Rosetta@home using AVX / AVX2 ?

Message boards : Number crunching : Rosetta@home using AVX / AVX2 ?

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

AuthorMessage
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,552,383
RAC: 6,167
Message 80461 - Posted: 2 Aug 2016, 10:13:14 UTC - in response to Message 80459.  

There are MANY things that can be done to significantly improve performance without a major rewrite.

So, what's the problem??

The first change I talked about was introducing "homogeneous coordinates". This is very nice because, it does not "really" change the "project code". You can introduce the C++ TEMPLATE typedef changes, recompile and you should get the EXACT SAME ANSWER with the new compile options.

So, again, what's the problem??

The second place where substantial improvement can be accomplished with little effort is by upgrading the server to steer optimized applications to target crunchers.

Waiting for crowdfounding :-P
ID: 80461 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 80465 - Posted: 3 Aug 2016, 0:59:00 UTC

I believe the reference to upgrading the server was meant to indicate software updates that would support task routing by host type.

rjs5 will have to clarify, but I believe his study and figures are estimates. To track back to original source code and make the suggested change and measure results is another story. That makes miscommunication very easy too when one is looking at one end of the elephant, and the other is looking at the other (source code vs. executable code).
Rosetta Moderator: Mod.Sense
ID: 80465 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,552,383
RAC: 6,167
Message 80468 - Posted: 3 Aug 2016, 7:55:54 UTC - in response to Message 80465.  

I believe the reference to upgrading the server was meant to indicate software updates that would support task routing by host type.

Crowdfounding may be also for SW (RHEL license, for example, or new Visual Studio licence).

To track back to original source code and make the suggested change and measure results is another story. That makes miscommunication very easy too when one is looking at one end of the elephant, and the other is looking at the other (source code vs. executable code).

Are they scared about "fork the code" and try it? Waste of resources? I think that, for example, the admins lost a lot of time and resources with Android version. IMHO.

ID: 80468 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 80471 - Posted: 3 Aug 2016, 17:13:18 UTC

Rosetta is constantly evolving and keeping up with the latest optimization technology and developing/testing/maintaining many different builds would be a very complex task which we currently don't have the resources for aside from taking away resources for research. rjs5 has put a huge effort to help and look into optimization possibilities with Rosetta. Now that CASP is almost over, we can get back to this.

Our servers are chugging along and our throughput has nearly doubled quite recently relatively speaking due to an influx of hosts from Charity Engine I believe. Despite this, the load on our servers is fine. In the mean time, there have been publications in Nature and Science and some exciting results with co-evolution that is under review right now which relied heavily on R@h. This research will hopefully have a huge positive impact in the future and make good use of DNA sequence data. There are videos and articles explaining this in a recent Science publication as noted on our news.

We are not sure if hiring a developer specifically for the task of optimization will work as DB has mentioned it has been tried before but with the complexity and dynamics of Rosetta, it has been difficult to accomplish.
ID: 80471 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,552,383
RAC: 6,167
Message 80472 - Posted: 3 Aug 2016, 17:27:50 UTC - in response to Message 80471.  

Rosetta is constantly evolving and keeping up with the latest optimization technology and developing/testing/maintaining many different builds would be a very complex task which we currently don't have the resources for aside from taking away resources for research......We are not sure if hiring a developer specifically for the task of optimization will work as DB has mentioned it has been tried before but with the complexity and dynamics of Rosetta, it has been difficult to accomplish.


Ok, we understand:
- no optimizations in near future
- no new servers/update servers

Please, close my thread about crowdfounding, it's a waste of time.
And, personally, i'll stop bothering you



ID: 80472 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 80473 - Posted: 3 Aug 2016, 17:38:11 UTC - in response to Message 80472.  
Last modified: 3 Aug 2016, 17:41:02 UTC

Rosetta is constantly evolving and keeping up with the latest optimization technology and developing/testing/maintaining many different builds would be a very complex task which we currently don't have the resources for aside from taking away resources for research......We are not sure if hiring a developer specifically for the task of optimization will work as DB has mentioned it has been tried before but with the complexity and dynamics of Rosetta, it has been difficult to accomplish.


Ok, we understand:
- no optimizations in near future
- no new servers/update servers

Please, close my thread about crowdfounding, it's a waste of time.
And, personally, i'll stop bothering you





We are going to update the database and file servers. And rjs5 may be able to help us further with optimizations. And you are not bothering at all, we appreciate the discussion and input! It's all with good intentions. And crowd funding may be promising. I haven't had a chance to read that thread yet. But we did look into our donations and it's under a couple grand I believe so more would help!
ID: 80473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 80474 - Posted: 3 Aug 2016, 17:54:14 UTC - in response to Message 80473.  
Last modified: 3 Aug 2016, 17:55:29 UTC

I haven't had a chance to read that thread yet. But we did look into our donations and it's under a couple grand I believe so more would help!

It's good to hear from you. :-)

I think experiment.com would be a good place to start. There are already crowdfunding campaigns from other universities.

Random pick:

Bacterial Vesicular Delivery: A One-Step Protein Transport Method

Kickstarter et al. are probably not the right choice, there are some hefty fees and not really science-related.

The r@h users usually participate in other forums, too, so it would be a good idea to come up with a standardized signature for those forum posts and advertise a little ;-)
ID: 80474 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 80478 - Posted: 3 Aug 2016, 18:17:25 UTC - in response to Message 80472.  

Ok, we understand:
- no optimizations in near future
- no new servers/update servers

I think that is overly negative. If they can make scientific progress by changing around their present applications, that may be a more productive use of their time than optimizing their applications. The latter might tend to freeze the present science in place rather than allowing it to advance (I don't know, but just raise the issue).

We are dealing with cutting-edge science here, not turning out widgets on a production line, and they have to be free to go where it leads them. But if they just need money for servers, I am sure we can help. Money is not always the limiting factor though.
ID: 80478 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,001,518
RAC: 6,291
Message 80486 - Posted: 5 Aug 2016, 1:14:31 UTC - in response to Message 80465.  

I believe the reference to upgrading the server was meant to indicate software updates that would support task routing by host type.

rjs5 will have to clarify, but I believe his study and figures are estimates. To track back to original source code and make the suggested change and measure results is another story. That makes miscommunication very easy too when one is looking at one end of the elephant, and the other is looking at the other (source code vs. executable code).


I think the 32-bit SSE2 applications can be shipped to any cruncher. It probably makes sense to build in the application routing for new and more highly optimized applications.

Level 1) What is being shipped today.

New level 2) applications modified for SSE2 PLUS vector padding

Fast level 3) AFTER ROUTING implemented ... application #2) but compiled for AVX2 for wider optimization and routed to AVX2 crunchers.



The figures are my estimates based on analysis of dynamic execution profiling code over 1 hour Rosetta runs.

It is VERY, VERY, VERY HARD to assign a FIXED improvement since the machines are very different ... microarchiture, cache, memory sizes, disk type (HDD or SSD), .... When I started a couple of PrimeGrid jobs in parallel, they degraded Rosetta by 30% ... so giving a single number is "tough".



ID: 80486 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,552,383
RAC: 6,167
Message 80530 - Posted: 11 Aug 2016, 7:16:43 UTC - in response to Message 80486.  

Level 1) What is being shipped today.


The great "force" of Rosetta code is that, with an unique base code, you can crunch a lot of different and heterogeneous simulations. This force, on the other hand, is also is his biggest weakness: a lot of different needs create a lot of "fluffy" code.
A solution may be to split the code into different specialized apps (one for the abinitio, one for folding, one for docking, etc) like A LOT of Boinc projects do.
ID: 80530 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,552,383
RAC: 6,167
Message 80585 - Posted: 1 Sep 2016, 15:24:48 UTC - in response to Message 80449.  
Last modified: 1 Sep 2016, 15:25:24 UTC

Developer "F": commenting on my recommendation for homogeneous coordinates ...
"Storing 3d cartesian coordinates as homogenous coordinates is well established practice. For example, Eigen::Geometry using homogenous coordinates in geometric expressions to support SIMD parallelism."


Eigen is very powerful (Tensorflow, for example), but i don't know if Rosy's team uses it
http://programmingexamples.net/wiki/Eigen
ID: 80585 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,552,383
RAC: 6,167
Message 80692 - Posted: 29 Sep 2016, 14:50:51 UTC

ID: 80692 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80809 - Posted: 30 Oct 2016, 19:07:48 UTC
Last modified: 30 Oct 2016, 19:10:17 UTC

i got a little too curious about AVX / AVX2 & decided to do a little experiment:

i made a little program that multiples a 4x4 matrix to a 4x1 vector. 2 subroutines one that does it using simple loops, the other tries to be as 'AVX' as possible. do it a hundred times each & count the cpu cycles.

#include <iostream>
//#include <ia32intrin.h>
#include <x86intrin.h>
#include <immintrin.h>
using namespace std;
#define ALIGN __attribute__ ((aligned (32)))
void matrix_multiply(double *mat, double *vec, double *res);
void matrix_mul_simd(double *mat, double *vec, double *res);
void print(double res[4]);

int main() {
	unsigned long long timestm, delta;
	cout << "avxtest" << endl; // prints avxtest
	double ALIGN mat[4][4] = {{1.0, 2.0, 3.0, 4.0}, {5.0, 6.0, 7.0, 8.0},
			{9.0, 10.0, 11.0, 12.0}, {13.0, 14.0, 15.0, 16.0}};
	double ALIGN vec[4] = {1.0, 2.0, 3.0, 4.0}, res[4] = { 0.0, 0.0, 0.0, 0.0};
	timestm = __rdtsc();
	for(int i=0;i<100; i++)
		matrix_multiply((double *)&mat, (double *)&vec, (double *)&res);
	print(res);
	delta = __rdtsc() - timestm;
	cout << "Loop: " << delta << endl;
	timestm = __rdtsc();
	for(int i=0;i<100; i++)
		matrix_mul_simd((double *)&mat, (double *)&vec, (double *)&res);
	print(res);
	delta = __rdtsc() - timestm;
	cout << "AVX: " << delta << endl;
	return 0;
}

void print(double res[4]) {
	cout << "result: ";
	for(int i=0; i<4; i++) {
		if (i > 0)	cout << ", ";
		cout << res[i] ; }
	cout << endl;
}

void matrix_multiply(double *mat, double *vec, double *res) {
	int i, j;
	for(i=0; i<4; i++) *(res+i) = 0;
	for(i=0; i<4; i++) {
		for(j=0; j<4; j++) {
			*(res+i) += *(mat + j + i*4) * *(vec+j);
		}}}

void matrix_mul_simd(double *mat, double *vec, double *res) {
	double ALIGN t[4] = {0.0, 0.0, 0.0, 0.0};

	__m256d r = _mm256_broadcast_sd(&t[0]);
	__m128i d = _mm256_castsi256_si128 (_mm256_set_epi32 (0, 0, 0, 0, 12, 8, 4, 0));
	for(int i=0; i<4; i++) {
		__m256d v = _mm256_broadcast_sd(vec+i);
		__m256d a = _mm256_i32gather_pd (mat + i, d, 8);

		r = _mm256_fmadd_pd(a,v,r);
	}
	_mm256_store_pd(res, r);
}


compile and run the code in GCC (in Linux)
> g++ -O2 -mavx2 -mavx -mfma  -o avxtest avxtest.cpp 
> g++ -O2 -mavx2 -mavx -mfma  -S avxtest.cpp 
> ./avxtest 
avxtest
result: 30, 70, 110, 150
Loop: 96439 << these are cpu cycles
result: 30, 70, 110, 150
AVX: 71015 << these are cpu cycles
> ./avxtest 
avxtest
result: 30, 70, 110, 150
Loop: 95024
result: 30, 70, 110, 150
AVX: 64296
> ./avxtest 
avxtest
result: 30, 70, 110, 150
Loop: 68096
result: 30, 70, 110, 150
AVX: 53596


so AVX didn't really reduce it to a small fraction, the differences are perhaps marginal.

And as it turns out GCC is simply too 'smart' and actually vectorized the 'loop' codes and made it AVX as well: (abstracts from the generated assembly)
        .globl  _Z15matrix_multiplyPdS_S_
        .type   _Z15[b]matrix_multiply[/b]PdS_S_, @function
_Z15matrix_multiplyPdS_S_:
.LFB2030:
        .cfi_startproc
        movq    $0, (%rdx)
        movq    $0, 8(%rdx)
        xorl    %ecx, %ecx
        movq    $0, 16(%rdx)
        movq    $0, 24(%rdx)
.L16:
        vmovsd  (%rdx,%rcx), %xmm0
        xorl    %eax, %eax
.L19:
        vmovsd  (%rdi,%rax), %xmm1
[b]     vfmadd231sd     (%rsi,%rax), %xmm1, %xmm0 [/b]
        addq    $8, %rax
        vmovsd  %xmm0, (%rdx,%rcx)


and this is the part that is hand optimised
        .globl  _Z15matrix_mul_simdPdS_S_
        .type   _Z15[b]matrix_mul_simd[/b]PdS_S_, @function
_Z15matrix_mul_simdPdS_S_:
.LFB2031:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        vxorpd  %xmm0, %xmm0, %xmm0
        xorl    %eax, %eax
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        andq    $-32, %rsp
        addq    $16, %rsp
        vcmppd  $0, %ymm0, %ymm0, %ymm3
        movq    $0, -48(%rsp)
        movq    $0, -40(%rsp)
        vmovdqa .LC3(%rip), %xmm4
        movq    $0, -32(%rsp)
        movq    $0, -24(%rsp)
.L23:
        leaq    (%rdi,%rax), %rcx
[b]     vmovapd %ymm3, %ymm5
        vbroadcastsd    (%rsi,%rax), %ymm1
        addq    $8, %rax
        vgatherdpd      %ymm5, (%rcx,%xmm4,8), %ymm2
        cmpq    $32, %rax
        vfmadd231pd     %ymm1, %ymm2, %ymm0
        jne     .L23
        vmovapd %ymm0, (%rdx)[/b]
        vzeroupper
        leave


conclusions:
1) GCC/G++ is pretty(very) 'smart' and if you simply select optimizations e.g. -O2 -mavx2 -mavx -mfma, GCC can actually optimize away loops and make them AVX/AVX2 all by the compiler itself
2) hand optimised codes seemed somewhat consistent in terms of using lesser time, but this is a 'toy' problem. a real problem may take an exorbitant effort to optimise.

it'd seem it'd be good to let GCC / compilers do the optimizations where convenient / appropriate.

and for apps like r@h, it needs to run on a large number of platforms some (many) do not have AVX let alone AVX2. we'd not want the app to 'crash' on those platforms simply because they don't have AVX/AVX2. Hence, such optimizations is a compromise of sorts, the same program may need to have both (AVX) optimised codes and non-optimised codes. the run time switching may incur some performance penalty in addition. but nevertheless 'safe' compiler optimizations are probably a 'good thing'. e.g. that it contains hybrid optimised codes hopefully AVX but runs on non-AVX platforms as well.
ID: 80809 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80810 - Posted: 30 Oct 2016, 19:54:57 UTC
Last modified: 30 Oct 2016, 20:09:23 UTC

the intel cpus can do 4 double precision SIMD in AVX2 'per core', i'm not sure if things like instruction level parallelism & hyperthreading (makes it 2 cores of AVX2?) could possibly make that even 'more parallel'.

but i've also run some other 'benchmark' apps (e.g. http://www.openblas.net/), and noted that things like AVX/AVX2 depends on problems being capable of 'completely running in the cpu' without needing to touch ram or disk(slowest). and for that matter the cases that truly see > 100 (say closing to 200) Gflops on even the 'average' i7 desktops are *multiplying large square matrices*, most of them are square matrices of large dimensions say 10,000 x 10,000 (i.e. 10,000 unknowns & dense matrices) and tiny matrices like 4x4 has little if any perceptible performance gain from AVX. The other overheads such as disk I/O far overwhelm the time to work that 4x4 matrix.

I'd guess in the same light if r@h problem scenarios can fit those *special cases* such as multiplying 10,000 x 10,000 square matrices AVX/AVX2 (SIMD) may turn out to be a significant advantage. And if the square matrices dimensions are even larger, possibly the high end GPUs may show (very) significant performance gains over CPU, but at a cost of much higher power consumption (e.g. 200-300 watts just on the GPU cards itself)
but as it stands, i'd think the 'problem' would need a possibility to be expressed as 'multiplying large square matrices'. not all problems are that simplified and some have if-then-else dependencies and yet others the next iteration of a 'small' problem depends on the results of the previous iteration, that makes it probably 'impossible' to vectorize
there is of course, the additional efforts to study and make those efforts which may be non-trival
ID: 80810 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,552,383
RAC: 6,167
Message 80811 - Posted: 31 Oct 2016, 9:06:29 UTC - in response to Message 80809.  

and for apps like r@h, it needs to run on a large number of platforms some (many) do not have AVX let alone AVX2. we'd not want the app to 'crash' on those platforms simply because they don't have AVX/AVX2. Hence, such optimizations is a compromise of sorts, the same program may need to have both (AVX) optimised codes and non-optimised codes.


No problem, 2 apps, updated scheduler recognizes correctly the cpu and sends the right app.
I think, to start, SSE3 will be enough....

ID: 80811 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80812 - Posted: 31 Oct 2016, 9:37:27 UTC
Last modified: 31 Oct 2016, 10:05:39 UTC

same avxtest as 2 post earlier, run the subroutines 1000 times each. the results are much closer. it shows that the GCC/G++ optimised codes are pretty much as good as hand optimised codes. GCC/G++ may not 'catch all' cases of loops and do those optimizations, especially for 'real world' problems where it isn't all this simple to 'unroll' the loops.

> ./avxtest 
avxtest
result: 30, 70, 110, 150
Loop: 103224
result: 30, 70, 110, 150
AVX: 100612
> ./avxtest 
avxtest
result: 30, 70, 110, 150
Loop: 107428
result: 30, 70, 110, 150
AVX: 108980
> ./avxtest 
avxtest
result: 30, 70, 110, 150
Loop: 111144
result: 30, 70, 110, 150
AVX: 108093
ID: 80812 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80813 - Posted: 31 Oct 2016, 9:50:32 UTC - in response to Message 80811.  
Last modified: 31 Oct 2016, 10:08:02 UTC


No problem, 2 apps, updated scheduler recognizes correctly the cpu and sends the right app.
I think, to start, SSE3 will be enough....


the other issue though, is that it seemed it isn't quite possible (yet) for boinc to distribute apps based on cpu 'architecture' (i.e. has SSE, AVX, AVX2, FMA etc), it seemed currently what's possible is 32 bits or 64 bits. yup going 64 bits esp for Windows platform (which today is 32 bits) would likely see some gains.

A 'hybrid' app (for SSE/AVX or none) is possibly more appropriate, because distributing apps is its own 'logistics' issue. just that a hybrid app depends on compiler capabilities and if the compiler can't do it on its own, it may need to be 'hand tuned'
ID: 80813 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,552,383
RAC: 6,167
Message 80814 - Posted: 31 Oct 2016, 15:01:51 UTC - in response to Message 80813.  

the other issue though, is that it seemed it isn't quite possible (yet) for boinc to distribute apps based on cpu 'architecture' (i.e. has SSE, AVX, AVX2, FMA etc), it seemed currently what's possible is 32 bits or 64 bits. yup going 64 bits esp for Windows platform (which today is 32 bits) would likely see some gains.


A little help with app_config.... :-)

ID: 80814 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 80840 - Posted: 12 Nov 2016, 17:17:16 UTC
Last modified: 12 Nov 2016, 17:40:23 UTC

here is a very interesting article / slides on *AVX/AVX2*, and from CERN the HPC (high performance computing) people who deal with *physics*

Haswell Conundrum:AVX or not AVX?
https://indico.cern.ch/event/327306/contributions/760669/attachments/635800/875267/HaswellConundrum.pdf
in 2014
Conclusions

– Free lunch is over

» In 2 years the computational power of Intel workstations has increased
by 30% max (including core count and freq-boost)

» For servers even less

– Power management affects individual components:

» Achieving maximal throughput requires to make choices among features
to activate

– Memory wall is higher than ever

» HSW improves on instruction caching though..

– Wide SIMD vectors are effective only for highly specialized code

– Little support for this new brave world in generic high level
languages and libraries



Summary

– Haswell is a great new Architecture:

» Not because of AVX

– Long SIMD vectors are worth only for intensive vectorized code

» Are not GPUs then a better option?

– Power Management cannot be ignored while assessing
computational efficiency

– On modern architecture, extrapolation based on synthetic
benchmarks is mission impossible


they are in Boinc too & u can run their simulations:
http://atlasathome.cern.ch/

that *special scenario* is apparently things like Linpack benchmark that depends heavily on subroutine DGEMM (double precision general matrix multiplication), e.g. multiply very *big/large* *square matrices* say 10,000 x 10,000

https://www.pugetsystems.com/labs/hpc/Haswell-Floating-Point-Performance-493/

once the math scenario falls outside this DGEMM multiply very big square matrices use case, all that vector / parallel cpu and even those extreme speed GPU (*petaflops*) hardware is simply *useless*, e.g. if you are trying to solve 2x2 matrices a billion times and the result of the next iteration depend on the previous, it would be just as slow as if you simply do it in loops no SSE,AVX,AVX2 lol

in short SSE/AVX to all those super high end vectorized extreme performance GPU is only good if the whole world is simply DGEMM. too bad DGEMM is just very few of true real world scenarios lol
ID: 80840 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,552,383
RAC: 6,167
Message 80848 - Posted: 15 Nov 2016, 7:49:22 UTC

Interesting tests, but every simulation is different (denis@home accelerated 10 times the computation with SSE3) so these results may be different in rosetta's enviroment.
At the end we "know" that Avx/Avx2/Avx512 requests a large refactoring of the code, while SSE2, for example, needs only flag in recompile....
ID: 80848 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

Message boards : Number crunching : Rosetta@home using AVX / AVX2 ?



©2024 University of Washington
https://www.bakerlab.org