Rosetta@home using AVX / AVX2 ?

Author	Message
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80461 - Posted: 2 Aug 2016, 10:13:14 UTC - in response to Message 80459. There are MANY things that can be done to significantly improve performance without a major rewrite. So, what's the problem?? The first change I talked about was introducing "homogeneous coordinates". This is very nice because, it does not "really" change the "project code". You can introduce the C++ TEMPLATE typedef changes, recompile and you should get the EXACT SAME ANSWER with the new compile options. So, again, what's the problem?? The second place where substantial improvement can be accomplished with little effort is by upgrading the server to steer optimized applications to target crunchers. Waiting for crowdfounding :-P ID: 80461 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 80465 - Posted: 3 Aug 2016, 0:59:00 UTC I believe the reference to upgrading the server was meant to indicate software updates that would support task routing by host type. rjs5 will have to clarify, but I believe his study and figures are estimates. To track back to original source code and make the suggested change and measure results is another story. That makes miscommunication very easy too when one is looking at one end of the elephant, and the other is looking at the other (source code vs. executable code). Rosetta Moderator: Mod.Sense ID: 80465 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80468 - Posted: 3 Aug 2016, 7:55:54 UTC - in response to Message 80465. I believe the reference to upgrading the server was meant to indicate software updates that would support task routing by host type. Crowdfounding may be also for SW (RHEL license, for example, or new Visual Studio licence). To track back to original source code and make the suggested change and measure results is another story. That makes miscommunication very easy too when one is looking at one end of the elephant, and the other is looking at the other (source code vs. executable code). Are they scared about "fork the code" and try it? Waste of resources? I think that, for example, the admins lost a lot of time and resources with Android version. IMHO. ID: 80468 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 80471 - Posted: 3 Aug 2016, 17:13:18 UTC Rosetta is constantly evolving and keeping up with the latest optimization technology and developing/testing/maintaining many different builds would be a very complex task which we currently don't have the resources for aside from taking away resources for research. rjs5 has put a huge effort to help and look into optimization possibilities with Rosetta. Now that CASP is almost over, we can get back to this. Our servers are chugging along and our throughput has nearly doubled quite recently relatively speaking due to an influx of hosts from Charity Engine I believe. Despite this, the load on our servers is fine. In the mean time, there have been publications in Nature and Science and some exciting results with co-evolution that is under review right now which relied heavily on R@h. This research will hopefully have a huge positive impact in the future and make good use of DNA sequence data. There are videos and articles explaining this in a recent Science publication as noted on our news. We are not sure if hiring a developer specifically for the task of optimization will work as DB has mentioned it has been tried before but with the complexity and dynamics of Rosetta, it has been difficult to accomplish. ID: 80471 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80472 - Posted: 3 Aug 2016, 17:27:50 UTC - in response to Message 80471. Rosetta is constantly evolving and keeping up with the latest optimization technology and developing/testing/maintaining many different builds would be a very complex task which we currently don't have the resources for aside from taking away resources for research......We are not sure if hiring a developer specifically for the task of optimization will work as DB has mentioned it has been tried before but with the complexity and dynamics of Rosetta, it has been difficult to accomplish. Ok, we understand: - no optimizations in near future - no new servers/update servers Please, close my thread about crowdfounding, it's a waste of time. And, personally, i'll stop bothering you ID: 80472 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0	Message 80473 - Posted: 3 Aug 2016, 17:38:11 UTC - in response to Message 80472. Last modified: 3 Aug 2016, 17:41:02 UTC Rosetta is constantly evolving and keeping up with the latest optimization technology and developing/testing/maintaining many different builds would be a very complex task which we currently don't have the resources for aside from taking away resources for research......We are not sure if hiring a developer specifically for the task of optimization will work as DB has mentioned it has been tried before but with the complexity and dynamics of Rosetta, it has been difficult to accomplish. Ok, we understand: - no optimizations in near future - no new servers/update servers Please, close my thread about crowdfounding, it's a waste of time. And, personally, i'll stop bothering you We are going to update the database and file servers. And rjs5 may be able to help us further with optimizations. And you are not bothering at all, we appreciate the discussion and input! It's all with good intentions. And crowd funding may be promising. I haven't had a chance to read that thread yet. But we did look into our donations and it's under a couple grand I believe so more would help! ID: 80473 · Rating: 0 · rate: / Reply Quote

Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0	Message 80474 - Posted: 3 Aug 2016, 17:54:14 UTC - in response to Message 80473. Last modified: 3 Aug 2016, 17:55:29 UTC I haven't had a chance to read that thread yet. But we did look into our donations and it's under a couple grand I believe so more would help! It's good to hear from you. :-) I think experiment.com would be a good place to start. There are already crowdfunding campaigns from other universities. Random pick: Bacterial Vesicular Delivery: A One-Step Protein Transport Method Kickstarter et al. are probably not the right choice, there are some hefty fees and not really science-related. The r@h users usually participate in other forums, too, so it would be a good idea to come up with a standardized signature for those forum posts and advertise a little ;-) ID: 80474 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 80478 - Posted: 3 Aug 2016, 18:17:25 UTC - in response to Message 80472. Ok, we understand: - no optimizations in near future - no new servers/update servers I think that is overly negative. If they can make scientific progress by changing around their present applications, that may be a more productive use of their time than optimizing their applications. The latter might tend to freeze the present science in place rather than allowing it to advance (I don't know, but just raise the issue). We are dealing with cutting-edge science here, not turning out widgets on a production line, and they have to be free to go where it leads them. But if they just need money for servers, I am sure we can help. Money is not always the limiting factor though. ID: 80478 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 80486 - Posted: 5 Aug 2016, 1:14:31 UTC - in response to Message 80465. I believe the reference to upgrading the server was meant to indicate software updates that would support task routing by host type. rjs5 will have to clarify, but I believe his study and figures are estimates. To track back to original source code and make the suggested change and measure results is another story. That makes miscommunication very easy too when one is looking at one end of the elephant, and the other is looking at the other (source code vs. executable code). I think the 32-bit SSE2 applications can be shipped to any cruncher. It probably makes sense to build in the application routing for new and more highly optimized applications. Level 1) What is being shipped today. New level 2) applications modified for SSE2 PLUS vector padding Fast level 3) AFTER ROUTING implemented ... application #2) but compiled for AVX2 for wider optimization and routed to AVX2 crunchers. The figures are my estimates based on analysis of dynamic execution profiling code over 1 hour Rosetta runs. It is VERY, VERY, VERY HARD to assign a FIXED improvement since the machines are very different ... microarchiture, cache, memory sizes, disk type (HDD or SSD), .... When I started a couple of PrimeGrid jobs in parallel, they degraded Rosetta by 30% ... so giving a single number is "tough". ID: 80486 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80530 - Posted: 11 Aug 2016, 7:16:43 UTC - in response to Message 80486. Level 1) What is being shipped today. The great "force" of Rosetta code is that, with an unique base code, you can crunch a lot of different and heterogeneous simulations. This force, on the other hand, is also is his biggest weakness: a lot of different needs create a lot of "fluffy" code. A solution may be to split the code into different specialized apps (one for the abinitio, one for folding, one for docking, etc) like A LOT of Boinc projects do. ID: 80530 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80585 - Posted: 1 Sep 2016, 15:24:48 UTC - in response to Message 80449. Last modified: 1 Sep 2016, 15:25:24 UTC Developer "F": commenting on my recommendation for homogeneous coordinates ... "Storing 3d cartesian coordinates as homogenous coordinates is well established practice. For example, Eigen::Geometry using homogenous coordinates in geometric expressions to support SIMD parallelism." Eigen is very powerful (Tensorflow, for example), but i don't know if Rosy's team uses it http://programmingexamples.net/wiki/Eigen ID: 80585 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80692 - Posted: 29 Sep 2016, 14:50:51 UTC Interesting historical article... Vectors: How the Old Became New Again in Supercomputing ID: 80692 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80809 - Posted: 30 Oct 2016, 19:07:48 UTC Last modified: 30 Oct 2016, 19:10:17 UTC i got a little too curious about AVX / AVX2 & decided to do a little experiment: i made a little program that multiples a 4x4 matrix to a 4x1 vector. 2 subroutines one that does it using simple loops, the other tries to be as 'AVX' as possible. do it a hundred times each & count the cpu cycles. #include <iostream> //#include <ia32intrin.h> #include <x86intrin.h> #include <immintrin.h> using namespace std; #define ALIGN __attribute__ ((aligned (32))) void matrix_multiply(double mat, double vec, double res); void matrix_mul_simd(double mat, double vec, double res); void print(double res[4]); int main() { unsigned long long timestm, delta; cout << "avxtest" << endl; // prints avxtest double ALIGN mat[4][4] = {{1.0, 2.0, 3.0, 4.0}, {5.0, 6.0, 7.0, 8.0}, {9.0, 10.0, 11.0, 12.0}, {13.0, 14.0, 15.0, 16.0}}; double ALIGN vec[4] = {1.0, 2.0, 3.0, 4.0}, res[4] = { 0.0, 0.0, 0.0, 0.0}; timestm = __rdtsc(); for(int i=0;i<100; i++) matrix_multiply((double )&mat, (double )&vec, (double )&res); print(res); delta = __rdtsc() - timestm; cout << "Loop: " << delta << endl; timestm = __rdtsc(); for(int i=0;i<100; i++) matrix_mul_simd((double )&mat, (double )&vec, (double )&res); print(res); delta = __rdtsc() - timestm; cout << "AVX: " << delta << endl; return 0; } void print(double res[4]) { cout << "result: "; for(int i=0; i<4; i++) { if (i > 0) cout << ", "; cout << res[i] ; } cout << endl; } void matrix_multiply(double mat, double vec, double res) { int i, j; for(i=0; i<4; i++) (res+i) = 0; for(i=0; i<4; i++) { for(j=0; j<4; j++) { (res+i) += (mat + j + i4) (vec+j); }}} void matrix_mul_simd(double mat, double vec, double res) { double ALIGN t[4] = {0.0, 0.0, 0.0, 0.0}; __m256d r = _mm256_broadcast_sd(&t[0]); __m128i d = _mm256_castsi256_si128 (_mm256_set_epi32 (0, 0, 0, 0, 12, 8, 4, 0)); for(int i=0; i<4; i++) { __m256d v = _mm256_broadcast_sd(vec+i); __m256d a = _mm256_i32gather_pd (mat + i, d, 8); r = _mm256_fmadd_pd(a,v,r); } _mm256_store_pd(res, r); } compile and run the code in GCC (in Linux) > g++ -O2 -mavx2 -mavx -mfma -o avxtest avxtest.cpp > g++ -O2 -mavx2 -mavx -mfma -S avxtest.cpp > ./avxtest avxtest result: 30, 70, 110, 150 Loop: 96439 << these are cpu cycles result: 30, 70, 110, 150 AVX: 71015 << these are cpu cycles > ./avxtest avxtest result: 30, 70, 110, 150 Loop: 95024 result: 30, 70, 110, 150 AVX: 64296 > ./avxtest avxtest result: 30, 70, 110, 150 Loop: 68096 result: 30, 70, 110, 150 AVX: 53596 so AVX didn't really reduce it to a small fraction, the differences are perhaps marginal. And as it turns out GCC is simply too 'smart' and actually vectorized the 'loop' codes and made it AVX as well: (abstracts from the generated assembly) .globl _Z15matrix_multiplyPdS_S_ .type _Z15[b]matrix_multiply[/b]PdS_S_, @function _Z15matrix_multiplyPdS_S_: .LFB2030: .cfi_startproc movq $0, (%rdx) movq $0, 8(%rdx) xorl %ecx, %ecx movq $0, 16(%rdx) movq $0, 24(%rdx) .L16: vmovsd (%rdx,%rcx), %xmm0 xorl %eax, %eax .L19: vmovsd (%rdi,%rax), %xmm1 [b] vfmadd231sd (%rsi,%rax), %xmm1, %xmm0 [/b] addq $8, %rax vmovsd %xmm0, (%rdx,%rcx) and this is the part that is hand optimised .globl _Z15matrix_mul_simdPdS_S_ .type _Z15[b]matrix_mul_simd[/b]PdS_S_, @function _Z15matrix_mul_simdPdS_S_: .LFB2031: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 vxorpd %xmm0, %xmm0, %xmm0 xorl %eax, %eax movq %rsp, %rbp .cfi_def_cfa_register 6 andq $-32, %rsp addq $16, %rsp vcmppd $0, %ymm0, %ymm0, %ymm3 movq $0, -48(%rsp) movq $0, -40(%rsp) vmovdqa .LC3(%rip), %xmm4 movq $0, -32(%rsp) movq $0, -24(%rsp) .L23: leaq (%rdi,%rax), %rcx [b] vmovapd %ymm3, %ymm5 vbroadcastsd (%rsi,%rax), %ymm1 addq $8, %rax vgatherdpd %ymm5, (%rcx,%xmm4,8), %ymm2 cmpq $32, %rax vfmadd231pd %ymm1, %ymm2, %ymm0 jne .L23 vmovapd %ymm0, (%rdx)[/b] vzeroupper leave conclusions: 1) GCC/G++ is pretty(very) 'smart' and if you simply select optimizations e.g. -O2 -mavx2 -mavx -mfma, GCC can actually optimize away loops and make them AVX/AVX2 all by the compiler itself 2) hand optimised codes seemed somewhat consistent in terms of using lesser time, but this is a 'toy' problem. a real problem may take an exorbitant effort to optimise. it'd seem it'd be good to let GCC / compilers do the optimizations where convenient / appropriate. and for apps like r@h, it needs to run on a large number of platforms some (many) do not have AVX let alone AVX2. we'd not want the app to 'crash' on those platforms simply because they don't have AVX/AVX2. Hence, such optimizations is a compromise of sorts, the same program may need to have both (AVX) optimised codes and non-optimised codes. the run time switching may incur some performance penalty in addition. but nevertheless 'safe' compiler optimizations are probably a 'good thing'. e.g. that it contains hybrid optimised codes hopefully AVX but runs on non-AVX platforms as well. ID: 80809 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80810 - Posted: 30 Oct 2016, 19:54:57 UTC Last modified: 30 Oct 2016, 20:09:23 UTC the intel cpus can do 4 double precision SIMD in AVX2 'per core', i'm not sure if things like instruction level parallelism & hyperthreading (makes it 2 cores of AVX2?) could possibly make that even 'more parallel'. but i've also run some other 'benchmark' apps (e.g. http://www.openblas.net/), and noted that things like AVX/AVX2 depends on problems being capable of 'completely running in the cpu' without needing to touch ram or disk(slowest). and for that matter the cases that truly see > 100 (say closing to 200) Gflops on even the 'average' i7 desktops are multiplying large square matrices, most of them are square matrices of large dimensions say 10,000 x 10,000 (i.e. 10,000 unknowns & dense matrices) and tiny matrices like 4x4 has little if any perceptible performance gain from AVX. The other overheads such as disk I/O far overwhelm the time to work that 4x4 matrix. I'd guess in the same light if r@h problem scenarios can fit those special cases such as multiplying 10,000 x 10,000 square matrices AVX/AVX2 (SIMD) may turn out to be a significant advantage. And if the square matrices dimensions are even larger, possibly the high end GPUs may show (very) significant performance gains over CPU, but at a cost of much higher power consumption (e.g. 200-300 watts just on the GPU cards itself) but as it stands, i'd think the 'problem' would need a possibility to be expressed as 'multiplying large square matrices'. not all problems are that simplified and some have if-then-else dependencies and yet others the next iteration of a 'small' problem depends on the results of the previous iteration, that makes it probably 'impossible' to vectorize there is of course, the additional efforts to study and make those efforts which may be non-trival ID: 80810 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80811 - Posted: 31 Oct 2016, 9:06:29 UTC - in response to Message 80809. and for apps like r@h, it needs to run on a large number of platforms some (many) do not have AVX let alone AVX2. we'd not want the app to 'crash' on those platforms simply because they don't have AVX/AVX2. Hence, such optimizations is a compromise of sorts, the same program may need to have both (AVX) optimised codes and non-optimised codes. No problem, 2 apps, updated scheduler recognizes correctly the cpu and sends the right app. I think, to start, SSE3 will be enough.... ID: 80811 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80812 - Posted: 31 Oct 2016, 9:37:27 UTC Last modified: 31 Oct 2016, 10:05:39 UTC same avxtest as 2 post earlier, run the subroutines 1000 times each. the results are much closer. it shows that the GCC/G++ optimised codes are pretty much as good as hand optimised codes. GCC/G++ may not 'catch all' cases of loops and do those optimizations, especially for 'real world' problems where it isn't all this simple to 'unroll' the loops. > ./avxtest avxtest result: 30, 70, 110, 150 Loop: 103224 result: 30, 70, 110, 150 AVX: 100612 > ./avxtest avxtest result: 30, 70, 110, 150 Loop: 107428 result: 30, 70, 110, 150 AVX: 108980 > ./avxtest avxtest result: 30, 70, 110, 150 Loop: 111144 result: 30, 70, 110, 150 AVX: 108093 ID: 80812 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80813 - Posted: 31 Oct 2016, 9:50:32 UTC - in response to Message 80811. Last modified: 31 Oct 2016, 10:08:02 UTC No problem, 2 apps, updated scheduler recognizes correctly the cpu and sends the right app. I think, to start, SSE3 will be enough.... the other issue though, is that it seemed it isn't quite possible (yet) for boinc to distribute apps based on cpu 'architecture' (i.e. has SSE, AVX, AVX2, FMA etc), it seemed currently what's possible is 32 bits or 64 bits. yup going 64 bits esp for Windows platform (which today is 32 bits) would likely see some gains. A 'hybrid' app (for SSE/AVX or none) is possibly more appropriate, because distributing apps is its own 'logistics' issue. just that a hybrid app depends on compiler capabilities and if the compiler can't do it on its own, it may need to be 'hand tuned' ID: 80813 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80814 - Posted: 31 Oct 2016, 15:01:51 UTC - in response to Message 80813. the other issue though, is that it seemed it isn't quite possible (yet) for boinc to distribute apps based on cpu 'architecture' (i.e. has SSE, AVX, AVX2, FMA etc), it seemed currently what's possible is 32 bits or 64 bits. yup going 64 bits esp for Windows platform (which today is 32 bits) would likely see some gains. A little help with app_config.... :-) ID: 80814 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 80840 - Posted: 12 Nov 2016, 17:17:16 UTC Last modified: 12 Nov 2016, 17:40:23 UTC here is a very interesting article / slides on AVX/AVX2, and from CERN the HPC (high performance computing) people who deal with physics Haswell Conundrum:AVX or not AVX? https://indico.cern.ch/event/327306/contributions/760669/attachments/635800/875267/HaswellConundrum.pdf in 2014 Conclusions – Free lunch is over » In 2 years the computational power of Intel workstations has increased by 30% max (including core count and freq-boost) » For servers even less – Power management affects individual components: » Achieving maximal throughput requires to make choices among features to activate – Memory wall is higher than ever » HSW improves on instruction caching though.. – Wide SIMD vectors are effective only for highly specialized code – Little support for this new brave world in generic high level languages and libraries Summary – Haswell is a great new Architecture: » Not because of AVX – Long SIMD vectors are worth only for intensive vectorized code » Are not GPUs then a better option? – Power Management cannot be ignored while assessing computational efficiency – On modern architecture, extrapolation based on synthetic benchmarks is mission impossible they are in Boinc too & u can run their simulations: http://atlasathome.cern.ch/ that special scenario is apparently things like Linpack benchmark that depends heavily on subroutine DGEMM (double precision general matrix multiplication), e.g. multiply very big/large square matrices say 10,000 x 10,000 https://www.pugetsystems.com/labs/hpc/Haswell-Floating-Point-Performance-493/ once the math scenario falls outside this DGEMM multiply very big square matrices use case, all that vector / parallel cpu and even those extreme speed GPU (petaflops) hardware is simply useless, e.g. if you are trying to solve 2x2 matrices a billion times and the result of the next iteration depend on the previous, it would be just as slow as if you simply do it in loops no SSE,AVX,AVX2 lol in short SSE/AVX to all those super high end vectorized extreme performance GPU is only good if the whole world is simply DGEMM. too bad DGEMM is just very few of true real world scenarios lol ID: 80840 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 80848 - Posted: 15 Nov 2016, 7:49:22 UTC Interesting tests, but every simulation is different (denis@home accelerated 10 times the computation with SSE3) so these results may be different in rosetta's enviroment. At the end we "know" that Avx/Avx2/Avx512 requests a large refactoring of the code, while SSE2, for example, needs only flag in recompile.... ID: 80848 · Rating: 0 · rate: / Reply Quote