Rosetta@home using AVX / AVX2 ?

Author	Message
rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 78202 - Posted: 16 May 2015, 5:32:34 UTC - in response to Message 78200. [quote] Interesting. Which tool did you use to get that info may I ask? Sure. Be happy to. There many similar tools for both Windows and for Linux. At the time I only had access to my Windows machine so I got the output from the Intel Vtune sampline profiler. For Linux environments I usually use "perf". Both will annotate the disassembly with the source code if you have the source and symbols if you the them. It makes tracking back to the specific source line easy. They use the CPU EVENT counters and you can set the time or event domain to trigger on. I just used the default CYCLES and INSTRUCTION COMPLETIONS to find where the program was burning clocks. That tells you where you will get the biggest return for your effort. Sample Haswell even list: https://code.google.com/p/likwid/wiki/Haswell EXAMPLE: I was running 8 copies of miniroseta on my Ubuntu 64 machine a d profiled ALL CPU with the command: sudo perf record -a -- sleep 10 Run as sudo and record what all "-a" the CPU are doing. After the 10 second sleep, it dumps the samples and then: sudo perf report --demangle do process the counts and demangle the C++ symbols to a more readable form. 7.6% of the time was spent in the numeric10MathMatrixIdE21inverse_square_matrixEv function. Drilling down into that function .... 7.60% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN7numeric10MathMatrixIdE21inverse_square_matrixEv 4.95% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN4core7scoring6etable5etrie16TrieCountPairAll20resolve_trie_vs_trieERKNS0_4trie 4.46% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK7numeric9xyzVectorIdE16distance_squaredERKS1_ 2.96% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN4core7scoring6etable5etrie16TrieCountPairAll20resolve_trie_vs_trieERKNS0_4trie 2.60% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core7scoring7vdwaals10VDW_Energy19residue_pair_energyERKNS_12conformation7Re 2.10% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core7scoring4elec13FA_ElecEnergy15score_atom_pairERKNS_12conformation7Residu 2.05% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] memcpy 1.94% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _int_malloc 1.61% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _int_free 1.34% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZN4core10kinematics4tree10BondedAtom17update_xyz_coordsERNS0_4StubE 1.33% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] malloc 1.20% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core12conformation7Residue3xyzEj 1.17% minirosetta_3.5 minirosetta_3.54_x86_64-pc-linux-gnu [.] _ZNK4core12conformation7Residue15atom_type_indexEj Perf will open up a disassembly display with the HOT SPOT higlighted, i marked with "=====". Even though the file has the x86_64 in the file name, it is still 32-bit application. file boinc.bakerlab.org_rosetta/*gnu boinc.bakerlab.org_rosetta/minirosetta_3.54_x86_64-pc-linux-gnu: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, for GNU/Linux 2.6.9, not stripped _ZN7numeric10MathMatrixIdE21inverse_square_matrixEv /var/lib/boinc-client/projects/boinc.bakerlab.org_rosetta/minirosetta_3.54_x86_64-pc-linux-gnu 0.07 │ add $0x8,%ebx 0.17 │ fmul %st(1),%st 0.44 │ fsubrl (%edx) 0.56 │ fstpl (%edx) 0.21 │ add $0x8,%edx 0.08 │ cmp %esi,-0xf4(%ebp) │ ↓ je 7eb 0.75 │ 778: fldl (%eax) 0.93 │ fmul %st(1),%st 3.35 │ mov -0xf0(%ebp),%edi 0.89 │ addl $0x4,-0xf4(%ebp) 1.63 │ fsubrl (%ecx) 4.62 │ fstpl (%ecx) 1.80 │ fldl (%ebx) 0.30 │ fmul %st(1),%st 2.57 │ fsubrl (%edx) 6.25 │ fstpl (%edx) =================== 1.70 │ fldl 0x8(%eax) 0.09 │ fmul %st(1),%st 1.20 │ fsubrl 0x8(%ecx) 4.50 │ fstpl 0x8(%ecx) 1.87 │ fldl 0x8(%ebx) 0.12 │ fmul %st(1),%st 0.84 │ fsubrl 0x8(%edx) 4.62 │ fstpl 0x8(%edx) 1.98 │ fldl 0x10(%eax) 0.09 │ fmul %st(1),%st 0.95 │ fsubrl 0x10(%ecx) 4.81 │ fstpl 0x10(%ecx) ID: 78202 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 78203 - Posted: 16 May 2015, 12:37:59 UTC - in response to Message 78198. All post-Pentium4 CPU (newer than Nov. 2000) support the SSE2 register model. Simply adding the SSE2 target option to the builds would require the machines to be made this century but would use the SSE registers. A simple recompile should make a noticeable difference without any side effects. Who cares about 15-years-old cpu??? A single modern Xeon 16 cores is hundreds of times more powerful. Beyond that, the developers would need to look more closely at the code. I hope rosetta admins read your post. SSE2 may be a first step. ID: 78203 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 78206 - Posted: 17 May 2015, 17:20:04 UTC - in response to Message 78203. All post-Pentium4 CPU (newer than Nov. 2000) support the SSE2 register model. Simply adding the SSE2 target option to the builds would require the machines to be made this century but would use the SSE registers. A simple recompile should make a noticeable difference without any side effects. Who cares about 15-years-old cpu??? A single modern Xeon 16 cores is hundreds of times more powerful. Beyond that, the developers would need to look more closely at the code. I hope rosetta admins read your post. SSE2 may be a first step. The applications seem to be built on Red Hat RHEL4 which is not too old and still in corporate use. It seems to be built with GCC 4.1.2. This is a "vintage" application and very unlikely to attract any administration upgrade attention. By comparison, World Community Grid Mapping Cancer is built with gcc 4.4.7 a similar clip of code shows that a similar code bottleneck is located in a similar USE-AFTER-COMPUTATION when results of a divide operation (divsd xmm2, qword ptr [rsi]) are used in the next instruction (xorpd xmm2, xmm6). The density of COMPUTE instruction compared to data spill/fill memory write/reads is much higher. You see references to "r8d", "rdx" ... 64-bit registers which you don't see with Rosetta. A 64-bit recompile would allow the compilers to use both extra registers and SSE2 instructions which dramatically reduce the excess memory operations. A Vtune clip of World Community Grid Mapping Cancer Markers .... Address Assembly 0x140058e2e sub r8, 0x1 0x140058e32 mulsd xmm0, qword ptr [rdx-0x8] 0x140058e37 addsd xmm2, xmm0 0x140058e3b jnz 0x140058e22 <Block 12> 0x140058e3d Block 13: 0x140058e3d movsd xmm0, qword ptr [rsi] 0x140058e41 xor r8d, r8d 0x140058e44 mulsd xmm0, qword ptr [r10+rbx1] 0x140058e4a subsd xmm2, xmm0 0x140058e4e divsd xmm2, qword ptr [rsi] 0x140058e52 xorpd xmm2, xmm6 ======================= stall on waiting for xmm2 results 0x140058e56 comisd xmm13, xmm2 0x140058e5b movsd qword ptr [r10], xmm2 0x140058e60 jbe 0x140058e65 <Block 15> 0x140058e62 Block 14: 0x140058e62 mov qword ptr [r10], r8 0x140058e65 Block 15: 0x140058e65 movsd xmm1, qword ptr [r10] 0x140058e6a movapd xmm0, xmm1 0x140058e6e subsd xmm0, qword ptr [r10+rbx1] 0x140058e74 andpd xmm0, xmm9 ID: 78206 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 78211 - Posted: 18 May 2015, 7:08:01 UTC - in response to Message 78206. It seems to be built with GCC 4.1.2. This is a "vintage" application and very unlikely to attract any administration upgrade attention. 4.1.2 is February 2007.... I understand that it's difficult to update constantly the software (now GCC is 5.1), but an 8 years is too much. Please, admins, use ralph@home server to test (eventualy) this updates ID: 78211 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 78227 - Posted: 26 May 2015, 11:28:49 UTC Didn't know R@H was that unoptimized. Seems that an easy recompile could bring significant throughput increase. Admins should use RALPH to test this out. ID: 78227 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 78229 - Posted: 27 May 2015, 12:37:38 UTC - in response to Message 78227. Didn't know R@H was that unoptimized. Seems that an easy recompile could bring significant throughput increase. Admins should use RALPH to test this out. we do not know if the admin read this thread and we do not even know if they are interested... ID: 78229 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 78232 - Posted: 27 May 2015, 23:13:45 UTC - in response to Message 78211. It seems to be built with GCC 4.1.2. This is a "vintage" application and very unlikely to attract any administration upgrade attention. 4.1.2 is February 2007.... I understand that it's difficult to update constantly the software (now GCC is 5.1), but an 8 years is too much. Please, admins, use ralph@home server to test (eventualy) this updates I had to go see what "RALPH@HOME" was. You mentioned it several times. It gave me a good chuckle. They could have just put another CHECKBOX on the Rosetta@Home "Edit Rosetta@home preferences" options page. They would not have to build the duplicate project. Few people will add RALPH just to run Rosetta ALPHA versions. Many more would click the opt-in option on the preferences page. Much of any performance increase beyond compiling for SSE2 comes from understanding the program and avoiding subtle coding problems. Keeping tight control on the data type sizes is one common overlooked problem. ID: 78232 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 78270 - Posted: 5 Jun 2015, 7:55:51 UTC - in response to Message 78232. Last modified: 5 Jun 2015, 8:01:37 UTC Few people will add RALPH just to run Rosetta ALPHA versions. Many more would click the opt-in option on the preferences page. I'm not agree with you. 1) Rosetta and Ralph are on 2 different servers with the same SW version/configuration. If admins want to try some updates/upgrades it's better to test it in alpha config than in production server. 2) Do not underestimate the Ralph's volunteers. :-) Much of any performance increase beyond compiling for SSE2 comes from understanding the program and avoiding subtle coding problems. Keeping tight control on the data type sizes is one common overlooked problem. +1 (obliusly, it would be better if rosetta admins have updated tools/servers/debuggers/etc) ID: 78270 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,662,635 RAC: 0	Message 78272 - Posted: 5 Jun 2015, 14:46:57 UTC - in response to Message 78270. Last modified: 5 Jun 2015, 14:49:02 UTC I doubt this thread is being read by any admins at this point given that until rjs5 brought some tangible analysis, this thread was otherwise mostly hot air and knee-jerk suggestions about adding support for different technologies which may be good ideas but may also be too technically challenging or time consuming to come onto the roadmap any time soon. *... but this thing about upgrading the compiler from the ancient* version they currently build with is truly low hanging fruit!** Thus, I would encourage rjs5 to consider spinning up a new thread strictly showcasing the details he/she was able to pull together. Lay it out as a new thread, call it 'An analysis of the Rosetta binaries...' or something that shows that rjs5 has already done the legwork, and in the first post break down what was found, maybe even add some charts or graphs. I have to do 'business cases' like this at work often, let me know if you need any help! ID: 78272 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 78274 - Posted: 5 Jun 2015, 15:12:23 UTC - in response to Message 78272. I doubt this thread is being read by any admins at this point given that until rjs5 brought some tangible analysis, this thread was otherwise mostly hot air and knee-jerk suggestions about adding support for different technologies which may be good ideas but may also be too technically challenging or time consuming to come onto the roadmap any time soon. *... but this thing about upgrading the compiler from the ancient* version they currently build with is truly low hanging fruit!** Thus, I would encourage rjs5 to consider spinning up a new thread strictly showcasing the details he/she was able to pull together. Lay it out as a new thread, call it 'An analysis of the Rosetta binaries...' or something that shows that rjs5 has already done the legwork, and in the first post break down what was found, maybe even add some charts or graphs. I have to do 'business cases' like this at work often, let me know if you need any help! Done! ID: 78274 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 78276 - Posted: 5 Jun 2015, 15:50:13 UTC - in response to Message 78272. Last modified: 5 Jun 2015, 15:50:28 UTC I doubt this thread is being read by any admins :-( I open a similar thread on Ralph@home, without reply from admins.... *... but this thing about upgrading the compiler from the ancient* version they currently build with is truly low hanging fruit!** +1 ID: 78276 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 78306 - Posted: 14 Jun 2015, 19:37:07 UTC - in response to Message 77835. Last modified: 14 Jun 2015, 19:40:11 UTC yup i'd think AVX / AVX2 is a good thing, actually this is very similar (or of the same nature) to the GPU request threads, i.e. to exploit vectorized CPU or GPU functionality to significantly accelerate computations the thing is that it may involve some code rewrites, which it seemed has been deemed 'hard to do'? :o lol AVX / AVX2 can process 4 x 64bit double precision floats in a single clock cycle, on a naive basis against non-vectorized codes, it would imply up to 4 times the speedup per cpu core. but in practice i'd think the speedup may not really reach the that scale as many of today's CPUs are superscalar (they features instruction level parallelism for non vector codes) and that it's likely not all pieces of codes can be parallelized http://en.wikipedia.org/wiki/Amdahl%27s_law as for GPUs the very high end / expensive cards is said to be able to process many times that. (unfortunately GPU is not consistent in this respects, a lot of GPU use software emulation for double precision floats computation, this cut that GPU prowess to 1/8 of it or more). note also that desktop GPU is normally clocked as about 1Ghz which is some 1/3 of today's CPU clock frequencies (e.g. 3-4 Ghz) link to GPU thread discussion: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=6549 Generating an AVX/AVX2/AVX512 binary does not necessarily mean a "rewrite". All you have to do is TURN ON the compile time option for the target feature you want to enable and JUST recompile. The benefit, HOWEVER, will depend on the program problem and HOW the coder encoded the algorithm to solve that problem. The Intel compiler can even generate a startup section of code that tests the CPU for them and branch to the correct code for all the CPU. U of Washington probably already has dozens of Intel ICC licenses. Either ICC or GCC will handle AVX since Intel also contributes to GCC. The AVX vector operations VECTORIZE loops in the code where the results of one operation are not needed for the next. For many cases, this would "fold" the Amdahl sections of code and the code owners could estimate the speed up. The PRIMARY problem they would have and the FIRST one I would look at would be the conversion from 80-bit floating point i387 register value computations to the smaller 64-bit floating point computations. The old i387 had 80-bit internal registers and would truncate/round/... to 64-bit when storing to 64-bit memory data types but a sequence of 80-bit float operations will get a slightly different answer than a sequence of 64-bit float operations. Those pesky extra 16-bits keep data that is lost when just using 64-bit operations. Once they confirm that they can use 64-bit operations, then move on. This is probably the barrier that worries the Rosetta people. Simple illustration: Suppose I had a 1024 vector of floating point numbers that I wanted to ADD together. for (i=0;i<1024;i++) sum += vec[i]; i386 code would: load next value of 1024 vector from memory to FP register ADD AVX1/2/512 code would operate on stride sub-vectors in the array: ADD next 2/4/8 values of the 1024 vector from memory to FP register horizontal ADD to accumulate the subtotals There might be some tail processing time IF the coder did not let the compiler know the length of the vector and just used arbitrary vector length (instead of modulo 2/4/8). It is hard to know what the speedup would be without some more analysis but modernizing Rosetta will multiple the data that the Rosetta people get from the current CPU cycles donated to their project. If they are satisfied with their poverty, then I am OK with it too. ID: 78306 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 78310 - Posted: 15 Jun 2015, 16:03:48 UTC - in response to Message 78306. Last modified: 15 Jun 2015, 16:06:28 UTC The Intel compiler can even generate a startup section of code that tests the CPU for them and branch to the correct code for all the CPU. U of Washington probably already has dozens of Intel ICC licenses. Either ICC or GCC will handle AVX since Intel also contributes to GCC. I hope these improvements may involve also AMD cpu.... Intel cripple Amd cpu ID: 78310 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 78865 - Posted: 28 Sep 2015, 9:58:00 UTC - in response to Message 77856. http://code.compeng.uni-frankfurt.de/projects/vc Now they are on GitHub https://github.com/VcDevel/Vc ID: 78865 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 78867 - Posted: 28 Sep 2015, 15:06:41 UTC - in response to Message 78310. The Intel compiler can even generate a startup section of code that tests the CPU for them and branch to the correct code for all the CPU. U of Washington probably already has dozens of Intel ICC licenses. Either ICC or GCC will handle AVX since Intel also contributes to GCC. I hope these improvements may involve also AMD cpu.... Intel cripple Amd cpu You will certainly get what you hope for as detailed in your January 2010 article. The ICC compiler now just checks the CPUID FEATURE support bits and if the feature is supported, ICC will generate the optimized code. ICC can 100% trust the CPUID FEATURE bits. The problem is now for AMD and vendors developing software for multiple target CPU. For example, when AMD had AVX problems with Bulldozer/Interlagos, AMD recommended compiling with -mssse3 and AVOID AVX. Since the CPUID FEATURE bit was on, vendors wanting to support AVX had problems with Bulldozer silicon. Now you have people trying to report Intel ICC bugs because their code did not run on the AMD transistors. Intel is now prohibited by court order from generating separate bits for Intel and AMD. Avoid -mavx on Interlagos/Bulldozer (middle of page) The Rosetta people will need to deal these decisions during their optimization effort. Unless they can vectorize their code, there is limited upside to pushing beyond 64-bit SSE2/3/4. Lots of fun. ID: 78867 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 78868 - Posted: 28 Sep 2015, 16:24:49 UTC - in response to Message 78867. Last modified: 28 Sep 2015, 16:25:06 UTC Intel is now prohibited by court order from generating separate bits for Intel and AMD. Lots of fun. Not so fun ID: 78868 · Rating: 0 · rate: / Reply Quote

xdarma Send message Joined: 20 Jan 08 Posts: 5 Credit: 5,606,234 RAC: 0	Message 78870 - Posted: 28 Sep 2015, 19:47:24 UTC - in response to Message 78867. Last modified: 28 Sep 2015, 20:38:39 UTC You will certainly get what you hope for as detailed in your January 2010 article. Sorry for re-posting, but this article is dated November 2014: Intel finally agrees to pay $15 to Pentium 4 owners over AMD Athlon benchmarking shenanigans The ICC compiler now just checks the CPUID FEATURE support bits and if the feature is supported, ICC will generate the optimized code. ICC can 100% trust the CPUID FEATURE bits. For sure ICC must check CPUID, unless can't cripple non-intel cpu. The author has applied a patch to fool the software created with icc. Second the previous article, the gain (or the loss?) seem to be around 8-12%. And still so on these days. The problem is now for AMD and vendors developing software for multiple target CPU. For example, when AMD had AVX problems with Bulldozer/Interlagos, AMD recommended compiling with -mssse3 and AVOID AVX. Since the CPUID FEATURE bit was on, vendors wanting to support AVX had problems with Bulldozer silicon. Which people? Can you elaborate? I do not find anything useful. Thanks. Now you have people trying to report Intel ICC bugs because their code did not run on the AMD transistors. Intel is now prohibited by court order from generating separate bits for Intel and AMD. IMO, the only way is to separate ICC away from Intel. So ICC must be fair with other cpu makers. Maybe your job can be hit by this. But it's another story. Avoid -mavx on Interlagos/Bulldozer (middle of page) Avoid if you use ICC. If you use gcc you can optimize with -mprefer-avx-128 (on the previous page of your link). Lots of fun. Another reason not to buy intel cpu. ID: 78870 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 26 Sep 05 Posts: 71 Credit: 5,702,246 RAC: 0	Message 78880 - Posted: 3 Oct 2015, 23:37:02 UTC - in response to Message 77551. Good to hear they are thinking of updating their server code... because it is ANCIENT. Well, if it was ANCIENT in October 2014 but they are thinking of updating it that's good news. Oh wait, it's now October 2015. Oh well maybe they haven't done that much thinking in 12 months. They are very BUSY you know. ID: 78880 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 79319 - Posted: 28 Dec 2015, 10:18:16 UTC Very intresting analysis about optimiziation by rsj5 here ID: 79319 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2207 Credit: 13,720,774 RAC: 3	Message 79322 - Posted: 29 Dec 2015, 15:24:53 UTC From Rjs5 Any performance improvement will only make progress when someone on the Rosetta team wants to. I have a couple unanswered messages to developers volunteering time and expertise. If anyone has interested contacts, please pass me along to them. I fully expect that there is no real need nor incentive for Rosetta to mess with the code to improve performance. :-( ID: 79322 · Rating: 0 · rate: / Reply Quote