Message boards : Number crunching : How to download the client? (AMD64 users)
Previous · 1 · 2
Author | Message |
---|---|
ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0 |
All projects do if you use dumas777's Linux x86_64 compiled client But this client doesn't dowload native 64-bit applications when they are available, such as from SIMAP and Chess960 and, who knows, soon Rosetta. ;-) |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
All projects do if you use dumas777's Linux x86_64 compiled client He said it did... (how is that coming along ? ;-) ) Team mauisun.org |
ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0 |
But this client doesn't dowload native 64-bit applications when they are available, such as from SIMAP and Chess960 and, who knows, soon Rosetta. ;-)He said it did... I don't think so. The author himself said that it downloads just x86 applications. (how is that coming along ? ;-) ) I've already built it for 64 bits using GCC 3.3.3, but it has some run-time problems that I want to address before trying GCC 4.2. Over time, GCC changed its interpretation of the C++ standard from version to version that can make some programs to fail compilation. |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
[quote]But this client doesn't dowload native 64-bit applications when they are available, such as from SIMAP and Chess960 and, who knows, soon Rosetta. ;-)He said it did... I don't think so. The author himself said that it downloads just x86 applications. I must have missread what he wrote. Team mauisun.org |
Who? Send message Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0 |
FWIW, running two instances of the client, one the 32-bit client, the other, the 64-bit client, on the same 4-core system, but limiting each client to 2 cores, I can compare the relative performance of 32-bit and 64-bit SIMAP's HMMER: the 64-bit version is about 7% faster. Well, you actually get performance improvement on K8 from 32 to 64bits. it is due to a poor implementation of the x87 stack in the processor. Modern processors will not see any difference. The code is more compact*** in SSE2 scalar or pack than it is compact in x87, so, you need less decoding badwidth to decode SSE2 than x87. I can bid a lot of money that the K8L will not see 32 to 64 bits performance improvement where K8 is use to see it. it was a good marketing work since intel fp performance was lower to convert the wearkness of K8 x87 to a great 64bits advantage ... only architects saw the trick and people thought that 64 bits was faster... Bottom line: 64 execusion units are available since MMX, and media boost in Core 2 is 2 x 128bits exec units ... for processing/computing, 64bits is totally obsolete, and the beauty of it: AMD is going to prove me right with K8L: They are "upgrading" to 128 exec units too said Mister Ruiz. So, if your compiler generate SSE2 in 32bits mode, your benetif of 64bits goes down to 0. the only good side of 64bits OS is its addressing, more than 4 Gigs of RAM is nice;) If i have to choose between threaded CPUs or 64bits, I know for sure what i choose, don't you? *** When i say compact, here is what i mean: x87 requires more instructions to excecute an algorythm than SSE2 required. in a case of a multiply: x87: ... fld a fld b fmul ... in the case of SSE2: ... movapd [a] mulpd [b] ... This is 33% saving ... if you counts equal instruction size. (Unfortunatly, the AMD64 spec was not that smart, they increased the size of all instruction code in 64bits mode, making their decoder still the bottleneck ) The stack management of x87 cost decoding bandwidth, and k8 is not a beast at decoding. in SSE2, the load and store are including in the instructions, saving decoding bandwidth. Here you go the explanation for your 64bits goodness ... in fact, say thank you to SSE2 for boosting K8 in 64bits mode. this is the awefull truth, sorry: a weakness of K8 turned into a great 64bits story! who? oupsss, i forgot, my employer is for nothing in my posting here, i am just posting here as myself. in fact I am sick of hearing the 64bits story over and over... 64bits for processing small data is useless, end of story!!!! |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
But not if you do the equivalent using the same type of addressing mode: fld a fmul b According AMD64 PRM Vol 5 and Intel 486 PRM there is such an instruction... So number of instructions for the same work is the same... [Whether the compiler chooses this type of operation or not is a completely different question, and that depends very much on other factors - but since the FPU stack is a fairly scarce resource, I'd expect it to avoid any unnecessary loads and using memory as 9th "register" would help a bit in that]. -- Mats |
Who? Send message Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0 |
1st of all: Correct, it was a simplistic example ... you are not suppose to get focus on this one, it was just to give an overall idea of the bandwidth problem. Ok ... so, if you raise a little the level you look at it, x87 is a Stack. If you want to use the stack properly, you have to do loading in the different order. It is call RPN order You can learn more about this here: click here and move forward and backward. then, the instruction you are speaking about (fmul b) will have exactly the same effect as what i was taking about, except that your ll have to "split" the instruction in "fld b and then fmul" now, if you look in very detail on fdiv for example: fdiv() fdivp() fdivr() fdivrp() fdiv( sti, st0 ); fdiv( st0, sti ); fdivp( st0, sti ); fdivr( sti, st0 ); fdivr( st0, sti ); fdivrp( st0, sti ); you will figure out that you are required to use the top of the x87 stack to do your operation. if you do not specify your 1st or 2nd parameter, fdiv will use st0. This force you to change the ordering of the loading and storing, increasing the number of instruction required to process the same algoryhtm. (Here memory dissambiguation of Core 2 helps to solve this problem) In SSE2, there is no constrain on what register you use to add,sub,mul or div, this is why the code gets more compact. Intel did not decide to invest so much on this without good reason. AMD themselve added more XMM registers to decrease the register pressure, and avoid those issues. they don't do that randomly neither. Mat, you can't defend AMD on this. With the list of AMD prototype and machines you have, i understand that you would like to defend 64bits, but it is obvious that x87 to SSE2 is the performance improvement AMD used to claim 64bits goodness. That is pure BS, and if you refuse to see it, fine! but at least, avoid to mislead people. I don't even add on the top of this that SSE2 can process 2 floats at the same time if you use PACKED ... those 7 to 8% performance improvement are 99.9% from SSE2, not from 64bits. the only case i know about 64bits registers helping is encrypting in INT, and if you recode it in MMX2(SSE2), you ll beat your INT code by a large amount too! the imul 64bits is 3 times faster than the same algo in 32bits imul. Just use MMX2 anyway :) The experimentation to do yourself: Just take every simple program in 32 bits, compile it in 64bits using AMD64 Flavor.( With MSVC 2005) then you get the same program and you "32 bit compile" it with vectorization ON: oupsss ... your 32bits goes as fast as your 64 version now. Somebody did charge premium to people for something that is only a compiler tricks! If i was a consumer association, AMD will be done in 5 minutes. there is a fine line between good marketing and misleading people, they obvisouly crossed the line about the 64bits goodness. who? again, this is my own opinion, and it does not represent any of my employer point of view, i am the only responsable person for those arguement. |
ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0 |
Just take every simple program in 32 bits, compile it in 64bits using AMD64 Flavor.( With MSVC 2005) then you get the same program and you "32 bit compile" it with vectorization ON: oupsss ... your 32bits goes as fast as your 64 version now. You forgot that AMD64 doubles the number of registers. If you do some research and compare, say, the SPECfp 2000 results for 32 and for 64 bits, both using scalar SSE, will see that the 64-bit results are about 10% better. Don't trust me, see for yourself (scroll down to the bottom for the mean peak score):
|
Who? Send message Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0 |
Just take every simple program in 32 bits, compile it in 64bits using AMD64 Flavor.( With MSVC 2005) then you get the same program and you "32 bit compile" it with vectorization ON: oupsss ... your 32bits goes as fast as your 64 version now. You prove exactly my point, the register pressure was the problem of the Athlon 64 in 32 bits mode. by adding more registers, you decrease register pressure, and help flexibility of the computation. it does not mean you are using anything close to a 64bits "stuff", those are just registers. Nice marketing tricks again! In modern processors, like core 2, your execution units do not work with registers any more, in fact, they work with load and store buffers, and micro-registers. In Core 2, you have plenty of internal registers, and a dynamic renaming is made, so, the number of register does not matter at all, and you can see how much faster Core 2 is. Even in the pentium 4, we had 40 load and 24 store buffers. Those explain that Pentium 4 core still beat the b..t of K8 on Specs :) Again, if Athlon was not so weak at using the eight x87 registers, you will no see the benefit. In your case, they used more register to decrease the register pressure they had. You are right, this is different,it is an other issue they had, Pentium III is use to have the same issue. They still told you it was 64 bits :) In FACT, that was just more SSE2 :) who? |
ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0 |
Who, Oh my! What can I say? Better yet, I won't say it. I'll let Mitch Alsup, an architect at AMD, speak: http://groups-beta.google.com/group/comp.arch/msg/26cd41f07d11a33a HTH |
Who? Send message Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0 |
Who, hehehehe :) too bad they did not have the bandwidth to feed them :) They totally get destroy on Spec2006 ... so, i guess, something in their plan did not work out ... hahahahaha Of course he will not explain you his problems ... neither do i :) and if you really want to know why it is funny, check my profile on SETI ... in the crunching forum :) who is who? sorry, but he can pretent what ever he wants, he has some serious problem with x87 compare to SSE2. who? |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
|
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Who?, Mats is has actually been against 64bit for Rosetta@home (in general) I'm not against it, I'm just saying that it's not going to be any great gain from it, because the limitation (in my experience) isn't the bitness of the code (or the number of registers used). I'm also against people saying "X will be faster because Y was faster" when doing wahtever it is suggested one should do. I can say for sure that Rosetta would gain from using packed SSE, but it will require a fairly major re-org of the data, since the current data-format is a simple array with index 0 = x, index 1 = y, index 2 = z. Since the values x, y and z are generally calculated in the same way for many iterations of a loop, it would be good to just load 4 of each into an SSE register and do the calculaton 4 in parallel [1]. But to do that, the x, y and z values need to be arrayed individually (or in groups of 4 or some multiple thereof). But there's another problem too, and I haven't spent enough time to figure out how to fix it, but the code is basicly doing: for(i = 0; i < n; i++) for(j = 0; j < m; j++) { calculation...; } } wheere n is in the range of a few hundred and m is less than 10. To do it the other way around would help a whole lot in unrolling the loop and paralellize the calculation. However, that is a major re-org of the data-flow and I'm not sure if the calculation will still be valid [because the whole code is a lot more complex than the above much simplified case]. -- Mats |
ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0 |
I can say for sure that Rosetta would gain from using packed SSE, but it will require a fairly major re-org of the data, since the current data-format is a simple array with index 0 = x, index 1 = y, index 2 = z. Since the values x, y and z are generally calculated in the same way for many iterations of a loop, it would be good to just load 4 of each into an SSE register and do the calculaton 4 in parallel [1]. But to do that, the x, y and z values need to be arrayed individually (or in groups of 4 or some multiple thereof). The vectorizer can swizzle the data and unroll the loop automatically. We'll see. |
ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0 |
hehehehe :) too bad they did not have the bandwidth to feed them :) They totally get destroy on Spec2006 ... so, i guess, something in their plan did not work out ... hahahahaha I guess that I shouldn't feed the trolls, especially those who laugh nervously. |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
I can say for sure that Rosetta would gain from using packed SSE, but it will require a fairly major re-org of the data, since the current data-format is a simple array with index 0 = x, index 1 = y, index 2 = z. Since the values x, y and z are generally calculated in the same way for many iterations of a loop, it would be good to just load 4 of each into an SSE register and do the calculaton 4 in parallel [1]. But to do that, the x, y and z values need to be arrayed individually (or in groups of 4 or some multiple thereof). Yes, it can, but swizzling isn't entirely free - in fact it's fairly costly when it's as "disorganized" as it is in this case... :-( And more importantly perhaps, the length of the inner loop (from memory) is 6 items [the loop is repeated several times in the busiest section of the code, and does a bunch of other things too, as I explained earlier. So even unrolling probably won't really do the trick. We'd need to reform the loop, so that it goes the other way around - then maybe it will work... -- Mats |
Who? Send message Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0 |
I can say for sure that Rosetta would gain from using packed SSE, but it will require a fairly major re-org of the data, since the current data-format is a simple array with index 0 = x, index 1 = y, index 2 = z. Since the values x, y and z are generally calculated in the same way for many iterations of a loop, it would be good to just load 4 of each into an SSE register and do the calculaton 4 in parallel [1]. But to do that, the x, y and z values need to be arrayed individually (or in groups of 4 or some multiple thereof). Moving to SIMD SSE2 is not really a big deal. as I did post in the SETI forum:
i am open to do ROSETTA too. As one of the designer of SSE2/SSE3, on my hobby time, i spend quite some time helping people to port data structure in SSE2. When you are use to it, few macros usually do the tricks. It is the right time to do SIMDization because SSE4 will buy a lot of performance in rosetta like algorythms, we actually included instruction to help pattern matching in SIMD. SSE4 will help as well to deal with change of code path into a set of SIMD data. SIMDization is less complexe than many people want to say, there is no "major" data re-organization, a fairly good programmer knows how to change his data structure without "major" work. you need to do few macros like "GetVec(x) GetVec4(x)" etc .. and an adaptive data structure and you are done. then, a gigantic search and replace for 1 or 2 hours, well focus and you are done. The key of SIMDization is data locality. who? |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
I can say for sure that Rosetta would gain from using packed SSE, but it will require a fairly major re-org of the data, since the current data-format is a simple array with index 0 = x, index 1 = y, index 2 = z. Since the values x, y and z are generally calculated in the same way for many iterations of a loop, it would be good to just load 4 of each into an SSE register and do the calculaton 4 in parallel [1]. But to do that, the x, y and z values need to be arrayed individually (or in groups of 4 or some multiple thereof). Yes, I've spent some time doing that sort of work too - however, there's more than one data structure... And more importantly, some of the code in Rosetta isn't straightforward to understand what depends on what [and it's a tad more than 200K lines in Rosetta, and if memory serves me right, that's more than 5 times the amount of code in SETI]. Most of it is using some Fortran-style array structuring C++ classes with optimized sections of code that "grab" a pointer to the array and do the math to calculate the index "inline" - with three-dimensioned arrays, that gets interesting... ;-) It's not the 1-2 hours of search and replace that is the problem - is the figuring out why it doesn't work for 4 weeks afterwards taht scares me. -- Mats |
ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0 |
Here's a development version of the x86-64 Linux client:
|
ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0 |
Even though there's an official AMD64 client for Linux, it refers to too many dynamic libraries and requires a fairly recent Linux setup to run on. So, one more time, I'm making available the AMD64 Linux client here. It refers to a minimal set of standard dynamic libraries whose version requirements should be satisfied by Linux systems up to 2 or 3 years old, however it was built with a fairly recent version of GCC, 4.1.2. The drill's still the same:
|
Message boards :
Number crunching :
How to download the client? (AMD64 users)
©2024 University of Washington
https://www.bakerlab.org