Message boards : Number crunching : Still ignoring 64 bit users?
Author | Message |
---|---|
David Send message Joined: 12 Dec 05 Posts: 5 Credit: 546,401 RAC: 0 |
Is there a timeline for when you guys will accept spare processor cycles from us AMD64 users? I keep seeing the statement that it isn't worthwhile to recompile the client. I find that questionable, but will accept it for the sake of argument. Does that mean you see no other way to allow those of us on a 64 bit OS to contribute? I can think of a few more: 1) Make a special app_info.xml available to ease use of the anonymous platform hack. 2) Better yet, automatically use the 32 bit client where no 64 bit client is available. 3) There's always the ia32 wrapper, which many of us already use... Until then, i'll be looking into other projects... |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
I'm not really sure why some people have problems with this - this is obviously some lack in my understanding of the combination of Linux and BOINC. I presume it's because some distributions of Linux doesn't work "right" for BOINC. I'm running a mixture of 32- and 64-bit Linux's from both SuSE and RedHat on my machines, and it's working just fine. [SuSE 9.3, Fedora Core 4 & 5] Whilst I realize that not everyone is happy with using those particular companies as distributors of Linux, I would also hazard a guess that many other distributions will ALSO work with no more effort than downloading BOINC and signing up for Rosetta as per you would on a 32-bit machine... As to compiling for 64-bit: I have spent quite a few hours trying to find out how to make Rosetta better. Compiling it for 64-bit will most likely make it slightly slower, since some of the code is using C++ data structures with pointers to the actual data (as per usual C++ methods), and that would increase the size of the data-set for the application, and thus push more data outside the cache, causing it to run margnially slower. None of the code is "call-intensive" [1], so it wouldn't benefit much from reducing the call overhead with more registers to pass arguments, and none of the code uses 64-bit integer arithmetic. All the math is done in 32-bit floating point, and it's limited by the floating point unit capacity, not on registers available to the processor. Using SSE instructions may have a benefit, but to make it worth-whle, it would require some pretty noticable re-organizing of the data structures used all over the application, whcih I haven't experimented with yet - as that requires MAJOR work, not just modifying one function. [1] The Rosetta compiler settings use VERY agressive inlining, which means that any small function that is called often will automatically be inlined in the code... -- Mats |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
Is there a timeline for when you guys will accept spare processor cycles from us AMD64 users? I keep seeing the statement that it isn't worthwhile to recompile the client. I find that questionable, but will accept it for the sake of argument. Does that mean you see no other way to allow those of us on a 64 bit OS to contribute? I can think of a few more: Quite a few member of my team quite happily run BOINC & Rosetta@Home on Linux and Windows64 computer, with no file modification as far as I know. Also note. There are 4 official BOINC platforms, Windows, Linux and MacOS-X x86 (32bit) and MacOS-X PPC If you are running a 64bit BOINC client as opposed to the 32bit BOINC client on your installation, then you are doing so un-officially. You should get the person who created the 64bit client to add the feature of requesting the 32bit application if there is no 64bit application present into the boinc client. This would then solve your problem at all projects. Either that or put your request to the BOINC server side developers to add this as a recommendation to implement into their server code. That is until all projects start producing 64 bit clients by default and or BOINC develop a 64 bit boinc client. Team mauisun.org |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I'm not really sure why some people have problems with this - this is obviously some lack in my understanding of the combination of Linux and BOINC. I think the problem comes from three areas, one of which is on the user side, a lack of willingness to run 32bit software on a 64 bit box. The other is on the project side, a lack of understanding (dare I say) by project admins. User side: The 32 bit client will download and run 32 bit clients on a 64 bit box. The client does not run enough of the time to make any apprecialbe difference to your throughput, thus there is no point at all in running a 64bit client if all the projects you crunch have only the 32 bit app. Project side The second problem is where users want to run 64 bit apps on some projects as they will get better througput on those projects. these 64 bit apps may be custom or (in principle) may be provided officially by the project. These users want to run the 32 bit apps from other projects, and it is entirely within the spirit of BOINC to do that. It is also entirely within the AMD design of their 64 bit chip (since copied by Intel) to mix 32 and 64 bit apps in this way. Here the lack of understanding is from the project admis who do not provide the 64-bit-compatible app they already have. This is the 32 bit app, which of course will run perfectly well on the 64 bit machine (due to the AMD design blah blah). This is done by copying the 32 bit app to a filename that claims itself to be a 64 bit app. To save space on the server, they could even use a linux ln (link) to the very same file. And of course you do it separately for both Win and Lin. The question in the thread title is entirely fair, imo, except it is not the users who are being ignored so much as the solution. If they knew it was so easy, they would do it. Unless of course there is some glaring point that I have overlooked entirely. Finally, Mats' point that on some projects the extra width will not help is entirely true and (unusually for him) not really relevant. It is true because Float arithmetic is already 64 bit on a 32 bit machine, and has been since the maths co-pro moved inside the cpu core - (was that the 486-DX ?). It is only if you need to work with interger bigger than +-2 gig, or address spaces bigger than 4 that there is any advantage at all. It is not really relevant because In that case the right response from the project is still to supply a 32 bit app aliased as a 64bit app, it is not to say "we don't do that here, it is not worth the programmer effort." River~~ |
ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0 |
Mats, All the math is done in 32-bit floating point, and it's limited by the floating point unit capacity, not on registers available to the processor. This statement is disputable. x87 code is quite inneficient because of the register stack. Unlike SSE code, about 30% of the x87 instructions are to move registers to the right slots because they must use top slot. Using SSE instructions may have a benefit, but to make it worth-whle, it would require some pretty noticable re-organizing of the data structures used all over the application, whcih I haven't experimented with yet - as that requires MAJOR work, not just modifying one function. No real code change, just recompiling the application with vectorization enabled will do to use SSE efficiently. SIMAP reported an improvement of about 7% when its application was ported to 64 bits and an extra 8% when vectorization was enabled in the compiler (see this). Without ANY source code change. HTH |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
Mats, BUT this is not SIMAP, the code here is DIFFERENT to SIMAP, just because Joe Blogs at RunMeOver@School can flick a switch and get it to work does not mean it will work for someone else. Team mauisun.org |
ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0 |
BUT this is not SIMAP, the code here is DIFFERENT to SIMAP, just because Joe Blogs at RunMeOver@School can flick a switch and get it to work does not mean it will work for someone else. The potential to improve the performance is there, especially by getting rid of arcane x87 code. But one will only know after trying, which hasn't been done yet. Instead, Rosetta went on to support cottage OSX/x86 systems... |
SekeRob Send message Joined: 7 Sep 06 Posts: 35 Credit: 19,984 RAC: 0 |
The potential to improve the performance is there, especially by getting rid of arcane x87 code. But one will only know after trying, which hasn't been done yet. Instead, Rosetta went on to support cottage OSX/x86 systems... That's highly assumptious! Who's privy of what's tried in the labs and the reasoning why after so many years of x64 in the market and the user base lack of growth, for the very reason that most users simply dont need it. The projects are aimed at the vast resource readily available and not the bleeding edge of calculation power that require a disproportionate amount of effort to implement. If easy access can be made to 10000 x32 machines versus 100 x64, where do u think the bets are placed? Well tell u what, I'd go for the 10000 x32. Coelum Non Animum Mutant, Qui Trans Mare Currunt |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
The potential to improve the performance is there, especially by getting rid of arcane x87 code. But one will only know after trying, which hasn't been done yet. Instead, Rosetta went on to support cottage OSX/x86 systems... I think it's more important to note that getting rid of x87 and it's arcane-ness in itself will not improve the floating point performance, as I've posted in another thread several weeks ago. x87 (at least on the Athlon64/Opteron processors that I have access to) is as fast as single-data SSE operations. And to actually get SIMD functionality in Rosetta, it would require a fair bit more than just throwing the -m64 switch to the compiler... Like re-organizing the data-structures so that data is stored in a form that is more suitable to loading 4 (or 2) value at once. I would guess that the effort to support OSX/x86 isn't that much - just compile the Linux version with a different library, and that's it done... Some testing effort may be needed to ensure that it calculates correctly in the different environment, but I would think that's relatively easy. By the way, the x87 register stack is completely "removed" as a bottleneck in Athlon64/Opteron processors, as "change top of stack" is a no-latency operation in this generation of processors. So as long as you can keep the calculation within the stack, it's no difference to any other 8-register architecture. Yes, 64-bit has 16 registers, but: 1. There's a penalty for using the "upper" register range, as any register in this range must be saved and restored by the callee so that it doesn't clobber the values used by the caller - which means that it needs to be used several times before there's any benefit from using the upper registers. Of coruse, for functions that call other functions, there is the benefit that these registers are preserved by the callee, so there's a use in lower level functions. 2. Few calculations are actually so complex that you need more than 8 registers at once. If you're going to load new data from memory anyways, you may just as well load it into a register you don't need any longer from some previous calculation. [Note that the internal architecture has more registers than those that are visible to the programmer - this means that two register loads to the same register that can be performed in parallel, can actually be performed in parallel, as the processor can use different "behind the scenes" registers...] -- Mats |
ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0 |
Mats, I think it's more important to note that getting rid of x87 and it's arcane-ness in itself will not improve the floating point performance, as I've posted in another thread several weeks ago. x87 (at least on the Athlon64/Opteron processors that I have access to) is as fast as single-data SSE operations. And to actually get SIMD functionality in Rosetta, it would require a fair bit more than just throwing the -m64 switch to the compiler... Like re-organizing the data-structures so that data is stored in a form that is more suitable to loading 4 (or 2) value at once. Actually, the -m64 and -ftree-vectorize switches. But the vectorizer is smart enough to pipeline the swizzling to get the data in the right places. Along with loop unrolling, most of the operations are then vectorized. By the way, the x87 register stack is completely "removed" as a bottleneck in Athlon64/Opteron processors, as "change top of stack" is a no-latency operation in this generation of processors. FXCH is almost free, but the code is rigged with FLD too, and it is not free (see Appendix C). Regardless, they are instructions which take up resources all the way from fetch to retiring. 1. There's a penalty for using the "upper" register range, as any register in this range must be saved and restored by the callee so that it doesn't clobber the values used by the caller Ahem... You're confusing the Windows and the Linux ABIs. What you said is true only for Windows. [Note that the internal architecture has more registers than those that are visible to the programmer - this means that two register loads to the same register that can be performed in parallel, can actually be performed in parallel, as the processor can use different "behind the scenes" registers...] We work for the same company, so it goes without saying... ;-) I would guess that the effort to support OSX/x86 isn't that much - just compile the Linux version with a different library, and that's it done... Some testing effort may be needed to ensure that it calculates correctly in the different environment, but I would think that's relatively easy. And how is that different from porting Rosetta to AMD64? But my point is that it must be tried. I offer my own time for that, if the developers are interested... |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
Mats, As you work for the same company, I can probably allow you to get to my machine and have a play with the source-code... Just e-mail me, and I'll let you know where to go... As to FLD, those, in 80-90% of the cases I've translated to SSE, convert straight into a mov instruction, which is more or less the same latency and other consequences in the processor. -ftree-vectorize works reasonably well on some code, but when I tried it on Rosetta, it wasn't successfull at all... -- Mats |
ebahapo Send message Joined: 17 Sep 05 Posts: 29 Credit: 413,302 RAC: 0 |
As you work for the same company, I can probably allow you to get to my machine and have a play with the source-code... Just e-mail me, and I'll let you know where to go... I will. As to FLD, those, in 80-90% of the cases I've translated to SSE, convert straight into a mov instruction, which is more or less the same latency and other consequences in the processor. FLD is often used to get some ST entry into ST(0), something that doesn't happen in SSE code, as it can use arbitrary registers. -ftree-vectorize works reasonably well on some code, but when I tried it on Rosetta, it wasn't successfull at all... Have you tried GCC 4.2? |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
I think it's better if we take this discussion off-line and discuss the results publicly when/if we come up with anything usefull. As most people don't have access to the source-code, discussing what I've done and not done is pretty pointless in this public forum. -- Mats |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
I think it's better if we take this discussion off-line and discuss the results publicly when/if we come up with anything usefull. As most people don't have access to the source-code, discussing what I've done and not done is pretty pointless in this public forum. discuss it over at Ralph since that the testing and development part of rosetta. Hope something comes out of it. Team mauisun.org |
Who? Send message Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0 |
I think working on the threading is much more important than the 64bits side. The number of consumers who have Dual core,Quad Core or SMP is much higher than the little tiny set of people running 64bits OS. The number of application for 64bits is little, and the performance gain seeing on 64 from 32 bits expose how the platform was weak in 32bits mode. If you compare 64b and 32b Core 2 performance, you ll figure out that the performance improvement is less than Athlon. One of the reason why is because the Athlon is not dealing correctly with the X87 stack, while Core 2 does as good as with SSE2 (Athlon X87 insructions are slower than Athlon SSE2) This is why you see 64bits performance improvement on Athlon and you don't see it on Core 2: It expose a weakness of Athlon, and was marketed as a "64bit improvement"! good job from the 3 marketeers ;-) Now, if you compare the performance improvement of "64bits" and the performance improvement of multi-core, it is pretty clear that HyperThreading, then multicore was the right way to go. 64bits for the workload of the side of Rosetta and Seti is irrelevent, if you do not have 64 processors to feel the memory with 64 workload. (duplicating threads multiply the memory size) I hope the K8L will fix the x87 floating point glass jaw of K8, and that we will see less 64bits marketing stuff for performance related topic! 64bits is good for memory addressing, the rest is pure marketing! your CPU since long time has 128bits instruction set, it was called SSE! This posting only engage myself, it reflect experimentation i did by myself at home, my employer is not responsible for any of those statements. I am just fedup with all of this #@!$!$#!@$ about 64bits. The common mistake will be to run several instances of the same program, duplicating the original set of data. in the case of Seti and rosetta, you can open the set of data in read only, and share it with all the processors, minimizing the memory traffic dramatically, and avoiding L2 cache or L3 caches synch issues. Of course, for marketing reason, the multiple instances of the workload with be pushed by the companies with several memory controler, but this is not the optimum algorythm, read only shared memory buffers is the way to go for protein unfolding, and pattern matching: Demo coming soon on SETI. Who? |
Who? Send message Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0 |
I forgot: if you want to do yourself the experimentation, take a simple algorythm using float single or double precision, compile it in 32bits mode, and compile it in 64 bits mode. Run it on an AMD .. oups! it is faster in 64bits. then recompile using a compiler that can vectorize. Make sure it is really vectorize! (/QxW and later should do on IC 9.0) now, you compiler both 32 and 64bits version ... ho ho! they are about the speed of your non vectorized version 64bits... Your 64bits goodness is in fact a SSEx goodness, x87 is by default ignored in 64bits mode. do this experimentation yoiurself if you are not convinced, somebody sold you potatoes with banana taste ... Who? |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I think working on the threading is much more important than the 64bits side. As a general point in computing theory I agree with you. In terms of any distributed computing project, I disagree. The design of a dc project involves splitting the big task into as many separate work units as possible. If this is done effectively then eack wu should end up single threaded. Secondly, think of a grid project - what I mean by a grid project is a project where all the computers are controlled by the same organisation. In this case it is possbile to make all the computers be the same. If each was a quad core it would be possible to design wu to run quad threaded. There will still be times when one thread, or three, are not running. This is likely to outweigh the advantages you mention from running four threads from the same task. Coming back to a dc project, with a diversity of numbers of cores, it would be a lot more human effort to tailor different apps for one core, two core, one core hyper (could in principle be different from two core), etc. A lot more programmer effort and we already have reason to think it may be less goos than doing it the easy way, of sending N tasks to an N core machine. Thirdly, think of BOINC. By running each app single threaded, BOINC can adapt the mix of tasks more effectively. Loads can be balanced by running Einstein in one core and Rosetta in another. Fourthly, think of the user. When the user starts work on a multi-core box, only one of the boinc apps needs to be stood down (this is done by the OS, not by the client). A task optimised to be quad threaded might run horrednously if allowed only three of the cores while the user did a long task that took one of the cores away. In a previous discussion of this issue, FluffyChicken summarised ll this by sayong that a DC project is already multi-threaded anyway - a nice way to put it. So yes, insist that your spreadsheet is multi threaded, and your database engine, and your word processor, etc. But not a dc application. River~~ |
Who? Send message Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0 |
I think working on the threading is much more important than the 64bits side. Intel already released Dual cores with shared cache, and AMD will follow next year with the same kind of idea, avoiding data duplication through multiple instances of the data make a lot of sense, doing so, you ll avoid cache trashing. the improvement factor can be dramatical, especially if you have the chance to fit into the L2 cache. (L3 in case of K8L), For example,multiple instances of SETI or Rosetta used the same data set. READ_ONLY allocation can stop cache snopping, and help the size of the buffers. 2 independant task or more can pick in this read only buffer without annoying the caches or memory controlers. that is what has to be done in a near futur. who? |
Who? Send message Joined: 2 Apr 06 Posts: 213 Credit: 1,366,981 RAC: 0 |
I think working on the threading is much more important than the 64bits side. Intel already released Dual cores with shared cache, and AMD will follow next year with the same kind of idea, avoiding data duplication through multiple instances of the data make a lot of sense, doing so, you ll avoid cache trashing. the improvement factor can be dramatical, especially if you have the chance to fit into the L2 cache. (L3 in case of K8L), For example,multiple instances of SETI or Rosetta used the same data set. READ_ONLY allocation can stop cache snopping, and help the size of the buffers. 2 independant task or more can pick in this read only buffer without annoying the caches or memory controlers. that is what has to be done in a near futur. who? |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
... For example,multiple instances of SETI or Rosetta used the same data set. READ_ONLY allocation can stop cache snopping, and help the size of the buffers. 2 independant task or more can pick in this read only buffer without annoying the caches or memory controlers. For Rosetta that would mean the project team would have to go over to the scheme used by Einstein and SETI to tell the client that the data was the same. I advocated this about 10 months ago, but the project team had some reason (which I forget) for not doing this. The advantage I was suggesting then was that by making the data files sticky the host would save on downloads and the project would save on bandwidth. Your suggestion would add a further significant saving *if* the saved cache faults outweighed the fact that the two apps are competing for the same resources. In addition, the client would have to be modified - this is a boinc responsibility, not Rosetta - so that it preferentially scheduled tasks together if they used the same data files. This would mean different scheduling for cores that shared cache and cores that don't (eg cores in different sockets), because the latter certainly run faster on mixed work rather than on similar work. On the other side of the balance, where the two/four cores on the chip are implemented so that they share hardware (like the two virtual cores in an Intel HT chip) then the advantage of not having the same program running can also be significant. I am not convinced, as you seem to be, that the savings in cache faults would outweigh this. R~~ |
Message boards :
Number crunching :
Still ignoring 64 bit users?
©2024 University of Washington
https://www.bakerlab.org