Message boards : Number crunching : Rosetta@home using AVX / AVX2 ?
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next
Author | Message |
---|---|
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,001,518 RAC: 6,291 |
Hi everybody, The gcc8 -march=skylake-avx512 compiled codes don't do very good when compared with the gcc8 -march=skylake compiled codes. My guess is that the -march=skylake codes turn on full AVX2 and other new instructions added through Broadwell. They say that the skylake-avx512 option also enables the AVX512F, AVX512VL, AVX512BW, AVX512DQ, and AVX512CD instruction families, but it is unclear whether the benchmark code has any code sequences that can take advantage of those AVX512 instructions beyond AVX512BW. I would expect that the FFT code would and it saw an 8% increase in performance using AVX512BW. If the benchmark code happened to be written like Rosetta, I could see the AVX512 code running slower. The compiler can only operate on 8 single precision or 4 double precision values in parallel if the code is written to allow it. For Rosetta, it is a moot point until the project developers see it as a need. The design of Rosetta dictates that it process all the floating point operations in SCALAR mode rather than in VECTOR mode. I guess I need to download and look at the code sequences again to see if they are doing anything performance related. Interestingly, there is a World Wide Grid project that is using Rosetta code too. Project Name: Microbiome Immunity Project wcgrid_mip1_rosetta_7.11_windows_intelx86.exe |
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
My answer to all of this up to this point: FAHCore a7 I don't mean to be rude but most likely I am. Some certain programmers don't know what the hell they're doing. This is a badly managed project. I draw my conclusions from that. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,560,976 RAC: 6,611 |
This is a badly managed project. I draw my conclusions from that. During my years of partecipation (and talking with rjs5), seems to me that the development of this large code by many people of different istitution may have led to "confusion" about the code. Rjs5 has read the source code and he said that is like a "jumble" where every developer added what interested him, whitout centralized strict rules. So, it's difficult to optimize it, but seems, moreover, that they are not even interested in doing so.... I'm curious to see if the new "c++ wave" (conversion of all libraries to this language) will bring an improvement on writing the code. |
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
Maybe they should just abandon this nonsensical attempt to support HPC (sort of) on smartphones and tablets... 4, 6 and even 8-core AVX-capable CPUs will be ubiquitous in a few years with a TDP in the 90-140W range. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,560,976 RAC: 6,611 |
Maybe they should just abandon this nonsensical attempt to support HPC (sort of) on smartphones and tablets... Despite i'm crunching (usually on Ralph) with my phone, i'm agree with you. For example a Ryzen 1700 (only 65W) will overclass a top level smartphone by some orders of magnitude. ....not to mention an high level gpu.... :-P |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,001,518 RAC: 6,291 |
Maybe they should just abandon this nonsensical attempt to support HPC (sort of) on smartphones and tablets... I would suspect that the cost of supporting phones and tablets is about the same as supporting Windows, Linux or MACOS. They just recompile the code and the execution is limited by scalar floating point and call/return chains. That porting effort is likely independent of the main developers. Going from scalar to 2 floating point computations in parallel is likely just a TYPEDEF change (add 4th dimension) that is sprinkled over the code. That has to be done before ANY reasonable parallel operation can be done. Going from 2 to 4 wide parallel AVX 128-bit to AVX 256-bit would be a recompile and require the Rosetta server to recognize a subset of machines to steer the 256-bit binary to. Alternatively, some of the compilers support the runtime detection of the machine type and fire up different versions of the compiled code. These "fat binaries" will run on an old CPU that only supports SSE or new one AVX2. I am not sure how good/poorly the project is managed. There is substantial evidence that comments from this group of crunchers carry little weight ... other than the valuable ... "it is broken" when problems develop. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,560,976 RAC: 6,611 |
|
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,560,976 RAC: 6,611 |
|
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,560,976 RAC: 6,611 |
Optimization of the code. In Acoustics@Home project, a volunteer optimized the code and these are results: Here are results from But R@H guys are not interested... |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,001,518 RAC: 6,291 |
My Skiylake-X system with AVX-512 support is scheduled to be delivered today. I think I might download the rosetta source and see what the developers have done to the code in the last year. 8-) Intel Core X (Socket 2066) w/ Core i9 9980XE (18 Core 36 Thread @ 4.5GHz) Asus ROG-Strix-X299-E XL Liquid Cooling Package 32GB DDR4-2666 (2x16GB Kit) 1TB Solid State Drive (SSD) - M.2 - NVMe Antec Silent Mid-Tower Case 850W - Power Supply w/ Active PFC DVD +/- RW Drive - Internal (SATA) Optimization of the code. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,560,976 RAC: 6,611 |
I think I might download the rosetta source and see what the developers have done to the code in the last year. 8-) I'm not so optimist. Latest version of R@H code (4.07) is almost one year old... P.S. Great pc!!! |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,001,518 RAC: 6,291 |
I think I might download the rosetta source and see what the developers have done to the code in the last year. 8-) The code version doesn't really matter to me as long as it runs the workloads. I am going to make some minor changes to the "vector" object definition ... from a array of 3 FP numbers to an array of 4 FP numbers. I am going to try to get the compilers to generate code that uses both halves of the SSE2 registers when computing. When that works, I will turn on AVX-512 and see how crunching the vectors in one operation changes performance. I am just going to try to get some "empirical" data that might interest the developers to introduce similar changes. BTW, the new machine is still accumulating Rosetta RAC and is around 21,800 and topping out. Top machine #34. With the liquid cooler, it is running about 65 degrees C. The only BIOS change I made was to tell the CPU to not exceed 80 degrees, but Linux tools says that the CPU is running at 3.8ghz. I am running predominantly Rosetta with a random WCG "Help TB" thrown in and GPU WUs too. |
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
Vote with your feet |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,560,976 RAC: 6,611 |
The code version doesn't really matter to me as long as it runs the workloads. I am going to make some minor changes to the "vector" object definition ... from a array of 3 FP numbers to an array of 4 FP numbers. I am going to try to get the compilers to generate code that uses both halves of the SSE2 registers when computing. I said that i'm not optimist not for you. I know you have the knowledge to work on optimization side. You said "what developers have done to the code"... i think that, during this year, R@H developers have not work largely on the code ('cause the exe is still the same) |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
When that works, I will turn on AVX-512 and see how crunching the vectors in one operation changes performance. My understanding is that AVX-512 is not likely to be widely adopted by Intel on future chips, due to space and power requirements. It seems to have been a special case for the Skylake generation. So I hope your efforts can be applied to something more mundane, such as AVX2. It appears that AMD is going to improve their implementation of it on Ryzen 3000, and so should be widely available before long. Good luck. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,560,976 RAC: 6,611 |
The code version doesn't really matter to me as long as it runs the workloads. I am going to make some minor changes to the "vector" object definition ... from a array of 3 FP numbers to an array of 4 FP numbers. I am going to try to get the compilers to generate code that uses both halves of the SSE2 registers when computing. When that works, I will turn on AVX-512 and see how crunching the vectors in one operation changes performance. An example of your knowledge. You noticed, some times ago, that they are using a very old version of GCC compiler. Are you using the latest version? Are THEY using the latest version?? |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,560,976 RAC: 6,611 |
So I hope your efforts can be applied to something more mundane, such as AVX2. It appears that AMD is going to improve their implementation of it on Ryzen 3000, and so should be widely available before long. Good luck. +1 Avx seems to be enough An example: Acustics |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,001,518 RAC: 6,291 |
So I hope your efforts can be applied to something more mundane, such as AVX2. It appears that AMD is going to improve their implementation of it on Ryzen 3000, and so should be widely available before long. Good luck. Interesting link. Since Rosetta is working on only 3-dimensional arrays, it currently performs 3-loads, 3-operations and then 3-stores. Changing to 4-dimensions will have SSE2 do 2-loads, 2-operations and 2-stores. AVX-512 would not make much difference and few could run the binary. I think on Skylake forward, Intel was looking at not stalling on software prefetches. If the code issued a software prefetch and all the read/write buffers were busy, the software prefetch was not executed. 32-bit has a smaller code footprint and smaller data footprint and therefore makes the on-chip caches more effective. IMO, a 32-bit code version with a 4-dimensional vector so SSE2 does only 2 operations vs three would probably be the fastest. Rosetta probably measures performance running one copy on one of their servers with large caches. The optimizations they chose bloat the runtime size and running multiple Rosetta binaries stress the hardware and slows all copies down. They over tune the binary using single one execution. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,560,976 RAC: 6,611 |
Since Rosetta is working on only 3-dimensional arrays, it currently performs 3-loads, 3-operations and then 3-stores. Seems that they have some problems on 3-dimensional arrays 60692 Make the "Cannot normalize xyzVector of length() zero" error more informative. The "Cannot normalize xyzVector of length() zero" error is a pain in the neck, because it's hard to know exactly what vector was tripping it up, and in what context. @everyday847 had the idea a while back of adding more try/catch blocks around calls to xyzVector::normalize() which would themselves re-throw after adding more information about the context to the error message, to aid debugging. This is a first pass at that. |
G.L.I.S. Send message Joined: 25 Dec 08 Posts: 26 Credit: 2,227,945 RAC: 2,778 |
Without wishing to flare or sound rude, we regret to see projects in 2019 that do not exploit the potential of modern processors. I don't necessarily say 'AVX', but some SIMD, it would mainly benefit the project itself. Best regards |
Message boards :
Number crunching :
Rosetta@home using AVX / AVX2 ?
©2024 University of Washington
https://www.bakerlab.org