GPU computing

Message boards : Number crunching : GPU computing

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Mark

Send message
Joined: 10 Nov 13
Posts: 40
Credit: 397,847
RAC: 0
Message 80082 - Posted: 14 May 2016, 17:10:05 UTC
Last modified: 14 May 2016, 17:23:58 UTC

I've just been looking at the performance of the new GTX1080 and for DOUBLE precision calculations it does 4 Tflops!!!! For comparison a relatively high performance chip like an overclocked 5820K will do maybe 350GFlops. So we are talking an order of magnitude difference. In addition the Tesla HPC version will probably be double that at 8 TFlops. (Edit: Looks like it is actually 5.3TFlops) The Volta version of the gtx1080 (next gen on, due in about 18 months time) is rumoured to be 7TFlops FP64 in the consumer version.

There is no way that conventional processors can keep up with that level of calculation. At what point does the gap between serial CPU and parallel GPU have to be before the project leaders decide they can not afford NOT to invest in recoding to parallel processing? Because by 2 years time, HPC GPUs will be around 35 times faster than CPUs. How much will it cost to rewrite the code, $100-150K maybe?? Isn't that worth paying for such a huge step up?

With that kind of performance increase, you can make calcs more accurate. You no longer have to use approximations like LJ potentials, you can calculate the energy accurately and get a better answer in a quicker time than now. Whats not to like?

It seems like so many projects, everyone is comfortable with what they are doing now. Revolution has been forsaken for evolution. Understandable, but the best way to do things?

Be bold and take the leap!
ID: 80082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,870,251
RAC: 1,154
Message 80083 - Posted: 14 May 2016, 23:20:17 UTC
Last modified: 14 May 2016, 23:26:24 UTC

It's been said a thousand times: The people who have looked have all said it's not viable. If you think you can get Rosetta to run faster on a GPU than on a CPU then offer your services - they've shown that they're willing to work with serious and capable people.

GPUs are great for some stuff but most programs run faster on a CPU regardless of theatrical numbers for perfect hugely parallel workloads.

Brute force through more, more efficient and faster cores is the current best option unless rjs5 works some magic.

Hopefully Zen will deliver a cheap way to 16 fast threads.

D
ID: 80083 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 80084 - Posted: 14 May 2016, 23:31:05 UTC

We basically have someone starting a thread like this one every 3-4 months.

Rosetta@Home's protocols simply don't lend themselves well to GPU architecture. If you have a GPU floating around and want to do protein related research with it, POEM@Home and Folding@Home both do protein folding simulations on the GPU, but it's a fundamentally different problem they respectively tackle compared to Rosetta and thus their simulations can be done using common Molecular Dynamics libraries that are very much GPU friendly.
ID: 80084 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2014
Credit: 9,821,437
RAC: 2,210
Message 80088 - Posted: 16 May 2016, 9:19:45 UTC - in response to Message 80084.  

We basically have someone starting a thread like this one every 3-4 months.


That's true, but....
The latest "public" test of Rosetta on gpu, if i'm not wrong, was 4/5 years ago. Some things are changed both in hw and sw. :-)
ID: 80088 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,870,251
RAC: 1,154
Message 80089 - Posted: 16 May 2016, 9:54:00 UTC - in response to Message 80088.  

We basically have someone starting a thread like this one every 3-4 months.


That's true, but....
The latest "public" test of Rosetta on gpu, if i'm not wrong, was 4/5 years ago. Some things are changed both in hw and sw. :-)

The problem is the people who post saying it must be done are generally not capable of doing it or knowing whether it is do-able, or of doing the cost-benefit analysis as to whether it would be worthwhile even if it were viable. I think one of the main things that is often overlooked, is that a priority for Rosetta is the continual development of the capabilities of the software, rather than optimising it at any one point in time.
ID: 80089 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2014
Credit: 9,821,437
RAC: 2,210
Message 80090 - Posted: 16 May 2016, 11:59:15 UTC - in response to Message 80089.  
Last modified: 16 May 2016, 12:00:22 UTC

The problem is the people who post saying it must be done are generally not capable of doing it or knowing whether it is do-able, or of doing the cost-benefit analysis as to whether it would be worthwhile even if it were viable.


I'm agree with you, I know the problem of gpu coding and i think that is important to have a "do-better" code for good science. But i also understand people that want a "do-faster" code to produce more results (with gpu or with cpu optimizations).
Is it possible to have a "do-better-faster" code?? :-P
ID: 80090 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 80091 - Posted: 16 May 2016, 21:42:08 UTC

I believe rjs5 is doing some general review looking for parallel operations as the GPUs do so well, as he looks for CPU optimizations.
Rosetta Moderator: Mod.Sense
ID: 80091 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mark

Send message
Joined: 10 Nov 13
Posts: 40
Credit: 397,847
RAC: 0
Message 80092 - Posted: 17 May 2016, 13:26:35 UTC

I get all the points made here. I'm not saying its easy. The original question is at what point does the difference between cpu and gpu performance become so great that the conversion project becomes worthwhile.

The Rosetta code got upgraded in a big project (to C++ I think) a while back. I am talking about a similar effort. Yes, I realise it will take time and ultimately money. Anyone have a feel for how much it would cost? I plucked a figure out of the air a bit, was it reasonable?
ID: 80092 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,870,251
RAC: 1,154
Message 80093 - Posted: 17 May 2016, 18:58:05 UTC - in response to Message 80092.  

I get all the points made here. I'm not saying its easy. The original question is at what point does the difference between cpu and gpu performance become so great that the conversion project becomes worthwhile.

I'm not sure that it ever does, unless either the compiler does the work to make the code work on GPU, or there are parts of the code that are largely static so that the ongoing code development isn't hindered/complicated.

The Rosetta code got upgraded in a big project (to C++ I think) a while back. I am talking about a similar effort. Yes, I realise it will take time and ultimately money. Anyone have a feel for how much it would cost? I plucked a figure out of the air a bit, was it reasonable?

I think that's right - I think they moved from Fortran(?). I think rjs5 is the best qualified/positioned to answer that one (although putting a cost to it might not be possible).

D
ID: 80093 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2014
Credit: 9,821,437
RAC: 2,210
Message 80095 - Posted: 18 May 2016, 10:13:03 UTC - in response to Message 80092.  
Last modified: 18 May 2016, 10:14:00 UTC

The original question is at what point does the difference between cpu and gpu performance become so great that the conversion project becomes worthwhile.


With the becoming generation of gpu (Nvidia and Amd are presenting new families with new productive process 16/14 nm) the gap between cpu and gpu will be incredible, but....

- Not all the sw can be used by gpus.
- Rosetta now is in C++ but, if i'm not wrong, has some parts/library/etc still in Fortran (or Fortran-like).
- Rosetta team is focused on scientific part of the code, not in optimization/gpu/whatsoever. The only one who are working on code is, indeed, a volunteer.
- They are using old compilers, OS server, boinc server, etc and seems not so interested to update.

I think a solution may be a graduate/phd/etc in computer science who works, in the Rosetta team, ONLY on optimization of the code. Other solution may be an open source code of Rosetta.
Only admins can decide.
ID: 80095 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,229,863
RAC: 567
Message 80097 - Posted: 18 May 2016, 15:06:21 UTC - in response to Message 80095.  
Last modified: 18 May 2016, 15:07:35 UTC

The original question is at what point does the difference between cpu and gpu performance become so great that the conversion project becomes worthwhile.


With the becoming generation of gpu (Nvidia and Amd are presenting new families with new productive process 16/14 nm) the gap between cpu and gpu will be incredible, but....

1- Not all the sw can be used by gpus.
2- Rosetta now is in C++ but, if i'm not wrong, has some parts/library/etc still in Fortran (or Fortran-like).
3- Rosetta team is focused on scientific part of the code, not in optimization/gpu/whatsoever. The only one who are working on code is, indeed, a volunteer.
4- They are using old compilers, OS server, boinc server, etc and seems not so interested to update.

I think a solution may be a graduate/phd/etc in computer science who works, in the Rosetta team, ONLY on optimization of the code. Other solution may be an open source code of Rosetta.
Only admins can decide.




1- GPUs are essentially very wide, heterogeneous "AVX" registers. You ship a vector of data to the GPU and it crunches MANY (hundreds/thousands) at once ... and then retrieve the results. The overhead of the transfers has to be small compared to the benefit.

2- It appears to me that Rosetta (or major chunks) started out Fortran and then were converted to C++. I am not a C++ programmer but it appears the programmer or the converter tool went slightly overboard on the templates and made some fundamental mistakes in the data structure design. Rosetta is based on an XYZ vector data element. All Rosetta XYZ operations are perform on X, then Y and then Z. IF they changed the XYZ to an XYZW 4 element structure, the compilers could be encouraged to perform operations on the XY pair, then the ZW pair ... for a 50% improvement with SSE. AVX2 could perform operations on XYZW combined elements for a 75% speedup ON THOSE SECTIONS OF CODE. This is what I am looking at. I speak "C" with an "Assembler accent" and I am looking at C++ through "very thick glasses" using a C-to-C++ translator. I do OK with C but C++ is new.

Crunching an XYZW vector coordinate is attainable without much Rosetta modification. I am not familiar enough to guess where the next step in parallelism might be ... but the XYZW conversion would be a first step in either case anyway.

3- I can testify that this is accurate. With 800k users and 1.7mil hosts, they cannot afford to speed Rosetta up too much or they will melt the internet transferring the 279mb database files. 8-) I am also looking at the impact and issues with partitioning Rosetta protocols.

4- The Rosetta BOINC server SW could use updating. They are building Rosetta now with newer versions of SW that are doing a pretty good job. Their HW is pretty dated. I suspect that the HW is groaning under the weight of its success ... assuming their equipment/server descriptions are current (other than the typos GB vs. GHz). The only thing that would cost more than a few $k to upgrade would be the 48 x 600GB disk drives on the GPFS SAN fileserver. It really depends on the system loading patterns and where the bottlenecks are. You could probably make a noticeable difference performance with $5 of equipment judiciously applied.





Rosetta@home Hardware

Web servers: boinc, srv1, srv2, srv3, srv4, srv5
4-core Dell R210s
mirrored 146GB 15K RPM system disks
8 GB RAM

Database server
Dell 2950, dual-quad 2GB Xeon E5405 w/ 6MB cache
32 GB RAM
mirrored root raid, running 2.6.18-92.1.22.el5 (64 bit)
the database files are written to a hardware mirror of two 15K 300GB disks
dual GB NIC
GPFS SAN fileserver
two IBM x3650
one IBM (1)DS3512/(3)EXP3512 disk controllers
48 600GB 15,000 RPM SAS disks
clustered fileserver running IBM's GPFS filesystem
ID: 80097 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 80105 - Posted: 19 May 2016, 15:19:28 UTC - in response to Message 80097.  

The original question is at what point does the difference between cpu and gpu performance become so great that the conversion project becomes worthwhile.


With the becoming generation of gpu (Nvidia and Amd are presenting new families with new productive process 16/14 nm) the gap between cpu and gpu will be incredible, but....

1- Not all the sw can be used by gpus.
2- Rosetta now is in C++ but, if i'm not wrong, has some parts/library/etc still in Fortran (or Fortran-like).
3- Rosetta team is focused on scientific part of the code, not in optimization/gpu/whatsoever. The only one who are working on code is, indeed, a volunteer.
4- They are using old compilers, OS server, boinc server, etc and seems not so interested to update.

I think a solution may be a graduate/phd/etc in computer science who works, in the Rosetta team, ONLY on optimization of the code. Other solution may be an open source code of Rosetta.
Only admins can decide.




1- GPUs are essentially very wide, heterogeneous "AVX" registers. You ship a vector of data to the GPU and it crunches MANY (hundreds/thousands) at once ... and then retrieve the results. The overhead of the transfers has to be small compared to the benefit.

2- It appears to me that Rosetta (or major chunks) started out Fortran and then were converted to C++. I am not a C++ programmer but it appears the programmer or the converter tool went slightly overboard on the templates and made some fundamental mistakes in the data structure design. Rosetta is based on an XYZ vector data element. All Rosetta XYZ operations are perform on X, then Y and then Z. IF they changed the XYZ to an XYZW 4 element structure, the compilers could be encouraged to perform operations on the XY pair, then the ZW pair ... for a 50% improvement with SSE. AVX2 could perform operations on XYZW combined elements for a 75% speedup ON THOSE SECTIONS OF CODE. This is what I am looking at. I speak "C" with an "Assembler accent" and I am looking at C++ through "very thick glasses" using a C-to-C++ translator. I do OK with C but C++ is new.

Crunching an XYZW vector coordinate is attainable without much Rosetta modification. I am not familiar enough to guess where the next step in parallelism might be ... but the XYZW conversion would be a first step in either case anyway.

3- I can testify that this is accurate. With 800k users and 1.7mil hosts, they cannot afford to speed Rosetta up too much or they will melt the internet transferring the 279mb database files. 8-) I am also looking at the impact and issues with partitioning Rosetta protocols.

4- The Rosetta BOINC server SW could use updating. They are building Rosetta now with newer versions of SW that are doing a pretty good job. Their HW is pretty dated. I suspect that the HW is groaning under the weight of its success ... assuming their equipment/server descriptions are current (other than the typos GB vs. GHz). The only thing that would cost more than a few $k to upgrade would be the 48 x 600GB disk drives on the GPFS SAN fileserver. It really depends on the system loading patterns and where the bottlenecks are. You could probably make a noticeable difference performance with $5 of equipment judiciously applied.





Rosetta@home Hardware

Web servers: boinc, srv1, srv2, srv3, srv4, srv5
4-core Dell R210s
mirrored 146GB 15K RPM system disks
8 GB RAM

Database server
Dell 2950, dual-quad 2GB Xeon E5405 w/ 6MB cache
32 GB RAM
mirrored root raid, running 2.6.18-92.1.22.el5 (64 bit)
the database files are written to a hardware mirror of two 15K 300GB disks
dual GB NIC
GPFS SAN fileserver
two IBM x3650
one IBM (1)DS3512/(3)EXP3512 disk controllers
48 600GB 15,000 RPM SAS disks
clustered fileserver running IBM's GPFS filesystem


I can't fathom the computing knowledge you need for something like Rosetta. Or anything useful for that matter... I just got into learning Python (I figured an EE should know a good bit of programming) and I'm struggling like mad. MATLAB is the only language I'm proficient at, but it's so user friendly it doesn't count IMO.
ID: 80105 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2014
Credit: 9,821,437
RAC: 2,210
Message 80106 - Posted: 19 May 2016, 16:13:37 UTC - in response to Message 80105.  
Last modified: 19 May 2016, 16:14:16 UTC

I can't fathom the computing knowledge you need for something like Rosetta. Or anything useful for that matter... I just got into learning Python (I figured an EE should know a good bit of programming) and I'm struggling like mad. MATLAB is the only language I'm proficient at, but it's so user friendly it doesn't count IMO.


If i remember correctly, the public test of rosy on gpu was with and old version of pycl

This is the post one developer wrote about this test. It's a pity that pdfs are not longer available
ID: 80106 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2014
Credit: 9,821,437
RAC: 2,210
Message 80172 - Posted: 13 Jun 2016, 7:00:16 UTC - in response to Message 80082.  

I've just been looking at the performance of the new GTX1080 and for DOUBLE precision calculations it does 4 Tflops!!!!


I'm curious to see the upcoming RX480. Over 5 Tflops SP with 199$ and 150W.

ID: 80172 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 356
Credit: 382,349
RAC: 0
Message 80173 - Posted: 13 Jun 2016, 17:46:32 UTC - in response to Message 80097.  

3- I can testify that this is accurate. With 800k users and 1.7mil hosts, they cannot afford to speed Rosetta up too much or they will melt the internet transferring the 279mb database files. 8-)

Every host is downloading this file just once when a new version is released. More efficient application code won't change anything here. I guess most WUs still finish before reaching the max decoys allowed, so even the amout of other files downloaded should not change much (except for some fast hosts with long target runtimes), eventually the result files will become larger.
.
ID: 80173 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2014
Credit: 9,821,437
RAC: 2,210
Message 80434 - Posted: 26 Jul 2016, 8:27:00 UTC

The old problem of Rosy on gpu is the gpu memory.
Problem solved!!! AMD Radeon Pro SSG
ID: 80434 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2014
Credit: 9,821,437
RAC: 2,210
Message 80705 - Posted: 5 Oct 2016, 19:03:16 UTC

Poem@Home is closing.
Two things:
- A lot of gpu power will be free
- Source code of opencl app of protein folding will be released to public
ID: 80705 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 80706 - Posted: 5 Oct 2016, 19:28:45 UTC

First things first

1) folding@home for YOUR GPU and probably soon for your CPU, too (AVX support) through GROMACS.

2) but AVXx support first for rosetta.

Is there any progress worth speaking of?

It's interesting: in the past they (poem@home) have started to recruit external personnel in order to optimize their app. Mr. Tankovich obviously did a great job: They no longer depend on the donations of pesky contributors like us.

In the long run (John Maynard Keynes: "In the long run we are all dead...") I predict this solution for rosetta, too. There is really no way that scientists have to wait hours or even days for their computation results. Personally, I hope BOINC will die because it's a kludge.

So unless there is another thing like Charity Engine coming along the way, I think that instead of having to rely on a legion of desktop-grade CPUs, it would be more wise to spend all the resources on a central cluster of AVXx-powered servers under central control.
ID: 80706 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1480
Credit: 4,334,829
RAC: 0
Message 80707 - Posted: 5 Oct 2016, 21:15:11 UTC - in response to Message 80706.  
Last modified: 5 Oct 2016, 21:28:23 UTC


So unless there is another thing like Charity Engine coming along the way, I think that instead of having to rely on a legion of desktop-grade CPUs, it would be more wise to spend all the resources on a central cluster of AVXx-powered servers under central control.


R@h is currently running on just a hand full of machines as it has for over 10 years. It's not a significant resource burden for the amount of volunteer computing and scientific progress.

The lab also has access to central clusters and super computing resources granted to scientists.

-- forgot to mention R@h also runs the automated structure prediction jobs from the public server, Robetta.
ID: 80707 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 2014
Credit: 9,821,437
RAC: 2,210
Message 80708 - Posted: 5 Oct 2016, 21:35:49 UTC - in response to Message 80706.  

There is really no way that scientists have to wait hours or even days for their computation results. Personally, I hope BOINC will die because it's a kludge.


I hope not.
If it is "kludge", why you are here?

I think that instead of having to rely on a legion of desktop-grade CPUs, it would be more wise to spend all the resources on a central cluster of AVXx-powered servers under central control.


It depends how many money and resources you have, particulary if you have only cpu code.
ID: 80708 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : GPU computing



©2025 University of Washington
https://www.bakerlab.org