GPU-Tech Releases GPU Computing API, Benchmarks

Author	Message
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,276,053 RAC: 0	Message 42191 - Posted: 15 Jun 2007, 0:50:55 UTC Last modified: 15 Jun 2007, 1:25:02 UTC GPU-Tech Releases GPU Computing API, Benchmarks and the Press Kit GPU-Tech, a French developer of high performance computing applications that run on graphic cards, launched a new set of APIs on Wednesday that, while reportedly compatible will all GPUs, seem to show better performance under Nvidia workstation cards. On a "Monte Carlo" algorithm, for example, performance on a GPU was roughly 20 times faster than an AMD Athlon 64 3800+ microprocessor. Running the algorithm on an Nvidia Quadro 4600 generated a roughly 10 percent performance improvement over an ATI HD 2900 XT, while running a "Black & Scholes" algorithm gave a 20 to 33 percent performance edge to Nvidia over ATI, and a roughly 40X edge over the Athlon 64. In matrix multiplication using double precision, however, ATI was found to have a significant edge. Why the sudden push to make coding for GPUs easier? As the company explains, graphics card manufacturers are increasingly starting to optimize their hardware for parallel programming, or the simultaneous execution of the same task (split up and specially adapted) on multiple processors in order to obtain results faster. ... both Nvidia's CUDA and ATI's CTM provide a higher-level, manufacturer-independent way of coding as well as some specific libraries that are tuned to running on the GPU. But these solutions as they exist now also tend to present a number of limitations for programmers, namely requiring extensive GPU programming knowledge and limitations to only one specific graphics card. GPU-Tech said it worked with both Nvidia and ATI on the new API and libraries and will continue to propose services compatible with all hardware manufacturers in the future. (From page 6 of the Press Kit): It is also possible to use two graphic cards installed in parallel on a single PC and to quasi double the power available. For example, two ATI HD 2900 XT in Crossfire will provide you with power of around 1 Tflops. ID: 42191 · Rating: 0 · rate: / Reply Quote

David Emigh Send message Joined: 13 Mar 06 Posts: 158 Credit: 417,178 RAC: 0	Message 42207 - Posted: 15 Jun 2007, 21:14:05 UTC Very cool. I look forward to the time when Rosie can use my video card's processing power also. Rosie, Rosie, she's our gal, If she can't do it, no one shall! ID: 42207 · Rating: 0 · rate: / Reply Quote

The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,276,053 RAC: 0	Message 42219 - Posted: 16 Jun 2007, 1:08:36 UTC - in response to Message 42207. Last modified: 16 Jun 2007, 2:04:35 UTC With all of the recent "alliances" (AMD/ATI, Intel/NVidia, etc), I am not certain of the reality of what "could" be available (Intel & ATI), but in theory (if Intel chipsets would "talk" with AMD/ATI Crossfire)..... my understanding is that for each GPU card, a cpu "core" is required. so... next month a $270 Intel quadcore, and how about mating that with quad-crossfire? 4 gpu's with a quadcore cpu. what's that, like 2 Tflops? What were Doc Baker and team looking for, in order for the research to "break open"? 150 Tflops ? Yes I do understand, ala Folding@Home, that there have to be "special" type of work units for GPU's... but while not cheap ($270 for quadcore cpu, $1600(?) for four HD2900XT's, $260 for 4 GB RAM, etc), it is still less than I paid for my Apple ][ Plus computer (with cassette tape data recorder!) all those years ago. Scary thought. 2 Tflops for ~$2k? I'd compare that "bang per buck" anyday against what they've just announced today for a local University, $26m for 100 Tflops Very cool. I look forward to the time when Rosie can use my video card's processing power also. (From page 6 of the Press Kit): It is also possible to use two graphic cards installed in parallel on a single PC and to quasi double the power available. For example, two ATI HD 2900 XT in Crossfire will provide you with power of around 1 Tflops. ID: 42219 · Rating: 0 · rate: / Reply Quote

The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,276,053 RAC: 0	Message 42262 - Posted: 17 Jun 2007, 15:27:03 UTC Last modified: 17 Jun 2007, 16:04:19 UTC put a quadcore and quad-crossfire on this! DFI prepping ultimate Intel motherboard OUR FRIENDS FROM XtremeSystems.org told us that Oskar Wu, crazy BIOS-magician managed to get 632 MHz front-side bus on upcoming motherboard from DFI, the P35-T2R. Since Intel's current FSB is actually 266MHz in QDR (Quad-Data-Rate) mode, seeing 632 MHz, or 2.53GHz FSB is nothing short of spectacular performance and display of stability seen on a motherboard. We took a picture of this baby bare naked, since DFI still has not finished the layout of the heatpipe cooling for this monster. ID: 42262 · Rating: 0 · rate: / Reply Quote

Paydirt Send message Joined: 10 Aug 06 Posts: 127 Credit: 960,607 RAC: 0	Message 42289 - Posted: 18 Jun 2007, 16:28:18 UTC The problem with 2 or more GPUs is whether or not the PCIe lanes will run in 16x mode... Then you have overall system bandwidth and latency. For motherboards that can handle multiple 16x at once, the question becomes one of cost: Is the more expensive motherboard worth the increase in performance? When I made my decision with dual x1950xtx's, the then extra $100 for 16x motherboard was too much for the potential 7% performance gain over 8x. Turns out my mobo runs both cards in 4x, so it was a bad choice. Presently, there is no crunching client for the 2900, but it will be coming within two months or so (I think). FAH won't say when it comes because it isn't their policy to announce until release (beta or otherwise). ID: 42289 · Rating: 0 · rate: / Reply Quote

The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,276,053 RAC: 0	Message 42291 - Posted: 18 Jun 2007, 17:52:24 UTC - in response to Message 42289. Last modified: 18 Jun 2007, 18:32:20 UTC @Paydirt: What effect do you believe 2Q/3Q release of chipsets supporting PCIe 2.0 and the announcement that 1GB HD2900XTs arrive will have on this type of setup (quadcore + quad-gpu (crossfire/sli))? ID: 42291 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1834 Credit: 124,687,465 RAC: 28,080	Message 42304 - Posted: 19 Jun 2007, 6:50:17 UTC - in response to Message 42289. The problem with 2 or more GPUs is whether or not the PCIe lanes will run in 16x mode... Then you have overall system bandwidth and latency. For motherboards that can handle multiple 16x at once, the question becomes one of cost: Is the more expensive motherboard worth the increase in performance? When I made my decision with dual x1950xtx's, the then extra $100 for 16x motherboard was too much for the potential 7% performance gain over 8x. Turns out my mobo runs both cards in 4x, so it was a bad choice. Can any current graphics cards saturate a PCI 8x link? Even if they did, would a crunching client require that much bandwidth? I'd think the GPU would be crunching on data held in memory for some of that time. I'd be interested to see how 2x 8x lanes cope ;) ID: 42304 · Rating: 0 · rate: / Reply Quote

Paydirt Send message Joined: 10 Aug 06 Posts: 127 Credit: 960,607 RAC: 0	Message 42313 - Posted: 19 Jun 2007, 13:16:32 UTC - in response to Message 42304. First, I'm not an expert, like you folks I just read a lot about this stuff. @Paydirt: What effect do you believe 2Q/3Q release of chipsets supporting PCIe 2.0 and the announcement that 1GB HD2900XTs arrive will have on this type of setup (quadcore + quad-gpu (crossfire/sli))? 8x and 16x PCIe lanes get about the same crunching performance. The performance hit is when they are dropping down to 4x. I don't think 32x would do much. I do think if more "lanes of data" are added and if quad cores can handle that and the various "buses" can handle that, then you may get better returns on running dual to quad GPUs. The crunching performance improvement for the 512MB 2900s is not expected to be linearly better than the x1950xtx (I was hoping that it would be...). The (good) problem is that the shaders are 5-dimensional vectors, whereas the x1950xtx uses 4-dimensional vectors. The Folding@Home code isn't designed to use that 5th "register" in each shader. The code will likely not utilize that for quite some time. So it's not going to be 6 to 7 times more powerful than x1950xtx (which is 3 times more powerful than PS3). It's estimated that it will be 70% to 3 times more powerful in the present state of the code. The code that will crunch on 2900 has not been released. I'm waiting for some numbers from after the release to make my next purchase decision. Can any current graphics cards saturate a PCI 8x link? Even if they did, would a crunching client require that much bandwidth? I'd think the GPU would be crunching on data held in memory for some of that time. I'd be interested to see how 2x 8x lanes cope ;) Crunching CAN require a ton of bandwidth. For instance, I remember reading that each work unit for SMP folding (where all the cores work in tandem instead of independently, I think) passes over a terabyte worth of data around in the system. For GPU crunching to work with the current DirectX, each GPU requires a CPU core that does "polling" to coordinate the work. (DX10 is supposed to get rid of this need, but it hasn't yet) I think what is happening is the CPU is asking "Is shader 15 done with the calculation yet? How about now? How about now?" That's why a CPU core is required for each GPU core. And though while the data is "dumb" a lot of data passes through. Dual GPU crunching on a Dual core CPU DOES take a performance hit. I'm guessing here that it is because total system bandwidth is being strained. ID: 42313 · Rating: 0 · rate: / Reply Quote

The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,276,053 RAC: 0	Message 42390 - Posted: 20 Jun 2007, 21:55:37 UTC Last modified: 20 Jun 2007, 22:08:29 UTC Nvidia Expands From Gaming to High-Performance Chips Nvidia today introduced the Tesla line of processors, which it bills as making high-density parallel processing capabilities available in workstation computers. The Tesla graphics processing unit (GPU) features 128 parallel processors and delivers up to 518 gigaflops of parallel computation. A gigaflop refers to the processing of a billion floating point operations per second. Nvidia envisions the Tesla being used in high-performance computing environments such as geosciences, molecular biology or medical diagnostics. Nvidia also will offer Tesla in a workstation, which it calls a "Deskside Supercomputer," that includes two Tesla GPUs, attaches to a PC or workstation via a PCI-Express connection, and delivers up to 8 teraflops of processing power. A teraflop is the processing of a trillion floating point operations per second. A Tesla Computing Server puts eight Tesla GPUs with 1,000 parallel processors into a 1U server rack. The Tesla is the third major product line from Nvidia, whose GeForce GPUs deliver high-end PC graphics. Its Quadro processor line enables computer-aided design in the creation of digital content, including 3-D graphics. It also released in February a beta version of software it calls CUDA, for compute unified device architecture, which enables software code to be written to use a computer's GPU as well as the CPU (central processing unit) for added processing power. A general availability of CUDA is expected in the second half of this year. ID: 42390 · Rating: 0 · rate: / Reply Quote

The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,276,053 RAC: 0	Message 42412 - Posted: 21 Jun 2007, 17:16:10 UTC Last modified: 21 Jun 2007, 17:17:44 UTC Tesla GPGPU monster What happens when a 8800GTX board looses video outputs? GPGPU monster, that's what happens Tesla line-up consists of discrete GPGPU-card, desktop supercomputer and enterprise class 1U, 3U and 5U units. For optimal GPGPU performance, there should be one X86 processing core per one GPGPU processing unit, so for four GPGPU units, a four-core processor is a must. The discrete GPGPU card is named Tesla D870, and basically it's a heavily modified GeForce 8800GTX board with 1.5 GB of GDDR-3 memory and does not come with any video connectors (resulting in improved cooling). Desktop Supercomputer is nothing else but a QuadroPlex system with two Tesla D870 cards while enterprise GPU servers, as Nvidia calls them - are something that is needed to get a good crack into the HPC market. Every Tesla part is PCIe Generation 2 compliant, while enterprise parts feature next-gen nForce Professional chipset from Nvidia. All in all, Nvidia continues what ATI started last year with Fire Stream. Nvidia also mentioned that there are dual-GPGPU boards coming, which is a technical pre-announcement of the GeForce 8950GX2, a dual GPU G80 board. These two should come to life when Nvidia finishes up their mobile version of G80 chip, that is placed for notebooks with GPU part set at 22 Watts or higher. Wanted to see how does a GPGPU card look in 1U form factor? ID: 42412 · Rating: 0 · rate: / Reply Quote

The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,276,053 RAC: 0	Message 43342 - Posted: 7 Jul 2007, 12:59:43 UTC - in response to Message 42313. Asus Launches Gaming Motherboard Line ...a technology called CrossLinx, a proprietary technology that Asus claims will better balance graphics performance. While the boards support ATI's CrossFire technology, the CrossLinx tech takes a x4 and a x16 PCI Express graphics card slot and reconfigures them as, essentially, X8 cards, eliminating the x4 bottleneck. 8x and 16x PCIe lanes get about the same crunching performance. The performance hit is when they are dropping down to 4x. ID: 43342 · Rating: 0 · rate: / Reply Quote

FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0	Message 44337 - Posted: 26 Jul 2007, 16:20:23 UTC Last modified: 26 Jul 2007, 16:23:16 UTC Somebody actually using and testing these things out AStroPhysics this time. (Pre-proof paper) Graphic-Card Cluster for Astrophysics (GraCCA) -- Performance Tests http://arxiv.org/abs/0707.2991 Direct to the PDF http://arxiv.org/ftp/arxiv/papers/0707/0707.2991.pdf Team mauisun.org ID: 44337 · Rating: 0 · rate: / Reply Quote

The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,276,053 RAC: 0	Message 44769 - Posted: 6 Aug 2007, 22:08:58 UTC AMD intros ATI FireGL workstation accelerators CHIP FIRM AMD said it has introduced five FireGL graphics accelerators aimed at the medical imaging market and at CAD workstations. That's computer aided design. The V8650, V8600, V7600, V5600 and V3600 boards can use up to 320 individual stream processing units. Features include 2GB on memory on the cards, and AMD-ATI claims the V5600 gives over 300 per cent the performance of the V5200 using the Viewperf 9.0.3 UGS Teamcentre Visualisation mockup benchmark test. Prices are not cheap - these are workstation boards, and cost $2,800, $1,900, $1,000, $600 and $300 respectively with memory sizes 2GB, 1GB, 512MB, 512MB and 256MB in turn. THERE'S A REVIEW of this range, which gets rave reviews, at 3D Professor. ID: 44769 · Rating: 0 · rate: / Reply Quote

Jmarks Send message Joined: 16 Jul 07 Posts: 132 Credit: 98,025 RAC: 0	Message 44806 - Posted: 8 Aug 2007, 12:00:57 UTC Hey I do not know if this is new to you but, this article says that all but one of the models are available now. Nvidia Quadro Plex VCS Models http://www.macsimumnews.com/index.php/archive/nvidia_releases_new_gpu_server/ Great article. Jon ID: 44806 · Rating: 0 · rate: / Reply Quote