How to download the client? (AMD64 users)

Message boards : Number crunching : How to download the client? (AMD64 users)

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
ebahapo
Avatar

Send message
Joined: 17 Sep 05
Posts: 29
Credit: 413,302
RAC: 0
Message 31197 - Posted: 15 Nov 2006, 19:17:46 UTC - in response to Message 31195.  

All projects do if you use dumas777's Linux x86_64 compiled client
http://boese.dnsalias.com:6969/

But this client doesn't dowload native 64-bit applications when they are available, such as from SIMAP and Chess960 and, who knows, soon Rosetta. ;-)


ID: 31197 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 31242 - Posted: 16 Nov 2006, 9:04:42 UTC - in response to Message 31197.  

All projects do if you use dumas777's Linux x86_64 compiled client
http://boese.dnsalias.com:6969/

But this client doesn't dowload native 64-bit applications when they are available, such as from SIMAP and Chess960 and, who knows, soon Rosetta. ;-)



He said it did...

(how is that coming along ? ;-) )
Team mauisun.org
ID: 31242 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ebahapo
Avatar

Send message
Joined: 17 Sep 05
Posts: 29
Credit: 413,302
RAC: 0
Message 31258 - Posted: 16 Nov 2006, 16:49:43 UTC - in response to Message 31242.  
Last modified: 16 Nov 2006, 16:50:00 UTC

But this client doesn't dowload native 64-bit applications when they are available, such as from SIMAP and Chess960 and, who knows, soon Rosetta. ;-)
He said it did...

I don't think so. The author himself said that it downloads just x86 applications.
(how is that coming along ? ;-) )

I've already built it for 64 bits using GCC 3.3.3, but it has some run-time problems that I want to address before trying GCC 4.2.

Over time, GCC changed its interpretation of the C++ standard from version to version that can make some programs to fail compilation.


ID: 31258 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 31288 - Posted: 17 Nov 2006, 8:35:04 UTC - in response to Message 31258.  

[quote]But this client doesn't dowload native 64-bit applications when they are available, such as from SIMAP and Chess960 and, who knows, soon Rosetta. ;-)
He said it did...

I don't think so. The author himself said that it downloads just x86 applications.


I must have missread what he wrote.
Team mauisun.org
ID: 31288 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Who?

Send message
Joined: 2 Apr 06
Posts: 213
Credit: 1,366,981
RAC: 0
Message 31451 - Posted: 20 Nov 2006, 7:34:05 UTC - in response to Message 28637.  
Last modified: 20 Nov 2006, 8:13:14 UTC

FWIW, running two instances of the client, one the 32-bit client, the other, the 64-bit client, on the same 4-core system, but limiting each client to 2 cores, I can compare the relative performance of 32-bit and 64-bit SIMAP's HMMER: the 64-bit version is about 7% faster.

By enabling vectorization (supported by default on AMD64), the SIMAP developers observed other 8% improvement.

Bottom line: porting the project application to AMD64 has the potential to improve performance by 15%!


Well, you actually get performance improvement on K8 from 32 to 64bits. it is due to a poor implementation of the x87 stack in the processor. Modern processors will not see any difference.
The code is more compact*** in SSE2 scalar or pack than it is compact in x87, so, you need less decoding badwidth to decode SSE2 than x87. I can bid a lot of money that the K8L will not see 32 to 64 bits performance improvement where K8 is use to see it. it was a good marketing work since intel fp performance was lower to convert the wearkness of K8 x87 to a great 64bits advantage ... only architects saw the trick and people thought that 64 bits was faster...

Bottom line: 64 execusion units are available since MMX, and media boost in Core 2 is 2 x 128bits exec units ... for processing/computing, 64bits is totally obsolete, and the beauty of it: AMD is going to prove me right with K8L: They are "upgrading" to 128 exec units too said Mister Ruiz.

So, if your compiler generate SSE2 in 32bits mode, your benetif of 64bits goes down to 0. the only good side of 64bits OS is its addressing, more than 4 Gigs of RAM is nice;)

If i have to choose between threaded CPUs or 64bits, I know for sure what i choose, don't you?


*** When i say compact, here is what i mean: x87 requires more instructions to excecute an algorythm than SSE2 required. in a case of a multiply:

x87:
...
fld a
fld b
fmul
...
in the case of SSE2:
...
movapd [a]
mulpd [b]
...

This is 33% saving ... if you counts equal instruction size.
(Unfortunatly, the AMD64 spec was not that smart, they increased the size of all instruction code in 64bits mode, making their decoder still the bottleneck )

The stack management of x87 cost decoding bandwidth, and k8 is not a beast at decoding. in SSE2, the load and store are including in the instructions, saving decoding bandwidth. Here you go the explanation for your 64bits goodness ... in fact, say thank you to SSE2 for boosting K8 in 64bits mode.
this is the awefull truth, sorry: a weakness of K8 turned into a great 64bits story!

who?
oupsss, i forgot, my employer is for nothing in my posting here, i am just posting here as myself. in fact I am sick of hearing the 64bits story over and over... 64bits for processing small data is useless, end of story!!!!
ID: 31451 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 31462 - Posted: 20 Nov 2006, 14:29:43 UTC


x87:
...
fld a
fld b
fmul
...
in the case of SSE2:
...
movapd [a]
mulpd [b]

But not if you do the equivalent using the same type of addressing mode:

fld a
fmul b

According AMD64 PRM Vol 5 and Intel 486 PRM there is such an instruction...

So number of instructions for the same work is the same...

[Whether the compiler chooses this type of operation or not is a completely different question, and that depends very much on other factors - but since the FPU stack is a fairly scarce resource, I'd expect it to avoid any unnecessary loads and using memory as 9th "register" would help a bit in that].

--
Mats
ID: 31462 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Who?

Send message
Joined: 2 Apr 06
Posts: 213
Credit: 1,366,981
RAC: 0
Message 31469 - Posted: 20 Nov 2006, 18:04:39 UTC - in response to Message 31462.  
Last modified: 20 Nov 2006, 18:16:36 UTC


x87:
...
fld a
fld b
fmul
...
in the case of SSE2:
...
movapd [a]
mulpd [ b ]

But not if you do the equivalent using the same type of addressing mode:

fld a
fmul b

According AMD64 PRM Vol 5 and Intel 486 PRM there is such an instruction...

So number of instructions for the same work is the same...

[Whether the compiler chooses this type of operation or not is a completely different question, and that depends very much on other factors - but since the FPU stack is a fairly scarce resource, I'd expect it to avoid any unnecessary loads and using memory as 9th "register" would help a bit in that].

--
Mats


1st of all:
Correct, it was a simplistic example ... you are not suppose to get focus on this one, it was just to give an overall idea of the bandwidth problem. Ok ... so, if you raise a little the level you look at it, x87 is a Stack. If you want to use the stack properly, you have to do loading in the different order. It is call RPN order
You can learn more about this here: click here and move forward and backward.

then, the instruction you are speaking about (fmul b) will have exactly the same effect as what i was taking about, except that your ll have to "split" the instruction in "fld b and then fmul"


now, if you look in very detail on fdiv for example:

fdiv()

fdivp()

fdivr()

fdivrp()

fdiv( sti, st0 );

fdiv( st0, sti );

fdivp( st0, sti );

fdivr( sti, st0 );

fdivr( st0, sti );

fdivrp( st0, sti );

you will figure out that you are required to use the top of the x87 stack to do your operation. if you do not specify your 1st or 2nd parameter, fdiv will use st0. This force you to change the ordering of the loading and storing, increasing the number of instruction required to process the same algoryhtm. (Here memory dissambiguation of Core 2 helps to solve this problem)

In SSE2, there is no constrain on what register you use to add,sub,mul or div, this is why the code gets more compact. Intel did not decide to invest so much on this without good reason. AMD themselve added more XMM registers to decrease the register pressure, and avoid those issues. they don't do that randomly neither.

Mat, you can't defend AMD on this. With the list of AMD prototype and machines you have, i understand that you would like to defend 64bits, but it is obvious that x87 to SSE2 is the performance improvement AMD used to claim 64bits goodness. That is pure BS, and if you refuse to see it, fine! but at least, avoid to mislead people. I don't even add on the top of this that SSE2 can process 2 floats at the same time if you use PACKED ... those 7 to 8% performance improvement are 99.9% from SSE2, not from 64bits.

the only case i know about 64bits registers helping is encrypting in INT, and if you recode it in MMX2(SSE2), you ll beat your INT code by a large amount too! the imul 64bits is 3 times faster than the same algo in 32bits imul. Just use MMX2 anyway :)

The experimentation to do yourself:
Just take every simple program in 32 bits, compile it in 64bits using AMD64 Flavor.( With MSVC 2005) then you get the same program and you "32 bit compile" it with vectorization ON: oupsss ... your 32bits goes as fast as your 64 version now. Somebody did charge premium to people for something that is only a compiler tricks! If i was a consumer association, AMD will be done in 5 minutes. there is a fine line between good marketing and misleading people, they obvisouly crossed the line about the 64bits goodness.


who?

again, this is my own opinion, and it does not represent any of my employer point of view, i am the only responsable person for those arguement.
ID: 31469 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ebahapo
Avatar

Send message
Joined: 17 Sep 05
Posts: 29
Credit: 413,302
RAC: 0
Message 31473 - Posted: 20 Nov 2006, 19:19:52 UTC - in response to Message 31469.  
Last modified: 20 Nov 2006, 19:21:09 UTC

Just take every simple program in 32 bits, compile it in 64bits using AMD64 Flavor.( With MSVC 2005) then you get the same program and you "32 bit compile" it with vectorization ON: oupsss ... your 32bits goes as fast as your 64 version now.

You forgot that AMD64 doubles the number of registers. If you do some research and compare, say, the SPECfp 2000 results for 32 and for 64 bits, both using scalar SSE, will see that the 64-bit results are about 10% better.

Don't trust me, see for yourself (scroll down to the bottom for the mean peak score):



ID: 31473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Who?

Send message
Joined: 2 Apr 06
Posts: 213
Credit: 1,366,981
RAC: 0
Message 31491 - Posted: 21 Nov 2006, 4:33:01 UTC - in response to Message 31473.  
Last modified: 21 Nov 2006, 4:34:15 UTC

Just take every simple program in 32 bits, compile it in 64bits using AMD64 Flavor.( With MSVC 2005) then you get the same program and you "32 bit compile" it with vectorization ON: oupsss ... your 32bits goes as fast as your 64 version now.

You forgot that AMD64 doubles the number of registers. If you do some research and compare, say, the SPECfp 2000 results for 32 and for 64 bits, both using scalar SSE, will see that the 64-bit results are about 10% better.

Don't trust me, see for yourself (scroll down to the bottom for the mean peak score):





You prove exactly my point, the register pressure was the problem of the Athlon 64 in 32 bits mode. by adding more registers, you decrease register pressure, and help flexibility of the computation. it does not mean you are using anything close to a 64bits "stuff", those are just registers. Nice marketing tricks again!

In modern processors, like core 2, your execution units do not work with registers any more, in fact, they work with load and store buffers, and micro-registers. In Core 2, you have plenty of internal registers, and a dynamic renaming is made, so, the number of register does not matter at all, and you can see how much faster Core 2 is. Even in the pentium 4, we had 40 load and 24 store buffers. Those explain that Pentium 4 core still beat the b..t of K8 on Specs :)

Again, if Athlon was not so weak at using the eight x87 registers, you will no see the benefit. In your case, they used more register to decrease the register pressure they had. You are right, this is different,it is an other issue they had, Pentium III is use to have the same issue.

They still told you it was 64 bits :) In FACT, that was just more SSE2 :)

who?
ID: 31491 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ebahapo
Avatar

Send message
Joined: 17 Sep 05
Posts: 29
Credit: 413,302
RAC: 0
Message 31493 - Posted: 21 Nov 2006, 4:47:13 UTC - in response to Message 31491.  

Who,

Oh my! What can I say? Better yet, I won't say it. I'll let Mitch Alsup, an architect at AMD, speak: http://groups-beta.google.com/group/comp.arch/msg/26cd41f07d11a33a

HTH

ID: 31493 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Who?

Send message
Joined: 2 Apr 06
Posts: 213
Credit: 1,366,981
RAC: 0
Message 31498 - Posted: 21 Nov 2006, 8:17:42 UTC - in response to Message 31493.  
Last modified: 21 Nov 2006, 8:27:34 UTC

Who,

Oh my! What can I say? Better yet, I won't say it. I'll let Mitch Alsup, an architect at AMD, speak: http://groups-beta.google.com/group/comp.arch/msg/26cd41f07d11a33a

HTH


hehehehe :) too bad they did not have the bandwidth to feed them :) They totally get destroy on Spec2006 ... so, i guess, something in their plan did not work out ... hahahahaha
Of course he will not explain you his problems ... neither do i :)
and if you really want to know why it is funny, check my profile on SETI ... in the crunching forum :) who is who?

sorry, but he can pretent what ever he wants, he has some serious problem with x87 compare to SSE2.



who?
ID: 31498 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 31500 - Posted: 21 Nov 2006, 9:23:38 UTC

Who?, Mats is has actually been against 64bit for Rosetta@home (in general)
Team mauisun.org
ID: 31500 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 31507 - Posted: 21 Nov 2006, 12:38:32 UTC - in response to Message 31500.  

Who?, Mats is has actually been against 64bit for Rosetta@home (in general)


I'm not against it, I'm just saying that it's not going to be any great gain from it, because the limitation (in my experience) isn't the bitness of the code (or the number of registers used).

I'm also against people saying "X will be faster because Y was faster" when doing wahtever it is suggested one should do.

I can say for sure that Rosetta would gain from using packed SSE, but it will require a fairly major re-org of the data, since the current data-format is a simple array with index 0 = x, index 1 = y, index 2 = z. Since the values x, y and z are generally calculated in the same way for many iterations of a loop, it would be good to just load 4 of each into an SSE register and do the calculaton 4 in parallel [1]. But to do that, the x, y and z values need to be arrayed individually (or in groups of 4 or some multiple thereof).

But there's another problem too, and I haven't spent enough time to figure out how to fix it, but the code is basicly doing:

for(i = 0; i < n; i++)
for(j = 0; j < m; j++) {
calculation...;
}
}

wheere n is in the range of a few hundred and m is less than 10.

To do it the other way around would help a whole lot in unrolling the loop and paralellize the calculation.

However, that is a major re-org of the data-flow and I'm not sure if the calculation will still be valid [because the whole code is a lot more complex than the above much simplified case].

--
Mats
ID: 31507 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ebahapo
Avatar

Send message
Joined: 17 Sep 05
Posts: 29
Credit: 413,302
RAC: 0
Message 31519 - Posted: 21 Nov 2006, 16:23:26 UTC - in response to Message 31507.  

I can say for sure that Rosetta would gain from using packed SSE, but it will require a fairly major re-org of the data, since the current data-format is a simple array with index 0 = x, index 1 = y, index 2 = z. Since the values x, y and z are generally calculated in the same way for many iterations of a loop, it would be good to just load 4 of each into an SSE register and do the calculaton 4 in parallel [1]. But to do that, the x, y and z values need to be arrayed individually (or in groups of 4 or some multiple thereof).

But there's another problem too, and I haven't spent enough time to figure out how to fix it, but the code is basicly doing:

for(i = 0; i < n; i++)
for(j = 0; j < m; j++) {
calculation...;
}
}

wheere n is in the range of a few hundred and m is less than 10.

To do it the other way around would help a whole lot in unrolling the loop and paralellize the calculation.

The vectorizer can swizzle the data and unroll the loop automatically. We'll see.

ID: 31519 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ebahapo
Avatar

Send message
Joined: 17 Sep 05
Posts: 29
Credit: 413,302
RAC: 0
Message 31520 - Posted: 21 Nov 2006, 16:25:09 UTC - in response to Message 31498.  
Last modified: 21 Nov 2006, 16:33:39 UTC

hehehehe :) too bad they did not have the bandwidth to feed them :) They totally get destroy on Spec2006 ... so, i guess, something in their plan did not work out ... hahahahaha

I guess that I shouldn't feed the trolls, especially those who laugh nervously.
ID: 31520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 31521 - Posted: 21 Nov 2006, 17:23:59 UTC - in response to Message 31519.  
Last modified: 21 Nov 2006, 17:28:09 UTC

I can say for sure that Rosetta would gain from using packed SSE, but it will require a fairly major re-org of the data, since the current data-format is a simple array with index 0 = x, index 1 = y, index 2 = z. Since the values x, y and z are generally calculated in the same way for many iterations of a loop, it would be good to just load 4 of each into an SSE register and do the calculaton 4 in parallel [1]. But to do that, the x, y and z values need to be arrayed individually (or in groups of 4 or some multiple thereof).

But there's another problem too, and I haven't spent enough time to figure out how to fix it, but the code is basicly doing:

for(i = 0; i < n; i++)
for(j = 0; j < m; j++) {
calculation...;
}
}

wheere n is in the range of a few hundred and m is less than 10.

To do it the other way around would help a whole lot in unrolling the loop and paralellize the calculation.

The vectorizer can swizzle the data and unroll the loop automatically. We'll see.


Yes, it can, but swizzling isn't entirely free - in fact it's fairly costly when it's as "disorganized" as it is in this case... :-(

And more importantly perhaps, the length of the inner loop (from memory) is 6 items [the loop is repeated several times in the busiest section of the code, and does a bunch of other things too, as I explained earlier. So even unrolling probably won't really do the trick. We'd need to reform the loop, so that it goes the other way around - then maybe it will work...


--
Mats

ID: 31521 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Who?

Send message
Joined: 2 Apr 06
Posts: 213
Credit: 1,366,981
RAC: 0
Message 31640 - Posted: 24 Nov 2006, 18:16:30 UTC - in response to Message 31521.  

I can say for sure that Rosetta would gain from using packed SSE, but it will require a fairly major re-org of the data, since the current data-format is a simple array with index 0 = x, index 1 = y, index 2 = z. Since the values x, y and z are generally calculated in the same way for many iterations of a loop, it would be good to just load 4 of each into an SSE register and do the calculaton 4 in parallel [1]. But to do that, the x, y and z values need to be arrayed individually (or in groups of 4 or some multiple thereof).

But there's another problem too, and I haven't spent enough time to figure out how to fix it, but the code is basicly doing:

for(i = 0; i < n; i++)
for(j = 0; j < m; j++) {
calculation...;
}
}

wheere n is in the range of a few hundred and m is less than 10.

To do it the other way around would help a whole lot in unrolling the loop and paralellize the calculation.

The vectorizer can swizzle the data and unroll the loop automatically. We'll see.


Yes, it can, but swizzling isn't entirely free - in fact it's fairly costly when it's as "disorganized" as it is in this case... :-(

And more importantly perhaps, the length of the inner loop (from memory) is 6 items [the loop is repeated several times in the busiest section of the code, and does a bunch of other things too, as I explained earlier. So even unrolling probably won't really do the trick. We'd need to reform the loop, so that it goes the other way around - then maybe it will work...


--
Mats



Moving to SIMD SSE2 is not really a big deal.
as I did post in the SETI forum:


As you probably notice, I am playing with Seti since 2000. Seti is always an interesting problem of distributed computer and the FFT is a chalenge for my little brain by itself.
If you look at the FFT using 4 vectors in parallel, you have to try to code your FFT in a way you minimize the penalities: Branching, Memory footprint, and in the case of Core, you want to use as many SSEx 128Bits instruction as you can.
to use SIMD efficenly, you want to move your data from Array of Structure to Structure of Structure.

For example, in 3D, it is very common to store X,Y,Z,W in memory like this:
X,Y,Z,W,X,Y,Z,W,X,Y,Z,W,X,Y,Z,W,X,Y,Z,W... (Array of Structure)

The natural way to store your SIMD data is
XXXXXXXXXXX....
YYYYYYYYYYYY...
ZZZZZZZZ.....
WWWWWWWWWWW... (Structure of Array)

But this have the bad side effect to open more memory streams and most of the modern processors allow only 4 or 8 streams open in the some time.
One of my co-worker, AlexK came up with this data structure in 1998 call Structure of Structure:

XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW...

Like this, you access only with one or 2 memory streams, your data locality is tight, and your cache lines get really efficent.

What I am doing today in SETI code is simply trying to apply Alex idea to FFT.
I ll need few more weeks to get it done, it is a nice mind game, but it should increase dramatically the intruction per clock on the FFT side.

Let's be clear, I am doing SETI for fun, I am a very happy/lucky man, my hobby and my Job are very interlaced, i rarely have the feeling of working, intel did not ask me to do anything on seti. Intel gives me access to the best toys I can dream of. Performance is general is a very interesting problem, and not only about computers, I do it as well on cars.

Anybody who wants to help on the SIMDized of SETI is welcome :)

FrancoisP


i am open to do ROSETTA too. As one of the designer of SSE2/SSE3, on my hobby time, i spend quite some time helping people to port data structure in SSE2.
When you are use to it, few macros usually do the tricks.

It is the right time to do SIMDization because SSE4 will buy a lot of performance in rosetta like algorythms, we actually included instruction to help pattern matching in SIMD. SSE4 will help as well to deal with change of code path into a set of SIMD data.
SIMDization is less complexe than many people want to say, there is no "major" data re-organization, a fairly good programmer knows how to change his data structure without "major" work.

you need to do few macros like "GetVec(x) GetVec4(x)" etc .. and an adaptive data structure and you are done. then, a gigantic search and replace for 1 or 2 hours, well focus and you are done. The key of SIMDization is data locality.

who?
ID: 31640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 31708 - Posted: 27 Nov 2006, 12:13:11 UTC - in response to Message 31640.  

I can say for sure that Rosetta would gain from using packed SSE, but it will require a fairly major re-org of the data, since the current data-format is a simple array with index 0 = x, index 1 = y, index 2 = z. Since the values x, y and z are generally calculated in the same way for many iterations of a loop, it would be good to just load 4 of each into an SSE register and do the calculaton 4 in parallel [1]. But to do that, the x, y and z values need to be arrayed individually (or in groups of 4 or some multiple thereof).

But there's another problem too, and I haven't spent enough time to figure out how to fix it, but the code is basicly doing:

for(i = 0; i < n; i++)
for(j = 0; j < m; j++) {
calculation...;
}
}

wheere n is in the range of a few hundred and m is less than 10.

To do it the other way around would help a whole lot in unrolling the loop and paralellize the calculation.

The vectorizer can swizzle the data and unroll the loop automatically. We'll see.


Yes, it can, but swizzling isn't entirely free - in fact it's fairly costly when it's as "disorganized" as it is in this case... :-(

And more importantly perhaps, the length of the inner loop (from memory) is 6 items [the loop is repeated several times in the busiest section of the code, and does a bunch of other things too, as I explained earlier. So even unrolling probably won't really do the trick. We'd need to reform the loop, so that it goes the other way around - then maybe it will work...


--
Mats



Moving to SIMD SSE2 is not really a big deal.
as I did post in the SETI forum:


As you probably notice, I am playing with Seti since 2000. Seti is always an interesting problem of distributed computer and the FFT is a chalenge for my little brain by itself.
If you look at the FFT using 4 vectors in parallel, you have to try to code your FFT in a way you minimize the penalities: Branching, Memory footprint, and in the case of Core, you want to use as many SSEx 128Bits instruction as you can.
to use SIMD efficenly, you want to move your data from Array of Structure to Structure of Structure.

For example, in 3D, it is very common to store X,Y,Z,W in memory like this:
X,Y,Z,W,X,Y,Z,W,X,Y,Z,W,X,Y,Z,W,X,Y,Z,W... (Array of Structure)

The natural way to store your SIMD data is
XXXXXXXXXXX....
YYYYYYYYYYYY...
ZZZZZZZZ.....
WWWWWWWWWWW... (Structure of Array)

But this have the bad side effect to open more memory streams and most of the modern processors allow only 4 or 8 streams open in the some time.
One of my co-worker, AlexK came up with this data structure in 1998 call Structure of Structure:

XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW,XXXX,YYYY,ZZZZ,WWWW...

Like this, you access only with one or 2 memory streams, your data locality is tight, and your cache lines get really efficent.

What I am doing today in SETI code is simply trying to apply Alex idea to FFT.
I ll need few more weeks to get it done, it is a nice mind game, but it should increase dramatically the intruction per clock on the FFT side.

Let's be clear, I am doing SETI for fun, I am a very happy/lucky man, my hobby and my Job are very interlaced, i rarely have the feeling of working, intel did not ask me to do anything on seti. Intel gives me access to the best toys I can dream of. Performance is general is a very interesting problem, and not only about computers, I do it as well on cars.

Anybody who wants to help on the SIMDized of SETI is welcome :)

FrancoisP


i am open to do ROSETTA too. As one of the designer of SSE2/SSE3, on my hobby time, i spend quite some time helping people to port data structure in SSE2.
When you are use to it, few macros usually do the tricks.

It is the right time to do SIMDization because SSE4 will buy a lot of performance in rosetta like algorythms, we actually included instruction to help pattern matching in SIMD. SSE4 will help as well to deal with change of code path into a set of SIMD data.
SIMDization is less complexe than many people want to say, there is no "major" data re-organization, a fairly good programmer knows how to change his data structure without "major" work.

you need to do few macros like "GetVec(x) GetVec4(x)" etc .. and an adaptive data structure and you are done. then, a gigantic search and replace for 1 or 2 hours, well focus and you are done. The key of SIMDization is data locality.

who?


Yes, I've spent some time doing that sort of work too - however, there's more than one data structure... And more importantly, some of the code in Rosetta isn't straightforward to understand what depends on what [and it's a tad more than 200K lines in Rosetta, and if memory serves me right, that's more than 5 times the amount of code in SETI]. Most of it is using some Fortran-style array structuring C++ classes with optimized sections of code that "grab" a pointer to the array and do the math to calculate the index "inline" - with three-dimensioned arrays, that gets interesting... ;-)

It's not the 1-2 hours of search and replace that is the problem - is the figuring out why it doesn't work for 4 weeks afterwards taht scares me.

--
Mats
ID: 31708 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ebahapo
Avatar

Send message
Joined: 17 Sep 05
Posts: 29
Credit: 413,302
RAC: 0
Message 44956 - Posted: 13 Aug 2007, 23:43:48 UTC

Here's a development version of the x86-64 Linux client:


The official client for x64 Windows client can be found at boinc_5.10.13_windows_x86_64.exe.

The BOINC client 5.10 can now get 32-bit applications from projects that haven't added support for AMD64 (e.g., Lattice, QMC, etc), provided that they run at least the BOINC server 5.0.9. However, such AMD64 clients for Windows may not get applications from some projects that supported AMD64 before due to a platform name change, at least until such projects are updated.

For more information, see the BoincStats Forum.

HTH


ID: 44956 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ebahapo
Avatar

Send message
Joined: 17 Sep 05
Posts: 29
Credit: 413,302
RAC: 0
Message 49757 - Posted: 17 Dec 2007, 17:58:02 UTC

Even though there's an official AMD64 client for Linux, it refers to too many dynamic libraries and requires a fairly recent Linux setup to run on.

So, one more time, I'm making available the AMD64 Linux client here. It refers to a minimal set of standard dynamic libraries whose version requirements should be satisfied by Linux systems up to 2 or 3 years old, however it was built with a fairly recent version of GCC, 4.1.2.

The drill's still the same:


The official AMD64 Windows client can be found here.

For more information, see the BoincStats Forum.

HTH


ID: 49757 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : How to download the client? (AMD64 users)



©2024 University of Washington
https://www.bakerlab.org