SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD)
AMD 259.65+2.3%Jan 23 3:59 PM EST

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: wanna_bmw who wrote (52890)8/29/2001 1:02:22 AM
From: pgerassiRead Replies (1) of 275872
 
Wanna_bmw:

Wrong again! AMD has three integer units and they each have a barrel shifter. Barrel shifters can rotate many bits at once. So if Athlon is requested to rotate something 15 bits over, it can still do it in one cycle for a total of 1 cycle for each of 3 ALUs for a total of 12 bytes in a cycle. The P4 takes 1 cycle for each bit (the actual amount cycles per rotate op is not stated beyond 1 bit) rotated for each of 2 ALUs for a total of 15 cycles for 8 bytes.

Note, none of the SSE(2) or MMX units can do a bit rotate. They (SSE(2)) shift by bytes only (MMX can do a bit shift but only one unit exists and can do only 1 bit per cycle) and they do not do rotates at all. Check the relevant pages for yourself, if you do not believe me. Besides, in the guide, Intel advises you against rotate by more than 1 bit and advises adds to itself instead of left shifts for up to 3 bits due to smaller latencies on page 116. This shows the really slow execution of shifts and especially rotates on P4.

From the guide, it must be that three bits shifted left in a double quad word must be moved to 32 bit registers, shifted via three successive adds and shifted back to the SSE register (the throughput is the same using MMX), a minimum of 10 cycles versus 6 cycles to do the same for Palomino. Doing four of them sequentially would use 40 cycles on the P4 and only 9 cycles on the Palomino or Morgan. Who would have thought that a simple code loop could run on a 1GHz Morgan as fast as a 4.4GHz P4? If it was rotate those three bits, the Morgan would beat a 4.8GHz P4. Northwood better add those barrel shifters in as shifts and rotates are heavily used in legacy code. It would help performance far more than another 256KB of L2.

Again, you did not check what the SSE(2) and MMX instructions actually do in a x86 system and P4 especially.

Any complaints to the post's length will be ignored as it is evident that people neglect to read the "fine" print even when it is not small and either bolded or italicized. Perhaps, the "2x4 and the mule" story will need to be told.

Pete
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext