Silicon Investor (SI) -- The First Internet Community

STOCKTALK

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor. We ask that you disable ad blocking while on Silicon Investor in the best interests of our community. If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.

Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?

To: Steve Porter who wrote (20798)	11/28/2000 9:21:40 AM
From: combjelly	Read Replies (1) \| Respond to of 275872

"Well Intel's new compiler (to the best of my knowledge) supports SSE, SSE2 and MMX vectorization automagically" And other than generating code for SPEC, what use do Intel compilers get?

To: Steve Porter who wrote (20798)	11/28/2000 7:49:01 PM
From: Scott D.	Read Replies (2) \| Respond to of 275872

Re: automatic use of MMX/SSE/SSE2 You are right! But as Intel warns, you still need to help it along: The goal of vectorizing compilers is to exploit single-instruction multiple data (SIMD) processing automatically. However, the realization of this goal has been difficult to achieve. The reason for the difficulty in achieving vectorization is due to two major factors: Style. The style in which you write source code can inhibit optimization. For example, a common problem with global pointers is that they often prevent the compiler from being able to prove two memory references are distinct locations. Consequently, this prevents certain reordering transformations. Hardware Restrictions. The compiler is limited by restrictions imposed by the underlying hardware. In the case of Streaming SIMD Extensions, the vector memory operations are limited to stride-1 accesses with a preference to 16-byte aligned memory references. This means that if the compiler abstractly recognizes a loop as vectorizable, it still might not vectorize it to a distinct target architecture. I tried this example: void copy1 (float restrict dest, float restrict src, unsigned length) { unsigned index; for (index = 0; index < length; index++) dest [index] = src [index]; } int main (void) { __declspec (align (32)) float src [100]; __declspec (align (32)) float dest [100]; copy1 (dest, src, 100); return 0; } The alignment stuff is needed by the SSE instructions. Believe it or not, the compiler generates two versions of the function, and calls the fastest at runtime, depending on the alignment of the arguments. Here is some SSE it generated: 9: dest [index] = src [index]; 00401050 0F 18 44 8E 40 prefetchnta [esi+ecx4+40h] 00401055 0F 28 04 8E movaps xmm0,xmmword ptr [esi+ecx4] 00401059 0F 29 44 8D 00 movaps xmmword ptr [ebp+ecx4],xmm0 0040105E 0F 28 4C 8E 10 movaps xmm1,xmmword ptr [esi+ecx4+10h] 00401063 0F 29 4C 8D 10 movaps xmmword ptr length[ecx*4],xmm1