To: Steve Porter who wrote (20798 ) 11/28/2000 7:49:01 PM From: Scott D. Read Replies (2) | Respond to of 275872 Re: automatic use of MMX/SSE/SSE2 You are right! But as Intel warns, you still need to help it along: The goal of vectorizing compilers is to exploit single-instruction multiple data (SIMD) processing automatically. However, the realization of this goal has been difficult to achieve. The reason for the difficulty in achieving vectorization is due to two major factors: Style. The style in which you write source code can inhibit optimization. For example, a common problem with global pointers is that they often prevent the compiler from being able to prove two memory references are distinct locations. Consequently, this prevents certain reordering transformations. Hardware Restrictions. The compiler is limited by restrictions imposed by the underlying hardware. In the case of Streaming SIMD Extensions, the vector memory operations are limited to stride-1 accesses with a preference to 16-byte aligned memory references. This means that if the compiler abstractly recognizes a loop as vectorizable, it still might not vectorize it to a distinct target architecture. I tried this example: void copy1 (float *restrict dest, float *restrict src, unsigned length) { unsigned index; for (index = 0; index < length; index++) dest [index] = src [index]; } int main (void) { __declspec (align (32)) float src [100]; __declspec (align (32)) float dest [100]; copy1 (dest, src, 100); return 0; } The alignment stuff is needed by the SSE instructions. Believe it or not, the compiler generates two versions of the function, and calls the fastest at runtime, depending on the alignment of the arguments. Here is some SSE it generated: 9: dest [index] = src [index]; 00401050 0F 18 44 8E 40 prefetchnta [esi+ecx*4+40h] 00401055 0F 28 04 8E movaps xmm0,xmmword ptr [esi+ecx*4] 00401059 0F 29 44 8D 00 movaps xmmword ptr [ebp+ecx*4],xmm0 0040105E 0F 28 4C 8E 10 movaps xmm1,xmmword ptr [esi+ecx*4+10h] 00401063 0F 29 4C 8D 10 movaps xmmword ptr length[ecx*4],xmm1