SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Steve Porter who wrote (20798)11/28/2000 9:21:40 AM
From: combjellyRead Replies (1) | Respond to of 275872
 
"Well Intel's new compiler (to the best of my knowledge) supports SSE, SSE2 and MMX vectorization automagically"

And other than generating code for SPEC, what use do Intel compilers get?



To: Steve Porter who wrote (20798)11/28/2000 7:49:01 PM
From: Scott D.Read Replies (2) | Respond to of 275872
 
Re: automatic use of MMX/SSE/SSE2

You are right! But as Intel warns, you still need to help it along:
The goal of vectorizing compilers is to exploit single-instruction
multiple data (SIMD) processing automatically. However, the
realization of this goal has been difficult to achieve. The reason
for the difficulty in achieving vectorization is due to two major factors:

Style. The style in which you write source code can inhibit optimization.
For example, a common problem with global pointers is that they often
prevent the compiler from being able to prove two memory references
are distinct locations. Consequently, this prevents certain reordering
transformations.

Hardware Restrictions. The compiler is limited by restrictions
imposed by the underlying hardware. In the case of Streaming SIMD
Extensions, the vector memory operations are limited to stride-1 accesses
with a preference to 16-byte aligned memory references. This means
that if the compiler abstractly recognizes a loop as vectorizable,
it still might not vectorize it to a distinct target architecture.

I tried this example:

void copy1 (float *restrict dest, float *restrict src, unsigned length)
{
unsigned index;

for (index = 0; index < length; index++)
dest [index] = src [index];
}

int main (void)
{
__declspec (align (32)) float src [100];
__declspec (align (32)) float dest [100];

copy1 (dest, src, 100);

return 0;
}

The alignment stuff is needed by the SSE instructions. Believe it or not,
the compiler generates two versions of the function, and calls the fastest
at runtime, depending on the alignment of the arguments. Here is some
SSE it generated:

9: dest [index] = src [index];
00401050 0F 18 44 8E 40 prefetchnta [esi+ecx*4+40h]
00401055 0F 28 04 8E movaps xmm0,xmmword ptr [esi+ecx*4]
00401059 0F 29 44 8D 00 movaps xmmword ptr [ebp+ecx*4],xmm0
0040105E 0F 28 4C 8E 10 movaps xmm1,xmmword ptr [esi+ecx*4+10h]
00401063 0F 29 4C 8D 10 movaps xmmword ptr length[ecx*4],xmm1