SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Rambus (RMBS) - Eagle or Penguin -- Ignore unavailable to you. Want to Upgrade?


To: Alex Fleming who wrote (61658)11/21/2000 7:02:39 PM
From: SBHX  Read Replies (1) | Respond to of 93625
 
Hi Alex,

In my context, streaming data when applied to the cpu refers to large blocks sequential data that is read in from main memory. One typical example is that of mpeg2 compression where compressing the I (jpeg) frames require reading an entire frame of data into the cpu (this is not the problem, since the DCT step takes more cycles than the memory read) or the P & B (predicted and bidirectional predicted) frames where even more memory intensive motion estimation requires searching for sum of abs diff in matrices from other reference frames [This is the problem, since such a search algorithm is memory intensive and the sum-of-abs-diff is not that cpu intensive], the reuse of data in this application is 100%, so this is one of the ideal scenarios for memory subsystems that require very long bursts to be efficient. Add to that the new katmai instructions that does 1. the sum-of-abs-diff in 1 instruction and 2. prefetch and fence instructions to start asynchronous prefetch of blocks of data and you have a pretty fancy mpeg compressor. But this requires assembler coding, and is not a simple recompile of C code since SIMD is not abstracted well in C.

A 2nd often discussed example is the 3D geometry transform problem : basically a floating point 4x4 matrix multiplication with each (x,y,z,1) component of vertex data. There are katmai floating point simd instructions that can take 2 x 4 32bit single precision FP values and do a dot product between the two. This would have been ideal, except people found that when the other components of each vertex (texture coordinates, color and light values) come after each x,y,z, those values are read in and discarded and memory then becomes a bottleneck. The proposal intel has is for everyone else to represent their vertices as either
A. (x1,x2,x3,x4,x5,x6,x7,x8,...x16)(y1,y2,...y16)(z1,...z16),(s1,...,s16)_1,(t1,...,t16)_1,...,(s1,...s16)_N,(t1,...t16)_N where each (S)_k and (T_k) are texture coordinates.
OR
B. (x1,y1,z1,*)(x2,y2,z2,*),...(x8,y8,z8,*),{(S),(T)}

This means that the texture coordinates and other junk never have to travel across the memory bus, hence solving granularity loss.

I think one of the side effects was that every graphics vendor changed how they walk these structures to get around this problem. Making data bursts too long is never without some corresponding pain paid by someone else.

SbH