SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Scumbria who wrote (7169)8/31/2000 3:32:13 PM
From: EricRRRead Replies (1) | Respond to of 275872
 
Scumbria- on caches:

I'm still trying to get a handle on you're view of cache philosophy vs Paul's.

You say that the small cache is bad because it will cause misses, while the small 2 clock latency will limit future clock speed.

Paul claims that the small cache, versus a larger one, won't limit clock speed because small caches have shorter speedpaths.

If I understand correctly, you claim that a 3 clock L1 latency isn't so bad because the requests can be pipelined. I assume then that only the L1 can be pipelined, because every mem access is assumed to be there, is this right? Also is the pipelining of the cache a difficult this to do- can the request for data be made 3 clocks before the data is needed in a register? Or does that require compiler support, like prefetching?

Paul:
Q2) Won't the 8 KB dcache in Wilma (compared to 16 KB in P6 and 64 KB in K7) really hurt performance?

No. The L1 cache is a small part of the overall memory system in an MPU. In a chip like Wilma the L1 serves primarily as a bandwidth-to-latency "transformer" to impedance match the CPU core to the L2 cache and reduce the effective latency of L2 and memory accesses. The big performance loser is going off chip to main memory and the 256 KB L2 in Wilma is what is relevent to that, not the 8 KB dcache. The size of an 8 KB cache is insignificant compared to the scale of the Wilma die and could easily be larger. I think the reason it isn't larger is because Intel wanted to hit a 2 cycle load-use penalty and at the clock rate Wilma targets a larger cache would be a speed path. An 8 KB dcache has a hit rate of around 92% and an 32 KB cache around 96%. A Two cycle 8 KB dcache beats a much larger three cycle dcache for the vast majority of applications given the rest of the Wilma memory system design.

The cache info from IDF is quite interesting. The L1 dcache can performa a 128 bit wide load and store per clock cycle. According to Intel the average cache latency of a 1.4 GHz Wilma is a little over half (55%) that of a 1 GHz PIII in ABSOLUTE TIME. On a clock normalized basis the memory latency is only 77% of P3.
That is right boys and girls, a P6 memory access averages about 30% more *clock cycles* than a 1.4 GHz Wilma. How is that possible? Well it isn't just the smaller/faster L1. Intel borrowed a neat trick from the Alpha EV6 core. The Wilma performs load data speculation in its pipeline and assumes the load hits in L1. If it doesn't then it executes the cache fill and then uses a replay trap to rerun the load. An analogy with football is this is like a receiver running a downfield pattern and expecting the football over the shoulder sight unseen. It is a lot faster than stopping and waiting to catch the football before running downfield.