SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?


To: wanna_bmw who wrote (157028)1/27/2002 1:20:52 PM
From: milo_morai  Read Replies (1) | Respond to of 186894
 
Wanna NW isn't beating Athlon with greater Memory Bandwidth. It has the same Bandwidth (AKA throughput) it's always had. Is your head in the clouds?

The L2 Cache on NW is twice the size of Willamette. By having more code in the L2 cache will reduce the latency of the information that gets to the CPU instead of having to go get it from the DRAM module. This therefore increases system performance.
Here's a definition that you can understand
anandtech.com

L2 Cache: What it does

We often take for granted that having an L2 cache means that your system runs faster than it would if it wasn’t there,
but what does that L2 cache actually do?

L2 cache, just like any other cache, acts as sort of a middle man between two mediums, in this case, your CPU’s L1 cache and your system memory (as well as other storage mediums). When the CPU wants to request a bit of data, it first searches in its L1 cache to see if it can find it there; if it does, then this results in what is known as a cache hit and the CPU retrieves it from the extremely fast, low latency L1 cache.

If it can’t retrieve it from L1 cache, it then goes to the L2 cache where it attempts to do the same – obtain a cache “hit.” In the event of a miss, the CPU must then go all the way to system memory in order to retrieve the data it needs. With the L2 cache of today’s CPUs operating at a much higher frequency and at much lower latency than system memory, if the L2 cache weren’t there or the cache mapping technique wasn’t as effective, we would see considerably lower performance figures from our systems. <i/>

Cache Mapping Techniques

We just established that the function of the L2 cache is to provide access to commonly used data in system RAM. It does so by essentially mapping the cache lines of the L2 cache to multiple addresses in the system memory (the number of which is defined by the cacheable memory area of the L2 cache).

There are a number of methods that can be used to dictate how this mapping should occur. On one end of the spectrum, we have a direct mapped cache, which divides the system memory into a number of equal sections, each one being mapped to a single cache line in the L2 cache.

The beauty of a direct mapped cache allows it to be searched relatively quickly and effectively since everything is organized into sections of equal size, but with this comes the sacrifice of hit rate because the technique does not allow for any bias toward more frequently used sections of data.

On the other end of the spectrum, we have a fully associative cache, which is the exact opposite of a direct mapped cache. Instead of equally dividing up the memory into sections mapped to individual address lines, a fully associative cache acts as more of a dynamic entity that allows for a cache line to be mapped to any section of system memory.

This flexibility allows for a much greater hit rate since allowances can be made for the most frequently used data. However, since there is no organized structure to the mapping technique, searching through a fully associative cache is much slower than through a direct mapped cache.

Establishing a mid-point between these two cache mapping techniques, we have a set associative cache, which is what the current crop of processors uses.

A set associative cache divides the cache into various sections, referred to as sets, with each set containing a number of cache lines. With an 8-way set associative L2 cache, each set contains 8 cache lines, and in a 16-way set associative L2 cache, each set contains 16 cache lines.

The beauty of this is that the cache acts as if it were a direct mapped cache except that, instead of the 1 cache line per memory section requirement, we get x number of cache lines per section of memory addresses.

This helps to sustain a balance between the pros and the cons of a direct mapped and a fully associative cache.

In the case of the Thunderbird and the Pentium III Coppermine, the 16-way set associative L2 cache of the Thunderbird allows for a higher hit rate for the L2 cache than the 8-way set associative L2 cache of the Pentium III Coppermine. In comparison, the older Athlons featured a 2-way set associative L2 cache.

What is it you do for INTC again? PR work?

M.



To: wanna_bmw who wrote (157028)1/27/2002 7:58:23 PM
From: Dan3  Read Replies (1) | Respond to of 186894
 
Re: Do you actually think that lower latency will be able to sustain performance as processors grow in clock frequency?

Do you think processors will be able to sustain performance as clock speeds increase without lower latency?

Modern chips, with on die L1 and L2 caches have hit rate percents in the high 90's. Data reads can be bytes, words, or longs and instructions are often long words so let's just assume 4 bytes are needed for each of those reads. If the miss rate is as high as 5% (and it rarely is) that means a 10GHZ chip would need bandwidth of 10,000 million / 20 = 500 million 4 byte reads per second or 2 gigabytes per second. One PC266 channel can transfer 266 mhz x 8 bytes = 2,128 megabytes or about 2 gigabytes per second - about what will be used by a 10GHZ processor.

So, one PC266 channel minimally supports the bandwidth needed by a 10 gigahertz processor.

The thing is, when the chip has a cache miss, it stalls until the required instruction or data is available. Even the best PC266 has latency on the order of 50ns - 100 clocks on a 2GHZ chip. So latency is very important. At 10GHZ the chip would be stalled for 500 clocks every time it needed an outside read. Rambus has longer latency times than DDR, which makes Rambus even worse for higher speeds. If AMD can cut latency down by 15 to 20 ns by bypassing a chipset memory controller, it will result in an instant and substantial increase in performance - especially as clock speeds increase.

Is that enough? No, of course not. Even today's processors benefit from higher memory bandwidths. When a miss occurs, there is a pretty good probability that a new branch has been taken and that a number of additional reads will be required that can be predicted and prefetched. So additional bandwidth is helpful and can improve performance. But latency is what's critical. That's why one channel of Rambus can't compete even with one channel of PC133 memory, despite it's much greater bandwidth.