SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: fyodor_ who wrote (74944)3/19/2002 12:06:52 PM
From: pgerassiRespond to of 275872
 
Dear Fyo:

Let's look at the 64 bit on chip controller vs 128 bit off chip fsb connected controller. Now to provide some numbers and I do acknowledge that these are EGs (Educated Guesses). First we should have the underlying memory the same to make sure of the apples to apples comparison. So we should assume a CAS2 1T DDRDRAM of PC2700 (6ns clock cycle 5 cycles to first data word on reads and 4 cycle (8 data xfers) bursts). FSB has a clock cycle command latency for control of the FSB and the on die FSB has a cycle to initiate the process. The on chip controller has 1 CPU clock for request and 1/2 clock average DDRDRAM clock synchronization (3ns for PC2700) and a 140ns delay to get to memory not directly connected (other CPU using HT).

Now we know that P4 uses 128 byte cache lines and an inclusive 8 way cache. And Clawhammer uses a 64 byte cache line and a 16 way exclusive cache (20 way when you add the L1 in). P4 uses a dual DDR channel (not yet here) connecting through a NB and Claw uses a single DDR channel directly connected.

Test loads are truly random access, sequential and normal (time sliced at 1ms a slice). The first happens with large complex loads with lots of traffic. Normal loads are like those encountered in Windows or workstations.

For truly random memory usage, caches are useless for data but ok for code. Every cache line is thus used once and overwritten. This is worst case for most memory systems. For P4 which needs 11 cycles to determine that data is not in the cache, 1 cycle to start FSB, 1 FSB cycle to gain control of bus (arbitration), 2 cycles (1 command and 1 address) to move read command to NB, 1 cycle for NB internal work, 5 cycles for the pair of channels, 4 cycles for data xfer to NB, 1 cycle to gain control of FSB, 4 cycles with resulting data, 1 cycle to do L2 overhead yields 13 CPU clock cycles, 8 FSB cycles and 10 DDRDRAM cycles for a total of (5ns + 80ns + 60ns) 145ns at 3.0GHz or about 7 million memory reads each second. For Clawhammer, 11 cycles to determine its not in cache, 1 cycle for memory request, 1/2 cycle for sychronization, 5 cycles for read command, 4 cycles for data xfer, 1 cycle for hub and 1 cycle for L2 overhead yields 13 CPU cycles, 9.5 DDRDRAM cycles for a total of (7ns + 57ns) 64ns at 2.0GHz or about 16 million memory reads each second. Winner, Clawhammer!

Sequential loads. For P4, each cache line fill yields 128 bytes and is fully used 16 reads at 8 bytes each. Without pipelining or prefetch, about 112 million 8 byte reads occur per second. However, given that the next read will probably be using the same RAS as the previous read, only 2 DDRDRAM cycles can not be overlapped at the memory level, thus the bottleneck will probably be the FSB since the 8 cycles is the minimum needed per cache line fill even with prefetch and pipelining. Thus 80 ns is required for each cache line fill or 5 ns per 8 byte read for 200 million per second at a P4 bus clock of 100MHz. At 133 MHz, 60 ns still is larger than the DDRDRAM needs of 6 cycles or 36ns for sequential reads thus, you get 266 million reads per second.

For Clawhammer, the bottleneck still is in the memory channel side or 5 DDRDRAM cycles (the 1 less is the maximum possible overlap given a controller that knows that it just needs a sequential burst) for 64 bytes or 8 8 byte reads. 30ns per cache line fill yields 266 million 8 byte reads per second. Winner,tie with P4 at 133MHz FSB but, Clawhammer with P4 at 100MHz!

Noraml loads. L1 & L2 caches fully effective. P4 has 2 way 8 KB 2 cycle L1 for a total of 64 cache lines or about a 88% hit rate (estimate but, probably very close) and a 8 way 512KB L2 9 cycle L2 (in addition to L1) which has about a hit rate of 95%. Thus there are about ((0.88*2 + 0.12*0.95*(9 + 2))/3GHZ + 0.12*0.05*(145ns)) or 1.875ns per 8 byte read.

Clawhammer has a 2 way 64KB 3 cycle L1 for a total of 1024 cache lines or about 95% hit rate and a 16 way 512KB 8 cycle L2 or about a 98.5% hit rate. Thus there are about ((0.95*3 + 0.05*0.985*(8 + 3))/2GHz + 0.05*0.015*64ns) or 1.744ns per 8 byte read. Winner, Clawhammer!

All in all even with a 2GHz Clawhammer only having 1 DDR channel, it wins 2 out 3 with a 3GHz P4 at 133MHz FSB (533 QDR) with 2 DDR channels. With the slower FSB, Clawhammer wins hands down.

Pete