To: Rob Young who wrote (88181 ) 9/12/1999 10:36:00 PM From: Tenchusatsu Read Replies (1) | Respond to of 186894
Rob, I'll try and kill two birds with one stone by making one reply. Since you seem to be pretty knowledgeable in these areas, I feel free to go into gritty detail. <Directory-based coherence ... dumbed down chipset as the memory controller and network (CPU<->CPU) controller are on-chip... thereby a relatively cheap server> I just looked over the Microprocessor Report article on the 21364, and now it makes sense to me, especially when I saw the words "directory-based coherence." Alpha's approach is much tougher to do than a shared processor bus like Merced, and it requires both hardware and software support, but it'll be quite a killer in memory bandwidth. (Originally, I thought the chipset would still need to participate in coherence, but when I read my article again, I saw how it was done w/out a central coherence controller. Very good scheme, though it will require specific OS support.) <Memory bandwidth 40/20 GByte/sec> I don't know how these guys are cooking the numbers. The RDRAM ports on a single CPU is 6.4 GB/sec (bidirectional). The interprocessor ports has 6.4 GB/sec going in, and 6.4 GB/sec going out, both unidirectional. Adding those numbers up gives us 19.2 GB/sec. Where did those guys get the 40 GB/sec figure from? Oh, wait a minute. If we count the RDRAM ports from three other CPUs, then we'll have 25.6 GB/sec. Then toss in the interprocessor connections (which isn't multiplied by four because that bandwidth is shared) and you'll get 38.4 GB/sec. I see now. <Memory latency 60-100 ns ... Of course the latency is the big thing. That IS an outstanding number for a 4 processor config. I went and looked at others and this is from my old and foggy memory but I believe the best I found for current day is 270 ns with 4 processors.> Latency for remote memory accesses will be much longer, but of course they will be kept to a minimum in a NUMA system. As for your "current day" latency figure, that sounds like a good estimation for a 4-way shared bus with normal TPC-C traffic. <They run at 90% of the speed of PA 8500.. i.e. 31 SpecInt95 .. no mention of MHz, etc. I would think they took everything into account. Doubt a 440 MHz Merced could run PA binaries at 31 Specint.> You sure about that? The Merced simulators are clock-based. They do a great job in predicting per-clock performance. However, they say nothing about performance when clock speed is factored in, because no one knows for sure what the clock speeds will be in the future. Mispredicting clock speed by even 10% could throw off those emulation percentages by a significant amount. That's why I think those infamous performance figures are only counting per-clock performance, not actual performance. On the other hand, a 440 MHz Merced running PA-RISC binaries at 30.6 SpecInt (90% of 34.0) sounds awfully good, even to me. Either PA-RISC emulation on a Merced is quite efficient, or clock speeds were indeed taken into account and I was just talking too much ... Tenchusatsu