To: Tenchusatsu who wrote (135429 ) 5/21/2001 2:36:39 PM From: Rob Young Read Replies (1) | Respond to of 186894 Tench, < 1) Validation - The added complexity of validation will surely impact the schedules. And like Itanium and McKinley, Alpha is not immune to schedule slips.> This is true. But the scheduling slips can't ALWAYS be visited upon succeeding generations. For instance, 21164 was delivered ahead of schedule. 21264 was a brute with OOO nastiness and slippage were there! Wildfire (GS160 and GS320) ASIC verification took 16 months of non-stop runtime on a large AlphaServer cluster (farm?) according to the architect at rollout. EV68, EV67? Not sure about their delays. EV7? That too delayed. However, I don't believe EV8 will be as delayed as SMT for Alpha should be much simpler than SMT for other architectures, right? " 2) Software support - What I mean here is whether existing applications, including the ones that run several threads, take advantage of SMT without any modifications whatsoever. Of course the processor has to be able to run all existing apps, but if SMT requires specific software support, then all those existing apps would be running in single-threaded mode, not SMT." You are highlighting what SMT is all about. Remember , in the EV8 description, it contains 4 program counters. As described, "it appears to the OS as 4 CPUs and performs with the power of 2 CPUs." So while that last statement of yours is correct, a SMT processor should outperform all comers and a good indicator to me would be how well a single SMT CPU does in SpecRate2000 for Integer and Floating point. Selling futures? Sure, that is under discussion. "3) Memory bandwidth requirements - Certainly Alpha would not have to concern themselves in this area, since EV7 provides boatloads of bandwidth. And tolerance to latency is somewhat of a don't care, since the latency of EV7 will be very low anyway. I can only assume that EV8 has a similar memory subsystem and infrastructure as EV7, right?" Believe it does. Latency is very important to OLTP as highlighted in one of the papers cited earlier. "4) Impact on caches - The issue here is pretty complicated, more than you and I would be qualified to address. (But that's not going to stop us, now is it?) Anyway, the problem is similar to the increased memory bandwidth requirements in SMT mode. More threads running simultaneously means more demand on the L1 and L2 caches. And if the caches aren't up to the task, you'll get lousy speed-up with SMT." Actually, if you had read the paper entitled: "An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors" you would note it is all about cache studies. The conclusion states: " This paper explored the behavior of database workloads on simultaneous multithreaded processors, concentrating in particular on the challenges presented to the memory system. For our study, we collected traces of the Oracle DBMS executing under Digital Unix on DEC Alpha processors, and processed those traces with simulators of wide-issue superscalar and simultaneous multithreaded processors. The conclusions from our simulation study are three-fold. First, while database workloads have large footprints, there is substantial reuse that results in a small, cacheable "critical" working set, if cache conflicts can be minimized. Second, while cache conflicts are high, particularly in the per-process private data region of the database, they often result from poorly-suited mapping policies in the operating system. By selecting the appropriate mapping policy, and using application-based address offsetting, conflict misses on an 8-context SMT can be reduced to the point where they are roughly commensurate with miss rates for a superscalar. Third, once cache conflicts are reduced to the superscalar level, an SMT processor can achieve substantial performance gains over a superscalar with similar resources. We show a 3-fold improvement in utilization (IPC) for the OLTP workload and a 1.5-fold increase for DSS. These performance gains occur because of SMT's ability to tolerate latencies through fine-grained sharing of all processor resources among the executing threads; this is demonstrated by our detailed measurements indicating that SMT achieves a striking improvement in memory-system utilization and fetch and issue efficiency, when compared with the single-threaded superscalar. We believe that these results indicate that SMT's latency tolerance makes it an excellent architecture for database servers." --- Elsewhere you read about "constructive cache interferance" listed in addition to the latency hiding capabilities of SMT. "No doubt the Alpha EV8 guys are all over these issues already. Surely thread-level parallelism is the next step beyond instruction-level parallelism. But exploiting TLP using SMT is very tricky, though the rewards could be significant depending on the application, platform, and many other factors." TLP? Now you are off on another angle. I'm not so sure about TLP and SMT and how they differ or are the same. The "instruction-level parallelism" argument is a poor one as many major workloads exhibit VERY poor parallelism. See the paper about OS and SMT, in it we see Apache spends 75% of its time in the kernel and has IPC of 1.1. The key to SMT is making the most of the resources that you have and the powerpoint shows how SMT's strength is keeping the ALUs busy. Busy ALUs translate into work ALWAYS being done. SMT is the future for all as it will deliver the highest TPC and commercial benchmarks. That is where the money is, that is where everyone will go. Rob