To: Elmer who wrote (145414 ) 10/16/2001 1:40:52 PM From: Paul Engel Read Replies (2) | Respond to of 186894 Elmer - A glimpse into Intel's Hyperthreading implementation and potential benefits:So rather than use that "basic trigger" for the new threads, Intel researchers developed a way for threads to beget more threads. Intel claims the speedup gained from creating such thread chains averages 76 percent but can range up to 169 percent. "We get a significant speedup using these chain triggers," Shen said. "If you have two logical processors, you can get 30 percent speedup. If you can do more than two logical processors — say, eight threads — you can now see very significant a speedup by using speculative precomputation threads." I assume the patent applications for these developments are very large in number ! Paul {============================}Intel looks to bridge gap in multithreading CPU landscape By Anthony Cataldo, EE Times Oct 16, 2001 (10:29 AM) URL: eetimes.com SAN JOSE, Calif. — With instruction-level parallelism out of fashion and thread-level parallelism the buzzword du jour, Intel Corp. is proposing "pseudo-parallelism" as the next step in microprocessor design. The approach lets a CPU force single-threaded applications to act as if they have multiple threads. The new wrinkle in parallelism, which Intel discussed here at the Microprocessor Forum, builds on a multithreading scheme called hyperthreading that Intel disclosed earlier this year. Hyperthreading, a latent feature in the P4 architecture that will be activated first in a Xeon processor next year, allows one CPU to act as two logical processors when it encounters applications that are split into separate threads. But with the exception of such server applications as database management software, most applications can't take advantage of hyperthreading, because they are still single-threaded. That presents a problem for Intel, which plans eventually to deploy hyperthreading in desktop systems. With pseudo-parallelism — more formally known as speculative precomputation — Intel exploits this "second," latent processor in single-threaded applications that would otherwise have remained idle. Intel is looking to apply the new form of parallelism to one of the most vexing trouble spots in microprocessors: memory access. Cache misses in local cache memory are getting harder for CPU designers to swallow as the penalties worsen for accessing external DRAM. An early Pentium at 66 MHz lost only 70 instruction cycles for a DRAM access. But when processors reach 5 to 10 GHz, the number of instructions required to retrieve data from DRAM will be measured in the thousands of cycles. "It reminds me of what disk latencies used to be 10 or 20 years ago," said Glenn Hinton, Intel fellow and director of IA-32 architecture development. Pseudo-parallelism attacks memory latency by minimizing cache misses, thus reducing the frequency of accesses to main memory. Intel has identified 10 static loads that do not lend themselves to prefetching and that are susceptible to stalls, either because they have too many dependencies or because they don't otherwise exhibit predictable access patterns. It is those "delinquent loads" that cause 80 to 90 percent of the cache misses. "We're looking at lots of pointer chasing that induces L2 and L3 cache misses," said John Shen, director of Intel's Advanced Architecture Labs. Speculative precomputation works by spawning a new thread in an otherwise single-threaded application when an instruction reaches a certain stage in a pipeline. That is done by attaching code to the tail end of an existing binary; recompiling the code is unnecessary. When a thread is triggered, the second, idle logical processor comes to life and performs the cache prefetching. "The objective is that the speculative-precomputation thread will trigger cache accesses much earlier than the main thread that encounters the delinquent load. We're trying to mask or eliminate all the cache miss latencies," Shen said. Early experiments with speculative precomputation backfired: Instead of helping CPU performance, the technique slowed it down because the pipeline had to be flushed out every time a new thread was spawned, eating more CPU cycles, said Shen. So rather than use that "basic trigger" for the new threads, Intel researchers developed a way for threads to beget more threads. Intel claims the speedup gained from creating such thread chains averages 76 percent but can range up to 169 percent. "We get a significant speedup using these chain triggers," Shen said. "If you have two logical processors, you can get 30 percent speedup. If you can do more than two logical processors — say, eight threads — you can now see very significant a speedup by using speculative precomputation threads." By taking advantage of the extra thread produced in hyperthreading, the technique remains consistent with Intel's overall goal of staying within strict power and die size budgets. In the strictest sense, a hyperthreaded CPU is not full-fledged multithreaded machine, because it cuts most processor resources in half to accommodate two threads instead of duplicating the hardware. But it derives better performance by behaving like two logical processors, with only 5 percent more die area and power consumption. One alternative to speculative precomputation would be to try to get even more instruction-level parallelism using superscalar techniques. But that approach is reaching its practical limits from a power and die size point of view. "There are a lot of execution units not highly utilized, because of instruction dependencies," Hinton said. "With two processors, the peak execution is six instructions per clock for each processor, but each processor is not necessarily using all the execution resources. "It still takes twice the power and twice the die size when only one thread is available. Half the hardware is completely idle." www.cmpnet.com The Technology Network Copyright 1998 CMP Media Inc