Itanic jokes running out of steam :-) (from McJerk's mouth *lol*
Part I: The aftermath of Alphacide, one year on
By Nebojsa Novakovic: Thursday 04 July 2002, 17:47
THIS TIME ALMOST EXACTLY a year ago, a crime took place. Alpha, the fastest and most elegant general-purpose processor architecture to date, and the first pure 64-bit microprocessor, was brutally murdered by its owner - at the height of its performance leadership, which she kept stubbornly despite its funding being gradually strangled by "PC-centric, industry-standard" [Ed: Lousy Dell-wannabe?] Compaq over the last few years. Yes, the only problem Alpha ever had, but bad enough to be fatal, was its owner company.
All the diabolical spin Compaq spewed about questionable technical sustainability was proven wrong not just by discovering what marvels the EV-8 and EV-9 Alpha chips would have brought, but also by the events that followed soon after.
I said on 27th June 2001, much to the jeers of some low-level Compaq staff here in Singapore: "The demise of Alpha means the demise of Compaq, and deliberately at that", and that's exactly what happened.
Alpha's gone - unless, somewhere deep in Tibetan mountains, Chinese military works real hard on their own cloned EV8 / EV9 processors running COSIX (Chinese Tru64 UNIX whose source code Compaq gave to the China sometime ago, plus EV8 etc plans that might have leaked out of the US).
Well, it would be fun to watch an imaginary short battle between Chinese Red Alpha COSIX -based battleships and their US Billy Windows-based opposites - one nasty Outlook virus (he, he, I only use Eudora) could bring a new, literal, meaning to the Blue Screen of Death for the Uncle Sam crew...
Well, we've got to move on in life, and here's where the 64-bit world stands right now.
Only two [Ed: surely three] survivors?
Memory prices drop, apps bloat even more than before, and many workstations and servers come with more than 2 GB RAM (the per-process limit on a 32-bit CPU) installed. If the memory trend continues, an average 2 to 4 GB RAM high-end PC could be common out there in three years. So, let's avoid all the segment juggling and other tricks - the time is to plan the 64-bit move starting now.
AMD's Opteron (Hammer, Athlon, whatever) is still far from volume shipping,[Pareti, Joseph] (what a polite way to describe it :-) and Sun is stuck with just-appearing 1050 MHz UltraSparc III whose SPEC performance is behind the year-old 1 GHz Alpha, not to mention Itanium2 or POWER4. Well, in the worst case, Sun could always use Opteron, at least for their high-end Linux machines. (why not, birds of a feather ... :-) < my note, of course :-)
Just a hint: imagine Sun going straight against Dell as a Tier 1 PC server vendor across all price/performance levels; it could be Scotty's wet dream :-) More on Sun and AMD in the second saga of this story (the third part will reflect feedback and positions from the vendors themselves, the 64-bit OS situation, as well as a look at the flagship systems of each platform.
Prove me wrong, but I believe that, unless Sun and AMD join hands or AMD gets another fully supporting Tier 1 vendor, the real battle for 64-bit domination will be between POWER and Itanium platforms. Whoever wins will probably dominate for the next 10 or so years, maybe until the first really usable quantum computers are ready - IBM, what say you?
Why POWER and Itanium? They both have huge vendors behind them: top technology with own fabs, huge R&D teams, strong marketing, and above all, plenty of money. They both aim very high at the performance front, and both have a potential to go to the desktop mainstream some years from now - in fact, if you consider POWER4 is in fact a 64-bit PowerPC, you can see some really interesting hints...
Let's look at each of them - starting with...
The good ship Itanic After all the hype related to its EPIC architecture, Intel's first Itanium has been more of a disappointment than a leader - but the first Alpha, the 21064 in 1992, didn't fare much better compared to its key competitors, HP-PA and MIPS, at that time either. Without going into technical details, the key issues facing Itanium CPU architecture still remain a comparatively large code footprint (keep in mind that every 3 EPIC instructions take 128 bits, same as 4 RISC instructions), complicated instruction formats and no out-of-order execution for runtime execution parallelisation - the compiler can't always predict what added parallel execution can happen within that particular nanosecond on the chip.
Can Itanium2 scale? It follows the shared-bus philosophy found in Pentium/Xeon processor all this while, therefore adding more CPUs on the same bus increases the contention and causes delays that would affect the real performance. This 128-bit 400 MHz (effective throughput) 6.4 GB/s bus is still 50% faster than the Pentium 4 64-bit 533 MHz FSB, however, the memory-bus load on the tasks you'd expect to run on Itanium2 is definitely much higher too. Therefore, anything beyond 4 CPUs on a single bus is out of question, and you need to have 2-level structure of quad-CPU nodes connected via Scalability Port switch to create larger, NUMA-like systems (Intel E8870 chipset provides this feature).
What about per-CPU performance? Whoever used the first Merced iteration was probably flabbergasted with its performance, especially integer and in general under Linux (for some reason, code compiled on HP-UX and even WindowsXP64 seem to perform better on IA-64 than Linux - at least benchmarks I ran on 800 MHz HP Itanium show that). Itanium2, under HP-UX, is a wholly different story, with many benchmark results shooting through the roof, especially FP - some even faster than POWER4. However, Itanium2 Linux compiler support (in particular Fortran), leaves a lot of room for more optimisations.
While the benchmark results are not official yet, the performance mentioned by the "industry sources" for the 1 GHz, 3 MB cache Itanium2 part really is about double that of 800 MHz Itanium - competing with the POWER4 for the top 64-bit rank. Since Itanium platform is quite dependent on the memory subsystem speed for optimal performance, going to 533 MHz effective FSB even before next-year's Madison platform would be a good move to improve the performance somewhat.
With "Tiger 4", a handsome quad-CPU, 4U rack Itanium2 E8870 OEM system (Bandera+Mini-Boone chasis&board bundle), Intel has made a technically excellent rackmount solution with modularity and hot-pluggability implemented wherever it could have been done - look forward to our detailed review late next week. Even HP's quad Itanium2 server looks much bulkier, at 7U. With this box, Intel is obviously trying hard to attract more OEMs to come out with a quickie McKinley, ready-to-roll server or cluster node. Technically, it is a very good system, and my experience with its early beta version was satisfying - except for few things we'll cover next week.
Raw POWER... Sadness, sorrow and anger that many in this field felt after 25 June 2001 was somewhat mellowed, and replaced by quite a bit of admiration, in early October last year. While it's SPEC2000 benchmark setup was a bit "unusual" (a mild word for letting one CPU core use all the 512 MB L3 cache on all other CPUs in a 32-way system), the performance was nevertheless leading - and it could only go forward as using large page size and better cache usage optimisation came along.
For the next few days still, the 1.3 GHz POWER4 is the champion 64-bit CPU on both integer and FP portions of the SPEC2000 benchmark suite. Itanium2 will take away the FP crown, so IBM will need the expected 1.5 to 1.8 GHz POWER4+ speed-up later this year to reclaim full leadership. We've seen Itanium2 under HP-UX beat POWER4 under AIX on some compute-intensive benchmarks substantially, mainly because of one feature: 4 GB large pages, which allow for basically no memory translation, similar to direct translation blocks which POWER3 supposedly had, but its successor doesn't? Cutting page-miss overhead does seem to speed up some stuff tremendously.
In reality, IBM did a fantastic job building a complex, multi-level interconnect architecture that fits all 32 CPUs in 16-two way chips (4 MCMs), on one 17"x17" mainboard!!! And ultrareliable one at that, with a slew of Elisa self-healing features which you can check out on their web site. Plus, just like on good old Transputers or poor soul called Alpha EV7, every chip has its own memory and I/O bus plus interconnect to several neighbouring chips at a very high speed - in the case of POWER4, they really went overboard and enabled each chip within MCM to read the 1.5 MB L2 caches of neighbouring chips at over 40 GB/s - the L2 cache speed, in fact!
Even the memory speed, at 12.8 GB/s per dual-core chip using 512-bit wide DDR-SDRAM, is the record breaker - a 4 CPU dual-chip system like the new IBM p630 would have a 25.6 GB/s peak mem bandwidth, compared to 6.4 GB/s for quad Itanium2. The problem: too much circuitry between the CPU and the main memory, increasing the first-word latency way up - it would be interesting to compare the latency and bandwidth figures for the p630 vs HP's Itanium2 with fast, direct-DDR ZX1 chipset. The 32MB per chip L3 (integrated DRAM solution) does help the IBM flagship, but the direct on-chip controller approach of Alpha EV7 and Hammer could be better in the long run - after all, at 5,000 pins per chip, adding 8 DDR or 16 Rambus channels per chip wouldn't be that noticeable :-)
Internally, even though the instruction set structure may not be as elegant as Alpha's, POWER4 is close enough - in fact, its design philosophy bears the most resemblance to the dying performance leader. As it sheds the remaining baggage that blocks fast speed ramps and moves towards the mainstream (Apple?), this architecture could be well positioned for the final 64-bit showdown.
What does the future hold for these two? Well, Intel now has the best of its own, HP-PA and Alpha teams to try to make the best out of the Itanium. But first, the Madison/Deerfield and Montecito successors to the Itanium2 have to be done properly. Based on independent sources, I expect Madison to appear as early as Q2 2003, probably starting at 1.33 GHz and going all the way up to 1.6 GHz during the year. It should have a 533 MHz effective bus throughput, bringing its bandwidth to 8.5 GB/s. Coupled with large up to 6 MB on-chip caches, and low-latency multichannel DDR or Rambus chipsets like successors to HP zx1 or Intel E8870, this could be a very good performance workstation / small server solution.
I don't expect any major changes for the 2004 Montecito --- maybe hyperthreading if they manage to squeeze it on time -- except reaching or exceeding 2 GHz clock, even larger caches and possibly a 667 MHz effective FSB (double the Prescott / Nocona width). The "Chivano", to follow a year or two after, should be the first CPU to have a major Alpha team influence, at least in the memory interface department - those distributed interconnects with scalable memory, I/O and interprocessor bandwidth must be there....
My layman's hope would be to try something: look at the Alpha EV8 diagram - a beautiful 8-way superscalar CPU made to sustain that rate. With Intel's top semicon technology, and the ability to pack zillions of transistors with ultrawide internal buses, imagine that a 32-bit single instruction slot on the EV-8 diagram is replaced by a 128-bit, 3 instruction bundle. Why this? Well, the compilers could take care of intra-bundle instruction scheduling, as well as a portion of inter-bundle scheduling. But the final out-of-order inter-bundle scheduling would be done at runtime within a (very complex, I guess) out-of-order unit in such CPU. So, you'd be out-of-order processing 8 bundles per clock instead of 8 32-bit instructions per clockl Then, add the EV9 vector unit and its 16 PC1200 Rambus channels, and you got a winner.
On the IBM side, expect a very different approach in POWER5 for the 2004 - besides the usual clock, cache, memory, CPU execution parallelisation and interconnect improvements, there will be "Fast Path" hardwiring (on-chip hardware acceleration) of some common tasks like TCP/IP processing - later maybe even stuff like high-level database or bioinformatics routines. The 0.13 micron copper POWER5 will have 2-way simultaneous multithreading, just like Xeon, and is expected to run well above 2 GHz. It should be much cheaper, cooler and less power-guzzling than POWER4. Don't be surprised to see Apple Macs or thin blade servers based on POWER5.
Two years after, we should see POWER6, with expected 4-way multithreading, wide parallel execution and even more hardwiring capability (maybe they'll also consider direct on-chip memory controllers and fast vector FP units?).
IBM may tend to say how they're happy to hold the high-end and not directly compete against IA-64 across the board, but the reality is different - simply because Intel's 64-bit offerings will ultimately attach that high-end niche as well. So, they should ensure as wide as possible spread for the POWER4 and its successors: from workstations or even Apple systems, all the way up to very large servers and NUMA-coupled clusters using either their upcoming Federation switch or more popular stuff like Quadrics Elan4 interconnect. µ |