SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: eracer who wrote (255269)8/5/2008 11:21:27 PM
From: pgerassiRead Replies (1) | Respond to of 275872
 
Eracer:

Did you notice that the 1600x1200 is not using AA but just 4 samples (4xAF)? Comparing it to a test where 1600x1200 using 16AF and 4xAA is like turning down the settings to bare minimum. Thus its like turning to low quality 60FPs 1600x1200 in a game like FEAR where most modern entry level GPU cards are CPU limited. Besides the current top IGP is the 790GX which runs like a Radeon 3470 with the 128MB sideport memory and dual channel PC2-8500. The 40:4:4 790GX at 500MHz/533MHz SP enabled with Phenom 9550 DDR2-1066MHz unganged gets 30FPS at 1600x1200 in FEAR high quality. Pushing it to 1000MHz/533MHz SP enabled boosts that by 10% to 33FPS. Disabling the SP at 500/533MHz reduces the FPS to 26. Thus a 20% memory BW reduction yields a 13.3% FPS reduction. That shows it to be mostly memory BW limited.

Imagine a 2008 IGP getting a real 50% of a 13 core Larrabee estimate for 2010. Given the memory BW considerations, look for a 2010 8 core Larrabee to be the equal of a 2008 790GX IGP. Of course given that IGPs have doubled or tripled in performance every year, the 2010 40nm SOI IGP (R870 based) would be about 4 to 9 times the performance of the current 790GX or between 26 and 60 GHz Larrabee cores.

And that assumes the 80:4:4 Stream:TU:ROP RV710 configuration holds for the RV810. Rumor has the TSMC 40nm SOI R870 being 12 cores (16 SP subcores plus 4 TUs per core) plus 24 ROPs at 1GHz. That yields a configuration of 960:48:24 for the 58xx Radeons. That is about 160 GHz Larrabee cores. I think it will be more like 16 cores or 1280:64:32 given that perfect scaling is 1.89x from 55nm to 40nm while I figure only about 1.6x real scaling. I also think that a 33% rise to 1GHz is conservative given the frequency scaling from a 55nm bulk process to 40nm dual strain SOI process should be higher at the same power.

And Larrabee supporters seem to forget the 8 SP MAC wide vector unit in each will vastly increase power consumption compared to the ALU, DP FPADD and DP FPMUL units in the Pentium core. As will some of the other features boost that further like the 512 bit bidirectional ring bus and the 512 bit DDR3 memory interface. So look for Larrabee clocks to be quite a bit slower than some expect.

As for cross fire scaling, do recall that unlike Larrabee, the ODMC is duplicated as well giving the CF 48xx twice the BW than a single 48xx would. A 64 core Larrabee would have the same bandwidth of a 32 core one. Given that a 1GHz 32 core Larrabee gets 1.0 TFlops of GPGPU performance which is the same as a 625MHz R770 (4850). What isn't compared is the ROPs and scheduling units that the R770 has that isn't in Larrabee. The latter had to use the less efficient scalar CPU cores for that. Also the 4850 gets 200 GFlops of DP power compared to 64 GFlops for 32 core GHz Larrabee. The 4870 gets 1.2 TFlops / 240 GFlops respectively.

The stuff about the R600 is the same. If Larrabee has so little memory usage, then the infrastructure of a 32 core Larrabee is at least as overbuilt as the R600. And the performance was bad and so will Larrabee given that scaling chart. Thus since the R600 was hot, slow and failed in comparison to its competition, so Larrabee will be against its competition especially that it tries to do more in software. Another telling example of this do all in software performance hit is the original Macintosh Lisa. It did everything in software including reading/writing to the floppy and communicating to serial devices. It took a reasonably fast 8MHz 68K and made it run far slower than a lowly 4.77MHz 8088 in the IBM PC. All to save about $10 to $20 in parts.

Pete



To: eracer who wrote (255269)8/6/2008 7:21:20 AM
From: mas_Respond to of 275872
 
We get an idea of the power consumption of Larrabee in that Table 1. 10 cores of a Larrabee similar design are roughly the same power and area as a Core 2 Duo at the same clock and process. So 30 Larrabee (10 * 3 GHz) units are about 65W according to that simulation. Obviously there's Memory/PCIe consumption to think of as well but still looks competitive.

Unlike some other tile-based rendering methods, there is no attempt at perfect occlusion culling before shading, reordering of shading, or any other non-standard rendering methods. When taking commands from a DirectX or OpenGL command stream, rendering for a single tile is performed in the order in which the commands are submitted. Using a conventional rendering pipeline within each tile avoids surprises in either functionality or performance and works consistently well across a broad spectrum of existing applications.


That's just an excuse for not implementing a very powerful feature of Tile based renderers which PowerVR aka Kyro showed to be very memory bandwidth effective. Maybe Revision 2 ;-)



To: eracer who wrote (255269)8/6/2008 4:11:56 PM
From: pgerassiRead Replies (2) | Respond to of 275872
 
Eracer:

Another look at your post also shows that you ascribe to me things that I did not say. The thing was that a 64 bit memory controller, 64 bit ring bus, setup engine, 4 ROPs, UVD, 16x PCIe interface and Tesselator uses 40mm2 of the 55nm RV610 die. Then some intellabee took that to mean just for the memory controller and forgot the rest and assumed that there were 4 equal copies of it in the R670. Well given the above there would be 256 bit memory and ring bus, 4x the setup engine and 16 ROPs, but there isn't 4 Tesselators, 4 UVDs, 64x PCIe so that number is high.

Although given that a R770 is 260mm2, some 68mm2 bigger than the R670's 192mm2 on the exact same process where there is 2.5x the stream processors and much else is roughly the same, that the 68mm2 is equal to 1.5 times the original 320 stream processors and 16 TUs. Thus one set of them is 46mm2, making all of the rest (non stream processors or TUs) use the other 146mm2. My original back of the napkin estimate isn't far wrong. The big thing that would have been missed is the use of a crossbar switch and hub instead of the ring bus probably saved more die area some of which was used to beef up the ROPs, make the PCIe version 2.0, revise the UVD to UVD2, allow for remote memory accesses through crossfire or PCIe interface and upgrade the memory controllers to handle GDDR5.

The same is true of the Atom. Although its stated all over that it uses 2.5W, that does not include the NB or memory controllers which add another 5.5W for a total of 8W. There are 1GHz 90nm Semprons that use 7.7W including a dual channel DDR2 memory controller and they being fully three issue out of order processors, have far more IPC than Atom does. They likely run rings around the Atom. There is even a 6W Geode NX1500 which is really a 1GHz 130nm bulk Athlon K7 which likely still runs rings around a 1.8GHz Atom also because its three issue OOO processor. You have to add in the NB, but given that it uses very cheap 130nm bulk likely sells for less than Atom.

Besides without adding in the other power users in a netbook, comparing processors performance per watt leads to garbage results as users look at the netbook as a whole, not at just the CPU core. A 2.5W Atom that gets 1x in performance might look good on paper versus a 7.7W Sempron that gets 2x. But in a netbook where the screen uses 10W and the DDR2 memory uses another 5W, the Atom powered netbook uses 23W (you have to add in the chipset) and the Sempron netbook uses 26W (the chipset doesn't have the memory controller just a small IGP like a M740G/M700SB which still would vastly outperform the IGP in the Atom chipset). Then the Sempron has a 13W per unit performance while the Atom has 23W per unit performance. And if the Sempron was upclocked to 2GHz and uses 25W, the netbook goes to 51W but has 3.5 times the performance for an increase to 14.5W per unit performance. Still higher perf/W than Atom's when looking at the netbook as a whole.

Pete