SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?


To: Noel who wrote (144588)10/2/2001 4:57:02 PM
From: pgerassi  Read Replies (1) | Respond to of 186894
 
Dear Noel:

Did you ever take a look at how they did it? It is two pipelines 180 degrees out of phase like I stated before. It is technically incorrect to call it a 4GHz pipeline as no section runs at that speed, only 2GHz, 1.33GHz or worse.

Just think of it, by your lights, a section that runs at 1000MHz and one that runs at 1000MHz 0.1 cycle out of phase could be thought as a single 10GHZ pipeline because one uops could be followed by another after only 0.1 of a base cycle. By that reasoning, Athlon has a section that runs at infinite speed since two things occur at exactly the same time.

The scheduler of the P4 proves that there are 6 pipelines in the P4. 2 of them do the first half of the cycle uops and 2 of them do the last half of the cycle uops, one does loads and one does stores making a total of six. At one point in the last half cycle pipes, the uop goes through a stage 1.5 cycles long to line it up to the middle of the first half cycle pipe. Then the uop takes 1 cycle when scheduled. Later, the last half pipe goes through a 1.5 stage section to resychronize to the first half cycle pipeline for easier retirement (if they saved the room to duplicate all of the later stages and it shows as a single stage in the regular pipeline). Notice, at no time are the stages any shorter than 1 cycle. Thus, the double speed moniker, is not correct.

That was told in one of the white papers I read on the subject. The marketing people took a statement that it looks like part that uses half cycles may look like it runs at double speed, but that does not occur.

Pete



To: Noel who wrote (144588)10/2/2001 4:57:01 PM
From: pgerassi  Read Replies (2) | Respond to of 186894
 
Dear Noel:

Did you ever take a look at how they did it? It is two pipelines 180 degrees out of phase like I stated before. It is technically incorrect to call it a 4GHz pipeline as no section runs at that speed, only 2GHz, 1.33GHz or worse.

Just think of it, by your lights, a section that runs at 1000MHz and one that runs at 1000MHz 0.1 cycle out of phase could be thought as a single 10GHZ pipeline because one uops could be followed by another after only 0.1 of a base cycle. By that reasoning, Athlon has a section that runs at infinite speed since two things occur at exactly the same time.

The scheduler of the P4 proves that there are 6 pipelines in the P4. 2 of them do the first half of the cycle uops and 2 of them do the last half of the cycle uops, one does loads and one does stores making a total of six. At one point in the last half cycle pipes, the uop goes through a stage 1.5 cycles long to line it up to the middle of the first half cycle pipe. Then the uop takes 1 cycle when scheduled. Later, the last half pipe goes through a 1.5 stage section to resychronize to the first half cycle pipeline for easier retirement (if they saved the room to duplicate all of the later stages and it shows as a single stage in the regular pipeline). Notice, at no time are the stages any shorter than 1 cycle. Thus, the double speed moniker, is not correct.

That was told in one of the white papers I read on the subject. The marketing people took a statement that it looks like part that uses half cycles may look like it runs at double speed, but that does not occur.

Pete



To: Noel who wrote (144588)10/2/2001 5:47:23 PM
From: pgerassi  Respond to of 186894
 
Dear Noel:

According to the white paper I read, what is occuring is exactly what I said. There does not have to be two completely different pipelines just a small section where one stage splits into two. The main pipeline has a stage that lasts exactly 1 cycle and the aux pipeline starts with a stage 1.5 cycles long (triggers on the same edge of the inverted clock is one method of doing this). Now the data is ready on the execution stage of the aux pipeline like any normal execute stage. Now the results are delayed by a 1.5 cycle stage where the original pipeline has two stages or the aux pipeline continues to the retirement stage always a half cycle off. Now this may be one of the reasons why the P4 is much larger than the equivalent pipeline in the P3.

Another good reason for there being two partially separate pipelines is the way in which the last half cycle uops are scheduled. The aux pipeline cannot have a uop scheduled, if the first uop would take nearer to a full cycle. Also there can be no branches between half cycle uops as retirement becomes a problem. Look at what the boundary conditions for two "half cycle" uops to be scheduled together.

All of this points to a second aux branch of the pipeline not a section that runs at double speed.

Pete



To: Noel who wrote (144588)10/5/2001 12:39:15 AM
From: Dan3  Read Replies (2) | Respond to of 186894
 
Re: What's really mindboggling is the frequency at which this ALU will operate

And how little actual work it will do while churning that clock and sucking so many watts. One of 5 instructions is a branch, and going to main memory after a cache miss will cost several hundred CPU cycles! The data buss runs at 100MHZ (though it transfers at 4 times that, once the read instruction is acted on). The fastest RDRAM has a component latency of 40ns, in which time P4's "double pumped" unit doesn't do anything for 160 clocks. There are additional latencies from the bus handshaking, then it takes P4 28 clocks to decode and execute an instruction. A 3GHZ P4's "double pumped" unit does nothing but burn power for more than 240 clocks, plus bus latency plus decode and execute then finally it can output something, whenever a cache miss takes place.

Not much of a CPU, maybe, but a really great space heater!