Now I may have got the wrong idea on this one, but I think that Intel's annointed successor to PCI bus, 3GIO, is some super-fast serial bus. That would work nicely on a 1024 CPU core, 32x32 planar arrangement each CPU would need 1023 serial lines to communicate directly with all the others.
Each CPU would have its own gig of memory locally, accessed throuh the high-speed serial bus, and any of the others could access it almost as fast, NUMA but not very N. Remember I'm talking 2010, a gig in the CPU should be reachable by then.
The main thing about thread parallel execution is that local memory is more important than global memory, since threads tend to work on well-focused tasks that are somewhat independent of the data in other threads, and only need to communicate with other threads on a sporadic basis.
But I'm off into speculation-land here, I don't know if the high-speed serial bus will materialize, and be able to run at 10GHz, but Intel are betting on it.
P. |