SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Silicon Graphics, Inc. (SGI) -- Ignore unavailable to you. Want to Upgrade?


To: Jerry Whlan who wrote (3645)11/23/1997 6:40:00 PM
From: John M. Zulauf  Read Replies (1) | Respond to of 14451
 
From a buddy of mine who just got his PhD on the subject of high degrees of parallelization in web serving. He related to me that MPI programming is a big pain in the neck, and not at all as automated as one could want. His experience was that "SCI" is incredibly bus limited, and data locality is extremely important (and not something that can necessarily be handed incredibly well automatically). Certainly, data migration over the SCI was not something you wanted to see much of in an SCI environment. He held HP/Convex big clusters and O/S MPI implementation in great distain, said that he spent so much time working around the bugs, it left little time for real investigation of the problem.

For him the idea of a big flat single image with crosssection bandwidth growing with the number of processors seemed the holy grail (and is the design of the O2000) for the back-end of the server. To be fair, he never had the opportunity to work on an O2000. No way to know if the reality would have been as good for him as the expectation. Sadly he was lost to academe and graduated before his school put in it's 32proc O2K.

Remember, software ALWAYS lags hardware. Anything that make the software problem harder, eventually results in lower performance, features, or stability (pick two).

Having said that, MPI schemes certainly **are** useful for non-flat archictectures. Also there are interesting classes of problems that can be attacked with almost no data passing at all. (** see postscript) Also, SGI does have a variety of clustering and failover product for their serverlines, and (as you noted) has MPI support as well. For problems tractable by these schemes, this is fine.

However performance (typically) degrades (for general programming tasks) with the number of nodes in a cluster faster than it does versus the number of processors in a given system (or single image). Thus for a given number of processors, the more processors per system the better.

Didn't know the HP has a 256 proc system operational. Is that product, tech demo, or prototype?

Best unofficial regards,

john

** Postscript **
Turner Whithed (of NC Chapel Hill) suggested many years ago that a real-time ray tracing rendering system could be built out of an array of Cray supercomputers, each solving for a single pixel. Apparently at that time, a Cray (model #?) could solve for about 30 pixel per second. If each Cray were to have three (R,G,B) spotlights mounted on the roof with the shining the current pixel value, and only solved for one pixel per frame, a camera mounted on a hot air balloon could film the array from above, capturing a real-time raytraced scene. Since ray tracing is fully parallizable, the only data to sent per frame would be the frame number to solve, or simply a syncronization signal, but then again, that is not a very generally useful solution, is it? ;-)



To: Jerry Whlan who wrote (3645)11/24/1997 6:08:00 AM
From: Alexis Cousein  Respond to of 14451
 
>The bonus of doing it for a system like IBM's
>was that using MPI (a message passing interface for distributed
>memory systems) was extremely portable, all high-performance vendors >support MPI so it is not a big deal to move your code from that Cray >to the IBM and a lot of the tuning you did for MPI on the
>Cray is just as useful on the IBM or the Origin.

1)
You'll see there is a very efficient MPI library on the Origin as well, *and* it runs better than on an SP2 on a very significant class of problems (not *all* problems can be domain-decomposed to make the inter-node communication overhead negligible) because of the low-latency, high-bandwidth interconnect.

2)
You don't have to use MPI if you don't want to. Compare the number of commercial packages for finite elements, chemistry etc. for SMP and for MPI-type programming, and you'll see.

And don't try to put all the cc-NUMA machines into one bucket. The effort some SGI benchmarkers I know had into 'porting' an SMP application and 'tune it' to a 64P machine was putting

setenv _DSM_ROUND_ROBIN

to distribute the memory evenly over the nodes, that's all. The latency of less than 1 microsecond even to the *other* side of the machine was negligible in this case, and bandwidth all that counts. Not so if you incur a large inter-node latency, as on other cc-NUMA architectures.