J Fieb, Here is the article:
November 24, 1997, Issue: 982 Section: Technology
Distributed computing highlights Supercomputing 97 -- Networks harness supercomputers
By Chappell Brown
San Jose, Calif. - The quest to build the ultimate supercomputer is showing signs of becoming enmeshed in the networking revolution. As the technical program at last week's Supercomputing 97 (SC97) show revealed, clusters of high-performance workstations linked to central supercomputing resources may prove to be the ultimate problem-solving tool.
New approaches based on high-performance networking, or simply on the global reach of the Internet, reveal some of the potential that widely distributed computing systems offer.
Networking already addresses the need for scientists and engineers to collaborate on projects. Whether it is a group of electrical engineers within a company designing a system or a group of physicists dispersed over geographically remote universities, sharing data and algorithms is critical to large-scale efforts. Networks allow such groups to tap the latest supercomputers when needed without having to make a major investment in hardware.
"I think a lot of people are realizing that they can't afford the latest supercomputer and a more-economical route is to link smaller resources over networks," said Andrew Chien, a computer scientist at the University of Illinois at Urbana-Champaign, who presented his work on networks for high-performance clusters at the conference. "Networking has really taken over in the supercomputing world and you can see the extent of it at a conference like this," he added, pointing out that commercial vendors such as Hewlett-Packard Co. and Sun Microsystems Inc. are introducing clustered systems and high-speed interconnect to reach supercomputer levels of performance.
Chien's concept, evolved with colleague Kay Connelly at the University of Illinois, is to simplify the network for high performance and ease of use. "We would like to be able to aggregate resources easily while maintaining network quality." The design involves a simplified switch for routing data along with a higher-level network-management system that ensures the timely delivery of data between nodes.
Called FM-QoS the network has a specially designed traffic-management system that can predetermine the allowable time delays between two nodes. The approach uses feedback from the destination node to the node of origin to maintain a given level of performance. Quality of Service (QoS) is the term that is generally used to express the need for a specified level of data throughput for a network. The QoS issue arises because different computational tasks have different requirements for the delivery of data. A high-speed numerical simulation might be able to tolerate only millisecond delays in data delivery while a group of users interacting online could absorb much longer delays. The FM-QoS system insures that each process can operate at its unique speed requirements without having network traffic interfere and block program execution.
By using network feedback with self-synchronizing communication schedules, the network-management system achieves synchronous I/O at all network interfaces. The network can therefore be scheduled for predictable performance without needing any specially designed QoS hardware. A prototype system has been demonstrated over Myricom's Myrinet network that showed synchronization overheads of less than 1 percent of the total communication traffic. The demonstration network achieved predictable latencies of 23 ms for an eight-node network, which Chien said was more than four times faster than conventional best-effort schemes that could achieve a predictable delay of only 104 ms.
The current design is optimized for clusters of workstations and supercomputers in a building and there is flexibility in the architecture to scale up to large systems such as a campuswide computer cluster. "It resembles the structure of 'clock domains' on VLSI chips where subcircuits are carefully timed and then linked by a clock that can have more skew," Chien said.
At the other extreme is the Internet, which is not under local control, but nevertheless offers opportunities for solving supercomputer-level problems. Some breakthroughs in solving intractable problems have resulted from a group of researchers breaking the problem down into smaller segments, doing computer runs locally and then combining the results. Systems such as the Parallel Virtual Machine, a network operating system that emulates a supercomputer in terms of a set of widely distributed computing resources, are being evolved by engineers at supercomputing centers.
That technique may get a boost from an Internet-based operating system called the Scalable Networked Information Processing Environment (Snipe) being developed at the University of Tennessee. Collaborations over the Web involve a lot of "hand coding" in the sense that everyone involved needs to take their segments of the problem and code them locally on whatever computing resources they have. The system would automate that process by identifying and tracking algorithm segments and data that are distributed over the Web. "We started with a Web model in which all the segments of a process are a resource tagged with a URL [uniform resource locator]," said Keith Moore, who presented the latest results from the project at the conference. Not only is the Internet unique in the scope of its interconnectivity but it can potentially involve millions of users. The Snipe system therefore duplicates data and algorithms and dynamically allocates them to computing resources on the Web.
A local cluster of computing resources-workstations, supercomputers and high-performance database servers-would be assigned a URL. The Snipe system then sets up virtual interconnects between these clusters and implements them over the Web, managing data transfers and resource duplication to emulate the virtual network of links.
Evaluating traffic
A variety of other network design and evaluation tools surfaced at the conference. Systems for evaluating traffic and performance at the network, I/O and inside a given computer system were described in several technical sessions. Clusters of symmetric multiprocessors and virtual memory-management systems over networks seem to be emerging as a force in high-end computing, answering the need for tightly managed local computing resources that support work groups.
One system that may help diverse technical computer users make sense of their execution environment is a performance evaluation tool from the University of Wisconsin-Madison, presented by Karen Karavanic, one of the researchers developing it. Called the Experiment Management System, the program is organized around the metaphor of a lab notebook. The system records results from a set of program executions collected over the life of an application. The data is stored as "program events," which include the components of the code executed, the execution environment and the performance data collected.
The combinations of code and execution environment are organized as a multidimensional program space. Visualization methods allow the user to view the collected data using different criteria. A graphical representation of the program space is the basis for the interface to the Experiment Management System.
"Essentially you want to answer the question 'how and how much did the performance change?' " Karavanic said. Currently, complex and detailed benchmarking procedures are used to answer that question, but those systems are relatively fixed. The lab-notebook system is designed to make the whole evaluation process more flexible by including it as part of an ongoing computational project.
"For example, we tried the system out with a group of chemists that were doing computer-based chemical simulations. The evaluation process was compressed from about a year down to a few weeks," Karavanic said. The lab-notebook metaphor helps scientists, who may not be computer scientists, to perform effective performance evaluations as part of their scientific experiments. "This system is not specifically tied to computer-system evaluation," said Barton Miller, another researcher on the project who evolved the basic concept. "We intend to develop it as a general lab-notebook system that could be used in any lab."
The computerized notebook includes the chronological page-by-page structure required by standard lab practice. But the same information, once recorded, can be reorganized by the system to display the results in a variety of modes to help high-performance-computer users to grasp the effectiveness of their particular mix of applications and systems. |