Distributed processing is on top. Electronic Engineering Times, Jan 24, 2000 p96
By Jenkins, Charlie
The need for flexible hardware solutions to process higher-layer applications within IP packets is spurring the use of network processors. As deployments proceed and backplanes push above the 50-Gbit/second range, designers will discover that distributed processing is the preferred method for cost-effective, highly scalable designs in an IP world.
A short history of the technology can be instructive. In the late 1980s, two-port bridges appeared as a means of connecting shared LAN subnets. The bridges used dual network interface functionality with a high-performance CPU engine performing the Layer 2 packet-processing functions. The CPU performance was adequate and the flexibility of software programming was ideal. But by the early 1990s, multiport bridges called switches were needed to connect the growing number of LAN subnets. Additionally, Ethernet became the dominant Layer 2 LAN protocol, thus solidifying the switching requirements and enabling clever designers to look to ASICs to provide more bandwidth for increasing port counts. The CPU became RISC-based and was still involved in most data path transactions, providing protocol demultiplexing, higher layer processing and supervisory and control functions.
By the mid-1990s, it was clear that IP over Ethernet would become the dominant LAN protocol. The race to design switch/routers was on. Companies began introducing IP switches using ASICs to provide pseudo Layer 3 processing for IP packets on the fast data path. Exception processing for all other packet types used a RISC engine or engines on a much slower data path. Then Internet growth exploded and Fast Ethernet LANs began to overwhelm the corporate backbone. A new Gigabit Ethernet standard was quickly deployed, pushing ATM out of the LAN space. So LAN switch companies had to redesign their L2-L3 boxes for gigabits of bandwidth while ATM switch vendors were redeploying their architectures for IP traffic in the lucrative Internet service provider (ISP) market.
As the end of the decade approached, the higher bandwidth requirements of Gigabit Ethernet in the LAN and OC3/12/48 in the WAN accelerated the move to distributed ASIC-based switching solutions. Unfortunately, these solutions must continually change to support quality of service in the converging world, but they lack the flexibility of the software solutions of the late 1980s.
Market trends
An inter-switch view of switching fundamentals highlights the market trends driving mega-switches. This is where traffic engineering (TE) comes in. TE is the mapping of traffic flows onto an existing physical topology, in this case a collection of switches/routers over a wide area. TE involves flow identification, provisioning, servicing and queue management for multiple users and application services over varying distances and with varying bandwidth requirements. It is the primary method for balancing traffic loads on various links in both LAN and WAN networks, helping to make the business case for service-based fee structures. TE provides bounded performance capabilities for deployment in multiservice Internet environments by minimizing packet transfer delays, packet loss and delay variations. It also offers enhanced reliability and control by rerouting around congested and failed link as well as maximum bandwidth utilization by provisioning flows to underutilized paths.
However, to provide these benefits, the TE function must recognize flows in real-time, based on services required to provision traffic between nodes before congestion occurs.
By the mid-1990s multihop routing was performed by metric-based control where flows traveled using least-cost (metric) paths from A to B. By changing the metrics on individual links, the traffic pattern would change. Unfortunately, as the network cloud grew ISPs became uncomfortable with the notion that fixing one congested link may cause several new problems elsewhere. Service providers required equipment that could scale to OC3/OC12 speeds while providing IP traffic flows, so they turned to IP overlay networks consisting of edge routers surrounding core ATM switches for the switch fabric infrastructure. The router tables were configured off line using permanent virtual circuits to establish point-to-point links between other edge routers. The ATM core switches were generally used as a high-speed switch fabric transport without regard to ATM class of service capabilities.
Overlay model
Before the advent of today's high-performance Internet backbone routers, the overlay model served ISPs well by providing scalable, reliable deterministic performance for their networks. The decision to use overlay networks is costly for several reasons.
The TE overhead to facilitate two different networks for IP routing and ATM switching is too large when a mega IP switch could do both. Moreover, availability and price of ATM interfaces continue to lag behind packet-over-Sonet optical interfaces. The reason: using ATM cells for transporting IP data carries a 20 percent overhead tax; also, a fully meshed ATM switch fabric with permanent virtual circuits exhibits an N2 interconnect problem.
The newest IP switches overcome these Layer 3 routing problems using emerging and proprietary versions of several routing protocols implemented in hardware and software. These include Multi Protocol Label Switching (MPLS) for forwarding; Interior Gateway Protocol (IGP) for information distribution; Open Shortest Path First (OSPF), for path selection; and Resource Reservation Protocol (RSVP), for end-to-end signaling/provisioning.
Mega-switches now must overcome the TE Layer 3 routing dilemma in an extremely scalable platform and must provide multiservice TE capabilities at all Layers. Those new application service requirements significantly increase the complexity of the TE problem and drive the need for network processing of Layers 4 through 7 in order to fit the Layer 3 TE routing model.
What functions are required in a switch to provide all-Layer processing at higher speeds? Inside the switch, bandwidth conservation or pipe management is the rule: Ideally, what goes in must come out. In the days when all packets were equal, only Layer 2 pipe management was required. But quality for application services is a Layer N pipe-management problem. Network processing reduces the variability of higher-Layer protocol processing into a deterministic Layer 3 routing problem.
The functions that make up network processing exist between the physical layer (PHY) and the switch fabric. The PHY converts electrical or photonic signals on the transmission media into bits. The switching function takes data, after network processing, from ingress (incoming) nodes and switches it, via the switch fabric, to the proper egress (output) nodes based on the processed criteria of the ingress data. The criteria for switching can be based on any layer or multilayer information (multifield), hence the term Layer N switches. Therefore, network processing generally is performed at Layers 2 through 7.
It is the responsibility of a network-processing implementation to interpret the unprocessed bit stream from the PHY, classify and modify, shaping the packet traffic based on the desired criteria and sending it to the switch fabric for switching to the egress nodes. The network-processing functions are framing, verification, classification, modification, encryption, compression and traffic shaping.
The framing function takes an arbitrary stream of bits and bytes and converts them into a logical grouping of bits commonly referred to as packets. Verification answers the question, "Is this a legitimate, uncorrupted packet?" and is easily performed in hardware.
Classification answers the question, "What is this packet?" Classification should perform at all layers, identifying protocols, priorities and data patterns within the packet. The most general case of packet classification for arbitrary bit patterns is an extremely difficult problem requiring an enormous amount of network-processing bandwidth.
Modification acts on the packet based on the classification results. Compression answers the question, "Can this data be represented in a more-compact fashion?" Encryption answers the question, "Can this data be represented in a more secure fashion?" Encryption and compression are often used together to offset the packet expansion that encryption causes.
Traffic shaping
The last function is traffic shaping, which answers the question, "How can this data be sent in a way that best utilizes the available media (whether on the egress node or into a switch fabric) resources?" Verification and classification are primarily application-independent receive functions performed at an ingress port. Modification, encryption, compression and traffic shaping are application-dependent and commonly performed at the transmit or egress nodes.
At speeds of 155 Mbits and less, most network-processing functions are integrated into a single ASIC or application-specific standard part (ASSP). Today, processing functions at higher rates of OC12 and above generally require multiple ASIC and ASSP chips. Over time these functions will integrate as Moore's Law-processing performance doubles every 18 months-catches up to the next higher data rate.
However, the Internet bandwidth is doubling every four months. Therefore, Moore's Law never catches up to Internet switching requirements at the highest data rates unless a new paradigm for network processing is found.
Since all network services require a mix of network processing functions, it is interesting to understand which functions take more of the overall network processing bandwidth.
Differing applications such as firewalls, VPNs, VoIP gateways, Layer N switches and IPsec adapter cards share one thing in common: the majority of the network processing bandwidth is associated with the classification of the data packet. Until classification is complete, the network application does not know how to treat the packet according to the required services. Therefore, whether network processing functions are done in hardware or software, the majority of the process cycles are required for classification.
Future switches will require powerful classification engine technology to efficiently solve the traffic-management problem at all layers.
Specialized memory chips called content-addressable memories can be used to perform packet classification. The memory contains a large data table that can be searched rapidly, and the contents represents bit patterns of data to be searched.
For distributed processing to succeed, an individual line card must provide enough processing bandwidth for the highest ingress data rate.
There are several methods to scale processing up to OC192 rates, including processing more cycles through pipelining, processing more cycles overall through parallelism and processing more bits per second by increasing the bit symbol size for each process cycle.
To process data at 10 Gbits/s, several functions are required. First is the ability to accept that much data into the input FIFOs. The data is accepted into the input buffers 64 bits at a time, at 166 MHz. Pipelining maximizes the forwarding memory bandwidth by using multiple processing engines to access a common memory during unused cycles.
Processing clusters
Then data is transferred from the input buffers to multiple engines in a statistical inverse multiplexed fashion. The second technique is to add more memory bandwidth by using multiple clusters of processing engines and memory. Each multiple engine/memory cluster provides a certain amount of processing bandwidth. By adding multiple clusters, more memory bandwidth is available for processing.
The third method to extend OC192 speed uses the processing gain to achieve two to 10 times bandwidth improvements. With no processing gain, 10 engines at 1 Gbit/s each are required to process the data. Processing gain reduces the number of engines and forwarding memory required to process data in real-time.
The new architecture uses classification engines as ingress processors and network processors for egress processing. A typical distributed switch architecture contains a host CPU administering system and control functions to multiple line cards interfaced to a switch fabric.
The policies are compiled in the control plane with background host processing and loaded into the forwarding tables on the line cards, during unused access periods. The line card pre- and post-processor must process the data at wire speed. This minimizes the host CPU's interaction with the data. The host CPU performs all control functions in the background. All data processing occurs in real-time on the line cards.
There are several implementation advantages of a pre-/post-processor line card architecture. For example, each card acts as a wire speed programmable network interface card that is adaptable for both short-term policy changes and longer-term emerging standards and services. Also, each card can execute independent policies and more cards can be added to attain more network throughput, distributing the processing load among all the cards. More cards means more added processing. Essentially, the distributed architecture allows for a scalable approach to switching and routing.
In taking a closer look at an individual line card, the medium is an OC192 fiber link from another OC192 node. The fiber is terminated on the line card into a fiber module that converts the pulses of light into serial electrical signals in the form of bits.
A serial/deserializer function converts the serial bit stream into a parallel bit stream. The data on the input of the serializer is at 10 Gbits/s and on the output is 64 bits at 166 MHz. For packet-over-Sonet applications, the next chip is a POS framer. This chip groups the bits into packets for pre- and post-processing. Next, a preprocessor with multiple engines talking to a dedicated SRAM memory performs classification. The combined stream from the POSPHY framing chip is divided or distributed among the two classification clusters. |