High Performance Communication

We are working to improve communication performance in both massively-parallel machines and workstation clusters, reducing overhead and latency while increasing the bandwidth delivered to the application programs. The efforts are most naturally divided across software messaging and hardware messaging approaches, though in many cases the functionality can be included in either.

Our work on hardware routers has developed a number of efficient routing algorithms for tightly-coupled multiprocessors. This work covers a range of hardware implementation, routing, and end-to-end architectural issues.

Our software messaging research has produced several high performance messaging layers for both clusters of workstations and massively-parallel processors, as well as characterized the the overheads in existing messaging layers. Close examination of these problems reveals that ties between hardware and software issues are extremely close with small design decisions in the former often having a dramatic impact on the latter.

Background

To increase the utility of parallel computing we must increase its applicability beyond dense, regular numeric problems. Improvements in communication performance can dramatically increase the applicability of multicomputers to irregular numeric and symbolic applications. For such applications, it is much harder to mask communication latency with large amounts of local computation, making low-latency access to remote data essential. For example, sparse matrix and other irregular computational structures (which are of increasing interest because of their computational efficiency) are often extremely sensitive to network latency. Such applications will benefit significantly from further advances in routing network technology.

Our goal is to provide cheap, efficient communication in multiprocessor systems. Towards this goal, our work has focused in two areas: developing efficient software messaging layers and network interface hardware, developing and evaluating new adaptive routing algorithms. In some cases, these efforts come together, in particular the development of routing algorithms which reduce the software (and thereby system) cost of features such as fault tolerance.

Messaging software

Despite significant improvements in network interfaces, and even software messaging layers, in most systems the software overhead of communication still dominates the hardware routing cost. Our software messaging research attempts to understand the reasons for these overheads and develop new techniques for minimizing them.

Illinois Fast Messages (FM)

To further investigate the software overheads, we are currently involved in the implementation of high-performance messaging layers for both network of workstations (NOWs) and massively-parallel processors (MPPs). We have designed a lean messaging interface - Illinois FM (Fast Messages) - which supports low-latency communication by providing an active-messages like interface. The FM interface consists of simple send and extract primitives transmitting messages and processing pending messages.

Our FM implementation on the Cray T3D makes use of the fetch-and-increment and atomic swap hardware to provide very low latency messaging primitives. The FM primitives provide an order of magnitude lower latency than Cray's PVM and achieve performance comparable to SHMEM get while providing a message-passing interface.

The FM implementation on workstation cluster is built on top of Myricom's fast and relatively inexpensive high speed Myrinet network. This implementation provides a latency of less than 30 microseconds with a sustained bandwidth of 14 MB/s. A future goal is to support the Concert parallel programming system as well as other more coarse-grained programming models on top of this FM implementation.

Analysis of software messaging overhead

To identify the various causes of software overhead, both in terms of the communication functionality required by applications, and the software cost that network hardware features incur, we have conducted a detailed analysis of the overheads in the CM-5 Active Messages layer. The results show that 50-70% of the software cost of messaging can be attributed to providing end-to-end flow control, in-order delivery, and reliable transmission services. These overheads are a direct result of specific network features - arbitrary delivery order, finite buffering, and limited fault handling - and unlikely to be eliminated through improved software implementations.

Network interfaces and routing

In highly parallel machines, a collection of computing nodes works in concert to solve large application problems. The nodes communicate data and coordinate their efforts by sending and receiving messages through a routing network. Consequently, the achieved performance of such machines depends critically on that of their routing networks.

Most existing multicomputers use deterministic wormhole routing due to its simplicity. Such deterministic routing algorithms do not make effective use of the network's physical channels because they assign only a single path to each source and destination. If those channels are congested, the traffic between that source and destination is delayed, despite the presence of uncongested alternative paths.

Adaptive routing allows paths to be chosen dynamically, using network status information. Thus, it offers the potential for making better use of network resources. However, though adaptive routing increases routing freedom, potentially improving performance, it also increases the cost of preventing deadlock. This cost can reduce network clock speed, overwhelming the benefits of adaptive routing. The CSAG group has explored a variety of hardware routing schemes, and novel parallel architectures including Compressionless Routing, Planar-Adaptive Routing, and Dynamic Interconnection.

Compressionless Routing (CR)

Compressionless Routing (CR) is a new adaptive routing framework which supports both adaptive and fault-tolerant routing while eliminating much of the software overhead for buffer management and retransmission. The basic idea is to use the fine-grained flow control and backpressure of wormhole routing to communicate routing status and error conditions to network interfaces. The network interface uses the information to detect possible deadlock situations and network faults and recover from them, eliminating the need for costly network protocol layers.

Advantages of Compressionless Routing include, deadlock-free adaptive routing with no virtual channels (any topology), simple router implementations, end-to-end flow control in hardware, and order-preserving message transmission. Fault tolerant Compressionless Routing (FCR) extends CR, providing end-to-end fault-tolerant delivery with the following advantages: tolerance of transient faults while maintaining data integrity (nonstop fault-tolerance), tolerance of permanent faults, applicability to a wide variety of network topologies, and eliminates software buffering for reliability.

Planar-adaptive Routing (PAR)

Planar-adaptive routing (PAR) combines elements of dimension-order and adaptive routing to produce a network with limited adaptivity. By carefully structuring routing freedom, it is possible to dramatically reduce the requirements for deadlock-prevention. Planar-adaptive networks are provably deadlock-free for networks of arbitrary dimension with only two virtual channels for each physical channel. In addition, planar-adaptive routing can be extended to support in-order packet delivery and adaptive, unordered packet delivery simultaneously. The designated in-order traffic is delivered in sequence, while the other packets arrive in unspecified order.

Dynamic Interconnection (DI) Multicomputer

Multicomputers use a parallel interconnect to link a number of uniprocessor systems, forming a parallel machine. However, attaching that network to the tightly coupled and highly-tuned local memory hierarchy can have a major impact on performance. We have designed and are evaluating several different alternatives for attaching the network: to the processor, between the cache and memory, and to both via a novel scheme called Dynamic Interconnection. The basic idea is to explore the integration and unification of the parallel interconnect and the local memory hierarchy. Dynamic Interconnection may not only lead to more flexible parallel machines, it may also lead to a new style of modular hardware design, based on packet interfaces.


Back to CSAG home page

Last updated 3 November 1995

webmaster