SPADE-2: A Scalable Distributed Shared Memory Parallel Architecture

Sergio Takeo Kofuji

University of São Paulo, Brazil

kofuji@lsi.usp.br, http://www.lsi.usp.br/~dsd/hpcac/spade

Sept 1, 1996

extended abstract

Emergent high speed network technologies, such as ATM, SCI and Fiber Channel, and high performance workstations are making it feasible to build high performance computing systems with performance equivalent to commercial MPP's.

However, in order to exploit the full potentiality of the system, issues such as fast message passing protocols and shared memory support must be investigated in more depth. This work presents a new parallel system based on stand-alone workstations and high speed interconnection networks with support for efficient message passing and shared memory.

Several prototypes of this architecture are being and will be implemented, based on interconnection systems composed by network adaptors and switches based on ATM and SCI standards. Prototypes with commercial fast Ethernet, ATM and Myrinet are also being implemented.

1 - Introduction

Emergent technologies for the interconnection of computers through high speed networks, with low latency and high bandwidth characteristics, are allowing the development of high performance processing systems based on high performance workstations LAN's. These systems, called NOW or COW, are currently a hot research topic, both in academia and in industry.

The bandwidth and latency of their interconnection systems are comparable to those exhibited by several MPPs, mainly if we take into account the software overheads associated to the network protocol stacks

of most MPP operating systems.

In order to make such systems more usable, specially when we consider non scientific programming, it is necessary to provide a more adequate programming environment. Several researchs have shown that the shared variable model is more comfortable for both the user and the construction of development tools (compilers).

This paper describes SPADE-2, a proposal for an architecture based on standard workstations and communication protocols, with support for efficient message passing and shared memory programming.

Following similar research lines, this work is focused on the user level network communication protocols and remote memory access. On the other hand, this work is also oriented to the implementation of CC-NUMA and COMA models.

First, we give a general description of the architecture, then we discuss issues such as fast barrier synchronization and software distributed shared memory implementation. After that, we discuss implementation issues, fast interconnection network implementations, intended applications and the current status. Finally we give a brief summary and some intended future work.

2 - General Description

The SPADE-2 is built from conventional (single processor or SMP) PCI bus workstations, in which a PCI network adaptor is attached and standard operating systems (FreeBSD, Linux, Solaris, Windows NT) are used. These nodes are interconnected in a generic topology by means of switches.

Using only standard commercial systems, the idea is to provide only the NORMA model, with user level message communication protocols, and shared memory implemented in software by means of a DSM employing relaxed memory consistency model.

Using custom boards, we will support the other models NUMA, CC-NUMA, COMA. Even so, we intend to emulate these shared memory models through Myrinet boards.

In all models, but specially in the NUMA and COMA models, it is desirable that the network and its protocol layers provide reliable communication between nodes, through error correction and automatic retransmission of packets. These features are not essential to correctly implement these memory models, but they significantly simplify and can potentially provide efficient implementation. In this context, it is alse very important that the network provides no packet loss flow control.

NORMA model

The emphasis is to implement efficient message passing through user level communication protocols, such as U-NET, active messages, and others. Above this, we intend to implement legacy protocols such as TCP-IP, message passing packages such as PVM and MP, and software DSM packages such as TreadMarks, Quarks and Pulsar.

Within NORMA, the main goal is to achieve the best communication performance (latency and throughput) for small messages.

NUMA model

Like MPP's, such as Cray MP-3E, interconnection systems, such as Dolphin SCI and VMIC Reflective Memory Networks, and academic systems, such as AXON, we intend to implement remote memory access.

Fixed length messages, composed of address, commands and data, are exchanged by the nodes for data sharing. Examples of commands are:

These transactions are implemented in hardware, without the intervention of any of the targeted node processors. As soon as the target NIC receives the message, it tries to access the bus in order to get the data.

CC-NUMA model

This model is an extension of the previous one, but keeps in the cache the most recently accessed data. The main problem is the monitoring of the processor memory accesses, given that the interconnection board is not on the memory bus, but on the IO bus (PCI). To avoid the redesigning of the PCI bridge, we have, among others, the following alternatives:

Both solutions are being evaluated by means of program driven simulations to have cost/performance figures for several sharing patterns and system parameters (hardware and software). In order to allow the interconnection with future SCI systems by means of bridges, we are using SCI cache coherence protocols (typical set).

In order to exploit the multicasting facilities of ATM switches, in the first version we adopted a directory scheme (one bit for each node) similar to the one employed by Stanford DASH, instead of the standard SCI doubled linked list scheme.

COMA model

In this last model, the memory in the interconnection card corresponds to a tertiary cache. Following standard naming, this memory corresponds to an ``{\em Attraction Memory}". This model is similar to the sequential consistency implemented in operating systems with replication/migration of pages keeping sequential consistency. The difference is basically the size of the data block (cache block instead of page block). The block fetch and consistency maintenance protocols are based on the Stanford COMA-F proposal; this approach was chosen because it doesn't rely on a specific interconnection system.

3 - Efficient Barrier Synchronization

To implement fast barrier synchronization on networks of workstations it is necessary to :

In case of parallel machines based on clusters connected by ATM switches

4 - Software Distributed Shared Memory

Software DSM systems constitute a hot topic of research, providing a shared memory abstraction over network of workstations (estranho). We developed a sofware DSM system, based on relaxed memory consistency model, to provide shared memory over conventional NOWs or the SPADE-2 working on NORMA and NUMA models.

The performance of a DSM (Distributed Shared Memory) system is determined by the traffic in the interconnection system. The several types of messages generated in a DSM system moving through a network, are basically: messages containing pages, synchronization messages and consistency messages. The size of these messages is an important factor that determines the performance of the system. Consistency messages prevail in a network running a DSM system.

Messages containing pages have the typical size of 4096 bytes in most of the systems. It is necessary to determine the typical size of synchronization and consistency messages. The PULSAR DSM system can use

updates to maintain consistency. In this case, it uses some techniques developed by Munin, like multiple writer protocols, to eliminate the false sharing effect and to increase the parallelism. This technique implies the creation and transmission of diffs( a codification of changes in modified pages of a critical section). These ''diffs" are sent through the network to maintain the consistency, and are the bulk of the consistency messages. Our work with Pulsar running some programs of SPLASH suite has shown that the average size of consistency messages is smaller than 160 bytes.

Several works have shown that, for small messages, the bandwidth of the interconnection system is not effectively used, mainly due to software overheads, which makes it very important to optimize the communication (bandwidth and latency) for small messages. One of the topics to be investigated is the efficacy of using compression algorithms (for instance, Huffman coding) on the messages.

5 - Implementation

We intend to implement several prototypes, some of them based solely in commercial subsystems (computers, communication boards, switches, etc) and others based on home designed hardware.

What follows is a list of the prototypes under development.

SPADE-2/FastEthernet and SPADE-2/ATM

These prototypes are based in the Intel Pentium PCI boards, off-the-shelf network adaptors and switches, and FreeBSD/Linux operating systems.

SPADE-2/Myrinet

This prototype uses Myrinet interconnection system, composed by PCI adapters, 8-port switches and copper 1.2+1.2 Gbit/s links. The firmware has been modified in order to support efficient message passing and shared memory.

SPADE-2/IN1

This prototype consists of PCI boards with the interconnection network IN1. This system supports the NUMA model at the hardware level, as well as the NORMA model. The inboard memory present on the network interface card can be used as a local portion of the global shared memory as well as the buffer memory for message passing.

SPADE-2/IN2

The previous system is based in small size switches (small number of ports) in order to simplify the design and minimize costs. Other advantages are the natural support for multicasting provided by the bus implementation. To build larger systems, these switches must be interconnected in some way as, for instance, a fat-tree such as the one employed in the Thinking Machines CM-5. The SPADE-2/IN2 is based in a larger switch architecture and the implementation lies closer to the ATM standard. Besides that, this system supports, along with the other models, the CC-NUMA and COMA.

6 - Interconnection Network IN1

Network Adaptor:

Switch:

Packets:

7 - Interconnection Network IN2

The IN2 is an evolution of the IN1, presenting the following characteristics:

8 - Applications

Besides the scientific and engineering applications, we intend to investigate the use of the described architectures in other fields such as distributed data bases, Web servers, video on demand servers, image rendering engine, etc, mainly exploiting the shared memory model implemented by SPADE-2.

9 - Related Works

10 - Current Status

We are porting operating systems, communication packages and applications to a cluster of Intel Pentium boards, interconnected by fast Ethernet and ATM switches.

The PULSAR system for FreeBSD and SunOS (Solaris 1.x) is completed (in its alpha version).

Currently, we are purchasing a Myrinet system and we are modifying the local firmware in order to support more efficient message passing, remote memory access and support PULSAR.

The IN1 system is completely defined and a version based on ALTERA FPGA is being developed.

The IN2 system is being defined through simulation studies.

11 - Summary

In this work we presented a proposal for high performance systems architectures using, as far as possible, standard hardware and software, while also offering the possibility of the effective implementation of shared memory as well as message passing models. Differently from other proposals, this one offers innovative solutions that don't rely on exotic, not open technologies. Most of the features discussed here about the ATM switch system are provided by the Washington University in Saint Louis's Gigaswitch, namely, credit-flow control, reliable multicasting, etc. Other related project is FORTH's Telegraphos which is also developing credit flow control based ATM switches. However, in the future we intend to work also with full IEEE-ANSI SCI standard compliant circuits and protocols, mainly because of its 1Gbytes/s per link bandwidth.

Acknowledgments

This work was supported by FINEP, through grant n. 56.94.0260.00. We would like to thank Marcio Lobo Netto, Casimiro A. Barreto, Jecel M. Assumpção Jr., Martha X. T. Delgado, Mario D. Marino, Marcelo Cintra, Li Kuan, and Carlos E. Valle, of LSI-EPUSP, for their valuable suggestions and text revision work. I would like to thank also Prof. João A. Zuffo, head of LSI-EPUSP for his support for this project.