Emergent high speed network technologies, such as
ATM, SCI and Fiber Channel, and high performance workstations
are making it feasible to build high performance computing systems
with performance equivalent to commercial MPP's.
However, in order to exploit the full potentiality
of the system, issues such as fast message passing protocols and
shared memory support must be investigated in more depth. This
work presents a new parallel system based on stand-alone workstations
and high speed interconnection networks with support for efficient
message passing and shared memory.
Several prototypes of this architecture are being
and will be implemented, based on interconnection systems composed
by network adaptors and switches based on ATM and SCI standards.
Prototypes with commercial fast Ethernet, ATM and Myrinet are
also being implemented.
1 - Introduction
Emergent technologies for the interconnection of
computers through high speed networks, with low latency and high
bandwidth characteristics, are allowing the development of high
performance processing systems based on high performance workstations
LAN's. These systems, called NOW or COW, are currently a hot
research topic, both in academia and in industry.
The bandwidth and latency of their interconnection systems are comparable to those exhibited by several MPPs, mainly if we take into account the software overheads associated to the network protocol stacks
of most MPP operating systems.
In order to make such systems more usable, specially
when we consider non scientific programming, it is necessary to
provide a more adequate programming environment. Several researchs
have shown that the shared variable model is more comfortable
for both the user and the construction of development tools (compilers).
This paper describes SPADE-2, a proposal for an architecture
based on standard workstations and communication protocols, with
support for efficient message passing and shared memory programming.
Following similar research lines, this work is focused
on the user level network communication protocols and remote memory
access. On the other hand, this work is also oriented to the implementation
of CC-NUMA and COMA models.
First, we give a general description of the architecture,
then we discuss issues such as fast barrier synchronization and
software distributed shared memory implementation. After that,
we discuss implementation issues, fast interconnection network
implementations, intended applications and the current status.
Finally we give a brief summary and some intended future work.
2 - General Description
The SPADE-2 is built from conventional (single processor
or SMP) PCI bus workstations, in which a PCI network adaptor
is attached and standard operating systems (FreeBSD, Linux, Solaris,
Windows NT) are used. These nodes are interconnected in a generic
topology by means of switches.
Using only standard commercial systems, the idea
is to provide only the NORMA model, with user level message communication
protocols, and shared memory implemented in software by means
of a DSM employing relaxed memory consistency model.
Using custom boards, we will support the other models
NUMA, CC-NUMA, COMA. Even so, we intend to emulate these shared
memory models through Myrinet boards.
In all models, but specially in the NUMA and COMA
models, it is desirable that the network and its protocol layers
provide reliable communication between nodes, through error correction
and automatic retransmission of packets. These features are not
essential to correctly implement these memory models, but they
significantly simplify and can potentially provide efficient implementation.
In this context, it is alse very important that the network provides
no packet loss flow control.
NORMA model
The emphasis is to implement efficient message passing
through user level communication protocols, such as U-NET, active
messages, and others. Above this, we intend to implement legacy
protocols such as TCP-IP, message passing packages such as PVM
and MP, and software DSM packages such as TreadMarks, Quarks and
Pulsar.
Within NORMA, the main goal is to achieve the best
communication performance (latency and throughput) for small messages.
NUMA model
Like MPP's, such as Cray MP-3E, interconnection
systems, such as Dolphin SCI and VMIC Reflective Memory Networks,
and academic systems, such as AXON, we intend to implement remote
memory access.
Fixed length messages, composed of address, commands
and data, are exchanged by the nodes for data sharing. Examples
of commands are:
These transactions are implemented in hardware, without
the intervention of any of the targeted node processors. As soon
as the target NIC receives the message, it tries to access the
bus in order to get the data.
CC-NUMA model
This model is an extension of the previous one, but
keeps in the cache the most recently accessed data. The main problem
is the monitoring of the processor memory accesses, given that
the interconnection board is not on the memory bus, but on the
IO bus (PCI). To avoid the redesigning of the PCI bridge, we have,
among others, the following alternatives:
Both solutions are being evaluated by means of program
driven simulations to have cost/performance figures for several
sharing patterns and system parameters (hardware and software).
In order to allow the interconnection with future SCI systems
by means of bridges, we are using SCI cache coherence protocols
(typical set).
In order to exploit the multicasting facilities of
ATM switches, in the first version we adopted a directory scheme
(one bit for each node) similar to the one employed by Stanford
DASH, instead of the standard SCI doubled linked list scheme.
COMA model
In this last model, the memory in the interconnection
card corresponds to a tertiary cache. Following standard naming,
this memory corresponds to an ``{\em Attraction Memory}".
This model is similar to the sequential consistency implemented
in operating systems with replication/migration of pages keeping
sequential consistency. The difference is basically the size of
the data block (cache block instead of page block). The block
fetch and consistency maintenance protocols are based on the Stanford
COMA-F proposal; this approach was chosen because it doesn't
rely on a specific interconnection system.
3 - Efficient Barrier Synchronization
To implement fast barrier synchronization on networks
of workstations it is necessary to :
In case of parallel machines based on clusters connected
by ATM switches
4 - Software Distributed Shared Memory
Software DSM systems constitute a hot topic of research,
providing a shared memory abstraction over network of workstations
(estranho). We developed a sofware DSM system, based on relaxed
memory consistency model, to provide shared memory over conventional
NOWs or the SPADE-2 working on NORMA and NUMA models.
The performance of a DSM (Distributed Shared Memory)
system is determined by the traffic in the interconnection system.
The several types of messages generated in a DSM system moving
through a network, are basically: messages containing pages,
synchronization messages and consistency messages. The size of
these messages is an important factor that determines the performance
of the system. Consistency messages prevail in a network running
a DSM system.
Messages containing pages have the typical size of 4096 bytes in most of the systems. It is necessary to determine the typical size of synchronization and consistency messages. The PULSAR DSM system can use
updates to maintain consistency. In this case, it
uses some techniques developed by Munin, like multiple writer
protocols, to eliminate the false sharing effect and to increase
the parallelism. This technique implies the creation and transmission
of diffs( a codification of changes in modified pages of a critical
section). These ''diffs" are sent through the network to
maintain the consistency, and are the bulk of the consistency
messages. Our work with Pulsar running some programs of SPLASH
suite has shown that the average size of consistency messages
is smaller than 160 bytes.
Several works have shown that, for small messages,
the bandwidth of the interconnection system is not effectively
used, mainly due to software overheads, which makes it very important
to optimize the communication (bandwidth and latency) for small
messages. One of the topics to be investigated is the efficacy
of using compression algorithms (for instance, Huffman coding)
on the messages.
5 - Implementation
We intend to implement several prototypes, some of
them based solely in commercial subsystems (computers, communication
boards, switches, etc) and others based on home designed hardware.
What follows is a list of the prototypes under development.
SPADE-2/FastEthernet and SPADE-2/ATM
These prototypes are based in the Intel Pentium PCI
boards, off-the-shelf network adaptors and switches, and FreeBSD/Linux
operating systems.
SPADE-2/Myrinet
This prototype uses Myrinet interconnection system,
composed by PCI adapters, 8-port switches and copper 1.2+1.2 Gbit/s
links. The firmware has been modified in order to support efficient
message passing and shared memory.
SPADE-2/IN1
This prototype consists of PCI boards with the interconnection
network IN1. This system supports the NUMA model at the hardware
level, as well as the NORMA model. The inboard memory present
on the network interface card can be used as a local portion of
the global shared memory as well as the buffer memory for message
passing.
SPADE-2/IN2
The previous system is based in small size switches
(small number of ports) in order to simplify the design and minimize
costs. Other advantages are the natural support for multicasting
provided by the bus implementation. To build larger systems, these
switches must be interconnected in some way as, for instance,
a fat-tree such as the one employed in the Thinking Machines CM-5.
The SPADE-2/IN2 is based in a larger switch architecture and the
implementation lies closer to the ATM standard. Besides that,
this system supports, along with the other models, the CC-NUMA
and COMA.
6 - Interconnection Network IN1
Network Adaptor:
Switch:
Packets:
7 - Interconnection Network IN2
The IN2 is an evolution of the IN1, presenting the
following characteristics:
8 - Applications
Besides the scientific and engineering applications,
we intend to investigate the use of the described architectures
in other fields such as distributed data bases, Web servers, video
on demand servers, image rendering engine, etc, mainly exploiting
the shared memory model implemented by SPADE-2.
9 - Related Works
10 - Current Status
We are porting operating systems, communication packages
and applications to a cluster of Intel Pentium boards, interconnected
by fast Ethernet and ATM switches.
The PULSAR system for FreeBSD and SunOS (Solaris
1.x) is completed (in its alpha version).
Currently, we are purchasing a Myrinet system and
we are modifying the local firmware in order to support more efficient
message passing, remote memory access and support PULSAR.
The IN1 system is completely defined and a version
based on ALTERA FPGA is being developed.
The IN2 system is being defined through simulation
studies.
11 - Summary
In this work we presented a proposal for high performance systems architectures using, as far as possible, standard hardware and software, while also offering the possibility of the effective implementation of shared memory as well as message passing models. Differently from other proposals, this one offers innovative solutions that don't rely on exotic, not open technologies. Most of the features discussed here about the ATM switch system are provided by the Washington University in Saint Louis's Gigaswitch, namely, credit-flow control, reliable multicasting, etc. Other related project is FORTH's Telegraphos which is also developing credit flow control based ATM switches. However, in the future we intend to work also with full IEEE-ANSI SCI standard compliant circuits and protocols, mainly because of its 1Gbytes/s per link bandwidth.
Acknowledgments
This work was supported by FINEP, through grant n.
56.94.0260.00. We would like to thank Marcio Lobo Netto, Casimiro
A. Barreto, Jecel M. Assumpção Jr., Martha X. T.
Delgado, Mario D. Marino, Marcelo Cintra, Li Kuan, and Carlos
E. Valle, of LSI-EPUSP, for their valuable suggestions and text
revision work. I would like to thank also Prof. João A.
Zuffo, head of LSI-EPUSP for his support for this project.