The ParaStation Project: Using Workstations as Building Blocks for Parallel Computing Thomas M. Warschko, Joachim M. Blum, and Walter F. Tichy University of Karlsruhe, Dept. of Informatics Am Fasanengarten 5, 76128 Karlsruhe, Germany email: {warschko,blum,tichy}@ira.uka.de URL: http://wwwipd.ira.uka.de/parastation The ParaStation communication fabric provides a high-speed communication network for efficient parallel computing on workstation clusters. The architecture, implemented on off-the-shelf workstations coupled by the ParaStation communication hardware, removes the kernel and common network protocols from the communication path while still providing protection in a multiuser, multiprogramming environment. The programming interface presented by ParaStation consists of a UNIX socket emulation and widely used parallel programming environments such as PVM, p4, and MPI. This allows porting a wide range of client/server and parallel applications to the ParaStation architecture. Supported platforms include Digital's AlphaGeneration workstations running Digital Unix (OSF/1) and Intel based PCs running Linux. Ports to other platforms (Linux on Digital Alpha machines and Windows-NT on both Digital Alphas and Intel PCs) are in progress. On the pairwise exchange benchmark, ParaStation achieves a communication latency of 2.5 us (process-to-process) on Digital Alpha machines and a sustained throughput of 10.5 Mbyte/s. The PC based platform with a latency of 1.7 us and a throughput of 15.5 Mbyte/s outperforms the Alpha platform. Application benchmarks using ScaLAPACK (on top of PVM) demonstrate real perfomance of 1 GFLOP on an 8-node 21064a (275 Mhz) Alpha cluster. The basic idea within the ParaStation architecture is to use a second high-speed interconnect besides the standard network used by the operating system on each workstation. The second network -- the ParaStation network -- is used by parallel applications exclusively. This technique allows interfacing the network hardware at user-level, implementing high-speed protocol software, and reducing the interaction with the operating system as much as possible. At the hardware level, the ParaStation network provides reliable data transmission (no loss of packets), temporal decoupling by buffering, error detection, variable packet size (4 to 508 bytes), in-order delivery of packets, and packet-based flow control. These features support small and efficient protocol stacks. The fragmentation and defragmentation task is simple, because data transmission is reliable and packets are delivered in order. Using software checksums to detect data corruption and transmission errors is not necessary, because the ParaStation network has very low error rates (< 1E-11) and built-in error detection logic. No software machanisms for flow control are needed, because the hardware already ensures the absense of packet losses due to buffer overflow or insufficient buffer space. In addition to the information used by the hardware protocol (destination id, and packet length), the software protocol provides a source-id, a source-port, and a destination-port as base support for a multitasking / multiuser environment. The major complexity of the ParaStation protocol stack is to handle individual and multiple communication channels (called ports) between different processes. To ensure correct interaction between competing processes several critical code regions within the protocol stack are locked by semaphores. These semaphores are also implemented at user-level using processor supported atomic operations to speed up protocol processing. The problem of scheduling incoming packets to different applications is solved by using a common accessable message pool to buffer messages not targeted to the current application. This approach enables building true zero-copy protocols and results in one buffer operation in the worst case. On top of this protocol layer (called ParaStation ports), software emulates higher communication interfaces such as the Unix TCP- and UDP-sockets. The interface has the same semantics as the operation system interface; the difference is just a naming convention (ParaStation sockets use a PSS prefix to all system calls dealing with sockets). Thus, porting any socket based application to the ParaStation platform can be done by replacing the socket calls, recompiling and relinking with the ParaStation library. As complex examples of this approach, we provide adopted versions of the PVM, p4, and the MPI programming environments.