Communication Library Design

Supported in part by DOE grant DE-FG02-93ER25167.
Principal Investigators:
Philip K. McKinley
Lionel M. Ni
Department of Computer Science
Michigan State University

Project Description. The main goal of this project was to improve the programmability of scalable parallel architectures while offering performance comparable to that of programming using low-level interfaces. The methods proposed to achieve this goal were to develop and study a scalable communications library routines for a variety of parallel platforms, including both massively parallel processors (MPPs) and networks of workstations (NOWs). These routines are collectively referred to as the ComPaSS Library. The ComPaSS project is intended to complement related projects in the field by investigating high-performance implementations of operations needed in both shared-memory and message-passing programming paradigms, while providing interfaces compatible with existing and emerging industry standards.

The ComPaSS library components are organized according to a hierarchy. The Low-Level Communication Services (LLCS) module contains operations that can be invoked either directly by message-passing applications or indirectly though calls generated by a compiler for a data parallel language. The High-Level Communications Services (HLCS) module offers an assortment of data manipulation and control operations to SPMD applications. These operations include both those related to data decomposition and realignment as well as those that implement application-specific functions, for example, linear algebra routines. Finally, the project studied how specific applications can benefit from ComPaSS services.

The hierarchical structure of ComPaSS supports both the message-passing programming paradigm and programming in data-parallel languages. ComPaSS complements related projects in the field by providing high-performance implementations of operations needed in both these programming paradigms, while providing interfaces compatible with existing and emerging industry standards. For example, the interfaces to message-passing operations are consistent with other packages and the emerging Message Passing Interface (MPI) standard, and the operations for data parallel programming are consistent with then needs of High Performance Fortran (HPF).

ComPaSS provides global communication operations for both data manipulation and process control, many of which are based upon a small set of low-level communication primitives. In particular, the one-to-many, or multicast, operation distributes a single value to multiple nodes, and the multireceive operation is used to read input from multiple nodes. In the ComPaSS library, these low-level operations are optimized for specific classes of commercial scalable parallel computers, such as the nCUBE-2 and IBM SP-1/2. Moreover, we have used these operations to construct efficient implementations of other collective communication operations, such as global reduction and all-to-all broadcast, for these architectures.

We have also developed efficient communication interfaces for workstation clusters, including those based on FDDI and Asynchronous Transfer Mode (ATM) interconnects. In the case of ATM networks, we have developed a set of collective operations that use an underlying virtual topology composed of ATM virtual multicast channels. We have analyzed, prototyped, and tested implementations on an ATM testbed, and have shown how several of ATM's salient features, such as full-duplex I/O, high bit rate, concurrent sends, and hardware multicast, are very useful in reducing the turnaround time of such operations.

At the higher level, we have developed an efficient data redistribution library for HPF, which has been implemented on both the IBM SP-1/2 and FDDI-based workstation clusters. We have proposed efficient data distribution and alignment algorithms which can help programmers and compilers to achieve better load balancing and to reduce interprocessor communication. In addition, we have used the ComPaSS library routines to improve the performance of parallel applications, namely, parallel eigenvalue and singular value algorithms, and parallel data clustering and image segmentation algorithms.