I zyxwvutsrqponmlkj SPIDER: Flexible and Efficient Communication Support for Point-t0-Point Distributed Systems * zyxw James Dolter, Stuart Daniel, Ashish Mehra, Jennifer Rexford, Wu-chang Feng, and Kang Shin Real-Time Computing Laboratory Department of Electrical Engineering and Computer Science The University of Michigan Ann Arbor, MI. 48109-2122 Abstract zyxwvuts SPIDER is a network adapter that provides scalable communication support for point-to-point distributed systems. The device exports an efficient interface to the host processor, provides transparent support for de- pendable, time-constrained communication, and han- dles packet routing and switching. The communication suppori provided by SPIDER exploits concurrency be- tween the independent data channels feeding the point- to-point network, and offers flexible and transparent hardware mechanisms. SPIDER allows the host to ex- ercise fine-grain control over its operation, enabling the latter to monitor and influence data transmission and reception efficiently. In the current implemen- tation, SPIDER interfaces to the Ironics IV-3207, a VMEbus-based 68040 card and will be controlled by zyxwvutsr x- kernel, a communication executive allowing the flexible composition of communication protocols. 1 Introduction Traditionally, parallel computers and distributed systems have been employed in disparate application domains. Parallel computing has been motivated pri- marily by the need for high-performance scientific computing, resulting in regular interconnection net- works and tightly-coupled processing elements. Dis- tributed systems, on the other hand, arose from the need for connectivity, communication, and resource sharing between network-based machines. This pa- per presents SPIDER (Scalable Point-to-point Inter- face DrivER), a network adapter that combines the protocol support and media access of distributed sys- tems with the low-level packet routing and switching schemes of the point-to-point, parallel computing do- main. In recent years distributed computing has emerged zyxwvu as a scalable and cost-effective solution to many classes *The work reported in this paper was supported in part by the National Science Foundation under Grant MIP-9203895. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the view of the NSF. of applications with widely-varying characteristics and resource requirements. Technological advances in VLSI, networking, and operating systems have ex- panded the domain of distributed computing, facilitat- ing the merger of the seemingly disparate disciplines of parallel computing and distributed computing. Faster networks now allow distributed systems to employ mechanisms previously applied only to tightly-coupled parallel machines, including system-wide shared mem- ory and a finer grain of computation. .In addition, parallel programming abstractions are now being ap- plied across a wide variety of distributed computing platforms. It is also becoming commonplace to use digital com- puters for real-time applications such as fly-by-wire, industrial process control, computer-integrated man- ufacturing, and medical life-support systems. These applications impose stringent timing and dependabil- ity requirements on the computer system, since a dis- ruption of service caused by a physical failure or in- adequate response time can result in a catastrophe. Commonly, dependability is provided by incorporat- ing some form of redundancy into the system. One technique replicates critical software components on a collection of nodes that fail independently [5]. Coor- dinating this software replication necessitates timely and dependable communication between nodes. Point-to-point networks, with their multiplicity of processors and internode routes, provide a natu- ral platform for applications that require both high performance and dependability [18]. Many parallel computers connect the processing elements with a point-to-point network zyxw [7,9,11,20] to provide scalable communication bandwidth to applications. However, these networks often consist of short links, such as on- board wires or ribbon cables, with no need for higher- level error control. Centralized hardware and software can make parallel machines vulnerable to single-point failures. For example, the message-driven processor (MDP) [9] for the J-machine is a chip that connects to a 3D-mesh network. With zyxw 64 nodes on a board and multiple boards in a chassis, a single board failure can disrupt several processing elements. 514 1063-6927194 $03.00 0 1994 IEEE I