Clusterfile: A Flexible Physical Layout Parallel File System Florin Isail Walter F. Tichy Department of Computer Science University of Karlsruhe, Germany florin,tichy @ira.uka.de Abstract This paper presents Clusterfile, a parallel file system that provides parallel file access on a cluster of computers. Ex- isting parallel file systems offer little control over matching the I/O access patterns and file data layout. Without this matching the applications may face the following problems: contention at I/O nodes, fragmentation of file data, false sharing, small network messages, high overhead of scatter- ing/gathering the data. Clusterfile addresses some of these inefficiencies. Parallel applications can physically parti- tion a file in arbitrary patterns. They can also set arbitrary views on a file. Views hide the parallel structure of the file and ease the programmer’s burden of computing complex access indices. The intersections between views and layouts are computed by a memory redistribution algorithm. Read and write operations are optimized by pre-computing the direct mapping between access patterns and disks. Cluster- file uses the same data representation for file layouts, access patterns, and the mappings between each other. 1. Introduction The tremendous increases in the processor speeds have exposed the I/O subsystem as a bottleneck in a cluster of computers. This affects especially the performance of ap- plications demanding a huge amount of data to be brought from the disks into memory, as for instance the scientific applications. Therefore it is very critical that the the I/O operations execute as fast as possible in order to minimize their impact on performance. Parallel file systems have converged toward a generic configuration shown in figure 1. The nodes in a cluster are divided into two sets, which may or may not overlap: the compute nodes and the I/O nodes. Files are typically striped over the I/O nodes. Applications run on the compute nodes. Parallel applications access the files in a different manner than the sequential ones do. UNIX file systems and even some distributed file systems (NFS) were designed based on the premise that file sharing is seldom, whereas parallel applications usually access a file concurrently. This means that the file structure of a parallel file system must not only allow parallel access on the file, but must also be scalable, as scalable as the computation, if possible. The parallel applications also have a wide range of I/O access patterns. At the same time they don’t have a suf- ficient degree of control over the file data placement on a cluster. Therefore, they often access the files in patterns, which differ from the file physical layout on the cluster. This can hurt performance in several ways. First, poor layout can cause fragmentation of data on the disks of the I/O nodes and complex index computations of accesses are needed. Second, the fragmentation of data re- sults in sending lots of small messages over the network in- stead of a few large ones. Message aggregation is possible, but the costs for gathering and scattering are not negligi- ble. Third, the contention of related processes at I/O nodes can lead to overload and can hinder the parallelism. Fourth, poor spacial locality of data on the disks of the I/O nodes translates in disk access other than sequential. Poor layout also increases the probability of false sharing within the file blocks. A particular file layout may improve the performance of the parallel applications but the same layout has to be used by different access patterns. Computing the mapping be- tween an arbitrary access pattern and the file layout may become tricky. That is why we provide applications the pos- sibility of setting views on the data and we use an efficient data redistribution algorithm for computing the indices. In this paper we will present the design and features of Clusterfile, a cluster parallel file system which offers an in- creased degree of control of the file layout over a cluster. Section 2 presents prerequisites of our approach: existing studies of parallel I/O characterization and our data struc- ture for representing subfiles and views. Section 3 shows how a file can be physically and logically partitioned. Sec- tion 4 describes the architecture of the parallel file system. Section 5 presents the experiments we performed. Section 6 discusses some related work. Section 7 contains conclu-