Evaluation of A Performance Model of Lustre File System
Tiezhu Zhao
1
, Verdi March
2,3
, Shoubin Dong
1
, Simon See
2 ,4
1
Guangdong Key Laboratory of Computer Network, South China University of Technology,
Guangzhou, 510641, P.R.China
2
Asia-Pacific Science and Technology Center (APSTC), Sun Microsystems
3
Department of Computer Science, National University of Singapore
4
Department of Mechanical & Aerospace Engineering, Nanyang Technological University
zhao.tiezhu@mail.scut.edu.cn, Verdi.March@Sun.COM , sbdong@scut.edu.cn , Simon.See@Sun.COM
Abstract—As a large-scale global parallel file system, Lustre file
system plays a key role in High Performance Computing (HPC)
system, and the potential performance of such systems can be
difficult to predict because the potential impact to application
performance is not clearly understood. It is important to gain
insights into the deliverable Lustre file system IO efficiency. In
order to gain a good understanding on what and how to impact
the performance of Lustre file system. This paper presents a
study on performance evaluation of Lustre file systems and we
propose a novel relative performance model to predict overhead
under different performance factors. In our previous
experiments, we discover that different performance factors have
a closed correlation. In order to mining the correlations, we
introduce relative performance model to predict performance
differences between a pair of Lustre file system equipped with
different performance factors. On average, relative model can
predict bandwidth within 17%-28%. The results show our
relative prediction model can obtain better prediction accuracy.
Keywords-performance evaluaion; parallel file system; model;
lustre
I. INTRODUCTION
Parallel file system is a key part of any complete massively
parallel computing environment and widely used in clusters
dedicating to I/O-intensive parallel applications. Lustre parallel
file system is best known for powering seven of the ten largest
high-performance computing (HPC) clusters in the world with
tens of thousands of client systems, petabytes (PBs) of storage
and hundreds of gigabytes per second (GB/sec) of I/O
throughput. Many HPC sites use Lustre as a site-wide global
file system, servicing dozens of clusters on an unprecedented
scale [1]. Currently, in the cloud computing era, the
performance research of Lustre file system increasingly
attracted the attention of industry and research communities.
Further details on Lustre are available in [9][10][11][12].
The rapid development of HPC application is aggressively
pushing the demand of parallel file system in terms of high
aggregated I/O bandwidth, mass storage capacity and high data
fault-tolerant etc. HPC platforms need to be coupled with
efficient parallel file systems, such as Lustre file system, that
can deliver commensurate IO throughput to scientific
applications. Although the various performance characteristics
in HPC workloads have been researched via experimental
analysis and empirical analysis, the potential performance of
such systems can be difficult to predict because the potential
impact to application performance in parallel file system
environment is not clearly understood and most internal details
of the basic components of parallel file system are not public. It
is important to gain insights into the deliverable Lustre file
system IO efficiency.
As we known, the construction of parallel file system is
much more expensive and complex. When a parallel file
system is not properly tuned or configured, this cost may not be
paid off. So, issues on how to optimize the design of a parallel
file system, how to evaluate the performance of a parallel file
system, how to tune the performance of a parallel file system
and how to predict the performance trend are more and more
concerned by both storage industry and research communities.
Storage system can be complex to manage. Management
consists of storage device (volume or LUN) sizing, selecting a
RAID level for each device, and mapping application
workloads to the storage device. Automating management is
one way to offer the administrator some relief and help reduce
the total cost of ownership in the data center. In particular, one
could automate the mapping of workloads to storage devices.
Whereas, storage administration currently continue to be overly
complex and costly, and face with many challenges involved in
deciding the mapping from application data sets to storage
devices, balancing loads, matching workload characteristics to
device strengths. Unfortunately, storage administration
currently relies on experts who use rules-of-thumb to make
decision [18][19][20]. One need a mechanism for predicting
the performance of any given workload and automates the
prediction process.
Based on the above observations, we propose an in-depth
performance evaluation of Lustre file system and our
evaluation mainly covers the number of OSSes, storage
connection approaches, the type of disks, the type of journal for
OST and the number of threads/OST etc.. We deliver relational
analysis for the performance overhead under different
performance factors and discover that different performance
factors have a closed correlation. In order to mining the
potential performance correlations, we conduct a novel relative
performance prediction model to predict performance under
different factors.
The remainder of this paper is organized as follows. We
firstly introduce related work in section II. Then, we conduct an
in-depth survey on performance factor of Lustre file system in
section III. In section IV, we present the relative performance
prediction model and carry out detailed prediction performance
analysis and conclude the paper in section V.
The Fifth Annual ChinaGrid Conference
978-0-7695-4106-8/10 $26.00 © 2010 IEEE
DOI 10.1109/ChinaGrid.2010.38
191