SCIENCE CHINA Information Sciences March 2022, Vol. 65 132101:1–132101:17 https://doi.org/10.1007/s11432-019-2833-1 c Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature 2021 info.scichina.com link.springer.com . RESEARCH PAPER . A self-tuning client-side metadata prefetching scheme for wide area network file systems Bing WEI 1,2 , Limin XIAO 1,2* , Yao SONG 1,2 , Guangjun QIN 3 , Jinbin ZHU 1,2 , Baicheng YAN 1,2 , Chaobo WANG 1,2 & Zhisheng HUO 1,2 1 Laboratory of Software Development Environment, Beihang University, Beijing 100191, China; 2 School of Computer Science and Engineering, Beihang University, Beijing 100191, China; 3 Smart City College, Beijing Union University, Beijing 100101, China Received 28 September 2019/Revised 14 December 2019/Accepted 17 March 2020/Published online 22 February 2021 Abstract Client-side metadata prefetching is commonly used in wide area network (WAN) file systems because it can effectively hide network latency. However, most existing prefetching approaches do not meet the various prefetching requirements of multiple workloads. They are usually optimized for only one specific workload and have no or harmful effects on other workloads. In this paper, we present a new self-tuning client-side metadata prefetching scheme that uses two different prefetching strategies and dynamically adapts to workload changes. It uses a directory-directed prefetching strategy to prefetch the related file metadata in the same directory, and a correlation-directed prefetching strategy to prefetch the related file metadata accessed across directories. A novel self-tuning mechanism is proposed to efficiently convert the prefetching strategy between directory-directed and correlation-directed prefetching. Experimental results using real system traces show that the hit ratio of the client-side cache can be significantly improved by our self-tuning client-side prefetching. With regards to the multi-workload concurrency scenario, our approach improves the hit ratios for the no-prefetching, directory-directed prefetching, variant probability graph algorithm, variant apriori algorithm, and variant semantic distance algorithm by up to 15.22%, 6.32%, 10.08%, 11.65%, and 10.73%, corresponding to 25.24%, 18.11%, 23.53%, 24.94%, and 24.19% reductions in the average access time, respectively. Keywords wide area network file systems, multiple workloads, metadata prefetching, correlation-directed prefetching, directory-directed prefetching, self-tuning prefetching Citation Wei B, Xiao L M, Song Y, et al. A self-tuning client-side metadata prefetching scheme for wide area network file systems. Sci China Inf Sci, 2022, 65(3): 132101, https://doi.org/10.1007/s11432-019-2833-1 1 Introduction In a wide area environment, heterogeneous storage resources owned by different organizations are geo- graphically distributed, resulting in barriers between applications and data. Network-based file systems offer promising solutions to address this problem. In network-based file systems (such as Onedata [1] and GFFS [2]), the client and server are decoupled and interact with each other through network commu- nications. Several network-based file systems use client-side metadata caching to reduce the number of network communications and achieve better access performance [1–5]. The client caches a certain amount of metadata and periodically refreshes the cached metadata [1]. As cache hit ratios are crucial for the performance of network-based file systems [6], several prefetching schemes [3–5, 7–11] have been proposed to improve cache hit ratios. These approaches can generally be classified into two categories: directory-directed or correlation-directed prefetching. Directory-directed prefetching is commonly used to alleviate access latency in several network-based storage systems [3– 5]. It prefetches all file metadata in the same directory with a network communication. This type of approach can be used to prefetch metadata without knowing the semantic correlations between files [11]. Directory-directed prefetching is effective because it can capture the natural organization imposed by * Corresponding author (email: xiaolm@buaa.edu.cn)