Low-latency Memory-mapped I/O for Data-intensive Applications on Fast Storage Devices Nae Young Song, Young Jin Yu, Woong Shin, Hyeonsang Eom and Heon Young Yeom Department of Computer Science and Engineering Seoul National University Seoul, South Korea {nysong, yjyu, wshin, hseom, yeom}@dcslab.snu.ac.kr Abstract—Thesedays, along with read()/write(), mmap() is used to ﬁle I/O in data-intensive applications as an alternative method of I/O on emerging low-latency device such as ﬂash-based SSD. Although utilizing memory-mapped ﬁle I/O have many advantages, it does not produce much beneﬁt when combined with large data and fast storage devices. When the working set of an application accessing ﬁles with mmap() is larger than the size of physical memory, the I/O performance is severely degraded compared to the application with read/write(). This is mainly due to the virtual memory subsystem that does not reﬂect the performance feature of the underlying storage device. In this paper, we examined linux virtual memory subsystem and mmap() I/O path to ﬁgure out the inﬂuence of low-latency storage devices on the existing virtual memory subsystem. Also, we suggest some optimization policies to reduce the overheads and implement the prototype in a recent Linux kernel. Our solution guarantees that 1) memory-mapped I/O will be several times faster than read-write I/O when cache-hit ratio becomes high, and 2) the former will show at least the performance of the latter even when cache-miss frequently occurs and the overhead of mapping/unmapping pages becomes signiﬁcant, which are not achievable by the existing virtual memory subsystem. I. I NTRODUCTION Recently, the size of data that many applications use in- creases as the multimedia or scientiﬁc applications are de- veloped. Because of the increased use of big data, many researches have focused on eliminating storage stack overhead that arises from ﬁle I/O via read/write system calls. [1], [2], [3]. Moreover, the advent of low-latency storage device enables fast access to ﬁle and it accelerates the study of read/write() ﬁle I/O optimizations. However, there is a separate ﬁle I/O path, called memory- mapped I/O (we call it mmio); It maps a ﬁle into contiguous virtual memory address space and provides load/store interface to applications as if the ﬁle was in-memory variable. This feature makes mmio more advantageous than read/write I/O (we will call it rwio) under certain circumstances since 1) any complex in-memory object can be made persistent easily, and 2) the cached region of the ﬁle can be accessed faster due to the reduced number of context swtiches. For these reasons, NoSQLs such as mongoDB or Cassandra, one of the most widely used data-intensive storage frameworks, use mmio to manage their index ﬁles. NV-heaps [4] also supports high performance persistent object by using mmio. Unfortunately, when it comes to data-intensive applications with large scale data, the performance of mmio is worse than that of rwio, especially when evaluated on low-latency storage devices(see chapter §II-C.). This implies that mmio is not scalable with large scale data. The most important reason of it is that the unmapping procedure in virtual memory subsystem was not scalable enough that the software overhead dominated the entire I/O performance. For example, the overhead of Inter- Processor-Interrupt (IPI) now becomes signiﬁcant in unmap- ping a page. Therefore, it is necessary to probe virtual memory subsystem relevant to mmio to ﬁnd out the root cause of the problem mentioned above, and to seek any optimizations. In this paper, we proﬁled mmap() and read()/write() I/O performance and suggested the following two optimizations: (1) recycling pages directly instead of (global) reclamation and (2) coalescing multiple IPIs instead of sending per-page IPI. Also we implemented those optimizations into the linux kernel 2.6.32. With the optimized virtual memory subsystem, we can observe that memory-mapped I/O now shows com- parable performance to read-write I/O even in the worst-case scenario where map/unmap overhead degrades the overall I/O performance. The remainder of this paper is organized as follows. Chapter §II shows our research motivation and background. Chapter §III contains our solutions to reduce overheads. Experimental setup is in chapter §IV and preliminary evaluation is in chapter §V. Finally, future works and our conclusion are presented in chapter §VII and §VI. II. MOTIVATION AND BACKGROUND A. Tradeoffs between mmap I/O and ﬁle I/O system calls Basically, ﬁle I/O via mmap() is considered as a advanced ﬁle I/O because mmap() I/O has lots of advantages such as fewer number of memory copies. In addition to these mer- its, mmap() has several advantages making memory-mapped ﬁle I/O to be more sophisticated than read()/write(). Practi- cally, memory-mapped ﬁle access is more efﬁcient than via read()/write() if ﬁle size so small as to have all ﬁle pages in the page cache in main memory. This is because mmap() I/O of in- memory ﬁle copies just one time and a application can access ﬁle by byte-level directly in main memory using memory variables. Thus, if the mapped ﬁle size is ﬁt for the physical