Low-latency Memory-mapped I/O for Data-intensive Applications on Fast Storage Devices Nae Young Song, Young Jin Yu, Woong Shin, Hyeonsang Eom and Heon Young Yeom Department of Computer Science and Engineering Seoul National University Seoul, South Korea {nysong, yjyu, wshin, hseom, yeom}@dcslab.snu.ac.kr Abstract—Thesedays, along with read()/write(), mmap() is used to file I/O in data-intensive applications as an alternative method of I/O on emerging low-latency device such as flash-based SSD. Although utilizing memory-mapped file I/O have many advantages, it does not produce much benefit when combined with large data and fast storage devices. When the working set of an application accessing files with mmap() is larger than the size of physical memory, the I/O performance is severely degraded compared to the application with read/write(). This is mainly due to the virtual memory subsystem that does not reflect the performance feature of the underlying storage device. In this paper, we examined linux virtual memory subsystem and mmap() I/O path to figure out the influence of low-latency storage devices on the existing virtual memory subsystem. Also, we suggest some optimization policies to reduce the overheads and implement the prototype in a recent Linux kernel. Our solution guarantees that 1) memory-mapped I/O will be several times faster than read-write I/O when cache-hit ratio becomes high, and 2) the former will show at least the performance of the latter even when cache-miss frequently occurs and the overhead of mapping/unmapping pages becomes significant, which are not achievable by the existing virtual memory subsystem. I. I NTRODUCTION Recently, the size of data that many applications use in- creases as the multimedia or scientific applications are de- veloped. Because of the increased use of big data, many researches have focused on eliminating storage stack overhead that arises from file I/O via read/write system calls. [1], [2], [3]. Moreover, the advent of low-latency storage device enables fast access to file and it accelerates the study of read/write() file I/O optimizations. However, there is a separate file I/O path, called memory- mapped I/O (we call it mmio); It maps a file into contiguous virtual memory address space and provides load/store interface to applications as if the file was in-memory variable. This feature makes mmio more advantageous than read/write I/O (we will call it rwio) under certain circumstances since 1) any complex in-memory object can be made persistent easily, and 2) the cached region of the file can be accessed faster due to the reduced number of context swtiches. For these reasons, NoSQLs such as mongoDB or Cassandra, one of the most widely used data-intensive storage frameworks, use mmio to manage their index files. NV-heaps [4] also supports high performance persistent object by using mmio. Unfortunately, when it comes to data-intensive applications with large scale data, the performance of mmio is worse than that of rwio, especially when evaluated on low-latency storage devices(see chapter §II-C.). This implies that mmio is not scalable with large scale data. The most important reason of it is that the unmapping procedure in virtual memory subsystem was not scalable enough that the software overhead dominated the entire I/O performance. For example, the overhead of Inter- Processor-Interrupt (IPI) now becomes significant in unmap- ping a page. Therefore, it is necessary to probe virtual memory subsystem relevant to mmio to find out the root cause of the problem mentioned above, and to seek any optimizations. In this paper, we profiled mmap() and read()/write() I/O performance and suggested the following two optimizations: (1) recycling pages directly instead of (global) reclamation and (2) coalescing multiple IPIs instead of sending per-page IPI. Also we implemented those optimizations into the linux kernel 2.6.32. With the optimized virtual memory subsystem, we can observe that memory-mapped I/O now shows com- parable performance to read-write I/O even in the worst-case scenario where map/unmap overhead degrades the overall I/O performance. The remainder of this paper is organized as follows. Chapter §II shows our research motivation and background. Chapter §III contains our solutions to reduce overheads. Experimental setup is in chapter §IV and preliminary evaluation is in chapter §V. Finally, future works and our conclusion are presented in chapter §VII and §VI. II. MOTIVATION AND BACKGROUND A. Tradeoffs between mmap I/O and file I/O system calls Basically, file I/O via mmap() is considered as a advanced file I/O because mmap() I/O has lots of advantages such as fewer number of memory copies. In addition to these mer- its, mmap() has several advantages making memory-mapped file I/O to be more sophisticated than read()/write(). Practi- cally, memory-mapped file access is more efficient than via read()/write() if file size so small as to have all file pages in the page cache in main memory. This is because mmap() I/O of in- memory file copies just one time and a application can access file by byte-level directly in main memory using memory variables. Thus, if the mapped file size is fit for the physical