TBM: Twin Block Management Policy to Enhance the Utilization of Plane-Level Parallelism in SSDs Arash Tavakkol, Pooyan Mehrvarzy, and Hamid Sarbazi-Azad Abstract—The internal architecture of a SSD provides channel-, chip-, die- and plane-level parallelism levels, to concurrently perform multiple data accesses and compensate for the performance gap between a single ﬂash chip and host interface. Although a good striping strategy can effectively exploit the ﬁrst three levels, parallel I/O accesses at plane-level can be performed only for operations of the same types and page addresses. In this work, we propose the Twin Block Management (TBM) policy that symmetrically conducts usage and recycling of the ﬂash block addresses on the planes of a die, thus enhancing the utilization of plane-level parallelism for reads, writes and erases. Evaluation results show that TBM improves IOPS and response time by up to 73 and 42 percent, respectively. Index Terms—Flash memory, solid-state drive, plane-level parallelism, garbage collection Ç 1 INTRODUCTION IN NAND ﬂash solid state drives (SSDs), in spite of the availability of high-performance NAND ﬂash communication interfaces (up to 800 MB [1]), the maximum achievable I/O performance of a single ﬂash chip (package) is restricted by the execution latency of ﬂash operations. In particular, the ﬂash write (program) operation is very slow and it even becomes slower by VLSI technology shrink- ing. Consequently, SSDs use a set of ﬂash chips which are orga- nized in a massively parallel architecture (see Fig. 1) to achieve high I/O performance through simultaneous execution of multiple ﬂash operations. In this architecture, a multi-channel multi-way bus structure facilitates concurrent accesses to ﬂash chips. Each ﬂash chip itself is composed of a set of dies that share the chip com- munication interface and can independently execute ﬂash opera- tions. At the lowest level, there are multiple planes within a die that can operate in parallel. However, plane-level parallelism has a strict restriction that must be adhered to, i.e. same operations on the same ﬂash memory addresses are required for simultaneous execution on the planes of a die. Many recent studies were proposed to exploit plane-level paral- lelism more effectively and alleviate its inherent limitations. They generally use reordering and rescheduling of queued I/O opera- tions to increase the chance of parallel execution at plane level [3], [5], [8]. However, the efﬁciency of these methods is highly sensitive to the behavior of the SSD ﬂash management policy. Strictly speak- ing, an out-of-place update policy is used in SSD to reduce the nega- tive impacts of the NAND ﬂash erase-before-write property. To perform an update using this policy, the previous version of data is marked invalid and the new data is written into a free location. Therefore, a logical-to-physical address mapping scheme is used in conjunction with a garbage collection (GC) mechanism to manage data placement, consumption of the free memory locations, and recycling of the invalid locations. These mechanisms greatly impact the existence of queued I/O operations mapped onto different planes of a die and, at the same time, accessing identical addresses inside these planes. For instance, less queued write operations con- form to the plane-level addressing constraint, if memory addresses of the neighboring planes are asymmetrically assigned and invali- dated memory locations are recycled without any address consider- ation. However, this critical inﬂuence of ﬂash management mechanisms was not considered in previous proposals and hence their performance gain becomes negligible when random use of page addresses becomes more frequent in the long-term. In this work, we propose the Twin Block Management (TBM) policy for out- of-place update that is aware of plane-level addressing constraint. TBM deﬁnes new strategies for physical address assignment and GC execution in order to symmetrically conduct usage and recycling of the memory addresses on the planes of a die. 2 SSD INTERNALS Fig. 1 shows the internal architecture of a NAND ﬂash SSD com- posed of four main components: 1) A host interface, that provides communication with the host system and performs I/O request queuing; 2) A controller, composed of a microprocessor and a DRAM memory, that executes a special management ﬁrmware called Flash Translation Layer (FTL); 3) A ﬂash controller that is a hard- ware driver to enable communication with ﬂash chips; 4) Flash chips that provide the raw SSD storage capacity. As we mentioned previ- ously, ﬂash chips are organized in a hierarchy of four parallelism lev- els i.e. channel-, chip-, die- and plane-level. Multiple ﬂash I/O operations can be simultaneously executed through striping over communication channels and pipelining between the set of ﬂash chips connected to each channel. Furthermore, each die of a ﬂash chip has its own command and address registers and hence execu- tion of different operations can be interleaved between dies. At the lowest level, planes of a die share the same control logic. Therefore, same operations with same addresses must be available in the I/O queue for parallel plane (multi-plane) command execution. Access to the internal storage space of a plane is serial and read/write opera- tions are performed at the unit of a page which typically includes 4 KB, 8 KB or a larger volume of data. Updating the content of a previ- ously written ﬂash page requires an erase operation. Due to its very slow execution, a ﬂash erase is performed at the unit of a block com- posed of a set of 128, 256 or more pages. According to ONFI standard speciﬁcation [1], multi-plane commands must access the same page offset within target blocks (referred as PAC). Besides, the ﬂash con- trol logic may also require the block addresses to be identical for multi-plane command execution (referred as BAC). As of 2015, all ﬂash products require PAC but BAC is relaxed in many products. In order to emulate the interface of a conventional block device (HDD), FTL performs following tasks: 1) Management of the queued I/O requests and address mapping: host I/O requests are segmented into several page-size transactions each with a speciﬁc logical page address (LPA). Due to out-of-place update, LPAs must be translated into physical page addresses (PPAs). The translation procedure follows two different paths for write and read operations. For a write operation, FTL allocates a free physical page. First, a plane allocation function (PLAlloc) determines the address of target channel, ﬂash chip, die and plane according to a predeﬁned alloca- tion strategy. Then, a block allocation function (BLAlloc) assigns a write-frontier within the selected plane. Inside write-frontier, pages are allocated sequentially, from ﬁrst to last index, and thus PPA is determined and the (LPA, PPA) pair is stored in a mapping table for future reads. For read operations, translation is performed by searching the mapping table for LPA entry. 2) GC: the out-of-place update policy quickly consumes free ﬂash pages. Consequently, a GC procedure must be triggered by FTL to recycle the physical pages having invalid data. This procedure selects a victim block based on a predetermined policy, moves its valid pages into a new location, and ﬁnally triggers the execution of erase operation.  A. Tavakkol is with the HPCAN Lab, Computer Engineering Department, Sharif University of Technology, Tehran, Iran. E-mail: tavakkol@ce.sharif.edu.  P. Mehrvarzy is with the School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran. E-mail: p.mehrvarzy@ipm.ir.  H. Sarbazi-Azad is with the HPCAN Lab, Computer Engineering Department, Sharif University of Technology, and the School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran. E-mail: azad@ipm.ir. Manuscript received 8 Feb. 2015; revised 2 July 2015; accepted 13 July 2015. Date of pub- lication 26 July 2015; date of current version 5 Jan. 2017. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee. org, and reference the Digital Object Identiﬁer below. Digital Object Identiﬁer no. 10.1109/LCA.2015.2461162 IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 15, NO. 2, JULY-DECEMBER 2016 121 1556-6056 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.