An Energy-efﬁcient Nonvolatile In-memory Computing Architecture for Extreme Learning Machine by Domain-wall Nanowire Devices Yuhao Wang, Hao Yu, Senior Member, IEEE, Leibin Ni, Guang-Bin Huang, Senior Member, IEEE, Mei Yan, Chuliang Weng, Wei Yang and Junfeng Zhao Abstract—The data-oriented applications have introduced in- creased demands on memory capacity and bandwidth, which raises the need to rethink the architecture of the current computing platforms. The logic-in-memory architecture is highly promising as future logic-memory integration paradigm for high throughput data-driven applications. From memory technology aspect, as one recently introduced non-volatile memory (NVM) device, domain-wall nanowire (or race-track) not only shows potential as future power efﬁcient memory, but also computing capacity by its unique physics of spintronics. This paper explores a novel distributed in-memory computing architecture where most logic functions are executed within the memory, which signiﬁcantly alleviates the bandwidth congestion issue and im- proves the energy efﬁciency. The proposed distributed in-memory computing architecture is purely built by domain-wall nanowire, i.e. both memory and logic are implemented by domain-wall nanowire devices. As a case study, neural network based image resolution enhancement algorithm, called DW-NN, is examined within the proposed architecture. We show that all operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by non-volatile domain-wall nanowire. Domain-wall nanowire based logic is customized for in machine learning within image data storage. As such, both neural network training and processing can be performed locally within the memory. The experimental results show that the domain-wall memory can reduce 92% leakage power and 16% dynamic power compared to main memory implemented by DRAM; and domain- wall logic can reduce 31% both dynamic and 65% leakage power under the similar performance compared to CMOS transistor based logic. And system throughput in DW-NN is improved by 11.6x and the energy efﬁciency is improved by 56x when compared to conventional image processing system. I. I NTRODUCTION T HE analysis of big-data at exascale (10 18 bytes/s or ﬂops) has introduced the emerging need to reexamine the existing hardware platform that can support memory- oriented computing. A big-data-driven application requires huge bandwidth with maintained low-power density. The most widely existed data-driven application is machine learning in Y. Wang, H. Yu, L. Ni, G.-B. Huang and M. Yan are with School of Elec- trical and Electronic Engineering, Nanyang Technological University (NTU), Singapore. C. Weng, W. Yang and J. Zhao are with Shannon Laboratory, Huawei Technologies Co., Ltd, China. The preliminary work was published in IEEE/ACM ISLPED’13. This work is sponsored by Singapore NRF- CRP(NRF-CRP9-2011-01), MOE Tier-2 (MOE2010-T2-2-037 (ARC 5/11)), A*STAR PSF fund 11201202015 and Huawei Shannon Research Lab. Please address comments to haoyu@ntu.edu.sg. big data storage system, as the most exciting feature of future big-data storage system is to ﬁnd implicit pattern of data and excavate valued behavior behind. Take image searching as an example, instead of performing the image search by calculating pixel similarity, image search by machine learning is a similar process as human brains, which learns the features of all images by feature extraction algorithms and compares the features in the form of strings. As such, the image search becomes a traditional string matching problem which is much easier to solve. However, to handle big image data at exa- scale, there is a memory wall that has long memory access latency as well as limited memory bandwidth. Again take the example of image search in one big-data storage system, there may be billions of images, so that to perform feature extraction for all images will lead to signiﬁcant congestion at I/Os when migrating data between memory and processor. In addition, the large volume of memory will experience signiﬁcant leakage power, especially at advanced CMOS technology nodes, for holding data in volatile memory for fast accesses [1], [2]. From memory technology point of view, there are many recent explorations by the emerging non-volatile memory (NVM) technologies at nano-scale such as phase-change mem- ory (PCM), spin-transfer torque memory (STT-RAM), and resistive memory (ReRAM) [3], [4], [5], [6], [7]. The primary advantage of NVM is the potential as the universal memory with signiﬁcantly reduced leakage power. For example, STT- RAM is considered as the second-generation of spin-based memory, which has sub-nanosecond magnetization switching time and sub-pJ switching energy [8], [9], [10]. As the third- generation of spin-based memory, domain-wall nanowire, also known as racetrack memory [11], [12], is a newly introduced NVM device that can have multiple bits densely packed in one single nanowire, where each bit can be accessed by the manipulation of the domain-wall. Compared with STT- RAM, the domain-wall nanowire is able to provide the similar speed and power but with much higher density or throughput [13]. Since domain-wall nanowire has close-to-DRAM density but with close-to-zero standby power, it becomes an ideal candidate for future main memory that can be utilized for big- data processing. From architecture point of view, the logic-in-memory archi- tecture is introduced to overcome memory bandwidth issue [14], [15], [16], [17], [18]. The basic idea behind is that, instead of feeding processor large volume of raw data, it is beneﬁcial to preprocess the data and provide processor only intermediate result. In other words, the key is to lower