HCube: A Server-centric Data Center Structure for Similarity Search Rodolfo da Silva Villaca * , Rafael Pasquini , Luciano Bernardes de Paula and Maur´ ıcio Ferreira Magalh˜ aes * * School of Electrical and Computer Engineering (FEEC/UNICAMP), Campinas – SP – Brazil Faculty of Computing (FACOM/UFU), Uberlˆ andia – MG – Brazil Federal Institute of Education, Science and Technology of S˜ ao Paulo (IFSP), Braganc ¸a Paulista – SP – Brazil Email: {rodolfo, mauricio}@dca.fee.unicamp.br, pasquini@facom.ufu.br, lbernardes@ifsp.edu.br Abstract—The information society is facing a sharp increase in the amount of information driven by the plethora of new applications that sprouts all the time. The amount of data now circulating on the Internet is over zettabytes (ZB), resulting in a scenario defined in the literature as Big Data. In order to handle such challenging scenario, the deployed solutions rely not only on massive storage, memory and processing capacity installed in Data Centers (DC) maintained by big players all over the globe, but also on shrewd computational techniques, such as BigTable, MapReduce and Dynamo. In this context, this work presents a DC structure designed to support the similarity search. The proposed solution aims at concentrating similar data on servers physically close within a DC. It accelerates the recovery of all data related to queries performed using a primitive get(k, sim), in which k represents the query identifier, i.e., the data used as reference, and sim a similarity level. Index Terms—Similarity Search, Big Data, Data Center, Ham- ming similarity I. I NTRODUCTION In the current Big Data scenario of Internet, users have become data sources; companies store uncountable infor- mation from clients; millions of sensors monitor the real world, creating and exchanging data in the Internet of things. According to a study from the International Data Corporation (IDC) published in May 2010 [1], the amount of data available in the Internet surpassed 2 ZB in 2010, doubling every two years, and might surpass 8 ZB in 2015. The study also revealed that approximately 90% of them are composed by unstructured, heterogeneous and variable data in nature, such as texts, images and videos. Emerging technologies, such as Hadoop [2] and MapRe- duce [3], are examples of solutions designed to address the challenges imposed by Big Data in the so called three Vs: Volume, Variety and Velocity. Through parallel computing techniques in conjunction with grid computing and/or, re- cently, taking advantage of the DCs infrastructure offered by the cloud computing concept, the IT organizations offer means for handling large-scale, distributed and data-intensive jobs across commodity servers. Usually, such technologies offer a distributed file system and automated tools for adjusting, on the fly, the number of servers involved in the processing tasks. In such case, large volumes of data are pushed over the networking facility connecting the servers, transferring < key,value > pairs from mappers to reducers in order to obtain the desirable results. Extensions like [4] avoid all the reprocessing of information stored in the distributed file systems, also minimizing the need for moving data across the network in order to speed up the overall processing task. While the current solutions are unquestionably efficient for handling traditional applications, such as batch processing of large volumes of data, they do not offer adequate support for the similarity search [5], whose objective is the retrieval of sets of similar data given by a similarity level. In previous works [6], an overlay solution for the similarity search was developed on top of a Distributed Hash Table (DHT) structure. Essentially, it was confirmed that it is possible to assign identifiers to data representing, with an elevated accuracy level, the similarity existent between them by the Hamming distance of their identifiers (defined in this paper as their Hamming similarity). Afterwards, it was shown the possibility of storing similar data in servers close in the logical space of the overlay network by using a put(k,v) primitive, and that it is also possible to efficiently recover a set of similar data by using a single get(k, sim) primitive, requiring a reduced number of logical hops between the peers in a DHT. In this current work, the main objective is to present a DC structure, named Hamming Cube (HCube), as a proof of con- cept for demonstrating the feasibility of storing similar data in peers physically near, as opposed to the logical neighborhood characteristic of the previous overlay solution, which does not necessarily represent physical distance. Although it is in early stage, the results indicate its feasibility, and serves as basis for the development of solutions for several applications such as recommender systems for social networks and/or medical images repositories, in which queries for similar past diagnoses may help in a new treatment. To achieve this goal, the HCube includes a data representa- tion model, a Locality Sensitive Hashing (LSH) function [5], a data storage infrastructure and a routing solution. For the data representation, the HCube uses a vector representation, in which each dimension is related to a characteristic of the data being stored, such as keywords in a text, color histogram in a picture or profile attributes in a social network. In order to index the data, HCube adopts the Random Hyperplane Hashing Function (RHH) [7], a family of LSH functions whose similarity corresponds to the cosine of the angle between vec- tors. The data storage infrastructure of HCube uses servers as