1 Similarity Group-by Operators for Multi-dimensional Relational Data Mingjie Tang, Ruby Y. Tahboub, Walid G. Aref, Senior Member, IEEE, Mikhail J. Atallah, Fellow, IEEE and ACM, Qutaibah M. Malluhi, Mourad Ouzzani, Member, IEEE, and Yasin N. Silva Abstract—The SQL group-by operator plays an important role in summarizing and aggregating large datasets in a data analytics stack. While the standard group-by operator, which is based on equality, is useful in several applications, allowing similarity aware grouping provides a more realistic view on real-world data that could lead to better insights. The Similarity SQL-based Group-By operator (SGB, for short) extends the semantics of the standard SQL Group-by by grouping data with similar but not necessarily equal values. While existing similarity-based grouping operators efficiently materialize this approximate semantics, they primarily focus on one-dimensional attributes and treat multi-dimensional attributes independently. However, correlated attributes, such as in spatial data, are processed independently, and hence, groups in the multi-dimensional space are not detected properly. To address this problem, we introduce two new SGB operators for multi-dimensional data. The first operator is the clique (or distance-to-all) SGB, where all the tuples in a group are within some distance from each other. The second operator is the distance-to-any SGB, where a tuple belongs to a group if the tuple is within some distance from any other tuple in the group. Since a tuple may satisfy the membership criterion of multiple groups, we introduce three different semantics to deal with such a case: (i) eliminate the tuple, (ii) put the tuple in any one group, and (iii) create a new group for this tuple. We implement and test the new SGB operators and their algorithms inside PostgreSQL. The overhead introduced by these operators proves to be minimal and the execution times are comparable to those of the standard Group-by. The experimental study, based on TPC-H and a social check-in data, demonstrates that the proposed algorithms can achieve up to three orders of magnitude enhancement in performance over baseline methods developed to solve the same problem. Index Terms—similarity query, relational database, 1 I NTRODUCTION The deluge of data accumulated from sensors, social net- works, computational sciences, and location-aware services calls for advanced querying and analytics that are often dependent on efficient aggregation and summarization tech- niques. The SQL group-by operator is one main construct that is used in conjunction with aggregate operations to cluster the data into groups and produce useful summaries. Grouping is usually performed by aggregating into the same groups tuples with equal values on a certain subset of the attributes. However, many applications (i.e.,in Section 5) are often interested in grouping based on similar rather than strictly equal values. Clustering [1] is a well-known technique for grouping similar data items in the multi-dimensional space. In most M. Tang and R. Tahboub is with the Department of Computer Science, Purdue University, Indiana, IN, 47906. E-mail: tang49@purdue.edu W.G. Aref is with the Department of Computer Science, Purdue, and Center for Education and Research in Information Assurance and Security (CERIAS). M. Atallah is with the Department of Computer Science,Purdue Uni- versity. Q. Malluhi is with the Department of Computer Science and Engineer- ing, Qatar University. M. Ouzzani is with the Qatar Computing Research Institute. Y. Silva is with the School of Mathematical and Natural Science,Arizona State University. cases, clustering is performed outside of the database system. Moving the data outside of the database to perform the clustering and then back into the database for further processing results in a costly impedance mismatch. More- over, based on the needs of the underlying applications, the output clusters may need to be further processed by SQL to filter out some of the clusters and to perform other SQL operations on the remaining clusters. Hence, it would be greatly beneficial to develop practical and fast similarity group-by operators that can be embedded within SQL to avoid such impedance mismatch and to benefit from the processing power of all the other SQL operators. SQL-based Similarity Group-by (SGB) operators have been proposed in [2] to support several semantics to group similar but not necessarily equal data. Although several applications can benefit from using existing SGB over Group-by, a key shortcoming of these operators is that they focus on one-dimensional data. Consequently, data can only be approximately grouped based on one attribute at a time. In this paper, we introduce new similarity-based group- by operators that group multi-dimensional data using vari- ous metric distance functions. More specifically, we pro- pose two SGB operators, namely SGB-All and SGB- Any, for grouping multi-dimensional data. SGB-All forms groups such that a tuple or a data item, say o, belongs to a group, say g, if o is at a distance within a user- defined threshold from all other data items in g. In other words, each group in SGB-All forms a clique of nearby arXiv:1412.4842v1 [cs.DB] 16 Dec 2014