978-1-7281-6251-5/20/$31.00 ©2020 IEEE Graph Filtering to Remove the “Middle Ground” for Anomaly Detection William Eberle Department of Computer Science Tennessee Tech University Cookeville, TN, USA weberle@tntech.edu Lawrence Holder School of Electrical Engineering & Computer Science Washington State University Pullman, WA, USA holder@wsu.edu Abstract—Discovering patterns and anomalies in a variety of voluminous data represented as a graph is challenging. Current research has demonstrated success discovering graph patterns using a sampling of the data, but there has been little work when it comes to discovering anomalies that are based upon understanding what is normative. In this work we present two approaches to reducing graph data: subgraph filtering and graph filtering. The idea behind the proposed algorithms is the removal of a “murky middle”, where data that may not be normative or anomalous, is removed from the discovery process. We empirically validate the proposed approach on real-world, pseudo-real-world, and synthetic data, as well as compare against a similar approach. Keywords—graph-based anomaly detection, graph filtering, knowledge discovery I. INTRODUCTION The ever-increasing volume, velocity and variety of data continues to challenge our ability to extract knowledge from data. In addition, interconnected relationships across data sources introduce further representational and computational challenges. Examples abound such as social networks, biological networks, communication networks, and even brain networks. The graph has emerged as an appropriate representation for such data, and numerous methods have been developed for extracting knowledge from networks. However, since the type of data we seek to analyze is continually growing, such as in telecommunications networks, we must be able to handle the data in a computationally effective way in order to extract relevant knowledge. Current work in the area of pattern discovery and anomaly detection in networks, or more specifically in graphs, has had to deal with computational difficulties. Most have addressed this issue by either only handling a sample of the graph [6][7][8][9], transforming the graph into a smaller representation of its structural properties [12][10], or reducing a visualization of the graph [14][11]. In any case, they are not dealing with the task of anomaly detection in the graphs while still reducing the computational complexities. In this work, we develop new methods for filtering graphs so as to reduce the computational complexities without losing our ability to discover relevant normative patterns and interesting anomalous subgraphs. We present two approaches to address the computational challenges: subgraph filtering, for reducing the number of subgraphs at each iteration of edge extensions when growing the list of candidate patterns; and, graph filtering, for reducing the size of the input graph before searching for patterns and anomalies. What makes this novel is that we are intelligently filtering the graph rather than sampling. By excluding the “middle” of the graph, or the “middle” candidate subgraphs, we are able to retain the best normative patterns and the best potential anomalous subgraphs – thereby, reducing the number of graph matches by ignoring those subgraphs that should never be normative or never be anomalous. We evaluate these methods using actual, near real-time network data provided by one of the leading managed security services providers at a major telecommunications company, with input from their domain experts. II. RELATED WORK Much work has been done on sampling and filtering graphs in order to improve the efficiency of graph mining methods that operate on high-volume data. Kashtan et al. [6] propose a sampling method that randomly samples n nodes in order to extract connected subgraph samples that are of order n. Their approach consists of repeatedly sampling subgraphs of order n until the desired sample size is reached. The concentration of different subgraphs is then estimated and used to find motifs in the network. Wernicke [7] modified the sampling scheme of Kashtan et al. in order to correct the sampling bias. The TIES algorithm [8] also uses a sampling scheme; however, the objective of TIES is to sample a subgraph that is representative of the entire graph. Ahmed et al. [9] propose the PIES algorithm for sampling a representative subgraph from a large streaming graph represented by a sequence of edges. PIES samples edges randomly and then stores the nodes from the sampled edges. It also performs stream sampling by maintaining a reservoir of nodes. When the reservoir is full, it probabilistically decides to drop old nodes and their incident edges from the sample and include new nodes and edges. However, in all of the above cases, the approaches are not dealing with anomaly detection. Sampling is complicated in our setting, because in the case of finding normative patterns, we want to keep common structures and discard unique structures; whereas, for anomaly detection, we want the opposite. This suggests more of a filtering strategy that identifies “middle