Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa Friedland, Marc Maier Brian Taylor, David Jensen Department of Computer Science University of Massachusetts Amherst Amherst MA 01003-9264 USA [afast, lfriedl, maier, btaylor, jensen] @cs.umass.edu Henry G. Goldberg John Komoroske National Association of Securities Dealers 1725 K Street NW Washington, DC 20006-1516 USA [henry.goldberg, john.komoroske]@nasd.com ABSTRACT Commercial datasets are often large, relational, and dynamic. They contain many records of people, places, things, events and their interactions over time. Such datasets are rarely structured appropriately for knowledge discovery, and they often contain variables whose meanings change across different subsets of the data. We describe how these challenges were addressed in a collaborative analysis project undertaken by the University of Massachusetts Amherst and the National Association of Securities Dealers (NASD). We describe several methods for data pre- processing that we applied to transform a large, dynamic, and relational dataset describing nearly the entirety of the U.S. securities industry, and we show how these methods made the dataset suitable for learning statistical relational models. To better utilize social structure, we first applied known consolidation and link formation techniques to associate individuals with branch office locations. In addition, we developed an innovative technique to infer professional associations by exploiting dynamic employment histories. Finally, we applied normalization techniques to create a suitable class label that adjusts for spatial, temporal, and other heterogeneity within the data. We show how these pre-processing techniques combine to provide the necessary foundation for learning high-performing statistical models of fraudulent activity. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – Data Mining; I.2.6 [Artificial Intelligence]: Learning General Terms Algorithms, Measurement, Design, Experimentation. Keywords Fraud detection, data pre-processing, statistical relational learning, normalization, relational probability trees. 1. INTRODUCTION The National Association of Securities Dealers (NASD) is charged with overseeing and regulating over 5,000 securities firms in the United States. One primary aim of NASD is to prevent and discover securities fraud and other forms of misconduct by member firms and their employees, called registered representatives, or reps. With over 659,000 reps currently employed, it is imperative for NASD to direct its limited regulatory resources towards the parties most likely to engage in risky behavior in the future. It is generally believed by experts at NASD and others that fraud happens among small, interconnected groups of individuals [2]. Due to the large number of potential interactions between reps, timely identification of persons and groups of interest is a sizeable challenge for NASD regulators. This work is a joint effort between researchers at the University of Massachusetts Amherst (UMass) and staff at NASD to identify effective, automated methods for detecting these high-risk entities to aid NASD in their regulatory efforts. The securities fraud domain, along with other similar domains, presents many challenges to knowledge discovery practitioners. The NASD dataset is large, containing historical records on over 3.4 million reps, 360,000 branches and over 25,000 firms. These entities also have many rich interactions over time as reps change jobs, as they move among branches and firms, and as branches and firms change ownership. In addition, the background rate of misconduct varies over time and geography. In this paper, we describe our techniques to address each of the challenges presented by the NASD dataset. These techniques allow the transformation from raw data to high-performing models that are useful for detecting high-risk individuals. First, we utilized standard consolidation and link formation techniques as described by Goldberg and Senator [4] to infer branch entities from rep employment records. Using the inferred branch entities, we created a new technique for identifying groups of reps, which we call tribes. Tribes are groups of reps that move from branch to branch together over time. Finally, we addressed the variability in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’07, August 12–15, 2007, San Jose, California, USA. Copyright 2007 ACM 978-1-59593-609-7/07/0008...$5.00. Industrial and Government Track Paper 941