Page 1 of 6 SYNTHETIC DATA GENERATION CAPABILTIES FOR TESTING DATA MINING TOOLS Daniel R. Jeske Behrokh Samadi Pengyue J. Lin Lucent Technologies Carlos Rendón samadi@lucent.com Rui Xiao University of California, Riverside djeske@ucr.edu ABSTRACT Recently, due to commercial success of data mining tools, there has been much attention to extracting hidden information from large databases to predict security problems and terrorist threats. The security applications are somewhat more complicated than commercial applications due to (i) lack of sufficient specific knowledge on what to look for, (ii) R&D labs developing these tools are not able to easily obtain sensitive information due to security, privacy or cost issues. Tools developed for security applications require substantially more testing and revisions in order to prevent costly errors. This paper describes a platform for the generation of realistic synthetic data that can facilitate the development and testing of data mining tools. The original applications for this platform were people information and credit card transaction data sets. In this paper, we introduce a new shipping container application that can support the testing of data mining tools developed for port security. KEYWORDS Knowledge Discovery and Data Mining, Synthetic Data Generation, Semantic Graphs, Shipping Container. INTRODUCTION Knowledge discovery and data mining (KDD) includes the technology of extracting unknown and possibly useful information from data. This process has been compared to finding a needle in a haystack. In spite of apparent complexity, KDD technology has shown to be successful in some commercial applications such as fraud prevention and medical diagnosis. KDD is a powerful technology with great potential to help focus attention on the most important information in data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge- driven decisions. Recently, due to this commercial success, there has been much attention paid to extraction of hidden information from large databases to predict security problems and terrorist threats. The security applications are somewhat more complicated than the commercial applications due to (1) lack of sufficient specific knowledge on what to look for, and (2) R&D labs developing these tools are not able to easily obtain sensitive information due to security, privacy or cost issues. KDD tools developed for security applications require substantially more testing and revisions of rules in order to minimize the false positives and false negatives that could be very costly. The combination of (1) and (2) motivated us to develop a platform for the generation of realistic synthetic data that can facilitate the development and testing of KDD tools. Realistic synthetic data can serve as background data sets into which hypothetical future scenarios can be overlaid. KDD tools can then be measured in terms of their false positive and false negative error rates. In addition, the availability of synthetic data sets provides necessary traction for new data mining ideas and approaches, and thereby facilitates the development and feasibility assessment of techniques that might otherwise die on vine. To be adequate substitutes for real data, the quality of synthetic data sets needs to be reasonable. Pitfalls