Software is Data Too Andrian Marcus Wayne State University Department of Computer Science 5057 Woodward Ave., Detroit, MI 48202, USA +1 313 577 5408 amarcus@wayne.edu Timothy Menzies West Virginia University Lane Dept. of Comp. Science & Electrical Engineering Morgantown, WV 26506, USA +1 304 293 9127 tim@menzies.us ABSTRACT Software systems are designed and engineered to process data. However, software is data too. The size and variety of today’s software artifacts and the multitude of stakeholder activities result in so much data that individuals can no longer reason about all of it. We argue in this position paper that data mining, statistical analysis, machine learning, information retrieval, data integration, etc., are necessary solutions to deal with software data. New research is needed to adapt existing algorithms and tools for software engineering data and processes, and new ones will have to be created. In order for this type of research to succeed, it should be supported with new approaches to empirical work, where data and results are shared globally among researchers and practitioners. Software engineering researchers can get inspired by other fields, such as, bioinformatics, where results of mining and analyzing biological data are often stored in databases shared across the world. Categories and Subject Descriptors D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement - Restructuring, reverse engineering, and reengineering; H.2.8. [Database Management]: Database Applications – Data mining H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; General Terms Management, Algorithms, Performance, Experimentation Keywords Data mining, machine learning, information retrieval, statistical analysis, software engineering, empirical research. 1. HOW MUCH DATA IS IN SOFTWARE? Software development no longer produces only source code and external documentation. The global nature of today’s development processes and the complexity and size of most software systems result in staggering amounts of data. How much data are we talking about? Source code contains millions of lines of code and comments. Versioning systems keep copies of the source code that evolved over decades. Analysis data (static and dynamic) populate databases in the range of terabytes. Test data often exceeds the size of source code. Bug tracking systems store information not only about defects and their fixes, but also large amounts of text resulting from developer discussions about these bugs. In distributed development environments, e-mail communications between developers are also stored. Usage and run time data of deployed software is collected in huge databases. Developer activities (inside and outside a development environment) are often monitored and stored as well. Process and management information is routinely stored and analyzed. External documentation, which ranges from requirements to user manuals are part of any software system. These are only the more common types of data that is generated, stored, and used during the life of a software system. There is more than that and the picture is pretty clear. The amount of data is so large that people can no longer reason about it without specialized, computer assisted, tool support. Key in dealing with so much data is to extract what is important to different stakeholders and tasks. Building software is no longer just an engineering problem; it is also an information management problem. 2. HOW TO DEAL WITH THE DATA? Analysis and management of the software data is an activity that software engineers are not trained to do. We have to look for solutions outside software engineering, adopt them, and make them our own. These solutions can come from data mining, information retrieval, machine learning, statistical analysis, etc. This is not the first time software engineers are looking at such solutions. It has been going on for about two decades, in a form or another. For example, data mining techniques have been proposed by many researchers to extract what is relevant to the stakeholders (i.e., developers, managers, testers, etc.) and help them understand data about a software system, its development process, and to make predictions about its future quality, cost, and evolution. Many traditional software engineering tasks and newer research areas already rely heavily on the use of data mining techniques, machine learning, statistical analysis, etc. The newer approaches include: search based software engineering (SBSE - http://www.sebase.org/), mining software repositories (MSR - http://www.msrconf.org/), recommendation systems in software engineering (RSSE - http://sites.google.com/site/rsseresearch), predictor models in software engineering (PROMISE - http://promisedata.org/), etc. These areas of research already have dedicated conferences or workshops and their communities are growing. There are many software engineering tasks that are supported with such techniques, used to analyzed software data. These tasks include: defect prediction, effort estimation, impact analysis, bug triage, bug assignment, source code searching, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. FoSER 2010, November 7–8, 2010, Santa Fe, New Mexico, USA. Copyright 2010 ACM 978-1-4503-0427-6/10/11...$10.00.