Implementation, Use, and Sharing of Data Structures in Java Programs Syed S. Albiz and Patrick Lam University of Waterloo ABSTRACT Programs manipulate data. For many classes of programs, this data is organized into data structures. Java’s standard libraries include robust, general-purpose data structure implementations; however, standard implementations may not meet developers’ needs, forc- ing them to implement ad-hoc data structures. The well-organized use of standard data structure implementations contributes to good modularity. We empirically investigate this aspect of modularity— namely, the implementation, use, and sharing of data structures in practice—by developing a tool to statically analyze Java libraries and applications. Our DSFinder tool reports 1) the number of likely and possible data structure implementations and interfaces in a program and 2) characteristics of the program’s uses of data structures. We applied our tool to 62 open-source Java programs and manually classified possible data structures. We found that 1) developers overwhelm- ingly used Java data structures over ad-hoc data structures; 2) ap- plications and libraries confine data structure implementation code to small portions of a software project. XXX something about exposure/sharing. 1. INTRODUCTION Data structures are central to many software systems. Clas- sically, programmers implement in-memory data structures with pointers. Understanding a software system typically requires understanding how the system manipulates data structures as well as understanding the relationships between its different data structures. Well-designed software systems with modular, well- encapsulated data structures are clearly easier to understand and maintain than systems where data structures are shared between and manipulated by many disparate parts of the code. Also, no mat- ter how well-encapsulated the data structure manipulations may be, such code always poses a challenge to static analysis techniques, as it requires intricate reasoning about the code’s behaviour. Modern programming environments, however, include rich stan- dard libraries. Since version 1.0, Java has included data struc- ture implementations in its library. Java 2’s Collections API [23] defines standard interfaces for data structures and includes imple- mentations of standard data structures. While the general contract of a data structure is to implement a mutable set, general-purpose Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00. data structure implementations might not meet developers’ specific needs, forcing them to implement ad-hoc data structures. The goals of our research are: 1) to understand how often devel- opers implement ad-hoc data structures versus library data struc- tures, and 2) to estimate how widely programs share data structures between their different modules. We are particularly interested in how programs organize information in the heap: do they use sys- tem collections such as LinkedList and HashMap, or do they implement their own ad-hoc lists, trees, graphs, and maps using unbounded-size pointer structures, like C programmers? Beyond the implementation style, we also seek to understand whether data structure manipulations are confined to a small set of classes, or whether data structures are accessed and updated by dozens of dif- ferent classes? Our results can help guide research in higher-level program un- derstanding and verification (e.g. [2, 11]) and the development of software maintenance tools by identifying the code idioms that analysis tools need to understand. For instance, linked-list data structure manipulations require shape analysis techniques. Our op- erational definition of a data structure is therefore driven by static analysis considerations: what types of analysis suffice to under- stand typical Java applications? However, while our primary moti- vation is to investigate the necessity for shape analysis, we believe that our results have broader implications to software engineering in general, especially in terms of understanding modularity as well as how programs are built. In this paper, we present the results of our analysis of data struc- ture implementation, interfaces, and exposure in a corpus of 62 open-source Java programs and libraries. Our work was driven by the following three hypotheses. Hypothesis 1: Implementations. It is possible to automatically identify data structure implementations. Pointer-based data struc- ture implementations are extremely rare. Hypothesis 2: Interfaces. It is possible to automatically identify data structure interfaces. Interfaces are ubiquitous. Hypothesis 3: Sharing. While many collections are exposed to clients, the actual sharing of changing collections between program modules is rare. To explore these hypotheses, we developed a number of defini- tions for data structure implementations and interfaces. We have developed an analysis tool which identifies data structure imple- mentations, interfaces, and exposure. To identify implementations, it searches for recursive type definitions and arrays, which signal the possible presence of sets of unbounded size. A simple analy- sis of a Java program’s class definitions (available in the program’s bytecode) thus suffices to identify its potential data structures. Our tool applies several automatic type- and name-based classification 1