Automatically Analyzing Groups of Crashes for Finding Correlations Marco Castelluccio Mozilla London, UK University Federico II of Naples Naples, Italy marco.castelluccio@unina.it Carlo Sansone University Federico II of Naples Naples, Italy carlo.sansone@unina.it Luisa Verdoliva University Federico II of Naples Naples, Italy verdoliv@unina.it Giovanni Poggi University Federico II of Naples Naples, Italy poggi@unina.it ABSTRACT We devised an algorithm, inspired by contrast-set mining algo- rithms such as STUCCO, to automatically fnd statistically signif- cant properties (correlations) in crash groups. Many earlier works focused on improving the clustering of crashes but, to the best of our knowledge, the problem of automatically describing properties of a cluster of crashes is so far unexplored. This means developers currently spend a fair amount of time analyzing the groups them- selves, which in turn means that a) they are not spending their time actually developing a fx for the crash; and b) they might miss something in their exploration of the crash data (there is a large number of attributes in crash reports and it is hard and error-prone to manually analyze everything). Our algorithm helps developers and release managers understand crash reports more easily and in an automated way, helping in pinpointing the root cause of the crash. The tool implementing the algorithm has been deployed on Mozilla’s crash reporting service. CCS CONCEPTS · Software and its engineering → Software reliability; KEYWORDS Crashes; Crash Reports; Crash Analysis. ACM Reference format: Marco Castelluccio, Carlo Sansone, Luisa Verdoliva, and Giovanni Poggi. 2017. Automatically Analyzing Groups of Crashes for Finding Correlations. In Proceedings of 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Soft- ware Engineering, Paderborn, Germany, September 4ś8, 2017 (ESEC/FSE’17), 10 pages. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a fee. Request permissions from permissions@acm.org. ESEC/FSE’17, September 4ś8, 2017, Paderborn, Germany © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-5105-8/17/09. . . $15.00 https://doi.org/10.1145/3106237.3106306 https://doi.org/10.1145/3106237.3106306 1 INTRODUCTION Fixing crashes is one of the top priorities for software organizations, as they are one of the main pain points for users and might lead them to leave. Even a single crash can dramatically worsen how users perceive a software, especially if it causes the loss of important data. Acting quickly is thus really important to avoid losing users and keep a high quality software. Several software organizations have deployed automated crash reporting systems, such as Mozilla’s Socorro [1] and Windows Error Reporting [12], which are used to collect reports from users at the time of crash. A report received by Socorro comprises typically more than a hundred attribute-value felds. These reports are then analyzed by dedicated personnel to fnd out fxes and improve software quality. It should be realized, however, that these systems collect a huge number of crash reports daily, about three hundred thousand reports/day for Socorro, which cannot be processed on an individual basis. Therefore, the typical workfow consists of two key phases (1) crash report clustering; (2) cluster featuring and analysis. The goal of clustering is to group together similar reports, as they are likely originated by multiple instances of the same software problem. Once the problem is fxed, all these reports can be dis- carded at once from further analysis. Moreover, clustering allows one to compute precious statistics on the cluster itself, enabling the second phase of the workfow. In fact, the typical features of interest in a cluster concern the frequency of occurrence of attribute- value pairs, which may provide useful hints for the solution of the problem. As an example, assume that a perfect clustering process succeeds in grouping together all crash reports originated by a given software bug, and assume also that all such reports are char- acterized by a distinctive feature which is never observed in reports of other clusters. While not conclusive, this observation would pro- vide a strong clue for the analyst, and would probably allow a quick fx of the problem. This idealized process is summarized graphically in Figure 1. 717