Investigation of Failure Causes in Workload-Driven Reliability Testing * Domenico Cotroneo and Roberto Pietrantuono Dipartimento di Informatica e Sistemistica Università degli Studi di Napoli Federico II via claudio 21, 80125 - Naples, Italy {cotroneo,roberto.pietrantuono}@unina.it Leonardo Mariani and Fabrizio Pastore Dipartimento di Informatica, Sistemistica e Comunicazione Università degli Studi di Milano Bicocca via Bicocca degli Arcimboldi, 8 - 20126 Milano {mariani,pastore}@disco.unimib.it ABSTRACT Virtual execution environments and middleware are required to be extremely reliable because applications running on top of them are developed assuming their correctness, and platform-level failures can result in serious and unexpected application-level problems. Since software platforms and middleware are often executed for long time without any interruption, large part of the testing process is devoted to investigate their behavior when long and stressful executions occur (these test cases are called workloads). When a prob- lem is identified, software engineers examine log files to find its root cause. Unfortunately, since of the workloads length, log files can contain a huge amount of information and man- ual analysis is often prohibitive. Thus, de-facto, the iden- tification of the root cause is mostly left to the intuition of the software engineer. In this paper, we propose a technique to automatically analyze logs obtained from workloads to retrieve important information that can relate the failure to its cause. The tech- nique works in three steps: (1) during workload executions, the system under test is monitored; (2) logs extracted from workloads that have been successfully completed are used to derive compact and general models of the expected behavior of the target system; (3) logs corresponding to workloads terminated unsuccessfully are compared with the inferred models to identify anomalous event sequences. Anomalies help software engineers to identify failure causes. The tech- nique can also be used during operational phase, to dis- cover possible causes of unexpected failures by comparing logs corresponding to failing executions with models derived at testing time. Preliminary experimental results conducted * This work has been supported by both MIUR, under the project PRIN 2006-2007 “Mutant hardware/software components for dynamically reconfigurable distributed sys- tems”(COMMUTA), and European Community, under the project FP6 “Self-Healing Approach to Designing Complex Software Systems” (SHADOWS). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SOQUA’07 September 3-4, 2007, Dubrovnik, Croatia Copyright 2007 ACM 978-1-59593-724-7/07/09 ...$5.00. on the Java Virtual Machine indicate that several bugs can be rapidly identified thanks to the feedbacks provided by our technique. Categories and Subject Descriptors D.2.5 [Software Engineering]: Testing and Debugging— Monitors, Tracing General Terms Reliability Keywords log file analysis, automated analysis, model inference, work- loads execution, JVM monitoring 1. INTRODUCTION Nowadays we are witnessing to an increasing use of vir- tual execution environments and middleware platforms for the development of large and complex applications. Ex- amples are Java Virtual Machines (JVMs), enterprise sys- tems and Corba-based platforms. During application test- ing, large part of the testing process is devoted to investigate the behavior of underlying platforms when long and stress- ful executions occur (these test cases are called workloads). When a problem is detected, software engineers examine log files to gain insights about failure manifestations and to discover potential root causes. Although several research efforts have been performed in the design and the imple- mentation of (semi)automatic tools for log file analysis, such as [9] and [10], the problem of automatically analyzing huge log files has been only partially solved [1, 13, 4]. De-facto, it is extremely hard to understand failure manifestations, errors propagation and isolation, and to discover potential root causes. In this paper, we propose a technique to collect and ana- lyze log files to obtain important information about failure manifestations. The technique works in three steps: 1. during workload executions, the system under test is monitored; 2. logs corresponding to workloads terminated correctly are used to derive a model of the behavior expected from the target system; 78