Experience Report: Verifying Data Interaction Coverage to Improve Testing of Data-intensive Systems The Norwegian Customs and Excise Case Study Sagar Sen, Carlo Ieva, Arnab Sarkar Certus V&V Center Simula Research Laboratory Oslo, Norway Email: {sagar,carlo,arnab}(at)simula.no Atle Sander, Astrid Grime Directorate of Norwegian Customs and Excise Email: {Atle.Sander,Astrid.Grime}(at)toll.no Abstract—Testing data-intensive systems is paramount to in- crease our reliance on information processed in e-governance, sci- entific/medical research, and social networks. A common practice in the industrial testing process is to use test databases copied from live production streams to test functionality of complex database applications that manage well-formedness of data and its adherence to business rules in these systems. This practice is often based on the assumption that the test database adequately covers realistic scenarios to test, hopefully, all functionality in these applications. There is a need to systematically evaluate this assumption. We present a tool-supported method to model realistic scenarios and verify whether copied test databases actually cover them and consequently facilitate adequate testing. We conceptualize realistic scenarios as data interactions between fields cross-cutting a complex database schema and model them as test cases in a classification tree model. We present a human-in- the-loop tool, DEPICT, that uses the classification tree model as input to (a) facilitate interactive selection of a connected subgraph from often many possible paths of interactions between tables specified in the model (b) automatically generate SQL queries to create an inner join between tables in the connected subgraph (c) extract records from the join and generate a visual report of satisfied and unsatisfied interactions hence quantifying test adequacy of the test database. We report our experience as a qualitative evaluation of approach and with a large industrial database from the Norwegian Customs and Excise information system TVINN featuring large and complex databases with millions of records. I. I NTRODUCTION Data-intensive software systems are increasingly prominent in driving global processes such as e-governance, data curation for scientific and medical research, and social networking. Large amounts of data is collected, processed, and stored by these systems in databases. For example, the Directorate of the Norwegian Customs and Excise (DNCE) uses the TVINN 1 system to process about 30,000 customs declarations a day coming in from both individuals and enterprises. The live 1 http://toll.no/ transaction stream of declarations is processed for confor- mance to well-formedness rules, customs laws and regulations by complex batch applications. This scenario is prevalent in many data-intensive software systems dealing with transac- tion data which comprises semi-structured/structured data in medium/high volume. The typical process to rapidly and effectively test batch ap- plications (including regression testing [24]) on data-intensive systems involves usage of input test databases regularly copied from the live transaction processing stream. Using test databases is based on the assumption that data from live transactions represents realistic scenarios. The realistic scenarios correspond to patterns found in test databases that are expected to either demonstrate correctness of applica- tion functionality or uncover bugs in the way transactions were processed by batch applications. For instance, testing a customs regulation or business rule in the TVINN system for value added tax (VAT) on alcohol such as whisky will require a test database with customs declarations for imports on a specific type of alcohol, whisky, and specific kind of tax, VAT. There is often a high probability that declarations coming into TVINN have exercised and consequently tested a large number of business rules. However, despite a high practical reliance on such test databases there exists very few systematic approaches to verify their adequacy for testing. This is the problem area we address in the overall testing process. We believe that verifying test databases in long-running data- intensive systems with medium/high volume of transactions will ensure adequate coverage[13] to test batch application. Verification will also make the overall testing process efficient as manual testing (such as creating specific declarations by customs personnel) can be limited to those cases that have not been covered by data from live streams. Therefore, we ask, how can we automate and simplify steps in the verification of a test database to improve testing of data-intensive systems?