Jin Yu, Ehud Reiter, Jim Hunter and Chris Mellish 1 Choosing the content of textual summaries of large time-series data sets JIN YU, EHUD REITER, JIM HUNTER AND CHRIS MELLISH Department of Computing Science, University of Aberdeen Aberdeen, United Kingdom AB24 3UE email:( jyu, ereiter, jhunter, cmellish)@csd.abdn.ac.uk Abstract Natural Language Generation (NLG) can be used to generate textual summaries of numeric data sets. In this paper we develop an architecture for generating short (a few sentences) summaries of large (100KB or more) time-series data sets. The architecture integrates pattern recognition, pattern abstraction, selection of the most significant patterns, microplanning (especially aggregation), and realisation. We also describe and evaluate SumTime-Turbine, a prototype system which uses this architecture to generate textual summaries of sensor data from gas turbines. 1 Introduction It is often said in the NLP community that the world is being flooded with text, but the flood of text is insignificant compared to the flood of data. A professional writer may write as much as 1MB of text in a year; the sensors in the writer’s car can easily produce this much data each day, as he or she drives to work and back. Killgarriff and Grefenstette (2003) estimate that there are 20TB of text on the World-Wide Web; 100TB or more of data is produced each day simply from sensors in aircraft engines (Hey and Trefethen 2003). Currently numeric time-series data is usually presented to people visually (Spence 2001). However, when the graphical reports need more domain knowledge to explain, or can not be sent to end-users, textual summaries will be advantageous. For example, weather forecasting reports generated from weather data are sent to users through their mobile phones in textual format instead of graphical format. Good human-written summaries of data can certainly be effective (Law, Freer, Hunter, Logie, McIntosh and Quinn 2005), but they are prohibitively expensive to produce. The challenge for Natural Language Generation (NLG) is to produce good summaries of numerical time-series data automatically.