Statistics and Computing (1995) 5, 59-72 Query languages for statistical databases ABDULLAH UZ TANSEL Baruch College, City University of New York, New York, N Y 10010, USA Received 1992 Statistical database management systems keep raw, elementary and/or aggregated data and include query languages with facilities to calculate various statistics from this data. In this article we examine statistical database query languages with respect to the criteria identified and taxonomy developed in Ozsoyogluand Ozsoyoglu(1985b). The criteria include statistical metadata and objects, aggregation features and interface to statistical packages. The taxonomy of statistical database query languages classifiesthem with respect to the data model used, the type of user interface and method of implementation. Temporal databases are rich sources of data for statistical analysis. Aggregation features of temporal query languages, as well as the issues in calculating aggregates from temporal data, are also examined. Keywords." Aggregation, statistical query languages, statistical databases, summary tables, tem- poral databases 1. Introduction Database management systems (DBMS) maintain and manage the data about an organization and its opera- tions. Traditionally, databases have been developed for commercial business data processing to allow easy and fast access to data, and to improve productivity of appli- cation development. Such databases can be labelled corporate database management systems (CDBMS). These databases provide vital information for the opera- tion and management of the organizations they serve. They support day-to-day operation of the enterprise (such as transaction processing), as well as the functions of middle and top management (auditing, planning, staffing, marketing, etc). These functions require extensive use of reports that summarize and/or classify data extracted from the database as well as presenting results of appli- cation of various mathematical and statistical techniques, such as calculation of averages, sums, indexes or trends. However, CDBMS generally provide limited support in this regard, perhaps providing no more than sums, averages, maxima and minima. Moreover, CDBMS are not suitable for the management of demographic, census, social and economic data. These applications require exten- sive use of statistical analysis techniques that range from calculating simple summary statistics to complex statistical techniques such as factor analysis, discriminant analysis and so on. They also require special conceptual and internal modelling constructs which are not available 0960-3174 9 1995Chapman& Hall in CDBMS. Furthermore, data aggregation features of CDBMS are add-on, ad hoc and usually inefficient. Databases that provide statistical analysis capabilities and/or maintain data about large populations are called statistical databases (SDB). A statistical database manage- ment system (SDBMS) models data in a way suitable for the SDB user's needs and allows application of statistical analysis techniques as its user interface. Thus, an SDBMS is expected to have powerful, easy-to-use, and efficient data aggregation features. For more advanced statistical data analysis requirements, the SDBMS provides inter- faces to statistical analysis procedures, which may be trans- parent to users or produce explicit output data in a format to be fed into statistical packages. Statistical software packages have been available for a long time. They have been widely and extensively used by economists and researchers in social sciences. Examples of such packages are SPSS, P-STAT, BMD and SAS. How- ever, the data management capabilities of these packages are limited and most user requirements are met by file management systems and customized application pro- grams. To accommodate these needs, new features have been added, at an increasng pace, to the statistical packages: for example, B+ tree file organization in P-STAT (Buhler 1981), new data manipulation commands of SPSS-X (Fry 1981), and an SQL interface to SAS (SAS 1982). However, there are major differences between statistical packages and statistical databases. Statistical databases provide conceptual modelling of statistical