A W@DIS-based data quality analysis of the energy levels and wavenumbers of isotopologues of the water molecule A. Fazliev 1 , O. Naumenko 1 , A. Privezentsev 1 , A. Akhlyostin 1 , N. Lavrentiev 1 , A.Kozodoev 1 , S. Voronina 1 , A. Apanovich 2 , A.G. Császár 3 , and J. Tennyson 4 1. Institute of Atmospheric Optics SB RAS, Tomsk 634055, Russia, 2. Institute of Informatics Systems SB RAS, Novosibirsk, Russia 3. MTA-ELTE Complex Chemical Systems Research Group, Budapest, Hungary 4. Department of Physics and Astronomy, University College London, London WC1E 6BT, United Kingdom The XVIII-th Symposium and School on High Resolution Molecular Spectroscopy June 30 - July 4, 2015, Tomsk, Russia 2. Water line lists A critical evaluation of the ro-vibrational spectra of nine major water isotopologues was performed 1-4 . One of the subjects of these IUPAC-sponsored activities 5 was the evaluation and validation of all the published measured spectra of these isotopologues. Following the MARVEL algorithm 6 and employing high-level principles data, the measured transitions and energy levels were made fully consistent. A small part of the measured transitions had to be rejected, while some of the published spectra had to be recalibrated. The results obtained and lists of the validated and rejected transitions and energy values were imported into the W@ADIS 7 and RESPECTH 8 information systems. The development of the W@DIS system was motivated in part by the publication of tens of articles about the spectral parameters of water (for instance, energy levels and molecular transitions) every year. These publications contain new data about the parameters of relevance to the water molecule or list more accurately measured energy levels, transitions, and the like. In a few cases, the newly published data were found to be inconsistent with those presented by other researchers engaged in these investigations. W@DIS contains several applications that provide facilities for spectral data export and import, comparison of the spectral data related to certain spectroscopic tasks, and representation of the data properties. Most of the data properties are indicative of data quality, i.e. of the validity of and trust in the expert data available. In this work, user interfaces are described and computer- generated reports on spectral data quality for all isotopologues of the water molecule are presented. 1. Data sources. Import and export 3. A computed description of data sources. Information sources Many research teams deal with compilation of collections of spectral data. The amount of data in the collections is increasing with a catastrophic speed, to say nothing of the growing body of information about the properties of data found in these collections. The users choose one or the other collection or part thereof according to the set of properties of this collection. That is why the collections must have a reasonable set of the data properties that allow for automatic comparison of collections thus assisting in making an optimal choice of the collection by the end user. In W@DIS, the construction of reports is automated by means of ontologies. A key role in the reports is played by data sources for publications, transitions, and states. Structures of information sources for a molecular state and transition 13 are presented in Figs. 4 and 5. Figure 5. Structure of the data source describing a molecular transition 4. A systematization of information sources, molecular states, and transitions Footnotes: A - information sources describing at least one identified transition, B - information sources describing all transitions satisfing selection rules, C - information sources describing some unidentified transitions, A including B and some C, VQN - variational quantum numbers. Introduction A research into a subject domain is known to consist of several stages. These are collection of facts, construction of subject domain models, comparison of the proposed models with those developed by other researchers, provision of access to the models for other researchers, and, finally, publication of the models. This brings up the following questions: 1) Is the set of facts collected and generated by a researcher complete?; 2)Are these facts consistent with each other?; 3) Does the formal language of the model specification allow researchers to build subject domain models being adequate to the collected facts?; 4) Is the proposed model consistent with those developed by other researchers?; How can quick access to the results of investigations be gained?; etc. These topical questions arise in molecular spectroscopy which is one of the fields of physics widely used in many applied research areas. Spectral data sets require a systematization of information and design of data processing computer software. Software implementation implies the construction of subject domain models related to these data as well as the development of the required facilities associated with a search for the relevant information resources. In spectroscopy, such resources are solutions of spectroscopic tasks. In quantitative spectroscopy, different research groups have acquired sets of expert data 9,10 found to be inconsistent with each other. Provision of consistency between the data available and those obtained by different investigators involved in collecting expert data is one of the main tasks not only in spectroscopy, but also in other subject domains. In the mid-2000s, an IUPAC project was launched wherein a task was undertaken to build an information system intended for collection of all currently available published data on the water molecule and water isotopologues. The system contains facilities for finding the inconsistency between the vacuum wavenumbers. There are formal and informal criteria for matching inconsistent data. The former criteria include selection rules, root-mean-square deviations (RMSDs), the difference between identical transition vacuum wavenumbers, etc. The latter criteria imply expert assessments of data quality like those used in the information system discussed here, viz. W@DIS. An ontology (logical theory) of information resources for the water molecules was built to describe the properties of published solutions of spectroscopic tasks 11 At present two groups of the ontologies developed in W@DIS provide a description of the state-of-the art of the published data and descriptions of states and transitions in the water spectroscopy that can be accessed via the Internet (http://wadis.saga.iao.ru). The goal of this work is to collect publications on the water molecule and its isotopologues, provide spectral data import describing the results obtained from measured and calculated data computer processing, develop software for data alignment relating to solutions of spectroscopic tasks [] , build automatically the information sources and a few ontologies for quantitative spectroscopy, and assess data quality of a complete set of the relevant data. The work was performed in 3 stages. In the first place, publications on the water molecule were collected and systematized for 19292015, and a quantum number notation was selected for building a W@DIS database. In the second place, data import into W@DIS was performed, and computer generated ontologies of information resources, states, and transitions were obtained to assess the imported data quality. Associated with the imported data sources are the data source properties generated with the use of a specific metadata set. The metadata are intended for solving the task of a search for information resources in accordance with a number of criteria. A key criterion is the validity of collected values of the physical quantities involved. The imported data and generated data properties make up an information source pertaining to the solution of one or the other spectroscopic task. This aim entails the construction of a data import system, finding applications for generation of spectral line lists that will allow for finding inconsistent transitions, development of logical theories of publications on the water spectroscopy, and selection of sets of criteria for assessing the validity of and trust in the data sources for the water spectroscopy. In W@DIS, data sources are parts of publications containing data about the solution of one of the six spectroscopic tasks 11 . Data source import and export are detailed in monograph 12 . The data distribution statistics for water isotopologues and spectroscopic tasks to be solved are presented in Table 1. Table 1. Statistics of the data sources imported into W@DIS depending on the type of spectroscopic task and form of representation of quantum numbers A special facility is used to make a water line list containing measured values of the characteristics of transitions available in W@DIS. The interface for this application is shown in Fig. 1. The application allows for generation both of a complete spectral line list and of its individual parts, for example, all transitions in a given spectral range or in a certain vibrational band. With a tabular representation of the line list, inconsistent transitions are shown in a certain color. A criterion for the inconsistency of transitions is the difference in the vacuum wavenumbers of identical transitions exceeding a certain value Δ found by the researcher. In W@DIS, use is made of the spectral ranges and values of Δ by default, as is shown in Fig. 2. A more detailed analysis of the inconsistency between the data sources is given in Section 5. Figure.1. Representation of a spectral line list illustrated by the example of H 2 O Figure 2. List of ranges of change of the wavenumbers and values of Δ Figure 3. Tabular representation of the properties of expert data source 1956_RoVaNi In data computer processing, the information stored in the databases can be represented in the form of different structures. In W@DIS, use is made of two groups of data representation. One group describes data sources which are parts of publications. The other group represents particular transitions and states of the water molecule. To each particular transition there corresponds a set of values of the vacuum wavenumbers of this transition. The properties of the data acquired from the data sources, and of an information object describing states and transitions are chosen so that their analysis allows for assessment of the quality of the data sources, states, and transitions, respectively. In W@DIS, the values of the data properties are computed automatically. The data source and its propreties, as well as, information objects describing states and transitions and their properties comprise information sources. For the end user, the properties retrieved from the information source are represented in W@DIS in the tabular form shown in Fig. 3. Figure 4. Structure of the data source describing a molecular state