Using Signifiers for Data Integration in Rail Automation Alexander Wurl 1 , Andreas Falkner 1 , Alois Haselb ¨ ock 1 and Alexandra Mazak 2, * 1 Siemens AG ¨ Osterreich, Corporate Technology, Vienna, Austria 2 TU Wien, Business Informatics Group, Austria Keywords: Data Integration, Signifier, Data Quality. Abstract: In Rail Automation, planning future projects requires the integration of business-critical data from heteroge- neous data sources. As a consequence, data quality of integrated data is crucial for the optimal utilization of the production capacity. Unfortunately, current integration approaches mostly neglect uncertainties and incon- sistencies in the integration process in terms of railway specific data. To tackle these restrictions, we propose a semi-automatic process for data import, where the user resolves ambiguous data classifications. The task of finding the correct data warehouse classification of source values in a proprietary, often semi-structured format is supported by the notion of a signifier, which is a natural extension of composite primary keys. In a case study from the domain of asset management in Rail Automation we evaluate that this approach facilitates high-quality data integration while minimizing user interaction. 1 INTRODUCTION In order to properly plan the utilization of production capacity, e.g., in a Rail Automation factory, informa- tion from all business processes and project phases must be taken into account. Sales people scan the market and derive rough estimations of the number of assets (i.e. producible units) of various types (e.g. control units for main signals, shunting signals, dis- tant signals, etc.) which may be ordered in the next few years. The numbers of assets get refined phase by phase, such as bid preparation or order fulfill- ment. Since these phases are often executed by differ- ent departments with different requirements and in- terests (e.g. rough numbers such as 100 signals for cost estimations in an early planning phase, vs. de- tailed bill-of-material with sub-components such as different lamps for different signal types for a final installation phase), the same assets are described by different properties (i.e. with - perhaps slightly - dif- ferent contents) and in different proprietary formats (e.g. spreadsheets or XML files). Apart from the tech- nical challenges of extracting data from such propri- etary structures, heterogeneous feature and asset rep- resentations hinder the process of mapping and merg- ing information which is crucial for a smooth over- all process and for efficient data analytics which aims * Alexandra Mazak is affiliated with the CDL-MINT at TU Wien. at optimizing future projects based upon experiences from all phases of previous projects. One solution ap- proach is to use a data warehouse and to map all het- erogeneous data sets of the different departments to its unified data schema. To achieve high data quality in this process, it is important to avoid uncertainties and inconsistencies while integrating data into the data warehouse. Espe- cially if data includes information concerning costs, it is essential to avoid storing duplicate or contradicting information because this may have business-critical effects. Part of the information can be used to identify corresponding data in some way (i.e. used as key), part of it can be seen as relevant values (such as quan- tities and costs). Only if keys of existing information objects in the data warehouse are comparable to that one of newly added information from heterogeneous data sets, that information can be stored unambigu- ously and its values are referenced correctly. Keys are formed from one or many components of the information object and are significant for compar- ing information of heterogeneous data sets with infor- mation stored in the data warehouse. If two of such keys do not match, this is caused by one of two sig- nificantly different causes: (i) two objects should have the same key but they slightly differ from each other, and (ii) two objects really have different keys. Us- ing solely heuristic lexicographical algorithms (Co- hen et al., 2003) to automatically find proper matches 172 Wurl, A., Falkner, A., Haselböck, A. and Mazak, A. Using Signifiers for Data Integration in Rail Automation. DOI: 10.5220/0006416401720179 In Proceedings of the 6th International Conference on Data Science, Technology and Applications (DATA 2017), pages 172-179 ISBN: 978-989-758-255-4 Copyright © 2017 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved