XML based Framework for ETL Processes For Relational Databases TASSAWAR IQBAL, NADEEM DAUDPOTA Department of Computer Science COMSATS Institute of Information Technology, Abbottabad, NWFP PAKISTAN Abstract:- In Data Warehousing, Extraction-Transformation-Loading (ETL) are the key tasks that are responsible for the extraction of data from several sources, their cleansing, customization and insertion into data warehouse [10]. More specifically ETL tools are category of specialized tools with the task of dealing with data warehouse cleaning and loading problems. These task are very critical in every data warehouse environment, It is observed that ETL and data cleaning tools are estimated to cost at least one third of effort and expenses in the budget of the data warehouse [1,11], another evidence shows that ETL process costs 55% of the total cost of the data warehouse [1,12]. In this paper, we focus on the problem of the definition of ETL processes using xml in order to make this framework more generic and capable to deal with heterogeneous source systems. We described the framework that extract data from various heterogeneous source systems and carry it in xml files, later on data cleaning is performed using few predefined xml templates, predefined functions and ultimately data is loaded into data warehouse as per warehouse schema. 1. Introduction Data warehousing systems integrate information from Transaction Processing Systems (TPS) into a central repository to enable analysis and mining of integrated information but it can be only achieved when we have all possible information from various TPS, if it is standardized and cleaned. Information must be with no missing values, no extra and varying symbols, no inconsistent codes and duplicates. Normally customized applications are used to perform this task that are designed keeping in mind the source system as well as destination systems structure and hierarchy. In this proposed framework extraction is performed using component that is customized in nature for each source database named extractor that extract the data from relational databases using SQL command and present it in the xml document. Second module named Cleansing Engine operate on these xml document and by using some predefined xml templates based on business rules it standardize and clean the data. Finally the standardized and cleaned data is mapped to the central repository of warehouse through Mapping Engine. Complete model is shown in figure 1. Significance of this model lie in introduction of two generic components named Cleansing Engine and Mapping Engine that are designed in such a way that user can customize them through few simple options at interface level to define the business rules and to achieve the desire results. The rest of the paper is organized as follow: Section 2.1 describes the Extractor Component in detail, Section 2.2 explains the Cleansing Engine of this model whereas Section 2.3 describes the Mapping Engine of this conceptual model and Section 3 brief overview of the GUI. At end section 4 consist of conclusion & section 5 is about references. 2. Framework 2.1 Extractor This component of the model is an initiator that is responsible to extract data from various TPSs. This component has as many sub units as many TPSs exist in the data warehousing environment. Each sub unit is customized for particular TPS needs and architecture. Each sub unit extract the data from respective TPS and convert that data into the xml format that is standard in this model to make it generic. Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 16-18, 2006 (pp481-485)