Text Gathering and Processing Agent for Language Modeling Corpus Daniel Hladek Department of Electronics and Multimedia Telecommunications Technical University of Kosice Kosice, Slovak Republic Email: daniel.hladek@tuke.sk Jan Stas Department of Electronics and Multimedia Telecommunications Technical University of Kosice Kosice, Slovak Republic Email: jan.stas@tuke.sk Abstract—An approach for acquisition, preprocessing and storage of large quantity of text for creation of the Slovak language model is presented. Text, downloaded from web, is preprocessed by automatically generated parser. Heuristic parser rules are used to identify entities in text such as abbreviations or end of sentence. Raw and processed text is stored in the relational database. Effective filters for erroneous and duplicate sentences are proposed. Evaluation of the results includes effects of the proposed filters on corpus and tendency of the database growth. I. I NTRODUCTION Language model is a key component for automatic speech recognition, grammar correction or spoken language recogni- tion. In order to create a language model, it is necessary to gather a great amount of text called training corpus. There are just a small number of attempts to create a database of corpora for statistical processing of the Slovak language. One of them, Slovak National Corpus [1]. due to the different approach, insufficient size and licensing does not meet our needs. This paper continues work presented in [2], [3], [4]. For the sake of building good language model it would be very helpful to have a sufficiently large database of text that would unify various sources in one place. This database then can be easily used to easily construct domain-specific corpora from the already collected and prepared data. When the text is stored, in order to be used, it have to be preprocessed. Recent methods use statistical methods, or context-free grammars, or their combination as described in [5]. II. TEXT DATABASE For the text database creation we can use more types of electronic sources: 1) Printed media - Classic paper books, newspaper and magazines. Printed text must be scanned and processed by OCR software first. In this phase we do not focus on this source. 2) Static electronic sources - Electronic databases of text, such as laws or theses, available on removable media such as DVD or that can be downloaded from a certain Internet site. 3) On-line electronic sources - Usually a web page that contains news, magazines or commercial presentations. This source seems to be the most promising, but is many problems with cleaning and extracting have to be solved. All these sources need to be gathered in one place. The best way to store, process and sort large quantity of structured data seems to be a relational database. Text data are collected in smaller parts that are called documents. On document contains one piece of text data written about one theme from one source. An example of document is a magazine article about motorcycles in the PDF format. One document contains a certain amount of text that can be used for creation of a text corpus. Process of acquisition of the text and sorting out irrelevant text is called extraction. All documents in database can be stored in just one relation that have following attributes: 1) Document file - Location of the document file on the disk. It have been found out that storing binary data directly in the database is not efficient for this use case, files are rather stored in dedicated directory on the disk. 2) Extracted text - Text that was extracted from the document file and is ready to be incorporated to the corpus 3) Document source - Description of the original docu- ment location. The most convenient way seems to be URI string. This format is general enough to assign each document a unique name. 4) Document processing status - We can assign some information about preliminary outcome of the processing a) Processing - Document is not ready yet b) Final - Extracted text can be inserted to the corpus c) Error - Document contain error and could not be processed d) Copy - Document is a copy of some other docu- ment in the database e) Bad text - Extracted text of the document has low quality 5) Meta Information String with some additional infor- mation about document, such as keywords that can help us determine document domain.