Bibliographic Attributes Extraction with Layer-upon-Layer Tagging Wei Wei School of ICT Royal Institute of Technology KTH, Sweden weiwe@kth.se Irwin King Dept. of Comp. Sci. & Eng. The Chinese University of Hong Kong Shatin, Hong Kong king@cse.cuhk.edu.hk Jimmy Ho-Man Lee Dept. of Comp. Sci. & Eng. The Chinese University of Hong Kong Shatin, Hong Kong jlee@cse.cuhk.edu.hk Abstract Bibliographic attributes extraction is an important re- search topic for digital libraries. In this paper we pro- pose a rule-based method for bibliographic attributes ex- traction with Layer-upon-Layer Tagging (LLT). The method analyzes bibliographic attributes’ appearances and punc- tuations to perform format and semantic taggings on two defined parsing layers. The method also resolves to specif- ically constructed lexicons to achieve high accuracy of se- mantic tagging. In the experimental evaluation on 1,000 reference strings, the accuracy of author tagging reaches to 96.8% and the accuracy of whole reference tagging is 82.9%. The experimental results demonstrate that the pro- posed LLT method can tag bibliographic attributes in refer- ence strings with high degree of accuracy. 1. Introduction Bibliographic information has been significantly utilized in digital libraries. With the rapid growth of digital informa- tion resources available, it becomes quite useful and impor- tant to be able to extract bibliographic information automat- ically. A method for automatically and accurately generat- ing structured machine-understandable data from unstruc- tured reference strings is in urgent demand. Because of the heterogeneity of the reference structure, a universal method to tackle the problem of bibliographic attributes extraction still faces some challenges. The prob- lem was usually solved as the Name Entity Recognition (NER) problem. Methods based on statistical models such as Hidden Markov Models [6, 7], Maximum Entropy Mod- els [9], Conditional Random Field Models [8] are pro- posed for solving NER problems. Doan [5] proposed a multistrategy learning approach to match schemas of data sources. However, these methods are not dedicated to ex- tracting bibliographic attributes from reference strings. Fur- thermore, the models in these methods need to be well trained. Takasu [10] proposed an Extended Hidden Markov Model for extracting bibliographic attributes from reference strings captured using OCR, but the proposed model still needs well-prepared training data. Chowdhury [3] men- tioned that template mining can be used for information ex- traction from digital documents and also pointed out that in order to facilitate template mining, standardization in the presentation style and layout of information within digital documents has to be ensured. Ding [4] produced four tem- plates for information extraction from citing and cited ar- ticles. However, the result by using template mining still heavily depends on the style and layout of the digital docu- ments. Besagni [2] proposed a method based on part-of- speech tagging for bibliographic reference segmentation. The method proposed did not fully utilize the rules in refer- ence strings and the result is less than satisfactory. In this paper we propose a rule-based method for bibli- ographic attributes extraction with Layer-upon-Layer Tag- ging (LLT). The method analyzes the difficulties and rules of bibliographic attributes extraction to tackle the problem by performing format and semantic taggings on two defined parsing layers. Consider the following two references from two published scientific papers. - Example 1: “Template mining for information extraction from digi- tal documents”, G. Chowdhury, Library Trends, vol. 48, 1999. - Example 2: Chowdhury, G.G. Template mining for information ex- traction from digital documents. Library Trends. 48(1), pp.182-208, 1999 The two references both refer to the same article, but the details of the two references’ expressions are different. We briefly summarize typical difficulties in solving the problem as follows: - The attribute fields delimiters may vary among different reference strings. For example, commas are used as delim- iters between attribute fields in Example 1 while full stop points are used as delimiters in Example 2. - Delimiters for attribute fields may also be used within an attribute field. For example, commas are used between the author’s given name and family name in Example 2 while no commas are used for this purpose in Example 1.