A trigram hidden Markov model for metadata extraction from heterogeneous references Bolanle Ojokoh a,b , Ming Zhang a,⇑ , Jian Tang a a School of Electronic Engineering and Computer Science, Peking University, Beijing 100871, PR China b Department of Computer Science, Federal University of Technology, P.M.B. 704 Akure, Nigeria article info Article history: Received 24 March 2010 Received in revised form 25 December 2010 Accepted 2 January 2011 Available online 11 January 2011 Keywords: Metadata extraction Hidden Markov models Bibliography Second order Shrinkage abstract Our objective was to explore an efﬁcient and accurate extraction of metadata such as author, title and institution from heterogeneous references, using hidden Markov models (HMMs). The major contributions of the research were the (i) development of a trigram, full second order hidden Markov model with more priority to words emitted in transitions to the same state, with a corresponding new Viterbi algorithm (ii) introduction of a new smoothing technique for transition probabilities and (iii) proposal of a modiﬁcation of back-off shrinkage technique for emission probabilities. The effect of the size of data set on the training procedure was also measured. Comparisons were made with other related works and the model was evaluated with three different data sets. The results showed overall accuracy, precision, recall and F1 measure of over 95% suggesting that the method outperforms other related methods in the task of metadata extraction from references. Ó 2011 Elsevier Inc. All rights reserved. 1. Introduction The dramatic growth of digital libraries in recent years has not only simpliﬁed access to existing information sources, but has also made the task of ﬁnding, extracting and aggregating relevant information difﬁcult. In the bibliographic research community, several researches are being conducted on citation analysis, grouping and social networks creation for subse- quent mining. A prerequisite to such tasks is accurate reference metadata extraction process. References are most commonly found in the late section of an article; this section is often labeled ‘‘References’’, ‘‘Bibliog- raphy’’ or ‘‘List of References’’, and information that is normally contained in this section includes the author names, title, journal, volume, number (issue), year, and page information. These have constituted an important kind of metadata valuable for literature search, analysis, and evaluation [9]. Automatic reference extraction is particularly difﬁcult because of the problems of inconsistent formatting, semantically overloaded punctuations and ﬁeld separators, and existence of many dramatically different reference styles. Inspired by the work of Yin et al. [30], where the inner emission probability was computed according to the bigram sequence relationship of words within the same ﬁeld, we describe a method that utilizes trigram HMMs with more priority to the words emitted in transitions to the same state for the task of metadata extraction from references. We propose a three dimensional transition matrix in which the probability of transitioning to a new state depends not only on the current state according to the traditional HMM but also on the previous state. Our method improves on those adopted by previous researches, by recommending a new approach for smoothing transition probabilities, a modiﬁed shrinkage technique for smoothing emission probabilities and optimization of the emission vocabulary. 0020-0255/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2011.01.014 ⇑ Corresponding author. Tel.: +86 13601036730; fax: +86 10 62765822. E-mail addresses: bolanleojokoh@yahoo.com (B. Ojokoh), mzhang@net.pku.edu.cn (M. Zhang), tangjian_0@126.com (J. Tang). Information Sciences 181 (2011) 1538–1551 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins