1
An Automatic Method for Extracting Citations from Google Books
1
Kayvan Kousha
Statistical Cybermetrics Research Group, School of Technology, University of Wolverhampton, Wulfruna Street,
Wolverhampton WV1 1LY, UK E-mail: k.kousha@wlv.ac.uk
Mike Thelwall
Statistical Cybermetrics Research Group, School of Technology, University of Wolverhampton, Wulfruna Street,
Wolverhampton WV1 1LY, UK. E-mail: m.thelwall@wlv.ac.uk
Recent studies have shown that counting citations from books can help scholarly impact assessment and
that Google Books (GB) is a useful source of such citation counts, despite its lack of a public citation index.
Searching GB for citations produces approximate matches, however, and so its raw results need time-
consuming human filtering. In response, this article introduces a method to automatically remove false
and irrelevant matches from GB citation searches in addition to introducing refinements to a previous GB
manual citation extraction method. The method was evaluated by manual checking of sampled GB results
and comparing citations to about 14,500 monographs in the Thomson Reuters Book Citation Index
(BKCI) against automatically extracted citations from GB across 24 subject areas. GB citations were
103% to 137% as numerous as BKCI citations in the humanities, except for tourism (72%) and linguistics
(91%), 46% to 85% in social sciences, but only 8% to 53% in the sciences. In all cases, however, GB found
substantially more citing books than did BKCI, with BKCI's results coming predominantly from journal
articles. Moderate correlations between the GB and BKCI citation counts in social sciences and
humanities, with most BKCI results coming from journal articles rather than books, suggests that they
could measure the different aspects of impact, however.
Introduction
Books are major scholarly outputs in many social sciences and humanities disciplines
and are therefore important for research evaluation (e.g., Moed, 2005; Nederhof, 2006; Huang
& Chang, 2008). For instance, about a third of the submissions in social sciences and
humanities fields to the 2008 U.K. Research Assessment Exercise (RAE) were books in
comparison to about 1% in the sciences (Kousha, Thelwall & Rezaie, 2011). Moreover,
counting citations from books rather than journal articles can give different results when
benchmarking authors (Cronin, Snyder & Atkins, 1997) and countries (Archambault et al.,
2006) in the social sciences and humanities. This shows that citations from books are an
important source of impact evidence that cannot be replaced by citations from journal articles.
The lack of a comprehensive index for the bibliographic references of books is therefore an
issue for bibliometric monitoring of research in book-based disciplines. Almost two decades
ago, this led to a call to include citations from books in academic citation databases (Garfield,
1996). Nevertheless, most previous quantitative investigations into the impact of book-based
scholarship have counted citations from journal articles indexed in the commercial citation
databases (Web of Science and Scopus) (e.g., Glänzel & Schoepflin, 1999; Butler & Visser,
2006; Bar-Ilan, 2010; Hammarfelt, 2011) rather than citations from other books, although
some studies have manually extracted cited references from selected monographs for
bibliometric analysis (e.g., Cullars, 1998; Krampen, Becker, Wahner & Montada, 2007).
There have also been initiatives to use non-citation metrics for usage assessment of books,
such as counting library holdings (“libcitations”) (White, Boell, Yu et al., 2009) and using
library loan statistics (Cabezas-Clavijo et al., 2013).
Several attempts have been made to extract citations from academic books on a large
scale for citation analysis or citation searching. In 2011 Thomson Reuters introduced the
Book Citation Index, a set of citations from selected academic books and book chapters that
could be added to the journal citations in the Web of Science (WoS). Whilst this is a valuable
1
This is a preprint of an article to be published in the Journal of the American Society for Information Science
and Technology © copyright 2013 John Wiley & Sons, Inc.