A Language-based Approach to Measuring Scholarly Impact Sean M. Gerrish SGERRISH@CS. PRINCETON. EDU David M. Blei BLEI @CS. PRINCETON. EDU Department of Computer Science, Princeton University 35 Olden St., Princeton, NJ 08540 Abstract Identifying the most influential documents in a corpus is an important problem in many fields, from information science and historiography to text summarization and news aggregation. Un- fortunately, traditional bibliometrics such as ci- tations are often not available. We propose us- ing changes in the thematic content of documents over time to measure the importance of individ- ual documents within the collection. We describe a dynamic topic model for both quantifying and qualifying the impact of these documents. We validate the model by analyzing three large cor- pora of scientific articles. Our measurement of a document’s impact correlates significantly with its number of citations. 1 Introduction Measuring the influence of a scientific article is an impor- tant and challenging problem. Influence measurements are used to assess the quality of academic instruments, such as journals, scientists, and universities, and can play a role in decisions surrounding publishing and funding. They are also important for academic research: finding and reading the influential articles of a field is central to good research practice. The traditional method of assessing an article’s influ- ence is to count the citations to it. The impact factor of a journal, for example, is based on aggregate citation counts (Garfield, 2002). This is intuitive: if more peo- ple have cited an article, then more people have read it, and it is likely to have had more impact on its field. Ci- tation counts are used with other types of documents as well: the Pagerank algorithm, which uses hyperlinks of web-pages, has been essential to Google’s early success in Web search (Brin & Page, 1998). Appearing in Proceedings of the 26 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Though citation counts can be powerful, they can be hard to use in practice. Some collections, such as news sto- ries, blog posts, or legal documents, contain articles that were influential on others but lack explicit citations be- tween them. Other collections, like OCR scans of histor- ical scientific literature, do contain citations, but they are difficult to read in reliable electronic form. Finally, cita- tion counts only capture one kind of influence. All cita- tions from an article are counted equally in an impact fac- tor, when some articles of a bibliography might have influ- enced the authors more than others. We take a different approach to identifying influential ar- ticles in a collection. Our idea is that an influential article will affect how future articles are written and that this ef- fect can be detected by examining the way corpus statistics change over time. We encode this intuition in a time-series model of sequential document collections. We base our model on dynamic topic models, allowing for multiple threads of influence within a corpus (Blei & Laf- ferty, 2006). Though our algorithm aims to capture some- thing different from citation, we validate the inferred influ- ence measurements by comparing them to citation counts. We analyzed one hundred years of the Proceedings of the National Academy, one hundred years of Nature, and a forty year corpus of articles on computational linguistics. With only the language of the articles as input, our algo- rithm produces a meaningful measure of each document’s influence in the corpus. 2 The Document Influence Model We develop a probabilistic model that captures how past articles exhibit varying influence on future articles. Our hypothesis is that an article’s influence on the future is cor- roborated by how the language of its field changes subse- quent to its publication. In the model, the influence of each article is encoded as a hidden variable and posterior infer- ence reveals the influential articles of the collection. Our model is based on the dynamic topic model (DTM) (Blei & Lafferty, 2006), a model of sequential cor- pora that allows language statistics to drift over time. Pre-