Journal of Informetrics 11 (2017) 989–1002
Contents lists available at ScienceDirect
Journal of Informetrics
j ourna l h o mepa ge: www.elsevier.com/locate/joi
Regular article
How is R cited in research outputs? Structure, impacts, and
citation standard
Kai Li
*
, Erjia Yan, Yuanyuan Feng
College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, United States
a r t i c l e i n f o
Article history:
Received 2 February 2017
Received in revised form 9 August 2017
Accepted 9 August 2017
Keyword:
R
Software citation
Content analysis
Bibliometrics
Scholarly communication
a b s t r a c t
This paper addresses software citation by analyzing how R and its packages are cited in a
sample of PLoS papers. A codebook is developed to support a content analysis of the full-
text papers. Our results indicate that the software R and its packages are inconsistently
cited, as is the case with other scientific software. The inconsistency derives partly from
the variety of citation standards currently used for software, and partly from fact that these
standards are not well followed by authors on multiple levels. This work sheds light on the
future development of software citation standards, especially given the present landscape
of conflicting citation practices. Moreover, our approach furnishes a possible blueprint for
dealing with the granularity of software entities in scientific citation: we consider citations
of the core R software environment, of specific R packages, and of individual functions.
© 2017 Elsevier Ltd. All rights reserved.
1. Introduction
This paper concerns the citation and mention of scientific software in research papers. It is commonly accepted that
proper citation of research datasets is important to the growing field of data science, because it provides a basic mechanism
for linking datasets to other scientific entities. This linkage provides a fundamental infrastructure for other scientific tasks,
including data sharing, data reuse, and reproducible research (e.g., Mooney & Newton, 2012). As data objects in their own
right, scientific software environments, applications, and packages are also amenable to this infrastructure of citation, with
similar benefits for researchers.
Earlier evidence suggests that data citation practices are highly inconsistent, largely because authors lack standards and
policies to guide them in citing datasets (Belter, 2014; Mooney, 2011; Mooney & Newton, 2012). These findings have inspired
the development of several data citation standards (e.g., Altman & King, 2007; Starr & Gastl, 2011) and the identification of
key principles for citing datasets (Altman, Borgman, & Crosas, 2015; Altman & Crosas, 2013; Martone, 2014).
Parallel developments are underway in the study of software citation: researchers have provided some early analysis of
software citation trends (e.g., Howison & Bullard, 2015; Li, Greenberg, & Lin, 2016; Pan, Yan, Wang, & Hua, 2015) and have
proposed guiding principles for future works (Smith, Katz, & Niemeyer, 2016). However, several factors limit the quality of
information obtainable from current software citation practices. For example, existing studies tend to describe software as
a single unit in the scientific workflow; this does not always reflect the way in the software is used.
Increasingly, scientific software is extensible. Extensibility is one of the most important principles of software design
(Johansson & Löfgren, 2009). R, a statistical and data visualization software and environment, represents a successful example
*
Corresponding author.
E-mail address: kl696@drexel.edu (K. Li).
http://dx.doi.org/10.1016/j.joi.2017.08.003
1751-1577/© 2017 Elsevier Ltd. All rights reserved.