genes G C A T T A C G G C A T Article Meta-Analysis of Gene Popularity: Less Than Half of Gene Citations Stem from Gene Regulatory Networks Ionut Sebastian Mihai 1,2,3 , Debojyoti Das 1 , Gabija Maršalkaite 1 and Johan Henriksson 1, *   Citation: Mihai, I.S.; Das, D.; Maršalkaite, G.; Henriksson, J. Meta-Analysis of Gene Popularity: Less Than Half of Gene Citations Stem from Gene Regulatory Networks. Genes 2021, 12, 319. https://doi.org/10.3390/ genes12020319 Academic Editor: Piero Fariselli Received: 27 January 2021 Accepted: 20 February 2021 Published: 23 February 2021 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affil- iations. Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). 1 Molecular Infection Medicine Sweden (MIMS), Umeå Centre for Microbial Research, Department of Molecular Biology, Umeå University, 901 87 Umeå, Sweden; ionut.sebastian.mihai@umu.se (I.S.M.); debojyoti.das@umu.se (D.D.); marschalka@gmail.com (G.M.) 2 Industrial Doctoral School, Umeå University, 901 87 Umeå, Sweden 3 National Clinical Research School in Chronic Inflammatory Diseases (NCRSCID), Karolinska Institutet, 171 77 Solna, Sweden * Correspondence: johan.henriksson@umu.se Abstract: The reasons for selecting a gene for further study might vary from historical momentum to funding availability, thus leading to unequal attention distribution among all genes. However, certain biological features tend to be overlooked in evaluating a gene’s popularity. Here we present a meta-analysis of the reasons why different genes have been studied and to what extent, with a focus on the gene-specific biological features. From unbiased datasets we can define biological properties of genes that reasonably may affect their perceived importance. We make use of both linear and nonlinear computational approaches for estimating gene popularity to then compare their relative importance. We find that roughly 25% of the studies are the result of a historical positive feedback, which we may think of as social reinforcement. Of the remaining features, gene family membership is the most indicative followed by disease relevance and finally regulatory pathway association. Disease relevance has been an important driver until the 1990s, after which the focus shifted to exploring every single gene. We also present a resource that allows one to study the impact of reinforcement, which may guide our research toward genes that have not yet received proportional attention. Keywords: gene; Matthew effect; biological feature; genomics; machine learning; linear model; gene regulatory networks 1. Introduction One of the current great challenges in biology is integrating knowledge into a coherent model, thus allowing predictions to be made. However, this quest heavily relies on our understanding of all the different features that define our biological question. How well do we understand the different features, and has the manner or motivation for study affected our conclusions about them? As systems biologists, we wondered if we could somehow address these questions based on the intrinsic properties of the genes. Similar studies have previously addressed this question, making a great contribution in highlighting social features (funding, transitioning to principal investigator status, model organism and scientific literature database availability) and a plethora of physicochemical properties of protein-coding genes [1]. However, the amount of literature about factors behind gene popularity integrating biological feature information yielded by NGS-derived datasets, CRISPR-screens, gene regulatory (GRNs), and protein–protein interaction (PPis) networks is limited. Scientometry is the discipline that studies scientific and technologic literature from the quantitative perspective [2]. From scientometry it is known that literature is usually skewed to cover some subjects at a much greater depth than others. Various statistical distributions seem able to explain current and past publication trends, proposed to follow laws such as those of Bradford and Lotka, or the Pareto distribution [3]. It is, however, Genes 2021, 12, 319. https://doi.org/10.3390/genes12020319 https://www.mdpi.com/journal/genes