genes
G C A T
T A C G
G C A T
Article
Meta-Analysis of Gene Popularity: Less Than Half of Gene
Citations Stem from Gene Regulatory Networks
Ionut Sebastian Mihai
1,2,3
, Debojyoti Das
1
, Gabija Maršalkaite
1
and Johan Henriksson
1,
*
Citation: Mihai, I.S.; Das, D.;
Maršalkaite, G.; Henriksson, J.
Meta-Analysis of Gene Popularity:
Less Than Half of Gene Citations
Stem from Gene Regulatory
Networks. Genes 2021, 12, 319.
https://doi.org/10.3390/
genes12020319
Academic Editor: Piero Fariselli
Received: 27 January 2021
Accepted: 20 February 2021
Published: 23 February 2021
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1
Molecular Infection Medicine Sweden (MIMS), Umeå Centre for Microbial Research,
Department of Molecular Biology, Umeå University, 901 87 Umeå, Sweden;
ionut.sebastian.mihai@umu.se (I.S.M.); debojyoti.das@umu.se (D.D.); marschalka@gmail.com (G.M.)
2
Industrial Doctoral School, Umeå University, 901 87 Umeå, Sweden
3
National Clinical Research School in Chronic Inflammatory Diseases (NCRSCID), Karolinska Institutet,
171 77 Solna, Sweden
* Correspondence: johan.henriksson@umu.se
Abstract: The reasons for selecting a gene for further study might vary from historical momentum
to funding availability, thus leading to unequal attention distribution among all genes. However,
certain biological features tend to be overlooked in evaluating a gene’s popularity. Here we present a
meta-analysis of the reasons why different genes have been studied and to what extent, with a focus
on the gene-specific biological features. From unbiased datasets we can define biological properties
of genes that reasonably may affect their perceived importance. We make use of both linear and
nonlinear computational approaches for estimating gene popularity to then compare their relative
importance. We find that roughly 25% of the studies are the result of a historical positive feedback,
which we may think of as social reinforcement. Of the remaining features, gene family membership is
the most indicative followed by disease relevance and finally regulatory pathway association. Disease
relevance has been an important driver until the 1990s, after which the focus shifted to exploring
every single gene. We also present a resource that allows one to study the impact of reinforcement,
which may guide our research toward genes that have not yet received proportional attention.
Keywords: gene; Matthew effect; biological feature; genomics; machine learning; linear model; gene
regulatory networks
1. Introduction
One of the current great challenges in biology is integrating knowledge into a coherent
model, thus allowing predictions to be made. However, this quest heavily relies on our
understanding of all the different features that define our biological question. How well do
we understand the different features, and has the manner or motivation for study affected
our conclusions about them? As systems biologists, we wondered if we could somehow
address these questions based on the intrinsic properties of the genes. Similar studies
have previously addressed this question, making a great contribution in highlighting
social features (funding, transitioning to principal investigator status, model organism and
scientific literature database availability) and a plethora of physicochemical properties of
protein-coding genes [1]. However, the amount of literature about factors behind gene
popularity integrating biological feature information yielded by NGS-derived datasets,
CRISPR-screens, gene regulatory (GRNs), and protein–protein interaction (PPis) networks
is limited.
Scientometry is the discipline that studies scientific and technologic literature from
the quantitative perspective [2]. From scientometry it is known that literature is usually
skewed to cover some subjects at a much greater depth than others. Various statistical
distributions seem able to explain current and past publication trends, proposed to follow
laws such as those of Bradford and Lotka, or the Pareto distribution [3]. It is, however,
Genes 2021, 12, 319. https://doi.org/10.3390/genes12020319 https://www.mdpi.com/journal/genes