135
0007-4888/18/16610135--©-2018--Springer-Science+Business-Media,-LLC
Surface Molecular Markers of Cancer Stem Cells:
Computation Analysis of Full-Text Scientific Articles
R. E. Suvorov
1
, Ya. S. Kim
2
, A. M. Gisina
2
, J. H. Chiang
3
,
K. N. Yarygin
2
, and A. Yu. Lupatov
2
Translated from Kletochnye Tekhnologii v Biologii i Meditsine, No. 3, pp. 157-163, September, 2018
Original article submitted May 17, 2018
The data on cancer stem cell surface molecular markers of 27 most common cancer diseases
were analyzed using natural language processing and data mining techniques. As a source,
8933 full-text open-access English-language scientific articles available on the Internet were
used. Text mining was based on searching for three entities within one sentence, namely a
tumor name, a phrase “cancer stem cells” or its synonym, and a name of differentiation cluster
molecule. As a result, a list of surface molecular markers was formed that included markers
most frequently mentioned in the context of certain tumor diseases and used in studies of hu-
man and animal tumor cells. Based on similarity of the associated markers, the tumors were
divided into five groups.
Key Words: cancer stem cells; surface molecular markers; natural language processing;
information extraction; data mining
1
Federal Research Center Computer Science and Control, Russian
Academy of Sciences;
2
V. N. Orekhovich Research Institute of Bio-
medical Chemistry, Moscow, Russia;
3
National Cheng Kung Univer-
sity, Tainan City, Taiwan. Address for correspondence: alupatov@
mail.ru. A. Yu. Lupatov
Cancer stem cells (RCS) are a subpopulation of the
most aggressive tumor cells that appear as a result of
malignant transformation of regional stem or progeni-
tor cells. CSC can retain some properties of normal
stem cells and serve as the source of tumor cells of
varying differentiation degree [6]. There is evidence
that this particular cell subpopulation is responsible for
tumor progression, including metastasizing and post-
operative recurrences [10]. As CSC are highly resistant
to chemotherapy and radiotherapy [2], the develop-
ment of effective methods aimed at their elimination is
a pressing problem. To solve this problem, comprehen-
sive data on the surface molecular expressed on CSC
are required. These molecules can be used not only
for evaluation of the disease prognosis, but also as
the target for preparations against CSC. Earlier, on the
basis of the analysis of published data, we described
the most reliable markers of CSC; their relevance was
confirmed by us by cell transplantation to immunode-
ficient animals [1,4]. At the same time, there are thou-
sands of publications exploring one or another aspect
of CSC. This mass data analysis is hardly possible
without using computer technology. From the view-
point of computer science, tasks of this kind relate to
extraction of named entities and links between them.
The methods of extraction of named entities are usu-
ally based on matching with the vocabulary or on the
rules for the selection of candidates followed by dis-
ambiguation by using classifiers trained on annotated
data (CRF, SVM). Extraction of relations between the
entities is usually attained via classification of all pairs
of entities occurring in one sentence [7].
Our aim was automated extraction and analysis
of information on surface molecular markers of CSC
available in the scientific literature. To this end, an
appropriate algorithm and software implementing this
algorithm were developed allowing automated extrac-
tion of the names of differentiation cluster molecules
from the CSC-relevant context and their linking with
specific types of cancer.
MATERIALS AND METHODS
The search and extraction of full-text articles was car-
ried out using PubMed database of the National Cen-
Cell Technologies in Biology and Medicine, No. 3, November, 2018
DOI 10.1007/s10517-018-4302-8