135 0007-4888/18/16610135--©-2018--Springer-Science+Business-Media,-LLC Surface Molecular Markers of Cancer Stem Cells: Computation Analysis of Full-Text Scientific Articles R. E. Suvorov 1 , Ya. S. Kim 2 , A. M. Gisina 2 , J. H. Chiang 3 , K. N. Yarygin 2 , and A. Yu. Lupatov 2 Translated from Kletochnye Tekhnologii v Biologii i Meditsine, No. 3, pp. 157-163, September, 2018 Original article submitted May 17, 2018 The data on cancer stem cell surface molecular markers of 27 most common cancer diseases were analyzed using natural language processing and data mining techniques. As a source, 8933 full-text open-access English-language scientific articles available on the Internet were used. Text mining was based on searching for three entities within one sentence, namely a tumor name, a phrase “cancer stem cells” or its synonym, and a name of differentiation cluster molecule. As a result, a list of surface molecular markers was formed that included markers most frequently mentioned in the context of certain tumor diseases and used in studies of hu- man and animal tumor cells. Based on similarity of the associated markers, the tumors were divided into five groups. Key Words: cancer stem cells; surface molecular markers; natural language processing; information extraction; data mining 1 Federal Research Center Computer Science and Control, Russian Academy of Sciences; 2 V. N. Orekhovich Research Institute of Bio- medical Chemistry, Moscow, Russia; 3 National Cheng Kung Univer- sity, Tainan City, Taiwan. Address for correspondence: alupatov@ mail.ru. A. Yu. Lupatov Cancer stem cells (RCS) are a subpopulation of the most aggressive tumor cells that appear as a result of malignant transformation of regional stem or progeni- tor cells. CSC can retain some properties of normal stem cells and serve as the source of tumor cells of varying differentiation degree [6]. There is evidence that this particular cell subpopulation is responsible for tumor progression, including metastasizing and post- operative recurrences [10]. As CSC are highly resistant to chemotherapy and radiotherapy [2], the develop- ment of effective methods aimed at their elimination is a pressing problem. To solve this problem, comprehen- sive data on the surface molecular expressed on CSC are required. These molecules can be used not only for evaluation of the disease prognosis, but also as the target for preparations against CSC. Earlier, on the basis of the analysis of published data, we described the most reliable markers of CSC; their relevance was confirmed by us by cell transplantation to immunode- ficient animals [1,4]. At the same time, there are thou- sands of publications exploring one or another aspect of CSC. This mass data analysis is hardly possible without using computer technology. From the view- point of computer science, tasks of this kind relate to extraction of named entities and links between them. The methods of extraction of named entities are usu- ally based on matching with the vocabulary or on the rules for the selection of candidates followed by dis- ambiguation by using classifiers trained on annotated data (CRF, SVM). Extraction of relations between the entities is usually attained via classification of all pairs of entities occurring in one sentence [7]. Our aim was automated extraction and analysis of information on surface molecular markers of CSC available in the scientific literature. To this end, an appropriate algorithm and software implementing this algorithm were developed allowing automated extrac- tion of the names of differentiation cluster molecules from the CSC-relevant context and their linking with specific types of cancer. MATERIALS AND METHODS The search and extraction of full-text articles was car- ried out using PubMed database of the National Cen- Cell Technologies in Biology and Medicine,  No.  3,  November, 2018 DOI 10.1007/s10517-018-4302-8