Impact Factor: ISRA (India) = 4.971 ISI (Dubai, UAE) = 0.829 GIF (Australia) = 0.564 JIF = 1.500 SIS (USA) = 0.912 РИНЦ (Russia) = 0.126 ESJI (KZ) = 8.997 SJIF (Morocco) = 5.667 ICV (Poland) = 6.630 PIF (India) = 1.940 IBI (India) = 4.260 OAJI (USA) = 0.350 Philadelphia, USA 574 QR – Issue QR – Article SOI: 1.1/TAS DOI: 10.15863/TAS International Scientific Journal Theoretical & Applied Science p-ISSN: 2308-4944 (print) e-ISSN: 2409-0085 (online) Year: 2020 Issue: 05 Volume: 85 Published: 30.05.2020 http://T-Science.org Vadim Andreevich Kozhevnikov Peter the Great St.Petersburg Polytechnic University Senior Lecturer vadim.kozhevnikov@gmail.com Evgeniya Sergeevna Pankratova Peter the Great St.Petersburg Polytechnic University student jane_koks@mail.ru RESEARCH OF THE TEXT DATA VECTORIZATION AND CLASSIFICATION ALGORITHMS OF MACHINE LEARNING Abstract: The article includes information about different classification algorithms and vectorization methods. We give the advantages and disadvantages of classification methods. Also in this paper we observe not only usual classification algorithm, but classification with using neural network, specifically with convolutional neural networks. In addition to description of these methods we discuss metrics which can be used to rate the quality of trained classification models. Key words: text classification, vectorization, neural networks, machine learning. Language: English Citation: Kozhevnikov, V. A., & Pankratova, E. S. (2020). Research of the text data vectorization and classification algorithms of machine learning. ISJ Theoretical & Applied Science, 05 (85), 574-585. Soi: http://s-o-i.org/1.1/TAS-05-85-106 Doi: https://dx.doi.org/10.15863/TAS.2020.05.85.106 Scopus ASCC: 2800. Introduction Solving problems using machine learning is a very popular task in the modern IT community. You can see a large number of competitions at Kaggle, courses at EdX, Coursera and Stepik. For machine learning, there are also a large number of different tools and platforms, for example, Scikit-learn, Tensor-flow, Keras and others. One of the classic and popular tasks is the classification of various data (texts or images). The basic algorithm for solving such problems: –Create a dataset and label it. – Split a dataset to train and test datasets – Fit vectorizer sand choose classifiers. – Fit classifiers with training dataset and calculate accuracy with test dataset. – Choose the most accurate classifier. – Use it. We talked about how to create and prepare a dataset in a previous article [1]. Using the prepared dataset, we can train a model that will predict which category the input message belongs to. And now let's talk in detail first about vectorization, and then about classification. Vectorization Machine learning algorithms operate in a space of numerical attributes, that is, they expect that a two- dimensional array will be presented at the input, the rows of which are concrete instances, and the columns are attributes or features. Thus, in order to perform machine learning on the text, it is necessary to convert the source documents into vector representations, to which numerical machine learning will subsequently be applied. This process is called vectorization and it is the first step towards analyzing natural language data. Converting documents to their numerical form makes it possible to analyze them and create instances with which the machine learning algorithm we choose will work. Documents (or sentences) can have different sizes, but the vectors that we define for them will