Effective Products Categorization with Importance Scores and Morphological Analysis of the Titles Leonidas Akritidis, Athanasios Fevgas, Panayiotis Bozanis Data Structuring & Engineering Lab Department of Electrical and Computer Engineering University of Thessaly Volos, Greece Email: {leoakr,fevgas,pbozanis}@e-ce.uth.gr Abstract—During the past few years, the e-commerce plat- forms and marketplaces have enriched their services with new features to improve their user experience and increase their proﬁtability. Such features include relevant products suggestion, personalized recommendations, query understanding algorithms and numerous others. To effectively implement all these features, a robust products categorization method is required. Due to its importance, the problem of the automatic products classiﬁcation into a given taxonomy has attracted the attention of multiple researchers. In the current literature, we encounter a broad variety of solutions, ranging from supervised and deep learning algorithms, as well as convolutional and recurrent neural net- works. In this paper we introduce a supervised learning method which performs morphological analysis of the product titles by extracting and processing a combination of words and n-grams. In the sequel, each of these tokens receives an importance score according to several criteria which reﬂect the strength of the cor- relation of the token with a category. Based on these importance scores, we also propose a dimensionality reduction technique to reduce the size of the feature space without sacriﬁcing much of the performance of the algorithm. The experimental evaluation of our method was conducted by using a real-world dataset, comprised of approximately 320 thousand product titles, which we acquired by crawling a product comparison Web platform. The results of this evaluation indicate that our approach is highly accurate, since it achieves a remarkable classiﬁcation accuracy of over 95%. I. I NTRODUCTION It is common knowledge that e-commerce is one of the fastest growing Web-based enterprises. In the last few years there has been a signiﬁcant increase in the e-commerce share of total global retail sales. From 7.4% in 2015, the percentage of the online sales has increased to 11.9% in 2018, and it is predicted to reach 17.5% at the end of 2021 1 . Consequently, the research topics related to e-commerce have been rendered particularly important. More speciﬁcally, the effective and efﬁcient management, mining, and processing of the products data are presently of top priority for the leading e-commerce platforms. One of the most fundamental problems in this area is the automatic classiﬁcation of products into an existing hierarchy of categories. The successful solution of this problem can lead to numerous novel applications, including query expansion and 1 https://www.statista.com/statistics/534123/e-commerce-share-of-retail- sales-worldwide/ rewriting, retrieval of relevant products, personalized recom- mendations, etc. On the other hand, the current large-scale commercial systems are now offering tens or even hundreds of millions of products and their warehouses are getting updated constantly at high rates. Therefore, the approach of manual categorization is not scalable and automatic classiﬁcation emerges even more compelling. Due to its importance, the problem in question has gained a lot of research attention recently. The methods which we encounter in the relevant literature take into consideration the product titles and/or their textual descriptions to train their classiﬁcation models. A number of approaches treat the issue of product classiﬁcation as a standard short-text classiﬁcation problem and they apply slight modiﬁcations of the proposed algorithms. Other methods perform morphological analysis of the titles and the textual descriptions by extracting words and n-grams. Finally, a third family of solutions employ deep learning approaches, which are based on convolutional or recurrent neural networks. A signiﬁcant portion of these methods take into consid- eration additional information –known as meta-data– about a product and extract features from brands, technical spec- iﬁcations, or textual descriptions. Nevertheless, these types of metadata are not always present, whereas, even in case they exist, they are occasionally incomplete or incorrect. In addition, although many models have been effectively applied to text classiﬁcation, they are difﬁcult to be applied to product categorization due to their lack of scalability, and the sparsity and skewness of the data contained in the commercial product catalogs. In this paper, we present a supervised algorithm to address this interesting problem. Similarly to some of the existing methods, our approach is based solely on the titles of the products. However, in contrast to the existing methods, our algorithm does not employ standard n-grams of ﬁxed length; instead, it sets a parameter N and extracts all the 1, 2,...,N -grams from the titles. All generated n-grams are stored within a dictionary data structure which, for each token, maintains a series of useful statistics. We show that this strategy leads to signiﬁcant performance beneﬁts. As might be expected, a portion of the generated n-grams are strongly correlated with a category, whereas others are