On the use of Machine Learning and Deep Learning for Text Similarity and Categorization and its Application to Troubleshooting Automation Julia Colleoni Couto, Laura Tomaz, Julia Godoy, Davi Kniest, Daniel Callegari, Felipe Meneguzzi, Duncan Ruiz School of Technology, PUCRS University {julia.couto, laura.tomaz, julia.godoy, davi.silva01}@edu.pucrs.br ; {daniel.callegari, felipe.meneguzzi, duncan.ruiz}@pucrs.br Abstract Troubleshooting is a labor-intensive task that includes repetitive solutions to similar problems. This task can be partially or fully automated using text-similarity matching to ﬁnd previous solutions, lowering the workload of technicians. We develop a systematic literature review to identify the best approaches to solve the problem of troubleshooting automation and classify incidents effectively. We identify promising approaches and point in the direction of a comprehensive set of solutions that could be employed in solving the troubleshooting automation problem. 1. Introduction Learning from text is an important subject for machine learning and deep learning models, providing valuable insights based on the execution of different algorithms [1]. For example, we can use models for text similarity recognition and text categorization or classiﬁcation to learn from text [1]. Being able to identify text similarity and its categorization allows us to solve problems such as helping answer questions in community-based websites [2], predict author gender based on the text they wrote [3], and tasks related to bug-triage [4, 5, 6, 7] to automate the resolution of issues. Large companies create many internal service requests, known as issues or incidents, and different people work to resolve them. These services have diverse complexities and levels of urgency. Some services are very speciﬁc, while a considerable portion is very similar or even contains identical requests. For the latter, companies can automatically replicate the solution employed in the ﬁrst occurrence of an issue. Answering these service requests takes time and requires the action of a specialist, who may sometimes not be available full-time. For this reason, the internal customer (and sometimes the external customer) needs to wait until an expert can resolve the request and ﬁx the problem, but waiting usually leads to customer dissatisfaction. In this context, developing a model that suggests solutions for service requests impacts customer satisfaction, time spent by users, collaborators, and specialists, consequently incurring cost reduction. This paper aims to understand the most used machine learning and deep learning algorithms for text similarity or text categorization applicable to troubleshooting automation. To do so, we perform a systematic literature review using ﬁve widely used web search engines to answer the following question: What are the most used machine learning and deep learning techniques, algorithms, tools, or models for text similarity or categorization? We start with 957 papers, and in the end, we accept 35 papers that answer our research question. We classify the papers into eight categories: models and frameworks, proposed methods, preprocessing and dimension reduction, new text representations, attention-based models, multi-label related, comparative approaches, and bug-triage related. Our review leads to two key ﬁndings. First, most papers perform experiments based on only one dataset, and the most used public-available datasets are Reuters-21578 1 and 20 Newsgroups 2 . Second, Support Vector Machine (SVM) is the most used machine learning model, while Convolutional Neural Network (CNN) is the most used deep learning model for text similarity or categorization. In what follows, we investigate applications of machine learning to troubleshooting automation. Section 3 describes the methodology we followed, and Section 4 presents the results we achieved. Section 5 answers our research question and a discusses our results. Finally, Section 6 summarizes our conclusions and future work. 1 Available at https://archive.ics.uci.edu/ml/ datasets/reuters-21578+text+categorization+ collection. Accessed in May 2021. 2 Available at https://archive.ics.uci.edu/ml/ datasets/Twenty+Newsgroups. Accessed in May 2021. Proceedings of the 55th Hawaii International Conference on System Sciences | 2022 Page 778 URI: https://hdl.handle.net/10125/79427 978-0-9981331-5-7 (CC BY-NC-ND 4.0)