Language-Independent Text Tokenization Using Unsupervised Deep Learning Hanan A. Hosni Mahmoud 1 , Alaaeldin M. Hafez 2 and Eatedal Alabdulkreem 1,* 1 Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, 11671, Saudi Arabia 2 Department of Information Systems, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia *Corresponding Author: Eatedal Alabdulkreem. Email: eaalabdulkareem@pnu.edu.sa Received: 19 December 2021; Accepted: 24 February 2022 Abstract: Languages–independent text tokenization can aid in classiﬁcation of languages with few sources. There is a global research effort to generate text clas- siﬁcation for any language. Human text classiﬁcation is a slow procedure. Conse- quently, the text summary generation of different languages, using machine text classiﬁcation, has been considered in recent years. There is no research on the machine text classiﬁcation for many languages such as Czech, Rome, Urdu. This research proposes a cross-language text tokenization model using a Transformer technique. The proposed Transformer employs an encoder that has ten layers with self-attention encoding and a feedforward sublayer. This model improves the efﬁ- ciency of text classiﬁcation by providing a draft text classiﬁcation for a number of documents. We also propose a novel Sub-Word tokenization model with frequent vocabulary usage in the documents. The Sub-Word Byte-Pair Tokenization tech- nique (SBPT) utilizes the sharing of the vocabulary of one sentence with other sentences. The Sub-Word tokenization model enhances the performance of other Sub-Word tokenization models such pair encoding model by +10% using precision metric. Keywords: Text classiﬁcation; language-independent tokenization; sub word tokenization 1 Introduction Recently, a great amount of data became available electronically in digital forms. This introduced a great chance to be retrieved for analysis and processing. However, manual analysis or processing of such huge content is costly and time-consuming. Hence, several computerized models were proposed to automatically process this data to deliver classiﬁcation. Text classiﬁcation models usually choose key points in texts to generate comprehensible classiﬁcation target documents. In general, text classiﬁcation models attempt to analyze a document by picking the main topics that constitute the documents and identifying the relevant ideas of these topics. Therefore, current models attempt to enhance the classiﬁcation process performance in identifying the document key points by allowing all the themes exist in it. This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Intelligent Automation & Soft Computing DOI: 10.32604/iasc.2023.026235 Article ech T Press Science