Language-Independent Text Tokenization Using Unsupervised Deep Learning Hanan A. Hosni Mahmoud 1 , Alaaeldin M. Hafez 2 and Eatedal Alabdulkreem 1,* 1 Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh, 11671, Saudi Arabia 2 Department of Information Systems, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia *Corresponding Author: Eatedal Alabdulkreem. Email: eaalabdulkareem@pnu.edu.sa Received: 19 December 2021; Accepted: 24 February 2022 Abstract: Languagesindependent text tokenization can aid in classication of languages with few sources. There is a global research effort to generate text clas- sication for any language. Human text classication is a slow procedure. Conse- quently, the text summary generation of different languages, using machine text classication, has been considered in recent years. There is no research on the machine text classication for many languages such as Czech, Rome, Urdu. This research proposes a cross-language text tokenization model using a Transformer technique. The proposed Transformer employs an encoder that has ten layers with self-attention encoding and a feedforward sublayer. This model improves the ef- ciency of text classication by providing a draft text classication for a number of documents. We also propose a novel Sub-Word tokenization model with frequent vocabulary usage in the documents. The Sub-Word Byte-Pair Tokenization tech- nique (SBPT) utilizes the sharing of the vocabulary of one sentence with other sentences. The Sub-Word tokenization model enhances the performance of other Sub-Word tokenization models such pair encoding model by +10% using precision metric. Keywords: Text classication; language-independent tokenization; sub word tokenization 1 Introduction Recently, a great amount of data became available electronically in digital forms. This introduced a great chance to be retrieved for analysis and processing. However, manual analysis or processing of such huge content is costly and time-consuming. Hence, several computerized models were proposed to automatically process this data to deliver classication. Text classication models usually choose key points in texts to generate comprehensible classication target documents. In general, text classication models attempt to analyze a document by picking the main topics that constitute the documents and identifying the relevant ideas of these topics. Therefore, current models attempt to enhance the classication process performance in identifying the document key points by allowing all the themes exist in it. This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Intelligent Automation & Soft Computing DOI: 10.32604/iasc.2023.026235 Article ech T Press Science