Archaea Taxonomic Classiﬁcation Jorge Miguel Silva 1 jorge.miguel.ferreira.silva@ua.pt Diogo Pratas 1 pratas@ua.pt Tânia Caetano 2 tcaetano@ua.pt Sérgio Matos 1 aleixomatos@ua.pt 1 DETI/IEETA University of Aveiro Portugal 2 CESAM and Department of Biology University of Aveiro Portugal 3 Department of Virology University of Helsinki Finland Abstract Archaea are a domain of single-celled organisms that live in almost ev- ery environment and play signiﬁcant environmental roles, such as carbon ﬁxation and nitrogen cycling. However, their classiﬁcation is difﬁcult because most have not been isolated in a laboratory and detected only by their gene sequences in environmental samples. Moreover, archaeal genomes are characterized by signiﬁcant dissimilarity. This manuscript provides an automatic classiﬁcation methodology by applying an ensem- ble method using a combination of reference-free compression measures with GC-content and length. Notably, the results show that we can auto- matically and accurately distinguish between Archaea genomes at differ- ent taxonomic levels. 1 Introduction Archaea are a domain of single-celled organisms that lack a nucleus. Their cells have unique properties which are distinct from both bacte- ria and eukaryota domains. Archaea and bacteria are generally similar in size and shape. However, despite the morphological similarities to bacteria, Archaea have genes and metabolic pathways more closely re- lated to eukaryotes, prominently for the enzymes involved in transcrip- tion and translation. In addition, other aspects of archaeal biochemistry are unique, such as their reliance on ether lipids in their cell membranes. Furthermore, Archaea are characterized for having a signiﬁcant genomic inter-dissimilarity. Despite being ﬁrstly detected living in extreme environments such as hot springs and salt lakes with no other organisms, they live in almost ev- ery environment. In the human microbiome, they are essential in the gut, mouth, and skin. Furthermore, they play signiﬁcant environmental roles, such as carbon ﬁxation, nitrogen cycling, organic compound turnover, and maintaining microbial symbiotic and syntrophic communities. Currently, Archaea are further divided into multiple recognized phyla. However, classiﬁcation is difﬁcult because most have not been isolated in a laboratory and detected only by their gene sequences in environmental samples. Studying a DNA sequence’s complexity (or quantity of informa- tion) may help solve this classiﬁcation problem. As such, this manuscript proposes an Archaea genomic taxonomic classiﬁcation tool. Speciﬁcally, it performs classiﬁcation without resorting directly to the sequence of the reference genomes. Instead, it uses an ensemble of three predictors, namely normalized compression and two simple property characteristics, for probabilistic classiﬁcation of DNA sequences. It is counter-intuitive to think that it is possible to classify a genome recurring only to how much it can be compressed, its length, and the per- centage of Guanine and Cytosine. For example, to determine its phylum, order, class or genus. Furthermore, this manuscript shows that it is not only possible but that it can be done automatically with high accuracy, using a small and diverse dataset recurring to alignment-free approaches [11]. The complete study can be fully replicated using the repository https://github.com/jorgeMFS/Archaea. 2 Methods 2.1 Database The Archaea NCBI database is minimal when compared to other domains of life. The dataset comprises 216 complete reference genomes retrieved from the NCBI database (link) on 30 September 2021. In addition, the taxonomic description was also retrieved from the NCBI database and manually corrected to classify different taxonomic levels correctly. This mapping is available in the project to simplify future usage and replica- tion. 2.2 Normalized Compression (NC) An efﬁcient compressor, C(x), provides an upper bound approximation for the Kolmogorov complexity (K(x)), where K(x) < C(x) ≤|x| (|x| is the length of string x in the appropriate scale). Usually, an efﬁcient data compressor is a program that approximates both probabilistic and algo- rithmic sources using affordable computational resources (time and mem- ory). Although the algorithmic nature may be more complex to model, data compressors can have embedded sub-programs to handle this nature. The normalized version, known as the Normalized Compression (NC), is deﬁned by NC(x)= C(x) |x| log 2 |A| , (1) where C(x) is the compressed size of x in bits, |A| the number of dif- ferent elements in x (size of the alphabet). Given the normalization, the NC enables to compare the proportions of information contained in the strings independently from their sizes [7]. If the compressor is efﬁcient, then it can approximate the quantity of probabilistic-algorithmic infor- mation in data using affordable computational resources. In our work, to determine the NC, we made use of the state-of-the-art DNA sequence compressor: GeCo3 [10]. 2.3 Other Measures The two other measures used to perform Archaea taxonomic classiﬁcation are the GC-Content (GC) and the length of the genome |x|. GC-Content (GC) represents the proportion of guanine (G) and cy- tosine (C) bases out the quaternary alphabet {A, C, G, T /U }. This in- cludes thymine (T) in DNA and uracil (U) in RNA. The GC percentage is given by the number of cytosine (C) and guanine (G) bases in an Archaea genome x with length |x| according to GC(x)= 100 |x| |x| ∑ i=1 N (x i ||x i ∈ Ξ), (2) where x i is each symbol of x (assuming causal order), Ξ is a subset of the genomic alphabet containing the symbols {G, C} and N the program that counts the numbers of symbols in Ξ. GC-content is variable between different organisms. In addition, the GC-content value correlates with the organism’s life-history traits, genome size [9], and GC-biased gene conversion [3]. As such, this measure is use- ful to perform Archaea classiﬁcation. Furthermore, an organism with a genome high in GC-content is rich in energy and more prone to mutation. Thus, over time, a species tends to decrease its GC-content to become more stable, giving us further information regarding Archaea characteri- zation. For comparison of the obtained results, we assessed the outcomes obtained using a random classiﬁer. For that purpose, for each task, we de- termined the probability of a random sequence being correctly classiﬁed ( p hit ) as p hit = n ∑ i=0 [ p(c i ) * p correct (c i )], (3) where p(c i ) is the probability of each class, determined as Proceedings of RECPAD 2021 27th Portuguese Conference on Pattern Recognition 93