Using distance-based methods to calculate optimal and suboptimal parsimony trees Osama A. Salman, Gábor Hosszú Department of Electron Devices, Budapest University of Technology and Economics, Budapest, Hungary Abstract This study examines the effectiveness of different distance-based techniques in creating optimal and suboptimal parsimony trees for extensive datasets. We introduce a distance-based approach for constructing tree topology, combined with a random method for score computation, which can efficiently detect optimal or suboptimal trees in vast datasets. The study examines various distance metrics and their impact on hierarchical clustering results, with a specific focus on tree length, Consistency Index (CI), and Retention Index (RI). Through this analysis, insightful findings are obtained on suitable metrics and techniques for improving the accuracy and computational efficiency of phylogenetic analysis in scriptinformatics and beyond. Categories and Subject Descriptors (according to ACM CCS): I.2.m [Artificial Intelligence]: Modelling of Evolution 1. Introduction Finding the best-scoring topology is still difficult for character-based methods. Several methods have been used in the past, such as branch-and-bound branch-swapping algorithms [1]. Although these algorithms are relatively fast on small datasets, as the dataset expands the possible trees that need to be checked will increase dramatically [2]. In this research, we propose to construct the topology of the tree using the distance based method and then compute the score using an arbitrary method, which can be efficient for finding the suboptimal or even optimal tree on large datasets. The dataset utilized in this research is publicly accessible on GitHub and also on ResearchGate, facilitating easy access and download from the GitHub database and ResearchGate [3, 4]. Some phenetic and cladistic analyses have already been published [2, 5]. Scriptinformatics deals with the investigation concerning the evolution of graphemes in scripts and the exploration of relationships between scripts, where scripts could be any sequence of symbols of cultural origin, such as historical writing systems [6, 7]. Evolutionary modeling of scripts includes phylogenetic modelling, namely phenetic, evolutionary, and statistical analyses of the studied scripts' features [8, 9]. The core of the phenetic approach is the cluster analysis of the taxon-feature data matrix. The result of the cluster analysis could be various types of cluster structures, including dendrograms if the clustering method were hierarchical [5]. 2. BACKGROUND Pairwise distance measures are crucial for understanding the relationships and similarities between various data points in a multidimensional space. In this research, we employ a comprehensive set of distance metrics to elucidate the phylogenetic relationships among taxa. The 'euclidean' distance, the most commonly used metric, measures the straight-line distance between two points in Euclidean space [10]. The 'squaredeuclidean' distance, while similar, is used for efficiency as it omits the square root calculation but does not satisfy the triangle inequality, limiting its use in some applications [11]. 'seuclidean' or standardized Euclidean distance scales each dimension by standard deviation, thus compensating for the scale of measurement [12]. For large datasets, 'fasteuclidean' and its squared counterpart provide computational efficiency at the potential cost of some accuracy and are not suitable for sparse datasets [13]. The 'mahalanobis' distance considers the correlation between variables, offering a scale-invariant measure which is especially important in multivariate analysis [12]. 'City block' or Manhattan distance measures the sum of the absolute differences of their Cartesian coordinates, and 'minkowski' distance generalizes both Euclidean and Manhattan distances with an adjustable exponent [14]. We also consider 'chebychev' distance, which is sensitive to the maximum difference between coordinates [15], and 'cosine' distance, which measures the cosine of the angle between two vectors, providing a scale-invariant measure of their similarity [16]. 'Correlation' distance is based on Pearson's correlation coefficient, reflecting the linear relationship between data points [17]. The 'hamming' distance, suitable for categorical data, counts the number of differing