PRZEGLĄD ELEKTROTECHNICZNY, ISSN 0033-2097, R. 92 NR 11/2016 15 Zbigniew OMIOTEK, Waldemar WÓJCIK Lublin University of Technology, Institute of Electronics and Information Technology doi:10.15199/48.2016.11.04 An efficient method for analyzing measurement results on the example of thyroid ultrasound images Abstract. The paper presents a method which supports the choice of the clustering procedure and makes it possible to select parameters for most important steps in this process. This method was presented on the example of thyroid ultrasound images belonging to healthy individuals and patients suffering from Hashimoto's thyroiditis. 11 360 variants of clustering procedure were analyzed and optimal parameters for 4 different forms of data set have been chosen. Streszczenie. W pracy zaprezentowano metodę, która wspomaga wybór procedury grupowania obiektów i pozwala określić parametry dla najważniejszych etapów tego procesu. Działanie tej metody pokazano na przykładzie obrazów USG tarczycy należących do osób zdrowych i chorych na chorobę Hashimoto. Metoda pozwoliła przeanalizować 11 360 wariantów procedury grupowania i wybrać optymalne parametry dla czterech różnych postaci zbioru danych. (Wydajna metoda analizy wyników pomiarów na przykładzie badań USG). Keywords: cluster analysis, clustering procedure, clustering validation, clusterSim. Słowa kluczowe: analiza skupień, procedura grupowania, ocena grupowania, clusterSim. Introduction In many areas of science and technology, measurement results are subsequently used to build a computer recognition system. Depending on the response of such a system for the given input object, we can talk that the system executes a supervised classification, regression or clustering task. We can distinguish 7 steps in a typical grouping procedure: selection of objects and variables, decisions concerning variable normalization formula, selection of a distance measure, selection of clustering method, determining the number of clusters, clustering validation, groups description and profiling. Critical stages are decisions concerning variables normalization formula, selection of a distance measure, selection of clustering method, and determining the number of clusters. These steps are largely arbitrary [1]. Depending on the similarity measure, type of a clustering algorithm and various values of its parameters, we get different splits of a given set of objects. Therefore, such a division of objects into classes is a difficult task and active studies are still carried out on the clustering methods [2-6]. The paper shows the possibility of automated clustering and objective selection of the most important parameters of this process. Our work aims to present the method to accomplish the above task and verifying its usefulness on the example of thyroid ultrasound images. Research material, tools and methods In the study, we used series of thyroid ultrasound images belonging to 60 patients. There were 28 healthy patients and 32 patients with a diagnosis of Hashimoto's disease [7, 8]. On this base, we obtained 126 samples belonging to cases identified as sick and 108 samples for healthy cases. A result of the image analysis was a set of 281 image feature descriptors that we reduced using 3 various methods in the next step. We obtained 48 descriptors using the correlation method, 57 descriptors using the HINoV method [9], and 3 descriptors by the use of the Hellwig method [10, 11]. During the clustering, the following tools and methods have been used: 5 data normalization formulas (classic standardization, Weber standardization, unitarization, zero unitarization, normalization in the interval of [-1; 1]; 5 distance measures for variables measured in the metric scale (Manhattan, Euclidean, Chebyshev, square Euclidean, generalized distance measure - GDM1); simulation method for optimization of the clustering procedure selection (clusterSim package was used); simulation results were evaluated using 5 indexes of the clustering quality: Caliński and Harabasz, Baker and Hubert, Hubert and Levine, Krzanowski and Lai and Silhouette; 9 clustering methods: the nearest neighbor, the furthest neighbor, group average, weighted group average, Ward, centroid, median, k-medoids and k-means. The number of variants under consideration of the classification procedure depends on the number of normalization formulas, the number of distance measures and the number of clustering methods. The aforementioned numbers vary depending on a type of the variable measurement scale in a data matrix. Variables used in the study were measured on a quotient and interval scale. For this type of scales and a given index of the clustering quality, the number of variants under consideration of the classification procedure for 7 hierarchical agglomeration methods and k-medoids method is equal to 140 (5 standardization formulas, 5 types of the distance measure 1 ). In addition, for 2 indexes (Caliński and Harabasz and Krzanowski and Lai) the k-means method is used, so the number of variants is further increased by 10 (5 standardization formulas). Because the study included 5 clustering quality indexes, the total number of variants in the analysis of only 1 way of dividing into groups was equal to 710 (5x140 + 2x10). We used such variants for the simulation procedure where the number of groups varied from 2 to 5, therefore the previous number should be multiplied by 4. As a result, the number of variants for 1 type of data set was equal to 2 840. In the analysis 4 types of data sets (full and 3 reduced) were used, therefore the total number of variants under consideration of the classification procedure was 11 360. Simulation method for optimization of the clustering procedure selection We used the simulation method to deal with as complex task as the analysis of 11 360 variants of the clustering procedure. For this purpose, the clusterSim package written in R has been used. This package consists of a basic cluster.Sim function and 16 auxiliary functions. The basic function searches for the optimal clustering procedure 1 For 3 hierarchical methods (Ward, centroid, median), the squared Euclidean distance as a distance measure is used, because these methods have a geometric interpretation only in this case.