Cluster of Tweet Users Based on Optimal Set Amit Paul Department of Computer Science & Engineering BTKIT, Dwarahat Email: amitpaul06@gmail.com Animesh Dutta Department of Information Technology National Institute of Technology Durgapur Email: animeshnit@gmail.com Frans Coenen Department of Computer Science University of Liverpool Email: coenen@liverpool.ac.uk Abstract—Over the years or even decades, researchers are dealing with the problem of duplicate clusters or overlapping clusters in a cluster set. Clusters overlap within each other just as in the case of social networking groups, or grouping movies by genre. In this paper, hierarchical form of clustering is used to cluster user based on interaction which creates numerous clusters with different sizes at different hierarchical level. In doing so, many overlapping clusters are generated but duplicates are not removed. Duplicity possesses a challenge for differentiation. Our work here is two fold. Firstly, to cluster users with different hierarchical levels to generate sets of clusters by level and secondly, to find among the different cluster sets the optimal one by simply using mean and standard deviation. The sense of optimality is different for different requirements. Our work shows that we can have a choice of picking the optimal set by requirement. 1. I NTRODUCTION Clustering is an unsupervised learning where similar ob- jects form groups in such a manner that the objects in a group are mostly similar within the group than in any other group. There are many examples of unsupervised learning. Take the case of online world. Some examples are: group movies by different parameters like genre, old and new, forming groups of users in a social network by their interaction. In all these case overlapping occurs. Social network such as Twitter is used to generate clusters based on only interaction among individuals. Here, interaction means that who has retweeted or replied to whom only. By doing so, many overlapping groups or clusters are generated. No consideration is given to real messages as a whole. There are many approaches proposed in the past regarding the generation of overlapping clusters [1], [2], [3], [4], [5]. Here, an algorithm is proposed for generating clusters sets which are overlapping, but our main focus is on finding the stability of these clusters sets and choosing the most optimal one. The clusters in an individual set are overlapping. A user may be found in more than one cluster and since clustering is done hierarchically somewhere cut has to be made to get the optimum cluster set. The generation of clusters in a set is given by level. Goldberg M K et al [6] compares two sets of clusters that are overlapping. Duplicity among clusters makes matter more complicated. Since, users interact with each other there will be hierarchy of users starting from any one user to form a cluster. The question arises is till which hierarchy level should be chosen. If no level is chosen, then there will be many clusters with numerous duplicates. Finding duplicate cluster or the number of duplicate clus- ters within a cluster set is one challenge and finding optimal cluster set by different level, knowing that clusters within the cluster set might be overlapping, is another. The paper focuses on the later part. Heise A et al [7] came up with a method to detect duplicate clusters within a set. 2. RELATED WORK Complexity of online networks can be guessed by its structure where either the distance between community or group clusters is very short or overlapping. Some [1], [2], [5] are based on finding overlapping communities in an efficient manner and also finding outliers that does not belong to any community [2]. Moreover, [2] uses fuzzy technique to detect communities. Gregory S [1] proposed an improved CONGA algorithm based on ’local’ form of betweenness. Palla et al[3] analyses statistical features of overlapping communities and introduces an approach for complex systems and found that overlaps are significant. Newman M E J [4] proposed a method to detect community structure and if there exists any natural cut into nonoverlapping communities. Wang X et al [5] proposed a co-clustering method to group communities using the tags information in messages. Goldberg M K et al [6] measures the distance between two overlapping cluster sets by using three different measures and assumed that each individual cluster contains no duplicates. Duplicity in clusters is major area of concern for data quality. Detecting duplicates and cleaning is considered a part of preprocessing data in data mining before putting into different uses. Hassanzadeh et al [8] used several clustering techniques to detect duplicity and also used Stringer system for evaluating cluster quality. Banerjee et al [9] worked on overlapping clusters where, some entities are allowed to be member of more than one cluster. Our work is somewhat similar here as duplicates are allowed in the clusters and are not removed. Keeping interesting duplicates will enrich information for finding localness of tweet user in our study further. In a paper [10] stability of clusters is found out by principle component analysis. The work is done on gene but can be extended to any sort of data. Hennig C [11], [12] used Jaccard coefficient to find stability between two cluster set. 3. SCOPE OF THE WORK In this paper, the work revolves around generating clusters and finding optimal set of clusters by level. The problem addressed is not only about generating overlapping clusters but comparing cluster sets by certain hierarchical level. In a single cluster set, at a particular level, there are numerous clusters which are compared with each other to find the percentages of similarity. At each level cluster size threshold is different. Goldberg et al [3] worked on finding similarity between two