1 Robust Scalable Visualized Clustering in Metric and non-Metric Spaces Geoffrey Fox School of Informatics and Computing Indiana University Bloomington IN 47408, USA gcf@.indiana.edu Abstract We describe an approach to data analytics on large systems using a suite of robust parallel algorithms running on both clouds and HPC systems. We apply this to cases where the data is defined in a vector space and when only pairwise distances between points are defined. We introduce improvements to known algorithms for functionality, features and performance. Visualization is valuable for steering complex analytics and we discuss it for both the non-metric case and for clustering high dimension vector spaces. We exploit deterministic annealing which is heuristic but has clear general principles that can give reasonably fast robust algorithms. We apply methods to several life sciences applications. 1 Introduction The importance of big data is well understood but so far there is no core library of “big algorithms” that tackle some of the new issues that arise. These include of course parallelism which should be scalable i.e. run at good efficiency as problem and machine are scaled up. Further one can expect that larger datasets will increase need for robust algorithms that for example when applied to the many optimization problems in big data do not easily get trapped in local minima. Section 2 describes deterministic annealing as a generally applicable principle that makes many algorithms more robust and builds in the important multi-scale concept. In section 3, we focus on clustering with an emphasis on some of the advanced features that are typically not provided in the openly available software such as R [3]. We discuss some challenging parallelization issues for cases with heterogeneous geometries. Further we note that the majority of datasets are in high dimension and so not easily visualizable to inspect the quality of an analysis. Thus we suggest that it is good practice to follow a data mining algorithm with some form of visualization. We suggest that the well-studied area of dimension reduction deserves more systematic use and show how one can map high dimension metric and non-metric spaces into 3 dimensions for this purpose. This process is also time consuming and itself an optimization problem (find the best 3D representation of a set of points) and so needs the same considerations of parallelization. This is briefly described in section 4. 2 Deterministic Annealing Deterministic annealing[4] is motivated by the same key concept as the more familiar simulated annealing, which is well understood from physics. We are considering optimization problems and want to follow nature’s approach that finds true minima of energy functions rather than local minima with dislocations of some sort. At high temperatures systems equilibrate easily as there is no roughness in the energy (objective) function. If one lowers the temperature on an equilibrated system, then it is a short safe path between minima at current temperature and that a higher temperature. Thus systems which are equilibrated iteratively at gradually lowered temperature, tend to avoid local minima. The Monte Carlo approach of simulated annealing is often too slow, so we perform integrals analytically using a variety of approximations within the well-known mean field approximation in statistical physics. In the basic case we have a Hamiltonian H(χ) which is to be minimized with respect to variables χ and we introduce the Gibbs distribution at Temperature T. P(χ) = exp( - H(χ)/T) / dχ exp( - H(χ)/T) (1) or P(χ) = exp( - H(χ)/T + F/T ) (2) and minimize the Free Energy F combining Objective Function and Entropy, F = < H - T S(P) > = dχ [P(χ)H + T P(χ) lnP(χ)] (3) as a function of χ, which are (a subset of) parameters to be minimized. The temperature is lowered slowly – say by a factor 0.95 to 0.9995 at each iteration. For some cases such as metric space clustering and Mixture Models, one can do integrals of equations (1) and (2) analytically but usually that will be