Eurographics Conference on Visualization (EuroVis) 2013 B. Preim, P. Rheingans, and H. Theisel (Guest Editors) Volume 32 (2013), Number 3 Towards High-dimensional Data Analysis in Air Quality Research Submission #: 257, type: application Abstract The analysis of aerosol emission sources involves mass spectrometry data factorization, an approximation of high-dimensional data in lower-dimensional space. The optimization problem associated with this analysis is non-convex and cannot be solved optimally with currently known algorithms, resulting in factorizations with crude approximation errors that are non-accessible to scientists. We describe a new methodology for user-guided error-aware data factorization that diminishes this problem. Based on a novel formulation of factorization basis suitability and an effective combination of visualization techniques, we provide means for the visual analysis of factorization quality and local refinement of factorizations with respect to minimizing approximation errors. A case study and domain-expert evaluation by collaborating atmospheric scientists shows that our method commu- nicates errors of numerical optimization effectively and admits the computation of high-quality data factorizations in a simple way. Categories and Subject Descriptors (according to ACM CCS): I.5.5 [Pattern Recognition]: Design Methodology— Feature evaluation and selection 1. Introduction Atmospheric particles have been shown to increase morbid- ity and mortality in urban areas and to alter the Earth’s radia- tive energy balance. A key step in delineating this problem is identifying the emission sources of ambient airborne par- ticles. Using innovative instruments, atmospheric scientists are now able to chemically analyze aerosols in real time, providing unprecedentedly rich data sets for air quality re- search. These single particle mass spectrometers (SPMS) measure the mass spectrum of aerosols, thereby, fundamen- tally characterizing particles in high-dimensional space. An exemplary mass spectrum is shown by Figure 1. In order to factor out emission sources from these measurements, anal- ysis requires non-negative matrix factorization (NMF). The optimization problem can be defined as follows: given data that is derived from a combination of unknown sources in unknown abundance and combination, the goal is to factor out both unknowns, provided only with an estimate of the number of sources and an assumption of their mixing model. In air quality research, sources represent a non-negative (and non-orthogonal) basis in high-dimensional space, by which SPMS samples are approximated linearly as coefficients to the basis. However, computing suitable basis vectors and coefficients proves difficult in practice, as the optimization problem is ill-posed and non-convex. Currently known al- gorithms produce sub-optimal factorization results. The ap- proximation error can be defined as the discrepancy between data and its lower-dimensional approximation. While such errors are, in general, unavoidable in dimension reduction, they can be increasingly large for sub-optimal factorizations and hard to assess by atmospheric scientists without the proper visual analytical tools. However, the visual communi- cation of errors in non-negative matrix factorization has not been studied in visualization research and common visual- ization tools are not applicable to this problem. We discuss our new approaches for the visual analysis of approximation errors in non-negative matrix factorization, by describing (i) a methodology to assessing the quality of a factorization basis based on the amount of information intro- duced by each basis vector, (ii) a visualization of factoriza- tion errors designed to depict the major features that are in the data but not included in its factorization, and (iii) means to interactively minimize specific errors. During analysis, the scientist can compare the numerical benefit in introduc- ing a basis vector that minimizes error features selected in the visualization against the benefit of each vector currently in the basis. Following this methodology, the scientist can discover and overcome “being stuck” in local optima of non- convex factorization interactively, consequently improving the factorization quality. Due to the high degree of interac- submitted to Eurographics Conference on Visualization (EuroVis) (2013)