T oday, as data sets used in computations grow in size and complexity, the technologies developed over the years to deal with scientific data sets have become less efficient and effective. Many frequently used oper- ations, such as Eigenvector computation, could quick- ly exhaust our desktop workstations once the data size reaches certain limits. On the other hand, the high-dimensional data sets we collect every day don’t relieve the problem. Many con- ventional metric designs that build on quantitative or categorical data sets cannot be applied directly to het- erogeneous data sets with multiple data types. While building new machines with more resources might conquer the data size problems, the complexity of today’s computations requires a new breed of projec- tion techniques to support analysis of the data and ver- ification of the results. We introduce the concept of a data signature, which captures the essence of a scien- tific data set in a compact format, and use it to conduct analysis as if using the original. A time-dependent cli- mate simulation data set demonstrates our approach and presents the results. Background In 1995, scientists at the Pacific Northwest National Laboratory (PNNL, on the Web at http://www.pnl.gov) had a challenging task: to analyze hundreds of thousands of unstructured text articles interactively on a desktop workstation. The solution—a system called Spire (Spatial Paradigm for Information Retrieval and Exploration)— has become one of the most powerful text analysis sys- tems developed to date. (R&D magazine recognized it with an R&D 100 Award in 1996.) Among all the core technologies developed for this project, implementation of the document vectors, which represent individual top- ics of a corpus, plays a critical role in the system’s success. Because of the extremely compact design of the doc- ument vectors, many powerful—but potentially expen- sive—analysis techniques now can be applied to huge amounts of text data. Today we can interactively analyze more than half a million news articles, study their time trends, review topic correlation, and read the original text, all on a desktop workstation such as a Sun Ultra 10. Figure 1 shows a visualization of a corpus with more than 60,000 medical research articles collected in 1997. We first project the corpus into individual document vec- tors before generating the terrain visualization using scaling and other analysis techniques. Refer to an earli- er article 1 or to http://www.pnl.gov/infoviz on the Web for details of this visualization and the other interactive features the system provides. Data signature Our research on information abstraction is neither static nor complete. The idea of document vectors has since evolved into the powerful concept of a data signa- ture that represents the content within the context of scientific data sets. In scientific computations such as climate and combustion simulations and modeling, we encounter large data sets with up to tens of gigabytes of data recorded per time step. Many conventional analy- sis techniques are hopelessly ineffective when faced with this much data, and the development of new tools seemingly lags behind. The data signature concept rep- resents one promising approach to analyzing and under- standing scientific data sets. A data signature can be described as a mathematical data vector that captures the essence of a large data set in a small fraction of its original size. It’s designed to char- acterize a portion of a data set, such as an individual time- frame of a scientific simulation or an article within a corpus. These signatures enable us to conduct analysis at a higher level of abstraction and yet still reflect the intend- ed results as if using the original data. For example, we can now measure the dissimilarity between two text arti- cles by computing the difference between the two corre- sponding signatures and return a quantitative answer. We have so far investigated designing data signatures for text, scalar fields, tensor fields, and a combination of these for data sets with multiple parameters. Our design is flexible enough to process both scalar and ten- sor fields, and project them into one numerical signa- ture. The construction of a data signature is based on one or more of the following features and approaches: Velocity gradient tensors (Jacobians) Critical points and their Eigenvalues Orthogonal and nonorthogonal edges Covariance matrices Intensity histograms Content segmentation Conditional probability Pak Chung Wong, Harlan Foote, Ruby Leung, Dan Adams, and Jim Thomas Pacific Northwest National Laboratory 0272-1716/00/$10.00 © 2000 IEEE Data Signatures and Visualization of Scientific Data Sets Visualization Viewpoints Editors: Theresa-Marie Rhyne and Lloyd Treinish 12 March/April 2000