T
oday, as data sets used in computations grow in size
and complexity, the technologies developed over
the years to deal with scientific data sets have become
less efficient and effective. Many frequently used oper-
ations, such as Eigenvector computation, could quick-
ly exhaust our desktop workstations once the data size
reaches certain limits.
On the other hand, the high-dimensional data sets we
collect every day don’t relieve the problem. Many con-
ventional metric designs that build on quantitative or
categorical data sets cannot be applied directly to het-
erogeneous data sets with multiple data types.
While building new machines with more resources
might conquer the data size problems, the complexity
of today’s computations requires a new breed of projec-
tion techniques to support analysis of the data and ver-
ification of the results. We introduce the concept of a
data signature, which captures the essence of a scien-
tific data set in a compact format, and use it to conduct
analysis as if using the original. A time-dependent cli-
mate simulation data set demonstrates our approach
and presents the results.
Background
In 1995, scientists at the Pacific Northwest National
Laboratory (PNNL, on the Web at http://www.pnl.gov)
had a challenging task: to analyze hundreds of thousands
of unstructured text articles interactively on a desktop
workstation. The solution—a system called Spire (Spatial
Paradigm for Information Retrieval and Exploration)—
has become one of the most powerful text analysis sys-
tems developed to date. (R&D magazine recognized it
with an R&D 100 Award in 1996.) Among all the core
technologies developed for this project, implementation
of the document vectors, which represent individual top-
ics of a corpus, plays a critical role in the system’s success.
Because of the extremely compact design of the doc-
ument vectors, many powerful—but potentially expen-
sive—analysis techniques now can be applied to huge
amounts of text data. Today we can interactively analyze
more than half a million news articles, study their time
trends, review topic correlation, and read the original
text, all on a desktop workstation such as a Sun Ultra 10.
Figure 1 shows a visualization of a corpus with more
than 60,000 medical research articles collected in 1997.
We first project the corpus into individual document vec-
tors before generating the terrain visualization using
scaling and other analysis techniques. Refer to an earli-
er article
1
or to http://www.pnl.gov/infoviz on the Web
for details of this visualization and the other interactive
features the system provides.
Data signature
Our research on information abstraction is neither
static nor complete. The idea of document vectors has
since evolved into the powerful concept of a data signa-
ture that represents the content within the context of
scientific data sets. In scientific computations such as
climate and combustion simulations and modeling, we
encounter large data sets with up to tens of gigabytes of
data recorded per time step. Many conventional analy-
sis techniques are hopelessly ineffective when faced
with this much data, and the development of new tools
seemingly lags behind. The data signature concept rep-
resents one promising approach to analyzing and under-
standing scientific data sets.
A data signature can be described as a mathematical
data vector that captures the essence of a large data set
in a small fraction of its original size. It’s designed to char-
acterize a portion of a data set, such as an individual time-
frame of a scientific simulation or an article within a
corpus. These signatures enable us to conduct analysis at
a higher level of abstraction and yet still reflect the intend-
ed results as if using the original data. For example, we
can now measure the dissimilarity between two text arti-
cles by computing the difference between the two corre-
sponding signatures and return a quantitative answer.
We have so far investigated designing data signatures
for text, scalar fields, tensor fields, and a combination
of these for data sets with multiple parameters. Our
design is flexible enough to process both scalar and ten-
sor fields, and project them into one numerical signa-
ture. The construction of a data signature is based on
one or more of the following features and approaches:
■ Velocity gradient tensors (Jacobians)
■ Critical points and their Eigenvalues
■ Orthogonal and nonorthogonal edges
■ Covariance matrices
■ Intensity histograms
■ Content segmentation
■ Conditional probability
Pak Chung
Wong, Harlan
Foote, Ruby
Leung, Dan
Adams, and Jim
Thomas
Pacific
Northwest
National
Laboratory
0272-1716/00/$10.00 © 2000 IEEE
Data Signatures and Visualization of Scientific Data Sets
Visualization Viewpoints
Editors: Theresa-Marie Rhyne and
Lloyd Treinish
12 March/April 2000