A Pyramid Data Model for Supporting Content-based Browsing and Knowledge Discovery Zuotao Li, X. Sean Wang, Menas Kafatos, Ruixin Yang zli,xywang,mkafatos,yang @gmu.edu George Mason University, Fairfax, Virginia, U.S.A. Scientific and Statistical Database Management Conference, July 1-3, 1998, Capri Italy (IEEE) Copyright 1998 IEEE. Published in the Proceedings of SSDBM ’98, July 1-3, 1998, in Capri, Italy. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promo- tional purposes or for creating new collective works for re- sale or redistribution to servers or lists, or to reuse any copy- righted component of this work in other works, must be ob- tained from IEEE. Contact: Manager, Copyrights and per- missions, IEEE Service Center, 445 Hoes Lane, P.O. Box 1331, Piscataway, NJ 08855-1331, USA. Telephone num- ber: 732-562-3966. Abstract Remote sensing from space can provide global and con- tinuous observations. The associated measurement data need to be stored and studied to understand the Earth system processes. The ability of interactive content-based brows- ing, i.e., browsing or searching the content to narrow-down the interesting portions of data sets prior to actually access- ing or ordering full data sets, is highly desirable for any Earth science data information system. However, the large volumes of archived and future Earth science remote sens- ing data are clearly a serious challenge for an interactive browsing process. In this paper, a pyramid data model is in- troduced to support interactive content-based browsing and knowledge discovery for a wide variety of Earth science re- mote sensing data sets. By using multi-level precomputation and robust nonparametric approximation procedures, the interactive browsing performance can be enhanced greatly. An initial implementation and testing of this data model has been carried out through our research prototype system, Vir- tual Domain Application Data Center (VDADC). Future im- plementations are planned for our Seasonal to Interannual Earth Science Information Partner (SIESIP) project. 1. Introduction Increasingly large volumes (terabytes or more) of Earth science data have been collected and archived from or will be collected from remote sensing measurements, such as the NASA Earth Observing System (EOS). A great need exists for efficient data information systems and associated tools for data management, access, distribution, analysis, and knowledge discovery. Based on our experience in closely working with data users and data distributors (Kafatos, et al. [1997]), we have adopted a two-step user access model. First, users “browse” archived data to locate interesting subsets. The browsing step includes querying the contents of the data, discovering data patterns or any on-line preliminary data analysis. The users then download or order these subsets for further, and more accurate, analysis or other usage. The volume of the data for the second step cannot be too large for most users. An effective first step will more likely lead to reduce the data volume. This makes the first step important. In the first data “browsing” step, users are frequently un- satisfied with searching only the static metadata, most of which are independent of the actual data contents, such as data set source, size, etc. They frequently wish to look at the actual data contents as well, and this process we term content-based browsing. Content-based browsing is a pro- cess of browsing or searching the content of data sets prior to actually accessing or ordering full data sets and allows a user to acquire important information contained in the data in order to help make better choices to access the full data sets (Kafatos, et al. [1997]). For a large data volume, queries on the statistical prop- erties of the data can be used in a content-based browsing process. By browsing these query results, users may see the patterns or have insights in the data. Normally, in a Earth science application, a user first browses large scale (i.e., low resolution) data patterns, and then proceeds to narrow down the search to small scale (i.e., high resolution) detailed fea-