Assuring Retrievability from Unstructured Databases by Contexts
Amihai Motro
Department of Computer Science
University of Southern California
Los Angeles, CA 90089
Abstract
In an unstructured database the data is a collection of
facts that does not adhere to any schema. Such a
database does not require any initial design and can
thereff)re evolve freely to accommodate new applications.
It is particularly suitable for information which is diverse
and idiosyncratic, such as when we want to store
everything known on a particular topic. Unfortunately,
this freedom also means that similar information may be
entered in different forms. This may cause severe
problems when retrieval is attempted, as some of the data
may appear to have been "lost* in the database. In this
paper we propose a method to solve this problem. Each
database fact must be supported by a context in the
database, in the form of several other facts. When an
attempt is made to add a fact to the database, the
existence of a suitable context is verified, or is extracted
from the user in a simple dialogue. Thus, the database
still retains the flexibility of unstructured databases, but
problems of multiple representations are usually
prevented.
1. Introduction
Most database management systems employ data models
that are structured (or strictly-typed). The network, the
hierarchical and the relational data models are all
examples of the structured approach. Such models
enforce a database design that is both restrictive and
permanent. Restrictive, because the design relies heavily
on broad categorizations, that apply to large classes of
instances. Permanent, because in general these models
require a priori commitment to a particular design.
Consequently, structured models are suitable mostly for
traditional database applications in which the
environment to be modelled lends itself to simple
categorizations and is relatively stable.
For example, a typical data model will record employees
and departments with a fixed number of attributes, such
as EMPLOYEE-NO, EMPLOYEE-NAME and
EMPLOYEE-ADDRESS, DEPARTMENT-NAME,
DEPARTMENT-HEAD and DEPARTMENT-OFFICE. The
relationship between employees and departments will also
have to be determined and defined; for example,
WORKS-FOR may associate each employee with at most
one department. These few generic attributes, that are
applicable to all employees and all departments, are
limited in their ability to capture the differences between
individual instances of employees or departments. In
addition, if this design later proves to be unsatisfactory,
modifications may require substantial effort. While these
limitations are not always objectionable, structured
models are inadequate in situations where there is need to
model data which is more diverse and idiosyncratic. An
example is a database in which one wishes to record all
that one knows about a topic. Such databases are quite
impossible to design, as the data does not easily fit into
uniform structures, and the eventual scope of the
database is initially unknown.
An attractive approach for such situations is a database
that is unstructured (or loosely-typed). The database is
merely a container that can hold diversified information,
into which one can toss information casually. Such an
architecture requires no commitment to a particular
design and can therefore accommodate any evolution in
the contents of the database. As there is no structure, it
can accommodate data with all its complexities and
idiosyncrasies. A flexibility of this sort is available in pile
structures, which are aggregates of records that do not
adhere to any uniform record type and are not organized
in any meaningful way (a detailed discussion of the
applicability and performance of piles can be found in
[17]). However, unstructured databases are not
necessarily unorganized: to facilitate access they may
adopt some internal organization, such as rings or indexes.
Other efforts that can be classified as supporting an
unstructured approach, are mostly based on semantic
networks or logic (a good review of the topic can be found
in [15]).
CH2261-6/86/0000/0426501.00 © 1986 IEEE
426