Managing Text as Data* Cordana Pavlovic-Lazetic** and Eugene Wong Department of Electrical Engineering and Computer Sciences and the Electronics Research Laboratory University of California, Berkeley, CA 94720 1. Introduction With all their advances, database management sys- tems of the present. generation are designed to handle only data of primitive types, namely, numbers and character strings. Several approaches to extending their capabilities to handle data with higher order semantics exist.. One is to add general abstract data type support. so that users can define such data types easily. In this approach, the DBMS makes no attempt to understand the semantics of user- deflned d&a types. and evaluation of operators on such data are done in applications progranis. As’s supplement, rather than an alternative, one can also extend the query language and its processor so that certain common non primitive data types are directly supported by the DBMS. Of these. tezt and geometric data are probably the two most. prominent examples. This paper deals with the case of text. Direct embedding of complex data in a database management system has obvious advantages, the most important, one being performance. To manage text as data, the first step is to handle words satisfactorily. Words are after all natural atoms of text. Whereas representing texts as strings of characters capture none of their meaning. representing them as sequences of words is a reasonable Arst order semantic representation. Our first. step, then, is to intrbduce “words” as a data type. Important operations on words are lexical operators. not string operators. They deal with how words are related to each other and how they are used. For example, “went” is a verb in past tense with “go” as its root.. “Verb”, “past tense”, and “go” are values returned by these distinct operators on the word “went”. We refer to “words” together with a class of operators on words as the lexical da& type. The principal objective of this papei is to deal with issues that arise in implementing the lexical data type. The specific issues that we shall consider are the fol- lowing: * efficient storage of words in a relational data- base *Research supported by the National Science Foundation under Grant&S-8300463. On leave from the University of Belgrade, Yugoslavia. Permission to copy without fee all or part of this material is granted prouided that the copies we n# made or distributed for direct commerciul advati e the VWB copyright notice and the title of the publication a r&P Its date appear, and notice is giuen that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. + implementation of lexical operators * resolving ambiguous words represented by the same character strings. The principal application that we envisage for textual databases is automatic extraction of facts. We shall con- sider some simple examples of this using lexical operators. 2 mcodlnguQrds A natural way.of storing texts in a relational database is to represent text by a relation: textname(seqno. word) where “seqno” denotes the order of appearance and “word” stands for words, punctuation and special symbols such as “new paragraph”. As character strings, words have greatly varying lengths. For storage in a ffxed-length field, charac- ter strings are grossly inef&ient. A solution to this prob- lem is to encode words into a fixed-length representation. Great compression can be achieved. For exam,& a 4-byt.5 integer suffices to represent a vocabulary of 2 N 4810 words. There is a second and equally compelling reason to encode. Very little of the lexical information is contained in the character-string representation of a word. Clearly, the fact that “went” has “go” as its root cannot. be deduced from the string w-e-n-t alone. If the goal is to implement. lexical operators. then words need to’be represented in a form whereby the values returned by the operators are explicit. in the representation. Basically, the coded form of a word should be a composite of the values returned by the set of all admissible operators on the word. There is yet. a third reason to encode, namely, remov- ing ambiguity. The same character string often has several meanings. In effect., it represents several different words, or more precisely, different “lexical units”. For example, “well” has at least. two unrelated meanings: “good and proper” and “a hole in the ground”. For these reasons we believe that encoding words is a must in storing text in a database system, if its meaning is to be exploited. The question is: how can this encoding be done? For compression alone, some kind of automatic encoding can probably be devised. However, no automatic encoding using only the character-strings as input can achieve the other two goals, since additional information must be supplied. To provide the lexical information. we shall use a dictionary. To resolve ambiguities. we shall use an expert system. The amount of lexical information that has to be sup- plied depends on the lexical operators to be supported. Thus, the first step is define the lexical data type. 3. Lexical Data Type We adopt the following terminology: a Zezical unit is the image of a word under encoding, 1ezi.ca.L data set is a set. of lexical units together with certain default values, Proceedings of the Twelfth International Conference on Very Large Data Bases -lll- Kyoto. August, 1986