On Embedding Machine-Processable Semantics into Documents Krishnaprasad Thirunarayan Department of Computer Science and Engineering Wright State University, Dayton, OH 45435, USA. tkprasad@cs.wright.edu http://www.cs.wright.edu/~tkprasad Abstract. Most Web and legacy paper-based documents are available in human comprehensible text form, not readily accessible to or under- stood by computer programs. Here we investigate an approach to amal- gamate XML technology with programming languages for representa- tional purposes. Specifically, we propose a modular technique to embed machine-processable semantics into a text document with tabular data via annotations, and evaluate it vis a vis document querying, manipu- lation, and integration. The ultimate aim is to be able to author and extract, human-readable and machine-comprehensible parts of a docu- ment “hand in hand”, and keep them “side by side”. 1 Introduction The World Wide Web currently contains about 16 million web sites hosting more than 3 billion pages, which are accessed by over 600 million users interna- tionally. Most of the information available on the web, including that obtained from legacy paper-based documents, is in human comprehensible text form, not readily accessible to or understood by computer programs. (Quoting from SHOE FAQ, “Web is not only written in a human-readable language (usually English) but in a human-vision-oriented layout (HTML with tables, frames, etc.) and with human-only-readable graphics”. [8]) The enormity and the machine incompre- hensibility of the available information has made it very difficult to accurately search, present, summarize, and maintain it for a variety of users [1]. Seman- tic Web initiative attempts to enrich the available information with machine- processable semantics, enabling both computers and humans to complement each other cooperatively [5, 9]. Automated (web) services enabled by the seman- tic web technology promises to improve assimilation of web content, providing accurate filtering, classification, location, manipulation and summarization. Every programming language provides syntax to embed documentation in the code. Typically, a comment appears as a clearly delimited text. In contrast, in Orwell, the documentation text is interspersed with cleanly delimited code that yields executable instructions [10]. Donald E. Knuth popularized the approach of combining a programming language with a documentation language under