Semantic Vectors: A Scalable Open Source Package and Online Technology Management Application Dominic Widdows, Kathleen Ferraro MAYA Design, University of Pittsburgh dwiddows@gmail.com, kaf1@pitt.edu Abstract This paper describes the open source SemanticVectors package that efficiently creates semantic vectors for words and documents from a corpus of free text articles. We believe that this package can play an important role in furthering research in distributional semantics, and (perhaps more importantly) can help to significantly reduce the current gap that exists between good research results and valuable applications in production software. Two clear principles that have guided the creation of the package so far include ease-of-use and scalability. The basic package installs and runs easily on any Java-enabled platform, and depends only on Apache Lucene. Dimension reduction is performed using Random Projection, which enables the system to scale much more effectively than other algorithms used for the same purpose. This paper also describes a trial application in the Technology Management domain, which highlights some user-centred design challenges that we believe are also key to successful deployment of this technology. 1. Introduction This paper describes the open source SemanticVectors software package, which can be freely downloaded from http://semanticvectors.googlecode.com. The software can be used to easily create semantic vec- tor models from a corpus of free text, and to search such models using a variety of mathematical operations includ- ing projections and algebraic product operations. It is hoped that the availability of this software will be of benefit both to academic and commercial users, due to its simplicity, ease of use, and scalability. Instead of spending considerable time on the basic text processing and search operations, researchers and developers will be able to focus their efforts on new experiments that investigate the rela- tionship between mathematical properties of the model and linguistic properties of the source texts, and on integrating semantic matching and search features into larger systems that support users with increasingly complex information needs. The core idea behind semantic vector models is that words and concepts are represented by points in a mathematical space, and this representation is learned from text in such a way that concepts with similar or related meanings are near to one another in that space. This introduces a range of possible applications: the most immediate perhaps is the end-user ‘semantic search engine’; semantic vector models can also be used in resource-building applications such as ontology learning and lexical acquisition, as part of data- gathering for decision-support systems, detailed research, etc. The SemanticVectors package itself was developed for a demonstration of such an application, to help Technology Management professionals at the University of Pittsburgh. To systematically describe the package, this paper proceeds as follows. In Section 2., we review some of the basics of Semantic Vector models. Section 3. gives a summary of Random Projection, the dimension reduction technique used by the Semantic Vectors package, chosen particularly for its scalability and computational tractability. Section 4. describes the Semantic Vectors package itself, including a detailed description of its design, implementation, current and intended features. Section 5. describes the University of Pittsburgh’s Technology Management Application, for which SemanticVectors was initially built, and discusses other potential applications of this technology. 2. Semantic Vector or WORDSPACE Models Semantic vector models have received considerable atten- tion from researchers in natural language processing over the past 15 years, though their invention can be traced at least to Salton’s introduction of the Vector Space Model for information retrieval [Salton, 1971, Salton and McGill, 1983]. Semantic vector models include a family of related mod- els for representing concepts with vectors in a high di- mensional vector space, such as Latent Semantic Analy- sis [Landauer and Dumais, 1997], Hyperspace Analogue to Language [Lund and Burgess, 1996], and WORDSPACE [Sch¨ utze, 1998, Widdows, 2004, Sahlgren, 2006]. The main attractions of semantic vector models include: • They can be built using entirely unsupervised distribu- tional analysis of free text. • While they involve some nontrivial mathematical ma- chinery, they make very few language-specific as- sumptions (e.g., it is possible to build a semantic vec- tor model provided only that you have reliably tok- enized text). • Similar techniques have been used in other areas, e.g., for image processing Bingham and Mannila [2001]. • The ease with which very simple distributed memory units can collaboratively learn and represent seman- tic vector models has been noted for its potential cog- nitive significance [Kanerva, 1988]. This has led to some overlap in interests between semantic vector re- searchers in computational linguistics, and composi- tional connectionist researchers in cognitive science.