Collecting Spatial Information for Locations in Text-to-Scene Systems Masoud Rouhizadeh 1 ⁀Richard Sproat 1 ⁀Bob Coyne 2 1 Center for Spoken Language Understanding, Oregon Health and Science University 2 Department of Computer Science, Columbia University 1 Introduction WordsEye [4, 5] is a text-to-scene conversion system that receives a text description of a picture from the user via its online interface and converts it into a 3D scene. The core of WordsEye is VigNet, a uniﬁed knowledge base and representational system for expressing lexical and real-world knowledge needed to depict scenes from text [3]. In particular, VigNet contains the knowledge needed to map the objects and locations speciﬁed in a text into the actual 3D objects [8, 7]. Individual objects (e.g. a chair) typically correspond to single 3D models, but locations (e.g. a living room) are typically composed of several individual objects, and those objects have a typical spatial relation. Prototypical mappings from locations to objects and the spatial relations of those objects are called location vignettes [9, 10]. Existing lexical and common sense knowledge resources such as WordNet [6], FrameNet [1], and Open- Mind [11] do not contain the spatial and semantic information required to construct location vignettes (for a discussion see [7]), so we need to build our own lexical resource. One of the well-known approaches for building lexical resources is extracting lexical relations from large text corpora. Directly relevant to this pa- per is a work by Sproat [12], which attempts to extract canonical locations of actions from text corpora. This approach provides useful information, but the extracted data is noisy and requires hand editing. Furthermore, much of the information that we are looking for is common-sense knowledge that is taken for granted by human beings and is not explicitly stated in corpora. Although structured corpora like Wikipedia do mention associated objects, they are often incomplete [7]. In this paper we investigate using Amazon Mechanical Turk (AMT) for building our own domain speciﬁc corpus and locations vignettes. 2 Using AMT to build location vignettes AMT is an online marketplace to co-ordinate the use of human intelligence to perform small tasks such as image annotation that are difﬁcult for computers but easy for humans [2]. The inputs to our AMT tasks are typical photos of different rooms, that show large objects typical of that particular room. We carefully selected the picture from the results of image searches using Google and Bing. Turkers of each task had to be in the US and had previous approval rating of at least 99%. Restricting the location of the Turkers increases the chance that they are native speakers of English, or at least have good command of the language. Phase 1: Collecting the functionally and visually important objects of rooms The functionally important objects for a room are those that are required in order for the room to be recog- nized or to function properly. The visually important objects are those that help deﬁne the basic structural The Paciﬁc Northwest Regional NLP Workshop (NW-NLP 2012)