Towards a Biologically Inspired Question-Answering Neural Architecture Derek Monner a,1 James A. Reggia a,b a Department of Computer Science, University of Maryland, College Park, USA b Institute for Advanced Computer Studies, University of Maryland, College Park, USA Abstract. Though question-answering systems like IBM’s Watson are undoubtedly impressive, their errors are often bafﬂing and inscrutable to onlookers, suggesting that the strategies they use are far different than those that humans employ. Desiring a more biologically inspired approach, we investigate the extent to which a neural network can develop a functional grasp of language by observing question/answer pairs. We present a neural network model that takes questions, as speech-sound sequences, about a visual environment, and learns to answer them with grounded predicate-based meanings. The model must learn to 1) segment morphemes, words, and phrases from the speech stream, 2) map the intended referents from the speech signal onto objects in the environment, 3) comprehend simple questions, recogniz- ing what information the question is asking for, and 4) ﬁnd and supply that informa- tion. Model evaluations suggest that the grounding and question-answering parts of the problem are signiﬁcantly more demanding than interpreting the speech input. Keywords. question answering, grounded language comprehension, recurrent neural network, long short term memory 1. Introduction While question-answering systems such as IBM’s recent Jeopardy! winner Watson [1] have been well-studied in natural language processing domains, little research has been done into to how the question/answer style of interaction might inﬂuence the way humans acquire language. This is an interesting question in light of the fact that, when listening to language, learners are constantly confronted with request/response, question/answer pairs. In this paper we investigate the extent to which a pure neural network model of a human learner can learn a micro-language by listening to question/answer pairs. The model is situated in a simulated micro-world along with two speakers whom we will call Watson and Sherlock. Watson asks questions about the shared environment in a subset of English, and Sherlock responds to these questions with the information Watson was seeking. The model’s task is to learn to emulate Sherlock. To do this effectively, the model must listen to the speech sounds of Watson’s questions and learn to segment them into morphemes, words, and phrases, which it must then interpret with respect to the common surroundings, thereby grounding them in visual experience. The model must then recognize what information Watson is asking for and provide that information as a predicate-based “meaning” that is grounded in the environment. 1 Corresponding Author: dmonner@cs.umd.edu