Under review as a conference paper at ICLR 2016 S HERLOCK : MODELING S TRUCTURED K NOWLEDGE IN I MAGES Mohamed Elhoseiny 1,2 , Scott Cohen 1 , Walter Chang 1 , Brian Price 1 , Ahmed Elgammal 2 1 Adobe Research 2 Department of Computer Science, Rutgers University ABSTRACT How to build a machine learning method that can continuously gain structured visual knowledge by learning structured facts? Our goal in this paper is to address this question by proposing a problem setting, where training data comes as struc- tured facts in images with different types including (1) objects(e.g., <boy>), (2) attributes (e.g., <boy,tall>), (3) actions (e.g., <boy, playing>), (4) interactions (e.g., <boy, riding, a horse >). Each structured fact has a semantic language view (e.g., < boy, playing>) and a visual view (an image with this fact). A human is able to efﬁciently gain visual knowledge by learning facts in a never ending process, and as we believe in a structured way (e.g., understanding “playing” is the action part of < boy, playing>, and hence can generalize to recognize <girl, playing > if just learn <girl> additionally). Inspired by human visual percep- tion, we propose a model that is (1) able to learn a representation, we name as wild-card, which covers different types of structured facts, (2) could ﬂexibly get fed with structured fact language-visual view pairs in a never ending way to gain more structured knowledge, (3) could generalize to unseen facts, and (4) allows retrieval of both the fact language view given the visual view (i.e., image) and vice versa. We also propose a novel method to generate hundreds of thousands of structured fact pairs from image caption data, which are necessary to train our model and can be useful for other applications. 1 I NTRODUCTION It is a capital mistake to theorize in advance of the facts. -Sherlock Holmes (A Scandal in Bohemia) Let us imagine a scene with the following facts: <man>, <baby>, <toy>, <man, smiling>, <baby, smiling>, <baby, sitting on, chair>, <man, sitting on, chair>, <baby, sitting on, chair>, <baby, holding, toy>, <man, feeding, baby>. We might expect that the imagined scene will be very close to the image in Fig. 1 due to the precise structured description. On the other hand, if we were given the same image and asked to describe it, we might expect only a short title “man feeding a baby”. Providing the given structured facts from this image assumes a detective’s eye that look for structured details that we aim to model. State-of-the-art captioning methods (e.g., Karpathy & Fei-Fei (2015); Vinyals et al. (2015); Xu et al. (2015); Mao et al. (2015)) rely on the idea of generating a sequence of words given an image of a scene, inspired by the success of sequence to sequence training of neural nets in machine translation systems (e.g., Cho et al. (2014)). While it is an impressive step, the mechanism of these captioning systems makes them incapable of conveying structured information in an image and providing a conﬁdence of the generated caption given the facts in the image. It might also provide a limited description like “man feeding a baby”, which makes the image search difﬁcult on the other direction due to lack of representation. Captions and unstructured tags are mainly a vehicle to communicate facts with humans. However, they may not be the best way to represent that knowledge in a way that is searchable for the machine. There are advantages to having explicit, structured knowledge for image search. If one searches for images of a “red ﬂower”, a bag-of-words approach that considers ”red” and “ﬂower” separately may return images of ﬂowers that are not red but have red elsewhere in the image. It is important to know that a user is looking for the fact <ﬂower,red>. Modeling the connection between the provided structured facts in its language form and its visual view (i.e., an image containing it) facilitates gaining richer visual knowledge, which is our focus in this paper. Several applications can make use of modeling 1 arXiv:1511.04891v1 [cs.CV] 16 Nov 2015