Every Picture Tells a Story: Generating Sentences from Images Ali Farhadi 1 , Mohsen Hejrati 2 , Mohammad Amin Sadeghi 2 , Peter Young 1 , Cyrus Rashtchian 1 , Julia Hockenmaier 1 , David Forsyth 1 1 Computer Science Department University of Illinois at Urbana-Champaign {afarhad2,pyoung2,crashtc2,juliahmr,daf}@illinois.edu 2 Computer Vision Group, School of Mathematics Institute for studies in theoretical Physics and Mathematics(IPM) {m.a.sadeghi,mhejrati}@gmail.com Abstract. Humans can prepare concise descriptions of pictures, focus- ing on what they ﬁnd important. We demonstrate that automatic meth- ods can do so too. We describe a system that can compute a score linking an image to a sentence. This score can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. The score is obtained by comparing an estimate of meaning ob- tained from the image to one obtained from the sentence. Each estimate of meaning comes from a discriminative procedure that is learned us- ing data. We evaluate on a novel dataset consisting of human-annotated images. While our underlying estimate of meaning is impoverished, it is suﬃcient to produce very good quantitative results, evaluated with a novel score that can account for synecdoche. 1 Introduction For most pictures, humans can prepare a concise description in the form of a sentence relatively easily. Such descriptions might identify the most interesting objects, what they are doing, and where this is happening. These descriptions are rich, because they are in sentence form. They are accurate, with good agreement between annotators. They are concise: much is omitted, because humans tend not to mention objects or events that they judge to be less signiﬁcant. Finally, they are consistent: in our data, annotators tend to agree on what is mentioned. Barnard et al. name two applications for methods that link text and images: Illustration, where one ﬁnds pictures suggested by text (perhaps to suggest il- lustrations from a collection); and annotation, where one ﬁnds text annotations for images (perhaps to allow keyword search to ﬁnd more images) [1]. This paper investigates methods to generate short descriptive sentences from images. Our contributions include: We introduce a dataset to study this problem (section 3.1). We introduce a novel representation intermediate between images and sentences (section 2.1). We describe a novel, discriminative approach that produces very good results at sentence annotation (section 2.4). For illustration, out of vocabulary words pose serious diﬃculties, and we show methods to use distributional semantics to cope with these issues (section 3.4). Evaluating sen- tence generation is very diﬃcult, because sentences are ﬂuid, and quite diﬀerent