Predicting Detection Events from Bayesian Scene Recognition ⋆ Georg Ogris and Lucas Paletta JOANNEUM RESEARCH Institute of Digital Image Processing Wastiangasse 6, 8010 Graz, Austria {georg.ogris,lucas.paletta }@joanneum.at Abstract. This work is conceptually based on psychological ﬁndings in human perception that highlight the utility of scene interpretation in ob- ject detection processes. Objects of interest are embedded in their visual context, i.e., in visual events within their spatial neighborhood. The im- plication for a detection system is that early recognition of this environ- ment might provide information to directly map to an object event. The original contribution of this work is to outline a detection system that gains prospective information out of rapid scene analysis in order to focus attention on estimated object locations. Scene recognition is outlined on the basis of rapid detection of triplet conﬁgurations of landmarks which determine the discriminability of a particular location within the scene. Formulating scene recognition in terms of posterior landmark interpre- tation enables a recursive integration of target predictions and hence a probabilistic representation for attention based object detection. 1 Introduction In computer vision, we face the highly challenging object detection task to per- formrecognitionofrelevanteventsinoutdoorenvironments.Changingillumina- tion, diﬀerent weather conditions, and noise in the imaging process are the most important issues that require a truly robust detection system. This paper con- siders prediction schemes that would signiﬁcantly improve the service of quality in real-time interpretation of image sequences. Research on video analysis has recently been focussing on object based in- terpretation, e.g., to reﬁne semantic interpretation for the precise indexing and sparserepresentationofimmenseamountsofimagedata(e.g.,[6]).Objectdetec- tion in real-time, such as for video annotating and interactive television [1], im- posesincreasedchallengesonresourcemanagementtomaintainsuﬃcientquality of service, and requires careful design of the system architecture. Recent work on real-time interpretation therefore considers attentional mechanisms and cas- caded systems [10] to coarsely analyze the complete video frame in a ﬁrst step, rejectirrelevanthypotheses,anditerativelyapplyincreasinglycomplexclassiﬁers with appropriate level of detail [13,9]. ⋆ This work is funded by the European Commission’s IST project DETECT under grant number IST-2001-32157. J. Bigun and T. Gustavsson (Eds.): SCIA 2003, LNCS 2749, pp. 1058-1065, 2003.  Springer-Verlag Berlin Heidelberg 2003