A novel mutual nearest neighbor based symmetry for text frame classiﬁcation in video Palaiahnakote Shivakumara a, n , Anjan Dutta b , Trung Quy Phan a , Chew Lim Tan a , Umapada Pal c a School of Computing, National University of Singapore, Singapore b Computer Vision Center, Universitat Aut onoma de Barcelona, Barcelona, Spain c Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, India a r t i c l e i n f o Article history: Received 9 July 2010 Received in revised form 7 January 2011 Accepted 7 February 2011 Available online 12 February 2011 Keywords: Wavelet–median moments Video image Mutual nearest neighbor Frame classiﬁcation Text block location a b s t r a c t In the ﬁeld of multimedia retrieval in video, text frame classiﬁcation is essential for text detection, event detection, event boundary detection, etc. We propose a new text frame classiﬁcation method that introduces a combination of wavelet and median moment with k-means clustering to select probable text blocks among 16 equally sized blocks of a video frame. The same feature combination is used with a new Max–Min clustering at the pixel level to choose probable dominant text pixels in the selected probable text blocks. For the probable text pixels, a so-called mutual nearest neighbor based symmetry is explored with a four-quadrant formation centered at the centroid of the probable dominant text pixels to know whether a block is a true text block or not. If a frame produces at least one true text block then it is considered as a text frame otherwise it is a non-text frame. Experimental results on diﬀerent text and non-text datasets including two public datasets and our own created data show that the proposed method gives promising results in terms of recall and precision at the block and frame levels. Further,we also show how existing text detection methods tend to misclassify non-text frames as text frames in term of recall and precision at both the block and frame levels. & 2011 Elsevier Ltd.All rights reserved. 1. Introduction Text frame classiﬁcation aims to classify frames among a large collection of video frames into text and non-text frames. It is useful in applications such as video browsing,event detection,event boundary detection, text tracking, and text detection and extraction. Due to the semantic gap between low-level features and high-level events, it is difﬁcult to come up with a generic Content-based Image Retrieval (CBIR) method or automatic annotation method to achieve a high accuracy ofevent detection [1]. In addition, the dynamic nature of events such as sports further complicates the analysis and impedes the implementation of such live event detection. In view of this difﬁculty, event detection is realized by detecting and recogniz- ing the starting texts of the games or events involved. Therefore, to build a computationally efﬁcient and accurate eventdetection system,accurate textframe classiﬁcation is required before text detection and recognition [2]. However,no method exists in the literature that solely works on text frame classiﬁcation. While text frame classiﬁcation invariably makes use of text detection techniques, it diﬀers from the usual text detection method in the following respects: (1) text frame classiﬁcation is basically a screening process prior to text detection and recognition, (2) text frame classiﬁcation should be simple and fast in order to quickly identify a frame as text or non-text, (3) text frame classiﬁcation help to reduce computational burden by avoiding expensive text detectio methods on given unknown video frames many of which may turn out to just non-text, and (4) many existing text detection methods assume that the given input is a text frame and hence false positives may occur when a non-text frame is fed as input. In this paper, we propose a text frame classiﬁcation method by dividing a video frame into small windows, which we call ‘‘blocks’’, to look for probable text pixels among these blocks using a mutual nearestneighborhood symmetry concept. Any block that detects the presence of text is us as an indication that the frame under testing is a text frame. The res of the paper is outlined as follows: In the next section, we will survey related works.We will present our proposed method in detail in Section 3, followed by a series of experiments in Section 4. Section 5 concludes this paper with discussions on future works. 2. Related work The closest related work is that of Liet al. [3] for video text tracking. The system includes a component for text frame Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2011.02.008 n Corresponding author. E-mail addresses: shiva@comp.nus.edu.sg, hudempsk@yahoo.com (P. Shivakumara), adutta@cvc.uab.es (A. Dutta), phanquyt@comp.nus.edu.sg (T. Quy Phan),tancl@comp.nus.edu.sg (C. Lim Tan), umapada@isical.ac.in (U. Pal). Pattern Recognition 44 (2011) 1671–1683