Is automated conversion of video to text a reality? Richard Bowden a , Stephen Cox b , Richard Harvey b , Yuxuan Lan b , Eng-Jon Ong a , Gari Owen c and Barry-John Theobald b a University of Surrey, Guildford, GU2 7XH, UK. b University of East Anglia, Norwich, NR4 7TJ, UK. c Annywyn Solutions, Bromley, Kent, BR1 3DW, UK. ABSTRACT A recent trend in law enforcement has been the use of Forensic lip-readers. Criminal activities are often recorded on CCTV or other video gathering systems. Knowledge of what suspects are saying enriches the evidence gathered but lip-readers, by their own admission, are fallible so, based on long term studies of automated lip-reading, we are investigating the possibilities and limitations of applying this technique under realistic conditions. We have adopted a step-by-step approach and are developing a capability when prior video information is available for the suspect of interest. We use the terminology video-to-text (V2T) for this technique by analogy with speech-to-text (S2T) which also has applications in security and law-enforcement. Keywords: Lip-reading, speech recognition, pattern recognition 1. INTRODUCTION Much of the intelligence associated with the investigation of crime is based on what various people are saying. This ranges from gossip to conversations between those planning criminal or terrorist acts. It has always been common practice in the criminal community to be wary of being overheard and hence conversations often take place at randomly chosen locations such as street corners. The suspects are often recorded opportunistically by a variety of security CCTV networks and video cameras, but without audio. It would therefore be extremely useful under certain circumstances to extract audio from the video product. We refer to this as the conversion of video-to-text (V2T) by analogy of the more established technique of speech recognition: speech-to-text (S2T). Indeed, much of the philosophy and technology of V2T is derived from S2T, which has been established and evolving for about 50 years. Human lip-readers have been used to interpret speech in video-product. However, there are few lip-readers available, transcription is often very slow and training and certiﬁcation are not well developed. A full discussion of the performance of human lip-readers could occupy another paper but, in short, it is diﬃcult to establish conﬁdence intervals on human performance. There is therefore a desire for automated means of V2T conversion, where the process can be scaled for widespread use. As in the case of human lip-readers, the information for the conversion of video-to-text is derived from the movement of the lips of the speaker. It is possible that other information such as gestures could also be used to enhance the level of the information accessible but, for the time being we focus on lip-motions. A major diﬃculty of both human lip-reading and its machine counterpart is that similar lip gestures (sometimes called visemes) may be associated with diﬀerent phonemes. Context is therefore all important. Ideally, we would like to be able to perform video-to-text conversion on all subjects, but this issue makes this extremely diﬃcult. Accent and dialect can lead to multiple interpretations of the same viseme. Based on a well-established research base of ﬁfteen years, we have adopted an engineering approach to demonstrate the technology of V2T conversion using the following approach ∗ : • We limit our attempts to the use of subjects where data has been captured already. This is by analogy of S2T where high performance can be obtained if training data for a subject is available. This is the typical approach taken for personal speech recognition systems now commercially available. Further author information: (Send correspondence to RWH) RWH.: E-mail: r.w.harvey@uea.ac.uk ∗ An explanation of an early version of our system can be found at http://www.youtube.com/watch?v=Tu2vInqqHX8.