Versatile Search of Scanned Arabic Handwriting Sargur N. Srihari, Gregory R. Ball and Harish Srinivasan Center of Excellence for Document Analysis and Recognition (CEDAR) University at Buffalo, State University of New York Amherst, New York 14228 {srihari, grball, hs32}@cedar.buffalo.edu Abstract Searching scanned handwritten documents is a rel- atively unexplored frontier for documents in any language. In the general search literature retrieval methods are described as being either image-based or text-based with the corresponding algorithms be- ing quite different. Versatile search is defined as a framework where the query can be either a tex- tual string or an image snippet in any language and the retrieval method is a fusion of text- and image- retrieval methods. An end-to-end versatile system known as CEDARABIC is described for searching a repository of scanned handwritten Arabic docu- ments; in addition to being a search engine it in- cludes several tools for image processing such as line removal, line segmentation, creating ground-truth, etc. In the search process of CEDARABIC the query can be either in English or Arabic. A UNICODE and an image query are maintained throughout the search, with the results being combined by an arti- ficial neural network. The combination results are better than each approach alone. The results can be further improved by refining the component pieces of the framework (text transcription and image search). 1 Introduction While searching electronic text is now a ubiquitous operation, the searching of scanned printed docu- ments such as books is just beginning to emerge. The searching of scanned handwritten and mixed documents is a virtually unexplored area. Processing handwritten Arabic language docu- ments is of much current interest. One unsolved problem is a reliable method, given some query, to search for a subset among the many such documents, similarly to searching printed documents. The prob- lem is challenging because of the unique structural features of Arabic script and the relative infancy of the field of handwriting processing. Content-based information retrieval (CBIR) is a broad topic in information retrieval and data mining [1]. CBIR algorithms are quite different for the tasks of text retrieval and image retrieval. Correspond- ingly there are two approaches to searching scanned documents, stemming from the two different schools of thought. One approach is to use direct content based image retrieval (word spotting). Another is to convert the document to an electronic textual rep- resentation (ASCII for English and UNICODE for Arabic) and search it with text information retrieval methods used routinely with electronic documents. Both of these approaches can be successful under ideal circumstances, but such a situation is difficult to achieve with current technology of handwriting recognition. Image based searches do not always re- turn correct results. Arabic handwriting recognition technology does not come close to allowing full tran- scriptions of unconstrained documents. However, by combining these two methods together, we achieve better performance than either on its own. The paper describes a framework for versatile search of Arabic handwritten documents. By ver- satile search, we mean both versatility in the query and versatility in the search strategy–combining con- tent based image retrieval and text-based informa- tion retrieval. Versatality in the query refers to to the query being either in textual form or electronic form. Another characteristic of versatile search is that the query can be in multiple languages such as English and Arabic. In the versatile search process both the original scanned image and the (partial) transcription are maintained at all stages. Searches proceed in parallel on both document representa- tions. Any query is also split into both an image and a UNICODE representation which act on the corresponding instance of the document. The re- sults from both parallel searches are combined into a single ranking of candidate documents. The rest of the paper is organized as follows. Sec- tion 2 describes previous work in scanned document retrieval. Section 3 describes the nature of queries for versatile search. Section 4 describes the overall