Challenges of Deep Learning-based Text Detection in the Wild Zobeir Raisi Vision and Image Processing Lab, University of Waterloo, ON, N2L 3G1, Canada Mohamed A. Naiel Vision and Image Processing Lab, University of Waterloo, ON, N2L 3G1, Canada Paul Fieguth Vision and Image Processing Lab, University of Waterloo, ON, N2L 3G1, Canada Steven Wardell ATS Automation Tooling Systems Inc., Cambridge, ON, N3H 4R7, Canada John Zelek Vision and Image Processing Lab, University of Waterloo, ON, N2L 3G1, Canada Email: {zraisi, mohamed.naiel, pfieguth, jzelek}@uwaterloo.ca, swardell@atsautomation.com Abstract The reported accuracy of recent state-of-the-art text detection methods, mostly deep learning approaches, is in the order of 80% to 90% on standard benchmark datasets. These methods have re- laxed some of the restrictions of structured text and environment (i.e., “in the wild”) which are usually required for classical OCR to properly function. Even with this relaxation, there are still cir- cumstances where these state-of-the-art methods fail. Several re- maining challenges in wild images, like in-plane-rotation, illumina- tion reflection, partial occlusion, complex font styles, and perspec- tive distortion, cause exciting methods to perform poorly. In or- der to evaluate current approaches in a formal way, we standard- ize the datasets and metrics for comparison which had made com- parison between these methods difficult in the past. We use three benchmark datasets for our evaluations: ICDAR13, ICDAR15, and COCO-Text V2.0. The objective of the paper is to quantify the cur- rent shortcomings and to identify the challenges for future text de- tection research. 1 Introduction Detecting and recognizing text in the wild images are challenging problems in the field of computer vision [1, 2]. “In the wild” refers to problems where the structured environment and text are of wide variations. Examples include street signs, store signs, advertise- ments, or text identifying sport players, to name a few. Reading text from scene images can be carried-out using two fundamental tasks: Text detection that localizes text in the image, and Text recognition that converts localized text or a cropped word image into a text string. They face common challenging problems that can be categorized as: Text diversity: images that contain text with different colors, fonts, orientations and languages. Scene complexity: images that include scene elements of similar appearance to text, such as signs, bricks and symbols. Distortion factors: text images distorted due to the effect of motion blurriness, images of low resolution, surface geome- try, perspective distortion and partial occlusion [1, 3, 4]. This paper focuses on the text detection task, which is more chal- lenging than text recognition due to the large variance of text shape and complicated backgrounds. The methods before the deep learn- ing era, typically identify a character or text component candidates using connected component-based approaches or sliding window- based methods, which used hand-craft features like MSER [5] or SWT [6] as basic components. However, the detection performance of these classical machine learning-based methods is still far from satisfactory. Recently, deep learning-based methods have been shown to outperform in detecting challenging text in scene images. These methods usually adopt general object detection frameworks such as SSD [7], YOLO [8], Faster R-CNN [9], or segmentation frameworks like FCN [10] and Mask R-CNN [11]. Most deep learning-based text detectors that detect text at the word level have difficulties in finding curved, extremely long, or highly deformed words by using a single bounding box [12]. This paper aims to highlight on the preceding challenges by reviewing recent advances in deep learning applied to scene text detection, and evaluating some of the best state of the art meth- ods: EAST [13], Pixellink [14], CRAFT [12], and PMTD [15]. The methods are evaluated on three challenging datasets, including the COCO-Text V2.0 [16], using a consistent methodology that contains several important challenges in the scene text detection. 2 Literature Review In this section, a brief literature review on deep learning-based scene text detection techniques is presented. Table 1 offers a com- parison among some of the recent state of the art text detection methods. 2.1 Regression-based Text Detection Several methods [13, 28] adopted a general object detection regression-based (RB) framework, such as SSD [7] or Faster R- CNN [9], for text detection. They regard text regions as objects and predict candidate bounding boxes for text regions directly. For ex- ample, TextBoxes [28] modified the single-shot descriptor (SSD) [7] kernels by applying long default anchors and filters to handle the significant variation of aspect ratios of text instances to detect the various type of text shapes. Unlike TextBoxes, deep matching prior network (DMPNet) in [29] introduced quadrilateral sliding windows to handle detecting text under multiple orientations. There are many regression-based methods [22, 24] that have tried to solve the de- tection challenges of rotated and arbitrarily shaped text; for instance EAST [13] proposed a fast and accurate text detector, which makes dense predictions processed using locality-aware Non-Maximum Suppression (NMS) to detect multi-oriented text in an image with- out using manually designed anchors. Liao et al. [24] extended TextBoxes to TextBoxes++ by improving the network structure and the training process. Textboxes++ replaced the rectangle bound- ing boxes of text to quadrilateral to detect arbitrary-oriented text. RB methods usually have a simple post-processing framework to handle multi-oriented text. However, due to structural limitations in these methods it is not easy to represent accurate bounding boxes for arbitrary text shapes. 2.2 Segmentation-based Text Detection Segmentation-based (SB) methods [14, 17, 23] classify text regions at the pixel level, making it possible to do word-level or character- level detection. They usually modify a segmentation framework like FCN [10] and Mask R-CNN [11]; for example, Zhang et al. [17] adopted FCN to predict the salient map of text regions, and TextSnake [23] adopts FCN as a base detector and extract text in- stances by detecting and assembling local components. The preceding methods are trained to detect words in images, however it is challenging to use words as the basic unit for scene text detection as individual text characters may be represented in ar- bitrary shapes, therefore some recent text detection methods have trained deep learning models to detect text at the character-level [12, 18, 19]. For example, in [18] a saliency map of text regions, given by a dedicated segmentation network, uses character-level annotations to generate multi-oriented text bounding boxes. Later, Seglink [19] was trained to search for small text elements (seg- ments) in the image and to link these segments to create word boxes using an additional post-processing step. Recently, CRAFT [12] used a weakly-supervised framework to detect individual char- acters in arbitrarily shaped text, which enables it to achieve the state of the art on benchmark datasets. Because text may appear in arbitrary shapes, recent methods usually adopted a segmentation framework as their backbone architecture, outperforming regression based methods in terms of multi-oriented text in several benchmark datasets. However, these types of methods require complex and time-consuming post-processing steps to produce the final detec- tion result. 2.3 Hybrid Methods Hybrid methods [15, 25, 26] use a combination of both segmen- tation and regression-based approaches for improving the perfor-