Stabilize Sequence Learning with Recurrent Neural Networks by Forced Alignment Marc-Peter Schambach Siemens AG Bücklestraße 1-5, 78464 Konstanz, Germany Email: marc-peter.schambach@siemens.com Sheikh Faisal Rashid Siemens AG University of Kaiserslautern, Germany Email: rashid@iupr.com Abstract—Cursive handwriting recognition is still a hot topic of research, especially for non-Latin scripts. One of the techniques which yields best recognition results is based on recurrent neural networks: with neurons modeled by long short-term memory (LSTM) cells, and alignment of label sequence to output sequence performed by a connectionist temporal classiﬁcation (CTC) layer. However, network training is time consuming, unstable, and tends to over-adaptation. One of the reasons is the bootstrap process, which aligns the label data more or less randomly in early training iterations. This also leads to the fact that the emission peak positions within a character are located unpredictably. But positions near the center of a character are more desirable: In theory, they better model the properties of a character. The solution presented here is to guide the back-propagation training in early iterations: Character alignment is enforced by replacing the forward-backward alignment by ﬁxed character positions: either pre-segmented, or equally distributed. After a number of guided iterations, training may be continued by standard dynamic alignment. A series of experiments is performed to answer some of these questions: Can peak positions be controlled in the long run? Can training iterations be reduced, getting results faster? Is training more stable? And ﬁnally: Do deﬁned character position lead to better recognition performance? I. I NTRODUCTION Cursive handwriting recognition has drawn continuous at- tention during the last years. Competitions measuring recog- nition performance for various scripts and languages [1], [2] are popular, and the number of public databases is constantly increasing. The focus on recognition performance has made recognition approaches comparable, and has led to profession- alism in the ﬁeld. State-of-the-art systems are either based on hidden Markov models (HMM), using sophisticated classiﬁ- cation methods and topologies, or recurrent neural networks (RNN) with a ﬁnal alignment layer, which will be used here. The amount of training data available and the demand for high recognition performance require systems with millions of parameters. This makes parameter training time-consuming. Training data consists of images containing scanned lines of text, tagged with the textual content, but without character positions. The missing segmentation information makes the bootstrap process unstable: Training convergence time varies in a great range. And properties of the ﬁnal system are un- deﬁned, especially the exact position of the characters during recognition. The solution presented here is to guide the training during the ﬁrst iterations with forced alignment of characters. The standard forward-backward algorithm for character alignment is replaced by either an estimated alignment, or by a ﬁxed segmentation from other sources. After few iterations of the guided bootstrap training, standard forward-backward training is used again, which best adapts to the training data. The ex- pectation is, that training startup time is reduced signiﬁcantly, while the control over character positions is kept during the ﬁnal training iterations. The paper is organized as follows: Section II gives an overview of the recognition system, while section III presents the data used for visualization and experiments. Section IV explains the idea of forced alignment and describes the appli- cation in the given training context. Section V then describes the experiments and its results. Conclusions are drawn in the ﬁnal section. II. SYSTEM The system used here is based on recurrent neural networks and has been described by Alex Graves [3], [4]. It’s a multi-layer neural network, which basically transforms a two-dimensional pixel plane into a sequence of class prob- abilities. It does so by sub-sampling the input pixel planes in each layer and ﬁnally collapsing the ﬁnal plane in y- direction, getting away with a sequence in x-direction. Classes represent characters including whitespace, complemented by a non-character class which represents everything between characters. Recognition results are derived from the class probability sequence by dynamic programming. Each layer is recurrent; it gets its input not only from the input pixel, but also from the neighboring cells within the layer. To get the whole context within two dimensions, four layers are implemented, each getting input from neighboring cells in NW, SW, SE, and SW direction, respectively. Cells are long short-term memory (LSTM) cells, which contain input, output and forget gates. These gates give the network a richer structure, and e.g. address the vanishing gradient problem. More details in [3]. The topology used in the experiments contains three hidden layers, with 4, 20 and 100 cells each. Sub-sampling layers have dimension 2 × 3 and sizes 6 and 30. III. DATA Latin word images have been used for experiments and visualization, because Latin can be read by most people and