IEEE TRANS. SIGNAL PROC., VOL. X, NO. Y, SUBMITTED JUNE 2011 1 Online Non-Negative Convolutive Pattern Learning for Speech Signals Dong Wang, Member, IEEE, Ravichander Vipperla, Member, IEEE, Nicholas Evans, Member, IEEE, and Thomas Fang Zheng, Senior Member, IEEE Abstract—The unsupervised learning of spectro-temporal pat- terns within speech signals is of interest in a broad range of applications. Where patterns are non-negative and convolutive in nature, relevant learning algorithms include convolutive non- negative matrix factorization (CNMF) and its sparse alternative, convolutive non-negative sparse coding (CNSC). Both algorithms, however, place unrealistic demands on computing power and memory which prohibit their application in large scale tasks. This paper proposes a new online implementation of CNMF and CNSC which processes input data piece-by-piece and up- dates learned patterns gradually with accumulated statistics. The proposed approach facilitates pattern learning with huge volumes of training data that are beyond the capability of existing alternatives. We show that, with unlimited data and computing resources, the new online learning algorithm almost surely converges to a local minimum of the objective cost function. In more realistic situations, where the amount of data is large and computing power is limited, online learning tends to obtain lower empirical cost than conventional batch learning. Index Terms—Non-negative matrix factorization, convolutive NMF, online pattern learning, sparse coding, speech processing, speech recognition I. I NTRODUCTION Many signals exhibit clear spectro-temporal patterns; the discovery and learning of such patterns with automatic ap- proaches is often needed for signal interpretation and for the design of suitable algorithms in practical applications. In speech signals, for instance, patterns of interest might be related to the speaker identity or the phonetic content. Whilst some of these patterns might be readily deﬁned and learned with supervised approaches, e.g. neural networks, more complex patterns are difﬁcult to pre-deﬁne and annotate, particularly when they involve large datasets, hence the need for unsupervised approaches. Various unsupervised learning techniques have been devel- oped for automatic pattern discovery. The general idea behind such learning approaches involves the search for a number of patterns which can be used to reconstruct a set of training Copyright (c) 2012 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. This work was conducted when Dong Wang was at EURECOM as a post- doctoral research fellow and was completed when he was a visiting researcher at Tsinghua University and a senior research engineer at Nuance. It was partially supported by the French Ministry of Industry (Innovative Web call) under contract 09.2.93.0966, collaborative Annotation for Video Accessibility (ACAV) and by the Adaptable Ambient Living Assistant (ALIAS) project funded through the joint national Ambient Assisted Living (AAL) programme. Dong Wang and Thomas Fang Zheng are with Tsinghua University, Ravichander Vipperla and Nicholas Evans are with EURECOM. signals according to a certain cost function, e.g. minimum reconstruction loss, and an appropriate set of constraints. This can be written formally as: ˜ W = arg min W {min H ℓ(X, ˜ X(W, H))} s.t. {g i (W, H)} (1) where X represents a set of training signals and ˜ X their reconstructed approximations. ℓ(·, ·) represents the objective function and {g i (W, H)} represents the set of constraints. The reconstruction usually takes a linear form: ˜ X(W, H)= W × H where H represents the projection of ˜ X onto a set of patterns W . Pattern learning is thus closely related to matrix factoriza- tion, a ﬁeld that has been studied extensively in mathematics and statistics. In signal processing and pattern learning, W is referred to as a dictionary whereas in statistics, W is referred to as a basis. The coefﬁcient matrix H is known as a factor matrix or a code matrix in some literature. In this paper we refer to W and H as ‘patterns’ and ‘coefﬁcients’ respectively. Different cost functions and constraints lead to different learning techniques. An l-2 reconstruction loss or Kullback- Leibler divergence cost function and a non-negative constraint applied to both patterns and coefﬁcients leads to non-negative matrix factorization (NMF) [1]–[6]. In contrast to other pattern learning approaches NMF is capable of learning partial pat- terns and has thus proved to be popular in applications such as data analysis, speech processing, image processing and pattern recognition [7]–[11]. A number of extensions have been introduced to improve the basic NMF approach, e.g. [12]–[23]. Convolutive NMF (CNMF) [24], [25] and sparse NMF [26]–[28] are among the most signiﬁcant. Patterns learned with convolutive NMF span a number of consecutive frames and thus capture spectro- temporal features. With sparse NMF, sparsity constraints im- posed on both patterns and coefﬁcients generally lead to improved representation and noise robustness. The two exten- sions can be combined, resulting in a more powerful learning approach referred to as convolutive non-negative sparse coding (CNSC) [29]–[32]. While promising results have been demonstrated in some tasks, such as speech enhancement [33] and source separation [34], NMF and its variants such as CNMF and CNSC place high demands on both computing resources and memory when the training database is large. The original form of the multi- plicative update procedure [4] requires all the signals to be read into memory and processed in each iteration; this is prohibitive