IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING IEEJ Trans 2015; 10: 114–115 Published online in Wiley Online Library (wileyonlinelibrary.com). DOI:10.1002/tee.22072 Letter Semi-Blind Source Separation using Binary Masking and Independent Vector Analysis Yuuki Tachioka a , Member Tomohiro Narita, Non-Member Jun Ishii, Non-member Recent prevalence of speech recognition system increases the opportunity of simultaneous recognition of multiple speakers’ utterances. There are two types of source separation methods: physical and statistical. The former is based on the physical information such as a direction of arrival of sound sources. The latter only uses statistical independence. The advantage of the former is fast computation and effectiveness with precise information; and that of the latter is no need for physical information, which leads to the robustness of measurement errors. In this paper, we propose to combine these approaches effectively. Experiments on a speech recognition task show that the proposed method can achieve the upper limit performance of the two approaches.  2014 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc. Keywords: binary masking, independent vector analysis, automatic speech recognition Received 4 April 2014; Revised XXXX; Accepted XXXX 1. Introduction Recent progress in speech recognition widens the target user base. In this scenario, simultaneous recognition of multiple speak- ers’ utterances for a real-time use of a single system is required. Before recognition, some source separation approaches are applied. The most general one is based on physical information, such as the direction of arrival of the sound [1]. This method is fast and effective but susceptible to errors in physical information. On the other hand, blind source separation approach based on statistical independence [2] is more time consuming and may be inferior to the physical method with precise information but can be robust for measurement errors. In this paper, we propose to combine these physical and statistical approaches effectively to improve the robustness of source separations. 2. Binary masking in the time–frequency domain From now on, the number of microphones is assumed to be 2. When x 1 and x 2 are the short-time Fourier transforms of the observed signals for the ﬁrst and second microphone, respectively, a cross-spectrum of them at the time frame t (1 ≤ t ≤ T ) and frequency bin ω is represented as x 2 (ω, t )/x 1 (ω, t ) = Ae jωτ(ω,t ) , (1) where j is an imaginary unit, A is a positive amplitude ratio, and τ(ω, t ) is a time difference between them. The masking matrix W is composed of two vectors w 1 and w 2 : W (ω, t ) = (w 1 (ω, t ), w 2 (ω, t )) h , (2) where h is an Hermite transpose. If the direction of the sound source θ is known, binary masking (BM) on time–frequency a Correspondence to: Yuuki Tachioka. E-mail: Tachioka.Yuki@eb.MitsubishiElectric.co.jp Information Technology R & D Center, Mitsubishi Electric Corporation, 5–1–1, Ofuna, Kamakura, Kanagawa 247-8501, Japan domain constructs the masks W as [1] w k (ω, t ) =  ǫ e k : | c lm sin −1 τ ω,t − θ | >θ c , e k : otherwise, (3) where k is the microphone ID, e k is a unit vector whose k th element is 1, ǫ is a small number for smoothing, and θ c is a tolerance error. c is a sound velocity and l m is the distance between microphones. Separated signal y is obtained as y (ω, t ) = W (ω, t )x (ω, t ), (4) where x (ω, t ) and y (ω, t ) are vector forms of (x 1 (ω, t ), x 2 (ω, t )) ⊤ and (y 1 (ω, t ), y 2 (ω, t )) ⊤ . ⊤ denotes a transpose. Separation is effective when the physical variables above are all reliable. 3. IVA using auxiliary function Statistical method uses only the independence between sources and needs no physical information above. The most major sta- tistical method, namely independent component analysis (ICA), causes the permutation problem about separated speakers because this method separates sources at each frequency bin. To address this problem, independent vector analysis (IVA) minimizes the objective function (5) across frequency bins and determines time- invariant separation matrices W (ω). J (W ) =  k E [r k ,t ] −  ω log |detW (ω)|. (5) where W is a set of W (ω), and r k ,t is an auxiliary variable in (6). This can be optimized using an auxiliary function as an upper limit of J [2]. This method outperforms gradient-decent-based conventional methods. After the update of auxiliary variables (6), the separation matrices are updated in two steps: direction update rule (7) and norm normalization rule (8). r k ,t =   ω |w h k (ω)x (ω, t )| 2 , V k (ω) = T  t =1  x (ω, t )x h (ω, t ) Tr k ,t  . (6)  2014 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.