International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 3, March 2013) 973 Speech Separation and Speaker Recognition-Review Dr. Mamta Sood 1 , Krishna Gopal Soni 2 , Monika Cheema 3 1 HOD, EC, 2 Student, M.Tech, TIT College Bhopal 3 Asst.Proff, Electronics&Communication Department, St. Francis Institute of Technology(Engineering College) Abstract— Speech separation and its recognition is based on two different phenomenons, first is speech separation and second is speech recognition. The speech separation is based on the time-domain which is depends on the full unconstrained decomposition of the speech sample just because in constrained approach it become practically hard to compute and also limits the performance of the system. The decomposition is done by an appropriate independent component analysis (ICA) algorithm giving independent components that are grouped into clusters corresponding to the original sources. Speech recognition (SR) is aimed to recognize the speech in large population. And in large population is very time-consuming and impose a bottleneck. So for fast recognition we use GMM based k-mean algorithm for fast recognition of speech. For speeding up the whole process the clustered signals are used. Then during the test stage only a small proportion of speaker models in selected clusters are used in the likelihood computations resulting in a significant speed-up with little to no loss in accuracy. Keywords- Speech separation, Speech recognition (SR), Clusters, GMM, Independent component analysis (ICA) I. INTRODUCTION Blind Signal Separation is the general problem of determining original sources when only their mixtures are available for observation. Over the past 5 years, research on this topic has exploded due to the emergence of relatively successful separation algorithms, as well as the growing sentiment that the technique constitutes a universal panacea capable of everything from de-noising speech to uncovering the laws of the stock market. The process is often termed “blind”, with the understanding that both source signals and mixing procedure are unknown [1]. Such a statement is of course blatant exaggeration –indeed the assumption of some specific mixing model is the paramount piece of prior information required, and in many scenarios even knowledge of certain source statistics is necessary. We thus begin with the channel model: The sources may be sounds, images, biomedical or financial data. Our primary interest will be in audio source signals, with microphones to collect the output mixed signals. Under this setting, the channel H may generally be construed as a linear time-invariant (LTI) system, though there is some activity occurring with nonlinear mixing models. Three levels of complexity are discerned:  H is a matrix. We call this the instantaneous mixing model, since only the relative attenuations of sound due to the microphone-source distances are accommodated.  H ij = a ij z D ij . This is the delayed mixing model, incorporating not only the attenuation a ij between the i th microphone and j th source, but the travel time d ij as well. A matrix of FIR filters H ij = ∑      . This is the convolutive mixing model, where room reverberation is accounted for.