A SegNet Based Image Enhancement Technique for Air-Tissue Boundary Segmentation in Real-Time Magnetic Resonance Imaging Video Renuka Mannem, Valliappan CA and Prasanta Kumar Ghosh Electrical Engineering Department, Indian Institute of Science, Bangalore mannemrenuka@iisc.ac.in, valliappanc@iisc.ac.in, prasantg@iisc.ac.in Abstract—In this paper, we propose a new technique for segmentation of the Air-Tissue Boundaries (ATBs) in the upper airway of the vocal tract in the midsagittal plane of the real- time Magnetic Resonance Imaging (rtMRI) videos. The proposed technique uses a segmentation using Fisher-discriminant measure (SFDM) scheme. The paper introduces an image enhancement technique using semantic segmentation in the preprocessing of the rtMRI frames before ATB prediction. We use a deep convolutional encoder-decoder architecture (SegNet) for semantic segmentation of the rtMRI images. The paper examines the significance of the preprocessing before ATB prediction by implementing the SFDM approach with different preprocessing techniques. Experiments with 5779 rtMRI video frames from four subjects demonstrate that using the semantic segmentation based image enhancement of rtMRI frames, the performance of the SFDM approach is improved compared to the other preprocess- ing approaches. Experiment results also show that the proposed approach yields 8.6% less error in ATB prediction compared with a semi-supervised grid based baseline segmentation approach. Index Terms: air-tissue boundary segmentation, real-time mag- netic resonance imaging video, fisher discriminant measure, SegNet, image enhancement. I. I NTRODUCTION The real-time magnetic resonance imaging video (rtMRI) of the vocal tract in the midsagittal plane during speech is an important tool for speech production research. The rtMRI captures the complete vocal tract in a non-invasive manner [1]. The non-invasive nature of rtMRI makes it more effective than the other existing methods like X-ray [2], Electromag- netic articulography [3] and Ultrasound [4]. The rtMRI video provides the spatio-temporal information of speech articulators which helps in modelling speech production. For this purpose, it is essential to have an accurate Air-Tissue boundary (ATB) segmentation in the rtMRI video. For example, Toutios [5] used the predicted ATBs from the rtMRI video to develop a text-to-speech synthesis system. The rtMRI data is used for comparing the articulatory control of beatboxers to understand the usage of articulators in achieving acoustic goals [6]. The ATB segmentation is used as a pre-processing step in the studies that involve morphological structures of vocal tracts [7] and analysis of vocal tract movement [8] using rtMRI video. The accurate ATB segmentation in the upper airway of the vocal tract is needed to study the time evolution of the vocal tract cross-sectional area [9] which forms the basis for the most speech processing applications. Thus, it is very important to have an accurate ATB segmentation in the upper airway of the vocal tract in the rtMRI videos before they can be used to study different articulators and dynamics of the vocal tract [10], [11], [12], [13]. The problems of ATB segmentation of rtMRI images have been addressed by several works in the past using various ap- proaches. For example, Asadiabadi et al. presented a statistical method using the appearance and shape model for the vocal tract [14]. Lammert et al. proposed a region of interest (ROI) based technique [15] and a data-driven approach using pixel intensity for the ATB segmentation [16]. A factor analysis approach was used by Toutios et al.[17] and Sorensen et al. [18] to predict the compact outline of the vocal tract. Zhang et al. [19] used multi-directional Sobel operators in order to construct boundary intensity map in the rtMRI video frames. A semantic edge detection based algorithm for contour prediction was proposed by Somandepalli et al. [20]. Several robust ATB segmentation techniques have also been proposed using a composite analysis grid line superimposed on each rtMRI frame [21], [22], [23], [24]. Techniques such as [21], [24], [16], [14] are advantageous over the others because of their unsupervised and semi-automatic approach. However, a more reliable and accurate ATBs can be obtained in a supervised learning approach using the enhanced rtMRI images. For example, Advait et al. proposed a supervised approach using Fisher-discriminant measure (FDM) [26]. Valliappan et al. [25] used a fully convolutional network (FCN) based semantic segmentation with various post-processing steps. In this paper, we have used FDM based approach [26] for ATB segmentation. The method of ATB segmentation using Fisher-discriminant measure (SFDM) learns the ATBs from the limited training rtMRI frames across different subjects instead of predicting the boundaries using an unsupervised approach. The ATBs, in the upper airway, trace the contours which separate the high pixel intensity regions that correspond to the tissue region from the low pixel intensity regions that correspond to the airway cavity in the vocal tract. Considering the rtMRI images, this transition in the intensity values form air to tissue region is not clearly visible due to the low resolution and blurriness of the images. Hence, the rtMRI images need to be enhanced before applying any ATB seg- mentation technique to predict reliable boundaries. In this paper, the enhancement of the rtMRI images is achieved by semantically segmenting an image, in which each pixel of the 978-1-5386-9286-8/19/$31.00 c 2019 IEEE