1 Multimodal Interaction in Modern Automobiles Ashish Khare Hiranmay Ghosh Sujal Subhash Wattamwar TCS Innovation Labs Delhi 249 D&E Udyog Vihar Phase 4 Gurgaon Haryana India {ashish16.k | hiranmay.ghosh | sujal.wattamwar}@tcs.com Aniruddha Sinha Brojeshwar Bhowmick K S Chidanand Kumar TCS Innovation Labs Kolkata BIPL Bldg Salt Lake Electronics Complex Kolkata West Bengal India { aniruddha.s | b.bhowmick | kschidanand.kumar}@tcs.com Sunil Kumar Kopparapu TCS Innovation Labs Mumbai ODC G SDC-V Yantra Park Subhash Nagar Pokharan Road 2, Thane(West) Maharashtra India sunilkumar.kopparapu@tcs.com ABSTRACT This paper describes a few innovative solutions for application of multimodal interaction techniques in modern automobiles to ensure driving comfort, safety and security. The solutions are based on computer vision and speech processing techniques. Keywords Multimodal interaction, hand gesture recognition, speaker verification, driver fatigue detection. INTRODUCTION Modern cars have gone beyond providing a means for commutation and tend to create a personal space for its inhabitants. The provision of advanced entertainment systems, climate control equipment, and navigation aids in modern cars are just some of the examples of such transition. Provision of such additional equipments in a car brings in some new challenges. Control of secondary equipment while the driver needs to concentrate on driving can cause distractions and be a potential safety threat [1, 2]. Long and lonely drives also contribute to driver fatigue resulting in fatal accidents [3, 4]. Security of the vehicles in urban society has become another major issue in the recent times [5]. In this context, researchers are seeking multimodal interaction techniques with the automobile for enhancing driving comfort, safety and security. Multimodal interaction refers to use of several natural modes of communication, such as gesture, gaze and speech, to complement traditional electro-mechanical interaction devices such as control buttons, joysticks and specially designed levers. Moreover, several bio-metric techniques can be used for vehicle security and enhancing driving safety by authenticating the driver and ascertaining his physiological condition. Several centers of TCS Innovation Labs have been working together with leading automobile manufacturers to provide such multimodal interfaces. In this paper, we provide a few examples of innovative multimodal interfaces in an automobile to provide driving comfort, safety and security. MULTIMODAL CONTROL OF IN-VEHICLE EQUIPMENTS The secondary equipments in an automobile, e.g. entertainment system, climate controls and other navigational aids are traditionally controlled with buttons, touch-screens and remote devices by the driver. Operating a plethora of control buttons of the various equipments while driving is not only inconvenient but can cause distraction to the driver causing the automobile going out of control and resulting in serious accidents. ‘Remote’ devices and placement of the buttons at vantage points like embedded on the steering wheel partially solves the problem. We propose interaction with such equipment with gesture and speech which requires little distraction. The motivation behind using gesture and speech together is to improve the robustness of the system and to provide alternative modes of communication. While gesture requires short diversions of visual attention, speech recognition may not be robust in noisy driving environment. Figure 1 depicts the system architecture. The car is equipped with a camera and a microphone to pick up the drivers voice and hand gesture. The driver of the car has two options to control an in-vehicle system, say the music player. He could either speak a word from the vocabulary (start, stop, eject, etc) to control the car audio system or alternatively he could create a gesture (using a predefined gesture vocabulary) with his hand. Speech and gesture recognition technologies are used to interpret the spoken words and the gesture made. The data obtained from the two channels are fused in context of the previous user interactions and the current instrument status. In absence of tactile or visual feedback, the interpreted action request is spoken out to the user.