AN H.264-BASED VIDEO ENCODING SCHEME FOR 3D TV M. T. Pourazad, P. Nasiopoulos, and R. K. Ward Electrical and Computer Engineering Department, University of British Columbia Vancouver, BC, V6T 1Z4, Canada phone: +1 (604) 822-4988, fax: +1 (604) 822-9013, email: {pourazad, panos, rababw}@ece.ubc.ca ABSTRACT This paper presents an H.264-based scheme for compress- ing 3D content captured by 3D depth range cameras. Exist- ing MPEG-2 based schemes take advantage of the correla- tion between the 2D video sequence and its corresponding depth map sequence, and use the 2D motion vectors (MV) for the depth video sequence as well. This improves the speed of encoding the depth map sequence, but it results in an increase in the bitrate or a drop in the quality of the re- constructed 3D video. This is found to be due to the MVs of the 2D video sequence not being the best choice for encod- ing some parts of the depth map sequence containing sharp edges or corresponding to distant objects. To solve this problem, we propose an H.264-based method which re- estimates the MVs and re-selects the appropriate modes for these regions. Experimental results show that the proposed method enhances the quality of the encoded depth map se- quence by an average of 1.77 dB. Finding the MVs of the sharp edge-included regions of the depth map sequence amounts to 30.64% of the computational effort needed to calculate MVs for the whole depth map sequence. 1. INTRODUCTION Stereoscopic or three-dimensional television (3D TV) can spectacularly enhance the viewer’s experience by allowing the images to emerge and penetrate from the screen into the spectator’s space. It generates a compelling sense of physi- cal space and makes the viewer feel as being part of the scene. To make a full 3D TV application available to the mass consumer market, researchers put much effort in recent years. In terms of 3D content generation and 3D display system production, significant improvements have been achieved. For 3D TV transmission, however, no compres- sion standard has yet been agreed upon. Recent investigations in this area are mostly focused on the efficient compression of 3D content captured by 3D depth range cameras rather than 3D video recorded via the dual- camera configuration [1]. 3D depth range cameras capture 3D content as two video sequences: a conventional two- dimensional (2D) RGB view and its accompanying depth map [2]. This format allows easy capturing, simplifies post- production, and requires lower transmission bandwidth compared to the dual-camera configuration. The latter cap- tures stereo pair data from two slightly different perspec- tives, one for the left eye and the other for the right eye. To perceive the 3D content captured by a 3D depth range cam- era, the left- and right-eye views must be reconstructed at the receiver end. This is achieved using image-based render- ing techniques [3]. Therefore at the receiver end, the viewer has the choice of watching the content either in 2D or 3D format. The generated 3D content needs to be compressed and transmitted for consumer use. Experiments on 3D video compression for 3D TV applications show that transmitting the depth information needs about 20% of the required bi- trate for MPEG-2 compressed 2D video (this is at a typical broadcast bitrate of 3 Mbit/s) [4]. One method used to compress 3D video takes advantage of the existing relationship between the 2D video sequence and the depth map sequence, and uses the motion vectors (MVs), obtained for the 2D video sequence to encode the depth map sequence as well [1]. This compression scheme was based on the MPEG-2 standard and resulted in improvement in the encoding speed. The bitrate of the depth map was fixed at 20% of the 2D video bitrate. The results showed that this approach did not hamper the quality of the encoded depth map sequence in the case of bi-directional temporal predic- tion when compared to the case of the depth map sequence being encoded separately. Though, for the unidirectional temporal prediction case the quality of the reconstructed depth map decreases. However, MPEG-2 seems to yield adequate results by copy- ing the MVs from 2D to the depth sequence, but that is true only because this video standard does not take advantage of the differences in the “texture” structure between the two streams. For instance, due to the resolution limitation of 3D depth cameras, some edges in the depth map sequence are sharper than the corresponding counterparts in the 2D video sequence. These areas may be compressed much more effi- ciently if motion estimation was more accurate that the one supported by MPEG-2. In addition, the depth map of all distant objects in the depth sequence has many zero valued pixels, a fact that may be exploited to reduce the number of macroblocks (MB) encoded compared to the 2D sequence. In other words, the MVs and MB modes used for the 2D sequence are not the best choices for encoding the edges and regions of distant objects in the depth map sequence. We developed a new coding method for 3D video streams which is based on the H.264/AVC standard. H.264/AVC is the most advanced video coding standard available today, reaching approximately 50% bitrate saving when compared to previous standards like MPEG-4 and MPEG-2 [5]. Our method uses special features of the H.264/AVC standard, 14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy, September 4-8, 2006, copyright by EURASIP