An Architecture for Centralized SIP-based Audio Conferencing using Application Layer Multicast José Simões 1 , Ravic Costa 1 , Paulo Nunes 1, 3 , Rui Lopes 1, 2 , Laurent Mathy 4 1 Departamento de Ciências e Tecnologias da Informação - Instituto Superior das Ciências do Trabalho e da Empresa (ISCTE) 2 Associação para o Desenvolvimento das Técnicas e Tecnologias de Informação (ADETTI) 3 Instituto de Telecomunicações (IT) Lisboa, Portugal {Jose.Simoes, Ravic.Costa, Paulo.Nunes, Rui.Lopes}@iscte.pt 4 Computing Department - Lancaster University Lancaster, United Kingdom laurent@comp.lancs.ac.uk Abstract — Audio conferencing is an important aspect of Internet Telephony services. In this article, using a centralized conferencing architecture, we propose to employ application layer multicast for media distribution, by using “agents” responsible for the delivery of streaming media to end-clients, aiming at reducing the traffic in the network and the server workload. In this model, we use the concepts of active client, which exchanges media directly with the server, and passive client, which is limited to receive media from its agent. In our implementation we use the Real Time Protocol (RTP) for data transmission and the Session Initiation Protocol (SIP) for signaling, hence making it compatible with most existing state-of- the-art hardware and software. Keywords — Audio Conferencing; SIP; ALM;VoIP I. INTRODUCTION With Internet Telephony services dissemination, there is an urge to introduce several traditional telephony functionalities to this new environment. Conferencing is not an exception. An audio conference call is a telephone call (IP or not) where the calling party wants to have more than one called party listen in to the audio. Conferencing is widely used these days, and it is used in many different applications and scenarios. It can be used for entertainment, social, education or business purposes. In our proposal, we pretend to serve large scale applications, with a large number of participants, where a few of them are producing media (active participants), while the majority (passive participants) are just listening to what is produced. Concerning conferencing models, they can be distinguished between: centralized; full mesh; unicast receive and multicast send; multicast, and end mixing [1]. This classification is based on the topology of signaling, media delivery and architecture component relationships [1]. In this paper we consider the centralized conference scenario where a server receives media streams from all participants, mixes them if needed, and redistributes the appropriate media stream(s) back to the participants. This model has the advantage that clients do not need to be modified to perform media mixing and transcoding. In addition, it is relatively easy to support heterogeneous media clients [1]. Since it is difficult for each sender to subtract its own contribution, the server needs to create a customized stream for each of the active callers, e.g., [2]. Assuming that not all users are using the same media format, the server needs to decode the audio streams to a non-compressed audio format to mix them. After that, it encodes the mixed stream in the appropriate media format for each of the participants. This will lead to media distribution and server workload scalability problems limiting the number of participants in a conference call. To improve some of the limitations of the centralized conferencing model, notably, the amount of traffic in the network and the server workload, we decided to study the impact of multicast for media distribution. Due to the non-existence of a globally deployed inter domain multicast routing protocol at the network layer; the use of Application Layer Multicast (ALM) is proposed in our architecture for media distribution [3]. Another important requirement for our architecture is that is should be SIP-based [4]. This will allow the use of popular software for both terminals and server (such as X-Lite, Kphone, SER, etc.). For media distribution, RTP is used either in unicast or ALM connections, depending whether clients are instantaneously active or passive. II. RELATED WORK Many conference servers in the market today are H.323- based. However, SIP-based conferencing systems are gaining more and more enthusiasts [1]. As an example, in [1], a centralized SIP-based conferencing system provides a suitable multimedia conferencing platform that allows advanced scenarios and services (e.g., transcoding) without requiring that end systems are conferencing aware. Another important aspect that has to be considered, in what concerns audio conferencing, is how the media is delivered. In ISBN: 1-9025-6013-9 © 2006 PGNet