A Resilient Telco Grid Middleware C. Lac and S. Ramanathan France Telecom RD/MAPS/AMS/VVT 2, avenue Pierre Marzin, 22307 Lannion Cedex, France {chidung.lac, sakkaravarthi.ramanathan}@francetelecom.com Abstract Grid computing can exploit distributed, underutilized or not, resources to provide massive parallel CPU capacity. Load balancing, applications sharing, as well as geographically dispersed databases features are other Grid's aspects which are of interest for a telecommunications operator (Telco). Building a Grid middleware in order to implement Telco's services is thus a way to assess the validity of this type of architecture for future applications. To achieve a trustworthy platform, the middleware needs to take into account accidental or malicious faults which can impact different resilience aspects. This paper describes a secure and highly available architecture which, besides traditional Grid middleware functionalities (resource broker, job mapping, system monitoring, ...), makes use of fault-tolerant mechanisms (process duplication, failure handling, ...) to guarantee QoS defined in the service level agreement. Security is carried out by analyzing each node's defense capability issue and finding a suitable solution to match this with the appropriate user's job. 1. Grid usage for Telcos Grid computing is distributed computing taken to the next evolutionary level. The goal is to create the illusion of a simple, yet large and powerful, self managing virtual computer out of a large collection of connected heterogeneous systems sharing various resources (CPU, storage capacity, applications, etc.). Whereas Telcos have a great deal of experience in managing large complex networks, they should extend this skill set into the Grid, by proposing to take control of the nodal IT assets, and provide an end-to-end Grid service to customers such as residential clients, small and medium enterprises, and corporate companies. Communication networks and Grid computing, if merged, have a great deal of technological potential. This would allow, for instance, users of mobile devices (cell phones, laptops, PDAs, etc.) to submit jobs on the Grid, and get access to its tremendous processing power at their fingertips. It would also allow them to access data being stored or generated by the Grid, and analyze it on their handheld devices. While classical clustering and distributed computing techniques have been mostly neglected as recognized to be likely out of the main core business of the Telcos, the ambitious goal of Grid, related to spreading and managing huge amounts of data across distributed (and distant) sites, is being seriously considered by network providers. Grid computing heavily involving the networks offers interesting opportunities to Telcos: the exploitation of Grid for internal use can greatly improve operations and lower the expenses, while offering external services through these networks could be a profitable new market. The driving factor for a Telco Grid network service offering will be to effectively use the assets it already owns in order to realize a fast return on investment. 2. Dependability and fault tolerance in Grid The preliminary implementation of a "proxy", which can combine groups of identical Grid services in various configurations (such as fallbacks and parallel execution), giving the appearance of a single and better service, aims to build on this to provide, according to user specifications, dependability (reliability, availability, …) for arbitrary applications [1]. Adding dependability features in service-based Grid emphasizes service composition rather than sharing of low quality resources. The idea is to build applications out of computational services provided by the different sites of the Grid [2]. Developing both an improved fault model for Grid computing and a method for offering fault- tolerant Grid applications that will provide protection and robustness against both malicious and erroneous faults is a big task [3]. A fault model attempts to Proceedings of the 11th IEEE Symposium on Computers and Communications (ISCC'06) 0-7695-2588-1/06 $20.00 © 2006 IEEE