J. Parallel Distrib. Comput. 66 (2006) 1181 – 1188 www.elsevier.com/locate/jpdc The Trellis security infrastructure for overlay metacomputers and bridged distributed ﬁle systems Paul Lu ∗ , Michael Closson, Cam Macdonell, Paul Nalos, Danny Ngo, Morgan Kan, Mark Lee Department of Computing Science, University of Alberta, Edmonton, Alta., Canada T6G 2E8 Received 17 December 2005; received in revised form 31 March 2006; accepted 10 April 2006 Available online 27 June 2006 Abstract Researchers often have non-privileged access to a variety of high-performance computer (HPC) systems in different administrative domains, possibly across a wide-area network. Consequently, the security infrastructure becomes an important component of an overlay metacomputer: a user-level aggregation of HPC systems. The Trellis security infrastructure (TSI) is layered on top of the widely-deployed secure shell (SSH) and systems administrators only need to provide unprivileged accounts to the users. The contribution of TSI is in demonstrating that a single sign-on (SSO) system, for a variety of use-case scenarios, can be implemented without requiring a completely new security infrastructure. We describe the use of TSI for a Canada- wide overlay metacomputer, for computational workloads (i.e., CISS-3) that spanned 22 administrative domains, at its peak had over 4000 concurrent jobs, and included a new distributed ﬁle system (i.e., Trellis NFS). © 2006 Elsevier Inc. All rights reserved. Keywords: Security; Single sign-on; Metacomputing; Computational science; Capacity computing; Global job scheduler; Distributed ﬁle system 1. Introduction Some workloads and experiments in computational science require large amounts of resources, both in terms of capability and capacity. In capacity computing, where high throughput is often the main goal, aggregating different high-performance computing (HPC) systems is a common technique to provide the needed capacity. For example, Researcher A (Fig. 1) has access to his group’s system, a departmental system, and a system at a HPC cen- ter. Researcher B has access to her group’s server and (per- haps) a couple of different HPC centers, including one center in common with Researcher A. It would be ideal if all of the systems could be part of one metacomputer. But, the differ- ent systems may be controlled by different groups who may not run the same security software or may not have negoti- ated cross-domain security policies. Yet, Researchers A and B would still like to be able to exploit the aggregate power of their systems. ∗ Corresponding author. E-mail address: paullu@cs.ualberta.ca (P. Lu). 0743-7315/$ - see front matter © 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2006.04.005 Some of the main requirements of the security infrastructure for a cross-administrative domain situation include: (1) Single sign-on (SSO) across multiple administrative do- mains: The user wishes to authenticate (i.e., prove his identity) to the system only once, and not once per- domain. The well-known secure shell (SSH) [1] system can support SSO if the user properly sets up his private and public keys and uses the ssh-agent for automated authentication. (2) SSO support for background jobs, servers, and multiple users: Jobs or servers left in the background need SSO (e.g., to get a unit of work, return a result, move data). (3) Security and mitigation of attacks: SSH is already consid- ered to be reasonably secure. The challenge for this work is in maintaining that security while not opening up new, signiﬁcant avenues of attack. At a high level, the Trellis security infrastructure (TSI) ad- dresses some of the main issues in security as follows: (1) Basic authentication and authorization: TSI relies on the existing ability to use ssh-agent for automatic, non- interactive authentication. The problems are: How can all