ICAT Job Portal: a generic job submission system built on a scientiﬁc data catalog Stephen M Fisher Scientiﬁc Computing Department Rutherford Appleton Laboratory Didcot, OX11 0QX, UK Email: dr.s.m.ﬁsher@gmail.com Kevin Phipps Scientiﬁc Computing Department Rutherford Appleton Laboratory Didcot, OX11 0QX, UK Email: kevin.phipps@stfc.ac.uk Daniel J Rolfe Central Laser Facility Research Complex at Harwell Rutherford Appleton Laboratory Didcot, OX11 0QX, UK Email: daniel.rolfe@stfc.ac.uk Abstract—The value of metadata to the scientist is well known: with the right choice of metadata, data ﬁles can be selected very quickly without having to scan through huge volumes of data. The ICAT metadata catalog[1] (which is part of the ICAT project[2]) allows the scientist to store and query information about individual data ﬁles and sets of data ﬁles as well as storing provenance information. This paper explains how a generic job management system, exposed as a web portal, has been built on top of ICAT. This gives the scientist easy access to a high performance computing infrastructure without allowing the complexities of that infrastructure to impede progress. The aim was to build a job and data management portal capable of dealing with batch and interactive work that would be simple to use and that was based on tried and tested, scalable, and preferably open source technologies. For the team operating the portal, it needed to be generic and conﬁgurable enough so that they can, without too much effort, modify their software to run within the portal, add new software, and create new dataset types and parameters. Modiﬁcations to existing software should be limited to saving and loading their datasets in a slightly different way so that instead of just being saved to disk, they are registered within the system along with recording any provenance information. I. I NTRODUCTION The ICAT Job Portal (IJP)[3] builds upon the tried and tested ICAT data catalog, an existing component written speciﬁcally to catalog datasets produced by scientiﬁc facilities. It uses ICAT as the central database component which also provides authorization via a ﬂexible rules based system. This means that users will only be shown datasets readable by them, and any datasets produced whilst using the Job Portal will also be protected by relevant permissions. While developing a prototype portal to meet the needs of one group it became apparent that it could be made generic and conﬁgurable enough to be used by a wide range of teams within the scientiﬁc community. A. ICAT the metadata catalog ICAT is a data catalog speciﬁcally aimed at scientiﬁc facili- ties into which data are stored based on the following hierarchy of entities: Facility, Investigation, Dataset and Dataﬁle. The “Facility” produces the data for a group of users associated with an “Investigation”. Within the investigation “Dataﬁles” are grouped into “Datasets”. Each entity has a small agreed set of attributes. To make the system extensible, parameter types can be deﬁned and associated with one or more of the entity types. Actual parameters of those types can then be associated with the corresponding entities. For example a parameter type could be deﬁned for current measured in milliamps or elapsed time measured in seconds. Further entities Application, Job, InputDataset and Output- Dataset allow the provenance of datasets to be stored within the catalog, such that it is possible to trace the derived dataset back through a chain of applications and intermediate datasets to the original raw dataset. ICAT is implemented as a SOAP based web service using the mechanisms provided by the Java Persistence Architecture (JPA) to connect to a relational database. ICAT has rule based authorization and a powerful query language which is translated into the JPA query language (JPQL). The data ﬁles are not stored within ICAT itself, but are stored within an ICAT Data Service (IDS)[4] as explained below. B. IDS the ICAT Data Service This is a component, deﬁned by its interface which is able to store ﬁles and register their metadata in ICAT. It makes use of ICAT for authorization. If ICAT allows the ﬁle metadata to be written then the IDS will allow the ﬁle to be written. Control of who can read follows the same pattern. II. BACKGROUND A. Use Case This work was motivated by a request from the Lasers for Science Facility (LSF) of the UK Science and Technology Facilities Council (STFC) to help them with their data. The LSF operates the OCTOPUS imaging cluster[5], a central core of lasers coupled to a set of advanced interconnected microscopy stations that can be used to image samples from single molecules to whole cells and tissues. They had accu- mulated a large number of data ﬁles stored in a directory structure. They had both a range of applications to process and visualise that data[6] and an interactive program with an easy to use GUI that would scan through a selected part of the ﬁle system to collect information in memory about their data then offer lists of raw datasets and lists of processed datasets