ICAT Job Portal: a generic job submission system built on a scientific data catalog Stephen M Fisher Scientific Computing Department Rutherford Appleton Laboratory Didcot, OX11 0QX, UK Email: dr.s.m.fisher@gmail.com Kevin Phipps Scientific Computing Department Rutherford Appleton Laboratory Didcot, OX11 0QX, UK Email: kevin.phipps@stfc.ac.uk Daniel J Rolfe Central Laser Facility Research Complex at Harwell Rutherford Appleton Laboratory Didcot, OX11 0QX, UK Email: daniel.rolfe@stfc.ac.uk Abstract—The value of metadata to the scientist is well known: with the right choice of metadata, data files can be selected very quickly without having to scan through huge volumes of data. The ICAT metadata catalog[1] (which is part of the ICAT project[2]) allows the scientist to store and query information about individual data files and sets of data files as well as storing provenance information. This paper explains how a generic job management system, exposed as a web portal, has been built on top of ICAT. This gives the scientist easy access to a high performance computing infrastructure without allowing the complexities of that infrastructure to impede progress. The aim was to build a job and data management portal capable of dealing with batch and interactive work that would be simple to use and that was based on tried and tested, scalable, and preferably open source technologies. For the team operating the portal, it needed to be generic and configurable enough so that they can, without too much effort, modify their software to run within the portal, add new software, and create new dataset types and parameters. Modifications to existing software should be limited to saving and loading their datasets in a slightly different way so that instead of just being saved to disk, they are registered within the system along with recording any provenance information. I. I NTRODUCTION The ICAT Job Portal (IJP)[3] builds upon the tried and tested ICAT data catalog, an existing component written specifically to catalog datasets produced by scientific facilities. It uses ICAT as the central database component which also provides authorization via a flexible rules based system. This means that users will only be shown datasets readable by them, and any datasets produced whilst using the Job Portal will also be protected by relevant permissions. While developing a prototype portal to meet the needs of one group it became apparent that it could be made generic and configurable enough to be used by a wide range of teams within the scientific community. A. ICAT the metadata catalog ICAT is a data catalog specifically aimed at scientific facili- ties into which data are stored based on the following hierarchy of entities: Facility, Investigation, Dataset and Datafile. The “Facility” produces the data for a group of users associated with an “Investigation”. Within the investigation “Datafiles” are grouped into “Datasets”. Each entity has a small agreed set of attributes. To make the system extensible, parameter types can be defined and associated with one or more of the entity types. Actual parameters of those types can then be associated with the corresponding entities. For example a parameter type could be defined for current measured in milliamps or elapsed time measured in seconds. Further entities Application, Job, InputDataset and Output- Dataset allow the provenance of datasets to be stored within the catalog, such that it is possible to trace the derived dataset back through a chain of applications and intermediate datasets to the original raw dataset. ICAT is implemented as a SOAP based web service using the mechanisms provided by the Java Persistence Architecture (JPA) to connect to a relational database. ICAT has rule based authorization and a powerful query language which is translated into the JPA query language (JPQL). The data files are not stored within ICAT itself, but are stored within an ICAT Data Service (IDS)[4] as explained below. B. IDS the ICAT Data Service This is a component, defined by its interface which is able to store files and register their metadata in ICAT. It makes use of ICAT for authorization. If ICAT allows the file metadata to be written then the IDS will allow the file to be written. Control of who can read follows the same pattern. II. BACKGROUND A. Use Case This work was motivated by a request from the Lasers for Science Facility (LSF) of the UK Science and Technology Facilities Council (STFC) to help them with their data. The LSF operates the OCTOPUS imaging cluster[5], a central core of lasers coupled to a set of advanced interconnected microscopy stations that can be used to image samples from single molecules to whole cells and tissues. They had accu- mulated a large number of data files stored in a directory structure. They had both a range of applications to process and visualise that data[6] and an interactive program with an easy to use GUI that would scan through a selected part of the file system to collect information in memory about their data then offer lists of raw datasets and lists of processed datasets