Subset Removal On Massive Data with Dash Jonathan A. Myers Large Synoptic Survey Telescope 933 N. Cherry Ave Tucson, AZ 85721 USA jmyers@lsst.org Mahidhar Tatineni San Diego Supercomputing Center 9500 Gilman Dr, MC0505 La Jolla, CA 92092, USA mahidhar@sdsc.edu Robert S. Sinkovits San Diego Computing Center 9500 Gilman Dr., MC0505 La Jolla, CA 92092, USA sinkovit@sdsc.edu ABSTRACT Ongoing eﬀorts by the Large Synoptic Survey Telescope (LSST) involve the study of asteroid search algorithms and their performance on both real and simulated data. Images of the night sky reveal large numbers of events caused by the reﬂection of sunlight from asteroids. Detections from consecutive nights can then be grouped together into tracks that potentially represent small portions of the asteroids’ sky-plane motion. The analysis of these tracks is extremely time consuming and there is strong interest in the develop- ment of techniques that can eliminate unnecessary tracks, thereby rendering the problem more manageable. One such approach is to collectively examine sets of tracks and discard those that are subsets of others. Our implementation of a subset removal algorithm has proven to be fast and accurate on modest sized collections of tracks, but unfortunately has extremely large memory requirements for realistic data sets and cannot eﬀectively use conventional high performance computing resources. We report our experience running the subset removal algorithm on the TeraGrid Appro Dash sys- tem, which uses the vSMP software developed by ScaleMP to aggregate memory from across multiple compute nodes to provide access to a large, logical shared memory space. Our results show that Dash is ideally suited for this algorithm and has performance comparable to or superior to that ob- tained on specialized, heavily demanded, large-memory sys- tems such as the SGI Altix UV. Categories and Subject Descriptors J.2 [PHYSICAL SCIENCES AND ENGINEERING]: Astronomy 1. MOTIVATION 1.1 Asteroid Search Simulation In order to better understand the behavior of known aster- oid search and discovery systems and algorithms, the Large Synoptic Survey Telescope (LSST) Data Management team Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full citation on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. TeraGrid ’11, July 18-21, 2011, Salt Lake City, Utah, USA. Copyright 2011 ACM 978-1-4503-0888-5/11/07...$10.00 has been applying those systems and algorithms to synthetic astronomical observations conducted with a simulated ver- sion of the LSST. For asteroid search purposes, these syn- thetic observations are generated using a realistic model of the solar system [3], a proposed survey cadence from the telescope [1], and realistic limits on the data-gathering abil- ities of the LSST [6]. The algorithms currently under investigation are based on generating sky-plane tracks[5]. A track is essentially a set of astronomical detections that follow a path in the sky con- sistent with some model of asteroid motion. Initial phases of processing use a greatly simpliﬁed model of asteroid mo- tion in order to ﬁnd initial sets of tracks with relatively low computational cost [4], and as a result the sets of tracks generated are very large, containing many tracks which in- correctly link sets of detections. Later in processing, more precise, but also more computationally costly, models of as- teroid motion are used to ﬁlter out these incorrectly linked tracks and derive more precise approximations of underly- ing motion [2]. At the end of processing, tracks should have highly precise associated orbital paths [7] [9]. In order to better understand the behavior of these al- gorithms and their relationship to LSST’s observational ca- dence (the schedule of observations of the sky) and imaging systems, we have been attempting to characterize the sets of tracks generated by the various stages of processing. Using this knowledge, we hope to adjust our models, ﬁlters, and algorithms to ﬁnd less computationally costly methods of generating and processing these asteroid tracks. 1.2 Subset Tracks and their Identiﬁcation It is a known issue that given certain patterns in source data, some algorithms generate tracks that are subsets of other tracks; that is, they link together a set of detections already linked by another, higher-cardinality track. This can create an artiﬁcial and unnecessary inﬂation of the set of tracks, leading to needlessly increased downstream cost. Unfortunately, the prevalence of these subset tracks is not easily predicted. This makes exhaustive subset removal a necessary step in advancing our understanding of track gen- eration algorithms. 2. SUBSET REMOVAL DESCRIPTION AND ALGORITHM The abstract problem of subset removal is fairly straight- forward. Given a set of tracks, we wish to ﬁnd and remove those items that are subsets of sets already present in the collection. A na¨ ıve approach to subset removal would in-