The PETS04 Surveillance Ground-Truth Data Sets Robert B. Fisher School of Informatics, University of Edinburgh rbf@inf.ed.ac.uk Abstract This paper summarizes the 28 video sequences available for result comparison in the PETS04 workshop. The se- quences are from about 500 to 1400 frames in length, for a total of about 26500 frames. The sequences are anno- tated with both target position and activities by the CAVIAR research team members. 1. Introduction This paper describes the video sequences used in the PETS04 workshop competition. The sequences are oriented about a public space surveillance task, and are ground truth labeled frame-by-frame with bounding boxes and also a se- mantic description of the activity in each frame. Altogether, there are 28 video sequences containing about 26500 la- beled frames, grouped into 6 different activity scenaria. The £rst group of videos was acquired at INRIA in July 2003. The sequences contained scripted activities by the research team members. The intended test scenaria are: Number of Number of Scenario Sequences Frames Walking 3 3045 Browsing 6 6665 Collapse 4 4227 Leaving object 5 5848 Meeting 6 4135 Fighting 4 2499 Total 28 26419 However, almost all sequences also contained both an introductory activity by one of the researchers, as well as unscripted activity (usually walking or meetings by other employees at INRIA). These sequences are publicly accessible at URL: homepages.inf.ed.ac.uk/rbf/CAVIARDATA1 1.1 Ground Truth Labeling Based on the CAVIAR activity representation model, each video frame has been labeled with a set of ground truth descriptions. Each individual person was described by a bounding box (id, centre coordinates, width, height, orientation of main axis of individual), plus a description of his/her movement (inactive, active, walking, running). Individuals are only la- beled once they start moving; otherwise they are effectively background. Based on the proposed semantics of the ac- tivity interpretation, each box is usually labeled with a role (£ghter, browser, left victim, leaving group, walker, left ob- ject), is a participant in a situation (browsing, moving, in- active), which is a component of a scenario (Walking, Idle- ness, Browse, Collapse, Leaving object, Meeting, Fighting). Each box is labeled with some of the above labels in each frame. The semantics of activity labeling were constrained by a £nite-state model of the allowable behaviors. These are summarized in Section 2, which shows the allowable se- quences of situations in a given scenario. In each scenario, the individual or group is observed in a sequence of situa- tions determined by the £nite state model for that scenario. When in a situation, the actor must ful£ll a speci£c role linked to that situation. As well as the role, the ground truth labeling for the box has a qualitative assessment of the mo- tion of the individual or group, i.e. whether they are run- ning, walking, stationary but active (e.g. moving arms), or inactive. Each video frame contains zero or more labeled individ- ual or group boxes. The boxes are labeled with an identi£er, which persists as long as the individual is visible. If a per- son disappears and then later reappears, then the individual obtains a new identity. If the person is obscured/occluded for only a few frames, then the same identity is maintained. Similarly, groups of interacting individuals also are de- scribed by bounding boxes (id, centre coordinates, width, height, orientation of main axis of individual, list of com- ponent individual boxes), plus a description of the group’s movement (inactive, active, moving). Based on the pro-