Unconstrained ego-centric videos with eye-tracking data Keng-Teck Ma, Rosary Lim, Peilun Dai, Liyuan Li and Joo-Hwee Lim Institute for Infocomm Research, A*STAR, Singapore {makt, rosary-lim, daip, lyli, joohwee}@i2r.a-star.edu.sg Abstract We present the ﬁrst eye-tracking dataset for uncon- strained ego-centric videos. The dataset captures over 6 hours of subjects performing common daily activities. These activities are manually annotated as socializing, walking, object manipulating, transiting and observing. 1. Introduction Computer-based scene understanding systems process image sequence frame by frame, and pixel by pixel within each frame, aiming to aggregate pixels into coherent regions (e.g. segmentation) for meaningful interpretation (e.g. ob- ject recognition). Is this exhaustive approach a good way for solving ill-posed visual perception and cognition prob- lems? Human visual systems are driven by visual attention whereby eye movements facilitate the selection of relevant areas in scene image to focus (by fovea) and process, while keeping a broad picture with summary statistics in periph- eral visual ﬁeld. Why can’t we develop a saccade-based visual information processing approach which is both more natural and efﬁcient? Do we have proper dataset and evalu- ation metric to study and benchmark this type of research? Although bottom-up saliency-based attention helps to anchor visual ﬁxation and has been an active area of re- search for many years, more often than not, task-based top- down visual attention and contextual priming play more im- portant role in directing our visual computational resources to accomplish our activities [4, 6, 10, 11]. We aim to facilitate the saccade-based visual information processing approach of scene understanding by creating un- constrained ego-centric videos with eye-tracking dataset for such research purposes. The video and eye-tracking data is recorded while participants are engaged in daily activities (e.g. socializing, commuting and object manipulations) in unconstrained settings. The unconstrained settings are sim- ilar to life-logging in ego-centric video research. This setup is different from existing ego-centric eye- tracking video datasets in controlled environments [1, 5, 12]. To the best of our knowledge, this is the ﬁrst uncon- strained ego-centric video dataset with eye-tracking infor- mation. 2. Unconstrained dataset Six participants, 4 males and 2 females, were recruited from a population of graduate students and ofﬁce workers. Their ages range from 23 to 31 years old. They have normal eye-sight or are wearing contact lens. We used the SMI Eye-Tracking Glasses (ETG) version 1. It records the video in 24 frames per second and the gazes are sampled at 30Hz. The resolution of the front facing camera is 1280x960. The ﬁeld-of-view is 60 degrees visual angle. The eye-tracker was calibrated with the 3-point cali- bration for each recording session. Participants were instructed to wear the mobile eye- trackers whenever it was convenient for them. There was no instruction on the type of activities they should partici- pate, except that they should avoid sports due to the risks of damaging the equipment; and avoid driving due to lim- ited ﬁeld-of-view. They were further instructed to record at least 10 minutes of data for each session. Fixations were extracted from the gaze samples with the vendor’s software (BeGaze). 3. Annotations The dataset was annotated by 3 volunteers for the fol- lowing overall information: Time of Day, Place (e.g. home, ofﬁce, subway), Indoor/Outdoor and a short description. For each manually selected video segment, the annota- tors also labeled a short description; and assigned it one or more of the activities: Social, Walk, Object, Transit, Ob- serve. Social refers to socializing activities such as talking, lis- tening, meetings etc. Walk refers to self-locomotive activ- ities such as walking and running etc. Object refers to ac- tivities in which hands are used to manipulate objects such as packing, holding, assembling etc. Transit refers to activ- ities on moving platform such as elevators, escalators etc. Observe refers to passive viewing such as scenery viewing. 4321