PAL: Intelligence Augmentation using Egocentric Visual Context Detection Mina Khan MIT Media Lab 75 Amherst St, Cambridge, MA, USA minakhan01@gmail.com Pattie Maes MIT Media Lab 75 Amherst St, Cambridge, MA, USA pattie@media.mit.edu Abstract Egocentric visual context detection can support intel- ligence augmentation applications. We created a wear- able system, called PAL, for wearable, personalized, and privacy-preserving egocentric visual context detection. PAL has a wearable device with a camera, heart-rate sensor, on- device deep learning, and audio input/output. PAL also has a mobile/web application for personalized context labeling. We used on-device deep learning models for generic object and face detection, low-shot custom face and context recog- nition (e.g., activities like brushing teeth), and custom con- text clustering (e.g., indoor locations). The models had over 80% accuracy in in-the-wild contexts (~1000 images) and we tested PAL for intelligence augmentation applications like behavior change. We have made PAL is open-source to further support intelligence augmentation using personal- ized and privacy-preserving egocentric visual contexts. 1. Introduction Egocentric visual contexts have been useful in context- aware intelligence augmentation [9], e.g., memory augmen- tation and assistive technology [31, 8]. However, visual context tracking, especially sending user data to the cloud for deep learning, raises privacy concerns and deep learning models also have to be personalized for each user. We created a wearable system, called PAL (Figure 1), for personalized and privacy-preserving egocentric visual con- text recognition using on-device, human-in-the-loop, and low-shot deep learning. PAL uses on-device deep learn- ing for real-time and privacy-preserving processing of user data, and includes multimodal sensing, e.g., using camera, heart-rate, physical activity, and geolocation [14]. PAL also supports user input for human-in-the-loop training of per- sonalized visual contexts. We used on-device models for generic object and face detection, personalized and low- shot custom face and recognition, and semi-supervised cus- tom context clustering [19]. Compared to existing wear- able systems, which use at least 100 training images per custom context [21] and do not use privacy-preserving on- device deep learning, PAL’s on-device models for low-shot and continual learning use ~10 training images per context. Also, PAL uses active learning for context clustering so that the users do not have to explicitly train different contexts. We make three contributions: i. a wearable device for privacy-preserving and personalized egocentric visual con- text detection using on-device and human-in-the-loop deep learning; ii. a system for recognizing custom contexts, faces, and clusters using low-shot, custom-trainable, and ac- tive learning; iii. real-world applications and evaluations. 2. Related Work PAL is a wearable system for personalized and privacy- preserving egocentric visual context detection using on- device, human-in-the-loop, and low-shot deep learning. PAL also includes multimodal sensors and input/outputs. The existing wearable systems for egocentric visual context detection do not support multimodal sensing and on-device, human-in-the-loop, and low-shot deep learning like PAL. 2.1. Wearable Cameras and Deep Learning Wearable cameras are commonly used for intelligence augmentation applications [31, 8] and even combined with physiological sensors [11]. Deep learning has also been used with egocentric cameras, e.g., for predicting daily ac- tivities [3], eating recognition [2], visual assistance [26, 27], visual guides [30], and face recognition [5]). However, none use on-device deep learning, especially for personal- ized and privacy-preserving egocentric visual contexts. On- device deep learning systems have also been used for com- puter vision [25], but they do not support personalized, low- shot, and human-in-the-loop visual context detection. 2.2. Privacy-preserving Deep Learning Privacy-preserving approaches include privacy- preserving collaborative deep learning for human activity recognition [24] and image distort or modification [7, 1]. However, none of these systems use on-device deep learning to avoid sending data to the cloud for processing. 1 arXiv:2105.10735v1 [cs.CV] 22 May 2021