ORIGINAL RESEARCH PAPER Robust information fusion in the DOHT paradigm for real-time action detection Geoffrey Vaquette 1 • Catherine Achard 2 • Laurent Lucat 1 Received: 28 May 2016 / Accepted: 24 November 2016 Ó Springer-Verlag Berlin Heidelberg 2016 Abstract In the increasingly explored domain of action analysis, our work focuses on action detection—i.e., seg- mentation and classification—in the context of real appli- cations. Hough transform paradigm fits well for such applications. In this paper, we extend deeply optimized Hough transform paradigm to handle various feature types and to merge information provided by multiple sensors— e.g., RBG sensors, depth sensors and skeleton data. To this end, we propose and compare three fusion methods applied at different levels of the algorithm, one being robust to data losses and, thus, to sensor failure. We deeply study the influence of merged features on the algorithm’s accuracy. Finally, since we consider real-time applications such as human interactions, we investigate the latency and com- putation time of our proposed method. Keywords Hough transform Action detection Feature fusion Activity detection Deeply optimized hough transform 1 Introduction Action recognition is an increasingly explored field in computer science with many applications such as video surveillance, video games, augmented reality, automatic annotation, smart homes and health monitoring. Our work focuses on real-life applications in the con- text of human action analysis. In this framework, input videos are not segmented—relatively to actions. More and more methods are proposed to analyze human action or activity but most of them consider the problem of short segmented video classification while only a few ones deal with large unsegmented videos. This task is a more dif- ficult one task since the algorithm needs to perform both segmentation and classification. Thus, in this paper, we present an action detection—segmentation and recogni- tion—algorithm. In the context of a real application, multiple sensors are likely to be used to capture as much information as pos- sible—e.g., kinect sensors capture both RGB and depth information—or to cover the largest possible area. The number of involved sensors depends on the monitored area. Real-life application-oriented algorithms have to handle successfully a variable number of sensors and, ideally, to merge information coming from different feature types. Furthermore, since some extracted features can be unavailable at some times—e.g., skeleton data extracted from classical depth sensors—algorithms are expected to profit from multi-sensor data when available, while staying robust to data failure. In applications involving human interaction, such as robotic ones, low latency is a crucial issue to be addressed. It involves both low computation time and low reaction time, i.e., the algorithm has to detect an action from as little frames as possible. & Geoffrey Vaquette geoffrey.vaquette@cea.fr Catherine Achard catherine.achard@upmc.fr Laurent Lucat laurent.lucat@cea.fr 1 CEA, LIST, Vision and Content Engineering Laboratory, Point Courrier 173, 91191 Gif-sur-yvette, France 2 UPMC Univ Paris 06, CNRS, UMR 7222, ISIR, Sorbonne University, 75005 Paris, France 123 J Real-Time Image Proc DOI 10.1007/s11554-016-0660-5