MP3: A Uniﬁed Model to Map, Perceive, Predict and Plan Sergio Casas * ,1,2 , Abbas Sadat *,1 , Raquel Urtasun 1,2 Uber ATG 1 , University of Toronto 2 {sergio, urtasun}@cs.toronto.edu, abbas.sadat@gmail.com Abstract High-deﬁnition maps (HD maps) are a key component of most modern self-driving systems due to their valuable se- mantic and geometric information. Unfortunately, building HD maps has proven hard to scale due to their cost as well as the requirements they impose in the localization system that has to work everywhere with centimeter-level accuracy. Being able to drive without an HD map would be very bene- ﬁcial to scale self-driving solutions as well as to increase the failure tolerance of existing ones (e.g., if localization fails or the map is not up-to-date). Towards this goal, we propose MP3, an end-to-end approach to mapless 1 driving where the input is raw sensor data and a high-level command (e.g., turn left at the intersection). MP3 predicts intermediate representations in the form of an online map and the current and future state of dynamic agents, and exploits them in a novel neural motion planner to make interpretable decisions taking into account uncertainty. We show that our approach is signiﬁcantly safer, more comfortable, and can follow com- mands better than the baselines in challenging long-term closed-loop simulations, as well as when compared to an expert driver in a large-scale real-world dataset. 1. Introduction Most modern self-driving stacks require up-to-date high- deﬁnition (HD) maps that contain rich semantic information necessary for driving such as the topology and location of the lanes, crosswalks, trafﬁc lights, intersections as well as the trafﬁc rules for each lane (e.g., unprotected left, right turn on red, maximum speed). These maps are a great source of knowledge that simplify the perception and motion forecast- ing tasks, as the online inference process has to mainly focus on dynamic objects (e.g., vehicles, pedestrians, cyclists). Furthermore, the use of HD maps signiﬁcantly increases the safety of motion planning as knowing the lane topology and geometry eases the generation of potential trajectories for * Denotes equal contribution 1 We note that by mapless we mean without HD maps. A coarse road network like the ones available in off-the-shelf services such as Google Maps or OpenStreetMap is assumed available for routing towards the goal. Driving with an HD map Mapless driving “Turn Right” Figure 1: Left: a localization error makes the SDV follow a wrong route when using an HD map, driving into trafﬁc. Right: mapless driving can interpret the scene from sensors and achieve a safe plan that follows a high-level command. the ego-vehicle that adhere to the trafﬁc rules. In addition, progressing towards a speciﬁc goal is much simpler when the desired route is deﬁned as a sequence of lanes to traverse. Unfortunately, building HD maps has proven hard to scale due to the complexity and cost of generating the maps and maintaining them. Furthermore, the heavy reliance on HD maps introduces very demanding requirements for the localization system, which needs to work at all times with centimeter-level accuracy or else unsafe situations like Fig. 1 (left) might arise. This motivates the development of mapless technology, which can serve as the fail-safe in the case of localization failures or outdated maps, and potentially unlock self-driving at scale at a much lower cost. Self-driving without HD maps is a very challenging task. Perception can no longer rely on the prior that is more likely to ﬁnd vehicles on the road and pedestrians on the sidewalk. Motion forecasting of dynamic objects becomes even more challenging without having access to the lanes that vehicles typically follow or the location of crosswalks for pedestrians. Most importantly, the search space to plan a safe maneuver for the SDV goes from narrow envelopes around the lane cen- ter lines [1, 45, 46, 50] to the full set of dynamically feasible trajectories as depicted in Fig. 1 (right). Moreover, without a well-deﬁned route as a series of lanes to follow, the goal that the SDV is trying to reach needs to be abstracted into high-level behaviors such as going straight at an intersec- tion, turning left or turning right [11], which require taking 14403