MayBMS: A System for Managing Large Uncertain and Probabilistic Databases ∗ Christoph Koch Department of Computer Science Cornell University, Ithaca, NY koch@cs.cornell.edu Abstract MayBMS is a state-of-the-art probabilistic database management system that has been built as an extension of Postgres, an open-source relational database management system. MayBMS follows a principled approach to leveraging the strengths of previous database research for achieving scalability. This article describes the main goals of this project, the design of query and update language, eﬃcient exact and approximate query processing, and algorithmic and systems aspects. Acknowledgments. My collaborators on the MayBMS project are Dan Olteanu (Oxford University), Lyublena Antova (Cornell), Jiewen Huang (Oxford), and Michaela Goetz (Cornell). Thomas Jansen and Ali Baran Sari are alumni of the MayBMS team. I thank Dan Suciu for the inspirational talk he gave at a Dagstuhl seminar in February of 2005, which triggered my interest in probabilistic databases and the start of the project. I am also indebted to Joseph Halpern for insightful discussions. The project was previously supported by German Science Foundation (DFG) grant KO 3491/1-1 and by funding provided by the Center for Bioinformatics (ZBI) at Saarland University. It is currently supported by NSF grant IIS-0812272, a KDD grant, and a gift from Intel. 1 Introduction Database systems for uncertain and probabilistic data promise to have many applications. Query processing on uncertain data occurs in the contexts of data warehousing, data in- tegration, and of processing data extracted from the Web. Data cleaning can be fruitfully approached as a problem of reducing uncertainty in data and requires the management and processing of large amounts of uncertain data. Decision support and diagnosis systems employ hypothetical (what-if) queries. Scientiﬁc databases, which store outcomes of sci- entiﬁc experiments, frequently contain uncertain data such as incomplete observations or imprecise measurements. Sensor and RFID data is inherently uncertain. Applications in the contexts of ﬁghting crime or terrorism, tracking moving objects, surveillance, and pla- giarism detection essentially rely on techniques for processing and managing large uncertain datasets. Beyond that, many further potential applications of probabilistic databases exist and will manifest themselves once such systems become available. ∗ This article will appear as Chapter 6 of Charu Aggarwal, ed., Managing and Mining Uncertain Data, Springer-Verlag, 2008/9. 1