Exact and Approximate Area-proportional Circular Venn and Euler Diagrams Leland Wilkinson Abstract— Scientists conducting microarray and other experiments use circular Venn and Euler diagrams to analyze and illustrate their results. As one solution to this problem, this article introduces a statistical model for fitting area-proportional Venn and Euler diagrams to observed data. The statistical model outlined in this report includes a statistical loss function and a minimization procedure that enables formal estimation of the Venn/Euler area-proportional model for the first time. A significance test of the null hypothesis is computed for the solution. Residuals from the model are available for inspection. As a result, this algorithm can be used for both exploration and inference on real datasets. A Java program implementing this algorithm is available under the Mozilla Public License. An R function venneuler() is available as a package in CRAN and a plugin is available in Cytoscape. Index Terms—visualization, bioinformatics, statistical graphics 1 I NTRODUCTION Venn diagrams are collections of n simple closed curves dividing the plane into 2 n nonempty connected regions uniquely representing all possible intersections of the interiors and exteriors of the curves [51]. The requirement that the curves be simple means that no more than two curves may intersect in a single point. The requirement that the curves be closed means that each curve may have no endpoints and each must completely enclose one or more regions. The requirement that the regions be nonempty means that their area must be greater than zero. The requirement that regions be connected means that there can be only one region resulting from the intersection of any two closed curves and that one curve may enclose only one region. Venn diagrams are most frequently used to represent sets; in these applications, there is a one-to-one mapping from set intersections to connected regions in the diagram. Although this definition does not restrict Venn diagrams to collections of circles, the popular form of these diagrams displayed in Venn’s original paper and in most applica- tions today involves two or three intersecting circles of constant radius (circles are simple closed curves). Figure 3 shows an example. Relaxing the restriction that all possible set intersections be rep- resented and the restriction that curves be simple results in an Euler diagram [11]. Figure 7 shows an example. Ruskey [39] discusses var- ious subclasses of the general definitions of Venn and Euler diagrams given here. This paper involves Venn and Euler diagrams constructed from cir- cles. There are some Venn and Euler diagrams that can be drawn with convex or non-convex polygons that cannot be drawn with circles, so this is a restriction. We add a further restriction in this paper, namely that the areas of polygon intersections be proportional to the cardinal- ities of intersections among the (finite) sets being represented by the diagram. We call these area-proportional Venn and Euler diagrams [5]. Venn and Euler diagrams have had wide use in teaching logic and probability. In almost all of these applications, their use has been con- fined to two or three circles of equal size. Venn diagrams based on circles do not exist for more than three circles [39]. Higher-order Venn and Euler diagrams can be drawn on the plane with convex or, in some cases, nonconvex polygons [10, 39]. Recently, the microarray community has discovered a new use for these diagrams [22, 33, 31, 9]. To reveal overlaps in gene lists, re- • Leland Wilkinson is Executive VP of Systat Software Inc., Adjunct Professor of Statistics at Northwestern University and Adjunct Professor of Computer Science at University of Illinois at Chicago. E-mail: leland.wilkinson@systat.com searchers use Venn and Euler diagrams to locate genes induced or repressed above a user-defined threshold. Consistencies across ex- periments are expected to yield large overlapping areas. An informal survey of 72 Venn/Euler diagrams published in articles from the 2009 volumes of Science, Nature, and online affiliated journals shows these diagrams have several common features: 1) almost all of them (65/72) use circles instead of other convex or nonconvex curves or polygons, 2) many of them (32/72) make circle areas proportional to counts of elements represented by those areas, 3) most of them (50/72) involve three or more sets, and 4) almost all of them (70/72) represent data col- lected in a process that involves measurement error. Figure 1 shows examples from this survey (including popular types in the left column and rare types in the right). This paper is an attempt to provide an algorithm, called venneuler(), that satisfies most of these needs. We use area- proportional circles to construct Venn and Euler diagrams and we build a statistical foundation that accommodates data involving measure- ment error. As we show through examples and simulations in Section 5, • The venneuler() algorithm produces a circular Venn diagram when the data can be fit by a circular Venn diagram. • The venneuler() algorithm produces an area-proportional cir- cular Venn diagram when the data can be fit by an area- proportional circular Venn diagram. • It produces an area-proportional circular Euler diagram when data can be fit by that model. • It produces a statistically-justifiable approximation to an area- proportional circular Venn or Euler diagram when the data can be fit approximately by one of these models. 2 RELATED WORK There have been two primary approaches to the drawing of Venn and Euler diagrams: axiomatic and heuristic. Axiomatic researchers begin with a formal definition (such as the definition of a Venn diagram given in the Introduction) and then devise algorithms for fulfilling the con- tract of the definition. These approaches are accompanied by proofs that the algorithm cannot violate the terms of the definition. Heuristic researchers begin with a similar definition, but devise algorithms that produce pleasing diagrams that follow the definition closely, but not provably. 2.1 Axiomatic Approaches Although axiomatic approaches are distinguished by proofs of correct- ness, they do vary in their definitions. Fish and Stapleton [13, 14], for example, suggest modifying the definition of an Euler diagram given