Collecting Historical Font Metrics from Google Books Robert LiVolsi, Richard Zanibbi, and Charles Bigelow Rochester Institute of Technology rjl3050@rit.edu, rlaz@cs.rit.edu, cabppr@rit.edu Abstract A system is presented for extracting key metrics from fonts used in historical documents. The system identifies important landmarks on a page, such as margins, para- graphs, and lines, and applies frequency analysis tech- niques to identify relevant sizes. The system was vali- dated by comparing its measurements to the measure- ments of a human expert on randomly selected samples, and differed on average from the expert by less than 5% for x-height, body size, and line spacing metrics. 1. Introduction Character size is a major determinant of the legibil- ity of text and has been studied from several disciplinary perspectives, including psychophysics [3], typographic history [12], and combinations of the two [4]. The cur- rent migration of reading from print to digital display raises questions of optimal character size, which analy- sis of font sizes in historical books may help answer. Measurement of type sizes in historical and mod- ern printed books has relied mainly on visual deter- minations made with optical or digital magnifiers, but such studies have usually been limited to a few hun- dred books or, more rarely, a few thousand [4, 12]. Recent digitization of large numbers of books dating back to the early era of European printing, e.g. Google Books, makes possible the automatic metrical analysis of type sizes in many thousands of books, and, poten- tially, many millions. A pioneering culturomic study of five million books by Michel et. al. [6] analyzes centuries of lexical and grammatical usage to identify cultural trends. Read- ing is a visual activity as well as a symbolic one, and in this preliminary study, we focus on form rather than content, analyzing quantitative features of typographi- cal elements to better understand trends in visual size optimization over centuries of typographic literacy. This study determines metrics of x-height, body size, and line spacing, in accord with standard typographic- historical studies [12] and typical font usage. The x- height is defined as the distance from the text baseline (the imaginary horizontal line on which letters sit) to the x-line (the imaginary horizontal line tangent to the top of the lower-case x). The x-height is the major determi- nant of the perceived size of text [4]. Non-ascending or descending lower-case letters that are x-height include a, c, e, n, o, v, and x, and are also referred to as minims. Body size is defined as the distance between the descen- der line (the imaginary line tangent to the bottoms of the descending strokes) and the ascender line (imaginary line tangent to the tops of the ascending strokes), and is the standard metric for identifying font sizes. Lastly, line spacing is defined as the distance between subse- quent baselines. These metrics rank among the most important factors influencing print cost as well as text legibility. The main objective of this paper is to present the system developed for this task and to assess whether the system can identify dominant font metrics reliably when compared with a human expert. Surprisingly, while estimating font metrics [13] and identifying ascender, descender and minim characters (e.g. for word shape coding [11]) is pervasive in docu- ment image analysis [7], we have been unable to locate references concerned with compiling font metric statis- tics for their own sake. In our approach, we randomly sample pages from a book, and use the largest detected paragraph from each page to estimate font metrics. Our system also needs to be fast: we wish to collect metrics from thousands, even millions of books. Collecting font metrics from historical documents is challenging, as pages are often skewed and/or warped, and noisy due to ink spread, bleedthrough of ink from the opposite side of a page, and dirt and damage accu- mulated from use over time. Before estimating metrics for the dominant font on a page, we go through the fol- lowing steps: 1) deskew the page using a Hough trans- form, 2) segment text lines through Fourier analysis of vertical pixel projections, 3) merge textlines into para- graphs, selecting the largest paragraph, 4) re-estimate