1 Using electronic corpora in historical dialectology research : the problem of document length variation Hermann Moisl University of Newcastle upon Tyne Introduction The proliferation of computational technology has generated an explosive production of electronically encoded information of all kinds. In the face of this, traditional philological methods for search and interpretation of data have been overwhelmed by volume, and a variety of computational methods have been developed in an attempt to make the deluge tractable. These developments have clear implications for corpus-based linguistics in general, and for corpus- based study of historical dialectology in particular: as more and larger historical text corpora become available, effective analysis of them will increasingly be tractable only by adapting the interpretative methods developed by the statistical (Hair et al. 2005; Tabachnik & Fidell 2006), information retrieval (Belew 2000; Grossman & Frieder 2004), pattern recognition (Bishop 2006), and related communities. To use such analytical methods effectively, however, issues that arise with respect to the abstraction of data from corpora have to be understood. This paper addresses an issue that has a fundamental bearing on the validity of analytical results based on such data: variation in document length. The discussion is in four main parts. The first part shows how a particular class of computational methods, exploratory multivariate analysis, can be used in historical dialectology research, the second explains why variation in document length can be a problem in such analysis, the third proposes document length normalization as a solution to that problem, and the fourth points out some difficulties associated with document length normalization