MD vs. LSA for constructing a register space: Are hand-coded features needed, or is bag-of-words enough? BROOKE Julian, HIRST Graeme Department of Computer Science, University of Toronto {jbrooke, gh}@cs.toronto.edu The bottom-up approach to defining register variation, typified by the multidimensional (MD) approach of Biber (Biber 1988; Biber & Conrad 2009), usually takes, as a starting point, a relatively small set of features which are known (or appear) to reflect key differences across registers. Since these features are often tied to the particular morphosyntactic properties of the languages, a modified set of features must be derived for cross-linguistic analysis (Biber 1995); an obvious example is the fact that word length may vary greatly in its usefulness in different languages (e.g. it might be problematic in agglutantive languages). A alternative approach is suggested by models such as latent semantic analysis (LSA) (Landauer and Dumais, 1997) and other kinds of topic models (Blei et al. 2003); both MD and LSA can be used to find latent dimensions of variation and in fact are based on a simliar (though distinct) mathematical models, but LSA is based on a bag-of-words approach, i.e. all lexical items are included as features which determine the variation in the document space. Models such as LSA are typically applied to variation in semantics (or topic), but there is no apriori reason they should not also capture register variation. In this paper, we will explore the extent to which they do, contrasting the latent space created by (bag-of-words) LSA which one created using more typical register features. We begin with the two statistical models that underlie the methods we are comparing. Principal components analysis (PCA), which is the basic for LSA, makes somewhat different assumptions than factor analysis (the basis of MD), and there is some argument about which is to be prefered (Fabrigar et al. 1999), but we show that in our case the basic result of the two approaches are similar, and we choose to use PCA for the remainder of our experiments. Our main comparison is between feature sets; we take a collection of featues typically used in multidimensional analysis, including textual statistics, parts-of-speech counts, and word types from Quirk et al. (1985); in short, we include as many features from Biber (1988) as is feasible, including those that require part of speech tagging. For the LSA approach, we simply take all words, with no filtering and no bucketing of word types whatsoever. We carry out our analysis in two well-known mixed register corpora: the Brown corpus (Francis & Kučera, 1982), and the much larger British National Corpus (Burnard, 2000). The text in these corpora are tagged for genre (register); following Biber, we can learn about the "spaces" created by our models by looking at where the texts of various genres are "found". If the spaces are similar, we expect that the genres would be similarly located along the spectrum corresponding to each dimension. However, the linear alegbra underlying PCA provides another, even more precise way to identify whether two space are the same: any new feature can be transformed so that it appears in this new space (without altering dimensions of the space itself). Thus, we can quantify the difference (or similarity) of the spaces by transforming the old register features into the new space, and directly comparing the resulting vectors with those in the old space (i.e those created by the variation of the features themselves). Thus in this work