Colorless Green Ideas Sleep Furiously Revisited: A Statistical Perspective Florencia Reali (fr34@cornell.edu) Rick Dale (rad28@cornell.edu) Morten H. Christiansen (mhc27@cornell.edu) Department of Psychology; Cornell University; Ithaca, NY 14853 USA Abstract In the present study we provide empirical evidence that human learners succeed in an artificial-grammar learning task that involves recognizing grammatical sequences whose bigram frequencies from the training corpus are zero. This result begs explanation: Whatever strategy is being used to perform the task, it cannot rely on the simple co-occurrence of elements in the training corpus. While rule-based mechanisms may offer an account, we propose that a statistical learning mechanism is able to capture these behavioral results. A simple recurrent network is shown to learn sequences that contain null-probability bigram information by simply relying on distributional information in a training corpus. The present results offer a simple but stark challenge to previous objections to statistical learning approaches to language acquisition that are based on sparseness of the primary linguistic data. Introduction The importance of statistical structure in language learning and processing has been a matter of intense debate. Initial data-driven empirical approaches embraced the idea that word co-occurrences are important sources of information in language processes (e.g., Harris, 1951). This approach fell out of favor in the 1950’s, in part due to the influential work of Noam Chomsky (1957) who believed that language behavior should be analyzed at a much deeper level than its surface statistics. In one of his most famous examples, he pointed out that it is reasonable to assume that neither the sentence (1) Colorless green ideas sleep furiously nor (2) Furiously sleep ideas green colorless has ever occurred, and yet (1), though nonsensical, is grammatical, while (2) is not. Therefore, a common argument against statistical approaches to language is that there are sentences containing low or zero probability sequences of words that can nonetheless be judged as grammatical. As Chomsky remarked, “… we are forced to conclude that ... probabilistic models give no particular insight into some of the basic problems of syntactic structure" (Chomsky, 1957, p. 17). Most theoretical linguists have accepted this argument, developing little interest in the role of statistical approaches to language. Recently there has been a reappraisal of statistical approaches, partly motivated by research indicating that distributional regularities may provide an important source of information for bootstrapping syntax (e.g., Redington, Chater & Finch, 1998; Mintz, 2002)—especially when integrated with prosodic or phonological information (e.g., Morgan, Meier & Newport, 1987; Monaghan, Chater & Christiansen, in press). Moreover, statistical approaches have been supported by recent research demonstrating that young infants are sensitive to statistical information inherent in bigram transitional probabilities (e.g., Saffran, Aslin & Newport, 1996; –for a review, see Gómez & Gerken, 2000). These studies demonstrate that at least some learning mechanisms employed by infants are statistical in nature. However, as suggested by the perceived grammaticality of sentences like (1), human learning capacities certainly need to go beyond the information conveyed by item co- occurrences. In the present study we explore the extent to which humans are capable of learning the regularities of an artificial grammar, and generalizing them to new sentences in which transitional probabilities are completely uninformative. The task involves “discovering” the underlying regularities and using them to recognize sequences in which the bigram transitions are completely novel. We find that humans perform well in this task. Two possible explanations could account for these results. First, as previously suggested (Marcus, Vijayan, Bandi Rao & Vishton, 1999), it could be that humans possess at least two learning mechanisms, one for learning statistical information and another for learning “algebraic” rules. Thus, regardless of available statistics, we could rely on open-ended abstract relationships into which we substitute arbitrary items. In an artificial-grammar learning scenario, we could know the structure or rules underlying a grammar and substitute variables with specific examples by mechanisms independent of the surface statistical information. This rule-based mechanism could therefore account for our ability to successfully generalize to sequences with uninformative bigram probabilities. Alternatively, we suggest that there is a second and equally plausible account. In this paper we demonstrate that this generalization can be accounted for on the basis of distributional learning. In the second part of this paper, we show that a simple connectionist model, trained purely on distributional information, is capable of simulating correct grammaticality judgments of test sentences that comprise bigram transitions absent in the training corpus. These