Hierarchical Sequential Memory for Music: A Cognitive Model James B. Maxwell Philippe Pasquier Arne Eigenfeldt Simon Fraser University jbmaxwel@sfu.ca Simon Fraser University pasquier@sfu.ca Simon Fraser University arne_e@sfu.ca ABSTRACT We propose a new machine-learning framework called the Hierarchical Sequential Memory for Music, or HSMM. The HSMM is an adaptation of the Hierarchical Temporal Memory (HTM) framework, designed to make it better suited to musical applications. The HSMM is an online learner, capable of recognition, generation, con- tinuation, and completion of musical structures. 1. INTRODUCTION In our previous work on the MusicDB [10] we outlined a system inspired by David Cope's notion of “music recom- binance” [1]. The design used Cope's “SPEAC” system of structural analysis [1] to build hierarchies of musical objects. It was similar to existing music representation models [7, 9, 13], in that it emphasized the construction of hierarchies in which the objects at each consecutively higher level demonstrated increasing “temporal invari- ance” [5]—i.e., an “S” phrase in SPEAC analysis, and a "head" in the Generative Theory of Tonal Music [9], both use singular names at higher levels to represent se- quences of musical events at lower levels. Other approaches to learning musical structure include neural network models [8], recurrent neural network models (RNNs) [11], RNNs with Long Short-Term Memory [3], Markov-based models [12, 14], and statist- ical models [2]. Many of these approaches have achieved high degrees of success, particularly in modeling melodic and/or homophonic music. With the HSMM we hope to extend such approaches by enabling a single system to represent melody, harmony, homophony, and various contrapuntal formations, with little or no explicit a priori modeling of musical "rules"—the HSMM will learn only by observing musical input. Further, because the HSMM is a cognitive model, it can be used to exploit musical knowledge, in real time, in a variety of interesting and in- teractive ways. 2. BACKGROUND: THE HTM FRAMEWORK In his book “On Intelligence”, Jeff Hawkins proposes a “top-down” model of the human neocortex, called the “Memory Prediction Framework” (MPF) [6]. The model is founded on the notion that intelligence arises through the interaction of perceptions and predictions; the percep- tion of sensory phenomena leads to the formation of pre- dictions, which in turn guide action. When predictions fail to match learned expectations, new predictions are formed, resulting in revised action. The MPF, as realized computationally in the HTM [4, 5], operates under the as- sumption of two fundamental ideas: 1) that memories are hierarchically structured, and 2) that higher levels of this structure show increasing temporal invariance. The HTM is a type of Bayesian network, and is best described as a memory system that can be used to discov- er or infer “causes” in the world, to make predictions, and to direct action. Each node has two main processing mod- ules, a Spatial Pooler (SP) for storing unique “spatial pat- terns” (discrete data representations expressed as single vectors) and a Temporal Pooler (TP) for storing temporal groupings of such patterns. The processing in an HTM occurs in two phases: a “bottom-up” classification phase, and a “top-down” re- cognition, prediction, and/or generation phase. Learning is a bottom-up process, involving the storage of discrete vector representations in the SP, and the clustering of such vectors into “temporal groups” [4], or variable-order Markov chains, in the TP. A node's learned Markov chains thus represent temporal structure in the training data. As information flows up the hierarchy, beliefs about the identity of the discrete input representations are formed in each node's SP, and beliefs about the member- ship of those representations in each of the stored Markov chains are formed in the TP. Since the model is hierarch- ical, higher-level nodes store invariant representations of lower-level states, leading to the formation of high-level spatio-temporal abstractions, or “concepts.” A simplified representation of HTM processing is giv- en in Figure 1. Here we see a 2-level hierarchy with two nodes at L1 and one node at L2. This HTM has already received some training, so that each L1 node has stored four spatial patterns and two Markov chains, while the L2 node has stored three spatial patterns and two Markov chains. There are two input patterns, p 1 and p 2 . It can be seen that p 1 corresponds to pattern 4 of Node 1, and that pattern 4 of Node 1 is a member of Markov chain b. When presented with p 1 , the node identifies pattern 4 as Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. © 2009 International Society for Music Information Retrieval Figure 1. Simplified HTM processing.