Hierarchical Sequential Memory for Music: A Cognitive Model
James B. Maxwell Philippe Pasquier Arne Eigenfeldt
Simon Fraser University
jbmaxwel@sfu.ca
Simon Fraser University
pasquier@sfu.ca
Simon Fraser University
arne_e@sfu.ca
ABSTRACT
We propose a new machine-learning framework called
the Hierarchical Sequential Memory for Music, or
HSMM. The HSMM is an adaptation of the Hierarchical
Temporal Memory (HTM) framework, designed to make
it better suited to musical applications. The HSMM is an
online learner, capable of recognition, generation, con-
tinuation, and completion of musical structures.
1. INTRODUCTION
In our previous work on the MusicDB [10] we outlined a
system inspired by David Cope's notion of “music recom-
binance” [1]. The design used Cope's “SPEAC” system
of structural analysis [1] to build hierarchies of musical
objects. It was similar to existing music representation
models [7, 9, 13], in that it emphasized the construction
of hierarchies in which the objects at each consecutively
higher level demonstrated increasing “temporal invari-
ance” [5]—i.e., an “S” phrase in SPEAC analysis, and a
"head" in the Generative Theory of Tonal Music [9], both
use singular names at higher levels to represent se-
quences of musical events at lower levels.
Other approaches to learning musical structure include
neural network models [8], recurrent neural network
models (RNNs) [11], RNNs with Long Short-Term
Memory [3], Markov-based models [12, 14], and statist-
ical models [2]. Many of these approaches have achieved
high degrees of success, particularly in modeling melodic
and/or homophonic music. With the HSMM we hope to
extend such approaches by enabling a single system to
represent melody, harmony, homophony, and various
contrapuntal formations, with little or no explicit a priori
modeling of musical "rules"—the HSMM will learn only
by observing musical input. Further, because the HSMM
is a cognitive model, it can be used to exploit musical
knowledge, in real time, in a variety of interesting and in-
teractive ways.
2. BACKGROUND: THE HTM FRAMEWORK
In his book “On Intelligence”, Jeff Hawkins proposes a
“top-down” model of the human neocortex, called the
“Memory Prediction Framework” (MPF) [6]. The model
is founded on the notion that intelligence arises through
the interaction of perceptions and predictions; the percep-
tion of sensory phenomena leads to the formation of pre-
dictions, which in turn guide action. When predictions
fail to match learned expectations, new predictions are
formed, resulting in revised action. The MPF, as realized
computationally in the HTM [4, 5], operates under the as-
sumption of two fundamental ideas: 1) that memories are
hierarchically structured, and 2) that higher levels of this
structure show increasing temporal invariance.
The HTM is a type of Bayesian network, and is best
described as a memory system that can be used to discov-
er or infer “causes” in the world, to make predictions, and
to direct action. Each node has two main processing mod-
ules, a Spatial Pooler (SP) for storing unique “spatial pat-
terns” (discrete data representations expressed as single
vectors) and a Temporal Pooler (TP) for storing temporal
groupings of such patterns.
The processing in an HTM occurs in two phases: a
“bottom-up” classification phase, and a “top-down” re-
cognition, prediction, and/or generation phase. Learning
is a bottom-up process, involving the storage of discrete
vector representations in the SP, and the clustering of
such vectors into “temporal groups” [4], or variable-order
Markov chains, in the TP. A node's learned Markov
chains thus represent temporal structure in the training
data. As information flows up the hierarchy, beliefs about
the identity of the discrete input representations are
formed in each node's SP, and beliefs about the member-
ship of those representations in each of the stored Markov
chains are formed in the TP. Since the model is hierarch-
ical, higher-level nodes store invariant representations of
lower-level states, leading to the formation of high-level
spatio-temporal abstractions, or “concepts.”
A simplified representation of HTM processing is giv-
en in Figure 1. Here we see a 2-level hierarchy with two
nodes at L1 and one node at L2. This HTM has already
received some training, so that each L1 node has stored
four spatial patterns and two Markov chains, while the L2
node has stored three spatial patterns and two Markov
chains. There are two input patterns, p
1
and p
2
. It can be
seen that p
1
corresponds to pattern 4 of Node 1, and that
pattern 4 of Node 1 is a member of Markov chain b.
When presented with p
1
, the node identifies pattern 4 as
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page.
© 2009 International Society for Music Information Retrieval
Figure 1. Simplified HTM processing.