Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State R. Andrew McCallum Department of Computer Science University of Rochester Rochester, NY 14627-0226 mccallum@cs.rochester.edu Abstract We present Utile Suffix Memory, a reinforcement learning algorithm that uses short-term memory to overcome the state aliasing that results from hidden state. By combining the advantages of previous work in instance-based (or “memory- based”) learning and previous work with statisti- cal tests for separating noise from task structure, the method learns quickly, creates only as much memory as needed for the task at hand, and han- dles noise well. Utile Suffix Memory uses a tree-structured rep- resentation, and is related to work on Predic- tion Suffix Trees [Ron et al., 1994], Parti-game [Moore, 1993], G-algorithm [Chapman and Kael- bling, 1991], and Variable Resolution Dynamic Programming [Moore, 1991]. 1 INTRODUCTION The sensory systems of embedded agents are inherently limited. When a reinforcement learning agent’s sensory limitations hide features of the environment from the agent, we say that the agent suffers from hidden state. There are many reasons why important features can be hid- den from a robot’s perception: sensors have noise, lim- ited range and limited field of view; occlusions hide areas from sensing; limited funds and space prevent equipping the robot with all desired sensors; an exhaustible power supply deters the robot from using all sensors all the time; and the robot has limited computational resources for turning raw sensor data into usable percepts. The hidden state problem arises as a case of perceptual aliasing: the mapping between states of the world and sensations of the agent is not one-to-one [Whitehead and Ballard, 1991]. If perceptual limitations allow the agent to perceive only a portion of its world, then many different world states can produce in the same percept. Also, if the agent has an active perceptual system, meaning that it can redirect its sensors to different parts of its surroundings, then the reverse will also be true—many different percepts can result from the same world state. Perceptual aliasing is both a blessing and a curse. It is a blessing because it can provide useful invariants by repre- senting as equivalent world states in which the same action is required. It is a curse because it can also confound world states in which different actions are required. Perceptual aliasing provides powerful generalization, but it can also over-generalize. The trick is to selectively remove hidden state, so as to uncover the hidden state that impedes task performance, but leave ambiguous the hidden state that that is irrelevant to the agent’s task. Distinguishingstates whose difference is irrelevant to the current task not only causes the agent to uselessly increase its storage requirements, more damagingly, it also prolongs learning time by requiring that the agent re-learn its policy in each of the needlessly distin- guished states. State identification techniques use history information to uncover hidden state [Bertsekas and Shreve, 1978]. Instead of defining agent internal state by percepts alone, the agent defines its internal state space by a combination of a percepts and short-term memory of past percepts and actions. If the agent uses enough short-term memory in the right places, the agent can uncover the non-Markovian dependencies that caused the task-impeding hidden state. Predefined, fixed-sized memory representations are often undesirable. When memory size (i.e. number of inter- nal state variables) is more than needed, it exponentially increases the number of agent internal states for which a policy must be learned and stored; when the size of the memory is less than needed, the agent reverts to the disad- vantages of undistinguished hidden state. Even if the agent designer understands the task well enough to know its max- imal memory requirements, the agent is at a disadvantage with fixed-sized memory because, for most tasks, different amounts of memory are needed at different steps of the task. We conclude that the agent should learn on-line how much memory is needed for different parts of its state space. The work described in this paper addresses the issue of hidden state in conjunction with the following principles: