CHECKERS: TD(λ) LEARNING APPLIED FOR DETERMINISTIC GAME Halina Kwasnicka, Artur Spirydowicz Department of Computer Science, Wroclaw University of Technology, Wyb. Wyspianskiego 27, 50-370 Wroclaw, Poland. Tel. (48 71) 320 23 97, Fax: (48 71) 321 10 18, E-mail: kwasnicka@ci.pwr.wroc.pl Abstract: In the paper we present a game-learning program called CHECKERS. The program contains a neural network that is trained on the basis of obtained rewards to be an evaluation function for the game of checkers, by playing against itself. This method of training multilayered neural networks is called TD(λ). The main aim of the paper is to explore possibilities of using reinforcement learning, the commonly known TD-Gammon, for a game without random factors. For this purpose, we decided to use a popular game – checkers. Developed program have been tested by playing games against: people (the authors and their colleagues), simple heuristic and other program – NEURODRAUGHTS. Developed program plays checkers at the level more then intermediate – it plays astonishingly well. Keywords: Reinforcement learning, machine learning, computer game playing, checkers. Introduction People enjoy playing games of skill, such as chess and checkers, because of the intellectual challenge and the satisfaction derived from playing well. They use knowledge and search to make their decisions at the board. The person with the best “algorithm” for playing the game wins in the long run. Without perfect knowledge, mistakes are made and even World Champions will lose occasionally. This gives rise to an intriguing question: is it possible to program a computer to play a game perfectly? Recently, some games have been solved, for example Qubic [6] and Go-Moku [1]. The most known computer program, which is able to learn play game and win with champions, is TD-Gammon. It plays the game called backgammon. As Bill Robertie says in [8], TD-Gammon's level of play is significantly better than any previous computer program. It plays at a strong master level that is close to the world’s human players. The paradigm of reinforcement learning is intuitive: a pupil (learning agent) observes an input state and he produces an ‘action’ – an input signal. After this, he receives some ‘reward’ from the environment. The reward indicates haw good or bad was the output produced by the pupil. The goal of such learning is to produce the optimal actions (output signals) leading to maximal reward. Often, the reward is delayed, what means that the reward is known (given) at the end of along sequence of input and output actions. The problem for the pupil is known as the “temporal credit assignment”. It is intuitive paradigm because the learner (pupil) is able to learn to perform its task from its own experience, without any intelligent ‘teacher’. Despite the considerable attention devoted to reinforcement learning with delay, there is difficult to find many significant practical applications. It seams that multilayered perceptrons can be capable of learning complex nonlinear functions of their inputs. Temporal difference learning seems to be a promising general-purpose technique for learning with delayed rewards, not only for prediction learning, but also for a combined prediction and control task where control decisions are made by optimizing predicted output. TD-Gammon allows to explore the capability of multilayer neural networks, trained by TD(λ) method to learn complex nonlinear functions. This program gives possibilities to compare the TD learning with the other approach of supervised training on the basis of expert labeled exemplars, as in NEUROGAMMON, trained by backpropagation using a database of recorded expert games. NEUROGAMMON achieved only intermediate level. Backgammon has some features that are absent in checkers, and they probably caused that the TD-Gammons plays surprisingly good. One of them is the stochastic nature of the task – it comes from the random dice rolls. This assures a wide variability in the positions visited during whole training process. The pupil can explore more of the state space and discover new strategies. The problem is in deterministic games (as checkers) where self-playing training can stop itself exploring only a small part of state space because the narrow range of different positions are produced. Problems connected with self-playing training were identified in such deterministic games as checkers and Go [10]. The second significant feature of backgammon is that for all playing strategies, the sequences of moves will terminate (winning or lost) even if we start play with randomly initiated networks. However, in deterministic games we can obtain cycles and, in such cases, trained network is not able to learn because the final reward cannot be produced. It is necessary to omit this problem when we would like to use TD(λ) learning method for deterministic games. Another advantage of non- deterministic games is smoothness and continuity of target function that the pupil must learn. It means that small changes in the position cause small changes in the probability of winning. Deterministic games, such chess, are discrete – a player can win, lose, draw, the target functions are more discontinuous and harder to learn. The article presents a game-learning program, called CHECKERS, which uses TD(λ) learning method for feedforward neural network. A network learns play checkers by experience, achieving positive and negative rewards. Further sections present used algorithm, developed program and obtained results. The program was tested by playing against people (i.e. the authors and their colleagues) and computer programs: implemented in the program simple heuristic and a program called NEURODRAUGHTS. Obtained results show that our approach give satisfactory results, modified TD(λ) used in TD-Gammon for backgammon, can be successfully used also for deterministic games.