Deep Distributional Temporal Difference Learning for Game Playing

Berglind, Frej

Deep Distributional Temporal Difference Learning for Game Playing

Mark

Berglind, Frej ^LU (2019) In Master’s Theses in Mathematical Sciences FMSM01 20192
Mathematical Statistics

Abstract: Temporal difference learning is considered one of the most successful methods in reinforcement learning. Recent developments in deep learning have opened up a new world of opportunities. In this project, we compare classic scalar temporal difference learning with three new distributional algorithms for playing the game of 5-in-a-row using deep neural networks: distributional temporal difference learning with constant learning rate, and two distributional temporal difference algorithms with adaptive learning rate. All these algorithms are applicable to any two-player deterministic zero sum game and can probably be successfully generalized to other settings.

As it turned out, all algorithms performed well and developed strong strategies.... (More); Temporal difference learning is considered one of the most successful methods in reinforcement learning. Recent developments in deep learning have opened up a new world of opportunities. In this project, we compare classic scalar temporal difference learning with three new distributional algorithms for playing the game of 5-in-a-row using deep neural networks: distributional temporal difference learning with constant learning rate, and two distributional temporal difference algorithms with adaptive learning rate. All these algorithms are applicable to any two-player deterministic zero sum game and can probably be successfully generalized to other settings.

As it turned out, all algorithms performed well and developed strong strategies. The algorithms implementing the adaptive methods learned more quickly in the beginning, but in the long run, they were outperformed by the algorithms using constant learning rate which, without any prior knowledge, learned to play the game at a very high level after 200 000 games of self play. (Less)
Popular Abstract (Swedish): Med modern teknik inom artificiell intelligens (AI) kan datorprogram lära sig spela spel genom att öva. I mitt examensarbete lär sig ett datorprogram spela luffarschack (5-i-rad) genom att spela 200 000 partier mot sig själv. I början spelar programmet helt slumpmässigt, men till slut är det väldigt svårslaget.

Brädspel har en lång historia inom AI-forskningen. De är med sin unika kombination av enkla regler och komplexa strategier speciellt tilltalande för AI-experiment. Det kan verka fånigt att slösa tid och resurser på att spela brädspel, men insikter från spelforskning kan användas i många andra områden som behöver effektiva sökmetoder, smarta beslut eller långsiktigt planerande. Exempel på detta är självkörande bilar, DNA-analys... (More); Med modern teknik inom artificiell intelligens (AI) kan datorprogram lära sig spela spel genom att öva. I mitt examensarbete lär sig ett datorprogram spela luffarschack (5-i-rad) genom att spela 200 000 partier mot sig själv. I början spelar programmet helt slumpmässigt, men till slut är det väldigt svårslaget.

Brädspel har en lång historia inom AI-forskningen. De är med sin unika kombination av enkla regler och komplexa strategier speciellt tilltalande för AI-experiment. Det kan verka fånigt att slösa tid och resurser på att spela brädspel, men insikter från spelforskning kan användas i många andra områden som behöver effektiva sökmetoder, smarta beslut eller långsiktigt planerande. Exempel på detta är självkörande bilar, DNA-analys och sökmotorer. Precis som att en kemist utför experiment i en kontrollerad labbmiljö, är det bra att göra AI-forskning i brädspel.

För att analysera brädet och ta beslut i spelet används i det här projektet neurala nätverk. Det är den inlärningsteknik som de senaste åren dominerat utvecklingen inom AI. Neurala nätverk är inspirerade av den mänskliga hjärnan och består av matematiska neuroner (hjärnceller) sammankopplade i ett enormt nätverk. I träningsprocessen lär sig nätverket förutsäga hur spelet kommer att gå genom att analysera spelplanen och hitta mönster. Förutsägelserna används sedan för att välja det bästa draget.

För att förbättra strategin och lära sig vinna spelet behöver programmet utveckla mer träffsäkra förutsägelser. Det görs genom förstärkningsinlärning, en inlärningsmetod inspirerad av människans förmåga att lära sig genom att öva. Datorprogrammet provar sig fram och lär sig från sina triumfer och misstag. Förstärkningsinlärning förväntas bli en central del i utvecklingen av framtidens AI. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/8998975

author

Berglind, Frej ^LU

supervisor

Alexandros Sopasakis ^LU
Jianhua Chen

organization

Mathematical Statistics

course

FMSM01 20192

year

2019

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

keywords

Reinforcement Learning, Deep Learning, Temporal Difference Learning, Distributional Learning, Game Playing, 5-in-a-row, Artificial Intelligence.

publication/series

Master’s Theses in Mathematical Sciences

report number

LUTFMS-3384-2019

ISSN

1404-6342

other publication id

2019:E66

language

English

id

8998975

date added to LUP

2020-03-09 10:24:16

date last changed

2024-10-22 15:05:30

@misc{8998975,
  abstract     = {{Temporal difference learning is considered one of the most successful methods in reinforcement learning. Recent developments in deep learning have opened up a new world of opportunities. In this project, we compare classic scalar temporal difference learning with three new distributional algorithms for playing the game of 5-in-a-row using deep neural networks: distributional temporal difference learning with constant learning rate, and two distributional temporal difference algorithms with adaptive learning rate. All these algorithms are applicable to any two-player deterministic zero sum game and can probably be successfully generalized to other settings.

As it turned out, all algorithms performed well and developed strong strategies. The algorithms implementing the adaptive methods learned more quickly in the beginning, but in the long run, they were outperformed by the algorithms using constant learning rate which, without any prior knowledge, learned to play the game at a very high level after 200 000 games of self play.}},
  author       = {{Berglind, Frej}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master’s Theses in Mathematical Sciences}},
  title        = {{Deep Distributional Temporal Difference Learning for Game Playing}},
  year         = {{2019}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Deep Distributional Temporal Difference Learning for Game Playing