Reinforcement Learning– Intelligent Weighting of Monte Carlo and Temporal Differences
(2019)Department of Automatic Control
 Abstract
 In Reinforcement learning the updating of the value functions determines the information spreading across the state/stateaction space which condenses the valuebased control policy. It is important to have an information propagation across the value domain in a manner that is effective. Two common ways to update the value function is MonteCarlo updating and temporal difference updating. They are two extreme cases opposite of another. MonteCarlo updates in episodic manner where fully played out episodes are used to collect the environment responses and rewards. The value function gets updated at the end of every episode. MonteCarlo updating needs a large amount of episodes and time steps to converge to an accurate result which is of... (More)
 In Reinforcement learning the updating of the value functions determines the information spreading across the state/stateaction space which condenses the valuebased control policy. It is important to have an information propagation across the value domain in a manner that is effective. Two common ways to update the value function is MonteCarlo updating and temporal difference updating. They are two extreme cases opposite of another. MonteCarlo updates in episodic manner where fully played out episodes are used to collect the environment responses and rewards. The value function gets updated at the end of every episode. MonteCarlo updating needs a large amount of episodes and time steps to converge to an accurate result which is of course a downside. However, the positive is that it will be an unbiased approximation of the value function. In circumstances like simulations and small real world problems it can be applied successfully. However, for larger problems it will cause problems regarding learning time and computer power. On the other hand, by use of temporal difference updating one can in some cases achieve a more effective spreading of information across the value domain. It uses, in contrary to MonteCarlo update, an incremental update at every time step with the newest information together with an approximation of the expected total discounted accumulated reward for the rest of the episode. In this way the agent learns at every timestep. This leads to a more effective updating of the Qvalue function. However the downside is that it introduces biases due to the approximation. Another drawback is that the algorithm only passes information one time step backward in time. By combining MonteCarlo and TemporalDifference update the best of the two can be exploited. A popular way to do that is by weighting the importance of the two. The method is called TD(λ) where the λ variable is a tuning parameter "how much to trust the long term update vs. the step wise update. TD(λ = 0) takes one step in the environment, bootstrapping the rest and updates. TD(λ = 1) updates with received rewards and hence it does not make use of any approximation. A value of λ in between is weighting the importance of the two. The optimal choice of λ depends on the specific situation and is dependant on many factors both from the environment and the control problem itself. This thesis proposes an idea to intelligently choose a proper value for λ dynamically together with choosing the values of other hyper parameters used in the reinforcement learning strategy. The main idea is to use a dropout technique as an inferential prediction for the uncertainty in the system. High inferential uncertainty reflects a less trustworthy Qvalue function and tuning parameters can be chosen accordingly. In situations where information has propagated throughout the network and bounds the inferential uncertainty for example a lower value of λ and ε (exploit versus explore parameter) can hopefully be used advantageously. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/studentpapers/record/8969562
 author
 Christiansson, Martin
 supervisor

 Fredrik Bagge Carlson ^{LU}
 Anders Robertsson ^{LU}
 organization
 year
 2019
 type
 H3  Professional qualifications (4 Years  )
 subject
 report number
 TFRT6072
 ISSN
 02805316
 language
 English
 id
 8969562
 date added to LUP
 20190612 11:39:06
 date last changed
 20190612 11:39:06
@misc{8969562, abstract = {In Reinforcement learning the updating of the value functions determines the information spreading across the state/stateaction space which condenses the valuebased control policy. It is important to have an information propagation across the value domain in a manner that is effective. Two common ways to update the value function is MonteCarlo updating and temporal difference updating. They are two extreme cases opposite of another. MonteCarlo updates in episodic manner where fully played out episodes are used to collect the environment responses and rewards. The value function gets updated at the end of every episode. MonteCarlo updating needs a large amount of episodes and time steps to converge to an accurate result which is of course a downside. However, the positive is that it will be an unbiased approximation of the value function. In circumstances like simulations and small real world problems it can be applied successfully. However, for larger problems it will cause problems regarding learning time and computer power. On the other hand, by use of temporal difference updating one can in some cases achieve a more effective spreading of information across the value domain. It uses, in contrary to MonteCarlo update, an incremental update at every time step with the newest information together with an approximation of the expected total discounted accumulated reward for the rest of the episode. In this way the agent learns at every timestep. This leads to a more effective updating of the Qvalue function. However the downside is that it introduces biases due to the approximation. Another drawback is that the algorithm only passes information one time step backward in time. By combining MonteCarlo and TemporalDifference update the best of the two can be exploited. A popular way to do that is by weighting the importance of the two. The method is called TD(λ) where the λ variable is a tuning parameter "how much to trust the long term update vs. the step wise update. TD(λ = 0) takes one step in the environment, bootstrapping the rest and updates. TD(λ = 1) updates with received rewards and hence it does not make use of any approximation. A value of λ in between is weighting the importance of the two. The optimal choice of λ depends on the specific situation and is dependant on many factors both from the environment and the control problem itself. This thesis proposes an idea to intelligently choose a proper value for λ dynamically together with choosing the values of other hyper parameters used in the reinforcement learning strategy. The main idea is to use a dropout technique as an inferential prediction for the uncertainty in the system. High inferential uncertainty reflects a less trustworthy Qvalue function and tuning parameters can be chosen accordingly. In situations where information has propagated throughout the network and bounds the inferential uncertainty for example a lower value of λ and ε (exploit versus explore parameter) can hopefully be used advantageously.}, author = {Christiansson, Martin}, issn = {02805316}, language = {eng}, note = {Student Paper}, title = {Reinforcement Learning– Intelligent Weighting of Monte Carlo and Temporal Differences}, year = {2019}, }