Time-efficient reinforcement learning with stochastic stateful policies

Al-Hafez, Firas; Zhao, Guoping; Peters, Jan; Tateo, Davide

Time-efficient reinforcement learning with stochastic stateful policies

Mark

Al-Hafez, Firas ; Zhao, Guoping ; Peters, Jan and Tateo, Davide ^LU

(2024) 12th International Conference on Learning Representations, ICLR 2024

Abstract: Stateful policies play an important role in reinforcement learning, such as handling partially observable environments, enhancing robustness, or imposing an inductive bias directly into the policy structure. The conventional method for training stateful policies is Backpropagation Through Time (BPTT), which comes with significant drawbacks, such as slow training due to sequential gradient propagation and the occurrence of vanishing or exploding gradients. The gradient is often truncated to address these issues, resulting in a biased policy update. We present a novel approach for training stateful policies by decomposing the latter into a stochastic internal state kernel and a stateless policy, jointly optimized by following the stateful... (More); Stateful policies play an important role in reinforcement learning, such as handling partially observable environments, enhancing robustness, or imposing an inductive bias directly into the policy structure. The conventional method for training stateful policies is Backpropagation Through Time (BPTT), which comes with significant drawbacks, such as slow training due to sequential gradient propagation and the occurrence of vanishing or exploding gradients. The gradient is often truncated to address these issues, resulting in a biased policy update. We present a novel approach for training stateful policies by decomposing the latter into a stochastic internal state kernel and a stateless policy, jointly optimized by following the stateful policy gradient. We introduce different versions of the stateful policy gradient theorem, enabling us to easily instantiate stateful variants of popular reinforcement learning and imitation learning algorithms. Furthermore, we provide a theoretical analysis of our new gradient estimator and compare it with BPTT. We evaluate our approach on complex continuous control tasks, e.g. humanoid locomotion, and demonstrate that our gradient estimator scales effectively with task complexity while offering a faster and simpler alternative to BPTT.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/08f2f52b-bbf6-4185-9bbf-76116268de92

author

Al-Hafez, Firas ; Zhao, Guoping ; Peters, Jan and Tateo, Davide ^LU

publishing date

2024

type

Contribution to conference

publication status

published

subject

Computer Systems

conference name

12th International Conference on Learning Representations, ICLR 2024

conference location

Hybrid, Vienna, Austria

conference dates

2024-05-07 - 2024-05-11

external identifiers

scopus:85200568592

language

English

LU publication?

no

additional info

id

08f2f52b-bbf6-4185-9bbf-76116268de92

date added to LUP

2025-10-16 14:06:36

date last changed

2025-10-17 12:07:50

@misc{08f2f52b-bbf6-4185-9bbf-76116268de92,
  abstract     = {{<p>Stateful policies play an important role in reinforcement learning, such as handling partially observable environments, enhancing robustness, or imposing an inductive bias directly into the policy structure. The conventional method for training stateful policies is Backpropagation Through Time (BPTT), which comes with significant drawbacks, such as slow training due to sequential gradient propagation and the occurrence of vanishing or exploding gradients. The gradient is often truncated to address these issues, resulting in a biased policy update. We present a novel approach for training stateful policies by decomposing the latter into a stochastic internal state kernel and a stateless policy, jointly optimized by following the stateful policy gradient. We introduce different versions of the stateful policy gradient theorem, enabling us to easily instantiate stateful variants of popular reinforcement learning and imitation learning algorithms. Furthermore, we provide a theoretical analysis of our new gradient estimator and compare it with BPTT. We evaluate our approach on complex continuous control tasks, e.g. humanoid locomotion, and demonstrate that our gradient estimator scales effectively with task complexity while offering a faster and simpler alternative to BPTT.</p>}},
  author       = {{Al-Hafez, Firas and Zhao, Guoping and Peters, Jan and Tateo, Davide}},
  language     = {{eng}},
  title        = {{Time-efficient reinforcement learning with stochastic stateful policies}},
  year         = {{2024}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Time-efficient reinforcement learning with stochastic stateful policies