Temporal-Distinction Studying: Combining Dynamic Programming and Monte Carlo Strategies for Reinforcement Studying | by Oliver S | Oct, 2024

Milestones of RL: Q-Studying and Double Q-Studying

We proceed our deep dive of Sutton’s e-book “Reinforcement Studying: An Introduction” [1], and on this publish introduce Temporal-Distinction (TD) Studying, which is Chapter 6 of stated work.

TD studying will be seen as a mixture of Dynamic Programming (DP) and Monte Carlo (MC) strategies, which we launched within the earlier two posts, and marks an essential milestone within the area of Reinforcement Studying (RL) — combining the energy of aforementioned strategies: TD studying doesn’t want a mannequin and learns from expertise alone, just like MC, but additionally “bootstraps” — makes use of beforehand established estimates — just like DP.

Photograph by Brooke Campbell on Unsplash

Right here, we’ll introduce this household of strategies, each from a theoretical standpoint but additionally displaying related sensible algorithms, resembling Q-learning — accompanied with Python code. As normal, all code will be discovered on GitHub.

We start with an introduction and motivation, after which begin with the prediction drawback — just like the earlier posts. Then, we dive deeper within the idea and talk about which answer TD studying finds. Following that, we transfer to the management drawback, and current a…