Update units/en/unitbonus3/model-based.mdx

Co-authored-by: Nathan Lambert <nathan@huggingface.co>
2026-06-15 06:27:24 +08:00 · 2023-02-04 10:26:02 +01:00
parent d6de00f454
commit 5b19f7663b
1 changed files with 24 additions and 1 deletions
--- a/units/en/unitbonus3/model-based.mdx
+++ b/units/en/unitbonus3/model-based.mdx
@@ -1,3 +1,26 @@
 # Model Based Reinforcement Learning

-Nathan can you provide an introduction and good learning resources?
+# Model-based reinforcement learning (MBRL)
+
+Model-based reinforcement learning only differs from it’s model-free counterpart in the learning of a *dynamics model*, but that has substantial downstream effects on how the decisions are made. 
+The dynamics models most often model the environment transition dynamics, \\( s_{t+1} = f_\theta (s_t, a_t) \\), but things like inverse dynamics models (mapping from states to actions) or reward models (predicting rewards) can be used in this framework.
+
+**Simple version**:
+
+There is an agent that repeatedly tries to solve a problem, accumulating state and action data. 
+With that data, the agent creates a structured learning tool -- a dynamics model -- to reason about the world. 
+With the dynamics model, the agent decides how to act by predicting into the future. 
+With those actions, the agent collects more data, improves said model, and hopefully improves future actions.
+
+**Academic version**:
+ 
+Model-based reinforcement learning (MBRL) follows the framework of an agent interacting in an environment, learning a model of said environment, and then leveraging the model for control. 
+Specifically, the agent acts in a Markov Decision Process (MDP) governed by a transition function \\( s_{t+1} = f (s_t , a_t) \\) and returns a reward at each step \\( r(s_t, a_t) \\). With a collected dataset \\( D :={ s_i, a_i, s_{i+1}, r_i} \\), the agent learns a model, \\( s_{t+1} = f_\theta (s_t , a_t) \\) to minimize the negative log-likelihood of the transitions. 
+We employ sample-based model-predictive control (MPC) using the learned dynamics model, which optimizes the expected reward over a finite, recursively predicted horizon, \\( \tau \\), from a set of actions sampled from a uniform distribution \\( U(a) \\), (see [paper](https://arxiv.org/pdf/2002.04523) or [paper](https://arxiv.org/pdf/2012.09156.pdf) or [paper](https://arxiv.org/pdf/2009.01221.pdf)).
+
+## Further reading
+For more information on MBRL, we recommend you check out the following resources.
+
+1. A [recent review paper on MBRL (long)](https://arxiv.org/abs/2006.16712),
+2. A [blog post on debugging MBRL](https://www.natolambert.com/writing/debugging-mbrl).
+