Update units/en/unitbonus3/model-based.mdx

Co-authored-by: Nathan Lambert <nathan@huggingface.co>
This commit is contained in:
Thomas Simonini
2023-02-04 10:26:02 +01:00
committed by GitHub
parent d6de00f454
commit 5b19f7663b

View File

@@ -1,3 +1,26 @@
# Model Based Reinforcement Learning
Nathan can you provide an introduction and good learning resources?
# Model-based reinforcement learning (MBRL)
Model-based reinforcement learning only differs from its model-free counterpart in the learning of a *dynamics model*, but that has substantial downstream effects on how the decisions are made.
The dynamics models most often model the environment transition dynamics, \\( s_{t+1} = f_\theta (s_t, a_t) \\), but things like inverse dynamics models (mapping from states to actions) or reward models (predicting rewards) can be used in this framework.
**Simple version**:
There is an agent that repeatedly tries to solve a problem, accumulating state and action data.
With that data, the agent creates a structured learning tool -- a dynamics model -- to reason about the world.
With the dynamics model, the agent decides how to act by predicting into the future.
With those actions, the agent collects more data, improves said model, and hopefully improves future actions.
**Academic version**:
Model-based reinforcement learning (MBRL) follows the framework of an agent interacting in an environment, learning a model of said environment, and then leveraging the model for control.
Specifically, the agent acts in a Markov Decision Process (MDP) governed by a transition function \\( s_{t+1} = f (s_t , a_t) \\) and returns a reward at each step \\( r(s_t, a_t) \\). With a collected dataset \\( D :={ s_i, a_i, s_{i+1}, r_i} \\), the agent learns a model, \\( s_{t+1} = f_\theta (s_t , a_t) \\) to minimize the negative log-likelihood of the transitions.
We employ sample-based model-predictive control (MPC) using the learned dynamics model, which optimizes the expected reward over a finite, recursively predicted horizon, \\( \tau \\), from a set of actions sampled from a uniform distribution \\( U(a) \\), (see [paper](https://arxiv.org/pdf/2002.04523) or [paper](https://arxiv.org/pdf/2012.09156.pdf) or [paper](https://arxiv.org/pdf/2009.01221.pdf)).
## Further reading
For more information on MBRL, we recommend you check out the following resources.
1. A [recent review paper on MBRL (long)](https://arxiv.org/abs/2006.16712),
2. A [blog post on debugging MBRL](https://www.natolambert.com/writing/debugging-mbrl).