mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-01 17:51:01 +08:00
Add bonus 3 unit
This commit is contained in:
@@ -116,14 +116,16 @@
|
||||
title: Introduction
|
||||
- local: unitbonus3/model-based
|
||||
title: Model-Based Reinforcement Learning
|
||||
- local: unitbonus3/decision-transformers
|
||||
title: Decision Transformers and Offline RL
|
||||
- local: unitbonus3/offline-online
|
||||
title: Offline vs. Online Reinforcement Learning
|
||||
- local: unitbonus3/rlhf
|
||||
title: Reinforcement Learning from Human Feedback
|
||||
- local: unitbonus3/minerl
|
||||
title: MineRL
|
||||
- local: unitbonus3/decision-transformers
|
||||
title: Decision Transformers and Offline RL
|
||||
- local: unitbonus3/language-models
|
||||
title: Language models in RL
|
||||
- local: unitbonus3/envs-to-try
|
||||
title: Interesting Environments to try
|
||||
- title: What's next? New Units Publishing Schedule
|
||||
sections:
|
||||
- local: communication/publishing-schedule
|
||||
|
||||
@@ -2,7 +2,8 @@
|
||||
|
||||
The Decision Transformer model was introduced by ["Decision Transformer: Reinforcement Learning via Sequence Modeling” by Chen L. et al](https://arxiv.org/abs/2106.01345). It abstracts Reinforcement Learning as a conditional-sequence modeling problem.
|
||||
|
||||
The main idea is that instead of training a policy using RL methods, such as fitting a value function, that will tell us what action to take to maximize the return (cumulative reward), we use a sequence modeling algorithm (Transformer) that, given a desired return, past states, and actions, will generate future actions to achieve this desired return. It’s an autoregressive model conditioned on the desired return, past states, and actions to generate future actions that achieve the desired return.
|
||||
The main idea is that instead of training a policy using RL methods, such as fitting a value function, that will tell us what action to take to maximize the return (cumulative reward), **we use a sequence modeling algorithm (Transformer) that, given a desired return, past states, and actions, will generate future actions to achieve this desired return**.
|
||||
It’s an autoregressive model conditioned on the desired return, past states, and actions to generate future actions that achieve the desired return.
|
||||
|
||||
This is a complete shift in the Reinforcement Learning paradigm since we use generative trajectory modeling (modeling the joint distribution of the sequence of states, actions, and rewards) to replace conventional RL algorithms. It means that in Decision Transformers, we don’t maximize the return but rather generate a series of future actions that achieve the desired return.
|
||||
|
||||
@@ -16,6 +17,11 @@ To learn more about Decision Transformers, you should read the blogpost we wrote
|
||||
|
||||
Now that you understand how Decision Transformers work thanks to [Introducing Decision Transformers on Hugging Face](https://huggingface.co/blog/decision-transformers). You’re ready to learn to train your first Offline Decision Transformer model from scratch to make a half-cheetah run.
|
||||
|
||||
TODO: Add half cheetah video
|
||||
|
||||
Start the tutorial here 👉 https://huggingface.co/blog/train-decision-transformers
|
||||
|
||||
## Further reading
|
||||
|
||||
For more information, we recommend you check out the following resources:
|
||||
|
||||
- [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345)
|
||||
- [Online Decision Transformer](https://arxiv.org/abs/2202.05607)
|
||||
|
||||
45
units/en/unitbonus3/envs-to-try.mdx
Normal file
45
units/en/unitbonus3/envs-to-try.mdx
Normal file
@@ -0,0 +1,45 @@
|
||||
# Interesting Environments to try
|
||||
|
||||
We provide here a list of interesting environments you can try to train your agents on:
|
||||
|
||||
## MineRL
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/minerl.jpg" alt="MineRL"/>
|
||||
|
||||
|
||||
MineRL is a python library that provides a Gym interface for interacting with the video game Minecraft, accompanied by datasets of human gameplay.
|
||||
Every year, there are challenges with this library. Check the [website](https://minerl.io/)
|
||||
|
||||
To start using this environment, check these resources:
|
||||
- [What is MineRL?](https://www.youtube.com/watch?v=z6PTrGifupU)
|
||||
- [First steps in MineRL](https://www.youtube.com/watch?v=8yIrWcyWGek)
|
||||
- [MineRL documentation and tutorials](https://minerl.readthedocs.io/en/latest/)
|
||||
|
||||
## DonkeyCar Simulator
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/donkeycar.jpg" alt="Donkey Car"/>
|
||||
Donkey is a Self Driving Car Platform for hobby remote control cars.
|
||||
This simulator version is built on the Unity game platform. It uses their internal physics and graphics, and connects to a donkey Python process to use our trained model to control the simulated Donkey (car).
|
||||
|
||||
|
||||
To start using this environment, check these resources:
|
||||
- [DonkeyCar Simulator documentation](https://docs.donkeycar.com/guide/deep_learning/simulator/)
|
||||
- [Learn to Drive Smoothly (Antonin Raffin's tutorial) Part 1](https://www.youtube.com/watch?v=ngK33h00iBE)
|
||||
- [Learn to Drive Smoothly (Antonin Raffin's tutorial) Part 2](https://www.youtube.com/watch?v=DUqssFvcSOY)
|
||||
- [Learn to Drive Smoothly (Antonin Raffin's tutorial) Part 3](https://www.youtube.com/watch?v=v8j2bpcE4Rg)
|
||||
|
||||
- Pretrained agents:
|
||||
- https://huggingface.co/araffin/tqc-donkey-mountain-track-v0
|
||||
- https://huggingface.co/araffin/tqc-donkey-avc-sparkfun-v0
|
||||
- https://huggingface.co/araffin/tqc-donkey-minimonaco-track-v0
|
||||
|
||||
|
||||
## Starcraft II
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/alphastar.jpg" alt="Alphastar"/>
|
||||
|
||||
Starcraft II is a famous *real time strategy game*. This game has been used by DeepMind for their Deep Reinforcement Learning researches with [Alphastar](https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii)
|
||||
|
||||
To start using this environment, check these resources:
|
||||
- [Starcraft gym](http://starcraftgym.com/)
|
||||
- [A. I. Learns to Play Starcraft 2 (Reinforcement Learning) tutorial](https://www.youtube.com/watch?v=q59wap1ELQ4)
|
||||
@@ -1,8 +1,9 @@
|
||||
# Introduction
|
||||
|
||||
TODO: Add thumbnail
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/thumbnail.png" alt="Unit bonus 3 thumbnail"/>
|
||||
|
||||
Thanks to this course you know have a solid background in Deep Reinforcement Learning. But this is a vast topic.
|
||||
In this optional unit we **give you some resources to go deeper into multiple concepts and research topics in Reinforcement Learning**.
|
||||
|
||||
Sounds fun? Let's get started,
|
||||
Congratulations on finishing this course! **You have now a solid background in Deep Reinforcement Learning**.
|
||||
But this course was just a beginning for your Deep Reinforcement Learning journey, there are so much subsections to discover. And in this optional unit we **give you some resources to go deeper into multiple concepts and research topics in Reinforcement Learning**.
|
||||
|
||||
Sounds fun? Let's get started 🔥,
|
||||
|
||||
@@ -1,3 +1,7 @@
|
||||
# Language models in RL
|
||||
|
||||
Clément
|
||||
|
||||
## Further reading
|
||||
|
||||
For more information, we recommend you check out the following resources:
|
||||
|
||||
@@ -1 +0,0 @@
|
||||
# MineRL
|
||||
@@ -1,26 +1,28 @@
|
||||
# Model Based Reinforcement Learning
|
||||
# Model Based Reinforcement Learning (MBRL)
|
||||
|
||||
# Model-based reinforcement learning (MBRL)
|
||||
Model-based reinforcement learning only differs from it’s model-free counterpart in the learning of a *dynamics model*, but that has substantial downstream effects on how the decisions are made.
|
||||
|
||||
Model-based reinforcement learning only differs from it’s model-free counterpart in the learning of a *dynamics model*, but that has substantial downstream effects on how the decisions are made.
|
||||
The dynamics models most often model the environment transition dynamics, \\( s_{t+1} = f_\theta (s_t, a_t) \\), but things like inverse dynamics models (mapping from states to actions) or reward models (predicting rewards) can be used in this framework.
|
||||
|
||||
**Simple version**:
|
||||
|
||||
There is an agent that repeatedly tries to solve a problem, accumulating state and action data.
|
||||
With that data, the agent creates a structured learning tool -- a dynamics model -- to reason about the world.
|
||||
With the dynamics model, the agent decides how to act by predicting into the future.
|
||||
With those actions, the agent collects more data, improves said model, and hopefully improves future actions.
|
||||
## Simple definition
|
||||
|
||||
- There is an agent that repeatedly tries to solve a problem, **accumulating state and action data**.
|
||||
- With that data, the agent creates a structured learning tool *a dynamics model* to reason about the world.
|
||||
- With the dynamics model, the agent **decides how to act by predicting into the future**.
|
||||
- With those actions, **the agent collects more data, improves said model, and hopefully improves future actions**.
|
||||
|
||||
## Academic definition
|
||||
|
||||
Model-based reinforcement learning (MBRL) follows the framework of an agent interacting in an environment, **learning a model of said environment**, and then **leveraging the model for control (making decisions).
|
||||
|
||||
Specifically, the agent acts in a Markov Decision Process (MDP) governed by a transition function \\( s_{t+1} = f (s_t , a_t) \\) and returns a reward at each step \\( r(s_t, a_t) \\). With a collected dataset \\( D :={ s_i, a_i, s_{i+1}, r_i} \\), the agent learns a model, \\( s_{t+1} = f_\theta (s_t , a_t) \\) **to minimize the negative log-likelihood of the transitions**.
|
||||
|
||||
**Academic version**:
|
||||
|
||||
Model-based reinforcement learning (MBRL) follows the framework of an agent interacting in an environment, learning a model of said environment, and then leveraging the model for control.
|
||||
Specifically, the agent acts in a Markov Decision Process (MDP) governed by a transition function \\( s_{t+1} = f (s_t , a_t) \\) and returns a reward at each step \\( r(s_t, a_t) \\). With a collected dataset \\( D :={ s_i, a_i, s_{i+1}, r_i} \\), the agent learns a model, \\( s_{t+1} = f_\theta (s_t , a_t) \\) to minimize the negative log-likelihood of the transitions.
|
||||
We employ sample-based model-predictive control (MPC) using the learned dynamics model, which optimizes the expected reward over a finite, recursively predicted horizon, \\( \tau \\), from a set of actions sampled from a uniform distribution \\( U(a) \\), (see [paper](https://arxiv.org/pdf/2002.04523) or [paper](https://arxiv.org/pdf/2012.09156.pdf) or [paper](https://arxiv.org/pdf/2009.01221.pdf)).
|
||||
|
||||
## Further reading
|
||||
For more information on MBRL, we recommend you check out the following resources.
|
||||
|
||||
1. A [recent review paper on MBRL (long)](https://arxiv.org/abs/2006.16712),
|
||||
2. A [blog post on debugging MBRL](https://www.natolambert.com/writing/debugging-mbrl).
|
||||
For more information on MBRL, we recommend you check out the following resources:
|
||||
|
||||
- A [blog post on debugging MBRL](https://www.natolambert.com/writing/debugging-mbrl).
|
||||
- A [recent review paper on MBRL](https://arxiv.org/abs/2006.16712),
|
||||
|
||||
33
units/en/unitbonus3/offline-online.mdx
Normal file
33
units/en/unitbonus3/offline-online.mdx
Normal file
@@ -0,0 +1,33 @@
|
||||
# Offline vs. Online Reinforcement Learning
|
||||
|
||||
Deep Reinforcement Learning (RL) is a framework **to build decision-making agents**. These agents aim to learn optimal behavior (policy) by interacting with the environment through **trial and error and receiving rewards as unique feedback**.
|
||||
|
||||
The agent’s goal **is to maximize its cumulative reward**, called return. Because RL is based on the *reward hypothesis*: all goals can be described as the **maximization of the expected cumulative reward**.
|
||||
|
||||
Deep Reinforcement Learning agents **learn with batches of experience**. The question is, how do they collect it?:
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/offlinevsonlinerl.gif" alt="Unit bonus 3 thumbnail">
|
||||
<figcaption>A comparison between Reinforcement Learning in an Online and Offline setting, figure taken from <a href="https://offline-rl.github.io/">this post</a></figcaption>
|
||||
</figure>
|
||||
|
||||
- In *online reinforcement learning*, the agent **gathers data directly**: it collects a batch of experience by **interacting with the environment**. Then, it uses this experience immediately (or via some replay buffer) to learn from it (update its policy).
|
||||
|
||||
But this implies that either you **train your agent directly in the real world or have a simulator**. If you don’t have one, you need to build it, which can be very complex (how to reflect the complex reality of the real world in an environment?), expensive, and insecure since if the simulator has flaws, the agent will exploit them if they provide a competitive advantage.
|
||||
|
||||
- On the other hand, in *offline reinforcement learning*, the agent only **uses data collected from other agents or human demonstrations**. It does **not interact with the environment**.
|
||||
|
||||
The process is as follows:
|
||||
- **Create a dataset** using one or more policies and/or human interactions.
|
||||
- Run **offline RL on this dataset** to learn a policy
|
||||
|
||||
This method has one drawback: the *counterfactual queries problem*. What do we do if our agent **decides to do something for which we don’t have the data?** For instance, turning right on an intersection but we don’t have this trajectory.
|
||||
|
||||
There’s already exists some solutions on this topic, but if you want to know more about offline reinforcement learning you can [watch this video](https://www.youtube.com/watch?v=k08N5a0gG0A)
|
||||
|
||||
## Further reading
|
||||
|
||||
For more information, we recommend you check out the following resources:
|
||||
|
||||
- [Offline Reinforcement Learning, Talk by Sergei Levine](https://www.youtube.com/watch?v=qgZPZREor5I)
|
||||
- [Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems](https://arxiv.org/abs/2005.01643)
|
||||
@@ -1,9 +1,11 @@
|
||||
# RLHF
|
||||
|
||||
Reinforcement learning from human feedback (RLHF) is a methodology for integrating human data labels into a RL-based optimization process.
|
||||
It is motivated by the challenge of modeling human preferences.
|
||||
Reinforcement learning from human feedback (RLHF) is a **methodology for integrating human data labels into a RL-based optimization process**.
|
||||
It is motivated by the **challenge of modeling human preferences**.
|
||||
|
||||
For many questions, even if you could try and write down an equation for one ideal, humans differ on their preferences.
|
||||
Updating models based on measured data is an avenue to try and alleviate these inherently human ML problems.
|
||||
|
||||
Updating models **based on measured data is an avenue to try and alleviate these inherently human ML problems**.
|
||||
|
||||
## Start Learning about RLHF
|
||||
|
||||
|
||||
Reference in New Issue
Block a user