mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-06-14 22:17:15 +08:00
Removed Unit 2-3 will be published tomorrow
This commit is contained in:
@@ -14,7 +14,7 @@
|
||||
"source": [
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit2/thumbnail.png\" alt=\"Bonus Unit 1Thumbnail\">\n",
|
||||
"\n",
|
||||
"In this notebook, we'll reinforce what we learn in the first Unit by **teaching Huggy the Dog to fetch the stick and then play with it directly in your browser**\n",
|
||||
"In this notebook, we'll reinforce what we learned in the first Unit by **teaching Huggy the Dog to fetch the stick and then play with it directly in your browser**\n",
|
||||
"\n",
|
||||
"⬇️ Here is an example of what **you will achieve at the end of the unit.** ⬇️ (launch ▶ to see)"
|
||||
],
|
||||
@@ -120,6 +120,29 @@
|
||||
"id": "6r7Hl0uywFSO"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Set the GPU 💪\n",
|
||||
"- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "DssdIjk_8vZE"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"- `Hardware Accelerator > GPU`\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "sTfCXHy68xBv"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
@@ -282,11 +305,11 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Check the Huggy config file\n",
|
||||
"## Check the Huggy config file\n",
|
||||
"\n",
|
||||
"- In ML-Agents, you define the **training hyperparameters into config.yaml files.**\n",
|
||||
"\n",
|
||||
"- For the scope of this notebook, we're not going to modify the hyperparameters but if you want to try as an experimentation, you should also try to modify some other hyperparameters, Unity provides a very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md)."
|
||||
"- For the scope of this notebook, we're not going to modify the hyperparameters, but if you want to try as an experiment, you should also try to modify some other hyperparameters, Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md)."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "NAuEq32Mwvtz"
|
||||
@@ -295,7 +318,7 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"- Click here to open the config.yaml: `/content/ml-agents/config/ppo/Huggy.yaml`\n",
|
||||
"- **In the case you want to modify the hyperparameters**, in Google Colab notebook, you can click here to open the config.yaml: `/content/ml-agents/config/ppo/Huggy.yaml`\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"We’re now ready to train our agent 🔥."
|
||||
@@ -316,7 +339,7 @@
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/mllearn.png\" alt=\"ml learn function\" width=\"100%\">\n",
|
||||
"\n",
|
||||
"We define four parameters:\n",
|
||||
"With ML Agents, we run a training script. We define four parameters:\n",
|
||||
"\n",
|
||||
"1. `mlagents-learn <config>`: the path where the hyperparameter config file is.\n",
|
||||
"2. `--env`: where the environment executable is.\n",
|
||||
@@ -332,7 +355,7 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"The training will take 30 to 45min depending on your machine, go take a ☕️you deserve it 🤗."
|
||||
"The training will take 30 to 45min depending on your machine (don't forget to **set up a GPU**), go take a ☕️you deserve it 🤗."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "lN32oWF8zPjs"
|
||||
@@ -448,7 +471,7 @@
|
||||
"Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/ppo-Huggy\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"It’s the link to your model, it contains a model card that explains how to use it, your Tensorboard and your config file. **What’s awesome is that it’s a git repository, that means you can have different commits, update your repository with a new push etc.**\n",
|
||||
"It’s the link to your model repository. The repository contains a model card that explains how to use the model, your Tensorboard logs and your config file. **What’s awesome is that it’s a git repository, which means you can have different commits, update your repository with a new push, open Pull Requests, etc.**\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/modelcard.png\" alt=\"ml learn function\" width=\"100%\">"
|
||||
],
|
||||
@@ -470,7 +493,7 @@
|
||||
"source": [
|
||||
"## Play with your Huggy 🐕\n",
|
||||
"\n",
|
||||
"For this step it’s simple:\n",
|
||||
"This step is the simplest:\n",
|
||||
"\n",
|
||||
"- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy\n",
|
||||
"\n",
|
||||
@@ -488,10 +511,10 @@
|
||||
"1. In step 1, choose your model repository which is the model id (in my case ThomasSimonini/ppo-Huggy).\n",
|
||||
"\n",
|
||||
"2. In step 2, **choose what model you want to replay**:\n",
|
||||
" - I have multiple one, since we saved a model every 500000 timesteps. \n",
|
||||
" - But if I want the more recent I choose Huggy.onnx\n",
|
||||
" - I have multiple ones, since we saved a model every 500000 timesteps. \n",
|
||||
" - But since I want the more recent, I choose `Huggy.onnx`\n",
|
||||
"\n",
|
||||
"👉 What’s nice **is to try with different models step to see the improvement of the agent.**"
|
||||
"👉 What’s nice **is to try with different models steps to see the improvement of the agent.**"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "Djs8c5rR0Z8a"
|
||||
|
||||
@@ -1,256 +0,0 @@
|
||||
# Bonus Unit 1: Let's train Huggy the Dog 🐶 to fetch a stick
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit2/thumbnail.png" alt="Bonus Unit 1Thumbnail">
|
||||
|
||||
In this notebook, we'll reinforce what we learn in the first Unit by **teaching Huggy the Dog to fetch the stick and then play with it directly in your browser**
|
||||
|
||||
⬇️ Here is an example of what **you will achieve at the end of the unit.** ⬇️ (launch ▶ to see)
|
||||
|
||||
```python
|
||||
%%html
|
||||
<video controls autoplay><source src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.mp4" type="video/mp4"></video>
|
||||
```
|
||||
|
||||
### The environment 🎮
|
||||
|
||||
- Huggy the Dog, an environment created by [Thomas Simonini](https://twitter.com/ThomasSimonini) based on [Puppo The Corgi](https://blog.unity.com/technology/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit)
|
||||
|
||||
### The library used 📚
|
||||
|
||||
- [MLAgents (Hugging Face version)](https://github.com/huggingface/ml-agents)
|
||||
|
||||
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
|
||||
|
||||
## Objectives of this notebook 🏆
|
||||
|
||||
At the end of the notebook, you will:
|
||||
|
||||
- Understand **the state space, action space and reward function used to train Huggy**.
|
||||
- **Train your own Huggy** to fetch the stick.
|
||||
- Be able to play **with your trained Huggy directly in your browser**.
|
||||
|
||||
|
||||
|
||||
|
||||
## This notebook is from Deep Reinforcement Learning Course
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>
|
||||
|
||||
In this free course, you will:
|
||||
|
||||
- 📖 Study Deep Reinforcement Learning in **theory and practice**.
|
||||
- 🧑💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.
|
||||
- 🤖 Train **agents in unique environments**
|
||||
|
||||
And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course
|
||||
|
||||
Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
|
||||
|
||||
|
||||
The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5
|
||||
|
||||
## Prerequisites 🏗️
|
||||
|
||||
Before diving into the notebook, you need to:
|
||||
|
||||
🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by doing Unit 1
|
||||
|
||||
🔲 📚 **Read the introduction to Huggy** by doing Bonus Unit 1
|
||||
|
||||
## Clone the repository and install the dependencies 🔽
|
||||
|
||||
- We need to clone the repository, that **contains the experimental version of the library that allows you to push your trained agent to the Hub.**
|
||||
|
||||
```python
|
||||
%%capture
|
||||
# Clone this specific repository (can take 3min)
|
||||
!git clone https://github.com/huggingface/ml-agents/
|
||||
```
|
||||
|
||||
```python
|
||||
%%capture
|
||||
# Go inside the repository and install the package (can take 3min)
|
||||
%cd ml-agents
|
||||
!pip3 install -e ./ml-agents-envs
|
||||
!pip3 install -e ./ml-agents
|
||||
```
|
||||
|
||||
## Download and move the environment zip file in `./trained-envs-executables/linux/`
|
||||
|
||||
- Our environment executable is in a zip file.
|
||||
- We need to download it and place it to `./trained-envs-executables/linux/`
|
||||
|
||||
```python
|
||||
!mkdir ./trained-envs-executables
|
||||
!mkdir ./trained-envs-executables/linux
|
||||
```
|
||||
|
||||
```python
|
||||
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1zv3M95ZJTWHUVOWT6ckq_cm98nft8gdF' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1zv3M95ZJTWHUVOWT6ckq_cm98nft8gdF" -O ./trained-envs-executables/linux/Huggy.zip && rm -rf /tmp/cookies.txt
|
||||
```
|
||||
|
||||
Download the file Huggy.zip from https://drive.google.com/uc?export=download&id=1zv3M95ZJTWHUVOWT6ckq_cm98nft8gdF using `wget`. Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)
|
||||
|
||||
```python
|
||||
%%capture
|
||||
!unzip -d ./trained-envs-executables/linux/ ./trained-envs-executables/linux/Huggy.zip
|
||||
```
|
||||
|
||||
Make sure your file is accessible
|
||||
|
||||
```python
|
||||
!chmod -R 755 ./trained-envs-executables/linux/Huggy
|
||||
```
|
||||
|
||||
## Let's recap how this environment works
|
||||
|
||||
### The State Space: what Huggy "perceives."
|
||||
|
||||
Huggy doesn't "see" his environment. Instead, we provide him information about the environment:
|
||||
|
||||
- The target (stick) position
|
||||
- The relative position between himself and the target
|
||||
- The orientation of his legs.
|
||||
|
||||
Given all this information, Huggy **can decide which action to take next to fulfill his goal**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.jpg" alt="Huggy" width="100%">
|
||||
|
||||
|
||||
### The Action Space: what moves Huggy can do
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-action.jpg" alt="Huggy action" width="100%">
|
||||
|
||||
**Joint motors drive huggy legs**. It means that to get the target, Huggy needs to **learn to rotate the joint motors of each of his legs correctly so he can move**.
|
||||
|
||||
### The Reward Function
|
||||
|
||||
The reward function is designed so that **Huggy will fulfill his goal** : fetch the stick.
|
||||
|
||||
Remember that one of the foundations of Reinforcement Learning is the *reward hypothesis*: a goal can be described as the **maximization of the expected cumulative reward**.
|
||||
|
||||
Here, our goal is that Huggy **goes towards the stick but without spinning too much**. Hence, our reward function must translate this goal.
|
||||
|
||||
Our reward function:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/reward.jpg" alt="Huggy reward function" width="100%">
|
||||
|
||||
- *Orientation bonus*: we **reward him for getting close to the target**.
|
||||
- *Time penalty*: a fixed-time penalty given at every action to **force him to get to the stick as fast as possible**.
|
||||
- *Rotation penalty*: we penalize Huggy if **he spins too much and turns too quickly**.
|
||||
- *Getting to the target reward*: we reward Huggy for **reaching the target**.
|
||||
|
||||
## Check the Huggy config file
|
||||
|
||||
- In ML-Agents, you define the **training hyperparameters into config.yaml files.**
|
||||
|
||||
- For the scope of this notebook, we're not going to modify the hyperparameters but if you want to try as an experimentation, you should also try to modify some other hyperparameters, Unity provides a very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
|
||||
|
||||
- Click here to open the config.yaml: `/content/ml-agents/config/ppo/Huggy.yaml`
|
||||
|
||||
|
||||
We’re now ready to train our agent 🔥.
|
||||
|
||||
## Train our agent
|
||||
|
||||
To train our agent, we just need to **launch mlagents-learn and select the executable containing the environment.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/mllearn.png" alt="ml learn function" width="100%">
|
||||
|
||||
We define four parameters:
|
||||
|
||||
1. `mlagents-learn <config>`: the path where the hyperparameter config file is.
|
||||
2. `--env`: where the environment executable is.
|
||||
3. `--run_id`: the name you want to give to your training run id.
|
||||
4. `--no-graphics`: to not launch the visualization during the training.
|
||||
|
||||
Train the model and use the `--resume` flag to continue training in case of interruption.
|
||||
|
||||
> It will fail first time when you use `--resume`, try running the block again to bypass the error.
|
||||
|
||||
|
||||
|
||||
The training will take 30 to 45min depending on your machine, go take a ☕️you deserve it 🤗.
|
||||
|
||||
```python
|
||||
!mlagents-learn ./config/ppo/Huggy.yaml --env=./trained-envs-executables/linux/Huggy/Huggy --run-id="Huggy" --no-graphics
|
||||
```
|
||||
|
||||
## Push the agent to the 🤗 Hub
|
||||
|
||||
- Now that we trained our agent, we’re **ready to push it to the Hub to be able to play with Huggy on your browser🔥.**
|
||||
|
||||
To be able to share your model with the community there are three more steps to follow:
|
||||
|
||||
1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
|
||||
|
||||
2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
|
||||
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
|
||||
|
||||
- Copy the token
|
||||
- Run the cell below and paste the token
|
||||
|
||||
```python
|
||||
from huggingface_hub import notebook_login
|
||||
|
||||
notebook_login()
|
||||
```
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
|
||||
Then, we simply need to run `mlagents-push-to-hf`.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/mlpush.png" alt="ml learn function" width="100%">
|
||||
|
||||
And we define 4 parameters:
|
||||
|
||||
1. `--run-id`: the name of the training run id.
|
||||
2. `--local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.
|
||||
3. `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>
|
||||
If the repo does not exist **it will be created automatically**
|
||||
4. `--commit-message`: since HF repos are git repository you need to define a commit message.
|
||||
|
||||
```python
|
||||
!mlagents-push-to-hf --run-id="HuggyTraining" --local-dir="./results/Huggy" --repo-id="ThomasSimonini/ppo-Huggy" --commit-message="Huggy"
|
||||
```
|
||||
|
||||
Else, if everything worked you should have this at the end of the process(but with a different url 😆) :
|
||||
|
||||
|
||||
|
||||
```
|
||||
Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/ppo-Huggy
|
||||
```
|
||||
|
||||
It’s the link to your model, it contains a model card that explains how to use it, your Tensorboard and your config file. **What’s awesome is that it’s a git repository, that means you can have different commits, update your repository with a new push etc.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/modelcard.png" alt="ml learn function" width="100%">
|
||||
|
||||
But now comes the best: **being able to play with Huggy online 👀.**
|
||||
|
||||
## Play with your Huggy 🐕
|
||||
|
||||
For this step it’s simple:
|
||||
|
||||
- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy
|
||||
|
||||
- Click on Play with my Huggy model
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/load-huggy.jpg" alt="load-huggy" width="100%">
|
||||
|
||||
1. In step 1, choose your model repository which is the model id (in my case ThomasSimonini/ppo-Huggy).
|
||||
|
||||
2. In step 2, **choose what model you want to replay**:
|
||||
- I have multiple one, since we saved a model every 500000 timesteps.
|
||||
- But if I want the more recent I choose Huggy.onnx
|
||||
|
||||
👉 What’s nice **is to try with different models step to see the improvement of the agent.**
|
||||
|
||||
Congrats on finishing this bonus unit!
|
||||
|
||||
You can now sit and enjoy playing with your Huggy 🐶. And don't **forget to spread the love by sharing Huggy with your friends 🤗**. And if you share about it on social media, **please tag us @huggingface and me @simoninithomas**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-cover.jpeg" alt="Huggy cover" width="100%">
|
||||
|
||||
|
||||
## Keep Learning, Stay awesome 🤗
|
||||
@@ -44,57 +44,3 @@
|
||||
title: Play with Huggy
|
||||
- local: unitbonus1/conclusion
|
||||
title: Conclusion
|
||||
- title: Unit 2. Introduction to Q-Learning
|
||||
sections:
|
||||
- local: unit2/introduction
|
||||
title: Introduction
|
||||
- local: unit2/what-is-rl
|
||||
title: What is RL? A short recap
|
||||
- local: unit2/two-types-value-based-methods
|
||||
title: The two types of value-based methods
|
||||
- local: unit2/bellman-equation
|
||||
title: The Bellman Equation, simplify our value estimation
|
||||
- local: unit2/mc-vs-td
|
||||
title: Monte Carlo vs Temporal Difference Learning
|
||||
- local: unit2/summary1
|
||||
title: Summary
|
||||
- local: unit2/quiz1
|
||||
title: First Quiz
|
||||
- local: unit2/q-learning
|
||||
title: Introducing Q-Learning
|
||||
- local: unit2/q-learning-example
|
||||
title: A Q-Learning example
|
||||
- local: unit2/hands-on
|
||||
title: Hands-on
|
||||
- local: unit2/quiz2
|
||||
title: Second Quiz
|
||||
- local: unit2/conclusion
|
||||
title: Conclusion
|
||||
- local: unit2/additional-readings
|
||||
title: Additional Readings
|
||||
- title: Unit 3. Deep Q-Learning with Atari Games
|
||||
sections:
|
||||
- local: unit3/introduction
|
||||
title: Introduction
|
||||
- local: unit3/from-q-to-dqn
|
||||
title: From Q-Learning to Deep Q-Learning
|
||||
- local: unit3/deep-q-network
|
||||
title: The Deep Q-Network (DQN)
|
||||
- local: unit3/deep-q-algorithm
|
||||
title: The Deep Q Algorithm
|
||||
- local: unit3/hands-on
|
||||
title: Hands-on
|
||||
- local: unit3/quiz
|
||||
title: Quiz
|
||||
- local: unit3/conclusion
|
||||
title: Conclusion
|
||||
- local: unit3/additional-readings
|
||||
title: Additional Readings
|
||||
- title: Unit Bonus 2. Automatic Hyperparameter Tuning with Optuna
|
||||
sections:
|
||||
- local: unitbonus2/introduction
|
||||
title: Introduction
|
||||
- local: unitbonus2/optuna
|
||||
title: Optuna
|
||||
- local: unitbonus2/hands-on
|
||||
title: Hands-on
|
||||
|
||||
@@ -1,15 +0,0 @@
|
||||
# Additional Readings [[additional-readings]]
|
||||
|
||||
These are **optional readings** if you want to go deeper.
|
||||
|
||||
## Monte Carlo and TD Learning [[mc-td]]
|
||||
|
||||
To dive deeper on Monte Carlo and Temporal Difference Learning:
|
||||
|
||||
- <a href="https://stats.stackexchange.com/questions/355820/why-do-temporal-difference-td-methods-have-lower-variance-than-monte-carlo-met">Why do temporal difference (TD) methods have lower variance than Monte Carlo methods?</a>
|
||||
- <a href="https://stats.stackexchange.com/questions/336974/when-are-monte-carlo-methods-preferred-over-temporal-difference-ones"> When are Monte Carlo methods preferred over temporal difference ones?</a>
|
||||
|
||||
## Q-Learning [[q-learning]]
|
||||
|
||||
- <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto Chapter 5, 6 and 7</a>
|
||||
- <a href="https://youtu.be/Psrhxy88zww">Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel</a>
|
||||
@@ -1,57 +0,0 @@
|
||||
# The Bellman Equation: simplify our value estimation [[bellman-equation]]
|
||||
|
||||
The Bellman equation **simplifies our state value or state-action value calculation.**
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman.jpg" alt="Bellman equation"/>
|
||||
|
||||
With what we have learned so far, we know that if we calculate the \\(V(S_t)\\) (value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).**
|
||||
|
||||
So to calculate \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence:
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" alt="Bellman equation"/>
|
||||
<figcaption>To calculate the value of State 1: the sum of rewards if the agent started in that state and then followed the greedy policy (taking actions that leads to the best states values) for all the time steps.</figcaption>
|
||||
</figure>
|
||||
|
||||
Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return starting at that state \\(S_{t+1}\\).
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman3.jpg" alt="Bellman equation"/>
|
||||
<figcaption>To calculate the value of State 2: the sum of rewards **if the agent started in that state, and then followed the **policy for all the time steps.</figcaption>
|
||||
</figure>
|
||||
|
||||
So you see, that's a pretty tedious process if you need to do it for each state value or state-action value.
|
||||
|
||||
Instead of calculating the expected return for each state or each state-action pair, **we can use the Bellman equation.**
|
||||
|
||||
The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
|
||||
|
||||
**The immediate reward \\(R_{t+1}\\) + the discounted value of the state that follows ( \\(gamma * V(S_{t+1}) \\) ) .**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation"/>
|
||||
<figcaption>For simplification, here we don’t discount so gamma = 1.</figcaption>
|
||||
</figure>
|
||||
|
||||
|
||||
If we go back to our example, we can say that the value of State 1 is equal to the expected cumulative return if we start at that state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" alt="Bellman equation"/>
|
||||
|
||||
|
||||
To calculate the value of State 1: the sum of rewards **if the agent started in that state 1** and then followed the **policy for all the time steps.**
|
||||
|
||||
This is equivalent to \\(V(S_{t})\\) = Immediate reward \\(R_{t+1}\\) + Discounted value of the next state \\(gamma * V(S_{t+1})\\)
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman6.jpg" alt="Bellman equation"/>
|
||||
|
||||
|
||||
In the interest of simplicity, here we don't discount, so gamma = 1.
|
||||
|
||||
- The value of \\(V(S_{t+1}) \\) = Immediate reward \\(R_{t+2}\\) + Discounted value of the next state ( \\(gamma * V(S_{t+2})\\) ).
|
||||
- And so on.
|
||||
|
||||
To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process.** This is equivalent **to the sum of immediate reward + the discounted value of the state that follows.**
|
||||
|
||||
Before going to the next section, think about the role of gamma in the Bellman equation. What happens if the value of gamma is very low (e.g. 0.1 or even 0)? What happens if the value is 1? What happens if the value is very high, such as a million?
|
||||
@@ -1,19 +0,0 @@
|
||||
# Conclusion [[conclusion]]
|
||||
|
||||
Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorials. You’ve just implemented your first RL agent from scratch and shared it on the Hub 🥳.
|
||||
|
||||
Implementing from scratch when you study a new architecture **is important to understand how it works.**
|
||||
|
||||
That’s **normal if you still feel confused** with all these elements. **This was the same for me and for all people who studied RL.**
|
||||
|
||||
Take time to really grasp the material before continuing.
|
||||
|
||||
|
||||
In the next chapter, we’re going to dive deeper by studying our first Deep Reinforcement Learning algorithm based on Q-Learning: Deep Q-Learning. And you'll train a **DQN agent with <a href="https://github.com/DLR-RM/rl-baselines3-zoo">RL-Baselines3 Zoo</a> to play Atari Games**.
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Atari environments"/>
|
||||
|
||||
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
@@ -1,14 +0,0 @@
|
||||
# Hands-on [[hands-on]]
|
||||
|
||||
Now that we studied the Q-Learning algorithm, let's implement it from scratch and train our Q-Learning agent in two environments:
|
||||
1. [Frozen-Lake-v1 (non-slippery and slippery version)](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) ☃️ : where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
|
||||
2. [An autonomous taxi](https://www.gymlibrary.dev/environments/toy_text/taxi/) 🚖 will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
|
||||
|
||||
Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 2?
|
||||
|
||||
|
||||
**To start the hands-on click on Open In Colab button** 👇 :
|
||||
|
||||
[]()
|
||||
@@ -1,26 +0,0 @@
|
||||
# Introduction to Q-Learning [[introduction-q-learning]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 thumbnail" width="100%">
|
||||
|
||||
|
||||
In the first unit of this class, we learned about Reinforcement Learning (RL), the RL process, and the different methods to solve an RL problem. We also **trained our first agents and uploaded them to the Hugging Face Hub.**
|
||||
|
||||
In this unit, we're going to **dive deeper into one of the Reinforcement Learning methods: value-based methods** and study our first RL algorithm: **Q-Learning.**
|
||||
|
||||
We'll also **implement our first RL agent from scratch**, a Q-Learning agent, and will train it in two environments:
|
||||
|
||||
1. Frozen-Lake-v1 (non-slippery version): where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
|
||||
2. An autonomous taxi: where our agent will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
|
||||
|
||||
Concretely, we will:
|
||||
|
||||
- Learn about **value-based methods**.
|
||||
- Learn about the **differences between Monte Carlo and Temporal Difference Learning**.
|
||||
- Study and implement **our first RL algorithm**: Q-Learning.s
|
||||
|
||||
This unit is **fundamental if you want to be able to work on Deep Q-Learning**: the first Deep RL algorithm that played Atari games and beat the human level on some of them (breakout, space invaders…).
|
||||
|
||||
So let's get started! 🚀
|
||||
@@ -1,126 +0,0 @@
|
||||
# Monte Carlo vs Temporal Difference Learning [[mc-vs-td]]
|
||||
|
||||
The last thing we need to talk about before diving into Q-Learning is the two ways of learning.
|
||||
|
||||
Remember that an RL agent **learns by interacting with its environment.** The idea is that **using the experience taken**, given the reward it gets, will **update its value or policy.**
|
||||
|
||||
Monte Carlo and Temporal Difference Learning are two different **strategies on how to train our value function or our policy function.** Both of them **use experience to solve the RL problem.**
|
||||
|
||||
On one hand, Monte Carlo uses **an entire episode of experience before learning.** On the other hand, Temporal Difference uses **only a step ( \\(S_t, A_t, R_{t+1}, S_{t+1}\\) ) to learn.**
|
||||
|
||||
We'll explain both of them **using a value-based method example.**
|
||||
|
||||
## Monte Carlo: learning at the end of the episode [[monte-carlo]]
|
||||
|
||||
Monte Carlo waits until the end of the episode, calculates \\(G_t\\) (return) and uses it as **a target for updating \\(V(S_t)\\).**
|
||||
|
||||
So it requires a **complete entire episode of interaction before updating our value function.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/monte-carlo-approach.jpg" alt="Monte Carlo"/>
|
||||
|
||||
|
||||
If we take an example:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-2.jpg" alt="Monte Carlo"/>
|
||||
|
||||
|
||||
- We always start the episode **at the same starting point.**
|
||||
- **The agent takes actions using the policy**. For instance, using an Epsilon Greedy Strategy, a policy that alternates between exploration (random actions) and exploitation.
|
||||
- We get **the reward and the next state.**
|
||||
- We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.
|
||||
|
||||
- At the end of the episode, **we have a list of State, Actions, Rewards, and Next States**
|
||||
- **The agent will sum the total rewards \\(G_t\\)** (to see how well it did).
|
||||
- It will then **update \\(V(s_t)\\) based on the formula**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3.jpg" alt="Monte Carlo"/>
|
||||
|
||||
- Then **start a new game with this new knowledge**
|
||||
|
||||
By running more and more episodes, **the agent will learn to play better and better.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3p.jpg" alt="Monte Carlo"/>
|
||||
|
||||
For instance, if we train a state-value function using Monte Carlo:
|
||||
|
||||
- We just started to train our Value function, **so it returns 0 value for each state**
|
||||
- Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)
|
||||
- Our mouse **explores the environment and takes random actions**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-4.jpg" alt="Monte Carlo"/>
|
||||
|
||||
|
||||
- The mouse made more than 10 steps, so the episode ends .
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-4p.jpg" alt="Monte Carlo"/>
|
||||
|
||||
|
||||
- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t}\\)**
|
||||
- \\(G_t = R_{t+1} + R_{t+2} + R_{t+3} ...\\)
|
||||
- \\(G_t = R_{t+1} + R_{t+2} + R_{t+3}…\\) (for simplicity we don’t discount the rewards).
|
||||
- \\(G_t = 1 + 0 + 0 + 0+ 0 + 0 + 1 + 1 + 0 + 0\\)
|
||||
- \\(G_t= 3\\)
|
||||
- We can now update \\(V(S_0)\\):
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5.jpg" alt="Monte Carlo"/>
|
||||
|
||||
- New \\(V(S_0) = V(S_0) + lr * [G_t — V(S_0)]\\)
|
||||
- New \\(V(S_0) = 0 + 0.1 * [3 – 0]\\)
|
||||
- New \\(V(S_0) = 0.3\\)
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5p.jpg" alt="Monte Carlo"/>
|
||||
|
||||
|
||||
## Temporal Difference Learning: learning at each step [[td-learning]]
|
||||
|
||||
- **Temporal difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)**
|
||||
- to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\(gamma * V(S_{t+1})\\).
|
||||
|
||||
The idea with **TD is to update the \\(V(S_t)\\) at each step.**
|
||||
|
||||
But because we didn't play during an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.**
|
||||
|
||||
This is called bootstrapping. It's called this **because TD bases its update part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="Temporal Difference"/>
|
||||
|
||||
|
||||
This method is called TD(0) or **one-step TD (update the value function after any individual step).**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1p.jpg" alt="Temporal Difference"/>
|
||||
|
||||
If we take the same example,
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2.jpg" alt="Temporal Difference"/>
|
||||
|
||||
- We just started to train our Value function, so it returns 0 value for each state.
|
||||
- Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
|
||||
- Our mouse explore the environment and take a random action: **going to the left**
|
||||
- It gets a reward \\(R_{t+1} = 1\\) since **it eats a piece of cheese**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2p.jpg" alt="Temporal Difference"/>
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3.jpg" alt="Temporal Difference"/>
|
||||
|
||||
We can now update \\(V(S_0)\\):
|
||||
|
||||
New \\(V(S_0) = V(S_0) + lr * [R_1 + gamma * V(S_1) - V(S_0)]\\)
|
||||
|
||||
New \\(V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]\\)
|
||||
|
||||
New \\(V(S_0) = 0.1\\)
|
||||
|
||||
So we just updated our value function for State 0.
|
||||
|
||||
Now we **continue to interact with this environment with our updated value function.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3p.jpg" alt="Temporal Difference"/>
|
||||
|
||||
If we summarize:
|
||||
|
||||
- With *Monte Carlo*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
|
||||
- With *TD Learning*, we update the value function from a step, so we replace \\(G_t\\) that we don't have with **an estimated return called TD target.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Summary.jpg" alt="Summary"/>
|
||||
@@ -1,83 +0,0 @@
|
||||
# A Q-Learning example [[q-learning-example]]
|
||||
|
||||
To better understand Q-Learning, let's take a simple example:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-Example-2.jpg" alt="Maze-Example"/>
|
||||
|
||||
- You're a mouse in this tiny maze. You always **start at the same starting point.**
|
||||
- The goal is **to eat the big pile of cheese at the bottom right-hand corner** and avoid the poison. After all, who doesn't like cheese?
|
||||
- The episode ends if we eat the poison, **eat the big pile of cheese or if we spent more than five steps.**
|
||||
- The learning rate is 0.1
|
||||
- The gamma (discount rate) is 0.99
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-1.jpg" alt="Maze-Example"/>
|
||||
|
||||
|
||||
The reward function goes like this:
|
||||
|
||||
- **+0:** Going to a state with no cheese in it.
|
||||
- **+1:** Going to a state with a small cheese in it.
|
||||
- **+10:** Going to the state with the big pile of cheese.
|
||||
- **-10:** Going to the state with the poison and thus die.
|
||||
- **+0** If we spend more than five steps.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-2.jpg" alt="Maze-Example"/>
|
||||
|
||||
To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**.
|
||||
|
||||
## Step 1: We initialize the Q-Table [[step1]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-1.jpg" alt="Maze-Example"/>
|
||||
|
||||
So, for now, **our Q-Table is useless**; we need **to train our Q-function using the Q-Learning algorithm.**
|
||||
|
||||
Let's do it for 2 training timesteps:
|
||||
|
||||
Training timestep 1:
|
||||
|
||||
## Step 2: Choose action using Epsilon Greedy Strategy [[step2]]
|
||||
|
||||
Because epsilon is big = 1.0, I take a random action, in this case, I go right.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-3.jpg" alt="Maze-Example"/>
|
||||
|
||||
|
||||
## Step 3: Perform action At, gets Rt+1 and St+1 [[step3]]
|
||||
|
||||
By going right, I've got a small cheese, so \\(R_{t+1} = 1\\), and I'm in a new state.
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-4.jpg" alt="Maze-Example"/>
|
||||
|
||||
|
||||
## Step 4: Update Q(St, At) [[step4]]
|
||||
|
||||
We can now update \\(Q(S_t, A_t)\\) using our formula.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Maze-Example"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-4.jpg" alt="Maze-Example"/>
|
||||
|
||||
Training timestep 2:
|
||||
|
||||
## Step 2: Choose action using Epsilon Greedy Strategy [[step2-2]]
|
||||
|
||||
**I take a random action again, since epsilon is big 0.99** (since we decay it a little bit because as the training progress, we want less and less exploration).
|
||||
|
||||
I took action down. **Not a good action since it leads me to the poison.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-6.jpg" alt="Maze-Example"/>
|
||||
|
||||
|
||||
## Step 3: Perform action At, gets \Rt+1 and St+1 [[step3-3]]
|
||||
|
||||
Because I go to the poison state, **I get \\(R_{t+1} = -10\\), and I die.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-7.jpg" alt="Maze-Example"/>
|
||||
|
||||
## Step 4: Update Q(St, At) [[step4-4]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-8.jpg" alt="Maze-Example"/>
|
||||
|
||||
Because we're dead, we start a new episode. But what we see here is that **with two explorations steps, my agent became smarter.**
|
||||
|
||||
As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-Table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-Function.**
|
||||
@@ -1,153 +0,0 @@
|
||||
# Introducing Q-Learning [[q-learning]]
|
||||
## What is Q-Learning? [[what-is-q-learning]]
|
||||
|
||||
Q-Learning is an **off-policy value-based method that uses a TD approach to train its action-value function:**
|
||||
|
||||
- *Off-policy*: we'll talk about that at the end of this chapter.
|
||||
- *Value-based method*: finds the optimal policy indirectly by training a value or action-value function that will tell us **the value of each state or each state-action pair.**
|
||||
- *Uses a TD approach:* **updates its action-value function at each step instead of at the end of the episode.**
|
||||
|
||||
**Q-Learning is the algorithm we use to train our Q-Function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" alt="Q-function"/>
|
||||
<figcaption>Given a state and action, our Q Function outputs a state-action value (also called Q-value)</figcaption>
|
||||
</figure>
|
||||
|
||||
The **Q comes from "the Quality" of that action at that state.**
|
||||
|
||||
Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action value pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
|
||||
If we take this maze example:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-1.jpg" alt="Maze example"/>
|
||||
|
||||
The Q-Table is initialized. That's why all values are = 0. This table **contains, for each state, the four state-action values.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-2.jpg" alt="Maze example"/>
|
||||
|
||||
Here we see that the **state-action value of the initial state and going up is 0:**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-3.jpg" alt="Maze example"/>
|
||||
|
||||
Therefore, Q-function contains a Q-table **that has the value of each-state action pair.** And given a state and action, **our Q-Function will search inside its Q-table to output the value.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q-function"/>
|
||||
<figcaption>Given a state and action pair, our Q-function will search inside its Q-table to output the state-action pair value (the Q value).</figcaption>
|
||||
</figure>
|
||||
|
||||
If we recap, *Q-Learning* **is the RL algorithm that:**
|
||||
|
||||
- Trains *Q-Function* (an **action-value function**) which internally is a *Q-table* **that contains all the state-action pair values.**
|
||||
- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
|
||||
- When the training is done, **we have an optimal Q-function, which means we have optimal Q-Table.**
|
||||
- And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"/>
|
||||
|
||||
|
||||
But, in the beginning, **our Q-Table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-Table to 0 values). But, as we'll **explore the environment and update our Q-Table, it will give us better and better approximations.**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-1.jpg" alt="Q-learning"/>
|
||||
<figcaption>We see here that with the training, our Q-Table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
|
||||
</figure>
|
||||
|
||||
So now that we understand what Q-Learning, Q-Function, and Q-Table are, **let's dive deeper into the Q-Learning algorithm**.
|
||||
|
||||
## The Q-Learning algorithm [[q-learning-algo]]
|
||||
|
||||
This is the Q-Learning pseudocode; let's study each part and **see how it works with a simple example before implementing it.** Don't be intimidated by it, it's simpler than it looks! We'll go over each step.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-learning"/>
|
||||
|
||||
### Step 1: We initialize the Q-Table [[step1]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-3.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
We need to initialize the Q-Table for each state-action pair. **Most of the time, we initialize with values of 0.**
|
||||
|
||||
### Step 2: Choose action using Epsilon Greedy Strategy [[step2]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
Epsilon Greedy Strategy is a policy that handles the exploration/exploitation trade-off.
|
||||
|
||||
The idea is that we define epsilon ɛ = 1.0:
|
||||
|
||||
- *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value).
|
||||
- With probability ɛ: **we do exploration** (trying random action).
|
||||
|
||||
At the beginning of the training, **the probability of doing exploration will be huge since ɛ is very high, so most of the time, we'll explore.** But as the training goes on, and consequently our **Q-Table gets better and better in its estimations, we progressively reduce the epsilon value** since we will need less and less exploration and more exploitation.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-5.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
### Step 3: Perform action At, gets reward Rt+1 and next state St+1 [[step3]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-6.jpg" alt="Q-learning"/>
|
||||
|
||||
### Step 4: Update Q(St, At) [[step4]]
|
||||
|
||||
Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose) **after one step of the interaction.**
|
||||
|
||||
To produce our TD target, **we used the immediate reward \\(R_{t+1}\\) plus the discounted value of the next state best state-action pair** (we call that bootstrap).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-7.jpg" alt="Q-learning"/>
|
||||
|
||||
Therefore, our \\(Q(S_t, A_t)\\) **update formula goes like this:**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-8.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
It means that to update our \\(Q(S_t, A_t)\\):
|
||||
|
||||
- We need \\(S_t, A_t, R_{t+1}, S_{t+1}\\).
|
||||
- To update our Q-value at a given state-action pair, we use the TD target.
|
||||
|
||||
How do we form the TD target?
|
||||
1. We obtain the reward after taking the action \\(R_{t+1}\\).
|
||||
2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon greedy policy, this will always take the action with the highest state-action value.
|
||||
|
||||
Then when the update of this Q-value is done. We start in a new_state and select our action **using our epsilon-greedy policy again.**
|
||||
|
||||
**It's why we say that this is an off-policy algorithm.**
|
||||
|
||||
## Off-policy vs On-policy [[off-vs-on]]
|
||||
|
||||
The difference is subtle:
|
||||
|
||||
- *Off-policy*: using **a different policy for acting and updating.**
|
||||
|
||||
For instance, with Q-Learning, the Epsilon greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
|
||||
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-1.jpg" alt="Off-on policy"/>
|
||||
<figcaption>Acting Policy</figcaption>
|
||||
</figure>
|
||||
|
||||
Is different from the policy we use during the training part:
|
||||
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-2.jpg" alt="Off-on policy"/>
|
||||
<figcaption>Updating policy</figcaption>
|
||||
</figure>
|
||||
|
||||
- *On-policy:* using the **same policy for acting and updating.**
|
||||
|
||||
For instance, with Sarsa, another value-based algorithm, **the Epsilon-Greedy Policy selects the next_state-action pair, not a greedy policy.**
|
||||
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-3.jpg" alt="Off-on policy"/>
|
||||
<figcaption>Sarsa</figcaption>
|
||||
</figure>
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Off-on policy"/>
|
||||
</figure>
|
||||
@@ -1,105 +0,0 @@
|
||||
# First Quiz [[quiz1]]
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
|
||||
### Q1: What are the two main approaches to find optimal policy?
|
||||
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Policy-based methods",
|
||||
explain: "With Policy-Based methods, we train the policy directly to learn which action to take given a state.",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "Random-based methods",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "Value-based methods",
|
||||
explain: "With Value-based methods, we train a value function to learn which state is more valuable and use this value function to take the action that leads to it.",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "Evolution-strategies methods",
|
||||
explain: ""
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
|
||||
### Q2: What is the Bellman Equation?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
**The Bellman equation is a recursive equation** that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
|
||||
|
||||
Rt+1 + (gamma * V(St+1))
|
||||
The immediate reward + the discounted value of the state that follows
|
||||
|
||||
</details>
|
||||
|
||||
### Q3: Define each part of the Bellman Equation
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4-quiz.jpg" alt="Bellman equation quiz"/>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation solution"/>
|
||||
|
||||
</details>
|
||||
|
||||
### Q4: What is the difference between Monte Carlo and Temporal Difference learning methods?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "With Monte Carlo methods, we update the value function from a complete episode",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "With Monte Carlo methods, we update the value function from a step",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "With TD learning methods, we update the value function from a complete episode",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "With TD learning methods, we update the value function from a step",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q5: Define each part of Temporal Difference learning formula
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/td-ex.jpg" alt="TD Learning exercise"/>
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="TD Exercise"/>
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q6: Define each part of Monte Carlo learning formula
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/mc-ex.jpg" alt="MC Learning exercise"/>
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/monte-carlo-approach.jpg" alt="MC Exercise"/>
|
||||
|
||||
</details>
|
||||
|
||||
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
|
||||
@@ -1,97 +0,0 @@
|
||||
# Second Quiz [[quiz2]]
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
|
||||
### Q1: What is Q-Learning?
|
||||
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "The algorithm we use to train our Q-Function",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "A value function",
|
||||
explain: "It's an action-value function since it determines the value of being at a particular state and taking a specific action at that state",
|
||||
},
|
||||
{
|
||||
text: "An algorithm that determines the value of being at a particular state and taking a specific action at that state",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "A table",
|
||||
explain: "Q-Function is not a Q-Table. The Q-Function is the algorithm that will feed the Q-Table."
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q2: What is a Q-Table?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "An algorithm we use in Q-Learning",
|
||||
explain: "",
|
||||
},
|
||||
{
|
||||
text: "Q-table is the internal memory of our agent",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "In Q-Table each cell corresponds a state value",
|
||||
explain: "Each cell corresponds to a state-action value pair value. Not a state value.",
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q3: Why if we have an optimal Q-function Q* we have an optimal policy?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Because if we have an optimal Q-function, we have an optimal policy since we know for each state what is the best action to take.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="link value policy"/>
|
||||
|
||||
</details>
|
||||
|
||||
### Q4: Can you explain what is Epsilon-Greedy Strategy?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
Epsilon Greedy Strategy is a policy that handles the exploration/exploitation trade-off.
|
||||
|
||||
The idea is that we define epsilon ɛ = 1.0:
|
||||
|
||||
- With *probability 1 — ɛ* : we do exploitation (aka our agent selects the action with the highest state-action pair value).
|
||||
- With *probability ɛ* : we do exploration (trying random action).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Epsilon Greedy"/>
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
### Q5: How do we update the Q value of a state, action pair?
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-update-ex.jpg" alt="Q Update exercise"/>
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-update-solution.jpg" alt="Q Update exercise"/>
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
|
||||
### Q6: What's the difference between on-policy and off-policy
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="On/off policy"/>
|
||||
</details>
|
||||
|
||||
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
|
||||
@@ -1,17 +0,0 @@
|
||||
# Summary [[summary1]]
|
||||
|
||||
Before diving on Q-Learning, let's summarize what we just learned.
|
||||
|
||||
We have two types of value-based functions:
|
||||
|
||||
- State-Value function: outputs the expected return if **the agent starts at a given state and acts accordingly to the policy forever after.**
|
||||
- Action-Value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
|
||||
- In value-based methods, **we define the policy by hand** because we don't train it, we train a value function. The idea is that if we have an optimal value function, we **will have an optimal policy.**
|
||||
|
||||
There are two types of methods to learn a policy for a value function:
|
||||
|
||||
- With *the Monte Carlo method*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
|
||||
- With *the TD Learning method,* we update the value function from a step, so we replace Gt that we don't have with **an estimated return called TD target.**
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/summary-learning-mtds.jpg" alt="Summary"/>
|
||||
@@ -1,86 +0,0 @@
|
||||
# Two types of value-based methods [[two-types-value-based-methods]]
|
||||
|
||||
In value-based methods, **we learn a value function** that **maps a state to the expected value of being at that state.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/vbm-1.jpg" alt="Value Based Methods"/>
|
||||
|
||||
The value of a state is the **expected discounted return** the agent can get if it **starts at that state and then acts according to our policy.**
|
||||
|
||||
<Tip>
|
||||
But what does it mean to act according to our policy? After all, we don't have a policy in value-based methods, since we train a value function and not a policy.
|
||||
</Tip>
|
||||
|
||||
Remember that the goal of an **RL agent is to have an optimal policy π.**
|
||||
|
||||
To find the optimal policy, we learned about two different methods:
|
||||
|
||||
- *Policy-based methods:* **Directly train the policy** to select what action to take given a state (or a probability distribution over actions at that state). In this case, we **don't have a value function.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches-2.jpg" alt="Two RL approaches"/>
|
||||
|
||||
The policy takes a state as input and outputs what action to take at that state (deterministic policy).
|
||||
|
||||
And consequently, **we don't define by hand the behavior of our policy; it's the training that will define it.**
|
||||
|
||||
- *Value-based methods:* **Indirectly, by training a value function** that outputs the value of a state or a state-action pair. Given this value function, our policy **will take action.**
|
||||
|
||||
Since the policy is not trained/learned, **we need to specify its behavior.** For instance, if we want a policy that, given the value function, will take actions that always lead to the biggest reward, **we'll create a Greedy Policy.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches-3.jpg" alt="Two RL approaches"/>
|
||||
<figcaption>Given a state, our action-value function (that we train) outputs the value of each action at that state. Then, our pre-defined Greedy Policy selects the action that will yield the highest value given a state or a state action pair.</figcaption>
|
||||
</figure>
|
||||
|
||||
Consequently, whatever method you use to solve your problem, **you will have a policy**. In the case of value-based methods, you don't train the policy: your policy **is just a simple pre-specified function** (for instance, Greedy Policy) that uses the values given by the value-function to select its actions.
|
||||
|
||||
So the difference is:
|
||||
|
||||
- In policy-based, **the optimal policy is found by training the policy directly.**
|
||||
- In value-based, **finding an optimal value function leads to having an optimal policy.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link between value and policy"/>
|
||||
|
||||
In fact, most of the time, in value-based methods, you'll use **an Epsilon-Greedy Policy** that handles the exploration/exploitation trade-off; we'll talk about it when we talk about Q-Learning in the second part of this unit.
|
||||
|
||||
|
||||
So, we have two types of value-based functions:
|
||||
|
||||
## The State-Value function [[state-value-function]]
|
||||
|
||||
We write the state value function under a policy π like this:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/state-value-function-1.jpg" alt="State value function"/>
|
||||
|
||||
For each state, the state-value function outputs the expected return if the agent **starts at that state,** and then follows the policy forever afterwards (for all future timesteps, if you prefer).
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/state-value-function-2.jpg" alt="State value function"/>
|
||||
<figcaption>If we take the state with value -7: it's the expected return starting at that state and taking actions according to our policy (greedy policy), so right, right, right, down, down, right, right.</figcaption>
|
||||
</figure>
|
||||
|
||||
## The Action-Value function [[action-value-function]]
|
||||
|
||||
In the Action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
|
||||
|
||||
The value of taking action an in state s under a policy π is:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-1.jpg" alt="Action State value function"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-2.jpg" alt="Action State value function"/>
|
||||
|
||||
|
||||
We see that the difference is:
|
||||
|
||||
- In state-value function, we calculate **the value of a state \\(S_t\\)**
|
||||
- In action-value function, we calculate **the value of the state-action pair ( \\(S_t, A_t\\) ) hence the value of taking that action at that state.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-types.jpg" alt="Two types of value function"/>
|
||||
<figcaption>
|
||||
Note: We didn't fill all the state-action pairs for the example of Action-value function</figcaption>
|
||||
</figure>
|
||||
|
||||
In either case, whatever value function we choose (state-value or action-value function), **the value is the expected return.**
|
||||
|
||||
However, the problem is that it implies that **to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.**
|
||||
|
||||
This can be a tedious process, and that's **where the Bellman equation comes to help us.**
|
||||
@@ -1,25 +0,0 @@
|
||||
# What is RL? A short recap [[what-is-rl]]
|
||||
|
||||
In RL, we build an agent that can **make smart decisions**. For instance, an agent that **learns to play a video game.** Or a trading agent that **learns to maximize its benefits** by making smart decisions on **what stocks to buy and when to sell.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/rl-process.jpg" alt="RL process"/>
|
||||
|
||||
|
||||
But, to make intelligent decisions, our agent will learn from the environment by **interacting with it through trial and error** and receiving rewards (positive or negative) **as unique feedback.**
|
||||
|
||||
Its goal **is to maximize its expected cumulative reward** (because of the reward hypothesis).
|
||||
|
||||
**The agent's decision-making process is called the policy π:** given a state, a policy will output an action or a probability distribution over actions. That is, given an observation of the environment, a policy will provide an action (or multiple probabilities for each action) that the agent should take.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/policy.jpg" alt="Policy"/>
|
||||
|
||||
**Our goal is to find an optimal policy π* **, aka., a policy that leads to the best expected cumulative reward.
|
||||
|
||||
And to find this optimal policy (hence solving the RL problem), there **are two main types of RL methods**:
|
||||
|
||||
- *Policy-based methods*: **Train the policy directly** to learn which action to take given a state.
|
||||
- *Value-based methods*: **Train a value function** to learn **which state is more valuable** and use this value function **to take the action that leads to it.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches.jpg" alt="Two RL approaches"/>
|
||||
|
||||
And in this unit, **we'll dive deeper into the value-based methods.**
|
||||
@@ -1,8 +0,0 @@
|
||||
# Additional Readings [[additional-readings]]
|
||||
|
||||
These are **optional readings** if you want to go deeper.
|
||||
|
||||
- [Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel](https://youtu.be/Psrhxy88zww)
|
||||
- [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602)
|
||||
- [Double Deep Q-Learning](https://papers.nips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html)
|
||||
- [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952)
|
||||
@@ -1,14 +0,0 @@
|
||||
# Conclusion [[conclusion]]
|
||||
|
||||
Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorial. You’ve just trained your first Deep Q-Learning agent and shared it on the Hub 🥳.
|
||||
|
||||
Take time to really grasp the material before continuing.
|
||||
|
||||
Don't hesitate to train your agent in other environments (Pong, Seaquest, QBert, Ms Pac Man). The **best way to learn is to try things on your own!**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
|
||||
In the next unit, **we're going to learn about Optuna**. One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
@@ -1,102 +0,0 @@
|
||||
# The Deep Q-Learning Algorithm [[deep-q-algorithm]]
|
||||
|
||||
We learned that Deep Q-Learning **uses a deep neural network to approximate the different Q-values for each possible action at a state** (value-function estimation).
|
||||
|
||||
The difference is that, during the training phase, instead of updating the Q-value of a state-action pair directly as we have done with Q-Learning:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Q Loss"/>
|
||||
|
||||
In Deep Q-Learning, we create a **Loss function between our Q-value prediction and the Q-target and use Gradient Descent to update the weights of our Deep Q-Network to approximate our Q-values better**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" alt="Q-target"/>
|
||||
|
||||
The Deep Q-Learning training algorithm has *two phases*:
|
||||
|
||||
- **Sampling**: we perform actions and **store the observed experiences tuples in a replay memory**.
|
||||
- **Training**: Select the **small batch of tuple randomly and learn from it using a gradient descent update step**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/sampling-training.jpg" alt="Sampling Training"/>
|
||||
|
||||
But, this is not the only change compared with Q-Learning. Deep Q-Learning training **might suffer from instability**, mainly because of combining a non-linear Q-value function (Neural Network) and bootstrapping (when we update targets with existing estimates and not an actual complete return).
|
||||
|
||||
To help us stabilize the training, we implement three different solutions:
|
||||
1. *Experience Replay*, to make more **efficient use of experiences**.
|
||||
2. *Fixed Q-Target* **to stabilize the training**.
|
||||
3. *Double Deep Q-Learning*, to **handle the problem of the overestimation of Q-values**.
|
||||
|
||||
|
||||
## Experience Replay to make more efficient use of experiences [[exp-replay]]
|
||||
|
||||
Why do we create a replay memory?
|
||||
|
||||
Experience Replay in Deep Q-Learning has two functions:
|
||||
|
||||
1. **Make more efficient use of the experiences during the training**.
|
||||
- Experience replay helps us **make more efficient use of the experiences during the training.** Usually, in online reinforcement learning, we interact in the environment, get experiences (state, action, reward, and next state), learn from them (update the neural network) and discard them.
|
||||
- But with experience replay, we create a replay buffer that saves experience samples **that we can reuse during the training.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay.jpg" alt="Experience Replay"/>
|
||||
|
||||
⇒ This allows us to **learn from individual experiences multiple times**.
|
||||
|
||||
2. **Avoid forgetting previous experiences and reduce the correlation between experiences**.
|
||||
- The problem we get if we give sequential samples of experiences to our neural network is that it tends to forget **the previous experiences as it overwrites new experiences.** For instance, if we are in the first level and then the second, which is different, our agent can forget how to behave and play in the first level.
|
||||
|
||||
The solution is to create a Replay Buffer that stores experience tuples while interacting with the environment and then sample a small batch of tuples. This prevents **the network from only learning about what it has immediately done.**
|
||||
|
||||
Experience replay also has other benefits. By randomly sampling the experiences, we remove correlation in the observation sequences and avoid **action values from oscillating or diverging catastrophically.**
|
||||
|
||||
In the Deep Q-Learning pseudocode, we see that we **initialize a replay memory buffer D from capacity N** (N is an hyperparameter that you can define). We then store experiences in the memory and sample a minibatch of experiences to feed the Deep Q-Network during the training phase.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay-pseudocode.jpg" alt="Experience Replay Pseudocode"/>
|
||||
|
||||
## Fixed Q-Target to stabilize the training [[fixed-q]]
|
||||
|
||||
When we want to calculate the TD error (aka the loss), we calculate the **difference between the TD target (Q-Target) and the current Q-value (estimation of Q)**.
|
||||
|
||||
But we **don’t have any idea of the real TD target**. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" alt="Q-target"/>
|
||||
|
||||
However, the problem is that we are using the same parameters (weights) for estimating the TD target **and** the Q value. Consequently, there is a significant correlation between the TD target and the parameters we are changing.
|
||||
|
||||
Therefore, it means that at every step of training, **our Q values shift but also the target value shifts.** So, we’re getting closer to our target, but the target is also moving. It’s like chasing a moving target! This led to a significant oscillation in training.
|
||||
|
||||
It’s like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target), you must get closer (reduce the error).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-1.jpg" alt="Q-target"/>
|
||||
|
||||
At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-2.jpg" alt="Q-target"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-3.jpg" alt="Q-target"/>
|
||||
This leads to a bizarre path of chasing (a significant oscillation in training).
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-4.jpg" alt="Q-target"/>
|
||||
|
||||
Instead, what we see in the pseudo-code is that we:
|
||||
- Use a **separate network with a fixed parameter** for estimating the TD Target
|
||||
- **Copy the parameters from our Deep Q-Network at every C step** to update the target network.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/fixed-q-target-pseudocode.jpg" alt="Fixed Q-target Pseudocode"/>
|
||||
|
||||
|
||||
|
||||
## Double DQN [[double-dqn]]
|
||||
|
||||
Double DQNs, or Double Learning, were introduced [by Hado van Hasselt](https://papers.nips.cc/paper/3964-double-q-learning). This method **handles the problem of the overestimation of Q-values.**
|
||||
|
||||
To understand this problem, remember how we calculate the TD Target:
|
||||
|
||||
We face a simple problem by calculating the TD target: how are we sure that **the best action for the next state is the action with the highest Q-value?**
|
||||
|
||||
We know that the accuracy of Q values depends on what action we tried **and** what neighboring states we explored.
|
||||
|
||||
Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the maximum Q value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.**
|
||||
|
||||
The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:
|
||||
- Use our **DQN network** to select the best action to take for the next state (the action with the highest Q value).
|
||||
- Use our **Target network** to calculate the target Q value of taking that action at the next state.
|
||||
|
||||
Therefore, Double DQN helps us reduce the overestimation of q values and, as a consequence, helps us train faster and have more stable learning.
|
||||
|
||||
Since these three improvements in Deep Q-Learning, many have been added such as Prioritized Experience Replay, Dueling Deep Q-Learning. They’re out of the scope of this course but if you’re interested, check the links we put in the reading list.
|
||||
@@ -1,39 +0,0 @@
|
||||
# The Deep Q-Network (DQN) [[deep-q-network]]
|
||||
This is the architecture of our Deep Q-Learning network:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" alt="Deep Q Network"/>
|
||||
|
||||
As input, we take a **stack of 4 frames** passed through the network as a state and output a **vector of Q-values for each possible action at that state**. Then, like with Q-Learning, we just need to use our epsilon-greedy policy to select which action to take.
|
||||
|
||||
When the Neural Network is initialized, **the Q-value estimation is terrible**. But during training, our Deep Q-Network agent will associate a situation with appropriate action and **learn to play the game well**.
|
||||
|
||||
## Preprocessing the input and temporal limitation [[preprocessing]]
|
||||
|
||||
We mentioned that we preprocess the input. It’s an essential step since we want to **reduce the complexity of our state to reduce the computation time needed for training**.
|
||||
|
||||
So what we do is **reduce the state space to 84x84 and grayscale it** (since the colors in Atari environments don't add important information).
|
||||
This is an essential saving since we **reduce our three color channels (RGB) to 1**.
|
||||
|
||||
We can also **crop a part of the screen in some games** if it does not contain important information.
|
||||
Then we stack four frames together.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/preprocessing.jpg" alt="Preprocessing"/>
|
||||
|
||||
**Why do we stack four frames together?**
|
||||
We stack frames together because it helps us **handle the problem of temporal limitation**. Let’s take an example with the game of Pong. When you see this frame:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation.jpg" alt="Temporal Limitation"/>
|
||||
|
||||
Can you tell me where the ball is going?
|
||||
No, because one frame is not enough to have a sense of motion! But what if I add three more frames? **Here you can see that the ball is going to the right**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal Limitation"/>
|
||||
That’s why, to capture temporal information, we stack four frames together.
|
||||
|
||||
Then, the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because frames are stacked together, **you can exploit some spatial properties across those frames**.
|
||||
|
||||
Finally, we have a couple of fully connected layers that output a Q-value for each possible action at that state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" alt="Deep Q Network"/>
|
||||
|
||||
So, we see that Deep Q-Learning is using a neural network to approximate, given a state, the different Q-values for each possible action at that state. Let’s now study the Deep Q-Learning algorithm.
|
||||
@@ -1,33 +0,0 @@
|
||||
# From Q-Learning to Deep Q-Learning [[from-q-to-dqn]]
|
||||
|
||||
We learned that **Q-Learning is an algorithm we use to train our Q-Function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" alt="Q-function"/>
|
||||
<figcaption>Given a state and action, our Q Function outputs a state-action value (also called Q-value)</figcaption>
|
||||
</figure>
|
||||
|
||||
The **Q comes from "the Quality" of that action at that state.**
|
||||
|
||||
Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
|
||||
The problem is that Q-Learning is a *tabular method*. This raises a problem in which the states and actions spaces **are small enough to approximate value functions to be represented as arrays and tables**. Also, this is **not scalable**.
|
||||
Q-Learning worked well with small state space environments like:
|
||||
|
||||
- FrozenLake, we had 14 states.
|
||||
- Taxi-v3, we had 500 states.
|
||||
|
||||
But think of what we're going to do today: we will train an agent to learn to play Space Invaders a more complex game, using the frames as input.
|
||||
|
||||
As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3), containing values ranging from 0 to 255 so that gives us 256^(210x160x3) = 256^100800 (for comparison, we have approximately 10^80 atoms in the observable universe).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari.jpg" alt="Atari State Space"/>
|
||||
|
||||
Therefore, the state space is gigantic; hence creating and updating a Q-table for that environment would not be efficient. In this case, the best idea is to approximate the Q-values instead of a Q-table using a parametrized Q-function \\(Q_{\theta}(s,a)\\) .
|
||||
|
||||
This neural network will approximate, given a state, the different Q-values for each possible action at that state. And that's exactly what Deep Q-Learning does.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" alt="Deep Q Learning"/>
|
||||
|
||||
|
||||
Now that we understand Deep Q-Learning, let's dive deeper into the Deep Q-Network.
|
||||
@@ -1,13 +0,0 @@
|
||||
# Hands-on [[hands-on]]
|
||||
|
||||
Now that you've studied the theory behind Deep Q-Learning, **you’re ready to train your Deep Q-Learning agent to play Atari Games**. We'll start with Space Invaders, but you'll be able to use any Atari game you want 🔥
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
|
||||
We're using the [RL-Baselines-3 Zoo integration](https://github.com/DLR-RM/rl-baselines3-zoo), a vanilla version of Deep Q-Learning with no extensions such as Double-DQN, Dueling-DQN, and Prioritized Experience Replay.
|
||||
|
||||
|
||||
**To start the hands-on click on Open In Colab button** 👇 :
|
||||
|
||||
[]()
|
||||
@@ -1,19 +0,0 @@
|
||||
# Deep Q-Learning [[deep-q-learning]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/thumbnail.jpg" alt="Unit 3 thumbnail" width="100%">
|
||||
|
||||
|
||||
|
||||
In the last unit, we learned our first reinforcement learning algorithm: Q-Learning, **implemented it from scratch**, and trained it in two environments, FrozenLake-v1 ☃️ and Taxi-v3 🚕.
|
||||
|
||||
We got excellent results with this simple algorithm. But these environments were relatively simple because the **state space was discrete and small** (14 different states for FrozenLake-v1 and 500 for Taxi-v3).
|
||||
|
||||
But as we'll see, producing and updating a **Q-table can become ineffective in large state space environments.**
|
||||
|
||||
So in this unit, **we'll study our first Deep Reinforcement Learning agent**: Deep Q-Learning. Instead of using a Q-table, Deep Q-Learning uses a Neural Network that takes a state and approximates Q-values for each action based on that state.
|
||||
|
||||
And **we'll train it to play Space Invaders and other Atari environments using [RL-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)**, a training framework for RL using Stable-Baselines that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results, and recording videos.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
So let’s get started! 🚀
|
||||
@@ -1,104 +0,0 @@
|
||||
# Quiz [[quiz]]
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
### Q1: What are tabular methods?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
*Tabular methods* are a type of problems in which the state and actions spaces are small enough to approximate value functions to be **represented as arrays and tables**. For instance, **Q-Learning is a tabular method** since we use a table to represent the state,action value pairs.
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
### Q2: Why we can't use a classical Q-Learning to solve an Atari Game?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Atari environments are too fast for Q-Learning",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "Atari environments have a big observation space. So creating an updating the Q-Table would not be efficient",
|
||||
explain: "",
|
||||
correct: true
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
|
||||
### Q3: Why do we stack four frames together when we use frames as input in Deep Q-Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
We stack frames together because it helps us **handle the problem of temporal limitation**. Since one frame is not enough to capture temporal information.
|
||||
For instance, in pong, our agent **will be unable to know the ball direction if it gets only one frame**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation.jpg" alt="Temporal limitation"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal limitation"/>
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q4: What are the two phases of Deep Q-Learning?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Sampling",
|
||||
explain: "We perform actions and store the observed experiences tuples in a replay memory.",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "Shuffling",
|
||||
explain: "",
|
||||
},
|
||||
{
|
||||
text: "Reranking",
|
||||
explain: "",
|
||||
},
|
||||
{
|
||||
text: "Training",
|
||||
explain: "We select the small batch of tuple randomly and learn from it using a gradient descent update step.",
|
||||
correct: true,
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q5: Why do we create a replay memory in Deep Q-Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
**1. Make more efficient use of the experiences during the training**
|
||||
|
||||
Usually, in online reinforcement learning, we interact in the environment, get experiences (state, action, reward, and next state), learn from them (update the neural network) and discard them.
|
||||
But with experience replay, **we create a replay buffer that saves experience samples that we can reuse during the training**.
|
||||
|
||||
**2. Avoid forgetting previous experiences and reduce the correlation between experiences**
|
||||
|
||||
The problem we get if we give sequential samples of experiences to our neural network is that it **tends to forget the previous experiences as it overwrites new experiences**. For instance, if we are in the first level and then the second, which is different, our agent can forget how to behave and play in the first level.
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
### Q6: How do we use Double Deep Q-Learning?
|
||||
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
When we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:
|
||||
|
||||
- Use our *DQN network* to **select the best action to take for the next state** (the action with the highest Q value).
|
||||
|
||||
- Use our *Target network* to calculate **the target Q value of taking that action at the next state**.
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
|
||||
@@ -1,3 +0,0 @@
|
||||
# Hands-on [[hands-on]]
|
||||
|
||||
Now that you've learned to use Optuna, **why not going back to our Deep Q-Learning hands-on and implement Optuna to find the best training hyperparameters?**
|
||||
@@ -1,7 +0,0 @@
|
||||
# Introduction [[introduction]]
|
||||
|
||||
One of the most critical task in Deep Reinforcement Learning is to **find a good set of training hyperparameters**.
|
||||
|
||||
<img src="https://raw.githubusercontent.com/optuna/optuna/master/docs/image/optuna-logo.png" alt="Optuna Logo"/>
|
||||
|
||||
[Optuna](https://optuna.org/) is a library that helps you to automate the search. In this Unit, we'll study a **little bit of the theory behind automatic hyperparameter tuning**. We'll first try to optimize the parameters of the DQN studied in the last unit manually. We'll then **learn how to automate the search using Optuna**.
|
||||
@@ -1,12 +0,0 @@
|
||||
# Optuna Tutorial [[optuna]]
|
||||
|
||||
The content below comes from [Antonin's Raffin ICRA 2022 presentations](https://araffin.github.io/tools-for-robotic-rl-icra2022/), he's one of the founders of Stable-Baselines and RL-Baselines3-Zoo.
|
||||
|
||||
|
||||
## The theory behind Hyperparameter tuning
|
||||
<Youtube id="AidFTOdGNFQ" />
|
||||
|
||||
|
||||
## Optuna Tutorial
|
||||
<Youtube id="ihP7E76KGOI" />
|
||||
The notebook 👉 https://colab.research.google.com/github/araffin/tools-for-robotic-rl-icra2022/blob/main/notebooks/optuna_lab.ipynb
|
||||
Reference in New Issue
Block a user