Merge pull request #108 from huggingface/ThomasSimonini/Unit3

Adding Unit 3: Deep Q-Learning and Optuna Bonus
This commit is contained in:
Thomas Simonini
2022-12-19 16:03:11 +01:00
committed by GitHub
13 changed files with 746 additions and 39 deletions

File diff suppressed because one or more lines are too long

View File

@@ -76,7 +76,34 @@
title: Conclusion
- local: unit2/additional-readings
title: Additional Readings
- title: Unit 3. Deep Q-Learning with Atari Games
sections:
- local: unit3/introduction
title: Introduction
- local: unit3/from-q-to-dqn
title: From Q-Learning to Deep Q-Learning
- local: unit3/deep-q-network
title: The Deep Q-Network (DQN)
- local: unit3/deep-q-algorithm
title: The Deep Q Algorithm
- local: unit3/hands-on
title: Hands-on
- local: unit3/quiz
title: Quiz
- local: unit3/conclusion
title: Conclusion
- local: unit3/additional-readings
title: Additional Readings
- title: Unit Bonus 2. Automatic Hyperparameter Tuning with Optuna
sections:
- local: unitbonus2/introduction
title: Introduction
- local: unitbonus2/optuna
title: Optuna
- local: unitbonus2/hands-on
title: Hands-on
- title: What's next? New Units Publishing Schedule
sections:
- local: communication/publishing-schedule
title: Publishing Schedule

View File

@@ -0,0 +1,8 @@
# Additional Readings [[additional-readings]]
These are **optional readings** if you want to go deeper.
- [Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel](https://youtu.be/Psrhxy88zww)
- [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602)
- [Double Deep Q-Learning](https://papers.nips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html)
- [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952)

View File

@@ -0,0 +1,14 @@
# Conclusion [[conclusion]]
Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorial. Youve just trained your first Deep Q-Learning agent and shared it on the Hub 🥳.
Take time to really grasp the material before continuing.
Don't hesitate to train your agent in other environments (Pong, Seaquest, QBert, Ms Pac Man). The **best way to learn is to try things on your own!**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
In the next unit, **we're going to learn about Optuna**. One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
### Keep Learning, stay awesome 🤗

View File

@@ -0,0 +1,105 @@
# The Deep Q-Learning Algorithm [[deep-q-algorithm]]
We learned that Deep Q-Learning **uses a deep neural network to approximate the different Q-values for each possible action at a state** (value-function estimation).
The difference is that, during the training phase, instead of updating the Q-value of a state-action pair directly as we have done with Q-Learning:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Q Loss"/>
in Deep Q-Learning, we create a **loss function that compares our Q-value prediction and the Q-target and uses gradient descent to update the weights of our Deep Q-Network to approximate our Q-values better**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" alt="Q-target"/>
The Deep Q-Learning training algorithm has *two phases*:
- **Sampling**: we perform actions and **store the observed experience tuples in a replay memory**.
- **Training**: Select a **small batch of tuples randomly and learn from this batch using a gradient descent update step**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/sampling-training.jpg" alt="Sampling Training"/>
This is not the only difference compared with Q-Learning. Deep Q-Learning training **might suffer from instability**, mainly because of combining a non-linear Q-value function (Neural Network) and bootstrapping (when we update targets with existing estimates and not an actual complete return).
To help us stabilize the training, we implement three different solutions:
1. *Experience Replay* to make more **efficient use of experiences**.
2. *Fixed Q-Target* **to stabilize the training**.
3. *Double Deep Q-Learning*, to **handle the problem of the overestimation of Q-values**.
Let's go through them!
## Experience Replay to make more efficient use of experiences [[exp-replay]]
Why do we create a replay memory?
Experience Replay in Deep Q-Learning has two functions:
1. **Make more efficient use of the experiences during the training**.
Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient
Experience replay helps **using the experiences of the training more efficiently**. We use a replay buffer that saves experience samples **that we can reuse during the training.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay.jpg" alt="Experience Replay"/>
⇒ This allows the agent to **learn from the same experiences multiple times**.
2. **Avoid forgetting previous experiences and reduce the correlation between experiences**.
- The problem we get if we give sequential samples of experiences to our neural network is that it tends to forget **the previous experiences as it gets new experiences.** For instance, if the agent is in the first level and then in the second, which is different, it can forget how to behave and play in the first level.
The solution is to create a Replay Buffer that stores experience tuples while interacting with the environment and then sample a small batch of tuples. This prevents **the network from only learning about what it has done immediately before.**
Experience replay also has other benefits. By randomly sampling the experiences, we remove correlation in the observation sequences and avoid **action values from oscillating or diverging catastrophically.**
In the Deep Q-Learning pseudocode, we **initialize a replay memory buffer D from capacity N** (N is a hyperparameter that you can define). We then store experiences in the memory and sample a batch of experiences to feed the Deep Q-Network during the training phase.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay-pseudocode.jpg" alt="Experience Replay Pseudocode"/>
## Fixed Q-Target to stabilize the training [[fixed-q]]
When we want to calculate the TD error (aka the loss), we calculate the **difference between the TD target (Q-Target) and the current Q-value (estimation of Q)**.
But we **dont have any idea of the real TD target**. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" alt="Q-target"/>
However, the problem is that we are using the same parameters (weights) for estimating the TD target **and** the Q-value. Consequently, there is a significant correlation between the TD target and the parameters we are changing.
Therefore, it means that at every step of training, **our Q-values shift but also the target value shifts.** Were getting closer to our target, but the target is also moving. Its like chasing a moving target! This can lead to a significant oscillation in training.
Its like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target). Your goal is to get closer (reduce the error).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-1.jpg" alt="Q-target"/>
At each time step, youre trying to approach the cow, which also moves at each time step (because you use the same parameters).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-2.jpg" alt="Q-target"/>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-3.jpg" alt="Q-target"/>
This leads to a bizarre path of chasing (a significant oscillation in training).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-4.jpg" alt="Q-target"/>
Instead, what we see in the pseudo-code is that we:
- Use a **separate network with fixed parameters** for estimating the TD Target
- **Copy the parameters from our Deep Q-Network at every C step** to update the target network.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/fixed-q-target-pseudocode.jpg" alt="Fixed Q-target Pseudocode"/>
## Double DQN [[double-dqn]]
Double DQNs, or Double Learning, were introduced [by Hado van Hasselt](https://papers.nips.cc/paper/3964-double-q-learning). This method **handles the problem of the overestimation of Q-values.**
To understand this problem, remember how we calculate the TD Target:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="TD target"/>
We face a simple problem by calculating the TD target: how are we sure that **the best action for the next state is the action with the highest Q-value?**
We know that the accuracy of Q-values depends on what action we tried **and** what neighboring states we explored.
Consequently, we dont have enough information about the best action to take at the beginning of the training. Therefore, taking the maximum Q-value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.**
The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q-value generation. We:
- Use our **DQN network** to select the best action to take for the next state (the action with the highest Q-value).
- Use our **Target network** to calculate the target Q-value of taking that action at the next state.
Therefore, Double DQN helps us reduce the overestimation of Q-values and, as a consequence, helps us train faster and have more stable learning.
Since these three improvements in Deep Q-Learning, many have been added such as Prioritized Experience Replay, Dueling Deep Q-Learning. Theyre out of the scope of this course but if youre interested, check the links we put in the reading list.

View File

@@ -0,0 +1,41 @@
# The Deep Q-Network (DQN) [[deep-q-network]]
This is the architecture of our Deep Q-Learning network:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" alt="Deep Q Network"/>
As input, we take a **stack of 4 frames** passed through the network as a state and output a **vector of Q-values for each possible action at that state**. Then, like with Q-Learning, we just need to use our epsilon-greedy policy to select which action to take.
When the Neural Network is initialized, **the Q-value estimation is terrible**. But during training, our Deep Q-Network agent will associate a situation with appropriate action and **learn to play the game well**.
## Preprocessing the input and temporal limitation [[preprocessing]]
We need to **preprocess the input**. Its an essential step since we want to **reduce the complexity of our state to reduce the computation time needed for training**.
To achieve this, we **reduce the state space to 84x84 and grayscale it**. We can do this since the colors in Atari environments don't add important information.
This is an essential saving since we **reduce our three color channels (RGB) to 1**.
We can also **crop a part of the screen in some games** if it does not contain important information.
Then we stack four frames together.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/preprocessing.jpg" alt="Preprocessing"/>
**Why do we stack four frames together?**
We stack frames together because it helps us **handle the problem of temporal limitation**. Lets take an example with the game of Pong. When you see this frame:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation.jpg" alt="Temporal Limitation"/>
Can you tell me where the ball is going?
No, because one frame is not enough to have a sense of motion! But what if I add three more frames? **Here you can see that the ball is going to the right**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal Limitation"/>
Thats why, to capture temporal information, we stack four frames together.
Then, the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because frames are stacked together, **you can exploit some spatial properties across those frames**.
If you don't know what are convolutional layers, don't worry. You can check the [Lesson 4 of this free Deep Reinforcement Learning Course by Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188)
Finally, we have a couple of fully connected layers that output a Q-value for each possible action at that state.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" alt="Deep Q Network"/>
So, we see that Deep Q-Learning is using a neural network to approximate, given a state, the different Q-values for each possible action at that state. Lets now study the Deep Q-Learning algorithm.

View File

@@ -0,0 +1,34 @@
# From Q-Learning to Deep Q-Learning [[from-q-to-dqn]]
We learned that **Q-Learning is an algorithm we use to train our Q-Function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" alt="Q-function"/>
</figure>
The **Q comes from "the Quality" of that action at that state.**
Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
The problem is that Q-Learning is a *tabular method*. This raises a problem in which the states and actions spaces **are small enough to approximate value functions to be represented as arrays and tables**. Also, this is **not scalable**.
Q-Learning worked well with small state space environments like:
- FrozenLake, we had 14 states.
- Taxi-v3, we had 500 states.
But think of what we're going to do today: we will train an agent to learn to play Space Invaders a more complex game, using the frames as input.
As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us \\(256^{210x160x3} = 256^{100800}\\) (for comparison, we have approximately \\(10^{80}\\) atoms in the observable universe).
* A single frame in Atari is composed of an image of 210x160 pixels. Given the images are in color (RGB), there are 3 channels. This is why the shape is (210, 160, 3). For each pixel, the value can go from 0 to 255.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari.jpg" alt="Atari State Space"/>
Therefore, the state space is gigantic; due to this, creating and updating a Q-table for that environment would not be efficient. In this case, the best idea is to approximate the Q-values instead of a Q-table using a parametrized Q-function \\(Q_{\theta}(s,a)\\) .
This neural network will approximate, given a state, the different Q-values for each possible action at that state. And that's exactly what Deep Q-Learning does.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" alt="Deep Q Learning"/>
Now that we understand Deep Q-Learning, let's dive deeper into the Deep Q-Network.

314
units/en/unit3/hands-on.mdx Normal file
View File

@@ -0,0 +1,314 @@
# Hands-on [[hands-on]]
<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
notebooks={[
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit3/unit3.ipynb"}
]}
askForHelpUrl="http://hf.co/join/discord" />
Now that you've studied the theory behind Deep Q-Learning, **youre ready to train your Deep Q-Learning agent to play Atari Games**. We'll start with Space Invaders, but you'll be able to use any Atari game you want 🔥
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
We're using the [RL-Baselines-3 Zoo integration](https://github.com/DLR-RM/rl-baselines3-zoo), a vanilla version of Deep Q-Learning with no extensions such as Double-DQN, Dueling-DQN, or Prioritized Experience Replay.
Also, **if you want to learn to implement Deep Q-Learning by yourself after this hands-on**, you definitely should look at CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
**To start the hands-on click on Open In Colab button** 👇 :
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit3/unit3.ipynb)
# Unit 3: Deep Q-Learning with Atari Games 👾 using RL Baselines3 Zoo
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/thumbnail.jpg" alt="Unit 3 Thumbnail">
In this notebook, **you'll train a Deep Q-Learning agent** playing Space Invaders using [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), a training framework based on [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
We're using the [RL-Baselines-3 Zoo integration, a vanilla version of Deep Q-Learning](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) with no extensions such as Double-DQN, Dueling-DQN, and Prioritized Experience Replay.
⬇️ Here is an example of what **you will achieve** ⬇️
```python
%%html
<video controls autoplay><source src="https://huggingface.co/ThomasSimonini/ppo-SpaceInvadersNoFrameskip-v4/resolve/main/replay.mp4" type="video/mp4"></video>
```
### 🎮 Environments:
- SpacesInvadersNoFrameskip-v4
### 📚 RL-Library:
- [RL-Baselines3-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)
## Objectives 🏆
At the end of the notebook, you will:
- Be able to understand deeper **how RL Baselines3 Zoo works**.
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
## Prerequisites 🏗️
Before diving into the notebook, you need to:
🔲 📚 **[Study Deep Q-Learning by reading Unit 3](https://huggingface.co/deep-rl-course/unit3/introduction)** 🤗
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
# Let's train a Deep Q-Learning agent playing Atari' Space Invaders 👾 and upload it to the Hub.
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
## Set the GPU 💪
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
- `Hardware Accelerator > GPU`
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
## Create a virtual display 🔽
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
Hence the following cell will install the librairies and create and run a virtual screen 🖥
```bash
apt install python-opengl
apt install ffmpeg
apt install xvfb
pip3 install pyvirtualdisplay
```
```bash
apt-get install swig cmake freeglut3-dev
```
```bash
pip install pyglet==1.5.1
```
```python
# Virtual display
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()
```
## Clone RL-Baselines3 Zoo Repo 📚
You could directly install from the Python package (`pip install rl_zoo3`), but since we want **the full installation with extra environments and dependencies**, we're going to clone the `RL-Baselines3-Zoo` repository and install from source.
```bash
git clone https://github.com/DLR-RM/rl-baselines3-zoo
```
## Install dependencies 🔽
We can now install the dependencies RL-Baselines3 Zoo needs (this can take 5min ⏲)
```bash
cd /content/rl-baselines3-zoo/
```
```bash
pip install -r requirements.txt
```
## Train our Deep Q-Learning Agent to Play Space Invaders 👾
To train an agent with RL-Baselines3-Zoo, we just need to do two things:
1. We define the hyperparameters in `rl-baselines3-zoo/hyperparams/dqn.yml`
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/hyperparameters.png" alt="DQN Hyperparameters">
Here we see that:
- We use the `Atari Wrapper` that does the pre-processing (Frame reduction, grayscale, stack four frames frames),
- We use `CnnPolicy`, since we use Convolutional layers to process the frames.
- We train the model for 10 million `n_timesteps`.
- Memory (Experience Replay) size is 100000, i.e. the number of experience steps you saved to train again your agent with.
💡 My advice is to **reduce the training timesteps to 1M,** which will take about 90 minutes on a P100. `!nvidia-smi` will tell you what GPU you're using. At 10 million steps, this will take about 9 hours, which could likely result in Colab timing out. I recommend running this on your local computer (or somewhere else). Just click on: `File>Download`.
In terms of hyperparameters optimization, my advice is to focus on these 3 hyperparameters:
- `learning_rate`
- `buffer_size (Experience Memory size)`
- `batch_size`
As a good practice, you need to **check the documentation to understand what each hyperparameters does**: https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html#parameters
2. We run `train.py` and save the models on `logs` folder 📁
```bash
python train.py --algo ________ --env SpaceInvadersNoFrameskip-v4 -f _________
```
#### Solution
```bash
python train.py --algo dqn --env SpaceInvadersNoFrameskip-v4 -f logs/
```
## Let's evaluate our agent 👀
- RL-Baselines3-Zoo provides `enjoy.py`, a python script to evaluate our agent. In most RL libraries, we call the evaluation script `enjoy.py`.
- Let's evaluate it for 5000 timesteps 🔥
```bash
python enjoy.py --algo dqn --env SpaceInvadersNoFrameskip-v4 --no-render --n-timesteps _________ --folder logs/
```
#### Solution
```bash
python enjoy.py --algo dqn --env SpaceInvadersNoFrameskip-v4 --no-render --n-timesteps 5000 --folder logs/
```
## Publish our trained model on the Hub 🚀
Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/space-invaders-model.gif" alt="Space Invaders model">
By using `rl_zoo3.push_to_hub.py`, **you evaluate, record a replay, generate a model card of your agent, and push it to the Hub**.
This way:
- You can **showcase our work** 🔥
- You can **visualize your agent playing** 👀
- You can **share with the community an agent that others can use** 💾
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
To be able to share your model with the community, there are three more steps to follow:
1⃣ (If it's not already done) create an account in HF ➡ https://huggingface.co/join
2⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
- Copy the token
- Run the cell below and past the token
```python
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
notebook_login()
git config --global credential.helper store
```
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
3⃣ We're now ready to push our trained agent to the Hub 🔥
Let's run `push_to_hub.py` file to upload our trained agent to the Hub. There are two important parameters:
* `--repo-name `: The name of the repo
* `-orga`: Your Hugging Face username
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/select-id.png" alt="Select Id">
```bash
python -m rl_zoo3.push_to_hub --algo dqn --env SpaceInvadersNoFrameskip-v4 --repo-name _____________________ -orga _____________________ -f logs/
```
#### Solution
```bash
python -m rl_zoo3.push_to_hub --algo dqn --env SpaceInvadersNoFrameskip-v4 --repo-name dqn-SpaceInvadersNoFrameskip-v4 -orga ThomasSimonini -f logs/
```
Congrats 🥳 you've just trained and uploaded your first Deep Q-Learning agent using RL-Baselines-3 Zoo. The script above should have displayed a link to a model repository such as https://huggingface.co/ThomasSimonini/dqn-SpaceInvadersNoFrameskip-v4. When you go to this link, you can:
- See a **video preview of your agent** at the right.
- Click "Files and versions" to see all the files in the repository.
- Click "Use in stable-baselines3" to get a code snippet that shows how to load the model.
- A model card (`README.md` file) which gives a description of the model and the hyperparameters you used.
Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.
**Compare the results of your agents with your classmates** using the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) 🏆
## Load a powerful trained model 🔥
The Stable-Baselines3 team uploaded **more than 150 trained Deep Reinforcement Learning agents on the Hub**. You can download them and use them to see how they perform!
You can find them here: 👉 https://huggingface.co/sb3
Some examples:
- Asteroids: https://huggingface.co/sb3/dqn-AsteroidsNoFrameskip-v4
- Beam Rider: https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4
- Breakout: https://huggingface.co/sb3/dqn-BreakoutNoFrameskip-v4
- Road Runner: https://huggingface.co/sb3/dqn-RoadRunnerNoFrameskip-v4
Let's load an agent playing Beam Rider: https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4
```python
<video controls autoplay><source src="https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4/resolve/main/replay.mp4" type="video/mp4"></video>
```
1. We download the model using `rl_zoo3.load_from_hub`, and place it in a new folder that we can call `rl_trained`
```bash
# Download model and save it into the logs/ folder
python -m rl_zoo3.load_from_hub --algo dqn --env BeamRiderNoFrameskip-v4 -orga sb3 -f rl_trained/
```
2. Let's evaluate if for 5000 timesteps
```bash
python enjoy.py --algo dqn --env BeamRiderNoFrameskip-v4 -n 5000 -f rl_trained/
```
Why not trying to train your own **Deep Q-Learning Agent playing BeamRiderNoFrameskip-v4? 🏆.**
If you want to try, check https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4#hyperparameters. There, **in the model card, you have the hyperparameters of the trained agent.**
But finding hyperparameters can be a daunting task. Fortunately, we'll see in the next bonus Unit, how we can **use Optuna for optimizing the Hyperparameters 🔥.**
## Some additional challenges 🏆
The best way to learn **is to try things by your own**!
In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
Here's a list of environments you can try to train your agent with:
- BeamRiderNoFrameskip-v4
- BreakoutNoFrameskip-v4
- EnduroNoFrameskip-v4
- PongNoFrameskip-v4
Also, **if you want to learn to implement Deep Q-Learning by yourself**, you definitely should look at CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
________________________________________________________________________
Congrats on finishing this chapter!
If youre still feel confused with all these elements...it's totally normal! **This was the same for me and for all people who studied RL.**
Take time to really **grasp the material before continuing and try the additional challenges**. Its important to master these elements and having a solid foundations.
In the next unit, **were going to learn about [Optuna](https://optuna.org/)**. One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
See you on Bonus unit 2! 🔥
### Keep Learning, Stay Awesome 🤗

View File

@@ -0,0 +1,19 @@
# Deep Q-Learning [[deep-q-learning]]
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/thumbnail.jpg" alt="Unit 3 thumbnail" width="100%">
In the last unit, we learned our first reinforcement learning algorithm: Q-Learning, **implemented it from scratch**, and trained it in two environments, FrozenLake-v1 ☃️ and Taxi-v3 🚕.
We got excellent results with this simple algorithm, but these environments were relatively simple because the **state space was discrete and small** (14 different states for FrozenLake-v1 and 500 for Taxi-v3). For comparison, the state space in Atari games can **contain \\(10^{9}\\) to \\(10^{11}\\) states**.
But as we'll see, producing and updating a **Q-table can become ineffective in large state space environments.**
So in this unit, **we'll study our first Deep Reinforcement Learning agent**: Deep Q-Learning. Instead of using a Q-table, Deep Q-Learning uses a Neural Network that takes a state and approximates Q-values for each action based on that state.
And **we'll train it to play Space Invaders and other Atari environments using [RL-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)**, a training framework for RL using Stable-Baselines that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results, and recording videos.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
So lets get started! 🚀

104
units/en/unit3/quiz.mdx Normal file
View File

@@ -0,0 +1,104 @@
# Quiz [[quiz]]
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
### Q1: We mentioned Q Learning is a tabular method. What are tabular methods?
<details>
<summary>Solution</summary>
*Tabular methods* is a type of problem in which the state and actions spaces are small enough to approximate value functions to be **represented as arrays and tables**. For instance, **Q-Learning is a tabular method** since we use a table to represent the state, and action value pairs.
</details>
### Q2: Why can't we use a classical Q-Learning to solve an Atari Game?
<Question
choices={[
{
text: "Atari environments are too fast for Q-Learning",
explain: ""
},
{
text: "Atari environments have a big observation space. So creating an updating the Q-Table would not be efficient",
explain: "",
correct: true
}
]}
/>
### Q3: Why do we stack four frames together when we use frames as input in Deep Q-Learning?
<details>
<summary>Solution</summary>
We stack frames together because it helps us **handle the problem of temporal limitation**: one frame is not enough to capture temporal information.
For instance, in pong, our agent **will be unable to know the ball direction if it gets only one frame**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation.jpg" alt="Temporal limitation"/>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal limitation"/>
</details>
### Q4: What are the two phases of Deep Q-Learning?
<Question
choices={[
{
text: "Sampling",
explain: "We perform actions and store the observed experiences tuples in a replay memory.",
correct: true,
},
{
text: "Shuffling",
explain: "",
},
{
text: "Reranking",
explain: "",
},
{
text: "Training",
explain: "We select the small batch of tuple randomly and learn from it using a gradient descent update step.",
correct: true,
}
]}
/>
### Q5: Why do we create a replay memory in Deep Q-Learning?
<details>
<summary>Solution</summary>
**1. Make more efficient use of the experiences during the training**
Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient
But with experience replay, **we create a replay buffer that saves experience samples that we can reuse during the training**.
**2. Avoid forgetting previous experiences and reduce the correlation between experiences**
The problem we get if we give sequential samples of experiences to our neural network is that it **tends to forget the previous experiences as it overwrites new experiences**. For instance, if we are in the first level and then the second, which is different, our agent can forget how to behave and play in the first level.
</details>
### Q6: How do we use Double Deep Q-Learning?
<details>
<summary>Solution</summary>
When we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:
- Use our *DQN network* to **select the best action to take for the next state** (the action with the highest Q value).
- Use our *Target network* to calculate **the target Q value of taking that action at the next state**.
</details>
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.

View File

@@ -0,0 +1,11 @@
# Hands-on [[hands-on]]
Now that you've learned to use Optuna, we give you some ideas to apply what you've learned:
1⃣ **Beat your LunarLander-v2 agent results**, by using Optuna to find a better set of hyperparameters. You can also try with another environment, such as MountainCar-v0 and CartPole-v1.
2⃣ **Beat your SpaceInvaders agent results**.
By doing that, you're going to see how Optuna is valuable and powerful in training better agents,
Have fun,

View File

@@ -0,0 +1,7 @@
# Introduction [[introduction]]
One of the most critical task in Deep Reinforcement Learning is to **find a good set of training hyperparameters**.
<img src="https://raw.githubusercontent.com/optuna/optuna/master/docs/image/optuna-logo.png" alt="Optuna Logo"/>
[Optuna](https://optuna.org/) is a library that helps you to automate the search. In this Unit, we'll study a **little bit of the theory behind automatic hyperparameter tuning**. We'll first try to optimize the parameters of the DQN studied in the last unit manually. We'll then **learn how to automate the search using Optuna**.

View File

@@ -0,0 +1,15 @@
# Optuna Tutorial [[optuna]]
The content below comes from [Antonin's Raffin ICRA 2022 presentations](https://araffin.github.io/tools-for-robotic-rl-icra2022/), he's one of the founders of Stable-Baselines and RL-Baselines3-Zoo.
## The theory behind Hyperparameter tuning
<Youtube id="AidFTOdGNFQ" />
## Optuna Tutorial
<Youtube id="ihP7E76KGOI" />
The notebook 👉 [here](https://colab.research.google.com/github/araffin/tools-for-robotic-rl-icra2022/blob/main/notebooks/optuna_lab.ipynb)