mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-05 11:38:43 +08:00
Update Unit 3 and 4
This commit is contained in:
File diff suppressed because one or more lines are too long
@@ -6,7 +6,7 @@ The difference is that, during the training phase, instead of updating the Q-val
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Q Loss"/>
|
||||
|
||||
in Deep Q-Learning, we create a **loss function that compares our Q-value prediction and the Q-target and uses Gradient Descent to update the weights of our Deep Q-Network to approximate our Q-values better**.
|
||||
in Deep Q-Learning, we create a **loss function that compares our Q-value prediction and the Q-target and uses gradient descent to update the weights of our Deep Q-Network to approximate our Q-values better**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" alt="Q-target"/>
|
||||
|
||||
@@ -35,7 +35,7 @@ Experience Replay in Deep Q-Learning has two functions:
|
||||
1. **Make more efficient use of the experiences during the training**.
|
||||
Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient
|
||||
|
||||
Experience replay helps **using the experiences of the training more efficiently**. We use a replay buffer that saves experience samples **that we can reuse during the training.**
|
||||
Experience replay helps **using the experiences of the training more efficiently**. We use a replay buffer that saves experience samples **that we can reuse during the training.**
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay.jpg" alt="Experience Replay"/>
|
||||
|
||||
⇒ This allows the agent to **learn from the same experiences multiple times**.
|
||||
@@ -59,9 +59,9 @@ But we **don’t have any idea of the real TD target**. We need to estimate it
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" alt="Q-target"/>
|
||||
|
||||
However, the problem is that we are using the same parameters (weights) for estimating the TD target **and** the Q value. Consequently, there is a significant correlation between the TD target and the parameters we are changing.
|
||||
However, the problem is that we are using the same parameters (weights) for estimating the TD target **and** the Q-value. Consequently, there is a significant correlation between the TD target and the parameters we are changing.
|
||||
|
||||
Therefore, it means that at every step of training, **our Q values shift but also the target value shifts.** We’re getting closer to our target, but the target is also moving. It’s like chasing a moving target! This can lead to a significant oscillation in training.
|
||||
Therefore, it means that at every step of training, **our Q-values shift but also the target value shifts.** We’re getting closer to our target, but the target is also moving. It’s like chasing a moving target! This can lead to a significant oscillation in training.
|
||||
|
||||
It’s like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target). Your goal is to get closer (reduce the error).
|
||||
|
||||
@@ -88,16 +88,18 @@ Double DQNs, or Double Learning, were introduced [by Hado van Hasselt](https://
|
||||
|
||||
To understand this problem, remember how we calculate the TD Target:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="TD target"/>
|
||||
|
||||
We face a simple problem by calculating the TD target: how are we sure that **the best action for the next state is the action with the highest Q-value?**
|
||||
|
||||
We know that the accuracy of Q values depends on what action we tried **and** what neighboring states we explored.
|
||||
We know that the accuracy of Q-values depends on what action we tried **and** what neighboring states we explored.
|
||||
|
||||
Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the maximum Q value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.**
|
||||
Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the maximum Q-value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.**
|
||||
|
||||
The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:
|
||||
- Use our **DQN network** to select the best action to take for the next state (the action with the highest Q value).
|
||||
- Use our **Target network** to calculate the target Q value of taking that action at the next state.
|
||||
The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q-value generation. We:
|
||||
- Use our **DQN network** to select the best action to take for the next state (the action with the highest Q-value).
|
||||
- Use our **Target network** to calculate the target Q-value of taking that action at the next state.
|
||||
|
||||
Therefore, Double DQN helps us reduce the overestimation of q values and, as a consequence, helps us train faster and have more stable learning.
|
||||
Therefore, Double DQN helps us reduce the overestimation of Q-values and, as a consequence, helps us train faster and have more stable learning.
|
||||
|
||||
Since these three improvements in Deep Q-Learning, many have been added such as Prioritized Experience Replay, Dueling Deep Q-Learning. They’re out of the scope of this course but if you’re interested, check the links we put in the reading list.
|
||||
|
||||
@@ -9,7 +9,7 @@ When the Neural Network is initialized, **the Q-value estimation is terrible**.
|
||||
|
||||
## Preprocessing the input and temporal limitation [[preprocessing]]
|
||||
|
||||
We mentioned that we preprocess the input. It’s an essential step since we want to **reduce the complexity of our state to reduce the computation time needed for training**.
|
||||
We need to **preprocess the input**. It’s an essential step since we want to **reduce the complexity of our state to reduce the computation time needed for training**.
|
||||
|
||||
To achieve this, we **reduce the state space to 84x84 and grayscale it**. We can do this since the colors in Atari environments don't add important information.
|
||||
This is an essential saving since we **reduce our three color channels (RGB) to 1**.
|
||||
@@ -32,6 +32,8 @@ That’s why, to capture temporal information, we stack four frames together.
|
||||
|
||||
Then, the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because frames are stacked together, **you can exploit some spatial properties across those frames**.
|
||||
|
||||
If you don't know what are convolutional layers, don't worry. You can check the [Lesson 4 of this free Deep Reinforcement Learning Course by Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188)
|
||||
|
||||
Finally, we have a couple of fully connected layers that output a Q-value for each possible action at that state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" alt="Deep Q Network"/>
|
||||
|
||||
@@ -4,7 +4,6 @@ We learned that **Q-Learning is an algorithm we use to train our Q-Function**,
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" alt="Q-function"/>
|
||||
<figcaption>Given a state and action, our Q Function outputs a state-action value (also called Q-value)</figcaption>
|
||||
</figure>
|
||||
|
||||
The **Q comes from "the Quality" of that action at that state.**
|
||||
@@ -19,7 +18,7 @@ Q-Learning worked well with small state space environments like:
|
||||
|
||||
But think of what we're going to do today: we will train an agent to learn to play Space Invaders a more complex game, using the frames as input.
|
||||
|
||||
As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us 256^(210x160x3) = 256^100800 (for comparison, we have approximately 10^80 atoms in the observable universe).
|
||||
As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us \\(256^{210x160x3} = 256^{100800}\\) (for comparison, we have approximately \\(10^{80}\\) atoms in the observable universe).
|
||||
|
||||
* A single frame in Atari is composed of an image of 210x160 pixels. Given the images are in color (RGB), there are 3 channels. This is why the shape is (210, 160, 3). For each pixel, the value can go from 0 to 255.
|
||||
|
||||
|
||||
@@ -1,5 +1,14 @@
|
||||
# Hands-on [[hands-on]]
|
||||
|
||||
|
||||
|
||||
<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
|
||||
notebooks={[
|
||||
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit3/unit3.ipynb"}
|
||||
]}
|
||||
askForHelpUrl="http://hf.co/join/discord" />
|
||||
|
||||
|
||||
Now that you've studied the theory behind Deep Q-Learning, **you’re ready to train your Deep Q-Learning agent to play Atari Games**. We'll start with Space Invaders, but you'll be able to use any Atari game you want 🔥
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
@@ -8,6 +17,304 @@ Now that you've studied the theory behind Deep Q-Learning, **you’re ready to t
|
||||
We're using the [RL-Baselines-3 Zoo integration](https://github.com/DLR-RM/rl-baselines3-zoo), a vanilla version of Deep Q-Learning with no extensions such as Double-DQN, Dueling-DQN, and Prioritized Experience Replay.
|
||||
|
||||
|
||||
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.
|
||||
|
||||
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
**To start the hands-on click on Open In Colab button** 👇 :
|
||||
|
||||
[]()
|
||||
[](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit3/unit3.ipynb)
|
||||
|
||||
|
||||
# Unit 3: Deep Q-Learning with Atari Games 👾 using RL Baselines3 Zoo
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/thumbnail.jpg" alt="Unit 3 Thumbnail">
|
||||
|
||||
In this notebook, **you'll train a Deep Q-Learning agent** playing Space Invaders using [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), a training framework based on [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
|
||||
|
||||
We're using the [RL-Baselines-3 Zoo integration, a vanilla version of Deep Q-Learning](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) with no extensions such as Double-DQN, Dueling-DQN, and Prioritized Experience Replay.
|
||||
|
||||
⬇️ Here is an example of what **you will achieve** ⬇️
|
||||
|
||||
```python
|
||||
%%html
|
||||
<video controls autoplay><source src="https://huggingface.co/ThomasSimonini/ppo-SpaceInvadersNoFrameskip-v4/resolve/main/replay.mp4" type="video/mp4"></video>
|
||||
```
|
||||
|
||||
### 🎮 Environments:
|
||||
|
||||
- SpacesInvadersNoFrameskip-v4
|
||||
|
||||
### 📚 RL-Library:
|
||||
|
||||
- [RL-Baselines3-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)
|
||||
|
||||
## Objectives 🏆
|
||||
|
||||
At the end of the notebook, you will:
|
||||
|
||||
- Be able to understand deeper **how RL Baselines3 Zoo works**.
|
||||
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
|
||||
|
||||
|
||||
## Prerequisites 🏗️
|
||||
Before diving into the notebook, you need to:
|
||||
|
||||
🔲 📚 **[Study Deep Q-Learning by reading Unit 3](https://huggingface.co/deep-rl-course/unit3/introduction)** 🤗
|
||||
|
||||
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
|
||||
|
||||
# Let's train a Deep Q-Learning agent playing Atari' Space Invaders 👾 and upload it to the Hub.
|
||||
|
||||
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.
|
||||
|
||||
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
## Set the GPU 💪
|
||||
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
|
||||
|
||||
- `Hardware Accelerator > GPU`
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
|
||||
|
||||
## Create a virtual display 🔽
|
||||
|
||||
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
|
||||
|
||||
Hence the following cell will install the librairies and create and run a virtual screen 🖥
|
||||
|
||||
```bash
|
||||
apt install python-opengl
|
||||
apt install ffmpeg
|
||||
apt install xvfb
|
||||
pip3 install pyvirtualdisplay
|
||||
```
|
||||
|
||||
```bash
|
||||
apt-get install swig cmake freeglut3-dev
|
||||
```
|
||||
|
||||
```bash
|
||||
pip install pyglet==1.5.1
|
||||
```
|
||||
|
||||
```python
|
||||
# Virtual display
|
||||
from pyvirtualdisplay import Display
|
||||
|
||||
virtual_display = Display(visible=0, size=(1400, 900))
|
||||
virtual_display.start()
|
||||
```
|
||||
|
||||
## Clone RL-Baselines3 Zoo Repo 📚
|
||||
You can now directly install from python package `pip install rl_zoo3` but since we want **the full installation with extra environments and dependencies** we're going to clone `RL-Baselines3-Zoo` repository and install from source.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/DLR-RM/rl-baselines3-zoo
|
||||
```
|
||||
|
||||
## Install dependencies 🔽
|
||||
We can now install the dependencies RL-Baselines3 Zoo needs (this can take 5min ⏲)
|
||||
|
||||
But we'll also install:
|
||||
- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face 🤗 Hub.
|
||||
|
||||
```bash
|
||||
cd /content/rl-baselines3-zoo/
|
||||
```
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
```bash
|
||||
pip install huggingface_sb3
|
||||
```
|
||||
|
||||
## Train our Deep Q-Learning Agent to Play Space Invaders 👾
|
||||
|
||||
To train an agent with RL-Baselines3-Zoo, we just need to do two things:
|
||||
1. We define the hyperparameters in `rl-baselines3-zoo/hyperparams/dqn.yml`
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/hyperparameters.png" alt="DQN Hyperparameters">
|
||||
|
||||
|
||||
Here we see that:
|
||||
- We use the `Atari Wrapper` that preprocess the input (Frame reduction ,grayscale, stack 4 frames)
|
||||
- We use `CnnPolicy`, since we use Convolutional layers to process the frames
|
||||
- We train it for 10 million `n_timesteps`
|
||||
- Memory (Experience Replay) size is 100000, aka the amount of experience steps you saved to train again your agent with.
|
||||
|
||||
💡 My advice is to **reduce the training timesteps to 1M,** which will take about 90 minutes on a P100. `!nvidia-smi` will tell you what GPU you're using. At 10 million steps, this will take about 9 hours, which could likely result in Colab timing out. I recommend running this on your local computer (or somewhere else). Just click on: `File>Download`.
|
||||
|
||||
In terms of hyperparameters optimization, my advice is to focus on these 3 hyperparameters:
|
||||
- `learning_rate`
|
||||
- `buffer_size (Experience Memory size)`
|
||||
- `batch_size`
|
||||
|
||||
As a good practice, you need to **check the documentation to understand what each hyperparameters does**: https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html#parameters
|
||||
|
||||
|
||||
|
||||
2. We run `train.py` and save the models on `logs` folder 📁
|
||||
|
||||
```bash
|
||||
python train.py --algo ________ --env SpaceInvadersNoFrameskip-v4 -f _________
|
||||
```
|
||||
|
||||
#### Solution
|
||||
|
||||
```bash
|
||||
python train.py --algo dqn --env SpaceInvadersNoFrameskip-v4 -f logs/
|
||||
```
|
||||
|
||||
## Let's evaluate our agent 👀
|
||||
- RL-Baselines3-Zoo provides `enjoy.py` to evaluate our agent.
|
||||
- Let's evaluate it for 5000 timesteps 🔥
|
||||
|
||||
```bash
|
||||
python enjoy.py --algo dqn --env SpaceInvadersNoFrameskip-v4 --no-render --n-timesteps _________ --folder logs/
|
||||
```
|
||||
|
||||
#### Solution
|
||||
|
||||
```bash
|
||||
python enjoy.py --algo dqn --env SpaceInvadersNoFrameskip-v4 --no-render --n-timesteps 5000 --folder logs/
|
||||
```
|
||||
|
||||
## Publish our trained model on the Hub 🚀
|
||||
Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/space-invaders-model.gif" alt="Space Invaders model">
|
||||
|
||||
By using `rl_zoo3.push_to_hub.py` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
|
||||
|
||||
This way:
|
||||
- You can **showcase our work** 🔥
|
||||
- You can **visualize your agent playing** 👀
|
||||
- You can **share with the community an agent that others can use** 💾
|
||||
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
To be able to share your model with the community there are three more steps to follow:
|
||||
|
||||
1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
|
||||
|
||||
2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
|
||||
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
|
||||
|
||||
- Copy the token
|
||||
- Run the cell below and past the token
|
||||
|
||||
```python
|
||||
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
|
||||
notebook_login()
|
||||
git config --global credential.helper store
|
||||
```
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
|
||||
3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥
|
||||
|
||||
Let's run push_to_hub.py file to upload our trained agent to the Hub.
|
||||
|
||||
`--repo-name `: The name of the repo
|
||||
|
||||
`-orga`: Your Hugging Face username
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/select-id.png" alt="Select Id">
|
||||
|
||||
```bash
|
||||
python -m rl_zoo3.push_to_hub --algo dqn --env SpaceInvadersNoFrameskip-v4 --repo-name _____________________ -orga _____________________ -f logs/
|
||||
```
|
||||
|
||||
#### Solution
|
||||
|
||||
```bash
|
||||
python -m rl_zoo3.push_to_hub --algo dqn --env SpaceInvadersNoFrameskip-v4 --repo-name dqn-SpaceInvadersNoFrameskip-v4 -orga ThomasSimonini -f logs/
|
||||
```
|
||||
|
||||
Congrats 🥳 you've just trained and uploaded your first Deep Q-Learning agent using RL-Baselines-3 Zoo. The script above should have displayed a link to a model repository such as https://huggingface.co/ThomasSimonini/dqn-SpaceInvadersNoFrameskip-v4. When you go to this link, you can:
|
||||
|
||||
- See a **video preview of your agent** at the right.
|
||||
- Click "Files and versions" to see all the files in the repository.
|
||||
- Click "Use in stable-baselines3" to get a code snippet that shows how to load the model.
|
||||
- A model card (`README.md` file) which gives a description of the model and the hyperparameters you used.
|
||||
|
||||
Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.
|
||||
|
||||
**Compare the results of your agents with your classmates** using the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) 🏆
|
||||
|
||||
## Load a powerful trained model 🔥
|
||||
|
||||
- The Stable-Baselines3 team uploaded **more than 150 trained Deep Reinforcement Learning agents on the Hub**.
|
||||
|
||||
You can find them here: 👉 https://huggingface.co/sb3
|
||||
|
||||
Some examples:
|
||||
- Asteroids: https://huggingface.co/sb3/dqn-AsteroidsNoFrameskip-v4
|
||||
- Beam Rider: https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4
|
||||
- Breakout: https://huggingface.co/sb3/dqn-BreakoutNoFrameskip-v4
|
||||
- Road Runner: https://huggingface.co/sb3/dqn-RoadRunnerNoFrameskip-v4
|
||||
|
||||
Let's load an agent playing Beam Rider: https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4
|
||||
|
||||
```python
|
||||
<video controls autoplay><source src="https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4/resolve/main/replay.mp4" type="video/mp4"></video>
|
||||
```
|
||||
|
||||
1. We download the model using `rl_zoo3.load_from_hub`, and place it in a new folder that we can call `rl_trained`
|
||||
|
||||
```bash
|
||||
# Download model and save it into the logs/ folder
|
||||
python -m rl_zoo3.load_from_hub --algo dqn --env BeamRiderNoFrameskip-v4 -orga sb3 -f rl_trained/
|
||||
```
|
||||
|
||||
2. Let's evaluate if for 5000 timesteps
|
||||
|
||||
```bash
|
||||
python enjoy.py --algo dqn --env BeamRiderNoFrameskip-v4 -n 5000 -f rl_trained/
|
||||
```
|
||||
|
||||
Why not trying to train your own **Deep Q-Learning Agent playing BeamRiderNoFrameskip-v4? 🏆.**
|
||||
|
||||
If you want to try, check https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4#hyperparameters **in the model card, you have the hyperparameters of the trained agent.**
|
||||
|
||||
But finding hyperparameters can be a daunting task. Fortunately, we'll see in the next Unit, how we can **use Optuna for optimizing the Hyperparameters 🔥.**
|
||||
|
||||
|
||||
## Some additional challenges 🏆
|
||||
|
||||
The best way to learn **is to try things by your own**!
|
||||
|
||||
In the [Leaderboard](https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
|
||||
|
||||
Here's a list of environments you can try to train your agent with:
|
||||
- BeamRiderNoFrameskip-v4
|
||||
- BreakoutNoFrameskip-v4
|
||||
- EnduroNoFrameskip-v4
|
||||
- PongNoFrameskip-v4
|
||||
|
||||
Also, **if you want to learn to implement Deep Q-Learning by yourself**, you definitely should look at CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
________________________________________________________________________
|
||||
Congrats on finishing this chapter!
|
||||
|
||||
If you’re still feel confused with all these elements...it's totally normal! **This was the same for me and for all people who studied RL.**
|
||||
|
||||
Take time to really **grasp the material before continuing and try the additional challenges**. It’s important to master these elements and having a solid foundations.
|
||||
|
||||
In the next unit, **we’re going to learn about [Optuna](https://optuna.org/)**. One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
|
||||
|
||||
See you on Bonus unit 2! 🔥
|
||||
|
||||
### Keep Learning, Stay Awesome 🤗
|
||||
|
||||
@@ -6,7 +6,7 @@
|
||||
|
||||
In the last unit, we learned our first reinforcement learning algorithm: Q-Learning, **implemented it from scratch**, and trained it in two environments, FrozenLake-v1 ☃️ and Taxi-v3 🚕.
|
||||
|
||||
We got excellent results with this simple algorithm, but these environments were relatively simple because the **state space was discrete and small** (14 different states for FrozenLake-v1 and 500 for Taxi-v3).
|
||||
We got excellent results with this simple algorithm, but these environments were relatively simple because the **state space was discrete and small** (14 different states for FrozenLake-v1 and 500 for Taxi-v3). For comparison, the state space in Atari games can **contain \\(10^{9}\\) to \\(10^{11}\\) states**.
|
||||
|
||||
But as we'll see, producing and updating a **Q-table can become ineffective in large state space environments.**
|
||||
|
||||
|
||||
@@ -76,7 +76,7 @@ For instance, in pong, our agent **will be unable to know the ball direction if
|
||||
|
||||
**1. Make more efficient use of the experiences during the training**
|
||||
|
||||
Usually, in online reinforcement learning, we interact in the environment, get experiences (state, action, reward, and next state), learn from them (update the neural network) and discard them.
|
||||
Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient
|
||||
But with experience replay, **we create a replay buffer that saves experience samples that we can reuse during the training**.
|
||||
|
||||
**2. Avoid forgetting previous experiences and reduce the correlation between experiences**
|
||||
|
||||
@@ -4,9 +4,12 @@ The content below comes from [Antonin's Raffin ICRA 2022 presentations](https://
|
||||
|
||||
|
||||
## The theory behind Hyperparameter tuning
|
||||
|
||||
<Youtube id="AidFTOdGNFQ" />
|
||||
|
||||
|
||||
## Optuna Tutorial
|
||||
|
||||
<Youtube id="ihP7E76KGOI" />
|
||||
The notebook 👉 https://colab.research.google.com/github/araffin/tools-for-robotic-rl-icra2022/blob/main/notebooks/optuna_lab.ipynb
|
||||
|
||||
The notebook 👉 [here](https://colab.research.google.com/github/araffin/tools-for-robotic-rl-icra2022/blob/main/notebooks/optuna_lab.ipynb)
|
||||
|
||||
Reference in New Issue
Block a user