mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-13 16:29:42 +08:00
Merge branch 'main' into ThomasSimonini/MLAgents
This commit is contained in:
5
notebooks/unit1/requirements-unit1.txt
Normal file
5
notebooks/unit1/requirements-unit1.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
stable-baselines3[extra]
|
||||
box2d
|
||||
box2d-kengz
|
||||
huggingface_sb3
|
||||
pyglet==1.5.1
|
||||
@@ -104,9 +104,9 @@
|
||||
"## Prerequisites 🏗️\n",
|
||||
"Before diving into the notebook, you need to:\n",
|
||||
"\n",
|
||||
"🔲 📝 **Done Unit 0** that gives you all the **information about the course and help you to onboard** 🤗 ADD LINK \n",
|
||||
"🔲 📝 **[Read Unit 0](https://huggingface.co/deep-rl-course/unit0/introduction)** that gives you all the **information about the course and help you to onboard** 🤗\n",
|
||||
"\n",
|
||||
"🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by doing Unit 1 👉 ADD LINK"
|
||||
"🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by [reading Unit 1](https://huggingface.co/deep-rl-course/unit1/introduction)."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -166,6 +166,20 @@
|
||||
"# Let's train our first Deep Reinforcement Learning agent and upload it to the Hub 🚀\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Get a certificate\n",
|
||||
"To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained model to the Hub and **get a result of >= 200**.\n",
|
||||
"\n",
|
||||
"To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
|
||||
"\n",
|
||||
"For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "qDploC3jSH99"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
@@ -233,7 +247,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install -r https://huggingface.co/spaces/ThomasSimonini/temp-space-requirements/raw/main/requirements/requirements-unit1.txt"
|
||||
"!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -3,8 +3,7 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "view-in-github",
|
||||
"colab_type": "text"
|
||||
"id": "view-in-github"
|
||||
},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit2/unit2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
@@ -169,6 +168,20 @@
|
||||
"id": "HEtx8Y8MqKfH"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"\n",
|
||||
"To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.\n",
|
||||
"\n",
|
||||
"To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
|
||||
"\n",
|
||||
"For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "Kdxb1IhzTn0v"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
@@ -200,7 +213,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install -r pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt"
|
||||
"!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1734,8 +1747,7 @@
|
||||
"Ji_UrI5l2zzn",
|
||||
"67OdoKL63eDD",
|
||||
"B2_-8b8z5k54"
|
||||
],
|
||||
"include_colab_link": true
|
||||
]
|
||||
},
|
||||
"gpuClass": "standard",
|
||||
"kernelspec": {
|
||||
|
||||
File diff suppressed because one or more lines are too long
6
notebooks/unit4/requirements-unit4.txt
Normal file
6
notebooks/unit4/requirements-unit4.txt
Normal file
@@ -0,0 +1,6 @@
|
||||
gym
|
||||
git+https://github.com/ntasfi/PyGame-Learning-Environment.git
|
||||
git+https://github.com/qlan3/gym-games.git
|
||||
huggingface_hub
|
||||
imageio-ffmpeg
|
||||
pyyaml==6.0
|
||||
1614
notebooks/unit4/unit4.ipynb
Normal file
1614
notebooks/unit4/unit4.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
3194
unit5/unit5.ipynb
3194
unit5/unit5.ipynb
File diff suppressed because one or more lines are too long
@@ -46,6 +46,10 @@
|
||||
title: Play with Huggy
|
||||
- local: unitbonus1/conclusion
|
||||
title: Conclusion
|
||||
- title: Live 1. How the course work, Q&A, and playing with Huggy
|
||||
sections:
|
||||
- local: live1/live1
|
||||
title: Live 1. How the course work, Q&A, and playing with Huggy 🐶
|
||||
- title: Unit 2. Introduction to Q-Learning
|
||||
sections:
|
||||
- local: unit2/introduction
|
||||
@@ -68,6 +72,8 @@
|
||||
title: A Q-Learning example
|
||||
- local: unit2/q-learning-recap
|
||||
title: Q-Learning Recap
|
||||
- local: unit2/glossary
|
||||
title: Glossary
|
||||
- local: unit2/hands-on
|
||||
title: Hands-on
|
||||
- local: unit2/quiz2
|
||||
@@ -76,7 +82,55 @@
|
||||
title: Conclusion
|
||||
- local: unit2/additional-readings
|
||||
title: Additional Readings
|
||||
- title: Unit 5. Introduction to ML-Agents
|
||||
- title: Unit 3. Deep Q-Learning with Atari Games
|
||||
sections:
|
||||
- local: unit3/introduction
|
||||
title: Introduction
|
||||
- local: unit3/from-q-to-dqn
|
||||
title: From Q-Learning to Deep Q-Learning
|
||||
- local: unit3/deep-q-network
|
||||
title: The Deep Q-Network (DQN)
|
||||
- local: unit3/deep-q-algorithm
|
||||
title: The Deep Q Algorithm
|
||||
- local: unit3/glossary
|
||||
title: Glossary
|
||||
- local: unit3/hands-on
|
||||
title: Hands-on
|
||||
- local: unit3/quiz
|
||||
title: Quiz
|
||||
- local: unit3/conclusion
|
||||
title: Conclusion
|
||||
- local: unit3/additional-readings
|
||||
title: Additional Readings
|
||||
- title: Bonus Unit 2. Automatic Hyperparameter Tuning with Optuna
|
||||
sections:
|
||||
- local: unitbonus2/introduction
|
||||
title: Introduction
|
||||
- local: unitbonus2/optuna
|
||||
title: Optuna
|
||||
- local: unitbonus2/hands-on
|
||||
title: Hands-on
|
||||
- title: Unit 4. Policy Gradient with PyTorch
|
||||
sections:
|
||||
- local: unit4/introduction
|
||||
title: Introduction
|
||||
- local: unit4/what-are-policy-based-methods
|
||||
title: What are the policy-based methods?
|
||||
- local: unit4/advantages-disadvantages
|
||||
title: The advantages and disadvantages of policy-gradient methods
|
||||
- local: unit4/policy-gradient
|
||||
title: Diving deeper into policy-gradient
|
||||
- local: unit4/pg-theorem
|
||||
title: (Optional) the Policy Gradient Theorem
|
||||
- local: unit4/hands-on
|
||||
title: Hands-on
|
||||
- local: unit4/quiz
|
||||
title: Quiz
|
||||
- local: unit4/conclusion
|
||||
title: Conclusion
|
||||
- local: unit4/additional-readings
|
||||
title: Additional Readings
|
||||
- title: Unit 5. Introduction to Unity ML-Agents
|
||||
sections:
|
||||
- local: unit5/introduction
|
||||
title: Introduction
|
||||
@@ -94,3 +148,7 @@
|
||||
title: Conclusion
|
||||
- local: unit5/bonus
|
||||
title: Bonus. Learn to create your own environments with Unity and MLAgents
|
||||
- title: What's next? New Units Publishing Schedule
|
||||
sections:
|
||||
- local: communication/publishing-schedule
|
||||
title: Publishing Schedule
|
||||
|
||||
13
units/en/communication/publishing-schedule.mdx
Normal file
13
units/en/communication/publishing-schedule.mdx
Normal file
@@ -0,0 +1,13 @@
|
||||
# Publishing Schedule [[publishing-schedule]]
|
||||
|
||||
We publish a **new unit every Tuesday**.
|
||||
|
||||
If you don't want to miss any of the updates, don't forget to:
|
||||
|
||||
1️⃣ [Sign up to the course](http://eepurl.com/ic5ZUD) to receive **update emails**.
|
||||
|
||||
2️⃣ [Join our discord server](https://hf.co/join/discord) to **get the last updates and exchange with your classmates**.
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule1.png" alt="Schedule 1" width="100%"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule2.png" alt="Schedule 2" width="100%"/>
|
||||
9
units/en/live1/live1.mdx
Normal file
9
units/en/live1/live1.mdx
Normal file
@@ -0,0 +1,9 @@
|
||||
# Live 1: How the course work, Q&A, and playing with Huggy
|
||||
|
||||
In this first live stream, we explained how the course work (scope, units, challenges, and more) and answered your questions.
|
||||
|
||||
And finally, we saw some LunarLander agents you've trained and play with your Huggies 🐶
|
||||
|
||||
<Youtube id="JeJIswxyrsM" />
|
||||
|
||||
To know when the next live is scheduled **check the discord server**. We will also send **you an email**. If you can't participate, don't worry, we record the live sessions.
|
||||
@@ -9,7 +9,13 @@ Discord is a free chat platform. If you've used Slack, **it's quite similar**. T
|
||||
|
||||
Starting in Discord can be a bit intimidating, so let me take you through it.
|
||||
|
||||
When you sign-up to our Discord server, you'll need to specify which topics you're interested in by **clicking #role-assignment at the left**. Here, you can pick different categories. Make sure to **click "Reinforcement Learning"**! :fire:. You'll then get to **introduce yourself in the `#introduction-yourself` channel**.
|
||||
When you sign-up to our Discord server, you'll need to specify which topics you're interested in by **clicking #role-assignment at the left**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/discord1.jpg" alt="Discord"/>
|
||||
|
||||
In #role-assignment, you can pick different categories. Make sure to **click "Reinforcement Learning"**. You'll then get to **introduce yourself in the `#introduction-yourself` channel**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/discord2.jpg" alt="Discord"/>
|
||||
|
||||
## So which channels are interesting to me? [[channels]]
|
||||
|
||||
|
||||
@@ -23,7 +23,7 @@ In this course, you will:
|
||||
|
||||
- 📖 Study Deep Reinforcement Learning in **theory and practice.**
|
||||
- 🧑💻 Learn to **use famous Deep RL libraries** such as [Stable Baselines3](https://stable-baselines3.readthedocs.io/en/master/), [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), [Sample Factory](https://samplefactory.dev/) and [CleanRL](https://github.com/vwxyzjn/cleanrl).
|
||||
- 🤖 **Train agents in unique environments** such as [SnowballFight](https://huggingface.co/spaces/ThomasSimonini/SnowballFight), [Huggy the Doggo 🐶](https://huggingface.co/spaces/ThomasSimonini/Huggy), [MineRL (Minecraft ⛏️)](https://minerl.io/), [VizDoom (Doom)](https://vizdoom.cs.put.edu.pl/) and classical ones such as [Space Invaders](https://www.gymlibrary.dev/environments/atari/) and [PyBullet](https://pybullet.org/wordpress/).
|
||||
- 🤖 **Train agents in unique environments** such as [SnowballFight](https://huggingface.co/spaces/ThomasSimonini/SnowballFight), [Huggy the Doggo 🐶](https://huggingface.co/spaces/ThomasSimonini/Huggy), [VizDoom (Doom)](https://vizdoom.cs.put.edu.pl/) and classical ones such as [Space Invaders](https://www.gymlibrary.dev/environments/atari/), [PyBullet](https://pybullet.org/wordpress/) and more.
|
||||
- 💾 Share your **trained agents with one line of code to the Hub** and also download powerful agents from the community.
|
||||
- 🏆 Participate in challenges where you will **evaluate your agents against other teams. You'll also get to play against the agents you'll train.**
|
||||
|
||||
@@ -53,12 +53,22 @@ The course is composed of:
|
||||
You can choose to follow this course either:
|
||||
|
||||
- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of March 2023.
|
||||
- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of March 2023.
|
||||
- *As a simple audit*: you can participate in all challenges and do assignments if you want, but you have no deadlines.
|
||||
|
||||
Both paths **are completely free**.
|
||||
Whatever path you choose, we advise you **to follow the recommended pace to enjoy the course and challenges with your fellow classmates.**
|
||||
You don't need to tell us which path you choose. At the end of March, when we verify the assignments **if you get more than 80% of the assignments done, you'll get a certificate.**
|
||||
|
||||
You don't need to tell us which path you choose. At the end of March, when we will verify the assignments **if you get more than 80% of the assignments done, you'll get a certificate.**
|
||||
|
||||
## The Certification Process [[certification-process]]
|
||||
|
||||
The certification process is **completely free**:
|
||||
|
||||
- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of March 2023.
|
||||
- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of March 2023.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/certification.jpg" alt="Course certification" width="100%"/>
|
||||
|
||||
## How to get most of the course? [[advice]]
|
||||
|
||||
@@ -80,6 +90,14 @@ You need only 3 things:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/tools.jpg" alt="Course tools needed" width="100%"/>
|
||||
|
||||
## What is the publishing schedule? [[publishing-schedule]]
|
||||
|
||||
|
||||
We publish **a new unit every Tuesday**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule1.png" alt="Schedule 1" width="100%"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule2.png" alt="Schedule 2" width="100%"/>
|
||||
|
||||
|
||||
## What is the recommended pace? [[recommended-pace]]
|
||||
|
||||
@@ -111,7 +129,7 @@ In this new version of the course, you have two types of challenges:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/challenges.jpg" alt="Challenges" width="100%"/>
|
||||
|
||||
These AI vs.AI challenges will be announced **later in December**.
|
||||
These AI vs.AI challenges will be announced **in January**.
|
||||
|
||||
|
||||
## I found a bug, or I want to improve the course [[contribute]]
|
||||
|
||||
@@ -21,6 +21,7 @@ We have multiple RL-related channels:
|
||||
- `rl-announcements`: where we give the last information about the course.
|
||||
- `rl-discussions`: where you can exchange about RL and share information.
|
||||
- `rl-study-group`: where you can create and join study groups.
|
||||
- `rl-i-made-this`: where you can share your projects and models.
|
||||
|
||||
If this is your first time using Discord, we wrote a Discord 101 to get the best practices. Check the next section.
|
||||
|
||||
|
||||
@@ -11,3 +11,4 @@ These are **optional readings** if you want to go deeper.
|
||||
## Gym [[gym]]
|
||||
|
||||
- [Getting Started With OpenAI Gym: The Basic Building Blocks](https://blog.paperspace.com/getting-started-with-openai-gym/)
|
||||
- [Make your own Gym custom environment](https://www.gymlibrary.dev/content/environment_creation/)
|
||||
|
||||
@@ -12,5 +12,10 @@ In the next (bonus) unit, we’re going to reinforce what we just learned by **t
|
||||
|
||||
You will be able then to play with him 🤗.
|
||||
|
||||
<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.mp4" alt="Huggy" type="video/mp4">
|
||||
</video>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.jpg" alt="Huggy"/>
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
|
||||
|
||||
@@ -26,7 +26,7 @@ If it’s still confusing, **think of a real problem: the choice of picking a re
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/exp_2.jpg" alt="Exploration">
|
||||
<figcaption>Source: <a href="[http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_exploration.pdf](http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_exploration.pdf)"> Berkley AI Course</a>
|
||||
<figcaption>Source: <a href="https://inst.eecs.berkeley.edu/~cs188/sp20/assets/lecture/lec15_6up.pdf"> Berkley AI Course</a>
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
|
||||
@@ -18,6 +18,12 @@ And finally, you'll **upload this trained agent to the Hugging Face Hub 🤗, a
|
||||
|
||||
Thanks to our <a href="https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard">leaderboard</a>, you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 1 🏆?
|
||||
|
||||
To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained model to the Hub and **get a result of >= 200**.
|
||||
|
||||
To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
So let's get started! 🚀
|
||||
|
||||
**To start the hands-on click on Open In Colab button** 👇 :
|
||||
@@ -28,6 +34,7 @@ You can either do this hands-on by reading the notebook or following it with the
|
||||
|
||||
<Youtube id="CsuIANBnSq8" />
|
||||
|
||||
|
||||
# Unit 1: Train your first Deep Reinforcement Learning Agent 🤖
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/thumbnail.jpg" alt="Unit 1 thumbnail" width="100%">
|
||||
@@ -36,9 +43,6 @@ In this notebook, you'll train your **first Deep Reinforcement Learning agent**
|
||||
|
||||
⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
|
||||
|
||||
|
||||
|
||||
|
||||
```python
|
||||
%%html
|
||||
<video controls autoplay><source src="https://huggingface.co/ThomasSimonini/ppo-LunarLander-v2/resolve/main/replay.mp4" type="video/mp4"></video>
|
||||
@@ -65,7 +69,7 @@ At the end of the notebook, you will:
|
||||
|
||||
|
||||
|
||||
## This notebook is from Deep Reinforcement Learning Course
|
||||
## This hands-on is from Deep Reinforcement Learning Course
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>
|
||||
|
||||
In this free course, you will:
|
||||
@@ -84,7 +88,7 @@ The best way to keep in touch and ask questions is to join our discord server to
|
||||
## Prerequisites 🏗️
|
||||
Before diving into the notebook, you need to:
|
||||
|
||||
🔲 📝 **Done Unit 0** that gives you all the **information about the course and help you to onboard** 🤗
|
||||
🔲 📝 **Read Unit 0** that gives you all the **information about the course and help you to onboard** 🤗
|
||||
|
||||
🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by doing Unit 1
|
||||
|
||||
@@ -135,7 +139,7 @@ To make things easier, we created a script to install all these dependencies.
|
||||
```
|
||||
|
||||
```python
|
||||
!pip install -r https://huggingface.co/spaces/ThomasSimonini/temp-space-requirements/raw/main/requirements/requirements-unit1.txt
|
||||
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt
|
||||
```
|
||||
|
||||
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
|
||||
|
||||
@@ -22,7 +22,6 @@ It's essential **to master these elements** before diving into implementing Dee
|
||||
|
||||
After this unit, in a bonus unit, you'll be **able to train Huggy the Dog 🐶 to fetch the stick and play with him 🤗**.
|
||||
|
||||
<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.mp4" alt="Huggy" type="video/mp4">
|
||||
</video>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.jpg" alt="Huggy"/>
|
||||
|
||||
So let's get started! 🚀
|
||||
|
||||
@@ -18,7 +18,7 @@ Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return startin
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman3.jpg" alt="Bellman equation"/>
|
||||
<figcaption>To calculate the value of State 2: the sum of rewards **if the agent started in that state, and then followed the **policy for all the time steps.</figcaption>
|
||||
<figcaption>To calculate the value of State 2: the sum of rewards <b>if the agent started in that state</b>, and then followed the <b>policy for all the time steps.</b></figcaption>
|
||||
</figure>
|
||||
|
||||
So you may have noticed, we're repeating the computation of the value of different states, which can be tedious if you need to do it for each state value or state-action value.
|
||||
@@ -58,6 +58,6 @@ But you'll study an example with gamma = 0.99 in the Q-Learning section of this
|
||||
|
||||
|
||||
|
||||
To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process.** This is equivalent **to the sum of immediate reward + the discounted value of the state that follows.**
|
||||
To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process**, we calculate the value as **the sum of immediate reward + the discounted value of the state that follows.**
|
||||
|
||||
Before going to the next section, think about the role of gamma in the Bellman equation. What happens if the value of gamma is very low (e.g. 0.1 or even 0)? What happens if the value is 1? What happens if the value is very high, such as a million?
|
||||
|
||||
@@ -15,5 +15,7 @@ In the next chapter, we’re going to dive deeper by studying our first Deep Rei
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Atari environments"/>
|
||||
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
|
||||
34
units/en/unit2/glossary.mdx
Normal file
34
units/en/unit2/glossary.mdx
Normal file
@@ -0,0 +1,34 @@
|
||||
# Glossary [[glossary]]
|
||||
|
||||
This is a community-created glossary. Contributions are welcomed!
|
||||
|
||||
|
||||
### Strategies to find the optimal policy
|
||||
|
||||
- **Policy-based methods.** The policy is usually trained with a neural network to select what action to take given a state. In this case is the neural network which outputs the action that the agent should take instead of using a value function. Depending on the experience received by the environment, the neural network will be re-adjusted and will provide better actions.
|
||||
- **Value-based methods.** In this case, a value function is trained to output the value of a state or a state-action pair that will represent our policy. However, this value doesn't define what action the agent should take. In contrast, we need to specify the behavior of the agent given the output of the value function. For example, we could decide to adopt a policy to take the action that always leads to the biggest reward (Greedy Policy). In summary, the policy is a Greedy Policy (or whatever decision the user takes) that uses the values of the value-function to decide the actions to take.
|
||||
|
||||
### Among the value-based methods, we can find two main strategies
|
||||
|
||||
- **The state-value function.** For each state, the state-value function is the expected return if the agent starts in that state and follows the policy until the end.
|
||||
- **The action-value function.** In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state and takes an action. Then it follows the policy forever after.
|
||||
|
||||
### Epsilon-greedy strategy:
|
||||
- Common exploration strategy used in reinforcement learning that involves balancing exploration and exploitation.
|
||||
- Chooses the action with the highest expected reward with a probability of 1-epsilon.
|
||||
- Chooses a random action with a probability of epsilon.
|
||||
- Epsilon is typically decreased over time to shift focus towards exploitation.
|
||||
|
||||
### Greedy strategy:
|
||||
- Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (only exploitation)
|
||||
- Always chooses the action with the highest expected reward.
|
||||
- Does not include any exploration.
|
||||
- Can be disadvantageous in environments with uncertainty or unknown optimal actions.
|
||||
|
||||
|
||||
If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
|
||||
|
||||
This glossary was made possible thanks to:
|
||||
|
||||
- [Ramón Rueda](https://github.com/ramon-rd)
|
||||
- [Hasarindu Perera](https://github.com/hasarinduperera/)
|
||||
@@ -16,6 +16,12 @@ Now that we studied the Q-Learning algorithm, let's implement it from scratch an
|
||||
|
||||
Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 2?
|
||||
|
||||
To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.
|
||||
|
||||
To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
|
||||
**To start the hands-on click on Open In Colab button** 👇 :
|
||||
|
||||
|
||||
@@ -76,8 +76,7 @@ For instance, if we train a state-value function using Monte Carlo:
|
||||
|
||||
## Temporal Difference Learning: learning at each step [[td-learning]]
|
||||
|
||||
- **Temporal Difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)**
|
||||
- to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\( \gamma * V(S_{t+1})\\).
|
||||
**Temporal Difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)** to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\( \gamma * V(S_{t+1})\\).
|
||||
|
||||
The idea with **TD is to update the \\(V(S_t)\\) at each step.**
|
||||
|
||||
|
||||
@@ -25,11 +25,11 @@ The reward function goes like this:
|
||||
|
||||
To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**.
|
||||
|
||||
## Step 1: We initialize the Q-Table [[step1]]
|
||||
## Step 1: We initialize the Q-table [[step1]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-1.jpg" alt="Maze-Example"/>
|
||||
|
||||
So, for now, **our Q-Table is useless**; we need **to train our Q-function using the Q-Learning algorithm.**
|
||||
So, for now, **our Q-table is useless**; we need **to train our Q-function using the Q-Learning algorithm.**
|
||||
|
||||
Let's do it for 2 training timesteps:
|
||||
|
||||
@@ -80,4 +80,4 @@ Because I go to the poison state, **I get \\(R_{t+1} = -10\\), and I die.**
|
||||
|
||||
Because we're dead, we start a new episode. But what we see here is that **with two explorations steps, my agent became smarter.**
|
||||
|
||||
As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-Table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-Function.**
|
||||
As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-function.**
|
||||
|
||||
@@ -3,20 +3,20 @@
|
||||
|
||||
The *Q-Learning* **is the RL algorithm that** :
|
||||
|
||||
- Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
|
||||
- Trains *Q-function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
|
||||
|
||||
- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
|
||||
- Given a state and action, our Q-function **will search into its Q-table the corresponding value.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function" width="100%"/>
|
||||
|
||||
- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
|
||||
- When the training is done,**we have an optimal Q-function, so an optimal Q-table.**
|
||||
|
||||
- And if we **have an optimal Q-function**, we
|
||||
have an optimal policy,since we **know for each state, what is the best action to take.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" width="100%"/>
|
||||
|
||||
But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
|
||||
But, in the beginning, our **Q-table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-table to 0 values)**. But, as we’ll explore the environment and update our Q-table it will give us better and better approximations
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
|
||||
|
||||
|
||||
@@ -7,7 +7,7 @@ Q-Learning is an **off-policy value-based method that uses a TD approach to tra
|
||||
- *Value-based method*: finds the optimal policy indirectly by training a value or action-value function that will tell us **the value of each state or each state-action pair.**
|
||||
- *Uses a TD approach:* **updates its action-value function at each step instead of at the end of the episode.**
|
||||
|
||||
**Q-Learning is the algorithm we use to train our Q-Function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
|
||||
**Q-Learning is the algorithm we use to train our Q-function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" alt="Q-function"/>
|
||||
@@ -18,16 +18,16 @@ The **Q comes from "the Quality" (the value) of that action at that state.**
|
||||
|
||||
Let's recap the difference between value and reward:
|
||||
|
||||
- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state action pair) and then acts accordingly to its policy.
|
||||
- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy.
|
||||
- The *reward* is the **feedback I get from the environment** after performing an action at a state.
|
||||
|
||||
Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action value pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
|
||||
Let's go through an example of a maze.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-1.jpg" alt="Maze example"/>
|
||||
|
||||
The Q-Table is initialized. That's why all values are = 0. This table **contains, for each state, the four state-action values.**
|
||||
The Q-table is initialized. That's why all values are = 0. This table **contains, for each state, the four state-action values.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-2.jpg" alt="Maze example"/>
|
||||
|
||||
@@ -35,7 +35,7 @@ Here we see that the **state-action value of the initial state and going up is
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-3.jpg" alt="Maze example"/>
|
||||
|
||||
Therefore, Q-function contains a Q-table **that has the value of each-state action pair.** And given a state and action, **our Q-Function will search inside its Q-table to output the value.**
|
||||
Therefore, Q-function contains a Q-table **that has the value of each-state action pair.** And given a state and action, **our Q-function will search inside its Q-table to output the value.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q-function"/>
|
||||
@@ -43,22 +43,22 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act
|
||||
|
||||
If we recap, *Q-Learning* **is the RL algorithm that:**
|
||||
|
||||
- Trains a *Q-Function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.**
|
||||
- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
|
||||
- When the training is done, **we have an optimal Q-function, which means we have optimal Q-Table.**
|
||||
- Trains a *Q-function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.**
|
||||
- Given a state and action, our Q-function **will search into its Q-table the corresponding value.**
|
||||
- When the training is done, **we have an optimal Q-function, which means we have optimal Q-table.**
|
||||
- And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"/>
|
||||
|
||||
|
||||
But, in the beginning, **our Q-Table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-Table to 0). As the agent **explores the environment and we update the Q-Table, it will give us better and better approximations** to the optimal policy.
|
||||
But, in the beginning, **our Q-table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-table to 0). As the agent **explores the environment and we update the Q-table, it will give us better and better approximations** to the optimal policy.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-1.jpg" alt="Q-learning"/>
|
||||
<figcaption>We see here that with the training, our Q-Table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
|
||||
<figcaption>We see here that with the training, our Q-table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
|
||||
</figure>
|
||||
|
||||
Now that we understand what Q-Learning, Q-Function, and Q-Table are, **let's dive deeper into the Q-Learning algorithm**.
|
||||
Now that we understand what Q-Learning, Q-function, and Q-table are, **let's dive deeper into the Q-Learning algorithm**.
|
||||
|
||||
## The Q-Learning algorithm [[q-learning-algo]]
|
||||
|
||||
@@ -66,26 +66,26 @@ This is the Q-Learning pseudocode; let's study each part and **see how it works
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-learning"/>
|
||||
|
||||
### Step 1: We initialize the Q-Table [[step1]]
|
||||
### Step 1: We initialize the Q-table [[step1]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-3.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
We need to initialize the Q-Table for each state-action pair. **Most of the time, we initialize with values of 0.**
|
||||
We need to initialize the Q-table for each state-action pair. **Most of the time, we initialize with values of 0.**
|
||||
|
||||
### Step 2: Choose action using epsilon greedy strategy [[step2]]
|
||||
### Step 2: Choose action using epsilon-greedy strategy [[step2]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
Epsilon greedy strategy is a policy that handles the exploration/exploitation trade-off.
|
||||
|
||||
The idea is that we define epsilon ɛ = 1.0:
|
||||
The idea is that we define the initial epsilon ɛ = 1.0:
|
||||
|
||||
- *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value).
|
||||
- With probability ɛ: **we do exploration** (trying random action).
|
||||
|
||||
At the beginning of the training, **the probability of doing exploration will be huge since ɛ is very high, so most of the time, we'll explore.** But as the training goes on, and consequently our **Q-Table gets better and better in its estimations, we progressively reduce the epsilon value** since we will need less and less exploration and more exploitation.
|
||||
At the beginning of the training, **the probability of doing exploration will be huge since ɛ is very high, so most of the time, we'll explore.** But as the training goes on, and consequently our **Q-table gets better and better in its estimations, we progressively reduce the epsilon value** since we will need less and less exploration and more exploitation.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-5.jpg" alt="Q-learning"/>
|
||||
|
||||
@@ -114,7 +114,7 @@ It means that to update our \\(Q(S_t, A_t)\\):
|
||||
|
||||
How do we form the TD target?
|
||||
1. We obtain the reward after taking the action \\(R_{t+1}\\).
|
||||
2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon greedy policy, this will always take the action with the highest state-action value.
|
||||
2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.
|
||||
|
||||
Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.**
|
||||
|
||||
@@ -126,7 +126,7 @@ The difference is subtle:
|
||||
|
||||
- *Off-policy*: using **a different policy for acting (inference) and updating (training).**
|
||||
|
||||
For instance, with Q-Learning, the epsilon greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
|
||||
For instance, with Q-Learning, the epsilon-greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
|
||||
|
||||
|
||||
<figure>
|
||||
@@ -144,7 +144,7 @@ Is different from the policy we use during the training part:
|
||||
|
||||
- *On-policy:* using the **same policy for acting and updating.**
|
||||
|
||||
For instance, with Sarsa, another value-based algorithm, **the epsilon greedy Policy selects the next state-action pair, not a greedy policy.**
|
||||
For instance, with Sarsa, another value-based algorithm, **the epsilon-greedy policy selects the next state-action pair, not a greedy policy.**
|
||||
|
||||
|
||||
<figure>
|
||||
|
||||
@@ -9,7 +9,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "The algorithm we use to train our Q-Function",
|
||||
text: "The algorithm we use to train our Q-function",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
@@ -24,12 +24,12 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
|
||||
},
|
||||
{
|
||||
text: "A table",
|
||||
explain: "Q-Function is not a Q-Table. The Q-Function is the algorithm that will feed the Q-Table."
|
||||
explain: "Q-function is not a Q-table. The Q-function is the algorithm that will feed the Q-table."
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q2: What is a Q-Table?
|
||||
### Q2: What is a Q-table?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
@@ -43,7 +43,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "In Q-Table each cell corresponds a state value",
|
||||
text: "In Q-table each cell corresponds a state value",
|
||||
explain: "Each cell corresponds to a state-action value pair value. Not a state value.",
|
||||
}
|
||||
]}
|
||||
|
||||
@@ -10,7 +10,7 @@ The value of a state is the **expected discounted return** the agent can get i
|
||||
But what does it mean to act according to our policy? After all, we don't have a policy in value-based methods since we train a value function and not a policy.
|
||||
</Tip>
|
||||
|
||||
Remember that the goal of an **RL agent is to have an optimal policy π.**
|
||||
Remember that the goal of an **RL agent is to have an optimal policy π\*.**
|
||||
|
||||
To find the optimal policy, we learned about two different methods:
|
||||
|
||||
@@ -35,8 +35,8 @@ Consequently, whatever method you use to solve your problem, **you will have a
|
||||
|
||||
So the difference is:
|
||||
|
||||
- In policy-based, **the optimal policy (denoted π*) is found by training the policy directly.**
|
||||
- In value-based, **finding an optimal value function (denoted Q* or V*, we'll study the difference after) in our leads to having an optimal policy.**
|
||||
- In policy-based, **the optimal policy (denoted π\*) is found by training the policy directly.**
|
||||
- In value-based, **finding an optimal value function (denoted Q\* or V\*, we'll study the difference after) leads to having an optimal policy.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link between value and policy"/>
|
||||
|
||||
@@ -62,7 +62,7 @@ For each state, the state-value function outputs the expected return if the agen
|
||||
|
||||
In the action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
|
||||
|
||||
The value of taking action an in state \\(s\\) under a policy \\(π\\) is:
|
||||
The value of taking action \\(a\\) in state \\(s\\) under a policy \\(π\\) is:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-1.jpg" alt="Action State value function"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-2.jpg" alt="Action State value function"/>
|
||||
|
||||
8
units/en/unit3/additional-readings.mdx
Normal file
8
units/en/unit3/additional-readings.mdx
Normal file
@@ -0,0 +1,8 @@
|
||||
# Additional Readings [[additional-readings]]
|
||||
|
||||
These are **optional readings** if you want to go deeper.
|
||||
|
||||
- [Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel](https://youtu.be/Psrhxy88zww)
|
||||
- [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602)
|
||||
- [Double Deep Q-Learning](https://papers.nips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html)
|
||||
- [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952)
|
||||
17
units/en/unit3/conclusion.mdx
Normal file
17
units/en/unit3/conclusion.mdx
Normal file
@@ -0,0 +1,17 @@
|
||||
# Conclusion [[conclusion]]
|
||||
|
||||
Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorial. You’ve just trained your first Deep Q-Learning agent and shared it on the Hub 🥳.
|
||||
|
||||
Take time to really grasp the material before continuing.
|
||||
|
||||
Don't hesitate to train your agent in other environments (Pong, Seaquest, QBert, Ms Pac Man). The **best way to learn is to try things on your own!**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
|
||||
In the next unit, **we're going to learn about Optuna**. One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
105
units/en/unit3/deep-q-algorithm.mdx
Normal file
105
units/en/unit3/deep-q-algorithm.mdx
Normal file
@@ -0,0 +1,105 @@
|
||||
# The Deep Q-Learning Algorithm [[deep-q-algorithm]]
|
||||
|
||||
We learned that Deep Q-Learning **uses a deep neural network to approximate the different Q-values for each possible action at a state** (value-function estimation).
|
||||
|
||||
The difference is that, during the training phase, instead of updating the Q-value of a state-action pair directly as we have done with Q-Learning:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Q Loss"/>
|
||||
|
||||
in Deep Q-Learning, we create a **loss function that compares our Q-value prediction and the Q-target and uses gradient descent to update the weights of our Deep Q-Network to approximate our Q-values better**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" alt="Q-target"/>
|
||||
|
||||
The Deep Q-Learning training algorithm has *two phases*:
|
||||
|
||||
- **Sampling**: we perform actions and **store the observed experience tuples in a replay memory**.
|
||||
- **Training**: Select a **small batch of tuples randomly and learn from this batch using a gradient descent update step**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/sampling-training.jpg" alt="Sampling Training"/>
|
||||
|
||||
This is not the only difference compared with Q-Learning. Deep Q-Learning training **might suffer from instability**, mainly because of combining a non-linear Q-value function (Neural Network) and bootstrapping (when we update targets with existing estimates and not an actual complete return).
|
||||
|
||||
To help us stabilize the training, we implement three different solutions:
|
||||
1. *Experience Replay* to make more **efficient use of experiences**.
|
||||
2. *Fixed Q-Target* **to stabilize the training**.
|
||||
3. *Double Deep Q-Learning*, to **handle the problem of the overestimation of Q-values**.
|
||||
|
||||
Let's go through them!
|
||||
|
||||
## Experience Replay to make more efficient use of experiences [[exp-replay]]
|
||||
|
||||
Why do we create a replay memory?
|
||||
|
||||
Experience Replay in Deep Q-Learning has two functions:
|
||||
|
||||
1. **Make more efficient use of the experiences during the training**.
|
||||
Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient
|
||||
|
||||
Experience replay helps **using the experiences of the training more efficiently**. We use a replay buffer that saves experience samples **that we can reuse during the training.**
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay.jpg" alt="Experience Replay"/>
|
||||
|
||||
⇒ This allows the agent to **learn from the same experiences multiple times**.
|
||||
|
||||
2. **Avoid forgetting previous experiences and reduce the correlation between experiences**.
|
||||
- The problem we get if we give sequential samples of experiences to our neural network is that it tends to forget **the previous experiences as it gets new experiences.** For instance, if the agent is in the first level and then in the second, which is different, it can forget how to behave and play in the first level.
|
||||
|
||||
The solution is to create a Replay Buffer that stores experience tuples while interacting with the environment and then sample a small batch of tuples. This prevents **the network from only learning about what it has done immediately before.**
|
||||
|
||||
Experience replay also has other benefits. By randomly sampling the experiences, we remove correlation in the observation sequences and avoid **action values from oscillating or diverging catastrophically.**
|
||||
|
||||
In the Deep Q-Learning pseudocode, we **initialize a replay memory buffer D from capacity N** (N is a hyperparameter that you can define). We then store experiences in the memory and sample a batch of experiences to feed the Deep Q-Network during the training phase.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay-pseudocode.jpg" alt="Experience Replay Pseudocode"/>
|
||||
|
||||
## Fixed Q-Target to stabilize the training [[fixed-q]]
|
||||
|
||||
When we want to calculate the TD error (aka the loss), we calculate the **difference between the TD target (Q-Target) and the current Q-value (estimation of Q)**.
|
||||
|
||||
But we **don’t have any idea of the real TD target**. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" alt="Q-target"/>
|
||||
|
||||
However, the problem is that we are using the same parameters (weights) for estimating the TD target **and** the Q-value. Consequently, there is a significant correlation between the TD target and the parameters we are changing.
|
||||
|
||||
Therefore, it means that at every step of training, **our Q-values shift but also the target value shifts.** We’re getting closer to our target, but the target is also moving. It’s like chasing a moving target! This can lead to a significant oscillation in training.
|
||||
|
||||
It’s like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target). Your goal is to get closer (reduce the error).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-1.jpg" alt="Q-target"/>
|
||||
|
||||
At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-2.jpg" alt="Q-target"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-3.jpg" alt="Q-target"/>
|
||||
This leads to a bizarre path of chasing (a significant oscillation in training).
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-4.jpg" alt="Q-target"/>
|
||||
|
||||
Instead, what we see in the pseudo-code is that we:
|
||||
- Use a **separate network with fixed parameters** for estimating the TD Target
|
||||
- **Copy the parameters from our Deep Q-Network at every C step** to update the target network.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/fixed-q-target-pseudocode.jpg" alt="Fixed Q-target Pseudocode"/>
|
||||
|
||||
|
||||
|
||||
## Double DQN [[double-dqn]]
|
||||
|
||||
Double DQNs, or Double Learning, were introduced [by Hado van Hasselt](https://papers.nips.cc/paper/3964-double-q-learning). This method **handles the problem of the overestimation of Q-values.**
|
||||
|
||||
To understand this problem, remember how we calculate the TD Target:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="TD target"/>
|
||||
|
||||
We face a simple problem by calculating the TD target: how are we sure that **the best action for the next state is the action with the highest Q-value?**
|
||||
|
||||
We know that the accuracy of Q-values depends on what action we tried **and** what neighboring states we explored.
|
||||
|
||||
Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the maximum Q-value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.**
|
||||
|
||||
The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q-value generation. We:
|
||||
- Use our **DQN network** to select the best action to take for the next state (the action with the highest Q-value).
|
||||
- Use our **Target network** to calculate the target Q-value of taking that action at the next state.
|
||||
|
||||
Therefore, Double DQN helps us reduce the overestimation of Q-values and, as a consequence, helps us train faster and have more stable learning.
|
||||
|
||||
Since these three improvements in Deep Q-Learning, many have been added such as Prioritized Experience Replay, Dueling Deep Q-Learning. They’re out of the scope of this course but if you’re interested, check the links we put in the reading list.
|
||||
41
units/en/unit3/deep-q-network.mdx
Normal file
41
units/en/unit3/deep-q-network.mdx
Normal file
@@ -0,0 +1,41 @@
|
||||
# The Deep Q-Network (DQN) [[deep-q-network]]
|
||||
This is the architecture of our Deep Q-Learning network:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" alt="Deep Q Network"/>
|
||||
|
||||
As input, we take a **stack of 4 frames** passed through the network as a state and output a **vector of Q-values for each possible action at that state**. Then, like with Q-Learning, we just need to use our epsilon-greedy policy to select which action to take.
|
||||
|
||||
When the Neural Network is initialized, **the Q-value estimation is terrible**. But during training, our Deep Q-Network agent will associate a situation with appropriate action and **learn to play the game well**.
|
||||
|
||||
## Preprocessing the input and temporal limitation [[preprocessing]]
|
||||
|
||||
We need to **preprocess the input**. It’s an essential step since we want to **reduce the complexity of our state to reduce the computation time needed for training**.
|
||||
|
||||
To achieve this, we **reduce the state space to 84x84 and grayscale it**. We can do this since the colors in Atari environments don't add important information.
|
||||
This is an essential saving since we **reduce our three color channels (RGB) to 1**.
|
||||
|
||||
We can also **crop a part of the screen in some games** if it does not contain important information.
|
||||
Then we stack four frames together.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/preprocessing.jpg" alt="Preprocessing"/>
|
||||
|
||||
**Why do we stack four frames together?**
|
||||
We stack frames together because it helps us **handle the problem of temporal limitation**. Let’s take an example with the game of Pong. When you see this frame:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation.jpg" alt="Temporal Limitation"/>
|
||||
|
||||
Can you tell me where the ball is going?
|
||||
No, because one frame is not enough to have a sense of motion! But what if I add three more frames? **Here you can see that the ball is going to the right**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal Limitation"/>
|
||||
That’s why, to capture temporal information, we stack four frames together.
|
||||
|
||||
Then, the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because frames are stacked together, **you can exploit some temporal properties across those frames**.
|
||||
|
||||
If you don't know what are convolutional layers, don't worry. You can check the [Lesson 4 of this free Deep Reinforcement Learning Course by Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188)
|
||||
|
||||
Finally, we have a couple of fully connected layers that output a Q-value for each possible action at that state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" alt="Deep Q Network"/>
|
||||
|
||||
So, we see that Deep Q-Learning is using a neural network to approximate, given a state, the different Q-values for each possible action at that state. Let’s now study the Deep Q-Learning algorithm.
|
||||
34
units/en/unit3/from-q-to-dqn.mdx
Normal file
34
units/en/unit3/from-q-to-dqn.mdx
Normal file
@@ -0,0 +1,34 @@
|
||||
# From Q-Learning to Deep Q-Learning [[from-q-to-dqn]]
|
||||
|
||||
We learned that **Q-Learning is an algorithm we use to train our Q-Function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" alt="Q-function"/>
|
||||
</figure>
|
||||
|
||||
The **Q comes from "the Quality" of that action at that state.**
|
||||
|
||||
Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
|
||||
The problem is that Q-Learning is a *tabular method*. This raises a problem in which the states and actions spaces **are small enough to approximate value functions to be represented as arrays and tables**. Also, this is **not scalable**.
|
||||
Q-Learning worked well with small state space environments like:
|
||||
|
||||
- FrozenLake, we had 16 states.
|
||||
- Taxi-v3, we had 500 states.
|
||||
|
||||
But think of what we're going to do today: we will train an agent to learn to play Space Invaders a more complex game, using the frames as input.
|
||||
|
||||
As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us \\(256^{210x160x3} = 256^{100800}\\) (for comparison, we have approximately \\(10^{80}\\) atoms in the observable universe).
|
||||
|
||||
* A single frame in Atari is composed of an image of 210x160 pixels. Given the images are in color (RGB), there are 3 channels. This is why the shape is (210, 160, 3). For each pixel, the value can go from 0 to 255.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari.jpg" alt="Atari State Space"/>
|
||||
|
||||
Therefore, the state space is gigantic; due to this, creating and updating a Q-table for that environment would not be efficient. In this case, the best idea is to approximate the Q-values instead of a Q-table using a parametrized Q-function \\(Q_{\theta}(s,a)\\) .
|
||||
|
||||
This neural network will approximate, given a state, the different Q-values for each possible action at that state. And that's exactly what Deep Q-Learning does.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" alt="Deep Q Learning"/>
|
||||
|
||||
|
||||
Now that we understand Deep Q-Learning, let's dive deeper into the Deep Q-Network.
|
||||
39
units/en/unit3/glossary.mdx
Normal file
39
units/en/unit3/glossary.mdx
Normal file
@@ -0,0 +1,39 @@
|
||||
# Glossary
|
||||
|
||||
This is a community-created glossary. Contributions are welcomed!
|
||||
|
||||
- **Tabular Method:** type of problem in which the state and action spaces are small enough to approximate value functions to be represented as arrays and tables.
|
||||
**Q-learning** is an example of tabular method since a table is used to represent the value for different state-action pairs.
|
||||
|
||||
- **Deep Q-Learning:** method that trains a neural network to approximate, given a state, the different **Q-values** for each possible action at that state.
|
||||
Is used to solve problems when observational space is too big to apply a tabular Q-Learning approach.
|
||||
|
||||
- **Temporal Limitation:** is a difficulty presented when the environment state is represented by frames. A frame by itself does not provide temporal information.
|
||||
In order to obtain temporal information, we need to **stack** a number of frames together.
|
||||
|
||||
- **Phases of Deep Q-Learning:**
|
||||
- **Sampling:** actions are performed, and observed experience tuples are stored in a **replay memory**.
|
||||
- **Training:** batches of tuples are selected randomly and the neural network updates its weights using gradient descent.
|
||||
|
||||
- **Solutions to stabilize Deep Q-Learning:**
|
||||
- **Experience Replay:** a replay memory is created to save experiences samples that can be reused during training.
|
||||
This allows the agent to learn from the same experiences multiple times. Also, it makes the agent avoid to forget previous experiences as it get new ones.
|
||||
**Random sampling** from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging
|
||||
catastrophically.
|
||||
|
||||
- **Fixed Q-Target:** In order to calculate the **Q-Target** we need to estimate the discounted optimal **Q-value** of the next state by using Bellman equation. The problem
|
||||
is that the same network weigths are used to calculate the **Q-Target** and the **Q-value**. This means that everytime we are modifying the **Q-value**, the **Q-Target** also moves with it.
|
||||
To avoid this issue, a separate network with fixed parameters is used for estimating the Temporal Difference Target. The target network is updated by copying parameters from
|
||||
our Deep Q-Network after certain **C steps**.
|
||||
|
||||
- **Double DQN:** method to handle **overstimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **-Value generation**:
|
||||
-**DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**)
|
||||
-**Target Network** to calculate the target **Q-Value** of taking that action at the next state.
|
||||
This approach reduce the **Q-Values** overstimation, it helps to train faster and have more stable learning.
|
||||
|
||||
|
||||
If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
|
||||
|
||||
This glossary was made possible thanks to:
|
||||
|
||||
- [Dario Paez](https://github.com/dario248)
|
||||
314
units/en/unit3/hands-on.mdx
Normal file
314
units/en/unit3/hands-on.mdx
Normal file
@@ -0,0 +1,314 @@
|
||||
# Hands-on [[hands-on]]
|
||||
|
||||
|
||||
|
||||
<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
|
||||
notebooks={[
|
||||
{label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit3/unit3.ipynb"}
|
||||
]}
|
||||
askForHelpUrl="http://hf.co/join/discord" />
|
||||
|
||||
|
||||
Now that you've studied the theory behind Deep Q-Learning, **you’re ready to train your Deep Q-Learning agent to play Atari Games**. We'll start with Space Invaders, but you'll be able to use any Atari game you want 🔥
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
|
||||
We're using the [RL-Baselines-3 Zoo integration](https://github.com/DLR-RM/rl-baselines3-zoo), a vanilla version of Deep Q-Learning with no extensions such as Double-DQN, Dueling-DQN, or Prioritized Experience Replay.
|
||||
|
||||
Also, **if you want to learn to implement Deep Q-Learning by yourself after this hands-on**, you definitely should look at CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
|
||||
|
||||
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.
|
||||
|
||||
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
**To start the hands-on click on Open In Colab button** 👇 :
|
||||
|
||||
[](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit3/unit3.ipynb)
|
||||
|
||||
|
||||
# Unit 3: Deep Q-Learning with Atari Games 👾 using RL Baselines3 Zoo
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/thumbnail.jpg" alt="Unit 3 Thumbnail">
|
||||
|
||||
In this notebook, **you'll train a Deep Q-Learning agent** playing Space Invaders using [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), a training framework based on [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
|
||||
|
||||
We're using the [RL-Baselines-3 Zoo integration, a vanilla version of Deep Q-Learning](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) with no extensions such as Double-DQN, Dueling-DQN, and Prioritized Experience Replay.
|
||||
|
||||
⬇️ Here is an example of what **you will achieve** ⬇️
|
||||
|
||||
```python
|
||||
%%html
|
||||
<video controls autoplay><source src="https://huggingface.co/ThomasSimonini/ppo-SpaceInvadersNoFrameskip-v4/resolve/main/replay.mp4" type="video/mp4"></video>
|
||||
```
|
||||
|
||||
### 🎮 Environments:
|
||||
|
||||
- SpacesInvadersNoFrameskip-v4
|
||||
|
||||
### 📚 RL-Library:
|
||||
|
||||
- [RL-Baselines3-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)
|
||||
|
||||
## Objectives 🏆
|
||||
|
||||
At the end of the notebook, you will:
|
||||
|
||||
- Be able to understand deeper **how RL Baselines3 Zoo works**.
|
||||
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
|
||||
|
||||
|
||||
## Prerequisites 🏗️
|
||||
Before diving into the notebook, you need to:
|
||||
|
||||
🔲 📚 **[Study Deep Q-Learning by reading Unit 3](https://huggingface.co/deep-rl-course/unit3/introduction)** 🤗
|
||||
|
||||
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
|
||||
|
||||
# Let's train a Deep Q-Learning agent playing Atari' Space Invaders 👾 and upload it to the Hub.
|
||||
|
||||
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.
|
||||
|
||||
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
## Set the GPU 💪
|
||||
|
||||
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg" alt="GPU Step 1">
|
||||
|
||||
- `Hardware Accelerator > GPU`
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
|
||||
|
||||
## Create a virtual display 🔽
|
||||
|
||||
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
|
||||
|
||||
Hence the following cell will install the librairies and create and run a virtual screen 🖥
|
||||
|
||||
```bash
|
||||
apt install python-opengl
|
||||
apt install ffmpeg
|
||||
apt install xvfb
|
||||
pip3 install pyvirtualdisplay
|
||||
```
|
||||
|
||||
```bash
|
||||
apt-get install swig cmake freeglut3-dev
|
||||
```
|
||||
|
||||
```bash
|
||||
pip install pyglet==1.5.1
|
||||
```
|
||||
|
||||
```python
|
||||
# Virtual display
|
||||
from pyvirtualdisplay import Display
|
||||
|
||||
virtual_display = Display(visible=0, size=(1400, 900))
|
||||
virtual_display.start()
|
||||
```
|
||||
|
||||
## Clone RL-Baselines3 Zoo Repo 📚
|
||||
You could directly install from the Python package (`pip install rl_zoo3`), but since we want **the full installation with extra environments and dependencies**, we're going to clone the `RL-Baselines3-Zoo` repository and install from source.
|
||||
|
||||
```bash
|
||||
git clone https://github.com/DLR-RM/rl-baselines3-zoo
|
||||
```
|
||||
|
||||
## Install dependencies 🔽
|
||||
We can now install the dependencies RL-Baselines3 Zoo needs (this can take 5min ⏲)
|
||||
|
||||
```bash
|
||||
cd /content/rl-baselines3-zoo/
|
||||
```
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Train our Deep Q-Learning Agent to Play Space Invaders 👾
|
||||
|
||||
To train an agent with RL-Baselines3-Zoo, we just need to do two things:
|
||||
1. We define the hyperparameters in `rl-baselines3-zoo/hyperparams/dqn.yml`
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/hyperparameters.png" alt="DQN Hyperparameters">
|
||||
|
||||
|
||||
Here we see that:
|
||||
- We use the `Atari Wrapper` that does the pre-processing (Frame reduction, grayscale, stack four frames frames),
|
||||
- We use `CnnPolicy`, since we use Convolutional layers to process the frames.
|
||||
- We train the model for 10 million `n_timesteps`.
|
||||
- Memory (Experience Replay) size is 100000, i.e. the number of experience steps you saved to train again your agent with.
|
||||
|
||||
💡 My advice is to **reduce the training timesteps to 1M,** which will take about 90 minutes on a P100. `!nvidia-smi` will tell you what GPU you're using. At 10 million steps, this will take about 9 hours, which could likely result in Colab timing out. I recommend running this on your local computer (or somewhere else). Just click on: `File>Download`.
|
||||
|
||||
In terms of hyperparameters optimization, my advice is to focus on these 3 hyperparameters:
|
||||
- `learning_rate`
|
||||
- `buffer_size (Experience Memory size)`
|
||||
- `batch_size`
|
||||
|
||||
As a good practice, you need to **check the documentation to understand what each hyperparameters does**: https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html#parameters
|
||||
|
||||
|
||||
|
||||
2. We run `train.py` and save the models on `logs` folder 📁
|
||||
|
||||
```bash
|
||||
python train.py --algo ________ --env SpaceInvadersNoFrameskip-v4 -f _________
|
||||
```
|
||||
|
||||
#### Solution
|
||||
|
||||
```bash
|
||||
python train.py --algo dqn --env SpaceInvadersNoFrameskip-v4 -f logs/
|
||||
```
|
||||
|
||||
## Let's evaluate our agent 👀
|
||||
- RL-Baselines3-Zoo provides `enjoy.py`, a python script to evaluate our agent. In most RL libraries, we call the evaluation script `enjoy.py`.
|
||||
- Let's evaluate it for 5000 timesteps 🔥
|
||||
|
||||
```bash
|
||||
python enjoy.py --algo dqn --env SpaceInvadersNoFrameskip-v4 --no-render --n-timesteps _________ --folder logs/
|
||||
```
|
||||
|
||||
#### Solution
|
||||
|
||||
```bash
|
||||
python enjoy.py --algo dqn --env SpaceInvadersNoFrameskip-v4 --no-render --n-timesteps 5000 --folder logs/
|
||||
```
|
||||
|
||||
## Publish our trained model on the Hub 🚀
|
||||
Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/space-invaders-model.gif" alt="Space Invaders model">
|
||||
|
||||
By using `rl_zoo3.push_to_hub.py`, **you evaluate, record a replay, generate a model card of your agent, and push it to the Hub**.
|
||||
|
||||
This way:
|
||||
- You can **showcase our work** 🔥
|
||||
- You can **visualize your agent playing** 👀
|
||||
- You can **share with the community an agent that others can use** 💾
|
||||
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
To be able to share your model with the community, there are three more steps to follow:
|
||||
|
||||
1️⃣ (If it's not already done) create an account in HF ➡ https://huggingface.co/join
|
||||
|
||||
2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
|
||||
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
|
||||
|
||||
- Copy the token
|
||||
- Run the cell below and past the token
|
||||
|
||||
```python
|
||||
from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.
|
||||
notebook_login()
|
||||
git config --global credential.helper store
|
||||
```
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
|
||||
3️⃣ We're now ready to push our trained agent to the Hub 🔥
|
||||
|
||||
Let's run `push_to_hub.py` file to upload our trained agent to the Hub. There are two important parameters:
|
||||
|
||||
* `--repo-name `: The name of the repo
|
||||
* `-orga`: Your Hugging Face username
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/select-id.png" alt="Select Id">
|
||||
|
||||
```bash
|
||||
python -m rl_zoo3.push_to_hub --algo dqn --env SpaceInvadersNoFrameskip-v4 --repo-name _____________________ -orga _____________________ -f logs/
|
||||
```
|
||||
|
||||
#### Solution
|
||||
|
||||
```bash
|
||||
python -m rl_zoo3.push_to_hub --algo dqn --env SpaceInvadersNoFrameskip-v4 --repo-name dqn-SpaceInvadersNoFrameskip-v4 -orga ThomasSimonini -f logs/
|
||||
```
|
||||
|
||||
Congrats 🥳 you've just trained and uploaded your first Deep Q-Learning agent using RL-Baselines-3 Zoo. The script above should have displayed a link to a model repository such as https://huggingface.co/ThomasSimonini/dqn-SpaceInvadersNoFrameskip-v4. When you go to this link, you can:
|
||||
|
||||
- See a **video preview of your agent** at the right.
|
||||
- Click "Files and versions" to see all the files in the repository.
|
||||
- Click "Use in stable-baselines3" to get a code snippet that shows how to load the model.
|
||||
- A model card (`README.md` file) which gives a description of the model and the hyperparameters you used.
|
||||
|
||||
Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.
|
||||
|
||||
**Compare the results of your agents with your classmates** using the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) 🏆
|
||||
|
||||
## Load a powerful trained model 🔥
|
||||
|
||||
The Stable-Baselines3 team uploaded **more than 150 trained Deep Reinforcement Learning agents on the Hub**. You can download them and use them to see how they perform!
|
||||
|
||||
You can find them here: 👉 https://huggingface.co/sb3
|
||||
|
||||
Some examples:
|
||||
- Asteroids: https://huggingface.co/sb3/dqn-AsteroidsNoFrameskip-v4
|
||||
- Beam Rider: https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4
|
||||
- Breakout: https://huggingface.co/sb3/dqn-BreakoutNoFrameskip-v4
|
||||
- Road Runner: https://huggingface.co/sb3/dqn-RoadRunnerNoFrameskip-v4
|
||||
|
||||
Let's load an agent playing Beam Rider: https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4
|
||||
|
||||
```python
|
||||
<video controls autoplay><source src="https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4/resolve/main/replay.mp4" type="video/mp4"></video>
|
||||
```
|
||||
|
||||
1. We download the model using `rl_zoo3.load_from_hub`, and place it in a new folder that we can call `rl_trained`
|
||||
|
||||
```bash
|
||||
# Download model and save it into the logs/ folder
|
||||
python -m rl_zoo3.load_from_hub --algo dqn --env BeamRiderNoFrameskip-v4 -orga sb3 -f rl_trained/
|
||||
```
|
||||
|
||||
2. Let's evaluate if for 5000 timesteps
|
||||
|
||||
```bash
|
||||
python enjoy.py --algo dqn --env BeamRiderNoFrameskip-v4 -n 5000 -f rl_trained/
|
||||
```
|
||||
|
||||
Why not trying to train your own **Deep Q-Learning Agent playing BeamRiderNoFrameskip-v4? 🏆.**
|
||||
|
||||
If you want to try, check https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4#hyperparameters. There, **in the model card, you have the hyperparameters of the trained agent.**
|
||||
|
||||
But finding hyperparameters can be a daunting task. Fortunately, we'll see in the next bonus Unit, how we can **use Optuna for optimizing the Hyperparameters 🔥.**
|
||||
|
||||
|
||||
## Some additional challenges 🏆
|
||||
|
||||
The best way to learn **is to try things by your own**!
|
||||
|
||||
In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
|
||||
|
||||
Here's a list of environments you can try to train your agent with:
|
||||
- BeamRiderNoFrameskip-v4
|
||||
- BreakoutNoFrameskip-v4
|
||||
- EnduroNoFrameskip-v4
|
||||
- PongNoFrameskip-v4
|
||||
|
||||
Also, **if you want to learn to implement Deep Q-Learning by yourself**, you definitely should look at CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
________________________________________________________________________
|
||||
Congrats on finishing this chapter!
|
||||
|
||||
If you’re still feel confused with all these elements...it's totally normal! **This was the same for me and for all people who studied RL.**
|
||||
|
||||
Take time to really **grasp the material before continuing and try the additional challenges**. It’s important to master these elements and having a solid foundations.
|
||||
|
||||
In the next unit, **we’re going to learn about [Optuna](https://optuna.org/)**. One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
|
||||
|
||||
See you on Bonus unit 2! 🔥
|
||||
|
||||
### Keep Learning, Stay Awesome 🤗
|
||||
19
units/en/unit3/introduction.mdx
Normal file
19
units/en/unit3/introduction.mdx
Normal file
@@ -0,0 +1,19 @@
|
||||
# Deep Q-Learning [[deep-q-learning]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/thumbnail.jpg" alt="Unit 3 thumbnail" width="100%">
|
||||
|
||||
|
||||
|
||||
In the last unit, we learned our first reinforcement learning algorithm: Q-Learning, **implemented it from scratch**, and trained it in two environments, FrozenLake-v1 ☃️ and Taxi-v3 🚕.
|
||||
|
||||
We got excellent results with this simple algorithm, but these environments were relatively simple because the **state space was discrete and small** (14 different states for FrozenLake-v1 and 500 for Taxi-v3). For comparison, the state space in Atari games can **contain \\(10^{9}\\) to \\(10^{11}\\) states**.
|
||||
|
||||
But as we'll see, producing and updating a **Q-table can become ineffective in large state space environments.**
|
||||
|
||||
So in this unit, **we'll study our first Deep Reinforcement Learning agent**: Deep Q-Learning. Instead of using a Q-table, Deep Q-Learning uses a Neural Network that takes a state and approximates Q-values for each action based on that state.
|
||||
|
||||
And **we'll train it to play Space Invaders and other Atari environments using [RL-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)**, a training framework for RL using Stable-Baselines that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results, and recording videos.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
So let’s get started! 🚀
|
||||
104
units/en/unit3/quiz.mdx
Normal file
104
units/en/unit3/quiz.mdx
Normal file
@@ -0,0 +1,104 @@
|
||||
# Quiz [[quiz]]
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
### Q1: We mentioned Q Learning is a tabular method. What are tabular methods?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
*Tabular methods* is a type of problem in which the state and actions spaces are small enough to approximate value functions to be **represented as arrays and tables**. For instance, **Q-Learning is a tabular method** since we use a table to represent the state, and action value pairs.
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
### Q2: Why can't we use a classical Q-Learning to solve an Atari Game?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Atari environments are too fast for Q-Learning",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "Atari environments have a big observation space. So creating an updating the Q-Table would not be efficient",
|
||||
explain: "",
|
||||
correct: true
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
|
||||
### Q3: Why do we stack four frames together when we use frames as input in Deep Q-Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
We stack frames together because it helps us **handle the problem of temporal limitation**: one frame is not enough to capture temporal information.
|
||||
For instance, in pong, our agent **will be unable to know the ball direction if it gets only one frame**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation.jpg" alt="Temporal limitation"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal limitation"/>
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q4: What are the two phases of Deep Q-Learning?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Sampling",
|
||||
explain: "We perform actions and store the observed experiences tuples in a replay memory.",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "Shuffling",
|
||||
explain: "",
|
||||
},
|
||||
{
|
||||
text: "Reranking",
|
||||
explain: "",
|
||||
},
|
||||
{
|
||||
text: "Training",
|
||||
explain: "We select the small batch of tuple randomly and learn from it using a gradient descent update step.",
|
||||
correct: true,
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q5: Why do we create a replay memory in Deep Q-Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
**1. Make more efficient use of the experiences during the training**
|
||||
|
||||
Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient
|
||||
But with experience replay, **we create a replay buffer that saves experience samples that we can reuse during the training**.
|
||||
|
||||
**2. Avoid forgetting previous experiences and reduce the correlation between experiences**
|
||||
|
||||
The problem we get if we give sequential samples of experiences to our neural network is that it **tends to forget the previous experiences as it overwrites new experiences**. For instance, if we are in the first level and then the second, which is different, our agent can forget how to behave and play in the first level.
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
### Q6: How do we use Double Deep Q-Learning?
|
||||
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
When we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:
|
||||
|
||||
- Use our *DQN network* to **select the best action to take for the next state** (the action with the highest Q value).
|
||||
|
||||
- Use our *Target network* to calculate **the target Q value of taking that action at the next state**.
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
|
||||
20
units/en/unit4/additional-readings.mdx
Normal file
20
units/en/unit4/additional-readings.mdx
Normal file
@@ -0,0 +1,20 @@
|
||||
# Additional Readings
|
||||
|
||||
These are **optional readings** if you want to go deeper.
|
||||
|
||||
|
||||
## Introduction to Policy Optimization
|
||||
|
||||
- [Part 3: Intro to Policy Optimization - Spinning Up documentation](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html)
|
||||
|
||||
|
||||
## Policy Gradient
|
||||
|
||||
- [https://johnwlambert.github.io/policy-gradients/](https://johnwlambert.github.io/policy-gradients/)
|
||||
- [RL - Policy Gradient Explained](https://jonathan-hui.medium.com/rl-policy-gradients-explained-9b13b688b146)
|
||||
- [Chapter 13, Policy Gradient Methods; Reinforcement Learning, an introduction by Richard Sutton and Andrew G. Barto](http://incompleteideas.net/book/RLbook2020.pdf)
|
||||
|
||||
## Implementation
|
||||
|
||||
- [PyTorch Reinforce implementation](https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py)
|
||||
- [Implementations from DDPG to PPO](https://github.com/MrSyee/pg-is-all-you-need)
|
||||
74
units/en/unit4/advantages-disadvantages.mdx
Normal file
74
units/en/unit4/advantages-disadvantages.mdx
Normal file
@@ -0,0 +1,74 @@
|
||||
# The advantages and disadvantages of policy-gradient methods
|
||||
|
||||
At this point, you might ask, "but Deep Q-Learning is excellent! Why use policy-gradient methods?". To answer this question, let's study the **advantages and disadvantages of policy-gradient methods**.
|
||||
|
||||
## Advantages
|
||||
|
||||
There are multiple advantages over value-based methods. Let's see some of them:
|
||||
|
||||
### The simplicity of integration
|
||||
|
||||
We can estimate the policy directly without storing additional data (action values).
|
||||
|
||||
### Policy-gradient methods can learn a stochastic policy
|
||||
|
||||
Policy-gradient methods can **learn a stochastic policy while value functions can't**.
|
||||
|
||||
This has two consequences:
|
||||
|
||||
1. We **don't need to implement an exploration/exploitation trade-off by hand**. Since we output a probability distribution over actions, the agent explores **the state space without always taking the same trajectory.**
|
||||
|
||||
2. We also get rid of the problem of **perceptual aliasing**. Perceptual aliasing is when two states seem (or are) the same but need different actions.
|
||||
|
||||
Let's take an example: we have an intelligent vacuum cleaner whose goal is to suck the dust and avoid killing the hamsters.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster1.jpg" alt="Hamster 1"/>
|
||||
</figure>
|
||||
|
||||
Our vacuum cleaner can only perceive where the walls are.
|
||||
|
||||
The problem is that the **two rose cases are aliased states because the agent perceives an upper and lower wall for each**.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster2.jpg" alt="Hamster 1"/>
|
||||
</figure>
|
||||
|
||||
Under a deterministic policy, the policy either will move right when in a red state or move left. **Either case will cause our agent to get stuck and never suck the dust**.
|
||||
|
||||
Under a value-based Reinforcement learning algorithm, we learn a **quasi-deterministic policy** ("greedy epsilon strategy"). Consequently, our agent can **spend a lot of time before finding the dust**.
|
||||
|
||||
On the other hand, an optimal stochastic policy **will randomly move left or right in rose states**. Consequently, **it will not be stuck and will reach the goal state with a high probability**.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster3.jpg" alt="Hamster 1"/>
|
||||
</figure>
|
||||
|
||||
### Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces
|
||||
|
||||
The problem with Deep Q-learning is that their **predictions assign a score (maximum expected future reward) for each possible action**, at each time step, given the current state.
|
||||
|
||||
But what if we have an infinite possibility of actions?
|
||||
|
||||
For instance, with a self-driving car, at each state, you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honking, etc.). **We'll need to output a Q-value for each possible action**! And **taking the max action of a continuous output is an optimization problem itself**!
|
||||
|
||||
Instead, with policy-gradient methods, we output a **probability distribution over actions.**
|
||||
|
||||
### Policy-gradient methods have better convergence properties
|
||||
|
||||
In value-based methods, we use an aggressive operator to **change the value function: we take the maximum over Q-estimates**.
|
||||
Consequently, the action probabilities may change dramatically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.
|
||||
|
||||
For instance, if during the training, the best action was left (with a Q-value of 0.22) and the training step after it's right (since the right Q-value becomes 0.23), we dramatically changed the policy since now the policy will take most of the time right instead of left.
|
||||
|
||||
On the other hand, in policy-gradient methods, stochastic policy action preferences (probability of taking action) **change smoothly over time**.
|
||||
|
||||
## Disadvantages
|
||||
|
||||
Naturally, policy-gradient methods also have some disadvantages:
|
||||
|
||||
- **Frequently, policy-gradient converges on a local maximum instead of a global optimum.**
|
||||
- Policy-gradient goes slower, **step by step: it can take longer to train (inefficient).**
|
||||
- Policy-gradient can have high variance. We'll see in actor-critic unit why and how we can solve this problem.
|
||||
|
||||
👉 If you want to go deeper into the advantages and disadvantages of policy-gradient methods, [you can check this video](https://youtu.be/y3oqOjHilio).
|
||||
17
units/en/unit4/conclusion.mdx
Normal file
17
units/en/unit4/conclusion.mdx
Normal file
@@ -0,0 +1,17 @@
|
||||
# Conclusion
|
||||
|
||||
|
||||
**Congrats on finishing this unit**! There was a lot of information.
|
||||
And congrats on finishing the tutorial. You've just coded your first Deep Reinforcement Learning agent from scratch using PyTorch and shared it on the Hub 🥳.
|
||||
|
||||
Don't hesitate to iterate on this unit **by improving the implementation for more complex environments** (for instance, what about changing the network to a Convolutional Neural Network to handle
|
||||
frames as observation)?
|
||||
|
||||
In the next unit, **we're going to learn more about Unity MLAgents**, by training agents in Unity environments. This way, you will be ready to participate in the **AI vs AI challenges where you'll train your agents
|
||||
to compete against other agents in a snowball fight and a soccer game.**
|
||||
|
||||
Sounds fun? See you next time!
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
1012
units/en/unit4/hands-on.mdx
Normal file
1012
units/en/unit4/hands-on.mdx
Normal file
File diff suppressed because it is too large
Load Diff
24
units/en/unit4/introduction.mdx
Normal file
24
units/en/unit4/introduction.mdx
Normal file
@@ -0,0 +1,24 @@
|
||||
# Introduction [[introduction]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/thumbnail.png" alt="thumbnail"/>
|
||||
|
||||
In the last unit, we learned about Deep Q-Learning. In this value-based deep reinforcement learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**
|
||||
|
||||
Since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
|
||||
|
||||
In value-based methods, the policy ** \\(π\\) only exists because of the action value estimates since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
|
||||
|
||||
But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
|
||||
|
||||
So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
|
||||
Then, we'll test its robustness using the CartPole-v1 and PixelCopter environments.
|
||||
|
||||
You'll then be able to iterate and improve this implementation for more advanced environments.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
|
||||
</figure>
|
||||
|
||||
Let's get started,
|
||||
82
units/en/unit4/pg-theorem.mdx
Normal file
82
units/en/unit4/pg-theorem.mdx
Normal file
@@ -0,0 +1,82 @@
|
||||
# (Optional) the Policy Gradient Theorem
|
||||
|
||||
In this optional section where we're **going to study how we differentiate the objective function that we will use to approximate the policy gradient**.
|
||||
|
||||
Let's first recap our different formulas:
|
||||
|
||||
1. The Objective function
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
|
||||
|
||||
|
||||
2. The probability of a trajectory (given that action comes from \\(\pi_\theta\\)):
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
|
||||
|
||||
|
||||
So we have:
|
||||
|
||||
\\(\nabla_\theta J(\theta) = \nabla_\theta \sum_{\tau}P(\tau;\theta)R(\tau)\\)
|
||||
|
||||
|
||||
We can rewrite the gradient of the sum as the sum of the gradient:
|
||||
|
||||
\\( = \sum_{\tau} \nabla_\theta P(\tau;\theta)R(\tau) \\)
|
||||
|
||||
We then multiply every term in the sum by \\(\frac{P(\tau;\theta)}{P(\tau;\theta)}\\)(which is possible since it's = 1)
|
||||
|
||||
\\( = \sum_{\tau} \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau) \\)
|
||||
|
||||
We can simplify further this since \\( \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta) = P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\)
|
||||
|
||||
\\(= \sum_{\tau} P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau) \\)
|
||||
|
||||
We can then use the *derivative log trick* (also called *likelihood ratio trick* or *REINFORCE trick*), a simple rule in calculus that implies that \\( \nabla_x log f(x) = \frac{\nabla_x f(x)}{f(x)} \\)
|
||||
|
||||
So given we have \\(\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\) we transform it as \\(\nabla_\theta log P(\tau|\theta) \\)
|
||||
|
||||
|
||||
|
||||
So this is our likelihood policy gradient:
|
||||
|
||||
\\( \nabla_\theta J(\theta) = \sum_{\tau} P(\tau;\theta) \nabla_\theta log P(\tau;\theta) R(\tau) \\)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Thanks for this new formula, we can estimate the gradient using trajectory samples (we can approximate the likelihood ratio policy gradient with sample-based estimate if you prefer).
|
||||
|
||||
\\(\nabla_\theta J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \nabla_\theta log P(\tau^{(i)};\theta)R(\tau^{(i)})\\) where each \\(\tau^{(i)}\\) is a sampled trajectory.
|
||||
|
||||
|
||||
But we still have some mathematics work to do there: we need to simplify \\( \nabla_\theta log P(\tau|\theta) \\)
|
||||
|
||||
We know that:
|
||||
|
||||
\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})]\\)
|
||||
|
||||
Where \\(\mu(s_0)\\) is the initial state distribution and \\( P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \\) is the state transition dynamics of the MDP.
|
||||
|
||||
We know that the log of a product is equal to the sum of the logs:
|
||||
|
||||
\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta \left[ \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\right]\\)
|
||||
|
||||
We also know that the gradient of the sum is equal to the sum of gradient:
|
||||
|
||||
\\( \nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\)
|
||||
|
||||
Since neither initial state distribution or state transition dynamics of the MDP are dependent of \\(\theta\\), the derivate of both terms are 0. So we can remove them:
|
||||
|
||||
Since:
|
||||
\\(\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) = 0 \\) and \\( \nabla_\theta \mu(s_0) = 0\\)
|
||||
|
||||
\\(\nabla_\theta log P(\tau^{(i)};\theta) = \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\\)
|
||||
|
||||
We can rewrite the gradient of the sum as the sum of gradients:
|
||||
|
||||
\\( \nabla_\theta log P(\tau^{(i)};\theta)= \sum_{t=0}^{H} \nabla_\theta log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\)
|
||||
|
||||
So, the final formula for estimating the policy gradient is:
|
||||
|
||||
\\( \nabla_{\theta} J(\theta) = \hat{g} = \frac{1}{m} \sum^{m}_{i=1} \sum^{H}_{t=0} \nabla_\theta \log \pi_\theta(a^{(i)}_{t} | s_{t}^{(i)})R(\tau^{(i)}) \\)
|
||||
120
units/en/unit4/policy-gradient.mdx
Normal file
120
units/en/unit4/policy-gradient.mdx
Normal file
@@ -0,0 +1,120 @@
|
||||
# Diving deeper into policy-gradient methods
|
||||
|
||||
## Getting the big picture
|
||||
|
||||
We just learned that policy-gradient methods aim to find parameters \\( \theta \\) that **maximize the expected return**.
|
||||
|
||||
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called *action preference*.
|
||||
|
||||
If we take the example of CartPole-v1:
|
||||
- As input, we have a state.
|
||||
- As output, we have a probability distribution over actions at that state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
|
||||
|
||||
Our goal with policy-gradient is to **control the probability distribution of actions** by tuning the policy such that **good actions (that maximize the return) are sampled more frequently in the future.**
|
||||
Each time the agent interacts with the environment, we tweak the parameters such that good actions will be sampled more likely in the future.
|
||||
|
||||
But **how are we going to optimize the weights using the expected return**?
|
||||
|
||||
The idea is that we're going to **let the agent interact during an episode**. And if we win the episode, we consider that each action taken was good and must be more sampled in the future
|
||||
since they lead to win.
|
||||
|
||||
So for each state-action pair, we want to increase the \\(P(a|s)\\): the probability of taking that action at that state. Or decrease if we lost.
|
||||
|
||||
The Policy-gradient algorithm (simplified) looks like this:
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/pg_bigpicture.jpg" alt="Policy Gradient Big Picture"/>
|
||||
</figure>
|
||||
|
||||
Now that we got the big picture, let's dive deeper into policy-gradient methods.
|
||||
|
||||
## Diving deeper into policy-gradient methods
|
||||
|
||||
We have our stochastic policy \\(\pi\\) which has a parameter \\(\theta\\). This \\(\pi\\), given a state, **outputs a probability distribution of actions**.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="Policy"/>
|
||||
</figure>
|
||||
|
||||
Where \\(\pi_\theta(a_t|s_t)\\) is the probability of the agent selecting action \\(a_t\\) from state \\(s_t\\) given our policy.
|
||||
|
||||
**But how do we know if our policy is good?** We need to have a way to measure it. To know that, we define a score/objective function called \\(J(\theta)\\).
|
||||
|
||||
### The objective function
|
||||
|
||||
The *objective function* gives us the **performance of the agent** given a trajectory (state action sequence without considering reward (contrary to an episode)), and it outputs the *expected cumulative reward*.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/objective.jpg" alt="Return"/>
|
||||
|
||||
Let's detail a little bit more this formula:
|
||||
- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by \\(P(\tau;\theta)\\) of all possible values that the return \\(R(\tau)\\) can take.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
|
||||
|
||||
|
||||
- \\(R(\tau)\\) : Return from an arbitrary trajectory. To take this quantity and use it to calculate the expected return, we need to multiply it by the probability of each possible trajectory.
|
||||
- \\(P(\tau;\theta)\\) : Probability of each possible trajectory \\(\tau\\) (that probability depends on \\( \theta\\) since it defines the policy that it uses to select the actions of the trajectory which as an impact of the states visited).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
|
||||
|
||||
- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given $\theta$, and the return of this trajectory.
|
||||
|
||||
Our objective then is to maximize the expected cumulative rewards by finding \\(\theta \\) that will output the best action probability distributions:
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/max_objective.png" alt="Max objective"/>
|
||||
|
||||
|
||||
## Gradient Ascent and the Policy-gradient Theorem
|
||||
|
||||
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
|
||||
|
||||
(If you need a refresher on the difference between gradient descent and gradient ascent [check this](https://www.baeldung.com/cs/gradient-descent-vs-ascent) and [this](https://stats.stackexchange.com/questions/258721/gradient-ascent-vs-gradient-descent-in-logistic-regression)).
|
||||
|
||||
Our update step for gradient-ascent is:
|
||||
|
||||
\\( \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\)
|
||||
|
||||
We can repeatedly apply this update state in the hope that \\(\theta \\) converges to the value that maximizes \\(J(\theta)\\).
|
||||
|
||||
However, we have two problems to obtain the derivative of \\(J(\theta)\\):
|
||||
1. We can't calculate the true gradient of the objective function since it would imply calculating the probability of each possible trajectory which is computationally super expensive.
|
||||
We want then to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
|
||||
|
||||
2. We have another problem that I detail in the next optional section. To differentiate this objective function, we need to differentiate the state distribution, called Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
|
||||
|
||||
Fortunately we're going to use a solution called the Policy Gradient Theorem that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
|
||||
|
||||
If you want to understand how we derivate this formula that we will use to approximate the gradient, check the next (optional) section.
|
||||
|
||||
## The Reinforce algorithm (Monte Carlo Reinforce)
|
||||
|
||||
The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\):
|
||||
|
||||
In a loop:
|
||||
- Use the policy \\(\pi_\theta\\) to collect an episode \\(\tau\\)
|
||||
- Use the episode to estimate the gradient \\(\hat{g} = \nabla_\theta J(\theta)\\)
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_one.png" alt="Policy Gradient"/>
|
||||
</figure>
|
||||
|
||||
- Update the weights of the policy: \\(\theta \leftarrow \theta + \alpha \hat{g}\\)
|
||||
|
||||
The interpretation we can make is this one:
|
||||
- \\(\nabla_\theta log \pi_\theta(a_t|s_t)\\) is the direction of **steepest increase of the (log) probability** of selecting action at from state st.
|
||||
This tells us **how we should change the weights of policy** if we want to increase/decrease the log probability of selecting action \\(a_t\\) at state \\(s_t\\).
|
||||
- \\(R(\tau)\\): is the scoring function:
|
||||
- If the return is high, it will **push up the probabilities** of the (state, action) combinations.
|
||||
- Else, if the return is low, it will **push down the probabilities** of the (state, action) combinations.
|
||||
|
||||
|
||||
We can also **collect multiple episodes (trajectories)** to estimate the gradient:
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_multiple.png" alt="Policy Gradient"/>
|
||||
</figure>
|
||||
82
units/en/unit4/quiz.mdx
Normal file
82
units/en/unit4/quiz.mdx
Normal file
@@ -0,0 +1,82 @@
|
||||
# Quiz
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
|
||||
### Q1: What are the advantages of policy-gradient over value-based methods? (Check all that apply)
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Policy-gradient methods can learn a stochastic policy",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "Policy-gradient converges most of the time on a global maximum.",
|
||||
explain: "No, frequently, policy-gradient converges on a local maximum instead of a global optimum.",
|
||||
},
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q2: What is the Policy Gradient Theorem?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
*The Policy Gradient Theorem* is a formula that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q3: What's the difference between policy-based methods and policy-gradient methods? (Check all that apply)
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Policy-based methods are a subset of policy-gradient methods.",
|
||||
explain: "",
|
||||
},
|
||||
{
|
||||
text: "Policy-gradient methods are a subset of policy-based methods.",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "In Policy-based methods, we can optimize the parameter θ **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "In Policy-gradient methods, we optimize the parameter θ **directly** by performing the gradient ascent on the performance of the objective function.",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
]}
|
||||
/>
|
||||
|
||||
|
||||
### Q4: Why do we use gradient ascent instead of gradient descent to optimize J(θ)?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "We want to minimize J(θ) and gradient ascent gives us the gives the direction of the steepest increase of J(θ)",
|
||||
explain: "",
|
||||
},
|
||||
{
|
||||
text: "We want to maximize J(θ) and gradient ascent gives us the gives the direction of the steepest increase of J(θ)",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
]}
|
||||
/>
|
||||
|
||||
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.
|
||||
42
units/en/unit4/what-are-policy-based-methods.mdx
Normal file
42
units/en/unit4/what-are-policy-based-methods.mdx
Normal file
@@ -0,0 +1,42 @@
|
||||
# What are the policy-based methods?
|
||||
|
||||
The main goal of Reinforcement learning is to **find the optimal policy \\(\pi^{*}\\) that will maximize the expected cumulative reward**.
|
||||
Because Reinforcement Learning is based on the *reward hypothesis*: **all goals can be described as the maximization of the expected cumulative reward.**
|
||||
|
||||
For instance, in a soccer game (where you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
|
||||
**maximizing the number of goals scored** (when the ball crosses the goal line) into your opponent's soccer goals. And **minimizing the number of goals in your soccer goals**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/soccer.jpg" alt="Soccer" />
|
||||
|
||||
## Value-based, Policy-based, and Actor-critic methods
|
||||
|
||||
We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi^{*}\\).
|
||||
|
||||
- In *value-based methods*, we learn a value function.
|
||||
- The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
|
||||
- Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function.
|
||||
- We have a policy, but it's implicit since it **was generated directly from the value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
|
||||
|
||||
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
|
||||
- The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
|
||||
- <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
|
||||
- Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
|
||||
- To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
|
||||
|
||||
- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
|
||||
|
||||
Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_\theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
|
||||
To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.
|
||||
|
||||
## The difference between policy-based and policy-gradient methods
|
||||
|
||||
Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update, we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
|
||||
|
||||
The difference between these two methods **lies on how we optimize the parameter** \\(\theta\\):
|
||||
|
||||
- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter \\(\theta\\) **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.
|
||||
- In *policy-gradient methods*, because we're a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter \\(\theta\\) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\).
|
||||
|
||||
Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent, etc.), let's study the advantages and disadvantages of policy-based methods.
|
||||
@@ -6,5 +6,7 @@ You can now sit and enjoy playing with your Huggy 🐶. And don't **forget to sp
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-cover.jpeg" alt="Huggy cover" width="100%">
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
### Keep Learning, Stay Awesome 🤗
|
||||
|
||||
16
units/en/unitbonus2/hands-on.mdx
Normal file
16
units/en/unitbonus2/hands-on.mdx
Normal file
@@ -0,0 +1,16 @@
|
||||
# Hands-on [[hands-on]]
|
||||
|
||||
Now that you've learned to use Optuna, we give you some ideas to apply what you've learned:
|
||||
|
||||
1️⃣ **Beat your LunarLander-v2 agent results**, by using Optuna to find a better set of hyperparameters. You can also try with another environment, such as MountainCar-v0 and CartPole-v1.
|
||||
|
||||
2️⃣ **Beat your SpaceInvaders agent results**.
|
||||
|
||||
By doing that, you're going to see how Optuna is valuable and powerful in training better agents,
|
||||
|
||||
Have fun,
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
7
units/en/unitbonus2/introduction.mdx
Normal file
7
units/en/unitbonus2/introduction.mdx
Normal file
@@ -0,0 +1,7 @@
|
||||
# Introduction [[introduction]]
|
||||
|
||||
One of the most critical task in Deep Reinforcement Learning is to **find a good set of training hyperparameters**.
|
||||
|
||||
<img src="https://raw.githubusercontent.com/optuna/optuna/master/docs/image/optuna-logo.png" alt="Optuna Logo"/>
|
||||
|
||||
[Optuna](https://optuna.org/) is a library that helps you to automate the search. In this Unit, we'll study a **little bit of the theory behind automatic hyperparameter tuning**. We'll first try to optimize the parameters of the DQN studied in the last unit manually. We'll then **learn how to automate the search using Optuna**.
|
||||
15
units/en/unitbonus2/optuna.mdx
Normal file
15
units/en/unitbonus2/optuna.mdx
Normal file
@@ -0,0 +1,15 @@
|
||||
# Optuna Tutorial [[optuna]]
|
||||
|
||||
The content below comes from [Antonin's Raffin ICRA 2022 presentations](https://araffin.github.io/tools-for-robotic-rl-icra2022/), he's one of the founders of Stable-Baselines and RL-Baselines3-Zoo.
|
||||
|
||||
|
||||
## The theory behind Hyperparameter tuning
|
||||
|
||||
<Youtube id="AidFTOdGNFQ" />
|
||||
|
||||
|
||||
## Optuna Tutorial
|
||||
|
||||
<Youtube id="ihP7E76KGOI" />
|
||||
|
||||
The notebook 👉 [here](https://colab.research.google.com/github/araffin/tools-for-robotic-rl-icra2022/blob/main/notebooks/optuna_lab.ipynb)
|
||||
Reference in New Issue
Block a user