Merge branch 'huggingface:main' into main
89
README.md
@@ -23,56 +23,79 @@ This course is **self-paced** you can start when you want 🥳.
|
||||
| 📆 Publishing date | 📘 Unit | 👩💻 Hands-on |
|
||||
|---------------|----------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
|
||||
| [Published 🥳](https://github.com/huggingface/deep-rl-class/tree/main/unit1#unit-1-introduction-to-deep-reinforcement-learning) | [An Introduction to Deep Reinforcement Learning](https://github.com/huggingface/deep-rl-class/tree/main/unit1) | [Train a Deep Reinforcement Learning lander agent to land correctly on the Moon 🌕 using Stable-Baselines3](https://github.com/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb) |
|
||||
| May, the 11th | [Bonus](https://discord.com/channels/879548962464493619/968114737655214080/973937495546925056) | |
|
||||
| May, the 18th | Q-Learning | Train an agent to cross a Frozen lake in this new version of the environment. |
|
||||
| June, the 1st | Deep Q-Learning and improvements | Train a Deep Q-Learning agent to play Space Invaders |
|
||||
| | Policy-based methods | 🏗️ |
|
||||
| | Actor-Critic Methods | 🏗️ |
|
||||
| | Proximal Policy Optimization (PPO) | 🏗️ |
|
||||
| | Decision Transformers and offline Reinforcement Learning | 🏗️ |
|
||||
| | Towards better explorations methods | 🏗️ |
|
||||
| [Published 🥳](https://github.com/huggingface/deep-rl-class/tree/main/unit1/unit1-bonus) | [Bonus](https://github.com/huggingface/deep-rl-class/tree/main/unit1/unit1-bonus) | |
|
||||
| [Published 🥳](https://github.com/huggingface/deep-rl-class/blob/main/unit2/README.md)| [Q-Learning](https://github.com/huggingface/deep-rl-class/blob/main/unit2/README.md) | [Train an agent to cross a Frozen lake ⛄ and train an autonomous taxi 🚖](https://github.com/huggingface/deep-rl-class/blob/main/unit2/unit2.ipynb). |
|
||||
| [Published 🥳](https://github.com/huggingface/deep-rl-class/tree/main/unit3#unit-3-deep-q-learning-with-atari-games-) | [Deep Q-Learning](https://github.com/huggingface/deep-rl-class/tree/main/unit3#unit-3-deep-q-learning-with-atari-games-) | Train a Deep Q-Learning agent to play Space Invaders using [RL-Baselines3-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo) |
|
||||
| [Published 🥳](https://github.com/huggingface/deep-rl-class/blob/main/unit3/bonus.md)| [Bonus: Automatic Hyperparameter Tuning using Optuna](https://github.com/huggingface/deep-rl-class/blob/main/unit3/bonus.md)| | | |
|
||||
| [Published 🥳](https://github.com/huggingface/deep-rl-class/tree/main/unit4#unit-4-an-introduction-to-unity-mlagents-with-hugging-face-) | [🎁 Learn to train your first Unity MLAgent](https://github.com/huggingface/deep-rl-class/tree/main/unit4#unit-4-an-introduction-to-unity-mlagents-with-hugging-face-) | [Train a curious agent to destroy Pyramids 💥](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit4/unit4.ipynb) |
|
||||
| [Published 🥳](https://github.com/huggingface/deep-rl-class/tree/main/unit5#unit-5-policy-gradient-with-pytorch) | [Policy Gradient with PyTorch](https://huggingface.co/blog/deep-rl-pg) | [Code a Reinforce agent from scratch using PyTorch and train it to play Pong 🎾, CartPole and Pixelcopter 🚁](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit5/unit5.ipynb) |
|
||||
| [Published 🥳](https://github.com/huggingface/deep-rl-class/tree/main/unit6#towards-better-explorations-methods-with-curiosity) | [Towards better explorations methods with Curiosity](https://github.com/huggingface/deep-rl-class/tree/main/unit6#towards-better-explorations-methods-with-curiosity)| |
|
||||
| [Published 🥳](https://github.com/huggingface/deep-rl-class/tree/main/unit7#unit-7-advantage-actor-critic-a2c-using-robotics-simulations-with-pybullet-) | [Advantage Actor Critic (A2C)](https://github.com/huggingface/deep-rl-class/tree/main/unit7#unit-7-advantage-actor-critic-a2c-using-robotics-simulations-with-pybullet-) | [Train a bipedal walker and a spider to learn to walk using A2C](https://github.com/huggingface/deep-rl-class/tree/main/unit7#unit-7-advantage-actor-critic-a2c-using-robotics-simulations-with-pybullet-) |
|
||||
| [Published 🥳](https://github.com/huggingface/deep-rl-class/tree/main/unit8#unit-8-proximal-policy-optimization-ppo-with-pytorch) | [Proximal Policy Optimization (PPO)](https://github.com/huggingface/deep-rl-class/tree/main/unit8#unit-8-proximal-policy-optimization-ppo-with-pytorch) | [Code a PPO agent from scratch using PyTorch and bulletproof it with Classical Control Environments](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit8/unit8.ipynb) |
|
||||
| TBA | Decision Transformers and offline Reinforcement Learning | 🏗️ |
|
||||
|
||||
|
||||
|
||||
## The library you'll learn during this course
|
||||
Version 1.0 (current):
|
||||
- [Stable-Baselines3](https://github.com/DLR-RM/stable-baselines3)
|
||||
- [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)
|
||||
- [RLlib](https://docs.ray.io/en/latest/rllib/index.html)
|
||||
- [CleanRL](https://github.com/vwxyzjn/cleanrl)
|
||||
|
||||
Version 2.0 (in addition to SB3, RL-Baselines3-Zoo and CleanRL):
|
||||
- [RLlib](https://docs.ray.io/en/latest/rllib/index.html)
|
||||
- [Sample Factory](https://github.com/alex-petrenko/sample-factory)
|
||||
- [Hugging Face Decision Transformers](https://huggingface.co/blog/decision-transformers)
|
||||
- More to come 🏗️
|
||||
|
||||
## The Environments you'll use
|
||||
### Custom environments made by the Hugging Face Team using Unity and Godot
|
||||
- Huggy the Doggo 🐶
|
||||
(Based on [Unity's Puppo the Corgi work](https://blog.unity.com/technology/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit))
|
||||

|
||||
|
||||
| Environment | Screenshot |
|
||||
|-----------------|--------------------------------------------------|
|
||||
| Huggy the Doggo 🐶 (Based on [Unity's Puppo the Corgi work](https://blog.unity.com/technology/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit)) |  |
|
||||
| SnowballFight ☃️ 👉 Play it here: https://huggingface.co/spaces/ThomasSimonini/SnowballFight |  |
|
||||
|
||||
|
||||
- SnowballFight ☃️
|
||||

|
||||
👉 Play it here: https://huggingface.co/spaces/ThomasSimonini/SnowballFight
|
||||
### Gym classic and controls environments 🕹️
|
||||
|
||||
| Environment | Screenshot |
|
||||
|-----------------|--------------------------------------------------|
|
||||
| Lunar Lander 🚀🌙 |  |
|
||||
| Frozen Lake ⛄ |  |
|
||||
| Taxi 🚖 |  |
|
||||
| Cartpole |  |
|
||||
| Pong 🎾 |  |
|
||||
| Pixelcopter 🚁 |  |
|
||||
|
||||
|
||||
- More to come 🚧
|
||||
|
||||
### Gym Atari environments 👾
|
||||
|
||||
### Gym classic controls environments 🕹️
|
||||
- Lunar-Lander v2 🚀🌙
|
||||
|
||||

|
||||
| Environment | Screenshot |
|
||||
|-----------------|--------------------------------------------------|
|
||||
| Space Invaders 👾 |  |
|
||||
| Breakout |  |
|
||||
| Qbert |  |
|
||||
| Seaquest |  |
|
||||
|
||||
|
||||
### PyBullet 🤖
|
||||
- More to come 🚧
|
||||
|
||||
### Gym Atari environments 👾
|
||||
- Space Invaders 👾
|
||||
| Environment | Screenshot |
|
||||
|-----------------|--------------------------------------------------|
|
||||
| Ant Bullet |  |
|
||||
| Walker 2D Bullet |  |
|
||||
|
||||
|
||||

|
||||
|
||||
### MLAgents environments 🖌️
|
||||
- More to come 🚧
|
||||
|
||||
|
||||
- More to come 🚧
|
||||
|
||||
|
||||
## Prerequisites
|
||||
- Good skills in Python 🐍
|
||||
- Basics in Deep Learning and Pytorch
|
||||
@@ -138,7 +161,7 @@ Don’t forget to **introduce yourself when you sign up 🤗**
|
||||
|
||||
**I have some feedback**
|
||||
|
||||
We want to improve and update the course iteratively with your feedback. If you have some, please send a mail to thomas.simonini@huggingface.co
|
||||
We want to improve and update the course iteratively with your feedback. If you have some, please fill this form 👉 https://forms.gle/3HgA7bEHwAmmLfwh9
|
||||
|
||||
|
||||
**How much background knowledge is needed?**
|
||||
@@ -158,3 +181,19 @@ If it's not the case yet, you can check these free resources:
|
||||
**Is there a certificate?**
|
||||
|
||||
Yes 🎉. You'll **need to upload the eight models with the eight hands-on.**
|
||||
|
||||
|
||||
## Citing the project
|
||||
|
||||
To cite this repository in publications:
|
||||
|
||||
```bibtex
|
||||
@misc{deep-rl-class,
|
||||
author = {Simonini, Thomas and Sanseviero, Omar},
|
||||
title = {The Hugging Face Deep Reinforcement Learning Class},
|
||||
year = {2022},
|
||||
publisher = {GitHub},
|
||||
journal = {GitHub repository},
|
||||
howpublished = {\url{https://github.com/huggingface/deep-rl-class}},
|
||||
}
|
||||
```
|
||||
|
||||
BIN
assets/img/antbullet.gif
Normal file
|
After Width: | Height: | Size: 900 KiB |
BIN
assets/img/breakout.gif
Normal file
|
After Width: | Height: | Size: 571 KiB |
BIN
assets/img/cartpole.jpg
Normal file
|
After Width: | Height: | Size: 181 KiB |
BIN
assets/img/frozenlake.gif
Normal file
|
After Width: | Height: | Size: 68 KiB |
BIN
assets/img/pixelcopter.jpg
Normal file
|
After Width: | Height: | Size: 369 KiB |
BIN
assets/img/pong.jpg
Normal file
|
After Width: | Height: | Size: 215 KiB |
BIN
assets/img/qbert.gif
Normal file
|
After Width: | Height: | Size: 690 KiB |
BIN
assets/img/seaquest.gif
Normal file
|
After Width: | Height: | Size: 454 KiB |
BIN
assets/img/taxi.gif
Normal file
|
After Width: | Height: | Size: 218 KiB |
BIN
assets/img/walker2d.gif
Normal file
|
After Width: | Height: | Size: 340 KiB |
@@ -1,28 +1,32 @@
|
||||
# Unit 1: Introduction to Deep Reinforcement Learning
|
||||
# Unit 1: Introduction to Deep Reinforcement Learning 🚀
|
||||
|
||||
In this Unit, you'll learn the foundations of Deep RL. And **you’ll train your first lander agent 🚀 to land correctly on the Moon 🌕** using Stable-Baselines3 and share it with the community.
|
||||

|
||||
|
||||
In this Unit, you'll learn the foundations of Deep Reinforcement Learning. And **you’ll train your first lander agent 🚀 to land correctly on the Moon 🌕** using Stable-Baselines3 and share it with the community.
|
||||
|
||||
<img src="assets/img/LunarLander.gif" alt="LunarLander"/>
|
||||
|
||||
You'll then be able to **compare your agent’s results with other classmates thanks to a leaderboard** 🔥.
|
||||
You'll then be able to **[compare your agent’s results with other classmates thanks to the leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard)** 🔥.
|
||||
|
||||
This course is **self-paced**, you can start whenever you want.
|
||||
|
||||
## Required time ⏱️
|
||||
The required time for this unit is, approximately:
|
||||
- 2 hours for the theory
|
||||
- 1 hour for the hands-on.
|
||||
- **2 hours** for the theory
|
||||
- **1 hour** for the hands-on.
|
||||
|
||||
## Start this Unit 🚀
|
||||
Here are the steps for this Unit:
|
||||
|
||||
1️⃣ Sign up to our Discord Server. This is the place where you **can exchange with the community and with us, create study groups to grow each other and more**
|
||||
1️⃣ 📝 **Sign up to the course** , to receive the updates when each Unit is published.
|
||||
|
||||
2️⃣ **Sign up to our Discord Server**. This is the place where you **can exchange with the community and with us, create study groups to grow each other and more**
|
||||
|
||||
👉🏻 [https://discord.gg/aYka4Yhff9](https://discord.gg/aYka4Yhff9).
|
||||
|
||||
Are you new to Discord? Check our **discord 101 to get the best practices** 👉 https://github.com/huggingface/deep-rl-class/blob/main/DISCORD.Md
|
||||
|
||||
2️⃣ **Introduce yourself on Discord in #introduce-yourself Discord channel 🤗 and check on the left the Reinforcement Learning section.**
|
||||
3️⃣ 👋 **Introduce yourself on Discord in #introduce-yourself Discord channel 🤗 and check on the left the Reinforcement Learning section.**
|
||||
|
||||
- In #rl-announcements we give the last information about the course.
|
||||
- #discussions is a place to exchange.
|
||||
@@ -31,17 +35,21 @@ Are you new to Discord? Check our **discord 101 to get the best practices** 👉
|
||||
|
||||
<img src="assets/img/discord_channels.jpg" alt="Discord Channels"/>
|
||||
|
||||
3️⃣ 📖 **Read An [Introduction to Deep Reinforcement Learning](https://huggingface.co/blog/deep-rl-intro)**, where you’ll learn the foundations of Deep RL. You can also watch the video version attached to the article. 👉 https://huggingface.co/blog/deep-rl-intro
|
||||
4️⃣ 📖 **Read An [Introduction to Deep Reinforcement Learning](https://huggingface.co/blog/deep-rl-intro)**, where you’ll learn the foundations of Deep RL. You can also watch the video version attached to the article. 👉 https://huggingface.co/blog/deep-rl-intro
|
||||
|
||||
4️⃣ 👩💻 Then dive on the hands-on, where **you’ll train your first lander agent 🚀 to land correctly on the Moon 🌕 using Stable-Baselines3 and share it with the community.** Thanks to a leaderboard, **you'll be able to compare your results with other classmates** and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 1 🏆?
|
||||
5️⃣ 📝 Take a piece of paper and **check your knowledge with this series of questions** ❔ 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit1/quiz.md
|
||||
|
||||
The hands-on 👉 [](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb)
|
||||
6️⃣ 👩💻 Then dive on the hands-on, where **you’ll train your first lander agent 🚀 to land correctly on the Moon 🌕 using Stable-Baselines3 and share it with the community.** Thanks to a leaderboard, **you'll be able to compare your results with other classmates** and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 1 🏆?
|
||||
|
||||
The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
👩💻 The hands-on 👉 [](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb)
|
||||
|
||||
You can work directly **with the colab notebook, which allows you not to have to install everything on your machine (and it’s free)**.
|
||||
🏆 The leaderboard 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
5️⃣ The best way to learn **is to try things on your own**. That’s why we have a challenges section in the colab where we give you some ideas on how you can go further: using another environment, using another model etc.
|
||||
You can work directly **with the colab notebook, which allows you not to have to install everything on your machine (and it’s free)**.
|
||||
|
||||
The best way to learn **is to try things on your own**. That’s why we have a challenges section in the colab where we give you some ideas on how you can go further: using another environment, using another model etc.
|
||||
|
||||
7️⃣ (Optional) In order to **find the best training parameters you can try this hands-on** made by [Sambit Mukherjee](https://github.com/sambitmukherjee) 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit1/unit1_optuna_guide.ipynb
|
||||
|
||||
## Additional readings 📚
|
||||
- [Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto Chapter 1, 2 and 3](http://incompleteideas.net/book/RLbook2020.pdf)
|
||||
@@ -55,13 +63,13 @@ You can work directly **with the colab notebook, which allows you not to have to
|
||||
To make the most of the course, my advice is to:
|
||||
|
||||
- **Participate in Discord** and join a study group.
|
||||
- **Read multiple times** the theory part and takes some notes
|
||||
- Don’t just do the colab. When you learn something, try to change the environment, change the parameters and read the libraries' documentation. Have fun 🥳
|
||||
- **Read multiple times** the theory part and takes some notes.
|
||||
- Don’t just do the colab. When you learn something, try to change the environment, change the parameters and read the libraries' documentation. Have fun 🥳.
|
||||
- Struggling is **a good thing in learning**. It means that you start to build new skills. Deep RL is a complex topic and it takes time to understand. Try different approaches, use our additional readings, and exchange with classmates on discord.
|
||||
|
||||
## This is a course built with you 👷🏿♀️
|
||||
|
||||
We want to improve and update the course iteratively with your feedback. If you have some, please open an issue on the Github Repo: [https://github.com/huggingface/deep-rl-class/issues](https://github.com/huggingface/deep-rl-class/issues)
|
||||
We want to improve and update the course iteratively with your feedback. **If you have some, please fill this form** 👉 https://forms.gle/3HgA7bEHwAmmLfwh9
|
||||
|
||||
## Don’t forget to join the Community 📢
|
||||
|
||||
@@ -73,4 +81,4 @@ Don’t forget to **introduce yourself when you sign up 🤗**
|
||||
|
||||
❓If you have other questions, [please check our FAQ](https://github.com/huggingface/deep-rl-class#faq)
|
||||
|
||||
Keep learning, stay awesome,
|
||||
## Keep learning, stay awesome 🤗,
|
||||
|
||||
BIN
unit1/assets/img/expexpltradeoff.jpg
Normal file
|
After Width: | Height: | Size: 295 KiB |
BIN
unit1/assets/img/obs_space_recap.jpg
Normal file
|
After Width: | Height: | Size: 146 KiB |
BIN
unit1/assets/img/policy.jpg
Normal file
|
After Width: | Height: | Size: 71 KiB |
BIN
unit1/assets/img/policy_2.jpg
Normal file
|
After Width: | Height: | Size: 64 KiB |
BIN
unit1/assets/img/rl-loop-ex.jpg
Normal file
|
After Width: | Height: | Size: 93 KiB |
BIN
unit1/assets/img/rl-loop-solution.jpg
Normal file
|
After Width: | Height: | Size: 88 KiB |
BIN
unit1/assets/img/tasks.jpg
Normal file
|
After Width: | Height: | Size: 324 KiB |
BIN
unit1/assets/img/thumbnail.png
Normal file
|
After Width: | Height: | Size: 220 KiB |
BIN
unit1/assets/img/value.jpg
Normal file
|
After Width: | Height: | Size: 75 KiB |
BIN
unit1/assets/img/value_2.jpg
Normal file
|
After Width: | Height: | Size: 56 KiB |
137
unit1/quiz.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Knowledge Check ✔️
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://fr.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
📝 Take a piece of paper and try to answer by writing, **then check the solutions**.
|
||||
|
||||
### Q1: What is Reinforcement Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Reinforcement learning is a **framework for solving control tasks (also called decision problems)** by building agents that learn from the environment by interacting with it through trial and error and **receiving rewards (positive or negative) as unique feedback**.
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#a-formal-definition
|
||||
|
||||
</details>
|
||||
|
||||
### Q2: Define the RL Loop
|
||||
|
||||
<img src="assets/img/rl-loop-ex.jpg" alt="Exercise RL Loop"/>
|
||||
|
||||
At every step:
|
||||
- Our Agent receives ______ from the environment
|
||||
- Based on that ______ the Agent takes an ______
|
||||
- Our Agent will move to the right
|
||||
- The Environment goes to a ______
|
||||
- The Environment gives ______ to the Agent
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<img src="assets/img/rl-loop-solution.jpg" alt="Exercise RL Solution"/>
|
||||
|
||||
|
||||
At every step:
|
||||
- Our Agent receives **state s0** from the environment
|
||||
- Based on that **state s0** the Agent takes an **action a0**
|
||||
- Our Agent will move to the right
|
||||
- The Environment goes to a **new state s1**
|
||||
- The Environment gives **a reward r1** to the Agent
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#the-rl-process
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q3: What's the difference between a state and an observation?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- *The state* is a **complete description of the state of the world** (there is no hidden information), in a fully observed environment. For instance, in chess game, we receive a state from the environment since we have access to the whole checkboard information.
|
||||
|
||||
- *The observation* is a **partial description of the state**. In a partially observed environment. For instance, in Super Mario Bros, we only see a part of the level close to the player, so we receive an observation.
|
||||
|
||||
<img src="assets/img/obs_space_recap.jpg" alt="Observation Space Recap"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#observationsstates-space
|
||||
|
||||
</details>
|
||||
|
||||
### Q4: A task is an instance of a Reinforcement Learning problem. What are the two types of tasks?
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- *Episodic task* : we have a **starting point and an ending point (a terminal state)**. This creates an episode: a list of States, Actions, Rewards, and new States. For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ending when you’re killed or you reached the end of the level.
|
||||
|
||||
- *Continuous task* : these are tasks that **continue forever (no terminal state)**. In this case, the agent must learn how to choose the best actions and simultaneously interact with the environment.
|
||||
|
||||
<img src="assets/img/tasks.jpg" alt="Task"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#type-of-tasks
|
||||
|
||||
</details>
|
||||
|
||||
### Q5: What is the exploration/exploitation tradeoff?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
In Reinforcement Learning, we need to **balance how much we explore the environment and how much we exploit what we know about the environment**.
|
||||
|
||||
- *Exploration* is exploring the environment by **trying random actions in order to find more information about the environment**.
|
||||
|
||||
- *Exploitation* is **exploiting known information to maximize the reward**.
|
||||
|
||||
<img src="assets/img/expexpltradeoff.jpg" alt="Exploration/exploitation tradeoff"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#exploration-exploitation-tradeoff
|
||||
</details>
|
||||
|
||||
### Q6: What is a policy?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- The Policy π **is the brain of our Agent**, it’s the function that tell us what action to take given the state we are. So it defines the agent’s behavior at a given time.
|
||||
|
||||
<img src="assets/img/policy.jpg" alt="Policy"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#the-policy-%CF%80-the-agents-brain
|
||||
</details>
|
||||
|
||||
|
||||
### Q7: What are value-based methods?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- Value-based methods is one of the main approaches for solving RL problems.
|
||||
- In Value-based methods, instead of training a policy function, **we train a value function that maps a state to the expected value of being at that state**.
|
||||
|
||||
<img src="assets/img/value.jpg" alt="Value illustration"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#value-based-methods
|
||||
</details>
|
||||
|
||||
### Q8: What are policy-based methods?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- In *Policy-Based Methods*, we learn a **policy function directly**.
|
||||
- This policy function will **map from each state to the best corresponding action at that state**. Or a **probability distribution over the set of possible actions at that state**.
|
||||
|
||||
<img src="assets/img/policy.jpg" alt="Policy illustration"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#value-based-methods
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
Congrats on **finishing this Quiz** 🥳, if you missed some elements, take time to [read again the chapter](https://huggingface.co/blog/deep-rl-intro) to reinforce (😏) your knowledge.
|
||||
|
||||
**Keep Learning, Stay Awesome**
|
||||
21
unit1/unit1-bonus/readme.md
Normal file
@@ -0,0 +1,21 @@
|
||||
# Unit 1: Bonus 🎁
|
||||
- Our teammate @Chris Emezue published a new leaderboard where you can compare your trained agents in new environments 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
## Try new environments 🎮
|
||||
Now that you've played with LunarLander-v2 Why not try these environments? 🔥:
|
||||
- 🗻 MountainCar-v0 https://www.gymlibrary.ml/environments/classic_control/mountain_car/
|
||||
- 🏎️ CarRacing-v1 https://www.gymlibrary.ml/environments/box2d/car_racing/
|
||||
- 🥶 FrozenLake-v1 https://www.gymlibrary.ml/environments/toy_text/frozen_lake/
|
||||
|
||||
## A piece of advice 🧐
|
||||
The first Unit, is a very interesting one but also **a very complex one because it's where you learn the fundamentals.**
|
||||
|
||||
That’s normal if you **still feel confused with all these elements**. This was the same for me and for all people who studied RL.
|
||||
|
||||
Take time to really grasp the material before continuing. It’s important to master these elements and having a solid foundations before entering the fun part.
|
||||
|
||||
We published additional readings in the syllabus if you want to go deeper 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit1/README.md
|
||||
|
||||
The hands-on for the first Unit are more funny experiments, but as we'll go deeper, **you'll understand better how to choose the hyperparameters and what model to use. For now, have fun, try stuff you can't break the simulations 🚀 **
|
||||
|
||||
### Keep learning, stay awesome.
|
||||
546
unit1/unit1_optuna_guide.ipynb
Normal file
@@ -0,0 +1,546 @@
|
||||
{
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"name": "Unit 1 Special Content: Optuna Guide.ipynb",
|
||||
"provenance": [],
|
||||
"collapsed_sections": []
|
||||
},
|
||||
"kernelspec": {
|
||||
"name": "python3",
|
||||
"display_name": "Python 3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"# Unit 1 Special Content: Optuna Guide"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "RpLWg64qubE9"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"In this notebook, we shall see how to use Optuna to perform hyperparameter tuning of Unit 1's <a href=\"https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html\" target=\"_blank\">`PPO`</a> model (created using Stable-Baselines3 for the `\"LunarLander-v2\"` Gym environment).\n",
|
||||
"\n",
|
||||
"Optuna is an open-source, automatic hyperparameter optimization framework. You can read more about it <a href=\"https://tech.preferred.jp/en/blog/optuna-release/\" target=\"_blank\">here</a>.\n",
|
||||
"\n",
|
||||
"**Prerequisite:** Before going through this notebook, you should have completed the <a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb\" target=\"_blank\">Unit 1 hands-on</a>."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "cWpMuD7jrfze"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Virtual Display Setup"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "InZX9_sRtfbO"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"We'll need to generate a replay video. To do so in Colab, we need to have a virtual display to be able to render the environment (and thus record the frames).\n",
|
||||
"\n",
|
||||
"The following cell will install virtual display libraries."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "QvvCztQ5tZxq"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "oUfu-vzL4g97"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!apt install python-opengl\n",
|
||||
"!apt install ffmpeg\n",
|
||||
"!apt install xvfb\n",
|
||||
"!pip install pyvirtualdisplay"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Now, let's create & start a virtual display."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "UVt5deF2qorb"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"from pyvirtualdisplay import Display\n",
|
||||
"\n",
|
||||
"virtual_display = Display(visible=0, size=(1400, 900))\n",
|
||||
"virtual_display.start()"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "6E1inOK_5Jko"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Dependencies, Imports & Gym Environments"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "B2sWJAj8ti1y"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Let's install all the other dependencies we'll need."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "_HVFbCW5tnyw"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"!pip install gym[box2d]\n",
|
||||
"!pip install stable-baselines3[extra]\n",
|
||||
"!pip install pyglet\n",
|
||||
"!pip install ale-py==0.7.4 # To overcome an issue with Gym (https://github.com/DLR-RM/stable-baselines3/issues/875)\n",
|
||||
"!pip install optuna\n",
|
||||
"!pip install huggingface_sb3"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "liGH8sAg5dk9"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Next, let's perform all the necessary imports."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "WWcb8nUVt-A9"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"import gym\n",
|
||||
"\n",
|
||||
"from stable_baselines3.common.env_util import make_vec_env\n",
|
||||
"from stable_baselines3.common.monitor import Monitor\n",
|
||||
"from stable_baselines3 import PPO\n",
|
||||
"from stable_baselines3.common.evaluation import evaluate_policy\n",
|
||||
"from stable_baselines3.common.vec_env import DummyVecEnv\n",
|
||||
"\n",
|
||||
"import optuna\n",
|
||||
"from optuna.samplers import TPESampler\n",
|
||||
"\n",
|
||||
"from huggingface_hub import notebook_login\n",
|
||||
"from huggingface_sb3 import package_to_hub"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "DivjESMF5i7M"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Finally, let's create our Gym environments. The training environment is a vectorized environment:"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "uaXEISAluKU2"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"env = make_vec_env(\"LunarLander-v2\", n_envs=16)\n",
|
||||
"env"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "WTHPMs_s7YFc"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"And the evaluation environment is a separate environment:"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "NTtoXSFiOR6f"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"eval_env = Monitor(gym.make(\"LunarLander-v2\"))"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "ECDFf_6OOXqO"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"We are now ready to dive into hyperparameter tuning!"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "sjyVRth8uSVm"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Hyperparameter Tuning"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "kz0QGAOnust1"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"First, let's define a `run_training()` function that trains a single model (using a particular combination of hyperparameter values), and returns a score. \n",
|
||||
"\n",
|
||||
"The score tells us how good the particular combination of hyperparameters is. (In our case, the score is `mean_reward - std_reward`, which is being used in the <a href=\"https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard\" target=\"_blank\">leaderboard</a>.) \n",
|
||||
"\n",
|
||||
"The function takes a very special argument - `params`, which is a dictionary. **The keys of this dictionary are the names of the hyperparameters we're tuning**, and **the values are sampled at each trial by Optuna's sampler** (from ranges that we'll specify soon).\n",
|
||||
"\n",
|
||||
"For example, in a particular trial, `params` might look like this:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"{'n_epochs': 5, 'gamma': 0.9926, 'total_timesteps': 559_621}\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"And in another trial, `params` might look like this:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"{'n_epochs': 3, 'gamma': 0.9974, 'total_timesteps': 1_728_482}\n",
|
||||
"```"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "02BT2bxVXEHj"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"def run_training(params, verbose=0, save_model=False):\n",
|
||||
" model = PPO(\n",
|
||||
" policy='MlpPolicy', \n",
|
||||
" env=env, \n",
|
||||
" n_steps=1024,\n",
|
||||
" batch_size=64, \n",
|
||||
" n_epochs=params['n_epochs'], # We're tuning this.\n",
|
||||
" gamma=params['gamma'], # We're tuning this.\n",
|
||||
" gae_lambda=0.98, \n",
|
||||
" ent_coef=0.01, \n",
|
||||
" verbose=verbose\n",
|
||||
" )\n",
|
||||
" model.learn(total_timesteps=params['total_timesteps']) # We're tuning this.\n",
|
||||
"\n",
|
||||
" mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=50, deterministic=True)\n",
|
||||
" score = mean_reward - std_reward\n",
|
||||
"\n",
|
||||
" if save_model:\n",
|
||||
" model.save(\"PPO-LunarLander-v2\")\n",
|
||||
"\n",
|
||||
" return model, score"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "kpDGvnBS6t57"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Next, we define another function - `objective()`. This function has a single parameter `trial`, which is an object of type `optuna.trial.Trial`. Using this `trial` object, we specify the ranges for the different hyperparameters we want to explore:\n",
|
||||
"\n",
|
||||
"- For `n_epochs`: We want to explore integer values between `3` and `5`.\n",
|
||||
"- For `gamma`: We want to explore floating point values between `0.9900` and `0.9999` (drawn from a uniform distribution).\n",
|
||||
"- For `total_timesteps`: We want to explore integer values between `500_000` and `2_000_000`.\n",
|
||||
"\n",
|
||||
"**Note:** If you have more time available, then you can tune other hyperparameters too. Moreover, you can explore wider ranges for each hyperparameter.\n",
|
||||
"\n",
|
||||
"The `trial.suggest_int()` and `trial.suggest_uniform()` methods are used by Optuna to suggest hyperparamter values in the ranges specified. The suggested combination of values are then used to train a model and return the score."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "dORGHcVYdSKp"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"def objective(trial):\n",
|
||||
" params = {\n",
|
||||
" \"n_epochs\": trial.suggest_int(\"n_epochs\", 3, 5), \n",
|
||||
" \"gamma\": trial.suggest_uniform(\"gamma\", 0.9900, 0.9999), \n",
|
||||
" \"total_timesteps\": trial.suggest_int(\"total_timesteps\", 500_000, 2_000_000)\n",
|
||||
" }\n",
|
||||
" model, score = run_training(params)\n",
|
||||
" return score"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "Wapg9hTI-AGz"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Finally, we use Optuna's `create_study()` function to create a study, passing in:\n",
|
||||
"\n",
|
||||
"- `sampler=TPESampler()`: This specifies that we want to employ a Bayesian optimization algorithm called Tree-structured Parzen Estimator. Other options are `GridSampler()`, `RandomSampler()`, etc. (The full list can be found <a href=\"https://optuna.readthedocs.io/en/stable/reference/samplers.html\" target=\"_blank\">here</a>.)\n",
|
||||
"- `study_name=\"PPO-LunarLander-v2\"`: This is a name we give to the study (optional).\n",
|
||||
"- `direction=\"maximize\"`: This is to specify that our objective is to maximize (not minimize) the score.\n",
|
||||
"\n",
|
||||
"Once our study is created, we call the `optimize()` method on it, specifying that we want to conduct `10` trials.\n",
|
||||
"\n",
|
||||
"**Note:** If you have more time available, then you can conduct more than `10` trials.\n",
|
||||
"\n",
|
||||
"**Warning:** The below code cell will take quite a bit of time to run!"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "a2gToNx7gPEG"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"study = optuna.create_study(sampler=TPESampler(), study_name=\"PPO-LunarLander-v2\", direction=\"maximize\")\n",
|
||||
"study.optimize(objective, n_trials=10)"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "0v9G57g4-gl_"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Now that all the `10` trials have concluded, let's print out the score and hyperparameters of the best trial."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "9qRTUBAExSLt"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"print(\"Best trial score:\", study.best_trial.values)\n",
|
||||
"print(\"Best trial hyperparameters:\", study.best_trial.params)"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "N-JFso9Lvurh"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Recreating & Saving The Best Model"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "qiSBZmZDux4q"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Let's recreate the best model and save it."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "SuZ4OSayyq36"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"model, score = run_training(study.best_trial.params, verbose=1, save_model=True)"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "8JVUHgljIW1E"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Pushing to Hugging Face Hub"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "0eja4YaFu5Vb"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"To be able to share your model with the community, there are three more steps to follow:\n",
|
||||
"\n",
|
||||
"1. (If not done already) create a Hugging Face account -> https://huggingface.co/join\n",
|
||||
"\n",
|
||||
"2. Sign in and then, get your authentication token from the Hugging Face website.\n",
|
||||
"\n",
|
||||
"- Create a new token (https://huggingface.co/settings/tokens) **with write role**.\n",
|
||||
"- Copy the token.\n",
|
||||
"- Run the cell below and paste the token."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "qDjEim2izuFi"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"notebook_login()\n",
|
||||
"!git config --global credential.helper store"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "msFVH2qhzwpE"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"If you aren't using Google Colab or Jupyter Notebook, you need to use this command instead: `huggingface-cli login`\n",
|
||||
"\n",
|
||||
"3. We're now ready to push our trained agent to the Hub using the `package_to_hub()` function.\n",
|
||||
"\n",
|
||||
"Let's fill in the arguments of the `package_to_hub` function:\n",
|
||||
"\n",
|
||||
"- `model`: our trained model\n",
|
||||
"\n",
|
||||
"- `model_name`: the name of the trained model that we defined in `model.save()`\n",
|
||||
"\n",
|
||||
"- `model_architecture`: the model architecture we used (in our case `\"PPO\"`)\n",
|
||||
"\n",
|
||||
"- `env_id`: the name of the environment (in our case `\"LunarLander-v2\"`)\n",
|
||||
"\n",
|
||||
"- `eval_env`: the evaluation environment\n",
|
||||
"\n",
|
||||
"- `repo_id`: the name of the Hugging Face Hub repository that will be created/updated `(repo_id=\"{username}/{repo_name}\")` (**Note:** A good `repo_id` is `\"{username}/{model_architecture}-{env_id}\"`.)\n",
|
||||
"\n",
|
||||
"- `commit_message`: the commit message"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "lD4ACH160L_5"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"model_name = \"PPO-LunarLander-v2\"\n",
|
||||
"model_architecture = \"PPO\"\n",
|
||||
"env_id = \"LunarLander-v2\"\n",
|
||||
"eval_env = DummyVecEnv([lambda: gym.make(env_id)])\n",
|
||||
"repo_id = \"Sadhaklal/PPO-LunarLander-v2\"\n",
|
||||
"commit_message = \"Upload best PPO LunarLander-v2 agent (tuned with Optuna).\""
|
||||
],
|
||||
"metadata": {
|
||||
"id": "8JkhrjAt0O6a"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"The following function call will evaluate the agent, record a replay, generate a model card, and push your agent to the Hub."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "3HIT-M2C1l4E"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"package_to_hub(\n",
|
||||
" model=model, \n",
|
||||
" model_name=model_name, \n",
|
||||
" model_architecture=model_architecture, \n",
|
||||
" env_id=env_id, \n",
|
||||
" eval_env=eval_env, \n",
|
||||
" repo_id=repo_id, \n",
|
||||
" commit_message=commit_message\n",
|
||||
")"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "1D0wQ_PU1mTN"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"That's it! You now know how to perform hyperparameter tuning of Stable-Baselines3 models using Optuna.\n",
|
||||
"\n",
|
||||
"To get even better results, try tuning the other hyperparameters of your model."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "7jLc9JDY11EW"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Final Tips"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "Fa4TtHLAtq7-"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"1. Read the <a href=\"https://optuna.readthedocs.io/en/stable/index.html\" target=\"_blank\">Optuna documentation</a> to get more familiar with the library and its features.\n",
|
||||
"2. You may have noticed that hyperparameter tuning is a time consuming process. However, it can be sped up significantly using parallelization. Check out <a href=\"https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/004_distributed.html\" target=\"_blank\">this guide</a> on how to do so."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "tEgxhSjf4loa"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"metadata": {
|
||||
"id": "C7CJ7x7n4HuA"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
}
|
||||
]
|
||||
}
|
||||
97
unit2/README.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# Unit 2: Introduction to Q-Learning
|
||||
|
||||
In this Unit, we're going to dive deeper into one of the Reinforcement Learning methods: value-based methods and **study our first RL algorithm: Q-Learning**.
|
||||
|
||||
We'll also implement our **first RL agent from scratch**: a Q-Learning agent and will train it in two environments:
|
||||
|
||||
- [Frozen-Lake-v1 ⛄ (non-slippery version)](https://www.gymlibrary.ml/environments/toy_text/frozen_lake/): where our agent will need to go from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoiding holes (H).
|
||||
- [An autonomous taxi 🚕](https://www.gymlibrary.ml/environments/toy_text/taxi/?highlight=taxi) will need to learn to navigate a city to transport its passengers from point A to point B.
|
||||
|
||||
<img src="assets/img/envs.gif" alt="unit 2 environments"/>
|
||||
|
||||
You'll then be able to **compare your agent’s results with other classmates thanks to a leaderboard** 🔥.
|
||||
|
||||
This Unit is divided into 2 parts:
|
||||
- [Part 1](https://huggingface.co/blog/deep-rl-q-part1)
|
||||
- [Part 2](https://huggingface.co/blog/deep-rl-q-part2)
|
||||
|
||||
<img src="assets/img/two_parts.jpg" alt="Two parts"/>
|
||||
|
||||
This course is **self-paced**, you can start whenever you want.
|
||||
|
||||
## Required time ⏱️
|
||||
The required time for this unit is, approximately:
|
||||
- **2-3 hours** for the theory
|
||||
- **1 hour** for the hands-on.
|
||||
|
||||
## Start this Unit 🚀
|
||||
Here are the steps for this Unit:
|
||||
|
||||
1️⃣ 📝 If it's not already done, sign up to our Discord Server. This is the place where you **can exchange with the community and with us, create study groups to grow each other and more**
|
||||
|
||||
👉🏻 [https://discord.gg/aYka4Yhff9](https://discord.gg/aYka4Yhff9).
|
||||
|
||||
Are you new to Discord? Check our **discord 101 to get the best practices** 👉 https://github.com/huggingface/deep-rl-class/blob/main/DISCORD.Md
|
||||
|
||||
2️⃣ 👋 **Introduce yourself on Discord in #introduce-yourself Discord channel 🤗 and check on the left the Reinforcement Learning section.**
|
||||
|
||||
- In #rl-announcements we give the last information about the course.
|
||||
- #discussions is a place to exchange.
|
||||
- #unity-ml-agents is to exchange about everything related to this library.
|
||||
- #study-groups, to create study groups with your classmates.
|
||||
|
||||
<img src="assets/img/discord_channels.jpg" alt="Discord Channels"/>
|
||||
|
||||
3️⃣ 📖 **Read An [Introduction to Q-Learning Part 1](https://huggingface.co/blog/deep-rl-q-part1)**.
|
||||
|
||||
4️⃣ 📝 Take a piece of paper and **check your knowledge with this series of questions** ❔ 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit2/quiz1.md
|
||||
|
||||
5️⃣ 📖 **Read An [Introduction to Q-Learning Part 2](https://huggingface.co/blog/deep-rl-q-part2)**.
|
||||
|
||||
6️⃣ 📝 Take a piece of paper and **check your knowledge with this series of questions** ❔ 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit2/quiz2.md
|
||||
|
||||
7️⃣ 👩💻 Then dive on the hands-on, where **you’ll implement our first RL agent from scratch**, a Q-Learning agent, and will train it in two environments:
|
||||
1. Frozen Lake v1 ❄️: where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
|
||||
2. An autonomous taxi 🚕: where the agent will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
|
||||
|
||||
Thanks to a leaderboard, **you'll be able to compare your results with other classmates** and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 2 🏆?
|
||||
|
||||
👩💻 The hands-on 👉 [](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit2/unit2.ipynb)
|
||||
|
||||
🏆 The leaderboard 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
You can work directly **with the colab notebook, which allows you not to have to install everything on your machine (and it’s free)**.
|
||||
|
||||
8️⃣ The best way to learn **is to try things on your own**. That’s why we have a challenges section in the colab where we give you some ideas on how you can go further: using another environment, using another model etc.
|
||||
|
||||
## Additional readings 📚
|
||||
- [Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto Chapter 5, 6 and 7](http://incompleteideas.net/book/RLbook2020.pdf)
|
||||
- [Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel](https://youtu.be/Psrhxy88zww)
|
||||
- To dive deeper on Monte Carlo and Temporal Difference Learning:
|
||||
- [Why do temporal difference (TD) methods have lower variance than Monte Carlo methods?](https://stats.stackexchange.com/questions/355820/why-do-temporal-difference-td-methods-have-lower-variance-than-monte-carlo-met)
|
||||
- [When are Monte Carlo methods preferred over temporal difference ones?](https://stats.stackexchange.com/questions/336974/when-are-monte-carlo-methods-preferred-over-temporal-difference-ones)
|
||||
|
||||
## How to make the most of this course
|
||||
|
||||
To make the most of the course, my advice is to:
|
||||
|
||||
- **Participate in Discord** and join a study group.
|
||||
- **Read multiple times** the theory part and takes some notes
|
||||
- Don’t just do the colab. When you learn something, try to change the environment, change the parameters and read the libraries' documentation. Have fun 🥳
|
||||
- Struggling is **a good thing in learning**. It means that you start to build new skills. Deep RL is a complex topic and it takes time to understand. Try different approaches, use our additional readings, and exchange with classmates on discord.
|
||||
|
||||
## This is a course built with you 👷🏿♀️
|
||||
|
||||
We want to improve and update the course iteratively with your feedback. **If you have some, please fill this form** 👉 https://forms.gle/3HgA7bEHwAmmLfwh9
|
||||
|
||||
## Don’t forget to join the Community 📢
|
||||
|
||||
We have a discord server where you **can exchange with the community and with us, create study groups to grow each other and more**
|
||||
|
||||
👉🏻 [https://discord.gg/aYka4Yhff9](https://discord.gg/aYka4Yhff9).
|
||||
|
||||
Don’t forget to **introduce yourself when you sign up 🤗**
|
||||
|
||||
❓If you have other questions, [please check our FAQ](https://github.com/huggingface/deep-rl-class#faq)
|
||||
|
||||
## Keep learning, stay awesome 🤗,
|
||||
BIN
unit2/assets/img/MC-3.jpg
Normal file
|
After Width: | Height: | Size: 144 KiB |
BIN
unit2/assets/img/TD-1.jpg
Normal file
|
After Width: | Height: | Size: 220 KiB |
BIN
unit2/assets/img/bellman4-quiz.jpg
Normal file
|
After Width: | Height: | Size: 108 KiB |
BIN
unit2/assets/img/bellman4.jpg
Normal file
|
After Width: | Height: | Size: 324 KiB |
BIN
unit2/assets/img/bonus-unit2.jpg
Normal file
|
After Width: | Height: | Size: 217 KiB |
BIN
unit2/assets/img/discord_channels.jpg
Normal file
|
After Width: | Height: | Size: 20 KiB |
BIN
unit2/assets/img/envs.gif
Normal file
|
After Width: | Height: | Size: 359 KiB |
BIN
unit2/assets/img/mc-ex.jpg
Normal file
|
After Width: | Height: | Size: 131 KiB |
BIN
unit2/assets/img/monte-carlo-approach.jpg
Normal file
|
After Width: | Height: | Size: 270 KiB |
BIN
unit2/assets/img/q-update-ex.jpg.jpg
Normal file
|
After Width: | Height: | Size: 65 KiB |
BIN
unit2/assets/img/q-update-solution.jpg.jpg
Normal file
|
After Width: | Height: | Size: 93 KiB |
BIN
unit2/assets/img/summary-learning-mtds.jpg
Normal file
|
After Width: | Height: | Size: 244 KiB |
BIN
unit2/assets/img/td-ex.jpg
Normal file
|
After Width: | Height: | Size: 119 KiB |
BIN
unit2/assets/img/two-approaches.jpg
Normal file
|
After Width: | Height: | Size: 445 KiB |
BIN
unit2/assets/img/two_parts.jpg
Normal file
|
After Width: | Height: | Size: 1.2 MiB |
98
unit2/quiz1.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# Knowledge Check ✔️
|
||||
|
||||
The best way to learn and [avoid the illusion of competence](https://fr.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
📝 Take a piece of paper and try to answer by writing, **then check the solutions**.
|
||||
|
||||
### Q1: What are the two main approaches to find optimal policy?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
The two main approaches are:
|
||||
- *Policy-based methods*: **Train the policy directly** to learn which action to take given a state.
|
||||
- *Value-based methods* : Train a value function to **learn which state is more valuable and use this value function to take the action that leads to it**.
|
||||
|
||||
<img src="assets/img/two-approaches.jpg" alt="Two approaches of Deep RL"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-q-part1#what-is-rl-a-short-recap
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q2: What is the Bellman Equation?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
**The Bellman equation is a recursive equation** that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
|
||||
|
||||
$R_{t+1} + ( gamma * V(S_{t+1}))$
|
||||
|
||||
The immediate reward + the discounted value of the state that follows
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-q-part1#the-bellman-equation-simplify-our-value-estimation
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q3: Define each part of the Bellman Equation
|
||||
|
||||
<img src="assets/img/bellman4-quiz.jpg" alt="Bellman equation quiz"/>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<img src="assets/img/bellman4.jpg" alt="Bellman equation solution"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-q-part1#the-bellman-equation-simplify-our-value-estimation
|
||||
|
||||
</details>
|
||||
|
||||
### Q4: What is the difference between Monte Carlo and Temporal Difference learning methods?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
There are two types of methods to learn a policy or a value function:
|
||||
- With the *Monte Carlo method*, we update the value function **from a complete episode**, and so we use the actual accurate discounted return of this episode.
|
||||
- With the *TD Learning method*, we update the value function **from a step, so we replace Gt that we don't have with an estimated return called TD target**.
|
||||
|
||||
<img src="assets/img/summary-learning-mtds.jpg" alt="summary-learning-mtds"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-q-part1#monte-carlo-vs-temporal-difference-learning
|
||||
|
||||
</details>
|
||||
|
||||
### Q5: Define each part of Temporal Difference learning formula
|
||||
|
||||
<img src="assets/img/td-ex.jpg" alt="TD Learning exercise"/>
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<img src="assets/img/TD-1.jpg" alt="TD Exercise"/>
|
||||
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-q-part1#temporal-difference-learning-learning-at-each-step
|
||||
</details>
|
||||
|
||||
|
||||
### Q6: Define each part of Monte Carlo learning formula
|
||||
|
||||
<img src="assets/img/mc-ex.jpg" alt="MC Learning exercise"/>
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<img src="assets/img/monte-carlo-approach.jpg" alt="MC Exercise"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-q-part1#monte-carlo-learning-at-the-end-of-the-episode
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
Congrats on **finishing this Quiz** 🥳, if you missed some elements, take time to [read the chapter again](https://huggingface.co/blog/deep-rl-q-part1) to reinforce (😏) your knowledge.
|
||||
|
||||
**Keep Learning, Stay Awesome**
|
||||
81
unit2/quiz2.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# Knowledge Check ✔️
|
||||
|
||||
The best way to learn and [avoid the illusion of competence](https://fr.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
📝 Take a piece of paper and try to answer by writing, **then check the solutions**.
|
||||
|
||||
### Q1: What is Q-Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Q-Learning is **the algorithm we use to train our Q-Function**, an action-value function that determines the value of being at a particular state and taking a specific action at that state.
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-q-part2#what-is-q-learning
|
||||
</details>
|
||||
|
||||
### Q2: What is a Q-Table?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Q-table is the "internal memory" of our agent where each cell corresponds to a state-action value pair value. Think of this Q-table as the memory or cheat sheet of our Q-function.
|
||||
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-q-part2#what-is-q-learning
|
||||
</details>
|
||||
|
||||
### Q3: Why if we have an optimal Q-function Q* we have an optimal policy?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Because if we have an optimal Q-function, we have an optimal policy since we know for each state what is the best action to take.
|
||||
|
||||
<img src="https://huggingface.co/blog/assets/73_deep_rl_q_part2/link-value-policy.jpg" alt="link value policy"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-q-part2#what-is-q-learning
|
||||
</details>
|
||||
|
||||
### Q4: Can you explain what is Epsilon-Greedy Strategy?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
Epsilon Greedy Strategy is a **policy that handles the exploration/exploitation trade-off**.
|
||||
|
||||
The idea is that we define epsilon ɛ = 1.0:
|
||||
|
||||
- With *probability 1 — ɛ* : we do exploitation (aka our agent selects the action with the highest state-action pair value).
|
||||
- With *probability ɛ* : we do exploration (trying random action).
|
||||
|
||||
<img src="https://huggingface.co/blog/assets/73_deep_rl_q_part2/Q-learning-4.jpg" alt="Epsilon Greedy"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-q-part2#the-q-learning-algorithm
|
||||
|
||||
</details>
|
||||
|
||||
### Q5: How do we update the Q value of a state, action pair?
|
||||
<img src="assets/img/q-update-ex.jpg.jpg" alt="Q Update exercise"/>
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
<img src="assets/img/q-update-solution.jpg.jpg" alt="Q Update exercise"/>
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-q-part2#the-q-learning-algorithm
|
||||
</details>
|
||||
|
||||
### Q6: What's the difference between on-policy and off-policy
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
<img src="https://huggingface.co/blog/assets/73_deep_rl_q_part2/off-on-4.jpg" alt="On/off policy"/>
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-q-part2#off-policy-vs-on-policy
|
||||
</details>
|
||||
|
||||
|
||||
---
|
||||
|
||||
Congrats on **finishing this Quiz** 🥳, if you missed some elements, take time to [read the chapter again](https://huggingface.co/blog/deep-rl-q-part2) to reinforce (😏) your knowledge.
|
||||
|
||||
**Keep Learning, Stay Awesome**
|
||||
|
||||
|
||||
14
unit2/unit2-bonus/readme.md
Normal file
@@ -0,0 +1,14 @@
|
||||
# Unit 2 Bonus 🎁
|
||||
If you want to go deeper into Stable Baselines3 before our next week unit about Deep Q-Learning you can **check these cool environments 🚀**:
|
||||
|
||||
<img src="../assets/img/bonus-unit2.jpg" alt="Illustration unit 2 bonus"/>
|
||||
|
||||
- [Minigrid environment](https://github.com/maximecb/gym-minigrid): puzzle environments where your agent needs to find the way out using keys 🔐 and doors 🚪 : https://github.com/maximecb/gym-minigrid
|
||||
|
||||
- [Procgen Benchmark](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#sb3-and-procgenenv): 16 simple-to-use procedurally-generated gym environments (platform, shooters etc). You have an example with Stable-Baselines3 here: https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#sb3-and-procgenenv
|
||||
|
||||
- [VizDoom: a Doom like environment 🔥](https://github.com/mwydmuch/ViZDoom) Nicholas Renotte made a very good tutorial on how to train an agent playing it using Stable Baselines-3: https://youtu.be/eBCU-tqLGfQ
|
||||
|
||||
Have fun 🥳
|
||||
|
||||
### Keep Learning, Stay awesome
|
||||
1790
unit2/unit2.ipynb
Normal file
66
unit3/README.md
Normal file
@@ -0,0 +1,66 @@
|
||||
# Unit 3: Deep Q-Learning with Atari Games 👾
|
||||
|
||||
In this Unit, **we'll study our first Deep Reinforcement Learning agent**: Deep Q-Learning.
|
||||
|
||||
And **we'll train it to play Space Invaders and other Atari environments using [RL-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)**, a training framework for RL using Stable-Baselines that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results, and recording videos.
|
||||
|
||||
<img src="assets/img/atari-envs.gif" alt="unit 3 environments"/>
|
||||
|
||||
You'll then be able to **compare your agent’s results with other classmates thanks to a leaderboard** 🔥 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
This course is **self-paced**, you can start whenever you want.
|
||||
|
||||
## Required time ⏱️
|
||||
The required time for this unit is, approximately:
|
||||
- 1-2 hours for the theory
|
||||
- 1 hour for the hands-on.
|
||||
|
||||
## Start this Unit 🚀
|
||||
Here are the steps for this Unit:
|
||||
|
||||
1️⃣ 📖 **Read [Deep Q-Learning with Atari chapter](https://huggingface.co/blog/deep-rl-dqn)**.
|
||||
|
||||
2️⃣ 📝 Take a piece of paper and check your knowledge with this series of questions ❔ 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit3/quiz.md
|
||||
|
||||
3️⃣ 👩💻 Then dive on the hands-on, where **you'll train a Deep Q-Learning agent** playing Space Invaders using [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), a training framework based on [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
|
||||
|
||||
Thanks to a leaderboard, **you'll be able to compare your results with other classmates** and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 2 🏆?
|
||||
|
||||
The hands-on 👉 [](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit3/unit3.ipynb)
|
||||
|
||||
The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
You can work directly **with the colab notebook, which allows you not to have to install everything on your machine (and it’s free)**.
|
||||
|
||||
4️⃣ The best way to learn **is to try things on your own**. That’s why we have a challenges section in the colab where we give you some ideas on how you can go further: using another environment, using another model etc.
|
||||
|
||||
## Additional readings 📚
|
||||
- [Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel](https://youtu.be/Psrhxy88zww)
|
||||
- [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602)
|
||||
- [Double Deep Q-Learning](https://papers.nips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html)
|
||||
- [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952)
|
||||
|
||||
## How to make the most of this course
|
||||
|
||||
To make the most of the course, my advice is to:
|
||||
|
||||
- **Participate in Discord** and join a study group.
|
||||
- **Read multiple times** the theory part and takes some notes
|
||||
- Don’t just do the colab. When you learn something, try to change the environment, change the parameters and read the libraries' documentation. Have fun 🥳
|
||||
- Struggling is **a good thing in learning**. It means that you start to build new skills. Deep RL is a complex topic and it takes time to understand. Try different approaches, use our additional readings, and exchange with classmates on discord.
|
||||
|
||||
## This is a course built with you 👷🏿♀️
|
||||
|
||||
We want to improve and update the course iteratively with your feedback. **If you have some, please fill this form** 👉 https://forms.gle/3HgA7bEHwAmmLfwh9
|
||||
|
||||
## Don’t forget to join the Community 📢
|
||||
|
||||
We have a discord server where you **can exchange with the community and with us, create study groups to grow each other and more**
|
||||
|
||||
👉🏻 [https://discord.gg/aYka4Yhff9](https://discord.gg/aYka4Yhff9).
|
||||
|
||||
Don’t forget to **introduce yourself when you sign up 🤗**
|
||||
|
||||
❓If you have other questions, [please check our FAQ](https://github.com/huggingface/deep-rl-class#faq)
|
||||
|
||||
### Keep learning, stay awesome,
|
||||
BIN
unit3/assets/img/atari-envs.gif
Normal file
|
After Width: | Height: | Size: 2.6 MiB |
1
unit3/assets/img/test
Normal file
@@ -0,0 +1 @@
|
||||
|
||||
16
unit3/bonus.md
Normal file
@@ -0,0 +1,16 @@
|
||||
# Automatic Hyperparameter Tuning with Optuna
|
||||
|
||||
One of the most critical task in Deep Reinforcement Learning is to **find a good set of training hyperparameters**.
|
||||
|
||||
<img src="https://raw.githubusercontent.com/optuna/optuna/master/docs/image/optuna-logo.png" alt="Optuna"/>
|
||||
|
||||
Optuna is a library that **helps you to automate the search**. In this Unit, we'll study a little bit of the theory behind automatic hyperparameter tuning. We'll then try to optimize the parameters manually and then see how to automate the search using Optuna.
|
||||
|
||||
The content below comes from [Antonin's Raffin ICRA 2022 presentations](https://twitter.com/araffin2), he's one of the founders of Stable-Baselines and RL-Baselines3-Zoo.
|
||||
|
||||
## The learning steps 📚
|
||||
1️⃣ 📹 First, let study what's [Automatic Hyperparameter Tuning](https://www.youtube.com/watch?v=AidFTOdGNFQ). Don't forget to 👍 the video 🤗.
|
||||
|
||||
2️⃣👩💻 Then let's dive on the [hands-on, where we'll then try to optimize the parameters manually and then see how to automate the search using Optuna](https://youtu.be/ihP7E76KGOI).
|
||||
|
||||
3️⃣ Now that you've learned to use Optuna, why not going back to our **Deep Q-Learning hands-on and implement Optuna to find the best training hyperparameters** 👉 [](https://colab.research.google.com/github/araffin/tools-for-robotic-rl-icra2022/blob/main/notebooks/optuna_lab.ipynb)
|
||||
93
unit3/quiz.md
Normal file
@@ -0,0 +1,93 @@
|
||||
# Knowledge Check ✔️
|
||||
|
||||
The best way to learn and [avoid the illusion of competence](https://fr.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
📝 Take a piece of paper and try to answer by writing, **then check the solutions**.
|
||||
|
||||
### Q1: What are tabular methods?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
*Tabular methods* are a type of problems in which the state and actions spaces are small enough to approximate value functions to be **represented as arrays and tables**. For instance, **Q-Learning is a tabular method** since we use a table to represent the state,action value pairs.
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-dqn#from-q-learning-to-deep-q-learning
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q2: Why we can't use a classical Q-Learning to solve an Atari Game?
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Atari environments have an observation space with a shape of (210, 160, 3), containing values ranging from 0 to 255, so that gives us 256^(210x160x3) = 256^100800 (**for comparison, we have approximately 10^80 atoms in the observable universe**).
|
||||
|
||||
Therefore, the state space is gigantic; hence creating and updating a Q-table for that environment **would not be efficient**. In this case, the best idea is to approximate the Q-values instead of a Q-table using a parametrized Q-function $Q_{\theta}(s,a)$.
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-dqn#from-q-learning-to-deep-q-learning
|
||||
</details>
|
||||
|
||||
### Q3: Why do we stack four frames together when we use frames as input in Deep Q-Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
We stack frames together because it helps us **handle the problem of temporal limitation**. Since one frame is not enough to capture temporal information.
|
||||
For instance, in pong, our agent **will be unable to know the ball direction if it gets only one frame**.
|
||||
|
||||
<img src="https://huggingface.co/blog/assets/78_deep_rl_dqn/temporal-limitation.jpg" alt="Temporal limitation"/>
|
||||
<img src="https://huggingface.co/blog/assets/78_deep_rl_dqn/temporal-limitation-2.jpg" alt="Temporal limitation"/>
|
||||
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-dqn#preprocessing-the-input-and-temporal-limitation
|
||||
</details>
|
||||
|
||||
### Q4: What are the two phases of Deep Q-Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
The Deep Q-Learning training algorithm has two phases:
|
||||
- *Sampling* : we perform actions and **store the observed experiences tuples in a replay memory**.
|
||||
- *Training* : Select the small batch of tuple randomly and **learn from it using a gradient descent update step**.
|
||||
|
||||
📖 If you don't remember, check 👉 [https://huggingface.co/blog/deep-rl-dqn#preprocessing-the-input-and-temporal-limitation](https://huggingface.co/blog/deep-rl-dqn#the-deep-q-learning-algorithm)
|
||||
</details>
|
||||
|
||||
### Q5: Why do we create a replay memory in Deep Q-Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
**1. Make more efficient use of the experiences during the training**
|
||||
|
||||
Usually, in online reinforcement learning, we interact in the environment, get experiences (state, action, reward, and next state), learn from them (update the neural network) and discard them.
|
||||
But with experience replay, **we create a replay buffer that saves experience samples that we can reuse during the training**.
|
||||
|
||||
**2. Avoid forgetting previous experiences and reduce the correlation between experiences**
|
||||
|
||||
The problem we get if we give sequential samples of experiences to our neural network is that it **tends to forget the previous experiences as it overwrites new experiences**. For instance, if we are in the first level and then the second, which is different, our agent can forget how to behave and play in the first level.
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-dqn#experience-replay-to-make-more-efficient-use-of-experiences
|
||||
|
||||
</details>
|
||||
|
||||
### Q6: How do we use Double Deep Q-Learning?
|
||||
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
When we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:
|
||||
|
||||
- Use our *DQN network* to **select the best action to take for the next state** (the action with the highest Q value).
|
||||
|
||||
- Use our *Target network* to calculate **the target Q value of taking that action at the next state**.
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
Congrats on **finishing this Quiz** 🥳, if you missed some elements, take time to [read the chapter again](https://huggingface.co/blog/deep-rl-dqn) to reinforce (😏) your knowledge.
|
||||
|
||||
**Keep Learning, Stay Awesome**
|
||||
735
unit3/unit3.ipynb
Normal file
53
unit4/README.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# Unit 4: An Introduction to Unity MLAgents with Hugging Face 🤗
|
||||

|
||||
|
||||
In this Unit, We’ll learn about [ML-Agents](https://huggingface.co/docs/hub/ml-agents) and use one of the pre-made environments: Pyramids. In this environment, we’ll train an agent that needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top.
|
||||
|
||||
To do that, **it will need to explore its environment, and we will use a technique called curiosity**.
|
||||
|
||||
Then, after training we’ll push the **trained agent to the Hugging Face Hub and you’ll be able to visualize it playing directly on your browser without having to use the Unity Editor. You’ll be also be able to visualize and download others trained agents from the community**.
|
||||
|
||||

|
||||
|
||||
## Required time ⏱️
|
||||
The required time for this unit is, approximately:
|
||||
- 2 hours for the theory and hands-on.
|
||||
|
||||
## Start this Unit 🚀
|
||||
Here are the steps for this Unit:
|
||||
|
||||
1️⃣📖 **Read An [An Introduction to Unity ML-Agents with Hugging Face 🤗](https://thomassimonini.medium.com/an-introduction-to-unity-ml-agents-with-hugging-face-efbac62c8c80)**.
|
||||
|
||||
2️⃣👩💻 In the meantime, **you can start the tutorial using Google Colab** 👉 [](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit4/unit4.ipynb)
|
||||
|
||||
You can work directly **with the colab notebook, which allows you not to have to install everything on your machine (and it’s free)**.
|
||||
|
||||
3️⃣ The best way to learn **is to try things on your own**. That’s why we have a challenges section in the colab where we **give you some ideas on how you can go further: using another environment etc**.
|
||||
|
||||
## Additional readings 📚
|
||||
- [MLAgents Documentation](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Readme.md)
|
||||
|
||||
## How to make the most of this course
|
||||
|
||||
To make the most of the course, my advice is to:
|
||||
|
||||
- **Participate in Discord** and join a study group.
|
||||
- **Read multiple times** the theory part and takes some notes
|
||||
- Don’t just do the colab. When you learn something, try to change the environment, change the parameters and read the libraries' documentation. Have fun 🥳
|
||||
- Struggling is **a good thing in learning**. It means that you start to build new skills. Deep RL is a complex topic and it takes time to understand. Try different approaches, use our additional readings, and exchange with classmates on discord.
|
||||
|
||||
## This is a course built with you 👷🏿♀️
|
||||
|
||||
We want to improve and update the course iteratively with your feedback. **If you have some, please fill this form** 👉 https://forms.gle/3HgA7bEHwAmmLfwh9
|
||||
|
||||
## Don’t forget to join the Community 📢
|
||||
|
||||
We have a discord server where you **can exchange with the community and with us, create study groups to grow each other and more**
|
||||
|
||||
👉🏻 [https://discord.gg/aYka4Yhff9](https://discord.gg/aYka4Yhff9).
|
||||
|
||||
Don’t forget to **introduce yourself when you sign up 🤗**
|
||||
|
||||
❓If you have other questions, [please check our FAQ](https://github.com/huggingface/deep-rl-class#faq)
|
||||
|
||||
### Keep learning, stay awesome 🤗,
|
||||
BIN
unit4/img/agents.gif
Normal file
|
After Width: | Height: | Size: 1.5 MiB |
1
unit4/img/img
Normal file
@@ -0,0 +1 @@
|
||||
|
||||
BIN
unit4/img/mlagents.jfif
Normal file
|
After Width: | Height: | Size: 64 KiB |
564
unit4/unit4.ipynb
Normal file
70
unit5/README.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# Unit 5: Policy Gradient with PyTorch
|
||||
|
||||
In this Unit, **we'll study Policy Gradient Methods**.
|
||||
|
||||
And we'll **implement Reinforce (a policy gradient method) from scratch using PyTorch**. Before testing its robustness using CartPole-v1, PixelCopter, and Pong.
|
||||
|
||||
<img src="assets/img/envs.gif" alt="unit 5 environments"/>
|
||||
|
||||
You'll then be able to **compare your agent’s results with other classmates thanks to a leaderboard** 🔥 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
This course is **self-paced**, you can start whenever you want.
|
||||
|
||||
## Required time ⏱️
|
||||
The required time for this unit is, approximately:
|
||||
- 1 hour for the theory
|
||||
- 1-2 hours for the hands-on.
|
||||
|
||||
## Start this Unit 🚀
|
||||
Here are the steps for this Unit:
|
||||
|
||||
1️⃣ 📖 **Read [Policy Gradient with PyTorch Chapter](https://huggingface.co/blog/deep-rl-pg)**.
|
||||
|
||||
2️⃣ 👩💻 Then dive on the hands-on where you'll **code your first Deep Reinforcement Learning algorithm from scratch: Reinforce**.
|
||||
|
||||
Reinforce is a *Policy-Based Method*: a Deep Reinforcement Learning algorithm that tries **to optimize the policy directly without using an action-value function**.
|
||||
More precisely, Reinforce is a *Policy-Gradient Method*, a subclass of *Policy-Based Methods* that aims **to optimize the policy directly by estimating the weights of the optimal policy using Gradient Ascent**.
|
||||
|
||||
To test its robustness, we're going to train it in 3 different simple environments:
|
||||
- Cartpole-v1
|
||||
- PongEnv
|
||||
- PixelcopterEnv
|
||||
|
||||
Thanks to a leaderboard, **you'll be able to compare your results with other classmates** and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 5 🏆?
|
||||
|
||||
The hands-on 👉 [](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit5/unit5.ipynb)
|
||||
|
||||
The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
You can work directly **with the colab notebook, which allows you not to have to install everything on your machine (and it’s free)**.
|
||||
|
||||
|
||||
## Additional readings 📚
|
||||
- [Foundations of Deep RL Series, L3 Policy Gradients and Advantage Estimation by Pieter Abbeel](https://youtu.be/AKbX1Zvo7r8)
|
||||
- [Policy Gradient Algorithms](https://lilianweng.github.io/posts/2018-04-08-policy-gradient/)
|
||||
- [An Intuitive Explanation of Policy Gradient](https://towardsdatascience.com/an-intuitive-explanation-of-policy-gradient-part-1-reinforce-aa4392cbfd3c)
|
||||
|
||||
## How to make the most of this course
|
||||
|
||||
To make the most of the course, my advice is to:
|
||||
|
||||
- **Participate in Discord** and join a study group.
|
||||
- **Read multiple times** the theory part and takes some notes
|
||||
- Don’t just do the colab. When you learn something, try to change the environment, change the parameters and read the libraries' documentation. Have fun 🥳
|
||||
- Struggling is **a good thing in learning**. It means that you start to build new skills. Deep RL is a complex topic and it takes time to understand. Try different approaches, use our additional readings, and exchange with classmates on discord.
|
||||
|
||||
## This is a course built with you 👷🏿♀️
|
||||
|
||||
We want to improve and update the course iteratively with your feedback. **If you have some, please fill this form** 👉 https://forms.gle/3HgA7bEHwAmmLfwh9
|
||||
|
||||
## Don’t forget to join the Community 📢
|
||||
|
||||
We have a discord server where you **can exchange with the community and with us, create study groups to grow each other and more**
|
||||
|
||||
👉🏻 [https://discord.gg/aYka4Yhff9](https://discord.gg/aYka4Yhff9).
|
||||
|
||||
Don’t forget to **introduce yourself when you sign up 🤗**
|
||||
|
||||
❓If you have other questions, [please check our FAQ](https://github.com/huggingface/deep-rl-class#faq)
|
||||
|
||||
### Keep learning, stay awesome,
|
||||
BIN
unit5/assets/img/envs.gif
Normal file
|
After Width: | Height: | Size: 1.8 MiB |
1558
unit5/unit5.ipynb
Normal file
61
unit6/README.md
Normal file
@@ -0,0 +1,61 @@
|
||||
# Towards better explorations methods with Curiosity
|
||||
|
||||
In this Unit, we'll study in theory **Curiosity in Deep Reinforcement Learning**, a technique used to push our agent to better explore its environment and solve two majors flaws in Deep Reinforcement Learning:
|
||||
|
||||
1️⃣ **Sparse rewards environments: environments were most rewards do not contain information, and hence are set to zero.**
|
||||
|
||||
For instance, in [Vizdoom environment](https://github.com/mwydmuch/ViZDoom) “DoomMyWayHome,” your agent is only rewarded if it finds the vest. However, the vest is
|
||||
**far away from your starting point, so most of your rewards will be zero**. Therefore, if our agent does not receive useful feedback (dense rewards), it will take much longer to learn an optimal policy and it **can spend time turning around without finding the goal**.
|
||||
|
||||

|
||||
|
||||
2️⃣ **The extrinsic reward function (environment reward function) is handmade, that is in each environment, a human has to implement a reward function. But how we can scale that in big and complex environments?**
|
||||
|
||||
Therefore, a solution to these problems is to develop a reward function that is intrinsic to the agent, i.e., generated by the agent itself. The agent will act as a self-learner since it will be the student, but also its own feedback master.
|
||||
|
||||
This intrinsic reward mechanism is known as curiosity because this reward push to explore states that are novel/unfamiliar. **In order to achieve that, our agent will receive a high reward when exploring new trajectories.**
|
||||
|
||||
We'll see two techniques to create this curiosity, both of them are based on a Paper.
|
||||
|
||||
## Required time ⏱️
|
||||
The required time for this unit is, approximately:
|
||||
- One hour for the first paper study
|
||||
- One hour for the second paper study
|
||||
|
||||
## Start this Unit 🚀
|
||||
1️⃣ 📖 Read [Curiosity-Driven Learning through Next State Prediction](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-next-state-prediction-f7f4e2f592fa)
|
||||
|
||||
2️⃣ In addition, you should read the paper 👉 https://pathak22.github.io/noreward-rl/
|
||||
|
||||
3️⃣ 📖 Read [Random Network Distillation: a new take on Curiosity-Driven Learning](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938)
|
||||
|
||||
4️⃣ In addition, you should read the paper 👉 https://arxiv.org/pdf/1808.04355.pdf
|
||||
|
||||
## Additional readings 📚
|
||||
- [Curiosity and Procrastination in Reinforcement Learning, Google Brain](https://ai.googleblog.com/2018/10/curiosity-and-procrastination-in.html)
|
||||
- [ML-Agents, Curiosity for Sparse-reward Environments](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/ML-Agents-Overview.md#curiosity-for-sparse-reward-environments)
|
||||
|
||||
## How to make the most of this course
|
||||
|
||||
To make the most of the course, my advice is to:
|
||||
|
||||
- **Participate in Discord** and join a study group.
|
||||
- **Read multiple times** the theory part and takes some notes
|
||||
- Don’t just do the colab. When you learn something, try to change the environment, change the parameters and read the libraries' documentation. Have fun 🥳
|
||||
- Struggling is **a good thing in learning**. It means that you start to build new skills. Deep RL is a complex topic and it takes time to understand. Try different approaches, use our additional readings, and exchange with classmates on discord.
|
||||
|
||||
## This is a course built with you 👷🏿♀️
|
||||
|
||||
We want to improve and update the course iteratively with your feedback. **If you have some, please fill this form** 👉 https://forms.gle/3HgA7bEHwAmmLfwh9
|
||||
|
||||
## Don’t forget to join the Community 📢
|
||||
|
||||
We have a discord server where you **can exchange with the community and with us, create study groups to grow each other and more**
|
||||
|
||||
👉🏻 [https://discord.gg/aYka4Yhff9](https://discord.gg/aYka4Yhff9).
|
||||
|
||||
Don’t forget to **introduce yourself when you sign up 🤗**
|
||||
|
||||
❓If you have other questions, [please check our FAQ](https://github.com/huggingface/deep-rl-class#faq)
|
||||
|
||||
### Keep learning, stay awesome,
|
||||
62
unit7/README.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# Unit 7: Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet 🤖
|
||||
|
||||
One of the major industries that use Reinforcement Learning is robotics. Unfortunately, **having access to robot equipment is very expensive**. Fortunately, some simulations exist to train Robots:
|
||||
1. PyBullet
|
||||
2. MuJoco
|
||||
3. Unity Simulations
|
||||
|
||||
We're going to learn about Advantage Actor Critic (A2C) and how to use PyBullet. And train a spider agent to walk.
|
||||
|
||||
🏆 You'll then be able to **compare your agent’s results with other classmates thanks to a leaderboard** 🔥 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||

|
||||
|
||||
Let's get started 🥳
|
||||
|
||||
## Required time ⏱️
|
||||
The required time for this unit is, approximately:
|
||||
- 1 hour for the theory.
|
||||
- 1 hour for the hands-on.
|
||||
|
||||
## Start this Unit 🚀
|
||||
Here are the steps for this Unit:
|
||||
|
||||
1️⃣ 📖 [Read Advantage Actor Critic Chapter](https://huggingface.co/blog/deep-rl-a2c).
|
||||
|
||||
2️⃣ 👩💻 Then dive on the hands-on where you'll train two robots to walk.
|
||||
|
||||
The hands-on 👉 [](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit7/unit7.ipynb)
|
||||
|
||||
Thanks to a leaderboard, you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 7 🏆?
|
||||
|
||||
The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
## Additional readings 📚
|
||||
- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
|
||||
- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
|
||||
- [Foundations of Deep RL Series, L3 Policy Gradients and Advantage Estimation by Pieter Abbeel](https://youtu.be/AKbX1Zvo7r8)
|
||||
|
||||
## How to make the most of this course
|
||||
|
||||
To make the most of the course, my advice is to:
|
||||
|
||||
- **Participate in Discord** and join a study group.
|
||||
- **Read multiple times** the theory part and takes some notes
|
||||
- Don’t just do the colab. When you learn something, try to change the environment, change the parameters and read the libraries' documentation. Have fun 🥳
|
||||
- Struggling is **a good thing in learning**. It means that you start to build new skills. Deep RL is a complex topic and it takes time to understand. Try different approaches, use our additional readings, and exchange with classmates on discord.
|
||||
|
||||
## This is a course built with you 👷🏿♀️
|
||||
|
||||
We want to improve and update the course iteratively with your feedback. **If you have some, please fill this form** 👉 https://forms.gle/3HgA7bEHwAmmLfwh9
|
||||
|
||||
## Don’t forget to join the Community 📢
|
||||
|
||||
We have a discord server where you **can exchange with the community and with us, create study groups to grow each other and more**
|
||||
|
||||
👉🏻 [https://discord.gg/aYka4Yhff9](https://discord.gg/aYka4Yhff9).
|
||||
|
||||
Don’t forget to **introduce yourself when you sign up 🤗**
|
||||
|
||||
❓If you have other questions, [please check our FAQ](https://github.com/huggingface/deep-rl-class#faq)
|
||||
|
||||
### Keep learning, stay awesome 🤗
|
||||
BIN
unit7/assets/img/pybullet-envs.gif
Normal file
|
After Width: | Height: | Size: 4.1 MiB |
1
unit7/assets/img/test
Normal file
@@ -0,0 +1 @@
|
||||
|
||||
536
unit7/unit7.ipynb
Normal file
64
unit8/README.md
Normal file
@@ -0,0 +1,64 @@
|
||||
# Unit 8: Proximal Policy Optimization (PPO) with PyTorch
|
||||
|
||||
Today we'll learn about Proximal Policy Optimization (PPO), an architecture that improves our agent's training stability by avoiding too large policy updates. To do that, we use a ratio that will indicates the difference between our current and old policy and clip this ratio from a specific range $[1 - \epsilon, 1 + \epsilon]$. Doing this will ensure that our policy update will not be too large and that the training is more stable.
|
||||
|
||||
And then, after the theory, we'll code a PPO architecture from scratch using PyTorch and bulletproof our implementation with CartPole-v1 and LunarLander-v2.
|
||||
|
||||
🏆 You'll then be able to **compare your agent’s results with other classmates thanks to a leaderboard** 🔥 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
<img src="assets/img/LunarLander.gif" alt="LunarLander"/>
|
||||
|
||||
Let's get started 🥳
|
||||
|
||||
## Required time ⏱️
|
||||
The required time for this unit is, approximately:
|
||||
- 1 hour for the theory.
|
||||
- 2 hours for the hands-on.
|
||||
|
||||
## Start this Unit 🚀
|
||||
Here are the steps for this Unit:
|
||||
|
||||
1️⃣ 📖 [Read Proximal Policy Optimization Chapter](https://huggingface.co/blog/deep-rl-ppo).
|
||||
|
||||
2️⃣ 👩💻 Then dive on the hands-on:
|
||||
|
||||
The hands-on 👉 [](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit8/unit8.ipynb)
|
||||
|
||||
Thanks to a leaderboard, you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 8 🏆?
|
||||
|
||||
The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
## Additional readings 📚
|
||||
- [Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization by Daniel Bick](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf)
|
||||
- [What is the way to understand Proximal Policy Optimization Algorithm in RL?](https://stackoverflow.com/questions/46422845/what-is-the-way-to-understand-proximal-policy-optimization-algorithm-in-rl)
|
||||
- [Foundations of Deep RL Series, L4 TRPO and PPO by Pieter Abbeel](https://youtu.be/KjWF8VIMGiY)
|
||||
- [OpenAI PPO Blogpost](https://openai.com/blog/openai-baselines-ppo/)
|
||||
- [Spinning Up RL PPO](https://spinningup.openai.com/en/latest/algorithms/ppo.html)
|
||||
- [Paper Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)
|
||||
- [The 37 Implementation Details of Proximal Policy Optimization](https://ppo-details.cleanrl.dev//2021/11/05/ppo-implementation-details/)
|
||||
- [Importance Sampling Explained](https://youtu.be/C3p2wI4RAi8)
|
||||
|
||||
## How to make the most of this course
|
||||
|
||||
To make the most of the course, my advice is to:
|
||||
|
||||
- **Participate in Discord** and join a study group.
|
||||
- **Read multiple times** the theory part and takes some notes
|
||||
- Don’t just do the colab. When you learn something, try to change the environment, change the parameters and read the libraries' documentation. Have fun 🥳
|
||||
- Struggling is **a good thing in learning**. It means that you start to build new skills. Deep RL is a complex topic and it takes time to understand. Try different approaches, use our additional readings, and exchange with classmates on discord.
|
||||
|
||||
## This is a course built with you 👷🏿♀️
|
||||
|
||||
We want to improve and update the course iteratively with your feedback. **If you have some, please fill this form** 👉 https://forms.gle/3HgA7bEHwAmmLfwh9
|
||||
|
||||
## Don’t forget to join the Community 📢
|
||||
|
||||
We have a discord server where you **can exchange with the community and with us, create study groups to grow each other and more**
|
||||
|
||||
👉🏻 [https://discord.gg/aYka4Yhff9](https://discord.gg/aYka4Yhff9).
|
||||
|
||||
Don’t forget to **introduce yourself when you sign up 🤗**
|
||||
|
||||
❓If you have other questions, [please check our FAQ](https://github.com/huggingface/deep-rl-class#faq)
|
||||
|
||||
### Keep learning, stay awesome 🤗
|
||||