Merge branch 'huggingface:main' into main
@@ -1,28 +1,32 @@
|
||||
# Unit 1: Introduction to Deep Reinforcement Learning
|
||||
# Unit 1: Introduction to Deep Reinforcement Learning 🚀
|
||||
|
||||
In this Unit, you'll learn the foundations of Deep RL. And **you’ll train your first lander agent 🚀 to land correctly on the Moon 🌕** using Stable-Baselines3 and share it with the community.
|
||||

|
||||
|
||||
In this Unit, you'll learn the foundations of Deep Reinforcement Learning. And **you’ll train your first lander agent 🚀 to land correctly on the Moon 🌕** using Stable-Baselines3 and share it with the community.
|
||||
|
||||
<img src="assets/img/LunarLander.gif" alt="LunarLander"/>
|
||||
|
||||
You'll then be able to **compare your agent’s results with other classmates thanks to a leaderboard** 🔥.
|
||||
You'll then be able to **[compare your agent’s results with other classmates thanks to the leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard)** 🔥.
|
||||
|
||||
This course is **self-paced**, you can start whenever you want.
|
||||
|
||||
## Required time ⏱️
|
||||
The required time for this unit is, approximately:
|
||||
- 2 hours for the theory
|
||||
- 1 hour for the hands-on.
|
||||
- **2 hours** for the theory
|
||||
- **1 hour** for the hands-on.
|
||||
|
||||
## Start this Unit 🚀
|
||||
Here are the steps for this Unit:
|
||||
|
||||
1️⃣ Sign up to our Discord Server. This is the place where you **can exchange with the community and with us, create study groups to grow each other and more**
|
||||
1️⃣ 📝 **Sign up to the course** , to receive the updates when each Unit is published.
|
||||
|
||||
2️⃣ **Sign up to our Discord Server**. This is the place where you **can exchange with the community and with us, create study groups to grow each other and more**
|
||||
|
||||
👉🏻 [https://discord.gg/aYka4Yhff9](https://discord.gg/aYka4Yhff9).
|
||||
|
||||
Are you new to Discord? Check our **discord 101 to get the best practices** 👉 https://github.com/huggingface/deep-rl-class/blob/main/DISCORD.Md
|
||||
|
||||
2️⃣ **Introduce yourself on Discord in #introduce-yourself Discord channel 🤗 and check on the left the Reinforcement Learning section.**
|
||||
3️⃣ 👋 **Introduce yourself on Discord in #introduce-yourself Discord channel 🤗 and check on the left the Reinforcement Learning section.**
|
||||
|
||||
- In #rl-announcements we give the last information about the course.
|
||||
- #discussions is a place to exchange.
|
||||
@@ -31,17 +35,21 @@ Are you new to Discord? Check our **discord 101 to get the best practices** 👉
|
||||
|
||||
<img src="assets/img/discord_channels.jpg" alt="Discord Channels"/>
|
||||
|
||||
3️⃣ 📖 **Read An [Introduction to Deep Reinforcement Learning](https://huggingface.co/blog/deep-rl-intro)**, where you’ll learn the foundations of Deep RL. You can also watch the video version attached to the article. 👉 https://huggingface.co/blog/deep-rl-intro
|
||||
4️⃣ 📖 **Read An [Introduction to Deep Reinforcement Learning](https://huggingface.co/blog/deep-rl-intro)**, where you’ll learn the foundations of Deep RL. You can also watch the video version attached to the article. 👉 https://huggingface.co/blog/deep-rl-intro
|
||||
|
||||
4️⃣ 👩💻 Then dive on the hands-on, where **you’ll train your first lander agent 🚀 to land correctly on the Moon 🌕 using Stable-Baselines3 and share it with the community.** Thanks to a leaderboard, **you'll be able to compare your results with other classmates** and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 1 🏆?
|
||||
5️⃣ 📝 Take a piece of paper and **check your knowledge with this series of questions** ❔ 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit1/quiz.md
|
||||
|
||||
The hands-on 👉 [](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb)
|
||||
6️⃣ 👩💻 Then dive on the hands-on, where **you’ll train your first lander agent 🚀 to land correctly on the Moon 🌕 using Stable-Baselines3 and share it with the community.** Thanks to a leaderboard, **you'll be able to compare your results with other classmates** and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 1 🏆?
|
||||
|
||||
The leaderboard 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
👩💻 The hands-on 👉 [](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb)
|
||||
|
||||
You can work directly **with the colab notebook, which allows you not to have to install everything on your machine (and it’s free)**.
|
||||
🏆 The leaderboard 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
5️⃣ The best way to learn **is to try things on your own**. That’s why we have a challenges section in the colab where we give you some ideas on how you can go further: using another environment, using another model etc.
|
||||
You can work directly **with the colab notebook, which allows you not to have to install everything on your machine (and it’s free)**.
|
||||
|
||||
The best way to learn **is to try things on your own**. That’s why we have a challenges section in the colab where we give you some ideas on how you can go further: using another environment, using another model etc.
|
||||
|
||||
7️⃣ (Optional) In order to **find the best training parameters you can try this hands-on** made by [Sambit Mukherjee](https://github.com/sambitmukherjee) 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit1/unit1_optuna_guide.ipynb
|
||||
|
||||
## Additional readings 📚
|
||||
- [Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto Chapter 1, 2 and 3](http://incompleteideas.net/book/RLbook2020.pdf)
|
||||
@@ -55,13 +63,13 @@ You can work directly **with the colab notebook, which allows you not to have to
|
||||
To make the most of the course, my advice is to:
|
||||
|
||||
- **Participate in Discord** and join a study group.
|
||||
- **Read multiple times** the theory part and takes some notes
|
||||
- Don’t just do the colab. When you learn something, try to change the environment, change the parameters and read the libraries' documentation. Have fun 🥳
|
||||
- **Read multiple times** the theory part and takes some notes.
|
||||
- Don’t just do the colab. When you learn something, try to change the environment, change the parameters and read the libraries' documentation. Have fun 🥳.
|
||||
- Struggling is **a good thing in learning**. It means that you start to build new skills. Deep RL is a complex topic and it takes time to understand. Try different approaches, use our additional readings, and exchange with classmates on discord.
|
||||
|
||||
## This is a course built with you 👷🏿♀️
|
||||
|
||||
We want to improve and update the course iteratively with your feedback. If you have some, please open an issue on the Github Repo: [https://github.com/huggingface/deep-rl-class/issues](https://github.com/huggingface/deep-rl-class/issues)
|
||||
We want to improve and update the course iteratively with your feedback. **If you have some, please fill this form** 👉 https://forms.gle/3HgA7bEHwAmmLfwh9
|
||||
|
||||
## Don’t forget to join the Community 📢
|
||||
|
||||
@@ -73,4 +81,4 @@ Don’t forget to **introduce yourself when you sign up 🤗**
|
||||
|
||||
❓If you have other questions, [please check our FAQ](https://github.com/huggingface/deep-rl-class#faq)
|
||||
|
||||
Keep learning, stay awesome,
|
||||
## Keep learning, stay awesome 🤗,
|
||||
|
||||
BIN
unit1/assets/img/expexpltradeoff.jpg
Normal file
|
After Width: | Height: | Size: 295 KiB |
BIN
unit1/assets/img/obs_space_recap.jpg
Normal file
|
After Width: | Height: | Size: 146 KiB |
BIN
unit1/assets/img/policy.jpg
Normal file
|
After Width: | Height: | Size: 71 KiB |
BIN
unit1/assets/img/policy_2.jpg
Normal file
|
After Width: | Height: | Size: 64 KiB |
BIN
unit1/assets/img/rl-loop-ex.jpg
Normal file
|
After Width: | Height: | Size: 93 KiB |
BIN
unit1/assets/img/rl-loop-solution.jpg
Normal file
|
After Width: | Height: | Size: 88 KiB |
BIN
unit1/assets/img/tasks.jpg
Normal file
|
After Width: | Height: | Size: 324 KiB |
BIN
unit1/assets/img/thumbnail.png
Normal file
|
After Width: | Height: | Size: 220 KiB |
BIN
unit1/assets/img/value.jpg
Normal file
|
After Width: | Height: | Size: 75 KiB |
BIN
unit1/assets/img/value_2.jpg
Normal file
|
After Width: | Height: | Size: 56 KiB |
137
unit1/quiz.md
Normal file
@@ -0,0 +1,137 @@
|
||||
# Knowledge Check ✔️
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://fr.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
📝 Take a piece of paper and try to answer by writing, **then check the solutions**.
|
||||
|
||||
### Q1: What is Reinforcement Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Reinforcement learning is a **framework for solving control tasks (also called decision problems)** by building agents that learn from the environment by interacting with it through trial and error and **receiving rewards (positive or negative) as unique feedback**.
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#a-formal-definition
|
||||
|
||||
</details>
|
||||
|
||||
### Q2: Define the RL Loop
|
||||
|
||||
<img src="assets/img/rl-loop-ex.jpg" alt="Exercise RL Loop"/>
|
||||
|
||||
At every step:
|
||||
- Our Agent receives ______ from the environment
|
||||
- Based on that ______ the Agent takes an ______
|
||||
- Our Agent will move to the right
|
||||
- The Environment goes to a ______
|
||||
- The Environment gives ______ to the Agent
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<img src="assets/img/rl-loop-solution.jpg" alt="Exercise RL Solution"/>
|
||||
|
||||
|
||||
At every step:
|
||||
- Our Agent receives **state s0** from the environment
|
||||
- Based on that **state s0** the Agent takes an **action a0**
|
||||
- Our Agent will move to the right
|
||||
- The Environment goes to a **new state s1**
|
||||
- The Environment gives **a reward r1** to the Agent
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#the-rl-process
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q3: What's the difference between a state and an observation?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- *The state* is a **complete description of the state of the world** (there is no hidden information), in a fully observed environment. For instance, in chess game, we receive a state from the environment since we have access to the whole checkboard information.
|
||||
|
||||
- *The observation* is a **partial description of the state**. In a partially observed environment. For instance, in Super Mario Bros, we only see a part of the level close to the player, so we receive an observation.
|
||||
|
||||
<img src="assets/img/obs_space_recap.jpg" alt="Observation Space Recap"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#observationsstates-space
|
||||
|
||||
</details>
|
||||
|
||||
### Q4: A task is an instance of a Reinforcement Learning problem. What are the two types of tasks?
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- *Episodic task* : we have a **starting point and an ending point (a terminal state)**. This creates an episode: a list of States, Actions, Rewards, and new States. For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ending when you’re killed or you reached the end of the level.
|
||||
|
||||
- *Continuous task* : these are tasks that **continue forever (no terminal state)**. In this case, the agent must learn how to choose the best actions and simultaneously interact with the environment.
|
||||
|
||||
<img src="assets/img/tasks.jpg" alt="Task"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#type-of-tasks
|
||||
|
||||
</details>
|
||||
|
||||
### Q5: What is the exploration/exploitation tradeoff?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
In Reinforcement Learning, we need to **balance how much we explore the environment and how much we exploit what we know about the environment**.
|
||||
|
||||
- *Exploration* is exploring the environment by **trying random actions in order to find more information about the environment**.
|
||||
|
||||
- *Exploitation* is **exploiting known information to maximize the reward**.
|
||||
|
||||
<img src="assets/img/expexpltradeoff.jpg" alt="Exploration/exploitation tradeoff"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#exploration-exploitation-tradeoff
|
||||
</details>
|
||||
|
||||
### Q6: What is a policy?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- The Policy π **is the brain of our Agent**, it’s the function that tell us what action to take given the state we are. So it defines the agent’s behavior at a given time.
|
||||
|
||||
<img src="assets/img/policy.jpg" alt="Policy"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#the-policy-%CF%80-the-agents-brain
|
||||
</details>
|
||||
|
||||
|
||||
### Q7: What are value-based methods?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- Value-based methods is one of the main approaches for solving RL problems.
|
||||
- In Value-based methods, instead of training a policy function, **we train a value function that maps a state to the expected value of being at that state**.
|
||||
|
||||
<img src="assets/img/value.jpg" alt="Value illustration"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#value-based-methods
|
||||
</details>
|
||||
|
||||
### Q8: What are policy-based methods?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- In *Policy-Based Methods*, we learn a **policy function directly**.
|
||||
- This policy function will **map from each state to the best corresponding action at that state**. Or a **probability distribution over the set of possible actions at that state**.
|
||||
|
||||
<img src="assets/img/policy.jpg" alt="Policy illustration"/>
|
||||
|
||||
📖 If you don't remember, check 👉 https://huggingface.co/blog/deep-rl-intro#value-based-methods
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
---
|
||||
|
||||
Congrats on **finishing this Quiz** 🥳, if you missed some elements, take time to [read again the chapter](https://huggingface.co/blog/deep-rl-intro) to reinforce (😏) your knowledge.
|
||||
|
||||
**Keep Learning, Stay Awesome**
|
||||
21
unit1/unit1-bonus/readme.md
Normal file
@@ -0,0 +1,21 @@
|
||||
# Unit 1: Bonus 🎁
|
||||
- Our teammate @Chris Emezue published a new leaderboard where you can compare your trained agents in new environments 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
## Try new environments 🎮
|
||||
Now that you've played with LunarLander-v2 Why not try these environments? 🔥:
|
||||
- 🗻 MountainCar-v0 https://www.gymlibrary.ml/environments/classic_control/mountain_car/
|
||||
- 🏎️ CarRacing-v1 https://www.gymlibrary.ml/environments/box2d/car_racing/
|
||||
- 🥶 FrozenLake-v1 https://www.gymlibrary.ml/environments/toy_text/frozen_lake/
|
||||
|
||||
## A piece of advice 🧐
|
||||
The first Unit, is a very interesting one but also **a very complex one because it's where you learn the fundamentals.**
|
||||
|
||||
That’s normal if you **still feel confused with all these elements**. This was the same for me and for all people who studied RL.
|
||||
|
||||
Take time to really grasp the material before continuing. It’s important to master these elements and having a solid foundations before entering the fun part.
|
||||
|
||||
We published additional readings in the syllabus if you want to go deeper 👉 https://github.com/huggingface/deep-rl-class/blob/main/unit1/README.md
|
||||
|
||||
The hands-on for the first Unit are more funny experiments, but as we'll go deeper, **you'll understand better how to choose the hyperparameters and what model to use. For now, have fun, try stuff you can't break the simulations 🚀 **
|
||||
|
||||
### Keep learning, stay awesome.
|
||||
546
unit1/unit1_optuna_guide.ipynb
Normal file
@@ -0,0 +1,546 @@
|
||||
{
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0,
|
||||
"metadata": {
|
||||
"colab": {
|
||||
"name": "Unit 1 Special Content: Optuna Guide.ipynb",
|
||||
"provenance": [],
|
||||
"collapsed_sections": []
|
||||
},
|
||||
"kernelspec": {
|
||||
"name": "python3",
|
||||
"display_name": "Python 3"
|
||||
},
|
||||
"language_info": {
|
||||
"name": "python"
|
||||
}
|
||||
},
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"# Unit 1 Special Content: Optuna Guide"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "RpLWg64qubE9"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"In this notebook, we shall see how to use Optuna to perform hyperparameter tuning of Unit 1's <a href=\"https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html\" target=\"_blank\">`PPO`</a> model (created using Stable-Baselines3 for the `\"LunarLander-v2\"` Gym environment).\n",
|
||||
"\n",
|
||||
"Optuna is an open-source, automatic hyperparameter optimization framework. You can read more about it <a href=\"https://tech.preferred.jp/en/blog/optuna-release/\" target=\"_blank\">here</a>.\n",
|
||||
"\n",
|
||||
"**Prerequisite:** Before going through this notebook, you should have completed the <a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit1/unit1.ipynb\" target=\"_blank\">Unit 1 hands-on</a>."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "cWpMuD7jrfze"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Virtual Display Setup"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "InZX9_sRtfbO"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"We'll need to generate a replay video. To do so in Colab, we need to have a virtual display to be able to render the environment (and thus record the frames).\n",
|
||||
"\n",
|
||||
"The following cell will install virtual display libraries."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "QvvCztQ5tZxq"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "oUfu-vzL4g97"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!apt install python-opengl\n",
|
||||
"!apt install ffmpeg\n",
|
||||
"!apt install xvfb\n",
|
||||
"!pip install pyvirtualdisplay"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Now, let's create & start a virtual display."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "UVt5deF2qorb"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"from pyvirtualdisplay import Display\n",
|
||||
"\n",
|
||||
"virtual_display = Display(visible=0, size=(1400, 900))\n",
|
||||
"virtual_display.start()"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "6E1inOK_5Jko"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Dependencies, Imports & Gym Environments"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "B2sWJAj8ti1y"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Let's install all the other dependencies we'll need."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "_HVFbCW5tnyw"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"!pip install gym[box2d]\n",
|
||||
"!pip install stable-baselines3[extra]\n",
|
||||
"!pip install pyglet\n",
|
||||
"!pip install ale-py==0.7.4 # To overcome an issue with Gym (https://github.com/DLR-RM/stable-baselines3/issues/875)\n",
|
||||
"!pip install optuna\n",
|
||||
"!pip install huggingface_sb3"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "liGH8sAg5dk9"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Next, let's perform all the necessary imports."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "WWcb8nUVt-A9"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"import gym\n",
|
||||
"\n",
|
||||
"from stable_baselines3.common.env_util import make_vec_env\n",
|
||||
"from stable_baselines3.common.monitor import Monitor\n",
|
||||
"from stable_baselines3 import PPO\n",
|
||||
"from stable_baselines3.common.evaluation import evaluate_policy\n",
|
||||
"from stable_baselines3.common.vec_env import DummyVecEnv\n",
|
||||
"\n",
|
||||
"import optuna\n",
|
||||
"from optuna.samplers import TPESampler\n",
|
||||
"\n",
|
||||
"from huggingface_hub import notebook_login\n",
|
||||
"from huggingface_sb3 import package_to_hub"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "DivjESMF5i7M"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Finally, let's create our Gym environments. The training environment is a vectorized environment:"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "uaXEISAluKU2"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"env = make_vec_env(\"LunarLander-v2\", n_envs=16)\n",
|
||||
"env"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "WTHPMs_s7YFc"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"And the evaluation environment is a separate environment:"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "NTtoXSFiOR6f"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"eval_env = Monitor(gym.make(\"LunarLander-v2\"))"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "ECDFf_6OOXqO"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"We are now ready to dive into hyperparameter tuning!"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "sjyVRth8uSVm"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Hyperparameter Tuning"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "kz0QGAOnust1"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"First, let's define a `run_training()` function that trains a single model (using a particular combination of hyperparameter values), and returns a score. \n",
|
||||
"\n",
|
||||
"The score tells us how good the particular combination of hyperparameters is. (In our case, the score is `mean_reward - std_reward`, which is being used in the <a href=\"https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard\" target=\"_blank\">leaderboard</a>.) \n",
|
||||
"\n",
|
||||
"The function takes a very special argument - `params`, which is a dictionary. **The keys of this dictionary are the names of the hyperparameters we're tuning**, and **the values are sampled at each trial by Optuna's sampler** (from ranges that we'll specify soon).\n",
|
||||
"\n",
|
||||
"For example, in a particular trial, `params` might look like this:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"{'n_epochs': 5, 'gamma': 0.9926, 'total_timesteps': 559_621}\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"And in another trial, `params` might look like this:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"{'n_epochs': 3, 'gamma': 0.9974, 'total_timesteps': 1_728_482}\n",
|
||||
"```"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "02BT2bxVXEHj"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"def run_training(params, verbose=0, save_model=False):\n",
|
||||
" model = PPO(\n",
|
||||
" policy='MlpPolicy', \n",
|
||||
" env=env, \n",
|
||||
" n_steps=1024,\n",
|
||||
" batch_size=64, \n",
|
||||
" n_epochs=params['n_epochs'], # We're tuning this.\n",
|
||||
" gamma=params['gamma'], # We're tuning this.\n",
|
||||
" gae_lambda=0.98, \n",
|
||||
" ent_coef=0.01, \n",
|
||||
" verbose=verbose\n",
|
||||
" )\n",
|
||||
" model.learn(total_timesteps=params['total_timesteps']) # We're tuning this.\n",
|
||||
"\n",
|
||||
" mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=50, deterministic=True)\n",
|
||||
" score = mean_reward - std_reward\n",
|
||||
"\n",
|
||||
" if save_model:\n",
|
||||
" model.save(\"PPO-LunarLander-v2\")\n",
|
||||
"\n",
|
||||
" return model, score"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "kpDGvnBS6t57"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Next, we define another function - `objective()`. This function has a single parameter `trial`, which is an object of type `optuna.trial.Trial`. Using this `trial` object, we specify the ranges for the different hyperparameters we want to explore:\n",
|
||||
"\n",
|
||||
"- For `n_epochs`: We want to explore integer values between `3` and `5`.\n",
|
||||
"- For `gamma`: We want to explore floating point values between `0.9900` and `0.9999` (drawn from a uniform distribution).\n",
|
||||
"- For `total_timesteps`: We want to explore integer values between `500_000` and `2_000_000`.\n",
|
||||
"\n",
|
||||
"**Note:** If you have more time available, then you can tune other hyperparameters too. Moreover, you can explore wider ranges for each hyperparameter.\n",
|
||||
"\n",
|
||||
"The `trial.suggest_int()` and `trial.suggest_uniform()` methods are used by Optuna to suggest hyperparamter values in the ranges specified. The suggested combination of values are then used to train a model and return the score."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "dORGHcVYdSKp"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"def objective(trial):\n",
|
||||
" params = {\n",
|
||||
" \"n_epochs\": trial.suggest_int(\"n_epochs\", 3, 5), \n",
|
||||
" \"gamma\": trial.suggest_uniform(\"gamma\", 0.9900, 0.9999), \n",
|
||||
" \"total_timesteps\": trial.suggest_int(\"total_timesteps\", 500_000, 2_000_000)\n",
|
||||
" }\n",
|
||||
" model, score = run_training(params)\n",
|
||||
" return score"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "Wapg9hTI-AGz"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Finally, we use Optuna's `create_study()` function to create a study, passing in:\n",
|
||||
"\n",
|
||||
"- `sampler=TPESampler()`: This specifies that we want to employ a Bayesian optimization algorithm called Tree-structured Parzen Estimator. Other options are `GridSampler()`, `RandomSampler()`, etc. (The full list can be found <a href=\"https://optuna.readthedocs.io/en/stable/reference/samplers.html\" target=\"_blank\">here</a>.)\n",
|
||||
"- `study_name=\"PPO-LunarLander-v2\"`: This is a name we give to the study (optional).\n",
|
||||
"- `direction=\"maximize\"`: This is to specify that our objective is to maximize (not minimize) the score.\n",
|
||||
"\n",
|
||||
"Once our study is created, we call the `optimize()` method on it, specifying that we want to conduct `10` trials.\n",
|
||||
"\n",
|
||||
"**Note:** If you have more time available, then you can conduct more than `10` trials.\n",
|
||||
"\n",
|
||||
"**Warning:** The below code cell will take quite a bit of time to run!"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "a2gToNx7gPEG"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"study = optuna.create_study(sampler=TPESampler(), study_name=\"PPO-LunarLander-v2\", direction=\"maximize\")\n",
|
||||
"study.optimize(objective, n_trials=10)"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "0v9G57g4-gl_"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Now that all the `10` trials have concluded, let's print out the score and hyperparameters of the best trial."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "9qRTUBAExSLt"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"print(\"Best trial score:\", study.best_trial.values)\n",
|
||||
"print(\"Best trial hyperparameters:\", study.best_trial.params)"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "N-JFso9Lvurh"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Recreating & Saving The Best Model"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "qiSBZmZDux4q"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Let's recreate the best model and save it."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "SuZ4OSayyq36"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"model, score = run_training(study.best_trial.params, verbose=1, save_model=True)"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "8JVUHgljIW1E"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Pushing to Hugging Face Hub"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "0eja4YaFu5Vb"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"To be able to share your model with the community, there are three more steps to follow:\n",
|
||||
"\n",
|
||||
"1. (If not done already) create a Hugging Face account -> https://huggingface.co/join\n",
|
||||
"\n",
|
||||
"2. Sign in and then, get your authentication token from the Hugging Face website.\n",
|
||||
"\n",
|
||||
"- Create a new token (https://huggingface.co/settings/tokens) **with write role**.\n",
|
||||
"- Copy the token.\n",
|
||||
"- Run the cell below and paste the token."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "qDjEim2izuFi"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"notebook_login()\n",
|
||||
"!git config --global credential.helper store"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "msFVH2qhzwpE"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"If you aren't using Google Colab or Jupyter Notebook, you need to use this command instead: `huggingface-cli login`\n",
|
||||
"\n",
|
||||
"3. We're now ready to push our trained agent to the Hub using the `package_to_hub()` function.\n",
|
||||
"\n",
|
||||
"Let's fill in the arguments of the `package_to_hub` function:\n",
|
||||
"\n",
|
||||
"- `model`: our trained model\n",
|
||||
"\n",
|
||||
"- `model_name`: the name of the trained model that we defined in `model.save()`\n",
|
||||
"\n",
|
||||
"- `model_architecture`: the model architecture we used (in our case `\"PPO\"`)\n",
|
||||
"\n",
|
||||
"- `env_id`: the name of the environment (in our case `\"LunarLander-v2\"`)\n",
|
||||
"\n",
|
||||
"- `eval_env`: the evaluation environment\n",
|
||||
"\n",
|
||||
"- `repo_id`: the name of the Hugging Face Hub repository that will be created/updated `(repo_id=\"{username}/{repo_name}\")` (**Note:** A good `repo_id` is `\"{username}/{model_architecture}-{env_id}\"`.)\n",
|
||||
"\n",
|
||||
"- `commit_message`: the commit message"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "lD4ACH160L_5"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"model_name = \"PPO-LunarLander-v2\"\n",
|
||||
"model_architecture = \"PPO\"\n",
|
||||
"env_id = \"LunarLander-v2\"\n",
|
||||
"eval_env = DummyVecEnv([lambda: gym.make(env_id)])\n",
|
||||
"repo_id = \"Sadhaklal/PPO-LunarLander-v2\"\n",
|
||||
"commit_message = \"Upload best PPO LunarLander-v2 agent (tuned with Optuna).\""
|
||||
],
|
||||
"metadata": {
|
||||
"id": "8JkhrjAt0O6a"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"The following function call will evaluate the agent, record a replay, generate a model card, and push your agent to the Hub."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "3HIT-M2C1l4E"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"package_to_hub(\n",
|
||||
" model=model, \n",
|
||||
" model_name=model_name, \n",
|
||||
" model_architecture=model_architecture, \n",
|
||||
" env_id=env_id, \n",
|
||||
" eval_env=eval_env, \n",
|
||||
" repo_id=repo_id, \n",
|
||||
" commit_message=commit_message\n",
|
||||
")"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "1D0wQ_PU1mTN"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"That's it! You now know how to perform hyperparameter tuning of Stable-Baselines3 models using Optuna.\n",
|
||||
"\n",
|
||||
"To get even better results, try tuning the other hyperparameters of your model."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "7jLc9JDY11EW"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Final Tips"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "Fa4TtHLAtq7-"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"1. Read the <a href=\"https://optuna.readthedocs.io/en/stable/index.html\" target=\"_blank\">Optuna documentation</a> to get more familiar with the library and its features.\n",
|
||||
"2. You may have noticed that hyperparameter tuning is a time consuming process. However, it can be sped up significantly using parallelization. Check out <a href=\"https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/004_distributed.html\" target=\"_blank\">this guide</a> on how to do so."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "tEgxhSjf4loa"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
""
|
||||
],
|
||||
"metadata": {
|
||||
"id": "C7CJ7x7n4HuA"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
}
|
||||
]
|
||||
}
|
||||