Merge branch 'huggingface:main' into main

This commit is contained in:
heispv
2023-01-06 14:19:23 +03:30
committed by GitHub
34 changed files with 4869 additions and 1584 deletions

View File

@@ -0,0 +1,5 @@
stable-baselines3[extra]
box2d
box2d-kengz
huggingface_sb3
pyglet==1.5.1

View File

@@ -247,7 +247,7 @@
},
"outputs": [],
"source": [
"!pip install -r https://huggingface.co/spaces/ThomasSimonini/temp-space-requirements/raw/main/requirements/requirements-unit1.txt"
"!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt"
]
},
{

View File

@@ -127,7 +127,7 @@
"source": [
"# Let's train a Deep Q-Learning agent playing Atari' Space Invaders 👾 and upload it to the Hub.\n",
"\n",
"To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.\n",
"To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 200**.\n",
"\n",
"To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**\n",
"\n",
@@ -799,4 +799,4 @@
},
"nbformat": 4,
"nbformat_minor": 0
}
}

View File

@@ -0,0 +1,6 @@
gym
git+https://github.com/ntasfi/PyGame-Learning-Environment.git
git+https://github.com/qlan3/gym-games.git
huggingface_hub
imageio-ffmpeg
pyyaml==6.0

1614
notebooks/unit4/unit4.ipynb Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

View File

@@ -46,6 +46,10 @@
title: Play with Huggy
- local: unitbonus1/conclusion
title: Conclusion
- title: Live 1. How the course work, Q&A, and playing with Huggy
sections:
- local: live1/live1
title: Live 1. How the course work, Q&A, and playing with Huggy 🐶
- title: Unit 2. Introduction to Q-Learning
sections:
- local: unit2/introduction
@@ -88,6 +92,8 @@
title: The Deep Q-Network (DQN)
- local: unit3/deep-q-algorithm
title: The Deep Q Algorithm
- local: unit3/glossary
title: Glossary
- local: unit3/hands-on
title: Hands-on
- local: unit3/quiz
@@ -96,7 +102,7 @@
title: Conclusion
- local: unit3/additional-readings
title: Additional Readings
- title: Unit Bonus 2. Automatic Hyperparameter Tuning with Optuna
- title: Bonus Unit 2. Automatic Hyperparameter Tuning with Optuna
sections:
- local: unitbonus2/introduction
title: Introduction
@@ -104,8 +110,27 @@
title: Optuna
- local: unitbonus2/hands-on
title: Hands-on
- title: Unit 4. Policy Gradient with PyTorch
sections:
- local: unit4/introduction
title: Introduction
- local: unit4/what-are-policy-based-methods
title: What are the policy-based methods?
- local: unit4/advantages-disadvantages
title: The advantages and disadvantages of policy-gradient methods
- local: unit4/policy-gradient
title: Diving deeper into policy-gradient
- local: unit4/pg-theorem
title: (Optional) the Policy Gradient Theorem
- local: unit4/hands-on
title: Hands-on
- local: unit4/quiz
title: Quiz
- local: unit4/conclusion
title: Conclusion
- local: unit4/additional-readings
title: Additional Readings
- title: What's next? New Units Publishing Schedule
sections:
- local: communication/publishing-schedule
title: Publishing Schedule

View File

@@ -1,6 +1,6 @@
# Publishing Schedule [[publishing-schedule]]
We publish a **new unit every Monday** (except Monday, the 26th of December).
We publish a **new unit every Tuesday**.
If you don't want to miss any of the updates, don't forget to:

9
units/en/live1/live1.mdx Normal file
View File

@@ -0,0 +1,9 @@
# Live 1: How the course work, Q&A, and playing with Huggy
In this first live stream, we explained how the course work (scope, units, challenges, and more) and answered your questions.
And finally, we saw some LunarLander agents you've trained and play with your Huggies 🐶
<Youtube id="JeJIswxyrsM" />
To know when the next live is scheduled **check the discord server**. We will also send **you an email**. If you can't participate, don't worry, we record the live sessions.

View File

@@ -9,7 +9,13 @@ Discord is a free chat platform. If you've used Slack, **it's quite similar**. T
Starting in Discord can be a bit intimidating, so let me take you through it.
When you sign-up to our Discord server, you'll need to specify which topics you're interested in by **clicking #role-assignment at the left**. Here, you can pick different categories. Make sure to **click "Reinforcement Learning"**! :fire:. You'll then get to **introduce yourself in the `#introduction-yourself` channel**.
When you sign-up to our Discord server, you'll need to specify which topics you're interested in by **clicking #role-assignment at the left**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/discord1.jpg" alt="Discord"/>
In #role-assignment, you can pick different categories. Make sure to **click "Reinforcement Learning"**. You'll then get to **introduce yourself in the `#introduction-yourself` channel**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/discord2.jpg" alt="Discord"/>
## So which channels are interesting to me? [[channels]]

View File

@@ -23,7 +23,7 @@ In this course, you will:
- 📖 Study Deep Reinforcement Learning in **theory and practice.**
- 🧑‍💻 Learn to **use famous Deep RL libraries** such as [Stable Baselines3](https://stable-baselines3.readthedocs.io/en/master/), [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), [Sample Factory](https://samplefactory.dev/) and [CleanRL](https://github.com/vwxyzjn/cleanrl).
- 🤖 **Train agents in unique environments** such as [SnowballFight](https://huggingface.co/spaces/ThomasSimonini/SnowballFight), [Huggy the Doggo 🐶](https://huggingface.co/spaces/ThomasSimonini/Huggy), [MineRL (Minecraft )](https://minerl.io/), [VizDoom (Doom)](https://vizdoom.cs.put.edu.pl/) and classical ones such as [Space Invaders](https://www.gymlibrary.dev/environments/atari/) and [PyBullet](https://pybullet.org/wordpress/).
- 🤖 **Train agents in unique environments** such as [SnowballFight](https://huggingface.co/spaces/ThomasSimonini/SnowballFight), [Huggy the Doggo 🐶](https://huggingface.co/spaces/ThomasSimonini/Huggy), [VizDoom (Doom)](https://vizdoom.cs.put.edu.pl/) and classical ones such as [Space Invaders](https://www.gymlibrary.dev/environments/atari/), [PyBullet](https://pybullet.org/wordpress/) and more.
- 💾 Share your **trained agents with one line of code to the Hub** and also download powerful agents from the community.
- 🏆 Participate in challenges where you will **evaluate your agents against other teams. You'll also get to play against the agents you'll train.**
@@ -58,7 +58,8 @@ You can choose to follow this course either:
Both paths **are completely free**.
Whatever path you choose, we advise you **to follow the recommended pace to enjoy the course and challenges with your fellow classmates.**
You don't need to tell us which path you choose. At the end of March, when we verify the assignments **if you get more than 80% of the assignments done, you'll get a certificate.**
You don't need to tell us which path you choose. At the end of March, when we will verify the assignments **if you get more than 80% of the assignments done, you'll get a certificate.**
## The Certification Process [[certification-process]]
@@ -92,7 +93,7 @@ You need only 3 things:
## What is the publishing schedule? [[publishing-schedule]]
We publish **a new unit every Monday** (except Monday, the 26th of December).
We publish **a new unit every Tuesday**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule1.png" alt="Schedule 1" width="100%"/>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule2.png" alt="Schedule 2" width="100%"/>
@@ -128,7 +129,7 @@ In this new version of the course, you have two types of challenges:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/challenges.jpg" alt="Challenges" width="100%"/>
These AI vs.AI challenges will be announced **later in December**.
These AI vs.AI challenges will be announced **in January**.
## I found a bug, or I want to improve the course [[contribute]]

View File

@@ -18,9 +18,10 @@ You can now sign up for our Discord Server. This is the place where you **can ex
When you join, remember to introduce yourself in #introduce-yourself and sign-up for reinforcement channels in #role-assignments.
We have multiple RL-related channels:
- `rl-announcements`: where we give the last information about the course.
- `rl-announcements`: where we give the latest information about the course.
- `rl-discussions`: where you can exchange about RL and share information.
- `rl-study-group`: where you can create and join study groups.
- `rl-i-made-this`: where you can share your projects and models.
If this is your first time using Discord, we wrote a Discord 101 to get the best practices. Check the next section.

View File

@@ -12,5 +12,10 @@ In the next (bonus) unit, were going to reinforce what we just learned by **t
You will be able then to play with him 🤗.
<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.mp4" alt="Huggy" type="video/mp4">
</video>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.jpg" alt="Huggy"/>
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
### Keep Learning, stay awesome 🤗

View File

@@ -139,7 +139,7 @@ To make things easier, we created a script to install all these dependencies.
```
```python
!pip install -r https://huggingface.co/spaces/ThomasSimonini/temp-space-requirements/raw/main/requirements/requirements-unit1.txt
!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt
```
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).

View File

@@ -22,7 +22,6 @@ It's essential **to master these elements** before diving into implementing Dee
After this unit, in a bonus unit, you'll be **able to train Huggy the Dog 🐶 to fetch the stick and play with him 🤗**.
<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.mp4" alt="Huggy" type="video/mp4">
</video>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/huggy.jpg" alt="Huggy"/>
So let's get started! 🚀

View File

@@ -15,5 +15,7 @@ In the next chapter, were going to dive deeper by studying our first Deep Rei
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Atari environments"/>
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
### Keep Learning, stay awesome 🤗

View File

@@ -13,9 +13,22 @@ This is a community-created glossary. Contributions are welcomed!
- **The state-value function.** For each state, the state-value function is the expected return if the agent starts in that state and follows the policy until the end.
- **The action-value function.** In contrast to the state-value function, the action-value calculates for each state and action pair the expected return if the agent starts in that state and takes an action. Then it follows the policy forever after.
### Epsilon-greedy strategy:
- Common exploration strategy used in reinforcement learning that involves balancing exploration and exploitation.
- Chooses the action with the highest expected reward with a probability of 1-epsilon.
- Chooses a random action with a probability of epsilon.
- Epsilon is typically decreased over time to shift focus towards exploitation.
### Greedy strategy:
- Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (only exploitation)
- Always chooses the action with the highest expected reward.
- Does not include any exploration.
- Can be disadvantageous in environments with uncertainty or unknown optimal actions.
If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
This glossary was made possible thanks to:
- [Ramón Rueda](https://github.com/ramon-rd)
- [Hasarindu Perera](https://github.com/hasarinduperera/)

View File

@@ -62,7 +62,7 @@ For each state, the state-value function outputs the expected return if the agen
In the action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
The value of taking action an in state \\(s\\) under a policy \\(π\\) is:
The value of taking action \\(a\\) in state \\(s\\) under a policy \\(π\\) is:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-1.jpg" alt="Action State value function"/>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-2.jpg" alt="Action State value function"/>

View File

@@ -11,4 +11,7 @@ Don't hesitate to train your agent in other environments (Pong, Seaquest, QBert,
In the next unit, **we're going to learn about Optuna**. One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
### Keep Learning, stay awesome 🤗

View File

@@ -30,7 +30,7 @@ No, because one frame is not enough to have a sense of motion! But what if I add
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal Limitation"/>
Thats why, to capture temporal information, we stack four frames together.
Then, the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because frames are stacked together, **you can exploit some spatial properties across those frames**.
Then, the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because frames are stacked together, **you can exploit some temporal properties across those frames**.
If you don't know what are convolutional layers, don't worry. You can check the [Lesson 4 of this free Deep Reinforcement Learning Course by Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188)

View File

@@ -13,7 +13,7 @@ Internally, our Q-function has **a Q-table, a table where each cell corresponds
The problem is that Q-Learning is a *tabular method*. This raises a problem in which the states and actions spaces **are small enough to approximate value functions to be represented as arrays and tables**. Also, this is **not scalable**.
Q-Learning worked well with small state space environments like:
- FrozenLake, we had 14 states.
- FrozenLake, we had 16 states.
- Taxi-v3, we had 500 states.
But think of what we're going to do today: we will train an agent to learn to play Space Invaders a more complex game, using the frames as input.

View File

@@ -0,0 +1,39 @@
# Glossary
This is a community-created glossary. Contributions are welcomed!
- **Tabular Method:** type of problem in which the state and action spaces are small enough to approximate value functions to be represented as arrays and tables.
**Q-learning** is an example of tabular method since a table is used to represent the value for different state-action pairs.
- **Deep Q-Learning:** method that trains a neural network to approximate, given a state, the different **Q-values** for each possible action at that state.
Is used to solve problems when observational space is too big to apply a tabular Q-Learning approach.
- **Temporal Limitation:** is a difficulty presented when the environment state is represented by frames. A frame by itself does not provide temporal information.
In order to obtain temporal information, we need to **stack** a number of frames together.
- **Phases of Deep Q-Learning:**
- **Sampling:** actions are performed, and observed experience tuples are stored in a **replay memory**.
- **Training:** batches of tuples are selected randomly and the neural network updates its weights using gradient descent.
- **Solutions to stabilize Deep Q-Learning:**
- **Experience Replay:** a replay memory is created to save experiences samples that can be reused during training.
This allows the agent to learn from the same experiences multiple times. Also, it makes the agent avoid to forget previous experiences as it get new ones.
**Random sampling** from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging
catastrophically.
- **Fixed Q-Target:** In order to calculate the **Q-Target** we need to estimate the discounted optimal **Q-value** of the next state by using Bellman equation. The problem
is that the same network weigths are used to calculate the **Q-Target** and the **Q-value**. This means that everytime we are modifying the **Q-value**, the **Q-Target** also moves with it.
To avoid this issue, a separate network with fixed parameters is used for estimating the Temporal Difference Target. The target network is updated by copying parameters from
our Deep Q-Network after certain **C steps**.
- **Double DQN:** method to handle **overstimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **-Value generation**:
-**DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**)
-**Target Network** to calculate the target **Q-Value** of taking that action at the next state.
This approach reduce the **Q-Values** overstimation, it helps to train faster and have more stable learning.
If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
This glossary was made possible thanks to:
- [Dario Paez](https://github.com/dario248)

View File

@@ -18,7 +18,7 @@ We're using the [RL-Baselines-3 Zoo integration](https://github.com/DLR-RM/rl-ba
Also, **if you want to learn to implement Deep Q-Learning by yourself after this hands-on**, you definitely should look at CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 200**.
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
@@ -68,13 +68,6 @@ Before diving into the notebook, you need to:
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
# Let's train a Deep Q-Learning agent playing Atari' Space Invaders 👾 and upload it to the Hub.
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 500**.
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
## Set the GPU 💪
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`

View File

@@ -0,0 +1,20 @@
# Additional Readings
These are **optional readings** if you want to go deeper.
## Introduction to Policy Optimization
- [Part 3: Intro to Policy Optimization - Spinning Up documentation](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html)
## Policy Gradient
- [https://johnwlambert.github.io/policy-gradients/](https://johnwlambert.github.io/policy-gradients/)
- [RL - Policy Gradient Explained](https://jonathan-hui.medium.com/rl-policy-gradients-explained-9b13b688b146)
- [Chapter 13, Policy Gradient Methods; Reinforcement Learning, an introduction by Richard Sutton and Andrew G. Barto](http://incompleteideas.net/book/RLbook2020.pdf)
## Implementation
- [PyTorch Reinforce implementation](https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py)
- [Implementations from DDPG to PPO](https://github.com/MrSyee/pg-is-all-you-need)

View File

@@ -0,0 +1,74 @@
# The advantages and disadvantages of policy-gradient methods
At this point, you might ask, "but Deep Q-Learning is excellent! Why use policy-gradient methods?". To answer this question, let's study the **advantages and disadvantages of policy-gradient methods**.
## Advantages
There are multiple advantages over value-based methods. Let's see some of them:
### The simplicity of integration
We can estimate the policy directly without storing additional data (action values).
### Policy-gradient methods can learn a stochastic policy
Policy-gradient methods can **learn a stochastic policy while value functions can't**.
This has two consequences:
1. We **don't need to implement an exploration/exploitation trade-off by hand**. Since we output a probability distribution over actions, the agent explores **the state space without always taking the same trajectory.**
2. We also get rid of the problem of **perceptual aliasing**. Perceptual aliasing is when two states seem (or are) the same but need different actions.
Let's take an example: we have an intelligent vacuum cleaner whose goal is to suck the dust and avoid killing the hamsters.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster1.jpg" alt="Hamster 1"/>
</figure>
Our vacuum cleaner can only perceive where the walls are.
The problem is that the **two rose cases are aliased states because the agent perceives an upper and lower wall for each**.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster2.jpg" alt="Hamster 1"/>
</figure>
Under a deterministic policy, the policy either will move right when in a red state or move left. **Either case will cause our agent to get stuck and never suck the dust**.
Under a value-based Reinforcement learning algorithm, we learn a **quasi-deterministic policy** ("greedy epsilon strategy"). Consequently, our agent can **spend a lot of time before finding the dust**.
On the other hand, an optimal stochastic policy **will randomly move left or right in rose states**. Consequently, **it will not be stuck and will reach the goal state with a high probability**.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster3.jpg" alt="Hamster 1"/>
</figure>
### Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces
The problem with Deep Q-learning is that their **predictions assign a score (maximum expected future reward) for each possible action**, at each time step, given the current state.
But what if we have an infinite possibility of actions?
For instance, with a self-driving car, at each state, you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honking, etc.). **We'll need to output a Q-value for each possible action**! And **taking the max action of a continuous output is an optimization problem itself**!
Instead, with policy-gradient methods, we output a **probability distribution over actions.**
### Policy-gradient methods have better convergence properties
In value-based methods, we use an aggressive operator to **change the value function: we take the maximum over Q-estimates**.
Consequently, the action probabilities may change dramatically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.
For instance, if during the training, the best action was left (with a Q-value of 0.22) and the training step after it's right (since the right Q-value becomes 0.23), we dramatically changed the policy since now the policy will take most of the time right instead of left.
On the other hand, in policy-gradient methods, stochastic policy action preferences (probability of taking action) **change smoothly over time**.
## Disadvantages
Naturally, policy-gradient methods also have some disadvantages:
- **Frequently, policy-gradient converges on a local maximum instead of a global optimum.**
- Policy-gradient goes slower, **step by step: it can take longer to train (inefficient).**
- Policy-gradient can have high variance. We'll see in actor-critic unit why and how we can solve this problem.
👉 If you want to go deeper into the advantages and disadvantages of policy-gradient methods, [you can check this video](https://youtu.be/y3oqOjHilio).

View File

@@ -0,0 +1,17 @@
# Conclusion
**Congrats on finishing this unit**! There was a lot of information.
And congrats on finishing the tutorial. You've just coded your first Deep Reinforcement Learning agent from scratch using PyTorch and shared it on the Hub 🥳.
Don't hesitate to iterate on this unit **by improving the implementation for more complex environments** (for instance, what about changing the network to a Convolutional Neural Network to handle
frames as observation)?
In the next unit, **we're going to learn more about Unity MLAgents**, by training agents in Unity environments. This way, you will be ready to participate in the **AI vs AI challenges where you'll train your agents
to compete against other agents in a snowball fight and a soccer game.**
Sounds fun? See you next time!
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
### Keep Learning, stay awesome 🤗

1012
units/en/unit4/hands-on.mdx Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,24 @@
# Introduction [[introduction]]
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/thumbnail.png" alt="thumbnail"/>
In the last unit, we learned about Deep Q-Learning. In this value-based deep reinforcement learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**
Since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
In value-based methods, the policy ** \\(π\\) only exists because of the action value estimates since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
Then, we'll test its robustness using the CartPole-v1 and PixelCopter environments.
You'll then be able to iterate and improve this implementation for more advanced environments.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
</figure>
Let's get started,

View File

@@ -0,0 +1,82 @@
# (Optional) the Policy Gradient Theorem
In this optional section where we're **going to study how we differentiate the objective function that we will use to approximate the policy gradient**.
Let's first recap our different formulas:
1. The Objective function
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
2. The probability of a trajectory (given that action comes from \\(\pi_\theta\\)):
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
So we have:
\\(\nabla_\theta J(\theta) = \nabla_\theta \sum_{\tau}P(\tau;\theta)R(\tau)\\)
We can rewrite the gradient of the sum as the sum of the gradient:
\\( = \sum_{\tau} \nabla_\theta P(\tau;\theta)R(\tau) \\)
We then multiply every term in the sum by \\(\frac{P(\tau;\theta)}{P(\tau;\theta)}\\)(which is possible since it's = 1)
\\( = \sum_{\tau} \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau) \\)
We can simplify further this since \\( \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta) = P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\)
\\(= \sum_{\tau} P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau) \\)
We can then use the *derivative log trick* (also called *likelihood ratio trick* or *REINFORCE trick*), a simple rule in calculus that implies that \\( \nabla_x log f(x) = \frac{\nabla_x f(x)}{f(x)} \\)
So given we have \\(\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\) we transform it as \\(\nabla_\theta log P(\tau|\theta) \\)
So this is our likelihood policy gradient:
\\( \nabla_\theta J(\theta) = \sum_{\tau} P(\tau;\theta) \nabla_\theta log P(\tau;\theta) R(\tau) \\)
Thanks for this new formula, we can estimate the gradient using trajectory samples (we can approximate the likelihood ratio policy gradient with sample-based estimate if you prefer).
\\(\nabla_\theta J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \nabla_\theta log P(\tau^{(i)};\theta)R(\tau^{(i)})\\) where each \\(\tau^{(i)}\\) is a sampled trajectory.
But we still have some mathematics work to do there: we need to simplify \\( \nabla_\theta log P(\tau|\theta) \\)
We know that:
\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta log[ \mu(s_0) \prod_{t=0}^{H} P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})]\\)
Where \\(\mu(s_0)\\) is the initial state distribution and \\( P(s_{t+1}^{(i)}|s_{t}^{(i)}, a_{t}^{(i)}) \\) is the state transition dynamics of the MDP.
We know that the log of a product is equal to the sum of the logs:
\\(\nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta \left[ \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\right]\\)
We also know that the gradient of the sum is equal to the sum of gradient:
\\( \nabla_\theta log P(\tau^{(i)};\theta)= \nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) + \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\)
Since neither initial state distribution or state transition dynamics of the MDP are dependent of \\(\theta\\), the derivate of both terms are 0. So we can remove them:
Since:
\\(\nabla_\theta \sum_{t=0}^{H} log P(s_{t+1}^{(i)}|s_{t}^{(i)} a_{t}^{(i)}) = 0 \\) and \\( \nabla_\theta \mu(s_0) = 0\\)
\\(\nabla_\theta log P(\tau^{(i)};\theta) = \nabla_\theta \sum_{t=0}^{H} log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)})\\)
We can rewrite the gradient of the sum as the sum of gradients:
\\( \nabla_\theta log P(\tau^{(i)};\theta)= \sum_{t=0}^{H} \nabla_\theta log \pi_\theta(a_{t}^{(i)}|s_{t}^{(i)}) \\)
So, the final formula for estimating the policy gradient is:
\\( \nabla_{\theta} J(\theta) = \hat{g} = \frac{1}{m} \sum^{m}_{i=1} \sum^{H}_{t=0} \nabla_\theta \log \pi_\theta(a^{(i)}_{t} | s_{t}^{(i)})R(\tau^{(i)}) \\)

View File

@@ -0,0 +1,120 @@
# Diving deeper into policy-gradient methods
## Getting the big picture
We just learned that policy-gradient methods aim to find parameters \\( \theta \\) that **maximize the expected return**.
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called *action preference*.
If we take the example of CartPole-v1:
- As input, we have a state.
- As output, we have a probability distribution over actions at that state.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
Our goal with policy-gradient is to **control the probability distribution of actions** by tuning the policy such that **good actions (that maximize the return) are sampled more frequently in the future.**
Each time the agent interacts with the environment, we tweak the parameters such that good actions will be sampled more likely in the future.
But **how are we going to optimize the weights using the expected return**?
The idea is that we're going to **let the agent interact during an episode**. And if we win the episode, we consider that each action taken was good and must be more sampled in the future
since they lead to win.
So for each state-action pair, we want to increase the \\(P(a|s)\\): the probability of taking that action at that state. Or decrease if we lost.
The Policy-gradient algorithm (simplified) looks like this:
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/pg_bigpicture.jpg" alt="Policy Gradient Big Picture"/>
</figure>
Now that we got the big picture, let's dive deeper into policy-gradient methods.
## Diving deeper into policy-gradient methods
We have our stochastic policy \\(\pi\\) which has a parameter \\(\theta\\). This \\(\pi\\), given a state, **outputs a probability distribution of actions**.
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="Policy"/>
</figure>
Where \\(\pi_\theta(a_t|s_t)\\) is the probability of the agent selecting action \\(a_t\\) from state \\(s_t\\) given our policy.
**But how do we know if our policy is good?** We need to have a way to measure it. To know that, we define a score/objective function called \\(J(\theta)\\).
### The objective function
The *objective function* gives us the **performance of the agent** given a trajectory (state action sequence without considering reward (contrary to an episode)), and it outputs the *expected cumulative reward*.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/objective.jpg" alt="Return"/>
Let's detail a little bit more this formula:
- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by \\(P(\tau;\theta)\\) of all possible values that the return \\(R(\tau)\\) can take.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
- \\(R(\tau)\\) : Return from an arbitrary trajectory. To take this quantity and use it to calculate the expected return, we need to multiply it by the probability of each possible trajectory.
- \\(P(\tau;\theta)\\) : Probability of each possible trajectory \\(\tau\\) (that probability depends on \\( \theta\\) since it defines the policy that it uses to select the actions of the trajectory which as an impact of the states visited).
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given $\theta$, and the return of this trajectory.
Our objective then is to maximize the expected cumulative rewards by finding \\(\theta \\) that will output the best action probability distributions:
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/max_objective.png" alt="Max objective"/>
## Gradient Ascent and the Policy-gradient Theorem
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
(If you need a refresher on the difference between gradient descent and gradient ascent [check this](https://www.baeldung.com/cs/gradient-descent-vs-ascent) and [this](https://stats.stackexchange.com/questions/258721/gradient-ascent-vs-gradient-descent-in-logistic-regression)).
Our update step for gradient-ascent is:
\\( \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\)
We can repeatedly apply this update state in the hope that \\(\theta \\) converges to the value that maximizes \\(J(\theta)\\).
However, we have two problems to obtain the derivative of \\(J(\theta)\\):
1. We can't calculate the true gradient of the objective function since it would imply calculating the probability of each possible trajectory which is computationally super expensive.
We want then to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
2. We have another problem that I detail in the next optional section. To differentiate this objective function, we need to differentiate the state distribution, called Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
Fortunately we're going to use a solution called the Policy Gradient Theorem that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
If you want to understand how we derivate this formula that we will use to approximate the gradient, check the next (optional) section.
## The Reinforce algorithm (Monte Carlo Reinforce)
The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\):
In a loop:
- Use the policy \\(\pi_\theta\\) to collect an episode \\(\tau\\)
- Use the episode to estimate the gradient \\(\hat{g} = \nabla_\theta J(\theta)\\)
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_one.png" alt="Policy Gradient"/>
</figure>
- Update the weights of the policy: \\(\theta \leftarrow \theta + \alpha \hat{g}\\)
The interpretation we can make is this one:
- \\(\nabla_\theta log \pi_\theta(a_t|s_t)\\) is the direction of **steepest increase of the (log) probability** of selecting action at from state st.
This tells us **how we should change the weights of policy** if we want to increase/decrease the log probability of selecting action \\(a_t\\) at state \\(s_t\\).
- \\(R(\tau)\\): is the scoring function:
- If the return is high, it will **push up the probabilities** of the (state, action) combinations.
- Else, if the return is low, it will **push down the probabilities** of the (state, action) combinations.
We can also **collect multiple episodes (trajectories)** to estimate the gradient:
<figure class="image table text-center m-0 w-full">
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_multiple.png" alt="Policy Gradient"/>
</figure>

82
units/en/unit4/quiz.mdx Normal file
View File

@@ -0,0 +1,82 @@
# Quiz
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
### Q1: What are the advantages of policy-gradient over value-based methods? (Check all that apply)
<Question
choices={[
{
text: "Policy-gradient methods can learn a stochastic policy",
explain: "",
correct: true,
},
{
text: "Policy-gradient methods are more effective in high-dimensional action spaces and continuous actions spaces",
explain: "",
correct: true,
},
{
text: "Policy-gradient converges most of the time on a global maximum.",
explain: "No, frequently, policy-gradient converges on a local maximum instead of a global optimum.",
},
]}
/>
### Q2: What is the Policy Gradient Theorem?
<details>
<summary>Solution</summary>
*The Policy Gradient Theorem* is a formula that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
</details>
### Q3: What's the difference between policy-based methods and policy-gradient methods? (Check all that apply)
<Question
choices={[
{
text: "Policy-based methods are a subset of policy-gradient methods.",
explain: "",
},
{
text: "Policy-gradient methods are a subset of policy-based methods.",
explain: "",
correct: true,
},
{
text: "In Policy-based methods, we can optimize the parameter θ **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.",
explain: "",
correct: true,
},
{
text: "In Policy-gradient methods, we optimize the parameter θ **directly** by performing the gradient ascent on the performance of the objective function.",
explain: "",
correct: true,
},
]}
/>
### Q4: Why do we use gradient ascent instead of gradient descent to optimize J(θ)?
<Question
choices={[
{
text: "We want to minimize J(θ) and gradient ascent gives us the gives the direction of the steepest increase of J(θ)",
explain: "",
},
{
text: "We want to maximize J(θ) and gradient ascent gives us the gives the direction of the steepest increase of J(θ)",
explain: "",
correct: true
},
]}
/>
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.

View File

@@ -0,0 +1,42 @@
# What are the policy-based methods?
The main goal of Reinforcement learning is to **find the optimal policy \\(\pi^{*}\\) that will maximize the expected cumulative reward**.
Because Reinforcement Learning is based on the *reward hypothesis*: **all goals can be described as the maximization of the expected cumulative reward.**
For instance, in a soccer game (where you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
**maximizing the number of goals scored** (when the ball crosses the goal line) into your opponent's soccer goals. And **minimizing the number of goals in your soccer goals**.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/soccer.jpg" alt="Soccer" />
## Value-based, Policy-based, and Actor-critic methods
We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi^{*}\\).
- In *value-based methods*, we learn a value function.
- The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
- Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function.
- We have a policy, but it's implicit since it **was generated directly from the value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
- The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
- <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="stochastic policy" />
- Our objective then is **to maximize the performance of the parameterized policy using gradient ascent**.
- To do that, we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_\theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.
## The difference between policy-based and policy-gradient methods
Policy-gradient methods, what we're going to study in this unit, is a subclass of policy-based methods. In policy-based methods, the optimization is most of the time *on-policy* since for each update, we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
The difference between these two methods **lies on how we optimize the parameter** \\(\theta\\):
- In *policy-based methods*, we search directly for the optimal policy. We can optimize the parameter \\(\theta\\) **indirectly** by maximizing the local approximation of the objective function with techniques like hill climbing, simulated annealing, or evolution strategies.
- In *policy-gradient methods*, because we're a subclass of the policy-based methods, we search directly for the optimal policy. But we optimize the parameter \\(\theta\\) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\).
Before diving more into how works policy-gradient methods (the objective function, policy gradient theorem, gradient ascent, etc.), let's study the advantages and disadvantages of policy-based methods.

View File

@@ -6,5 +6,7 @@ You can now sit and enjoy playing with your Huggy 🐶. And don't **forget to sp
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-cover.jpeg" alt="Huggy cover" width="100%">
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
### Keep Learning, stay awesome 🤗
### Keep Learning, Stay Awesome 🤗

View File

@@ -9,3 +9,8 @@ Now that you've learned to use Optuna, we give you some ideas to apply what you'
By doing that, you're going to see how Optuna is valuable and powerful in training better agents,
Have fun,
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
### Keep Learning, stay awesome 🤗