mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-02 02:00:15 +08:00
Merge pull request #289 from dylwil3/fix-typos
Typos, grammar, style, etc.
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
# Discord 101 [[discord-101]]
|
||||
|
||||
Hey there! My name is Huggy, the dog 🐕, and I'm looking forward to train with you during this RL Course!
|
||||
Although I don't know much about bringing sticks (yet), I know one or two things about Discord. So I wrote this guide to help you learn about it!
|
||||
Although I don't know much about fetching sticks (yet), I know one or two things about Discord. So I wrote this guide to help you learn about it!
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/huggy-logo.jpg" alt="Huggy Logo"/>
|
||||
|
||||
|
||||
@@ -9,11 +9,11 @@ This course will **teach you about Deep Reinforcement Learning from beginner to
|
||||
In this introduction unit you’ll:
|
||||
|
||||
- Learn more about the **course content**.
|
||||
- **Define the path** you’re going to take (either self-audit or certification process)
|
||||
- Learn more about the **AI vs. AI challenges** you're going to participate to.
|
||||
- **Define the path** you’re going to take (either self-audit or certification process).
|
||||
- Learn more about the **AI vs. AI challenges** you're going to participate in.
|
||||
- Learn more **about us**.
|
||||
- **Create your Hugging Face account** (it’s free).
|
||||
- **Sign-up our Discord server**, the place where you can exchange with your classmates and us (the Hugging Face team).
|
||||
- **Sign-up to our Discord server**, the place where you can chat with your classmates and us (the Hugging Face team).
|
||||
|
||||
Let’s get started!
|
||||
|
||||
@@ -30,7 +30,7 @@ In this course, you will:
|
||||
|
||||
And more!
|
||||
|
||||
At the end of this course, **you’ll get a solid foundation from the basics to the SOTA (state-of-the-art) methods**.
|
||||
At the end of this course, **you’ll get a solid foundation from the basics to the SOTA (state-of-the-art) of methods**.
|
||||
|
||||
Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
|
||||
|
||||
@@ -43,6 +43,7 @@ The course is composed of:
|
||||
|
||||
- *A theory part*: where you learn a **concept in theory**.
|
||||
- *A hands-on*: where you’ll learn **to use famous Deep RL libraries** to train your agents in unique environments. These hands-on will be **Google Colab notebooks with companion tutorial videos** if you prefer learning with video format!
|
||||
|
||||
- *Challenges*: you'll get to put your agent to compete against other agents in different challenges. There will also be [a leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) for you to compare the agents' performance.
|
||||
|
||||
## What's the syllabus? [[syllabus]]
|
||||
@@ -52,7 +53,6 @@ This is the course's syllabus:
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/syllabus1.jpg" alt="Syllabus Part 1" width="100%"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/syllabus2.jpg" alt="Syllabus Part 2" width="100%"/>
|
||||
|
||||
|
||||
## Two paths: choose your own adventure [[two-paths]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/two-paths.jpg" alt="Two paths" width="100%"/>
|
||||
@@ -100,7 +100,7 @@ You need only 3 things:
|
||||
|
||||
## What is the recommended pace? [[recommended-pace]]
|
||||
|
||||
We defined a planning that you can follow to keep up the pace of the course.
|
||||
We defined a plan that you can follow to keep up the pace of the course.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/pace1.jpg" alt="Course advice" width="100%"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/pace2.jpg" alt="Course advice" width="100%"/>
|
||||
|
||||
@@ -11,7 +11,7 @@ After all this information, it's time to get started. We're going to do two thin
|
||||
|
||||
### Let's join our Discord server
|
||||
|
||||
You can now sign up for our Discord Server. This is the place where you **can exchange with the community and with us, create and join study groups to grow each other and more**
|
||||
You can now sign up for our Discord Server. This is the place where you **can chat with the community and with us, create and join study groups to grow with each other and more**
|
||||
|
||||
👉🏻 Join our discord server <a href="https://discord.gg/ydHrjt3WP5">here.</a>
|
||||
|
||||
@@ -19,7 +19,7 @@ When you join, remember to introduce yourself in #introduce-yourself and sign-up
|
||||
|
||||
We have multiple RL-related channels:
|
||||
- `rl-announcements`: where we give the latest information about the course.
|
||||
- `rl-discussions`: where you can exchange about RL and share information.
|
||||
- `rl-discussions`: where you can chat about RL and share information.
|
||||
- `rl-study-group`: where you can create and join study groups.
|
||||
- `rl-i-made-this`: where you can share your projects and models.
|
||||
|
||||
|
||||
@@ -1,16 +1,16 @@
|
||||
# Conclusion [[conclusion]]
|
||||
|
||||
Congrats on finishing this unit! **That was the biggest one**, and there was a lot of information. And congrats on finishing the tutorial. You’ve just trained your first Deep RL agents and shared it with the community! 🥳
|
||||
Congrats on finishing this unit! **That was the biggest one**, and there was a lot of information. And congrats on finishing the tutorial. You’ve just trained your first Deep RL agents and shared them with the community! 🥳
|
||||
|
||||
It's **normal if you still feel confused with some of these elements**. This was the same for me and for all people who studied RL.
|
||||
It's **normal if you still feel confused by some of these elements**. This was the same for me and for all people who studied RL.
|
||||
|
||||
**Take time to really grasp the material** before continuing. It’s important to master these elements and having a solid foundations before entering the fun part.
|
||||
**Take time to really grasp the material** before continuing. It’s important to master these elements and have a solid foundation before entering the fun part.
|
||||
|
||||
Naturally, during the course, we’re going to use and explain these terms again, but it’s better to understand them before diving into the next units.
|
||||
|
||||
In the next (bonus) unit, we’re going to reinforce what we just learned by **training Huggy the Dog to fetch the stick**.
|
||||
In the next (bonus) unit, we’re going to reinforce what we just learned by **training Huggy the Dog to fetch a stick**.
|
||||
|
||||
You will be able then to play with him 🤗.
|
||||
You will then be able to play with him 🤗.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.jpg" alt="Huggy"/>
|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@ Deep Reinforcement Learning introduces **deep neural networks to solve Reinforc
|
||||
|
||||
For instance, in the next unit, we’ll learn about two value-based algorithms: Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning.
|
||||
|
||||
You’ll see the difference is that in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
|
||||
You’ll see the difference is that, in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
|
||||
|
||||
In the second approach, **we will use a Neural Network** (to approximate the Q value).
|
||||
|
||||
@@ -18,4 +18,4 @@ In the second approach, **we will use a Neural Network** (to approximate the Q
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
If you are not familiar with Deep Learning you definitely should watch [the FastAI Practical Deep Learning for Coders](https://course.fast.ai) (Free).
|
||||
If you are not familiar with Deep Learning you should definitely watch [the FastAI Practical Deep Learning for Coders](https://course.fast.ai) (Free).
|
||||
|
||||
@@ -30,7 +30,7 @@ If it’s still confusing, **think of a real problem: the choice of picking a re
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
- *Exploitation*: You go every day to the same one that you know is good and **take the risk to miss another better restaurant.**
|
||||
- *Exploitation*: You go to the same one that you know is good every day and **take the risk to miss another better restaurant.**
|
||||
- *Exploration*: Try restaurants you never went to before, with the risk of having a bad experience **but the probable opportunity of a fantastic experience.**
|
||||
|
||||
To recap:
|
||||
|
||||
File diff suppressed because one or more lines are too long
@@ -12,9 +12,9 @@ To understand the RL process, let’s imagine an agent learning to play a platfo
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
|
||||
|
||||
- Our Agent receives **state \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
|
||||
- Our Agent receives **state \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
|
||||
- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
|
||||
- Environment goes to a **new** **state \\(S_1\\)** — new frame.
|
||||
- The environment goes to a **new** **state \\(S_1\\)** — new frame.
|
||||
- The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*.
|
||||
|
||||
This RL loop outputs a sequence of **state, action, reward and next state.**
|
||||
@@ -34,7 +34,7 @@ That’s why in Reinforcement Learning, **to have the best behavior,** we aim
|
||||
|
||||
## Markov Property [[markov-property]]
|
||||
|
||||
In papers, you’ll see that the RL process is called the **Markov Decision Process** (MDP).
|
||||
In papers, you’ll see that the RL process is called a **Markov Decision Process** (MDP).
|
||||
|
||||
We’ll talk again about the Markov Property in the following units. But if you need to remember something today about it, it's this: the Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states and actions** they took before.
|
||||
|
||||
@@ -58,10 +58,10 @@ In a chess game, we have access to the whole board information, so we receive a
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
|
||||
<figcaption>In Super Mario Bros, we only see a part of the level close to the player, so we receive an observation.</figcaption>
|
||||
<figcaption>In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.</figcaption>
|
||||
</figure>
|
||||
|
||||
In Super Mario Bros, we only see a part of the level close to the player, so we receive an observation.
|
||||
In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.
|
||||
|
||||
In Super Mario Bros, we are in a partially observed environment. We receive an observation **since we only see a part of the level.**
|
||||
|
||||
@@ -110,7 +110,7 @@ The cumulative reward at each time step **t** can be written as:
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_1.jpg" alt="Rewards">
|
||||
<figcaption>The cumulative reward equals to the sum of all rewards of the sequence.
|
||||
<figcaption>The cumulative reward equals the sum of all rewards in the sequence.
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
@@ -134,7 +134,7 @@ Consequently, **the reward near the cat, even if it is bigger (more cheese), wi
|
||||
|
||||
To discount the rewards, we proceed like this:
|
||||
|
||||
1. We define a discount rate called gamma. **It must be between 0 and 1.** Most of the time between **0.99 and 0.95**.
|
||||
1. We define a discount rate called gamma. **It must be between 0 and 1.** Most of the time between **0.95 and 0.99**.
|
||||
- The larger the gamma, the smaller the discount. This means our agent **cares more about the long-term reward.**
|
||||
- On the other hand, the smaller the gamma, the bigger the discount. This means our **agent cares more about the short term reward (the nearest cheese).**
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
That was a lot of information! Let's summarize:
|
||||
|
||||
- Reinforcement Learning is a computational approach of learning from action. We build an agent that learns from the environment **by interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.
|
||||
- Reinforcement Learning is a computational approach of learning from actions. We build an agent that learns from the environment **by interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.
|
||||
|
||||
- The goal of any RL agent is to maximize its expected cumulative reward (also called expected return) because RL is based on the **reward hypothesis**, which is that **all goals can be described as the maximization of the expected cumulative reward.**
|
||||
|
||||
|
||||
@@ -6,7 +6,7 @@ A task is an **instance** of a Reinforcement Learning problem. We can have two t
|
||||
|
||||
In this case, we have a starting point and an ending point **(a terminal state). This creates an episode**: a list of States, Actions, Rewards, and new States.
|
||||
|
||||
For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ending **when you’re killed or you reached the end of the level.**
|
||||
For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ends **when you’re killed or you reached the end of the level.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
Now that we learned the RL framework, how do we solve the RL problem?
|
||||
</Tip>
|
||||
|
||||
In other terms, how to build an RL agent that can **select the actions that maximize its expected cumulative reward?**
|
||||
In other words, how do we build an RL agent that can **select the actions that maximize its expected cumulative reward?**
|
||||
|
||||
## The Policy π: the agent’s brain [[policy]]
|
||||
|
||||
@@ -26,7 +26,7 @@ There are two approaches to train our agent to find this optimal policy π\*:
|
||||
|
||||
In Policy-Based methods, **we learn a policy function directly.**
|
||||
|
||||
This function will define a mapping between each state and the best corresponding action. We can also say that it'll define **a probability distribution over the set of possible actions at that state.**
|
||||
This function will define a mapping from each state to the best corresponding action. Alternatively, it could define **a probability distribution over the set of possible actions at that state.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_2.jpg" alt="Policy" />
|
||||
@@ -69,13 +69,13 @@ If we recap:
|
||||
|
||||
In value-based methods, instead of learning a policy function, we **learn a value function** that maps a state to the expected value **of being at that state.**
|
||||
|
||||
The value of a state is the **expected discounted return** the agent can get if it **starts in that state, and then act according to our policy.**
|
||||
The value of a state is the **expected discounted return** the agent can get if it **starts in that state, and then acts according to our policy.**
|
||||
|
||||
“Act according to our policy” just means that our policy is **“going to the state with the highest value”.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_1.jpg" alt="Value based RL" width="100%" />
|
||||
|
||||
Here we see that our value function **defined value for each possible state.**
|
||||
Here we see that our value function **defined values for each possible state.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_2.jpg" alt="Value based RL"/>
|
||||
|
||||
@@ -17,12 +17,12 @@ Your brother will interact with the environment (the video game) by pressing the
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_2.jpg" alt="Illustration_2" width="100%">
|
||||
|
||||
But then, **he presses right again** and he touches an enemy. He just died, so that's a -1 reward.
|
||||
But then, **he presses the right button again** and he touches an enemy. He just died, so that's a -1 reward.
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_3.jpg" alt="Illustration_3" width="100%">
|
||||
|
||||
By interacting with his environment through trial and error, your little brother understood that **he needed to get coins in this environment but avoid the enemies.**
|
||||
By interacting with his environment through trial and error, your little brother understands that **he needs to get coins in this environment but avoid the enemies.**
|
||||
|
||||
**Without any supervision**, the child will get better and better at playing the game.
|
||||
|
||||
@@ -31,7 +31,7 @@ That’s how humans and animals learn, **through interaction.** Reinforcement
|
||||
|
||||
### A formal definition [[a-formal-definition]]
|
||||
|
||||
If we take now a formal definition:
|
||||
We can now make a formal definition:
|
||||
|
||||
<Tip>
|
||||
Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.
|
||||
|
||||
@@ -4,7 +4,7 @@ These are **optional readings** if you want to go deeper.
|
||||
|
||||
## Monte Carlo and TD Learning [[mc-td]]
|
||||
|
||||
To dive deeper on Monte Carlo and Temporal Difference Learning:
|
||||
To dive deeper into Monte Carlo and Temporal Difference Learning:
|
||||
|
||||
- <a href="https://stats.stackexchange.com/questions/355820/why-do-temporal-difference-td-methods-have-lower-variance-than-monte-carlo-met">Why do temporal difference (TD) methods have lower variance than Monte Carlo methods?</a>
|
||||
- <a href="https://stats.stackexchange.com/questions/336974/when-are-monte-carlo-methods-preferred-over-temporal-difference-ones"> When are Monte Carlo methods preferred over temporal difference ones?</a>
|
||||
|
||||
@@ -5,7 +5,7 @@ The Bellman equation **simplifies our state value or state-action value calcula
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman.jpg" alt="Bellman equation"/>
|
||||
|
||||
With what we have learned so far, we know that if we calculate the \\(V(S_t)\\) (value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).**
|
||||
With what we have learned so far, we know that if we calculate \\(V(S_t)\\) (the value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).**
|
||||
|
||||
So to calculate \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence:
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@ Congrats on finishing this chapter! There was a lot of information. And congrat
|
||||
|
||||
Implementing from scratch when you study a new architecture **is important to understand how it works.**
|
||||
|
||||
That’s **normal if you still feel confused** with all these elements. **This was the same for me and for all people who studied RL.**
|
||||
It's **normal if you still feel confused** by all these elements. **This was the same for me and for everyone who studies RL.**
|
||||
|
||||
Take time to really grasp the material before continuing.
|
||||
|
||||
@@ -15,7 +15,6 @@ In the next chapter, we’re going to dive deeper by studying our first Deep Rei
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Atari environments"/>
|
||||
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
@@ -5,7 +5,7 @@ This is a community-created glossary. Contributions are welcomed!
|
||||
|
||||
### Strategies to find the optimal policy
|
||||
|
||||
- **Policy-based methods.** The policy is usually trained with a neural network to select what action to take given a state. In this case is the neural network which outputs the action that the agent should take instead of using a value function. Depending on the experience received by the environment, the neural network will be re-adjusted and will provide better actions.
|
||||
- **Policy-based methods.** The policy is usually trained with a neural network to select what action to take given a state. In this case it is the neural network which outputs the action that the agent should take instead of using a value function. Depending on the experience received by the environment, the neural network will be re-adjusted and will provide better actions.
|
||||
- **Value-based methods.** In this case, a value function is trained to output the value of a state or a state-action pair that will represent our policy. However, this value doesn't define what action the agent should take. In contrast, we need to specify the behavior of the agent given the output of the value function. For example, we could decide to adopt a policy to take the action that always leads to the biggest reward (Greedy Policy). In summary, the policy is a Greedy Policy (or whatever decision the user takes) that uses the values of the value-function to decide the actions to take.
|
||||
|
||||
### Among the value-based methods, we can find two main strategies
|
||||
@@ -15,14 +15,14 @@ This is a community-created glossary. Contributions are welcomed!
|
||||
|
||||
### Epsilon-greedy strategy:
|
||||
|
||||
- Common exploration strategy used in reinforcement learning that involves balancing exploration and exploitation.
|
||||
- Common strategy used in reinforcement learning that involves balancing exploration and exploitation.
|
||||
- Chooses the action with the highest expected reward with a probability of 1-epsilon.
|
||||
- Chooses a random action with a probability of epsilon.
|
||||
- Epsilon is typically decreased over time to shift focus towards exploitation.
|
||||
|
||||
### Greedy strategy:
|
||||
|
||||
- Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (only exploitation)
|
||||
- Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (Only exploitation)
|
||||
- Always chooses the action with the highest expected reward.
|
||||
- Does not include any exploration.
|
||||
- Can be disadvantageous in environments with uncertainty or unknown optimal actions.
|
||||
|
||||
@@ -27,7 +27,7 @@ For more information about the certification process, check this section 👉 ht
|
||||
And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
|
||||
|
||||
|
||||
**To start the hands-on click on Open In Colab button** 👇 :
|
||||
**To start the hands-on click on the Open In Colab button** 👇 :
|
||||
|
||||
[](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb)
|
||||
|
||||
@@ -36,7 +36,7 @@ And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSi
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">
|
||||
|
||||
In this notebook, **you'll code from scratch your first Reinforcement Learning agent** playing FrozenLake ❄️ using Q-Learning, share it to the community, and experiment with different configurations.
|
||||
In this notebook, **you'll code your first Reinforcement Learning agent from scratch** to play FrozenLake ❄️ using Q-Learning, share it with the community, and experiment with different configurations.
|
||||
|
||||
|
||||
⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
|
||||
@@ -61,7 +61,7 @@ We're constantly trying to improve our tutorials, so **if you find some issues i
|
||||
At the end of the notebook, you will:
|
||||
|
||||
- Be able to use **Gym**, the environment library.
|
||||
- Be able to code from scratch a Q-Learning agent.
|
||||
- Be able to code a Q-Learning agent from scratch.
|
||||
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
|
||||
|
||||
|
||||
@@ -72,23 +72,23 @@ Before diving into the notebook, you need to:
|
||||
|
||||
## A small recap of Q-Learning
|
||||
|
||||
- The *Q-Learning* **is the RL algorithm that**
|
||||
- *Q-Learning* **is the RL algorithm that**
|
||||
|
||||
- Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
|
||||
- Trains *Q-Function*, an **action-value function** that encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**
|
||||
|
||||
- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
|
||||
- Given a state and action, our Q-Function **will search the Q-table for the corresponding value.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function" width="100%"/>
|
||||
|
||||
- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
|
||||
|
||||
- And if we **have an optimal Q-function**, we
|
||||
have an optimal policy,since we **know for each state, what is the best action to take.**
|
||||
have an optimal policy, since we **know for, each state, the best action to take.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" width="100%"/>
|
||||
|
||||
|
||||
But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
|
||||
But, in the beginning, our **Q-Table is useless since it gives arbitrary values for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we explore the environment and update our Q-Table it will give us better and better approximations
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
|
||||
|
||||
@@ -113,7 +113,7 @@ We’ll install multiple ones:
|
||||
|
||||
The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
|
||||
|
||||
You can see here all the Deep RL models available (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning
|
||||
You can see all the Deep RL models available here (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning
|
||||
|
||||
```bash
|
||||
pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt
|
||||
@@ -125,7 +125,7 @@ apt install python-opengl ffmpeg xvfb
|
||||
pip3 install pyvirtualdisplay
|
||||
```
|
||||
|
||||
To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks for this trick, **we will be able to run our virtual screen.**
|
||||
To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**
|
||||
|
||||
```python
|
||||
import os
|
||||
@@ -299,7 +299,7 @@ Remember we have two policies since Q-Learning is an **off-policy** algorithm. T
|
||||
- Epsilon-greedy policy (acting policy)
|
||||
- Greedy-policy (updating policy)
|
||||
|
||||
Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.
|
||||
The greedy policy will also be the final policy we'll have when the Q-learning agent completes training. The greedy policy is used to select an action using the Q-table.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Q-Learning" width="100%"/>
|
||||
|
||||
@@ -330,9 +330,9 @@ The idea with epsilon-greedy:
|
||||
|
||||
- With *probability 1 - ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).
|
||||
|
||||
- With *probability ɛ*: we do **exploration** (trying random action).
|
||||
- With *probability ɛ*: we do **exploration** (trying a random action).
|
||||
|
||||
And as the training goes, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**
|
||||
As the training continues, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-Learning" width="100%"/>
|
||||
|
||||
@@ -726,7 +726,7 @@ By using `push_to_hub` **you evaluate, record a replay, generate a model card of
|
||||
This way:
|
||||
- You can **showcase our work** 🔥
|
||||
- You can **visualize your agent playing** 👀
|
||||
- You can **share with the community an agent that others can use** 💾
|
||||
- You can **share an agent with the community that others can use** 💾
|
||||
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
|
||||
@@ -788,8 +788,8 @@ repo_name = "q-FrozenLake-v1-4x4-noSlippery"
|
||||
push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
|
||||
```
|
||||
|
||||
Congrats 🥳 you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent.
|
||||
FrozenLake-v1 no_slippery is very simple environment, let's try an harder one 🔥.
|
||||
Congrats 🥳 you've just implemented from scratch, trained, and uploaded your first Reinforcement Learning agent.
|
||||
FrozenLake-v1 no_slippery is very simple environment, let's try a harder one 🔥.
|
||||
|
||||
# Part 2: Taxi-v3 🚖
|
||||
|
||||
@@ -1009,7 +1009,7 @@ repo_name = ""
|
||||
push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
|
||||
```
|
||||
|
||||
Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
Now that it's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
⚠ To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** ⚠
|
||||
|
||||
@@ -1075,24 +1075,24 @@ evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"
|
||||
```
|
||||
|
||||
## Some additional challenges 🏆
|
||||
The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!
|
||||
The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!
|
||||
|
||||
In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
|
||||
|
||||
Here are some ideas to achieve so:
|
||||
Here are some ideas to climb up the leaderboard:
|
||||
|
||||
* Train more steps
|
||||
* Try different hyperparameters by looking at what your classmates have done.
|
||||
* **Push your new trained model** on the Hub 🔥
|
||||
|
||||
Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not using FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
|
||||
Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not use the FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
|
||||
|
||||
_____________________________________________________________________
|
||||
Congrats 🥳, you've just implemented, trained, and uploaded your first Reinforcement Learning agent.
|
||||
|
||||
Understanding Q-Learning is an **important step to understanding value-based methods.**
|
||||
|
||||
In the next Unit with Deep Q-Learning, we'll see that creating and updating a Q-table was a good strategy — **however, this is not scalable.**
|
||||
In the next Unit with Deep Q-Learning, we'll see that while creating and updating a Q-table was a good strategy — **however, it is not scalable.**
|
||||
|
||||
For instance, imagine you create an agent that learns to play Doom.
|
||||
|
||||
@@ -1100,11 +1100,11 @@ For instance, imagine you create an agent that learns to play Doom.
|
||||
|
||||
Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient.
|
||||
|
||||
That's why we'll study, in the next unit, Deep Q-Learning, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**
|
||||
That's why we'll study Deep Q-Learning in the next unit, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
|
||||
See you on Unit 3! 🔥
|
||||
See you in Unit 3! 🔥
|
||||
|
||||
## Keep learning, stay awesome 🤗
|
||||
|
||||
@@ -45,7 +45,7 @@ By running more and more episodes, **the agent will learn to play better and be
|
||||
|
||||
For instance, if we train a state-value function using Monte Carlo:
|
||||
|
||||
- We just started to train our value function, **so it returns 0 value for each state**
|
||||
- We initialize our value function **so that it returns 0 value for each state**
|
||||
- Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)
|
||||
- Our mouse **explores the environment and takes random actions**
|
||||
|
||||
@@ -82,7 +82,7 @@ The idea with **TD is to update the \\(V(S_t)\\) at each step.**
|
||||
|
||||
But because we didn't experience an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.**
|
||||
|
||||
This is called bootstrapping. It's called this **because TD bases its update part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).**
|
||||
This is called bootstrapping. It's called this **because TD bases its update in part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="Temporal Difference"/>
|
||||
|
||||
@@ -95,9 +95,9 @@ If we take the same example,
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2.jpg" alt="Temporal Difference"/>
|
||||
|
||||
- We just started to train our value function, so it returns 0 value for each state.
|
||||
- We initialize our value function so that it returns 0 value for each state.
|
||||
- Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
|
||||
- Our mouse explore the environment and take a random action: **going to the left**
|
||||
- Our mouse begins to explore the environment and takes a random action: **going to the left**
|
||||
- It gets a reward \\(R_{t+1} = 1\\) since **it eats a piece of cheese**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2p.jpg" alt="Temporal Difference"/>
|
||||
@@ -119,9 +119,9 @@ Now we **continue to interact with this environment with our updated value func
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3p.jpg" alt="Temporal Difference"/>
|
||||
|
||||
If we summarize:
|
||||
To summarize:
|
||||
|
||||
- With *Monte Carlo*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
|
||||
- With *TD Learning*, we update the value function from a step, so we replace \\(G_t\\) that we don't have with **an estimated return called TD target.**
|
||||
- With *TD Learning*, we update the value function from a step, and we replace \\(G_t\\), which we don't know, with **an estimated return called the TD target.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Summary.jpg" alt="Summary"/>
|
||||
|
||||
@@ -1,17 +1,17 @@
|
||||
# Mid-way Recap [[mid-way-recap]]
|
||||
|
||||
Before diving into Q-Learning, let's summarize what we just learned.
|
||||
Before diving into Q-Learning, let's summarize what we've just learned.
|
||||
|
||||
We have two types of value-based functions:
|
||||
|
||||
- State-value function: outputs the expected return if **the agent starts at a given state and acts accordingly to the policy forever after.**
|
||||
- State-value function: outputs the expected return if **the agent starts at a given state and acts according to the policy forever after.**
|
||||
- Action-value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
|
||||
- In value-based methods, rather than learning the policy, **we define the policy by hand** and we learn a value function. If we have an optimal value function, we **will have an optimal policy.**
|
||||
|
||||
There are two types of methods to learn a policy for a value function:
|
||||
|
||||
- With *the Monte Carlo method*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
|
||||
- With *the TD Learning method,* we update the value function from a step, so we replace \\(G_t\\) that we don't have with **an estimated return called TD target.**
|
||||
- With *the Monte Carlo method*, we update the value function from a complete episode, and so we **use the actual discounted return of this episode.**
|
||||
- With *the TD Learning method,* we update the value function from a step, replacing the unknown \\(G_t\\) with **an estimated return called the TD target.**
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/summary-learning-mtds.jpg" alt="Summary"/>
|
||||
|
||||
@@ -6,9 +6,9 @@ To better understand Q-Learning, let's take a simple example:
|
||||
|
||||
- You're a mouse in this tiny maze. You always **start at the same starting point.**
|
||||
- The goal is **to eat the big pile of cheese at the bottom right-hand corner** and avoid the poison. After all, who doesn't like cheese?
|
||||
- The episode ends if we eat the poison, **eat the big pile of cheese or if we spent more than five steps.**
|
||||
- The episode ends if we eat the poison, **eat the big pile of cheese**, or if we take more than five steps.
|
||||
- The learning rate is 0.1
|
||||
- The gamma (discount rate) is 0.99
|
||||
- The discount rate (gamma) is 0.99
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-1.jpg" alt="Maze-Example"/>
|
||||
|
||||
@@ -18,14 +18,14 @@ The reward function goes like this:
|
||||
- **+0:** Going to a state with no cheese in it.
|
||||
- **+1:** Going to a state with a small cheese in it.
|
||||
- **+10:** Going to the state with the big pile of cheese.
|
||||
- **-10:** Going to the state with the poison and thus die.
|
||||
- **+0** If we spend more than five steps.
|
||||
- **-10:** Going to the state with the poison and thus dying.
|
||||
- **+0** If we take more than five steps.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-2.jpg" alt="Maze-Example"/>
|
||||
|
||||
To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**.
|
||||
|
||||
## Step 1: We initialize the Q-table [[step1]]
|
||||
## Step 1: Initialize the Q-table [[step1]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-1.jpg" alt="Maze-Example"/>
|
||||
|
||||
@@ -35,16 +35,16 @@ Let's do it for 2 training timesteps:
|
||||
|
||||
Training timestep 1:
|
||||
|
||||
## Step 2: Choose action using Epsilon Greedy Strategy [[step2]]
|
||||
## Step 2: Choose an action using the Epsilon Greedy Strategy [[step2]]
|
||||
|
||||
Because epsilon is big = 1.0, I take a random action, in this case, I go right.
|
||||
Because epsilon is big (= 1.0), I take a random action. In this case, I go right.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-3.jpg" alt="Maze-Example"/>
|
||||
|
||||
|
||||
## Step 3: Perform action At, gets Rt+1 and St+1 [[step3]]
|
||||
## Step 3: Perform action At, get Rt+1 and St+1 [[step3]]
|
||||
|
||||
By going right, I've got a small cheese, so \\(R_{t+1} = 1\\), and I'm in a new state.
|
||||
By going right, I get a small cheese, so \\(R_{t+1} = 1\\) and I'm in a new state.
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-4.jpg" alt="Maze-Example"/>
|
||||
@@ -59,18 +59,18 @@ We can now update \\(Q(S_t, A_t)\\) using our formula.
|
||||
|
||||
Training timestep 2:
|
||||
|
||||
## Step 2: Choose action using Epsilon Greedy Strategy [[step2-2]]
|
||||
## Step 2: Choose an action using the Epsilon Greedy Strategy [[step2-2]]
|
||||
|
||||
**I take a random action again, since epsilon is big 0.99** (since we decay it a little bit because as the training progress, we want less and less exploration).
|
||||
**I take a random action again, since epsilon=0.99 is big**. (Notice we decay epsilon a little bit because, as the training progress, we want less and less exploration).
|
||||
|
||||
I took action down. **Not a good action since it leads me to the poison.**
|
||||
I took the action 'down'. **This is not a good action since it leads me to the poison.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-6.jpg" alt="Maze-Example"/>
|
||||
|
||||
|
||||
## Step 3: Perform action At, gets Rt+1 and St+1 [[step3-3]]
|
||||
## Step 3: Perform action At, get Rt+1 and St+1 [[step3-3]]
|
||||
|
||||
Because I go to the poison state, **I get \\(R_{t+1} = -10\\), and I die.**
|
||||
Because I ate poison, **I get \\(R_{t+1} = -10\\), and I die.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-7.jpg" alt="Maze-Example"/>
|
||||
|
||||
@@ -78,6 +78,6 @@ Because I go to the poison state, **I get \\(R_{t+1} = -10\\), and I die.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-8.jpg" alt="Maze-Example"/>
|
||||
|
||||
Because we're dead, we start a new episode. But what we see here is that **with two explorations steps, my agent became smarter.**
|
||||
Because we're dead, we start a new episode. But what we see here is that, **with two explorations steps, my agent became smarter.**
|
||||
|
||||
As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-function.**
|
||||
As we continue exploring and exploiting the environment and updating Q-values using the TD target, the **Q-table will give us a better and better approximation. At the end of the training, we'll get an estimate of the optimal Q-function.**
|
||||
|
||||
@@ -1,22 +1,22 @@
|
||||
# Q-Learning Recap [[q-learning-recap]]
|
||||
|
||||
|
||||
The *Q-Learning* **is the RL algorithm that** :
|
||||
*Q-Learning* **is the RL algorithm that** :
|
||||
|
||||
- Trains *Q-function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
|
||||
- Trains a *Q-function*, an **action-value function** encoded, in internal memory, by a *Q-table* **containing all the state-action pair values.**
|
||||
|
||||
- Given a state and action, our Q-function **will search into its Q-table the corresponding value.**
|
||||
- Given a state and action, our Q-function **will search its Q-table for the corresponding value.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function" width="100%"/>
|
||||
|
||||
- When the training is done,**we have an optimal Q-function, so an optimal Q-table.**
|
||||
- When the training is done, **we have an optimal Q-function, or, equivalently, an optimal Q-table.**
|
||||
|
||||
- And if we **have an optimal Q-function**, we
|
||||
have an optimal policy,since we **know for each state, what is the best action to take.**
|
||||
have an optimal policy, since we **know, for each state, the best action to take.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" width="100%"/>
|
||||
|
||||
But, in the beginning, our **Q-table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-table to 0 values)**. But, as we’ll explore the environment and update our Q-table it will give us better and better approximations
|
||||
But, in the beginning, our **Q-table is useless since it gives arbitrary values for each state-action pair (most of the time we initialize the Q-table to 0 values)**. But, as we explore the environment and update our Q-table it will give us a better and better approximation.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
|
||||
|
||||
|
||||
@@ -5,7 +5,7 @@ Q-Learning is an **off-policy value-based method that uses a TD approach to tra
|
||||
|
||||
- *Off-policy*: we'll talk about that at the end of this unit.
|
||||
- *Value-based method*: finds the optimal policy indirectly by training a value or action-value function that will tell us **the value of each state or each state-action pair.**
|
||||
- *Uses a TD approach:* **updates its action-value function at each step instead of at the end of the episode.**
|
||||
- *TD approach:* **updates its action-value function at each step instead of at the end of the episode.**
|
||||
|
||||
**Q-Learning is the algorithm we use to train our Q-function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
|
||||
|
||||
@@ -21,13 +21,13 @@ Let's recap the difference between value and reward:
|
||||
- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy.
|
||||
- The *reward* is the **feedback I get from the environment** after performing an action at a state.
|
||||
|
||||
Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
Internally, our Q-function is encoded by **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
|
||||
Let's go through an example of a maze.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-1.jpg" alt="Maze example"/>
|
||||
|
||||
The Q-table is initialized. That's why all values are = 0. This table **contains, for each state, the four state-action values.**
|
||||
The Q-table is initialized. That's why all values are = 0. This table **contains, for each state and action, the corresponding state-action values.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-2.jpg" alt="Maze example"/>
|
||||
|
||||
@@ -35,7 +35,7 @@ Here we see that the **state-action value of the initial state and going up is
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-3.jpg" alt="Maze example"/>
|
||||
|
||||
Therefore, Q-function contains a Q-table **that has the value of each-state action pair.** And given a state and action, **our Q-function will search inside its Q-table to output the value.**
|
||||
So: the Q-function uses a Q-table **that has the value of each state-action pair.** Given a state and action, **our Q-function will search inside its Q-table to output the value.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q-function"/>
|
||||
@@ -44,21 +44,21 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act
|
||||
If we recap, *Q-Learning* **is the RL algorithm that:**
|
||||
|
||||
- Trains a *Q-function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.**
|
||||
- Given a state and action, our Q-function **will search into its Q-table the corresponding value.**
|
||||
- Given a state and action, our Q-function **will search its Q-table for the corresponding value.**
|
||||
- When the training is done, **we have an optimal Q-function, which means we have optimal Q-table.**
|
||||
- And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**
|
||||
- And if we **have an optimal Q-function**, we **have an optimal policy** since we **know the best action to take at each state.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"/>
|
||||
|
||||
|
||||
But, in the beginning, **our Q-table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-table to 0). As the agent **explores the environment and we update the Q-table, it will give us better and better approximations** to the optimal policy.
|
||||
In the beginning, **our Q-table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-table to 0). As the agent **explores the environment and we update the Q-table, it will give us a better and better approximation** to the optimal policy.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-1.jpg" alt="Q-learning"/>
|
||||
<figcaption>We see here that with the training, our Q-table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
|
||||
</figure>
|
||||
|
||||
Now that we understand what Q-Learning, Q-function, and Q-table are, **let's dive deeper into the Q-Learning algorithm**.
|
||||
Now that we understand what Q-Learning, Q-functions, and Q-tables are, **let's dive deeper into the Q-Learning algorithm**.
|
||||
|
||||
## The Q-Learning algorithm [[q-learning-algo]]
|
||||
|
||||
@@ -73,14 +73,14 @@ This is the Q-Learning pseudocode; let's study each part and **see how it works
|
||||
|
||||
We need to initialize the Q-table for each state-action pair. **Most of the time, we initialize with values of 0.**
|
||||
|
||||
### Step 2: Choose action using epsilon-greedy strategy [[step2]]
|
||||
### Step 2: Choose an action using the epsilon-greedy strategy [[step2]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
Epsilon greedy strategy is a policy that handles the exploration/exploitation trade-off.
|
||||
The epsilon-greedy strategy is a policy that handles the exploration/exploitation trade-off.
|
||||
|
||||
The idea is that we define the initial epsilon ɛ = 1.0:
|
||||
The idea is that, with an initial value of ɛ = 1.0:
|
||||
|
||||
- *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value).
|
||||
- With probability ɛ: **we do exploration** (trying random action).
|
||||
@@ -90,7 +90,7 @@ At the beginning of the training, **the probability of doing exploration will b
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-5.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
### Step 3: Perform action At, gets reward Rt+1 and next state St+1 [[step3]]
|
||||
### Step 3: Perform action At, get reward Rt+1 and next state St+1 [[step3]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-6.jpg" alt="Q-learning"/>
|
||||
|
||||
@@ -98,7 +98,7 @@ At the beginning of the training, **the probability of doing exploration will b
|
||||
|
||||
Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose) **after one step of the interaction.**
|
||||
|
||||
To produce our TD target, **we used the immediate reward \\(R_{t+1}\\) plus the discounted value of the next state best state-action pair** (we call that bootstrap).
|
||||
To produce our TD target, **we used the immediate reward \\(R_{t+1}\\) plus the discounted value of the next state**, computed by finding the action that maximizes the current Q-function at the next state. (We call that bootstrap).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-7.jpg" alt="Q-learning"/>
|
||||
|
||||
@@ -107,14 +107,14 @@ Therefore, our \\(Q(S_t, A_t)\\) **update formula goes like this:**
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-8.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
It means that to update our \\(Q(S_t, A_t)\\):
|
||||
This means that to update our \\(Q(S_t, A_t)\\):
|
||||
|
||||
- We need \\(S_t, A_t, R_{t+1}, S_{t+1}\\).
|
||||
- To update our Q-value at a given state-action pair, we use the TD target.
|
||||
|
||||
How do we form the TD target?
|
||||
1. We obtain the reward after taking the action \\(R_{t+1}\\).
|
||||
2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.
|
||||
2. To get the **best state-action pair value** for the next state, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.
|
||||
|
||||
Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.**
|
||||
|
||||
|
||||
@@ -31,19 +31,19 @@ Since the policy is not trained/learned, **we need to specify its behavior.**
|
||||
<figcaption>Given a state, our action-value function (that we train) outputs the value of each action at that state. Then, our pre-defined Greedy Policy selects the action that will yield the highest value given a state or a state action pair.</figcaption>
|
||||
</figure>
|
||||
|
||||
Consequently, whatever method you use to solve your problem, **you will have a policy**. In the case of value-based methods, you don't train the policy: your policy **is just a simple pre-specified function** (for instance, Greedy Policy) that uses the values given by the value-function to select its actions.
|
||||
Consequently, whatever method you use to solve your problem, **you will have a policy**. In the case of value-based methods, you don't train the policy: your policy **is just a simple pre-specified function** (for instance, the Greedy Policy) that uses the values given by the value-function to select its actions.
|
||||
|
||||
So the difference is:
|
||||
|
||||
- In policy-based, **the optimal policy (denoted π\*) is found by training the policy directly.**
|
||||
- In value-based, **finding an optimal value function (denoted Q\* or V\*, we'll study the difference after) leads to having an optimal policy.**
|
||||
- In policy-based training, **the optimal policy (denoted π\*) is found by training the policy directly.**
|
||||
- In value-based training, **finding an optimal value function (denoted Q\* or V\*, we'll study the difference below) leads to having an optimal policy.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link between value and policy"/>
|
||||
|
||||
In fact, most of the time, in value-based methods, you'll use **an Epsilon-Greedy Policy** that handles the exploration/exploitation trade-off; we'll talk about it when we talk about Q-Learning in the second part of this unit.
|
||||
In fact, most of the time, in value-based methods, you'll use **an Epsilon-Greedy Policy** that handles the exploration/exploitation trade-off; we'll talk about this when we talk about Q-Learning in the second part of this unit.
|
||||
|
||||
|
||||
So, we have two types of value-based functions:
|
||||
As we mentioned above, we have two types of value-based functions:
|
||||
|
||||
## The state-value function [[state-value-function]]
|
||||
|
||||
@@ -60,7 +60,7 @@ For each state, the state-value function outputs the expected return if the agen
|
||||
|
||||
## The action-value function [[action-value-function]]
|
||||
|
||||
In the action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
|
||||
In the action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state, takes that action, and then follows the policy forever after.
|
||||
|
||||
The value of taking action \\(a\\) in state \\(s\\) under a policy \\(π\\) is:
|
||||
|
||||
@@ -70,8 +70,8 @@ The value of taking action \\(a\\) in state \\(s\\) under a policy \\(π\\) is:
|
||||
|
||||
We see that the difference is:
|
||||
|
||||
- In state-value function, we calculate **the value of a state \\(S_t\\)**
|
||||
- In action-value function, we calculate **the value of the state-action pair ( \\(S_t, A_t\\) ) hence the value of taking that action at that state.**
|
||||
- For the state-value function, we calculate **the value of a state \\(S_t\\)**
|
||||
- For the action-value function, we calculate **the value of the state-action pair ( \\(S_t, A_t\\) ) hence the value of taking that action at that state.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-types.jpg" alt="Two types of value function"/>
|
||||
@@ -79,8 +79,8 @@ We see that the difference is:
|
||||
Note: We didn't fill all the state-action pairs for the example of Action-value function</figcaption>
|
||||
</figure>
|
||||
|
||||
In either case, whatever value function we choose (state-value or action-value function), **the returned value is the expected return.**
|
||||
In either case, whichever value function we choose (state-value or action-value function), **the returned value is the expected return.**
|
||||
|
||||
However, the problem is that it implies that **to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.**
|
||||
However, the problem is that **to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.**
|
||||
|
||||
This can be a computationally expensive process, and that's **where the Bellman equation comes to help us.**
|
||||
This can be a computationally expensive process, and that's **where the Bellman equation comes in to help us.**
|
||||
|
||||
@@ -5,7 +5,7 @@ In RL, we build an agent that can **make smart decisions**. For instance, an ag
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/rl-process.jpg" alt="RL process"/>
|
||||
|
||||
|
||||
But, to make intelligent decisions, our agent will learn from the environment by **interacting with it through trial and error** and receiving rewards (positive or negative) **as unique feedback.**
|
||||
To make intelligent decisions, our agent will learn from the environment by **interacting with it through trial and error** and receiving rewards (positive or negative) **as unique feedback.**
|
||||
|
||||
Its goal **is to maximize its expected cumulative reward** (because of the reward hypothesis).
|
||||
|
||||
|
||||
@@ -9,9 +9,9 @@ Don't hesitate to train your agent in other environments (Pong, Seaquest, QBert,
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
|
||||
In the next unit, **we're going to learn about Optuna**. One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
|
||||
In the next unit, **we're going to learn about Optuna**. One of the most critical tasks in Deep Reinforcement Learning is to find a good set of training hyperparameters. Optuna is a library that helps you to automate the search.
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
|
||||
@@ -33,9 +33,9 @@ Why do we create a replay memory?
|
||||
Experience Replay in Deep Q-Learning has two functions:
|
||||
|
||||
1. **Make more efficient use of the experiences during the training**.
|
||||
Usually, in online reinforcement learning, the agent interacts in the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient
|
||||
Usually, in online reinforcement learning, the agent interacts with the environment, gets experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient.
|
||||
|
||||
Experience replay helps **using the experiences of the training more efficiently**. We use a replay buffer that saves experience samples **that we can reuse during the training.**
|
||||
Experience replay helps by **using the experiences of the training more efficiently**. We use a replay buffer that saves experience samples **that we can reuse during the training.**
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay.jpg" alt="Experience Replay"/>
|
||||
|
||||
⇒ This allows the agent to **learn from the same experiences multiple times**.
|
||||
@@ -47,7 +47,7 @@ The solution is to create a Replay Buffer that stores experience tuples while in
|
||||
|
||||
Experience replay also has other benefits. By randomly sampling the experiences, we remove correlation in the observation sequences and avoid **action values from oscillating or diverging catastrophically.**
|
||||
|
||||
In the Deep Q-Learning pseudocode, we **initialize a replay memory buffer D from capacity N** (N is a hyperparameter that you can define). We then store experiences in the memory and sample a batch of experiences to feed the Deep Q-Network during the training phase.
|
||||
In the Deep Q-Learning pseudocode, we **initialize a replay memory buffer D with capacity N** (N is a hyperparameter that you can define). We then store experiences in the memory and sample a batch of experiences to feed the Deep Q-Network during the training phase.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay-pseudocode.jpg" alt="Experience Replay Pseudocode"/>
|
||||
|
||||
@@ -61,9 +61,9 @@ But we **don’t have any idea of the real TD target**. We need to estimate it
|
||||
|
||||
However, the problem is that we are using the same parameters (weights) for estimating the TD target **and** the Q-value. Consequently, there is a significant correlation between the TD target and the parameters we are changing.
|
||||
|
||||
Therefore, it means that at every step of training, **our Q-values shift but also the target value shifts.** We’re getting closer to our target, but the target is also moving. It’s like chasing a moving target! This can lead to a significant oscillation in training.
|
||||
Therefore, at every step of training, **both our Q-values and the target values shift.** We’re getting closer to our target, but the target is also moving. It’s like chasing a moving target! This can lead to significant oscillation in training.
|
||||
|
||||
It’s like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target). Your goal is to get closer (reduce the error).
|
||||
It’s like if you were a cowboy (the Q estimation) and you wanted to catch a cow (the Q-target). Your goal is to get closer (reduce the error).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-1.jpg" alt="Q-target"/>
|
||||
|
||||
@@ -76,7 +76,7 @@ This leads to a bizarre path of chasing (a significant oscillation in training).
|
||||
|
||||
Instead, what we see in the pseudo-code is that we:
|
||||
- Use a **separate network with fixed parameters** for estimating the TD Target
|
||||
- **Copy the parameters from our Deep Q-Network at every C step** to update the target network.
|
||||
- **Copy the parameters from our Deep Q-Network every C steps** to update the target network.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/fixed-q-target-pseudocode.jpg" alt="Fixed Q-target Pseudocode"/>
|
||||
|
||||
@@ -84,7 +84,7 @@ Instead, what we see in the pseudo-code is that we:
|
||||
|
||||
## Double DQN [[double-dqn]]
|
||||
|
||||
Double DQNs, or Double Learning, were introduced [by Hado van Hasselt](https://papers.nips.cc/paper/3964-double-q-learning). This method **handles the problem of the overestimation of Q-values.**
|
||||
Double DQNs, or Double Deep Q-Learning neural networks, were introduced [by Hado van Hasselt](https://papers.nips.cc/paper/3964-double-q-learning). This method **handles the problem of the overestimation of Q-values.**
|
||||
|
||||
To understand this problem, remember how we calculate the TD Target:
|
||||
|
||||
@@ -100,6 +100,6 @@ The solution is: when we compute the Q target, we use two networks to decouple t
|
||||
- Use our **DQN network** to select the best action to take for the next state (the action with the highest Q-value).
|
||||
- Use our **Target network** to calculate the target Q-value of taking that action at the next state.
|
||||
|
||||
Therefore, Double DQN helps us reduce the overestimation of Q-values and, as a consequence, helps us train faster and have more stable learning.
|
||||
Therefore, Double DQN helps us reduce the overestimation of Q-values and, as a consequence, helps us train faster and with more stable learning.
|
||||
|
||||
Since these three improvements in Deep Q-Learning, many have been added such as Prioritized Experience Replay, Dueling Deep Q-Learning. They’re out of the scope of this course but if you’re interested, check the links we put in the reading list.
|
||||
Since these three improvements in Deep Q-Learning, many more have been added, such as Prioritized Experience Replay and Dueling Deep Q-Learning. They’re out of the scope of this course but if you’re interested, check the links we put in the reading list.
|
||||
|
||||
@@ -5,14 +5,14 @@ This is the architecture of our Deep Q-Learning network:
|
||||
|
||||
As input, we take a **stack of 4 frames** passed through the network as a state and output a **vector of Q-values for each possible action at that state**. Then, like with Q-Learning, we just need to use our epsilon-greedy policy to select which action to take.
|
||||
|
||||
When the Neural Network is initialized, **the Q-value estimation is terrible**. But during training, our Deep Q-Network agent will associate a situation with appropriate action and **learn to play the game well**.
|
||||
When the Neural Network is initialized, **the Q-value estimation is terrible**. But during training, our Deep Q-Network agent will associate a situation with the appropriate action and **learn to play the game well**.
|
||||
|
||||
## Preprocessing the input and temporal limitation [[preprocessing]]
|
||||
|
||||
We need to **preprocess the input**. It’s an essential step since we want to **reduce the complexity of our state to reduce the computation time needed for training**.
|
||||
|
||||
To achieve this, we **reduce the state space to 84x84 and grayscale it**. We can do this since the colors in Atari environments don't add important information.
|
||||
This is an essential saving since we **reduce our three color channels (RGB) to 1**.
|
||||
This is a big improvement since we **reduce our three color channels (RGB) to 1**.
|
||||
|
||||
We can also **crop a part of the screen in some games** if it does not contain important information.
|
||||
Then we stack four frames together.
|
||||
@@ -30,12 +30,12 @@ No, because one frame is not enough to have a sense of motion! But what if I add
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal Limitation"/>
|
||||
That’s why, to capture temporal information, we stack four frames together.
|
||||
|
||||
Then, the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because frames are stacked together, **you can exploit some temporal properties across those frames**.
|
||||
Then the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because the frames are stacked together, **we can exploit some temporal properties across those frames**.
|
||||
|
||||
If you don't know what are convolutional layers, don't worry. You can check the [Lesson 4 of this free Deep Reinforcement Learning Course by Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188)
|
||||
If you don't know what convolutional layers are, don't worry. You can check out [Lesson 4 of this free Deep Reinforcement Learning Course by Udacity](https://www.udacity.com/course/deep-learning-pytorch--ud188)
|
||||
|
||||
Finally, we have a couple of fully connected layers that output a Q-value for each possible action at that state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" alt="Deep Q Network"/>
|
||||
|
||||
So, we see that Deep Q-Learning is using a neural network to approximate, given a state, the different Q-values for each possible action at that state. Let’s now study the Deep Q-Learning algorithm.
|
||||
So, we see that Deep Q-Learning uses a neural network to approximate, given a state, the different Q-values for each possible action at that state. Now let's study the Deep Q-Learning algorithm.
|
||||
|
||||
@@ -8,23 +8,23 @@ We learned that **Q-Learning is an algorithm we use to train our Q-Function**,
|
||||
|
||||
The **Q comes from "the Quality" of that action at that state.**
|
||||
|
||||
Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
Internally, our Q-function is encoded by **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
|
||||
The problem is that Q-Learning is a *tabular method*. This raises a problem in which the states and actions spaces **are small enough to approximate value functions to be represented as arrays and tables**. Also, this is **not scalable**.
|
||||
The problem is that Q-Learning is a *tabular method*. This becomes a problem if the states and actions spaces **are not small enough to be represented efficiently by arrays and tables**. In other words: it is **not scalable**.
|
||||
Q-Learning worked well with small state space environments like:
|
||||
|
||||
- FrozenLake, we had 16 states.
|
||||
- Taxi-v3, we had 500 states.
|
||||
|
||||
But think of what we're going to do today: we will train an agent to learn to play Space Invaders a more complex game, using the frames as input.
|
||||
But think of what we're going to do today: we will train an agent to learn to play Space Invaders, a more complex game, using the frames as input.
|
||||
|
||||
As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us \\(256^{210 \times 160 \times 3} = 256^{100800}\\) (for comparison, we have approximately \\(10^{80}\\) atoms in the observable universe).
|
||||
As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3)*, containing values ranging from 0 to 255 so that gives us \\(256^{210 \times 160 \times 3} = 256^{100800}\\) possible observations (for comparison, we have approximately \\(10^{80}\\) atoms in the observable universe).
|
||||
|
||||
* A single frame in Atari is composed of an image of 210x160 pixels. Given the images are in color (RGB), there are 3 channels. This is why the shape is (210, 160, 3). For each pixel, the value can go from 0 to 255.
|
||||
* A single frame in Atari is composed of an image of 210x160 pixels. Given that the images are in color (RGB), there are 3 channels. This is why the shape is (210, 160, 3). For each pixel, the value can go from 0 to 255.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari.jpg" alt="Atari State Space"/>
|
||||
|
||||
Therefore, the state space is gigantic; due to this, creating and updating a Q-table for that environment would not be efficient. In this case, the best idea is to approximate the Q-values instead of a Q-table using a parametrized Q-function \\(Q_{\theta}(s,a)\\) .
|
||||
Therefore, the state space is gigantic; due to this, creating and updating a Q-table for that environment would not be efficient. In this case, the best idea is to approximate the Q-values using a parametrized Q-function \\(Q_{\theta}(s,a)\\) .
|
||||
|
||||
This neural network will approximate, given a state, the different Q-values for each possible action at that state. And that's exactly what Deep Q-Learning does.
|
||||
|
||||
|
||||
@@ -16,7 +16,7 @@ Now that you've studied the theory behind Deep Q-Learning, **you’re ready to t
|
||||
|
||||
We're using the [RL-Baselines-3 Zoo integration](https://github.com/DLR-RM/rl-baselines3-zoo), a vanilla version of Deep Q-Learning with no extensions such as Double-DQN, Dueling-DQN, or Prioritized Experience Replay.
|
||||
|
||||
Also, **if you want to learn to implement Deep Q-Learning by yourself after this hands-on**, you definitely should look at CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
|
||||
Also, **if you want to learn to implement Deep Q-Learning by yourself after this hands-on**, you definitely should look at the CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
|
||||
|
||||
To validate this hands-on for the certification process, you need to push your trained model to the Hub and **get a result of >= 200**.
|
||||
|
||||
@@ -87,7 +87,7 @@ We're constantly trying to improve our tutorials, so **if you find some issues i
|
||||
|
||||
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
|
||||
|
||||
Hence the following cell will install the librairies and create and run a virtual screen 🖥
|
||||
The following cell will install the librairies and create and run a virtual screen 🖥
|
||||
|
||||
```bash
|
||||
apt install python-opengl
|
||||
@@ -143,7 +143,7 @@ To train an agent with RL-Baselines3-Zoo, we just need to do two things:
|
||||
|
||||
Here we see that:
|
||||
- We use the `Atari Wrapper` that does the pre-processing (Frame reduction, grayscale, stack four frames),
|
||||
- We use `CnnPolicy`, since we use Convolutional layers to process the frames.
|
||||
- We use the `CnnPolicy`, since we use Convolutional layers to process the frames.
|
||||
- We train the model for 10 million `n_timesteps`.
|
||||
- Memory (Experience Replay) size is 100000, i.e. the number of experience steps you saved to train again your agent with.
|
||||
|
||||
@@ -154,7 +154,7 @@ In terms of hyperparameters optimization, my advice is to focus on these 3 hyper
|
||||
- `buffer_size (Experience Memory size)`
|
||||
- `batch_size`
|
||||
|
||||
As a good practice, you need to **check the documentation to understand what each hyperparameters does**: https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html#parameters
|
||||
As a good practice, you need to **check the documentation to understand what each hyperparameter does**: https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html#parameters
|
||||
|
||||
|
||||
|
||||
@@ -185,16 +185,16 @@ python enjoy.py --algo dqn --env SpaceInvadersNoFrameskip-v4 --no-render --n
|
||||
```
|
||||
|
||||
## Publish our trained model on the Hub 🚀
|
||||
Now that we saw we got good results after the training, we can publish our trained model on the Hub with one line of code.
|
||||
Now that we saw we got good results after the training, we can publish our trained model to the Hub with one line of code.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/space-invaders-model.gif" alt="Space Invaders model">
|
||||
|
||||
By using `rl_zoo3.push_to_hub.py`, **you evaluate, record a replay, generate a model card of your agent, and push it to the Hub**.
|
||||
|
||||
This way:
|
||||
- You can **showcase our work** 🔥
|
||||
- You can **showcase your work** 🔥
|
||||
- You can **visualize your agent playing** 👀
|
||||
- You can **share with the community an agent that others can use** 💾
|
||||
- You can **share an agent with the community that others can use** 💾
|
||||
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
To be able to share your model with the community, there are three more steps to follow:
|
||||
@@ -215,11 +215,11 @@ notebook_login()
|
||||
git config --global credential.helper store
|
||||
```
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
|
||||
3️⃣ We're now ready to push our trained agent to the Hub 🔥
|
||||
|
||||
Let's run `push_to_hub.py` file to upload our trained agent to the Hub. There are two important parameters:
|
||||
Let's run the `push_to_hub.py` file to upload our trained agent to the Hub. There are two important parameters:
|
||||
|
||||
* `--repo-name `: The name of the repo
|
||||
* `-orga`: Your Hugging Face username
|
||||
@@ -278,16 +278,16 @@ python -m rl_zoo3.load_from_hub --algo dqn --env BeamRiderNoFrameskip-v4 -orga s
|
||||
python enjoy.py --algo dqn --env BeamRiderNoFrameskip-v4 -n 5000 -f rl_trained/
|
||||
```
|
||||
|
||||
Why not trying to train your own **Deep Q-Learning Agent playing BeamRiderNoFrameskip-v4? 🏆.**
|
||||
Why not try training your own **Deep Q-Learning Agent playing BeamRiderNoFrameskip-v4? 🏆.**
|
||||
|
||||
If you want to try, check https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4#hyperparameters. There, **in the model card, you have the hyperparameters of the trained agent.**
|
||||
If you want to try, check out https://huggingface.co/sb3/dqn-BeamRiderNoFrameskip-v4#hyperparameters. There, **in the model card, you'll find the hyperparameters of the trained agent.**
|
||||
|
||||
But finding hyperparameters can be a daunting task. Fortunately, we'll see in the next bonus Unit, how we can **use Optuna for optimizing the Hyperparameters 🔥.**
|
||||
Finding hyperparameters in general can be a daunting task. Fortunately, we'll see in the next bonus Unit how we can **use Optuna for optimizing the Hyperparameters 🔥.**
|
||||
|
||||
|
||||
## Some additional challenges 🏆
|
||||
|
||||
The best way to learn **is to try things by your own**!
|
||||
The best way to learn **is to try things on your own**!
|
||||
|
||||
In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
|
||||
|
||||
@@ -297,18 +297,18 @@ Here's a list of environments you can try to train your agent with:
|
||||
- EnduroNoFrameskip-v4
|
||||
- PongNoFrameskip-v4
|
||||
|
||||
Also, **if you want to learn to implement Deep Q-Learning by yourself**, you definitely should look at CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
|
||||
Also, **if you want to learn to implement Deep Q-Learning by yourself**, you definitely should look at the CleanRL implementation: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/dqn_atari.py
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
________________________________________________________________________
|
||||
Congrats on finishing this chapter!
|
||||
|
||||
If you’re still feel confused with all these elements...it's totally normal! **This was the same for me and for all people who studied RL.**
|
||||
If you’re still feel confused with all these elements...it's totally normal! **This was the same for me and for all people who study RL.**
|
||||
|
||||
Take time to really **grasp the material before continuing and try the additional challenges**. It’s important to master these elements and having a solid foundations.
|
||||
Take time to really **grasp the material before continuing and try the additional challenges**. It’s important to master these elements and have a solid foundations.
|
||||
|
||||
In the next unit, **we’re going to learn about [Optuna](https://optuna.org/)**. One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
|
||||
In the next unit, **we’re going to learn about [Optuna](https://optuna.org/)**. One of the most critical tasks in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
|
||||
|
||||
See you on Bonus unit 2! 🔥
|
||||
|
||||
|
||||
@@ -34,7 +34,7 @@ The problem is that the **two rose cases are aliased states because the agent pe
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster2.jpg" alt="Hamster 1"/>
|
||||
</figure>
|
||||
|
||||
Under a deterministic policy, the policy either will move right when in a red state or move left. **Either case will cause our agent to get stuck and never suck the dust**.
|
||||
Under a deterministic policy, the policy will either always move right when in a red state or always move left. **Either case will cause our agent to get stuck and never suck the dust**.
|
||||
|
||||
Under a value-based Reinforcement learning algorithm, we learn a **quasi-deterministic policy** ("greedy epsilon strategy"). Consequently, our agent can **spend a lot of time before finding the dust**.
|
||||
|
||||
@@ -67,8 +67,8 @@ On the other hand, in policy-gradient methods, stochastic policy action preferen
|
||||
|
||||
Naturally, policy-gradient methods also have some disadvantages:
|
||||
|
||||
- **Frequently, policy-gradient converges on a local maximum instead of a global optimum.**
|
||||
- **Frequently, policy-gradient methods converges to a local maximum instead of a global optimum.**
|
||||
- Policy-gradient goes slower, **step by step: it can take longer to train (inefficient).**
|
||||
- Policy-gradient can have high variance. We'll see in actor-critic unit why and how we can solve this problem.
|
||||
- Policy-gradient can have high variance. We'll see in the actor-critic unit why, and how we can solve this problem.
|
||||
|
||||
👉 If you want to go deeper into the advantages and disadvantages of policy-gradient methods, [you can check this video](https://youtu.be/y3oqOjHilio).
|
||||
|
||||
@@ -10,8 +10,8 @@ frames as observation)?
|
||||
In the next unit, **we're going to learn more about Unity MLAgents**, by training agents in Unity environments. This way, you will be ready to participate in the **AI vs AI challenges where you'll train your agents
|
||||
to compete against other agents in a snowball fight and a soccer game.**
|
||||
|
||||
Sounds fun? See you next time!
|
||||
Sound fun? See you next time!
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
@@ -10,7 +10,7 @@
|
||||
|
||||
|
||||
|
||||
Now that we studied the theory behind Reinforce, **you’re ready to code your Reinforce agent with PyTorch**. And you'll test its robustness using CartPole-v1 and PixelCopter,.
|
||||
Now that we've studied the theory behind Reinforce, **you’re ready to code your Reinforce agent with PyTorch**. And you'll test its robustness using CartPole-v1 and PixelCopter,.
|
||||
|
||||
You'll then be able to iterate and improve this implementation for more advanced environments.
|
||||
|
||||
@@ -19,9 +19,9 @@ You'll then be able to iterate and improve this implementation for more advanced
|
||||
</figure>
|
||||
|
||||
|
||||
To validate this hands-on for the certification process, you need to push your trained models to the Hub.
|
||||
To validate this hands-on for the certification process, you need to push your trained models to the Hub and:
|
||||
|
||||
- Get a result of >= 350 for `Cartpole-v1`.
|
||||
- Get a result of >= 350 for `Cartpole-v1`
|
||||
- Get a result of >= 5 for `PixelCopter`.
|
||||
|
||||
To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**. **If you don't see your model on the leaderboard, go at the bottom of the leaderboard page and click on the refresh button**.
|
||||
@@ -75,7 +75,7 @@ We're constantly trying to improve our tutorials, so **if you find some issues i
|
||||
|
||||
At the end of the notebook, you will:
|
||||
|
||||
- Be able to **code from scratch a Reinforce algorithm using PyTorch.**
|
||||
- Be able to **code a Reinforce algorithm from scratch using PyTorch.**
|
||||
- Be able to **test the robustness of your agent using simple environments.**
|
||||
- Be able to **push your trained agent to the Hub** with a nice video replay and an evaluation score 🔥.
|
||||
|
||||
@@ -87,9 +87,9 @@ Before diving into the notebook, you need to:
|
||||
|
||||
# Let's code Reinforce algorithm from scratch 🔥
|
||||
|
||||
## An advice 💡
|
||||
## Some advice 💡
|
||||
|
||||
It's better to run this colab in a copy on your Google Drive, so that **if it timeouts** you still have the saved notebook on your Google Drive and do not need to fill everything from scratch.
|
||||
It's better to run this colab in a copy on your Google Drive, so that **if it times out** you still have the saved notebook on your Google Drive and do not need to fill everything in from scratch.
|
||||
|
||||
To do that you can either do `Ctrl + S` or `File > Save a copy in Google Drive.`
|
||||
|
||||
@@ -107,7 +107,7 @@ To do that you can either do `Ctrl + S` or `File > Save a copy in Google Drive.`
|
||||
|
||||
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
|
||||
|
||||
Hence the following cell will install the librairies and create and run a virtual screen 🖥
|
||||
The following cell will install the librairies and create and run a virtual screen 🖥
|
||||
|
||||
```python
|
||||
%%capture
|
||||
@@ -145,7 +145,7 @@ And you can find all the Deep Reinforcement Learning models here 👉 https://hu
|
||||
|
||||
## Import the packages 📦
|
||||
|
||||
In addition to import the installed libraries, we also import:
|
||||
In addition to importing the installed libraries, we also import:
|
||||
|
||||
- `imageio`: A library that will help us to generate a replay video
|
||||
|
||||
@@ -217,10 +217,10 @@ So, we start with CartPole-v1. The goal is to push the cart left or right **so t
|
||||
|
||||
The episode ends if:
|
||||
- The pole Angle is greater than ±12°
|
||||
- Cart Position is greater than ±2.4
|
||||
- Episode length is greater than 500
|
||||
- The Cart Position is greater than ±2.4
|
||||
- The episode length is greater than 500
|
||||
|
||||
We get a reward 💰 of +1 every timestep the Pole stays in the equilibrium.
|
||||
We get a reward 💰 of +1 every timestep that the Pole stays in the equilibrium.
|
||||
|
||||
```python
|
||||
env_id = "CartPole-v1"
|
||||
@@ -258,8 +258,8 @@ This implementation is based on three implementations:
|
||||
|
||||
So we want:
|
||||
- Two fully connected layers (fc1 and fc2).
|
||||
- Using ReLU as activation function of fc1
|
||||
- Using Softmax to output a probability distribution over actions
|
||||
- To use ReLU as activation function of fc1
|
||||
- To use Softmax to output a probability distribution over actions
|
||||
|
||||
```python
|
||||
class Policy(nn.Module):
|
||||
@@ -310,7 +310,7 @@ class Policy(nn.Module):
|
||||
return action.item(), m.log_prob(action)
|
||||
```
|
||||
|
||||
I make a mistake, can you guess where?
|
||||
I made a mistake, can you guess where?
|
||||
|
||||
- To find out let's make a forward pass:
|
||||
|
||||
@@ -325,7 +325,7 @@ debug_policy.act(env.reset())
|
||||
|
||||
- Do you know why? Check the act function and try to see why it does not work.
|
||||
|
||||
Advice 💡: Something is wrong in this implementation. Remember that we act function **we want to sample an action from the probability distribution over actions**.
|
||||
Advice 💡: Something is wrong in this implementation. Remember that for the act function **we want to sample an action from the probability distribution over actions**.
|
||||
|
||||
|
||||
### (Real) Solution
|
||||
@@ -352,9 +352,9 @@ class Policy(nn.Module):
|
||||
|
||||
By using CartPole, it was easier to debug since **we know that the bug comes from our integration and not from our simple environment**.
|
||||
|
||||
- Since **we want to sample an action from the probability distribution over actions**, we can't use `action = np.argmax(m)` since it will always output the action that have the highest probability.
|
||||
- Since **we want to sample an action from the probability distribution over actions**, we can't use `action = np.argmax(m)` since it will always output the action that has the highest probability.
|
||||
|
||||
- We need to replace with `action = m.sample()` that will sample an action from the probability distribution P(.|s)
|
||||
- We need to replace this with `action = m.sample()` which will sample an action from the probability distribution P(.|s)
|
||||
|
||||
### Let's build the Reinforce Training Algorithm
|
||||
This is the Reinforce algorithm pseudocode:
|
||||
@@ -371,7 +371,7 @@ This is the Reinforce algorithm pseudocode:
|
||||
We use an interesting technique coded by [Chris1nexus](https://github.com/Chris1nexus) to **compute the return at each timestep efficiently**. The comments explained the procedure. Don't hesitate also [to check the PR explanation](https://github.com/huggingface/deep-rl-class/pull/95)
|
||||
But overall the idea is to **compute the return at each timestep efficiently**.
|
||||
|
||||
The second question you may ask is **why do we minimize the loss**? Did you talk about Gradient Ascent, not Gradient Descent?
|
||||
The second question you may ask is **why do we minimize the loss**? Didn't we talk about Gradient Ascent, not Gradient Descent earlier?
|
||||
|
||||
- We want to maximize our utility function $J(\theta)$, but in PyTorch and TensorFlow, it's better to **minimize an objective function.**
|
||||
- So let's say we want to reinforce action 3 at a certain timestep. Before training this action P is 0.25.
|
||||
@@ -797,12 +797,12 @@ def push_to_hub(repo_id,
|
||||
print(f"Your model is pushed to the Hub. You can view your model here: {repo_url}")
|
||||
```
|
||||
|
||||
By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.
|
||||
By using `push_to_hub`, **you evaluate, record a replay, generate a model card of your agent, and push it to the Hub**.
|
||||
|
||||
This way:
|
||||
- You can **showcase our work** 🔥
|
||||
- You can **visualize your agent playing** 👀
|
||||
- You can **share with the community an agent that others can use** 💾
|
||||
- You can **share an agent with the community that others can use** 💾
|
||||
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
|
||||
@@ -821,7 +821,7 @@ To be able to share your model with the community there are three more steps to
|
||||
notebook_login()
|
||||
```
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)
|
||||
If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)
|
||||
|
||||
3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
|
||||
|
||||
@@ -836,7 +836,7 @@ push_to_hub(
|
||||
)
|
||||
```
|
||||
|
||||
Now that we try the robustness of our implementation, let's try a more complex environment: PixelCopter 🚁
|
||||
Now that we tested the robustness of our implementation, let's try a more complex environment: PixelCopter 🚁
|
||||
|
||||
|
||||
|
||||
@@ -881,7 +881,7 @@ The action space(2) 🎮:
|
||||
- Down
|
||||
|
||||
The reward function 💰:
|
||||
- For each vertical block it passes through it gains a positive reward of +1. Each time a terminal state reached it receives a negative reward of -1.
|
||||
- For each vertical block it passes, it gains a positive reward of +1. Each time a terminal state is reached it receives a negative reward of -1.
|
||||
|
||||
### Define the new Policy 🧠
|
||||
- We need to have a deeper neural network since the environment is more complex
|
||||
@@ -986,11 +986,11 @@ push_to_hub(
|
||||
|
||||
## Some additional challenges 🏆
|
||||
|
||||
The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. But also trying to find better parameters.
|
||||
The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. But also try to find better parameters.
|
||||
|
||||
In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
|
||||
|
||||
Here are some ideas to achieve so:
|
||||
Here are some ideas to climb up the leaderboard:
|
||||
* Train more steps
|
||||
* Try different hyperparameters by looking at what your classmates have done 👉 https://huggingface.co/models?other=reinforce
|
||||
* **Push your new trained model** on the Hub 🔥
|
||||
@@ -1008,9 +1008,9 @@ frames as observation)?
|
||||
In the next unit, **we're going to learn more about Unity MLAgents**, by training agents in Unity environments. This way, you will be ready to participate in the **AI vs AI challenges where you'll train your agents
|
||||
to compete against other agents in a snowball fight and a soccer game.**
|
||||
|
||||
Sounds fun? See you next time!
|
||||
Sound fun? See you next time!
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
See you in Unit 5! 🔥
|
||||
|
||||
|
||||
@@ -4,13 +4,13 @@
|
||||
|
||||
In the last unit, we learned about Deep Q-Learning. In this value-based deep reinforcement learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**
|
||||
|
||||
Since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
|
||||
Since the beginning of the course, we have only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
|
||||
|
||||
In value-based methods, the policy ** \\(π\\) only exists because of the action value estimates since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
|
||||
In value-based methods, the policy ** \(π\) only exists because of the action value estimates since the policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
|
||||
|
||||
But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
|
||||
With policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
|
||||
|
||||
So today, **we'll learn about policy-based methods and study a subset of these methods called policy gradient**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch.
|
||||
Then, we'll test its robustness using the CartPole-v1 and PixelCopter environments.
|
||||
@@ -21,4 +21,4 @@ You'll then be able to iterate and improve this implementation for more advanced
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
|
||||
</figure>
|
||||
|
||||
Let's get started,
|
||||
Let's get started!
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
We just learned that policy-gradient methods aim to find parameters \\( \theta \\) that **maximize the expected return**.
|
||||
|
||||
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called *action preference*.
|
||||
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called the *action preference*.
|
||||
|
||||
If we take the example of CartPole-v1:
|
||||
- As input, we have a state.
|
||||
@@ -47,20 +47,20 @@ The *objective function* gives us the **performance of the agent** given a traje
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/objective.jpg" alt="Return"/>
|
||||
|
||||
Let's detail a little bit more this formula:
|
||||
- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by \\(P(\tau;\theta)\\) of all possible values that the return \\(R(\tau)\\) can take.
|
||||
Let's give some more details on this formula:
|
||||
- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by \\(P(\tau;\theta)\\) of all possible values that the return \\(R(\tau)\\) can take).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
|
||||
|
||||
|
||||
- \\(R(\tau)\\) : Return from an arbitrary trajectory. To take this quantity and use it to calculate the expected return, we need to multiply it by the probability of each possible trajectory.
|
||||
- \\(P(\tau;\theta)\\) : Probability of each possible trajectory \\(\tau\\) (that probability depends on \\( \theta\\) since it defines the policy that it uses to select the actions of the trajectory which as an impact of the states visited).
|
||||
- \\(P(\tau;\theta)\\) : Probability of each possible trajectory \\(\tau\\) (that probability depends on \\( \theta\\) since it defines the policy that it uses to select the actions of the trajectory which has an impact of the states visited).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
|
||||
|
||||
- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given \\(\theta \\), and the return of this trajectory.
|
||||
- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given \\(\theta \\) multiplied by the return of this trajectory.
|
||||
|
||||
Our objective then is to maximize the expected cumulative reward by finding \\(\theta \\) that will output the best action probability distributions:
|
||||
Our objective then is to maximize the expected cumulative reward by finding the \\(\theta \\) that will output the best action probability distributions:
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/max_objective.png" alt="Max objective"/>
|
||||
@@ -68,7 +68,7 @@ Our objective then is to maximize the expected cumulative reward by finding \\(\
|
||||
|
||||
## Gradient Ascent and the Policy-gradient Theorem
|
||||
|
||||
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
|
||||
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), so we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
|
||||
|
||||
(If you need a refresher on the difference between gradient descent and gradient ascent [check this](https://www.baeldung.com/cs/gradient-descent-vs-ascent) and [this](https://stats.stackexchange.com/questions/258721/gradient-ascent-vs-gradient-descent-in-logistic-regression)).
|
||||
|
||||
@@ -76,13 +76,13 @@ Our update step for gradient-ascent is:
|
||||
|
||||
\\( \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\)
|
||||
|
||||
We can repeatedly apply this update state in the hope that \\(\theta \\) converges to the value that maximizes \\(J(\theta)\\).
|
||||
We can repeatedly apply this update in the hopes that \\(\theta \\) converges to the value that maximizes \\(J(\theta)\\).
|
||||
|
||||
However, we have two problems to obtain the derivative of \\(J(\theta)\\):
|
||||
1. We can't calculate the true gradient of the objective function since it would imply calculating the probability of each possible trajectory which is computationally super expensive.
|
||||
We want then to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
|
||||
However, there are two problems with computing the derivative of \\(J(\theta)\\):
|
||||
1. We can't calculate the true gradient of the objective function since it requires calculating the probability of each possible trajectory, which is computationally super expensive.
|
||||
So we want to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
|
||||
|
||||
2. We have another problem that I detail in the next optional section. To differentiate this objective function, we need to differentiate the state distribution, called Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it.
|
||||
2. We have another problem that I explain in the next optional section. To differentiate this objective function, we need to differentiate the state distribution, called the Markov Decision Process dynamics. This is attached to the environment. It gives us the probability of the environment going into the next state, given the current state and the action taken by the agent. The problem is that we can't differentiate it because we might not know about it.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
|
||||
|
||||
@@ -90,7 +90,7 @@ Fortunately we're going to use a solution called the Policy Gradient Theorem tha
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_theorem.png" alt="Policy Gradient"/>
|
||||
|
||||
If you want to understand how we derivate this formula that we will use to approximate the gradient, check the next (optional) section.
|
||||
If you want to understand how we derive this formula for approximating the gradient, check out the next (optional) section.
|
||||
|
||||
## The Reinforce algorithm (Monte Carlo Reinforce)
|
||||
|
||||
@@ -106,12 +106,12 @@ In a loop:
|
||||
|
||||
- Update the weights of the policy: \\(\theta \leftarrow \theta + \alpha \hat{g}\\)
|
||||
|
||||
The interpretation we can make is this one:
|
||||
We can interpret this update as follows:
|
||||
- \\(\nabla_\theta log \pi_\theta(a_t|s_t)\\) is the direction of **steepest increase of the (log) probability** of selecting action at from state st.
|
||||
This tells us **how we should change the weights of policy** if we want to increase/decrease the log probability of selecting action \\(a_t\\) at state \\(s_t\\).
|
||||
- \\(R(\tau)\\): is the scoring function:
|
||||
- If the return is high, it will **push up the probabilities** of the (state, action) combinations.
|
||||
- Else, if the return is low, it will **push down the probabilities** of the (state, action) combinations.
|
||||
- Otherwise, if the return is low, it will **push down the probabilities** of the (state, action) combinations.
|
||||
|
||||
|
||||
We can also **collect multiple episodes (trajectories)** to estimate the gradient:
|
||||
|
||||
@@ -10,12 +10,12 @@ For instance, in a soccer game (where you're going to train the agents in two un
|
||||
|
||||
## Value-based, Policy-based, and Actor-critic methods
|
||||
|
||||
We studied in the first unit, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi^{*}\\).
|
||||
In the first unit, we saw two methods to find (or, most of the time, approximate) this optimal policy \\(\pi^{*}\\).
|
||||
|
||||
- In *value-based methods*, we learn a value function.
|
||||
- The idea is that an optimal value function leads to an optimal policy \\(\pi^{*}\\).
|
||||
- Our objective is to **minimize the loss between the predicted and target value** to approximate the true action-value function.
|
||||
- We have a policy, but it's implicit since it **was generated directly from the value function**. For instance, in Q-Learning, we defined an epsilon-greedy policy.
|
||||
- We have a policy, but it's implicit since it **is generated directly from the value function**. For instance, in Q-Learning, we used an (epsilon-)greedy policy.
|
||||
|
||||
- On the other hand, in *policy-based methods*, we directly learn to approximate \\(\pi^{*}\\) without having to learn a value function.
|
||||
- The idea is **to parameterize the policy**. For instance, using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
|
||||
@@ -25,10 +25,10 @@ We studied in the first unit, that we had two methods to find (most of the time
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_based.png" alt="Policy based" />
|
||||
|
||||
- Finally, we'll study the next time *actor-critic* which is a combination of value-based and policy-based methods.
|
||||
- Next time, we'll study the *actor-critic* method, which is a combination of value-based and policy-based methods.
|
||||
|
||||
Consequently, thanks to policy-based methods, we can directly optimize our policy \\(\pi_\theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
|
||||
To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find \\(\theta\\) that maximizes this objective function**.
|
||||
To do that, we define an objective function \\(J(\theta)\\), that is, the expected cumulative reward, and we **want to find the value \\(\theta\\) that maximizes this objective function**.
|
||||
|
||||
## The difference between policy-based and policy-gradient methods
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Bonus: Learn to create your own environments with Unity and MLAgents
|
||||
|
||||
**You can create your own reinforcement learning environments with Unity and MLAgents**. But, using a game engine such as Unity, can be intimidating at first but here are the steps you can do to learn smoothly.
|
||||
**You can create your own reinforcement learning environments with Unity and MLAgents**. Using a game engine such as Unity can be intimidating at first, but here are the steps you can take to learn smoothly.
|
||||
|
||||
## Step 1: Know how to use Unity
|
||||
|
||||
@@ -12,8 +12,8 @@
|
||||
|
||||
## Step 3: Iterate and create nice environments
|
||||
|
||||
- Now that you've created a first simple environment you can iterate in more complex one using the [MLAgents documentation (especially Designing Agents and Agent part)](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/)
|
||||
- In addition, you can follow this free course ["Create a hummingbird environment"](https://learn.unity.com/course/ml-agents-hummingbirds) by [Adam Kelly](https://twitter.com/aktwelve)
|
||||
- Now that you've created your first simple environment you can iterate to more complex ones using the [MLAgents documentation (especially Designing Agents and Agent part)](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/)
|
||||
- In addition, you can take this free course ["Create a hummingbird environment"](https://learn.unity.com/course/ml-agents-hummingbirds) by [Adam Kelly](https://twitter.com/aktwelve)
|
||||
|
||||
|
||||
Have fun! And if you create custom environments don't hesitate to share them to `#rl-i-made-this` discord channel.
|
||||
Have fun! And if you create custom environments don't hesitate to share them to the `#rl-i-made-this` discord channel.
|
||||
|
||||
@@ -6,17 +6,17 @@ The best way to learn is to **practice and try stuff**. Why not try another envi
|
||||
|
||||
For instance:
|
||||
- [Worm](https://singularite.itch.io/worm), where you teach a worm to crawl.
|
||||
- [Walker](https://singularite.itch.io/walker): teach an agent to walk towards a goal.
|
||||
- [Walker](https://singularite.itch.io/walker), where you teach an agent to walk towards a goal.
|
||||
|
||||
Check the documentation to find how to train them and the list of already integrated MLAgents environments on the Hub: https://github.com/huggingface/ml-agents#getting-started
|
||||
Check the documentation to find out how to train them and to see the list of already integrated MLAgents environments on the Hub: https://github.com/huggingface/ml-agents#getting-started
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/envs-unity.jpeg" alt="Example envs"/>
|
||||
|
||||
|
||||
In the next unit, we're going to learn about multi-agents. And you're going to train your first multi-agents to compete in Soccer and Snowball fight against other classmate's agents.
|
||||
In the next unit, we're going to learn about multi-agents. You're going to train your first multi-agents to compete in Soccer and Snowball fight against other classmate's agents.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight.gif" alt="Snownball fight"/>
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
@@ -7,7 +7,7 @@ This is an (optional) introduction to Curiosity. If you want to learn more, you
|
||||
|
||||
## Two Major Problems in Modern RL
|
||||
|
||||
To understand what is Curiosity, we need first to understand the two major problems with RL:
|
||||
To understand what Curiosity is, we first need to understand the two major problems with RL:
|
||||
|
||||
First, the *sparse rewards problem:* that is, **most rewards do not contain information, and hence are set to zero**.
|
||||
|
||||
@@ -33,7 +33,7 @@ A solution to these problems is **to develop a reward function intrinsic to the
|
||||
|
||||
**This intrinsic reward mechanism is known as Curiosity** because this reward pushes the agent to explore states that are novel/unfamiliar. To achieve that, our agent will receive a high reward when exploring new trajectories.
|
||||
|
||||
This reward is inspired by how human acts. ** we naturally have an intrinsic desire to explore environments and discover new things**.
|
||||
This reward is inspired by how humans act. ** We naturally have an intrinsic desire to explore environments and discover new things**.
|
||||
|
||||
There are different ways to calculate this intrinsic reward. The classical approach (Curiosity through next-state prediction) is to calculate Curiosity **as the error of our agent in predicting the next state, given the current state and action taken**.
|
||||
|
||||
@@ -41,7 +41,7 @@ There are different ways to calculate this intrinsic reward. The classical appro
|
||||
|
||||
Because the idea of Curiosity is to **encourage our agent to perform actions that reduce the uncertainty in the agent’s ability to predict the consequences of its actions** (uncertainty will be higher in areas where the agent has spent less time or in areas with complex dynamics).
|
||||
|
||||
If the agent spends a lot of time on these states, it will be good to predict the next state (low Curiosity). On the other hand, if it’s a new state unexplored, it will be hard to predict the following state (high Curiosity).
|
||||
If the agent spends a lot of time on these states, it will be good at predicting the next state (low Curiosity). On the other hand, if it’s in a new, unexplored state, it will be hard to predict the following state (high Curiosity).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity4.png" alt="Curiosity"/>
|
||||
|
||||
|
||||
@@ -11,8 +11,8 @@ We learned what ML-Agents is and how it works. We also studied the two environme
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/envs.png" alt="Environments" />
|
||||
|
||||
The ML-Agents integration on the Hub **is still experimental**. Some features will be added in the future. But for now, to validate this hands-on for the certification process, you just need to push your trained models to the Hub.
|
||||
There are no minimum results to attain to validate this Hands On. But if you want to get nice results, you can try to reach the following:
|
||||
The ML-Agents integration on the Hub **is still experimental**. Some features will be added in the future. But, for now, to validate this hands-on for the certification process, you just need to push your trained models to the Hub.
|
||||
There are no minimum results to attain in order to validate this Hands On. But if you want to get nice results, you can try to reach the following:
|
||||
|
||||
- For [Pyramids](https://singularite.itch.io/pyramids): Mean Reward = 1.75
|
||||
- For [SnowballTarget](https://singularite.itch.io/snowballtarget): Mean Reward = 15 or 30 targets shoot in an episode.
|
||||
@@ -30,7 +30,7 @@ For more information about the certification process, check this section 👉 ht
|
||||
In this notebook, you'll learn about ML-Agents and train two agents.
|
||||
|
||||
- The first one will learn to **shoot snowballs onto spawning targets**.
|
||||
- The second need to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, **and move to the gold brick at the top**. To do that, it will need to explore its environment, and we will use a technique called curiosity.
|
||||
- The second needs to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, **and move to the gold brick at the top**. To do that, it will need to explore its environment, and we will use a technique called curiosity.
|
||||
|
||||
After that, you'll be able **to watch your agents playing directly on your browser**.
|
||||
|
||||
@@ -59,13 +59,13 @@ We're constantly trying to improve our tutorials, so **if you find some issues i
|
||||
|
||||
At the end of the notebook, you will:
|
||||
|
||||
- Understand how works **ML-Agents**, the environment library.
|
||||
- Understand how **ML-Agents** works and the environment library.
|
||||
- Be able to **train agents in Unity Environments**.
|
||||
|
||||
## Prerequisites 🏗️
|
||||
Before diving into the notebook, you need to:
|
||||
|
||||
🔲 📚 **Study [what is ML-Agents and how it works by reading Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction)** 🤗
|
||||
🔲 📚 **Study [what ML-Agents is and how it works by reading Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction)** 🤗
|
||||
|
||||
# Let's train our agents 🚀
|
||||
|
||||
@@ -101,7 +101,7 @@ Before diving into the notebook, you need to:
|
||||
If you need a refresher on how this environment works check this section 👉
|
||||
https://huggingface.co/deep-rl-course/unit5/snowball-target
|
||||
|
||||
### Download and move the environm ent zip file in `./training-envs-executables/linux/`
|
||||
### Download and move the environment zip file in `./training-envs-executables/linux/`
|
||||
- Our environment executable is in a zip file.
|
||||
- We need to download it and place it to `./training-envs-executables/linux/`
|
||||
- We use a linux executable because we use colab, and colab machines OS is Ubuntu (linux)
|
||||
@@ -134,14 +134,14 @@ Make sure your file is accessible
|
||||
```
|
||||
|
||||
### Define the SnowballTarget config file
|
||||
- In ML-Agents, you define the **training hyperparameters into config.yaml files.**
|
||||
- In ML-Agents, you define the **training hyperparameters in config.yaml files.**
|
||||
|
||||
There are multiple hyperparameters. To know them better, you should check for each explanation with [the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)
|
||||
There are multiple hyperparameters. To understand them better, you should read the explanation for each one in [the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)
|
||||
|
||||
|
||||
You need to create a `SnowballTarget.yaml` config file in ./content/ml-agents/config/ppo/
|
||||
|
||||
We'll give you here a first version of this config (to copy and paste into your `SnowballTarget.yaml file`), **but you should modify it**.
|
||||
We'll give you a preliminary version of this config (to copy and paste into your `SnowballTarget.yaml file`), **but you should modify it**.
|
||||
|
||||
```yaml
|
||||
behaviors:
|
||||
@@ -197,7 +197,7 @@ Train the model and use the `--resume` flag to continue training in case of inte
|
||||
|
||||
> It will fail the first time if and when you use `--resume`. Try rerunning the block to bypass the error.
|
||||
|
||||
The training will take 10 to 35min depending on your config. Go take a ☕️you deserve it 🤗.
|
||||
The training will take 10 to 35min depending on your config. Go take a ☕️ you deserve it 🤗.
|
||||
|
||||
```bash
|
||||
!mlagents-learn ./config/ppo/SnowballTarget.yaml --env=./training-envs-executables/linux/SnowballTarget/SnowballTarget --run-id="SnowballTarget1" --no-graphics
|
||||
@@ -205,7 +205,7 @@ The training will take 10 to 35min depending on your config. Go take a ☕️you
|
||||
|
||||
### Push the agent to the Hugging Face Hub
|
||||
|
||||
- Now that we trained our agent, we’re **ready to push it to the Hub to be able to visualize it playing on your browser🔥.**
|
||||
- Now that we've trained our agent, we’re **ready to push it to the Hub and visualize it playing on your browser🔥.**
|
||||
|
||||
To be able to share your model with the community, there are three more steps to follow:
|
||||
|
||||
@@ -225,9 +225,9 @@ from huggingface_hub import notebook_login
|
||||
notebook_login()
|
||||
```
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
|
||||
Then, we need to run `mlagents-push-to-hf`.
|
||||
Then we need to run `mlagents-push-to-hf`.
|
||||
|
||||
And we define four parameters:
|
||||
|
||||
@@ -235,7 +235,7 @@ And we define four parameters:
|
||||
2. `--local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.
|
||||
3. `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>
|
||||
If the repo does not exist **it will be created automatically**
|
||||
4. `--commit-message`: since HF repos are git repository you need to define a commit message.
|
||||
4. `--commit-message`: since HF repos are git repositories you need to give a commit message.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/mlagentspushtohub.png" alt="Push to Hub"/>
|
||||
|
||||
@@ -247,7 +247,7 @@ For instance:
|
||||
!mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message= # Your commit message
|
||||
```
|
||||
|
||||
Else, if everything worked you should have this at the end of the process(but with a different url 😆) :
|
||||
If everything worked you should see this at the end of the process (but with a different url 😆) :
|
||||
|
||||
|
||||
|
||||
@@ -277,18 +277,18 @@ This step it's simple:
|
||||
- I have multiple ones since we saved a model every 500000 timesteps.
|
||||
- But if I want the more recent I choose `SnowballTarget.onnx`
|
||||
|
||||
👉 What's nice **is to try different models steps to see the improvement of the agent.**
|
||||
👉 It's nice to **try different model stages to see the improvement of the agent.**
|
||||
|
||||
And don't hesitate to share the best score your agent gets on discord in #rl-i-made-this channel 🔥
|
||||
And don't hesitate to share the best score your agent gets on discord in the #rl-i-made-this channel 🔥
|
||||
|
||||
Let's now try a more challenging environment called Pyramids.
|
||||
Now let's try a more challenging environment called Pyramids.
|
||||
|
||||
## Pyramids 🏆
|
||||
|
||||
### Download and move the environment zip file in `./training-envs-executables/linux/`
|
||||
- Our environment executable is in a zip file.
|
||||
- We need to download it and place it to `./training-envs-executables/linux/`
|
||||
- We use a linux executable because we use colab, and colab machines OS is Ubuntu (linux)
|
||||
- We need to download it and place it into `./training-envs-executables/linux/`
|
||||
- We use a linux executable because we're using colab, and the colab machine's OS is Ubuntu (linux)
|
||||
|
||||
Download the file Pyramids.zip from https://drive.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H using `wget`. Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)
|
||||
|
||||
@@ -316,7 +316,7 @@ Make sure your file is accessible
|
||||
|
||||
For this training, we’ll modify one thing:
|
||||
- The total training steps hyperparameter is too high since we can hit the benchmark (mean reward = 1.75) in only 1M training steps.
|
||||
👉 To do that, we go to config/ppo/PyramidsRND.yaml,**and modify these to max_steps to 1000000.**
|
||||
👉 To do that, we go to config/ppo/PyramidsRND.yaml,**and change max_steps to 1000000.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-config.png" alt="Pyramids config"/>
|
||||
|
||||
@@ -326,7 +326,7 @@ We’re now ready to train our agent 🔥.
|
||||
|
||||
### Train the agent
|
||||
|
||||
The training will take 30 to 45min depending on your machine, go take a ☕️you deserve it 🤗.
|
||||
The training will take 30 to 45min depending on your machine, go take a ☕️ you deserve it 🤗.
|
||||
|
||||
```python
|
||||
!mlagents-learn ./config/ppo/PyramidsRND.yaml --env=./training-envs-executables/linux/Pyramids/Pyramids --run-id="Pyramids Training" --no-graphics
|
||||
@@ -347,13 +347,13 @@ The temporary link for the Pyramids demo is: https://singularite.itch.io/pyramid
|
||||
### 🎁 Bonus: Why not train on another environment?
|
||||
Now that you know how to train an agent using MLAgents, **why not try another environment?**
|
||||
|
||||
MLAgents provides 18 different and we’re building some custom ones. The best way to learn is to try things of your own, have fun.
|
||||
MLAgents provides 18 different environments and we’re building some custom ones. The best way to learn is to try things on your own, have fun.
|
||||
|
||||

|
||||
|
||||
You have the full list of the one currently available on Hugging Face here 👉 https://github.com/huggingface/ml-agents#the-environments
|
||||
You have the full list of the one currently available environments on Hugging Face here 👉 https://github.com/huggingface/ml-agents#the-environments
|
||||
|
||||
For the demos to visualize your agent, the temporary link is: https://singularite.itch.io (temporary because we'll also put the demos on Hugging Face Space)
|
||||
For the demos to visualize your agent, the temporary link is: https://singularite.itch.io (temporary because we'll also put the demos on Hugging Face Spaces)
|
||||
|
||||
For now we have integrated:
|
||||
- [Worm](https://singularite.itch.io/worm) demo where you teach a **worm to crawl**.
|
||||
@@ -363,7 +363,7 @@ If you want new demos to be added, please open an issue: https://github.com/hugg
|
||||
|
||||
That’s all for today. Congrats on finishing this tutorial!
|
||||
|
||||
The best way to learn is to practice and try stuff. Why not try another environment? ML-Agents has 18 different environments, but you can also create your own? Check the documentation and have fun!
|
||||
The best way to learn is to practice and try stuff. Why not try another environment? ML-Agents has 18 different environments, but you can also create your own. Check the documentation and have fun!
|
||||
|
||||
See you on Unit 6 🔥,
|
||||
|
||||
|
||||
@@ -26,14 +26,14 @@ With Unity ML-Agents, you have six essential components:
|
||||
- The second is the *Python Low-level API*, which contains **the low-level Python interface for interacting and manipulating the environment**. It’s the API we use to launch the training.
|
||||
- Then, we have the *External Communicator* that **connects the Learning Environment (made with C#) with the low level Python API (Python)**.
|
||||
- The *Python trainers*: the **Reinforcement algorithms made with PyTorch (PPO, SAC…)**.
|
||||
- The *Gym wrapper*: to encapsulate RL environment in a gym wrapper.
|
||||
- The *PettingZoo wrapper*: PettingZoo is the multi-agents of gym wrapper.
|
||||
- The *Gym wrapper*: to encapsulate the RL environment in a gym wrapper.
|
||||
- The *PettingZoo wrapper*: PettingZoo is the multi-agents version of the gym wrapper.
|
||||
|
||||
## Inside the Learning Component [[inside-learning-component]]
|
||||
|
||||
Inside the Learning Component, we have **three important elements**:
|
||||
|
||||
- The first is the *agent component*, the actor of the scene. We’ll **train the agent by optimizing its policy** (which will tell us what action to take in each state). The policy is called *Brain*.
|
||||
- The first is the *agent component*, the actor of the scene. We’ll **train the agent by optimizing its policy** (which will tell us what action to take in each state). The policy is called the *Brain*.
|
||||
- Finally, there is the *Academy*. This component **orchestrates agents and their decision-making processes**. Think of this Academy as a teacher who handles Python API requests.
|
||||
|
||||
To better understand its role, let’s remember the RL process. This can be modeled as a loop that works like this:
|
||||
@@ -50,7 +50,7 @@ Now, let’s imagine an agent learning to play a platform game. The RL process l
|
||||
|
||||
- Our Agent receives **state \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
|
||||
- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
|
||||
- Environment goes to a **new** **state \\(S_1\\)** — new frame.
|
||||
- The environment goes to a **new** **state \\(S_1\\)** — new frame.
|
||||
- The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*.
|
||||
|
||||
This RL loop outputs a sequence of **state, action, reward and next state.** The goal of the agent is to **maximize the expected cumulative reward**.
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/thumbnail.png" alt="thumbnail"/>
|
||||
|
||||
One of the challenges in Reinforcement Learning is **creating environments**. Fortunately for us, we can use game engines to achieve so.
|
||||
One of the challenges in Reinforcement Learning is **creating environments**. Fortunately for us, we can use game engines to do so.
|
||||
These engines, such as [Unity](https://unity.com/), [Godot](https://godotengine.org/) or [Unreal Engine](https://www.unrealengine.com/), are programs made to create video games. They are perfectly suited
|
||||
for creating environments: they provide physics systems, 2D/3D rendering, and more.
|
||||
|
||||
@@ -14,18 +14,18 @@ One of them, [Unity](https://unity.com/), created the [Unity ML-Agents Toolkit](
|
||||
<figcaption>Source: <a href="https://github.com/Unity-Technologies/ml-agents">ML-Agents documentation</a></figcaption>
|
||||
</figure>
|
||||
|
||||
Unity ML-Agents Toolkit provides many exceptional pre-made environments, from playing football (soccer), learning to walk, and jumping big walls.
|
||||
Unity ML-Agents Toolkit provides many exceptional pre-made environments, from playing football (soccer), learning to walk, and jumping over big walls.
|
||||
|
||||
In this Unit, we'll learn to use ML-Agents, but **don't worry if you don't know how to use the Unity Game Engine**: you don't need to use it to train your agents.
|
||||
|
||||
So, today, we're going to train two agents:
|
||||
- The first one will learn to **shoot snowballs onto spawning target**.
|
||||
- The second needs to **press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top**. To do that, it will need to explore its environment, which will be achieved using a technique called curiosity.
|
||||
- The first one will learn to **shoot snowballs onto a spawning target**.
|
||||
- The second needs to **press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top**. To do that, it will need to explore its environment, which will be done using a technique called curiosity.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/envs.png" alt="Environments" />
|
||||
|
||||
Then, after training, **you'll push the trained agents to the Hugging Face Hub**, and you'll be able to **visualize it playing directly on your browser without having to use the Unity Editor**.
|
||||
Then, after training, **you'll push the trained agents to the Hugging Face Hub**, and you'll be able to **visualize them playing directly on your browser without having to use the Unity Editor**.
|
||||
|
||||
Doing this Unit will **prepare you for the next challenge: AI vs. AI where you will train agents in multi-agents environments and compete against your classmates' agents**.
|
||||
|
||||
Sounds exciting? Let's get started!
|
||||
Sound exciting? Let's get started!
|
||||
|
||||
@@ -8,7 +8,7 @@ SnowballTarget is an environment we created at Hugging Face using assets from [K
|
||||
|
||||
The first agent you're going to train is called Julien the bear 🐻. Julien is trained **to hit targets with snowballs**.
|
||||
|
||||
The Goal in this environment is that Julien **hits as many targets as possible in the limited time** (1000 timesteps). It will need **to place itself correctly from the target and shoot**to do that.
|
||||
The Goal in this environment is that Julien **hits as many targets as possible in the limited time** (1000 timesteps). It will need **to place itself correctly in relation to the target and shoot**to do that.
|
||||
|
||||
In addition, to avoid "snowball spamming" (aka shooting a snowball every timestep), **Julien has a "cool off" system** (it needs to wait 0.5 seconds after a shoot to be able to shoot again).
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Bias-variance tradeoff in Reinforcement Learning
|
||||
|
||||
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles:
|
||||
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check out these two articles:
|
||||
|
||||
- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
|
||||
- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
|
||||
The solution to reducing the variance of the Reinforce algorithm and training our agent faster and better is to use a combination of Policy-Based and Value-Based methods: *the Actor-Critic method*.
|
||||
|
||||
To understand the Actor-Critic, imagine you play a video game. You can play with a friend that will provide you with some feedback. You're the Actor and your friend is the Critic.
|
||||
To understand the Actor-Critic, imagine you're playing a video game. You can play with a friend that will provide you with some feedback. You're the Actor and your friend is the Critic.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/ac.jpg" alt="Actor Critic"/>
|
||||
|
||||
@@ -21,13 +21,13 @@ This is the idea behind Actor-Critic. We learn two function approximations:
|
||||
- *A value function* to assist the policy update by measuring how good the action taken is: \\( \hat{q}_{w}(s,a) \\)
|
||||
|
||||
## The Actor-Critic Process
|
||||
Now that we have seen the Actor Critic's big picture, let's dive deeper to understand how Actor and Critic improve together during the training.
|
||||
Now that we have seen the Actor Critic's big picture, let's dive deeper to understand how the Actor and Critic improve together during the training.
|
||||
|
||||
As we saw, with Actor-Critic methods, there are two function approximations (two neural networks):
|
||||
- *Actor*, a **policy function** parameterized by theta: \\( \pi_{\theta}(s) \\)
|
||||
- *Critic*, a **value function** parameterized by w: \\( \hat{q}_{w}(s,a) \\)
|
||||
|
||||
Let's see the training process to understand how Actor and Critic are optimized:
|
||||
Let's see the training process to understand how the Actor and Critic are optimized:
|
||||
- At each timestep, t, we get the current state \\( S_t\\) from the environment and **pass it as input through our Actor and Critic**.
|
||||
|
||||
- Our Policy takes the state and **outputs an action** \\( A_t \\).
|
||||
|
||||
@@ -4,8 +4,8 @@ Congrats on finishing this unit and the tutorial. You've just trained your first
|
||||
|
||||
**Take time to grasp the material before continuing**. You can also look at the additional reading materials we provided in the *additional reading* section.
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill out this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
See you in next unit,
|
||||
See you in next unit!
|
||||
|
||||
### Keep learning, stay awesome 🤗,
|
||||
### Keep learning, stay awesome 🤗
|
||||
|
||||
@@ -11,7 +11,7 @@
|
||||
Now that you've studied the theory behind Advantage Actor Critic (A2C), **you're ready to train your A2C agent** using Stable-Baselines3 in robotic environments. And train two robots:
|
||||
|
||||
- A spider 🕷️ to learn to move.
|
||||
- A robotic arm 🦾 to move in the correct position.
|
||||
- A robotic arm 🦾 to move to the correct position.
|
||||
|
||||
We're going to use two Robotics environments:
|
||||
|
||||
@@ -54,7 +54,7 @@ We're constantly trying to improve our tutorials, so **if you find some issues i
|
||||
|
||||
At the end of the notebook, you will:
|
||||
|
||||
- Be able to use **PyBullet** and **Panda-Gym**, the environment libraries.
|
||||
- Be able to use the environment librairies **PyBullet** and **Panda-Gym**.
|
||||
- Be able to **train robots using A2C**.
|
||||
- Understand why **we need to normalize the input**.
|
||||
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
|
||||
@@ -80,7 +80,7 @@ Before diving into the notebook, you need to:
|
||||
|
||||
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
|
||||
|
||||
Hence the following cell will install the librairies and create and run a virtual screen 🖥
|
||||
The following cell will install the librairies and create and run a virtual screen 🖥
|
||||
|
||||
```python
|
||||
%%capture
|
||||
@@ -135,7 +135,7 @@ from huggingface_hub import notebook_login
|
||||
### Create the AntBulletEnv-v0
|
||||
#### The environment 🎮
|
||||
|
||||
In this environment, the agent needs to use correctly its different joints to walk correctly.
|
||||
In this environment, the agent needs to use its different joints correctly in order to walk.
|
||||
You can find a detailled explanation of this environment here: https://hackmd.io/@jeffreymo/SJJrSJh5_#PyBullet
|
||||
|
||||
```python
|
||||
@@ -231,7 +231,7 @@ model = A2C(
|
||||
|
||||
### Train the A2C agent 🏃
|
||||
|
||||
- Let's train our agent for 2,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~25-40min
|
||||
- Let's train our agent for 2,000,000 timesteps. Don't forget to use GPU on Colab. It will take approximately ~25-40min
|
||||
|
||||
```python
|
||||
model.learn(2_000_000)
|
||||
@@ -244,7 +244,7 @@ env.save("vec_normalize.pkl")
|
||||
```
|
||||
|
||||
### Evaluate the agent 📈
|
||||
- Now that's our agent is trained, we need to **check its performance**.
|
||||
- Now that our agent is trained, we need to **check its performance**.
|
||||
- Stable-Baselines3 provides a method to do that: `evaluate_policy`
|
||||
- In my case, I got a mean reward of `2371.90 +/- 16.50`
|
||||
|
||||
@@ -282,7 +282,7 @@ By using `package_to_hub`, as we already mentionned in the former units, **you e
|
||||
This way:
|
||||
- You can **showcase our work** 🔥
|
||||
- You can **visualize your agent playing** 👀
|
||||
- You can **share with the community an agent that others can use** 💾
|
||||
- You can **share an agent with the community that others can use** 💾
|
||||
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
|
||||
@@ -290,7 +290,7 @@ To be able to share your model with the community there are three more steps to
|
||||
|
||||
1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
|
||||
|
||||
2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
|
||||
2️⃣ Sign in and then you need to get your authentication token from the Hugging Face website.
|
||||
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
|
||||
@@ -303,7 +303,7 @@ notebook_login()
|
||||
!git config --global credential.helper store
|
||||
```
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
|
||||
3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
|
||||
|
||||
@@ -332,9 +332,9 @@ In robotics, the *end-effector* is the device at the end of a robotic arm design
|
||||
|
||||
In `PandaReach`, the robot must place its end-effector at a target position (green ball).
|
||||
|
||||
We're going to use the dense version of this environment. It means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). Contrary to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.
|
||||
We're going to use the dense version of this environment. This means we'll get a *dense reward function* that **will provide a reward at each timestep** (the closer the agent is to completing the task, the higher the reward). This is in contrast to a *sparse reward function* where the environment **return a reward if and only if the task is completed**.
|
||||
|
||||
Also, we're going to use the *End-effector displacement control*, it means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).
|
||||
Also, we're going to use the *End-effector displacement control*, which means the **action corresponds to the displacement of the end-effector**. We don't control the individual motion of each joint (joint control).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/robotics.jpg" alt="Robotics"/>
|
||||
|
||||
@@ -384,14 +384,14 @@ The action space is a vector with 3 values:
|
||||
|
||||
Now it's your turn:
|
||||
|
||||
1. Define the environment called "PandaReachDense-v2"
|
||||
2. Make a vectorized environment
|
||||
1. Define the environment called "PandaReachDense-v2".
|
||||
2. Make a vectorized environment.
|
||||
3. Add a wrapper to normalize the observations and rewards. [Check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/vec_envs.html#vecnormalize)
|
||||
4. Create the A2C Model (don't forget verbose=1 to print the training logs).
|
||||
5. Train it for 1M Timesteps
|
||||
6. Save the model and VecNormalize statistics when saving the agent
|
||||
7. Evaluate your agent
|
||||
8. Publish your trained model on the Hub 🔥 with `package_to_hub`
|
||||
5. Train it for 1M Timesteps.
|
||||
6. Save the model and VecNormalize statistics when saving the agent.
|
||||
7. Evaluate your agent.
|
||||
8. Publish your trained model on the Hub 🔥 with `package_to_hub`.
|
||||
|
||||
### Solution (fill the todo)
|
||||
|
||||
@@ -448,7 +448,7 @@ package_to_hub(
|
||||
|
||||
## Some additional challenges 🏆
|
||||
|
||||
The best way to learn **is to try things by your own**! Why not trying `HalfCheetahBulletEnv-v0` for PyBullet and `PandaPickAndPlace-v1` for Panda-Gym?
|
||||
The best way to learn **is to try things on your own**! Why not try `HalfCheetahBulletEnv-v0` for PyBullet and `PandaPickAndPlace-v1` for Panda-Gym?
|
||||
|
||||
If you want to try more advanced tasks for panda-gym, you need to check what was done using **TQC or SAC** (a more sample-efficient algorithm suited for robotics tasks). In real robotics, you'll use a more sample-efficient algorithm for a simple reason: contrary to a simulation **if you move your robotic arm too much, you have a risk of breaking it**.
|
||||
|
||||
@@ -456,7 +456,7 @@ PandaPickAndPlace-v1: https://huggingface.co/sb3/tqc-PandaPickAndPlace-v1
|
||||
|
||||
And don't hesitate to check panda-gym documentation here: https://panda-gym.readthedocs.io/en/latest/usage/train_with_sb3.html
|
||||
|
||||
Here are some ideas to achieve so:
|
||||
Here are some ideas to go further:
|
||||
* Train more steps
|
||||
* Try different hyperparameters by looking at what your classmates have done 👉 https://huggingface.co/models?other=https://huggingface.co/models?other=AntBulletEnv-v0
|
||||
* **Push your new trained model** on the Hub 🔥
|
||||
|
||||
@@ -11,15 +11,15 @@ We saw that Reinforce worked well. However, because we use Monte-Carlo sampling
|
||||
|
||||
Remember that the policy gradient estimation is **the direction of the steepest increase in return**. In other words, how to update our policy weights so that actions that lead to good returns have a higher probability of being taken. The Monte Carlo variance, which we will further study in this unit, **leads to slower training since we need a lot of samples to mitigate it**.
|
||||
|
||||
So, today we'll study **Actor-Critic methods**, a hybrid architecture combining value-based and Policy-Based methods that help to stabilize the training by reducing the variance:
|
||||
So today we'll study **Actor-Critic methods**, a hybrid architecture combining value-based and Policy-Based methods that helps to stabilize the training by reducing the variance using:
|
||||
- *An Actor* that controls **how our agent behaves** (Policy-Based method)
|
||||
- *A Critic* that measures **how good the taken action is** (Value-Based method)
|
||||
|
||||
|
||||
We'll study one of these hybrid methods, Advantage Actor Critic (A2C), **and train our agent using Stable-Baselines3 in robotic environments**. We'll train two robots:
|
||||
- A spider 🕷️ to learn to move.
|
||||
- A robotic arm 🦾 to move in the correct position.
|
||||
- A robotic arm 🦾 to move to the correct position.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/environments.gif" alt="Environments"/>
|
||||
|
||||
Sounds exciting? Let's get started!
|
||||
Sound exciting? Let's get started!
|
||||
|
||||
@@ -1,12 +1,12 @@
|
||||
# The Problem of Variance in Reinforce [[the-problem-of-variance-in-reinforce]]
|
||||
|
||||
In Reinforce, we want to **increase the probability of actions in a trajectory proportional to how high the return is**.
|
||||
In Reinforce, we want to **increase the probability of actions in a trajectory proportionally to how high the return is**.
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/pg.jpg" alt="Reinforce"/>
|
||||
|
||||
- If the **return is high**, we will **push up** the probabilities of the (state, action) combinations.
|
||||
- Else, if the **return is low**, it will **push down** the probabilities of the (state, action) combinations.
|
||||
- Otherwise, if the **return is low**, it will **push down** the probabilities of the (state, action) combinations.
|
||||
|
||||
This return \\(R(\tau)\\) is calculated using a *Monte-Carlo sampling*. We collect a trajectory and calculate the discounted return, **and use this score to increase or decrease the probability of every action taken in that trajectory**. If the return is good, all actions will be “reinforced” by increasing their likelihood of being taken.
|
||||
|
||||
@@ -24,7 +24,7 @@ The solution is to mitigate the variance by **using a large number of trajectori
|
||||
However, increasing the batch size significantly **reduces sample efficiency**. So we need to find additional mechanisms to reduce the variance.
|
||||
|
||||
---
|
||||
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check these two articles:
|
||||
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check out these two articles:
|
||||
- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
|
||||
- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
|
||||
---
|
||||
|
||||
@@ -2,10 +2,10 @@
|
||||
|
||||
That’s all for today. Congrats on finishing this unit and the tutorial!
|
||||
|
||||
The best way to learn is to practice and try stuff. **Why not training another agent with a different configuration?**
|
||||
The best way to learn is to practice and try stuff. **Why not train another agent with a different configuration?**
|
||||
|
||||
And don’t hesitate from time to time to check the [leaderboard](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos)
|
||||
|
||||
See you on Unit 8 🔥,
|
||||
See you in Unit 8 🔥
|
||||
|
||||
## Keep Learning, Stay awesome 🤗
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Hands-on
|
||||
|
||||
Now that you learned the bases of multi-agents. You're ready to train our first agents in a multi-agent system: **a 2vs2 soccer team that needs to beat the opponent team**.
|
||||
Now that you learned the basics of multi-agents, you're ready to train your first agents in a multi-agent system: **a 2vs2 soccer team that needs to beat the opponent team**.
|
||||
|
||||
And you’re going to participate in AI vs. AI challenges where your trained agent will compete against other classmates’ **agents every day and be ranked on a new leaderboard.**
|
||||
|
||||
@@ -38,11 +38,11 @@ We're going to write a blog post to explain this AI vs. AI tool in detail, but t
|
||||
This first AI vs. AI competition **is an experiment**: the goal is to improve the tool in the future with your feedback. So some **breakups can happen during the challenge**. But don't worry
|
||||
**all the results are saved in a dataset so we can always restart the calculation correctly without losing information**.
|
||||
|
||||
In order that your model to get correctly evaluated against others you need to follow these rules:
|
||||
In order for your model to get correctly evaluated against others you need to follow these rules:
|
||||
|
||||
1. **You can't change the observation space or action space of the agent.** By doing that your model will not work during evaluation.
|
||||
2. You **can't use a custom trainer for now,** you need to use Unity MLAgents ones.
|
||||
3. We provide executables to train your agents. You can also use the Unity Editor if you prefer **, but to avoid bugs, we advise you to use our executables**.
|
||||
2. You **can't use a custom trainer for now,** you need to use the Unity MLAgents ones.
|
||||
3. We provide executables to train your agents. You can also use the Unity Editor if you prefer **, but to avoid bugs, we advise that you use our executables**.
|
||||
|
||||
What will make the difference during this challenge are **the hyperparameters you choose**.
|
||||
|
||||
@@ -50,16 +50,16 @@ The AI vs AI algorithm will run until April the 30th, 2023.
|
||||
|
||||
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
|
||||
|
||||
### Exchange with your classmates, share advice and ask questions on Discord
|
||||
### Chat with your classmates, share advice and ask questions on Discord
|
||||
|
||||
- We created a new channel called `ai-vs-ai-challenge` to exchange advice and ask questions.
|
||||
- If you didn’t joined yet the discord server you can [join here](https://discord.gg/ydHrjt3WP5)
|
||||
- If you didn’t join the discord server yet, you can [join here](https://discord.gg/ydHrjt3WP5)
|
||||
|
||||
## Step 0: Install MLAgents and download the correct executable
|
||||
|
||||
⚠ We're going to use an experimental version of ML-Agents which allows you to push and load your models to/from the Hub. **You need to install the same version.**
|
||||
|
||||
⚠ ⚠ ⚠ We’re not going to use the same version than for the Unit 5: Introduction to ML-Agents ⚠ ⚠ ⚠
|
||||
⚠ ⚠ ⚠ We’re not going to use the same version from Unit 5: Introduction to ML-Agents ⚠ ⚠ ⚠
|
||||
|
||||
We advise you to use [conda](https://docs.conda.io/en/latest/) as a package manager and create a new environment.
|
||||
|
||||
@@ -70,7 +70,7 @@ conda create --name rl python=3.9
|
||||
conda activate rl
|
||||
```
|
||||
|
||||
To be able to train correctly our agents and push to the Hub, we need to install an experimental version of ML-Agents (the branch aivsai from Hugging Face ML-Agents fork)
|
||||
To be able to train our agents correctly and push to the Hub, we need to install an experimental version of ML-Agents (the branch aivsai from Hugging Face ML-Agents fork)
|
||||
|
||||
```bash
|
||||
git clone --branch aivsai https://github.com/huggingface/ml-agents
|
||||
@@ -107,7 +107,7 @@ Mac: Download [this executable](https://drive.google.com/drive/folders/1h7YB0qwj
|
||||
|
||||
The environment is called `SoccerTwos`. The Unity MLAgents Team made it. You can find its documentation [here](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Learning-Environment-Examples.md#soccer-twos)
|
||||
|
||||
The goal in this environment **is to get the ball into the opponent's goal while preventing the ball from entering its own goal.**
|
||||
The goal in this environment **is to get the ball into the opponent's goal while preventing the ball from entering your own goal.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
|
||||
@@ -124,7 +124,7 @@ The reward function is:
|
||||
|
||||
### The observation space
|
||||
|
||||
The observation space is composed vector size of 336:
|
||||
The observation space is composed of vectors of size 336:
|
||||
|
||||
- 11 ray-casts forward distributed over 120 degrees (264 state dimensions)
|
||||
- 3 ray-casts backward distributed over 90 degrees (72 state dimensions)
|
||||
@@ -146,7 +146,7 @@ The action space is three discrete branches:
|
||||
|
||||
We know how to train agents to play against others: **we can use self-play.** This is a perfect technique for a 1vs1.
|
||||
|
||||
But in our case we’re 2vs2, and each team has 2 agents. How then we can **train cooperative behavior for groups of agents?**
|
||||
But in our case we’re 2vs2, and each team has 2 agents. How then can we **train cooperative behavior for groups of agents?**
|
||||
|
||||
As explained in the [Unity Blog](https://blog.unity.com/technology/ml-agents-v20-release-now-supports-training-complex-cooperative-behaviors), agents typically receive a reward as a group (+1 - penalty) when the team scores a goal. This implies that **every agent on the team is rewarded even if each agent didn’t contribute the same to the win**, which makes it difficult to learn what to do independently.
|
||||
|
||||
@@ -166,17 +166,17 @@ This allows each agent to **make decisions based only on what it perceives local
|
||||
</figure>
|
||||
|
||||
|
||||
The solution then is to use Self-Play with an MA-POCA trainer (called poca). The poca trainer will help us to train cooperative behavior and self-play to get an opponent team.
|
||||
The solution then is to use Self-Play with an MA-POCA trainer (called poca). The poca trainer will help us to train cooperative behavior and self-play to win against an opponent team.
|
||||
|
||||
If you want to dive deeper into this MA-POCA algorithm, you need to read the paper they published [here](https://arxiv.org/pdf/2111.05992.pdf) and the sources we put on the additional readings section.
|
||||
|
||||
## Step 3: Define the config file
|
||||
|
||||
We already learned in [Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction) that in ML-Agents, you define **the training hyperparameters into `config.yaml` files.**
|
||||
We already learned in [Unit 5](https://huggingface.co/deep-rl-course/unit5/introduction) that in ML-Agents, you define **the training hyperparameters in `config.yaml` files.**
|
||||
|
||||
There are multiple hyperparameters. To know them better, you should check for each explanation with **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)**
|
||||
There are multiple hyperparameters. To understand them better, you should read the explanations for each of them in **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)**
|
||||
|
||||
The config file we’re going to use here is in `./config/poca/SoccerTwos.yaml` it looks like this:
|
||||
The config file we’re going to use here is in `./config/poca/SoccerTwos.yaml`. It looks like this:
|
||||
|
||||
```csharp
|
||||
behaviors:
|
||||
@@ -215,7 +215,7 @@ behaviors:
|
||||
|
||||
Compared to Pyramids or SnowballTarget, we have new hyperparameters with a self-play part. How you modify them can be critical in getting good results.
|
||||
|
||||
The advice I can give you here is to check the explanation and recommended value for each parameters (especially self-play ones) with **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md).**
|
||||
The advice I can give you here is to check the explanation and recommended value for each parameters (especially self-play ones) against **[the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md).**
|
||||
|
||||
Now that you’ve modified our config file, you’re ready to train your agents.
|
||||
|
||||
@@ -230,7 +230,7 @@ We define four parameters:
|
||||
3. `-run_id`: the name you want to give to your training run id.
|
||||
4. `-no-graphics`: to not launch the visualization during the training.
|
||||
|
||||
Depending on your hardware, 5M timesteps (the recommended value but you can also try 10M) will take 5 to 8 hours of training. You can continue using your computer in the meantime, but I advise deactivating the computer standby mode to prevent the training from being stopped.
|
||||
Depending on your hardware, 5M timesteps (the recommended value, but you can also try 10M) will take 5 to 8 hours of training. You can continue using your computer in the meantime, but I advise deactivating the computer standby mode to prevent the training from being stopped.
|
||||
|
||||
Depending on the executable you use (windows, ubuntu, mac) the training command will look like this (your executable path can be different so don’t hesitate to check before running).
|
||||
|
||||
@@ -242,7 +242,7 @@ The executable contains 8 copies of SoccerTwos.
|
||||
|
||||
⚠️ It’s normal if you don’t see a big increase of ELO score (and even a decrease below 1200) before 2M timesteps, since your agents will spend most of their time moving randomly on the field before being able to goal.
|
||||
|
||||
⚠️ You can stop the training with Ctrl + C but beware of typing only once this command to stop the training since MLAgents needs to generate a final .onnx file before closing the run.
|
||||
⚠️ You can stop the training with Ctrl + C but beware of typing this command only once to stop the training since MLAgents needs to generate a final .onnx file before closing the run.
|
||||
|
||||
## Step 5: **Push the agent to the Hugging Face Hub**
|
||||
|
||||
@@ -250,11 +250,11 @@ Now that we trained our agents, we’re **ready to push them to the Hub to be a
|
||||
|
||||
To be able to share your model with the community, there are three more steps to follow:
|
||||
|
||||
1️⃣ (If it’s not already done) create an account to HF ➡ https://huggingface.co/join](https://huggingface.co/join
|
||||
1️⃣ (If it’s not already done) create an account to HF ➡ [https://huggingface.co/join](https://huggingface.co/join)
|
||||
|
||||
2️⃣ Sign in and store your authentication token from the Hugging Face website.
|
||||
|
||||
Create a new token (https://huggingface.co/settings/tokens)) **with write role**
|
||||
Create a new token (https://huggingface.co/settings/tokens) **with write role**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
|
||||
|
||||
@@ -272,7 +272,7 @@ And we define four parameters:
|
||||
2. `-local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.
|
||||
3. `-repo-id`: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>
|
||||
If the repo does not exist **it will be created automatically**
|
||||
4. `--commit-message`: since HF repos are git repository you need to define a commit message.
|
||||
4. `--commit-message`: since HF repos are git repositories you need to give a commit message.
|
||||
|
||||
In my case
|
||||
|
||||
@@ -284,7 +284,7 @@ mlagents-push-to-hf --run-id="SoccerTwos" --local-dir="./results/SoccerTwos" --
|
||||
mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message="First Push"
|
||||
```
|
||||
|
||||
If everything worked you should have this at the end of the process(but with a different url 😆) :
|
||||
If everything worked you should see this at the end of the process (but with a different url 😆) :
|
||||
|
||||
Your model is pushed to the Hub. You can view your model here: https://huggingface.co/ThomasSimonini/poca-SoccerTwos
|
||||
|
||||
@@ -294,14 +294,14 @@ It's the link to your model. It contains a model card that explains how to use i
|
||||
|
||||
Now that your model is pushed to the Hub, **it’s going to be added automatically to the AI vs AI Challenge model pool.** It can take a little bit of time before your model is added to the leaderboard given we do a run of matches every 4h.
|
||||
|
||||
But in order that everything works perfectly you need to check:
|
||||
But to ensure that everything works perfectly you need to check:
|
||||
|
||||
1. That you have this tag in your model: ML-Agents-SoccerTwos. This is the tag we use to select models to be added to the challenge pool. To do that go to your model and check the tags
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/verify1.png" alt="Verify"/>
|
||||
|
||||
|
||||
If it’s not the case you just need to modify readme and add it
|
||||
If it’s not the case you just need to modify the readme and add it
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/verify2.png" alt="Verify"/>
|
||||
|
||||
@@ -315,10 +315,10 @@ We strongly suggest that you create a new model when you push to the Hub if you
|
||||
|
||||
Now that your model is part of AI vs AI Challenge, **you can visualize how good it is compared to others**: https://huggingface.co/spaces/unity/ML-Agents-SoccerTwos
|
||||
|
||||
In order to do that, you just need to go on this demo:
|
||||
In order to do that, you just need to go to this demo:
|
||||
|
||||
- Select your model as team blue (or team purple if you prefer) and another. The best to compare your model is either with the one who’s on top of the leaderboard. Or use the [baseline model as opponent](https://huggingface.co/unity/MLAgents-SoccerTwos)
|
||||
- Select your model as team blue (or team purple if you prefer) and another model to compete against. The best opponents to compare your model to are either whoever is on top of the leaderboard or the [baseline model](https://huggingface.co/unity/MLAgents-SoccerTwos)
|
||||
|
||||
This matches you see live are not used to the calculation of your result **but are good way to visualize how good your agent is**.
|
||||
The matches you see live are not used in the calculation of your result **but they are a good way to visualize how good your agent is**.
|
||||
|
||||
And don't hesitate to share the best score your agent gets on discord in #rl-i-made-this channel 🔥
|
||||
And don't hesitate to share the best score your agent gets on discord in the #rl-i-made-this channel 🔥
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## From single agent to multiple agents
|
||||
|
||||
In the first unit, we learned to train agents in a single-agent system. Where our agent was alone in its environment: **it was not cooperating or collaborating with other agents**.
|
||||
In the first unit, we learned to train agents in a single-agent system. When our agent was alone in its environment: **it was not cooperating or collaborating with other agents**.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/patchwork.jpg" alt="Patchwork"/>
|
||||
@@ -33,13 +33,13 @@ In these examples, we have **multiple agents interacting in the environment and
|
||||
|
||||
## Different types of multi-agent environments
|
||||
|
||||
Given that in a multi-agent system, agents interact with other agents, we can have different types of environments:
|
||||
Given that, in a multi-agent system, agents interact with other agents, we can have different types of environments:
|
||||
|
||||
- *Cooperative environments*: where your agents needs **to maximize the common benefits**.
|
||||
- *Cooperative environments*: where your agents need **to maximize the common benefits**.
|
||||
|
||||
For instance, in a warehouse, **robots must collaborate to load and unload the packages as efficiently (as fast as possible)**.
|
||||
For instance, in a warehouse, **robots must collaborate to load and unload the packages efficiently (as fast as possible)**.
|
||||
|
||||
- *Competitive/Adversarial environments*: in that case, your agent **want to maximize its benefits by minimizing the opponent ones**.
|
||||
- *Competitive/Adversarial environments*: in this case, your agent **wants to maximize its benefits by minimizing the opponent's**.
|
||||
|
||||
For example, in a game of tennis, **each agent wants to beat the other agent**.
|
||||
|
||||
|
||||
@@ -20,9 +20,9 @@ But, as humans, **we live in a multi-agent world**. Our intelligence comes from
|
||||
|
||||
Consequently, we must study how to train deep reinforcement learning agents in a *multi-agents system* to build robust agents that can adapt, collaborate, or compete.
|
||||
|
||||
So today, we’re going to **learn the basics of this fascinating topic of multi-agents reinforcement learning (MARL)**.
|
||||
So today we’re going to **learn the basics of the fascinating topic of multi-agents reinforcement learning (MARL)**.
|
||||
|
||||
And the most exciting part is that during this unit, you’re going to train your first agents in a multi-agents system: **a 2vs2 soccer team that needs to beat the opponent team**.
|
||||
And the most exciting part is that, during this unit, you’re going to train your first agents in a multi-agents system: **a 2vs2 soccer team that needs to beat the opponent team**.
|
||||
|
||||
And you’re going to participate in **AI vs. AI challenge** where your trained agent will compete against other classmates’ agents every day and be ranked on a [new leaderboard](https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos).
|
||||
|
||||
|
||||
@@ -5,7 +5,7 @@ For this section, you're going to watch this excellent introduction to multi-age
|
||||
<Youtube id="qgb0gyrpiGk" />
|
||||
|
||||
|
||||
In this video, Brian talked about how to design multi-agent systems. He specifically took a vacuum cleaner multi-agents setting and asked how they **can cooperate with each other**?
|
||||
In this video, Brian talked about how to design multi-agent systems. He specifically took a multi-agents system of vacuum cleaners and asked: **how can can cooperate with each other**?
|
||||
|
||||
We have two solutions to design this multi-agent reinforcement learning system (MARL).
|
||||
|
||||
@@ -18,7 +18,7 @@ Source: <a href="https://www.youtube.com/watch?v=qgb0gyrpiGk"> Introduction to M
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
In decentralized learning, **each agent is trained independently from others**. In the example given, each vacuum learns to clean as many places as it can **without caring about what other vacuums (agents) are doing**.
|
||||
In decentralized learning, **each agent is trained independently from the others**. In the example given, each vacuum learns to clean as many places as it can **without caring about what other vacuums (agents) are doing**.
|
||||
|
||||
The benefit is that **since no information is shared between agents, these vacuums can be designed and trained like we train single agents**.
|
||||
|
||||
@@ -36,22 +36,22 @@ Source: <a href="https://www.youtube.com/watch?v=qgb0gyrpiGk"> Introduction to M
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
In this architecture, **we have a high-level process that collects agents' experiences**: experience buffer. And we'll use these experiences **to learn a common policy**.
|
||||
In this architecture, **we have a high-level process that collects agents' experiences**: the experience buffer. And we'll use these experiences **to learn a common policy**.
|
||||
|
||||
For instance, in the vacuum cleaner, the observation will be:
|
||||
For instance, in the vacuum cleaner example, the observation will be:
|
||||
- The coverage map of the vacuums.
|
||||
- The position of all the vacuums.
|
||||
|
||||
We use that collective experience **to train a policy that will move all three robots in the most beneficial way as a whole**. So each robot is learning from the common experience.
|
||||
And we have a stationary environment since all the agents are treated as a larger entity, and they know the change of other agents' policies (since it's the same as theirs).
|
||||
We use that collective experience **to train a policy that will move all three robots in the most beneficial way as a whole**. So each robot is learning from their common experience.
|
||||
We now have a stationary environment since all the agents are treated as a larger entity, and they know the change of other agents' policies (since it's the same as theirs).
|
||||
|
||||
If we recap:
|
||||
|
||||
- In *decentralized approach*, we **treat all agents independently without considering the existence of the other agents.**
|
||||
- In a *decentralized approach*, we **treat all agents independently without considering the existence of the other agents.**
|
||||
- In this case, all agents **consider others agents as part of the environment**.
|
||||
- **It’s a non-stationarity environment condition**, so non-guaranty of convergence.
|
||||
- **It’s a non-stationarity environment condition**, so has no guarantee of convergence.
|
||||
|
||||
- In centralized approach:
|
||||
- In a *centralized approach*:
|
||||
- A **single policy is learned from all the agents**.
|
||||
- Takes as input the present state of an environment and the policy output joint actions.
|
||||
- Takes as input the present state of an environment and the policy outputs joint actions.
|
||||
- The reward is global.
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Self-Play: a classic technique to train competitive agents in adversarial games
|
||||
|
||||
Now that we studied the basics of multi-agents. We're ready to go deeper. As mentioned in the introduction, we're going **to train agents in an adversarial game with SoccerTwos, a 2vs2 game**.
|
||||
Now that we've studied the basics of multi-agents, we're ready to go deeper. As mentioned in the introduction, we're going **to train agents in an adversarial game with SoccerTwos, a 2vs2 game**.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/soccertwos.gif" alt="SoccerTwos"/>
|
||||
@@ -13,7 +13,7 @@ Now that we studied the basics of multi-agents. We're ready to go deeper. As men
|
||||
|
||||
Training agents correctly in an adversarial game can be **quite complex**.
|
||||
|
||||
On the one hand, we need to find how to get a well-trained opponent to play against your training agent. And on the other hand, even if you have a very good trained opponent, it's not a good solution since how your agent is going to improve its policy when the opponent is too strong?
|
||||
On the one hand, we need to find how to get a well-trained opponent to play against your training agent. And on the other hand, if you find a very good trained opponent, how will your agent improve its policy when the opponent is too strong?
|
||||
|
||||
Think of a child that just started to learn soccer. Playing against a very good soccer player will be useless since it will be too hard to win or at least get the ball from time to time. So the child will continuously lose without having time to learn a good policy.
|
||||
|
||||
@@ -29,27 +29,27 @@ It’s the same way humans learn in competition:
|
||||
We do the same with self-play:
|
||||
|
||||
- We **start with a copy of our agent as an opponent** this way, this opponent is on a similar level.
|
||||
- We **learn from it**, and when we acquire some skills, we **update our opponent with a more recent copy of our training policy**.
|
||||
- We **learn from it** and, when we acquire some skills, we **update our opponent with a more recent copy of our training policy**.
|
||||
|
||||
The theory behind self-play is not something new. It was already used by Arthur Samuel’s checker player system in the fifties and by Gerald Tesauro’s TD-Gammon in 1995. If you want to learn more about the history of self-play [check this very good blogpost by Andrew Cohen](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents)
|
||||
The theory behind self-play is not something new. It was already used by Arthur Samuel’s checker player system in the fifties and by Gerald Tesauro’s TD-Gammon in 1995. If you want to learn more about the history of self-play [check out this very good blogpost by Andrew Cohen](https://blog.unity.com/technology/training-intelligent-adversaries-using-self-play-with-ml-agents)
|
||||
|
||||
## Self-Play in MLAgents
|
||||
|
||||
Self-Play is integrated into the MLAgents library and is managed by multiple hyperparameters that we’re going to study. But the main focus as explained in the documentation is the **tradeoff between the skill level and generality of the final policy and the stability of learning**.
|
||||
Self-Play is integrated into the MLAgents library and is managed by multiple hyperparameters that we’re going to study. But the main focus, as explained in the documentation, is the **tradeoff between the skill level and generality of the final policy and the stability of learning**.
|
||||
|
||||
Training against a set of slowly changing or unchanging adversaries with low diversity **results in more stable training. But a risk to overfit if the change is too slow.**
|
||||
|
||||
We need then to control:
|
||||
So we need to control:
|
||||
|
||||
- How **often do we change opponents** with `swap_steps` and `team_change` parameters.
|
||||
- The **number of opponents saved** with `window` parameter. A larger value of `window`
|
||||
- How **often we change opponents** with the `swap_steps` and `team_change` parameters.
|
||||
- The **number of opponents saved** with the `window` parameter. A larger value of `window`
|
||||
means that an agent's pool of opponents will contain a larger diversity of behaviors since it will contain policies from earlier in the training run.
|
||||
- **Probability of playing against the current self vs opponent** sampled in the pool with `play_against_latest_model_ratio`. A larger value of `play_against_latest_model_ratio`
|
||||
- The **probability of playing against the current self vs opponent** sampled from the pool with `play_against_latest_model_ratio`. A larger value of `play_against_latest_model_ratio`
|
||||
indicates that an agent will be playing against the current opponent more often.
|
||||
- The **number of training steps before saving a new opponent** with `save_steps` parameters. A larger value of `save_steps`
|
||||
will yield a set of opponents that cover a wider range of skill levels and possibly play styles since the policy receives more training.
|
||||
|
||||
To get more details about these hyperparameters, you definitely need [to check this part of the documentation](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play)
|
||||
To get more details about these hyperparameters, you definitely need [to check out this part of the documentation](https://github.com/Unity-Technologies/ml-agents/blob/develop/docs/Training-Configuration-File.md#self-play)
|
||||
|
||||
|
||||
## The ELO Score to evaluate our agent
|
||||
@@ -124,14 +124,14 @@ Player B has a rating of 2300
|
||||
|
||||
### The Advantages
|
||||
|
||||
Using ELO score has multiple advantages:
|
||||
Using the ELO score has multiple advantages:
|
||||
|
||||
- Points are **always balanced** (more points are exchanged when there is an unexpected outcome, but the sum is always the same).
|
||||
- It is a **self-corrected system** since if a player wins against a weak player, you will only win a few points.
|
||||
- If **works with team games**: we calculate the average for each team and use it in Elo.
|
||||
- It is a **self-corrected system** since if a player wins against a weak player, they will only win a few points.
|
||||
- It **works with team games**: we calculate the average for each team and use it in Elo.
|
||||
|
||||
### The Disadvantages
|
||||
|
||||
- ELO **does not take the individual contribution** of each people in the team.
|
||||
- Rating deflation: **good rating require skill over time to get the same rating**.
|
||||
- ELO **does not take into account the individual contribution** of each people in the team.
|
||||
- Rating deflation: **a good rating requires skill over time to keep the same rating**.
|
||||
- **Can’t compare rating in history**.
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
# Introducing the Clipped Surrogate Objective Function
|
||||
## Recap: The Policy Objective Function
|
||||
|
||||
Let’s remember what is the objective to optimize in Reinforce:
|
||||
Let’s remember what the objective is to optimize in Reinforce:
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/lpg.jpg" alt="Reinforce"/>
|
||||
|
||||
The idea was that by taking a gradient ascent step on this function (equivalent to taking gradient descent of the negative of this function), we would **push our agent to take actions that lead to higher rewards and avoid harmful actions.**
|
||||
@@ -10,9 +10,9 @@ However, the problem comes from the step size:
|
||||
- Too small, **the training process was too slow**
|
||||
- Too high, **there was too much variability in the training**
|
||||
|
||||
Here with PPO, the idea is to constrain our policy update with a new objective function called the *Clipped surrogate objective function* that **will constrain the policy change in a small range using a clip.**
|
||||
With PPO, the idea is to constrain our policy update with a new objective function called the *Clipped surrogate objective function* that **will constrain the policy change in a small range using a clip.**
|
||||
|
||||
This new function **is designed to avoid destructive large weights updates** :
|
||||
This new function **is designed to avoid destructively large weights updates** :
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ppo-surrogate.jpg" alt="PPO surrogate function"/>
|
||||
|
||||
@@ -21,11 +21,11 @@ Let’s study each part to understand how it works.
|
||||
## The Ratio Function
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ratio1.jpg" alt="Ratio"/>
|
||||
|
||||
This ratio is calculated this way:
|
||||
This ratio is calculated as follows:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/ratio2.jpg" alt="Ratio"/>
|
||||
|
||||
It’s the probability of taking action \\( a_t \\) at state \\( s_t \\) in the current policy divided by the previous one.
|
||||
It’s the probability of taking action \\( a_t \\) at state \\( s_t \\) in the current policy, divided by the same for the previous policy.
|
||||
|
||||
As we can see, \\( r_t(\theta) \\) denotes the probability ratio between the current and old policy:
|
||||
|
||||
@@ -49,7 +49,7 @@ However, without a constraint, if the action taken is much more probable in our
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/clipped.jpg" alt="PPO"/>
|
||||
|
||||
Consequently, we need to constrain this objective function by penalizing changes that lead to a ratio away from 1 (in the paper, the ratio can only vary from 0.8 to 1.2).
|
||||
Consequently, we need to constrain this objective function by penalizing changes that lead to a ratio far away from 1 (in the paper, the ratio can only vary from 0.8 to 1.2).
|
||||
|
||||
**By clipping the ratio, we ensure that we do not have a too large policy update because the current policy can't be too different from the older one.**
|
||||
|
||||
@@ -62,7 +62,7 @@ To do that, we have two solutions:
|
||||
|
||||
This clipped part is a version where rt(theta) is clipped between \\( [1 - \epsilon, 1 + \epsilon] \\).
|
||||
|
||||
With the Clipped Surrogate Objective function, we have two probability ratios, one non-clipped and one clipped in a range (between \\( [1 - \epsilon, 1 + \epsilon] \\), epsilon is a hyperparameter that helps us to define this clip range (in the paper \\( \epsilon = 0.2 \\).).
|
||||
With the Clipped Surrogate Objective function, we have two probability ratios, one non-clipped and one clipped in a range between \\( [1 - \epsilon, 1 + \epsilon] \\), epsilon is a hyperparameter that helps us to define this clip range (in the paper \\( \epsilon = 0.2 \\).).
|
||||
|
||||
Then, we take the minimum of the clipped and non-clipped objective, **so the final objective is a lower bound (pessimistic bound) of the unclipped objective.**
|
||||
|
||||
|
||||
@@ -6,8 +6,8 @@ Now that you've successfully trained your Doom agent, why not try deathmatch? Re
|
||||
|
||||
If you do it, don't hesitate to share your model in the `#rl-i-made-this` channel in our [discord server](https://www.hf.co/join/discord).
|
||||
|
||||
This concludes the last unit, but we are not finished yet! 🤗 The following **bonus unit includes some of the most interesting, advanced and cutting edge work in Deep Reinforcement Learning**.
|
||||
This concludes the last unit, but we are not finished yet! 🤗 The following **bonus unit includes some of the most interesting, advanced, and cutting edge work in Deep Reinforcement Learning**.
|
||||
|
||||
See you next time 🔥,
|
||||
See you next time 🔥
|
||||
|
||||
## Keep Learning, Stay awesome 🤗
|
||||
|
||||
@@ -2,8 +2,8 @@
|
||||
|
||||
That’s all for today. Congrats on finishing this unit and the tutorial!
|
||||
|
||||
The best way to learn is to practice and try stuff. **Why not improving the implementation to handle frames as input?**.
|
||||
The best way to learn is to practice and try stuff. **Why not improve the implementation to handle frames as input?**.
|
||||
|
||||
See you on second part of this Unit 🔥,
|
||||
See you on second part of this Unit 🔥
|
||||
|
||||
## Keep Learning, Stay awesome 🤗
|
||||
|
||||
@@ -31,9 +31,9 @@ Then, to test its robustness, we're going to train it in:
|
||||
</video>
|
||||
</figure>
|
||||
|
||||
And finally, we will be push the trained model to the Hub to evaluate and visualize your agent playing.
|
||||
And finally, we will push the trained model to the Hub to evaluate and visualize your agent playing.
|
||||
|
||||
LunarLander-v2 is the first environment you used when you started this course. At that time, you didn't know how it worked, and now, you can code it from scratch and train it. **How incredible is that 🤩.**
|
||||
LunarLander-v2 is the first environment you used when you started this course. At that time, you didn't know how it worked, and now you can code it from scratch and train it. **How incredible is that 🤩.**
|
||||
|
||||
<iframe src="https://giphy.com/embed/pynZagVcYxVUk" width="480" height="480" frameBorder="0" class="giphy-embed" allowFullScreen></iframe><p><a href="https://giphy.com/gifs/the-office-michael-heartbreak-pynZagVcYxVUk">via GIPHY</a></p>
|
||||
|
||||
@@ -118,7 +118,7 @@ pip install huggingface_hub
|
||||
pip install box2d
|
||||
```
|
||||
|
||||
## Let's code PPO from scratch with Costa Huang tutorial
|
||||
## Let's code PPO from scratch with Costa Huang's tutorial
|
||||
- For the core implementation of PPO we're going to use the excellent [Costa Huang](https://costa.sh/) tutorial.
|
||||
- In addition to the tutorial, to go deeper you can read the 37 core implementation details: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
|
||||
|
||||
@@ -429,7 +429,7 @@ package_to_hub(
|
||||
)
|
||||
```
|
||||
|
||||
- Here's what look the ppo.py final file
|
||||
- Here's what the final ppo.py file looks like:
|
||||
|
||||
```python
|
||||
# docs and experiment results can be found at https://docs.cleanrl.dev/rl-algorithms/ppo/#ppopy
|
||||
@@ -1034,7 +1034,7 @@ To be able to share your model with the community there are three more steps to
|
||||
|
||||
1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
|
||||
|
||||
2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
|
||||
2️⃣ Sign in and get your authentication token from the Hugging Face website.
|
||||
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
|
||||
@@ -1048,11 +1048,11 @@ notebook_login()
|
||||
!git config --global credential.helper store
|
||||
```
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
|
||||
## Let's start the training 🔥
|
||||
|
||||
- Now that you've coded from scratch PPO and added the Hugging Face Integration, we're ready to start the training 🔥
|
||||
- Now that you've coded PPO from scratch and added the Hugging Face Integration, we're ready to start the training 🔥
|
||||
|
||||
- First, you need to copy all your code to a file you create called `ppo.py`
|
||||
|
||||
@@ -1060,7 +1060,7 @@ If you don't want to use a Google Colab or a Jupyter Notebook, you need to use t
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/step2.png" alt="PPO"/>
|
||||
|
||||
- Now we just need to run this python script using `python <name-of-python-script>.py` with the additional parameters we defined with `argparse`
|
||||
- Now we just need to run this python script using `python <name-of-python-script>.py` with the additional parameters we defined using `argparse`
|
||||
|
||||
- You should modify more hyperparameters otherwise the training will not be super stable.
|
||||
|
||||
@@ -1070,8 +1070,8 @@ If you don't want to use a Google Colab or a Jupyter Notebook, you need to use t
|
||||
|
||||
## Some additional challenges 🏆
|
||||
|
||||
The best way to learn **is to try things by your own**! Why not trying another environment?
|
||||
The best way to learn **is to try things on your own**! Why not try another environment?
|
||||
|
||||
See you on Unit 8, part 2 where we going to train agents to play Doom 🔥
|
||||
See you in Unit 8, part 2 where we're going to train agents to play Doom 🔥
|
||||
|
||||
## Keep learning, stay awesome 🤗
|
||||
|
||||
@@ -214,9 +214,9 @@ Now that the setup if complete, we can train the agent. We have chosen here to l
|
||||
|
||||
|
||||
|
||||
The objective of this scenario is to **teach the agent how to survive without knowing what makes him survive**. Agent know only that **life is precious** and death is bad so **it must learn what prolongs his existence and that his health is connected with it**.
|
||||
The objective of this scenario is to **teach the agent how to survive without knowing what makes it survive**. The Agent know only that **life is precious** and death is bad so **it must learn what prolongs its existence and that its health is connected with survival**.
|
||||
|
||||
Map is a rectangle containing walls and with a green, acidic floor which **hurts the player periodically**. Initially there are some medkits spread uniformly over the map. A new medkit falls from the skies every now and then. **Medkits heal some portions of player's health** - to survive agent needs to pick them up. Episode finishes after player's death or on timeout.
|
||||
The map is a rectangle containing walls and with a green, acidic floor which **hurts the player periodically**. Initially there are some medkits spread uniformly over the map. A new medkit falls from the skies every now and then. **Medkits heal some portions of player's health** - to survive, the agent needs to pick them up. The episode finishes after the player's death or on timeout.
|
||||
|
||||
Further configuration:
|
||||
- Living_reward = 1
|
||||
@@ -232,7 +232,7 @@ There are also a number of more complex scenarios that have been create for ViZD
|
||||
|
||||
## Training the agent
|
||||
|
||||
- We're going to train the agent for 4000000 steps it will take approximately 20min
|
||||
- We're going to train the agent for 4000000 steps. It will take approximately 20min
|
||||
|
||||
```python
|
||||
## Start the training, this should take around 15 minutes
|
||||
@@ -288,7 +288,7 @@ To be able to share your model with the community there are three more steps to
|
||||
|
||||
1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
|
||||
|
||||
2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
|
||||
2️⃣ Sign in and get your authentication token from the Hugging Face website.
|
||||
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
|
||||
@@ -296,7 +296,7 @@ To be able to share your model with the community there are three more steps to
|
||||
- Copy the token
|
||||
- Run the cell below and paste the token
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
|
||||
```python
|
||||
from huggingface_hub import notebook_login
|
||||
@@ -330,7 +330,7 @@ status = enjoy(cfg)
|
||||
|
||||
|
||||
|
||||
This agent's performance was good, but can do better! Let's download and visualize an agent trained for 10B timesteps from the hub.
|
||||
This agent's performance was good, but we can do better! Let's download and visualize an agent trained for 10B timesteps from the hub.
|
||||
|
||||
```bash
|
||||
#download the agent from the hub
|
||||
@@ -425,6 +425,6 @@ If you prefer an easier scenario, **why not try training in another ViZDoom scen
|
||||
---
|
||||
|
||||
|
||||
This concludes the last unit. But we are not finished yet! 🤗 The following **bonus section include some of the most interesting, advanced and cutting edge work in Deep Reinforcement Learning**.
|
||||
This concludes the last unit. But we are not finished yet! 🤗 The following **bonus section include some of the most interesting, advanced, and cutting edge work in Deep Reinforcement Learning**.
|
||||
|
||||
## Keep learning, stay awesome 🤗
|
||||
|
||||
@@ -2,12 +2,12 @@
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail2.png" alt="thumbnail"/>
|
||||
|
||||
In this second part of Unit 8, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/), an **asynchronous implementation of the PPO algorithm**, to train our agent playing [vizdoom](https://vizdoom.cs.put.edu.pl/) (an open source version of Doom).
|
||||
In this second part of Unit 8, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/), an **asynchronous implementation of the PPO algorithm**, to train our agent to play [vizdoom](https://vizdoom.cs.put.edu.pl/) (an open source version of Doom).
|
||||
|
||||
In the notebook, **you'll train your agent to play the Health Gathering level**, where the agent must collect health packs to avoid dying. After that, you can **train your agent to play more complex levels, such as Deathmatch**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit10/environments.png" alt="Environment"/>
|
||||
|
||||
Sounds exciting? Let's get started! 🚀
|
||||
Sound exciting? Let's get started! 🚀
|
||||
|
||||
The hands-on is made by [Edward Beeching](https://twitter.com/edwardbeeching), a Machine Learning Research Scientist at Hugging Face. He worked on Godot Reinforcement Learning Agents, an open-source interface for developing environments and agents in the Godot Game Engine.
|
||||
|
||||
@@ -2,17 +2,17 @@
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/thumbnail.png" alt="Unit 8"/>
|
||||
|
||||
In Unit 6, we learned about Advantage Actor Critic (A2C), a hybrid architecture combining value-based and policy-based methods that help to stabilize the training by reducing the variance with:
|
||||
In Unit 6, we learned about Advantage Actor Critic (A2C), a hybrid architecture combining value-based and policy-based methods that helps to stabilize the training by reducing the variance with:
|
||||
|
||||
- *An Actor* that controls **how our agent behaves** (policy-based method).
|
||||
- *A Critic* that measures **how good the action taken is** (value-based method).
|
||||
|
||||
Today we'll learn about Proximal Policy Optimization (PPO), an architecture that **improves our agent's training stability by avoiding too large policy updates**. To do that, we use a ratio that indicates the difference between our current and old policy and clip this ratio from a specific range \\( [1 - \epsilon, 1 + \epsilon] \\) .
|
||||
Today we'll learn about Proximal Policy Optimization (PPO), an architecture that **improves our agent's training stability by avoiding policy updates that are too large**. To do that, we use a ratio that indicates the difference between our current and old policy and clip this ratio to a specific range \\( [1 - \epsilon, 1 + \epsilon] \\) .
|
||||
|
||||
Doing this will ensure **that our policy update will not be too large and that the training is more stable.**
|
||||
|
||||
This Unit is in two parts:
|
||||
- In this first part, you'll learn the theory behind PPO and code your PPO agent from scratch using [CleanRL](https://github.com/vwxyzjn/cleanrl) implementation. To test its robustness with LunarLander-v2. LunarLander-v2 **is the first environment you used when you started this course**. At that time, you didn't know how PPO worked, and now, **you can code it from scratch and train it. How incredible is that 🤩**.
|
||||
- In this first part, you'll learn the theory behind PPO and code your PPO agent from scratch using the [CleanRL](https://github.com/vwxyzjn/cleanrl) implementation. To test its robustness you'll use LunarLander-v2. LunarLander-v2 **is the first environment you used when you started this course**. At that time, you didn't know how PPO worked, and now, **you can code it from scratch and train it. How incredible is that 🤩**.
|
||||
- In the second part, we'll get deeper into PPO optimization by using [Sample-Factory](https://samplefactory.dev/) and train an agent playing vizdoom (an open source version of Doom).
|
||||
|
||||
<figure>
|
||||
@@ -20,4 +20,4 @@ This Unit is in two parts:
|
||||
<figcaption>These are the environments you're going to use to train your agents: VizDoom environments</figcaption>
|
||||
</figure>
|
||||
|
||||
Sounds exciting? Let's get started! 🚀
|
||||
Sound exciting? Let's get started! 🚀
|
||||
|
||||
@@ -1,11 +1,11 @@
|
||||
# The intuition behind PPO [[the-intuition-behind-ppo]]
|
||||
|
||||
|
||||
The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: **we want to avoid having too large policy updates.**
|
||||
The idea with Proximal Policy Optimization (PPO) is that we want to improve the training stability of the policy by limiting the change you make to the policy at each training epoch: **we want to avoid having too large of a policy update.**
|
||||
|
||||
For two reasons:
|
||||
- We know empirically that smaller policy updates during training are **more likely to converge to an optimal solution.**
|
||||
- A too-big step in a policy update can result in falling “off the cliff” (getting a bad policy) **and having a long time or even no possibility to recover.**
|
||||
- A too-big step in a policy update can result in falling “off the cliff” (getting a bad policy) **and taking a long time or even having no possibility to recover.**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img class="center" src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit9/cliff.jpg" alt="Policy Update cliff"/>
|
||||
|
||||
@@ -59,7 +59,7 @@ So we update our policy only if:
|
||||
**You might wonder why, when the minimum is the clipped ratio, the gradient is 0.** When the ratio is clipped, the derivative in this case will not be the derivative of the \\( r_t(\theta) * A_t \\) but the derivative of either \\( (1 - \epsilon)* A_t\\) or the derivative of \\( (1 + \epsilon)* A_t\\) which both = 0.
|
||||
|
||||
|
||||
To summarize, thanks to this clipped surrogate objective, **we restrict the range that the current policy can vary from the old one.** Because we remove the incentive for the probability ratio to move outside of the interval since, the clip have the effect to gradient. If the ratio is > \\( 1 + \epsilon \\) or < \\( 1 - \epsilon \\) the gradient will be equal to 0.
|
||||
To summarize, thanks to this clipped surrogate objective, **we restrict the range that the current policy can vary from the old one.** Because we remove the incentive for the probability ratio to move outside of the interval since the clip forces the gradient to be zero. If the ratio is > \\( 1 + \epsilon \\) or < \\( 1 - \epsilon \\) the gradient will be equal to 0.
|
||||
|
||||
The final Clipped Surrogate Objective Loss for PPO Actor-Critic style looks like this, it's a combination of Clipped Surrogate Objective function, Value Loss Function and Entropy bonus:
|
||||
|
||||
|
||||
@@ -6,7 +6,7 @@ You can now sit and enjoy playing with your Huggy 🐶. And don't **forget to sp
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-cover.jpeg" alt="Huggy cover" width="100%">
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill out this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
|
||||
@@ -5,7 +5,7 @@ This environment was created using the [Unity game engine](https://unity.com/) a
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.jpg" alt="Huggy" width="100%">
|
||||
|
||||
In this environment we aim to train Huggy to **fetch the stick we throw. It means he needs to move correctly toward the stick**.
|
||||
In this environment we aim to train Huggy to **fetch the stick we throw. This means he needs to move correctly toward the stick**.
|
||||
|
||||
## The State Space, what Huggy perceives. [[state-space]]
|
||||
Huggy doesn't "see" his environment. Instead, we provide him information about the environment:
|
||||
@@ -19,7 +19,7 @@ Given all this information, Huggy can **use his policy to determine which action
|
||||
## The Action Space, what moves Huggy can perform [[action-space]]
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-action.jpg" alt="Huggy action" width="100%">
|
||||
|
||||
**Joint motors drive Huggy legs**. It means that to get the target, Huggy needs to **learn to rotate the joint motors of each of his legs correctly so he can move**.
|
||||
**Joint motors drive Huggy's legs**. This means that to get the target, Huggy needs to **learn to rotate the joint motors of each of his legs correctly so he can move**.
|
||||
|
||||
## The Reward Function [[reward-function]]
|
||||
|
||||
@@ -63,4 +63,4 @@ Now that you have the big picture of the environment, you're ready to train Hugg
|
||||
|
||||
To do that, we're going to use [MLAgents](https://github.com/Unity-Technologies/ml-agents). Don't worry if you have never used it before. In this unit we'll use Google Colab to train Huggy, and then you'll be able to load your trained Huggy and play with him directly in the browser.
|
||||
|
||||
In a future unit, we will study more in-depth MLAgents and how it works. But for now, we keep things simple by just using the provided implementation.
|
||||
In a future unit, we will study MLAgents more in-depth and see how it works. But for now, we keep things simple by just using the provided implementation.
|
||||
|
||||
@@ -4,15 +4,15 @@ Now that you've trained Huggy and pushed it to the Hub. **You will be able to pl
|
||||
|
||||
For this step it’s simple:
|
||||
|
||||
- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy
|
||||
- Open the Huggy game in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy
|
||||
- Click on Play with my Huggy model
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/load-huggy.jpg" alt="load-huggy" width="100%">
|
||||
|
||||
1. In step 1, choose your model repository which is the model id (in my case ThomasSimonini/ppo-Huggy).
|
||||
|
||||
2. In step 2, **choose what model you want to replay**:
|
||||
- I have multiple one, since we saved a model every 500000 timesteps.
|
||||
- But if I want the more recent I choose Huggy.onnx
|
||||
2. In step 2, **choose which model you want to replay**:
|
||||
- I have multiple ones, since we saved a model every 500000 timesteps.
|
||||
- But if I want the most recent one I choose Huggy.onnx
|
||||
|
||||
👉 What’s nice **is to try with different models step to see the improvement of the agent.**
|
||||
👉 It's good to **try with different model checkpoints to see the improvement of the agent.**
|
||||
|
||||
@@ -40,7 +40,7 @@ We're constantly trying to improve our tutorials, so **if you find some issues i
|
||||
|
||||
At the end of the notebook, you will:
|
||||
|
||||
- Understand **the state space, action space and reward function used to train Huggy**.
|
||||
- Understand **the state space, action space, and reward function used to train Huggy**.
|
||||
- **Train your own Huggy** to fetch the stick.
|
||||
- Be able to play **with your trained Huggy directly in your browser**.
|
||||
|
||||
@@ -123,7 +123,7 @@ Given all this information, Huggy **can decide which action to take next to fulf
|
||||
### The Action Space: what moves Huggy can do
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-action.jpg" alt="Huggy action" width="100%">
|
||||
|
||||
**Joint motors drive huggy legs**. It means that to get the target, Huggy needs to **learn to rotate the joint motors of each of his legs correctly so he can move**.
|
||||
**Joint motors drive huggy legs**. This means that to get the target, Huggy needs to **learn to rotate the joint motors of each of his legs correctly so he can move**.
|
||||
|
||||
### The Reward Function
|
||||
|
||||
@@ -144,9 +144,9 @@ Our reward function:
|
||||
|
||||
## Check the Huggy config file
|
||||
|
||||
- In ML-Agents, you define the **training hyperparameters into config.yaml files.**
|
||||
- In ML-Agents, you define the **training hyperparameters in config.yaml files.**
|
||||
|
||||
- For the scope of this notebook, we're not going to modify the hyperparameters, but if you want to try as an experiment, you should also try to modify some other hyperparameters, Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
|
||||
- For the scope of this notebook, we're not going to modify the hyperparameters, but if you want to try as an experiment, Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
|
||||
|
||||
- **In the case you want to modify the hyperparameters**, in Google Colab notebook, you can click here to open the config.yaml: `/content/ml-agents/config/ppo/Huggy.yaml`
|
||||
|
||||
@@ -172,7 +172,7 @@ Train the model and use the `--resume` flag to continue training in case of inte
|
||||
|
||||
|
||||
|
||||
The training will take 30 to 45min depending on your machine (don't forget to **set up a GPU**), go take a ☕️you deserve it 🤗.
|
||||
The training will take 30 to 45min depending on your machine (don't forget to **set up a GPU**), go take a ☕️ you deserve it 🤗.
|
||||
|
||||
```bash
|
||||
mlagents-learn ./config/ppo/Huggy.yaml --env=./trained-envs-executables/linux/Huggy/Huggy --run-id="Huggy" --no-graphics
|
||||
@@ -186,7 +186,7 @@ To be able to share your model with the community there are three more steps to
|
||||
|
||||
1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
|
||||
|
||||
2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
|
||||
2️⃣ Sign in and then get your token from the Hugging Face website.
|
||||
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
|
||||
@@ -200,7 +200,7 @@ from huggingface_hub import notebook_login
|
||||
notebook_login()
|
||||
```
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
If you don't want to use Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
|
||||
Then, we simply need to run `mlagents-push-to-hf`.
|
||||
|
||||
@@ -212,13 +212,13 @@ And we define 4 parameters:
|
||||
2. `--local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.
|
||||
3. `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>
|
||||
If the repo does not exist **it will be created automatically**
|
||||
4. `--commit-message`: since HF repos are git repository you need to define a commit message.
|
||||
4. `--commit-message`: since HF repos are git repositories you need to give a commit message.
|
||||
|
||||
```bash
|
||||
mlagents-push-to-hf --run-id="HuggyTraining" --local-dir="./results/Huggy" --repo-id="ThomasSimonini/ppo-Huggy" --commit-message="Huggy"
|
||||
```
|
||||
|
||||
Else, if everything worked you should have this at the end of the process(but with a different url 😆) :
|
||||
If everything worked you should see this at the end of the process (but with a different url 😆) :
|
||||
|
||||
|
||||
|
||||
@@ -230,25 +230,25 @@ It’s the link to your model repository. The repository contains a model card t
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/modelcard.png" alt="ml learn function" width="100%">
|
||||
|
||||
But now comes the best: **being able to play with Huggy online 👀.**
|
||||
But now comes the best part: **being able to play with Huggy online 👀.**
|
||||
|
||||
## Play with your Huggy 🐕
|
||||
|
||||
This step is the simplest:
|
||||
|
||||
- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy
|
||||
- Open the Huggy game in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy
|
||||
|
||||
- Click on Play with my Huggy model
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/load-huggy.jpg" alt="load-huggy" width="100%">
|
||||
|
||||
1. In step 1, choose your model repository which is the model id (in my case ThomasSimonini/ppo-Huggy).
|
||||
1. In step 1, choose your model repository, which is the model id (in my case ThomasSimonini/ppo-Huggy).
|
||||
|
||||
2. In step 2, **choose what model you want to replay**:
|
||||
2. In step 2, **choose which model you want to replay**:
|
||||
- I have multiple ones, since we saved a model every 500000 timesteps.
|
||||
- But since I want the more recent, I choose `Huggy.onnx`
|
||||
- But since I want the most recent one, I choose `Huggy.onnx`
|
||||
|
||||
👉 What’s nice **is to try with different models steps to see the improvement of the agent.**
|
||||
👉 It's good **to try with different models steps to see the improvement of the agent.**
|
||||
|
||||
Congrats on finishing this bonus unit!
|
||||
|
||||
@@ -257,4 +257,4 @@ You can now sit and enjoy playing with your Huggy 🐶. And don't **forget to sp
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-cover.jpeg" alt="Huggy cover" width="100%">
|
||||
|
||||
|
||||
## Keep Learning, Stay awesome 🤗
|
||||
## Keep Learning, Stay awesome 🤗
|
||||
|
||||
@@ -1,16 +1,16 @@
|
||||
# Hands-on [[hands-on]]
|
||||
|
||||
Now that you've learned to use Optuna, we give you some ideas to apply what you've learned:
|
||||
Now that you've learned to use Optuna, here are some ideas to apply what you've learned:
|
||||
|
||||
1️⃣ **Beat your LunarLander-v2 agent results**, by using Optuna to find a better set of hyperparameters. You can also try with another environment, such as MountainCar-v0 and CartPole-v1.
|
||||
|
||||
2️⃣ **Beat your SpaceInvaders agent results**.
|
||||
|
||||
By doing that, you're going to see how Optuna is valuable and powerful in training better agents,
|
||||
By doing this, you'll see how valuable and powerful Optuna can be in training better agents.
|
||||
|
||||
Have fun,
|
||||
Have fun!
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill out this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Introduction [[introduction]]
|
||||
|
||||
One of the most critical task in Deep Reinforcement Learning is to **find a good set of training hyperparameters**.
|
||||
One of the most critical tasks in Deep Reinforcement Learning is to **find a good set of training hyperparameters**.
|
||||
|
||||
<img src="https://raw.githubusercontent.com/optuna/optuna/master/docs/image/optuna-logo.png" alt="Optuna Logo"/>
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# (Automatic) Curriculum Learning for RL
|
||||
|
||||
While most of the RL methods seen in this course work well in practice, there are some cases where using them alone fails. It is for instance the case where:
|
||||
While most of the RL methods seen in this course work well in practice, there are some cases where using them alone fails. This can happen, for instance, when:
|
||||
|
||||
- the task to learn is hard and requires an **incremental acquisition of skills** (for instance when one wants to make a bipedal agent learn to go through hard obstacles, it must first learn to stand, then walk, then maybe jump…)
|
||||
- there are variations in the environment (that affect the difficulty) and one wants its agent to be **robust** to them
|
||||
@@ -11,9 +11,9 @@ While most of the RL methods seen in this course work well in practice, there ar
|
||||
<figcaption> <a href="https://developmentalsystems.org/TeachMyAgent/">TeachMyAgent</a> </figcaption>
|
||||
</figure>
|
||||
|
||||
In such cases, it seems needed to propose different tasks to our RL agent and organize them such that it allows the agent to progressively acquire skills. This approach is called **Curriculum Learning** and usually implies a hand-designed curriculum (or set of tasks organized in a specific order). In practice, one can for instance control the generation of the environment, the initial states, or use Self-Play an control the level of opponents proposed to the RL agent.
|
||||
In such cases, it seems needed to propose different tasks to our RL agent and organize them such that the agent progressively acquires skills. This approach is called **Curriculum Learning** and usually implies a hand-designed curriculum (or set of tasks organized in a specific order). In practice, one can, for instance, control the generation of the environment, the initial states, or use Self-Play and control the level of opponents proposed to the RL agent.
|
||||
|
||||
As designing such a curriculum is not always trivial, the field of **Automatic Curriculum Learning (ACL) proposes to design approaches that learn to create such and organization of tasks in order to maximize the RL agent’s performances**. Portelas et al. proposed to define ACL as:
|
||||
As designing such a curriculum is not always trivial, the field of **Automatic Curriculum Learning (ACL) proposes to design approaches that learn to create such an organization of tasks in order to maximize the RL agent’s performances**. Portelas et al. proposed to define ACL as:
|
||||
|
||||
> … a family of mechanisms that automatically adapt the distribution of training data by learning to adjust the selection of learning situations to the capabilities of RL agents.
|
||||
>
|
||||
@@ -36,7 +36,7 @@ Finally, you can play with the robustness of agents trained in the <a href="http
|
||||
|
||||
## Further reading
|
||||
|
||||
For more information, we recommend you check out the following resources:
|
||||
For more information, we recommend that you check out the following resources:
|
||||
|
||||
### Overview of the field
|
||||
|
||||
|
||||
@@ -5,7 +5,7 @@ The Decision Transformer model was introduced by ["Decision Transformer: Reinfor
|
||||
The main idea is that instead of training a policy using RL methods, such as fitting a value function, that will tell us what action to take to maximize the return (cumulative reward), **we use a sequence modeling algorithm (Transformer) that, given a desired return, past states, and actions, will generate future actions to achieve this desired return**.
|
||||
It’s an autoregressive model conditioned on the desired return, past states, and actions to generate future actions that achieve the desired return.
|
||||
|
||||
This is a complete shift in the Reinforcement Learning paradigm since we use generative trajectory modeling (modeling the joint distribution of the sequence of states, actions, and rewards) to replace conventional RL algorithms. It means that in Decision Transformers, we don’t maximize the return but rather generate a series of future actions that achieve the desired return.
|
||||
This is a complete shift in the Reinforcement Learning paradigm since we use generative trajectory modeling (modeling the joint distribution of the sequence of states, actions, and rewards) to replace conventional RL algorithms. This means that in Decision Transformers, we don’t maximize the return but rather generate a series of future actions that achieve the desired return.
|
||||
|
||||
The 🤗 Transformers team integrated the Decision Transformer, an Offline Reinforcement Learning method, into the library as well as the Hugging Face Hub.
|
||||
|
||||
@@ -15,13 +15,13 @@ To learn more about Decision Transformers, you should read the blogpost we wrote
|
||||
|
||||
## Train your first Decision Transformers
|
||||
|
||||
Now that you understand how Decision Transformers work thanks to [Introducing Decision Transformers on Hugging Face](https://huggingface.co/blog/decision-transformers). You’re ready to learn to train your first Offline Decision Transformer model from scratch to make a half-cheetah run.
|
||||
Now that you understand how Decision Transformers work thanks to [Introducing Decision Transformers on Hugging Face](https://huggingface.co/blog/decision-transformers), you’re ready to learn to train your first Offline Decision Transformer model from scratch to make a half-cheetah run.
|
||||
|
||||
Start the tutorial here 👉 https://huggingface.co/blog/train-decision-transformers
|
||||
|
||||
## Further reading
|
||||
|
||||
For more information, we recommend you check out the following resources:
|
||||
For more information, we recommend that you check out the following resources:
|
||||
|
||||
- [Decision Transformer: Reinforcement Learning via Sequence Modeling](https://arxiv.org/abs/2106.01345)
|
||||
- [Online Decision Transformer](https://arxiv.org/abs/2202.05607)
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Interesting Environments to try
|
||||
|
||||
We provide here a list of interesting environments you can try to train your agents on:
|
||||
Here we provide a list of interesting environments you can try to train your agents on:
|
||||
|
||||
## MineRL
|
||||
|
||||
@@ -8,7 +8,7 @@ We provide here a list of interesting environments you can try to train your age
|
||||
|
||||
|
||||
MineRL is a Python library that provides a Gym interface for interacting with the video game Minecraft, accompanied by datasets of human gameplay.
|
||||
Every year, there are challenges with this library. Check the [website](https://minerl.io/)
|
||||
Every year there are challenges with this library. Check the [website](https://minerl.io/)
|
||||
|
||||
To start using this environment, check these resources:
|
||||
- [What is MineRL?](https://www.youtube.com/watch?v=z6PTrGifupU)
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Godot RL Agents
|
||||
|
||||
[Godot RL Agents](https://github.com/edbeeching/godot_rl_agents) is an Open Source package that allows video game creators, AI researchers and hobbyists the opportunity **to learn complex behaviors for their Non Player Characters or agents**.
|
||||
[Godot RL Agents](https://github.com/edbeeching/godot_rl_agents) is an Open Source package that allows video game creators, AI researchers, and hobbyists the opportunity **to learn complex behaviors for their Non Player Characters or agents**.
|
||||
|
||||
The library provides:
|
||||
|
||||
@@ -19,7 +19,7 @@ Installation of the library is simple: `pip install godot-rl`
|
||||
|
||||
In this section, you will **learn how to create a custom environment in the Godot Game Engine** and then implement an AI controller that learns to play with Deep Reinforcement Learning.
|
||||
|
||||
The example game we create today is simple, **but shows off many of the features of the Godot Engine and the Godot RL Agents library**.You can then dive into the examples for more complex environments and behaviors.
|
||||
The example game we create today is simple, **but shows off many of the features of the Godot Engine and the Godot RL Agents library**. You can then dive into the examples for more complex environments and behaviors.
|
||||
|
||||
The environment we will be building today is called Ring Pong, the game of pong but the pitch is a ring and the paddle moves around the ring. The **objective is to keep the ball bouncing inside the ring**.
|
||||
|
||||
@@ -31,7 +31,7 @@ The [Godot game engine](https://godotengine.org/) is an open source tool for the
|
||||
|
||||
Godot Engine is a feature-packed, cross-platform game engine designed to create 2D and 3D games from a unified interface. It provides a comprehensive set of common tools, so users **can focus on making games without having to reinvent the wheel**. Games can be exported in one click to a number of platforms, including the major desktop platforms (Linux, macOS, Windows) as well as mobile (Android, iOS) and web-based (HTML5) platforms.
|
||||
|
||||
While we will guide you through the steps to implement your agent, you may wish to learn more about the Godot Game Engine. Their [documentation](https://docs.godotengine.org/en/latest/index.html) is thorough, there are many tutorials on YouTube we would also recommend [GDQuest](https://www.gdquest.com/), [KidsCanCode](https://kidscancode.org/godot_recipes/4.x/) and [Bramwell](https://www.youtube.com/channel/UCczi7Aq_dTKrQPF5ZV5J3gg) as sources of information.
|
||||
While we will guide you through the steps to implement your agent, you may wish to learn more about the Godot Game Engine. Their [documentation](https://docs.godotengine.org/en/latest/index.html) is thorough, and there are many tutorials on YouTube we would also recommend [GDQuest](https://www.gdquest.com/), [KidsCanCode](https://kidscancode.org/godot_recipes/4.x/) and [Bramwell](https://www.youtube.com/channel/UCczi7Aq_dTKrQPF5ZV5J3gg) as sources of information.
|
||||
|
||||
In order to create games in Godot, **you must first download the editor**. Godot RL Agents supports the latest version of Godot, Godot 4.0.
|
||||
|
||||
@@ -125,7 +125,7 @@ func _process(delta):
|
||||
pass
|
||||
```
|
||||
|
||||
We will now implement the 4 missing methods, delete this code and replace it with the following:
|
||||
We will now implement the 4 missing methods, delete this code, and replace it with the following:
|
||||
|
||||
```python
|
||||
extends AIController3D
|
||||
@@ -191,7 +191,7 @@ func _on_area_3d_body_entered(body):
|
||||
ai_controller.reward += 1.0
|
||||
```
|
||||
|
||||
We now need to synchronize between the game running in Godot and the neural network being trained in Python. Godot RL agents provides a node that does just that. Open the train.tscn scene, right click on the root node and click “Add child node”. Then, search for “sync” and add a Godot RL Agents Sync node. This node handles the communication between Python and Godot over TCP.
|
||||
We now need to synchronize between the game running in Godot and the neural network being trained in Python. Godot RL agents provides a node that does just that. Open the train.tscn scene, right click on the root node, and click “Add child node”. Then, search for “sync” and add a Godot RL Agents Sync node. This node handles the communication between Python and Godot over TCP.
|
||||
|
||||
You can run training live in the the editor, by first launching the python training with `gdrl`
|
||||
|
||||
|
||||
@@ -8,4 +8,4 @@ But this course was just the beginning of your Deep Reinforcement Learning journ
|
||||
|
||||
Contrary to other units, this unit is a collective work of multiple people from Hugging Face. We mention the author for each unit.
|
||||
|
||||
Sounds fun? Let's get started 🔥,
|
||||
Sound fun? Let's get started 🔥,
|
||||
|
||||
@@ -1,9 +1,9 @@
|
||||
# Language models in RL
|
||||
## LMs encode useful knowledge for agents
|
||||
|
||||
**Language models** (LMs) can exhibit impressive abilities when manipulating text such as question-answering or even step-by-step reasoning. Additionally, their training on massive text corpora allowed them to **encode various knowledge including abstract ones about the physical rules of our world** (for instance what is possible to do with an object, what happens when one rotates an object…).
|
||||
**Language models** (LMs) can exhibit impressive abilities when manipulating text such as question-answering or even step-by-step reasoning. Additionally, their training on massive text corpora allowed them to **encode various types of knowledge including abstract ones about the physical rules of our world** (for instance what is possible to do with an object, what happens when one rotates an object…).
|
||||
|
||||
A natural question recently studied was could such knowledge benefit agents such as robots when trying to solve everyday tasks. And while these works showed interesting results, the proposed agents lacked of any learning method. **This limitation prevents these agent from adapting to the environment (e.g. fixing wrong knowledge) or learning new skills.**
|
||||
A natural question recently studied was whether such knowledge could benefit agents such as robots when trying to solve everyday tasks. And while these works showed interesting results, the proposed agents lacked any learning method. **This limitation prevents these agent from adapting to the environment (e.g. fixing wrong knowledge) or learning new skills.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/language.png" alt="Language">
|
||||
@@ -12,17 +12,17 @@ A natural question recently studied was could such knowledge benefit agents such
|
||||
|
||||
## LMs and RL
|
||||
|
||||
There is therefore a potential synergy between LMs which can bring knowledge about the world, and RL which can align and correct these knowledge by interacting with an environment. It is especially interesting from a RL point-of-view as the RL field mostly relies on the **Tabula-rasa** setup where everything is learned from scratch by agent leading to:
|
||||
There is therefore a potential synergy between LMs which can bring knowledge about the world, and RL which can align and correct this knowledge by interacting with an environment. It is especially interesting from a RL point-of-view as the RL field mostly relies on the **Tabula-rasa** setup where everything is learned from scratch by the agent leading to:
|
||||
|
||||
1) Sample inefficiency
|
||||
|
||||
2) Unexpected behaviors from humans’ eyes
|
||||
|
||||
As a first attempt, the paper [“Grounding Large Language Models with Online Reinforcement Learning”](https://arxiv.org/abs/2302.02662v1) tackled the problem of **adapting or aligning a LM to a textual environment using PPO**. They showed that the knowledge encoded in the LM lead to a fast adaptation to the environment (opening avenue for sample efficiency RL agents) but also that such knowledge allowed the LM to better generalize to new tasks once aligned.
|
||||
As a first attempt, the paper [“Grounding Large Language Models with Online Reinforcement Learning”](https://arxiv.org/abs/2302.02662v1) tackled the problem of **adapting or aligning a LM to a textual environment using PPO**. They showed that the knowledge encoded in the LM lead to a fast adaptation to the environment (opening avenues for sample efficient RL agents) but also that such knowledge allowed the LM to better generalize to new tasks once aligned.
|
||||
|
||||
<video src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/papier_v4.mp4" type="video/mp4" controls />
|
||||
|
||||
Another direction studied in [“Guiding Pretraining in Reinforcement Learning with Large Language Models”](https://arxiv.org/abs/2302.06692) was to keep the LM frozen but leverage its knowledge to **guide an RL agent’s exploration**. Such method allows the RL agent to be guided towards human-meaningful and plausibly useful behaviors without requiring a human in the loop during training.
|
||||
Another direction studied in [“Guiding Pretraining in Reinforcement Learning with Large Language Models”](https://arxiv.org/abs/2302.06692) was to keep the LM frozen but leverage its knowledge to **guide an RL agent’s exploration**. Such a method allows the RL agent to be guided towards human-meaningful and plausibly useful behaviors without requiring a human in the loop during training.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/language2.png" alt="Language">
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
Model-based reinforcement learning only differs from its model-free counterpart in learning a *dynamics model*, but that has substantial downstream effects on how the decisions are made.
|
||||
|
||||
The dynamics models usually model the environment transition dynamics, \\( s_{t+1} = f_\theta (s_t, a_t) \\), but things like inverse dynamics models (mapping from states to actions) or reward models (predicting rewards) can be used in this framework.
|
||||
The dynamics model usually models the environment transition dynamics, \\( s_{t+1} = f_\theta (s_t, a_t) \\), but things like inverse dynamics models (mapping from states to actions) or reward models (predicting rewards) can be used in this framework.
|
||||
|
||||
|
||||
## Simple definition
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
In this advanced topic, we address the question: **how should we monitor and keep track of powerful reinforcement learning agents that we are training in the real world and
|
||||
interfacing with humans?**
|
||||
|
||||
As machine learning systems have increasingly impacted modern life, **call for documentation of these systems has grown**.
|
||||
As machine learning systems have increasingly impacted modern life, the **call for the documentation of these systems has grown**.
|
||||
|
||||
Such documentation can cover aspects such as the training data used — where it is stored, when it was collected, who was involved, etc.
|
||||
— or the model optimization framework — the architecture, evaluation metrics, relevant papers, etc. — and more.
|
||||
@@ -19,7 +19,7 @@ These model and data specific logs are designed to be completed when the model o
|
||||
|
||||
Reinforcement learning systems are fundamentally designed to optimize based on measurements of reward and time.
|
||||
While the notion of a reward function can be mapped nicely to many well-understood fields of supervised learning (via a loss function),
|
||||
understanding how machine learning systems evolve over time is limited.
|
||||
understanding of how machine learning systems evolve over time is limited.
|
||||
|
||||
To that end, the authors introduce [*Reward Reports for Reinforcement Learning*](https://www.notion.so/Brief-introduction-to-RL-documentation-b8cbda5a6f5242338e0756e6bef72af4) (the pithy naming is designed to mirror the popular papers *Model Cards for Model Reporting* and *Datasheets for Datasets*).
|
||||
The goal is to propose a type of documentation focused on the **human factors of reward** and **time-varying feedback systems**.
|
||||
@@ -42,7 +42,7 @@ The change log is accompanied by update triggers that encourage monitoring these
|
||||
|
||||
## Contributing
|
||||
|
||||
Some of the most impactful RL-driven systems are multi-stakeholder in nature and behind closed doors of private corporations.
|
||||
Some of the most impactful RL-driven systems are multi-stakeholder in nature and behind the closed doors of private corporations.
|
||||
These corporations are largely without regulation, so the burden of documentation falls on the public.
|
||||
|
||||
If you are interested in contributing, we are building Reward Reports for popular machine learning systems on a public
|
||||
|
||||
@@ -14,7 +14,7 @@ To start learning about RLHF:
|
||||
1. Read this introduction: [Illustrating Reinforcement Learning from Human Feedback (RLHF)](https://huggingface.co/blog/rlhf).
|
||||
|
||||
2. Watch the recorded live we did some weeks ago, where Nathan covered the basics of Reinforcement Learning from Human Feedback (RLHF) and how this technology is being used to enable state-of-the-art ML tools like ChatGPT.
|
||||
Most of the talk is an overview of the interconnected ML models. It covers the basics of Natural Language Processing and RL and how RLHF is used on large language models. We then conclude with the open question in RLHF.
|
||||
Most of the talk is an overview of the interconnected ML models. It covers the basics of Natural Language Processing and RL and how RLHF is used on large language models. We then conclude with open questions in RLHF.
|
||||
|
||||
<Youtube id="2MBJOuVq380" />
|
||||
|
||||
|
||||
Reference in New Issue
Block a user