Apply suggestions from code review

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
This commit is contained in:
Thomas Simonini
2022-12-05 01:11:18 +01:00
committed by GitHub
parent 07e8da8672
commit 1c08eb4ae5
7 changed files with 26 additions and 23 deletions

View File

@@ -1,14 +1,14 @@
# Conclusion [[conclusion]]
Congrats on finishing this chapter! **That was the biggest one**, and there was a lot of information. And congrats on finishing the tutorial. Youve just trained your first Deep RL agents and shared it on the Hub 🥳.
Congrats on finishing this unit! **That was the biggest one**, and there was a lot of information. And congrats on finishing the tutorial. Youve just trained your first Deep RL agents and shared it with the community! 🥳
Thats **normal if you still feel confused with all these elements**. This was the same for me and for all people who studied RL.
It's **normal if you still feel confused with some of these elements**. This was the same for me and for all people who studied RL.
**Take time to really grasp the material** before continuing. Its important to master these elements and having a solid foundations before entering the fun part.
Naturally, during the course, were going to use and explain these terms again, but its better to understand them before diving into the next chapters.
Naturally, during the course, were going to use and explain these terms again, but its better to understand them before diving into the next units.
In the next chapter, were going to reinforce what we just learn by **training Huggy the Dog to fetch the stick**.
In the next (bonus) unit, were going to reinforce what we just learned by **training Huggy the Dog to fetch the stick**.
You will be able then to play with him 🤗.

View File

@@ -6,11 +6,11 @@ What we've talked about so far is Reinforcement Learning. But where does the "De
Deep Reinforcement Learning introduces **deep neural networks to solve Reinforcement Learning problems** — hence the name “deep”.
For instance, in the next article, well work on Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning both are value-based RL algorithms.
For instance, in the next unit, well learn about two value-based algorithms: Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning.
Youll see the difference is that in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
In the second approach, **we will use a Neural Network** (to approximate the q value).
In the second approach, **we will use a Neural Network** (to approximate the Q value).
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" alt="Value based RL"/>
@@ -18,4 +18,4 @@ In the second approach, **we will use a Neural Network** (to approximate the q
</figcaption>
</figure>
If you are not familiar with Deep Learning you definitely should watch <a href="[https://course.fast.ai/](https://course.fast.ai/)">the fastai Practical Deep Learning for Coders (Free)</a>
If you are not familiar with Deep Learning you definitely should watch [the FastAI Practical Deep Learning for Coders](https://course.fast.a) (Free).

View File

@@ -1,4 +1,4 @@
# The Exploration/ Exploitation tradeoff [[exp-exp-tradeoff]]
# The Exploration/Exploitation trade-off [[exp-exp-tradeoff]]
Finally, before looking at the different methods to solve Reinforcement Learning problems, we must cover one more very important topic: *the exploration/exploitation trade-off.*
@@ -19,9 +19,10 @@ But if our agent does a little bit of exploration, it can **discover the big re
This is what we call the exploration/exploitation trade-off. We need to balance how much we **explore the environment** and how much we **exploit what we know about the environment.**
Therefore, we must **define a rule that helps to handle this trade-off**. Well see the different ways to handle it in the future chapters.
Therefore, we must **define a rule that helps to handle this trade-off**. Well see the different ways to handle it in the future units.
If its still confusing, **think of a real problem: the choice of picking a restaurant:**
If its still confusing, **think of a real problem: the choice of a restaurant:**
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/exp_2.jpg" alt="Exploration">

View File

@@ -8,6 +8,6 @@ A Lunar Lander agent that will learn to land correctly on the Moon 🌕
And finally, you'll **upload this trained agent to the Hugging Face Hub 🤗, a free, open platform where people can share ML models, datasets, and demos.**
Thanks to our <a href="https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard">leaderboard</a>, you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 1 🏆?
Thanks to our <a href="https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard">leaderboard</a>, you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 1 🏆?
So let's get started! 🚀

View File

@@ -7,9 +7,11 @@ Welcome to the most fascinating topic in Artificial Intelligence: **Deep Reinfo
Deep RL is a type of Machine Learning where an agent learns **how to behave** in an environment **by performing actions** and **seeing the results.**
So in this first chapter, **you'll learn the foundations of Deep Reinforcement Learning.**
In this first unit, **you'll learn the foundations of Deep Reinforcement Learning.**
Then, you'll **train your Deep Reinforcement Learning agent, a lunar lander to land correctly on the Moon** using <a href="https://stable-baselines3.readthedocs.io/en/master/"> Stable-Baselines3 </a>, a Deep Reinforcement Learning library.
Then, you'll **train your Deep Reinforcement Learning agent, a lunar lander to land correctly on the Moon** using <a href="https://stable-baselines3.readthedocs.io/en/master/"> Stable-Baselines3 </a> a Deep Reinforcement Learning library.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/lunarLander.gif" alt="LunarLander">

View File

@@ -132,7 +132,7 @@ In Reinforcement Learning, we need to **balance how much we explore the environm
<details>
<summary>Solution</summary>
- The Policy π **is the brain of our Agent**, its the function that tell us what action to take given the state we are. So it defines the agents behavior at a given time.
- The Policy π **is the brain of our Agent**. Its the function that tells us what action to take given the state we are in. So it defines the agents behavior at a given time.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_1.jpg" alt="Policy">

View File

@@ -23,7 +23,7 @@ This RL loop outputs a sequence of **state, action, reward and next state.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/sars.jpg" alt="State, Action, Reward, Next State" width="100%">
The agent's goal is to maximize its cumulative reward, **called the expected return.**
The agent's goal is to _maximize_ its cumulative reward, **called the expected return.**
## The reward hypothesis: the central idea of Reinforcement Learning [[reward-hypothesis]]
@@ -31,19 +31,20 @@ The agent's goal is to maximize its cumulative reward, **called the expected re
Because RL is based on the **reward hypothesis**, which is that all goals can be described as the **maximization of the expected return** (expected cumulative reward).
Thats why in Reinforcement Learning, **to have the best behavior,** we need to **maximize the expected cumulative reward.**
Thats why in Reinforcement Learning, **to have the best behavior,** we aim to learn to take actions that **maximize the expected cumulative reward.**
## Markov Property [[markov-property]]
In papers, youll see that the RL process is called the **Markov Decision Process** (MDP).
Well talk again about the Markov Property in the following units. But if you need to remember something today about it, Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states** **and actions** they took before.
Well talk again about the Markov Property in the following units. But if you need to remember something today about it, it's this: the Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states and actions** they took before.
## Observations/States Space [[obs-space]]
Observations/States are the **information our agent gets from the environment.** In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.
There is a differentiation to make between *observation* and *state*:
There is a differentiation to make between *observation* and *state*, however:
- *State s*: is **a complete description of the state of the world** (there is no hidden information). In a fully observed environment.
@@ -53,9 +54,7 @@ There is a differentiation to make between *observation* and *state*:
<figcaption>In chess game, we receive a state from the environment since we have access to the whole check board information.</figcaption>
</figure>
In chess game, we receive a state from the environment since we have access to the whole check board information.
With a chess game, we are in a fully observed environment, since we have access to the whole check board information.
In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.
- *Observation o*: is a **partial description of the state.** In a partially observed environment.
@@ -69,7 +68,7 @@ In Super Mario Bros, we only see a part of the level close to the player, so we
In Super Mario Bros, we are in a partially observed environment. We receive an observation **since we only see a part of the level.**
<Tip>
In reality, we use the term state in this course but we will make the distinction in implementations.
In this course, we use the term "state" to denote both state and observation, but we will make the distinction in implementations.
</Tip>
To recap:
@@ -86,7 +85,8 @@ The actions can come from a *discrete* or *continuous space*:
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
<figcaption>Again, in Super Mario Bros, we have only 4 directions and jump possible</figcaption>
<figcaption>Again, in Super Mario Bros, we have only 5 possible actions: 4 directions and jumping</figcaption>
</figure>
In Super Mario Bros, we have a finite set of actions since we have only 4 directions and jump.