Typos Unit1

This commit is contained in:
Dylan Wilson
2023-04-18 14:01:03 -05:00
parent 3d13100f02
commit 70aadc9565
9 changed files with 58 additions and 58 deletions

View File

@@ -1,16 +1,16 @@
# Conclusion [[conclusion]]
Congrats on finishing this unit! **That was the biggest one**, and there was a lot of information. And congrats on finishing the tutorial. Youve just trained your first Deep RL agents and shared it with the community! 🥳
Congrats on finishing this unit! **That was the biggest one**, and there was a lot of information. And congrats on finishing the tutorial. Youve just trained your first Deep RL agents and shared them with the community! 🥳
It's **normal if you still feel confused with some of these elements**. This was the same for me and for all people who studied RL.
It's **normal if you still feel confused by some of these elements**. This was the same for me and for all people who studied RL.
**Take time to really grasp the material** before continuing. Its important to master these elements and having a solid foundations before entering the fun part.
**Take time to really grasp the material** before continuing. Its important to master these elements and have a solid foundation before entering the fun part.
Naturally, during the course, were going to use and explain these terms again, but its better to understand them before diving into the next units.
In the next (bonus) unit, were going to reinforce what we just learned by **training Huggy the Dog to fetch the stick**.
In the next (bonus) unit, were going to reinforce what we just learned by **training Huggy the Dog to fetch a stick**.
You will be able then to play with him 🤗.
You will then be able to play with him 🤗.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.jpg" alt="Huggy"/>

View File

@@ -8,7 +8,7 @@ Deep Reinforcement Learning introduces **deep neural networks to solve Reinforc
For instance, in the next unit, well learn about two value-based algorithms: Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning.
Youll see the difference is that in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
Youll see the difference is that, in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
In the second approach, **we will use a Neural Network** (to approximate the Q value).
@@ -18,4 +18,4 @@ In the second approach, **we will use a Neural Network** (to approximate the Q
</figcaption>
</figure>
If you are not familiar with Deep Learning you definitely should watch [the FastAI Practical Deep Learning for Coders](https://course.fast.ai) (Free).
If you are not familiar with Deep Learning you should definitely watch [the FastAI Practical Deep Learning for Coders](https://course.fast.ai) (Free).

View File

@@ -30,7 +30,7 @@ If its still confusing, **think of a real problem: the choice of picking a re
</figcaption>
</figure>
- *Exploitation*: You go every day to the same one that you know is good and **take the risk to miss another better restaurant.**
- *Exploitation*: You go to the same one that you know is good every day and **take the risk to miss another better restaurant.**
- *Exploration*: Try restaurants you never went to before, with the risk of having a bad experience **but the probable opportunity of a fantastic experience.**
To recap:

File diff suppressed because one or more lines are too long

View File

@@ -12,9 +12,9 @@ To understand the RL process, lets imagine an agent learning to play a platfo
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
- Our Agent receives **state \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
- Our Agent receives **state \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
- Environment goes to a **new** **state \\(S_1\\)** — new frame.
- The environment goes to a **new** **state \\(S_1\\)** — new frame.
- The environment gives some **reward \\(R_1\\)** to the Agent — were not dead *(Positive Reward +1)*.
This RL loop outputs a sequence of **state, action, reward and next state.**
@@ -34,7 +34,7 @@ Thats why in Reinforcement Learning, **to have the best behavior,** we aim
## Markov Property [[markov-property]]
In papers, youll see that the RL process is called the **Markov Decision Process** (MDP).
In papers, youll see that the RL process is called a **Markov Decision Process** (MDP).
Well talk again about the Markov Property in the following units. But if you need to remember something today about it, it's this: the Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states and actions** they took before.
@@ -58,10 +58,10 @@ In a chess game, we have access to the whole board information, so we receive a
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
<figcaption>In Super Mario Bros, we only see a part of the level close to the player, so we receive an observation.</figcaption>
<figcaption>In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.</figcaption>
</figure>
In Super Mario Bros, we only see a part of the level close to the player, so we receive an observation.
In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.
In Super Mario Bros, we are in a partially observed environment. We receive an observation **since we only see a part of the level.**
@@ -110,7 +110,7 @@ The cumulative reward at each time step **t** can be written as:
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_1.jpg" alt="Rewards">
<figcaption>The cumulative reward equals to the sum of all rewards of the sequence.
<figcaption>The cumulative reward equals the sum of all rewards in the sequence.
</figcaption>
</figure>
@@ -134,7 +134,7 @@ Consequently, **the reward near the cat, even if it is bigger (more cheese), wi
To discount the rewards, we proceed like this:
1. We define a discount rate called gamma. **It must be between 0 and 1.** Most of the time between **0.99 and 0.95**.
1. We define a discount rate called gamma. **It must be between 0 and 1.** Most of the time between **0.95 and 0.99**.
- The larger the gamma, the smaller the discount. This means our agent **cares more about the long-term reward.**
- On the other hand, the smaller the gamma, the bigger the discount. This means our **agent cares more about the short term reward (the nearest cheese).**

View File

@@ -2,7 +2,7 @@
That was a lot of information! Let's summarize:
- Reinforcement Learning is a computational approach of learning from action. We build an agent that learns from the environment **by interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.
- Reinforcement Learning is a computational approach of learning from actions. We build an agent that learns from the environment **by interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.
- The goal of any RL agent is to maximize its expected cumulative reward (also called expected return) because RL is based on the **reward hypothesis**, which is that **all goals can be described as the maximization of the expected cumulative reward.**

View File

@@ -6,7 +6,7 @@ A task is an **instance** of a Reinforcement Learning problem. We can have two t
In this case, we have a starting point and an ending point **(a terminal state). This creates an episode**: a list of States, Actions, Rewards, and new States.
For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ending **when youre killed or you reached the end of the level.**
For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ends **when youre killed or you reached the end of the level.**
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">

View File

@@ -4,7 +4,7 @@
Now that we learned the RL framework, how do we solve the RL problem?
</Tip>
In other terms, how to build an RL agent that can **select the actions that maximize its expected cumulative reward?**
In other words, how do we build an RL agent that can **select the actions that maximize its expected cumulative reward?**
## The Policy π: the agents brain [[policy]]
@@ -26,7 +26,7 @@ There are two approaches to train our agent to find this optimal policy π\*:
In Policy-Based methods, **we learn a policy function directly.**
This function will define a mapping between each state and the best corresponding action. We can also say that it'll define **a probability distribution over the set of possible actions at that state.**
This function will define a mapping from each state to the best corresponding action. Alternatively, it could define **a probability distribution over the set of possible actions at that state.**
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_2.jpg" alt="Policy" />
@@ -69,13 +69,13 @@ If we recap:
In value-based methods, instead of learning a policy function, we **learn a value function** that maps a state to the expected value **of being at that state.**
The value of a state is the **expected discounted return** the agent can get if it **starts in that state, and then act according to our policy.**
The value of a state is the **expected discounted return** the agent can get if it **starts in that state, and then acts according to our policy.**
“Act according to our policy” just means that our policy is **“going to the state with the highest value”.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_1.jpg" alt="Value based RL" width="100%" />
Here we see that our value function **defined value for each possible state.**
Here we see that our value function **defined values for each possible state.**
<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_2.jpg" alt="Value based RL"/>

View File

@@ -17,12 +17,12 @@ Your brother will interact with the environment (the video game) by pressing the
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_2.jpg" alt="Illustration_2" width="100%">
But then, **he presses right again** and he touches an enemy. He just died, so that's a -1 reward.
But then, **he presses the right button again** and he touches an enemy. He just died, so that's a -1 reward.
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_3.jpg" alt="Illustration_3" width="100%">
By interacting with his environment through trial and error, your little brother understood that **he needed to get coins in this environment but avoid the enemies.**
By interacting with his environment through trial and error, your little brother understands that **he needs to get coins in this environment but avoid the enemies.**
**Without any supervision**, the child will get better and better at playing the game.
@@ -31,7 +31,7 @@ Thats how humans and animals learn, **through interaction.** Reinforcement
### A formal definition [[a-formal-definition]]
If we take now a formal definition:
We can now make a formal definition:
<Tip>
Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.