mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-13 18:00:45 +08:00
Typos Unit2
This commit is contained in:
@@ -4,7 +4,7 @@ These are **optional readings** if you want to go deeper.
|
||||
|
||||
## Monte Carlo and TD Learning [[mc-td]]
|
||||
|
||||
To dive deeper on Monte Carlo and Temporal Difference Learning:
|
||||
To dive deeper into Monte Carlo and Temporal Difference Learning:
|
||||
|
||||
- <a href="https://stats.stackexchange.com/questions/355820/why-do-temporal-difference-td-methods-have-lower-variance-than-monte-carlo-met">Why do temporal difference (TD) methods have lower variance than Monte Carlo methods?</a>
|
||||
- <a href="https://stats.stackexchange.com/questions/336974/when-are-monte-carlo-methods-preferred-over-temporal-difference-ones"> When are Monte Carlo methods preferred over temporal difference ones?</a>
|
||||
|
||||
@@ -5,7 +5,7 @@ The Bellman equation **simplifies our state value or state-action value calcula
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman.jpg" alt="Bellman equation"/>
|
||||
|
||||
With what we have learned so far, we know that if we calculate the \\(V(S_t)\\) (value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).**
|
||||
With what we have learned so far, we know that if we calculate \\(V(S_t)\\) (the value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).**
|
||||
|
||||
So to calculate \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence:
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@ Congrats on finishing this chapter! There was a lot of information. And congrat
|
||||
|
||||
Implementing from scratch when you study a new architecture **is important to understand how it works.**
|
||||
|
||||
That’s **normal if you still feel confused** with all these elements. **This was the same for me and for all people who studied RL.**
|
||||
It's **normal if you still feel confused** by all these elements. **This was the same for me and for everyone who studies RL.**
|
||||
|
||||
Take time to really grasp the material before continuing.
|
||||
|
||||
@@ -15,7 +15,6 @@ In the next chapter, we’re going to dive deeper by studying our first Deep Rei
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Atari environments"/>
|
||||
|
||||
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
@@ -5,7 +5,7 @@ This is a community-created glossary. Contributions are welcomed!
|
||||
|
||||
### Strategies to find the optimal policy
|
||||
|
||||
- **Policy-based methods.** The policy is usually trained with a neural network to select what action to take given a state. In this case is the neural network which outputs the action that the agent should take instead of using a value function. Depending on the experience received by the environment, the neural network will be re-adjusted and will provide better actions.
|
||||
- **Policy-based methods.** The policy is usually trained with a neural network to select what action to take given a state. In this case it is the neural network which outputs the action that the agent should take instead of using a value function. Depending on the experience received by the environment, the neural network will be re-adjusted and will provide better actions.
|
||||
- **Value-based methods.** In this case, a value function is trained to output the value of a state or a state-action pair that will represent our policy. However, this value doesn't define what action the agent should take. In contrast, we need to specify the behavior of the agent given the output of the value function. For example, we could decide to adopt a policy to take the action that always leads to the biggest reward (Greedy Policy). In summary, the policy is a Greedy Policy (or whatever decision the user takes) that uses the values of the value-function to decide the actions to take.
|
||||
|
||||
### Among the value-based methods, we can find two main strategies
|
||||
@@ -15,14 +15,14 @@ This is a community-created glossary. Contributions are welcomed!
|
||||
|
||||
### Epsilon-greedy strategy:
|
||||
|
||||
- Common exploration strategy used in reinforcement learning that involves balancing exploration and exploitation.
|
||||
- Common strategy used in reinforcement learning that involves balancing exploration and exploitation.
|
||||
- Chooses the action with the highest expected reward with a probability of 1-epsilon.
|
||||
- Chooses a random action with a probability of epsilon.
|
||||
- Epsilon is typically decreased over time to shift focus towards exploitation.
|
||||
|
||||
### Greedy strategy:
|
||||
|
||||
- Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (only exploitation)
|
||||
- Involves always choosing the action that is expected to lead to the highest reward, based on the current knowledge of the environment. (Only exploitation)
|
||||
- Always chooses the action with the highest expected reward.
|
||||
- Does not include any exploration.
|
||||
- Can be disadvantageous in environments with uncertainty or unknown optimal actions.
|
||||
|
||||
@@ -27,7 +27,7 @@ For more information about the certification process, check this section 👉 ht
|
||||
And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
|
||||
|
||||
|
||||
**To start the hands-on click on Open In Colab button** 👇 :
|
||||
**To start the hands-on click on the Open In Colab button** 👇 :
|
||||
|
||||
[](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb)
|
||||
|
||||
@@ -36,7 +36,7 @@ And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSi
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">
|
||||
|
||||
In this notebook, **you'll code from scratch your first Reinforcement Learning agent** playing FrozenLake ❄️ using Q-Learning, share it to the community, and experiment with different configurations.
|
||||
In this notebook, **you'll code your first Reinforcement Learning agent from scratch** to play FrozenLake ❄️ using Q-Learning, share it with the community, and experiment with different configurations.
|
||||
|
||||
|
||||
⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
|
||||
@@ -61,7 +61,7 @@ We're constantly trying to improve our tutorials, so **if you find some issues i
|
||||
At the end of the notebook, you will:
|
||||
|
||||
- Be able to use **Gym**, the environment library.
|
||||
- Be able to code from scratch a Q-Learning agent.
|
||||
- Be able to code a Q-Learning agent from scratch.
|
||||
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
|
||||
|
||||
|
||||
@@ -72,23 +72,23 @@ Before diving into the notebook, you need to:
|
||||
|
||||
## A small recap of Q-Learning
|
||||
|
||||
- The *Q-Learning* **is the RL algorithm that**
|
||||
- *Q-Learning* **is the RL algorithm that**
|
||||
|
||||
- Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
|
||||
- Trains *Q-Function*, an **action-value function** that encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**
|
||||
|
||||
- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
|
||||
- Given a state and action, our Q-Function **will search the Q-table for the corresponding value.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function" width="100%"/>
|
||||
|
||||
- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
|
||||
|
||||
- And if we **have an optimal Q-function**, we
|
||||
have an optimal policy,since we **know for each state, what is the best action to take.**
|
||||
have an optimal policy, since we **know for, each state, the best action to take.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" width="100%"/>
|
||||
|
||||
|
||||
But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
|
||||
But, in the beginning, our **Q-Table is useless since it gives arbitrary values for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we explore the environment and update our Q-Table it will give us better and better approximations
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
|
||||
|
||||
@@ -113,7 +113,7 @@ We’ll install multiple ones:
|
||||
|
||||
The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
|
||||
|
||||
You can see here all the Deep RL models available (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning
|
||||
You can see all the Deep RL models available here (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning
|
||||
|
||||
```bash
|
||||
pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt
|
||||
@@ -125,7 +125,7 @@ apt install python-opengl ffmpeg xvfb
|
||||
pip3 install pyvirtualdisplay
|
||||
```
|
||||
|
||||
To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks for this trick, **we will be able to run our virtual screen.**
|
||||
To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**
|
||||
|
||||
```python
|
||||
import os
|
||||
@@ -299,7 +299,7 @@ Remember we have two policies since Q-Learning is an **off-policy** algorithm. T
|
||||
- Epsilon-greedy policy (acting policy)
|
||||
- Greedy-policy (updating policy)
|
||||
|
||||
Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.
|
||||
The greedy policy will also be the final policy we'll have when the Q-learning agent completes training. The greedy policy is used to select an action using the Q-table.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Q-Learning" width="100%"/>
|
||||
|
||||
@@ -330,9 +330,9 @@ The idea with epsilon-greedy:
|
||||
|
||||
- With *probability 1 - ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).
|
||||
|
||||
- With *probability ɛ*: we do **exploration** (trying random action).
|
||||
- With *probability ɛ*: we do **exploration** (trying a random action).
|
||||
|
||||
And as the training goes, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**
|
||||
As the training continues, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-Learning" width="100%"/>
|
||||
|
||||
@@ -726,7 +726,7 @@ By using `push_to_hub` **you evaluate, record a replay, generate a model card of
|
||||
This way:
|
||||
- You can **showcase our work** 🔥
|
||||
- You can **visualize your agent playing** 👀
|
||||
- You can **share with the community an agent that others can use** 💾
|
||||
- You can **share an agent with the community that others can use** 💾
|
||||
- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
|
||||
@@ -788,8 +788,8 @@ repo_name = "q-FrozenLake-v1-4x4-noSlippery"
|
||||
push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
|
||||
```
|
||||
|
||||
Congrats 🥳 you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent.
|
||||
FrozenLake-v1 no_slippery is very simple environment, let's try an harder one 🔥.
|
||||
Congrats 🥳 you've just implemented from scratch, trained, and uploaded your first Reinforcement Learning agent.
|
||||
FrozenLake-v1 no_slippery is very simple environment, let's try a harder one 🔥.
|
||||
|
||||
# Part 2: Taxi-v3 🚖
|
||||
|
||||
@@ -1009,7 +1009,7 @@ repo_name = ""
|
||||
push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
|
||||
```
|
||||
|
||||
Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
Now that it's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
⚠ To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** ⚠
|
||||
|
||||
@@ -1075,24 +1075,24 @@ evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"
|
||||
```
|
||||
|
||||
## Some additional challenges 🏆
|
||||
The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!
|
||||
The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!
|
||||
|
||||
In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
|
||||
|
||||
Here are some ideas to achieve so:
|
||||
Here are some ideas to climb up the leaderboard:
|
||||
|
||||
* Train more steps
|
||||
* Try different hyperparameters by looking at what your classmates have done.
|
||||
* **Push your new trained model** on the Hub 🔥
|
||||
|
||||
Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not using FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
|
||||
Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not use the FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
|
||||
|
||||
_____________________________________________________________________
|
||||
Congrats 🥳, you've just implemented, trained, and uploaded your first Reinforcement Learning agent.
|
||||
|
||||
Understanding Q-Learning is an **important step to understanding value-based methods.**
|
||||
|
||||
In the next Unit with Deep Q-Learning, we'll see that creating and updating a Q-table was a good strategy — **however, this is not scalable.**
|
||||
In the next Unit with Deep Q-Learning, we'll see that while creating and updating a Q-table was a good strategy — **however, it is not scalable.**
|
||||
|
||||
For instance, imagine you create an agent that learns to play Doom.
|
||||
|
||||
@@ -1100,11 +1100,11 @@ For instance, imagine you create an agent that learns to play Doom.
|
||||
|
||||
Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient.
|
||||
|
||||
That's why we'll study, in the next unit, Deep Q-Learning, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**
|
||||
That's why we'll study Deep Q-Learning in the next unit, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
|
||||
See you on Unit 3! 🔥
|
||||
See you in Unit 3! 🔥
|
||||
|
||||
## Keep learning, stay awesome 🤗
|
||||
|
||||
@@ -45,7 +45,7 @@ By running more and more episodes, **the agent will learn to play better and be
|
||||
|
||||
For instance, if we train a state-value function using Monte Carlo:
|
||||
|
||||
- We just started to train our value function, **so it returns 0 value for each state**
|
||||
- We initialize our value function **so that it returns 0 value for each state**
|
||||
- Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)
|
||||
- Our mouse **explores the environment and takes random actions**
|
||||
|
||||
@@ -82,7 +82,7 @@ The idea with **TD is to update the \\(V(S_t)\\) at each step.**
|
||||
|
||||
But because we didn't experience an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.**
|
||||
|
||||
This is called bootstrapping. It's called this **because TD bases its update part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).**
|
||||
This is called bootstrapping. It's called this **because TD bases its update in part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="Temporal Difference"/>
|
||||
|
||||
@@ -95,9 +95,9 @@ If we take the same example,
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2.jpg" alt="Temporal Difference"/>
|
||||
|
||||
- We just started to train our value function, so it returns 0 value for each state.
|
||||
- We initialize our value function so that it returns 0 value for each state.
|
||||
- Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
|
||||
- Our mouse explore the environment and take a random action: **going to the left**
|
||||
- Our mouse begins to explore the environment and takes a random action: **going to the left**
|
||||
- It gets a reward \\(R_{t+1} = 1\\) since **it eats a piece of cheese**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2p.jpg" alt="Temporal Difference"/>
|
||||
@@ -119,9 +119,9 @@ Now we **continue to interact with this environment with our updated value func
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3p.jpg" alt="Temporal Difference"/>
|
||||
|
||||
If we summarize:
|
||||
To summarize:
|
||||
|
||||
- With *Monte Carlo*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
|
||||
- With *TD Learning*, we update the value function from a step, so we replace \\(G_t\\) that we don't have with **an estimated return called TD target.**
|
||||
- With *TD Learning*, we update the value function from a step, and we replace \\(G_t\\), which we don't know, with **an estimated return called the TD target.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Summary.jpg" alt="Summary"/>
|
||||
|
||||
@@ -1,17 +1,17 @@
|
||||
# Mid-way Recap [[mid-way-recap]]
|
||||
|
||||
Before diving into Q-Learning, let's summarize what we just learned.
|
||||
Before diving into Q-Learning, let's summarize what we've just learned.
|
||||
|
||||
We have two types of value-based functions:
|
||||
|
||||
- State-value function: outputs the expected return if **the agent starts at a given state and acts accordingly to the policy forever after.**
|
||||
- State-value function: outputs the expected return if **the agent starts at a given state and acts according to the policy forever after.**
|
||||
- Action-value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
|
||||
- In value-based methods, rather than learning the policy, **we define the policy by hand** and we learn a value function. If we have an optimal value function, we **will have an optimal policy.**
|
||||
|
||||
There are two types of methods to learn a policy for a value function:
|
||||
|
||||
- With *the Monte Carlo method*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
|
||||
- With *the TD Learning method,* we update the value function from a step, so we replace \\(G_t\\) that we don't have with **an estimated return called TD target.**
|
||||
- With *the Monte Carlo method*, we update the value function from a complete episode, and so we **use the actual discounted return of this episode.**
|
||||
- With *the TD Learning method,* we update the value function from a step, replacing the unknown \\(G_t\\) with **an estimated return called the TD target.**
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/summary-learning-mtds.jpg" alt="Summary"/>
|
||||
|
||||
@@ -6,9 +6,9 @@ To better understand Q-Learning, let's take a simple example:
|
||||
|
||||
- You're a mouse in this tiny maze. You always **start at the same starting point.**
|
||||
- The goal is **to eat the big pile of cheese at the bottom right-hand corner** and avoid the poison. After all, who doesn't like cheese?
|
||||
- The episode ends if we eat the poison, **eat the big pile of cheese or if we spent more than five steps.**
|
||||
- The episode ends if we eat the poison, **eat the big pile of cheese**, or if we take more than five steps.
|
||||
- The learning rate is 0.1
|
||||
- The gamma (discount rate) is 0.99
|
||||
- The discount rate (gamma) is 0.99
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-1.jpg" alt="Maze-Example"/>
|
||||
|
||||
@@ -18,14 +18,14 @@ The reward function goes like this:
|
||||
- **+0:** Going to a state with no cheese in it.
|
||||
- **+1:** Going to a state with a small cheese in it.
|
||||
- **+10:** Going to the state with the big pile of cheese.
|
||||
- **-10:** Going to the state with the poison and thus die.
|
||||
- **+0** If we spend more than five steps.
|
||||
- **-10:** Going to the state with the poison and thus dying.
|
||||
- **+0** If we take more than five steps.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-2.jpg" alt="Maze-Example"/>
|
||||
|
||||
To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**.
|
||||
|
||||
## Step 1: We initialize the Q-table [[step1]]
|
||||
## Step 1: Initialize the Q-table [[step1]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-1.jpg" alt="Maze-Example"/>
|
||||
|
||||
@@ -35,16 +35,16 @@ Let's do it for 2 training timesteps:
|
||||
|
||||
Training timestep 1:
|
||||
|
||||
## Step 2: Choose action using Epsilon Greedy Strategy [[step2]]
|
||||
## Step 2: Choose an action using the Epsilon Greedy Strategy [[step2]]
|
||||
|
||||
Because epsilon is big = 1.0, I take a random action, in this case, I go right.
|
||||
Because epsilon is big (= 1.0), I take a random action. In this case, I go right.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-3.jpg" alt="Maze-Example"/>
|
||||
|
||||
|
||||
## Step 3: Perform action At, gets Rt+1 and St+1 [[step3]]
|
||||
## Step 3: Perform action At, get Rt+1 and St+1 [[step3]]
|
||||
|
||||
By going right, I've got a small cheese, so \\(R_{t+1} = 1\\), and I'm in a new state.
|
||||
By going right, I get a small cheese, so \\(R_{t+1} = 1\\) and I'm in a new state.
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-4.jpg" alt="Maze-Example"/>
|
||||
@@ -59,18 +59,18 @@ We can now update \\(Q(S_t, A_t)\\) using our formula.
|
||||
|
||||
Training timestep 2:
|
||||
|
||||
## Step 2: Choose action using Epsilon Greedy Strategy [[step2-2]]
|
||||
## Step 2: Choose an action using the Epsilon Greedy Strategy [[step2-2]]
|
||||
|
||||
**I take a random action again, since epsilon is big 0.99** (since we decay it a little bit because as the training progress, we want less and less exploration).
|
||||
**I take a random action again, since epsilon=0.99 is big**. (Notice we decay epsilon a little bit because, as the training progress, we want less and less exploration).
|
||||
|
||||
I took action down. **Not a good action since it leads me to the poison.**
|
||||
I took the action 'down'. **This is not a good action since it leads me to the poison.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-6.jpg" alt="Maze-Example"/>
|
||||
|
||||
|
||||
## Step 3: Perform action At, gets Rt+1 and St+1 [[step3-3]]
|
||||
## Step 3: Perform action At, get Rt+1 and St+1 [[step3-3]]
|
||||
|
||||
Because I go to the poison state, **I get \\(R_{t+1} = -10\\), and I die.**
|
||||
Because I ate poison, **I get \\(R_{t+1} = -10\\), and I die.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-7.jpg" alt="Maze-Example"/>
|
||||
|
||||
@@ -78,6 +78,6 @@ Because I go to the poison state, **I get \\(R_{t+1} = -10\\), and I die.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-8.jpg" alt="Maze-Example"/>
|
||||
|
||||
Because we're dead, we start a new episode. But what we see here is that **with two explorations steps, my agent became smarter.**
|
||||
Because we're dead, we start a new episode. But what we see here is that, **with two explorations steps, my agent became smarter.**
|
||||
|
||||
As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-function.**
|
||||
As we continue exploring and exploiting the environment and updating Q-values using the TD target, the **Q-table will give us a better and better approximation. At the end of the training, we'll get an estimate of the optimal Q-function.**
|
||||
|
||||
@@ -1,22 +1,22 @@
|
||||
# Q-Learning Recap [[q-learning-recap]]
|
||||
|
||||
|
||||
The *Q-Learning* **is the RL algorithm that** :
|
||||
*Q-Learning* **is the RL algorithm that** :
|
||||
|
||||
- Trains *Q-function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
|
||||
- Trains a *Q-function*, an **action-value function** encoded, in internal memory, by a *Q-table* **containing all the state-action pair values.**
|
||||
|
||||
- Given a state and action, our Q-function **will search into its Q-table the corresponding value.**
|
||||
- Given a state and action, our Q-function **will search its Q-table for the corresponding value.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function" width="100%"/>
|
||||
|
||||
- When the training is done,**we have an optimal Q-function, so an optimal Q-table.**
|
||||
- When the training is done, **we have an optimal Q-function, or, equivalently, an optimal Q-table.**
|
||||
|
||||
- And if we **have an optimal Q-function**, we
|
||||
have an optimal policy,since we **know for each state, what is the best action to take.**
|
||||
have an optimal policy, since we **know, for each state, the best action to take.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" width="100%"/>
|
||||
|
||||
But, in the beginning, our **Q-table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-table to 0 values)**. But, as we’ll explore the environment and update our Q-table it will give us better and better approximations
|
||||
But, in the beginning, our **Q-table is useless since it gives arbitrary values for each state-action pair (most of the time we initialize the Q-table to 0 values)**. But, as we explore the environment and update our Q-table it will give us a better and better approximation.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
|
||||
|
||||
|
||||
@@ -5,7 +5,7 @@ Q-Learning is an **off-policy value-based method that uses a TD approach to tra
|
||||
|
||||
- *Off-policy*: we'll talk about that at the end of this unit.
|
||||
- *Value-based method*: finds the optimal policy indirectly by training a value or action-value function that will tell us **the value of each state or each state-action pair.**
|
||||
- *Uses a TD approach:* **updates its action-value function at each step instead of at the end of the episode.**
|
||||
- *TD approach:* **updates its action-value function at each step instead of at the end of the episode.**
|
||||
|
||||
**Q-Learning is the algorithm we use to train our Q-function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
|
||||
|
||||
@@ -21,13 +21,13 @@ Let's recap the difference between value and reward:
|
||||
- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy.
|
||||
- The *reward* is the **feedback I get from the environment** after performing an action at a state.
|
||||
|
||||
Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
Internally, our Q-function is encoded by **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
|
||||
Let's go through an example of a maze.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-1.jpg" alt="Maze example"/>
|
||||
|
||||
The Q-table is initialized. That's why all values are = 0. This table **contains, for each state, the four state-action values.**
|
||||
The Q-table is initialized. That's why all values are = 0. This table **contains, for each state and action, the corresponding state-action values.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-2.jpg" alt="Maze example"/>
|
||||
|
||||
@@ -35,7 +35,7 @@ Here we see that the **state-action value of the initial state and going up is
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-3.jpg" alt="Maze example"/>
|
||||
|
||||
Therefore, Q-function contains a Q-table **that has the value of each-state action pair.** And given a state and action, **our Q-function will search inside its Q-table to output the value.**
|
||||
So: the Q-function uses a Q-table **that has the value of each state-action pair.** Given a state and action, **our Q-function will search inside its Q-table to output the value.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q-function"/>
|
||||
@@ -44,21 +44,21 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act
|
||||
If we recap, *Q-Learning* **is the RL algorithm that:**
|
||||
|
||||
- Trains a *Q-function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.**
|
||||
- Given a state and action, our Q-function **will search into its Q-table the corresponding value.**
|
||||
- Given a state and action, our Q-function **will search its Q-table for the corresponding value.**
|
||||
- When the training is done, **we have an optimal Q-function, which means we have optimal Q-table.**
|
||||
- And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**
|
||||
- And if we **have an optimal Q-function**, we **have an optimal policy** since we **know the best action to take at each state.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"/>
|
||||
|
||||
|
||||
But, in the beginning, **our Q-table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-table to 0). As the agent **explores the environment and we update the Q-table, it will give us better and better approximations** to the optimal policy.
|
||||
In the beginning, **our Q-table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-table to 0). As the agent **explores the environment and we update the Q-table, it will give us a better and better approximation** to the optimal policy.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-1.jpg" alt="Q-learning"/>
|
||||
<figcaption>We see here that with the training, our Q-table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
|
||||
</figure>
|
||||
|
||||
Now that we understand what Q-Learning, Q-function, and Q-table are, **let's dive deeper into the Q-Learning algorithm**.
|
||||
Now that we understand what Q-Learning, Q-functions, and Q-tables are, **let's dive deeper into the Q-Learning algorithm**.
|
||||
|
||||
## The Q-Learning algorithm [[q-learning-algo]]
|
||||
|
||||
@@ -73,14 +73,14 @@ This is the Q-Learning pseudocode; let's study each part and **see how it works
|
||||
|
||||
We need to initialize the Q-table for each state-action pair. **Most of the time, we initialize with values of 0.**
|
||||
|
||||
### Step 2: Choose action using epsilon-greedy strategy [[step2]]
|
||||
### Step 2: Choose an action using the epsilon-greedy strategy [[step2]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
Epsilon greedy strategy is a policy that handles the exploration/exploitation trade-off.
|
||||
The epsilon-greedy strategy is a policy that handles the exploration/exploitation trade-off.
|
||||
|
||||
The idea is that we define the initial epsilon ɛ = 1.0:
|
||||
The idea is that, with an initial value of ɛ = 1.0:
|
||||
|
||||
- *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value).
|
||||
- With probability ɛ: **we do exploration** (trying random action).
|
||||
@@ -90,7 +90,7 @@ At the beginning of the training, **the probability of doing exploration will b
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-5.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
### Step 3: Perform action At, gets reward Rt+1 and next state St+1 [[step3]]
|
||||
### Step 3: Perform action At, get reward Rt+1 and next state St+1 [[step3]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-6.jpg" alt="Q-learning"/>
|
||||
|
||||
@@ -98,7 +98,7 @@ At the beginning of the training, **the probability of doing exploration will b
|
||||
|
||||
Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose) **after one step of the interaction.**
|
||||
|
||||
To produce our TD target, **we used the immediate reward \\(R_{t+1}\\) plus the discounted value of the next state best state-action pair** (we call that bootstrap).
|
||||
To produce our TD target, **we used the immediate reward \\(R_{t+1}\\) plus the discounted value of the next state**, computed by finding the action that maximizes the current Q-function at the next state. (We call that bootstrap).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-7.jpg" alt="Q-learning"/>
|
||||
|
||||
@@ -107,14 +107,14 @@ Therefore, our \\(Q(S_t, A_t)\\) **update formula goes like this:**
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-8.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
It means that to update our \\(Q(S_t, A_t)\\):
|
||||
This means that to update our \\(Q(S_t, A_t)\\):
|
||||
|
||||
- We need \\(S_t, A_t, R_{t+1}, S_{t+1}\\).
|
||||
- To update our Q-value at a given state-action pair, we use the TD target.
|
||||
|
||||
How do we form the TD target?
|
||||
1. We obtain the reward after taking the action \\(R_{t+1}\\).
|
||||
2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.
|
||||
2. To get the **best state-action pair value** for the next state, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.
|
||||
|
||||
Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.**
|
||||
|
||||
|
||||
@@ -31,19 +31,19 @@ Since the policy is not trained/learned, **we need to specify its behavior.**
|
||||
<figcaption>Given a state, our action-value function (that we train) outputs the value of each action at that state. Then, our pre-defined Greedy Policy selects the action that will yield the highest value given a state or a state action pair.</figcaption>
|
||||
</figure>
|
||||
|
||||
Consequently, whatever method you use to solve your problem, **you will have a policy**. In the case of value-based methods, you don't train the policy: your policy **is just a simple pre-specified function** (for instance, Greedy Policy) that uses the values given by the value-function to select its actions.
|
||||
Consequently, whatever method you use to solve your problem, **you will have a policy**. In the case of value-based methods, you don't train the policy: your policy **is just a simple pre-specified function** (for instance, the Greedy Policy) that uses the values given by the value-function to select its actions.
|
||||
|
||||
So the difference is:
|
||||
|
||||
- In policy-based, **the optimal policy (denoted π\*) is found by training the policy directly.**
|
||||
- In value-based, **finding an optimal value function (denoted Q\* or V\*, we'll study the difference after) leads to having an optimal policy.**
|
||||
- In policy-based training, **the optimal policy (denoted π\*) is found by training the policy directly.**
|
||||
- In value-based training, **finding an optimal value function (denoted Q\* or V\*, we'll study the difference below) leads to having an optimal policy.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link between value and policy"/>
|
||||
|
||||
In fact, most of the time, in value-based methods, you'll use **an Epsilon-Greedy Policy** that handles the exploration/exploitation trade-off; we'll talk about it when we talk about Q-Learning in the second part of this unit.
|
||||
In fact, most of the time, in value-based methods, you'll use **an Epsilon-Greedy Policy** that handles the exploration/exploitation trade-off; we'll talk about this when we talk about Q-Learning in the second part of this unit.
|
||||
|
||||
|
||||
So, we have two types of value-based functions:
|
||||
As we mentioned above, we have two types of value-based functions:
|
||||
|
||||
## The state-value function [[state-value-function]]
|
||||
|
||||
@@ -60,7 +60,7 @@ For each state, the state-value function outputs the expected return if the agen
|
||||
|
||||
## The action-value function [[action-value-function]]
|
||||
|
||||
In the action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
|
||||
In the action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state, takes that action, and then follows the policy forever after.
|
||||
|
||||
The value of taking action \\(a\\) in state \\(s\\) under a policy \\(π\\) is:
|
||||
|
||||
@@ -70,8 +70,8 @@ The value of taking action \\(a\\) in state \\(s\\) under a policy \\(π\\) is:
|
||||
|
||||
We see that the difference is:
|
||||
|
||||
- In state-value function, we calculate **the value of a state \\(S_t\\)**
|
||||
- In action-value function, we calculate **the value of the state-action pair ( \\(S_t, A_t\\) ) hence the value of taking that action at that state.**
|
||||
- For the state-value function, we calculate **the value of a state \\(S_t\\)**
|
||||
- For the action-value function, we calculate **the value of the state-action pair ( \\(S_t, A_t\\) ) hence the value of taking that action at that state.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-types.jpg" alt="Two types of value function"/>
|
||||
@@ -79,8 +79,8 @@ We see that the difference is:
|
||||
Note: We didn't fill all the state-action pairs for the example of Action-value function</figcaption>
|
||||
</figure>
|
||||
|
||||
In either case, whatever value function we choose (state-value or action-value function), **the returned value is the expected return.**
|
||||
In either case, whichever value function we choose (state-value or action-value function), **the returned value is the expected return.**
|
||||
|
||||
However, the problem is that it implies that **to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.**
|
||||
However, the problem is that **to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.**
|
||||
|
||||
This can be a computationally expensive process, and that's **where the Bellman equation comes to help us.**
|
||||
This can be a computationally expensive process, and that's **where the Bellman equation comes in to help us.**
|
||||
|
||||
@@ -5,7 +5,7 @@ In RL, we build an agent that can **make smart decisions**. For instance, an ag
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/rl-process.jpg" alt="RL process"/>
|
||||
|
||||
|
||||
But, to make intelligent decisions, our agent will learn from the environment by **interacting with it through trial and error** and receiving rewards (positive or negative) **as unique feedback.**
|
||||
To make intelligent decisions, our agent will learn from the environment by **interacting with it through trial and error** and receiving rewards (positive or negative) **as unique feedback.**
|
||||
|
||||
Its goal **is to maximize its expected cumulative reward** (because of the reward hypothesis).
|
||||
|
||||
|
||||
Reference in New Issue
Block a user