mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-03 02:28:50 +08:00
Updated Unit 2 + added notebook
This commit is contained in:
10
notebooks/unit2/requirements-unit2.txt
Normal file
10
notebooks/unit2/requirements-unit2.txt
Normal file
@@ -0,0 +1,10 @@
|
||||
gym==0.24
|
||||
pygame
|
||||
numpy
|
||||
|
||||
huggingface_hub
|
||||
pickle5
|
||||
pyyaml==6.0
|
||||
imageio
|
||||
imageio_ffmpeg
|
||||
pyglet==1.5.1
|
||||
@@ -25,32 +25,16 @@
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif\" alt=\"Environments\"/>"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"TODO: ADD TEXT LIVE INFO"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "yaBKcncmYku4"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"TODO: ADD IF YOU HAVE QUESTIONS\n"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "hz5KE5HjYlRh"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"###🎮 Environments: \n",
|
||||
"\n",
|
||||
"- [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)\n",
|
||||
"- [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/)\n",
|
||||
"\n",
|
||||
"###📚 RL-Library: \n",
|
||||
"\n",
|
||||
"- Python and Numpy"
|
||||
],
|
||||
"metadata": {
|
||||
@@ -73,7 +57,9 @@
|
||||
},
|
||||
"source": [
|
||||
"## Objectives of this notebook 🏆\n",
|
||||
"\n",
|
||||
"At the end of the notebook, you will:\n",
|
||||
"\n",
|
||||
"- Be able to use **Gym**, the environment library.\n",
|
||||
"- Be able to code from scratch a Q-Learning agent.\n",
|
||||
"- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.\n",
|
||||
@@ -120,7 +106,7 @@
|
||||
"## Prerequisites 🏗️\n",
|
||||
"Before diving into the notebook, you need to:\n",
|
||||
"\n",
|
||||
"🔲 📚 **Study Q-Learning by reading Unit 2** 🤗 ADD LINK "
|
||||
"🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)** 🤗 "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -139,6 +125,7 @@
|
||||
},
|
||||
"source": [
|
||||
"- The *Q-Learning* **is the RL algorithm that** \n",
|
||||
"\n",
|
||||
" - Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**\n",
|
||||
" \n",
|
||||
" - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**\n",
|
||||
@@ -194,15 +181,6 @@
|
||||
"id": "4gpxC1_kqUYe"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"TODO CHANGE LINK OF THE REQUIREMENTS"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "32e3NPYgH5ET"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
@@ -211,7 +189,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install -r https://huggingface.co/spaces/ThomasSimonini/temp-space-requirements/raw/main/requirements/requirements-unit2.txt"
|
||||
"!pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/unit2/requirements-unit2.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -230,6 +208,27 @@
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks for this trick, **we will be able to run our virtual screen.**"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "K6XC13pTfFiD"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"import os\n",
|
||||
"os.kill(os.getpid(), 9)"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "3kuZbWAkfHdg"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
@@ -317,11 +316,13 @@
|
||||
"We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.\n",
|
||||
"\n",
|
||||
"We can have two sizes of environment:\n",
|
||||
"\n",
|
||||
"- `map_name=\"4x4\"`: a 4x4 grid version\n",
|
||||
"- `map_name=\"8x8\"`: a 8x8 grid version\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The environment has two modes:\n",
|
||||
"\n",
|
||||
"- `is_slippery=False`: The agent always move in the intended direction due to the non-slippery nature of the frozen lake.\n",
|
||||
"- `is_slippery=True`: The agent may not always move in the intended direction due to the slippery nature of the frozen lake (stochastic)."
|
||||
]
|
||||
@@ -931,6 +932,7 @@
|
||||
},
|
||||
"source": [
|
||||
"## Evaluate our Q-Learning agent 📈\n",
|
||||
"\n",
|
||||
"- Normally you should have mean reward of 1.0\n",
|
||||
"- It's relatively easy since the state space is really small (16). What you can try to do is [to replace with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)."
|
||||
]
|
||||
@@ -955,6 +957,7 @@
|
||||
},
|
||||
"source": [
|
||||
"## Publish our trained model on the Hub 🔥\n",
|
||||
"\n",
|
||||
"Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.\n",
|
||||
"\n",
|
||||
"Here's an example of a Model Card:\n",
|
||||
@@ -1173,6 +1176,7 @@
|
||||
},
|
||||
"source": [
|
||||
"### .\n",
|
||||
"\n",
|
||||
"By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n",
|
||||
"\n",
|
||||
"This way:\n",
|
||||
@@ -1264,9 +1268,10 @@
|
||||
},
|
||||
"source": [
|
||||
"Let's fill the `package_to_hub` function:\n",
|
||||
"\n",
|
||||
"- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `\n",
|
||||
"(repo_id = {username}/{repo_name})`\n",
|
||||
"💡 **A good name is {username}/q-{env_id}**\n",
|
||||
"💡 A good `repo_id` is `{username}/q-{env_id}`\n",
|
||||
"- `model`: our model dictionary containing the hyperparameters and the Qtable.\n",
|
||||
"- `env`: the environment.\n",
|
||||
"- `commit_message`: message of the commit"
|
||||
@@ -1326,7 +1331,9 @@
|
||||
"\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
"In Taxi-v3 🚕, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger’s location, picks up the passenger, drives to the passenger’s destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends.\n",
|
||||
"In `Taxi-v3` 🚕, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). \n",
|
||||
"\n",
|
||||
"When the episode starts, **the taxi starts off at a random square** and the passenger is at a random location. The taxi drives to the passenger’s location, **picks up the passenger**, drives to the passenger’s destination (another one of the four specified locations), and then **drops off the passenger**. Once the passenger is dropped off, the episode ends.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi.png\" alt=\"Taxi\">\n"
|
||||
@@ -1383,6 +1390,7 @@
|
||||
},
|
||||
"source": [
|
||||
"The action space (the set of possible actions the agent can take) is discrete with **6 actions available 🎮**:\n",
|
||||
"\n",
|
||||
"- 0: move south\n",
|
||||
"- 1: move north\n",
|
||||
"- 2: move east\n",
|
||||
@@ -1391,6 +1399,7 @@
|
||||
"- 5: drop off passenger\n",
|
||||
"\n",
|
||||
"Reward function 💰:\n",
|
||||
"\n",
|
||||
"- -1 per step unless other reward is triggered.\n",
|
||||
"- +20 delivering passenger.\n",
|
||||
"- -10 executing “pickup” and “drop-off” actions illegally."
|
||||
@@ -1556,7 +1565,8 @@
|
||||
"\n",
|
||||
"What's amazing with Hugging Face Hub 🤗 is that you can easily load powerful models from the community.\n",
|
||||
"\n",
|
||||
"Loading a saved model from the Hub is really easy.\n",
|
||||
"Loading a saved model from the Hub is really easy:\n",
|
||||
"\n",
|
||||
"1. You go https://huggingface.co/models?other=q-learning to see the list of all the q-learning saved models.\n",
|
||||
"2. You select one and copy its repo_id\n",
|
||||
"\n",
|
||||
@@ -1671,9 +1681,10 @@
|
||||
"## Some additional challenges 🏆\n",
|
||||
"The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results! \n",
|
||||
"\n",
|
||||
"In the [Leaderboard](https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n",
|
||||
"In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n",
|
||||
"\n",
|
||||
"Here are some ideas to achieve so:\n",
|
||||
"\n",
|
||||
"* Train more steps\n",
|
||||
"* Try different hyperparameters by looking at what your classmates have done.\n",
|
||||
"* **Push your new trained model** on the Hub 🔥\n",
|
||||
@@ -1711,8 +1722,8 @@
|
||||
"id": "BjLhT70TEZIn"
|
||||
},
|
||||
"source": [
|
||||
"See you on [Unit 3](https://github.com/huggingface/deep-rl-class/tree/main/unit2#unit-2-introduction-to-q-learning)! 🔥\n",
|
||||
"TODO CHANGE LINK\n",
|
||||
"See you on Unit 3! 🔥\n",
|
||||
"\n",
|
||||
"## Keep learning, stay awesome 🤗"
|
||||
]
|
||||
}
|
||||
|
||||
1089
notebooks/unit2/unit2.mdx
Normal file
1089
notebooks/unit2/unit2.mdx
Normal file
File diff suppressed because it is too large
Load Diff
@@ -57,13 +57,15 @@
|
||||
- local: unit2/mc-vs-td
|
||||
title: Monte Carlo vs Temporal Difference Learning
|
||||
- local: unit2/summary1
|
||||
title: Summary
|
||||
title: Mid-way Recap
|
||||
- local: unit2/quiz1
|
||||
title: First Quiz
|
||||
title: Mid-way Quiz
|
||||
- local: unit2/q-learning
|
||||
title: Introducing Q-Learning
|
||||
- local: unit2/q-learning-example
|
||||
title: A Q-Learning example
|
||||
- local: unit2/summary2
|
||||
title: Q-Learning Recap
|
||||
- local: unit2/hands-on
|
||||
title: Hands-on
|
||||
- local: unit2/quiz2
|
||||
|
||||
@@ -31,7 +31,6 @@ The Bellman equation is a recursive equation that works like this: instead of st
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation"/>
|
||||
<figcaption>For simplification, here we don’t discount so gamma = 1.</figcaption>
|
||||
</figure>
|
||||
|
||||
|
||||
@@ -44,14 +43,20 @@ To calculate the value of State 1: the sum of rewards **if the agent started in
|
||||
|
||||
This is equivalent to \\(V(S_{t})\\) = Immediate reward \\(R_{t+1}\\) + Discounted value of the next state \\(gamma * V(S_{t+1})\\)
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman6.jpg" alt="Bellman equation"/>
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman6.jpg" alt="Bellman equation"/>
|
||||
<figcaption>For simplification, here we don’t discount so gamma = 1.</figcaption>
|
||||
</figure>
|
||||
|
||||
In the interest of simplicity, here we don't discount, so gamma = 1.
|
||||
|
||||
- The value of \\(V(S_{t+1}) \\) = Immediate reward \\(R_{t+2}\\) + Discounted value of the next state ( \\(gamma * V(S_{t+2})\\) ).
|
||||
- And so on.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process.** This is equivalent **to the sum of immediate reward + the discounted value of the state that follows.**
|
||||
|
||||
Before going to the next section, think about the role of gamma in the Bellman equation. What happens if the value of gamma is very low (e.g. 0.1 or even 0)? What happens if the value is 1? What happens if the value is very high, such as a million?
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -43,7 +43,7 @@ By running more and more episodes, **the agent will learn to play better and be
|
||||
|
||||
For instance, if we train a state-value function using Monte Carlo:
|
||||
|
||||
- We just started to train our Value function, **so it returns 0 value for each state**
|
||||
- We just started to train our value function, **so it returns 0 value for each state**
|
||||
- Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)
|
||||
- Our mouse **explores the environment and takes random actions**
|
||||
|
||||
@@ -75,7 +75,7 @@ For instance, if we train a state-value function using Monte Carlo:
|
||||
## Temporal Difference Learning: learning at each step [[td-learning]]
|
||||
|
||||
- **Temporal Difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)**
|
||||
- to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\(gamma * V(S_{t+1})\\).
|
||||
- to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\( \gamma * V(S_{t+1})\\).
|
||||
|
||||
The idea with **TD is to update the \\(V(S_t)\\) at each step.**
|
||||
|
||||
@@ -94,7 +94,7 @@ If we take the same example,
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2.jpg" alt="Temporal Difference"/>
|
||||
|
||||
- We just started to train our Value function, so it returns 0 value for each state.
|
||||
- We just started to train our value function, so it returns 0 value for each state.
|
||||
- Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
|
||||
- Our mouse explore the environment and take a random action: **going to the left**
|
||||
- It gets a reward \\(R_{t+1} = 1\\) since **it eats a piece of cheese**
|
||||
@@ -106,7 +106,7 @@ If we take the same example,
|
||||
|
||||
We can now update \\(V(S_0)\\):
|
||||
|
||||
New \\(V(S_0) = V(S_0) + lr * [R_1 + gamma * V(S_1) - V(S_0)]\\)
|
||||
New \\(V(S_0) = V(S_0) + lr * [R_1 + \gamma * V(S_1) - V(S_0)]\\)
|
||||
|
||||
New \\(V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]\\)
|
||||
|
||||
|
||||
@@ -14,7 +14,11 @@ Q-Learning is an **off-policy value-based method that uses a TD approach to tra
|
||||
<figcaption>Given a state and action, our Q Function outputs a state-action value (also called Q-value)</figcaption>
|
||||
</figure>
|
||||
|
||||
The **Q comes from "the Quality" of that action at that state.**
|
||||
The **Q comes from "the Quality" (the value) of that action at that state.**
|
||||
|
||||
Let's recap the difference between value and reward:
|
||||
- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state action pair) and then acts accordingly to its policy.
|
||||
- The *reward* is the **feedback I get from the environment** after performing an action at a state.
|
||||
|
||||
Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action value pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
|
||||
@@ -34,7 +38,6 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q-function"/>
|
||||
<figcaption>Given a state and action pair, our Q-function will search inside its Q-table to output the state-action pair value (the Q value).</figcaption>
|
||||
</figure>
|
||||
|
||||
If we recap, *Q-Learning* **is the RL algorithm that:**
|
||||
@@ -69,12 +72,12 @@ This is the Q-Learning pseudocode; let's study each part and **see how it works
|
||||
|
||||
We need to initialize the Q-Table for each state-action pair. **Most of the time, we initialize with values of 0.**
|
||||
|
||||
### Step 2: Choose action using Epsilon Greedy Strategy [[step2]]
|
||||
### Step 2: Choose action using epsilon greedy strategy [[step2]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
Epsilon Greedy Strategy is a policy that handles the exploration/exploitation trade-off.
|
||||
Epsilon greedy strategy is a policy that handles the exploration/exploitation trade-off.
|
||||
|
||||
The idea is that we define epsilon ɛ = 1.0:
|
||||
|
||||
@@ -122,7 +125,7 @@ The difference is subtle:
|
||||
|
||||
- *Off-policy*: using **a different policy for acting (inference) and updating (training).**
|
||||
|
||||
For instance, with Q-Learning, the Epsilon greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
|
||||
For instance, with Q-Learning, the epsilon greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
|
||||
|
||||
|
||||
<figure>
|
||||
@@ -140,7 +143,7 @@ Is different from the policy we use during the training part:
|
||||
|
||||
- *On-policy:* using the **same policy for acting and updating.**
|
||||
|
||||
For instance, with Sarsa, another value-based algorithm, **the Epsilon-Greedy Policy selects the next state-action pair, not a greedy policy.**
|
||||
For instance, with Sarsa, another value-based algorithm, **the epsilon greedy Policy selects the next state-action pair, not a greedy policy.**
|
||||
|
||||
|
||||
<figure>
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# First Quiz [[quiz1]]
|
||||
# Mid-way Quiz [[quiz1]]
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
@@ -19,7 +19,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
|
||||
},
|
||||
{
|
||||
text: "Value-based methods",
|
||||
explain: "With Value-based methods, we train a value function to learn which state is more valuable and use this value function to take the action that leads to it.",
|
||||
explain: "With value-based methods, we train a value function to learn which state is more valuable and use this value function to take the action that leads to it.",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
@@ -37,7 +37,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
|
||||
|
||||
**The Bellman equation is a recursive equation** that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
|
||||
|
||||
Rt+1 + (gamma * V(St+1))
|
||||
\\(Rt+1 + (\gamma * V(St+1)))\\
|
||||
The immediate reward + the discounted value of the state that follows
|
||||
|
||||
</details>
|
||||
|
||||
@@ -1,17 +1,17 @@
|
||||
# Summary [[summary1]]
|
||||
# Mid-way Recap [[summary1]]
|
||||
|
||||
Before diving into Q-Learning, let's summarize what we just learned.
|
||||
|
||||
We have two types of value-based functions:
|
||||
|
||||
- State-Value function: outputs the expected return if **the agent starts at a given state and acts accordingly to the policy forever after.**
|
||||
- Action-Value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
|
||||
- State-value function: outputs the expected return if **the agent starts at a given state and acts accordingly to the policy forever after.**
|
||||
- Action-value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
|
||||
- In value-based methods, rather than learning the policy, **we define the policy by hand** and we learn a value function. If we have an optimal value function, we **will have an optimal policy.**
|
||||
|
||||
There are two types of methods to learn a policy for a value function:
|
||||
|
||||
- With *the Monte Carlo method*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
|
||||
- With *the TD Learning method,* we update the value function from a step, so we replace Gt that we don't have with **an estimated return called TD target.**
|
||||
- With *the TD Learning method,* we update the value function from a step, so we replace \\(G_t\\) that we don't have with **an estimated return called TD target.**
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/summary-learning-mtds.jpg" alt="Summary"/>
|
||||
|
||||
25
units/en/unit2/summary2.mdx
Normal file
25
units/en/unit2/summary2.mdx
Normal file
@@ -0,0 +1,25 @@
|
||||
# Q-Learning Recap [[summary2]]
|
||||
|
||||
|
||||
The *Q-Learning* **is the RL algorithm that** :
|
||||
|
||||
- Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
|
||||
|
||||
- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function" width="100%"/>
|
||||
|
||||
- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
|
||||
|
||||
- And if we **have an optimal Q-function**, we
|
||||
have an optimal policy,since we **know for each state, what is the best action to take.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" width="100%"/>
|
||||
|
||||
But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
|
||||
|
||||
This is the Q-Learning pseudocode:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
|
||||
@@ -18,7 +18,7 @@ To find the optimal policy, we learned about two different methods:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches-2.jpg" alt="Two RL approaches"/>
|
||||
|
||||
The policy takes a state as input and outputs what action to take at that state (deterministic policy).
|
||||
The policy takes a state as input and outputs what action to take at that state (deterministic policy: a policy that output one action given a state, contrary to stochastic policy that output a probability distribution over actions).
|
||||
|
||||
And consequently, **we don't define by hand the behavior of our policy; it's the training that will define it.**
|
||||
|
||||
@@ -35,8 +35,8 @@ Consequently, whatever method you use to solve your problem, **you will have a
|
||||
|
||||
So the difference is:
|
||||
|
||||
- In policy-based, **the optimal policy is found by training the policy directly.**
|
||||
- In value-based, **finding an optimal value function leads to having an optimal policy.**
|
||||
- In policy-based, **the optimal policy (denoted π*) is found by training the policy directly.**
|
||||
- In value-based, **finding an optimal value function (denoted Q* or V*, we'll study the difference after) in our leads to having an optimal policy.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link between value and policy"/>
|
||||
|
||||
@@ -45,7 +45,7 @@ In fact, most of the time, in value-based methods, you'll use **an Epsilon-Gree
|
||||
|
||||
So, we have two types of value-based functions:
|
||||
|
||||
## The State-Value function [[state-value-function]]
|
||||
## The state-value function [[state-value-function]]
|
||||
|
||||
We write the state value function under a policy π like this:
|
||||
|
||||
@@ -58,11 +58,11 @@ For each state, the state-value function outputs the expected return if the agen
|
||||
<figcaption>If we take the state with value -7: it's the expected return starting at that state and taking actions according to our policy (greedy policy), so right, right, right, down, down, right, right.</figcaption>
|
||||
</figure>
|
||||
|
||||
## The Action-Value function [[action-value-function]]
|
||||
## The action-value function [[action-value-function]]
|
||||
|
||||
In the Action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
|
||||
In the action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
|
||||
|
||||
The value of taking action an in state s under a policy π is:
|
||||
The value of taking action an in state \\(s\\) under a policy \\(π\\) is:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-1.jpg" alt="Action State value function"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-2.jpg" alt="Action State value function"/>
|
||||
@@ -83,4 +83,4 @@ In either case, whatever value function we choose (state-value or action-value f
|
||||
|
||||
However, the problem is that it implies that **to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.**
|
||||
|
||||
This can be a tedious process, and that's **where the Bellman equation comes to help us.**
|
||||
This can be a computationally expensive process, and that's **where the Bellman equation comes to help us.**
|
||||
|
||||
Reference in New Issue
Block a user