mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-05 11:38:43 +08:00
Cases consistency
This commit is contained in:
@@ -25,11 +25,11 @@ The reward function goes like this:
|
||||
|
||||
To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**.
|
||||
|
||||
## Step 1: We initialize the Q-Table [[step1]]
|
||||
## Step 1: We initialize the Q-table [[step1]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-1.jpg" alt="Maze-Example"/>
|
||||
|
||||
So, for now, **our Q-Table is useless**; we need **to train our Q-function using the Q-Learning algorithm.**
|
||||
So, for now, **our Q-table is useless**; we need **to train our Q-function using the Q-Learning algorithm.**
|
||||
|
||||
Let's do it for 2 training timesteps:
|
||||
|
||||
@@ -80,4 +80,4 @@ Because I go to the poison state, **I get \\(R_{t+1} = -10\\), and I die.**
|
||||
|
||||
Because we're dead, we start a new episode. But what we see here is that **with two explorations steps, my agent became smarter.**
|
||||
|
||||
As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-Table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-Function.**
|
||||
As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-function.**
|
||||
|
||||
@@ -3,20 +3,20 @@
|
||||
|
||||
The *Q-Learning* **is the RL algorithm that** :
|
||||
|
||||
- Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
|
||||
- Trains *Q-function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
|
||||
|
||||
- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
|
||||
- Given a state and action, our Q-function **will search into its Q-table the corresponding value.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function" width="100%"/>
|
||||
|
||||
- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
|
||||
- When the training is done,**we have an optimal Q-function, so an optimal Q-table.**
|
||||
|
||||
- And if we **have an optimal Q-function**, we
|
||||
have an optimal policy,since we **know for each state, what is the best action to take.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" width="100%"/>
|
||||
|
||||
But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
|
||||
But, in the beginning, our **Q-table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-table to 0 values)**. But, as we’ll explore the environment and update our Q-table it will give us better and better approximations
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
|
||||
|
||||
|
||||
@@ -9,7 +9,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "The algorithm we use to train our Q-Function",
|
||||
text: "The algorithm we use to train our Q-function",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
@@ -24,12 +24,12 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
|
||||
},
|
||||
{
|
||||
text: "A table",
|
||||
explain: "Q-Function is not a Q-Table. The Q-Function is the algorithm that will feed the Q-Table."
|
||||
explain: "Q-function is not a Q-table. The Q-function is the algorithm that will feed the Q-table."
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q2: What is a Q-Table?
|
||||
### Q2: What is a Q-table?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
@@ -43,7 +43,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "In Q-Table each cell corresponds a state value",
|
||||
text: "In Q-table each cell corresponds a state value",
|
||||
explain: "Each cell corresponds to a state-action value pair value. Not a state value.",
|
||||
}
|
||||
]}
|
||||
|
||||
Reference in New Issue
Block a user