<figcaption>To calculate the value of State 2: the sum of rewards**if the agent started in that state**,and then followed the**policy for all the time steps.**</figcaption>
<figcaption>To calculate the value of State 2: the sum of rewards<b>if the agent started in that state</b>,and then followed the<b>policy for all the time steps.</b></figcaption>
</figure>
So you may have noticed, we're repeating the computation of the value of different states, which can be tedious if you need to do it for each state value or state-action value.
So, for now,**our Q-Table is useless**; we need**to train our Q-function using the Q-Learning algorithm.**
So, for now,**our Q-table is useless**; we need**to train our Q-function using the Q-Learning algorithm.**
Let's do it for 2 training timesteps:
@@ -80,4 +80,4 @@ Because I go to the poison state,**I get \\(R_{t+1} = -10\\), and I die.**
Because we're dead, we start a new episode. But what we see here is that**with two explorations steps, my agent became smarter.**
As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-Table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-Function.**
As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-function.**
- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
- When the training is done,**we have an optimal Q-function, so an optimal Q-table.**
- And if we **have an optimal Q-function**, we
have an optimal policy,since we **know for each state, what is the best action to take.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" width="100%"/>
But, in the beginning,our **Q-Table is useless since it gives arbitrary value for each state-action pair(most of the time we initialize the Q-Table to 0 values)**. But, as we’llexplore the environment and update our Q-Table it will give us better and better approximations
But, in the beginning,our **Q-table is useless since it gives arbitrary value for each state-action pair(most of the time we initialize the Q-table to 0 values)**. But, as we’llexplore the environment and update our Q-table it will give us better and better approximations
@@ -7,7 +7,7 @@ Q-Learning is an**off-policy value-based method that uses a TD approach to tra
- *Value-based method*: finds the optimal policy indirectly by training a value or action-value function that will tell us**the value of each state or each state-action pair.**
- *Uses a TD approach:***updates its action-value function at each step instead of at the end of the episode.**
**Q-Learning is the algorithm we use to train our Q-Function**, an**action-value function**that determines the value of being at a particular state and taking a specific action at that state.
**Q-Learning is the algorithm we use to train our Q-function**, an**action-value function**that determines the value of being at a particular state and taking a specific action at that state.
@@ -18,16 +18,16 @@ The**Q comes from "the Quality" (the value) of that action at that state.**
Let's recap the difference between value and reward:
- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or stateaction pair) and then acts accordingly to its policy.
- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy.
- The *reward* is the **feedback I get from the environment** after performing an action at a state.
Internally, our Q-function has**a Q-table, a table where each cell corresponds to a state-action value pair value.**Think of this Q-table as**the memory or cheat sheet of our Q-function.**
Internally, our Q-function has**a Q-table, a table where each cell corresponds to a state-action pair value.**Think of this Q-table as**the memory or cheat sheet of our Q-function.**
Therefore, Q-function contains a Q-table**that has the value of each-state action pair.**And given a state and action,**our Q-Function will search inside its Q-table to output the value.**
Therefore, Q-function contains a Q-table**that has the value of each-state action pair.**And given a state and action,**our Q-function will search inside its Q-table to output the value.**
@@ -43,22 +43,22 @@ Therefore, Q-function contains a Q-table**that has the value of each-state act
If we recap,*Q-Learning***is the RL algorithm that:**
- Trainsa *Q-Function* (an **action-value function**),which internally is a**Q-table that contains all the state-action pair values.**
- Given a state and action, our Q-Function**will search into its Q-table the corresponding value.**
- When the training is done,**we have an optimal Q-function, which means we have optimal Q-Table.**
- Trainsa *Q-function* (an **action-value function**),which internally is a**Q-table that contains all the state-action pair values.**
- Given a state and action, our Q-function**will search into its Q-table the corresponding value.**
- When the training is done,**we have an optimal Q-function, which means we have optimal Q-table.**
- And if we**have an optimal Q-function**, we**have an optimal policy**since we**know for each state what is the best action to take.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"/>
But, in the beginning,**our Q-Table is useless since it gives arbitrary values for each state-action pair**(most of the time, we initialize the Q-Table to 0). As the agent **explores the environment and we update the Q-Table, it will give us better and better approximations** to the optimal policy.
But, in the beginning,**our Q-table is useless since it gives arbitrary values for each state-action pair**(most of the time, we initialize the Q-table to 0). As the agent **explores the environment and we update the Q-table, it will give us better and better approximations** to the optimal policy.
<figcaption>We see here that with the training, our Q-Table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
<figcaption>We see here that with the training, our Q-table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
</figure>
Now that we understand what Q-Learning, Q-Function, and Q-Table are,**let's dive deeper into the Q-Learning algorithm**.
Now that we understand what Q-Learning, Q-function, and Q-table are,**let's dive deeper into the Q-Learning algorithm**.
## The Q-Learning algorithm [[q-learning-algo]]
@@ -66,26 +66,26 @@ This is the Q-Learning pseudocode; let's study each part and**see how it works
Epsilon greedy strategy is a policy that handles the exploration/exploitation trade-off.
The idea is that we define epsilon ɛ = 1.0:
The idea is that we define the initial epsilon ɛ = 1.0:
- *With probability 1 — ɛ*: we do**exploitation**(aka our agent selects the action with the highest state-action pair value).
- With probability ɛ:**we do exploration**(trying random action).
At the beginning of the training,**the probability of doing exploration will be huge since ɛ is very high, so most of the time, we'll explore.**But as the training goes on, and consequently our**Q-Table gets better and better in its estimations, we progressively reduce the epsilon value**since we will need less and less exploration and more exploitation.
At the beginning of the training,**the probability of doing exploration will be huge since ɛ is very high, so most of the time, we'll explore.**But as the training goes on, and consequently our**Q-table gets better and better in its estimations, we progressively reduce the epsilon value**since we will need less and less exploration and more exploitation.
@@ -114,7 +114,7 @@ It means that to update our \\(Q(S_t, A_t)\\):
How do we form the TD target?
1. We obtain the reward after taking the action \\(R_{t+1}\\).
2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilongreedy policy, this will always take the action with the highest state-action value.
2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.
Then when the update of this Q-value is done, we start in a new state and select our action**using a epsilon-greedy policy again.**
@@ -126,7 +126,7 @@ The difference is subtle:
- *Off-policy*: using**a different policy for acting (inference) and updating (training).**
For instance, with Q-Learning, the epsilongreedy policy (acting policy), is different from the greedy policy that is**used to select the best next-state action value to update our Q-value (updating policy).**
For instance, with Q-Learning, the epsilon-greedy policy (acting policy), is different from the greedy policy that is**used to select the best next-state action value to update our Q-value (updating policy).**
<figure>
@@ -144,7 +144,7 @@ Is different from the policy we use during the training part:
- *On-policy:*using the**same policy for acting and updating.**
For instance, with Sarsa, another value-based algorithm,**the epsilongreedy Policy selects the next state-action pair, not a greedy policy.**
For instance, with Sarsa, another value-based algorithm,**the epsilon-greedy policy selects the next state-action pair, not a greedy policy.**
@@ -10,7 +10,7 @@ The value of a state is the**expected discounted return**the agent can get i
But what does it mean to act according to our policy? After all, we don't have a policy in value-based methods since we train a value function and not a policy.
</Tip>
Remember that the goal of an**RL agent is to have an optimal policy π.**
Remember that the goal of an**RL agent is to have an optimal policy π\*.**
To find the optimal policy, we learned about two different methods:
@@ -35,8 +35,8 @@ Consequently, whatever method you use to solve your problem,**you will have a
So the difference is:
- In policy-based,**the optimal policy (denoted π*) is found by training the policy directly.**
- In value-based,**finding an optimal value function (denoted Q* or V*, we'll study the difference after) in our leads to having an optimal policy.**
- In policy-based,**the optimal policy (denoted π\*) is found by training the policy directly.**
- In value-based,**finding an optimal value function (denoted Q\* or V\*, we'll study the difference after) in our leads to having an optimal policy.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link between value and policy"/>
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.