Update Unit 2

This commit is contained in:
simoninithomas
2022-12-12 02:45:16 +01:00
parent 11c8f87460
commit 723f75223e
6 changed files with 9 additions and 6 deletions

View File

@@ -58,7 +58,7 @@
title: Monte Carlo vs Temporal Difference Learning
- local: unit2/mid-way-recap
title: Mid-way Recap
- local: unit2/quiz1
- local: unit2/mid-way-quiz
title: Mid-way Quiz
- local: unit2/q-learning
title: Introducing Q-Learning
@@ -69,7 +69,7 @@
- local: unit2/hands-on
title: Hands-on
- local: unit2/quiz2
title: Second Quiz
title: Q-Learning Quiz
- local: unit2/conclusion
title: Conclusion
- local: unit2/additional-readings

View File

@@ -30,6 +30,8 @@ If we take an example:
- We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.
- At the end of the episode, **we have a list of State, Actions, Rewards, and Next States tuples**
For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]...]
- **The agent will sum the total rewards \\(G_t\\)** (to see how well it did).
- It will then **update \\(V(s_t)\\) based on the formula**

View File

@@ -1,4 +1,4 @@
# Mid-way Quiz [[quiz1]]
# Mid-way Quiz [[mid-way-quiz]]
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.

View File

@@ -1,4 +1,4 @@
# Mid-way Recap [[summary1]]
# Mid-way Recap [[mid-way-recap]]
Before diving into Q-Learning, let's summarize what we just learned.

View File

@@ -1,4 +1,4 @@
# Q-Learning Recap [[summary2]]
# Q-Learning Recap [[q-learning-recap]]
The *Q-Learning* **is the RL algorithm that** :

View File

@@ -17,6 +17,7 @@ Q-Learning is an **off-policy value-based method that uses a TD approach to tra
The **Q comes from "the Quality" (the value) of that action at that state.**
Let's recap the difference between value and reward:
- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state action pair) and then acts accordingly to its policy.
- The *reward* is the **feedback I get from the environment** after performing an action at a state.
@@ -42,7 +43,7 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act
If we recap, *Q-Learning* **is the RL algorithm that:**
- Trains a *Q-Function* (an **action-value function**), which internally is a *Q-table that contains all the state-action pair values.**
- Trains a *Q-Function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.**
- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
- When the training is done, **we have an optimal Q-function, which means we have optimal Q-Table.**
- And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**