diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml index 2615a89..5483f9c 100644 --- a/units/en/_toctree.yml +++ b/units/en/_toctree.yml @@ -58,7 +58,7 @@ title: Monte Carlo vs Temporal Difference Learning - local: unit2/mid-way-recap title: Mid-way Recap - - local: unit2/quiz1 + - local: unit2/mid-way-quiz title: Mid-way Quiz - local: unit2/q-learning title: Introducing Q-Learning @@ -69,7 +69,7 @@ - local: unit2/hands-on title: Hands-on - local: unit2/quiz2 - title: Second Quiz + title: Q-Learning Quiz - local: unit2/conclusion title: Conclusion - local: unit2/additional-readings diff --git a/units/en/unit2/mc-vs-td.mdx b/units/en/unit2/mc-vs-td.mdx index da47dc5..1d3517f 100644 --- a/units/en/unit2/mc-vs-td.mdx +++ b/units/en/unit2/mc-vs-td.mdx @@ -30,6 +30,8 @@ If we take an example: - We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps. - At the end of the episode, **we have a list of State, Actions, Rewards, and Next States tuples** +For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]...] + - **The agent will sum the total rewards \\(G_t\\)** (to see how well it did). - It will then **update \\(V(s_t)\\) based on the formula** diff --git a/units/en/unit2/quiz1.mdx b/units/en/unit2/mid-way-quiz.mdx similarity index 99% rename from units/en/unit2/quiz1.mdx rename to units/en/unit2/mid-way-quiz.mdx index 80bc321..b1ffe3a 100644 --- a/units/en/unit2/quiz1.mdx +++ b/units/en/unit2/mid-way-quiz.mdx @@ -1,4 +1,4 @@ -# Mid-way Quiz [[quiz1]] +# Mid-way Quiz [[mid-way-quiz]] The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**. diff --git a/units/en/unit2/summary1.mdx b/units/en/unit2/mid-way-recap.mdx similarity index 97% rename from units/en/unit2/summary1.mdx rename to units/en/unit2/mid-way-recap.mdx index 496c5aa..0bae566 100644 --- a/units/en/unit2/summary1.mdx +++ b/units/en/unit2/mid-way-recap.mdx @@ -1,4 +1,4 @@ -# Mid-way Recap [[summary1]] +# Mid-way Recap [[mid-way-recap]] Before diving into Q-Learning, let's summarize what we just learned. diff --git a/units/en/unit2/summary2.mdx b/units/en/unit2/q-learning-recap.mdx similarity index 97% rename from units/en/unit2/summary2.mdx rename to units/en/unit2/q-learning-recap.mdx index a5653ef..55c66bf 100644 --- a/units/en/unit2/summary2.mdx +++ b/units/en/unit2/q-learning-recap.mdx @@ -1,4 +1,4 @@ -# Q-Learning Recap [[summary2]] +# Q-Learning Recap [[q-learning-recap]] The *Q-Learning* **is the RL algorithm that** : diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx index 7a52cc4..52e744a 100644 --- a/units/en/unit2/q-learning.mdx +++ b/units/en/unit2/q-learning.mdx @@ -17,6 +17,7 @@ Q-Learning is an **off-policy value-based method that uses a TD approach to tra The **Q comes from "the Quality" (the value) of that action at that state.** Let's recap the difference between value and reward: + - The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state action pair) and then acts accordingly to its policy. - The *reward* is the **feedback I get from the environment** after performing an action at a state. @@ -42,7 +43,7 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act If we recap, *Q-Learning* **is the RL algorithm that:** -- Trains a *Q-Function* (an **action-value function**), which internally is a *Q-table that contains all the state-action pair values.** +- Trains a *Q-Function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.** - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.** - When the training is done, **we have an optimal Q-function, which means we have optimal Q-Table.** - And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**