Update Unit 2

2026-06-15 14:36:45 +08:00 · 2022-12-12 02:45:16 +01:00
parent 11c8f87460
commit 723f75223e
6 changed files with 9 additions and 6 deletions
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -58,7 +58,7 @@
    title: Monte Carlo vs Temporal Difference Learning
  - local: unit2/mid-way-recap
    title: Mid-way Recap
-  - local: unit2/quiz1
+  - local: unit2/mid-way-quiz
    title: Mid-way Quiz
  - local: unit2/q-learning
    title: Introducing Q-Learning
@@ -69,7 +69,7 @@
  - local: unit2/hands-on
    title: Hands-on
  - local: unit2/quiz2
-    title: Second Quiz
+    title: Q-Learning Quiz
  - local: unit2/conclusion
    title: Conclusion
  - local: unit2/additional-readings
--- a/units/en/unit2/mc-vs-td.mdx
+++ b/units/en/unit2/mc-vs-td.mdx
@@ -30,6 +30,8 @@ If we take an example:
 - We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.

 - At the end of the episode, **we have a list of State, Actions, Rewards, and Next States tuples**
+For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]...]
+
 - **The agent will sum the total rewards \\(G_t\\)** (to see how well it did).
 - It will then **update \\(V(s_t)\\) based on the formula**

--- a/units/en/unit2/mid-way-quiz.mdx
+++ b/units/en/unit2/mid-way-quiz.mdx
@@ -1,4 +1,4 @@
-# Mid-way Quiz [[quiz1]]
+# Mid-way Quiz [[mid-way-quiz]]

 The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.

--- a/units/en/unit2/mid-way-recap.mdx
+++ b/units/en/unit2/mid-way-recap.mdx
@@ -1,4 +1,4 @@
-# Mid-way Recap [[summary1]]
+# Mid-way Recap [[mid-way-recap]]

 Before diving into Q-Learning, let's summarize what we just learned.

--- a/units/en/unit2/q-learning-recap.mdx
+++ b/units/en/unit2/q-learning-recap.mdx
@@ -1,4 +1,4 @@
-# Q-Learning Recap [[summary2]]
+# Q-Learning Recap [[q-learning-recap]]


 The *Q-Learning* **is the RL algorithm that** :
--- a/units/en/unit2/q-learning.mdx
+++ b/units/en/unit2/q-learning.mdx
@@ -17,6 +17,7 @@ Q-Learning is an **off-policy value-based method that uses a TD approach to tra
 The **Q comes from "the Quality" (the value) of that action at that state.**

 Let's recap the difference between value and reward:
+
 - The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state action pair) and then acts accordingly to its policy.
 - The *reward* is the **feedback I get from the environment** after performing an action at a state.

@@ -42,7 +43,7 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act

 If we recap, *Q-Learning* **is the RL algorithm that:**

- Trains a *Q-Function* (an **action-value function**), which internally is a *Q-table that contains all the state-action pair values.**
+- Trains a *Q-Function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.**
 - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
 - When the training is done, **we have an optimal Q-function, which means we have optimal Q-Table.**
 - And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**