From 0744d542ada257bfe7def5a3cd635b5ef8f67322 Mon Sep 17 00:00:00 2001
From: Artagon <florent.vaucher@gmail.com>
Date: Fri, 16 Dec 2022 20:31:49 +0100
Subject: [PATCH 1/8] =?UTF-8?q?Properly=20display=20=CF=80*?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 units/en/unit2/two-types-value-based-methods.mdx | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/units/en/unit2/two-types-value-based-methods.mdx b/units/en/unit2/two-types-value-based-methods.mdx
index 47a17e2..3422e7d 100644
--- a/units/en/unit2/two-types-value-based-methods.mdx
+++ b/units/en/unit2/two-types-value-based-methods.mdx
@@ -10,7 +10,7 @@ The value of a state is the **expected discounted return** the agent can get i
 But what does it mean to act according to our policy? After all, we don't have a policy in value-based methods since we train a value function and not a policy.
 </Tip>
 
-Remember that the goal of an **RL agent is to have an optimal policy π.**
+Remember that the goal of an **RL agent is to have an optimal policy π\*.**
 
 To find the optimal policy, we learned about two different methods:
 
@@ -35,8 +35,8 @@ Consequently, whatever method you use to solve your problem, **you will have a
 
 So the difference is:
 
-- In policy-based, **the optimal policy (denoted π*) is found by training the policy directly.**
-- In value-based, **finding an optimal value function (denoted Q* or V*, we'll study the difference after) in our leads to having an optimal policy.**
+- In policy-based, **the optimal policy (denoted π\*) is found by training the policy directly.**
+- In value-based, **finding an optimal value function (denoted Q\* or V\*, we'll study the difference after) in our leads to having an optimal policy.**
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link between value and policy"/>
 

From 0c3616c03ffcf8735a59ec495f08db0c73540c42 Mon Sep 17 00:00:00 2001
From: Artagon <florent.vaucher@gmail.com>
Date: Fri, 16 Dec 2022 20:34:24 +0100
Subject: [PATCH 2/8] Replace ** by <b> tags in figcaption

---
 units/en/unit2/bellman-equation.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx
index 99d753a..6979d23 100644
--- a/units/en/unit2/bellman-equation.mdx
+++ b/units/en/unit2/bellman-equation.mdx
@@ -18,7 +18,7 @@ Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return startin
 
 <figure>
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman3.jpg" alt="Bellman equation"/>
-  <figcaption>To calculate the value of State 2: the sum of rewards **if the agent started in that state**, and then followed the **policy for all the time steps.**</figcaption>
+  <figcaption>To calculate the value of State 2: the sum of rewards <b>if the agent started in that state</b>, and then followed the <b>policy for all the time steps.</b></figcaption>
 </figure>
 
 So you may have noticed, we're repeating the computation of the value of different states, which can be tedious if you need to do it for each state value or state-action value.

From 0a4c6c6f2ce41a7a6450ee8d603b78bcc2f4033b Mon Sep 17 00:00:00 2001
From: Artagon <florent.vaucher@gmail.com>
Date: Sat, 17 Dec 2022 14:30:19 +0100
Subject: [PATCH 3/8] fix redundant 'pair' and inconsistent Case.

---
 units/en/unit2/q-learning.mdx | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx
index 52e744a..e259363 100644
--- a/units/en/unit2/q-learning.mdx
+++ b/units/en/unit2/q-learning.mdx
@@ -7,7 +7,7 @@ Q-Learning is an **off-policy value-based method that uses a TD approach to tra
 - *Value-based method*: finds the optimal policy indirectly by training a value or action-value function that will tell us **the value of each state or each state-action pair.**
 - *Uses a TD approach:* **updates its action-value function at each step instead of at the end of the episode.**
 
-**Q-Learning is the algorithm we use to train our Q-Function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
+**Q-Learning is the algorithm we use to train our Q-function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
 
 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" alt="Q-function"/>
@@ -18,16 +18,16 @@ The **Q comes from "the Quality" (the value) of that action at that state.**
 
 Let's recap the difference between value and reward:
 
-- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state action pair) and then acts accordingly to its policy.
+- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy.
 - The *reward* is the **feedback I get from the environment** after performing an action at a state.
 
-Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action value pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
+Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
 
 Let's go through an example of a maze.
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-1.jpg" alt="Maze example"/>
 
-The Q-Table is initialized. That's why all values are = 0. This table **contains, for each state, the four state-action values.**
+The Q-table is initialized. That's why all values are = 0. This table **contains, for each state, the four state-action values.**
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-2.jpg" alt="Maze example"/>
 
@@ -35,7 +35,7 @@ Here we see that the **state-action value of the initial state and going up is
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-3.jpg" alt="Maze example"/>
 
-Therefore, Q-function contains a Q-table **that has the value of each-state action pair.** And given a state and action, **our Q-Function will search inside its Q-table to output the value.**
+Therefore, Q-function contains a Q-table **that has the value of each-state action pair.** And given a state and action, **our Q-function will search inside its Q-table to output the value.**
 
 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q-function"/>
@@ -43,22 +43,22 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act
 
 If we recap, *Q-Learning* **is the RL algorithm that:**
 
-- Trains a *Q-Function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.**
-- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
-- When the training is done, **we have an optimal Q-function, which means we have optimal Q-Table.**
+- Trains a *Q-function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.**
+- Given a state and action, our Q-function **will search into its Q-table the corresponding value.**
+- When the training is done, **we have an optimal Q-function, which means we have optimal Q-table.**
 - And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"/>
 
 
-But, in the beginning, **our Q-Table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-Table to 0). As the agent **explores the environment and we update the Q-Table, it will give us better and better approximations** to the optimal policy.
+But, in the beginning, **our Q-table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-table to 0). As the agent **explores the environment and we update the Q-table, it will give us better and better approximations** to the optimal policy.
 
 <figure class="image table text-center m-0 w-full">
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-1.jpg" alt="Q-learning"/>
-  <figcaption>We see here that with the training, our Q-Table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
+  <figcaption>We see here that with the training, our Q-table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
 </figure>
 
-Now that we understand what Q-Learning, Q-Function, and Q-Table are, **let's dive deeper into the Q-Learning algorithm**.
+Now that we understand what Q-Learning, Q-function, and Q-table are, **let's dive deeper into the Q-Learning algorithm**.
 
 ## The Q-Learning algorithm [[q-learning-algo]]
 
@@ -66,12 +66,12 @@ This is the Q-Learning pseudocode; let's study each part and **see how it works
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-learning"/>
 
-### Step 1: We initialize the Q-Table [[step1]]
+### Step 1: We initialize the Q-table [[step1]]
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-3.jpg" alt="Q-learning"/>
 
 
-We need to initialize the Q-Table for each state-action pair. **Most of the time, we initialize with values of 0.**
+We need to initialize the Q-table for each state-action pair. **Most of the time, we initialize with values of 0.**
 
 ### Step 2: Choose action using epsilon greedy strategy [[step2]]
 
@@ -85,7 +85,7 @@ The idea is that we define epsilon ɛ = 1.0:
 - *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value).
 - With probability ɛ: **we do exploration** (trying random action).
 
-At the beginning of the training, **the probability of doing exploration will be huge since ɛ is very high, so most of the time, we'll explore.** But as the training goes on, and consequently our **Q-Table gets better and better in its estimations, we progressively reduce the epsilon value** since we will need less and less exploration and more exploitation.
+At the beginning of the training, **the probability of doing exploration will be huge since ɛ is very high, so most of the time, we'll explore.** But as the training goes on, and consequently our **Q-table gets better and better in its estimations, we progressively reduce the epsilon value** since we will need less and less exploration and more exploitation.
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-5.jpg" alt="Q-learning"/>
 

From f913af7300f6303b1b014a98499033749696c0b7 Mon Sep 17 00:00:00 2001
From: Artagon <florent.vaucher@gmail.com>
Date: Sat, 17 Dec 2022 14:39:40 +0100
Subject: [PATCH 4/8] epsilon smaller or equal to 1.0

---
 units/en/unit2/q-learning.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx
index e259363..605f506 100644
--- a/units/en/unit2/q-learning.mdx
+++ b/units/en/unit2/q-learning.mdx
@@ -80,7 +80,7 @@ We need to initialize the Q-table for each state-action pair. **Most of the tim
 
 Epsilon greedy strategy is a policy that handles the exploration/exploitation trade-off.
 
-The idea is that we define epsilon ɛ = 1.0:
+The idea is that we define epsilon ɛ ≤ 1.0:
 
 - *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value).
 - With probability ɛ: **we do exploration** (trying random action).

From 753ef67eae0507c70121a594464127a0fedaa951 Mon Sep 17 00:00:00 2001
From: Artagon <florent.vaucher@gmail.com>
Date: Sat, 17 Dec 2022 14:45:08 +0100
Subject: [PATCH 5/8] epsilon-greedy instead of epsilon greedy

---
 units/en/unit2/q-learning.mdx | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx
index 605f506..48f01d2 100644
--- a/units/en/unit2/q-learning.mdx
+++ b/units/en/unit2/q-learning.mdx
@@ -73,7 +73,7 @@ This is the Q-Learning pseudocode; let's study each part and **see how it works
 
 We need to initialize the Q-table for each state-action pair. **Most of the time, we initialize with values of 0.**
 
-### Step 2: Choose action using epsilon greedy strategy [[step2]]
+### Step 2: Choose action using epsilon-greedy strategy [[step2]]
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-learning"/>
 
@@ -114,7 +114,7 @@ It means that to update our \\(Q(S_t, A_t)\\):
 
 How do we form the TD target?
 1. We obtain the reward after taking the action \\(R_{t+1}\\).
-2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon greedy policy, this will always take the action with the highest state-action value.
+2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.
 
 Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.**
 
@@ -126,7 +126,7 @@ The difference is subtle:
 
 - *Off-policy*: using **a different policy for acting (inference) and updating (training).**
 
-For instance, with Q-Learning, the epsilon greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
+For instance, with Q-Learning, the epsilon-greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
 
 
 <figure>
@@ -144,7 +144,7 @@ Is different from the policy we use during the training part:
 
 - *On-policy:* using the **same policy for acting and updating.**
 
-For instance, with Sarsa, another value-based algorithm, **the epsilon greedy Policy selects the next state-action pair, not a greedy policy.**
+For instance, with Sarsa, another value-based algorithm, **the epsilon-greedy Policy selects the next state-action pair, not a greedy policy.**
 
 
 <figure>

From a7d74befb03c03ed92039bbe6464400aefba6584 Mon Sep 17 00:00:00 2001
From: Artagon <florent.vaucher@gmail.com>
Date: Sat, 17 Dec 2022 14:47:18 +0100
Subject: [PATCH 6/8] Fix midsentence uppercase 'Policy'

---
 units/en/unit2/q-learning.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx
index 48f01d2..e78a598 100644
--- a/units/en/unit2/q-learning.mdx
+++ b/units/en/unit2/q-learning.mdx
@@ -144,7 +144,7 @@ Is different from the policy we use during the training part:
 
 - *On-policy:* using the **same policy for acting and updating.**
 
-For instance, with Sarsa, another value-based algorithm, **the epsilon-greedy Policy selects the next state-action pair, not a greedy policy.**
+For instance, with Sarsa, another value-based algorithm, **the epsilon-greedy policy selects the next state-action pair, not a greedy policy.**
 
 
 <figure>

From 96714cdb107297f166fa7f2708ba707d97c9deac Mon Sep 17 00:00:00 2001
From: Artagon <florent.vaucher@gmail.com>
Date: Sat, 17 Dec 2022 22:23:08 +0100
Subject: [PATCH 7/8] Cases consistency

---
 units/en/unit2/q-learning-example.mdx | 6 +++---
 units/en/unit2/q-learning-recap.mdx   | 8 ++++----
 units/en/unit2/quiz2.mdx              | 8 ++++----
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/units/en/unit2/q-learning-example.mdx b/units/en/unit2/q-learning-example.mdx
index d6ccbda..e771af9 100644
--- a/units/en/unit2/q-learning-example.mdx
+++ b/units/en/unit2/q-learning-example.mdx
@@ -25,11 +25,11 @@ The reward function goes like this:
 
 To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**.
 
-## Step 1: We initialize the Q-Table [[step1]]
+## Step 1: We initialize the Q-table [[step1]]
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-1.jpg" alt="Maze-Example"/>
 
-So, for now, **our Q-Table is useless**; we need **to train our Q-function using the Q-Learning algorithm.**
+So, for now, **our Q-table is useless**; we need **to train our Q-function using the Q-Learning algorithm.**
 
 Let's do it for 2 training timesteps:
 
@@ -80,4 +80,4 @@ Because I go to the poison state, **I get \\(R_{t+1} = -10\\), and I die.**
 
 Because we're dead, we start a new episode. But what we see here is that **with two explorations steps, my agent became smarter.**
 
-As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-Table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-Function.**
+As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-function.**
diff --git a/units/en/unit2/q-learning-recap.mdx b/units/en/unit2/q-learning-recap.mdx
index 55c66bf..ab3b974 100644
--- a/units/en/unit2/q-learning-recap.mdx
+++ b/units/en/unit2/q-learning-recap.mdx
@@ -3,20 +3,20 @@
 
 The *Q-Learning* **is the RL algorithm that** :
 
-- Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
+- Trains *Q-function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
 
-- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
+- Given a state and action, our Q-function **will search into its Q-table the corresponding value.**
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function"  width="100%"/>
 
-- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
+- When the training is done,**we have an optimal Q-function, so an optimal Q-table.**
 
 - And if we **have an optimal Q-function**, we
 have an optimal policy,since we **know for each state, what is the best action to take.**
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"  width="100%"/>
 
-But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
+But, in the beginning, our **Q-table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-table to 0 values)**. But, as we’ll explore the environment and update our Q-table it will give us better and better approximations
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
 
diff --git a/units/en/unit2/quiz2.mdx b/units/en/unit2/quiz2.mdx
index 9d96d74..961d477 100644
--- a/units/en/unit2/quiz2.mdx
+++ b/units/en/unit2/quiz2.mdx
@@ -9,7 +9,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
 <Question
 	choices={[
 		{
-			text: "The algorithm we use to train our Q-Function",
+			text: "The algorithm we use to train our Q-function",
 			explain: "",
       correct: true
 		},
@@ -24,12 +24,12 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
 		},
 		{
 			text: "A table",
-      explain: "Q-Function is not a Q-Table. The Q-Function is the algorithm that will feed the Q-Table."
+      explain: "Q-function is not a Q-table. The Q-function is the algorithm that will feed the Q-table."
 		}
 	]}
 />
 
-### Q2: What is a Q-Table?
+### Q2: What is a Q-table?
 
 <Question
 	choices={[
@@ -43,7 +43,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
       correct: true
 		},
     {
-			text: "In Q-Table each cell corresponds a state value",
+			text: "In Q-table each cell corresponds a state value",
 			explain: "Each cell corresponds to a state-action value pair value. Not a state value.",
 		}
 	]}

From fc66ea7e4aa2b53c761367d55154566477a98c17 Mon Sep 17 00:00:00 2001
From: Artagon <florent.vaucher@gmail.com>
Date: Sat, 17 Dec 2022 22:33:02 +0100
Subject: [PATCH 8/8] Rephrasing for initial epsilon value

---
 units/en/unit2/q-learning.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx
index e78a598..2dd7190 100644
--- a/units/en/unit2/q-learning.mdx
+++ b/units/en/unit2/q-learning.mdx
@@ -80,7 +80,7 @@ We need to initialize the Q-table for each state-action pair. **Most of the tim
 
 Epsilon greedy strategy is a policy that handles the exploration/exploitation trade-off.
 
-The idea is that we define epsilon ɛ ≤ 1.0:
+The idea is that we define the initial epsilon ɛ = 1.0:
 
 - *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value).
 - With probability ɛ: **we do exploration** (trying random action).