From 0744d542ada257bfe7def5a3cd635b5ef8f67322 Mon Sep 17 00:00:00 2001 From: Artagon Date: Fri, 16 Dec 2022 20:31:49 +0100 Subject: [PATCH 1/8] =?UTF-8?q?Properly=20display=20=CF=80*?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- units/en/unit2/two-types-value-based-methods.mdx | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/units/en/unit2/two-types-value-based-methods.mdx b/units/en/unit2/two-types-value-based-methods.mdx index 47a17e2..3422e7d 100644 --- a/units/en/unit2/two-types-value-based-methods.mdx +++ b/units/en/unit2/two-types-value-based-methods.mdx @@ -10,7 +10,7 @@ The value of a state is the **expected discounted return** the agent can get i But what does it mean to act according to our policy? After all, we don't have a policy in value-based methods since we train a value function and not a policy. -Remember that the goal of an **RL agent is to have an optimal policy π.** +Remember that the goal of an **RL agent is to have an optimal policy π\*.** To find the optimal policy, we learned about two different methods: @@ -35,8 +35,8 @@ Consequently, whatever method you use to solve your problem, **you will have a So the difference is: -- In policy-based, **the optimal policy (denoted π*) is found by training the policy directly.** -- In value-based, **finding an optimal value function (denoted Q* or V*, we'll study the difference after) in our leads to having an optimal policy.** +- In policy-based, **the optimal policy (denoted π\*) is found by training the policy directly.** +- In value-based, **finding an optimal value function (denoted Q\* or V\*, we'll study the difference after) in our leads to having an optimal policy.** Link between value and policy From 0c3616c03ffcf8735a59ec495f08db0c73540c42 Mon Sep 17 00:00:00 2001 From: Artagon Date: Fri, 16 Dec 2022 20:34:24 +0100 Subject: [PATCH 2/8] Replace ** by tags in figcaption --- units/en/unit2/bellman-equation.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx index 99d753a..6979d23 100644 --- a/units/en/unit2/bellman-equation.mdx +++ b/units/en/unit2/bellman-equation.mdx @@ -18,7 +18,7 @@ Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return startin
Bellman equation -
To calculate the value of State 2: the sum of rewards **if the agent started in that state**, and then followed the **policy for all the time steps.**
+
To calculate the value of State 2: the sum of rewards if the agent started in that state, and then followed the policy for all the time steps.
So you may have noticed, we're repeating the computation of the value of different states, which can be tedious if you need to do it for each state value or state-action value. From 0a4c6c6f2ce41a7a6450ee8d603b78bcc2f4033b Mon Sep 17 00:00:00 2001 From: Artagon Date: Sat, 17 Dec 2022 14:30:19 +0100 Subject: [PATCH 3/8] fix redundant 'pair' and inconsistent Case. --- units/en/unit2/q-learning.mdx | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx index 52e744a..e259363 100644 --- a/units/en/unit2/q-learning.mdx +++ b/units/en/unit2/q-learning.mdx @@ -7,7 +7,7 @@ Q-Learning is an **off-policy value-based method that uses a TD approach to tra - *Value-based method*: finds the optimal policy indirectly by training a value or action-value function that will tell us **the value of each state or each state-action pair.** - *Uses a TD approach:* **updates its action-value function at each step instead of at the end of the episode.** -**Q-Learning is the algorithm we use to train our Q-Function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state. +**Q-Learning is the algorithm we use to train our Q-function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
Q-function @@ -18,16 +18,16 @@ The **Q comes from "the Quality" (the value) of that action at that state.** Let's recap the difference between value and reward: -- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state action pair) and then acts accordingly to its policy. +- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state-action pair) and then acts accordingly to its policy. - The *reward* is the **feedback I get from the environment** after performing an action at a state. -Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action value pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.** +Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.** Let's go through an example of a maze. Maze example -The Q-Table is initialized. That's why all values are = 0. This table **contains, for each state, the four state-action values.** +The Q-table is initialized. That's why all values are = 0. This table **contains, for each state, the four state-action values.** Maze example @@ -35,7 +35,7 @@ Here we see that the **state-action value of the initial state and going up is Maze example -Therefore, Q-function contains a Q-table **that has the value of each-state action pair.** And given a state and action, **our Q-Function will search inside its Q-table to output the value.** +Therefore, Q-function contains a Q-table **that has the value of each-state action pair.** And given a state and action, **our Q-function will search inside its Q-table to output the value.**
Q-function @@ -43,22 +43,22 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act If we recap, *Q-Learning* **is the RL algorithm that:** -- Trains a *Q-Function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.** -- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.** -- When the training is done, **we have an optimal Q-function, which means we have optimal Q-Table.** +- Trains a *Q-function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.** +- Given a state and action, our Q-function **will search into its Q-table the corresponding value.** +- When the training is done, **we have an optimal Q-function, which means we have optimal Q-table.** - And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.** Link value policy -But, in the beginning, **our Q-Table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-Table to 0). As the agent **explores the environment and we update the Q-Table, it will give us better and better approximations** to the optimal policy. +But, in the beginning, **our Q-table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-table to 0). As the agent **explores the environment and we update the Q-table, it will give us better and better approximations** to the optimal policy.
Q-learning -
We see here that with the training, our Q-Table is better since, thanks to it, we can know the value of each state-action pair.
+
We see here that with the training, our Q-table is better since, thanks to it, we can know the value of each state-action pair.
-Now that we understand what Q-Learning, Q-Function, and Q-Table are, **let's dive deeper into the Q-Learning algorithm**. +Now that we understand what Q-Learning, Q-function, and Q-table are, **let's dive deeper into the Q-Learning algorithm**. ## The Q-Learning algorithm [[q-learning-algo]] @@ -66,12 +66,12 @@ This is the Q-Learning pseudocode; let's study each part and **see how it works Q-learning -### Step 1: We initialize the Q-Table [[step1]] +### Step 1: We initialize the Q-table [[step1]] Q-learning -We need to initialize the Q-Table for each state-action pair. **Most of the time, we initialize with values of 0.** +We need to initialize the Q-table for each state-action pair. **Most of the time, we initialize with values of 0.** ### Step 2: Choose action using epsilon greedy strategy [[step2]] @@ -85,7 +85,7 @@ The idea is that we define epsilon ɛ = 1.0: - *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value). - With probability ɛ: **we do exploration** (trying random action). -At the beginning of the training, **the probability of doing exploration will be huge since ɛ is very high, so most of the time, we'll explore.** But as the training goes on, and consequently our **Q-Table gets better and better in its estimations, we progressively reduce the epsilon value** since we will need less and less exploration and more exploitation. +At the beginning of the training, **the probability of doing exploration will be huge since ɛ is very high, so most of the time, we'll explore.** But as the training goes on, and consequently our **Q-table gets better and better in its estimations, we progressively reduce the epsilon value** since we will need less and less exploration and more exploitation. Q-learning From f913af7300f6303b1b014a98499033749696c0b7 Mon Sep 17 00:00:00 2001 From: Artagon Date: Sat, 17 Dec 2022 14:39:40 +0100 Subject: [PATCH 4/8] epsilon smaller or equal to 1.0 --- units/en/unit2/q-learning.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx index e259363..605f506 100644 --- a/units/en/unit2/q-learning.mdx +++ b/units/en/unit2/q-learning.mdx @@ -80,7 +80,7 @@ We need to initialize the Q-table for each state-action pair. **Most of the tim Epsilon greedy strategy is a policy that handles the exploration/exploitation trade-off. -The idea is that we define epsilon ɛ = 1.0: +The idea is that we define epsilon ɛ ≤ 1.0: - *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value). - With probability ɛ: **we do exploration** (trying random action). From 753ef67eae0507c70121a594464127a0fedaa951 Mon Sep 17 00:00:00 2001 From: Artagon Date: Sat, 17 Dec 2022 14:45:08 +0100 Subject: [PATCH 5/8] epsilon-greedy instead of epsilon greedy --- units/en/unit2/q-learning.mdx | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx index 605f506..48f01d2 100644 --- a/units/en/unit2/q-learning.mdx +++ b/units/en/unit2/q-learning.mdx @@ -73,7 +73,7 @@ This is the Q-Learning pseudocode; let's study each part and **see how it works We need to initialize the Q-table for each state-action pair. **Most of the time, we initialize with values of 0.** -### Step 2: Choose action using epsilon greedy strategy [[step2]] +### Step 2: Choose action using epsilon-greedy strategy [[step2]] Q-learning @@ -114,7 +114,7 @@ It means that to update our \\(Q(S_t, A_t)\\): How do we form the TD target? 1. We obtain the reward after taking the action \\(R_{t+1}\\). -2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon greedy policy, this will always take the action with the highest state-action value. +2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value. Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.** @@ -126,7 +126,7 @@ The difference is subtle: - *Off-policy*: using **a different policy for acting (inference) and updating (training).** -For instance, with Q-Learning, the epsilon greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).** +For instance, with Q-Learning, the epsilon-greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
@@ -144,7 +144,7 @@ Is different from the policy we use during the training part: - *On-policy:* using the **same policy for acting and updating.** -For instance, with Sarsa, another value-based algorithm, **the epsilon greedy Policy selects the next state-action pair, not a greedy policy.** +For instance, with Sarsa, another value-based algorithm, **the epsilon-greedy Policy selects the next state-action pair, not a greedy policy.**
From a7d74befb03c03ed92039bbe6464400aefba6584 Mon Sep 17 00:00:00 2001 From: Artagon Date: Sat, 17 Dec 2022 14:47:18 +0100 Subject: [PATCH 6/8] Fix midsentence uppercase 'Policy' --- units/en/unit2/q-learning.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx index 48f01d2..e78a598 100644 --- a/units/en/unit2/q-learning.mdx +++ b/units/en/unit2/q-learning.mdx @@ -144,7 +144,7 @@ Is different from the policy we use during the training part: - *On-policy:* using the **same policy for acting and updating.** -For instance, with Sarsa, another value-based algorithm, **the epsilon-greedy Policy selects the next state-action pair, not a greedy policy.** +For instance, with Sarsa, another value-based algorithm, **the epsilon-greedy policy selects the next state-action pair, not a greedy policy.**
From 96714cdb107297f166fa7f2708ba707d97c9deac Mon Sep 17 00:00:00 2001 From: Artagon Date: Sat, 17 Dec 2022 22:23:08 +0100 Subject: [PATCH 7/8] Cases consistency --- units/en/unit2/q-learning-example.mdx | 6 +++--- units/en/unit2/q-learning-recap.mdx | 8 ++++---- units/en/unit2/quiz2.mdx | 8 ++++---- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/units/en/unit2/q-learning-example.mdx b/units/en/unit2/q-learning-example.mdx index d6ccbda..e771af9 100644 --- a/units/en/unit2/q-learning-example.mdx +++ b/units/en/unit2/q-learning-example.mdx @@ -25,11 +25,11 @@ The reward function goes like this: To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**. -## Step 1: We initialize the Q-Table [[step1]] +## Step 1: We initialize the Q-table [[step1]] Maze-Example -So, for now, **our Q-Table is useless**; we need **to train our Q-function using the Q-Learning algorithm.** +So, for now, **our Q-table is useless**; we need **to train our Q-function using the Q-Learning algorithm.** Let's do it for 2 training timesteps: @@ -80,4 +80,4 @@ Because I go to the poison state, **I get \\(R_{t+1} = -10\\), and I die.** Because we're dead, we start a new episode. But what we see here is that **with two explorations steps, my agent became smarter.** -As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-Table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-Function.** +As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-function.** diff --git a/units/en/unit2/q-learning-recap.mdx b/units/en/unit2/q-learning-recap.mdx index 55c66bf..ab3b974 100644 --- a/units/en/unit2/q-learning-recap.mdx +++ b/units/en/unit2/q-learning-recap.mdx @@ -3,20 +3,20 @@ The *Q-Learning* **is the RL algorithm that** : -- Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.** +- Trains *Q-function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.** -- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.** +- Given a state and action, our Q-function **will search into its Q-table the corresponding value.** Q function -- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.** +- When the training is done,**we have an optimal Q-function, so an optimal Q-table.** - And if we **have an optimal Q-function**, we have an optimal policy,since we **know for each state, what is the best action to take.** Link value policy -But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations +But, in the beginning, our **Q-table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-table to 0 values)**. But, as we’ll explore the environment and update our Q-table it will give us better and better approximations q-learning.jpeg diff --git a/units/en/unit2/quiz2.mdx b/units/en/unit2/quiz2.mdx index 9d96d74..961d477 100644 --- a/units/en/unit2/quiz2.mdx +++ b/units/en/unit2/quiz2.mdx @@ -9,7 +9,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour -### Q2: What is a Q-Table? +### Q2: What is a Q-table? Date: Sat, 17 Dec 2022 22:33:02 +0100 Subject: [PATCH 8/8] Rephrasing for initial epsilon value --- units/en/unit2/q-learning.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx index e78a598..2dd7190 100644 --- a/units/en/unit2/q-learning.mdx +++ b/units/en/unit2/q-learning.mdx @@ -80,7 +80,7 @@ We need to initialize the Q-table for each state-action pair. **Most of the tim Epsilon greedy strategy is a policy that handles the exploration/exploitation trade-off. -The idea is that we define epsilon ɛ ≤ 1.0: +The idea is that we define the initial epsilon ɛ = 1.0: - *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value). - With probability ɛ: **we do exploration** (trying random action).