From 753ef67eae0507c70121a594464127a0fedaa951 Mon Sep 17 00:00:00 2001 From: Artagon Date: Sat, 17 Dec 2022 14:45:08 +0100 Subject: [PATCH] epsilon-greedy instead of epsilon greedy --- units/en/unit2/q-learning.mdx | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx index 605f506..48f01d2 100644 --- a/units/en/unit2/q-learning.mdx +++ b/units/en/unit2/q-learning.mdx @@ -73,7 +73,7 @@ This is the Q-Learning pseudocode; let's study each part and **see how it works We need to initialize the Q-table for each state-action pair. **Most of the time, we initialize with values of 0.** -### Step 2: Choose action using epsilon greedy strategy [[step2]] +### Step 2: Choose action using epsilon-greedy strategy [[step2]] Q-learning @@ -114,7 +114,7 @@ It means that to update our \\(Q(S_t, A_t)\\): How do we form the TD target? 1. We obtain the reward after taking the action \\(R_{t+1}\\). -2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon greedy policy, this will always take the action with the highest state-action value. +2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value. Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.** @@ -126,7 +126,7 @@ The difference is subtle: - *Off-policy*: using **a different policy for acting (inference) and updating (training).** -For instance, with Q-Learning, the epsilon greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).** +For instance, with Q-Learning, the epsilon-greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
@@ -144,7 +144,7 @@ Is different from the policy we use during the training part: - *On-policy:* using the **same policy for acting and updating.** -For instance, with Sarsa, another value-based algorithm, **the epsilon greedy Policy selects the next state-action pair, not a greedy policy.** +For instance, with Sarsa, another value-based algorithm, **the epsilon-greedy Policy selects the next state-action pair, not a greedy policy.**