From 630b80a00ffb780a0f576f6c9a0eddf0f5990e3e Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 20 Dec 2022 13:54:08 +0100 Subject: [PATCH 1/6] Update hands-on.mdx --- units/en/unit1/hands-on.mdx | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/units/en/unit1/hands-on.mdx b/units/en/unit1/hands-on.mdx index 419aefd..d3645de 100644 --- a/units/en/unit1/hands-on.mdx +++ b/units/en/unit1/hands-on.mdx @@ -34,6 +34,7 @@ You can either do this hands-on by reading the notebook or following it with the + # Unit 1: Train your first Deep Reinforcement Learning Agent 🤖 Unit 1 thumbnail @@ -42,9 +43,6 @@ In this notebook, you'll train your **first Deep Reinforcement Learning agent** ⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️ - - - ```python %%html @@ -71,7 +69,7 @@ At the end of the notebook, you will: -## This notebook is from Deep Reinforcement Learning Course +## This hands-on is from Deep Reinforcement Learning Course Deep RL Course illustration In this free course, you will: @@ -90,7 +88,7 @@ The best way to keep in touch and ask questions is to join our discord server to ## Prerequisites 🏗️ Before diving into the notebook, you need to: -🔲 📝 **Done Unit 0** that gives you all the **information about the course and help you to onboard** 🤗 +🔲 📝 **Read Unit 0** that gives you all the **information about the course and help you to onboard** 🤗 🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by doing Unit 1 From 31dc00a52bf795aa583acf389c3321480329c7a1 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 20 Dec 2022 13:59:12 +0100 Subject: [PATCH 2/6] Update additional-readings.mdx Add make your own gym custom env --- units/en/unit1/additional-readings.mdx | 1 + 1 file changed, 1 insertion(+) diff --git a/units/en/unit1/additional-readings.mdx b/units/en/unit1/additional-readings.mdx index b881244..d1f1820 100644 --- a/units/en/unit1/additional-readings.mdx +++ b/units/en/unit1/additional-readings.mdx @@ -11,3 +11,4 @@ These are **optional readings** if you want to go deeper. ## Gym [[gym]] - [Getting Started With OpenAI Gym: The Basic Building Blocks](https://blog.paperspace.com/getting-started-with-openai-gym/) +- [Make your own Gym custom environment](https://www.gymlibrary.dev/content/environment_creation/) From beaef9b0a44cb5a5b618d1c4a5668fd7143eec06 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 20 Dec 2022 14:02:46 +0100 Subject: [PATCH 3/6] Update two-types-value-based-methods.mdx --- units/en/unit2/two-types-value-based-methods.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit2/two-types-value-based-methods.mdx b/units/en/unit2/two-types-value-based-methods.mdx index 3422e7d..df83311 100644 --- a/units/en/unit2/two-types-value-based-methods.mdx +++ b/units/en/unit2/two-types-value-based-methods.mdx @@ -36,7 +36,7 @@ Consequently, whatever method you use to solve your problem, **you will have a So the difference is: - In policy-based, **the optimal policy (denoted π\*) is found by training the policy directly.** -- In value-based, **finding an optimal value function (denoted Q\* or V\*, we'll study the difference after) in our leads to having an optimal policy.** +- In value-based, **finding an optimal value function (denoted Q\* or V\*, we'll study the difference after) leads to having an optimal policy.** Link between value and policy From 3bdc44cd354cc2434caab99a41475c586ff3c7d9 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 20 Dec 2022 14:05:29 +0100 Subject: [PATCH 4/6] Update bellman-equation.mdx --- units/en/unit2/bellman-equation.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx index 6979d23..f8f99f7 100644 --- a/units/en/unit2/bellman-equation.mdx +++ b/units/en/unit2/bellman-equation.mdx @@ -58,6 +58,6 @@ But you'll study an example with gamma = 0.99 in the Q-Learning section of this -To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process.** This is equivalent **to the sum of immediate reward + the discounted value of the state that follows.** +To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process.**, we calculate the value as **the sum of immediate reward + the discounted value of the state that follows.** Before going to the next section, think about the role of gamma in the Bellman equation. What happens if the value of gamma is very low (e.g. 0.1 or even 0)? What happens if the value is 1? What happens if the value is very high, such as a million? From 5f66e674196a99cb724159af521e0997375233dd Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 20 Dec 2022 14:06:10 +0100 Subject: [PATCH 5/6] Update mc-vs-td.mdx --- units/en/unit2/mc-vs-td.mdx | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/units/en/unit2/mc-vs-td.mdx b/units/en/unit2/mc-vs-td.mdx index 1d3517f..bac1639 100644 --- a/units/en/unit2/mc-vs-td.mdx +++ b/units/en/unit2/mc-vs-td.mdx @@ -76,8 +76,7 @@ For instance, if we train a state-value function using Monte Carlo: ## Temporal Difference Learning: learning at each step [[td-learning]] -- **Temporal Difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)** -- to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\( \gamma * V(S_{t+1})\\). +**Temporal Difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)** to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\( \gamma * V(S_{t+1})\\). The idea with **TD is to update the \\(V(S_t)\\) at each step.** From 7b61d9f813e60241982ce67bedf2cc751741fa5a Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Tue, 20 Dec 2022 14:20:40 +0100 Subject: [PATCH 6/6] Update bellman-equation.mdx --- units/en/unit2/bellman-equation.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx index f8f99f7..4c801ae 100644 --- a/units/en/unit2/bellman-equation.mdx +++ b/units/en/unit2/bellman-equation.mdx @@ -58,6 +58,6 @@ But you'll study an example with gamma = 0.99 in the Q-Learning section of this -To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process.**, we calculate the value as **the sum of immediate reward + the discounted value of the state that follows.** +To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process**, we calculate the value as **the sum of immediate reward + the discounted value of the state that follows.** Before going to the next section, think about the role of gamma in the Bellman equation. What happens if the value of gamma is very low (e.g. 0.1 or even 0)? What happens if the value is 1? What happens if the value is very high, such as a million?