Merge pull request #231 from huggingface/ThomasSimonini/SundayUpdate

Sunday Update of the Course
2026-06-15 14:36:45 +08:00 · 2023-02-25 18:45:56 +01:00
parent 1727d54eeb 80bb91e764
commit d72a886ef0
9 changed files with 25 additions and 15 deletions
--- a/notebooks/unit1/unit1.ipynb
+++ b/notebooks/unit1/unit1.ipynb
@@ -1099,7 +1099,7 @@
        "\n",
        "Take time to really **grasp the material before continuing and try the additional challenges**. It’s important to master these elements and having a solid foundations.\n",
        "\n",
-        "Naturally, during the course, we’re going to use and deeper explain again these terms but **it’s better to have a good understanding of them now before diving into the next chapters.**\n"
+        "Naturally, during the course, we’re going to dive deeper into these concepts but **it’s better to have a good understanding of them now before diving into the next chapters.**\n\n"
      ]
    },
    {
--- a/notebooks/unit2/unit2.ipynb
+++ b/notebooks/unit2/unit2.ipynb
@@ -511,7 +511,7 @@
      },
      "outputs": [],
      "source": [
-        "# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros\n",
+        "# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros. np.zeros needs a tuple (a,b)\n",
        "def initialize_q_table(state_space, action_space):\n",
        "  Qtable = \n",
        "  return Qtable"
--- a/notebooks/unit3/unit3.ipynb
+++ b/notebooks/unit3/unit3.ipynb
@@ -301,7 +301,7 @@
        "## Train our Deep Q-Learning Agent to Play Space Invaders 👾\n",
        "\n",
        "To train an agent with RL-Baselines3-Zoo, we just need to do two things:\n",
-        "1. We define the hyperparameters in `rl-baselines3-zoo/hyperparams/dqn.yml`\n",
+        "1. We define the hyperparameters in `/content/rl-baselines3-zoo/hyperparams/dqn.yml`\n",
        "\n",
        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/hyperparameters.png\" alt=\"DQN Hyperparameters\">\n"
      ]
--- a/units/en/unit1/hands-on.mdx
+++ b/units/en/unit1/hands-on.mdx
@@ -22,6 +22,8 @@ To validate this hands-on for the [certification process](https://huggingface.co

 To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**

+**If you don't find your model, go to the bottom of the page and click on the refresh button.**
+
 For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process

 And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
@@ -657,8 +659,7 @@ If you’re still feel confused with all these elements...it's totally normal! *

 Take time to really **grasp the material before continuing and try the additional challenges**. It’s important to master these elements and having a solid foundations.

-Naturally, during the course, we’re going to use and deeper explain again these terms but **it’s better to have a good understanding of them now before diving into the next chapters.**
-
+Naturally, during the course, we’re going to dive deeper into these concepts but **it’s better to have a good understanding of them now before diving into the next chapters.**

 Next time, in the bonus unit 1, you'll train Huggy the Dog to fetch the stick.

--- a/units/en/unit2/hands-on.mdx
+++ b/units/en/unit2/hands-on.mdx
@@ -16,6 +16,8 @@ Now that we studied the Q-Learning algorithm, let's implement it from scratch an

 Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 2?

+**If you don't find your model, go to the bottom of the page and click on the refresh button.**
+
 To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.

 To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
@@ -259,7 +261,8 @@ print("There are ", action_space, " possible actions")
 ```

 ```python
-# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
+# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros. np.zeros needs a tuple (a,b)
+
 def initialize_q_table(state_space, action_space):
  Qtable =
  return Qtable
--- a/units/en/unit3/hands-on.mdx
+++ b/units/en/unit3/hands-on.mdx
@@ -22,6 +22,8 @@ To validate this hands-on for the certification process, you need to push your t

 To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**

+**If you don't find your model, go to the bottom of the page and click on the refresh button.**
+
 For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process

 And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
@@ -36,7 +38,7 @@ And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSi

 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/thumbnail.jpg" alt="Unit 3 Thumbnail">

-In this notebook, **you'll train a Deep Q-Learning agent** playing Space Invaders using [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), a training framework based on [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.
+In this notebook, **you'll train a Deep Q-Learning agent** playing Space Invaders using [RL Baselines3 Zoo](https://github.com/DLR-RM/rl-baselines3-zoo), a training framework based on [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) that provides scripts for training, evaluating agents, tuning arameters, plotting results and recording videos.

 We're using the [RL-Baselines-3 Zoo integration, a vanilla version of Deep Q-Learning](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) with no extensions such as Double-DQN, Dueling-DQN, and Prioritized Experience Replay.

@@ -131,7 +133,7 @@ pip install -r requirements.txt
 ## Train our Deep Q-Learning Agent to Play Space Invaders 👾

 To train an agent with RL-Baselines3-Zoo, we just need to do two things:
-1. We define the hyperparameters in `rl-baselines3-zoo/hyperparams/dqn.yml`
+1. We define the hyperparameters in `/content/rl-baselines3-zoo/hyperparams/dqn.yml`

 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit3/hyperparameters.png" alt="DQN Hyperparameters">

--- a/units/en/unit4/hands-on.mdx
+++ b/units/en/unit4/hands-on.mdx
@@ -26,6 +26,8 @@ To validate this hands-on for the certification process, you need to push your t

 To find your result, go to the leaderboard and find your model, **the result = mean_reward - std of reward**. **If you don't see your model on the leaderboard, go at the bottom of the leaderboard page and click on the refresh button**.

+**If you don't find your model, go to the bottom of the page and click on the refresh button.**
+
 For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process

 And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course
@@ -373,11 +375,11 @@ The second question you may ask is **why do we minimize the loss**? Did you talk

 - We want to maximize our utility function $J(\theta)$, but in PyTorch and TensorFlow, it's better to **minimize an objective function.**
    - So let's say we want to reinforce action 3 at a certain timestep. Before training this action P is 0.25.
-    - So we want to modify $\theta$ such that $\pi_\theta(a_3|s; \theta) > 0.25$
-    - Because all P must sum to 1, max $\pi_\theta(a_3|s; \theta)$ will **minimize other action probability.**
-    - So we should tell PyTorch **to min $1 - \pi_\theta(a_3|s; \theta)$.**
-    - This loss function approaches 0 as $\pi_\theta(a_3|s; \theta)$ nears 1.
-    - So we are encouraging the gradient to max $\pi_\theta(a_3|s; \theta)$
+    - So we want to modify \\(theta \\) such that \\(\pi_\theta(a_3|s; \theta) > 0.25 \\)
+    - Because all P must sum to 1, max \\(pi_\theta(a_3|s; \theta)\\) will **minimize other action probability.**
+    - So we should tell PyTorch **to min \\(1 - \pi_\theta(a_3|s; \theta)\\).**
+    - This loss function approaches 0 as \\(\pi_\theta(a_3|s; \theta)\\) nears 1.
+    - So we are encouraging the gradient to max \\(\pi_\theta(a_3|s; \theta)\\)


 ```python
--- a/units/en/unit6/advantage-actor-critic.mdx
+++ b/units/en/unit6/advantage-actor-critic.mdx
@@ -16,7 +16,7 @@ On the other hand, your friend (Critic) will also update their way to provide fe

 This is the idea behind Actor-Critic. We learn two function approximations:

- *A policy* that **controls how our agent acts**: \\( \pi_{\theta}(s,a) \\)
+- *A policy* that **controls how our agent acts**: \\( \pi_{\theta}(s) \\)

 - *A value function* to assist the policy update by measuring how good the action taken is: \\( \hat{q}_{w}(s,a) \\)

@@ -24,7 +24,7 @@ This is the idea behind Actor-Critic. We learn two function approximations:
 Now that we have seen the Actor Critic's big picture, let's dive deeper to understand how Actor and Critic improve together during the training.

 As we saw, with Actor-Critic methods, there are two function approximations (two neural networks):
- *Actor*, a **policy function** parameterized by theta: \\( \pi_{\theta}(s,a) \\)
+- *Actor*, a **policy function** parameterized by theta: \\( \pi_{\theta}(s) \\)
 - *Critic*, a **value function** parameterized by w: \\( \hat{q}_{w}(s,a) \\)

 Let's see the training process to understand how Actor and Critic are optimized:
--- a/units/en/unit6/hands-on.mdx
+++ b/units/en/unit6/hands-on.mdx
@@ -28,6 +28,8 @@ To validate this hands-on for the certification process, you need to push your t

 To find your result, [go to the leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**

+**If you don't find your model, go to the bottom of the page and click on the refresh button.**
+
 For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process

 **To start the hands-on click on Open In Colab button** 👇 :