From ff6e547e49dcbe8d4cba7117cffad90d8ea198cd Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Tue, 6 Dec 2022 10:14:12 +0100
Subject: [PATCH 01/49] Add Unit 2

---
 units/en/unit2/additional-readings.mdx        |  15 ++
 units/en/unit2/bellman-equation.mdx           |  57 +++++++
 units/en/unit2/conclusion.mdx                 |  19 +++
 units/en/unit2/hands-on.mdx                   |  14 ++
 units/en/unit2/introduction.mdx               |  26 +++
 units/en/unit2/mc-vs-td.mdx                   | 126 +++++++++++++++
 units/en/unit2/q-learning-example.mdx         |  83 ++++++++++
 units/en/unit2/q-learning.mdx                 | 153 ++++++++++++++++++
 units/en/unit2/quiz1.mdx                      | 105 ++++++++++++
 units/en/unit2/quiz2.mdx                      |  97 +++++++++++
 units/en/unit2/summary1.mdx                   |  17 ++
 .../unit2/two-types-value-based-methods.mdx   |  86 ++++++++++
 units/en/unit2/what-is-rl.mdx                 |  25 +++
 13 files changed, 823 insertions(+)
 create mode 100644 units/en/unit2/additional-readings.mdx
 create mode 100644 units/en/unit2/bellman-equation.mdx
 create mode 100644 units/en/unit2/conclusion.mdx
 create mode 100644 units/en/unit2/hands-on.mdx
 create mode 100644 units/en/unit2/introduction.mdx
 create mode 100644 units/en/unit2/mc-vs-td.mdx
 create mode 100644 units/en/unit2/q-learning-example.mdx
 create mode 100644 units/en/unit2/q-learning.mdx
 create mode 100644 units/en/unit2/quiz1.mdx
 create mode 100644 units/en/unit2/quiz2.mdx
 create mode 100644 units/en/unit2/summary1.mdx
 create mode 100644 units/en/unit2/two-types-value-based-methods.mdx
 create mode 100644 units/en/unit2/what-is-rl.mdx
diff --git a/units/en/unit2/additional-readings.mdx b/units/en/unit2/additional-readings.mdx
new file mode 100644
index 0000000..9a14724
--- /dev/null
+++ b/units/en/unit2/additional-readings.mdx
@@ -0,0 +1,15 @@
+# Additional Readings [[additional-readings]]
+
+These are **optional readings** if you want to go deeper.
+
+## Monte Carlo and TD Learning [[mc-td]]
+
+To dive deeper on Monte Carlo and Temporal Difference Learning:
+
+- <a href="https://stats.stackexchange.com/questions/355820/why-do-temporal-difference-td-methods-have-lower-variance-than-monte-carlo-met">Why do temporal difference (TD) methods have lower variance than Monte Carlo methods?</a>
+- <a href="https://stats.stackexchange.com/questions/336974/when-are-monte-carlo-methods-preferred-over-temporal-difference-ones"> When are Monte Carlo methods preferred over temporal difference ones?</a>
+
+## Q-Learning [[q-learning]]
+
+- <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto Chapter 5, 6 and 7</a>
+- <a href="https://youtu.be/Psrhxy88zww">Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel</a>
diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx
new file mode 100644
index 0000000..6d224f0
--- /dev/null
+++ b/units/en/unit2/bellman-equation.mdx
@@ -0,0 +1,57 @@
+# The Bellman Equation: simplify our value estimation [[bellman-equation]]
+
+The Bellman equation **simplifies our state value or state-action value calculation.**
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman.jpg" alt="Bellman equation"/>
+
+With what we have learned so far, we know that if we calculate the \\(V(S_t)\\) (value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).**
+
+So to calculate \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence:
+
+<figure>
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" alt="Bellman equation"/>
+  <figcaption>To calculate the value of State 1: the sum of rewards if the agent started in that state and then followed the greedy policy (taking actions that leads to the best states values) for all the time steps.</figcaption>
+</figure>
+
+Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return starting at that state \\(S_{t+1}\\).
+
+<figure>
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman3.jpg" alt="Bellman equation"/>
+  <figcaption>To calculate the value of State 2: the sum of rewards **if the agent started in that state, and then followed the **policy for all the time steps.</figcaption>
+</figure>
+
+So you see, that's a pretty tedious process if you need to do it for each state value or state-action value.
+
+Instead of calculating the expected return for each state or each state-action pair, **we can use the Bellman equation.**
+
+The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
+
+**The immediate reward  \\(R_{t+1}\\)  + the discounted value of the state that follows ( \\(gamma * V(S_{t+1}) \\) ) .**
+
+<figure>
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation"/>
+  <figcaption>For simplification, here we don’t discount so gamma = 1.</figcaption>
+</figure>
+
+
+If we go back to our example, we can say that the value of State 1 is equal to the expected cumulative return if we start at that state.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" alt="Bellman equation"/>
+
+
+To calculate the value of State 1: the sum of rewards **if the agent started in that state 1** and then followed the **policy for all the time steps.**
+
+This is equivalent to  \\(V(S_{t})\\)  = Immediate reward  \\(R_{t+1}\\)  + Discounted value of the next state  \\(gamma * V(S_{t+1})\\)
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman6.jpg" alt="Bellman equation"/>
+
+
+In the interest of simplicity, here we don't discount, so gamma = 1.
+
+- The value of  \\(V(S_{t+1}) \\)  = Immediate reward  \\(R_{t+2}\\)  + Discounted value of the next state ( \\(gamma * V(S_{t+2})\\) ).
+- And so on.
+
+To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process.** This is equivalent **to the sum of immediate reward + the discounted value of the state that follows.**
+
+Before going to the next section, think about the role of gamma in the Bellman equation. What happens if the value of gamma is very low (e.g. 0.1 or even 0)? What happens if the value is 1? What happens if the value is very high, such as a million?
diff --git a/units/en/unit2/conclusion.mdx b/units/en/unit2/conclusion.mdx
new file mode 100644
index 0000000..f271ce0
--- /dev/null
+++ b/units/en/unit2/conclusion.mdx
@@ -0,0 +1,19 @@
+# Conclusion [[conclusion]]
+
+Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorials. You’ve just implemented your first RL agent from scratch and shared it on the Hub 🥳.
+
+Implementing from scratch when you study a new architecture **is important to understand how it works.**
+
+That’s **normal if you still feel confused** with all these elements. **This was the same for me and for all people who studied RL.**
+
+Take time to really grasp the material before continuing.
+
+
+In the next chapter, we’re going to dive deeper by studying our first Deep Reinforcement Learning algorithm based on Q-Learning: Deep Q-Learning. And you'll train a **DQN agent with <a href="https://github.com/DLR-RM/rl-baselines3-zoo">RL-Baselines3 Zoo</a> to play Atari Games**.
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Atari environments"/>
+
+
+
+### Keep Learning, stay awesome 🤗
diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
new file mode 100644
index 0000000..d683cac
--- /dev/null
+++ b/units/en/unit2/hands-on.mdx
@@ -0,0 +1,14 @@
+# Hands-on [[hands-on]]
+
+Now that we studied the Q-Learning algorithm, let's implement it from scratch and train our Q-Learning agent in two environments:
+1. [Frozen-Lake-v1  (non-slippery and slippery version)](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) ☃️ : where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
+2. [An autonomous taxi](https://www.gymlibrary.dev/environments/toy_text/taxi/) 🚖 will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
+
+Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 2?
+
+
+**To start the hands-on click on Open In Colab button** 👇 :
+
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]()
diff --git a/units/en/unit2/introduction.mdx b/units/en/unit2/introduction.mdx
new file mode 100644
index 0000000..409f025
--- /dev/null
+++ b/units/en/unit2/introduction.mdx
@@ -0,0 +1,26 @@
+# Introduction to Q-Learning [[introduction-q-learning]]
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 thumbnail" width="100%">
+
+
+In the first unit of this class, we learned about Reinforcement Learning (RL), the RL process, and the different methods to solve an RL problem. We also **trained our first agents and uploaded them to the Hugging Face Hub.**
+
+In this unit, we're going to **dive deeper into one of the Reinforcement Learning methods: value-based methods** and study our first RL algorithm: **Q-Learning.**
+
+We'll also **implement our first RL agent from scratch**, a Q-Learning agent, and will train it in two environments:
+
+1. Frozen-Lake-v1 (non-slippery version): where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
+2. An autonomous taxi: where our agent will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
+
+Concretely, we will:
+
+- Learn about **value-based methods**.
+- Learn about the **differences between Monte Carlo and Temporal Difference Learning**.
+- Study and implement **our first RL algorithm**: Q-Learning.s
+
+This unit is **fundamental if you want to be able to work on Deep Q-Learning**: the first Deep RL algorithm that played Atari games and beat the human level on some of them (breakout, space invaders…).
+
+So let's get started! 🚀
diff --git a/units/en/unit2/mc-vs-td.mdx b/units/en/unit2/mc-vs-td.mdx
new file mode 100644
index 0000000..e78ee78
--- /dev/null
+++ b/units/en/unit2/mc-vs-td.mdx
@@ -0,0 +1,126 @@
+# Monte Carlo vs Temporal Difference Learning [[mc-vs-td]]
+
+The last thing we need to talk about before diving into Q-Learning is the two ways of learning.
+
+Remember that an RL agent **learns by interacting with its environment.** The idea is that **using the experience taken**, given the reward it gets, will **update its value or policy.**
+
+Monte Carlo and Temporal Difference Learning are two different **strategies on how to train our value function or our policy function.** Both of them **use experience to solve the RL problem.**
+
+On one hand, Monte Carlo uses **an entire episode of experience before learning.** On the other hand, Temporal Difference uses **only a step ( \\(S_t, A_t, R_{t+1}, S_{t+1}\\) ) to learn.**
+
+We'll explain both of them **using a value-based method example.**
+
+## Monte Carlo: learning at the end of the episode [[monte-carlo]]
+
+Monte Carlo waits until the end of the episode, calculates  \\(G_t\\) (return) and uses it as **a target for updating  \\(V(S_t)\\).**
+
+So it requires a **complete entire episode of interaction before updating our value function.**
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/monte-carlo-approach.jpg" alt="Monte Carlo"/>
+
+
+If we take an example:
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-2.jpg" alt="Monte Carlo"/>
+
+
+- We always start the episode **at the same starting point.**
+- **The agent takes actions using the policy**. For instance, using an Epsilon Greedy Strategy, a policy that alternates between exploration (random actions) and exploitation.
+- We get **the reward and the next state.**
+- We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.
+
+- At the end of the episode, **we have a list of State, Actions, Rewards, and Next States**
+- **The agent will sum the total rewards \\(G_t\\)** (to see how well it did).
+- It will then **update \\(V(s_t)\\) based on the formula**
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3.jpg" alt="Monte Carlo"/>
+
+- Then **start a new game with this new knowledge**
+
+By running more and more episodes, **the agent will learn to play better and better.**
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3p.jpg" alt="Monte Carlo"/>
+
+For instance, if we train a state-value function using Monte Carlo:
+
+- We just started to train our Value function, **so it returns 0 value for each state**
+- Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)
+- Our mouse **explores the environment and takes random actions**
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-4.jpg" alt="Monte Carlo"/>
+
+
+- The mouse made more than 10 steps, so the episode ends .
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-4p.jpg" alt="Monte Carlo"/>
+
+
+- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t}\\)**
+- \\(G_t = R_{t+1} + R_{t+2} + R_{t+3} ...\\)
+- \\(G_t = R_{t+1} + R_{t+2} + R_{t+3}…\\) (for simplicity we don’t discount the rewards).
+- \\(G_t = 1 + 0 + 0 + 0+ 0 + 0 + 1 + 1 + 0 + 0\\)
+- \\(G_t= 3\\)
+- We can now update \\(V(S_0)\\):
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5.jpg" alt="Monte Carlo"/>
+
+- New \\(V(S_0) = V(S_0) + lr * [G_t — V(S_0)]\\)
+- New \\(V(S_0) = 0 + 0.1 * [3 – 0]\\)
+- New \\(V(S_0) = 0.3\\)
+
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5p.jpg" alt="Monte Carlo"/>
+
+
+## Temporal Difference Learning: learning at each step [[td-learning]]
+
+- **Temporal difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)**
+- to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\(gamma * V(S_{t+1})\\).
+
+The idea with **TD is to update the \\(V(S_t)\\) at each step.**
+
+But because we didn't play during an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.**
+
+This is called bootstrapping. It's called this **because TD bases its update part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).**
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="Temporal Difference"/>
+
+
+This method is called TD(0) or **one-step TD (update the value function after any individual step).**
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1p.jpg" alt="Temporal Difference"/>
+
+If we take the same example,
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2.jpg" alt="Temporal Difference"/>
+
+- We just started to train our Value function, so it returns 0 value for each state.
+- Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
+- Our mouse explore the environment and take a random action: **going to the left**
+- It gets a reward  \\(R_{t+1} = 1\\) since **it eats a piece of cheese**
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2p.jpg" alt="Temporal Difference"/>
+
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3.jpg" alt="Temporal Difference"/>
+
+We can now update  \\(V(S_0)\\):
+
+New  \\(V(S_0) = V(S_0) + lr * [R_1 + gamma * V(S_1) - V(S_0)]\\)
+
+New \\(V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]\\)
+
+New \\(V(S_0) = 0.1\\)
+
+So we just updated our value function for State 0.
+
+Now we **continue to interact with this environment with our updated value function.**
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3p.jpg" alt="Temporal Difference"/>
+
+  If we summarize:
+
+  - With *Monte Carlo*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
+  - With *TD Learning*, we update the value function from a step, so we replace \\(G_t\\) that we don't have with **an estimated return called TD target.**
+
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Summary.jpg" alt="Summary"/>
diff --git a/units/en/unit2/q-learning-example.mdx b/units/en/unit2/q-learning-example.mdx
new file mode 100644
index 0000000..62e9be3
--- /dev/null
+++ b/units/en/unit2/q-learning-example.mdx
@@ -0,0 +1,83 @@
+# A Q-Learning example [[q-learning-example]]
+
+To better understand Q-Learning, let's take a simple example:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-Example-2.jpg" alt="Maze-Example"/>
+
+- You're a mouse in this tiny maze. You always **start at the same starting point.**
+- The goal is **to eat the big pile of cheese at the bottom right-hand corner** and avoid the poison. After all, who doesn't like cheese?
+- The episode ends if we eat the poison, **eat the big pile of cheese or if we spent more than five steps.**
+- The learning rate is 0.1
+- The gamma (discount rate) is 0.99
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-1.jpg" alt="Maze-Example"/>
+
+
+The reward function goes like this:
+
+- **+0:** Going to a state with no cheese in it.
+- **+1:** Going to a state with a small cheese in it.
+- **+10:** Going to the state with the big pile of cheese.
+- **-10:** Going to the state with the poison and thus die.
+- **+0** If we spend more than five steps.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-2.jpg" alt="Maze-Example"/>
+
+To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**.
+
+## Step 1: We initialize the Q-Table [[step1]]
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-1.jpg" alt="Maze-Example"/>
+
+So, for now, **our Q-Table is useless**; we need **to train our Q-function using the Q-Learning algorithm.**
+
+Let's do it for 2 training timesteps:
+
+Training timestep 1:
+
+## Step 2: Choose action using Epsilon Greedy Strategy [[step2]]
+
+Because epsilon is big = 1.0, I take a random action, in this case, I go right.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-3.jpg" alt="Maze-Example"/>
+
+
+## Step 3: Perform action At, gets Rt+1 and St+1 [[step3]]
+
+By going right, I've got a small cheese, so \\(R_{t+1} = 1\\), and I'm in a new state.
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-4.jpg" alt="Maze-Example"/>
+
+
+## Step 4: Update Q(St, At) [[step4]]
+
+We can now update \\(Q(S_t, A_t)\\) using our formula.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Maze-Example"/>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-4.jpg" alt="Maze-Example"/>
+
+Training timestep 2:
+
+## Step 2: Choose action using Epsilon Greedy Strategy [[step2-2]]
+
+**I take a random action again, since epsilon is big 0.99** (since we decay it a little bit because as the training progress, we want less and less exploration).
+
+I took action down. **Not a good action since it leads me to the poison.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-6.jpg" alt="Maze-Example"/>
+
+
+## Step 3: Perform action At, gets \Rt+1 and St+1 [[step3-3]]
+
+Because I go to the poison state, **I get \\(R_{t+1} = -10\\), and I die.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-7.jpg" alt="Maze-Example"/>
+
+## Step 4: Update Q(St, At) [[step4-4]]
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-8.jpg" alt="Maze-Example"/>
+
+Because we're dead, we start a new episode. But what we see here is that **with two explorations steps, my agent became smarter.**
+
+As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-Table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-Function.**
diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx
new file mode 100644
index 0000000..8447e4c
--- /dev/null
+++ b/units/en/unit2/q-learning.mdx
@@ -0,0 +1,153 @@
+# Introducing Q-Learning [[q-learning]]
+## What is Q-Learning? [[what-is-q-learning]]
+
+Q-Learning is an **off-policy value-based method that uses a TD approach to train its action-value function:**
+
+- *Off-policy*: we'll talk about that at the end of this chapter.
+- *Value-based method*: finds the optimal policy indirectly by training a value or action-value function that will tell us **the value of each state or each state-action pair.**
+- *Uses a TD approach:* **updates its action-value function at each step instead of at the end of the episode.**
+
+**Q-Learning is the algorithm we use to train our Q-Function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" alt="Q-function"/>
+  <figcaption>Given a state and action, our Q Function outputs a state-action value (also called Q-value)</figcaption>
+</figure>
+
+The **Q comes from "the Quality" of that action at that state.**
+
+Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action value pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
+
+If we take this maze example:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-1.jpg" alt="Maze example"/>
+
+The Q-Table is initialized. That's why all values are = 0. This table **contains, for each state, the four state-action values.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-2.jpg" alt="Maze example"/>
+
+Here we see that the **state-action value of the initial state and going up is 0:**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-3.jpg" alt="Maze example"/>
+
+Therefore, Q-function contains a Q-table **that has the value of each-state action pair.** And given a state and action, **our Q-Function will search inside its Q-table to output the value.**
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q-function"/>
+  <figcaption>Given a state and action pair, our Q-function will search inside its Q-table to output the state-action pair value (the Q value).</figcaption>
+</figure>
+
+If we recap, *Q-Learning* **is the RL algorithm that:**
+
+- Trains *Q-Function* (an **action-value function**) which internally is a *Q-table* **that contains all the state-action pair values.**
+- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
+- When the training is done, **we have an optimal Q-function, which means we have optimal Q-Table.**
+- And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"/>
+
+
+But, in the beginning, **our Q-Table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-Table to 0 values). But, as we'll **explore the environment and update our Q-Table, it will give us better and better approximations.**
+
+<figure class="image table text-center m-0 w-full">
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-1.jpg" alt="Q-learning"/>
+  <figcaption>We see here that with the training, our Q-Table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
+</figure>
+
+So now that we understand what Q-Learning, Q-Function, and Q-Table are, **let's dive deeper into the Q-Learning algorithm**.
+
+## The Q-Learning algorithm [[q-learning-algo]]
+
+This is the Q-Learning pseudocode; let's study each part and **see how it works with a simple example before implementing it.** Don't be intimidated by it, it's simpler than it looks! We'll go over each step.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-learning"/>
+
+### Step 1: We initialize the Q-Table [[step1]]
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-3.jpg" alt="Q-learning"/>
+
+
+We need to initialize the Q-Table for each state-action pair. **Most of the time, we initialize with values of 0.**
+
+### Step 2: Choose action using Epsilon Greedy Strategy [[step2]]
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-learning"/>
+
+
+Epsilon Greedy Strategy is a policy that handles the exploration/exploitation trade-off.
+
+The idea is that we define epsilon ɛ = 1.0:
+
+- *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value).
+- With probability ɛ: **we do exploration** (trying random action).
+
+At the beginning of the training, **the probability of doing exploration will be huge since ɛ is very high, so most of the time, we'll explore.** But as the training goes on, and consequently our **Q-Table gets better and better in its estimations, we progressively reduce the epsilon value** since we will need less and less exploration and more exploitation.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-5.jpg" alt="Q-learning"/>
+
+
+### Step 3: Perform action At, gets reward Rt+1 and next state St+1 [[step3]]
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-6.jpg" alt="Q-learning"/>
+
+### Step 4: Update Q(St, At) [[step4]]
+
+Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose) **after one step of the interaction.**
+
+To produce our TD target, **we used the immediate reward \\(R_{t+1}\\) plus the discounted value of the next state best state-action pair** (we call that bootstrap).
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-7.jpg" alt="Q-learning"/>
+
+Therefore, our \\(Q(S_t, A_t)\\) **update formula goes like this:**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-8.jpg" alt="Q-learning"/>
+
+
+It means that to update our \\(Q(S_t, A_t)\\):
+
+- We need \\(S_t, A_t, R_{t+1}, S_{t+1}\\).
+- To update our Q-value at a given state-action pair, we use the TD target.
+
+How do we form the TD target?
+1. We obtain the reward after taking the action \\(R_{t+1}\\).
+2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon greedy policy, this will always take the action with the highest state-action value.
+
+Then when the update of this Q-value is done. We start in a new_state and select our action **using our epsilon-greedy policy again.**
+
+**It's why we say that this is an off-policy algorithm.**
+
+## Off-policy vs On-policy [[off-vs-on]]
+
+The difference is subtle:
+
+- *Off-policy*: using **a different policy for acting and updating.**
+
+For instance, with Q-Learning, the Epsilon greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
+
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-1.jpg" alt="Off-on policy"/>
+  <figcaption>Acting Policy</figcaption>
+</figure>
+
+Is different from the policy we use during the training part:
+
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-2.jpg" alt="Off-on policy"/>
+  <figcaption>Updating policy</figcaption>
+</figure>
+
+- *On-policy:* using the **same policy for acting and updating.**
+
+For instance, with Sarsa, another value-based algorithm, **the Epsilon-Greedy Policy selects the next_state-action pair, not a greedy policy.**
+
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-3.jpg" alt="Off-on policy"/>
+    <figcaption>Sarsa</figcaption>
+</figure>
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Off-on policy"/>
+</figure>
diff --git a/units/en/unit2/quiz1.mdx b/units/en/unit2/quiz1.mdx
new file mode 100644
index 0000000..cc5692d
--- /dev/null
+++ b/units/en/unit2/quiz1.mdx
@@ -0,0 +1,105 @@
+# First Quiz [[quiz1]]
+
+The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
+
+
+### Q1: What are the two main approaches to find optimal policy?
+
+
+<Question
+	choices={[
+		{
+			text: "Policy-based methods",
+			explain: "With Policy-Based methods, we train the policy directly to learn which action to take given a state.",
+      correct: true
+		},
+		{
+			text: "Random-based methods",
+			explain: ""
+		},
+    {
+			text: "Value-based methods",
+			explain: "With Value-based methods, we train a value function to learn which state is more valuable and use this value function to take the action that leads to it.",
+      correct: true
+		},
+		{
+			text: "Evolution-strategies methods",
+      explain: ""
+		}
+	]}
+/>
+
+
+### Q2: What is the Bellman Equation?
+
+<details>
+<summary>Solution</summary>
+
+**The Bellman equation is a recursive equation** that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
+
+Rt+1 + (gamma * V(St+1))
+The immediate reward + the discounted value of the state that follows
+
+</details>
+
+### Q3: Define each part of the Bellman Equation
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4-quiz.jpg" alt="Bellman equation quiz"/>
+
+
+<details>
+<summary>Solution</summary>
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation solution"/>
+
+</details>
+
+### Q4: What is the difference between Monte Carlo and Temporal Difference learning methods?
+
+<Question
+	choices={[
+		{
+			text: "With Monte Carlo methods, we update the value function from a complete episode",
+			explain: "",
+      correct: true
+		},
+    {
+			text: "With Monte Carlo methods, we update the value function from a step",
+			explain: ""
+		},
+    {
+			text: "With TD learning methods, we update the value function from a complete episode",
+			explain: ""
+		},
+    {
+			text: "With TD learning methods, we update the value function from a step",
+			explain: "",
+      correct: true
+		},
+	]}
+/>
+
+### Q5: Define each part of Temporal Difference learning formula
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/td-ex.jpg" alt="TD Learning exercise"/>
+
+<details>
+<summary>Solution</summary>
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="TD Exercise"/>
+
+</details>
+
+
+### Q6: Define each part of Monte Carlo learning formula
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/mc-ex.jpg" alt="MC Learning exercise"/>
+
+<details>
+<summary>Solution</summary>
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/monte-carlo-approach.jpg" alt="MC Exercise"/>
+
+</details>
+
+Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
diff --git a/units/en/unit2/quiz2.mdx b/units/en/unit2/quiz2.mdx
new file mode 100644
index 0000000..9d96d74
--- /dev/null
+++ b/units/en/unit2/quiz2.mdx
@@ -0,0 +1,97 @@
+# Second Quiz [[quiz2]]
+
+The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
+
+
+### Q1: What is Q-Learning?
+
+
+<Question
+	choices={[
+		{
+			text: "The algorithm we use to train our Q-Function",
+			explain: "",
+      correct: true
+		},
+		{
+			text: "A value function",
+			explain: "It's an action-value function since it determines the value of being at a particular state and taking a specific action at that state",
+		},
+    {
+			text: "An algorithm that determines the value of being at a particular state and taking a specific action at that state",
+			explain: "",
+      correct: true
+		},
+		{
+			text: "A table",
+      explain: "Q-Function is not a Q-Table. The Q-Function is the algorithm that will feed the Q-Table."
+		}
+	]}
+/>
+
+### Q2: What is a Q-Table?
+
+<Question
+	choices={[
+		{
+			text: "An algorithm we use in Q-Learning",
+			explain: "",
+		},
+		{
+			text: "Q-table is the internal memory of our agent",
+			explain: "",
+      correct: true
+		},
+    {
+			text: "In Q-Table each cell corresponds a state value",
+			explain: "Each cell corresponds to a state-action value pair value. Not a state value.",
+		}
+	]}
+/>
+
+### Q3: Why if we have an optimal Q-function Q* we have an optimal policy?
+
+<details>
+<summary>Solution</summary>
+
+Because if we have an optimal Q-function, we have an optimal policy since we know for each state what is the best action to take.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="link value policy"/>
+
+</details>
+
+### Q4: Can you explain what is Epsilon-Greedy Strategy?
+
+<details>
+<summary>Solution</summary>
+Epsilon Greedy Strategy is a policy that handles the exploration/exploitation trade-off.
+
+The idea is that we define epsilon ɛ = 1.0:
+
+- With *probability 1 — ɛ* : we do exploitation (aka our agent selects the action with the highest state-action pair value).
+- With *probability ɛ* : we do exploration (trying random action).
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Epsilon Greedy"/>
+
+
+</details>
+
+### Q5: How do we update the Q value of a state, action pair?
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-update-ex.jpg" alt="Q Update exercise"/>
+
+<details>
+<summary>Solution</summary>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-update-solution.jpg" alt="Q Update exercise"/>
+
+</details>
+
+
+
+### Q6: What's the difference between on-policy and off-policy
+
+<details>
+<summary>Solution</summary>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="On/off policy"/>
+</details>
+
+Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
diff --git a/units/en/unit2/summary1.mdx b/units/en/unit2/summary1.mdx
new file mode 100644
index 0000000..3a19d86
--- /dev/null
+++ b/units/en/unit2/summary1.mdx
@@ -0,0 +1,17 @@
+# Summary [[summary1]]
+
+Before diving on Q-Learning, let's summarize what we just learned.
+
+We have two types of value-based functions:
+
+- State-Value function: outputs the expected return if **the agent starts at a given state and acts accordingly to the policy forever after.**
+- Action-Value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
+- In value-based methods, **we define the policy by hand** because we don't train it, we train a value function. The idea is that if we have an optimal value function, we **will have an optimal policy.**
+
+There are two types of methods to learn a policy for a value function:
+
+- With *the Monte Carlo method*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
+- With *the TD Learning method,* we update the value function from a step, so we replace Gt that we don't have with **an estimated return called TD target.**
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/summary-learning-mtds.jpg" alt="Summary"/>
diff --git a/units/en/unit2/two-types-value-based-methods.mdx b/units/en/unit2/two-types-value-based-methods.mdx
new file mode 100644
index 0000000..47da6ef
--- /dev/null
+++ b/units/en/unit2/two-types-value-based-methods.mdx
@@ -0,0 +1,86 @@
+# Two types of value-based methods [[two-types-value-based-methods]]
+
+In value-based methods, **we learn a value function** that **maps a state to the expected value of being at that state.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/vbm-1.jpg" alt="Value Based Methods"/>
+
+The value of a state is the **expected discounted return** the agent can get if it **starts at that state and then acts according to our policy.**
+
+<Tip>
+But what does it mean to act according to our policy? After all, we don't have a policy in value-based methods, since we train a value function and not a policy.
+</Tip>
+
+Remember that the goal of an **RL agent is to have an optimal policy π.**
+
+To find the optimal policy, we learned about two different methods:
+
+- *Policy-based methods:* **Directly train the policy** to select what action to take given a state (or a probability distribution over actions at that state). In this case, we **don't have a value function.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches-2.jpg" alt="Two RL approaches"/>
+
+The policy takes a state as input and outputs what action to take at that state (deterministic policy).
+
+And consequently, **we don't define by hand the behavior of our policy; it's the training that will define it.**
+
+- *Value-based methods:* **Indirectly, by training a value function** that outputs the value of a state or a state-action pair. Given this value function, our policy **will take action.**
+
+Since the policy is not trained/learned, **we need to specify its behavior.** For instance, if we want a policy that, given the value function, will take actions that always lead to the biggest reward, **we'll create a Greedy Policy.**
+
+<figure>
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches-3.jpg" alt="Two RL approaches"/>
+  <figcaption>Given a state, our action-value function (that we train) outputs the value of each action at that state. Then, our pre-defined Greedy Policy selects the action that will yield the highest value given a state or a state action pair.</figcaption>
+</figure>
+
+Consequently, whatever method you use to solve your problem, **you will have a policy**. In the case of value-based methods, you don't train the policy: your policy **is just a simple pre-specified function** (for instance, Greedy Policy) that uses the values given by the value-function to select its actions.
+
+So the difference is:
+
+- In policy-based, **the optimal policy is found by training the policy directly.**
+- In value-based, **finding an optimal value function leads to having an optimal policy.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link between value and policy"/>
+
+In fact, most of the time, in value-based methods, you'll use **an Epsilon-Greedy Policy** that handles the exploration/exploitation trade-off; we'll talk about it when we talk about Q-Learning in the second part of this unit.
+
+
+So, we have two types of value-based functions:
+
+## The State-Value function [[state-value-function]]
+
+We write the state value function under a policy π like this:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/state-value-function-1.jpg" alt="State value function"/>
+
+For each state, the state-value function outputs the expected return if the agent **starts at that state,** and then follows the policy forever afterwards (for all future timesteps, if you prefer).
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/state-value-function-2.jpg" alt="State value function"/>
+  <figcaption>If we take the state with value -7: it's the expected return starting at that state and taking actions according to our policy (greedy policy), so right, right, right, down, down, right, right.</figcaption>
+</figure>
+
+## The Action-Value function [[action-value-function]]
+
+In the Action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
+
+The value of taking action an in state s under a policy π is:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-1.jpg" alt="Action State value function"/>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-2.jpg" alt="Action State value function"/>
+
+
+We see that the difference is:
+
+- In state-value function, we calculate **the value of a state \\(S_t\\)**
+- In action-value function, we calculate **the value of the state-action pair ( \\(S_t, A_t\\) ) hence the value of taking that action at that state.**
+
+<figure>
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-types.jpg" alt="Two types of value function"/>
+  <figcaption>
+Note: We didn't fill all the state-action pairs for the example of Action-value function</figcaption>
+</figure>
+
+In either case, whatever value function we choose (state-value or action-value function), **the value is the expected return.**
+
+However, the problem is that it implies that **to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.**
+
+This can be a tedious process, and that's **where the Bellman equation comes to help us.**
diff --git a/units/en/unit2/what-is-rl.mdx b/units/en/unit2/what-is-rl.mdx
new file mode 100644
index 0000000..2c31486
--- /dev/null
+++ b/units/en/unit2/what-is-rl.mdx
@@ -0,0 +1,25 @@
+# What is RL? A short recap [[what-is-rl]]
+
+In RL, we build an agent that can **make smart decisions**. For instance, an agent that **learns to play a video game.** Or a trading agent that **learns to maximize its benefits** by making smart decisions on **what stocks to buy and when to sell.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/rl-process.jpg" alt="RL process"/>
+
+
+But, to make intelligent decisions, our agent will learn from the environment by **interacting with it through trial and error** and receiving rewards (positive or negative) **as unique feedback.**
+
+Its goal **is to maximize its expected cumulative reward** (because of the reward hypothesis).
+
+**The agent's decision-making process is called the policy π:** given a state, a policy will output an action or a probability distribution over actions. That is, given an observation of the environment, a policy will provide an action (or multiple probabilities for each action) that the agent should take.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/policy.jpg" alt="Policy"/>
+
+**Our goal is to find an optimal policy π* **, aka., a policy that leads to the best expected cumulative reward.
+
+And to find this optimal policy (hence solving the RL problem), there **are two main types of RL methods**:
+
+- *Policy-based methods*: **Train the policy directly** to learn which action to take given a state.
+- *Value-based methods*: **Train a value function** to learn **which state is more valuable** and use this value function **to take the action that leads to it.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches.jpg" alt="Two RL approaches"/>
+
+And in this unit, **we'll dive deeper into the value-based methods.**

From 6e356bf1accf3aeff19f28663454fc4fea58ea4f Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Tue, 6 Dec 2022 11:24:11 +0100
Subject: [PATCH 02/49] Add unit to _toc_tree

---
 units/en/_toctree.yml | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml
index 45f4925..1f006b2 100644
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -44,3 +44,31 @@
     title: Play with Huggy
   - local: unitbonus1/conclusion
     title: Conclusion
+- title: Unit 2. Introduction to Q-Learning
+  sections:
+  - local: unit2/introduction
+    title: Introduction
+  - local: unit2/what-is-rl
+    title: What is RL? A short recap
+  - local: unit2/two-types-value-based-methods
+    title: The two types of value-based methods
+  - local: unit2/bellman-equation
+    title: The Bellman Equation, simplify our value estimation
+  - local: unit2/mc-vs-td
+    title: Monte Carlo vs Temporal Difference Learning
+  - local: unit2/summary1
+    title: Summary
+  - local: unit2/quiz1
+    title: First Quiz
+  - local: unit2/q-learning
+    title: Introducing Q-Learning
+  - local: unit2/q-learning-example
+    title: A Q-Learning example
+  - local: unit2/hands-on
+    title: Hands-on
+  - local: unit2/quiz2
+    title: Second Quiz
+  - local: unit2/conclusion
+    title: Conclusion
+  - local: unit2/additional-readings
+    title: Additional Readings

From d3a8e6ea348673bfbce7c73369fc0bbaa1e5d8fd Mon Sep 17 00:00:00 2001
From: lucifermorningstart1305 <adityam.ghosh@gmail.com>
Date: Thu, 8 Dec 2022 15:21:32 +1300
Subject: [PATCH 03/49] Adding::Glossary for Unit-1

---
 .units/unit1/glossary.mdx | 40 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)
 create mode 100644 .units/unit1/glossary.mdx

diff --git a/.units/unit1/glossary.mdx b/.units/unit1/glossary.mdx
new file mode 100644
index 0000000..92396ac
--- /dev/null
+++ b/.units/unit1/glossary.mdx
@@ -0,0 +1,40 @@
+# Glossary [[glossary]]
+
+### Markov Property
+It implies that the action taken by our agent is conditional solely on the present state and independent of the past states and actions.
+
+### Observations/State
+- **State**:  Complete description of the state of the world.
+- **Observation**: Partial description of the state of the environment/world.
+
+### Actions
+- **Discrete Actions**: Finite number of actions, such as left, right, up, and down.
+- **Continuous Actions**: Infinite possibility of actions; for example, in the case of self-driving cars, the driving scenario has an infinite possibility of actions occurring.
+
+### Rewards and Discounting
+- **Rewards**: Fundamental factor in RL. Tells the agent whether the action taken is good/bad.
+- RL algorithms are focused on maximizing the **cumulative reward**.
+- **Reward Hypothesis**: RL problems can be formulated as a maximisation of (cumulative) return.
+- **Discounting** is performed because rewards obtained at the start are more likely to happen as they are more predictable than long-term rewards.
+
+### Tasks
+- **Episodic**: Has a starting point and an ending point.
+- **Continuous**: Has a starting point but no ending point.
+
+### Exploration v/s Exploitation Trade-Off
+- **Exploration**: It's all about exploring the environment by trying random actions and receiving feedback/returns/rewards from the environment.
+- **Exploitation**: It's about exploiting what we know about the environment to gain maximum rewards.
+- **Exploration-Exploitation Trade-Off**: It balances how much we want to **explore** the environment and how much we want to **exploit** what we know about the environment.
+
+### Policy
+- **Policy**: It is called the agent's brain. It tells us what action to take, given the state.
+- **Optimal Policy**: Policy that **maximizes** the **expected return** when an agent acts according to it. It is learned through *training*.
+
+### Policy-based Methods:
+- An approach to solving RL problems.
+- In this method, the Policy is learned directly. 
+- Will map each state to the best corresponding action at that state. Or a probability distribution over the set of possible actions at that state.
+
+### Value-based Methods:
+- Another approach to solving RL problems.
+- Here, instead of training a policy, we train a **value function** that maps each state to the expected value of being in that state.

From c800ddfa01f5a8de046d27b316957aa369da2535 Mon Sep 17 00:00:00 2001
From: lucifermorningstart1305 <adityam.ghosh@gmail.com>
Date: Thu, 8 Dec 2022 23:52:01 +1300
Subject: [PATCH 04/49] Added: Contributions

---
 .units/unit1/glossary.mdx | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/.units/unit1/glossary.mdx b/.units/unit1/glossary.mdx
index 92396ac..03d440e 100644
--- a/.units/unit1/glossary.mdx
+++ b/.units/unit1/glossary.mdx
@@ -38,3 +38,14 @@ It implies that the action taken by our agent is conditional solely on the prese
 ### Value-based Methods:
 - Another approach to solving RL problems.
 - Here, instead of training a policy, we train a **value function** that maps each state to the expected value of being in that state.
+
+Contributions are welcomed :hugs:
+
+If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
+
+This glossary was made possible thanks to:
+
+- [@lucifermorningstar1305](https://github.com/lucifermorningstar1305)
+- [@daspartho](https://github.com/daspartho)
+- [@misza222](https://github.com/misza222)
+

From 729f314eb11e8e855cbda66ef1ec520099c20b24 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 8 Dec 2022 13:33:38 +0100
Subject: [PATCH 05/49] Update .units/unit1/glossary.mdx

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
---
 .units/unit1/glossary.mdx | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/.units/unit1/glossary.mdx b/.units/unit1/glossary.mdx
index 03d440e..2871af5 100644
--- a/.units/unit1/glossary.mdx
+++ b/.units/unit1/glossary.mdx
@@ -1,5 +1,7 @@
 # Glossary [[glossary]]
 
+This is a community-created glossary. Contributions are welcomed!
+
 ### Markov Property
 It implies that the action taken by our agent is conditional solely on the present state and independent of the past states and actions.
 

From 9c2b9282a4a20fa437bd41067f1326532446bc17 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 8 Dec 2022 13:37:48 +0100
Subject: [PATCH 06/49] Move glossary to correct folder

---
 units/en/_toctree.yml                   | 2 ++
 {.units => units/en}/unit1/glossary.mdx | 0
 2 files changed, 2 insertions(+)
 rename {.units => units/en}/unit1/glossary.mdx (100%)

diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml
index 45f4925..07eecd3 100644
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -24,6 +24,8 @@
     title: The “Deep” in Deep Reinforcement Learning
   - local: unit1/summary
     title: Summary
+  - local: unit1/glossary
+    title: Glossary
   - local: unit1/hands-on
     title: Hands-on
   - local: unit1/quiz
diff --git a/.units/unit1/glossary.mdx b/units/en/unit1/glossary.mdx
similarity index 100%
rename from .units/unit1/glossary.mdx
rename to units/en/unit1/glossary.mdx

From 4ba43cf05fa4122da5de1b7d953f287b896d25b8 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 8 Dec 2022 13:47:01 +0100
Subject: [PATCH 07/49] Update glossary.mdx

---
 units/en/unit1/glossary.mdx | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/units/en/unit1/glossary.mdx b/units/en/unit1/glossary.mdx
index 2871af5..1dc9164 100644
--- a/units/en/unit1/glossary.mdx
+++ b/units/en/unit1/glossary.mdx
@@ -3,41 +3,50 @@
 This is a community-created glossary. Contributions are welcomed!
 
 ### Markov Property
-It implies that the action taken by our agent is conditional solely on the present state and independent of the past states and actions.
+
+It implies that the action taken by our agent is **conditional solely on the present state and independent of the past states and actions**.
 
 ### Observations/State
+
 - **State**:  Complete description of the state of the world.
 - **Observation**: Partial description of the state of the environment/world.
 
 ### Actions
+
 - **Discrete Actions**: Finite number of actions, such as left, right, up, and down.
 - **Continuous Actions**: Infinite possibility of actions; for example, in the case of self-driving cars, the driving scenario has an infinite possibility of actions occurring.
 
 ### Rewards and Discounting
+
 - **Rewards**: Fundamental factor in RL. Tells the agent whether the action taken is good/bad.
 - RL algorithms are focused on maximizing the **cumulative reward**.
 - **Reward Hypothesis**: RL problems can be formulated as a maximisation of (cumulative) return.
 - **Discounting** is performed because rewards obtained at the start are more likely to happen as they are more predictable than long-term rewards.
 
 ### Tasks
+
 - **Episodic**: Has a starting point and an ending point.
 - **Continuous**: Has a starting point but no ending point.
 
 ### Exploration v/s Exploitation Trade-Off
+
 - **Exploration**: It's all about exploring the environment by trying random actions and receiving feedback/returns/rewards from the environment.
 - **Exploitation**: It's about exploiting what we know about the environment to gain maximum rewards.
 - **Exploration-Exploitation Trade-Off**: It balances how much we want to **explore** the environment and how much we want to **exploit** what we know about the environment.
 
 ### Policy
+
 - **Policy**: It is called the agent's brain. It tells us what action to take, given the state.
 - **Optimal Policy**: Policy that **maximizes** the **expected return** when an agent acts according to it. It is learned through *training*.
 
 ### Policy-based Methods:
+
 - An approach to solving RL problems.
 - In this method, the Policy is learned directly. 
 - Will map each state to the best corresponding action at that state. Or a probability distribution over the set of possible actions at that state.
 
 ### Value-based Methods:
+
 - Another approach to solving RL problems.
 - Here, instead of training a policy, we train a **value function** that maps each state to the expected value of being in that state.
 

From a2ee1fc636b2b266220754e0f6faec3da42ae2e4 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 8 Dec 2022 13:51:16 +0100
Subject: [PATCH 08/49] Update glossary

Emoji problem
---
 units/en/unit1/glossary.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unit1/glossary.mdx b/units/en/unit1/glossary.mdx
index 1dc9164..6ab3b73 100644
--- a/units/en/unit1/glossary.mdx
+++ b/units/en/unit1/glossary.mdx
@@ -50,7 +50,7 @@ It implies that the action taken by our agent is **conditional solely on the pre
 - Another approach to solving RL problems.
 - Here, instead of training a policy, we train a **value function** that maps each state to the expected value of being in that state.
 
-Contributions are welcomed :hugs:
+Contributions are welcomed 🤗
 
 If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
 

From 20a12074bc400e91925857555abe7f60416474af Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 8 Dec 2022 15:50:46 +0100
Subject: [PATCH 09/49] Apply suggestions from code review from Omar

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
---
 units/en/unit2/bellman-equation.mdx            |  4 ++--
 units/en/unit2/introduction.mdx                |  4 ++--
 units/en/unit2/mc-vs-td.mdx                    | 12 ++++++------
 units/en/unit2/q-learning-example.mdx          |  2 +-
 units/en/unit2/q-learning.mdx                  | 18 +++++++++---------
 units/en/unit2/quiz1.mdx                       |  2 +-
 units/en/unit2/summary1.mdx                    |  4 ++--
 .../en/unit2/two-types-value-based-methods.mdx |  8 ++++----
 8 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx
index 6d224f0..b284c44 100644
--- a/units/en/unit2/bellman-equation.mdx
+++ b/units/en/unit2/bellman-equation.mdx
@@ -21,9 +21,9 @@ Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return startin
   <figcaption>To calculate the value of State 2: the sum of rewards **if the agent started in that state, and then followed the **policy for all the time steps.</figcaption>
 </figure>
 
-So you see, that's a pretty tedious process if you need to do it for each state value or state-action value.
+So you may have noticed, we're repeating the computation of the value of different states, which can be tedious if you need to do it for each state value or state-action value.
 
-Instead of calculating the expected return for each state or each state-action pair, **we can use the Bellman equation.**
+Instead of calculating the expected return for each state or each state-action pair, **we can use the Bellman equation.** (hint: if you know what Dynamic Programming is, this is very similar! if you don't know what it is, no worries!)
 
 The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
 
diff --git a/units/en/unit2/introduction.mdx b/units/en/unit2/introduction.mdx
index 409f025..e465f45 100644
--- a/units/en/unit2/introduction.mdx
+++ b/units/en/unit2/introduction.mdx
@@ -19,8 +19,8 @@ Concretely, we will:
 
 - Learn about **value-based methods**.
 - Learn about the **differences between Monte Carlo and Temporal Difference Learning**.
-- Study and implement **our first RL algorithm**: Q-Learning.s
+- Study and implement **our first RL algorithm**: Q-Learning.
 
-This unit is **fundamental if you want to be able to work on Deep Q-Learning**: the first Deep RL algorithm that played Atari games and beat the human level on some of them (breakout, space invaders…).
+This unit is **fundamental if you want to be able to work on Deep Q-Learning**: the first Deep RL algorithm that played Atari games and beat the human level on some of them (breakout, space invaders, etc).
 
 So let's get started! 🚀
diff --git a/units/en/unit2/mc-vs-td.mdx b/units/en/unit2/mc-vs-td.mdx
index e78ee78..030ee62 100644
--- a/units/en/unit2/mc-vs-td.mdx
+++ b/units/en/unit2/mc-vs-td.mdx
@@ -1,8 +1,8 @@
 # Monte Carlo vs Temporal Difference Learning [[mc-vs-td]]
 
-The last thing we need to talk about before diving into Q-Learning is the two ways of learning.
+The last thing we need to discuss before diving into Q-Learning is the two learning strategies.
 
-Remember that an RL agent **learns by interacting with its environment.** The idea is that **using the experience taken**, given the reward it gets, will **update its value or policy.**
+Remember that an RL agent **learns by interacting with its environment.** The idea is that **given the experience and the received reward, the agent will update its value function or policy.**
 
 Monte Carlo and Temporal Difference Learning are two different **strategies on how to train our value function or our policy function.** Both of them **use experience to solve the RL problem.**
 
@@ -14,7 +14,7 @@ We'll explain both of them **using a value-based method example.**
 
 Monte Carlo waits until the end of the episode, calculates  \\(G_t\\) (return) and uses it as **a target for updating  \\(V(S_t)\\).**
 
-So it requires a **complete entire episode of interaction before updating our value function.**
+So it requires a **complete episode of interaction before updating our value function.**
 
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/monte-carlo-approach.jpg" alt="Monte Carlo"/>
 
@@ -29,7 +29,7 @@ If we take an example:
 - We get **the reward and the next state.**
 - We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.
 
-- At the end of the episode, **we have a list of State, Actions, Rewards, and Next States**
+- At the end of the episode, **we have a list of State, Actions, Rewards, and Next States tuples**
 - **The agent will sum the total rewards \\(G_t\\)** (to see how well it did).
 - It will then **update \\(V(s_t)\\) based on the formula**
 
@@ -74,12 +74,12 @@ For instance, if we train a state-value function using Monte Carlo:
 
 ## Temporal Difference Learning: learning at each step [[td-learning]]
 
-- **Temporal difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)**
+- **Temporal Difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)**
 - to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\(gamma * V(S_{t+1})\\).
 
 The idea with **TD is to update the \\(V(S_t)\\) at each step.**
 
-But because we didn't play during an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.**
+But because we didn't experience an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.**
 
 This is called bootstrapping. It's called this **because TD bases its update part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).**
 
diff --git a/units/en/unit2/q-learning-example.mdx b/units/en/unit2/q-learning-example.mdx
index 62e9be3..d6ccbda 100644
--- a/units/en/unit2/q-learning-example.mdx
+++ b/units/en/unit2/q-learning-example.mdx
@@ -68,7 +68,7 @@ I took action down. **Not a good action since it leads me to the poison.**
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-6.jpg" alt="Maze-Example"/>
 
 
-## Step 3: Perform action At, gets \Rt+1 and St+1 [[step3-3]]
+## Step 3: Perform action At, gets Rt+1 and St+1 [[step3-3]]
 
 Because I go to the poison state, **I get \\(R_{t+1} = -10\\), and I die.**
 
diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx
index 8447e4c..d2e8aa4 100644
--- a/units/en/unit2/q-learning.mdx
+++ b/units/en/unit2/q-learning.mdx
@@ -3,7 +3,7 @@
 
 Q-Learning is an **off-policy value-based method that uses a TD approach to train its action-value function:**
 
-- *Off-policy*: we'll talk about that at the end of this chapter.
+- *Off-policy*: we'll talk about that at the end of this unit.
 - *Value-based method*: finds the optimal policy indirectly by training a value or action-value function that will tell us **the value of each state or each state-action pair.**
 - *Uses a TD approach:* **updates its action-value function at each step instead of at the end of the episode.**
 
@@ -18,7 +18,7 @@ The **Q comes from "the Quality" of that action at that state.**
 
 Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action value pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
 
-If we take this maze example:
+Let's go through an example of a maze.
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-1.jpg" alt="Maze example"/>
 
@@ -39,7 +39,7 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act
 
 If we recap, *Q-Learning* **is the RL algorithm that:**
 
-- Trains *Q-Function* (an **action-value function**) which internally is a *Q-table* **that contains all the state-action pair values.**
+- Trains a *Q-Function* (an **action-value function**), which internally is a *Q-table that contains all the state-action pair values.**
 - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
 - When the training is done, **we have an optimal Q-function, which means we have optimal Q-Table.**
 - And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**
@@ -47,14 +47,14 @@ If we recap, *Q-Learning* **is the RL algorithm that:**
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"/>
 
 
-But, in the beginning, **our Q-Table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-Table to 0 values). But, as we'll **explore the environment and update our Q-Table, it will give us better and better approximations.**
+But, in the beginning, **our Q-Table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-Table to 0). As the agent **explores the environment and we update the Q-Table, it will give us better and better approximations** to the optimal policy.
 
 <figure class="image table text-center m-0 w-full">
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-1.jpg" alt="Q-learning"/>
   <figcaption>We see here that with the training, our Q-Table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
 </figure>
 
-So now that we understand what Q-Learning, Q-Function, and Q-Table are, **let's dive deeper into the Q-Learning algorithm**.
+Now that we understand what Q-Learning, Q-Function, and Q-Table are, **let's dive deeper into the Q-Learning algorithm**.
 
 ## The Q-Learning algorithm [[q-learning-algo]]
 
@@ -112,15 +112,15 @@ How do we form the TD target?
 1. We obtain the reward after taking the action \\(R_{t+1}\\).
 2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon greedy policy, this will always take the action with the highest state-action value.
 
-Then when the update of this Q-value is done. We start in a new_state and select our action **using our epsilon-greedy policy again.**
+Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.**
 
-**It's why we say that this is an off-policy algorithm.**
+**This is why we say that Q Learning is an off-policy algorithm.**
 
 ## Off-policy vs On-policy [[off-vs-on]]
 
 The difference is subtle:
 
-- *Off-policy*: using **a different policy for acting and updating.**
+- *Off-policy*: using **a different policy for acting (inference) and updating (training).**
 
 For instance, with Q-Learning, the Epsilon greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
 
@@ -140,7 +140,7 @@ Is different from the policy we use during the training part:
 
 - *On-policy:* using the **same policy for acting and updating.**
 
-For instance, with Sarsa, another value-based algorithm, **the Epsilon-Greedy Policy selects the next_state-action pair, not a greedy policy.**
+For instance, with Sarsa, another value-based algorithm, **the Epsilon-Greedy Policy selects the next state-action pair, not a greedy policy.**
 
 
 <figure>
diff --git a/units/en/unit2/quiz1.mdx b/units/en/unit2/quiz1.mdx
index cc5692d..2372fdd 100644
--- a/units/en/unit2/quiz1.mdx
+++ b/units/en/unit2/quiz1.mdx
@@ -102,4 +102,4 @@ The immediate reward + the discounted value of the state that follows
 
 </details>
 
-Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
+Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the previous sections to reinforce (😏) your knowledge.
diff --git a/units/en/unit2/summary1.mdx b/units/en/unit2/summary1.mdx
index 3a19d86..ee3c202 100644
--- a/units/en/unit2/summary1.mdx
+++ b/units/en/unit2/summary1.mdx
@@ -1,12 +1,12 @@
 # Summary [[summary1]]
 
-Before diving on Q-Learning, let's summarize what we just learned.
+Before diving into Q-Learning, let's summarize what we just learned.
 
 We have two types of value-based functions:
 
 - State-Value function: outputs the expected return if **the agent starts at a given state and acts accordingly to the policy forever after.**
 - Action-Value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
-- In value-based methods, **we define the policy by hand** because we don't train it, we train a value function. The idea is that if we have an optimal value function, we **will have an optimal policy.**
+- In value-based methods, rather than learning the policy, **we define the policy by hand** and we learn a value function. If we have an optimal value function, we **will have an optimal policy.**
 
 There are two types of methods to learn a policy for a value function:
 
diff --git a/units/en/unit2/two-types-value-based-methods.mdx b/units/en/unit2/two-types-value-based-methods.mdx
index 47da6ef..3ea7591 100644
--- a/units/en/unit2/two-types-value-based-methods.mdx
+++ b/units/en/unit2/two-types-value-based-methods.mdx
@@ -7,7 +7,7 @@ In value-based methods, **we learn a value function** that **maps a state to
 The value of a state is the **expected discounted return** the agent can get if it **starts at that state and then acts according to our policy.**
 
 <Tip>
-But what does it mean to act according to our policy? After all, we don't have a policy in value-based methods, since we train a value function and not a policy.
+But what does it mean to act according to our policy? After all, we don't have a policy in value-based methods since we train a value function and not a policy.
 </Tip>
 
 Remember that the goal of an **RL agent is to have an optimal policy π.**
@@ -22,7 +22,7 @@ The policy takes a state as input and outputs what action to take at that state
 
 And consequently, **we don't define by hand the behavior of our policy; it's the training that will define it.**
 
-- *Value-based methods:* **Indirectly, by training a value function** that outputs the value of a state or a state-action pair. Given this value function, our policy **will take action.**
+- *Value-based methods:* **Indirectly, by training a value function** that outputs the value of a state or a state-action pair. Given this value function, our policy **will take an action.**
 
 Since the policy is not trained/learned, **we need to specify its behavior.** For instance, if we want a policy that, given the value function, will take actions that always lead to the biggest reward, **we'll create a Greedy Policy.**
 
@@ -51,7 +51,7 @@ We write the state value function under a policy π like this:
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/state-value-function-1.jpg" alt="State value function"/>
 
-For each state, the state-value function outputs the expected return if the agent **starts at that state,** and then follows the policy forever afterwards (for all future timesteps, if you prefer).
+For each state, the state-value function outputs the expected return if the agent **starts at that state** and then follows the policy forever afterward (for all future timesteps, if you prefer).
 
 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/state-value-function-2.jpg" alt="State value function"/>
@@ -79,7 +79,7 @@ We see that the difference is:
 Note: We didn't fill all the state-action pairs for the example of Action-value function</figcaption>
 </figure>
 
-In either case, whatever value function we choose (state-value or action-value function), **the value is the expected return.**
+In either case, whatever value function we choose (state-value or action-value function), **the returned value is the expected return.**
 
 However, the problem is that it implies that **to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.**
 

From 32e54d7e000860d3b4ec06130848323e04d36f99 Mon Sep 17 00:00:00 2001
From: simoninithomas <simonini_thomas@outlook.fr>
Date: Fri, 9 Dec 2022 08:01:28 +0100
Subject: [PATCH 10/49] Updated discord section

---
 README.md                     | 2 +-
 units/en/unit0/discord101.mdx | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index ac2a07f..b2e6427 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# The Hugging Face Deep Reinforcement Learning Course 🤗 (v2.0)
+# [The Hugging Face Deep Reinforcement Learning Course 🤗 (v2.0)](ttps://huggingface.co/deep-rl-course/unit0/introduction)
 
 This repository contains the Deep Reinforcement Learning Course mdx files and notebooks. The website is here: https://huggingface.co/deep-rl-course/unit0/introduction?fw=pt
 
diff --git a/units/en/unit0/discord101.mdx b/units/en/unit0/discord101.mdx
index f46667f..f970432 100644
--- a/units/en/unit0/discord101.mdx
+++ b/units/en/unit0/discord101.mdx
@@ -17,6 +17,7 @@ They are in the reinforcement learning lounge. **Don't forget to sign up to thes
 - `rl-announcements`: where we give the **lastest information about the course**.
 - `rl-discussions`: where you can **exchange about RL and share information**.
 - `rl-study-group`: where you can **create and join study groups**.
+- `rl-i-made-this`: where you can **share your projects and models**.
 
 The HF Community Server has a thriving community of human beings interested in many areas, so you can also learn from those. There are paper discussions, events, and many other things.
 

From fbe55c2063c38ef7484d623b005622aff2170791 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Fri, 9 Dec 2022 09:14:33 +0100
Subject: [PATCH 11/49] Update units/en/unit2/what-is-rl.mdx

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
---
 units/en/unit2/what-is-rl.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unit2/what-is-rl.mdx b/units/en/unit2/what-is-rl.mdx
index 2c31486..e939c21 100644
--- a/units/en/unit2/what-is-rl.mdx
+++ b/units/en/unit2/what-is-rl.mdx
@@ -1,6 +1,6 @@
 # What is RL? A short recap [[what-is-rl]]
 
-In RL, we build an agent that can **make smart decisions**. For instance, an agent that **learns to play a video game.** Or a trading agent that **learns to maximize its benefits** by making smart decisions on **what stocks to buy and when to sell.**
+In RL, we build an agent that can **make smart decisions**. For instance, an agent that **learns to play a video game.** Or a trading agent that **learns to maximize its benefits** by deciding on **what stocks to buy and when to sell.**
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/rl-process.jpg" alt="RL process"/>
 

From 2ee9fcfb327d3807f36dd34ede319afb5eb5bfa6 Mon Sep 17 00:00:00 2001
From: simoninithomas <simonini_thomas@outlook.fr>
Date: Fri, 9 Dec 2022 12:21:26 +0100
Subject: [PATCH 12/49] Updated Unit 2 + added notebook

---
 notebooks/unit2/requirements-unit2.txt        |   10 +
 notebooks/unit2/unit2.ipynb                   |   81 +-
 notebooks/unit2/unit2.mdx                     | 1089 +++++++++++++++++
 units/en/_toctree.yml                         |    6 +-
 units/en/unit2/bellman-equation.mdx           |   11 +-
 units/en/unit2/hands-on.mdx                   | 1062 +++++++++++++++-
 units/en/unit2/mc-vs-td.mdx                   |    8 +-
 units/en/unit2/q-learning.mdx                 |   15 +-
 units/en/unit2/quiz1.mdx                      |    6 +-
 units/en/unit2/summary1.mdx                   |    8 +-
 units/en/unit2/summary2.mdx                   |   25 +
 .../unit2/two-types-value-based-methods.mdx   |   16 +-
 12 files changed, 2271 insertions(+), 66 deletions(-)
 create mode 100644 notebooks/unit2/requirements-unit2.txt
 create mode 100644 notebooks/unit2/unit2.mdx
 create mode 100644 units/en/unit2/summary2.mdx

diff --git a/notebooks/unit2/requirements-unit2.txt b/notebooks/unit2/requirements-unit2.txt
new file mode 100644
index 0000000..733afc8
--- /dev/null
+++ b/notebooks/unit2/requirements-unit2.txt
@@ -0,0 +1,10 @@
+gym==0.24
+pygame
+numpy
+
+huggingface_hub
+pickle5
+pyyaml==6.0
+imageio
+imageio_ffmpeg
+pyglet==1.5.1
diff --git a/notebooks/unit2/unit2.ipynb b/notebooks/unit2/unit2.ipynb
index cf97cdc..90de5a6 100644
--- a/notebooks/unit2/unit2.ipynb
+++ b/notebooks/unit2/unit2.ipynb
@@ -25,32 +25,16 @@
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif\" alt=\"Environments\"/>"
       ]
     },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "TODO: ADD TEXT LIVE INFO"
-      ],
-      "metadata": {
-        "id": "yaBKcncmYku4"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "TODO: ADD IF YOU HAVE QUESTIONS\n"
-      ],
-      "metadata": {
-        "id": "hz5KE5HjYlRh"
-      }
-    },
     {
       "cell_type": "markdown",
       "source": [
         "###🎮 Environments: \n",
+        "\n",
         "- [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)\n",
         "- [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/)\n",
         "\n",
         "###📚 RL-Library: \n",
+        "\n",
         "- Python and Numpy"
       ],
       "metadata": {
@@ -73,7 +57,9 @@
       },
       "source": [
         "## Objectives of this notebook 🏆\n",
+        "\n",
         "At the end of the notebook, you will:\n",
+        "\n",
         "- Be able to use **Gym**, the environment library.\n",
         "- Be able to code from scratch a Q-Learning agent.\n",
         "- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.\n",
@@ -120,7 +106,7 @@
         "## Prerequisites 🏗️\n",
         "Before diving into the notebook, you need to:\n",
         "\n",
-        "🔲 📚 **Study Q-Learning by reading Unit 2**  🤗 ADD LINK "
+        "🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)**  🤗  "
       ]
     },
     {
@@ -139,6 +125,7 @@
       },
       "source": [
         "- The *Q-Learning* **is the RL algorithm that**  \n",
+        "\n",
         "  - Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**\n",
         "    \n",
         "  - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**\n",
@@ -194,15 +181,6 @@
         "id": "4gpxC1_kqUYe"
       }
     },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "TODO CHANGE LINK OF THE REQUIREMENTS"
-      ],
-      "metadata": {
-        "id": "32e3NPYgH5ET"
-      }
-    },
     {
       "cell_type": "code",
       "execution_count": null,
@@ -211,7 +189,7 @@
       },
       "outputs": [],
       "source": [
-        "!pip install -r https://huggingface.co/spaces/ThomasSimonini/temp-space-requirements/raw/main/requirements/requirements-unit2.txt"
+        "!pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/unit2/requirements-unit2.txt"
       ]
     },
     {
@@ -230,6 +208,27 @@
       "execution_count": null,
       "outputs": []
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks for this trick, **we will be able to run our virtual screen.**"
+      ],
+      "metadata": {
+        "id": "K6XC13pTfFiD"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import os\n",
+        "os.kill(os.getpid(), 9)"
+      ],
+      "metadata": {
+        "id": "3kuZbWAkfHdg"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
     {
       "cell_type": "code",
       "source": [
@@ -317,11 +316,13 @@
         "We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.\n",
         "\n",
         "We can have two sizes of environment:\n",
+        "\n",
         "- `map_name=\"4x4\"`: a 4x4 grid version\n",
         "- `map_name=\"8x8\"`: a 8x8 grid version\n",
         "\n",
         "\n",
         "The environment has two modes:\n",
+        "\n",
         "- `is_slippery=False`: The agent always move in the intended direction due to the non-slippery nature of the frozen lake.\n",
         "- `is_slippery=True`: The agent may not always move in the intended direction due to the slippery nature of the frozen lake (stochastic)."
       ]
@@ -931,6 +932,7 @@
       },
       "source": [
         "## Evaluate our Q-Learning agent 📈\n",
+        "\n",
         "- Normally you should have mean reward of 1.0\n",
         "- It's relatively easy since the state space is really small (16). What you can try to do is [to replace with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)."
       ]
@@ -955,6 +957,7 @@
       },
       "source": [
         "## Publish our trained model on the Hub 🔥\n",
+        "\n",
         "Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.\n",
         "\n",
         "Here's an example of a Model Card:\n",
@@ -1173,6 +1176,7 @@
       },
       "source": [
         "### .\n",
+        "\n",
         "By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n",
         "\n",
         "This way:\n",
@@ -1264,9 +1268,10 @@
       },
       "source": [
         "Let's fill the `package_to_hub` function:\n",
+        "\n",
         "- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `\n",
         "(repo_id = {username}/{repo_name})`\n",
-        "💡 **A good name is {username}/q-{env_id}**\n",
+        "💡 A good `repo_id` is `{username}/q-{env_id}`\n",
         "- `model`: our model dictionary containing the hyperparameters and the Qtable.\n",
         "- `env`: the environment.\n",
         "- `commit_message`: message of the commit"
@@ -1326,7 +1331,9 @@
         "\n",
         "---\n",
         "\n",
-        "In Taxi-v3 🚕, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger’s location, picks up the passenger, drives to the passenger’s destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends.\n",
+        "In `Taxi-v3` 🚕, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). \n",
+        "\n",
+        "When the episode starts, **the taxi starts off at a random square** and the passenger is at a random location. The taxi drives to the passenger’s location, **picks up the passenger**, drives to the passenger’s destination (another one of the four specified locations), and then **drops off the passenger**. Once the passenger is dropped off, the episode ends.\n",
         "\n",
         "\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi.png\" alt=\"Taxi\">\n"
@@ -1383,6 +1390,7 @@
       },
       "source": [
         "The action space (the set of possible actions the agent can take) is discrete with **6 actions available 🎮**:\n",
+        "\n",
         "- 0: move south\n",
         "- 1: move north\n",
         "- 2: move east\n",
@@ -1391,6 +1399,7 @@
         "- 5: drop off passenger\n",
         "\n",
         "Reward function 💰:\n",
+        "\n",
         "- -1 per step unless other reward is triggered.\n",
         "- +20 delivering passenger.\n",
         "- -10 executing “pickup” and “drop-off” actions illegally."
@@ -1556,7 +1565,8 @@
         "\n",
         "What's amazing with Hugging Face Hub 🤗 is that you can easily load powerful models from the community.\n",
         "\n",
-        "Loading a saved model from the Hub is really easy.\n",
+        "Loading a saved model from the Hub is really easy:\n",
+        "\n",
         "1. You go https://huggingface.co/models?other=q-learning to see the list of all the q-learning saved models.\n",
         "2. You select one and copy its repo_id\n",
         "\n",
@@ -1671,9 +1681,10 @@
         "## Some additional challenges 🏆\n",
         "The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results! \n",
         "\n",
-        "In the [Leaderboard](https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n",
+        "In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n",
         "\n",
         "Here are some ideas to achieve so:\n",
+        "\n",
         "* Train more steps\n",
         "* Try different hyperparameters by looking at what your classmates have done.\n",
         "* **Push your new trained model** on the Hub 🔥\n",
@@ -1711,8 +1722,8 @@
         "id": "BjLhT70TEZIn"
       },
       "source": [
-        "See you on [Unit 3](https://github.com/huggingface/deep-rl-class/tree/main/unit2#unit-2-introduction-to-q-learning)! 🔥\n",
-        "TODO CHANGE LINK\n",
+        "See you on Unit 3! 🔥\n",
+        "\n",
         "## Keep learning, stay awesome 🤗"
       ]
     }
diff --git a/notebooks/unit2/unit2.mdx b/notebooks/unit2/unit2.mdx
new file mode 100644
index 0000000..cfa8618
--- /dev/null
+++ b/notebooks/unit2/unit2.mdx
@@ -0,0 +1,1089 @@
+# Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">
+
+In this notebook, **you'll code from scratch your first Reinforcement Learning agent** playing FrozenLake ❄️ using Q-Learning, share it to the community, and experiment with different configurations.
+
+
+⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
+
+###🎮 Environments: 
+
+- [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
+- [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/)
+
+###📚 RL-Library: 
+
+- Python and Numpy
+
+We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
+
+## Objectives of this notebook 🏆
+
+At the end of the notebook, you will:
+
+- Be able to use **Gym**, the environment library.
+- Be able to code from scratch a Q-Learning agent.
+- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
+
+
+
+
+## This notebook is from Deep Reinforcement Learning Course
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>
+
+In this free course, you will:
+
+- 📖 Study Deep Reinforcement Learning in **theory and practice**.
+- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.
+- 🤖 Train **agents in unique environments** 
+
+And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course
+
+Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
+
+
+The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5
+
+## Prerequisites 🏗️
+Before diving into the notebook, you need to:
+
+🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)**  🤗  
+
+## A small recap of Q-Learning
+
+- The *Q-Learning* **is the RL algorithm that**  
+
+  - Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
+    
+  - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
+    
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function"  width="100%"/>
+
+- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
+    
+- And if we **have an optimal Q-function**, we
+have an optimal policy,since we **know for each state, what is the best action to take.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"  width="100%"/>
+
+
+But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
+
+This is the Q-Learning pseudocode:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
+
+
+# Let's code our first Reinforcement Learning algorithm 🚀
+
+## Install dependencies and create a virtual display 🔽
+
+During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). 
+
+Hence the following cell will install the librairies and create and run a virtual screen 🖥
+
+We’ll install multiple ones:
+
+- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.
+- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.
+- `numPy`: Used for handling our Q-table.
+
+The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
+
+You can see here all the Deep reinforcement Learning models available 👉 https://huggingface.co/models?other=q-learning
+
+
+```python
+!pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/unit2/requirements-unit2.txt
+```
+
+```python
+%capture
+!sudo apt-get update
+!apt install python-opengl
+!apt install ffmpeg
+!apt install xvfb
+!pip3 install pyvirtualdisplay
+```
+
+To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks for this trick, **we will be able to run our virtual screen.**
+
+```python
+import os
+
+os.kill(os.getpid(), 9)
+```
+
+```python
+# Virtual display
+from pyvirtualdisplay import Display
+
+virtual_display = Display(visible=0, size=(1400, 900))
+virtual_display.start()
+```
+
+## Import the packages 📦
+
+In addition to the installed libraries, we also use:
+
+- `random`: To generate random numbers (that will be useful for Epsilon-Greedy Policy).
+- `imageio`: To generate a replay video
+
+
+
+
+
+
+```python
+import numpy as np
+import gym
+import random
+import imageio
+import os
+
+import pickle5 as pickle
+from tqdm.notebook import tqdm
+```
+
+We're now ready to code our Q-Learning algorithm 🔥
+
+# Part 1: Frozen Lake ⛄ (non slippery version)
+
+## Create and understand [FrozenLake environment ⛄]((https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
+---
+
+💡 A good habit when you start to use an environment is to check its documentation 
+
+👉 https://www.gymlibrary.dev/environments/toy_text/frozen_lake/
+
+---
+
+We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.
+
+We can have two sizes of environment:
+
+- `map_name="4x4"`: a 4x4 grid version
+- `map_name="8x8"`: a 8x8 grid version
+
+
+The environment has two modes:
+
+- `is_slippery=False`: The agent always move in the intended direction due to the non-slippery nature of the frozen lake.
+- `is_slippery=True`: The agent may not always move in the intended direction due to the slippery nature of the frozen lake (stochastic).
+
+For now let's keep it simple with the 4x4 map and non-slippery
+
+```python
+# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version
+env = gym.make()  # TODO use the correct parameters
+```
+
+### Solution
+
+```python
+env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
+```
+
+You can create your own custom grid like this:
+
+```python
+desc=["SFFF", "FHFH", "FFFH", "HFFG"]
+gym.make('FrozenLake-v1', desc=desc, is_slippery=True)
+```
+
+but we'll use the default environment for now.
+
+### Let's see what the Environment looks like:
+
+
+```python
+# We create our environment with gym.make("<name_of_the_environment>")
+env.reset()
+print("_____OBSERVATION SPACE_____ \n")
+print("Observation Space", env.observation_space)
+print("Sample observation", env.observation_space.sample())  # Get a random observation
+```
+
+We see with `Observation Space Shape Discrete(16)` that the observation is a value representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**. 
+
+For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**
+
+
+For instance, this is what state = 0 looks like:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/frozenlake.png" alt="FrozenLake">
+
+```python
+print("\n _____ACTION SPACE_____ \n")
+print("Action Space Shape", env.action_space.n)
+print("Action Space Sample", env.action_space.sample())  # Take a random action
+```
+
+The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:
+- 0: GO LEFT
+- 1: GO DOWN
+- 2: GO RIGHT
+- 3: GO UP
+
+Reward function 💰:
+- Reach goal: +1
+- Reach hole: 0
+- Reach frozen: 0
+
+## Create and Initialize the Q-table 🗄️
+(👀 Step 1 of the pseudocode)
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
+
+
+It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`
+
+
+```python
+state_space = 
+print("There are ", state_space, " possible states")
+
+action_space = 
+print("There are ", action_space, " possible actions")
+```
+
+```python
+# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
+def initialize_q_table(state_space, action_space):
+  Qtable = 
+  return Qtable
+```
+
+```python
+Qtable_frozenlake = initialize_q_table(state_space, action_space)
+```
+
+### Solution
+
+```python
+state_space = env.observation_space.n
+print("There are ", state_space, " possible states")
+
+action_space = env.action_space.n
+print("There are ", action_space, " possible actions")
+```
+
+```python
+# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
+def initialize_q_table(state_space, action_space):
+    Qtable = np.zeros((state_space, action_space))
+    return Qtable
+```
+
+```python
+Qtable_frozenlake = initialize_q_table(state_space, action_space)
+```
+
+##Define the epsilon-greedy policy 🤖
+
+Epsilon-Greedy is the training policy that handles the exploration/exploitation trade-off.
+
+The idea with Epsilon Greedy:
+
+- With *probability 1 - ɛ* : **we do exploitation** (aka our agent selects the action with the highest state-action pair value).
+
+- With *probability ɛ*: we do **exploration** (trying random action).
+
+And as the training goes, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-Learning" width="100%"/>
+
+
+Thanks to Sambit for finding a bug on the epsilon function 🤗
+
+```python
+def epsilon_greedy_policy(Qtable, state, epsilon):
+  # Randomly generate a number between 0 and 1
+  random_num = 
+  # if random_num > greater than epsilon --> exploitation
+  if random_num > epsilon:
+    # Take the action with the highest value given a state
+    # np.argmax can be useful here
+    action = 
+  # else --> exploration
+  else:
+    action = # Take a random action
+  
+  return action
+```
+
+#### Solution
+
+```python
+def epsilon_greedy_policy(Qtable, state, epsilon):
+    # Randomly generate a number between 0 and 1
+    random_int = random.uniform(0, 1)
+    # if random_int > greater than epsilon --> exploitation
+    if random_int > epsilon:
+        # Take the action with the highest value given a state
+        # np.argmax can be useful here
+        action = np.argmax(Qtable[state])
+    # else --> exploration
+    else:
+        action = env.action_space.sample()
+
+    return action
+```
+
+## Define the greedy policy 🤖
+Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
+
+- Epsilon greedy policy (acting policy)
+- Greedy policy (updating policy)
+
+Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Q-Learning" width="100%"/>
+
+
+```python
+def greedy_policy(Qtable, state):
+  # Exploitation: take the action with the highest state, action value
+  action = 
+  
+  return action
+```
+
+#### Solution
+
+```python
+def greedy_policy(Qtable, state):
+    # Exploitation: take the action with the highest state, action value
+    action = np.argmax(Qtable[state])
+
+    return action
+```
+
+## Define the hyperparameters ⚙️
+The exploration related hyperparamters are some of the most important ones. 
+
+- We need to make sure that our agent **explores enough the state space** in order to learn a good value approximation, in order to do that we need to have progressive decay of the epsilon.
+- If you decrease too fast epsilon (too high decay_rate), **you take the risk that your agent is stuck**, since your agent didn't explore enough the state space and hence can't solve the problem.
+
+```python
+# Training parameters
+n_training_episodes = 10000  # Total training episodes
+learning_rate = 0.7  # Learning rate
+
+# Evaluation parameters
+n_eval_episodes = 100  # Total number of test episodes
+
+# Environment parameters
+env_id = "FrozenLake-v1"  # Name of the environment
+max_steps = 99  # Max steps per episode
+gamma = 0.95  # Discounting rate
+eval_seed = []  # The evaluation seed of the environment
+
+# Exploration parameters
+max_epsilon = 1.0  # Exploration probability at start
+min_epsilon = 0.05  # Minimum exploration probability
+decay_rate = 0.0005  # Exponential decay rate for exploration prob
+```
+
+## Step 6: Create the training loop method
+
+```python
+def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
+  for episode in range(n_training_episodes):
+    # Reduce epsilon (because we need less and less exploration)
+    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
+    # Reset the environment
+    state = env.reset()
+    step = 0
+    done = False
+
+    # repeat
+    for step in range(max_steps):
+      # Choose the action At using epsilon greedy policy
+      action = 
+
+      # Take action At and observe Rt+1 and St+1
+      # Take the action (a) and observe the outcome state(s') and reward (r)
+      new_state, reward, done, info = 
+
+      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
+      Qtable[state][action] = 
+
+      # If done, finish the episode
+      if done:
+        break
+      
+      # Our state is the new state
+      state = new_state
+  return Qtable
+```
+
+#### Solution
+
+```python
+def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
+    for episode in tqdm(range(n_training_episodes)):
+        # Reduce epsilon (because we need less and less exploration)
+        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
+        # Reset the environment
+        state = env.reset()
+        step = 0
+        done = False
+
+        # repeat
+        for step in range(max_steps):
+            # Choose the action At using epsilon greedy policy
+            action = epsilon_greedy_policy(Qtable, state, epsilon)
+
+            # Take action At and observe Rt+1 and St+1
+            # Take the action (a) and observe the outcome state(s') and reward (r)
+            new_state, reward, done, info = env.step(action)
+
+            # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
+            Qtable[state][action] = Qtable[state][action] + learning_rate * (
+                reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]
+            )
+
+            # If done, finish the episode
+            if done:
+                break
+
+            # Our state is the new state
+            state = new_state
+    return Qtable
+```
+
+## Train the Q-Learning agent 🏃
+
+```python
+Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)
+```
+
+## Let's see what our Q-Learning table looks like now 👀
+
+```python
+Qtable_frozenlake
+```
+
+## Define the evaluation method 📝
+
+```python
+def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
+    """
+    Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
+    :param env: The evaluation environment
+    :param n_eval_episodes: Number of episode to evaluate the agent
+    :param Q: The Q-table
+    :param seed: The evaluation seed array (for taxi-v3)
+    """
+    episode_rewards = []
+    for episode in tqdm(range(n_eval_episodes)):
+        if seed:
+            state = env.reset(seed=seed[episode])
+        else:
+            state = env.reset()
+        step = 0
+        done = False
+        total_rewards_ep = 0
+
+        for step in range(max_steps):
+            # Take the action (index) that have the maximum expected future reward given that state
+            action = np.argmax(Q[state][:])
+            new_state, reward, done, info = env.step(action)
+            total_rewards_ep += reward
+
+            if done:
+                break
+            state = new_state
+        episode_rewards.append(total_rewards_ep)
+    mean_reward = np.mean(episode_rewards)
+    std_reward = np.std(episode_rewards)
+
+    return mean_reward, std_reward
+```
+
+## Evaluate our Q-Learning agent 📈
+
+- Normally you should have mean reward of 1.0
+- It's relatively easy since the state space is really small (16). What you can try to do is [to replace with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/).
+
+```python
+# Evaluate our Agent
+mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
+print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
+```
+
+## Publish our trained model on the Hub 🔥
+
+Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.
+
+Here's an example of a Model Card:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/modelcard.png" alt="Model card" width="100%"/>
+
+
+Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.
+
+#### Do not modify this code
+
+```python
+%%capture
+from huggingface_hub import HfApi, HfFolder, Repository
+from huggingface_hub.repocard import metadata_eval_result, metadata_save
+
+from pathlib import Path
+import datetime
+import json
+```
+
+```python
+def record_video(env, Qtable, out_directory, fps=1):
+    images = []
+    done = False
+    state = env.reset(seed=random.randint(0, 500))
+    img = env.render(mode="rgb_array")
+    images.append(img)
+    while not done:
+        # Take the action (index) that have the maximum expected future reward given that state
+        action = np.argmax(Qtable[state][:])
+        state, reward, done, info = env.step(action)  # We directly put next_state = state for recording logic
+        img = env.render(mode="rgb_array")
+        images.append(img)
+    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)
+```
+
+```python
+def push_to_hub(
+    repo_id, model, env, video_fps=1, local_repo_path="hub", commit_message="Push Q-Learning agent to Hub", token=None
+):
+    _, repo_name = repo_id.split("/")
+
+    eval_env = env
+
+    # Step 1: Clone or create the repo
+    # Create the repo (or clone its content if it's nonempty)
+    api = HfApi()
+
+    repo_url = api.create_repo(
+        repo_id=repo_id,
+        token=token,
+        private=False,
+        exist_ok=True,
+    )
+
+    # Git pull
+    repo_local_path = Path(local_repo_path) / repo_name
+    repo = Repository(repo_local_path, clone_from=repo_url, use_auth_token=True)
+    repo.git_pull()
+
+    repo.lfs_track(["*.mp4"])
+
+    # Step 1: Save the model
+    if env.spec.kwargs.get("map_name"):
+        model["map_name"] = env.spec.kwargs.get("map_name")
+        if env.spec.kwargs.get("is_slippery", "") == False:
+            model["slippery"] = False
+
+    print(model)
+
+    # Pickle the model
+    with open(Path(repo_local_path) / "q-learning.pkl", "wb") as f:
+        pickle.dump(model, f)
+
+    # Step 2: Evaluate the model and build JSON
+    mean_reward, std_reward = evaluate_agent(
+        eval_env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"]
+    )
+
+    # First get datetime
+    eval_datetime = datetime.datetime.now()
+    eval_form_datetime = eval_datetime.isoformat()
+
+    evaluate_data = {
+        "env_id": model["env_id"],
+        "mean_reward": mean_reward,
+        "n_eval_episodes": model["n_eval_episodes"],
+        "eval_datetime": eval_form_datetime,
+    }
+    # Write a JSON file
+    with open(Path(repo_local_path) / "results.json", "w") as outfile:
+        json.dump(evaluate_data, outfile)
+
+    # Step 3: Create the model card
+    # Env id
+    env_name = model["env_id"]
+    if env.spec.kwargs.get("map_name"):
+        env_name += "-" + env.spec.kwargs.get("map_name")
+
+    if env.spec.kwargs.get("is_slippery", "") == False:
+        env_name += "-" + "no_slippery"
+
+    metadata = {}
+    metadata["tags"] = [env_name, "q-learning", "reinforcement-learning", "custom-implementation"]
+
+    # Add metrics
+    eval = metadata_eval_result(
+        model_pretty_name=repo_name,
+        task_pretty_name="reinforcement-learning",
+        task_id="reinforcement-learning",
+        metrics_pretty_name="mean_reward",
+        metrics_id="mean_reward",
+        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
+        dataset_pretty_name=env_name,
+        dataset_id=env_name,
+    )
+
+    # Merges both dictionaries
+    metadata = {**metadata, **eval}
+
+    model_card = f"""
+  # **Q-Learning** Agent playing **{env_id}**
+  This is a trained model of a **Q-Learning** agent playing **{env_id}** .
+  """
+
+    model_card += """
+  ## Usage
+  ```python
+  """
+
+    model_card += f"""model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
+
+  # Don't forget to check if you need to add additional attributes (is_slippery=False etc)
+  env = gym.make(model["env_id"])
+
+  evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
+  """
+
+    model_card += """
+  ```
+  """
+
+    readme_path = repo_local_path / "README.md"
+    readme = ""
+    if readme_path.exists():
+        with readme_path.open("r", encoding="utf8") as f:
+            readme = f.read()
+    else:
+        readme = model_card
+
+    with readme_path.open("w", encoding="utf-8") as f:
+        f.write(readme)
+
+    # Save our metrics to Readme metadata
+    metadata_save(readme_path, metadata)
+
+    # Step 4: Record a video
+    video_path = repo_local_path / "replay.mp4"
+    record_video(env, model["qtable"], video_path, video_fps)
+
+    # Push everything to hub
+    print(f"Pushing repo {repo_name} to the Hugging Face Hub")
+    repo.push_to_hub(commit_message=commit_message)
+
+    print(f"Your model is pushed to the hub. You can view your model here: {repo_url}")
+```
+
+### .
+
+By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
+
+This way:
+- You can **showcase our work** 🔥
+- You can **visualize your agent playing** 👀
+- You can **share with the community an agent that others can use** 💾
+- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
+
+
+To be able to share your model with the community there are three more steps to follow:
+
+1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
+
+2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
+- Create a new token (https://huggingface.co/settings/tokens) **with write role**
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
+
+
+```python
+from huggingface_hub import notebook_login
+
+notebook_login()
+```
+
+If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
+
+3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
+
+- Let's create **the model dictionary that contains the hyperparameters and the Q_table**.
+
+```python
+model = {
+    "env_id": env_id,
+    "max_steps": max_steps,
+    "n_training_episodes": n_training_episodes,
+    "n_eval_episodes": n_eval_episodes,
+    "eval_seed": eval_seed,
+    "learning_rate": learning_rate,
+    "gamma": gamma,
+    "max_epsilon": max_epsilon,
+    "min_epsilon": min_epsilon,
+    "decay_rate": decay_rate,
+    "qtable": Qtable_frozenlake,
+}
+```
+
+Let's fill the `package_to_hub` function:
+
+- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `
+(repo_id = {username}/{repo_name})`
+💡 A good `repo_id` is `{username}/q-{env_id}`
+- `model`: our model dictionary containing the hyperparameters and the Qtable.
+- `env`: the environment.
+- `commit_message`: message of the commit
+
+```python
+model
+```
+
+```python
+username = ""  # FILL THIS
+repo_name = "q-FrozenLake-v1-4x4-noSlippery"
+push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
+```
+
+Congrats 🥳 you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent. 
+FrozenLake-v1 no_slippery is very simple environment, let's try an harder one 🔥.
+
+# Part 2: Taxi-v3 🚖
+
+## Create and understand [Taxi-v3 🚕](https://www.gymlibrary.dev/environments/toy_text/taxi/)
+---
+
+💡 A good habit when you start to use an environment is to check its documentation 
+
+👉 https://www.gymlibrary.dev/environments/toy_text/taxi/
+
+---
+
+In `Taxi-v3` 🚕, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). 
+
+When the episode starts, **the taxi starts off at a random square** and the passenger is at a random location. The taxi drives to the passenger’s location, **picks up the passenger**, drives to the passenger’s destination (another one of the four specified locations), and then **drops off the passenger**. Once the passenger is dropped off, the episode ends.
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi.png" alt="Taxi">
+
+
+```python
+env = gym.make("Taxi-v3")
+```
+
+There are **500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger** (including the case when the passenger is in the taxi), and **4 destination locations.**
+
+
+```python
+state_space = env.observation_space.n
+print("There are ", state_space, " possible states")
+```
+
+```python
+action_space = env.action_space.n
+print("There are ", action_space, " possible actions")
+```
+
+The action space (the set of possible actions the agent can take) is discrete with **6 actions available 🎮**:
+
+- 0: move south
+- 1: move north
+- 2: move east
+- 3: move west
+- 4: pickup passenger
+- 5: drop off passenger
+
+Reward function 💰:
+
+- -1 per step unless other reward is triggered.
+- +20 delivering passenger.
+- -10 executing “pickup” and “drop-off” actions illegally.
+
+```python
+# Create our Q table with state_size rows and action_size columns (500x6)
+Qtable_taxi = initialize_q_table(state_space, action_space)
+print(Qtable_taxi)
+print("Q-table shape: ", Qtable_taxi.shape)
+```
+
+## Define the hyperparameters ⚙️
+⚠ DO NOT MODIFY EVAL_SEED: the eval_seed array **allows us to evaluate your agent with the same taxi starting positions for every classmate**
+
+```python
+# Training parameters
+n_training_episodes = 25000  # Total training episodes
+learning_rate = 0.7  # Learning rate
+
+# Evaluation parameters
+n_eval_episodes = 100  # Total number of test episodes
+
+# DO NOT MODIFY EVAL_SEED
+eval_seed = [
+    16,
+    54,
+    165,
+    177,
+    191,
+    191,
+    120,
+    80,
+    149,
+    178,
+    48,
+    38,
+    6,
+    125,
+    174,
+    73,
+    50,
+    172,
+    100,
+    148,
+    146,
+    6,
+    25,
+    40,
+    68,
+    148,
+    49,
+    167,
+    9,
+    97,
+    164,
+    176,
+    61,
+    7,
+    54,
+    55,
+    161,
+    131,
+    184,
+    51,
+    170,
+    12,
+    120,
+    113,
+    95,
+    126,
+    51,
+    98,
+    36,
+    135,
+    54,
+    82,
+    45,
+    95,
+    89,
+    59,
+    95,
+    124,
+    9,
+    113,
+    58,
+    85,
+    51,
+    134,
+    121,
+    169,
+    105,
+    21,
+    30,
+    11,
+    50,
+    65,
+    12,
+    43,
+    82,
+    145,
+    152,
+    97,
+    106,
+    55,
+    31,
+    85,
+    38,
+    112,
+    102,
+    168,
+    123,
+    97,
+    21,
+    83,
+    158,
+    26,
+    80,
+    63,
+    5,
+    81,
+    32,
+    11,
+    28,
+    148,
+]  # Evaluation seed, this ensures that all classmates agents are trained on the same taxi starting position
+# Each seed has a specific starting state
+
+# Environment parameters
+env_id = "Taxi-v3"  # Name of the environment
+max_steps = 99  # Max steps per episode
+gamma = 0.95  # Discounting rate
+
+# Exploration parameters
+max_epsilon = 1.0  # Exploration probability at start
+min_epsilon = 0.05  # Minimum exploration probability
+decay_rate = 0.005  # Exponential decay rate for exploration prob
+```
+
+## Train our Q-Learning agent 🏃
+
+```python
+Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
+```
+
+```python
+Qtable_taxi
+```
+
+## Create a model dictionary 💾 and publish our trained model on the Hub 🔥
+- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.
+
+
+```python
+model = {
+    "env_id": env_id,
+    "max_steps": max_steps,
+    "n_training_episodes": n_training_episodes,
+    "n_eval_episodes": n_eval_episodes,
+    "eval_seed": eval_seed,
+    "learning_rate": learning_rate,
+    "gamma": gamma,
+    "max_epsilon": max_epsilon,
+    "min_epsilon": min_epsilon,
+    "decay_rate": decay_rate,
+    "qtable": Qtable_taxi,
+}
+```
+
+```python
+username = ""  # FILL THIS
+repo_name = "q-Taxi-v3"
+push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
+```
+
+Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png" alt="Taxi Leaderboard">
+
+# Part 3: Load from Hub 🔽
+
+What's amazing with Hugging Face Hub 🤗 is that you can easily load powerful models from the community.
+
+Loading a saved model from the Hub is really easy:
+
+1. You go https://huggingface.co/models?other=q-learning to see the list of all the q-learning saved models.
+2. You select one and copy its repo_id
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/copy-id.png" alt="Copy id">
+
+3. Then we just need to use `load_from_hub` with:
+- The repo_id
+- The filename: the saved model inside the repo.
+
+#### Do not modify this code
+
+```python
+from urllib.error import HTTPError
+
+from huggingface_hub import hf_hub_download
+
+
+def load_from_hub(repo_id: str, filename: str) -> str:
+    """
+    Download a model from Hugging Face Hub.
+    :param repo_id: id of the model repository from the Hugging Face Hub
+    :param filename: name of the model zip file from the repository
+    """
+    try:
+        from huggingface_hub import cached_download, hf_hub_url
+    except ImportError:
+        raise ImportError(
+            "You need to install huggingface_hub to use `load_from_hub`. "
+            "See https://pypi.org/project/huggingface-hub/ for installation."
+        )
+
+    # Get the model from the Hub, download and cache the model on your local disk
+    pickle_model = hf_hub_download(repo_id=repo_id, filename=filename)
+
+    with open(pickle_model, "rb") as f:
+        downloaded_model_file = pickle.load(f)
+
+    return downloaded_model_file
+```
+
+### .
+
+```python
+model = load_from_hub(repo_id="ThomasSimonini/q-Taxi-v3", filename="q-learning.pkl")  # Try to use another model
+
+print(model)
+env = gym.make(model["env_id"])
+
+evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
+```
+
+```python
+model = load_from_hub(
+    repo_id="ThomasSimonini/q-FrozenLake-v1-no-slippery", filename="q-learning.pkl"
+)  # Try to use another model
+
+env = gym.make(model["env_id"], is_slippery=False)
+
+evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
+```
+
+## Some additional challenges 🏆
+The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results! 
+
+In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
+
+Here are some ideas to achieve so:
+
+* Train more steps
+* Try different hyperparameters by looking at what your classmates have done.
+* **Push your new trained model** on the Hub 🔥
+
+Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not using FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
+
+_____________________________________________________________________
+Congrats 🥳, you've just implemented, trained, and uploaded your first Reinforcement Learning agent.
+
+Understanding Q-Learning is an **important step to understanding value-based methods.**
+
+In the next Unit with Deep Q-Learning, we'll see that creating and updating a Q-table was a good strategy — **however, this is not scalable.**
+
+For instance, imagine you create an agent that learns to play Doom. 
+
+<img src="https://vizdoom.cs.put.edu.pl/user/pages/01.tutorial/basic.png" alt="Doom"/>
+
+Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient. 
+
+That's why we'll study, in the next unit, Deep Q-Learning, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
+
+
+See you on Unit 3! 🔥
+
+## Keep learning, stay awesome 🤗
\ No newline at end of file
diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml
index 1f006b2..6e7658f 100644
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -57,13 +57,15 @@
   - local: unit2/mc-vs-td
     title: Monte Carlo vs Temporal Difference Learning
   - local: unit2/summary1
-    title: Summary
+    title: Mid-way Recap
   - local: unit2/quiz1
-    title: First Quiz
+    title: Mid-way Quiz
   - local: unit2/q-learning
     title: Introducing Q-Learning
   - local: unit2/q-learning-example
     title: A Q-Learning example
+  - local: unit2/summary2
+    title: Q-Learning Recap
   - local: unit2/hands-on
     title: Hands-on
   - local: unit2/quiz2
diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx
index b284c44..03cab20 100644
--- a/units/en/unit2/bellman-equation.mdx
+++ b/units/en/unit2/bellman-equation.mdx
@@ -31,7 +31,6 @@ The Bellman equation is a recursive equation that works like this: instead of st
 
 <figure>
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation"/>
-  <figcaption>For simplification, here we don’t discount so gamma = 1.</figcaption>
 </figure>
 
 
@@ -44,14 +43,20 @@ To calculate the value of State 1: the sum of rewards **if the agent started in
 
 This is equivalent to  \\(V(S_{t})\\)  = Immediate reward  \\(R_{t+1}\\)  + Discounted value of the next state  \\(gamma * V(S_{t+1})\\)
 
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman6.jpg" alt="Bellman equation"/>
-
+<figure>
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman6.jpg" alt="Bellman equation"/>
+  <figcaption>For simplification, here we don’t discount so gamma = 1.</figcaption>
+</figure>
 
 In the interest of simplicity, here we don't discount, so gamma = 1.
 
 - The value of  \\(V(S_{t+1}) \\)  = Immediate reward  \\(R_{t+2}\\)  + Discounted value of the next state ( \\(gamma * V(S_{t+2})\\) ).
 - And so on.
 
+
+
+
+
 To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process.** This is equivalent **to the sum of immediate reward + the discounted value of the state that follows.**
 
 Before going to the next section, think about the role of gamma in the Bellman equation. What happens if the value of gamma is very low (e.g. 0.1 or even 0)? What happens if the value is 1? What happens if the value is very high, such as a million?
diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
index d683cac..a3dfdb1 100644
--- a/units/en/unit2/hands-on.mdx
+++ b/units/en/unit2/hands-on.mdx
@@ -1,5 +1,13 @@
 # Hands-on [[hands-on]]
 
+<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
+notebooks={[
+  {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb"}
+  ]}
+  askForHelpUrl="http://hf.co/join/discord" />
+
+
+
 Now that we studied the Q-Learning algorithm, let's implement it from scratch and train our Q-Learning agent in two environments:
 1. [Frozen-Lake-v1  (non-slippery and slippery version)](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) ☃️ : where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
 2. [An autonomous taxi](https://www.gymlibrary.dev/environments/toy_text/taxi/) 🚖 will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
@@ -11,4 +19,1056 @@ Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Dee
 
 **To start the hands-on click on Open In Colab button** 👇 :
 
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]()
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb)
+
+
+# Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">
+
+In this notebook, **you'll code from scratch your first Reinforcement Learning agent** playing FrozenLake ❄️ using Q-Learning, share it to the community, and experiment with different configurations.
+
+
+⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
+
+### 🎮 Environments:
+
+- [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
+- [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/)
+
+### 📚 RL-Library:
+
+- Python and Numpy
+
+We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
+
+## Objectives of this notebook 🏆
+
+At the end of the notebook, you will:
+
+- Be able to use **Gym**, the environment library.
+- Be able to code from scratch a Q-Learning agent.
+- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
+
+
+## Prerequisites 🏗️
+
+Before diving into the notebook, you need to:
+
+🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)**  🤗
+
+
+## Install dependencies and create a virtual display 🔽
+
+During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
+
+Hence the following cell will install the librairies and create and run a virtual screen 🖥
+
+We’ll install multiple ones:
+
+- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.
+- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.
+- `numPy`: Used for handling our Q-table.
+
+The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
+
+
+You can see here all the Deep reinforcement Learning models available 👉 https://huggingface.co/models?other=q-learning
+
+
+```bash
+pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/unit2/requirements-unit2.txt
+```
+
+```bash
+sudo apt-get update
+apt install python-opengl
+apt install ffmpeg
+apt install xvfb
+pip3 install pyvirtualdisplay
+```
+
+To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks for this trick, **we will be able to run our virtual screen.**
+
+```python
+import os
+
+os.kill(os.getpid(), 9)
+```
+
+```python
+# Virtual display
+from pyvirtualdisplay import Display
+
+virtual_display = Display(visible=0, size=(1400, 900))
+virtual_display.start()
+```
+
+## Import the packages 📦
+
+In addition to the installed libraries, we also use:
+
+- `random`: To generate random numbers (that will be useful for Epsilon-Greedy Policy).
+- `imageio`: To generate a replay video
+
+
+
+
+
+
+```python
+import numpy as np
+import gym
+import random
+import imageio
+import os
+
+import pickle5 as pickle
+from tqdm.notebook import tqdm
+```
+
+We're now ready to code our Q-Learning algorithm 🔥
+
+# Part 1: Frozen Lake ⛄ (non slippery version)
+
+## Create and understand [FrozenLake environment ⛄](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
+---
+
+💡 A good habit when you start to use an environment is to check its documentation
+
+👉 https://www.gymlibrary.dev/environments/toy_text/frozen_lake/
+
+---
+
+We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.
+
+We can have two sizes of environment:
+
+- `map_name="4x4"`: a 4x4 grid version
+- `map_name="8x8"`: a 8x8 grid version
+
+
+The environment has two modes:
+
+- `is_slippery=False`: The agent always move in the intended direction due to the non-slippery nature of the frozen lake.
+- `is_slippery=True`: The agent may not always move in the intended direction due to the slippery nature of the frozen lake (stochastic).
+
+For now let's keep it simple with the 4x4 map and non-slippery
+
+```python
+# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version
+env = gym.make()  # TODO use the correct parameters
+```
+
+### Solution
+
+```python
+env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
+```
+
+You can create your own custom grid like this:
+
+```python
+desc=["SFFF", "FHFH", "FFFH", "HFFG"]
+gym.make('FrozenLake-v1', desc=desc, is_slippery=True)
+```
+
+but we'll use the default environment for now.
+
+### Let's see what the Environment looks like:
+
+
+```python
+# We create our environment with gym.make("<name_of_the_environment>")
+env.reset()
+print("_____OBSERVATION SPACE_____ \n")
+print("Observation Space", env.observation_space)
+print("Sample observation", env.observation_space.sample())  # Get a random observation
+```
+
+We see with `Observation Space Shape Discrete(16)` that the observation is a value representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**.
+
+For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**
+
+
+For instance, this is what state = 0 looks like:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/frozenlake.png" alt="FrozenLake">
+
+```python
+print("\n _____ACTION SPACE_____ \n")
+print("Action Space Shape", env.action_space.n)
+print("Action Space Sample", env.action_space.sample())  # Take a random action
+```
+
+The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:
+- 0: GO LEFT
+- 1: GO DOWN
+- 2: GO RIGHT
+- 3: GO UP
+
+Reward function 💰:
+- Reach goal: +1
+- Reach hole: 0
+- Reach frozen: 0
+
+
+## Create and Initialize the Q-table 🗄️
+(👀 Step 1 of the pseudocode)
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
+
+
+It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`
+
+
+```python
+state_space =
+print("There are ", state_space, " possible states")
+
+action_space =
+print("There are ", action_space, " possible actions")
+```
+
+```python
+# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
+def initialize_q_table(state_space, action_space):
+  Qtable =
+  return Qtable
+```
+
+```python
+Qtable_frozenlake = initialize_q_table(state_space, action_space)
+```
+
+
+### Solution
+
+```python
+state_space = env.observation_space.n
+print("There are ", state_space, " possible states")
+
+action_space = env.action_space.n
+print("There are ", action_space, " possible actions")
+```
+
+```python
+# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
+def initialize_q_table(state_space, action_space):
+    Qtable = np.zeros((state_space, action_space))
+    return Qtable
+```
+
+```python
+Qtable_frozenlake = initialize_q_table(state_space, action_space)
+```
+
+## Define the epsilon-greedy policy 🤖
+
+Epsilon-Greedy is the training policy that handles the exploration/exploitation trade-off.
+
+The idea with Epsilon Greedy:
+
+- With *probability 1 - ɛ* : **we do exploitation** (aka our agent selects the action with the highest state-action pair value).
+
+- With *probability ɛ*: we do **exploration** (trying random action).
+
+And as the training goes, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-Learning" width="100%"/>
+
+
+Thanks to Sambit for finding a bug on the epsilon function 🤗
+
+```python
+def epsilon_greedy_policy(Qtable, state, epsilon):
+  # Randomly generate a number between 0 and 1
+  random_num =
+  # if random_num > greater than epsilon --> exploitation
+  if random_num > epsilon:
+    # Take the action with the highest value given a state
+    # np.argmax can be useful here
+    action =
+  # else --> exploration
+  else:
+    action = # Take a random action
+
+  return action
+```
+
+#### Solution
+
+```python
+def epsilon_greedy_policy(Qtable, state, epsilon):
+    # Randomly generate a number between 0 and 1
+    random_int = random.uniform(0, 1)
+    # if random_int > greater than epsilon --> exploitation
+    if random_int > epsilon:
+        # Take the action with the highest value given a state
+        # np.argmax can be useful here
+        action = np.argmax(Qtable[state])
+    # else --> exploration
+    else:
+        action = env.action_space.sample()
+
+    return action
+```
+
+## Define the greedy policy 🤖
+
+Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
+
+- Epsilon greedy policy (acting policy)
+- Greedy policy (updating policy)
+
+Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Q-Learning" width="100%"/>
+
+
+```python
+def greedy_policy(Qtable, state):
+  # Exploitation: take the action with the highest state, action value
+  action =
+
+  return action
+```
+
+#### Solution
+
+```python
+def greedy_policy(Qtable, state):
+    # Exploitation: take the action with the highest state, action value
+    action = np.argmax(Qtable[state])
+
+    return action
+```
+
+## Define the hyperparameters ⚙️
+The exploration related hyperparamters are some of the most important ones.
+
+- We need to make sure that our agent **explores enough the state space** in order to learn a good value approximation, in order to do that we need to have progressive decay of the epsilon.
+- If you decrease too fast epsilon (too high decay_rate), **you take the risk that your agent is stuck**, since your agent didn't explore enough the state space and hence can't solve the problem.
+
+```python
+# Training parameters
+n_training_episodes = 10000  # Total training episodes
+learning_rate = 0.7  # Learning rate
+
+# Evaluation parameters
+n_eval_episodes = 100  # Total number of test episodes
+
+# Environment parameters
+env_id = "FrozenLake-v1"  # Name of the environment
+max_steps = 99  # Max steps per episode
+gamma = 0.95  # Discounting rate
+eval_seed = []  # The evaluation seed of the environment
+
+# Exploration parameters
+max_epsilon = 1.0  # Exploration probability at start
+min_epsilon = 0.05  # Minimum exploration probability
+decay_rate = 0.0005  # Exponential decay rate for exploration prob
+```
+
+## Step 6: Create the training loop method
+
+
+```python
+def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
+  for episode in range(n_training_episodes):
+    # Reduce epsilon (because we need less and less exploration)
+    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
+    # Reset the environment
+    state = env.reset()
+    step = 0
+    done = False
+
+    # repeat
+    for step in range(max_steps):
+      # Choose the action At using epsilon greedy policy
+      action =
+
+      # Take action At and observe Rt+1 and St+1
+      # Take the action (a) and observe the outcome state(s') and reward (r)
+      new_state, reward, done, info =
+
+      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
+      Qtable[state][action] =
+
+      # If done, finish the episode
+      if done:
+        break
+
+      # Our state is the new state
+      state = new_state
+  return Qtable
+```
+
+#### Solution
+
+```python
+def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
+    for episode in tqdm(range(n_training_episodes)):
+        # Reduce epsilon (because we need less and less exploration)
+        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
+        # Reset the environment
+        state = env.reset()
+        step = 0
+        done = False
+
+        # repeat
+        for step in range(max_steps):
+            # Choose the action At using epsilon greedy policy
+            action = epsilon_greedy_policy(Qtable, state, epsilon)
+
+            # Take action At and observe Rt+1 and St+1
+            # Take the action (a) and observe the outcome state(s') and reward (r)
+            new_state, reward, done, info = env.step(action)
+
+            # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
+            Qtable[state][action] = Qtable[state][action] + learning_rate * (
+                reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]
+            )
+
+            # If done, finish the episode
+            if done:
+                break
+
+            # Our state is the new state
+            state = new_state
+    return Qtable
+```
+
+## Train the Q-Learning agent 🏃
+
+```python
+Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)
+```
+
+## Let's see what our Q-Learning table looks like now 👀
+
+```python
+Qtable_frozenlake
+```
+
+## Define the evaluation method 📝
+
+```python
+def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
+    """
+    Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
+    :param env: The evaluation environment
+    :param n_eval_episodes: Number of episode to evaluate the agent
+    :param Q: The Q-table
+    :param seed: The evaluation seed array (for taxi-v3)
+    """
+    episode_rewards = []
+    for episode in tqdm(range(n_eval_episodes)):
+        if seed:
+            state = env.reset(seed=seed[episode])
+        else:
+            state = env.reset()
+        step = 0
+        done = False
+        total_rewards_ep = 0
+
+        for step in range(max_steps):
+            # Take the action (index) that have the maximum expected future reward given that state
+            action = np.argmax(Q[state][:])
+            new_state, reward, done, info = env.step(action)
+            total_rewards_ep += reward
+
+            if done:
+                break
+            state = new_state
+        episode_rewards.append(total_rewards_ep)
+    mean_reward = np.mean(episode_rewards)
+    std_reward = np.std(episode_rewards)
+
+    return mean_reward, std_reward
+```
+
+## Evaluate our Q-Learning agent 📈
+
+- Normally you should have mean reward of 1.0
+- It's relatively easy since the state space is really small (16). What you can try to do is [to replace with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/).
+
+```python
+# Evaluate our Agent
+mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
+print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
+```
+
+
+## Publish our trained model on the Hub 🔥
+
+Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.
+
+Here's an example of a Model Card:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/modelcard.png" alt="Model card" width="100%"/>
+
+
+Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.
+
+#### Do not modify this code
+
+```python
+%%capture
+from huggingface_hub import HfApi, HfFolder, Repository
+from huggingface_hub.repocard import metadata_eval_result, metadata_save
+
+from pathlib import Path
+import datetime
+import json
+```
+
+```python
+def record_video(env, Qtable, out_directory, fps=1):
+    images = []
+    done = False
+    state = env.reset(seed=random.randint(0, 500))
+    img = env.render(mode="rgb_array")
+    images.append(img)
+    while not done:
+        # Take the action (index) that have the maximum expected future reward given that state
+        action = np.argmax(Qtable[state][:])
+        state, reward, done, info = env.step(action)  # We directly put next_state = state for recording logic
+        img = env.render(mode="rgb_array")
+        images.append(img)
+    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)
+```
+
+```python
+def push_to_hub(
+    repo_id, model, env, video_fps=1, local_repo_path="hub", commit_message="Push Q-Learning agent to Hub", token=None
+):
+    _, repo_name = repo_id.split("/")
+
+    eval_env = env
+
+    # Step 1: Clone or create the repo
+    # Create the repo (or clone its content if it's nonempty)
+    api = HfApi()
+
+    repo_url = api.create_repo(
+        repo_id=repo_id,
+        token=token,
+        private=False,
+        exist_ok=True,
+    )
+
+    # Git pull
+    repo_local_path = Path(local_repo_path) / repo_name
+    repo = Repository(repo_local_path, clone_from=repo_url, use_auth_token=True)
+    repo.git_pull()
+
+    repo.lfs_track(["*.mp4"])
+
+    # Step 1: Save the model
+    if env.spec.kwargs.get("map_name"):
+        model["map_name"] = env.spec.kwargs.get("map_name")
+        if env.spec.kwargs.get("is_slippery", "") == False:
+            model["slippery"] = False
+
+    print(model)
+
+    # Pickle the model
+    with open(Path(repo_local_path) / "q-learning.pkl", "wb") as f:
+        pickle.dump(model, f)
+
+    # Step 2: Evaluate the model and build JSON
+    mean_reward, std_reward = evaluate_agent(
+        eval_env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"]
+    )
+
+    # First get datetime
+    eval_datetime = datetime.datetime.now()
+    eval_form_datetime = eval_datetime.isoformat()
+
+    evaluate_data = {
+        "env_id": model["env_id"],
+        "mean_reward": mean_reward,
+        "n_eval_episodes": model["n_eval_episodes"],
+        "eval_datetime": eval_form_datetime,
+    }
+    # Write a JSON file
+    with open(Path(repo_local_path) / "results.json", "w") as outfile:
+        json.dump(evaluate_data, outfile)
+
+    # Step 3: Create the model card
+    # Env id
+    env_name = model["env_id"]
+    if env.spec.kwargs.get("map_name"):
+        env_name += "-" + env.spec.kwargs.get("map_name")
+
+    if env.spec.kwargs.get("is_slippery", "") == False:
+        env_name += "-" + "no_slippery"
+
+    metadata = {}
+    metadata["tags"] = [env_name, "q-learning", "reinforcement-learning", "custom-implementation"]
+
+    # Add metrics
+    eval = metadata_eval_result(
+        model_pretty_name=repo_name,
+        task_pretty_name="reinforcement-learning",
+        task_id="reinforcement-learning",
+        metrics_pretty_name="mean_reward",
+        metrics_id="mean_reward",
+        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
+        dataset_pretty_name=env_name,
+        dataset_id=env_name,
+    )
+
+    # Merges both dictionaries
+    metadata = {**metadata, **eval}
+
+    model_card = f"""
+  # **Q-Learning** Agent playing **{env_id}**
+  This is a trained model of a **Q-Learning** agent playing **{env_id}** .
+  """
+
+    model_card += """
+  ## Usage
+  ```python
+  """
+
+    model_card += f"""model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
+
+  # Don't forget to check if you need to add additional attributes (is_slippery=False etc)
+  env = gym.make(model["env_id"])
+
+  evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
+  """
+
+    model_card += """
+
+
+    readme_path = repo_local_path / "README.md"
+    readme = ""
+    if readme_path.exists():
+        with readme_path.open("r", encoding="utf8") as f:
+            readme = f.read()
+    else:
+        readme = model_card
+
+    with readme_path.open("w", encoding="utf-8") as f:
+        f.write(readme)
+
+    # Save our metrics to Readme metadata
+    metadata_save(readme_path, metadata)
+
+    # Step 4: Record a video
+    video_path = repo_local_path / "replay.mp4"
+    record_video(env, model["qtable"], video_path, video_fps)
+
+    # Push everything to hub
+    print(f"Pushing the repo to the Hugging Face Hub")
+    repo.push_to_hub(commit_message=commit_message)
+
+    print("Your model is pushed to the hub. You can view your model here: ", repo_url)
+```
+
+### .
+
+By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
+
+This way:
+- You can **showcase our work** 🔥
+- You can **visualize your agent playing** 👀
+- You can **share with the community an agent that others can use** 💾
+- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
+
+
+To be able to share your model with the community there are three more steps to follow:
+
+1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
+
+2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
+- Create a new token (https://huggingface.co/settings/tokens) **with write role**
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
+
+
+```python
+from huggingface_hub import notebook_login
+
+notebook_login()
+```
+
+If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
+
+3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
+
+- Let's create **the model dictionary that contains the hyperparameters and the Q_table**.
+
+```python
+model = {
+    "env_id": env_id,
+    "max_steps": max_steps,
+    "n_training_episodes": n_training_episodes,
+    "n_eval_episodes": n_eval_episodes,
+    "eval_seed": eval_seed,
+    "learning_rate": learning_rate,
+    "gamma": gamma,
+    "max_epsilon": max_epsilon,
+    "min_epsilon": min_epsilon,
+    "decay_rate": decay_rate,
+    "qtable": Qtable_frozenlake,
+}
+```
+
+Let's fill the `package_to_hub` function:
+
+- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `
+(repo_id = {username}/{repo_name})`
+💡 A good `repo_id` is `{username}/q-{env_id}`
+- `model`: our model dictionary containing the hyperparameters and the Qtable.
+- `env`: the environment.
+- `commit_message`: message of the commit
+
+```python
+model
+```
+
+```python
+username = ""  # FILL THIS
+repo_name = "q-FrozenLake-v1-4x4-noSlippery"
+push_to_hub(repo_id=f"username}/{repo_name}", model=model, env=env)
+```
+
+Congrats 🥳 you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent.
+FrozenLake-v1 no_slippery is very simple environment, let's try an harder one 🔥.
+
+# Part 2: Taxi-v3 🚖
+
+## Create and understand [Taxi-v3 🚕](https://www.gymlibrary.dev/environments/toy_text/taxi/)
+---
+
+💡 A good habit when you start to use an environment is to check its documentation
+
+👉 https://www.gymlibrary.dev/environments/toy_text/taxi/
+
+---
+
+In `Taxi-v3` 🚕, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue).
+
+When the episode starts, **the taxi starts off at a random square** and the passenger is at a random location. The taxi drives to the passenger’s location, **picks up the passenger**, drives to the passenger’s destination (another one of the four specified locations), and then **drops off the passenger**. Once the passenger is dropped off, the episode ends.
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi.png" alt="Taxi">
+
+
+```python
+env = gym.make("Taxi-v3")
+```
+
+There are **500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger** (including the case when the passenger is in the taxi), and **4 destination locations.**
+
+
+```python
+state_space = env.observation_space.n
+print("There are ", state_space, " possible states")
+```
+
+```python
+action_space = env.action_space.n
+print("There are ", action_space, " possible actions")
+```
+
+The action space (the set of possible actions the agent can take) is discrete with **6 actions available 🎮**:
+
+- 0: move south
+- 1: move north
+- 2: move east
+- 3: move west
+- 4: pickup passenger
+- 5: drop off passenger
+
+Reward function 💰:
+
+- -1 per step unless other reward is triggered.
+- +20 delivering passenger.
+- -10 executing “pickup” and “drop-off” actions illegally.
+
+```python
+# Create our Q table with state_size rows and action_size columns (500x6)
+Qtable_taxi = initialize_q_table(state_space, action_space)
+print(Qtable_taxi)
+print("Q-table shape: ", Qtable_taxi.shape)
+```
+
+## Define the hyperparameters ⚙️
+⚠ DO NOT MODIFY EVAL_SEED: the eval_seed array **allows us to evaluate your agent with the same taxi starting positions for every classmate**
+
+```python
+# Training parameters
+n_training_episodes = 25000  # Total training episodes
+learning_rate = 0.7  # Learning rate
+
+# Evaluation parameters
+n_eval_episodes = 100  # Total number of test episodes
+
+
+
+# DO NOT MODIFY EVAL_SEED
+eval_seed = [
+    16,
+    54,
+    165,
+    177,
+    191,
+    191,
+    120,
+    80,
+    149,
+    178,
+    48,
+    38,
+    6,
+    125,
+    174,
+    73,
+    50,
+    172,
+    100,
+    148,
+    146,
+    6,
+    25,
+    40,
+    68,
+    148,
+    49,
+    167,
+    9,
+    97,
+    164,
+    176,
+    61,
+    7,
+    54,
+    55,
+    161,
+    131,
+    184,
+    51,
+    170,
+    12,
+    120,
+    113,
+    95,
+    126,
+    51,
+    98,
+    36,
+    135,
+    54,
+    82,
+    45,
+    95,
+    89,
+    59,
+    95,
+    124,
+    9,
+    113,
+    58,
+    85,
+    51,
+    134,
+    121,
+    169,
+    105,
+    21,
+    30,
+    11,
+    50,
+    65,
+    12,
+    43,
+    82,
+    145,
+    152,
+    97,
+    106,
+    55,
+    31,
+    85,
+    38,
+    112,
+    102,
+    168,
+    123,
+    97,
+    21,
+    83,
+    158,
+    26,
+    80,
+    63,
+    5,
+    81,
+    32,
+    11,
+    28,
+    148,
+]  # Evaluation seed, this ensures that all classmates agents are trained on the same taxi starting position
+# Each seed has a specific starting state
+
+# Environment parameters
+env_id = "Taxi-v3"  # Name of the environment
+max_steps = 99  # Max steps per episode
+gamma = 0.95  # Discounting rate
+
+# Exploration parameters
+max_epsilon = 1.0  # Exploration probability at start
+min_epsilon = 0.05  # Minimum exploration probability
+decay_rate = 0.005  # Exponential decay rate for exploration prob
+```
+
+## Train our Q-Learning agent 🏃
+
+```python
+Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
+```
+
+```python
+Qtable_taxi
+```
+
+## Create a model dictionary 💾 and publish our trained model on the Hub 🔥
+- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.
+
+
+```python
+model = {
+    "env_id": env_id,
+    "max_steps": max_steps,
+    "n_training_episodes": n_training_episodes,
+    "n_eval_episodes": n_eval_episodes,
+    "eval_seed": eval_seed,
+    "learning_rate": learning_rate,
+    "gamma": gamma,
+    "max_epsilon": max_epsilon,
+    "min_epsilon": min_epsilon,
+    "decay_rate": decay_rate,
+    "qtable": Qtable_taxi,
+}
+```
+
+```python
+username = ""  # FILL THIS
+repo_name = "q-Taxi-v3"
+push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
+```
+
+Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png" alt="Taxi Leaderboard">
+
+# Part 3: Load from Hub 🔽
+
+What's amazing with Hugging Face Hub 🤗 is that you can easily load powerful models from the community.
+
+Loading a saved model from the Hub is really easy:
+
+1. You go https://huggingface.co/models?other=q-learning to see the list of all the q-learning saved models.
+2. You select one and copy its repo_id
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/copy-id.png" alt="Copy id">
+
+3. Then we just need to use `load_from_hub` with:
+- The repo_id
+- The filename: the saved model inside the repo.
+
+#### Do not modify this code
+
+```python
+from urllib.error import HTTPError
+
+from huggingface_hub import hf_hub_download
+
+
+def load_from_hub(repo_id: str, filename: str) -> str:
+    """
+    Download a model from Hugging Face Hub.
+    :param repo_id: id of the model repository from the Hugging Face Hub
+    :param filename: name of the model zip file from the repository
+    """
+    try:
+        from huggingface_hub import cached_download, hf_hub_url
+    except ImportError:
+        raise ImportError(
+            "You need to install huggingface_hub to use `load_from_hub`. "
+            "See https://pypi.org/project/huggingface-hub/ for installation."
+        )
+
+    # Get the model from the Hub, download and cache the model on your local disk
+    pickle_model = hf_hub_download(repo_id=repo_id, filename=filename)
+
+    with open(pickle_model, "rb") as f:
+        downloaded_model_file = pickle.load(f)
+
+    return downloaded_model_file
+```
+
+### .
+
+```python
+model = load_from_hub(repo_id="ThomasSimonini/q-Taxi-v3", filename="q-learning.pkl")  # Try to use another model
+
+print(model)
+env = gym.make(model["env_id"])
+
+evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
+```
+
+```python
+model = load_from_hub(
+    repo_id="ThomasSimonini/q-FrozenLake-v1-no-slippery", filename="q-learning.pkl"
+)  # Try to use another model
+
+env = gym.make(model["env_id"], is_slippery=False)
+
+evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
+```
+
+## Some additional challenges 🏆
+The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!
+
+In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
+
+Here are some ideas to achieve so:
+
+* Train more steps
+* Try different hyperparameters by looking at what your classmates have done.
+* **Push your new trained model** on the Hub 🔥
+
+Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not using FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
+
+_____________________________________________________________________
+Congrats 🥳, you've just implemented, trained, and uploaded your first Reinforcement Learning agent.
+
+Understanding Q-Learning is an **important step to understanding value-based methods.**
+
+In the next Unit with Deep Q-Learning, we'll see that creating and updating a Q-table was a good strategy — **however, this is not scalable.**
+
+For instance, imagine you create an agent that learns to play Doom.
+
+<img src="https://vizdoom.cs.put.edu.pl/user/pages/01.tutorial/basic.png" alt="Doom"/>
+
+Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient.
+
+That's why we'll study, in the next unit, Deep Q-Learning, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
+
+
+See you on Unit 3! 🔥
+
+## Keep learning, stay awesome 🤗
diff --git a/units/en/unit2/mc-vs-td.mdx b/units/en/unit2/mc-vs-td.mdx
index 030ee62..da47dc5 100644
--- a/units/en/unit2/mc-vs-td.mdx
+++ b/units/en/unit2/mc-vs-td.mdx
@@ -43,7 +43,7 @@ By running more and more episodes, **the agent will learn to play better and be
 
 For instance, if we train a state-value function using Monte Carlo:
 
-- We just started to train our Value function, **so it returns 0 value for each state**
+- We just started to train our value function, **so it returns 0 value for each state**
 - Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)
 - Our mouse **explores the environment and takes random actions**
 
@@ -75,7 +75,7 @@ For instance, if we train a state-value function using Monte Carlo:
 ## Temporal Difference Learning: learning at each step [[td-learning]]
 
 - **Temporal Difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)**
-- to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\(gamma * V(S_{t+1})\\).
+- to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\( \gamma * V(S_{t+1})\\).
 
 The idea with **TD is to update the \\(V(S_t)\\) at each step.**
 
@@ -94,7 +94,7 @@ If we take the same example,
 
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2.jpg" alt="Temporal Difference"/>
 
-- We just started to train our Value function, so it returns 0 value for each state.
+- We just started to train our value function, so it returns 0 value for each state.
 - Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
 - Our mouse explore the environment and take a random action: **going to the left**
 - It gets a reward  \\(R_{t+1} = 1\\) since **it eats a piece of cheese**
@@ -106,7 +106,7 @@ If we take the same example,
 
 We can now update  \\(V(S_0)\\):
 
-New  \\(V(S_0) = V(S_0) + lr * [R_1 + gamma * V(S_1) - V(S_0)]\\)
+New  \\(V(S_0) = V(S_0) + lr * [R_1 + \gamma * V(S_1) - V(S_0)]\\)
 
 New \\(V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]\\)
 
diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx
index d2e8aa4..7a52cc4 100644
--- a/units/en/unit2/q-learning.mdx
+++ b/units/en/unit2/q-learning.mdx
@@ -14,7 +14,11 @@ Q-Learning is an **off-policy value-based method that uses a TD approach to tra
   <figcaption>Given a state and action, our Q Function outputs a state-action value (also called Q-value)</figcaption>
 </figure>
 
-The **Q comes from "the Quality" of that action at that state.**
+The **Q comes from "the Quality" (the value) of that action at that state.**
+
+Let's recap the difference between value and reward:
+- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state action pair) and then acts accordingly to its policy.
+- The *reward* is the **feedback I get from the environment** after performing an action at a state.
 
 Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action value pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
 
@@ -34,7 +38,6 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act
 
 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q-function"/>
-  <figcaption>Given a state and action pair, our Q-function will search inside its Q-table to output the state-action pair value (the Q value).</figcaption>
 </figure>
 
 If we recap, *Q-Learning* **is the RL algorithm that:**
@@ -69,12 +72,12 @@ This is the Q-Learning pseudocode; let's study each part and **see how it works
 
 We need to initialize the Q-Table for each state-action pair. **Most of the time, we initialize with values of 0.**
 
-### Step 2: Choose action using Epsilon Greedy Strategy [[step2]]
+### Step 2: Choose action using epsilon greedy strategy [[step2]]
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-learning"/>
 
 
-Epsilon Greedy Strategy is a policy that handles the exploration/exploitation trade-off.
+Epsilon greedy strategy is a policy that handles the exploration/exploitation trade-off.
 
 The idea is that we define epsilon ɛ = 1.0:
 
@@ -122,7 +125,7 @@ The difference is subtle:
 
 - *Off-policy*: using **a different policy for acting (inference) and updating (training).**
 
-For instance, with Q-Learning, the Epsilon greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
+For instance, with Q-Learning, the epsilon greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
 
 
 <figure>
@@ -140,7 +143,7 @@ Is different from the policy we use during the training part:
 
 - *On-policy:* using the **same policy for acting and updating.**
 
-For instance, with Sarsa, another value-based algorithm, **the Epsilon-Greedy Policy selects the next state-action pair, not a greedy policy.**
+For instance, with Sarsa, another value-based algorithm, **the epsilon greedy Policy selects the next state-action pair, not a greedy policy.**
 
 
 <figure>
diff --git a/units/en/unit2/quiz1.mdx b/units/en/unit2/quiz1.mdx
index 2372fdd..80bc321 100644
--- a/units/en/unit2/quiz1.mdx
+++ b/units/en/unit2/quiz1.mdx
@@ -1,4 +1,4 @@
-# First Quiz [[quiz1]]
+# Mid-way Quiz [[quiz1]]
 
 The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
 
@@ -19,7 +19,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
 		},
     {
 			text: "Value-based methods",
-			explain: "With Value-based methods, we train a value function to learn which state is more valuable and use this value function to take the action that leads to it.",
+			explain: "With value-based methods, we train a value function to learn which state is more valuable and use this value function to take the action that leads to it.",
       correct: true
 		},
 		{
@@ -37,7 +37,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
 
 **The Bellman equation is a recursive equation** that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
 
-Rt+1 + (gamma * V(St+1))
+\\(Rt+1 + (\gamma * V(St+1)))\\
 The immediate reward + the discounted value of the state that follows
 
 </details>
diff --git a/units/en/unit2/summary1.mdx b/units/en/unit2/summary1.mdx
index ee3c202..496c5aa 100644
--- a/units/en/unit2/summary1.mdx
+++ b/units/en/unit2/summary1.mdx
@@ -1,17 +1,17 @@
-# Summary [[summary1]]
+# Mid-way Recap [[summary1]]
 
 Before diving into Q-Learning, let's summarize what we just learned.
 
 We have two types of value-based functions:
 
-- State-Value function: outputs the expected return if **the agent starts at a given state and acts accordingly to the policy forever after.**
-- Action-Value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
+- State-value function: outputs the expected return if **the agent starts at a given state and acts accordingly to the policy forever after.**
+- Action-value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
 - In value-based methods, rather than learning the policy, **we define the policy by hand** and we learn a value function. If we have an optimal value function, we **will have an optimal policy.**
 
 There are two types of methods to learn a policy for a value function:
 
 - With *the Monte Carlo method*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
-- With *the TD Learning method,* we update the value function from a step, so we replace Gt that we don't have with **an estimated return called TD target.**
+- With *the TD Learning method,* we update the value function from a step, so we replace \\(G_t\\) that we don't have with **an estimated return called TD target.**
 
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/summary-learning-mtds.jpg" alt="Summary"/>
diff --git a/units/en/unit2/summary2.mdx b/units/en/unit2/summary2.mdx
new file mode 100644
index 0000000..a5653ef
--- /dev/null
+++ b/units/en/unit2/summary2.mdx
@@ -0,0 +1,25 @@
+# Q-Learning Recap [[summary2]]
+
+
+The *Q-Learning* **is the RL algorithm that** :
+
+- Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
+
+- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function"  width="100%"/>
+
+- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
+
+- And if we **have an optimal Q-function**, we
+have an optimal policy,since we **know for each state, what is the best action to take.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"  width="100%"/>
+
+But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
+
+This is the Q-Learning pseudocode:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
diff --git a/units/en/unit2/two-types-value-based-methods.mdx b/units/en/unit2/two-types-value-based-methods.mdx
index 3ea7591..47a17e2 100644
--- a/units/en/unit2/two-types-value-based-methods.mdx
+++ b/units/en/unit2/two-types-value-based-methods.mdx
@@ -18,7 +18,7 @@ To find the optimal policy, we learned about two different methods:
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches-2.jpg" alt="Two RL approaches"/>
 
-The policy takes a state as input and outputs what action to take at that state (deterministic policy).
+The policy takes a state as input and outputs what action to take at that state (deterministic policy: a policy that output one action given a state, contrary to stochastic policy that output a probability distribution over actions).
 
 And consequently, **we don't define by hand the behavior of our policy; it's the training that will define it.**
 
@@ -35,8 +35,8 @@ Consequently, whatever method you use to solve your problem, **you will have a
 
 So the difference is:
 
-- In policy-based, **the optimal policy is found by training the policy directly.**
-- In value-based, **finding an optimal value function leads to having an optimal policy.**
+- In policy-based, **the optimal policy (denoted π*) is found by training the policy directly.**
+- In value-based, **finding an optimal value function (denoted Q* or V*, we'll study the difference after) in our leads to having an optimal policy.**
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link between value and policy"/>
 
@@ -45,7 +45,7 @@ In fact, most of the time, in value-based methods, you'll use **an Epsilon-Gree
 
 So, we have two types of value-based functions:
 
-## The State-Value function [[state-value-function]]
+## The state-value function [[state-value-function]]
 
 We write the state value function under a policy π like this:
 
@@ -58,11 +58,11 @@ For each state, the state-value function outputs the expected return if the agen
   <figcaption>If we take the state with value -7: it's the expected return starting at that state and taking actions according to our policy (greedy policy), so right, right, right, down, down, right, right.</figcaption>
 </figure>
 
-## The Action-Value function [[action-value-function]]
+## The action-value function [[action-value-function]]
 
-In the Action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
+In the action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
 
-The value of taking action an in state s under a policy π is:
+The value of taking action an in state \\(s\\) under a policy \\(π\\) is:
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-1.jpg" alt="Action State value function"/>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-2.jpg" alt="Action State value function"/>
@@ -83,4 +83,4 @@ In either case, whatever value function we choose (state-value or action-value f
 
 However, the problem is that it implies that **to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.**
 
-This can be a tedious process, and that's **where the Bellman equation comes to help us.**
+This can be a computationally expensive process, and that's **where the Bellman equation comes to help us.**

From 53c98279c6d1c482ebb260b2a91efd6eb355355d Mon Sep 17 00:00:00 2001
From: ankandrew <61120139+ankandrew@users.noreply.github.com>
Date: Fri, 9 Dec 2022 22:12:55 -0300
Subject: [PATCH 13/49] Fix bad-formatted bp list

---
 units/en/unitbonus1/how-huggy-works.mdx | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/units/en/unitbonus1/how-huggy-works.mdx b/units/en/unitbonus1/how-huggy-works.mdx
index bc1ac4d..53d4d95 100644
--- a/units/en/unitbonus1/how-huggy-works.mdx
+++ b/units/en/unitbonus1/how-huggy-works.mdx
@@ -9,9 +9,10 @@ In this environment we aim to train Huggy to **fetch the stick we throw at him.
 
 ## The State Space, what Huggy perceives. [[state-space]]
 Huggy doesn't "see" his environment. Instead, we provide him information about the environment:
-* The target (stick) position
-* The relative position between himself and the target
-* The orientation of his legs.
+
+- The target (stick) position
+- The relative position between himself and the target
+- The orientation of his legs.
 
 Given all this information, Huggy can **use his policy to determine which action to take next to fulfill his goal**.
 

From c56cbbf4dcc2de82dcf38c89e29d264dd81b46ad Mon Sep 17 00:00:00 2001
From: Artagon <florent.vaucher@gmail.com>
Date: Sat, 10 Dec 2022 17:41:03 +0100
Subject: [PATCH 14/49] format and redundancy fixes

---
 units/en/unit1/two-methods.mdx | 39 +++++++++++++---------------------
 1 file changed, 15 insertions(+), 24 deletions(-)

diff --git a/units/en/unit1/two-methods.mdx b/units/en/unit1/two-methods.mdx
index 4f3d8a8..e6459c2 100644
--- a/units/en/unit1/two-methods.mdx
+++ b/units/en/unit1/two-methods.mdx
@@ -11,17 +11,13 @@ In other terms, how to build an RL agent that can **select the actions that ma
 The Policy **π** is the **brain of our Agent**, it’s the function that tells us what **action to take given the state we are.** So it **defines the agent’s behavior** at a given time.
 
 <figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_1.jpg" alt="Policy">
-<figcaption>Think of policy as the brain of our agent, the function that will tell us the action to take given a state
-
-</figcaption>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_1.jpg" alt="Policy" />
+<figcaption>Think of policy as the brain of our agent, the function that will tell us the action to take given a state</figcaption>
 </figure>
 
-Think of policy as the brain of our agent, the function that will tells us the action to take given a state
+This Policy **is the function we want to learn**, our goal is to find the optimal policy π\*, the policy that **maximizes expected return** when the agent acts according to it. We find this π\* **through training.**
 
-This Policy **is the function we want to learn**, our goal is to find the optimal policy π*, the policy that** maximizes **expected return** when the agent acts according to it. We find this *π through training.**
-
-There are two approaches to train our agent to find this optimal policy π*:
+There are two approaches to train our agent to find this optimal policy π\*:
 
 - **Directly,** by teaching the agent to learn which **action to take,** given the current state: **Policy-Based Methods.**
 - Indirectly, **teach the agent to learn which state is more valuable** and then take the action that **leads to the more valuable states**: Value-Based Methods.
@@ -33,9 +29,8 @@ In Policy-Based methods, **we learn a policy function directly.**
 This function will define a mapping between each state and the best corresponding action. We can also say that it'll define **a probability distribution over the set of possible actions at that state.**
 
 <figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_2.jpg" alt="Policy">
-<figcaption>As we can see here, the policy (deterministic) <b>directly indicates the action to take for each step.</b>
-</figcaption>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_2.jpg" alt="Policy" />
+<figcaption>As we can see here, the policy (deterministic) <b>directly indicates the action to take for each step.</b></figcaption>
 </figure>
 
 
@@ -46,8 +41,7 @@ We have two types of policies:
 
 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_3.jpg" alt="Policy"/>
-<figcaption>action = policy(state)
-</figcaption>
+<figcaption>action = policy(state)</figcaption>
 </figure>
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_4.jpg" alt="Policy" width="100%"/>
@@ -56,21 +50,19 @@ We have two types of policies:
 
 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_5.jpg" alt="Policy"/>
-<figcaption>policy(actions | state) = probability distribution over the set of actions given the current state
-</figcaption>
+<figcaption>policy(actions | state) = probability distribution over the set of actions given the current state</figcaption>
 </figure>
 
 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario"/>
-<figcaption>Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.
-</figcaption>
+<figcaption>Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.</figcaption>
 </figure>
 
 
 If we recap:
 
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_1.jpg" alt="Pbm recap" width="100%">
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_2.jpg" alt="Pbm recap" width="100%">
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_1.jpg" alt="Pbm recap" width="100%" />
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_2.jpg" alt="Pbm recap" width="100%" />
 
 
 ## Value-based methods [[value-based]]
@@ -81,19 +73,18 @@ The value of a state is the **expected discounted return** the agent can get i
 
 “Act according to our policy” just means that our policy is **“going to the state with the highest value”.**
 
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_1.jpg" alt="Value based RL" width="100%">
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_1.jpg" alt="Value based RL" width="100%" />
 
 Here we see that our value function **defined value for each possible state.**
 
 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_2.jpg" alt="Value based RL"/>
-<figcaption>Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.
-</figcaption>
+<figcaption>Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.</figcaption>
 </figure>
 
 Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.
 
 If we recap:
 
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_1.jpg" alt="Vbm recap" width="100%">
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_2.jpg" alt="Vbm recap" width="100%">
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_1.jpg" alt="Vbm recap" width="100%" />
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_2.jpg" alt="Vbm recap" width="100%" />

From 39346c0cc3bd278606c9dfde19a2c209afdfffc5 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Sun, 11 Dec 2022 22:03:06 +0100
Subject: [PATCH 15/49] Small updates unit 2

---
 notebooks/unit2/unit2.mdx | 1089 -------------------------------------
 units/en/_toctree.yml     |    4 +-
 2 files changed, 2 insertions(+), 1091 deletions(-)
 delete mode 100644 notebooks/unit2/unit2.mdx

diff --git a/notebooks/unit2/unit2.mdx b/notebooks/unit2/unit2.mdx
deleted file mode 100644
index cfa8618..0000000
--- a/notebooks/unit2/unit2.mdx
+++ /dev/null
@@ -1,1089 +0,0 @@
-# Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">
-
-In this notebook, **you'll code from scratch your first Reinforcement Learning agent** playing FrozenLake ❄️ using Q-Learning, share it to the community, and experiment with different configurations.
-
-
-⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
-
-###🎮 Environments: 
-
-- [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
-- [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/)
-
-###📚 RL-Library: 
-
-- Python and Numpy
-
-We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
-
-## Objectives of this notebook 🏆
-
-At the end of the notebook, you will:
-
-- Be able to use **Gym**, the environment library.
-- Be able to code from scratch a Q-Learning agent.
-- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
-
-
-
-
-## This notebook is from Deep Reinforcement Learning Course
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>
-
-In this free course, you will:
-
-- 📖 Study Deep Reinforcement Learning in **theory and practice**.
-- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.
-- 🤖 Train **agents in unique environments** 
-
-And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course
-
-Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
-
-
-The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5
-
-## Prerequisites 🏗️
-Before diving into the notebook, you need to:
-
-🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)**  🤗  
-
-## A small recap of Q-Learning
-
-- The *Q-Learning* **is the RL algorithm that**  
-
-  - Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
-    
-  - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
-    
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function"  width="100%"/>
-
-- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
-    
-- And if we **have an optimal Q-function**, we
-have an optimal policy,since we **know for each state, what is the best action to take.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"  width="100%"/>
-
-
-But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
-
-This is the Q-Learning pseudocode:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
-
-
-# Let's code our first Reinforcement Learning algorithm 🚀
-
-## Install dependencies and create a virtual display 🔽
-
-During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). 
-
-Hence the following cell will install the librairies and create and run a virtual screen 🖥
-
-We’ll install multiple ones:
-
-- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.
-- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.
-- `numPy`: Used for handling our Q-table.
-
-The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
-
-You can see here all the Deep reinforcement Learning models available 👉 https://huggingface.co/models?other=q-learning
-
-
-```python
-!pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/unit2/requirements-unit2.txt
-```
-
-```python
-%capture
-!sudo apt-get update
-!apt install python-opengl
-!apt install ffmpeg
-!apt install xvfb
-!pip3 install pyvirtualdisplay
-```
-
-To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks for this trick, **we will be able to run our virtual screen.**
-
-```python
-import os
-
-os.kill(os.getpid(), 9)
-```
-
-```python
-# Virtual display
-from pyvirtualdisplay import Display
-
-virtual_display = Display(visible=0, size=(1400, 900))
-virtual_display.start()
-```
-
-## Import the packages 📦
-
-In addition to the installed libraries, we also use:
-
-- `random`: To generate random numbers (that will be useful for Epsilon-Greedy Policy).
-- `imageio`: To generate a replay video
-
-
-
-
-
-
-```python
-import numpy as np
-import gym
-import random
-import imageio
-import os
-
-import pickle5 as pickle
-from tqdm.notebook import tqdm
-```
-
-We're now ready to code our Q-Learning algorithm 🔥
-
-# Part 1: Frozen Lake ⛄ (non slippery version)
-
-## Create and understand [FrozenLake environment ⛄]((https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
----
-
-💡 A good habit when you start to use an environment is to check its documentation 
-
-👉 https://www.gymlibrary.dev/environments/toy_text/frozen_lake/
-
----
-
-We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.
-
-We can have two sizes of environment:
-
-- `map_name="4x4"`: a 4x4 grid version
-- `map_name="8x8"`: a 8x8 grid version
-
-
-The environment has two modes:
-
-- `is_slippery=False`: The agent always move in the intended direction due to the non-slippery nature of the frozen lake.
-- `is_slippery=True`: The agent may not always move in the intended direction due to the slippery nature of the frozen lake (stochastic).
-
-For now let's keep it simple with the 4x4 map and non-slippery
-
-```python
-# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version
-env = gym.make()  # TODO use the correct parameters
-```
-
-### Solution
-
-```python
-env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
-```
-
-You can create your own custom grid like this:
-
-```python
-desc=["SFFF", "FHFH", "FFFH", "HFFG"]
-gym.make('FrozenLake-v1', desc=desc, is_slippery=True)
-```
-
-but we'll use the default environment for now.
-
-### Let's see what the Environment looks like:
-
-
-```python
-# We create our environment with gym.make("<name_of_the_environment>")
-env.reset()
-print("_____OBSERVATION SPACE_____ \n")
-print("Observation Space", env.observation_space)
-print("Sample observation", env.observation_space.sample())  # Get a random observation
-```
-
-We see with `Observation Space Shape Discrete(16)` that the observation is a value representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**. 
-
-For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**
-
-
-For instance, this is what state = 0 looks like:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/frozenlake.png" alt="FrozenLake">
-
-```python
-print("\n _____ACTION SPACE_____ \n")
-print("Action Space Shape", env.action_space.n)
-print("Action Space Sample", env.action_space.sample())  # Take a random action
-```
-
-The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:
-- 0: GO LEFT
-- 1: GO DOWN
-- 2: GO RIGHT
-- 3: GO UP
-
-Reward function 💰:
-- Reach goal: +1
-- Reach hole: 0
-- Reach frozen: 0
-
-## Create and Initialize the Q-table 🗄️
-(👀 Step 1 of the pseudocode)
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
-
-
-It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`
-
-
-```python
-state_space = 
-print("There are ", state_space, " possible states")
-
-action_space = 
-print("There are ", action_space, " possible actions")
-```
-
-```python
-# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
-def initialize_q_table(state_space, action_space):
-  Qtable = 
-  return Qtable
-```
-
-```python
-Qtable_frozenlake = initialize_q_table(state_space, action_space)
-```
-
-### Solution
-
-```python
-state_space = env.observation_space.n
-print("There are ", state_space, " possible states")
-
-action_space = env.action_space.n
-print("There are ", action_space, " possible actions")
-```
-
-```python
-# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
-def initialize_q_table(state_space, action_space):
-    Qtable = np.zeros((state_space, action_space))
-    return Qtable
-```
-
-```python
-Qtable_frozenlake = initialize_q_table(state_space, action_space)
-```
-
-##Define the epsilon-greedy policy 🤖
-
-Epsilon-Greedy is the training policy that handles the exploration/exploitation trade-off.
-
-The idea with Epsilon Greedy:
-
-- With *probability 1 - ɛ* : **we do exploitation** (aka our agent selects the action with the highest state-action pair value).
-
-- With *probability ɛ*: we do **exploration** (trying random action).
-
-And as the training goes, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-Learning" width="100%"/>
-
-
-Thanks to Sambit for finding a bug on the epsilon function 🤗
-
-```python
-def epsilon_greedy_policy(Qtable, state, epsilon):
-  # Randomly generate a number between 0 and 1
-  random_num = 
-  # if random_num > greater than epsilon --> exploitation
-  if random_num > epsilon:
-    # Take the action with the highest value given a state
-    # np.argmax can be useful here
-    action = 
-  # else --> exploration
-  else:
-    action = # Take a random action
-  
-  return action
-```
-
-#### Solution
-
-```python
-def epsilon_greedy_policy(Qtable, state, epsilon):
-    # Randomly generate a number between 0 and 1
-    random_int = random.uniform(0, 1)
-    # if random_int > greater than epsilon --> exploitation
-    if random_int > epsilon:
-        # Take the action with the highest value given a state
-        # np.argmax can be useful here
-        action = np.argmax(Qtable[state])
-    # else --> exploration
-    else:
-        action = env.action_space.sample()
-
-    return action
-```
-
-## Define the greedy policy 🤖
-Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
-
-- Epsilon greedy policy (acting policy)
-- Greedy policy (updating policy)
-
-Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Q-Learning" width="100%"/>
-
-
-```python
-def greedy_policy(Qtable, state):
-  # Exploitation: take the action with the highest state, action value
-  action = 
-  
-  return action
-```
-
-#### Solution
-
-```python
-def greedy_policy(Qtable, state):
-    # Exploitation: take the action with the highest state, action value
-    action = np.argmax(Qtable[state])
-
-    return action
-```
-
-## Define the hyperparameters ⚙️
-The exploration related hyperparamters are some of the most important ones. 
-
-- We need to make sure that our agent **explores enough the state space** in order to learn a good value approximation, in order to do that we need to have progressive decay of the epsilon.
-- If you decrease too fast epsilon (too high decay_rate), **you take the risk that your agent is stuck**, since your agent didn't explore enough the state space and hence can't solve the problem.
-
-```python
-# Training parameters
-n_training_episodes = 10000  # Total training episodes
-learning_rate = 0.7  # Learning rate
-
-# Evaluation parameters
-n_eval_episodes = 100  # Total number of test episodes
-
-# Environment parameters
-env_id = "FrozenLake-v1"  # Name of the environment
-max_steps = 99  # Max steps per episode
-gamma = 0.95  # Discounting rate
-eval_seed = []  # The evaluation seed of the environment
-
-# Exploration parameters
-max_epsilon = 1.0  # Exploration probability at start
-min_epsilon = 0.05  # Minimum exploration probability
-decay_rate = 0.0005  # Exponential decay rate for exploration prob
-```
-
-## Step 6: Create the training loop method
-
-```python
-def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
-  for episode in range(n_training_episodes):
-    # Reduce epsilon (because we need less and less exploration)
-    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
-    # Reset the environment
-    state = env.reset()
-    step = 0
-    done = False
-
-    # repeat
-    for step in range(max_steps):
-      # Choose the action At using epsilon greedy policy
-      action = 
-
-      # Take action At and observe Rt+1 and St+1
-      # Take the action (a) and observe the outcome state(s') and reward (r)
-      new_state, reward, done, info = 
-
-      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
-      Qtable[state][action] = 
-
-      # If done, finish the episode
-      if done:
-        break
-      
-      # Our state is the new state
-      state = new_state
-  return Qtable
-```
-
-#### Solution
-
-```python
-def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
-    for episode in tqdm(range(n_training_episodes)):
-        # Reduce epsilon (because we need less and less exploration)
-        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
-        # Reset the environment
-        state = env.reset()
-        step = 0
-        done = False
-
-        # repeat
-        for step in range(max_steps):
-            # Choose the action At using epsilon greedy policy
-            action = epsilon_greedy_policy(Qtable, state, epsilon)
-
-            # Take action At and observe Rt+1 and St+1
-            # Take the action (a) and observe the outcome state(s') and reward (r)
-            new_state, reward, done, info = env.step(action)
-
-            # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
-            Qtable[state][action] = Qtable[state][action] + learning_rate * (
-                reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]
-            )
-
-            # If done, finish the episode
-            if done:
-                break
-
-            # Our state is the new state
-            state = new_state
-    return Qtable
-```
-
-## Train the Q-Learning agent 🏃
-
-```python
-Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)
-```
-
-## Let's see what our Q-Learning table looks like now 👀
-
-```python
-Qtable_frozenlake
-```
-
-## Define the evaluation method 📝
-
-```python
-def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
-    """
-    Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
-    :param env: The evaluation environment
-    :param n_eval_episodes: Number of episode to evaluate the agent
-    :param Q: The Q-table
-    :param seed: The evaluation seed array (for taxi-v3)
-    """
-    episode_rewards = []
-    for episode in tqdm(range(n_eval_episodes)):
-        if seed:
-            state = env.reset(seed=seed[episode])
-        else:
-            state = env.reset()
-        step = 0
-        done = False
-        total_rewards_ep = 0
-
-        for step in range(max_steps):
-            # Take the action (index) that have the maximum expected future reward given that state
-            action = np.argmax(Q[state][:])
-            new_state, reward, done, info = env.step(action)
-            total_rewards_ep += reward
-
-            if done:
-                break
-            state = new_state
-        episode_rewards.append(total_rewards_ep)
-    mean_reward = np.mean(episode_rewards)
-    std_reward = np.std(episode_rewards)
-
-    return mean_reward, std_reward
-```
-
-## Evaluate our Q-Learning agent 📈
-
-- Normally you should have mean reward of 1.0
-- It's relatively easy since the state space is really small (16). What you can try to do is [to replace with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/).
-
-```python
-# Evaluate our Agent
-mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
-print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
-```
-
-## Publish our trained model on the Hub 🔥
-
-Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.
-
-Here's an example of a Model Card:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/modelcard.png" alt="Model card" width="100%"/>
-
-
-Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.
-
-#### Do not modify this code
-
-```python
-%%capture
-from huggingface_hub import HfApi, HfFolder, Repository
-from huggingface_hub.repocard import metadata_eval_result, metadata_save
-
-from pathlib import Path
-import datetime
-import json
-```
-
-```python
-def record_video(env, Qtable, out_directory, fps=1):
-    images = []
-    done = False
-    state = env.reset(seed=random.randint(0, 500))
-    img = env.render(mode="rgb_array")
-    images.append(img)
-    while not done:
-        # Take the action (index) that have the maximum expected future reward given that state
-        action = np.argmax(Qtable[state][:])
-        state, reward, done, info = env.step(action)  # We directly put next_state = state for recording logic
-        img = env.render(mode="rgb_array")
-        images.append(img)
-    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)
-```
-
-```python
-def push_to_hub(
-    repo_id, model, env, video_fps=1, local_repo_path="hub", commit_message="Push Q-Learning agent to Hub", token=None
-):
-    _, repo_name = repo_id.split("/")
-
-    eval_env = env
-
-    # Step 1: Clone or create the repo
-    # Create the repo (or clone its content if it's nonempty)
-    api = HfApi()
-
-    repo_url = api.create_repo(
-        repo_id=repo_id,
-        token=token,
-        private=False,
-        exist_ok=True,
-    )
-
-    # Git pull
-    repo_local_path = Path(local_repo_path) / repo_name
-    repo = Repository(repo_local_path, clone_from=repo_url, use_auth_token=True)
-    repo.git_pull()
-
-    repo.lfs_track(["*.mp4"])
-
-    # Step 1: Save the model
-    if env.spec.kwargs.get("map_name"):
-        model["map_name"] = env.spec.kwargs.get("map_name")
-        if env.spec.kwargs.get("is_slippery", "") == False:
-            model["slippery"] = False
-
-    print(model)
-
-    # Pickle the model
-    with open(Path(repo_local_path) / "q-learning.pkl", "wb") as f:
-        pickle.dump(model, f)
-
-    # Step 2: Evaluate the model and build JSON
-    mean_reward, std_reward = evaluate_agent(
-        eval_env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"]
-    )
-
-    # First get datetime
-    eval_datetime = datetime.datetime.now()
-    eval_form_datetime = eval_datetime.isoformat()
-
-    evaluate_data = {
-        "env_id": model["env_id"],
-        "mean_reward": mean_reward,
-        "n_eval_episodes": model["n_eval_episodes"],
-        "eval_datetime": eval_form_datetime,
-    }
-    # Write a JSON file
-    with open(Path(repo_local_path) / "results.json", "w") as outfile:
-        json.dump(evaluate_data, outfile)
-
-    # Step 3: Create the model card
-    # Env id
-    env_name = model["env_id"]
-    if env.spec.kwargs.get("map_name"):
-        env_name += "-" + env.spec.kwargs.get("map_name")
-
-    if env.spec.kwargs.get("is_slippery", "") == False:
-        env_name += "-" + "no_slippery"
-
-    metadata = {}
-    metadata["tags"] = [env_name, "q-learning", "reinforcement-learning", "custom-implementation"]
-
-    # Add metrics
-    eval = metadata_eval_result(
-        model_pretty_name=repo_name,
-        task_pretty_name="reinforcement-learning",
-        task_id="reinforcement-learning",
-        metrics_pretty_name="mean_reward",
-        metrics_id="mean_reward",
-        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
-        dataset_pretty_name=env_name,
-        dataset_id=env_name,
-    )
-
-    # Merges both dictionaries
-    metadata = {**metadata, **eval}
-
-    model_card = f"""
-  # **Q-Learning** Agent playing **{env_id}**
-  This is a trained model of a **Q-Learning** agent playing **{env_id}** .
-  """
-
-    model_card += """
-  ## Usage
-  ```python
-  """
-
-    model_card += f"""model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
-
-  # Don't forget to check if you need to add additional attributes (is_slippery=False etc)
-  env = gym.make(model["env_id"])
-
-  evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
-  """
-
-    model_card += """
-  ```
-  """
-
-    readme_path = repo_local_path / "README.md"
-    readme = ""
-    if readme_path.exists():
-        with readme_path.open("r", encoding="utf8") as f:
-            readme = f.read()
-    else:
-        readme = model_card
-
-    with readme_path.open("w", encoding="utf-8") as f:
-        f.write(readme)
-
-    # Save our metrics to Readme metadata
-    metadata_save(readme_path, metadata)
-
-    # Step 4: Record a video
-    video_path = repo_local_path / "replay.mp4"
-    record_video(env, model["qtable"], video_path, video_fps)
-
-    # Push everything to hub
-    print(f"Pushing repo {repo_name} to the Hugging Face Hub")
-    repo.push_to_hub(commit_message=commit_message)
-
-    print(f"Your model is pushed to the hub. You can view your model here: {repo_url}")
-```
-
-### .
-
-By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
-
-This way:
-- You can **showcase our work** 🔥
-- You can **visualize your agent playing** 👀
-- You can **share with the community an agent that others can use** 💾
-- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
-
-
-To be able to share your model with the community there are three more steps to follow:
-
-1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
-
-2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
-- Create a new token (https://huggingface.co/settings/tokens) **with write role**
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
-
-
-```python
-from huggingface_hub import notebook_login
-
-notebook_login()
-```
-
-If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
-
-3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
-
-- Let's create **the model dictionary that contains the hyperparameters and the Q_table**.
-
-```python
-model = {
-    "env_id": env_id,
-    "max_steps": max_steps,
-    "n_training_episodes": n_training_episodes,
-    "n_eval_episodes": n_eval_episodes,
-    "eval_seed": eval_seed,
-    "learning_rate": learning_rate,
-    "gamma": gamma,
-    "max_epsilon": max_epsilon,
-    "min_epsilon": min_epsilon,
-    "decay_rate": decay_rate,
-    "qtable": Qtable_frozenlake,
-}
-```
-
-Let's fill the `package_to_hub` function:
-
-- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `
-(repo_id = {username}/{repo_name})`
-💡 A good `repo_id` is `{username}/q-{env_id}`
-- `model`: our model dictionary containing the hyperparameters and the Qtable.
-- `env`: the environment.
-- `commit_message`: message of the commit
-
-```python
-model
-```
-
-```python
-username = ""  # FILL THIS
-repo_name = "q-FrozenLake-v1-4x4-noSlippery"
-push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
-```
-
-Congrats 🥳 you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent. 
-FrozenLake-v1 no_slippery is very simple environment, let's try an harder one 🔥.
-
-# Part 2: Taxi-v3 🚖
-
-## Create and understand [Taxi-v3 🚕](https://www.gymlibrary.dev/environments/toy_text/taxi/)
----
-
-💡 A good habit when you start to use an environment is to check its documentation 
-
-👉 https://www.gymlibrary.dev/environments/toy_text/taxi/
-
----
-
-In `Taxi-v3` 🚕, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). 
-
-When the episode starts, **the taxi starts off at a random square** and the passenger is at a random location. The taxi drives to the passenger’s location, **picks up the passenger**, drives to the passenger’s destination (another one of the four specified locations), and then **drops off the passenger**. Once the passenger is dropped off, the episode ends.
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi.png" alt="Taxi">
-
-
-```python
-env = gym.make("Taxi-v3")
-```
-
-There are **500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger** (including the case when the passenger is in the taxi), and **4 destination locations.**
-
-
-```python
-state_space = env.observation_space.n
-print("There are ", state_space, " possible states")
-```
-
-```python
-action_space = env.action_space.n
-print("There are ", action_space, " possible actions")
-```
-
-The action space (the set of possible actions the agent can take) is discrete with **6 actions available 🎮**:
-
-- 0: move south
-- 1: move north
-- 2: move east
-- 3: move west
-- 4: pickup passenger
-- 5: drop off passenger
-
-Reward function 💰:
-
-- -1 per step unless other reward is triggered.
-- +20 delivering passenger.
-- -10 executing “pickup” and “drop-off” actions illegally.
-
-```python
-# Create our Q table with state_size rows and action_size columns (500x6)
-Qtable_taxi = initialize_q_table(state_space, action_space)
-print(Qtable_taxi)
-print("Q-table shape: ", Qtable_taxi.shape)
-```
-
-## Define the hyperparameters ⚙️
-⚠ DO NOT MODIFY EVAL_SEED: the eval_seed array **allows us to evaluate your agent with the same taxi starting positions for every classmate**
-
-```python
-# Training parameters
-n_training_episodes = 25000  # Total training episodes
-learning_rate = 0.7  # Learning rate
-
-# Evaluation parameters
-n_eval_episodes = 100  # Total number of test episodes
-
-# DO NOT MODIFY EVAL_SEED
-eval_seed = [
-    16,
-    54,
-    165,
-    177,
-    191,
-    191,
-    120,
-    80,
-    149,
-    178,
-    48,
-    38,
-    6,
-    125,
-    174,
-    73,
-    50,
-    172,
-    100,
-    148,
-    146,
-    6,
-    25,
-    40,
-    68,
-    148,
-    49,
-    167,
-    9,
-    97,
-    164,
-    176,
-    61,
-    7,
-    54,
-    55,
-    161,
-    131,
-    184,
-    51,
-    170,
-    12,
-    120,
-    113,
-    95,
-    126,
-    51,
-    98,
-    36,
-    135,
-    54,
-    82,
-    45,
-    95,
-    89,
-    59,
-    95,
-    124,
-    9,
-    113,
-    58,
-    85,
-    51,
-    134,
-    121,
-    169,
-    105,
-    21,
-    30,
-    11,
-    50,
-    65,
-    12,
-    43,
-    82,
-    145,
-    152,
-    97,
-    106,
-    55,
-    31,
-    85,
-    38,
-    112,
-    102,
-    168,
-    123,
-    97,
-    21,
-    83,
-    158,
-    26,
-    80,
-    63,
-    5,
-    81,
-    32,
-    11,
-    28,
-    148,
-]  # Evaluation seed, this ensures that all classmates agents are trained on the same taxi starting position
-# Each seed has a specific starting state
-
-# Environment parameters
-env_id = "Taxi-v3"  # Name of the environment
-max_steps = 99  # Max steps per episode
-gamma = 0.95  # Discounting rate
-
-# Exploration parameters
-max_epsilon = 1.0  # Exploration probability at start
-min_epsilon = 0.05  # Minimum exploration probability
-decay_rate = 0.005  # Exponential decay rate for exploration prob
-```
-
-## Train our Q-Learning agent 🏃
-
-```python
-Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
-```
-
-```python
-Qtable_taxi
-```
-
-## Create a model dictionary 💾 and publish our trained model on the Hub 🔥
-- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.
-
-
-```python
-model = {
-    "env_id": env_id,
-    "max_steps": max_steps,
-    "n_training_episodes": n_training_episodes,
-    "n_eval_episodes": n_eval_episodes,
-    "eval_seed": eval_seed,
-    "learning_rate": learning_rate,
-    "gamma": gamma,
-    "max_epsilon": max_epsilon,
-    "min_epsilon": min_epsilon,
-    "decay_rate": decay_rate,
-    "qtable": Qtable_taxi,
-}
-```
-
-```python
-username = ""  # FILL THIS
-repo_name = "q-Taxi-v3"
-push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
-```
-
-Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png" alt="Taxi Leaderboard">
-
-# Part 3: Load from Hub 🔽
-
-What's amazing with Hugging Face Hub 🤗 is that you can easily load powerful models from the community.
-
-Loading a saved model from the Hub is really easy:
-
-1. You go https://huggingface.co/models?other=q-learning to see the list of all the q-learning saved models.
-2. You select one and copy its repo_id
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/copy-id.png" alt="Copy id">
-
-3. Then we just need to use `load_from_hub` with:
-- The repo_id
-- The filename: the saved model inside the repo.
-
-#### Do not modify this code
-
-```python
-from urllib.error import HTTPError
-
-from huggingface_hub import hf_hub_download
-
-
-def load_from_hub(repo_id: str, filename: str) -> str:
-    """
-    Download a model from Hugging Face Hub.
-    :param repo_id: id of the model repository from the Hugging Face Hub
-    :param filename: name of the model zip file from the repository
-    """
-    try:
-        from huggingface_hub import cached_download, hf_hub_url
-    except ImportError:
-        raise ImportError(
-            "You need to install huggingface_hub to use `load_from_hub`. "
-            "See https://pypi.org/project/huggingface-hub/ for installation."
-        )
-
-    # Get the model from the Hub, download and cache the model on your local disk
-    pickle_model = hf_hub_download(repo_id=repo_id, filename=filename)
-
-    with open(pickle_model, "rb") as f:
-        downloaded_model_file = pickle.load(f)
-
-    return downloaded_model_file
-```
-
-### .
-
-```python
-model = load_from_hub(repo_id="ThomasSimonini/q-Taxi-v3", filename="q-learning.pkl")  # Try to use another model
-
-print(model)
-env = gym.make(model["env_id"])
-
-evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
-```
-
-```python
-model = load_from_hub(
-    repo_id="ThomasSimonini/q-FrozenLake-v1-no-slippery", filename="q-learning.pkl"
-)  # Try to use another model
-
-env = gym.make(model["env_id"], is_slippery=False)
-
-evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
-```
-
-## Some additional challenges 🏆
-The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results! 
-
-In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
-
-Here are some ideas to achieve so:
-
-* Train more steps
-* Try different hyperparameters by looking at what your classmates have done.
-* **Push your new trained model** on the Hub 🔥
-
-Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not using FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
-
-_____________________________________________________________________
-Congrats 🥳, you've just implemented, trained, and uploaded your first Reinforcement Learning agent.
-
-Understanding Q-Learning is an **important step to understanding value-based methods.**
-
-In the next Unit with Deep Q-Learning, we'll see that creating and updating a Q-table was a good strategy — **however, this is not scalable.**
-
-For instance, imagine you create an agent that learns to play Doom. 
-
-<img src="https://vizdoom.cs.put.edu.pl/user/pages/01.tutorial/basic.png" alt="Doom"/>
-
-Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient. 
-
-That's why we'll study, in the next unit, Deep Q-Learning, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
-
-
-See you on Unit 3! 🔥
-
-## Keep learning, stay awesome 🤗
\ No newline at end of file
diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml
index 6e7658f..2615a89 100644
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -56,7 +56,7 @@
     title: The Bellman Equation, simplify our value estimation
   - local: unit2/mc-vs-td
     title: Monte Carlo vs Temporal Difference Learning
-  - local: unit2/summary1
+  - local: unit2/mid-way-recap
     title: Mid-way Recap
   - local: unit2/quiz1
     title: Mid-way Quiz
@@ -64,7 +64,7 @@
     title: Introducing Q-Learning
   - local: unit2/q-learning-example
     title: A Q-Learning example
-  - local: unit2/summary2
+  - local: unit2/q-learning-recap
     title: Q-Learning Recap
   - local: unit2/hands-on
     title: Hands-on

From 4000745ba5979e3acfe15b1e1838b36173daf7f7 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Sun, 11 Dec 2022 22:13:20 +0100
Subject: [PATCH 16/49] Update units/en/unit2/hands-on.mdx

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
---
 units/en/unit2/hands-on.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
index a3dfdb1..f6ad122 100644
--- a/units/en/unit2/hands-on.mdx
+++ b/units/en/unit2/hands-on.mdx
@@ -4,7 +4,7 @@
 notebooks={[
   {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb"}
   ]}
-  askForHelpUrl="http://hf.co/join/discord" />
+askForHelpUrl="http://hf.co/join/discord" />
 
 
 

From 11c8f8746008aa8a7021cd0ce9a61f81b58248be Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Sun, 11 Dec 2022 22:19:53 +0100
Subject: [PATCH 17/49] Update units/en/unit2/hands-on.mdx

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
---
 units/en/unit2/hands-on.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
index f6ad122..b621dde 100644
--- a/units/en/unit2/hands-on.mdx
+++ b/units/en/unit2/hands-on.mdx
@@ -14,7 +14,7 @@ Now that we studied the Q-Learning algorithm, let's implement it from scratch an
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
 
-Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 2?
+Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 2?
 
 
 **To start the hands-on click on Open In Colab button** 👇 :

From 723f75223e06d06e8c16decf0b4b9a86d7ba95be Mon Sep 17 00:00:00 2001
From: simoninithomas <simonini_thomas@outlook.fr>
Date: Mon, 12 Dec 2022 02:45:16 +0100
Subject: [PATCH 18/49] Update Unit 2

---
 units/en/_toctree.yml                                 | 4 ++--
 units/en/unit2/mc-vs-td.mdx                           | 2 ++
 units/en/unit2/{quiz1.mdx => mid-way-quiz.mdx}        | 2 +-
 units/en/unit2/{summary1.mdx => mid-way-recap.mdx}    | 2 +-
 units/en/unit2/{summary2.mdx => q-learning-recap.mdx} | 2 +-
 units/en/unit2/q-learning.mdx                         | 3 ++-
 6 files changed, 9 insertions(+), 6 deletions(-)
 rename units/en/unit2/{quiz1.mdx => mid-way-quiz.mdx} (99%)
 rename units/en/unit2/{summary1.mdx => mid-way-recap.mdx} (97%)
 rename units/en/unit2/{summary2.mdx => q-learning-recap.mdx} (97%)

diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml
index 2615a89..5483f9c 100644
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -58,7 +58,7 @@
     title: Monte Carlo vs Temporal Difference Learning
   - local: unit2/mid-way-recap
     title: Mid-way Recap
-  - local: unit2/quiz1
+  - local: unit2/mid-way-quiz
     title: Mid-way Quiz
   - local: unit2/q-learning
     title: Introducing Q-Learning
@@ -69,7 +69,7 @@
   - local: unit2/hands-on
     title: Hands-on
   - local: unit2/quiz2
-    title: Second Quiz
+    title: Q-Learning Quiz
   - local: unit2/conclusion
     title: Conclusion
   - local: unit2/additional-readings
diff --git a/units/en/unit2/mc-vs-td.mdx b/units/en/unit2/mc-vs-td.mdx
index da47dc5..1d3517f 100644
--- a/units/en/unit2/mc-vs-td.mdx
+++ b/units/en/unit2/mc-vs-td.mdx
@@ -30,6 +30,8 @@ If we take an example:
 - We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.
 
 - At the end of the episode, **we have a list of State, Actions, Rewards, and Next States tuples**
+For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]...]
+
 - **The agent will sum the total rewards \\(G_t\\)** (to see how well it did).
 - It will then **update \\(V(s_t)\\) based on the formula**
 
diff --git a/units/en/unit2/quiz1.mdx b/units/en/unit2/mid-way-quiz.mdx
similarity index 99%
rename from units/en/unit2/quiz1.mdx
rename to units/en/unit2/mid-way-quiz.mdx
index 80bc321..b1ffe3a 100644
--- a/units/en/unit2/quiz1.mdx
+++ b/units/en/unit2/mid-way-quiz.mdx
@@ -1,4 +1,4 @@
-# Mid-way Quiz [[quiz1]]
+# Mid-way Quiz [[mid-way-quiz]]
 
 The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
 
diff --git a/units/en/unit2/summary1.mdx b/units/en/unit2/mid-way-recap.mdx
similarity index 97%
rename from units/en/unit2/summary1.mdx
rename to units/en/unit2/mid-way-recap.mdx
index 496c5aa..0bae566 100644
--- a/units/en/unit2/summary1.mdx
+++ b/units/en/unit2/mid-way-recap.mdx
@@ -1,4 +1,4 @@
-# Mid-way Recap [[summary1]]
+# Mid-way Recap [[mid-way-recap]]
 
 Before diving into Q-Learning, let's summarize what we just learned.
 
diff --git a/units/en/unit2/summary2.mdx b/units/en/unit2/q-learning-recap.mdx
similarity index 97%
rename from units/en/unit2/summary2.mdx
rename to units/en/unit2/q-learning-recap.mdx
index a5653ef..55c66bf 100644
--- a/units/en/unit2/summary2.mdx
+++ b/units/en/unit2/q-learning-recap.mdx
@@ -1,4 +1,4 @@
-# Q-Learning Recap [[summary2]]
+# Q-Learning Recap [[q-learning-recap]]
 
 
 The *Q-Learning* **is the RL algorithm that** :
diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx
index 7a52cc4..52e744a 100644
--- a/units/en/unit2/q-learning.mdx
+++ b/units/en/unit2/q-learning.mdx
@@ -17,6 +17,7 @@ Q-Learning is an **off-policy value-based method that uses a TD approach to tra
 The **Q comes from "the Quality" (the value) of that action at that state.**
 
 Let's recap the difference between value and reward:
+
 - The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state action pair) and then acts accordingly to its policy.
 - The *reward* is the **feedback I get from the environment** after performing an action at a state.
 
@@ -42,7 +43,7 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act
 
 If we recap, *Q-Learning* **is the RL algorithm that:**
 
-- Trains a *Q-Function* (an **action-value function**), which internally is a *Q-table that contains all the state-action pair values.**
+- Trains a *Q-Function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.**
 - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
 - When the training is done, **we have an optimal Q-function, which means we have optimal Q-Table.**
 - And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**

From c945fbcd6fddd1d751f7423bc02e860b7f4c0de3 Mon Sep 17 00:00:00 2001
From: simoninithomas <simonini_thomas@outlook.fr>
Date: Mon, 12 Dec 2022 03:47:55 +0100
Subject: [PATCH 19/49] Finalize Unit 2

---
 notebooks/unit2/unit2.ipynb         |  511 ++++++-------
 notebooks/unit2/unit2.mdx           | 1096 +++++++++++++++++++++++++++
 units/en/unit2/bellman-equation.mdx |    1 +
 units/en/unit2/hands-on.mdx         |  312 ++++----
 4 files changed, 1516 insertions(+), 404 deletions(-)
 create mode 100644 notebooks/unit2/unit2.mdx

diff --git a/notebooks/unit2/unit2.ipynb b/notebooks/unit2/unit2.ipynb
index 90de5a6..81f3652 100644
--- a/notebooks/unit2/unit2.ipynb
+++ b/notebooks/unit2/unit2.ipynb
@@ -35,7 +35,8 @@
         "\n",
         "###📚 RL-Library: \n",
         "\n",
-        "- Python and Numpy"
+        "- Python and NumPy\n",
+        "- [Gym](https://www.gymlibrary.dev/)"
       ],
       "metadata": {
         "id": "DPTBOv9HYLZ2"
@@ -44,7 +45,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues)."
+        "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
       ],
       "metadata": {
         "id": "3iaIxM_TwklQ"
@@ -163,19 +164,19 @@
       "source": [
         "## Install dependencies and create a virtual display 🔽\n",
         "\n",
-        "During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). \n",
+        "In the notebook, we'll need to generate a replay video. To do so, with Colab, **we need to have a virtual screen to render the environment** (and thus record the frames).\n",
         "\n",
-        "Hence the following cell will install the librairies and create and run a virtual screen 🖥\n",
+        "Hence the following cell will install the libraries and create and run a virtual screen 🖥\n",
         "\n",
         "We’ll install multiple ones:\n",
         "\n",
         "- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.\n",
         "- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.\n",
-        "- `numPy`: Used for handling our Q-table.\n",
+        "- `numpy`: Used for handling our Q-table.\n",
         "\n",
         "The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.\n",
         "\n",
-        "You can see here all the Deep reinforcement Learning models available 👉 https://huggingface.co/models?other=q-learning\n"
+        "You can see here all the Deep RL models available (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning"
       ],
       "metadata": {
         "id": "4gpxC1_kqUYe"
@@ -195,11 +196,9 @@
     {
       "cell_type": "code",
       "source": [
-        "%capture\n",
+        "%%capture\n",
         "!sudo apt-get update\n",
-        "!apt install python-opengl\n",
-        "!apt install ffmpeg\n",
-        "!apt install xvfb\n",
+        "!apt install python-opengl ffmpeg xvfb\n",
         "!pip3 install pyvirtualdisplay"
       ],
       "metadata": {
@@ -254,12 +253,8 @@
         "\n",
         "In addition to the installed libraries, we also use:\n",
         "\n",
-        "- `random`: To generate random numbers (that will be useful for Epsilon-Greedy Policy).\n",
-        "- `imageio`: To generate a replay video\n",
-        "\n",
-        "\n",
-        "\n",
-        "\n"
+        "- `random`: To generate random numbers (that will be useful for epsilon-greedy policy).\n",
+        "- `imageio`: To generate a replay video."
       ]
     },
     {
@@ -323,8 +318,8 @@
         "\n",
         "The environment has two modes:\n",
         "\n",
-        "- `is_slippery=False`: The agent always move in the intended direction due to the non-slippery nature of the frozen lake.\n",
-        "- `is_slippery=True`: The agent may not always move in the intended direction due to the slippery nature of the frozen lake (stochastic)."
+        "- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).\n",
+        "- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic)."
       ]
     },
     {
@@ -401,8 +396,7 @@
       },
       "outputs": [],
       "source": [
-        "# We create our environment with gym.make(\"<name_of_the_environment>\")\n",
-        "env.reset()\n",
+        "# We create our environment with gym.make(\"<name_of_the_environment>\")- `is_slippery=False`: The agent always moves in the intended direction due to the non-slippery nature of the frozen lake (deterministic).\n",
         "print(\"_____OBSERVATION SPACE_____ \\n\")\n",
         "print(\"Observation Space\", env.observation_space)\n",
         "print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
@@ -414,7 +408,7 @@
         "id": "2MXc15qFE0M9"
       },
       "source": [
-        "We see with `Observation Space Shape Discrete(16)` that the observation is a value representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**. \n",
+        "We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**. \n",
         "\n",
         "For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**\n",
         "\n",
@@ -467,7 +461,7 @@
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n",
         "\n",
         "\n",
-        "It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`\n"
+        "It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. We already know their values from before, but we'll want to obtain them programmatically so that our algorithm generalizes for different environments. Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`\n"
       ]
     },
     {
@@ -559,6 +553,62 @@
         "Qtable_frozenlake = initialize_q_table(state_space, action_space)"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Atll4Z774gri"
+      },
+      "source": [
+        "## Define the greedy policy 🤖\n",
+        "Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.\n",
+        "\n",
+        "- Epsilon-greedy policy (acting policy)\n",
+        "- Greedy-policy (updating policy)\n",
+        "\n",
+        "Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.\n",
+        "\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "E3SCLmLX5bWG"
+      },
+      "outputs": [],
+      "source": [
+        "def greedy_policy(Qtable, state):\n",
+        "  # Exploitation: take the action with the highest state, action value\n",
+        "  action = \n",
+        "  \n",
+        "  return action"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "B2_-8b8z5k54"
+      },
+      "source": [
+        "#### Solution"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "se2OzWGW5kYJ"
+      },
+      "outputs": [],
+      "source": [
+        "def greedy_policy(Qtable, state):\n",
+        "  # Exploitation: take the action with the highest state, action value\n",
+        "  action = np.argmax(Qtable[state][:])\n",
+        "  \n",
+        "  return action"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -567,11 +617,11 @@
       "source": [
         "##Define the epsilon-greedy policy 🤖\n",
         "\n",
-        "Epsilon-Greedy is the training policy that handles the exploration/exploitation trade-off.\n",
+        "Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.\n",
         "\n",
-        "The idea with Epsilon Greedy:\n",
+        "The idea with epsilon-greedy:\n",
         "\n",
-        "- With *probability 1 - ɛ* : **we do exploitation** (aka our agent selects the action with the highest state-action pair value).\n",
+        "- With *probability 1 - ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).\n",
         "\n",
         "- With *probability ɛ*: we do **exploration** (trying random action).\n",
         "\n",
@@ -580,15 +630,6 @@
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n"
       ]
     },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "LjZSvhsD7_52"
-      },
-      "source": [
-        "Thanks to Sambit for finding a bug on the epsilon function 🤗"
-      ]
-    },
     {
       "cell_type": "code",
       "execution_count": null,
@@ -636,7 +677,7 @@
         "  if random_int > epsilon:\n",
         "    # Take the action with the highest value given a state\n",
         "    # np.argmax can be useful here\n",
-        "    action = np.argmax(Qtable[state])\n",
+        "    action = greedy_policy(Qtable, state)\n",
         "  # else --> exploration\n",
         "  else:\n",
         "    action = env.action_space.sample()\n",
@@ -644,62 +685,6 @@
         "  return action"
       ]
     },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "Atll4Z774gri"
-      },
-      "source": [
-        "## Define the greedy policy 🤖\n",
-        "Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.\n",
-        "\n",
-        "- Epsilon greedy policy (acting policy)\n",
-        "- Greedy policy (updating policy)\n",
-        "\n",
-        "Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "E3SCLmLX5bWG"
-      },
-      "outputs": [],
-      "source": [
-        "def greedy_policy(Qtable, state):\n",
-        "  # Exploitation: take the action with the highest state, action value\n",
-        "  action = \n",
-        "  \n",
-        "  return action"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "B2_-8b8z5k54"
-      },
-      "source": [
-        "#### Solution"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "se2OzWGW5kYJ"
-      },
-      "outputs": [],
-      "source": [
-        "def greedy_policy(Qtable, state):\n",
-        "  # Exploitation: take the action with the highest state, action value\n",
-        "  action = np.argmax(Qtable[state])\n",
-        "  \n",
-        "  return action"
-      ]
-    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -709,8 +694,8 @@
         "## Define the hyperparameters ⚙️\n",
         "The exploration related hyperparamters are some of the most important ones. \n",
         "\n",
-        "- We need to make sure that our agent **explores enough the state space** in order to learn a good value approximation, in order to do that we need to have progressive decay of the epsilon.\n",
-        "- If you decrease too fast epsilon (too high decay_rate), **you take the risk that your agent is stuck**, since your agent didn't explore enough the state space and hence can't solve the problem."
+        "- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.\n",
+        "- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem."
       ]
     },
     {
@@ -746,7 +731,25 @@
         "id": "cDb7Tdx8atfL"
       },
       "source": [
-        "## Step 6: Create the training loop method"
+        "## Create the training loop method\n",
+        "\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n",
+        "\n",
+        "The training loop goes like this:\n",
+        "\n",
+        "```\n",
+        "For episode in the total of training episodes:\n",
+        "\n",
+        "Reduce epsilon (since we need less and less exploration)\n",
+        "Reset the environment\n",
+        "\n",
+        "  For step in max timesteps:    \n",
+        "    Choose the action At using epsilon greedy policy\n",
+        "    Take the action (a) and observe the outcome state(s') and reward (r)\n",
+        "    Update the Q-value Q(s,a) using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
+        "    If done, finish the episode\n",
+        "    Our next state is the new state\n",
+        "```"
       ]
     },
     {
@@ -782,7 +785,7 @@
         "      if done:\n",
         "        break\n",
         "      \n",
-        "      # Our state is the new state\n",
+        "      # Our next state is the new state\n",
         "      state = new_state\n",
         "  return Qtable"
       ]
@@ -829,7 +832,7 @@
         "      if done:\n",
         "        break\n",
         "      \n",
-        "      # Our state is the new state\n",
+        "      # Our next state is the new state\n",
         "      state = new_state\n",
         "  return Qtable"
       ]
@@ -880,7 +883,9 @@
         "id": "pUrWkxsHccXD"
       },
       "source": [
-        "## Define the evaluation method 📝"
+        "## The evaluation method 📝\n",
+        "\n",
+        "- We defined the evaluation method that we're going to use to test our Q-Learning agent."
       ]
     },
     {
@@ -911,7 +916,7 @@
         "    \n",
         "    for step in range(max_steps):\n",
         "      # Take the action (index) that have the maximum expected future reward given that state\n",
-        "      action = np.argmax(Q[state][:])\n",
+        "      action = greedy_policy(Q, state)\n",
         "      new_state, reward, done, info = env.step(action)\n",
         "      total_rewards_ep += reward\n",
         "        \n",
@@ -933,8 +938,8 @@
       "source": [
         "## Evaluate our Q-Learning agent 📈\n",
         "\n",
-        "- Normally you should have mean reward of 1.0\n",
-        "- It's relatively easy since the state space is really small (16). What you can try to do is [to replace with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)."
+        "- Usually, you should have a mean reward of 1.0\n",
+        "- The **environment is relatively easy** since the state space is really small (16). What you can try to do is [to replace it with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/), which introduces stochasticity, making the environment more complex."
       ]
     },
     {
@@ -956,9 +961,9 @@
         "id": "yxaP3bPdg1DV"
       },
       "source": [
-        "## Publish our trained model on the Hub 🔥\n",
+        "## Publish our trained model to the Hub 🔥\n",
         "\n",
-        "Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.\n",
+        "Now that we saw good results after the training, **we can publish our trained model to the Hub 🤗 with one line of code**.\n",
         "\n",
         "Here's an example of a Model Card:\n",
         "\n",
@@ -991,8 +996,7 @@
       },
       "outputs": [],
       "source": [
-        "%%capture\n",
-        "from huggingface_hub import HfApi, HfFolder, Repository\n",
+        "from huggingface_hub import HfApi, HfFolder, Repository, snapshot_download\n",
         "from huggingface_hub.repocard import metadata_eval_result, metadata_save\n",
         "\n",
         "from pathlib import Path\n",
@@ -1009,6 +1013,13 @@
       "outputs": [],
       "source": [
         "def record_video(env, Qtable, out_directory, fps=1):\n",
+        "  \"\"\"\n",
+        "  Generate a replay video of the agent\n",
+        "  :param env\n",
+        "  :param Qtable: Qtable of our agent\n",
+        "  :param out_directory\n",
+        "  :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)\n",
+        "  \"\"\"\n",
         "  images = []  \n",
         "  done = False\n",
         "  state = env.reset(seed=random.randint(0,500))\n",
@@ -1025,149 +1036,144 @@
     },
     {
       "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "pwsNrzB339aF"
-      },
-      "outputs": [],
       "source": [
-        "def push_to_hub(repo_id, \n",
-        "                model,\n",
-        "                env,\n",
-        "                video_fps=1,\n",
-        "                local_repo_path=\"hub\",\n",
-        "                commit_message=\"Push Q-Learning agent to Hub\",\n",
-        "                token= None\n",
-        "                ):\n",
-        "  _, repo_name = repo_id.split(\"/\")\n",
+        "def push_to_hub(\n",
+        "    repo_id, model, env, video_fps=1, local_repo_path=\"hub\"\n",
+        "):\n",
+        "    \"\"\"\n",
+        "    Evaluate, Generate a video and Upload a model to Hugging Face Hub.\n",
+        "    This method does the complete pipeline:\n",
+        "    - It evaluates the model\n",
+        "    - It generates the model card\n",
+        "    - It generates a replay video of the agent\n",
+        "    - It pushes everything to the Hub\n",
         "\n",
-        "  eval_env = env\n",
-        "  \n",
-        "  # Step 1: Clone or create the repo\n",
-        "  # Create the repo (or clone its content if it's nonempty)\n",
-        "  api = HfApi()\n",
-        "  \n",
-        "  repo_url = api.create_repo(\n",
+        "    :param repo_id: repo_id: id of the model repository from the Hugging Face Hub\n",
+        "    :param env\n",
+        "    :param video_fps: how many frame per seconds to record our video replay \n",
+        "    (with taxi-v3 and frozenlake-v1 we use 1)\n",
+        "    :param local_repo_path: where the local repository is\n",
+        "    \"\"\"\n",
+        "    _, repo_name = repo_id.split(\"/\")\n",
+        "\n",
+        "    eval_env = env\n",
+        "    api = HfApi()\n",
+        "\n",
+        "    # Step 1: Create the repo\n",
+        "    repo_url = api.create_repo(\n",
         "        repo_id=repo_id,\n",
-        "        token=token,\n",
-        "        private=False,\n",
-        "        exist_ok=True,)\n",
-        "  \n",
-        "  # Git pull\n",
-        "  repo_local_path = Path(local_repo_path) / repo_name\n",
-        "  repo = Repository(repo_local_path, clone_from=repo_url, use_auth_token=True)\n",
-        "  repo.git_pull()\n",
-        "  \n",
-        "  repo.lfs_track([\"*.mp4\"])\n",
-        "\n",
-        "  # Step 1: Save the model\n",
-        "  if env.spec.kwargs.get(\"map_name\"):\n",
-        "    model[\"map_name\"] = env.spec.kwargs.get(\"map_name\")\n",
-        "    if env.spec.kwargs.get(\"is_slippery\", \"\") == False:\n",
-        "      model[\"slippery\"] = False\n",
-        "\n",
-        "  print(model)\n",
-        "  \n",
-        "    \n",
-        "  # Pickle the model\n",
-        "  with open(Path(repo_local_path)/'q-learning.pkl', 'wb') as f:\n",
-        "    pickle.dump(model, f)\n",
-        "  \n",
-        "  # Step 2: Evaluate the model and build JSON\n",
-        "  mean_reward, std_reward = evaluate_agent(eval_env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"])\n",
-        "\n",
-        "  # First get datetime\n",
-        "  eval_datetime = datetime.datetime.now()\n",
-        "  eval_form_datetime = eval_datetime.isoformat()\n",
-        "\n",
-        "  evaluate_data = {\n",
-        "        \"env_id\": model[\"env_id\"], \n",
-        "        \"mean_reward\": mean_reward,\n",
-        "        \"n_eval_episodes\": model[\"n_eval_episodes\"],\n",
-        "        \"eval_datetime\": eval_form_datetime,\n",
-        "  }\n",
-        "  # Write a JSON file\n",
-        "  with open(Path(repo_local_path) / \"results.json\", \"w\") as outfile:\n",
-        "      json.dump(evaluate_data, outfile)\n",
-        "\n",
-        "  # Step 3: Create the model card\n",
-        "  # Env id\n",
-        "  env_name = model[\"env_id\"]\n",
-        "  if env.spec.kwargs.get(\"map_name\"):\n",
-        "    env_name += \"-\" + env.spec.kwargs.get(\"map_name\")\n",
-        "\n",
-        "  if env.spec.kwargs.get(\"is_slippery\", \"\") == False:\n",
-        "    env_name += \"-\" + \"no_slippery\"\n",
-        "\n",
-        "  metadata = {}\n",
-        "  metadata[\"tags\"] = [\n",
-        "        env_name,\n",
-        "        \"q-learning\",\n",
-        "        \"reinforcement-learning\",\n",
-        "        \"custom-implementation\"\n",
-        "    ]\n",
-        "\n",
-        "  # Add metrics\n",
-        "  eval = metadata_eval_result(\n",
-        "      model_pretty_name=repo_name,\n",
-        "      task_pretty_name=\"reinforcement-learning\",\n",
-        "      task_id=\"reinforcement-learning\",\n",
-        "      metrics_pretty_name=\"mean_reward\",\n",
-        "      metrics_id=\"mean_reward\",\n",
-        "      metrics_value=f\"{mean_reward:.2f} +/- {std_reward:.2f}\",\n",
-        "      dataset_pretty_name=env_name,\n",
-        "      dataset_id=env_name,\n",
+        "        exist_ok=True,\n",
         "    )\n",
         "\n",
-        "  # Merges both dictionaries\n",
-        "  metadata = {**metadata, **eval}\n",
+        "    # Step 2: Download files\n",
+        "    repo_local_path = Path(snapshot_download(repo_id=repo_id))\n",
         "\n",
-        "  model_card = f\"\"\"\n",
-        "  # **Q-Learning** Agent playing **{env_id}**\n",
+        "    # Step 3: Save the model\n",
+        "    if env.spec.kwargs.get(\"map_name\"):\n",
+        "        model[\"map_name\"] = env.spec.kwargs.get(\"map_name\")\n",
+        "        if env.spec.kwargs.get(\"is_slippery\", \"\") == False:\n",
+        "            model[\"slippery\"] = False\n",
+        "\n",
+        "    print(model)\n",
+        "\n",
+        "    # Pickle the model\n",
+        "    with open((repo_local_path) / \"q-learning.pkl\", \"wb\") as f:\n",
+        "        pickle.dump(model, f)\n",
+        "\n",
+        "    # Step 4: Evaluate the model and build JSON with evaluation metrics\n",
+        "    mean_reward, std_reward = evaluate_agent(\n",
+        "        eval_env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"]\n",
+        "    )\n",
+        "\n",
+        "    evaluate_data = {\n",
+        "        \"env_id\": model[\"env_id\"],\n",
+        "        \"mean_reward\": mean_reward,\n",
+        "        \"n_eval_episodes\": model[\"n_eval_episodes\"],\n",
+        "        \"eval_datetime\": datetime.datetime.now().isoformat()\n",
+        "    }\n",
+        "\n",
+        "    # Write a JSON file\n",
+        "    with open(repo_local_path / \"results.json\", \"w\") as outfile:\n",
+        "        json.dump(evaluate_data, outfile)\n",
+        "\n",
+        "    # Step 5: Create the model card\n",
+        "    env_name = model[\"env_id\"]\n",
+        "    if env.spec.kwargs.get(\"map_name\"):\n",
+        "        env_name += \"-\" + env.spec.kwargs.get(\"map_name\")\n",
+        "\n",
+        "    if env.spec.kwargs.get(\"is_slippery\", \"\") == False:\n",
+        "        env_name += \"-\" + \"no_slippery\"\n",
+        "\n",
+        "    metadata = {}\n",
+        "    metadata[\"tags\"] = [env_name, \"q-learning\", \"reinforcement-learning\", \"custom-implementation\"]\n",
+        "\n",
+        "    # Add metrics\n",
+        "    eval = metadata_eval_result(\n",
+        "        model_pretty_name=repo_name,\n",
+        "        task_pretty_name=\"reinforcement-learning\",\n",
+        "        task_id=\"reinforcement-learning\",\n",
+        "        metrics_pretty_name=\"mean_reward\",\n",
+        "        metrics_id=\"mean_reward\",\n",
+        "        metrics_value=f\"{mean_reward:.2f} +/- {std_reward:.2f}\",\n",
+        "        dataset_pretty_name=env_name,\n",
+        "        dataset_id=env_name,\n",
+        "    )\n",
+        "\n",
+        "    # Merges both dictionaries\n",
+        "    metadata = {**metadata, **eval}\n",
+        "\n",
+        "    model_card = f\"\"\"\n",
+        "  # **Q-Learning** Agent playing1 **{env_id}**\n",
         "  This is a trained model of a **Q-Learning** agent playing **{env_id}** .\n",
-        "  \"\"\"\n",
         "\n",
-        "  model_card += \"\"\"\n",
         "  ## Usage\n",
-        "  ```python\n",
-        "  \"\"\"\n",
         "\n",
-        "  model_card += f\"\"\"model = load_from_hub(repo_id=\"{repo_id}\", filename=\"q-learning.pkl\")\n",
+        "  ```python\n",
+        "  \n",
+        "  model = load_from_hub(repo_id=\"{repo_id}\", filename=\"q-learning.pkl\")\n",
         "\n",
         "  # Don't forget to check if you need to add additional attributes (is_slippery=False etc)\n",
         "  env = gym.make(model[\"env_id\"])\n",
-        "\n",
-        "  evaluate_agent(env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"])\n",
-        "  \"\"\"\n",
-        "\n",
-        "  model_card +=\"\"\"\n",
         "  ```\n",
         "  \"\"\"\n",
         "\n",
-        "  readme_path = repo_local_path / \"README.md\"\n",
-        "  readme = \"\"\n",
-        "  if readme_path.exists():\n",
-        "      with readme_path.open(\"r\", encoding=\"utf8\") as f:\n",
-        "        readme = f.read()\n",
-        "  else:\n",
-        "    readme = model_card\n",
-        "\n",
-        "  with readme_path.open(\"w\", encoding=\"utf-8\") as f:\n",
-        "    f.write(readme)\n",
-        "\n",
-        "  # Save our metrics to Readme metadata\n",
-        "  metadata_save(readme_path, metadata)\n",
-        "\n",
-        "  # Step 4: Record a video\n",
-        "  video_path =  repo_local_path / \"replay.mp4\"\n",
-        "  record_video(env, model[\"qtable\"], video_path, video_fps)\n",
+        "    evaluate_agent(env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"])\n",
         "  \n",
-        "  # Push everything to hub\n",
-        "  print(f\"Pushing repo {repo_name} to the Hugging Face Hub\")\n",
-        "  repo.push_to_hub(commit_message=commit_message)\n",
         "\n",
-        "  print(f\"Your model is pushed to the hub. You can view your model here: {repo_url}\")"
-      ]
+        "    readme_path = repo_local_path / \"README.md\"\n",
+        "    readme = \"\"\n",
+        "    print(readme_path.exists())\n",
+        "    if readme_path.exists():\n",
+        "        with readme_path.open(\"r\", encoding=\"utf8\") as f:\n",
+        "            readme = f.read()\n",
+        "    else:\n",
+        "        readme = model_card\n",
+        "    print(readme)\n",
+        "\n",
+        "    with readme_path.open(\"w\", encoding=\"utf-8\") as f:\n",
+        "        f.write(readme)\n",
+        "\n",
+        "    # Save our metrics to Readme metadata\n",
+        "    metadata_save(readme_path, metadata)\n",
+        "\n",
+        "    # Step 6: Record a video\n",
+        "    video_path = repo_local_path / \"replay.mp4\"\n",
+        "    record_video(env, model[\"qtable\"], video_path, video_fps)\n",
+        "\n",
+        "    # Step 7. Push everything to the Hub\n",
+        "    api.upload_folder(\n",
+        "        repo_id=repo_id,\n",
+        "        folder_path=repo_local_path,\n",
+        "        path_in_repo=\".\",\n",
+        "    )\n",
+        "\n",
+        "    print(\"Your model is pushed to the Hub. You can view your model here: \", repo_url)"
+      ],
+      "metadata": {
+        "id": "U4mdUTKkGnUd"
+      },
+      "execution_count": null,
+      "outputs": []
     },
     {
       "cell_type": "markdown",
@@ -1177,7 +1183,7 @@
       "source": [
         "### .\n",
         "\n",
-        "By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n",
+        "By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.\n",
         "\n",
         "This way:\n",
         "- You can **showcase our work** 🔥\n",
@@ -1221,7 +1227,7 @@
         "id": "GyWc1x3-o3xG"
       },
       "source": [
-        "If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
+        "If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)"
       ]
     },
     {
@@ -1230,7 +1236,7 @@
         "id": "Gc5AfUeFo3xH"
       },
       "source": [
-        "3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function\n",
+        "3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `push_to_hub()` function\n",
         "\n",
         "- Let's create **the model dictionary that contains the hyperparameters and the Q_table**."
       ]
@@ -1267,7 +1273,7 @@
         "id": "9kld-AEso3xH"
       },
       "source": [
-        "Let's fill the `package_to_hub` function:\n",
+        "Let's fill the `push_to_hub` function:\n",
         "\n",
         "- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `\n",
         "(repo_id = {username}/{repo_name})`\n",
@@ -1470,17 +1476,6 @@
         "## Train our Q-Learning agent 🏃"
       ]
     },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "MLNwkNDb14h2"
-      },
-      "outputs": [],
-      "source": [
-        "Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)"
-      ]
-    },
     {
       "cell_type": "code",
       "execution_count": null,
@@ -1489,6 +1484,7 @@
       },
       "outputs": [],
       "source": [
+        "Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)\n",
         "Qtable_taxi"
       ]
     },
@@ -1498,7 +1494,7 @@
         "id": "wPdu0SueLVl2"
       },
       "source": [
-        "## Create a model dictionary 💾 and publish our trained model on the Hub 🔥\n",
+        "## Create a model dictionary 💾 and publish our trained model to the Hub 🔥\n",
         "- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.\n"
       ]
     },
@@ -1537,7 +1533,7 @@
       "outputs": [],
       "source": [
         "username = \"\" # FILL THIS\n",
-        "repo_name = \"q-Taxi-v3\"\n",
+        "repo_name = \"\"\n",
         "push_to_hub(\n",
         "    repo_id=f\"{username}/{repo_name}\",\n",
         "    model=model,\n",
@@ -1552,6 +1548,8 @@
       "source": [
         "Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n",
         "\n",
+        "⚠ To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** ⚠\n",
+        "\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png\" alt=\"Taxi Leaderboard\">"
       ]
     },
@@ -1612,14 +1610,6 @@
         "    :param repo_id: id of the model repository from the Hugging Face Hub\n",
         "    :param filename: name of the model zip file from the repository\n",
         "    \"\"\"\n",
-        "    try:\n",
-        "        from huggingface_hub import cached_download, hf_hub_url\n",
-        "    except ImportError:\n",
-        "        raise ImportError(\n",
-        "            \"You need to install huggingface_hub to use `load_from_hub`. \"\n",
-        "            \"See https://pypi.org/project/huggingface-hub/ for installation.\"\n",
-        "        )\n",
-        "\n",
         "    # Get the model from the Hub, download and cache the model on your local disk\n",
         "    pickle_model = hf_hub_download(\n",
         "        repo_id=repo_id,\n",
@@ -1731,14 +1721,13 @@
   "metadata": {
     "accelerator": "GPU",
     "colab": {
-      "collapsed_sections": [
-        "4i6tjI2tHQ8j",
-        "Y-mo_6rXIjRi",
-        "EtrfoTaBoNrd",
-        "BjLhT70TEZIn"
-      ],
       "private_outputs": true,
-      "provenance": []
+      "provenance": [],
+      "collapsed_sections": [
+        "Ji_UrI5l2zzn",
+        "67OdoKL63eDD",
+        "B2_-8b8z5k54"
+      ]
     },
     "gpuClass": "standard",
     "kernelspec": {
diff --git a/notebooks/unit2/unit2.mdx b/notebooks/unit2/unit2.mdx
new file mode 100644
index 0000000..fefa3f0
--- /dev/null
+++ b/notebooks/unit2/unit2.mdx
@@ -0,0 +1,1096 @@
+# Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">
+
+In this notebook, **you'll code from scratch your first Reinforcement Learning agent** playing FrozenLake ❄️ using Q-Learning, share it to the community, and experiment with different configurations.
+
+
+⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
+
+###🎮 Environments: 
+
+- [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
+- [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/)
+
+###📚 RL-Library: 
+
+- Python and NumPy
+- [Gym](https://www.gymlibrary.dev/)
+
+We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
+
+## Objectives of this notebook 🏆
+
+At the end of the notebook, you will:
+
+- Be able to use **Gym**, the environment library.
+- Be able to code from scratch a Q-Learning agent.
+- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
+
+
+
+
+## This notebook is from Deep Reinforcement Learning Course
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>
+
+In this free course, you will:
+
+- 📖 Study Deep Reinforcement Learning in **theory and practice**.
+- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.
+- 🤖 Train **agents in unique environments** 
+
+And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course
+
+Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
+
+
+The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5
+
+## Prerequisites 🏗️
+Before diving into the notebook, you need to:
+
+🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)**  🤗  
+
+## A small recap of Q-Learning
+
+- The *Q-Learning* **is the RL algorithm that**  
+
+  - Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
+    
+  - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
+    
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function"  width="100%"/>
+
+- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
+    
+- And if we **have an optimal Q-function**, we
+have an optimal policy,since we **know for each state, what is the best action to take.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"  width="100%"/>
+
+
+But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
+
+This is the Q-Learning pseudocode:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
+
+
+# Let's code our first Reinforcement Learning algorithm 🚀
+
+## Install dependencies and create a virtual display 🔽
+
+In the notebook, we'll need to generate a replay video. To do so, with Colab, **we need to have a virtual screen to render the environment** (and thus record the frames).
+
+Hence the following cell will install the libraries and create and run a virtual screen 🖥
+
+We’ll install multiple ones:
+
+- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.
+- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.
+- `numpy`: Used for handling our Q-table.
+
+The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
+
+You can see here all the Deep RL models available (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning
+
+```python
+!pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/unit2/requirements-unit2.txt
+```
+
+```python
+%%capture
+!sudo apt-get update
+!apt install python-opengl ffmpeg xvfb
+!pip3 install pyvirtualdisplay
+```
+
+To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks for this trick, **we will be able to run our virtual screen.**
+
+```python
+import os
+
+os.kill(os.getpid(), 9)
+```
+
+```python
+# Virtual display
+from pyvirtualdisplay import Display
+
+virtual_display = Display(visible=0, size=(1400, 900))
+virtual_display.start()
+```
+
+## Import the packages 📦
+
+In addition to the installed libraries, we also use:
+
+- `random`: To generate random numbers (that will be useful for epsilon-greedy policy).
+- `imageio`: To generate a replay video.
+
+```python
+import numpy as np
+import gym
+import random
+import imageio
+import os
+
+import pickle5 as pickle
+from tqdm.notebook import tqdm
+```
+
+We're now ready to code our Q-Learning algorithm 🔥
+
+# Part 1: Frozen Lake ⛄ (non slippery version)
+
+## Create and understand [FrozenLake environment ⛄]((https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
+---
+
+💡 A good habit when you start to use an environment is to check its documentation 
+
+👉 https://www.gymlibrary.dev/environments/toy_text/frozen_lake/
+
+---
+
+We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.
+
+We can have two sizes of environment:
+
+- `map_name="4x4"`: a 4x4 grid version
+- `map_name="8x8"`: a 8x8 grid version
+
+
+The environment has two modes:
+
+- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).
+- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic).
+
+For now let's keep it simple with the 4x4 map and non-slippery
+
+```python
+# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version
+env = gym.make()  # TODO use the correct parameters
+```
+
+### Solution
+
+```python
+env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
+```
+
+You can create your own custom grid like this:
+
+```python
+desc=["SFFF", "FHFH", "FFFH", "HFFG"]
+gym.make('FrozenLake-v1', desc=desc, is_slippery=True)
+```
+
+but we'll use the default environment for now.
+
+### Let's see what the Environment looks like:
+
+
+```python
+# We create our environment with gym.make("<name_of_the_environment>")- `is_slippery=False`: The agent always moves in the intended direction due to the non-slippery nature of the frozen lake (deterministic).
+print("_____OBSERVATION SPACE_____ \n")
+print("Observation Space", env.observation_space)
+print("Sample observation", env.observation_space.sample())  # Get a random observation
+```
+
+We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**. 
+
+For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**
+
+
+For instance, this is what state = 0 looks like:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/frozenlake.png" alt="FrozenLake">
+
+```python
+print("\n _____ACTION SPACE_____ \n")
+print("Action Space Shape", env.action_space.n)
+print("Action Space Sample", env.action_space.sample())  # Take a random action
+```
+
+The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:
+- 0: GO LEFT
+- 1: GO DOWN
+- 2: GO RIGHT
+- 3: GO UP
+
+Reward function 💰:
+- Reach goal: +1
+- Reach hole: 0
+- Reach frozen: 0
+
+## Create and Initialize the Q-table 🗄️
+(👀 Step 1 of the pseudocode)
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
+
+
+It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. We already know their values from before, but we'll want to obtain them programmatically so that our algorithm generalizes for different environments. Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`
+
+
+```python
+state_space = 
+print("There are ", state_space, " possible states")
+
+action_space = 
+print("There are ", action_space, " possible actions")
+```
+
+```python
+# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
+def initialize_q_table(state_space, action_space):
+  Qtable = 
+  return Qtable
+```
+
+```python
+Qtable_frozenlake = initialize_q_table(state_space, action_space)
+```
+
+### Solution
+
+```python
+state_space = env.observation_space.n
+print("There are ", state_space, " possible states")
+
+action_space = env.action_space.n
+print("There are ", action_space, " possible actions")
+```
+
+```python
+# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
+def initialize_q_table(state_space, action_space):
+    Qtable = np.zeros((state_space, action_space))
+    return Qtable
+```
+
+```python
+Qtable_frozenlake = initialize_q_table(state_space, action_space)
+```
+
+## Define the greedy policy 🤖
+Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
+
+- Epsilon-greedy policy (acting policy)
+- Greedy-policy (updating policy)
+
+Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Q-Learning" width="100%"/>
+
+
+```python
+def greedy_policy(Qtable, state):
+  # Exploitation: take the action with the highest state, action value
+  action = 
+  
+  return action
+```
+
+#### Solution
+
+```python
+def greedy_policy(Qtable, state):
+    # Exploitation: take the action with the highest state, action value
+    action = np.argmax(Qtable[state][:])
+
+    return action
+```
+
+##Define the epsilon-greedy policy 🤖
+
+Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.
+
+The idea with epsilon-greedy:
+
+- With *probability 1 - ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).
+
+- With *probability ɛ*: we do **exploration** (trying random action).
+
+And as the training goes, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-Learning" width="100%"/>
+
+
+```python
+def epsilon_greedy_policy(Qtable, state, epsilon):
+  # Randomly generate a number between 0 and 1
+  random_num = 
+  # if random_num > greater than epsilon --> exploitation
+  if random_num > epsilon:
+    # Take the action with the highest value given a state
+    # np.argmax can be useful here
+    action = 
+  # else --> exploration
+  else:
+    action = # Take a random action
+  
+  return action
+```
+
+#### Solution
+
+```python
+def epsilon_greedy_policy(Qtable, state, epsilon):
+    # Randomly generate a number between 0 and 1
+    random_int = random.uniform(0, 1)
+    # if random_int > greater than epsilon --> exploitation
+    if random_int > epsilon:
+        # Take the action with the highest value given a state
+        # np.argmax can be useful here
+        action = greedy_policy(Qtable, state)
+    # else --> exploration
+    else:
+        action = env.action_space.sample()
+
+    return action
+```
+
+## Define the hyperparameters ⚙️
+The exploration related hyperparamters are some of the most important ones. 
+
+- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
+- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.
+
+```python
+# Training parameters
+n_training_episodes = 10000  # Total training episodes
+learning_rate = 0.7  # Learning rate
+
+# Evaluation parameters
+n_eval_episodes = 100  # Total number of test episodes
+
+# Environment parameters
+env_id = "FrozenLake-v1"  # Name of the environment
+max_steps = 99  # Max steps per episode
+gamma = 0.95  # Discounting rate
+eval_seed = []  # The evaluation seed of the environment
+
+# Exploration parameters
+max_epsilon = 1.0  # Exploration probability at start
+min_epsilon = 0.05  # Minimum exploration probability
+decay_rate = 0.0005  # Exponential decay rate for exploration prob
+```
+
+## Create the training loop method
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
+
+The training loop goes like this:
+
+```
+For episode in the total of training episodes:
+
+Reduce epsilon (since we need less and less exploration)
+Reset the environment
+
+  For step in max timesteps:    
+    Choose the action At using epsilon greedy policy
+    Take the action (a) and observe the outcome state(s') and reward (r)
+    Update the Q-value Q(s,a) using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
+    If done, finish the episode
+    Our next state is the new state
+```
+
+```python
+def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
+  for episode in range(n_training_episodes):
+    # Reduce epsilon (because we need less and less exploration)
+    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
+    # Reset the environment
+    state = env.reset()
+    step = 0
+    done = False
+
+    # repeat
+    for step in range(max_steps):
+      # Choose the action At using epsilon greedy policy
+      action = 
+
+      # Take action At and observe Rt+1 and St+1
+      # Take the action (a) and observe the outcome state(s') and reward (r)
+      new_state, reward, done, info = 
+
+      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
+      Qtable[state][action] = 
+
+      # If done, finish the episode
+      if done:
+        break
+      
+      # Our next state is the new state
+      state = new_state
+  return Qtable
+```
+
+#### Solution
+
+```python
+def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
+    for episode in tqdm(range(n_training_episodes)):
+        # Reduce epsilon (because we need less and less exploration)
+        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
+        # Reset the environment
+        state = env.reset()
+        step = 0
+        done = False
+
+        # repeat
+        for step in range(max_steps):
+            # Choose the action At using epsilon greedy policy
+            action = epsilon_greedy_policy(Qtable, state, epsilon)
+
+            # Take action At and observe Rt+1 and St+1
+            # Take the action (a) and observe the outcome state(s') and reward (r)
+            new_state, reward, done, info = env.step(action)
+
+            # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
+            Qtable[state][action] = Qtable[state][action] + learning_rate * (
+                reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]
+            )
+
+            # If done, finish the episode
+            if done:
+                break
+
+            # Our next state is the new state
+            state = new_state
+    return Qtable
+```
+
+## Train the Q-Learning agent 🏃
+
+```python
+Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)
+```
+
+## Let's see what our Q-Learning table looks like now 👀
+
+```python
+Qtable_frozenlake
+```
+
+## The evaluation method 📝
+
+- We defined the evaluation method that we're going to use to test our Q-Learning agent.
+
+```python
+def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
+    """
+    Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
+    :param env: The evaluation environment
+    :param n_eval_episodes: Number of episode to evaluate the agent
+    :param Q: The Q-table
+    :param seed: The evaluation seed array (for taxi-v3)
+    """
+    episode_rewards = []
+    for episode in tqdm(range(n_eval_episodes)):
+        if seed:
+            state = env.reset(seed=seed[episode])
+        else:
+            state = env.reset()
+        step = 0
+        done = False
+        total_rewards_ep = 0
+
+        for step in range(max_steps):
+            # Take the action (index) that have the maximum expected future reward given that state
+            action = greedy_policy(Q, state)
+            new_state, reward, done, info = env.step(action)
+            total_rewards_ep += reward
+
+            if done:
+                break
+            state = new_state
+        episode_rewards.append(total_rewards_ep)
+    mean_reward = np.mean(episode_rewards)
+    std_reward = np.std(episode_rewards)
+
+    return mean_reward, std_reward
+```
+
+## Evaluate our Q-Learning agent 📈
+
+- Usually, you should have a mean reward of 1.0
+- The **environment is relatively easy** since the state space is really small (16). What you can try to do is [to replace it with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/), which introduces stochasticity, making the environment more complex.
+
+```python
+# Evaluate our Agent
+mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
+print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
+```
+
+## Publish our trained model to the Hub 🔥
+
+Now that we saw good results after the training, **we can publish our trained model to the Hub 🤗 with one line of code**.
+
+Here's an example of a Model Card:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/modelcard.png" alt="Model card" width="100%"/>
+
+
+Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.
+
+#### Do not modify this code
+
+```python
+from huggingface_hub import HfApi, HfFolder, Repository, snapshot_download
+from huggingface_hub.repocard import metadata_eval_result, metadata_save
+
+from pathlib import Path
+import datetime
+import json
+```
+
+```python
+def record_video(env, Qtable, out_directory, fps=1):
+    """
+    Generate a replay video of the agent
+    :param env
+    :param Qtable: Qtable of our agent
+    :param out_directory
+    :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
+    """
+    images = []
+    done = False
+    state = env.reset(seed=random.randint(0, 500))
+    img = env.render(mode="rgb_array")
+    images.append(img)
+    while not done:
+        # Take the action (index) that have the maximum expected future reward given that state
+        action = np.argmax(Qtable[state][:])
+        state, reward, done, info = env.step(action)  # We directly put next_state = state for recording logic
+        img = env.render(mode="rgb_array")
+        images.append(img)
+    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)
+```
+
+```python
+def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
+    """
+    Evaluate, Generate a video and Upload a model to Hugging Face Hub.
+    This method does the complete pipeline:
+    - It evaluates the model
+    - It generates the model card
+    - It generates a replay video of the agent
+    - It pushes everything to the Hub
+
+    :param repo_id: repo_id: id of the model repository from the Hugging Face Hub
+    :param env
+    :param video_fps: how many frame per seconds to record our video replay
+    (with taxi-v3 and frozenlake-v1 we use 1)
+    :param local_repo_path: where the local repository is
+    """
+    _, repo_name = repo_id.split("/")
+
+    eval_env = env
+    api = HfApi()
+
+    # Step 1: Create the repo
+    repo_url = api.create_repo(
+        repo_id=repo_id,
+        exist_ok=True,
+    )
+
+    # Step 2: Download files
+    repo_local_path = Path(snapshot_download(repo_id=repo_id))
+
+    # Step 3: Save the model
+    if env.spec.kwargs.get("map_name"):
+        model["map_name"] = env.spec.kwargs.get("map_name")
+        if env.spec.kwargs.get("is_slippery", "") == False:
+            model["slippery"] = False
+
+    print(model)
+
+    # Pickle the model
+    with open((repo_local_path) / "q-learning.pkl", "wb") as f:
+        pickle.dump(model, f)
+
+    # Step 4: Evaluate the model and build JSON with evaluation metrics
+    mean_reward, std_reward = evaluate_agent(
+        eval_env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"]
+    )
+
+    evaluate_data = {
+        "env_id": model["env_id"],
+        "mean_reward": mean_reward,
+        "n_eval_episodes": model["n_eval_episodes"],
+        "eval_datetime": datetime.datetime.now().isoformat(),
+    }
+
+    # Write a JSON file
+    with open(repo_local_path / "results.json", "w") as outfile:
+        json.dump(evaluate_data, outfile)
+
+    # Step 5: Create the model card
+    env_name = model["env_id"]
+    if env.spec.kwargs.get("map_name"):
+        env_name += "-" + env.spec.kwargs.get("map_name")
+
+    if env.spec.kwargs.get("is_slippery", "") == False:
+        env_name += "-" + "no_slippery"
+
+    metadata = {}
+    metadata["tags"] = [env_name, "q-learning", "reinforcement-learning", "custom-implementation"]
+
+    # Add metrics
+    eval = metadata_eval_result(
+        model_pretty_name=repo_name,
+        task_pretty_name="reinforcement-learning",
+        task_id="reinforcement-learning",
+        metrics_pretty_name="mean_reward",
+        metrics_id="mean_reward",
+        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
+        dataset_pretty_name=env_name,
+        dataset_id=env_name,
+    )
+
+    # Merges both dictionaries
+    metadata = {**metadata, **eval}
+
+    model_card = f"""
+  # **Q-Learning** Agent playing1 **{env_id}**
+  This is a trained model of a **Q-Learning** agent playing **{env_id}** .
+
+  ## Usage
+
+  ```python
+  
+  model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
+
+  # Don't forget to check if you need to add additional attributes (is_slippery=False etc)
+  env = gym.make(model["env_id"])
+  ```
+  """
+
+    evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
+
+    readme_path = repo_local_path / "README.md"
+    readme = ""
+    print(readme_path.exists())
+    if readme_path.exists():
+        with readme_path.open("r", encoding="utf8") as f:
+            readme = f.read()
+    else:
+        readme = model_card
+    print(readme)
+
+    with readme_path.open("w", encoding="utf-8") as f:
+        f.write(readme)
+
+    # Save our metrics to Readme metadata
+    metadata_save(readme_path, metadata)
+
+    # Step 6: Record a video
+    video_path = repo_local_path / "replay.mp4"
+    record_video(env, model["qtable"], video_path, video_fps)
+
+    # Step 7. Push everything to the Hub
+    api.upload_folder(
+        repo_id=repo_id,
+        folder_path=repo_local_path,
+        path_in_repo=".",
+    )
+
+    print("Your model is pushed to the Hub. You can view your model here: ", repo_url)
+```
+
+### .
+
+By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.
+
+This way:
+- You can **showcase our work** 🔥
+- You can **visualize your agent playing** 👀
+- You can **share with the community an agent that others can use** 💾
+- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
+
+
+To be able to share your model with the community there are three more steps to follow:
+
+1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
+
+2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
+- Create a new token (https://huggingface.co/settings/tokens) **with write role**
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
+
+
+```python
+from huggingface_hub import notebook_login
+
+notebook_login()
+```
+
+If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)
+
+3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `push_to_hub()` function
+
+- Let's create **the model dictionary that contains the hyperparameters and the Q_table**.
+
+```python
+model = {
+    "env_id": env_id,
+    "max_steps": max_steps,
+    "n_training_episodes": n_training_episodes,
+    "n_eval_episodes": n_eval_episodes,
+    "eval_seed": eval_seed,
+    "learning_rate": learning_rate,
+    "gamma": gamma,
+    "max_epsilon": max_epsilon,
+    "min_epsilon": min_epsilon,
+    "decay_rate": decay_rate,
+    "qtable": Qtable_frozenlake,
+}
+```
+
+Let's fill the `push_to_hub` function:
+
+- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `
+(repo_id = {username}/{repo_name})`
+💡 A good `repo_id` is `{username}/q-{env_id}`
+- `model`: our model dictionary containing the hyperparameters and the Qtable.
+- `env`: the environment.
+- `commit_message`: message of the commit
+
+```python
+model
+```
+
+```python
+username = ""  # FILL THIS
+repo_name = "q-FrozenLake-v1-4x4-noSlippery"
+push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
+```
+
+Congrats 🥳 you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent. 
+FrozenLake-v1 no_slippery is very simple environment, let's try an harder one 🔥.
+
+# Part 2: Taxi-v3 🚖
+
+## Create and understand [Taxi-v3 🚕](https://www.gymlibrary.dev/environments/toy_text/taxi/)
+---
+
+💡 A good habit when you start to use an environment is to check its documentation 
+
+👉 https://www.gymlibrary.dev/environments/toy_text/taxi/
+
+---
+
+In `Taxi-v3` 🚕, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). 
+
+When the episode starts, **the taxi starts off at a random square** and the passenger is at a random location. The taxi drives to the passenger’s location, **picks up the passenger**, drives to the passenger’s destination (another one of the four specified locations), and then **drops off the passenger**. Once the passenger is dropped off, the episode ends.
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi.png" alt="Taxi">
+
+
+```python
+env = gym.make("Taxi-v3")
+```
+
+There are **500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger** (including the case when the passenger is in the taxi), and **4 destination locations.**
+
+
+```python
+state_space = env.observation_space.n
+print("There are ", state_space, " possible states")
+```
+
+```python
+action_space = env.action_space.n
+print("There are ", action_space, " possible actions")
+```
+
+The action space (the set of possible actions the agent can take) is discrete with **6 actions available 🎮**:
+
+- 0: move south
+- 1: move north
+- 2: move east
+- 3: move west
+- 4: pickup passenger
+- 5: drop off passenger
+
+Reward function 💰:
+
+- -1 per step unless other reward is triggered.
+- +20 delivering passenger.
+- -10 executing “pickup” and “drop-off” actions illegally.
+
+```python
+# Create our Q table with state_size rows and action_size columns (500x6)
+Qtable_taxi = initialize_q_table(state_space, action_space)
+print(Qtable_taxi)
+print("Q-table shape: ", Qtable_taxi.shape)
+```
+
+## Define the hyperparameters ⚙️
+⚠ DO NOT MODIFY EVAL_SEED: the eval_seed array **allows us to evaluate your agent with the same taxi starting positions for every classmate**
+
+```python
+# Training parameters
+n_training_episodes = 25000  # Total training episodes
+learning_rate = 0.7  # Learning rate
+
+# Evaluation parameters
+n_eval_episodes = 100  # Total number of test episodes
+
+# DO NOT MODIFY EVAL_SEED
+eval_seed = [
+    16,
+    54,
+    165,
+    177,
+    191,
+    191,
+    120,
+    80,
+    149,
+    178,
+    48,
+    38,
+    6,
+    125,
+    174,
+    73,
+    50,
+    172,
+    100,
+    148,
+    146,
+    6,
+    25,
+    40,
+    68,
+    148,
+    49,
+    167,
+    9,
+    97,
+    164,
+    176,
+    61,
+    7,
+    54,
+    55,
+    161,
+    131,
+    184,
+    51,
+    170,
+    12,
+    120,
+    113,
+    95,
+    126,
+    51,
+    98,
+    36,
+    135,
+    54,
+    82,
+    45,
+    95,
+    89,
+    59,
+    95,
+    124,
+    9,
+    113,
+    58,
+    85,
+    51,
+    134,
+    121,
+    169,
+    105,
+    21,
+    30,
+    11,
+    50,
+    65,
+    12,
+    43,
+    82,
+    145,
+    152,
+    97,
+    106,
+    55,
+    31,
+    85,
+    38,
+    112,
+    102,
+    168,
+    123,
+    97,
+    21,
+    83,
+    158,
+    26,
+    80,
+    63,
+    5,
+    81,
+    32,
+    11,
+    28,
+    148,
+]  # Evaluation seed, this ensures that all classmates agents are trained on the same taxi starting position
+# Each seed has a specific starting state
+
+# Environment parameters
+env_id = "Taxi-v3"  # Name of the environment
+max_steps = 99  # Max steps per episode
+gamma = 0.95  # Discounting rate
+
+# Exploration parameters
+max_epsilon = 1.0  # Exploration probability at start
+min_epsilon = 0.05  # Minimum exploration probability
+decay_rate = 0.005  # Exponential decay rate for exploration prob
+```
+
+## Train our Q-Learning agent 🏃
+
+```python
+Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
+Qtable_taxi
+```
+
+## Create a model dictionary 💾 and publish our trained model to the Hub 🔥
+- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.
+
+
+```python
+model = {
+    "env_id": env_id,
+    "max_steps": max_steps,
+    "n_training_episodes": n_training_episodes,
+    "n_eval_episodes": n_eval_episodes,
+    "eval_seed": eval_seed,
+    "learning_rate": learning_rate,
+    "gamma": gamma,
+    "max_epsilon": max_epsilon,
+    "min_epsilon": min_epsilon,
+    "decay_rate": decay_rate,
+    "qtable": Qtable_taxi,
+}
+```
+
+```python
+username = ""  # FILL THIS
+repo_name = ""
+push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
+```
+
+Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
+
+⚠ To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** ⚠
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png" alt="Taxi Leaderboard">
+
+# Part 3: Load from Hub 🔽
+
+What's amazing with Hugging Face Hub 🤗 is that you can easily load powerful models from the community.
+
+Loading a saved model from the Hub is really easy:
+
+1. You go https://huggingface.co/models?other=q-learning to see the list of all the q-learning saved models.
+2. You select one and copy its repo_id
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/copy-id.png" alt="Copy id">
+
+3. Then we just need to use `load_from_hub` with:
+- The repo_id
+- The filename: the saved model inside the repo.
+
+#### Do not modify this code
+
+```python
+from urllib.error import HTTPError
+
+from huggingface_hub import hf_hub_download
+
+
+def load_from_hub(repo_id: str, filename: str) -> str:
+    """
+    Download a model from Hugging Face Hub.
+    :param repo_id: id of the model repository from the Hugging Face Hub
+    :param filename: name of the model zip file from the repository
+    """
+    # Get the model from the Hub, download and cache the model on your local disk
+    pickle_model = hf_hub_download(repo_id=repo_id, filename=filename)
+
+    with open(pickle_model, "rb") as f:
+        downloaded_model_file = pickle.load(f)
+
+    return downloaded_model_file
+```
+
+### .
+
+```python
+model = load_from_hub(repo_id="ThomasSimonini/q-Taxi-v3", filename="q-learning.pkl")  # Try to use another model
+
+print(model)
+env = gym.make(model["env_id"])
+
+evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
+```
+
+```python
+model = load_from_hub(
+    repo_id="ThomasSimonini/q-FrozenLake-v1-no-slippery", filename="q-learning.pkl"
+)  # Try to use another model
+
+env = gym.make(model["env_id"], is_slippery=False)
+
+evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
+```
+
+## Some additional challenges 🏆
+The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results! 
+
+In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
+
+Here are some ideas to achieve so:
+
+* Train more steps
+* Try different hyperparameters by looking at what your classmates have done.
+* **Push your new trained model** on the Hub 🔥
+
+Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not using FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
+
+_____________________________________________________________________
+Congrats 🥳, you've just implemented, trained, and uploaded your first Reinforcement Learning agent.
+
+Understanding Q-Learning is an **important step to understanding value-based methods.**
+
+In the next Unit with Deep Q-Learning, we'll see that creating and updating a Q-table was a good strategy — **however, this is not scalable.**
+
+For instance, imagine you create an agent that learns to play Doom. 
+
+<img src="https://vizdoom.cs.put.edu.pl/user/pages/01.tutorial/basic.png" alt="Doom"/>
+
+Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient. 
+
+That's why we'll study, in the next unit, Deep Q-Learning, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
+
+
+See you on Unit 3! 🔥
+
+## Keep learning, stay awesome 🤗
\ No newline at end of file
diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx
index 03cab20..e819f5a 100644
--- a/units/en/unit2/bellman-equation.mdx
+++ b/units/en/unit2/bellman-equation.mdx
@@ -49,6 +49,7 @@ This is equivalent to  \\(V(S_{t})\\)  = Immediate reward  \\(R_{t+1}\\)  + Disc
 </figure>
 
 In the interest of simplicity, here we don't discount, so gamma = 1.
+But you'll study an example with gamma = 0.99 in the Q-Learning section of this unit.
 
 - The value of  \\(V(S_{t+1}) \\)  = Immediate reward  \\(R_{t+2}\\)  + Discounted value of the next state ( \\(gamma * V(S_{t+2})\\) ).
 - And so on.
diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
index b621dde..cc36d00 100644
--- a/units/en/unit2/hands-on.mdx
+++ b/units/en/unit2/hands-on.mdx
@@ -21,7 +21,6 @@ Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Dee
 
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb)
 
-
 # Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">
@@ -41,9 +40,10 @@ In this notebook, **you'll code from scratch your first Reinforcement Learning a
 
 ### 📚 RL-Library:
 
-- Python and Numpy
+- Python and NumPy
+- [Gym](https://www.gymlibrary.dev/)
 
-We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
+We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
 
 ## Objectives of this notebook 🏆
 
@@ -55,29 +55,54 @@ At the end of the notebook, you will:
 
 
 ## Prerequisites 🏗️
-
 Before diving into the notebook, you need to:
 
 🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)**  🤗
 
+## A small recap of Q-Learning
+
+- The *Q-Learning* **is the RL algorithm that**
+
+  - Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
+
+  - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function"  width="100%"/>
+
+- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
+
+- And if we **have an optimal Q-function**, we
+have an optimal policy,since we **know for each state, what is the best action to take.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"  width="100%"/>
+
+
+But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
+
+This is the Q-Learning pseudocode:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
+
+
+# Let's code our first Reinforcement Learning algorithm 🚀
 
 ## Install dependencies and create a virtual display 🔽
 
-During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
+In the notebook, we'll need to generate a replay video. To do so, with Colab, **we need to have a virtual screen to render the environment** (and thus record the frames).
 
-Hence the following cell will install the librairies and create and run a virtual screen 🖥
+Hence the following cell will install the libraries and create and run a virtual screen 🖥
 
 We’ll install multiple ones:
 
 - `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.
 - `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.
-- `numPy`: Used for handling our Q-table.
+- `numpy`: Used for handling our Q-table.
 
 The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
 
-
-You can see here all the Deep reinforcement Learning models available 👉 https://huggingface.co/models?other=q-learning
-
+You can see here all the Deep RL models available (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning
 
 ```bash
 pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/unit2/requirements-unit2.txt
@@ -85,9 +110,7 @@ pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/
 
 ```bash
 sudo apt-get update
-apt install python-opengl
-apt install ffmpeg
-apt install xvfb
+apt install python-opengl ffmpeg xvfb
 pip3 install pyvirtualdisplay
 ```
 
@@ -111,13 +134,8 @@ virtual_display.start()
 
 In addition to the installed libraries, we also use:
 
-- `random`: To generate random numbers (that will be useful for Epsilon-Greedy Policy).
-- `imageio`: To generate a replay video
-
-
-
-
-
+- `random`: To generate random numbers (that will be useful for epsilon-greedy policy).
+- `imageio`: To generate a replay video.
 
 ```python
 import numpy as np
@@ -153,8 +171,8 @@ We can have two sizes of environment:
 
 The environment has two modes:
 
-- `is_slippery=False`: The agent always move in the intended direction due to the non-slippery nature of the frozen lake.
-- `is_slippery=True`: The agent may not always move in the intended direction due to the slippery nature of the frozen lake (stochastic).
+- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).
+- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic).
 
 For now let's keep it simple with the 4x4 map and non-slippery
 
@@ -182,14 +200,13 @@ but we'll use the default environment for now.
 
 
 ```python
-# We create our environment with gym.make("<name_of_the_environment>")
-env.reset()
+# We create our environment with gym.make("<name_of_the_environment>")- `is_slippery=False`: The agent always moves in the intended direction due to the non-slippery nature of the frozen lake (deterministic).
 print("_____OBSERVATION SPACE_____ \n")
 print("Observation Space", env.observation_space)
 print("Sample observation", env.observation_space.sample())  # Get a random observation
 ```
 
-We see with `Observation Space Shape Discrete(16)` that the observation is a value representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**.
+We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**.
 
 For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**
 
@@ -215,14 +232,13 @@ Reward function 💰:
 - Reach hole: 0
 - Reach frozen: 0
 
-
 ## Create and Initialize the Q-table 🗄️
 (👀 Step 1 of the pseudocode)
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
 
 
-It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`
+It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. We already know their values from before, but we'll want to obtain them programmatically so that our algorithm generalizes for different environments. Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`
 
 
 ```python
@@ -244,7 +260,6 @@ def initialize_q_table(state_space, action_space):
 Qtable_frozenlake = initialize_q_table(state_space, action_space)
 ```
 
-
 ### Solution
 
 ```python
@@ -266,13 +281,42 @@ def initialize_q_table(state_space, action_space):
 Qtable_frozenlake = initialize_q_table(state_space, action_space)
 ```
 
-## Define the epsilon-greedy policy 🤖
+## Define the greedy policy 🤖
+Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
 
-Epsilon-Greedy is the training policy that handles the exploration/exploitation trade-off.
+- Epsilon-greedy policy (acting policy)
+- Greedy-policy (updating policy)
 
-The idea with Epsilon Greedy:
+Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.
 
-- With *probability 1 - ɛ* : **we do exploitation** (aka our agent selects the action with the highest state-action pair value).
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Q-Learning" width="100%"/>
+
+
+```python
+def greedy_policy(Qtable, state):
+  # Exploitation: take the action with the highest state, action value
+  action =
+
+  return action
+```
+
+#### Solution
+
+```python
+def greedy_policy(Qtable, state):
+    # Exploitation: take the action with the highest state, action value
+    action = np.argmax(Qtable[state][:])
+
+    return action
+```
+
+##Define the epsilon-greedy policy 🤖
+
+Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.
+
+The idea with epsilon-greedy:
+
+- With *probability 1 - ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).
 
 - With *probability ɛ*: we do **exploration** (trying random action).
 
@@ -281,8 +325,6 @@ And as the training goes, we progressively **reduce the epsilon value since we w
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-Learning" width="100%"/>
 
 
-Thanks to Sambit for finding a bug on the epsilon function 🤗
-
 ```python
 def epsilon_greedy_policy(Qtable, state, epsilon):
   # Randomly generate a number between 0 and 1
@@ -309,7 +351,7 @@ def epsilon_greedy_policy(Qtable, state, epsilon):
     if random_int > epsilon:
         # Take the action with the highest value given a state
         # np.argmax can be useful here
-        action = np.argmax(Qtable[state])
+        action = greedy_policy(Qtable, state)
     # else --> exploration
     else:
         action = env.action_space.sample()
@@ -317,41 +359,11 @@ def epsilon_greedy_policy(Qtable, state, epsilon):
     return action
 ```
 
-## Define the greedy policy 🤖
-
-Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
-
-- Epsilon greedy policy (acting policy)
-- Greedy policy (updating policy)
-
-Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Q-Learning" width="100%"/>
-
-
-```python
-def greedy_policy(Qtable, state):
-  # Exploitation: take the action with the highest state, action value
-  action =
-
-  return action
-```
-
-#### Solution
-
-```python
-def greedy_policy(Qtable, state):
-    # Exploitation: take the action with the highest state, action value
-    action = np.argmax(Qtable[state])
-
-    return action
-```
-
 ## Define the hyperparameters ⚙️
 The exploration related hyperparamters are some of the most important ones.
 
-- We need to make sure that our agent **explores enough the state space** in order to learn a good value approximation, in order to do that we need to have progressive decay of the epsilon.
-- If you decrease too fast epsilon (too high decay_rate), **you take the risk that your agent is stuck**, since your agent didn't explore enough the state space and hence can't solve the problem.
+- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
+- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.
 
 ```python
 # Training parameters
@@ -373,8 +385,25 @@ min_epsilon = 0.05  # Minimum exploration probability
 decay_rate = 0.0005  # Exponential decay rate for exploration prob
 ```
 
-## Step 6: Create the training loop method
+## Create the training loop method
 
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
+
+The training loop goes like this:
+
+```
+For episode in the total of training episodes:
+
+Reduce epsilon (since we need less and less exploration)
+Reset the environment
+
+  For step in max timesteps:
+    Choose the action At using epsilon greedy policy
+    Take the action (a) and observe the outcome state(s') and reward (r)
+    Update the Q-value Q(s,a) using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
+    If done, finish the episode
+    Our next state is the new state
+```
 
 ```python
 def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
@@ -402,7 +431,7 @@ def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_st
       if done:
         break
 
-      # Our state is the new state
+      # Our next state is the new state
       state = new_state
   return Qtable
 ```
@@ -437,7 +466,7 @@ def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_st
             if done:
                 break
 
-            # Our state is the new state
+            # Our next state is the new state
             state = new_state
     return Qtable
 ```
@@ -454,7 +483,9 @@ Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_r
 Qtable_frozenlake
 ```
 
-## Define the evaluation method 📝
+## The evaluation method 📝
+
+- We defined the evaluation method that we're going to use to test our Q-Learning agent.
 
 ```python
 def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
@@ -477,7 +508,7 @@ def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
 
         for step in range(max_steps):
             # Take the action (index) that have the maximum expected future reward given that state
-            action = np.argmax(Q[state][:])
+            action = greedy_policy(Q, state)
             new_state, reward, done, info = env.step(action)
             total_rewards_ep += reward
 
@@ -493,8 +524,8 @@ def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
 
 ## Evaluate our Q-Learning agent 📈
 
-- Normally you should have mean reward of 1.0
-- It's relatively easy since the state space is really small (16). What you can try to do is [to replace with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/).
+- Usually, you should have a mean reward of 1.0
+- The **environment is relatively easy** since the state space is really small (16). What you can try to do is [to replace it with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/), which introduces stochasticity, making the environment more complex.
 
 ```python
 # Evaluate our Agent
@@ -502,10 +533,9 @@ mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable
 print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
 ```
 
+## Publish our trained model to the Hub 🔥
 
-## Publish our trained model on the Hub 🔥
-
-Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.
+Now that we saw good results after the training, **we can publish our trained model to the Hub 🤗 with one line of code**.
 
 Here's an example of a Model Card:
 
@@ -517,8 +547,7 @@ Under the hood, the Hub uses git-based repositories (don't worry if you don't kn
 #### Do not modify this code
 
 ```python
-%%capture
-from huggingface_hub import HfApi, HfFolder, Repository
+from huggingface_hub import HfApi, HfFolder, Repository, snapshot_download
 from huggingface_hub.repocard import metadata_eval_result, metadata_save
 
 from pathlib import Path
@@ -528,6 +557,13 @@ import json
 
 ```python
 def record_video(env, Qtable, out_directory, fps=1):
+    """
+    Generate a replay video of the agent
+    :param env
+    :param Qtable: Qtable of our agent
+    :param out_directory
+    :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
+    """
     images = []
     done = False
     state = env.reset(seed=random.randint(0, 500))
@@ -543,32 +579,36 @@ def record_video(env, Qtable, out_directory, fps=1):
 ```
 
 ```python
-def push_to_hub(
-    repo_id, model, env, video_fps=1, local_repo_path="hub", commit_message="Push Q-Learning agent to Hub", token=None
-):
+def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
+    """
+    Evaluate, Generate a video and Upload a model to Hugging Face Hub.
+    This method does the complete pipeline:
+    - It evaluates the model
+    - It generates the model card
+    - It generates a replay video of the agent
+    - It pushes everything to the Hub
+
+    :param repo_id: repo_id: id of the model repository from the Hugging Face Hub
+    :param env
+    :param video_fps: how many frame per seconds to record our video replay
+    (with taxi-v3 and frozenlake-v1 we use 1)
+    :param local_repo_path: where the local repository is
+    """
     _, repo_name = repo_id.split("/")
 
     eval_env = env
-
-    # Step 1: Clone or create the repo
-    # Create the repo (or clone its content if it's nonempty)
     api = HfApi()
 
+    # Step 1: Create the repo
     repo_url = api.create_repo(
         repo_id=repo_id,
-        token=token,
-        private=False,
         exist_ok=True,
     )
 
-    # Git pull
-    repo_local_path = Path(local_repo_path) / repo_name
-    repo = Repository(repo_local_path, clone_from=repo_url, use_auth_token=True)
-    repo.git_pull()
+    # Step 2: Download files
+    repo_local_path = Path(snapshot_download(repo_id=repo_id))
 
-    repo.lfs_track(["*.mp4"])
-
-    # Step 1: Save the model
+    # Step 3: Save the model
     if env.spec.kwargs.get("map_name"):
         model["map_name"] = env.spec.kwargs.get("map_name")
         if env.spec.kwargs.get("is_slippery", "") == False:
@@ -577,30 +617,26 @@ def push_to_hub(
     print(model)
 
     # Pickle the model
-    with open(Path(repo_local_path) / "q-learning.pkl", "wb") as f:
+    with open((repo_local_path) / "q-learning.pkl", "wb") as f:
         pickle.dump(model, f)
 
-    # Step 2: Evaluate the model and build JSON
+    # Step 4: Evaluate the model and build JSON with evaluation metrics
     mean_reward, std_reward = evaluate_agent(
         eval_env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"]
     )
 
-    # First get datetime
-    eval_datetime = datetime.datetime.now()
-    eval_form_datetime = eval_datetime.isoformat()
-
     evaluate_data = {
         "env_id": model["env_id"],
         "mean_reward": mean_reward,
         "n_eval_episodes": model["n_eval_episodes"],
-        "eval_datetime": eval_form_datetime,
+        "eval_datetime": datetime.datetime.now().isoformat(),
     }
+
     # Write a JSON file
-    with open(Path(repo_local_path) / "results.json", "w") as outfile:
+    with open(repo_local_path / "results.json", "w") as outfile:
         json.dump(evaluate_data, outfile)
 
-    # Step 3: Create the model card
-    # Env id
+    # Step 5: Create the model card
     env_name = model["env_id"]
     if env.spec.kwargs.get("map_name"):
         env_name += "-" + env.spec.kwargs.get("map_name")
@@ -627,33 +663,31 @@ def push_to_hub(
     metadata = {**metadata, **eval}
 
     model_card = f"""
-  # **Q-Learning** Agent playing **{env_id}**
-  This is a trained model of a **Q-Learning** agent playing **{env_id}** .
-  """
+    # **Q-Learning** Agent playing1 **{env_id}**
+    This is a trained model of a **Q-Learning** agent playing **{env_id}** .
 
-    model_card += """
-  ## Usage
-  ```python
-  """
+    ## Usage
 
-    model_card += f"""model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
+    ```python
 
-  # Don't forget to check if you need to add additional attributes (is_slippery=False etc)
-  env = gym.make(model["env_id"])
+    model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
 
-  evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
-  """
-
-    model_card += """
+    # Don't forget to check if you need to add additional attributes (is_slippery=False etc)
+    env = gym.make(model["env_id"])
+    ```
+    """
 
+    evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
 
     readme_path = repo_local_path / "README.md"
     readme = ""
+    print(readme_path.exists())
     if readme_path.exists():
         with readme_path.open("r", encoding="utf8") as f:
             readme = f.read()
     else:
         readme = model_card
+    print(readme)
 
     with readme_path.open("w", encoding="utf-8") as f:
         f.write(readme)
@@ -661,20 +695,23 @@ def push_to_hub(
     # Save our metrics to Readme metadata
     metadata_save(readme_path, metadata)
 
-    # Step 4: Record a video
+    # Step 6: Record a video
     video_path = repo_local_path / "replay.mp4"
     record_video(env, model["qtable"], video_path, video_fps)
 
-    # Push everything to hub
-    print(f"Pushing the repo to the Hugging Face Hub")
-    repo.push_to_hub(commit_message=commit_message)
+    # Step 7. Push everything to the Hub
+    api.upload_folder(
+        repo_id=repo_id,
+        folder_path=repo_local_path,
+        path_in_repo=".",
+    )
 
-    print("Your model is pushed to the hub. You can view your model here: ", repo_url)
+    print("Your model is pushed to the Hub. You can view your model here: ", repo_url)
 ```
 
 ### .
 
-By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
+By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.
 
 This way:
 - You can **showcase our work** 🔥
@@ -700,9 +737,9 @@ from huggingface_hub import notebook_login
 notebook_login()
 ```
 
-If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
+If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)
 
-3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
+3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `push_to_hub()` function
 
 - Let's create **the model dictionary that contains the hyperparameters and the Q_table**.
 
@@ -722,7 +759,7 @@ model = {
 }
 ```
 
-Let's fill the `package_to_hub` function:
+Let's fill the `push_to_hub` function:
 
 - `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `
 (repo_id = {username}/{repo_name})`
@@ -738,7 +775,7 @@ model
 ```python
 username = ""  # FILL THIS
 repo_name = "q-FrozenLake-v1-4x4-noSlippery"
-push_to_hub(repo_id=f"username}/{repo_name}", model=model, env=env)
+push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
 ```
 
 Congrats 🥳 you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent.
@@ -813,8 +850,6 @@ learning_rate = 0.7  # Learning rate
 # Evaluation parameters
 n_eval_episodes = 100  # Total number of test episodes
 
-
-
 # DO NOT MODIFY EVAL_SEED
 eval_seed = [
     16,
@@ -935,13 +970,10 @@ decay_rate = 0.005  # Exponential decay rate for exploration prob
 
 ```python
 Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
-```
-
-```python
 Qtable_taxi
 ```
 
-## Create a model dictionary 💾 and publish our trained model on the Hub 🔥
+## Create a model dictionary 💾 and publish our trained model to the Hub 🔥
 - We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.
 
 
@@ -963,12 +995,14 @@ model = {
 
 ```python
 username = ""  # FILL THIS
-repo_name = "q-Taxi-v3"
+repo_name = ""
 push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
 ```
 
 Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
 
+⚠ To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** ⚠
+
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png" alt="Taxi Leaderboard">
 
 # Part 3: Load from Hub 🔽
@@ -1000,14 +1034,6 @@ def load_from_hub(repo_id: str, filename: str) -> str:
     :param repo_id: id of the model repository from the Hugging Face Hub
     :param filename: name of the model zip file from the repository
     """
-    try:
-        from huggingface_hub import cached_download, hf_hub_url
-    except ImportError:
-        raise ImportError(
-            "You need to install huggingface_hub to use `load_from_hub`. "
-            "See https://pypi.org/project/huggingface-hub/ for installation."
-        )
-
     # Get the model from the Hub, download and cache the model on your local disk
     pickle_model = hf_hub_download(repo_id=repo_id, filename=filename)
 

From 5c8432379e4a3b7a22a119b16091849c750be39a Mon Sep 17 00:00:00 2001
From: simoninithomas <simonini_thomas@outlook.fr>
Date: Mon, 12 Dec 2022 03:48:17 +0100
Subject: [PATCH 20/49] Remove unit2.mdx

---
 notebooks/unit2/unit2.mdx | 1096 -------------------------------------
 1 file changed, 1096 deletions(-)
 delete mode 100644 notebooks/unit2/unit2.mdx

diff --git a/notebooks/unit2/unit2.mdx b/notebooks/unit2/unit2.mdx
deleted file mode 100644
index fefa3f0..0000000
--- a/notebooks/unit2/unit2.mdx
+++ /dev/null
@@ -1,1096 +0,0 @@
-# Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">
-
-In this notebook, **you'll code from scratch your first Reinforcement Learning agent** playing FrozenLake ❄️ using Q-Learning, share it to the community, and experiment with different configurations.
-
-
-⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
-
-###🎮 Environments: 
-
-- [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
-- [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/)
-
-###📚 RL-Library: 
-
-- Python and NumPy
-- [Gym](https://www.gymlibrary.dev/)
-
-We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
-
-## Objectives of this notebook 🏆
-
-At the end of the notebook, you will:
-
-- Be able to use **Gym**, the environment library.
-- Be able to code from scratch a Q-Learning agent.
-- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
-
-
-
-
-## This notebook is from Deep Reinforcement Learning Course
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>
-
-In this free course, you will:
-
-- 📖 Study Deep Reinforcement Learning in **theory and practice**.
-- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.
-- 🤖 Train **agents in unique environments** 
-
-And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course
-
-Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
-
-
-The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5
-
-## Prerequisites 🏗️
-Before diving into the notebook, you need to:
-
-🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)**  🤗  
-
-## A small recap of Q-Learning
-
-- The *Q-Learning* **is the RL algorithm that**  
-
-  - Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
-    
-  - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
-    
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function"  width="100%"/>
-
-- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
-    
-- And if we **have an optimal Q-function**, we
-have an optimal policy,since we **know for each state, what is the best action to take.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"  width="100%"/>
-
-
-But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
-
-This is the Q-Learning pseudocode:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
-
-
-# Let's code our first Reinforcement Learning algorithm 🚀
-
-## Install dependencies and create a virtual display 🔽
-
-In the notebook, we'll need to generate a replay video. To do so, with Colab, **we need to have a virtual screen to render the environment** (and thus record the frames).
-
-Hence the following cell will install the libraries and create and run a virtual screen 🖥
-
-We’ll install multiple ones:
-
-- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.
-- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.
-- `numpy`: Used for handling our Q-table.
-
-The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
-
-You can see here all the Deep RL models available (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning
-
-```python
-!pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/unit2/requirements-unit2.txt
-```
-
-```python
-%%capture
-!sudo apt-get update
-!apt install python-opengl ffmpeg xvfb
-!pip3 install pyvirtualdisplay
-```
-
-To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks for this trick, **we will be able to run our virtual screen.**
-
-```python
-import os
-
-os.kill(os.getpid(), 9)
-```
-
-```python
-# Virtual display
-from pyvirtualdisplay import Display
-
-virtual_display = Display(visible=0, size=(1400, 900))
-virtual_display.start()
-```
-
-## Import the packages 📦
-
-In addition to the installed libraries, we also use:
-
-- `random`: To generate random numbers (that will be useful for epsilon-greedy policy).
-- `imageio`: To generate a replay video.
-
-```python
-import numpy as np
-import gym
-import random
-import imageio
-import os
-
-import pickle5 as pickle
-from tqdm.notebook import tqdm
-```
-
-We're now ready to code our Q-Learning algorithm 🔥
-
-# Part 1: Frozen Lake ⛄ (non slippery version)
-
-## Create and understand [FrozenLake environment ⛄]((https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
----
-
-💡 A good habit when you start to use an environment is to check its documentation 
-
-👉 https://www.gymlibrary.dev/environments/toy_text/frozen_lake/
-
----
-
-We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.
-
-We can have two sizes of environment:
-
-- `map_name="4x4"`: a 4x4 grid version
-- `map_name="8x8"`: a 8x8 grid version
-
-
-The environment has two modes:
-
-- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).
-- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic).
-
-For now let's keep it simple with the 4x4 map and non-slippery
-
-```python
-# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version
-env = gym.make()  # TODO use the correct parameters
-```
-
-### Solution
-
-```python
-env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
-```
-
-You can create your own custom grid like this:
-
-```python
-desc=["SFFF", "FHFH", "FFFH", "HFFG"]
-gym.make('FrozenLake-v1', desc=desc, is_slippery=True)
-```
-
-but we'll use the default environment for now.
-
-### Let's see what the Environment looks like:
-
-
-```python
-# We create our environment with gym.make("<name_of_the_environment>")- `is_slippery=False`: The agent always moves in the intended direction due to the non-slippery nature of the frozen lake (deterministic).
-print("_____OBSERVATION SPACE_____ \n")
-print("Observation Space", env.observation_space)
-print("Sample observation", env.observation_space.sample())  # Get a random observation
-```
-
-We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**. 
-
-For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**
-
-
-For instance, this is what state = 0 looks like:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/frozenlake.png" alt="FrozenLake">
-
-```python
-print("\n _____ACTION SPACE_____ \n")
-print("Action Space Shape", env.action_space.n)
-print("Action Space Sample", env.action_space.sample())  # Take a random action
-```
-
-The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:
-- 0: GO LEFT
-- 1: GO DOWN
-- 2: GO RIGHT
-- 3: GO UP
-
-Reward function 💰:
-- Reach goal: +1
-- Reach hole: 0
-- Reach frozen: 0
-
-## Create and Initialize the Q-table 🗄️
-(👀 Step 1 of the pseudocode)
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
-
-
-It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. We already know their values from before, but we'll want to obtain them programmatically so that our algorithm generalizes for different environments. Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`
-
-
-```python
-state_space = 
-print("There are ", state_space, " possible states")
-
-action_space = 
-print("There are ", action_space, " possible actions")
-```
-
-```python
-# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
-def initialize_q_table(state_space, action_space):
-  Qtable = 
-  return Qtable
-```
-
-```python
-Qtable_frozenlake = initialize_q_table(state_space, action_space)
-```
-
-### Solution
-
-```python
-state_space = env.observation_space.n
-print("There are ", state_space, " possible states")
-
-action_space = env.action_space.n
-print("There are ", action_space, " possible actions")
-```
-
-```python
-# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros
-def initialize_q_table(state_space, action_space):
-    Qtable = np.zeros((state_space, action_space))
-    return Qtable
-```
-
-```python
-Qtable_frozenlake = initialize_q_table(state_space, action_space)
-```
-
-## Define the greedy policy 🤖
-Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
-
-- Epsilon-greedy policy (acting policy)
-- Greedy-policy (updating policy)
-
-Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Q-Learning" width="100%"/>
-
-
-```python
-def greedy_policy(Qtable, state):
-  # Exploitation: take the action with the highest state, action value
-  action = 
-  
-  return action
-```
-
-#### Solution
-
-```python
-def greedy_policy(Qtable, state):
-    # Exploitation: take the action with the highest state, action value
-    action = np.argmax(Qtable[state][:])
-
-    return action
-```
-
-##Define the epsilon-greedy policy 🤖
-
-Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.
-
-The idea with epsilon-greedy:
-
-- With *probability 1 - ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).
-
-- With *probability ɛ*: we do **exploration** (trying random action).
-
-And as the training goes, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-Learning" width="100%"/>
-
-
-```python
-def epsilon_greedy_policy(Qtable, state, epsilon):
-  # Randomly generate a number between 0 and 1
-  random_num = 
-  # if random_num > greater than epsilon --> exploitation
-  if random_num > epsilon:
-    # Take the action with the highest value given a state
-    # np.argmax can be useful here
-    action = 
-  # else --> exploration
-  else:
-    action = # Take a random action
-  
-  return action
-```
-
-#### Solution
-
-```python
-def epsilon_greedy_policy(Qtable, state, epsilon):
-    # Randomly generate a number between 0 and 1
-    random_int = random.uniform(0, 1)
-    # if random_int > greater than epsilon --> exploitation
-    if random_int > epsilon:
-        # Take the action with the highest value given a state
-        # np.argmax can be useful here
-        action = greedy_policy(Qtable, state)
-    # else --> exploration
-    else:
-        action = env.action_space.sample()
-
-    return action
-```
-
-## Define the hyperparameters ⚙️
-The exploration related hyperparamters are some of the most important ones. 
-
-- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
-- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.
-
-```python
-# Training parameters
-n_training_episodes = 10000  # Total training episodes
-learning_rate = 0.7  # Learning rate
-
-# Evaluation parameters
-n_eval_episodes = 100  # Total number of test episodes
-
-# Environment parameters
-env_id = "FrozenLake-v1"  # Name of the environment
-max_steps = 99  # Max steps per episode
-gamma = 0.95  # Discounting rate
-eval_seed = []  # The evaluation seed of the environment
-
-# Exploration parameters
-max_epsilon = 1.0  # Exploration probability at start
-min_epsilon = 0.05  # Minimum exploration probability
-decay_rate = 0.0005  # Exponential decay rate for exploration prob
-```
-
-## Create the training loop method
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
-
-The training loop goes like this:
-
-```
-For episode in the total of training episodes:
-
-Reduce epsilon (since we need less and less exploration)
-Reset the environment
-
-  For step in max timesteps:    
-    Choose the action At using epsilon greedy policy
-    Take the action (a) and observe the outcome state(s') and reward (r)
-    Update the Q-value Q(s,a) using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
-    If done, finish the episode
-    Our next state is the new state
-```
-
-```python
-def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
-  for episode in range(n_training_episodes):
-    # Reduce epsilon (because we need less and less exploration)
-    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
-    # Reset the environment
-    state = env.reset()
-    step = 0
-    done = False
-
-    # repeat
-    for step in range(max_steps):
-      # Choose the action At using epsilon greedy policy
-      action = 
-
-      # Take action At and observe Rt+1 and St+1
-      # Take the action (a) and observe the outcome state(s') and reward (r)
-      new_state, reward, done, info = 
-
-      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
-      Qtable[state][action] = 
-
-      # If done, finish the episode
-      if done:
-        break
-      
-      # Our next state is the new state
-      state = new_state
-  return Qtable
-```
-
-#### Solution
-
-```python
-def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
-    for episode in tqdm(range(n_training_episodes)):
-        # Reduce epsilon (because we need less and less exploration)
-        epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
-        # Reset the environment
-        state = env.reset()
-        step = 0
-        done = False
-
-        # repeat
-        for step in range(max_steps):
-            # Choose the action At using epsilon greedy policy
-            action = epsilon_greedy_policy(Qtable, state, epsilon)
-
-            # Take action At and observe Rt+1 and St+1
-            # Take the action (a) and observe the outcome state(s') and reward (r)
-            new_state, reward, done, info = env.step(action)
-
-            # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
-            Qtable[state][action] = Qtable[state][action] + learning_rate * (
-                reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]
-            )
-
-            # If done, finish the episode
-            if done:
-                break
-
-            # Our next state is the new state
-            state = new_state
-    return Qtable
-```
-
-## Train the Q-Learning agent 🏃
-
-```python
-Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)
-```
-
-## Let's see what our Q-Learning table looks like now 👀
-
-```python
-Qtable_frozenlake
-```
-
-## The evaluation method 📝
-
-- We defined the evaluation method that we're going to use to test our Q-Learning agent.
-
-```python
-def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
-    """
-    Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.
-    :param env: The evaluation environment
-    :param n_eval_episodes: Number of episode to evaluate the agent
-    :param Q: The Q-table
-    :param seed: The evaluation seed array (for taxi-v3)
-    """
-    episode_rewards = []
-    for episode in tqdm(range(n_eval_episodes)):
-        if seed:
-            state = env.reset(seed=seed[episode])
-        else:
-            state = env.reset()
-        step = 0
-        done = False
-        total_rewards_ep = 0
-
-        for step in range(max_steps):
-            # Take the action (index) that have the maximum expected future reward given that state
-            action = greedy_policy(Q, state)
-            new_state, reward, done, info = env.step(action)
-            total_rewards_ep += reward
-
-            if done:
-                break
-            state = new_state
-        episode_rewards.append(total_rewards_ep)
-    mean_reward = np.mean(episode_rewards)
-    std_reward = np.std(episode_rewards)
-
-    return mean_reward, std_reward
-```
-
-## Evaluate our Q-Learning agent 📈
-
-- Usually, you should have a mean reward of 1.0
-- The **environment is relatively easy** since the state space is really small (16). What you can try to do is [to replace it with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/), which introduces stochasticity, making the environment more complex.
-
-```python
-# Evaluate our Agent
-mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)
-print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
-```
-
-## Publish our trained model to the Hub 🔥
-
-Now that we saw good results after the training, **we can publish our trained model to the Hub 🤗 with one line of code**.
-
-Here's an example of a Model Card:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/modelcard.png" alt="Model card" width="100%"/>
-
-
-Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.
-
-#### Do not modify this code
-
-```python
-from huggingface_hub import HfApi, HfFolder, Repository, snapshot_download
-from huggingface_hub.repocard import metadata_eval_result, metadata_save
-
-from pathlib import Path
-import datetime
-import json
-```
-
-```python
-def record_video(env, Qtable, out_directory, fps=1):
-    """
-    Generate a replay video of the agent
-    :param env
-    :param Qtable: Qtable of our agent
-    :param out_directory
-    :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
-    """
-    images = []
-    done = False
-    state = env.reset(seed=random.randint(0, 500))
-    img = env.render(mode="rgb_array")
-    images.append(img)
-    while not done:
-        # Take the action (index) that have the maximum expected future reward given that state
-        action = np.argmax(Qtable[state][:])
-        state, reward, done, info = env.step(action)  # We directly put next_state = state for recording logic
-        img = env.render(mode="rgb_array")
-        images.append(img)
-    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)
-```
-
-```python
-def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
-    """
-    Evaluate, Generate a video and Upload a model to Hugging Face Hub.
-    This method does the complete pipeline:
-    - It evaluates the model
-    - It generates the model card
-    - It generates a replay video of the agent
-    - It pushes everything to the Hub
-
-    :param repo_id: repo_id: id of the model repository from the Hugging Face Hub
-    :param env
-    :param video_fps: how many frame per seconds to record our video replay
-    (with taxi-v3 and frozenlake-v1 we use 1)
-    :param local_repo_path: where the local repository is
-    """
-    _, repo_name = repo_id.split("/")
-
-    eval_env = env
-    api = HfApi()
-
-    # Step 1: Create the repo
-    repo_url = api.create_repo(
-        repo_id=repo_id,
-        exist_ok=True,
-    )
-
-    # Step 2: Download files
-    repo_local_path = Path(snapshot_download(repo_id=repo_id))
-
-    # Step 3: Save the model
-    if env.spec.kwargs.get("map_name"):
-        model["map_name"] = env.spec.kwargs.get("map_name")
-        if env.spec.kwargs.get("is_slippery", "") == False:
-            model["slippery"] = False
-
-    print(model)
-
-    # Pickle the model
-    with open((repo_local_path) / "q-learning.pkl", "wb") as f:
-        pickle.dump(model, f)
-
-    # Step 4: Evaluate the model and build JSON with evaluation metrics
-    mean_reward, std_reward = evaluate_agent(
-        eval_env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"]
-    )
-
-    evaluate_data = {
-        "env_id": model["env_id"],
-        "mean_reward": mean_reward,
-        "n_eval_episodes": model["n_eval_episodes"],
-        "eval_datetime": datetime.datetime.now().isoformat(),
-    }
-
-    # Write a JSON file
-    with open(repo_local_path / "results.json", "w") as outfile:
-        json.dump(evaluate_data, outfile)
-
-    # Step 5: Create the model card
-    env_name = model["env_id"]
-    if env.spec.kwargs.get("map_name"):
-        env_name += "-" + env.spec.kwargs.get("map_name")
-
-    if env.spec.kwargs.get("is_slippery", "") == False:
-        env_name += "-" + "no_slippery"
-
-    metadata = {}
-    metadata["tags"] = [env_name, "q-learning", "reinforcement-learning", "custom-implementation"]
-
-    # Add metrics
-    eval = metadata_eval_result(
-        model_pretty_name=repo_name,
-        task_pretty_name="reinforcement-learning",
-        task_id="reinforcement-learning",
-        metrics_pretty_name="mean_reward",
-        metrics_id="mean_reward",
-        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
-        dataset_pretty_name=env_name,
-        dataset_id=env_name,
-    )
-
-    # Merges both dictionaries
-    metadata = {**metadata, **eval}
-
-    model_card = f"""
-  # **Q-Learning** Agent playing1 **{env_id}**
-  This is a trained model of a **Q-Learning** agent playing **{env_id}** .
-
-  ## Usage
-
-  ```python
-  
-  model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
-
-  # Don't forget to check if you need to add additional attributes (is_slippery=False etc)
-  env = gym.make(model["env_id"])
-  ```
-  """
-
-    evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
-
-    readme_path = repo_local_path / "README.md"
-    readme = ""
-    print(readme_path.exists())
-    if readme_path.exists():
-        with readme_path.open("r", encoding="utf8") as f:
-            readme = f.read()
-    else:
-        readme = model_card
-    print(readme)
-
-    with readme_path.open("w", encoding="utf-8") as f:
-        f.write(readme)
-
-    # Save our metrics to Readme metadata
-    metadata_save(readme_path, metadata)
-
-    # Step 6: Record a video
-    video_path = repo_local_path / "replay.mp4"
-    record_video(env, model["qtable"], video_path, video_fps)
-
-    # Step 7. Push everything to the Hub
-    api.upload_folder(
-        repo_id=repo_id,
-        folder_path=repo_local_path,
-        path_in_repo=".",
-    )
-
-    print("Your model is pushed to the Hub. You can view your model here: ", repo_url)
-```
-
-### .
-
-By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.
-
-This way:
-- You can **showcase our work** 🔥
-- You can **visualize your agent playing** 👀
-- You can **share with the community an agent that others can use** 💾
-- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
-
-
-To be able to share your model with the community there are three more steps to follow:
-
-1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
-
-2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
-- Create a new token (https://huggingface.co/settings/tokens) **with write role**
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
-
-
-```python
-from huggingface_hub import notebook_login
-
-notebook_login()
-```
-
-If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)
-
-3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `push_to_hub()` function
-
-- Let's create **the model dictionary that contains the hyperparameters and the Q_table**.
-
-```python
-model = {
-    "env_id": env_id,
-    "max_steps": max_steps,
-    "n_training_episodes": n_training_episodes,
-    "n_eval_episodes": n_eval_episodes,
-    "eval_seed": eval_seed,
-    "learning_rate": learning_rate,
-    "gamma": gamma,
-    "max_epsilon": max_epsilon,
-    "min_epsilon": min_epsilon,
-    "decay_rate": decay_rate,
-    "qtable": Qtable_frozenlake,
-}
-```
-
-Let's fill the `push_to_hub` function:
-
-- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `
-(repo_id = {username}/{repo_name})`
-💡 A good `repo_id` is `{username}/q-{env_id}`
-- `model`: our model dictionary containing the hyperparameters and the Qtable.
-- `env`: the environment.
-- `commit_message`: message of the commit
-
-```python
-model
-```
-
-```python
-username = ""  # FILL THIS
-repo_name = "q-FrozenLake-v1-4x4-noSlippery"
-push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
-```
-
-Congrats 🥳 you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent. 
-FrozenLake-v1 no_slippery is very simple environment, let's try an harder one 🔥.
-
-# Part 2: Taxi-v3 🚖
-
-## Create and understand [Taxi-v3 🚕](https://www.gymlibrary.dev/environments/toy_text/taxi/)
----
-
-💡 A good habit when you start to use an environment is to check its documentation 
-
-👉 https://www.gymlibrary.dev/environments/toy_text/taxi/
-
----
-
-In `Taxi-v3` 🚕, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). 
-
-When the episode starts, **the taxi starts off at a random square** and the passenger is at a random location. The taxi drives to the passenger’s location, **picks up the passenger**, drives to the passenger’s destination (another one of the four specified locations), and then **drops off the passenger**. Once the passenger is dropped off, the episode ends.
-
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi.png" alt="Taxi">
-
-
-```python
-env = gym.make("Taxi-v3")
-```
-
-There are **500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger** (including the case when the passenger is in the taxi), and **4 destination locations.**
-
-
-```python
-state_space = env.observation_space.n
-print("There are ", state_space, " possible states")
-```
-
-```python
-action_space = env.action_space.n
-print("There are ", action_space, " possible actions")
-```
-
-The action space (the set of possible actions the agent can take) is discrete with **6 actions available 🎮**:
-
-- 0: move south
-- 1: move north
-- 2: move east
-- 3: move west
-- 4: pickup passenger
-- 5: drop off passenger
-
-Reward function 💰:
-
-- -1 per step unless other reward is triggered.
-- +20 delivering passenger.
-- -10 executing “pickup” and “drop-off” actions illegally.
-
-```python
-# Create our Q table with state_size rows and action_size columns (500x6)
-Qtable_taxi = initialize_q_table(state_space, action_space)
-print(Qtable_taxi)
-print("Q-table shape: ", Qtable_taxi.shape)
-```
-
-## Define the hyperparameters ⚙️
-⚠ DO NOT MODIFY EVAL_SEED: the eval_seed array **allows us to evaluate your agent with the same taxi starting positions for every classmate**
-
-```python
-# Training parameters
-n_training_episodes = 25000  # Total training episodes
-learning_rate = 0.7  # Learning rate
-
-# Evaluation parameters
-n_eval_episodes = 100  # Total number of test episodes
-
-# DO NOT MODIFY EVAL_SEED
-eval_seed = [
-    16,
-    54,
-    165,
-    177,
-    191,
-    191,
-    120,
-    80,
-    149,
-    178,
-    48,
-    38,
-    6,
-    125,
-    174,
-    73,
-    50,
-    172,
-    100,
-    148,
-    146,
-    6,
-    25,
-    40,
-    68,
-    148,
-    49,
-    167,
-    9,
-    97,
-    164,
-    176,
-    61,
-    7,
-    54,
-    55,
-    161,
-    131,
-    184,
-    51,
-    170,
-    12,
-    120,
-    113,
-    95,
-    126,
-    51,
-    98,
-    36,
-    135,
-    54,
-    82,
-    45,
-    95,
-    89,
-    59,
-    95,
-    124,
-    9,
-    113,
-    58,
-    85,
-    51,
-    134,
-    121,
-    169,
-    105,
-    21,
-    30,
-    11,
-    50,
-    65,
-    12,
-    43,
-    82,
-    145,
-    152,
-    97,
-    106,
-    55,
-    31,
-    85,
-    38,
-    112,
-    102,
-    168,
-    123,
-    97,
-    21,
-    83,
-    158,
-    26,
-    80,
-    63,
-    5,
-    81,
-    32,
-    11,
-    28,
-    148,
-]  # Evaluation seed, this ensures that all classmates agents are trained on the same taxi starting position
-# Each seed has a specific starting state
-
-# Environment parameters
-env_id = "Taxi-v3"  # Name of the environment
-max_steps = 99  # Max steps per episode
-gamma = 0.95  # Discounting rate
-
-# Exploration parameters
-max_epsilon = 1.0  # Exploration probability at start
-min_epsilon = 0.05  # Minimum exploration probability
-decay_rate = 0.005  # Exponential decay rate for exploration prob
-```
-
-## Train our Q-Learning agent 🏃
-
-```python
-Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
-Qtable_taxi
-```
-
-## Create a model dictionary 💾 and publish our trained model to the Hub 🔥
-- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.
-
-
-```python
-model = {
-    "env_id": env_id,
-    "max_steps": max_steps,
-    "n_training_episodes": n_training_episodes,
-    "n_eval_episodes": n_eval_episodes,
-    "eval_seed": eval_seed,
-    "learning_rate": learning_rate,
-    "gamma": gamma,
-    "max_epsilon": max_epsilon,
-    "min_epsilon": min_epsilon,
-    "decay_rate": decay_rate,
-    "qtable": Qtable_taxi,
-}
-```
-
-```python
-username = ""  # FILL THIS
-repo_name = ""
-push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
-```
-
-Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
-
-⚠ To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** ⚠
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png" alt="Taxi Leaderboard">
-
-# Part 3: Load from Hub 🔽
-
-What's amazing with Hugging Face Hub 🤗 is that you can easily load powerful models from the community.
-
-Loading a saved model from the Hub is really easy:
-
-1. You go https://huggingface.co/models?other=q-learning to see the list of all the q-learning saved models.
-2. You select one and copy its repo_id
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/copy-id.png" alt="Copy id">
-
-3. Then we just need to use `load_from_hub` with:
-- The repo_id
-- The filename: the saved model inside the repo.
-
-#### Do not modify this code
-
-```python
-from urllib.error import HTTPError
-
-from huggingface_hub import hf_hub_download
-
-
-def load_from_hub(repo_id: str, filename: str) -> str:
-    """
-    Download a model from Hugging Face Hub.
-    :param repo_id: id of the model repository from the Hugging Face Hub
-    :param filename: name of the model zip file from the repository
-    """
-    # Get the model from the Hub, download and cache the model on your local disk
-    pickle_model = hf_hub_download(repo_id=repo_id, filename=filename)
-
-    with open(pickle_model, "rb") as f:
-        downloaded_model_file = pickle.load(f)
-
-    return downloaded_model_file
-```
-
-### .
-
-```python
-model = load_from_hub(repo_id="ThomasSimonini/q-Taxi-v3", filename="q-learning.pkl")  # Try to use another model
-
-print(model)
-env = gym.make(model["env_id"])
-
-evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
-```
-
-```python
-model = load_from_hub(
-    repo_id="ThomasSimonini/q-FrozenLake-v1-no-slippery", filename="q-learning.pkl"
-)  # Try to use another model
-
-env = gym.make(model["env_id"], is_slippery=False)
-
-evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
-```
-
-## Some additional challenges 🏆
-The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results! 
-
-In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
-
-Here are some ideas to achieve so:
-
-* Train more steps
-* Try different hyperparameters by looking at what your classmates have done.
-* **Push your new trained model** on the Hub 🔥
-
-Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not using FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
-
-_____________________________________________________________________
-Congrats 🥳, you've just implemented, trained, and uploaded your first Reinforcement Learning agent.
-
-Understanding Q-Learning is an **important step to understanding value-based methods.**
-
-In the next Unit with Deep Q-Learning, we'll see that creating and updating a Q-table was a good strategy — **however, this is not scalable.**
-
-For instance, imagine you create an agent that learns to play Doom. 
-
-<img src="https://vizdoom.cs.put.edu.pl/user/pages/01.tutorial/basic.png" alt="Doom"/>
-
-Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient. 
-
-That's why we'll study, in the next unit, Deep Q-Learning, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
-
-
-See you on Unit 3! 🔥
-
-## Keep learning, stay awesome 🤗
\ No newline at end of file

From 7b9c1cf0a4896695f01283ebd2b19eb11f864c43 Mon Sep 17 00:00:00 2001
From: simoninithomas <simonini_thomas@outlook.fr>
Date: Mon, 12 Dec 2022 03:57:37 +0100
Subject: [PATCH 21/49] Small updates Unit 2

---
 units/en/unit2/bellman-equation.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx
index e819f5a..577c6bb 100644
--- a/units/en/unit2/bellman-equation.mdx
+++ b/units/en/unit2/bellman-equation.mdx
@@ -41,7 +41,7 @@ If we go back to our example, we can say that the value of State 1 is equal to t
 
 To calculate the value of State 1: the sum of rewards **if the agent started in that state 1** and then followed the **policy for all the time steps.**
 
-This is equivalent to  \\(V(S_{t})\\)  = Immediate reward  \\(R_{t+1}\\)  + Discounted value of the next state  \\(gamma * V(S_{t+1})\\)
+This is equivalent to  \\(V(S_{t})\\)  = Immediate reward  \\(R_{t+1}\\)  + Discounted value of the next state  \\(\gamma * V(S_{t+1})\\)
 
 <figure>
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman6.jpg" alt="Bellman equation"/>

From f20ccf4cc1aabe149cdd088f098d201c8a821fb9 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Mon, 12 Dec 2022 13:36:37 +0100
Subject: [PATCH 22/49] Update units/en/unit2/hands-on.mdx

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
---
 units/en/unit2/hands-on.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
index cc36d00..18af8b2 100644
--- a/units/en/unit2/hands-on.mdx
+++ b/units/en/unit2/hands-on.mdx
@@ -105,7 +105,7 @@ The Hugging Face Hub 🤗 works as a central place where anyone can share and ex
 You can see here all the Deep RL models available (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning
 
 ```bash
-pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/unit2/requirements-unit2.txt
+pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt
 ```
 
 ```bash

From bea56eb1159357a08bd8b86ccfa0a6ad3ef175ce Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Mon, 12 Dec 2022 13:38:52 +0100
Subject: [PATCH 23/49] Update units/en/unit2/hands-on.mdx

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
---
 units/en/unit2/hands-on.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
index 18af8b2..3a4183a 100644
--- a/units/en/unit2/hands-on.mdx
+++ b/units/en/unit2/hands-on.mdx
@@ -547,7 +547,7 @@ Under the hood, the Hub uses git-based repositories (don't worry if you don't kn
 #### Do not modify this code
 
 ```python
-from huggingface_hub import HfApi, HfFolder, Repository, snapshot_download
+from huggingface_hub import HfApi, snapshot_download
 from huggingface_hub.repocard import metadata_eval_result, metadata_save
 
 from pathlib import Path

From 751a562f3a8f4a140344846006402b8ab4d21e4a Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Mon, 12 Dec 2022 14:05:58 +0100
Subject: [PATCH 24/49] Update Unit2 hands on

---
 units/en/unit2/hands-on.mdx | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
index 3a4183a..58a2a57 100644
--- a/units/en/unit2/hands-on.mdx
+++ b/units/en/unit2/hands-on.mdx
@@ -614,8 +614,6 @@ def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
         if env.spec.kwargs.get("is_slippery", "") == False:
             model["slippery"] = False
 
-    print(model)
-
     # Pickle the model
     with open((repo_local_path) / "q-learning.pkl", "wb") as f:
         pickle.dump(model, f)
@@ -632,7 +630,8 @@ def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
         "eval_datetime": datetime.datetime.now().isoformat(),
     }
 
-    # Write a JSON file
+    # Write a JSON file called "results.json" that will contain the
+    # evaluation results
     with open(repo_local_path / "results.json", "w") as outfile:
         json.dump(evaluate_data, outfile)
 
@@ -687,7 +686,6 @@ def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
             readme = f.read()
     else:
         readme = model_card
-    print(readme)
 
     with readme_path.open("w", encoding="utf-8") as f:
         f.write(readme)

From 1dc62782bef1d9e3f96bf9cebdb4b4f67c46898d Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Mon, 12 Dec 2022 14:07:04 +0100
Subject: [PATCH 25/49] Update Unit 2 notebook

---
 notebooks/unit2/unit2.ipynb | 13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/notebooks/unit2/unit2.ipynb b/notebooks/unit2/unit2.ipynb
index 81f3652..a4f9b74 100644
--- a/notebooks/unit2/unit2.ipynb
+++ b/notebooks/unit2/unit2.ipynb
@@ -71,7 +71,7 @@
     {
       "cell_type": "markdown",
       "source": [
-        "## This notebook is from Deep Reinforcement Learning Course\n",
+        "## This notebook is from the Deep Reinforcement Learning Course\n",
         "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>"
       ],
       "metadata": {
@@ -190,7 +190,7 @@
       },
       "outputs": [],
       "source": [
-        "!pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/unit2/requirements-unit2.txt"
+        "!pip install -r pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt"
       ]
     },
     {
@@ -996,7 +996,7 @@
       },
       "outputs": [],
       "source": [
-        "from huggingface_hub import HfApi, HfFolder, Repository, snapshot_download\n",
+        "from huggingface_hub import HfApi, snapshot_download\n",
         "from huggingface_hub.repocard import metadata_eval_result, metadata_save\n",
         "\n",
         "from pathlib import Path\n",
@@ -1074,8 +1074,6 @@
         "        if env.spec.kwargs.get(\"is_slippery\", \"\") == False:\n",
         "            model[\"slippery\"] = False\n",
         "\n",
-        "    print(model)\n",
-        "\n",
         "    # Pickle the model\n",
         "    with open((repo_local_path) / \"q-learning.pkl\", \"wb\") as f:\n",
         "        pickle.dump(model, f)\n",
@@ -1092,7 +1090,8 @@
         "        \"eval_datetime\": datetime.datetime.now().isoformat()\n",
         "    }\n",
         "\n",
-        "    # Write a JSON file\n",
+        "    # Write a JSON file called \"results.json\" that will contain the\n",
+        "    # evaluation results\n",
         "    with open(repo_local_path / \"results.json\", \"w\") as outfile:\n",
         "        json.dump(evaluate_data, outfile)\n",
         "\n",
@@ -1139,7 +1138,6 @@
         "\n",
         "    evaluate_agent(env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"])\n",
         "  \n",
-        "\n",
         "    readme_path = repo_local_path / \"README.md\"\n",
         "    readme = \"\"\n",
         "    print(readme_path.exists())\n",
@@ -1148,7 +1146,6 @@
         "            readme = f.read()\n",
         "    else:\n",
         "        readme = model_card\n",
-        "    print(readme)\n",
         "\n",
         "    with readme_path.open(\"w\", encoding=\"utf-8\") as f:\n",
         "        f.write(readme)\n",

From 7db3e276ded6e26655f7106b9f71cc877e9fcdab Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Mon, 12 Dec 2022 14:12:06 +0100
Subject: [PATCH 26/49] Update README.md

Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index b2e6427..afca08c 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# [The Hugging Face Deep Reinforcement Learning Course 🤗 (v2.0)](ttps://huggingface.co/deep-rl-course/unit0/introduction)
+# [The Hugging Face Deep Reinforcement Learning Course 🤗 (v2.0)](https://huggingface.co/deep-rl-course/unit0/introduction)
 
 This repository contains the Deep Reinforcement Learning Course mdx files and notebooks. The website is here: https://huggingface.co/deep-rl-course/unit0/introduction?fw=pt
 

From cd23f6a7258a2f4c474ea2d6813d66cc00a84f15 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Mon, 12 Dec 2022 20:26:53 +0100
Subject: [PATCH 27/49] Update Bellman Latex equation quiz

---
 units/en/unit2/mid-way-quiz.mdx | 1 +
 1 file changed, 1 insertion(+)

diff --git a/units/en/unit2/mid-way-quiz.mdx b/units/en/unit2/mid-way-quiz.mdx
index b1ffe3a..ded2617 100644
--- a/units/en/unit2/mid-way-quiz.mdx
+++ b/units/en/unit2/mid-way-quiz.mdx
@@ -38,6 +38,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
 **The Bellman equation is a recursive equation** that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
 
 \\(Rt+1 + (\gamma * V(St+1)))\\
+
 The immediate reward + the discounted value of the state that follows
 
 </details>

From c54bb4605e81d01b1b59c5dc3e437d83abb6ac2a Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Mon, 12 Dec 2022 20:38:45 +0100
Subject: [PATCH 28/49] Updated Bellman equation (latex not working)

---
 units/en/unit2/mid-way-quiz.mdx | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/units/en/unit2/mid-way-quiz.mdx b/units/en/unit2/mid-way-quiz.mdx
index ded2617..c00a726 100644
--- a/units/en/unit2/mid-way-quiz.mdx
+++ b/units/en/unit2/mid-way-quiz.mdx
@@ -37,7 +37,8 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
 
 **The Bellman equation is a recursive equation** that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
 
-\\(Rt+1 + (\gamma * V(St+1)))\\
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation quiz"/>
+
 
 The immediate reward + the discounted value of the state that follows
 

From 746b3e0a2d03593610f44215d217fa9e994020b2 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Mon, 12 Dec 2022 20:44:55 +0100
Subject: [PATCH 29/49] Update Bellman Latex equation quiz

---
 units/en/unit2/mid-way-quiz.mdx | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/units/en/unit2/mid-way-quiz.mdx b/units/en/unit2/mid-way-quiz.mdx
index c00a726..86584bf 100644
--- a/units/en/unit2/mid-way-quiz.mdx
+++ b/units/en/unit2/mid-way-quiz.mdx
@@ -37,9 +37,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
 
 **The Bellman equation is a recursive equation** that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
 
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation quiz"/>
-
-
+Rt+1 + gamma * V(St+1)
 The immediate reward + the discounted value of the state that follows
 
 </details>

From 35088e1f598325300a025b5d90c35b3f5fb4b621 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Mon, 12 Dec 2022 20:53:14 +0100
Subject: [PATCH 30/49] Update Bellman Latex equation quiz

---
 units/en/unit2/mid-way-quiz.mdx | 1 +
 1 file changed, 1 insertion(+)

diff --git a/units/en/unit2/mid-way-quiz.mdx b/units/en/unit2/mid-way-quiz.mdx
index 86584bf..abb4b8b 100644
--- a/units/en/unit2/mid-way-quiz.mdx
+++ b/units/en/unit2/mid-way-quiz.mdx
@@ -38,6 +38,7 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
 **The Bellman equation is a recursive equation** that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
 
 Rt+1 + gamma * V(St+1)
+
 The immediate reward + the discounted value of the state that follows
 
 </details>

From 2ca9a92002483586475ca48b196db4d0a79671b7 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Tue, 13 Dec 2022 10:59:34 +0100
Subject: [PATCH 31/49] Update hands-on.mdx

---
 units/en/unit2/hands-on.mdx | 1 +
 1 file changed, 1 insertion(+)

diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
index 58a2a57..4b30bb0 100644
--- a/units/en/unit2/hands-on.mdx
+++ b/units/en/unit2/hands-on.mdx
@@ -21,6 +21,7 @@ Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Dee
 
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb)
 
+
 # Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">

From 95b869b3e6edb133dfca5a13248483674a332eab Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Tue, 13 Dec 2022 11:07:58 +0100
Subject: [PATCH 32/49] Update hands-on.mdx

---
 units/en/unit2/hands-on.mdx | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
index 4b30bb0..08c63d7 100644
--- a/units/en/unit2/hands-on.mdx
+++ b/units/en/unit2/hands-on.mdx
@@ -1,10 +1,10 @@
 # Hands-on [[hands-on]]
 
-<CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
-notebooks={[
-  {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb"}
-  ]}
-askForHelpUrl="http://hf.co/join/discord" />
+      <CourseFloatingBanner classNames="absolute z-10 right-0 top-0"
+      notebooks={[
+        {label: "Google Colab", value: "https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb"}
+        ]}
+        askForHelpUrl="http://hf.co/join/discord" />
 
 
 

From 7e77ee2215af4982bddde98c00bcad7d354e73ed Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Tue, 13 Dec 2022 15:38:36 +0100
Subject: [PATCH 33/49] Update pip install

---
 notebooks/unit2/unit2.ipynb | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/notebooks/unit2/unit2.ipynb b/notebooks/unit2/unit2.ipynb
index a4f9b74..ea78857 100644
--- a/notebooks/unit2/unit2.ipynb
+++ b/notebooks/unit2/unit2.ipynb
@@ -1,5 +1,15 @@
 {
   "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit2/unit2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -190,7 +200,7 @@
       },
       "outputs": [],
       "source": [
-        "!pip install -r pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt"
+        "!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt"
       ]
     },
     {
@@ -1724,7 +1734,8 @@
         "Ji_UrI5l2zzn",
         "67OdoKL63eDD",
         "B2_-8b8z5k54"
-      ]
+      ],
+      "include_colab_link": true
     },
     "gpuClass": "standard",
     "kernelspec": {

From 3080ad3fc10b550cd74fdd9f93113252938931d5 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Tue, 13 Dec 2022 16:02:56 +0100
Subject: [PATCH 34/49] Update hands-on.mdx

---
 units/en/unit1/hands-on.mdx | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/units/en/unit1/hands-on.mdx b/units/en/unit1/hands-on.mdx
index 078b3ec..2c65154 100644
--- a/units/en/unit1/hands-on.mdx
+++ b/units/en/unit1/hands-on.mdx
@@ -1,4 +1,5 @@
-# Hands on [[hands-on]]
+# Train your first Deep Reinforcement Learning Agent 🤖 [[hands-on]]
+
 
 
 

From f653685a953b9df8a4e03d3bd2f39a04e5df25ad Mon Sep 17 00:00:00 2001
From: andraxin <github@andraxin.se>
Date: Tue, 13 Dec 2022 20:32:22 +0100
Subject: [PATCH 35/49] Update how-huggy-works.mdx

Surely, we're not throwing stick *at* poor Huggy...
---
 units/en/unitbonus1/how-huggy-works.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unitbonus1/how-huggy-works.mdx b/units/en/unitbonus1/how-huggy-works.mdx
index 53d4d95..b00310f 100644
--- a/units/en/unitbonus1/how-huggy-works.mdx
+++ b/units/en/unitbonus1/how-huggy-works.mdx
@@ -5,7 +5,7 @@ This environment was created using the [Unity game engine](https://unity.com/) a
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.jpg" alt="Huggy" width="100%">
 
-In this environment we aim to train Huggy to **fetch the stick we throw at him. It means he needs to move correctly toward the stick**.
+In this environment we aim to train Huggy to **fetch the stick we throw. It means he needs to move correctly toward the stick**.
 
 ## The State Space, what Huggy perceives. [[state-space]]
 Huggy doesn't "see" his environment. Instead, we provide him information about the environment:

From d4657698bec1acdcc64c8ec997f960216d904813 Mon Sep 17 00:00:00 2001
From: Johannes 'fish' Ziemke <github@5pi.de>
Date: Wed, 14 Dec 2022 17:22:42 +0100
Subject: [PATCH 36/49] unit2: Use tqdm in train template

---
 notebooks/unit2/unit2.ipynb | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/notebooks/unit2/unit2.ipynb b/notebooks/unit2/unit2.ipynb
index ea78857..a10b711 100644
--- a/notebooks/unit2/unit2.ipynb
+++ b/notebooks/unit2/unit2.ipynb
@@ -771,7 +771,7 @@
       "outputs": [],
       "source": [
         "def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):\n",
-        "  for episode in range(n_training_episodes):\n",
+        "  for episode in tqdm(range(n_training_episodes))\n",
         "    # Reduce epsilon (because we need less and less exploration)\n",
         "    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)\n",
         "    # Reset the environment\n",
@@ -1748,4 +1748,4 @@
   },
   "nbformat": 4,
   "nbformat_minor": 0
-}
\ No newline at end of file
+}

From 626f69c7c7e2a42250e5ec4d9ffb4e24b1b1b903 Mon Sep 17 00:00:00 2001
From: Johannes 'fish' Ziemke <github@5pi.de>
Date: Wed, 14 Dec 2022 17:26:36 +0100
Subject: [PATCH 37/49] Fix typo

---
 notebooks/unit2/unit2.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/notebooks/unit2/unit2.ipynb b/notebooks/unit2/unit2.ipynb
index a10b711..ea808f6 100644
--- a/notebooks/unit2/unit2.ipynb
+++ b/notebooks/unit2/unit2.ipynb
@@ -771,7 +771,7 @@
       "outputs": [],
       "source": [
         "def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):\n",
-        "  for episode in tqdm(range(n_training_episodes))\n",
+        "  for episode in tqdm(range(n_training_episodes)):\n",
         "    # Reduce epsilon (because we need less and less exploration)\n",
         "    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)\n",
         "    # Reset the environment\n",

From 3f6516d4493cb3bb72c4d78c4a0f243d984209a4 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 15 Dec 2022 08:23:57 +0100
Subject: [PATCH 38/49] Remove GPU

---
 notebooks/unit2/unit2.ipynb | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/notebooks/unit2/unit2.ipynb b/notebooks/unit2/unit2.ipynb
index ea78857..1a075ae 100644
--- a/notebooks/unit2/unit2.ipynb
+++ b/notebooks/unit2/unit2.ipynb
@@ -200,7 +200,7 @@
       },
       "outputs": [],
       "source": [
-        "!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt"
+        "!pip install -r pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt"
       ]
     },
     {
@@ -1726,7 +1726,6 @@
     }
   ],
   "metadata": {
-    "accelerator": "GPU",
     "colab": {
       "private_outputs": true,
       "provenance": [],

From 72da15ef782dd3432dae6deae3a99cdff5c8be8a Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 15 Dec 2022 08:40:28 +0100
Subject: [PATCH 39/49] Add tqdm to requirements unit2

---
 notebooks/unit2/requirements-unit2.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/notebooks/unit2/requirements-unit2.txt b/notebooks/unit2/requirements-unit2.txt
index 733afc8..995b24b 100644
--- a/notebooks/unit2/requirements-unit2.txt
+++ b/notebooks/unit2/requirements-unit2.txt
@@ -8,3 +8,4 @@ pyyaml==6.0
 imageio
 imageio_ffmpeg
 pyglet==1.5.1
+tqdm

From c92f5be6cf711bdf32d0eb4e76b24dced6307bff Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 15 Dec 2022 08:45:30 +0100
Subject: [PATCH 40/49] Add tqdm to import

---
 notebooks/unit2/unit2.ipynb | 1 +
 1 file changed, 1 insertion(+)

diff --git a/notebooks/unit2/unit2.ipynb b/notebooks/unit2/unit2.ipynb
index ea808f6..7f3c3da 100644
--- a/notebooks/unit2/unit2.ipynb
+++ b/notebooks/unit2/unit2.ipynb
@@ -280,6 +280,7 @@
         "import random\n",
         "import imageio\n",
         "import os\n",
+        "import tqdm\n",
         "\n",
         "import pickle5 as pickle\n",
         "from tqdm.notebook import tqdm"

From 13fea298ccdbb26f49e3b5ba1907ccb31d576bdf Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 15 Dec 2022 09:52:23 +0100
Subject: [PATCH 41/49] Update keep_checkpoints

---
 bonus-unit1/bonus-unit1.ipynb | 572 ++++++++++++++++++++++++++++++++++
 1 file changed, 572 insertions(+)
 create mode 100644 bonus-unit1/bonus-unit1.ipynb

diff --git a/bonus-unit1/bonus-unit1.ipynb b/bonus-unit1/bonus-unit1.ipynb
new file mode 100644
index 0000000..7d1c5c7
--- /dev/null
+++ b/bonus-unit1/bonus-unit1.ipynb
@@ -0,0 +1,572 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/bonus-unit1/bonus-unit1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "2D3NL_e4crQv"
+      },
+      "source": [
+        "# Bonus Unit 1: Let's train Huggy the Dog 🐶 to fetch a stick"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit2/thumbnail.png\" alt=\"Bonus Unit 1Thumbnail\">\n",
+        "\n",
+        "In this notebook, we'll reinforce what we learned in the first Unit by **teaching Huggy the Dog to fetch the stick and then play with it directly in your browser**\n",
+        "\n",
+        "⬇️ Here is an example of what **you will achieve at the end of the unit.** ⬇️ (launch ▶ to see)"
+      ],
+      "metadata": {
+        "id": "FMYrDriDujzX"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "%%html\n",
+        "<video controls autoplay><source src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.mp4\" type=\"video/mp4\"></video>"
+      ],
+      "metadata": {
+        "id": "PnVhs1yYNyUF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### The environment 🎮\n",
+        "\n",
+        "- Huggy the Dog, an environment created by [Thomas Simonini](https://twitter.com/ThomasSimonini) based on [Puppo The Corgi](https://blog.unity.com/technology/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit)\n",
+        "\n",
+        "### The library used 📚\n",
+        "\n",
+        "- [MLAgents (Hugging Face version)](https://github.com/huggingface/ml-agents)"
+      ],
+      "metadata": {
+        "id": "x7oR6R-ZIbeS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues)."
+      ],
+      "metadata": {
+        "id": "60yACvZwO0Cy"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Objectives of this notebook 🏆\n",
+        "\n",
+        "At the end of the notebook, you will:\n",
+        "\n",
+        "- Understand **the state space, action space and reward function used to train Huggy**.\n",
+        "- **Train your own Huggy** to fetch the stick.\n",
+        "- Be able to play **with your trained Huggy directly in your browser**.\n",
+        "\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "Oks-ETYdO2Dc"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## This notebook is from Deep Reinforcement Learning Course\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>"
+      ],
+      "metadata": {
+        "id": "mUlVrqnBv2o1"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "In this free course, you will:\n",
+        "\n",
+        "- 📖 Study Deep Reinforcement Learning in **theory and practice**.\n",
+        "- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n",
+        "- 🤖 Train **agents in unique environments** \n",
+        "\n",
+        "And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course\n",
+        "\n",
+        "Don’t forget to **<a href=\"http://eepurl.com/ic5ZUD\">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**\n",
+        "\n",
+        "\n",
+        "The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5"
+      ],
+      "metadata": {
+        "id": "pAMjaQpHwB_s"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Prerequisites 🏗️\n",
+        "\n",
+        "Before diving into the notebook, you need to:\n",
+        "\n",
+        "🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by doing Unit 1\n",
+        "\n",
+        "🔲 📚 **Read the introduction to Huggy** by doing Bonus Unit 1"
+      ],
+      "metadata": {
+        "id": "6r7Hl0uywFSO"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Set the GPU 💪\n",
+        "- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
+        "\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
+      ],
+      "metadata": {
+        "id": "DssdIjk_8vZE"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "- `Hardware Accelerator > GPU`\n",
+        "\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
+      ],
+      "metadata": {
+        "id": "sTfCXHy68xBv"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "an3ByrXYQ4iK"
+      },
+      "source": [
+        "## Clone the repository and install the dependencies 🔽\n",
+        "\n",
+        "- We need to clone the repository, that **contains the experimental version of the library that allows you to push your trained agent to the Hub.**"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "6WNoL04M7rTa"
+      },
+      "outputs": [],
+      "source": [
+        "%%capture\n",
+        "# Clone this specific repository (can take 3min)\n",
+        "!git clone https://github.com/huggingface/ml-agents/"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "d8wmVcMk7xKo"
+      },
+      "outputs": [],
+      "source": [
+        "%%capture\n",
+        "# Go inside the repository and install the package (can take 3min)\n",
+        "%cd ml-agents\n",
+        "!pip3 install -e ./ml-agents-envs\n",
+        "!pip3 install -e ./ml-agents"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "HRY5ufKUKfhI"
+      },
+      "source": [
+        "## Download and move the environment zip file in `./trained-envs-executables/linux/`\n",
+        "\n",
+        "- Our environment executable is in a zip file.\n",
+        "- We need to download it and place it to `./trained-envs-executables/linux/`"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "C9Ls6_6eOKiA"
+      },
+      "outputs": [],
+      "source": [
+        "!mkdir ./trained-envs-executables\n",
+        "!mkdir ./trained-envs-executables/linux"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!wget --load-cookies /tmp/cookies.txt \"https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1zv3M95ZJTWHUVOWT6ckq_cm98nft8gdF' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\\1\\n/p')&id=1zv3M95ZJTWHUVOWT6ckq_cm98nft8gdF\" -O ./trained-envs-executables/linux/Huggy.zip && rm -rf /tmp/cookies.txt"
+      ],
+      "metadata": {
+        "id": "EB-G-80GsxYN"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "jsoZGxr1MIXY"
+      },
+      "source": [
+        "Download the file Huggy.zip from https://drive.google.com/uc?export=download&id=1zv3M95ZJTWHUVOWT6ckq_cm98nft8gdF using `wget`. Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "8FPx0an9IAwO"
+      },
+      "outputs": [],
+      "source": [
+        "%%capture\n",
+        "!unzip -d ./trained-envs-executables/linux/ ./trained-envs-executables/linux/Huggy.zip"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "nyumV5XfPKzu"
+      },
+      "source": [
+        "Make sure your file is accessible "
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "EdFsLJ11JvQf"
+      },
+      "outputs": [],
+      "source": [
+        "!chmod -R 755 ./trained-envs-executables/linux/Huggy"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Let's recap how this environment works\n",
+        "\n",
+        "### The State Space: what Huggy \"perceives.\"\n",
+        "\n",
+        "Huggy doesn't \"see\" his environment. Instead, we provide him information about the environment:\n",
+        "\n",
+        "- The target (stick) position\n",
+        "- The relative position between himself and the target\n",
+        "- The orientation of his legs.\n",
+        "\n",
+        "Given all this information, Huggy **can decide which action to take next to fulfill his goal**.\n",
+        "\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.jpg\" alt=\"Huggy\" width=\"100%\">\n",
+        "\n",
+        "\n",
+        "### The Action Space: what moves Huggy can do\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-action.jpg\" alt=\"Huggy action\" width=\"100%\">\n",
+        "\n",
+        "**Joint motors drive huggy legs**. It means that to get the target, Huggy needs to **learn to rotate the joint motors of each of his legs correctly so he can move**.\n",
+        "\n",
+        "### The Reward Function\n",
+        "\n",
+        "The reward function is designed so that **Huggy will fulfill his goal** : fetch the stick.\n",
+        "\n",
+        "Remember that one of the foundations of Reinforcement Learning is the *reward hypothesis*: a goal can be described as the **maximization of the expected cumulative reward**.\n",
+        "\n",
+        "Here, our goal is that Huggy **goes towards the stick but without spinning too much**. Hence, our reward function must translate this goal.\n",
+        "\n",
+        "Our reward function:\n",
+        "\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/reward.jpg\" alt=\"Huggy reward function\" width=\"100%\">\n",
+        "\n",
+        "- *Orientation bonus*: we **reward him for getting close to the target**.\n",
+        "- *Time penalty*: a fixed-time penalty given at every action to **force him to get to the stick as fast as possible**.\n",
+        "- *Rotation penalty*: we penalize Huggy if **he spins too much and turns too quickly**.\n",
+        "- *Getting to the target reward*: we reward Huggy for **reaching the target**."
+      ],
+      "metadata": {
+        "id": "dYKVj8yUvj55"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Check the Huggy config file\n",
+        "\n",
+        "- In ML-Agents, you define the **training hyperparameters into config.yaml files.**\n",
+        "\n",
+        "- For the scope of this notebook, we're not going to modify the hyperparameters, but if you want to try as an experiment, you should also try to modify some other hyperparameters, Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md)."
+      ],
+      "metadata": {
+        "id": "NAuEq32Mwvtz"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "- **In the case you want to modify the hyperparameters**, in Google Colab notebook, you can click here to open the config.yaml: `/content/ml-agents/config/ppo/Huggy.yaml`\n",
+        "\n",
+        "- For instance **if you want to save more models during the training** (for now, we save every 200,000 training timesteps). You need to modify:\n",
+        "  - `checkpoint_interval`: The number of training timesteps collected between each checkpoint.\n",
+        "  - `keep_checkpoints`: The maximum number of model checkpoints to keep. \n",
+        "\n",
+        "=> Just keep in mind that **decreasing the `checkpoint_interval` means more models to upload to the Hub and so a longer uploading time** \n",
+        "We’re now ready to train our agent 🔥."
+      ],
+      "metadata": {
+        "id": "r9wv5NYGw-05"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "f9fI555bO12v"
+      },
+      "source": [
+        "## Train our agent\n",
+        "\n",
+        "To train our agent, we just need to **launch mlagents-learn and select the executable containing the environment.**\n",
+        "\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/mllearn.png\" alt=\"ml learn function\" width=\"100%\">\n",
+        "\n",
+        "With ML Agents, we run a training script. We define four parameters:\n",
+        "\n",
+        "1. `mlagents-learn <config>`: the path where the hyperparameter config file is.\n",
+        "2. `--env`: where the environment executable is.\n",
+        "3. `--run_id`: the name you want to give to your training run id.\n",
+        "4. `--no-graphics`: to not launch the visualization during the training.\n",
+        "\n",
+        "Train the model and use the `--resume` flag to continue training in case of interruption. \n",
+        "\n",
+        "> It will fail first time when you use `--resume`, try running the block again to bypass the error. \n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The training will take 30 to 45min depending on your machine (don't forget to **set up a GPU**), go take a ☕️you deserve it 🤗."
+      ],
+      "metadata": {
+        "id": "lN32oWF8zPjs"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "bS-Yh1UdHfzy"
+      },
+      "outputs": [],
+      "source": [
+        "!mlagents-learn ./config/ppo/Huggy.yaml --env=./trained-envs-executables/linux/Huggy/Huggy --run-id=\"Huggy\" --no-graphics"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "5Vue94AzPy1t"
+      },
+      "source": [
+        "## Push the agent to the 🤗 Hub\n",
+        "\n",
+        "- Now that we trained our agent, we’re **ready to push it to the Hub to be able to play with Huggy on your browser🔥.**"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "To be able to share your model with the community there are three more steps to follow:\n",
+        "\n",
+        "1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join\n",
+        "\n",
+        "2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.\n",
+        "- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n",
+        "\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg\" alt=\"Create HF Token\">\n",
+        "\n",
+        "- Copy the token \n",
+        "- Run the cell below and paste the token"
+      ],
+      "metadata": {
+        "id": "izT6FpgNzZ6R"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "rKt2vsYoK56o"
+      },
+      "outputs": [],
+      "source": [
+        "from huggingface_hub import notebook_login\n",
+        "notebook_login()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
+      ],
+      "metadata": {
+        "id": "ew59mK19zjtN"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Xi0y_VASRzJU"
+      },
+      "source": [
+        "Then, we simply need to run `mlagents-push-to-hf`.\n",
+        "\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/mlpush.png\" alt=\"ml learn function\" width=\"100%\">"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "And we define 4 parameters:\n",
+        "\n",
+        "1. `--run-id`: the name of the training run id.\n",
+        "2. `--local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.\n",
+        "3. `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>\n",
+        "If the repo does not exist **it will be created automatically**\n",
+        "4. `--commit-message`: since HF repos are git repository you need to define a commit message."
+      ],
+      "metadata": {
+        "id": "KK4fPfnczunT"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "dGEFAIboLVc6"
+      },
+      "outputs": [],
+      "source": [
+        "!mlagents-push-to-hf --run-id=\"HuggyTraining\" --local-dir=\"./results/Huggy\" --repo-id=\"ThomasSimonini/ppo-Huggy\" --commit-message=\"Huggy\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Else, if everything worked you should have this at the end of the process(but with a different url 😆) :\n",
+        "\n",
+        "\n",
+        "\n",
+        "```\n",
+        "Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/ppo-Huggy\n",
+        "```\n",
+        "\n",
+        "It’s the link to your model repository. The repository contains a model card that explains how to use the model, your Tensorboard logs and your config file. **What’s awesome is that it’s a git repository, which means you can have different commits, update your repository with a new push, open Pull Requests, etc.**\n",
+        "\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/modelcard.png\" alt=\"ml learn function\" width=\"100%\">"
+      ],
+      "metadata": {
+        "id": "yborB0850FTM"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "But now comes the best: **being able to play with Huggy online 👀.**"
+      ],
+      "metadata": {
+        "id": "5Uaon2cg0NrL"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Play with your Huggy 🐕\n",
+        "\n",
+        "This step is the simplest:\n",
+        "\n",
+        "- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy\n",
+        "\n",
+        "- Click on Play with my Huggy model\n",
+        "\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/load-huggy.jpg\" alt=\"load-huggy\" width=\"100%\">"
+      ],
+      "metadata": {
+        "id": "VMc4oOsE0QiZ"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "1. In step 1, choose your model repository which is the model id (in my case ThomasSimonini/ppo-Huggy).\n",
+        "\n",
+        "2. In step 2, **choose what model you want to replay**:\n",
+        "  - I have multiple ones, since we saved a model every 500000 timesteps. \n",
+        "  - But since I want the more recent, I choose `Huggy.onnx`\n",
+        "\n",
+        "👉 What’s nice **is to try with different models steps to see the improvement of the agent.**"
+      ],
+      "metadata": {
+        "id": "Djs8c5rR0Z8a"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Congrats on finishing this bonus unit!\n",
+        "\n",
+        "You can now sit and enjoy playing with your Huggy 🐶. And don't **forget to spread the love by sharing Huggy with your friends 🤗**. And if you share about it on social media, **please tag us @huggingface and me @simoninithomas**\n",
+        "\n",
+        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-cover.jpeg\" alt=\"Huggy cover\" width=\"100%\">\n",
+        "\n",
+        "\n",
+        "## Keep Learning, Stay  awesome 🤗"
+      ],
+      "metadata": {
+        "id": "PI6dPWmh064H"
+      }
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "provenance": [],
+      "private_outputs": true,
+      "include_colab_link": true
+    },
+    "gpuClass": "standard",
+    "kernelspec": {
+      "display_name": "Python 3",
+      "name": "python3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file

From 05fe8dd33fdb9363639a770e498b18210399e8cd Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 15 Dec 2022 09:52:54 +0100
Subject: [PATCH 42/49] Delete bonus-unit1.ipynb

---
 bonus-unit1/bonus-unit1.ipynb | 572 ----------------------------------
 1 file changed, 572 deletions(-)
 delete mode 100644 bonus-unit1/bonus-unit1.ipynb

diff --git a/bonus-unit1/bonus-unit1.ipynb b/bonus-unit1/bonus-unit1.ipynb
deleted file mode 100644
index 7d1c5c7..0000000
--- a/bonus-unit1/bonus-unit1.ipynb
+++ /dev/null
@@ -1,572 +0,0 @@
-{
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "view-in-github",
-        "colab_type": "text"
-      },
-      "source": [
-        "<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/bonus-unit1/bonus-unit1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "2D3NL_e4crQv"
-      },
-      "source": [
-        "# Bonus Unit 1: Let's train Huggy the Dog 🐶 to fetch a stick"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit2/thumbnail.png\" alt=\"Bonus Unit 1Thumbnail\">\n",
-        "\n",
-        "In this notebook, we'll reinforce what we learned in the first Unit by **teaching Huggy the Dog to fetch the stick and then play with it directly in your browser**\n",
-        "\n",
-        "⬇️ Here is an example of what **you will achieve at the end of the unit.** ⬇️ (launch ▶ to see)"
-      ],
-      "metadata": {
-        "id": "FMYrDriDujzX"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "%%html\n",
-        "<video controls autoplay><source src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.mp4\" type=\"video/mp4\"></video>"
-      ],
-      "metadata": {
-        "id": "PnVhs1yYNyUF"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "### The environment 🎮\n",
-        "\n",
-        "- Huggy the Dog, an environment created by [Thomas Simonini](https://twitter.com/ThomasSimonini) based on [Puppo The Corgi](https://blog.unity.com/technology/puppo-the-corgi-cuteness-overload-with-the-unity-ml-agents-toolkit)\n",
-        "\n",
-        "### The library used 📚\n",
-        "\n",
-        "- [MLAgents (Hugging Face version)](https://github.com/huggingface/ml-agents)"
-      ],
-      "metadata": {
-        "id": "x7oR6R-ZIbeS"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues)."
-      ],
-      "metadata": {
-        "id": "60yACvZwO0Cy"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Objectives of this notebook 🏆\n",
-        "\n",
-        "At the end of the notebook, you will:\n",
-        "\n",
-        "- Understand **the state space, action space and reward function used to train Huggy**.\n",
-        "- **Train your own Huggy** to fetch the stick.\n",
-        "- Be able to play **with your trained Huggy directly in your browser**.\n",
-        "\n",
-        "\n"
-      ],
-      "metadata": {
-        "id": "Oks-ETYdO2Dc"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## This notebook is from Deep Reinforcement Learning Course\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>"
-      ],
-      "metadata": {
-        "id": "mUlVrqnBv2o1"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "In this free course, you will:\n",
-        "\n",
-        "- 📖 Study Deep Reinforcement Learning in **theory and practice**.\n",
-        "- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n",
-        "- 🤖 Train **agents in unique environments** \n",
-        "\n",
-        "And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course\n",
-        "\n",
-        "Don’t forget to **<a href=\"http://eepurl.com/ic5ZUD\">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**\n",
-        "\n",
-        "\n",
-        "The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5"
-      ],
-      "metadata": {
-        "id": "pAMjaQpHwB_s"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Prerequisites 🏗️\n",
-        "\n",
-        "Before diving into the notebook, you need to:\n",
-        "\n",
-        "🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by doing Unit 1\n",
-        "\n",
-        "🔲 📚 **Read the introduction to Huggy** by doing Bonus Unit 1"
-      ],
-      "metadata": {
-        "id": "6r7Hl0uywFSO"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Set the GPU 💪\n",
-        "- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step1.jpg\" alt=\"GPU Step 1\">"
-      ],
-      "metadata": {
-        "id": "DssdIjk_8vZE"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "- `Hardware Accelerator > GPU`\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg\" alt=\"GPU Step 2\">"
-      ],
-      "metadata": {
-        "id": "sTfCXHy68xBv"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "an3ByrXYQ4iK"
-      },
-      "source": [
-        "## Clone the repository and install the dependencies 🔽\n",
-        "\n",
-        "- We need to clone the repository, that **contains the experimental version of the library that allows you to push your trained agent to the Hub.**"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "6WNoL04M7rTa"
-      },
-      "outputs": [],
-      "source": [
-        "%%capture\n",
-        "# Clone this specific repository (can take 3min)\n",
-        "!git clone https://github.com/huggingface/ml-agents/"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "d8wmVcMk7xKo"
-      },
-      "outputs": [],
-      "source": [
-        "%%capture\n",
-        "# Go inside the repository and install the package (can take 3min)\n",
-        "%cd ml-agents\n",
-        "!pip3 install -e ./ml-agents-envs\n",
-        "!pip3 install -e ./ml-agents"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "HRY5ufKUKfhI"
-      },
-      "source": [
-        "## Download and move the environment zip file in `./trained-envs-executables/linux/`\n",
-        "\n",
-        "- Our environment executable is in a zip file.\n",
-        "- We need to download it and place it to `./trained-envs-executables/linux/`"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "C9Ls6_6eOKiA"
-      },
-      "outputs": [],
-      "source": [
-        "!mkdir ./trained-envs-executables\n",
-        "!mkdir ./trained-envs-executables/linux"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "!wget --load-cookies /tmp/cookies.txt \"https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1zv3M95ZJTWHUVOWT6ckq_cm98nft8gdF' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\\1\\n/p')&id=1zv3M95ZJTWHUVOWT6ckq_cm98nft8gdF\" -O ./trained-envs-executables/linux/Huggy.zip && rm -rf /tmp/cookies.txt"
-      ],
-      "metadata": {
-        "id": "EB-G-80GsxYN"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "jsoZGxr1MIXY"
-      },
-      "source": [
-        "Download the file Huggy.zip from https://drive.google.com/uc?export=download&id=1zv3M95ZJTWHUVOWT6ckq_cm98nft8gdF using `wget`. Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "8FPx0an9IAwO"
-      },
-      "outputs": [],
-      "source": [
-        "%%capture\n",
-        "!unzip -d ./trained-envs-executables/linux/ ./trained-envs-executables/linux/Huggy.zip"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "nyumV5XfPKzu"
-      },
-      "source": [
-        "Make sure your file is accessible "
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "EdFsLJ11JvQf"
-      },
-      "outputs": [],
-      "source": [
-        "!chmod -R 755 ./trained-envs-executables/linux/Huggy"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Let's recap how this environment works\n",
-        "\n",
-        "### The State Space: what Huggy \"perceives.\"\n",
-        "\n",
-        "Huggy doesn't \"see\" his environment. Instead, we provide him information about the environment:\n",
-        "\n",
-        "- The target (stick) position\n",
-        "- The relative position between himself and the target\n",
-        "- The orientation of his legs.\n",
-        "\n",
-        "Given all this information, Huggy **can decide which action to take next to fulfill his goal**.\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy.jpg\" alt=\"Huggy\" width=\"100%\">\n",
-        "\n",
-        "\n",
-        "### The Action Space: what moves Huggy can do\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-action.jpg\" alt=\"Huggy action\" width=\"100%\">\n",
-        "\n",
-        "**Joint motors drive huggy legs**. It means that to get the target, Huggy needs to **learn to rotate the joint motors of each of his legs correctly so he can move**.\n",
-        "\n",
-        "### The Reward Function\n",
-        "\n",
-        "The reward function is designed so that **Huggy will fulfill his goal** : fetch the stick.\n",
-        "\n",
-        "Remember that one of the foundations of Reinforcement Learning is the *reward hypothesis*: a goal can be described as the **maximization of the expected cumulative reward**.\n",
-        "\n",
-        "Here, our goal is that Huggy **goes towards the stick but without spinning too much**. Hence, our reward function must translate this goal.\n",
-        "\n",
-        "Our reward function:\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/reward.jpg\" alt=\"Huggy reward function\" width=\"100%\">\n",
-        "\n",
-        "- *Orientation bonus*: we **reward him for getting close to the target**.\n",
-        "- *Time penalty*: a fixed-time penalty given at every action to **force him to get to the stick as fast as possible**.\n",
-        "- *Rotation penalty*: we penalize Huggy if **he spins too much and turns too quickly**.\n",
-        "- *Getting to the target reward*: we reward Huggy for **reaching the target**."
-      ],
-      "metadata": {
-        "id": "dYKVj8yUvj55"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Check the Huggy config file\n",
-        "\n",
-        "- In ML-Agents, you define the **training hyperparameters into config.yaml files.**\n",
-        "\n",
-        "- For the scope of this notebook, we're not going to modify the hyperparameters, but if you want to try as an experiment, you should also try to modify some other hyperparameters, Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md)."
-      ],
-      "metadata": {
-        "id": "NAuEq32Mwvtz"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "- **In the case you want to modify the hyperparameters**, in Google Colab notebook, you can click here to open the config.yaml: `/content/ml-agents/config/ppo/Huggy.yaml`\n",
-        "\n",
-        "- For instance **if you want to save more models during the training** (for now, we save every 200,000 training timesteps). You need to modify:\n",
-        "  - `checkpoint_interval`: The number of training timesteps collected between each checkpoint.\n",
-        "  - `keep_checkpoints`: The maximum number of model checkpoints to keep. \n",
-        "\n",
-        "=> Just keep in mind that **decreasing the `checkpoint_interval` means more models to upload to the Hub and so a longer uploading time** \n",
-        "We’re now ready to train our agent 🔥."
-      ],
-      "metadata": {
-        "id": "r9wv5NYGw-05"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "f9fI555bO12v"
-      },
-      "source": [
-        "## Train our agent\n",
-        "\n",
-        "To train our agent, we just need to **launch mlagents-learn and select the executable containing the environment.**\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/mllearn.png\" alt=\"ml learn function\" width=\"100%\">\n",
-        "\n",
-        "With ML Agents, we run a training script. We define four parameters:\n",
-        "\n",
-        "1. `mlagents-learn <config>`: the path where the hyperparameter config file is.\n",
-        "2. `--env`: where the environment executable is.\n",
-        "3. `--run_id`: the name you want to give to your training run id.\n",
-        "4. `--no-graphics`: to not launch the visualization during the training.\n",
-        "\n",
-        "Train the model and use the `--resume` flag to continue training in case of interruption. \n",
-        "\n",
-        "> It will fail first time when you use `--resume`, try running the block again to bypass the error. \n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "The training will take 30 to 45min depending on your machine (don't forget to **set up a GPU**), go take a ☕️you deserve it 🤗."
-      ],
-      "metadata": {
-        "id": "lN32oWF8zPjs"
-      }
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "bS-Yh1UdHfzy"
-      },
-      "outputs": [],
-      "source": [
-        "!mlagents-learn ./config/ppo/Huggy.yaml --env=./trained-envs-executables/linux/Huggy/Huggy --run-id=\"Huggy\" --no-graphics"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "5Vue94AzPy1t"
-      },
-      "source": [
-        "## Push the agent to the 🤗 Hub\n",
-        "\n",
-        "- Now that we trained our agent, we’re **ready to push it to the Hub to be able to play with Huggy on your browser🔥.**"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "To be able to share your model with the community there are three more steps to follow:\n",
-        "\n",
-        "1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join\n",
-        "\n",
-        "2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.\n",
-        "- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg\" alt=\"Create HF Token\">\n",
-        "\n",
-        "- Copy the token \n",
-        "- Run the cell below and paste the token"
-      ],
-      "metadata": {
-        "id": "izT6FpgNzZ6R"
-      }
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "rKt2vsYoK56o"
-      },
-      "outputs": [],
-      "source": [
-        "from huggingface_hub import notebook_login\n",
-        "notebook_login()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
-      ],
-      "metadata": {
-        "id": "ew59mK19zjtN"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "Xi0y_VASRzJU"
-      },
-      "source": [
-        "Then, we simply need to run `mlagents-push-to-hf`.\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/mlpush.png\" alt=\"ml learn function\" width=\"100%\">"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "And we define 4 parameters:\n",
-        "\n",
-        "1. `--run-id`: the name of the training run id.\n",
-        "2. `--local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.\n",
-        "3. `--repo-id`: the name of the Hugging Face repo you want to create or update. It’s always <your huggingface username>/<the repo name>\n",
-        "If the repo does not exist **it will be created automatically**\n",
-        "4. `--commit-message`: since HF repos are git repository you need to define a commit message."
-      ],
-      "metadata": {
-        "id": "KK4fPfnczunT"
-      }
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "dGEFAIboLVc6"
-      },
-      "outputs": [],
-      "source": [
-        "!mlagents-push-to-hf --run-id=\"HuggyTraining\" --local-dir=\"./results/Huggy\" --repo-id=\"ThomasSimonini/ppo-Huggy\" --commit-message=\"Huggy\""
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Else, if everything worked you should have this at the end of the process(but with a different url 😆) :\n",
-        "\n",
-        "\n",
-        "\n",
-        "```\n",
-        "Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/ppo-Huggy\n",
-        "```\n",
-        "\n",
-        "It’s the link to your model repository. The repository contains a model card that explains how to use the model, your Tensorboard logs and your config file. **What’s awesome is that it’s a git repository, which means you can have different commits, update your repository with a new push, open Pull Requests, etc.**\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/modelcard.png\" alt=\"ml learn function\" width=\"100%\">"
-      ],
-      "metadata": {
-        "id": "yborB0850FTM"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "But now comes the best: **being able to play with Huggy online 👀.**"
-      ],
-      "metadata": {
-        "id": "5Uaon2cg0NrL"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "## Play with your Huggy 🐕\n",
-        "\n",
-        "This step is the simplest:\n",
-        "\n",
-        "- Open the game Huggy in your browser: https://huggingface.co/spaces/ThomasSimonini/Huggy\n",
-        "\n",
-        "- Click on Play with my Huggy model\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/load-huggy.jpg\" alt=\"load-huggy\" width=\"100%\">"
-      ],
-      "metadata": {
-        "id": "VMc4oOsE0QiZ"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "1. In step 1, choose your model repository which is the model id (in my case ThomasSimonini/ppo-Huggy).\n",
-        "\n",
-        "2. In step 2, **choose what model you want to replay**:\n",
-        "  - I have multiple ones, since we saved a model every 500000 timesteps. \n",
-        "  - But since I want the more recent, I choose `Huggy.onnx`\n",
-        "\n",
-        "👉 What’s nice **is to try with different models steps to see the improvement of the agent.**"
-      ],
-      "metadata": {
-        "id": "Djs8c5rR0Z8a"
-      }
-    },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "Congrats on finishing this bonus unit!\n",
-        "\n",
-        "You can now sit and enjoy playing with your Huggy 🐶. And don't **forget to spread the love by sharing Huggy with your friends 🤗**. And if you share about it on social media, **please tag us @huggingface and me @simoninithomas**\n",
-        "\n",
-        "<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit-bonus1/huggy-cover.jpeg\" alt=\"Huggy cover\" width=\"100%\">\n",
-        "\n",
-        "\n",
-        "## Keep Learning, Stay  awesome 🤗"
-      ],
-      "metadata": {
-        "id": "PI6dPWmh064H"
-      }
-    }
-  ],
-  "metadata": {
-    "accelerator": "GPU",
-    "colab": {
-      "provenance": [],
-      "private_outputs": true,
-      "include_colab_link": true
-    },
-    "gpuClass": "standard",
-    "kernelspec": {
-      "display_name": "Python 3",
-      "name": "python3"
-    },
-    "language_info": {
-      "name": "python"
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
-}
\ No newline at end of file

From fa7eb7b0e4ebc7de2a3331ed9aa4f99030836bb3 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 15 Dec 2022 09:53:37 +0100
Subject: [PATCH 43/49] Update keep_checkpoints

---
 notebooks/bonus-unit1/bonus-unit1.ipynb | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/notebooks/bonus-unit1/bonus-unit1.ipynb b/notebooks/bonus-unit1/bonus-unit1.ipynb
index a96ade7..13deda7 100644
--- a/notebooks/bonus-unit1/bonus-unit1.ipynb
+++ b/notebooks/bonus-unit1/bonus-unit1.ipynb
@@ -1,5 +1,15 @@
 {
   "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/bonus-unit1/bonus-unit1.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -320,7 +330,11 @@
       "source": [
         "- **In the case you want to modify the hyperparameters**, in Google Colab notebook, you can click here to open the config.yaml: `/content/ml-agents/config/ppo/Huggy.yaml`\n",
         "\n",
+        "- For instance **if you want to save more models during the training** (for now, we save every 200,000 training timesteps). You need to modify:\n",
+        "  - `checkpoint_interval`: The number of training timesteps collected between each checkpoint.\n",
+        "  - `keep_checkpoints`: The maximum number of model checkpoints to keep. \n",
         "\n",
+        "=> Just keep in mind that **decreasing the `checkpoint_interval` means more models to upload to the Hub and so a longer uploading time** \n",
         "We’re now ready to train our agent 🔥."
       ],
       "metadata": {
@@ -541,7 +555,8 @@
     "accelerator": "GPU",
     "colab": {
       "provenance": [],
-      "private_outputs": true
+      "private_outputs": true,
+      "include_colab_link": true
     },
     "gpuClass": "standard",
     "kernelspec": {

From 6d66d9146bb3a1a7f95b8bd9a053fa722cada3dd Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 15 Dec 2022 14:25:41 +0100
Subject: [PATCH 44/49] Add publishing schedule

---
 units/en/_toctree.yml | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml
index 1ce98b5..9621222 100644
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -76,3 +76,7 @@
     title: Conclusion
   - local: unit2/additional-readings
     title: Additional Readings
+- title: What's next? New Units Publishing Schedule
+  sections:
+  - local: communication/publishing-schedule
+    title: Publishing Schedule

From 03fc439ac09de310edc101ce898e450c11d4fd0f Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 15 Dec 2022 14:40:18 +0100
Subject: [PATCH 45/49] Create publishing-schedule.mdx

---
 units/en/communication/publishing-schedule.mdx | 13 +++++++++++++
 1 file changed, 13 insertions(+)
 create mode 100644 units/en/communication/publishing-schedule.mdx

diff --git a/units/en/communication/publishing-schedule.mdx b/units/en/communication/publishing-schedule.mdx
new file mode 100644
index 0000000..d1570f4
--- /dev/null
+++ b/units/en/communication/publishing-schedule.mdx
@@ -0,0 +1,13 @@
+# Publishing Schedule [[publishing-schedule]]
+
+We publish a new unit every Monday (except Monday, the 26th of December).
+
+If you don't want to miss any of the updates, don't forget to:
+
+1️⃣ [Sign up to the course](http://eepurl.com/ic5ZUD) to receive update emails.
+
+2️⃣ [Join our discord server](https://hf.co/join/discord) to get the last updates and exchange with your classmates.
+
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule1.png" alt="Schedule 1" width="100%"/>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule2.png" alt="Schedule 2" width="100%"/>

From 5e730739579018bd07e894a5dd43418041f24e01 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 15 Dec 2022 14:42:58 +0100
Subject: [PATCH 46/49] Add publishing schedule

---
 units/en/unit0/introduction.mdx | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/units/en/unit0/introduction.mdx b/units/en/unit0/introduction.mdx
index e48f0c4..4ab31b0 100644
--- a/units/en/unit0/introduction.mdx
+++ b/units/en/unit0/introduction.mdx
@@ -80,6 +80,14 @@ You need only 3 things:
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/tools.jpg" alt="Course tools needed" width="100%"/>
 
+## What is the publishing schedule? [[publishing-schedule]]
+
+
+We publish **a new unit every Monday** (except Monday, the 26th of December).
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule1.png" alt="Schedule 1" width="100%"/>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule2.png" alt="Schedule 2" width="100%"/>
+
 
 ## What is the recommended pace? [[recommended-pace]]
 

From e06c7ab2bb93e20b88170ba2522c485a49d11c36 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 15 Dec 2022 14:56:20 +0100
Subject: [PATCH 47/49] Update publishing-schedule.mdx

---
 units/en/communication/publishing-schedule.mdx | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/units/en/communication/publishing-schedule.mdx b/units/en/communication/publishing-schedule.mdx
index d1570f4..fe24045 100644
--- a/units/en/communication/publishing-schedule.mdx
+++ b/units/en/communication/publishing-schedule.mdx
@@ -1,12 +1,12 @@
 # Publishing Schedule [[publishing-schedule]]
 
-We publish a new unit every Monday (except Monday, the 26th of December).
+We publish a **new unit every Monday** (except Monday, the 26th of December).
 
 If you don't want to miss any of the updates, don't forget to:
 
-1️⃣ [Sign up to the course](http://eepurl.com/ic5ZUD) to receive update emails.
+1️⃣ [Sign up to the course](http://eepurl.com/ic5ZUD) to receive **update emails**.
 
-2️⃣ [Join our discord server](https://hf.co/join/discord) to get the last updates and exchange with your classmates.
+2️⃣ [Join our discord server](https://hf.co/join/discord) to **get the last updates and exchange with your classmates**.
 
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/communication/schedule1.png" alt="Schedule 1" width="100%"/>

From f46ce5d5c97696880ea3b26dada78d6a5e203b9d Mon Sep 17 00:00:00 2001
From: ankandrew <61120139+ankandrew@users.noreply.github.com>
Date: Thu, 15 Dec 2022 11:01:56 -0300
Subject: [PATCH 48/49] Fix minor bold text issue

---
 units/en/unit2/bellman-equation.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx
index 577c6bb..99d753a 100644
--- a/units/en/unit2/bellman-equation.mdx
+++ b/units/en/unit2/bellman-equation.mdx
@@ -18,7 +18,7 @@ Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return startin
 
 <figure>
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman3.jpg" alt="Bellman equation"/>
-  <figcaption>To calculate the value of State 2: the sum of rewards **if the agent started in that state, and then followed the **policy for all the time steps.</figcaption>
+  <figcaption>To calculate the value of State 2: the sum of rewards **if the agent started in that state**, and then followed the **policy for all the time steps.**</figcaption>
 </figure>
 
 So you may have noticed, we're repeating the computation of the value of different states, which can be tedious if you need to do it for each state value or state-action value.

From 55deaa576c3d79a35c09e7331554acb301c1a112 Mon Sep 17 00:00:00 2001
From: Thomas Simonini <simonini.thomas.pro@gmail.com>
Date: Thu, 15 Dec 2022 16:03:34 +0100
Subject: [PATCH 49/49] Add certification info in hands-on and introduction

---
 notebooks/unit1/unit1.ipynb     | 14 ++++++++++++++
 notebooks/unit2/unit2.ipynb     | 22 +++++++++++++++++-----
 units/en/unit0/introduction.mdx |  9 +++++++++
 units/en/unit1/hands-on.mdx     |  6 ++++++
 units/en/unit2/hands-on.mdx     |  6 ++++++
 5 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/notebooks/unit1/unit1.ipynb b/notebooks/unit1/unit1.ipynb
index 3b58d09..aee6b51 100644
--- a/notebooks/unit1/unit1.ipynb
+++ b/notebooks/unit1/unit1.ipynb
@@ -166,6 +166,20 @@
         "# Let's train our first Deep Reinforcement Learning agent and upload it to the Hub 🚀\n"
       ]
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## Get a certificate\n",
+        "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained model to the Hub and **get a result of >= 200**.\n",
+        "\n",
+        "To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
+        "\n",
+        "For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
+      ],
+      "metadata": {
+        "id": "qDploC3jSH99"
+      }
+    },
     {
       "cell_type": "markdown",
       "source": [
diff --git a/notebooks/unit2/unit2.ipynb b/notebooks/unit2/unit2.ipynb
index 5555554..f1ff2cd 100644
--- a/notebooks/unit2/unit2.ipynb
+++ b/notebooks/unit2/unit2.ipynb
@@ -3,8 +3,7 @@
     {
       "cell_type": "markdown",
       "metadata": {
-        "id": "view-in-github",
-        "colab_type": "text"
+        "id": "view-in-github"
       },
       "source": [
         "<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit2/unit2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
@@ -169,6 +168,20 @@
         "id": "HEtx8Y8MqKfH"
       }
     },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.\n",
+        "\n",
+        "To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
+        "\n",
+        "For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process"
+      ],
+      "metadata": {
+        "id": "Kdxb1IhzTn0v"
+      }
+    },
     {
       "cell_type": "markdown",
       "source": [
@@ -1734,8 +1747,7 @@
         "Ji_UrI5l2zzn",
         "67OdoKL63eDD",
         "B2_-8b8z5k54"
-      ],
-      "include_colab_link": true
+      ]
     },
     "gpuClass": "standard",
     "kernelspec": {
@@ -1748,4 +1760,4 @@
   },
   "nbformat": 4,
   "nbformat_minor": 0
-}
+}
\ No newline at end of file
diff --git a/units/en/unit0/introduction.mdx b/units/en/unit0/introduction.mdx
index 4ab31b0..3118d0d 100644
--- a/units/en/unit0/introduction.mdx
+++ b/units/en/unit0/introduction.mdx
@@ -53,12 +53,21 @@ The course is composed of:
 You can choose to follow this course either:
 
 - *To get a certificate of completion*: you need to complete 80% of the assignments before the end of March 2023.
+- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of March 2023.
 - *As a simple audit*: you can participate in all challenges and do assignments if you want, but you have no deadlines.
 
 Both paths **are completely free**.
 Whatever path you choose, we advise you **to follow the recommended pace to enjoy the course and challenges with your fellow classmates.**
 You don't need to tell us which path you choose. At the end of March, when we verify the assignments **if you get more than 80% of the assignments done, you'll get a certificate.**
 
+## The Certification Process [[certification-process]]
+
+The certification process is **completely free**:
+
+- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of March 2023.
+- *To get a certificate of honors*: you need to complete 100% of the assignments before the end of March 2023.
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/certification.jpg" alt="Course certification" width="100%"/>
 
 ## How to get most of the course? [[advice]]
 
diff --git a/units/en/unit1/hands-on.mdx b/units/en/unit1/hands-on.mdx
index 2c65154..419aefd 100644
--- a/units/en/unit1/hands-on.mdx
+++ b/units/en/unit1/hands-on.mdx
@@ -18,6 +18,12 @@ And finally, you'll **upload this trained agent to the Hugging Face Hub 🤗, a
 
 Thanks to our <a href="https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard">leaderboard</a>, you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 1 🏆?
 
+To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained model to the Hub and **get a result of >= 200**.
+
+To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
+
+For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
+
 So let's get started! 🚀
 
 **To start the hands-on click on Open In Colab button** 👇 :
diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx
index 08c63d7..71c0151 100644
--- a/units/en/unit2/hands-on.mdx
+++ b/units/en/unit2/hands-on.mdx
@@ -16,6 +16,12 @@ Now that we studied the Q-Learning algorithm, let's implement it from scratch an
 
 Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 2?
 
+To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.
+
+To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
+
+For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
+
 
 **To start the hands-on click on Open In Colab button** 👇 :