First draft unfinished

2026-06-16 23:18:59 +08:00 · 2023-01-01 13:23:38 +01:00
parent ab8598b772
commit c71422e59c
5 changed files with 299 additions and 1 deletions
--- a/units/en/_toctree.yml
+++ b/units/en/_toctree.yml
@@ -110,8 +110,18 @@
    title: Optuna
  - local: unitbonus2/hands-on
    title: Hands-on
+- title: Unit 4. Policy Gradient with Robotics
+  sections:
+  - local: unit4/introduction
+    title: Introduction
+  - local: unit4/what-are-policy-based-methods
+    title: What are the Policy Based methods?
+  - local: unit4/advantages-disadvantages
+    title: The advantages and disadvantages of Policy-based methods
+  - local: unit4/policy-gradient
+    title: Diving deeper into Policy-gradient methods
+
 - title: What's next? New Units Publishing Schedule
  sections:
  - local: communication/publishing-schedule
    title: Publishing Schedule
-
--- a/units/en/unit4/advantages-disadvantages.mdx
+++ b/units/en/unit4/advantages-disadvantages.mdx
@@ -0,0 +1,72 @@
+# The advantages and disadvantages of Policy-based methods
+
+At this point, you might ask "But Deep Q-Learning is excellent! Why use policy gradient methods?", let's study then the advantages and disadvantages of Policy-gradient methods
+
+## Advantages
+
+There are multiple advantages over Value-based methods. Let's see some of them:
+
+### The simplicity of integration
+
+Indeed, **we can estimate the policy directly without storing additional data (action values).**
+
+### Policy-gradient methods can learn a stochastic policy
+
+Policy gradient methods can **learn a stochastic policy while value functions can't**.
+
+This has two consequences:
+
+1. We **don't need to implement an exploration/exploitation trade-off by hand**. Since we output a probability distribution over actions, the agent explores **the state space without always taking the same trajectory.**
+
+2. We also get rid of the problem of **perceptual aliasing**. Perceptual aliasing is when two states seem (or are) the same but need different actions.
+
+Let's take an example: we have an intelligent vacuum cleaner whose goal is to suck the dust and avoid killing the hamsters.
+
+<figure class="image table text-center m-0 w-full">
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster1.jpg" alt="Hamster 1"/>
+</figure>
+
+Our vacuum cleaner can only perceive where the walls are.
+
+The problem is that the two red cases are aliased states because the agent perceives an upper and lower wall for each.
+
+<figure class="image table text-center m-0 w-full">
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster2.jpg" alt="Hamster 1"/>
+</figure>
+
+Under a deterministic policy, the policy will either move right when in a red state or move left. **Either case will cause our agent to get stuck and never suck the dust**.
+
+Under a value-based RL algorithm, we learn a quasi-deterministic policy ("greedy epsilon strategy"). Consequently, our agent can spend a lot of time before finding the dust.
+
+On the other hand, an optimal stochastic policy will randomly move left or right in grey states. Consequently, **it will not be stuck and will reach the goal state with a high probability**.
+
+<figure class="image table text-center m-0 w-full">
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/hamster3.jpg" alt="Hamster 1"/>
+</figure>
+
+### Policy-gradient methods are more effective in in high-dimensional action spaces and continuous actions spaces
+
+Indeed, the problem with Deep Q-learning is that their **predictions assign a score (maximum expected future reward) for each possible action**, at each time step, given the current state.
+
+But what if we have an infinite possibility of actions?
+
+For instance, with a self-driving car, at each state, you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honking, etc.). We'll need to output a Q-value for each possible action! And taking the max action of a continuous output is an optimization problem itself!
+
+Instead, with a policy gradient, we output a **probability distribution over actions.**
+
+### Policy-gradient methods have better convergence properties
+In Value-based methods, we use an aggressive operator to **change the value function: we take the maximum over Q-estimates**.
+Consequently, the action probabilities may change dramatically for an arbitrary small change in the estimated action-values if that change results in a different action having the maximal value.
+For instance if during the training the best action was left (with Q-value of 0.22) and the training step after it's right (since the right Q-value become 0.23) we dramatically changed the policy since now the policy will take most of the time right instead of left.
+
+On the other hand, in Policy-based methods, stochastic policy action preferences (probability of taking action) **changes smoothly over time**.
+
+## Disadvantages
+
+Naturally, Policy Gradient methods have also some disadvantages:
+
+- **Policy gradients converge a lot of time on a local maximum instead of a global optimum.**
+- Policy gradient goes less faster, **step by step: it can take longer to train (inefficient).**
+- Policy gradient can have high variance, we'll see in Actor Critic unit why and how we can solve this problem.
+
+👉 If you want to go deeper on the advantages and disadvantages of Policy Gradients methods, [you can check this video](https://youtu.be/y3oqOjHilio).
--- a/units/en/unit4/introduction.mdx
+++ b/units/en/unit4/introduction.mdx
@@ -0,0 +1,67 @@
+# Introduction [[introduction]]
+
+In the last unit, we learned about Deep Q-Learning. In this value-based Deep Reinforcement Learning algorithm, we **used a deep neural network to approximate the different Q-values for each possible action at a state.**
+
+Indeed, since the beginning of the course, we only studied value-based methods, **where we estimate a value function as an intermediate step towards finding an optimal policy.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" />
+
+Because, in value-based, ** \\(π\\) exists only because of the action value estimates, since policy is just a function** (for instance, greedy-policy) that will select the action with the highest value given a state.
+
+But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.**
+
+So today, **we'll learn about Policy based methods, and we'll study a subset of these methods called Policy Gradients**. Then we'll implement our first Policy Gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch. Before testing its robustness using CartPole-v1, PixelCopter, and Pong.
+
+<figure class="image table text-center m-0 w-full">
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
+</figure>
+
+Let's get started,
+
+- [What are Policy-Gradient Methods?](#what-are-policy-gradient-methods)
+  - [An Overview of Policy Gradients](#an-overview-of-policy-gradients)
+  - [The Advantages of Policy-Gradient Methods](#the-advantages-of-policy-gradient-methods)
+  - [The Disadvantages of Policy-Gradient Methods](#the-disadvantages-of-policy-gradient-methods)
+- [Reinforce (Monte Carlo Policy Gradient)](#reinforce-monte-carlo-policy-gradient)
+
+
+
+
+
+
+
+Now that we have seen the big picture of Policy-Gradient and its advantages and disadvantages, **let's study and implement one of them**: Reinforce.
+
+## Reinforce (Monte Carlo Policy Gradient)
+
+Reinforce, also called Monte-Carlo Policy Gradient, **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\).
+
+
+Now that we studied the theory behind Reinforce, **you’re ready to code your Reinforce agent with PyTorch**. And you'll test its robustness using CartPole-v1, PixelCopter, and Pong.
+
+Start the tutorial here 👉 https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit5/unit5.ipynb
+
+The leaderboard to compare your results with your classmates 🏆 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard
+
+<figure class="image table text-center m-0 w-full">
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/envs.gif" alt="Environments"/>
+</figure>
+---
+
+Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorial. You’ve just coded your first Deep Reinforcement Learning agent from scratch using PyTorch and shared it on the Hub 🥳.
+
+It's **normal if you still feel confused** with all these elements. **This was the same for me and for all people who studied RL.**
+
+Take time to really grasp the material before continuing.
+
+Don't hesitate to train your agent in other environments. The **best way to learn is to try things on your own!**
+
+We published additional readings in the syllabus if you want to go deeper 👉 **[https://github.com/huggingface/deep-rl-class/blob/main/unit5/README.md](https://github.com/huggingface/deep-rl-class/blob/main/unit5/README.md)**
+
+In the next unit, we’re going to learn about a combination of Policy-Based and Value-based methods called Actor Critic Methods.
+
+And don't forget to share with your friends who want to learn 🤗!
+
+Finally, we want **to improve and update the course iteratively with your feedback**. If you have some, please fill this form 👉 **[https://forms.gle/3HgA7bEHwAmmLfwh9](https://forms.gle/3HgA7bEHwAmmLfwh9)**
+
+### **Keep learning, stay awesome 🤗,**
--- a/units/en/unit4/policy-gradient.mdx
+++ b/units/en/unit4/policy-gradient.mdx
@@ -0,0 +1,109 @@
+# Diving deeper into Policy-gradient methods
+
+## Getting the big picture
+
+We just learned that the goal of Policy-gradient methods is to find parameters //(/theta //) that maximize the expected return.
+
+The idea is that we have a parameterized stochastic policy. In our case, a neural network that output a probability distribution over actions. The probability of taking each action is also called action preference.
+
+If we take the example of CartPole-v1:
+- As input, we have a state.
+- As output, we have a probability distribution over actions at that state.
+
+TODO: IMAGE NEURAL NETWORK CARTPOLE
+
+
+Our goal with Policy-Gradients is to control the probability distribution of actions by tuning the policy such that **good actions (that maximize the return) are sampled more frequently in the future.**
+Each time the agent interacts with the environment, we tweak the parameters such that good actions will be sampled more likely in the future.
+
+But how we're going to optimize the weights using the expected return?
+
+The idea, is that we're going to let the agent interact during an episode. And if we win the episode, we consider that each action taken were good and must be more sampled in the future
+since they lead to win.
+
+So for each state, action pair, we want to increase the //(P(a|s)//): probability of taking that action at that state. Or decrease if we lost.
+
+TODO: IMAGE
+Collect an episode
+Change the weights of the policy network
+
+  If Good return: increase the P(a|s) each state action combination taken
+  If Bad: decrease the P(a|s)
+
+  The Policy Gradient algorithm (simplified) looks like this:
+  <figure class="image table text-center m-0 w-full">
+    <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/pg_bigpicture.jpg" alt="Policy Gradient Big Picture"/>
+  </figure>
+
+
+## Diving deeper into Policy-gradient
+
+We have our policy \\(\pi\\) which has a parameter \\(\theta\\). This \\(\pi\\), given a state, **outputs a probability distribution of actions**.
+
+<figure class="image table text-center m-0 w-full">
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy.jpg" alt="Policy"/>
+</figure>
+
+Where \\(\pi_\theta(a_t|s_t)\\) is the probability of the agent selecting action at from state st, given our policy.
+
+**But how do we know if our policy is good?** We need to have a way to measure it. To know that we define a score/objective function called \\(J(\theta)\\).
+
+### The Objective Function
+
+The Objective function gives us the **performance of the agent** given a trajectory, it outputs the *expected cumulative reward*.
+
+TODO: Illustration Expected reward
+
+Our objective then is to maximize the expected cumulative rewards by finding \\(\theta \\) that will output the best action probability distributions.
+
+
+
+
+
+
+## Gradient Ascent
+Since we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
+
+Our update step for gradient-ascent is:
+\\ \theta \leftarrow \theta + \alpha *  \nabla_\theta U(\theta) \\)
+
+But to
+
+
+
+
+The score function J is the expected return:
+  <figure class="image table text-center m-0 w-full">
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/objective.jpg" alt="Return"/>
+</figure>
+
+Remember that policy gradient can be seen as an optimization problem. So we must find the best parameters (θ) to maximize the score function, J(θ).
+
+To do that we’re going to use the [Policy Gradient Theorem](https://www.youtube.com/watch?v=AKbX1Zvo7r8). I’m not going to dive on the mathematical details but if you’re interested check [this video](https://www.youtube.com/watch?v=AKbX1Zvo7r8)
+
+The Reinforce algorithm works like this:
+Loop:
+- Use the policy \\(\pi_\theta\\)  to collect an episode \\(\tau\\)
+- Use the episode to estimate the gradient \\(\hat{g} = \nabla_\theta J(\theta)\\)
+
+ <figure class="image table text-center m-0 w-full">
+  <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/pg.jpg" alt="Policy Gradient"/>
+</figure>
+
+- Update the weights of the policy: \\(\theta \leftarrow \theta + \alpha \hat{g}\\)
+
+The interpretation we can make is this one:
+- \\(\nabla_\theta log \pi_\theta(a_t|s_t)\\) is the direction of **steepest increase of the (log) probability** of selecting action at from state st.
+=> This tells use **how we should change the weights of policy** if we want to increase/decrease the log probability of selecting action at at state st.
+- \\(R(\tau)\\): is the scoring function:
+  - If the return is high, it will push up the probabilities of the (state, action) combinations.
+  - Else, if the return is low, it will push down the probabilities of the (state, action) combinations.
+
+
+
+
+## The Policy Gradient Theorem
+
+
+
+##
--- a/units/en/unit4/what-are-policy-based-methods.mdx
+++ b/units/en/unit4/what-are-policy-based-methods.mdx
@@ -0,0 +1,40 @@
+# What are the Policy Based methods?
+
+The main goal of Reinforcement learning is to find the optimal policy \\(\pi*\\) that will maximize the expected cumulative reward.
+Because Reinforcement Learning is based on the *reward hypothesis* that is **all goals can be described as the maximization of the expected cumulative reward.**
+
+For instance, in a soccer game (that you're going to train the agents in two units), the goal is to win the game. We can describe this goal in reinforcement learning as
+maximizing the number of goal scored (when the ball cross the goal line) into your opponent soccer goals. And minimize the number of goals into yours soccer goals.
+
+TODO ADD IMAGE SOCCER
+
+## Value-based, Policy-based and Actor-Critic methods
+
+We studied in the Unit 1, that we had two methods to find (most of the time approximate) this optimal policy \\(\pi*\\).
+
+- In *Value-Based methods*, we learn a value function.
+  - The idea then is that an optimal value function leads to an optimal policy \\(\pi*\\).
+  - Our objective is to **minimize the loss between, predicted and target value to match the true action-value function.
+  - We have a policy, but it's implicit since it **was generated directly from the Value function**. For instance, in Q-Learning we defined an epsilon-greedy policy.
+
+- On the other hand, in *Policy-based methods*, we directly learn to approximate \\(\pi*\\) without having to learn a value function.
+  - The idea then is to parametize policy, for instance using a neural network \\(\pi_\theta\\), this policy will output a probability distribution over actions (stochastic policy).
+  - Our objective then is *to maximize the performance of the parametized policy using gradient ascent*.
+  - To do that we control the parameter \\(\theta\\) that will affect the distribution of actions over a state.
+
+- Finally, we'll study next time *Actor-Critic* which is a combination of value-based and policy based methods.
+
+Consequently, thanks to Policy-based methods, we can directly optimize our policy \\(\pi_theta\\) to output a probability distribution over actions \\(\pi_\theta(a|s)\\) that leads to the best cumulative return.
+To do that we define an objective function \\(J(\theta)\\), that is the expected cumulative reward and we **want to find \\(\theta\\) that maximize this objective function**.
+
+## The difference between Policy-based and Policy-gradient methods
+
+Policy-gradient methods, what we're going to study in this unit, is a subclass of Policy-based methods.
+The difference between these two methods **lies on how we optimize the parameter** //(/theta//):
+
+- In *Policy-Based methods*, we search directly for the optimal policy. We can optimize the parameter //(/theta//) **indirectly** by maximize the local approximation of the objective function with techniques like hill climbing, simulated annealing or Evolution Strategies.
+- In *Policy-Gradient methods*, because we're a subclass of the Policy-based methods, we search directly for the optimal policy. And we optimize the parameter //(/theta//) **directly** by performing the gradient ascent on the performance of the objective function \\(J(\theta)\\).
+
+In Policy-based methods, the optimization is most of the time *on-policy* since for each update we only use data (trajectories) collected **by our most recent version of** \\(\pi_\theta\\).
+
+Before diving more into how works Policy-gradient methods (the objective function, policy gradient theorem, gradient ascent etc.) let's study the advantages and disadvantages of policy based methods.