From bebb6fed17c98c01f1d54e993e506eaa7fa86b1b Mon Sep 17 00:00:00 2001 From: simoninithomas Date: Mon, 2 Jan 2023 21:22:44 +0100 Subject: [PATCH] Adding conclusion --- units/en/_toctree.yml | 11 ++- units/en/unit4/additional-readings.mdx | 20 +++++ units/en/unit4/conclusion.mdx | 1 + units/en/unit4/hands-on.mdx | 1 + units/en/unit4/introduction.mdx | 46 +---------- units/en/unit4/policy-gradient.mdx | 110 ------------------------- units/en/unit4/quiz.mdx | 1 + units/en/unit4/reinforce.mdx | 39 +++++++++ 8 files changed, 75 insertions(+), 154 deletions(-) create mode 100644 units/en/unit4/additional-readings.mdx create mode 100644 units/en/unit4/conclusion.mdx create mode 100644 units/en/unit4/hands-on.mdx create mode 100644 units/en/unit4/quiz.mdx create mode 100644 units/en/unit4/reinforce.mdx diff --git a/units/en/_toctree.yml b/units/en/_toctree.yml index fd3dbea..35bfa4e 100644 --- a/units/en/_toctree.yml +++ b/units/en/_toctree.yml @@ -120,7 +120,16 @@ title: The advantages and disadvantages of Policy-based methods - local: unit4/policy-gradient title: Diving deeper into Policy-gradient methods - + - local: unit4/reinforce + title: The Reinforce algorithm + - local: unit4/hands-on + title: Hands-on + - local: unit4/quiz + title: Quiz + - local: unit4/conclusion + title: Conclusion + - local: unit4/additional-readings + title: Additional Readings - title: What's next? New Units Publishing Schedule sections: - local: communication/publishing-schedule diff --git a/units/en/unit4/additional-readings.mdx b/units/en/unit4/additional-readings.mdx new file mode 100644 index 0000000..6d246ae --- /dev/null +++ b/units/en/unit4/additional-readings.mdx @@ -0,0 +1,20 @@ +# Additional Readings + +These are **optional readings** if you want to go deeper. + + +## Introduction to Policy Optimization + +- [Part 3: Intro to Policy Optimization - Spinning Up documentation](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html) + + +## Policy Gradient + +- [https://johnwlambert.github.io/policy-gradients/](https://johnwlambert.github.io/policy-gradients/) +- [RL - Policy Gradient Explained](https://jonathan-hui.medium.com/rl-policy-gradients-explained-9b13b688b146) +- [Chapter 13, Policy Gradient Methods; Reinforcement Learning, an introduction by Richard Sutton and Andrew G. Barto](http://incompleteideas.net/book/RLbook2020.pdf) + +## Implementation + +- [PyTorch Reinforce implementation](https://github.com/pytorch/examples/blob/main/reinforcement_learning/reinforce.py) +- [Implementations from DDPG to PPO](https://github.com/MrSyee/pg-is-all-you-need) diff --git a/units/en/unit4/conclusion.mdx b/units/en/unit4/conclusion.mdx new file mode 100644 index 0000000..2f6b6e3 --- /dev/null +++ b/units/en/unit4/conclusion.mdx @@ -0,0 +1 @@ +# Conclusion diff --git a/units/en/unit4/hands-on.mdx b/units/en/unit4/hands-on.mdx new file mode 100644 index 0000000..3654eda --- /dev/null +++ b/units/en/unit4/hands-on.mdx @@ -0,0 +1 @@ +# Hands-on diff --git a/units/en/unit4/introduction.mdx b/units/en/unit4/introduction.mdx index 8f63e49..aa96750 100644 --- a/units/en/unit4/introduction.mdx +++ b/units/en/unit4/introduction.mdx @@ -12,52 +12,12 @@ Because, in value-based, ** \\(π\\) exists only because of the action value es But, with policy-based methods, we want to optimize the policy directly **without having an intermediate step of learning a value function.** -So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called Policy Gradients**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch. Before testing its robustness using CartPole-v1, PixelCopter, and Pong. +So today, **we'll learn about policy-based methods, and we'll study a subset of these methods called Policy Gradients**. Then we'll implement our first policy gradient algorithm called Monte Carlo **Reinforce** from scratch using PyTorch. Before testing its robustness using CartPole-v1, and PixelCopter. + +You'll be then able to iterate and improve this implementation for more advanced environments.
Environments
Let's get started, - - - - - - - -Now that we have seen the big picture of Policy-Gradient and its advantages and disadvantages, **let's study and implement one of them**: Reinforce. - -## Reinforce (Monte Carlo Policy Gradient) - -Reinforce, also called Monte-Carlo Policy Gradient, **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\). - - -Now that we studied the theory behind Reinforce, **you’re ready to code your Reinforce agent with PyTorch**. And you'll test its robustness using CartPole-v1, PixelCopter, and Pong. - -Start the tutorial here 👉 https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit5/unit5.ipynb - -The leaderboard to compare your results with your classmates 🏆 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard - -
- Environments -
---- - -Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorial. You’ve just coded your first Deep Reinforcement Learning agent from scratch using PyTorch and shared it on the Hub 🥳. - -It's **normal if you still feel confused** with all these elements. **This was the same for me and for all people who studied RL.** - -Take time to really grasp the material before continuing. - -Don't hesitate to train your agent in other environments. The **best way to learn is to try things on your own!** - -We published additional readings in the syllabus if you want to go deeper 👉 **[https://github.com/huggingface/deep-rl-class/blob/main/unit5/README.md](https://github.com/huggingface/deep-rl-class/blob/main/unit5/README.md)** - -In the next unit, we’re going to learn about a combination of Policy-Based and value-based methods called Actor Critic Methods. - -And don't forget to share with your friends who want to learn 🤗! - -Finally, we want **to improve and update the course iteratively with your feedback**. If you have some, please fill this form 👉 **[https://forms.gle/3HgA7bEHwAmmLfwh9](https://forms.gle/3HgA7bEHwAmmLfwh9)** - -### **Keep learning, stay awesome 🤗,** diff --git a/units/en/unit4/policy-gradient.mdx b/units/en/unit4/policy-gradient.mdx index b9ea7db..e69de29 100644 --- a/units/en/unit4/policy-gradient.mdx +++ b/units/en/unit4/policy-gradient.mdx @@ -1,110 +0,0 @@ -# Diving deeper into policy-gradient methods - -## Getting the big picture - -We just learned that the goal of policy-gradient methods is to find parameters //(/theta //) that maximize the expected return. - -The idea is that we have a *parameterized stochastic policy*. In our case, a neural network that output a probability distribution over actions. The probability of taking each action is also called *action preference*. - -If we take the example of CartPole-v1: -- As input, we have a state. -- As output, we have a probability distribution over actions at that state. - -Policy based - -Our goal with policy-gradient is to **control the probability distribution of actions** by tuning the policy such that **good actions (that maximize the return) are sampled more frequently in the future.** -Each time the agent interacts with the environment, we tweak the parameters such that good actions will be sampled more likely in the future. - -But how we're going to optimize the weights using the expected return? - -The idea, is that we're going to **let the agent interact during an episode**. And if we win the episode, we consider that each action taken were good and must be more sampled in the future -since they lead to win. - -So for each state, action pair, we want to increase the //(P(a|s)//): probability of taking that action at that state. Or decrease if we lost. - -The Policy-gradient algorithm (simplified) looks like this: -
- Policy Gradient Big Picture -
- -Now that we got the big picture, let's dive deeper into policy-gradient. - -## Diving deeper into policy-gradient - -We have our stochastic policy \\(\pi\\) which has a parameter \\(\theta\\). This \\(\pi\\), given a state, **outputs a probability distribution of actions**. - -
- Policy -
- -Where \\(\pi_\theta(a_t|s_t)\\) is the probability of the agent selecting action at from state st, given our policy. - -**But how do we know if our policy is good?** We need to have a way to measure it. To know that we define a score/objective function called \\(J(\theta)\\). - -### The Objective Function - -The Objective function gives us the **performance of the agent** given a trajectory (state action sequence without taking into account reward (contrary to an episode)), it outputs the *expected cumulative reward*. - -Return - -Let's detail a little bit more this formula: -- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by \\(P(\tau;\theta)\\) of all possible values that the return \\(R(\tau)\\) can take. - - -Return - - -- \\(R(\tau)\\) : Return from an arbitrary trajectory. To take this quantity and use it to calculate the expected return, we need to multiply it by the probability of each possible trajectory. -- \\(P(\tau;\theta)\\) : Probability of each possible trajectory \\(tau\\) (that probability depends on (\\theta\\) since it defines the policy that it uses to select the actions of the trajectory which as an impact of the states visited). -- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given $\theta$ and the return of this trajectory. - -Our objective then is to maximize the expected cumulative rewards by finding \\(\theta \\) that will output the best action probability distributions: - - -Max objective - - -## Gradient Ascent and the Policy-gradient Theorem -Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\). - -Our update step for gradient-ascent is: -\\ \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\) - -We can repeatidly apply this update state in the hope that \\(\theta)\\ converges to the value that maximize \\J(\theta)\\). - -However, we have two problems to derivate \\(J(\theta)\\): -1. We can't calculate the true gradient of the objective function, since it would imply to calculate the probability of each possible trajectory which is computotially super expensive. -We want then to **calculate an estimation of the gradient with a sample-based estimate (collect some trajectories)**. - -2. We have another problem, that I detail in the optional next section. That is to differentiate this objective function we need to differentiate the state distribution (attached to the environment it gives us the probability of the environment goes into next state given the current state and the action taken) but we might not know about it. - -Fortunately we're going to use a solution called the Policy Gradient Theorem that will help us to reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution. - - -Policy Gradient - -## The policy-gradient algorithm - -The Reinforce algorithm works like this: -Loop: -- Use the policy \\(\pi_\theta\\) to collect an episode \\(\tau\\) -- Use the episode to estimate the gradient \\(\hat{g} = \nabla_\theta J(\theta)\\) - -
- Policy Gradient -
- -- Update the weights of the policy: \\(\theta \leftarrow \theta + \alpha \hat{g}\\) - -The interpretation we can make is this one: -- \\(\nabla_\theta log \pi_\theta(a_t|s_t)\\) is the direction of **steepest increase of the (log) probability** of selecting action at from state st. -=> This tells use **how we should change the weights of policy** if we want to increase/decrease the log probability of selecting action at at state st. -- \\(R(\tau)\\): is the scoring function: - - If the return is high, it will push up the probabilities of the (state, action) combinations. - - Else, if the return is low, it will push down the probabilities of the (state, action) combinations. - - -We can also collect multiple episodes to estimate the gradient: -
- Policy Gradient -
diff --git a/units/en/unit4/quiz.mdx b/units/en/unit4/quiz.mdx new file mode 100644 index 0000000..ccbc2ec --- /dev/null +++ b/units/en/unit4/quiz.mdx @@ -0,0 +1 @@ + # Quiz diff --git a/units/en/unit4/reinforce.mdx b/units/en/unit4/reinforce.mdx new file mode 100644 index 0000000..16a5db9 --- /dev/null +++ b/units/en/unit4/reinforce.mdx @@ -0,0 +1,39 @@ +# Monte Carlo Policy Gradient (Reinforce) + + + +Now that we have seen the big picture of Policy-Gradient and its advantages and disadvantages, **let's study and implement one of them**: Reinforce. + +## Reinforce (Monte Carlo Policy Gradient) + +Reinforce, also called Monte-Carlo Policy Gradient, **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\). + + +Now that we studied the theory behind Reinforce, **you’re ready to code your Reinforce agent with PyTorch**. And you'll test its robustness using CartPole-v1, PixelCopter, and Pong. + +Start the tutorial here 👉 https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit5/unit5.ipynb + +The leaderboard to compare your results with your classmates 🏆 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard + +
+ Environments +
+--- + +Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorial. You’ve just coded your first Deep Reinforcement Learning agent from scratch using PyTorch and shared it on the Hub 🥳. + +It's **normal if you still feel confused** with all these elements. **This was the same for me and for all people who studied RL.** + +Take time to really grasp the material before continuing. + +Don't hesitate to train your agent in other environments. The **best way to learn is to try things on your own!** + +We published additional readings in the syllabus if you want to go deeper 👉 **[https://github.com/huggingface/deep-rl-class/blob/main/unit5/README.md](https://github.com/huggingface/deep-rl-class/blob/main/unit5/README.md)** + +In the next unit, we’re going to learn about a combination of Policy-Based and value-based methods called Actor Critic Methods. + +And don't forget to share with your friends who want to learn 🤗! + +Finally, we want **to improve and update the course iteratively with your feedback**. If you have some, please fill this form 👉 **[https://forms.gle/3HgA7bEHwAmmLfwh9](https://forms.gle/3HgA7bEHwAmmLfwh9)** + +### **Keep learning, stay awesome 🤗,**