mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-05 19:48:04 +08:00
Adding from Unit 0 to Unit 5
This commit is contained in:
90
chapters/en/_toctree.yml
Normal file
90
chapters/en/_toctree.yml
Normal file
@@ -0,0 +1,90 @@
|
||||
- title: Unit 0. Welcome to the course
|
||||
sections:
|
||||
- local: unit0/introduction
|
||||
title: Welcome to the course 🤗
|
||||
- local: unit0/setup
|
||||
title: Setup
|
||||
- local: unit0/discord101
|
||||
title: Discord 101
|
||||
- title: Unit 1. Introduction to Deep Reinforcement Learning
|
||||
sections:
|
||||
- local: unit1/introduction
|
||||
title: Introduction
|
||||
- local: unit1/what-is-rl
|
||||
title: What is Reinforcement Learning?
|
||||
- local: unit1/rl-framework
|
||||
title: The Reinforcement Learning Framework
|
||||
- local: unit1/tasks
|
||||
title: The type of tasks
|
||||
- local: unit1/exp-exp-tradeoff
|
||||
title: The Exploration/ Exploitation tradeoff
|
||||
- local: unit1/two-methods
|
||||
title: The two main approaches for solving RL problems
|
||||
- local: unit1/deep-rl
|
||||
title: The “Deep” in Deep Reinforcement Learning
|
||||
- local: unit1/summary
|
||||
title: Summary
|
||||
- local: unit1/hands-on
|
||||
title: Hands-on
|
||||
- local: unit1/quiz
|
||||
title: Quiz
|
||||
- local: unit1/conclusion
|
||||
title: Conclusion
|
||||
- title: Bonus Unit 1. Introduction to Deep Reinforcement Learning with Huggy
|
||||
sections:
|
||||
- local: unitbonus1/introduction
|
||||
title: Introduction
|
||||
- title: Unit 2. Introduction to Q-Learning
|
||||
sections:
|
||||
- local: unit2/introduction
|
||||
title: Introduction
|
||||
- local: unit2/what-is-rl
|
||||
title: What is RL? A short recap
|
||||
- local: unit2/two-types-value-based-methods
|
||||
title: The two types of value-based methods
|
||||
- local: unit2/bellman-equation
|
||||
title: The Bellman Equation, simplify our value estimation
|
||||
- local: unit2/mc-vs-td
|
||||
title: Monte Carlo vs Temporal Difference Learning
|
||||
- local: unit2/summary1
|
||||
title: Summary
|
||||
- local: unit2/quiz1
|
||||
title: First Quiz
|
||||
- local: unit2/q-learning
|
||||
title: Introducing Q-Learning
|
||||
- local: unit2/q-learning-example
|
||||
title: A Q-Learning example
|
||||
- local: unit2/hands-on
|
||||
title: Hands-on
|
||||
- local: unit2/quiz2
|
||||
title: Second Quiz
|
||||
- local: unit2/conclusion
|
||||
title: Conclusion
|
||||
- local: unit2/additional-reading
|
||||
title: Additional Reading
|
||||
- title: Unit 3. Deep Q-Learning with Atari Games
|
||||
sections:
|
||||
- local: unit3/introduction
|
||||
title: Introduction
|
||||
- local: unit3/from-q-to-dqn
|
||||
title: From Q-Learning to Deep Q-Learning
|
||||
- local: unit3/deep-q-network
|
||||
title: The Deep Q-Network (DQN)
|
||||
- local: unit3/deep-q-algorithm
|
||||
title: The Deep Q Algorithm
|
||||
- local: unit3/hands-on
|
||||
title: Hands-on
|
||||
- local: unit3/quiz
|
||||
title: Quiz
|
||||
- local: unit3/conclusion
|
||||
title: Conclusion
|
||||
- local: unit3/additional-reading
|
||||
title: Additional Reading
|
||||
- title: Unit Bonus 2. Automatic Hyperparameter Tuning with Optuna
|
||||
sections:
|
||||
- local: unitbonus2/introduction
|
||||
title: Introduction
|
||||
- local: unitbonus2/optuna
|
||||
title: Optuna
|
||||
- local: unitbonus2/hands-on
|
||||
title: Hands-on
|
||||
33
chapters/en/unit0/discord101.mdx
Normal file
33
chapters/en/unit0/discord101.mdx
Normal file
@@ -0,0 +1,33 @@
|
||||
# Discord 101 [[discord-101]]
|
||||
|
||||
Hey there! My name is Huggy, the dog 🐕, and I'm looking forward to train with you during this RL Course!
|
||||
Although I don't know much about bringing sticks (yet), I know one or two things about Discord. So I wrote this guide to help you learn about it!
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/huggy-logo.jpg" alt="Huggy Logo"/>
|
||||
|
||||
Discord is a free chat platform. If you've used Slack, **it's quite similar**. There is a Hugging Face Community Discord server with 8000 members you can <a href="https://discord.gg/ydHrjt3WP5">join with a single click here</a>. So many humans to play with!
|
||||
|
||||
Starting in Discord can be a bit intimidating, so let me take you through it.
|
||||
|
||||
When you sign-up to our discord server, you need to **introduce yourself in `introduction-yourself` channel**. Then, you need to sign up to which channel groups might interest you by looking at `role-assignment`.
|
||||
|
||||
## So which channels are interesting to me? [[channels]]
|
||||
|
||||
They are in the reinforcement learning lounge. **Don't forget to sign up to these channels** by clicking on 🤖 Reinforcement Learning in `role-assigment`.
|
||||
- `rl-announcements`: where we give the **lastest information about the course**.
|
||||
- `rl-discussions`: where you can **exchange about RL and share information**.
|
||||
- `rl-study-group`: where you can **create and join study groups**.
|
||||
|
||||
The HF Community Server has a thriving community of humans interested in many areas, so you can also learn from those. There are paper discussions, events, and many other things. Here are other channels in the community
|
||||
|
||||
Was this useful? There are a couple of tips I can share with you:
|
||||
|
||||
- There are **voice channels** you can use as well, although most people prefer text chat.
|
||||
- You can **use markdown style**. So if you're writing code, you can use that style. Sadly this does not work as well for links.
|
||||
- You can open threads as well! It's a good idea when **it's a long conversation**.
|
||||
|
||||
I hope this is useful! And if you have questions, just ask!
|
||||
|
||||
See you later!
|
||||
|
||||
Huggy
|
||||
123
chapters/en/unit0/introduction.mdx
Normal file
123
chapters/en/unit0/introduction.mdx
Normal file
@@ -0,0 +1,123 @@
|
||||
# Welcome to the 🤗 Deep Reinforcement Learning Course [[introduction]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/thumbnail.jpg" alt="Deep RL Course thumbnail" width="100%"/>
|
||||
|
||||
Welcome to the most fascinating topic in Artificial Intelligence: Deep Reinforcement Learning.
|
||||
|
||||
This course will **teach you about Deep Reinforcement Learning from beginner to expert**. It’s completely free.
|
||||
|
||||
In this unit you’ll:
|
||||
|
||||
- Learn more about the **course content**.
|
||||
- **Define the path** you’re going to take (either self-audit or certification process)
|
||||
- Learn more about the **AI vs. AI challenges** you're going to participate to.
|
||||
- Learn more **about us**.
|
||||
- **Create your Hugging Face account** (it’s free).
|
||||
- **Sign-up our discord server**, the place where you can exchange with our classmates and us.
|
||||
|
||||
Let’s get started!
|
||||
|
||||
## What to expect? [[expect]]
|
||||
|
||||
In this course, you will:
|
||||
|
||||
- 📖 Study Deep Reinforcement Learning in **theory and practice.**
|
||||
- 🧑💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, Sample Factory and CleanRL.
|
||||
- 🤖 **Train agents in unique environments** such as SnowballFight, Huggy the Doggo 🐶, MineRL (Minecraft ⛏️), VizDoom (Doom) and classical ones such as Space Invaders and PyBullet.
|
||||
- 💾 Publish your **trained agents in one line of code to the Hub**. But also download powerful agents from the community.
|
||||
- 🏆 Participate in challenges where you will **evaluate your agents against other teams. But also play against AI you'll train.**
|
||||
|
||||
And more!
|
||||
|
||||
At the end of this course, **you’ll get a solid foundation from the basics to the SOTA (state-of-the-art) methods**.
|
||||
|
||||
You can find the syllabus on our website 👉 <a href="https://simoninithomas.github.io/deep-rl-course/">here</a>
|
||||
|
||||
Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
|
||||
|
||||
Sign up 👉 <a href="http://eepurl.com/ic5ZUD">here</a>
|
||||
|
||||
|
||||
## What the course look like? [[course-look-like]]
|
||||
The course is composed of:
|
||||
|
||||
- *A theory part*: where you learn a **concept in theory (article)**.
|
||||
- *A hands-on*: with a **weekly live hands-on session** in ADD DATE every week at ADD TIME. where you'll learn to use famous Deep RL libraries such as Stable Baselines3, RL Baselines3 Zoo, and RLlib to train your agents in unique environments such as SnowballFight, Huggy the Doggo dog, and classical ones such as Space Invaders and PyBullet.
|
||||
We strongly advise you to participate to the live so that you can ask questions but if you can't participate in the live, the sessions are recorded and will be posted.
|
||||
- *Challenges* such AI vs. AI and leaderboard.
|
||||
|
||||
|
||||
## Two paths: choose your own adventure [[two-paths]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/two-paths.jpg" alt="Two paths" width="100%"/>
|
||||
|
||||
You can choose to follow this course either:
|
||||
|
||||
- *To get a certificate of completion*: you need to complete 80% of the assignments before the end of March 2023.
|
||||
- *As a simple audit*: you can participate in all challenges and do assignments if you want, but you have no deadlines.
|
||||
|
||||
Whatever path you choose, we advise you **to follow the recommended pace to enjoy the course and challenges with the most classmates.**
|
||||
You don't need to tell us which path you choose. At the end of March, when we verify the assignments **if you get more than 80% of the assignments done, you'll get a certificate.**
|
||||
|
||||
|
||||
|
||||
## How to get most of the course? [[advice]]
|
||||
|
||||
To get most of the course, we have some advice:
|
||||
|
||||
1. <a href="https://discord.gg/ydHrjt3WP5">Join or create study groups in Discord </a>: studying in groups is always easier. To do that, you need to join our discord server.
|
||||
2. **Do the quizzes and assignments**: the best way to learn is to do and test yourself.
|
||||
3. **Define a schedule to stay in sync: you can use our recommended pace schedule below or create yours.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/advice.jpg" alt="Course advice" width="100%"/>
|
||||
|
||||
## What tools do I need? [[tools]]
|
||||
|
||||
You need only 3 things:
|
||||
|
||||
- A computer with an internet connection.
|
||||
- Google Colab (free version): most of our hands-on will use Google Colab, the **free version is enough.**
|
||||
- A Hugging Face Account: to push and load models. If you don’t have an account yet you can create one here (it’s free).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/tools.jpg" alt="Course tools needed" width="100%"/>
|
||||
|
||||
|
||||
## What is the recommended pace? [[recommended-pace]]
|
||||
|
||||
We defined a planning that you can follow to keep up the pace of the course.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/pace1.jpg" alt="Course advice" width="100%"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/pace2.jpg" alt="Course advice" width="100%"/>
|
||||
|
||||
|
||||
Each chapter in this course is designed **to be completed in 1 week, with approximately 3-4 hours of work per week**. However, you can take as much time as you need to complete the course.
|
||||
|
||||
|
||||
## Who are we [[who-are-we]]
|
||||
About the authors:
|
||||
|
||||
<a href="https://twitter.com/ThomasSimonini">Thomas Simonini</a> is a Developer Advocate at Hugging Face 🤗 specializing in Deep Reinforcement Learning. He founded Deep Reinforcement Learning Course in 2018, which became one of the most used courses in Deep RL.
|
||||
|
||||
ADD OMAR
|
||||
|
||||
## When do the challenges start? [[challenges]]
|
||||
|
||||
In this new version of the course, you have two types of challenges:
|
||||
- A leaderboard to compare your agent's performance to other classmates'.
|
||||
- AI vs. AI challenges where you can train your agent and compete against other classmates' agents.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/challenges.jpg" alt="Challenges" width="100%"/>
|
||||
|
||||
These AI vs.AI challenges will be announced **later in December**.
|
||||
|
||||
|
||||
## I found a bug, or I want to improve the course [[contribute]]
|
||||
|
||||
Contributions are welcomed 🤗
|
||||
|
||||
- If you *found a bug 🐛 in a notebook*, please <a href="https://github.com/huggingface/deep-rl-class/issues">open an issue</a> and **describe the problem**.
|
||||
- If you *want to improve the course*, you can <a href="https://github.com/huggingface/deep-rl-class/pulls">open a Pull Request.</a>
|
||||
|
||||
## I still have questions [[questions]]
|
||||
|
||||
In that case, <a href="https://simoninithomas.github.io/deep-rl-course/#faq">check our FAQ</a>. And if the question is not in it, ask your question in our <a href="https://discord.gg/ydHrjt3WP5">discord server #rl-discussions.</a>
|
||||
30
chapters/en/unit0/setup.mdx
Normal file
30
chapters/en/unit0/setup.mdx
Normal file
@@ -0,0 +1,30 @@
|
||||
# Setup [[setup]]
|
||||
|
||||
After all this information, it's time to get started. We're going to do two things:
|
||||
|
||||
1. Create your Hugging Face account if it's not already done
|
||||
2. Sign up to Discord and introduce yourself (don't be shy 🤗)
|
||||
|
||||
### Let's create my Hugging Face account
|
||||
|
||||
(If it's not already done) create an account to HF <href="https://huggingface.co/join">here</a>
|
||||
|
||||
### Let's join our Discord server
|
||||
|
||||
You can now sign up for our Discord Server. This is the place where you **can exchange with the community and with us, create and join study groups to grow each other and more**
|
||||
|
||||
👉🏻 Join our discord server <href="https://discord.gg/ydHrjt3WP5">here.</a>
|
||||
|
||||
When you join, remember to introduce yourself in #introduce-yourself and sign-up for reinforcement channels in #role-assignments.
|
||||
|
||||
We have multiple RL-related channels:
|
||||
- `rl-announcements`: where we give the last information about the course.
|
||||
- `rl-discussions`: where you can exchange about RL and share information.
|
||||
- `rl-study-group`: where you can create and join study groups.
|
||||
|
||||
If this is your first time using Discord, we wrote a discord 101 to get the best practices. Check the next section.
|
||||
|
||||
Congratulations! **You've just finished the on-boarding**. You're now ready to start to learn Deep Reinforcement Learning. Have fun!
|
||||
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
15
chapters/en/unit1/conclusion.mdx
Normal file
15
chapters/en/unit1/conclusion.mdx
Normal file
@@ -0,0 +1,15 @@
|
||||
# Conclusion [[conclusion]]
|
||||
|
||||
Congrats on finishing this chapter! **That was the biggest one**, and there was a lot of information. And congrats on finishing the tutorial. You’ve just trained your first Deep RL agents and shared it on the Hub 🥳.
|
||||
|
||||
That’s **normal if you still feel confused with all these elements**. This was the same for me and for all people who studied RL.
|
||||
|
||||
**Take time to really grasp the material** before continuing. It’s important to master these elements and having a solid foundations before entering the fun part.
|
||||
|
||||
Naturally, during the course, we’re going to use and explain these terms again, but it’s better to understand them before diving into the next chapters.
|
||||
|
||||
In the next chapter, we’re going to reinforce what we just learn by **training Huggy the Dog to fetch the stick**.
|
||||
|
||||
You will be able then to play with him 🤗.
|
||||
|
||||
ADD GIF HUGGY
|
||||
21
chapters/en/unit1/deep-rl.mdx
Normal file
21
chapters/en/unit1/deep-rl.mdx
Normal file
@@ -0,0 +1,21 @@
|
||||
# The “Deep” in Reinforcement Learning [[deep-rl]]
|
||||
|
||||
<Tip>
|
||||
What we've talked about so far is Reinforcement Learning. But where does the "Deep" come into play?
|
||||
</Tip>
|
||||
|
||||
Deep Reinforcement Learning introduces **deep neural networks to solve Reinforcement Learning problems** — hence the name “deep”.
|
||||
|
||||
For instance, in the next article, we’ll work on Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning both are value-based RL algorithms.
|
||||
|
||||
You’ll see the difference is that in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
|
||||
|
||||
In the second approach, **we will use a Neural Network** (to approximate the q value).
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" alt="Value based RL"/>
|
||||
<figcaption>Schema inspired by the Q learning notebook by Udacity
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
If you are not familiar with Deep Learning you definitely should watch <a href="[https://course.fast.ai/](https://course.fast.ai/)">the fastai Practical Deep Learning for Coders (Free)</a>
|
||||
36
chapters/en/unit1/exp-exp-tradeoff.mdx
Normal file
36
chapters/en/unit1/exp-exp-tradeoff.mdx
Normal file
@@ -0,0 +1,36 @@
|
||||
# The Exploration/ Exploitation tradeoff [[exp-exp-tradeoff]]
|
||||
|
||||
Finally, before looking at the different methods to solve Reinforcement Learning problems, we must cover one more very important topic: *the exploration/exploitation trade-off.*
|
||||
|
||||
- *Exploration* is exploring the environment by trying random actions in order to **find more information about the environment.**
|
||||
- *Exploitation* is **exploiting known information to maximize the reward.**
|
||||
|
||||
Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, **we can fall into a common trap**.
|
||||
|
||||
Let’s take an example:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/exp_1.jpg" alt="Exploration" width="100%">
|
||||
|
||||
In this game, our mouse can have an **infinite amount of small cheese** (+1 each). But at the top of the maze, there is a gigantic sum of cheese (+1000).
|
||||
|
||||
However, if we only focus on exploitation, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit **the nearest source of rewards,** even if this source is small (exploitation).
|
||||
|
||||
But if our agent does a little bit of exploration, it can **discover the big reward** (the pile of big cheese).
|
||||
|
||||
This is what we call the exploration/exploitation trade-off. We need to balance how much we **explore the environment** and how much we **exploit what we know about the environment.**
|
||||
|
||||
Therefore, we must **define a rule that helps to handle this trade-off**. We’ll see in future chapters different ways to handle it.
|
||||
|
||||
If it’s still confusing, **think of a real problem: the choice of a restaurant:**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/exp_2.jpg" alt="Exploration">
|
||||
<figcaption>Source: <a href="[http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_exploration.pdf](http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_exploration.pdf)"> Berkley AI Course</a>
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
- *Exploitation*: You go every day to the same one that you know is good and **take the risk to miss another better restaurant.**
|
||||
- *Exploration*: Try restaurants you never went to before, with the risk of having a bad experience **but the probable opportunity of a fantastic experience.**
|
||||
|
||||
To recap:
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/expexpltradeoff.jpg" alt="Exploration Exploitation Tradeoff" width="100%">
|
||||
13
chapters/en/unit1/hands-on.mdx
Normal file
13
chapters/en/unit1/hands-on.mdx
Normal file
@@ -0,0 +1,13 @@
|
||||
# Hands on [[hands-on]]
|
||||
|
||||
Now that you've studied the bases of Reinforcement Learning, you’re ready to train your first two agents and share it with the community through the Hub 🔥:
|
||||
|
||||
- A Lunar Lander agent that will learn to land correctly on the Moon 🌕
|
||||
- A car that needs to reach the top of the mountain ⛰️ .
|
||||
|
||||
TODO: Add illustration MountainCar and MoonLanding
|
||||
|
||||
|
||||
Thanks to our <a href="https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard">leaderboard</a>, you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 1 🏆?
|
||||
|
||||
So let's get started! 🚀
|
||||
26
chapters/en/unit1/introduction.mdx
Normal file
26
chapters/en/unit1/introduction.mdx
Normal file
@@ -0,0 +1,26 @@
|
||||
# Introduction to Deep Reinforcement Learning [[introduction-to-deep-reinforcement-learning]]
|
||||
|
||||
|
||||
TODO: ADD IMAGE THUMBNAIL
|
||||
|
||||
|
||||
Welcome to the most fascinating topic in Artificial Intelligence: **Deep Reinforcement Learning.**
|
||||
|
||||
Deep RL is a type of Machine Learning where an agent learns **how to behave** in an environment **by performing actions** and **seeing the results.**
|
||||
|
||||
So in this first chapter, **you'll learn the foundations of Deep Reinforcement Learning.**
|
||||
|
||||
Then, you'll **train your first two Deep Reinforcement Learning agents** using <a href="https://stable-baselines3.readthedocs.io/en/master/"> Stable-Baselines3 </a> a Deep Reinforcement Learning library.:
|
||||
|
||||
1. A Lunar Lander agent that will learn to **land correctly on the Moon 🌕**
|
||||
2. A car that needs **to reach the top of the mountain ⛰️ **.
|
||||
|
||||
TODO: Add illustration MountainCar and MoonLanding
|
||||
|
||||
And finally, you'll **upload it to the Hugging Face Hub 🤗, a free, open platform where people can share ML models, datasets, and demos.**
|
||||
|
||||
TODO: ADD model card illustration
|
||||
|
||||
It's essential **to master these elements** before diving into implementing Deep Reinforcement Learning agents. The goal of this chapter is to give you solid foundations.
|
||||
|
||||
So let's get started! 🚀
|
||||
168
chapters/en/unit1/quiz.mdx
Normal file
168
chapters/en/unit1/quiz.mdx
Normal file
@@ -0,0 +1,168 @@
|
||||
# Quiz [[quiz]]
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://fr.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
### Q1: What is Reinforcement Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Reinforcement learning is a **framework for solving control tasks (also called decision problems)** by building agents that learn from the environment by interacting with it through trial and error and **receiving rewards (positive or negative) as unique feedback**.
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
|
||||
### Q2: Define the RL Loop
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rl-loop-ex.jpg" alt="Exercise RL Loop"/>
|
||||
|
||||
At every step:
|
||||
- Our Agent receives ______ from the environment
|
||||
- Based on that ______ the Agent takes an ______
|
||||
- Our Agent will move to the right
|
||||
- The Environment goes to a ______
|
||||
- The Environment gives a ______ to the Agent
|
||||
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "an action a0, action a0, state s0, state s1, reward r1",
|
||||
explain: "At every step: Our Agent receives **state s0** from the environment. Based on that **state s0** the Agent takes an **action a0**. Our Agent will move to the right. The Environment goes to a **new state s1**. The Environment gives **a reward r1** to the Agent."
|
||||
},
|
||||
{
|
||||
text: "state s0, state s0, action a0, new state s1, reward r1",
|
||||
explain: "The self-attention layer does contain attention \"heads,\" but these are not adaptation heads.",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "a state s0, state s0, action a0, state s1, action a1",
|
||||
explain: "At every step: Our Agent receives **state s0** from the environment. Based on that **state s0** the Agent takes an **action a0**. Our Agent will move to the right. The Environment goes to a **new state s1**. The Environment gives **a reward r1** to the Agent."
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q3: What's the difference between a state and an observation?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "The state is a complete description of the state of the world (there is no hidden information)",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "The state is a partial description of the state",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "The observation* is a complete description of the state of the world (there is no hidden information)",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "The observation is a partial description of the state",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "We receive a state when we play with chess environment",
|
||||
explain: "Since we have access to the whole checkboard information.",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "We receive an observation when we play with chess environment",
|
||||
explain: "Since we have access to the whole checkboard information."
|
||||
},
|
||||
{
|
||||
text: "We receive a state when we play with Super Mario Bros",
|
||||
explain: "We only see a part of the level close to the player, so we receive an observation."
|
||||
},
|
||||
{
|
||||
text: "We receive an observation when we play with chess environment",
|
||||
explain: "We only see a part of the level close to the player.",
|
||||
correct: true
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q4: A task is an instance of a Reinforcement Learning problem. What are the two types of tasks?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Episodic",
|
||||
explain: "In Episodic task, we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and new States. For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ending when you’re killed or you reached the end of the level.",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "Recursive",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "Adversarial",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "Continuing",
|
||||
explain: "Continuing tasks are tasks that continue forever (no terminal state). In this case, the agent must learn how to choose the best actions and simultaneously interact with the environment.",
|
||||
correct: true
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q5: What is the exploration/exploitation tradeoff?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
In Reinforcement Learning, we need to **balance how much we explore the environment and how much we exploit what we know about the environment**.
|
||||
|
||||
- *Exploration* is exploring the environment by **trying random actions in order to find more information about the environment**.
|
||||
|
||||
- *Exploitation* is **exploiting known information to maximize the reward**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/expexpltradeoff.jpg" alt="Exploration Exploitation Tradeoff" width="100%">
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q6: What is a policy?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- The Policy π **is the brain of our Agent**, it’s the function that tell us what action to take given the state we are. So it defines the agent’s behavior at a given time.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_1.jpg" alt="Policy">
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q7: What are value-based methods?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- Value-based methods is one of the main approaches for solving RL problems.
|
||||
- In Value-based methods, instead of training a policy function, **we train a value function that maps a state to the expected value of being at that state**.
|
||||
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
### Q8: What are policy-based methods?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
- In *Policy-Based Methods*, we learn a **policy function directly**.
|
||||
- This policy function will **map from each state to the best corresponding action at that state**. Or a **probability distribution over the set of possible actions at that state**.
|
||||
|
||||
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
|
||||
146
chapters/en/unit1/rl-framework.mdx
Normal file
146
chapters/en/unit1/rl-framework.mdx
Normal file
@@ -0,0 +1,146 @@
|
||||
# The Reinforcement Learning Framework [[the-reinforcement-learning-framework]]
|
||||
|
||||
## The RL Process [[the-rl-process]]
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process.jpg" alt="The RL process" width="100%">
|
||||
<figcaption>The RL Process: a loop of state, action, reward and next state</figcaption>
|
||||
<figcaption>Source: <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto</a></figcaption>
|
||||
</figure>
|
||||
|
||||
To understand the RL process, let’s imagine an agent learning to play a platform game:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
|
||||
|
||||
\$\sqrt{2}\$
|
||||
|
||||
- Our Agent receives **state $S_0$** from the **Environment** — we receive the first frame of our game (Environment).
|
||||
- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
|
||||
- Environment goes to a **new** **state \\(S_1\\)** — new frame.
|
||||
- The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*.
|
||||
|
||||
This RL loop outputs a sequence of **state, action, reward and next state.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/sars.jpg" alt="State, Action, Reward, Next State" width="100%">
|
||||
|
||||
The agent's goal is to maximize its cumulative reward, **called the expected return.**
|
||||
|
||||
## The reward hypothesis: the central idea of Reinforcement Learning [[reward-hypothesis]]
|
||||
|
||||
⇒ Why is the goal of the agent to maximize the expected return?
|
||||
|
||||
Because RL is based on the **reward hypothesis**, which is that all goals can be described as the **maximization of the expected return** (expected cumulative reward).
|
||||
|
||||
That’s why in Reinforcement Learning, **to have the best behavior,** we need to **maximize the expected cumulative reward.**
|
||||
|
||||
## Markov Property [[markov-property]]
|
||||
|
||||
In papers, you’ll see that the RL process is called the **Markov Decision Process** (MDP).
|
||||
|
||||
We’ll talk again about the Markov Property in the following units. But if you need to remember something today about it, Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states** **and actions** they took before.
|
||||
|
||||
## Observations/States Space [[obs-space]]
|
||||
|
||||
Observations/States are the **information our agent gets from the environment.** In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.
|
||||
|
||||
There is a differentiation to make between *observation* and *state*:
|
||||
|
||||
- *State s*: is **a complete description of the state of the world** (there is no hidden information). In a fully observed environment.
|
||||
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/chess.jpg" alt="Chess">
|
||||
<figcaption>In chess game, we receive a state from the environment since we have access to the whole check board information.</figcaption>
|
||||
</figure>
|
||||
|
||||
In chess game, we receive a state from the environment since we have access to the whole check board information.
|
||||
|
||||
With a chess game, we are in a fully observed environment, since we have access to the whole check board information.
|
||||
|
||||
- *Observation o*: is a **partial description of the state.** In a partially observed environment.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
|
||||
<figcaption>In Super Mario Bros, we only see a part of the level close to the player, so we receive an observation.</figcaption>
|
||||
</figure>
|
||||
|
||||
In Super Mario Bros, we only see a part of the level close to the player, so we receive an observation.
|
||||
|
||||
In Super Mario Bros, we are in a partially observed environment. We receive an observation **since we only see a part of the level.**
|
||||
|
||||
<Tip>
|
||||
In reality, we use the term state in this course but we will make the distinction in implementations.
|
||||
</Tip>
|
||||
|
||||
To recap:
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/obs_space_recap.jpg" alt="Obs space recap" width="100%">
|
||||
|
||||
|
||||
## Action Space [[action-space]]
|
||||
|
||||
The Action space is the set of **all possible actions in an environment.**
|
||||
|
||||
The actions can come from a *discrete* or *continuous space*:
|
||||
|
||||
- *Discrete space*: the number of possible actions is **finite**.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
|
||||
<figcaption>Again, in Super Mario Bros, we have only 4 directions and jump possible</figcaption>
|
||||
</figure>
|
||||
|
||||
In Super Mario Bros, we have a finite set of actions since we have only 4 directions and jump.
|
||||
|
||||
- *Continuous space*: the number of possible actions is **infinite**.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/self_driving_car.jpg" alt="Self Driving Car">
|
||||
<figcaption>A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
To recap:
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/action_space.jpg" alt="Action space recap" width="100%">
|
||||
|
||||
Taking this information into consideration is crucial because it will **have importance when choosing the RL algorithm in the future.**
|
||||
|
||||
## Rewards and the discounting [[rewards]]
|
||||
|
||||
The reward is fundamental in RL because it’s **the only feedback** for the agent. Thanks to it, our agent knows **if the action taken was good or not.**
|
||||
|
||||
The cumulative reward at each time step t can be written as:
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_1.jpg" alt="Rewards">
|
||||
<figcaption>The cumulative reward equals to the sum of all rewards of the sequence.
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
Which is equivalent to:
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_2.jpg" alt="Rewards">
|
||||
<figcaption>The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) + ...
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
However, in reality, **we can’t just add them like that.** The rewards that come sooner (at the beginning of the game) **are more likely to happen** since they are more predictable than the long-term future reward.
|
||||
|
||||
Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). Your goal is **to eat the maximum amount of cheese before being eaten by the cat.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_3.jpg" alt="Rewards" width="100%">
|
||||
|
||||
As we can see in the diagram, **it’s more probable to eat the cheese near us than the cheese close to the cat** (the closer we are to the cat, the more dangerous it is).
|
||||
|
||||
Consequently, **the reward near the cat, even if it is bigger (more cheese), will be more discounted** since we’re not really sure we’ll be able to eat it.
|
||||
|
||||
To discount the rewards, we proceed like this:
|
||||
|
||||
1. We define a discount rate called gamma. **It must be between 0 and 1.** Most of the time between **0.99 and 0.95**.
|
||||
- The larger the gamma, the smaller the discount. This means our agent **cares more about the long-term reward.**
|
||||
- On the other hand, the smaller the gamma, the bigger the discount. This means our **agent cares more about the short term reward (the nearest cheese).**
|
||||
|
||||
2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, **so the future reward is less and less likely to happen.**
|
||||
|
||||
Our discounted cumulative expected rewards is:
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_4.jpg" alt="Rewards" width="100%">
|
||||
19
chapters/en/unit1/summary.mdx
Normal file
19
chapters/en/unit1/summary.mdx
Normal file
@@ -0,0 +1,19 @@
|
||||
# Summary [[summary]]
|
||||
|
||||
That was a lot of information, if we summarize:
|
||||
|
||||
- Reinforcement Learning is a computational approach of learning from action. We build an agent that learns from the environment **by interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.
|
||||
|
||||
- The goal of any RL agent is to maximize its expected cumulative reward (also called expected return) because RL is based on the **reward hypothesis**, which is that **all goals can be described as the maximization of the expected cumulative reward.**
|
||||
|
||||
- The RL process is a loop that outputs a sequence of **state, action, reward and next state.**
|
||||
|
||||
- To calculate the expected cumulative reward (expected return), we discount the rewards: the rewards that come sooner (at the beginning of the game) **are more probable to happen since they are more predictable than the long term future reward.**
|
||||
|
||||
- To solve an RL problem, you want to **find an optimal policy**, the policy is the “brain” of your AI that will tell us **what action to take given a state.** The optimal one is the one who **gives you the actions that max the expected return.**
|
||||
|
||||
- There are two ways to find your optimal policy:
|
||||
1. By training your policy directly: **policy-based methods.**
|
||||
2. By training a value function that tells us the expected return the agent will get at each state and use this function to define our policy: **value-based methods.**
|
||||
|
||||
- Finally, we speak about Deep RL because we introduces **deep neural networks to estimate the action to take (policy-based) or to estimate the value of a state (value-based)** hence the name “deep.”
|
||||
27
chapters/en/unit1/tasks.mdx
Normal file
27
chapters/en/unit1/tasks.mdx
Normal file
@@ -0,0 +1,27 @@
|
||||
# Type of tasks [[tasks]]
|
||||
|
||||
A task is an **instance** of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuing.
|
||||
|
||||
## Episodic task [[episodic-task]]
|
||||
|
||||
In this case, we have a starting point and an ending point **(a terminal state). This creates an episode**: a list of States, Actions, Rewards, and new States.
|
||||
|
||||
For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ending **when you’re killed or you reached the end of the level.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
|
||||
<figcaption>Beginning of a new episode.
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
|
||||
## Continuing tasks [[continuing-tasks]]
|
||||
|
||||
These are tasks that continue forever (no terminal state). In this case, the agent must **learn how to choose the best actions and simultaneously interact with the environment.**
|
||||
|
||||
For instance, an agent that does automated stock trading. For this task, there is no starting point and terminal state. **The agent keeps running until we decide to stop them.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/stock.jpg" alt="Stock Market" width="100%">
|
||||
|
||||
To recap:
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/tasks.jpg" alt="Tasks recap" width="100%">
|
||||
97
chapters/en/unit1/two-methods.mdx
Normal file
97
chapters/en/unit1/two-methods.mdx
Normal file
@@ -0,0 +1,97 @@
|
||||
# The two main approaches for solving RL problems [[two-methods]]
|
||||
|
||||
<Tip>
|
||||
Now that we learned the RL framework, how do we solve the RL problem?
|
||||
</Tip>
|
||||
|
||||
In other terms, how to build an RL agent that can **select the actions that maximize its expected cumulative reward?**
|
||||
|
||||
## The Policy π: the agent’s brain [[policy]]
|
||||
|
||||
The Policy **π** is the **brain of our Agent**, it’s the function that tell us what **action to take given the state we are.** So it **defines the agent’s behavior** at a given time.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_1.jpg" alt="Policy">
|
||||
<figcaption>Think of policy as the brain of our agent, the function that will tells us the action to take given a state
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
Think of policy as the brain of our agent, the function that will tells us the action to take given a state
|
||||
|
||||
This Policy **is the function we want to learn**, our goal is to find the optimal policy π*, the policy that** maximizes **expected return** when the agent acts according to it. We find this *π through training.**
|
||||
|
||||
There are two approaches to train our agent to find this optimal policy π*:
|
||||
|
||||
- **Directly,** by teaching the agent to learn which **action to take,** given the state is in: **Policy-Based Methods.**
|
||||
- Indirectly, **teach the agent to learn which state is more valuable** and then take the action that **leads to the more valuable states**: Value-Based Methods.
|
||||
|
||||
## Policy-Based Methods [[policy-based]]
|
||||
|
||||
In Policy-Based Methods, **we learn a policy function directly.**
|
||||
|
||||
This function will map from each state to the best corresponding action at that state. **Or a probability distribution over the set of possible actions at that state.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_2.jpg" alt="Policy">
|
||||
<figcaption>As we can see here, the policy (deterministic) <b>directly indicates the action to take for each step.</b>
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
|
||||
We have two types of policy:
|
||||
|
||||
- *Deterministic*: a policy at a given state **will always return the same action.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_3.jpg" alt="Policy"/>
|
||||
<figcaption>action = policy(state)
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_4.jpg" alt="Policy" width="100%"/>
|
||||
|
||||
- *Stochastic*: output **a probability distribution over actions.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_5.jpg" alt="Policy"/>
|
||||
<figcaption>policy(actions | state) = probability distribution over the set of actions given the current state
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario"/>
|
||||
<figcaption>Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
|
||||
If we recap:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_1.jpg" alt="Pbm recap" width="100%">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_2.jpg" alt="Pbm recap" width="100%">
|
||||
|
||||
|
||||
## Value-based methods [[value-based]]
|
||||
|
||||
In Value-based methods, instead of training a policy function, we **train a value function** that maps a state to the expected value **of being at that state.**
|
||||
|
||||
The value of a state is the **expected discounted return** the agent can get if it **starts in that state, and then act according to our policy.**
|
||||
|
||||
“Act according to our policy” just means that our policy is **“going to the state with the highest value”.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_1.jpg" alt="Value based RL" width="100%">
|
||||
|
||||
Here we see that our value function **defined value for each possible state.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_2.jpg" alt="Value based RL"/>
|
||||
<figcaption>Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.
|
||||
|
||||
If we recap:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_1.jpg" alt="Vbm recap" width="100%">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_2.jpg" alt="Vbm recap" width="100%">
|
||||
37
chapters/en/unit1/what-is-rl.mdx
Normal file
37
chapters/en/unit1/what-is-rl.mdx
Normal file
@@ -0,0 +1,37 @@
|
||||
# What is Reinforcement Learning? [[what-is-reinforcement-learning]]
|
||||
|
||||
To understand Reinforcement Learning, let’s start with the big picture.
|
||||
|
||||
## The big picture [[the-big-picture]]
|
||||
|
||||
The idea behind Reinforcement Learning is that an agent (an AI) will learn from the environment by **interacting with it** (through trial and error) and **receiving rewards** (negative or positive) as feedback for performing actions.
|
||||
|
||||
Learning from interaction with the environment **comes from our natural experiences.**
|
||||
|
||||
For instance, imagine putting your little brother in front of a video game he never played, a controller in his hands, and letting him alone.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_1.jpg" alt="Illustration_1" width="100%">
|
||||
|
||||
Your brother will interact with the environment (the video game) by pressing the right button (action). He got a coin, that’s a +1 reward. It’s positive, he just understood that in this game **he must get the coins.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_2.jpg" alt="Illustration_2" width="100%">
|
||||
|
||||
But then, **he presses right again** and he touches an enemy, he just died -1 reward.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_3.jpg" alt="Illustration_3" width="100%">
|
||||
|
||||
By interacting with his environment through trial and error, your little brother understood that **he needed to get coins in this environment but avoid the enemies.**
|
||||
|
||||
**Without any supervision**, the child will get better and better at playing the game.
|
||||
|
||||
That’s how humans and animals learn, **through interaction.** Reinforcement Learning is just a **computational approach of learning from action.**
|
||||
|
||||
### A formal definition [[a-formal-definition]]
|
||||
|
||||
If we take now a formal definition:
|
||||
|
||||
<Tip>
|
||||
Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.
|
||||
</Tip>
|
||||
|
||||
But how Reinforcement Learning works?
|
||||
1
chapters/en/unit2/additional-reading.mdx
Normal file
1
chapters/en/unit2/additional-reading.mdx
Normal file
@@ -0,0 +1 @@
|
||||
# Additional Reading [[additional-reading]]
|
||||
55
chapters/en/unit2/bellman-equation.mdx
Normal file
55
chapters/en/unit2/bellman-equation.mdx
Normal file
@@ -0,0 +1,55 @@
|
||||
# The Bellman Equation: simplify our value estimation [[bellman-equation]]
|
||||
|
||||
The Bellman equation **simplifies our state value or state-action value calculation.**
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman.jpg" alt="Bellman equation"/>
|
||||
|
||||
With what we learned from now, we know that if we calculate the \\(V(S_t)\\) (value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(Our policy that we defined in the following example is a Greedy Policy, and for simplification, we don't discount the reward).**
|
||||
|
||||
So to calculate \\(V(S_t)\\), we need to make the sum of the expected rewards. Hence:
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" alt="Bellman equation"/>
|
||||
<figcaption>To calculate the value of State 1: the sum of rewards if the agent started in that state and then followed the greedy policy (taking actions that leads to the best states values) for all the time steps.</figcaption>
|
||||
</figure>
|
||||
|
||||
Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return starting at that state \\(S_{t+1}\\).
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman3.jpg" alt="Bellman equation"/>
|
||||
<figcaption>To calculate the value of State 2: the sum of rewards **if the agent started in that state, and then followed the **policy for all the time steps.</figcaption>
|
||||
</figure>
|
||||
|
||||
So you see, that's a pretty tedious process if you need to do it for each state value or state-action value.
|
||||
|
||||
Instead of calculating the expected return for each state or each state-action pair, **we can use the Bellman equation.**
|
||||
|
||||
The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
|
||||
|
||||
**The immediate reward \\(R_{t+1}\\) + the discounted value of the state that follows ( \\(gamma * V(S_{t+1}) \\) ) .**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation"/>
|
||||
<figcaption>For simplification here we don’t discount so gamma = 1.</figcaption>
|
||||
</figure>
|
||||
|
||||
|
||||
If we go back to our example, the value of State 1= expected cumulative return if we start at that state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" alt="Bellman equation"/>
|
||||
|
||||
|
||||
To calculate the value of State 1: the sum of rewards **if the agent started in that state 1** and then followed the **policy for all the time steps.**
|
||||
|
||||
Which is equivalent to \\(V(S_{t})\\) = Immediate reward \\(R_{t+1}\\) + Discounted value of the next state \\(gamma * V(S_{t+1})\\)
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman6.jpg" alt="Bellman equation"/>
|
||||
|
||||
|
||||
For simplification, here we don't discount, so gamma = 1.
|
||||
|
||||
- The value of \\(V(S_{t+1}) \\) = Immediate reward \\(R_{t+2}\\) + Discounted value of the next state ( \\(gamma * V(S_{t+2})\\) ).
|
||||
- And so on.
|
||||
|
||||
To recap, the idea of the Bellman equation is that instead of calculating each value as the sum of the expected return, **which is a long process.** This is equivalent **to the sum of immediate reward + the discounted value of the state that follows.**
|
||||
19
chapters/en/unit2/conclusion.mdx
Normal file
19
chapters/en/unit2/conclusion.mdx
Normal file
@@ -0,0 +1,19 @@
|
||||
# Conclusion [[conclusion]]
|
||||
|
||||
Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorials. You’ve just implemented your first RL agent from scratch and shared it on the Hub 🥳.
|
||||
|
||||
Implementing from scratch when you study a new architecture **is important to understand how it works.**
|
||||
|
||||
That’s **normal if you still feel confused** with all these elements. **This was the same for me and for all people who studied RL.**
|
||||
|
||||
Take time to really grasp the material before continuing.
|
||||
|
||||
|
||||
In the next chapter, we’re going to dive deeper by studying our first Deep Reinforcement Learning algorithm based on Q-Learning: Deep Q-Learning. And you'll train a **DQN agent with <a href="https://github.com/DLR-RM/rl-baselines3-zoo">RL-Baselines3 Zoo</a> to play Atari Games**.
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Atari environments"/>
|
||||
|
||||
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
2
chapters/en/unit2/hands-on.mdx
Normal file
2
chapters/en/unit2/hands-on.mdx
Normal file
@@ -0,0 +1,2 @@
|
||||
# Hands-on [[hands-on]]
|
||||
n
|
||||
22
chapters/en/unit2/introduction.mdx
Normal file
22
chapters/en/unit2/introduction.mdx
Normal file
@@ -0,0 +1,22 @@
|
||||
# Introduction to Q-Learning [[introduction-q-learning]]
|
||||
|
||||
ADD THUMBNAIL
|
||||
|
||||
In the first chapter of this class, we learned about Reinforcement Learning (RL), the RL process, and the different methods to solve an RL problem. We also **trained our first agents and uploaded them to the Hugging Face Hub.**
|
||||
|
||||
In this chapter, we're going to **dive deeper into one of the Reinforcement Learning methods: value-based methods** and study our first RL algorithm: **Q-Learning.**
|
||||
|
||||
We'll also **implement our first RL agent from scratch**: a Q-Learning agent and will train it in two environments:
|
||||
|
||||
1. Frozen-Lake-v1 (non-slippery version): where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
|
||||
2. An autonomous taxi will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
|
||||
|
||||
|
||||
We'll learn about the value-based methods and the difference between Monte Carlo and Temporal Difference Learning. And then, **we'll study and code our first RL algorithm**: Q-Learning, and implement our first RL Agent.
|
||||
|
||||
This unit is **fundamental if you want to be able to work on Deep Q-Learning**: the first Deep RL algorithm that was able to play Atari games and beat the human level on some of them (breakout, space invaders…).
|
||||
|
||||
So let's get started! 🚀
|
||||
126
chapters/en/unit2/mc-vs-td.mdx
Normal file
126
chapters/en/unit2/mc-vs-td.mdx
Normal file
@@ -0,0 +1,126 @@
|
||||
# Monte Carlo vs Temporal Difference Learning [[mc-vs-td]]
|
||||
|
||||
The last thing we need to talk about before diving into Q-Learning is the two ways of learning.
|
||||
|
||||
Remember that an RL agent **learns by interacting with its environment.** The idea is that **using the experience taken**, given the reward it gets, will **update its value or policy.**
|
||||
|
||||
Monte Carlo and Temporal Difference Learning are two different **strategies on how to train our value function or our policy function.** Both of them **use experience to solve the RL problem.**
|
||||
|
||||
On one hand, Monte Carlo uses **an entire episode of experience before learning.** On the other hand, Temporal Difference uses **only a step ( \\(S_t, A_t, R_{t+1}, S_{t+1}\\) ) to learn.**
|
||||
|
||||
We'll explain both of them **using a value-based method example.**
|
||||
|
||||
## Monte Carlo: learning at the end of the episode [[monte-carlo]]
|
||||
|
||||
Monte Carlo waits until the end of the episode, calculates \\(G_t\\) (return) and uses it as **a target for updating \\(V(S_t)\\).**
|
||||
|
||||
So it requires a **complete entire episode of interaction before updating our value function.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/monte-carlo-approach.jpg" alt="Monte Carlo"/>
|
||||
|
||||
|
||||
If we take an example:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-2.jpg" alt="Monte Carlo"/>
|
||||
|
||||
|
||||
- We always start the episode **at the same starting point.**
|
||||
- **The agent takes actions using the policy**. For instance, using an Epsilon Greedy Strategy, a policy that alternates between exploration (random actions) and exploitation.
|
||||
- We get **the reward and the next state.**
|
||||
- We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.
|
||||
|
||||
- At the end of the episode, **we have a list of State, Actions, Rewards, and Next States**
|
||||
- **The agent will sum the total rewards \\(G_t\\)** (to see how well it did).
|
||||
- It will then **update \\(V(s_t)\\) based on the formula**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3.jpg" alt="Monte Carlo"/>
|
||||
|
||||
- Then **start a new game with this new knowledge**
|
||||
|
||||
By running more and more episodes, **the agent will learn to play better and better.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3p.jpg" alt="Monte Carlo"/>
|
||||
|
||||
For instance, if we train a state-value function using Monte Carlo:
|
||||
|
||||
- We just started to train our Value function, **so it returns 0 value for each state**
|
||||
- Our learning rate (lr) is 0.1 and our discount rate is 1 (= no discount)
|
||||
- Our mouse **explores the environment and takes random actions**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-4.jpg" alt="Monte Carlo"/>
|
||||
|
||||
|
||||
- The mouse made more than 10 steps, so the episode ends .
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-4p.jpg" alt="Monte Carlo"/>
|
||||
|
||||
|
||||
- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t}\\)**
|
||||
- \\(G_t = R_{t+1} + R_{t+2} + R_{t+3} ...\\)
|
||||
- \\(G_t = R_{t+1} + R_{t+2} + R_{t+3}…\\) (for simplicity we don’t discount the rewards).
|
||||
- \\(G_t = 1 + 0 + 0 + 0+ 0 + 0 + 1 + 1 + 0 + 0\\)
|
||||
- \\(G_t= 3\\)
|
||||
- We can now update \\(V(S_0)\\):
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5.jpg" alt="Monte Carlo"/>
|
||||
|
||||
- New \\(V(S_0) = V(S_0) + lr * [G_t — V(S_0)]\\)
|
||||
- New \\(V(S_0) = 0 + 0.1 * [3 – 0]\\)
|
||||
- New \\(V(S_0) = 0.3\\)
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5p.jpg" alt="Monte Carlo"/>
|
||||
|
||||
|
||||
## Temporal Difference Learning: learning at each step [[td-learning]]
|
||||
|
||||
- **Temporal difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)**
|
||||
- to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\(gamma * V(S_{t+1})\\).
|
||||
|
||||
The idea with **TD is to update the \\(V(S_t)\\) at each step.**
|
||||
|
||||
But because we didn't play during an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.**
|
||||
|
||||
This is called bootstrapping. It's called this **because TD bases its update part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="Temporal Difference"/>
|
||||
|
||||
|
||||
This method is called TD(0) or **one-step TD (update the value function after any individual step).**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1p.jpg" alt="Temporal Difference"/>
|
||||
|
||||
If we take the same example,
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2.jpg" alt="Temporal Difference"/>
|
||||
|
||||
- We just started to train our Value function, so it returns 0 value for each state.
|
||||
- Our learning rate (lr) is 0.1, and our discount rate is 1 (no discount).
|
||||
- Our mouse explore the environment and take a random action: **going to the left**
|
||||
- It gets a reward \\(R_{t+1} = 1\\) since **it eats a piece of cheese**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-2p.jpg" alt="Temporal Difference"/>
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3.jpg" alt="Temporal Difference"/>
|
||||
|
||||
We can now update \\(V(S_0)\\):
|
||||
|
||||
New \\(V(S_0) = V(S_0) + lr * [R_1 + gamma * V(S_1) - V(S_0)]\\)
|
||||
|
||||
New \\(V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]\\)
|
||||
|
||||
New \\(V(S_0) = 0.1\\)
|
||||
|
||||
So we just updated our value function for State 0.
|
||||
|
||||
Now we **continue to interact with this environment with our updated value function.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-3p.jpg" alt="Temporal Difference"/>
|
||||
|
||||
If we summarize:
|
||||
|
||||
- With *Monte Carlo*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
|
||||
- With *TD learning*, we update the value function from a step, so we replace \\(G_t\\) that we don't have with **an estimated return called TD target.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Summary.jpg" alt="Summary"/>
|
||||
83
chapters/en/unit2/q-learning-example.mdx
Normal file
83
chapters/en/unit2/q-learning-example.mdx
Normal file
@@ -0,0 +1,83 @@
|
||||
# A Q-Learning example [[q-learning-example]]
|
||||
|
||||
To better understand Q-Learning, let's take a simple example:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-Example-2.jpg" alt="Maze-Example"/>
|
||||
|
||||
- You're a mouse in this tiny maze. You always **start at the same starting point.**
|
||||
- The goal is **to eat the big pile of cheese at the bottom right-hand corner** and avoid the poison. After all, who doesn't like cheese?
|
||||
- The episode ends if we eat the poison, **eat the big pile of cheese or if we spent more than five steps.**
|
||||
- The learning rate is 0.1
|
||||
- The gamma (discount rate) is 0.99
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-1.jpg" alt="Maze-Example"/>
|
||||
|
||||
|
||||
The reward function goes like this:
|
||||
|
||||
- **+0:** Going to a state with no cheese in it.
|
||||
- **+1:** Going to a state with a small cheese in it.
|
||||
- **+10:** Going to the state with the big pile of cheese.
|
||||
- **-10:** Going to the state with the poison and thus die.
|
||||
- **+0** If we spend more than five steps.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-2.jpg" alt="Maze-Example"/>
|
||||
|
||||
To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**.
|
||||
|
||||
## Step 1: We initialize the Q-Table [[step1]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-1.jpg" alt="Maze-Example"/>
|
||||
|
||||
So, for now, **our Q-Table is useless**; we need **to train our Q-function using the Q-Learning algorithm.**
|
||||
|
||||
Let's do it for 2 training timesteps:
|
||||
|
||||
Training timestep 1:
|
||||
|
||||
## Step 2: Choose action using Epsilon Greedy Strategy [[step2]]
|
||||
|
||||
Because epsilon is big = 1.0, I take a random action, in this case, I go right.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-3.jpg" alt="Maze-Example"/>
|
||||
|
||||
|
||||
## Step 3: Perform action At, gets Rt+1 and St+1 [[step3]]
|
||||
|
||||
By going right, I've got a small cheese, so \\(R_{t+1} = 1\\), and I'm in a new state.
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-4.jpg" alt="Maze-Example"/>
|
||||
|
||||
|
||||
## Step 4: Update Q(St, At) [[step4]]
|
||||
|
||||
We can now update \\(Q(S_t, A_t)\\) using our formula.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Maze-Example"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-4.jpg" alt="Maze-Example"/>
|
||||
|
||||
Training timestep 2:
|
||||
|
||||
## Step 2: Choose action using Epsilon Greedy Strategy [[step2-2]]
|
||||
|
||||
**I take a random action again, since epsilon is big 0.99** (since we decay it a little bit because as the training progress, we want less and less exploration).
|
||||
|
||||
I took action down. **Not a good action since it leads me to the poison.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-6.jpg" alt="Maze-Example"/>
|
||||
|
||||
|
||||
## Step 3: Perform action At, gets \Rt+1 and St+1 [[step3-3]]
|
||||
|
||||
Because I go to the poison state, **I get \\(R_{t+1} = -10\\), and I die.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-7.jpg" alt="Maze-Example"/>
|
||||
|
||||
## Step 4: Update Q(St, At) [[step4-4]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-8.jpg" alt="Maze-Example"/>
|
||||
|
||||
Because we're dead, we start a new episode. But what we see here is that **with two explorations steps, my agent became smarter.**
|
||||
|
||||
As we continue exploring and exploiting the environment and updating Q-values using TD target, **Q-Table will give us better and better approximations. And thus, at the end of the training, we'll get an estimate of the optimal Q-Function.**
|
||||
153
chapters/en/unit2/q-learning.mdx
Normal file
153
chapters/en/unit2/q-learning.mdx
Normal file
@@ -0,0 +1,153 @@
|
||||
# Introducing Q-Learning [[q-learning]]
|
||||
## What is Q-Learning? [[what-is-q-learning]]
|
||||
|
||||
Q-Learning is an **off-policy value-based method that uses a TD approach to train its action-value function:**
|
||||
|
||||
- *Off-policy*: we'll talk about that at the end of this chapter.
|
||||
- *Value-based method*: finds the optimal policy indirectly by training a value or action-value function that will tell us **the value of each state or each state-action pair.**
|
||||
- *Uses a TD approach:* **updates its action-value function at each step instead of at the end of the episode.**
|
||||
|
||||
**Q-Learning is the algorithm we use to train our Q-Function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" alt="Q-function"/>
|
||||
<figcaption>Given a state and action, our Q Function outputs a state-action value (also called Q-value)</figcaption>
|
||||
</figure>
|
||||
|
||||
The **Q comes from "the Quality" of that action at that state.**
|
||||
|
||||
Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action value pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
|
||||
If we take this maze example:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-1.jpg" alt="Maze example"/>
|
||||
|
||||
The Q-Table is initialized. That's why all values are = 0. This table **contains, for each state, the four state-action values.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-2.jpg" alt="Maze example"/>
|
||||
|
||||
Here we see that the **state-action value of the initial state and going up is 0:**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-3.jpg" alt="Maze example"/>
|
||||
|
||||
Therefore, Q-function contains a Q-table **that has the value of each-state action pair.** And given a state and action, **our Q-Function will search inside its Q-table to output the value.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q-function"/>
|
||||
<figcaption>Given a state and action pair, our Q-function will search inside its Q-table to output the state-action pair value (the Q value).</figcaption>
|
||||
</figure>
|
||||
|
||||
If we recap, *Q-Learning* **is the RL algorithm that:**
|
||||
|
||||
- Trains *Q-Function* (an **action-value function**) which internally is a *Q-table* **that contains all the state-action pair values.**
|
||||
- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
|
||||
- When the training is done, **we have an optimal Q-function, which means we have optimal Q-Table.**
|
||||
- And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy"/>
|
||||
|
||||
|
||||
But, in the beginning, **our Q-Table is useless since it gives arbitrary values for each state-action pair** (most of the time, we initialize the Q-Table to 0 values). But, as we'll **explore the environment and update our Q-Table, it will give us better and better approximations.**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-1.jpg" alt="Q-learning"/>
|
||||
<figcaption>We see here that with the training, our Q-Table is better since, thanks to it, we can know the value of each state-action pair.</figcaption>
|
||||
</figure>
|
||||
|
||||
So now that we understand what Q-Learning, Q-Function, and Q-Table are, **let's dive deeper into the Q-Learning algorithm**.
|
||||
|
||||
## The Q-Learning algorithm [[q-learning-algo]]
|
||||
|
||||
This is the Q-Learning pseudocode; let's study each part and **see how it works with a simple example before implementing it.** Don't be intimidated by it, it's simpler than it looks! We'll go over each step.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-learning"/>
|
||||
|
||||
### Step 1: We initialize the Q-Table [[step1]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-3.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
We need to initialize the Q-Table for each state-action pair. **Most of the time, we initialize with values of 0.**
|
||||
|
||||
### Step 2: Choose action using Epsilon Greedy Strategy [[step2]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
Epsilon Greedy Strategy is a policy that handles the exploration/exploitation trade-off.
|
||||
|
||||
The idea is that we define epsilon ɛ = 1.0:
|
||||
|
||||
- *With probability 1 — ɛ* : we do **exploitation** (aka our agent selects the action with the highest state-action pair value).
|
||||
- With probability ɛ: **we do exploration** (trying random action).
|
||||
|
||||
At the beginning of the training, **the probability of doing exploration will be huge since ɛ is very high, so most of the time, we'll explore.** But as the training goes on, and consequently our **Q-Table gets better and better in its estimations, we progressively reduce the epsilon value** since we will need less and less exploration and more exploitation.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-5.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
### Step 3: Perform action At, gets reward Rt+1 and next state St+1 [[step3]]
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-6.jpg" alt="Q-learning"/>
|
||||
|
||||
### Step 4: Update Q(St, At) [[step4]]
|
||||
|
||||
Remember that in TD Learning, we update our policy or value function (depending on the RL method we choose) **after one step of the interaction.**
|
||||
|
||||
To produce our TD target, **we used the immediate reward \\(R_{t+1}\\) plus the discounted value of the next state best state-action pair** (we call that bootstrap).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-7.jpg" alt="Q-learning"/>
|
||||
|
||||
Therefore, our \\(Q(S_t, A_t)\\) **update formula goes like this:**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-8.jpg" alt="Q-learning"/>
|
||||
|
||||
|
||||
It means that to update our \\(Q(S_t, A_t)\\):
|
||||
|
||||
- We need \\(S_t, A_t, R_{t+1}, S_{t+1}\\).
|
||||
- To update our Q-value at a given state-action pair, we use the TD target.
|
||||
|
||||
How do we form the TD target?
|
||||
1. We obtain the reward after taking the action \\(R_{t+1}\\).
|
||||
2. To get the **best next-state-action pair value**, we use a greedy policy to select the next best action. Note that this is not an epsilon greedy policy, this will always take the action with the highest state-action value.
|
||||
|
||||
Then when the update of this Q-value is done. We start in a new_state and select our action **using our epsilon-greedy policy again.**
|
||||
|
||||
**It's why we say that this is an off-policy algorithm.**
|
||||
|
||||
## Off-policy vs On-policy [[off-vs-on]]
|
||||
|
||||
The difference is subtle:
|
||||
|
||||
- *Off-policy*: using **a different policy for acting and updating.**
|
||||
|
||||
For instance, with Q-Learning, the Epsilon greedy policy (acting policy), is different from the greedy policy that is **used to select the best next-state action value to update our Q-value (updating policy).**
|
||||
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-1.jpg" alt="Off-on policy"/>
|
||||
<figcaption>Acting Policy</figcaption>
|
||||
</figure>
|
||||
|
||||
Is different from the policy we use during the training part:
|
||||
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-2.jpg" alt="Off-on policy"/>
|
||||
<figcaption>Updating policy</figcaption>
|
||||
</figure>
|
||||
|
||||
- *On-policy:* using the **same policy for acting and updating.**
|
||||
|
||||
For instance, with Sarsa, another value-based algorithm, **the Epsilon-Greedy Policy selects the next_state-action pair, not a greedy policy.**
|
||||
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-3.jpg" alt="Off-on policy"/>
|
||||
<figcaption>Sarsa</figcaption>
|
||||
</figure>
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Off-on policy"/>
|
||||
</figure>
|
||||
105
chapters/en/unit2/quiz1.mdx
Normal file
105
chapters/en/unit2/quiz1.mdx
Normal file
@@ -0,0 +1,105 @@
|
||||
# First Quiz [[quiz1]]
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://fr.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
|
||||
### Q1: What are the two main approaches to find optimal policy?
|
||||
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Policy-based methods",
|
||||
explain: "With Policy-Based methods, we train the policy directly to learn which action to take given a state.",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "Random-based methods",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "Value-based methods",
|
||||
explain: "With Value-based methods, we train a value function to learn which state is more valuable and use this value function to take the action that leads to it.",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "Evolution-strategies methods",
|
||||
explain: ""
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
|
||||
### Q2: What is the Bellman Equation?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
**The Bellman equation is a recursive equation** that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
|
||||
|
||||
Rt+1 + (gamma * V(St+1))
|
||||
The immediate reward + the discounted value of the state that follows
|
||||
|
||||
</details>
|
||||
|
||||
### Q3: Define each part of the Bellman Equation
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4-quiz.jpg" alt="Bellman equation quiz"/>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation solution"/>
|
||||
|
||||
</details>
|
||||
|
||||
### Q4: What is the difference between Monte Carlo and Temporal Difference learning methods?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "With Monte Carlo methods, we update the value function from a complete episode",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "With Monte Carlo methods, we update the value function from a step",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "With TD learning methods, we update the value function from a complete episode",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "With TD learning methods, we update the value function from a step",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q5: Define each part of Temporal Difference learning formula
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/td-ex.jpg" alt="TD Learning exercise"/>
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="TD Exercise"/>
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q6: Define each part of Monte Carlo learning formula
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/mc-ex.jpg" alt="MC Learning exercise"/>
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/monte-carlo-approach.jpg" alt="MC Exercise"/>
|
||||
|
||||
</details>
|
||||
|
||||
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
|
||||
97
chapters/en/unit2/quiz2.mdx
Normal file
97
chapters/en/unit2/quiz2.mdx
Normal file
@@ -0,0 +1,97 @@
|
||||
# Second Quiz [[quiz2]]
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://fr.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
|
||||
### Q1: What is Q-Learning?
|
||||
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "The algorithm we use to train our Q-Function",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "A value function",
|
||||
explain: "It's an action-value function since it determines the value of being at a particular state and taking a specific action at that state",
|
||||
},
|
||||
{
|
||||
text: "An algorithm that determines the value of being at a particular state and taking a specific action at that state",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "A table",
|
||||
explain: "Q-Function is not a Q-Table. The Q-Function is the algorithm that will feed the Q-Table."
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q2: What is a Q-Table?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "An algorithm we use in Q-Learning",
|
||||
explain: "",
|
||||
},
|
||||
{
|
||||
text: "Q-table is the internal memory of our agent",
|
||||
explain: "",
|
||||
correct: true
|
||||
},
|
||||
{
|
||||
text: "In Q-Table each cell corresponds a state value",
|
||||
explain: "Each cell corresponds to a state-action value pair value. Not a state value.",
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q3: Why if we have an optimal Q-function Q* we have an optimal policy?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Because if we have an optimal Q-function, we have an optimal policy since we know for each state what is the best action to take.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="link value policy"/>
|
||||
|
||||
</details>
|
||||
|
||||
### Q4: Can you explain what is Epsilon-Greedy Strategy?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
Epsilon Greedy Strategy is a policy that handles the exploration/exploitation trade-off.
|
||||
|
||||
The idea is that we define epsilon ɛ = 1.0:
|
||||
|
||||
- With *probability 1 — ɛ* : we do exploitation (aka our agent selects the action with the highest state-action pair value).
|
||||
- With *probability ɛ* : we do exploration (trying random action).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Epsilon Greedy"/>
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
### Q5: How do we update the Q value of a state, action pair?
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-update-ex.jpg" alt="Q Update exercise"/>
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-update-solution.jpg" alt="Q Update exercise"/>
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
|
||||
### Q6: What's the difference between on-policy and off-policy
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="On/off policy"/>
|
||||
</details>
|
||||
|
||||
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
|
||||
17
chapters/en/unit2/summary1.mdx
Normal file
17
chapters/en/unit2/summary1.mdx
Normal file
@@ -0,0 +1,17 @@
|
||||
# Summary [[summary1]]
|
||||
|
||||
Before diving on Q-Learning, let's summarize what we just learned.
|
||||
|
||||
We have two types of value-based functions:
|
||||
|
||||
- State-Value function: outputs the expected return if **the agent starts at a given state and acts accordingly to the policy forever after.**
|
||||
- Action-Value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
|
||||
- In value-based methods, **we define the policy by hand** because we don't train it, we train a value function. The idea is that if we have an optimal value function, we **will have an optimal policy.**
|
||||
|
||||
There are two types of methods to learn a policy for a value function:
|
||||
|
||||
- With *the Monte Carlo method*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
|
||||
- With *the TD Learning method,* we update the value function from a step, so we replace Gt that we don't have with **an estimated return called TD target.**
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/summary-learning-mtds.jpg" alt="Summary"/>
|
||||
86
chapters/en/unit2/two-types-value-based-methods.mdx
Normal file
86
chapters/en/unit2/two-types-value-based-methods.mdx
Normal file
@@ -0,0 +1,86 @@
|
||||
# The two types of value-based methods [[two-types-value-based-methods]]
|
||||
|
||||
In value-based methods, **we learn a value function** that **maps a state to the expected value of being at that state.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/vbm-1.jpg" alt="Value Based Methods"/>
|
||||
|
||||
The value of a state is the **expected discounted return** the agent can get if it **starts at that state and then acts according to our policy.**
|
||||
|
||||
<Tip>
|
||||
But what does it mean to act according to our policy? After all, we don't have a policy in value-based methods, since we train a value function and not a policy.
|
||||
</Tip>
|
||||
|
||||
Remember that the goal of an **RL agent is to have an optimal policy π.**
|
||||
|
||||
To find it, we learned that there are two different methods:
|
||||
|
||||
- *Policy-based methods:* **Directly train the policy** to select what action to take given a state (or a probability distribution over actions at that state). In this case, we **don't have a value function.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches-2.jpg" alt="Two RL approaches"/>
|
||||
|
||||
The policy takes a state as input and outputs what action to take at that state (deterministic policy).
|
||||
|
||||
And consequently, **we don't define by hand the behavior of our policy; it's the training that will define it.**
|
||||
|
||||
- *Value-based methods:* **Indirectly, by training a value function** that outputs the value of a state or a state-action pair. Given this value function, our policy **will take action.**
|
||||
|
||||
But, because we didn't train our policy, **we need to specify its behavior.** For instance, if we want a policy that, given the value function, will take actions that always lead to the biggest reward, **we'll create a Greedy Policy.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches-3.jpg" alt="Two RL approaches"/>
|
||||
<figcaption>Given a state, our action-value function (that we train) outputs the value of each action at that state, then our greedy policy (that we defined) selects the action with the biggest state-action pair value.</figcaption>
|
||||
</figure>
|
||||
|
||||
Consequently, whatever method you use to solve your problem, **you will have a policy**, but in the case of value-based methods you don't train it, your policy **is just a simple function that you specify** (for instance greedy policy) and this policy **uses the values given by the value-function to select its actions.**
|
||||
|
||||
So the difference is:
|
||||
|
||||
- In policy-based, **the optimal policy is found by training the policy directly.**
|
||||
- In value-based, **finding an optimal value function leads to having an optimal policy.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link between value and policy"/>
|
||||
|
||||
In fact, most of the time, in value-based methods, you'll use **an Epsilon-Greedy Policy** that handles the exploration/exploitation trade-off; we'll talk about it when we talk about Q-Learning in the second part of this unit.
|
||||
|
||||
|
||||
So, we have two types of value-based functions:
|
||||
|
||||
## The State-Value function [[state-value-function]]
|
||||
|
||||
We write the state value function under a policy π like this:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/state-value-function-1.jpg" alt="State value function"/>
|
||||
|
||||
For each state, the state-value function outputs the expected return if the agent **starts at that state,** and then follow the policy forever after (for all future timesteps if you prefer).
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/state-value-function-2.jpg" alt="State value function"/>
|
||||
<figcaption>If we take the state with value -7: it's the expected return starting at that state and taking actions according to our policy (greedy policy), so right, right, right, down, down, right, right.</figcaption>
|
||||
</figure>
|
||||
|
||||
## The Action-Value function [[action-value-function]]
|
||||
|
||||
In the Action-value function, for each state and action pair, the action-value function **outputs the expected return** if the agent starts in that state and takes action, and then follows the policy forever after.
|
||||
|
||||
The value of taking action an in state s under a policy π is:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-1.jpg" alt="Action State value function"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/action-state-value-function-2.jpg" alt="Action State value function"/>
|
||||
|
||||
|
||||
We see that the difference is:
|
||||
|
||||
- In state-value function, we calculate **the value of a state \\(S_t\\)**
|
||||
- In action-value function, we calculate **the value of the state-action pair ( \\(S_t, A_t\\) ) hence the value of taking that action at that state.**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-types.jpg" alt="Two types of value function"/>
|
||||
<figcaption>
|
||||
Note: We didn't fill all the state-action pairs for the example of Action-value function</figcaption>
|
||||
</figure>
|
||||
|
||||
In either case, whatever value function we choose (state-value or action-value function), **the value is the expected return.**
|
||||
|
||||
However, the problem is that it implies that **to calculate EACH value of a state or a state-action pair, we need to sum all the rewards an agent can get if it starts at that state.**
|
||||
|
||||
This can be a tedious process, and that's **where the Bellman equation comes to help us.**
|
||||
25
chapters/en/unit2/what-is-rl.mdx
Normal file
25
chapters/en/unit2/what-is-rl.mdx
Normal file
@@ -0,0 +1,25 @@
|
||||
# What is RL? A short recap [[what-is-rl]]
|
||||
|
||||
In RL, we build an agent that can **make smart decisions**. For instance, an agent that **learns to play a video game.** Or a trading agent that **learns to maximize its benefits** by making smart decisions on **what stocks to buy and when to sell.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/rl-process.jpg" alt="RL process"/>
|
||||
|
||||
|
||||
But, to make intelligent decisions, our agent will learn from the environment by **interacting with it through trial and error** and receiving rewards (positive or negative) **as unique feedback.**
|
||||
|
||||
Its goal **is to maximize its expected cumulative reward** (because of the reward hypothesis).
|
||||
|
||||
**The agent's decision-making process is called the policy π:** given a state, a policy will output an action or a probability distribution over actions. That is, given an observation of the environment, a policy will provide an action (or multiple probabilities for each action) that the agent should take.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/policy.jpg" alt="Policy"/>
|
||||
|
||||
**Our goal is to find an optimal policy π* **, aka., a policy that leads to the best expected cumulative reward.
|
||||
|
||||
And to find this optimal policy (hence solving the RL problem), there **are two main types of RL methods**:
|
||||
|
||||
- *Policy-based methods*: **Train the policy directly** to learn which action to take given a state.
|
||||
- *Value-based methods*: **Train a value function** to learn **which state is more valuable** and use this value function **to take the action that leads to it.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches.jpg" alt="Two RL approaches"/>
|
||||
|
||||
And in this unit, **we'll dive deeper into the Value-based methods.**
|
||||
1
chapters/en/unit3/additional-reading.mdx
Normal file
1
chapters/en/unit3/additional-reading.mdx
Normal file
@@ -0,0 +1 @@
|
||||
# Additional Reading [[additional-reading]]
|
||||
11
chapters/en/unit3/conclusion.mdx
Normal file
11
chapters/en/unit3/conclusion.mdx
Normal file
@@ -0,0 +1,11 @@
|
||||
# Conclusion [[conclusion]]
|
||||
|
||||
Congrats on finishing this chapter! There was a lot of information. And congrats on finishing the tutorial. You’ve just trained your first Deep Q-Learning agent and shared it on the Hub 🥳.
|
||||
|
||||
Take time to really grasp the material before continuing.
|
||||
|
||||
Don't hesitate to train your agent in other environments (Pong, Seaquest, QBert, Ms Pac Man). The **best way to learn is to try things on your own!**
|
||||
|
||||
In the next unit, **we're going to learn about Optuna**. One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters. And Optuna is a library that helps you to automate the search.
|
||||
|
||||
### Keep Learning, stay awesome 🤗
|
||||
102
chapters/en/unit3/deep-q-algorithm.mdx
Normal file
102
chapters/en/unit3/deep-q-algorithm.mdx
Normal file
@@ -0,0 +1,102 @@
|
||||
# The Deep Q-Learning Algorithm [[deep-q-algorithm]]
|
||||
|
||||
We learned that Deep Q-Learning **uses a deep neural network to approximate the different Q-values for each possible action at a state** (value-function estimation).
|
||||
|
||||
The difference is that, during the training phase, instead of updating the Q-value of a state-action pair directly as we have done with Q-Learning:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Q Loss"/>
|
||||
|
||||
In Deep Q-Learning, we create a **Loss function between our Q-value prediction and the Q-target and use Gradient Descent to update the weights of our Deep Q-Network to approximate our Q-values better**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" alt="Q-target"/>
|
||||
|
||||
The Deep Q-Learning training algorithm has *two phases*:
|
||||
|
||||
- **Sampling**: we perform actions and **store the observed experiences tuples in a replay memory**.
|
||||
- **Training**: Select the **small batch of tuple randomly and learn from it using a gradient descent update step**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/sampling-training.jpg" alt="Sampling Training"/>
|
||||
|
||||
But, this is not the only change compared with Q-Learning. Deep Q-Learning training **might suffer from instability**, mainly because of combining a non-linear Q-value function (Neural Network) and bootstrapping (when we update targets with existing estimates and not an actual complete return).
|
||||
|
||||
To help us stabilize the training, we implement three different solutions:
|
||||
1. *Experience Replay*, to make more **efficient use of experiences**.
|
||||
2. *Fixed Q-Target* **to stabilize the training**.
|
||||
3. *Double Deep Q-Learning*, to **handle the problem of the overestimation of Q-values**.
|
||||
|
||||
|
||||
## Experience Replay to make more efficient use of experiences [[exp-replay]]
|
||||
|
||||
Why do we create a replay memory?
|
||||
|
||||
Experience Replay in Deep Q-Learning has two functions:
|
||||
|
||||
1. **Make more efficient use of the experiences during the training**.
|
||||
- Experience replay helps us **make more efficient use of the experiences during the training.** Usually, in online reinforcement learning, we interact in the environment, get experiences (state, action, reward, and next state), learn from them (update the neural network) and discard them.
|
||||
- But with experience replay, we create a replay buffer that saves experience samples **that we can reuse during the training.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay.jpg" alt="Experience Replay"/>
|
||||
|
||||
⇒ This allows us to **learn from individual experiences multiple times**.
|
||||
|
||||
2. **Avoid forgetting previous experiences and reduce the correlation between experiences**.
|
||||
- The problem we get if we give sequential samples of experiences to our neural network is that it tends to forget **the previous experiences as it overwrites new experiences.** For instance, if we are in the first level and then the second, which is different, our agent can forget how to behave and play in the first level.
|
||||
|
||||
The solution is to create a Replay Buffer that stores experience tuples while interacting with the environment and then sample a small batch of tuples. This prevents **the network from only learning about what it has immediately done.**
|
||||
|
||||
Experience replay also has other benefits. By randomly sampling the experiences, we remove correlation in the observation sequences and avoid **action values from oscillating or diverging catastrophically.**
|
||||
|
||||
In the Deep Q-Learning pseudocode, we see that we **initialize a replay memory buffer D from capacity N** (N is an hyperparameter that you can define). We then store experiences in the memory and sample a minibatch of experiences to feed the Deep Q-Network during the training phase.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/experience-replay-pseudocode.jpg" alt="Experience Replay Pseudocode"/>
|
||||
|
||||
## Fixed Q-Target to stabilize the training [[fixed-q]]
|
||||
|
||||
When we want to calculate the TD error (aka the loss), we calculate the **difference between the TD target (Q-Target) and the current Q-value (estimation of Q)**.
|
||||
|
||||
But we **don’t have any idea of the real TD target**. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/Q-target.jpg" alt="Q-target"/>
|
||||
|
||||
However, the problem is that we are using the same parameters (weights) for estimating the TD target **and** the Q value. Consequently, there is a significant correlation between the TD target and the parameters we are changing.
|
||||
|
||||
Therefore, it means that at every step of training, **our Q values shift but also the target value shifts.** So, we’re getting closer to our target, but the target is also moving. It’s like chasing a moving target! This led to a significant oscillation in training.
|
||||
|
||||
It’s like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target), you must get closer (reduce the error).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-1.jpg" alt="Q-target"/>
|
||||
|
||||
At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-2.jpg" alt="Q-target"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-3.jpg" alt="Q-target"/>
|
||||
This leads to a bizarre path of chasing (a significant oscillation in training).
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/qtarget-4.jpg" alt="Q-target"/>
|
||||
|
||||
Instead, what we see in the pseudo-code is that we:
|
||||
- Use a **separate network with a fixed parameter** for estimating the TD Target
|
||||
- **Copy the parameters from our Deep Q-Network at every C step** to update the target network.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/fixed-q-target-pseudocode.jpg" alt="Fixed Q-target Pseudocode"/>
|
||||
|
||||
|
||||
|
||||
## Double DQN [[double-dqn]]
|
||||
|
||||
Double DQNs, or Double Learning, were introduced [by Hado van Hasselt](https://papers.nips.cc/paper/3964-double-q-learning). This method **handles the problem of the overestimation of Q-values.**
|
||||
|
||||
To understand this problem, remember how we calculate the TD Target:
|
||||
|
||||
We face a simple problem by calculating the TD target: how are we sure that **the best action for the next state is the action with the highest Q-value?**
|
||||
|
||||
We know that the accuracy of Q values depends on what action we tried **and** what neighboring states we explored.
|
||||
|
||||
Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the maximum Q value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.**
|
||||
|
||||
The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:
|
||||
- Use our **DQN network** to select the best action to take for the next state (the action with the highest Q value).
|
||||
- Use our **Target network** to calculate the target Q value of taking that action at the next state.
|
||||
|
||||
Therefore, Double DQN helps us reduce the overestimation of q values and, as a consequence, helps us train faster and have more stable learning.
|
||||
|
||||
Since these three improvements in Deep Q-Learning, many have been added such as Prioritized Experience Replay, Dueling Deep Q-Learning. They’re out of the scope of this course but if you’re interested, check the links we put in the reading list. TODO Add reading list
|
||||
39
chapters/en/unit3/deep-q-network.mdx
Normal file
39
chapters/en/unit3/deep-q-network.mdx
Normal file
@@ -0,0 +1,39 @@
|
||||
# The Deep Q-Network (DQN) [[deep-q-network]]
|
||||
This is the architecture of our Deep Q-Learning network:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" alt="Deep Q Network"/>
|
||||
|
||||
As input, we take a **stack of 4 frames** passed through the network as a state and output a **vector of Q-values for each possible action at that state**. Then, like with Q-Learning, we just need to use our epsilon-greedy policy to select which action to take.
|
||||
|
||||
When the Neural Network is initialized, **the Q-value estimation is terrible**. But during training, our Deep Q-Network agent will associate a situation with appropriate action and **learn to play the game well**.
|
||||
|
||||
## Preprocessing the input and temporal limitation [[preprocessing]]
|
||||
|
||||
We mentioned that we preprocess the input. It’s an essential step since we want to **reduce the complexity of our state to reduce the computation time needed for training**.
|
||||
|
||||
So what we do is **reduce the state space to 84x84 and grayscale it** (since the colors in Atari environments don't add important information).
|
||||
This is an essential saving since we **reduce our three color channels (RGB) to 1**.
|
||||
|
||||
We can also **crop a part of the screen in some games** if it does not contain important information.
|
||||
Then we stack four frames together.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/preprocessing.jpg" alt="Preprocessing"/>
|
||||
|
||||
Why do we stack four frames together?
|
||||
We stack frames together because it helps us **handle the problem of temporal limitation**. Let’s take an example with the game of Pong. When you see this frame:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation.jpg" alt="Temporal Limitation"/>
|
||||
|
||||
Can you tell me where the ball is going?
|
||||
No, because one frame is not enough to have a sense of motion! But what if I add three more frames? **Here you can see that the ball is going to the right**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal Limitation"/>
|
||||
That’s why, to capture temporal information, we stack four frames together.
|
||||
|
||||
Then, the stacked frames are processed by three convolutional layers. These layers **allow us to capture and exploit spatial relationships in images**. But also, because frames are stacked together, **you can exploit some spatial properties across those frames**.
|
||||
|
||||
Finally, we have a couple of fully connected layers that output a Q-value for each possible action at that state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/deep-q-network.jpg" alt="Deep Q Network"/>
|
||||
|
||||
So, we see that Deep Q-Learning is using a neural network to approximate, given a state, the different Q-values for each possible action at that state. Let’s now study the Deep Q-Learning algorithm.
|
||||
34
chapters/en/unit3/from-q-to-dqn.mdx
Normal file
34
chapters/en/unit3/from-q-to-dqn.mdx
Normal file
@@ -0,0 +1,34 @@
|
||||
# From Q-Learning to Deep Q-Learning [[from-q-to-dqn]]
|
||||
|
||||
We learned that **Q-Learning is an algorithm we use to train our Q-Function**, an **action-value function** that determines the value of being at a particular state and taking a specific action at that state.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function.jpg" alt="Q-function"/>
|
||||
<figcaption>Given a state and action, our Q Function outputs a state-action value (also called Q-value)</figcaption>
|
||||
</figure>
|
||||
|
||||
The **Q comes from "the Quality" of that action at that state.**
|
||||
|
||||
Internally, our Q-function has **a Q-table, a table where each cell corresponds to a state-action pair value.** Think of this Q-table as **the memory or cheat sheet of our Q-function.**
|
||||
|
||||
The problem is that Q-Learning is a *tabular method*. Aka, a problem in which the state and actions spaces **are small enough to approximate value functions to be represented as arrays and tables**. And this is **not scalable**.
|
||||
|
||||
Q-Learning was working well with small state space environments like:
|
||||
|
||||
- FrozenLake, we had 14 states.
|
||||
- Taxi-v3, we had 500 states.
|
||||
|
||||
But think of what we're going to do today: we will train an agent to learn to play Space Invaders using the frames as input.
|
||||
|
||||
As **[Nikita Melkozerov mentioned](https://twitter.com/meln1k), Atari environments** have an observation space with a shape of (210, 160, 3), containing values ranging from 0 to 255 so that gives us 256^(210x160x3) = 256^100800 (for comparison, we have approximately 10^80 atoms in the observable universe).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari.jpg" alt="Atari State Space"/>
|
||||
|
||||
Therefore, the state space is gigantic; hence creating and updating a Q-table for that environment would not be efficient. In this case, the best idea is to approximate the Q-values instead of a Q-table using a parametrized Q-function \\(Q_{\theta}(s,a)\\) .
|
||||
|
||||
This neural network will approximate, given a state, the different Q-values for each possible action at that state. And that's exactly what Deep Q-Learning does.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" alt="Deep Q Learning"/>
|
||||
|
||||
|
||||
Now that we understand Deep Q-Learning, let's dive deeper into the Deep Q-Network.
|
||||
1
chapters/en/unit3/hands-on.mdx
Normal file
1
chapters/en/unit3/hands-on.mdx
Normal file
@@ -0,0 +1 @@
|
||||
# Hands-on [[hands-on]]
|
||||
15
chapters/en/unit3/introduction.mdx
Normal file
15
chapters/en/unit3/introduction.mdx
Normal file
@@ -0,0 +1,15 @@
|
||||
# Deep Q-Learning [[deep-q-learning]]
|
||||
|
||||
In the last unit, we learned our first reinforcement learning algorithm: Q-Learning, **implemented it from scratch**, and trained it in two environments, FrozenLake-v1 ☃️ and Taxi-v3 🚕.
|
||||
|
||||
We got excellent results with this simple algorithm. But these environments were relatively simple because the **State Space was discrete and small** (14 different states for FrozenLake-v1 and 500 for Taxi-v3).
|
||||
|
||||
But as we'll see, producing and updating a **Q-table can become ineffective in large state space environments.**
|
||||
|
||||
So in this unit, **we'll study our first Deep Reinforcement Learning agent**: Deep Q-Learning. Instead of using a Q-table, Deep Q-Learning uses a Neural Network that takes a state and approximates Q-values for each action based on that state.
|
||||
|
||||
And **we'll train it to play Space Invaders and other Atari environments using [RL-Zoo](https://github.com/DLR-RM/rl-baselines3-zoo)**, a training framework for RL using Stable-Baselines that provides scripts for training, evaluating agents, tuning hyperparameters, plotting results, and recording videos.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif" alt="Environments"/>
|
||||
|
||||
So let’s get started! 🚀
|
||||
104
chapters/en/unit3/quiz.mdx
Normal file
104
chapters/en/unit3/quiz.mdx
Normal file
@@ -0,0 +1,104 @@
|
||||
# Quiz [[quiz]]
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://fr.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
### Q1: What are tabular methods?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
*Tabular methods* are a type of problems in which the state and actions spaces are small enough to approximate value functions to be **represented as arrays and tables**. For instance, **Q-Learning is a tabular method** since we use a table to represent the state,action value pairs.
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
### Q2: Why we can't use a classical Q-Learning to solve an Atari Game?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Atari environments are too fast for Q-Learning",
|
||||
explain: ""
|
||||
},
|
||||
{
|
||||
text: "Atari environments have a big observation space. So creating an updating the Q-Table would not be efficient",
|
||||
explain: "",
|
||||
correct: true
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
|
||||
### Q3: Why do we stack four frames together when we use frames as input in Deep Q-Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
We stack frames together because it helps us **handle the problem of temporal limitation**. Since one frame is not enough to capture temporal information.
|
||||
For instance, in pong, our agent **will be unable to know the ball direction if it gets only one frame**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation.jpg" alt="Temporal limitation"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/temporal-limitation-2.jpg" alt="Temporal limitation"/>
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
### Q4: What are the two phases of Deep Q-Learning?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "Sampling",
|
||||
explain: "We perform actions and store the observed experiences tuples in a replay memory.",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "Shuffling",
|
||||
explain: "",
|
||||
},
|
||||
{
|
||||
text: "Reranking",
|
||||
explain: "",
|
||||
},
|
||||
{
|
||||
text: "Training",
|
||||
explain: "We select the small batch of tuple randomly and learn from it using a gradient descent update step.",
|
||||
correct: true,
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q5: Why do we create a replay memory in Deep Q-Learning?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
**1. Make more efficient use of the experiences during the training**
|
||||
|
||||
Usually, in online reinforcement learning, we interact in the environment, get experiences (state, action, reward, and next state), learn from them (update the neural network) and discard them.
|
||||
But with experience replay, **we create a replay buffer that saves experience samples that we can reuse during the training**.
|
||||
|
||||
**2. Avoid forgetting previous experiences and reduce the correlation between experiences**
|
||||
|
||||
The problem we get if we give sequential samples of experiences to our neural network is that it **tends to forget the previous experiences as it overwrites new experiences**. For instance, if we are in the first level and then the second, which is different, our agent can forget how to behave and play in the first level.
|
||||
|
||||
|
||||
</details>
|
||||
|
||||
### Q6: How do we use Double Deep Q-Learning?
|
||||
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
When we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We:
|
||||
|
||||
- Use our *DQN network* to **select the best action to take for the next state** (the action with the highest Q value).
|
||||
|
||||
- Use our *Target network* to calculate **the target Q value of taking that action at the next state**.
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read again the chapter to reinforce (😏) your knowledge.
|
||||
1
chapters/en/unitbonus1/introduction.mdx
Normal file
1
chapters/en/unitbonus1/introduction.mdx
Normal file
@@ -0,0 +1 @@
|
||||
# Introduction [[introduction]]
|
||||
1
chapters/en/unitbonus2/hands-on.mdx
Normal file
1
chapters/en/unitbonus2/hands-on.mdx
Normal file
@@ -0,0 +1 @@
|
||||
# Hands-on [[hands-on]]
|
||||
7
chapters/en/unitbonus2/introduction.mdx
Normal file
7
chapters/en/unitbonus2/introduction.mdx
Normal file
@@ -0,0 +1,7 @@
|
||||
# Introduction [[introduction]]
|
||||
|
||||
One of the most critical task in Deep Reinforcement Learning is to find a good set of training hyperparameters.
|
||||
|
||||
<img src="https://raw.githubusercontent.com/optuna/optuna/master/docs/image/optuna-logo.png" alt="Optuna Logo"/>
|
||||
|
||||
Optuna is a library that helps you to automate the search. In this Unit, we'll study a little bit of the theory behind automatic hyperparameter tuning. We'll then try to optimize the parameters manually and then see how to automate the search using Optuna.
|
||||
5
chapters/en/unitbonus2/optuna.mdx
Normal file
5
chapters/en/unitbonus2/optuna.mdx
Normal file
@@ -0,0 +1,5 @@
|
||||
# Optuna Tutorial [[optuna]]
|
||||
|
||||
<Youtube id="AidFTOdGNFQ" />
|
||||
|
||||
The content below comes from Antonin's Raffin ICRA 2022 presentations, he's one of the founders of Stable-Baselines and RL-Baselines3-Zoo.
|
||||
@@ -1,6 +0,0 @@
|
||||
- title: 0. Setup
|
||||
sections:
|
||||
- local: index
|
||||
title: Index Page
|
||||
- local: unit1/1
|
||||
title: Introduction
|
||||
@@ -1 +0,0 @@
|
||||
# Index Page
|
||||
@@ -1,487 +0,0 @@
|
||||
# An Introduction to Deep Reinforcement Learning [[introduction-to-deep-rl]]
|
||||
|
||||
|
||||
|
||||
TODO: ADD IMAGE THUMBNAIL
|
||||
|
||||
|
||||
Welcome to the most fascinating topic in Artificial Intelligence: **Deep Reinforcement Learning.**
|
||||
|
||||
Deep RL is a type of Machine Learning where an agent learns **how to behave** in an environment **by performing actions** and **seeing the results.**
|
||||
|
||||
So in this first chapter, **you’ll learn the foundations of Deep Reinforcement Learning.**
|
||||
|
||||
Then, you'll train your first two Deep Reinforcement Learning agents:
|
||||
|
||||
1. A Lunar Lander agent that will learn to **land correctly on the Moon 🌕**
|
||||
2. A car that neads **to reach the top of the mountain ⛰️ **.
|
||||
|
||||
|
||||
TODO: Add illustration MountainCar and MoonLanding
|
||||
|
||||
And finally you’ll **upload it to the Hugging Face Hub 🤗, a free, open platform where people can share ML models, datasets and demos.**
|
||||
|
||||
TODO: ADD model card illustration
|
||||
|
||||
Here’s what you’re going to accomplish at the end of this unit.
|
||||
|
||||
It’s essential **to master these elements** before diving into implementing Deep Reinforcement Learning agents. The goal of this chapter is to give you solid foundations.
|
||||
|
||||
So let’s get started! 🚀
|
||||
|
||||
|
||||
- [What is Reinforcement Learning?](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#what-is-reinforcement-learning)
|
||||
- [The big picture](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#the-big-picture)
|
||||
- [A formal definition](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#a-formal-definition)
|
||||
- [The Reinforcement Learning Framework](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#the-reinforcement-learning-framework)
|
||||
- [The RL Process](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#the-rl-process)
|
||||
- [The reward hypothesis: the central idea of Reinforcement Learning](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#the-reward-hypothesis-the-central-idea-of-reinforcement-learning)
|
||||
- [Markov Property](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#markov-property)
|
||||
- [Observations/States Space](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#observationsstates-space)
|
||||
- [Action Space](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#action-space)
|
||||
- [Rewards and the discounting](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#rewards-and-the-discounting)
|
||||
- [Type of tasks](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#type-of-tasks)
|
||||
- [Exploration/ Exploitation tradeoff](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#exploration-exploitation-tradeoff)
|
||||
- [The two main approaches for solving RL problems](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#the-two-main-approaches-for-solving-rl-problems)
|
||||
- [The Policy π: the agent’s brain](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#the-policy-%CF%80-the-agents-brain)
|
||||
- [Policy-Based Methods](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#policy-based-methods)
|
||||
- [Value-based methods](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#value-based-methods)
|
||||
- [The “Deep” in Reinforcement Learning](notion://www.notion.so/8b87232a66e34c58a27683ff77fc7d0f#the-deep-in-reinforcement-learning)
|
||||
|
||||
## What is Reinforcement Learning? [[what-is-rl]]
|
||||
|
||||
To understand Reinforcement Learning, let’s start with the big picture.
|
||||
|
||||
### The big picture [[the-big-picture]]
|
||||
|
||||
The idea behind Reinforcement Learning is that an agent (an AI) will learn from the environment by **interacting with it** (through trial and error) and **receiving rewards** (negative or positive) as feedback for performing actions.
|
||||
|
||||
Learning from interaction with the environment **comes from our natural experiences.**
|
||||
|
||||
For instance, imagine putting your little brother in front of a video game he never played, a controller in his hands, and letting him alone.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_1.jpg" alt="Illustration_1" width="100%">
|
||||
|
||||
Your brother will interact with the environment (the video game) by pressing the right button (action). He got a coin, that’s a +1 reward. It’s positive, he just understood that in this game **he must get the coins.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_2.jpg" alt="Illustration_2" width="100%">
|
||||
|
||||
But then, **he presses right again** and he touches an enemy, he just died -1 reward.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/Illustration_3.jpg" alt="Illustration_3" width="100%">
|
||||
|
||||
By interacting with his environment through trial and error, your little brother understood that **he needed to get coins in this environment but avoid the enemies.**
|
||||
|
||||
**Without any supervision**, the child will get better and better at playing the game.
|
||||
|
||||
That’s how humans and animals learn, **through interaction.** Reinforcement Learning is just a **computational approach of learning from action.**
|
||||
|
||||
### A formal definition [[formal-definition]]
|
||||
|
||||
If we take now a formal definition:
|
||||
|
||||
<Tip>
|
||||
Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.
|
||||
</Tip>
|
||||
|
||||
⇒ But how Reinforcement Learning works?
|
||||
|
||||
## The Reinforcement Learning Framework [[rl-framework]]
|
||||
|
||||
### The RL Process [[rl-process]]
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process.jpg" alt="The RL process" width="100%">
|
||||
<figcaption>The RL Process: a loop of state, action, reward and next state</figcaption>
|
||||
<figcaption>Source: <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto</a></figcaption>
|
||||
|
||||
|
||||
To understand the RL process, let’s imagine an agent learning to play a platform game:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
|
||||
|
||||
<!--<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/" alt="The RL process"/>
|
||||
</figure>-->
|
||||
|
||||
- Our Agent receives **state \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
|
||||
- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
|
||||
- Environment goes to a **new** **state \\(S_1\\)** — new frame.
|
||||
- The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*.
|
||||
|
||||
This RL loop outputs a sequence of **state, action, reward and next state.**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/sars.jpg" alt="State, Action, Reward, Next State"/>
|
||||
</figure>
|
||||
|
||||
The agent's goal is to maximize its cumulative reward, **called the expected return.**
|
||||
|
||||
### The reward hypothesis: the central idea of Reinforcement Learning [[reward-hypothesis]]
|
||||
|
||||
⇒ Why is the goal of the agent to maximize the expected return?
|
||||
|
||||
Because RL is based on the **reward hypothesis**, which is that all goals can be described as the **maximization of the expected return** (expected cumulative reward).
|
||||
|
||||
That’s why in Reinforcement Learning, **to have the best behavior,** we need to **maximize the expected cumulative reward.**
|
||||
|
||||
### Markov Property [[markov-property]]
|
||||
|
||||
In papers, you’ll see that the RL process is called the **Markov Decision Process** (MDP).
|
||||
|
||||
We’ll talk again about the Markov Property in the following units. But if you need to remember something today about it, Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states** **and actions** they took before.
|
||||
|
||||
### Observations/States Space [[obs-space]]
|
||||
|
||||
Observations/States are the **information our agent gets from the environment.** In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.
|
||||
|
||||
There is a differentiation to make between *observation* and *state*:
|
||||
|
||||
- *State s*: is **a complete description of the state of the world** (there is no hidden information). In a fully observed environment.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img class="center" src="assets/63_deep_rl_intro/chess.jpg" alt="Chess"/>
|
||||
<figcaption>In chess game, we receive a state from the environment since we have access to the whole check board information.</figcaption>
|
||||
</figure>
|
||||
|
||||
In chess game, we receive a state from the environment since we have access to the whole check board information.
|
||||
|
||||
With a chess game, we are in a fully observed environment, since we have access to the whole check board information.
|
||||
|
||||
- *Observation o*: is a **partial description of the state.** In a partially observed environment.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img class="center" src="assets/63_deep_rl_intro/mario.jpg" alt="Mario"/>
|
||||
<figcaption>In Super Mario Bros, we only see a part of the level close to the player, so we receive an observation.</figcaption>
|
||||
</figure>
|
||||
|
||||
In Super Mario Bros, we only see a part of the level close to the player, so we receive an observation.
|
||||
|
||||
In Super Mario Bros, we are in a partially observed environment. We receive an observation **since we only see a part of the level.**
|
||||
|
||||
> In reality, we use the term state in this course but we will make the distinction in implementations.
|
||||
>
|
||||
|
||||
To recap:
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/obs_space_recap.jpg" alt="Obs space recap"/>
|
||||
</figure>
|
||||
|
||||
### Action Space [[action-space]]
|
||||
|
||||
The Action space is the set of **all possible actions in an environment.**
|
||||
|
||||
The actions can come from a *discrete* or *continuous space*:
|
||||
|
||||
- *Discrete space*: the number of possible actions is **finite**.
|
||||
|
||||
<figure class="image table image-center text-center m-0 w-full">
|
||||
<img class="center" src="assets/63_deep_rl_intro/mario.jpg" alt="Mario"/>
|
||||
<figcaption>Again, in Super Mario Bros, we have only 4 directions and jump possible</figcaption>
|
||||
</figure>
|
||||
|
||||
In Super Mario Bros, we have a finite set of actions since we have only 4 directions and jump.
|
||||
|
||||
- *Continuous space*: the number of possible actions is **infinite**.
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/self_driving_car.jpg" alt="Self Driving Car"/>
|
||||
<figcaption>A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
To recap:
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/action_space.jpg" alt="Recap action space"/>
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
Taking this information into consideration is crucial because it will **have importance when choosing the RL algorithm in the future.**
|
||||
|
||||
### Rewards and the discounting [[rewards]]
|
||||
|
||||
The reward is fundamental in RL because it’s **the only feedback** for the agent. Thanks to it, our agent knows **if the action taken was good or not.**
|
||||
|
||||
The cumulative reward at each time step t can be written as:
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/rewards_1.jpg" alt="Rewards"/>
|
||||
<figcaption>The cumulative reward equals to the sum of all rewards of the sequence.
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
Which is equivalent to:
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/rewards_2.jpg" alt="Rewards"/>
|
||||
<figcaption>The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) + ...
|
||||
</figcaption>
|
||||
</figure>
|
||||
</figure>
|
||||
|
||||
However, in reality, **we can’t just add them like that.** The rewards that come sooner (at the beginning of the game) **are more likely to happen** since they are more predictable than the long-term future reward.
|
||||
|
||||
Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). Your goal is **to eat the maximum amount of cheese before being eaten by the cat.**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/rewards_3.jpg" alt="Rewards"/>
|
||||
</figure>
|
||||
|
||||
As we can see in the diagram, **it’s more probable to eat the cheese near us than the cheese close to the cat** (the closer we are to the cat, the more dangerous it is).
|
||||
|
||||
Consequently, **the reward near the cat, even if it is bigger (more cheese), will be more discounted** since we’re not really sure we’ll be able to eat it.
|
||||
|
||||
To discount the rewards, we proceed like this:
|
||||
|
||||
1. We define a discount rate called gamma. **It must be between 0 and 1.** Most of the time between **0.99 and 0.95**.
|
||||
- The larger the gamma, the smaller the discount. This means our agent **cares more about the long-term reward.**
|
||||
- On the other hand, the smaller the gamma, the bigger the discount. This means our **agent cares more about the short term reward (the nearest cheese).**
|
||||
|
||||
2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, **so the future reward is less and less likely to happen.**
|
||||
|
||||
Our discounted cumulative expected rewards is:
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/rewards_4.jpg" alt="Rewards"/>
|
||||
</figure>
|
||||
|
||||
### Type of tasks [[tasks]]
|
||||
|
||||
A task is an **instance** of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuing.
|
||||
|
||||
### Episodic task [[episodic-task]]
|
||||
|
||||
In this case, we have a starting point and an ending point **(a terminal state). This creates an episode**: a list of States, Actions, Rewards, and new States.
|
||||
|
||||
For instance, think about Super Mario Bros: an episode begin at the launch of a new Mario Level and ending **when you’re killed or you reached the end of the level.**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img class="center" src="assets/63_deep_rl_intro/mario.jpg" alt="Mario"/>
|
||||
<figcaption>Beginning of a new episode.
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
### Continuing tasks [[continuing-tasks]]
|
||||
|
||||
These are tasks that continue forever (no terminal state). In this case, the agent must **learn how to choose the best actions and simultaneously interact with the environment.**
|
||||
|
||||
For instance, an agent that does automated stock trading. For this task, there is no starting point and terminal state. **The agent keeps running until we decide to stop them.**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/stock.jpg" alt="Stock Market"/>
|
||||
</figure>
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/tasks.jpg" alt="Tasks recap"/>
|
||||
</figure>
|
||||
|
||||
## Exploration/ Exploitation tradeoff [[exp-exp-tradeoff]]
|
||||
|
||||
Finally, before looking at the different methods to solve Reinforcement Learning problems, we must cover one more very important topic: *the exploration/exploitation trade-off.*
|
||||
|
||||
- Exploration is exploring the environment by trying random actions in order to **find more information about the environment.**
|
||||
- Exploitation is **exploiting known information to maximize the reward.**
|
||||
|
||||
Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, **we can fall into a common trap**.
|
||||
|
||||
Let’s take an example:
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/exp_1.jpg" alt="Exploration"/>
|
||||
</figure>
|
||||
|
||||
In this game, our mouse can have an **infinite amount of small cheese** (+1 each). But at the top of the maze, there is a gigantic sum of cheese (+1000).
|
||||
|
||||
However, if we only focus on exploitation, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit **the nearest source of rewards,** even if this source is small (exploitation).
|
||||
|
||||
But if our agent does a little bit of exploration, it can **discover the big reward** (the pile of big cheese).
|
||||
|
||||
This is what we call the exploration/exploitation trade-off. We need to balance how much we **explore the environment** and how much we **exploit what we know about the environment.**
|
||||
|
||||
Therefore, we must **define a rule that helps to handle this trade-off**. We’ll see in future chapters different ways to handle it.
|
||||
|
||||
If it’s still confusing, **think of a real problem: the choice of a restaurant:**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/exp_2.jpg" alt="Exploration"/>
|
||||
<figcaption>Source: <a href="[http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_exploration.pdf](http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_13_exploration.pdf)"> Berkley AI Course</a>
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
- *Exploitation*: You go every day to the same one that you know is good and **take the risk to miss another better restaurant.**
|
||||
- *Exploration*: Try restaurants you never went to before, with the risk of having a bad experience **but the probable opportunity of a fantastic experience.**
|
||||
|
||||
To recap:
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/expexpltradeoff.jpg" alt="Exploration Exploitation Tradeoff"/>
|
||||
</figure>
|
||||
|
||||
## **The two main approaches for solving RL problems**
|
||||
|
||||
⇒ Now that we learned the RL framework, how do we solve the RL problem?
|
||||
|
||||
In other terms, how to build an RL agent that can **select the actions that maximize its expected cumulative reward?**
|
||||
|
||||
### **The Policy π: the agent’s brain**
|
||||
|
||||
The Policy **π** is the **brain of our Agent**, it’s the function that tell us what **action to take given the state we are.** So it **defines the agent’s behavior** at a given time.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/policy_1.jpg" alt="Policy"/>
|
||||
<figcaption>Think of policy as the brain of our agent, the function that will tells us the action to take given a state
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
Think of policy as the brain of our agent, the function that will tells us the action to take given a state
|
||||
|
||||
This Policy **is the function we want to learn**, our goal is to find the optimal policy *π, the policy that** maximizes **expected return** when the agent acts according to it. We find this *π through training.**
|
||||
|
||||
There are two approaches to train our agent to find this optimal policy π*:
|
||||
|
||||
- **Directly,** by teaching the agent to learn which **action to take,** given the state is in: **Policy-Based Methods.**
|
||||
- Indirectly, **teach the agent to learn which state is more valuable** and then take the action that **leads to the more valuable states**: Value-Based Methods.
|
||||
|
||||
### **Policy-Based Methods**
|
||||
|
||||
In Policy-Based Methods, **we learn a policy function directly.**
|
||||
|
||||
This function will map from each state to the best corresponding action at that state. **Or a probability distribution over the set of possible actions at that state.**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/policy_2.jpg" alt="Policy"/>
|
||||
<figcaption>As we can see here, the policy (deterministic) <b>directly indicates the action to take for each step.</b>
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
We have two types of policy:
|
||||
|
||||
- *Deterministic*: a policy at a given state **will always return the same action.**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/policy_3.jpg" alt="Policy"/>
|
||||
<figcaption>action = policy(state)
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/policy_4.jpg" alt="Policy"/>
|
||||
</figure>
|
||||
|
||||
- *Stochastic*: output **a probability distribution over actions.**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/policy_5.jpg" alt="Policy"/>
|
||||
<figcaption>policy(actions | state) = probability distribution over the set of actions given the current state
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img class="center" src="assets/63_deep_rl_intro/mario.jpg" alt="Mario"/>
|
||||
<figcaption>Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
If we recap:
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/pbm_1.jpg" alt="Pbm recap"/>
|
||||
</figure>
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/pbm_2.jpg" alt="Pbm recap"/>
|
||||
</figure>
|
||||
|
||||
### **Value-based methods**
|
||||
|
||||
In Value-based methods, instead of training a policy function, we **train a value function** that maps a state to the expected value **of being at that state.**
|
||||
|
||||
The value of a state is the **expected discounted return** the agent can get if it **starts in that state, and then act according to our policy.**
|
||||
|
||||
“Act according to our policy” just means that our policy is **“going to the state with the highest value”.**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/value_1.jpg" alt="Value based RL"/>
|
||||
</figure>
|
||||
|
||||
Here we see that our value function **defined value for each possible state.**
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/value_2.jpg" alt="Value based RL"/>
|
||||
<figcaption>Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.
|
||||
|
||||
If we recap:
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/vbm_1.jpg" alt="Vbm recap"/>
|
||||
</figure>
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/vbm_2.jpg" alt="Vbm recap"/>
|
||||
</figure>
|
||||
|
||||
## **The “Deep” in Reinforcement Learning**
|
||||
|
||||
⇒ What we've talked about so far is Reinforcement Learning. But where does the "Deep" come into play?
|
||||
|
||||
Deep Reinforcement Learning introduces **deep neural networks to solve Reinforcement Learning problems** — hence the name “deep”.
|
||||
|
||||
For instance, in the next article, we’ll work on Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning both are value-based RL algorithms.
|
||||
|
||||
You’ll see the difference is that in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
|
||||
|
||||
In the second approach, **we will use a Neural Network** (to approximate the q value).
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="assets/63_deep_rl_intro/deep.jpg" alt="Value based RL"/>
|
||||
<figcaption>Schema inspired by the Q learning notebook by Udacity
|
||||
</figcaption>
|
||||
</figure>
|
||||
|
||||
If you are not familiar with Deep Learning you definitely should watch <a href="[https://course.fast.ai/](https://course.fast.ai/)">the fastai Practical Deep Learning for Coders (Free)</a>
|
||||
|
||||
That was a lot of information, if we summarize:
|
||||
|
||||
- Reinforcement Learning is a computational approach of learning from action. We build an agent that learns from the environment **by interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.
|
||||
- The goal of any RL agent is to maximize its expected cumulative reward (also called expected return) because RL is based on the **reward hypothesis**, which is that **all goals can be described as the maximization of the expected cumulative reward.**
|
||||
- The RL process is a loop that outputs a sequence of **state, action, reward and next state.**
|
||||
- To calculate the expected cumulative reward (expected return), we discount the rewards: the rewards that come sooner (at the beginning of the game) **are more probable to happen since they are more predictable than the long term future reward.**
|
||||
- To solve an RL problem, you want to **find an optimal policy**, the policy is the “brain” of your AI that will tell us **what action to take given a state.** The optimal one is the one who **gives you the actions that max the expected return.**
|
||||
- There are two ways to find your optimal policy:
|
||||
1. By training your policy directly: **policy-based methods.**
|
||||
2. By training a value function that tells us the expected return the agent will get at each state and use this function to define our policy: **value-based methods.**
|
||||
- Finally, we speak about Deep RL because we introduces **deep neural networks to estimate the action to take (policy-based) or to estimate the value of a state (value-based)** hence the name “deep.”
|
||||
|
||||
---
|
||||
|
||||
Now that you've studied the bases of Reinforcement Learning, you’re ready to train your first lander agent to **land correctly on the Moon 🌕 and share it with the community through the Hub** 🔥
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<video
|
||||
alt="LunarLander"
|
||||
style="max-width: 70%; margin: auto;"
|
||||
autoplay loop autobuffer muted playsinline
|
||||
>
|
||||
<source src="assets/63_deep_rl_intro/lunarlander.mp4" type="video/mp4">
|
||||
</video>
|
||||
</figure>
|
||||
|
||||
ADD To the notebook
|
||||
|
||||
Congrats on finishing this chapter! **That was the biggest one**, and there was a lot of information. And congrats on finishing the tutorial. You’ve just trained your first Deep RL agent and shared it on the Hub 🥳.
|
||||
|
||||
That’s **normal if you still feel confused** with all these elements. **This was the same for me and for all people who studied RL.**
|
||||
|
||||
Take time to really grasp the material before continuing. It’s important to master these elements and having a solid foundations before entering the **fun part.**
|
||||
|
||||
We published additional readings in the syllabus if you want to go deeper 👉 [https://github.com/huggingface/deep-rl-class/blob/main/unit1/README.md](https://github.com/huggingface/deep-rl-class/blob/main/unit1/README.md)
|
||||
|
||||
Naturally, during the course, **we’re going to use and explain these terms again**, but it’s better to understand them before diving into the next chapters.
|
||||
|
||||
In the next chapter, [we’re going to learn about Q-Learning and dive deeper **into the value-based methods.**](https://huggingface.co/blog/deep-rl-q-part1)
|
||||
|
||||
And don't forget to share with your friends who want to learn 🤗 !
|
||||
|
||||
Finally, we want **to improve and update the course iteratively with your feedback**. If you have some, please fill this form 👉 [https://forms.gle/3HgA7bEHwAmmLfwh9](https://forms.gle/3HgA7bEHwAmmLfwh9)
|
||||
|
||||
### Keep learning, stay awesome,
|
||||
Reference in New Issue
Block a user