mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-02-07 20:34:35 +08:00
Finalized Unit 2
This commit is contained in:
@@ -30,6 +30,8 @@
|
||||
title: Quiz
|
||||
- local: unit1/conclusion
|
||||
title: Conclusion
|
||||
- local: unit1/additional-readings
|
||||
title: Additional Readings
|
||||
- title: Bonus Unit 1. Introduction to Deep Reinforcement Learning with Huggy
|
||||
sections:
|
||||
- local: unitbonus1/introduction
|
||||
@@ -60,8 +62,8 @@
|
||||
title: Second Quiz
|
||||
- local: unit2/conclusion
|
||||
title: Conclusion
|
||||
- local: unit2/additional-reading
|
||||
title: Additional Reading
|
||||
- local: unit2/additional-readings
|
||||
title: Additional Readings
|
||||
- title: Unit 3. Deep Q-Learning with Atari Games
|
||||
sections:
|
||||
- local: unit3/introduction
|
||||
@@ -78,8 +80,8 @@
|
||||
title: Quiz
|
||||
- local: unit3/conclusion
|
||||
title: Conclusion
|
||||
- local: unit3/additional-reading
|
||||
title: Additional Reading
|
||||
- local: unit3/additional-readings
|
||||
title: Additional Readings
|
||||
- title: Unit Bonus 2. Automatic Hyperparameter Tuning with Optuna
|
||||
sections:
|
||||
- local: unitbonus2/introduction
|
||||
|
||||
11
units/en/unit1/additional-readings.mdx
Normal file
11
units/en/unit1/additional-readings.mdx
Normal file
@@ -0,0 +1,11 @@
|
||||
# Additional Readings [[additional-readings]]
|
||||
|
||||
## Deep Reinforcement Learning [[deep-rl]]
|
||||
|
||||
- [Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto Chapter 1, 2 and 3](http://incompleteideas.net/book/RLbook2020.pdf)
|
||||
- [Foundations of Deep RL Series, L1 MDPs, Exact Solution Methods, Max-ent RL by Pieter Abbeel](https://youtu.be/2GwBez0D20A)
|
||||
- [Spinning Up RL by OpenAI Part 1: Key concepts of RL](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html)
|
||||
|
||||
## Gym [[gym]]
|
||||
|
||||
- [Getting Started With OpenAI Gym: The Basic Building Blocks](https://blog.paperspace.com/getting-started-with-openai-gym/)
|
||||
@@ -1 +0,0 @@
|
||||
# Additional Reading [[additional-reading]]
|
||||
13
units/en/unit2/additional-readings.mdx
Normal file
13
units/en/unit2/additional-readings.mdx
Normal file
@@ -0,0 +1,13 @@
|
||||
# Additional Readings [[additional-readings]]
|
||||
|
||||
## Monte Carlo and TD Learning [[mc-td]]
|
||||
|
||||
To dive deeper on Monte Carlo and Temporal Difference Learning:
|
||||
|
||||
- <a href="https://stats.stackexchange.com/questions/355820/why-do-temporal-difference-td-methods-have-lower-variance-than-monte-carlo-met">Why do temporal difference (TD) methods have lower variance than Monte Carlo methods?</a>
|
||||
- <a href="https://stats.stackexchange.com/questions/336974/when-are-monte-carlo-methods-preferred-over-temporal-difference-ones"> When are Monte Carlo methods preferred over temporal difference ones?</a>
|
||||
|
||||
## Q-Learning [[q-learning]]
|
||||
|
||||
- <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto Chapter 5, 6 and 7</a>
|
||||
- <a href="https://youtu.be/Psrhxy88zww">Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel</a>
|
||||
@@ -5,9 +5,9 @@ The Bellman equation **simplifies our state value or state-action value calcula
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman.jpg" alt="Bellman equation"/>
|
||||
|
||||
With what we learned from now, we know that if we calculate the \\(V(S_t)\\) (value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(Our policy that we defined in the following example is a Greedy Policy, and for simplification, we don't discount the reward).**
|
||||
With what we learned so far, we know that if we calculate the \\(V(S_t)\\) (value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(Our policy that we defined in the following example is a Greedy Policy, and for simplification, we don't discount the reward).**
|
||||
|
||||
So to calculate \\(V(S_t)\\), we need to make the sum of the expected rewards. Hence:
|
||||
So to calculate \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence:
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" alt="Bellman equation"/>
|
||||
@@ -35,7 +35,7 @@ The Bellman equation is a recursive equation that works like this: instead of st
|
||||
</figure>
|
||||
|
||||
|
||||
If we go back to our example, the value of State 1= expected cumulative return if we start at that state.
|
||||
If we go back to our example, we can say that the value of State 1 is equal to the expected cumulative return if we start at that state.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" alt="Bellman equation"/>
|
||||
|
||||
|
||||
@@ -1,2 +1,9 @@
|
||||
# Hands-on [[hands-on]]
|
||||
n
|
||||
|
||||
Now that we studied the Q-Learning algorithm, let's implement it from scratch and train our Q-Learning agent in two environments:
|
||||
1. [Frozen-Lake-v1 (non-slippery and slippery version)](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) ☃️ : where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
|
||||
2. [An autonomous taxi](https://www.gymlibrary.dev/environments/toy_text/taxi/) 🚖 will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
|
||||
|
||||
Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores Who will win the challenge for Unit 2?
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
# Introduction to Q-Learning [[introduction-q-learning]]
|
||||
|
||||
ADD THUMBNAIL
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 thumbnail" width="100%">
|
||||
|
||||
|
||||
In the first chapter of this class, we learned about Reinforcement Learning (RL), the RL process, and the different methods to solve an RL problem. We also **trained our first agents and uploaded them to the Hugging Face Hub.**
|
||||
|
||||
@@ -14,13 +15,11 @@ We'll also **implement our first RL agent from scratch**: a Q-Learning agent an
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
|
||||
|
||||
Concretely, we will:
|
||||
|
||||
Concretely, we'll:
|
||||
|
||||
* learn about value-based methods
|
||||
* learn about the differences between Monte Carlo and Temporal Difference Learning
|
||||
* study and implement our first RL algorithm: Q-Learning
|
||||
* implement our first RL agent
|
||||
- Learn about **value-based methods**.
|
||||
- Learn about the **differences between Monte Carlo and Temporal Difference Learning**.
|
||||
- Study and implement **our first RL algorithm**: Q-Learning.s
|
||||
|
||||
This unit is **fundamental if you want to be able to work on Deep Q-Learning**: the first Deep RL algorithm that played Atari games and beat the human level on some of them (breakout, space invaders…).
|
||||
|
||||
|
||||
@@ -121,6 +121,6 @@ Now we **continue to interact with this environment with our updated value func
|
||||
If we summarize:
|
||||
|
||||
- With *Monte Carlo*, we update the value function from a complete episode, and so we **use the actual accurate discounted return of this episode.**
|
||||
- With *TD learning*, we update the value function from a step, so we replace \\(G_t\\) that we don't have with **an estimated return called TD target.**
|
||||
- With *TD Learning*, we update the value function from a step, so we replace \\(G_t\\) that we don't have with **an estimated return called TD target.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Summary.jpg" alt="Summary"/>
|
||||
|
||||
@@ -22,4 +22,4 @@ And to find this optimal policy (hence solving the RL problem), there **are two
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/two-approaches.jpg" alt="Two RL approaches"/>
|
||||
|
||||
And in this unit, **we'll dive deeper into the Value-based methods.**
|
||||
And in this unit, **we'll dive deeper into the value-based methods.**
|
||||
|
||||
@@ -1 +0,0 @@
|
||||
# Additional Reading [[additional-reading]]
|
||||
6
units/en/unit3/additional-readings.mdx
Normal file
6
units/en/unit3/additional-readings.mdx
Normal file
@@ -0,0 +1,6 @@
|
||||
# Additional Readings [[additional-readings]]
|
||||
|
||||
- [Foundations of Deep RL Series, L2 Deep Q-Learning by Pieter Abbeel](https://youtu.be/Psrhxy88zww)
|
||||
- [Playing Atari with Deep Reinforcement Learning](https://arxiv.org/abs/1312.5602)
|
||||
- [Double Deep Q-Learning](https://papers.nips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html)
|
||||
- [Prioritized Experience Replay](https://arxiv.org/abs/1511.05952)
|
||||
Reference in New Issue
Block a user