mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-05 11:38:43 +08:00
Update Unit 2
This commit is contained in:
@@ -58,7 +58,7 @@
|
||||
title: Monte Carlo vs Temporal Difference Learning
|
||||
- local: unit2/mid-way-recap
|
||||
title: Mid-way Recap
|
||||
- local: unit2/quiz1
|
||||
- local: unit2/mid-way-quiz
|
||||
title: Mid-way Quiz
|
||||
- local: unit2/q-learning
|
||||
title: Introducing Q-Learning
|
||||
@@ -69,7 +69,7 @@
|
||||
- local: unit2/hands-on
|
||||
title: Hands-on
|
||||
- local: unit2/quiz2
|
||||
title: Second Quiz
|
||||
title: Q-Learning Quiz
|
||||
- local: unit2/conclusion
|
||||
title: Conclusion
|
||||
- local: unit2/additional-readings
|
||||
|
||||
@@ -30,6 +30,8 @@ If we take an example:
|
||||
- We terminate the episode if the cat eats the mouse or if the mouse moves > 10 steps.
|
||||
|
||||
- At the end of the episode, **we have a list of State, Actions, Rewards, and Next States tuples**
|
||||
For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]...]
|
||||
|
||||
- **The agent will sum the total rewards \\(G_t\\)** (to see how well it did).
|
||||
- It will then **update \\(V(s_t)\\) based on the formula**
|
||||
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Mid-way Quiz [[quiz1]]
|
||||
# Mid-way Quiz [[mid-way-quiz]]
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Mid-way Recap [[summary1]]
|
||||
# Mid-way Recap [[mid-way-recap]]
|
||||
|
||||
Before diving into Q-Learning, let's summarize what we just learned.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# Q-Learning Recap [[summary2]]
|
||||
# Q-Learning Recap [[q-learning-recap]]
|
||||
|
||||
|
||||
The *Q-Learning* **is the RL algorithm that** :
|
||||
@@ -17,6 +17,7 @@ Q-Learning is an **off-policy value-based method that uses a TD approach to tra
|
||||
The **Q comes from "the Quality" (the value) of that action at that state.**
|
||||
|
||||
Let's recap the difference between value and reward:
|
||||
|
||||
- The *value of a state*, or a *state-action pair* is the expected cumulative reward our agent gets if it starts at this state (or state action pair) and then acts accordingly to its policy.
|
||||
- The *reward* is the **feedback I get from the environment** after performing an action at a state.
|
||||
|
||||
@@ -42,7 +43,7 @@ Therefore, Q-function contains a Q-table **that has the value of each-state act
|
||||
|
||||
If we recap, *Q-Learning* **is the RL algorithm that:**
|
||||
|
||||
- Trains a *Q-Function* (an **action-value function**), which internally is a *Q-table that contains all the state-action pair values.**
|
||||
- Trains a *Q-Function* (an **action-value function**), which internally is a **Q-table that contains all the state-action pair values.**
|
||||
- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
|
||||
- When the training is done, **we have an optimal Q-function, which means we have optimal Q-Table.**
|
||||
- And if we **have an optimal Q-function**, we **have an optimal policy** since we **know for each state what is the best action to take.**
|
||||
|
||||
Reference in New Issue
Block a user