mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-28 04:40:19 +08:00
18 lines
1.2 KiB
Plaintext
18 lines
1.2 KiB
Plaintext
# Mid-way Recap [[mid-way-recap]]
|
||
|
||
Before diving into Q-Learning, let's summarize what we've just learned.
|
||
|
||
We have two types of value-based functions:
|
||
|
||
- State-value function: outputs the expected return if **the agent starts at a given state and acts according to the policy forever after.**
|
||
- Action-value function: outputs the expected return if **the agent starts in a given state, takes a given action at that state** and then acts accordingly to the policy forever after.
|
||
- In value-based methods, rather than learning the policy, **we define the policy by hand** and we learn a value function. If we have an optimal value function, we **will have an optimal policy.**
|
||
|
||
There are two types of methods to update the value function:
|
||
|
||
- With *the Monte Carlo method*, we update the value function from a complete episode, and so we **use the actual discounted return of this episode.**
|
||
- With *the TD Learning method,* we update the value function from a step, replacing the unknown \\(G_t\\) with **an estimated return called the TD target.**
|
||
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/summary-learning-mtds.jpg" alt="Summary"/>
|