mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-09 05:40:29 +08:00
Apply suggestions from code review
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com> Co-authored-by: Omar Sanseviero <osanseviero@gmail.com>
This commit is contained in:
@@ -126,7 +126,7 @@ Which is equivalent to:
|
||||
|
||||
However, in reality, **we can’t just add them like that.** The rewards that come sooner (at the beginning of the game) **are more likely to happen** since they are more predictable than the long-term future reward.
|
||||
|
||||
Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). Your goal is **to eat the maximum amount of cheese before being eaten by the cat.**
|
||||
Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse's goal is **to eat the maximum amount of cheese before being eaten by the cat.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_3.jpg" alt="Rewards" width="100%">
|
||||
|
||||
@@ -142,5 +142,5 @@ To discount the rewards, we proceed like this:
|
||||
|
||||
2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, **so the future reward is less and less likely to happen.**
|
||||
|
||||
Our discounted cumulative expected rewards is:
|
||||
Our discounted expected cumulative reward is:
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_4.jpg" alt="Rewards" width="100%">
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Summary [[summary]]
|
||||
|
||||
That was a lot of information, if we summarize:
|
||||
That was a lot of information! Let's summarize:
|
||||
|
||||
- Reinforcement Learning is a computational approach of learning from action. We build an agent that learns from the environment **by interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user