Merge pull request #282 from spolisar/patch-1

Corrected spelling and grammar in Unit 3 Glossary
This commit is contained in:
Thomas Simonini
2023-04-14 09:13:27 +02:00
committed by GitHub

View File

@@ -17,19 +17,19 @@ In order to obtain temporal information, we need to **stack** a number of frames
- **Solutions to stabilize Deep Q-Learning:**
- **Experience Replay:** A replay memory is created to save experiences samples that can be reused during training.
This allows the agent to learn from the same experiences multiple times. Also, it makes the agent avoid to forget previous experiences as it get new ones.
This allows the agent to learn from the same experiences multiple times. Also, it helps the agent avoid forgetting previous experiences as it gets new ones.
- **Random sampling** from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging
catastrophically.
- **Fixed Q-Target:** In order to calculate the **Q-Target** we need to estimate the discounted optimal **Q-value** of the next state by using Bellman equation. The problem
is that the same network weigths are used to calculate the **Q-Target** and the **Q-value**. This means that everytime we are modifying the **Q-value**, the **Q-Target** also moves with it.
is that the same network weights are used to calculate the **Q-Target** and the **Q-value**. This means that everytime we are modifying the **Q-value**, the **Q-Target** also moves with it.
To avoid this issue, a separate network with fixed parameters is used for estimating the Temporal Difference Target. The target network is updated by copying parameters from
our Deep Q-Network after certain **C steps**.
- **Double DQN:** Method to handle **overestimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **Value generation**:
- **DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**)
- **Target Network** to calculate the target **Q-Value** of taking that action at the next state.
This approach reduce the **Q-Values** overestimation, it helps to train faster and have more stable learning.
This approach reduces the **Q-Values** overestimation, it helps to train faster and have more stable learning.
If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)