Files
deep-rl-class/units/en/unit3/glossary.mdx

39 lines
2.9 KiB
Plaintext

# Glossary
This is a community-created glossary. Contributions are welcomed!
- **Tabular Method:** Type of problem in which the state and action spaces are small enough to approximate value functions to be represented as arrays and tables.
**Q-learning** is an example of tabular method since a table is used to represent the value for different state-action pairs.
- **Deep Q-Learning:** Method that trains a neural network to approximate, given a state, the different **Q-values** for each possible action at that state.
It is used to solve problems when observational space is too big to apply a tabular Q-Learning approach.
- **Temporal Limitation** is a difficulty presented when the environment state is represented by frames. A frame by itself does not provide temporal information.
In order to obtain temporal information, we need to **stack** a number of frames together.
- **Phases of Deep Q-Learning:**
- **Sampling:** Actions are performed, and observed experience tuples are stored in a **replay memory**.
- **Training:** Batches of tuples are selected randomly and the neural network updates its weights using gradient descent.
- **Solutions to stabilize Deep Q-Learning:**
- **Experience Replay:** A replay memory is created to save experiences samples that can be reused during training.
This allows the agent to learn from the same experiences multiple times. Also, it helps the agent avoid forgetting previous experiences as it gets new ones.
- **Random sampling** from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging
catastrophically.
- **Fixed Q-Target:** In order to calculate the **Q-Target** we need to estimate the discounted optimal **Q-value** of the next state by using Bellman equation. The problem
is that the same network weights are used to calculate the **Q-Target** and the **Q-value**. This means that everytime we are modifying the **Q-value**, the **Q-Target** also moves with it.
To avoid this issue, a separate network with fixed parameters is used for estimating the Temporal Difference Target. The target network is updated by copying parameters from
our Deep Q-Network after certain **C steps**.
- **Double DQN:** Method to handle **overestimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **Value generation**:
- **DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**)
- **Target Network** to calculate the target **Q-Value** of taking that action at the next state.
This approach reduces the **Q-Values** overestimation, it helps to train faster and have more stable learning.
If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
This glossary was made possible thanks to:
- [Dario Paez](https://github.com/dario248)