diff --git a/units/en/unit3/glossary.mdx b/units/en/unit3/glossary.mdx index f0adaa9..208aaa3 100644 --- a/units/en/unit3/glossary.mdx +++ b/units/en/unit3/glossary.mdx @@ -2,23 +2,23 @@ This is a community-created glossary. Contributions are welcomed! -- **Tabular Method:** type of problem in which the state and action spaces are small enough to approximate value functions to be represented as arrays and tables. +- **Tabular Method:** Type of problem in which the state and action spaces are small enough to approximate value functions to be represented as arrays and tables. **Q-learning** is an example of tabular method since a table is used to represent the value for different state-action pairs. -- **Deep Q-Learning:** method that trains a neural network to approximate, given a state, the different **Q-values** for each possible action at that state. -Is used to solve problems when observational space is too big to apply a tabular Q-Learning approach. +- **Deep Q-Learning:** Method that trains a neural network to approximate, given a state, the different **Q-values** for each possible action at that state. +It is used to solve problems when observational space is too big to apply a tabular Q-Learning approach. -- **Temporal Limitation:** is a difficulty presented when the environment state is represented by frames. A frame by itself does not provide temporal information. +- **Temporal Limitation** is a difficulty presented when the environment state is represented by frames. A frame by itself does not provide temporal information. In order to obtain temporal information, we need to **stack** a number of frames together. - **Phases of Deep Q-Learning:** - - **Sampling:** actions are performed, and observed experience tuples are stored in a **replay memory**. - - **Training:** batches of tuples are selected randomly and the neural network updates its weights using gradient descent. + - **Sampling:** Actions are performed, and observed experience tuples are stored in a **replay memory**. + - **Training:** Batches of tuples are selected randomly and the neural network updates its weights using gradient descent. - **Solutions to stabilize Deep Q-Learning:** - - **Experience Replay:** a replay memory is created to save experiences samples that can be reused during training. + - **Experience Replay:** A replay memory is created to save experiences samples that can be reused during training. This allows the agent to learn from the same experiences multiple times. Also, it makes the agent avoid to forget previous experiences as it get new ones. - **Random sampling** from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging + - **Random sampling** from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging catastrophically. - **Fixed Q-Target:** In order to calculate the **Q-Target** we need to estimate the discounted optimal **Q-value** of the next state by using Bellman equation. The problem @@ -26,11 +26,10 @@ In order to obtain temporal information, we need to **stack** a number of frames To avoid this issue, a separate network with fixed parameters is used for estimating the Temporal Difference Target. The target network is updated by copying parameters from our Deep Q-Network after certain **C steps**. - - **Double DQN:** method to handle **overstimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **-Value generation**: - -**DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**) - -**Target Network** to calculate the target **Q-Value** of taking that action at the next state. - This approach reduce the **Q-Values** overstimation, it helps to train faster and have more stable learning. - + - **Double DQN:** Method to handle **overestimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **Value generation**: + - **DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**) + - **Target Network** to calculate the target **Q-Value** of taking that action at the next state. +This approach reduce the **Q-Values** overestimation, it helps to train faster and have more stable learning. If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)