mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-04 02:57:58 +08:00
fix typos and formatting issues
This commit is contained in:
@@ -2,23 +2,23 @@
|
||||
|
||||
This is a community-created glossary. Contributions are welcomed!
|
||||
|
||||
- **Tabular Method:** type of problem in which the state and action spaces are small enough to approximate value functions to be represented as arrays and tables.
|
||||
- **Tabular Method:** Type of problem in which the state and action spaces are small enough to approximate value functions to be represented as arrays and tables.
|
||||
**Q-learning** is an example of tabular method since a table is used to represent the value for different state-action pairs.
|
||||
|
||||
- **Deep Q-Learning:** method that trains a neural network to approximate, given a state, the different **Q-values** for each possible action at that state.
|
||||
Is used to solve problems when observational space is too big to apply a tabular Q-Learning approach.
|
||||
- **Deep Q-Learning:** Method that trains a neural network to approximate, given a state, the different **Q-values** for each possible action at that state.
|
||||
It is used to solve problems when observational space is too big to apply a tabular Q-Learning approach.
|
||||
|
||||
- **Temporal Limitation:** is a difficulty presented when the environment state is represented by frames. A frame by itself does not provide temporal information.
|
||||
- **Temporal Limitation** is a difficulty presented when the environment state is represented by frames. A frame by itself does not provide temporal information.
|
||||
In order to obtain temporal information, we need to **stack** a number of frames together.
|
||||
|
||||
- **Phases of Deep Q-Learning:**
|
||||
- **Sampling:** actions are performed, and observed experience tuples are stored in a **replay memory**.
|
||||
- **Training:** batches of tuples are selected randomly and the neural network updates its weights using gradient descent.
|
||||
- **Sampling:** Actions are performed, and observed experience tuples are stored in a **replay memory**.
|
||||
- **Training:** Batches of tuples are selected randomly and the neural network updates its weights using gradient descent.
|
||||
|
||||
- **Solutions to stabilize Deep Q-Learning:**
|
||||
- **Experience Replay:** a replay memory is created to save experiences samples that can be reused during training.
|
||||
- **Experience Replay:** A replay memory is created to save experiences samples that can be reused during training.
|
||||
This allows the agent to learn from the same experiences multiple times. Also, it makes the agent avoid to forget previous experiences as it get new ones.
|
||||
**Random sampling** from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging
|
||||
- **Random sampling** from replay buffer allows to remove correlation in the observation sequences and prevents action values from oscillating or diverging
|
||||
catastrophically.
|
||||
|
||||
- **Fixed Q-Target:** In order to calculate the **Q-Target** we need to estimate the discounted optimal **Q-value** of the next state by using Bellman equation. The problem
|
||||
@@ -26,11 +26,10 @@ In order to obtain temporal information, we need to **stack** a number of frames
|
||||
To avoid this issue, a separate network with fixed parameters is used for estimating the Temporal Difference Target. The target network is updated by copying parameters from
|
||||
our Deep Q-Network after certain **C steps**.
|
||||
|
||||
- **Double DQN:** method to handle **overstimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **-Value generation**:
|
||||
-**DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**)
|
||||
-**Target Network** to calculate the target **Q-Value** of taking that action at the next state.
|
||||
This approach reduce the **Q-Values** overstimation, it helps to train faster and have more stable learning.
|
||||
|
||||
- **Double DQN:** Method to handle **overestimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **Value generation**:
|
||||
- **DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**)
|
||||
- **Target Network** to calculate the target **Q-Value** of taking that action at the next state.
|
||||
This approach reduce the **Q-Values** overestimation, it helps to train faster and have more stable learning.
|
||||
|
||||
If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user