mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-03-30 16:50:17 +08:00
84 lines
3.9 KiB
Plaintext
84 lines
3.9 KiB
Plaintext
# A Q-Learning example [[q-learning-example]]
|
||
|
||
To better understand Q-Learning, let's take a simple example:
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Maze-Example-2.jpg" alt="Maze-Example"/>
|
||
|
||
- You're a mouse in this tiny maze. You always **start at the same starting point.**
|
||
- The goal is **to eat the big pile of cheese at the bottom right-hand corner** and avoid the poison. After all, who doesn't like cheese?
|
||
- The episode ends if we eat the poison, **eat the big pile of cheese**, or if we take more than five steps.
|
||
- The learning rate is 0.1
|
||
- The discount rate (gamma) is 0.99
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-1.jpg" alt="Maze-Example"/>
|
||
|
||
|
||
The reward function goes like this:
|
||
|
||
- **+0:** Going to a state with no cheese in it.
|
||
- **+1:** Going to a state with a small cheese in it.
|
||
- **+10:** Going to the state with the big pile of cheese.
|
||
- **-10:** Going to the state with the poison and thus dying.
|
||
- **+0** If we take more than five steps.
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-2.jpg" alt="Maze-Example"/>
|
||
|
||
To train our agent to have an optimal policy (so a policy that goes right, right, down), **we will use the Q-Learning algorithm**.
|
||
|
||
## Step 1: Initialize the Q-table [[step1]]
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-1.jpg" alt="Maze-Example"/>
|
||
|
||
So, for now, **our Q-table is useless**; we need **to train our Q-function using the Q-Learning algorithm.**
|
||
|
||
Let's do it for 2 training timesteps:
|
||
|
||
Training timestep 1:
|
||
|
||
## Step 2: Choose an action using the Epsilon Greedy Strategy [[step2]]
|
||
|
||
Because epsilon is big (= 1.0), I take a random action. In this case, I go right.
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-3.jpg" alt="Maze-Example"/>
|
||
|
||
|
||
## Step 3: Perform action At, get Rt+1 and St+1 [[step3]]
|
||
|
||
By going right, I get a small cheese, so \\(R_{t+1} = 1\\) and I'm in a new state.
|
||
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-4.jpg" alt="Maze-Example"/>
|
||
|
||
|
||
## Step 4: Update Q(St, At) [[step4]]
|
||
|
||
We can now update \\(Q(S_t, A_t)\\) using our formula.
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Maze-Example"/>
|
||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-4.jpg" alt="Maze-Example"/>
|
||
|
||
Training timestep 2:
|
||
|
||
## Step 2: Choose an action using the Epsilon Greedy Strategy [[step2-2]]
|
||
|
||
**I take a random action again, since epsilon=0.99 is big**. (Notice we decay epsilon a little bit because, as the training progress, we want less and less exploration).
|
||
|
||
I took the action 'down'. **This is not a good action since it leads me to the poison.**
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-6.jpg" alt="Maze-Example"/>
|
||
|
||
|
||
## Step 3: Perform action At, get Rt+1 and St+1 [[step3-3]]
|
||
|
||
Because I ate poison, **I get \\(R_{t+1} = -10\\), and I die.**
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-7.jpg" alt="Maze-Example"/>
|
||
|
||
## Step 4: Update Q(St, At) [[step4-4]]
|
||
|
||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-8.jpg" alt="Maze-Example"/>
|
||
|
||
Because we're dead, we start a new episode. But what we see here is that, **with two explorations steps, my agent became smarter.**
|
||
|
||
As we continue exploring and exploiting the environment and updating Q-values using the TD target, the **Q-table will give us a better and better approximation. At the end of the training, we'll get an estimate of the optimal Q-function.**
|