mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-04 19:18:46 +08:00
Merge pull request #195 from huggingface/ml-agents-review
General pass of unit 5
This commit is contained in:
@@ -1,17 +1,17 @@
|
||||
# (Optional) What is curiosity in Deep Reinforcement Learning?
|
||||
# (Optional) What is Curiosity in Deep Reinforcement Learning?
|
||||
|
||||
This is an (optional) introduction about curiosity. If you want to learn more you can read my two articles where I dive into the mathematical details:
|
||||
This is an (optional) introduction to Curiosity. If you want to learn more, you can read two additional articles where we dive into the mathematical details:
|
||||
|
||||
- [Curiosity-Driven Learning through Next State Prediction](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-next-state-prediction-f7f4e2f592fa)
|
||||
- [Random Network Distillation: a new take on Curiosity-Driven Learning](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938)
|
||||
|
||||
## Two Major Problems in Modern RL
|
||||
|
||||
To understand what is curiosity, we need first to understand the two major problems with RL:
|
||||
To understand what is Curiosity, we need first to understand the two major problems with RL:
|
||||
|
||||
First, the *sparse rewards problem:* that is, **most rewards do not contain information, and hence are set to zero**.
|
||||
|
||||
Remember that RL is based on the *reward hypothesis*, which is the idea that each goal can be described as the maximization of the rewards. Therefore, rewards act as feedback for RL agents, **if they don’t receive any, their knowledge of which action is appropriate (or not) cannot change**.
|
||||
Remember that RL is based on the *reward hypothesis*, which is the idea that each goal can be described as the maximization of the rewards. Therefore, rewards act as feedback for RL agents; **if they don’t receive any, their knowledge of which action is appropriate (or not) cannot change**.
|
||||
|
||||
|
||||
<figure>
|
||||
@@ -21,30 +21,30 @@ Remember that RL is based on the *reward hypothesis*, which is the idea that eac
|
||||
|
||||
|
||||
For instance, in [Vizdoom](https://vizdoom.cs.put.edu.pl/), a set of environments based on the game Doom “DoomMyWayHome,” your agent is only rewarded **if it finds the vest**.
|
||||
However, the vest is far away from your starting point, so most of your rewards will be zero. Therefore, if our agent does not receive useful feedback (dense rewards), it will take much longer to learn an optimal policy and **it can spend time turning around without finding the goal**.
|
||||
However, the vest is far away from your starting point, so most of your rewards will be zero. Therefore, if our agent does not receive useful feedback (dense rewards), it will take much longer to learn an optimal policy, and **it can spend time turning around without finding the goal**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity2.png" alt="Curiosity"/>
|
||||
|
||||
The second big problem is that **the extrinsic reward function is handmade, that is in each environment, a human has to implement a reward function**. But how we can scale that in big and complex environments?
|
||||
The second big problem is that **the extrinsic reward function is handmade; in each environment, a human has to implement a reward function**. But how we can scale that in big and complex environments?
|
||||
|
||||
## So what is curiosity?
|
||||
## So what is Curiosity?
|
||||
|
||||
A solution to these problems is **to develop a reward function that is intrinsic to the agent, i.e., generated by the agent itself**. The agent will act as a self-learner since it will be the student, but also its own feedback master.
|
||||
A solution to these problems is **to develop a reward function intrinsic to the agent, i.e., generated by the agent itself**. The agent will act as a self-learner since it will be the student and its own feedback master.
|
||||
|
||||
**This intrinsic reward mechanism is known as curiosity** because this reward push to explore states that are novel/unfamiliar. In order to achieve that, our agent will receive a high reward when exploring new trajectories.
|
||||
**This intrinsic reward mechanism is known as Curiosity** because this reward pushes the agent to explore states that are novel/unfamiliar. To achieve that, our agent will receive a high reward when exploring new trajectories.
|
||||
|
||||
This reward is in fact designed on how human acts, **we have naturally an intrinsic desire to explore environments and discover new things**.
|
||||
This reward is inspired by how human acts. ** we naturally have an intrinsic desire to explore environments and discover new things**.
|
||||
|
||||
There are different ways to calculate this intrinsic reward, the classical one (curiosity through next-state prediction) was to calculate curiosity **as the error of our agent of predicting the next state, given the current state and action taken**.
|
||||
There are different ways to calculate this intrinsic reward. The classical approach (Curiosity through next-state prediction) is to calculate Curiosity **as the error of our agent in predicting the next state, given the current state and action taken**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity3.png" alt="Curiosity"/>
|
||||
|
||||
Because the idea of curiosity is to **encourage our agent to perform actions that reduce the uncertainty in the agent’s ability to predict the consequences of its own actions** (uncertainty will be higher in areas where the agent has spent less time, or in areas with complex dynamics).
|
||||
Because the idea of Curiosity is to **encourage our agent to perform actions that reduce the uncertainty in the agent’s ability to predict the consequences of its actions** (uncertainty will be higher in areas where the agent has spent less time or in areas with complex dynamics).
|
||||
|
||||
If the agent spend a lot of times on these states, it will be good to predict the next state (low curiosity), on the other hand, if it’s a new state unexplored, it will be bad to predict the next state (high curiosity).
|
||||
If the agent spends a lot of time on these states, it will be good to predict the next state (low Curiosity). On the other hand, if it’s a new state unexplored, it will be harmful to predict the following state (high Curiosity).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/curiosity4.png" alt="Curiosity"/>
|
||||
|
||||
Using curiosity will push our agent to favor transitions with high prediction error (which will be higher in areas where the agent has spent less time, or in areas with complex dynamics) and **consequently better explore our environment**.
|
||||
Using Curiosity will push our agent to favor transitions with high prediction error (which will be higher in areas where the agent has spent less time, or in areas with complex dynamics) and **consequently better explore our environment**.
|
||||
|
||||
There’s also **other curiosity calculation methods**. ML-Agents uses a more advanced one called Curiosity through random network distillation. This is out of the scope of the tutorial but if you’re interested [I wrote an article explaining it in detail](https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938).
|
||||
|
||||
@@ -7,31 +7,24 @@ notebooks={[
|
||||
askForHelpUrl="http://hf.co/join/discord" />
|
||||
|
||||
|
||||
Now that we learned what is ML-Agents, how it works and that we studied the two environments we're going to use. We're ready to train our agents.
|
||||
|
||||
- The first one will learn to **shoot snowballs onto spawning target**.
|
||||
- The second need to **press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top**. To do that, it will need to explore its environment, and we will use a technique called curiosity.
|
||||
We learned what ML-Agents is and how it works. We also studied the two environments we're going to use. Now we're ready to train our agents!
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/envs.png" alt="Environments" />
|
||||
|
||||
After that, you'll be able to watch your agents playing directly on your browser.
|
||||
|
||||
The ML-Agents integration on the Hub **is still experimental**, some features will be added in the future. But for now, to validate this hands-on for the certification process, you just need to push your trained models to the Hub.
|
||||
There's no results to attain to validate this one. But if you want to get nice results you can try to reach:
|
||||
The ML-Agents integration on the Hub **is still experimental**. Some features will be added in the future. But for now, to validate this hands-on for the certification process, you just need to push your trained models to the Hub.
|
||||
There are no minimum results to attain to validate this Hands On. But if you want to get nice results, you can try to reach the following:
|
||||
|
||||
- For [Pyramids](https://singularite.itch.io/pyramids): Mean Reward = 1.75
|
||||
- For [SnowballTarget](https://singularite.itch.io/snowballtarget): Mean Reward = 15 or 30 targets shoot in an episode.
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
**To start the hands-on click on Open In Colab button** 👇 :
|
||||
**To start the hands-on, click on Open In Colab button** 👇 :
|
||||
|
||||
[](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit5/unit5.ipynb)
|
||||
|
||||
# Unit 5: An Introduction to ML-Agents
|
||||
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/thumbnail.png" alt="Thumbnail"/>
|
||||
|
||||
In this notebook, you'll learn about ML-Agents and train two agents.
|
||||
@@ -45,7 +38,6 @@ For more information about the certification process, check this section 👉 ht
|
||||
|
||||
⬇️ Here is an example of what **you will achieve at the end of this unit.** ⬇️
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.gif" alt="Pyramids"/>
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget.gif" alt="SnowballTarget"/>
|
||||
@@ -59,7 +51,7 @@ For more information about the certification process, check this section 👉 ht
|
||||
|
||||
- [ML-Agents (HuggingFace Experimental Version)](https://github.com/huggingface/ml-agents)
|
||||
|
||||
⚠ We're going to use an experimental version of ML-Agents were you can push to hub and load from hub Unity ML-Agents Models **you need to install the same version**
|
||||
⚠ We're going to use an experimental version of ML-Agents where you can push to Hub and load from Hub Unity ML-Agents Models **you need to install the same version**
|
||||
|
||||
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
|
||||
|
||||
@@ -77,14 +69,6 @@ Before diving into the notebook, you need to:
|
||||
|
||||
# Let's train our agents 🚀
|
||||
|
||||
The ML-Agents integration on the Hub is **still experimental**, some features will be added in the future.
|
||||
|
||||
But for now, **to validate this hands-on for the certification process, you just need to push your trained models to the Hub**. There’s no results to attain to validate this one. But if you want to get nice results you can try to attain:
|
||||
|
||||
- For `Pyramids` : Mean Reward = 1.75
|
||||
- For `SnowballTarget` : Mean Reward = 15 or 30 targets hit in an episode.
|
||||
|
||||
|
||||
## Set the GPU 💪
|
||||
|
||||
- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`
|
||||
@@ -96,7 +80,7 @@ But for now, **to validate this hands-on for the certification process, you just
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/gpu-step2.jpg" alt="GPU Step 2">
|
||||
|
||||
## Clone the repository and install the dependencies 🔽
|
||||
- We need to clone the repository, that **contains the experimental version of the library that allows you to push your trained agent to the Hub.**
|
||||
- We need to clone the repository that **contains the experimental version of the library that allows you to push your trained agent to the Hub.**
|
||||
|
||||
```python
|
||||
%%capture
|
||||
@@ -114,10 +98,10 @@ But for now, **to validate this hands-on for the certification process, you just
|
||||
|
||||
## SnowballTarget ⛄
|
||||
|
||||
If you need a refresher on how this environments work check this section 👉
|
||||
If you need a refresher on how this environment works check this section 👉
|
||||
https://huggingface.co/deep-rl-course/unit5/snowball-target
|
||||
|
||||
### Download and move the environment zip file in `./training-envs-executables/linux/`
|
||||
### Download and move the environm ent zip file in `./training-envs-executables/linux/`
|
||||
- Our environment executable is in a zip file.
|
||||
- We need to download it and place it to `./training-envs-executables/linux/`
|
||||
- We use a linux executable because we use colab, and colab machines OS is Ubuntu (linux)
|
||||
@@ -155,11 +139,11 @@ Make sure your file is accessible
|
||||
There are multiple hyperparameters. To know them better, you should check for each explanation with [the documentation](https://github.com/Unity-Technologies/ml-agents/blob/release_20_docs/docs/Training-Configuration-File.md)
|
||||
|
||||
|
||||
So you need to create a `SnowballTarget.yaml` config file in ./content/ml-agents/config/ppo/
|
||||
You need to create a `SnowballTarget.yaml` config file in ./content/ml-agents/config/ppo/
|
||||
|
||||
We'll give you here a first version of this config (to copy and paste into your `SnowballTarget.yaml file`), **but you should modify it**.
|
||||
|
||||
```
|
||||
```yaml
|
||||
behaviors:
|
||||
SnowballTarget:
|
||||
trainer_type: ppo
|
||||
@@ -192,13 +176,13 @@ behaviors:
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight_config1.png" alt="Config SnowballTarget"/>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballfight_config2.png" alt="Config SnowballTarget"/>
|
||||
|
||||
As an experimentation, you should also try to modify some other hyperparameters. Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
|
||||
As an experiment, try to modify some other hyperparameters. Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
|
||||
|
||||
Now that you've created the config file and understand what most hyperparameters do, we're ready to train our agent 🔥.
|
||||
|
||||
### Train the agent
|
||||
|
||||
To train our agent, we just need to **launch mlagents-learn and select the executable containing the environment.**
|
||||
To train our agent, we need to **launch mlagents-learn and select the executable containing the environment.**
|
||||
|
||||
We define four parameters:
|
||||
|
||||
@@ -211,25 +195,23 @@ We define four parameters:
|
||||
|
||||
Train the model and use the `--resume` flag to continue training in case of interruption.
|
||||
|
||||
> It will fail first time if and when you use `--resume`, try running the block again to bypass the error.
|
||||
> It will fail the first time if and when you use `--resume`. Try rerunning the block to bypass the error.
|
||||
|
||||
The training will take 10 to 35min depending on your config. Go take a ☕️you deserve it 🤗.
|
||||
|
||||
|
||||
The training will take 10 to 35min depending on your config, go take a ☕️you deserve it 🤗.
|
||||
|
||||
```python
|
||||
```bash
|
||||
!mlagents-learn ./config/ppo/SnowballTarget.yaml --env=./training-envs-executables/linux/SnowballTarget/SnowballTarget --run-id="SnowballTarget1" --no-graphics
|
||||
```
|
||||
|
||||
### Push the agent to the 🤗 Hub
|
||||
### Push the agent to the Hugging Face Hub
|
||||
|
||||
- Now that we trained our agent, we’re **ready to push it to the Hub to be able to visualize it playing on your browser🔥.**
|
||||
|
||||
To be able to share your model with the community there are three more steps to follow:
|
||||
To be able to share your model with the community, there are three more steps to follow:
|
||||
|
||||
1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join
|
||||
|
||||
2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.
|
||||
2️⃣ Sign in and store your authentication token from the Hugging Face website.
|
||||
- Create a new token (https://huggingface.co/settings/tokens) **with write role**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/create-token.jpg" alt="Create HF Token">
|
||||
@@ -245,9 +227,9 @@ notebook_login()
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
|
||||
Then, we simply need to run `mlagents-push-to-hf`.
|
||||
Then, we need to run `mlagents-push-to-hf`.
|
||||
|
||||
And we define 4 parameters:
|
||||
And we define four parameters:
|
||||
|
||||
1. `--run-id`: the name of the training run id.
|
||||
2. `--local-dir`: where the agent was saved, it’s results/<run_id name>, so in my case results/First Training.
|
||||
@@ -273,13 +255,13 @@ Else, if everything worked you should have this at the end of the process(but wi
|
||||
Your model is pushed to the hub. You can view your model here: https://huggingface.co/ThomasSimonini/ppo-SnowballTarget
|
||||
```
|
||||
|
||||
It’s the link to your model, it contains a model card that explains how to use it, your Tensorboard and your config file. **What’s awesome is that it’s a git repository, that means you can have different commits, update your repository with a new push etc.**
|
||||
It's the link to your model. It contains a model card that explains how to use it, your Tensorboard, and your config file. **What's awesome is that it's a git repository, which means you can have different commits, update your repository with a new push, etc.**
|
||||
|
||||
But now comes the best: **being able to visualize your agent online 👀.**
|
||||
|
||||
### Watch your agent playing 👀
|
||||
|
||||
For this step it’s simple:
|
||||
This step it's simple:
|
||||
|
||||
1. Remember your repo-id
|
||||
|
||||
@@ -289,17 +271,17 @@ For this step it’s simple:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_load.png" alt="Snowballtarget load"/>
|
||||
|
||||
1. In step 1, choose your model repository which is the model id (in my case ThomasSimonini/ppo-SnowballTarget).
|
||||
1. In step 1, choose your model repository, which is the model id (in my case ThomasSimonini/ppo-SnowballTarget).
|
||||
|
||||
2. In step 2, **choose what model you want to replay**:
|
||||
- I have multiple one, since we saved a model every 500000 timesteps.
|
||||
- I have multiple ones since we saved a model every 500000 timesteps.
|
||||
- But if I want the more recent I choose `SnowballTarget.onnx`
|
||||
|
||||
👉 What’s nice **is to try with different models step to see the improvement of the agent.**
|
||||
👉 What's nice **is to try different models steps to see the improvement of the agent.**
|
||||
|
||||
And don't hesitate to share the best score your agent gets on discord in #rl-i-made-this channel 🔥
|
||||
|
||||
Let's now try a harder environment called Pyramids...
|
||||
Let's now try a more challenging environment called Pyramids.
|
||||
|
||||
## Pyramids 🏆
|
||||
|
||||
@@ -328,9 +310,9 @@ Make sure your file is accessible
|
||||
```
|
||||
|
||||
### Modify the PyramidsRND config file
|
||||
- Contrary to the first environment which was a custom one, **Pyramids was made by the Unity team**.
|
||||
- Contrary to the first environment, which was a custom one, **Pyramids was made by the Unity team**.
|
||||
- So the PyramidsRND config file already exists and is in ./content/ml-agents/config/ppo/PyramidsRND.yaml
|
||||
- You might asked why "RND" in PyramidsRND. RND stands for *random network distillation* it's a way to generate curiosity rewards. If you want to know more on that we wrote an article explaning this technique: https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938
|
||||
- You might ask why "RND" is in PyramidsRND. RND stands for *random network distillation* it's a way to generate curiosity rewards. If you want to know more about that, we wrote an article explaining this technique: https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938
|
||||
|
||||
For this training, we’ll modify one thing:
|
||||
- The total training steps hyperparameter is too high since we can hit the benchmark (mean reward = 1.75) in only 1M training steps.
|
||||
@@ -338,7 +320,7 @@ For this training, we’ll modify one thing:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-config.png" alt="Pyramids config"/>
|
||||
|
||||
As an experimentation, you should also try to modify some other hyperparameters, Unity provides a very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
|
||||
As an experiment, you should also try to modify some other hyperparameters. Unity provides very [good documentation explaining each of them here](https://github.com/Unity-Technologies/ml-agents/blob/main/docs/Training-Configuration-File.md).
|
||||
|
||||
We’re now ready to train our agent 🔥.
|
||||
|
||||
@@ -350,29 +332,23 @@ The training will take 30 to 45min depending on your machine, go take a ☕️yo
|
||||
!mlagents-learn ./config/ppo/PyramidsRND.yaml --env=./training-envs-executables/linux/Pyramids/Pyramids --run-id="Pyramids Training" --no-graphics
|
||||
```
|
||||
|
||||
### Push the agent to the 🤗 Hub
|
||||
### Push the agent to the Hugging Face Hub
|
||||
|
||||
- Now that we trained our agent, we’re **ready to push it to the Hub to be able to visualize it playing on your browser🔥.**
|
||||
|
||||
```python
|
||||
|
||||
```
|
||||
|
||||
```python
|
||||
```bash
|
||||
!mlagents-push-to-hf --run-id= # Add your run id --local-dir= # Your local dir --repo-id= # Your repo id --commit-message= # Your commit message
|
||||
```
|
||||
|
||||
### Watch your agent playing 👀
|
||||
|
||||
The temporary link for Pyramids demo is: https://singularite.itch.io/pyramids
|
||||
The temporary link for the Pyramids demo is: https://singularite.itch.io/pyramids
|
||||
|
||||
### 🎁 Bonus: Why not train on another environment?
|
||||
Now that you know how to train an agent using MLAgents, **why not try another environment?**
|
||||
|
||||
MLAgents provides 18 different and we’re building some custom ones. The best way to learn is to try things of your own, have fun.
|
||||
|
||||
|
||||
|
||||

|
||||
|
||||
You have the full list of the one currently available on Hugging Face here 👉 https://github.com/huggingface/ml-agents#the-environments
|
||||
@@ -391,4 +367,4 @@ The best way to learn is to practice and try stuff. Why not try another environm
|
||||
|
||||
See you on Unit 6 🔥,
|
||||
|
||||
## Keep Learning, Stay awesome 🤗
|
||||
## Keep Learning, Stay awesome 🤗
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# How do Unity ML-Agents work? [[how-mlagents-works]]
|
||||
|
||||
Before training our agent, we need to understand **what is ML-Agents and how it works**.
|
||||
Before training our agent, we need to understand **what ML-Agents is and how it works**.
|
||||
|
||||
## What is Unity ML-Agents? [[what-is-mlagents]]
|
||||
|
||||
@@ -23,7 +23,7 @@ With Unity ML-Agents, you have six essential components:
|
||||
</figure>
|
||||
|
||||
- The first is the *Learning Environment*, which contains **the Unity scene (the environment) and the environment elements** (game characters).
|
||||
- The second is the *Python Low-level API* which contains **the low-level Python interface for interacting and manipulating the environment**. It’s the API we use to launch the training.
|
||||
- The second is the *Python Low-level API*, which contains **the low-level Python interface for interacting and manipulating the environment**. It’s the API we use to launch the training.
|
||||
- Then, we have the *External Communicator* that **connects the Learning Environment (made with C#) with the low level Python API (Python)**.
|
||||
- The *Python trainers*: the **Reinforcement algorithms made with PyTorch (PPO, SAC…)**.
|
||||
- The *Gym wrapper*: to encapsulate RL environment in a gym wrapper.
|
||||
@@ -34,7 +34,7 @@ With Unity ML-Agents, you have six essential components:
|
||||
Inside the Learning Component, we have **three important elements**:
|
||||
|
||||
- The first is the *agent component*, the actor of the scene. We’ll **train the agent by optimizing its policy** (which will tell us what action to take in each state). The policy is called *Brain*.
|
||||
- Finally, there is the *Academy*. This component **orchestrates agents and their decision-making processes**. Think of this Academy as a teacher that handles the requests from the Python API.
|
||||
- Finally, there is the *Academy*. This component **orchestrates agents and their decision-making processes**. Think of this Academy as a teacher who handles Python API requests.
|
||||
|
||||
To better understand its role, let’s remember the RL process. This can be modeled as a loop that works like this:
|
||||
|
||||
|
||||
@@ -2,30 +2,30 @@
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/thumbnail.png" alt="thumbnail"/>
|
||||
|
||||
One of the challenges in Reinforcement Learning is to **create environments**. Fortunately for us, we can use game engines.
|
||||
Indeed, these engines like [Unity](https://unity.com/), [Godot](https://godotengine.org/) or [Unreal Engine](https://www.unrealengine.com/), are programs made to create video games. They are perfectly suited
|
||||
One of the challenges in Reinforcement Learning is **creating environments**. Fortunately for us, we can use game engines to achieve so.
|
||||
These engines, such as [Unity](https://unity.com/), [Godot](https://godotengine.org/) or [Unreal Engine](https://www.unrealengine.com/), are programs made to create video games. They are perfectly suited
|
||||
for creating environments: they provide physics systems, 2D/3D rendering, and more.
|
||||
|
||||
|
||||
One of them, [Unity](https://unity.com/), created the [Unity ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents), a plugin based on the game engine Unity that allows us **to use the Unity Game Engine as an environment builder to train agents**.
|
||||
One of them, [Unity](https://unity.com/), created the [Unity ML-Agents Toolkit](https://github.com/Unity-Technologies/ml-agents), a plugin based on the game engine Unity that allows us **to use the Unity Game Engine as an environment builder to train agents**. In the first bonus unit, this is what we used to train Huggy to catch a stick!
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/example-envs.png" alt="MLAgents environments"/>
|
||||
<figcaption>Source: <a href="https://github.com/Unity-Technologies/ml-agents">ML-Agents documentation</a></figcaption>
|
||||
</figure>
|
||||
|
||||
Unity ML-Agents Toolkit provides a ton of exceptional pre-made environments, from playing football (soccer), learning to walk, and jumping big walls.
|
||||
Unity ML-Agents Toolkit provides many exceptional pre-made environments, from playing football (soccer), learning to walk, and jumping big walls.
|
||||
|
||||
In this Unit, we'll learn to use ML-Agents, but **don't worry if you don't know how to use the Unity Game Engine**, you'll don't need to use it to train your agents.
|
||||
In this Unit, we'll learn to use ML-Agents, but **don't worry if you don't know how to use the Unity Game Engine**: you don't need to use it to train your agents.
|
||||
|
||||
And so, today, we're going to train two agents:
|
||||
So, today, we're going to train two agents:
|
||||
- The first one will learn to **shoot snowballs onto spawning target**.
|
||||
- The second need to **press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top**. To do that, it will need to explore its environment, and we will use a technique called curiosity.
|
||||
- The second needs to **press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top**. To do that, it will need to explore its environment, which will be achieved using a technique called curiosity.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/envs.png" alt="Environments" />
|
||||
|
||||
Then, after training **, you'll push the trained agents to the Hugging Face Hub**, and you'll be able to **visualize it playing directly on your browser without having to use the Unity Editor**.
|
||||
Then, after training, **you'll push the trained agents to the Hugging Face Hub**, and you'll be able to **visualize it playing directly on your browser without having to use the Unity Editor**.
|
||||
|
||||
Doing this Unit will **prepare you for the next challenge: AI vs. AI where you will train agents in multi-agents environments and compete against your classmates' agents**.
|
||||
|
||||
Sounds exciting? Let's get started,
|
||||
Sounds exciting? Let's get started!
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# The Pyramid environment
|
||||
|
||||
The goal in this environment is to train our agent to **get the gold brick on the top of the Pyramid. In order to do that, it needs to press a button to spawn a pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top**.
|
||||
The goal in this environment is to train our agent to **get the gold brick on the top of the Pyramid. To do that, it needs to press a button to spawn a Pyramid, navigate to the Pyramid, knock it over, and move to the gold brick at the top**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids.png" alt="Pyramids Environment"/>
|
||||
|
||||
@@ -11,7 +11,7 @@ The reward function is:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward.png" alt="Pyramids Environment"/>
|
||||
|
||||
In terms of code it looks like this
|
||||
In terms of code, it looks like this
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-reward-code.png" alt="Pyramids Reward"/>
|
||||
|
||||
To train this new agent that seeks that button and then the Pyramid to destroy, we’ll use a combination of two types of rewards:
|
||||
@@ -27,7 +27,7 @@ In terms of observation, we **use 148 raycasts that can each detect objects** (s
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids_raycasts.png"/>
|
||||
|
||||
We also use a **boolean variable indicating the switch state** (did we turn on or not the switch to spawn the Pyramid) and a vector that **contains the agent’s speed**.
|
||||
We also use a **boolean variable indicating the switch state** (did we turn on or off the switch to spawn the Pyramid) and a vector that **contains the agent’s speed**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/pyramids-obs-code.png" alt="Pyramids obs code"/>
|
||||
|
||||
|
||||
@@ -2,13 +2,13 @@
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget.gif" alt="SnowballTarget"/>
|
||||
|
||||
SnowballTarget is the environment we created at Hugging Face, and the assets are from [Kay Lousberg](https://kaylousberg.com/). We have an optional section at the end of this Unit **if you want to learn to use Unity and create your environments**.
|
||||
SnowballTarget is an environment we created at Hugging Face using assets from [Kay Lousberg](https://kaylousberg.com/). We have an optional section at the end of this Unit **if you want to learn to use Unity and create your environments**.
|
||||
|
||||
## The agent's Goal
|
||||
|
||||
The first agent you're going to train is Julien the bear 🐻 (the name is based after our [CTO Julien Chaumond](https://twitter.com/julien_c)) **to hit targets with snowballs**.
|
||||
The first agent you're going to train is called Julien the bear 🐻. Julien is trained **to hit targets with snowballs**.
|
||||
|
||||
The goal in this environment is that Julien **hits as many targets as possible in the limited time** (1000 timesteps). To do that, it will need **to place itself correctly from the target and shoot**.
|
||||
The Goal in this environment is that Julien **hits as many targets as possible in the limited time** (1000 timesteps). It will need **to place itself correctly from the target and shoot**to do that.
|
||||
|
||||
In addition, to avoid "snowball spamming" (aka shooting a snowball every timestep), **Julien has a "cool off" system** (it needs to wait 0.5 seconds after a shoot to be able to shoot again).
|
||||
|
||||
@@ -19,14 +19,14 @@ In addition, to avoid "snowball spamming" (aka shooting a snowball every timeste
|
||||
|
||||
## The reward function and the reward engineering problem
|
||||
|
||||
The reward function is simple. **The environment gives a +1 reward every time the agent's snowball hits a target** and because the agent's goal is to maximize the expected cumulative reward, **it will try to hit as many targets as possible**.
|
||||
The reward function is simple. **The environment gives a +1 reward every time the agent's snowball hits a target**. Because the agent's Goal is to maximize the expected cumulative reward, **it will try to hit as many targets as possible**.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget_reward.png" alt="Reward system"/>
|
||||
|
||||
We could have a more complex reward function (with a penalty to push the agent to go faster, etc.). But when you design an environment, you need to avoid the *reward engineering problem*, which is having a too complex reward function to force your agent to behave as you want it to do.
|
||||
We could have a more complex reward function (with a penalty to push the agent to go faster, for example). But when you design an environment, you need to avoid the *reward engineering problem*, which is having a too complex reward function to force your agent to behave as you want it to do.
|
||||
Why? Because by doing that, **you might miss interesting strategies that the agent will find with a simpler reward function**.
|
||||
|
||||
In terms of code it looks like this:
|
||||
In terms of code, it looks like this:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowballtarget-reward-code.png" alt="Reward"/>
|
||||
|
||||
@@ -35,7 +35,7 @@ In terms of code it looks like this:
|
||||
|
||||
Regarding observations, we don't use normal vision (frame), but **we use raycasts**.
|
||||
|
||||
Think of raycasts as lasers that will detect if it passes through an object.
|
||||
Think of raycasts as lasers that will detect if they pass through an object.
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit5/raycasts.png" alt="Raycasts"/>
|
||||
@@ -43,7 +43,7 @@ Think of raycasts as lasers that will detect if it passes through an object.
|
||||
</figure>
|
||||
|
||||
|
||||
In this environment our agent have multiple set of raycasts:
|
||||
In this environment, our agent has multiple set of raycasts:
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit7/snowball_target_raycasts.png" alt="Raycasts"/>
|
||||
|
||||
In addition to raycasts, the agent gets a "can I shoot" bool as observation.
|
||||
|
||||
Reference in New Issue
Block a user