mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-06-14 22:17:15 +08:00
Merge branch 'main' into pierrecounathe/unit-4-propositions
This commit is contained in:
13
.github/workflows/delete_doc_comment.yml
vendored
13
.github/workflows/delete_doc_comment.yml
vendored
@@ -1,13 +0,0 @@
|
||||
name: Delete doc comment
|
||||
|
||||
on:
|
||||
workflow_run:
|
||||
workflows: ["Delete doc comment trigger"]
|
||||
types:
|
||||
- completed
|
||||
|
||||
jobs:
|
||||
delete:
|
||||
uses: huggingface/doc-builder/.github/workflows/delete_doc_comment.yml@main
|
||||
secrets:
|
||||
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
|
||||
12
.github/workflows/delete_doc_comment_trigger.yml
vendored
12
.github/workflows/delete_doc_comment_trigger.yml
vendored
@@ -1,12 +0,0 @@
|
||||
name: Delete doc comment trigger
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
types: [ closed ]
|
||||
|
||||
|
||||
jobs:
|
||||
delete:
|
||||
uses: huggingface/doc-builder/.github/workflows/delete_doc_comment_trigger.yml@main
|
||||
with:
|
||||
pr_number: ${{ github.event.number }}
|
||||
@@ -217,25 +217,25 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"!wget --load-cookies /tmp/cookies.txt \"https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1zv3M95ZJTWHUVOWT6ckq_cm98nft8gdF' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\\1\\n/p')&id=1zv3M95ZJTWHUVOWT6ckq_cm98nft8gdF\" -O ./trained-envs-executables/linux/Huggy.zip && rm -rf /tmp/cookies.txt"
|
||||
"We downloaded the file Huggy.zip from https://github.com/huggingface/Huggy using `wget`"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "EB-G-80GsxYN"
|
||||
"id": "IHh_LXsRrrbM"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"!wget \"https://github.com/huggingface/Huggy/raw/main/Huggy.zip\" -O ./trained-envs-executables/linux/Huggy.zip"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "8xNAD1tRpy0_"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "jsoZGxr1MIXY"
|
||||
},
|
||||
"source": [
|
||||
"Download the file Huggy.zip from https://drive.google.com/uc?export=download&id=1zv3M95ZJTWHUVOWT6ckq_cm98nft8gdF using `wget`. Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
@@ -441,7 +441,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!mlagents-learn ./config/ppo/Huggy.yaml --env=./trained-envs-executables/linux/Huggy/Huggy --run-id=\"Huggy\" --no-graphics"
|
||||
"!mlagents-learn ./config/ppo/Huggy.yaml --env=./trained-envs-executables/linux/Huggy/Huggy --run-id=\"Huggy2\" --no-graphics"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -115,7 +115,7 @@
|
||||
"\n",
|
||||
"🔲 📝 **[Read Unit 0](https://huggingface.co/deep-rl-course/unit0/introduction)** that gives you all the **information about the course and helps you to onboard** 🤗\n",
|
||||
"\n",
|
||||
"🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (MC, TD, Rewards hypothesis...) by [reading Unit 1](https://huggingface.co/deep-rl-course/unit1/introduction)."
|
||||
"🔲 📚 **Develop an understanding of the foundations of Reinforcement learning** (RL process, Rewards hypothesis...) by [reading Unit 1](https://huggingface.co/deep-rl-course/unit1/introduction)."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@@ -206,7 +206,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%capture\n",
|
||||
"\n",
|
||||
"# Go inside the repository and install the package\n",
|
||||
"%cd ml-agents\n",
|
||||
"!pip3 install -e ./ml-agents-envs\n",
|
||||
@@ -600,58 +600,38 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "NyqYYkLyAVMK"
|
||||
},
|
||||
"source": [
|
||||
"Download the file Pyramids.zip from https://drive.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H using `wget`. Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)"
|
||||
]
|
||||
"We downloaded the file Pyramids.zip from from https://huggingface.co/spaces/unity/ML-Agents-Pyramids/resolve/main/Pyramids.zip using `wget`"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "x2C48SGZjZYw"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "AxojCsSVAVMP"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!wget --load-cookies /tmp/cookies.txt \"https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\\1\\n/p')&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H\" -O ./training-envs-executables/linux/Pyramids.zip && rm -rf /tmp/cookies.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "bfs6CTJ1AVMP"
|
||||
},
|
||||
"source": [
|
||||
"**OR** Download directly to local machine and then drag and drop the file from local machine to `./training-envs-executables/linux`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "H7JmgOwcSSmF"
|
||||
},
|
||||
"source": [
|
||||
"Wait for the upload to finish and then run the command below.\n",
|
||||
"\n",
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"Unzip it"
|
||||
"!wget \"https://huggingface.co/spaces/unity/ML-Agents-Pyramids/resolve/main/Pyramids.zip\" -O ./training-envs-executables/linux/Pyramids.zip"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "iWUUcs0_794U"
|
||||
"id": "eWh8Pl3sjZY2"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"We unzip the executable.zip file"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "V5LXPOPujZY3"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "i2E3K4V2AVMP"
|
||||
"id": "SmNgFdXhjZY3"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@@ -662,7 +642,7 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "KmKYBgHTAVMP"
|
||||
"id": "T1jxwhrJjZY3"
|
||||
},
|
||||
"source": [
|
||||
"Make sure your file is accessible"
|
||||
@@ -672,7 +652,7 @@
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "Im-nwvLPAVMP"
|
||||
"id": "6fDd03btjZY3"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@@ -834,4 +814,4 @@
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
}
|
||||
|
||||
@@ -17,7 +17,7 @@ Then click next, you'll then get to **introduce yourself in the `#introduce-your
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit0/discord2.jpg" alt="Discord"/>
|
||||
|
||||
They are in the reinforcement learning category. **Don't forget to sign up to these channels** by clicking on 🤖 Reinforcement Learning in `role-assigment`.
|
||||
- `rl-announcements`: where we give the **lastest information about the course**.
|
||||
- `rl-announcements`: where we give the **latest information about the course**.
|
||||
- `rl-discussions`: where you can **exchange about RL and share information**.
|
||||
- `rl-study-group`: where you can **ask questions and exchange with your classmates**.
|
||||
- `rl-i-made-this`: where you can **share your projects and models**.
|
||||
|
||||
@@ -27,7 +27,7 @@ Instead of calculating the expected return for each state or each state-action p
|
||||
|
||||
The Bellman equation is a recursive equation that works like this: instead of starting for each state from the beginning and calculating the return, we can consider the value of any state as:
|
||||
|
||||
**The immediate reward \\(R_{t+1}\\) + the discounted value of the state that follows ( \\(gamma * V(S_{t+1}) \\) ) .**
|
||||
**The immediate reward \\(R_{t+1}\\) + the discounted value of the state that follows ( \\(\gamma * V(S_{t+1}) \\) ) .**
|
||||
|
||||
<figure>
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman4.jpg" alt="Bellman equation"/>
|
||||
|
||||
@@ -19,12 +19,11 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
|
||||
},
|
||||
{
|
||||
text: "An algorithm that determines the value of being at a particular state and taking a specific action at that state",
|
||||
explain: "",
|
||||
correct: true
|
||||
explain: "Q-function is the function that determines the value of being at a particular state and taking a specific action at that state.",
|
||||
},
|
||||
{
|
||||
text: "A table",
|
||||
explain: "Q-function is not a Q-table. The Q-function is the algorithm that will feed the Q-table."
|
||||
explain: "Q-learning is not a Q-table. The Q-function is the algorithm that will feed the Q-table."
|
||||
}
|
||||
]}
|
||||
/>
|
||||
|
||||
@@ -27,9 +27,12 @@ We then multiply every term in the sum by \\(\frac{P(\tau;\theta)}{P(\tau;\theta
|
||||
|
||||
\\( = \sum_{\tau} \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)R(\tau) \\)
|
||||
|
||||
We can simplify further this since \\( \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)\\). Thus we can rewrite the sum as \\( = P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\)
|
||||
|
||||
\\(= \sum_{\tau} P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau) \\)
|
||||
We can simplify further this since \\( \frac{P(\tau;\theta)}{P(\tau;\theta)}\nabla_\theta P(\tau;\theta)\\).
|
||||
|
||||
Thus we can rewrite the sum as \\( = P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\)
|
||||
|
||||
\\( P(\tau;\theta)\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}= \sum_{\tau} P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau) \\)
|
||||
|
||||
We can then use the *derivative log trick* (also called *likelihood ratio trick* or *REINFORCE trick*), a simple rule in calculus that implies that \\( \nabla_x log f(x) = \frac{\nabla_x f(x)}{f(x)} \\)
|
||||
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
## Getting the big picture
|
||||
|
||||
We just learned that policy-gradient methods aim to find parameters \\( \theta \\) that **maximize the expected return**.
|
||||
We just learned that policy-gradient methods aim to find parameters \\( \theta \\) that **maximize the expected return**.
|
||||
|
||||
The idea is that we have a *parameterized stochastic policy*. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also called the *action preference*.
|
||||
|
||||
@@ -20,7 +20,7 @@ But **how are we going to optimize the weights using the expected return**?
|
||||
The idea is that we're going to **let the agent interact during an episode**. And if we win the episode, we consider that each action taken was good and must be more sampled in the future
|
||||
since they lead to win.
|
||||
|
||||
So for each state-action pair, we want to increase the \\(P(a|s)\\): the probability of taking that action at that state. Or decrease if we lost.
|
||||
So for each state-action pair, we want to increase the \\(P(a|s)\\): the probability of taking that action at that state. Or decrease if we lost.
|
||||
|
||||
The Policy-gradient algorithm (simplified) looks like this:
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
@@ -31,15 +31,15 @@ Now that we got the big picture, let's dive deeper into policy-gradient methods.
|
||||
|
||||
## Diving deeper into policy-gradient methods
|
||||
|
||||
We have our stochastic policy \\(\pi\\) which has a parameter \\(\theta\\). This \\(\pi\\), given a state, **outputs a probability distribution of actions**.
|
||||
We have our stochastic policy \\(\pi\\) which has a parameter \\(\theta\\). This \\(\pi\\), given a state, **outputs a probability distribution of actions**.
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/stochastic_policy.png" alt="Policy"/>
|
||||
</figure>
|
||||
|
||||
Where \\(\pi_\theta(a_t|s_t)\\) is the probability of the agent selecting action \\(a_t\\) from state \\(s_t\\) given our policy.
|
||||
Where \\(\pi_\theta(a_t|s_t)\\) is the probability of the agent selecting action \\(a_t\\) from state \\(s_t\\) given our policy.
|
||||
|
||||
**But how do we know if our policy is good?** We need to have a way to measure it. To know that, we define a score/objective function called \\(J(\theta)\\).
|
||||
**But how do we know if our policy is good?** We need to have a way to measure it. To know that, we define a score/objective function called \\(J(\theta)\\).
|
||||
|
||||
### The objective function
|
||||
|
||||
@@ -48,19 +48,20 @@ The *objective function* gives us the **performance of the agent** given a traje
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/objective.jpg" alt="Return"/>
|
||||
|
||||
Let's give some more details on this formula:
|
||||
- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by \\(P(\tau;\theta)\\) of all possible values that the return \\(R(\tau)\\) can take).
|
||||
- The *expected return* (also called expected cumulative reward), is the weighted average (where the weights are given by \\(P(\tau;\theta)\\) of all possible values that the return \\(R(\tau)\\) can take).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/expected_reward.png" alt="Return"/>
|
||||
|
||||
|
||||
- \\(R(\tau)\\) : Return from an arbitrary trajectory. To take this quantity and use it to calculate the expected return, we need to multiply it by the probability of each possible trajectory.
|
||||
- \\(P(\tau;\theta)\\) : Probability of each possible trajectory \\(\tau\\) (that probability depends on \\( \theta\\) since it defines the policy that it uses to select the actions of the trajectory which has an impact of the states visited).
|
||||
|
||||
- \\(P(\tau;\theta)\\) : Probability of each possible trajectory \\(\tau\\) (that probability depends on \\(\theta\\) since it defines the policy that it uses to select the actions of the trajectory which has an impact of the states visited).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/probability.png" alt="Probability"/>
|
||||
|
||||
- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given \\(\theta \\) multiplied by the return of this trajectory.
|
||||
- \\(J(\theta)\\) : Expected return, we calculate it by summing for all trajectories, the probability of taking that trajectory given \\(\theta \\) multiplied by the return of this trajectory.
|
||||
|
||||
Our objective then is to maximize the expected cumulative reward by finding the \\(\theta \\) that will output the best action probability distributions:
|
||||
Our objective then is to maximize the expected cumulative reward by finding the \\(\theta \\) that will output the best action probability distributions:
|
||||
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/max_objective.png" alt="Max objective"/>
|
||||
@@ -68,7 +69,7 @@ Our objective then is to maximize the expected cumulative reward by finding the
|
||||
|
||||
## Gradient Ascent and the Policy-gradient Theorem
|
||||
|
||||
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), so we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
|
||||
Policy-gradient is an optimization problem: we want to find the values of \\(\theta\\) that maximize our objective function \\(J(\theta)\\), so we need to use **gradient-ascent**. It's the inverse of *gradient-descent* since it gives the direction of the steepest increase of \\(J(\theta)\\).
|
||||
|
||||
(If you need a refresher on the difference between gradient descent and gradient ascent [check this](https://www.baeldung.com/cs/gradient-descent-vs-ascent) and [this](https://stats.stackexchange.com/questions/258721/gradient-ascent-vs-gradient-descent-in-logistic-regression)).
|
||||
|
||||
@@ -76,9 +77,9 @@ Our update step for gradient-ascent is:
|
||||
|
||||
\\( \theta \leftarrow \theta + \alpha * \nabla_\theta J(\theta) \\)
|
||||
|
||||
We can repeatedly apply this update in the hopes that \\(\theta \\) converges to the value that maximizes \\(J(\theta)\\).
|
||||
We can repeatedly apply this update in the hopes that \\(\theta \\) converges to the value that maximizes \\(J(\theta)\\).
|
||||
|
||||
However, there are two problems with computing the derivative of \\(J(\theta)\\):
|
||||
However, there are two problems with computing the derivative of \\(J(\theta)\\):
|
||||
1. We can't calculate the true gradient of the objective function since it requires calculating the probability of each possible trajectory, which is computationally super expensive.
|
||||
So we want to **calculate a gradient estimation with a sample-based estimate (collect some trajectories)**.
|
||||
|
||||
@@ -97,18 +98,20 @@ If you want to understand how we derive this formula for approximating the gradi
|
||||
The Reinforce algorithm, also called Monte-Carlo policy-gradient, is a policy-gradient algorithm that **uses an estimated return from an entire episode to update the policy parameter** \\(\theta\\):
|
||||
|
||||
In a loop:
|
||||
- Use the policy \\(\pi_\theta\\) to collect an episode \\(\tau\\)
|
||||
- Use the episode to estimate the gradient \\(\hat{g} = \nabla_\theta J(\theta)\\)
|
||||
- Use the policy \\(\pi_\theta\\) to collect an episode \\(\tau\\)
|
||||
- Use the episode to estimate the gradient \\(\hat{g} = \nabla_\theta J(\theta)\\)
|
||||
|
||||
<figure class="image table text-center m-0 w-full">
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit6/policy_gradient_one.png" alt="Policy Gradient"/>
|
||||
</figure>
|
||||
|
||||
- Update the weights of the policy: \\(\theta \leftarrow \theta + \alpha \hat{g}\\)
|
||||
- Update the weights of the policy: \\(\theta \leftarrow \theta + \alpha \hat{g}\\)
|
||||
|
||||
We can interpret this update as follows:
|
||||
|
||||
- \\(\nabla_\theta log \pi_\theta(a_t|s_t)\\) is the direction of **steepest increase of the (log) probability** of selecting action \\(a_t\\) from state \\(s_t\\).
|
||||
This tells us **how we should change the weights of policy** if we want to increase/decrease the log probability of selecting action \\(a_t\\) at state \\(s_t\\).
|
||||
|
||||
- \\(R(\tau)\\): is the scoring function:
|
||||
- If the return is high, it will **push up the probabilities** of the (state, action) combinations.
|
||||
- Otherwise, if the return is low, it will **push down the probabilities** of the (state, action) combinations.
|
||||
|
||||
@@ -288,17 +288,16 @@ Now let's try a more challenging environment called Pyramids.
|
||||
- We need to download it and place it into `./training-envs-executables/linux/`
|
||||
- We use a linux executable because we're using colab, and the colab machine's OS is Ubuntu (linux)
|
||||
|
||||
Download the file Pyramids.zip from https://drive.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H using `wget`. Check out the full solution to download large files from GDrive [here](https://bcrf.biochem.wisc.edu/2021/02/05/download-google-drive-files-using-wget/)
|
||||
We downloaded the file Pyramids.zip from from https://huggingface.co/spaces/unity/ML-Agents-Pyramids/resolve/main/Pyramids.zip using `wget`
|
||||
|
||||
```python
|
||||
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UiFNdKlsH0NTu32xV-giYUEVKV4-vc7H" -O ./training-envs-executables/linux/Pyramids.zip && rm -rf /tmp/cookies.txt
|
||||
wget "https://huggingface.co/spaces/unity/ML-Agents-Pyramids/resolve/main/Pyramids.zip" -O ./training-envs-executables/linux/Pyramids.zip
|
||||
```
|
||||
|
||||
Unzip it
|
||||
|
||||
```python
|
||||
%%capture
|
||||
!unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/Pyramids.zip
|
||||
unzip -d ./training-envs-executables/linux/ ./training-envs-executables/linux/Pyramids.zip
|
||||
```
|
||||
|
||||
Make sure your file is accessible
|
||||
|
||||
@@ -78,13 +78,15 @@ The best way to learn and [to avoid the illusion of competence](https://www.cour
|
||||
|
||||
### Q3: Fill the missing letters
|
||||
|
||||
- In Unity ML-Agents, the Policy of an Agent is called a b _ _ _ n
|
||||
- The component in charge of orchestrating the agents is called the _ c _ _ _ m _
|
||||
- In Unity ML-Agents, the Policy of an Agent is called a b \_ \_ \_ n
|
||||
- The component in charge of orchestrating the agents is called the \_ c \_ \_ \_ m \_
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
- b r a i n
|
||||
- a c a d e m y
|
||||
<ul>
|
||||
<li>b r a i n</li>
|
||||
<li>a c a d e m y</li>
|
||||
</ul>
|
||||
</details>
|
||||
|
||||
### Q4: Define with your own words what is a `raycast`
|
||||
|
||||
@@ -27,4 +27,5 @@ However, increasing the batch size significantly **reduces sample efficiency**.
|
||||
If you want to dive deeper into the question of variance and bias tradeoff in Deep Reinforcement Learning, you can check out these two articles:
|
||||
- [Making Sense of the Bias / Variance Trade-off in (Deep) Reinforcement Learning](https://blog.mlreview.com/making-sense-of-the-bias-variance-trade-off-in-deep-reinforcement-learning-79cf1e83d565)
|
||||
- [Bias-variance Tradeoff in Reinforcement Learning](https://www.endtoend.ai/blog/bias-variance-tradeoff-in-reinforcement-learning/)
|
||||
- [High Variance in Policy gradients](https://balajiai.github.io/high_variance_in_policy_gradients)
|
||||
---
|
||||
|
||||
@@ -2,6 +2,45 @@
|
||||
|
||||
Here we provide a list of interesting environments you can try to train your agents on:
|
||||
|
||||
## DIAMBRA Arena
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/diambraarena.png" alt="diambraArena"/>
|
||||
|
||||
|
||||
DIAMBRA Arena is a software package featuring a collection of high-quality environments for Reinforcement Learning research and experimentation. It provides a standard interface to popular arcade emulated video games, offering a Python API fully compliant with OpenAI Gym/Gymnasium format, that makes its adoption smooth and straightforward.
|
||||
|
||||
It supports all major Operating Systems (Linux, Windows and MacOS) and can be easily installed via [Python PIP](https://pypi.org/project/diambra-arena/). It is completely free to use, the user only needs to register on the [official website](https://diambra.ai/register/).
|
||||
|
||||
In addition, its [GitHub repository](https://github.com/diambra/) provides a collection of examples covering main use cases of interest that can be run in just a few steps.
|
||||
|
||||
#### Main Features
|
||||
|
||||
All environments are episodic Reinforcement Learning tasks, with discrete actions (gamepad buttons) and observations composed by screen pixels plus additional numerical data (RAM values like characters health bars or characters stage side).
|
||||
|
||||
They all support both single player (1P) as well as two players (2P) mode, making them the perfect resource to explore Standard RL, Competitive Multi-Agent, Competitive Human-Agent, Self-Play, Imitation Learning and Human-in-the-Loop.
|
||||
|
||||
[Interfaced games](https://docs.diambra.ai/envs/games/) have been selected among the most popular fighting retro-games. While sharing the same fundamental mechanics, they provide different challenges, with specific features such as different type and number of characters, how to perform combos, health bars recharging, etc.
|
||||
|
||||
DIAMBRA Arena is built to maximize compatibility will all major Reinforcement Learning libraries. It natively provides interfaces with the two most important packages: [Stable Baselines 3](https://stable-baselines3.readthedocs.io/en/master/) and [Ray RLlib](https://docs.ray.io/en/latest/rllib/index.html), while Stable Baselines is also available but deprecated. Their usage is illustrated in the [official documentation](https://docs.diambra.ai/) and in the [DIAMBRA Agents examples repository](https://github.com/diambra/agents). It can easily be interfaced with any other package in a similar way.
|
||||
|
||||
### Competition Platform
|
||||
|
||||
DIAMBRA also provides a competition platform fully integrated with the Hugging Face Hub, on which you can submit your trained agents and compete with other coders around the globe in epic video games tournaments!
|
||||
|
||||
It features a public leaderboard where users are ranked by the best score achieved by their agents in our different environments.
|
||||
|
||||
It also offers the possibility to unlock cool achievements depending on the performances of your agent.
|
||||
|
||||
Submitted agents are evaluated and their episodes are streamed on [DIAMBRA Twitch channel](https://www.twitch.tv/diambra_ai).
|
||||
|
||||
#### References
|
||||
|
||||
To start using this environment, check these resources:
|
||||
- [Official Docs](https://docs.diambra.ai/)
|
||||
- [Competition Platform](https://diambra.ai)
|
||||
- [GitHub](https://github.com/diambra/)
|
||||
- [Discord](https://diambra.ai/discord)
|
||||
|
||||
## MineRL
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit12/minerl.jpg" alt="MineRL"/>
|
||||
|
||||
Reference in New Issue
Block a user