mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-13 18:00:45 +08:00
Update notebook
This commit is contained in:
@@ -223,11 +223,11 @@
|
||||
},
|
||||
"source": [
|
||||
"## Install the dependencies 🔽\n",
|
||||
"The first step is to install the dependencies, we’ll install multiple ones:\n",
|
||||
"The first step is to install the dependencies. We’ll install multiple ones:\n",
|
||||
"\n",
|
||||
"- `gym`\n",
|
||||
"- `gym-games`: Extra gym environments made with PyGame.\n",
|
||||
"- `huggingface_hub`: 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.\n",
|
||||
"- `huggingface_hub`: 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations, and other features that will allow you to easily collaborate with others.\n",
|
||||
"\n",
|
||||
"You can see here all the Reinforce models available 👉 https://huggingface.co/models?other=reinforce\n",
|
||||
"\n",
|
||||
@@ -236,20 +236,8 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "kgxMH5wMXME8"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!pip install -r https://huggingface.co/spaces/ThomasSimonini/temp-space-requirements/resolve/main/requirements/requirements-unit4.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"# TODO UNCOMMENT BEFORE MERGING\n",
|
||||
"# !pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit4/requirements-unit4.txt"
|
||||
"!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit4/requirements-unit4.txt"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "e8ZVi-uydpgL"
|
||||
@@ -304,22 +292,15 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## Check if we have a GPU"
|
||||
"## Check if we have a GPU\n",
|
||||
"\n",
|
||||
"- Let's check if we have a GPU\n",
|
||||
"- If it's the case you should see `device:cuda0`"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "RfxJYdMeeVgv"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "hn2Emlm9bXmc"
|
||||
},
|
||||
"source": [
|
||||
"- Let's check if we have a GPU\n",
|
||||
"- If it's the case you should see `device:cuda0`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
@@ -677,11 +658,14 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"- When we calculate the return Gt we see that we calculate the sum of discounted rewards **starting at timestep t**.\n",
|
||||
"- When we calculate the return Gt (line 6) we see that we calculate the sum of discounted rewards **starting at timestep t**.\n",
|
||||
"\n",
|
||||
"- Why? Because our policy should only **reinforce actions on the basis of the consequences**: so rewards obtained before taking an action are useless (since they were not because of the action), **only the ones that come after the action matters**.\n",
|
||||
"\n",
|
||||
"- Before coding this you should read this section [don't let the past distract you](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#don-t-let-the-past-distract-you) that explains why we use reward-to-go policy gradient."
|
||||
"- Before coding this you should read this section [don't let the past distract you](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#don-t-let-the-past-distract-you) that explains why we use reward-to-go policy gradient.\n",
|
||||
"\n",
|
||||
"We use an interesting technique coded by [Chris1nexus](https://github.com/Chris1nexus) to **compute the return at each timestep efficiently**. The comments explained the procedure. Don't hesitate also [to check the PR explanation](https://github.com/huggingface/deep-rl-class/pull/95)\n",
|
||||
"But overall the idea is to **compute the return at each timestep efficiently**."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "QmcXG-9i2Qu2"
|
||||
@@ -701,10 +685,7 @@
|
||||
" - Because all P must sum to 1, max $\\pi_\\theta(a_3|s; \\theta)$ will **minimize other action probability.**\n",
|
||||
" - So we should tell PyTorch **to min $1 - \\pi_\\theta(a_3|s; \\theta)$.**\n",
|
||||
" - This loss function approaches 0 as $\\pi_\\theta(a_3|s; \\theta)$ nears 1.\n",
|
||||
" - So we are encouraging the gradient to max $\\pi_\\theta(a_3|s; \\theta)$\n",
|
||||
"\n",
|
||||
"Line 6 is an interesting technique coded by [Chris1nexus](https://github.com/Chris1nexus) to **compute the return at each timestep efficiently**. The comments explained the procedure. Don't hesitate also [to check the PR explanation](https://github.com/huggingface/deep-rl-class/pull/95)\n",
|
||||
"But overall the idea is to **compute the return at each timestep efficiently**."
|
||||
" - So we are encouraging the gradient to max $\\pi_\\theta(a_3|s; \\theta)$\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1032,7 +1013,7 @@
|
||||
"id": "7CoeLkQ7TpO8"
|
||||
},
|
||||
"source": [
|
||||
"## Publish our trained model on the Hub 🔥\n",
|
||||
"### Publish our trained model on the Hub 🔥\n",
|
||||
"Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.\n",
|
||||
"\n",
|
||||
"Here's an example of a Model Card:\n",
|
||||
@@ -1319,7 +1300,7 @@
|
||||
"id": "jrnuKH1gYZSz"
|
||||
},
|
||||
"source": [
|
||||
"Now that we try the robustness of our implementation, let's try with more complex environments with PixelCopter 🚁\n",
|
||||
"Now that we try the robustness of our implementation, let's try a more complex environment: PixelCopter 🚁\n",
|
||||
"\n",
|
||||
"\n"
|
||||
]
|
||||
@@ -1329,7 +1310,7 @@
|
||||
"source": [
|
||||
"## Second agent: PixelCopter 🚁\n",
|
||||
"\n",
|
||||
"### Step 1: Study the PixelCopter environment 👀\n",
|
||||
"### Study the PixelCopter environment 👀\n",
|
||||
"- [The Environment documentation](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pixelcopter.html)\n"
|
||||
],
|
||||
"metadata": {
|
||||
@@ -1400,15 +1381,88 @@
|
||||
"- For each vertical block it passes through it gains a positive reward of +1. Each time a terminal state reached it receives a negative reward of -1."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Define the new Policy 🧠\n",
|
||||
"- We need to have a deeper neural network since the environment is more complex"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "aV1466QP8crz"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "I1eBkCiX2X_S"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"class Policy(nn.Module):\n",
|
||||
" def __init__(self, s_size, a_size, h_size):\n",
|
||||
" super(Policy, self).__init__()\n",
|
||||
" # Define the three layers here\n",
|
||||
"\n",
|
||||
" def forward(self, x):\n",
|
||||
" # Define the forward process here\n",
|
||||
" return F.softmax(x, dim=1)\n",
|
||||
" \n",
|
||||
" def act(self, state):\n",
|
||||
" state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n",
|
||||
" probs = self.forward(state).cpu()\n",
|
||||
" m = Categorical(probs)\n",
|
||||
" action = m.sample()\n",
|
||||
" return action.item(), m.log_prob(action)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"#### Solution"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "47iuAFqV8Ws-"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"class Policy(nn.Module):\n",
|
||||
" def __init__(self, s_size, a_size, h_size):\n",
|
||||
" super(Policy, self).__init__()\n",
|
||||
" self.fc1 = nn.Linear(s_size, h_size)\n",
|
||||
" self.fc2 = nn.Linear(h_size, h_size*2)\n",
|
||||
" self.fc3 = nn.Linear(h_size*2, a_size)\n",
|
||||
"\n",
|
||||
" def forward(self, x):\n",
|
||||
" x = F.relu(self.fc1(x))\n",
|
||||
" x = F.relu(self.fc2(x))\n",
|
||||
" x = self.fc3(x)\n",
|
||||
" return F.softmax(x, dim=1)\n",
|
||||
" \n",
|
||||
" def act(self, state):\n",
|
||||
" state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n",
|
||||
" probs = self.forward(state).cpu()\n",
|
||||
" m = Categorical(probs)\n",
|
||||
" action = m.sample()\n",
|
||||
" return action.item(), m.log_prob(action)"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "wrNuVcHC8Xu7"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "SM1QiGCSbBkM"
|
||||
},
|
||||
"source": [
|
||||
"### Step 2: Define the hyperparameters ⚙️\n",
|
||||
"- Because this environment is more complex, we need to change the hyperparameters\n",
|
||||
"- Especially the hidden size, we need more neurons."
|
||||
"### Define the hyperparameters ⚙️\n",
|
||||
"- Because this environment is more complex.\n",
|
||||
"- Especially for the hidden size, we need more neurons."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1433,33 +1487,14 @@
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "I1eBkCiX2X_S"
|
||||
},
|
||||
"outputs": [],
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"class Policy(nn.Module):\n",
|
||||
" def __init__(self, s_size, a_size, h_size):\n",
|
||||
" super(Policy, self).__init__()\n",
|
||||
" self.fc1 = nn.Linear(s_size, h_size)\n",
|
||||
" self.fc2 = nn.Linear(h_size, h_size*2)\n",
|
||||
" self.fc3 = nn.Linear(h_size*2, a_size)\n",
|
||||
"\n",
|
||||
" def forward(self, x):\n",
|
||||
" x = F.relu(self.fc1(x))\n",
|
||||
" x = F.relu(self.fc2(x))\n",
|
||||
" x = self.fc3(x)\n",
|
||||
" return F.softmax(x, dim=1)\n",
|
||||
" \n",
|
||||
" def act(self, state):\n",
|
||||
" state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n",
|
||||
" probs = self.forward(state).cpu()\n",
|
||||
" m = Categorical(probs)\n",
|
||||
" action = m.sample()\n",
|
||||
" return action.item(), m.log_prob(action)"
|
||||
]
|
||||
"### Train it\n",
|
||||
"- We're now ready to train our agent 🔥."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "wyvXTJWm9GJG"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
@@ -1491,17 +1526,26 @@
|
||||
" 1000)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"### Publish our trained model on the Hub 🔥"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "8kwFQ-Ip85BE"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"repo_id = \"ThomasSimonini/Secondtestpx\" #TODO Define your repo id {username/Reinforce-{model-id}}\n",
|
||||
"repo_id = \"\" #TODO Define your repo id {username/Reinforce-{model-id}}\n",
|
||||
"push_to_hub(repo_id,\n",
|
||||
" pixelcopter_policy, # The model we want to save\n",
|
||||
" pixelcopter_hyperparameters, # Hyperparameters\n",
|
||||
" eval_env, # Evaluation environment\n",
|
||||
" video_fps=30,\n",
|
||||
" local_repo_path=\"hub\",\n",
|
||||
" )\n"
|
||||
" )"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "6PtB7LRbTKWK"
|
||||
@@ -1516,7 +1560,7 @@
|
||||
},
|
||||
"source": [
|
||||
"## Some additional challenges 🏆\n",
|
||||
"The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. But also trying to find better parameters.\n",
|
||||
"The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. But also trying to find better parameters.\n",
|
||||
"\n",
|
||||
"In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n",
|
||||
"\n",
|
||||
@@ -1549,7 +1593,7 @@
|
||||
"\n",
|
||||
"Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)\n",
|
||||
"\n",
|
||||
"See you on Unit 5! 🔥\n",
|
||||
"See you in Unit 5! 🔥\n",
|
||||
"\n",
|
||||
"### Keep Learning, stay awesome 🤗\n",
|
||||
"\n"
|
||||
@@ -1566,9 +1610,10 @@
|
||||
"L_WSo0VUV99t",
|
||||
"mjY-eq3eWh9O",
|
||||
"JoTC9o2SczNn",
|
||||
"rOMrdwSYOWSC",
|
||||
"gfGJNZBUP7Vn",
|
||||
"YB0Cxrw1StrP",
|
||||
"Jmhs1k-cftIq",
|
||||
"47iuAFqV8Ws-",
|
||||
"x62pP0PHdA-y"
|
||||
],
|
||||
"include_colab_link": true
|
||||
|
||||
Reference in New Issue
Block a user