diff --git a/notebooks/unit4/unit4.ipynb b/notebooks/unit4/unit4.ipynb index a8d2d4c..7a6842e 100644 --- a/notebooks/unit4/unit4.ipynb +++ b/notebooks/unit4/unit4.ipynb @@ -223,11 +223,11 @@ }, "source": [ "## Install the dependencies 🔽\n", - "The first step is to install the dependencies, we’ll install multiple ones:\n", + "The first step is to install the dependencies. We’ll install multiple ones:\n", "\n", "- `gym`\n", "- `gym-games`: Extra gym environments made with PyGame.\n", - "- `huggingface_hub`: 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.\n", + "- `huggingface_hub`: 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations, and other features that will allow you to easily collaborate with others.\n", "\n", "You can see here all the Reinforce models available 👉 https://huggingface.co/models?other=reinforce\n", "\n", @@ -236,20 +236,8 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "kgxMH5wMXME8" - }, - "outputs": [], "source": [ - "!pip install -r https://huggingface.co/spaces/ThomasSimonini/temp-space-requirements/resolve/main/requirements/requirements-unit4.txt" - ] - }, - { - "cell_type": "code", - "source": [ - "# TODO UNCOMMENT BEFORE MERGING\n", - "# !pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit4/requirements-unit4.txt" + "!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit4/requirements-unit4.txt" ], "metadata": { "id": "e8ZVi-uydpgL" @@ -304,22 +292,15 @@ { "cell_type": "markdown", "source": [ - "## Check if we have a GPU" + "## Check if we have a GPU\n", + "\n", + "- Let's check if we have a GPU\n", + "- If it's the case you should see `device:cuda0`" ], "metadata": { "id": "RfxJYdMeeVgv" } }, - { - "cell_type": "markdown", - "metadata": { - "id": "hn2Emlm9bXmc" - }, - "source": [ - "- Let's check if we have a GPU\n", - "- If it's the case you should see `device:cuda0`" - ] - }, { "cell_type": "code", "execution_count": null, @@ -677,11 +658,14 @@ { "cell_type": "markdown", "source": [ - "- When we calculate the return Gt we see that we calculate the sum of discounted rewards **starting at timestep t**.\n", + "- When we calculate the return Gt (line 6) we see that we calculate the sum of discounted rewards **starting at timestep t**.\n", "\n", "- Why? Because our policy should only **reinforce actions on the basis of the consequences**: so rewards obtained before taking an action are useless (since they were not because of the action), **only the ones that come after the action matters**.\n", "\n", - "- Before coding this you should read this section [don't let the past distract you](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#don-t-let-the-past-distract-you) that explains why we use reward-to-go policy gradient." + "- Before coding this you should read this section [don't let the past distract you](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#don-t-let-the-past-distract-you) that explains why we use reward-to-go policy gradient.\n", + "\n", + "We use an interesting technique coded by [Chris1nexus](https://github.com/Chris1nexus) to **compute the return at each timestep efficiently**. The comments explained the procedure. Don't hesitate also [to check the PR explanation](https://github.com/huggingface/deep-rl-class/pull/95)\n", + "But overall the idea is to **compute the return at each timestep efficiently**." ], "metadata": { "id": "QmcXG-9i2Qu2" @@ -701,10 +685,7 @@ " - Because all P must sum to 1, max $\\pi_\\theta(a_3|s; \\theta)$ will **minimize other action probability.**\n", " - So we should tell PyTorch **to min $1 - \\pi_\\theta(a_3|s; \\theta)$.**\n", " - This loss function approaches 0 as $\\pi_\\theta(a_3|s; \\theta)$ nears 1.\n", - " - So we are encouraging the gradient to max $\\pi_\\theta(a_3|s; \\theta)$\n", - "\n", - "Line 6 is an interesting technique coded by [Chris1nexus](https://github.com/Chris1nexus) to **compute the return at each timestep efficiently**. The comments explained the procedure. Don't hesitate also [to check the PR explanation](https://github.com/huggingface/deep-rl-class/pull/95)\n", - "But overall the idea is to **compute the return at each timestep efficiently**." + " - So we are encouraging the gradient to max $\\pi_\\theta(a_3|s; \\theta)$\n" ] }, { @@ -1032,7 +1013,7 @@ "id": "7CoeLkQ7TpO8" }, "source": [ - "## Publish our trained model on the Hub 🔥\n", + "### Publish our trained model on the Hub 🔥\n", "Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.\n", "\n", "Here's an example of a Model Card:\n", @@ -1319,7 +1300,7 @@ "id": "jrnuKH1gYZSz" }, "source": [ - "Now that we try the robustness of our implementation, let's try with more complex environments with PixelCopter 🚁\n", + "Now that we try the robustness of our implementation, let's try a more complex environment: PixelCopter 🚁\n", "\n", "\n" ] @@ -1329,7 +1310,7 @@ "source": [ "## Second agent: PixelCopter 🚁\n", "\n", - "### Step 1: Study the PixelCopter environment 👀\n", + "### Study the PixelCopter environment 👀\n", "- [The Environment documentation](https://pygame-learning-environment.readthedocs.io/en/latest/user/games/pixelcopter.html)\n" ], "metadata": { @@ -1400,15 +1381,88 @@ "- For each vertical block it passes through it gains a positive reward of +1. Each time a terminal state reached it receives a negative reward of -1." ] }, + { + "cell_type": "markdown", + "source": [ + "### Define the new Policy 🧠\n", + "- We need to have a deeper neural network since the environment is more complex" + ], + "metadata": { + "id": "aV1466QP8crz" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "I1eBkCiX2X_S" + }, + "outputs": [], + "source": [ + "class Policy(nn.Module):\n", + " def __init__(self, s_size, a_size, h_size):\n", + " super(Policy, self).__init__()\n", + " # Define the three layers here\n", + "\n", + " def forward(self, x):\n", + " # Define the forward process here\n", + " return F.softmax(x, dim=1)\n", + " \n", + " def act(self, state):\n", + " state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n", + " probs = self.forward(state).cpu()\n", + " m = Categorical(probs)\n", + " action = m.sample()\n", + " return action.item(), m.log_prob(action)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "#### Solution" + ], + "metadata": { + "id": "47iuAFqV8Ws-" + } + }, + { + "cell_type": "code", + "source": [ + "class Policy(nn.Module):\n", + " def __init__(self, s_size, a_size, h_size):\n", + " super(Policy, self).__init__()\n", + " self.fc1 = nn.Linear(s_size, h_size)\n", + " self.fc2 = nn.Linear(h_size, h_size*2)\n", + " self.fc3 = nn.Linear(h_size*2, a_size)\n", + "\n", + " def forward(self, x):\n", + " x = F.relu(self.fc1(x))\n", + " x = F.relu(self.fc2(x))\n", + " x = self.fc3(x)\n", + " return F.softmax(x, dim=1)\n", + " \n", + " def act(self, state):\n", + " state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n", + " probs = self.forward(state).cpu()\n", + " m = Categorical(probs)\n", + " action = m.sample()\n", + " return action.item(), m.log_prob(action)" + ], + "metadata": { + "id": "wrNuVcHC8Xu7" + }, + "execution_count": null, + "outputs": [] + }, { "cell_type": "markdown", "metadata": { "id": "SM1QiGCSbBkM" }, "source": [ - "### Step 2: Define the hyperparameters ⚙️\n", - "- Because this environment is more complex, we need to change the hyperparameters\n", - "- Especially the hidden size, we need more neurons." + "### Define the hyperparameters ⚙️\n", + "- Because this environment is more complex.\n", + "- Especially for the hidden size, we need more neurons." ] }, { @@ -1433,33 +1487,14 @@ ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "I1eBkCiX2X_S" - }, - "outputs": [], + "cell_type": "markdown", "source": [ - "class Policy(nn.Module):\n", - " def __init__(self, s_size, a_size, h_size):\n", - " super(Policy, self).__init__()\n", - " self.fc1 = nn.Linear(s_size, h_size)\n", - " self.fc2 = nn.Linear(h_size, h_size*2)\n", - " self.fc3 = nn.Linear(h_size*2, a_size)\n", - "\n", - " def forward(self, x):\n", - " x = F.relu(self.fc1(x))\n", - " x = F.relu(self.fc2(x))\n", - " x = self.fc3(x)\n", - " return F.softmax(x, dim=1)\n", - " \n", - " def act(self, state):\n", - " state = torch.from_numpy(state).float().unsqueeze(0).to(device)\n", - " probs = self.forward(state).cpu()\n", - " m = Categorical(probs)\n", - " action = m.sample()\n", - " return action.item(), m.log_prob(action)" - ] + "### Train it\n", + "- We're now ready to train our agent 🔥." + ], + "metadata": { + "id": "wyvXTJWm9GJG" + } }, { "cell_type": "code", @@ -1491,17 +1526,26 @@ " 1000)" ] }, + { + "cell_type": "markdown", + "source": [ + "### Publish our trained model on the Hub 🔥" + ], + "metadata": { + "id": "8kwFQ-Ip85BE" + } + }, { "cell_type": "code", "source": [ - "repo_id = \"ThomasSimonini/Secondtestpx\" #TODO Define your repo id {username/Reinforce-{model-id}}\n", + "repo_id = \"\" #TODO Define your repo id {username/Reinforce-{model-id}}\n", "push_to_hub(repo_id,\n", " pixelcopter_policy, # The model we want to save\n", " pixelcopter_hyperparameters, # Hyperparameters\n", " eval_env, # Evaluation environment\n", " video_fps=30,\n", " local_repo_path=\"hub\",\n", - " )\n" + " )" ], "metadata": { "id": "6PtB7LRbTKWK" @@ -1516,7 +1560,7 @@ }, "source": [ "## Some additional challenges 🏆\n", - "The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. But also trying to find better parameters.\n", + "The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. But also trying to find better parameters.\n", "\n", "In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n", "\n", @@ -1549,7 +1593,7 @@ "\n", "Finally, we would love **to hear what you think of the course and how we can improve it**. If you have some feedback then, please 👉 [fill this form](https://forms.gle/BzKXWzLAGZESGNaE9)\n", "\n", - "See you on Unit 5! 🔥\n", + "See you in Unit 5! 🔥\n", "\n", "### Keep Learning, stay awesome 🤗\n", "\n" @@ -1566,9 +1610,10 @@ "L_WSo0VUV99t", "mjY-eq3eWh9O", "JoTC9o2SczNn", - "rOMrdwSYOWSC", "gfGJNZBUP7Vn", "YB0Cxrw1StrP", + "Jmhs1k-cftIq", + "47iuAFqV8Ws-", "x62pP0PHdA-y" ], "include_colab_link": true