mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-06-15 06:27:24 +08:00
Merge pull request #310 from huggingface/GymnasiumUpdate/Unit2
Update Unit 2 (Gymnasium)
This commit is contained in:
1792
notebooks/unit2.ipynb
Normal file
1792
notebooks/unit2.ipynb
Normal file
File diff suppressed because it is too large
Load Diff
@@ -1,4 +1,4 @@
|
||||
gym==0.24
|
||||
gymnasium
|
||||
pygame
|
||||
numpy
|
||||
|
||||
@@ -8,4 +8,4 @@ pyyaml==6.0
|
||||
imageio
|
||||
imageio_ffmpeg
|
||||
pyglet==1.5.1
|
||||
tqdm
|
||||
tqdm
|
||||
@@ -3,10 +3,11 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "view-in-github"
|
||||
"id": "view-in-github",
|
||||
"colab_type": "text"
|
||||
},
|
||||
"source": [
|
||||
"<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit2/unit2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
"<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/GymnasiumUpdate%2FUnit2/notebooks/unit2/unit2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -19,8 +20,7 @@
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg\" alt=\"Unit 2 Thumbnail\">\n",
|
||||
"\n",
|
||||
"In this notebook, **you'll code from scratch your first Reinforcement Learning agent** playing FrozenLake ❄️ using Q-Learning, share it to the community, and experiment with different configurations.\n",
|
||||
"\n",
|
||||
"In this notebook, **you'll code your first Reinforcement Learning agent from scratch** to play FrozenLake ❄️ using Q-Learning, share it with the community, and experiment with different configurations.\n",
|
||||
"\n",
|
||||
"⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️\n"
|
||||
]
|
||||
@@ -39,25 +39,18 @@
|
||||
"source": [
|
||||
"###🎮 Environments: \n",
|
||||
"\n",
|
||||
"- [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)\n",
|
||||
"- [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/)\n",
|
||||
"- [FrozenLake-v1](https://gymnasium.farama.org/environments/toy_text/frozen_lake/)\n",
|
||||
"- [Taxi-v3](https://gymnasium.farama.org/environments/toy_text/taxi/)\n",
|
||||
"\n",
|
||||
"###📚 RL-Library: \n",
|
||||
"\n",
|
||||
"- Python and NumPy\n",
|
||||
"- [Gym](https://www.gymlibrary.dev/)"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "DPTBOv9HYLZ2"
|
||||
}
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"- [Gymnasium](https://gymnasium.farama.org/)\n",
|
||||
"\n",
|
||||
"We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "3iaIxM_TwklQ"
|
||||
"id": "DPTBOv9HYLZ2"
|
||||
}
|
||||
},
|
||||
{
|
||||
@@ -70,8 +63,8 @@
|
||||
"\n",
|
||||
"At the end of the notebook, you will:\n",
|
||||
"\n",
|
||||
"- Be able to use **Gym**, the environment library.\n",
|
||||
"- Be able to code from scratch a Q-Learning agent.\n",
|
||||
"- Be able to use **Gymnasium**, the environment library.\n",
|
||||
"- Be able to code a Q-Learning agent from scratch.\n",
|
||||
"- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.\n",
|
||||
"\n",
|
||||
"\n"
|
||||
@@ -81,6 +74,7 @@
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"## This notebook is from the Deep Reinforcement Learning Course\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>"
|
||||
],
|
||||
"metadata": {
|
||||
@@ -114,6 +108,7 @@
|
||||
},
|
||||
"source": [
|
||||
"## Prerequisites 🏗️\n",
|
||||
"\n",
|
||||
"Before diving into the notebook, you need to:\n",
|
||||
"\n",
|
||||
"🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)** 🤗 "
|
||||
@@ -134,18 +129,18 @@
|
||||
"id": "V68VveLacfxJ"
|
||||
},
|
||||
"source": [
|
||||
"- The *Q-Learning* **is the RL algorithm that** \n",
|
||||
"*Q-Learning* **is the RL algorithm that**:\n",
|
||||
"\n",
|
||||
" - Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**\n",
|
||||
" \n",
|
||||
" - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**\n",
|
||||
"- Trains *Q-Function*, an **action-value function** that encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**\n",
|
||||
"\n",
|
||||
"- Given a state and action, our Q-Function **will search the Q-table for the corresponding value.**\n",
|
||||
" \n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg\" alt=\"Q function\" width=\"100%\"/>\n",
|
||||
"\n",
|
||||
"- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**\n",
|
||||
" \n",
|
||||
"- And if we **have an optimal Q-function**, we\n",
|
||||
"have an optimal policy,since we **know for each state, what is the best action to take.**\n",
|
||||
"have an optimal policy, since we **know for, each state, the best action to take.**\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg\" alt=\"Link value policy\" width=\"100%\"/>\n",
|
||||
"\n",
|
||||
@@ -171,7 +166,6 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"\n",
|
||||
"To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.\n",
|
||||
"\n",
|
||||
"To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
|
||||
@@ -193,13 +187,13 @@
|
||||
"\n",
|
||||
"We’ll install multiple ones:\n",
|
||||
"\n",
|
||||
"- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.\n",
|
||||
"- `gymnasium`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. \n",
|
||||
"- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.\n",
|
||||
"- `numpy`: Used for handling our Q-table.\n",
|
||||
"\n",
|
||||
"The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.\n",
|
||||
"\n",
|
||||
"You can see here all the Deep RL models available (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning"
|
||||
"You can see here all the Deep RL models available (if they use Q Learning) here 👉 https://huggingface.co/models?other=q-learning"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "4gpxC1_kqUYe"
|
||||
@@ -233,7 +227,7 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks for this trick, **we will be able to run our virtual screen.**"
|
||||
"To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "K6XC13pTfFiD"
|
||||
@@ -289,7 +283,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"import gym\n",
|
||||
"import gymnasium as gym\n",
|
||||
"import random\n",
|
||||
"import imageio\n",
|
||||
"import os\n",
|
||||
@@ -323,12 +317,12 @@
|
||||
"id": "NAvihuHdy9tw"
|
||||
},
|
||||
"source": [
|
||||
"## Create and understand [FrozenLake environment ⛄]((https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)\n",
|
||||
"## Create and understand [FrozenLake environment ⛄]((https://gymnasium.farama.org/environments/toy_text/frozen_lake/)\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
"💡 A good habit when you start to use an environment is to check its documentation \n",
|
||||
"\n",
|
||||
"👉 https://www.gymlibrary.dev/environments/toy_text/frozen_lake/\n",
|
||||
"👉 https://gymnasium.farama.org/environments/toy_text/frozen_lake/\n",
|
||||
"\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
@@ -352,7 +346,10 @@
|
||||
"id": "UaW_LHfS0PY2"
|
||||
},
|
||||
"source": [
|
||||
"For now let's keep it simple with the 4x4 map and non-slippery"
|
||||
"For now let's keep it simple with the 4x4 map and non-slippery.\n",
|
||||
"We add a parameter called `render_mode` that specifies how the environment should be visualised. In our case because we **want to record a video of the environment at the end, we need to set render_mode to rgb_array**.\n",
|
||||
"\n",
|
||||
"As [explained in the documentation](https://gymnasium.farama.org/api/env/#gymnasium.Env.render) “rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -363,7 +360,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version\n",
|
||||
"# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version and render_mode=\"rgb_array\"\n",
|
||||
"env = gym.make() # TODO use the correct parameters"
|
||||
]
|
||||
},
|
||||
@@ -384,7 +381,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"env = gym.make(\"FrozenLake-v1\", map_name=\"4x4\", is_slippery=False)"
|
||||
"env = gym.make(\"FrozenLake-v1\", map_name=\"4x4\", is_slippery=False, render_mode=\"rgb_array\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -480,6 +477,7 @@
|
||||
},
|
||||
"source": [
|
||||
"## Create and Initialize the Q-table 🗄️\n",
|
||||
"\n",
|
||||
"(👀 Step 1 of the pseudocode)\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n",
|
||||
@@ -584,12 +582,13 @@
|
||||
},
|
||||
"source": [
|
||||
"## Define the greedy policy 🤖\n",
|
||||
"\n",
|
||||
"Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.\n",
|
||||
"\n",
|
||||
"- Epsilon-greedy policy (acting policy)\n",
|
||||
"- Greedy-policy (updating policy)\n",
|
||||
"\n",
|
||||
"Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.\n",
|
||||
"The greedy policy will also be the final policy we'll have when the Q-learning agent completes training. The greedy policy is used to select an action using the Q-table.\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n"
|
||||
]
|
||||
@@ -647,9 +646,9 @@
|
||||
"\n",
|
||||
"- With *probability 1 - ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).\n",
|
||||
"\n",
|
||||
"- With *probability ɛ*: we do **exploration** (trying random action).\n",
|
||||
"- With *probability ɛ*: we do **exploration** (trying a random action).\n",
|
||||
"\n",
|
||||
"And as the training goes, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**\n",
|
||||
"As the training continues, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n"
|
||||
]
|
||||
@@ -716,6 +715,7 @@
|
||||
},
|
||||
"source": [
|
||||
"## Define the hyperparameters ⚙️\n",
|
||||
"\n",
|
||||
"The exploration related hyperparamters are some of the most important ones. \n",
|
||||
"\n",
|
||||
"- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.\n",
|
||||
@@ -789,9 +789,10 @@
|
||||
" # Reduce epsilon (because we need less and less exploration)\n",
|
||||
" epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)\n",
|
||||
" # Reset the environment\n",
|
||||
" state = env.reset()\n",
|
||||
" state, info = env.reset()\n",
|
||||
" step = 0\n",
|
||||
" done = False\n",
|
||||
" terminated = False\n",
|
||||
" truncated = False\n",
|
||||
"\n",
|
||||
" # repeat\n",
|
||||
" for step in range(max_steps):\n",
|
||||
@@ -800,13 +801,13 @@
|
||||
"\n",
|
||||
" # Take action At and observe Rt+1 and St+1\n",
|
||||
" # Take the action (a) and observe the outcome state(s') and reward (r)\n",
|
||||
" new_state, reward, done, info = \n",
|
||||
" new_state, reward, terminated, truncated, info =\n",
|
||||
"\n",
|
||||
" # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
|
||||
" Qtable[state][action] = \n",
|
||||
"\n",
|
||||
" # If done, finish the episode\n",
|
||||
" if done:\n",
|
||||
" # If terminated or truncated finish the episode\n",
|
||||
" if terminated or truncated:\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" # Our next state is the new state\n",
|
||||
@@ -836,9 +837,10 @@
|
||||
" # Reduce epsilon (because we need less and less exploration)\n",
|
||||
" epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)\n",
|
||||
" # Reset the environment\n",
|
||||
" state = env.reset()\n",
|
||||
" state, info = env.reset()\n",
|
||||
" step = 0\n",
|
||||
" done = False\n",
|
||||
" terminated = False\n",
|
||||
" truncated = False\n",
|
||||
"\n",
|
||||
" # repeat\n",
|
||||
" for step in range(max_steps):\n",
|
||||
@@ -847,13 +849,13 @@
|
||||
"\n",
|
||||
" # Take action At and observe Rt+1 and St+1\n",
|
||||
" # Take the action (a) and observe the outcome state(s') and reward (r)\n",
|
||||
" new_state, reward, done, info = env.step(action)\n",
|
||||
" new_state, reward, terminated, truncated, info = env.step(action)\n",
|
||||
"\n",
|
||||
" # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
|
||||
" Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]) \n",
|
||||
"\n",
|
||||
" # If done, finish the episode\n",
|
||||
" if done:\n",
|
||||
" # If terminated or truncated finish the episode\n",
|
||||
" if terminated or truncated:\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" # Our next state is the new state\n",
|
||||
@@ -931,20 +933,21 @@
|
||||
" episode_rewards = []\n",
|
||||
" for episode in tqdm(range(n_eval_episodes)):\n",
|
||||
" if seed:\n",
|
||||
" state = env.reset(seed=seed[episode])\n",
|
||||
" state, info = env.reset(seed=seed[episode])\n",
|
||||
" else:\n",
|
||||
" state = env.reset()\n",
|
||||
" state, info = env.reset()\n",
|
||||
" step = 0\n",
|
||||
" done = False\n",
|
||||
" truncated = False\n",
|
||||
" terminated = False\n",
|
||||
" total_rewards_ep = 0\n",
|
||||
" \n",
|
||||
" for step in range(max_steps):\n",
|
||||
" # Take the action (index) that have the maximum expected future reward given that state\n",
|
||||
" action = greedy_policy(Q, state)\n",
|
||||
" new_state, reward, done, info = env.step(action)\n",
|
||||
" new_state, reward, terminated, truncated, info = env.step(action)\n",
|
||||
" total_rewards_ep += reward\n",
|
||||
" \n",
|
||||
" if done:\n",
|
||||
" if terminated or truncated:\n",
|
||||
" break\n",
|
||||
" state = new_state\n",
|
||||
" episode_rewards.append(total_rewards_ep)\n",
|
||||
@@ -1045,15 +1048,16 @@
|
||||
" :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)\n",
|
||||
" \"\"\"\n",
|
||||
" images = [] \n",
|
||||
" done = False\n",
|
||||
" state = env.reset(seed=random.randint(0,500))\n",
|
||||
" img = env.render(mode='rgb_array')\n",
|
||||
" terminated = False\n",
|
||||
" truncated = False\n",
|
||||
" state, info = env.reset(seed=random.randint(0,500))\n",
|
||||
" img = env.render()\n",
|
||||
" images.append(img)\n",
|
||||
" while not done:\n",
|
||||
" while not terminated or truncated:\n",
|
||||
" # Take the action (index) that have the maximum expected future reward given that state\n",
|
||||
" action = np.argmax(Qtable[state][:])\n",
|
||||
" state, reward, done, info = env.step(action) # We directly put next_state = state for recording logic\n",
|
||||
" img = env.render(mode='rgb_array')\n",
|
||||
" state, reward, terminated, truncated, info = env.step(action) # We directly put next_state = state for recording logic\n",
|
||||
" img = env.render()\n",
|
||||
" images.append(img)\n",
|
||||
" imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)"
|
||||
]
|
||||
@@ -1209,7 +1213,7 @@
|
||||
"This way:\n",
|
||||
"- You can **showcase our work** 🔥\n",
|
||||
"- You can **visualize your agent playing** 👀\n",
|
||||
"- You can **share with the community an agent that others can use** 💾\n",
|
||||
"- You can **share an agent with the community that others can use** 💾\n",
|
||||
"- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n"
|
||||
]
|
||||
},
|
||||
@@ -1337,8 +1341,8 @@
|
||||
"id": "E2875IGsprzq"
|
||||
},
|
||||
"source": [
|
||||
"Congrats 🥳 you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent. \n",
|
||||
"FrozenLake-v1 no_slippery is very simple environment, let's try an harder one 🔥."
|
||||
"Congrats 🥳 you've just implemented from scratch, trained, and uploaded your first Reinforcement Learning agent.\n",
|
||||
"FrozenLake-v1 no_slippery is very simple environment, let's try a harder one 🔥."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1349,12 +1353,12 @@
|
||||
"source": [
|
||||
"# Part 2: Taxi-v3 🚖\n",
|
||||
"\n",
|
||||
"## Create and understand [Taxi-v3 🚕](https://www.gymlibrary.dev/environments/toy_text/taxi/)\n",
|
||||
"## Create and understand [Taxi-v3 🚕](https://gymnasium.farama.org/environments/toy_text/taxi/)\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
"💡 A good habit when you start to use an environment is to check its documentation \n",
|
||||
"\n",
|
||||
"👉 https://www.gymlibrary.dev/environments/toy_text/taxi/\n",
|
||||
"👉 https://gymnasium.farama.org/environments/toy_text/taxi/\n",
|
||||
"\n",
|
||||
"---\n",
|
||||
"\n",
|
||||
@@ -1374,7 +1378,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"env = gym.make(\"Taxi-v3\")"
|
||||
"env = gym.make(\"Taxi-v3\", render_mode=\"rgb_array\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1453,6 +1457,7 @@
|
||||
},
|
||||
"source": [
|
||||
"## Define the hyperparameters ⚙️\n",
|
||||
"\n",
|
||||
"⚠ DO NOT MODIFY EVAL_SEED: the eval_seed array **allows us to evaluate your agent with the same taxi starting positions for every classmate**"
|
||||
]
|
||||
},
|
||||
@@ -1516,6 +1521,7 @@
|
||||
},
|
||||
"source": [
|
||||
"## Create a model dictionary 💾 and publish our trained model to the Hub 🔥\n",
|
||||
"\n",
|
||||
"- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.\n"
|
||||
]
|
||||
},
|
||||
@@ -1554,7 +1560,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"username = \"\" # FILL THIS\n",
|
||||
"repo_name = \"\"\n",
|
||||
"repo_name = \"\" # FILL THIS\n",
|
||||
"push_to_hub(\n",
|
||||
" repo_id=f\"{username}/{repo_name}\",\n",
|
||||
" model=model,\n",
|
||||
@@ -1567,9 +1573,8 @@
|
||||
"id": "ZgSdjgbIpRti"
|
||||
},
|
||||
"source": [
|
||||
"Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n",
|
||||
"Now that it's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n",
|
||||
"\n",
|
||||
"⚠ To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** ⚠\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png\" alt=\"Taxi Leaderboard\">"
|
||||
]
|
||||
@@ -1690,17 +1695,18 @@
|
||||
},
|
||||
"source": [
|
||||
"## Some additional challenges 🏆\n",
|
||||
"The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results! \n",
|
||||
"\n",
|
||||
"The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!\n",
|
||||
"\n",
|
||||
"In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n",
|
||||
"\n",
|
||||
"Here are some ideas to achieve so:\n",
|
||||
"Here are some ideas to climb up the leaderboard:\n",
|
||||
"\n",
|
||||
"* Train more steps\n",
|
||||
"* Try different hyperparameters by looking at what your classmates have done.\n",
|
||||
"* **Push your new trained model** on the Hub 🔥\n",
|
||||
"\n",
|
||||
"Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not using FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉."
|
||||
"Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not use FrozenLake-v1 slippery version? Check how they work [using the gymnasium documentation](https://gymnasium.farama.org/) and have fun 🎉."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1714,7 +1720,7 @@
|
||||
"\n",
|
||||
"Understanding Q-Learning is an **important step to understanding value-based methods.**\n",
|
||||
"\n",
|
||||
"In the next Unit with Deep Q-Learning, we'll see that creating and updating a Q-table was a good strategy — **however, this is not scalable.**\n",
|
||||
"In the next Unit with Deep Q-Learning, we'll see that while creating and updating a Q-table was a good strategy — **however, it is not scalable.**\n",
|
||||
"\n",
|
||||
"For instance, imagine you create an agent that learns to play Doom. \n",
|
||||
"\n",
|
||||
@@ -1722,7 +1728,7 @@
|
||||
"\n",
|
||||
"Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient. \n",
|
||||
"\n",
|
||||
"That's why we'll study, in the next unit, Deep Q-Learning, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**\n",
|
||||
"That's why we'll study Deep Q-Learning in the next unit, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif\" alt=\"Environments\"/>\n"
|
||||
]
|
||||
@@ -1733,7 +1739,7 @@
|
||||
"id": "BjLhT70TEZIn"
|
||||
},
|
||||
"source": [
|
||||
"See you on Unit 3! 🔥\n",
|
||||
"See you in Unit 3! 🔥\n",
|
||||
"\n",
|
||||
"## Keep learning, stay awesome 🤗"
|
||||
]
|
||||
@@ -1744,10 +1750,12 @@
|
||||
"private_outputs": true,
|
||||
"provenance": [],
|
||||
"collapsed_sections": [
|
||||
"Ji_UrI5l2zzn",
|
||||
"67OdoKL63eDD",
|
||||
"B2_-8b8z5k54"
|
||||
]
|
||||
"B2_-8b8z5k54",
|
||||
"8R5ej1fS4P2V",
|
||||
"Pnpk2ePoem3r"
|
||||
],
|
||||
"include_colab_link": true
|
||||
},
|
||||
"gpuClass": "standard",
|
||||
"kernelspec": {
|
||||
@@ -1760,4 +1768,4 @@
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
}
|
||||
@@ -9,15 +9,13 @@
|
||||
|
||||
|
||||
Now that we studied the Q-Learning algorithm, let's implement it from scratch and train our Q-Learning agent in two environments:
|
||||
1. [Frozen-Lake-v1 (non-slippery and slippery version)](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) ☃️ : where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
|
||||
2. [An autonomous taxi](https://www.gymlibrary.dev/environments/toy_text/taxi/) 🚖 will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
|
||||
1. [Frozen-Lake-v1 (non-slippery and slippery version)](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) ☃️ : where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
|
||||
2. [An autonomous taxi](https://gymnasium.farama.org/environments/toy_text/taxi/) 🚖 will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
|
||||
|
||||
Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 2?
|
||||
|
||||
**If you don't find your model, go to the bottom of the page and click on the refresh button.**
|
||||
|
||||
To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.
|
||||
|
||||
To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
|
||||
@@ -32,13 +30,17 @@ And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSi
|
||||
[](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb)
|
||||
|
||||
|
||||
We strongly **recommend students use Google Colab for the hands-on exercises** instead of running them on their personal computers.
|
||||
|
||||
By using Google Colab, **you can focus on learning and experimenting without worrying about the technical aspects** of setting up your environments.
|
||||
|
||||
|
||||
# Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">
|
||||
|
||||
In this notebook, **you'll code your first Reinforcement Learning agent from scratch** to play FrozenLake ❄️ using Q-Learning, share it with the community, and experiment with different configurations.
|
||||
|
||||
|
||||
⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
|
||||
|
||||
|
||||
@@ -46,13 +48,13 @@ In this notebook, **you'll code your first Reinforcement Learning agent from scr
|
||||
|
||||
### 🎮 Environments:
|
||||
|
||||
- [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
|
||||
- [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/)
|
||||
- [FrozenLake-v1](https://gymnasium.farama.org/environments/toy_text/frozen_lake/)
|
||||
- [Taxi-v3](https://gymnasium.farama.org/environments/toy_text/taxi/)
|
||||
|
||||
### 📚 RL-Library:
|
||||
|
||||
- Python and NumPy
|
||||
- [Gym](https://www.gymlibrary.dev/)
|
||||
- [Gymnasium](https://gymnasium.farama.org/)
|
||||
|
||||
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
|
||||
|
||||
@@ -60,23 +62,40 @@ We're constantly trying to improve our tutorials, so **if you find some issues i
|
||||
|
||||
At the end of the notebook, you will:
|
||||
|
||||
- Be able to use **Gym**, the environment library.
|
||||
- Be able to use **Gymnasium**, the environment library.
|
||||
- Be able to code a Q-Learning agent from scratch.
|
||||
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
|
||||
|
||||
## This notebook is from the Deep Reinforcement Learning Course
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>
|
||||
|
||||
In this free course, you will:
|
||||
|
||||
- 📖 Study Deep Reinforcement Learning in **theory and practice**.
|
||||
- 🧑💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.
|
||||
- 🤖 Train **agents in unique environments**
|
||||
|
||||
And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course
|
||||
|
||||
Don’t forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
|
||||
|
||||
|
||||
The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5
|
||||
|
||||
## Prerequisites 🏗️
|
||||
|
||||
Before diving into the notebook, you need to:
|
||||
|
||||
🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)** 🤗
|
||||
|
||||
## A small recap of Q-Learning
|
||||
|
||||
- *Q-Learning* **is the RL algorithm that**
|
||||
*Q-Learning* **is the RL algorithm that**:
|
||||
|
||||
- Trains *Q-Function*, an **action-value function** that encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**
|
||||
- Trains *Q-Function*, an **action-value function** that encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**
|
||||
|
||||
- Given a state and action, our Q-Function **will search the Q-table for the corresponding value.**
|
||||
- Given a state and action, our Q-Function **will search the Q-table for the corresponding value.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function" width="100%"/>
|
||||
|
||||
@@ -88,7 +107,7 @@ have an optimal policy, since we **know for, each state, the best action to take
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" width="100%"/>
|
||||
|
||||
|
||||
But, in the beginning, our **Q-Table is useless since it gives arbitrary values for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we explore the environment and update our Q-Table it will give us better and better approximations
|
||||
But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
|
||||
|
||||
@@ -99,6 +118,12 @@ This is the Q-Learning pseudocode:
|
||||
|
||||
# Let's code our first Reinforcement Learning algorithm 🚀
|
||||
|
||||
To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.
|
||||
|
||||
To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
|
||||
|
||||
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
|
||||
|
||||
## Install dependencies and create a virtual display 🔽
|
||||
|
||||
In the notebook, we'll need to generate a replay video. To do so, with Colab, **we need to have a virtual screen to render the environment** (and thus record the frames).
|
||||
@@ -107,13 +132,13 @@ Hence the following cell will install the libraries and create and run a virtual
|
||||
|
||||
We’ll install multiple ones:
|
||||
|
||||
- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.
|
||||
- `gymnasium`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments.
|
||||
- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.
|
||||
- `numpy`: Used for handling our Q-table.
|
||||
|
||||
The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
|
||||
|
||||
You can see all the Deep RL models available here (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning
|
||||
You can see here all the Deep RL models available (if they use Q Learning) here 👉 https://huggingface.co/models?other=q-learning
|
||||
|
||||
```bash
|
||||
pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt
|
||||
@@ -150,10 +175,11 @@ In addition to the installed libraries, we also use:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import gym
|
||||
import gymnasium as gym
|
||||
import random
|
||||
import imageio
|
||||
import os
|
||||
import tqdm
|
||||
|
||||
import pickle5 as pickle
|
||||
from tqdm.notebook import tqdm
|
||||
@@ -163,12 +189,12 @@ We're now ready to code our Q-Learning algorithm 🔥
|
||||
|
||||
# Part 1: Frozen Lake ⛄ (non slippery version)
|
||||
|
||||
## Create and understand [FrozenLake environment ⛄](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
|
||||
## Create and understand [FrozenLake environment ⛄]((https://gymnasium.farama.org/environments/toy_text/frozen_lake/)
|
||||
---
|
||||
|
||||
💡 A good habit when you start to use an environment is to check its documentation
|
||||
|
||||
👉 https://www.gymlibrary.dev/environments/toy_text/frozen_lake/
|
||||
👉 https://gymnasium.farama.org/environments/toy_text/frozen_lake/
|
||||
|
||||
---
|
||||
|
||||
@@ -185,17 +211,20 @@ The environment has two modes:
|
||||
- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).
|
||||
- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic).
|
||||
|
||||
For now let's keep it simple with the 4x4 map and non-slippery
|
||||
For now let's keep it simple with the 4x4 map and non-slippery.
|
||||
We add a parameter called `render_mode` that specifies how the environment should be visualised. In our case because we **want to record a video of the environment at the end, we need to set render_mode to rgb_array**.
|
||||
|
||||
As [explained in the documentation](https://gymnasium.farama.org/api/env/#gymnasium.Env.render) “rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.
|
||||
|
||||
```python
|
||||
# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version
|
||||
# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version and render_mode="rgb_array"
|
||||
env = gym.make() # TODO use the correct parameters
|
||||
```
|
||||
|
||||
### Solution
|
||||
|
||||
```python
|
||||
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
|
||||
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False, render_mode="rgb_array")
|
||||
```
|
||||
|
||||
You can create your own custom grid like this:
|
||||
@@ -244,6 +273,7 @@ Reward function 💰:
|
||||
- Reach frozen: 0
|
||||
|
||||
## Create and Initialize the Q-table 🗄️
|
||||
|
||||
(👀 Step 1 of the pseudocode)
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
|
||||
@@ -262,7 +292,6 @@ print("There are ", action_space, " possible actions")
|
||||
|
||||
```python
|
||||
# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros. np.zeros needs a tuple (a,b)
|
||||
|
||||
def initialize_q_table(state_space, action_space):
|
||||
Qtable =
|
||||
return Qtable
|
||||
@@ -294,6 +323,7 @@ Qtable_frozenlake = initialize_q_table(state_space, action_space)
|
||||
```
|
||||
|
||||
## Define the greedy policy 🤖
|
||||
|
||||
Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
|
||||
|
||||
- Epsilon-greedy policy (acting policy)
|
||||
@@ -372,7 +402,8 @@ def epsilon_greedy_policy(Qtable, state, epsilon):
|
||||
```
|
||||
|
||||
## Define the hyperparameters ⚙️
|
||||
The exploration related hyperparameters are some of the most important ones.
|
||||
|
||||
The exploration related hyperparamters are some of the most important ones.
|
||||
|
||||
- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
|
||||
- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.
|
||||
@@ -419,13 +450,14 @@ Reset the environment
|
||||
|
||||
```python
|
||||
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
|
||||
for episode in range(n_training_episodes):
|
||||
for episode in tqdm(range(n_training_episodes)):
|
||||
# Reduce epsilon (because we need less and less exploration)
|
||||
epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
|
||||
# Reset the environment
|
||||
state = env.reset()
|
||||
state, info = env.reset()
|
||||
step = 0
|
||||
done = False
|
||||
terminated = False
|
||||
truncated = False
|
||||
|
||||
# repeat
|
||||
for step in range(max_steps):
|
||||
@@ -434,13 +466,13 @@ def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_st
|
||||
|
||||
# Take action At and observe Rt+1 and St+1
|
||||
# Take the action (a) and observe the outcome state(s') and reward (r)
|
||||
new_state, reward, done, info =
|
||||
new_state, reward, terminated, truncated, info =
|
||||
|
||||
# Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
|
||||
Qtable[state][action] =
|
||||
|
||||
# If done, finish the episode
|
||||
if done:
|
||||
# If terminated or truncated finish the episode
|
||||
if terminated or truncated:
|
||||
break
|
||||
|
||||
# Our next state is the new state
|
||||
@@ -456,9 +488,10 @@ def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_st
|
||||
# Reduce epsilon (because we need less and less exploration)
|
||||
epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
|
||||
# Reset the environment
|
||||
state = env.reset()
|
||||
state, info = env.reset()
|
||||
step = 0
|
||||
done = False
|
||||
terminated = False
|
||||
truncated = False
|
||||
|
||||
# repeat
|
||||
for step in range(max_steps):
|
||||
@@ -467,15 +500,15 @@ def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_st
|
||||
|
||||
# Take action At and observe Rt+1 and St+1
|
||||
# Take the action (a) and observe the outcome state(s') and reward (r)
|
||||
new_state, reward, done, info = env.step(action)
|
||||
new_state, reward, terminated, truncated, info = env.step(action)
|
||||
|
||||
# Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
|
||||
Qtable[state][action] = Qtable[state][action] + learning_rate * (
|
||||
reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]
|
||||
)
|
||||
|
||||
# If done, finish the episode
|
||||
if done:
|
||||
# If terminated or truncated finish the episode
|
||||
if terminated or truncated:
|
||||
break
|
||||
|
||||
# Our next state is the new state
|
||||
@@ -511,20 +544,21 @@ def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
|
||||
episode_rewards = []
|
||||
for episode in tqdm(range(n_eval_episodes)):
|
||||
if seed:
|
||||
state = env.reset(seed=seed[episode])
|
||||
state, info = env.reset(seed=seed[episode])
|
||||
else:
|
||||
state = env.reset()
|
||||
state, info = env.reset()
|
||||
step = 0
|
||||
done = False
|
||||
truncated = False
|
||||
terminated = False
|
||||
total_rewards_ep = 0
|
||||
|
||||
for step in range(max_steps):
|
||||
# Take the action (index) that have the maximum expected future reward given that state
|
||||
action = greedy_policy(Q, state)
|
||||
new_state, reward, done, info = env.step(action)
|
||||
new_state, reward, terminated, truncated, info = env.step(action)
|
||||
total_rewards_ep += reward
|
||||
|
||||
if done:
|
||||
if terminated or truncated:
|
||||
break
|
||||
state = new_state
|
||||
episode_rewards.append(total_rewards_ep)
|
||||
@@ -577,15 +611,18 @@ def record_video(env, Qtable, out_directory, fps=1):
|
||||
:param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
|
||||
"""
|
||||
images = []
|
||||
done = False
|
||||
state = env.reset(seed=random.randint(0, 500))
|
||||
img = env.render(mode="rgb_array")
|
||||
terminated = False
|
||||
truncated = False
|
||||
state, info = env.reset(seed=random.randint(0, 500))
|
||||
img = env.render()
|
||||
images.append(img)
|
||||
while not done:
|
||||
while not terminated or truncated:
|
||||
# Take the action (index) that have the maximum expected future reward given that state
|
||||
action = np.argmax(Qtable[state][:])
|
||||
state, reward, done, info = env.step(action) # We directly put next_state = state for recording logic
|
||||
img = env.render(mode="rgb_array")
|
||||
state, reward, terminated, truncated, info = env.step(
|
||||
action
|
||||
) # We directly put next_state = state for recording logic
|
||||
img = env.render()
|
||||
images.append(img)
|
||||
imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)
|
||||
```
|
||||
@@ -674,19 +711,19 @@ def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
|
||||
metadata = {**metadata, **eval}
|
||||
|
||||
model_card = f"""
|
||||
# **Q-Learning** Agent playing1 **{env_id}**
|
||||
This is a trained model of a **Q-Learning** agent playing **{env_id}** .
|
||||
# **Q-Learning** Agent playing1 **{env_id}**
|
||||
This is a trained model of a **Q-Learning** agent playing **{env_id}** .
|
||||
|
||||
## Usage
|
||||
## Usage
|
||||
|
||||
```python
|
||||
```python
|
||||
|
||||
model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
|
||||
model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
|
||||
|
||||
# Don't forget to check if you need to add additional attributes (is_slippery=False etc)
|
||||
env = gym.make(model["env_id"])
|
||||
```
|
||||
"""
|
||||
# Don't forget to check if you need to add additional attributes (is_slippery=False etc)
|
||||
env = gym.make(model["env_id"])
|
||||
```
|
||||
"""
|
||||
|
||||
evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
|
||||
|
||||
@@ -793,12 +830,12 @@ FrozenLake-v1 no_slippery is very simple environment, let's try a harder one
|
||||
|
||||
# Part 2: Taxi-v3 🚖
|
||||
|
||||
## Create and understand [Taxi-v3 🚕](https://www.gymlibrary.dev/environments/toy_text/taxi/)
|
||||
## Create and understand [Taxi-v3 🚕](https://gymnasium.farama.org/environments/toy_text/taxi/)
|
||||
---
|
||||
|
||||
💡 A good habit when you start to use an environment is to check its documentation
|
||||
|
||||
👉 https://www.gymlibrary.dev/environments/toy_text/taxi/
|
||||
👉 https://gymnasium.farama.org/environments/toy_text/taxi/
|
||||
|
||||
---
|
||||
|
||||
@@ -811,7 +848,7 @@ When the episode starts, **the taxi starts off at a random square** and the pass
|
||||
|
||||
|
||||
```python
|
||||
env = gym.make("Taxi-v3")
|
||||
env = gym.make("Taxi-v3", render_mode="rgb_array")
|
||||
```
|
||||
|
||||
There are **500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger** (including the case when the passenger is in the taxi), and **4 destination locations.**
|
||||
@@ -850,6 +887,7 @@ print("Q-table shape: ", Qtable_taxi.shape)
|
||||
```
|
||||
|
||||
## Define the hyperparameters ⚙️
|
||||
|
||||
⚠ DO NOT MODIFY EVAL_SEED: the eval_seed array **allows us to evaluate your agent with the same taxi starting positions for every classmate**
|
||||
|
||||
```python
|
||||
@@ -984,6 +1022,7 @@ Qtable_taxi
|
||||
```
|
||||
|
||||
## Create a model dictionary 💾 and publish our trained model to the Hub 🔥
|
||||
|
||||
- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.
|
||||
|
||||
|
||||
@@ -1005,13 +1044,12 @@ model = {
|
||||
|
||||
```python
|
||||
username = "" # FILL THIS
|
||||
repo_name = ""
|
||||
repo_name = "" # FILL THIS
|
||||
push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
|
||||
```
|
||||
|
||||
Now that it's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
⚠ To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** ⚠
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png" alt="Taxi Leaderboard">
|
||||
|
||||
@@ -1075,6 +1113,7 @@ evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"
|
||||
```
|
||||
|
||||
## Some additional challenges 🏆
|
||||
|
||||
The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!
|
||||
|
||||
In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
|
||||
@@ -1085,7 +1124,7 @@ Here are some ideas to climb up the leaderboard:
|
||||
* Try different hyperparameters by looking at what your classmates have done.
|
||||
* **Push your new trained model** on the Hub 🔥
|
||||
|
||||
Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not use the FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
|
||||
Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not use FrozenLake-v1 slippery version? Check how they work [using the gymnasium documentation](https://gymnasium.farama.org/) and have fun 🎉.
|
||||
|
||||
_____________________________________________________________________
|
||||
Congrats 🥳, you've just implemented, trained, and uploaded your first Reinforcement Learning agent.
|
||||
|
||||
Reference in New Issue
Block a user