Merge pull request #310 from huggingface/GymnasiumUpdate/Unit2

Update Unit 2 (Gymnasium)
This commit is contained in:
Thomas Simonini
2023-05-04 07:01:58 +02:00
committed by GitHub
4 changed files with 1977 additions and 138 deletions

1792
notebooks/unit2.ipynb Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -1,4 +1,4 @@
gym==0.24
gymnasium
pygame
numpy
@@ -8,4 +8,4 @@ pyyaml==6.0
imageio
imageio_ffmpeg
pyglet==1.5.1
tqdm
tqdm

View File

@@ -3,10 +3,11 @@
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github"
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/notebooks/unit2/unit2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
"<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/GymnasiumUpdate%2FUnit2/notebooks/unit2/unit2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
@@ -19,8 +20,7 @@
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg\" alt=\"Unit 2 Thumbnail\">\n",
"\n",
"In this notebook, **you'll code from scratch your first Reinforcement Learning agent** playing FrozenLake ❄️ using Q-Learning, share it to the community, and experiment with different configurations.\n",
"\n",
"In this notebook, **you'll code your first Reinforcement Learning agent from scratch** to play FrozenLake ❄️ using Q-Learning, share it with the community, and experiment with different configurations.\n",
"\n",
"⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️\n"
]
@@ -39,25 +39,18 @@
"source": [
"###🎮 Environments: \n",
"\n",
"- [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)\n",
"- [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/)\n",
"- [FrozenLake-v1](https://gymnasium.farama.org/environments/toy_text/frozen_lake/)\n",
"- [Taxi-v3](https://gymnasium.farama.org/environments/toy_text/taxi/)\n",
"\n",
"###📚 RL-Library: \n",
"\n",
"- Python and NumPy\n",
"- [Gym](https://www.gymlibrary.dev/)"
],
"metadata": {
"id": "DPTBOv9HYLZ2"
}
},
{
"cell_type": "markdown",
"source": [
"- [Gymnasium](https://gymnasium.farama.org/)\n",
"\n",
"We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
],
"metadata": {
"id": "3iaIxM_TwklQ"
"id": "DPTBOv9HYLZ2"
}
},
{
@@ -70,8 +63,8 @@
"\n",
"At the end of the notebook, you will:\n",
"\n",
"- Be able to use **Gym**, the environment library.\n",
"- Be able to code from scratch a Q-Learning agent.\n",
"- Be able to use **Gymnasium**, the environment library.\n",
"- Be able to code a Q-Learning agent from scratch.\n",
"- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.\n",
"\n",
"\n"
@@ -81,6 +74,7 @@
"cell_type": "markdown",
"source": [
"## This notebook is from the Deep Reinforcement Learning Course\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg\" alt=\"Deep RL Course illustration\"/>"
],
"metadata": {
@@ -114,6 +108,7 @@
},
"source": [
"## Prerequisites 🏗️\n",
"\n",
"Before diving into the notebook, you need to:\n",
"\n",
"🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)** 🤗 "
@@ -134,18 +129,18 @@
"id": "V68VveLacfxJ"
},
"source": [
"- The *Q-Learning* **is the RL algorithm that** \n",
"*Q-Learning* **is the RL algorithm that**:\n",
"\n",
" - Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**\n",
" \n",
" - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**\n",
"- Trains *Q-Function*, an **action-value function** that encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**\n",
"\n",
"- Given a state and action, our Q-Function **will search the Q-table for the corresponding value.**\n",
" \n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg\" alt=\"Q function\" width=\"100%\"/>\n",
"\n",
"- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**\n",
" \n",
"- And if we **have an optimal Q-function**, we\n",
"have an optimal policy,since we **know for each state, what is the best action to take.**\n",
"have an optimal policy, since we **know for, each state, the best action to take.**\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg\" alt=\"Link value policy\" width=\"100%\"/>\n",
"\n",
@@ -171,7 +166,6 @@
{
"cell_type": "markdown",
"source": [
"\n",
"To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.\n",
"\n",
"To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n",
@@ -193,13 +187,13 @@
"\n",
"Well install multiple ones:\n",
"\n",
"- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.\n",
"- `gymnasium`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. \n",
"- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.\n",
"- `numpy`: Used for handling our Q-table.\n",
"\n",
"The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.\n",
"\n",
"You can see here all the Deep RL models available (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning"
"You can see here all the Deep RL models available (if they use Q Learning) here 👉 https://huggingface.co/models?other=q-learning"
],
"metadata": {
"id": "4gpxC1_kqUYe"
@@ -233,7 +227,7 @@
{
"cell_type": "markdown",
"source": [
"To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks for this trick, **we will be able to run our virtual screen.**"
"To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**"
],
"metadata": {
"id": "K6XC13pTfFiD"
@@ -289,7 +283,7 @@
"outputs": [],
"source": [
"import numpy as np\n",
"import gym\n",
"import gymnasium as gym\n",
"import random\n",
"import imageio\n",
"import os\n",
@@ -323,12 +317,12 @@
"id": "NAvihuHdy9tw"
},
"source": [
"## Create and understand [FrozenLake environment ⛄]((https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)\n",
"## Create and understand [FrozenLake environment ⛄]((https://gymnasium.farama.org/environments/toy_text/frozen_lake/)\n",
"---\n",
"\n",
"💡 A good habit when you start to use an environment is to check its documentation \n",
"\n",
"👉 https://www.gymlibrary.dev/environments/toy_text/frozen_lake/\n",
"👉 https://gymnasium.farama.org/environments/toy_text/frozen_lake/\n",
"\n",
"---\n",
"\n",
@@ -352,7 +346,10 @@
"id": "UaW_LHfS0PY2"
},
"source": [
"For now let's keep it simple with the 4x4 map and non-slippery"
"For now let's keep it simple with the 4x4 map and non-slippery.\n",
"We add a parameter called `render_mode` that specifies how the environment should be visualised. In our case because we **want to record a video of the environment at the end, we need to set render_mode to rgb_array**.\n",
"\n",
"As [explained in the documentation](https://gymnasium.farama.org/api/env/#gymnasium.Env.render) “rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image."
]
},
{
@@ -363,7 +360,7 @@
},
"outputs": [],
"source": [
"# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version\n",
"# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version and render_mode=\"rgb_array\"\n",
"env = gym.make() # TODO use the correct parameters"
]
},
@@ -384,7 +381,7 @@
},
"outputs": [],
"source": [
"env = gym.make(\"FrozenLake-v1\", map_name=\"4x4\", is_slippery=False)"
"env = gym.make(\"FrozenLake-v1\", map_name=\"4x4\", is_slippery=False, render_mode=\"rgb_array\")"
]
},
{
@@ -480,6 +477,7 @@
},
"source": [
"## Create and Initialize the Q-table 🗄️\n",
"\n",
"(👀 Step 1 of the pseudocode)\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n",
@@ -584,12 +582,13 @@
},
"source": [
"## Define the greedy policy 🤖\n",
"\n",
"Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.\n",
"\n",
"- Epsilon-greedy policy (acting policy)\n",
"- Greedy-policy (updating policy)\n",
"\n",
"Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.\n",
"The greedy policy will also be the final policy we'll have when the Q-learning agent completes training. The greedy policy is used to select an action using the Q-table.\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n"
]
@@ -647,9 +646,9 @@
"\n",
"- With *probability 1- ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).\n",
"\n",
"- With *probability ɛ*: we do **exploration** (trying random action).\n",
"- With *probability ɛ*: we do **exploration** (trying a random action).\n",
"\n",
"And as the training goes, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**\n",
"As the training continues, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n"
]
@@ -716,6 +715,7 @@
},
"source": [
"## Define the hyperparameters ⚙️\n",
"\n",
"The exploration related hyperparamters are some of the most important ones. \n",
"\n",
"- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.\n",
@@ -789,9 +789,10 @@
" # Reduce epsilon (because we need less and less exploration)\n",
" epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)\n",
" # Reset the environment\n",
" state = env.reset()\n",
" state, info = env.reset()\n",
" step = 0\n",
" done = False\n",
" terminated = False\n",
" truncated = False\n",
"\n",
" # repeat\n",
" for step in range(max_steps):\n",
@@ -800,13 +801,13 @@
"\n",
" # Take action At and observe Rt+1 and St+1\n",
" # Take the action (a) and observe the outcome state(s') and reward (r)\n",
" new_state, reward, done, info = \n",
" new_state, reward, terminated, truncated, info =\n",
"\n",
" # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
" Qtable[state][action] = \n",
"\n",
" # If done, finish the episode\n",
" if done:\n",
" # If terminated or truncated finish the episode\n",
" if terminated or truncated:\n",
" break\n",
" \n",
" # Our next state is the new state\n",
@@ -836,9 +837,10 @@
" # Reduce epsilon (because we need less and less exploration)\n",
" epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)\n",
" # Reset the environment\n",
" state = env.reset()\n",
" state, info = env.reset()\n",
" step = 0\n",
" done = False\n",
" terminated = False\n",
" truncated = False\n",
"\n",
" # repeat\n",
" for step in range(max_steps):\n",
@@ -847,13 +849,13 @@
"\n",
" # Take action At and observe Rt+1 and St+1\n",
" # Take the action (a) and observe the outcome state(s') and reward (r)\n",
" new_state, reward, done, info = env.step(action)\n",
" new_state, reward, terminated, truncated, info = env.step(action)\n",
"\n",
" # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
" Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]) \n",
"\n",
" # If done, finish the episode\n",
" if done:\n",
" # If terminated or truncated finish the episode\n",
" if terminated or truncated:\n",
" break\n",
" \n",
" # Our next state is the new state\n",
@@ -931,20 +933,21 @@
" episode_rewards = []\n",
" for episode in tqdm(range(n_eval_episodes)):\n",
" if seed:\n",
" state = env.reset(seed=seed[episode])\n",
" state, info = env.reset(seed=seed[episode])\n",
" else:\n",
" state = env.reset()\n",
" state, info = env.reset()\n",
" step = 0\n",
" done = False\n",
" truncated = False\n",
" terminated = False\n",
" total_rewards_ep = 0\n",
" \n",
" for step in range(max_steps):\n",
" # Take the action (index) that have the maximum expected future reward given that state\n",
" action = greedy_policy(Q, state)\n",
" new_state, reward, done, info = env.step(action)\n",
" new_state, reward, terminated, truncated, info = env.step(action)\n",
" total_rewards_ep += reward\n",
" \n",
" if done:\n",
" if terminated or truncated:\n",
" break\n",
" state = new_state\n",
" episode_rewards.append(total_rewards_ep)\n",
@@ -1045,15 +1048,16 @@
" :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)\n",
" \"\"\"\n",
" images = [] \n",
" done = False\n",
" state = env.reset(seed=random.randint(0,500))\n",
" img = env.render(mode='rgb_array')\n",
" terminated = False\n",
" truncated = False\n",
" state, info = env.reset(seed=random.randint(0,500))\n",
" img = env.render()\n",
" images.append(img)\n",
" while not done:\n",
" while not terminated or truncated:\n",
" # Take the action (index) that have the maximum expected future reward given that state\n",
" action = np.argmax(Qtable[state][:])\n",
" state, reward, done, info = env.step(action) # We directly put next_state = state for recording logic\n",
" img = env.render(mode='rgb_array')\n",
" state, reward, terminated, truncated, info = env.step(action) # We directly put next_state = state for recording logic\n",
" img = env.render()\n",
" images.append(img)\n",
" imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)"
]
@@ -1209,7 +1213,7 @@
"This way:\n",
"- You can **showcase our work** 🔥\n",
"- You can **visualize your agent playing** 👀\n",
"- You can **share with the community an agent that others can use** 💾\n",
"- You can **share an agent with the community that others can use** 💾\n",
"- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n"
]
},
@@ -1337,8 +1341,8 @@
"id": "E2875IGsprzq"
},
"source": [
"Congrats 🥳 you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent. \n",
"FrozenLake-v1 no_slippery is very simple environment, let's try an harder one 🔥."
"Congrats 🥳 you've just implemented from scratch, trained, and uploaded your first Reinforcement Learning agent.\n",
"FrozenLake-v1 no_slippery is very simple environment, let's try a harder one 🔥."
]
},
{
@@ -1349,12 +1353,12 @@
"source": [
"# Part 2: Taxi-v3 🚖\n",
"\n",
"## Create and understand [Taxi-v3 🚕](https://www.gymlibrary.dev/environments/toy_text/taxi/)\n",
"## Create and understand [Taxi-v3 🚕](https://gymnasium.farama.org/environments/toy_text/taxi/)\n",
"---\n",
"\n",
"💡 A good habit when you start to use an environment is to check its documentation \n",
"\n",
"👉 https://www.gymlibrary.dev/environments/toy_text/taxi/\n",
"👉 https://gymnasium.farama.org/environments/toy_text/taxi/\n",
"\n",
"---\n",
"\n",
@@ -1374,7 +1378,7 @@
},
"outputs": [],
"source": [
"env = gym.make(\"Taxi-v3\")"
"env = gym.make(\"Taxi-v3\", render_mode=\"rgb_array\")"
]
},
{
@@ -1453,6 +1457,7 @@
},
"source": [
"## Define the hyperparameters ⚙️\n",
"\n",
"⚠ DO NOT MODIFY EVAL_SEED: the eval_seed array **allows us to evaluate your agent with the same taxi starting positions for every classmate**"
]
},
@@ -1516,6 +1521,7 @@
},
"source": [
"## Create a model dictionary 💾 and publish our trained model to the Hub 🔥\n",
"\n",
"- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.\n"
]
},
@@ -1554,7 +1560,7 @@
"outputs": [],
"source": [
"username = \"\" # FILL THIS\n",
"repo_name = \"\"\n",
"repo_name = \"\" # FILL THIS\n",
"push_to_hub(\n",
" repo_id=f\"{username}/{repo_name}\",\n",
" model=model,\n",
@@ -1567,9 +1573,8 @@
"id": "ZgSdjgbIpRti"
},
"source": [
"Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n",
"Now that it's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n",
"\n",
"⚠ To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** ⚠\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png\" alt=\"Taxi Leaderboard\">"
]
@@ -1690,17 +1695,18 @@
},
"source": [
"## Some additional challenges 🏆\n",
"The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results! \n",
"\n",
"The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!\n",
"\n",
"In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n",
"\n",
"Here are some ideas to achieve so:\n",
"Here are some ideas to climb up the leaderboard:\n",
"\n",
"* Train more steps\n",
"* Try different hyperparameters by looking at what your classmates have done.\n",
"* **Push your new trained model** on the Hub 🔥\n",
"\n",
"Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not using FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉."
"Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not use FrozenLake-v1 slippery version? Check how they work [using the gymnasium documentation](https://gymnasium.farama.org/) and have fun 🎉."
]
},
{
@@ -1714,7 +1720,7 @@
"\n",
"Understanding Q-Learning is an **important step to understanding value-based methods.**\n",
"\n",
"In the next Unit with Deep Q-Learning, we'll see that creating and updating a Q-table was a good strategy — **however, this is not scalable.**\n",
"In the next Unit with Deep Q-Learning, we'll see that while creating and updating a Q-table was a good strategy — **however, it is not scalable.**\n",
"\n",
"For instance, imagine you create an agent that learns to play Doom. \n",
"\n",
@@ -1722,7 +1728,7 @@
"\n",
"Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient. \n",
"\n",
"That's why we'll study, in the next unit, Deep Q-Learning, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**\n",
"That's why we'll study Deep Q-Learning in the next unit, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**\n",
"\n",
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit4/atari-envs.gif\" alt=\"Environments\"/>\n"
]
@@ -1733,7 +1739,7 @@
"id": "BjLhT70TEZIn"
},
"source": [
"See you on Unit 3! 🔥\n",
"See you in Unit 3! 🔥\n",
"\n",
"## Keep learning, stay awesome 🤗"
]
@@ -1744,10 +1750,12 @@
"private_outputs": true,
"provenance": [],
"collapsed_sections": [
"Ji_UrI5l2zzn",
"67OdoKL63eDD",
"B2_-8b8z5k54"
]
"B2_-8b8z5k54",
"8R5ej1fS4P2V",
"Pnpk2ePoem3r"
],
"include_colab_link": true
},
"gpuClass": "standard",
"kernelspec": {
@@ -1760,4 +1768,4 @@
},
"nbformat": 4,
"nbformat_minor": 0
}
}

View File

@@ -9,15 +9,13 @@
Now that we studied the Q-Learning algorithm, let's implement it from scratch and train our Q-Learning agent in two environments:
1. [Frozen-Lake-v1 (non-slippery and slippery version)](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) ☃️ : where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
2. [An autonomous taxi](https://www.gymlibrary.dev/environments/toy_text/taxi/) 🚖 will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
1. [Frozen-Lake-v1 (non-slippery and slippery version)](https://gymnasium.farama.org/environments/toy_text/frozen_lake/) ☃️ : where our agent will need to **go from the starting state (S) to the goal state (G)** by walking only on frozen tiles (F) and avoiding holes (H).
2. [An autonomous taxi](https://gymnasium.farama.org/environments/toy_text/taxi/) 🚖 will need **to learn to navigate** a city to **transport its passengers from point A to point B.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/envs.gif" alt="Environments"/>
Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 2?
**If you don't find your model, go to the bottom of the page and click on the refresh button.**
To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.
To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
@@ -32,13 +30,17 @@ And you can check your progress here 👉 https://huggingface.co/spaces/ThomasSi
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb)
We strongly **recommend students use Google Colab for the hands-on exercises** instead of running them on their personal computers.
By using Google Colab, **you can focus on learning and experimenting without worrying about the technical aspects** of setting up your environments.
# Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">
In this notebook, **you'll code your first Reinforcement Learning agent from scratch** to play FrozenLake ❄️ using Q-Learning, share it with the community, and experiment with different configurations.
⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️
@@ -46,13 +48,13 @@ In this notebook, **you'll code your first Reinforcement Learning agent from scr
### 🎮 Environments:
- [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
- [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/)
- [FrozenLake-v1](https://gymnasium.farama.org/environments/toy_text/frozen_lake/)
- [Taxi-v3](https://gymnasium.farama.org/environments/toy_text/taxi/)
### 📚 RL-Library:
- Python and NumPy
- [Gym](https://www.gymlibrary.dev/)
- [Gymnasium](https://gymnasium.farama.org/)
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
@@ -60,23 +62,40 @@ We're constantly trying to improve our tutorials, so **if you find some issues i
At the end of the notebook, you will:
- Be able to use **Gym**, the environment library.
- Be able to use **Gymnasium**, the environment library.
- Be able to code a Q-Learning agent from scratch.
- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.
## This notebook is from the Deep Reinforcement Learning Course
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/deep-rl-course-illustration.jpg" alt="Deep RL Course illustration"/>
In this free course, you will:
- 📖 Study Deep Reinforcement Learning in **theory and practice**.
- 🧑‍💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.
- 🤖 Train **agents in unique environments**
And more check 📚 the syllabus 👉 https://simoninithomas.github.io/deep-rl-course
Dont forget to **<a href="http://eepurl.com/ic5ZUD">sign up to the course</a>** (we are collecting your email to be able to **send you the links when each Unit is published and give you information about the challenges and updates).**
The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/ydHrjt3WP5
## Prerequisites 🏗️
Before diving into the notebook, you need to:
🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)** 🤗
## A small recap of Q-Learning
- *Q-Learning* **is the RL algorithm that**
*Q-Learning* **is the RL algorithm that**:
- Trains *Q-Function*, an **action-value function** that encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**
- Trains *Q-Function*, an **action-value function** that encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.**
- Given a state and action, our Q-Function **will search the Q-table for the corresponding value.**
- Given a state and action, our Q-Function **will search the Q-table for the corresponding value.**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function" width="100%"/>
@@ -88,7 +107,7 @@ have an optimal policy, since we **know for, each state, the best action to take
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" width="100%"/>
But, in the beginning, our **Q-Table is useless since it gives arbitrary values for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we explore the environment and update our Q-Table it will give us better and better approximations
But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as well explore the environment and update our Q-Table it will give us better and better approximations
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
@@ -99,6 +118,12 @@ This is the Q-Learning pseudocode:
# Let's code our first Reinforcement Learning algorithm 🚀
To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**.
To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**
For more information about the certification process, check this section 👉 https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process
## Install dependencies and create a virtual display 🔽
In the notebook, we'll need to generate a replay video. To do so, with Colab, **we need to have a virtual screen to render the environment** (and thus record the frames).
@@ -107,13 +132,13 @@ Hence the following cell will install the libraries and create and run a virtual
Well install multiple ones:
- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.
- `gymnasium`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments.
- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.
- `numpy`: Used for handling our Q-table.
The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
You can see all the Deep RL models available here (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning
You can see here all the Deep RL models available (if they use Q Learning) here 👉 https://huggingface.co/models?other=q-learning
```bash
pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt
@@ -150,10 +175,11 @@ In addition to the installed libraries, we also use:
```python
import numpy as np
import gym
import gymnasium as gym
import random
import imageio
import os
import tqdm
import pickle5 as pickle
from tqdm.notebook import tqdm
@@ -163,12 +189,12 @@ We're now ready to code our Q-Learning algorithm 🔥
# Part 1: Frozen Lake ⛄ (non slippery version)
## Create and understand [FrozenLake environment ⛄](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)
## Create and understand [FrozenLake environment ⛄]((https://gymnasium.farama.org/environments/toy_text/frozen_lake/)
---
💡 A good habit when you start to use an environment is to check its documentation
👉 https://www.gymlibrary.dev/environments/toy_text/frozen_lake/
👉 https://gymnasium.farama.org/environments/toy_text/frozen_lake/
---
@@ -185,17 +211,20 @@ The environment has two modes:
- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).
- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic).
For now let's keep it simple with the 4x4 map and non-slippery
For now let's keep it simple with the 4x4 map and non-slippery.
We add a parameter called `render_mode` that specifies how the environment should be visualised. In our case because we **want to record a video of the environment at the end, we need to set render_mode to rgb_array**.
As [explained in the documentation](https://gymnasium.farama.org/api/env/#gymnasium.Env.render) “rgb_array”: Return a single frame representing the current state of the environment. A frame is a np.ndarray with shape (x, y, 3) representing RGB values for an x-by-y pixel image.
```python
# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version
# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version and render_mode="rgb_array"
env = gym.make() # TODO use the correct parameters
```
### Solution
```python
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False)
env = gym.make("FrozenLake-v1", map_name="4x4", is_slippery=False, render_mode="rgb_array")
```
You can create your own custom grid like this:
@@ -244,6 +273,7 @@ Reward function 💰:
- Reach frozen: 0
## Create and Initialize the Q-table 🗄️
(👀 Step 1 of the pseudocode)
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
@@ -262,7 +292,6 @@ print("There are ", action_space, " possible actions")
```python
# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros. np.zeros needs a tuple (a,b)
def initialize_q_table(state_space, action_space):
Qtable =
return Qtable
@@ -294,6 +323,7 @@ Qtable_frozenlake = initialize_q_table(state_space, action_space)
```
## Define the greedy policy 🤖
Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
- Epsilon-greedy policy (acting policy)
@@ -372,7 +402,8 @@ def epsilon_greedy_policy(Qtable, state, epsilon):
```
## Define the hyperparameters ⚙️
The exploration related hyperparameters are some of the most important ones.
The exploration related hyperparamters are some of the most important ones.
- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.
@@ -419,13 +450,14 @@ Reset the environment
```python
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
for episode in range(n_training_episodes):
for episode in tqdm(range(n_training_episodes)):
# Reduce epsilon (because we need less and less exploration)
epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
# Reset the environment
state = env.reset()
state, info = env.reset()
step = 0
done = False
terminated = False
truncated = False
# repeat
for step in range(max_steps):
@@ -434,13 +466,13 @@ def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_st
# Take action At and observe Rt+1 and St+1
# Take the action (a) and observe the outcome state(s') and reward (r)
new_state, reward, done, info =
new_state, reward, terminated, truncated, info =
# Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
Qtable[state][action] =
# If done, finish the episode
if done:
# If terminated or truncated finish the episode
if terminated or truncated:
break
# Our next state is the new state
@@ -456,9 +488,10 @@ def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_st
# Reduce epsilon (because we need less and less exploration)
epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-decay_rate * episode)
# Reset the environment
state = env.reset()
state, info = env.reset()
step = 0
done = False
terminated = False
truncated = False
# repeat
for step in range(max_steps):
@@ -467,15 +500,15 @@ def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_st
# Take action At and observe Rt+1 and St+1
# Take the action (a) and observe the outcome state(s') and reward (r)
new_state, reward, done, info = env.step(action)
new_state, reward, terminated, truncated, info = env.step(action)
# Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
Qtable[state][action] = Qtable[state][action] + learning_rate * (
reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]
)
# If done, finish the episode
if done:
# If terminated or truncated finish the episode
if terminated or truncated:
break
# Our next state is the new state
@@ -511,20 +544,21 @@ def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
episode_rewards = []
for episode in tqdm(range(n_eval_episodes)):
if seed:
state = env.reset(seed=seed[episode])
state, info = env.reset(seed=seed[episode])
else:
state = env.reset()
state, info = env.reset()
step = 0
done = False
truncated = False
terminated = False
total_rewards_ep = 0
for step in range(max_steps):
# Take the action (index) that have the maximum expected future reward given that state
action = greedy_policy(Q, state)
new_state, reward, done, info = env.step(action)
new_state, reward, terminated, truncated, info = env.step(action)
total_rewards_ep += reward
if done:
if terminated or truncated:
break
state = new_state
episode_rewards.append(total_rewards_ep)
@@ -577,15 +611,18 @@ def record_video(env, Qtable, out_directory, fps=1):
:param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
"""
images = []
done = False
state = env.reset(seed=random.randint(0, 500))
img = env.render(mode="rgb_array")
terminated = False
truncated = False
state, info = env.reset(seed=random.randint(0, 500))
img = env.render()
images.append(img)
while not done:
while not terminated or truncated:
# Take the action (index) that have the maximum expected future reward given that state
action = np.argmax(Qtable[state][:])
state, reward, done, info = env.step(action) # We directly put next_state = state for recording logic
img = env.render(mode="rgb_array")
state, reward, terminated, truncated, info = env.step(
action
) # We directly put next_state = state for recording logic
img = env.render()
images.append(img)
imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)
```
@@ -674,19 +711,19 @@ def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
metadata = {**metadata, **eval}
model_card = f"""
# **Q-Learning** Agent playing1 **{env_id}**
This is a trained model of a **Q-Learning** agent playing **{env_id}** .
# **Q-Learning** Agent playing1 **{env_id}**
This is a trained model of a **Q-Learning** agent playing **{env_id}** .
## Usage
## Usage
```python
```python
model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
# Don't forget to check if you need to add additional attributes (is_slippery=False etc)
env = gym.make(model["env_id"])
```
"""
# Don't forget to check if you need to add additional attributes (is_slippery=False etc)
env = gym.make(model["env_id"])
```
"""
evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
@@ -793,12 +830,12 @@ FrozenLake-v1 no_slippery is very simple environment, let's try a harder one
# Part 2: Taxi-v3 🚖
## Create and understand [Taxi-v3 🚕](https://www.gymlibrary.dev/environments/toy_text/taxi/)
## Create and understand [Taxi-v3 🚕](https://gymnasium.farama.org/environments/toy_text/taxi/)
---
💡 A good habit when you start to use an environment is to check its documentation
👉 https://www.gymlibrary.dev/environments/toy_text/taxi/
👉 https://gymnasium.farama.org/environments/toy_text/taxi/
---
@@ -811,7 +848,7 @@ When the episode starts, **the taxi starts off at a random square** and the pass
```python
env = gym.make("Taxi-v3")
env = gym.make("Taxi-v3", render_mode="rgb_array")
```
There are **500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger** (including the case when the passenger is in the taxi), and **4 destination locations.**
@@ -850,6 +887,7 @@ print("Q-table shape: ", Qtable_taxi.shape)
```
## Define the hyperparameters ⚙️
⚠ DO NOT MODIFY EVAL_SEED: the eval_seed array **allows us to evaluate your agent with the same taxi starting positions for every classmate**
```python
@@ -984,6 +1022,7 @@ Qtable_taxi
```
## Create a model dictionary 💾 and publish our trained model to the Hub 🔥
- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.
@@ -1005,13 +1044,12 @@ model = {
```python
username = "" # FILL THIS
repo_name = ""
repo_name = "" # FILL THIS
push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
```
Now that it's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
⚠ To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** ⚠
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png" alt="Taxi Leaderboard">
@@ -1075,6 +1113,7 @@ evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"
```
## Some additional challenges 🏆
The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!
In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?
@@ -1085,7 +1124,7 @@ Here are some ideas to climb up the leaderboard:
* Try different hyperparameters by looking at what your classmates have done.
* **Push your new trained model** on the Hub 🔥
Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not use the FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉.
Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not use FrozenLake-v1 slippery version? Check how they work [using the gymnasium documentation](https://gymnasium.farama.org/) and have fun 🎉.
_____________________________________________________________________
Congrats 🥳, you've just implemented, trained, and uploaded your first Reinforcement Learning agent.