mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-02-12 14:45:56 +08:00
1791 lines
3.6 MiB
1791 lines
3.6 MiB
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"colab_type": "text",
|
||
"id": "view-in-github"
|
||
},
|
||
"source": [
|
||
"<a href=\"https://colab.research.google.com/github/huggingface/deep-rl-class/blob/main/unit2/unit2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "njb_ProuHiOe"
|
||
},
|
||
"source": [
|
||
"# Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕\n",
|
||
"\n",
|
||
"In this notebook, **you'll code from scratch your first Reinforcement Learning agent** playing FrozenLake ❄️ using Q-Learning, share it to the community, and experiment with different configurations\n",
|
||
"\n",
|
||
"❓ If you have questions, please post them on #study-group-unit2 discord channel 👉 https://discord.gg/aYka4Yhff9\n",
|
||
"\n",
|
||
"🎮 Environments: \n",
|
||
"- [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)\n",
|
||
"- [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/)\n",
|
||
"\n",
|
||
"📚 RL-Library: Python and Numpy\n",
|
||
"\n",
|
||
"⬇️ Here is an example of what **you will achieve in just a couple of minutes.** ⬇️"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "vRU_vXBrl1Jx"
|
||
},
|
||
"source": [
|
||
"<img src=\"https://huggingface.co/blog/assets/70_deep_rl_q_part1/envs.gif\" alt=\"environments\"/>"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "4i6tjI2tHQ8j"
|
||
},
|
||
"source": [
|
||
"## Objectives of this notebook 🏆\n",
|
||
"At the end of the notebook, you will:\n",
|
||
"- Be able to use **Gym**, the environment library.\n",
|
||
"- Be able to code from scratch a Q-Learning agent.\n",
|
||
"- Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score 🔥.\n",
|
||
"\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "ct8eLAabICE-"
|
||
},
|
||
"source": [
|
||
"## This notebook is from Deep Reinforcement Learning Class\n",
|
||
"\n",
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "6p5HnEefISCB"
|
||
},
|
||
"source": [
|
||
"In this free course, you will:\n",
|
||
"\n",
|
||
"- 📖 Study Deep Reinforcement Learning in **theory and practice**.\n",
|
||
"- 🧑💻 Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, and RLlib.\n",
|
||
"- 🤖 Train **agents in unique environments** \n",
|
||
"\n",
|
||
"And more check 📚 the syllabus 👉 https://github.com/huggingface/deep-rl-class\n",
|
||
"\n",
|
||
"The best way to keep in touch is to join our discord server to exchange with the community and with us 👉🏻 https://discord.gg/aYka4Yhff9"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "Y-mo_6rXIjRi"
|
||
},
|
||
"source": [
|
||
"## Prerequisites 🏗️\n",
|
||
"Before diving into the notebook, you need to:\n",
|
||
"\n",
|
||
"🔲 📚 [Read the Unit 2 Readme](https://github.com/huggingface/deep-rl-class/blob/main/unit2/README.md) that contains all the information.\n",
|
||
"\n",
|
||
"🔲 📚 [Read **An Introduction to Q-Learning Part 1**](https://huggingface.co/blog/deep-rl-q-part1) \n",
|
||
"\n",
|
||
"🔲 📚 [Read **An Introduction to Q-Learning Part 2**](https://huggingface.co/blog/deep-rl-q-part2) \n",
|
||
"\n",
|
||
"🔲 📢 Sign up to [our Discord Server](https://discord.gg/aYka4Yhff9) and **introduce yourself to #introduce-yourself channel 🥳**\n",
|
||
"\n",
|
||
"🔲 🐕 Are you new to Discord? Check our **discord 101 to get the best practices** 👉 https://github.com/huggingface/deep-rl-class/blob/main/DISCORD.Md\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "f2ONOODsyrMU"
|
||
},
|
||
"source": [
|
||
"## A small recap of Q-Learning"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "V68VveLacfxJ"
|
||
},
|
||
"source": [
|
||
"- The *Q-Learning* **is the RL algorithm that** \n",
|
||
" - Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**\n",
|
||
" \n",
|
||
" - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**\n",
|
||
" \n",
|
||
"\n",
|
||
"<img src=\"https://huggingface.co/blog/assets/70_deep_rl_q_part1/Q-function-2.jpg\" alt=\"Q function\"/>\n",
|
||
"\n",
|
||
"- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**\n",
|
||
" \n",
|
||
"- And if we **have an optimal Q-function**, we\n",
|
||
"have an optimal policy,since we **know for each state, what is the best action to take.**\n",
|
||
"\n",
|
||
"\n",
|
||
"<img src=\"https://huggingface.co/blog/assets/70_deep_rl_q_part1/link-value-policy.jpg\" alt=\"Link value policy\"/>\n",
|
||
"\n",
|
||
"But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations\n",
|
||
"\n",
|
||
"<img src=\"https://huggingface.co/blog/assets/70_deep_rl_q_part1/Q-learning-1.jpg\" alt=\"Link value policy\"/>\n",
|
||
"\n",
|
||
"This is the Q-Learning pseudocode:\n",
|
||
"\n",
|
||
"<img src=\"https://huggingface.co/blog/assets/70_deep_rl_q_part1/Q-learning-2.jpg\" alt=\"Link value policy\"/>\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "7sUukH74yQ7K"
|
||
},
|
||
"source": [
|
||
"### Step 0: Setup a Virtual Display 💻\n",
|
||
"\n",
|
||
"During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). \n",
|
||
"\n",
|
||
"Hence the following cell will install virtual screen libraries and create and run a virtual screen 🖥"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "f2wrVUvrySkW"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"%%capture\n",
|
||
"!pip install pyglet==1.5.1 \n",
|
||
"!apt install python-opengl\n",
|
||
"!apt install ffmpeg\n",
|
||
"!apt install xvfb\n",
|
||
"!pip3 install pyvirtualdisplay\n",
|
||
"\n",
|
||
"# Virtual display\n",
|
||
"from pyvirtualdisplay import Display\n",
|
||
"\n",
|
||
"virtual_display = Display(visible=0, size=(1400, 900))\n",
|
||
"virtual_display.start()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "mIkOB4Rpw65k"
|
||
},
|
||
"source": [
|
||
"### Step 1: Install dependencies 🔽\n",
|
||
"The first step is to install the dependencies, we’ll install multiple ones:\n",
|
||
"\n",
|
||
"- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments.\n",
|
||
"- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.\n",
|
||
"- `numPy`: Used for handling our Q-table.\n",
|
||
"\n",
|
||
"The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.\n",
|
||
"\n",
|
||
"You can see here all the Deep reinforcement Learning models available 👉 https://huggingface.co/models?other=q-learning\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "1Ac7wW_5ClJC"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"%%capture\n",
|
||
"!pip install gym==0.24 # We install the newest gym version for the Taxi-v3 \"rgb_array version\"\n",
|
||
"!pip install pygame\n",
|
||
"!pip install numpy\n",
|
||
"\n",
|
||
"!pip install huggingface_hub\n",
|
||
"!pip install pickle5\n",
|
||
"!pip install pyyaml==6.0 # avoid key error metadata\n",
|
||
"!pip install imageio imageio_ffmpeg"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "W-7f-Swax_9x"
|
||
},
|
||
"source": [
|
||
"### Step 2: Import the packages 📦\n",
|
||
"\n",
|
||
"In addition to the installed libraries, we also use:\n",
|
||
"\n",
|
||
"- `random`: To generate random numbers (that will be useful for Epsilon-Greedy Policy).\n",
|
||
"- `imageio`: To generate a replay video\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "VcNvOAQlysBJ"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import gym\n",
|
||
"import random\n",
|
||
"import imageio\n",
|
||
"import os\n",
|
||
"\n",
|
||
"import pickle5 as pickle\n",
|
||
"from tqdm.notebook import tqdm"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "xp4-bXKIy1mQ"
|
||
},
|
||
"source": [
|
||
"We're now ready to code our Q-Learning algorithm 🔥"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "xya49aNJWVvv"
|
||
},
|
||
"source": [
|
||
"# Part 1: Frozen Lake ⛄ (non slippery version)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "NAvihuHdy9tw"
|
||
},
|
||
"source": [
|
||
"### Step 1: Create and understand [FrozenLake environment ⛄]((https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)\n",
|
||
"---\n",
|
||
"\n",
|
||
"💡 A good habit when you start to use an environment is to check its documentation \n",
|
||
"\n",
|
||
"👉 https://www.gymlibrary.dev/environments/toy_text/frozen_lake/\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"We're going to train our Q-Learning agent **to navigate from the starting state (S) to the goal state (G) by walking only on frozen tiles (F) and avoid holes (H)**.\n",
|
||
"\n",
|
||
"We can have two sizes of environment:\n",
|
||
"- `map_name=\"4x4\"`: a 4x4 grid version\n",
|
||
"- `map_name=\"8x8\"`: a 8x8 grid version\n",
|
||
"\n",
|
||
"\n",
|
||
"The environment has two modes:\n",
|
||
"- `is_slippery=False`: The agent always move in the intended direction due to the non-slippery nature of the frozen lake.\n",
|
||
"- `is_slippery=True`: The agent may not always move in the intended direction due to the slippery nature of the frozen lake (stochastic)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "UaW_LHfS0PY2"
|
||
},
|
||
"source": [
|
||
"For now let's keep it simple with the 4x4 map and non-slippery"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "IzJnb8O3y8up"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Create the FrozenLake-v1 environment using 4x4 map and non-slippery version\n",
|
||
"env = gym.make() # TODO use the correct parameters"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "Ji_UrI5l2zzn"
|
||
},
|
||
"source": [
|
||
"### Solution"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "jNxUbPMP0akP"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"env = gym.make(\"FrozenLake-v1\", map_name=\"4x4\", is_slippery=False)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "KASNViqL4tZn"
|
||
},
|
||
"source": [
|
||
"You can create your own custom grid like this:\n",
|
||
"\n",
|
||
"```python\n",
|
||
"desc=[\"SFFF\", \"FHFH\", \"FFFH\", \"HFFG\"]\n",
|
||
"gym.make('FrozenLake-v1', desc=desc, is_slippery=True)\n",
|
||
"```\n",
|
||
"\n",
|
||
"but we'll use the default environment for now."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "SXbTfdeJ1Xi9"
|
||
},
|
||
"source": [
|
||
"### Let's see what the Environment looks like:\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "ZNPG0g_UGCfh"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# We create our environment with gym.make(\"<name_of_the_environment>\")\n",
|
||
"env.reset()\n",
|
||
"print(\"_____OBSERVATION SPACE_____ \\n\")\n",
|
||
"print(\"Observation Space\", env.observation_space)\n",
|
||
"print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "2MXc15qFE0M9"
|
||
},
|
||
"source": [
|
||
"We see with `Observation Space Shape Discrete(16)` that the observation is a value representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**. \n",
|
||
"\n",
|
||
"For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**\n",
|
||
"\n",
|
||
"\n",
|
||
"For instance, this is what state = 0 looks like:\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "nW-hW7n22PI_"
|
||
},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "We5WqOBGLoSm"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"print(\"\\n _____ACTION SPACE_____ \\n\")\n",
|
||
"print(\"Action Space Shape\", env.action_space.n)\n",
|
||
"print(\"Action Space Sample\", env.action_space.sample()) # Take a random action"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "MyxXwkI2Magx"
|
||
},
|
||
"source": [
|
||
"The action space (the set of possible actions the agent can take) is discrete with 4 actions available 🎮:\n",
|
||
"- 0: GO LEFT\n",
|
||
"- 1: GO DOWN\n",
|
||
"- 2: GO RIGHT\n",
|
||
"- 3: GO UP\n",
|
||
"\n",
|
||
"Reward function 💰:\n",
|
||
"- Reach goal: +1\n",
|
||
"- Reach hole: 0\n",
|
||
"- Reach frozen: 0"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "1pFhWblk3Awr"
|
||
},
|
||
"source": [
|
||
"### Step 2: Create and Initialize the Q-table 🗄️\n",
|
||
"(👀 Step 1 of the pseudocode)\n",
|
||
"\n",
|
||
"<img src=\"https://huggingface.co/blog/assets/70_deep_rl_q_part1/Q-learning-2.jpg\" alt=\"Q-Learning table\"/>\n",
|
||
"\n",
|
||
"It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "y3ZCdluj3k0l"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"state_space = \n",
|
||
"print(\"There are \", state_space, \" possible states\")\n",
|
||
"\n",
|
||
"action_space = \n",
|
||
"print(\"There are \", action_space, \" possible actions\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "rCddoOXM3UQH"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros\n",
|
||
"def initialize_q_table(state_space, action_space):\n",
|
||
" Qtable = \n",
|
||
" return Qtable"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "9YfvrqRt3jdR"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"Qtable_frozenlake = initialize_q_table(state_space, action_space)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "67OdoKL63eDD"
|
||
},
|
||
"source": [
|
||
"### Solution"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "HuTKv3th3ohG"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"state_space = env.observation_space.n\n",
|
||
"print(\"There are \", state_space, \" possible states\")\n",
|
||
"\n",
|
||
"action_space = env.action_space.n\n",
|
||
"print(\"There are \", action_space, \" possible actions\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "lnrb_nX33fJo"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros\n",
|
||
"def initialize_q_table(state_space, action_space):\n",
|
||
" Qtable = np.zeros((state_space, action_space))\n",
|
||
" return Qtable"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "Y0WlgkVO3Jf9"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"Qtable_frozenlake = initialize_q_table(state_space, action_space)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "flILKhBU3yZ7"
|
||
},
|
||
"source": [
|
||
"### Step 3: Define the epsilon-greedy policy 🤖\n",
|
||
"\n",
|
||
"Epsilon-Greedy is the training policy that handles the exploration/exploitation trade-off.\n",
|
||
"\n",
|
||
"The idea with Epsilon Greedy:\n",
|
||
"\n",
|
||
"- With *probability 1 - ɛ* : **we do exploitation** (aka our agent selects the action with the highest state-action pair value).\n",
|
||
"\n",
|
||
"- With *probability ɛ*: we do **exploration** (trying random action).\n",
|
||
"\n",
|
||
"And as the training goes, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "3kk8TU3w4Ali"
|
||
},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "LjZSvhsD7_52"
|
||
},
|
||
"source": [
|
||
"Thanks to Sambit for finding a bug on the epsilon function 🤗"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "6Bj7x3in3_Pq"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def epsilon_greedy_policy(Qtable, state, epsilon):\n",
|
||
" # Randomly generate a number between 0 and 1\n",
|
||
" random_num = \n",
|
||
" # if random_num > greater than epsilon --> exploitation\n",
|
||
" if random_num > epsilon:\n",
|
||
" # Take the action with the highest value given a state\n",
|
||
" # np.argmax can be useful here\n",
|
||
" action = \n",
|
||
" # else --> exploration\n",
|
||
" else:\n",
|
||
" action = # Take a random action\n",
|
||
" \n",
|
||
" return action"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "8R5ej1fS4P2V"
|
||
},
|
||
"source": [
|
||
"#### Solution"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "cYxHuckr4LiG"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def epsilon_greedy_policy(Qtable, state, epsilon):\n",
|
||
" # Randomly generate a number between 0 and 1\n",
|
||
" random_int = random.uniform(0,1)\n",
|
||
" # if random_int > greater than epsilon --> exploitation\n",
|
||
" if random_int > epsilon:\n",
|
||
" # Take the action with the highest value given a state\n",
|
||
" # np.argmax can be useful here\n",
|
||
" action = np.argmax(Qtable[state])\n",
|
||
" # else --> exploration\n",
|
||
" else:\n",
|
||
" action = env.action_space.sample()\n",
|
||
" \n",
|
||
" return action"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "Atll4Z774gri"
|
||
},
|
||
"source": [
|
||
"### Step 4: Define the greedy policy 🤖\n",
|
||
"Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.\n",
|
||
"\n",
|
||
"- Epsilon greedy policy (acting policy)\n",
|
||
"- Greedy policy (updating policy)\n",
|
||
"\n",
|
||
"Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "PlmodanK40QA"
|
||
},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "E3SCLmLX5bWG"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def greedy_policy(Qtable, state):\n",
|
||
" # Exploitation: take the action with the highest state, action value\n",
|
||
" action = \n",
|
||
" \n",
|
||
" return action"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "B2_-8b8z5k54"
|
||
},
|
||
"source": [
|
||
"#### Solution"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "se2OzWGW5kYJ"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def greedy_policy(Qtable, state):\n",
|
||
" # Exploitation: take the action with the highest state, action value\n",
|
||
" action = np.argmax(Qtable[state])\n",
|
||
" \n",
|
||
" return action"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "hW80DealcRtu"
|
||
},
|
||
"source": [
|
||
"### Step 5: Define the hyperparameters ⚙️\n",
|
||
"The exploration related hyperparamters are some of the most important ones. \n",
|
||
"\n",
|
||
"- We need to make sure that our agent **explores enough the state space** in order to learn a good value approximation, in order to do that we need to have progressive decay of the epsilon.\n",
|
||
"- If you decrease too fast epsilon (too high decay_rate), **you take the risk that your agent is stuck**, since your agent didn't explore enough the state space and hence can't solve the problem."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "Y1tWn0tycWZ1"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Training parameters\n",
|
||
"n_training_episodes = 10000 # Total training episodes\n",
|
||
"learning_rate = 0.7 # Learning rate\n",
|
||
"\n",
|
||
"# Evaluation parameters\n",
|
||
"n_eval_episodes = 100 # Total number of test episodes\n",
|
||
"\n",
|
||
"# Environment parameters\n",
|
||
"env_id = \"FrozenLake-v1\" # Name of the environment\n",
|
||
"max_steps = 99 # Max steps per episode\n",
|
||
"gamma = 0.95 # Discounting rate\n",
|
||
"eval_seed = [] # The evaluation seed of the environment\n",
|
||
"\n",
|
||
"# Exploration parameters\n",
|
||
"max_epsilon = 1.0 # Exploration probability at start\n",
|
||
"min_epsilon = 0.05 # Minimum exploration probability \n",
|
||
"decay_rate = 0.0005 # Exponential decay rate for exploration prob"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "cDb7Tdx8atfL"
|
||
},
|
||
"source": [
|
||
"### Step 6: Create the training loop method"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "paOynXy3aoJW"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):\n",
|
||
" for episode in range(n_training_episodes):\n",
|
||
" # Reduce epsilon (because we need less and less exploration)\n",
|
||
" epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)\n",
|
||
" # Reset the environment\n",
|
||
" state = env.reset()\n",
|
||
" step = 0\n",
|
||
" done = False\n",
|
||
"\n",
|
||
" # repeat\n",
|
||
" for step in range(max_steps):\n",
|
||
" # Choose the action At using epsilon greedy policy\n",
|
||
" action = \n",
|
||
"\n",
|
||
" # Take action At and observe Rt+1 and St+1\n",
|
||
" # Take the action (a) and observe the outcome state(s') and reward (r)\n",
|
||
" new_state, reward, done, info = \n",
|
||
"\n",
|
||
" # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
|
||
" Qtable[state][action] = \n",
|
||
"\n",
|
||
" # If done, finish the episode\n",
|
||
" if done:\n",
|
||
" break\n",
|
||
" \n",
|
||
" # Our state is the new state\n",
|
||
" state = new_state\n",
|
||
" return Qtable"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "Pnpk2ePoem3r"
|
||
},
|
||
"source": [
|
||
"#### Solution"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "IyZaYbUAeolw"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):\n",
|
||
" for episode in tqdm(range(n_training_episodes)):\n",
|
||
" # Reduce epsilon (because we need less and less exploration)\n",
|
||
" epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)\n",
|
||
" # Reset the environment\n",
|
||
" state = env.reset()\n",
|
||
" step = 0\n",
|
||
" done = False\n",
|
||
"\n",
|
||
" # repeat\n",
|
||
" for step in range(max_steps):\n",
|
||
" # Choose the action At using epsilon greedy policy\n",
|
||
" action = epsilon_greedy_policy(Qtable, state, epsilon)\n",
|
||
"\n",
|
||
" # Take action At and observe Rt+1 and St+1\n",
|
||
" # Take the action (a) and observe the outcome state(s') and reward (r)\n",
|
||
" new_state, reward, done, info = env.step(action)\n",
|
||
"\n",
|
||
" # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
|
||
" Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action]) \n",
|
||
"\n",
|
||
" # If done, finish the episode\n",
|
||
" if done:\n",
|
||
" break\n",
|
||
" \n",
|
||
" # Our state is the new state\n",
|
||
" state = new_state\n",
|
||
" return Qtable"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "WLwKQ4tUdhGI"
|
||
},
|
||
"source": [
|
||
"### Step 7: Train the Q-Learning agent 🏃"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "DPBxfjJdTCOH"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_frozenlake)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "yVeEhUCrc30L"
|
||
},
|
||
"source": [
|
||
"### Step 8: Let's see what our Q-Learning table looks like now 👀"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "nmfchsTITw4q"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"Qtable_frozenlake"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "pUrWkxsHccXD"
|
||
},
|
||
"source": [
|
||
"### Step 9: Define the evaluation method 📝"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "jNl0_JO2cbkm"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):\n",
|
||
" \"\"\"\n",
|
||
" Evaluate the agent for ``n_eval_episodes`` episodes and returns average reward and std of reward.\n",
|
||
" :param env: The evaluation environment\n",
|
||
" :param n_eval_episodes: Number of episode to evaluate the agent\n",
|
||
" :param Q: The Q-table\n",
|
||
" :param seed: The evaluation seed array (for taxi-v3)\n",
|
||
" \"\"\"\n",
|
||
" episode_rewards = []\n",
|
||
" for episode in tqdm(range(n_eval_episodes)):\n",
|
||
" if seed:\n",
|
||
" state = env.reset(seed=seed[episode])\n",
|
||
" else:\n",
|
||
" state = env.reset()\n",
|
||
" step = 0\n",
|
||
" done = False\n",
|
||
" total_rewards_ep = 0\n",
|
||
" \n",
|
||
" for step in range(max_steps):\n",
|
||
" # Take the action (index) that have the maximum expected future reward given that state\n",
|
||
" action = np.argmax(Q[state][:])\n",
|
||
" new_state, reward, done, info = env.step(action)\n",
|
||
" total_rewards_ep += reward\n",
|
||
" \n",
|
||
" if done:\n",
|
||
" break\n",
|
||
" state = new_state\n",
|
||
" episode_rewards.append(total_rewards_ep)\n",
|
||
" mean_reward = np.mean(episode_rewards)\n",
|
||
" std_reward = np.std(episode_rewards)\n",
|
||
"\n",
|
||
" return mean_reward, std_reward"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "0jJqjaoAnxUo"
|
||
},
|
||
"source": [
|
||
"### Step 10: Evaluate our Q-Learning agent 📈\n",
|
||
"- Normally you should have mean reward of 1.0\n",
|
||
"- It's relatively easy since the state space is really small (16). What you can try to do is [to replace with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "fAgB7s0HEFMm"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Evaluate our Agent\n",
|
||
"mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_frozenlake, eval_seed)\n",
|
||
"print(f\"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "yxaP3bPdg1DV"
|
||
},
|
||
"source": [
|
||
"### Step 11: Publish our trained model on the Hub 🔥\n",
|
||
"Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.\n",
|
||
"\n",
|
||
"Here's an example of a Model Card:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "nSjyobJ_n5y_"
|
||
},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "kv0k1JQjpMq3"
|
||
},
|
||
"source": [
|
||
"Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "QZ5LrR-joIHD"
|
||
},
|
||
"source": [
|
||
"#### Do not modify this code"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "Jex3i9lZ8ksX"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"%%capture\n",
|
||
"from huggingface_hub import HfApi, HfFolder, Repository\n",
|
||
"from huggingface_hub.repocard import metadata_eval_result, metadata_save\n",
|
||
"\n",
|
||
"from pathlib import Path\n",
|
||
"import datetime\n",
|
||
"import json"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "Qo57HBn3W74O"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def record_video(env, Qtable, out_directory, fps=1):\n",
|
||
" images = [] \n",
|
||
" done = False\n",
|
||
" state = env.reset(seed=random.randint(0,500))\n",
|
||
" img = env.render(mode='rgb_array')\n",
|
||
" images.append(img)\n",
|
||
" while not done:\n",
|
||
" # Take the action (index) that have the maximum expected future reward given that state\n",
|
||
" action = np.argmax(Qtable[state][:])\n",
|
||
" state, reward, done, info = env.step(action) # We directly put next_state = state for recording logic\n",
|
||
" img = env.render(mode='rgb_array')\n",
|
||
" images.append(img)\n",
|
||
" imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "pwsNrzB339aF"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"def push_to_hub(repo_id, \n",
|
||
" model,\n",
|
||
" env,\n",
|
||
" video_fps=1,\n",
|
||
" local_repo_path=\"hub\",\n",
|
||
" commit_message=\"Push Q-Learning agent to Hub\",\n",
|
||
" token= None\n",
|
||
" ):\n",
|
||
" _, repo_name = repo_id.split(\"/\")\n",
|
||
"\n",
|
||
" eval_env = env\n",
|
||
" \n",
|
||
" # Step 1: Clone or create the repo\n",
|
||
" # Create the repo (or clone its content if it's nonempty)\n",
|
||
" api = HfApi()\n",
|
||
" \n",
|
||
" repo_url = api.create_repo(\n",
|
||
" repo_id=repo_id,\n",
|
||
" token=token,\n",
|
||
" private=False,\n",
|
||
" exist_ok=True,)\n",
|
||
" \n",
|
||
" # Git pull\n",
|
||
" repo_local_path = Path(local_repo_path) / repo_name\n",
|
||
" repo = Repository(repo_local_path, clone_from=repo_url, use_auth_token=True)\n",
|
||
" repo.git_pull()\n",
|
||
" \n",
|
||
" repo.lfs_track([\"*.mp4\"])\n",
|
||
"\n",
|
||
" # Step 1: Save the model\n",
|
||
" if env.spec.kwargs.get(\"map_name\"):\n",
|
||
" model[\"map_name\"] = env.spec.kwargs.get(\"map_name\")\n",
|
||
" if env.spec.kwargs.get(\"is_slippery\", \"\") == False:\n",
|
||
" model[\"slippery\"] = False\n",
|
||
"\n",
|
||
" print(model)\n",
|
||
" \n",
|
||
" \n",
|
||
" # Pickle the model\n",
|
||
" with open(Path(repo_local_path)/'q-learning.pkl', 'wb') as f:\n",
|
||
" pickle.dump(model, f)\n",
|
||
" \n",
|
||
" # Step 2: Evaluate the model and build JSON\n",
|
||
" mean_reward, std_reward = evaluate_agent(eval_env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"])\n",
|
||
"\n",
|
||
" # First get datetime\n",
|
||
" eval_datetime = datetime.datetime.now()\n",
|
||
" eval_form_datetime = eval_datetime.isoformat()\n",
|
||
"\n",
|
||
" evaluate_data = {\n",
|
||
" \"env_id\": model[\"env_id\"], \n",
|
||
" \"mean_reward\": mean_reward,\n",
|
||
" \"n_eval_episodes\": model[\"n_eval_episodes\"],\n",
|
||
" \"eval_datetime\": eval_form_datetime,\n",
|
||
" }\n",
|
||
" # Write a JSON file\n",
|
||
" with open(Path(repo_local_path) / \"results.json\", \"w\") as outfile:\n",
|
||
" json.dump(evaluate_data, outfile)\n",
|
||
"\n",
|
||
" # Step 3: Create the model card\n",
|
||
" # Env id\n",
|
||
" env_name = model[\"env_id\"]\n",
|
||
" if env.spec.kwargs.get(\"map_name\"):\n",
|
||
" env_name += \"-\" + env.spec.kwargs.get(\"map_name\")\n",
|
||
"\n",
|
||
" if env.spec.kwargs.get(\"is_slippery\", \"\") == False:\n",
|
||
" env_name += \"-\" + \"no_slippery\"\n",
|
||
"\n",
|
||
" metadata = {}\n",
|
||
" metadata[\"tags\"] = [\n",
|
||
" env_name,\n",
|
||
" \"q-learning\",\n",
|
||
" \"reinforcement-learning\",\n",
|
||
" \"custom-implementation\"\n",
|
||
" ]\n",
|
||
"\n",
|
||
" # Add metrics\n",
|
||
" eval = metadata_eval_result(\n",
|
||
" model_pretty_name=repo_name,\n",
|
||
" task_pretty_name=\"reinforcement-learning\",\n",
|
||
" task_id=\"reinforcement-learning\",\n",
|
||
" metrics_pretty_name=\"mean_reward\",\n",
|
||
" metrics_id=\"mean_reward\",\n",
|
||
" metrics_value=f\"{mean_reward:.2f} +/- {std_reward:.2f}\",\n",
|
||
" dataset_pretty_name=env_name,\n",
|
||
" dataset_id=env_name,\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Merges both dictionaries\n",
|
||
" metadata = {**metadata, **eval}\n",
|
||
"\n",
|
||
" model_card = f\"\"\"\n",
|
||
" # **Q-Learning** Agent playing **{env_id}**\n",
|
||
" This is a trained model of a **Q-Learning** agent playing **{env_id}** .\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" model_card += \"\"\"\n",
|
||
" ## Usage\n",
|
||
" ```python\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" model_card += f\"\"\"model = load_from_hub(repo_id=\"{repo_id}\", filename=\"q-learning.pkl\")\n",
|
||
"\n",
|
||
" # Don't forget to check if you need to add additional attributes (is_slippery=False etc)\n",
|
||
" env = gym.make(model[\"env_id\"])\n",
|
||
"\n",
|
||
" evaluate_agent(env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"])\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" model_card +=\"\"\"\n",
|
||
" ```\n",
|
||
" \"\"\"\n",
|
||
"\n",
|
||
" readme_path = repo_local_path / \"README.md\"\n",
|
||
" readme = \"\"\n",
|
||
" if readme_path.exists():\n",
|
||
" with readme_path.open(\"r\", encoding=\"utf8\") as f:\n",
|
||
" readme = f.read()\n",
|
||
" else:\n",
|
||
" readme = model_card\n",
|
||
"\n",
|
||
" with readme_path.open(\"w\", encoding=\"utf-8\") as f:\n",
|
||
" f.write(readme)\n",
|
||
"\n",
|
||
" # Save our metrics to Readme metadata\n",
|
||
" metadata_save(readme_path, metadata)\n",
|
||
"\n",
|
||
" # Step 4: Record a video\n",
|
||
" video_path = repo_local_path / \"replay.mp4\"\n",
|
||
" record_video(env, model[\"qtable\"], video_path, video_fps)\n",
|
||
" \n",
|
||
" # Push everything to hub\n",
|
||
" print(f\"Pushing repo {repo_name} to the Hugging Face Hub\")\n",
|
||
" repo.push_to_hub(commit_message=commit_message)\n",
|
||
"\n",
|
||
" print(f\"Your model is pushed to the hub. You can view your model here: {repo_url}\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "81J6cet_ogSS"
|
||
},
|
||
"source": [
|
||
"### .\n",
|
||
"By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n",
|
||
"\n",
|
||
"This way:\n",
|
||
"- You can **showcase our work** 🔥\n",
|
||
"- You can **visualize your agent playing** 👀\n",
|
||
"- You can **share with the community an agent that others can use** 💾\n",
|
||
"- You can **access a leaderboard 🏆 to see how well your agent is performing compared to your classmates** 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "cWnFC0iZooTw"
|
||
},
|
||
"source": [
|
||
"To be able to share your model with the community there are three more steps to follow:\n",
|
||
"\n",
|
||
"1️⃣ (If it's not already done) create an account to HF ➡ https://huggingface.co/join\n",
|
||
"\n",
|
||
"2️⃣ Sign in and then, you need to store your authentication token from the Hugging Face website.\n",
|
||
"- Create a new token (https://huggingface.co/settings/tokens) **with write role**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "hUExHU_5oqPc"
|
||
},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "QB5nIcxR8paT"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from huggingface_hub import notebook_login\n",
|
||
"notebook_login()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "GyWc1x3-o3xG"
|
||
},
|
||
"source": [
|
||
"If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "Gc5AfUeFo3xH"
|
||
},
|
||
"source": [
|
||
"3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function\n",
|
||
"\n",
|
||
"- Let's create **the model dictionary that contains the hyperparameters and the Q_table**."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "FiMqxqVHg0I4"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = {\n",
|
||
" \"env_id\": env_id,\n",
|
||
" \"max_steps\": max_steps,\n",
|
||
" \"n_training_episodes\": n_training_episodes,\n",
|
||
" \"n_eval_episodes\": n_eval_episodes,\n",
|
||
" \"eval_seed\": eval_seed,\n",
|
||
"\n",
|
||
" \"learning_rate\": learning_rate,\n",
|
||
" \"gamma\": gamma,\n",
|
||
"\n",
|
||
" \"max_epsilon\": max_epsilon,\n",
|
||
" \"min_epsilon\": min_epsilon,\n",
|
||
" \"decay_rate\": decay_rate,\n",
|
||
"\n",
|
||
" \"qtable\": Qtable_frozenlake\n",
|
||
"}"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "9kld-AEso3xH"
|
||
},
|
||
"source": [
|
||
"Let's fill the `package_to_hub` function:\n",
|
||
"- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `\n",
|
||
"(repo_id = {username}/{repo_name})`\n",
|
||
"💡 **A good name is {username}/q-{env_id}**\n",
|
||
"- `model`: our model dictionary containing the hyperparameters and the Qtable.\n",
|
||
"- `env`: the environment.\n",
|
||
"- `commit_message`: message of the commit"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "5sBo2umnXpPd"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"model"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "RpOTtSt83kPZ"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"username = \"\" # FILL THIS\n",
|
||
"repo_name = \"q-FrozenLake-v1-4x4-noSlippery\"\n",
|
||
"push_to_hub(\n",
|
||
" repo_id=f\"{username}/{repo_name}\",\n",
|
||
" model=model,\n",
|
||
" env=env)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "E2875IGsprzq"
|
||
},
|
||
"source": [
|
||
"Congrats 🥳 you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent. \n",
|
||
"FrozenLake-v1 no_slippery is very simple environment, let's try an harder one 🔥."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "18lN8Bz7yvLt"
|
||
},
|
||
"source": [
|
||
"# Part 2: Taxi-v3 🚖\n",
|
||
"\n",
|
||
"### Step 1: Create and understand [Taxi-v3 🚕](https://www.gymlibrary.dev/environments/toy_text/taxi/)\n",
|
||
"---\n",
|
||
"\n",
|
||
"💡 A good habit when you start to use an environment is to check its documentation \n",
|
||
"\n",
|
||
"👉 https://www.gymlibrary.dev/environments/toy_text/taxi/\n",
|
||
"\n",
|
||
"---\n",
|
||
"\n",
|
||
"In Taxi-v3 🚕, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drives to the passenger’s location, picks up the passenger, drives to the passenger’s destination (another one of the four specified locations), and then drops off the passenger. Once the passenger is dropped off, the episode ends."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "dkGqA8YFmtQf"
|
||
},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "gL0wpeO8gpej"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"env = gym.make(\"Taxi-v3\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "gBOaXgtsrmtT"
|
||
},
|
||
"source": [
|
||
"There are **500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger** (including the case when the passenger is in the taxi), and **4 destination locations.**\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "_TPNaGSZrgqA"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"state_space = env.observation_space.n\n",
|
||
"print(\"There are \", state_space, \" possible states\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "CdeeZuokrhit"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"action_space = env.action_space.n\n",
|
||
"print(\"There are \", action_space, \" possible actions\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "R1r50Advrh5Q"
|
||
},
|
||
"source": [
|
||
"The action space (the set of possible actions the agent can take) is discrete with **6 actions available 🎮**:\n",
|
||
"- 0: move south\n",
|
||
"- 1: move north\n",
|
||
"- 2: move east\n",
|
||
"- 3: move west\n",
|
||
"- 4: pickup passenger\n",
|
||
"- 5: drop off passenger\n",
|
||
"\n",
|
||
"Reward function 💰:\n",
|
||
"- -1 per step unless other reward is triggered.\n",
|
||
"- +20 delivering passenger.\n",
|
||
"- -10 executing “pickup” and “drop-off” actions illegally."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "US3yDXnEtY9I"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Create our Q table with state_size rows and action_size columns (500x6)\n",
|
||
"Qtable_taxi = initialize_q_table(state_space, action_space)\n",
|
||
"print(Qtable_taxi)\n",
|
||
"print(\"Q-table shape: \", Qtable_taxi .shape)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "gUMKPH0_LJyH"
|
||
},
|
||
"source": [
|
||
"### Step 2: Define the hyperparameters ⚙️\n",
|
||
"⚠ DO NOT MODIFY EVAL_SEED: the eval_seed array **allows us to evaluate your agent with the same taxi starting positions for every classmate**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "AB6n__hhg7YS"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Training parameters\n",
|
||
"n_training_episodes = 25000 # Total training episodes\n",
|
||
"learning_rate = 0.7 # Learning rate\n",
|
||
"\n",
|
||
"# Evaluation parameters\n",
|
||
"n_eval_episodes = 100 # Total number of test episodes\n",
|
||
"\n",
|
||
"# DO NOT MODIFY EVAL_SEED\n",
|
||
"eval_seed = [16,54,165,177,191,191,120,80,149,178,48,38,6,125,174,73,50,172,100,148,146,6,25,40,68,148,49,167,9,97,164,176,61,7,54,55,\n",
|
||
" 161,131,184,51,170,12,120,113,95,126,51,98,36,135,54,82,45,95,89,59,95,124,9,113,58,85,51,134,121,169,105,21,30,11,50,65,12,43,82,145,152,97,106,55,31,85,38,\n",
|
||
" 112,102,168,123,97,21,83,158,26,80,63,5,81,32,11,28,148] # Evaluation seed, this ensures that all classmates agents are trained on the same taxi starting position\n",
|
||
" # Each seed has a specific starting state\n",
|
||
"\n",
|
||
"# Environment parameters\n",
|
||
"env_id = \"Taxi-v3\" # Name of the environment\n",
|
||
"max_steps = 99 # Max steps per episode\n",
|
||
"gamma = 0.95 # Discounting rate\n",
|
||
"\n",
|
||
"# Exploration parameters\n",
|
||
"max_epsilon = 1.0 # Exploration probability at start\n",
|
||
"min_epsilon = 0.05 # Minimum exploration probability \n",
|
||
"decay_rate = 0.005 # Exponential decay rate for exploration prob\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "1TMORo1VLTsX"
|
||
},
|
||
"source": [
|
||
"### Step 3: Train our Q-Learning agent 🏃"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "MLNwkNDb14h2"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "WwP3Y2z2eS-K"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"Qtable_taxi"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "wPdu0SueLVl2"
|
||
},
|
||
"source": [
|
||
"### Step 4: Create a model dictionary 💾 and publish our trained model on the Hub 🔥\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "0a1FpE_3hNYr"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = {\n",
|
||
" \"env_id\": env_id,\n",
|
||
" \"max_steps\": max_steps,\n",
|
||
" \"n_training_episodes\": n_training_episodes,\n",
|
||
" \"n_eval_episodes\": n_eval_episodes,\n",
|
||
" \"eval_seed\": eval_seed,\n",
|
||
"\n",
|
||
" \"learning_rate\": learning_rate,\n",
|
||
" \"gamma\": gamma,\n",
|
||
"\n",
|
||
" \"max_epsilon\": max_epsilon,\n",
|
||
" \"min_epsilon\": min_epsilon,\n",
|
||
" \"decay_rate\": decay_rate,\n",
|
||
"\n",
|
||
" \"qtable\": Qtable_taxi\n",
|
||
"}"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "dhQtiQozhOn1"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"username = \"\" # FILL THIS\n",
|
||
"repo_name = \"q-Taxi-v3\"\n",
|
||
"push_to_hub(\n",
|
||
" repo_id=f\"{username}/{repo_name}\",\n",
|
||
" model=model,\n",
|
||
" env=env)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "ZgSdjgbIpRti"
|
||
},
|
||
"source": [
|
||
"Compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "LyGvq3k4pYv_"
|
||
},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "bzgIO70c0bu2"
|
||
},
|
||
"source": [
|
||
"# Part 3: Load from Hub 🔽\n",
|
||
"\n",
|
||
"What's amazing with Hugging Face Hub 🤗 is that you can easily load powerful models from the community.\n",
|
||
"\n",
|
||
"Loading a saved model from the Hub is really easy.\n",
|
||
"1. You go https://huggingface.co/models?other=q-learning to see the list of all the q-learning saved models.\n",
|
||
"2. You select one and copy its repo_id"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "qUDSEy8Bn3ps"
|
||
},
|
||
"source": [
|
||
""
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "gTth6thRoC6X"
|
||
},
|
||
"source": [
|
||
"3. Then we just need to use `load_from_hub` with:\n",
|
||
"- The repo_id\n",
|
||
"- The filename: the saved model inside the repo."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "EtrfoTaBoNrd"
|
||
},
|
||
"source": [
|
||
"#### Do not modify this code"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "Eo8qEzNtCaVI"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from urllib.error import HTTPError\n",
|
||
"\n",
|
||
"from huggingface_hub import hf_hub_download\n",
|
||
"\n",
|
||
"\n",
|
||
"def load_from_hub(repo_id: str, filename: str) -> str:\n",
|
||
" \"\"\"\n",
|
||
" Download a model from Hugging Face Hub.\n",
|
||
" :param repo_id: id of the model repository from the Hugging Face Hub\n",
|
||
" :param filename: name of the model zip file from the repository\n",
|
||
" \"\"\"\n",
|
||
" try:\n",
|
||
" from huggingface_hub import cached_download, hf_hub_url\n",
|
||
" except ImportError:\n",
|
||
" raise ImportError(\n",
|
||
" \"You need to install huggingface_hub to use `load_from_hub`. \"\n",
|
||
" \"See https://pypi.org/project/huggingface-hub/ for installation.\"\n",
|
||
" )\n",
|
||
"\n",
|
||
" # Get the model from the Hub, download and cache the model on your local disk\n",
|
||
" pickle_model = hf_hub_download(\n",
|
||
" repo_id=repo_id,\n",
|
||
" filename=filename\n",
|
||
" )\n",
|
||
"\n",
|
||
" with open(pickle_model, 'rb') as f:\n",
|
||
" downloaded_model_file = pickle.load(f)\n",
|
||
" \n",
|
||
" return downloaded_model_file"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "b_sM2gNioPZH"
|
||
},
|
||
"source": [
|
||
"### ."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "JUm9lz2gCQcU"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = load_from_hub(repo_id=\"ThomasSimonini/q-Taxi-v3\", filename=\"q-learning.pkl\")\n",
|
||
"\n",
|
||
"print(model)\n",
|
||
"env = gym.make(model[\"env_id\"])\n",
|
||
"\n",
|
||
"evaluate_agent(env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "O7pL8rg1MulN"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"model = load_from_hub(repo_id=\"ThomasSimonini/q-FrozenLake-v1-no-slippery\", filename=\"q-learning.pkl\")\n",
|
||
"\n",
|
||
"env = gym.make(model[\"env_id\"], is_slippery=False)\n",
|
||
"\n",
|
||
"evaluate_agent(env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "p-fW-EU5WejJ"
|
||
},
|
||
"source": [
|
||
"Congrats 🥳, you've just implemented, trained, and uploaded your first Reinforcement Learning agent.\n",
|
||
"\n",
|
||
"Understanding Q-Learning is an **important step to understanding value-based methods.**\n",
|
||
"\n",
|
||
"In the next Unit with Deep Q-Learning, we'll see that creating and updating a Q-table was a good strategy — **however, this is not scalable.**\n",
|
||
"\n",
|
||
"For instance, imagine you create an agent that learns to play Doom. Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient. That's why we'll study Deep Q-Learning, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.**"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "BQAwLnYFPk-s"
|
||
},
|
||
"source": [
|
||
"## Some additional challenges 🏆\n",
|
||
"The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results! \n",
|
||
"\n",
|
||
"In the [Leaderboard](https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n",
|
||
"\n",
|
||
"Here are some ideas to achieve so:\n",
|
||
"* Train more steps\n",
|
||
"* Try different hyperparameters by looking at what your classmates have done.\n",
|
||
"* **Push your new trained model** on the Hub 🔥\n",
|
||
"\n",
|
||
"Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not using FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun 🎉."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "9lM95-dvmif8"
|
||
},
|
||
"source": [
|
||
"________________________________________________________________________\n",
|
||
"Congrats on finishing this chapter! That was the biggest one, **and there was a lot of information.**\n",
|
||
"\n",
|
||
"If you’re still feel confused with all these elements...it's totally normal! **This was the same for me and for all people who studied RL.**\n",
|
||
"\n",
|
||
"Take time to really **grasp the material before continuing and try the additional challenges**. It’s important to master these elements and having a solid foundations.\n",
|
||
"\n",
|
||
"Naturally, during the course, we’re going to use and deeper explain again these terms but **it’s better to have a good understanding of them now before diving into the next chapters.**\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "feR90OUSEXq9"
|
||
},
|
||
"source": [
|
||
"### This is a course built with you 👷🏿♀️\n",
|
||
"\n",
|
||
"We want to improve and update the course iteratively with your feedback. If you have some, please fill this form 👉 https://forms.gle/3HgA7bEHwAmmLfwh9\n",
|
||
"\n",
|
||
"If you found some issues in this notebook, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "BjLhT70TEZIn"
|
||
},
|
||
"source": [
|
||
"See you on [Unit 3](https://github.com/huggingface/deep-rl-class/tree/main/unit3#unit-3-deep-q-learning-with-atari-games-)! 🔥\n",
|
||
"## Keep learning, stay awesome 🤗"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"accelerator": "GPU",
|
||
"colab": {
|
||
"collapsed_sections": [
|
||
"67OdoKL63eDD",
|
||
"8R5ej1fS4P2V",
|
||
"B2_-8b8z5k54",
|
||
"Pnpk2ePoem3r",
|
||
"QZ5LrR-joIHD",
|
||
"EtrfoTaBoNrd"
|
||
],
|
||
"include_colab_link": true,
|
||
"name": "Copie de Copie de Unit 2: Q-Learning with FrozenLake-v1 and Taxi-v3.ipynb",
|
||
"private_outputs": true,
|
||
"provenance": []
|
||
},
|
||
"gpuClass": "standard",
|
||
"kernelspec": {
|
||
"display_name": "Python 3",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"name": "python"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 0
|
||
}
|