mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-01 17:51:01 +08:00
Finalize Unit 2
This commit is contained in:
@@ -35,7 +35,8 @@
|
||||
"\n",
|
||||
"###📚 RL-Library: \n",
|
||||
"\n",
|
||||
"- Python and Numpy"
|
||||
"- Python and NumPy\n",
|
||||
"- [Gym](https://www.gymlibrary.dev/)"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "DPTBOv9HYLZ2"
|
||||
@@ -44,7 +45,7 @@
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"source": [
|
||||
"We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues)."
|
||||
"We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues)."
|
||||
],
|
||||
"metadata": {
|
||||
"id": "3iaIxM_TwklQ"
|
||||
@@ -163,19 +164,19 @@
|
||||
"source": [
|
||||
"## Install dependencies and create a virtual display 🔽\n",
|
||||
"\n",
|
||||
"During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames). \n",
|
||||
"In the notebook, we'll need to generate a replay video. To do so, with Colab, **we need to have a virtual screen to render the environment** (and thus record the frames).\n",
|
||||
"\n",
|
||||
"Hence the following cell will install the librairies and create and run a virtual screen 🖥\n",
|
||||
"Hence the following cell will install the libraries and create and run a virtual screen 🖥\n",
|
||||
"\n",
|
||||
"We’ll install multiple ones:\n",
|
||||
"\n",
|
||||
"- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.\n",
|
||||
"- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.\n",
|
||||
"- `numPy`: Used for handling our Q-table.\n",
|
||||
"- `numpy`: Used for handling our Q-table.\n",
|
||||
"\n",
|
||||
"The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.\n",
|
||||
"\n",
|
||||
"You can see here all the Deep reinforcement Learning models available 👉 https://huggingface.co/models?other=q-learning\n"
|
||||
"You can see here all the Deep RL models available (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "4gpxC1_kqUYe"
|
||||
@@ -195,11 +196,9 @@
|
||||
{
|
||||
"cell_type": "code",
|
||||
"source": [
|
||||
"%capture\n",
|
||||
"%%capture\n",
|
||||
"!sudo apt-get update\n",
|
||||
"!apt install python-opengl\n",
|
||||
"!apt install ffmpeg\n",
|
||||
"!apt install xvfb\n",
|
||||
"!apt install python-opengl ffmpeg xvfb\n",
|
||||
"!pip3 install pyvirtualdisplay"
|
||||
],
|
||||
"metadata": {
|
||||
@@ -254,12 +253,8 @@
|
||||
"\n",
|
||||
"In addition to the installed libraries, we also use:\n",
|
||||
"\n",
|
||||
"- `random`: To generate random numbers (that will be useful for Epsilon-Greedy Policy).\n",
|
||||
"- `imageio`: To generate a replay video\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n"
|
||||
"- `random`: To generate random numbers (that will be useful for epsilon-greedy policy).\n",
|
||||
"- `imageio`: To generate a replay video."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -323,8 +318,8 @@
|
||||
"\n",
|
||||
"The environment has two modes:\n",
|
||||
"\n",
|
||||
"- `is_slippery=False`: The agent always move in the intended direction due to the non-slippery nature of the frozen lake.\n",
|
||||
"- `is_slippery=True`: The agent may not always move in the intended direction due to the slippery nature of the frozen lake (stochastic)."
|
||||
"- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).\n",
|
||||
"- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic)."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -401,8 +396,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# We create our environment with gym.make(\"<name_of_the_environment>\")\n",
|
||||
"env.reset()\n",
|
||||
"# We create our environment with gym.make(\"<name_of_the_environment>\")- `is_slippery=False`: The agent always moves in the intended direction due to the non-slippery nature of the frozen lake (deterministic).\n",
|
||||
"print(\"_____OBSERVATION SPACE_____ \\n\")\n",
|
||||
"print(\"Observation Space\", env.observation_space)\n",
|
||||
"print(\"Sample observation\", env.observation_space.sample()) # Get a random observation"
|
||||
@@ -414,7 +408,7 @@
|
||||
"id": "2MXc15qFE0M9"
|
||||
},
|
||||
"source": [
|
||||
"We see with `Observation Space Shape Discrete(16)` that the observation is a value representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**. \n",
|
||||
"We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**. \n",
|
||||
"\n",
|
||||
"For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**\n",
|
||||
"\n",
|
||||
@@ -467,7 +461,7 @@
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`\n"
|
||||
"It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. We already know their values from before, but we'll want to obtain them programmatically so that our algorithm generalizes for different environments. Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -559,6 +553,62 @@
|
||||
"Qtable_frozenlake = initialize_q_table(state_space, action_space)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "Atll4Z774gri"
|
||||
},
|
||||
"source": [
|
||||
"## Define the greedy policy 🤖\n",
|
||||
"Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.\n",
|
||||
"\n",
|
||||
"- Epsilon-greedy policy (acting policy)\n",
|
||||
"- Greedy-policy (updating policy)\n",
|
||||
"\n",
|
||||
"Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "E3SCLmLX5bWG"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def greedy_policy(Qtable, state):\n",
|
||||
" # Exploitation: take the action with the highest state, action value\n",
|
||||
" action = \n",
|
||||
" \n",
|
||||
" return action"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "B2_-8b8z5k54"
|
||||
},
|
||||
"source": [
|
||||
"#### Solution"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "se2OzWGW5kYJ"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def greedy_policy(Qtable, state):\n",
|
||||
" # Exploitation: take the action with the highest state, action value\n",
|
||||
" action = np.argmax(Qtable[state][:])\n",
|
||||
" \n",
|
||||
" return action"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
@@ -567,11 +617,11 @@
|
||||
"source": [
|
||||
"##Define the epsilon-greedy policy 🤖\n",
|
||||
"\n",
|
||||
"Epsilon-Greedy is the training policy that handles the exploration/exploitation trade-off.\n",
|
||||
"Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.\n",
|
||||
"\n",
|
||||
"The idea with Epsilon Greedy:\n",
|
||||
"The idea with epsilon-greedy:\n",
|
||||
"\n",
|
||||
"- With *probability 1 - ɛ* : **we do exploitation** (aka our agent selects the action with the highest state-action pair value).\n",
|
||||
"- With *probability 1 - ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).\n",
|
||||
"\n",
|
||||
"- With *probability ɛ*: we do **exploration** (trying random action).\n",
|
||||
"\n",
|
||||
@@ -580,15 +630,6 @@
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "LjZSvhsD7_52"
|
||||
},
|
||||
"source": [
|
||||
"Thanks to Sambit for finding a bug on the epsilon function 🤗"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
@@ -636,7 +677,7 @@
|
||||
" if random_int > epsilon:\n",
|
||||
" # Take the action with the highest value given a state\n",
|
||||
" # np.argmax can be useful here\n",
|
||||
" action = np.argmax(Qtable[state])\n",
|
||||
" action = greedy_policy(Qtable, state)\n",
|
||||
" # else --> exploration\n",
|
||||
" else:\n",
|
||||
" action = env.action_space.sample()\n",
|
||||
@@ -644,62 +685,6 @@
|
||||
" return action"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "Atll4Z774gri"
|
||||
},
|
||||
"source": [
|
||||
"## Define the greedy policy 🤖\n",
|
||||
"Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.\n",
|
||||
"\n",
|
||||
"- Epsilon greedy policy (acting policy)\n",
|
||||
"- Greedy policy (updating policy)\n",
|
||||
"\n",
|
||||
"Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "E3SCLmLX5bWG"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def greedy_policy(Qtable, state):\n",
|
||||
" # Exploitation: take the action with the highest state, action value\n",
|
||||
" action = \n",
|
||||
" \n",
|
||||
" return action"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"id": "B2_-8b8z5k54"
|
||||
},
|
||||
"source": [
|
||||
"#### Solution"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "se2OzWGW5kYJ"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def greedy_policy(Qtable, state):\n",
|
||||
" # Exploitation: take the action with the highest state, action value\n",
|
||||
" action = np.argmax(Qtable[state])\n",
|
||||
" \n",
|
||||
" return action"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
@@ -709,8 +694,8 @@
|
||||
"## Define the hyperparameters ⚙️\n",
|
||||
"The exploration related hyperparamters are some of the most important ones. \n",
|
||||
"\n",
|
||||
"- We need to make sure that our agent **explores enough the state space** in order to learn a good value approximation, in order to do that we need to have progressive decay of the epsilon.\n",
|
||||
"- If you decrease too fast epsilon (too high decay_rate), **you take the risk that your agent is stuck**, since your agent didn't explore enough the state space and hence can't solve the problem."
|
||||
"- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.\n",
|
||||
"- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -746,7 +731,25 @@
|
||||
"id": "cDb7Tdx8atfL"
|
||||
},
|
||||
"source": [
|
||||
"## Step 6: Create the training loop method"
|
||||
"## Create the training loop method\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg\" alt=\"Q-Learning\" width=\"100%\"/>\n",
|
||||
"\n",
|
||||
"The training loop goes like this:\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"For episode in the total of training episodes:\n",
|
||||
"\n",
|
||||
"Reduce epsilon (since we need less and less exploration)\n",
|
||||
"Reset the environment\n",
|
||||
"\n",
|
||||
" For step in max timesteps: \n",
|
||||
" Choose the action At using epsilon greedy policy\n",
|
||||
" Take the action (a) and observe the outcome state(s') and reward (r)\n",
|
||||
" Update the Q-value Q(s,a) using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
|
||||
" If done, finish the episode\n",
|
||||
" Our next state is the new state\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -782,7 +785,7 @@
|
||||
" if done:\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" # Our state is the new state\n",
|
||||
" # Our next state is the new state\n",
|
||||
" state = new_state\n",
|
||||
" return Qtable"
|
||||
]
|
||||
@@ -829,7 +832,7 @@
|
||||
" if done:\n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" # Our state is the new state\n",
|
||||
" # Our next state is the new state\n",
|
||||
" state = new_state\n",
|
||||
" return Qtable"
|
||||
]
|
||||
@@ -880,7 +883,9 @@
|
||||
"id": "pUrWkxsHccXD"
|
||||
},
|
||||
"source": [
|
||||
"## Define the evaluation method 📝"
|
||||
"## The evaluation method 📝\n",
|
||||
"\n",
|
||||
"- We defined the evaluation method that we're going to use to test our Q-Learning agent."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -911,7 +916,7 @@
|
||||
" \n",
|
||||
" for step in range(max_steps):\n",
|
||||
" # Take the action (index) that have the maximum expected future reward given that state\n",
|
||||
" action = np.argmax(Q[state][:])\n",
|
||||
" action = greedy_policy(Q, state)\n",
|
||||
" new_state, reward, done, info = env.step(action)\n",
|
||||
" total_rewards_ep += reward\n",
|
||||
" \n",
|
||||
@@ -933,8 +938,8 @@
|
||||
"source": [
|
||||
"## Evaluate our Q-Learning agent 📈\n",
|
||||
"\n",
|
||||
"- Normally you should have mean reward of 1.0\n",
|
||||
"- It's relatively easy since the state space is really small (16). What you can try to do is [to replace with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/)."
|
||||
"- Usually, you should have a mean reward of 1.0\n",
|
||||
"- The **environment is relatively easy** since the state space is really small (16). What you can try to do is [to replace it with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/), which introduces stochasticity, making the environment more complex."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -956,9 +961,9 @@
|
||||
"id": "yxaP3bPdg1DV"
|
||||
},
|
||||
"source": [
|
||||
"## Publish our trained model on the Hub 🔥\n",
|
||||
"## Publish our trained model to the Hub 🔥\n",
|
||||
"\n",
|
||||
"Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.\n",
|
||||
"Now that we saw good results after the training, **we can publish our trained model to the Hub 🤗 with one line of code**.\n",
|
||||
"\n",
|
||||
"Here's an example of a Model Card:\n",
|
||||
"\n",
|
||||
@@ -991,8 +996,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"%%capture\n",
|
||||
"from huggingface_hub import HfApi, HfFolder, Repository\n",
|
||||
"from huggingface_hub import HfApi, HfFolder, Repository, snapshot_download\n",
|
||||
"from huggingface_hub.repocard import metadata_eval_result, metadata_save\n",
|
||||
"\n",
|
||||
"from pathlib import Path\n",
|
||||
@@ -1009,6 +1013,13 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def record_video(env, Qtable, out_directory, fps=1):\n",
|
||||
" \"\"\"\n",
|
||||
" Generate a replay video of the agent\n",
|
||||
" :param env\n",
|
||||
" :param Qtable: Qtable of our agent\n",
|
||||
" :param out_directory\n",
|
||||
" :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)\n",
|
||||
" \"\"\"\n",
|
||||
" images = [] \n",
|
||||
" done = False\n",
|
||||
" state = env.reset(seed=random.randint(0,500))\n",
|
||||
@@ -1025,149 +1036,144 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "pwsNrzB339aF"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def push_to_hub(repo_id, \n",
|
||||
" model,\n",
|
||||
" env,\n",
|
||||
" video_fps=1,\n",
|
||||
" local_repo_path=\"hub\",\n",
|
||||
" commit_message=\"Push Q-Learning agent to Hub\",\n",
|
||||
" token= None\n",
|
||||
" ):\n",
|
||||
" _, repo_name = repo_id.split(\"/\")\n",
|
||||
"def push_to_hub(\n",
|
||||
" repo_id, model, env, video_fps=1, local_repo_path=\"hub\"\n",
|
||||
"):\n",
|
||||
" \"\"\"\n",
|
||||
" Evaluate, Generate a video and Upload a model to Hugging Face Hub.\n",
|
||||
" This method does the complete pipeline:\n",
|
||||
" - It evaluates the model\n",
|
||||
" - It generates the model card\n",
|
||||
" - It generates a replay video of the agent\n",
|
||||
" - It pushes everything to the Hub\n",
|
||||
"\n",
|
||||
" eval_env = env\n",
|
||||
" \n",
|
||||
" # Step 1: Clone or create the repo\n",
|
||||
" # Create the repo (or clone its content if it's nonempty)\n",
|
||||
" api = HfApi()\n",
|
||||
" \n",
|
||||
" repo_url = api.create_repo(\n",
|
||||
" :param repo_id: repo_id: id of the model repository from the Hugging Face Hub\n",
|
||||
" :param env\n",
|
||||
" :param video_fps: how many frame per seconds to record our video replay \n",
|
||||
" (with taxi-v3 and frozenlake-v1 we use 1)\n",
|
||||
" :param local_repo_path: where the local repository is\n",
|
||||
" \"\"\"\n",
|
||||
" _, repo_name = repo_id.split(\"/\")\n",
|
||||
"\n",
|
||||
" eval_env = env\n",
|
||||
" api = HfApi()\n",
|
||||
"\n",
|
||||
" # Step 1: Create the repo\n",
|
||||
" repo_url = api.create_repo(\n",
|
||||
" repo_id=repo_id,\n",
|
||||
" token=token,\n",
|
||||
" private=False,\n",
|
||||
" exist_ok=True,)\n",
|
||||
" \n",
|
||||
" # Git pull\n",
|
||||
" repo_local_path = Path(local_repo_path) / repo_name\n",
|
||||
" repo = Repository(repo_local_path, clone_from=repo_url, use_auth_token=True)\n",
|
||||
" repo.git_pull()\n",
|
||||
" \n",
|
||||
" repo.lfs_track([\"*.mp4\"])\n",
|
||||
"\n",
|
||||
" # Step 1: Save the model\n",
|
||||
" if env.spec.kwargs.get(\"map_name\"):\n",
|
||||
" model[\"map_name\"] = env.spec.kwargs.get(\"map_name\")\n",
|
||||
" if env.spec.kwargs.get(\"is_slippery\", \"\") == False:\n",
|
||||
" model[\"slippery\"] = False\n",
|
||||
"\n",
|
||||
" print(model)\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" # Pickle the model\n",
|
||||
" with open(Path(repo_local_path)/'q-learning.pkl', 'wb') as f:\n",
|
||||
" pickle.dump(model, f)\n",
|
||||
" \n",
|
||||
" # Step 2: Evaluate the model and build JSON\n",
|
||||
" mean_reward, std_reward = evaluate_agent(eval_env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"])\n",
|
||||
"\n",
|
||||
" # First get datetime\n",
|
||||
" eval_datetime = datetime.datetime.now()\n",
|
||||
" eval_form_datetime = eval_datetime.isoformat()\n",
|
||||
"\n",
|
||||
" evaluate_data = {\n",
|
||||
" \"env_id\": model[\"env_id\"], \n",
|
||||
" \"mean_reward\": mean_reward,\n",
|
||||
" \"n_eval_episodes\": model[\"n_eval_episodes\"],\n",
|
||||
" \"eval_datetime\": eval_form_datetime,\n",
|
||||
" }\n",
|
||||
" # Write a JSON file\n",
|
||||
" with open(Path(repo_local_path) / \"results.json\", \"w\") as outfile:\n",
|
||||
" json.dump(evaluate_data, outfile)\n",
|
||||
"\n",
|
||||
" # Step 3: Create the model card\n",
|
||||
" # Env id\n",
|
||||
" env_name = model[\"env_id\"]\n",
|
||||
" if env.spec.kwargs.get(\"map_name\"):\n",
|
||||
" env_name += \"-\" + env.spec.kwargs.get(\"map_name\")\n",
|
||||
"\n",
|
||||
" if env.spec.kwargs.get(\"is_slippery\", \"\") == False:\n",
|
||||
" env_name += \"-\" + \"no_slippery\"\n",
|
||||
"\n",
|
||||
" metadata = {}\n",
|
||||
" metadata[\"tags\"] = [\n",
|
||||
" env_name,\n",
|
||||
" \"q-learning\",\n",
|
||||
" \"reinforcement-learning\",\n",
|
||||
" \"custom-implementation\"\n",
|
||||
" ]\n",
|
||||
"\n",
|
||||
" # Add metrics\n",
|
||||
" eval = metadata_eval_result(\n",
|
||||
" model_pretty_name=repo_name,\n",
|
||||
" task_pretty_name=\"reinforcement-learning\",\n",
|
||||
" task_id=\"reinforcement-learning\",\n",
|
||||
" metrics_pretty_name=\"mean_reward\",\n",
|
||||
" metrics_id=\"mean_reward\",\n",
|
||||
" metrics_value=f\"{mean_reward:.2f} +/- {std_reward:.2f}\",\n",
|
||||
" dataset_pretty_name=env_name,\n",
|
||||
" dataset_id=env_name,\n",
|
||||
" exist_ok=True,\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Merges both dictionaries\n",
|
||||
" metadata = {**metadata, **eval}\n",
|
||||
" # Step 2: Download files\n",
|
||||
" repo_local_path = Path(snapshot_download(repo_id=repo_id))\n",
|
||||
"\n",
|
||||
" model_card = f\"\"\"\n",
|
||||
" # **Q-Learning** Agent playing **{env_id}**\n",
|
||||
" # Step 3: Save the model\n",
|
||||
" if env.spec.kwargs.get(\"map_name\"):\n",
|
||||
" model[\"map_name\"] = env.spec.kwargs.get(\"map_name\")\n",
|
||||
" if env.spec.kwargs.get(\"is_slippery\", \"\") == False:\n",
|
||||
" model[\"slippery\"] = False\n",
|
||||
"\n",
|
||||
" print(model)\n",
|
||||
"\n",
|
||||
" # Pickle the model\n",
|
||||
" with open((repo_local_path) / \"q-learning.pkl\", \"wb\") as f:\n",
|
||||
" pickle.dump(model, f)\n",
|
||||
"\n",
|
||||
" # Step 4: Evaluate the model and build JSON with evaluation metrics\n",
|
||||
" mean_reward, std_reward = evaluate_agent(\n",
|
||||
" eval_env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"]\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" evaluate_data = {\n",
|
||||
" \"env_id\": model[\"env_id\"],\n",
|
||||
" \"mean_reward\": mean_reward,\n",
|
||||
" \"n_eval_episodes\": model[\"n_eval_episodes\"],\n",
|
||||
" \"eval_datetime\": datetime.datetime.now().isoformat()\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" # Write a JSON file\n",
|
||||
" with open(repo_local_path / \"results.json\", \"w\") as outfile:\n",
|
||||
" json.dump(evaluate_data, outfile)\n",
|
||||
"\n",
|
||||
" # Step 5: Create the model card\n",
|
||||
" env_name = model[\"env_id\"]\n",
|
||||
" if env.spec.kwargs.get(\"map_name\"):\n",
|
||||
" env_name += \"-\" + env.spec.kwargs.get(\"map_name\")\n",
|
||||
"\n",
|
||||
" if env.spec.kwargs.get(\"is_slippery\", \"\") == False:\n",
|
||||
" env_name += \"-\" + \"no_slippery\"\n",
|
||||
"\n",
|
||||
" metadata = {}\n",
|
||||
" metadata[\"tags\"] = [env_name, \"q-learning\", \"reinforcement-learning\", \"custom-implementation\"]\n",
|
||||
"\n",
|
||||
" # Add metrics\n",
|
||||
" eval = metadata_eval_result(\n",
|
||||
" model_pretty_name=repo_name,\n",
|
||||
" task_pretty_name=\"reinforcement-learning\",\n",
|
||||
" task_id=\"reinforcement-learning\",\n",
|
||||
" metrics_pretty_name=\"mean_reward\",\n",
|
||||
" metrics_id=\"mean_reward\",\n",
|
||||
" metrics_value=f\"{mean_reward:.2f} +/- {std_reward:.2f}\",\n",
|
||||
" dataset_pretty_name=env_name,\n",
|
||||
" dataset_id=env_name,\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Merges both dictionaries\n",
|
||||
" metadata = {**metadata, **eval}\n",
|
||||
"\n",
|
||||
" model_card = f\"\"\"\n",
|
||||
" # **Q-Learning** Agent playing1 **{env_id}**\n",
|
||||
" This is a trained model of a **Q-Learning** agent playing **{env_id}** .\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" model_card += \"\"\"\n",
|
||||
" ## Usage\n",
|
||||
" ```python\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" model_card += f\"\"\"model = load_from_hub(repo_id=\"{repo_id}\", filename=\"q-learning.pkl\")\n",
|
||||
" ```python\n",
|
||||
" \n",
|
||||
" model = load_from_hub(repo_id=\"{repo_id}\", filename=\"q-learning.pkl\")\n",
|
||||
"\n",
|
||||
" # Don't forget to check if you need to add additional attributes (is_slippery=False etc)\n",
|
||||
" env = gym.make(model[\"env_id\"])\n",
|
||||
"\n",
|
||||
" evaluate_agent(env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"])\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" model_card +=\"\"\"\n",
|
||||
" ```\n",
|
||||
" \"\"\"\n",
|
||||
"\n",
|
||||
" readme_path = repo_local_path / \"README.md\"\n",
|
||||
" readme = \"\"\n",
|
||||
" if readme_path.exists():\n",
|
||||
" with readme_path.open(\"r\", encoding=\"utf8\") as f:\n",
|
||||
" readme = f.read()\n",
|
||||
" else:\n",
|
||||
" readme = model_card\n",
|
||||
"\n",
|
||||
" with readme_path.open(\"w\", encoding=\"utf-8\") as f:\n",
|
||||
" f.write(readme)\n",
|
||||
"\n",
|
||||
" # Save our metrics to Readme metadata\n",
|
||||
" metadata_save(readme_path, metadata)\n",
|
||||
"\n",
|
||||
" # Step 4: Record a video\n",
|
||||
" video_path = repo_local_path / \"replay.mp4\"\n",
|
||||
" record_video(env, model[\"qtable\"], video_path, video_fps)\n",
|
||||
" evaluate_agent(env, model[\"max_steps\"], model[\"n_eval_episodes\"], model[\"qtable\"], model[\"eval_seed\"])\n",
|
||||
" \n",
|
||||
" # Push everything to hub\n",
|
||||
" print(f\"Pushing repo {repo_name} to the Hugging Face Hub\")\n",
|
||||
" repo.push_to_hub(commit_message=commit_message)\n",
|
||||
"\n",
|
||||
" print(f\"Your model is pushed to the hub. You can view your model here: {repo_url}\")"
|
||||
]
|
||||
" readme_path = repo_local_path / \"README.md\"\n",
|
||||
" readme = \"\"\n",
|
||||
" print(readme_path.exists())\n",
|
||||
" if readme_path.exists():\n",
|
||||
" with readme_path.open(\"r\", encoding=\"utf8\") as f:\n",
|
||||
" readme = f.read()\n",
|
||||
" else:\n",
|
||||
" readme = model_card\n",
|
||||
" print(readme)\n",
|
||||
"\n",
|
||||
" with readme_path.open(\"w\", encoding=\"utf-8\") as f:\n",
|
||||
" f.write(readme)\n",
|
||||
"\n",
|
||||
" # Save our metrics to Readme metadata\n",
|
||||
" metadata_save(readme_path, metadata)\n",
|
||||
"\n",
|
||||
" # Step 6: Record a video\n",
|
||||
" video_path = repo_local_path / \"replay.mp4\"\n",
|
||||
" record_video(env, model[\"qtable\"], video_path, video_fps)\n",
|
||||
"\n",
|
||||
" # Step 7. Push everything to the Hub\n",
|
||||
" api.upload_folder(\n",
|
||||
" repo_id=repo_id,\n",
|
||||
" folder_path=repo_local_path,\n",
|
||||
" path_in_repo=\".\",\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" print(\"Your model is pushed to the Hub. You can view your model here: \", repo_url)"
|
||||
],
|
||||
"metadata": {
|
||||
"id": "U4mdUTKkGnUd"
|
||||
},
|
||||
"execution_count": null,
|
||||
"outputs": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
@@ -1177,7 +1183,7 @@
|
||||
"source": [
|
||||
"### .\n",
|
||||
"\n",
|
||||
"By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n",
|
||||
"By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.\n",
|
||||
"\n",
|
||||
"This way:\n",
|
||||
"- You can **showcase our work** 🔥\n",
|
||||
@@ -1221,7 +1227,7 @@
|
||||
"id": "GyWc1x3-o3xG"
|
||||
},
|
||||
"source": [
|
||||
"If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`"
|
||||
"If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1230,7 +1236,7 @@
|
||||
"id": "Gc5AfUeFo3xH"
|
||||
},
|
||||
"source": [
|
||||
"3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function\n",
|
||||
"3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `push_to_hub()` function\n",
|
||||
"\n",
|
||||
"- Let's create **the model dictionary that contains the hyperparameters and the Q_table**."
|
||||
]
|
||||
@@ -1267,7 +1273,7 @@
|
||||
"id": "9kld-AEso3xH"
|
||||
},
|
||||
"source": [
|
||||
"Let's fill the `package_to_hub` function:\n",
|
||||
"Let's fill the `push_to_hub` function:\n",
|
||||
"\n",
|
||||
"- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `\n",
|
||||
"(repo_id = {username}/{repo_name})`\n",
|
||||
@@ -1470,17 +1476,6 @@
|
||||
"## Train our Q-Learning agent 🏃"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"id": "MLNwkNDb14h2"
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
@@ -1489,6 +1484,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)\n",
|
||||
"Qtable_taxi"
|
||||
]
|
||||
},
|
||||
@@ -1498,7 +1494,7 @@
|
||||
"id": "wPdu0SueLVl2"
|
||||
},
|
||||
"source": [
|
||||
"## Create a model dictionary 💾 and publish our trained model on the Hub 🔥\n",
|
||||
"## Create a model dictionary 💾 and publish our trained model to the Hub 🔥\n",
|
||||
"- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.\n"
|
||||
]
|
||||
},
|
||||
@@ -1537,7 +1533,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"username = \"\" # FILL THIS\n",
|
||||
"repo_name = \"q-Taxi-v3\"\n",
|
||||
"repo_name = \"\"\n",
|
||||
"push_to_hub(\n",
|
||||
" repo_id=f\"{username}/{repo_name}\",\n",
|
||||
" model=model,\n",
|
||||
@@ -1552,6 +1548,8 @@
|
||||
"source": [
|
||||
"Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n",
|
||||
"\n",
|
||||
"⚠ To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** ⚠\n",
|
||||
"\n",
|
||||
"<img src=\"https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png\" alt=\"Taxi Leaderboard\">"
|
||||
]
|
||||
},
|
||||
@@ -1612,14 +1610,6 @@
|
||||
" :param repo_id: id of the model repository from the Hugging Face Hub\n",
|
||||
" :param filename: name of the model zip file from the repository\n",
|
||||
" \"\"\"\n",
|
||||
" try:\n",
|
||||
" from huggingface_hub import cached_download, hf_hub_url\n",
|
||||
" except ImportError:\n",
|
||||
" raise ImportError(\n",
|
||||
" \"You need to install huggingface_hub to use `load_from_hub`. \"\n",
|
||||
" \"See https://pypi.org/project/huggingface-hub/ for installation.\"\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # Get the model from the Hub, download and cache the model on your local disk\n",
|
||||
" pickle_model = hf_hub_download(\n",
|
||||
" repo_id=repo_id,\n",
|
||||
@@ -1731,14 +1721,13 @@
|
||||
"metadata": {
|
||||
"accelerator": "GPU",
|
||||
"colab": {
|
||||
"collapsed_sections": [
|
||||
"4i6tjI2tHQ8j",
|
||||
"Y-mo_6rXIjRi",
|
||||
"EtrfoTaBoNrd",
|
||||
"BjLhT70TEZIn"
|
||||
],
|
||||
"private_outputs": true,
|
||||
"provenance": []
|
||||
"provenance": [],
|
||||
"collapsed_sections": [
|
||||
"Ji_UrI5l2zzn",
|
||||
"67OdoKL63eDD",
|
||||
"B2_-8b8z5k54"
|
||||
]
|
||||
},
|
||||
"gpuClass": "standard",
|
||||
"kernelspec": {
|
||||
|
||||
1096
notebooks/unit2/unit2.mdx
Normal file
1096
notebooks/unit2/unit2.mdx
Normal file
File diff suppressed because it is too large
Load Diff
@@ -49,6 +49,7 @@ This is equivalent to \\(V(S_{t})\\) = Immediate reward \\(R_{t+1}\\) + Disc
|
||||
</figure>
|
||||
|
||||
In the interest of simplicity, here we don't discount, so gamma = 1.
|
||||
But you'll study an example with gamma = 0.99 in the Q-Learning section of this unit.
|
||||
|
||||
- The value of \\(V(S_{t+1}) \\) = Immediate reward \\(R_{t+2}\\) + Discounted value of the next state ( \\(gamma * V(S_{t+2})\\) ).
|
||||
- And so on.
|
||||
|
||||
@@ -21,7 +21,6 @@ Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Dee
|
||||
|
||||
[](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb)
|
||||
|
||||
|
||||
# Unit 2: Q-Learning with FrozenLake-v1 ⛄ and Taxi-v3 🚕
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/thumbnail.jpg" alt="Unit 2 Thumbnail">
|
||||
@@ -41,9 +40,10 @@ In this notebook, **you'll code from scratch your first Reinforcement Learning a
|
||||
|
||||
### 📚 RL-Library:
|
||||
|
||||
- Python and Numpy
|
||||
- Python and NumPy
|
||||
- [Gym](https://www.gymlibrary.dev/)
|
||||
|
||||
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues).
|
||||
We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the GitHub Repo](https://github.com/huggingface/deep-rl-class/issues).
|
||||
|
||||
## Objectives of this notebook 🏆
|
||||
|
||||
@@ -55,29 +55,54 @@ At the end of the notebook, you will:
|
||||
|
||||
|
||||
## Prerequisites 🏗️
|
||||
|
||||
Before diving into the notebook, you need to:
|
||||
|
||||
🔲 📚 **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)** 🤗
|
||||
|
||||
## A small recap of Q-Learning
|
||||
|
||||
- The *Q-Learning* **is the RL algorithm that**
|
||||
|
||||
- Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.**
|
||||
|
||||
- Given a state and action, our Q-Function **will search into its Q-table the corresponding value.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-function-2.jpg" alt="Q function" width="100%"/>
|
||||
|
||||
- When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.**
|
||||
|
||||
- And if we **have an optimal Q-function**, we
|
||||
have an optimal policy,since we **know for each state, what is the best action to take.**
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/link-value-policy.jpg" alt="Link value policy" width="100%"/>
|
||||
|
||||
|
||||
But, in the beginning, our **Q-Table is useless since it gives arbitrary value for each state-action pair (most of the time we initialize the Q-Table to 0 values)**. But, as we’ll explore the environment and update our Q-Table it will give us better and better approximations
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/q-learning.jpeg" alt="q-learning.jpeg" width="100%"/>
|
||||
|
||||
This is the Q-Learning pseudocode:
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
|
||||
|
||||
|
||||
# Let's code our first Reinforcement Learning algorithm 🚀
|
||||
|
||||
## Install dependencies and create a virtual display 🔽
|
||||
|
||||
During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).
|
||||
In the notebook, we'll need to generate a replay video. To do so, with Colab, **we need to have a virtual screen to render the environment** (and thus record the frames).
|
||||
|
||||
Hence the following cell will install the librairies and create and run a virtual screen 🖥
|
||||
Hence the following cell will install the libraries and create and run a virtual screen 🖥
|
||||
|
||||
We’ll install multiple ones:
|
||||
|
||||
- `gym`: Contains the FrozenLake-v1 ⛄ and Taxi-v3 🚕 environments. We use `gym==0.24` since it contains a nice Taxi-v3 UI version.
|
||||
- `pygame`: Used for the FrozenLake-v1 and Taxi-v3 UI.
|
||||
- `numPy`: Used for handling our Q-table.
|
||||
- `numpy`: Used for handling our Q-table.
|
||||
|
||||
The Hugging Face Hub 🤗 works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.
|
||||
|
||||
|
||||
You can see here all the Deep reinforcement Learning models available 👉 https://huggingface.co/models?other=q-learning
|
||||
|
||||
You can see here all the Deep RL models available (if they use Q Learning) 👉 https://huggingface.co/models?other=q-learning
|
||||
|
||||
```bash
|
||||
pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/unit2/requirements-unit2.txt
|
||||
@@ -85,9 +110,7 @@ pip install -r https://github.com/huggingface/deep-rl-class/tree/main/notebooks/
|
||||
|
||||
```bash
|
||||
sudo apt-get update
|
||||
apt install python-opengl
|
||||
apt install ffmpeg
|
||||
apt install xvfb
|
||||
apt install python-opengl ffmpeg xvfb
|
||||
pip3 install pyvirtualdisplay
|
||||
```
|
||||
|
||||
@@ -111,13 +134,8 @@ virtual_display.start()
|
||||
|
||||
In addition to the installed libraries, we also use:
|
||||
|
||||
- `random`: To generate random numbers (that will be useful for Epsilon-Greedy Policy).
|
||||
- `imageio`: To generate a replay video
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
- `random`: To generate random numbers (that will be useful for epsilon-greedy policy).
|
||||
- `imageio`: To generate a replay video.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
@@ -153,8 +171,8 @@ We can have two sizes of environment:
|
||||
|
||||
The environment has two modes:
|
||||
|
||||
- `is_slippery=False`: The agent always move in the intended direction due to the non-slippery nature of the frozen lake.
|
||||
- `is_slippery=True`: The agent may not always move in the intended direction due to the slippery nature of the frozen lake (stochastic).
|
||||
- `is_slippery=False`: The agent always moves **in the intended direction** due to the non-slippery nature of the frozen lake (deterministic).
|
||||
- `is_slippery=True`: The agent **may not always move in the intended direction** due to the slippery nature of the frozen lake (stochastic).
|
||||
|
||||
For now let's keep it simple with the 4x4 map and non-slippery
|
||||
|
||||
@@ -182,14 +200,13 @@ but we'll use the default environment for now.
|
||||
|
||||
|
||||
```python
|
||||
# We create our environment with gym.make("<name_of_the_environment>")
|
||||
env.reset()
|
||||
# We create our environment with gym.make("<name_of_the_environment>")- `is_slippery=False`: The agent always moves in the intended direction due to the non-slippery nature of the frozen lake (deterministic).
|
||||
print("_____OBSERVATION SPACE_____ \n")
|
||||
print("Observation Space", env.observation_space)
|
||||
print("Sample observation", env.observation_space.sample()) # Get a random observation
|
||||
```
|
||||
|
||||
We see with `Observation Space Shape Discrete(16)` that the observation is a value representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**.
|
||||
We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agent’s current position as current_row * nrows + current_col (where both the row and col start at 0)**.
|
||||
|
||||
For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.**
|
||||
|
||||
@@ -215,14 +232,13 @@ Reward function 💰:
|
||||
- Reach hole: 0
|
||||
- Reach frozen: 0
|
||||
|
||||
|
||||
## Create and Initialize the Q-table 🗄️
|
||||
(👀 Step 1 of the pseudocode)
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
|
||||
|
||||
|
||||
It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. OpenAI Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`
|
||||
It's time to initialize our Q-table! To know how many rows (states) and columns (actions) to use, we need to know the action and observation space. We already know their values from before, but we'll want to obtain them programmatically so that our algorithm generalizes for different environments. Gym provides us a way to do that: `env.action_space.n` and `env.observation_space.n`
|
||||
|
||||
|
||||
```python
|
||||
@@ -244,7 +260,6 @@ def initialize_q_table(state_space, action_space):
|
||||
Qtable_frozenlake = initialize_q_table(state_space, action_space)
|
||||
```
|
||||
|
||||
|
||||
### Solution
|
||||
|
||||
```python
|
||||
@@ -266,13 +281,42 @@ def initialize_q_table(state_space, action_space):
|
||||
Qtable_frozenlake = initialize_q_table(state_space, action_space)
|
||||
```
|
||||
|
||||
## Define the epsilon-greedy policy 🤖
|
||||
## Define the greedy policy 🤖
|
||||
Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
|
||||
|
||||
Epsilon-Greedy is the training policy that handles the exploration/exploitation trade-off.
|
||||
- Epsilon-greedy policy (acting policy)
|
||||
- Greedy-policy (updating policy)
|
||||
|
||||
The idea with Epsilon Greedy:
|
||||
Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.
|
||||
|
||||
- With *probability 1 - ɛ* : **we do exploitation** (aka our agent selects the action with the highest state-action pair value).
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Q-Learning" width="100%"/>
|
||||
|
||||
|
||||
```python
|
||||
def greedy_policy(Qtable, state):
|
||||
# Exploitation: take the action with the highest state, action value
|
||||
action =
|
||||
|
||||
return action
|
||||
```
|
||||
|
||||
#### Solution
|
||||
|
||||
```python
|
||||
def greedy_policy(Qtable, state):
|
||||
# Exploitation: take the action with the highest state, action value
|
||||
action = np.argmax(Qtable[state][:])
|
||||
|
||||
return action
|
||||
```
|
||||
|
||||
##Define the epsilon-greedy policy 🤖
|
||||
|
||||
Epsilon-greedy is the training policy that handles the exploration/exploitation trade-off.
|
||||
|
||||
The idea with epsilon-greedy:
|
||||
|
||||
- With *probability 1 - ɛ* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value).
|
||||
|
||||
- With *probability ɛ*: we do **exploration** (trying random action).
|
||||
|
||||
@@ -281,8 +325,6 @@ And as the training goes, we progressively **reduce the epsilon value since we w
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-4.jpg" alt="Q-Learning" width="100%"/>
|
||||
|
||||
|
||||
Thanks to Sambit for finding a bug on the epsilon function 🤗
|
||||
|
||||
```python
|
||||
def epsilon_greedy_policy(Qtable, state, epsilon):
|
||||
# Randomly generate a number between 0 and 1
|
||||
@@ -309,7 +351,7 @@ def epsilon_greedy_policy(Qtable, state, epsilon):
|
||||
if random_int > epsilon:
|
||||
# Take the action with the highest value given a state
|
||||
# np.argmax can be useful here
|
||||
action = np.argmax(Qtable[state])
|
||||
action = greedy_policy(Qtable, state)
|
||||
# else --> exploration
|
||||
else:
|
||||
action = env.action_space.sample()
|
||||
@@ -317,41 +359,11 @@ def epsilon_greedy_policy(Qtable, state, epsilon):
|
||||
return action
|
||||
```
|
||||
|
||||
## Define the greedy policy 🤖
|
||||
|
||||
Remember we have two policies since Q-Learning is an **off-policy** algorithm. This means we're using a **different policy for acting and updating the value function**.
|
||||
|
||||
- Epsilon greedy policy (acting policy)
|
||||
- Greedy policy (updating policy)
|
||||
|
||||
Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/off-on-4.jpg" alt="Q-Learning" width="100%"/>
|
||||
|
||||
|
||||
```python
|
||||
def greedy_policy(Qtable, state):
|
||||
# Exploitation: take the action with the highest state, action value
|
||||
action =
|
||||
|
||||
return action
|
||||
```
|
||||
|
||||
#### Solution
|
||||
|
||||
```python
|
||||
def greedy_policy(Qtable, state):
|
||||
# Exploitation: take the action with the highest state, action value
|
||||
action = np.argmax(Qtable[state])
|
||||
|
||||
return action
|
||||
```
|
||||
|
||||
## Define the hyperparameters ⚙️
|
||||
The exploration related hyperparamters are some of the most important ones.
|
||||
|
||||
- We need to make sure that our agent **explores enough the state space** in order to learn a good value approximation, in order to do that we need to have progressive decay of the epsilon.
|
||||
- If you decrease too fast epsilon (too high decay_rate), **you take the risk that your agent is stuck**, since your agent didn't explore enough the state space and hence can't solve the problem.
|
||||
- We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon.
|
||||
- If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem.
|
||||
|
||||
```python
|
||||
# Training parameters
|
||||
@@ -373,8 +385,25 @@ min_epsilon = 0.05 # Minimum exploration probability
|
||||
decay_rate = 0.0005 # Exponential decay rate for exploration prob
|
||||
```
|
||||
|
||||
## Step 6: Create the training loop method
|
||||
## Create the training loop method
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-2.jpg" alt="Q-Learning" width="100%"/>
|
||||
|
||||
The training loop goes like this:
|
||||
|
||||
```
|
||||
For episode in the total of training episodes:
|
||||
|
||||
Reduce epsilon (since we need less and less exploration)
|
||||
Reset the environment
|
||||
|
||||
For step in max timesteps:
|
||||
Choose the action At using epsilon greedy policy
|
||||
Take the action (a) and observe the outcome state(s') and reward (r)
|
||||
Update the Q-value Q(s,a) using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
|
||||
If done, finish the episode
|
||||
Our next state is the new state
|
||||
```
|
||||
|
||||
```python
|
||||
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
|
||||
@@ -402,7 +431,7 @@ def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_st
|
||||
if done:
|
||||
break
|
||||
|
||||
# Our state is the new state
|
||||
# Our next state is the new state
|
||||
state = new_state
|
||||
return Qtable
|
||||
```
|
||||
@@ -437,7 +466,7 @@ def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_st
|
||||
if done:
|
||||
break
|
||||
|
||||
# Our state is the new state
|
||||
# Our next state is the new state
|
||||
state = new_state
|
||||
return Qtable
|
||||
```
|
||||
@@ -454,7 +483,9 @@ Qtable_frozenlake = train(n_training_episodes, min_epsilon, max_epsilon, decay_r
|
||||
Qtable_frozenlake
|
||||
```
|
||||
|
||||
## Define the evaluation method 📝
|
||||
## The evaluation method 📝
|
||||
|
||||
- We defined the evaluation method that we're going to use to test our Q-Learning agent.
|
||||
|
||||
```python
|
||||
def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
|
||||
@@ -477,7 +508,7 @@ def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
|
||||
|
||||
for step in range(max_steps):
|
||||
# Take the action (index) that have the maximum expected future reward given that state
|
||||
action = np.argmax(Q[state][:])
|
||||
action = greedy_policy(Q, state)
|
||||
new_state, reward, done, info = env.step(action)
|
||||
total_rewards_ep += reward
|
||||
|
||||
@@ -493,8 +524,8 @@ def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
|
||||
|
||||
## Evaluate our Q-Learning agent 📈
|
||||
|
||||
- Normally you should have mean reward of 1.0
|
||||
- It's relatively easy since the state space is really small (16). What you can try to do is [to replace with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/).
|
||||
- Usually, you should have a mean reward of 1.0
|
||||
- The **environment is relatively easy** since the state space is really small (16). What you can try to do is [to replace it with the slippery version](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/), which introduces stochasticity, making the environment more complex.
|
||||
|
||||
```python
|
||||
# Evaluate our Agent
|
||||
@@ -502,10 +533,9 @@ mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable
|
||||
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")
|
||||
```
|
||||
|
||||
## Publish our trained model to the Hub 🔥
|
||||
|
||||
## Publish our trained model on the Hub 🔥
|
||||
|
||||
Now that we saw we got good results after the training, we can publish our trained model on the hub 🤗 with one line of code.
|
||||
Now that we saw good results after the training, **we can publish our trained model to the Hub 🤗 with one line of code**.
|
||||
|
||||
Here's an example of a Model Card:
|
||||
|
||||
@@ -517,8 +547,7 @@ Under the hood, the Hub uses git-based repositories (don't worry if you don't kn
|
||||
#### Do not modify this code
|
||||
|
||||
```python
|
||||
%%capture
|
||||
from huggingface_hub import HfApi, HfFolder, Repository
|
||||
from huggingface_hub import HfApi, HfFolder, Repository, snapshot_download
|
||||
from huggingface_hub.repocard import metadata_eval_result, metadata_save
|
||||
|
||||
from pathlib import Path
|
||||
@@ -528,6 +557,13 @@ import json
|
||||
|
||||
```python
|
||||
def record_video(env, Qtable, out_directory, fps=1):
|
||||
"""
|
||||
Generate a replay video of the agent
|
||||
:param env
|
||||
:param Qtable: Qtable of our agent
|
||||
:param out_directory
|
||||
:param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
|
||||
"""
|
||||
images = []
|
||||
done = False
|
||||
state = env.reset(seed=random.randint(0, 500))
|
||||
@@ -543,32 +579,36 @@ def record_video(env, Qtable, out_directory, fps=1):
|
||||
```
|
||||
|
||||
```python
|
||||
def push_to_hub(
|
||||
repo_id, model, env, video_fps=1, local_repo_path="hub", commit_message="Push Q-Learning agent to Hub", token=None
|
||||
):
|
||||
def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
|
||||
"""
|
||||
Evaluate, Generate a video and Upload a model to Hugging Face Hub.
|
||||
This method does the complete pipeline:
|
||||
- It evaluates the model
|
||||
- It generates the model card
|
||||
- It generates a replay video of the agent
|
||||
- It pushes everything to the Hub
|
||||
|
||||
:param repo_id: repo_id: id of the model repository from the Hugging Face Hub
|
||||
:param env
|
||||
:param video_fps: how many frame per seconds to record our video replay
|
||||
(with taxi-v3 and frozenlake-v1 we use 1)
|
||||
:param local_repo_path: where the local repository is
|
||||
"""
|
||||
_, repo_name = repo_id.split("/")
|
||||
|
||||
eval_env = env
|
||||
|
||||
# Step 1: Clone or create the repo
|
||||
# Create the repo (or clone its content if it's nonempty)
|
||||
api = HfApi()
|
||||
|
||||
# Step 1: Create the repo
|
||||
repo_url = api.create_repo(
|
||||
repo_id=repo_id,
|
||||
token=token,
|
||||
private=False,
|
||||
exist_ok=True,
|
||||
)
|
||||
|
||||
# Git pull
|
||||
repo_local_path = Path(local_repo_path) / repo_name
|
||||
repo = Repository(repo_local_path, clone_from=repo_url, use_auth_token=True)
|
||||
repo.git_pull()
|
||||
# Step 2: Download files
|
||||
repo_local_path = Path(snapshot_download(repo_id=repo_id))
|
||||
|
||||
repo.lfs_track(["*.mp4"])
|
||||
|
||||
# Step 1: Save the model
|
||||
# Step 3: Save the model
|
||||
if env.spec.kwargs.get("map_name"):
|
||||
model["map_name"] = env.spec.kwargs.get("map_name")
|
||||
if env.spec.kwargs.get("is_slippery", "") == False:
|
||||
@@ -577,30 +617,26 @@ def push_to_hub(
|
||||
print(model)
|
||||
|
||||
# Pickle the model
|
||||
with open(Path(repo_local_path) / "q-learning.pkl", "wb") as f:
|
||||
with open((repo_local_path) / "q-learning.pkl", "wb") as f:
|
||||
pickle.dump(model, f)
|
||||
|
||||
# Step 2: Evaluate the model and build JSON
|
||||
# Step 4: Evaluate the model and build JSON with evaluation metrics
|
||||
mean_reward, std_reward = evaluate_agent(
|
||||
eval_env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"]
|
||||
)
|
||||
|
||||
# First get datetime
|
||||
eval_datetime = datetime.datetime.now()
|
||||
eval_form_datetime = eval_datetime.isoformat()
|
||||
|
||||
evaluate_data = {
|
||||
"env_id": model["env_id"],
|
||||
"mean_reward": mean_reward,
|
||||
"n_eval_episodes": model["n_eval_episodes"],
|
||||
"eval_datetime": eval_form_datetime,
|
||||
"eval_datetime": datetime.datetime.now().isoformat(),
|
||||
}
|
||||
|
||||
# Write a JSON file
|
||||
with open(Path(repo_local_path) / "results.json", "w") as outfile:
|
||||
with open(repo_local_path / "results.json", "w") as outfile:
|
||||
json.dump(evaluate_data, outfile)
|
||||
|
||||
# Step 3: Create the model card
|
||||
# Env id
|
||||
# Step 5: Create the model card
|
||||
env_name = model["env_id"]
|
||||
if env.spec.kwargs.get("map_name"):
|
||||
env_name += "-" + env.spec.kwargs.get("map_name")
|
||||
@@ -627,33 +663,31 @@ def push_to_hub(
|
||||
metadata = {**metadata, **eval}
|
||||
|
||||
model_card = f"""
|
||||
# **Q-Learning** Agent playing **{env_id}**
|
||||
This is a trained model of a **Q-Learning** agent playing **{env_id}** .
|
||||
"""
|
||||
# **Q-Learning** Agent playing1 **{env_id}**
|
||||
This is a trained model of a **Q-Learning** agent playing **{env_id}** .
|
||||
|
||||
model_card += """
|
||||
## Usage
|
||||
```python
|
||||
"""
|
||||
## Usage
|
||||
|
||||
model_card += f"""model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
|
||||
```python
|
||||
|
||||
# Don't forget to check if you need to add additional attributes (is_slippery=False etc)
|
||||
env = gym.make(model["env_id"])
|
||||
model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")
|
||||
|
||||
evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
|
||||
"""
|
||||
|
||||
model_card += """
|
||||
# Don't forget to check if you need to add additional attributes (is_slippery=False etc)
|
||||
env = gym.make(model["env_id"])
|
||||
```
|
||||
"""
|
||||
|
||||
evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"])
|
||||
|
||||
readme_path = repo_local_path / "README.md"
|
||||
readme = ""
|
||||
print(readme_path.exists())
|
||||
if readme_path.exists():
|
||||
with readme_path.open("r", encoding="utf8") as f:
|
||||
readme = f.read()
|
||||
else:
|
||||
readme = model_card
|
||||
print(readme)
|
||||
|
||||
with readme_path.open("w", encoding="utf-8") as f:
|
||||
f.write(readme)
|
||||
@@ -661,20 +695,23 @@ def push_to_hub(
|
||||
# Save our metrics to Readme metadata
|
||||
metadata_save(readme_path, metadata)
|
||||
|
||||
# Step 4: Record a video
|
||||
# Step 6: Record a video
|
||||
video_path = repo_local_path / "replay.mp4"
|
||||
record_video(env, model["qtable"], video_path, video_fps)
|
||||
|
||||
# Push everything to hub
|
||||
print(f"Pushing the repo to the Hugging Face Hub")
|
||||
repo.push_to_hub(commit_message=commit_message)
|
||||
# Step 7. Push everything to the Hub
|
||||
api.upload_folder(
|
||||
repo_id=repo_id,
|
||||
folder_path=repo_local_path,
|
||||
path_in_repo=".",
|
||||
)
|
||||
|
||||
print("Your model is pushed to the hub. You can view your model here: ", repo_url)
|
||||
print("Your model is pushed to the Hub. You can view your model here: ", repo_url)
|
||||
```
|
||||
|
||||
### .
|
||||
|
||||
By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.
|
||||
By using `push_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the Hub**.
|
||||
|
||||
This way:
|
||||
- You can **showcase our work** 🔥
|
||||
@@ -700,9 +737,9 @@ from huggingface_hub import notebook_login
|
||||
notebook_login()
|
||||
```
|
||||
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`
|
||||
If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login` (or `login`)
|
||||
|
||||
3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `package_to_hub()` function
|
||||
3️⃣ We're now ready to push our trained agent to the 🤗 Hub 🔥 using `push_to_hub()` function
|
||||
|
||||
- Let's create **the model dictionary that contains the hyperparameters and the Q_table**.
|
||||
|
||||
@@ -722,7 +759,7 @@ model = {
|
||||
}
|
||||
```
|
||||
|
||||
Let's fill the `package_to_hub` function:
|
||||
Let's fill the `push_to_hub` function:
|
||||
|
||||
- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `
|
||||
(repo_id = {username}/{repo_name})`
|
||||
@@ -738,7 +775,7 @@ model
|
||||
```python
|
||||
username = "" # FILL THIS
|
||||
repo_name = "q-FrozenLake-v1-4x4-noSlippery"
|
||||
push_to_hub(repo_id=f"username}/{repo_name}", model=model, env=env)
|
||||
push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
|
||||
```
|
||||
|
||||
Congrats 🥳 you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent.
|
||||
@@ -813,8 +850,6 @@ learning_rate = 0.7 # Learning rate
|
||||
# Evaluation parameters
|
||||
n_eval_episodes = 100 # Total number of test episodes
|
||||
|
||||
|
||||
|
||||
# DO NOT MODIFY EVAL_SEED
|
||||
eval_seed = [
|
||||
16,
|
||||
@@ -935,13 +970,10 @@ decay_rate = 0.005 # Exponential decay rate for exploration prob
|
||||
|
||||
```python
|
||||
Qtable_taxi = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_taxi)
|
||||
```
|
||||
|
||||
```python
|
||||
Qtable_taxi
|
||||
```
|
||||
|
||||
## Create a model dictionary 💾 and publish our trained model on the Hub 🔥
|
||||
## Create a model dictionary 💾 and publish our trained model to the Hub 🔥
|
||||
- We create a model dictionary that will contain all the training hyperparameters for reproducibility and the Q-Table.
|
||||
|
||||
|
||||
@@ -963,12 +995,14 @@ model = {
|
||||
|
||||
```python
|
||||
username = "" # FILL THIS
|
||||
repo_name = "q-Taxi-v3"
|
||||
repo_name = ""
|
||||
push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)
|
||||
```
|
||||
|
||||
Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard 🏆 👉 https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard
|
||||
|
||||
⚠ To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** ⚠
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/notebooks/unit2/taxi-leaderboard.png" alt="Taxi Leaderboard">
|
||||
|
||||
# Part 3: Load from Hub 🔽
|
||||
@@ -1000,14 +1034,6 @@ def load_from_hub(repo_id: str, filename: str) -> str:
|
||||
:param repo_id: id of the model repository from the Hugging Face Hub
|
||||
:param filename: name of the model zip file from the repository
|
||||
"""
|
||||
try:
|
||||
from huggingface_hub import cached_download, hf_hub_url
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"You need to install huggingface_hub to use `load_from_hub`. "
|
||||
"See https://pypi.org/project/huggingface-hub/ for installation."
|
||||
)
|
||||
|
||||
# Get the model from the Hub, download and cache the model on your local disk
|
||||
pickle_model = hf_hub_download(repo_id=repo_id, filename=filename)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user