From d46598ff1c447409384f29d6c9100eb0abbb2788 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Thu, 4 May 2023 06:48:15 +0200 Subject: [PATCH] Update hands-on.mdx * Gymnasium Update --- units/en/unit2/hands-on.mdx | 196 +++++++++++++++++------------------- 1 file changed, 95 insertions(+), 101 deletions(-) diff --git a/units/en/unit2/hands-on.mdx b/units/en/unit2/hands-on.mdx index ce16a18..1202432 100644 --- a/units/en/unit2/hands-on.mdx +++ b/units/en/unit2/hands-on.mdx @@ -1,42 +1,10 @@ -# Hands-on [[hands-on]] - - - - - -Now that we studied the Q-Learning algorithm, let's implement it from scratch and train our Q-Learning agent in two environments: -1. [Frozen-Lake-v1 (non-slippery and slippery version)](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) โ˜ƒ๏ธ : where our agent will need toย **go from the starting state (S) to the goal state (G)**ย by walking only on frozen tiles (F) and avoiding holes (H). -2. [An autonomous taxi](https://www.gymlibrary.dev/environments/toy_text/taxi/) ๐Ÿš– will needย **to learn to navigate**ย a city toย **transport its passengers from point A to point B.** - -Environments - -Thanks to a [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard), you'll be able to compare your results with other classmates and exchange the best practices to improve your agent's scores. Who will win the challenge for Unit 2? - -**If you don't find your model, go to the bottom of the page and click on the refresh button.** - -To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**. - -To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward** - -For more information about the certification process, check this section ๐Ÿ‘‰ https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process - -And you can check your progress here ๐Ÿ‘‰ https://huggingface.co/spaces/ThomasSimonini/Check-my-progress-Deep-RL-Course - - -**To start the hands-on click on the Open In Colab button** ๐Ÿ‘‡ : - -[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/deep-rl-class/blob/master/notebooks/unit2/unit2.ipynb) - +Open In Colab # Unit 2: Q-Learning with FrozenLake-v1 โ›„ and Taxi-v3 ๐Ÿš• Unit 2 Thumbnail -In this notebook, **you'll code your first Reinforcement Learning agent from scratch** to play FrozenLake โ„๏ธ using Q-Learning, share it with the community, and experiment with different configurations. +In this notebook, **you'll code from scratch your first Reinforcement Learning agent** playing FrozenLake โ„๏ธ using Q-Learning, share it to the community, and experiment with different configurations. โฌ‡๏ธ Here is an example of what **you will achieve in just a couple of minutes.** โฌ‡๏ธ @@ -44,12 +12,12 @@ In this notebook, **you'll code your first Reinforcement Learning agent from scr Environments -### ๐ŸŽฎ Environments: +###๐ŸŽฎ Environments: - [FrozenLake-v1](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) - [Taxi-v3](https://www.gymlibrary.dev/environments/toy_text/taxi/) -### ๐Ÿ“š RL-Library: +###๐Ÿ“š RL-Library: - Python and NumPy - [Gym](https://www.gymlibrary.dev/) @@ -61,34 +29,52 @@ We're constantly trying to improve our tutorials, so **if you find some issues i At the end of the notebook, you will: - Be able to use **Gym**, the environment library. -- Be able to code a Q-Learning agent from scratch. +- Be able to code from scratch a Q-Learning agent. - Be able to **push your trained agent and the code to the Hub** with a nice video replay and an evaluation score ๐Ÿ”ฅ. + + +## This notebook is from the Deep Reinforcement Learning Course +Deep RL Course illustration + +In this free course, you will: + +- ๐Ÿ“– Study Deep Reinforcement Learning in **theory and practice**. +- ๐Ÿง‘โ€๐Ÿ’ป Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0. +- ๐Ÿค– Train **agents in unique environments** + +And more check ๐Ÿ“š the syllabus ๐Ÿ‘‰ https://simoninithomas.github.io/deep-rl-course + +Donโ€™t forget to **sign up to the course** (we are collecting your email to be able toย **send you the links when each Unit is published and give you information about the challenges and updates).** + + +The best way to keep in touch is to join our discord server to exchange with the community and with us ๐Ÿ‘‰๐Ÿป https://discord.gg/ydHrjt3WP5 + ## Prerequisites ๐Ÿ—๏ธ Before diving into the notebook, you need to: -๐Ÿ”ฒ ๐Ÿ“š **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)** ๐Ÿค— +๐Ÿ”ฒ ๐Ÿ“š **Study [Q-Learning by reading Unit 2](https://huggingface.co/deep-rl-course/unit2/introduction)** ๐Ÿค— ## A small recap of Q-Learning -- *Q-Learning* **is the RL algorithm that** - - - Trains *Q-Function*, an **action-value function** that encoded, in internal memory, by a *Q-table* **that contains all the state-action pair values.** - - - Given a state and action, our Q-Function **will search the Q-table for the corresponding value.** +- The *Q-Learning* **is the RL algorithm that** + - Trains *Q-Function*, an **action-value function** that contains, as internal memory, a *Q-table* **that contains all the state-action pair values.** + + - Given a state and action, our Q-Function **will search into its Q-table the corresponding value.** + Q function - When the training is done,**we have an optimal Q-Function, so an optimal Q-Table.** - + - And if we **have an optimal Q-function**, we -have an optimal policy, since we **know for, each state, the best action to take.** +have an optimal policy,since we **know for each state, what is the best action to take.** Link value policy -But, in the beginning,ย our **Q-Table is useless since it gives arbitrary values for each state-action pairย (most of the time we initialize the Q-Table to 0 values)**. But, as weย explore the environment and update our Q-Table it will give us better and better approximations +But, in the beginning,ย our **Q-Table is useless since it gives arbitrary value for each state-action pairย (most of the time we initialize the Q-Table to 0 values)**. But, as weโ€™llย explore the environment and update our Q-Table it will give us better and better approximations q-learning.jpeg @@ -99,6 +85,13 @@ This is the Q-Learning pseudocode: # Let's code our first Reinforcement Learning algorithm ๐Ÿš€ + +To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained Taxi model to the Hub and **get a result of >= 4.5**. + +To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward** + +For more information about the certification process, check this section ๐Ÿ‘‰ https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process + ## Install dependencies and create a virtual display ๐Ÿ”ฝ In the notebook, we'll need to generate a replay video. To do so, with Colab, **we need to have a virtual screen to render the environment** (and thus record the frames). @@ -113,19 +106,20 @@ Weโ€™ll install multiple ones: The Hugging Face Hub ๐Ÿค— works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others. -You can see all the Deep RL models available here (if they use Q Learning) ๐Ÿ‘‰ https://huggingface.co/models?other=q-learning +You can see here all the Deep RL models available (if they use Q Learning) ๐Ÿ‘‰ https://huggingface.co/models?other=q-learning -```bash -pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt +```python +!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit2/requirements-unit2.txt ``` -```bash -sudo apt-get update -apt install python-opengl ffmpeg xvfb -pip3 install pyvirtualdisplay +```python +%%capture +!sudo apt-get update +!apt install python-opengl ffmpeg xvfb +!pip3 install pyvirtualdisplay ``` -To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.** +To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks for this trick, **we will be able to run our virtual screen.** ```python import os @@ -154,6 +148,7 @@ import gym import random import imageio import os +import tqdm import pickle5 as pickle from tqdm.notebook import tqdm @@ -163,10 +158,10 @@ We're now ready to code our Q-Learning algorithm ๐Ÿ”ฅ # Part 1: Frozen Lake โ›„ (non slippery version) -## Create and understand [FrozenLake environment โ›„](https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) +## Create and understand [FrozenLake environment โ›„]((https://www.gymlibrary.dev/environments/toy_text/frozen_lake/) --- -๐Ÿ’ก A good habit when you start to use an environment is to check its documentation +๐Ÿ’ก A good habit when you start to use an environment is to check its documentation ๐Ÿ‘‰ https://www.gymlibrary.dev/environments/toy_text/frozen_lake/ @@ -217,7 +212,7 @@ print("Observation Space", env.observation_space) print("Sample observation", env.observation_space.sample()) # Get a random observation ``` -We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agentโ€™s current position as current_row * nrows + current_col (where both the row and col start at 0)**. +We see with `Observation Space Shape Discrete(16)` that the observation is an integer representing the **agentโ€™s current position as current_row * nrows + current_col (where both the row and col start at 0)**. For example, the goal position in the 4x4 map can be calculated as follows: 3 * 4 + 3 = 15. The number of possible observations is dependent on the size of the map. **For example, the 4x4 map has 16 possible observations.** @@ -253,18 +248,17 @@ It's time to initialize our Q-table! To know how many rows (states) and columns ```python -state_space = +state_space = print("There are ", state_space, " possible states") -action_space = +action_space = print("There are ", action_space, " possible actions") ``` ```python # Let's create our Qtable of size (state_space, action_space) and initialized each values at 0 using np.zeros. np.zeros needs a tuple (a,b) - def initialize_q_table(state_space, action_space): - Qtable = + Qtable = return Qtable ``` @@ -299,7 +293,7 @@ Remember we have two policies since Q-Learning is an **off-policy** algorithm. T - Epsilon-greedy policy (acting policy) - Greedy-policy (updating policy) -The greedy policy will also be the final policy we'll have when the Q-learning agent completes training. The greedy policy is used to select an action using the Q-table. +Greedy policy will also be the final policy we'll have when the Q-learning agent will be trained. The greedy policy is used to select an action from the Q-table. Q-Learning @@ -307,8 +301,8 @@ The greedy policy will also be the final policy we'll have when the Q-learning a ```python def greedy_policy(Qtable, state): # Exploitation: take the action with the highest state, action value - action = - + action = + return action ``` @@ -330,9 +324,9 @@ The idea with epsilon-greedy: - With *probability 1โ€Š-โ€Šษ›* : **we do exploitation** (i.e. our agent selects the action with the highest state-action pair value). -- With *probability ษ›*: we do **exploration** (trying a random action). +- With *probability ษ›*: we do **exploration** (trying random action). -As the training continues, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.** +And as the training goes, we progressively **reduce the epsilon value since we will need less and less exploration and more exploitation.** Q-Learning @@ -340,16 +334,16 @@ As the training continues, we progressively **reduce the epsilon value since we ```python def epsilon_greedy_policy(Qtable, state, epsilon): # Randomly generate a number between 0 and 1 - random_num = + random_num = # if random_num > greater than epsilon --> exploitation if random_num > epsilon: # Take the action with the highest value given a state # np.argmax can be useful here - action = + action = # else --> exploration else: action = # Take a random action - + return action ``` @@ -372,7 +366,7 @@ def epsilon_greedy_policy(Qtable, state, epsilon): ``` ## Define the hyperparameters โš™๏ธ -The exploration related hyperparameters are some of the most important ones. +The exploration related hyperparamters are some of the most important ones. - We need to make sure that our agent **explores enough of the state space** to learn a good value approximation. To do that, we need to have progressive decay of the epsilon. - If you decrease epsilon too fast (too high decay_rate), **you take the risk that your agent will be stuck**, since your agent didn't explore enough of the state space and hence can't solve the problem. @@ -409,7 +403,7 @@ For episode in the total of training episodes: Reduce epsilon (since we need less and less exploration) Reset the environment - For step in max timesteps: + For step in max timesteps: Choose the action At using epsilon greedy policy Take the action (a) and observe the outcome state(s') and reward (r) Update the Q-value Q(s,a) using Bellman equation Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)] @@ -419,7 +413,7 @@ Reset the environment ```python def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable): - for episode in range(n_training_episodes): + for episode in tqdm(range(n_training_episodes)): # Reduce epsilon (because we need less and less exploration) epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) # Reset the environment @@ -430,19 +424,19 @@ def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_st # repeat for step in range(max_steps): # Choose the action At using epsilon greedy policy - action = + action = # Take action At and observe Rt+1 and St+1 # Take the action (a) and observe the outcome state(s') and reward (r) - new_state, reward, done, info = + new_state, reward, done, info = # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)] - Qtable[state][action] = + Qtable[state][action] = # If done, finish the episode if done: break - + # Our next state is the new state state = new_state return Qtable @@ -674,19 +668,19 @@ def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"): metadata = {**metadata, **eval} model_card = f""" - # **Q-Learning** Agent playing1 **{env_id}** - This is a trained model of a **Q-Learning** agent playing **{env_id}** . + # **Q-Learning** Agent playing1 **{env_id}** + This is a trained model of a **Q-Learning** agent playing **{env_id}** . - ## Usage + ## Usage - ```python + ```python + + model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl") - model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl") - - # Don't forget to check if you need to add additional attributes (is_slippery=False etc) - env = gym.make(model["env_id"]) - ``` - """ + # Don't forget to check if you need to add additional attributes (is_slippery=False etc) + env = gym.make(model["env_id"]) + ``` + """ evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable"], model["eval_seed"]) @@ -726,7 +720,7 @@ By using `push_to_hub` **you evaluate, record a replay, generate a model card of This way: - You can **showcase our work** ๐Ÿ”ฅ - You can **visualize your agent playing** ๐Ÿ‘€ -- You can **share an agent with the community that others can use** ๐Ÿ’พ +- You can **share with the community an agent that others can use** ๐Ÿ’พ - You can **access a leaderboard ๐Ÿ† to see how well your agent is performing compared to your classmates** ๐Ÿ‘‰ https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard @@ -788,21 +782,21 @@ repo_name = "q-FrozenLake-v1-4x4-noSlippery" push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env) ``` -Congrats ๐Ÿฅณ you've just implemented from scratch, trained, and uploaded your first Reinforcement Learning agent. -FrozenLake-v1 no_slippery is very simple environment, let's try a harder one ๐Ÿ”ฅ. +Congrats ๐Ÿฅณ you've just implemented from scratch, trained and uploaded your first Reinforcement Learning agent. +FrozenLake-v1 no_slippery is very simple environment, let's try an harder one ๐Ÿ”ฅ. # Part 2: Taxi-v3 ๐Ÿš– ## Create and understand [Taxi-v3 ๐Ÿš•](https://www.gymlibrary.dev/environments/toy_text/taxi/) --- -๐Ÿ’ก A good habit when you start to use an environment is to check its documentation +๐Ÿ’ก A good habit when you start to use an environment is to check its documentation ๐Ÿ‘‰ https://www.gymlibrary.dev/environments/toy_text/taxi/ --- -In `Taxi-v3` ๐Ÿš•, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). +In `Taxi-v3` ๐Ÿš•, there are four designated locations in the grid world indicated by R(ed), G(reen), Y(ellow), and B(lue). When the episode starts, **the taxi starts off at a random square** and the passenger is at a random location. The taxi drives to the passengerโ€™s location, **picks up the passenger**, drives to the passengerโ€™s destination (another one of the four specified locations), and then **drops off the passenger**. Once the passenger is dropped off, the episode ends. @@ -1009,7 +1003,7 @@ repo_name = "" push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env) ``` -Now that it's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard ๐Ÿ† ๐Ÿ‘‰ https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard +Now that's on the Hub, you can compare the results of your Taxi-v3 with your classmates using the leaderboard ๐Ÿ† ๐Ÿ‘‰ https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard โš  To see your entry, you need to go to the bottom of the leaderboard page and **click on refresh** โš  @@ -1075,36 +1069,36 @@ evaluate_agent(env, model["max_steps"], model["n_eval_episodes"], model["qtable" ``` ## Some additional challenges ๐Ÿ† -The best way to learn **is to try things on your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results! +The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results! In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top? -Here are some ideas to climb up the leaderboard: +Here are some ideas to achieve so: * Train more steps * Try different hyperparameters by looking at what your classmates have done. * **Push your new trained model** on the Hub ๐Ÿ”ฅ -Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not use the FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun ๐ŸŽ‰. +Are walking on ice and driving taxis too boring to you? Try to **change the environment**, why not using FrozenLake-v1 slippery version? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun ๐ŸŽ‰. _____________________________________________________________________ Congrats ๐Ÿฅณ, you've just implemented, trained, and uploaded your first Reinforcement Learning agent. Understanding Q-Learning is an **important step to understanding value-based methods.** -In the next Unit with Deep Q-Learning, we'll see that while creating and updating a Q-table was a good strategy โ€” **however, it is not scalable.** +In the next Unit with Deep Q-Learning, we'll see that creating and updating a Q-table was a good strategy โ€” **however, this is not scalable.** -For instance, imagine you create an agent that learns to play Doom. +For instance, imagine you create an agent that learns to play Doom. Doom -Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient. +Doom is a large environment with a huge state space (millions of different states). Creating and updating a Q-table for that environment would not be efficient. -That's why we'll study Deep Q-Learning in the next unit, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.** +That's why we'll study, in the next unit, Deep Q-Learning, an algorithm **where we use a neural network that approximates, given a state, the different Q-values for each action.** Environments -See you in Unit 3! ๐Ÿ”ฅ +See you on Unit 3! ๐Ÿ”ฅ -## Keep learning, stay awesome ๐Ÿค— +## Keep learning, stay awesome ๐Ÿค— \ No newline at end of file