Merge pull request #114 from Artachtron/main

Format and redundancy fixes
2026-04-02 02:00:15 +08:00 · 2022-12-12 14:13:47 +01:00
parent df03a2124d c56cbbf4dc
commit 294e5cd1d0
1 changed files with 15 additions and 24 deletions
--- a/units/en/unit1/two-methods.mdx
+++ b/units/en/unit1/two-methods.mdx
@@ -11,17 +11,13 @@ In other terms, how to build an RL agent that can **select the actions that ma
 The Policy **π** is the **brain of our Agent**, it’s the function that tells us what **action to take given the state we are.** So it **defines the agent’s behavior** at a given time.

 <figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_1.jpg" alt="Policy">
-<figcaption>Think of policy as the brain of our agent, the function that will tell us the action to take given a state
-
-</figcaption>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_1.jpg" alt="Policy" />
+<figcaption>Think of policy as the brain of our agent, the function that will tell us the action to take given a state</figcaption>
 </figure>

-Think of policy as the brain of our agent, the function that will tells us the action to take given a state
+This Policy **is the function we want to learn**, our goal is to find the optimal policy π\*, the policy that **maximizes expected return** when the agent acts according to it. We find this π\* **through training.**

-This Policy **is the function we want to learn**, our goal is to find the optimal policy π*, the policy that** maximizes **expected return** when the agent acts according to it. We find this *π through training.**
-
-There are two approaches to train our agent to find this optimal policy π*:
+There are two approaches to train our agent to find this optimal policy π\*:

 - **Directly,** by teaching the agent to learn which **action to take,** given the current state: **Policy-Based Methods.**
 - Indirectly, **teach the agent to learn which state is more valuable** and then take the action that **leads to the more valuable states**: Value-Based Methods.
@@ -33,9 +29,8 @@ In Policy-Based methods, **we learn a policy function directly.**
 This function will define a mapping between each state and the best corresponding action. We can also say that it'll define **a probability distribution over the set of possible actions at that state.**

 <figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_2.jpg" alt="Policy">
-<figcaption>As we can see here, the policy (deterministic) <b>directly indicates the action to take for each step.</b>
-</figcaption>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_2.jpg" alt="Policy" />
+<figcaption>As we can see here, the policy (deterministic) <b>directly indicates the action to take for each step.</b></figcaption>
 </figure>


@@ -46,8 +41,7 @@ We have two types of policies:

 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_3.jpg" alt="Policy"/>
-<figcaption>action = policy(state)
-</figcaption>
+<figcaption>action = policy(state)</figcaption>
 </figure>

 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_4.jpg" alt="Policy" width="100%"/>
@@ -56,21 +50,19 @@ We have two types of policies:

 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/policy_5.jpg" alt="Policy"/>
-<figcaption>policy(actions | state) = probability distribution over the set of actions given the current state
-</figcaption>
+<figcaption>policy(actions | state) = probability distribution over the set of actions given the current state</figcaption>
 </figure>

 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario"/>
-<figcaption>Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.
-</figcaption>
+<figcaption>Given an initial state, our stochastic policy will output probability distributions over the possible actions at that state.</figcaption>
 </figure>


 If we recap:

-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_1.jpg" alt="Pbm recap" width="100%">
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_2.jpg" alt="Pbm recap" width="100%">
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_1.jpg" alt="Pbm recap" width="100%" />
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/pbm_2.jpg" alt="Pbm recap" width="100%" />


 ## Value-based methods [[value-based]]
@@ -81,19 +73,18 @@ The value of a state is the **expected discounted return** the agent can get i

 “Act according to our policy” just means that our policy is **“going to the state with the highest value”.**

-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_1.jpg" alt="Value based RL" width="100%">
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_1.jpg" alt="Value based RL" width="100%" />

 Here we see that our value function **defined value for each possible state.**

 <figure>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/value_2.jpg" alt="Value based RL"/>
-<figcaption>Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.
-</figcaption>
+<figcaption>Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.</figcaption>
 </figure>

 Thanks to our value function, at each step our policy will select the state with the biggest value defined by the value function: -7, then -6, then -5 (and so on) to attain the goal.

 If we recap:

-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_1.jpg" alt="Vbm recap" width="100%">
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_2.jpg" alt="Vbm recap" width="100%">
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_1.jpg" alt="Vbm recap" width="100%" />
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/vbm_2.jpg" alt="Vbm recap" width="100%" />