mirror of
https://github.com/huggingface/deep-rl-class.git
synced 2026-04-09 05:40:29 +08:00
Merge branch 'huggingface:main' into main
This commit is contained in:
@@ -160,6 +160,8 @@
|
||||
title: Advantage Actor Critic (A2C)
|
||||
- local: unit6/hands-on
|
||||
title: Advantage Actor Critic (A2C) using Robotics Simulations with Panda-Gym 🤖
|
||||
- local: unit6/quiz
|
||||
title: Quiz
|
||||
- local: unit6/conclusion
|
||||
title: Conclusion
|
||||
- local: unit6/additional-readings
|
||||
|
||||
123
units/en/unit6/quiz.mdx
Normal file
123
units/en/unit6/quiz.mdx
Normal file
@@ -0,0 +1,123 @@
|
||||
# Quiz
|
||||
|
||||
The best way to learn and [to avoid the illusion of competence](https://www.coursera.org/lecture/learning-how-to-learn/illusions-of-competence-BuFzf) **is to test yourself.** This will help you to find **where you need to reinforce your knowledge**.
|
||||
|
||||
|
||||
### Q1: What of the following interpretations of bias-variance tradeoff is the most accurate in the field of Reinforcement Learning?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "The bias-variance tradeoff reflects how my model is able to generalize the knowledge to previously tagged data we give to the model during training time.",
|
||||
explain: "This is the traditional bias-variance tradeoff in Machine Learning. In our specific case of Reinforcement Learning, we don't have previously tagged data, but only a reward signal.",
|
||||
correct: false,
|
||||
},
|
||||
{
|
||||
text: "The bias-variance tradeoff reflects how well the reinforcement signal reflects the true reward the agent should get from the enviromment",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q2: Which of the following statements are True, when talking about models with bias and/or variance in RL?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "An unbiased reward signal returns rewards similar to the real / expected ones from the environment",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "A biased reward signal returns rewards similar to the real / expected ones from the environment",
|
||||
explain: "If a reward signal is biased, it means the reward signal we get differs from the real reward we should be getting from an environment",
|
||||
correct: false,
|
||||
},
|
||||
{
|
||||
text: "A reward signal with high variance has much noise in it and gets affected by, for example, stochastic (non constant) elements in the environment",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "A reward signal with low variance has much noise in it and gets affected by, for example, stochastic (non constant) elements in the environment",
|
||||
explain: "If a reward signal has low variance, then it's less affected by the noise of the environment and produce similar values regardless the random elements in the environment",
|
||||
correct: false,
|
||||
},
|
||||
]}
|
||||
/>
|
||||
|
||||
|
||||
### Q3: Which of the following statements are true about Monte-carlo method?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "It's a sampling mechanism, which means we don't consider analyze all the possible states, but a sample of those",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "It's very resistant to stochasticity (random elements in the trajectory)",
|
||||
explain: "Monte-carlo randomly estimates everytime a sample of trajectories. However, even same trajectories can have different reward values if they contain stochastic elements",
|
||||
correct: false,
|
||||
},
|
||||
{
|
||||
text: "To reduce the impact of stochastic elements in Monte-Carlo, we can take `n` strategies and average them, reducing their impact impact in case of noise",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
]}
|
||||
/>
|
||||
|
||||
### Q4: What is the Advanced Actor-Critic Method (A2C)?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
The idea behind Actor-Critic is that we learn two function approximations:
|
||||
1. A `policy` that controls how our agent acts (π)
|
||||
2. A `value` function to assist the policy update by measuring how good the action taken is (q)
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/step2.jpg" alt="Actor-Critic, step 2"/>
|
||||
|
||||
</details>
|
||||
|
||||
### Q5: Which of the following statemets are True about the Actor-Critic Method?
|
||||
|
||||
<Question
|
||||
choices={[
|
||||
{
|
||||
text: "The Critic does not learn from the training process",
|
||||
explain: "Both the Actor and the Critic function parameters are updated during training time",
|
||||
correct: false,
|
||||
},
|
||||
{
|
||||
text: "The Actor learns a policy function, while the Critic learns a value function",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
{
|
||||
text: "It adds resistance to stochasticity and reduces high variance",
|
||||
explain: "",
|
||||
correct: true,
|
||||
},
|
||||
]}
|
||||
/>
|
||||
|
||||
|
||||
|
||||
### Q6: What is `Advantage` in the A2C method?
|
||||
|
||||
<details>
|
||||
<summary>Solution</summary>
|
||||
|
||||
Instead of using directly the Action-Value function of the Critic as it is, we could use an `Advantage` function. The idea behind an `Advantage` function is that we calculate the relative advantage of an action compared to the others possible at a state, averaging them.
|
||||
|
||||
In other words: how taking that action at a state is better compared to the average value of the state
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit8/advantage1.jpg" alt="Advantage in A2C"/>
|
||||
|
||||
</details>
|
||||
|
||||
Congrats on finishing this Quiz 🥳, if you missed some elements, take time to read the chapter again to reinforce (😏) your knowledge.
|
||||
Reference in New Issue
Block a user