From 4ba43cf05fa4122da5de1b7d953f287b896d25b8 Mon Sep 17 00:00:00 2001 From: Thomas Simonini Date: Thu, 8 Dec 2022 13:47:01 +0100 Subject: [PATCH] Update glossary.mdx --- units/en/unit1/glossary.mdx | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/units/en/unit1/glossary.mdx b/units/en/unit1/glossary.mdx index 2871af5..1dc9164 100644 --- a/units/en/unit1/glossary.mdx +++ b/units/en/unit1/glossary.mdx @@ -3,41 +3,50 @@ This is a community-created glossary. Contributions are welcomed! ### Markov Property -It implies that the action taken by our agent is conditional solely on the present state and independent of the past states and actions. + +It implies that the action taken by our agent is **conditional solely on the present state and independent of the past states and actions**. ### Observations/State + - **State**: Complete description of the state of the world. - **Observation**: Partial description of the state of the environment/world. ### Actions + - **Discrete Actions**: Finite number of actions, such as left, right, up, and down. - **Continuous Actions**: Infinite possibility of actions; for example, in the case of self-driving cars, the driving scenario has an infinite possibility of actions occurring. ### Rewards and Discounting + - **Rewards**: Fundamental factor in RL. Tells the agent whether the action taken is good/bad. - RL algorithms are focused on maximizing the **cumulative reward**. - **Reward Hypothesis**: RL problems can be formulated as a maximisation of (cumulative) return. - **Discounting** is performed because rewards obtained at the start are more likely to happen as they are more predictable than long-term rewards. ### Tasks + - **Episodic**: Has a starting point and an ending point. - **Continuous**: Has a starting point but no ending point. ### Exploration v/s Exploitation Trade-Off + - **Exploration**: It's all about exploring the environment by trying random actions and receiving feedback/returns/rewards from the environment. - **Exploitation**: It's about exploiting what we know about the environment to gain maximum rewards. - **Exploration-Exploitation Trade-Off**: It balances how much we want to **explore** the environment and how much we want to **exploit** what we know about the environment. ### Policy + - **Policy**: It is called the agent's brain. It tells us what action to take, given the state. - **Optimal Policy**: Policy that **maximizes** the **expected return** when an agent acts according to it. It is learned through *training*. ### Policy-based Methods: + - An approach to solving RL problems. - In this method, the Policy is learned directly. - Will map each state to the best corresponding action at that state. Or a probability distribution over the set of possible actions at that state. ### Value-based Methods: + - Another approach to solving RL problems. - Here, instead of training a policy, we train a **value function** that maps each state to the expected value of being in that state.