diff --git a/units/en/unit4/pg-theorem.mdx b/units/en/unit4/pg-theorem.mdx index 55eea08..9bfee23 100644 --- a/units/en/unit4/pg-theorem.mdx +++ b/units/en/unit4/pg-theorem.mdx @@ -18,6 +18,7 @@ So we have: \\(\nabla_\theta J(\theta) = \nabla_\theta \sum_{\tau}P(\tau;\theta)R(\tau)\\) + We can rewrite the gradient of the sum as the sum of the gradient: \\( = \sum_{\tau} \nabla_\theta P(\tau;\theta)R(\tau) \\) @@ -34,16 +35,20 @@ We can then use the *derivative log trick* (also called *likelihood ratio trick* So given we have \\(\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\) we transform it as \\(\nabla_\theta log P(\tau|\theta) \\) + + So this is our likelihood policy gradient: \\( \nabla_\theta J(\theta) = \sum_{\tau} P(\tau;\theta) \nabla_\theta log P(\tau;\theta) R(\tau) \\) + + + Thanks for this new formula, we can estimate the gradient using trajectory samples (we can approximate the likelihood ratio policy gradient with sample-based estimate if you prefer). -\\(\nabla_\theta J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \nabla_\theta log P(\tau^{(i)};\theta)R(\tau^{(i)})\\) +\\(\nabla_\theta J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \nabla_\theta log P(\tau^{(i)};\theta)R(\tau^{(i)})\\) where each \\(\tau^{(i)}\\) is a sampled trajectory. -where each \\(\tau(i)}\\) is a sampled trajectory. But we still have some mathematics work to do there: we need to simplify \\( \nabla_\theta log P(\tau|\theta) \\)