Update maths

2026-04-01 09:40:26 +08:00 · 2023-01-03 09:58:54 +01:00
parent 53ad3d9a09
commit 8e0bbdb82e
1 changed files with 7 additions and 2 deletions
--- a/units/en/unit4/pg-theorem.mdx
+++ b/units/en/unit4/pg-theorem.mdx
@@ -18,6 +18,7 @@ So we have:

 \\(\nabla_\theta J(\theta) =  \nabla_\theta \sum_{\tau}P(\tau;\theta)R(\tau)\\)

+
 We can rewrite the gradient of the sum as the sum of the gradient:

 \\( =  \sum_{\tau} \nabla_\theta P(\tau;\theta)R(\tau) \\)
@@ -34,16 +35,20 @@ We can then use the *derivative log trick* (also called *likelihood ratio trick*

 So given we have \\(\frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)} \\) we transform it as \\(\nabla_\theta log P(\tau|\theta) \\)

+
+
 So this is our likelihood policy gradient:

 \\( \nabla_\theta J(\theta) = \sum_{\tau} P(\tau;\theta)  \nabla_\theta log P(\tau;\theta) R(\tau) \\)


+
+
+
 Thanks for this new formula, we can estimate the gradient using trajectory samples (we can approximate the likelihood ratio policy gradient with sample-based estimate if you prefer).

-\\(\nabla_\theta J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \nabla_\theta log P(\tau^{(i)};\theta)R(\tau^{(i)})\\)
+\\(\nabla_\theta J(\theta) = \frac{1}{m} \sum^{m}_{i=1} \nabla_\theta log P(\tau^{(i)};\theta)R(\tau^{(i)})\\) where each \\(\tau^{(i)}\\) is a sampled trajectory.

-where each \\(\tau(i)}\\) is a sampled trajectory.

 But we still have some mathematics work to do there: we need to simplify \\(  \nabla_\theta log P(\tau|\theta) \\)