【强化学习】随机策略的策略梯度
文章目录
- Policy的目标函数
- 定理1
- 定理2
- 定理3
- 定理4
- 定理5
Policy的目标函数
J ( π ) = E τ ∣ π [ G 0 ] = E τ ∣ π [ ∑ t = 0 γ t r t ] J({\pi})=\mathbb{E}_{\tau|\pi}[G_0]=\mathbb{E}_{\tau|\pi}[\sum\limits_{t=0}\gamma^{t}r_{t}] J(π)=Eτ∣π[G0]=Eτ∣π[t=0∑γtrt]
定理1
∇ θ J ( θ ) = E τ ∣ π [ ∑ t = 0 ∞ γ t G t ∇ θ ln π ( a t ∣ s t ) ] \nabla_{\theta}J({\theta})=\mathbb{E}_{\tau|\pi} [\sum\limits_{t=0}^{\infty}\gamma ^{t}G_{t}\nabla_{\theta}\ln \pi(a_t|s_{t})] ∇θJ(θ)=Eτ∣π[t=0∑∞γtGt∇θlnπ(at∣st)]
G t = ∑ k = 0 ∞ γ k r t + k G_{t}=\sum_{k=0}^{\infty}\gamma^{k}r_{t+k} Gt=∑k=0∞γkrt+k
证明:
J ( θ ) = ∑ τ ∣ π G 0 π ( τ ; θ ) J({\theta})=\sum_{\tau|\pi} G_{0}\pi(\tau; \theta) J(θ)=∑τ∣πG0π(τ;θ)
∇ θ J ( θ ) = ∑ τ ∣ π G 0 ∇ θ π ( τ ; θ ) \nabla_{\theta}J(\theta)=\sum_{\tau|\pi} G_{0}\nabla_\theta\pi(\tau; \theta) ∇θJ(θ)=∑τ∣πG0∇θπ(τ;θ)
∇ θ π ( τ ; θ ) = π ( τ ; θ ) ∇ θ ln π ( τ ; θ ) \nabla_{\theta}\pi(\tau;\theta)=\pi(\tau;\theta)\nabla_{\theta} \ln\pi(\tau;\theta) ∇θπ(τ;θ)=π(τ;θ)∇θlnπ(τ;θ)
π ( τ ; θ ) = p 1 ( s 0 ) Π i = 0 ∞ π ( a i ∣ s i ; θ ) T ( s i , a i , s i + 1 ) \pi(\tau;\theta)=p_{1}(s_{0})\Pi_{i=0}^{\infty}\pi(a_{i}|s_{i};\theta)T(s_i,a_i,s_{i+1}) π(τ;θ)=p1(s0)Πi=0∞π(ai∣si;θ)T(si,ai,si+1)
ln π ( τ ; θ ) = ∑ i = 0 ∞ π ( a i ∣ s i ; θ ) + ∑ i = 0 ∞ ln T ( s i , a i , s i + 1 ) + ln p 1 ( s 0 ) \ln\pi(\tau;\theta)=\sum_{i=0}^{\infty}\pi(a_i|s_i;\theta)+\sum_{i=0}^{\infty}\ln T(s_i,a_i,s_{i+1}) + \ln p_{1}(s_0) lnπ(τ;θ)=∑i=0∞π(ai∣si;θ)+∑i=0∞lnT(si,ai,si+1)+lnp1(s0)
∇ θ ln π ( τ ; θ ) = ∑ i = 0 ∞ ∇ θ π ( a i ∣ s i ; θ ) \nabla_{\theta}\ln\pi(\tau;\theta)=\sum_{i=0}^{\infty}\nabla_{\theta}\pi(a_i|s_i;\theta) ∇θlnπ(τ;θ)=∑i=0∞∇θπ(ai∣si;θ)
∇ θ J ( θ ) = ∑ τ G 0 π ( τ ; θ ) ∑ i = 0 ∞ ∇ θ π ( a i ∣ s i ; θ ) = E τ [ G 0 ∑ i = 0 ∞ ∇ θ π ( a i ∣ s i ; θ ) ] \nabla_{\theta}J(\theta)=\sum_{\tau}G_0\pi(\tau;\theta)\sum_{i=0}^{\infty}\nabla_{\theta}\pi(a_i|s_i;\theta)=\mathbb{E}_{\tau}[G_0\sum_{i=0}^{\infty}\nabla_{\theta}\pi(a_i|s_i;\theta)] ∇θJ(θ)=∑τG0π(τ;θ)∑i=0∞∇θπ(ai∣si;θ)=Eτ[G0∑i=0∞∇θπ(ai∣si;θ)]
定理2
∇ θ J ( θ ) = E τ ∣ π [ ∑ i = 0 ∞ γ i G i ∇ θ ln π ( s i ) ] \nabla_{\theta}J({\theta})=\mathbb{E}_{\tau|\pi}[\sum_{i=0}^{\infty}\gamma^{i}G_{i}\nabla_{\theta}\ln\pi(s_i)] ∇θJ(θ)=Eτ∣π[∑i=0∞γiGi∇θlnπ(si)]
证明:
J ( θ ) = E τ ∣ π [ G 0 ] = E s 0 E τ ∣ s 0 , π [ G 0 ] = E s 0 V θ ( s 0 ) J(\theta)=\mathbb{E}_{\tau|\pi}[G_{0}]=\mathbb{E}_{s_0}\mathbb{E}_{\tau|s_0,\pi}[G_{0}]=\mathbb{E}_{s_0}V^{\theta}(s_0) J(θ)=Eτ∣π[G0]=Es0Eτ∣s0,π[G0]=Es0Vθ(s0)
∇ θ J ( θ ) = E s 0 ∇ θ V θ ( s 0 ) \nabla_{\theta} J({\theta})=\mathbb{E}_{s_0}\nabla_{\theta}V^{\theta}(s_0) ∇θJ(θ)=Es0∇θVθ(s0)
V θ ( s 0 ) = ∑ a π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) V^{\theta}(s_0)=\sum_{a}\pi(a|s_0;\theta)Q^{\theta}(s_0,a) Vθ(s0)=∑aπ(a∣s0;θ)Qθ(s0,a)
∇ θ V θ ( s 0 ) = ∑ a [ ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) + ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) ] \nabla_{\theta}V^{\theta}(s_0)=\sum_{a}[\nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a) + \nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)] ∇θVθ(s0)=∑a[∇θπ(a∣s0;θ)Qθ(s0,a)+∇θQθ(s0,a)π(a∣s0;θ)]
∑ a 0 ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = ∑ a π ( a ∣ s 0 ; θ ) ∇ θ ln π ( a ∣ s 0 ; θ ) E τ ∣ s 0 , a 0 = a , π [ G 0 ] = E a 0 ∣ s 0 , π { ∇ θ ln π ( a 0 ∣ s 0 ; θ ) E τ ∣ s 0 , a 0 , π [ G 0 ] } = E a 0 ∣ s 0 , π E τ ∣ s 0 , a 0 , π [ G 0 ∇ θ ln π ( a 0 ∣ s 0 ; θ ) ] = E τ ∣ s 0 , π { G 0 ∇ θ ln π ( a 0 ∣ s 0 ; θ ) } \begin{align} \sum_{a_0}\nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a) &=\sum_{a}\pi(a|s_0;\theta) \nabla_{\theta}\ln\pi(a|s_0;\theta) \mathbb{E}_{\tau|s_0,a_0=a,\pi}[G_0]\notag\\ &=\mathbb{E}_{a_0|s_0,\pi}\{\nabla_{\theta}\ln\pi(a_0|s_0;\theta) \mathbb{E}_{\tau|s_0,a_0,\pi}[G_0]\}\notag\\ &=\mathbb{E}_{a_0|s_0,\pi}\mathbb{E}_{\tau|s_0,a_0,\pi}[G_0\nabla_{\theta}\ln\pi(a_0|s_0;\theta)]\\ &=\mathbb{E}_{\tau|s_0,\pi}\{G_0\nabla_{\theta}\ln\pi(a_0|s_0;\theta)\} \end{align} a0∑∇θπ(a∣s0;θ)Qθ(s0,a)=a∑π(a∣s0;θ)∇θlnπ(a∣s0;θ)Eτ∣s0,a0=a,π[G0]=Ea0∣s0,π{∇θlnπ(a0∣s0;θ)Eτ∣s0,a0,π[G0]}=Ea0∣s0,πEτ∣s0,a0,π[G0∇θlnπ(a0∣s0;θ)]=Eτ∣s0,π{G0∇θlnπ(a0∣s0;θ)}
∑ a ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) = γ E a 0 ∣ s 0 , π ∇ θ Q θ ( s 0 , a 0 ) \sum_{a}\nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)=\gamma\mathbb{E}_{a_0|s_0,\pi}\nabla_{\theta}Q^{\theta}(s_0,a_0) ∑a∇θQθ(s0,a)π(a∣s0;θ)=γEa0∣s0,π∇θQθ(s0,a0)
其中 Q θ ( s 0 , a ) = ∑ s ′ T ( s 0 , a , s ′ ) [ r ( s 0 , a , s ′ ) + γ V θ ( s ′ ) ] Q^{\theta}(s_0,a)=\sum_{s'}T(s_0,a,s')[r(s_0,a,s')+\gamma V^{\theta}(s')] Qθ(s0,a)=∑s′T(s0,a,s′)[r(s0,a,s′)+γVθ(s′)]
∇ θ Q θ ( s 0 , a 0 ) = ∑ s ′ T ( s 0 , a 0 , s ′ ) ∇ θ V θ ( s ′ ) = E s 1 ∣ s 0 , a 0 ∇ θ V θ ( s 1 ) \nabla_{\theta}Q^{\theta}(s_0,a_0)=\sum_{s'}T(s_0,a_0,s')\nabla_{\theta}V^{\theta}(s')=\mathbb{E}_{s_1|s_0,a_0}\nabla_{\theta}V^{\theta}(s_1) ∇θQθ(s0,a0)=∑s′T(s0,a0,s′)∇θVθ(s′)=Es1∣s0,a0∇θVθ(s1)
所以 ∑ a ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) = γ E a 0 ∣ s 0 , π E s 1 ∣ s 0 , a 0 ∇ θ V θ ( s 1 ) = γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \sum_{a}\nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)=\gamma\mathbb{E}_{a_0|s_0,\pi}\mathbb{E}_{s_1|s_0,a_0}\nabla_{\theta}V^{\theta}(s_1)=\gamma\mathbb{E}_{s_1|s_0, \pi}\nabla_{\theta}V^{\theta}(s_1) ∑a∇θQθ(s0,a)π(a∣s0;θ)=γEa0∣s0,πEs1∣s0,a0∇θVθ(s1)=γEs1∣s0,π∇θVθ(s1)
∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π { G 0 ∇ θ ln π ( a 0 ∣ s 0 ; θ ) } + γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0,\pi}\{G_0\nabla_{\theta}\ln\pi(a_0|s_0;\theta)\}+\gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1) ∇θVθ(s0)=Eτ∣s0,π{G0∇θlnπ(a0∣s0;θ)}+γEs1∣s0,π∇θVθ(s1)
同理可得:
V θ ( s i ) = E τ i ∣ s i , π { G i ∇ θ ln π ( a i ∣ s i ; θ ) } + γ E s i + 1 ∣ s i , π ∇ θ V θ ( s i + 1 ) , i = 1 , 2 , . . . V^{\theta}(s_i)=\mathbb{E}_{\tau_i|s_i,\pi}\{G_i\nabla_{\theta}\ln\pi(a_i|s_i;\theta)\}+\gamma \mathbb{E}_{s_{i+1}|s_{i},\pi}\nabla_{\theta}V^{\theta}(s_{i+1}), \ i=1,2,... Vθ(si)=Eτi∣si,π{Gi∇θlnπ(ai∣si;θ)}+γEsi+1∣si,π∇θVθ(si+1), i=1,2,...$
将 V θ ( s 1 ) = E τ 1 ∣ s 1 , π { G 1 ∇ θ ln π ( a 1 ∣ s 1 ; θ ) } + γ E s 2 ∣ s 1 , π ∇ θ V θ ( s 2 ) V^{\theta}(s_1)=\mathbb{E}_{\tau_1|s_1,\pi}\{G_1\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma \mathbb{E}_{s_{2}|s_{1},\pi}\nabla_{\theta}V^{\theta}(s_{2}) Vθ(s1)=Eτ1∣s1,π{G1∇θlnπ(a1∣s1;θ)}+γEs2∣s1,π∇θVθ(s2) 代入 γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1) γEs1∣s0,π∇θVθ(s1), 得
γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) = γ E s 1 ∣ s 0 , π { E τ 1 ∣ s 1 , π { G 1 ∇ θ ln π ( a 1 ∣ s 1 ; θ ) } + γ E s 2 ∣ s 1 , π ∇ θ V θ ( s 2 ) } = γ E τ 1 ∣ s 0 , π [ G 1 ∇ θ ln π ( a 1 ∣ s 1 ; θ ) ] + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) = E τ ∣ s 0 , π [ γ G 1 ∇ θ ln π ( a 1 ∣ s 1 ; θ ) ] + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \begin{align} \gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1)&=\gamma\mathbb{E}_{s_1|s_0,\pi}\{\mathbb{E}_{\tau_1|s_1,\pi}\{G_1\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma \mathbb{E}_{s_{2}|s_{1},\pi}\nabla_{\theta}V^{\theta}(s_{2})\}\notag\\ &=\gamma \mathbb{E}_{\tau_1|s_0, \pi}[G_1\nabla_{\theta}\ln \pi(a_1|s_1;\theta)]+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) \notag\\ &=\mathbb{E}_{\tau|s_0, \pi}[\gamma G_1\nabla_{\theta}\ln \pi(a_1|s_1;\theta)]+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) \end{align} γEs1∣s0,π∇θVθ(s1)=γEs1∣s0,π{Eτ1∣s1,π{G1∇θlnπ(a1∣s1;θ)}+γEs2∣s1,π∇θVθ(s2)}=γEτ1∣s0,π[G1∇θlnπ(a1∣s1;θ)]+γ2Es2∣s0,π∇θVθ(s2)=Eτ∣s0,π[γG1∇θlnπ(a1∣s1;θ)]+γ2Es2∣s0,π∇θVθ(s2)
进而
∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π { G 0 ∇ θ ln π ( a 0 ∣ s 0 ; θ ) + γ G 1 ∇ θ ln π ( a 1 ∣ s 1 ; θ ) } + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0,\pi}\{G_0\nabla_{\theta}\ln\pi(a_0|s_0;\theta)+\gamma G_1\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) ∇θVθ(s0)=Eτ∣s0,π{G0∇θlnπ(a0∣s0;θ)+γG1∇θlnπ(a1∣s1;θ)}+γ2Es2∣s0,π∇θVθ(s2)
再将 V θ ( s 2 ) = E τ 2 ∣ s 2 , π { G 2 ∇ θ ln π ( a 2 ∣ s 2 ; θ ) } + γ E s 3 ∣ s 2 , π ∇ θ V θ ( s 3 ) V^{\theta}(s_2)=\mathbb{E}_{\tau_2|s_2,\pi}\{G_2\nabla_{\theta}\ln\pi(a_2|s_2;\theta)\}+\gamma \mathbb{E}_{s_{3}|s_{2},\pi}\nabla_{\theta}V^{\theta}(s_{3}) Vθ(s2)=Eτ2∣s2,π{G2∇θlnπ(a2∣s2;θ)}+γEs3∣s2,π∇θVθ(s3) 代入 γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) γ2Es2∣s0,π∇θVθ(s2) …
不断重复上述过程得到
∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π [ ∑ i = 0 ∞ γ i G i ∇ θ ln π ( a i ∣ s i ; θ ) ] \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0, \pi}[\sum_{i=0}^{\infty}\gamma ^{i}G_{i}\nabla_{\theta}\ln \pi(a_i|s_i;\theta)] ∇θVθ(s0)=Eτ∣s0,π[∑i=0∞γiGi∇θlnπ(ai∣si;θ)]
∇ θ J ( θ ) = E s 0 E τ ∣ s 0 , π [ ∑ i = 0 ∞ γ i G i ∇ θ ln π ( a i ∣ s i ; θ ) ] = E τ ∣ π [ ∑ i = 0 ∞ γ i G i ∇ θ ln π ( a i ∣ s i ; θ ) ] \nabla_{\theta} J({\theta})=\mathbb{E}_{s_0}\mathbb{E}_{\tau|s_0, \pi}[\sum_{i=0}^{\infty}\gamma^{i}G_i\nabla_{\theta}\ln \pi(a_i|s_i;\theta)]=\mathbb{E}_{\tau|\pi}[\sum_{i=0}^{\infty}\gamma^{i}G_i\nabla_{\theta}\ln \pi(a_i|s_i;\theta)] ∇θJ(θ)=Es0Eτ∣s0,π[∑i=0∞γiGi∇θlnπ(ai∣si;θ)]=Eτ∣π[∑i=0∞γiGi∇θlnπ(ai∣si;θ)]
定理3
∇ θ J ( θ ) = E τ ∣ π [ ∑ i = 0 ∞ γ i Q i ( s i , a i ) ∇ θ ln π ( a i ∣ s i ) ] \nabla_{\theta}J({\theta})=\mathbb{E}_{\tau|\pi}[\sum_{i=0}^{\infty}\gamma^{i}Q_{i}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i)] ∇θJ(θ)=Eτ∣π[∑i=0∞γiQi(si,ai)∇θlnπ(ai∣si)]
以上同上一个证明, 不赘述.
∑ a 0 ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = ∑ a π ( a ∣ s 0 ; θ ) ∇ θ ln π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = E a 0 ∣ s 0 , π { ∇ θ ln π ( a 0 ∣ s 0 ; θ ) Q θ ( s 0 , a 0 ) } = E τ ∣ s 0 , π { Q θ ( s 0 , a 0 ) ∇ θ ln π ( a 0 ∣ s 0 ; θ ) } \begin{align} \sum_{a_0}\nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a) &=\sum_{a}\pi(a|s_0;\theta) \nabla_{\theta}\ln\pi(a|s_0;\theta)Q^{\theta}(s_0,a)\notag\\ &=\mathbb{E}_{a_0|s_0,\pi}\{\nabla_{\theta}\ln\pi(a_0|s_0;\theta) Q^{\theta}(s_0,a_0)\}\notag\\ &=\mathbb{E}_{\tau|s_0,\pi}\{Q^{\theta}(s_0,a_0)\nabla_{\theta}\ln\pi(a_0|s_0;\theta)\} \end{align} a0∑∇θπ(a∣s0;θ)Qθ(s0,a)=a∑π(a∣s0;θ)∇θlnπ(a∣s0;θ)Qθ(s0,a)=Ea0∣s0,π{∇θlnπ(a0∣s0;θ)Qθ(s0,a0)}=Eτ∣s0,π{Qθ(s0,a0)∇θlnπ(a0∣s0;θ)}
∑ a ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) = γ E a 0 ∣ s 0 , π ∇ θ Q θ ( s 0 , a 0 ) \sum_{a}\nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)=\gamma\mathbb{E}_{a_0|s_0,\pi}\nabla_{\theta}Q^{\theta}(s_0,a_0) ∑a∇θQθ(s0,a)π(a∣s0;θ)=γEa0∣s0,π∇θQθ(s0,a0)
其中 Q θ ( s 0 , a ) = ∑ s ′ T ( s 0 , a , s ′ ) [ r ( s 0 , a , s ′ ) + γ V θ ( s ′ ) ] Q^{\theta}(s_0,a)=\sum_{s'}T(s_0,a,s')[r(s_0,a,s')+\gamma V^{\theta}(s')] Qθ(s0,a)=∑s′T(s0,a,s′)[r(s0,a,s′)+γVθ(s′)]
∇ θ Q θ ( s 0 , a 0 ) = ∑ s ′ T ( s 0 , a 0 , s ′ ) ∇ θ V θ ( s ′ ) = E s 1 ∣ s 0 , a 0 ∇ θ V θ ( s 1 ) \nabla_{\theta}Q^{\theta}(s_0,a_0)=\sum_{s'}T(s_0,a_0,s')\nabla_{\theta}V^{\theta}(s')=\mathbb{E}_{s_1|s_0,a_0}\nabla_{\theta}V^{\theta}(s_1) ∇θQθ(s0,a0)=∑s′T(s0,a0,s′)∇θVθ(s′)=Es1∣s0,a0∇θVθ(s1)
所以 ∑ a ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) = γ E a 0 ∣ s 0 , π E s 1 ∣ s 0 , a 0 ∇ θ V θ ( s 1 ) = γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \sum_{a}\nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)=\gamma\mathbb{E}_{a_0|s_0,\pi}\mathbb{E}_{s_1|s_0,a_0}\nabla_{\theta}V^{\theta}(s_1)=\gamma\mathbb{E}_{s_1|s_0, \pi}\nabla_{\theta}V^{\theta}(s_1) ∑a∇θQθ(s0,a)π(a∣s0;θ)=γEa0∣s0,πEs1∣s0,a0∇θVθ(s1)=γEs1∣s0,π∇θVθ(s1)
∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π { Q θ ( s 0 , a 0 ) ∇ θ ln π ( a 0 ∣ s 0 ; θ ) } + γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0,\pi}\{Q^{\theta}(s_0,a_0)\nabla_{\theta}\ln\pi(a_0|s_0;\theta)\}+\gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1) ∇θVθ(s0)=Eτ∣s0,π{Qθ(s0,a0)∇θlnπ(a0∣s0;θ)}+γEs1∣s0,π∇θVθ(s1)
同理可得:
V θ ( s i ) = E τ i ∣ s i , π { Q θ ( s i , a i ) ∇ θ ln π ( a i ∣ s i ; θ ) } + γ E s i + 1 ∣ s i , π ∇ θ V θ ( s i + 1 ) , i = 1 , 2 , . . . V^{\theta}(s_i)=\mathbb{E}_{\tau_i|s_i,\pi}\{Q^{\theta}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i;\theta)\}+\gamma \mathbb{E}_{s_{i+1}|s_{i},\pi}\nabla_{\theta}V^{\theta}(s_{i+1}), \ i=1,2,... Vθ(si)=Eτi∣si,π{Qθ(si,ai)∇θlnπ(ai∣si;θ)}+γEsi+1∣si,π∇θVθ(si+1), i=1,2,...$
将 V θ ( s 1 ) = E τ 1 ∣ s 1 , π { Q θ ( s 1 , a 1 ) ∇ θ ln π ( a 1 ∣ s 1 ; θ ) } + γ E s 2 ∣ s 1 , π ∇ θ V θ ( s 2 ) V^{\theta}(s_1)=\mathbb{E}_{\tau_1|s_1,\pi}\{Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma \mathbb{E}_{s_{2}|s_{1},\pi}\nabla_{\theta}V^{\theta}(s_{2}) Vθ(s1)=Eτ1∣s1,π{Qθ(s1,a1)∇θlnπ(a1∣s1;θ)}+γEs2∣s1,π∇θVθ(s2) 代入 γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1) γEs1∣s0,π∇θVθ(s1), 得
γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) = γ E s 1 ∣ s 0 , π { E τ 1 ∣ s 1 , π { Q θ ( s 1 , a 1 ) ∇ θ ln π ( a 1 ∣ s 1 ; θ ) } + γ E s 2 ∣ s 1 , π ∇ θ V θ ( s 2 ) } = γ E τ 1 ∣ s 0 , π [ Q θ ( s 1 , a 1 ) ∇ θ ln π ( a 1 ∣ s 1 ; θ ) ] + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) = E τ ∣ s 0 , π [ γ Q θ ( s 1 , a 1 ) ∇ θ ln π ( a 1 ∣ s 1 ; θ ) ] + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \begin{align} \gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1)&=\gamma\mathbb{E}_{s_1|s_0,\pi}\{\mathbb{E}_{\tau_1|s_1,\pi}\{Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma \mathbb{E}_{s_{2}|s_{1},\pi}\nabla_{\theta}V^{\theta}(s_{2})\}\notag\\ &=\gamma \mathbb{E}_{\tau_1|s_0, \pi}[Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln \pi(a_1|s_1;\theta)]+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) \notag\\ &=\mathbb{E}_{\tau|s_0, \pi}[\gamma Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln \pi(a_1|s_1;\theta)]+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) \end{align} γEs1∣s0,π∇θVθ(s1)=γEs1∣s0,π{Eτ1∣s1,π{Qθ(s1,a1)∇θlnπ(a1∣s1;θ)}+γEs2∣s1,π∇θVθ(s2)}=γEτ1∣s0,π[Qθ(s1,a1)∇θlnπ(a1∣s1;θ)]+γ2Es2∣s0,π∇θVθ(s2)=Eτ∣s0,π[γQθ(s1,a1)∇θlnπ(a1∣s1;θ)]+γ2Es2∣s0,π∇θVθ(s2)
进而
∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π { Q θ ( s 0 , a 0 ) ∇ θ ln π ( a 0 ∣ s 0 ; θ ) + γ Q θ ( s 1 , a 1 ) ∇ θ ln π ( a 1 ∣ s 1 ; θ ) } + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0,\pi}\{Q^{\theta}(s_0,a_0)\nabla_{\theta}\ln\pi(a_0|s_0;\theta)+\gamma Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) ∇θVθ(s0)=Eτ∣s0,π{Qθ(s0,a0)∇θlnπ(a0∣s0;θ)+γQθ(s1,a1)∇θlnπ(a1∣s1;θ)}+γ2Es2∣s0,π∇θVθ(s2)
再将 V θ ( s 2 ) = E τ 2 ∣ s 2 , π { Q θ ( s 2 , a 2 ) ∇ θ ln π ( a 2 ∣ s 2 ; θ ) } + γ E s 3 ∣ s 2 , π ∇ θ V θ ( s 3 ) V^{\theta}(s_2)=\mathbb{E}_{\tau_2|s_2,\pi}\{Q^{\theta}(s_2,a_2)\nabla_{\theta}\ln\pi(a_2|s_2;\theta)\}+\gamma \mathbb{E}_{s_{3}|s_{2},\pi}\nabla_{\theta}V^{\theta}(s_{3}) Vθ(s2)=Eτ2∣s2,π{Qθ(s2,a2)∇θlnπ(a2∣s2;θ)}+γEs3∣s2,π∇θVθ(s3) 代入 γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) γ2Es2∣s0,π∇θVθ(s2) …
不断重复上述过程得到
∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π [ ∑ i = 0 ∞ γ i Q θ ( s i , a i ) ∇ θ ln π ( a i ∣ s i ; θ ) ] \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0, \pi}[\sum_{i=0}^{\infty}\gamma ^{i}Q^{\theta}(s_i,a_i)\nabla_{\theta}\ln \pi(a_i|s_i;\theta)] ∇θVθ(s0)=Eτ∣s0,π[∑i=0∞γiQθ(si,ai)∇θlnπ(ai∣si;θ)]
∇ θ J ( θ ) = E s 0 E τ ∣ s 0 , π [ ∑ i = 0 ∞ γ i Q θ ( s i , a i ) ∇ θ ln π ( a i ∣ s i ; θ ) ] = E τ ∣ π [ ∑ i = 0 ∞ γ i Q θ ( s i , a i ) ∇ θ ln π ( a i ∣ s i ; θ ) ] \nabla_{\theta} J({\theta})=\mathbb{E}_{s_0}\mathbb{E}_{\tau|s_0, \pi}[\sum_{i=0}^{\infty}\gamma^{i}Q^{\theta}(s_i,a_i)\nabla_{\theta}\ln \pi(a_i|s_i;\theta)]=\mathbb{E}_{\tau|\pi}[\sum_{i=0}^{\infty}\gamma^{i}Q^{\theta}(s_i,a_i)\nabla_{\theta}\ln \pi(a_i|s_i;\theta)] ∇θJ(θ)=Es0Eτ∣s0,π[∑i=0∞γiQθ(si,ai)∇θlnπ(ai∣si;θ)]=Eτ∣π[∑i=0∞γiQθ(si,ai)∇θlnπ(ai∣si;θ)]
推论.
∇ θ J ( θ ) = ∑ s ∑ a ∑ i = 0 ∞ γ i Pr [ s t = s , a t = a ∣ π ] Q θ ( s , a ) ∇ θ ln π ( a ∣ s ) \nabla_{\theta}J({\theta})=\sum_{s}\sum_{a}\sum_{i=0}^{\infty}\gamma^{i}\Pr[s_t=s,a_t=a|\pi]Q^{\theta}(s,a)\nabla_{\theta}\ln\pi(a|s) ∇θJ(θ)=∑s∑a∑i=0∞γiPr[st=s,at=a∣π]Qθ(s,a)∇θlnπ(a∣s)
证明: ∇ θ J ( θ ) = ∑ i = 0 ∞ γ i E τ ∣ π [ Q i ( s i , a i ) ∇ θ ln π ( a i ∣ s i ) ] \nabla_{\theta}J({\theta})=\sum_{i=0}^{\infty}\gamma^{i}\mathbb{E}_{\tau|\pi}[Q_{i}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i)] ∇θJ(θ)=∑i=0∞γiEτ∣π[Qi(si,ai)∇θlnπ(ai∣si)]
E τ ∣ π [ Q i ( s i , a i ) ∇ θ ln π ( a i ∣ s i ) ] = ∑ s ∑ a Pr [ s i = s , a i = a ∣ π ] [ Q i ( s i , a i ) ∇ θ ln π ( a i ∣ s i ) ] ∣ s i = s , a i = a \mathbb{E}_{\tau|\pi}[Q_{i}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i)]=\sum_s\sum_{a}\Pr[s_i=s,a_i=a|\pi][Q_{i}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i)]|_{s_{i}=s,a_{i}=a} Eτ∣π[Qi(si,ai)∇θlnπ(ai∣si)]=∑s∑aPr[si=s,ai=a∣π][Qi(si,ai)∇θlnπ(ai∣si)]∣si=s,ai=a
∇ θ J ( θ ) = ∑ i = 0 ∞ ∑ s ∑ a γ i Pr [ s i = s , a i = a ∣ π ] Q ( s , a ) ln π ( a ∣ s ) = ∑ s ∑ a ∑ i = 0 ∞ γ i Pr [ s t = s , a t = a ∣ π ] Q θ ( s , a ) ∇ θ ln π ( a ∣ s ) \nabla_{\theta}J({\theta})=\sum_{i=0}^{\infty}\sum_s\sum_{a}\gamma^{i}\Pr[s_i=s,a_i=a|\pi]Q(s,a)\ln \pi(a|s)=\sum_{s}\sum_{a}\sum_{i=0}^{\infty}\gamma^{i}\Pr[s_t=s,a_t=a|\pi]Q^{\theta}(s,a)\nabla_{\theta}\ln\pi(a|s) ∇θJ(θ)=∑i=0∞∑s∑aγiPr[si=s,ai=a∣π]Q(s,a)lnπ(a∣s)=∑s∑a∑i=0∞γiPr[st=s,at=a∣π]Qθ(s,a)∇θlnπ(a∣s)
定理4
E ( s i , a i ) ∣ π ∇ θ ln π ( a i ∣ s i ; θ ) Q θ ( s i , a i ) = ∑ s ∑ a Pr [ s i = s , a i = a ∣ π ] ∇ θ ln π ( a ∣ s ; θ ) Q θ ( s , a ) \mathbb{E}_{(s_i,a_i)|\pi} \nabla_{\theta} \ln \pi(a_i|s_i;\theta)Q^{\theta}(s_i,a_i)=\sum_{s}\sum_{a}\Pr[s_i=s,a_i=a|\pi] \nabla_{\theta} \ln \pi(a|s;\theta)Q^{\theta}(s,a) E(si,ai)∣π∇θlnπ(ai∣si;θ)Qθ(si,ai)=∑s∑aPr[si=s,ai=a∣π]∇θlnπ(a∣s;θ)Qθ(s,a)
证明:
引理. Q θ ( s , a ) = ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ V θ ( s ′ ) ] Q^{\theta}(s,a)=\sum_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\theta}(s')] Qθ(s,a)=∑s′T(s,a,s′)[r(s,a,s′)+γVθ(s′)]
∇ θ Q θ ( s , a ) = ∑ s ′ T ( s , a , s ′ ) ∇ θ V θ ( s ′ ) \nabla_{\theta}Q^{\theta}(s,a)=\sum_{s'}T(s,a,s')\nabla_{\theta}V^{\theta}(s') ∇θQθ(s,a)=∑s′T(s,a,s′)∇θVθ(s′)
∇ θ V ( s ′ ) = ∇ θ ∑ a ′ π ( a ′ ∣ s ′ ; θ ) Q θ ( s ′ , a ′ ) = ∑ a ′ [ ∇ θ π ( a ′ ∣ s ′ ; θ ) Q θ ( s ′ , a ′ ) + π ( a ′ ∣ s ′ ; θ ) ∇ θ Q θ ( s ′ , a ′ ) ] \nabla_{\theta}V(s')=\nabla_{\theta}\sum_{a'}\pi(a'|s';\theta)Q^{\theta}(s',a')=\sum_{a'}[\nabla_{\theta}\pi(a'|s';\theta)Q^{\theta}(s',a')+\pi(a'|s';\theta)\nabla_{\theta}Q^{\theta}(s',a')] ∇θV(s′)=∇θ∑a′π(a′∣s′;θ)Qθ(s′,a′)=∑a′[∇θπ(a′∣s′;θ)Qθ(s′,a′)+π(a′∣s′;θ)∇θQθ(s′,a′)]
所以
∇ θ Q θ ( s , a ) = γ ∑ s ′ ∑ a ′ T ( s , a , s ′ ) [ ∇ θ π ( a ′ ∣ s ′ ; θ ) Q θ ( s ′ , a ′ ) + π ( s ′ ∣ a ′ ; θ ) ∇ θ Q θ ( s ′ , a ′ ) ] = γ ∑ s ′ ∑ a ′ T ( s , a , s ′ ) π ( s ′ ∣ a ′ ; θ ) [ ∇ θ ln π ( a ′ ∣ s ′ ; θ ) Q θ ( s ′ , a ′ ) + ∇ θ Q θ ( s ′ , a ′ ) ] \begin{align} \nabla_{\theta}Q^{\theta}(s,a)&=\gamma\sum_{s'}\sum_{a'}T(s,a,s')[\nabla_{\theta}\pi(a'|s';\theta)Q^{\theta}(s',a')+\pi(s'|a';\theta)\nabla_{\theta}Q^{\theta}(s',a')]\\ &=\gamma\sum_{s'}\sum_{a'}T(s,a,s')\pi(s'|a';\theta)[\nabla_{\theta}\ln \pi(a'|s';\theta)Q^{\theta}(s',a')+\nabla_{\theta}Q^{\theta}(s',a')] \end{align} ∇θQθ(s,a)=γs′∑a′∑T(s,a,s′)[∇θπ(a′∣s′;θ)Qθ(s′,a′)+π(s′∣a′;θ)∇θQθ(s′,a′)]=γs′∑a′∑T(s,a,s′)π(s′∣a′;θ)[∇θlnπ(a′∣s′;θ)Qθ(s′,a′)+∇θQθ(s′,a′)]
∇ θ Q θ ( s i , a i ) = γ ∑ s i + 1 ∑ a i + 1 T ( s i , a i , s i + 1 ) π ( s i + 1 ∣ a i + 1 ; θ ) [ ∇ θ ln π ( a i + 1 ∣ s i + 1 ; θ ) Q θ ( s i + 1 , a i + 1 ) + ∇ θ Q θ ( s i + 1 , a i + 1 ) ] = γ E ( s i + 1 , a i + 1 ) ∣ ( s i , a i ) , π [ ∇ θ ln π ( a i + 1 , s i + 1 ; θ ) Q θ ( s i + 1 , a i + 1 ) + ∇ θ Q θ ( s i + 1 , a i + 1 ) ] \begin{align} \nabla_{\theta}Q^{\theta}(s_i,a_i)&=\gamma\sum_{s_{i+1}}\sum_{a_{i+1}}T(s_i,a_i,s_{i+1})\pi(s_{i+1}|a_{i+1};\theta)[\nabla_{\theta}\ln \pi(a_{i+1}|s_{i+1};\theta)Q^{\theta}(s_{i+1},a_{i+1})+\nabla_{\theta}Q^{\theta}(s_{i+1},a_{i+1})]\notag\\ &=\gamma\mathbb{E}_{(s_{i+1},a_{i+1})|(s_i,a_i), \pi}[\nabla_{\theta}\ln\pi(a_{i+1},s_{i+1};\theta)Q^{\theta}(s_{i+1},a_{i+1})+\nabla_{\theta}Q^{\theta}(s_{i+1},a_{i+1})]\notag \end{align} ∇θQθ(si,ai)=γsi+1∑ai+1∑T(si,ai,si+1)π(si+1∣ai+1;θ)[∇θlnπ(ai+1∣si+1;θ)Qθ(si+1,ai+1)+∇θQθ(si+1,ai+1)]=γE(si+1,ai+1)∣(si,ai),π[∇θlnπ(ai+1,si+1;θ)Qθ(si+1,ai+1)+∇θQθ(si+1,ai+1)]
∇ θ J ( θ ) = E s 0 V θ ( s 0 ) = E s 0 ∑ a π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) \nabla_{\theta} J(\theta)=\mathbb{E}_{s_0}V^{\theta}(s_0)=\mathbb{E}_{s_0}\sum_a \pi(a|s_0;\theta)Q^{\theta}(s_0,a) ∇θJ(θ)=Es0Vθ(s0)=Es0∑aπ(a∣s0;θ)Qθ(s0,a)
∇ θ J ( θ ) = E s 0 ∑ a ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) + E s 0 ∑ a π ( a ∣ s 0 ; θ ) ∇ θ Q θ ( s 0 , a ) \nabla_{\theta} J(\theta)=\mathbb{E}_{s_0}\sum_a \nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a)+\mathbb{E}_{s_0}\sum_{a}\pi(a|s_0;\theta)\nabla_{\theta}Q^{\theta}(s_0,a) ∇θJ(θ)=Es0∑a∇θπ(a∣s0;θ)Qθ(s0,a)+Es0∑aπ(a∣s0;θ)∇θQθ(s0,a)
E s 0 ∑ a ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = E s 0 ∑ a π ( a ∣ s 0 ; θ ) ∇ θ ln π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = E ( s 0 , a 0 ) ∣ π ∇ θ ln π ( a 0 ∣ s 0 ; θ ) Q θ ( s 0 , a 0 ) \mathbb{E}_{s_0}\sum_a \nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a)=\mathbb{E}_{s_0}\sum_a \pi(a|s_0;\theta)\nabla_{\theta}\ln \pi(a|s_0;\theta)Q^{\theta}(s_0,a)=\mathbb{E}_{(s_0,a_0)|\pi}\nabla_{\theta}\ln \pi(a_0|s_0;\theta)Q^{\theta}(s_0,a_0) Es0∑a∇θπ(a∣s0;θ)Qθ(s0,a)=Es0∑aπ(a∣s0;θ)∇θlnπ(a∣s0;θ)Qθ(s0,a)=E(s0,a0)∣π∇θlnπ(a0∣s0;θ)Qθ(s0,a0)
E s 0 ∑ a π ( a ∣ s 0 ; θ ) ∇ θ Q θ ( s 0 , a ) = E s 0 E a 0 ∣ s 0 , π ∇ θ Q θ ( s 0 , a 0 ) = E ( s 0 , a 0 ) ∣ π ∇ θ Q θ ( s 0 , a 0 ) \begin{align} \mathbb{E}_{s_0}\sum_{a}\pi(a|s_0;\theta)\nabla_{\theta}Q^{\theta}(s_0,a) &=\mathbb{E}_{s_0}\mathbb{E}_{a_0|s_0, \pi}\nabla_{\theta}Q^{\theta}(s_0,a_0)\notag\\ &=\mathbb{E}_{(s_0,a_0)|\pi}\nabla_{\theta}Q^{\theta}(s_0,a_0)\notag \end{align} Es0a∑π(a∣s0;θ)∇θQθ(s0,a)=Es0Ea0∣s0,π∇θQθ(s0,a0)=E(s0,a0)∣π∇θQθ(s0,a0)
至此,
∇ θ J ( θ ) = E ( s 0 , a 0 ) ∣ π ∇ θ ln π ( a 0 ∣ s 0 ; θ ) Q θ ( s 0 , a 0 ) + E ( s 0 , a 0 ) ∣ π ∇ θ Q θ ( s 0 , a 0 ) \nabla_{\theta} J(\theta)=\mathbb{E}_{(s_0,a_0)|\pi}\nabla_{\theta}\ln \pi(a_0|s_0;\theta)Q^{\theta}(s_0,a_0)+\mathbb{E}_{(s_0,a_0)|\pi}\nabla_{\theta}Q^{\theta}(s_0,a_0) ∇θJ(θ)=E(s0,a0)∣π∇θlnπ(a0∣s0;θ)Qθ(s0,a0)+E(s0,a0)∣π∇θQθ(s0,a0)
由引理得:
∇ θ Q θ ( s 0 , a 0 ) = γ E ( s 1 , a 1 ) ∣ ( s 0 , a 0 ) , π [ ∇ θ ln π ( a 1 ∣ s 1 ; θ ) Q θ ( s 1 , a 1 ) + ∇ θ Q θ ( s 1 , a 1 ) ] \nabla_{\theta}Q^{\theta}(s_0,a_0)=\gamma\mathbb{E}_{(s_1,a_1)|(s_0,a_0),\pi}[\nabla_{\theta}\ln \pi(a_1|s_1;\theta)Q^{\theta}(s_1,a_1)+\nabla_{\theta}Q^{\theta}(s_1,a_1)] ∇θQθ(s0,a0)=γE(s1,a1)∣(s0,a0),π[∇θlnπ(a1∣s1;θ)Qθ(s1,a1)+∇θQθ(s1,a1)]
E ( s 0 , a 0 ) ∣ π ∇ θ Q θ ( s 0 , a 0 ) = γ E ( s 0 , a 0 ) E ( s 1 , a 1 ) ∣ ( s 0 , a 0 ) , π [ ∇ θ ln π ( a 1 ∣ s 1 ; θ ) Q θ ( s 1 , a 1 ) + ∇ θ Q θ ( s 1 , a 1 ) ] = γ E ( s 1 , a 1 ) ∣ π [ ∇ θ ln π ( a 1 ∣ s 1 ; θ ) Q θ ( s 1 , a 1 ) + ∇ θ Q θ ( s 1 , a 1 ) ] \mathbb{E}_{(s_{0},a_{0})|\pi}\nabla_{\theta}Q^{\theta}(s_0,a_0)=\gamma\mathbb{E}_{(s_{0},a_{0})}\mathbb{E}_{(s_1,a_1)|(s_0,a_0),\pi}[\nabla_{\theta}\ln \pi(a_1|s_1;\theta)Q^{\theta}(s_1,a_1)+\nabla_{\theta}Q^{\theta}(s_1,a_1)]=\gamma\mathbb{E}_{(s_1,a_1)|\pi}[\nabla_{\theta}\ln \pi(a_1|s_1;\theta)Q^{\theta}(s_1,a_1)+\nabla_{\theta}Q^{\theta}(s_1,a_1)] E(s0,a0)∣π∇θQθ(s0,a0)=γE(s0,a0)E(s1,a1)∣(s0,a0),π[∇θlnπ(a1∣s1;θ)Qθ(s1,a1)+∇θQθ(s1,a1)]=γE(s1,a1)∣π[∇θlnπ(a1∣s1;θ)Qθ(s1,a1)+∇θQθ(s1,a1)]
至此,
∇ θ J ( θ ) = E ( s 0 , a 0 ) ∣ π ∇ θ ln π ( a 0 ∣ s 0 ; θ ) Q θ ( s 0 , a 0 ) + γ E ( s 1 , a 1 ) ∣ π ∇ θ ln π ( a 1 ∣ s 1 ; θ ) Q θ ( s 1 , a 1 ) + γ E ( s 1 , a 1 ) ∣ π ∇ θ Q θ ( s 1 , a 1 ) \nabla_{\theta} J(\theta)=\mathbb{E}_{(s_0,a_0)|\pi}\nabla_{\theta}\ln \pi(a_0|s_0;\theta)Q^{\theta}(s_0,a_0)+\gamma\mathbb{E}_{(s_1,a_1)|\pi}\nabla_{\theta}\ln \pi(a_1|s_1;\theta)Q^{\theta}(s_1,a_1)+\gamma\mathbb{E}_{(s_1,a_1)|\pi}\nabla_{\theta}Q^{\theta}(s_1,a_1) ∇θJ(θ)=E(s0,a0)∣π∇θlnπ(a0∣s0;θ)Qθ(s0,a0)+γE(s1,a1)∣π∇θlnπ(a1∣s1;θ)Qθ(s1,a1)+γE(s1,a1)∣π∇θQθ(s1,a1)
重复上述过程, 可得
∇ θ J ( θ ) = γ i ∑ i = 0 ∞ E ( s i , a i ) ∣ π ∇ θ ln π ( a i ∣ s i ; θ ) Q θ ( s i , a i ) \nabla_{\theta} J(\theta)=\gamma^{i}\sum\limits_{i=0}^{\infty}\mathbb{E}_{(s_i,a_i)|\pi}\nabla_{\theta}\ln \pi(a_i|s_i;\theta)Q^{\theta}(s_i,a_i) ∇θJ(θ)=γii=0∑∞E(si,ai)∣π∇θlnπ(ai∣si;θ)Qθ(si,ai)
其中
E ( s i , a i ) ∣ π ∇ θ ln π ( a i ∣ s i ; θ ) Q θ ( s i , a i ) = ∑ s ∑ a Pr [ s i = s , a i = a ∣ π ] ∇ θ ln π ( a ∣ s ; θ ) Q θ ( s , a ) \mathbb{E}_{(s_i,a_i)|\pi} \nabla_{\theta} \ln \pi(a_i|s_i;\theta)Q^{\theta}(s_i,a_i)=\sum_{s}\sum_{a}\Pr[s_i=s,a_i=a|\pi] \nabla_{\theta} \ln \pi(a|s;\theta)Q^{\theta}(s,a) E(si,ai)∣π∇θlnπ(ai∣si;θ)Qθ(si,ai)=∑s∑aPr[si=s,ai=a∣π]∇θlnπ(a∣s;θ)Qθ(s,a)
进而有
∇ θ J ( θ ) = ∑ s ∑ a ∑ i = 0 ∞ γ i Pr [ s i = s , a i = a ] ∇ θ ln π ( a ∣ s ; θ ) Q θ ( s , a ) \nabla_{\theta} J(\theta)=\sum_{s}\sum_{a}\sum_{i=0}^{\infty}\gamma^{i}\Pr[s_i=s,a_i=a] \nabla_{\theta} \ln\pi(a|s;\theta)Q^{\theta}(s,a) ∇θJ(θ)=∑s∑a∑i=0∞γiPr[si=s,ai=a]∇θlnπ(a∣s;θ)Qθ(s,a).
定理5
E ( s t , a t ) ∣ π [ γ t G t ∇ θ ln π ( a t ∣ s t ; θ ) ] = E ( s t , a t ) ∣ π ′ [ 1 π ′ ( a t ∣ s t ) γ t G t ∇ θ π ( a t ∣ s t ; θ ) ] \mathbb{E}_{(s_t,a_t)|\pi}[\gamma^{t}G_{t}\nabla_{\theta}\ln \pi(a_t|s_t;\theta)]=\mathbb{E}_{(s_t,a_t)|\pi'}[\frac{1}{\pi'(a_t|s_t)}\gamma^{t}G_{t}\nabla_{\theta}\pi(a_t|s_t;\theta)] E(st,at)∣π[γtGt∇θlnπ(at∣st;θ)]=E(st,at)∣π′[π′(at∣st)1γtGt∇θπ(at∣st;θ)]
其中 π ′ \pi' π′ 为任意一个策略.
证明:
∇ θ ln ( π ( a t ∣ s t ; θ ) ) = ∇ θ π ( a t ∣ s t ; θ ) / π ( a t ∣ s t ; θ ) \nabla_{\theta}\ln(\pi(a_{t}|s_{t}; \theta))=\nabla_{\theta}\pi(a_t|s_t;\theta)/\pi(a_t|s_t;\theta) ∇θln(π(at∣st;θ))=∇θπ(at∣st;θ)/π(at∣st;θ), 代入得
E ( s t , a t ) ∣ π [ γ t G t ∇ θ ln π ( a t ∣ s t ; θ ) ] = ∑ ( s t , a t ) ∈ S × A π ( a t ∣ s t ; θ ) γ t G t ∇ θ π ( a t ∣ s t ; θ ) / π ( a t ∣ s t ; θ ) = ∑ ( s t , a t ) ∈ S × A γ t G t ∇ θ π ( a t ∣ s t ; θ ) = ∑ ( s t , a t ) ∈ S × A π ′ ( a t ∣ s t ) 1 π ′ ( a t ∣ s t ) γ i G t ∇ θ π ( a t ∣ s t ; θ ) = E ( s t , a t ) ∣ π ′ [ 1 π ′ ( a t ∣ s t ) γ t G t ∇ θ π ( a t ∣ s t ; θ ) ] \begin{align} \mathbb{E}_{(s_t,a_t)|\pi}[\gamma^{t}G_{t}\nabla_{\theta}\ln \pi(a_t|s_t;\theta)]&=\sum_{(s_t,a_t)\in \mathcal{S}\times\mathcal{A}}\pi(a_t|s_t;\theta)\gamma^{t}G_{t}\nabla_{\theta}\pi(a_t|s_t;\theta)/\pi(a_t|s_t;\theta)\notag\\ &=\sum_{(s_t,a_t)\in \mathcal{S}\times\mathcal{A}}\gamma^{t}G_{t}\nabla_{\theta}\pi(a_t|s_t;\theta)\\ &=\sum_{(s_t,a_t)\in \mathcal{S}\times\mathcal{A}}\pi'(a_t|s_t)\frac{1}{\pi'(a_t|s_t)}\gamma^{i}G_{t}\nabla_{\theta}\pi(a_t|s_t;\theta)\\ &=\mathbb{E}_{(s_t,a_t)|\pi'}[\frac{1}{\pi'(a_t|s_t)}\gamma^{t}G_{t}\nabla_{\theta}\pi(a_t|s_t;\theta)] \end{align} E(st,at)∣π[γtGt∇θlnπ(at∣st;θ)]=(st,at)∈S×A∑π(at∣st;θ)γtGt∇θπ(at∣st;θ)/π(at∣st;θ)=(st,at)∈S×A∑γtGt∇θπ(at∣st;θ)=(st,at)∈S×A∑π′(at∣st)π′(at∣st)1γiGt∇θπ(at∣st;θ)=E(st,at)∣π′[π′(at∣st)1γtGt∇θπ(at∣st;θ)]