当前位置: 首页 > article >正文

【强化学习】随机策略的策略梯度

文章目录

  • Policy的目标函数
  • 定理1
  • 定理2
  • 定理3
  • 定理4
  • 定理5

Policy的目标函数

J ( π ) = E τ ∣ π [ G 0 ] = E τ ∣ π [ ∑ t = 0 γ t r t ] J({\pi})=\mathbb{E}_{\tau|\pi}[G_0]=\mathbb{E}_{\tau|\pi}[\sum\limits_{t=0}\gamma^{t}r_{t}] J(π)=Eτπ[G0]=Eτπ[t=0γtrt]

定理1

∇ θ J ( θ ) = E τ ∣ π [ ∑ t = 0 ∞ γ t G t ∇ θ ln ⁡ π ( a t ∣ s t ) ] \nabla_{\theta}J({\theta})=\mathbb{E}_{\tau|\pi} [\sum\limits_{t=0}^{\infty}\gamma ^{t}G_{t}\nabla_{\theta}\ln \pi(a_t|s_{t})] θJ(θ)=Eτπ[t=0γtGtθlnπ(atst)]

G t = ∑ k = 0 ∞ γ k r t + k G_{t}=\sum_{k=0}^{\infty}\gamma^{k}r_{t+k} Gt=k=0γkrt+k

证明:

J ( θ ) = ∑ τ ∣ π G 0 π ( τ ; θ ) J({\theta})=\sum_{\tau|\pi} G_{0}\pi(\tau; \theta) J(θ)=τπG0π(τ;θ)

∇ θ J ( θ ) = ∑ τ ∣ π G 0 ∇ θ π ( τ ; θ ) \nabla_{\theta}J(\theta)=\sum_{\tau|\pi} G_{0}\nabla_\theta\pi(\tau; \theta) θJ(θ)=τπG0θπ(τ;θ)

∇ θ π ( τ ; θ ) = π ( τ ; θ ) ∇ θ ln ⁡ π ( τ ; θ ) \nabla_{\theta}\pi(\tau;\theta)=\pi(\tau;\theta)\nabla_{\theta} \ln\pi(\tau;\theta) θπ(τ;θ)=π(τ;θ)θlnπ(τ;θ)

π ( τ ; θ ) = p 1 ( s 0 ) Π i = 0 ∞ π ( a i ∣ s i ; θ ) T ( s i , a i , s i + 1 ) \pi(\tau;\theta)=p_{1}(s_{0})\Pi_{i=0}^{\infty}\pi(a_{i}|s_{i};\theta)T(s_i,a_i,s_{i+1}) π(τ;θ)=p1(s0)Πi=0π(aisi;θ)T(si,ai,si+1)

ln ⁡ π ( τ ; θ ) = ∑ i = 0 ∞ π ( a i ∣ s i ; θ ) + ∑ i = 0 ∞ ln ⁡ T ( s i , a i , s i + 1 ) + ln ⁡ p 1 ( s 0 ) \ln\pi(\tau;\theta)=\sum_{i=0}^{\infty}\pi(a_i|s_i;\theta)+\sum_{i=0}^{\infty}\ln T(s_i,a_i,s_{i+1}) + \ln p_{1}(s_0) lnπ(τ;θ)=i=0π(aisi;θ)+i=0lnT(si,ai,si+1)+lnp1(s0)

∇ θ ln ⁡ π ( τ ; θ ) = ∑ i = 0 ∞ ∇ θ π ( a i ∣ s i ; θ ) \nabla_{\theta}\ln\pi(\tau;\theta)=\sum_{i=0}^{\infty}\nabla_{\theta}\pi(a_i|s_i;\theta) θlnπ(τ;θ)=i=0θπ(aisi;θ)

∇ θ J ( θ ) = ∑ τ G 0 π ( τ ; θ ) ∑ i = 0 ∞ ∇ θ π ( a i ∣ s i ; θ ) = E τ [ G 0 ∑ i = 0 ∞ ∇ θ π ( a i ∣ s i ; θ ) ] \nabla_{\theta}J(\theta)=\sum_{\tau}G_0\pi(\tau;\theta)\sum_{i=0}^{\infty}\nabla_{\theta}\pi(a_i|s_i;\theta)=\mathbb{E}_{\tau}[G_0\sum_{i=0}^{\infty}\nabla_{\theta}\pi(a_i|s_i;\theta)] θJ(θ)=τG0π(τ;θ)i=0θπ(aisi;θ)=Eτ[G0i=0θπ(aisi;θ)]

定理2

∇ θ J ( θ ) = E τ ∣ π [ ∑ i = 0 ∞ γ i G i ∇ θ ln ⁡ π ( s i ) ] \nabla_{\theta}J({\theta})=\mathbb{E}_{\tau|\pi}[\sum_{i=0}^{\infty}\gamma^{i}G_{i}\nabla_{\theta}\ln\pi(s_i)] θJ(θ)=Eτπ[i=0γiGiθlnπ(si)]

证明:

J ( θ ) = E τ ∣ π [ G 0 ] = E s 0 E τ ∣ s 0 , π [ G 0 ] = E s 0 V θ ( s 0 ) J(\theta)=\mathbb{E}_{\tau|\pi}[G_{0}]=\mathbb{E}_{s_0}\mathbb{E}_{\tau|s_0,\pi}[G_{0}]=\mathbb{E}_{s_0}V^{\theta}(s_0) J(θ)=Eτπ[G0]=Es0Eτs0,π[G0]=Es0Vθ(s0)

∇ θ J ( θ ) = E s 0 ∇ θ V θ ( s 0 ) \nabla_{\theta} J({\theta})=\mathbb{E}_{s_0}\nabla_{\theta}V^{\theta}(s_0) θJ(θ)=Es0θVθ(s0)

V θ ( s 0 ) = ∑ a π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) V^{\theta}(s_0)=\sum_{a}\pi(a|s_0;\theta)Q^{\theta}(s_0,a) Vθ(s0)=aπ(as0;θ)Qθ(s0,a)

∇ θ V θ ( s 0 ) = ∑ a [ ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) + ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) ] \nabla_{\theta}V^{\theta}(s_0)=\sum_{a}[\nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a) + \nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)] θVθ(s0)=a[θπ(as0;θ)Qθ(s0,a)+θQθ(s0,a)π(as0;θ)]

∑ a 0 ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = ∑ a π ( a ∣ s 0 ; θ ) ∇ θ ln ⁡ π ( a ∣ s 0 ; θ ) E τ ∣ s 0 , a 0 = a , π [ G 0 ] = E a 0 ∣ s 0 , π { ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) E τ ∣ s 0 , a 0 , π [ G 0 ] } = E a 0 ∣ s 0 , π E τ ∣ s 0 , a 0 , π [ G 0 ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) ] = E τ ∣ s 0 , π { G 0 ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) } \begin{align} \sum_{a_0}\nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a) &=\sum_{a}\pi(a|s_0;\theta) \nabla_{\theta}\ln\pi(a|s_0;\theta) \mathbb{E}_{\tau|s_0,a_0=a,\pi}[G_0]\notag\\ &=\mathbb{E}_{a_0|s_0,\pi}\{\nabla_{\theta}\ln\pi(a_0|s_0;\theta) \mathbb{E}_{\tau|s_0,a_0,\pi}[G_0]\}\notag\\ &=\mathbb{E}_{a_0|s_0,\pi}\mathbb{E}_{\tau|s_0,a_0,\pi}[G_0\nabla_{\theta}\ln\pi(a_0|s_0;\theta)]\\ &=\mathbb{E}_{\tau|s_0,\pi}\{G_0\nabla_{\theta}\ln\pi(a_0|s_0;\theta)\} \end{align} a0θπ(as0;θ)Qθ(s0,a)=aπ(as0;θ)θlnπ(as0;θ)Eτs0,a0=a,π[G0]=Ea0s0,π{θlnπ(a0s0;θ)Eτs0,a0,π[G0]}=Ea0s0,πEτs0,a0,π[G0θlnπ(a0s0;θ)]=Eτs0,π{G0θlnπ(a0s0;θ)}

∑ a ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) = γ E a 0 ∣ s 0 , π ∇ θ Q θ ( s 0 , a 0 ) \sum_{a}\nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)=\gamma\mathbb{E}_{a_0|s_0,\pi}\nabla_{\theta}Q^{\theta}(s_0,a_0) aθQθ(s0,a)π(as0;θ)=γEa0s0,πθQθ(s0,a0)

其中 Q θ ( s 0 , a ) = ∑ s ′ T ( s 0 , a , s ′ ) [ r ( s 0 , a , s ′ ) + γ V θ ( s ′ ) ] Q^{\theta}(s_0,a)=\sum_{s'}T(s_0,a,s')[r(s_0,a,s')+\gamma V^{\theta}(s')] Qθ(s0,a)=sT(s0,a,s)[r(s0,a,s)+γVθ(s)]

∇ θ Q θ ( s 0 , a 0 ) = ∑ s ′ T ( s 0 , a 0 , s ′ ) ∇ θ V θ ( s ′ ) = E s 1 ∣ s 0 , a 0 ∇ θ V θ ( s 1 ) \nabla_{\theta}Q^{\theta}(s_0,a_0)=\sum_{s'}T(s_0,a_0,s')\nabla_{\theta}V^{\theta}(s')=\mathbb{E}_{s_1|s_0,a_0}\nabla_{\theta}V^{\theta}(s_1) θQθ(s0,a0)=sT(s0,a0,s)θVθ(s)=Es1s0,a0θVθ(s1)

所以 ∑ a ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) = γ E a 0 ∣ s 0 , π E s 1 ∣ s 0 , a 0 ∇ θ V θ ( s 1 ) = γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \sum_{a}\nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)=\gamma\mathbb{E}_{a_0|s_0,\pi}\mathbb{E}_{s_1|s_0,a_0}\nabla_{\theta}V^{\theta}(s_1)=\gamma\mathbb{E}_{s_1|s_0, \pi}\nabla_{\theta}V^{\theta}(s_1) aθQθ(s0,a)π(as0;θ)=γEa0s0,πEs1s0,a0θVθ(s1)=γEs1s0,πθVθ(s1)

∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π { G 0 ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) } + γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0,\pi}\{G_0\nabla_{\theta}\ln\pi(a_0|s_0;\theta)\}+\gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1) θVθ(s0)=Eτs0,π{G0θlnπ(a0s0;θ)}+γEs1s0,πθVθ(s1)

同理可得:

V θ ( s i ) = E τ i ∣ s i , π { G i ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) } + γ E s i + 1 ∣ s i , π ∇ θ V θ ( s i + 1 ) ,   i = 1 , 2 , . . . V^{\theta}(s_i)=\mathbb{E}_{\tau_i|s_i,\pi}\{G_i\nabla_{\theta}\ln\pi(a_i|s_i;\theta)\}+\gamma \mathbb{E}_{s_{i+1}|s_{i},\pi}\nabla_{\theta}V^{\theta}(s_{i+1}), \ i=1,2,... Vθ(si)=Eτisi,π{Giθlnπ(aisi;θ)}+γEsi+1si,πθVθ(si+1), i=1,2,...$

V θ ( s 1 ) = E τ 1 ∣ s 1 , π { G 1 ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) } + γ E s 2 ∣ s 1 , π ∇ θ V θ ( s 2 ) V^{\theta}(s_1)=\mathbb{E}_{\tau_1|s_1,\pi}\{G_1\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma \mathbb{E}_{s_{2}|s_{1},\pi}\nabla_{\theta}V^{\theta}(s_{2}) Vθ(s1)=Eτ1s1,π{G1θlnπ(a1s1;θ)}+γEs2s1,πθVθ(s2) 代入 γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1) γEs1s0,πθVθ(s1), 得

γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) = γ E s 1 ∣ s 0 , π { E τ 1 ∣ s 1 , π { G 1 ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) } + γ E s 2 ∣ s 1 , π ∇ θ V θ ( s 2 ) } = γ E τ 1 ∣ s 0 , π [ G 1 ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) ] + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) = E τ ∣ s 0 , π [ γ G 1 ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) ] + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \begin{align} \gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1)&=\gamma\mathbb{E}_{s_1|s_0,\pi}\{\mathbb{E}_{\tau_1|s_1,\pi}\{G_1\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma \mathbb{E}_{s_{2}|s_{1},\pi}\nabla_{\theta}V^{\theta}(s_{2})\}\notag\\ &=\gamma \mathbb{E}_{\tau_1|s_0, \pi}[G_1\nabla_{\theta}\ln \pi(a_1|s_1;\theta)]+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) \notag\\ &=\mathbb{E}_{\tau|s_0, \pi}[\gamma G_1\nabla_{\theta}\ln \pi(a_1|s_1;\theta)]+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) \end{align} γEs1s0,πθVθ(s1)=γEs1s0,π{Eτ1s1,π{G1θlnπ(a1s1;θ)}+γEs2s1,πθVθ(s2)}=γEτ1s0,π[G1θlnπ(a1s1;θ)]+γ2Es2s0,πθVθ(s2)=Eτs0,π[γG1θlnπ(a1s1;θ)]+γ2Es2s0,πθVθ(s2)

进而

∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π { G 0 ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) + γ G 1 ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) } + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0,\pi}\{G_0\nabla_{\theta}\ln\pi(a_0|s_0;\theta)+\gamma G_1\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) θVθ(s0)=Eτs0,π{G0θlnπ(a0s0;θ)+γG1θlnπ(a1s1;θ)}+γ2Es2s0,πθVθ(s2)

再将 V θ ( s 2 ) = E τ 2 ∣ s 2 , π { G 2 ∇ θ ln ⁡ π ( a 2 ∣ s 2 ; θ ) } + γ E s 3 ∣ s 2 , π ∇ θ V θ ( s 3 ) V^{\theta}(s_2)=\mathbb{E}_{\tau_2|s_2,\pi}\{G_2\nabla_{\theta}\ln\pi(a_2|s_2;\theta)\}+\gamma \mathbb{E}_{s_{3}|s_{2},\pi}\nabla_{\theta}V^{\theta}(s_{3}) Vθ(s2)=Eτ2s2,π{G2θlnπ(a2s2;θ)}+γEs3s2,πθVθ(s3) 代入 γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) γ2Es2s0,πθVθ(s2)

不断重复上述过程得到

∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π [ ∑ i = 0 ∞ γ i G i ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) ] \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0, \pi}[\sum_{i=0}^{\infty}\gamma ^{i}G_{i}\nabla_{\theta}\ln \pi(a_i|s_i;\theta)] θVθ(s0)=Eτs0,π[i=0γiGiθlnπ(aisi;θ)]

∇ θ J ( θ ) = E s 0 E τ ∣ s 0 , π [ ∑ i = 0 ∞ γ i G i ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) ] = E τ ∣ π [ ∑ i = 0 ∞ γ i G i ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) ] \nabla_{\theta} J({\theta})=\mathbb{E}_{s_0}\mathbb{E}_{\tau|s_0, \pi}[\sum_{i=0}^{\infty}\gamma^{i}G_i\nabla_{\theta}\ln \pi(a_i|s_i;\theta)]=\mathbb{E}_{\tau|\pi}[\sum_{i=0}^{\infty}\gamma^{i}G_i\nabla_{\theta}\ln \pi(a_i|s_i;\theta)] θJ(θ)=Es0Eτs0,π[i=0γiGiθlnπ(aisi;θ)]=Eτπ[i=0γiGiθlnπ(aisi;θ)]

定理3

∇ θ J ( θ ) = E τ ∣ π [ ∑ i = 0 ∞ γ i Q i ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ) ] \nabla_{\theta}J({\theta})=\mathbb{E}_{\tau|\pi}[\sum_{i=0}^{\infty}\gamma^{i}Q_{i}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i)] θJ(θ)=Eτπ[i=0γiQi(si,ai)θlnπ(aisi)]

以上同上一个证明, 不赘述.

∑ a 0 ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = ∑ a π ( a ∣ s 0 ; θ ) ∇ θ ln ⁡ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = E a 0 ∣ s 0 , π { ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) Q θ ( s 0 , a 0 ) } = E τ ∣ s 0 , π { Q θ ( s 0 , a 0 ) ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) } \begin{align} \sum_{a_0}\nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a) &=\sum_{a}\pi(a|s_0;\theta) \nabla_{\theta}\ln\pi(a|s_0;\theta)Q^{\theta}(s_0,a)\notag\\ &=\mathbb{E}_{a_0|s_0,\pi}\{\nabla_{\theta}\ln\pi(a_0|s_0;\theta) Q^{\theta}(s_0,a_0)\}\notag\\ &=\mathbb{E}_{\tau|s_0,\pi}\{Q^{\theta}(s_0,a_0)\nabla_{\theta}\ln\pi(a_0|s_0;\theta)\} \end{align} a0θπ(as0;θ)Qθ(s0,a)=aπ(as0;θ)θlnπ(as0;θ)Qθ(s0,a)=Ea0s0,π{θlnπ(a0s0;θ)Qθ(s0,a0)}=Eτs0,π{Qθ(s0,a0)θlnπ(a0s0;θ)}

∑ a ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) = γ E a 0 ∣ s 0 , π ∇ θ Q θ ( s 0 , a 0 ) \sum_{a}\nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)=\gamma\mathbb{E}_{a_0|s_0,\pi}\nabla_{\theta}Q^{\theta}(s_0,a_0) aθQθ(s0,a)π(as0;θ)=γEa0s0,πθQθ(s0,a0)

其中 Q θ ( s 0 , a ) = ∑ s ′ T ( s 0 , a , s ′ ) [ r ( s 0 , a , s ′ ) + γ V θ ( s ′ ) ] Q^{\theta}(s_0,a)=\sum_{s'}T(s_0,a,s')[r(s_0,a,s')+\gamma V^{\theta}(s')] Qθ(s0,a)=sT(s0,a,s)[r(s0,a,s)+γVθ(s)]

∇ θ Q θ ( s 0 , a 0 ) = ∑ s ′ T ( s 0 , a 0 , s ′ ) ∇ θ V θ ( s ′ ) = E s 1 ∣ s 0 , a 0 ∇ θ V θ ( s 1 ) \nabla_{\theta}Q^{\theta}(s_0,a_0)=\sum_{s'}T(s_0,a_0,s')\nabla_{\theta}V^{\theta}(s')=\mathbb{E}_{s_1|s_0,a_0}\nabla_{\theta}V^{\theta}(s_1) θQθ(s0,a0)=sT(s0,a0,s)θVθ(s)=Es1s0,a0θVθ(s1)

所以 ∑ a ∇ θ Q θ ( s 0 , a ) π ( a ∣ s 0 ; θ ) = γ E a 0 ∣ s 0 , π E s 1 ∣ s 0 , a 0 ∇ θ V θ ( s 1 ) = γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \sum_{a}\nabla_{\theta}Q^{\theta}(s_0,a)\pi(a|s_0;\theta)=\gamma\mathbb{E}_{a_0|s_0,\pi}\mathbb{E}_{s_1|s_0,a_0}\nabla_{\theta}V^{\theta}(s_1)=\gamma\mathbb{E}_{s_1|s_0, \pi}\nabla_{\theta}V^{\theta}(s_1) aθQθ(s0,a)π(as0;θ)=γEa0s0,πEs1s0,a0θVθ(s1)=γEs1s0,πθVθ(s1)

∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π { Q θ ( s 0 , a 0 ) ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) } + γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0,\pi}\{Q^{\theta}(s_0,a_0)\nabla_{\theta}\ln\pi(a_0|s_0;\theta)\}+\gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1) θVθ(s0)=Eτs0,π{Qθ(s0,a0)θlnπ(a0s0;θ)}+γEs1s0,πθVθ(s1)

同理可得:

V θ ( s i ) = E τ i ∣ s i , π { Q θ ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) } + γ E s i + 1 ∣ s i , π ∇ θ V θ ( s i + 1 ) ,   i = 1 , 2 , . . . V^{\theta}(s_i)=\mathbb{E}_{\tau_i|s_i,\pi}\{Q^{\theta}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i;\theta)\}+\gamma \mathbb{E}_{s_{i+1}|s_{i},\pi}\nabla_{\theta}V^{\theta}(s_{i+1}), \ i=1,2,... Vθ(si)=Eτisi,π{Qθ(si,ai)θlnπ(aisi;θ)}+γEsi+1si,πθVθ(si+1), i=1,2,...$

V θ ( s 1 ) = E τ 1 ∣ s 1 , π { Q θ ( s 1 , a 1 ) ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) } + γ E s 2 ∣ s 1 , π ∇ θ V θ ( s 2 ) V^{\theta}(s_1)=\mathbb{E}_{\tau_1|s_1,\pi}\{Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma \mathbb{E}_{s_{2}|s_{1},\pi}\nabla_{\theta}V^{\theta}(s_{2}) Vθ(s1)=Eτ1s1,π{Qθ(s1,a1)θlnπ(a1s1;θ)}+γEs2s1,πθVθ(s2) 代入 γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) \gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1) γEs1s0,πθVθ(s1), 得

γ E s 1 ∣ s 0 , π ∇ θ V θ ( s 1 ) = γ E s 1 ∣ s 0 , π { E τ 1 ∣ s 1 , π { Q θ ( s 1 , a 1 ) ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) } + γ E s 2 ∣ s 1 , π ∇ θ V θ ( s 2 ) } = γ E τ 1 ∣ s 0 , π [ Q θ ( s 1 , a 1 ) ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) ] + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) = E τ ∣ s 0 , π [ γ Q θ ( s 1 , a 1 ) ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) ] + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \begin{align} \gamma\mathbb{E}_{s_1|s_0,\pi}\nabla_{\theta}V^{\theta}(s_1)&=\gamma\mathbb{E}_{s_1|s_0,\pi}\{\mathbb{E}_{\tau_1|s_1,\pi}\{Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma \mathbb{E}_{s_{2}|s_{1},\pi}\nabla_{\theta}V^{\theta}(s_{2})\}\notag\\ &=\gamma \mathbb{E}_{\tau_1|s_0, \pi}[Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln \pi(a_1|s_1;\theta)]+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) \notag\\ &=\mathbb{E}_{\tau|s_0, \pi}[\gamma Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln \pi(a_1|s_1;\theta)]+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) \end{align} γEs1s0,πθVθ(s1)=γEs1s0,π{Eτ1s1,π{Qθ(s1,a1)θlnπ(a1s1;θ)}+γEs2s1,πθVθ(s2)}=γEτ1s0,π[Qθ(s1,a1)θlnπ(a1s1;θ)]+γ2Es2s0,πθVθ(s2)=Eτs0,π[γQθ(s1,a1)θlnπ(a1s1;θ)]+γ2Es2s0,πθVθ(s2)

进而

∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π { Q θ ( s 0 , a 0 ) ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) + γ Q θ ( s 1 , a 1 ) ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) } + γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0,\pi}\{Q^{\theta}(s_0,a_0)\nabla_{\theta}\ln\pi(a_0|s_0;\theta)+\gamma Q^{\theta}(s_1,a_1)\nabla_{\theta}\ln\pi(a_1|s_1;\theta)\}+\gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) θVθ(s0)=Eτs0,π{Qθ(s0,a0)θlnπ(a0s0;θ)+γQθ(s1,a1)θlnπ(a1s1;θ)}+γ2Es2s0,πθVθ(s2)

再将 V θ ( s 2 ) = E τ 2 ∣ s 2 , π { Q θ ( s 2 , a 2 ) ∇ θ ln ⁡ π ( a 2 ∣ s 2 ; θ ) } + γ E s 3 ∣ s 2 , π ∇ θ V θ ( s 3 ) V^{\theta}(s_2)=\mathbb{E}_{\tau_2|s_2,\pi}\{Q^{\theta}(s_2,a_2)\nabla_{\theta}\ln\pi(a_2|s_2;\theta)\}+\gamma \mathbb{E}_{s_{3}|s_{2},\pi}\nabla_{\theta}V^{\theta}(s_{3}) Vθ(s2)=Eτ2s2,π{Qθ(s2,a2)θlnπ(a2s2;θ)}+γEs3s2,πθVθ(s3) 代入 γ 2 E s 2 ∣ s 0 , π ∇ θ V θ ( s 2 ) \gamma^2 \mathbb{E}_{s_2|s_0,\pi}\nabla_{\theta}V^{\theta}(s_{2}) γ2Es2s0,πθVθ(s2)

不断重复上述过程得到

∇ θ V θ ( s 0 ) = E τ ∣ s 0 , π [ ∑ i = 0 ∞ γ i Q θ ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) ] \nabla_{\theta}V^{\theta}(s_0)=\mathbb{E}_{\tau|s_0, \pi}[\sum_{i=0}^{\infty}\gamma ^{i}Q^{\theta}(s_i,a_i)\nabla_{\theta}\ln \pi(a_i|s_i;\theta)] θVθ(s0)=Eτs0,π[i=0γiQθ(si,ai)θlnπ(aisi;θ)]

∇ θ J ( θ ) = E s 0 E τ ∣ s 0 , π [ ∑ i = 0 ∞ γ i Q θ ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) ] = E τ ∣ π [ ∑ i = 0 ∞ γ i Q θ ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) ] \nabla_{\theta} J({\theta})=\mathbb{E}_{s_0}\mathbb{E}_{\tau|s_0, \pi}[\sum_{i=0}^{\infty}\gamma^{i}Q^{\theta}(s_i,a_i)\nabla_{\theta}\ln \pi(a_i|s_i;\theta)]=\mathbb{E}_{\tau|\pi}[\sum_{i=0}^{\infty}\gamma^{i}Q^{\theta}(s_i,a_i)\nabla_{\theta}\ln \pi(a_i|s_i;\theta)] θJ(θ)=Es0Eτs0,π[i=0γiQθ(si,ai)θlnπ(aisi;θ)]=Eτπ[i=0γiQθ(si,ai)θlnπ(aisi;θ)]

推论.

∇ θ J ( θ ) = ∑ s ∑ a ∑ i = 0 ∞ γ i Pr ⁡ [ s t = s , a t = a ∣ π ] Q θ ( s , a ) ∇ θ ln ⁡ π ( a ∣ s ) \nabla_{\theta}J({\theta})=\sum_{s}\sum_{a}\sum_{i=0}^{\infty}\gamma^{i}\Pr[s_t=s,a_t=a|\pi]Q^{\theta}(s,a)\nabla_{\theta}\ln\pi(a|s) θJ(θ)=sai=0γiPr[st=s,at=aπ]Qθ(s,a)θlnπ(as)

证明: ∇ θ J ( θ ) = ∑ i = 0 ∞ γ i E τ ∣ π [ Q i ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ) ] \nabla_{\theta}J({\theta})=\sum_{i=0}^{\infty}\gamma^{i}\mathbb{E}_{\tau|\pi}[Q_{i}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i)] θJ(θ)=i=0γiEτπ[Qi(si,ai)θlnπ(aisi)]

E τ ∣ π [ Q i ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ) ] = ∑ s ∑ a Pr ⁡ [ s i = s , a i = a ∣ π ] [ Q i ( s i , a i ) ∇ θ ln ⁡ π ( a i ∣ s i ) ] ∣ s i = s , a i = a \mathbb{E}_{\tau|\pi}[Q_{i}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i)]=\sum_s\sum_{a}\Pr[s_i=s,a_i=a|\pi][Q_{i}(s_i,a_i)\nabla_{\theta}\ln\pi(a_i|s_i)]|_{s_{i}=s,a_{i}=a} Eτπ[Qi(si,ai)θlnπ(aisi)]=saPr[si=s,ai=aπ][Qi(si,ai)θlnπ(aisi)]si=s,ai=a

∇ θ J ( θ ) = ∑ i = 0 ∞ ∑ s ∑ a γ i Pr ⁡ [ s i = s , a i = a ∣ π ] Q ( s , a ) ln ⁡ π ( a ∣ s ) = ∑ s ∑ a ∑ i = 0 ∞ γ i Pr ⁡ [ s t = s , a t = a ∣ π ] Q θ ( s , a ) ∇ θ ln ⁡ π ( a ∣ s ) \nabla_{\theta}J({\theta})=\sum_{i=0}^{\infty}\sum_s\sum_{a}\gamma^{i}\Pr[s_i=s,a_i=a|\pi]Q(s,a)\ln \pi(a|s)=\sum_{s}\sum_{a}\sum_{i=0}^{\infty}\gamma^{i}\Pr[s_t=s,a_t=a|\pi]Q^{\theta}(s,a)\nabla_{\theta}\ln\pi(a|s) θJ(θ)=i=0saγiPr[si=s,ai=aπ]Q(s,a)lnπ(as)=sai=0γiPr[st=s,at=aπ]Qθ(s,a)θlnπ(as)

定理4

E ( s i , a i ) ∣ π ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) Q θ ( s i , a i ) = ∑ s ∑ a Pr ⁡ [ s i = s , a i = a ∣ π ] ∇ θ ln ⁡ π ( a ∣ s ; θ ) Q θ ( s , a ) \mathbb{E}_{(s_i,a_i)|\pi} \nabla_{\theta} \ln \pi(a_i|s_i;\theta)Q^{\theta}(s_i,a_i)=\sum_{s}\sum_{a}\Pr[s_i=s,a_i=a|\pi] \nabla_{\theta} \ln \pi(a|s;\theta)Q^{\theta}(s,a) E(si,ai)πθlnπ(aisi;θ)Qθ(si,ai)=saPr[si=s,ai=aπ]θlnπ(as;θ)Qθ(s,a)

证明:

引理. Q θ ( s , a ) = ∑ s ′ T ( s , a , s ′ ) [ r ( s , a , s ′ ) + γ V θ ( s ′ ) ] Q^{\theta}(s,a)=\sum_{s'}T(s,a,s')[r(s,a,s')+\gamma V^{\theta}(s')] Qθ(s,a)=sT(s,a,s)[r(s,a,s)+γVθ(s)]

∇ θ Q θ ( s , a ) = ∑ s ′ T ( s , a , s ′ ) ∇ θ V θ ( s ′ ) \nabla_{\theta}Q^{\theta}(s,a)=\sum_{s'}T(s,a,s')\nabla_{\theta}V^{\theta}(s') θQθ(s,a)=sT(s,a,s)θVθ(s)

∇ θ V ( s ′ ) = ∇ θ ∑ a ′ π ( a ′ ∣ s ′ ; θ ) Q θ ( s ′ , a ′ ) = ∑ a ′ [ ∇ θ π ( a ′ ∣ s ′ ; θ ) Q θ ( s ′ , a ′ ) + π ( a ′ ∣ s ′ ; θ ) ∇ θ Q θ ( s ′ , a ′ ) ] \nabla_{\theta}V(s')=\nabla_{\theta}\sum_{a'}\pi(a'|s';\theta)Q^{\theta}(s',a')=\sum_{a'}[\nabla_{\theta}\pi(a'|s';\theta)Q^{\theta}(s',a')+\pi(a'|s';\theta)\nabla_{\theta}Q^{\theta}(s',a')] θV(s)=θaπ(as;θ)Qθ(s,a)=a[θπ(as;θ)Qθ(s,a)+π(as;θ)θQθ(s,a)]

所以

∇ θ Q θ ( s , a ) = γ ∑ s ′ ∑ a ′ T ( s , a , s ′ ) [ ∇ θ π ( a ′ ∣ s ′ ; θ ) Q θ ( s ′ , a ′ ) + π ( s ′ ∣ a ′ ; θ ) ∇ θ Q θ ( s ′ , a ′ ) ] = γ ∑ s ′ ∑ a ′ T ( s , a , s ′ ) π ( s ′ ∣ a ′ ; θ ) [ ∇ θ ln ⁡ π ( a ′ ∣ s ′ ; θ ) Q θ ( s ′ , a ′ ) + ∇ θ Q θ ( s ′ , a ′ ) ] \begin{align} \nabla_{\theta}Q^{\theta}(s,a)&=\gamma\sum_{s'}\sum_{a'}T(s,a,s')[\nabla_{\theta}\pi(a'|s';\theta)Q^{\theta}(s',a')+\pi(s'|a';\theta)\nabla_{\theta}Q^{\theta}(s',a')]\\ &=\gamma\sum_{s'}\sum_{a'}T(s,a,s')\pi(s'|a';\theta)[\nabla_{\theta}\ln \pi(a'|s';\theta)Q^{\theta}(s',a')+\nabla_{\theta}Q^{\theta}(s',a')] \end{align} θQθ(s,a)=γsaT(s,a,s)[θπ(as;θ)Qθ(s,a)+π(sa;θ)θQθ(s,a)]=γsaT(s,a,s)π(sa;θ)[θlnπ(as;θ)Qθ(s,a)+θQθ(s,a)]

∇ θ Q θ ( s i , a i ) = γ ∑ s i + 1 ∑ a i + 1 T ( s i , a i , s i + 1 ) π ( s i + 1 ∣ a i + 1 ; θ ) [ ∇ θ ln ⁡ π ( a i + 1 ∣ s i + 1 ; θ ) Q θ ( s i + 1 , a i + 1 ) + ∇ θ Q θ ( s i + 1 , a i + 1 ) ] = γ E ( s i + 1 , a i + 1 ) ∣ ( s i , a i ) , π [ ∇ θ ln ⁡ π ( a i + 1 , s i + 1 ; θ ) Q θ ( s i + 1 , a i + 1 ) + ∇ θ Q θ ( s i + 1 , a i + 1 ) ] \begin{align} \nabla_{\theta}Q^{\theta}(s_i,a_i)&=\gamma\sum_{s_{i+1}}\sum_{a_{i+1}}T(s_i,a_i,s_{i+1})\pi(s_{i+1}|a_{i+1};\theta)[\nabla_{\theta}\ln \pi(a_{i+1}|s_{i+1};\theta)Q^{\theta}(s_{i+1},a_{i+1})+\nabla_{\theta}Q^{\theta}(s_{i+1},a_{i+1})]\notag\\ &=\gamma\mathbb{E}_{(s_{i+1},a_{i+1})|(s_i,a_i), \pi}[\nabla_{\theta}\ln\pi(a_{i+1},s_{i+1};\theta)Q^{\theta}(s_{i+1},a_{i+1})+\nabla_{\theta}Q^{\theta}(s_{i+1},a_{i+1})]\notag \end{align} θQθ(si,ai)=γsi+1ai+1T(si,ai,si+1)π(si+1ai+1;θ)[θlnπ(ai+1si+1;θ)Qθ(si+1,ai+1)+θQθ(si+1,ai+1)]=γE(si+1,ai+1)(si,ai),π[θlnπ(ai+1,si+1;θ)Qθ(si+1,ai+1)+θQθ(si+1,ai+1)]

∇ θ J ( θ ) = E s 0 V θ ( s 0 ) = E s 0 ∑ a π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) \nabla_{\theta} J(\theta)=\mathbb{E}_{s_0}V^{\theta}(s_0)=\mathbb{E}_{s_0}\sum_a \pi(a|s_0;\theta)Q^{\theta}(s_0,a) θJ(θ)=Es0Vθ(s0)=Es0aπ(as0;θ)Qθ(s0,a)

∇ θ J ( θ ) = E s 0 ∑ a ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) + E s 0 ∑ a π ( a ∣ s 0 ; θ ) ∇ θ Q θ ( s 0 , a ) \nabla_{\theta} J(\theta)=\mathbb{E}_{s_0}\sum_a \nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a)+\mathbb{E}_{s_0}\sum_{a}\pi(a|s_0;\theta)\nabla_{\theta}Q^{\theta}(s_0,a) θJ(θ)=Es0aθπ(as0;θ)Qθ(s0,a)+Es0aπ(as0;θ)θQθ(s0,a)

E s 0 ∑ a ∇ θ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = E s 0 ∑ a π ( a ∣ s 0 ; θ ) ∇ θ ln ⁡ π ( a ∣ s 0 ; θ ) Q θ ( s 0 , a ) = E ( s 0 , a 0 ) ∣ π ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) Q θ ( s 0 , a 0 ) \mathbb{E}_{s_0}\sum_a \nabla_{\theta}\pi(a|s_0;\theta)Q^{\theta}(s_0,a)=\mathbb{E}_{s_0}\sum_a \pi(a|s_0;\theta)\nabla_{\theta}\ln \pi(a|s_0;\theta)Q^{\theta}(s_0,a)=\mathbb{E}_{(s_0,a_0)|\pi}\nabla_{\theta}\ln \pi(a_0|s_0;\theta)Q^{\theta}(s_0,a_0) Es0aθπ(as0;θ)Qθ(s0,a)=Es0aπ(as0;θ)θlnπ(as0;θ)Qθ(s0,a)=E(s0,a0)πθlnπ(a0s0;θ)Qθ(s0,a0)

E s 0 ∑ a π ( a ∣ s 0 ; θ ) ∇ θ Q θ ( s 0 , a ) = E s 0 E a 0 ∣ s 0 , π ∇ θ Q θ ( s 0 , a 0 ) = E ( s 0 , a 0 ) ∣ π ∇ θ Q θ ( s 0 , a 0 ) \begin{align} \mathbb{E}_{s_0}\sum_{a}\pi(a|s_0;\theta)\nabla_{\theta}Q^{\theta}(s_0,a) &=\mathbb{E}_{s_0}\mathbb{E}_{a_0|s_0, \pi}\nabla_{\theta}Q^{\theta}(s_0,a_0)\notag\\ &=\mathbb{E}_{(s_0,a_0)|\pi}\nabla_{\theta}Q^{\theta}(s_0,a_0)\notag \end{align} Es0aπ(as0;θ)θQθ(s0,a)=Es0Ea0s0,πθQθ(s0,a0)=E(s0,a0)πθQθ(s0,a0)

至此,

∇ θ J ( θ ) = E ( s 0 , a 0 ) ∣ π ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) Q θ ( s 0 , a 0 ) + E ( s 0 , a 0 ) ∣ π ∇ θ Q θ ( s 0 , a 0 ) \nabla_{\theta} J(\theta)=\mathbb{E}_{(s_0,a_0)|\pi}\nabla_{\theta}\ln \pi(a_0|s_0;\theta)Q^{\theta}(s_0,a_0)+\mathbb{E}_{(s_0,a_0)|\pi}\nabla_{\theta}Q^{\theta}(s_0,a_0) θJ(θ)=E(s0,a0)πθlnπ(a0s0;θ)Qθ(s0,a0)+E(s0,a0)πθQθ(s0,a0)

由引理得:

∇ θ Q θ ( s 0 , a 0 ) = γ E ( s 1 , a 1 ) ∣ ( s 0 , a 0 ) , π [ ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) Q θ ( s 1 , a 1 ) + ∇ θ Q θ ( s 1 , a 1 ) ] \nabla_{\theta}Q^{\theta}(s_0,a_0)=\gamma\mathbb{E}_{(s_1,a_1)|(s_0,a_0),\pi}[\nabla_{\theta}\ln \pi(a_1|s_1;\theta)Q^{\theta}(s_1,a_1)+\nabla_{\theta}Q^{\theta}(s_1,a_1)] θQθ(s0,a0)=γE(s1,a1)(s0,a0),π[θlnπ(a1s1;θ)Qθ(s1,a1)+θQθ(s1,a1)]

E ( s 0 , a 0 ) ∣ π ∇ θ Q θ ( s 0 , a 0 ) = γ E ( s 0 , a 0 ) E ( s 1 , a 1 ) ∣ ( s 0 , a 0 ) , π [ ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) Q θ ( s 1 , a 1 ) + ∇ θ Q θ ( s 1 , a 1 ) ] = γ E ( s 1 , a 1 ) ∣ π [ ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) Q θ ( s 1 , a 1 ) + ∇ θ Q θ ( s 1 , a 1 ) ] \mathbb{E}_{(s_{0},a_{0})|\pi}\nabla_{\theta}Q^{\theta}(s_0,a_0)=\gamma\mathbb{E}_{(s_{0},a_{0})}\mathbb{E}_{(s_1,a_1)|(s_0,a_0),\pi}[\nabla_{\theta}\ln \pi(a_1|s_1;\theta)Q^{\theta}(s_1,a_1)+\nabla_{\theta}Q^{\theta}(s_1,a_1)]=\gamma\mathbb{E}_{(s_1,a_1)|\pi}[\nabla_{\theta}\ln \pi(a_1|s_1;\theta)Q^{\theta}(s_1,a_1)+\nabla_{\theta}Q^{\theta}(s_1,a_1)] E(s0,a0)πθQθ(s0,a0)=γE(s0,a0)E(s1,a1)(s0,a0),π[θlnπ(a1s1;θ)Qθ(s1,a1)+θQθ(s1,a1)]=γE(s1,a1)π[θlnπ(a1s1;θ)Qθ(s1,a1)+θQθ(s1,a1)]

至此,

∇ θ J ( θ ) = E ( s 0 , a 0 ) ∣ π ∇ θ ln ⁡ π ( a 0 ∣ s 0 ; θ ) Q θ ( s 0 , a 0 ) + γ E ( s 1 , a 1 ) ∣ π ∇ θ ln ⁡ π ( a 1 ∣ s 1 ; θ ) Q θ ( s 1 , a 1 ) + γ E ( s 1 , a 1 ) ∣ π ∇ θ Q θ ( s 1 , a 1 ) \nabla_{\theta} J(\theta)=\mathbb{E}_{(s_0,a_0)|\pi}\nabla_{\theta}\ln \pi(a_0|s_0;\theta)Q^{\theta}(s_0,a_0)+\gamma\mathbb{E}_{(s_1,a_1)|\pi}\nabla_{\theta}\ln \pi(a_1|s_1;\theta)Q^{\theta}(s_1,a_1)+\gamma\mathbb{E}_{(s_1,a_1)|\pi}\nabla_{\theta}Q^{\theta}(s_1,a_1) θJ(θ)=E(s0,a0)πθlnπ(a0s0;θ)Qθ(s0,a0)+γE(s1,a1)πθlnπ(a1s1;θ)Qθ(s1,a1)+γE(s1,a1)πθQθ(s1,a1)

重复上述过程, 可得

∇ θ J ( θ ) = γ i ∑ i = 0 ∞ E ( s i , a i ) ∣ π ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) Q θ ( s i , a i ) \nabla_{\theta} J(\theta)=\gamma^{i}\sum\limits_{i=0}^{\infty}\mathbb{E}_{(s_i,a_i)|\pi}\nabla_{\theta}\ln \pi(a_i|s_i;\theta)Q^{\theta}(s_i,a_i) θJ(θ)=γii=0E(si,ai)πθlnπ(aisi;θ)Qθ(si,ai)

其中

E ( s i , a i ) ∣ π ∇ θ ln ⁡ π ( a i ∣ s i ; θ ) Q θ ( s i , a i ) = ∑ s ∑ a Pr ⁡ [ s i = s , a i = a ∣ π ] ∇ θ ln ⁡ π ( a ∣ s ; θ ) Q θ ( s , a ) \mathbb{E}_{(s_i,a_i)|\pi} \nabla_{\theta} \ln \pi(a_i|s_i;\theta)Q^{\theta}(s_i,a_i)=\sum_{s}\sum_{a}\Pr[s_i=s,a_i=a|\pi] \nabla_{\theta} \ln \pi(a|s;\theta)Q^{\theta}(s,a) E(si,ai)πθlnπ(aisi;θ)Qθ(si,ai)=saPr[si=s,ai=aπ]θlnπ(as;θ)Qθ(s,a)

进而有

∇ θ J ( θ ) = ∑ s ∑ a ∑ i = 0 ∞ γ i Pr ⁡ [ s i = s , a i = a ] ∇ θ ln ⁡ π ( a ∣ s ; θ ) Q θ ( s , a ) \nabla_{\theta} J(\theta)=\sum_{s}\sum_{a}\sum_{i=0}^{\infty}\gamma^{i}\Pr[s_i=s,a_i=a] \nabla_{\theta} \ln\pi(a|s;\theta)Q^{\theta}(s,a) θJ(θ)=sai=0γiPr[si=s,ai=a]θlnπ(as;θ)Qθ(s,a).

定理5

E ( s t , a t ) ∣ π [ γ t G t ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) ] = E ( s t , a t ) ∣ π ′ [ 1 π ′ ( a t ∣ s t ) γ t G t ∇ θ π ( a t ∣ s t ; θ ) ] \mathbb{E}_{(s_t,a_t)|\pi}[\gamma^{t}G_{t}\nabla_{\theta}\ln \pi(a_t|s_t;\theta)]=\mathbb{E}_{(s_t,a_t)|\pi'}[\frac{1}{\pi'(a_t|s_t)}\gamma^{t}G_{t}\nabla_{\theta}\pi(a_t|s_t;\theta)] E(st,at)π[γtGtθlnπ(atst;θ)]=E(st,at)π[π(atst)1γtGtθπ(atst;θ)]

其中 π ′ \pi' π 为任意一个策略.

证明:

∇ θ ln ⁡ ( π ( a t ∣ s t ; θ ) ) = ∇ θ π ( a t ∣ s t ; θ ) / π ( a t ∣ s t ; θ ) \nabla_{\theta}\ln(\pi(a_{t}|s_{t}; \theta))=\nabla_{\theta}\pi(a_t|s_t;\theta)/\pi(a_t|s_t;\theta) θln(π(atst;θ))=θπ(atst;θ)/π(atst;θ), 代入得

E ( s t , a t ) ∣ π [ γ t G t ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) ] = ∑ ( s t , a t ) ∈ S × A π ( a t ∣ s t ; θ ) γ t G t ∇ θ π ( a t ∣ s t ; θ ) / π ( a t ∣ s t ; θ ) = ∑ ( s t , a t ) ∈ S × A γ t G t ∇ θ π ( a t ∣ s t ; θ ) = ∑ ( s t , a t ) ∈ S × A π ′ ( a t ∣ s t ) 1 π ′ ( a t ∣ s t ) γ i G t ∇ θ π ( a t ∣ s t ; θ ) = E ( s t , a t ) ∣ π ′ [ 1 π ′ ( a t ∣ s t ) γ t G t ∇ θ π ( a t ∣ s t ; θ ) ] \begin{align} \mathbb{E}_{(s_t,a_t)|\pi}[\gamma^{t}G_{t}\nabla_{\theta}\ln \pi(a_t|s_t;\theta)]&=\sum_{(s_t,a_t)\in \mathcal{S}\times\mathcal{A}}\pi(a_t|s_t;\theta)\gamma^{t}G_{t}\nabla_{\theta}\pi(a_t|s_t;\theta)/\pi(a_t|s_t;\theta)\notag\\ &=\sum_{(s_t,a_t)\in \mathcal{S}\times\mathcal{A}}\gamma^{t}G_{t}\nabla_{\theta}\pi(a_t|s_t;\theta)\\ &=\sum_{(s_t,a_t)\in \mathcal{S}\times\mathcal{A}}\pi'(a_t|s_t)\frac{1}{\pi'(a_t|s_t)}\gamma^{i}G_{t}\nabla_{\theta}\pi(a_t|s_t;\theta)\\ &=\mathbb{E}_{(s_t,a_t)|\pi'}[\frac{1}{\pi'(a_t|s_t)}\gamma^{t}G_{t}\nabla_{\theta}\pi(a_t|s_t;\theta)] \end{align} E(st,at)π[γtGtθlnπ(atst;θ)]=(st,at)S×Aπ(atst;θ)γtGtθπ(atst;θ)/π(atst;θ)=(st,at)S×AγtGtθπ(atst;θ)=(st,at)S×Aπ(atst)π(atst)1γiGtθπ(atst;θ)=E(st,at)π[π(atst)1γtGtθπ(atst;θ)]


http://www.kler.cn/a/553913.html

相关文章:

  • Vue 3:基于按钮切换动态图片展示(附Demo)
  • 【Web前端开发精品课 HTML CSS JavaScript基础教程】第二十四章课后题答案
  • Centos开机自启动
  • 电路元器件知识:稳压二极管
  • Elasticsearch 混合搜索 - Hybrid Search
  • 【Java项目】基于SpringBoot的【校园台球厅人员与设备管理系统】
  • 【蓝桥杯集训·每日一题2025】 AcWing 6118. 蛋糕游戏 python
  • VMware17Pro虚拟机安装macOS教程(超详细)
  • 我的电脑是 3070ti 能用那个级别的deepseek
  • Redis的基础使用
  • Scrapy:DownloaderAwarePriorityQueue队列设计详解
  • 【系统架构设计师】虚拟机体系结构风格
  • 【从0做项目】Java搜索引擎(6) 正则表达式鲨疯了优化正文解析
  • 【项目日记】仿RabbitMQ实现消息队列 --- 模块设计
  • 关于视频抽帧调用虹软人脸识别的BufferedImage读取优化策略
  • 基于微信小程序的民宿短租系统设计与实现(ssm论文源码调试讲解)
  • 如何在Ubuntu服务器上快速安装GNOME桌面环境
  • ​44页PDF | 天津大学深度解读DeepSeek:原理与效应(附下载)
  • 解决DeepSeek服务器繁忙问题的实用指南
  • UE5.3 C++ 通过Spline样条实现三维连线,自己UV贴图。