当前位置: 首页 > article >正文

Understanding the model of openAI 5 (1024 unit LSTM reinforcement learning)

题意:理解 OpenAI 5(1024 单元 LSTM 强化学习)的模型

问题背景:

I recently came across openAI 5. I was curious to see how their model is built and understand it. I read in wikipedia that it "contains a single layer with a 1024-unit LSTM". Then I found this pdf containing a scheme of the architecture.

我最近了解了 OpenAI 5。我很好奇他们的模型是如何构建的,并希望了解它。我在维基百科上读到,它“包含一个具有 1024 单元的 LSTM 层”。然后我找到了这份包含架构示意图的 PDF

My Questions        我的问题

From all this I don't understand a few things:

从这些信息中,我有几个地方不太明白

  • What does it mean to have a 1024-unit LSTM layer? Does this mean we have 1024 time steps with a single LSTM cell, or does this mean we have 1024 cells. Could you show me some kind of graph visualizing this? I'm especially having a hard time visualizing 1024 cells in one layer. (I tried looking at several SO questions such as 1, 2, or the openAI 5 blog, but they didn't help much).

拥有一个 1024 单元的 LSTM 层是什么意思?这是否意味着我们有 1024 个时间步长和一个单独的 LSTM 单元,还是说我们有 1024 个单元?你能给我展示一些可视化的图表吗?我特别难以想象在一层中有 1024 个单元。(我尝试查看了几个 SO 问题,例如 1、2,或 OpenAI 5 的博客,但没有太大帮助。)

  • How can you do reinforcement learning on such model? I'm used to RL being used with Q-Tables and them being updated during training. Does this simply mean that their loss function is the reward?

你如何在这样的模型上进行强化学习?我习惯于使用 Q 表进行强化学习,并在训练过程中对其进行更新。这是否意味着他们的损失函数就是奖励

  • How come such large model doesn't suffer from vanishing gradients or something? Haven't seen in the pdf any types of normalizations or so.

为什么这样的大型模型不会受到梯度消失等问题的影响?我在 PDF 中没有看到任何类型的归一化或类似的内容

  • In the pdf you can see a blue rectangle, seems like it's a unit and there are N of those. What does this mean? And correct me please if I'm mistaken, the pink boxes are used to select the best move/item(?)

在 PDF 中,你可以看到一个蓝色的矩形,似乎它是一个单元,并且有 N 个这样的单元。这是什么意思?如果我错了,请纠正我,粉色的框是用来选择最佳动作/项目的


In general all of this can be summarized to "how does the openAI 5 model work?

总的来说,这些问题可以归结为:“OpenAI 5 模型是如何工作的?

问题解决:

  • It means that the size of the hidden state is 1024 units, which is essentially that your LSTM has 1024 cells, in each timestep. We do not know in advance how many timesteps we will have.

这意味着隐藏状态的大小是 1024 单元,这基本上意味着你的 LSTM 在每个时间步都有 1024 个单元。我们事先不知道会有多少个时间步

  • The state of the LSTM (hidden state) represents the current state that is observed by the agent. It gets updated every timestep using the input received. This hidden state can be used to predict the Q-function (as in Deep Q-learning). You don't have an explicit table of (state, action) -> q_value, instead you have a 1024 sized vector which represents the state and feeds into another dense layer, which will output the q_values for all possible actions.

LSTM 的状态(隐藏状态)表示智能体当前观察到的状态。它会在每个时间步通过接收到的输入进行更新。这个隐藏状态可以用来预测 Q 函数(如深度 Q 学习中所示)。你没有一个明确的(状态,动作)-> Q 值的表格,而是有一个 1024 维的向量,它代表状态,并输入到另一个全连接层,该层会输出所有可能动作的 Q 值

  • LSTMs are the mechanism which help stop vanishing gradients, as the long range memory also allows the gradients to flow back easier.

LSTM 是帮助防止梯度消失的机制,因为其长程记忆功能使得梯度更容易反向传播

  • If you are referring to the big blue and pink boxes, then the pink ones seem like they are the input values which are put through a network and pooled, over each pickup or modifier. The blue space seems to be the same thing over each unit. The terms pickup, modifier, unit, etc., should be meaningful in the context of the game they are playing.

如果你指的是大的蓝色和粉色框,那么粉色框似乎是输入值,它们通过网络处理并在每个拾取物或修饰物上进行汇总。蓝色区域似乎是相同的东西,只是针对每个单位。拾取物、修饰物、单位等术语应该在他们玩的游戏的上下文中具有特定含义

Here is an image of the LSTM - the yellow nodes at each step are the n: 

这是 LSTM 的一张图片——每一步的黄色节点是 n

The vector h is the hidden state of the LSTM which is being passed to both the next timestep and being used as the output of that timestep.

向量 h 是 LSTM 的隐藏状态,它被传递到下一个时间步,同时也作为该时间步的输出


http://www.kler.cn/news/305456.html

相关文章:

  • WSL安装Redis
  • 【linux】 cd命令
  • 代码随想录算法训练营第62天| 图论 Floyd算法 A*算法
  • 鸿蒙 NEXT 生态应用核心技术理念:可分可合,自由流转
  • 开源 AI 智能名片 S2B2C 商城小程序相关角色的探索
  • 基于vue框架的宠物爱好者交流网站的设计与实现p2653(程序+源码+数据库+调试部署+开发环境)系统界面在最后面。
  • 黑马点评19——多级缓存-缓存同步
  • 基于SSM的银发在线教育云平台的设计与实现
  • Qt事件处理机制
  • wandb一直上传 解决方案
  • 大顶堆+动态规划+二分
  • 微信小程序播放音频方法,解决uniapp 微信小程序不能播放本地音频的方法
  • 地震勘探原理视频总结(1-6)
  • K8s 简介以及详细部署步骤
  • python中实用的数组操作技巧i奥,都在这里了
  • 聊点基础的,关于监控,关于告警(prometheus—+grafana+夜莺如何丝滑使用?)
  • Redis的数据类型以及应用场景
  • 4个方法教你图片转PDF怎么弄。
  • redis短信登录模型
  • for循环语句
  • 支持向量机(Support Vector Machines,SVM)—有监督学习方法、非概率模型、判别模型、线性模型、非参数化模型、批量学习、核方法
  • 【STM32 MCU】stm32MCUs 32-bit Arm Cortex-M
  • 关于 openeuler 22.03-LTS-SP4 scp 失败问题的记录
  • c++基础入门二
  • 【RabbitMQ 项目】服务端:数据管理模块之消息队列管理
  • 速盾:高防服务器租用需要注意什么事项
  • FPGA开发:模块 × 实例化
  • postgres_fdw访问存储在外部 PostgreSQL 服务器中的数据
  • 无线麦克风哪款好用,手机领夹麦克风哪个牌子好,麦克风推荐
  • 软件开发详解:同城O2O系统源码的架构设计与外卖跑腿APP的开发要点