当前位置: 首页 > article >正文

datasets笔记:两种数据集对象

Datasets 提供两种数据集对象:Dataset✨ IterableDataset ✨

  • Dataset 提供快速随机访问数据集中的行,并支持内存映射,因此即使加载大型数据集也只需较少的内存。
  • IterableDataset 适用于超大数据集,甚至无法完全下载到磁盘或内存中。它允许在数据集完全下载之前就开始访问和使用数据集。

0 读取数据

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")
dataset
'''
Dataset({
    features: ['text', 'label'],
    num_rows: 8530
})
'''

1 Dataset

1.1 索引

dataset[0]
'''
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}
'''

dataset[-1]
'''
{'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .',
 'label': 0}
'''


dataset[0]['text']
'''
'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'
'''


dataset['text']

1.2 切片

dataset[:3]
'''
{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic'],
 'label': [1, 1, 1]}
'''

2 IterableDataset

当设置 streaming=True 时加载的数据集为 IterableDataset

IterableDataset 的行为与 Dataset 不同:

  • 无法随机访问。
  • 只能逐个迭代获取元素,例如使用 next(iter())for 循环。
from datasets import load_dataset

iter_dataset = load_dataset("rotten_tomatoes", split="train",streaming=True)
iter_dataset
'''
IterableDataset({
    features: ['text', 'label'],
    n_shards: 1
})
'''
for i in iter_dataset:
    print(i)
    break
'''
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
'''

2.1 从现有 Dataset 创建 IterableDataset

iter_dataset2=dataset.to_iterable_dataset()
for i in iter_dataset2:
    print(i)
    break
'''
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
'''

2.2  获取指定数量的示例

list(iter_dataset2.take(3))
'''
[{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'label': 1},
 {'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'label': 1},
 {'text': 'effective but too-tepid biopic', 'label': 1}]
'''


http://www.kler.cn/a/447944.html

相关文章:

  • 分布式系统架构5:限流设计模式
  • SYD881X RTC定时器事件在调用timeAppClockSet后会出现比较大的延迟
  • PostgreSQL表达式的类型
  • 54、库卡机器人轴的软限位设置
  • UIP协议栈 TCP通信客户端 服务端,UDP单播 广播通信 example
  • 【开源免费】基于SpringBoot+Vue.JS房屋租赁管理系统(JAVA毕业设计)
  • 【前端爬虫】关于如何获取自己的请求头信息(user-agent和cookie)
  • 单片机:实现PWM LED灯亮度调节及Proteus仿真(附带源码)
  • 【编辑器扩展】打开持久化路径/缓存路径/DataPath/StreamingAssetsPath文件夹
  • Restaurants WebAPI(三)——Serilog/FluenValidation
  • SSH特性|组成|SSH是什么?
  • Netty解决粘包半包问题
  • Spring常见问题
  • OpenHarmony-6.IPC/RPC组件
  • 无人机飞防高效率喷洒技术详解
  • 用音乐与自我对话 ——澄迈漓岛音乐节x草台回声
  • Deepin和Windows传文件(Xftp,WinSCP)
  • AI的进阶之路:从机器学习到深度学习的演变(四)
  • 【Android】unzip aar删除冲突classes再zip
  • <QNAP 453D QTS-5.x> 日志记录:Docker 运行的 Flask 应用 SSL 证书 过期, 更新证书
  • 数据结构 C/C++(实验五:图)
  • 【SH】在Ubuntu Server 24中基于Python Web应用的Flask Web开发(实现POST请求)学习笔记
  • 基于Spring Boot的动漫交流与推荐平台
  • Cadence学习笔记 8 添加分页符
  • Vue CLI 脚手架创建项目流程详解 (2)
  • 【git】git命令