Datasets 提供两种数据集对象:Dataset✨ IterableDataset ✨

  • Dataset 提供快速随机访问数据集中的行,并支持内存映射,因此即使加载大型数据集也只需较少的内存。
  • IterableDataset 适用于超大数据集,甚至无法完全下载到磁盘或内存中。它允许在数据集完全下载之前就开始访问和使用数据集。

0 读取数据

from datasets import load_dataset

dataset = load_dataset("rotten_tomatoes", split="train")
    features: ['text', 'label'],
    num_rows: 8530

1 Dataset

1.1 索引

{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'label': 1}

{'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .',
 'label': 0}

'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'


1.2 切片

{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'effective but too-tepid biopic'],
 'label': [1, 1, 1]}

2 IterableDataset

当设置 streaming=True 时加载的数据集为 IterableDataset

IterableDataset 的行为与 Dataset 不同:

  • 无法随机访问。
  • 只能逐个迭代获取元素,例如使用 next(iter())for 循环。
from datasets import load_dataset

iter_dataset = load_dataset("rotten_tomatoes", split="train",streaming=True)
    features: ['text', 'label'],
    n_shards: 1
for i in iter_dataset:
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}

2.1 从现有 Dataset 创建 IterableDataset

for i in iter_dataset2:
{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}

2.2  获取指定数量的示例

[{'text': 'the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
  'label': 1},
 {'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
  'label': 1},
 {'text': 'effective but too-tepid biopic', 'label': 1}]



