当前位置: 首页 > article >正文

HuggingFace datasets - 下载数据

文章目录

    • 下载数据
    • 修改默认保存地址 TRANSFORMERS_CACHE
    • 保存到本地 & 本地加载
      • 保存
      • 加载
    • 读取 `.arrow` 数据


下载数据

1、Python 代码下载

from datasets import load_dataset
imdb = load_dataset("imdb") 
# name参数为full或mini,full表示下载全部数据,mini表示下载部分少量数据
# dataset = load_dataset(model_name, name="full") 

imdb
'''
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
'''

默认保存在 ~/.cache/huggingface 文件夹

数据格式如下:

$ cd datasets/imdb/
$ tree
.
└── plain_text
    └── 0.0.0
        ├── e6281661ce1c48d982bc483cf8a173c1bbeb5d31
        │   ├── dataset_info.json
        │   ├── imdb-test.arrow
        │   ├── imdb-train.arrow
        │   └── imdb-unsupervised.arrow
        ├── e6281661ce1c48d982bc483cf8a173c1bbeb5d31.incomplete_info.lock
        └── e6281661ce1c48d982bc483cf8a173c1bbeb5d31_builder.lock

3 directories, 6 files

2、huggingface-cli 命令下载
这样下载也会保存到 ~/.cache/huggingface 文件夹

huggingface-cli download --repo-type dataset imdb

3、git
在这里插入图片描述


修改默认保存地址 TRANSFORMERS_CACHE

环境变量添加

export TRANSFORMERS_CACHE='path/'

代码中使用

import os 
os.environ['TRANSFORMERS_CACHE']=''

保存到本地 & 本地加载

保存

save_path = '/Users/xx/Downloads/imdb' 
imdb.save_to_disk(save_path)
'''
Saving the dataset (1/1 shards): 100%|█| 25000/25000 [00:00<00:00, 97903.42 exam
Saving the dataset (1/1 shards): 100%|█| 25000/25000 [00:00<00:00, 251032.07 exa
Saving the dataset (1/1 shards): 100%|█| 50000/50000 [00:00<00:00, 88591.53 exam
'''

imdb2 = load_from_disk(save_path)
imdb2
'''
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})
'''

存储格式如下:

$ cd imdb/
$ tree
.
├── dataset_dict.json
├── test
│   ├── data-00000-of-00001.arrow
│   ├── dataset_info.json
│   └── state.json
├── train
│   ├── data-00000-of-00001.arrow
│   ├── dataset_info.json
│   └── state.json
└── unsupervised
    ├── data-00000-of-00001.arrow
    ├── dataset_info.json
    └── state.json

3 directories, 10 files

加载

# 指定加载测试集
save_path1 = '/Users/xx/Downloads/imdb/test' 
imdb3 = load_from_disk(save_path1)
imdb3
'''
Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})
'''

imdb4 = load_dataset('imdb') # 默认加载 `.cache` 中的数据 

imdb4 = load_dataset(path='/Users/xx/Downloads/imdb')
'''
Generating train split: 1 examples [00:00, 69.32 examples/s]
Generating test split: 1 examples [00:00, 277.31 examples/s]
'''
imdb4
'''
DatasetDict({
    train: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
    test: Dataset({
        features: ['_data_files', '_fingerprint', '_format_columns', '_format_kwargs', '_format_type', '_output_all_columns', '_split'],
        num_rows: 1
    })
})
'''

# 指定加载文件 - 失败
save_path2 = '/Users/xx/Downloads/imdb/test/data-00000-of-00001.arrow' 
imdb4 =  load_from_disk(save_path2)
'''
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xx/miniconda3/lib/python3.11/site-packages/datasets/load.py", line 2215, in load_from_disk
    raise FileNotFoundError(
FileNotFoundError: Directory /Users/xx/Downloads/imdb/test/data-00000-of-00001.arrow is neither a `Dataset` directory nor a `DatasetDict` directory.
'''

无法从 .cache/huggingface/datasets 加载

path = '/Users/xx/.cache/huggingface/datasets/imdb' 
from datasets import load_from_disk

imdb2 = load_from_disk(path)
'''
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xx/miniconda3/lib/python3.11/site-packages/datasets/load.py", line 2215, in load_from_disk
    raise FileNotFoundError(
FileNotFoundError: Directory /Users/xx/.cache/huggingface/datasets/imdb is neither a `Dataset` directory nor a `DatasetDict` directory.
'''

path1 = '/Users/xx/.cache/huggingface/datasets/imdb/plain_text/0.0.0/e6281661ce1c48d982bc483cf8a173c1bbeb5d31/imdb-test.arrow'  

imdb2 = load_from_disk(path1)
'''
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xx/miniconda3/lib/python3.11/site-packages/datasets/load.py", line 2215, in load_from_disk
    raise FileNotFoundError(
FileNotFoundError: Directory /Users/xx/.cache/huggingface/datasets/imdb/plain_text/0.0.0/e6281661ce1c48d982bc483cf8a173c1bbeb5d31/imdb-test.arrow is neither a `Dataset` directory nor a `DatasetDict` directory.
'''

path1 = '/Users/xx/.cache/huggingface/datasets/imdb/plain_text/0.0.0/e6281661ce1c48d982bc483cf8a173c1bbeb5d31/' 
imdb2 = load_from_disk(path1)
'''
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xx/miniconda3/lib/python3.11/site-packages/datasets/load.py", line 2215, in load_from_disk
    raise FileNotFoundError(
FileNotFoundError: Directory /Users/xx/.cache/huggingface/datasets/imdb/plain_text/0.0.0/e6281661ce1c48d982bc483cf8a173c1bbeb5d31/ is neither a `Dataset` directory nor a `DatasetDict` directory.
'''

path1 = '/Users/xx/.cache/huggingface/datasets/imdb/plain_text/0.0.0/' 

imdb2 = load_from_disk(path1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xx/miniconda3/lib/python3.11/site-packages/datasets/load.py", line 2215, in load_from_disk
    raise FileNotFoundError(
FileNotFoundError: Directory /Users/xx/.cache/huggingface/datasets/imdb/plain_text/0.0.0/ is neither a `Dataset` directory nor a `DatasetDict` directory.


path1 = '/Users/xx/.cache/huggingface/datasets/imdb/plain_text/' 
imdb2 = load_from_disk(path1)
'''
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/xx/miniconda3/lib/python3.11/site-packages/datasets/load.py", line 2215, in load_from_disk
    raise FileNotFoundError(
FileNotFoundError: Directory /Users/xx/.cache/huggingface/datasets/imdb/plain_text/ is neither a `Dataset` directory nor a `DatasetDict` directory.
'''

读取 .arrow 数据

双击 .arrow 文件无法直接查看,使用下面代码可以查看内容

def read_arrow_to_df_julia_ok(path):
    with open(path, "rb") as f:
        r = pyarrow.ipc.RecordBatchStreamReader(f)
        df = r.read_pandas()
        return df

path = '/Users/xx/Downloads/imdb/test/data-00000-of-00001.arrow'
path = '/Users/xx/.cache/huggingface/datasets/imdb/plain_text/0.0.0/e6281661ce1c48d982bc483cf8a173c1bbeb5d31/imdb-test.arrow'
table = read_arrow_to_df_julia_ok(path)
# 打印数据
print('打印数据:\n', table)

结果

打印数据:
                                                     text  label
0      I love sci-fi and am willing to put up with a ...      0
1      Worth the entertainment value of a rental, esp...      0
2      its a totally average film with a few semi-alr...      0
3      STAR RATING: ***** Saturday Night **** Friday ...      0
4      First off let me say, If you haven't enjoyed a...      0
...                                                  ...    ...
24995  Just got around to seeing Monster Man yesterda...      1
24996  I got this as part of a competition prize. I w...      1
24997  I got Monster Man in a box set of three films ...      1
24998  Five minutes in, i started to feel how naff th...      1
24999  I caught this movie on the Sci-Fi channel rece...      1


http://www.kler.cn/a/446645.html

相关文章:

  • C语言数据库管理系统示例:文件操作、内存管理、错误处理与动态数据库设计 栈和堆的内存分配
  • [c++进阶(二)] 智能指针详细剖析--RAII思想
  • 在 .NET 5.0 运行 .NET 8.0 教程:使用 ASP.NET Core 创建 Web API
  • python 模拟法
  • 学技术学英文:SpringBoot的内置监控组件-Spring Boot Actuator
  • Android 10 Launcher3 删除谷歌搜索
  • c++中如何处理对象的创建与销毁的时机?
  • Python发送带key的kafka消息
  • TCP为什么需要三次握手和四次挥手?
  • 创新性融合丨卡尔曼滤波+目标检测 新突破!
  • C/C++语言基础--C++STL库之仿函数、函数对象、bind、function简介
  • 单元测试(C++)——gmock通用测试模版(个人总结)
  • Spring(三)-SpringWeb-概述、特点、搭建、运行流程、组件、接受请求、获取请求数据、特殊处理、拦截器
  • python实现word转html
  • AI大模型进一步推动了AI在处理图片、视频、音频、文本的等数据应用
  • 【新教程】非root用户给Ubuntu server设置开机自启服务-root用户给Ubuntu server设置开机自启服务
  • ArcGIS计算土地转移矩阵
  • 详细解释爬虫中的异常处理机制?
  • Rabbitmq实现延迟队列
  • Leetcode2545:根据第 K 场考试的分数排序