当前位置: 首页 > article >正文

从Huggingface下载的数据集为arrow格式,如何从本地路径读取arrow数据并输出样例

加载本地路径的arrow文件并打印

如下图所示,我自定义的数据集保存在这个路径:~/.cache/huggingface/datasets/allenai___tulu-3-sft-mixture2/default/0.0.0/55e9fd6d41c3cd1a98270dff07557bc2a1e1ba91/tulu-3-sft-mixture-train-00000-of-00001.arrow。现在想看一下能否读取,并打印出来第一条

在这里插入图片描述

from datasets import load_dataset

# 设置路径,指定包含所有 Arrow 文件的文件夹
dataset_path = "~/.cache/huggingface/datasets/allenai___tulu-3-sft-mixture2/default/0.0.0/55e9fd6d41c3cd1a98270dff07557bc2a1e1ba91"

# 加载 Arrow 数据集,确保使用 Arrow 格式
dataset = load_dataset(dataset_path, data_files="tulu-3-sft-mixture-train-00000-of-00001.arrow", split="train")

# 打印加载结果
print(dataset[0])
# Generating train split: 1000 examples [00:00, 32709.99 examples/s]
# Dataset({
#     features: ['id', 'messages', 'source'],
#     num_rows: 1000
# })

打印的结果:

{
    "id": "ai2-adapt-dev/flan_v2_converted_21688",
    "messages": [
      {
        "content": "Given the task definition, example input & output, solve the new input case.\nGiven a text passage as input comprising of dialogue of negotiations between a seller and a buyer about the sale of an item, your task is to classify the item being sold into exactly one of these categories: 'housing', 'furniture', 'bike', 'phone', 'car', 'electronics'. The output should be the name of the category from the stated options and there should be exactly one category for the given text passage.\nExample: Seller: hi\nBuyer: Hello\nSeller: do you care to make an offer?\nBuyer: The place sounds nice, but may be a little more than I can afford\nSeller: well how much can you soend?\nBuyer: I was looking for something in the 1500-1600 range\nSeller: That is really unreasonable considering all the immenities and other going rates, you would need to come up to at least 3000\nBuyer: I have seen some 2 bedrooms for that price, which I could split the cost with a roommate, so even with amenities, this may be out of my range\nSeller: it may be then... the absolute lowest i will go is 2700. that is my final offer.\nBuyer: Ok, I think the most I could spend on this is 2000 - we are a ways apart\nSeller: ya that is far too low like i said 2700\nBuyer: Ok, thanks for your consideration. I will have to keep looking for now.\nSeller: good luck\nOutput: housing\nThe answer 'housing' is correct because a house is being talked about which is indicated by the mention of 'bedrooms' and 'amenities' which are words that are both related to housing.\n\nNew input case for you: Buyer: hi\nSeller: hello. Are you interested in the table for sale?\nBuyer: Yes, can you tell me , is your home smoke free?\nSeller: yes it is. I do , however have pets.\nBuyer: they never peed on it or anything, right?\nSeller: No they have not. There is a small broken corner but you cannot tell when it is against the wall\nBuyer: Do you think you would be willing to bring down the price since it is somewhat damaged?\nSeller: Maybe a little, however It is fairly new so I would like to get close to what i paid for\nBuyer: well a used product can rarely fetch the price paid, I was thinking  $59 and I can come pick it up myself.\nSeller: I would consider $75\nBuyer: I feel that is too much since it is damaged already and have to try to repaint and repair it.\nSeller: The paint is very fresh and does not need a new coat. The price I paid last year was $300 so you are getting it for less than 1/3\nBuyer: I need to paint it if there is a chip, either I can do 60 cash and come get it, or I will have to pass.\nSeller: Okay we can do that.\nBuyer: \nSeller: \n\nOutput: ",
        "role": "user"
      },
      {
        "content": "furniture",
        "role": "assistant"
      }
    ],
    "source": "ai2-adapt-dev/flan_v2_converted"
  }
  

如果是加载很多arrow文件呢?

比如路径如下:~/.cache/huggingface/datasets/allenai___tulu-3-sft-mixture/default/0.0.0/55e9fd6d41c3cd1a98270dff07557bc2a1e1ba91,这里的问题是什么呢?因为使用的过程中,会产生一些cache*.arrow中间文件(用于加速处理),所以需要指定哪些具体我们需要的arrow文件,处理方法见下面的代码。

在这里插入图片描述

代码如下:

# 设置数据集路径和文件
dataset_path = "~/.cache/huggingface/datasets/allenai___tulu-3-sft-mixture/default/0.0.0/55e9fd6d41c3cd1a98270dff07557bc2a1e1ba91"
selected_files = [
    "tulu-3-sft-mixture-train-00000-of-00006.arrow",
    "tulu-3-sft-mixture-train-00001-of-00006.arrow",
    "tulu-3-sft-mixture-train-00002-of-00006.arrow",
    "tulu-3-sft-mixture-train-00003-of-00006.arrow",
    "tulu-3-sft-mixture-train-00004-of-00006.arrow",
    "tulu-3-sft-mixture-train-00005-of-00006.arrow"
]

# 加载数据集
dataset = load_dataset(dataset_path, data_files=selected_files)

这样就可以加载一个文件夹下面的很多arrow文件,只要在列表中指定好即可。

后记

2024年12月29日13点24分于上海。


http://www.kler.cn/a/458862.html

相关文章:

  • Python机器学习笔记(十七、分箱、离散化、线性模型与树)
  • CSS2笔记
  • 前端安全措施:接口签名、RSA加密、反调试、反反调试、CAPTCHA验证
  • 安装bert_embedding遇到问题
  • Presto-简单了解-230403
  • 如何在谷歌浏览器中创建安全的密码
  • Knowledge is power——Digital Electronics
  • pytorch基础之注解的使用--003
  • 「Mac玩转仓颉内测版55」应用篇2 - 使用函数实现更复杂的计算
  • 项目优化性能监控
  • 基于YOLOv10和BYTETracker的多目标追踪系统,实现地铁人流量计数功能(基于复杂场景和密集遮挡条件下)
  • 前端学习DAY29(1688侧边栏)
  • NPM组件包 vant部分版本内嵌挖矿代码
  • 《燕云十六声》d3dcompiler_47.dll缺失怎么解决?
  • 深度学习中的HTTP:从请求到响应的计算机网络交互
  • JVM实战—5.G1垃圾回收器的原理和调优
  • windows 下通过脚本方式实现 类似 Linux keepalived IP 动态绑定效果
  • 有限元分析学习——Anasys Workbanch第一阶段笔记(2)应力奇异及位移结果对比、初步了解单元的基本知识
  • JVM的详细介绍
  • 【机器学习】 卷积神经网络 (CNN)
  • 基于Docker基础与操作实战
  • 【WdatePicker】选择不能小于当天
  • 深度学习模型格式转换:pytorch2onnx(包含自定义操作符)
  • 当现代教育技术遇上仓颉---探秘华为仓颉编程语言与未来教育技术的接轨
  • 电子电器架构 ---什么是智能电动汽车上的BMS?
  • VScode怎么重启