当前位置：首页 > article >正文

从Huggingface下载的数据集为arrow格式，如何从本地路径读取arrow数据并输出样例

article 2025/1/3 6:11:48

加载本地路径的arrow文件并打印

如下图所示，我自定义的数据集保存在这个路径：~/.cache/huggingface/datasets/allenai___tulu-3-sft-mixture2/default/0.0.0/55e9fd6d41c3cd1a98270dff07557bc2a1e1ba91/tulu-3-sft-mixture-train-00000-of-00001.arrow。现在想看一下能否读取，并打印出来第一条

在这里插入图片描述

from datasets import load_dataset

# 设置路径，指定包含所有 Arrow 文件的文件夹
dataset_path = "~/.cache/huggingface/datasets/allenai___tulu-3-sft-mixture2/default/0.0.0/55e9fd6d41c3cd1a98270dff07557bc2a1e1ba91"

# 加载 Arrow 数据集，确保使用 Arrow 格式
dataset = load_dataset(dataset_path, data_files="tulu-3-sft-mixture-train-00000-of-00001.arrow", split="train")

# 打印加载结果
print(dataset[0])
# Generating train split: 1000 examples [00:00, 32709.99 examples/s]
# Dataset({
#     features: ['id', 'messages', 'source'],
#     num_rows: 1000
# })

打印的结果：

{
    "id": "ai2-adapt-dev/flan_v2_converted_21688",
    "messages": [
      {
        "content": "Given the task definition, example input & output, solve the new input case.\nGiven a text passage as input comprising of dialogue of negotiations between a seller and a buyer about the sale of an item, your task is to classify the item being sold into exactly one of these categories: 'housing', 'furniture', 'bike', 'phone', 'car', 'electronics'. The output should be the name of the category from the stated options and there should be exactly one category for the given text passage.\nExample: Seller: hi\nBuyer: Hello\nSeller: do you care to make an offer?\nBuyer: The place sounds nice, but may be a little more than I can afford\nSeller: well how much can you soend?\nBuyer: I was looking for something in the 1500-1600 range\nSeller: That is really unreasonable considering all the immenities and other going rates, you would need to come up to at least 3000\nBuyer: I have seen some 2 bedrooms for that price, which I could split the cost with a roommate, so even with amenities, this may be out of my range\nSeller: it may be then... the absolute lowest i will go is 2700. that is my final offer.\nBuyer: Ok, I think the most I could spend on this is 2000 - we are a ways apart\nSeller: ya that is far too low like i said 2700\nBuyer: Ok, thanks for your consideration. I will have to keep looking for now.\nSeller: good luck\nOutput: housing\nThe answer 'housing' is correct because a house is being talked about which is indicated by the mention of 'bedrooms' and 'amenities' which are words that are both related to housing.\n\nNew input case for you: Buyer: hi\nSeller: hello. Are you interested in the table for sale?\nBuyer: Yes, can you tell me , is your home smoke free?\nSeller: yes it is. I do , however have pets.\nBuyer: they never peed on it or anything, right?\nSeller: No they have not. There is a small broken corner but you cannot tell when it is against the wall\nBuyer: Do you think you would be willing to bring down the price since it is somewhat damaged?\nSeller: Maybe a little, however It is fairly new so I would like to get close to what i paid for\nBuyer: well a used product can rarely fetch the price paid, I was thinking  $59 and I can come pick it up myself.\nSeller: I would consider $75\nBuyer: I feel that is too much since it is damaged already and have to try to repaint and repair it.\nSeller: The paint is very fresh and does not need a new coat. The price I paid last year was $300 so you are getting it for less than 1/3\nBuyer: I need to paint it if there is a chip, either I can do 60 cash and come get it, or I will have to pass.\nSeller: Okay we can do that.\nBuyer: \nSeller: \n\nOutput: ",
        "role": "user"
      },
      {
        "content": "furniture",
        "role": "assistant"
      }
    ],
    "source": "ai2-adapt-dev/flan_v2_converted"
  }

如果是加载很多arrow文件呢？

比如路径如下：~/.cache/huggingface/datasets/allenai___tulu-3-sft-mixture/default/0.0.0/55e9fd6d41c3cd1a98270dff07557bc2a1e1ba91，这里的问题是什么呢？因为使用的过程中，会产生一些cache*.arrow中间文件（用于加速处理），所以需要指定哪些具体我们需要的arrow文件，处理方法见下面的代码。

在这里插入图片描述

代码如下：

# 设置数据集路径和文件
dataset_path = "~/.cache/huggingface/datasets/allenai___tulu-3-sft-mixture/default/0.0.0/55e9fd6d41c3cd1a98270dff07557bc2a1e1ba91"
selected_files = [
    "tulu-3-sft-mixture-train-00000-of-00006.arrow",
    "tulu-3-sft-mixture-train-00001-of-00006.arrow",
    "tulu-3-sft-mixture-train-00002-of-00006.arrow",
    "tulu-3-sft-mixture-train-00003-of-00006.arrow",
    "tulu-3-sft-mixture-train-00004-of-00006.arrow",
    "tulu-3-sft-mixture-train-00005-of-00006.arrow"
]

# 加载数据集
dataset = load_dataset(dataset_path, data_files=selected_files)