从Huggingface下载的数据集为arrow格式,如何从本地路径读取arrow数据并输出样例
加载本地路径的arrow文件并打印
如下图所示,我自定义的数据集保存在这个路径:~/.cache/huggingface/datasets/allenai___tulu-3-sft-mixture2/default/0.0.0/55e9fd6d41c3cd1a98270dff07557bc2a1e1ba91/tulu-3-sft-mixture-train-00000-of-00001.arrow
。现在想看一下能否读取,并打印出来第一条
from datasets import load_dataset
# 设置路径,指定包含所有 Arrow 文件的文件夹
dataset_path = "~/.cache/huggingface/datasets/allenai___tulu-3-sft-mixture2/default/0.0.0/55e9fd6d41c3cd1a98270dff07557bc2a1e1ba91"
# 加载 Arrow 数据集,确保使用 Arrow 格式
dataset = load_dataset(dataset_path, data_files="tulu-3-sft-mixture-train-00000-of-00001.arrow", split="train")
# 打印加载结果
print(dataset[0])
# Generating train split: 1000 examples [00:00, 32709.99 examples/s]
# Dataset({
# features: ['id', 'messages', 'source'],
# num_rows: 1000
# })
打印的结果:
{
"id": "ai2-adapt-dev/flan_v2_converted_21688",
"messages": [
{
"content": "Given the task definition, example input & output, solve the new input case.\nGiven a text passage as input comprising of dialogue of negotiations between a seller and a buyer about the sale of an item, your task is to classify the item being sold into exactly one of these categories: 'housing', 'furniture', 'bike', 'phone', 'car', 'electronics'. The output should be the name of the category from the stated options and there should be exactly one category for the given text passage.\nExample: Seller: hi\nBuyer: Hello\nSeller: do you care to make an offer?\nBuyer: The place sounds nice, but may be a little more than I can afford\nSeller: well how much can you soend?\nBuyer: I was looking for something in the 1500-1600 range\nSeller: That is really unreasonable considering all the immenities and other going rates, you would need to come up to at least 3000\nBuyer: I have seen some 2 bedrooms for that price, which I could split the cost with a roommate, so even with amenities, this may be out of my range\nSeller: it may be then... the absolute lowest i will go is 2700. that is my final offer.\nBuyer: Ok, I think the most I could spend on this is 2000 - we are a ways apart\nSeller: ya that is far too low like i said 2700\nBuyer: Ok, thanks for your consideration. I will have to keep looking for now.\nSeller: good luck\nOutput: housing\nThe answer 'housing' is correct because a house is being talked about which is indicated by the mention of 'bedrooms' and 'amenities' which are words that are both related to housing.\n\nNew input case for you: Buyer: hi\nSeller: hello. Are you interested in the table for sale?\nBuyer: Yes, can you tell me , is your home smoke free?\nSeller: yes it is. I do , however have pets.\nBuyer: they never peed on it or anything, right?\nSeller: No they have not. There is a small broken corner but you cannot tell when it is against the wall\nBuyer: Do you think you would be willing to bring down the price since it is somewhat damaged?\nSeller: Maybe a little, however It is fairly new so I would like to get close to what i paid for\nBuyer: well a used product can rarely fetch the price paid, I was thinking $59 and I can come pick it up myself.\nSeller: I would consider $75\nBuyer: I feel that is too much since it is damaged already and have to try to repaint and repair it.\nSeller: The paint is very fresh and does not need a new coat. The price I paid last year was $300 so you are getting it for less than 1/3\nBuyer: I need to paint it if there is a chip, either I can do 60 cash and come get it, or I will have to pass.\nSeller: Okay we can do that.\nBuyer: \nSeller: \n\nOutput: ",
"role": "user"
},
{
"content": "furniture",
"role": "assistant"
}
],
"source": "ai2-adapt-dev/flan_v2_converted"
}
如果是加载很多arrow文件呢?
比如路径如下:~/.cache/huggingface/datasets/allenai___tulu-3-sft-mixture/default/0.0.0/55e9fd6d41c3cd1a98270dff07557bc2a1e1ba91
,这里的问题是什么呢?因为使用的过程中,会产生一些cache*.arrow中间文件(用于加速处理),所以需要指定哪些具体我们需要的arrow文件,处理方法见下面的代码。
代码如下:
# 设置数据集路径和文件
dataset_path = "~/.cache/huggingface/datasets/allenai___tulu-3-sft-mixture/default/0.0.0/55e9fd6d41c3cd1a98270dff07557bc2a1e1ba91"
selected_files = [
"tulu-3-sft-mixture-train-00000-of-00006.arrow",
"tulu-3-sft-mixture-train-00001-of-00006.arrow",
"tulu-3-sft-mixture-train-00002-of-00006.arrow",
"tulu-3-sft-mixture-train-00003-of-00006.arrow",
"tulu-3-sft-mixture-train-00004-of-00006.arrow",
"tulu-3-sft-mixture-train-00005-of-00006.arrow"
]
# 加载数据集
dataset = load_dataset(dataset_path, data_files=selected_files)
这样就可以加载一个文件夹下面的很多arrow文件,只要在列表中指定好即可。
后记
2024年12月29日13点24分于上海。