当前位置：首页 > article >正文

# 深度学习笔记（6）Hugginface -Transformer

article 2024/11/13 22:31:58

深度学习笔记（6）Hugginface -Transformer

文章目录

深度学习笔记（6）Hugginface -Transformer
一、工具包
二、 Tokenizer
三、模型加载
四、输出
五，padding的作用
- 5.1 attention_mask
- 5.2 不同padding方法
六，数据集与模型
- 6.1 数据集
- 6.2 处理数据
- 6.3 对所有数据处理
六，训练模块

一、工具包

1.安装transformer

pip install transformers

安装完成后，你可以通过以下代码在Python中导入transformers来验证安装是否成功：

import transformers
print(transformers.__version__)

然后用一段代码测试一下

import warnings
warnings.filterwarnings("ignore")
from transformers import pipeline#用人家设计好的流程完成一些简单的任务
classifier = pipeline("sentiment-analysis")
classifier(
    [
        "I've been waiting for a HuggingFace course my whole life.",
        "I hate this so much!",
    ]
)

结果如下就是加载成功

[{'label': 'POSITIVE', 'score': 0.9598049521446228},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

基本流程如下
在这里插入图片描述
1.输入文本
2.分词，ipput IDs是分词器得到的结果，把每个词都转换成了唯一的ID。这里我们要指定对应的分词器，转换成list
3.输入到模型中得到预测结果。
4.后处理

二、 Tokenizer

Tokenizer要做的事：

分词，分字以及特殊字符（起始，终止，间隔，分类等特殊字符可以自己设计的）
对每一个token映射得到一个ID（每个词都会对应一个唯一的ID）
还有一些辅助信息也可以得到，比如当前词属于哪个句子（还有一些MASK，表示是否事原来的词还是特殊字符等）

from transformers import AutoTokenizer#自动判断，这段代码基本可以不用改，它会根据你后面引用的模型来看
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"#根据这个模型所对应的来加载，模型去Hugginface的官网去找，可以根据对应的名字去搜索
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

然后进行输入


raw_inputs = [
    "I've been waiting for a this course my whole life.",#指定的两句话
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

这几个参数是什么意思呢，第一个是padding，因为两个句子长度不一样，所以需要把这两句话作成长度一样的，padding=true默认是按长度长那个算，比如第一句5个，第二句10个，那就会把第一个补0.
truncation截断，这里可以自己指定，就是最大的字数都长度。比如最大长度为8，那I’ve been waiting for a this course my whole life就只能留下I’ve been waiting for a this course my
return_tensors 选择底层是啥pt就是pytorch，tensorflow就分tf

运行结果如下

{'input_ids': tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 2023, 2607, 2026, 2878,
         2166, 1012,  102],
        [ 101, 1045, 5223, 2023, 2061, 2172,  999,  102,    0,    0,    0,    0,
            0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])}

[[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 2023, 2607, 2026, 2878,
2166, 1012, 102],是第一个
第二个是 [ 101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0,
0, 0, 0] ，按理说第二个只有 6个单词，为啥有8个呢，因为存在一些特殊字符，101和102是特殊字符cls和sep sep之后都是补0的，也就是没意义的东西。
'attention_mask中为1的是作attention的时候要去算的，为1的就是要去算的，为0就是不去算的，比如第二句话后面补0的都是没意义的，所以没必要算。不会参与到selfattention中。

tokenizer.decode([ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 2023, 2607, 2026, 2878,2166, 1012,  102])

解码看下编码的是啥

"[CLS] i've been waiting for a this course my whole life. [SEP]"

三、模型加载

from transformers import AutoModel#还是一样自动选模型

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"#模型的名字
model = AutoModel.from_pretrained(checkpoint)#从预训练中直接选择

model#打印出来观察下

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(in_features=3072, out_features=768, bias=True)
          (activation): GELUActivation()
        )
        (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      )
    )
  )
)

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

torch.Size([2, 15, 768])

第一个数字2表示批处理大小（batch size）。这意味着张量包含2个样本。
第二个数字15通常表示序列长度（sequence length），特别是在处理序列数据（如文本或时间序列）时。这意味着每个样本是一个长度为15的序列。
第三个数字768表示特征维度（feature dimension）。对于每个序列位置，都有一个包含768个数字的特征向量。

四、输出

需要什么输出就用什么输出头，，用于进行序列分类任务，比如情感分析、文本分类等。
比如AutoModelForSequenceClassification


from transformers import AutoModelForSequenceClassification

# 指定预训练模型的名称，这里使用的是DistilBERT模型，已经针对SST-2（情感分析任务）进行了微调
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

# 加载预训练模型，这个模型已经准备好进行序列分类任务
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# 使用模型进行预测，这里inputs是一个字典，包含了模型需要的所有输入
outputs = model(**inputs)

# 打印输出中的logits的形状
# logits的形状通常是[batch_size, num_labels]，其中batch_size是输入的批处理大小，num_labels是分类任务的类别数量
# 对于SST-2情感分析任务，num_labels通常是2（正面和负面）
print(outputs.logits.shape)

torch.Size([2, 2])   #

模型对一批包含2个样本的数据进行了预测。
对于每个样本，模型输出了2个值，这两个值分别代表样本被分类为正面和负面的原始得分（logits）。

然后要进行softmax
i

mport torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

在应用函数如softmax时，通常需要在类别维度上应用，以计算每个样本在每个类别上的概率。例如，对于一个形状为[batch_size, num_classes]的logits张量，我们通常在num_classes这个维度上应用softmax，即dim=-1，因为这是类别维度的索引。
所以上面的代码是输出类别的概率、

tensor([[1.5446e-02, 9.8455e-01],
        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)

d2label这个我们后续可以自己设计，标签名字对应都可以自己指定

model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

五，padding的作用

这段代码只是个例子，和上面的任务无关

sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]

batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)

print(model(torch.tensor(batched_ids)).logits)

tensor([[ 1.5694, -1.3895]], grad_fn=<AddmmBackward0>)
tensor([[ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)
tensor([[ 1.5694, -1.3895],
        [ 1.3374, -1.2163]], grad_fn=<AddmmBackward0>)

你会发现单sequence2_ids = [[200, 200]]和[200, 200, tokenizer.pad_token_id]预测结果不一样，因为这里没有使用attention_mask，所以模型可能会错误地将填充的部分考虑在内，所以结果不一样。所以attention_mask必要要加上。

5.1 attention_mask

 batched_ids = [
    [200, 200, 200],
    [200, 200, tokenizer.pad_token_id],
]

attention_mask = [
    [1, 1, 1],
    [1, 1, 0],
]

outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask))
print(outputs.logits)

tensor([[ 1.5694, -1.3895],
        [ 0.5803, -0.4125]], grad_fn=<AddmmBackward0>)

加上这段attention_mask ，结果就没问题了。

5.2 不同padding方法

sequences = ["I've been waiting for a this course my whole life.", "So have I!", "I played basketball yesterday."]

1.按照最长的填充

model_inputs = tokenizer(sequences, padding="longest")
model_inputs

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 2023, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2209, 3455, 7483, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]}

2.BERT默认最大是512

model_inputs = tokenizer(sequences, padding="max_length")
model_inputs

3.自己指定填充到多少，不满足的自动添加，但是超过的不进行截断

model_inputs = tokenizer(sequences, padding="max_length", max_length=8)
model_inputs

{'input_ids': [[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 2023, 2607, 2026, 2878, 2166, 1012, 102], [101, 2061, 2031, 1045, 999, 102, 0, 0], [101, 1045, 2209, 3455, 7483, 1012, 102, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 0, 0], [1, 1, 1, 1, 1, 1, 1, 0]]}

4.到多少就截。truncation=True就是截断

model_inputs = tokenizer(sequences, max_length=10, truncation=True)
model_inputs

六，数据集与模型

6.1 数据集

先安装dataset模块`pip install datasets

import warnings
warnings.filterwarnings("ignore")
from datasets import load_dataset#datasets库中导入load_dataset函数。datasets是一个由Hugging Face提供的库，用于加载和操作不同类型的数据集。

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

import warnings：导入Python的warnings模块，用于控制警告信息。
warnings.filterwarnings(“ignore”)：这个函数调用用于过滤掉所有警告信息，即忽略警告。在某些情况下，这可能是有用的，比如当警告信息对当前任务不重要，或者在调试过程中想要忽略特定的警告。
所以这两行不重要
这个函数调用用于加载一个名为"glue"的数据集，其中的子集是"mrpc"。

看看数据长啥样子

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

sentence1：第一个句子。
sentence2：第二个句子。
label：标签，用于分类任务。
idx：数据的索引或ID。
num_rows：数据集中包含的行数，即样本的数量。

raw_train_dataset = raw_datasets["train"] #选取train的数据集
raw_train_dataset[100]#选100号样本看看

{'sentence1': 'The Nasdaq composite index inched up 1.28 , or 0.1 percent , to 1,766.60 , following a weekly win of 3.7 percent .',#第一个句子
 'sentence2': 'The technology-laced Nasdaq Composite Index .IXIC was off 24.44 points , or 1.39 percent , at 1,739.87 .',#第二个句子
 'label': 0,#二分类
 'idx': 114}

label：标签，用于指示这两个句子是否是同义句。在这个例子中，标签是0，这意味着这两个句子不是同义句。
idx：数据的索引或ID，这是一个唯一的标识符，用于标识数据集中的每个样本。在这个例子中，索引是114。

raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

sentence1 和 sentence2：这两个键对应于数据集中的两个句子特征。它们的值类型是字符串（string）。
label：这个键对应于数据集中的标签特征。标签用于指示两个句子是否是同义句。在这个例子中，标签的值类型是ClassLabel，它是一个特殊的类型，表示分类特征，并且具有一个名为names的属性，它是一个列表，包含了标签的可能值（在这个例子中是[‘not_equivalent’, ‘equivalent’]）。
idx：这个键对应于数据集中的索引特征。它的值类型是整数（int32）。

6.2 处理数据

from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

AutoTokenizer 类的作用在于，它能够根据提供的模型名称自动加载与该模型兼容的分词器。这意味着，如果你有一个预训练的模型，你可以使用AutoTokenizer来加载一个与该模型一起训练的分词器，这样可以确保分词器与模型之间的兼容性，从而在训练和推理过程中获得最佳性能。

例如，如果你有一个名为"bert-base-uncased"的预训练BERT模型，你可以使用AutoTokenizer来加载一个与之兼容的分词器，然后使用这个分词器来处理输入文本，以便模型可以处理。

inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

这里多了个token_type_ids
因为这个模型bert-base-uncased是把两个句子合在一起的，所以用token_type_ids区分两个句子。
tokenizer在这个任务是这样的，但是有些任务不是这样的，具体任务要具体设置
这里可用tokenizer.convert_ids_to_tokens(inputs["input_ids"])转回去看看

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

6.3 对所有数据处理

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

这个函数通常用于数据预处理阶段，即将原始文本数据转换为模型可以处理的整数序列。它可以帮助确保模型输入的一致性和标准化。
truncation=True 是一个参数，它指示分词器应该对输入的序列进行截断，以匹配模型的最大序列长度。如果序列太长，分词器会自动截断超出长度的部分。

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

tokenize_function 将被应用到一批样本上，而不是单独应用到一个样本上。这样可以提高效率，尤其是在处理大型数据集时。这个操作通常在数据预处理阶段执行，以便将原始文本数据转换为模型可以处理的整数序列。它有助于确保模型输入的一致性和标准化。

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

可以看到多了 ‘input_ids’, ‘token_type_ids’, 'attention_mask’三个字段tokenize可以把转换好的映射到原来的数据中

注意：tokenize只是对数据预处理，模型现在是无法直接使用的

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

封装数据，让模型能读数据。实际就是DataCollator 加载数据

samples = tokenized_datasets["train"][:8]#取到前8个杨门所有的列
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}#不需要这些列
[len(x) for x in samples["input_ids"]]#每一个样本的长度

就是不用"idx", “sentence1”, “sentence2”] 这几列，留下剩下的列。因为训练模型的时候这几列是不要的

batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'attention_mask': torch.Size([8, 67]),
 'input_ids': torch.Size([8, 67]),
 'labels': torch.Size([8]),
 'token_type_ids': torch.Size([8, 67])}

上面的操作就等于是经过data_collator处理之后，所有的样本长度都是固定的

六，训练模块

from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

导入 TrainingArguments 类，这是 transformers 库中的一个类，用于定义和配置模型训练的参数
创建 TrainingArguments 对象后，您可以使用它来定义和配置模型训练的各种参数，如学习率、批次大小、最大训练步数等。

training_args

TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=-1,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=test-trainer\runs\May26_10-08-48_WIN-BM410VRSBIO,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=OptimizerNames.ADAMW_HF,
output_dir=test-trainer,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=8,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['tensorboard', 'wandb'],
resume_from_checkpoint=None,
run_name=test-trainer,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)

这么多参数，怎么修改呢，具体的api文件可以去这里查
https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

导入 AutoModelForSequenceClassification 类，这是 transformers 库中的一个类，用于加载预训练的序列分类模型。
checkpoint 是一个字符串，它包含了一个预训练模型的名称，如 “bert-base-uncased”。
num_labels=2 是一个参数，它指定了模型应该预测的类别数量。对于二分类任务，这个值通常是2。

输出层，我们要自己进行训练

from transformers import Trainer

trainer = Trainer(
    model,#模型，上面定义类
    training_args,#配置参数，上面那么参数值
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],#刚才弄好的两个字典
    data_collator=data_collator,
    tokenizer=tokenizer,
)

这些参数具体咋用去网站上，比如这个
在这里插入图片描述
去 transformers 的traniner里去看这些参数具体咋用
然后训练模型

trainer.train()

在这里插入图片描述

Saving model checkpoint to test-trainer\checkpoint-500
Configuration saved in test-trainer\checkpoint-5oo\config. json
Model weights saved in test-trainer\checkpoint-500\pytorch_model.bin
tokenizer config file saved in test-trainer\checkpoint-50o\tokenizer_config. json
Special tokens file saved in test-trainer\checkpoint-5oo\special_tokens_map.json
Saving model checkpoint to test-trainer\checkpoint-1000
Configuration saved in test-trainer\checkpoint-100o\config. json
Model weights saved in test-trainer\checkpoint-1000\pytorch_model.bin
tokenizer config file saved in test-trainer\checkpoint-10oo\tokenizer_config. json
Special tokens file saved in test-trainer\checkpoint-1ooo\special_tokens_map.json
Training completed. Do not forget to share your model on huggingface.co/models =)

这里面有模型保存的路径比如

Model weights saved in test-trainer\checkpoint-500\pytorch_model.bin
Model weights saved in test-trainer\checkpoint-1000\pytorch_model.bin

然后验证结果

predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

使用 Trainer 对象的方法 predict 对验证数据集进行预测。
tokenized_datasets[“validation”] 是已经分词的数据集，它包含了验证数据。

打印预测结果的形状。
predictions.predictions 是一个张量，包含了模型对验证数据集的预测结果。
predictions.label_ids 是一个张量，包含了验证数据集的真实标签。
打印这两个张量的形状可以帮助您了解预测结果和真实标签的维度。

from datasets import load_metric

metric = load_metric("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

load_metric 函数用于加载预定义的评估指标，如用于情感分析、文本分类等任务。

“glue”, “mrpc” 是一个字符串，它包含了一个评估指标的名称，如 “glue” 数据集中的 “mrpc” 任务。
load_metric() 函数会从 datasets 库中加载相应的评估指标。
使用 metric 对象的方法 compute() 来计算预测结果和真实标签之间的指标。
predictions 是一个张量，包含了模型对验证数据集的预测结果。
predictions.label_ids 是一个张量，包含了验证数据集的真实标签。
compute() 方法会计算这些预测结果和真实标签之间的指标，并返回计算结果。

accuracy’:0.8186274509803921,f1：0.8754208754208753

预测值和真实值通常存在/path/to/your/dataset 中

def compute_metrics(eval_preds):
    metric = load_metric("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

这里一般不用特别改，你看着结果就多了

在这里插入图片描述
training_args = TrainingArguments(“test-trainer”, evaluation_strategy=“epoch”)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)