大模型微调|使用LoRA微调Qwen2.5-7B-Instruct完成文本分类任务
前言:近期跑baseline需要用到大模型,本篇博客记录如何使用LoRA完成Qwen2.5-7B-Instruct微调及报错相应解决方法。
目录
- 模型选取
- 数据集
- 代码实现
- 一些 Debug
模型选取
计算资源:可供使用的显卡是H800(显存80G),因此选取 Qwen-2.5-7B-Instruct + LoRA 的方式进行微调,batch size
为32时,实际使用显存约为72G.(可以通过降低 batch size
的方式降低显存,batch size=8时,显存占用约42G。)
数据集
三个csv文件,分别为训练集、验证集和测试集;每个文件中有descriptionE
和rank
两列,分别表示文本和Ground Truth标签。下述代码中的load_data_s(·)
函数实现数据读取;TextDataset
类构建数据集对象。
代码实现
使用的全部代码如下,使用时将注意路径替换为自己的数据集和模型所在路径。
import os
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from peft import get_peft_model, LoraConfig, TaskType
import pandas as pd
from torch.utils.data import Dataset
import torchvision
torchvision.disable_beta_transforms_warning()
os.environ["TOKENIZERS_PARALLELISM"] = "false" # 禁用TOKENIZERS_PARALLELISM警告
# 自定义数据集类
class TextDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
def load_data_s(file_path):
'''
加载打分数据
'''
df = pd.read_csv(file_path)
texts = [t.__str__().strip() for t in df['descriptionE']]
labels = np.array(df['rank']) # rank in [1,2,3,4,5]
labels = labels - 1 # #变成 [0,1,2,3,4]
return texts, labels
if __name__ == "__main__":
# load data
train_texts, train_labels = load_data_s("./data/trainset.csv")
val_texts, val_labels = load_data_s("./data/valset.csv")
test_texts, test_labels = load_data_s("./data/testset.csv")
train_labels, val_labels, test_labels = torch.LongTensor(train_labels), torch.LongTensor(val_labels), torch.LongTensor(test_labels)
train_size = len(train_texts)
# 加载预训练的Qwen模型和分词器
model_name_or_path = "./PLMs/Qwen2.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForSequenceClassification.from_pretrained(model_name_or_path, num_labels=5)
model.config.pad_token_id = tokenizer.pad_token_id
# 配置LoRA参数
lora_config = LoraConfig(
task_type=TaskType.SEQ_CLS, # 指定任务类型为序列分类
inference_mode=False,
r=8, # 低秩矩阵的秩
lora_alpha=32, # LoRA的缩放因子
lora_dropout=0.1, # dropout概率
bias="none", # 是否调整偏置
target_modules=["q_proj", "v_proj"], # 微调目标模块
)
print(lora_config.target_modules)
# 将模型转换为LoRA模型
model = get_peft_model(model, lora_config)
# 将数据转换为模型输入格式
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=128)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(test_texts, truncation=True, padding=True, max_length=128)
# 创建数据集对象
train_dataset = TextDataset(train_encodings, train_labels)
val_dataset = TextDataset(val_encodings, val_labels)
test_dataset = TextDataset(test_encodings, test_labels)
# 定义训练参数
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
warmup_steps=10,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
evaluation_strategy="epoch", # 每个周期结束后进行评估
save_strategy="epoch", # 每个周期结束后保存模型
load_best_model_at_end=True, # 在训练结束时加载效果最好的模型
)
# 使用Trainer进行训练
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
# 开始训练
trainer.train()
# 评估模型在测试集上的表现
trainer.evaluate(test_dataset)
# 保存微调后的模型
model.save_pretrained('./ckpts/lora_qwen_model')
运行时长:约35h
一些 Debug
-
ValueError: greenlet.greenlet size changed, may indicate binary incompatibility. Expected 152 from C header, got 40 from PyObject
解决方案:
pip install greenlet==1.1.3 gevent==22.8.0
-
ValueError: Cannot handle batch sizes > 1 if no padding token is defined.
解决方案:上述代码中添加的这两行:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id
(这里只是选取了其中一种方式,当然还有其他pad_token
赋值方式,参见此篇文章)
后记:感谢Copilot和GPT-4o对本项目的大力支持!
参考资料
- 【解决】RuntimeWarning: greenlet.greenlet size changed, may indicate binary incompatibility._valueerror: greenlet.greenlet size changed, may in-CSDN博客
- python - “AssertionError: Cannot handle batch sizes > 1 if no padding token is > defined” and pad_token = eos_token - Stack Overflow