PEFT学习
目录
Prompt Tuning
Prefix Tuning
LoRA
QLoRA
Adapter Tuning
P-Tuning
P-Tuning v2
除PEFT之外
PILL
SSF
监督式微调SFT(Supervised Fine Tuning)
基于人类反馈的强化学习微调RLHF(Reinforcement Learning with Human Feedback)
基于AI反馈的强化学习微调RLAIF(Reinforcement Learning with AI Feedback)
Prompt Tuning
在Embedding环节,往输入序列X前面加特定的Token,使后面生成所需内容时的概率期望改变
具体来说,就是将X = [x1, x2, ..., xm]变成,X` = [x`1, x`2, ..., x`k; x1, x2, ..., xm], Y = WX`
使用静态的、可训练的虚拟标记嵌入,在初始化后保持固定,除非在训练过程中更新
from transformers import AutoModelForCausalLM, AutoTokenizer, default_data_collator, get_linear_schedule_with_warmup
from peft import get_peft_config, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftType
import torch
from datasets import load_dataset
import os
from torch.utils.data import DataLoader
from tqdm import tqdm
device = "mps"
# device = "cuda"
model_name_or_path = "bigscience/bloomz-560m"
tokenizer_name_or_path = "bigscience/bloomz-560m"
peft_config = PromptTuningConfig(
task_type=TaskType.CAUSAL_LM,
prompt_tuning_init=PromptTuningInit.TEXT,
num_virtual_tokens=8,
prompt_tuning_init_text="Classify if the tweet is a complaint or not:",
tokenizer_name_or_path=tokenizer_name_or_path,
)
dataset_name = "twitter_complaints"
text_column = "Tweet text"
label_column = "text_label"
max_length = 64
learning_rate = 3e-2
num_epochs = 20
batch_size = 8
output_dir = './output'
# 1. load a subset of the RAFT dataset at https://huggingface.co/datasets/ought/raft
dataset = load_dataset("ought/raft", dataset_name)
# get lable's possible values
label_values = [name.replace("_", "") for name in dataset["train"].features["Label"].names]
# append label value to the dataset to make it more readable
dataset = dataset.map(
lambda x: {label_column: [label_values[label] for label in x["Label"]]},
batched=True,
num_proc=1
)
# have a look at the data structure
dataset["train"][0]
# 2. dataset
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
def preprocess_fn(examples):
tweets = examples[text_column]
# pad labels with a pad token at the end
labels = [str(x) + tokenizer.pad_token for x in examples[label_column]]
# concatenate the tweet with it label
inputs = [f"{text_column} : {tweet}\nLabel :{label}"
for tweet, label in zip(tweets, labels)]
# tokenize input
model_inputs = tokenizer(inputs,
padding='max_length',
max_length=max_length,
truncation=True,)
# tokenize label, as -100 not a valid token id, do the padding manually here
labels_input_ids = []
for i in range(len(labels)):
ids = tokenizer(labels[i])["input_ids"]
padding = [-100] * (max_length - len(ids))
labels_input_ids.append(padding + ids)
model_inputs["labels"] = labels_input_ids
# make model inputs tensor
model_inputs["input_ids"] = [torch.tensor(ids) for ids in model_inputs["input_ids"]]
model_inputs["attention_mask"] = [torch.tensor(ids) for ids in model_inputs["attention_mask"]]
model_inputs["labels"] = [torch.tensor(ids) for ids in model_inputs["labels"]]
return model_inputs
# have a look at the preprocessing result
# print(preprocess_fn(dataset["train"][:2]))
processed_datasets = dataset.map(
preprocess_fn,
batched=True,
num_proc=1,
remove_columns=dataset["train"].column_names, #remove unprocessed column for training
load_from_cache_file=False,
desc="Running tokenizer on datasset"
)
test_size = round(len(processed_datasets["train"]) * 0.2)
train_val = processed_datasets["train"].train_test_split(
test_size=test_size, shuffle=True, seed=42)
train_data = train_val["train"]
val_data = train_val["test"]
# 3. model
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
print(model.print_trainable_parameters())
trainable params: 8192 || all params: 559222784 || trainable%: 0.0014648902430985358
# 4. trainer
from transformers import Trainer, TrainingArguments
trainer = Trainer(
model=model,
train_dataset=train_data,
eval_dataset=val_data,
data_collator=default_data_collator,
args=TrainingArguments(
output_dir='./output',
per_device_train_batch_size=batch_size,
num_train_epochs=num_epochs,
learning_rate=learning_rate,
load_best_model_at_end=True,
logging_strategy='steps',
logging_steps=10,
evaluation_strategy='steps',
eval_steps=10,
save_strategy='steps',
save_steps=10,
)
)
trainer.train()
Prefix Tuning
在Transformer的Encoder和Decoder的网络中都加了一些特定的前缀
具体来说,就是将Y=WX中的W,变成W` = [Wp; W],Y=W`X
前缀可以是固定的(即手动设计的静态提示)或可训练的(即模型在训练过程中学习的动态提示)
在 Prefix 层前面加了 MLP 结构,训练完成后,只保留 Prefix 的参数
LoRA
将Y=WX变成Y=(W+∆W)X,这里面∆W主是我们要微调得到的结果;
其次,将∆W进行低维分解∆W=AB (∆W为m * n维,A为m * r维,B为r * n维,r就是上述假设中的低维)
from transformers import AutoModelForSeq2SeqLM
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
model_name_or_path = "bigscience/mt0-large"
tokenizer_name_or_path = "bigscience/mt0-large"
peft_config = LoraConfig(
task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1
)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# output: trainable params: 2359296 || all params: 1231940608 || trainable%: 0.19151053100118282
QLoRA
将原本用16bit表示的参数,降为用4bit来表示
量化操作:在4位量化中,每个权重由4个比特表示,量化过程中需选择最重要的值并将它们映射到16个可能的值之一。首先确定量化范围(例如-1到1),然后将这个范围分成16个区间,每个区间对应一个4-bit值。然后,原始的32位浮点数值将映射到最近的量化区间值上。
微调阶段:在训练期间,QLoRA先以4-bit格式加载模型,训练时将数值反量化到bf16进行训练
Adapter Tuning
在预训练模型的每个层或指定层中,插入适配器。适配器是小型的神经网络,一般包含少量层次,并且参数规模相对较小
在每个子层的输出之后,即在投影层和跳跃连接之间,插入了两个串联的适配器模块。这些适配器模块是新增加的组件,用于调整模型以适应特定的下游任务
P-Tuning
使用一个可训练的LSTM模型(称为提示编码器prompt_encoder)来动态生成虚拟标记嵌入,允许根据输入数据的不同生成不同的嵌入,提供更高的灵活性和适应性,适合需要精细控制和理解复杂上下文的任务
LSTM模型的参数可以在多个任务之间共享
这种设计存在两个问题:
- 第一,它限制了优化参数的数量。由于模型的输入文本长度是固定的,通常为512,因此提示的长度不能过长。
- 第二,当模型层数很深时,微调时模型的稳定性难以保证;模型层数越深,第一层输入的提示对后面层的影响难以预测,这会影响模型的稳定性。
P-Tuning v2
不仅在第一层插入连续提示,而是在多层都插入连续提示,且层与层之间的连续提示是相互独立的
移除重参数化的编码器:以前的方法利用重参数化功能来提高训练速度和鲁棒性(如:Prefix Tuning 中的 MLP 、P-Tuning 中的 LSTM)。在 P-tuning v2 中,作者发现重参数化的改进很小,尤其是对于较小的模型,同时还会影响模型的表现
除PEFT之外
PILL
特别关注于如何通过插入可训练的模块或插件来提升模型的任务适应性
SSF
对预训练的模型提取的深层特征进行缩放和移位,就可以进行微调
SSF 引入了缩放参数和移位参数,这些参数可以被认为是方差和均值,用于调节用上游数据集上的预训练模型提取的下游数据集的特征,从而使被调节的特征落在一个鉴别性的空间。这些缩放参数和移位参数不依赖于任何输入,对于不同的任务有一个统一的可学习参数空间