使用 DeepSpeed 微调 OPT 基础语言模型
文章目录
- OPT 基础语言模型
- Using OPT with DeepSpeed
- main.py 解析
- 1、导入库和模块
- 2、解析命令行参数
- 3、main 函数
- 3.1 设备与分布式初始化
- 3.2 模型与数据准备
- 3.3 定义评估函数
- 3.4 优化器与学习率调度器设置
- 3.5 使用 deepspeed 进行模型等初始化
- 3.6 训练循环
- 3.7 模型保存
- 4、dschat/utils 函数
- 4.1 get_train_ds_config 生成训练阶段的 deepspeed 配置字典
- 4.2 get_train_ds_config 生成训练阶段的 deepspeed 配置字典
- 复现过程
- Using OPT with Colossal-AI
- 参考资料
OPT 基础语言模型
基础语言模型 (Basic Language Model) 是指只在大规模文本语料中进行了预训练的模型,未经过指令和下游任务微调、以及人类反馈等任何对齐优化。
OPT 是由 Meta AI 研究人员发布的一系列大规模预训练语言模型,模型包括125M、350M、1.3B、2.7B、6.7B、13B、30B、66B、175B 9个不同的参数规模和版本,除了 175B 的版本需要填写申请获取外,其它规模版本的模型都完全开放下载,可以免费获得。OPT-175B 和 GPT-3 的性能相当,并且部署只需要损耗 GPT-3 1/7 的能量损耗。OPT 系列模型开源的目的是为促进学术研究和交流,因为绝大多数大语言模型训练成本高昂,导致大部分研究人员都无法负担大语言模型的训练或使用;同时,各大企业发布的大语言预训练模型由于商业目的也都无法完整访问模型权重,只能通过 API 调用获取结果,阻碍了学术的交流与研究。
Using OPT with DeepSpeed
DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning
目录结构
├── evaluation_scripts // 脚本用于运行模型的评估
│ └── run_prompt.sh
├── main.py // 训练的主脚本
├── prompt_eval.py // 与 run_prompt.sh 搭配使用
├── README.md
├── training_log_output // 训练过程的日志文件
│ └── opt-1.3b-globalBatchSize128.log
└── training_scripts
├── llama2 // 用于微调 LLaMA 2 模型
│ ├── run_llama2_7b_lora.sh
│ └── run_llama2_7b.sh
├── opt // 用于微调 OPT 模型
│ ├── multi_node
│ │ └── run_66b.sh
│ ├── single_gpu
│ │ ├── run_1.3b.sh
│ │ └── run_6.7b_lora.sh
│ └── single_node
│ ├── run_1.3b_lora.sh
│ ├── run_13b.sh
│ ├── run_1.3b.sh
│ ├── run_30b_lora.sh
│ ├── run_6.7b.sh
│ └── sweep // 用于超参数搜索
│ ├── README.md
│ ├── run_single.sh
│ └── run_step1_sweep.sh
├── other_language // 用于支持其他语言的模型训练
│ ├── run_chinese.sh
│ └── run_japanese.sh
└── README.md
main.py 解析
位置:
DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py
1、导入库和模块
import argparse
import math
import time
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler
from transformers import (
AutoModelForCausalLM,
SchedulerType,
default_data_collator,
get_scheduler,
)
import deepspeed
from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam
from deepspeed import get_accelerator
from dschat.utils.data.data_utils import create_prompt_dataset
from dschat.utils.utils import print_rank_0, to_device, save_hf_format, set_random_seed, get_all_reduce_mean, get_optimizer_grouped_parameters, save_zero_three_model, load_hf_tokenizer
from dschat.utils.ds_utils import get_train_ds_config
from dschat.utils.module.lora import convert_linear_layer_to_lora, convert_lora_to_linear_layer, only_optimize_lora_parameters, make_model_gradient_checkpointing_compatible
from dschat.utils.model.model_utils import create_hf_model, causal_lm_model_to_fp32_loss
from dschat.utils.perf import print_throughput
2、解析命令行参数
def parse_args():
parser = argparse.ArgumentParser(
description=
"Finetune a transformers model on a causal language modeling task")
parser.add_argument('--data_path',
nargs='*',
default=['Dahoas/rm-static'],
help='Path to the training dataset. Accepted format:'
'1) a single data path, 2) multiple datasets in the'
'form: dataset1-path dataset2-path ...')
parser.add_argument('--data_split',
type=str,
default='2,4,4',
help='Comma-separated list of proportions for training'
'phase 1, 2, and 3 data. For example the split `6,2,2`'
'will use 60%% of data for phase 1, 20%% for phase 2'
'and 20%% for phase 3.')
parser.add_argument(
'--sft_only_data_path',
nargs='*',
default=[],
help='Path to the dataset for only using in SFT phase.')
parser.add_argument(
'--data_output_path',
type=str,
default='/tmp/data_files/',
help=
'Where to store the data-related files such as shuffle index. This needs to be on a local storage of a node (not on a shared storage)'
)
parser.add_argument(
"--model_name_or_path",
type=str,
help=
"Path to pretrained model or model identifier from huggingface.co/models.",
required=True,
)
parser.add_argument(
"--per_device_train_batch_size",
type=int,
default=16,
help="Batch size (per device) for the training dataloader.",
)
parser.add_argument(
"--per_device_eval_batch_size",
type=int,
default=16,
help="Batch size (per device) for the evaluation dataloader.",
)
parser.add_argument(
"--max_seq_len",
type=int,
default=512,
help="The maximum sequence length.",
)
parser.add_argument(
"--learning_rate",
type=float,
default=1e-3,
help=
"Initial learning rate (after the potential warmup period) to use.",
)
parser.add_argument("--weight_decay",
type=float,
default=0.,
help="Weight decay to use.")
parser.add_argument("--num_train_epochs",
type=int,
default=1,
help="Total number of training epochs to perform.")
parser.add_argument(
"--gradient_accumulation_steps",
type=int,
default=1,
help=
"Number of updates steps to accumulate before performing a backward/update pass.",
)
parser.add_argument(
"--lr_scheduler_type",
type=SchedulerType,
default="cosine",
help="The scheduler type to use.",
choices=[
"linear", "cosine", "cosine_with_restarts", "polynomial",
"constant", "constant_with_warmup"
],
)
parser.add_argument(
"--num_warmup_steps",
type=int,
default=0,
help="Number of steps for the warmup in the lr scheduler.")
parser.add_argument("--output_dir",
type=str,
default=None,
help="Where to store the model.")
parser.add_argument("--seed",
type=int,
default=1234,
help="A seed for reproducible training.")
parser.add_argument("--local_rank",
type=int,
default=-1,
help="local_rank for distributed training on gpus")
parser.add_argument('--gradient_checkpointing',
action='store_true',
help='Enable HF gradient checkpointing for model.')
parser.add_argument(
"--dropout",
type=float,
default=None,
help="If dropout configured, use it. "
"Otherwise, keep the default dropout configuration of the model.")
# deepspeed features
parser.add_argument('--offload',
action='store_true',
help='Enable ZeRO Offload techniques.')
parser.add_argument('--dtype',
type=str,
default='fp16',
choices=['fp16', 'bf16'],
help='Training data type')
parser.add_argument(
'--zero_stage',
type=int,
default=0,
help='ZeRO optimization stage for Actor model (and clones).')
## LoRA for efficient training setting
parser.add_argument("--lora_dim",
type=int,
default=0,
help="If > 0, use LoRA for efficient training.")
parser.add_argument("--lora_module_name",
type=str,
default="decoder.layers.",
help="The scope of LoRA.")
parser.add_argument('--only_optimize_lora',
action='store_true',
help='Only optimize the LoRA parameters.')
parser.add_argument(
"--lora_learning_rate",
type=float,
default=5e-4,
help=
"Initial LoRA learning rate (after the potential warmup period) to use."
)
## low precision
parser.add_argument(
'--compute_fp32_loss',
action='store_true',
help='Relevant for low precision dtypes (fp16, bf16, etc.). '
'If specified, loss is calculated in fp32.')
## Tensorboard logging
parser.add_argument('--enable_tensorboard',
action='store_true',
help='Enable tensorboard logging')
parser.add_argument('--tensorboard_path',
type=str,
default="step1_tensorboard")
## Tokenizer
parser.add_argument(
"--add_eot_token",
action='store_true',
help="Add `eot_token` as additional special token to tokenizer")
parser.add_argument(
"--eot_token",
type=str,
default="<|endoftext|>",
help="Specify the format of the `eot_token`",
)
## Print loss
parser.add_argument('--print_loss',
action='store_true',
help='Prints loss at each step.')
parser = deepspeed.add_config_arguments(parser)
args = parser.parse_args()
return args
部分命令行参数含义:
Name | Description |
---|---|
data_path | 指定训练数据集路径,可以是单个路径或多个路径。 |
data_split | 逗号分隔的比例,用于划分数据为阶段 1、2 和 3 的训练数据。 |
model_name_or_path | 指定预训练模型路径 |
per_device_train_batch_size | 每个设备的训练数据批大小 |
per_device_eval_batch_size | 每个设备的验证数据批大小 |
max_seq_len | 训练时的最大序列长度,较长的序列可能消耗更多显存 |
learning_rate | 初始学习率 |
weight_decay | 权重衰减值 |
num_train_epochs | 训练的总轮数 |
gradient_accumulation_steps | 反向传播/更新之前累计的步数,用于模拟更大的批大小。 |
lr_scheduler_type | 学习率调度器类型 |
num_warmup_steps | 学习率预热的步数 |
seed | 随机数种子 |
zero_stage | 指定 DeepSpeed ZeRO 优化阶段 |
lora_dim | LoRA 的低秩矩阵维度。如果大于 0,则启用 LoRA 高效训练。 |
lora_module_name | 指定应用 LoRA 的模块范围 |
only_optimize_lora | 仅优化 LoRA 参数,不优化模型的其他参数。 |
gradient_checkpointing | 启用梯度检查点,降低显存占用。 |
deepspeed | 启用 DeepSpeed 训练框架 |
enable_tensorboard | 启用 TensorBoard,用于训练监控和可视化 |
tensorboard_path | 指定 TensorBoard 的日志保存路径 |
output_dir | 指定模型训练输出文件的保存路径 |
有些资料说,
only_optimize_lora
和gradient_checkpointing
不能同时开,试了一下,训练可以跑起来,但不知道会不会影响效果。如果确实不能同时开,可以在return args
前加一个判断语句,增强鲁棒性。
3、main 函数
3.1 设备与分布式初始化
args = parse_args()
# 如果args.local_rank为 -1,表示不是分布式训练场景
if args.local_rank == -1:
device = torch.device(get_accelerator().device_name())
else:
get_accelerator().set_device(args.local_rank)
device = torch.device(get_accelerator().device_name(), args.local_rank)
# Initializes the distributed backend which will take care of sychronizing nodes/GPUs
# torch.distributed.init_process_group(backend='nccl')
deepspeed.init_distributed()
args.global_rank = torch.distributed.get_rank()
# from dschat.utils.ds_utils import get_train_ds_config
# 生成 deepspeed 训练的配置字典
ds_config = get_train_ds_config(offload=args.offload,
dtype=args.dtype,
stage=args.zero_stage,
enable_tensorboard=args.enable_tensorboard,
tb_path=args.tensorboard_path,
tb_name="step1_model")
ds_config[
'train_micro_batch_size_per_gpu'] = args.per_device_train_batch_size
ds_config[
'train_batch_size'] = args.per_device_train_batch_size * torch.distributed.get_world_size(
) * args.gradient_accumulation_steps
# If passed along, set the training seed now.
# 设置训练的随机种子
set_random_seed(args.seed)
torch.distributed.barrier()
3.2 模型与数据准备
# load_hf_tokenizer will get the correct tokenizer and set padding tokens based on the model family
# 加载 tokenizer
additional_special_tokens = args.eot_token if args.add_eot_token else None
tokenizer = load_hf_tokenizer(args.model_name_or_path,
fast_tokenizer=True,
add_special_tokens=additional_special_tokens)
# 模型的初始化构建
model = create_hf_model(AutoModelForCausalLM,
args.model_name_or_path,
tokenizer,
ds_config,
dropout=args.dropout)
if args.compute_fp32_loss:
print_rank_0(
f"Using model {model.__class__.__name__} with loss in fp32",
args.global_rank)
causal_lm_model_to_fp32_loss(model)
# 如果 args.lora_dim 大于 0,启用 LoRA(低秩自适应)进行高效训练
if args.lora_dim > 0:
model = convert_linear_layer_to_lora(model, args.lora_module_name,
args.lora_dim)
if args.only_optimize_lora:
model = only_optimize_lora_parameters(model)
model = make_model_gradient_checkpointing_compatible(model)
# Prepare the data
# 数据集准备
train_phase = 1
train_dataset, eval_dataset = create_prompt_dataset(
args.local_rank,
args.data_path,
args.data_split,
args.data_output_path,
train_phase,
args.seed,
tokenizer,
args.max_seq_len,
end_of_conversation_token=tokenizer.eos_token,
sft_only_data_path=args.sft_only_data_path)
# DataLoaders creation:
# 数据加载器创建
if args.local_rank == -1:
train_sampler = RandomSampler(train_dataset)
eval_sampler = SequentialSampler(eval_dataset)
else:
train_sampler = DistributedSampler(train_dataset)
eval_sampler = DistributedSampler(eval_dataset)
train_dataloader = DataLoader(train_dataset,
collate_fn=default_data_collator,
sampler=train_sampler,
batch_size=args.per_device_train_batch_size)
eval_dataloader = DataLoader(eval_dataset,
collate_fn=default_data_collator,
sampler=eval_sampler,
batch_size=args.per_device_eval_batch_size)
3.3 定义评估函数
def evaluation(model, eval_dataloader):
model.eval()
losses = 0
for step, batch in enumerate(eval_dataloader):
batch = to_device(batch, device)
with torch.no_grad():
outputs = model(**batch)
loss = outputs.loss
losses += loss.float()
losses = losses / (step + 1)
try:
losses = get_all_reduce_mean(losses)
except:
pass
try:
perplexity = torch.exp(losses).item()
except OverflowError:
perplexity = float("inf")
return perplexity, losses.item()
3.4 优化器与学习率调度器设置
# Split weights in two groups, one with weight decay and the other not.
optimizer_grouped_parameters = get_optimizer_grouped_parameters(
model, args.weight_decay, args.lora_learning_rate)
AdamOptimizer = DeepSpeedCPUAdam if args.offload else FusedAdam
optimizer = AdamOptimizer(optimizer_grouped_parameters,
lr=args.learning_rate,
betas=(0.9, 0.95))
num_update_steps_per_epoch = math.ceil(
len(train_dataloader) / args.gradient_accumulation_steps)
lr_scheduler = get_scheduler(
name=args.lr_scheduler_type,
optimizer=optimizer,
num_warmup_steps=args.num_warmup_steps,
num_training_steps=args.num_train_epochs * num_update_steps_per_epoch,
)
3.5 使用 deepspeed 进行模型等初始化
model, optimizer, _, lr_scheduler = deepspeed.initialize(
model=model,
optimizer=optimizer,
args=args,
config=ds_config,
lr_scheduler=lr_scheduler,
dist_init_required=True)
# 开启模型的梯度检查点功能
if args.gradient_checkpointing:
model.gradient_checkpointing_enable()
3.6 训练循环
# Train!
print_rank_0("***** Running training *****", args.global_rank)
print_rank_0(
f"***** Evaluating perplexity, Epoch {0}/{args.num_train_epochs} *****",
args.global_rank)
# 观察模型在未训练时在验证集上的表现情况
perplexity, eval_loss = evaluation(model, eval_dataloader)
print_rank_0(f"ppl: {perplexity}, loss: {eval_loss}", args.global_rank)
# 外层 for 循环,每一轮代表一次完整遍历训练数据集的过程
for epoch in range(args.num_train_epochs):
print_rank_0(
f"Beginning of Epoch {epoch+1}/{args.num_train_epochs}, Total Micro Batches {len(train_dataloader)}",
args.global_rank)
# 将模型切换到训练模式
model.train()
# 内层 for 循环,用于遍历训练数据加载器中的每个批次数据
for step, batch in enumerate(train_dataloader):
# 先记录当前时间,用于后续计算该批次数据处理的耗时
start = time.time()
batch = to_device(batch, device)
# 前向传播
outputs = model(**batch, use_cache=False)
loss = outputs.loss
if args.print_loss:
print(
f"Epoch: {epoch}, Step: {step}, Rank: {torch.distributed.get_rank()}, loss = {loss}"
)
# 反向传播
model.backward(loss)
# 更新模型参数
model.step()
# 完成一个批次数据的训练流程,记录结束时间
end = time.time()
if torch.distributed.get_rank() == 0:
# 打印训练吞吐量
print_throughput(model.model, args, end - start,
args.global_rank)
# Evaluate perplexity on the validation set.
# 观察模型在验证集上的性能变化情况
print_rank_0(
f"***** Evaluating perplexity, Epoch {epoch+1}/{args.num_train_epochs} *****",
args.global_rank)
perplexity, eval_loss = evaluation(model, eval_dataloader)
print_rank_0(f"ppl: {perplexity}, loss: {eval_loss}", args.global_rank)
model.tput_timer.update_epoch_count()
3.7 模型保存
if args.output_dir is not None:
print_rank_0('saving the final model ...', args.global_rank)
model = convert_lora_to_linear_layer(model)
if args.global_rank == 0:
save_hf_format(model, tokenizer, args)
if args.zero_stage == 3:
# For zero stage 3, each gpu only has a part of the model, so we need a special save function
save_zero_three_model(model,
args.global_rank,
args.output_dir,
zero_stage=args.zero_stage)
4、dschat/utils 函数
4.1 get_train_ds_config 生成训练阶段的 deepspeed 配置字典
# applications/DeepSpeed-Chat/dschat/utils/ds_utils.py
def get_train_ds_config(offload,
dtype,
stage=2,
enable_hybrid_engine=False,
inference_tp_size=1,
release_inference_cache=False,
pin_parameters=True,
tp_gather_partition_size=8,
max_out_tokens=512,
enable_tensorboard=False,
enable_mixed_precision_lora=False,
tb_path="",
tb_name=""):
# 如果 offload 为 True,则将 device 设置为 "cpu",卸载到 CPU 设备
device = "cpu" if offload else "none"
if dtype == "fp16":
data_type = "fp16"
dtype_config = {"enabled": True, "loss_scale_window": 100}
elif dtype == "bf16":
data_type = "bfloat16"
dtype_config = {"enabled": True}
# 存储 ZeRO 优化相关的配置信息
zero_opt_dict = {
"stage": stage, # 设置 ZeRO 优化阶段
"overlap_comm": True, # 启用通信重叠
# 将模型参数和优化器相关的数据卸载到指定的设备
"offload_param": {
"device": device
},
"offload_optimizer": {
"device": device
},
"stage3_param_persistence_threshold": 1e4,
"stage3_max_live_parameters": 3e7,
"stage3_prefetch_bucket_size": 3e7,
"memory_efficient_linear": False
}
# 混合精度 LoRA 相关配置部分
if enable_mixed_precision_lora:
zero_opt_dict["zero_quantized_nontrainable_weights"] = True
if dist.get_world_size() != get_accelerator().device_count():
zero_opt_dict["zero_hpz_partition_size"] = get_accelerator(
).device_count()
# 返回配置字典,包含完整的训练阶段 deepspeed 配置信息
return {
"train_batch_size": GLOBAL_BATCH_SIZE,
"train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE,
"steps_per_print": 10,
"zero_optimization": zero_opt_dict,
data_type: dtype_config,
"gradient_clipping": 1.0,
"prescale_gradients": False,
"wall_clock_breakdown": False,
"hybrid_engine": {
"enabled": enable_hybrid_engine,
"max_out_tokens": max_out_tokens,
"inference_tp_size": inference_tp_size,
"release_inference_cache": release_inference_cache,
"pin_parameters": pin_parameters,
"tp_gather_partition_size": tp_gather_partition_size,
},
"tensorboard": {
"enabled": enable_tensorboard,
"output_path": f"{tb_path}/ds_tensorboard_logs/",
"job_name": f"{tb_name}_tensorboard"
}
}
4.2 get_train_ds_config 生成训练阶段的 deepspeed 配置字典
# applications/DeepSpeed-Chat/dschat/utils/perf.py
# This function can be used to print throughput for Step 1 and 2 only
def print_throughput(hf_model, args, e2e_time, rank=0):
if rank <= 0:
hf_config = hf_model.config
num_layers, hidden_size, vocab_size = get_hf_configs(hf_config)
gpus_per_model = torch.distributed.get_world_size()
seq_length = args.max_seq_len
batch_size = args.per_device_train_batch_size
# 计算每秒处理样本数
samples_per_second = batch_size / e2e_time
checkpoint_activations_factor = 4 if args.gradient_checkpointing else 3
if args.lora_dim > 0:
k = args.lora_dim * 2 / hidden_size
checkpoint_activations_factor -= (1 - k)
# 计算模型参数数量
hf_model._num_params = sum([
p.ds_numel if hasattr(p, "ds_tensor") else p.numel()
for p in hf_model.parameters()
])
# 换算为以十亿为单位
params_in_billions = hf_model._num_params / (1e9)
# Megatron paper's formula to calculate training flops
train_flops_per_iteration = calculate_flops(
checkpoint_activations_factor, batch_size, seq_length, hf_config)
# 计算训练的每秒浮点运算次数
train_tflops = train_flops_per_iteration / (e2e_time * gpus_per_model *
(10**12))
param_string = f"{params_in_billions:.3f} B" if params_in_billions != 0 else "NA"
# 格式化输出性能指标信息
# - 模型参数数量
# - 端到端训练的延迟时间
# - 每秒浮点运算次数
# - 每秒处理的样本数
# - 每个样本的处理时间
# - 批次大小
# - 序列长度
print(
f"Model Parameters: {param_string}, Latency: {e2e_time:.2f}s, TFLOPs: {train_tflops:.2f}, Samples/sec: {samples_per_second:.2f}, Time/seq {e2e_time/batch_size:.2f}s, Batch Size: {batch_size}, Sequence Length: {seq_length}"
)
复现过程
1、实验环境。
cat /etc/issue | Ubuntu 22.04.4 LTS
nvidia-smi | NVIDIA GeForce RTX 2080 Ti
nvcc -V | cuda_11.7
2、下载代码,进入目录。
git clone https://github.com/microsoft/DeepSpeedExamples.git
cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning
3、下载模型,保存在step1_supervised_finetuning
目录下。
Model | Parameters | Pretrained weights | HF-Mirror |
---|---|---|---|
OPT-125M | 125M | facebook/opt-125m | facebook/opt-125m |
OPT-350M | 350M | facebook/opt-350m | facebook/opt-350m |
OPT-1.3B | 1.3B | facebook/opt-1.3b | facebook/opt-1.3b |
下载如下内容即可
4、下载数据集,保存在step1_supervised_finetuning
目录下。
Name | Address | HF-Mirror |
---|---|---|
Dahoas/rm-static | Dahoas/rm-static | Dahoas/rm-static |
Dahoas/full-hh-rlhf | Dahoas/full-hh-rlhf | Dahoas/full-hh-rlhf |
Dahoas/synthetic-instruct-gptj-pairwise | Dahoas/synthetic-instruct-gptj-pairwise | Dahoas/synthetic-instruct-gptj-pairwise |
yitingxie/rlhf-reward-datasets | yitingxie/rlhf-reward-datasets | yitingxie/rlhf-reward-datasets |
5、创建 conda 虚拟环境,安装依赖。
conda create -n torch_2.1.0 python=3.8
conda activate torch_2.1.0
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install datasets>=2.8.0 sentencepiece>=0.1.97 protobuf==3.20.3 accelerate>=0.15.0 deepspeed>=0.9.0 transformers>=4.31.0,!=4.33.2 tensorboard
6、创建测试脚本。
路径:
applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/my_test/run_opt-125m.sh
#!/bin/bash
OUTPUT_PATH=./output
mkdir -p $OUTPUT_PATH
deepspeed main.py \
--data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets \
--data_split 2,4,4 \
--model_name_or_path facebook/opt-125m \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--max_seq_len 512 \
--learning_rate 1e-3 \
--weight_decay 0. \
--num_train_epochs 16 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--num_warmup_steps 0 \
--seed 1234 \
--gradient_checkpointing \
--zero_stage 3 \
--lora_dim 128 \
--lora_module_name decoder.layers. \
--deepspeed \
--output_dir $OUTPUT_PATH \
&> $OUTPUT_PATH/training.log
给新创建的脚本赋予执行权限:chmod +x run_opt-125m.sh
7、将applications/DeepSpeed-Chat
下的dschat
文件夹移动到applications/DeepSpeed-Chat/training
下。
当前applications/DeepSpeed-Chat/training
的文件目录结构:
├── dschat
│ ├── utils
│ ├── data
│ │ ├── data_utils.py
│ │ └── raw_datasets.py
│ ├── ds_utils.py
│ ├── model
│ │ ├── model_utils.py
│ │ └── reward_model.py
│ ├── module
│ │ └── lora.py
│ ├── perf.py
│ └── utils.py
├── step1_supervised_finetuning
│ ├── Dahoas
│ │ ├── full-hh-rlhf
│ │ │ ├── data
│ │ │ │ ├── test-00000-of-00001-ec71e9262143a91c.parquet
│ │ │ │ └── train-00000-of-00001-8349d0765e6718df.parquet
│ │ │ └── dataset_infos.json
│ │ ├── rm-static
│ │ │ ├── data
│ │ │ │ ├── test-00000-of-00001-8c7c51afc6d45980.parquet
│ │ │ │ └── train-00000-of-00001-2a1df75c6bce91ab.parquet
│ │ │ └── dataset_infos.json
│ │ └── synthetic-instruct-gptj-pairwise
│ │ ├── data
│ │ │ └── train-00000-of-00001-1e5d57b93c448e7a.parquet
│ │ └── dataset_infos.json
│ ├── facebook
│ │ ├── opt-125m
│ │ │ ├── config.json
│ │ │ ├── generation_config.json
│ │ │ ├── merges.txt
│ │ │ ├── pytorch_model.bin
│ │ │ ├── special_tokens_map.json
│ │ │ ├── tokenizer_config.json
│ │ │ └── vocab.json
│ │ ├── opt-1.3b
│ │ │ ├── config.json
│ │ │ ├── generation_config.json
│ │ │ ├── merges.txt
│ │ │ ├── pytorch_model.bin
│ │ │ ├── special_tokens_map.json
│ │ │ ├── tokenizer_config.json
│ │ │ └── vocab.json
│ │ └── opt-350m
│ │ ├── config.json
│ │ ├── generation_config.json
│ │ ├── merges.txt
│ │ ├── pytorch_model.bin
│ │ ├── special_tokens_map.json
│ │ ├── tokenizer_config.json
│ │ └── vocab.json
│ ├── main.py
│ ├── README.md
│ ├── training_log_output
│ │ └── opt-1.3b-globalBatchSize128.log
│ ├── training_scripts
│ │ ├── llama2
│ │ │ ├── run_llama2_7b_lora.sh
│ │ │ └── run_llama2_7b.sh
│ │ ├── my_test
│ │ │ └── run_opt-125m.sh
│ │ ├── opt
│ │ │ ├── single_gpu
│ │ │ │ ├── run_1.3b.sh
│ │ │ │ └── run_6.7b_lora.sh
│ │ │ └── single_node
│ │ │ ├── run_1.3b_lora.sh
│ │ │ ├── run_13b.sh
│ │ │ ├── run_1.3b.sh
│ │ │ ├── run_30b_lora.sh
│ │ │ ├── run_6.7b.sh
│ │ │ └── sweep
│ │ │ ├── README.md
│ │ │ ├── run_single.sh
│ │ │ └── run_step1_sweep.sh
│ │ ├── other_language
│ │ │ ├── run_chinese.sh
│ │ │ └── run_japanese.sh
│ │ └── README.md
│ └── yitingxie
│ └── rlhf-reward-datasets
│ └── data
│ ├── test-00000-of-00001-955c146ec7a10a1e.parquet
│ └── train-00000-of-00001-2ea3039ca4da89f8.parquet
8、在main.py
中加入如下代码,使其能找到导入的dschat
模块。
import sys
import os
sys.path.append(
os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))
9、在applications/DeepSpeed-Chat/training/step1_supervised_finetuning
目录下运行脚本。
bash training_scripts/my_test/run_opt-125m.sh
遇到一个报错:FileNotFoundError: Directory Dahoas/rm-static is neither a dataset directory nor a dataset dict directory.
,经过排查,是applications/DeepSpeed-Chat/training/dschat/utils/data/raw_datasets.py
的问题。
class PromptRawDataset(object):
def __init__(self, output_path, seed, local_rank, dataset_name):
self.output_path = output_path
self.seed = seed
self.local_rank = local_rank
if os.path.exists(dataset_name):
# self.raw_datasets = load_from_disk(dataset_name)
# 注释掉上面这条语句,这里也直接改用 load_dataset(dataset_name)
self.raw_datasets = load_dataset(dataset_name)
elif not dataset_name == 'local/jsonfile':
self.raw_datasets = load_dataset(dataset_name)
参考:
- mariosasko 的回答1
- mariosasko 的回答2
- Arunbh Yashaswi 的回答
- 【huggingface】数据集及模型下载并保存至本地
Using OPT with Colossal-AI
待研究:https://github.com/hpcaitech/ColossalAI#OPT
参考资料
- guolipa:大语言模型调研汇总
- 罗小黑:1-7B开源小模型整理汇总
- facebookresearch:About OPT & Pretrained Model Weights
- Microsoft:DeepSpeed Chat: 一键式RLHF训练,让你的类ChatGPT千亿大模型提速省钱15倍
- just_sort:DeepSpeed-Chat 打造类ChatGPT全流程 笔记一
- just_sort:DeepSpeed-Chat 打造类ChatGPT全流程 笔记二之监督指令微调
- AI开发者:专题:大模型训练入门实战
- J.P.Liu:DeepSpeed-Chat全流程训练实战