当前位置: 首页 > article >正文

使用 DeepSpeed 微调 OPT 基础语言模型

文章目录

  • OPT 基础语言模型
  • Using OPT with DeepSpeed
    • main.py 解析
      • 1、导入库和模块
      • 2、解析命令行参数
      • 3、main 函数
        • 3.1 设备与分布式初始化
        • 3.2 模型与数据准备
        • 3.3 定义评估函数
        • 3.4 优化器与学习率调度器设置
        • 3.5 使用 deepspeed 进行模型等初始化
        • 3.6 训练循环
        • 3.7 模型保存
      • 4、dschat/utils 函数
        • 4.1 get_train_ds_config 生成训练阶段的 deepspeed 配置字典
        • 4.2 get_train_ds_config 生成训练阶段的 deepspeed 配置字典
    • 复现过程
  • Using OPT with Colossal-AI
  • 参考资料

OPT 基础语言模型

基础语言模型 (Basic Language Model) 是指只在大规模文本语料中进行了预训练的模型,未经过指令和下游任务微调、以及人类反馈等任何对齐优化。

OPT 是由 Meta AI 研究人员发布的一系列大规模预训练语言模型,模型包括125M、350M、1.3B、2.7B、6.7B、13B、30B、66B、175B 9个不同的参数规模和版本,除了 175B 的版本需要填写申请获取外,其它规模版本的模型都完全开放下载,可以免费获得。OPT-175B 和 GPT-3 的性能相当,并且部署只需要损耗 GPT-3 1/7 的能量损耗。OPT 系列模型开源的目的是为促进学术研究和交流,因为绝大多数大语言模型训练成本高昂,导致大部分研究人员都无法负担大语言模型的训练或使用;同时,各大企业发布的大语言预训练模型由于商业目的也都无法完整访问模型权重,只能通过 API 调用获取结果,阻碍了学术的交流与研究。

Using OPT with DeepSpeed

DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning目录结构

├── evaluation_scripts		// 脚本用于运行模型的评估
│   └── run_prompt.sh
├── main.py					// 训练的主脚本
├── prompt_eval.py			// 与 run_prompt.sh 搭配使用
├── README.md
├── training_log_output		// 训练过程的日志文件
│   └── opt-1.3b-globalBatchSize128.log
└── training_scripts
    ├── llama2				// 用于微调 LLaMA 2 模型
    │   ├── run_llama2_7b_lora.sh
    │   └── run_llama2_7b.sh
    ├── opt					// 用于微调 OPT 模型
    │   ├── multi_node
    │   │   └── run_66b.sh
    │   ├── single_gpu
    │   │   ├── run_1.3b.sh
    │   │   └── run_6.7b_lora.sh
    │   └── single_node
    │       ├── run_1.3b_lora.sh
    │       ├── run_13b.sh
    │       ├── run_1.3b.sh
    │       ├── run_30b_lora.sh
    │       ├── run_6.7b.sh
    │       └── sweep		// 用于超参数搜索
    │           ├── README.md
    │           ├── run_single.sh
    │           └── run_step1_sweep.sh
    ├── other_language		// 用于支持其他语言的模型训练
    │   ├── run_chinese.sh
    │   └── run_japanese.sh
    └── README.md

main.py 解析

位置:DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py

1、导入库和模块

import argparse
import math
import time

import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler

from transformers import (
    AutoModelForCausalLM,
    SchedulerType,
    default_data_collator,
    get_scheduler,
)

import deepspeed
from deepspeed.ops.adam import DeepSpeedCPUAdam, FusedAdam
from deepspeed import get_accelerator

from dschat.utils.data.data_utils import create_prompt_dataset
from dschat.utils.utils import print_rank_0, to_device, save_hf_format, set_random_seed, get_all_reduce_mean, get_optimizer_grouped_parameters, save_zero_three_model, load_hf_tokenizer
from dschat.utils.ds_utils import get_train_ds_config
from dschat.utils.module.lora import convert_linear_layer_to_lora, convert_lora_to_linear_layer, only_optimize_lora_parameters, make_model_gradient_checkpointing_compatible
from dschat.utils.model.model_utils import create_hf_model, causal_lm_model_to_fp32_loss
from dschat.utils.perf import print_throughput

2、解析命令行参数

def parse_args():
    parser = argparse.ArgumentParser(
        description=
        "Finetune a transformers model on a causal language modeling task")
    parser.add_argument('--data_path',
                        nargs='*',
                        default=['Dahoas/rm-static'],
                        help='Path to the training dataset. Accepted format:'
                        '1) a single data path, 2) multiple datasets in the'
                        'form: dataset1-path dataset2-path ...')
    parser.add_argument('--data_split',
                        type=str,
                        default='2,4,4',
                        help='Comma-separated list of proportions for training'
                        'phase 1, 2, and 3 data. For example the split `6,2,2`'
                        'will use 60%% of data for phase 1, 20%% for phase 2'
                        'and 20%% for phase 3.')
    parser.add_argument(
        '--sft_only_data_path',
        nargs='*',
        default=[],
        help='Path to the dataset for only using in SFT phase.')
    parser.add_argument(
        '--data_output_path',
        type=str,
        default='/tmp/data_files/',
        help=
        'Where to store the data-related files such as shuffle index. This needs to be on a local storage of a node (not on a shared storage)'
    )
    parser.add_argument(
        "--model_name_or_path",
        type=str,
        help=
        "Path to pretrained model or model identifier from huggingface.co/models.",
        required=True,
    )
    parser.add_argument(
        "--per_device_train_batch_size",
        type=int,
        default=16,
        help="Batch size (per device) for the training dataloader.",
    )
    parser.add_argument(
        "--per_device_eval_batch_size",
        type=int,
        default=16,
        help="Batch size (per device) for the evaluation dataloader.",
    )
    parser.add_argument(
        "--max_seq_len",
        type=int,
        default=512,
        help="The maximum sequence length.",
    )
    parser.add_argument(
        "--learning_rate",
        type=float,
        default=1e-3,
        help=
        "Initial learning rate (after the potential warmup period) to use.",
    )
    parser.add_argument("--weight_decay",
                        type=float,
                        default=0.,
                        help="Weight decay to use.")
    parser.add_argument("--num_train_epochs",
                        type=int,
                        default=1,
                        help="Total number of training epochs to perform.")
    parser.add_argument(
        "--gradient_accumulation_steps",
        type=int,
        default=1,
        help=
        "Number of updates steps to accumulate before performing a backward/update pass.",
    )
    parser.add_argument(
        "--lr_scheduler_type",
        type=SchedulerType,
        default="cosine",
        help="The scheduler type to use.",
        choices=[
            "linear", "cosine", "cosine_with_restarts", "polynomial",
            "constant", "constant_with_warmup"
        ],
    )
    parser.add_argument(
        "--num_warmup_steps",
        type=int,
        default=0,
        help="Number of steps for the warmup in the lr scheduler.")
    parser.add_argument("--output_dir",
                        type=str,
                        default=None,
                        help="Where to store the model.")
    parser.add_argument("--seed",
                        type=int,
                        default=1234,
                        help="A seed for reproducible training.")
    parser.add_argument("--local_rank",
                        type=int,
                        default=-1,
                        help="local_rank for distributed training on gpus")
    parser.add_argument('--gradient_checkpointing',
                        action='store_true',
                        help='Enable HF gradient checkpointing for model.')
    parser.add_argument(
        "--dropout",
        type=float,
        default=None,
        help="If dropout configured, use it. "
        "Otherwise, keep the default dropout configuration of the model.")
    # deepspeed features
    parser.add_argument('--offload',
                        action='store_true',
                        help='Enable ZeRO Offload techniques.')
    parser.add_argument('--dtype',
                        type=str,
                        default='fp16',
                        choices=['fp16', 'bf16'],
                        help='Training data type')
    parser.add_argument(
        '--zero_stage',
        type=int,
        default=0,
        help='ZeRO optimization stage for Actor model (and clones).')
    ## LoRA for efficient training setting
    parser.add_argument("--lora_dim",
                        type=int,
                        default=0,
                        help="If > 0, use LoRA for efficient training.")
    parser.add_argument("--lora_module_name",
                        type=str,
                        default="decoder.layers.",
                        help="The scope of LoRA.")
    parser.add_argument('--only_optimize_lora',
                        action='store_true',
                        help='Only optimize the LoRA parameters.')
    parser.add_argument(
        "--lora_learning_rate",
        type=float,
        default=5e-4,
        help=
        "Initial LoRA learning rate (after the potential warmup period) to use."
    )
    ## low precision
    parser.add_argument(
        '--compute_fp32_loss',
        action='store_true',
        help='Relevant for low precision dtypes (fp16, bf16, etc.). '
        'If specified, loss is calculated in fp32.')
    ## Tensorboard logging
    parser.add_argument('--enable_tensorboard',
                        action='store_true',
                        help='Enable tensorboard logging')
    parser.add_argument('--tensorboard_path',
                        type=str,
                        default="step1_tensorboard")
    ## Tokenizer
    parser.add_argument(
        "--add_eot_token",
        action='store_true',
        help="Add `eot_token` as additional special token to tokenizer")
    parser.add_argument(
        "--eot_token",
        type=str,
        default="<|endoftext|>",
        help="Specify the format of the `eot_token`",
    )
    ## Print loss
    parser.add_argument('--print_loss',
                        action='store_true',
                        help='Prints loss at each step.')
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()

    return args

部分命令行参数含义:

NameDescription
data_path指定训练数据集路径,可以是单个路径或多个路径。
data_split逗号分隔的比例,用于划分数据为阶段 1、2 和 3 的训练数据。
model_name_or_path指定预训练模型路径
per_device_train_batch_size每个设备的训练数据批大小
per_device_eval_batch_size每个设备的验证数据批大小
max_seq_len训练时的最大序列长度,较长的序列可能消耗更多显存
learning_rate初始学习率
weight_decay权重衰减值
num_train_epochs训练的总轮数
gradient_accumulation_steps反向传播/更新之前累计的步数,用于模拟更大的批大小。
lr_scheduler_type学习率调度器类型
num_warmup_steps学习率预热的步数
seed随机数种子
zero_stage指定 DeepSpeed ZeRO 优化阶段
lora_dimLoRA 的低秩矩阵维度。如果大于 0,则启用 LoRA 高效训练。
lora_module_name指定应用 LoRA 的模块范围
only_optimize_lora仅优化 LoRA 参数,不优化模型的其他参数。
gradient_checkpointing启用梯度检查点,降低显存占用。
deepspeed启用 DeepSpeed 训练框架
enable_tensorboard启用 TensorBoard,用于训练监控和可视化
tensorboard_path指定 TensorBoard 的日志保存路径
output_dir指定模型训练输出文件的保存路径

有些资料说,only_optimize_loragradient_checkpointing不能同时开,试了一下,训练可以跑起来,但不知道会不会影响效果。如果确实不能同时开,可以在return args前加一个判断语句,增强鲁棒性。

3、main 函数

3.1 设备与分布式初始化
    args = parse_args()
    
	# 如果args.local_rank为 -1,表示不是分布式训练场景
    if args.local_rank == -1:
        device = torch.device(get_accelerator().device_name())
    else:
        get_accelerator().set_device(args.local_rank)
        device = torch.device(get_accelerator().device_name(), args.local_rank)
        # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
        # torch.distributed.init_process_group(backend='nccl')
        deepspeed.init_distributed()

    args.global_rank = torch.distributed.get_rank()

	# from dschat.utils.ds_utils import get_train_ds_config
	# 生成 deepspeed 训练的配置字典
    ds_config = get_train_ds_config(offload=args.offload,
                                    dtype=args.dtype,
                                    stage=args.zero_stage,
                                    enable_tensorboard=args.enable_tensorboard,
                                    tb_path=args.tensorboard_path,
                                    tb_name="step1_model")
    ds_config[
        'train_micro_batch_size_per_gpu'] = args.per_device_train_batch_size
    ds_config[
        'train_batch_size'] = args.per_device_train_batch_size * torch.distributed.get_world_size(
        ) * args.gradient_accumulation_steps

    # If passed along, set the training seed now.
    # 设置训练的随机种子
    set_random_seed(args.seed)

    torch.distributed.barrier()
3.2 模型与数据准备
	# load_hf_tokenizer will get the correct tokenizer and set padding tokens based on the model family
	# 加载 tokenizer
    additional_special_tokens = args.eot_token if args.add_eot_token else None
    tokenizer = load_hf_tokenizer(args.model_name_or_path,
                                  fast_tokenizer=True,
                                  add_special_tokens=additional_special_tokens)
                                  
	# 模型的初始化构建
    model = create_hf_model(AutoModelForCausalLM,
                            args.model_name_or_path,
                            tokenizer,
                            ds_config,
                            dropout=args.dropout)

    if args.compute_fp32_loss:
        print_rank_0(
            f"Using model {model.__class__.__name__} with loss in fp32",
            args.global_rank)
        causal_lm_model_to_fp32_loss(model)
        
	# 如果 args.lora_dim 大于 0,启用 LoRA(低秩自适应)进行高效训练
    if args.lora_dim > 0:
        model = convert_linear_layer_to_lora(model, args.lora_module_name,
                                             args.lora_dim)
        if args.only_optimize_lora:
            model = only_optimize_lora_parameters(model)
            model = make_model_gradient_checkpointing_compatible(model)

    # Prepare the data
    # 数据集准备
    train_phase = 1
    train_dataset, eval_dataset = create_prompt_dataset(
        args.local_rank,
        args.data_path,
        args.data_split,
        args.data_output_path,
        train_phase,
        args.seed,
        tokenizer,
        args.max_seq_len,
        end_of_conversation_token=tokenizer.eos_token,
        sft_only_data_path=args.sft_only_data_path)
        
    # DataLoaders creation:
    # 数据加载器创建
    if args.local_rank == -1:
        train_sampler = RandomSampler(train_dataset)
        eval_sampler = SequentialSampler(eval_dataset)
    else:
        train_sampler = DistributedSampler(train_dataset)
        eval_sampler = DistributedSampler(eval_dataset)
    train_dataloader = DataLoader(train_dataset,
                                  collate_fn=default_data_collator,
                                  sampler=train_sampler,
                                  batch_size=args.per_device_train_batch_size)
    eval_dataloader = DataLoader(eval_dataset,
                                 collate_fn=default_data_collator,
                                 sampler=eval_sampler,
                                 batch_size=args.per_device_eval_batch_size)
3.3 定义评估函数
    def evaluation(model, eval_dataloader):
        model.eval()
        losses = 0
        for step, batch in enumerate(eval_dataloader):
            batch = to_device(batch, device)
            with torch.no_grad():
                outputs = model(**batch)

            loss = outputs.loss
            losses += loss.float()
        losses = losses / (step + 1)
        try:
            losses = get_all_reduce_mean(losses)
        except:
            pass
        try:
            perplexity = torch.exp(losses).item()
        except OverflowError:
            perplexity = float("inf")
        return perplexity, losses.item()
3.4 优化器与学习率调度器设置
    # Split weights in two groups, one with weight decay and the other not.
    optimizer_grouped_parameters = get_optimizer_grouped_parameters(
        model, args.weight_decay, args.lora_learning_rate)

    AdamOptimizer = DeepSpeedCPUAdam if args.offload else FusedAdam
    optimizer = AdamOptimizer(optimizer_grouped_parameters,
                              lr=args.learning_rate,
                              betas=(0.9, 0.95))

    num_update_steps_per_epoch = math.ceil(
        len(train_dataloader) / args.gradient_accumulation_steps)
    lr_scheduler = get_scheduler(
        name=args.lr_scheduler_type,
        optimizer=optimizer,
        num_warmup_steps=args.num_warmup_steps,
        num_training_steps=args.num_train_epochs * num_update_steps_per_epoch,
    )
3.5 使用 deepspeed 进行模型等初始化
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
        model=model,
        optimizer=optimizer,
        args=args,
        config=ds_config,
        lr_scheduler=lr_scheduler,
        dist_init_required=True)
        
	# 开启模型的梯度检查点功能
    if args.gradient_checkpointing:
        model.gradient_checkpointing_enable()
3.6 训练循环
    # Train!
    print_rank_0("***** Running training *****", args.global_rank)
    print_rank_0(
        f"***** Evaluating perplexity, Epoch {0}/{args.num_train_epochs} *****",
        args.global_rank)
        
    # 观察模型在未训练时在验证集上的表现情况
    perplexity, eval_loss = evaluation(model, eval_dataloader)
    print_rank_0(f"ppl: {perplexity}, loss: {eval_loss}", args.global_rank)

	# 外层 for 循环,每一轮代表一次完整遍历训练数据集的过程
    for epoch in range(args.num_train_epochs):
        print_rank_0(
            f"Beginning of Epoch {epoch+1}/{args.num_train_epochs}, Total Micro Batches {len(train_dataloader)}",
            args.global_rank)
            
        # 将模型切换到训练模式
        model.train()
        
        # 内层 for 循环,用于遍历训练数据加载器中的每个批次数据
        for step, batch in enumerate(train_dataloader):
        	# 先记录当前时间,用于后续计算该批次数据处理的耗时
            start = time.time()
            batch = to_device(batch, device)
            # 前向传播
            outputs = model(**batch, use_cache=False)
            loss = outputs.loss
            if args.print_loss:
                print(
                    f"Epoch: {epoch}, Step: {step}, Rank: {torch.distributed.get_rank()}, loss = {loss}"
                )
            # 反向传播
            model.backward(loss)
            # 更新模型参数
            model.step()
            # 完成一个批次数据的训练流程,记录结束时间
            end = time.time()
            if torch.distributed.get_rank() == 0:
            	# 打印训练吞吐量
                print_throughput(model.model, args, end - start,
                                 args.global_rank)

        # Evaluate perplexity on the validation set.
        # 观察模型在验证集上的性能变化情况
        print_rank_0(
            f"***** Evaluating perplexity, Epoch {epoch+1}/{args.num_train_epochs} *****",
            args.global_rank)
        perplexity, eval_loss = evaluation(model, eval_dataloader)
        print_rank_0(f"ppl: {perplexity}, loss: {eval_loss}", args.global_rank)
        model.tput_timer.update_epoch_count()
3.7 模型保存
    if args.output_dir is not None:
        print_rank_0('saving the final model ...', args.global_rank)
        model = convert_lora_to_linear_layer(model)

        if args.global_rank == 0:
            save_hf_format(model, tokenizer, args)

        if args.zero_stage == 3:
            # For zero stage 3, each gpu only has a part of the model, so we need a special save function
            save_zero_three_model(model,
                                  args.global_rank,
                                  args.output_dir,
                                  zero_stage=args.zero_stage)

4、dschat/utils 函数

4.1 get_train_ds_config 生成训练阶段的 deepspeed 配置字典
# applications/DeepSpeed-Chat/dschat/utils/ds_utils.py
def get_train_ds_config(offload,
                        dtype,
                        stage=2,
                        enable_hybrid_engine=False,
                        inference_tp_size=1,
                        release_inference_cache=False,
                        pin_parameters=True,
                        tp_gather_partition_size=8,
                        max_out_tokens=512,
                        enable_tensorboard=False,
                        enable_mixed_precision_lora=False,
                        tb_path="",
                        tb_name=""):
                        
	# 如果 offload 为 True,则将 device 设置为 "cpu",卸载到 CPU 设备
    device = "cpu" if offload else "none"
    if dtype == "fp16":
        data_type = "fp16"
        dtype_config = {"enabled": True, "loss_scale_window": 100}
    elif dtype == "bf16":
        data_type = "bfloat16"
        dtype_config = {"enabled": True}
	# 存储 ZeRO 优化相关的配置信息
    zero_opt_dict = {
        "stage": stage,			# 设置 ZeRO 优化阶段
        "overlap_comm": True,	# 启用通信重叠
        # 将模型参数和优化器相关的数据卸载到指定的设备
        "offload_param": {
            "device": device
        },
        "offload_optimizer": {
            "device": device
        },
        "stage3_param_persistence_threshold": 1e4,
        "stage3_max_live_parameters": 3e7,
        "stage3_prefetch_bucket_size": 3e7,
        "memory_efficient_linear": False
    }
    
    # 混合精度 LoRA 相关配置部分
    if enable_mixed_precision_lora:
        zero_opt_dict["zero_quantized_nontrainable_weights"] = True
        if dist.get_world_size() != get_accelerator().device_count():
            zero_opt_dict["zero_hpz_partition_size"] = get_accelerator(
            ).device_count()
            
    # 返回配置字典,包含完整的训练阶段 deepspeed 配置信息
    return {
        "train_batch_size": GLOBAL_BATCH_SIZE,
        "train_micro_batch_size_per_gpu": MICRO_BATCH_SIZE,
        "steps_per_print": 10,
        "zero_optimization": zero_opt_dict,
        data_type: dtype_config,
        "gradient_clipping": 1.0,
        "prescale_gradients": False,
        "wall_clock_breakdown": False,
        "hybrid_engine": {
            "enabled": enable_hybrid_engine,
            "max_out_tokens": max_out_tokens,
            "inference_tp_size": inference_tp_size,
            "release_inference_cache": release_inference_cache,
            "pin_parameters": pin_parameters,
            "tp_gather_partition_size": tp_gather_partition_size,
        },
        "tensorboard": {
            "enabled": enable_tensorboard,
            "output_path": f"{tb_path}/ds_tensorboard_logs/",
            "job_name": f"{tb_name}_tensorboard"
        }
    }
4.2 get_train_ds_config 生成训练阶段的 deepspeed 配置字典
# applications/DeepSpeed-Chat/dschat/utils/perf.py
# This function can be used to print throughput for Step 1 and 2 only
def print_throughput(hf_model, args, e2e_time, rank=0):
    if rank <= 0:
        hf_config = hf_model.config
        num_layers, hidden_size, vocab_size = get_hf_configs(hf_config)

        gpus_per_model = torch.distributed.get_world_size()
        seq_length = args.max_seq_len
        batch_size = args.per_device_train_batch_size
        # 计算每秒处理样本数
        samples_per_second = batch_size / e2e_time
        checkpoint_activations_factor = 4 if args.gradient_checkpointing else 3
        if args.lora_dim > 0:
            k = args.lora_dim * 2 / hidden_size
            checkpoint_activations_factor -= (1 - k)
		# 计算模型参数数量
        hf_model._num_params = sum([
            p.ds_numel if hasattr(p, "ds_tensor") else p.numel()
            for p in hf_model.parameters()
        ])
        # 换算为以十亿为单位
        params_in_billions = hf_model._num_params / (1e9)

        # Megatron paper's formula to calculate training flops
        train_flops_per_iteration = calculate_flops(
            checkpoint_activations_factor, batch_size, seq_length, hf_config)
		# 计算训练的每秒浮点运算次数
        train_tflops = train_flops_per_iteration / (e2e_time * gpus_per_model *
                                                    (10**12))

        param_string = f"{params_in_billions:.3f} B" if params_in_billions != 0 else "NA"
        # 格式化输出性能指标信息
        # - 模型参数数量
        # - 端到端训练的延迟时间
        # - 每秒浮点运算次数
        # - 每秒处理的样本数
        # - 每个样本的处理时间
        # - 批次大小
        # - 序列长度
        print(
            f"Model Parameters: {param_string}, Latency: {e2e_time:.2f}s, TFLOPs: {train_tflops:.2f}, Samples/sec: {samples_per_second:.2f}, Time/seq {e2e_time/batch_size:.2f}s, Batch Size: {batch_size}, Sequence Length: {seq_length}"
        )

复现过程

1、实验环境。

cat /etc/issue	| 	Ubuntu 22.04.4 LTS
nvidia-smi 		| 	NVIDIA GeForce RTX 2080 Ti
nvcc -V			| 	cuda_11.7

2、下载代码,进入目录。

git clone https://github.com/microsoft/DeepSpeedExamples.git
cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning

3、下载模型,保存在step1_supervised_finetuning目录下。

ModelParametersPretrained weightsHF-Mirror
OPT-125M125Mfacebook/opt-125mfacebook/opt-125m
OPT-350M350Mfacebook/opt-350mfacebook/opt-350m
OPT-1.3B1.3Bfacebook/opt-1.3bfacebook/opt-1.3b

下载如下内容即可

需要下载的模型数据
4、下载数据集,保存在step1_supervised_finetuning目录下。

NameAddressHF-Mirror
Dahoas/rm-staticDahoas/rm-staticDahoas/rm-static
Dahoas/full-hh-rlhfDahoas/full-hh-rlhfDahoas/full-hh-rlhf
Dahoas/synthetic-instruct-gptj-pairwiseDahoas/synthetic-instruct-gptj-pairwiseDahoas/synthetic-instruct-gptj-pairwise
yitingxie/rlhf-reward-datasetsyitingxie/rlhf-reward-datasetsyitingxie/rlhf-reward-datasets

5、创建 conda 虚拟环境,安装依赖。

conda create -n torch_2.1.0 python=3.8
conda activate torch_2.1.0
conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install datasets>=2.8.0 sentencepiece>=0.1.97 protobuf==3.20.3 accelerate>=0.15.0 deepspeed>=0.9.0 transformers>=4.31.0,!=4.33.2 tensorboard

6、创建测试脚本。

路径:applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/my_test/run_opt-125m.sh

#!/bin/bash

OUTPUT_PATH=./output
mkdir -p $OUTPUT_PATH

deepspeed main.py \
   --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets \
   --data_split 2,4,4 \
   --model_name_or_path facebook/opt-125m \
   --per_device_train_batch_size 1 \
   --per_device_eval_batch_size 1 \
   --max_seq_len 512 \
   --learning_rate 1e-3 \
   --weight_decay 0. \
   --num_train_epochs 16 \
   --gradient_accumulation_steps 1 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \
   --seed 1234 \
   --gradient_checkpointing \
   --zero_stage 3 \
   --lora_dim 128 \
   --lora_module_name decoder.layers. \
   --deepspeed \
   --output_dir $OUTPUT_PATH \
   &> $OUTPUT_PATH/training.log

给新创建的脚本赋予执行权限:chmod +x run_opt-125m.sh

7、将applications/DeepSpeed-Chat下的dschat文件夹移动到applications/DeepSpeed-Chat/training下。

当前applications/DeepSpeed-Chat/training的文件目录结构:

├── dschat
│   ├── utils
│       ├── data
│       │   ├── data_utils.py
│       │   └── raw_datasets.py
│       ├── ds_utils.py
│       ├── model
│       │   ├── model_utils.py
│       │   └── reward_model.py
│       ├── module
│       │   └── lora.py
│       ├── perf.py
│       └── utils.py
├── step1_supervised_finetuning
│   ├── Dahoas
│   │   ├── full-hh-rlhf
│   │   │   ├── data
│   │   │   │   ├── test-00000-of-00001-ec71e9262143a91c.parquet
│   │   │   │   └── train-00000-of-00001-8349d0765e6718df.parquet
│   │   │   └── dataset_infos.json
│   │   ├── rm-static
│   │   │   ├── data
│   │   │   │   ├── test-00000-of-00001-8c7c51afc6d45980.parquet
│   │   │   │   └── train-00000-of-00001-2a1df75c6bce91ab.parquet
│   │   │   └── dataset_infos.json
│   │   └── synthetic-instruct-gptj-pairwise
│   │       ├── data
│   │       │   └── train-00000-of-00001-1e5d57b93c448e7a.parquet
│   │       └── dataset_infos.json
│   ├── facebook
│   │   ├── opt-125m
│   │   │   ├── config.json
│   │   │   ├── generation_config.json
│   │   │   ├── merges.txt
│   │   │   ├── pytorch_model.bin
│   │   │   ├── special_tokens_map.json
│   │   │   ├── tokenizer_config.json
│   │   │   └── vocab.json
│   │   ├── opt-1.3b
│   │   │   ├── config.json
│   │   │   ├── generation_config.json
│   │   │   ├── merges.txt
│   │   │   ├── pytorch_model.bin
│   │   │   ├── special_tokens_map.json
│   │   │   ├── tokenizer_config.json
│   │   │   └── vocab.json
│   │   └── opt-350m
│   │       ├── config.json
│   │       ├── generation_config.json
│   │       ├── merges.txt
│   │       ├── pytorch_model.bin
│   │       ├── special_tokens_map.json
│   │       ├── tokenizer_config.json
│   │       └── vocab.json
│   ├── main.py
│   ├── README.md
│   ├── training_log_output
│   │   └── opt-1.3b-globalBatchSize128.log
│   ├── training_scripts
│   │   ├── llama2
│   │   │   ├── run_llama2_7b_lora.sh
│   │   │   └── run_llama2_7b.sh
│   │   ├── my_test
│   │   │   └── run_opt-125m.sh
│   │   ├── opt
│   │   │   ├── single_gpu
│   │   │   │   ├── run_1.3b.sh
│   │   │   │   └── run_6.7b_lora.sh
│   │   │   └── single_node
│   │   │       ├── run_1.3b_lora.sh
│   │   │       ├── run_13b.sh
│   │   │       ├── run_1.3b.sh
│   │   │       ├── run_30b_lora.sh
│   │   │       ├── run_6.7b.sh
│   │   │       └── sweep
│   │   │           ├── README.md
│   │   │           ├── run_single.sh
│   │   │           └── run_step1_sweep.sh
│   │   ├── other_language
│   │   │   ├── run_chinese.sh
│   │   │   └── run_japanese.sh
│   │   └── README.md
│   └── yitingxie
│       └── rlhf-reward-datasets
│           └── data
│               ├── test-00000-of-00001-955c146ec7a10a1e.parquet
│               └── train-00000-of-00001-2ea3039ca4da89f8.parquet

8、在main.py中加入如下代码,使其能找到导入的dschat模块。

import sys
import os

sys.path.append(
    os.path.abspath(os.path.join(os.path.dirname(__file__), os.path.pardir)))

9、在applications/DeepSpeed-Chat/training/step1_supervised_finetuning目录下运行脚本。

bash training_scripts/my_test/run_opt-125m.sh 

遇到一个报错:FileNotFoundError: Directory Dahoas/rm-static is neither a dataset directory nor a dataset dict directory.,经过排查,是applications/DeepSpeed-Chat/training/dschat/utils/data/raw_datasets.py的问题。

class PromptRawDataset(object):

    def __init__(self, output_path, seed, local_rank, dataset_name):
        self.output_path = output_path
        self.seed = seed
        self.local_rank = local_rank
        if os.path.exists(dataset_name):
            # self.raw_datasets = load_from_disk(dataset_name)
            # 注释掉上面这条语句,这里也直接改用 load_dataset(dataset_name)
            self.raw_datasets = load_dataset(dataset_name)
        elif not dataset_name == 'local/jsonfile':
            self.raw_datasets = load_dataset(dataset_name)

参考:

  • mariosasko 的回答1
  • mariosasko 的回答2
  • Arunbh Yashaswi 的回答
  • 【huggingface】数据集及模型下载并保存至本地

Using OPT with Colossal-AI

待研究:https://github.com/hpcaitech/ColossalAI#OPT

参考资料

  • guolipa:大语言模型调研汇总
  • 罗小黑:1-7B开源小模型整理汇总
  • facebookresearch:About OPT & Pretrained Model Weights
  • Microsoft:DeepSpeed Chat: 一键式RLHF训练,让你的类ChatGPT千亿大模型提速省钱15倍
  • just_sort:DeepSpeed-Chat 打造类ChatGPT全流程 笔记一
  • just_sort:DeepSpeed-Chat 打造类ChatGPT全流程 笔记二之监督指令微调
  • AI开发者:专题:大模型训练入门实战
  • J.P.Liu:DeepSpeed-Chat全流程训练实战

http://www.kler.cn/a/445280.html

相关文章:

  • Map.put 方法
  • 使用k6进行MongoDB负载测试
  • 《全面解析 QT 各版本:特性、应用与选择策略》
  • 算法笔记—前缀和(动态规划)
  • 【中标麒麟服务器操作系统实例分享】java应用DNS解析异常分析及处理
  • SSC338Q SigmaStar 摄像头主控芯片
  • 【新版】阿里云ACP大数据工程师模拟试题(含答案解析)
  • wepack的各个版本差异?
  • 生产环境kafka升级过程
  • RadiAnt DICOM - 基本主题 :从 PACS 服务器打开研究
  • 彻底理解如何优化接口性能
  • 【Python】Selenium模拟滚动鼠标,向下拖动下拉按钮,直至网页页面向下滑的方法
  • vue3+vite 引入动画组件库 Inspira UI
  • Python机器学习算法KNN、MLP、NB、LR助力油气钻井大数据提速参数优选及模型构建研究...
  • flask-admin+Flask-WTF 实现实现增删改查
  • HTMLCSS:酷炫的3D开关控件
  • 设计模式详解(十一):模板方法——Template Method
  • 数字化供应链:背景特点
  • <论文>初代GPT长什么样?
  • es-head安装使用以及常见问题
  • Spring框架(1)——IOC(控制权反转)的实现
  • 深度比较:OpenNI2 SDK与Orbbec SDK的功能、优势和选择指南
  • parquet类型小文件合并
  • ESP32单片机开发
  • uniApp上传文件踩坑日记
  • 【C++ 无限循环】1625. 执行操作后字典序最小的字符串|1992