当前位置：首页 > article >正文

deepseek GRPO算法保姆级讲解(数学原理+源码解析+案例实战)

article 2025/3/17 19:42:43

文章目录

什么是GRPO
- 群组形成(Group Formation):让大模型创建多种解决方案
- 偏好学习(Preference Learning)：让大模型理解何为好的解答
- - 组内相对优势
- 优化(optimization): 让大模型从经验中学习(learning from experience)
- - 目标函数
- GRPO算法的伪码表示
- GRPO算法的局限与挑战
代码实现
- GRPO训练参数配置
- GRPO训练器
- - token对数概率计算
  - 奖励矩阵计算
  - 组内优势计算
  - 计算损失函数
案例实战：使用TRL库实现GRPO训练文本生成模型
推荐阅读
REF

什么是GRPO

GRPO（Groupwise relative policy optimization）群组相对策略优化是deepseek提出的一种RLHF(基于人类反馈的强化学习)技术。首次提出是在DeepSeekMath中，RL被证明在SFT阶段后能有效提升LLM的数学推理能力。
在GRPO提出之前，有两种流行的RLHF技术被广泛用于大模型的对齐过程中，分别是PPO和DPO。

群组形成(Group Formation):让大模型创建多种解决方案

GRPO算法的第一步非常直观，类似于学生尝试多种解法去解决同一个问题。对于给定的prompt，让大模型尝试生成多种解答(attempt)来解决同一个问题(通常4、8或16种的不同解答)。不同的解答示例如下图所示。
在这里插入图片描述

偏好学习(Preference Learning)：让大模型理解何为好的解答

与其他的RLHF方法需要一个单独的奖励模型(reward model)不同，GRPO方法可以使用任何函数或模型来评估LLM解答(Solution)的质量。例如，可以使用一个长度函数作为评估器来奖励模型生成更简短的解答，或是使用一个数学求解器，来奖励模型生成更准确的数学解答方案。
评估过程会从多个角度来衡量大模型生成解答方案的质量，包括：
1）最终答案是否正确；2）答案是否按指定的格式输出（如XML标签格式正确性）；3）推理过程与提供的答案是否一致。

组内相对优势

这种处理方式的巧妙之处还在于评分机制，GRPO不是简单的给出绝对分数，而是对每个组内的奖励进行归一化处理。它采用了一个简单但有效的方法来计算组内相对优势：

$Advatage = (reward- mean(group_rewards)/ std(group_rewards)$

这种归一化方法帮助大模型理解，哪些解答方案比同组内其他解答方案更好或更差，而不是简单的反馈绝对得分给大模型。

优化(optimization): 让大模型从经验中学习(learning from experience)

GRPO的最后一步是根据解决方案组的评估结果，指导大模型进行改进。这一过程基于两个主要原则：

鼓励模型生成更多类似成功案例的解决方案，同时远离效果不佳的方法；
使用KL散度惩罚作为安全机制，防止模型在训练中一次发生过于剧烈的变化；

这种方法在实践中被证明比传统方法更稳定，原因在于：

同时考虑多个解决方案，而不局限于两两比较；
基于组的归一化有助于避免奖励缩放问题（reward scaling）
KL散度惩罚作为安全网，保证模型在学习新知识的同时，不会忘记已经掌握的内容；

总结GRPO的关键创新在于：

直接从任何函数或模型中学习，消除了对单独奖励模型的依赖。
基于组的学习方式，比传统的成对比较等方法更加稳定和高效。

目标函数

GRPO算法从旧的策略模型中采样一组输出 ${o_1,o_2,...,o_G}$ ，并通过下面的目标函数来优化策略模型：
在这里插入图片描述
GRPO的目标函数通过组内相对优势估计替代传统优势函数，并结合PPO的裁剪机制（限制策略更新幅度）和KL散度正则化（约束策略与参考模型的偏离），在最大化奖励的同时确保策略稳定性：

函数中通过最大化组内相对优势，增加生成优质结果的概率，减少生成劣质结果的概率；
通过KL散度惩罚项，确保模型策略不会过度偏离参考策略，从而保证优化过程的稳定性。

式中参数注释如下：
$\begin{aligned} &\theta &&: \text{当前策略模型的可学习参数，通过优化目标函数更新} \\ &q \sim P(Q) &&: \text{从问题分布中采样的输入（如用户指令）} \\ &G &&: \text{每个输入生成的样本数（组内样本数）} \\ &o_i \sim \pi_{\theta_{\text{old}}}(O|q) &&: \text{旧策略生成的第 }i\text{ 个输出样本} \\ &\pi_{\theta_{\text{old}}} &&: \text{旧策略模型（生成样本时固定）} \\ &\pi_{\theta} &&: \text{当前策略模型（通过优化更新）} \\ &O|q &&: \text{输入 }q\text{ 条件下策略生成的输出分布} \\ &|o_i| &&: \text{输出序列 }o_i\text{ 的长度（token数量）} \\ &t &&: \text{序列生成的时间步（第 }t\text{ 个token位置）} \\ &\hat{A}_{i,t} &&: \text{组内相对优势（基于组内奖励差值计算）} \\ &\epsilon &&: \text{裁剪范围参数（限制策略更新幅度）} \\ &\beta &&: \text{KL散度正则化系数（控制策略与参考模型偏离）} \\ &D_{\text{KL}}(\pi_{\theta}||\pi_{\text{ref}}) &&: \text{当前策略与参考模型的KL散度} \\ &\pi_{\text{ref}} &&: \text{参考模型（如初始SFT模型）} \end{aligned}$
其中两个超参数：𝜀是PPO机制中的裁剪范围参数，限制策略更新幅度；𝛽是KL散度正则化系数，控制当前策略与参考模型 $π_{ref}$ 的偏离程度。

这里的KL散度与传统的KL散度计算方式不同，使用Schulman提出的无偏估计，确保结果是正数：
在这里插入图片描述
式中 $o_{<t}$ 表示在时间步 t 之前的所有观测（或输出）序列。

GRPO算法的伪码表示

DeepSeekMath论文里给出了GRPO迭代优化的步骤：
在这里插入图片描述

基于上述流程，GRPO算法的伪码可表示为：

Input: 
- initial_policy: Starting model to be trained
- reward_function: Function that evaluates outputs
- training_prompts: Set of training examples
- group_size: Number of outputs per prompt (typically 4-16)

Algorithm GRPO:
1. For each training iteration:
   a. Set reference_policy = initial_policy (snapshot current policy)
   b. For each prompt in batch:
      i. Generate group_size different outputs using initial_policy
      ii. Compute rewards for each output using reward_function
      iii. Normalize rewards within group:
           normalized_advantage = (reward - mean(rewards)) / std(rewards)
      iv. Update policy by maximizing the clipped ratio:
          min(prob_ratio * normalized_advantage, 
              clip(prob_ratio, 1-epsilon, 1+epsilon) * normalized_advantage)
          - kl_weight * KL(initial_policy || reference_policy)
          
          where prob_ratio is current_prob / reference_prob

Output: Optimized policy model

GRPO算法的局限与挑战

GRPO算法在实践中也面临一些挑战：

生成成本。为每个提示词生成多个补全（4-16个）相比只生成一个或两个补全的方法，显著增加了计算需求。
批量大小限制。需要一起处理一组补全，这可能会限制有效的批量大小，增加训练过程的复杂性，并可能减缓训练速度。
奖励函数的设计。训练质量在很大程度上依赖于精心设计的奖励函数。设计不佳的奖励可能导致意外行为或优化错误的目标。
组大小的权衡。选择最优的组大小需要在解决方案的多样性与计算成本之间进行平衡。组内样本太少可能无法提供足够的多样性，而样本太多则会增加训练时间和资源需求。
KL散度调参。找到合适的KL散度惩罚平衡需要谨慎调整——过高会导致模型无法有效学习，过低则可能使其偏离初始能力过远。

代码实现

GRPO训练参数配置

TRL库中将GRPO算法封装为GRPOConfig参数配置器和GRPOTrainer训练器。
这里参考TRL库的源码，给出简化版的代码解读。
将GRPO训练所需参数封装成 GRPOConfig数据类。用@dataclass装饰器将 GRPOConfig定义为一个数据类，并通过field函数为类中的字段指定默认值、默认工厂函数、元数据等。
原代码过长，这里只贴出控制GRPO训练的参数：

@dataclass
class GRPOConfig(TrainingArguments):
# Parameters that control the training
    learning_rate: float = field(
        default=1e-6,
        metadata={
            "help": "Initial learning rate for `AdamW` optimizer. The default value replaces that of "
            "`transformers.TrainingArguments`."
        },
    )
    beta: float = field(
        default=0.04,
        metadata={
            "help": "KL coefficient. If `0.0`, the reference model is not loaded, reducing memory usage and improving "
            "training speed, but may be numerically unstable for long training runs."
        },
    ) # KL散度部分的超参数$\beta$,控制KL散度惩罚项的大小
    num_iterations: int = field(
        default=1,
        metadata={"help": "Number of iterations per batch (denoted as μ in the algorithm)."},
    )
     epsilon: float = field(
        default=0.2,
        metadata={"help": "Epsilon value for clipping."},
    ) # clip部分超参数，控制裁剪的范围

控制每个奖励函数的权重占比的参数：

reward_weights: Optional[list[float]] = field(
        default=None,
        metadata={
            "help": "Weights for each reward function. Must match the number of reward functions. If `None`, all "
            "rewards are weighted equally with weight `1.0`."
        },
    ) # 每个奖励函数的权重占比

TR-DPO论文中提出的控制参考模型动态更新的三个参数, 当前策略模型与旧参考模型的混合比例、是否启用参考模型与当前策略模型的同步机制、参考模型与当前策略模型同步的频率:

## >TR-DPO论文中提出的控制参考模型与当前策略模型同步机制以动态更新的三个参数
    # 是否启用参考模型与当前策略模型的同步机制
    sync_ref_model: bool = field(
        default=False,
        metadata={
            "help": "Whether to synchronize the reference model with the active model every `ref_model_sync_steps` "
            "steps, using the `ref_model_mixup_alpha` parameter."
        },
    ) 
    # 控制参考模型更新时，当前策略模型与旧参考模型的混合比例
    # 加权因子alpha控制软更新（soft update）；aplha=1时更新方法变为硬更新（hard update），即将参考策略模型替换为当前策略模型
    ref_model_mixup_alpha: float = field(
        default=0.6,
        metadata={
            "help": "α parameter from the TR-DPO paper, which controls the mix between the current policy and the "
            "previous reference policy during updates. The reference policy is updated according to the equation: "
            "`π_ref = α * π_θ + (1 - α) * π_ref_prev`. To use this parameter, you must set `sync_ref_model=True`."
        },
    )
    # 控制参考模型与当前策略模型同步的频率
    # 每隔 ref_model_sync_steps 步，参考模型会根据 ref_model_mixup_alpha 的规则与当前策略模型进行同步更新。
    ref_model_sync_steps: int = field(
        default=512,
        metadata={
            "help": "τ parameter from the TR-DPO paper, which determines how frequently the current policy is "
            "synchronized with the reference policy. To use this parameter, you must set `sync_ref_model=True`."
        },
    )

GRPO训练器

将GRPO训练过程封装为一个Trainer的一个子类。

class GRPOtrainer(Trainer):
    def __init__(
            self,
        model,
        reward_funcs,
        args,
        train_dataset,
        eval_dataset,
        processing_class,
        reward_processing_classes,
        callbacks,
        optimizers,
        peft_config,
    ):
         # Training arguments
        self.max_prompt_length = args.max_prompt_length
        self.max_completion_length = args.max_completion_length  # = |o_i| in the GRPO paper
        self.num_generations = args.num_generations  # = G in the GRPO paper

         # Multi-step
        self.num_iterations = args.num_iterations  # = 𝜇 in the GRPO paper
        self.epsilon = args.epsilon # $\epsilon$超参数用于梯度clip
        # Tracks the number of iterations (forward + backward passes), including those within a gradient accumulation cycle.
        self._step = 0

        self._buffered_inputs = [None] * args.gradient_accumulation_steps
         # Initialize the metrics
        self._metrics = {"train": defaultdict(list), "eval": defaultdict(list)}
        self.log_completions = args.log_completions

        super().__init__(
            model=model,
            args=args,
            data_collator=data_collator,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            processing_class=processing_class,
            callbacks=callbacks,
            optimizers=optimizers,
        )

        self.generation_config = GenerationConfig(
            max_new_tokens=self.max_completion_length,
            do_sample=True,
            pad_token_id=processing_class.pad_token_id,
            temperature=args.temperature,
            top_p=args.top_p,
            top_k=args.top_k,
            min_p=args.min_p,
            repetition_penalty=args.repetition_penalty,
        )

token对数概率计算

计算模型生成的每个token的对数概率,以控制模型在训练中的策略更新：

# Get the per-token log probabilities for the completions for the model and the reference model
    @profiling_decorator # 性能分析
    def _get_per_token_logps(self, model, input_ids, attention_mask, logits_to_keep):
        # We add 1 to `logits_to_keep` because the last logits of the sequence is later excluded
        logits = model(input_ids=input_ids, attention_mask=attention_mask, logits_to_keep=logits_to_keep + 1).logits
        logits = logits[:, :-1, :]  # (B, L-1, V), 排除最后一个 logit: 对应下一个token的预测

        input_ids = input_ids[:, -logits_to_keep:]
        logits = logits[:, -logits_to_keep:]
        return selective_log_softmax(logits, input_ids)  #  计算每个输入token的对数概率

奖励矩阵计算

初始化奖励矩阵后，遍历所有预定义的奖励函数（可灵活定义为pytorch模型或普通python函数），分别计算奖励值后更新奖励矩阵：

 rewards_per_func = torch.zeros(len(prompts), len(self.reward_funcs), device=device) # 初始化奖励矩阵
        for i, (reward_func, reward_processing_class) in enumerate(
            zip(self.reward_funcs, self.reward_processing_classes) # 遍历所有的奖励函数
        ):
            if isinstance(reward_func, nn.Module):  # Module instead of PretrainedModel for compat with compiled models
                reward_func_name = f"reward {reward_func.config._name_or_path.split('/')[-1]}"
            else:
                reward_func_name = reward_func.__name__
            with profiling_context(self, reward_func_name):
                if isinstance( # 基于pytorch模型计算奖励值
                    reward_func, nn.Module 
                ):  # Module instead of PretrainedModel for compat with compiled models
                    if is_conversational(inputs[0]):
                        messages = [{"messages": p + c} for p, c in zip(prompts, completions)]
                        texts = [apply_chat_template(x, reward_processing_class)["text"] for x in messages]
                    else:
                        texts = [p + c for p, c in zip(prompts, completions)]
                    reward_inputs = reward_processing_class(
                        texts, return_tensors="pt", padding=True, padding_side="right", add_special_tokens=False
                    )
                    reward_inputs = super()._prepare_inputs(reward_inputs)
                    with torch.inference_mode():
                        rewards_per_func[:, i] = reward_func(**reward_inputs).logits[:, 0]  # Shape (B*G,)
                else: # 基于python函数计算奖励值
                    # Repeat all input columns (but "prompt" and "completion") to match the number of generations
                    keys = [key for key in inputs[0] if key not in ["prompt", "completion"]]
                    reward_kwargs = {key: [example[key] for example in inputs] for key in keys}
                    output_reward_func = reward_func(prompts=prompts, completions=completions, **reward_kwargs)
                    rewards_per_func[:, i] = torch.tensor(output_reward_func, dtype=torch.float32, device=device)

在这里插入图片描述

组内优势计算

根据组内解答数(num_generations)，计算组内优势：

		rewards_per_func = gather(rewards_per_func)

        # Apply weights to each reward function's output and sum
        rewards = (rewards_per_func * self.reward_weights.to(device).unsqueeze(0)).sum(dim=1)

        # Compute grouped-wise rewards
        mean_grouped_rewards = rewards.view(-1, self.num_generations).mean(dim=1)
        std_grouped_rewards = rewards.view(-1, self.num_generations).std(dim=1)

        # Normalize the rewards to compute the advantages
        mean_grouped_rewards = mean_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
        std_grouped_rewards = std_grouped_rewards.repeat_interleave(self.num_generations, dim=0)
        advantages = (rewards - mean_grouped_rewards) / (std_grouped_rewards + 1e-4)

计算损失函数

在这里插入图片描述

根据deepseek math论文中公式(4), 计算 $D_{KL}$ 的无偏估计：

 # 计算参考模型与当前模型之间的KL散度
        if self.beta != 0.0: # 当KL散度正则项的参数$beta$不为0时
            ref_per_token_logps = inputs["ref_per_token_logps"]
            per_token_kl = (
                torch.exp(ref_per_token_logps - per_token_logps) - (ref_per_token_logps - per_token_logps) - 1
            ) # KL散度的无偏估计，,deepseek math论文中公式(4)

在这里插入图片描述

根据deepseek math 论文中的公式(3),计算损失函数：

 # Compute the loss
        advantages = inputs["advantages"]
       
        old_per_token_logps = inputs["old_per_token_logps"] if self.num_iterations > 1 else per_token_logps.detach()
        coef_1 = torch.exp(per_token_logps - old_per_token_logps) # 新旧模型token概率的比值(先取log再取指数便于计算)
        coef_2 = torch.clamp(coef_1, 1 - self.epsilon, 1 + self.epsilon) #clip截断部分
        per_token_loss1 = coef_1 * advantages.unsqueeze(1) # 未截断的概率比值计算的损失
        per_token_loss2 = coef_2 * advantages.unsqueeze(1) # 截断的概率比值计算的损失
        per_token_loss = -torch.min(per_token_loss1, per_token_loss2) # 损失部分计算，最小化损失(最大化奖励)
        if self.beta != 0.0:
            per_token_loss = per_token_loss + self.beta * per_token_kl # deepseek math 论文中的公式(3)，GRPO目标函数
        loss = (per_token_loss * completion_mask).sum() / completion_mask.sum()

记录KL散度平均值：

 # 记录KL散度的平均值
        if self.beta != 0.0:
            mean_kl = (per_token_kl * completion_mask).sum() / completion_mask.sum()
            self._metrics[mode]["kl"].append(self.accelerator.gather_for_metrics(mean_kl).mean().item())

计算截断比例(clip ratio)：

 # 计算截断比例
        is_clipped = (per_token_loss1 < per_token_loss2).float()
        clip_ratio = (is_clipped * completion_mask).sum() / completion_mask.sum() 
        self._metrics[mode]["clip_ratio"].append(self.accelerator.gather_for_metrics(clip_ratio).mean().item())

案例实战：使用TRL库实现GRPO训练文本生成模型

这里给出一个用GRPO算法，在smoltldr数据集上训练SmolLM135M文本生成模型的demo例子。使用trl、peft库实现GRPO和LORA微调。
数据集介绍
mlabonne/smoltldr 是一个包含短篇小说列表的数据集，由用户 mlabonne 在 Hugging Face 上创建并维护。该数据集主要用于自然语言处理任务，例如文本生成、故事创作等。数据集中的每个样本通常是一个短篇故事，内容可能涵盖多种主题和风格。这些故事经过清洗和格式化，适合用于训练语言模型。SmolLM-135M则是用于文本生成的小模型。

数据和模型加载

import torch
import wandb
from datasets import load_dataset
from peft import LoraConfig,get_peft_model
from transformers import AutoModelForCausalLM,AutoTokenizer
from trl import GRPOConfig,GRPOTrainer


wandb.login() # 登录wandb，输入api key，保存训练结果到wandb

# 加载huggingface上的数据集
dataset = load_dataset("mlabonne/smoltldr")

# 加载huggingface上的模型
model_id = "HuggingFaceTB/SmolLM-135M-Instruct"
model = AutoModelForCausalLM(
    model_id,
    torch_dtype = "auto",
    device_map = "auto",
    # attn_implementation = "flash_attention_2",
    attn_implementation = "eager",# GPU不支持flashattention时改用标准注意力
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

LORA微调配置


# 加载PEFT库中的lora配置
lora_config = LoraConfig(
    task_type = "CAUSAL_LM", # 生成式任务
    r=16, # 秩为16
    lora_alpha=32, # 低秩矩阵权重贡献的缩放因子设为32
    target_modules = "all-linear",# 模型中的线性层应用lora
)
model = get_peft_model(model,lora_config) # 冻结预训练模型的所有参数，根据lora_config在模型中添加低秩矩阵
print(model.print_trainable_parameters()) # 打印模型中可训练参数的数量及占总参数的比例

定义奖励函数


# 定义奖励函数
def reward_len(completions,**kwargs):
    return [-abs(50-len(completion)) for completion in completions]

GRPO训练参数设置


# GRPO训练参数配置
training_args = GRPOConfig(
    output_dir="GRPO",
    run_name="GRPO_experi_0308_01",# wandb保存的实验名称
    learning_rate=2e-5,
    per_device_train_batch_size=4,# 批量大小设小一点减少显存占用
    gradient_accumulation_steps=2, # 梯度累积步数
    max_prompt_length=512,
    max_completion_length=96,
    num_generations=4,# GRPO每组生成的解答数
    optim="adamW_8bit",
    num_train_epochs=1,# 训练数据集训练的总轮数
    bf16=True,
    report_to=["wandb"],
    remove_unused_columns=False,# 不移除数据集中未使用的列
    logging_steps=1,# 每隔一步记录一次日志
)
# 设置训练器
trainer = GRPOTrainer(
    model=model,
    reward_funcs = [reward_len],# 自定义的奖励函数
    agrs=training_args,# GRPO的训练配置
    train_dataset = dataset["train"],
)

训练模型

# 训练模型
wandb.init(project="GRPO") # 初始化wandb日志环境
trainer.train() # 开始训练

上传训练完成的模型参数


# 保存模型参数，上传到huggingface hub
merged_model = trainer.model.merge_and_unload() # 合并lora权重和预训练权重
merged_model.push_to_hub("<your_username/your_modelname>",private=False) # 模型公开可见

下载训练好的模型参数进行文本生成


prompt = """
# A long document about the Cat

The cat (Felis catus), also referred to as the domestic cat or house cat, is a small
domesticated carnivorous mammal. It is the only domesticated species of the family Felidae.
Advances in archaeology and genetics have shown that the domestication of the cat occurred
in the Near East around 7500 BC. It is commonly kept as a pet and farm cat, but also ranges
freely as a feral cat avoiding human contact. It is valued by humans for companionship and
its ability to kill vermin. Its retractable claws are adapted to killing small prey species
such as mice and rats. It has a strong, flexible body, quick reflexes, and sharp teeth,
and its night vision and sense of smell are well developed. It is a social species,
but a solitary hunter and a crepuscular predator. Cat communication includes
vocalizations—including meowing, purring, trilling, hissing, growling, and grunting—as
well as body language. It can hear sounds too faint or too high in frequency for human ears,
such as those made by small mammals. It secretes and perceives pheromones.
"""
messages = [
    {"role":"user","content":prompt},
]

from transformers import pipeline
generator = pipeline("text-generation",model="<your_username/your_modelname>")
generate_kwargs = {
    "max_new_tokens": 256,
    "do_sample":True, # 启用采样生成模式
    "temperature":0.5, 
    "min_p":0.1,
}
generated_text = generator(messages,generate_kwargs=generate_kwargs)
print(generated_text)

训练结果分析
随着模型的学习，奖励函数的奖励值逐渐接近0。这表明模型正在学习生成正确长度的文本。
在这里插入图片描述
在GRPO中，损失函数的初始值为零，然后在训练过程中增加。GRPO中的损失与KL散度（相对于原始策略的上限）成正比。随着训练的进行，模型学会了生成更好地符合奖励函数的文本，导致它与初始策略的偏差越来越大。这种增加的偏差反映在上升的损失值中，实际上表明模型正在成功地适应以优化奖励函数。
在这里插入图片描述

REF

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models： https://arxiv.org/pdf/2402.03300
https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py#L98
https://github.com/huggingface/open-r1/blob/main/src/open_r1/grpo.py
https://open-r1.com/#:~:text=Open%20R1%20is%20an%20open-source%20reproduction%20of%20DeepSeek-R1%2C,MIT%20license%2C%20though%20original%20training%20data%20remains%20proprietary.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning https://arxiv.org/abs/2501.12948
https://huggingface.co/learn/nlp-course/en/chapter12/1?fw=pt

查看全文

http://www.kler.cn/a/588767.html