当前位置：首页 > article >正文

Qwen文章阅读笔记

article 2025/1/21 5:59:01

一、Pretraining

用大量的数据来训练模型，是模型更全面理解世界和世界的复杂性。

1.数据

数据的规模对于模型的鲁棒性是至关重要的

创建一个有质量的数据集：

增加数据的多样性
过滤低质量的数据
上采样某个数据源

使用多任务指令预训练语言模型可以提高其零样本学习和少量样本学习性能。(主要是因为这种预训练方式能够让模型学习到更广泛的知识和更丰富的表示，从而在面对新任务时能够更好地迁移和泛化。)。为了提高模型的性能，把高质量的数据放入预训练阶段。

最后，作者们提到他们构建了一个包含高达3万亿个标记（tokens）的数据集。

2.分词

Qwen利用BPE进行分词（tokenization）从开源的快速BPE标记器tiktoken (Jain, 2022)开始，并选择词汇表cl100k base作为起点。把数字分成个位数。

Qwen的最终词表是152k

3.架构

QWEN是使用Transformer架构的修改版本设计的，使用LLaMA模型的方法进行训练大模型

Embedding and output projection：the untied embedding approach
Positional embedding：RoPE (Rotary Positional Embedding)
Bias：add biases in the QKV layer of attention
Pre-Norm & RMSNorm：pre-normalization is the most widely used approach, which has been shown to improve training stability compared to post-normalization but Recent research has suggested alternative methods for better training stability Additionally, we have replaced the traditional layer normalization technique described in (Ba et al., 2016) with RMSNorm
Activation function：SwiGLU

4.训练

WAY: the standard approach of autoregressive language modeling
To create batches of data：we shuffle and merge the documents, and then truncate them to the specified context lengths.
Flash Attention
optimizer：adopt the standard optimizer AdamW for pretraining optimization

All the models are trained with BFloat16 mixed precision for training stability.

5.上下文长度扩展

Transformer limitations：have a significant limitation in terms of the context length for their attention mechanism.

have implemented simple training-free techniques：solely applied during inference to extend the context length of the model. key techniques: have used is NTK-aware interpolation

To further improve performance: dynamic NTK-aware interpolation [It dynamically changes the scale by chunks, avoiding severe performance degradation.]

These techniques allow us to effectively extend the context length of Transformer models without compromising their computational efficiency or accuracy.

QWEN incorporates two attention mechanisms：LogN-Scaling and window attention

We also observed that the long-context modeling ability of our model varies across layers

二、对齐

未与人类行为对齐的pretrained model不能很好的与人类进行交互。最近的研究表明使用supervised finetuning（SFT）和reinforcement learning from human feedback（RLHF）可以显著提高LLM的自然对话能力。下面就是研究这两个技术在Qwen上的应用。