当前位置：首页 > article >正文

HuggingFace应用——自然语言处理（1）：什么是NLP？什么是Transformer？

article 2024/10/27 21:32:50

本篇文章属于HuggingFace应用——自然语言处理系列
本篇文章对自然语言处理任务进行基本介绍
参考资料：HuggingFace
文章目录
- 1 什么是NLP？
- 2 主流算法与模型梳理——时间顺序
- 3 Transformer
- - 3.1 Transformer模型能做什么？
  - - Sentiment Analysis 语义分析
    - Zero-shot classification 零样本分类
    - Text generation 文本生成
    - Mask filling 完形填空
    - Named entity recognition 主体识别
    - Question answering 回答问题
    - Summarization 总结
    - Translation 翻译
  - 3.2 Transformer如何工作？——High-level understanding
  - - Transformer是语言模型（language model）
    - 迁移学习：Transfer Learning
    - Transformer的结构
    - Attention Layer
      Transformer模型的原始结构
    - 什么是Architecture？什么是checkpoints？

1 什么是NLP？

NLP（自然语言处理）是语言学（linguistic）与机器学习（ML）的交叉学科/应用，核心关注点是：赋予机器对人类语言的理解能力并发展出相应的应用能力
常见的NLP任务与具体应用：
- 句子分类（Sentence Classification）：语义理解、恶意邮件侦查、语句正确性判断、逻辑正确性判断等
- 词语分类（Word Classification）：识别语句中的词语（主语、形容词、动词等）
- 文本生成（Text Generation）：语句补全（ChatGPT），完形填空
- 总结与提取（Answer Extraction）：基于给定的上下文以及问题，获取答案（ChatGPT是通过文本生成解决这个问题的，思路不同）
- 文本转化（Generating sentence from input sentence）：翻译任务、文本总结等
注意：随着技术的快速发展，NLP任务不单单局限于对文本的处理，现在已扩展出多模态NLP任务，例如基于音频输入或视频输入提取出（生成）文本等
NLP任务的难点：计算机与人类处理信息的方式不同
- 语义理解："I’m happy, thank you"在不同的情况下有很大的差别
- 语义近似：“I’m happy”, "I feel great"的关系是什么？

2 主流算法与模型梳理——时间顺序

这里对NLP领域的主流算法与模型按照时间线进行简单梳理 —— 感兴趣的可以自行进一步查看与了解
Word2Vec (2013)
- 目标：提出了现在最流行的词嵌入方法，Word2Vector，将单词转换为向量进行表达
- Paper: “Efficient Estimation of Word Representations in Vector Space”
GloVe (2014)
- 目标：通过从语料库中捕捉全局静态信息来进行词嵌入
- Paper: “GloVe: Global Vectors for Word Representation”
Seq2Seq with Attention (2014)
- 目标：针对翻译问题，提出了一种带注意力机制的序列到序列建模方法
- Paper: “Neural Machine Translation by Jointly Learning to Align and Translate”
Transformer (2017) ⭐️
- 目标：提出了一种新的基于注意力机制的序列转化模型（重磅炸弹）
- Paper: “Attention is All You Need”
BERT (Bidirectional Encoder Representations from Transformers) (2018) ⭐️
- 目标：提出了一种能够捕捉双向上下文信息的语言表示方法
- Paper: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”
GPT (Generative Pre-trained Transformer) （2018）⭐️
- 目标：提出了一种基于文本生成的适用于各种类型NLP任务的语言生成模型
- Paper: “Improving Language Understanding by Generative Pre-Training”
GPT-2 (2019)
- 目标：GPT模型的改进版本，能够进行连贯、通顺的文本生成
- Paper: “Language Models are Unsupervised Multitask Learners”
XLNet (2019)
- 目标：通过将Transformer-XL与自回归预训练相结合来克服BERT的局限性，从而提高语言理解能力
- Paper: “XLNet: Generalized Autoregressive Pretraining for Language Understanding”
RoBERTa (2019)
- 目标：鲁棒优化版本的BERT，以获得更好的语言建模
- Paper: “RoBERTa: A Robustly Optimized BERT Pretraining Approach”
T5 (Text-to-Text Transfer Transformer) (2019) ⭐️
- 目标：将NLP任务视为统一的文本到文本问题
- Paper: “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”
BART (2019)
- 目标：提出了一种用于文本生成和摘要的序列到序列模型
- Paper: “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”
GPT-3 (2020)⭐️
- 目标：提出了通过few-shot学习来生成连贯和上下文准确的文本的方法
- Paper: “Language Models are Few-Shot Learners”
- 发现：GPT模型规模增大后获得了示例学习的能力（重磅炸弹）
可以看到，其实在GPT-3出来之前，BERT系列的工作还是比较主流的，现在的情况大家也都有目共睹了

3 Transformer

3.1 Transformer模型能做什么？

本章节是对Transformer功能的介绍，会用到Transformers库中的pipeline工具，这里大家不用纠结这个工具具体是如何实现的
- pipeline定义了端到端的功能路径
- 选择需要使用的pipeline名称（是HF预定义的）
- 不同的pipeline支持不同的参数，例如model（用于选择模型）

Sentiment Analysis 语义分析

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

输出：[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

Zero-shot classification 零样本分类

from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

输出：

{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.8445963859558105, 0.111976258456707, 0.043427448719739914]}

Text generation 文本生成

from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

输出：

[{'generated_text': 'In this course, we will teach you how to understand and use '
                    'data flow and data interchange when handling user data. We '
                    'will be working with one or more of the most commonly used '
                    'data flows — data flows of various types, as seen by the '
                    'HTTP'}]

Mask filling 完形填空

from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

输出：

[{'sequence': 'This course will teach you all about mathematical models.',
  'score': 0.19619831442832947,
  'token': 30412,
  'token_str': ' mathematical'},
 {'sequence': 'This course will teach you all about computational models.',
  'score': 0.04052725434303284,
  'token': 38163,
  'token_str': ' computational'}]

Named entity recognition 主体识别

from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

输出：PER(person), ORG(organization), LOC(location)

[{'entity_group': 'PER', 'score': 0.99816, 'word': 'Sylvain', 'start': 11, 'end': 18}, 
 {'entity_group': 'ORG', 'score': 0.97960, 'word': 'Hugging Face', 'start': 33, 'end': 45}, 
 {'entity_group': 'LOC', 'score': 0.99321, 'word': 'Brooklyn', 'start': 49, 'end': 57}
]

Question answering 回答问题

from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

输出：

{'score': 0.6385916471481323, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}

Summarization 总结

from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
"""
)

输出：

[{'summary_text': ' America has changed dramatically during recent years . The '
                  'number of engineering graduates in the U.S. has declined in '
                  'traditional engineering disciplines such as mechanical, civil '
                  ', electrical, chemical, and aeronautical engineering . Rapidly '
                  'developing economies such as China and India, as well as other '
                  'industrial countries in Europe and Asia, continue to encourage '
                  'and advance engineering .'}]

Translation 翻译

from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

输出：

[{'translation_text': 'This course is produced by Hugging Face.'}]

3.2 Transformer如何工作？——High-level understanding

本章节从模型设计的角度对Transformer的工作原理进行解析，更加细粒度的工作与计算流程请查阅其他资料
这里推荐一个好用的Transformer可视化网站——Transformer Visulization
基于Transformer架构衍生出的模型可以分为三类：
- GPT系列模型：自回归Transformer（auto-regressive - token by token）
- BERT系列模型：自编码Transformer（auto-encoding）
  - 什么是auto-encoding?
    - 与GPT系列模型不同，BERT在进行“生成”时需要同时考虑上下文（context），这被称作基于上下文的表示（Contextual Rrepresentation）；此外，BERT使用MLM进行训练（Masked Large Language Model：即对部分token添加mask，模型训练时实际上是在做完形填空），这与autoencoder的行为相似，即重建input（predict）。
    - 自编码器(autoencoder)是一种主要用于无监督学习的人工神经网络。它最初被设计为学习有效的数据编码，通常用于降维、特征学习或数据重建.
- BART/T5系列：序列-序列Transformer（Seq2Seq，保持了Transformer模型的初始结构）

Transformer是语言模型（language model）

Transformer在初始情况下通常基于大语料库进行无监督训练——学习并理解特定的language（当你的训练数据足够丰富时，可以认为Transformer是一个世界语言模型）
这样训练出来的模型随后需要进行Transfer Learning——以将通用语言模型迁移到特定任务中
- 对于next work prediction/generation任务，常用的训练方法是因故语言模型（causal language modeling）——即模型看到前面的 $N$ 个words来预测下一个words
- 对于完形填空任务，常用的训练方法为MLM

迁移学习：Transfer Learning

（Pre-training）基础模型的训练是非常耗时耗材耗力的，因此，对于普通组织或个人而言，基于大企业开源的基础模型进行模型微调与迁移是非常重要的（统称为Post-training），这也是HuggingFace存在的意义之一
- 至于Pre-training和Post-training的技术方案，这本身就是一个非常热点的研究问题，以Post-training为例：
  - Fine-tuning（模型微调）：SFT（supervised fine-tuning）、Domain-Specific Fine-Tuning
  - Instruction Tuning（指令微调）：训练数据集主要由两部分组成（instruction，response）
  - 人类指导的强化学习（RLHF：Reinforcement Learning with Human Feedback）：人类对LLM的输出进行评价（排序）、训练一个RM（reward model）模拟人类的评价标准来对输出进行评价；使用以PPO（Proximal Policy Optimization）等强化学习算法来调整LLM的输出，LLM的目标是最大化RM的奖励
  - Adapter Tuning：adapter本身是NN（小型），通过将adapters添加到pre-trained的模型中，同时在模型训练时，冻结pre-trained模型的参数（Frozen），只训练adapter的参数以实现对LLM的调整
  - LoRA（Low-Rank Adaptation）⭐️：通过在pre-trained模型中添加低秩矩阵实现对LLM的调整
    - Cheap and fast
  - Prompt层面⭐️：在不训练模型的情况下改变模型的输出（现在很多这样的产品）
  - 蒸馏（Distillation）：训练一个小模型——目标是模拟大模型的输出
  - Active Learning：先让大模型判断哪些token是他不好理解的，然后人工为这些token打上标签；随后基于这些token对LLM进行训练 —— Deep Active Learning for Named Entity Recognition
  - 基于RL的自对齐（Self-Alignment with RL）：一种基于RM的LLM自对齐技术，较新，需要进一步调研 —— Constitutional AI: Harmlessness from AI Feedback
  - 结合上述多种方法对模型进行微调

Transformer的结构

总体而言，Transformer模型包含两个部分：Encoder、Decoder
Encoder与Decoder可以被分开使用：正如上面提到的各种类型的Transformer模型

Attention Layer

Attention是Transformer的核心，可以参考这篇论文——“Attention Is All You Need”
核心：Attention层将告诉模型在处理每个单词的表示时，特别注意你传递给它的句子中的某些单词（或多或少地忽略其他单词）
- 例如：“李四上午上了英语课，他说他很喜欢这门课”，当处理“这”时，它能够关联到“英语”；当处理“他”时，它需要能够关联到“李四”
- 这在几乎任何自然语言处理任务中都是非常重要的