当前位置：首页 > article >正文

【token】【1】零基础token pipline快速实战

article 2025/3/11 9:53:28

前言

大家对transformers的原理大致都了解了，但是论文里常常说到的token到底是什么？要如何控制token？怎么自定义token？由于token机制比较复杂，作为token讲解的开篇，我们先从快速应用现有的pipline开始，后面再一一探究底层的原理以及实战。

开箱即用的 pipelines

Transformers 库将目前的 NLP 任务归纳为几下几类：

**文本分类：**例如情感分析、句子对关系判断等；
**对文本中的词语进行分类：**例如词性标注 (POS)、命名实体识别 (NER) 等；
**文本生成：**例如填充预设的模板 (prompt)、预测文本中被遮掩掉 (masked) 的词语；
**从文本中抽取答案：**例如根据给定的问题从一段文本中抽取出对应的答案；
**根据输入文本生成新的句子：**例如文本翻译、自动摘要等。

Transformers 库最基础的对象就是 pipeline() 函数，它封装了预训练模型和对应的前处理和后处理环节。我们只需输入文本，就能得到预期的答案。目前常用的 pipelines 有：

feature-extraction （获得文本的向量化表示）
fill-mask （填充被遮盖的词、片段）
ner（命名实体识别）
question-answering （自动问答）
sentiment-analysis （情感分析）
summarization （自动摘要）
text-generation （文本生成）
translation （机器翻译）
zero-shot-classification （零训练样本分类）

下面我们以常见的几个 NLP 任务为例，展示如何调用这些 pipeline 模型。

import os
os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
result = classifier(
    "I've been waiting for a HuggingFace course my whole life.")
print(result)
results = classifier([
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!"
])
print(results)

pipeline 模型会自动完成以下三个步骤：

将文本预处理为模型可以理解的格式；
将预处理好的文本送入模型；
对模型的预测值进行后处理，输出人类可以理解的格式。

pipeline 会自动选择合适的预训练模型来完成任务。例如对于情感分析，默认就会选择微调好的英文情感模型 distilbert-base-uncased-finetuned-sst-2-english。

Transformers 库会在创建对象时下载并且缓存模型，只有在首次加载模型时才会下载，后续会直接调用缓存好的模型。

零训练样本分类

零训练样本分类 pipeline 允许我们在不提供任何标注数据的情况下自定义分类标签。

import os

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
result = classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)
print(result)

可以看到，pipeline 自动选择了预训练好的 facebook/bart-large-mnli 模型来完成任务。

文本生成

我们首先根据任务需要构建一个模板 (prompt)，然后将其送入到模型中来生成后续文本。注意，由于文本生成具有随机性，因此每次运行都会得到不同的结果。

这种模板被称为前缀模板 (Preﬁx Prompt)，了解更多详细信息可以查看《Prompt 方法简介》。

import os

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from transformers import pipeline

generator = pipeline("text-generation")
results = generator("In this course, we will teach you how to")
print(results)
results = generator(
    "In this course, we will teach you how to",
    num_return_sequences=2,
    max_length=50
) 
print(results)

可以看到，pipeline 自动选择了预训练好的 gpt2 模型来完成任务。我们也可以指定要使用的模型。对于文本生成任务，我们可以在 Model Hub 页面左边选择 Text Generation tag 查询支持的模型。例如，我们在相同的 pipeline 中加载 distilgpt2 模型：

import os

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
results = generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)
print(results)

遮盖词填充

给定一段部分词语被遮盖掉 (masked) 的文本，使用预训练模型来预测能够填充这些位置的词语。

与前面介绍的文本生成类似，这个任务其实也是先构建模板然后运用模型来完善模板，称为填充模板 (Cloze Prompt)。了解更多详细信息可以查看《Prompt 方法简介》。

import os

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from transformers import pipeline

unmasker = pipeline("fill-mask")
results = unmasker("This course will teach you all about <mask> models.", top_k=2)
print(results)

命名实体识别

命名实体识别 (NER) pipeline 负责从文本中抽取出指定类型的实体，例如人物、地点、组织等等。

import os

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
results = ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
print(results)

可以看到，模型正确地识别出了 Sylvain 是一个人物，Hugging Face 是一个组织，Brooklyn 是一个地名。

这里通过设置参数 grouped_entities=True，使得 pipeline 自动合并属于同一个实体的多个子词 (token)，例如这里将“Hugging”和“Face”合并为一个组织实体，实际上 Sylvain 也进行了子词合并，因为分词器会将 Sylvain 切分为 S、##yl 、##va 和 ##in 四个 token。

自动问答

自动问答 pipeline 可以根据给定的上下文回答问题，例如：

import os

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from transformers import pipeline
question_answerer = pipeline("question-answering")
answer = question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)
print(answer)

可以看到，pipeline 自动选择了在 SQuAD 数据集上训练好的 distilbert-base 模型来完成任务。这里的自动问答 pipeline 实际上是一个抽取式问答模型，即从给定的上下文中抽取答案，而不是生成答案。

根据形式的不同，自动问答 (QA) 系统可以分为三种：

**抽取式 QA (extractive QA)：**假设答案就包含在文档中，因此直接从文档中抽取答案；
**多选 QA (multiple-choice QA)：**从多个给定的选项中选择答案，相当于做阅读理解题；
**无约束 QA (free-form QA)：**直接生成答案文本，并且对答案文本格式没有任何限制。

自动摘要

自动摘要 pipeline 旨在将长文本压缩成短文本，并且还要尽可能保留原文的主要信息，例如：

import os

os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
from transformers import pipeline

summarizer = pipeline("summarization")
results = summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science. As a result, there 
    are declining offerings in engineering subjects dealing with infrastructure, 
    the environment, and related issues, and greater concentration on high 
    technology subjects, largely supporting increasingly complex scientific 
    developments. While the latter is important, it should not be at the expense 
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other 
    industrial countries in Europe and Asia, continue to encourage and advance 
    the teaching of engineering. Both China and India, respectively, graduate 
    six and eight times as many traditional engineers as does the United States. 
    Other industrial countries at minimum maintain their output, while America 
    suffers an increasingly serious decline in the number of engineering graduates 
    and a lack of well-educated engineers.
    """
)
print(results)