bert-base-chinese模型使用教程
向量编码和向量相似度展示
import torch
from transformers import BertTokenizer, BertModel
import numpy as np
model_name = "C:/Users/Administrator.DESKTOP-TPJL4TC/.cache/modelscope/hub/tiansz/bert-base-chinese"
sentences = ['春眠不觉晓', '大梦谁先觉', '浓睡不消残酒', '东临碣石以观沧海']
tokenizer = BertTokenizer.from_pretrained(model_name)
# print(type(tokenizer)) # <class 'transformers.models.bert.tokenization_bert.BertTokenizer'>
model = BertModel.from_pretrained(model_name)
# print(type(model)) # <class 'transformers.models.bert.modeling_bert.BertModel'>
def test_encode():
input_ids = tokenizer.encode('春眠不觉晓', return_tensors='pt') # shape (1, 7)
output = model(input_ids)
print(output.last_hidden_state.shape) # shape (1, 7, 768)
v = torch.mean(output.last_hidden_state, dim=1) # shape (1, 768)
print(v.shape) # shape (1, 768)
print(output.pooler_output.shape) # shape (1, 768)
根据 transformers\modeling_outputs.py:196,即 BaseModelOutputWithPoolingAndCrossAttentions 的注释:
@dataclass
class BaseModelOutputWithPoolingAndCrossAttentions(ModelOutput):
"""
Base class for model's outputs that also contains a pooling of the last hidden states.
Args:
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
Last layer hidden-state of the first token of the sequence (classification token) after further processing
through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns
the classification token after processing through a linear layer and a tanh activation function. The linear
layer weights are trained from the next sentence prediction (classification) objective during pretraining.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
weighted average in the cross-attention heads.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
`config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
input) to speed up sequential decoding.
"""
last_hidden_state: torch.FloatTensor = None
pooler_output: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
cross_attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
1为batch_size,我们出入的就是一个字符串所以batch_size为1。
7为sequence_length,BERT模型会为单句的输入前面加特殊字符[CLS]和后面加特殊字符[SEP],因此为7个字符。
768为hidden_size,即每个字符被编码成768个数字组成的向量。
根据文档的说法,pooler_output向量一般不是很好的句子语义摘要,因此这里采用了将 last_hidden_state 进行池化的方法。
torch.mean(dim)函数
这是常用的池化pooling方法,降低向量的维度,便于运算。
dim不指定任何参数就是所有元素的算术平均值
dim指定为0时,求得是列的平均值。
dim指定为1时,求得是行的平均值;
也就是经过tarch.mean()
之后,这句话变成了一个 1X768 的向量,即使用这一个向量来代表着句话的语义。
接下来就可以计算文本相似度得分了,目标为上面给出的四个句子。
def test_similarity():
with torch.no_grad():
vs = [sentence_embedding(sentence).numpy() for sentence in sentences]
nvs = [v / np.linalg.norm(v) for v in vs] # normalize each vector
m = np.array(nvs).squeeze(1) # shape (4, 768)
print(np.around(m @ m.T, decimals=2)) # pairwise cosine similarity
def sentence_embedding(sentence):
input_ids = tokenizer.encode(sentence, return_tensors='pt')
output = model(input_ids)
return torch.mean(output.last_hidden_state, dim=1)
@
符号是矩阵相乘
输出结果 4X4
[[1. 0.75 0.83 0.57]
[0.75 1. 0.72 0.51]
[0.83 0.72 1. 0.58]
[0.57 0.51 0.58 1. ]]
春眠不觉晓 | 大梦谁先觉 | 浓睡不消残酒 | 东临碣石以观沧海 | |
---|---|---|---|---|
春眠不觉晓 | 1 | 0.75 | 0.83 | 0.57 |
大梦谁先觉 | 0.75 | 1 | 0.72 | 0.51 |
浓睡不消残酒 | 0.83 | 0.72 | 1 | 0.58 |
东临碣石以观沧海 | 0.57 | 0.51 | 0.58 | 1 |
with torch.no_grad() 的作用是将模型状态置为推断(inference),即在计算过程中不进行梯度计算和反向传播操作。
关于 last_hidden_state 与 pooler_output
看bert中文文本分类任务,发现训练输出结果向量仅使用了第一个token的向量,而一句话中的第一个位置是特殊字符[CLS],那么该如何理解呢?
BERT在第一句前会加一个[CLS]标志,CLS 就是 classification 分类之意,即 CLS 的作用就是它能代表整句话的意思,使用它就可以对整句话进行分类。用于下游的分类任务等。
为什么选它呢,因为与文本中已有的其它词相比,这个无明显语义信息的符号会更“公平”地融合文本中各个词的语义信息,从而更好的表示整句话的语义。具体来说,self-attention是用文本中的其它词来增强目标词的语义表示,但是目标词本身的语义还是会占主要部分的,因此,经过BERT的12层,每次词的embedding融合了所有词的信息,可以去更好的表示自己的语义。而[CLS]位本身没有语义,经过12层,得到的是attention后所有词的加权平均,相比其他正常词,可以更好的表征句子语义。
当然,也可以通过对最后一层所有词的embedding做pooling池化处理去表征句子语义,最常用的池化方法就是平均法mean。
从上面的打印信息我们可以知道 pooler_output 的结果也是 1X768,与 torch.mean() 的维度一样,可以理解为它是经过模型池化后的结果。pooler_output是通过应用一个线性层和一个激活函数(通常是Tanh)到last_hidden_state的第一个token(即[CLS]标记)的隐藏状态来生成的。这个输出通常用于分类任务,因为它编码了整个序列的信息。
BERT下游NLP任务的微调
https://zhuanlan.zhihu.com/p/149779660
在BERT原论文的设计中,下游NLP任务被分成如下几类:
- 句子对分类型任务,如GLUE的MNLI、QQP、MRPC等
- 单句分类型任务,如GLUE的STS-2、CoLA等
- 问答/阅读理解型任务,需要根据问题,在备选回答文本中标识出答案部分,如SQuAD
- 单句标注任务,给每个token打上标签,如POS词性标注、NER命名实例识别任务
一般来说,以上4类任务已经基本上可以涵盖大部分下游NLP任务了。
我们在上节指定的sst-2任务实际上就是一类“单句分类”型NLP任务,利用run_glue.py脚本,我们可以几乎不用写代码就可以完成微调。GLUE基准测试集一共包括9种不同类型的数据集,如下表小结。
动手写BERT系列 https://www.bilibili.com/video/av258262103?vd_source=0c75dc193ee55511d0515b3a8c375bd0&spm_id_from=333.788.videopod.sections