当前位置：首页 > article >正文

快速学会一个算法，BERT

article 2025/2/19 4:42:18

今天给大家介绍一个强大的算法模型，BERT

BERT（Bidirectional Encoder Representations from Transformers）是一种基于 Transformer 架构的深度学习模型，主要用于处理自然语言处理（NLP）问题。

BERT 由 Google AI 的研究团队在 2018 年提出，它通过双向捕获上下文信息的能力彻底改变了 NLP。与之前单向读取文本的模型不同，BERT 通过考虑左右上下文来理解句子中的单词。这种能力极大地增强了它对语言细微差别的理解，使其在各种 NLP 任务中非常有效。

BERT 如何工作

BERT 架构

BERT 建立在 Transformer 的架构之上，特别是利用了其编码器部分。

BERT 由多层自注意力和前馈神经网络组成。BERT 采用双向方法从句子中前后单词捕获上下文信息。

根据模型架构的规模，BERT 有四种类型的预训练版本。

BERT-Base（Cased / Un-Cased），12 层，768 个隐藏节点，12 个注意头，110M 个参数
BERT-Large（Cased / Un-Cased），24 层，1024 个隐藏节点，16 个注意头，340M 个参数

‍

文本预处理

BERT 的开发人员添加了一组特定规则来表示模型的输入文本。

首先，每个输入嵌入都是 3 个嵌入的组合。

位置嵌入

BERT 学习并使用位置嵌入来表达单词在句子中的位置。添加这些是为了克服 Transformer 的局限性，因为与 RNN 不同，Transformer 无法捕获 “序列” 或 “顺序” 信息。
片段嵌入

BERT 还可以将句子对作为任务的输入。这就是为什么它会为第一句和第二句学习一个独特的嵌入，以帮助模型区分它们。在上面的例子中，所有标记为的标记都属于句子 A（也是如此）。
标记嵌入

这些是从 WordPiece 标记词汇表中为特定标记学习到的嵌入。

需要注意的是，我们在第一个句子的开头，将 [CLS] 标记添加到输入单词标记中，在每个句子之后，添加 [SEP] 标记。

对于给定的标记，其输入表示是通过对相应的标记、段和位置嵌入进行求和来构建的。

预训练任务

BERT 在两个 NLP 任务上进行了预训练：

Masked Language Model (MLM)

在这个任务中，模型的输入序列中随机选择一些单词被替换为一个特殊的 [MASK] 标记，模型的任务是预测这些被遮蔽的单词。这促使模型学习到更加深入的语言理解能力。
Next Sentence Prediction (NSP)

在这个任务中，给定两个句子A 和 B，模型需要预测B是否是A的直接后续。这有助于模型理解句子间的关系，对于某些特定的任务（如问答系统）尤为重要。

微调

过程

预训练完成后，BERT 可以通过微调来适应具体的下游任务，例如情感分析、命名实体识别等。在微调阶段，通常保持预训练期间学到的参数不变，只对输出层进行修改，以适应特定任务的需求。
优化

通过微调，BERT 能够利用预训练中获得的丰富语言理解能力，迅速适应并提高在特定任务上的表现。

案例分享

下面是一个使用 BERT 来进行情绪分析的案例。

1.加载预训练的 BERT 模型

这里，我们使用 BERT 基础模型的无大小写预训练版本。

from transformers import BertModel  
from transformers import BertTokenizerFast  
  
bert = BertModel.from_pretrained('bert-base-uncased')  
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', do_lower_case=True)

2.加载数据集

这里，我们首先加载数据集并对其进行预处理。

import re  
import pandas as pd  
  
def preprocessor(text):  
    
  #convering text to lower case  
  text = text.lower()  
  
  #remove user mentions  
  text = re.sub(r'@[A-Za-z0-9]+','',text)             
    
  #remove hashtags  
  #text = re.sub(r'#[A-Za-z0-9]+','',text)           
    
  #remove links  
  text = re.sub(r'http\S+', '', text)    
    
  #split token to remove extra spaces  
  tokens = text.split()  
    
  #join tokens by space  
  return " ".join(tokens)  
  
df = pd.read_csv('Sentiment.csv')  
# perform text cleaning  
df['clean_text']= df['text'].apply(preprocessor)  
# save cleaned text and labels to a variable  
text   = df['clean_text'].values  
labels = df['sentiment'].values  
print(text[50:55])

接下来，准备模型的输入输出数据。

#importing label encoder  
from sklearn.preprocessing import LabelEncoder  
#define label encoder  
le = LabelEncoder()  
#fit and transform target strings to a numbers  
labels = le.fit_transform(labels)  
print(le.classes_)  
print(labels)  
  
# library for progress bar  
from tqdm import notebook  
# create an empty list to save integer sequence  
sent_id = []  
# iterate over each tweet  
for i in notebook.tqdm(range(len(text))):  
    
  encoded_sent = tokenizer.encode(text[i],                        
                                  add_special_tokens = True,      
                                  max_length = 25,  
                                  truncation = True,           
                                  pad_to_max_length='right')       
  # saving integer sequence to a list  
  sent_id.append(encoded_sent)  
  
attention_masks = []  
for sent in sent_id:  
  att_mask = [int(token_id > 0) for token_id in sent]  
  attention_masks.append(att_mask)  
  
# Use train_test_split to split our data into train and validation sets  
from sklearn.model_selection import train_test_split  
  
# Use 90% for training and 10% for validation.  
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(sent_id, labels, random_state=2018, test_size=0.1, stratify=labels)  
  
# Do the same for the masks.  
train_masks, validation_masks, _, _ = train_test_split(attention_masks, labels, random_state=2018, test_size=0.1, stratify=labels)  
  
import torch  
train_inputs = torch.tensor(train_inputs)  
validation_inputs = torch.tensor(validation_inputs)  
train_labels = torch.tensor(train_labels)  
validation_labels = torch.tensor(validation_labels)  
train_masks = torch.tensor(train_masks)  
validation_masks = torch.tensor(validation_masks)  
  
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler  
  
# For fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32.  
# define a batch size  
batch_size = 32  
  
# Create the DataLoader for our training set.  
#Dataset wrapping tensors.  
train_data = TensorDataset(train_inputs, train_masks, train_labels)  
  
#define a sampler for sampling the data during training  
  #random sampler samples randomly from a dataset   
  #sequential sampler samples sequentially, always in the same order  
train_sampler = RandomSampler(train_data)  
  
#represents a iterator over a dataset. Supports batching, customized data loading order  
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)  
  
# Create the DataLoader for our validation set.  
#Dataset wrapping tensors.  
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)  
  
#define a sequential sampler   
#This samples data in a sequential order  
validation_sampler = SequentialSampler(validation_data)  
  
#create a iterator over the dataset  
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

3.定义模型架构

#importing nn module  
import torch.nn as nn  
  
class classifier(nn.Module):  
  
    #define the layers and wrappers used by model  
    def __init__(self, bert):  
        
      #constructor  
      super(classifier, self).__init__()  
  
      #bert model  
      self.bert = bert   
  
      # dense layer 1  
      self.fc1 = nn.Linear(768,512)  
        
      #dense layer 2 (Output layer)  
      self.fc2 = nn.Linear(512,3)  
        
      #dropout layer  
      self.dropout = nn.Dropout(0.1)  
  
      #relu activation function  
      self.relu =  nn.ReLU()  
  
      #softmax activation function  
      self.softmax = nn.LogSoftmax(dim=1)  
  
    #define the forward pass  
    def forward(self, sent_id, mask):  
  
      #pass the inputs to the model    
      all_hidden_states, cls_hidden_state = self.bert(sent_id, attention_mask=mask, return_dict=False)  
        
      #pass CLS hidden state to dense layer  
      x = self.fc1(cls_hidden_state)  
  
      #Apply ReLU activation function  
      x = self.relu(x)  
  
      #Apply Dropout  
      x = self.dropout(x)  
  
      #pass input to the output layer  
      x = self.fc2(x)  
        
      #apply softmax activation  
      x = self.softmax(x)  
  
      return x  
   
device="cuda:0"  
for param in bert.parameters():  
    param.requires_grad = False  
#create the model  
model = classifier(bert)  
  
#push the model to GPU, if available  
model = model.to(device)

4.定义优化器和损失函数

# Adam optimizer  
optimizer = torch.optim.Adam(model.parameters(), lr = 0.001)  
  
import numpy as np  
from sklearn.utils.class_weight import compute_class_weight  
#class_weights = compute_class_weight('balanced', np.unique(labels), y=labels)  
class_weights = compute_class_weight(class_weight = "balanced", classes= np.unique(labels), y= labels)  
print("Class Weights:",class_weights)  
  
# converting a list of class weights to a tensor  
weights= torch.tensor(class_weights,dtype=torch.float)  
# transfer to GPU  
weights = weights.to(device)  
# define the loss function  
cross_entropy  = nn.NLLLoss(weight=weights)

5.模型训练与评估

import time  
import datetime  
# compute time in hh:mm:ss  
def format_time(elapsed):  
    # round to the nearest second.  
    elapsed_rounded = int(round((elapsed)))  
    # format as hh:mm:ss  
    return str(datetime.timedelta(seconds = elapsed_rounded))

import time  
#define a function for training the model  
def train():  
    
  print("\nTraining.....")    
    
  #set the model on training phase - Dropout layers are activated  
  model.train()  
  
  #record the current time  
  t0 = time.time()  
  
  #initialize loss and accuracy to 0  
  total_loss, total_accuracy = 0, 0  
    
  #Create a empty list to save the model predictions  
  total_preds=[]  
    
  #for every batch  
  for step,batch in enumerate(train_dataloader):  
      
    # Progress update after every 40 batches.  
    if step % 40 == 0 and not step == 0:  
        
      # Calculate elapsed time in minutes.  
      elapsed = format_time(time.time() - t0)  
              
      # Report progress.  
      print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))  
  
    #push the batch to gpu  
    batch = tuple(t.to(device) for t in batch)  
  
    #unpack the batch into separate variables  
    # `batch` contains three pytorch tensors:  
    #   [0]: input ids   
    #   [1]: attention masks  
    #   [2]: labels   
    sent_id, mask, labels = batch  
  
    # Always clear any previously calculated gradients before performing a  
    # backward pass. PyTorch doesn't do this automatically.   
    model.zero_grad()          
  
    # Perform a forward pass. This returns the model predictions  
    preds = model(sent_id, mask)  
  
    #compute the loss between actual and predicted values  
    loss =  cross_entropy(preds, labels)  
  
    # Accumulate the training loss over all of the batches so that we can  
    # calculate the average loss at the end. `loss` is a Tensor containing a  
    # single value; the `.item()` function just returns the Python value   
    # from the tensor.  
    total_loss = total_loss + loss.item()  
  
    # Perform a backward pass to calculate the gradients.  
    loss.backward()  
  
    # Update parameters and take a step using the computed gradient.  
    # The optimizer dictates the "update rule"--how the parameters are  
    # modified based on their gradients, the learning rate, etc.  
    optimizer.step()  
  
    #The model predictions are stored on GPU. So, push it to CPU  
    preds=preds.detach().cpu().numpy()  
  
    #Accumulate the model predictions of each batch  
    total_preds.append(preds)  
  
  #compute the training loss of a epoch  
  avg_loss     = total_loss / len(train_dataloader)  
    
  #The predictions are in the form of (no. of batches, size of batch, no. of classes).  
  #So, reshaping the predictions in form of (number of samples, no. of classes)  
  total_preds  = np.concatenate(total_preds, axis=0)  
  
  #returns the loss and predictions  
  return avg_loss, total_preds

#define a function for evaluating the model  
def evaluate():  
    
  print("\nEvaluating.....")  
    
  #set the model on training phase - Dropout layers are deactivated  
  model.eval()  
  
  #record the current time  
  t0 = time.time()  
  
  #initialize the loss and accuracy to 0  
  total_loss, total_accuracy = 0, 0  
    
  #Create a empty list to save the model predictions  
  total_preds = []  
  
  #for each batch    
  for step,batch in enumerate(validation_dataloader):  
      
    # Progress update every 40 batches.  
    if step % 40 == 0 and not step == 0:  
        
      # Calculate elapsed time in minutes.  
      elapsed = format_time(time.time() - t0)  
              
      # Report progress.  
      print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(validation_dataloader), elapsed))  
  
    #push the batch to gpu  
    batch = tuple(t.to(device) for t in batch)  
  
    #unpack the batch into separate variables  
    # `batch` contains three pytorch tensors:  
    #   [0]: input ids   
    #   [1]: attention masks  
    #   [2]: labels          
    sent_id, mask, labels = batch  
  
    #deactivates autograd  
    with torch.no_grad():  
        
      # Perform a forward pass. This returns the model predictions  
      preds = model(sent_id, mask)  
  
      #compute the validation loss between actual and predicted values  
      loss = cross_entropy(preds,labels)  
  
      # Accumulate the validation loss over all of the batches so that we can  
      # calculate the average loss at the end. `loss` is a Tensor containing a  
      # single value; the `.item()` function just returns the Python value   
      # from the tensor.        
      total_loss = total_loss + loss.item()  
  
      #The model predictions are stored on GPU. So, push it to CPU  
      preds=preds.detach().cpu().numpy()  
  
      #Accumulate the model predictions of each batch  
      total_preds.append(preds)  
  
  #compute the validation loss of a epoch  
  avg_loss = total_loss / len(validation_dataloader)   
  
  #The predictions are in the form of (no. of batches, size of batch, no. of classes).  
  #So, reshaping the predictions in form of (number of samples, no. of classes)  
  total_preds  = np.concatenate(total_preds, axis=0)  
  
  return avg_loss, total_preds

接下来，对模型进行训练。

#Assign the initial loss to infinite  
best_valid_loss = float('inf')  
  
#create a empty list to store training and validation loss of each epoch  
train_losses=[]  
valid_losses=[]  
  
epochs = 5  
  
  
#for each epoch  
for epoch in range(epochs):  
       
    print('\n....... epoch {:} / {:} .......'.format(epoch + 1, epochs))  
      
    #train model  
    train_loss, _ = train()  
      
    #evaluate model  
    valid_loss, _ = evaluate()  
      
    #save the best model  
    if valid_loss < best_valid_loss:  
        best_valid_loss = valid_loss  
        torch.save(model.state_dict(), 'saved_weights.pt')  
      
    #accumulate training and validation loss  
    train_losses.append(train_loss)  
    valid_losses.append(valid_loss)  
      
    print(f'\nTraining Loss: {train_loss:.3f}')  
    print(f'Validation Loss: {valid_loss:.3f}')  
  
print("")  
print("Training complete!")

接下来，对模型进行评估。



# load weights of best model  
path='saved\_weights.pt'  
model.load\_state\_dict(torch.load(path))  
# get the model predictions on the validation data  
# returns 2 elements- Validation loss and Predictions  
valid\_loss, preds = evaluate()  
print(valid\_loss)  
  
from sklearn.metrics import classification\_report  
# Converting the log(probabities) into a classes  
# Choosing index of a maximum value as class  
y\_pred = np.argmax(preds,axis=1)  
# actual labels  
y\_true = validation\_labels  
print(classification\_report(y\_true,y\_pred))