当前位置：首页 > article >正文

大模型试用-t5-base

article 2024/11/22 21:01:33

一：模型背景介绍

T5（Text-to-Text Transfer Transformer）是 Google 于 2020 年提出的一种通用文本到文本生成模型，详细介绍于论文 "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"。T5 是一个高度灵活的模型，能够统一处理各种 NLP 任务，例如翻译、摘要、问答、文本分类等。
T5-base 是 T5 模型的一种中等规模配置，适合在资源有限的环境中应用，同时仍然提供良好的性能。

二：试用目的

想看其在中文环境下的功能实现

三：试用环境

python 3.12，安装 transformers 等库，CPU 机器。

四：模型常用参数介绍

T5-Base 配置，以下是 T5-base 的主要配置参数：

五：试用代码和结果

5.1 摘要任务

摘要任务的前缀是 "summarize: "

from transformers import T5Tokenizer, T5ForConditionalGeneration
import os
os.environ["TF_ENABLE_ONEDNN_OPTS"] = '0'

#tokenizer = T5Tokenizer.from_pretrained("t5-base")

tokenizer = T5Tokenizer.from_pretrained("t5-base",legacy=False)
model = T5ForConditionalGeneration.from_pretrained("t5-base")

# 输入文本
#input_text = "summarize: The quick brown fox jumps over the lazy dog. The dog barked and chased the fox."
input_text1 = "summarize: 这是一个关于T5模型的长篇文章,主要内容是讲述:t5-base 是 Google 开发的 T5 (Text-to-Text Transfer Transformer) 模型的中型版本，具备强大的文本到文本任务处理能力。这意味着任何自然语言任务都可以被转换为输入文本和输出文本的形式，因此 t5-base 在多种实际应用中表现出色。"
input_text2 = "总结或者概括: 春天，是大自然最温柔的笔触。万物复苏，绿意盎然，仿佛一夜之间，世界被重新上色。细雨绵绵，滋润着每一寸土地，唤醒了沉睡的花朵，它们竞相绽放，争奇斗艳。柳树轻摇，抽出嫩绿的新芽，随风起舞，宛如少女的秀发。空气中弥漫着泥土与花香的清新，让人心旷神怡。燕子归来，穿梭在蓝天与屋檐之间，忙碌而又欢快。春天，是生命的赞歌，是希望的开始，它以无限的生机与活力，温暖着每一个渴望生长的心灵。"
input_text3 = "summarize: Spring is the gentlest stroke of nature's brush. Everything comes to life, and the world is painted green, as if overnight. Gentle rain nourishes every inch of land, awakening dormant flowers that bloom in a riot of colors. Willow trees sway lightly, sprouting tender green buds and dancing with the wind, resembling a maiden's flowing hair. The air is filled with the freshness of earth and floral fragrances, making one's heart feel light and joyful. Swallows return, weaving between the blue sky and rooftops, busy yet joyful. Spring is a hymn to life, the beginning of hope, warming every heart that longs to grow with its boundless vitality and energy."
input_text4 = "总结: 月球是地球唯一的天然卫星，距离地球大约23.8万公里。它的直径约为2159英里（约3474公里），大约是地球的四分之一大小。月球表面布满了陨石坑、山脉和平原，其中最著名的包括雨海盆地和静海平原。月球没有大气层，因此没有空气和水，表面温度变化极大。自古以来，月球就激发了人类无限的好奇与探索欲望。。"
input_texts = [input_text1, input_text2, input_text3, input_text4 ]
for i in range(4):
    # 对输入文本进行编码
    input_ids = tokenizer(input_texts[i], return_tensors="pt").input_ids

    # 使用模型生成输出
    outputs = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)

    # 解码生成的输出
    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("-------------------------------------------------------")
    print("Generated Text:", output_text)

使用结果如下，对中文支持几乎没有，英文还可以。

5.2 翻译任务

T5 模型的预训练语料库是 C4（Colossal Clean Crawled Corpus），它主要由英语网页内容组成。因此：

预训练期间 T5 主要是单语（英语）模型。
翻译任务的能力需要通过额外的微调（fine-tuning）实现。（我主要是试用，不微调了）

翻译任务的前缀类似 "translate English to German: "

from transformers import T5Tokenizer, T5ForConditionalGeneration
import os
os.environ["TF_ENABLE_ONEDNN_OPTS"] = '0'

#tokenizer = T5Tokenizer.from_pretrained("t5-base")

tokenizer = T5Tokenizer.from_pretrained("t5-base",legacy=False)
model = T5ForConditionalGeneration.from_pretrained("t5-base")

# 输入文本
#input_text = "summarize: The quick brown fox jumps over the lazy dog. The dog barked and chased the fox."
input_text1 = "translate English to German: Spring is the gentlest stroke of nature's brush."
input_text2 = "translate Chinese to German: 春天万物复苏，绿意盎然，仿佛一夜之间，世界被重新上色。"
input_text3 = "translate English to Chinese: Spring is the gentlest stroke of nature's brush. "
input_text4 = "translate Chinese to German: 月球是地球唯一的天然卫星，距离地球大约23.8万公里。自古以来，月球就激发了人类无限的好奇与探索欲望。。"
input_texts = [input_text1, input_text2, input_text3, input_text4 ]
for i in range(4):
    # 对输入文本进行编码
    input_ids = tokenizer(input_texts[i], return_tensors="pt").input_ids

    # 使用模型生成输出
    outputs = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)

    # 解码生成的输出
    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("-------------------------------------------------------")
    print(input_texts[i])
    print("Generated Text:", output_text)

5.3 文本分类

该模型号称能支持情感分类、话题分类、情绪检测、意图分类等

from transformers import T5Tokenizer, T5ForConditionalGeneration
import os
os.environ["TF_ENABLE_ONEDNN_OPTS"] = '0'

#tokenizer = T5Tokenizer.from_pretrained("t5-base")

tokenizer = T5Tokenizer.from_pretrained("t5-base",legacy=False)
model = T5ForConditionalGeneration.from_pretrained("t5-base")

# 输入文本
#input_text = "summarize: The quick brown fox jumps over the lazy dog. The dog barked and chased the fox."
input_text1 = "classify sentiment: I love this product very much!.I am very happy."
input_text2 = "classify sentiment: I hate this product very much!.I am very sad."
input_text3 = "classify sentiment: I am not sure about this product.I am not sure.  "
input_text4 = "classify sentiment: 我非常讨厌这个人.I am very angry."
input_text5 = "classify topic: The team won their third championship title this season."
input_text6 = "classify topic: The team lost their first match of the season."
input_text7 = "classify topic: The team is playing their first match of the season."
input_text8 = "classify emotion: I can't believe this happened, I'm so angry!"
input_text9 = "classify emotion: I love this product very much!.I am very happy.!"
input_texts = [input_text1, input_text2, input_text3, input_text4,input_text5,\
               input_text6,input_text7,input_text8,input_text9]

for i in range(len(input_texts)):
    # 对输入文本进行编码
    input_ids = tokenizer(input_texts[i], return_tensors="pt").input_ids

    # 使用模型生成输出
    outputs = model.generate(input_ids, max_length=20, num_beams=2, early_stopping=True)

    # 解码生成的输出
    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("-------------------------------------------------------")
    print(input_texts[i])
    print("Generated Text:", output_text)

但是试用结果非常不好，英文已经乱七八糟了，中文更不用说。试用了情感分类、话题分类、情绪检测之后，其余的分类我已经不想再测试了。下面是运行结果。

-------------------------------------------------------
classify sentiment: I love this product very much!.I am very happy.
Generated Text: sentiment: I love this product very much!.I am very happy.
-------------------------------------------------------
classify sentiment: I hate this product very much!.I am very sad.
Generated Text: sentiment: I hate this product very much!.I am very sad.
-------------------------------------------------------
classify sentiment: I am not sure about this product.I am not sure.
Generated Text: sentiment: I am not sure about this product.I am not sure about this product.
-------------------------------------------------------
classify sentiment: 我非常讨厌这个人.I am very angry.
Generated Text: .I am very angry.
-------------------------------------------------------
classify topic: The team won their third championship title this season.
Generated Text: False
-------------------------------------------------------
classify topic: The team lost their first match of the season.
Generated Text: False
-------------------------------------------------------
classify topic: The team is playing their first match of the season.
Generated Text: True
-------------------------------------------------------
classify emotion: I can't believe this happened, I'm so angry!
Generated Text: emotion: I can't believe this happened, I'm so angry!
-------------------------------------------------------
classify emotion: I love this product very much!.I am very happy.!
Generated Text: emotion: I love this product very much!.I am very happy.!