当前位置：首页 > article >正文

文本预处理

article 2025/2/5 2:22:24

一、文本的基本单位

1、Token

定义：文本的最小单位，例如单词、标点符号。

示例：

原句： "I love NLP."

分词结果： ['I', 'love', 'NLP', '.']

2、语法与语义

语法：词的结构和句子的组合规则。

语义：词的含义和上下文理解。

示例：

句子 "Time flies like an arrow." 有多重解释：

时间像箭一样飞逝。

像箭一样的飞虫在时间中飞翔。

二、基本的文本预处理

1、分词（Tokenization）

英文分词：基于空格或标点分隔。
中文分词：基于统计和规则的方法，如 Jieba。

2、去停用词

停用词：意义较小或频率过高的词，例如 "the", "is", "and"。

3、词干化

将词语削减为根形式，例如 running → run。

4、词形还原

考虑语法规则还原为词的基本形式，例如 mice → mouse。

三、用nltk库做文本预处理

NLTK（Natural Language Toolkit）是一个功能强大、灵活性高的开源 Python 库，专为自然

语言处理（NLP）领域的研究和开发而设计。 NLTK 提供了一套丰富的工具和资源，适合处

理、分析和理解人类语言文本。

1、文本预处理包

分词： nltk.tokenize.word_tokenize
停用词库： nltk.corpus.stopwords
词干化： nltk.stem.PorterStemmer
词形还原： nltk.stem.WordNetLemmatizer

2、案例

使用 Python 对自己的文本数据进行分词、去停用词操作，并计算剩余单词的数量

文本如下：

"Dr. Smith's favorite movie in 2024 is 'Inception'; he rates it 9/10 stars! Isn't that amazing? Let's analyze this #text with NLP techniques: @homework1.py, line 42."

代码如下：

from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,WordNetLemmatizer

from src.common import util

def text_prepare(text):
    #分词
    print(f"原始文本：{text}")
    tokens = word_tokenize(text)
    print(f"分词后：{tokens}")
    #去除停用词
    en_stopwords  = stopwords.words('english') #获取英文停用词表
    print(f"去除停用词前文本长度：{len(tokens)}")

    filter_stop_words = []
    for token in tokens:
        token = token.lower()
        if token not in en_stopwords:
            filter_stop_words.append(token)
    print(f"去除停用词后文本：{filter_stop_words}")
    print(f"去除停用词后文本长度：{len(filter_stop_words)}")

    #词干化
    prepare_stem = []
    porter_stemmer = PorterStemmer()
    for token in filter_stop_words:
        token = porter_stemmer.stem(token)
        prepare_stem.append(token)
    print(f"词干化后：{prepare_stem}")

    #词性标注
    tagged_pos = pos_tag(filter_stop_words)
    print(f"词性标注后：{tagged_pos}")

    #词形还原
    prepare_lemma = []
    wordnetLemma = WordNetLemmatizer()
    for word, pos in tagged_pos:
        prepare_lemma.append(wordnetLemma.lemmatize(word,util.get_wordnet_pos(pos)))
    print(f"词形还原后：{prepare_lemma}")


def main():
    file_path = "example"
    with(open(file_path, "r", encoding="utf-8")) as file:
        text = file.read()
    text_prepare(text)

if __name__ == '__main__':
    main()

运行结果：

原始文本："Dr. Smith's favorite movie in 2024 is 'Inception'; he rates it 9/10 stars! Isn't that amazing? Let's analyze this #text with NLP techniques: @homework1.py, line 42."

分词后：['``', 'Dr.', 'Smith', "'s", 'favorite', 'movie', 'in', '2024', 'is', "'Inception", "'", ';', 'he', 'rates', 'it', '9/10', 'stars', '!', 'Is', "n't", 'that', 'amazing', '?', 'Let', "'s", 'analyze', 'this', '#', 'text', 'with', 'NLP', 'techniques', ':', '@', 'homework1.py', ',', 'line', '42', '.', "''"]

去除停用词前文本长度：40

去除停用词后文本：['``', 'dr.', 'smith', "'s", 'favorite', 'movie', '2024', "'inception", "'", ';', 'rates', '9/10', 'stars', '!', "n't", 'amazing', '?', 'let', "'s", 'analyze', '#', 'text', 'nlp', 'techniques', ':', '@', 'homework1.py', ',', 'line', '42', '.', "''"]
去除停用词后文本长度：32

词干化后：['``', 'dr.', 'smith', "'s", 'favorit', 'movi', '2024', "'incept", "'", ';', 'rate', '9/10', 'star', '!', "n't", 'amaz', '?', 'let', "'s", 'analyz', '#', 'text', 'nlp', 'techniqu', ':', '@', 'homework1.pi', ',', 'line', '42', '.', "''"]

词性标注后：[('``', '``'), ('dr.', 'NN'), ('smith', 'NN'), ("'s", 'POS'), ('favorite', 'JJ'), ('movie', 'NN'), ('2024', 'CD'), ("'inception", 'NN'), ("'", "''"), (';', ':'), ('rates', 'NNS'), ('9/10', 'CD'), ('stars', 'NNS'), ('!', '.'), ("n't", 'RB'), ('amazing', 'VBG'), ('?', '.'), ('let', 'NN'), ("'s", 'POS'), ('analyze', 'JJ'), ('#', '#'), ('text', 'JJ'), ('nlp', 'NN'), ('techniques', 'NNS'), (':', ':'), ('@', 'NN'), ('homework1.py', 'NN'), (',', ','), ('line', 'NN'), ('42', 'CD'), ('.', '.'), ("''", "''")]

词形还原后：['``', 'dr.', 'smith', "'s", 'favorite', 'movie', '2024', "'inception", "'", ';', 'rate', '9/10', 'star', '!', "n't", 'amaze', '?', 'let', "'s", 'analyze', '#', 'text', 'nlp', 'technique', ':', '@', 'homework1.py', ',', 'line', '42', '.', "''"]

查看全文

http://www.kler.cn/a/531842.html