当前位置：首页 > article >正文

Chapter2 文本规范化

article 2025/3/4 5:00:42

在NLP任务中，第一步往往要做文本规范化，其作用是将字符串表示成更易于计算机处理的规范形式。一般包括三个步骤：1.分词 2.词规范化 3.分句

2.1 分词

分词：将一段字符序列文本转化成词元(token)序列的过程。词元不一定等于词，可以是字符、子词和词等。

2.1.1 基于空格与标点符号的分词

import re

sentence = 'I learn natural language processing with dongshouxueNLP, too.'
tokens = sentence.split(' ')
print(f"输入语句：{sentence}")
print(f"分词结果：{tokens}")
# 分词结果：['I', 'learn', 'natural', 'language', 'processing', 'with', 'dongshouxueNLP,', 'too.']

# 去除句子中的 . 和 ,
sentence = re.sub(r'[,\.]', '', string=sentence)
print(sentence) # I learn natural language processing with dongshouxueNLP too
tokens = sentence.split(' ')
print(f"分词结果：{tokens}")
# 分词结果：['I', 'learn', 'natural', 'language', 'processing', 'with', 'dongshouxueNLP', 'too']

2.1.2 基于正则表达式的分词

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import regexp_tokenize

sentence = "Did you spend $3.4 on arxiv.org for your pre-print?" +\
           " No, it's free! It's..."
pattern = r"\w+"
print(re.findall(pattern, sentence))
# ['Did', 'you', 'spend', '3', '4', 'on', 'arxiv', 'org', 'for', 'your',
# 'pre', 'print', 'No', 'it', 's', 'free', 'It', 's']

pattern = r"\w+|\S\w*"
print(re.findall(pattern, sentence))
# ['Did', 'you', 'spend', '$3', '.4', 'on', 'arxiv', '.org', 'for', 'your', 'pre', '-print', '?', 'No', ',', 'it',
# "'s", 'free', '!', 'It', "'s", '.', '.', '.']

print("1.匹配可能含有连续字符的词")
pattern = r"\w+(?:[-']\w+)*"
print(re.findall(pattern, sentence))
# ['Did', 'you', 'spend', '3', '4', 'on', 'arxiv', 'org', 'for', 'your', 'pre-print', 'No', "it's", 'free', "It's"]

# 和前面的pattern组合
pattern = r"\w+(?:[-']\w+)*|\S\w*"
print(re.findall(pattern, sentence))
# ['Did', 'you', 'spend', '$3', '.4', 'on', 'arxiv', '.org', 'for', 'your', 'pre-print', '?', 'No', ',',
# "it's", 'free', '!', "It's", '.', '.', '.']

print("2.匹配简写和网址")
new_pattern = r"(?:\w+\.)+\w+(?:\.)*"
pattern = new_pattern + r"|" + pattern
print(re.findall(pattern, sentence))
# ['Did', 'you', 'spend', '$3', '.4', 'on', 'arxiv.org', 'for', 'your', 'pre-print', '?', 'No', ',', "it's", 'free',
# '!', "It's", '.', '.', '.']

print("3.货币和百分比")
new_pattern2 = r"\$?\d+(?:\.\d+)?%?"
pattern = new_pattern2 + r"|" + pattern
print(re.findall(pattern, sentence))
# ['Did', 'you', 'spend', '$3.4', 'on', 'arxiv.org', 'for', 'your', 'pre-print', '?', 'No', ',', "it's", 'free',
# '!', "It's", '.', '.', '.']

print("4.英文省略号")
new_pattern3 = r"\.\.\."
pattern = new_pattern3 + r"|" + pattern
print(re.findall(pattern, sentence))
# ['Did', 'you', 'spend', '$3.4', 'on', 'arxiv.org', 'for', 'your', 'pre-print', '?', 'No', ',', "it's", 'free', '!',
# "It's", '...']

print('----------------')
print("NLTK是基于python的NLP工具包，也可以实现前面基于正则表达式的分词")
tokens = regexp_tokenize(sentence, pattern)
print(tokens)

Tips:

1、\w表示匹配a-z、A-Z、0-9、_这四种类型的字符，等价于[a-zA-Z0-9_]；+ 表示匹配前面的表达式1次或者多次。

2、\s在正则表达式中表示空格字符，\S表示\s的补集。

3、* 表示匹配前面的表达式0次或多次，\S\w*表示先匹配除空格外的1个字符，后面可以包含0个或多个\w字符。

4、- 表示匹配连字符，(?:[-']\w+)*表示匹配0次或多次括号内的模式。(?: ...)表示匹配括号内的模式，可以和 +/*等字符连用，其中，?: 表示不保存匹配到的括号中的内容。

5、. 表示匹配任意字符，因此需要 \. 才能表示 . 本身，其他的像+ ? “” $ 等符号也需要加 \ 。

参考：正则表达式-常见正则表达式符号和特殊字符_正则匹配特殊符号-CSDN博客

2.1.4 基于子词的分词

当前常见三种基于子词的分词方法：

1、字节对编码（BPE）

2、一元语言建模分词

3、词片（WordPiece）

（1）一个词元学习器在大量训练语料上学习，并构建出一个词表（表示词元的集合）。

（2）给定一个新的句子，词元分词器会根据总结出的词表进行分词。

print("step1 根据语料构建初始的词表")
str1 = 'nan ' * 5
str2 = 'nanjing ' * 2
str3 = 'beijing ' * 6
str4 = 'dongbei ' * 3
str5 = 'bei bei'
corpus = str1 + str2 + str3 + str4 + str5
tokens = corpus.split(' ')

# 构建基于字符的初始词表
vocabulary = set(corpus)
vocabulary.remove(' ')
vocabulary.add('_')
vocabulary = sorted(list(vocabulary))

# 根据语料构建词频统计表
corpus_dict = dict()
for token in tokens:
    key = token
    token += '_'
    if key not in corpus_dict:
        corpus_dict[key] = {'split': list(token), 'count': 0}
    corpus_dict[key]['count'] += 1

# 打印语料和词表
print("词频统计表：")
for key in corpus_dict.keys():
    print(corpus_dict[key]['count'], corpus_dict[key]['split'])
print("词表：", vocabulary)

print("step2 词元学习器通过迭代的方式逐步组合新的符号并加入词表中")
for step in range(10):
    print(f"第{step+1}次迭代：")
    split_dict = {}
    for key in corpus_dict:
        splits = corpus_dict[key]['split']
        for i in range(len(splits) - 1):
            current_group = splits[i] + splits[i+1]
            if current_group not in split_dict:
                split_dict[current_group] = 0
            split_dict[current_group] += corpus_dict[key]['count']

    group_hist = [(k, v) for k, v in sorted(split_dict.items(),
                  key=lambda x: x[1], reverse=True)]
    print(f"当前最常出现的前5个符号组合：{group_hist[: 5]}")

    merge_key = group_hist[0][0]
    print("本次迭代组合的符号为：", merge_key)
    vocabulary.append(merge_key)

    for key in corpus_dict:
        splits = corpus_dict[key]['split']
        new_splits = []
        i = 0
        while i < len(splits):
            if i+1 >= len(splits):
                new_splits.append(splits[i])
                i += 1
                break
            if splits[i] + splits[i+1] == merge_key:
                new_splits.append(merge_key)
                i += 2
            else:
                new_splits.append(splits[i])
                i += 1
        corpus_dict[key]['split'] = new_splits

    print()
    print("迭代后的词频统计表为：")
    for key in corpus_dict:
        print(corpus_dict[key]['count'], corpus_dict[key]['split'])
    print("迭代后的词表为：", vocabulary)
    print('-------------------------')

print("基于BPE的词元分词器")
ordered_vocabulary = {key: x for x, key in enumerate(vocabulary)}
sentence = 'nanjing beijing'
print("输入语句：", sentence)
tokens = sentence.split(' ')
tokenized_string = []
for token in tokens:
    splits = list(token)
    key = token + '_'
    flag = 1
    while flag:
        flag = 0
        split_dict = {}
        for i in range(len(splits) - 1):
            current_group = splits[i] + splits[i+1]
            if current_group not in ordered_vocabulary:
                continue
            if current_group not in split_dict:
                split_dict[current_group] = ordered_vocabulary[current_group]
                flag = 1
        if not flag:
            continue

        group_hist = [(k, v) for k, v in sorted(split_dict.items(),
                      key=lambda x: x[1])]

        merge_key = group_hist[0][0]
        new_splits = []
        i = 0

        while i < len(splits):
            if i+1 >= len(splits):
                new_splits.append(splits[i])
                i += 1
                break
            if splits[i] + splits[i+1] == merge_key:
                new_splits.append(merge_key)
                i += 2
            else:
                new_splits.append(splits[i])
                i += 1
        splits = new_splits
    tokenized_string += splits
print("分词结果：", tokenized_string)

2.2 词规范化

将词或词元变成标准形式的过程：标准化缩写、大小写转换、动词目态转化、繁体转简体。

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# 2.2.1 大小写折叠
sentence = "let's study Hands-On-NLP"
print(sentence.lower())

# 2.2.2 词目还原
lemma_dict = {
    'am': 'be', 'is': 'be', 'are': 'be', 'cats': 'cat', "cats'": 'cat', "cat's": 'cat',
    'dogs': 'dog', "dogs'": 'dog', "dog's": 'dog', 'chasing': 'chase'
}

sentence = "Two dogs are chasing three cats"
words = sentence.split(' ')
print("词目还原前：", words)
lemmatized_words = []
for word in words:
    if word in lemma_dict:
        lemmatized_words.append(lemma_dict[word])
    else:
        lemmatized_words.append(word)
print("词目还原后：", lemmatized_words)


# 另外，也可以利用NLTK自带的词典来进行词目还原
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)

lemmatizer = WordNetLemmatizer()
words = word_tokenize(sentence)
print("词目还原前：", words)
lemmatized_words = []
for word in words:
    lemmatized_words.append(lemmatizer.lemmatize(word, wordnet.VERB))
print("词目还原后：", lemmatized_words)

2.3 分句

先做分词，使用基于正则表达式或者基于机器学习的分词方法将文本分解成词元，然后基于标点符号判断句子分界。

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import regexp_tokenize

pattern = r"\w+(?:[-']\w+)*|\S\w*"
new_pattern = r"(?:\w+\.)+\w+(?:\.)*"
pattern = new_pattern + r"|" + pattern
new_pattern2 = r"\$?\d+(?:\.\d+)?%?"
pattern = new_pattern2 + r"|" + pattern
new_pattern3 = r"\.\.\."
pattern = new_pattern3 + r"|" + pattern

seg_label = [".", "?", '!', '...']
sentence_spliter = set(seg_label)

sentence = "Did you spend $3.4 on arxiv.org for your pre-print?" +\
           " No, it's free! It's..."
tokens = regexp_tokenize(sentence, pattern)
sentences = []
boundary = [0]
for token_id, token in enumerate(tokens):
    if token in sentence_spliter:
        # 如果是句子边界，则把分句结果加入进去
        sentences.append(tokens[boundary[-1]: token_id+1])
        # 将下一句句子起始位置加入boundary
        boundary.append(token_id+1)

if boundary[-1] != len(tokens):
    sentences.append(sentence[boundary[-1]: ])
print("分句结果：")
for seg_sentence in sentences:
    print(seg_sentence)