当前位置: 首页 > article >正文

NLTK工具包

NLTK工具包

NLTK工具包安装

非常实用的文本处理工具,主要用于英文数据,历史悠久~

安装命令:
pip install nltk

在这里插入图片描述

import nltk
# nltk.download()
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('maxent_ne_chunker')
nltk.download('words')
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\26388\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.

True

分词

from nltk.tokenize import word_tokenize
from nltk.text import Text
input_str = "Today's weather is good, very windy and sunny, we have no classes in the afternoon, we have to play basketball tomorrow."
tokens = word_tokenize(input_str)
tokens = [word.lower() for word in tokens]
tokens[:5]
['today', "'s", 'weather', 'is', 'good']

Text对象

# 帮助文档
help(nltk.text)

创建一个Text对象,方便后续操作

t = Text(tokens)
# 统计good单词的个数
t.count('good')
1
# 统计单词位置
t.index('good')
4
# 词频分布排名前8的单词
t.plot(8)


在这里插入图片描述

<AxesSubplot:xlabel='Samples', ylabel='Counts'>

停用词过滤

from nltk.corpus import stopwords
stopwords.readme().replace('\n', '')
'Stopwords CorpusThis corpus contains lists of stop words for several languages.  Theseare high-frequency grammatical words which are usually ignored in textretrieval applications.They were obtained from:http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/The stop words for the Romanian language were obtained from:http://arlc.ro/resources/The English list has been augmentedhttps://github.com/nltk/nltk_data/issues/22The German list has been correctedhttps://github.com/nltk/nltk_data/pull/49A Kazakh list has been addedhttps://github.com/nltk/nltk_data/pull/52A Nepali list has been addedhttps://github.com/nltk/nltk_data/pull/83An Azerbaijani list has been addedhttps://github.com/nltk/nltk_data/pull/100A Greek list has been addedhttps://github.com/nltk/nltk_data/pull/103An Indonesian list has been addedhttps://github.com/nltk/nltk_data/pull/112'
# 查看一下nltk支持的语言
stopwords.fileids()
['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']
# 查看停用词表
stopwords.raw('english').replace('\n', ' ')
"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't "
test_words = [word.lower() for word in tokens]
test_words_set = set(test_words)
# 查看给定文本的停用词有哪些
test_words_set.intersection(set(stopwords.words('english')))
{'and', 'have', 'in', 'is', 'no', 'the', 'to', 'very', 'we'}

过滤掉停用词

filter = [w for w in test_words_set if(w not in stopwords.words('english'))]
filter
['afternoon',
 "'s",
 'classes',
 'basketball',
 'today',
 ',',
 'play',
 'weather',
 'windy',
 'sunny',
 'tomorrow',
 '.',
 'good']

词性标注

from nltk import pos_tag
tags = pos_tag(tokens)
tags
[('today', 'NN'),
 ("'s", 'POS'),
 ('weather', 'NN'),
 ('is', 'VBZ'),
 ('good', 'JJ'),
 (',', ','),
 ('very', 'RB'),
 ('windy', 'JJ'),
 ('and', 'CC'),
 ('sunny', 'JJ'),
 (',', ','),
 ('we', 'PRP'),
 ('have', 'VBP'),
 ('no', 'DT'),
 ('classes', 'NNS'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('afternoon', 'NN'),
 (',', ','),
 ('we', 'PRP'),
 ('have', 'VBP'),
 ('to', 'TO'),
 ('play', 'VB'),
 ('basketball', 'NN'),
 ('tomorrow', 'NN'),
 ('.', '.')]

命名实体识别

from nltk import ne_chunk
sentence = "Edison went to Tsinghua University today."
print(ne_chunk(pos_tag(word_tokenize(sentence))))
(S
  (PERSON Edison/NNP)
  went/VBD
  to/TO
  (ORGANIZATION Tsinghua/NNP University/NNP)
  today/NN
  ./.)

数据清洗实例

import re
from nltk.corpus import stopwords
# 输入数据
s = ' RT @Amila #Test\nDurant\'s newly listed Co &amp;Mary\'s unlisted  Group to supply tech for nlTK. \nh $TSLA $AAPL https:// t.co/x34afsfQsh'

# 指定停用词
cache_english_stopwords = stopwords.words('english')

def text_clean(text):
    print('原始数据:', text, '\n')
    
    # 去掉HTML标签
    text_no_special_entities = re.sub(r'\&\w*;|#\w*|@\w*', '', text)
    print('去掉特殊标签后:', text_no_special_entities, '\n')
    
    # 去掉一些价值符号
    text_no_tickers = re.sub(r'\$\w*', '', text_no_special_entities)
    print('去掉价值符号后:', text_no_tickers, '\n')
    
    # 去掉超链接
    text_no_hyperlinks = re.sub(r'https?:\/\/.*\/\w*', '', text_no_tickers)
    print('去掉超链接后:', text_no_hyperlinks, '\n')
    
    # 去掉一些专门的名词缩写,简单来说就是字母比较少的词
    text_no_small_words = re.sub(r'\b\w{1,2}\b', '', text_no_hyperlinks)
    print('去掉专门名词缩写后:', text_no_small_words, '\n')
    
    # 去掉多余的空格
    text_no_whitespace = re.sub(r'\s\s+', ' ', text_no_small_words)
    print('去掉空格后:', text_no_whitespace, '\n')
    
    # 分词
    tokens = word_tokenize(text_no_whitespace)
    print('分词结果:', tokens, '\n')
    
    # 去停用词
    list_no_stopwords = [i for i in tokens if i not in cache_english_stopwords]
    print('去停用词后结果:', list_no_stopwords, '\n')
    # 过滤后结果
    text_filtered = ' '.join(list_no_stopwords)
    print('过滤后:', text_filtered)
    
text_clean(s)
原始数据:  RT @Amila #Test
Durant's newly listed Co &amp;Mary's unlisted  Group to supply tech for nlTK. 
h $TSLA $AAPL https:// t.co/x34afsfQsh 

去掉特殊标签后:  RT  
Durant's newly listed Co Mary's unlisted  Group to supply tech for nlTK. 
h $TSLA $AAPL https:// t.co/x34afsfQsh 

去掉价值符号后:  RT  
Durant's newly listed Co Mary's unlisted  Group to supply tech for nlTK. 
h   https:// t.co/x34afsfQsh 

去掉超链接后:  RT  
Durant's newly listed Co Mary's unlisted  Group to supply tech for nlTK. 
h    

去掉专门名词缩写后:    
Durant' newly listed  Mary' unlisted  Group  supply tech for nlTK. 

去掉空格后:  Durant' newly listed Mary' unlisted Group supply tech for nlTK.  

分词结果: ['Durant', "'", 'newly', 'listed', 'Mary', "'", 'unlisted', 'Group', 'supply', 'tech', 'for', 'nlTK', '.'] 

去停用词后结果: ['Durant', "'", 'newly', 'listed', 'Mary', "'", 'unlisted', 'Group', 'supply', 'tech', 'nlTK', '.'] 

过滤后: Durant ' newly listed Mary ' unlisted Group supply tech nlTK .


http://www.kler.cn/a/418457.html

相关文章:

  • 龙蜥 Linux 安装 JDK
  • k8s--pod创建、销毁流程
  • React 自定义钩子:useOnlineStatus
  • 力扣刷题TOP101:8.BM10 两个链表的第一个公共结点
  • 【情感分析】数据集合集!
  • 【赵渝强老师】PostgreSQL中的模式
  • ubuntu20.04下cuDNN的安装与检测
  • 【docker】容器卷综合讲解,以及go实现的企业案例
  • Javascript 图片懒加载
  • AcWing 1245. 特别数的和
  • 两道数据结构编程题
  • 聊一聊汽车网络安全
  • 腾讯微众银行前端面试题及参考答案
  • 侯捷STL标准库和泛型编程
  • 使用Gradle编译前端的项目
  • 【大数据学习 | Spark】Spark on hive与 hive on Spark的区别
  • buuctf-[SUCTF 2019]EasySQL 1解题记录
  • C#tabcontrol如何指定某个tabItem为默认页
  • 量化交易系统开发-实时行情自动化交易-8.4.MT4/MT5平台
  • 触觉智能亮相OpenHarmony人才生态大会2024
  • k8s--pod创建、销毁流程
  • 【学术投稿】Imagen:重塑图像生成领域的革命性突破
  • 反向传播、梯度下降与学习率:深度学习中的优化艺术
  • kafka消息在client是怎么写入的
  • 探索未来:深入人工智能学习框架的奥秘与实践
  • 设计有一个 “分布式软总线“ 系统,跨平台