当前位置：首页 > article >正文

Python词频统计(数据整理)

article 2025/2/21 3:07:32

请编写程序，对一段英文文本，统计其中所有不同单词的个数，以及词频最大的前10%的单词。

输入格式:

输入给出一段非空文本，最后以符号#结尾。输入保证存在至少10个不同的单词。

输出格式:

在第一行中输出文本中所有不同单词的个数。注意“单词”不区分英文大小写。随后按照词频递减的顺序，按照 <词频:单词> 的格式输出词频最大的前10%的单词。若有并列，则按递增字典序输出。

输入样例：

This is a test.

The word "this" is the word with the highest frequency.

Longlonglonglongword should be cut off, so is considered as the same as longlonglonglonee.  But this_8 is different than this, and this, and this...#
this line should be ignored.

输出样例：

23
5:this
4:is

（注意：虽然单词the也出现了4次，但因为我们只要输出前10%（即23个单词中的前2个）单词，而按照字母序，the排第3位，所以不输出。）

代码示例 :

#定义临时文本段落
text = ""
#定义文本统计字典
text_dic = {}
#多行输入
while True:
    #定义临时缓存并整理
    buffer = input().replace('.','').replace('"','').replace(',','').lower()
    if buffer:
        text += buffer + ' '
        if buffer[-1] == '#':
            text = text.replace('#', '')
            print('text',text)
            break
    else:
        continue
#剪切临时文本存入字典
for x in text.split(' '):
    if x == '':
        continue
    if x not in text_dic:
        text_dic[x] = 1
    else:
        text_dic[x] += 1
#从大到小值排序
sorted_dict = {k: v for k, v in sorted(text_dic.items(), key=lambda item: item[1],reverse = True)}
#文本前百分之十
top_10_percent = sum(sorted_dict.values()) * 0.1
top_10_percent_count = 0
#定义词的种数
words = 0
for x in sorted_dict:
    words += 1
    if top_10_percent_count < top_10_percent:
        top_10_percent_count += sorted_dict[x]
        print(f'{sorted_dict[x]}:{x}')
print(words)

以上代码全为本人亲自手敲，可能有一些错误和不足之处，如有更好的方法和建议，欢迎您在评论区友善讨论。

查看全文

http://www.kler.cn/a/156416.html