27、深度学习-自学之路-NLP自然语言处理-做一个简单的项目识别一组电影评论,来判断电影评论是积极的,还是消极的。
一、如果我们要做这个项目,第一步我们要做的就是需要有对应的训练数据集。
这里提供两个数据集,一个是原始评论数据集《reviews.txt》,以及对应的评论是消极还是积极的数据集《labels.txt》,下面的程序就是找到这两个数据集,并把对应的数据集的内容分别赋值给reviews和labels
def pretty_print_review_and_label(i): print(labels[i] + "\t:\t" + reviews[i][:80] + "...") g = open('reviews.txt','r') # What we know! reviews = list(map(lambda x:x[:-1],g.readlines())) g.close() print(reviews[0]) g = open('labels.txt','r') # What we WANT to know! labels = list(map(lambda x:x[:-1].upper(),g.readlines())) g.close() print(labels[0])
运行结果为:
这个是评论:
bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life such as teachers . my years in the teaching profession lead me to believe that bromwell high s satire is much closer to reality than is teachers . the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn t
这个是消极或者积极的评判:
POSITIVE
二、这一步我们要做连个事情,
第1个是把评论的所有的单词进行链表处理。
比如说:reviews的文件有25000条评论数据,那么着25000条评论数据对应会有很多个单词,我们要把这些单词做成一个词典一样的链表。比如:1对应的是 is 2对应的是it 类似于这种。
第2个是把做好这个词典以后,我们把每天评论语中的单词不重复的形式去词典里面查找对应的数字号码。并把新的数字号码作为神经元的输入。给训练使用。
第3个是把所有的评论评判的结果用 1 和 0 来表示。积极的用 1 消极的用 0
第4把整个整理好的输入和真实数据集放入神经网络中进行训练,然后再用训练好的权重对前1000条评论进行回测。
程序如下:
import sys f = open('reviews.txt') raw_reviews = f.readlines() f.close() f = open('labels.txt') raw_labels = f.readlines() f.close() tokens = list(map(lambda x:set(x.split(" ")),raw_reviews)) vocab = set() for sent in tokens: for word in sent: if(len(word)>0): vocab.add(word) vocab = list(vocab) word2index = {} for i,word in enumerate(vocab): word2index[word]=i input_dataset = list() for sent in tokens: sent_indices = list() for word in sent: try: sent_indices.append(word2index[word]) except: "" input_dataset.append(list(set(sent_indices))) target_dataset = list() for label in raw_labels: if label == 'positive\n': target_dataset.append(1) else: target_dataset.append(0) import numpy as np np.random.seed(1) def sigmoid(x): return 1/(1 + np.exp(-x)) alpha, iterations = (0.01, 2) hidden_size = 100 weights_0_1 = 0.2 * np.random.random((len(vocab),hidden_size)) - 0.1 #weights_0_1 = 2 * np.random.random((3,hidden_size)) - 1 weights_1_2 = 0.2 * np.random.random((hidden_size,1)) - 0.1 correct,total = (0,0) for iter in range(iterations): for i in range(len(input_dataset)-1000): x,y = (input_dataset[i],target_dataset[i]) #这行代码的主要目的是计算神经网络中某一层(这里是 layer_1)的输出。具体来说, # 它先从权重矩阵 weights_0_1 中选取特定的行(由索引 x 确定),然后对这些行沿着 axis=0 方向求和, # 最后将求和结果通过 sigmoid 激活函数进行处理,得到 layer_1 的输出值。 layer_1 = sigmoid(np.sum(weights_0_1[x],axis=0)) layer_2 = sigmoid(np.dot(layer_1,weights_1_2)) layer_2_delta = layer_2 - y layer_1_delta = layer_2_delta.dot(weights_1_2.T) weights_0_1[x] -= layer_1_delta * alpha weights_1_2 -= np.outer(layer_1,layer_2_delta) * alpha #如果得到的误差值绝对值小于0.5,那么说明layer_2和真实结果值最接近。 if(np.abs(layer_2_delta)<0.5): correct += 1 total += 1 if(i % 10 == 9): progress = str(i/float(len(input_dataset))) sys.stdout.write('\rIter:'+str(iter)\ + 'Progress:'+progress[2:4]\ + '.' + progress[4:6]\ + '% Training Accureacy:'\ + str(correct/float(total)) + '%') print() correct, total = (0, 0) for i in range(len(input_dataset)-1000,len(input_dataset)): x = input_dataset[i] y = target_dataset[i] layer_1 = sigmoid(np.sum(weights_0_1[x], axis=0)) layer_2 = sigmoid(np.dot(layer_1, weights_1_2)) if (np.abs(layer_2 - y) < 0.5): correct += 1 total += 1 print("Test Accuracy:"+ str(correct/float(total))) #运行结果: ''' Iter:0Progress:95.99% Training Accureacy:0.83275% Iter:1Progress:95.99% Training Accureacy:0.8662291666666667% Test Accuracy:0.846 '''
从运行结果上看,以及识别率已经达到了0.846,属于NLP的基本的的识别成功率范围。
如果大家想下载数据集的话,希望大家花点币,因为最近在学习,我自己也需要下载很多其他学习资料在CSDN上,所有有点币下东西比较方便。