当前位置: 首页 > article >正文

pandas教程:MovieLens 1M Dataset MovieLens 1M数据集

文章目录

  • 14.2 MovieLens 1M Dataset(MovieLens 1M数据集)
  • 1 Measuring Rating Disagreement(计算评分分歧)

14.2 MovieLens 1M Dataset(MovieLens 1M数据集)

这个数据集是电影评分数据:包括电影评分,电影元数据(风格类型,年代)以及关于用户的人口统计学数据(年龄,邮编,性别,职业等)。

MovieLens 1M数据集含有来自6000名用户对4000部电影的100万条评分数据。分为三个表:评分,用户信息,电影信息。这些数据都是dat文件格式,可以通过pandas.read_table将各个表分别读到一个pandas DataFrame对象中:

import pandas as pd
# Make display smaller
pd.options.display.max_rows = 10
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('../datasets/movielens/users.dat', sep='::', 
                      header=None, names=unames)
/Users/xu/anaconda/envs/py35/lib/python3.5/site-packages/ipykernel/__main__.py:3: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  app.launch_new_instance()

因为sep='::'有点像是正则表达式,于是有了上面的错误。在这个帖子找到了解决方法,设置enginepython即可。

Looks like on Python 2.7 Pandas just doesn’t handle separators that
look regexish. The initial “error” can be worked around by adding
engine=‘python’ as a named parameter in the call, as suggested in the
warning.

users = pd.read_table('../datasets/movielens/users.dat', sep='::', 
                      header=None, names=unames, engine='python')
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('../datasets/movielens/ratings.dat', sep='::', header=None, names=rnames, engine='python')
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('../datasets/movielens/movies.dat', sep='::', header=None, names=mnames, engine='python')

加载前几行验证一下数据加载工作是否顺利

users[:5]
user_idgenderageoccupationzip
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455
ratings[:5]
user_idmovie_idratingtimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
movies[:5]
movie_idtitlegenres
01Toy Story (1995)Animation|Children's|Comedy
12Jumanji (1995)Adventure|Children's|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy

注意,年龄和职业是以编码形式给出的,它们的具体含义请参考改数据集的README文件。分析散布在三个表中的数据不是一件轻松的事情。假设我们想要根据性别和年龄来计算某部电影的平均得分,如果将所有的数据都合并到一个表中的话,问题就简单多了。我们先用pandasmerge函数将ratingsusers合并到一起,然后再将movies也合并进去。pandas会根据列名的重叠情况推断出哪些列是合并(或连接)键:

data = pd.merge(pd.merge(ratings, users), movies)
data.head()
user_idmovie_idratingtimestampgenderageoccupationziptitlegenres
0111935978300760F11048067One Flew Over the Cuckoo's Nest (1975)Drama
1211935978298413M561670072One Flew Over the Cuckoo's Nest (1975)Drama
21211934978220179M251232793One Flew Over the Cuckoo's Nest (1975)Drama
31511934978199279M25722903One Flew Over the Cuckoo's Nest (1975)Drama
41711935978158471M50195350One Flew Over the Cuckoo's Nest (1975)Drama
data.iloc[0]
user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object

现在,只要稍微熟悉一下pandas,就能轻松地根据任意个用户或电影属性对评分数据进行聚合操作了。为了按性别计算每部电影的平均得分,我们可以使用pivot_table方法:

mean_ratings = data.pivot_table('rating', index='title',
                                columns='gender', aggfunc='mean')
mean_ratings[:5]
genderFM
title
$1,000,000 Duck (1971)3.3750002.761905
'Night Mother (1986)3.3888893.352941
'Til There Was You (1997)2.6756762.733333
'burbs, The (1989)2.7934782.962085
...And Justice for All (1979)3.8285713.689024

该操作产生了另一个DataFrame,其内容为电影平均得分,行标为电影名称,列表为性别。现在,我们打算过滤掉评分数据不够250条的电影(这个数字可以自己设定)。为了达到这个目的,我们先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象:

ratings_by_title = data.groupby('title').size()
ratings_by_title[:10]
title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64
active_titles = ratings_by_title.index[ratings_by_title >= 250]
print(active_titles)
Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)

上面的active_titles中的电影,都是评论是大于250条以上的。我们可以用这些标题作为索引,从mean_ratings中选出这些评论大于250条的电影:

mean_ratings = mean_ratings.loc[active_titles]
mean_ratings
genderFM
title
'burbs, The (1989)2.7934782.962085
10 Things I Hate About You (1999)3.6465523.311966
101 Dalmatians (1961)3.7914443.500000
101 Dalmatians (1996)3.2400002.911215
12 Angry Men (1957)4.1843974.328421
.........
Young Guns (1988)3.3717953.425620
Young Guns II (1990)2.9347832.904025
Young Sherlock Holmes (1985)3.5147063.363344
Zero Effect (1998)3.8644073.723140
eXistenZ (1999)3.0985923.289086

1216 rows × 2 columns

想要查看女性观众喜欢的电影,可以按F列进行降序操作:

top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
top_female_ratings[:10]
genderFM
title
Close Shave, A (1995)4.6444444.473795
Wrong Trousers, The (1993)4.5882354.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)4.5726504.464589
Wallace & Gromit: The Best of Aardman Animation (1996)4.5631074.385075
Schindler's List (1993)4.5626024.491415
Shawshank Redemption, The (1994)4.5390754.560625
Grand Day Out, A (1992)4.5378794.293255
To Kill a Mockingbird (1962)4.5366674.372611
Creature Comforts (1990)4.5138894.272277
Usual Suspects, The (1995)4.5133174.518248

1 Measuring Rating Disagreement(计算评分分歧)

假设我们想要找出男性和女性观众分歧最大的电影。一个办法是给mean_ratings加上一个用于存放平均得分之差的列,并对其进行排序:

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

按‘diff’排序即可得到分歧最大且女性观众更喜欢的电影:

sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:15]
genderFMdiff
title
Dirty Dancing (1987)3.7903782.959596-0.830782
Jumpin' Jack Flash (1986)3.2547172.578358-0.676359
Grease (1978)3.9752653.367041-0.608224
Little Women (1994)3.8705883.321739-0.548849
Steel Magnolias (1989)3.9017343.365957-0.535777
............
French Kiss (1995)3.5357143.056962-0.478752
Little Shop of Horrors, The (1960)3.6500003.179688-0.470312
Guys and Dolls (1955)4.0517243.583333-0.468391
Mary Poppins (1964)4.1977403.730594-0.467147
Patch Adams (1998)3.4732823.008746-0.464536

15 rows × 3 columns

对行进行反序操作,并取出前15行,得到的则是男性更喜欢,而女性评价较低的电影:

# Reverse order of rows, take first 10 rows
sorted_by_diff[::-1][:10]
genderFMdiff
title
Good, The Bad and The Ugly, The (1966)3.4949494.2213000.726351
Kentucky Fried Movie, The (1977)2.8787883.5551470.676359
Dumb & Dumber (1994)2.6979873.3365950.638608
Longest Day, The (1962)3.4117654.0314470.619682
Cable Guy, The (1996)2.2500002.8637870.613787
Evil Dead II (Dead By Dawn) (1987)3.2972973.9092830.611985
Hidden, The (1987)3.1379313.7450980.607167
Rocky III (1982)2.3617022.9435030.581801
Caddyshack (1980)3.3961353.9697370.573602
For a Few Dollars More (1965)3.4090913.9537950.544704

如果只是想要找出分歧最大的电影(不考虑性别因素),则可以计算得分数据的方差或标准差:

# 根据电影名称分组的得分数据的标准差
rating_std_by_title = data.groupby('title')['rating'].std()
# 根据active_titles进行过滤
rating_std_by_title = rating_std_by_title.loc[active_titles]
# Order Series by value in descending order
rating_std_by_title.sort_values(ascending=False)[:10]
title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64

这里我们注意到,电影分类是以竖线|分割的字符串形式给出的。如果想对不同的电影分类进行分析的话,就需要先将其转换成更有用的形式才行。


http://www.kler.cn/a/148621.html

相关文章:

  • Mysql数据库里的SSH连接
  • python 同时控制多部手机
  • 学法减分交管12123模拟练习小程序源码前端和后端和搭建教程
  • ❤React-React 组件通讯
  • jwt用户登录,网关给微服务传递用户信息,以及微服务间feign调用传递用户信息
  • 探索 JNI - Rust 与 Java 互调实战
  • HbuilderX 项目打包文件过大问题优化
  • Postgresql数据库运维统计信息
  • 西南科技大学电路分析基础实验A1(一阶电路的设计)
  • 【Go语言从入门到实战】反射编程、Unsafe篇
  • unity3d NPC寻路时相互挤压、导致离目标越来越远
  • mysql数据库基础知识,Mysql的索引和主键区别,数据库的事务的基本特性
  • redis key
  • Element-UI Upload 手动上传文件的实现与优化
  • 爬楼梯(力扣LeetCode)动态规划
  • Win7 SP1 x64 Google Chrome 字体模糊
  • android系统新特性——用户界面以及系统界面改进
  • 记录一次因内存不足而导致hiveserver2和namenode进程宕机的排查
  • Vue项目实战之一----实现分类弹框效果
  • 【华为OD题库-037】跳房子2-java
  • Vue组件实战:列表组件开发
  • AIGC系列之:CLIP和OpenCLIP
  • Kubernetes异常排查方式
  • 【Linux】coredump 文件的例子分析
  • 4:kotlin 方法(Functions)
  • 看懂YOLOv7混淆矩阵的含义,正确计算召回率、精确率、误检率、漏检率