当前位置：首页 > article >正文

在线销售数据集分析：基于Python的RFM数据分析方法实操训练

article 2025/2/4 19:28:58

一、前言

个人练习，文章用于记录自己的学习练习过程，分享出来和大家一起学习。
数据集：在线销售数据集
分析方法：RFM分析方法

二、过程

1.1 库的导入与一些必要的初始设置

import pandas as pd
import datetime
import matplotlib.pyplot as ax[0]

# 设置中文字体，避免可视化时中文标题出现乱码
ax[0].rcParams['font.sans-serif'] = ['Microsoft YaHei']

数据导入与预览：

data = pd.read_csv(r"C:\Users\31049\Desktop\在线零售数据集\Online_Retail.csv")
data.head(5)

	InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice	CustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2012/1/10 8:26	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	2012/1/10 8:26	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2012/1/10 8:26	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2012/1/10 8:26	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2012/1/10 8:26	3.39	17850.0	United Kingdom

1.2 数据的初步分析

data.shape

(541909, 8)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB

data.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

# 缺失值占比
(data.isnull().sum() / data.shape[0]).round(3)

InvoiceNo      0.000
StockCode      0.000
Description    0.003
Quantity       0.000
InvoiceDate    0.000
UnitPrice      0.000
CustomerID     0.249
Country        0.000
dtype: float64

初步分析结果：

~数据集共541909行和8个属性列。
~属性InvoiceDate是object型，需要修改为datetime型；属性CustomerID是float型，需要修改为str型。
~Description列包含1454个缺失值，CustomerID列包含135080个缺失值，不过缺失值占整个数据的比例都不足0.5%（即较少缺失值）

1.3 数据清洗

# 删除InvoiceNo列中以字母‘c’开头的数据
data = data[data['InvoiceNo'].str.startswith('c') == False]

# 删除重复数据
data.drop_duplicates(inplace=True)

# 删除缺失值
data.dropna(subset=['CustomerID'], inplace=True)
data.shape

(401604, 8)

# 格式转换
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'], format='mixed', dayfirst=False)
data['CustomerID'] = data['CustomerID'].astype(str)

1.4 数据探索性分析(EDA)————RFM分析

# 加入消费金额列
data['Total'] = data['UnitPrice'] * data['Quantity']

# 设定最新的时间作为计算标准
latest = data['InvoiceDate'].max() + datetime.timedelta(days=1)

# 开始计算R、F、M
rfm = data.groupby(['CustomerID']).agg({
    'InvoiceDate': lambda x: (latest-x.max()).days,     # R：最近一次消费距离laetest多少天了
    'InvoiceNo': 'count',     # F: 消费频次
    'Total': 'sum'    # M: 所有消费总金额
})

# 设置分组计算结果的列索引
rfm.rename(columns={'InvoiceDate': 'R','InvoiceNo': 'F','Total': 'M'}, inplace=True)

rfm.head(10)

	R	F	M
CustomerID
12346.0	693	2	0.00
12347.0	153	182	4310.00
12348.0	443	31	1797.24
12349.0	386	73	1757.55
12350.0	3956	17	334.40
12352.0	440	95	1545.41
12353.0	571	4	89.00
12354.0	600	58	1079.40
12355.0	2648	13	459.40
12356.0	390	59	2811.43

# 计算RFM分数，计算前使用rank（）进行排名，避免重复值出现在分箱边界而分箱出错

# R分数: R值比较特殊，R值越小越好（用户质量越高），打分时，R值越小，给分越高，即r_s越大越好
rfm['r_s'] = pd.qcut(rfm['R'].rank(method='first'), q=5, labels=[5, 4, 3, 2, 1]).astype(int)

# F分数：F值越大，给分越高
rfm['f_s'] = pd.qcut(rfm['F'].rank(method='first'), q=5, labels=[1, 2, 3, 4, 5]).astype(int)

# M分数：F值越大，给分越高
rfm['m_s'] = pd.qcut(rfm['M'].rank(method='first'), q=5, labels=[1, 2, 3, 4, 5]).astype(int)

# 基于评分的平均值进行分值评价

def low_high(score, mean):
    if score>mean:
        return '高'
    else:
        return '低'

# 评分高or低
rfm['rs高低'] = rfm['r_s'].apply(lambda x: low_high(x, rfm['r_s'].mean()))
rfm['fs高低'] = rfm['f_s'].apply(lambda x: low_high(x, rfm['f_s'].mean()))
rfm['ms高低'] = rfm['m_s'].apply(lambda x: low_high(x, rfm['m_s'].mean()))

rfm.head(10)

	R	F	M	r_s	f_s	m_s	rs高低	fs高低	ms高低
CustomerID
12346.0	693	2	0.00	1	1	1	低	低	低
12347.0	153	182	4310.00	5	5	5	高	高	高
12348.0	443	31	1797.24	3	3	4	低	低	高
12349.0	386	73	1757.55	4	4	4	高	高	高
12350.0	3956	17	334.40	1	2	2	低	低	低
12352.0	440	95	1545.41	3	4	4	低	高	高
12353.0	571	4	89.00	2	1	1	低	低	低
12354.0	600	58	1079.40	2	3	4	低	低	高
12355.0	2648	13	459.40	1	1	2	低	低	低
12356.0	390	59	2811.43	4	4	5	高	高	高

1.5 用户分类



def classify(r, f, m):
    if r=='高' and f=='高' and m=='高':
        return '重要价值用户'
    elif r=='高' and f=='低' and m=='高':
        return '重要发展用户'
    elif r=='低' and f=='高' and m=='高':
        return '重要保持用户'
    elif r=='低' and f=='低' and m=='高':
        return '重要挽留用户'
    elif r=='高' and f=='高' and m=='低':
        return '一般价值用户'
    elif r=='高' and f=='低' and m=='低':
        return '一般发展用户'
    elif r=='低' and f=='高' and m=='低':
        return '一般保持用户'
    elif r=='低' and f=='低' and m=='低':
        return '一般保留用户'
    
# 分类
rfm['用户分类'] = rfm.apply(lambda x: classify(x['rs高低'], x['fs高低'], x['ms高低']), axis=1)

rfm.head(10)

	R	F	M	r_s	f_s	m_s	rs高低	fs高低	ms高低	用户分类
CustomerID
12346.0	693	2	0.00	1	1	1	低	低	低	一般保留用户
12347.0	153	182	4310.00	5	5	5	高	高	高	重要价值用户
12348.0	443	31	1797.24	3	3	4	低	低	高	重要挽留用户
12349.0	386	73	1757.55	4	4	4	高	高	高	重要价值用户
12350.0	3956	17	334.40	1	2	2	低	低	低	一般保留用户
12352.0	440	95	1545.41	3	4	4	低	高	高	重要保持用户
12353.0	571	4	89.00	2	1	1	低	低	低	一般保留用户
12354.0	600	58	1079.40	2	3	4	低	低	高	重要挽留用户
12355.0	2648	13	459.40	1	1	2	低	低	低	一般保留用户
12356.0	390	59	2811.43	4	4	5	高	高	高	重要价值用户

1.6 可视化

# R的分布图
ax[0].hist(rfm['R'], bins=30, edgecolor='black')
ax[0].xticks(range(0, max(rfm['R'])+1, 500))
ax[0].xlabel('R值')
ax[0].ylabel('频数')
ax[0].title('R的分布图')
ax[0].show()

在这里插入图片描述

# F的分布图
ax[0].hist(rfm['F'], bins=30, edgecolor='black')
ax[0].xlabel('F值')
ax[0].ylabel('频数')
ax[0].title('F的分布图')
ax[0].show()

在这里插入图片描述

# M的分布图
ax[0].hist(rfm['M'], bins=30, edgecolor='black')
ax[0].xlabel('M值')
ax[0].ylabel('频数')
ax[0].title('M的分布图')
ax[0].show()

在这里插入图片描述

# 不同用户数量的可视化

categories = rfm.groupby('用户分类')['用户分类'].count().sort_values(ascending=False)

# 柱状图
fig, ax = plt.subplots(2, 1, figsize=(6, 11))
ax[0].bar(categories.index, categories.values, color='skyblue')
for i in range(len(categories)):
    ax[0].text(categories.index[i], categories.iloc[i], s=str(categories.iloc[i]), ha='center', va='bottom')
ax[0].tick_params(axis='x', rotation=45)
ax[0].set_title('不同种类的用户数')
ax[0].set_ylabel('数量')

# 饼图
ax[1].pie(categories, labels=categories.index, radius=1.1, autopct=lambda pct: f'{int(pct/100*categories.sum())}, {pct: .1f}%')
ax[1].set_title('各种用户占比')

fig.tight_layout()
plt.show()