kaggle学习 eloData项目(2)-数据清洗
文章目录
- kaggle学习 eloData项目(2)-数据清洗
- 4.1 读取数据
- 4.2 处理测试集
- 4.3 处理交易表
- 4.4 处理商户表
- 4.5 保存数据
- 小结
kaggle学习 eloData项目(2)-数据清洗
写完这篇,eloData项目的内容就不更新了,我想不明白,卖课就卖课,为啥非要搞什么关注啊!微信啊!什么的,怎么那么费劲,我以后去github里面找个课程学习了,看文字比听视频更有效率,继续加油。文章参考:kaggle比赛案例:Elo Merchant Category Recommendation(2)
- 我的文件夹情况
4.1 读取数据
start = time.time()
PATH = r'E:\kaggle_study\eloData'
filename = os.listdir(PATH) #读取文件夹内所有文件
print(filename)
pd.set_option("display.max_colwidth",1000) #表格完全显示
# 说明文档
explain_doc = pd.read_excel(os.path.join(PATH,filename[0]),header=2,sheet_name='merchant').head(5)
# 提交文档
summit_doc = pd.read_csv(os.path.join(PATH, filename[5]), header=0).head(5)
# 训练集
train = pd.read_csv(os.path.join(PATH, filename[7]), header=0)
# 测试集
test = pd.read_csv(os.path.join(PATH, filename[6]), header=0)
# 历史交易表
historical_transactions = pd.read_csv(os.path.join(PATH,filename[1]),header=0)
# 新交易表
new_transactions = pd.read_csv(os.path.join(PATH,filename[3]),header=0)
# merchants商户表
merchant = pd.read_csv(os.path.join(PATH,filename[2]),header=0)
exe_time = time.time()-start
print("读取数据用时:",exe_time)
4.2 处理测试集
# 处理缺失值
test = test.fillna('2017-03')
4.3 处理交易表
start = time.time()
# 4.3.1 纵向连接两张交易表 信用卡历史交易表和新交易表
transactions = pd.concat([historical_transactions,new_transactions],axis=0)
transactions = transactions.reset_index(drop=True) # 重置索引
# 4.3.2 依次填充特征的缺失值
transactions['category_2'] = transactions['category_2'].fillna(-1);
transactions['category_3'] = transactions['category_3'].fillna(-1);
transactions['merchant_id'] = transactions['merchant_id'].fillna('no_merchant_id');
transactions = transactions.reset_index(drop=True)
del historical_transactions
del new_transactions
gc.collect()
# 4.3.3 处理交易表中的日期特征
# 定义一个函数
def date_to_ymd(series):
year = []
month = []
day = []
for index,val in enumerate(series):
l = series[index].split('-')
year.append(l[0])
month.append(l[1])
day.append(l[2].split(' ')[0])
return year,month,day
year, month, day = date_to_ymd(series=transactions['purchase_date'])
transactions['purchase_date_year'] = year
transactions['purchase_date_month'] = month
transactions['purchase_date_day'] = day
transactions = transactions.drop('purchase_date',axis=1)
print("交易表处理时间:",time.time()-start)
4.4 处理商户表
start = time.time()
# 4.4.1 删除商户表和交易表中的重复特征,去掉merchant_id重复的63条样本
merchant = merchant.drop(['merchant_category_id', 'subsector_id', 'category_1',
'category_2', 'city_id', 'state_id'],axis=1)
merchant = merchant.iloc[merchant['merchant_id'].drop_duplicates()
.index, :].reset_index(drop=True)
# 4.4.2 补全缺失值为每列平均值
for col in ['avg_sales_lag3','avg_sales_lag6','avg_sales_lag12']:
merchant[col] = merchant[col].fillna(merchant[col].mean())
# 4.4.3 替换inf溢出值为每列最大值
for col in ['avg_purchases_lag3','avg_purchases_lag6','avg_purchases_lag12']:
max_value = merchant[merchant[col] != np.inf][col].max()
merchant[col] = merchant[col].replace(np.inf,max_value)
print("商户表处理时间:",time.time()-start)
4.5 保存数据
start = time.time()
train.to_csv(r'E:\kaggle_study\eloData\result_pre\train_pre.csv',index=False)
test.to_csv(r'E:\kaggle_study\eloData\result_pre\test_pre.csv',index=False)
transactions.to_csv(r'E:\kaggle_study\eloData\result_pre\transactions_pre.csv',index=False)
merchant.to_csv(r'E:\kaggle_study\eloData\result_pre\merchant_pre.csv',index=False)
print("保存与处理数据用时:",time.time()-start)
小结
记得及时释放内存,这里没写实际用的时候注意些。
总之,加油,共勉吧!