建模中的特征衍生技巧总结(含各类常用衍生函数)
本文总结了有哪些特征衍生方法,函数是什么,用在什么场景,具体步骤如下:
数据集探索:
1.ID有无重复:tcc['customerID'].nunique() == tcc.shape[0]
2.有无缺失值:tcc.isnull().sum()
另外需注意空格的情况,离散型变量查看函数为:
for feature in tcc[category_cols]:
print(f'{feature}: {tcc[feature].unique()}')
连续型变量查看是否能将所有的列都转化为数值型字段
tcc[numeric_cols].astype(float)
将空格替换为np.nan:
tcc['TotalCharges']= tcc['TotalCharges'].apply(lambda x: x if x!= ' ' else np.nan).astype(float)
然后做填补:
tcc['TotalCharges'].fillna(tcc['TotalCharges'].mean())
也可以思考缺失值的业务含义作填补
3.计算缺失值占比
def missing (df):
"""
计算每一列的缺失值及占比
"""
missing_number = df.isnull().sum().sort_values(ascending=False) # 每一列的缺失值求和后降序排序
missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False) # 每一列缺失值占比
missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=['Missing_Number', 'Missing_Percent']) # 合并为一个DataFrame
return missing_values
4.字段类型探索:
分为连续、离散(object类型要更改,提取方法:tcc.select_dtypes('object').columns)、时序三种类型
5.连续变量异常值探索:
三倍方差法:tcc['TotalCharges'].mean() + 3 * tcc['TotalCharges'].std() 去和tcc[numeric_cols].describe() 比较
6.变量相关性分析
先要将非数值类型对象转化为哑变量(具体离散变量转化见下面),而数值型变量不动,看相关性
df_dummies = pd.get_dummies(df3)
df_dummies.corr()['Churn'].sort_values(ascending = False)
当然可以用热力图、柱状图等看分布情况
7、离散变量编码
a.自然数编码(preprocessing.OrdinalEncoder())
b.OneHotEncoder独热编码(preprocessing.OneHotEncoder())(注意二分类用这个是没意义的)
另外独热编码无法自动加列名,函数如下:
def cate_colName(Transformer, category_cols, drop='if_binary'):
"""
离散字段独热编码后字段名创建函数
:param Transformer: 独热编码转化器
:param category_cols: 输入转化器的离散变量
:param drop: 独热编码转化器的drop参数
"""
cate_cols_new = []
col_value = Transformer.categories_
for i, j in enumerate(category_cols):
if (drop == 'if_binary') & (len(col_value[i]) == 2):
cate_cols_new.append(j)
else:
for f in col_value[i]:
feature_name = j + '_' + f
cate_cols_new.append(feature_name)
return(cate_cols_new)
8.column transformer转化流水线(方便对所有列做转化)
参数:(评估器名称(自定义), 转化器, 数据集字段(转化器作用的字段))
转化器可以写:'passthrough'字符串表示直接让连续变量通过,或者preprocessing.OneHotEncoder(drop='if_binary')
preprocess_col = ColumnTransformer([
('cat', preprocessing.OneHotEncoder(drop='if_binary'), category_cols),
('num', 'passthrough', numeric_cols)
])
这样就能直接分别对两种列进行处理了。
通过fit 、transform进行转化
通过preprocess_col.named_transformers_查看具体的训练好的每个转化器(若未训练,则调用该属性会报错)
9.连续变量处理:标准化(均值方差标准化)
scaler.fit_transform(X_train)
scaler.transform(X_test)
10.归一化:
# 1-范数单位化过程
preprocessing.normalize(X, norm='l1') 让每行的和为1
或者用sklearn
from sklearn.preprocessing import Normalizer
normlize = Normalizer()
normlize.fit_transform(X)
默认用2范数,若要用1范数则
normlize = Normalizer(norm='l1')
11.分箱
作用:减少异常值的影响/消除特征量纲影响/引入了非线性的因素,从而提升模型表现/缺点:让连续变量损失一些信息
a.等宽分箱:dis = preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
最后一个表示等宽,第二个表示是否需要独热,ordinal是不需要
dis.bin_edges_ 查看分箱依据
b.等频分箱:strategy='quantile'
c.聚类分箱:等宽分箱会一定程度受到异常值的影响,而等频分箱又容易完全忽略异常值信息,从而一定程度上导致特征信息损失,而若要更好的兼顾变量原始数值分布,则可以考虑使用聚类分箱。
from sklearn import cluster
kmeans = cluster.KMeans(n_clusters=3)
kmeans.labels_ 看类别
d.有监督分箱,比如用树模型训练并观察树结果
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier().fit(income, y)
plt.figure(figsize=(6, 2), dpi=150)
tree.plot_tree(clf)
同样,这些可以集成到
ColumnTransformer([
('cat', preprocessing.OneHotEncoder(drop='if_binary'), category_cols),
('num', preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans'), numeric_cols)
])
12.标签偏态处理:如果若0:1小于3:1,则标签偏态基本可以忽略不计;如果0:1大于3:1,则认为标签取值分布存在偏态,需要对其进行处理,如过采样、欠采样、或者模型组合训练、或者样本聚类等,并且如果此时需要重点衡量模型对1类识别能力的话,则更加推荐选择f1-Score。
13.同时实现多个指标运算的函数:
def result_df(model, X_train, y_train, X_test, y_test, metrics=
[accuracy_score, recall_score, precision_score, f1_score, roc_auc_score]):
res_train = []
res_test = []
col_name = []
for fun in metrics:
res_train.append(fun(model.predict(X_train), y_train))
res_test.append(fun(model.predict(X_test), y_test))
col_name.append(fun.__name__)
idx_name = ['train_eval', 'test_eval']
res = pd.DataFrame([res_train, res_test], columns=col_name, index=idx_name)
return res
开始建模的一些技巧
首先使用逻辑回归
14.模型优化:
对模型结果影响较大的参数主要有两类:
其一是正则化项的选择,同时也包括经验风险项的系数与损失求解方法选择
第二类则是迭代限制条件,主要是max_iter和tol两个参数
15.在完成网格搜索调优后,进阶调优策略是调整逻辑回归的判别阈值
在sklearn中逻辑回归的判别阈值并不是超参数,因此如果需要对此进行搜索调优的话,我们可以考虑手动编写一个包装在逻辑回归评估器外层的评估器,并加入阈值这一超参数,然后再带入网格搜索流程;
而clas_weight的搜索调优就相对简单,该参数作为逻辑回归评估器的原生参数,只需要合理设置参数空间对其搜索即可。
16.结果中的一些启示:连续变量离散化的目的是什么:
逻辑回归方程系数实际上是在衡量自变量每增加1、因变量的对数几率变化情况,因此对于取值较大的连续变量来说,最终的系数结果较小。若想更加准确的和离散变量作比较,此处可以考虑将连续变量离散化,然后再计算离散化后的特征系数,并使用该系数和原离散变量系数进行比较,二者会有更好的可比性。
17.逻辑回归调参技巧总结:
模型性能评估:通过上述尝试,我们基本能判断逻辑回归模型在当前数据集的性能,准确率约在80%左右,准确率没有太大的超参数调优搜索空间,而f1-Score则在我们额外设置的超参数——阈值上能够有更好的搜索结果。
超参数搜索策略总结:sklearn中的逻辑回归超参数众多,在算力允许的情况下,建议尽量设置更多的迭代次数(max_iter)和更小的收敛条件(tol),基本的搜索参数为正则化项(penalty)+经验风险系数(C)+求解器(solver),如果算力允许,可以纳入弹性网正则化项进行搜索,并搜索l1正则化项权重系数(l1_ratio)。若样本存在样本不均衡,可带入class_weight进行搜索,若搜索目标是提升f1-Score或ROC-AUC,则可通过自定义评估器进行阈值移动,若希望进行更加精确的搜索,可以纳入连续变量的编码方式进行搜索。
阈值移动与样本权重调优总结:根据上面的实验结果,对于阈值调优和样本权重调优可以进行如下总结:
(1)阈值移动往往出现在f1-Score调优或ROC-AUC调优的场景中,由于阈值移动对召回率、精确度等指标调整效果显著,因此该参数的搜索往往效果要好于逻辑回归其他默认参数,类似的情况也出现在其他能够输出概率结果的模型中(如决策树、随机森林等);
(2)样本权重调节往往出现在非平衡类数据集的建模场景中,通过该参数的设置,能够让模型在训练过程中更加关注少数类样本,从而一定程度起到平衡数据集不同类别样本的目的,并且相比于其他平衡样本方法(例如过采样、欠采样、SMOTEENN等),该方法能够更好的避免过拟合,并且该参数同样也是一个通用参数,出现在sklearn集成的诸多模型中。建议如果算力允许,可以在任何指标调整过程中对该参数进行搜索;
(3)不过如果是围绕f1-Score或ROC-AUC进行调优,阈值移动和样本权重调节会有功能上的重复,此时建议优先选用阈值进行搜索。
其次使用决策树:
18.决策树由于并没有类似线性方程的数值解释,因此无需对分类变量进行独热编码转化,只需进行自然数转化即可
参数:此处我们选取max_depth、min_samples_split、min_samples_leaf、max_leaf_nodes和ccp_alpha进行搜索
-------------以上为基础处理,接下来开始特征衍生与筛选
19.特征衍生1:人工字段合成
测试新特征是否能提升模型效果的函数:
def features_test(new_features,
features = features,
labels = labels,
category_cols = category_cols,
numeric_cols = numeric_cols):
"""
新特征测试函数
:param features: 数据集特征
:param labels: 数据集标签
:param new_features: 新增特征
:param category_cols: 离散列名称
:param numeric_cols: 连续列名称
:return: result_df评估指标
"""
# 数据准备
if type(new_features) == np.ndarray:
name = 'new_features'
new_features = pd.Series(new_features, name=name)
# print(new_features)
features = features.copy()
category_cols = category_cols.copy()
numeric_cols = numeric_cols.copy()
features = pd.concat([features, new_features], axis=1)
# print(features.columns)
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.3, random_state=21)
# 划分连续变量/离散变量
if type(new_features) == pd.DataFrame:
for col in new_features:
if new_features[col].nunique() >= 15:
numeric_cols.append(col)
else:
category_cols.append(col)
else:
if new_features.nunique() >= 15:
numeric_cols.append(name)
else:
category_cols.append(name)
# print(category_cols)
# 检验列是否划分完全
assert len(category_cols) + len(numeric_cols) == X_train.shape[1]
# 设置转化器流
logistic_pre = ColumnTransformer([
('cat', preprocessing.OneHotEncoder(drop='if_binary'), category_cols),
('num', 'passthrough', numeric_cols)
])
num_pre = ['passthrough', preprocessing.StandardScaler(), preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans')]
# 实例化逻辑回归评估器
logistic_model = logit_threshold(max_iter=int(1e8))
# 设置机器学习流
logistic_pipe = make_pipeline(logistic_pre, logistic_model)
# 设置超参数空间
logistic_param = [
{'columntransformer__num':num_pre, 'logit_threshold__penalty': ['l1'], 'logit_threshold__C': np.arange(0.1, 1.1, 0.1).tolist(), 'logit_threshold__solver': ['saga']},
{'columntransformer__num':num_pre, 'logit_threshold__penalty': ['l2'], 'logit_threshold__C': np.arange(0.1, 1.1, 0.1).tolist(), 'logit_threshold__solver': ['lbfgs', 'newton-cg', 'sag', 'saga']},
]
# 实例化网格搜索评估器
logistic_search = GridSearchCV(estimator = logistic_pipe,
param_grid = logistic_param,
scoring='accuracy',
n_jobs = 12)
# 输出时间
s = time.time()
logistic_search.fit(X_train, y_train)
print(time.time()-s, "s")
# 计算预测结果
return(logistic_search.best_score_,
logistic_search.best_params_,
result_df(logistic_search.best_estimator_, X_train, y_train, X_test, y_test))
需要注意的是,在特征衍生过程中,由于是通过其他特征创建新的特征,因此新老特征之间往往会存在比较强的共线性,为了消除共线性带来的模型效果影响,往往需要带入网格搜索过程来进行模型训练,找到一组参数来降低共线性
20.特征重要性评估函数——iv值
def IV(new_features, DataFrame=tcc, target=target):
count_result = DataFrame[target].value_counts().values
def IV_cal(features_name, target, df_temp):
IV_l = []
for i in features_name:
IV_temp_l = []
for values in df_temp[i].unique():
data_temp = df_temp[df_temp[i] == values][target]
PB, PG = data_temp.value_counts().values / count_result
IV_temp = (PG-PB) * np.log(PG/PB)
IV_temp_l.append(IV_temp)
IV_l.append(np.array(IV_temp_l).sum())
return(IV_l)
if type(new_features) == np.ndarray:
features_name = ['new_features']
new_features = pd.Series(new_features, name=features_name[0])
elif type(new_features) == pd.Series:
features_name = [new_features.name]
else:
features_name = new_features.columns
df_temp = pd.concat([new_features, DataFrame], axis=1)
df_temp = df_temp.loc[:, ~df_temp.columns.duplicated()]
IV_l = IV_cal(features_name=features_name, target=target, df_temp=df_temp)
res = pd.DataFrame(IV_l, columns=['IV'], index=features_name)
return(res)
21.批量特征衍生
-----------------单变量特征衍生----------------------
方法一:多项式特征衍生(效果较差)
from sklearn.preprocessing import PolynomialFeatures
PolynomialFeatures(degree=5).fit_transform(x1.reshape(-1, 1))
方法二:前面提到的独热、分箱等方法(效果较好)
-----------------双变量特征衍生----------------------
方法三:四则运算
方法四:交叉组合
函数如下:
def Binary_Cross_Combination(colNames, features, OneHot=True):
"""
分类变量两两组合交叉衍生函数
:param colNames: 参与交叉衍生的列名称
:param features: 原始数据集
:param OneHot: 是否进行独热编码
:return:交叉衍生后的新特征和新列名称
"""
# 创建空列表存储器
colNames_new_l = []
features_new_l = []
# 提取需要进行交叉组合的特征
features = features[colNames]
# 逐个创造新特征名称、新特征
for col_index, col_name in enumerate(colNames):
for col_sub_index in range(col_index+1, len(colNames)):
newNames = col_name + '&' + colNames[col_sub_index]
colNames_new_l.append(newNames)
newDF = pd.Series(features[col_name].astype('str')
+ '&'
+ features[colNames[col_sub_index]].astype('str'),
name=col_name)
features_new_l.append(newDF)
# 拼接新特征矩阵
features_new = pd.concat(features_new_l, axis=1)
features_new.columns = colNames_new_l
colNames_new = colNames_new_l
# 对新特征矩阵进行独热编码
if OneHot == True:
enc = preprocessing.OneHotEncoder()
enc.fit_transform(features_new)
colNames_new = cate_colName(enc, colNames_new_l, drop=None)
features_new = pd.DataFrame(enc.fit_transform(features_new).toarray(), columns=colNames_new)
return features_new, colNames_new
方法五:分组统计特征衍生
a特征根据b特征的不同取值统计统计值
def Binary_Group_Statistics(keyCol,
features,
col_num=None,
col_cat=None,
num_stat=['mean', 'var', 'max', 'min', 'skew', 'median'],
cat_stat=['mean', 'var', 'max', 'min', 'median', 'count', 'nunique'],
quant=True):
"""
双变量分组统计特征衍生函数
:param keyCol: 分组参考的关键变量
:param features: 原始数据集
:param col_num: 参与衍生的连续型变量
:param col_cat: 参与衍生的离散型变量
:param num_stat: 连续变量分组统计量
:param cat_num: 离散变量分组统计量
:param quant: 是否计算分位数
:return:交叉衍生后的新特征和新特征的名称
"""
# 当输入的特征有连续型特征时
if col_num != None:
aggs_num = {}
colNames = col_num
# 创建agg方法所需字典
for col in col_num:
aggs_num[col] = num_stat
# 创建衍生特征名称列表
cols_num = [keyCol]
for key in aggs_num.keys():
cols_num.extend([key+'_'+keyCol+'_'+stat for stat in aggs_num[key]])
# 创建衍生特征df
features_num_new = features[col_num+[keyCol]].groupby(keyCol).agg(aggs_num).reset_index()
features_num_new.columns = cols_num
# 当输入的特征有连续型也有离散型特征时
if col_cat != None:
aggs_cat = {}
colNames = col_num + col_cat
# 创建agg方法所需字典
for col in col_cat:
aggs_cat[col] = cat_stat
# 创建衍生特征名称列表
cols_cat = [keyCol]
for key in aggs_cat.keys():
cols_cat.extend([key+'_'+keyCol+'_'+stat for stat in aggs_cat[key]])
# 创建衍生特征df
features_cat_new = features[col_cat+[keyCol]].groupby(keyCol).agg(aggs_cat).reset_index()
features_cat_new.columns = cols_cat
# 合并连续变量衍生结果与离散变量衍生结果
df_temp = pd.merge(features_num_new, features_cat_new, how='left',on=keyCol)
features_new = pd.merge(features[keyCol], df_temp, how='left',on=keyCol)
features_new.loc[:, ~features_new.columns.duplicated()]
colNames_new = cols_num + cols_cat
colNames_new.remove(keyCol)
colNames_new.remove(keyCol)
# 当只有连续变量时
else:
# merge连续变量衍生结果与原始数据,然后删除重复列
features_new = pd.merge(features[keyCol], features_num_new, how='left',on=keyCol)
features_new.loc[:, ~features_new.columns.duplicated()]
colNames_new = cols_num
colNames_new.remove(keyCol)
# 当没有输入连续变量时
else:
# 但存在分类变量时,即只有分类变量时
if col_cat != None:
aggs_cat = {}
colNames = col_cat
for col in col_cat:
aggs_cat[col] = cat_stat
cols_cat = [keyCol]
for key in aggs_cat.keys():
cols_cat.extend([key+'_'+keyCol+'_'+stat for stat in aggs_cat[key]])
features_cat_new = features[col_cat+[keyCol]].groupby(keyCol).agg(aggs_cat).reset_index()
features_cat_new.columns = cols_cat
features_new = pd.merge(features[keyCol], features_cat_new, how='left',on=keyCol)
features_new.loc[:, ~features_new.columns.duplicated()]
colNames_new = cols_cat
colNames_new.remove(keyCol)
if quant:
# 定义四分位计算函数
def q1(x):
"""
下四分位数
"""
return x.quantile(0.25)
def q2(x):
"""
上四分位数
"""
return x.quantile(0.75)
aggs = {}
for col in colNames:
aggs[col] = ['q1', 'q2']
cols = [keyCol]
for key in aggs.keys():
cols.extend([key+'_'+keyCol+'_'+stat for stat in aggs[key]])
aggs = {}
for col in colNames:
aggs[col] = [q1, q2]
features_temp = features[colNames+[keyCol]].groupby(keyCol).agg(aggs).reset_index()
features_temp.columns = cols
features_new = pd.merge(features_new, features_temp, how='left',on=keyCol)
features_new.loc[:, ~features_new.columns.duplicated()]
colNames_new = colNames_new + cols
colNames_new.remove(keyCol)
features_new.drop([keyCol], axis=1, inplace=True)
return features_new, colNames_new
-方法五:多项式特征衍生
def Binary_PolynomialFeatures(colNames, degree, features):
"""
连续变量两变量多项式衍生函数
:param colNames: 参与交叉衍生的列名称
:param degree: 多项式最高阶
:param features: 原始数据集
:return:交叉衍生后的新特征和新列名称
"""
# 创建空列表存储器
colNames_new_l = []
features_new_l = []
# 提取需要进行多项式衍生的特征
features = features[colNames]
# 逐个进行多项式特征组合
for col_index, col_name in enumerate(colNames):
for col_sub_index in range(col_index+1, len(colNames)):
col_temp = [col_name] + [colNames[col_sub_index]]
array_new_temp = PolynomialFeatures(degree=degree, include_bias=False).fit_transform(features[col_temp])
features_new_l.append(pd.DataFrame(array_new_temp[:, 2:]))
# 逐个创建衍生多项式特征的名称
for deg in range(2, degree+1):
for i in range(deg+1):
col_name_temp = col_temp[0] + '**' + str(deg-i) + '*'+ col_temp[1] + '**' + str(i)
colNames_new_l.append(col_name_temp)
# 拼接新特征矩阵
features_new = pd.concat(features_new_l, axis=1)
features_new.columns = colNames_new_l
colNames_new = colNames_new_l
return features_new, colNames_new
-方法六:二阶特征衍生(再进行四则运算)/统计演变特征
eg:组内归一化特征(值-组内均值/组内标准差)/分组统计特征再进行交叉衍生/GAP特征(上四分位数-下四分位数)/数据倾斜(均值和中位数求差值/比值)
变异系数(分组统计的标准差/均值)
方法汇总函数:
def Group_Statistics_Extension(colNames, keyCol, features):
"""
双变量分组统计二阶特征衍生函数
:param colNames: 参与衍生的特征
:param keyCol: 分组参考的关键变量
:param features: 原始数据集
:return:交叉衍生后的新特征和新列名称
"""
# 定义四分位计算函数
def q1(x):
"""
下四分位数
"""
return x.quantile(0.25)
def q2(x):
"""
上四分位数
"""
return x.quantile(0.75)
# 一阶特征衍生
# 先定义用于生成列名称的aggs
aggs = {}
for col in colNames:
aggs[col] = ['mean', 'var', 'median', 'q1', 'q2']
cols = [keyCol]
for key in aggs.keys():
cols.extend([key+'_'+keyCol+'_'+stat for stat in aggs[key]])
# 再定义用于进行分组汇总的aggs
aggs = {}
for col in colNames:
aggs[col] = ['mean', 'var', 'median', q1, q2]
features_new = features[colNames+[keyCol]].groupby(keyCol).agg(aggs).reset_index()
features_new.columns = cols
features_new = pd.merge(features[keyCol], features_new, how='left',on=keyCol)
features_new.loc[:, ~features_new.columns.duplicated()]
colNames_new = cols
colNames_new.remove(keyCol)
col1 = colNames_new.copy()
print(col1)
# 二阶特征衍生
# 流量平滑特征
for col_temp in colNames:
col = col_temp+'_'+keyCol+'_'+'mean'
features_new[col_temp+'_dive1_'+col] = features_new[keyCol] / (features_new[col] + 1e-5)
colNames_new.append(col_temp+'_dive1_'+col)
col = col_temp+'_'+keyCol+'_'+'median'
features_new[col_temp+'_dive2_'+col] = features_new[keyCol] / (features_new[col] + 1e-5)
colNames_new.append(col_temp+'_dive2_'+col)
# 黄金组合特征
for col_temp in colNames:
col = col_temp+'_'+keyCol+'_'+'mean'
features_new[col_temp+'_minus1_'+col] = features_new[keyCol] - features_new[col]
colNames_new.append(col_temp+'_minus1_'+col)
features_new[col_temp+'_minus2_'+col] = features_new[keyCol] - features_new[col]
colNames_new.append(col_temp+'_minus2_'+col)
# 组内归一化特征
for col_temp in colNames:
col_mean = col_temp+'_'+keyCol+'_'+'mean'
col_var = col_temp+'_'+keyCol+'_'+'var'
features_new[col_temp+'_norm_'+keyCol] = (features_new[keyCol] - features_new[col_mean]) / (np.sqrt(features_new[col_var]) + 1e-5)
colNames_new.append(col_temp+'_norm_'+keyCol)
# Gap特征
for col_temp in colNames:
col_q1 = col_temp+'_'+keyCol+'_'+'q1'
col_q2 = col_temp+'_'+keyCol+'_'+'q2'
features_new[col_temp+'_gap_'+keyCol] = features_new[col_q2] - features_new[col_q1]
colNames_new.append(col_temp+'_gap_'+keyCol)
# 数据倾斜特征
for col_temp in colNames:
col_mean = col_temp+'_'+keyCol+'_'+'mean'
col_median = col_temp+'_'+keyCol+'_'+'median'
features_new[col_temp+'_mag1_'+keyCol] = features_new[col_median] - features_new[col_mean]
colNames_new.append(col_temp+'_mag1_'+keyCol)
features_new[col_temp+'_mag2_'+keyCol] = features_new[col_median] / (features_new[col_mean] + 1e-5)
colNames_new.append(col_temp+'_mag2_'+keyCol)
# 变异系数
for col_temp in colNames:
col_mean = col_temp+'_'+keyCol+'_'+'mean'
col_var = col_temp+'_'+keyCol+'_'+'var'
features_new[col_temp+'_cv_'+keyCol] = np.sqrt(features_new[col_var]) / (features_new[col_mean] + 1e-5)
colNames_new.append(col_temp+'_cv_'+keyCol)
features_new.drop([keyCol], axis=1, inplace=True)
features_new.drop(col1, axis=1, inplace=True)
colNames_new = list(features_new.columns)
return features_new, colNames_new
-----------------------------------多变量特征衍生
方法七:多变量组合交叉衍生
的主要使用场景有两个,其一是需要补充一些特定业务字段时(例如统计每位用户购买服务项目总数),其二则是需要进行衍生字段的二阶段衍生时。
函数如下:
def Multi_Cross_Combination(colNames, features, OneHot=True):
"""
多变量组合交叉衍生
:param colNames: 参与交叉衍生的列名称
:param features: 原始数据集
:param OneHot: 是否进行独热编码
:return:交叉衍生后的新特征和新列名称
"""
# 创建组合特征
colNames_new = '&'.join([str(i) for i in colNames])
features_new = features[colNames[0]].astype('str')
for col in colNames[1:]:
features_new = features_new + '&' + features[col].astype('str')
# 将组合特征转化为DataFrame
features_new = pd.DataFrame(features_new, columns=[colNames_new])
# 对新的特征列进行独热编码
if OneHot == True:
enc = preprocessing.OneHotEncoder()
enc.fit_transform(features_new)
colNames_new = cate_colName(enc, [colNames_new], drop=None)
features_new = pd.DataFrame(enc.fit_transform(features_new).toarray(), columns=colNames_new)
return features_new, colNames_new
方法八:多变量分组汇总统计特征衍生
def Multi_Group_Statistics(keyCol,
features,
col_num=None,
col_cat=None,
num_stat=['mean', 'var', 'max', 'min', 'skew', 'median'],
cat_stat=['mean', 'var', 'max', 'min', 'median', 'count', 'nunique'],
quant=True):
"""
多变量分组统计特征衍生函数
:param keyCol: 分组参考的关键变量
:param features: 原始数据集
:param col_num: 参与衍生的连续型变量
:param col_cat: 参与衍生的离散型变量
:param num_stat: 连续变量分组统计量
:param cat_num: 离散变量分组统计量
:param quant: 是否计算分位数
:return:交叉衍生后的新特征和新特征的名称
"""
# 生成原数据合并的主键
features_key1, col1 = Multi_Cross_Combination(keyCol, features, OneHot=False)
# 当输入的特征有连续型特征时
if col_num != None:
aggs_num = {}
colNames = col_num
# 创建agg方法所需字典
for col in col_num:
aggs_num[col] = num_stat
# 创建衍生特征名称列表
cols_num = keyCol.copy()
for key in aggs_num.keys():
cols_num.extend([key+'_'+col1+'_'+stat for stat in aggs_num[key]])
# 创建衍生特征df
features_num_new = features[col_num+keyCol].groupby(keyCol).agg(aggs_num).reset_index()
features_num_new.columns = cols_num
# 生成主键
features_key2, col2 = Multi_Cross_Combination(keyCol, features_num_new, OneHot=False)
# 创建包含合并主键的数据集
features_num_new = pd.concat([features_key2, features_num_new], axis=1)
# 当输入的特征有连续型也有离散型特征时
if col_cat != None:
aggs_cat = {}
colNames = col_num + col_cat
# 创建agg方法所需字典
for col in col_cat:
aggs_cat[col] = cat_stat
# 创建衍生特征名称列表
cols_cat = keyCol.copy()
for key in aggs_cat.keys():
cols_cat.extend([key+'_'+col1+'_'+stat for stat in aggs_cat[key]])
# 创建衍生特征df
features_cat_new = features[col_cat+keyCol].groupby(keyCol).agg(aggs_cat).reset_index()
features_cat_new.columns = cols_cat
# 生成主键
features_key3, col3 = Multi_Cross_Combination(keyCol, features_cat_new, OneHot=False)
# 创建包含合并主键的数据集
features_cat_new = pd.concat([features_key3, features_cat_new], axis=1)
# 合并连续变量衍生结果与离散变量衍生结果
# 合并新的特征矩阵
df_temp = pd.concat([features_num_new, features_cat_new], axis=1)
df_temp = df_temp.loc[:, ~df_temp.columns.duplicated()]
# 将新的特征矩阵与原始数据集合并
features_new = pd.merge(features_key1, df_temp, how='left',on=col1)
# 当只有连续变量时
else:
# merge连续变量衍生结果与原始数据,然后删除重复列
features_new = pd.merge(features_key1, features_num_new, how='left',on=col1)
features_new = features_new.loc[:, ~features_new.columns.duplicated()]
# 当没有输入连续变量时
else:
# 但存在分类变量时,即只有分类变量时
if col_cat != None:
aggs_cat = {}
colNames = col_cat
for col in col_cat:
aggs_cat[col] = cat_stat
cols_cat = keyCol.copy()
for key in aggs_cat.keys():
cols_cat.extend([key+'_'+col1+'_'+stat for stat in aggs_cat[key]])
features_cat_new = features[col_cat+keyCol].groupby(keyCol).agg(aggs_cat).reset_index()
features_cat_new.columns = cols_cat
features_new = pd.merge(features_key1, features_cat_new, how='left',on=col1)
features_new = features_new.loc[:, ~features_new.columns.duplicated()]
if quant:
# 定义四分位计算函数
def q1(x):
"""
下四分位数
"""
return x.quantile(0.25)
def q2(x):
"""
上四分位数
"""
return x.quantile(0.75)
aggs = {}
for col in colNames:
aggs[col] = ['q1', 'q2']
cols = keyCol.copy()
for key in aggs.keys():
cols.extend([key+'_'+col1+'_'+stat for stat in aggs[key]])
aggs = {}
for col in colNames:
aggs[col] = [q1, q2]
features_temp = features[colNames+keyCol].groupby(keyCol).agg(aggs).reset_index()
features_temp.columns = cols
features_new.drop(keyCol, axis=1, inplace=True)
# 生成主键
features_key4, col4 = Multi_Cross_Combination(keyCol, features_temp, OneHot=False)
# 创建包含合并主键的数据集
features_temp = pd.concat([features_key4, features_temp], axis=1)
# 合并新特征矩阵
features_new = pd.merge(features_new, features_temp, how='left',on=col1)
features_new = features_new.loc[:, ~features_new.columns.duplicated()]
features_new.drop(keyCol+[col1], axis=1, inplace=True)
colNames_new = list(features_new.columns)
return features_new, colNames_new
方法九:多变量的多项式特征衍生
def Muti_PolynomialFeatures(colNames, degree, features):
"""
连续变量多变量多项式衍生函数
:param colNames: 参与交叉衍生的列名称
:param degree: 多项式最高阶
:param features: 原始数据集
:return:交叉衍生后的新特征和新列名称
"""
# 创建空列表容器
colNames_new_l = []
# 计算带入多项式计算的特征数
n = len(colNames)
# 提取需要进行多项式衍生的特征
features = features[colNames]
# 进行多项式特征组合
array_new_temp = PolynomialFeatures(degree=degree, include_bias=False).fit_transform(features)
# 选取衍生的特征
array_new_temp = array_new_temp[:, n:]
# 创建列名称列表
deg = 2
while deg <= degree:
m = 1
a1 = range(deg, -1, -1)
a2 = []
while m < n:
a1 = list(product(a1, range(deg, -1, -1)))
if m > 1:
for i in a1:
i_temp = list(i[0])
i_temp.append(i[1])
a2.append(i_temp)
a1 = a2.copy()
a2 = []
m += 1
a1 = np.array(a1)
a3 = a1[a1.sum(1) == deg]
for i in a3:
colNames_new_l.append('&'.join(colNames) + '_' + ''.join([str(i) for i in i]))
deg += 1
# 拼接新特征矩阵
features_new = pd.DataFrame(array_new_temp, columns=colNames_new_l)
colNames_new = colNames_new_l
return features_new, colNames_new
------------------------时序字段------------------------------
核心:增加分组,方法是找自然周期和业务周期,关键节点的时间差值
方法十:时序特征衍生函数
def timeSeriesCreation(timeSeries, timeStamp=None, precision_high=False):
"""
时序字段的特征衍生
:param timeSeries:时序特征,需要是一个Series
:param timeStamp:手动输入的关键时间节点的时间戳,需要组成字典形式,字典的key、value分别是时间戳的名字与字符串
:param precision_high:是否精确到时、分、秒
:return features_new, colNames_new:返回创建的新特征矩阵和特征名称
"""
# 创建衍生特征df
features_new = pd.DataFrame()
# 提取时间字段及时间字段的名称
timeSeries = pd.to_datetime(timeSeries)
colNames = timeSeries.name
# 年月日信息提取
features_new[colNames+'_year'] = timeSeries.dt.year
features_new[colNames+'_month'] = timeSeries.dt.month
features_new[colNames+'_day'] = timeSeries.dt.day
if precision_high != False:
features_new[colNames+'_hour'] = timeSeries.dt.hour
features_new[colNames+'_minute'] = timeSeries.dt.minute
features_new[colNames+'_second'] = timeSeries.dt.second
# 自然周期提取
features_new[colNames+'_quarter'] = timeSeries.dt.quarter
features_new[colNames+'_weekofyear'] = timeSeries.dt.weekofyear
features_new[colNames+'_dayofweek'] = timeSeries.dt.dayofweek + 1
features_new[colNames+'_weekend'] = (features_new[colNames+'_dayofweek'] > 5).astype(int)
if precision_high != False:
features_new['hour_section'] = (features_new[colNames+'_hour'] // 6).astype(int)
# 关键时间点时间差计算
# 创建关键时间戳名称的列表和时间戳列表
timeStamp_name_l = []
timeStamp_l = []
if timeStamp != None:
timeStamp_name_l = list(timeStamp.keys())
timeStamp_l = [pd.Timestamp(x) for x in list(timeStamp.values())]
# 准备通用关键时间点时间戳
time_max = timeSeries.max()
time_min = timeSeries.min()
time_now = pd.to_datetime(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
timeStamp_name_l.extend(['time_max', 'time_min', 'time_now'])
timeStamp_l.extend([time_max, time_min, time_now])
# 时间差特征衍生
for timeStamp, timeStampName in zip(timeStamp_l, timeStamp_name_l):
time_diff = timeSeries - timeStamp
features_new['time_diff_days'+'_'+timeStampName] = time_diff.dt.days
features_new['time_diff_months'+'_'+timeStampName] = np.round(features_new['time_diff_days'+'_'+timeStampName] / 30).astype('int')
if precision_high != False:
features_new['time_diff_seconds'+'_'+timeStampName] = time_diff.dt.seconds
features_new['time_diff_h'+'_'+timeStampName] = time_diff.values.astype('timedelta64[h]').astype('int')
features_new['time_diff_s'+'_'+timeStampName] = time_diff.values.astype('timedelta64[s]').astype('int')
colNames_new = list(features_new.columns)
return features_new, colNames_new
方法十一:文本特征衍生
指运用文本字段处理方法与思想,来处理数值型的字段,来衍生出更多有效特征。
1.CountVectorizer统计词频from sklearn.feature_extraction.text import CountVectorizer
2.TF-IDF:词频-逆向词频统计 from sklearn.feature_extraction.text import TfidfTransformer
注意:它是tf*idf的结果经过l2正则化和平滑后的结果
因此可以直接去统计以目标变量分组,每个变量在组内的词频以及在各个组之间的词频
之后在这个结果上用 TfidfTransformer来衍生
注意点:3.二分类变量里变量的选择:这些变量最好是一个事件的不同角度的统计结果,例如上述三个字段其实都是用户某次购买行为在三项不同服务上的记录结果,三个字段相互补充、共同记录了用户购买行为的最终结果
4.不同于其他分组汇总统计特征衍生需要控制分组的数量,NLP特征衍生其实会更适用于分组更多的情况。这也就是说,NLP特征衍生比较适合配合多变量分组特征衍生(或者分组较多的双变量特征衍生)共同进行。
5.特殊情况:针对每一行进行计算
函数如下:
def NLP_Group_Statistics(features,
col_cat,
keyCol=None,
tfidf=True,
countVec=True):
"""
多变量分组统计特征衍生函数
:param features: 原始数据集
:param col_cat: 参与衍生的离散型变量,只能带入多个列
:param keyCol: 分组参考的关键变量,输入字符串时代表按照单独列分组,输入list代表按照多个列进行分组
:param tfidf: 是否进行tfidf计算
:param countVec: 是否进行CountVectorizer计算
:return:NLP特征衍生后的新特征和新特征的名称
"""
# 提取所有需要带入计算的特征名称和特征
if keyCol != None:
if type(keyCol) == str:
keyCol = [keyCol]
colName_temp = keyCol.copy()
colName_temp.extend(col_cat)
features = features[colName_temp]
else:
features = features[col_cat]
# 定义CountVectorizer计算和TF-IDF计算过程
def NLP_Stat(features=features,
col_cat=col_cat,
keyCol=keyCol,
countVec=countVec,
tfidf=tfidf):
"""
CountVectorizer计算和TF-IDF计算函数
参数和外层函数参数完全一致
返回结果需要注意,此处返回带有keyCol的衍生特征矩阵及特征名称
"""
n = len(keyCol)
col_cat = [x +'_' + '&'.join(keyCol) for x in col_cat]
if tfidf == True:
# 计算CountVectorizer
features_new_cntv = features.groupby(keyCol).sum().reset_index()
colNames_new_cntv = keyCol.copy()
colNames_new_cntv.extend([x + '_cntv' for x in col_cat])
features_new_cntv.columns = colNames_new_cntv
# 计算TF-IDF
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(features_new_cntv.iloc[:, n: ]).toarray()
colNames_new_tfv = [x + '_tfidf' for x in col_cat]
features_new_tfv = pd.DataFrame(tfidf, columns=colNames_new_tfv)
if countVec == True:
features_new = pd.concat([features_new_cntv, features_new_tfv], axis=1)
colNames_new_cntv.extend(colNames_new_tfv)
colNames_new = colNames_new_cntv
else:
colNames_new = pd.concat([features_new_cntv[:, :n], features_new_tfv], axis=1)
features_new = keyCol + features_new_tfv
# 如果只计算CountVectorizer时
elif countVec == True:
features_new_cntv = features.groupby(keyCol).sum().reset_index()
colNames_new_cntv = keyCol.copy()
colNames_new_cntv.extend([x + '_cntv' for x in col_cat])
features_new_cntv.columns = colNames_new_cntv
colNames_new = colNames_new_cntv
features_new = features_new_cntv
return features_new, colNames_new
# keyCol==None时对原始数据进行NLP特征衍生
# 此时无需进行CountVectorizer计算
if keyCol == None:
if tfidf == True:
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(features).toarray()
colNames_new = [x + '_tfidf' for x in col_cat]
features_new = pd.DataFrame(tfidf, columns=colNames_new)
# keyCol!=None时对分组汇总后的数据进行NLP特征衍生
else:
n = len(keyCol)
# 如果是依据单个特征取值进行分组
if n == 1:
features_new, colNames_new = NLP_Stat()
# 将分组统计结果拼接回原矩阵
features_new = pd.merge(features[keyCol[0]], features_new, how='left',on=keyCol[0])
features_new = features_new.iloc[:, n: ]
colNames_new = features_new.columns
# 如果是多特征交叉分组
else:
features_new, colNames_new = NLP_Stat()
# 在原数据集中生成合并主键
features_key1, col1 = Multi_Cross_Combination(keyCol, features, OneHot=False)
# 在衍生特征数据集中创建合并主键
features_key2, col2 = Multi_Cross_Combination(keyCol, features_new, OneHot=False)
features_key2 = pd.concat([features_key2, features_new], axis=1)
# 将分组统计结果拼接回原矩阵
features_new = pd.merge(features_key1, features_key2, how='left',on=col1)
features_new = features_new.iloc[:, n+1: ]
colNames_new = features_new.columns
return features_new, colNames_new
方法汇总:
最后,我们对这些函数进行封装, 优化点有:
1、将双变量和多变量的相同目标的函数合而为一
2、有些函数需要划分训练集和测试集,实现训练集、测试集分开输入、输出的目的
这里会出现测试集数据较少,有些衍生情况没出现,需要用训练集进行填充的情况
3、目标编码(即用keycol对标签进行分组,强大而危险,容易过拟合)
一、首先做一个函数,当训练集衍生特征个数>测试集时,给测试集加上这个特征并让它值置0,也就是实现第二点,反之亦然。
def Features_Padding(features_train_new,
features_test_new,
colNames_train_new,
colNames_test_new):
"""
特征零值填补函数
:param features_train_new: 训练集衍生特征
:param features_test_new: 测试集衍生特征
:param colNames_train_new: 训练集衍生列名称
:param colNames_test_new: 测试集衍生列名称
:return:0值填补后的新特征和特征名称
"""
if len(colNames_train_new) > len(colNames_test_new):
sub_colNames = list(set(colNames_train_new) - set(colNames_test_new))
for col in sub_colNames:
features_test_new[col] = 0
features_test_new = features_test_new[colNames_train_new]
colNames_test_new = list(features_test_new.columns)
elif len(colNames_train_new) < len(colNames_test_new):
sub_colNames = list(set(colNames_test_new) - set(colNames_train_new))
for col in sub_colNames:
features_train_new[col] = 0
features_train_new = features_train_new[colNames_test_new]
colNames_train_new = list(features_train_new.columns)
assert colNames_train_new == colNames_test_new
return features_train_new, features_test_new, colNames_train_new, colNames_test_new
二、然后做交叉衍生组合函数:
def Cross_Combination(colNames,
X_train,
X_test,
multi=False,
OneHot=True):
"""
交叉组合特征衍生函数
:param colNames: 参与交叉衍生的列名称
:param X_train: 训练集特征
:param X_test: 测试集特征
:param multi: 是否进行多变量交叉组合
:param OneHot: 是否进行独热编码
:return:交叉衍生后的新特征和特征名称
"""
# 首先,训练集和测试集单独进行交叉组合特征衍生
if multi == False:
features_train_new, colNames_train_new = Binary_Cross_Combination(colNames=colNames, features=X_train, OneHot=OneHot)
features_test_new, colNames_test_new = Binary_Cross_Combination(colNames=colNames, features=X_test, OneHot=OneHot)
else:
features_train_new, colNames_train_new = Multi_Cross_Combination(colNames=colNames, features=X_train, OneHot=OneHot)
features_test_new, colNames_test_new = Multi_Cross_Combination(colNames=colNames, features=X_test, OneHot=OneHot)
# 然后判断训练集和测试集的衍生特征是否存在差异
if colNames_train_new != colNames_test_new:
features_train_new, features_test_new, colNames_train_new, colNames_test_new = Features_Padding(features_train_new = features_train_new,
features_test_new = features_test_new,
colNames_train_new = colNames_train_new,
colNames_test_new = colNames_test_new)
return features_train_new, features_test_new, colNames_train_new, colNames_test_new
三、多项式衍生组合函数
对于多项式特征衍生来说,并不存在“在训练集上进行训练,在测试集上进行测试”的过程
def Polynomial_Features(colNames,
degree,
X_train,
X_test,
multi=False):
"""
多项式衍生函数
:param colNames: 参与交叉衍生的列名称
:param degree: 多项式最高阶
:param X_train: 训练集特征
:param X_test: 测试集特征
:param multi: 是否进行多变量多项式组衍生
:return:多项式衍生后的新特征和新列名称
"""
if multi == False:
features_train_new, colNames_train_new = Binary_PolynomialFeatures(colNames=colNames, degree=degree, features=X_train)
features_test_new, colNames_test_new = Binary_PolynomialFeatures(colNames=colNames, degree=degree, features=X_test)
else:
features_train_new, colNames_train_new = Multi_PolynomialFeatures(colNames=colNames, degree=degree, features=X_train)
features_test_new, colNames_test_new = Multi_PolynomialFeatures(colNames=colNames, degree=degree, features=X_test)
assert colNames_train_new == colNames_test_new
return features_train_new, features_test_new, colNames_train_new, colNames_test_new
四、分组统计特征衍生的高阶函数
存在的问题:
1.训练集测试集衍生特征个数不一样
2.训练集上训练,测试集上测试
得到函数:def test_features(keyCol,
X_train,
X_test,
features_train_new,
multi=False):
"""
测试集特征填补函数
:param keyCol: 分组参考的关键变量
:param X_train: 训练集特征
:param X_test: 测试集特征
:param features_train_new: 训练集衍生特征
:param multi: 是否多变量参与分组
:return:分组统计衍生后的新特征和新特征的名称
"""
# 创建主键
# 创建带有主键的训练集衍生特征df
# 创建只包含主键的test_key
if multi == False:
keyCol = keyCol
features_train_new[keyCol] = X_train[keyCol].reset_index()[keyCol]
test_key = pd.DataFrame(X_test[keyCol])
else:
train_key, train_col = Multi_Cross_Combination(colNames=keyCol, features=X_train, OneHot=False)
test_key, test_col = Multi_Cross_Combination(colNames=keyCol, features=X_test, OneHot=False)
assert train_col == test_col
keyCol = train_col
features_train_new[keyCol] = train_key[train_col].reset_index()[train_col]
# 利用groupby进行去重
features_test_or = features_train_new.groupby(keyCol).mean().reset_index()
# 和测试集进行拼接
features_test_new = pd.merge(test_key, features_test_or, on=keyCol, how='left')
# 删除keyCol列,只保留新衍生的列
features_test_new.drop([keyCol], axis=1, inplace=True)
features_train_new.drop([keyCol], axis=1, inplace=True)
# 输出列名称
colNames_train_new = list(features_train_new.columns)
colNames_test_new = list(features_test_new.columns)
return features_train_new, features_test_new, colNames_train_new, colNames_test_new
然后我们来定义高级分组汇总函数:
def Group_Statistics(keyCol,
X_train,
X_test,
col_num=None,
col_cat=None,
num_stat=['mean', 'var', 'max', 'min', 'skew', 'median'],
cat_stat=['mean', 'var', 'max', 'min', 'median', 'count', 'nunique'],
quant=True,
multi=False):
"""
分组统计特征衍生函数
:param keyCol: 分组参考的关键变量
:param X_train: 训练集特征
:param X_test: 测试集特征
:param col_num: 参与衍生的连续型变量
:param col_cat: 参与衍生的离散型变量
:param num_stat: 连续变量分组统计量
:param cat_num: 离散变量分组统计量
:param quant: 是否计算分位数
:param multi: 是否进行多变量的分组统计特征衍生
:return:分组统计衍生后的新特征和新特征的名称
"""
features_test_new = pd.DataFrame()
if multi == False:
features_train_new, colNames_train_new = Binary_Group_Statistics(keyCol=keyCol, features=X_train, col_num=col_num, col_cat=col_cat, num_stat=num_stat, cat_stat=cat_stat, quant=quant)
features_train_new[keyCol] = X_train[keyCol].reset_index()[keyCol] # 注:这里的reset_index()是为了后面交叉验证调用做准备
for col in colNames_train_new:
order_label = features_train_new.groupby([keyCol]).mean()[col]
features_test_new[col] = X_test[keyCol].map(order_label)
features_train_new.drop([keyCol], axis=1, inplace=True)
colNames_test_new = list(features_test_new.columns)
else:
features_train_new, colNames_train_new = Multi_Group_Statistics(keyCol=keyCol, features=X_train, col_num=col_num, col_cat=col_cat, num_stat=num_stat, cat_stat=cat_stat, quant=quant)
train_key, train_col = Multi_Cross_Combination(colNames=keyCol, features=X_train, OneHot=False)
test_key, test_col = Multi_Cross_Combination(colNames=keyCol, features=X_test, OneHot=False)
features_train_new[train_col] = train_key[train_col].reset_index()[train_col]
for col in colNames_train_new:
order_label = features_train_new.groupby([train_col]).mean()[col]
features_test_new[col] = test_key[test_col].map(order_label)
features_train_new.drop([train_col], axis=1, inplace=True)
colNames_test_new = list(features_test_new.columns)
assert colNames_train_new == colNames_test_new
return features_train_new, features_test_new, colNames_train_new, colNames_test_new
五、分组统计特征衍生的高阶函数二——目标编码
为了防止把目标值直接透露给测试集,因此在训练集上做交叉验证,并把其余四组的均值作为另一组的目标值,算出训练集;而后再将训练集分组求均值,填入验证集
具体函数如下:
def Target_Encode(keyCol,
X_train,
y_train,
X_test,
col_num=None,
col_cat=None,
num_stat=['mean', 'var', 'max', 'min', 'skew', 'median'],
cat_stat=['mean', 'var', 'max', 'min', 'median', 'count', 'nunique'],
quant=True,
multi=False,
extension=False,
n_splits=5,
random_state=42):
"""
目标编码
:param keyCol: 分组参考的关键变量
:param X_train: 训练集特征
:param y_train: 训练集标签
:param X_test: 测试集特征
:param col_num: 参与衍生的连续型变量
:param col_cat: 参与衍生的离散型变量
:param num_stat: 连续变量分组统计量
:param cat_num: 离散变量分组统计量
:param quant: 是否计算分位数
:param multi: 是否进行多变量的分组统计特征衍生
:param extension: 是否进行二阶特征衍生
:param n_splits: 进行几折交叉统计
:param random_state: 随机数种子
:return:目标编码后的新特征和新特征的名称
"""
# 获取标签名称
target = y_train.name
# 合并同时带有特征和标签的完整训练集
train = pd.concat([X_train, y_train], axis=1)
folds = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
# 每一折验证集的结果存储容器
df_l = []
# 进行交叉统计
for trn_idx, val_idx in folds.split(train):
trn_temp = train.iloc[trn_idx]
val_temp = train.iloc[val_idx]
trn_temp_new, val_temp_new, colNames_trn_temp_new, colNames_val_temp_new = Group_Statistics(keyCol,
X_train = trn_temp,
X_test = val_temp,
col_num = col_num,
col_cat = col_cat,
num_stat = num_stat,
cat_stat = cat_stat,
quant = quant,
multi = multi,
extension = extension)
val_temp_new.index = val_temp.index
df_l.append(val_temp_new)
# 创建训练集的衍生特征
features_train_new = pd.concat(df_l).sort_index(ascending=True)
colNames_train_new = [col + '_kfold' for col in features_train_new.columns]
features_train_new.columns = colNames_train_new
# 对测试集结果进行填补
features_train_new, features_test_new, colNames_train_new, colNames_test_new = test_features(keyCol = keyCol,
X_train = X_train,
X_test = X_test,
features_train_new = features_train_new,
multi = multi)
# 如果特征不一致,则进行0值填补
if colNames_train_new != colNames_test_new:
features_train_new, features_test_new, colNames_train_new, colNames_test_new = Features_Padding(features_train_new = features_train_new,
features_test_new = features_test_new,
colNames_train_new = colNames_train_new,
colNames_test_new = colNames_test_new)
assert colNames_train_new == colNames_test_new
return features_train_new, features_test_new, colNames_train_new, colNames_test_new
六.时序特征衍生的函数封装
函数如下:
def timeSeries_Creation(timeSeries_train,
timeSeries_test,
timeStamp=None,
precision_high=False):
"""
时序字段的特征衍生
:param timeSeries_train:训练集的时序特征,需要是一个Series
:param timeSeries_test:测试集的时序特征,需要是一个Series
:param timeStamp:手动输入的关键时间节点的时间戳,需要组成字典形式,字典的key、value分别是时间戳的名字与字符串
:param precision_high:是否精确到时、分、秒
:return features_new, colNames_new:返回创建的新特征矩阵和特征名称
"""
features_train_new, colNames_train_new = features_new, colNames_new = timeSeriesCreation(timeSeries = timeSeries_train,
timeStamp = timeStamp,
precision_high = precision_high)
features_test_new, colNames_test_new = features_new, colNames_new = timeSeriesCreation(timeSeries = timeSeries_test,
timeStamp = timeStamp,
precision_high = precision_high)
assert colNames_train_new == colNames_test_new
return features_train_new, features_test_new, colNames_train_new, colNames_test_new
七.NLP特征衍生
需要分训练集和测试集
注:col_cat为需要分组统计的特征,key_col是分组依据
def NLP_Group_Stat(X_train,
X_test,
col_cat,
keyCol=None,
tfidf=True,
countVec=True):
"""
NLP特征衍生函数
:param X_train: 训练集特征
:param X_test: 测试集特征
:param col_cat: 参与衍生的离散型变量,只能带入多个列
:param keyCol: 分组参考的关键变量,输入字符串时代表按照单独列分组,输入list代表按照多个列进行分组
:param tfidf: 是否进行tfidf计算
:param countVec: 是否进行CountVectorizer计算
:return:NLP特征衍生后的新特征和新特征的名称
"""
# 在训练集上进行NLP特征衍生
features_train_new, colNames_train_new = NLP_Group_Statistics(features = X_train,
col_cat = col_cat,
keyCol = keyCol,
tfidf = tfidf,
countVec = countVec)
# 如果不分组,则测试集单独计算NLP特征
if keyCol == None:
features_test_new, colNames_test_new = NLP_Group_Statistics(features = X_test,
col_cat = col_cat,
keyCol = keyCol,
tfidf = tfidf,
countVec = countVec)
# 否则需要用训练集上统计结果应用于测试集
else:
if type(keyCol) == str:
multi = False
else:
multi = True
features_train_new, features_test_new, colNames_train_new, colNames_test_new = test_features(keyCol = keyCol,
X_train = X_train,
X_test = X_test,
features_train_new = features_train_new,
multi = multi)
# 如果训练集和测试集衍生特征不一致时
if colNames_train_new != colNames_test_new:
features_train_new, features_test_new, colNames_train_new, colNames_test_new = Features_Padding(features_train_new = features_train_new,
features_test_new = features_test_new,
colNames_train_new = colNames_train_new,
colNames_test_new = colNames_test_new)
assert colNames_train_new == colNames_test_new
return features_train_new, features_test_new, colNames_train_new, colNames_test_new
八.特征衍生的一般顺序:
Stage 1.时序特征衍生。时序特征衍生过程并不需要依赖其他任何特征,且衍生出来的特征可以作为备选特征带入到交叉组合或者分组统计的过程中。
Stage 2.多项式特征衍生。多项式特征衍生往往只适用于连续变量或者取值水平较多的有序变量,并且在实际操作过程中,需要注意衍生特征取值大小的问题,如果衍生特征的绝对值过大,则需要进行数据标准化处理。
Stage 3.交叉组合特征衍生。一般情况下,需要针对所有原始离散变量和部分时序衍生字段(分类水平较少的时序衍生字段)进行两两组合,而是否需要进行三三组合,则需要根据两两组合的情况来决定。
Stage 4.分组统计特征衍生。由于很多时候分组统计特征衍生需要依据交叉组合的结果进行分组,所以分组统计特征衍生一般会放在交叉组合特征衍生之后。分组统计特征衍生既是最重要的特征衍生环节(有可能产出非常多的有效特征),同时也是最为复杂的特征衍生环节。此处关键变量可以是单独的原始变量、也可以是衍生的时序字段、当然也可以双变量(或者多变量)的交叉组合字段。但选取keyCol的过程并不简单。
Stage 5.NLP特征衍生。NLP特征衍生也可以看成是分组统计特征衍生的一种拓展形式,当然也并不是所有的数据集都适合进行NLP特征衍生,同时,NLP特征衍生也基本上可以看成是独立于其他方法的单独方法,如果出现了适合NLP特征衍生的情况,单独执行NLP方法即可,并不存在和此前方法过多的交叉,关键在于判定当前数据集是否适合进行NLP特征衍生。