当前位置：首页 > article >正文

数据挖掘笔记

article 2024/9/19 8:59:09

lambda表达式中使用多个if else写法

adult_data["marital-status"]=adult_data["marital-status"].apply(lambda x:1 if x.strip() == "Married-civ-spouse" else 2 if x.strip() == "Never-married"
                                               
                                                                   else 3 if x.strip() == "Divorced" else 4 if x.strip() == "Separated" else 5 if x.strip() == "Widowed"
                                                               else 6 if x.strip() == "Married-spouse-absent" else 7 )

在DataFrame列行选择时使用判断语句

sns.distplot(dataset_con.loc[dataset_con["target"]==1]["age"],kde_kws={"label":">50k"})

常用到的图
3.1 seaborn的countplot，计数图


sns.countplot(data=diabetes,x="Outcome")

3.2 seaborn的distplot，直方图
3.3 直方图和柱状图的区别，总结就是直方图展示数据的分布，柱状图比较数据的大小，参考https://zhuanlan.zhihu.com/p/61433510
3.4 seaborn的barplot，条形图，hue有个小图
3.5 plt的散点图

plt.scatter(x[:,0],x[:,1],c=y,marker="o")

x,y 输入数据，如shape(n,)数据
c颜色
marker形状
3.6 plt的箱线图，boxplot
3.7 要画饼图，使用plt的pie

plt.pie(diabetes["Outcome"].value_counts(),labels=[0,1])

3.8 可以使用读出的数据直接画直方图、箱线图

直方图
diabetes.hist(figsize=(16,14))
箱线图
diabetes.plot(kind='box', subplots=True, layout=(4,4), figsize=(16,14))

3.9 sns画散点图

sns.pairplot(diabetes,vars=diabetes.columns,hue="Outcome")

jupyter notebook使用技巧
4.1 alt+左箭头，光标跳到最右
4.2 alt+右箭头，光标跳到最左
关于sklearn的preprocessing中的LabelEnconder
简单来说LabelEncoder就是把n个类别值编码为0~n-1之间的整数，建立起1-1映射，举个例子可能更直观。
numpy.ndarray
获取数据不能使用iloc和loc，只能如下使用方法

dataset_con_enc.values[:,2:]

字符型转数值型

pd.DataFrame([1 if i == "cc" else 0 for i in xx])
df1 = df[0].apply(lambda x:1 if x == 'AAA' else 2 if x == 'BBB' else 3)

关于训练集、测试集、标签训练集、标签测试集的数据划分
8.1 把异常值、缺失值处理后，还要做特征缩放（标准=StandardScaler、大小归一=MinMaxScaler）和特征编码（object–>float/int,独热或者哑编码、label-encoding）、降维（pca）、特征选择（corr），排除目标特性后，剩余的特性用来做sklearn.model_selection中的train_test_split函数中的训练集
8.2 使用的包是sklearn.model_selection中的train_test_split
8.3 标签集，就是直接从原有数据中将目标特性提取出来即可
8.4 模型预测是针对测试训练街x_test,打分是针对标签测试集和预测集predice(y_test,y_pre)
关于PCA主成分分析降维，可解释性
9.1 使用corr分析相关性，将强相关性就是主成分
9.2 可以使用seaborn的heatmap和pairplot展示
9.3 参考

https://mp.weixin.qq.com/s?__biz=MzAxMjUyNDQ5OA==&mid=2653563742&idx=1&sn=0c30499a4680bdf025d41feea6d04c83&chksm=806e05e3b7198cf50c95ed5edff7c2d927d4bcc3427e1fdaa1c897a612ba3cd4130dcb148773&scene=27
9.4 具体使用PCA执行解释比不低于0.95的操作

from sklearn.decomposition import PCA
//这里n也直接指定需要降的维度数
pca=PCA(n_components=0.95)
pc=pca.fit_transform(corr)
pca.explained_variance_ratio_

输出
array([0.42270748, 0.16685479, 0.13368761, 0.1022182 , 0.05467414,
        0.02581897, 0.02232875, 0.01844716, 0.01551768])

DataFrame操作
10.1 Seriers转DataFrame

y=diabetes.iloc[:,8]
y=pd.DataFrame(y,columns=["Outcome"])

模型
11.1 xgboost

from xgboost.sklearn import XGBClassifier
model_rf = XGBClassifier()
model_rf.fit(X_train,Y_train)
参数
learning_rate=0.1
n_estimators=20
max_depth=4
objective=‘binary:logistic’
seed=27
silent=0
参数调整 https://blog.csdn.net/u010657489/article/details/51952785

查看全文

http://www.kler.cn/news/18548.html