机器学习基础03
目录
1.KNN算法-分类
1.1样本距离判断
1.1.1欧式距离
1.1.2曼哈顿距离
1.2KNN 算法原理
1.3KNN缺点
1.4API
2.模型选择与调优
2.1保留交叉验证
2.2K-折交叉验证
2.3分层k-折交叉验证Stratified k-fold
2.4其它验证
2.5API
3.模型保存与加载
3.1保存模型
3.2加载模型
4.超参数搜索
1.KNN算法-分类
1.1样本距离判断
1.1.1欧式距离
1.1.2曼哈顿距离
二维平面
1.2KNN 算法原理
K-近邻算法(K-Nearest Neighbors,简称KNN),根据K个邻居样本的类别来判断当前样本的类别;
如果一个样本在特征空间中的k个最相似(最邻近)样本中的大多数属于某个类别,则该类本也属于
这个类别。
1.3KNN缺点
(1)对于大规模数据集,计算量大。
(2)对于高维数据,距离度量可能变得不那么有意义,这就是所谓的“维度灾难”
(3)需要选择合适的k值和距离度量。
1.4API
class sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, algorithm='auto')
参数:
(1)n_neighbors:
默认为:5,n_neighbors就是 KNN中的K
(2)algorithm:
{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’。找到近邻的方式:排列后存储的数据结构。
示例
# 导入库
import numpy as np
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA
# 加载数据集
dataSet = load_wine()
# print(data.data)
# print(data.feature_names)
# 划分数据集
x_train, x_test, y_train, y_test = train_test_split(dataSet.data,dataSet.target,train_size=0.7, random_state=44)
# 数据集标准化
transfer1 = StandardScaler()
standard_x_train = transfer1.fit_transform(x_train)
standard_x_test = transfer1.transform(x_test)
# print(standard_x_train )
# print(standard_x_test)
# 数据降维
transfer2 = PCA(n_components=5)
pca_x_train=transfer2.fit_transform(standard_x_train)
pca_x_test=transfer2.transform(standard_x_test)
# print(pca_x_train )
# print(pca_x_test)
# 模型训练
transfer3=KNeighborsClassifier(n_neighbors=7)
estimator =transfer3.fit(pca_x_train,y_train)
# 模型测试
# 模型评估
# 方法一:调用predict(),对比预测值与真实值
y_predict = estimator.predict(pca_x_test)
result = y_predict==y_test
# print(result)
accuracy_radio=np.sum(result)/len(result)
print(accuracy_radio)
# 方法二:调用score()查看准确率
accuracy_radio2=estimator.score(pca_x_test,y_test)
print(accuracy_radio2)
2.模型选择与调优
2.1保留交叉验证
HoldOut Cross-validation(Train-Test Split)
在这种交叉验证技术中,整个数据集被随机地划分为训练集和验证集。根据经验法则,整个数据集的近70%被用作训练集,其余30%被用作验证集。也就是我们最常使用的,直接划分数据集的方法。
优点:很简单很容易执行。
缺点1:①不适用于不平衡的数据集。②可能有数据没有训练模型的机会。
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=0.7,shuffle=True,random_state=44)
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
x,y = load_iris(return_X_y=True)
# print(b)
# 数据标准化处理器
standar_transfer = StandardScaler()
# k邻近分类器
classifier =KNeighborsClassifier(n_neighbors=7)
# 保留交叉验证
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=0.7,shuffle=True,random_state=44)
standard_x_train =standar_transfer.fit_transform(x_train)
standard_x_test =standar_transfer.transform(x_test)
# 训练模型
classifier.fit(standard_x_train,y_train)
# 输出每次折叠的准确性得分
score = classifier.score(standard_x_test,y_test)
print(score)
2.2K-折交叉验证
(K-fold Cross Validation,记为K-CV或K-fold)
K-Fold交叉验证技术中,整个数据集被划分为K个大小相同的部分。每个分区被称为 一个”Fold”。所以我们有K个部分,我们称之为K-Fold。一个Fold被用作验证集,其余的K-1个Fold被用作训练集。
该技术重复K次,直到每个Fold都被用作验证集,其余的作为训练集。
模型的最终准确度是通过取k个模型验证数据的平均准确度来计算的。
2.3分层k-折交叉验证Stratified k-fold
Stratified k-fold cross validation,
K-折交叉验证的变种, 分层的意思是说在每一折中都保持着原始数据中各个类别的比例关系。
2.4其它验证
(1)去除p交叉验证
(2)留一交叉验证
(3)蒙特卡罗交叉验证
(4)时间序列交叉验证
2.5API
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
strat_k_fold=sklearn.model_selection.StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
k_fold=sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=42)
参数:
n_splits划分为几个折叠
shuffle是否在拆分之前被打乱(随机化),False则按照顺序拆分
random_state随机因子
indexs1=strat_k_fold.split(X,y)
indexs2=k_fold.split(X,y)
返回一个可迭代对象,一共有n_splits个折叠,每个折叠对应的是训练集和测试集的下标
然后可以用for循环取出每一个折叠对应的X和y下标来访问到对应的测试数据集和训练数据集 以及测试目标集和训练目标集
for train_index, test_index in indexs1:
x[train_index],x[test_index ]=train_index, test_index
y[train_index ],y[test_index ]=train_index, test_index
for train_index, test_index in indexs2:
x[train_index],x[test_index ]=train_index, test_index
y[train_index ],y[test_index ]=train_index, test_index
StratifiedKFold示例:
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
x,y = load_iris(return_X_y=True)
# print(b)
fold1 = StratifiedKFold(n_splits=5,shuffle=True,random_state=44)
# 数据标准化处理器
standar_transfer = StandardScaler()
# k邻近分类器
classifier =KNeighborsClassifier(n_neighbors=7)
# 分层K折叠交叉验证
accuracies =[]
for train_index,test_index in fold1.split(x,y):
# print(train_index)
# print(test_index)
x_train,x_test = x[train_index],x[test_index]
y_train,y_test = y[train_index],y[test_index]
standard_x_train =standar_transfer.fit_transform(x_train)
standard_x_test =standar_transfer.transform(x_test)
# 训练模型
classifier.fit(standard_x_train,y_train)
# 输出每次折叠的准确性得分
score = classifier.score(standard_x_test,y_test)
# print(score)
accuracies.append(score)
print(sum(accuracies)/len(accuracies))
del accuracies
KFold示例:
from sklearn.model_selection import KFold
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
x,y = load_iris(return_X_y=True)
# print(b)
fold1 = StratifiedKFold(n_splits=5,shuffle=True,random_state=44)
fold2 = KFold(n_splits=5,shuffle=True,random_state=44)
# 数据标准化处理器
standar_transfer = StandardScaler()
# k邻近分类器
classifier =KNeighborsClassifier(n_neighbors=7)
# K折叠交叉验证
accuracies2 =[]
for train_index,test_index in fold2.split(x,y):
# print(train_index)
# print(test_index)
x_train,x_test = x[train_index],x[test_index]
y_train,y_test = y[train_index],y[test_index]
standard_x_train =standar_transfer.fit_transform(x_train)
standard_x_test =standar_transfer.transform(x_test)
# 训练模型
classifier.fit(standard_x_train,y_train)
# 输出每次折叠的准确性得分
score = classifier.score(standard_x_test,y_test)
# print(score)
accuracies2.append(score)
print(sum(accuracies2)/len(accuracies2))
del accuracies2
3.模型保存与加载
3.1保存模型
joblib.dump(estimator,model_path)
参数:
estimator:估计器名
model_path:模型存储路径
3.2加载模型
estimator = joblib.load(model_path)
参数:
model_path:模型存储路径
import joblib
# 保存模型
joblib.dump(estimator,'model/wine.pkl')
# 加载模型
estimator = joblib.load('./model/wine.pkl')
# 输入数据,用模型进行预测
y_predict= estimator.predict([[0.89221896, -1.92495677, -0.24116668, 0.19771249, 0.93709413]])
wine_type = dataSet.target_names[y_predict]
print(wine_type)
4.超参数搜索
超参数:即 可人为设置的参数
超参数搜索也叫网格搜索(Grid Search)
比如在KNN算法中,k是一个可以人为设置的参数,所以就是一个超参数。网格搜索能自动的帮助我们找到最好的超参数值。
class sklearn.model_selection.GridSearchCV(estimator, param_grid)
参数:
estimator: scikit-learn估计器实例
param_grid:以参数名称(str)作为键,将参数设置列表尝试作为值的字典
示例: {"n_neighbors": [1, 3, 5, 7, 9, 11]}
cv: 确定交叉验证切分策略,值为:
(1)None 默认5折
(2)integer 设置多少折
如果估计器是分类器,使用"分层k-折交叉验证(StratifiedKFold)"。在所有其他情况下,使用KFold。
说明:
同时进行交叉验证(CV) 和 网格搜索(GridSearch),GridSearchCV实际上也是一个估计器(estimator),同时它有几个重要属性
best_params_ | 最佳参数 |
best_score_ | best_score_ |
best_estimator_ | 最佳估计器 |
cv_results_ | 交叉验证过程描述 |
best_index_ | 最佳k在列表中的下标 |
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
# 加载数据集
dataSet = load_breast_cancer()
# 数据集划分
x_train, x_test, y_train, y_test = train_test_split(dataSet.data, dataSet.target, train_size=0.7, random_state=8)
# 数据标准化
scaler = StandardScaler()
standard_x_train = scaler.fit_transform(x_train)
standard_x_test = scaler.transform(x_test)
# 特征降维
pca = PCA(n_components=10)
pca_x_train = pca.fit_transform(standard_x_train)
pca_x_test = pca.transform(standard_x_test)
# knn预估器,n_neighbors不传参
knn = KNeighborsClassifier()
# 超参数搜索器
grid_search = GridSearchCV(knn, param_grid={'n_neighbors': [1, 3, 5, 7, 9]}, cv=10)
# 模型训练
grid_search.fit(pca_x_train, y_train)
# 模型测试与评估
best_model = grid_search.best_estimator_
y_predict = best_model.score(pca_x_test, y_test)
print(f"Mean_score: {y_predict}")
print(f"Best_params_: {grid_search.best_params_}")
print(f"Best_score_: {grid_search.best_score_}")
print(f"Best_estimator_: {grid_search.best_estimator_}")
print(f"Best_results_: {grid_search.cv_results_}")
# print(type(grid_search.cv_results_))
print(f"Best_index_: {grid_search.best_index_}")