对sklearn库中的鸢尾花数据集内容和结构的详解认识和load_iris()函数查找学习举例
对sklearn库中的鸢尾花数据集内容和结构的详解认识和load_iris()函数查找学习举例
对sklearn库中的鸢尾花数据集内容和结构的详解认识和load_iris函数查找学习举例
- 对sklearn库中的鸢尾花数据集内容和结构的详解认识和load_iris()函数查找学习举例
- 一、鸢尾花数据位置
- 二、鸢尾花数据调用
- 2.1 load_iris()函数的使用方法查看
- (1)步骤1——使用浏览器打开Python Module Docs文档
- (2)步骤2——使用浏览器打开sklearn(package)库文档
- (3)步骤3——使用浏览器打开sklearn(package)库文档中的datasets(package)库
- (4)步骤4——使用浏览器打开sklearn(package)库文档中的datasets(package)库后,往下文搜索查找load_iris函数
- 2.2 load_iris函数的使用方法说明
- 2.2.1 Parameters
- 2.2.2 Returns
- 2.2.3 Notes
- 2.2.3 Examples
- 三、鸢尾花数据结果分析
- 3.1 查看数据结构
- 3.2 iris数据各个部分
- 3.2.1 DESCR部分查看
- 3.2.2 data部分查看
- 3.2.3 target部分查看
- 3.2.4 target_names部分查看
- 3.2.5 feature_names部分查看
- 3.2.6 filename部分查看
- 3.2.7 frame部分查看
- 3.2.8 data_module部分查看
- 四.鸢尾花数据绘图
- 4.1 花萼数据的散点图绘制
- 4.2 花瓣数据的散点图绘制
- 五、总结
鸢尾花数据集在sklearn的机器学习中有重要应用,下载sklearn库后,鸢尾花数据就在如图1所示的datasets中。该数据集由 3 种不同类型的鸢尾花组成 (Setosa, Versicolour, 和 Virginica)花瓣和萼片尺寸,存储在 150x4 的 numpy.ndarray 中。行是样本,列是: 萼片长度、萼片宽度、花瓣长度和花瓣宽度。同时,本文深入分析iris数据内容和机构,以及一种的load_iris函数的学习举例,其他可以举一反三。
一、鸢尾花数据位置
下载sklearn库后,鸢尾花数据就在如图1所示的datasets中。对于鸢尾花的详细结构认识见本人博文链接: 鸢尾花植物的结构认识和Python中scikit-learn工具包的安装的内容。
图1 数据集datasets位置
二、鸢尾花数据调用
调用鸢尾花数据,使用如下python代码:
## 1. 从sklearn中加载数据集datasets
from sklearn import datasets
## 2.取出datasets数据集中的鸢尾花数据赋值给iris
iris = datasets.load_iris() #iris为为类似字典类型的数据,其中.load_iris()方法是机器学习库sklearn中的datasets数据集中的函数。查询使用方法如图2-图5所示。
2.1 load_iris()函数的使用方法查看
(1)步骤1——使用浏览器打开Python Module Docs文档
图2 查看load_iris函数步骤1——使用浏览器打开Python Module Docs文档
(2)步骤2——使用浏览器打开sklearn(package)库文档
图3 查看load_iris函数步骤2——使用浏览器打开sklearn(package)库文档
(3)步骤3——使用浏览器打开sklearn(package)库文档中的datasets(package)库
图4 查看load_iris函数步骤3——使用浏览器打开sklearn(package)库文档中的datasets(package)库
(4)步骤4——使用浏览器打开sklearn(package)库文档中的datasets(package)库后,往下文搜索查找load_iris函数
图5 查看load_iris函数步骤4——使用浏览器打开sklearn(package)库文档中的datasets(package)库后,往下文搜索查找load_iris函数
2.2 load_iris函数的使用方法说明
load_iris(*, return_X_y=False, as_frame=False)
Load and return the iris dataset (classification).
The iris dataset is a classic and very easy multi-class classification
dataset.
================= ==============
Classes 3
Samples per class 50
Samples total 150
Dimensionality 4
Features real, positive
================= ==============
Read more in the :ref:User Guide <iris_dataset>
.
2.2.1 Parameters
return_X_y : bool, default=False
If True, returns (data, target)
instead of a Bunch object. See
below for more information about the data
and target
object.
.. versionadded:: 0.18
as_frame : bool, default=False
If True, the data is a pandas DataFrame including columns with
appropriate dtypes (numeric). The target is
a pandas DataFrame or Series depending on the number of target columns.
If return_X_y
is True, then (data
, target
) will be pandas
DataFrames or Series as described below.
.. versionadded:: 0.23
2.2.2 Returns
data : :class:~sklearn.utils.Bunch
Dictionary-like object, with the following attributes.
data : {ndarray, dataframe} of shape (150, 4)
The data matrix. If `as_frame=True`, `data` will be a pandas
DataFrame.
target: {ndarray, Series} of shape (150,)
The classification target. If `as_frame=True`, `target` will be
a pandas Series.
feature_names: list
The names of the dataset columns.
target_names: list
The names of target classes.
frame: DataFrame of shape (150, 5)
Only present when `as_frame=True`. DataFrame with `data` and
`target`.
.. versionadded:: 0.23
DESCR: str
The full description of the dataset.
filename: str
The path to the location of the data.
.. versionadded:: 0.20
(data, target) : tuple if return_X_y
is True
A tuple of two ndarray. The first containing a 2D array of shape
(n_samples, n_features) with each row representing one sample and
each column representing the features. The second ndarray of shape
(n_samples,) containing the target samples.
.. versionadded:: 0.18
2.2.3 Notes
.. versionchanged:: 0.20
Fixed two wrong data points according to Fisher's paper.
The new version is the same as in R, but not as in the UCI
Machine Learning Repository.
2.2.3 Examples
Let’s say you are interested in the samples 10, 25, and 50, and want to
know their class name.
from sklearn.datasets import load_iris
data = load_iris()
data.target[[10, 25, 50]] # 运行得到array([0, 0, 1])
list(data.target_names) #运行得到[np.str_('setosa'), np.str_('versicolor'), np.str_('virginica')]
See :ref:sphx_glr_auto_examples_datasets_plot_iris_dataset.py
for a more
detailed example of how to work with the iris dataset.
三、鸢尾花数据结果分析
3.1 查看数据结构
## 1. 从sklearn中加载数据集datasets
from sklearn import datasets
## 2.取出datasets数据集中的鸢尾花数据赋值给iris
iris = datasets.load_iris() #iris为字典类型数据
# print("Shape of iris:\n{}".format(iris.shape()))
## 3.打印字典iris所有键名
print("Keys of iris:\n{}".format(iris.keys()))
运行结果:
图6 iris数据结构查看
根据图6可知iris数据是类似于字典结构的数据类型,它有8个键。进一步可以在PyCharm的python控制台如图7中圈1位置,再观察图7左侧,可以看到iris的数据结构以及里面所包含的其他具体数据等。
图7 在PyCharm软件的Python控制台中查看iris数据结构查看
3.2 iris数据各个部分
3.2.1 DESCR部分查看
调用方式
print("Values of key 'DESCR' of iris:\n{}".format(iris.get('DESCR')))
运行得到:
Values of key ‘DESCR’ of iris:_iris_dataset:
Iris plants dataset
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher’s paper. Note that it’s the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher’s paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
… dropdown:: References
- Fisher, R.A. “The use of multiple measurements in taxonomic problems”
Annual Eugenics, 7, Part II, 179-188 (1936); also in “Contributions to
Mathematical Statistics” (John Wiley, NY, 1950). - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. - Dasarathy, B.V. (1980) “Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments”. IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71. - Gates, G.W. (1972) “The Reduced Nearest Neighbor Rule”. IEEE Transactions
on Information Theory, May 1972, 431-433. - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data. - Many, many more …
3.2.2 data部分查看
调用格式
print("Values of key 'data' of iris:\n{}".format(iris.get('data')))
运行结果:
Values of key ‘data’ of iris:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3. 1.4 0.1]
[4.3 3. 1.1 0.1]
[5.8 4. 1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1. 0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5. 3. 1.6 0.2]
[5. 3.4 1.6 0.4]
[5.2 3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.2 4.1 1.5 0.1]
[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.2]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.6 1.4 0.1]
[4.4 3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 3.3 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4. 1.3]
[6.5 2.8 4.6 1.5]
[5.7 2.8 4.5 1.3]
[6.3 3.3 4.7 1.6]
[4.9 2.4 3.3 1. ]
[6.6 2.9 4.6 1.3]
[5.2 2.7 3.9 1.4]
[5. 2. 3.5 1. ]
[5.9 3. 4.2 1.5]
[6. 2.2 4. 1. ]
[6.1 2.9 4.7 1.4]
[5.6 2.9 3.6 1.3]
[6.7 3.1 4.4 1.4]
[5.6 3. 4.5 1.5]
[5.8 2.7 4.1 1. ]
[6.2 2.2 4.5 1.5]
[5.6 2.5 3.9 1.1]
[5.9 3.2 4.8 1.8]
[6.1 2.8 4. 1.3]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[6.4 2.9 4.3 1.3]
[6.6 3. 4.4 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3. 5. 1.7]
[6. 2.9 4.5 1.5]
[5.7 2.6 3.5 1. ]
[5.5 2.4 3.8 1.1]
[5.5 2.4 3.7 1. ]
[5.8 2.7 3.9 1.2]
[6. 2.7 5.1 1.6]
[5.4 3. 4.5 1.5]
[6. 3.4 4.5 1.6]
[6.7 3.1 4.7 1.5]
[6.3 2.3 4.4 1.3]
[5.6 3. 4.1 1.3]
[5.5 2.5 4. 1.3]
[5.5 2.6 4.4 1.2]
[6.1 3. 4.6 1.4]
[5.8 2.6 4. 1.2]
[5. 2.3 3.3 1. ]
[5.6 2.7 4.2 1.3]
[5.7 3. 4.2 1.2]
[5.7 2.9 4.2 1.3]
[6.2 2.9 4.3 1.3]
[5.1 2.5 3. 1.1]
[5.7 2.8 4.1 1.3]
[6.3 3.3 6. 2.5]
[5.8 2.7 5.1 1.9]
[7.1 3. 5.9 2.1]
[6.3 2.9 5.6 1.8]
[6.5 3. 5.8 2.2]
[7.6 3. 6.6 2.1]
[4.9 2.5 4.5 1.7]
[7.3 2.9 6.3 1.8]
[6.7 2.5 5.8 1.8]
[7.2 3.6 6.1 2.5]
[6.5 3.2 5.1 2. ]
[6.4 2.7 5.3 1.9]
[6.8 3. 5.5 2.1]
[5.7 2.5 5. 2. ]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3. 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6. 2.2 5. 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2. ]
[7.7 2.8 6.7 2. ]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6. 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3. 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2. ]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]
3.2.3 target部分查看
print("Values of key 'target' of iris:\n{}".format(iris.get('target')))
运行结果:
3.2.4 target_names部分查看
调用程序
print("Values of key 'target_names' of iris:\n{}".format(iris.get('target_names')))
3.2.5 feature_names部分查看
调运程序
print("Values of key 'feature_names' of iris:\n{}".format(iris.get('feature_names')))
运行结果:
3.2.6 filename部分查看
调用程序
print("Values of key 'filename' of iris:\n{}".format(iris.get('filename')))
运行结果:
3.2.7 frame部分查看
调用程序
print("Values of key 'frame' of iris:\n{}".format(iris.get('frame')))
运行结果
3.2.8 data_module部分查看
运行程序:
print("Values of key 'data_module' of iris:\n{}".format(iris.get('data_module')))
运行结果:
四.鸢尾花数据绘图
4.1 花萼数据的散点图绘制
运行代码:
## 1. 从sklearn中加载数据集datasets
from sklearn import datasets
## 2.取出datasets数据集中的鸢尾花数据赋值给iris
iris = datasets.load_iris() #iris为类似字典类型数据
## 3.打印iris所有键名
print("Keys of iris:\n{}".format(iris.keys()))
# 输出为:
#Keys of iris:
#dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
# 4.打印输出键名为data所对应的键值
print("Values of key 'DESCR' of iris:\n{}".format(iris.get('DESCR')))
print("Values of key 'data' of iris:\n{}".format(iris.get('data')))
print("Values of key 'target' of iris:\n{}".format(iris.get('target')))
print("Values of key 'target_names' of iris:\n{}".format(iris.get('target_names')))
print("Values of key 'feature_names' of iris:\n{}".format(iris.get('feature_names')))
print("Values of key 'filename' of iris:\n{}".format(iris.get('filename')))
print("Values of key 'data_module' of iris:\n{}".format(iris.get('data_module')))
print("Values of key 'frame' of iris:\n{}".format(iris.get('frame')))
## 5.绘花萼图
import matplotlib.pyplot as plt #使用缩减的plt代替matplotlib
fig1, ax1 = plt.subplots() #将 plt.subplots()赋值于fig1和ax
scatter1 = ax1.scatter(iris.data[:, 0], iris.data[:, 1], c=iris.target)
ax1.set(xlabel=iris.feature_names[0], ylabel=iris.feature_names[1])
## 6.加图例
_ = ax1.legend(scatter1.legend_elements()[0], iris.target_names, loc="lower right", title="Classes")
plt.show() #图显示
运行结果:
图8 花萼的长宽尺寸散点图
图9 运行过程数据输出
4.2 花瓣数据的散点图绘制
运行代码:
## 1. 从sklearn中加载数据集datasets
from sklearn import datasets
## 2.取出datasets数据集中的鸢尾花数据赋值给iris
iris = datasets.load_iris() #iris为类似字典类型数据
## 3.打印iris所有键名
print("Keys of iris:\n{}".format(iris.keys()))
# 输出为:
#Keys of iris:
#dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
# 4.打印输出键名为data所对应的键值
print("Values of key 'DESCR' of iris:\n{}".format(iris.get('DESCR')))
print("Values of key 'data' of iris:\n{}".format(iris.get('data')))
print("Values of key 'target' of iris:\n{}".format(iris.get('target')))
print("Values of key 'target_names' of iris:\n{}".format(iris.get('target_names')))
print("Values of key 'feature_names' of iris:\n{}".format(iris.get('feature_names')))
print("Values of key 'filename' of iris:\n{}".format(iris.get('filename')))
print("Values of key 'data_module' of iris:\n{}".format(iris.get('data_module')))
print("Values of key 'frame' of iris:\n{}".format(iris.get('frame')))
## 5.绘花瓣图
import matplotlib.pyplot as plt #使用缩减的plt代替matplotlib
fig2, ax2 = plt.subplots() #将 plt.subplots()赋值于fig1和ax
scatter2 = ax2.scatter(iris.data[:, 2], iris.data[:, 3], c=iris.target)
ax2.set(xlabel=iris.feature_names[2], ylabel=iris.feature_names[3])
## 6.加图附
_ = ax2.legend(scatter2.legend_elements()[0], iris.target_names, loc="lower right", title="Classes")
plt.show() #图显示
运行结果:
图10 花瓣的长宽尺寸散点图
图11 运行过程数据输出
五、总结
鸢尾花数据集在sklearn的机器学习中有重要应用,深入分析鸢尾花iris数据内容和机构,以及一种的load_iris函数的学习举例,其他函数等查询和学习可以举一反三,为掌握和直观理解分类问题走好第二步。