机器学习之决策树(DecisionTree——C4.5)
在机器学习之决策树(DecisionTree——ID3)中我们提到,ID3无法处理是连续值或有缺失值的属性。而C4.5算法可以解决ID3算的上述局限性。
1、连续值属性的处理
对于数据集 D D D和连续值属性 A A A,假设连续值属性 A A A有 M M M个不同的取值,可通过二分法bi-partition对联组织属性进行离散化处理,即:
- 对 M M M个不同的取值由小到大排序,得到排序后的取值,记为 { a 1 , a 2 , . . . , a M } \{a^1, a^2, ..., a^M\} {a1,a2,...,aM};
- 对相邻的属性取值 a i a^{i} ai与 a i + 1 a^{i+1} ai+1,取其均值作为划分点,即 a i + a i + 1 2 \frac{a^{i}+a^{i+1}}{2} 2ai+ai+1,划分后的子集表示为 D t − D_t^- Dt−和 D t + D_t^+ Dt+;
- 对于连续值属性
A
A
A,可获得包含
M
−
1
M-1
M−1个元素的候选划分点集合:
T A = { a i + a i + 1 2 ∣ 1 ≤ i ≤ M − 1 } (1) T_A=\{\frac{a^{i}+a^{i+1}}{2}|1≤i≤M-1\}\tag1 TA={2ai+ai+1∣1≤i≤M−1}(1) - 像离散属性值一样开考察上述候选划分点,选取最优的划分点进行样本集合的划分:
G a i n ( D , A ) = max t ∈ T a G a i n ( D , A , t ) = max t ∈ T a ( E n t r o p y ( D ) − ∑ λ ∈ { − , + } N t λ N E n t r o p y ( D t λ ) ) (2) \begin{aligned} Gain(D, A)&=\mathop{\max}\limits_{t\in T_a}Gain(D, A, t)\\ &=\mathop{\max}\limits_{t\in T_a}(Entropy(D)-\sum_{\lambda\in \{-, +\}}\frac{N_t^{\lambda}}{N}Entropy(D_t^{\lambda}))\tag2 \end{aligned} Gain(D,A)=t∈TamaxGain(D,A,t)=t∈Tamax(Entropy(D)−λ∈{−,+}∑NNtλEntropy(Dtλ))(2)
式(2)中, G a i n ( D , A , t ) Gain(D, A, t) Gain(D,A,t)是样本集 D D D基于划分点 t t t二分后的信息增益, D t λ D_t^{\lambda} Dtλ表示二分后的子集, N t λ N_t^{\lambda} Ntλ表示二分后的子集的样本数量。
编号 | 色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 密度 | 含糖率 | 好瓜 |
---|---|---|---|---|---|---|---|---|---|
1 | 青绿 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.697 | 0.460 | 是 |
2 | 乌黑 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 0.774 | 0.376 | 是 |
3 | 乌黑 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.634 | 0.264 | 是 |
4 | 青绿 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 0.608 | 0.318 | 是 |
5 | 浅白 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.556 | 0.215 | 是 |
6 | 青绿 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 0.403 | 0.237 | 是 |
7 | 乌黑 | 稍蜷 | 浊响 | 稍糊 | 稍凹 | 软粘 | 0.481 | 0.149 | 是 |
8 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 硬滑 | 0.437 | 0.211 | 是 |
9 | 乌黑 | 稍蜷 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 0.666 | 0.091 | 否 |
10 | 青绿 | 硬挺 | 清脆 | 清晰 | 平坦 | 软粘 | 0.243 | 0.267 | 否 |
11 | 浅白 | 硬挺 | 清脆 | 模糊 | 平坦 | 硬滑 | 0.245 | 0.057 | 否 |
12 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 软粘 | 0.343 | 0.099 | 否 |
13 | 青绿 | 稍蜷 | 浊响 | 稍糊 | 凹陷 | 硬滑 | 0.639 | 0.161 | 否 |
14 | 浅白 | 稍蜷 | 沉闷 | 稍糊 | 凹陷 | 硬滑 | 0.657 | 0.198 | 否 |
15 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 0.360 | 0.370 | 否 |
16 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 硬滑 | 0.593 | 0.042 | 否 |
17 | 青绿 | 蜷缩 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 0.719 | 0.103 | 否 |
表1中的西瓜数据集包含17个样本(
n
=
1
,
2
,
3
,
.
.
.
,
17
n=1,2,3,...,17
n=1,2,3,...,17),每个样本有8个属性(
k
=
1
,
2
,
3
,
.
.
.
,
8
k = 1 , 2 , 3 , . . . , 8
k=1,2,3,...,8),样本共计有2个类别(
c
=
是
,
否
c = 是 , 否
c=是,否)。17个样本中,好瓜样本有8个、差瓜样本有9个,数据集
D
D
D信息熵为:
E
n
t
r
o
p
y
(
D
)
=
−
(
8
17
log
8
17
+
9
17
log
9
17
)
=
0.9975
Entropy(D)=-(\frac{8}{17}\log\frac{8}{17}+\frac{9}{17}\log\frac{9}{17})=0.9975
Entropy(D)=−(178log178+179log179)=0.9975
以属性"含糖率"为例,17个样本的在该属性的取值由小到大排序后为:
编号 | 色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 密度 | 含糖率 | 好瓜 |
---|---|---|---|---|---|---|---|---|---|
16 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 硬滑 | 0.593 | 0.042 | 否 |
11 | 浅白 | 硬挺 | 清脆 | 模糊 | 平坦 | 硬滑 | 0.245 | 0.057 | 否 |
9 | 乌黑 | 稍蜷 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 0.666 | 0.091 | 否 |
12 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 软粘 | 0.343 | 0.099 | 否 |
17 | 青绿 | 蜷缩 | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 0.719 | 0.103 | 否 |
7 | 乌黑 | 稍蜷 | 浊响 | 稍糊 | 稍凹 | 软粘 | 0.481 | 0.149 | 是 |
13 | 青绿 | 稍蜷 | 浊响 | 稍糊 | 凹陷 | 硬滑 | 0.639 | 0.161 | 否 |
14 | 浅白 | 稍蜷 | 沉闷 | 稍糊 | 凹陷 | 硬滑 | 0.657 | 0.198 | 否 |
8 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 硬滑 | 0.437 | 0.211 | 是 |
5 | 浅白 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.556 | 0.215 | 是 |
6 | 青绿 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 0.403 | 0.237 | 是 |
3 | 乌黑 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.634 | 0.264 | 是 |
10 | 青绿 | 硬挺 | 清脆 | 清晰 | 平坦 | 软粘 | 0.243 | 0.267 | 否 |
4 | 青绿 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 0.608 | 0.318 | 是 |
15 | 乌黑 | 稍蜷 | 浊响 | 清晰 | 稍凹 | 软粘 | 0.360 | 0.370 | 否 |
2 | 乌黑 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 0.774 | 0.376 | 是 |
1 | 青绿 | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 0.697 | 0.460 | 是 |
17个样本的在该属性的二分候选划分点为:
|
|
- 当划分点为0.0495,划分后两个子集分别为
D
0.0495
−
D_{0.0495}^-
D0.0495−:{16}和
D
0.0495
+
D_{0.0495}^+
D0.0495+:{11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
E n t r o p y ( D 0.0495 − ) = − ( 0 1 log 0 1 + 1 1 log 1 1 ) = 0 E n t r o p y ( D 0.0495 + ) = − ( 8 16 log 8 16 + 8 16 log 8 16 ) = 1.0 G a i n ( D , 含糖率 , 0.0495 ) = E n t r o p y ( D ) − ∑ λ ∈ { − , + } N 0.0495 λ N E n t r o p y ( D 0.126 λ ) = 0.9975 − ( 1 17 ∗ 0 + 16 17 ∗ 1.0 ) = 0.0563 \begin{aligned} Entropy(D_{0.0495}^-)&=-(\frac{0}{1}\log\frac{0}{1}+\frac{1}{1}\log\frac{1}{1})=0\\ Entropy(D_{0.0495}^+)&=-(\frac{8}{16}\log\frac{8}{16}+\frac{8}{16}\log\frac{8}{16})=1.0\\ Gain(D, 含糖率, 0.0495)&= Entropy(D)-\sum_{\lambda\in\{-, +\}}\frac{N_{0.0495}^{\lambda}}{N} Entropy(D_{0.126}^{\lambda})\\ &= 0.9975-(\frac{1}{17}*0+\frac{16}{17}*1.0)\\ &=0.0563 \end{aligned} Entropy(D0.0495−)Entropy(D0.0495+)Gain(D,含糖率,0.0495)=−(10log10+11log11)=0=−(168log168+168log168)=1.0=Entropy(D)−λ∈{−,+}∑NN0.0495λEntropy(D0.126λ)=0.9975−(171∗0+1716∗1.0)=0.0563 - 当划分点为0.074,划分后两个子集分别为
D
0.074
−
D_{0.074}^-
D0.074−:{16, 11}和
D
0.074
+
D_{0.074}^+
D0.074+:{9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
G a i n ( D , 含糖率 , 0.074 ) = 0.9975 − { 2 17 ∗ [ − ( 0 2 log 0 2 + 2 2 log 2 2 ) ] + 15 17 ∗ [ − ( 8 15 log 8 15 + 7 15 log 7 15 ) ] } = 0.1179 \begin{aligned} Gain(D, 含糖率, 0.074)&= 0.9975-\{\frac{2}{17}*[-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2})]+\frac{15}{17}*[-(\frac{8}{15}\log\frac{8}{15}+\frac{7}{15}\log\frac{7}{15})]\}=0.1179 \end{aligned} Gain(D,含糖率,0.074)=0.9975−{172∗[−(20log20+22log22)]+1715∗[−(158log158+157log157)]}=0.1179 - 当划分点为0.095,划分后两个子集分别为
D
0.074
−
D_{0.074}^-
D0.074−:{16, 11, 9}和
D
0.074
+
D_{0.074}^+
D0.074+:{12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
G a i n ( D , 含糖率 , 0.095 ) = 0.9975 − { 3 17 ∗ [ − ( 0 3 log 0 3 + 3 3 log 3 3 ) ] + 14 17 ∗ [ − ( 8 14 log 8 14 + 6 14 log 6 14 ) ] } = 0.1861 \begin{aligned} Gain(D, 含糖率, 0.095)&= 0.9975-\{\frac{3}{17}*[-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3})]+\frac{14}{17}*[-(\frac{8}{14}\log\frac{8}{14}+\frac{6}{14}\log\frac{6}{14})]\}=0.1861 \end{aligned} Gain(D,含糖率,0.095)=0.9975−{173∗[−(30log30+33log33)]+1714∗[−(148log148+146log146)]}=0.1861 - 当划分点为0.101,划分后两个子集分别为
D
0.101
−
D_{0.101}^-
D0.101−:{16, 11, 9, 12}和
D
0.101
+
D_{0.101}^+
D0.101+:{17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
G a i n ( D , 含糖率 , 0.101 ) = 0.9975 − { 4 17 ∗ [ − ( 0 4 log 0 4 + 4 4 log 4 4 ) ] + 13 17 ∗ [ − ( 8 13 log 8 13 + 5 13 log 5 13 ) ] } = 0.2624 \begin{aligned} Gain(D, 含糖率, 0.101)&= 0.9975-\{\frac{4}{17}*[-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4})]+\frac{13}{17}*[-(\frac{8}{13}\log\frac{8}{13}+\frac{5}{13}\log\frac{5}{13})]\}=0.2624 \end{aligned} Gain(D,含糖率,0.101)=0.9975−{174∗[−(40log40+44log44)]+1713∗[−(138log138+135log135)]}=0.2624 - 当划分点为0.126,划分后两个子集分别为
D
0.126
−
D_{0.126}^-
D0.126−:{16, 11, 9, 12, 17}和
D
0.126
+
D_{0.126}^+
D0.126+:{7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
G a i n ( D , 含糖率 , 0.126 ) = 0.9975 − { 5 17 ∗ [ − ( 0 5 log 0 5 + 5 5 log 5 5 ) ] + 12 17 ∗ [ − ( 8 12 log 8 12 + 4 12 log 4 12 ) ] } = 0.3492 \begin{aligned} Gain(D, 含糖率, 0.126)&= 0.9975-\{\frac{5}{17}*[-(\frac{0}{5}\log\frac{0}{5}+\frac{5}{5}\log\frac{5}{5})]+\frac{12}{17}*[-(\frac{8}{12}\log\frac{8}{12}+\frac{4}{12}\log\frac{4}{12})]\}=0.3492 \end{aligned} Gain(D,含糖率,0.126)=0.9975−{175∗[−(50log50+55log55)]+1712∗[−(128log128+124log124)]}=0.3492 - 当划分点为0.155,划分后两个子集分别为
D
0.155
−
D_{0.155}^-
D0.155−:{16, 11, 9, 12, 17, 7}和
D
0.155
+
D_{0.155}^+
D0.155+:{13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
G a i n ( D , 含糖率 , 0.155 ) = 0.9975 − { 6 17 ∗ [ − ( 1 6 log 1 6 + 5 6 log 5 6 ) ] + 11 17 ∗ [ − ( 7 11 log 7 11 + 4 11 log 4 11 ) ] } = 0.1561 \begin{aligned} Gain(D, 含糖率, 0.155)&= 0.9975-\{\frac{6}{17}*[-(\frac{1}{6}\log\frac{1}{6}+\frac{5}{6}\log\frac{5}{6})]+\frac{11}{17}*[-(\frac{7}{11}\log\frac{7}{11}+\frac{4}{11}\log\frac{4}{11})]\}=0.1561 \end{aligned} Gain(D,含糖率,0.155)=0.9975−{176∗[−(61log61+65log65)]+1711∗[−(117log117+114log114)]}=0.1561 - 当划分点为0.1795,划分后两个子集分别为
D
0.1795
−
D_{0.1795}^-
D0.1795−:{16, 11, 9, 12, 17, 7, 13}和
D
0.1795
+
D_{0.1795}^+
D0.1795+:{14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
G a i n ( D , 含糖率 , 0.1795 ) = 0.9975 − { 7 17 ∗ [ − ( 1 7 log 1 7 + 6 7 log 6 7 ) ] + 10 17 ∗ [ − ( 7 10 log 7 10 + 3 10 log 3 10 ) ] } = 0.2354 \begin{aligned} Gain(D, 含糖率, 0.1795)&= 0.9975-\{\frac{7}{17}*[-(\frac{1}{7}\log\frac{1}{7}+\frac{6}{7}\log\frac{6}{7})]+\frac{10}{17}*[-(\frac{7}{10}\log\frac{7}{10}+\frac{3}{10}\log\frac{3}{10})]\}=0.2354 \end{aligned} Gain(D,含糖率,0.1795)=0.9975−{177∗[−(71log71+76log76)]+1710∗[−(107log107+103log103)]}=0.2354 - 当划分点为0.2045,划分后两个子集分别为
D
0.2045
−
D_{0.2045}^-
D0.2045−:{16, 11, 9, 12, 17, 7, 13, 14}和
D
0.2045
+
D_{0.2045}^+
D0.2045+:{8, 5, 6, 3, 10, 4, 15, 2, 1}
G a i n ( D , 含糖率 , 0.2045 ) = 0.9975 − { 8 17 ∗ [ − ( 1 8 log 1 8 + 7 8 log 7 8 ) ] + 9 17 ∗ [ − ( 7 9 log 7 9 + 2 9 log 2 9 ) ] } = 0.3371 \begin{aligned} Gain(D, 含糖率, 0.2045)&= 0.9975-\{\frac{8}{17}*[-(\frac{1}{8}\log\frac{1}{8}+\frac{7}{8}\log\frac{7}{8})]+\frac{9}{17}*[-(\frac{7}{9}\log\frac{7}{9}+\frac{2}{9}\log\frac{2}{9})]\}=0.3371 \end{aligned} Gain(D,含糖率,0.2045)=0.9975−{178∗[−(81log81+87log87)]+179∗[−(97log97+92log92)]}=0.3371 - 当划分点为0.213,划分后两个子集分别为
D
0.213
−
D_{0.213}^-
D0.213−:{16, 11, 9, 12, 17, 7, 13, 14, 8}和
D
0.213
+
D_{0.213}^+
D0.213+:{5, 6, 3, 10, 4, 15, 2, 1}
G a i n ( D , 含糖率 , 0.213 ) = 0.9975 − { 9 17 ∗ [ − ( 2 9 log 2 9 + 7 9 log 7 9 ) ] + 8 17 ∗ [ − ( 6 8 log 6 8 + 2 8 log 2 8 ) ] } = 0.2111 \begin{aligned} Gain(D, 含糖率, 0.213)&= 0.9975-\{\frac{9}{17}*[-(\frac{2}{9}\log\frac{2}{9}+\frac{7}{9}\log\frac{7}{9})]+\frac{8}{17}*[-(\frac{6}{8}\log\frac{6}{8}+\frac{2}{8}\log\frac{2}{8})]\}=0.2111 \end{aligned} Gain(D,含糖率,0.213)=0.9975−{179∗[−(92log92+97log97)]+178∗[−(86log86+82log82)]}=0.2111 - 当划分点为0.226,划分后两个子集分别为
D
0.226
−
D_{0.226}^-
D0.226−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5}和
D
0.226
+
D_{0.226}^+
D0.226+:{6, 3, 10, 4, 15, 2, 1}
G a i n ( D , 含糖率 , 0.226 ) = 0.9975 − { 10 17 ∗ [ − ( 3 10 log 3 10 + 7 10 log 7 10 ) ] + 7 17 ∗ [ − ( 5 7 log 5 7 + 2 7 log 2 7 ) ] } = 0.1237 \begin{aligned} Gain(D, 含糖率, 0.226)&= 0.9975-\{\frac{10}{17}*[-(\frac{3}{10}\log\frac{3}{10}+\frac{7}{10}\log\frac{7}{10})]+\frac{7}{17}*[-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7})]\}=0.1237 \end{aligned} Gain(D,含糖率,0.226)=0.9975−{1710∗[−(103log103+107log107)]+177∗[−(75log75+72log72)]}=0.1237 - 当划分点为0.2505,划分后两个子集分别为
D
0.2505
−
D_{0.2505}^-
D0.2505−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6}和
D
0.2505
+
D_{0.2505}^+
D0.2505+:{3, 10, 4, 15, 2, 1}
G a i n ( D , 含糖率 , 0.2505 ) = 0.9975 − { 11 17 ∗ [ − ( 4 11 log 4 11 + 7 11 log 7 11 ) ] + 6 17 ∗ [ − ( 4 6 log 4 6 + 2 6 log 2 6 ) ] } = 0.0615 \begin{aligned} Gain(D, 含糖率, 0.2505)&= 0.9975-\{\frac{11}{17}*[-(\frac{4}{11}\log\frac{4}{11}+\frac{7}{11}\log\frac{7}{11})]+\frac{6}{17}*[-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6})]\}=0.0615 \end{aligned} Gain(D,含糖率,0.2505)=0.9975−{1711∗[−(114log114+117log117)]+176∗[−(64log64+62log62)]}=0.0615 - 当划分点为0.2655,划分后两个子集分别为
D
0.2655
−
D_{0.2655}^-
D0.2655−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3}和
D
0.2655
+
D_{0.2655}^+
D0.2655+:{10, 4, 15, 2, 1}
G a i n ( D , 含糖率 , 0.2655 ) = 0.9975 − { 12 17 ∗ [ − ( 5 12 log 5 12 + 7 12 log 7 12 ) ] + 5 17 ∗ [ − ( 3 5 log 3 5 + 2 5 log 2 5 ) ] } = 0.0202 \begin{aligned} Gain(D, 含糖率, 0.2655)&= 0.9975-\{\frac{12}{17}*[-(\frac{5}{12}\log\frac{5}{12}+\frac{7}{12}\log\frac{7}{12})]+\frac{5}{17}*[-(\frac{3}{5}\log\frac{3}{5}+\frac{2}{5}\log\frac{2}{5})]\}=0.0202 \end{aligned} Gain(D,含糖率,0.2655)=0.9975−{1712∗[−(125log125+127log127)]+175∗[−(53log53+52log52)]}=0.0202 - 当划分点为0.2925,划分后两个子集分别为
D
0.2925
−
D_{0.2925}^-
D0.2925−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10}和
D
0.2925
+
D_{0.2925}^+
D0.2925+:{4, 15, 2, 1}
G a i n ( D , 含糖率 , 0.2925 ) = 0.9975 − { 13 17 ∗ [ − ( 5 13 log 5 13 + 8 13 log 8 13 ) ] + 4 17 ∗ [ − ( 3 4 log 3 4 + 1 4 log 1 4 ) ] } = 0.0715 \begin{aligned} Gain(D, 含糖率, 0.2925)&= 0.9975-\{\frac{13}{17}*[-(\frac{5}{13}\log\frac{5}{13}+\frac{8}{13}\log\frac{8}{13})]+\frac{4}{17}*[-(\frac{3}{4}\log\frac{3}{4}+\frac{1}{4}\log\frac{1}{4})]\}=0.0715 \end{aligned} Gain(D,含糖率,0.2925)=0.9975−{1713∗[−(135log135+138log138)]+174∗[−(43log43+41log41)]}=0.0715 - 当划分点为0.344,划分后两个子集分别为
D
0.344
−
D_{0.344}^-
D0.344−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4}和
D
0.344
+
D_{0.344}^+
D0.344+:{15, 2, 1}
G a i n ( D , 含糖率 , 0.344 ) = 0.9975 − { 14 17 ∗ [ − ( 6 14 log 6 14 + 8 14 log 8 14 ) ] + 3 17 ∗ [ − ( 2 3 log 2 3 + 1 3 log 1 3 ) ] } = 0.0241 \begin{aligned} Gain(D, 含糖率, 0.344)&= 0.9975-\{\frac{14}{17}*[-(\frac{6}{14}\log\frac{6}{14}+\frac{8}{14}\log\frac{8}{14})]+\frac{3}{17}*[-(\frac{2}{3}\log\frac{2}{3}+\frac{1}{3}\log\frac{1}{3})]\}=0.0241 \end{aligned} Gain(D,含糖率,0.344)=0.9975−{1714∗[−(146log146+148log148)]+173∗[−(32log32+31log31)]}=0.0241 - 当划分点为0.373,划分后两个子集分别为
D
0.373
−
D_{0.373}^-
D0.373−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15}和
D
0.373
+
D_{0.373}^+
D0.373+:{2, 1}
G a i n ( D , 含糖率 , 0.373 ) = 0.9975 − { 15 17 ∗ [ − ( 6 15 log 6 15 + 9 15 log 9 15 ) ] + 2 17 ∗ [ − ( 2 2 log 2 2 + 0 2 log 0 2 ) ] } = 0.1041 \begin{aligned} Gain(D, 含糖率, 0.373)&= 0.9975-\{\frac{15}{17}*[-(\frac{6}{15}\log\frac{6}{15}+\frac{9}{15}\log\frac{9}{15})]+\frac{2}{17}*[-(\frac{2}{2}\log\frac{2}{2}+\frac{0}{2}\log\frac{0}{2})]\}=0.1041 \end{aligned} Gain(D,含糖率,0.373)=0.9975−{1715∗[−(156log156+159log159)]+172∗[−(22log22+20log20)]}=0.1041 - 当划分点为0.373,划分后两个子集分别为
D
0.373
−
D_{0.373}^-
D0.373−:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2}和
D
0.373
+
D_{0.373}^+
D0.373+:{1}
G a i n ( D , 含糖率 , 0.418 ) = 0.9975 − { 16 17 ∗ [ − ( 7 16 log 7 16 + 9 16 log 9 16 ) ] + 1 17 ∗ [ − ( 1 1 log 1 1 + 0 1 log 0 1 ) ] } = 0.0669 \begin{aligned} Gain(D, 含糖率, 0.418)&= 0.9975-\{\frac{16}{17}*[-(\frac{7}{16}\log\frac{7}{16}+\frac{9}{16}\log\frac{9}{16})]+\frac{1}{17}*[-(\frac{1}{1}\log\frac{1}{1}+\frac{0}{1}\log\frac{0}{1})]\}=0.0669 \end{aligned} Gain(D,含糖率,0.418)=0.9975−{1716∗[−(167log167+169log169)]+171∗[−(11log11+10log10)]}=0.0669
因此,属性"含糖率"划分后的最大信息增益为0.349,对应划分点为0.126:
G
a
i
n
(
D
,
含糖率
)
=
G
a
i
n
(
D
,
含糖率
,
t
=
0.126
)
=
0.3492
\begin{aligned} Gain(D, 含糖率)&=Gain(D, 含糖率, t=0.126)=0.3492 \end{aligned}
Gain(D,含糖率)=Gain(D,含糖率,t=0.126)=0.3492
同理,属性"密度"划分后的最大信息增益为0.2624,对应划分点为0.3815:
G
a
i
n
(
D
,
密度
)
=
G
a
i
n
(
D
,
密度
,
t
=
0.3815
)
=
0.2624
\begin{aligned} Gain(D, 密度)&=Gain(D, 密度, t=0.3815)=0.2624 \end{aligned}
Gain(D,密度)=Gain(D,密度,t=0.3815)=0.2624
以如此方式即可处理连续值的属性。
2、缺失值属性的处理
编号 | 色泽 | 根蒂 | 敲声 | 纹理 | 脐部 | 触感 | 好瓜 |
---|---|---|---|---|---|---|---|
1 | — | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 是 |
2 | 乌黑 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | — | 是 |
3 | 乌黑 | 蜷缩 | — | 清晰 | 凹陷 | 硬滑 | 是 |
4 | 青绿 | 蜷缩 | 沉闷 | 清晰 | 凹陷 | 硬滑 | 是 |
5 | — | 蜷缩 | 浊响 | 清晰 | 凹陷 | 硬滑 | 是 |
6 | 青绿 | 稍蜷 | 浊响 | 清晰 | — | 软粘 | 是 |
7 | 乌黑 | 稍蜷 | 浊响 | 稍糊 | 稍凹 | 软粘 | 是 |
8 | 乌黑 | 稍蜷 | 浊响 | — | 稍凹 | 硬滑 | 是 |
9 | 乌黑 | — | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 否 |
10 | 青绿 | 硬挺 | 清脆 | — | 平坦 | 软粘 | 否 |
11 | 浅白 | 硬挺 | 清脆 | 模糊 | 平坦 | — | 否 |
12 | 浅白 | 蜷缩 | — | 模糊 | 平坦 | 软粘 | 否 |
13 | — | 稍蜷 | 浊响 | 稍糊 | 凹陷 | 硬滑 | 否 |
14 | 浅白 | 稍蜷 | 沉闷 | 稍糊 | 凹陷 | 硬滑 | 否 |
15 | 乌黑 | 稍蜷 | 浊响 | 清晰 | — | 软粘 | 否 |
16 | 浅白 | 蜷缩 | 浊响 | 模糊 | 平坦 | 硬滑 | 否 |
17 | 青绿 | — | 沉闷 | 稍糊 | 稍凹 | 硬滑 | 否 |
(1) 如何在属性值确实的情况下进行划分属性选择?
给定训练集
D
D
D和属性
A
A
A,假设
D
~
\widetilde{D}
D
表示属性
A
A
A上没有缺失值的样本子集,假定属性
A
A
A有
m
m
m个可取值
{
a
1
,
a
2
,
.
.
.
,
a
m
}
\{a^1, a^2, ..., a^m\}
{a1,a2,...,am},
D
~
m
\widetilde{D}^m
D
m表示
D
~
\widetilde{D}
D
中属性
A
A
A上取值为
a
m
a^m
am的样本子集,
D
~
k
\widetilde{D}_k
D
k表示
D
~
\widetilde{D}
D
中属于第
k
k
k类(
k
=
1
,
2
,
.
.
.
,
K
k=1,2,...,K
k=1,2,...,K)的样本子集,则有
D
~
=
∪
k
=
1
K
D
~
k
=
∪
m
=
1
m
D
~
m
\widetilde{D}=\cup_{k=1}^{K}\widetilde{D}_k=\cup_{m=1}^{m}\widetilde{D}^m
D
=∪k=1KD
k=∪m=1mD
m,假定为每一个样本
x
x
x赋予一个权重
w
x
w_x
wx定义:
ρ
=
∑
x
∈
D
~
w
x
∑
x
∈
D
w
x
p
~
k
=
∑
x
∈
D
~
k
w
x
∑
x
∈
D
~
w
x
r
~
m
=
∑
x
∈
D
~
m
w
x
∑
x
∈
D
~
w
x
\rho=\frac{\sum_{x\in\widetilde{D}}w_x}{\sum_{x\in D}w_x}\\ \widetilde{p}_k=\frac{\sum_{x\in\widetilde{D}_k}w_x}{\sum_{x\in \widetilde{D}}w_x}\\ \widetilde{r}_m=\frac{\sum_{x\in\widetilde{D}^m}w_x}{\sum_{x\in \widetilde{D}}w_x}\\
ρ=∑x∈Dwx∑x∈D
wxp
k=∑x∈D
wx∑x∈D
kwxr
m=∑x∈D
wx∑x∈D
mwx
式中,
ρ
\rho
ρ表示无缺失值样本所占的比例,
p
~
k
\widetilde{p}_k
p
k表示无缺失值样本中第
k
k
k类所占的比例,
r
~
m
\widetilde{r}_m
r
m表示无缺失值样本中属性
A
A
A上取值
a
m
a^m
am的样本所占的比例,故有
∑
k
=
1
K
p
~
k
=
∑
n
=
1
m
r
~
m
=
1
\sum_{k=1}^K\widetilde{p}_k=\sum_{n=1}^m\widetilde{r}_m=1
∑k=1Kp
k=∑n=1mr
m=1
基于上述定义,可将信息增益的计算在缺失值上推广为:
E
n
t
r
o
p
y
(
D
~
)
=
−
∑
k
=
1
K
p
~
k
log
p
~
k
G
a
i
n
(
D
,
A
)
=
ρ
×
G
a
i
n
(
D
~
,
A
)
=
ρ
×
[
E
n
t
r
o
p
y
(
D
~
)
−
∑
m
=
1
m
r
~
m
E
n
t
r
o
p
y
(
D
~
m
)
]
\begin{aligned} Entropy(\widetilde{D})&=-\sum_{k=1}^{K}\widetilde{p}_k\log\widetilde{p}_k\\ Gain(D, A)&=\rho\times Gain(\widetilde{D}, A)=\rho\times[Entropy(\widetilde{D})-\sum_{m=1}^{m}\widetilde{r}_mEntropy(\widetilde{D}^m)] \end{aligned}
Entropy(D
)Gain(D,A)=−k=1∑Kp
klogp
k=ρ×Gain(D
,A)=ρ×[Entropy(D
)−m=1∑mr
mEntropy(D
m)]
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
- 属性"色泽",无缺失值样本子集
D
~
=
{
2
,
3
,
4
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
14
,
15
,
16
,
17
}
\widetilde{D}=\{2,3,4,6,7,8,9,10,11,12,14,15,16,17\}
D
={2,3,4,6,7,8,9,10,11,12,14,15,16,17},有"乌黑"、“青绿”、"浅白"3个取值
G a i n ( D , 色泽 ) = ρ × [ E n t r o p y ( D ~ ) − ∑ m = 1 m r ~ m E n t r o p y ( D ~ m ) ] = 14 17 × { − ( 6 14 log 6 14 + 8 14 log 8 14 ) − [ 6 14 × ( − ( 4 6 log 4 6 + 2 6 log 2 6 ) ) + 4 14 × ( − ( 2 4 log 2 4 + 2 4 log 2 4 ) ) + 4 14 × ( − ( 0 4 log 0 4 + 4 4 log 4 4 ) ) ] } = 0.2519 \begin{aligned} Gain(D, 色泽)&=\rho\times[Entropy(\widetilde{D})-\sum_{m=1}^{m}\widetilde{r}_mEntropy(\widetilde{D}^m)]\\ &=\frac{14}{17}\times\{-(\frac{6}{14}\log\frac{6}{14}+\frac{8}{14}\log\frac{8}{14})-[\frac{6}{14}\times(-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6}))+\frac{4}{14}\times(-(\frac{2}{4}\log\frac{2}{4}+\frac{2}{4}\log\frac{2}{4}))+\frac{4}{14}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\}\\ &=0.2519 \end{aligned} Gain(D,色泽)=ρ×[Entropy(D )−m=1∑mr mEntropy(D m)]=1714×{−(146log146+148log148)−[146×(−(64log64+62log62))+144×(−(42log42+42log42))+144×(−(40log40+44log44))]}=0.2519 - 属性"根蒂",无缺失值样本子集
D
~
=
{
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
10
,
11
,
12
,
13
,
14
,
15
,
16
}
\widetilde{D}=\{1,2,3,4,5,6,7,8,10,11,12,13,14,15,16\}
D
={1,2,3,4,5,6,7,8,10,11,12,13,14,15,16},有"蜷缩"、“稍蜷”、"硬挺"3个取值
G a i n ( D , 根蒂 ) = 15 17 × { − ( 8 15 log 8 15 + 7 15 log 7 15 ) − [ 7 15 × ( − ( 5 7 log 5 7 + 2 7 log 2 7 ) ) + 6 15 × ( − ( 3 6 log 3 6 + 3 6 log 3 6 ) ) + 2 15 × ( − ( 0 2 log 0 2 + 2 2 log 2 2 ) ) ] } = 0.1711 \begin{aligned} Gain(D, 根蒂)&=\frac{15}{17}\times\{- (\frac{8}{15}\log\frac{8}{15}+\frac{7}{15}\log\frac{7}{15})- [\frac{7}{15}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+ \frac{6}{15}\times(-(\frac{3}{6}\log\frac{3}{6}+\frac{3}{6}\log\frac{3}{6}))+ \frac{2}{15}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\}\\ &=0.1711 \end{aligned} Gain(D,根蒂)=1715×{−(158log158+157log157)−[157×(−(75log75+72log72))+156×(−(63log63+63log63))+152×(−(20log20+22log22))]}=0.1711 - 属性"敲声",无缺失值样本子集
D
~
=
{
1
,
2
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
13
,
14
,
15
,
16
,
17
}
\widetilde{D}=\{1,2,4,5,6,7,8,9,10,11,13,14,15,16,17\}
D
={1,2,4,5,6,7,8,9,10,11,13,14,15,16,17},有"浊响"、“沉闷”、"清脆"3个取值
G a i n ( D , 敲声 ) = 15 17 × { − ( 7 15 log 7 15 + 8 15 log 8 15 ) − [ 8 15 × ( − ( 5 8 log 5 8 + 3 8 log 3 8 ) ) + 5 15 × ( − ( 2 5 log 2 5 + 3 5 log 3 5 ) ) + 2 15 × ( − ( 0 2 log 0 2 + 2 2 log 2 2 ) ) ] } = 0.1448 \begin{aligned} Gain(D, 敲声)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{8}{15}\times(-(\frac{5}{8}\log\frac{5}{8}+\frac{3}{8}\log\frac{3}{8}))+ \frac{5}{15}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))+ \frac{2}{15}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\}\\ &=0.1448 \end{aligned} Gain(D,敲声)=1715×{−(157log157+158log158)−[158×(−(85log85+83log83))+155×(−(52log52+53log53))+152×(−(20log20+22log22))]}=0.1448 - 属性"纹理",无缺失值样本子集
D
~
=
{
1
,
2
,
3
,
4
,
5
,
6
,
7
,
9
,
11
,
12
,
13
,
14
,
15
,
16
,
17
}
\widetilde{D}=\{1,2,3,4,5,6,7,9,11,12,13,14,15,16,17\}
D
={1,2,3,4,5,6,7,9,11,12,13,14,15,16,17},有"清晰"、“稍糊”、"模糊"3个取值
G a i n ( D , 纹理 ) = 15 17 × { − ( 7 15 log 7 15 + 8 15 log 8 15 ) − [ 7 15 × ( − ( 6 7 log 6 7 + 1 7 log 1 7 ) ) + 5 15 × ( − ( 1 5 log 1 5 + 4 5 log 4 5 ) ) + 3 15 × ( − ( 0 3 log 0 3 + 3 3 log 3 3 ) ) ] } = 0.4235 \begin{aligned} Gain(D, 纹理)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{7}{15}\times(-(\frac{6}{7}\log\frac{6}{7}+\frac{1}{7}\log\frac{1}{7}))+ \frac{5}{15}\times(-(\frac{1}{5}\log\frac{1}{5}+\frac{4}{5}\log\frac{4}{5}))+ \frac{3}{15}\times(-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3}))]\}\\ &=0.4235 \end{aligned} Gain(D,纹理)=1715×{−(157log157+158log158)−[157×(−(76log76+71log71))+155×(−(51log51+54log54))+153×(−(30log30+33log33))]}=0.4235 - 属性"脐部",无缺失值样本子集
D
~
=
{
1
,
2
,
3
,
4
,
5
,
7
,
8
,
9
,
10
,
11
,
12
,
13
,
14
,
16
,
17
}
\widetilde{D}=\{1,2,3,4,5,7,8,9,10,11,12,13,14,16,17\}
D
={1,2,3,4,5,7,8,9,10,11,12,13,14,16,17},有"凹陷"、“稍凹”、"平坦"3个取值
G a i n ( D , 脐部 ) = 15 17 × { − ( 7 15 log 7 15 + 8 15 log 8 15 ) − [ 7 15 × ( − ( 5 7 log 5 7 + 2 7 log 2 7 ) ) + 4 15 × ( − ( 2 4 log 2 4 + 2 4 log 2 4 ) ) + 4 15 × ( − ( 0 4 log 0 4 + 4 4 log 4 4 ) ) ] } = 0.2888 \begin{aligned} Gain(D, 脐部)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{7}{15}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+ \frac{4}{15}\times(-(\frac{2}{4}\log\frac{2}{4}+\frac{2}{4}\log\frac{2}{4}))+ \frac{4}{15}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\}\\ &=0.2888 \end{aligned} Gain(D,脐部)=1715×{−(157log157+158log158)−[157×(−(75log75+72log72))+154×(−(42log42+42log42))+154×(−(40log40+44log44))]}=0.2888 - 属性"触感",无缺失值样本子集
D
~
=
{
1
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
12
,
13
,
14
,
15
,
16
,
17
}
\widetilde{D}=\{1,3,4,5,6,7,8,9,10,12,13,14,15,16,17\}
D
={1,3,4,5,6,7,8,9,10,12,13,14,15,16,17},有"硬滑"、"软粘"2个取值
G a i n ( D , 脐部 ) = 15 17 × { − ( 7 15 log 7 15 + 8 15 log 8 15 ) − [ 10 15 × ( − ( 5 10 log 5 10 + 5 10 log 5 10 ) ) + 5 15 × ( − ( 2 5 log 2 5 + 3 5 log 3 5 ) ) ] } = 0.0057 \begin{aligned} Gain(D, 脐部)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{10}{15}\times(-(\frac{5}{10}\log\frac{5}{10}+\frac{5}{10}\log\frac{5}{10}))+ \frac{5}{15}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))]\}\\ &=0.0057 \end{aligned} Gain(D,脐部)=1715×{−(157log157+158log158)−[1510×(−(105log105+105log105))+155×(−(52log52+53log53))]}=0.0057
(2) 给定划分属性,若样本在该属性上确实,如何对样本进行划分? - 如果样本在划分属性上的取值已知,则将其分裂到与其取值对应的子节点,且样本权重在子节点中保持为1;
- 如若样本在划分属性上的取值未知,则将其同时分裂到所有子节点中,在各子节点中的权重为对应子节点的样本权重 ρ \rho ρ。
属性"纹理"的信息增益最大,用于进一步分裂,包含15个取值已知(清晰7个、稍糊5个、模糊3个)和2个取值未知的样本{8,10}。
属性: 取值 | 样本 | 好瓜 | 差瓜 | 缺失值 | 缺失值权重 | 总权重 |
---|---|---|---|---|---|---|
纹理:清晰 | {1,2,3,4,5,6,15} | {1,2,3,4,5,6} | {15} | {8,10} | 2 × 7 15 2\times\frac{7}{15} 2×157 | 7 + 2 × 7 15 7+2\times\frac{7}{15} 7+2×157 |
纹理:稍糊 | {7,9,13,14,17} | {7} | {9,13,14,17} | {8,10} | 2 × 5 15 2\times\frac{5}{15} 2×155 | 5 + 2 × 5 15 5+2\times\frac{5}{15} 5+2×155 |
纹理:模糊 | {11,12,16} | - | {11,12,16} | {8,10} | 2 × 3 15 2\times\frac{3}{15} 2×153 | 3 + 2 × 3 15 3+2\times\frac{3}{15} 3+2×153 |
子节点属性纹理=清晰,包含7个有取值样本{1,2,3,4,5,6,15},其中6个好瓜和1个差瓜,假设属性在缺失值处对应的类别分布与原始样本一致,分别为
6
7
\frac{6}{7}
76和
1
7
\frac{1}{7}
71,则子节点属性纹理=清晰的信息熵为:
E
n
t
r
o
p
y
(
D
纹理
=
清晰
)
=
−
∑
i
=
1
k
p
i
log
p
i
=
−
(
6
+
6
7
×
7
15
×
2
7
+
7
15
×
2
log
6
+
6
7
×
7
15
×
2
7
+
7
15
×
2
+
1
+
1
7
×
7
15
×
2
7
+
7
15
×
2
log
1
+
1
7
×
7
15
×
2
7
+
7
15
×
2
)
=
0.5916
\begin{aligned} Entropy(D^{纹理=清晰})&=-\sum_{i=1}^{k}p_i\log p_i\\ &=-(\frac{6+\frac{6}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}\log\frac{6+\frac{6}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}+\frac{1+\frac{1}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}\log\frac{1+\frac{1}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2})\\ &=0.5916 \end{aligned}
Entropy(D纹理=清晰)=−i=1∑kpilogpi=−(7+157×26+76×157×2log7+157×26+76×157×2+7+157×21+71×157×2log7+157×21+71×157×2)=0.5916
- 子节点属性纹理=清晰,计算属性色泽的信息增益
-
色泽=乌黑的样本数为3(2个正样本和1个负样本);色泽=青绿的样本数为2(2个正样本);2个缺失值样本
-
缺失值样本的权重:色泽=乌黑的权重 3 5 \frac{3}{5} 53,总权重为 2 × 3 5 = 6 5 2\times\frac{3}{5}=\frac{6}{5} 2×53=56;色泽=青绿的权重 2 5 \frac{2}{5} 52,总权重为 2 × 2 5 = 4 5 2\times\frac{2}{5}=\frac{4}{5} 2×52=54
-
色泽=乌黑:正样本的权重: 2 + 2 3 × 3 5 × 2 2+\frac{2}{3}\times\frac{3}{5}\times2 2+32×53×2;负样本的权重: 1 + 1 3 × 3 5 × 2 1+\frac{1}{3}\times\frac{3}{5}\times2 1+31×53×2;总权重 2 + 2 3 × 3 5 × 2 + 1 + 1 3 × 3 5 × 2 = 3 + 3 5 × 2 2+\frac{2}{3}\times\frac{3}{5}\times2+1+\frac{1}{3}\times\frac{3}{5}\times2=3+\frac{3}{5}\times2 2+32×53×2+1+31×53×2=3+53×2
E n t r o p y ( D 纹理 = 清晰 , 色泽 = 乌黑 ) = − ( 2 + 2 3 × 3 5 × 2 3 + 3 5 × 2 log 2 + 2 3 × 3 5 × 2 3 + 3 5 × 2 + 1 + 1 3 × 3 5 × 2 3 + 3 5 × 2 log 1 + 1 3 × 3 5 × 2 3 + 3 5 × 2 ) = 0.6589 \begin{aligned} Entropy(D^{纹理=清晰},色泽=乌黑)&=-(\frac{2+\frac{2}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}\log\frac{2+\frac{2}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}+\frac{1+\frac{1}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}\log\frac{1+\frac{1}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2})=0.6589 \end{aligned} Entropy(D纹理=清晰,色泽=乌黑)=−(3+53×22+32×53×2log3+53×22+32×53×2+3+53×21+31×53×2log3+53×21+31×53×2)=0.6589 -
色泽=青绿:正样本的权重: 2 + 2 2 × 2 5 × 2 2+\frac{2}{2}\times\frac{2}{5}\times2 2+22×52×2;负样本的权重: 0 + 0 2 × 2 5 × 2 0+\frac{0}{2}\times\frac{2}{5}\times2 0+20×52×2;总权重 2 + 2 2 × 2 5 × 2 + 0 + 0 2 × 2 5 × 2 = 2 + 2 5 × 2 2+\frac{2}{2}\times\frac{2}{5}\times2+0+\frac{0}{2}\times\frac{2}{5}\times2=2+\frac{2}{5}\times2 2+22×52×2+0+20×52×2=2+52×2
E n t r o p y ( D 纹理 = 清晰 , 色泽 = 青绿 ) = − ( 2 + 2 5 × 2 2 + 2 5 × 2 log 2 + 2 5 × 2 2 + 2 5 × 2 + 0 + 0 2 × 2 5 × 2 2 + 2 5 × 2 log 0 + 0 2 × 2 5 × 2 2 + 2 5 × 2 ) = 0.0 \begin{aligned} Entropy(D^{纹理=清晰},色泽=青绿)&=-(\frac{2+\frac{2}{5}\times2}{2+\frac{2}{5}\times2}\log\frac{2+\frac{2}{5}\times2}{2+\frac{2}{5}\times2}+\frac{0+\frac{0}{2}\times\frac{2}{5}\times2}{2+\frac{2}{5}\times2}\log\frac{0+\frac{0}{2}\times\frac{2}{5}\times2}{2+\frac{2}{5}\times2})=0.0 \end{aligned} Entropy(D纹理=清晰,色泽=青绿)=−(2+52×22+52×2log2+52×22+52×2+2+52×20+20×52×2log2+52×20+20×52×2)=0.0
-
G a i n ( D 纹理 = 清晰 , 色泽 ) = 0.5916 − ( 3 + 3 5 × 2 7 + 7 15 × 2 × 0.6598 + 2 + 2 5 × 2 7 + 7 15 × 2 × 0.0 ) = 0.2423 \begin{aligned} Gain(D^{纹理=清晰},色泽)&=0.5916-(\frac{3+\frac{3}{5}\times2}{7+\frac{7}{15}\times2}\times0.6598+\frac{2+\frac{2}{5}\times2}{7+\frac{7}{15}\times2}\times0.0)&=0.2423 \end{aligned} Gain(D纹理=清晰,色泽)=0.5916−(7+157×23+53×2×0.6598+7+157×22+52×2×0.0)=0.2423
-
子节点属性纹理=清晰,计算属性根蒂的信息增益
- 根蒂=蜷缩的样本数为5(5个正样本);根蒂=稍蜷的样本数为2(1个正样本和1个负样本);无缺失值样本
E n t r o p y ( D 纹理 = 清晰 , 根蒂 = 蜷缩 ) = − ( 5 5 log 5 5 + 0 5 log 0 5 ) = 0.0 E n t r o p y ( D 纹理 = 清晰 , 根蒂 = 稍蜷 ) = − ( 1 2 log 1 2 + 1 2 log 1 2 ) = 1.0 G a i n ( D 纹理 = 清晰 , 根蒂 ) = 0.5916 − ( 5 7 × 0.0 + 2 7 × 1.0 ) = 0.3058 \begin{aligned} Entropy(D^{纹理=清晰},根蒂=蜷缩)&=-(\frac{5}{5}\log\frac{5}{5}+\frac{0}{5}\log\frac{0}{5})=0.0\\ Entropy(D^{纹理=清晰},根蒂=稍蜷)&=-(\frac{1}{2}\log\frac{1}{2}+\frac{1}{2}\log\frac{1}{2})=1.0\\ Gain(D^{纹理=清晰},根蒂)&=0.5916-(\frac{5}{7}\times0.0+\frac{2}{7}\times1.0)=0.3058 \end{aligned} Entropy(D纹理=清晰,根蒂=蜷缩)Entropy(D纹理=清晰,根蒂=稍蜷)Gain(D纹理=清晰,根蒂)=−(55log55+50log50)=0.0=−(21log21+21log21)=1.0=0.5916−(75×0.0+72×1.0)=0.3058
- 根蒂=蜷缩的样本数为5(5个正样本);根蒂=稍蜷的样本数为2(1个正样本和1个负样本);无缺失值样本
-
子节点属性纹理=清晰,计算属性敲声的信息增益
-
敲声=浊响的样本数为4(3个正样本和1个负样本);敲声=沉闷的样本数为2(2个正样本);1个缺失值样本
-
缺失值样本的权重:敲声=浊响的权重 4 6 \frac{4}{6} 64,总权重为 4 6 \frac{4}{6} 64;敲声=沉闷的权重 2 6 \frac{2}{6} 62,总权重为 2 6 \frac{2}{6} 62
-
敲声=浊响:正样本的权重: 3 + 3 4 × 4 6 3+\frac{3}{4}\times\frac{4}{6} 3+43×64;负样本的权重: 1 + 1 4 × 4 6 1+\frac{1}{4}\times\frac{4}{6} 1+41×64;总权重 3 + 3 4 × 4 6 + 1 + 1 4 × 4 6 = 4 + 4 6 3+\frac{3}{4}\times\frac{4}{6}+1+\frac{1}{4}\times\frac{4}{6}=4+\frac{4}{6} 3+43×64+1+41×64=4+64
E n t r o p y ( D 纹理 = 清晰 , 敲声 = 浊响 ) = − ( 3 + 3 4 × 4 6 4 + 4 6 log 3 + 3 4 × 4 6 4 + 4 6 + 1 + 1 4 × 4 6 4 + 4 6 log 1 + 1 4 × 4 6 4 + 4 6 ) = 0.8112 \begin{aligned} Entropy(D^{纹理=清晰},敲声=浊响)&=-(\frac{3+\frac{3}{4}\times\frac{4}{6}}{4+\frac{4}{6}}\log\frac{3+\frac{3}{4}\times\frac{4}{6}}{4+\frac{4}{6}}+\frac{1+\frac{1}{4}\times\frac{4}{6}}{4+\frac{4}{6}}\log\frac{1+\frac{1}{4}\times\frac{4}{6}}{4+\frac{4}{6}})=0.8112 \end{aligned} Entropy(D纹理=清晰,敲声=浊响)=−(4+643+43×64log4+643+43×64+4+641+41×64log4+641+41×64)=0.8112 -
敲声=沉闷:正样本的权重: 2 + 2 2 × 2 6 2+\frac{2}{2}\times\frac{2}{6} 2+22×62;负样本的权重: 0 + 0 2 × 2 6 0+\frac{0}{2}\times\frac{2}{6} 0+20×62;总权重 2 + 2 2 × 2 6 + 0 + 0 2 × 2 6 = 2 + 2 6 2+\frac{2}{2}\times\frac{2}{6}+0+\frac{0}{2}\times\frac{2}{6}=2+\frac{2}{6} 2+22×62+0+20×62=2+62
E n t r o p y ( D 纹理 = 清晰 , 敲声 = 沉闷 ) = − ( 2 + 2 2 × 2 6 2 + 2 6 log 2 + 2 2 × 2 6 2 + 2 6 + 0 + 0 2 × 2 6 2 + 2 6 log 0 + 0 2 × 2 6 2 + 2 6 ) = 0.0 \begin{aligned} Entropy(D^{纹理=清晰},敲声=沉闷)&=-(\frac{2+\frac{2}{2}\times\frac{2}{6}}{2+\frac{2}{6}}\log\frac{2+\frac{2}{2}\times\frac{2}{6}}{2+\frac{2}{6}}+\frac{0+\frac{0}{2}\times\frac{2}{6}}{2+\frac{2}{6}}\log\frac{0+\frac{0}{2}\times\frac{2}{6}}{2+\frac{2}{6}})=0.0 \end{aligned} Entropy(D纹理=清晰,敲声=沉闷)=−(2+622+22×62log2+622+22×62+2+620+20×62log2+620+20×62)=0.0
G a i n ( D 纹理 = 清晰 , 敲声 ) = 0.5916 − ( 4 + 4 6 7 + 7 15 × 2 × 0.8112 + 2 + 2 6 7 + 7 15 × 2 × 0.0 ) = 0.1144 \begin{aligned} Gain(D^{纹理=清晰},敲声)&=0.5916-(\frac{4+\frac{4}{6}}{7+\frac{7}{15}\times2}\times0.8112+\frac{2+\frac{2}{6}}{7+\frac{7}{15}\times2}\times0.0)&=0.1144 \end{aligned} Gain(D纹理=清晰,敲声)=0.5916−(7+157×24+64×0.8112+7+157×22+62×0.0)=0.1144
-