当前位置: 首页 > article >正文

机器学习之决策树(DecisionTree——C4.5)

机器学习之决策树(DecisionTree——ID3)中我们提到,ID3无法处理是连续值或有缺失值的属性。而C4.5算法可以解决ID3算的上述局限性。

1、连续值属性的处理

对于数据集 D D D和连续值属性 A A A,假设连续值属性 A A A M M M个不同的取值,可通过二分法bi-partition对联组织属性进行离散化处理,即:

  1. M M M个不同的取值由小到大排序,得到排序后的取值,记为 { a 1 , a 2 , . . . , a M } \{a^1, a^2, ..., a^M\} {a1,a2,...,aM}
  2. 对相邻的属性取值 a i a^{i} ai a i + 1 a^{i+1} ai+1,取其均值作为划分点,即 a i + a i + 1 2 \frac{a^{i}+a^{i+1}}{2} 2ai+ai+1,划分后的子集表示为 D t − D_t^- Dt D t + D_t^+ Dt+
  3. 对于连续值属性 A A A,可获得包含 M − 1 M-1 M1个元素的候选划分点集合:
    T A = { a i + a i + 1 2 ∣ 1 ≤ i ≤ M − 1 } (1) T_A=\{\frac{a^{i}+a^{i+1}}{2}|1≤i≤M-1\}\tag1 TA={2ai+ai+1∣1iM1}(1)
  4. 像离散属性值一样开考察上述候选划分点,选取最优的划分点进行样本集合的划分:
    G a i n ( D , A ) = max ⁡ t ∈ T a G a i n ( D , A , t ) = max ⁡ t ∈ T a ( E n t r o p y ( D ) − ∑ λ ∈ { − , + } N t λ N E n t r o p y ( D t λ ) ) (2) \begin{aligned} Gain(D, A)&=\mathop{\max}\limits_{t\in T_a}Gain(D, A, t)\\ &=\mathop{\max}\limits_{t\in T_a}(Entropy(D)-\sum_{\lambda\in \{-, +\}}\frac{N_t^{\lambda}}{N}Entropy(D_t^{\lambda}))\tag2 \end{aligned} Gain(D,A)=tTamaxGain(D,A,t)=tTamax(Entropy(D)λ{,+}NNtλEntropy(Dtλ))(2)
    式(2)中, G a i n ( D , A , t ) Gain(D, A, t) Gain(D,A,t)是样本集 D D D基于划分点 t t t二分后的信息增益, D t λ D_t^{\lambda} Dtλ表示二分后的子集, N t λ N_t^{\lambda} Ntλ表示二分后的子集的样本数量。
表1 西瓜数据集3.0
编号色泽根蒂敲声纹理脐部触感密度含糖率好瓜
1青绿蜷缩浊响清晰凹陷硬滑0.6970.460
2乌黑蜷缩沉闷清晰凹陷硬滑0.7740.376
3乌黑蜷缩浊响清晰凹陷硬滑0.6340.264
4青绿蜷缩沉闷清晰凹陷硬滑0.6080.318
5浅白蜷缩浊响清晰凹陷硬滑0.5560.215
6青绿稍蜷浊响清晰稍凹软粘0.4030.237
7乌黑稍蜷浊响稍糊稍凹软粘0.4810.149
8乌黑稍蜷浊响清晰稍凹硬滑0.4370.211
9乌黑稍蜷沉闷稍糊稍凹硬滑0.6660.091
10青绿硬挺清脆清晰平坦软粘0.2430.267
11浅白硬挺清脆模糊平坦硬滑0.2450.057
12浅白蜷缩浊响模糊平坦软粘0.3430.099
13青绿稍蜷浊响稍糊凹陷硬滑0.6390.161
14浅白稍蜷沉闷稍糊凹陷硬滑0.6570.198
15乌黑稍蜷浊响清晰稍凹软粘0.3600.370
16浅白蜷缩浊响模糊平坦硬滑0.5930.042
17青绿蜷缩沉闷稍糊稍凹硬滑0.7190.103

表1中的西瓜数据集包含17个样本( n = 1 , 2 , 3 , . . . , 17 n=1,2,3,...,17 n=1,2,3,...,17),每个样本有8个属性( k = 1 , 2 , 3 , . . . , 8 k = 1 , 2 , 3 , . . . , 8 k=1,2,3,...,8),样本共计有2个类别( c = 是 , 否 c = 是 , 否 c=,)。17个样本中,好瓜样本有8个、差瓜样本有9个,数据集 D D D信息熵为:
E n t r o p y ( D ) = − ( 8 17 log ⁡ 8 17 + 9 17 log ⁡ 9 17 ) = 0.9975 Entropy(D)=-(\frac{8}{17}\log\frac{8}{17}+\frac{9}{17}\log\frac{9}{17})=0.9975 Entropy(D)=(178log178+179log179)=0.9975

以属性"含糖率"为例,17个样本的在该属性的取值由小到大排序后为:

表2 西瓜数据集3.0——sort("含糖率")
编号色泽根蒂敲声纹理脐部触感密度含糖率好瓜
16浅白蜷缩浊响模糊平坦硬滑0.5930.042
11浅白硬挺清脆模糊平坦硬滑0.2450.057
9乌黑稍蜷沉闷稍糊稍凹硬滑0.6660.091
12浅白蜷缩浊响模糊平坦软粘0.3430.099
17青绿蜷缩沉闷稍糊稍凹硬滑0.7190.103
7乌黑稍蜷浊响稍糊稍凹软粘0.4810.149
13青绿稍蜷浊响稍糊凹陷硬滑0.6390.161
14浅白稍蜷沉闷稍糊凹陷硬滑0.6570.198
8乌黑稍蜷浊响清晰稍凹硬滑0.4370.211
5浅白蜷缩浊响清晰凹陷硬滑0.5560.215
6青绿稍蜷浊响清晰稍凹软粘0.4030.237
3乌黑蜷缩浊响清晰凹陷硬滑0.6340.264
10青绿硬挺清脆清晰平坦软粘0.2430.267
4青绿蜷缩沉闷清晰凹陷硬滑0.6080.318
15乌黑稍蜷浊响清晰稍凹软粘0.3600.370
2乌黑蜷缩沉闷清晰凹陷硬滑0.7740.376
1青绿蜷缩浊响清晰凹陷硬滑0.6970.460

17个样本的在该属性的二分候选划分点为:

0.042
0.057
0.091
0.099
0.103
0.149
0.161
0.198
0.211
0.215
0.237
0.264
0.267
0.318
0.370
0.376
0.460
0.0495
0.074
0.095
0.101
0.126
0.155
0.1795
0.2045
0.213
0.226
0.2505
0.2655
0.2925
0.344
0.373
0.418
  • 当划分点为0.0495,划分后两个子集分别为 D 0.0495 − D_{0.0495}^- D0.0495:{16}和 D 0.0495 + D_{0.0495}^+ D0.0495+:{11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    E n t r o p y ( D 0.0495 − ) = − ( 0 1 log ⁡ 0 1 + 1 1 log ⁡ 1 1 ) = 0 E n t r o p y ( D 0.0495 + ) = − ( 8 16 log ⁡ 8 16 + 8 16 log ⁡ 8 16 ) = 1.0 G a i n ( D , 含糖率 , 0.0495 ) = E n t r o p y ( D ) − ∑ λ ∈ { − , + } N 0.0495 λ N E n t r o p y ( D 0.126 λ ) = 0.9975 − ( 1 17 ∗ 0 + 16 17 ∗ 1.0 ) = 0.0563 \begin{aligned} Entropy(D_{0.0495}^-)&=-(\frac{0}{1}\log\frac{0}{1}+\frac{1}{1}\log\frac{1}{1})=0\\ Entropy(D_{0.0495}^+)&=-(\frac{8}{16}\log\frac{8}{16}+\frac{8}{16}\log\frac{8}{16})=1.0\\ Gain(D, 含糖率, 0.0495)&= Entropy(D)-\sum_{\lambda\in\{-, +\}}\frac{N_{0.0495}^{\lambda}}{N} Entropy(D_{0.126}^{\lambda})\\ &= 0.9975-(\frac{1}{17}*0+\frac{16}{17}*1.0)\\ &=0.0563 \end{aligned} Entropy(D0.0495)Entropy(D0.0495+)Gain(D,含糖率,0.0495)=(10log10+11log11)=0=(168log168+168log168)=1.0=Entropy(D)λ{,+}NN0.0495λEntropy(D0.126λ)=0.9975(1710+17161.0)=0.0563
  • 当划分点为0.074,划分后两个子集分别为 D 0.074 − D_{0.074}^- D0.074:{16, 11}和 D 0.074 + D_{0.074}^+ D0.074+:{9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.074 ) = 0.9975 − { 2 17 ∗ [ − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ] + 15 17 ∗ [ − ( 8 15 log ⁡ 8 15 + 7 15 log ⁡ 7 15 ) ] } = 0.1179 \begin{aligned} Gain(D, 含糖率, 0.074)&= 0.9975-\{\frac{2}{17}*[-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2})]+\frac{15}{17}*[-(\frac{8}{15}\log\frac{8}{15}+\frac{7}{15}\log\frac{7}{15})]\}=0.1179 \end{aligned} Gain(D,含糖率,0.074)=0.9975{172[(20log20+22log22)]+1715[(158log158+157log157)]}=0.1179
  • 当划分点为0.095,划分后两个子集分别为 D 0.074 − D_{0.074}^- D0.074:{16, 11, 9}和 D 0.074 + D_{0.074}^+ D0.074+:{12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.095 ) = 0.9975 − { 3 17 ∗ [ − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) ] + 14 17 ∗ [ − ( 8 14 log ⁡ 8 14 + 6 14 log ⁡ 6 14 ) ] } = 0.1861 \begin{aligned} Gain(D, 含糖率, 0.095)&= 0.9975-\{\frac{3}{17}*[-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3})]+\frac{14}{17}*[-(\frac{8}{14}\log\frac{8}{14}+\frac{6}{14}\log\frac{6}{14})]\}=0.1861 \end{aligned} Gain(D,含糖率,0.095)=0.9975{173[(30log30+33log33)]+1714[(148log148+146log146)]}=0.1861
  • 当划分点为0.101,划分后两个子集分别为 D 0.101 − D_{0.101}^- D0.101:{16, 11, 9, 12}和 D 0.101 + D_{0.101}^+ D0.101+:{17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.101 ) = 0.9975 − { 4 17 ∗ [ − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ] + 13 17 ∗ [ − ( 8 13 log ⁡ 8 13 + 5 13 log ⁡ 5 13 ) ] } = 0.2624 \begin{aligned} Gain(D, 含糖率, 0.101)&= 0.9975-\{\frac{4}{17}*[-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4})]+\frac{13}{17}*[-(\frac{8}{13}\log\frac{8}{13}+\frac{5}{13}\log\frac{5}{13})]\}=0.2624 \end{aligned} Gain(D,含糖率,0.101)=0.9975{174[(40log40+44log44)]+1713[(138log138+135log135)]}=0.2624
  • 当划分点为0.126,划分后两个子集分别为 D 0.126 − D_{0.126}^- D0.126:{16, 11, 9, 12, 17}和 D 0.126 + D_{0.126}^+ D0.126+:{7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.126 ) = 0.9975 − { 5 17 ∗ [ − ( 0 5 log ⁡ 0 5 + 5 5 log ⁡ 5 5 ) ] + 12 17 ∗ [ − ( 8 12 log ⁡ 8 12 + 4 12 log ⁡ 4 12 ) ] } = 0.3492 \begin{aligned} Gain(D, 含糖率, 0.126)&= 0.9975-\{\frac{5}{17}*[-(\frac{0}{5}\log\frac{0}{5}+\frac{5}{5}\log\frac{5}{5})]+\frac{12}{17}*[-(\frac{8}{12}\log\frac{8}{12}+\frac{4}{12}\log\frac{4}{12})]\}=0.3492 \end{aligned} Gain(D,含糖率,0.126)=0.9975{175[(50log50+55log55)]+1712[(128log128+124log124)]}=0.3492
  • 当划分点为0.155,划分后两个子集分别为 D 0.155 − D_{0.155}^- D0.155:{16, 11, 9, 12, 17, 7}和 D 0.155 + D_{0.155}^+ D0.155+:{13, 14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.155 ) = 0.9975 − { 6 17 ∗ [ − ( 1 6 log ⁡ 1 6 + 5 6 log ⁡ 5 6 ) ] + 11 17 ∗ [ − ( 7 11 log ⁡ 7 11 + 4 11 log ⁡ 4 11 ) ] } = 0.1561 \begin{aligned} Gain(D, 含糖率, 0.155)&= 0.9975-\{\frac{6}{17}*[-(\frac{1}{6}\log\frac{1}{6}+\frac{5}{6}\log\frac{5}{6})]+\frac{11}{17}*[-(\frac{7}{11}\log\frac{7}{11}+\frac{4}{11}\log\frac{4}{11})]\}=0.1561 \end{aligned} Gain(D,含糖率,0.155)=0.9975{176[(61log61+65log65)]+1711[(117log117+114log114)]}=0.1561
  • 当划分点为0.1795,划分后两个子集分别为 D 0.1795 − D_{0.1795}^- D0.1795:{16, 11, 9, 12, 17, 7, 13}和 D 0.1795 + D_{0.1795}^+ D0.1795+:{14, 8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.1795 ) = 0.9975 − { 7 17 ∗ [ − ( 1 7 log ⁡ 1 7 + 6 7 log ⁡ 6 7 ) ] + 10 17 ∗ [ − ( 7 10 log ⁡ 7 10 + 3 10 log ⁡ 3 10 ) ] } = 0.2354 \begin{aligned} Gain(D, 含糖率, 0.1795)&= 0.9975-\{\frac{7}{17}*[-(\frac{1}{7}\log\frac{1}{7}+\frac{6}{7}\log\frac{6}{7})]+\frac{10}{17}*[-(\frac{7}{10}\log\frac{7}{10}+\frac{3}{10}\log\frac{3}{10})]\}=0.2354 \end{aligned} Gain(D,含糖率,0.1795)=0.9975{177[(71log71+76log76)]+1710[(107log107+103log103)]}=0.2354
  • 当划分点为0.2045,划分后两个子集分别为 D 0.2045 − D_{0.2045}^- D0.2045:{16, 11, 9, 12, 17, 7, 13, 14}和 D 0.2045 + D_{0.2045}^+ D0.2045+:{8, 5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2045 ) = 0.9975 − { 8 17 ∗ [ − ( 1 8 log ⁡ 1 8 + 7 8 log ⁡ 7 8 ) ] + 9 17 ∗ [ − ( 7 9 log ⁡ 7 9 + 2 9 log ⁡ 2 9 ) ] } = 0.3371 \begin{aligned} Gain(D, 含糖率, 0.2045)&= 0.9975-\{\frac{8}{17}*[-(\frac{1}{8}\log\frac{1}{8}+\frac{7}{8}\log\frac{7}{8})]+\frac{9}{17}*[-(\frac{7}{9}\log\frac{7}{9}+\frac{2}{9}\log\frac{2}{9})]\}=0.3371 \end{aligned} Gain(D,含糖率,0.2045)=0.9975{178[(81log81+87log87)]+179[(97log97+92log92)]}=0.3371
  • 当划分点为0.213,划分后两个子集分别为 D 0.213 − D_{0.213}^- D0.213:{16, 11, 9, 12, 17, 7, 13, 14, 8}和 D 0.213 + D_{0.213}^+ D0.213+:{5, 6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.213 ) = 0.9975 − { 9 17 ∗ [ − ( 2 9 log ⁡ 2 9 + 7 9 log ⁡ 7 9 ) ] + 8 17 ∗ [ − ( 6 8 log ⁡ 6 8 + 2 8 log ⁡ 2 8 ) ] } = 0.2111 \begin{aligned} Gain(D, 含糖率, 0.213)&= 0.9975-\{\frac{9}{17}*[-(\frac{2}{9}\log\frac{2}{9}+\frac{7}{9}\log\frac{7}{9})]+\frac{8}{17}*[-(\frac{6}{8}\log\frac{6}{8}+\frac{2}{8}\log\frac{2}{8})]\}=0.2111 \end{aligned} Gain(D,含糖率,0.213)=0.9975{179[(92log92+97log97)]+178[(86log86+82log82)]}=0.2111
  • 当划分点为0.226,划分后两个子集分别为 D 0.226 − D_{0.226}^- D0.226:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5}和 D 0.226 + D_{0.226}^+ D0.226+:{6, 3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.226 ) = 0.9975 − { 10 17 ∗ [ − ( 3 10 log ⁡ 3 10 + 7 10 log ⁡ 7 10 ) ] + 7 17 ∗ [ − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ] } = 0.1237 \begin{aligned} Gain(D, 含糖率, 0.226)&= 0.9975-\{\frac{10}{17}*[-(\frac{3}{10}\log\frac{3}{10}+\frac{7}{10}\log\frac{7}{10})]+\frac{7}{17}*[-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7})]\}=0.1237 \end{aligned} Gain(D,含糖率,0.226)=0.9975{1710[(103log103+107log107)]+177[(75log75+72log72)]}=0.1237
  • 当划分点为0.2505,划分后两个子集分别为 D 0.2505 − D_{0.2505}^- D0.2505:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6}和 D 0.2505 + D_{0.2505}^+ D0.2505+:{3, 10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2505 ) = 0.9975 − { 11 17 ∗ [ − ( 4 11 log ⁡ 4 11 + 7 11 log ⁡ 7 11 ) ] + 6 17 ∗ [ − ( 4 6 log ⁡ 4 6 + 2 6 log ⁡ 2 6 ) ] } = 0.0615 \begin{aligned} Gain(D, 含糖率, 0.2505)&= 0.9975-\{\frac{11}{17}*[-(\frac{4}{11}\log\frac{4}{11}+\frac{7}{11}\log\frac{7}{11})]+\frac{6}{17}*[-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6})]\}=0.0615 \end{aligned} Gain(D,含糖率,0.2505)=0.9975{1711[(114log114+117log117)]+176[(64log64+62log62)]}=0.0615
  • 当划分点为0.2655,划分后两个子集分别为 D 0.2655 − D_{0.2655}^- D0.2655:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3}和 D 0.2655 + D_{0.2655}^+ D0.2655+:{10, 4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2655 ) = 0.9975 − { 12 17 ∗ [ − ( 5 12 log ⁡ 5 12 + 7 12 log ⁡ 7 12 ) ] + 5 17 ∗ [ − ( 3 5 log ⁡ 3 5 + 2 5 log ⁡ 2 5 ) ] } = 0.0202 \begin{aligned} Gain(D, 含糖率, 0.2655)&= 0.9975-\{\frac{12}{17}*[-(\frac{5}{12}\log\frac{5}{12}+\frac{7}{12}\log\frac{7}{12})]+\frac{5}{17}*[-(\frac{3}{5}\log\frac{3}{5}+\frac{2}{5}\log\frac{2}{5})]\}=0.0202 \end{aligned} Gain(D,含糖率,0.2655)=0.9975{1712[(125log125+127log127)]+175[(53log53+52log52)]}=0.0202
  • 当划分点为0.2925,划分后两个子集分别为 D 0.2925 − D_{0.2925}^- D0.2925:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10}和 D 0.2925 + D_{0.2925}^+ D0.2925+:{4, 15, 2, 1}
    G a i n ( D , 含糖率 , 0.2925 ) = 0.9975 − { 13 17 ∗ [ − ( 5 13 log ⁡ 5 13 + 8 13 log ⁡ 8 13 ) ] + 4 17 ∗ [ − ( 3 4 log ⁡ 3 4 + 1 4 log ⁡ 1 4 ) ] } = 0.0715 \begin{aligned} Gain(D, 含糖率, 0.2925)&= 0.9975-\{\frac{13}{17}*[-(\frac{5}{13}\log\frac{5}{13}+\frac{8}{13}\log\frac{8}{13})]+\frac{4}{17}*[-(\frac{3}{4}\log\frac{3}{4}+\frac{1}{4}\log\frac{1}{4})]\}=0.0715 \end{aligned} Gain(D,含糖率,0.2925)=0.9975{1713[(135log135+138log138)]+174[(43log43+41log41)]}=0.0715
  • 当划分点为0.344,划分后两个子集分别为 D 0.344 − D_{0.344}^- D0.344:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4}和 D 0.344 + D_{0.344}^+ D0.344+:{15, 2, 1}
    G a i n ( D , 含糖率 , 0.344 ) = 0.9975 − { 14 17 ∗ [ − ( 6 14 log ⁡ 6 14 + 8 14 log ⁡ 8 14 ) ] + 3 17 ∗ [ − ( 2 3 log ⁡ 2 3 + 1 3 log ⁡ 1 3 ) ] } = 0.0241 \begin{aligned} Gain(D, 含糖率, 0.344)&= 0.9975-\{\frac{14}{17}*[-(\frac{6}{14}\log\frac{6}{14}+\frac{8}{14}\log\frac{8}{14})]+\frac{3}{17}*[-(\frac{2}{3}\log\frac{2}{3}+\frac{1}{3}\log\frac{1}{3})]\}=0.0241 \end{aligned} Gain(D,含糖率,0.344)=0.9975{1714[(146log146+148log148)]+173[(32log32+31log31)]}=0.0241
  • 当划分点为0.373,划分后两个子集分别为 D 0.373 − D_{0.373}^- D0.373:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15}和 D 0.373 + D_{0.373}^+ D0.373+:{2, 1}
    G a i n ( D , 含糖率 , 0.373 ) = 0.9975 − { 15 17 ∗ [ − ( 6 15 log ⁡ 6 15 + 9 15 log ⁡ 9 15 ) ] + 2 17 ∗ [ − ( 2 2 log ⁡ 2 2 + 0 2 log ⁡ 0 2 ) ] } = 0.1041 \begin{aligned} Gain(D, 含糖率, 0.373)&= 0.9975-\{\frac{15}{17}*[-(\frac{6}{15}\log\frac{6}{15}+\frac{9}{15}\log\frac{9}{15})]+\frac{2}{17}*[-(\frac{2}{2}\log\frac{2}{2}+\frac{0}{2}\log\frac{0}{2})]\}=0.1041 \end{aligned} Gain(D,含糖率,0.373)=0.9975{1715[(156log156+159log159)]+172[(22log22+20log20)]}=0.1041
  • 当划分点为0.373,划分后两个子集分别为 D 0.373 − D_{0.373}^- D0.373:{16, 11, 9, 12, 17, 7, 13, 14, 8, 5, 6, 3, 10, 4, 15, 2}和 D 0.373 + D_{0.373}^+ D0.373+:{1}
    G a i n ( D , 含糖率 , 0.418 ) = 0.9975 − { 16 17 ∗ [ − ( 7 16 log ⁡ 7 16 + 9 16 log ⁡ 9 16 ) ] + 1 17 ∗ [ − ( 1 1 log ⁡ 1 1 + 0 1 log ⁡ 0 1 ) ] } = 0.0669 \begin{aligned} Gain(D, 含糖率, 0.418)&= 0.9975-\{\frac{16}{17}*[-(\frac{7}{16}\log\frac{7}{16}+\frac{9}{16}\log\frac{9}{16})]+\frac{1}{17}*[-(\frac{1}{1}\log\frac{1}{1}+\frac{0}{1}\log\frac{0}{1})]\}=0.0669 \end{aligned} Gain(D,含糖率,0.418)=0.9975{1716[(167log167+169log169)]+171[(11log11+10log10)]}=0.0669

因此,属性"含糖率"划分后的最大信息增益为0.349,对应划分点为0.126:
G a i n ( D , 含糖率 ) = G a i n ( D , 含糖率 , t = 0.126 ) = 0.3492 \begin{aligned} Gain(D, 含糖率)&=Gain(D, 含糖率, t=0.126)=0.3492 \end{aligned} Gain(D,含糖率)=Gain(D,含糖率,t=0.126)=0.3492
同理,属性"密度"划分后的最大信息增益为0.2624,对应划分点为0.3815:
G a i n ( D , 密度 ) = G a i n ( D , 密度 , t = 0.3815 ) = 0.2624 \begin{aligned} Gain(D, 密度)&=Gain(D, 密度, t=0.3815)=0.2624 \end{aligned} Gain(D,密度)=Gain(D,密度,t=0.3815)=0.2624

以如此方式即可处理连续值的属性。

2、缺失值属性的处理

表3 西瓜数据集——缺失值
编号色泽根蒂敲声纹理脐部触感好瓜
1蜷缩浊响清晰凹陷硬滑
2乌黑蜷缩沉闷清晰凹陷
3乌黑蜷缩清晰凹陷硬滑
4青绿蜷缩沉闷清晰凹陷硬滑
5蜷缩浊响清晰凹陷硬滑
6青绿稍蜷浊响清晰软粘
7乌黑稍蜷浊响稍糊稍凹软粘
8乌黑稍蜷浊响稍凹硬滑
9乌黑沉闷稍糊稍凹硬滑
10青绿硬挺清脆平坦软粘
11浅白硬挺清脆模糊平坦
12浅白蜷缩模糊平坦软粘
13稍蜷浊响稍糊凹陷硬滑
14浅白稍蜷沉闷稍糊凹陷硬滑
15乌黑稍蜷浊响清晰软粘
16浅白蜷缩浊响模糊平坦硬滑
17青绿沉闷稍糊稍凹硬滑

(1) 如何在属性值确实的情况下进行划分属性选择?

给定训练集 D D D和属性 A A A,假设 D ~ \widetilde{D} D 表示属性 A A A上没有缺失值的样本子集,假定属性 A A A m m m个可取值 { a 1 , a 2 , . . . , a m } \{a^1, a^2, ..., a^m\} {a1,a2,...,am} D ~ m \widetilde{D}^m D m表示 D ~ \widetilde{D} D 中属性 A A A上取值为 a m a^m am的样本子集, D ~ k \widetilde{D}_k D k表示 D ~ \widetilde{D} D 中属于第 k k k类( k = 1 , 2 , . . . , K k=1,2,...,K k=1,2,...,K)的样本子集,则有 D ~ = ∪ k = 1 K D ~ k = ∪ m = 1 m D ~ m \widetilde{D}=\cup_{k=1}^{K}\widetilde{D}_k=\cup_{m=1}^{m}\widetilde{D}^m D =k=1KD k=m=1mD m,假定为每一个样本 x x x赋予一个权重 w x w_x wx定义:
ρ = ∑ x ∈ D ~ w x ∑ x ∈ D w x p ~ k = ∑ x ∈ D ~ k w x ∑ x ∈ D ~ w x r ~ m = ∑ x ∈ D ~ m w x ∑ x ∈ D ~ w x \rho=\frac{\sum_{x\in\widetilde{D}}w_x}{\sum_{x\in D}w_x}\\ \widetilde{p}_k=\frac{\sum_{x\in\widetilde{D}_k}w_x}{\sum_{x\in \widetilde{D}}w_x}\\ \widetilde{r}_m=\frac{\sum_{x\in\widetilde{D}^m}w_x}{\sum_{x\in \widetilde{D}}w_x}\\ ρ=xDwxxD wxp k=xD wxxD kwxr m=xD wxxD mwx
式中, ρ \rho ρ表示无缺失值样本所占的比例, p ~ k \widetilde{p}_k p k表示无缺失值样本中第 k k k类所占的比例, r ~ m \widetilde{r}_m r m表示无缺失值样本中属性 A A A上取值 a m a^m am的样本所占的比例,故有 ∑ k = 1 K p ~ k = ∑ n = 1 m r ~ m = 1 \sum_{k=1}^K\widetilde{p}_k=\sum_{n=1}^m\widetilde{r}_m=1 k=1Kp k=n=1mr m=1

基于上述定义,可将信息增益的计算在缺失值上推广为:
E n t r o p y ( D ~ ) = − ∑ k = 1 K p ~ k log ⁡ p ~ k G a i n ( D , A ) = ρ × G a i n ( D ~ , A ) = ρ × [ E n t r o p y ( D ~ ) − ∑ m = 1 m r ~ m E n t r o p y ( D ~ m ) ] \begin{aligned} Entropy(\widetilde{D})&=-\sum_{k=1}^{K}\widetilde{p}_k\log\widetilde{p}_k\\ Gain(D, A)&=\rho\times Gain(\widetilde{D}, A)=\rho\times[Entropy(\widetilde{D})-\sum_{m=1}^{m}\widetilde{r}_mEntropy(\widetilde{D}^m)] \end{aligned} Entropy(D )Gain(D,A)=k=1Kp klogp k=ρ×Gain(D ,A)=ρ×[Entropy(D )m=1mr mEntropy(D m)]
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17

  • 属性"色泽",无缺失值样本子集 D ~ = { 2 , 3 , 4 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 14 , 15 , 16 , 17 } \widetilde{D}=\{2,3,4,6,7,8,9,10,11,12,14,15,16,17\} D ={2,3,4,6,7,8,9,10,11,12,14,15,16,17},有"乌黑"、“青绿”、"浅白"3个取值
    G a i n ( D , 色泽 ) = ρ × [ E n t r o p y ( D ~ ) − ∑ m = 1 m r ~ m E n t r o p y ( D ~ m ) ] = 14 17 × { − ( 6 14 log ⁡ 6 14 + 8 14 log ⁡ 8 14 ) − [ 6 14 × ( − ( 4 6 log ⁡ 4 6 + 2 6 log ⁡ 2 6 ) ) + 4 14 × ( − ( 2 4 log ⁡ 2 4 + 2 4 log ⁡ 2 4 ) ) + 4 14 × ( − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ) ] } = 0.2519 \begin{aligned} Gain(D, 色泽)&=\rho\times[Entropy(\widetilde{D})-\sum_{m=1}^{m}\widetilde{r}_mEntropy(\widetilde{D}^m)]\\ &=\frac{14}{17}\times\{-(\frac{6}{14}\log\frac{6}{14}+\frac{8}{14}\log\frac{8}{14})-[\frac{6}{14}\times(-(\frac{4}{6}\log\frac{4}{6}+\frac{2}{6}\log\frac{2}{6}))+\frac{4}{14}\times(-(\frac{2}{4}\log\frac{2}{4}+\frac{2}{4}\log\frac{2}{4}))+\frac{4}{14}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\}\\ &=0.2519 \end{aligned} Gain(D,色泽)=ρ×[Entropy(D )m=1mr mEntropy(D m)]=1714×{(146log146+148log148)[146×((64log64+62log62))+144×((42log42+42log42))+144×((40log40+44log44))]}=0.2519
  • 属性"根蒂",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 10 , 11 , 12 , 13 , 14 , 15 , 16 } \widetilde{D}=\{1,2,3,4,5,6,7,8,10,11,12,13,14,15,16\} D ={1,2,3,4,5,6,7,8,10,11,12,13,14,15,16},有"蜷缩"、“稍蜷”、"硬挺"3个取值
    G a i n ( D , 根蒂 ) = 15 17 × { − ( 8 15 log ⁡ 8 15 + 7 15 log ⁡ 7 15 ) − [ 7 15 × ( − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ) + 6 15 × ( − ( 3 6 log ⁡ 3 6 + 3 6 log ⁡ 3 6 ) ) + 2 15 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] } = 0.1711 \begin{aligned} Gain(D, 根蒂)&=\frac{15}{17}\times\{- (\frac{8}{15}\log\frac{8}{15}+\frac{7}{15}\log\frac{7}{15})- [\frac{7}{15}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+ \frac{6}{15}\times(-(\frac{3}{6}\log\frac{3}{6}+\frac{3}{6}\log\frac{3}{6}))+ \frac{2}{15}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\}\\ &=0.1711 \end{aligned} Gain(D,根蒂)=1715×{(158log158+157log157)[157×((75log75+72log72))+156×((63log63+63log63))+152×((20log20+22log22))]}=0.1711
  • 属性"敲声",无缺失值样本子集 D ~ = { 1 , 2 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,2,4,5,6,7,8,9,10,11,13,14,15,16,17\} D ={1,2,4,5,6,7,8,9,10,11,13,14,15,16,17},有"浊响"、“沉闷”、"清脆"3个取值
    G a i n ( D , 敲声 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 8 15 × ( − ( 5 8 log ⁡ 5 8 + 3 8 log ⁡ 3 8 ) ) + 5 15 × ( − ( 2 5 log ⁡ 2 5 + 3 5 log ⁡ 3 5 ) ) + 2 15 × ( − ( 0 2 log ⁡ 0 2 + 2 2 log ⁡ 2 2 ) ) ] } = 0.1448 \begin{aligned} Gain(D, 敲声)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{8}{15}\times(-(\frac{5}{8}\log\frac{5}{8}+\frac{3}{8}\log\frac{3}{8}))+ \frac{5}{15}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))+ \frac{2}{15}\times(-(\frac{0}{2}\log\frac{0}{2}+\frac{2}{2}\log\frac{2}{2}))]\}\\ &=0.1448 \end{aligned} Gain(D,敲声)=1715×{(157log157+158log158)[158×((85log85+83log83))+155×((52log52+53log53))+152×((20log20+22log22))]}=0.1448
  • 属性"纹理",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 9 , 11 , 12 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,2,3,4,5,6,7,9,11,12,13,14,15,16,17\} D ={1,2,3,4,5,6,7,9,11,12,13,14,15,16,17},有"清晰"、“稍糊”、"模糊"3个取值
    G a i n ( D , 纹理 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 7 15 × ( − ( 6 7 log ⁡ 6 7 + 1 7 log ⁡ 1 7 ) ) + 5 15 × ( − ( 1 5 log ⁡ 1 5 + 4 5 log ⁡ 4 5 ) ) + 3 15 × ( − ( 0 3 log ⁡ 0 3 + 3 3 log ⁡ 3 3 ) ) ] } = 0.4235 \begin{aligned} Gain(D, 纹理)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{7}{15}\times(-(\frac{6}{7}\log\frac{6}{7}+\frac{1}{7}\log\frac{1}{7}))+ \frac{5}{15}\times(-(\frac{1}{5}\log\frac{1}{5}+\frac{4}{5}\log\frac{4}{5}))+ \frac{3}{15}\times(-(\frac{0}{3}\log\frac{0}{3}+\frac{3}{3}\log\frac{3}{3}))]\}\\ &=0.4235 \end{aligned} Gain(D,纹理)=1715×{(157log157+158log158)[157×((76log76+71log71))+155×((51log51+54log54))+153×((30log30+33log33))]}=0.4235
  • 属性"脐部",无缺失值样本子集 D ~ = { 1 , 2 , 3 , 4 , 5 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 16 , 17 } \widetilde{D}=\{1,2,3,4,5,7,8,9,10,11,12,13,14,16,17\} D ={1,2,3,4,5,7,8,9,10,11,12,13,14,16,17},有"凹陷"、“稍凹”、"平坦"3个取值
    G a i n ( D , 脐部 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 7 15 × ( − ( 5 7 log ⁡ 5 7 + 2 7 log ⁡ 2 7 ) ) + 4 15 × ( − ( 2 4 log ⁡ 2 4 + 2 4 log ⁡ 2 4 ) ) + 4 15 × ( − ( 0 4 log ⁡ 0 4 + 4 4 log ⁡ 4 4 ) ) ] } = 0.2888 \begin{aligned} Gain(D, 脐部)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{7}{15}\times(-(\frac{5}{7}\log\frac{5}{7}+\frac{2}{7}\log\frac{2}{7}))+ \frac{4}{15}\times(-(\frac{2}{4}\log\frac{2}{4}+\frac{2}{4}\log\frac{2}{4}))+ \frac{4}{15}\times(-(\frac{0}{4}\log\frac{0}{4}+\frac{4}{4}\log\frac{4}{4}))]\}\\ &=0.2888 \end{aligned} Gain(D,脐部)=1715×{(157log157+158log158)[157×((75log75+72log72))+154×((42log42+42log42))+154×((40log40+44log44))]}=0.2888
  • 属性"触感",无缺失值样本子集 D ~ = { 1 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 12 , 13 , 14 , 15 , 16 , 17 } \widetilde{D}=\{1,3,4,5,6,7,8,9,10,12,13,14,15,16,17\} D ={1,3,4,5,6,7,8,9,10,12,13,14,15,16,17},有"硬滑"、"软粘"2个取值
    G a i n ( D , 脐部 ) = 15 17 × { − ( 7 15 log ⁡ 7 15 + 8 15 log ⁡ 8 15 ) − [ 10 15 × ( − ( 5 10 log ⁡ 5 10 + 5 10 log ⁡ 5 10 ) ) + 5 15 × ( − ( 2 5 log ⁡ 2 5 + 3 5 log ⁡ 3 5 ) ) ] } = 0.0057 \begin{aligned} Gain(D, 脐部)&=\frac{15}{17}\times\{- (\frac{7}{15}\log\frac{7}{15}+\frac{8}{15}\log\frac{8}{15})- [\frac{10}{15}\times(-(\frac{5}{10}\log\frac{5}{10}+\frac{5}{10}\log\frac{5}{10}))+ \frac{5}{15}\times(-(\frac{2}{5}\log\frac{2}{5}+\frac{3}{5}\log\frac{3}{5}))]\}\\ &=0.0057 \end{aligned} Gain(D,脐部)=1715×{(157log157+158log158)[1510×((105log105+105log105))+155×((52log52+53log53))]}=0.0057
    (2) 给定划分属性,若样本在该属性上确实,如何对样本进行划分?
  • 如果样本在划分属性上的取值已知,则将其分裂到与其取值对应的子节点,且样本权重在子节点中保持为1;
  • 如若样本在划分属性上的取值未知,则将其同时分裂到所有子节点中,在各子节点中的权重为对应子节点的样本权重 ρ \rho ρ

属性"纹理"的信息增益最大,用于进一步分裂,包含15个取值已知(清晰7个、稍糊5个、模糊3个)和2个取值未知的样本{8,10}。

属性: 取值样本好瓜差瓜缺失值缺失值权重总权重
纹理:清晰{1,2,3,4,5,6,15}{1,2,3,4,5,6}{15}{8,10} 2 × 7 15 2\times\frac{7}{15} 2×157 7 + 2 × 7 15 7+2\times\frac{7}{15} 7+2×157
纹理:稍糊{7,9,13,14,17}{7}{9,13,14,17}{8,10} 2 × 5 15 2\times\frac{5}{15} 2×155 5 + 2 × 5 15 5+2\times\frac{5}{15} 5+2×155
纹理:模糊{11,12,16}-{11,12,16}{8,10} 2 × 3 15 2\times\frac{3}{15} 2×153 3 + 2 × 3 15 3+2\times\frac{3}{15} 3+2×153

子节点属性纹理=清晰,包含7个有取值样本{1,2,3,4,5,6,15},其中6个好瓜和1个差瓜,假设属性在缺失值处对应的类别分布与原始样本一致,分别为 6 7 \frac{6}{7} 76 1 7 \frac{1}{7} 71,则子节点属性纹理=清晰的信息熵为:
E n t r o p y ( D 纹理 = 清晰 ) = − ∑ i = 1 k p i log ⁡ p i = − ( 6 + 6 7 × 7 15 × 2 7 + 7 15 × 2 log ⁡ 6 + 6 7 × 7 15 × 2 7 + 7 15 × 2 + 1 + 1 7 × 7 15 × 2 7 + 7 15 × 2 log ⁡ 1 + 1 7 × 7 15 × 2 7 + 7 15 × 2 ) = 0.5916 \begin{aligned} Entropy(D^{纹理=清晰})&=-\sum_{i=1}^{k}p_i\log p_i\\ &=-(\frac{6+\frac{6}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}\log\frac{6+\frac{6}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}+\frac{1+\frac{1}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2}\log\frac{1+\frac{1}{7}\times\frac{7}{15}\times2}{7+\frac{7}{15}\times2})\\ &=0.5916 \end{aligned} Entropy(D纹理=清晰)=i=1kpilogpi=(7+157×26+76×157×2log7+157×26+76×157×2+7+157×21+71×157×2log7+157×21+71×157×2)=0.5916

  • 子节点属性纹理=清晰,计算属性色泽的信息增益
    • 色泽=乌黑的样本数为3(2个正样本和1个负样本);色泽=青绿的样本数为2(2个正样本);2个缺失值样本

    • 缺失值样本的权重:色泽=乌黑的权重 3 5 \frac{3}{5} 53,总权重为 2 × 3 5 = 6 5 2\times\frac{3}{5}=\frac{6}{5} 2×53=56;色泽=青绿的权重 2 5 \frac{2}{5} 52,总权重为 2 × 2 5 = 4 5 2\times\frac{2}{5}=\frac{4}{5} 2×52=54

    • 色泽=乌黑:正样本的权重: 2 + 2 3 × 3 5 × 2 2+\frac{2}{3}\times\frac{3}{5}\times2 2+32×53×2;负样本的权重: 1 + 1 3 × 3 5 × 2 1+\frac{1}{3}\times\frac{3}{5}\times2 1+31×53×2;总权重 2 + 2 3 × 3 5 × 2 + 1 + 1 3 × 3 5 × 2 = 3 + 3 5 × 2 2+\frac{2}{3}\times\frac{3}{5}\times2+1+\frac{1}{3}\times\frac{3}{5}\times2=3+\frac{3}{5}\times2 2+32×53×2+1+31×53×2=3+53×2
      E n t r o p y ( D 纹理 = 清晰 , 色泽 = 乌黑 ) = − ( 2 + 2 3 × 3 5 × 2 3 + 3 5 × 2 log ⁡ 2 + 2 3 × 3 5 × 2 3 + 3 5 × 2 + 1 + 1 3 × 3 5 × 2 3 + 3 5 × 2 log ⁡ 1 + 1 3 × 3 5 × 2 3 + 3 5 × 2 ) = 0.6589 \begin{aligned} Entropy(D^{纹理=清晰},色泽=乌黑)&=-(\frac{2+\frac{2}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}\log\frac{2+\frac{2}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}+\frac{1+\frac{1}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2}\log\frac{1+\frac{1}{3}\times\frac{3}{5}\times2}{3+\frac{3}{5}\times2})=0.6589 \end{aligned} Entropy(D纹理=清晰,色泽=乌黑)=(3+53×22+32×53×2log3+53×22+32×53×2+3+53×21+31×53×2log3+53×21+31×53×2)=0.6589

    • 色泽=青绿:正样本的权重: 2 + 2 2 × 2 5 × 2 2+\frac{2}{2}\times\frac{2}{5}\times2 2+22×52×2;负样本的权重: 0 + 0 2 × 2 5 × 2 0+\frac{0}{2}\times\frac{2}{5}\times2 0+20×52×2;总权重 2 + 2 2 × 2 5 × 2 + 0 + 0 2 × 2 5 × 2 = 2 + 2 5 × 2 2+\frac{2}{2}\times\frac{2}{5}\times2+0+\frac{0}{2}\times\frac{2}{5}\times2=2+\frac{2}{5}\times2 2+22×52×2+0+20×52×2=2+52×2
      E n t r o p y ( D 纹理 = 清晰 , 色泽 = 青绿 ) = − ( 2 + 2 5 × 2 2 + 2 5 × 2 log ⁡ 2 + 2 5 × 2 2 + 2 5 × 2 + 0 + 0 2 × 2 5 × 2 2 + 2 5 × 2 log ⁡ 0 + 0 2 × 2 5 × 2 2 + 2 5 × 2 ) = 0.0 \begin{aligned} Entropy(D^{纹理=清晰},色泽=青绿)&=-(\frac{2+\frac{2}{5}\times2}{2+\frac{2}{5}\times2}\log\frac{2+\frac{2}{5}\times2}{2+\frac{2}{5}\times2}+\frac{0+\frac{0}{2}\times\frac{2}{5}\times2}{2+\frac{2}{5}\times2}\log\frac{0+\frac{0}{2}\times\frac{2}{5}\times2}{2+\frac{2}{5}\times2})=0.0 \end{aligned} Entropy(D纹理=清晰,色泽=青绿)=(2+52×22+52×2log2+52×22+52×2+2+52×20+20×52×2log2+52×20+20×52×2)=0.0

G a i n ( D 纹理 = 清晰 , 色泽 ) = 0.5916 − ( 3 + 3 5 × 2 7 + 7 15 × 2 × 0.6598 + 2 + 2 5 × 2 7 + 7 15 × 2 × 0.0 ) = 0.2423 \begin{aligned} Gain(D^{纹理=清晰},色泽)&=0.5916-(\frac{3+\frac{3}{5}\times2}{7+\frac{7}{15}\times2}\times0.6598+\frac{2+\frac{2}{5}\times2}{7+\frac{7}{15}\times2}\times0.0)&=0.2423 \end{aligned} Gain(D纹理=清晰,色泽)=0.5916(7+157×23+53×2×0.6598+7+157×22+52×2×0.0)=0.2423

  • 子节点属性纹理=清晰,计算属性根蒂的信息增益

    • 根蒂=蜷缩的样本数为5(5个正样本);根蒂=稍蜷的样本数为2(1个正样本和1个负样本);无缺失值样本
      E n t r o p y ( D 纹理 = 清晰 , 根蒂 = 蜷缩 ) = − ( 5 5 log ⁡ 5 5 + 0 5 log ⁡ 0 5 ) = 0.0 E n t r o p y ( D 纹理 = 清晰 , 根蒂 = 稍蜷 ) = − ( 1 2 log ⁡ 1 2 + 1 2 log ⁡ 1 2 ) = 1.0 G a i n ( D 纹理 = 清晰 , 根蒂 ) = 0.5916 − ( 5 7 × 0.0 + 2 7 × 1.0 ) = 0.3058 \begin{aligned} Entropy(D^{纹理=清晰},根蒂=蜷缩)&=-(\frac{5}{5}\log\frac{5}{5}+\frac{0}{5}\log\frac{0}{5})=0.0\\ Entropy(D^{纹理=清晰},根蒂=稍蜷)&=-(\frac{1}{2}\log\frac{1}{2}+\frac{1}{2}\log\frac{1}{2})=1.0\\ Gain(D^{纹理=清晰},根蒂)&=0.5916-(\frac{5}{7}\times0.0+\frac{2}{7}\times1.0)=0.3058 \end{aligned} Entropy(D纹理=清晰,根蒂=蜷缩)Entropy(D纹理=清晰,根蒂=稍蜷)Gain(D纹理=清晰,根蒂)=(55log55+50log50)=0.0=(21log21+21log21)=1.0=0.5916(75×0.0+72×1.0)=0.3058
  • 子节点属性纹理=清晰,计算属性敲声的信息增益

    • 敲声=浊响的样本数为4(3个正样本和1个负样本);敲声=沉闷的样本数为2(2个正样本);1个缺失值样本

    • 缺失值样本的权重:敲声=浊响的权重 4 6 \frac{4}{6} 64,总权重为 4 6 \frac{4}{6} 64敲声=沉闷的权重 2 6 \frac{2}{6} 62,总权重为 2 6 \frac{2}{6} 62

    • 敲声=浊响:正样本的权重: 3 + 3 4 × 4 6 3+\frac{3}{4}\times\frac{4}{6} 3+43×64;负样本的权重: 1 + 1 4 × 4 6 1+\frac{1}{4}\times\frac{4}{6} 1+41×64;总权重 3 + 3 4 × 4 6 + 1 + 1 4 × 4 6 = 4 + 4 6 3+\frac{3}{4}\times\frac{4}{6}+1+\frac{1}{4}\times\frac{4}{6}=4+\frac{4}{6} 3+43×64+1+41×64=4+64
      E n t r o p y ( D 纹理 = 清晰 , 敲声 = 浊响 ) = − ( 3 + 3 4 × 4 6 4 + 4 6 log ⁡ 3 + 3 4 × 4 6 4 + 4 6 + 1 + 1 4 × 4 6 4 + 4 6 log ⁡ 1 + 1 4 × 4 6 4 + 4 6 ) = 0.8112 \begin{aligned} Entropy(D^{纹理=清晰},敲声=浊响)&=-(\frac{3+\frac{3}{4}\times\frac{4}{6}}{4+\frac{4}{6}}\log\frac{3+\frac{3}{4}\times\frac{4}{6}}{4+\frac{4}{6}}+\frac{1+\frac{1}{4}\times\frac{4}{6}}{4+\frac{4}{6}}\log\frac{1+\frac{1}{4}\times\frac{4}{6}}{4+\frac{4}{6}})=0.8112 \end{aligned} Entropy(D纹理=清晰,敲声=浊响)=(4+643+43×64log4+643+43×64+4+641+41×64log4+641+41×64)=0.8112

    • 敲声=沉闷:正样本的权重: 2 + 2 2 × 2 6 2+\frac{2}{2}\times\frac{2}{6} 2+22×62;负样本的权重: 0 + 0 2 × 2 6 0+\frac{0}{2}\times\frac{2}{6} 0+20×62;总权重 2 + 2 2 × 2 6 + 0 + 0 2 × 2 6 = 2 + 2 6 2+\frac{2}{2}\times\frac{2}{6}+0+\frac{0}{2}\times\frac{2}{6}=2+\frac{2}{6} 2+22×62+0+20×62=2+62
      E n t r o p y ( D 纹理 = 清晰 , 敲声 = 沉闷 ) = − ( 2 + 2 2 × 2 6 2 + 2 6 log ⁡ 2 + 2 2 × 2 6 2 + 2 6 + 0 + 0 2 × 2 6 2 + 2 6 log ⁡ 0 + 0 2 × 2 6 2 + 2 6 ) = 0.0 \begin{aligned} Entropy(D^{纹理=清晰},敲声=沉闷)&=-(\frac{2+\frac{2}{2}\times\frac{2}{6}}{2+\frac{2}{6}}\log\frac{2+\frac{2}{2}\times\frac{2}{6}}{2+\frac{2}{6}}+\frac{0+\frac{0}{2}\times\frac{2}{6}}{2+\frac{2}{6}}\log\frac{0+\frac{0}{2}\times\frac{2}{6}}{2+\frac{2}{6}})=0.0 \end{aligned} Entropy(D纹理=清晰,敲声=沉闷)=(2+622+22×62log2+622+22×62+2+620+20×62log2+620+20×62)=0.0
      G a i n ( D 纹理 = 清晰 , 敲声 ) = 0.5916 − ( 4 + 4 6 7 + 7 15 × 2 × 0.8112 + 2 + 2 6 7 + 7 15 × 2 × 0.0 ) = 0.1144 \begin{aligned} Gain(D^{纹理=清晰},敲声)&=0.5916-(\frac{4+\frac{4}{6}}{7+\frac{7}{15}\times2}\times0.8112+\frac{2+\frac{2}{6}}{7+\frac{7}{15}\times2}\times0.0)&=0.1144 \end{aligned} Gain(D纹理=清晰,敲声)=0.5916(7+157×24+64×0.8112+7+157×22+62×0.0)=0.1144


http://www.kler.cn/a/515301.html

相关文章:

  • RabbitMQ 高级特性
  • 【玩转全栈】----Django制作部门管理页面
  • 深度学习之使用yolo网络训练kitti数据集:kitti数据集转换为VOC格式
  • NIO | 什么是Java中的NIO —— 结合业务场景理解 NIO (二)
  • XCP 协议基础
  • Effective C++读书笔记——item22(明确变量的作用域和访问权限)
  • StarRocks强大的实时数据分析
  • 网络安全解决方案分享:推荐十款网络准入控制系统,保护企业网络安全
  • 青少年编程与数学 02-007 PostgreSQL数据库应用 15课题、备份与还原
  • 新年好(Dijkstra+dfs/全排列)
  • excel导入数据处理前端
  • 安卓程序作为web服务端的技术实现(二):Room 实现数据存储
  • Spring AOP 中,常用来定义切入点的表达式
  • 算法随笔_16: 找出第k小的数对距离
  • ubuntu扩建swap 解决8295编译卡死的问题(提高系统性能)
  • K8S中Service详解(二)
  • 详解深度学习中的Dropout
  • 数据结构(精讲)----应用篇
  • Dart语言和flutter框架的特性
  • SMT32 FatFs,RTC,记录文件操作时间
  • SentencePiece和 WordPiece tokenization 的含义和区别
  • 备赛蓝桥杯之第十五届职业院校组省赛第二题:分享点滴
  • (1)STM32 USB设备开发-基础知识
  • MDX语言的区块链
  • Mysql面试题----为什么B+树比B树更适合实现数据库索引
  • spring boot中实现手动分页