添加决策树的隐形眼镜数据集，修改3.决策树.md 文件

2026-05-11 00:58:19 +08:00 · 2017-08-22 12:52:03 +08:00
parent f9e1d2e2ff
commit f4510c1a38
2 changed files with 157 additions and 6 deletions
--- a/docs/3.决策树.md
+++ b/docs/3.决策树.md
@@ -102,8 +102,11 @@ def createDataSet():
    labels = ['no surfacing', 'flippers']
    return dataSet, labels
 ```
+> 准备数据：树构造算法只适用于标称型数据，因此数值型数据必须离散化。

-* 计算给定数据集的香农熵
+此处，由于我们输入的数据本身就是离散化数据，所以这一步就省略了。
+
+计算给定数据集的香农熵

 ```Python
 def calcShannonEnt(dataSet):
@@ -130,7 +133,7 @@ def calcShannonEnt(dataSet):
    return shannonEnt
 ```

-* 按照给定特征划分数据集
+按照给定特征划分数据集

 ```Python
 def splitDataSet(dataSet, axis, value):
@@ -143,15 +146,100 @@ def splitDataSet(dataSet, axis, value):
    return retDataSet
 ```

+选择最好的数据集划分方式
+
+```Python
+def chooseBestFeatureToSplit(dataSet):
+    """chooseBestFeatureToSplit(选择最好的特征)
+
+    Args:
+        dataSet 数据集
+    Returns:
+        bestFeature 最优的特征列
+    """
+    # 求第一行有多少列的 Feature, 最后一列是label列嘛
+    numFeatures = len(dataSet[0]) - 1
+    # label的信息熵
+    baseEntropy = calcShannonEnt(dataSet)
+    # 最优的信息增益值, 和最优的Featurn编号
+    bestInfoGain, bestFeature = 0.0, -1
+    # iterate over all the features
+    for i in range(numFeatures):
+        # create a list of all the examples of this feature
+        # 获取每一个实例的第i+1个feature，组成list集合
+        featList = [example[i] for example in dataSet]
+        # get a set of unique values
+        # 获取剔重后的集合，使用set对list数据进行去重
+        uniqueVals = set(featList)
+        # 创建一个临时的信息熵
+        newEntropy = 0.0
+        # 遍历某一列的value集合，计算该列的信息熵 
+        # 遍历当前特征中的所有唯一属性值，对每个唯一属性值划分一次数据集，计算数据集的新熵值，并对所有唯一特征值得到的熵求和。
+        for value in uniqueVals:
+            subDataSet = splitDataSet(dataSet, i, value)
+            prob = len(subDataSet)/float(len(dataSet))
+            newEntropy += prob * calcShannonEnt(subDataSet)
+        # gain[信息增益]: 划分数据集前后的信息变化， 获取信息熵最大的值
+        # 信息增益是熵的减少或者是数据无序度的减少。最后，比较所有特征中的信息增益，返回最好特征划分的索引值。
+        infoGain = baseEntropy - newEntropy
+        print 'infoGain=', infoGain, 'bestFeature=', i, baseEntropy, newEntropy
+        if (infoGain > bestInfoGain):
+            bestInfoGain = infoGain
+            bestFeature = i
+    return bestFeature
+```
+
+> 训练算法：构造树的数据结构
+
+创建树的函数代码
+
+```Python
+def createTree(dataSet, labels):
+    classList = [example[-1] for example in dataSet]
+    # 如果数据集的最后一列的第一个值出现的次数=整个集合的数量，也就说只有一个类别，就只直接返回结果就行
+    # 第一个停止条件：所有的类标签完全相同，则直接返回该类标签。
+    # count() 函数是统计括号中的值在list中出现的次数
+    if classList.count(classList[0]) == len(classList):
+        return classList[0]
+    # 如果数据集只有1列，那么最初出现label次数最多的一类，作为结果
+    # 第二个停止条件：使用完了所有特征，仍然不能将数据集划分成仅包含唯一类别的分组。
+    if len(dataSet[0]) == 1:
+        return majorityCnt(classList)
+
+    # 选择最优的列，得到最优列对应的label含义
+    bestFeat = chooseBestFeatureToSplit(dataSet)
+    # 获取label的名称
+    bestFeatLabel = labels[bestFeat]
+    # 初始化myTree
+    myTree = {bestFeatLabel: {}}
+    # 注：labels列表是可变对象，在PYTHON函数中作为参数时传址引用，能够被全局修改
+    # 所以这行代码导致函数外的同名变量被删除了元素，造成例句无法执行，提示'no surfacing' is not in list
+    del(labels[bestFeat])
+    # 取出最优列，然后它的branch做分类
+    featValues = [example[bestFeat] for example in dataSet]
+    uniqueVals = set(featValues)
+    for value in uniqueVals:
+        # 求出剩余的标签label
+        subLabels = labels[:]
+        # 遍历当前选择特征包含的所有属性值，在每个数据集划分上递归调用函数createTree()
+        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
+        # print 'myTree', value, myTree
+    return myTree
+```
+
+> 测试算法：使用经验树计算错误率
+
+> 使用算法：此步骤可以适用于任何监督学习算法，而使用决策树可以更好地理解数据的内在含义。
+
 [完整代码地址](https://github.com/apachecn/MachineLearning/blob/master/src/python/3.DecisionTree/DecisionTree.py): <https://github.com/apachecn/MachineLearning/blob/master/src/python/3.DecisionTree/DecisionTree.py>

-### 项目实战2: 使用决策树预测隐形眼镜类型
+### 项目案例2: 使用决策树预测隐形眼镜类型

-#### 概述
+#### 项目概述

 隐形眼镜类型包括应材质、软材质以及不适合佩戴隐形眼镜。我们需要使用决策树预测患者需要佩戴的隐形眼镜类型。

-#### 流程
+#### 开发流程

 1. 收集数据: 提供的文本文件。
 2. 解析数据: 解析 tab 键分隔的数据行
@@ -160,7 +248,46 @@ def splitDataSet(dataSet, axis, value):
 5. 测试算法: 编写测试函数验证决策树可以正确分类给定的数据实例。
 6. 使用算法: 存储树的数据结构，以便下次使用时无需重新构造树。

-* 使用 pickle 模块存储决策树
+> 收集数据：提供的文本文件
+
+文本文件数据格式如下：
+
+```
+young	myope	no	reduced	no lenses
+pre	myope	no	reduced	no lenses
+presbyopic	myope	no	reduced	no lenses
+```
+
+> 解析数据：解析 tab 键分隔的数据行
+
+```Python
+lecses = [inst.strip().split('\t') for inst in fr.readlines()]
+lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
+```
+
+> 分析数据：快速检查数据，确保正确地解析数据内容，使用 createPlot() 函数绘制最终的树形图。
+
+```Python
+>>> treePlotter.createPlot(lensesTree)
+```
+
+> 训练算法：使用 createTree() 函数
+
+```Python
+>>> lensesTree = trees.createTree(lenses, lensesLabels)
+>>> lensesTree
+{'tearRate': {'reduced': 'no lenses', 'normal': {'astigmatic':{'yes':
+{'prescript':{'hyper':{'age':{'pre':'no lenses', 'presbyopic':
+'no lenses', 'young':'hard'}}, 'myope':'hard'}}, 'no':{'age':{'pre':
+'soft', 'presbyopic':{'prescript': {'hyper':'soft', 'myope':
+'no lenses'}}, 'young':'soft'}}}}}
+```
+
+> 测试算法: 编写测试函数验证决策树可以正确分类给定的数据实例。
+
+> 使用算法: 存储树的数据结构，以便下次使用时无需重新构造树。
+
+使用 pickle 模块存储决策树

 ```Python
 def storeTree(inputTree, filename):