Merge pull request #109 from jiangzhonglian/master

更新推荐系统代码／AdaBoost 部分的注释／svm
2026-05-08 14:52:28 +08:00 · 2017-08-15 18:35:47 +08:00
parent 47fb94a019 af6229cead
commit 0d05f0cdaf
17 changed files with 750 additions and 159 deletions
--- a/docs/16.推荐系统.md
+++ b/docs/16.推荐系统.md
@@ -1,5 +1,42 @@
 # 第16章 推荐系统

+## 背景与挖掘目标
+
+随着互联网的快速发展，用户很难快速从海量信息中寻找到自己感兴趣的信息。因此诞生了：搜索引擎+推荐系统
+
+本章节-推荐系统：
+
+1. 帮助用户发现其感兴趣和可能感兴趣的信息。
+2. 让网站价值信息脱颖而出，得到广大用户的认可。
+3. 提高用户对网站的忠诚度和关注度，建立稳固用户群体。
+
+## 分析方法与过程
+
+本案例的目标是对用户进行推荐，即以一定的方式将用户与物品（本次指网页）之间建立联系。
+
+由于用户访问网站的数据记录很多，如果不对数据进行分类处理，对所有的记录直接采用推荐系统进行推荐，这样会存在一下问题。
+
+1. 数据量太大意味着物品数与用户数很多，在模型构建用户与物品稀疏矩阵时，出现设备内存空间不够的情况，并且模型计算需要消耗大量的时间。
+2. 用户区别很大，不同的用户关注的信息不一样，因此，即使能够得到推荐结果，其效果也会不好。
+
+为了避免出现上述问题，需要进行分类处理与分析。
+
+正常的情况下，需要对用户的兴趣爱好以及需求进行分类。
+因为在用户访问记录中，没有记录用户访问页面时间的长短，因此不容易判断用户兴趣爱好。
+因此，本文根据用户浏览的网页信息进行分析处理，主要采用以下方法处理：以用户浏览网页的类型进行分类，然后对每个类型中的内容进行推荐。
+
+分析过程如下：
+
+* 从系统中获取用户访问网站的原始记录。
+* 对数据进行多维分析，包括用户访问内容，流失用户分析以及用户分类等分析。
+* 对数据进行预处理，包含数据去重、数据变换和数据分类鞥处理过程。
+* 以用户访问html后缀的页面为关键条件，对数据进行处理。
+* 对比多种推荐算法进行推荐，通过模型评价，得到比较好的智能推荐模型。通过模型对样本数据进行预测，获得推荐结果。
+
+
+
+## 主流推荐算法
+
 | 推荐方法 | 描述 |
 | --- | --- |
 | 基于内容推荐  |   |
@@ -11,12 +48,31 @@

 ![推荐方法对比](/images/16.RecommendedSystem/推荐方法对比.png)

-## 基于知识推荐
+### 基于知识推荐

 基于知识的推荐（Knowledge-based Recommendation）在某种程度是可以看成是一种推理（Inference）技术，它不是建立在用户需要和偏好基础上推荐的。基于知识的方法因它们所用的功能知识不同而有明显区别。效用知识（Functional Knowledge）是一种关于一个项目如何满足某一特定用户的知识，因此能解释需要和推荐的关系，所以用户资料可以是任何能支持推理的知识结构，它可以是用户已经规范化的查询，也可以是一个更详细的用户需要的表示。

 ![基于知识的推荐](/images/16.RecommendedSystem/基于知识的推荐.jpg)

+### 协同过滤推荐
+
+* memory-based推荐
+    * Item-based方法
+    * User-based方法
+    * Memory-based推荐方法通过执行最近邻搜索，把每一个Item或者User看成一个向量，计算其他所有Item或者User与它的相似度。有了Item或者User之间的两两相似度之后，就可以进行预测与推荐了。 
+* model-based推荐
+    * Model-based推荐最常见的方法为Matrix factorization.
+    * 矩阵分解通过把原始的评分矩阵R分解为两个矩阵相乘，并且只考虑有评分的值，训练时不考虑missing项的值。R矩阵分解成为U与V两个矩阵后，评分矩阵R中missing的值就可以通过U矩阵中的某列和V矩阵的某行相乘得到
+    * 矩阵分解的目标函数: U矩阵与V矩阵的可以通过梯度下降(gradient descent)算法求得，通过交替更新u与v多次迭代收敛之后可求出U与V。 
+    * 矩阵分解背后的核心思想，找到两个矩阵，它们相乘之后得到的那个矩阵的值，与评分矩阵R中有值的位置中的值尽可能接近。这样一来，分解出来的两个矩阵相乘就尽可能还原了评分矩阵R，因为有值的地方，值都相差得尽可能地小，那么missing的值通过这样的方式计算得到，比较符合趋势。 
+* 协同过滤中主要存在如下两个问题：稀疏性与冷启动问题。已有的方案通常会通过引入多个不同的数据源或者辅助信息(Side information)来解决这些问题，用户的Side information可以是用户的基本个人信息、用户画像信息等，而Item的Side information可以是物品的content信息等。
+
+## 效果评估
+
+1. 召回率和准确率 【人为统计分析】
+2. F值(P-R曲线) 【偏重：非均衡问题】
+3. ROC和AUC  【偏重：不同结果的对比】
+
 * * *

 * **作者：[片刻](http://www.apache.wiki/display/~jiangzhonglian)**
@@ -27,3 +83,4 @@

 * [推荐系统中常用算法 以及优点缺点对比](http://www.36dsj.com/archives/9519)
 * [推荐算法的基于知识推荐](https://zhidao.baidu.com/question/2013524494179442228.html)
+* [推荐系统中基于深度学习的混合协同过滤模型](http://www.iteye.com/news/32100)
--- a/docs/6.支持向量机.md
+++ b/docs/6.支持向量机.md
@@ -15,7 +15,7 @@
 * 注意：`SVM几何含义比较直观，但其算法实现较复杂，牵扯大量数学公式的推导。`

 ```
-优点：泛化错误率低，计算开销不大，结果易理解。
+优点：泛化（由具体的、个别的扩大为一般的，就是说：模型训练完后的新样本）错误率低，计算开销不大，结果易理解。
 缺点：对参数调节和核函数的选择敏感，原始分类器不加修改仅适合于处理二分类问题。
 使用数据类型：数值型和标称型数据。
 ```
@@ -47,13 +47,13 @@ This is the simplest kind of SVM (Called an LSVM) Support Vectors are those data

 1. 直觉上是安全的
 2. 如果我们在边界的位置发生了一个小错误（它在垂直方向上被颠倒），这给我们最小的错误分类机会。
-3. CV很容易，因为该模型对任何非支持向量数据点的去除是免疫的。
+3. CV（Computer Vision 计算机视觉 - 这缩写看着可怕）很容易，因为该模型对任何非支持向量数据点的去除是免疫的。
 4. 有一些理论，这是一件好事。
 5. 通常它的工作非常好。
 ```

 * 选择D会比B、C分隔的效果要好很多，原因是上述的5个结论。
-* 所有的点看作地雷吧，那么我们(超平面)得找到最近所有的地雷，并保证我们离它最远。
+* 如果把所有的点看作地雷，那么我们(超平面)得找到最近所有的地雷，并保证我们离它最远。
 ![线性可分](/images/6.SVM/SVM_3_linearly-separable.jpg)

 ### 怎么寻找最大间隔
@@ -70,7 +70,7 @@ This is the simplest kind of SVM (Called an LSVM) Support Vectors are those data
 * 类别标签用-1、1，是为了后期方便 \\(lable*(w^Tx+b)\\) 的标识和距离计算；如果 \\(lable*(w^Tx+b)>0\\) 表示预测正确，否则预测错误。
 * 现在目标很明确，就是要找到`w`和`b`，因此我们必须要找到最小间隔的数据点，也就是前面所说的`支持向量`。
    * 也就说，让最小的距离取最大.(最小的距离：就是最小间隔的数据点；最大：就是最大间距，为了找出最优超平面--最终就是支持向量)
-    * 怎么理解呢？ 例如： 所有的点看作地雷吧，那么我们(超平面)得找到最近所有的地雷，并保证我们离它最远。
+    * 怎么理解呢？ 例如： 如果把所有的点看作地雷，那么我们(超平面)得找到最近所有的地雷，并保证我们离它最远。
    * 目标函数：\\(arg: max\ \{min\ [lable*(w^Tx+b)/||w||]\}\\)
    * 1.如果 \\(lable*(w^Tx+b)>0\\) 表示预测正确，也称`函数间隔`，\\(||w||\\) 可以理解为归一化，也称`几何间隔`，我们始终可以找到一个阈值让 \\(lable*(w^Tx+b)>=1\\)
    * 2.所以令 \\(lable*(w^Tx+b)=1\\)，我们本质上是求 \\(arg: max\{关于w, b\}\ (1/||w||)\\)；也就说，我们约束(前提)条件是: \\(lable*(w^Tx+b)=1\\)
--- a/docs/7.1.利用AdaBoost元算法提高分类.md
+++ b/docs/7.1.利用AdaBoost元算法提高分类.md
@@ -1,4 +1,4 @@
-# 第7章 利用AdaBoost元算法提高分类
+# 第7.1章 利用AdaBoost元算法提高分类

 ![利用AdaBoost元算法提高分类](/images/7.AdaBoost/adaboost_headPage.jpg "利用AdaBoost元算法提高分类")

@@ -9,50 +9,54 @@
 * 概念：是对其他算法进行组合的一种形式。
 * 通俗来说： 当做重要决定时，大家可能都会考虑吸取多个专家而不只是一个人的意见。
    机器学习处理问题时又何尝不是如此？ 这就是元算法(meta-algorithm)背后的思想。
-* 集成方法：  1. 投票选举   2. 再学习
+* 集成方法：  1. 投票选举(bagging)   2. 再学习(boosting)

-> bagging：基于数据随机重抽样的分类起构造方法
+> bagging：基于数据随机重抽样的分类器构造方法

 * 自举汇聚法(bootstrap aggregating)，也称为bagging方法，是在从原始数据集选择S次后得到S个新数据集的一种技术。
    1. 新数据集和原数据集的大小相等。
    2. 每个数据集都是通过在原始数据集中随机选择一个样本来进行替换(替换：意味着可以多次选择同一个样本，也就有重复值)而得到的。
    3. 该算法作用的数据集就会得到S个分类器，与此同时，选择分类器投票结果中最多的类别作为最后的分类结果。
    4. 例如：随机森林(random forest)
-* 追美女：美女选择择偶对象的时候，会问几个闺蜜的建议，最后选择一个综合得分最高的一个作为男朋友
+* 选帅哥：美女选择择偶对象的时候，会问几个闺蜜的建议，最后选择一个综合得分最高的一个作为男朋友

-> boosting
+> boosting：是基于所有分类器的加权求和的方法

 * boosting是一种与bagging很类似的技术。
 * 不过boosting分类的结果是基于所有分类器的加权求和结果的。不论是boosting还是bagging当中，所使用的多个分类器的类型都是一致的。
-* 区别是什么？
-    1. bagging：不同的分类器是通过串形训练而获得的，每个新分类器斗根据已训练出的分类器的性能来进行训练。
-    2. boosting：是通过集中关注被已有分类器错分的那些数据来获得新的分类器。
-    3. 由于boosting分类的结果是基于所有分类器的加权求和结果的，因此boosting与bagging不太一样。
-    4. bagging中的分类器权重是相等的，而boosting中的分类器权重并不相等，每个权重代表的是其对应分类器在上一轮迭代中的成功度。
 * 目前boosting方法最流行的版本是： AdaBoost。
 * 追美女：第1个帅哥失败->(传授经验：姓名、家庭情况) 第2个帅哥失败->(传授经验：兴趣爱好、性格特点) 第3个帅哥成功

-## 应用AdaBoost算法
+> bagging 和 boosting 区别是什么？
+
+1. bagging：不同的分类器是通过串形训练而获得的，每个新分类器斗根据已训练出的分类器的性能来进行训练。
+2. boosting：是通过集中关注被已有分类器错分的那些数据来获得新的分类器。
+3. 由于 boosting 分类的结果是基于所有分类器的加权求和结果的，因此 boosting 与 bagging 不太一样。
+4. bagging 中的分类器权重是相等的，而 boosting 中的分类器权重并不相等，每个权重代表的是其对应分类器在上一轮迭代中的成功度。
+
+## 应用 AdaBoost 算法

 > AdaBoost(adaptive boosting: 自适应boosting)

 ```
 能否使用弱分类器和多个实例来构建一个强分类器？ 这是一个非常有趣的理论问题。

-* 优点：泛化错误率低，易编码，可以应用在大部分分类器上，无参数调节。
+* 优点：泛化（由具体的、个别的扩大为一般的，就是说：模型训练完后的新样本）错误率低，易编码，可以应用在大部分分类器上，无参数调节。
 * 缺点：对离群点敏感。
 * 适用数据类型：数值型和标称型数据。
 ```

-> AdaBoost的一般流程
+> AdaBoost 的一般流程

 ```
 收集数据：可以使用任意方法
-准备数据：依赖于所使用的弱分类器类型，本章使用的是单层决策树，这种分类器可以处理任何数据类型。当然也可以使用任意分类器作为弱分类器，第2章到第6章中的任一分类器都可以充当弱分类器。作为弱分类器，简单分类器的效果更好。
+准备数据：依赖于所使用的弱分类器类型，本章使用的是单层决策树，这种分类器可以处理任何数据类型。
+    当然也可以使用任意分类器作为弱分类器，第2章到第6章中的任一分类器都可以充当弱分类器。
+    作为弱分类器，简单分类器的效果更好。
 分析数据：可以使用任意方法。
-训练数据：AdaBoost的大部分时间都用在训练上，分类器将多次在同一数据集上训练弱分类器。
+训练数据：AdaBoost 的大部分时间都用在训练上，分类器将多次在同一数据集上训练弱分类器。
 测试数据：计算分类的错误率。
-使用算法：通SVM一样，AdaBoost预测两个类别中的一个。如果想把它应用到多个类别的场景，那么就要像多类SVM中的做法一样对AdaBoost进行修改。
+使用算法：通SVM一样，AdaBoost 预测两个类别中的一个。如果想把它应用到多个类别的场景，那么就要像多类 SVM 中的做法一样对 AdaBoost 进行修改。
 ```

 * 训练算法： 基于错误提升分类器的性能
@@ -70,15 +74,15 @@

 ```
 发现：
-alpha目的主要是计算每一个分类器实例的权重(组合就是分类结果)
-  分类的权重值：最大的值，为alpha的加和，最小值为-最大值
-D的目的是为了计算错误概率： weightedError = D.T*errArr，求最佳分类器
-  特征的权重值：如果一个值误判的几率越小，那么D的特征权重越少
+alpha 目的主要是计算每一个分类器实例的权重(组合就是分类结果)
+  分类的权重值：最大的值= alpha 的加和，最小值=-最大值
+D 的目的是为了计算错误概率： weightedError = D.T*errArr，求最佳分类器
+  特征的权重值：如果一个值误判的几率越小，那么 D 的特征权重越少
 ```

 ![AdaBoost算法权重计算公式](/images/7.AdaBoost/adaboost_alpha.png "AdaBoost算法权重计算公式")

-## 完整AdaBoost算法的实现
+## 完整 AdaBoost 算法的实现

 整个实现的伪代码如下：

@@ -92,6 +96,8 @@ D的目的是为了计算错误概率： weightedError = D.T*errArr，求最佳
    如果错误率等于0.0，则退出循环
 ```

+![AdaBoost代码流程图](/images/7.AdaBoost/adaboost_code-flow-chart.jpg "AdaBoost代码流程图")
+
 ## 处理非均衡分类问题

 > 概念
--- a/docs/7.2.随机森林的使用.md
+++ b/docs/7.2.随机森林的使用.md
@@ -1,4 +1,4 @@
-# 第7章 随机森林的使用(个人补充，非课本内容)
+# 第7.2章 随机森林的使用(个人补充，非课本内容)

 ## 基本介绍

--- a/images/7.AdaBoost/adaboost_code-flow-chart.jpg
+++ b/images/7.AdaBoost/adaboost_code-flow-chart.jpg
--- a/input/7.AdaBoost/horseColicTest2.libsvm
+++ b/input/7.AdaBoost/horseColicTest2.libsvm
@@ -0,0 +1,67 @@
+1 0:2 1:1 2:38.5 3:54 4:20 5:0 6:1 7:2 8:2 9:3 10:4 11:1 12:2 13:2 14:5.9 15:0 16:2 17:42 18:6.3 19:0 20:0
+1 0:2 1:1 2:37.6 3:48 4:36 5:0 6:0 7:1 8:1 9:0 10:3 11:0 12:0 13:0 14:0 15:0 16:0 17:44 18:6.3 19:1 20:5
+1 0:1 1:1 2:37.7 3:44 4:28 5:0 6:4 7:3 8:2 9:5 10:4 11:4 12:1 13:1 14:0 15:3 16:5 17:45 18:70 19:3 20:2
+-1 0:1 1:1 2:37 3:56 4:24 5:3 6:1 7:4 8:2 9:4 10:4 11:3 12:1 13:1 14:0 15:0 16:0 17:35 18:61 19:3 20:2
+1 0:2 1:1 2:38 3:42 4:12 5:3 6:0 7:3 8:1 9:1 10:0 11:1 12:0 13:0 14:0 15:0 16:2 17:37 18:5.8 19:0 20:0
+1 0:1 1:1 2:0 3:60 4:40 5:3 6:0 7:1 8:1 9:0 10:4 11:0 12:3 13:2 14:0 15:0 16:5 17:42 18:72 19:0 20:0
+1 0:2 1:1 2:38.4 3:80 4:60 5:3 6:2 7:2 8:1 9:3 10:2 11:1 12:2 13:2 14:0 15:1 16:1 17:54 18:6.9 19:0 20:0
+1 0:2 1:1 2:37.8 3:48 4:12 5:2 6:1 7:2 8:1 9:3 10:0 11:1 12:2 13:0 14:0 15:2 16:0 17:48 18:7.3 19:1 20:0
+1 0:2 1:1 2:37.9 3:45 4:36 5:3 6:3 7:3 8:2 9:2 10:3 11:1 12:2 13:1 14:0 15:3 16:0 17:33 18:5.7 19:3 20:0
+-1 0:2 1:1 2:39 3:84 4:12 5:3 6:1 7:5 8:1 9:2 10:4 11:2 12:1 13:2 14:7 15:0 16:4 17:62 18:5.9 19:2 20:2.2
+1 0:2 1:1 2:38.2 3:60 4:24 5:3 6:1 7:3 8:2 9:3 10:3 11:2 12:3 13:3 14:0 15:4 16:4 17:53 18:7.5 19:2 20:1.4
+-1 0:1 1:1 2:0 3:140 4:0 5:0 6:0 7:4 8:2 9:5 10:4 11:4 12:1 13:1 14:0 15:0 16:5 17:30 18:69 19:0 20:0
+-1 0:1 1:1 2:37.9 3:120 4:60 5:3 6:3 7:3 8:1 9:5 10:4 11:4 12:2 13:2 14:7.5 15:4 16:5 17:52 18:6.6 19:3 20:1.8
+1 0:2 1:1 2:38 3:72 4:36 5:1 6:1 7:3 8:1 9:3 10:0 11:2 12:2 13:1 14:0 15:3 16:5 17:38 18:6.8 19:2 20:2
+1 0:2 1:9 2:38 3:92 4:28 5:1 6:1 7:2 8:1 9:1 10:3 11:2 12:3 13:0 14:7.2 15:0 16:0 17:37 18:6.1 19:1 20:1.1
+1 0:1 1:1 2:38.3 3:66 4:30 5:2 6:3 7:1 8:1 9:2 10:4 11:3 12:3 13:2 14:8.5 15:4 16:5 17:37 18:6 19:0 20:0
+1 0:2 1:1 2:37.5 3:48 4:24 5:3 6:1 7:1 8:1 9:2 10:1 11:0 12:1 13:1 14:0 15:3 16:2 17:43 18:6 19:1 20:2.8
+-1 0:1 1:1 2:37.5 3:88 4:20 5:2 6:3 7:3 8:1 9:4 10:3 11:3 12:0 13:0 14:0 15:0 16:0 17:35 18:6.4 19:1 20:0
+-1 0:2 1:9 2:0 3:150 4:60 5:4 6:4 7:4 8:2 9:5 10:4 11:4 12:0 13:0 14:0 15:0 16:0 17:0 18:0 19:0 20:0
+-1 0:1 1:1 2:39.7 3:100 4:30 5:0 6:0 7:6 8:2 9:4 10:4 11:3 12:1 13:0 14:0 15:4 16:5 17:65 18:75 19:0 20:0
+1 0:1 1:1 2:38.3 3:80 4:0 5:3 6:3 7:4 8:2 9:5 10:4 11:3 12:2 13:1 14:0 15:4 16:4 17:45 18:7.5 19:2 20:4.6
+1 0:2 1:1 2:37.5 3:40 4:32 5:3 6:1 7:3 8:1 9:3 10:2 11:3 12:2 13:1 14:0 15:0 16:5 17:32 18:6.4 19:1 20:1.1
+-1 0:1 1:1 2:38.4 3:84 4:30 5:3 6:1 7:5 8:2 9:4 10:3 11:3 12:2 13:3 14:6.5 15:4 16:4 17:47 18:7.5 19:3 20:0
+-1 0:1 1:1 2:38.1 3:84 4:44 5:4 6:0 7:4 8:2 9:5 10:3 11:1 12:1 13:3 14:5 15:0 16:4 17:60 18:6.8 19:0 20:5.7
+1 0:2 1:1 2:38.7 3:52 4:0 5:1 6:1 7:1 8:1 9:1 10:3 11:1 12:0 13:0 14:0 15:1 16:3 17:4 18:74 19:0 20:0
+1 0:2 1:1 2:38.1 3:44 4:40 5:2 6:1 7:3 8:1 9:3 10:3 11:1 12:0 13:0 14:0 15:1 16:3 17:35 18:6.8 19:0 20:0
+1 0:2 1:1 2:38.4 3:52 4:20 5:2 6:1 7:3 8:1 9:1 10:3 11:2 12:2 13:1 14:0 15:3 16:5 17:41 18:63 19:1 20:1
+1 0:1 1:1 2:38.2 3:60 4:0 5:1 6:0 7:3 8:1 9:2 10:1 11:1 12:1 13:1 14:0 15:4 16:4 17:43 18:6.2 19:2 20:3.9
+1 0:2 1:1 2:37.7 3:40 4:18 5:1 6:1 7:1 8:0 9:3 10:2 11:1 12:1 13:1 14:0 15:3 16:3 17:36 18:3.5 19:0 20:0
+1 0:1 1:1 2:39.1 3:60 4:10 5:0 6:1 7:1 8:0 9:2 10:3 11:0 12:0 13:0 14:0 15:4 16:4 17:0 18:0 19:0 20:0
+1 0:2 1:1 2:37.8 3:48 4:16 5:1 6:1 7:1 8:1 9:0 10:1 11:1 12:2 13:1 14:0 15:4 16:3 17:43 18:7.5 19:0 20:0
+1 0:1 1:1 2:39 3:120 4:0 5:4 6:3 7:5 8:2 9:2 10:4 11:3 12:2 13:3 14:8 15:0 16:0 17:65 18:8.199999999999999 19:3 20:4.6
+1 0:1 1:1 2:38.2 3:76 4:0 5:2 6:3 7:2 8:1 9:5 10:3 11:3 12:1 13:2 14:6 15:1 16:5 17:35 18:6.5 19:2 20:0.9
+-1 0:2 1:1 2:38.3 3:88 4:0 5:0 6:0 7:6 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:0 18:0 19:0 20:0
+1 0:1 1:1 2:38 3:80 4:30 5:3 6:3 7:3 8:1 9:0 10:0 11:0 12:0 13:0 14:6 15:0 16:0 17:48 18:8.300000000000001 19:0 20:4.3
+-1 0:1 1:1 2:0 3:0 4:0 5:3 6:1 7:1 8:1 9:2 10:3 11:3 12:1 13:3 14:6 15:4 16:4 17:0 18:0 19:2 20:0
+1 0:1 1:1 2:37.6 3:40 4:0 5:1 6:1 7:1 8:1 9:1 10:1 11:1 12:0 13:0 14:0 15:1 16:1 17:0 18:0 19:2 20:2.1
+1 0:2 1:1 2:37.5 3:44 4:0 5:1 6:1 7:1 8:1 9:3 10:3 11:2 12:0 13:0 14:0 15:0 16:0 17:45 18:5.8 19:2 20:1.4
+1 0:2 1:1 2:38.2 3:42 4:16 5:1 6:1 7:3 8:1 9:1 10:3 11:1 12:0 13:0 14:0 15:1 16:0 17:35 18:60 19:1 20:1
+1 0:2 1:1 2:38 3:56 4:44 5:3 6:3 7:3 8:0 9:0 10:1 11:1 12:2 13:1 14:0 15:4 16:0 17:47 18:70 19:2 20:1
+1 0:2 1:1 2:38.3 3:45 4:20 5:3 6:3 7:2 8:2 9:2 10:4 11:1 12:2 13:0 14:0 15:4 16:0 17:0 18:0 19:0 20:0
+1 0:1 1:1 2:0 3:48 4:96 5:1 6:1 7:3 8:1 9:0 10:4 11:1 12:2 13:1 14:0 15:1 16:4 17:42 18:8 19:1 20:0
+1 0:1 1:1 2:37.7 3:55 4:28 5:2 6:1 7:2 8:1 9:2 10:3 11:3 12:0 13:3 14:5 15:4 16:5 17:0 18:0 19:0 20:0
+-1 0:2 1:1 2:36 3:100 4:20 5:4 6:3 7:6 8:2 9:2 10:4 11:3 12:1 13:1 14:0 15:4 16:5 17:74 18:5.7 19:2 20:2.5
+1 0:1 1:1 2:37.1 3:60 4:20 5:2 6:0 7:4 8:1 9:3 10:0 11:3 12:0 13:2 14:5 15:3 16:4 17:64 18:8.5 19:2 20:0
+1 0:2 1:1 2:37.1 3:114 4:40 5:3 6:0 7:3 8:2 9:2 10:2 11:1 12:0 13:0 14:0 15:0 16:3 17:32 18:0 19:3 20:6.5
+1 0:1 1:1 2:38.1 3:72 4:30 5:3 6:3 7:3 8:1 9:4 10:4 11:3 12:2 13:1 14:0 15:3 16:5 17:37 18:56 19:3 20:1
+1 0:1 1:1 2:37 3:44 4:12 5:3 6:1 7:1 8:2 9:1 10:1 11:1 12:0 13:0 14:0 15:4 16:2 17:40 18:6.7 19:3 20:8
+1 0:1 1:1 2:38.6 3:48 4:20 5:3 6:1 7:1 8:1 9:4 10:3 11:1 12:0 13:0 14:0 15:3 16:0 17:37 18:75 19:0 20:0
+-1 0:1 1:1 2:0 3:82 4:72 5:3 6:1 7:4 8:1 9:2 10:3 11:3 12:0 13:3 14:0 15:4 16:4 17:53 18:65 19:3 20:2
+-1 0:1 1:9 2:38.2 3:78 4:60 5:4 6:4 7:6 8:0 9:3 10:3 11:3 12:0 13:0 14:0 15:1 16:0 17:59 18:5.8 19:3 20:3.1
+-1 0:2 1:1 2:37.8 3:60 4:16 5:1 6:1 7:3 8:1 9:2 10:3 11:2 12:1 13:2 14:0 15:3 16:0 17:41 18:73 19:0 20:0
+-1 0:1 1:1 2:38.7 3:34 4:30 5:2 6:0 7:3 8:1 9:2 10:3 11:0 12:0 13:0 14:0 15:0 16:0 17:33 18:69 19:0 20:2
+1 0:1 1:1 2:0 3:36 4:12 5:1 6:1 7:1 8:1 9:1 10:2 11:1 12:1 13:1 14:0 15:1 16:5 17:44 18:0 19:0 20:0
+1 0:2 1:1 2:38.3 3:44 4:60 5:0 6:0 7:1 8:1 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:6.4 18:36 19:0 20:0
+1 0:2 1:1 2:37.4 3:54 4:18 5:3 6:0 7:1 8:1 9:3 10:4 11:3 12:2 13:2 14:0 15:4 16:5 17:30 18:7.1 19:2 20:0
+1 0:1 1:1 2:0 3:0 4:0 5:4 6:3 7:0 8:2 9:2 10:4 11:1 12:0 13:0 14:0 15:0 16:0 17:54 18:76 19:3 20:2
+-1 0:1 1:1 2:36.6 3:48 4:16 5:3 6:1 7:3 8:1 9:4 10:1 11:1 12:1 13:1 14:0 15:0 16:0 17:27 18:56 19:0 20:0
+1 0:1 1:1 2:38.5 3:90 4:0 5:1 6:1 7:3 8:1 9:3 10:3 11:3 12:2 13:3 14:2 15:4 16:5 17:47 18:79 19:0 20:0
+1 0:1 1:1 2:0 3:75 4:12 5:1 6:1 7:4 8:1 9:5 10:3 11:3 12:0 13:3 14:5.8 15:0 16:0 17:58 18:8.5 19:1 20:0
+1 0:2 1:1 2:38.2 3:42 4:0 5:3 6:1 7:1 8:1 9:1 10:1 11:2 12:2 13:1 14:0 15:3 16:2 17:35 18:5.9 19:2 20:0
+-1 0:1 1:9 2:38.2 3:78 4:60 5:4 6:4 7:6 8:0 9:3 10:3 11:3 12:0 13:0 14:0 15:1 16:0 17:59 18:5.8 19:3 20:3.1
+1 0:2 1:1 2:38.6 3:60 4:30 5:1 6:1 7:3 8:1 9:4 10:2 11:2 12:1 13:1 14:0 15:0 16:0 17:40 18:6 19:1 20:0
+1 0:2 1:1 2:37.8 3:42 4:40 5:1 6:1 7:1 8:1 9:1 10:3 11:1 12:0 13:0 14:0 15:3 16:3 17:36 18:6.2 19:0 20:0
+-1 0:1 1:1 2:38 3:60 4:12 5:1 6:1 7:2 8:1 9:2 10:1 11:1 12:1 13:1 14:0 15:1 16:4 17:44 18:65 19:3 20:2
+1 0:2 1:1 2:38 3:42 4:12 5:3 6:0 7:3 8:1 9:1 10:1 11:1 12:0 13:0 14:0 15:0 16:1 17:37 18:5.8 19:0 20:0
+-1 0:2 1:1 2:37.6 3:88 4:36 5:3 6:1 7:1 8:1 9:3 10:3 11:2 12:1 13:3 14:1.5 15:0 16:0 17:44 18:6 19:0 20:0
--- a/src/python/16.RecommenderSystems/RS-itemcf.py
+++ b/src/python/16.RecommenderSystems/RS-itemcf.py
@@ -65,7 +65,8 @@ class ItemBasedCF():

        for line in self.loadfile(filename):
            # 用户ID，电影名称，评分，时间戳
-            user, movie, rating, _ = line.split('::')
+            # user, movie, rating, _ = line.split('::')
+            user, movie, rating, _ = line.split('\t')
            # 通过pivot和随机函数比较，然后初始化用户和对应的值
            if (random.random() < pivot):

@@ -89,6 +90,7 @@ class ItemBasedCF():

        print >> sys.stderr, 'counting movies number and popularity...'

+        # 统计在所有的用户中，不同电影的总出现次数
        for user, movies in self.trainset.iteritems():
            for movie in movies:
                # count item popularity
@@ -175,6 +177,8 @@ class ItemBasedCF():
        # varables for popularity
        popular_sum = 0

+        # enumerate将其组成一个索引序列，利用它可以同时获得索引和值
+        # 参考地址：http://blog.csdn.net/churximi/article/details/51648388
        for i, user in enumerate(self.trainset):
            if i > 0 and i % 500 == 0:
                print >> sys.stderr, 'recommended for %d users' % i
@@ -200,7 +204,8 @@ class ItemBasedCF():


 if __name__ == '__main__':
-    ratingfile = 'input/16.RecommenderSystems/ml-1m/ratings.dat'
+    # ratingfile = 'input/16.RecommenderSystems/ml-1m/ratings.dat'
+    ratingfile = 'input/16.RecommenderSystems/ml-100k/u.data'

    # 创建ItemCF对象
    itemcf = ItemBasedCF()
@@ -209,4 +214,8 @@ if __name__ == '__main__':
    # 计算用户之间的相似度
    itemcf.calc_movie_sim()
    # 评估推荐效果
-    itemcf.evaluate()
+    # itemcf.evaluate()
+    # 查看推荐结果用户
+    user = "2"
+    print "推荐结果", itemcf.recommend(user)
+    print "---", itemcf.testset.get(user, {})
--- a/src/python/16.RecommenderSystems/RS-sklearn-rating.py
+++ b/src/python/16.RecommenderSystems/RS-sklearn-rating.py
@@ -0,0 +1,175 @@
+#!/usr/bin/python
+# coding:utf8
+
+import sys
+import math
+from operator import itemgetter
+
+import numpy as np
+import pandas as pd
+from scipy.sparse.linalg import svds
+from sklearn import cross_validation as cv
+from sklearn.metrics import mean_squared_error
+from sklearn.metrics.pairwise import pairwise_distances
+
+
+def splitData(dataFile, test_size):
+    # 加载数据集
+    header = ['user_id', 'item_id', 'rating', 'timestamp']
+    df = pd.read_csv(dataFile, sep='\t', names=header)
+
+    n_users = df.user_id.unique().shape[0]
+    n_items = df.item_id.unique().shape[0]
+
+    print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items)
+    train_data, test_data = cv.train_test_split(df, test_size=test_size)
+    print "数据量：", len(train_data), len(test_data)
+    return df, n_users, n_items, train_data, test_data
+
+
+def calc_similarity(n_users, n_items, train_data, test_data):
+    # 创建用户产品矩阵，针对测试数据和训练数据，创建两个矩阵：
+    train_data_matrix = np.zeros((n_users, n_items))
+    for line in train_data.itertuples():
+        train_data_matrix[line[1]-1, line[2]-1] = line[3]
+    test_data_matrix = np.zeros((n_users, n_items))
+    for line in test_data.itertuples():
+        test_data_matrix[line[1]-1, line[2]-1] = line[3]
+
+    # 使用sklearn的pairwise_distances函数来计算余弦相似性。
+    print "1:", np.shape(train_data_matrix)    # 行：人，列：电影
+    print "2:", np.shape(train_data_matrix.T)  # 行：电影，列：人
+
+    user_similarity = pairwise_distances(train_data_matrix, metric="cosine")
+    item_similarity = pairwise_distances(train_data_matrix.T, metric="cosine")
+
+    print >> sys.stderr, '开始统计流行item的数量...'
+    item_popular = {}
+    # 统计在所有的用户中，不同电影的总出现次数
+    for i_index in range(n_items):
+        if np.sum(train_data_matrix[:, i_index]) != 0:
+            item_popular[i_index] = np.sum(train_data_matrix[:, i_index]!=0)
+            # print "pop=", i_index, self.item_popular[i_index]
+
+    # save the total number of items
+    item_count = len(item_popular)
+    print >> sys.stderr, '总共流行item数量 = %d' % item_count
+
+    return train_data_matrix, test_data_matrix, user_similarity, item_similarity, item_popular
+
+
+def predict(rating, similarity, type='user'):
+    print type
+    print "rating=", np.shape(rating)
+    print "similarity=", np.shape(similarity)
+    if type == 'user':
+        # 求出每一个用户，所有电影的综合评分（axis=0 表示对列操作， 1表示对行操作）
+        # print "rating=", np.shape(rating)
+        mean_user_rating = rating.mean(axis=1)
+        # np.newaxis参考地址: http://blog.csdn.net/xtingjie/article/details/72510834
+        # print "mean_user_rating=", np.shape(mean_user_rating)
+        # print "mean_user_rating.newaxis=", np.shape(mean_user_rating[:, np.newaxis])
+        rating_diff = (rating - mean_user_rating[:, np.newaxis])
+        # print "rating=", rating[:3, :3]
+        # print "mean_user_rating[:, np.newaxis]=", mean_user_rating[:, np.newaxis][:3, :3]
+        # print "rating_diff=", rating_diff[:3, :3]
+
+        # 均分  +  人-人-距离(943, 943)*人-电影-评分diff(943, 1682)=结果-人-电影（每个人对同一电影的综合得分）(943, 1682)  再除以  个人与其他人总的距离 = 人-电影综合得分
+        pred = mean_user_rating[:, np.newaxis] + similarity.dot(rating_diff)/np.array([np.abs(similarity).sum(axis=1)]).T
+    elif type == 'item':
+        # 综合打分： 人-电影-评分(943, 1682)*电影-电影-距离(1682, 1682)=结果-人-电影(各个电影对同一电影的综合得分)(943, 1682)  ／  再除以  电影与其他电影总的距离 = 人-电影综合得分
+        pred = rating.dot(similarity)/np.array([np.abs(similarity).sum(axis=1)])
+    return pred
+
+
+def rmse(prediction, ground_truth):
+    prediction = prediction[ground_truth.nonzero()].flatten()
+    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
+    return math.sqrt(mean_squared_error(prediction, ground_truth))
+
+
+def evaluate(prediction, item_popular, name):
+    hit = 0
+    rec_count = 0
+    test_count = 0
+    popular_sum = 0
+    all_rec_items = set()
+    for u_index in range(n_users):
+        items = np.where(train_data_matrix[u_index, :] == 0)[0]
+        pre_items = sorted(dict(zip(items, prediction[u_index, items])).items(), key=itemgetter(1), reverse=True)[: 20]
+        test_items = np.where(test_data_matrix[u_index, :] != 0)[0]
+
+        # 对比测试集和推荐集的差异
+        for item, w in pre_items:
+            if item in test_items:
+                hit += 1
+            all_rec_items.add(item)
+
+            # 计算用户对应的电影出现次数log值的sum加和
+            if item in item_popular:
+                popular_sum += math.log(1 + item_popular[item])
+
+        rec_count += len(pre_items)
+        test_count += len(test_items)
+
+    precision = hit / (1.0 * rec_count)
+    recall = hit / (1.0 * test_count)
+    coverage = len(all_rec_items) / (1.0 * len(item_popular))
+    popularity = popular_sum / (1.0 * rec_count)
+    print >> sys.stderr, '%s: precision=%.4f \t recall=%.4f \t coverage=%.4f \t popularity=%.4f' % (name, precision, recall, coverage, popularity)
+
+
+def recommend(u_index, prediction):
+    items = np.where(train_data_matrix[u_index, :] == 0)[0]
+    pre_items = sorted(dict(zip(items, prediction[u_index, items])).items(), key=itemgetter(1), reverse=True)[: 10]
+    test_items = np.where(test_data_matrix[u_index, :] != 0)[0]
+
+    print '原始结果：', test_items
+    print '推荐结果：', [key for key, value in pre_items]
+
+
+if __name__ == "__main__":
+    
+    # 基于内存的协同过滤
+    # ...
+    # 拆分数据集
+    # http://files.grouplens.org/datasets/movielens/ml-100k.zip
+    dataFile = 'input/16.RecommenderSystems/ml-100k/u.data'
+    df, n_users, n_items, train_data, test_data = splitData(dataFile, test_size=0.25)
+
+    # 计算相似度
+    train_data_matrix, test_data_matrix, user_similarity, item_similarity, item_popular = calc_similarity(n_users, n_items, train_data, test_data)
+
+    item_prediction = predict(train_data_matrix, item_similarity, type='item')
+    user_prediction = predict(train_data_matrix, user_similarity, type='user')
+
+    # 评估：均方根误差
+    print 'Item based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix))
+    print 'User based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix))
+
+    # 基于模型的协同过滤
+    # ...
+    # 计算MovieLens数据集的稀疏度 （n_users，n_items 是常量，所以，用户行为数据越少，意味着信息量少；越稀疏，优化的空间也越大）
+    sparsity = round(1.0 - len(df)/float(n_users*n_items), 3)
+    print 'The sparsity level of MovieLen100K is ' + str(sparsity * 100) + '%'
+
+    # 计算稀疏矩阵的最大k个奇异值/向量
+    u, s, vt = svds(train_data_matrix, k=15)
+    s_diag_matrix = np.diag(s)
+    svd_prediction = np.dot(np.dot(u, s_diag_matrix), vt)
+    print "svd-shape:", np.shape(svd_prediction)
+    print 'Model based CF RMSE: ' + str(rmse(svd_prediction, test_data_matrix))
+
+    """
+    在信息量相同的情况下，矩阵越小，那么携带的信息越可靠。
+    所以：user-cf 推荐效果高于 item-cf； 而svd分解后，发现15个维度效果就能达到90%以上，所以信息更可靠，效果也更好。
+    item-cf: 1682
+    user-cf: 943
+    svd: 15
+    """
+    evaluate(item_prediction, item_popular, 'item')
+    evaluate(user_prediction, item_popular, 'user')
+    evaluate(svd_prediction, item_popular, 'svd')
+
+    # 推荐结果
+    recommend(1, svd_prediction)
--- a/src/python/16.RecommenderSystems/RS-usercf.py
+++ b/src/python/16.RecommenderSystems/RS-usercf.py
@@ -65,7 +65,8 @@ class UserBasedCF():

        for line in self.loadfile(filename):
            # 用户ID，电影名称，评分，时间戳
-            user, movie, rating, timestamp = line.split('::')
+            # user, movie, rating, timestamp = line.split('::')
+            user, movie, rating, timestamp = line.split('\t')
            # 通过pivot和随机函数比较，然后初始化用户和对应的值
            if (random.random() < pivot):

@@ -92,6 +93,8 @@ class UserBasedCF():
        print >> sys.stderr, 'building movie-users inverse table...'
        movie2users = dict()

+        # 同一个电影中，收集用户的集合
+        # 统计在所有的用户中，不同电影的总出现次数
        for user, movies in self.trainset.iteritems():
            for movie in movies:
                # inverse table for item-users
@@ -155,16 +158,24 @@ class UserBasedCF():
        watched_movies = self.trainset[user]

        # 计算top K 用户的相似度
-        # v=similar user, wuv=不同用户同时出现的次数
+        # v=similar user, wuv=不同用户同时出现的次数，根据wuv倒序从大到小选出K个用户进行排列
        # 耗时分析：50.4%的时间在 line-160行
        for v, wuv in sorted(self.user_sim_mat[user].items(), key=itemgetter(1), reverse=True)[0:K]:
-            for movie in self.trainset[v]:
+            for movie, rating in self.trainset[v].iteritems():
                if movie in watched_movies:
                    continue
                # predict the user's "interest" for each movie
                rank.setdefault(movie, 0)
-                rank[movie] += wuv
+                rank[movie] += wuv * rating
        # return the N best movies
+
+        """
+        wuv
+        precision=0.3766         recall=0.0759   coverage=0.3183         popularity=6.9194
+
+        wuv * rating
+        precision=0.3865         recall=0.0779   coverage=0.2681         popularity=7.0116
+        """
        return sorted(rank.items(), key=itemgetter(1), reverse=True)[0:N]

    def evaluate(self):
@@ -183,6 +194,8 @@ class UserBasedCF():
        # varables for popularity
        popular_sum = 0

+        # enumerate将其组成一个索引序列，利用它可以同时获得索引和值
+        # 参考地址：http://blog.csdn.net/churximi/article/details/51648388
        for i, user in enumerate(self.trainset):
            if i > 0 and i % 500 == 0:
                print >> sys.stderr, 'recommended for %d users' % i
@@ -208,7 +221,8 @@ class UserBasedCF():


 if __name__ == '__main__':
-    ratingfile = 'input/16.RecommenderSystems/ml-1m/ratings.dat'
+    # ratingfile = 'input/16.RecommenderSystems/ml-1m/ratings.dat'
+    ratingfile = 'input/16.RecommenderSystems/ml-100k/u.data'

    # 创建UserCF对象
    usercf = UserBasedCF()
--- a/src/python/16.RecommenderSystems/python/Recommender.py
+++ b/src/python/16.RecommenderSystems/python/Recommender.py
@@ -0,0 +1,28 @@
+import numpy as np
+
+
+# 自定义杰卡德相似系数函数，仅对0-1矩阵有效
+def Jaccard(a, b):
+    return 1.0*(a*b).sum()/(a+b-a*b).sum()
+
+
+class Recommender():
+
+    # 相似度矩阵
+    sim = None
+
+    # 计算相似度矩阵的函数
+    def similarity(self, x, distance):
+        y = np.ones((len(x), len(x)))
+        for i in range(len(x)):
+            for j in range(len(x)):
+                y[i, j] = distance(x[i], x[j])
+        return y
+
+    # 训练函数
+    def fit(self, x, distance=Jaccard):
+        self.sim = self.similarity(x, distance)
+
+    # 推荐函数
+    def recommend(self, a):
+        return np.dot(self.sim, a)*(1-a)
--- a/src/python/16.RecommenderSystems/sklearn-RS-demo-cf-item-test.py
+++ b/src/python/16.RecommenderSystems/sklearn-RS-demo-cf-item-test.py
@@ -0,0 +1,185 @@
+#!/usr/bin/python
+# coding:utf8
+
+'''
+Created on 2015-06-22
+Update  on 2017-05-16
+@author: Lockvictor/片刻
+《推荐系统实践》协同过滤算法源代码
+参考地址：https://github.com/Lockvictor/MovieLens-RecSys
+更新地址：https://github.com/apachecn/MachineLearning
+'''
+import math
+import random
+import sys
+from operator import itemgetter
+
+import numpy as np
+import pandas as pd
+from sklearn import cross_validation as cv
+from sklearn.metrics.pairwise import pairwise_distances
+
+print(__doc__)
+# 作用：使得随机数据可预测
+random.seed(0)
+
+
+class ItemBasedCF():
+    ''' TopN recommendation - ItemBasedCF '''
+    def __init__(self):
+        # 拆分数据集
+        self.train_mat = {}
+        self.test_mat = {}
+
+        # 总用户数
+        self.n_users = 0
+        self.n_items = 0
+
+        # n_sim_user: top 20个用户， n_rec_item: top 10个推荐结果
+        self.n_sim_item = 20
+        self.n_rec_item = 10
+
+        # item_mat_similarity: 电影之间的相似度， item_popular: 电影的出现次数， item_count: 总电影数量
+        self.item_mat_similarity = {}
+        self.item_popular = {}
+        self.item_count = 0
+
+        print >> sys.stderr, 'Similar item number = %d' % self.n_sim_item
+        print >> sys.stderr, 'Recommended item number = %d' % self.n_rec_item
+
+    def splitData(self, dataFile, test_size):
+        # 加载数据集
+        header = ['user_id', 'item_id', 'rating', 'timestamp']
+        df = pd.read_csv(dataFile, sep='\t', names=header)
+
+        self.n_users = df.user_id.unique().shape[0]
+        self.n_items = df.item_id.unique().shape[0]
+
+        print 'Number of users = ' + str(self.n_users) + ' | Number of items = ' + str(self.n_items)
+
+        # 拆分数据集： 用户+电影
+        self.train_data, self.test_data = cv.train_test_split(df, test_size=test_size)
+        print >> sys.stderr, '分离训练集和测试集成功'
+        print >> sys.stderr, 'len(train) = %s' % np.shape(self.train_data)[0]
+        print >> sys.stderr, 'len(test) = %s' % np.shape(self.test_data)[0]
+
+    def calc_similarity(self):
+        # 创建用户产品矩阵，针对测试数据和训练数据，创建两个矩阵：
+        self.train_mat = np.zeros((self.n_users, self.n_items))
+        for line in self.train_data.itertuples():
+            self.train_mat[int(line.user_id)-1, int(line.item_id)-1] = float(line.rating)
+        self.test_mat = np.zeros((self.n_users, self.n_items))
+        for line in self.test_data.itertuples():
+            # print "line", line.user_id-1, line.item_id-1, line.rating
+            self.test_mat[int(line.user_id)-1, int(line.item_id)-1] = float(line.rating)
+
+        # 使用sklearn的pairwise_distances函数来计算余弦相似性。
+        print "1:", np.shape(np.mat(self.train_mat).T)  # 行：电影，列：人
+        # 电影-电影-距离(1682, 1682)
+        self.item_mat_similarity = pairwise_distances(np.mat(self.train_mat).T, metric='cosine')
+        print >> sys.stderr, 'item_mat_similarity=', np.shape(self.item_mat_similarity)
+
+        print >> sys.stderr, '开始统计流行item的数量...'
+
+        # 统计在所有的用户中，不同电影的总出现次数
+        for i_index in range(self.n_items):
+            if np.sum(self.train_mat[:, i_index]) != 0:
+                self.item_popular[i_index] = np.sum(self.train_mat[:, i_index]!=0)
+                # print "pop=", i_index, self.item_popular[i_index]
+
+        # save the total number of items
+        self.item_count = len(self.item_popular)
+        print >> sys.stderr, '总共流行item数量 = %d' % self.item_count
+
+    # @profile
+    def recommend(self, u_index):
+        """recommend(找出top K的电影，对电影进行相似度sum的排序，取出top N的电影数)
+
+        Args:
+            u_index   用户_ID-1=用户index
+        Returns:
+            rec_item  电影推荐列表，按照相似度从大到小的排序
+        """
+        ''' Find K similar items and recommend N items. '''
+        K = self.n_sim_item
+        N = self.n_rec_item
+        rank = {}
+        i_items = np.where(self.train_mat[u_index, :] != 0)[0]
+        # print "i_items=", i_items
+        watched_items = dict(zip(i_items, self.train_mat[u_index, i_items]))
+
+        # 计算top K 电影的相似度
+        # rating=电影评分, w=不同电影出现的次数
+        # 耗时分析：98.2%的时间在 line-154行
+        for i_item, rating in watched_items.iteritems():
+            i_other_items = np.where(self.item_mat_similarity[i_item, :] != 0)[0]
+            for related_item, w in sorted(dict(zip(i_other_items, self.item_mat_similarity[i_item, i_other_items])).items(), key=itemgetter(1), reverse=True)[0:K]:
+                if related_item in watched_items:
+                    continue
+                rank.setdefault(related_item, 0)
+                rank[related_item] += w * rating
+
+        # return the N best items
+        return sorted(rank.items(), key=itemgetter(1), reverse=True)[0:N]
+
+    def evaluate(self):
+        ''' return precision, recall, coverage and popularity '''
+        print >> sys.stderr, 'Evaluation start...'
+
+        # varables for precision and recall
+        # hit表示命中(测试集和推荐集相同+1)，rec_count 每个用户的推荐数， test_count 每个用户对应的测试数据集的电影数
+        hit = 0
+        rec_count = 0
+        test_count = 0
+        # varables for coverage
+        all_rec_items = set()
+        # varables for popularity
+        popular_sum = 0
+
+        # enumerate 将其组成一个索引序列，利用它可以同时获得索引和值
+        # 参考地址：http://blog.csdn.net/churximi/article/details/51648388
+        for u_index in range(50):
+            if u_index > 0 and u_index % 10 == 0:
+                print >> sys.stderr, 'recommended for %d users' % u_index
+            print "u_index", u_index
+
+            # 对比测试集和推荐集的差异
+            rec_items = self.recommend(u_index)
+            print "rec_items=", rec_items
+            for item, w in rec_items:
+                # print 'test_mat[u_index, item]=', item, self.test_mat[u_index, item]
+
+                if self.test_mat[u_index, item] != 0:
+                    hit += 1
+                    print "self.test_mat[%d, %d]=%s" % (u_index, item, self.test_mat[u_index, item])
+                # 计算用户对应的电影出现次数log值的sum加和
+                if item in self.item_popular:
+                    popular_sum += math.log(1 + self.item_popular[item])
+
+            rec_count += len(rec_items)
+            test_count += np.sum(self.test_mat[u_index, :] != 0)
+            # print "test_count=", np.sum(self.test_mat[u_index, :] != 0), np.sum(self.train_mat[u_index, :] != 0)
+
+        print("-------", hit, rec_count)
+        precision = hit / (1.0 * rec_count)
+        recall = hit / (1.0 * test_count)
+        coverage = len(all_rec_items) / (1.0 * self.item_count)
+        popularity = popular_sum / (1.0 * rec_count)
+
+        print >> sys.stderr, 'precision=%.4f \t recall=%.4f \t coverage=%.4f \t popularity=%.4f' % (precision, recall, coverage, popularity)
+
+
+if __name__ == '__main__':
+    dataFile = 'input/16.RecommenderSystems/ml-100k/u.data'
+
+    # 创建ItemCF对象
+    itemcf = ItemBasedCF()
+    # 将数据按照 7:3的比例，拆分成：训练集和测试集，存储在usercf的trainset和testset中
+    itemcf.splitData(dataFile, test_size=0.3)
+    # 计算用户之间的相似度
+    itemcf.calc_similarity()
+    # 评估推荐效果
+    # itemcf.evaluate()
+    # 查看推荐结果用户
+    print "推荐结果", itemcf.recommend(u_index=1)
+    print "---", np.where(itemcf.test_mat[1, :] != 0)[0]
--- a/src/python/16.RecommenderSystems/sklearn-RS-demo-cf.py
+++ b/src/python/16.RecommenderSystems/sklearn-RS-demo-cf.py
@@ -1,68 +0,0 @@
-#!/usr/bin/python
-# coding:utf8
-
-from math import sqrt
-
-import numpy as np
-import pandas as pd
-from scipy.sparse.linalg import svds
-from sklearn import cross_validation as cv
-from sklearn.metrics import mean_squared_error
-from sklearn.metrics.pairwise import pairwise_distances
-
-# 加载数据集
-header = ['user_id', 'item_id', 'rating', 'timestamp']
-# http://files.grouplens.org/datasets/movielens/ml-100k.zip
-dataFile = 'input/16.RecommenderSystems/ml-100k/u.data'
-df = pd.read_csv(dataFile, sep='\t', names=header)
-
-n_users = df.user_id.unique().shape[0]
-n_items = df.item_id.unique().shape[0]
-print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items)
-
-# 拆分数据集
-train_data, test_data = cv.train_test_split(df, test_size=0.25)
-
-# 创建用户产品矩阵，针对测试数据和训练数据，创建两个矩阵：
-train_data_matrix = np.zeros((n_users, n_items))
-for line in train_data.itertuples():
-    train_data_matrix[line[1]-1, line[2]-1] = line[3]
-test_data_matrix = np.zeros((n_users, n_items))
-for line in test_data.itertuples():
-    test_data_matrix[line[1]-1, line[2]-1] = line[3]
-# 使用sklearn的pairwise_distances函数来计算余弦相似性。
-user_similarity = pairwise_distances(train_data_matrix, metric="cosine")
-item_similarity = pairwise_distances(train_data_matrix.T, metric="cosine")
-
-
-def predict(rating, similarity, type='user'):
-    if type == 'user':
-        mean_user_rating = rating.mean(axis=1)
-        rating_diff = (rating - mean_user_rating[:, np.newaxis])
-        pred = mean_user_rating[:, np.newaxis] + similarity.dot(rating_diff)/np.array([np.abs(similarity).sum(axis=1)]).T
-    elif type == 'item':
-        pred = rating.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
-    return pred
-
-
-user_prediction = predict(train_data_matrix, user_similarity, type='user')
-item_prediction = predict(train_data_matrix, item_similarity, type='item')
-
-
-def rmse(prediction, ground_truth):
-    prediction = prediction[ground_truth.nonzero()].flatten()
-    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
-    return sqrt(mean_squared_error(prediction, ground_truth))
-
-
-print 'User based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix))
-print 'Item based CF RMSe: ' + str(rmse(item_prediction, test_data_matrix))
-
-sparsity = round(1.0 - len(df)/float(n_users*n_items), 3)
-print 'The sparsity level of MovieLen100K is ' + str(sparsity * 100) + '%'
-
-
-u, s, vt = svds(train_data_matrix, k=20)
-s_diag_matrix = np.diag(s)
-x_pred = np.dot(np.dot(u, s_diag_matrix), vt)
-print 'User-based CF MSE: ' + str(rmse(x_pred, test_data_matrix))
--- a/src/python/16.RecommenderSystems/test_graph-based.py
+++ b/src/python/16.RecommenderSystems/test_graph-based.py
@@ -14,4 +14,3 @@ def PersonalRank(G, alpha, root):
                    tmp[j] += 1 - alpha
        rank = tmp
    return rank
-
--- a/src/python/6.SVM/sklearn-svm-demo.py
+++ b/src/python/6.SVM/sklearn-svm-demo.py
@@ -7,10 +7,12 @@ Updated on 2017-06-28
 SVM：最大边距分离超平面
@author: 片刻
 《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
+sklearn-SVM译文链接: http://cwiki.apachecn.org/pages/viewpage.action?pageId=10031359
 """
-import numpy as np
 import matplotlib.pyplot as plt
+import numpy as np
 from sklearn import svm
+
 print(__doc__)


@@ -52,7 +54,7 @@ clf.fit(X, Y)
 # 获取分割超平面
 w = clf.coef_[0]
 # 斜率
-a = -w[0] / w[1]
+a = -w[0]/w[1]
 # 从-5到5，顺序间隔采样50个样本，默认是num=50
 # xx = np.linspace(-5, 5)  # , num=50)
 xx = np.linspace(-2, 10)  # , num=50)
@@ -74,7 +76,7 @@ plt.plot(xx, yy_down, 'k--')
 plt.plot(xx, yy_up, 'k--')

 plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=80, facecolors='none')
-plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=plt.cm.Paired)
+plt.scatter([X[:, 0]], [X[:, 1]], c=Y, cmap=plt.cm.Paired)

 plt.axis('tight')
 plt.show()
--- a/src/python/7.AdaBoost/adaboost.py
+++ b/src/python/7.AdaBoost/adaboost.py
@@ -43,9 +43,9 @@ def stumpClassify(dataMat, dimen, threshVal, threshIneq):
    """stumpClassify(将数据集，按照feature列的value进行 二分法切分比较来赋值分类)

    Args:
-        dataMat  Matrix数据集
-        dimen 特征列
-        threshVal 特征列要比较的值
+        dataMat    Matrix数据集
+        dimen      特征列
+        threshVal  特征列要比较的值
    Returns:
        retArray 结果集
    """
@@ -110,11 +110,11 @@ def buildStump(dataArr, labelArr, D):
                # 例如： 一个都没错，那么错误率= 0.2*0=0 ， 5个都错，那么错误率= 0.2*5=1， 只错3个，那么错误率= 0.2*3=0.6
                weightedError = D.T*errArr
                '''
-                dim       表示 feature列
-                threshVal 表示树的分界值
-                inequal   表示计算树左右颠倒的错误率的情况
-                weightedError 表示整体结果的错误率
-                bestClasEst   预测的最优结果
+                dim            表示 feature列
+                threshVal      表示树的分界值
+                inequal        表示计算树左右颠倒的错误率的情况
+                weightedError  表示整体结果的错误率
+                bestClasEst    预测的最优结果
                '''
                # print "split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError)
                if weightedError < minError:
@@ -155,7 +155,7 @@ def adaBoostTrainDS(dataArr, labelArr, numIt=40):
        # store Stump Params in Array
        weakClassArr.append(bestStump)

-        # print "alpha=%s, classEst=%s, bestStump=%s, error=%s " % (alpha, classEst.T, bestStump, error)
+        print "alpha=%s, classEst=%s, bestStump=%s, error=%s " % (alpha, classEst.T, bestStump, error)
        # -1主要是下面求e的-alpha次方； 如果判断正确，乘积为1，否则成绩为-1，这样就可以算出分类的情况了
        expon = multiply(-1*alpha*mat(labelArr).T, classEst)
        print '\n'
@@ -207,8 +207,11 @@ def plotROC(predStrengths, classLabels):

    Args:
        predStrengths  最终预测结果的权重值
-        classLabels 原始数据的分类结果集
+        classLabels    原始数据的分类结果集
    """
+    print 'predStrengths=', predStrengths
+    print 'classLabels=', classLabels
+
    import matplotlib.pyplot as plt
    # variable to calculate AUC
    ySum = 0.0
@@ -221,6 +224,8 @@ def plotROC(predStrengths, classLabels):
    # argsort函数返回的是数组值从小到大的索引值
    # get sorted index, it's reverse
    sortedIndicies = predStrengths.argsort()
+    # 测试结果是否是从小到大排列
+    print 'sortedIndicies=', sortedIndicies, predStrengths[0, 176], predStrengths.min(), predStrengths[0, 293], predStrengths.max()

    # 开始创建模版对象
    fig = plt.figure()
@@ -239,7 +244,7 @@ def plotROC(predStrengths, classLabels):
            ySum += cur[1]
        # draw line from cur to (cur[0]-delX, cur[1]-delY)
        # 画点连线 (x1, x2, y1, y2)
-        # print cur[0], cur[0]-delX, cur[1], cur[1]-delY
+        print cur[0], cur[0]-delX, cur[1], cur[1]-delY
        ax.plot([cur[0], cur[0]-delX], [cur[1], cur[1]-delY], c='b')
        cur = (cur[0]-delX, cur[1]-delY)
    # 画对角的虚线线
@@ -252,55 +257,54 @@ def plotROC(predStrengths, classLabels):
    plt.show()
    '''
    参考说明：http://blog.csdn.net/wenyusuran/article/details/39056013
-    为了计算AUC，我们需要对多个小矩形的面积进行累加。这些小矩形的宽度是xStep，因此
-    可以先对所有矩形的高度进行累加，最后再乘以xStep得到其总面积。所有高度的和(ySum)随
-    着x轴的每次移动而渐次增加。
+    为了计算 AUC ，我们需要对多个小矩形的面积进行累加。
+    这些小矩形的宽度是xStep，因此可以先对所有矩形的高度进行累加，最后再乘以xStep得到其总面积。
+    所有高度的和(ySum)随着x轴的每次移动而渐次增加。
    '''
    print "the Area Under the Curve is: ", ySum*xStep


 if __name__ == "__main__":
-    # 我们要将5个点进行分类
-    dataArr, labelArr = loadSimpData()
-    print 'dataArr', dataArr, 'labelArr', labelArr
+    # # 我们要将5个点进行分类
+    # dataArr, labelArr = loadSimpData()
+    # print 'dataArr', dataArr, 'labelArr', labelArr

-    # D表示最初值，对1进行均分为5份，平均每一个初始的概率都为0.2
-    # D的目的是为了计算错误概率： weightedError = D.T*errArr
-    D = mat(ones((5, 1))/5)
-    print 'D=', D.T
+    # # D表示最初值，对1进行均分为5份，平均每一个初始的概率都为0.2
+    # # D的目的是为了计算错误概率： weightedError = D.T*errArr
+    # D = mat(ones((5, 1))/5)
+    # print 'D=', D.T

-    # bestStump, minError, bestClasEst = buildStump(dataArr, labelArr, D)
-    # print 'bestStump=', bestStump
-    # print 'minError=', minError
-    # print 'bestClasEst=', bestClasEst.T
+    # # bestStump, minError, bestClasEst = buildStump(dataArr, labelArr, D)
+    # # print 'bestStump=', bestStump
+    # # print 'minError=', minError
+    # # print 'bestClasEst=', bestClasEst.T

+    # # 分类器：weakClassArr
+    # # 历史累计的分类结果集
+    # weakClassArr, aggClassEst = adaBoostTrainDS(dataArr, labelArr, 9)
+    # print '\nweakClassArr=', weakClassArr, '\naggClassEst=', aggClassEst.T

-    # 分类器：weakClassArr
-    # 历史累计的分类结果集
-    weakClassArr, aggClassEst = adaBoostTrainDS(dataArr, labelArr, 9)
-    print '\nweakClassArr=', weakClassArr, '\naggClassEst=', aggClassEst.T
+    # """
+    # 发现:
+    # 分类的权重值：最大的值，为alpha的加和，最小值为-最大值
+    # 特征的权重值：如果一个值误判的几率越小，那么D的特征权重越少
+    # """

-    """
-    发现:
-    分类的权重值：最大的值，为alpha的加和，最小值为-最大值
-    特征的权重值：如果一个值误判的几率越小，那么D的特征权重越少
-    """
+    # # 测试数据的分类结果, 观测：aggClassEst分类的最终权重
+    # print adaClassify([0, 0], weakClassArr).T
+    # print adaClassify([[5, 5], [0, 0]], weakClassArr).T

-    # 测试数据的分类结果, 观测：aggClassEst分类的最终权重
-    print adaClassify([0, 0], weakClassArr).T
-    print adaClassify([[5, 5], [0, 0]], weakClassArr).T
-
-    # # 马疝病数据集
-    # # 训练集合
-    # dataArr, labelArr = loadDataSet("input/7.AdaBoost/horseColicTraining2.txt")
-    # weakClassArr, aggClassEst = adaBoostTrainDS(dataArr, labelArr, 40)
-    # print weakClassArr, '\n-----\n', aggClassEst.T
-    # # 计算ROC下面的AUC的面积大小
-    # plotROC(aggClassEst.T, labelArr)
-    # # 测试集合
-    # dataArrTest, labelArrTest = loadDataSet("input/7.AdaBoost/horseColicTest2.txt")
-    # m = shape(dataArrTest)[0]
-    # predicting10 = adaClassify(dataArrTest, weakClassArr)
-    # errArr = mat(ones((m, 1)))
-    # # 测试：计算总样本数，错误样本数，错误率
-    # print m, errArr[predicting10 != mat(labelArrTest).T].sum(), errArr[predicting10 != mat(labelArrTest).T].sum()/m
+    # 马疝病数据集
+    # 训练集合
+    dataArr, labelArr = loadDataSet("input/7.AdaBoost/horseColicTraining2.txt")
+    weakClassArr, aggClassEst = adaBoostTrainDS(dataArr, labelArr, 40)
+    print weakClassArr, '\n-----\n', aggClassEst.T
+    # 计算ROC下面的AUC的面积大小
+    plotROC(aggClassEst.T, labelArr)
+    # 测试集合
+    dataArrTest, labelArrTest = loadDataSet("input/7.AdaBoost/horseColicTest2.txt")
+    m = shape(dataArrTest)[0]
+    predicting10 = adaClassify(dataArrTest, weakClassArr)
+    errArr = mat(ones((m, 1)))
+    # 测试：计算总样本数，错误样本数，错误率
+    print m, errArr[predicting10 != mat(labelArrTest).T].sum(), errArr[predicting10 != mat(labelArrTest).T].sum()/m
--- a/src/python/7.AdaBoost/sklearn-adaboost-demo.py
+++ b/src/python/7.AdaBoost/sklearn-adaboost-demo.py
@@ -0,0 +1,61 @@
+#!/usr/bin/python
+# coding:utf8
+"""
+Created on 2017-07-10
+Updated on 2017-07-10
+@author: 片刻／Noel Dawe
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
+sklearn-AdaBoost译文链接: http://cwiki.apachecn.org/pages/viewpage.action?pageId=10813457
+"""
+
+import matplotlib.pyplot as plt
+# importing necessary libraries
+import numpy as np
+from sklearn import metrics
+from sklearn.ensemble import AdaBoostRegressor
+from sklearn.tree import DecisionTreeRegressor
+
+print(__doc__)
+
+
+# Create the dataset
+rng = np.random.RandomState(1)
+X = np.linspace(0, 6, 100)[:, np.newaxis]
+y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])
+# dataArr, labelArr = loadDataSet("input/7.AdaBoost/horseColicTraining2.txt")
+
+
+# Fit regression model
+regr_1 = DecisionTreeRegressor(max_depth=4)
+regr_2 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4), n_estimators=300, random_state=rng)
+
+regr_1.fit(X, y)
+regr_2.fit(X, y)
+
+# Predict
+y_1 = regr_1.predict(X)
+y_2 = regr_2.predict(X)
+
+# Plot the results
+plt.figure()
+plt.scatter(X, y, c="k", label="training samples")
+plt.plot(X, y_1, c="g", label="n_estimators=1", linewidth=2)
+plt.plot(X, y_2, c="r", label="n_estimators=300", linewidth=2)
+plt.xlabel("data")
+plt.ylabel("target")
+plt.title("Boosted Decision Tree Regression")
+plt.legend()
+plt.show()
+
+print 'y---', type(y[0]), len(y), y[:4]
+print 'y_1---', type(y_1[0]), len(y_1), y_1[:4]
+print 'y_2---', type(y_2[0]), len(y_2), y_2[:4]
+
+# 适合2分类
+y_true = np.array([0, 0, 1, 1])
+y_scores = np.array([0.1, 0.4, 0.35, 0.8])
+print 'y_scores---', type(y_scores[0]), len(y_scores), y_scores
+print metrics.roc_auc_score(y_true, y_scores)
+
+# print "-" * 100
+# print metrics.roc_auc_score(y[:1], y_2[:1])
--- a/tools/python2libsvm.py
+++ b/tools/python2libsvm.py
@@ -0,0 +1,52 @@
+#!/usr/bin/python
+# coding:utf8
+
+import os
+import sklearn.datasets as datasets
+
+
+def get_data(file_input, separator='\t'):
+    if 'libsvm' not in file_input:
+        file_input = other2libsvm(file_input, separator)
+    data = datasets.load_svmlight_file(file_input)
+    return data[0], data[1]
+
+
+def other2libsvm(file_name, separator='\t'):
+
+    libsvm_name = file_name.replace('.txt', '.libsvm_tmp')
+    libsvm_data = open(libsvm_name, 'w')
+
+    file_data = open(file_name, 'r')
+    for line in file_data.readlines():
+        features = line.strip().split(separator)
+        # print len(features)
+        class_data = features[-1]
+        svm_format = ''
+        for i in range(len(features)-1):
+            svm_format += " %d:%s" % (i+1, features[i])
+            # print svm_format
+        svm_format = "%s%s\n" % (class_data, svm_format)
+        # print svm_format
+        libsvm_data.write(svm_format)
+    file_data.close()
+
+    libsvm_data.close()
+    return libsvm_name
+
+
+def dump_data(x, y, file_output):
+    datasets.dump_svmlight_file(x, y, file_output)
+    os.remove("%s_tmp" % file_output)
+
+
+if __name__ == "__main__":
+    file_input = "input/7.AdaBoost/horseColicTest2.txt"
+    file_output = "input/7.AdaBoost/horseColicTest2.libsvm"
+
+    # 获取数据集
+    x, y = get_data(file_input, separator='\t')
+    print x[3, :]
+    print y
+    # 导出数据为 libsvm
+    dump_data(x, y, file_output)