Merge pull request #88 from jiangzhonglian/master

完全推荐系统的python代码
2026-05-07 14:13:14 +08:00 · 2017-05-18 12:03:13 +08:00
parent b9eac31808 62e8b3dee4
commit ac0f425015
30 changed files with 1011057 additions and 20 deletions
--- a/input/16.RecommendedSystem/ml-1m/README
+++ b/input/16.RecommendedSystem/ml-1m/README
@@ -0,0 +1,159 @@
+SUMMARY
+================================================================================
+
+These files contain 1,000,209 anonymous ratings of approximately 3,900 movies 
+made by 6,040 MovieLens users who joined MovieLens in 2000.
+
+USAGE LICENSE
+================================================================================
+
+Neither the University of Minnesota nor any of the researchers
+involved can guarantee the correctness of the data, its suitability
+for any particular purpose, or the validity of results based on the
+use of the data set.  The data set may be used for any research
+purposes under the following conditions:
+
+     * The user may not state or imply any endorsement from the
+       University of Minnesota or the GroupLens Research Group.
+
+     * The user must acknowledge the use of the data set in
+       publications resulting from the use of the data set, and must
+       send us an electronic or paper copy of those publications.
+
+     * The user may not redistribute the data without separate
+       permission.
+
+     * The user may not use this information for any commercial or
+       revenue-bearing purposes without first obtaining permission
+       from a faculty member of the GroupLens Research Project at the
+       University of Minnesota.
+
+If you have any further questions or comments, please contact GroupLens
+<grouplens-info@cs.umn.edu>. 
+
+ACKNOWLEDGEMENTS
+================================================================================
+
+Thanks to Shyong Lam and Jon Herlocker for cleaning up and generating the data
+set.
+
+FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
+================================================================================
+
+The GroupLens Research Project is a research group in the Department of 
+Computer Science and Engineering at the University of Minnesota. Members of 
+the GroupLens Research Project are involved in many research projects related 
+to the fields of information filtering, collaborative filtering, and 
+recommender systems. The project is lead by professors John Riedl and Joseph 
+Konstan. The project began to explore automated collaborative filtering in 
+1992, but is most well known for its world wide trial of an automated 
+collaborative filtering system for Usenet news in 1996. Since then the project 
+has expanded its scope to research overall information filtering solutions, 
+integrating in content-based methods as well as improving current collaborative 
+filtering technology.
+
+Further information on the GroupLens Research project, including research 
+publications, can be found at the following web site:
+        
+        http://www.grouplens.org/
+
+GroupLens Research currently operates a movie recommender based on 
+collaborative filtering:
+
+        http://www.movielens.org/
+
+RATINGS FILE DESCRIPTION
+================================================================================
+
+All ratings are contained in the file "ratings.dat" and are in the
+following format:
+
+UserID::MovieID::Rating::Timestamp
+
+- UserIDs range between 1 and 6040 
+- MovieIDs range between 1 and 3952
+- Ratings are made on a 5-star scale (whole-star ratings only)
+- Timestamp is represented in seconds since the epoch as returned by time(2)
+- Each user has at least 20 ratings
+
+USERS FILE DESCRIPTION
+================================================================================
+
+User information is in the file "users.dat" and is in the following
+format:
+
+UserID::Gender::Age::Occupation::Zip-code
+
+All demographic information is provided voluntarily by the users and is
+not checked for accuracy.  Only users who have provided some demographic
+information are included in this data set.
+
+- Gender is denoted by a "M" for male and "F" for female
+- Age is chosen from the following ranges:
+
+	*  1:  "Under 18"
+	* 18:  "18-24"
+	* 25:  "25-34"
+	* 35:  "35-44"
+	* 45:  "45-49"
+	* 50:  "50-55"
+	* 56:  "56+"
+
+- Occupation is chosen from the following choices:
+
+	*  0:  "other" or not specified
+	*  1:  "academic/educator"
+	*  2:  "artist"
+	*  3:  "clerical/admin"
+	*  4:  "college/grad student"
+	*  5:  "customer service"
+	*  6:  "doctor/health care"
+	*  7:  "executive/managerial"
+	*  8:  "farmer"
+	*  9:  "homemaker"
+	* 10:  "K-12 student"
+	* 11:  "lawyer"
+	* 12:  "programmer"
+	* 13:  "retired"
+	* 14:  "sales/marketing"
+	* 15:  "scientist"
+	* 16:  "self-employed"
+	* 17:  "technician/engineer"
+	* 18:  "tradesman/craftsman"
+	* 19:  "unemployed"
+	* 20:  "writer"
+
+MOVIES FILE DESCRIPTION
+================================================================================
+
+Movie information is in the file "movies.dat" and is in the following
+format:
+
+MovieID::Title::Genres
+
+- Titles are identical to titles provided by the IMDB (including
+year of release)
+- Genres are pipe-separated and are selected from the following genres:
+
+	* Action
+	* Adventure
+	* Animation
+	* Children's
+	* Comedy
+	* Crime
+	* Documentary
+	* Drama
+	* Fantasy
+	* Film-Noir
+	* Horror
+	* Musical
+	* Mystery
+	* Romance
+	* Sci-Fi
+	* Thriller
+	* War
+	* Western
+
+- Some MovieIDs do not correspond to a movie due to accidental duplicate
+entries and/or test entries
+- Movies are mostly entered by hand, so errors and inconsistencies may exist
--- a/input/16.RecommendedSystem/ml-1m/movies.dat
+++ b/input/16.RecommendedSystem/ml-1m/movies.dat
--- a/input/16.RecommendedSystem/ml-1m/ratings.dat
+++ b/input/16.RecommendedSystem/ml-1m/ratings.dat
--- a/input/16.RecommendedSystem/ml-1m/users.dat
+++ b/input/16.RecommendedSystem/ml-1m/users.dat
--- a/src/python/1.MLFoundation/NumPy.py
+++ b/src/python/1.MLFoundation/NumPy.py
@@ -1,6 +1,11 @@
 #!/usr/bin/python
 # coding:utf8
-
+'''
+Created on 2017-05-18
+Update  on 2017-05-18
+@author: Peter Harrington/山上有课树
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
+'''
 from numpy import random, mat, eye

 '''
--- a/src/python/10.kmeans/kMeans.py
+++ b/src/python/10.kmeans/kMeans.py
@@ -1,6 +1,12 @@
 #!/usr/bin/python
 # coding:utf8
-
+'''
+Created on Feb 16, 2011
+Update on 2017-05-18
+k Means Clustering for Ch10 of Machine Learning in Action
+@author: Peter Harrington/那伊抹微笑
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
+'''
 from numpy import *


--- a/src/python/11.Apriori/apriori.py
+++ b/src/python/11.Apriori/apriori.py
@@ -3,9 +3,10 @@

 '''
 Created on Mar 24, 2011
-Update on 2017-03-16
+Update  on 2017-05-18
 Ch 11 code
@author: Peter/片刻
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 '''
 print(__doc__)
 from numpy import *
--- a/src/python/12.FrequentPattemTree/fpGrowth.py
+++ b/src/python/12.FrequentPattemTree/fpGrowth.py
@@ -3,12 +3,14 @@

 '''
 Created on Jun 14, 2011
+Update  on 2017-05-18
 FP-Growth FP means frequent pattern
 the FP-Growth algorithm needs:
 1. FP-tree (class treeNode)
 2. header table (use dict)
 This finds frequent itemsets similar to apriori but does not find association rules.
@author: Peter/片刻
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 '''
 print(__doc__)

--- a/src/python/13.PCA/pca.py
+++ b/src/python/13.PCA/pca.py
@@ -3,8 +3,9 @@

 '''
 Created on Jun 1, 2011
-Update on 2017-04-06
+Update  on 2017-05-18
@author: Peter Harrington/片刻
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 '''
 print(__doc__)
 from numpy import *
--- a/src/python/14.SVD/svdRec.py
+++ b/src/python/14.SVD/svdRec.py
@@ -1,7 +1,11 @@
 #!/usr/bin/python
-# -*- coding: utf-8 -*-  
-# encoding: utf-8
-
+# coding: utf-8
+'''
+Created on Mar 8, 2011
+Update  on 2017-05-18
+@author: Peter Harrington/山上有课树
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
+'''
 from numpy import linalg as la
 from numpy import *

--- a/src/python/16.RecommendedSystem/itemcf.py
+++ b/src/python/16.RecommendedSystem/itemcf.py
@@ -0,0 +1,212 @@
+#!/usr/bin/python
+# coding:utf8
+
+'''
+Created on 2015-06-22
+Update  on 2017-05-16
+@author: Lockvictor/片刻
+《推荐系统实践》协同过滤算法源代码
+参考地址：https://github.com/Lockvictor/MovieLens-RecSys
+更新地址：https://github.com/apachecn/MachineLearning
+'''
+import sys
+import math
+import random
+from operator import itemgetter
+print(__doc__)
+# 作用：使得随机数据可预测
+random.seed(0)
+
+
+class ItemBasedCF():
+    ''' TopN recommendation - ItemBasedCF '''
+    def __init__(self):
+        self.trainset = {}
+        self.testset = {}
+
+        # n_sim_user: top 20个用户， n_rec_movie: top 10个推荐结果
+        self.n_sim_movie = 20
+        self.n_rec_movie = 10
+
+        # user_sim_mat: 电影之间的相似度， movie_popular: 电影的出现次数， movie_count: 总电影数量
+        self.movie_sim_mat = {}
+        self.movie_popular = {}
+        self.movie_count = 0
+
+        print >> sys.stderr, 'Similar movie number = %d' % self.n_sim_movie
+        print >> sys.stderr, 'Recommended movie number = %d' % self.n_rec_movie
+
+    @staticmethod
+    def loadfile(filename):
+        """loadfile(加载文件，返回一个生成器)
+
+        Args:
+            filename   文件名
+        Returns:
+            line       行数据，去空格
+        """
+        fp = open(filename, 'r')
+        for i, line in enumerate(fp):
+            yield line.strip('\r\n')
+            if i > 0 and i % 100000 == 0:
+                print >> sys.stderr, 'loading %s(%s)' % (filename, i)
+        fp.close()
+        print >> sys.stderr, 'load %s success' % filename
+
+    def generate_dataset(self, filename, pivot=0.7):
+        """loadfile(加载文件，将数据集按照7:3 进行随机拆分)
+
+        Args:
+            filename   文件名
+            pivot      拆分比例
+        """
+        trainset_len = 0
+        testset_len = 0
+
+        for line in self.loadfile(filename):
+            # 用户ID，电影名称，评分，时间戳
+            user, movie, rating, _ = line.split('::')
+            # 通过pivot和随机函数比较，然后初始化用户和对应的值
+            if (random.random() < pivot):
+
+                # dict.setdefault(key, default=None)
+                # key -- 查找的键值
+                # default -- 键不存在时，设置的默认键值
+                self.trainset.setdefault(user, {})
+                self.trainset[user][movie] = int(rating)
+                trainset_len += 1
+            else:
+                self.testset.setdefault(user, {})
+                self.testset[user][movie] = int(rating)
+                testset_len += 1
+
+        print >> sys.stderr, '分离训练集和测试集成功'
+        print >> sys.stderr, 'train set = %s' % trainset_len
+        print >> sys.stderr, 'test set = %s' % testset_len
+
+    def calc_movie_sim(self):
+        """calc_movie_sim(计算用户之间的相似度)"""
+
+        print >> sys.stderr, 'counting movies number and popularity...'
+
+        for user, movies in self.trainset.iteritems():
+            for movie in movies:
+                # count item popularity
+                if movie not in self.movie_popular:
+                    self.movie_popular[movie] = 0
+                self.movie_popular[movie] += 1
+
+        print >> sys.stderr, 'count movies number and popularity success'
+
+        # save the total number of movies
+        self.movie_count = len(self.movie_popular)
+        print >> sys.stderr, 'total movie number = %d' % self.movie_count
+
+        # 统计在相同用户时，不同电影同时出现的次数
+        itemsim_mat = self.movie_sim_mat
+        print >> sys.stderr, 'building co-rated users matrix...'
+
+        for user, movies in self.trainset.iteritems():
+            for m1 in movies:
+                for m2 in movies:
+                    if m1 == m2:
+                        continue
+                    itemsim_mat.setdefault(m1, {})
+                    itemsim_mat[m1].setdefault(m2, 0)
+                    itemsim_mat[m1][m2] += 1
+        print >> sys.stderr, 'build co-rated users matrix success'
+
+        # calculate similarity matrix
+        print >> sys.stderr, 'calculating movie similarity matrix...'
+        simfactor_count = 0
+        PRINT_STEP = 2000000
+        for m1, related_movies in itemsim_mat.iteritems():
+            for m2, count in related_movies.iteritems():
+                # 余弦相似度
+                itemsim_mat[m1][m2] = count / math.sqrt(self.movie_popular[m1] * self.movie_popular[m2])
+                simfactor_count += 1
+                # 打印进度条
+                if simfactor_count % PRINT_STEP == 0:
+                    print >> sys.stderr, 'calculating movie similarity factor(%d)' % simfactor_count
+
+        print >> sys.stderr, 'calculate movie similarity matrix(similarity factor) success'
+        print >> sys.stderr, 'Total similarity factor number = %d' % simfactor_count
+
+    # @profile
+    def recommend(self, user):
+        """recommend(找出top K的电影，对电影进行相似度sum的排序，取出top N的电影数)
+
+        Args:
+            user       用户
+        Returns:
+            rec_movie  电影推荐列表，按照相似度从大到小的排序
+        """
+        ''' Find K similar movies and recommend N movies. '''
+        K = self.n_sim_movie
+        N = self.n_rec_movie
+        rank = {}
+        watched_movies = self.trainset[user]
+
+        # 计算top K 电影的相似度
+        # rating=电影评分, w=不同电影出现的次数
+        # 耗时分析：98.2%的时间在 line-154行
+        for movie, rating in watched_movies.iteritems():
+            for related_movie, w in sorted(self.movie_sim_mat[movie].items(), key=itemgetter(1), reverse=True)[0:K]:
+                if related_movie in watched_movies:
+                    continue
+                rank.setdefault(related_movie, 0)
+                rank[related_movie] += w * rating
+        # return the N best movies
+        return sorted(rank.items(), key=itemgetter(1), reverse=True)[0:N]
+
+    def evaluate(self):
+        ''' return precision, recall, coverage and popularity '''
+        print >> sys.stderr, 'Evaluation start...'
+
+        # 返回top N的推荐结果
+        N = self.n_rec_movie
+        # varables for precision and recall
+        # hit表示命中(测试集和推荐集相同+1)，rec_count 每个用户的推荐数， test_count 每个用户对应的测试数据集的电影数
+        hit = 0
+        rec_count = 0
+        test_count = 0
+        # varables for coverage
+        all_rec_movies = set()
+        # varables for popularity
+        popular_sum = 0
+
+        for i, user in enumerate(self.trainset):
+            if i > 0 and i % 500 == 0:
+                print >> sys.stderr, 'recommended for %d users' % i
+            test_movies = self.testset.get(user, {})
+            rec_movies = self.recommend(user)
+
+            # 对比测试集和推荐集的差异
+            for movie, w in rec_movies:
+                if movie in test_movies:
+                    hit += 1
+                all_rec_movies.add(movie)
+                # 计算用户对应的电影出现次数log值的sum加和
+                popular_sum += math.log(1 + self.movie_popular[movie])
+            rec_count += N
+            test_count += len(test_movies)
+
+        precision = hit / (1.0 * rec_count)
+        recall = hit / (1.0 * test_count)
+        coverage = len(all_rec_movies) / (1.0 * self.movie_count)
+        popularity = popular_sum / (1.0 * rec_count)
+
+        print >> sys.stderr, 'precision=%.4f \t recall=%.4f \t coverage=%.4f \t popularity=%.4f' % (precision, recall, coverage, popularity)
+
+
+if __name__ == '__main__':
+    ratingfile = 'input/16.RecommendedSystem/ml-1m/ratings.dat'
+
+    # 创建ItemCF对象
+    itemcf = ItemBasedCF()
+    # 将数据按照 7:3的比例，拆分成：训练集和测试集，存储在usercf的trainset和testset中
+    itemcf.generate_dataset(ratingfile, pivot=0.7)
+    # 计算用户之间的相似度
+    itemcf.calc_movie_sim()
+    # 评估推荐效果
+    itemcf.evaluate()
--- a/src/python/16.RecommendedSystem/test_evaluation_model.py
+++ b/src/python/16.RecommendedSystem/test_evaluation_model.py
@@ -0,0 +1,72 @@
+
+def SplitData(data, M, k, seed):
+    test = []
+    train = []
+    random.seed(seed)
+    for user, item in data:
+        if random.randint(0,M) == k:
+            test.append([user,item])
+        else:
+            train.append([user,item])
+    return train, test
+
+
+# 准确率
+def Precision(train, test, N):
+    hit = 0
+    all = 0
+    for user in train.keys():
+        tu = test[user]
+        rank = GetRecommendation(user, N)
+        for item, pui in rank:
+            if item in tu:
+                hit += 1
+        all += N
+    return hit / (all * 1.0)
+
+
+# 召回率
+def Recall(train, test, N):
+    hit = 0
+    all = 0
+    for user in train.keys():
+        tu = test[user]
+        rank = GetRecommendation(user, N)
+        for item, pui in rank:
+            if item in tu:
+                hit += 1
+        all += len(tu)
+    return hit / (all * 1.0)
+
+
+# 覆盖率
+def Coverage(train, test, N):
+    recommend_items = set()
+    all_items = set()
+    for user in train.keys():
+        for item in train[user].keys():
+            all_items.add(item)
+        rank = GetRecommendation(user, N)
+        for item, pui in rank:
+            recommend_items.add(item)
+    return len(recommend_items) / (len(all_items) * 1.0)
+
+
+# 新颖度
+def Popularity(train, test, N):
+    item_popularity = dict()
+    for user, items in train.items():
+        for item in items.keys():
+            if item not in item_popularity:
+                item_popularity[item] = 0
+                item_popularity[item] += 1
+    ret = 0
+    n=0
+    for user in train.keys():
+        rank = GetRecommendation(user, N)
+        for item, pui in rank:
+            ret += math.log(1 + item_popularity[item])
+            n += 1
+    ret /= n * 1.0
+    return ret
+
--- a/src/python/16.RecommendedSystem/test_graph-based.py
+++ b/src/python/16.RecommendedSystem/test_graph-based.py
@@ -0,0 +1,17 @@
+
+def PersonalRank(G, alpha, root):
+    rank = dict()
+    rank = {x:0 for x in G.keys()}
+    rank[root] = 1
+    for k in range(20):
+        tmp = {x:0 for x in G.keys()}
+        for i, ri in G.items():
+            for j, wij in ri.items():
+                if j not in tmp:
+                    tmp[j] = 0
+                tmp[j] += 0.6 * rank[i] / (1.0 * len(ri))
+                if j == root:
+                    tmp[j] += 1 - alpha
+        rank = tmp
+    return rank
+
--- a/src/python/16.RecommendedSystem/test_lfm.py
+++ b/src/python/16.RecommendedSystem/test_lfm.py
@@ -0,0 +1,41 @@
+
+# 负样本采样过程
+def RandomSelectNegativeSample(self, items):
+    ret = dict()
+    for i in items.keys():
+        ret[i] = 1
+    
+    n = 0
+    for i in range(0, len(items) * 3):
+        item = items_pool[random.randint(0, len(items_pool) - 1)]
+        if item in ret:
+            continue
+        ret[item] = 0
+        n + = 1
+        if n > len(items):
+            break
+    return ret
+
+
+def LatentFactorModel(user_items, F, N, alpha, lambda):
+    [P, Q] = InitModel(user_items, F)
+    for step in range(0,N):
+        for user, items in user_items.items():
+            samples = RandSelectNegativeSamples(items)
+            for item, rui in samples.items():
+                eui = rui - Predict(user, item)
+                for f in range(0, F):
+                    P[user][f] += alpha * (eui * Q[item][f] - lambda * P[user][f])
+                    Q[item][f] += alpha * (eui * P[user][f] - lambda * Q[item][f])
+        alpha *= 0.9
+
+
+def Recommend(user, P, Q):
+    rank = dict()
+    for f, puf in P[user].items():
+        for i, qfi in Q[f].items():
+            if i not in rank:
+                rank[i] += puf * qfi
+    return rank
+
+
--- a/src/python/16.RecommendedSystem/test_基于物品.py
+++ b/src/python/16.RecommendedSystem/test_基于物品.py
@@ -0,0 +1,64 @@
+
+def ItemSimilarity1(train):
+    #calculate co-rated users between items
+    C = dict()
+    N = dict()
+    for u, items in train.items():
+        for i in users:
+            N[i] += 1
+            for j in users:
+                if i == j:
+                    continue
+                C[i][j] += 1
+
+    #calculate finial similarity matrix W
+    W = dict()
+    for i,related_items in C.items():
+        for j, cij in related_items.items():
+            W[u][v] = cij / math.sqrt(N[i] * N[j])
+    return W
+
+
+def ItemSimilarity2(train):
+    #calculate co-rated users between items
+    C = dict()
+    N = dict()
+    for u, items in train.items():
+        for i in users:
+            N[i] += 1
+            for j in users:
+                if i == j:
+                continue
+            C[i][j] += 1 / math.log(1 + len(items) * 1.0)
+
+    #calculate finial similarity matrix W
+    W = dict()
+    for i,related_items in C.items():
+        for j, cij in related_items.items():
+            W[u][v] = cij / math.sqrt(N[i] * N[j])
+    return W
+
+
+def Recommendation1(train, user_id, W, K):
+    rank = dict()
+    ru = train[user_id]
+    for i,pi in ru.items():
+        for j, wj in sorted(W[i].items(), key=itemgetter(1), reverse=True)[0:K]:
+            if j in ru:
+                continue
+            rank[j] += pi * wj
+    return rank
+
+
+def Recommendation2(train, user_id, W, K):
+    rank = dict()
+    ru = train[user_id]
+    for i,pi in ru.items():
+        for j, wj in sorted(W[i].items(), key=itemgetter(1), reverse=True)[0:K]:
+            if j in ru:
+                continue
+            rank[j].weight += pi * wj
+            rank[j].reason[i] = pi * wj
+    return rank
+
+
--- a/src/python/16.RecommendedSystem/test_基于用户.py
+++ b/src/python/16.RecommendedSystem/test_基于用户.py
@@ -0,0 +1,78 @@
+
+def UserSimilarity1(train):
+    W = dict()
+    for u in train.keys():
+        for v in train.keys():
+            if u == v:
+                continue
+            W[u][v] = len(train[u] & train[v])
+            W[u][v] /= math.sqrt(len(train[u]) * len(train[v]) * 1.0)
+    return W
+
+def UserSimilarity2(train):
+    # build inverse table for item_users
+    item_users = dict()
+    for u, items in train.items():
+        for i in items.keys():
+            if i not in item_users:
+                item_users[i] = set()
+            item_users[i].add(u)
+
+    #calculate co-rated items between users
+    C = dict()
+    N = dict()
+    for i, users in item_users.items():
+        for u in users:
+            N[u] += 1
+            for v in users:
+                if u == v:
+                    continue
+                C[u][v] += 1
+
+    #calculate finial similarity matrix W
+    W = dict()
+    for u, related_users in C.items():
+        for v, cuv in related_users.items():
+            W[u][v] = cuv / math.sqrt(N[u] * N[v])
+    return W
+
+
+def UserSimilarity3(train):
+    # build inverse table for item_users
+    item_users = dict()
+    for u, items in train.items():
+        for i in items.keys():
+            if i not in item_users:
+                item_users[i] = set()
+            item_users[i].add(u)
+
+    #calculate co-rated items between users
+    C = dict()
+    N = dict()
+    for i, users in item_users.items():
+        for u in users:
+            N[u] += 1
+            for v in users:
+                if u == v:
+                    continue
+                C[u][v] += 1 / math.log(1 + len(users))
+
+    #calculate finial similarity matrix W
+    W = dict()
+    for u, related_users in C.items():
+        for v, cuv in related_users.items():
+            W[u][v] = cuv / math.sqrt(N[u] * N[v])
+    return W
+
+
+def Recommend(user, train, W):
+    rank = dict()
+    interacted_items = train[user]
+    for v, wuv in sorted(W[u].items, key=itemgetter(1), reverse=True)[0:K]:
+        for i, rvi in train[v].items:
+            if i in interacted_items:
+                #we should filter items user interacted before
+                continue
+            rank[i] += wuv * rvi
+    return rank
+
--- a/src/python/16.RecommendedSystem/usercf.py
+++ b/src/python/16.RecommendedSystem/usercf.py
@@ -0,0 +1,220 @@
+#!/usr/bin/python
+# coding:utf8
+
+'''
+Created on 2015-06-22
+Update  on 2017-05-16
+@author: Lockvictor/片刻
+《推荐系统实践》协同过滤算法源代码
+参考地址：https://github.com/Lockvictor/MovieLens-RecSys
+更新地址：https://github.com/apachecn/MachineLearning
+'''
+import sys
+import math
+import random
+from operator import itemgetter
+print(__doc__)
+# 作用：使得随机数据可预测
+random.seed(0)
+
+
+class UserBasedCF():
+    ''' TopN recommendation - UserBasedCF '''
+    def __init__(self):
+        self.trainset = {}
+        self.testset = {}
+
+        # n_sim_user: top 20个用户， n_rec_movie: top 10个推荐结果
+        self.n_sim_user = 20
+        self.n_rec_movie = 10
+
+        # user_sim_mat: 用户之间的相似度， movie_popular: 电影的出现次数， movie_count: 总电影数量
+        self.user_sim_mat = {}
+        self.movie_popular = {}
+        self.movie_count = 0
+
+        print >> sys.stderr, 'similar user number = %d' % self.n_sim_user
+        print >> sys.stderr, 'recommended movie number = %d' % self.n_rec_movie
+
+    @staticmethod
+    def loadfile(filename):
+        """loadfile(加载文件，返回一个生成器)
+
+        Args:
+            filename   文件名
+        Returns:
+            line       行数据，去空格
+        """
+        fp = open(filename, 'r')
+        for i, line in enumerate(fp):
+            yield line.strip('\r\n')
+            if i > 0 and i % 100000 == 0:
+                print >> sys.stderr, 'loading %s(%s)' % (filename, i)
+        fp.close()
+        print >> sys.stderr, 'load %s success' % filename
+
+    def generate_dataset(self, filename, pivot=0.7):
+        """loadfile(加载文件，将数据集按照7:3 进行随机拆分)
+
+        Args:
+            filename   文件名
+            pivot      拆分比例
+        """
+        trainset_len = 0
+        testset_len = 0
+
+        for line in self.loadfile(filename):
+            # 用户ID，电影名称，评分，时间戳
+            user, movie, rating, timestamp = line.split('::')
+            # 通过pivot和随机函数比较，然后初始化用户和对应的值
+            if (random.random() < pivot):
+
+                # dict.setdefault(key, default=None)
+                # key -- 查找的键值
+                # default -- 键不存在时，设置的默认键值
+                self.trainset.setdefault(user, {})
+                self.trainset[user][movie] = int(rating)
+                trainset_len += 1
+            else:
+                self.testset.setdefault(user, {})
+                self.testset[user][movie] = int(rating)
+                testset_len += 1
+
+        print >> sys.stderr, '分离训练集和测试集成功'
+        print >> sys.stderr, 'train set = %s' % trainset_len
+        print >> sys.stderr, 'test  set = %s' % testset_len
+
+    def calc_user_sim(self):
+        """calc_user_sim(计算用户之间的相似度)"""
+
+        # build inverse table for item-users
+        # key=movieID, value=list of userIDs who have seen this movie
+        print >> sys.stderr, 'building movie-users inverse table...'
+        movie2users = dict()
+
+        for user, movies in self.trainset.iteritems():
+            for movie in movies:
+                # inverse table for item-users
+                if movie not in movie2users:
+                    movie2users[movie] = set()
+                movie2users[movie].add(user)
+                # count item popularity at the same time
+                if movie not in self.movie_popular:
+                    self.movie_popular[movie] = 0
+                self.movie_popular[movie] += 1
+
+        print >> sys.stderr, 'build movie-users inverse table success'
+
+        # save the total movie number, which will be used in evaluation
+        self.movie_count = len(movie2users)
+        print >> sys.stderr, 'total movie number = %d' % self.movie_count
+
+        usersim_mat = self.user_sim_mat
+        # 统计在相同电影时，不同用户同时出现的次数
+        print >> sys.stderr, 'building user co-rated movies matrix...'
+
+        for movie, users in movie2users.iteritems():
+            for u in users:
+                for v in users:
+                    if u == v:
+                        continue
+                    usersim_mat.setdefault(u, {})
+                    usersim_mat[u].setdefault(v, 0)
+                    usersim_mat[u][v] += 1
+        print >> sys.stderr, 'build user co-rated movies matrix success'
+
+        # calculate similarity matrix
+        print >> sys.stderr, 'calculating user similarity matrix...'
+        simfactor_count = 0
+        PRINT_STEP = 2000000
+        for u, related_users in usersim_mat.iteritems():
+            for v, count in related_users.iteritems():
+                # 余弦相似度
+                usersim_mat[u][v] = count / math.sqrt(len(self.trainset[u]) * len(self.trainset[v]))
+                simfactor_count += 1
+                # 打印进度条
+                if simfactor_count % PRINT_STEP == 0:
+                    print >> sys.stderr, 'calculating user similarity factor(%d)' % simfactor_count
+
+        print >> sys.stderr, 'calculate user similarity matrix(similarity factor) success'
+        print >> sys.stderr, 'Total similarity factor number = %d' % simfactor_count
+
+    # @profile
+    def recommend(self, user):
+        """recommend(找出top K的用户，所看过的电影，对电影进行相似度sum的排序，取出top N的电影数)
+
+        Args:
+            user       用户
+        Returns:
+            rec_movie  电影推荐列表，按照相似度从大到小的排序
+        """
+        ''' Find K similar users and recommend N movies. '''
+        K = self.n_sim_user
+        N = self.n_rec_movie
+        rank = dict()
+        watched_movies = self.trainset[user]
+
+        # 计算top K 用户的相似度
+        # v=similar user, wuv=不同用户同时出现的次数
+        # 耗时分析：50.4%的时间在 line-160行
+        for v, wuv in sorted(self.user_sim_mat[user].items(), key=itemgetter(1), reverse=True)[0:K]:
+            for movie in self.trainset[v]:
+                if movie in watched_movies:
+                    continue
+                # predict the user's "interest" for each movie
+                rank.setdefault(movie, 0)
+                rank[movie] += wuv
+        # return the N best movies
+        return sorted(rank.items(), key=itemgetter(1), reverse=True)[0:N]
+
+    def evaluate(self):
+        ''' return precision, recall, coverage and popularity '''
+        print >> sys.stderr, 'Evaluation start...'
+
+        # 返回top N的推荐结果
+        N = self.n_rec_movie
+        # varables for precision and recall
+        # hit表示命中(测试集和推荐集相同+1)，rec_count 每个用户的推荐数， test_count 每个用户对应的测试数据集的电影数
+        hit = 0
+        rec_count = 0
+        test_count = 0
+        # varables for coverage
+        all_rec_movies = set()
+        # varables for popularity
+        popular_sum = 0
+
+        for i, user in enumerate(self.trainset):
+            if i > 0 and i % 500 == 0:
+                print >> sys.stderr, 'recommended for %d users' % i
+            test_movies = self.testset.get(user, {})
+            rec_movies = self.recommend(user)
+
+            # 对比测试集和推荐集的差异
+            for movie, w in rec_movies:
+                if movie in test_movies:
+                    hit += 1
+                all_rec_movies.add(movie)
+                # 计算用户对应的电影出现次数log值的sum加和
+                popular_sum += math.log(1 + self.movie_popular[movie])
+            rec_count += N
+            test_count += len(test_movies)
+
+        precision = hit / (1.0*rec_count)
+        recall = hit / (1.0*test_count)
+        coverage = len(all_rec_movies) / (1.0*self.movie_count)
+        popularity = popular_sum / (1.0*rec_count)
+
+        print >> sys.stderr, 'precision=%.4f \t recall=%.4f \t coverage=%.4f \t popularity=%.4f' % (precision, recall, coverage, popularity)
+
+
+if __name__ == '__main__':
+    ratingfile = 'input/16.RecommendedSystem/ml-1m/ratings.dat'
+
+    # 创建UserCF对象
+    usercf = UserBasedCF()
+    # 将数据按照 7:3的比例，拆分成：训练集和测试集，存储在usercf的trainset和testset中
+    usercf.generate_dataset(ratingfile, pivot=0.7)
+    # 计算用户之间的相似度
+    usercf.calc_user_sim()
+    # 评估推荐效果
+    usercf.evaluate()
--- a/src/python/2.KNN/kNN.py
+++ b/src/python/2.KNN/kNN.py
@@ -1,9 +1,13 @@
 #!/usr/bin/env python
 # encoding: utf-8
 '''
-导入科学计算包numpy和运算符模块operator
+Created on Sep 16, 2010
+Update  on 2017-05-18
+@author: Peter Harrington/羊山
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 '''
 from numpy import *
+# 导入科学计算包numpy和运算符模块operator
 import operator
 from os import listdir

--- a/src/python/3.DecisionTree/DTSklearn.py
+++ b/src/python/3.DecisionTree/DTSklearn.py
@@ -1,6 +1,7 @@
 #!/usr/bin/python
 # coding: utf8
 # 原始链接： http://blog.csdn.net/lsldd/article/details/41223147
+# 《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 import numpy as np
 from sklearn import tree
 from sklearn.metrics import precision_recall_curve
--- a/src/python/3.DecisionTree/DecisionTree.py
+++ b/src/python/3.DecisionTree/DecisionTree.py
@@ -1,11 +1,11 @@
 #!/usr/bin/python
 # coding:utf8
-
 '''
 Created on Oct 12, 2010
-Update on 2017-02-27
+Update on 2017-05-18
 Decision Tree Source Code for Machine Learning in Action Ch. 3
@author: Peter Harrington/片刻
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 '''
 print(__doc__)
 import operator
--- a/src/python/4.NaiveBayes/bayes.py
+++ b/src/python/4.NaiveBayes/bayes.py
@@ -1,5 +1,11 @@
 #!/usr/bin/env python
 # -*- coding:utf-8 -*-
+'''
+Created on Oct 19, 2010
+Update  on 2017-05-18
+@author: Peter Harrington/羊山
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
+'''
 from numpy import *
 """
 p(xy)=p(x|y)p(y)=p(y|x)p(x)
--- a/src/python/5.Logistic/logistic.py
+++ b/src/python/5.Logistic/logistic.py
@@ -3,8 +3,10 @@

 '''
 Created on Oct 27, 2010
+Update  on 2017-05-18
 Logistic Regression Working Module
-@author: Peter
+@author: Peter Harrington/羊山
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 '''
 from numpy import *
 import matplotlib.pyplot as plt
--- a/src/python/6.SVM/svm-complete.py
+++ b/src/python/6.SVM/svm-complete.py
@@ -3,9 +3,10 @@

 """
 Created on Nov 4, 2010
-Update on 2017-03-21
+Update on 2017-05-18
 Chapter 5 source file for Machine Learing in Action
@author: Peter/geekidentity/片刻
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 """
 from numpy import *
 import matplotlib.pyplot as plt
--- a/src/python/6.SVM/svm-complete_Non-Kernel.py
+++ b/src/python/6.SVM/svm-complete_Non-Kernel.py
@@ -3,9 +3,10 @@

 """
 Created on Nov 4, 2010
-Update on 2017-03-21
+Update on 2017-05-18
 Chapter 5 source file for Machine Learing in Action
@author: Peter/geekidentity/片刻
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 """
 from numpy import *
 import matplotlib.pyplot as plt
--- a/src/python/6.SVM/svm-simple.py
+++ b/src/python/6.SVM/svm-simple.py
@@ -3,9 +3,10 @@

 """
 Created on Nov 4, 2010
-Update on 2017-03-21
+Update on 2017-05-18
 Chapter 5 source file for Machine Learing in Action
@author: Peter/geekidentity/片刻
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 """
 from numpy import *
 import matplotlib.pyplot as plt
--- a/src/python/7.AdaBoost/adaboost.py
+++ b/src/python/7.AdaBoost/adaboost.py
@@ -3,8 +3,10 @@

 '''
 Created on Nov 28, 2010
+Update  on 2017-05-18
 Adaboost is short for Adaptive Boosting
-@author: Peter/jiangzhonglian
+@author: Peter/片刻
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 '''
 from numpy import *

--- a/src/python/7.RandomForest/randomForest.py
+++ b/src/python/7.RandomForest/randomForest.py
@@ -3,8 +3,10 @@

 '''
 Created 2017-04-25
+Update  on 2017-05-18
 Random Forest Algorithm on Sonar Dataset
@author: Flying_sfeng/片刻
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 ---
 源代码网址：http://www.tuicool.com/articles/iiUfeim
 Flying_sfeng博客地址：http://blog.csdn.net/flying_sfeng/article/details/64133822
--- a/regression/regression.py
+++ b/regression/regression.py
@@ -2,8 +2,10 @@
 # coding:utf8

 '''
-Create by ApacheCN-小瑶
-Date from 2017-02-27
+Created on Jan 8, 2011
+Update  on 2017-05-18
+@author: Peter Harrington/ApacheCN-小瑶
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 '''


--- a/src/python/9.RegTrees/regTrees.py
+++ b/src/python/9.RegTrees/regTrees.py
@@ -1,11 +1,11 @@
 #!/usr/bin/python
 # coding:utf8
-
 '''
 Created on Feb 4, 2011
-Update on 2017-03-02
+Update on 2017-05-18
 Tree-Based Regression Methods Source Code for Machine Learning in Action Ch. 9
@author: Peter Harrington/片刻
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 '''
 print(__doc__)
 from numpy import *
--- a/src/python/9.RegTrees/treeExplore.py
+++ b/src/python/9.RegTrees/treeExplore.py
@@ -3,9 +3,10 @@

 '''
 Created on 2017-03-08
-Update on 2017-03-08
+Update  on 2017-05-18
 Tree-Based Regression Methods Source Code for Machine Learning in Action Ch. 9
@author: Peter/片刻
+《机器学习实战》更新地址：https://github.com/apachecn/MachineLearning
 '''
 import regTrees
 from Tkinter import *