Merge pull request #88 from jiangzhonglian/master

完全推荐系统的python代码
This commit is contained in:
片刻
2017-05-18 12:03:13 +08:00
committed by GitHub
30 changed files with 1011057 additions and 20 deletions

View File

@@ -0,0 +1,159 @@
SUMMARY
================================================================================
These files contain 1,000,209 anonymous ratings of approximately 3,900 movies
made by 6,040 MovieLens users who joined MovieLens in 2000.
USAGE LICENSE
================================================================================
Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set. The data set may be used for any research
purposes under the following conditions:
* The user may not state or imply any endorsement from the
University of Minnesota or the GroupLens Research Group.
* The user must acknowledge the use of the data set in
publications resulting from the use of the data set, and must
send us an electronic or paper copy of those publications.
* The user may not redistribute the data without separate
permission.
* The user may not use this information for any commercial or
revenue-bearing purposes without first obtaining permission
from a faculty member of the GroupLens Research Project at the
University of Minnesota.
If you have any further questions or comments, please contact GroupLens
<grouplens-info@cs.umn.edu>.
ACKNOWLEDGEMENTS
================================================================================
Thanks to Shyong Lam and Jon Herlocker for cleaning up and generating the data
set.
FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
================================================================================
The GroupLens Research Project is a research group in the Department of
Computer Science and Engineering at the University of Minnesota. Members of
the GroupLens Research Project are involved in many research projects related
to the fields of information filtering, collaborative filtering, and
recommender systems. The project is lead by professors John Riedl and Joseph
Konstan. The project began to explore automated collaborative filtering in
1992, but is most well known for its world wide trial of an automated
collaborative filtering system for Usenet news in 1996. Since then the project
has expanded its scope to research overall information filtering solutions,
integrating in content-based methods as well as improving current collaborative
filtering technology.
Further information on the GroupLens Research project, including research
publications, can be found at the following web site:
http://www.grouplens.org/
GroupLens Research currently operates a movie recommender based on
collaborative filtering:
http://www.movielens.org/
RATINGS FILE DESCRIPTION
================================================================================
All ratings are contained in the file "ratings.dat" and are in the
following format:
UserID::MovieID::Rating::Timestamp
- UserIDs range between 1 and 6040
- MovieIDs range between 1 and 3952
- Ratings are made on a 5-star scale (whole-star ratings only)
- Timestamp is represented in seconds since the epoch as returned by time(2)
- Each user has at least 20 ratings
USERS FILE DESCRIPTION
================================================================================
User information is in the file "users.dat" and is in the following
format:
UserID::Gender::Age::Occupation::Zip-code
All demographic information is provided voluntarily by the users and is
not checked for accuracy. Only users who have provided some demographic
information are included in this data set.
- Gender is denoted by a "M" for male and "F" for female
- Age is chosen from the following ranges:
* 1: "Under 18"
* 18: "18-24"
* 25: "25-34"
* 35: "35-44"
* 45: "45-49"
* 50: "50-55"
* 56: "56+"
- Occupation is chosen from the following choices:
* 0: "other" or not specified
* 1: "academic/educator"
* 2: "artist"
* 3: "clerical/admin"
* 4: "college/grad student"
* 5: "customer service"
* 6: "doctor/health care"
* 7: "executive/managerial"
* 8: "farmer"
* 9: "homemaker"
* 10: "K-12 student"
* 11: "lawyer"
* 12: "programmer"
* 13: "retired"
* 14: "sales/marketing"
* 15: "scientist"
* 16: "self-employed"
* 17: "technician/engineer"
* 18: "tradesman/craftsman"
* 19: "unemployed"
* 20: "writer"
MOVIES FILE DESCRIPTION
================================================================================
Movie information is in the file "movies.dat" and is in the following
format:
MovieID::Title::Genres
- Titles are identical to titles provided by the IMDB (including
year of release)
- Genres are pipe-separated and are selected from the following genres:
* Action
* Adventure
* Animation
* Children's
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
- Some MovieIDs do not correspond to a movie due to accidental duplicate
entries and/or test entries
- Movies are mostly entered by hand, so errors and inconsistencies may exist

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -1,6 +1,11 @@
#!/usr/bin/python
# coding:utf8
'''
Created on 2017-05-18
Update on 2017-05-18
@author: Peter Harrington/山上有课树
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
from numpy import random, mat, eye
'''

View File

@@ -1,6 +1,12 @@
#!/usr/bin/python
# coding:utf8
'''
Created on Feb 16, 2011
Update on 2017-05-18
k Means Clustering for Ch10 of Machine Learning in Action
@author: Peter Harrington/那伊抹微笑
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
from numpy import *

View File

@@ -3,9 +3,10 @@
'''
Created on Mar 24, 2011
Update on 2017-03-16
Update on 2017-05-18
Ch 11 code
@author: Peter/片刻
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
print(__doc__)
from numpy import *

View File

@@ -3,12 +3,14 @@
'''
Created on Jun 14, 2011
Update on 2017-05-18
FP-Growth FP means frequent pattern
the FP-Growth algorithm needs:
1. FP-tree (class treeNode)
2. header table (use dict)
This finds frequent itemsets similar to apriori but does not find association rules.
@author: Peter/片刻
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
print(__doc__)

View File

@@ -3,8 +3,9 @@
'''
Created on Jun 1, 2011
Update on 2017-04-06
Update on 2017-05-18
@author: Peter Harrington/片刻
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
print(__doc__)
from numpy import *

View File

@@ -1,7 +1,11 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding: utf-8
# coding: utf-8
'''
Created on Mar 8, 2011
Update on 2017-05-18
@author: Peter Harrington/山上有课树
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
from numpy import linalg as la
from numpy import *

View File

@@ -0,0 +1,212 @@
#!/usr/bin/python
# coding:utf8
'''
Created on 2015-06-22
Update on 2017-05-16
@author: Lockvictor/片刻
《推荐系统实践》协同过滤算法源代码
参考地址https://github.com/Lockvictor/MovieLens-RecSys
更新地址https://github.com/apachecn/MachineLearning
'''
import sys
import math
import random
from operator import itemgetter
print(__doc__)
# 作用:使得随机数据可预测
random.seed(0)
class ItemBasedCF():
''' TopN recommendation - ItemBasedCF '''
def __init__(self):
self.trainset = {}
self.testset = {}
# n_sim_user: top 20个用户 n_rec_movie: top 10个推荐结果
self.n_sim_movie = 20
self.n_rec_movie = 10
# user_sim_mat: 电影之间的相似度, movie_popular: 电影的出现次数, movie_count: 总电影数量
self.movie_sim_mat = {}
self.movie_popular = {}
self.movie_count = 0
print >> sys.stderr, 'Similar movie number = %d' % self.n_sim_movie
print >> sys.stderr, 'Recommended movie number = %d' % self.n_rec_movie
@staticmethod
def loadfile(filename):
"""loadfile(加载文件,返回一个生成器)
Args:
filename 文件名
Returns:
line 行数据,去空格
"""
fp = open(filename, 'r')
for i, line in enumerate(fp):
yield line.strip('\r\n')
if i > 0 and i % 100000 == 0:
print >> sys.stderr, 'loading %s(%s)' % (filename, i)
fp.close()
print >> sys.stderr, 'load %s success' % filename
def generate_dataset(self, filename, pivot=0.7):
"""loadfile(加载文件将数据集按照7:3 进行随机拆分)
Args:
filename 文件名
pivot 拆分比例
"""
trainset_len = 0
testset_len = 0
for line in self.loadfile(filename):
# 用户ID电影名称评分时间戳
user, movie, rating, _ = line.split('::')
# 通过pivot和随机函数比较然后初始化用户和对应的值
if (random.random() < pivot):
# dict.setdefault(key, default=None)
# key -- 查找的键值
# default -- 键不存在时,设置的默认键值
self.trainset.setdefault(user, {})
self.trainset[user][movie] = int(rating)
trainset_len += 1
else:
self.testset.setdefault(user, {})
self.testset[user][movie] = int(rating)
testset_len += 1
print >> sys.stderr, '分离训练集和测试集成功'
print >> sys.stderr, 'train set = %s' % trainset_len
print >> sys.stderr, 'test set = %s' % testset_len
def calc_movie_sim(self):
"""calc_movie_sim(计算用户之间的相似度)"""
print >> sys.stderr, 'counting movies number and popularity...'
for user, movies in self.trainset.iteritems():
for movie in movies:
# count item popularity
if movie not in self.movie_popular:
self.movie_popular[movie] = 0
self.movie_popular[movie] += 1
print >> sys.stderr, 'count movies number and popularity success'
# save the total number of movies
self.movie_count = len(self.movie_popular)
print >> sys.stderr, 'total movie number = %d' % self.movie_count
# 统计在相同用户时,不同电影同时出现的次数
itemsim_mat = self.movie_sim_mat
print >> sys.stderr, 'building co-rated users matrix...'
for user, movies in self.trainset.iteritems():
for m1 in movies:
for m2 in movies:
if m1 == m2:
continue
itemsim_mat.setdefault(m1, {})
itemsim_mat[m1].setdefault(m2, 0)
itemsim_mat[m1][m2] += 1
print >> sys.stderr, 'build co-rated users matrix success'
# calculate similarity matrix
print >> sys.stderr, 'calculating movie similarity matrix...'
simfactor_count = 0
PRINT_STEP = 2000000
for m1, related_movies in itemsim_mat.iteritems():
for m2, count in related_movies.iteritems():
# 余弦相似度
itemsim_mat[m1][m2] = count / math.sqrt(self.movie_popular[m1] * self.movie_popular[m2])
simfactor_count += 1
# 打印进度条
if simfactor_count % PRINT_STEP == 0:
print >> sys.stderr, 'calculating movie similarity factor(%d)' % simfactor_count
print >> sys.stderr, 'calculate movie similarity matrix(similarity factor) success'
print >> sys.stderr, 'Total similarity factor number = %d' % simfactor_count
# @profile
def recommend(self, user):
"""recommend(找出top K的电影对电影进行相似度sum的排序取出top N的电影数)
Args:
user 用户
Returns:
rec_movie 电影推荐列表,按照相似度从大到小的排序
"""
''' Find K similar movies and recommend N movies. '''
K = self.n_sim_movie
N = self.n_rec_movie
rank = {}
watched_movies = self.trainset[user]
# 计算top K 电影的相似度
# rating=电影评分, w=不同电影出现的次数
# 耗时分析98.2%的时间在 line-154行
for movie, rating in watched_movies.iteritems():
for related_movie, w in sorted(self.movie_sim_mat[movie].items(), key=itemgetter(1), reverse=True)[0:K]:
if related_movie in watched_movies:
continue
rank.setdefault(related_movie, 0)
rank[related_movie] += w * rating
# return the N best movies
return sorted(rank.items(), key=itemgetter(1), reverse=True)[0:N]
def evaluate(self):
''' return precision, recall, coverage and popularity '''
print >> sys.stderr, 'Evaluation start...'
# 返回top N的推荐结果
N = self.n_rec_movie
# varables for precision and recall
# hit表示命中(测试集和推荐集相同+1)rec_count 每个用户的推荐数, test_count 每个用户对应的测试数据集的电影数
hit = 0
rec_count = 0
test_count = 0
# varables for coverage
all_rec_movies = set()
# varables for popularity
popular_sum = 0
for i, user in enumerate(self.trainset):
if i > 0 and i % 500 == 0:
print >> sys.stderr, 'recommended for %d users' % i
test_movies = self.testset.get(user, {})
rec_movies = self.recommend(user)
# 对比测试集和推荐集的差异
for movie, w in rec_movies:
if movie in test_movies:
hit += 1
all_rec_movies.add(movie)
# 计算用户对应的电影出现次数log值的sum加和
popular_sum += math.log(1 + self.movie_popular[movie])
rec_count += N
test_count += len(test_movies)
precision = hit / (1.0 * rec_count)
recall = hit / (1.0 * test_count)
coverage = len(all_rec_movies) / (1.0 * self.movie_count)
popularity = popular_sum / (1.0 * rec_count)
print >> sys.stderr, 'precision=%.4f \t recall=%.4f \t coverage=%.4f \t popularity=%.4f' % (precision, recall, coverage, popularity)
if __name__ == '__main__':
ratingfile = 'input/16.RecommendedSystem/ml-1m/ratings.dat'
# 创建ItemCF对象
itemcf = ItemBasedCF()
# 将数据按照 7:3的比例拆分成训练集和测试集存储在usercf的trainset和testset中
itemcf.generate_dataset(ratingfile, pivot=0.7)
# 计算用户之间的相似度
itemcf.calc_movie_sim()
# 评估推荐效果
itemcf.evaluate()

View File

@@ -0,0 +1,72 @@
def SplitData(data, M, k, seed):
test = []
train = []
random.seed(seed)
for user, item in data:
if random.randint(0,M) == k:
test.append([user,item])
else:
train.append([user,item])
return train, test
# 准确率
def Precision(train, test, N):
hit = 0
all = 0
for user in train.keys():
tu = test[user]
rank = GetRecommendation(user, N)
for item, pui in rank:
if item in tu:
hit += 1
all += N
return hit / (all * 1.0)
# 召回率
def Recall(train, test, N):
hit = 0
all = 0
for user in train.keys():
tu = test[user]
rank = GetRecommendation(user, N)
for item, pui in rank:
if item in tu:
hit += 1
all += len(tu)
return hit / (all * 1.0)
# 覆盖率
def Coverage(train, test, N):
recommend_items = set()
all_items = set()
for user in train.keys():
for item in train[user].keys():
all_items.add(item)
rank = GetRecommendation(user, N)
for item, pui in rank:
recommend_items.add(item)
return len(recommend_items) / (len(all_items) * 1.0)
# 新颖度
def Popularity(train, test, N):
item_popularity = dict()
for user, items in train.items():
for item in items.keys():
if item not in item_popularity:
item_popularity[item] = 0
item_popularity[item] += 1
ret = 0
n=0
for user in train.keys():
rank = GetRecommendation(user, N)
for item, pui in rank:
ret += math.log(1 + item_popularity[item])
n += 1
ret /= n * 1.0
return ret

View File

@@ -0,0 +1,17 @@
def PersonalRank(G, alpha, root):
rank = dict()
rank = {x:0 for x in G.keys()}
rank[root] = 1
for k in range(20):
tmp = {x:0 for x in G.keys()}
for i, ri in G.items():
for j, wij in ri.items():
if j not in tmp:
tmp[j] = 0
tmp[j] += 0.6 * rank[i] / (1.0 * len(ri))
if j == root:
tmp[j] += 1 - alpha
rank = tmp
return rank

View File

@@ -0,0 +1,41 @@
# 负样本采样过程
def RandomSelectNegativeSample(self, items):
ret = dict()
for i in items.keys():
ret[i] = 1
n = 0
for i in range(0, len(items) * 3):
item = items_pool[random.randint(0, len(items_pool) - 1)]
if item in ret:
continue
ret[item] = 0
n + = 1
if n > len(items):
break
return ret
def LatentFactorModel(user_items, F, N, alpha, lambda):
[P, Q] = InitModel(user_items, F)
for step in range(0,N):
for user, items in user_items.items():
samples = RandSelectNegativeSamples(items)
for item, rui in samples.items():
eui = rui - Predict(user, item)
for f in range(0, F):
P[user][f] += alpha * (eui * Q[item][f] - lambda * P[user][f])
Q[item][f] += alpha * (eui * P[user][f] - lambda * Q[item][f])
alpha *= 0.9
def Recommend(user, P, Q):
rank = dict()
for f, puf in P[user].items():
for i, qfi in Q[f].items():
if i not in rank:
rank[i] += puf * qfi
return rank

View File

@@ -0,0 +1,64 @@
def ItemSimilarity1(train):
#calculate co-rated users between items
C = dict()
N = dict()
for u, items in train.items():
for i in users:
N[i] += 1
for j in users:
if i == j:
continue
C[i][j] += 1
#calculate finial similarity matrix W
W = dict()
for i,related_items in C.items():
for j, cij in related_items.items():
W[u][v] = cij / math.sqrt(N[i] * N[j])
return W
def ItemSimilarity2(train):
#calculate co-rated users between items
C = dict()
N = dict()
for u, items in train.items():
for i in users:
N[i] += 1
for j in users:
if i == j:
continue
C[i][j] += 1 / math.log(1 + len(items) * 1.0)
#calculate finial similarity matrix W
W = dict()
for i,related_items in C.items():
for j, cij in related_items.items():
W[u][v] = cij / math.sqrt(N[i] * N[j])
return W
def Recommendation1(train, user_id, W, K):
rank = dict()
ru = train[user_id]
for i,pi in ru.items():
for j, wj in sorted(W[i].items(), key=itemgetter(1), reverse=True)[0:K]:
if j in ru:
continue
rank[j] += pi * wj
return rank
def Recommendation2(train, user_id, W, K):
rank = dict()
ru = train[user_id]
for i,pi in ru.items():
for j, wj in sorted(W[i].items(), key=itemgetter(1), reverse=True)[0:K]:
if j in ru:
continue
rank[j].weight += pi * wj
rank[j].reason[i] = pi * wj
return rank

View File

@@ -0,0 +1,78 @@
def UserSimilarity1(train):
W = dict()
for u in train.keys():
for v in train.keys():
if u == v:
continue
W[u][v] = len(train[u] & train[v])
W[u][v] /= math.sqrt(len(train[u]) * len(train[v]) * 1.0)
return W
def UserSimilarity2(train):
# build inverse table for item_users
item_users = dict()
for u, items in train.items():
for i in items.keys():
if i not in item_users:
item_users[i] = set()
item_users[i].add(u)
#calculate co-rated items between users
C = dict()
N = dict()
for i, users in item_users.items():
for u in users:
N[u] += 1
for v in users:
if u == v:
continue
C[u][v] += 1
#calculate finial similarity matrix W
W = dict()
for u, related_users in C.items():
for v, cuv in related_users.items():
W[u][v] = cuv / math.sqrt(N[u] * N[v])
return W
def UserSimilarity3(train):
# build inverse table for item_users
item_users = dict()
for u, items in train.items():
for i in items.keys():
if i not in item_users:
item_users[i] = set()
item_users[i].add(u)
#calculate co-rated items between users
C = dict()
N = dict()
for i, users in item_users.items():
for u in users:
N[u] += 1
for v in users:
if u == v:
continue
C[u][v] += 1 / math.log(1 + len(users))
#calculate finial similarity matrix W
W = dict()
for u, related_users in C.items():
for v, cuv in related_users.items():
W[u][v] = cuv / math.sqrt(N[u] * N[v])
return W
def Recommend(user, train, W):
rank = dict()
interacted_items = train[user]
for v, wuv in sorted(W[u].items, key=itemgetter(1), reverse=True)[0:K]:
for i, rvi in train[v].items:
if i in interacted_items:
#we should filter items user interacted before
continue
rank[i] += wuv * rvi
return rank

View File

@@ -0,0 +1,220 @@
#!/usr/bin/python
# coding:utf8
'''
Created on 2015-06-22
Update on 2017-05-16
@author: Lockvictor/片刻
《推荐系统实践》协同过滤算法源代码
参考地址https://github.com/Lockvictor/MovieLens-RecSys
更新地址https://github.com/apachecn/MachineLearning
'''
import sys
import math
import random
from operator import itemgetter
print(__doc__)
# 作用:使得随机数据可预测
random.seed(0)
class UserBasedCF():
''' TopN recommendation - UserBasedCF '''
def __init__(self):
self.trainset = {}
self.testset = {}
# n_sim_user: top 20个用户 n_rec_movie: top 10个推荐结果
self.n_sim_user = 20
self.n_rec_movie = 10
# user_sim_mat: 用户之间的相似度, movie_popular: 电影的出现次数, movie_count: 总电影数量
self.user_sim_mat = {}
self.movie_popular = {}
self.movie_count = 0
print >> sys.stderr, 'similar user number = %d' % self.n_sim_user
print >> sys.stderr, 'recommended movie number = %d' % self.n_rec_movie
@staticmethod
def loadfile(filename):
"""loadfile(加载文件,返回一个生成器)
Args:
filename 文件名
Returns:
line 行数据,去空格
"""
fp = open(filename, 'r')
for i, line in enumerate(fp):
yield line.strip('\r\n')
if i > 0 and i % 100000 == 0:
print >> sys.stderr, 'loading %s(%s)' % (filename, i)
fp.close()
print >> sys.stderr, 'load %s success' % filename
def generate_dataset(self, filename, pivot=0.7):
"""loadfile(加载文件将数据集按照7:3 进行随机拆分)
Args:
filename 文件名
pivot 拆分比例
"""
trainset_len = 0
testset_len = 0
for line in self.loadfile(filename):
# 用户ID电影名称评分时间戳
user, movie, rating, timestamp = line.split('::')
# 通过pivot和随机函数比较然后初始化用户和对应的值
if (random.random() < pivot):
# dict.setdefault(key, default=None)
# key -- 查找的键值
# default -- 键不存在时,设置的默认键值
self.trainset.setdefault(user, {})
self.trainset[user][movie] = int(rating)
trainset_len += 1
else:
self.testset.setdefault(user, {})
self.testset[user][movie] = int(rating)
testset_len += 1
print >> sys.stderr, '分离训练集和测试集成功'
print >> sys.stderr, 'train set = %s' % trainset_len
print >> sys.stderr, 'test set = %s' % testset_len
def calc_user_sim(self):
"""calc_user_sim(计算用户之间的相似度)"""
# build inverse table for item-users
# key=movieID, value=list of userIDs who have seen this movie
print >> sys.stderr, 'building movie-users inverse table...'
movie2users = dict()
for user, movies in self.trainset.iteritems():
for movie in movies:
# inverse table for item-users
if movie not in movie2users:
movie2users[movie] = set()
movie2users[movie].add(user)
# count item popularity at the same time
if movie not in self.movie_popular:
self.movie_popular[movie] = 0
self.movie_popular[movie] += 1
print >> sys.stderr, 'build movie-users inverse table success'
# save the total movie number, which will be used in evaluation
self.movie_count = len(movie2users)
print >> sys.stderr, 'total movie number = %d' % self.movie_count
usersim_mat = self.user_sim_mat
# 统计在相同电影时,不同用户同时出现的次数
print >> sys.stderr, 'building user co-rated movies matrix...'
for movie, users in movie2users.iteritems():
for u in users:
for v in users:
if u == v:
continue
usersim_mat.setdefault(u, {})
usersim_mat[u].setdefault(v, 0)
usersim_mat[u][v] += 1
print >> sys.stderr, 'build user co-rated movies matrix success'
# calculate similarity matrix
print >> sys.stderr, 'calculating user similarity matrix...'
simfactor_count = 0
PRINT_STEP = 2000000
for u, related_users in usersim_mat.iteritems():
for v, count in related_users.iteritems():
# 余弦相似度
usersim_mat[u][v] = count / math.sqrt(len(self.trainset[u]) * len(self.trainset[v]))
simfactor_count += 1
# 打印进度条
if simfactor_count % PRINT_STEP == 0:
print >> sys.stderr, 'calculating user similarity factor(%d)' % simfactor_count
print >> sys.stderr, 'calculate user similarity matrix(similarity factor) success'
print >> sys.stderr, 'Total similarity factor number = %d' % simfactor_count
# @profile
def recommend(self, user):
"""recommend(找出top K的用户所看过的电影对电影进行相似度sum的排序取出top N的电影数)
Args:
user 用户
Returns:
rec_movie 电影推荐列表,按照相似度从大到小的排序
"""
''' Find K similar users and recommend N movies. '''
K = self.n_sim_user
N = self.n_rec_movie
rank = dict()
watched_movies = self.trainset[user]
# 计算top K 用户的相似度
# v=similar user, wuv=不同用户同时出现的次数
# 耗时分析50.4%的时间在 line-160行
for v, wuv in sorted(self.user_sim_mat[user].items(), key=itemgetter(1), reverse=True)[0:K]:
for movie in self.trainset[v]:
if movie in watched_movies:
continue
# predict the user's "interest" for each movie
rank.setdefault(movie, 0)
rank[movie] += wuv
# return the N best movies
return sorted(rank.items(), key=itemgetter(1), reverse=True)[0:N]
def evaluate(self):
''' return precision, recall, coverage and popularity '''
print >> sys.stderr, 'Evaluation start...'
# 返回top N的推荐结果
N = self.n_rec_movie
# varables for precision and recall
# hit表示命中(测试集和推荐集相同+1)rec_count 每个用户的推荐数, test_count 每个用户对应的测试数据集的电影数
hit = 0
rec_count = 0
test_count = 0
# varables for coverage
all_rec_movies = set()
# varables for popularity
popular_sum = 0
for i, user in enumerate(self.trainset):
if i > 0 and i % 500 == 0:
print >> sys.stderr, 'recommended for %d users' % i
test_movies = self.testset.get(user, {})
rec_movies = self.recommend(user)
# 对比测试集和推荐集的差异
for movie, w in rec_movies:
if movie in test_movies:
hit += 1
all_rec_movies.add(movie)
# 计算用户对应的电影出现次数log值的sum加和
popular_sum += math.log(1 + self.movie_popular[movie])
rec_count += N
test_count += len(test_movies)
precision = hit / (1.0*rec_count)
recall = hit / (1.0*test_count)
coverage = len(all_rec_movies) / (1.0*self.movie_count)
popularity = popular_sum / (1.0*rec_count)
print >> sys.stderr, 'precision=%.4f \t recall=%.4f \t coverage=%.4f \t popularity=%.4f' % (precision, recall, coverage, popularity)
if __name__ == '__main__':
ratingfile = 'input/16.RecommendedSystem/ml-1m/ratings.dat'
# 创建UserCF对象
usercf = UserBasedCF()
# 将数据按照 7:3的比例拆分成训练集和测试集存储在usercf的trainset和testset中
usercf.generate_dataset(ratingfile, pivot=0.7)
# 计算用户之间的相似度
usercf.calc_user_sim()
# 评估推荐效果
usercf.evaluate()

View File

@@ -1,9 +1,13 @@
#!/usr/bin/env python
# encoding: utf-8
'''
导入科学计算包numpy和运算符模块operator
Created on Sep 16, 2010
Update on 2017-05-18
@author: Peter Harrington/羊山
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
from numpy import *
# 导入科学计算包numpy和运算符模块operator
import operator
from os import listdir

View File

@@ -1,6 +1,7 @@
#!/usr/bin/python
# coding: utf8
# 原始链接: http://blog.csdn.net/lsldd/article/details/41223147
# 《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
import numpy as np
from sklearn import tree
from sklearn.metrics import precision_recall_curve

View File

@@ -1,11 +1,11 @@
#!/usr/bin/python
# coding:utf8
'''
Created on Oct 12, 2010
Update on 2017-02-27
Update on 2017-05-18
Decision Tree Source Code for Machine Learning in Action Ch. 3
@author: Peter Harrington/片刻
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
print(__doc__)
import operator

View File

@@ -1,5 +1,11 @@
#!/usr/bin/env python
# -*- coding:utf-8 -*-
'''
Created on Oct 19, 2010
Update on 2017-05-18
@author: Peter Harrington/羊山
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
from numpy import *
"""
p(xy)=p(x|y)p(y)=p(y|x)p(x)

View File

@@ -3,8 +3,10 @@
'''
Created on Oct 27, 2010
Update on 2017-05-18
Logistic Regression Working Module
@author: Peter
@author: Peter Harrington/羊山
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
from numpy import *
import matplotlib.pyplot as plt

View File

@@ -3,9 +3,10 @@
"""
Created on Nov 4, 2010
Update on 2017-03-21
Update on 2017-05-18
Chapter 5 source file for Machine Learing in Action
@author: Peter/geekidentity/片刻
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
"""
from numpy import *
import matplotlib.pyplot as plt

View File

@@ -3,9 +3,10 @@
"""
Created on Nov 4, 2010
Update on 2017-03-21
Update on 2017-05-18
Chapter 5 source file for Machine Learing in Action
@author: Peter/geekidentity/片刻
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
"""
from numpy import *
import matplotlib.pyplot as plt

View File

@@ -3,9 +3,10 @@
"""
Created on Nov 4, 2010
Update on 2017-03-21
Update on 2017-05-18
Chapter 5 source file for Machine Learing in Action
@author: Peter/geekidentity/片刻
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
"""
from numpy import *
import matplotlib.pyplot as plt

View File

@@ -3,8 +3,10 @@
'''
Created on Nov 28, 2010
Update on 2017-05-18
Adaboost is short for Adaptive Boosting
@author: Peter/jiangzhonglian
@author: Peter/片刻
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
from numpy import *

View File

@@ -3,8 +3,10 @@
'''
Created 2017-04-25
Update on 2017-05-18
Random Forest Algorithm on Sonar Dataset
@author: Flying_sfeng/片刻
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
---
源代码网址http://www.tuicool.com/articles/iiUfeim
Flying_sfeng博客地址http://blog.csdn.net/flying_sfeng/article/details/64133822

View File

@@ -2,8 +2,10 @@
# coding:utf8
'''
Create by ApacheCN-小瑶
Date from 2017-02-27
Created on Jan 8, 2011
Update on 2017-05-18
@author: Peter Harrington/ApacheCN-小瑶
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''

View File

@@ -1,11 +1,11 @@
#!/usr/bin/python
# coding:utf8
'''
Created on Feb 4, 2011
Update on 2017-03-02
Update on 2017-05-18
Tree-Based Regression Methods Source Code for Machine Learning in Action Ch. 9
@author: Peter Harrington/片刻
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
print(__doc__)
from numpy import *

View File

@@ -3,9 +3,10 @@
'''
Created on 2017-03-08
Update on 2017-03-08
Update on 2017-05-18
Tree-Based Regression Methods Source Code for Machine Learning in Action Ch. 9
@author: Peter/片刻
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
import regTrees
from Tkinter import *