更新文档数据

This commit is contained in:
jiangzhonglian
2017-07-04 13:25:17 +08:00
parent a01f5a12fd
commit 08ecfc4086
31 changed files with 803046 additions and 10 deletions

View File

@@ -14,7 +14,7 @@
机器学习machine learning: 机器学习是最基础的(当下初创公司和研究实验室的热点领域之一)。
在90年代初人们开始意识到一种可以更有效地构建模式识别算法的方法那就是用数据可以通过廉价劳动力采集获得去替换专家具有很多图像方面知识的人
“机器学习”强调的是,在给计算机程序(或者机器)输入一些数据后,它必须做一些事情,那就是学习这些数据,而这个学习的步骤是明确的。
机器学习Machine Learning是一门专门研究计算机怎样模拟或实现人类的学习行为以获取新的知识或技能重新组织已有的知识结构使之不断改善自身性能的学科。
机器学习Machine Learning是一门专门研究计算机怎样模拟或实现人类的学习行为以获取新的知识或技能重新组织已有的知识结构使之不断改善自身性能的学科。
深度学习deep learning: 深度学习是非常崭新和有影响力的前沿领域,我们甚至不会去思考-后深度学习时代。
深度学习是机器学习研究中的一个新的领域,其动机在于建立、模拟人脑进行分析学习的神经网络,它模仿人脑的机制来解释数据,例如图像,声音和文本。
@@ -36,18 +36,18 @@ http://baike.baidu.com/link?url=76P-uA4EBrC3G-I__P1tqeO7eoDS709Kp4wYuHxc7GNkz_xn
## 机器学习的简单概述
`机器学习`就是把无序的数据转换成有用的信息;机器学习将有助于我们穿越数据雾霾,从中抽取出有用的信息。
* 1.需要获取海量的数据
* 2.才能从海量数据中获取有用的信息
* 1.获取海量的数据
* 2.从海量数据中获取有用的信息
## 机器学习的主要任务
> 机器学习的主要任务就是分类和回归
* 分类:将实例数据划分到合适的类别中。
* 回归:主要用于预测数值型数据。(例子———数据拟合曲线:通过给定数据点的最优拟合曲线)
* 回归:主要用于预测数值型数据。(示例:数据通过给定数据点来拟合最优曲线)
* 目标变量
* 目标变量是机器学习预测算法的测试结果。
* 在分类算法中目标变量的类型通常是标称型,而在回归算法中通常是连续型
* 在分类算法中目标变量的类型通常是标称型(如:真与假),而在回归算法中通常是连续型(如1~100)
* 机器学习的训练过程
* ![机器学习训练过程图](/images/1.MLFoundation/机器学习基础训练过程.png)
@@ -57,7 +57,7 @@ http://baike.baidu.com/link?url=76P-uA4EBrC3G-I__P1tqeO7eoDS709Kp4wYuHxc7GNkz_xn
* 必须知道预测什么,即必须知道目标变量的分类信息。分类和回归属于监督学习。
* 样本集:训练数据 + 测试数据
* 训练样本 = 特征(feature) + 目标变量(label)
* 训练样本的集合称为训练样本集,训练样本集必须确定知道目标变量的值,以便机器学习算法可以发现特征和目标变量之间的关系。
* 训练样本的集合称为训练样本集,训练样本集必须确定目标变量的值,以便机器学习算法可以发现特征和目标变量之间的关系。
* 特征(feature-是否有缺失情况) + 目标变量(分类-离散值<A/B/C、 是/否>/回归-连续值<0~100、 -999999>)
* 特征或者属性通常是训练样本集的列,它们是独立测量得到的结果,多个特征联系在一起共同组成一个训练样本。
* `知识表示`(例如-机器已经学会如何识别鸟类的过程)
@@ -76,7 +76,7 @@ http://baike.baidu.com/link?url=76P-uA4EBrC3G-I__P1tqeO7eoDS709Kp4wYuHxc7GNkz_xn
![算法汇总](/images/1.MLFoundation/ml_algorithm.jpg)
## 学习机器学习的原因
## 学习机器学习
* 选择算法需要考虑的两个问题
* 使用机器学习算法的目的

View File

@@ -24,8 +24,8 @@
> Sigmoid函数简介
```
我们想要的函数应该是,能接受所有的输入然后预测出类别。例如,在两个类的情况下,上述函数输出 0 和 1 。这类函数称为海维塞阶跃函数,或者直接称之为 单位阶跃函数。
但是,海维塞阶跃函数的问题在于:该函数在跳跃点上从 0 瞬间跳跃到 1这个瞬间跳跃过程有时候很难处理。幸好另外的一个函数也有这样的性质(这里的性质指的是可以输出0和1的性质),
我们想要的函数应该是,能接受所有的输入然后预测出类别。例如,在两个类的情况下,上述函数输出 0 和 1 。这类函数称为海维塞阶跃函数,或者直接称之为 单位阶跃函数。
但是,海维塞阶跃函数的问题在于:该函数在跳跃点上从 0 瞬间跳跃到 1这个瞬间跳跃过程有时候很难处理。幸好另外的一个函数也有这样的性质(这里的性质指的是可以输出0和1的性质),
且数学上更易处理,这就是我们下边要介绍的 Sigmoid 函数。
Sigmoid函数具体的计算公式如下

View File

@@ -0,0 +1,157 @@
SUMMARY & USAGE LICENSE
=============================================
MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies.
* Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)
The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th,
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.
Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set. The data set may be used for any research
purposes under the following conditions:
* The user may not state or imply any endorsement from the
University of Minnesota or the GroupLens Research Group.
* The user must acknowledge the use of the data set in
publications resulting from the use of the data set
(see below for citation information).
* The user may not redistribute the data without separate
permission.
* The user may not use this information for any commercial or
revenue-bearing purposes without first obtaining permission
from a faculty member of the GroupLens Research Project at the
University of Minnesota.
If you have any further questions or comments, please contact GroupLens
<grouplens-info@cs.umn.edu>.
CITATION
==============================================
To acknowledge use of the dataset in publications, please cite the
following paper:
F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
History and Context. ACM Transactions on Interactive Intelligent
Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
DOI=http://dx.doi.org/10.1145/2827872
ACKNOWLEDGEMENTS
==============================================
Thanks to Al Borchers for cleaning up this data and writing the
accompanying scripts.
PUBLISHED WORK THAT HAS USED THIS DATASET
==============================================
Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
Framework for Performing Collaborative Filtering. Proceedings of the
1999 Conference on Research and Development in Information
Retrieval. Aug. 1999.
FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
==============================================
The GroupLens Research Project is a research group in the Department
of Computer Science and Engineering at the University of Minnesota.
Members of the GroupLens Research Project are involved in many
research projects related to the fields of information filtering,
collaborative filtering, and recommender systems. The project is lead
by professors John Riedl and Joseph Konstan. The project began to
explore automated collaborative filtering in 1992, but is most well
known for its world wide trial of an automated collaborative filtering
system for Usenet news in 1996. The technology developed in the
Usenet trial formed the base for the formation of Net Perceptions,
Inc., which was founded by members of GroupLens Research. Since then
the project has expanded its scope to research overall information
filtering solutions, integrating in content-based methods as well as
improving current collaborative filtering technology.
Further information on the GroupLens Research project, including
research publications, can be found at the following web site:
http://www.grouplens.org/
GroupLens Research currently operates a movie recommender based on
collaborative filtering:
http://www.movielens.org/
DETAILED DESCRIPTIONS OF DATA FILES
==============================================
Here are brief descriptions of the data.
ml-data.tar.gz -- Compressed tar file. To rebuild the u data files do this:
gunzip ml-data.tar.gz
tar xvf ml-data.tar
mku.sh
u.data -- The full u data set, 100000 ratings by 943 users on 1682 items.
Each user has rated at least 20 movies. Users and items are
numbered consecutively from 1. The data is randomly
ordered. This is a tab separated list of
user id | item id | rating | timestamp.
The time stamps are unix seconds since 1/1/1970 UTC
u.info -- The number of users, items, and ratings in the u data set.
u.item -- Information about the items (movies); this is a tab separated
list of
movie id | movie title | release date | video release date |
IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western |
The last 19 fields are the genres, a 1 indicates the movie
is of that genre, a 0 indicates it is not; movies can be in
several genres at once.
The movie ids are the ones used in the u.data data set.
u.genre -- A list of the genres.
u.user -- Demographic information about the users; this is a tab
separated list of
user id | age | gender | occupation | zip code
The user ids are the ones used in the u.data data set.
u.occupation -- A list of the occupations.
u1.base -- The data sets u1.base and u1.test through u5.base and u5.test
u1.test are 80%/20% splits of the u data into training and test data.
u2.base Each of u1, ..., u5 have disjoint test sets; this if for
u2.test 5 fold cross validation (where you repeat your experiment
u3.base with each training and test set and average the results).
u3.test These data sets can be generated from u.data by mku.sh.
u4.base
u4.test
u5.base
u5.test
ua.base -- The data sets ua.base, ua.test, ub.base, and ub.test
ua.test split the u data into a training set and a test set with
ub.base exactly 10 ratings per user in the test set. The sets
ub.test ua.test and ub.test are disjoint. These data sets can
be generated from u.data by mku.sh.
allbut.pl -- The script that generates training and test sets where
all but n of a users ratings are in the training data.
mku.sh -- A shell script to generate all the u data sets from u.data.

View File

@@ -0,0 +1,34 @@
#!/usr/local/bin/perl
# get args
if (@ARGV < 3) {
print STDERR "Usage: $0 base_name start stop max_test [ratings ...]\n";
exit 1;
}
$basename = shift;
$start = shift;
$stop = shift;
$maxtest = shift;
# open files
open( TESTFILE, ">$basename.test" ) or die "Cannot open $basename.test for writing\n";
open( BASEFILE, ">$basename.base" ) or die "Cannot open $basename.base for writing\n";
# init variables
$testcnt = 0;
while (<>) {
($user) = split;
if (! defined $ratingcnt{$user}) {
$ratingcnt{$user} = 0;
}
++$ratingcnt{$user};
if (($testcnt < $maxtest || $maxtest <= 0)
&& $ratingcnt{$user} >= $start && $ratingcnt{$user} <= $stop) {
++$testcnt;
print TESTFILE;
}
else {
print BASEFILE;
}
}

View File

@@ -0,0 +1,25 @@
#!/bin/sh
trap `rm -f tmp.$$; exit 1` 1 2 15
for i in 1 2 3 4 5
do
head -`expr $i \* 20000` u.data | tail -20000 > tmp.$$
sort -t" " -k 1,1n -k 2,2n tmp.$$ > u$i.test
head -`expr \( $i - 1 \) \* 20000` u.data > tmp.$$
tail -`expr \( 5 - $i \) \* 20000` u.data >> tmp.$$
sort -t" " -k 1,1n -k 2,2n tmp.$$ > u$i.base
done
allbut.pl ua 1 10 100000 u.data
sort -t" " -k 1,1n -k 2,2n ua.base > tmp.$$
mv tmp.$$ ua.base
sort -t" " -k 1,1n -k 2,2n ua.test > tmp.$$
mv tmp.$$ ua.test
allbut.pl ub 11 20 100000 u.data
sort -t" " -k 1,1n -k 2,2n ub.base > tmp.$$
mv tmp.$$ ub.base
sort -t" " -k 1,1n -k 2,2n ub.test > tmp.$$
mv tmp.$$ ub.test

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,20 @@
unknown|0
Action|1
Adventure|2
Animation|3
Children's|4
Comedy|5
Crime|6
Documentary|7
Drama|8
Fantasy|9
Film-Noir|10
Horror|11
Musical|12
Mystery|13
Romance|14
Sci-Fi|15
Thriller|16
War|17
Western|18

View File

@@ -0,0 +1,3 @@
943 users
1682 items
100000 ratings

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,21 @@
administrator
artist
doctor
educator
engineer
entertainment
executive
healthcare
homemaker
lawyer
librarian
marketing
none
other
programmer
retired
salesman
scientist
student
technician
writer

View File

@@ -0,0 +1,943 @@
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703
11|39|F|other|30329
12|28|F|other|06405
13|47|M|educator|29206
14|45|M|scientist|55106
15|49|F|educator|97301
16|21|M|entertainment|10309
17|30|M|programmer|06355
18|35|F|other|37212
19|40|M|librarian|02138
20|42|F|homemaker|95660
21|26|M|writer|30068
22|25|M|writer|40206
23|30|F|artist|48197
24|21|F|artist|94533
25|39|M|engineer|55107
26|49|M|engineer|21044
27|40|F|librarian|30030
28|32|M|writer|55369
29|41|M|programmer|94043
30|7|M|student|55436
31|24|M|artist|10003
32|28|F|student|78741
33|23|M|student|27510
34|38|F|administrator|42141
35|20|F|homemaker|42459
36|19|F|student|93117
37|23|M|student|55105
38|28|F|other|54467
39|41|M|entertainment|01040
40|38|M|scientist|27514
41|33|M|engineer|80525
42|30|M|administrator|17870
43|29|F|librarian|20854
44|26|M|technician|46260
45|29|M|programmer|50233
46|27|F|marketing|46538
47|53|M|marketing|07102
48|45|M|administrator|12550
49|23|F|student|76111
50|21|M|writer|52245
51|28|M|educator|16509
52|18|F|student|55105
53|26|M|programmer|55414
54|22|M|executive|66315
55|37|M|programmer|01331
56|25|M|librarian|46260
57|16|M|none|84010
58|27|M|programmer|52246
59|49|M|educator|08403
60|50|M|healthcare|06472
61|36|M|engineer|30040
62|27|F|administrator|97214
63|31|M|marketing|75240
64|32|M|educator|43202
65|51|F|educator|48118
66|23|M|student|80521
67|17|M|student|60402
68|19|M|student|22904
69|24|M|engineer|55337
70|27|M|engineer|60067
71|39|M|scientist|98034
72|48|F|administrator|73034
73|24|M|student|41850
74|39|M|scientist|T8H1N
75|24|M|entertainment|08816
76|20|M|student|02215
77|30|M|technician|29379
78|26|M|administrator|61801
79|39|F|administrator|03755
80|34|F|administrator|52241
81|21|M|student|21218
82|50|M|programmer|22902
83|40|M|other|44133
84|32|M|executive|55369
85|51|M|educator|20003
86|26|M|administrator|46005
87|47|M|administrator|89503
88|49|F|librarian|11701
89|43|F|administrator|68106
90|60|M|educator|78155
91|55|M|marketing|01913
92|32|M|entertainment|80525
93|48|M|executive|23112
94|26|M|student|71457
95|31|M|administrator|10707
96|25|F|artist|75206
97|43|M|artist|98006
98|49|F|executive|90291
99|20|M|student|63129
100|36|M|executive|90254
101|15|M|student|05146
102|38|M|programmer|30220
103|26|M|student|55108
104|27|M|student|55108
105|24|M|engineer|94043
106|61|M|retired|55125
107|39|M|scientist|60466
108|44|M|educator|63130
109|29|M|other|55423
110|19|M|student|77840
111|57|M|engineer|90630
112|30|M|salesman|60613
113|47|M|executive|95032
114|27|M|programmer|75013
115|31|M|engineer|17110
116|40|M|healthcare|97232
117|20|M|student|16125
118|21|M|administrator|90210
119|32|M|programmer|67401
120|47|F|other|06260
121|54|M|librarian|99603
122|32|F|writer|22206
123|48|F|artist|20008
124|34|M|student|60615
125|30|M|lawyer|22202
126|28|F|lawyer|20015
127|33|M|none|73439
128|24|F|marketing|20009
129|36|F|marketing|07039
130|20|M|none|60115
131|59|F|administrator|15237
132|24|M|other|94612
133|53|M|engineer|78602
134|31|M|programmer|80236
135|23|M|student|38401
136|51|M|other|97365
137|50|M|educator|84408
138|46|M|doctor|53211
139|20|M|student|08904
140|30|F|student|32250
141|49|M|programmer|36117
142|13|M|other|48118
143|42|M|technician|08832
144|53|M|programmer|20910
145|31|M|entertainment|V3N4P
146|45|M|artist|83814
147|40|F|librarian|02143
148|33|M|engineer|97006
149|35|F|marketing|17325
150|20|F|artist|02139
151|38|F|administrator|48103
152|33|F|educator|68767
153|25|M|student|60641
154|25|M|student|53703
155|32|F|other|11217
156|25|M|educator|08360
157|57|M|engineer|70808
158|50|M|educator|27606
159|23|F|student|55346
160|27|M|programmer|66215
161|50|M|lawyer|55104
162|25|M|artist|15610
163|49|M|administrator|97212
164|47|M|healthcare|80123
165|20|F|other|53715
166|47|M|educator|55113
167|37|M|other|L9G2B
168|48|M|other|80127
169|52|F|other|53705
170|53|F|healthcare|30067
171|48|F|educator|78750
172|55|M|marketing|22207
173|56|M|other|22306
174|30|F|administrator|52302
175|26|F|scientist|21911
176|28|M|scientist|07030
177|20|M|programmer|19104
178|26|M|other|49512
179|15|M|entertainment|20755
180|22|F|administrator|60202
181|26|M|executive|21218
182|36|M|programmer|33884
183|33|M|scientist|27708
184|37|M|librarian|76013
185|53|F|librarian|97403
186|39|F|executive|00000
187|26|M|educator|16801
188|42|M|student|29440
189|32|M|artist|95014
190|30|M|administrator|95938
191|33|M|administrator|95161
192|42|M|educator|90840
193|29|M|student|49931
194|38|M|administrator|02154
195|42|M|scientist|93555
196|49|M|writer|55105
197|55|M|technician|75094
198|21|F|student|55414
199|30|M|writer|17604
200|40|M|programmer|93402
201|27|M|writer|E2A4H
202|41|F|educator|60201
203|25|F|student|32301
204|52|F|librarian|10960
205|47|M|lawyer|06371
206|14|F|student|53115
207|39|M|marketing|92037
208|43|M|engineer|01720
209|33|F|educator|85710
210|39|M|engineer|03060
211|66|M|salesman|32605
212|49|F|educator|61401
213|33|M|executive|55345
214|26|F|librarian|11231
215|35|M|programmer|63033
216|22|M|engineer|02215
217|22|M|other|11727
218|37|M|administrator|06513
219|32|M|programmer|43212
220|30|M|librarian|78205
221|19|M|student|20685
222|29|M|programmer|27502
223|19|F|student|47906
224|31|F|educator|43512
225|51|F|administrator|58202
226|28|M|student|92103
227|46|M|executive|60659
228|21|F|student|22003
229|29|F|librarian|22903
230|28|F|student|14476
231|48|M|librarian|01080
232|45|M|scientist|99709
233|38|M|engineer|98682
234|60|M|retired|94702
235|37|M|educator|22973
236|44|F|writer|53214
237|49|M|administrator|63146
238|42|F|administrator|44124
239|39|M|artist|95628
240|23|F|educator|20784
241|26|F|student|20001
242|33|M|educator|31404
243|33|M|educator|60201
244|28|M|technician|80525
245|22|M|student|55109
246|19|M|student|28734
247|28|M|engineer|20770
248|25|M|student|37235
249|25|M|student|84103
250|29|M|executive|95110
251|28|M|doctor|85032
252|42|M|engineer|07733
253|26|F|librarian|22903
254|44|M|educator|42647
255|23|M|entertainment|07029
256|35|F|none|39042
257|17|M|student|77005
258|19|F|student|77801
259|21|M|student|48823
260|40|F|artist|89801
261|28|M|administrator|85202
262|19|F|student|78264
263|41|M|programmer|55346
264|36|F|writer|90064
265|26|M|executive|84601
266|62|F|administrator|78756
267|23|M|engineer|83716
268|24|M|engineer|19422
269|31|F|librarian|43201
270|18|F|student|63119
271|51|M|engineer|22932
272|33|M|scientist|53706
273|50|F|other|10016
274|20|F|student|55414
275|38|M|engineer|92064
276|21|M|student|95064
277|35|F|administrator|55406
278|37|F|librarian|30033
279|33|M|programmer|85251
280|30|F|librarian|22903
281|15|F|student|06059
282|22|M|administrator|20057
283|28|M|programmer|55305
284|40|M|executive|92629
285|25|M|programmer|53713
286|27|M|student|15217
287|21|M|salesman|31211
288|34|M|marketing|23226
289|11|M|none|94619
290|40|M|engineer|93550
291|19|M|student|44106
292|35|F|programmer|94703
293|24|M|writer|60804
294|34|M|technician|92110
295|31|M|educator|50325
296|43|F|administrator|16803
297|29|F|educator|98103
298|44|M|executive|01581
299|29|M|doctor|63108
300|26|F|programmer|55106
301|24|M|student|55439
302|42|M|educator|77904
303|19|M|student|14853
304|22|F|student|71701
305|23|M|programmer|94086
306|45|M|other|73132
307|25|M|student|55454
308|60|M|retired|95076
309|40|M|scientist|70802
310|37|M|educator|91711
311|32|M|technician|73071
312|48|M|other|02110
313|41|M|marketing|60035
314|20|F|student|08043
315|31|M|educator|18301
316|43|F|other|77009
317|22|M|administrator|13210
318|65|M|retired|06518
319|38|M|programmer|22030
320|19|M|student|24060
321|49|F|educator|55413
322|20|M|student|50613
323|21|M|student|19149
324|21|F|student|02176
325|48|M|technician|02139
326|41|M|administrator|15235
327|22|M|student|11101
328|51|M|administrator|06779
329|48|M|educator|01720
330|35|F|educator|33884
331|33|M|entertainment|91344
332|20|M|student|40504
333|47|M|other|V0R2M
334|32|M|librarian|30002
335|45|M|executive|33775
336|23|M|salesman|42101
337|37|M|scientist|10522
338|39|F|librarian|59717
339|35|M|lawyer|37901
340|46|M|engineer|80123
341|17|F|student|44405
342|25|F|other|98006
343|43|M|engineer|30093
344|30|F|librarian|94117
345|28|F|librarian|94143
346|34|M|other|76059
347|18|M|student|90210
348|24|F|student|45660
349|68|M|retired|61455
350|32|M|student|97301
351|61|M|educator|49938
352|37|F|programmer|55105
353|25|M|scientist|28480
354|29|F|librarian|48197
355|25|M|student|60135
356|32|F|homemaker|92688
357|26|M|executive|98133
358|40|M|educator|10022
359|22|M|student|61801
360|51|M|other|98027
361|22|M|student|44074
362|35|F|homemaker|85233
363|20|M|student|87501
364|63|M|engineer|01810
365|29|M|lawyer|20009
366|20|F|student|50670
367|17|M|student|37411
368|18|M|student|92113
369|24|M|student|91335
370|52|M|writer|08534
371|36|M|engineer|99206
372|25|F|student|66046
373|24|F|other|55116
374|36|M|executive|78746
375|17|M|entertainment|37777
376|28|F|other|10010
377|22|M|student|18015
378|35|M|student|02859
379|44|M|programmer|98117
380|32|M|engineer|55117
381|33|M|artist|94608
382|45|M|engineer|01824
383|42|M|administrator|75204
384|52|M|programmer|45218
385|36|M|writer|10003
386|36|M|salesman|43221
387|33|M|entertainment|37412
388|31|M|other|36106
389|44|F|writer|83702
390|42|F|writer|85016
391|23|M|student|84604
392|52|M|writer|59801
393|19|M|student|83686
394|25|M|administrator|96819
395|43|M|other|44092
396|57|M|engineer|94551
397|17|M|student|27514
398|40|M|other|60008
399|25|M|other|92374
400|33|F|administrator|78213
401|46|F|healthcare|84107
402|30|M|engineer|95129
403|37|M|other|06811
404|29|F|programmer|55108
405|22|F|healthcare|10019
406|52|M|educator|93109
407|29|M|engineer|03261
408|23|M|student|61755
409|48|M|administrator|98225
410|30|F|artist|94025
411|34|M|educator|44691
412|25|M|educator|15222
413|55|M|educator|78212
414|24|M|programmer|38115
415|39|M|educator|85711
416|20|F|student|92626
417|27|F|other|48103
418|55|F|none|21206
419|37|M|lawyer|43215
420|53|M|educator|02140
421|38|F|programmer|55105
422|26|M|entertainment|94533
423|64|M|other|91606
424|36|F|marketing|55422
425|19|M|student|58644
426|55|M|educator|01602
427|51|M|doctor|85258
428|28|M|student|55414
429|27|M|student|29205
430|38|M|scientist|98199
431|24|M|marketing|92629
432|22|M|entertainment|50311
433|27|M|artist|11211
434|16|F|student|49705
435|24|M|engineer|60007
436|30|F|administrator|17345
437|27|F|other|20009
438|51|F|administrator|43204
439|23|F|administrator|20817
440|30|M|other|48076
441|50|M|technician|55013
442|22|M|student|85282
443|35|M|salesman|33308
444|51|F|lawyer|53202
445|21|M|writer|92653
446|57|M|educator|60201
447|30|M|administrator|55113
448|23|M|entertainment|10021
449|23|M|librarian|55021
450|35|F|educator|11758
451|16|M|student|48446
452|35|M|administrator|28018
453|18|M|student|06333
454|57|M|other|97330
455|48|M|administrator|83709
456|24|M|technician|31820
457|33|F|salesman|30011
458|47|M|technician|Y1A6B
459|22|M|student|29201
460|44|F|other|60630
461|15|M|student|98102
462|19|F|student|02918
463|48|F|healthcare|75218
464|60|M|writer|94583
465|32|M|other|05001
466|22|M|student|90804
467|29|M|engineer|91201
468|28|M|engineer|02341
469|60|M|educator|78628
470|24|M|programmer|10021
471|10|M|student|77459
472|24|M|student|87544
473|29|M|student|94708
474|51|M|executive|93711
475|30|M|programmer|75230
476|28|M|student|60440
477|23|F|student|02125
478|29|M|other|10019
479|30|M|educator|55409
480|57|M|retired|98257
481|73|M|retired|37771
482|18|F|student|40256
483|29|M|scientist|43212
484|27|M|student|21208
485|44|F|educator|95821
486|39|M|educator|93101
487|22|M|engineer|92121
488|48|M|technician|21012
489|55|M|other|45218
490|29|F|artist|V5A2B
491|43|F|writer|53711
492|57|M|educator|94618
493|22|M|engineer|60090
494|38|F|administrator|49428
495|29|M|engineer|03052
496|21|F|student|55414
497|20|M|student|50112
498|26|M|writer|55408
499|42|M|programmer|75006
500|28|M|administrator|94305
501|22|M|student|10025
502|22|M|student|23092
503|50|F|writer|27514
504|40|F|writer|92115
505|27|F|other|20657
506|46|M|programmer|03869
507|18|F|writer|28450
508|27|M|marketing|19382
509|23|M|administrator|10011
510|34|M|other|98038
511|22|M|student|21250
512|29|M|other|20090
513|43|M|administrator|26241
514|27|M|programmer|20707
515|53|M|marketing|49508
516|53|F|librarian|10021
517|24|M|student|55454
518|49|F|writer|99709
519|22|M|other|55320
520|62|M|healthcare|12603
521|19|M|student|02146
522|36|M|engineer|55443
523|50|F|administrator|04102
524|56|M|educator|02159
525|27|F|administrator|19711
526|30|M|marketing|97124
527|33|M|librarian|12180
528|18|M|student|55104
529|47|F|administrator|44224
530|29|M|engineer|94040
531|30|F|salesman|97408
532|20|M|student|92705
533|43|M|librarian|02324
534|20|M|student|05464
535|45|F|educator|80302
536|38|M|engineer|30078
537|36|M|engineer|22902
538|31|M|scientist|21010
539|53|F|administrator|80303
540|28|M|engineer|91201
541|19|F|student|84302
542|21|M|student|60515
543|33|M|scientist|95123
544|44|F|other|29464
545|27|M|technician|08052
546|36|M|executive|22911
547|50|M|educator|14534
548|51|M|writer|95468
549|42|M|scientist|45680
550|16|F|student|95453
551|25|M|programmer|55414
552|45|M|other|68147
553|58|M|educator|62901
554|32|M|scientist|62901
555|29|F|educator|23227
556|35|F|educator|30606
557|30|F|writer|11217
558|56|F|writer|63132
559|69|M|executive|10022
560|32|M|student|10003
561|23|M|engineer|60005
562|54|F|administrator|20879
563|39|F|librarian|32707
564|65|M|retired|94591
565|40|M|student|55422
566|20|M|student|14627
567|24|M|entertainment|10003
568|39|M|educator|01915
569|34|M|educator|91903
570|26|M|educator|14627
571|34|M|artist|01945
572|51|M|educator|20003
573|68|M|retired|48911
574|56|M|educator|53188
575|33|M|marketing|46032
576|48|M|executive|98281
577|36|F|student|77845
578|31|M|administrator|M7A1A
579|32|M|educator|48103
580|16|M|student|17961
581|37|M|other|94131
582|17|M|student|93003
583|44|M|engineer|29631
584|25|M|student|27511
585|69|M|librarian|98501
586|20|M|student|79508
587|26|M|other|14216
588|18|F|student|93063
589|21|M|lawyer|90034
590|50|M|educator|82435
591|57|F|librarian|92093
592|18|M|student|97520
593|31|F|educator|68767
594|46|M|educator|M4J2K
595|25|M|programmer|31909
596|20|M|artist|77073
597|23|M|other|84116
598|40|F|marketing|43085
599|22|F|student|R3T5K
600|34|M|programmer|02320
601|19|F|artist|99687
602|47|F|other|34656
603|21|M|programmer|47905
604|39|M|educator|11787
605|33|M|engineer|33716
606|28|M|programmer|63044
607|49|F|healthcare|02154
608|22|M|other|10003
609|13|F|student|55106
610|22|M|student|21227
611|46|M|librarian|77008
612|36|M|educator|79070
613|37|F|marketing|29678
614|54|M|educator|80227
615|38|M|educator|27705
616|55|M|scientist|50613
617|27|F|writer|11201
618|15|F|student|44212
619|17|M|student|44134
620|18|F|writer|81648
621|17|M|student|60402
622|25|M|programmer|14850
623|50|F|educator|60187
624|19|M|student|30067
625|27|M|programmer|20723
626|23|M|scientist|19807
627|24|M|engineer|08034
628|13|M|none|94306
629|46|F|other|44224
630|26|F|healthcare|55408
631|18|F|student|38866
632|18|M|student|55454
633|35|M|programmer|55414
634|39|M|engineer|T8H1N
635|22|M|other|23237
636|47|M|educator|48043
637|30|M|other|74101
638|45|M|engineer|01940
639|42|F|librarian|12065
640|20|M|student|61801
641|24|M|student|60626
642|18|F|student|95521
643|39|M|scientist|55122
644|51|M|retired|63645
645|27|M|programmer|53211
646|17|F|student|51250
647|40|M|educator|45810
648|43|M|engineer|91351
649|20|M|student|39762
650|42|M|engineer|83814
651|65|M|retired|02903
652|35|M|other|22911
653|31|M|executive|55105
654|27|F|student|78739
655|50|F|healthcare|60657
656|48|M|educator|10314
657|26|F|none|78704
658|33|M|programmer|92626
659|31|M|educator|54248
660|26|M|student|77380
661|28|M|programmer|98121
662|55|M|librarian|19102
663|26|M|other|19341
664|30|M|engineer|94115
665|25|M|administrator|55412
666|44|M|administrator|61820
667|35|M|librarian|01970
668|29|F|writer|10016
669|37|M|other|20009
670|30|M|technician|21114
671|21|M|programmer|91919
672|54|F|administrator|90095
673|51|M|educator|22906
674|13|F|student|55337
675|34|M|other|28814
676|30|M|programmer|32712
677|20|M|other|99835
678|50|M|educator|61462
679|20|F|student|54302
680|33|M|lawyer|90405
681|44|F|marketing|97208
682|23|M|programmer|55128
683|42|M|librarian|23509
684|28|M|student|55414
685|32|F|librarian|55409
686|32|M|educator|26506
687|31|F|healthcare|27713
688|37|F|administrator|60476
689|25|M|other|45439
690|35|M|salesman|63304
691|34|M|educator|60089
692|34|M|engineer|18053
693|43|F|healthcare|85210
694|60|M|programmer|06365
695|26|M|writer|38115
696|55|M|other|94920
697|25|M|other|77042
698|28|F|programmer|06906
699|44|M|other|96754
700|17|M|student|76309
701|51|F|librarian|56321
702|37|M|other|89104
703|26|M|educator|49512
704|51|F|librarian|91105
705|21|F|student|54494
706|23|M|student|55454
707|56|F|librarian|19146
708|26|F|homemaker|96349
709|21|M|other|N4T1A
710|19|M|student|92020
711|22|F|student|15203
712|22|F|student|54901
713|42|F|other|07204
714|26|M|engineer|55343
715|21|M|technician|91206
716|36|F|administrator|44265
717|24|M|technician|84105
718|42|M|technician|64118
719|37|F|other|V0R2H
720|49|F|administrator|16506
721|24|F|entertainment|11238
722|50|F|homemaker|17331
723|26|M|executive|94403
724|31|M|executive|40243
725|21|M|student|91711
726|25|F|administrator|80538
727|25|M|student|78741
728|58|M|executive|94306
729|19|M|student|56567
730|31|F|scientist|32114
731|41|F|educator|70403
732|28|F|other|98405
733|44|F|other|60630
734|25|F|other|63108
735|29|F|healthcare|85719
736|48|F|writer|94618
737|30|M|programmer|98072
738|35|M|technician|95403
739|35|M|technician|73162
740|25|F|educator|22206
741|25|M|writer|63108
742|35|M|student|29210
743|31|M|programmer|92660
744|35|M|marketing|47024
745|42|M|writer|55113
746|25|M|engineer|19047
747|19|M|other|93612
748|28|M|administrator|94720
749|33|M|other|80919
750|28|M|administrator|32303
751|24|F|other|90034
752|60|M|retired|21201
753|56|M|salesman|91206
754|59|F|librarian|62901
755|44|F|educator|97007
756|30|F|none|90247
757|26|M|student|55104
758|27|M|student|53706
759|20|F|student|68503
760|35|F|other|14211
761|17|M|student|97302
762|32|M|administrator|95050
763|27|M|scientist|02113
764|27|F|educator|62903
765|31|M|student|33066
766|42|M|other|10960
767|70|M|engineer|00000
768|29|M|administrator|12866
769|39|M|executive|06927
770|28|M|student|14216
771|26|M|student|15232
772|50|M|writer|27105
773|20|M|student|55414
774|30|M|student|80027
775|46|M|executive|90036
776|30|M|librarian|51157
777|63|M|programmer|01810
778|34|M|student|01960
779|31|M|student|K7L5J
780|49|M|programmer|94560
781|20|M|student|48825
782|21|F|artist|33205
783|30|M|marketing|77081
784|47|M|administrator|91040
785|32|M|engineer|23322
786|36|F|engineer|01754
787|18|F|student|98620
788|51|M|administrator|05779
789|29|M|other|55420
790|27|M|technician|80913
791|31|M|educator|20064
792|40|M|programmer|12205
793|22|M|student|85281
794|32|M|educator|57197
795|30|M|programmer|08610
796|32|F|writer|33755
797|44|F|other|62522
798|40|F|writer|64131
799|49|F|administrator|19716
800|25|M|programmer|55337
801|22|M|writer|92154
802|35|M|administrator|34105
803|70|M|administrator|78212
804|39|M|educator|61820
805|27|F|other|20009
806|27|M|marketing|11217
807|41|F|healthcare|93555
808|45|M|salesman|90016
809|50|F|marketing|30803
810|55|F|other|80526
811|40|F|educator|73013
812|22|M|technician|76234
813|14|F|student|02136
814|30|M|other|12345
815|32|M|other|28806
816|34|M|other|20755
817|19|M|student|60152
818|28|M|librarian|27514
819|59|M|administrator|40205
820|22|M|student|37725
821|37|M|engineer|77845
822|29|F|librarian|53144
823|27|M|artist|50322
824|31|M|other|15017
825|44|M|engineer|05452
826|28|M|artist|77048
827|23|F|engineer|80228
828|28|M|librarian|85282
829|48|M|writer|80209
830|46|M|programmer|53066
831|21|M|other|33765
832|24|M|technician|77042
833|34|M|writer|90019
834|26|M|other|64153
835|44|F|executive|11577
836|44|M|artist|10018
837|36|F|artist|55409
838|23|M|student|01375
839|38|F|entertainment|90814
840|39|M|artist|55406
841|45|M|doctor|47401
842|40|M|writer|93055
843|35|M|librarian|44212
844|22|M|engineer|95662
845|64|M|doctor|97405
846|27|M|lawyer|47130
847|29|M|student|55417
848|46|M|engineer|02146
849|15|F|student|25652
850|34|M|technician|78390
851|18|M|other|29646
852|46|M|administrator|94086
853|49|M|writer|40515
854|29|F|student|55408
855|53|M|librarian|04988
856|43|F|marketing|97215
857|35|F|administrator|V1G4L
858|63|M|educator|09645
859|18|F|other|06492
860|70|F|retired|48322
861|38|F|student|14085
862|25|M|executive|13820
863|17|M|student|60089
864|27|M|programmer|63021
865|25|M|artist|11231
866|45|M|other|60302
867|24|M|scientist|92507
868|21|M|programmer|55303
869|30|M|student|10025
870|22|M|student|65203
871|31|M|executive|44648
872|19|F|student|74078
873|48|F|administrator|33763
874|36|M|scientist|37076
875|24|F|student|35802
876|41|M|other|20902
877|30|M|other|77504
878|50|F|educator|98027
879|33|F|administrator|55337
880|13|M|student|83702
881|39|M|marketing|43017
882|35|M|engineer|40503
883|49|M|librarian|50266
884|44|M|engineer|55337
885|30|F|other|95316
886|20|M|student|61820
887|14|F|student|27249
888|41|M|scientist|17036
889|24|M|technician|78704
890|32|M|student|97301
891|51|F|administrator|03062
892|36|M|other|45243
893|25|M|student|95823
894|47|M|educator|74075
895|31|F|librarian|32301
896|28|M|writer|91505
897|30|M|other|33484
898|23|M|homemaker|61755
899|32|M|other|55116
900|60|M|retired|18505
901|38|M|executive|L1V3W
902|45|F|artist|97203
903|28|M|educator|20850
904|17|F|student|61073
905|27|M|other|30350
906|45|M|librarian|70124
907|25|F|other|80526
908|44|F|librarian|68504
909|50|F|educator|53171
910|28|M|healthcare|29301
911|37|F|writer|53210
912|51|M|other|06512
913|27|M|student|76201
914|44|F|other|08105
915|50|M|entertainment|60614
916|27|M|engineer|N2L5N
917|22|F|student|20006
918|40|M|scientist|70116
919|25|M|other|14216
920|30|F|artist|90008
921|20|F|student|98801
922|29|F|administrator|21114
923|21|M|student|E2E3R
924|29|M|other|11753
925|18|F|salesman|49036
926|49|M|entertainment|01701
927|23|M|programmer|55428
928|21|M|student|55408
929|44|M|scientist|53711
930|28|F|scientist|07310
931|60|M|educator|33556
932|58|M|educator|06437
933|28|M|student|48105
934|61|M|engineer|22902
935|42|M|doctor|66221
936|24|M|other|32789
937|48|M|educator|98072
938|38|F|technician|55038
939|26|F|student|33319
940|32|M|administrator|02215
941|20|M|student|97229
942|48|F|librarian|78209
943|22|M|student|77841

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -3,7 +3,7 @@
'''
Created on 2017-05-18
Update on 2017-05-18
@author: Peter Harrington/山上有课树
@author: Peter Harrington/1988/片刻
《机器学习实战》更新地址https://github.com/apachecn/MachineLearning
'''
from numpy import random, mat, eye

View File

@@ -0,0 +1,68 @@
#!/usr/bin/python
# coding:utf8
from math import sqrt
import numpy as np
import pandas as pd
from scipy.sparse.linalg import svds
from sklearn import cross_validation as cv
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import pairwise_distances
# 加载数据集
header = ['user_id', 'item_id', 'rating', 'timestamp']
# http://files.grouplens.org/datasets/movielens/ml-100k.zip
dataFile = 'input/16.RecommenderSystems/ml-100k/u.data'
df = pd.read_csv(dataFile, sep='\t', names=header)
n_users = df.user_id.unique().shape[0]
n_items = df.item_id.unique().shape[0]
print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items)
# 拆分数据集
train_data, test_data = cv.train_test_split(df, test_size=0.25)
# 创建用户产品矩阵,针对测试数据和训练数据,创建两个矩阵:
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
train_data_matrix[line[1]-1, line[2]-1] = line[3]
test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
test_data_matrix[line[1]-1, line[2]-1] = line[3]
# 使用sklearn的pairwise_distances函数来计算余弦相似性。
user_similarity = pairwise_distances(train_data_matrix, metric="cosine")
item_similarity = pairwise_distances(train_data_matrix.T, metric="cosine")
def predict(rating, similarity, type='user'):
if type == 'user':
mean_user_rating = rating.mean(axis=1)
rating_diff = (rating - mean_user_rating[:, np.newaxis])
pred = mean_user_rating[:, np.newaxis] + similarity.dot(rating_diff)/np.array([np.abs(similarity).sum(axis=1)]).T
elif type == 'item':
pred = rating.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
return pred
user_prediction = predict(train_data_matrix, user_similarity, type='user')
item_prediction = predict(train_data_matrix, item_similarity, type='item')
def rmse(prediction, ground_truth):
prediction = prediction[ground_truth.nonzero()].flatten()
ground_truth = ground_truth[ground_truth.nonzero()].flatten()
return sqrt(mean_squared_error(prediction, ground_truth))
print 'User based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix))
print 'Item based CF RMSe: ' + str(rmse(item_prediction, test_data_matrix))
sparsity = round(1.0 - len(df)/float(n_users*n_items), 3)
print 'The sparsity level of MovieLen100K is ' + str(sparsity * 100) + '%'
u, s, vt = svds(train_data_matrix, k=20)
s_diag_matrix = np.diag(s)
x_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print 'User-based CF MSE: ' + str(rmse(x_pred, test_data_matrix))

View File

@@ -0,0 +1,30 @@
#!/usr/bin/python
# coding:utf8
import numpy as np
from sklearn.decomposition import NMF
import matplotlib.pyplot as plt
RATE_MATRIX = np.array(
[[5, 5, 3, 0, 5, 5],
[5, 0, 4, 0, 4, 4],
[0, 3, 0, 5, 4, 5],
[5, 4, 3, 3, 5, 5]]
)
nmf = NMF(n_components=2)
user_distribution = nmf.fit_transform(RATE_MATRIX)
item_distribution = nmf.components_
item_distribution = item_distribution.T
plt.plot(item_distribution[:, 0], item_distribution[:, 1], "b*")
plt.xlim((-1, 3))
plt.ylim((-1, 3))
plt.title(u'the distribution of items (NMF)')
count = 1
for item in item_distribution:
plt.text(item[0], item[1], 'item '+str(count), bbox=dict(facecolor='red', alpha=0.2),)
count += 1
plt.show()

View File

@@ -0,0 +1,31 @@
#!/usr/bin/python
# coding:utf8
import numpy as np
from sklearn.decomposition import NMF
import matplotlib.pyplot as plt
RATE_MATRIX = np.array(
[[5, 5, 3, 0, 5, 5],
[5, 0, 4, 0, 4, 4],
[0, 3, 0, 5, 4, 5],
[5, 4, 3, 3, 5, 5]]
)
nmf = NMF(n_components=2)
user_distribution = nmf.fit_transform(RATE_MATRIX)
item_distribution = nmf.components_
users = ['Ben', 'Tom', 'John', 'Fred']
zip_data = zip(users, user_distribution)
plt.title(u'the distribution of users (NMF)')
plt.xlim((-1, 3))
plt.ylim((-1, 4))
for item in zip_data:
user_name = item[0]
data = item[1]
plt.plot(data[0], data[1], "b*")
plt.text(data[0], data[1], user_name, bbox=dict(facecolor='red', alpha=0.2),)
plt.show()

View File

@@ -0,0 +1,22 @@
#!/usr/bin/python
# coding:utf8
import numpy as np
from sklearn.decomposition import NMF
import matplotlib.pyplot as plt
RATE_MATRIX = np.array(
[[5, 5, 3, 0, 5, 5],
[5, 0, 4, 0, 4, 4],
[0, 3, 0, 5, 4, 5],
[5, 4, 3, 3, 5, 5]]
)
nmf = NMF(n_components=2) # 设有2个隐主题
user_distribution = nmf.fit_transform(RATE_MATRIX)
item_distribution = nmf.components_
print '用户的主题分布:'
print user_distribution
print '物品的主题分布:'
print item_distribution