mirror of
https://github.com/apachecn/ailearning.git
synced 2026-05-07 14:13:14 +08:00
修改朴素贝叶斯代码和文档,添加一些数据集
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
# 第3章 决策树
|
||||
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"></script>
|
||||
|
||||

|
||||

|
||||
|
||||
## 决策树 概述
|
||||
|
||||
|
||||
483
docs/4.朴素贝叶斯.md
483
docs/4.朴素贝叶斯.md
@@ -4,91 +4,446 @@
|
||||
|
||||

|
||||
|
||||
## 使用概率分布进行分类
|
||||
## 朴素贝叶斯 概述
|
||||
|
||||
> 朴素贝叶斯简介
|
||||
`贝叶斯分类是一类分类算法的总称,这类算法均以贝叶斯定理为基础,故统称为贝叶斯分类。本章首先介绍贝叶斯分类算法的基础——贝叶斯定理。最后,我们通过实例来讨论贝叶斯分类的中最简单的一种: 朴素贝叶斯分类。`
|
||||
|
||||
```
|
||||
前两章我们要求分类器做出艰难决策,给出“该数据实例属于哪一类”这类问题的明确答案。不过,分类器有时候会产生错误的结果,
|
||||
这时可以要求分类器给出一个最优的类别猜测结果,同时给出这个猜测的概率估计值。
|
||||
在这里我们先统计特征在数据集中取某个特定值的次数,然后除以数据集的实例总数,就得到了特征取该值的概率。
|
||||
下面我们会给出一些使用概率论进行分类的方法。首先从一个最简单的概率分类器开始,然后给出一些假设来学习朴素贝叶斯分类器。
|
||||
我们称之为“朴素”,是因为整个形式化过程只做最原始、最简答的假设。
|
||||
```
|
||||
## 贝叶斯理论 & 条件概率
|
||||
|
||||
> 贝叶斯决策理论
|
||||
|
||||
```
|
||||
朴素贝叶斯是贝叶斯决策理论的一部分,所以讲述朴素贝叶斯之前有必要快速了解一下贝叶斯决策理论。
|
||||
假设我们有一个数据集,它由两类数据组成,数据分布如下图所示:
|
||||
```
|
||||
### 贝叶斯理论
|
||||
|
||||
我们现在有一个数据集,它由两类数据组成,数据分布如下图所示:
|
||||

|
||||
|
||||
```
|
||||
我们现在用p1(x,y)表示数据点(x,y)属于类别1(图中用圆点表示的类别)的概率,用p2(x,y)表示数据点(x,y)属于类别2(图中三角形表示的类别)的概率,
|
||||
那么对于一个新数据点(x,y),可以用下面的规则来判断它的类别:
|
||||
* 如果 p1(x,y) > p2(x,y) ,那么类别为1
|
||||
* 如果 p2(x,y) > p1(x,y) ,那么类别为2
|
||||
也就是说,我们会选择高概率对应的类别。这就是贝叶斯决策理论的核心思想,即选择具有最高概率的决策。
|
||||
```
|
||||
我们现在用 p1(x,y) 表示数据点 (x,y) 属于类别 1(图中用圆点表示的类别)的概率,用 p2(x,y) 表示数据点 (x,y) 属于类别 2(图中三角形表示的类别)的概率,那么对于一个新数据点 (x,y),可以用下面的规则来判断它的类别:
|
||||
* 如果 p1(x,y) > p2(x,y) ,那么类别为1
|
||||
* 如果 p2(x,y) > p1(x,y) ,那么类别为2
|
||||
|
||||
> 朴素贝叶斯特点
|
||||
也就是说,我们会选择高概率对应的类别。这就是贝叶斯决策理论的核心思想,即选择具有最高概率的决策。
|
||||
|
||||
### 条件概率
|
||||
|
||||
如果你对 p(x,y|c1) 符号很熟悉,那么可以跳过本小节。
|
||||
|
||||
有一个装了 7 块石头的罐子,其中 3 块是灰色的,4 块是黑色的。如果从罐子中随机去除一块石头,那么是灰色石头的可能性是多少?由于取石头有 7 中可能,其中 3 种为灰色,所以取出灰色石头的概率为 3/7 。那么取到黑色石头的概率又是多少呢?很显然,是 4/7 。我们使用 P(gray) 来表示取到灰色石头的概率,其概率值可以通过灰色石头数目除以总的石头数目来得到。
|
||||
|
||||

|
||||
|
||||
如果这 7 块石头如下图所示,放在两个桶中,那么上述概率应该如何计算?
|
||||
|
||||

|
||||
|
||||
计算 P(gray) 或者 P(black) ,如果事先我们知道石头所在桶的信息是会改变结果的。这就是所谓的条件概率(conditional probablity)。假定计算的是从 B 桶取到灰色石头的概率,这个概率可以记作 P(gray|bucketB) ,我们称之为“在已知石头出自 B 桶的条件下,取出灰色石头的概率”。不难得到,P(gray|bucketA) 值为 2/4 ,P(gray|bucketB) 的值为 1/3 。
|
||||
|
||||
条件概率的计算公式如下:
|
||||
|
||||
P(gray|bucketB) = P(gray and bucketB) / P(bucketB)
|
||||
|
||||
首先,我们用 B 桶中灰色石头的个数除以两个桶中总的石头数,得到 P(gray and bucketB) = 1/7 .其次,由于 B 桶中有 3 块石头,而总石头数为 7 ,于是 P(bucketB) 就等于 3/7 。于是又 P(gray|bucketB) = P(gray and bucketB) / P(bucketB) = (1/7) / (3/7) = 1/3 。
|
||||
|
||||
另外一种有效计算条件概率的方法称为贝叶斯准则。贝叶斯准则告诉我们如何交换条件概率中的条件与结果,即如果已知 P(x|c),要求 P(c|x),那么可以使用下面的计算方法:
|
||||
|
||||

|
||||
|
||||
### 使用条件概率来分类
|
||||
|
||||
上面我们提到贝叶斯决策理论要求计算两个概率 p1(x, y) 和 p2(x, y):
|
||||
* 如果 p1(x, y) > p2(x, y), 那么属于类别 1;
|
||||
* 如果 p2(x, y) > p1(X, y), 那么属于类别 2.
|
||||
|
||||
这并不是贝叶斯决策理论的所有内容。使用 p1() 和 p2() 只是为了尽可能简化描述,而真正需要计算和比较的是 p(c1|x, y) 和 p(c2|x, y) .这些符号所代表的具体意义是: 给定某个由 x、y 表示的数据点,那么该数据点来自类别 c1 的概率是多少?数据点来自类别 c2 的概率又是多少?注意这些概率与刚才给出的概率 p(x, y|c1) 并不一样,不过可以使用贝叶斯准则来交换概率中条件与结果。具体地,应用贝叶斯准则得到:
|
||||
|
||||

|
||||
|
||||
使用上面这些定义,可以定义贝叶斯分类准则为:
|
||||
* 如果 P(c1|x, y) > P(c2|x, y), 那么属于类别 c1;
|
||||
* 如果 P(c1|x, y) > P(c2|x, y), 那么属于类别 c2.
|
||||
|
||||
在文档分类中,整个文档(如一封电子邮件)是实例,而电子邮件中的某些元素则构成特征。我们可以观察文档中出现的词,并把每个词的出现或者不出现作为一个特征,这样得到的特征数目就会跟词汇表中的数目一样多。
|
||||
|
||||
我们假设特征之间<b>相互独立</b>。所谓 <b>独立(independence)</b> 指的是统计意义上的独立,即一个特征或者单词出现的可能性与它和其他单词相邻没有关系。这个假设正是朴素贝叶斯分类器中 朴素(naive) 一词的含义。朴素贝叶斯分类器中的另一个假设是,<b>每个特征同等重要</b>。
|
||||
|
||||
<b>Note:</b> 朴素贝叶斯分类器通常有两种实现方式: 一种基于伯努利模型实现,一种基于多项式模型实现。这里采用前一种实现方式。该实现方式中并不考虑词在文档中出现的次数,只考虑出不出现,因此在这个意义上相当于假设词是等权重的。
|
||||
|
||||
## 朴素贝叶斯 场景
|
||||
|
||||
机器学习的一个重要应用就是文档的自动分类。
|
||||
|
||||
在文档分类中,整个文档(如一封电子邮件)是实例,而电子邮件中的某些元素则构成特征。例如,我们可以观察文档中出现的词,并把每个词的出现或者不出现作为一个特征,这样得到的特征数目就会跟词汇表中的词目一样多。
|
||||
|
||||
朴素贝叶斯是上面介绍的贝叶斯分类器的一个扩展,是用于文档分类的常用算法。下面我们会进行一些朴素贝叶斯分类的实践项目。
|
||||
|
||||
## 朴素贝叶斯 原理
|
||||
|
||||
### 朴素贝叶斯 工作原理
|
||||
|
||||
```
|
||||
优点:在数据较少的情况下仍然有效,可以处理多类别问题。
|
||||
缺点:对于输入数据的准备方式较为敏感。
|
||||
适用于数据类型:标称型数据。
|
||||
计算每个类别中的文档数目
|
||||
对每篇训练文档:
|
||||
对每个类别:
|
||||
如果词条出现在文档中-->增加该词条的计数值
|
||||
增加所有词条的计数值
|
||||
对每个类别:
|
||||
对每个词条:
|
||||
将该词条的数目除以总词条数目得到的条件概率
|
||||
返回每个类别的条件概率
|
||||
```
|
||||
|
||||
> 朴素贝叶斯的一般过程
|
||||
### 朴素贝叶斯 开发流程
|
||||
|
||||
```
|
||||
(1)收集数据:可以使用任何方法。本章使用RSS源。
|
||||
(2)准备数据:需要数值型或者布尔型数据。
|
||||
(3)分析数据:有大量特征时,绘制特征作用不大,此时使用直方图效果更好。
|
||||
(4)训练算法:计算不同的独立特征的条件概率。
|
||||
(5)测试算法:计算错误率。
|
||||
(6)使用算法:一个常见的朴素贝叶斯应用是文档分类。可以在任意的分类场景中使用朴素贝叶斯分类器,不一定非要是文本。
|
||||
收集数据: 可以使用任何方法。
|
||||
准备数据: 需要数值型或者布尔型数据。
|
||||
分析数据: 有大量特征时,绘制特征作用不大,此时使用直方图效果更好。
|
||||
训练算法: 计算不同的独立特征的条件概率。
|
||||
测试算法: 计算错误率。
|
||||
使用算法: 一个常见的朴素贝叶斯应用是文档分类。可以在任意的分类场景中使用朴素贝叶斯分类器,不一定非要是文本。
|
||||
```
|
||||
|
||||
## 学习朴素贝叶斯分类器
|
||||
|
||||
> 条件概率
|
||||
|
||||

|
||||
|
||||
> 计算上述概率
|
||||
### 朴素贝叶斯 算法特点
|
||||
|
||||
```
|
||||
要计算 P(gray) 或者 P(black) ,事先得知道石头所在桶的信息会不会改变结果?你有可能已经想到计算从 B 桶中取到灰色石头的概率的办法,
|
||||
这就是所谓的条件概率 (conditional probability)。假定计算的是从 B 桶取到灰色石头的概率,这个概率可以记作 P(gray|bucketB),
|
||||
我们称之为“在已知石头出自B桶的条件下,取出灰色石头的概率”。不难得到,P(gray|bucketA)值为2/4,P(gray|bucketB)的值为1/3。
|
||||
条件概率的计算公式如下所示:
|
||||
P(gray|bucketB) = P(gray and bucketB) / P(bucketB)
|
||||
```
|
||||
> 贝叶斯准则
|
||||
|
||||
```
|
||||
贝叶斯准则告诉我们如何交换条件概率中的条件和结果,即如果已知 P(x|c) ,要求 P(c|x) ,那么可以使用下面的计算方法:
|
||||
p(c|x) = p(x|c)·p(c)/p(x)
|
||||
优点: 在数据较少的情况下仍然有效,可以处理多类别问题。
|
||||
缺点: 对于输入数据的准备方式较为敏感。
|
||||
适用数据类型: 标称型数据。
|
||||
```
|
||||
|
||||
> 使用条件概率来分类
|
||||
## 朴素贝叶斯 项目案例
|
||||
|
||||
上面提到的贝叶斯决策理论要求计算两个概率 p1(x,y) 和 p2(x,y):
|
||||
* 如果 p1(x,y) > p2(x,y) ,那么属于类型1;
|
||||
* 如果 p2(x,y) > p1(x,y) ,那么属于类型2;
|
||||
但这两个准则并不是贝叶斯决策理论的所有内容。使用 p1() 和 p2() 只是为了尽可能简化描述,而真正需要计算和比较的是 p(c1|x,y) 和 p(c2|x,y)。
|
||||
这些符号代表的具体意义是:给定某个由x,y表示的数据点,那么该数据点来自类别c1的概率是多少?数据点来自类别c2的概率又是多少?
|
||||
注意这些概率与刚才给出的概率p(x,y|c1)并不一样,不过可以使用贝叶斯准则来交换概率中条件和结果。具体地,应用贝叶斯准则得到:
|
||||
p(ci|x,y) = p(x,y|ci)·p(ci)/p(x,y)
|
||||
### 项目案例1: 屏蔽社区留言板的侮辱性言论
|
||||
|
||||
#### 项目概述
|
||||
|
||||
构建一个快读过滤器来屏蔽在线社区留言板上的侮辱性言论。如果某条留言使用了负面或者侮辱性的语言,那么就将该留言标识为内容不当。对此问题建立两个类别: 侮辱类和非侮辱类,使用 1 和 0 分别表示。
|
||||
|
||||
#### 开发流程
|
||||
|
||||
```
|
||||
收集数据: 可以使用任何方法
|
||||
准备数据: 从文本中构建词向量
|
||||
分析数据: 检查词条确保解析的正确性
|
||||
训练算法: 从词向量计算概率
|
||||
测试算法: 根据现实情况修改分类器
|
||||
使用算法: 对社区留言板言论进行分类
|
||||
```
|
||||
|
||||
> 收集数据: 可以使用任何方法
|
||||
|
||||
本例是我们自己构造的词表:
|
||||
|
||||
```python
|
||||
def loadDataSet():
|
||||
"""
|
||||
创建数据集
|
||||
:return: 单词列表postingList, 所属类别classVec
|
||||
"""
|
||||
postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], #[0,0,1,1,1......]
|
||||
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
|
||||
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
|
||||
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
|
||||
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
|
||||
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
|
||||
classVec = [0, 1, 0, 1, 0, 1] # 1 is abusive, 0 not
|
||||
return postingList, classVec
|
||||
```
|
||||
|
||||
> 准备数据: 从文本中构建词向量
|
||||
|
||||
```python
|
||||
def createVocabList(dataSet):
|
||||
"""
|
||||
获取所有单词的集合
|
||||
:param dataSet: 数据集
|
||||
:return: 所有单词的集合(即不含重复元素的单词列表)
|
||||
"""
|
||||
vocabSet = set([]) # create empty set
|
||||
for document in dataSet:
|
||||
# 操作符 | 用于求两个集合的并集
|
||||
vocabSet = vocabSet | set(document) # union of the two sets
|
||||
return list(vocabSet)
|
||||
|
||||
|
||||
def setOfWords2Vec(vocabList, inputSet):
|
||||
"""
|
||||
遍历查看该单词是否出现,出现该单词则将该单词置1
|
||||
:param vocabList: 所有单词集合列表
|
||||
:param inputSet: 输入数据集
|
||||
:return: 匹配列表[0,1,0,1...],其中 1与0 表示词汇表中的单词是否出现在输入的数据集中
|
||||
"""
|
||||
# 创建一个和词汇表等长的向量,并将其元素都设置为0
|
||||
returnVec = [0] * len(vocabList)# [0,0......]
|
||||
# 遍历文档中的所有单词,如果出现了词汇表中的单词,则将输出的文档向量中的对应值设为1
|
||||
for word in inputSet:
|
||||
if word in vocabList:
|
||||
returnVec[vocabList.index(word)] = 1
|
||||
else:
|
||||
print "the word: %s is not in my Vocabulary!" % word
|
||||
return returnVec
|
||||
```
|
||||
|
||||
> 分析数据: 检查词条确保解析的正确性
|
||||
|
||||
检查函数执行情况,检查词表,不出现重复单词,需要的话,可以对其进行排序。
|
||||
|
||||
```python
|
||||
>>> listOPosts, listClasses = bayes.loadDataSet()
|
||||
>>> myVocabList = bayes.createVocabList(listOPosts)
|
||||
>>> myVocabList
|
||||
['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park',
|
||||
'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how',
|
||||
'stupid', 'so', 'take', 'mr', 'steak', 'my']
|
||||
```
|
||||
|
||||
检查函数有效性。例如:myVocabList 中索引为 2 的元素是什么单词?应该是是 help 。该单词在第一篇文档中出现了,现在检查一下看看它是否出现在第四篇文档中。
|
||||
|
||||
```python
|
||||
>>> bayes.setOfWords2Vec(myVocabList, listOPosts[0])
|
||||
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
|
||||
|
||||
>>> bayes.setOfWords2Vec(myVocabList, listOPosts[3])
|
||||
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
|
||||
```
|
||||
|
||||
> 训练算法: 从词向量计算概率
|
||||
|
||||
现在已经知道了一个词是否出现在一篇文档中,也知道该文档所属的类别。接下来我们重写贝叶斯准则,将之前的 x, y 替换为 <b>w</b>. 粗体的 <b>w</b> 表示这是一个向量,即它由多个值组成。在这个例子中,数值个数与词汇表中的词个数相同。
|
||||
|
||||

|
||||
|
||||
我们使用上述公式,对每个类计算该值,然后比较这两个概率值的大小。
|
||||
|
||||
首先可以通过类别 i (侮辱性留言或者非侮辱性留言)中文档数除以总的文档数来计算概率 p(ci) 。接下来计算 p(<b>w</b> | ci) ,这里就要用到朴素贝叶斯假设。如果将 w 展开为一个个独立特征,那么就可以将上述概率写作 p(w0, w1, w2...wn | ci) 。这里假设所有词都互相独立,该假设也称作条件独立性假设,它意味着可以使用 p(w0 | ci)p(w1 | ci)p(w2 | ci)...p(wn | ci) 来计算上述概率,这样就极大地简化了计算的过程。
|
||||
|
||||
朴素贝叶斯分类器训练函数
|
||||
|
||||
```python
|
||||
def trainNB0(trainMatrix, trainCategory):
|
||||
"""
|
||||
训练数据优化版本
|
||||
:param trainMatrix: 文件单词矩阵
|
||||
:param trainCategory: 文件对应的类别
|
||||
:return:
|
||||
"""
|
||||
# 总文件数
|
||||
numTrainDocs = len(trainMatrix)
|
||||
# 总单词数
|
||||
numWords = len(trainMatrix[0])
|
||||
# 侮辱性文件的出现概率
|
||||
pAbusive = sum(trainCategory) / float(numTrainDocs)
|
||||
# 构造单词出现次数列表
|
||||
# p0Num 正常的统计
|
||||
# p1Num 侮辱的统计
|
||||
p0Num = ones(numWords)#[0,0......]->[1,1,1,1,1.....]
|
||||
p1Num = ones(numWords)
|
||||
|
||||
# 整个数据集单词出现总数,2.0根据样本/实际调查结果调整分母的值(2主要是避免分母为0,当然值可以调整)
|
||||
# p0Denom 正常的统计
|
||||
# p1Denom 侮辱的统计
|
||||
p0Denom = 2.0
|
||||
p1Denom = 2.0
|
||||
for i in range(numTrainDocs):
|
||||
if trainCategory[i] == 1:
|
||||
# 累加辱骂词的频次
|
||||
p1Num += trainMatrix[i]
|
||||
# 对每篇文章的辱骂的频次 进行统计汇总
|
||||
p1Denom += sum(trainMatrix[i])
|
||||
else:
|
||||
p0Num += trainMatrix[i]
|
||||
p0Denom += sum(trainMatrix[i])
|
||||
# 类别1,即侮辱性文档的[log(P(F1|C1)),log(P(F2|C1)),log(P(F3|C1)),log(P(F4|C1)),log(P(F5|C1))....]列表
|
||||
p1Vect = log(p1Num / p1Denom)
|
||||
# 类别0,即正常文档的[log(P(F1|C0)),log(P(F2|C0)),log(P(F3|C0)),log(P(F4|C0)),log(P(F5|C0))....]列表
|
||||
p0Vect = log(p0Num / p0Denom)
|
||||
return p0Vect, p1Vect, pAbusive
|
||||
```
|
||||
|
||||
> 测试算法: 根据现实情况修改分类器
|
||||
|
||||
在利用贝叶斯分类器对文档进行分类时,要计算多个概率的乘积以获得文档属于某个类别的概率,即计算 p(w0 | 1)p(w1 | 1)p(w2 | 1)。如果其中一个概率值为 0,那么最后的乘积也为 0。为降低这种影响,可以将所有词的出现数初始化为 1,并将分母初始化为 2 。
|
||||
|
||||
另一个遇到的问题是下溢出,这是由于太多很小的数相乘造成的。当计算乘积 p(w0 | ci)p(w1 | ci)p(w2 | ci)...p(wn | ci) 时,由于大部分因子都非常小,所以程序会下溢出或者得到不正确的答案。(用 Python 尝试相乘许多很小的数,最后四舍五入后会得到 0)。一种解决办法是对乘积取自然对数。在代数中有 ln(a * b) = ln(a) + ln(b), 于是通过求对数可以避免下溢出或者浮点数舍入导致的错误。同时,采用自然对数进行处理不会有任何损失。
|
||||
|
||||
下图给出了函数 f(x) 与 ln(f(x)) 的曲线。可以看出,它们在相同区域内同时增加或者减少,并且在相同点上取到极值。它们的取值虽然不同,但不影响最终结果。通过修改 return 前的两行代码,将上述做法用到分类器中:
|
||||
|
||||
```python
|
||||
p1Vect = log(p1Num /p1Denom)
|
||||
p0Vect = log(p0Num / p0Denom)
|
||||
```
|
||||
|
||||

|
||||
|
||||
> 使用算法: 对社区留言板言论进行分类
|
||||
|
||||
朴素贝叶斯分类函数
|
||||
|
||||
```python
|
||||
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
|
||||
"""
|
||||
使用算法:
|
||||
# 将乘法转换为加法
|
||||
乘法:P(C|F1F2...Fn) = P(F1F2...Fn|C)P(C)/P(F1F2...Fn)
|
||||
加法:P(F1|C)*P(F2|C)....P(Fn|C)P(C) -> log(P(F1|C))+log(P(F2|C))+....+log(P(Fn|C))+log(P(C))
|
||||
:param vec2Classify: 待测数据[0,1,1,1,1...],即要分类的向量
|
||||
:param p0Vec: 类别0,即正常文档的[log(P(F1|C0)),log(P(F2|C0)),log(P(F3|C0)),log(P(F4|C0)),log(P(F5|C0))....]列表
|
||||
:param p1Vec: 类别1,即侮辱性文档的[log(P(F1|C1)),log(P(F2|C1)),log(P(F3|C1)),log(P(F4|C1)),log(P(F5|C1))....]列表
|
||||
:param pClass1: 类别1,侮辱性文件的出现概率
|
||||
:return: 类别1 or 0
|
||||
"""
|
||||
# 计算公式 log(P(F1|C))+log(P(F2|C))+....+log(P(Fn|C))+log(P(C))
|
||||
# 使用 NumPy 数组来计算两个向量相乘的结果,这里的相乘是指对应元素相乘,即先将两个向量中的第一个元素相乘,然后将第2个元素相乘,以此类推。
|
||||
# 我的理解是:这里的 vec2Classify * p1Vec 的意思就是将每个词与其对应的概率相关联起来
|
||||
p1 = sum(vec2Classify * p1Vec) + log(pClass1)
|
||||
p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
|
||||
if p1 > p0:
|
||||
return 1
|
||||
else:
|
||||
return 0
|
||||
|
||||
|
||||
def testingNB():
|
||||
"""
|
||||
测试朴素贝叶斯算法
|
||||
"""
|
||||
# 1. 加载数据集
|
||||
listOPosts, listClasses = loadDataSet()
|
||||
# 2. 创建单词集合
|
||||
myVocabList = createVocabList(listOPosts)
|
||||
# 3. 计算单词是否出现并创建数据矩阵
|
||||
trainMat = []
|
||||
for postinDoc in listOPosts:
|
||||
# 返回m*len(myVocabList)的矩阵, 记录的都是0,1信息
|
||||
trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
|
||||
# 4. 训练数据
|
||||
p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))
|
||||
# 5. 测试数据
|
||||
testEntry = ['love', 'my', 'dalmation']
|
||||
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
|
||||
print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
|
||||
testEntry = ['stupid', 'garbage']
|
||||
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
|
||||
print testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)
|
||||
```
|
||||
|
||||
### 项目案例2: 使用朴素贝叶斯过滤垃圾邮件
|
||||
|
||||
#### 项目概述
|
||||
|
||||
完成朴素贝叶斯的一个最著名的应用: 电子邮件垃圾过滤。
|
||||
|
||||
#### 开发流程
|
||||
|
||||
使用朴素贝叶斯对电子邮件进行分类
|
||||
|
||||
```
|
||||
收集数据: 提供文本文件
|
||||
准备数据: 将文本文件解析成词条向量
|
||||
分析数据: 检查词条确保解析的正确性
|
||||
训练算法: 使用我们之前建立的 trainNB() 函数
|
||||
测试算法: 使用朴素贝叶斯进行交叉验证
|
||||
使用算法: 构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上
|
||||
```
|
||||
|
||||
> 收集数据: 提供文本文件
|
||||
|
||||
文本文件内容如下:
|
||||
|
||||
```
|
||||
Hi Peter,
|
||||
|
||||
With Jose out of town, do you want to
|
||||
meet once in a while to keep things
|
||||
going and do some interesting stuff?
|
||||
|
||||
Let me know
|
||||
Eugene
|
||||
```
|
||||
|
||||
> 准备数据: 将文本文件解析成词条向量
|
||||
|
||||
文档词袋模型
|
||||
|
||||
我们将每个词的出现与否作为一个特征,这可以被描述为 <b>词集模型(set-of-words model)</b>。如果一个词在文档中出现不止一次,这可能意味着包含该词是否出现在文档中所不能表达的某种信息,这种方法被称为 <b>词袋模型(bag-of-words model)</b>。在词袋中,每个单词可以出现多次,而在词集中,每个词只能出现一次。为适应词袋模型,需要对函数 setOfWords2Vec() 稍加修改,修改后的函数为 bagOfWords2Vec() 。
|
||||
|
||||
如下给出了基于词袋模型的朴素贝叶斯代码。它与函数 setOfWords2Vec() 几乎完全相同,唯一不同的是每当遇到一个单词时,它会增加词向量中的对应值,而不只是将对应的数值设为 1 。
|
||||
|
||||
```python
|
||||
def bagOfWords2VecMN(vocaList, inputSet):
|
||||
returnVec = [0] * len(vocabList)
|
||||
for word in inputSet:
|
||||
if word in inputSet:
|
||||
returnVec[vocabList.index(word)] += 1
|
||||
return returnVec
|
||||
```
|
||||
|
||||
使用正则表达式来切分文本
|
||||
|
||||
```python
|
||||
>>> mySent = 'This book is the best book on Python or M.L. I have ever laid eyes upon.'
|
||||
>>> import re
|
||||
>>> regEx = re.compile('\\W*')
|
||||
>>> listOfTokens = regEx.split(mySent)
|
||||
>>> listOfTokens
|
||||
['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']
|
||||
```
|
||||
|
||||
> 分析数据: 检查词条确保解析的正确性
|
||||
|
||||
> 训练算法: 使用我们之前建立的 trainNB() 函数
|
||||
|
||||
```python
|
||||
def trainNB0(trainMatrix, trainCategory):
|
||||
"""
|
||||
训练数据优化版本
|
||||
:param trainMatrix: 文件单词矩阵
|
||||
:param trainCategory: 文件对应的类别
|
||||
:return:
|
||||
"""
|
||||
# 总文件数
|
||||
numTrainDocs = len(trainMatrix)
|
||||
# 总单词数
|
||||
numWords = len(trainMatrix[0])
|
||||
# 侮辱性文件的出现概率
|
||||
pAbusive = sum(trainCategory) / float(numTrainDocs)
|
||||
# 构造单词出现次数列表
|
||||
# p0Num 正常的统计
|
||||
# p1Num 侮辱的统计
|
||||
p0Num = ones(numWords)#[0,0......]->[1,1,1,1,1.....]
|
||||
p1Num = ones(numWords)
|
||||
|
||||
# 整个数据集单词出现总数,2.0根据样本/实际调查结果调整分母的值(2主要是避免分母为0,当然值可以调整)
|
||||
# p0Denom 正常的统计
|
||||
# p1Denom 侮辱的统计
|
||||
p0Denom = 2.0
|
||||
p1Denom = 2.0
|
||||
for i in range(numTrainDocs):
|
||||
if trainCategory[i] == 1:
|
||||
# 累加辱骂词的频次
|
||||
p1Num += trainMatrix[i]
|
||||
# 对每篇文章的辱骂的频次 进行统计汇总
|
||||
p1Denom += sum(trainMatrix[i])
|
||||
else:
|
||||
p0Num += trainMatrix[i]
|
||||
p0Denom += sum(trainMatrix[i])
|
||||
# 类别1,即侮辱性文档的[log(P(F1|C1)),log(P(F2|C1)),log(P(F3|C1)),log(P(F4|C1)),log(P(F5|C1))....]列表
|
||||
p1Vect = log(p1Num / p1Denom)
|
||||
# 类别0,即正常文档的[log(P(F1|C0)),log(P(F2|C0)),log(P(F3|C0)),log(P(F4|C0)),log(P(F5|C0))....]列表
|
||||
p0Vect = log(p0Num / p0Denom)
|
||||
return p0Vect, p1Vect, pAbusive
|
||||
```
|
||||
|
||||
> 测试算法: 使用朴素贝叶斯进行交叉验证
|
||||
|
||||
```python
|
||||
|
||||
```
|
||||
|
||||
> 使用算法: 构建一个完整的程序对一组文档进行分类,将错分的文档输出到屏幕上
|
||||
|
||||
|
||||
### 项目案例3: 使用朴素贝叶斯分类器从个人广告中获取区域倾向
|
||||
|
||||
#### 项目概述
|
||||
|
||||
使用这些定义,可以定义贝叶斯分类准则为:
|
||||
* 如果 P(c1|x,y) > P(c2|x,y) ,那么属于类别c1;
|
||||
* 如果 P(c1|x,y) < P(c2|x,y) ,那么属于类别c2;
|
||||
使用贝叶斯准则,可以通过已知的三个概率值来计算未知的概率值。后面就会给出利用贝叶斯准则来计算概率并对数据进行分类的代码。现在介绍了一些概率理论,
|
||||
你也了解了基于这些理论构建分类器的方法,接下来就要将它们付诸实践。
|
||||
|
||||
## 解析RSS源数据
|
||||
|
||||
|
||||
BIN
images/4.NaiveBayesian/NB_2.png
Normal file
BIN
images/4.NaiveBayesian/NB_2.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 15 KiB |
BIN
images/4.NaiveBayesian/NB_3.png
Normal file
BIN
images/4.NaiveBayesian/NB_3.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 11 KiB |
BIN
images/4.NaiveBayesian/NB_4.png
Normal file
BIN
images/4.NaiveBayesian/NB_4.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 2.9 KiB |
BIN
images/4.NaiveBayesian/NB_5.png
Normal file
BIN
images/4.NaiveBayesian/NB_5.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 3.7 KiB |
BIN
images/4.NaiveBayesian/NB_6.png
Normal file
BIN
images/4.NaiveBayesian/NB_6.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 3.6 KiB |
BIN
images/4.NaiveBayesian/NB_7.png
Normal file
BIN
images/4.NaiveBayesian/NB_7.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 40 KiB |
8
input/4.NaiveBayes/email/ham/1.txt
Normal file
8
input/4.NaiveBayes/email/ham/1.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
Hi Peter,
|
||||
|
||||
With Jose out of town, do you want to
|
||||
meet once in a while to keep things
|
||||
going and do some interesting stuff?
|
||||
|
||||
Let me know
|
||||
Eugene
|
||||
4
input/4.NaiveBayes/email/ham/10.txt
Normal file
4
input/4.NaiveBayes/email/ham/10.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
Ryan Whybrew commented on your status.
|
||||
|
||||
Ryan wrote:
|
||||
"turd ferguson or butt horn."
|
||||
8
input/4.NaiveBayes/email/ham/11.txt
Normal file
8
input/4.NaiveBayes/email/ham/11.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
Arvind Thirumalai commented on your status.
|
||||
|
||||
Arvind wrote:
|
||||
""you know""
|
||||
|
||||
|
||||
Reply to this email to comment on this status.
|
||||
|
||||
11
input/4.NaiveBayes/email/ham/12.txt
Normal file
11
input/4.NaiveBayes/email/ham/12.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
Thanks Peter.
|
||||
|
||||
I'll definitely check in on this. How is your book
|
||||
going? I heard chapter 1 came in and it was in
|
||||
good shape. ;-)
|
||||
|
||||
I hope you are doing well.
|
||||
|
||||
Cheers,
|
||||
|
||||
Troy
|
||||
10
input/4.NaiveBayes/email/ham/13.txt
Normal file
10
input/4.NaiveBayes/email/ham/13.txt
Normal file
@@ -0,0 +1,10 @@
|
||||
Jay Stepp commented on your status.
|
||||
|
||||
Jay wrote:
|
||||
""to the" ???"
|
||||
|
||||
|
||||
Reply to this email to comment on this status.
|
||||
|
||||
To see the comment thread, follow the link below:
|
||||
|
||||
10
input/4.NaiveBayes/email/ham/14.txt
Normal file
10
input/4.NaiveBayes/email/ham/14.txt
Normal file
@@ -0,0 +1,10 @@
|
||||
LinkedIn
|
||||
|
||||
Kerry Haloney requested to add you as a connection on LinkedIn:
|
||||
|
||||
Peter,
|
||||
|
||||
I'd like to add you to my professional network on LinkedIn.
|
||||
|
||||
- Kerry Haloney
|
||||
|
||||
9
input/4.NaiveBayes/email/ham/15.txt
Normal file
9
input/4.NaiveBayes/email/ham/15.txt
Normal file
@@ -0,0 +1,9 @@
|
||||
Hi Peter,
|
||||
|
||||
The hotels are the ones that rent out the tent. They are all lined up on the hotel grounds : )) So much for being one with nature, more like being one with a couple dozen tour groups and nature.
|
||||
I have about 100M of pictures from that trip. I can go through them and get you jpgs of my favorite scenic pictures.
|
||||
|
||||
Where are you and Jocelyn now? New York? Will you come to Tokyo for Chinese New Year? Perhaps to see the two of you then. I will go to Thailand for winter holiday to see my mom : )
|
||||
|
||||
Take care,
|
||||
D
|
||||
1
input/4.NaiveBayes/email/ham/16.txt
Normal file
1
input/4.NaiveBayes/email/ham/16.txt
Normal file
@@ -0,0 +1 @@
|
||||
yeah I am ready. I may not be here because Jar Jar has plane tickets to Germany for me.
|
||||
11
input/4.NaiveBayes/email/ham/17.txt
Normal file
11
input/4.NaiveBayes/email/ham/17.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
Benoit Mandelbrot 1924-2010
|
||||
|
||||
Benoit Mandelbrot 1924-2010
|
||||
|
||||
Wilmott Team
|
||||
|
||||
Benoit Mandelbrot, the mathematician, the father of fractal mathematics, and advocate of more sophisticated modelling in quantitative finance, died on 14th October 2010 aged 85.
|
||||
|
||||
Wilmott magazine has often featured Mandelbrot, his ideas, and the work of others inspired by his fundamental insights.
|
||||
|
||||
You must be logged on to view these articles from past issues of Wilmott Magazine.
|
||||
8
input/4.NaiveBayes/email/ham/18.txt
Normal file
8
input/4.NaiveBayes/email/ham/18.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
Hi Peter,
|
||||
|
||||
Sure thing. Sounds good. Let me know what time would be good for you.
|
||||
I will come prepared with some ideas and we can go from there.
|
||||
|
||||
Regards,
|
||||
|
||||
-Vivek.
|
||||
10
input/4.NaiveBayes/email/ham/19.txt
Normal file
10
input/4.NaiveBayes/email/ham/19.txt
Normal file
@@ -0,0 +1,10 @@
|
||||
LinkedIn
|
||||
|
||||
Julius O requested to add you as a connection on LinkedIn:
|
||||
|
||||
Hi Peter.
|
||||
|
||||
Looking forward to the book!
|
||||
|
||||
|
||||
Accept View invitation from Julius O
|
||||
3
input/4.NaiveBayes/email/ham/2.txt
Normal file
3
input/4.NaiveBayes/email/ham/2.txt
Normal file
@@ -0,0 +1,3 @@
|
||||
Yay to you both doing fine!
|
||||
|
||||
I'm working on an MBA in Design Strategy at CCA (top art school.) It's a new program focusing on more of a right-brained creative and strategic approach to management. I'm an 1/8 of the way done today!
|
||||
5
input/4.NaiveBayes/email/ham/20.txt
Normal file
5
input/4.NaiveBayes/email/ham/20.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
I've thought about this and think it's possible. We should get another
|
||||
lunch. I have a car now and could come pick you up this time. Does
|
||||
this wednesday work? 11:50?
|
||||
|
||||
Can I have a signed copy of you book?
|
||||
6
input/4.NaiveBayes/email/ham/21.txt
Normal file
6
input/4.NaiveBayes/email/ham/21.txt
Normal file
@@ -0,0 +1,6 @@
|
||||
we saw this on the way to the coast...thought u might like it
|
||||
|
||||
hangzhou is huge, one day wasn't enough, but we got a glimpse...
|
||||
|
||||
we went inside the china pavilion at expo, it is pretty interesting,
|
||||
each province has an exhibit...
|
||||
7
input/4.NaiveBayes/email/ham/22.txt
Normal file
7
input/4.NaiveBayes/email/ham/22.txt
Normal file
@@ -0,0 +1,7 @@
|
||||
Hi Hommies,
|
||||
|
||||
Just got a phone call from the roofer, they will come and spaying the foaming today. it will be dusty. pls close all the doors and windows.
|
||||
Could you help me to close my bathroom window, cat window and the sliding door behind the TV?
|
||||
I don't know how can those 2 cats survive......
|
||||
|
||||
Sorry for any inconvenience!
|
||||
7
input/4.NaiveBayes/email/ham/23.txt
Normal file
7
input/4.NaiveBayes/email/ham/23.txt
Normal file
@@ -0,0 +1,7 @@
|
||||
|
||||
SciFinance now automatically generates GPU-enabled pricing & risk model source code that runs up to 50-300x faster than serial code using a new NVIDIA Fermi-class Tesla 20-Series GPU.
|
||||
|
||||
SciFinance® is a derivatives pricing and risk model development tool that automatically generates C/C++ and GPU-enabled source code from concise, high-level model specifications. No parallel computing or CUDA programming expertise is required.
|
||||
|
||||
SciFinance's automatic, GPU-enabled Monte Carlo pricing model source code generation capabilities have been significantly extended in the latest release. This includes:
|
||||
|
||||
1
input/4.NaiveBayes/email/ham/24.txt
Normal file
1
input/4.NaiveBayes/email/ham/24.txt
Normal file
@@ -0,0 +1 @@
|
||||
Ok I will be there by 10:00 at the latest.
|
||||
2
input/4.NaiveBayes/email/ham/25.txt
Normal file
2
input/4.NaiveBayes/email/ham/25.txt
Normal file
@@ -0,0 +1,2 @@
|
||||
That is cold. Is there going to be a retirement party?
|
||||
Are the leaves changing color?
|
||||
8
input/4.NaiveBayes/email/ham/3.txt
Normal file
8
input/4.NaiveBayes/email/ham/3.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
WHat is going on there?
|
||||
I talked to John on email. We talked about some computer stuff that's it.
|
||||
|
||||
I went bike riding in the rain, it was not that cold.
|
||||
|
||||
We went to the museum in SF yesterday it was $3 to get in and they had
|
||||
free food. At the same time was a SF Giants game, when we got done we
|
||||
had to take the train with all the Giants fans, they are 1/2 drunk.
|
||||
3
input/4.NaiveBayes/email/ham/4.txt
Normal file
3
input/4.NaiveBayes/email/ham/4.txt
Normal file
@@ -0,0 +1,3 @@
|
||||
Yo. I've been working on my running website. I'm using jquery and the jqplot plugin. I'm not too far away from having a prototype to launch.
|
||||
|
||||
You used jqplot right? If not, I think you would like it.
|
||||
2
input/4.NaiveBayes/email/ham/5.txt
Normal file
2
input/4.NaiveBayes/email/ham/5.txt
Normal file
@@ -0,0 +1,2 @@
|
||||
There was a guy at the gas station who told me that if I knew Mandarin
|
||||
and Python I could get a job with the FBI.
|
||||
7
input/4.NaiveBayes/email/ham/6.txt
Normal file
7
input/4.NaiveBayes/email/ham/6.txt
Normal file
@@ -0,0 +1,7 @@
|
||||
Hello,
|
||||
|
||||
Since you are an owner of at least one Google Groups group that uses the customized welcome message, pages or files, we are writing to inform you that we will no longer be supporting these features starting February 2011. We made this decision so that we can focus on improving the core functionalities of Google Groups -- mailing lists and forum discussions. Instead of these features, we encourage you to use products that are designed specifically for file storage and page creation, such as Google Docs and Google Sites.
|
||||
|
||||
For example, you can easily create your pages on Google Sites and share the site (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=174623) with the members of your group. You can also store your files on the site by attaching files to pages (http://www.google.com/support/sites/bin/answer.py?hl=en&answer=90563) on the site. If you’re just looking for a place to upload your files so that your group members can download them, we suggest you try Google Docs. You can upload files (http://docs.google.com/support/bin/answer.py?hl=en&answer=50092) and share access with either a group (http://docs.google.com/support/bin/answer.py?hl=en&answer=66343) or an individual (http://docs.google.com/support/bin/answer.py?hl=en&answer=86152), assigning either edit or download only access to the files.
|
||||
|
||||
you have received this mandatory email service announcement to update you about important changes to Google Groups.
|
||||
6
input/4.NaiveBayes/email/ham/7.txt
Normal file
6
input/4.NaiveBayes/email/ham/7.txt
Normal file
@@ -0,0 +1,6 @@
|
||||
Zach Hamm commented on your status.
|
||||
|
||||
Zach wrote:
|
||||
"doggy style - enough said, thank you & good night"
|
||||
|
||||
|
||||
5
input/4.NaiveBayes/email/ham/8.txt
Normal file
5
input/4.NaiveBayes/email/ham/8.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
This e-mail was sent from a notification-only address that cannot accept incoming e-mail. Please do not reply to this message.
|
||||
|
||||
Thank you for your online reservation. The store you selected has located the item you requested and has placed it on hold in your name. Please note that all items are held for 1 day. Please note store prices may differ from those online.
|
||||
|
||||
If you have questions or need assistance with your reservation, please contact the store at the phone number listed below. You can also access store information, such as store hours and location, on the web at http://www.borders.com/online/store/StoreDetailView_98.
|
||||
5
input/4.NaiveBayes/email/ham/9.txt
Normal file
5
input/4.NaiveBayes/email/ham/9.txt
Normal file
@@ -0,0 +1,5 @@
|
||||
Hi Peter,
|
||||
|
||||
These are the only good scenic ones and it's too bad there was a girl's back in one of them. Just try to enjoy the blue sky : ))
|
||||
|
||||
D
|
||||
4
input/4.NaiveBayes/email/spam/1.txt
Normal file
4
input/4.NaiveBayes/email/spam/1.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
--- Codeine 15mg -- 30 for $203.70 -- VISA Only!!! --
|
||||
|
||||
-- Codeine (Methylmorphine) is a narcotic (opioid) pain reliever
|
||||
-- We have 15mg & 30mg pills -- 30/15mg for $203.70 - 60/15mg for $385.80 - 90/15mg for $562.50 -- VISA Only!!! ---
|
||||
6
input/4.NaiveBayes/email/spam/10.txt
Normal file
6
input/4.NaiveBayes/email/spam/10.txt
Normal file
@@ -0,0 +1,6 @@
|
||||
OrderCializViagra Online & Save 75-90%
|
||||
|
||||
0nline Pharmacy NoPrescription required
|
||||
Buy Canadian Drugs at Wholesale Prices and Save 75-90%
|
||||
FDA-Approved drugs + Superb Quality Drugs only!
|
||||
Accept all major credit cards
|
||||
13
input/4.NaiveBayes/email/spam/11.txt
Normal file
13
input/4.NaiveBayes/email/spam/11.txt
Normal file
@@ -0,0 +1,13 @@
|
||||
You Have Everything To Gain!
|
||||
|
||||
Incredib1e gains in length of 3-4 inches to yourPenis, PERMANANTLY
|
||||
|
||||
Amazing increase in thickness of yourPenis, up to 30%
|
||||
BetterEjacu1ation control
|
||||
Experience Rock-HardErecetions
|
||||
Explosive, intenseOrgasns
|
||||
Increase volume ofEjacu1ate
|
||||
Doctor designed and endorsed
|
||||
100% herbal, 100% Natural, 100% Safe
|
||||
The proven NaturalPenisEnhancement that works!
|
||||
100% MoneyBack Guaranteeed
|
||||
7
input/4.NaiveBayes/email/spam/12.txt
Normal file
7
input/4.NaiveBayes/email/spam/12.txt
Normal file
@@ -0,0 +1,7 @@
|
||||
Buy Ambiem (Zolpidem) 5mg/10mg @ $2.39/- pill
|
||||
|
||||
30 pills x 5 mg - $129.00
|
||||
60 pills x 5 mg - $199.20
|
||||
180 pills x 5 mg - $430.20
|
||||
30 pills x 10 mg - $ 138.00
|
||||
120 pills x 10 mg - $ 322.80
|
||||
7
input/4.NaiveBayes/email/spam/13.txt
Normal file
7
input/4.NaiveBayes/email/spam/13.txt
Normal file
@@ -0,0 +1,7 @@
|
||||
OrderCializViagra Online & Save 75-90%
|
||||
|
||||
0nline Pharmacy NoPrescription required
|
||||
Buy Canadian Drugs at Wholesale Prices and Save 75-90%
|
||||
FDA-Approved drugs + Superb Quality Drugs only!
|
||||
Accept all major credit cards
|
||||
Order Today! From $1.38
|
||||
7
input/4.NaiveBayes/email/spam/14.txt
Normal file
7
input/4.NaiveBayes/email/spam/14.txt
Normal file
@@ -0,0 +1,7 @@
|
||||
BuyVIAGRA 25mg, 50mg, 100mg,
|
||||
BrandViagra, FemaleViagra from $1.15 per pill
|
||||
|
||||
|
||||
ViagraNoPrescription needed - from Certified Canadian Pharmacy
|
||||
|
||||
Buy Here... We accept VISA, AMEX, E-Check... Worldwide Delivery
|
||||
11
input/4.NaiveBayes/email/spam/15.txt
Normal file
11
input/4.NaiveBayes/email/spam/15.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
You Have Everything To Gain!
|
||||
|
||||
Incredib1e gains in length of 3-4 inches to yourPenis, PERMANANTLY
|
||||
|
||||
Amazing increase in thickness of yourPenis, up to 30%
|
||||
BetterEjacu1ation control
|
||||
Experience Rock-HardErecetions
|
||||
Explosive, intenseOrgasns
|
||||
Increase volume ofEjacu1ate
|
||||
Doctor designed and endorsed
|
||||
100% herbal, 100% Natural, 100% Safe
|
||||
11
input/4.NaiveBayes/email/spam/16.txt
Normal file
11
input/4.NaiveBayes/email/spam/16.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
You Have Everything To Gain!
|
||||
|
||||
Incredib1e gains in length of 3-4 inches to yourPenis, PERMANANTLY
|
||||
|
||||
Amazing increase in thickness of yourPenis, up to 30%
|
||||
BetterEjacu1ation control
|
||||
Experience Rock-HardErecetions
|
||||
Explosive, intenseOrgasns
|
||||
Increase volume ofEjacu1ate
|
||||
Doctor designed and endorsed
|
||||
100% herbal, 100% Natural, 100% Safe
|
||||
14
input/4.NaiveBayes/email/spam/17.txt
Normal file
14
input/4.NaiveBayes/email/spam/17.txt
Normal file
@@ -0,0 +1,14 @@
|
||||
A home based business opportunity is knocking at your door.
|
||||
|
||||
Don’t be rude and let this chance go by.
|
||||
|
||||
You can earn a great income and find
|
||||
your financial life transformed.
|
||||
|
||||
Learn more Here.
|
||||
|
||||
|
||||
|
||||
To Your Success.
|
||||
|
||||
Work From Home Finder Experts
|
||||
6
input/4.NaiveBayes/email/spam/18.txt
Normal file
6
input/4.NaiveBayes/email/spam/18.txt
Normal file
@@ -0,0 +1,6 @@
|
||||
Codeine (the most competitive price on NET!)
|
||||
|
||||
Codeine (WILSON) 30mg x 30 $156.00
|
||||
Codeine (WILSON) 30mg x 60 $291.00 (+4 FreeViagra pills)
|
||||
Codeine (WILSON) 30mg x 90 $396.00 (+4 FreeViagra pills)
|
||||
Codeine (WILSON) 30mg x 120 $492.00 (+10 FreeViagra pills)
|
||||
13
input/4.NaiveBayes/email/spam/19.txt
Normal file
13
input/4.NaiveBayes/email/spam/19.txt
Normal file
@@ -0,0 +1,13 @@
|
||||
Get Up to 75% OFF at Online WatchesStore
|
||||
|
||||
Discount Watches for All Famous Brands
|
||||
|
||||
* Watches: aRolexBvlgari, Dior, Hermes, Oris, Cartier, AP and more brands
|
||||
* Louis Vuitton Bags & Wallets
|
||||
* Gucci Bags
|
||||
* Tiffany & Co Jewerly
|
||||
|
||||
Enjoy a full 1 year WARRANTY
|
||||
Shipment via reputable courier: FEDEX, UPS, DHL and EMS Speedpost
|
||||
You will 100% recieve your order
|
||||
Save Up to 75% OFF Quality Watches
|
||||
8
input/4.NaiveBayes/email/spam/2.txt
Normal file
8
input/4.NaiveBayes/email/spam/2.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
Hydrocodone/Vicodin ES/Brand Watson
|
||||
|
||||
Vicodin ES - 7.5/750 mg: 30 - $195 / 120 $570
|
||||
Brand Watson - 7.5/750 mg: 30 - $195 / 120 $570
|
||||
Brand Watson - 10/325 mg: 30 - $199 / 120 - $588
|
||||
NoPrescription Required
|
||||
FREE Express FedEx (3-5 days Delivery) for over $200 order
|
||||
Major Credit Cards + E-CHECK
|
||||
12
input/4.NaiveBayes/email/spam/20.txt
Normal file
12
input/4.NaiveBayes/email/spam/20.txt
Normal file
@@ -0,0 +1,12 @@
|
||||
Get Up to 75% OFF at Online WatchesStore
|
||||
|
||||
Discount Watches for All Famous Brands
|
||||
|
||||
* Watches: aRolexBvlgari, Dior, Hermes, Oris, Cartier, AP and more brands
|
||||
* Louis Vuitton Bags & Wallets
|
||||
* Gucci Bags
|
||||
* Tiffany & Co Jewerly
|
||||
|
||||
Enjoy a full 1 year WARRANTY
|
||||
Shipment via reputable courier: FEDEX, UPS, DHL and EMS Speedpost
|
||||
You will 100% recieve your order
|
||||
4
input/4.NaiveBayes/email/spam/21.txt
Normal file
4
input/4.NaiveBayes/email/spam/21.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
Percocet 10/625 mg withoutPrescription 30 tabs - $225!
|
||||
Percocet, a narcotic analgesic, is used to treat moderate to moderately SeverePain
|
||||
Top Quality, EXPRESS Shipping, 100% Safe & Discreet & Private.
|
||||
Buy Cheap Percocet Online
|
||||
12
input/4.NaiveBayes/email/spam/22.txt
Normal file
12
input/4.NaiveBayes/email/spam/22.txt
Normal file
@@ -0,0 +1,12 @@
|
||||
Get Up to 75% OFF at Online WatchesStore
|
||||
|
||||
Discount Watches for All Famous Brands
|
||||
|
||||
* Watches: aRolexBvlgari, Dior, Hermes, Oris, Cartier, AP and more brands
|
||||
* Louis Vuitton Bags & Wallets
|
||||
* Gucci Bags
|
||||
* Tiffany & Co Jewerly
|
||||
|
||||
Enjoy a full 1 year WARRANTY
|
||||
Shipment via reputable courier: FEDEX, UPS, DHL and EMS Speedpost
|
||||
You will 100% recieve your order
|
||||
11
input/4.NaiveBayes/email/spam/23.txt
Normal file
11
input/4.NaiveBayes/email/spam/23.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
You Have Everything To Gain!
|
||||
|
||||
Incredib1e gains in length of 3-4 inches to yourPenis, PERMANANTLY
|
||||
|
||||
Amazing increase in thickness of yourPenis, up to 30%
|
||||
BetterEjacu1ation control
|
||||
Experience Rock-HardErecetions
|
||||
Explosive, intenseOrgasns
|
||||
Increase volume ofEjacu1ate
|
||||
Doctor designed and endorsed
|
||||
100% herbal, 100% Natural, 100% Safe
|
||||
11
input/4.NaiveBayes/email/spam/24.txt
Normal file
11
input/4.NaiveBayes/email/spam/24.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
You Have Everything To Gain!
|
||||
|
||||
Incredib1e gains in length of 3-4 inches to yourPenis, PERMANANTLY
|
||||
|
||||
Amazing increase in thickness of yourPenis, up to 30%
|
||||
BetterEjacu1ation control
|
||||
Experience Rock-HardErecetions
|
||||
Explosive, intenseOrgasns
|
||||
Increase volume ofEjacu1ate
|
||||
Doctor designed and endorsed
|
||||
100% herbal, 100% Natural, 100% Safe
|
||||
7
input/4.NaiveBayes/email/spam/25.txt
Normal file
7
input/4.NaiveBayes/email/spam/25.txt
Normal file
@@ -0,0 +1,7 @@
|
||||
Experience with BiggerPenis Today! Grow 3-inches more
|
||||
|
||||
The Safest & Most Effective Methods Of_PenisEn1argement.
|
||||
Save your time and money!
|
||||
BetterErections with effective Ma1eEnhancement products.
|
||||
|
||||
#1 Ma1eEnhancement Supplement. Trusted by Millions. Buy Today!
|
||||
13
input/4.NaiveBayes/email/spam/3.txt
Normal file
13
input/4.NaiveBayes/email/spam/3.txt
Normal file
@@ -0,0 +1,13 @@
|
||||
You Have Everything To Gain!
|
||||
|
||||
Incredib1e gains in length of 3-4 inches to yourPenis, PERMANANTLY
|
||||
|
||||
Amazing increase in thickness of yourPenis, up to 30%
|
||||
BetterEjacu1ation control
|
||||
Experience Rock-HardErecetions
|
||||
Explosive, intenseOrgasns
|
||||
Increase volume ofEjacu1ate
|
||||
Doctor designed and endorsed
|
||||
100% herbal, 100% Natural, 100% Safe
|
||||
The proven NaturalPenisEnhancement that works!
|
||||
100% MoneyBack Guaranteeed
|
||||
4
input/4.NaiveBayes/email/spam/4.txt
Normal file
4
input/4.NaiveBayes/email/spam/4.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
Percocet 10/625 mg withoutPrescription 30 tabs - $225!
|
||||
Percocet, a narcotic analgesic, is used to treat moderate to moderately SeverePain
|
||||
Top Quality, EXPRESS Shipping, 100% Safe & Discreet & Private.
|
||||
Buy Cheap Percocet Online
|
||||
4
input/4.NaiveBayes/email/spam/5.txt
Normal file
4
input/4.NaiveBayes/email/spam/5.txt
Normal file
@@ -0,0 +1,4 @@
|
||||
--- Codeine 15mg -- 30 for $203.70 -- VISA Only!!! --
|
||||
|
||||
-- Codeine (Methylmorphine) is a narcotic (opioid) pain reliever
|
||||
-- We have 15mg & 30mg pills -- 30/15mg for $203.70 - 60/15mg for $385.80 - 90/15mg for $562.50 -- VISA Only!!! ---
|
||||
8
input/4.NaiveBayes/email/spam/6.txt
Normal file
8
input/4.NaiveBayes/email/spam/6.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
OEM Adobe & Microsoft softwares
|
||||
Fast order and download
|
||||
|
||||
Microsoft Office Professional Plus 2007/2010 $129
|
||||
Microsoft Windows 7 Ultimate $119
|
||||
Adobe Photoshop CS5 Extended
|
||||
Adobe Acrobat 9 Pro Extended
|
||||
Windows XP Professional & thousand more titles
|
||||
9
input/4.NaiveBayes/email/spam/7.txt
Normal file
9
input/4.NaiveBayes/email/spam/7.txt
Normal file
@@ -0,0 +1,9 @@
|
||||
Bargains Here! Buy Phentermin 37.5 mg (K-25)
|
||||
|
||||
Buy Genuine Phentermin at Low Cost
|
||||
VISA Accepted
|
||||
30 - $130.50
|
||||
60 - $219.00
|
||||
90 - $292.50
|
||||
120 - $366.00
|
||||
180 - $513.00
|
||||
11
input/4.NaiveBayes/email/spam/8.txt
Normal file
11
input/4.NaiveBayes/email/spam/8.txt
Normal file
@@ -0,0 +1,11 @@
|
||||
You Have Everything To Gain!
|
||||
|
||||
Incredib1e gains in length of 3-4 inches to yourPenis, PERMANANTLY
|
||||
|
||||
Amazing increase in thickness of yourPenis, up to 30%
|
||||
BetterEjacu1ation control
|
||||
Experience Rock-HardErecetions
|
||||
Explosive, intenseOrgasns
|
||||
Increase volume ofEjacu1ate
|
||||
Doctor designed and endorsed
|
||||
100% herbal, 100% Natural, 100% Safe
|
||||
9
input/4.NaiveBayes/email/spam/9.txt
Normal file
9
input/4.NaiveBayes/email/spam/9.txt
Normal file
@@ -0,0 +1,9 @@
|
||||
Bargains Here! Buy Phentermin 37.5 mg (K-25)
|
||||
|
||||
Buy Genuine Phentermin at Low Cost
|
||||
VISA Accepted
|
||||
30 - $130.50
|
||||
60 - $219.00
|
||||
90 - $292.50
|
||||
120 - $366.00
|
||||
180 - $513.00
|
||||
Reference in New Issue
Block a user