commit karp_rabin.md.
This commit is contained in:
@@ -112,6 +112,7 @@ int match(char* text, char* pattern){
|
||||
j = m - 1;
|
||||
}
|
||||
delete [] bc;
|
||||
delete [] gs;
|
||||
return i;
|
||||
}
|
||||
```
|
||||
|
||||
87
thu_dsa/chp11/karp_rabin.md
Normal file
87
thu_dsa/chp11/karp_rabin.md
Normal file
@@ -0,0 +1,87 @@
|
||||
串匹配之karp-rabin算法
|
||||
=====================
|
||||
|
||||
## 万物皆数
|
||||
|
||||
回想我们平时对整数进行的比较,都可以在`O(1)`的时间内完成,而任何数据在计算机中的存储都是一系列的字节构成的二进制整数,串也不例外,那为什么不可以把对整数高效的比较操作也移植到串匹配问题上呢?这就是`karp-rabin`的基本思想。
|
||||
|
||||
一般地,对于任意一个串,设字符集的大小为`d`,则该串中的任意一个字符都可以用一个`d+1`进制的整数来表示。需要注意的是,这里是`d+1`进制,而不是`d`进制,是因为不能用`0`来表示任意一个字符,否则如果该字符组成串的一个前缀,无论前缀的长度多少,都不会影响串所对应的整数取值。
|
||||
|
||||
在这种情况下,任意一个串,都可以将之用整数表示出来,并且串与这个整数是唯一对应的,因此这是一个`完美散列`,因此将该整数成为串的`指纹`(fingerprint)。如果将该`指纹`转化为二进制整数,就可以在计算机中用二进制字节流唯一的表示一个字符串了。
|
||||
|
||||
## karp-rabin算法
|
||||
|
||||
根据上面的分析似乎已经可以构造出一个新的串匹配算法了,具体说来,在每一个对齐位置,将模式串和与之对齐的文本串的`m`个字符,分别用其`指纹`表示出来,然后利用整数的比较就可以在`O(1)`时间内完成比较,这样整体的时间复杂度为`O(n)`,已经和`kmp`算法相当了!可是,果真这么简单吗?
|
||||
|
||||
答案是否定的,因为该过程中还存在着其他开销——比如将长度为`m`串转化为其对应的`指纹`,其开销就已经是`O(m)`了,因此整个算法的时间开销是`O(mn)`,与蛮力策略相当!此外,还存在一些新的问题,当字符集较大,或者串长度较长时,其转化成的`指纹`位数也会相当长,比如采用`ASCII`码字符集时,字符集的大小`d = 128`,如果模式串的长度`m = 10`,则其对应的`指纹`会占`7 x 10 = 70`个比特,已经超过了计算机中通常支持的整数位数,并且随着串的进一步增长,对这么多位`指纹`的比对也难以在`O(1)`时间内完成,而是也要消耗`O(m)`的时间,同时对这些整数的存储也是一个问题。
|
||||
|
||||
下面就从各个方面分别讨论怎么解决上述存在的这么多问题。
|
||||
|
||||
> 指纹长度的压缩
|
||||
|
||||
将更大的数据,存储到更小的空间,这其实是我们在[散列的基本概念](../chp9/hash.mg)中就提出过的问题。具体说来,为了将`70bits`乃至更长的`指纹`压缩到`32bit`整数表示的范围内,只需要对该`指纹`做一个散列,不妨就简明地采用模余法,即
|
||||
|
||||
```c
|
||||
hash(fingerprint) = fingerprint % M;
|
||||
```
|
||||
|
||||
这样,就一次性地解决了整数的存储与比对时间的问题,经过散列后的指纹可以存储计算机通常支持的位长度以内,并且此时对`指纹`的比对又只需要`O(1)`的时间了。
|
||||
|
||||
但是由于散列内在的缺陷,不可避免地又会引入新的问题——冲突。对于两个不相匹配的串,它们经过压缩后的`指纹`却有可能相同,此时就会导致误判。为了解决这个问题,可以使`指纹`相同作为串匹配的必要条件,一旦发现两个串的`指纹`相同,可以对它们再启动一次逐个比较的字符比对,来确定这两个串是否的确是匹配的。需要指出,只要这里的散列长度足够长,就可以保证一般情况下两个不匹配的串,其指纹相同的概率极低,从而引入的逐个字符比对并不会显著地增加算法的时间复杂度。
|
||||
|
||||
> 快速指纹更新
|
||||
|
||||
尽管在引入了散列以后,指纹的比对可以在`O(1)`时间内完成了,但是指纹的计算仍然需要`O(m)`的时间,此时`karp-rabin`算法整体的时间复杂度仍然是`O(mn)`,没有显著的提高,因此需要提供一种快速的指纹计算方法。
|
||||
|
||||
对于模式串而言,指纹的计算是没有办法提高了,因为`m`个字符肯定需要全部遍历一次才能计算出它对应的指纹,`O(m)`的时间复杂度没有任何可以提高的空间。
|
||||
|
||||
但是对于文本串则不然,诚然,对于任意一个长度为`m`的串,计算其指纹也必须需要`O(m)`的时间开销,但是在文本串中,可以注意到,相邻串的指纹是具有一定的联系的,如下图所示:
|
||||
|
||||

|
||||
|
||||
具体说来,相邻串只有最前一个字符和最后一个字符是不相同的,利用模余的运算法则,就可以根据前一个串的指纹,在`O(1)`时间内计算出下一个串的指纹。设`a, b`分别是两个正整数,且有`a > b > 0`,具体利用到的运算法则是,
|
||||
|
||||
```
|
||||
(a + b) % M = ((a % M) + (b % M)) % M = ((a % M) + b) % M = ((b % M) + a) % M;
|
||||
(a - b) % M = ((a % M) - (b % M) + M) % M;
|
||||
(a * b) % M = ((a % M) * (b % M)) % M;
|
||||
```
|
||||
|
||||
上述的运算法则均可以推广到多个正整数的情形。因此,就可以构造出计算模式串和文本串的初始`指纹`的代码:
|
||||
|
||||
```c
|
||||
m = strlen(P);
|
||||
HashCode hashP = 0, hashT = 0;
|
||||
for(int i = 0; i < m; ++i){
|
||||
hashP = (hashP * R + DIGIT(P, i)) % M;
|
||||
hashT = (hashT * R + DIGIT(T, i)) % M;
|
||||
}
|
||||
```
|
||||
|
||||
为了快速更新文本串相邻的长度为`m`的子串的`指纹`,需要首先从原先的指纹中,减去最高位的部分,再加上最低位的部分,而计算最高位字符的模余值,需要做`m - 1`次连乘运算,即
|
||||
|
||||
```
|
||||
fingerprint(P[0]) = P[0] * R^(m - 1)
|
||||
```
|
||||
|
||||
为了简化这个运算,可以事先将`R^(m - 1)`计算并保存,形成下面的代码:
|
||||
|
||||
```c
|
||||
HashCode prepareDm(int m){
|
||||
HashCode Dm = 1;
|
||||
for(int i = 0; i != m; ++i)
|
||||
Dm = (Dm * R) % M;
|
||||
return Dm;
|
||||
}
|
||||
```
|
||||
|
||||
可以注意到,上面计算得出的`Dm`,正是`R^(m - 1)`的模余值,这里是利用到了模余的第三条运算法则。所以可以形成下面的快速更新`指纹`的代码:
|
||||
|
||||
```c
|
||||
void updateHash(HashCode &hashT, char* T, int m, int k, HashCode Dm){
|
||||
hashT = (hashT - DIGIT(T, k - 1) * Dm + M) % M;
|
||||
hashT = (hashT * R + DIGIT(T, k - 1 + m)) % M;
|
||||
}
|
||||
```
|
||||
|
||||
该算法其实就是上面三条模余的运算法则的反复使用。
|
||||
BIN
thu_dsa/chp11/update_fingerprint.png
Normal file
BIN
thu_dsa/chp11/update_fingerprint.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 17 KiB |
8
words.md
8
words.md
@@ -1001,7 +1001,7 @@ Some Words
|
||||
|
||||
+ pledge
|
||||
> (n)s serious promise or agreement, especially one made publicly or officially.</br>
|
||||
> (v)to make a formal, usually public, promise that you will do something, promise.
|
||||
> (v)to make a formal, usually public promise that you will do something, promise.
|
||||
|
||||
- the government's pledge to make no deals with terrorists
|
||||
- Eisenhower fulfilled his election pledge to end the war.
|
||||
@@ -1034,14 +1034,14 @@ Some Words
|
||||
- The beauty of the scene defies description
|
||||
|
||||
+ stake
|
||||
> at stake: if someething is at stake, it is being risked and might be lost or damaged if you are not successful.</br>
|
||||
> at stake: if something is at stake, it is being risked and might be lost or damaged if you are not successful.</br>
|
||||
> (n)the stakes involved in in a contest or a risky action are the things that can be gained or lost</br>
|
||||
> (v)if you stake something such as your money or your reputation on the result of something, you risk your money or reputationo on it.</br>
|
||||
> (v)if you stake something such as your money or your reputation on the result of something, you risk your money or reputation on it.</br>
|
||||
> (n)if you have a stake in a business, you have invested money in it.</br>
|
||||
> have a stake in sth: if you have a stake in something, you will get advantages if it's successful, and you feel you have an important connection with it.
|
||||
|
||||
- The tension was naturally high for a game with so much at stake.
|
||||
- The game was usually play for high stakes between two large groups.
|
||||
- The game was usually played for high stakes between two large groups.
|
||||
- He has staked his political future on an election victory.
|
||||
- He holds a 51% stake in the firm.
|
||||
- Young people don't feel they have a stake in the country's future.
|
||||
|
||||
Reference in New Issue
Block a user