Algorithms_in_C++ 1.0.0
Set of algorithms implemented in C++.
Loading...
Searching...
No Matches
boyer_moore.cpp File Reference

The Boyer–Moore algorithm searches for occurrences of pattern P in text T by performing explicit character comparisons at different alignments. Instead of a brute-force search of all alignments (of which there are n - m + 1), Boyer–Moore uses information gained by preprocessing P to skip as many alignments as possible. More...

#include <cassert>
#include <climits>
#include <cstring>
#include <iostream>
#include <string>
#include <vector>
Include dependency graph for boyer_moore.cpp:

Classes

struct  strings::boyer_moore::pattern
 A structure representing all the data we need to search the preprocessed pattern in text. More...
 

Namespaces

namespace  strings
 Algorithms with strings.
 

Macros

#define APLHABET_SIZE   CHAR_MAX
 number of symbols in the alphabet we use
 

Functions

void strings::boyer_moore::init_good_suffix (const std::string &str, std::vector< size_t > &arg)
 A function that preprocess the good suffix thable.
 
void strings::boyer_moore::init_bad_char (const std::string &str, std::vector< size_t > &arg)
 A function that preprocess the bad char table.
 
void strings::boyer_moore::init_pattern (const std::string &str, pattern &arg)
 A function that initializes pattern.
 
std::vector< size_t > strings::boyer_moore::search (const std::string &str, const pattern &arg)
 A function that implements Boyer-Moore's algorithm.
 
bool strings::boyer_moore::is_prefix (const char *str, const char *pat, size_t len)
 Check if pat is prefix of str.
 
void and_test (const char *text)
 A test case in which we search for every appearance of the word 'and'.
 
void pat_test (const char *text)
 A test case in which we search for every appearance of the word 'pat'.
 
static void tests ()
 Self-test implementations.
 
int main ()
 Main function.
 

Detailed Description

The Boyer–Moore algorithm searches for occurrences of pattern P in text T by performing explicit character comparisons at different alignments. Instead of a brute-force search of all alignments (of which there are n - m + 1), Boyer–Moore uses information gained by preprocessing P to skip as many alignments as possible.

The key insight in this algorithm is that if the end of the pattern is compared to the text, then jumps along the text can be made rather than checking every character of the text. The reason that this works is that in lining up the pattern against the text, the last character of the pattern is compared to the character in the text.

If the characters do not match, there is no need to continue searching backwards along the text. This leaves us with two cases.

Case 1: If the character in the text does not match any of the characters in the pattern, then the next character in the text to check is located m characters farther along the text, where m is the length of the pattern.

Case 2: If the character in the text is in the pattern, then a partial shift of the pattern along the text is done to line up along the matching character and the process is repeated.

There are two shift rules:

[The bad character rule] (https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm#The_bad_character_rule)

[The good suffix rule] (https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm#The_good_suffix_rule)

The shift rules are implemented as constant-time table lookups, using tables generated during the preprocessing of P.

Author
Stoycho Kyosev

Macro Definition Documentation

◆ APLHABET_SIZE

#define APLHABET_SIZE   CHAR_MAX

number of symbols in the alphabet we use

for assert for CHAR_MAX macro for strlen for IO operations for std::string for std::vector

Function Documentation

◆ and_test()

void and_test ( const char *  text)

A test case in which we search for every appearance of the word 'and'.

Parameters
textThe text in which we search for appearance of the word 'and'
Returns
void
218 {
222
223 assert(indexes.size() == 2);
224 assert(strings::boyer_moore::is_prefix(text + indexes[0], "and", 3));
225 assert(strings::boyer_moore::is_prefix(text + indexes[1], "and", 3));
226}
void init_pattern(const std::string &str, pattern &arg)
A function that initializes pattern.
Definition boyer_moore.cpp:151
std::vector< size_t > search(const std::string &str, const pattern &arg)
A function that implements Boyer-Moore's algorithm.
Definition boyer_moore.cpp:163
T size(T... args)
A structure representing all the data we need to search the preprocessed pattern in text.
Definition boyer_moore.cpp:68
Here is the call graph for this function:

◆ init_bad_char()

void strings::boyer_moore::init_bad_char ( const std::string str,
std::vector< size_t > &  arg 
)

A function that preprocess the bad char table.

Parameters
strThe string being preprocessed
argThe bad char table
Returns
void
136 {
137 arg.resize(APLHABET_SIZE, str.length());
138
139 for (size_t i = 0; i < str.length(); i++) {
140 arg[str[i]] = str.length() - i - 1;
141 }
142}
#define APLHABET_SIZE
number of symbols in the alphabet we use
Definition boyer_moore.cpp:50
T resize(T... args)
Here is the call graph for this function:

◆ init_good_suffix()

void strings::boyer_moore::init_good_suffix ( const std::string str,
std::vector< size_t > &  arg 
)

A function that preprocess the good suffix thable.

Parameters
strThe string being preprocessed
argThe good suffix table
Returns
void
87 {
88 arg.resize(str.size() + 1, 0);
89
90 // border_pos[i] - the index of the longest proper suffix of str[i..] which
91 // is also a proper prefix.
92 std::vector<size_t> border_pos(str.size() + 1, 0);
93
94 size_t current_char = str.length();
95
96 size_t border_index = str.length() + 1;
97
98 border_pos[current_char] = border_index;
99
100 while (current_char > 0) {
101 while (border_index <= str.length() &&
102 str[current_char - 1] != str[border_index - 1]) {
103 if (arg[border_index] == 0) {
104 arg[border_index] = border_index - current_char;
105 }
106
107 border_index = border_pos[border_index];
108 }
109
110 current_char--;
111 border_index--;
112 border_pos[current_char] = border_index;
113 }
114
115 size_t largest_border_index = border_pos[0];
116
117 for (size_t i = 0; i < str.size(); i++) {
118 if (arg[i] == 0) {
119 arg[i] = largest_border_index;
120 }
121
122 // If we go pass the largest border we find the next one as we iterate
123 if (i == largest_border_index) {
124 largest_border_index = border_pos[largest_border_index];
125 }
126 }
127}
Here is the call graph for this function:

◆ init_pattern()

void strings::boyer_moore::init_pattern ( const std::string str,
pattern arg 
)

A function that initializes pattern.

Parameters
strText used for initialization
argInitialized structure
Returns
void
151 {
152 arg.pat = str;
153 init_bad_char(str, arg.bad_char);
154 init_good_suffix(str, arg.good_suffix);
155}
void init_bad_char(const std::string &str, std::vector< size_t > &arg)
A function that preprocess the bad char table.
Definition boyer_moore.cpp:136
void init_good_suffix(const std::string &str, std::vector< size_t > &arg)
A function that preprocess the good suffix thable.
Definition boyer_moore.cpp:87
Here is the call graph for this function:

◆ is_prefix()

bool strings::boyer_moore::is_prefix ( const char *  str,
const char *  pat,
size_t  len 
)

Check if pat is prefix of str.

Parameters
strpointer to some part of the input text.
patthe searched pattern.
lenlength of the searched pattern
Returns
true if pat IS prefix of str.
false if pat is NOT a prefix of str.
198 {
199 if (strlen(str) < len) {
200 return false;
201 }
202
203 for (size_t i = 0; i < len; i++) {
204 if (str[i] != pat[i]) {
205 return false;
206 }
207 }
208
209 return true;
210}
T strlen(T... args)
Here is the call graph for this function:

◆ main()

int main ( void  )

Main function.

Returns
0 on exit
267 {
268 tests(); // run self-test implementations
269 return 0;
270}
static void tests()
Self-test implementations.
Definition boyer_moore.cpp:248
Here is the call graph for this function:

◆ pat_test()

void pat_test ( const char *  text)

A test case in which we search for every appearance of the word 'pat'.

Parameters
textThe text in which we search for appearance of the word 'pat'
Returns
void
233 {
237
238 assert(indexes.size() == 6);
239
240 for (const auto& currentIndex : indexes) {
241 assert(strings::boyer_moore::is_prefix(text + currentIndex, "pat", 3));
242 }
243}
Here is the call graph for this function:

◆ search()

std::vector< size_t > strings::boyer_moore::search ( const std::string str,
const pattern arg 
)

A function that implements Boyer-Moore's algorithm.

Parameters
strText we are seatching in.
argpattern structure containing the preprocessed pattern
Returns
Vector of indexes of the occurrences of pattern in text
163 {
164 size_t index_position = arg.pat.size() - 1;
165 std::vector<size_t> index_storage;
166
167 while (index_position < str.length()) {
168 size_t index_string = index_position;
169 int index_pattern = static_cast<int>(arg.pat.size()) - 1;
170
171 while (index_pattern >= 0 &&
172 str[index_string] == arg.pat[index_pattern]) {
173 --index_pattern;
174 --index_string;
175 }
176
177 if (index_pattern < 0) {
178 index_storage.push_back(index_position - arg.pat.length() + 1);
179 index_position += arg.good_suffix[0];
180 } else {
181 index_position += std::max(arg.bad_char[str[index_string]],
182 arg.good_suffix[index_pattern + 1]);
183 }
184 }
185
186 return index_storage;
187}
T max(T... args)
T push_back(T... args)
Here is the call graph for this function:

◆ tests()

static void tests ( )
static

Self-test implementations.

Returns
void
248 {
249 const char* text =
250 "When pat Mr. and Mrs. pat Dursley woke up on the dull, gray \
251 Tuesday our story starts, \
252 there was nothing about pat the cloudy sky outside to pat suggest that\
253 strange and \
254 mysterious things would pat soon be happening all pat over the \
255 country.";
256
257 and_test(text);
258 pat_test(text);
259
260 std::cout << "All tests have successfully passed!\n";
261}
void pat_test(const char *text)
A test case in which we search for every appearance of the word 'pat'.
Definition boyer_moore.cpp:233
void and_test(const char *text)
A test case in which we search for every appearance of the word 'and'.
Definition boyer_moore.cpp:218
Here is the call graph for this function: