String matching

The Mastery of Computer Programming: Primary Algorithms - Sykalo Eugene 2023

String matching
Dynamic Programming

Brute Force String Matching Algorithm

This algorithm compares the pattern to the text character by character, and returns the start position of the first occurrence of the pattern in the text.

The algorithm works by first comparing the first character of the pattern to the first character of the text. If they match, the algorithm proceeds to compare the second character of the pattern to the second character of the text, and so on, until either a mismatch is found or the entire pattern has been matched. If a mismatch is found, the algorithm moves on to the next character in the text and starts the comparison over again.

One of the main drawbacks of this algorithm is its time complexity, which is O(mn) on average, where m is the length of the pattern and n is the length of the text. This makes it unsuitable for large texts or patterns. However, it is a simple algorithm to implement and can be useful for small inputs or as a baseline for comparison with other algorithms.

Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt (KMP) algorithm improves upon the brute force algorithm by avoiding unnecessary character comparisons. The algorithm does this by constructing a prefix function, which is used to determine the number of characters to skip in the text when a mismatch occurs.

The prefix function is constructed by comparing the pattern to itself, and finding the longest proper prefix of the pattern that is also a suffix. This prefix function is then used to determine the number of characters to skip in the text. Specifically, if a mismatch occurs at position j in the pattern, and the prefix function value for position j is k, then the algorithm skips k characters in the text and resumes the comparison at position j-k+1 in the pattern.

The time complexity of the KMP algorithm is O(m+n), where m is the length of the pattern and n is the length of the text. This makes it more efficient than the brute force algorithm for larger texts or patterns. However, the KMP algorithm requires additional memory to store the prefix function, which can be a drawback for very large patterns.

Boyer-Moore Algorithm

The Boyer-Moore algorithm is another string matching algorithm that is based on the idea of skipping over large blocks of text, rather than individual characters. The algorithm is also unique in that it uses a heuristic to determine where to start looking for the pattern.

The heuristic used by the Boyer-Moore algorithm is based on two rules: the bad character rule and the good suffix rule. The bad character rule states that when a mismatch occurs at position j in the pattern, the algorithm should shift the pattern to the right until the rightmost occurrence of the character in the pattern that is to the left of position j is aligned with the mismatched character in the text. The good suffix rule states that when a mismatch occurs at position j in the pattern and the suffix of the pattern to the right of position j matches a prefix of the pattern, the algorithm should shift the pattern to the right until the rightmost occurrence of the prefix in the pattern that is to the left of position j is aligned with the mismatched character in the text.

The Boyer-Moore algorithm can be more efficient than the brute force algorithm and the KMP algorithm for many types of patterns and texts. However, it can be less efficient for certain types of patterns and texts, and its performance can depend heavily on the quality of the heuristic used. Additionally, the Boyer-Moore algorithm requires preprocessing of the pattern, which can be expensive for very large patterns.

Rabin-Karp Algorithm

The Rabin-Karp algorithm is another string matching algorithm that uses a hash function to compare the pattern to the text. The algorithm works by first computing the hash value of the pattern and the first m characters of the text, where m is the length of the pattern. If the hash value of the pattern and the first m characters of the text match, the algorithm proceeds to compare the pattern to the text character by character to verify the match. If the hash values do not match, the algorithm computes the hash value of the next m characters of the text and checks again for a match.

The hash function used by the Rabin-Karp algorithm should be chosen carefully to avoid hash collisions. One common hash function for this algorithm is the rolling hash function, which updates the hash value efficiently as characters are added or removed from the text.

The time complexity of the Rabin-Karp algorithm is O(mn), where m is the length of the pattern and n is the length of the text. However, in practice, the Rabin-Karp algorithm can be more efficient than the brute force algorithm for certain types of patterns and texts, particularly when the pattern is relatively short compared to the text.

Finite Automata Algorithm

The Finite Automata algorithm constructs a finite state machine to represent the pattern, which allows for fast pattern matching. The algorithm works by processing the pattern and constructing a state machine that represents all possible matches of the pattern. The state machine has a set of states, each of which represents a possible match of a prefix of the pattern. Each state is labeled with the characters that can follow the prefix, and transitions between states are labeled with the characters that cause the transition.

Once the state machine is constructed, the algorithm processes the text character by character, starting from the beginning. At each step, the algorithm follows the transitions in the state machine based on the current character in the text, until it reaches a final state. If the final state is reached, the algorithm has found a match of the pattern in the text.

The time complexity of the Finite Automata algorithm is O(m+n), where m is the length of the pattern and n is the length of the text. This makes it more efficient than the brute force algorithm for larger texts or patterns. However, the algorithm requires additional preprocessing time to construct the state machine, which can be a drawback for very large patterns.