Design and Analysis of Algorithms: The Knuth-Morris-Pratt algorithm

The Knuth-Morris-Pratt algorithm

This algorithm avoids computing the transition function δ altogether, and its matching time is ‚θ(n) using just an auxiliary function π which we precompute from the pattern in time θ(m) and store in an array π[1. . m].

For any state q = 0, 1, . . . ,m and any character the value π[q] contains the information we need to compute θ(q, a) but that does not depend on a. Since the array π has only m entries, whereas δ has θ(m| |) entries, we save a factor of |Ʃ| in the preprocessing time by computing rather than δ.

The prefix function for a pattern

The prefix function π for a pattern encapsulates knowledge about how the pattern matches against shifts of itself.

Searches for occurrences of a pattern x within a main text string y by employing the simple observation: after a mismatch, the word itself allows us to determine where to begin the next match to bypass re-examination of previously matched characters.

Consider the operation of the naive string matcher.

Figure shows a particular shift s of a template containing the pattern P = ababaca against a text T. For this example, q = 5 of the characters have matched successfully, but the 6th pattern character fails to match the corresponding text character. The information that q characters have matched successfully determines the corresponding text characters. Knowing these q text characters allows us to determine immediately that certain shifts are invalid. In the example of the figure, the shift is necessarily invalid, since the first pattern character (a) would be aligned with a text character that we know does not match the first pattern character, but does match the second pattern character (b). The shift s'=s+2 shown in part (b) of the figure, however, aligns the first three pattern characters with three text characters that must necessarily match.

We formalize the information that we precompute as follows. Given a pattern P[1. .m], the prefix function for the pattern P is the function π : f{1,2, . . . ,m} à{0, 1, . . . ,m -1} such that

The pseudocode below gives the Knuth-Morris-Pratt matching algorithm as the procedure KMP-MATCHER. KMP-MATCHER calls the auxiliary procedure COMPUTE-PREFIX-FUNCTION to compute π .

These two procedures have much in common, because both match a string against the pattern

P: KMP-MATCHER matches the text T against P, and COMPUTEPREFIX-FUNCTION matches P against itself.

Run-Time Analysis

- Computing the prefix function takes time Θ(𝑚),- outer for loop takes time Θ(𝑚 )

- String-matching takes time Θ(𝑛 ).

Compared with FINITE-AUTOMATON-MATCHER, by using π rather than δ, we have reduced the time for preprocessing the pattern from O(m|Ʃ|) to Θ(𝑚 ), while keeping the actual matching time bounded by Θ(𝑛 ).

Design and Analysis of Algorithms

Sunday, 23 February 2020

The Knuth-Morris-Pratt algorithm

No comments:

Post a Comment

Updates