The Knuth-Morris-Pratt
algorithm
Consider the operation of
the naive string matcher.
This algorithm avoids computing the transition function δ
altogether, and its matching time is ‚θ(n) using just an
auxiliary function π which we precompute from
the pattern in time θ(m) and store in an array
π[1.
. m].
For any state q = 0, 1, .
. . ,m and any character
the value π[q] contains the
information we need to compute θ(q, a) but that does not
depend on a. Since the array π has only m entries, whereas
δ
has θ(m|
|) entries, we save a factor of |Ʃ| in the preprocessing
time by computing rather than δ.
The prefix
function for a pattern
The prefix function π
for a pattern encapsulates knowledge about how the pattern matches against
shifts of itself.
Searches for occurrences
of a pattern x within a main text string y by employing the simple observation:
after a mismatch, the word itself allows us to determine where to begin the
next match to bypass re-examination of previously matched characters.
Figure shows a particular shift s of a template
containing the pattern P = ababaca against a text T. For this example, q = 5 of
the characters have matched successfully, but the 6th pattern character fails
to match the corresponding text character. The information that q characters have
matched successfully determines the corresponding text characters. Knowing
these q text characters allows us to determine immediately that certain shifts
are invalid. In the example of the figure, the shift
is necessarily invalid, since the
first pattern character (a) would be aligned with a text character that we know
does not match the first pattern character, but does match the second pattern
character (b). The shift s'=s+2 shown
in part (b) of the figure, however, aligns the first three pattern characters
with three text characters that must necessarily match.
We
formalize the information that we precompute as follows. Given a pattern P[1.
.m], the prefix function for the pattern P is the function π : f{1,2,
. . . ,m} à{0,
1, . . . ,m -1} such that
The
pseudocode below gives the Knuth-Morris-Pratt matching algorithm as the
procedure KMP-MATCHER. KMP-MATCHER calls the auxiliary procedure
COMPUTE-PREFIX-FUNCTION to compute π .
These
two procedures have much in common, because both match a string against the
pattern
P:
KMP-MATCHER matches the text T against P, and COMPUTEPREFIX-FUNCTION matches P
against itself.
Run-Time Analysis
-
Computing the prefix function takes time
Θ(𝑚),- outer for
loop takes time Θ(𝑚
)
-
String-matching takes time Θ(𝑛 ).
Compared
with FINITE-AUTOMATON-MATCHER, by using π
rather than δ,
we have reduced the time for preprocessing the pattern from O(m|Ʃ|) to Θ(𝑚 ), while keeping
the actual matching time bounded by Θ(𝑛
).