Design and Analysis of Algorithms: February 2020

The Knuth-Morris-Pratt algorithm

This algorithm avoids computing the transition function δ altogether, and its matching time is ‚θ(n) using just an auxiliary function π which we precompute from the pattern in time θ(m) and store in an array π[1. . m].

For any state q = 0, 1, . . . ,m and any character the value π[q] contains the information we need to compute θ(q, a) but that does not depend on a. Since the array π has only m entries, whereas δ has θ(m| |) entries, we save a factor of |Ʃ| in the preprocessing time by computing rather than δ.

The prefix function for a pattern

The prefix function π for a pattern encapsulates knowledge about how the pattern matches against shifts of itself.

Searches for occurrences of a pattern x within a main text string y by employing the simple observation: after a mismatch, the word itself allows us to determine where to begin the next match to bypass re-examination of previously matched characters.

Consider the operation of the naive string matcher.

Figure shows a particular shift s of a template containing the pattern P = ababaca against a text T. For this example, q = 5 of the characters have matched successfully, but the 6th pattern character fails to match the corresponding text character. The information that q characters have matched successfully determines the corresponding text characters. Knowing these q text characters allows us to determine immediately that certain shifts are invalid. In the example of the figure, the shift is necessarily invalid, since the first pattern character (a) would be aligned with a text character that we know does not match the first pattern character, but does match the second pattern character (b). The shift s'=s+2 shown in part (b) of the figure, however, aligns the first three pattern characters with three text characters that must necessarily match.

We formalize the information that we precompute as follows. Given a pattern P[1. .m], the prefix function for the pattern P is the function π : f{1,2, . . . ,m} à{0, 1, . . . ,m -1} such that

The pseudocode below gives the Knuth-Morris-Pratt matching algorithm as the procedure KMP-MATCHER. KMP-MATCHER calls the auxiliary procedure COMPUTE-PREFIX-FUNCTION to compute π .

These two procedures have much in common, because both match a string against the pattern

P: KMP-MATCHER matches the text T against P, and COMPUTEPREFIX-FUNCTION matches P against itself.

Run-Time Analysis

- Computing the prefix function takes time Θ(𝑚),- outer for loop takes time Θ(𝑚 )

- String-matching takes time Θ(𝑛 ).

Compared with FINITE-AUTOMATON-MATCHER, by using π rather than δ, we have reduced the time for preprocessing the pattern from O(m|Ʃ|) to Θ(𝑚 ), while keeping the actual matching time bounded by Θ(𝑛 ).

String matching with finite automata

Many string-matching algorithms build a finite automaton—a simple machine for processing information—that scans the text string T for all occurrences of the pattern P.

Finite automata

A finite automaton M, is a 5-tuple (Q, q₀,A,Ʃ,δ),

Where

- Q is a finite set of states,

- is the start state,

- is a distinguished set of accepting states,

- Ʃ is a finite input alphabet,

- δ is a function from Q X Ʃ into Q, called the transition function of M.

The finite automaton begins in state q₀ and reads the characters of its input string one at a time. If the automaton is in state q and reads input character a, it moves (“makes a transition”) from state q to state δ(q,a). Whenever its current state q is a member of A, the machine M has accepted the string read so far. An input that is not accepted is rejected.

For example, on input abaaa, including the start state, this automaton enters the sequence of states < 0, 1, 0, 1, 0, 1 >, and so it accepts this input. For input abbaa, it enters the sequence of states < 0, 1, 0, 0, 1, 0 >, and so it rejects this input.

In order to specify the string-matching automaton corresponding to a given pattern P [1. .m], we first define function σ, called the suffix function corresponding to P. The function σ maps Ʃ^* to {0, 1, . . . ,m} such that σ(x) is the length of the longest prefix of P that is also a suffix of x:

We define δ(q, a) = σ(P_qa) because to keep track of the longest prefix of the pattern P that has matched the text string T so far.

With this improved procedure for computing δ, we can find all occurrences of a length-m pattern in a length-n text over an alphabet Ʃ with O(m|Ʃ|) preprocessing time and θ(n) matching time.

Design and Analysis of Algorithms

Sunday, 23 February 2020

The Knuth-Morris-Pratt algorithm

Saturday, 22 February 2020

String matching with finite automata

Updates