Design and Analysis of Algorithms: string matching

Showing posts with label string matching. Show all posts

Sunday, 23 February 2020

The Knuth-Morris-Pratt algorithm

The Knuth-Morris-Pratt algorithm

This algorithm avoids computing the transition function δ altogether, and its matching time is ‚θ(n) using just an auxiliary function π which we precompute from the pattern in time θ(m) and store in an array π[1. . m].

For any state q = 0, 1, . . . ,m and any character the value π[q] contains the information we need to compute θ(q, a) but that does not depend on a. Since the array π has only m entries, whereas δ has θ(m| |) entries, we save a factor of |Ʃ| in the preprocessing time by computing rather than δ.

The prefix function for a pattern

The prefix function π for a pattern encapsulates knowledge about how the pattern matches against shifts of itself.

Searches for occurrences of a pattern x within a main text string y by employing the simple observation: after a mismatch, the word itself allows us to determine where to begin the next match to bypass re-examination of previously matched characters.

Consider the operation of the naive string matcher.

Figure shows a particular shift s of a template containing the pattern P = ababaca against a text T. For this example, q = 5 of the characters have matched successfully, but the 6th pattern character fails to match the corresponding text character. The information that q characters have matched successfully determines the corresponding text characters. Knowing these q text characters allows us to determine immediately that certain shifts are invalid. In the example of the figure, the shift is necessarily invalid, since the first pattern character (a) would be aligned with a text character that we know does not match the first pattern character, but does match the second pattern character (b). The shift s'=s+2 shown in part (b) of the figure, however, aligns the first three pattern characters with three text characters that must necessarily match.

We formalize the information that we precompute as follows. Given a pattern P[1. .m], the prefix function for the pattern P is the function π : f{1,2, . . . ,m} à{0, 1, . . . ,m -1} such that

The pseudocode below gives the Knuth-Morris-Pratt matching algorithm as the procedure KMP-MATCHER. KMP-MATCHER calls the auxiliary procedure COMPUTE-PREFIX-FUNCTION to compute π .

These two procedures have much in common, because both match a string against the pattern

P: KMP-MATCHER matches the text T against P, and COMPUTEPREFIX-FUNCTION matches P against itself.

Run-Time Analysis

- Computing the prefix function takes time Θ(𝑚),- outer for loop takes time Θ(𝑚 )

- String-matching takes time Θ(𝑛 ).

Compared with FINITE-AUTOMATON-MATCHER, by using π rather than δ, we have reduced the time for preprocessing the pattern from O(m|Ʃ|) to Θ(𝑚 ), while keeping the actual matching time bounded by Θ(𝑛 ).

Saturday, 22 February 2020

String matching with finite automata

String matching with finite automata

Many string-matching algorithms build a finite automaton—a simple machine for processing information—that scans the text string T for all occurrences of the pattern P.

Finite automata

A finite automaton M, is a 5-tuple (Q, q₀,A,Ʃ,δ),

Where

- Q is a finite set of states,

- is the start state,

- is a distinguished set of accepting states,

- Ʃ is a finite input alphabet,

- δ is a function from Q X Ʃ into Q, called the transition function of M.

The finite automaton begins in state q₀ and reads the characters of its input string one at a time. If the automaton is in state q and reads input character a, it moves (“makes a transition”) from state q to state δ(q,a). Whenever its current state q is a member of A, the machine M has accepted the string read so far. An input that is not accepted is rejected.

For example, on input abaaa, including the start state, this automaton enters the sequence of states < 0, 1, 0, 1, 0, 1 >, and so it accepts this input. For input abbaa, it enters the sequence of states < 0, 1, 0, 0, 1, 0 >, and so it rejects this input.

In order to specify the string-matching automaton corresponding to a given pattern P [1. .m], we first define function σ, called the suffix function corresponding to P. The function σ maps Ʃ^* to {0, 1, . . . ,m} such that σ(x) is the length of the longest prefix of P that is also a suffix of x:

We define δ(q, a) = σ(P_qa) because to keep track of the longest prefix of the pattern P that has matched the text string T so far.

With this improved procedure for computing δ, we can find all occurrences of a length-m pattern in a length-n text over an alphabet Ʃ with O(m|Ʃ|) preprocessing time and θ(n) matching time.

Tuesday, 10 December 2019

The Rabin-Karp algorithm

The Rabin-Karp algorithm

Rabin and Karp proposed a string-matching algorithm that performs well in practice. The Rabin-Karp algorithm uses θ(m) preprocessing time, and its worst-case running time is θ((n-m+1)m). Based on certain assumptions, however, its average-case running time is better.

This algorithm makes use of elementary number-theoretic notions.

Given a pattern P [1. .m], let p denote its corresponding decimal value. Similarly, given a text T[1. . n], let t_s denote the decimal value of the length-m sub string T [s +1. . s + m], for s = 0, 1, . . . , n - m.

Now, t_s = p if and only if T [s +1. . s +m] = P[1. .m]; thus, s is a valid shift if and only if t_s=p.

If p could be computed in time O(m) and all of the t_i values in a total of O(n) time, then we could determine all valid shifts s in time O(n) by comparing p with each of the t_s values.

The only problem arises when numbers p and t_s can be so large that comparisons cannot be said to take constant time. This problem can be solved easily by computing p and t_s modulo a suitable number q. Generally q is chosen as a prime so that d.q fits in a computer word, which allows all computation to be done in single precision.

In general, with a d-ary alphabet {0, 1, . . . , d – 1}, choose q so that d.q fits within a computer word.

Working with numbers modulo q has the drawback that p approximately equal to t_s(mod q) does not imply that p = t_s. This is called Spurious hit. Pattern P[1..m] with T[s + 1..s + m] has to be compared character by character to see if the strings really are identical.

The inputs to the procedure are the text T , the pattern P, the radix d to use (which is typically taken to be | Ʃ | ), and the prime q to use

RABIN-KARP-MATCHER takes θ(m) preprocessing time, and its matching time is

θ((n - m + 1)m) in the worst case. Rabin-Karp is slow because of multiplications and mod operations, but becomes competitive for long patterns.

String matching

String matching

String matching or searching algorithms try to find places where one or several strings (also called patterns) are found within a larger string (searched text).

Text-editing programs frequently need to find all occurrences of a pattern in the text. Efficient algorithms for this problem are called “string matching” algorithms that can greatly help the responsiveness of the text-editing program.

The string-matching problem is formalized as follows:

Assume that the text is an array T[1. . n] of length n and that the pattern is an array P[1. .m] of length m ≤ n. Assume that the elements of P and T are characters drawn from a finite alphabet Ʃ. The character arrays P and T are often called strings of characters.

The string-matching problem is the problem of finding all valid shifts with which a given pattern P occurs in a given text T.

If the pattern P occurs with shift s in T, then s is called valid shift; otherwise, s is an invalid shift.

Pattern P occurs with shift s in text T if 0 ≤ s ≤ n-m and T [s+1. . s +m] = P [1. .m].