The
Rabin-Karp algorithm
Rabin and Karp proposed a string-matching algorithm that performs well
in practice. The Rabin-Karp algorithm uses θ(m) preprocessing time, and its worst-case
running time is θ((n-m+1)m). Based on certain assumptions, however, its
average-case running time is better.
This algorithm makes use of elementary number-theoretic notions.
Given a pattern P [1. .m], let p denote its
corresponding decimal value. Similarly, given a text T[1. . n], let ts
denote the decimal value of the length-m sub string T [s +1. . s + m], for s = 0,
1, . . . , n - m.
Now, ts = p if and only if T
[s +1. . s +m] = P[1. .m]; thus, s is a valid shift if and only if ts=p.
If p could be computed in
time O(m) and all of the ti values
in a total of O(n) time, then we could determine all valid
shifts s in time O(n) by comparing p with
each of the ts values.
The only problem
arises when numbers p and ts
can be so large that comparisons cannot be said to take constant time. This problem
can be solved easily by computing p and ts modulo a suitable number
q. Generally q is chosen as a prime so that d.q fits in a computer word, which
allows all computation to be done in single precision.
In general, with a d-ary alphabet {0, 1, . . . , d – 1}, choose q so that
d.q fits within a computer word.
Working with numbers modulo q has the drawback that p approximately equal to ts(mod q) does not
imply that p = ts. This is called Spurious hit. Pattern P[1..m] with
T[s + 1..s + m] has to be compared character by character to see if the strings
really are identical.
The inputs to the procedure are the text T , the pattern P, the radix d
to use (which is typically taken to be | Ʃ | ), and the prime q to use
θ((n - m + 1)m) in the worst case. Rabin-Karp is slow
because of multiplications and mod operations, but becomes competitive for long
patterns.