The **Rabin-Karp algorithm** solves the string matching problem I described in an earlier post. Its worst-case running time is the same as for the naive string matching algorithm, but it performs significantly better on real-word inputs on average. (Note that the worst case is that the text *T* consists of repetitions of the pattern *P* for the naive algorithm — this is also true for Rabin-Karp as we will see later.)

The algorithm performs a preprocessing first, this is then used to speed up the actual matching.

**The ideas behind Rabin-Karp
**

The Rabin-Karp algorithm uses a sliding window approach like the naive method, but it is based on the following ideas:

**Strings can be treated as numbers.** We can map the characters of an alphabet *sigma* with |sigma| = *d* different characters to numbers in radix-*d* notation. Example: this means, if the alphabet has 10 letters *a, b, .., j* we can map it to the *decimal* numbers 0..9 (d=10 for decimal numbers). We could use a different radix to map arbitrary alphabets.

So we could map the 3-character string *bda* from the alphabet *sigma = {a,b,c,d}* to the number *241*. So for each shift *s*, instead of comparing the pattern to the current part of the text character-by-character, we quickly compute a (multi-digit) number representation of the pattern (once) and the part of the text we are interested in. Then we compare these number representations:

**Computers can compare numbers very quickly.** We can assume that a computer can compare two digits of *reasonable* length in **constant** time. The question of what exactly reasonable means in this context depends on the computer architecture, of course. But what happens if the numbers get too large (i.e., the pattern is very long)? In that case, we will need to **hash** the numbers to ensure they are short.

**Use Horner’s rule to quickly compute mod values.** For hashing, we can use the modulo function: choose some arbitrary value *q* (usually a prime, e.g., 13) and for each sliding window with shift *s* , compute and save* s mod q*. Compute the mod values instead of the real numbers.

During the sliding window phase, it is possible to compute the number representation of the text part that is considered in the next shift *s+1* in constant time if you use Horner’s rule to compute the next *mod q* value.

**Explanation**

Since the algorithm treats strings as numbers, let us introduce some **new notations** to make the following stuff easier:

Let* p* denote the corresponding decimal value of a pattern *P[1..m]*. And let *t _{s}* denote the decimal value of the substring

*T[s+1 .. s+m]*with length

*m*of the text

*T*(for

*s=0,1, .. n-m*). Note that this implies that

*s*is a valid shift iff

*t*.

_{s}= pSo here is what the algorithm does:

**Preprocessing:** Compute the digit representation* p* of the pattern P in time *Θ(m)* using Horner’s rule, which is: *p = P[m] + 10 (P[m-1] + 10(P[m-2] + … + 10(P[2] + 10P[1])…))*. Example:

For the input pattern *0100* (and thus *m=4*), we compute *p* like this:

*p = Hash(0100) = 0 + 4(0 + 4(1 + 4 ∙ 0)) = 16*

Use the same method to compute the hash for the first chunk of the text, *t _{0} = hash(T[1..m])* in the same time Θ(m). This is nothing special.

**Matching:** But now, we slide the window. Observe that we can compute the remaining values t_{1} ,t_{2}, … , t_{n-m} in time Θ(n-m) because **we can compute t _{s+1} from t_{s} in constant time**! This is due to Horner’s rule:

t_{s+1} = 10 (t_{s} – 10^{m-1} T[s+1]) + T[s+m+1]

This equation is easy to compute and takes only a **constant number of operations** (note that the 10^{m-1} is a constant and can be precomputed once): all that is done in the equation are these 3 easy steps:

From the current value t_{s}, to to get *t _{s+1}*, do:

- remove the left-most digit from
*t*(the part_{s}*(t*does this)_{s}– 10^{m-1}T[s+1]) - multiply the result by 10 (a simple shift operation for decimal numbers, the
*10*at the beginning of the equation does this) - compute the sum of that result and the next digit from T (see the
*+ T[s+m+1]*part).

**Keeping the numbers reasonably small (hashing + collision handling)
**

In practice, if *m* is very large, the comparison gets slower. Therefore, we could apply a final step (after the 3 steps listed above):

- let
*t*(where q is a prime, e.g., 13)_{s+1}= t_{s+1}mod q

In practice, we would choose a larger prime (such that the length of the resulting number *t _{s+1}* fits the computer architecture). This keeps the numbers small and thus ensure that comparing them is fast. It may lead to collisions of course, because

*t*does not imply

_{s}≡ p (mod q)*t*. So we will have to double-check the matches (valid shifts) to filter out false positives or spurious hits. This is done by comparing the original text instead of the hash values.

_{s}= p

**How fast is the Rabin-Karp algorithm?**

The** preprocessing** takes time Θ(m) for computing the *mod q*-value of the pattern and the first text window.

The **matching** time is a bit more complicated.

**Worst case**. The matching time becomes Θ((n – m + 1)m) in the worst case: we have to double-check the matches for false positives and for the worst possible input (text is a concatination of valid matches), Rabin-Karp becomes the naive method.

**Average case on real world data. **Usually, we expect very few hits within the text T. For some constant number *c* of hits**,** the expected matching time becomes *O((n – m +1) + cm)*, which is *O(n+m)*. Since *m <= n*, this becomes *O(n)*.

Note: We have not considered the time to process false positives here. For a small number of valid shifts and q > m, this time becomes O(m) and thus can be ignored. See Cormen et al. for a more detailed explanation if you are interested.