The Rabin-Karp string matching algorithm

The Rabin-Karp algorithm solves the string matching problem I described in an earlier post. Its worst-case running time is the same as for the naive string matching algorithm, but it performs significantly better on real-word inputs on average. (Note that the worst case is that the text T consists of repetitions of the pattern P for the naive algorithm — this is also true for Rabin-Karp as we will see later.)

The algorithm performs a preprocessing first, this is then used to speed up the actual matching.

The ideas behind Rabin-Karp

The Rabin-Karp algorithm uses a sliding window approach like the naive method, but it is based on the following ideas:

Strings can be treated as numbers. We can map the characters of an alphabet sigma with |sigma| = d different characters to numbers in radix-d notation. Example: this means, if the alphabet has 10 letters a, b, .., j we can map it to the decimal numbers 0..9 (d=10 for decimal numbers). We could use a different radix to map arbitrary alphabets.

So we could map the 3-character string bda from the alphabet sigma = {a,b,c,d} to the number 241. So for each shift s, instead of comparing the pattern to the current part of the text character-by-character, we quickly compute a (multi-digit) number representation of the pattern (once) and the part of the text we are interested in. Then we compare these number representations:

Computers can compare numbers very quickly. We can assume that a computer can compare two digits of reasonable length in constant time. The question of what exactly reasonable means in this context depends on the computer architecture, of course. But what happens if the numbers get too large (i.e., the pattern is very long)? In that case, we will need to hash the numbers to ensure they are short.

Use Horner’s rule to quickly compute mod values. For hashing, we can use the modulo function: choose some arbitrary value q (usually a prime, e.g., 13) and for each sliding window with shift s , compute and save s mod q. Compute the mod values instead of the real numbers.

During the sliding window phase, it is possible to compute the number representation of the text part that is considered in the next shift s+1 in constant time if you use Horner’s rule to compute the next mod q value.

Explanation

Since the algorithm treats strings as numbers, let us introduce some new notations to make the following stuff easier:

Let p denote the corresponding decimal value of a pattern P[1..m]. And let ts denote the decimal value of the substring T[s+1 .. s+m] with length m of the text T (for s=0,1, .. n-m). Note that this implies that s is a valid shift iff ts = p.

So here is what the algorithm does:

Preprocessing: Compute the digit representation p of the pattern P in time Θ(m) using Horner’s rule, which is: p = P[m] + 10 (P[m-1] + 10(P[m-2] + … + 10(P[2] + 10P[1])…)). Example:

For the input pattern 0100 (and thus m=4), we compute p like this:
p = Hash(0100) = 0 + 4(0 + 4(1 + 4 ∙ 0)) = 16

Use the same method to compute the hash for the first chunk of the text, t0 = hash(T[1..m]) in the same time Θ(m). This is nothing special.

Matching: But now, we slide the window. Observe that we can compute the remaining values t1 ,t2, … , tn-m in time Θ(n-m) because we can compute ts+1 from ts in constant time! This is due to Horner’s rule:

ts+1 = 10 (ts – 10m-1 T[s+1]) + T[s+m+1]

This equation is easy to compute and takes only a constant number of operations (note that the 10m-1 is a constant and can be precomputed once): all that is done in the equation are these 3 easy steps:

From the current value ts, to to get ts+1, do:

  • remove the left-most digit from ts (the part (ts – 10m-1 T[s+1]) does this)
  • multiply the result by 10 (a simple shift operation for decimal numbers, the 10 at the beginning of the equation does this)
  • compute the sum of that result and the next digit from T  (see the + T[s+m+1] part).

Keeping the numbers reasonably small (hashing + collision handling)

In practice, if m is very large, the comparison gets slower. Therefore, we could apply a final step (after the 3 steps listed above):

  • let ts+1 = ts+1 mod q (where q is a prime, e.g., 13)

In practice, we would choose a larger prime (such that the length of the resulting number ts+1 fits the computer architecture). This keeps the numbers small and thus ensure that comparing them is fast. It may lead to collisions of course, because  ts ≡ p (mod q) does not imply ts = p. So we will have to double-check the matches (valid shifts) to filter out false positives or spurious hits. This is done by comparing the original text instead of the hash values.

How fast is the Rabin-Karp algorithm?

The preprocessing takes time Θ(m) for computing the mod q-value of the pattern and the first text window.

The matching time is a bit more complicated.

Worst case. The matching time becomes Θ((n – m + 1)m) in the worst case: we have to double-check the matches for false positives and for the worst possible input (text is a concatination of valid matches), Rabin-Karp becomes the naive method.

Average case on real world  data. Usually, we expect very few hits within the text T. For some constant number c of hits, the expected matching time becomes O((n – m +1) + cm), which is O(n+m). Since m <= n, this becomes O(n).


Note: We have not considered the time to process false positives here. For a small number of valid shifts  and q > m, this time becomes O(m) and thus can be ignored. See Cormen et al. for a more detailed explanation if you are interested.

Advertisements

About dfspspirit

PhD student in bioinformatics, interested in photography, level design, digital image manipulation, architecture and, of course, bioinformatics.
This entry was posted in algorithms, Bioinformatics, Science and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s