Naive string matching

Note: This is an article in the string matching series. You can read the introduction to string matching here.

All elaborated and fast string matching algorithms perform some kind of preprocessing to gain information on the pattern. This information is used later during matching in order to take shortcuts, i.e., skip comparisons. The naive method described here does no preprocessing and is very easy to implement, but it is also very slow.

How it works

This method slides the pattern P along the text T (you may know the term sliding window from bioinformatics). At each of the n – m + 1 possible shifts, it starts comparing the pattern to the text character by character, aborting if a mismatch is found. When the last character is hit and matches, it has found a match and prints it with the current shift. Here, we search for P=bda in the text T = bacbcbdac :

At each shift, I show 3 lines:

  • line 1: The text T
  • line 2: match / mismatch indicators (match = “|” , mismatch = “x”)
  • line 3: The pattern P

s=0:
bacbcbdac
|x
bd

s=1:
bacbcbdac
.x
.b

s=2:
bacbcbdac
..x
..b

s=3:
bacbcbdac
…|x
…bd

s=4:
bacbcbdac
….x
….b

s=5:
bacbcbdac
…..|||
…..bda
Match as shift s=5!

 

So the pseudo code boils down to:


FOR shift s = 0 TO n -m:
..IF P[1 .. m] = T[s+1 .. s+m]
....PRINT Pattern occurs with shift s.

How much time does it take?

A lot. We need to compare the pattern of length m with the text at all possible shifts s. There are roughly n such shifts (exactly there are  n – m + 1 such shifts because trying to match a pattern of length 5 at the last character of the text does not make any sense). So this easy but rather stupid algorithm takes time O((n – m + 1) m ).

Note that this algorithm is really stupid: If it found the pattern house in a text at s = 5, it will try the next position, s = 6, and check whether another match exists there. This obviously cannot be the case, but the algorithm does not analyze the pattern in any way, so it cannot know it.

FAQ

Q #1: This is asymptotic notation, so who cares for the 1 in O((n – m + 1) m )?

A #1: The 1 is required for the special case n = m (look at the formula again to see why).

Summary & Outlook

We need to get better! And we will, using the Rabin-Karp algorithm described in the next post.

Advertisements

About dfspspirit

PhD student in bioinformatics, interested in photography, level design, digital image manipulation, architecture and, of course, bioinformatics.
This entry was posted in algorithms, Bioinformatics, Science and tagged , , , , . Bookmark the permalink.

One Response to Naive string matching

  1. Pingback: The Rabin-Karp string matching algorithm | spirit's spinney

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s