Note: This is an article in the string matching series. You can read the introduction to string matching here.
All elaborated and fast string matching algorithms perform some kind of preprocessing to gain information on the pattern. This information is used later during matching in order to take shortcuts, i.e., skip comparisons. The naive method described here does no preprocessing and is very easy to implement, but it is also very slow.
How it works
This method slides the pattern P along the text T (you may know the term sliding window from bioinformatics). At each of the n – m + 1 possible shifts, it starts comparing the pattern to the text character by character, aborting if a mismatch is found. When the last character is hit and matches, it has found a match and prints it with the current shift. Here, we search for P=bda in the text T = bacbcbdac :
At each shift, I show 3 lines:
- line 1: The text T
- line 2: match / mismatch indicators (match = “|” , mismatch = “x”)
- line 3: The pattern P
Match as shift s=5!
So the pseudo code boils down to:
FOR shift s = 0 TO n -m:
..IF P[1 .. m] = T[s+1 .. s+m]
....PRINT Pattern occurs with shift s.
How much time does it take?
A lot. We need to compare the pattern of length m with the text at all possible shifts s. There are roughly n such shifts (exactly there are n – m + 1 such shifts because trying to match a pattern of length 5 at the last character of the text does not make any sense). So this easy but rather stupid algorithm takes time O((n – m + 1) m ).
Note that this algorithm is really stupid: If it found the pattern house in a text at s = 5, it will try the next position, s = 6, and check whether another match exists there. This obviously cannot be the case, but the algorithm does not analyze the pattern in any way, so it cannot know it.
Q #1: This is asymptotic notation, so who cares for the 1 in O((n – m + 1) m )?
A #1: The 1 is required for the special case n = m (look at the formula again to see why).
Summary & Outlook
We need to get better! And we will, using the Rabin-Karp algorithm described in the next post.