Algorithms for string matching

String matching

I am responsible for a tutorial on Algorithms and Models in Computational Biology this semester, so I decided to write a bit about some of the algorithms here when ever I find the time.

The posts here intentionally are not gonna be very formal, the focus is on giving easy and good explanations of the algorithms — the formal stuff is available from 1000s of books in each university library. (I suggest the first one you try is Introduction to Algorithms by Cormen et al.)

In this post, I will explain the string matching problem and give some definitions. In the next posts, I will describe some of the algorithms which can be used to solve it.

The task

Basically, we want to find all occurrences of a short text — the pattern — in a longer text. This is obviously required in all text processing software and you can most likely press CTRL + F now to use it in your browser right now.

An example would be the question: How often and at which positions does the pattern ‘mouse’ appear in the text ‘I once had a mouse called Klaus that lived in a mouse house’? We would expect something like this as an answer: ‘2 times total: 1st occurrence at position 14, 2nd occurrence at position 49’ (assuming the first letter of the string is at position 1).

It gets way more interesting (and more important to do it fast) if you are searching a longer pattern in a very long text, of course. Think of searching in a book or a whole library. In bioinformatics, people may even have the idea to search for parts of a gene in a whole genome*.

Some definitions

Now that we know about the problem and why it matters, let us give a definition of the problem that is a bit more formal and define some abbreviations that will come in handy later (here, I follow Cormen_2001, page 906):

  • A pattern P is a string of length m, and we search for it in a text T of length n. The characters of both P and T are from the finite alphabet sigma (Example: for DNA sequences, sigma = { A, T, G, C} and the cardinality of sigma is |sigma| = 4).
  • We say that a pattern P occurs with a shift s in T if T[s+1 .. s+m] = P[1..m]. Obviously, this can only happen for 0 <= s <= n-m because negative indexes make no sense and a pattern of length m cannot match at a position in a text where the text has less than m characters left.

We will use these definitions when talking about the algorithms in the next posts.

Sources & References

Cormen_2001: Cormen et al., Introduction to Algorithms, second edition, MIT Press, 2001

*But please do not confuse string matching with the problem of finding evolutionary related sequences to a given source sequence in a database, a more complicated problem which is usually solved by alignment methods which use dynamic programming techniques, like the well-known BLAST tool.

Advertisements

About dfspspirit

PhD student in bioinformatics, interested in photography, level design, digital image manipulation, architecture and, of course, bioinformatics.
This entry was posted in algorithms, Bioinformatics, Science and tagged , , , . Bookmark the permalink.

2 Responses to Algorithms for string matching

  1. Pingback: Naive string matching | spirit's spinney

  2. Pingback: The Rabin-Karp string matching algorithm | spirit's spinney

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s