**String matching**

I am responsible for a tutorial on *Algorithms and Models in Computational Biology* this semester, so I decided to write a bit about some of the algorithms here when ever I find the time.

The posts here intentionally are not gonna be very formal, the focus is on giving easy and good explanations of the algorithms — the formal stuff is available from 1000s of books in each university library. (I suggest the first one you try is *Introduction to Algorithms* by *Cormen et al.*)

In this post, I will explain the string matching problem and give some definitions. In the next posts, I will describe some of the algorithms which can be used to solve it.

**The task**

Basically, we want to find all occurrences of a short text — the **pattern** — in a longer **text**. This is obviously required in all text processing software and you can most likely press *CTRL + F* now to use it in your browser right now.

An example would be the question: How often and at which positions does the pattern ‘mouse’ appear in the text ‘I once had a mouse called Klaus that lived in a mouse house’? We would expect something like this as an answer: ‘2 times total: 1st occurrence at position 14, 2nd occurrence at position 49’ (assuming the first letter of the string is at position 1).

It gets way more interesting (and more important to do it *fast*) if you are searching a longer pattern in a very long text, of course. Think of searching in a book or a whole library. In bioinformatics, people may even have the idea to search for parts of a gene in a whole genome*.

**Some definitions**

Now that we know about the problem and why it matters, let us give a definition of the problem that is a bit more formal and define some abbreviations that will come in handy later (here, I follow Cormen_2001, page 906):

- A
*pattern*is a string of length**P****m**, and we search for it in a*text*of length*T***n**. The characters of both*P*and*T*are from the finite alphabet*sigma*(Example: for DNA sequences,*sigma*= { A, T, G, C} and the cardinality of*sigma*is |*sigma*| = 4). - We say that a pattern P occurs with a
*shift*in T if T[s+1 .. s+m] = P[1..m]. Obviously, this can only happen for 0 <= s <= n-m because negative indexes make no sense and a pattern of length**s***m*cannot match at a position in a text where the text has less than*m*characters left.

We will use these definitions when talking about the algorithms in the next posts.

**Sources & References**

Cormen_2001: Cormen *et al.*, Introduction to Algorithms, second edition, MIT Press, 2001

*But please do not confuse string matching with the problem of finding *evolutionary related* sequences to a given source sequence in a database, a more complicated problem which is usually solved by alignment methods which use dynamic programming techniques, like the well-known BLAST tool.

Pingback: Naive string matching | spirit's spinney

Pingback: The Rabin-Karp string matching algorithm | spirit's spinney