Pattern-Matching Algorithms on Indeterminate Strings
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
<p>In computing, strings are sequences of simple elements (letters) drawn from some
alphabet. A string is one of the most basic and important data structures in computer
science, it exists everywhere in computer systems. Disk files, contents of memory,
source and object code of computer programs, e-mail messages are all examples of
strings. Also in bioinformatics, nucleotides in DNA can be viewed as strings drawn
from an alphabet of four basic symbols- A, C, G and T. Algorithms that deal with
strings are correspondingly very important in the field of computer science as well as
in bioinformatics, data compression, cryptanalysis and other scientific areas.</p> <p>A pattern-matching algorithm is one of the most fundamental string algorithms.
Simply speaking, it outputs the occurrences of a string (called the pattern) within
another string (called the text). Exact pattern-matching algorithms have been extensively
studied in the last three decades and many algorithms have been proposed.
Among them are two well known algorithms, the Knuth-Morris-Pratt (KMP) algorithm
[KMP77], with good worst-case performance, and the Boyer-Moore (BM) algorithm
[BM77], with good average-case performance. A recent attempt to combine
the virtues of the two algorithms results in the Franek-Jennings-Smyth (FJS) algorithm
[Jen02, FJS05b, FJS05a], which has both excellent average-case and worst-case
performance.</p> <p>Indeterminate strings are a new class of strings proposed in [HS03] driven by the
increasingly common application of string algorithms on biological (DNA) sequences.
Unlike letters in a normal alphabet, an indeterminate letter matches a set of letters
(under certain constraints) rather than just one during pattern-matching. Perhaps the most straightforward example of an indeterminate letter is the don't care letter
'*', that matches all letters in the alphabet. The main purpose of an indeterminate
string is to increase the flexibility of pattern-matching. It can also be seen as a
model of a DNA sequence with a polymorphism property, which is a commonplace in
molecular biology.</p> <p>An indeterminate string is a relatively new idea and not too many known algorithms
work on it, except for a particular version of the ShiftOr algorithm [WM92].
However as we will see, even this version can only handle a special case of indeterminate
pattern-matching. Faster and more general indeterminate pattern-matching
algorithms are desired. In this thesis we first give rigorous definitions of indeterminate
strings and then develop our new indeterminate pattern-matching algorithms,
adapted and modified from existing exact pattern-matching algorithms. There are
many different constraints and also different models of patten-matching. We present
our solution with all these variations in order of increasing complexity. We present
our rationale of development and investigate the possibility of developing new algorithms
from different categories of existing determinate algorithms. After describing
our algorithms we conduct comprehensive experiments and compare the results carefully.
In the last chapter we present our conclusions and point out the directions of
future research.</p>
Description
Title: Pattern-Matching Algorithms on Indeterminate Strings, Author: Shu Wang, Location: Thode