Pattern-Matching Algorithms on Indeterminate Strings

Wang, Shu

Pattern-Matching Algorithms on Indeterminate Strings

Files

Wang_Shu_2005_01_master.pdf (15.76 MB)

Date

2005-01

Authors

Wang, Shu

Abstract

In computing, strings are sequences of simple elements (letters) drawn from some alphabet. A string is one of the most basic and important data structures in computer science, it exists everywhere in computer systems. Disk files, contents of memory, source and object code of computer programs, e-mail messages are all examples of strings. Also in bioinformatics, nucleotides in DNA can be viewed as strings drawn from an alphabet of four basic symbols- A, C, G and T. Algorithms that deal with strings are correspondingly very important in the field of computer science as well as in bioinformatics, data compression, cryptanalysis and other scientific areas. A pattern-matching algorithm is one of the most fundamental string algorithms. Simply speaking, it outputs the occurrences of a string (called the pattern) within another string (called the text). Exact pattern-matching algorithms have been extensively studied in the last three decades and many algorithms have been proposed. Among them are two well known algorithms, the Knuth-Morris-Pratt (KMP) algorithm [KMP77], with good worst-case performance, and the Boyer-Moore (BM) algorithm [BM77], with good average-case performance. A recent attempt to combine the virtues of the two algorithms results in the Franek-Jennings-Smyth (FJS) algorithm [Jen02, FJS05b, FJS05a], which has both excellent average-case and worst-case performance. Indeterminate strings are a new class of strings proposed in [HS03] driven by the increasingly common application of string algorithms on biological (DNA) sequences. Unlike letters in a normal alphabet, an indeterminate letter matches a set of letters (under certain constraints) rather than just one during pattern-matching. Perhaps the most straightforward example of an indeterminate letter is the don't care letter '*', that matches all letters in the alphabet. The main purpose of an indeterminate string is to increase the flexibility of pattern-matching. It can also be seen as a model of a DNA sequence with a polymorphism property, which is a commonplace in molecular biology. An indeterminate string is a relatively new idea and not too many known algorithms work on it, except for a particular version of the ShiftOr algorithm [WM92]. However as we will see, even this version can only handle a special case of indeterminate pattern-matching. Faster and more general indeterminate pattern-matching algorithms are desired. In this thesis we first give rigorous definitions of indeterminate strings and then develop our new indeterminate pattern-matching algorithms, adapted and modified from existing exact pattern-matching algorithms. There are many different constraints and also different models of patten-matching. We present our solution with all these variations in order of increasing complexity. We present our rationale of development and investigate the possibility of developing new algorithms from different categories of existing determinate algorithms. After describing our algorithms we conduct comprehensive experiments and compare the results carefully. In the last chapter we present our conclusions and point out the directions of future research.

Description

Title: Pattern-Matching Algorithms on Indeterminate Strings, Author: Shu Wang, Location: Thode

URI

http://hdl.handle.net/11375/21222

Collections

Digitized Open Access Dissertations and Theses

Full item page

Pattern-Matching Algorithms on Indeterminate Strings

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By