Finding Approximate Repeats in DNA Sequences Using Multiple Spaced Seeds

Banyassady, Sarah

Finding Approximate Repeats in DNA Sequences Using Multiple Spaced Seeds

Files

Banyassady_Sarah_201509_MSc.pdf (2.49 MB)

Date

2015-11

Authors

Banyassady, Sarah

Abstract

In computational biology, genome sequences are represented as strings of characters defined over a small alphabet. These sequences contain many repeated subsequences, yet most of them are similarities, or approximate repeats. Sequence similarity search is a powerful way of analyzing genome sequences with many applications such as inferring genomic evolutionary events and relationships. The detection of approximate repeats between two sequences is not a trivial problem and solutions generally need large memory space and long processing time. Furthermore, the number of available genome sequences is growing fast along with the sequencing technologies. Hence, designing efficient methods for approximate repeat detection in large sequences is of great importance. In this study, we propose a new method for finding approximate repeats in DNA sequences and develop the corresponding software. A common strategy is to index the locations of short substrings, or seeds, of one sequence and store them in an efficiently searchable structure. Then, scan the other sequence and look up the structure for matches with the stored seeds. A novel feature of our method is its efficient use of spaced seeds, substrings with gaps, to generate approximate repeats. We have designed a new space-efficient hash table for indexing sequences with multiple spaced seeds. The resulting seed-matches are then extended into longer approximate repeats using dynamic programming. Our results indicate that our hash table implementation requires less memory than previously proposed hash table methods, especially when higher similarities between approximate repeats are desired. Moreover, increasing the length of seeds does not significantly increase the space requirement of the hash table, while allowing the same similarities to be computed faster.

URI

http://hdl.handle.net/11375/18041

Collections

Open Access Dissertations and Theses

Full item page

Finding Approximate Repeats in DNA Sequences Using Multiple Spaced Seeds

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By