Finding Approximate Repeats in DNA Sequences Using Multiple Spaced Seeds

Banyassady, Sarah

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/18041

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Smyth, William F.	-
dc.contributor.author	Banyassady, Sarah	-
dc.date.accessioned	2015-09-24T14:19:34Z	-
dc.date.available	2015-09-24T14:19:34Z	-
dc.date.issued	2015-11	-
dc.identifier.uri	http://hdl.handle.net/11375/18041	-
dc.description.abstract	In computational biology, genome sequences are represented as strings of characters defined over a small alphabet. These sequences contain many repeated subsequences, yet most of them are similarities, or approximate repeats. Sequence similarity search is a powerful way of analyzing genome sequences with many applications such as inferring genomic evolutionary events and relationships. The detection of approximate repeats between two sequences is not a trivial problem and solutions generally need large memory space and long processing time. Furthermore, the number of available genome sequences is growing fast along with the sequencing technologies. Hence, designing efficient methods for approximate repeat detection in large sequences is of great importance. In this study, we propose a new method for finding approximate repeats in DNA sequences and develop the corresponding software. A common strategy is to index the locations of short substrings, or seeds, of one sequence and store them in an efficiently searchable structure. Then, scan the other sequence and look up the structure for matches with the stored seeds. A novel feature of our method is its efficient use of spaced seeds, substrings with gaps, to generate approximate repeats. We have designed a new space-efficient hash table for indexing sequences with multiple spaced seeds. The resulting seed-matches are then extended into longer approximate repeats using dynamic programming. Our results indicate that our hash table implementation requires less memory than previously proposed hash table methods, especially when higher similarities between approximate repeats are desired. Moreover, increasing the length of seeds does not significantly increase the space requirement of the hash table, while allowing the same similarities to be computed faster.	en_US
dc.language.iso	en	en_US
dc.title	Finding Approximate Repeats in DNA Sequences Using Multiple Spaced Seeds	en_US
dc.title.alternative	Finding Approximate Repeats with Multiple Spaced Seeds	en_US
dc.type	Thesis	en_US
dc.contributor.department	Computational Engineering and Science	en_US
dc.description.degreetype	Thesis	en_US
dc.description.degree	Master of Science (MSc)	en_US
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Description	Size	Format
Banyassady_Sarah_201509_MSc.pdf Open Access	Main article	2.55 MB	Adobe PDF	View/Open

Show simple item record