Please use this identifier to cite or link to this item:
http://hdl.handle.net/11375/18041
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Smyth, William F. | - |
dc.contributor.author | Banyassady, Sarah | - |
dc.date.accessioned | 2015-09-24T14:19:34Z | - |
dc.date.available | 2015-09-24T14:19:34Z | - |
dc.date.issued | 2015-11 | - |
dc.identifier.uri | http://hdl.handle.net/11375/18041 | - |
dc.description.abstract | In computational biology, genome sequences are represented as strings of characters defined over a small alphabet. These sequences contain many repeated subsequences, yet most of them are similarities, or approximate repeats. Sequence similarity search is a powerful way of analyzing genome sequences with many applications such as inferring genomic evolutionary events and relationships. The detection of approximate repeats between two sequences is not a trivial problem and solutions generally need large memory space and long processing time. Furthermore, the number of available genome sequences is growing fast along with the sequencing technologies. Hence, designing efficient methods for approximate repeat detection in large sequences is of great importance. In this study, we propose a new method for finding approximate repeats in DNA sequences and develop the corresponding software. A common strategy is to index the locations of short substrings, or seeds, of one sequence and store them in an efficiently searchable structure. Then, scan the other sequence and look up the structure for matches with the stored seeds. A novel feature of our method is its efficient use of spaced seeds, substrings with gaps, to generate approximate repeats. We have designed a new space-efficient hash table for indexing sequences with multiple spaced seeds. The resulting seed-matches are then extended into longer approximate repeats using dynamic programming. Our results indicate that our hash table implementation requires less memory than previously proposed hash table methods, especially when higher similarities between approximate repeats are desired. Moreover, increasing the length of seeds does not significantly increase the space requirement of the hash table, while allowing the same similarities to be computed faster. | en_US |
dc.language.iso | en | en_US |
dc.title | Finding Approximate Repeats in DNA Sequences Using Multiple Spaced Seeds | en_US |
dc.title.alternative | Finding Approximate Repeats with Multiple Spaced Seeds | en_US |
dc.type | Thesis | en_US |
dc.contributor.department | Computational Engineering and Science | en_US |
dc.description.degreetype | Thesis | en_US |
dc.description.degree | Master of Science (MSc) | en_US |
Appears in Collections: | Open Access Dissertations and Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Banyassady_Sarah_201509_MSc.pdf | Main article | 2.55 MB | Adobe PDF | View/Open |
Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.