Please use this identifier to cite or link to this item:
http://hdl.handle.net/11375/27555
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Smyth, William F. | - |
dc.contributor.advisor | Mhaskar, Neerja | - |
dc.contributor.author | Koponen, Holly | - |
dc.date.accessioned | 2022-05-12T20:20:13Z | - |
dc.date.available | 2022-05-12T20:20:13Z | - |
dc.date.issued | 2022 | - |
dc.identifier.uri | http://hdl.handle.net/11375/27555 | - |
dc.description.abstract | This thesis describes the development and application of the new software MAXCOVER that computes maximal covers and non-extendible repeats (a.k.a. “maximal repeats”). A string is a finite array x[1..n] of elements chosen from a set of totally ordered symbols called an alphabet. A repeat is a substring that occurs at least twice in x. A repeat is left/right extendible if every occurrence is preceded/followed by the same symbol; otherwise, it is non-left/non-right extendible (NLE/NRE). A non-extendible (NE) repeat is both NLE and NRE. A repeat covers a position i if x[i] lies within the repeat. A maximal cover (a.k.a. “optimal cover”) is a repeat that covers the most positions in x. For simplicity, we first describe a quadratic O(n2) implementation of MAXCOVER to compute all maximal covers of a given string based on the pseudocode given in [1]. Then, we consider the logarithmic O(n log n) pseudocode in [1], in which we identify several errors. We leave a complete correction and implementation for future work. Instead, we propose two improved quadratic algorithms that, shown through experiments, will execute in linear time for the average case. We perform a benchmark evaluation of MAXCOVER’s performance and demonstrate its value to biologists in the protein context [2]. To do so, we develop an extension of MAXCOVER for the closely related task of computing NE repeats. Then, we compare MAXCOVER to the repeat-match feature of the well-known MUMmer software [3] (600+ citations). We determine that MAXCOVER is an order-of-magnitude faster than MUMmer with much lower space requirements. We also show that MAXCOVER produces a more compact, exact, and user-friendly output that specifies the repeats. Availability: Open source code, binaries, and test data are available on Github at https://github.com/hollykoponen/MAXCOVER. Currently runs on Linux, untested on other OS. | en_US |
dc.language.iso | en | en_US |
dc.subject | String | en_US |
dc.subject | Maximal Cover | en_US |
dc.subject | Repeats | en_US |
dc.subject | MAXCOVER | en_US |
dc.subject | protein sequence | en_US |
dc.subject | pattern matching | en_US |
dc.subject | NE repeat | en_US |
dc.subject | MUMmer | en_US |
dc.title | Efficient Implementation & Application of Maximal String Covering Algorithm | en_US |
dc.title.alternative | MAXIMAL COVER ALGORITHM IMPLEMENTATION | en_US |
dc.type | Thesis | en_US |
dc.contributor.department | Computing and Software | en_US |
dc.description.degreetype | Thesis | en_US |
dc.description.degree | Master of Science (MSc) | en_US |
dc.description.layabstract | This thesis deals with a simple yet essential data structure called a string, a sequence of symbols drawn from an alphabet. For example, a DNA sequence is a string comprised of four letters. We describe a new software called MAXCOVER that identifies maximal covers of a given string x (a repeating substring that ‘covers’ the most positions in x). This software is based on the algorithms in [1]. We propose two new algorithms that perform faster in practice. We also extended MAXCOVER for the closely related task of computing non-extendible repeats. We compare this extension to the well-known MUMmer software (600+ citations). We find that MAXCOVER is many times faster than MUMmer with much lower space requirements and produces a more compact, exact and user-friendly output. | en_US |
Appears in Collections: | Open Access Dissertations and Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Koponen_Holly_K_2022April_MSc.pdf | 871.22 kB | Adobe PDF | View/Open |
Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.