Please use this identifier to cite or link to this item:
http://hdl.handle.net/11375/27555
Title: | Efficient Implementation & Application of Maximal String Covering Algorithm |
Other Titles: | MAXIMAL COVER ALGORITHM IMPLEMENTATION |
Authors: | Koponen, Holly |
Advisor: | Smyth, William F. Mhaskar, Neerja |
Department: | Computing and Software |
Keywords: | String;Maximal Cover;Repeats;MAXCOVER;protein sequence;pattern matching;NE repeat;MUMmer |
Publication Date: | 2022 |
Abstract: | This thesis describes the development and application of the new software MAXCOVER that computes maximal covers and non-extendible repeats (a.k.a. “maximal repeats”). A string is a finite array x[1..n] of elements chosen from a set of totally ordered symbols called an alphabet. A repeat is a substring that occurs at least twice in x. A repeat is left/right extendible if every occurrence is preceded/followed by the same symbol; otherwise, it is non-left/non-right extendible (NLE/NRE). A non-extendible (NE) repeat is both NLE and NRE. A repeat covers a position i if x[i] lies within the repeat. A maximal cover (a.k.a. “optimal cover”) is a repeat that covers the most positions in x. For simplicity, we first describe a quadratic O(n2) implementation of MAXCOVER to compute all maximal covers of a given string based on the pseudocode given in [1]. Then, we consider the logarithmic O(n log n) pseudocode in [1], in which we identify several errors. We leave a complete correction and implementation for future work. Instead, we propose two improved quadratic algorithms that, shown through experiments, will execute in linear time for the average case. We perform a benchmark evaluation of MAXCOVER’s performance and demonstrate its value to biologists in the protein context [2]. To do so, we develop an extension of MAXCOVER for the closely related task of computing NE repeats. Then, we compare MAXCOVER to the repeat-match feature of the well-known MUMmer software [3] (600+ citations). We determine that MAXCOVER is an order-of-magnitude faster than MUMmer with much lower space requirements. We also show that MAXCOVER produces a more compact, exact, and user-friendly output that specifies the repeats. Availability: Open source code, binaries, and test data are available on Github at https://github.com/hollykoponen/MAXCOVER. Currently runs on Linux, untested on other OS. |
URI: | http://hdl.handle.net/11375/27555 |
Appears in Collections: | Open Access Dissertations and Theses |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Koponen_Holly_K_2022April_MSc.pdf | 871.22 kB | Adobe PDF | View/Open |
Items in MacSphere are protected by copyright, with all rights reserved, unless otherwise indicated.