Welcome to the upgraded MacSphere! We're putting the finishing touches on it; if you notice anything amiss, email macsphere@mcmaster.ca

Efficient Implementation & Application of Maximal String Covering Algorithm

dc.contributor.advisorSmyth, William F.
dc.contributor.advisorMhaskar, Neerja
dc.contributor.authorKoponen, Holly
dc.contributor.departmentComputing and Softwareen_US
dc.date.accessioned2022-05-12T20:20:13Z
dc.date.available2022-05-12T20:20:13Z
dc.date.issued2022
dc.description.abstractThis thesis describes the development and application of the new software MAXCOVER that computes maximal covers and non-extendible repeats (a.k.a. “maximal repeats”). A string is a finite array x[1..n] of elements chosen from a set of totally ordered symbols called an alphabet. A repeat is a substring that occurs at least twice in x. A repeat is left/right extendible if every occurrence is preceded/followed by the same symbol; otherwise, it is non-left/non-right extendible (NLE/NRE). A non-extendible (NE) repeat is both NLE and NRE. A repeat covers a position i if x[i] lies within the repeat. A maximal cover (a.k.a. “optimal cover”) is a repeat that covers the most positions in x. For simplicity, we first describe a quadratic O(n2) implementation of MAXCOVER to compute all maximal covers of a given string based on the pseudocode given in [1]. Then, we consider the logarithmic O(n log n) pseudocode in [1], in which we identify several errors. We leave a complete correction and implementation for future work. Instead, we propose two improved quadratic algorithms that, shown through experiments, will execute in linear time for the average case. We perform a benchmark evaluation of MAXCOVER’s performance and demonstrate its value to biologists in the protein context [2]. To do so, we develop an extension of MAXCOVER for the closely related task of computing NE repeats. Then, we compare MAXCOVER to the repeat-match feature of the well-known MUMmer software [3] (600+ citations). We determine that MAXCOVER is an order-of-magnitude faster than MUMmer with much lower space requirements. We also show that MAXCOVER produces a more compact, exact, and user-friendly output that specifies the repeats. Availability: Open source code, binaries, and test data are available on Github at https://github.com/hollykoponen/MAXCOVER. Currently runs on Linux, untested on other OS.en_US
dc.description.degreeMaster of Science (MSc)en_US
dc.description.degreetypeThesisen_US
dc.description.layabstractThis thesis deals with a simple yet essential data structure called a string, a sequence of symbols drawn from an alphabet. For example, a DNA sequence is a string comprised of four letters. We describe a new software called MAXCOVER that identifies maximal covers of a given string x (a repeating substring that ‘covers’ the most positions in x). This software is based on the algorithms in [1]. We propose two new algorithms that perform faster in practice. We also extended MAXCOVER for the closely related task of computing non-extendible repeats. We compare this extension to the well-known MUMmer software (600+ citations). We find that MAXCOVER is many times faster than MUMmer with much lower space requirements and produces a more compact, exact and user-friendly output.en_US
dc.identifier.urihttp://hdl.handle.net/11375/27555
dc.language.isoenen_US
dc.subjectStringen_US
dc.subjectMaximal Coveren_US
dc.subjectRepeatsen_US
dc.subjectMAXCOVERen_US
dc.subjectprotein sequenceen_US
dc.subjectpattern matchingen_US
dc.subjectNE repeaten_US
dc.subjectMUMmeren_US
dc.titleEfficient Implementation & Application of Maximal String Covering Algorithmen_US
dc.title.alternativeMAXIMAL COVER ALGORITHM IMPLEMENTATIONen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Koponen_Holly_K_2022April_MSc.pdf
Size:
871.22 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.68 KB
Format:
Item-specific license agreed upon to submission
Description: