Contributions to Sparse Statistical Methods for Data Integration

Bonner, Ashley

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/24009

Title:	Contributions to Sparse Statistical Methods for Data Integration
Authors:	Bonner, Ashley
Advisor:	Beyene, Joseph Hamid, Jemila Canty, Angelo
Department:	Health Research Methodology
Keywords:	biostatistics;statistics;genetics;genomics;sparse methods;data integration
Publication Date:	2018
Abstract:	Background: Scientists are measuring multiple sources of massive, complex, and diverse data in hopes to better understand the principles underpinning complex phenomena. Sophisticated statistical and computational methods that reduce data complexity, harness variability, and integrate multiple sources of information are required. The ‘sparse’ class of multivariate statistical methods is becoming a promising solution to these data-driven challenges, but lacks application, testing, and development. Methods: In this thesis, efforts are three-fold. Sparse principal component analysis (sparse PCA) and sparse canonical correlation analysis (sparse CCA) are applied to a large toxicogenomic database to uncover candidate genes associated with drug toxicity. Extensive simulations are conducted to test and compare the performance of many sparse CCA methods, determining which methods are most accurate under a variety of realistic, large-data scenarios. Finally, the performance of the non-parametric bootstrap is examined, determining its ability to generate inferential measures for sparse CCA. Results: Through applications, several groups of candidate genes are obtained to point researchers towards promising genetic profiles of drug toxicity. Simulations expose one sparse CCA method that outperforms the rest in the majority of data scenarios, while suggesting the use of a combination of complimentary sparse CCA methods for specific data conditions. Simulations for the bootstrap conclude the bootstrap to be a suitable means for inference for the canonical correlation coefficient for sparse CCA but only when sample size approaches the number of variables. As well, it is shown that aggregating sparse CCA results from many bootstrap samples can improve accuracy of detection of truly cross-correlated features. Conclusions: Sparse multivariate methods can flexibly handle challenging integrative analysis tasks. Work in this thesis has demonstrated their much-needed utility in the field of toxicogenomics and strengthened our knowledge about how they perform within a complex, massive data framework, while promoting the use of bootstrapped inferential measures.
URI:	http://hdl.handle.net/11375/24009
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Description	Size	Format
Bonner_Ashley_J_201812_PhD.pdf Open Access		3.51 MB	Adobe PDF	View/Open

Show full item record