Bayesian methods for identifying non-protein coding genomic regions contributing to diseases

Algama Appuhamilage Dona, Manjula Dilhani

doi:10.4225/03/58b79d0591a09

4713601_monash_165943.pdf (10.67 MB)

Bayesian methods for identifying non-protein coding genomic regions contributing to diseases

thesis

posted on 2017-03-02, 04:18 authored by Algama Appuhamilage Dona, Manjula Dilhani

Identifying and discerning the function of non-coding RNAs (ncRNAs) is an important goal of genetic research. Much evidence suggests that ncRNAs play an important role in the aetiology of many complex genetic diseases. Therefore the task of developing methods to identify these elements in genomes has become increasingly urgent. In this research my focus was to use a Bayesian approach to identify putative functional non-coding genomic sequences contributing to various diseases. The analysis was mainly carried out using a Bayesian segmentation model, implemented in the software package changept, designed to segment discrete genomic data. In the first phase of the research, I developed methods to expand the capabilities of changept. One simple but powerful innovation was to develop several ways of encoding an alignment of sequences using a D-character representation (D is a positive integer). This enables sequence alignments to be segmented based on multiple data types: specifically conservation, GC content and transition/transversion ratio and significantly generalizes the capacity of changept, which previously could only segment on the basis of one of these characteristics at a time. Incorporating multiple data types greatly helped to clearly identify complex segmentation patterns and functional signatures among species, especially between closely related species. A second methodological innovation was a new model selection procedure to decide the optimal model for the data. A third, and most important, methodological innovation was to build a process for systematically discovering genome- wide putative ncRNAs, including data selection, cleaning, encoding, analysis and post-processing. To validate these findings, both experimental methods and currently available bioinfomatics resources were used. In the second phase of the research, my focus turned to application of changept, and the new methods developed, to identify genome-wide putative non-coding elements that may be associated with diseases. I was able to discover more than a thousand highly conserved non-coding sequences in human, mouse and zebrafish genomes. A complementary analysis focused on a set of genes involved in muscle development. Some of these elements identified may contribute to muscle diseases. Discovery of putative small ncRNAs in the bacterium Wolbachia pipientis is another successful application of the new methods; this work was undertaken as part of the eradicate dengue project. Application to malaria genomes revealed genetic mechanisms important in infecting multiple hosts. I also identified putative regulatory sequences in 3' UTRs in 3 closely related Drosophila species. Although this work focussed on Drosophila rather than human diseases, mutations in 3' UTRs have been shown to play a crucial role in human health and diseases.

History

Campus location

Australia

Principal supervisor

Jonathan Macgregor Keith

Additional supervisor 1

Robert Bryson-Richardson

Year of Award

2016

Department, School or Centre

Mathematics

Degree Type

DOCTORATE

Faculty

Faculty of Science

Usage metrics

Keywords

Bayesian methods Non-coding RNAs thesis(doctorate)Sequence segmentation monash:165943 Open access ethesis-20160202-114333 2016 1959.1/1240873

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Bayesian methods for identifying non-protein coding genomic regions contributing to diseases

History

Campus location

Principal supervisor

Additional supervisor 1

Year of Award

Department, School or Centre

Degree Type

Faculty

Usage metrics

Categories

Keywords

Licence

Exports