Bayesian methods for identifying non-protein coding genomic regions contributing to diseases
2017-03-02T04:18:11Z (GMT) by
Identifying and discerning the function of non-coding RNAs (ncRNAs) is an important goal of genetic research. Much evidence suggests that ncRNAs play an important role in the aetiology of many complex genetic diseases. Therefore the task of developing methods to identify these elements in genomes has become increasingly urgent. In this research my focus was to use a Bayesian approach to identify putative functional non-coding genomic sequences contributing to various diseases. The analysis was mainly carried out using a Bayesian segmentation model, implemented in the software package changept, designed to segment discrete genomic data. In the first phase of the research, I developed methods to expand the capabilities of changept. One simple but powerful innovation was to develop several ways of encoding an alignment of sequences using a D-character representation (D is a positive integer). This enables sequence alignments to be segmented based on multiple data types: specifically conservation, GC content and transition/transversion ratio and significantly generalizes the capacity of changept, which previously could only segment on the basis of one of these characteristics at a time. Incorporating multiple data types greatly helped to clearly identify complex segmentation patterns and functional signatures among species, especially between closely related species. A second methodological innovation was a new model selection procedure to decide the optimal model for the data. A third, and most important, methodological innovation was to build a process for systematically discovering genome- wide putative ncRNAs, including data selection, cleaning, encoding, analysis and post-processing. To validate these findings, both experimental methods and currently available bioinfomatics resources were used. In the second phase of the research, my focus turned to application of changept, and the new methods developed, to identify genome-wide putative non-coding elements that may be associated with diseases. I was able to discover more than a thousand highly conserved non-coding sequences in human, mouse and zebrafish genomes. A complementary analysis focused on a set of genes involved in muscle development. Some of these elements identified may contribute to muscle diseases. Discovery of putative small ncRNAs in the bacterium Wolbachia pipientis is another successful application of the new methods; this work was undertaken as part of the eradicate dengue project. Application to malaria genomes revealed genetic mechanisms important in infecting multiple hosts. I also identified putative regulatory sequences in 3' UTRs in 3 closely related Drosophila species. Although this work focussed on Drosophila rather than human diseases, mutations in 3' UTRs have been shown to play a crucial role in human health and diseases.