Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes

Ashlock, Wendy Cole

Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes

dc.contributor.advisor	Datta, Suprakash	en_US
dc.creator	Ashlock, Wendy Cole
dc.date.accessioned	2016-06-23T17:45:58Z
dc.date.available	2016-06-23T17:45:58Z
dc.date.copyright	2013-08	en_US
dc.date.issued	2016-06-23
dc.degree.discipline	Computer Science and Engineering	en_US
dc.degree.level	Doctoral	en_US
dc.degree.name	PhD - Doctor of Philosophy	en_US
dc.description.abstract	About half of the human genome consists of transposable elements (TE's), sequences that have many copies of themselves distributed throughout the genome. All genomes, from bacterial to human, contain TE's. TE's affect genome function by either creating proteins directly or affecting genome regulation. They serve as molecular fossils, giving clues to the evolutionary history of the organism. TE's are often challenging to identify because they are fragmentary or heavily mutated. In this thesis, novel features for the detection and study of TE's are developed. These features are of two types. The first type are statistical features based on the Fourier transform used to assess reading frame use. These features measure how different the reading frame use is from that of a random sequence, which reading frames the sequence is using, and the proportion of use of the active reading frames. The second type of feature, called side effect machine (SEM) features, are generated by finite state machines augmented with counters that track the number of times the state is visited. These counters then become features of the sequence. The number of possible SEM features is super-exponential in the number of states. New methods for selecting useful feature subsets that incorporate a genetic algorithm and a novel clustering method are introduced. The features produced reveal structural characteristics of the sequences of potential interest to biologists. A detailed analysis of the genetic algorithm, its fitness functions, and its fitness landscapes is performed. The features are used, together with features used in existing exon finding algorithms, to build classifiers that distinguish TE's from other genomic sequences in humans, fruit flies, and ciliates. The classifiers achieve high accuracy (> 85%) on a variety of TE classification problems. The classifiers are used to scan large genomes for TE's. In addition, the features are used to describe the TE's in the newly sequenced ciliate, Tetrahymena thermophile to provide information for biologists useful to them in forming hypotheses to test experimentally concerning the role of these TE's and the mechanisms that govern them.	en_US
dc.identifier.uri	http://hdl.handle.net/10315/31432
dc.rights	Author owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.	en_US
dc.subject.keywords	Human genome	en_US
dc.subject.keywords	Transposable elements	en_US
dc.subject.keywords	Fourier transform	en_US
dc.subject.keywords	Side effect machine	en_US
dc.subject.keywords	Genetic algorithm	en_US
dc.title	Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes	en_US
dc.type	Electronic Thesis or Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Ashlock_Wendy_C_2013_PhD.pdf
Size:: 9.72 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: YorkU_ETDlicense.txt
Size:: 3.38 KB
Format:: Plain Text
Description:

Download

Name:: license.txt
Size:: 1.83 KB
Format:: Plain Text
Description:

Download

Collections

Computer Science and Engineering