Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes

dc.contributor.advisorDatta, Suprakashen_US
dc.creatorAshlock, Wendy Cole
dc.date.accessioned2016-06-23T17:45:58Z
dc.date.available2016-06-23T17:45:58Z
dc.date.copyright2013-08en_US
dc.date.issued2016-06-23
dc.degree.disciplineComputer Science and Engineeringen_US
dc.degree.levelDoctoralen_US
dc.degree.namePhD - Doctor of Philosophyen_US
dc.description.abstractAbout half of the human genome consists of transposable elements (TE's), sequences that have many copies of themselves distributed throughout the genome. All genomes, from bacterial to human, contain TE's. TE's affect genome function by either creating proteins directly or affecting genome regulation. They serve as molecular fossils, giving clues to the evolutionary history of the organism. TE's are often challenging to identify because they are fragmentary or heavily mutated. In this thesis, novel features for the detection and study of TE's are developed. These features are of two types. The first type are statistical features based on the Fourier transform used to assess reading frame use. These features measure how different the reading frame use is from that of a random sequence, which reading frames the sequence is using, and the proportion of use of the active reading frames. The second type of feature, called side effect machine (SEM) features, are generated by finite state machines augmented with counters that track the number of times the state is visited. These counters then become features of the sequence. The number of possible SEM features is super-exponential in the number of states. New methods for selecting useful feature subsets that incorporate a genetic algorithm and a novel clustering method are introduced. The features produced reveal structural characteristics of the sequences of potential interest to biologists. A detailed analysis of the genetic algorithm, its fitness functions, and its fitness landscapes is performed. The features are used, together with features used in existing exon finding algorithms, to build classifiers that distinguish TE's from other genomic sequences in humans, fruit flies, and ciliates. The classifiers achieve high accuracy (> 85%) on a variety of TE classification problems. The classifiers are used to scan large genomes for TE's. In addition, the features are used to describe the TE's in the newly sequenced ciliate, Tetrahymena thermophile to provide information for biologists useful to them in forming hypotheses to test experimentally concerning the role of these TE's and the mechanisms that govern them.en_US
dc.identifier.urihttp://hdl.handle.net/10315/31432
dc.rightsAuthor owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.en_US
dc.subject.keywordsHuman genomeen_US
dc.subject.keywordsTransposable elementsen_US
dc.subject.keywordsFourier transformen_US
dc.subject.keywordsSide effect machineen_US
dc.subject.keywordsGenetic algorithmen_US
dc.titleUsing signal processing, evolutionary computation, and machine learning to identify transposable elements in genomesen_US
dc.typeElectronic Thesis or Dissertationen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ashlock_Wendy_C_2013_PhD.pdf
Size:
9.72 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
YorkU_ETDlicense.txt
Size:
3.38 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
license.txt
Size:
1.83 KB
Format:
Plain Text
Description: