Computer Science and Engineering

Permanent URI for this collection

https://hdl.handle.net/10315/28572

Browse

Now showing 1 - 20 of 110

Open Access
ACT-R Based Models For Learning Interactive Layouts
(2015-01-26) Das, Arindam; Stuerzlinger, Wolfgang
This dissertation presents research on learning of interactive layouts. I develop two models based on a theory of cognition known as ACT-R (Adaptive Control of Thought–Rational). I validate them against experimental data collected by other researchers. The first model is a simulation model that emulates the transition from novice to expert level in text entry. The model transcribes the presented English letters on a traditional phone keypad. It predicts the non-movement time to copy a pre-cued letter. It explains the visual exploration strategy that a user may employ in the novice to expert continuum. The second model is a closed-form model that accounts for the combined effect of practice, decay, proactive interference and mental effort on task completion time while practicing target acquisition on an interactive layout. The model can quantitatively compare a set of layouts in terms of the mental effort expended to learn them. My first model provides insight into how much practice is needed by a learner to progress from novice to expert level for an interactive layout. My second model provides insight into how effortful is it to learn a layout relative to other layouts.
Open Access
Predicting and Reducing the Impact of Errors in Character-Based Text Entry
(2015-01-26) Arif, Ahmed Sabbir; Stuerzlinger, Wolfgang
This dissertation focuses on the effect of errors in character-based text entry techniques. The effect of errors is targeted from theoretical, behavioral, and practical standpoints. This document starts with a review of the existing literature. It then presents results of a user study that investigated the effect of different error correction conditions on popular text entry performance metrics. Results showed that the way errors are handled has a significant effect on all frequently used error metrics. The outcomes also provided an understanding of how users notice and correct errors. Building on this, the dissertation then presents a new high-level and method-agnostic model for predicting the cost of error correction with a given text entry technique. Unlike the existing models, it accounts for both human and system factors and is general enough to be used with most character-based techniques. A user study verified the model through measuring the effects of a faulty keyboard on text entry performance. Subsequently, the work then explores the potential user adaptation to a gesture recognizer’s misrecognitions in two user studies. Results revealed that users gradually adapt to misrecognition errors by replacing the erroneous gestures with alternative ones, if available. Also, users adapt to a frequently misrecognized gesture faster if it occurs more frequently than the other error-prone gestures. Finally, this work presents a new hybrid approach to simulate pressure detection on standard touchscreens. The new approach combines the existing touch-point- and time-based methods. Results of two user studies showed that it can simulate pressure detection more reliably for at least two pressure levels: regular (~1 N) and extra (~3 N). Then, a new pressure-based text entry technique is presented that does not require tapping outside the virtual keyboard to reject an incorrect or unwanted prediction. Instead, the technique requires users to apply extra pressure for the tap on the next target key. The performance of the new technique was compared with the conventional technique in a user study. Results showed that for inputting short English phrases with 10% non-dictionary words, the new technique increases entry speed by 9% and decreases error rates by 25%. Also, most users (83%) favor the new technique over the conventional one. Together, the research presented in this dissertation gives more insight into on how errors affect text entry and also presents improved text entry methods.
Open Access
Model Checking of Distributed Multi-Threaded Java Applications
(2015-01-26) Shafiei, Nastaran; Breugel, Franck van
In this dissertation, we focus on the verification of distributed Java applications composed of communicating multithreaded processes. We choose model checking as the verification technique. We propose an instance of the so-called centralization approach which allows for model checking multiple communicating processes. The main challenge of applying centralization is keeping data separated between different processes. In our approach, this issue is addressed through a new class-loading model. As one of our contributions, we implement our approach within an existing model checker, Java PathFinder (JPF). To account for interactions between processes, our approach provides the model checker with a model of interprocess communication. Moreover, our model allows for systematically exploring potential exceptional control flows caused by network failures. We also apply a partial order reduction (POR) algorithm to reduce the state space of distributed applications, and we prove that our POR algorithm preserves deadlocks. Furthermore, we propose an automatic approach to capture interactions between the system being verified and external resources, such as cloud computing services. The dissertation also discusses how our approach is superior to existing approaches. Our approach exhibits better performance which is mainly due to the POR technique. Furthermore, our approach allows for verifying a considerably larger class of applications without the need for any manual modeling, and it has been successfully used to detect bugs that cannot be found using previous work.
Open Access
Optimizing Simulated Crowd Behaviour
(2015-01-26) Berseth, Glen Paul; Faloutsos, Petros
In the context of crowd simulation, there is a diverse set of algorithms that model steering, the ability of an agent to navigate between spatial locations, while avoiding static and dynamic obstacles. The performance of steering approaches, both in terms of quality of results and computational efficiency, depends on internal parameters that are manually tuned to satisfy application-specific requirements. This work investigates the effect that these parameters have on an algorithm's performance. Using three representative steering algorithms and a set of established performance criteria, we perform a number of large scale optimization experiments that optimize an algorithm's parameters for a range of objectives. For example, our method automatically finds optimal parameters to minimize turbulence at bottlenecks, reduce building evacuation times, produce emergent patterns, and increase the computational efficiency of an algorithm. Our study includes a statistical analysis of the correlations between algorithmic parameters, and performance criteria. We also propose using the Pareto Optimal Front as an efficient way of modelling optimal relationships between multiple objectives, and demonstrate its effectiveness by estimating optimal parameters for interactively defined combinations of the associated objectives. The proposed methodologies are general and can be applied to any steering algorithm using any set of performance criteria.
Open Access
In-Air Un-Instrumented Pointing Performance
(2015-01-26) Brown, Michelle Andree; Stuerzlinger, Wolfgang
I present an analysis of in-air un-instrumented pointing and selection. I look at the performance of these systems and how this performance can be improved, with the eventual goal that their throughput reaches that of the mouse. Many potential limiting factors were explored, such as latency, selection reliability, and elbow stabilization. I found that the un-instrumented in-air pointing as currently implemented performed significantly worse, at less than 75% of mouse throughput. Yet, my research shows that this value can potentially reach mouse-like levels with lower system latencies, user training, and potentially improved finger tracking. Even without these improvements, the large range of applications for un-instrumented 3D hand tracking makes this technology still an attractive option for user interfaces.
Open Access
Planning Practical Paths in High-Dimensional Space
(2015-08-28) Yang, Jing; Jenkin, Michael R.
Unlike traditional manipulator robots which tend to have small numbers of degree of freedom (DOF), tentacle robots utilize redundant DOFs in order to enhance their ability to deal with complex environments and tasks. However, it also makes the planning and control of such devices extremely difficult. One of the fundamental tasks robots have to perform is planning their motions while avoiding collisions with obstacles in the environment, which is known to be PSPACE-complete in the robot's DOF. As a consequence heuristic sampling-based approaches have been developed to solve high-dimensional real-world path planning problems. A shortcoming of the current sampling-based algorithms is that they can obtain highly non-optimal solutions since they rely upon randomization to explore the search space. Although these planners may find a valid solution, the solutions found are often not practical in that they do not take into account soft application-specific constraints. This thesis integrates soft constraints in addition to the basic geometric or hard constraints in the general path planning process for high DOF robots. The practicality of paths are formulated based on the notion of soft constraints found in the Planning Domain Definition Language 3 (PDDL3). A range of optimization strategies are developed targeted towards user-preferred qualities by integrating soft constraints in the pre-processing (i.e. sampling), planning and post-processing phases of the sampling-based path planners. An auction-based resource allocation approach coordinates competing optimization strategies. This approach uses an adaptive bidding strategy for each optimizer and in each round the optimizer with the best predicted performance is selected. This general coordination system allows for flexibility in both the number and types of the optimizers to be used. Experimental validation with real and simulated tentacle robots demonstrate the effectiveness of the approach.
Open Access
Achieving Adaptation Through Live Virtual Machine Migration in Two-Tier Clouds
(2015-08-28) Lu, Hong Bin; Litoiu, Marin
This thesis presents a model-driven approach for application deployment and management in two-tier heterogeneous cloud environments. For application deployment, we introduce the architecture, the services and the domain specific language that abstract common features of multi-cloud deployments. By leveraging the architecture and the language, application deployers author a deployment model that captures the high-level structure of the application. The deployment model is then translated into deployment workflows on specific clouds. As a use case, we introduce a live VM migration framework that maintains the application quality of services through VM migrations across two tier-clouds. The proposed framework can monitor the performance of the applications and their underlying infrastructure and plan and executes VM migrations to eliminate hotspots in a datacenter. We evaluate both the application deployment architecture and the live migration on public clouds.
Open Access
Comparing Measured and Theoretical Target Registration Error of an Optical Tracking System
(2015-08-28) Banivaheb, Niloofar; Ma, Burton
The goal of this thesis is to experimentally measure the accuracy of an optical tracking system used in commercial surgical navigation systems. We measure accuracy by constructing a mechanism that allows a tracked target to move with spherical motion (i.e., there exists a single point on the mechanism—the center of the sphere—that does not change position when the tracked target is moved). We imagine that the center of the sphere is the tip of a surgical tool rigidly attached to the tracked target. The location of the tool tip cannot be measured directly by the tracking system (because it is impossible to attach a tracking marker to the tool tip) and must be calculated using the measured location and orientation of the tracking target. Any measurement error in the tracking system will cause the calculated position of the tool tip to change as the target is moved; the spread of the calculated tool tip positions is a measurement of tracking error called the target registration error (TRE). The observed TRE will be compared to an analytic model of TRE to assess the predictions of the analytic model.
Open Access
Optimizing Human Performance in Mobile Text Entry
(2015-08-28) Castellucci, Steven John; MacKenzie, I. Scott
Although text entry on mobile phones is abundant, research strives to achieve desktop typing performance "on the go". But how can researchers evaluate new and existing mobile text entry techniques? How can they ensure that evaluations are conducted in a consistent manner that facilitates comparison? What forms of input are possible on a mobile device? Do the audio and haptic feedback options with most touchscreen keyboards affect performance? What influences users' preference for one feedback or another? Can rearranging the characters and keys of a keyboard improve performance? This dissertation answers these questions and more. The developed TEMA software allows researchers to evaluate mobile text entry methods in an easy, detailed, and consistent manner. Many in academia and industry have adopted it. TEMA was used to evaluate a typical QWERTY keyboard with multiple options for audio and haptic feedback. Though feedback did not have a significant effect on performance, a survey revealed that users' choice of feedback is influenced by social and technical factors. Another study using TEMA showed that novice users entered text faster using a tapping technique than with a gesture or handwriting technique. This motivated rearranging the keys and characters to create a new keyboard, MIME, that would provide better performance for expert users. Data on character frequency and key selection times were gathered and used to design MIME. A longitudinal user study using TEMA revealed an entry speed of 17 wpm and a total error rate of 1.7% for MIME, compared to 23 wpm and 5.2% for QWERTY. Although MIME's entry speed did not surpass QWERTY's during the study, it is projected to do so after twelve hours of practice. MIME's error rate was consistently low and significantly lower than QWERTY's. In addition, participants found MIME more comfortable to use, with some reporting hand soreness after using QWERTY for extended periods.
Open Access
Exploring Topological Environments
(2015-08-28) Wang, Hui Victor; Jenkin, Michael
Simultaneous localization and mapping (SLAM) addresses the task of incrementally building a map of the environment with a robot while simultaneously localizing the robot relative to that map. SLAM is generally regarded as one of the most important problems in the pursuit of building truly autonomous mobile robots. This thesis considers the SLAM problem within a topological framework, in which the world and its representation are modelled as a graph. A topological framework provides a useful model within which to explore fundamental limits to exploration and mapping. Given a topological world, it is not, in general, possible to map the world deterministically without resorting to some type of marking aids. Early work demonstrated that a single movable marker was sufficient but is this necessary? This thesis shows that deterministic mapping is possible if both explicit place and back-link information exist in one vertex. Such 'directional lighthouse' information can be established in a number of ways including through the addition of a simple directional immovable marker to the environment. This thesis also explores non-deterministic approaches that map the world with less marking information. The algorithms are evaluated through performance analysis and experimental validation. Furthermore, the basic sensing and locomotion assumptions that underlie these algorithms are evaluated using a differential drive robot and an autonomous visual sensor.
Open Access
Exploiting Structure Information for Network Dissimilarity Characterization - Application to Disease Network Analysis and Treatment Prediction
(2015-08-28) Wong, Serene Wing Hang; Cercone, Nick; Jurisica, Igor
Most cancers lack effective early disease markers, prognostic and predictive signatures, primarily due to tumor heterogeneity. As a result, we fail treating cancer heterogeneity due to multiple ways cancer initiates and develops treatment resistance. Models that represent these differences and the underlying molecular mechanism in cancer enhance the possibility in characterizing and in turn treating cancer successfully. We introduce novel graph-based methods for exploiting network structure information in the comparison between any graphs, and validate them on non-small cell lung cancer datasets. We generate normal and tumor graphs using normal and tumor samples from gene expression datasets, where vertices are genes, and edges connect co-expressed genes. In the first part of this dissertation, we propose a systems approach with an aim to revert disease conditions to healthy ones through treatments. In order to achieve the objective, we propose three novel methods to 1) systematically identify network structure differences between normal and tumor graphs, 2) identify and prioritize drug combinations based on extracted network structure differences, and 3) computationally estimate the potential of the proposed drug combination to "repair" deregulated subgraphs. Biological validation of the predictions highlights that our systems approach is a promising method to provide treatment options to non-small cell lung cancer through the rewiring of disease networks. In the second part of this dissertation, we introduce the notion of differential graphlet community to detect deregulated subgraphs between any graphs such that the network structure information is exploited. We observed a trend that the shortest path lengths are shorter for tumor graphs than for normal graphs between genes that are in differential graphlet communities, suggesting that cancer creates shortcuts between biological processes that may not be present in normal conditions. In the third part of this dissertation, we propose a heuristic, the differential correlation graph approach, that identifies areas that are different between the normal and tumor graph, and perform graphlet enumeration on the identified areas. Results showed that our approach achieves accurate estimation in the difference between normal and tumor states by performing network comparisons in important areas only.
Open Access
Automatic Speech Recognition Using Deep Neural Networks: New Possibilities
(2015-08-28) Abdel-Hamid, Ossama Abdel-Hamid Mohamed; Jiang, Hui
Recently, automatic speech recognition (ASR) systems that use deep neural networks (DNNs) for acoustic modeling have attracted huge research interest. This is due to the recent results that have significantly raised the state of the art performance of ASR systems. This dissertation proposes a number of new methods to improve the state of the art ASR performance by exploiting the power of DNNs. The first method exploits domain knowledge in designing a special neural network (NN) structure called a convolutional neural network (CNN). This dissertation proposes to use the CNN in a way that applies convolution and pooling operations along frequency to handle frequency variations that commonly happen due to speaker and pronunciation differences in speech signals. Moreover, a new CNN structure called limited weight sharing is proposed to better suit special spectral characteristics of speech signals. Our experimental results have shown that the use of a CNN leads to 6-9% relative reduction in error rate. The second proposed method deals with speaker variations in a more explicit way through using a new speaker code based adaptation. This method adapts the speech acoustic model to a new speaker by learning a suitable speaker representation based on a small amount of adaptation data from each target speaker. This method alleviates the need to modify any model parameters as is done with other commonly used adaptation methods for neural networks. This greatly reduces the number of parameters to estimate during adaptation; hence, it allows rapid speaker adaptation. The third proposed method aims to handle the temporal structure within speech segments by using a deep segmental neural network (DSNN). The DSNN model alleviates the need to use an HMM model as it directly models the posterior probability of the label sequence. Moreover, a segment-aware NN structure has been proposed. It is able to model the dependency among speech frames within each segment and performs better than the conventional frame based DNNs. Experimental results show that the proposed DSNN can significantly improve recognition performance as compared with the conventional frame based models.
Open Access
Automatic Extraction of Closed Contours Bounding Salient Objects: New Algorithms and Evaluation Methods
(2015-08-28) Movahedi, Vida; Elder, James
The problem under consideration in this dissertation is achieving salient object segmentation of natural images by means of probabilistic contour grouping. The goal is to extract the simple closed contour bounding the salient object in a given image. The method proposed here falls in the Contour Grouping category, searching for the optimal grouping of boundary entities to form an object contour. Our first contribution is to provide both a ground truth dataset and a performance measure for empirical evaluation of salient object segmentation methods. Our Salient Object Dataset (SOD) provides ground truth boundaries of salient objects perceived by humans in natural images. We also psychophysically evaluated 5 distinct performance measures that have been used in the literature and showed that a measure based upon minimal contour mappings is most sensitive to shape irregularities and most consistent with human judgements. In fact, the Contour Mapping measure is as predictive of human judgements as human subjects are of each other. Contour grouping methods often rely on Gestalt cues locally defined on pairs of oriented features. Accurate integration of these local cues with global cues is a challenge. A second major contribution of this dissertation is a novel, effective method for combining local and global cues. A third major contribution in this dissertation is a novel method based on Principal Component Analysis for promoting diversity among contour hypotheses, leading to substantial improvements in grouping performance. To further improve the performance, a multiscale implementation of this method has been studied. A fourth contribution in this dissertation is studying the effect of the multiscale prior on the performance and analysing the method for combining the results obtained in different resolutions. Our final contribution is comparing the performance of univariate distribution models for local cues used by our method with the use of a multivariate mixture model for their joint distribution. We obtain slight improvement by the mixture models. The proposed method has been evaluated and compared with four other state-of-the-art grouping methods, showing considerably better performance on the SOD ground truth dataset.
Open Access
Model-Based Dynamic Resource Management for Service Oriented Clouds
(2015-08-28) Ghanbari, Hamoun; Litoiu, Marin
Cloud computing is a flexible platform for software as a service, as more and more applications are deployed on cloud. Major challenges in cloud include how to characterize the workload of the applications and how to manage the cloud resources efficiently by sharing them among many applications. The current state of the art considers a simplified model of the system, either ignoring the software components altogether or ignoring the relationship between individual software services. This thesis considers the following resource management problems for cloud-based service providers: (i) how to estimate the parameters of the current workload, (ii) how to meet Quality of Service (QoS) targets while minimizing infrastructure cost, (iii) how to allocate resources considering performance costs of virtual machine reconfigurations. To address the above problems, we propose a model-based feedback loop approach. The cloud infrastructure, the services, and the applications are modelled using Layered Queuing Models (LQM). These models are then optimized. Mathematical techniques are used to reduce the complexity of the models and address the scalability issues. The main contributions of this thesis are: (i) Extended Kalman Filter (EKF) based techniques improved by dynamic clustering for scalable estimation of workload parameters, (ii) combination of adaptive empirical models (tuned during runtime) and stepwise optimizations for improving the overall allocation performance, (iii) dynamic service placement algorithms that consider the cost of virtual machine reconfiguration.
Open Access
Term Association Modelling in Information Retrieval
(2015-08-28) Zhao, Jiashu; Huang, Xiangji
Many traditional Information Retrieval (IR) models assume that query terms are independent of each other. For those models, a document is normally represented as a bag of words/terms and their frequencies. Although traditional retrieval models can achieve reasonably good performance in many applications, the corresponding independence assumption has limitations. There are some recent studies that investigate how to model term associations/dependencies by proximity measures. However, the modeling of term associations theoretically under the probabilistic retrieval framework is still largely unexplored. In this thesis, I propose a new concept named Cross Term, to model term proximity, with the aim of boosting retrieval performance. With Cross Terms, the association of multiple query terms can be modeled in the same way as a simple unigram term. In particular, an occurrence of a query term is assumed to have an impact on its neighboring text. The degree of the query term impact gradually weakens with increasing distance from the place of occurrence. Shape functions are used to characterize such impacts. Based on this assumption, I first propose a bigram CRoss TErm Retrieval (CRTER2) model for probabilistic IR and a Language model based model CRTER2LM. Specifically, a bigram Cross Term occurs when the corresponding query terms appear close to each other, and its impact can be modeled by the intersection of the respective shape functions of the query terms. Second, I propose a generalized n-gram CRoss TErm Retrieval (CRTERn) model recursively for n query terms where n>2. For n-gram Cross Term, I develop several distance metrics with different properties and employ them in the proposed models for ranking. Third, an enhanced context-sensitive proximity model is proposed to boost the CRTER models, where the contextual relevance of term proximity is studied. The models are validated on several large standard data sets, and show improved performance over other state-of-art approaches. I also discusse the practical impact of the proposed models. The approaches in this thesis can also provide helpful benefit for term association modeling in other domains.
Open Access
Novel Medium Access Control (MAC) Protocols for Wireless Sensor and Ad Hoc Networks (WSANs) and Vehicular Ad Hoc Networks (VANETs)
(2015-08-28) Hossain, Md Kowsar; Edmonds, Jeffrey A.
Efficient medium access control (MAC) is a key part of any wireless network communication architecture. MAC protocols are needed for nodes to access the shared wireless medium efficiently. Providing high throughput is one of the primary goals of the MAC protocols designed for wireless networks. MAC protocols for Wireless Sensor and Ad hoc networks (WSANs) must also conserve energy as sensor nodes have limited battery power. On the other hand, MAC protocols for Vehicular Ad hoc networks (VANETs) must also adapt to the highly dynamic nature of the network. As communication link failure is very common in VANETs because of the fast movement of vehicles so quick reservation of packet transmission slots by vehicles is important. In this thesis we propose two new distributed MAC algorithms. One is for WSANs and the other one is for VANETs. We demonstrate using simulations that our algorithms outperform the state-of-the-art algorithms.
Open Access
Non-Blocking Data Structures Handling Multiple Changes Atomically
(2015-12-16) Shafiei, Niloufar; Ruppert, Eric
Here, we propose a new approach to design non-blocking algorithms that can apply multiple changes to a shared data structure atomically using Compare&Swap (CAS) instructions. We applied our approach to two data structures, doubly-linked lists and Patricia tries. In our implementations, only update operations perform CAS instructions; operations other than updates perform only reads of shared memory. Our doubly-linked list implements a novel specification that is designed to make it easy to use as a black box in a concurrent setting. In our doubly-linked list implementation, each process accesses the list via a cursor, which is an object in the process's local memory that is located at an item in the list. Our specification describes how updates affect cursors and how a process gets feedback about other processes' updates at the location of its cursor. We provide a detailed proof of correctness for our list implementation. We also give an amortized analysis for our list implementation, which is the first upper bound on amortized time complexity that has been proved for a concurrent doubly-linked list. In addition, we evaluate our list algorithms on a multi-core system empirically to show that they are scalable in practice. Our non-blocking Patricia trie implementation stores a set of keys, represented as bit strings, and allows processes to concurrently insert, delete and find keys. In addition, our implementation supports the replace operation, which deletes a key from the trie and adds a new key to the trie simultaneously. Since the correctness proof of our trie is similar to the correctness proof of our list implementation, we only provide a sketch of a correctness proof of our trie implementation here. We empirically evaluate our trie and compare our trie to some existing set implementations. Our empirical results show that our Patricia trie implementation consistently performs well under different scenarios.
Open Access
Molecular Communication: From Theory to Practice
(2015-12-16) Farsad, Nariman; Eckford, Andrew W.
Always-on, always-available digital communication has changed the world – allowing us to collaborate and share information in ways unimaginable not long ago. Yet many of the physical principles used in everyday digital communication break down as the size of the devices approach micro- or nano-scale dimensions. As a result, tiny devices, with dimensions of microns or less, need to do something different in order to communicate. Moreover, at meter scales there are areas where use of radio signals is not possible or desirable. An emerging biomimetic technique called molecular communication, which relies on chemical signaling is a promising solution to these problems. Although biologists have studied molecular communication extensively, it is very poorly understood from a telecommunication engineering perspective. Engineering molecular communication systems is important since micro- and nano-scale systems are the key to unlocking a realm of futuristic possibilities such as: self-repairing machines, micro- and nano-scale robotics, synthetic biological devices, nanomedicine, and artificial immune systems that detect and kill cancer cells and pathogens. All these transformative applications have one feature in common: they involve not just single devices working independently, but swarms of devices working in concert. Besides solving the communication problem at small scales, use of molecular communication in areas such as robotics, and infrastructure monitoring can unlock new applications in smart cities and disaster search and rescue. In this dissertation, after providing a comprehensive survey of the field, two areas of study with high potential impact are identified: on-chip molecular communication, and experimental platforms for molecular communication. First, on-chip molecular communication is investigated towards the goal of networking components within lab-on-chip devices and point-of-care diagnostic devices. This has numerous applications in medicine, environmental monitoring systems, and the food industry. Then in the second part of the dissertation, a tabletop demonstrator for molecular communication is designed and built that could be used for research and experimentation. In particular, no macroscale or microscale molecular communication platform capable of reliably transporting sequential data had existed in the past, and this platform is used to send the world's first text message ("O Canada") using chemical signals.
Open Access
Extending Topic Models With Syntax and Semantics Relationships
(2015-12-16) Delpisheh, Elnaz; An, Aijun
Probabilistic topic modeling is a powerful tool to uncover hidden thematic structure of documents. These hidden structures are useful for extracting concepts of documents and other data mining tasks, such as information retrieval. Latent Dirichlet allocation (LDA), is a generative probabilistic topic model for collections of discrete data such as text corpora. LDA represents documents as a bag-of-words, where the important structure of documents is neglected. In this work, we proposed three extended LDA models that incorporates syntactic and semantic structures of text documents into probabilistic topic models. Our first proposed topic model enriches text documents with collapsed typed dependency relations to effectively acquire syntactic and semantic dependencies between consecutive and nonconsecutive words of text documents. This representation has several benefits. It captures relations between consecutive and nonconsecutive words of text documents. In addition, the labels of the collapsed typed dependency relations help to eliminate less important relations, i.e., relations involving prepositions. Moreover, in this thesis, we introduced a method to enforce topic similarity to conceptually similar words. As a result, this algorithm leads to more coherent topic distribution over words. Our second and third proposed generative topic models incorporate term importance into latent topic variables by boosting the probability of important terms and consequently decreasing the probability of less important terms to better reflect the themes of documents. In essence, we assign weights to terms by employing corpus-level and document-level approaches. We incorporate term importance using a nonuniform base measure for an asymmetric prior over topic term distributions in the LDA framework. This leads to better estimates for important terms that occur less frequently in documents. Experimental studies have been conducted to show the effectiveness of our work across a variety of text mining applications. Furthermore, we employ our topic models to build a personalized content-based news recommender system. Our proposed recommender system eases reading and navigation through online newspapers. In essence, the recommender system acts as filters, delivering only news articles that can be considered relevant to a user. This recommender system has been used by The Globe and Mail, a company that offers most authoritative news in Canada, featuring national and international news.
Open Access
Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes
(2016-06-23) Ashlock, Wendy Cole; Datta, Suprakash
About half of the human genome consists of transposable elements (TE's), sequences that have many copies of themselves distributed throughout the genome. All genomes, from bacterial to human, contain TE's. TE's affect genome function by either creating proteins directly or affecting genome regulation. They serve as molecular fossils, giving clues to the evolutionary history of the organism. TE's are often challenging to identify because they are fragmentary or heavily mutated. In this thesis, novel features for the detection and study of TE's are developed. These features are of two types. The first type are statistical features based on the Fourier transform used to assess reading frame use. These features measure how different the reading frame use is from that of a random sequence, which reading frames the sequence is using, and the proportion of use of the active reading frames. The second type of feature, called side effect machine (SEM) features, are generated by finite state machines augmented with counters that track the number of times the state is visited. These counters then become features of the sequence. The number of possible SEM features is super-exponential in the number of states. New methods for selecting useful feature subsets that incorporate a genetic algorithm and a novel clustering method are introduced. The features produced reveal structural characteristics of the sequences of potential interest to biologists. A detailed analysis of the genetic algorithm, its fitness functions, and its fitness landscapes is performed. The features are used, together with features used in existing exon finding algorithms, to build classifiers that distinguish TE's from other genomic sequences in humans, fruit flies, and ciliates. The classifiers achieve high accuracy (> 85%) on a variety of TE classification problems. The classifiers are used to scan large genomes for TE's. In addition, the features are used to describe the TE's in the newly sequenced ciliate, Tetrahymena thermophile to provide information for biologists useful to them in forming hypotheses to test experimentally concerning the role of these TE's and the mechanisms that govern them.