Extending Topic Models With Syntax and Semantics Relationships

dc.contributor.advisorAn, Aijun
dc.creatorDelpisheh, Elnaz
dc.date.accessioned2015-12-16T19:09:47Z
dc.date.available2015-12-16T19:09:47Z
dc.date.copyright2015-05-06
dc.date.issued2015-12-16
dc.date.updated2015-12-16T19:09:47Z
dc.degree.disciplineComputer Science
dc.degree.levelDoctoral
dc.degree.namePhD - Doctor of Philosophy
dc.description.abstractProbabilistic topic modeling is a powerful tool to uncover hidden thematic structure of documents. These hidden structures are useful for extracting concepts of documents and other data mining tasks, such as information retrieval. Latent Dirichlet allocation (LDA), is a generative probabilistic topic model for collections of discrete data such as text corpora. LDA represents documents as a bag-of-words, where the important structure of documents is neglected. In this work, we proposed three extended LDA models that incorporates syntactic and semantic structures of text documents into probabilistic topic models. Our first proposed topic model enriches text documents with collapsed typed dependency relations to effectively acquire syntactic and semantic dependencies between consecutive and nonconsecutive words of text documents. This representation has several benefits. It captures relations between consecutive and nonconsecutive words of text documents. In addition, the labels of the collapsed typed dependency relations help to eliminate less important relations, i.e., relations involving prepositions. Moreover, in this thesis, we introduced a method to enforce topic similarity to conceptually similar words. As a result, this algorithm leads to more coherent topic distribution over words. Our second and third proposed generative topic models incorporate term importance into latent topic variables by boosting the probability of important terms and consequently decreasing the probability of less important terms to better reflect the themes of documents. In essence, we assign weights to terms by employing corpus-level and document-level approaches. We incorporate term importance using a nonuniform base measure for an asymmetric prior over topic term distributions in the LDA framework. This leads to better estimates for important terms that occur less frequently in documents. Experimental studies have been conducted to show the effectiveness of our work across a variety of text mining applications. Furthermore, we employ our topic models to build a personalized content-based news recommender system. Our proposed recommender system eases reading and navigation through online newspapers. In essence, the recommender system acts as filters, delivering only news articles that can be considered relevant to a user. This recommender system has been used by The Globe and Mail, a company that offers most authoritative news in Canada, featuring national and international news.
dc.identifier.urihttp://hdl.handle.net/10315/30620
dc.language.isoen
dc.rightsAuthor owns copyright, except where explicitly noted. Please contact the author directly with licensing requests.
dc.subjectComputer science
dc.subjectComputer engineering
dc.subject.keywordsData mining
dc.subject.keywordsText mining
dc.subject.keywordsTopic modeling
dc.subject.keywordsLDA
dc.subject.keywordsLatent Dirichlet allocation
dc.subject.keywordsRecommender system
dc.subject.keywordsComputational linguistics
dc.titleExtending Topic Models With Syntax and Semantics Relationships
dc.typeElectronic Thesis or Dissertation

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Delpisheh_Elnaz_2015_PhD.pdf
Size:
1.37 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
license.txt
Size:
1.83 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
YorkU_ETDlicense.txt
Size:
3.38 KB
Format:
Plain Text
Description: