[Search | Browse Authors | Browse Reports | Home ]

On Document Clustering using Core Semantic Features

MSU-CSE-09-31

Samah Fodeh, William F Punch, Pang-Ning Tan
December, 2009

Incorporating background knowledge in the form of a semantic ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. In this study, we investigate the interplay of various factors that a ffect the performance of clustering algorithms which utilize ontology background knowledge. We show that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for cluster formation. However, the procedure used to map document words to ontological concepts may both introduce inaccurate semantic features and increase the dimensionality of the feature space. To address these problems, we develop an information theoretic approach for extracting a core set of semantic features to represent a text corpus. Empirical results show that by core semantic features for clustering, one can reduce the number of features by 80% or more and still produce clusters comparable to, and occasionally better than, those clusters produced using either all the document words or all the derived semantic features as the feature set.


Display BibTex Entry

No online versions of this document are available.

For more information on this report, contact ptan@cse.msu.edu.


You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format.


[Search | Browse Authors | Browse Reports | Home ]