Samah Fodeh, William F Punch, Pang-Ning Tan
Incorporating background knowledge in the form of a semantic ontology into document clustering is an important but challenging problem. While numerous methods have been developed, the value of using such an ontology is still not clear. In this study, we investigate the interplay of various factors that affect the performance of clustering algorithms which utilize ontology background knowledge. We show that polysemous and synonymous nouns are both relatively prevalent and fundamentally important for cluster formation. However, the procedure used to map document words to ontological concepts may both introduce inaccurate semantic features and increase the dimensionality of the feature space. To address these problems, we develop an information theoretic approach for extracting a core set of semantic features to represent a text corpus. Empirical results show that by core semantic features for clustering, one can reduce the number of features by 80% or more and still produce clusters comparable to, and occasionally better than, those clusters produced using either all the document words or all the derived semantic features as the feature set.
You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format.