Incorporating Background Knowledge into Data Mining

Team Members:

  • Pang-Ning Tan (Faculty advisor)
  • Haibin Cheng (PhD student)
  • Samah Fodeh (PhD student)

Overview:

Data mining is a large-scale data analysis endeavor that has found its applications in diverse areas such as business intelligence, computer security, bioinformatics, and geo-sciences. Despite its tremendous promise, data mining is often plagued by the fundamental problem that the models and patterns derived from data tend to be inferior compared to human expertise. High false alarm rates, clusters that are hard to interpret, and spuriousness of discovered patterns are among the typical grievances voiced by users when applying data mining techniques in a practical setting. These problems arise because many data mining techniques begin from tabula rasa, or the blank state, where the underlying algorithms have no innate knowledge of the particular domain. Significant advances must therefore be made if data mining is to be successfully deployed in critical applications such as computer security and medical diagnostic systems. This project seeks to improve the effectiveness of data mining by incorporating background knowledge automatically into the mining process. Specifically, this project aims to achieve this goal by developing innovative algorithms for combining information from multiple sources and exploring new problem areas that may benefit from using background knowledge.

Publications:

  1. Haibin Cheng and Pang-Ning Tan. Semi-supervised Learning with Data Calibration for Long-Term Time Series Forecasting. To appear in Proc of ACM SIGKDD Int’l Conf on Data Mining (KDD-2008), Las Vegas, Nevada, August 24-27 (2008)
  2. Samah Fodeh and Pang-Ning Tan Incorporating Background Knowledge for Subjective Rule Evaluation, In Proc of IEEE Int'l Conf on Tools with Artificial Intelligence (ICTAI-07), Patras, Greece, October 29-31 (2007).
  3. Jing Gao, Pang-Ning Tan, and Haibin Cheng, Semi-supervised Clustering with Partial Background Information. In Proc of SDM'06: SIAM Int'l Conf. on Data Mining, Bethesda, MD, Apr 20-22 (2006).
  4. Jing Gao, Haibin Cheng, and Pang-Ning Tan, A Novel Framework for Incorporating Labeled Examples into Anomaly Detection. In Proc of SDM'06: SIAM Int'l Conf. on Data Mining, Bethesda, MD, Apr 20-22 (2006).
  5. Jerry Scripps and Pang-Ning Tan, Clustering in the Presence of Bridge-Nodes. In Proc of SDM'06: SIAM Int'l Conf. on Data Mining, Bethesda, MD, Apr 20-22 (2006).
  6. Pang-Ning Tan and Rong Jin, Ordering Patterns by Combining Opinions from Multiple Sources, Proc of the Tenth ACM SIGKDD Int'l Conf on Knowledge Discovery and Data Mining (KDD-2004), Seattle, WA, Aug 22-25 (2004).