Title: Pattern Extraction and Clustering for High-Dimensional Discrete Data
Alice Peng Jiang
University of Illinois
Wednesday, March 12, 10am in EB3105
With the advent of big data, data analysis has replaced data acquisition as the new bottleneck. Pattern extraction and clustering are two essential methods for data analysis. In many application areas, big data has high dimensionality and discrete features, where data values are restricted, for example, to be binary (i.e., only values 0 and 1 are permitted). Such problems pose significant challenges in data analysis. Methods for analyzing continuous data are often ineffective for discrete datasets, as such methods usually produce non-discrete values, which may be difficult to interpret and may require excessive memory to store. Thus we need algorithms designed especially for discrete data.
This talk focuses on compressing data with minimum loss of accuracy via low-rank matrix factorizations, which greatly reduces the time for solving problems in data mining and machine learning. Specifically, matrix factorization produces two matrix factors, one finding the dominant patterns and the other indicating the cluster assignment of the original patterns. We have developed a framework in which matrix factorization problems are reformulated as variants of clustering problems, for which we proposed novel algorithms and proved their constant approximation bounds. These are very nice properties, as the original discrete problems are NP-hard and approximation algorithms that provide close bounds to exact solutions are highly desired. In addition to theoretical verification, we also perform experimental evaluation on various applications in pattern extraction, transaction data mining, recommender systems, bi-cluster discovery in gene expression data, document clustering, and social network mining.