Research in Data Clustering
Welcome to the data clustering page at Michigan State University!
For our general research in Pattern
Recognition and Image Processing, please visit the
PRIP page
For our research in Biometric Authentication, please visit the
Biometrics page
Overview
The goal of data clustering, or unsupervised learning, is to discover
"natural" groupings in a set of patterns, points, or objects, without
prior knowledge of any class labels. There are many applications of cluster
analysis, including vector quantization, image segmentation, constructing the
prototypes of classifiers, understanding genomic data, market segmentation,
etc. Despite its long history, clustering still poses a number of open research
problems. Two surveys on clustering are:
-
A. K. Jain, M.N. Murthy and P.J. Flynn,
Data Clustering: A Review, ACM Computing Reviews, Nov 1999.
-
A. K. Jain and R. C. Dubes.
Algorithms for Clustering Data, Prentice Hall, 1988.
(This book is out of print, but it can be downloaded for free by following the
hyperlink above.)
Below are some recent publications of our group in this area.
Fitting a Mixture of Gaussians
The standard algorithm for fitting a mixture of Gaussians to a data set is the
classic EM algorithm. However, EM algorithm has several known weaknesses: the
number of components needs to be fixed beforehand, EM can converge to a poor
local optimum, and EM can converge towards a singular estimate at the boundary
of the parameter space. These issues are addressed by the algorithm described
in the following papers. The Matlab code for our algorithm is available for
download.
-
M. Figueiredo,
A.K. Jain, "Unsupervised Learning of Finite
Mixture Models", IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. 24, No. 3, March 2002, pp. 381-396. (Matlab
code) (abstract
at IEEE Explore)
-
M. Figueiredo,
A.K. Jain, "Unsupervised Selection and Estimation of Finite
Mixture Models"; in Proceedings of the International Conference on Pattern
Recognition - ICPR'2000, vol. 2, pp. 87-90, Barcelona, September 2000.
(ps.gz, pdf)
Feature Selection in Unsupervised Learning
Given a large number of features, feature selection finds a subset of the
available features that is appropriate for the task at hand. Feature
selection can be of tremendous help when one faces the "curse of
dimensionality". Most previous work on feature selection is for supervised
classification. We consider feature selection in unsupervised learning in the
following papers.
-
M. Law,
M. A. T. Figueiredo, A. K. Jain.
"Simultaneous
Feature Selection and Clustering Using Mixture Models", IEEE
Transactions of Pattern Analysis and Machine Intelligence. vol. 26, no.
9, pp. 1154- 1166, September 2004. (IEEE
Xplore)
(Matlab
code)
-
M. Figueiredo,
A.K. Jain, M. Law.
"A Feature selection wrapper for mixtures", in Proceedings of the
First Iberian Conference on Pattern Recognition and Image Analysis,
Puerto de Andratx, Spain, June 2003.
-
M. Law,
M. Figueiredo, A. K. Jain.
"Feature selection in mixture-based clustering",
in Advances in Neural Information Processing Systems 15 (NIPS 2002), pp.
609-616, Vancouver, Dec 2002.
-
A. K. Jain and D. Zongker.
"Feature-Selection: Evaluation, Application, and Small Sample Performance",
IEEE Transactions on Pattern Analysis and Machine Intelligence
vol. 19, no. 2, pp. 153-158, February 1997.
(IEEE Explore)
Combination of Clustering Algorithms
Combination of multiple classifiers in supervised classification has achieved
great success and it is becoming one of the standard techniques in
pattern recognition. However, little has been done to explore how to combine
data partitions generated by different clustering algorithms. The following
papers investigate different issues on combining the outputs of multiple
clustering algorithms.
-
A. Fred,
A.K. Jain. Combining Multiple Clustering
Using Evidence Accumulation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 27, number 6, pp. 835-850, 2005. (Abstract
in IEEE explore)
-
A. Topchy,
A.K. Jain,W. Punch.
Clustering Ensembles: Models of Consensus and Weak Partitions. To
appear in IEEE Transactions on Pattern Analysis and Machine Intelligence
(under review).
-
A. Topchy,
M. H. Law. A.K. Jain,
A. Fred. Analysis of Consensus Partition
in Cluster Ensemble. In Proceedings of The Fourth IEEE International
Conference on Data Mining, pp. 225-232, Brighton, UK, November 01-04,
2004.
-
A. Topchy,
B. Minaei, A.K. Jain,
W. Punch. "Adaptive clustering Ensembles",
in Proceedings of the International Conference on Pattern Recognition,
Cambridge, United Kingdom, August 23-26, 2004.
-
A. Topchy,
A.K. Jain, W. Punch,
"A Mixture Model of Clustering
Ensembles", in Proceedings of the SIAM International Conference on
Data Mining, Lake Buena Vista, Florida, April 22-24, 2004.
-
B. Minaei,
A. Topchy, and W. Punch,
Ensembles of Partitions via Data Resampling, in Proceedings of the
International Conference on Information Technology: Coding and Computing, ITCC
2004, Las Vegas, April 2004
-
A. Topchy,
A.K. Jain, W. Punch,
"Combining Multiple Weak Clusterings",
in Proceedings of the IEEE International Conf. Data Mining, pp. 331-338,
Melbourne, Florida, USA, November 19-22 2003.
-
A. Fred,
A.K. Jain, "Data Clustering Using
Evidence Accumulation", in Proceedings of the International
Conference on Pattern Recognition (ICPR), Quebec City, August 11-15
2002.
-
A. Fred,
A.K. Jain, "Evidence
Accumulation Clustering based on the K-means algorithm", in Proceedings
of the International Workshops on Structural and Syntactic Pattern Recognition
(SSPR), Windsor, Canada, August 6-9 2002.
Semi-supervised Learning
-
Pavan K. Mallapragada, Rong Jin, A. K. Jain, Yi Liu .
"SemiBoost: Boosting for Semi-supervised Learning", Technical Report MSU-CSE-07-197, Department of Computer Science and Engineering, Michigan State University.
-
Yi Liu , Rong Jin, A. K. Jain.
"BoostCluster: Boosting Clustering by Pairwise Constraints", In Proceedings of Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 450-459, 2007.
-
A. K. Jain,Pavan K. Mallapragada,
M. Law.
"Bayesian Feedback in Data Clustering", In Proceedings of the
18th International Conference on Pattern Recognition, Vol. 3, pp. 374-378, Hong
Kong, August 20-24, 2006.
-
T. Lange, M. H. Law,
A. K. Jain, J. Buhmann.
Learning With Constrained and Unlabelled Data. In Proceedings of the
IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
vol.1, pp. 730-737, June 2005.
-
M. H. Law,
A. Topchy, A. K. Jain.
Model-based Clustering With Probabilistic Constraints. In Proceedings of
SIAM Data Mining, pp. 641-645, 2005.
-
M. H. Law,
A. Topchy, A. K. Jain,
"Clustering with Soft and Group Constraints", In Proceedings of the
Joint IAPR International Workshop on Structural, Syntactic, And Statistical
Pattern Recognition (S+SSPR 2004), pp. 662-670, 2004.
Multiobjective Data Clustering
Most clustering algorithms generate the output partition by explicitly or
implicitly minimizing a single objective function. Unfortunately, clusters in
real world data sets are "heterogeneous" (of diverse shapes and data
densities), and it is difficult for a single clustering algorithm to detect
different types of clusters. We explore how to use multiple clustering criteria
simultaneously in the following paper.
Nonlinear Dimensionality Reduction
-
M. H. Law, A. K. Jain.
"Incremental Nonlinear Dimensionality Reduction By Manifold Learning",
IEEE Transactions of Pattern Analysis and Machine Intelligence. vol. 28, no. 3, pp: 377 - 391, March 2006.
-
M. Law,
N. Zhang, A. K. Jain.
"Nonlinear Manifold Learning for Data Stream", In Proceedings of SIAM Data
Mining, pp. 33-44, Orlando, Florida, 2004. This paper receives the Best
Student Paper award. The
web site (under construction)
Other Papers on Clustering and Dimensionality Reduction
-
A. Fred,
A.K. Jain, "Learning Pairwise Similarity
for Data Clustering",Proceedings of 18th International Conference
on Pattern Recognition (ICPR), Vol. 1, pp. 925 - 928, Hong Kong, August 20-24,
2006.
-
A. K. Jain,
A. Topchy, M. Law,
J. Buhmann. "Landscape of
Clustering Algorithms", In Proceedings of the 17th International
Conference on Pattern Recognition, pp. I-260--I-263, Cambridge UK,
August 23-26, 2004.
-
A. K. Jain,
S. Raudys. "Small sample size effects
in statistical pattern recognition: recommendations for practitioners",
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13,
no. 3, pp. 252-264, March 1991.
-
K. Pettis and T. Bailey and A. K. Jain and R. Dubes. "An
Intrinsic Dimensionality Estimator from Near-Neighbor Information",
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1,
no. 1, pp. 25-36, 1979.
Recent Theses
Comments and suggestions are welcome. Please direct them to
Pavan
.