Clustering

Concepts

These components perform clustering, they produce "homogenous" group by optimization of some criteria. They are also known as unsupervised methods.

Attributes status

"Input" attributes only, they are generally continuous.

Clustering components

Component Description Parameters Note

K-Means
K-Means - Forgy and Mc Queen algorithms. Several trials are performed.

- T. Hastie, R. Tibshirani, J. Friedman, "The elements of statistical learning. Data Mining, inference and predictions.", Springer, pp.461-463, 2001.
- E. Diday, "La méthode des nuées dynamiques", Revue de Stat Appliquée, vol. 19, n°2, pp.19-34, 1971.
- M.R. Anderberg, "Cluster analysis for applications", Academic Press, 1973.
- J.L. Chandon, S. Pinson, "Analyse typologique : théorie et applications", Masson, pp.132-160, 1981.

- Number of clusters
- Number of iterations
- Numberof trials
- Data standardization
- Average computation during optimization process

Kohonen's SOM
Kohonen's Self Organization Map.

- T. Kohonen, "Self-organization and associative memory", Springer-Verlag, 1988.
- K. Mehrotra, C. Mohan, S. Ranka, "Elements of artificial neural network", MIT Press, pp.187-201, 1997.
- T. Hastie, R. Tibshirani, J. Friedman, "The elements of statistical learning. Data Mining, inference and predictions.", Springer, pp.480-485, 2001.

- Map : number of row
- Map : numberof columns
- Data standardization
- Learning rate

LVQ
Kohonen's Learning Vector Quantizers, a "supervised" clustering algorithm.

- T. Kohonen, "Self-organization and associative memory", Springer-Verlag, 1988.
- K. Mehrotra, C. Mohan, S. Ranka, "Elements of artificial neural network", MIT Press, pp.173-176, 1997.
- T. Hastie, R. Tibshirani, J. Friedman, "The elements of statistical learning. Data Mining, inference and predictions.", Springer, pp.414-415, 2001.

- Number of cluster per class
- Learning rate
- Number of iterations
- Data standardization
"Target" discrete attribute must be defined.

HAC
Hierachical agglomerative clustering.

The tree is built into two steps :
(1) use a "fast" clustering method (K-Means, SOM), which produces many low-level clusters ;
(2) use these clusters for the dendogram construction.

This method allows to apply HAC on dataset with many examples.

How to set the number of low-level clusters ? It must be lower than the size of the dataset, a good compromise seems to be 15 -- 20.

M.A. Wong, "A hybrid clustering method for identifying high density clusters", JASA, 77, pp.841-847, 1982.

- Clusters detection strategy
- Number of clusters
- Standardization or not of attributes for distance evaluation
Discrete TARGET must be defined, it comes from first clustering component such as K-MEANS or SOM.

Last modification : April 18th, 2004.