## Code: K-means clustering using euclid distance

Table of Contents |

First Edition | 04/2006 |
---|---|

Last Modified | 06/2006 |

Language | Matlab |

## Abstract

Implementation of K-means algorithm and experiments. Clustering, Vector Quantization, Classification (supervised), and Speaker identification using k-means vector quantization were experimented.

## Algorithm [1]

1. Initialize centroids of K-clusters randomly.

2. Assign each sample to the nearest centroid.

3. Calculate centroids (means) of K-clusters.

4. If centroids are unchanged, done. Otherwise, go to step 2.

## Source Codes

### cvEucdist.m

### cvKmeans.m

### cvAutolabel.m

## Demo

### Clustering

**Results**

### Vector Quantization using Kmeans

A color image of 256x256x256 is quantized. The return value, codebook, of kmeans_light is the quantized vectors.

>> cvKmeansDemoVq(16)

**Results**

Input image

Quantized into 16 colors

codebook = R G B 102.9687 150.2048 209.9537 104.1191 113.4578 39.0459 69.3265 119.3945 186.1745 45.5851 84.7983 142.6698 127.7494 150.0138 165.5314 135.5958 137.5865 123.5271 183.1342 197.3741 212.2263 82.2467 112.0260 135.9893 76.7181 89.7075 32.5789 146.2566 149.1479 59.9476 134.0863 139.7082 47.4650 137.9857 173.9007 216.7912 92.0399 101.3639 66.3207 237.8315 240.4457 244.8124 120.4648 127.5330 45.8403 42.7258 63.0666 44.3238

### Classification (Supervised) using Kmeans Vector Quantization

As a faster K-NN method. Compare distances with representatives only.

cvpr:demo/cvKmeansDemoClassifi.m

**Results**

**Observation**

The # of prototypes were drastically(?) decreased, but still classification was succeeded.

For this example, computing centroids for each class (supervised training) also should work with the same recognition rate. However, only one centroid for one class would cause poor classification because it does not take into account variances at all.

K-means algorithm can be used to take into account the variances. If we use k-means with k = 3 * # of classes, it hopefully creates 3 representatives for each class, thus this helps us to take into acount sort of variance.

Furthermore, if samples from a class are rare, k-means might create only 1 representative, not 3 representatives, for the class. It makes sense rather than taking 3 representatives equally for all classes because it is taking into account frequency of occurence (prior probability of samples in each class). However, of course, it has a risk that k-means might create no representative for a class.

## Future work

- LBG algorithm

## References

- [1] Xuedong Huang, Alex Acero, Hsiao-wuen hon, "Spoken Language Processing: A guide to Theory, Algorithm, and System Development," Prentice Hall, January 2001, Chapter 4.4, p169
- [2] TIMIT Speech Database http://www.mpi.nl/world/tg/corpora/timit/timit.html Share
- [3] HTK (Hidden Markov Model Toolkit) Speech Recognition Toolkit http://htk.eng.cam.ac.uk/ Free
- [4] Xuedong Huang, Alex Acero, Hsiao-wuen hon, "Spoken Language Processing: A guide to Theory, Algorithm, and System Development," Prentice Hall, January 2001, Chapter 9.3.3. p424-426