Research

next up previous
Next: Conclusion Up: New directions Previous: Semi supervised classification

Clustering

Clustering problems crop up in data mining, image segmentation, biology (MRI image segmentation), and bioinformatics. Many of these problems require us to cluster a set of large alphabet sequences by observing a subset of the sequences. Typically, the large alphabet renders practical sizes of this subset to be very small.

Example : For disease diagnosis using protein profiling  [23], relevant cell samples are first collected from several subjects. Using mass spectrometric techniques, the proteins present in each sample, which could number several hundred or more, are determined. The samples are then clustered so as to distinguish diseased ones from normal ones. Since the number of subjects and samples available rarely even exceeds the number of proteins in most experiments, we are stuck with the familiar large alphabet issue--the alphabet (the set of proteins) renders the available input (the samples from various subjects) effectively small.  

See [24] for several other large alphabet clustering problems.

Theoretically, we handle this as a constrained distribution estimation problem. In previous applications, we were concerned with learning a distribution on our sample space--here we impose an additional constraint on the distributions--for example, the distribution has to be a mixture of gaussians, a mixture of binomials, etc.--the challenge being to incorporate this constraint into our large alphabet approaches.


next up previous
Next: Conclusion Up: New directions Previous: Semi supervised classification
Prasad Santhanam 2007-12-28