Research

next up previous
Next: Background

We live amidst complex systems such as the Internet as well as advances like speech recognition, Mars missions and genome mapping, made possible by the unprecedented communication, computation, and storage resources available today. Modern research attempts to further these advances, and simultaneously harness available knowledge and data better.

Two aspects of this effort stand out.

First--while a lot of statistical work has been answering questions on sequences of symbols much longer than the set of possible symbols, the alphabet size--a number of current research problems require solutions for the opposite scenario, the large alphabet setting.

For instance, take speech recognition--language models for speech recognition estimate distributions over English words using text examples much smaller than the vocabulary. Or modern medicine--thousands of genes are clustered by their expression levels for applications in diagnosis and drug response prediction.

Tweaking old results and using them for large alphabets yields unsatisfactory results in many cases, and we are therefore forced to rework these problems altogether.

Second, problems posed by different systems are interconnected. But progress among interconnected problems do not always match up, typically due to lack of communication between the relevant fields.

For example, consider text compression on the one hand, along with language modeling for speech recognition which requires estimation of word probabilities given a text sample. It is folklore that compression and estimation are closely linked, but some very commonly used probability estimators in the machine learning community had not even been considered from a compression perspective till recently [1].

The theoretical focus of this statement will be the intersection of statistical learning and information theory. In particular, we concentrate on probability estimation, inference, classification and clustering in large alphabet settings. Practical applications include text classification and compression, image segmentation, analysis of gene expression data, and risk analysis for network insurance policies.

Many principles behind the approaches we take for the above problems are not just new and theoretically interesting, but have proved to be general in nature. As a demonstration, text classifiers based on these principles are comparable to the state of the art results (and faster) on several real life benchmarks using nothing but simple minded implementations of our algorithms, no tuning, no tweaking.




next up previous
Next: Background
Prasad Santhanam 2007-12-28