Research

next up previous
Next: New theory Up: Background Previous: Universal data compression and


Estimation with insufficient data

Probability estimation over large alphabets is complicated owing to the large number of rare and unseen events whose probabilities cannot be reliably estimated using conventional maximum likelihood approaches. To resolve this difficulty, Good and Turing [4] derived an unintuitive, yet effective, formula for estimating the probability of every symbol appearing in a sequence, as well as the probability of the missing mass. Note that Good and Turing's formula does not attempt to assign a probability to every missing element, it only estimates the probability of the set of all missing elements.

Several estimators based on this formula have since been employed in a variety of applications such as information retrieval, spelling correction, and language modeling. These Good-Turing estimators are known to work very well for rare events. However, while there were justifications [5] for why the estimator should work well, until recently [6] there was not much theoretical analysis on these estimators.

Not only are these estimators a little bit of black magic in the sense that their underlying principles are not always completely understood, they do not seem to always work better in every application. In fact, certain heuristics outperform these estimators in tasks like text classification (naive bayes approaches).


next up previous
Next: New theory Up: Background Previous: Universal data compression and
Prasad Santhanam 2007-12-28