Probability estimation over large alphabets is complicated owing to
the large number of rare and unseen events whose probabilities cannot
be reliably estimated using conventional maximum likelihood
approaches. To resolve this difficulty, Good and Turing [4]
derived an unintuitive, yet effective, formula for estimating the
probability of every symbol appearing in a sequence, as well as the
probability of the missing mass. Note that Good and Turing's formula
does not attempt to assign a probability to every missing element,
it only estimates the probability of the set of all missing elements.
Several estimators based on this formula have since been employed in a
variety of applications such as information retrieval, spelling
correction, and language modeling. These Good-Turing estimators
are known to work very well for rare events. However, while there were
justifications [5] for why the estimator should work well,
until recently [6] there was not much theoretical analysis on
these estimators.
Not only are these estimators a little bit of black magic in the sense
that their underlying principles are not always completely understood,
they do not seem to always work better in every application. In fact,
certain heuristics outperform these estimators in tasks like text
classification (naive bayes approaches).