Research

next up previous
Next: New directions Up: RESEARCH STATEMENT Previous: Estimation with insufficient data

New theory

We therefore need new solutions in both problems above. To proceed further we observe that estimation of the probability distribution in the large alphabet setting is a futile task--Kieffer's result mentioned in Section 1.1 can be used to show this formally. Therefore, what we try to estimate is the probability of observing just those events already in the data sample.

In the compression framework, this can be seen to be equivalent to separating [7] the description of the data strings over large alphabets into two parts: description of the symbols appearing in the sequence, and of their pattern, the order in which the symbols appear. For instance, to describe the sequence ``Sam I am that Sam I am I do not like that Sam I am'' over the symbol space of English words, this approach separates the pattern, 123412325674123 from the dictionary,

1 2 3 4 5 6 7
Sam I am that do not like
.
Patterns have an interesting combinatorial structure [3], and using Hardy and Ramanujan's celebrated results on the number of integer partitions [8,9], we show that patterns of i.i.d. sequences can be universally compressed with vanishing redundancy [7], see also [10]. These results have since been extended to more complex distributions as well [11].

These results have their analog in estimation as well. We also analyze a few variants of the Good Turing estimator, showing that while they outperform other commonly used estimators, none is optimal in general. We subsequently develop two provably asymptotically optimal estimators. These results appeared in Science [12].


next up previous
Next: New directions Up: RESEARCH STATEMENT Previous: Estimation with insufficient data
Prasad Santhanam 2007-12-28