Research

next up previous
Next: Estimation with insufficient data Up: Background Previous: Background


Universal data compression and large alphabets

While Huffman or Shannon codes achieve compression of data by using the underlying distribution to assign variable length codewords, in most applications, we do not know the underlying distribution. In such situations, the source is usually modeled as an unknown distribution in a collection of distributions, the collection of i.i.d., markov, context tree sources, depending on the context of the problem.

The question is: how much penalty is incurred if we use one (albeit carefully chosen) encoder for the whole class instead of a separate one for each distribution? This penalty is measured by the redundancy of encoding.

However, Kieffer showed that universal compression of even i.i.d. sources over infinite alphabets incurs infinite redundancy [2]. (We analyze this problem in more detail in [3].)


Prasad Santhanam 2007-12-28