# Shannon’s Supply Coding Theorem (Foundations of data idea: Half 3)

*by*Phil Tadros

*The mathematical discipline of data idea makes an attempt to mathematically describe the idea of “data”. Within the first two posts, we mentioned the ideas of self-information and data entropy. On this submit, we step via Shannon’s Supply Coding Theorem to see how the knowledge entropy of a chance distribution describes the best-achievable effectivity required to speak samples from the distribution.*

## Introduction

To date, we’ve got mentioned how, intuitively, based on Info Idea, “data” may be interpreted as the quantity of shock skilled after we be taught the end result of some random course of. Info Idea expresses this “shock” by counting the minimal variety of symbols that may be required to speak the end result of this random course of to a different particular person/agent.

This concept of measuring “shock” by some variety of “symbols” is made concrete by Shannon’s Supply Coding Theorem. Shannon’s Supply Coding Theorem tells us that if we want to talk samples drawn from some distribution, then on common, we would require no less than as many symbols because the entropy of that distribution to unambiguously talk these samples. Stated otherwise, the theory tells us that the entropy supplies a decrease certain on how a lot we are able to compress our description of the samples from the distribution earlier than we inevitably lose data (“data” used right here within the colloquial sense).

On this submit, we’ll stroll via Shannon’s theorem. My understanding of this materials got here, partially, from watching this wonderful collection of movies by mathematicalmonk on YouTube. This submit makes an attempt to distill a lot of the knowledge offered in these movies in my very own phrases, protecting solely the elements of the reason essential to get via the theory.

## Encoding and speaking samples from a distribution

Recall from the previous post, we mentioned how the knowledge entropy of a random variable tells you, on common, the minimal variety of symbols that you’ll want to make use of to speak the end result of a random variable. That’s, allow us to say we’ve got two individuals, Individual A and Individual B, who’re attempting to speak with each other. Particularly, Individual A is observing samples, drawn one by one, from some distribution X. Individual A then needs to speak every pattern to Individual B. For instance, (X) is perhaps a cube and Individual A needs to speak to Individual B the outcomes of repeated cube rolls.

This framework is depicted beneath:

Allow us to rigorously describe the mathematical framework during which Individual A is speaking messages to Individual B. To take action, we’ll introduce a bunch of basic ideas from Coding Theory.

First, as an alternative of calling every draw from (X) a “pattern”, we’ll as an alternative name every draw a **image**. The concept right here is that Individual A goes to provide you with some random sequence of symbols, every drawn from a distribution (X), and try to speak this random message, composed of the random sequence of symbols, to Individual B. We’ll name this random message the **sequence of supply symbols**:

The concept of every (X_i) being a “image”, intuitively implies that the variety of outcomes that every (X_i) can tackle is finite. To make this concrete, we’ll assert that (X) is a categorical distribution. That’s,

[X_1, X_2, X_3, dots, X_m overset{text{i.i.d.}}{sim} text{Cat}(boldsymbol{theta})]such that

[forall i X_i in mathcal{X}]the place (mathcal{X}) is a finite set of symbols known as the **supply alphabet** and (boldsymbol{theta}) is a vector describing the chances {that a} given (X_i) will tackle a given image in (mathcal{X}).

Earlier than speaking every supply image $X_i$ to Individual B, Individual A will first *encode* every image utilizing a **code perform** (C). The code perform (C) takes as enter a supply image and outputs a sequence of symbols from a brand new set of symbols known as the **code alphabet**, denoted (mathcal{A}). The code is named a **(b)-ary code** if the dimensions of the code alphabet is (b). That’s, if (vertmathcal{A}vert = b). For instance, if (mathcal{A}) consists of two symbols, we name the code 2-ary (a.okay.a. binary).

To make this concrete, let’s have a look at a simplified, 2-ary model of Morse Code for encoding the English alphanumeric symbols into two symbols: “dots” and “dashes”. In Morse Code, the supply alphabet, (mathcal{X}) consists of the alphanumeric symbols, the code alphabet (mathcal{A}) consists of solely a dot and sprint, and the code perform (C) maps every alphanumeric image to a sequence of dots and dashes. Extra particularly:

For instance, the identify “Morse” can be encoded as follows:

Acknowledged extra rigorously, if we denote (mathcal{A}^*) to be the *set of sequences* of code symbols

then (C) is a perform

[C: mathcal{X} rightarrow mathcal{A}^*]We refer name every aspect (alpha in C(mathcal{X})) a **code phrase**, the place (C(mathcal{X})) is the image of (C). We denote the size of a code phrase (alpha in C(mathcal{X})) as (vertalphavert). Within the above instance, the size of code phrase “(cdot) – (cdot)” (i.e. (C(X_3))) is just 3.

For the needs of our dialogue, we’ll focus solely on **uniquely decodable** code features. A code perform is uniquely decodable whether it is an invertible function. Acknowledged plainly, if a code (C) is uniquely decodable, then we are able to all the time decode the code phrases unambiguously into the unique sequence of supply symbols utilizing the inverse of (C). Most codes utilized in apply are uniquely decodable. A non-uniquely decodable code wouldn’t be very helpful since Individual B who receives the encoded message from Individual A can be unable to unambiguously decode Individual A’s message.

## Shannon’s Supply Coding Theorem

Given some categorical distribution (X), Shannon’s Supply Code Theorem tells us that it doesn’t matter what (C) you select, the smallest attainable *anticipated code phrase size* is the entropy of (X). That’s,

Extra formally:

**Theorem 1 (Shannon’s Supply Coding Thoerem):** Given a categorical random variable (X) over a finite supply alphabet (mathcal{X}) and a code alphabet (mathcal{A}), then for all uniquely decodable (C : mathcal{X} rightarrow mathcal{A}^*), it holds that (E[vert C(X)vert] geq H(X)).

The anticipated code phrase size of (X) underneath some code (C) tells us how effectively one can “compress” the knowledge in (X). If on common, (C) is ready to produce small code phrases for every image drawn from (X), then we’re capable of extra effectively talk these symbols. Shannon’s Supply Coding Theorem tells us that the entropy of (X) is, in some sense, the true “data content material” of the random variable as a result of there is no such thing as a (C) that may allow you to compress (X) previous (X)’s entropy.

## Proof

To show the theory, we’ll make the most of one other outcome (which we’ll show in one other weblog submit): the converse of the Kraft-McMillan Inequality. This theorem goes as follows:

**Theorem 2 (Converse of Kraft-McMillan Inequality):** Given a finite supply alphabet (mathcal{X} := {x_1, x_2, dots, x_m}), an integer (B), and a perform (ell) the place

(the place (mathbb{Z}+) is the set of strictly constructive integers) such that

then there exists a (B)-ary uniquely decodable code (C).

Principally, this says that in case you have some set of integers ({ell(x) mid x in mathcal{X}}) that satisfiy a sure inequality, then there exists a uniquely decodable (C) that may map every supply image (x in mathcal{X}) to a code phrase with size (ell(x)).

The Kraft-McMillan Inequality will allow us to formulate an optimization downside that makes an attempt to reduce the anticipated code phrase size underneath some hypothetical code perform (C) that produces code phrases of size (ell(x)). That’s, the place (ell(x) = vert C(x) vert). This optimization downside is as follows:

topic to

The constraint on this optimization downside ensures that for any set of code phrase lengths we take into account, based on the Kraft-McMillan inequality there’ll exist a sound uniquely decodable code!

To make the notation a bit simpler to cope with, allow us to order the weather of (mathcal{X}) and let (x_1, x_2, dots, x_m) denote every aspect of (mathcal{X}). Let (ell_i := ell(x_i)). Lastly, let (p_i := P(X = x_i)). Now the optimization downside turns into:

[underset{ell_1, ell_2, dots, ell_m in mathbb{Z}+}{text{min}} sum_{i=1}^m ell_i p_i]topic to

[sum_{i=1}^m frac{1}{B^{ell_i}} leq 1]Earlier than we remedy this optimization downside, we notice that as a result of we’re requiring that (ell_1, ell_2, dots, ell_m) be integers, this optimization downside, known as an integer program, is difficult to resolve as a result of discrete nature of the feasible set. We’ll thus *calm down* this optimization downside by enabling the (ell_1, ell_2, dots, ell_m) values to be any actual quantity as an alternative of requiring them to be integers. Thus, the optimization downside turns into:

topic to

[sum_{i=1}^m frac{1}{B^{ell_i}} leq 1]You might discover an issue with this leisure. What if the answer to this downside has fractional or unfavorable values for any of the (ell_1, ell_2, dots, ell_m)? What does it imply to have a unfavorable code phrase size? Certainly, that is regarding; nevertheless, we notice that our leisure has *expanded* the possible set from solely constructive integers to *all* actual numbers. Thus, if we consider the target perform on the resolution to this relaxed optimization downside, the worth for the target perform might be *no less than as small* as the target perform evaluated on the resolution to the non-relaxed downside. Thus, this resolution will nonetheless lead us to a decrease certain for the anticipated code phrase size!

Okay, so are objective proper now could be to resolve this relaxed optimization downside. How can we do it? We’re now going to point out that this optimization downside may be re-formulated to an equal optimization downside during which the inequality within the constraint turns into an equality:

[underset{ell_1, ell_2, dots, ell_m in mathbb{R}}{text{min}} sum_{i=1}^m ell_i p_i]topic to

[sum_{i=1}^m frac{1}{B^{ell_i}} = 1]To indicate that that is really equal, we’ll use a fast proof by contradiction. Let’s let (ell_1^*, ell_2^*, dots, ell_m^*) be the values for (ell_1, ell_2, dots, ell_m) that remedy the optimization downside. Now, let’s assume, for the sake of contradiction, that this resolution is such that the summation within the constraint is strictly lower than one:

[sum_{i=1}^m frac{1}{B^{ell_i^*}} < 1]What would this assumption indicate? First, it implies that each (ell_i^*) *should* be strictly higher than 0. Too see why, assume that for some (i), (ell_i^* leq 0). Below this state of affairs

which breaks our assumption. So underneath this assumption, every (ell_i^*) is strictly constructive.

Now, let’s have a look at the target perform. As a result of we assume that (ell_1^*, ell_2^*, dots, ell_m^*) is an answer, it thus minimizes the target perform (sum_{i=1}^m ell_i p_i). Nevertheless, there’s nothing stopping us from selecting new values for every (ell_i), which we denote (ell_i^{**}) such that (0 < ell_i^{**} < ell_i^*). If we achieve this, then it follows that the target perform might be smaller for these new values. That’s,

[0 < ell_i^{**} < ell_i^* implies sum_{i=1}^m ell_i^{**} p_i < sum_{i=1}^m ell_i^* p_i]As a result of (ell_i^{**}) additional minimizes the target perform, it should be the case that (ell_i^*) will not be the true resolution! Thus, our unique assumption that the answer results in (sum_{i=1}^m frac{1}{B^{ell_i}} < 1) should be mistaken! Certainly, it should be the case that

[sum_{i=1}^m frac{1}{B^{ell_i^*}} = 1]To date, we’ve made a bunch of adjustments to this optimization downside to make it ever extra simple to resolve. Can we go additional? Let’s do a fast change of variables and let

[q_i := frac{1}{B^{ell_i}}]This then implies that

[ell_i = log_B frac{1}{q_i}]which ends up in a brand new type of the optimziation downside:

[underset{q_1, q_2, dots, q_m in mathbb{R}+}{text{min}} sum_{i=1}^m p_i log_B frac{1}{q_i}]topic to

[sum_{i=1}^m q_i = 1]the place (mathbb{R}+) is the set of strictly constructive actual numbers. This optimization downside can now simply be solved utilizing Lagrange Multipliers!

To take action, we kind the Lagrangian:

[mathcal{L}(q_1, dots, q_m, lambda) := sum_{i=1}^m p_i log_B frac{1}{q_i} + lambdaleft(sum_i q_i – 1right)]We now compute the partial spinoff with respect to every (q_i) and set it to 0:

[begin{align*} frac{partial}{partial q_i} sum_{i=1}^m p_i log_B frac{1}{q_i} + lambdaleft(sum_i q_i – 1right) &= 0 implies -frac{p_i}{q_i log B} + lambda &= 0 implies q_i &= frac{p_i}{lambda log B}end{align*}]Now, with the the equation (sum_{i=1}^m q_i = 1) and every partial spinoff set to zero, we remedy for (q_i) and (lambda). Plugging (q_i = frac{p_i}{lambda log B}) into the equation (sum_{i=1}^m q_i = 1), we get

[lambda log B = sum_{i=1}^m p_i implies lambda log B = 1 implies lambda = frac{1}{log B}]Lastly we plug this again into every (q_i = frac{p_i}{lambda log B}) and see that

[forall i, p_i = q_i]Recall, every (q_i) is a variable that we substituted for (frac{1}{B^{ell_i}}), which implies that our closing resolution is given by

[ell_i^* = log_B frac{1}{p_i}]Plugging on this resolution to the target perform we arrive at our closing decrease certain on the anticipated code phrase size! It’s merely

[sum_{i=1}^m p_i log_B frac{1}{p_i}]That is exactly the entropy of (X)!