**1. Gibbs inequality**

Let $P = \{p_x\}_{x \in X}$ and $Q = \{q_x\}_{x \in X}$ be two probability distributions on a finite set $X$. I simply assume that the $q_x$ are all positive, $q_x > 0$. The Gibbs inequality states that

$$

- \sum_{x \in X} p_x \cdot \log p_x \le - \sum_{x \in X} p_x \cdot \log q_x

$$

with equality iff the two distributions are identical. We use the convention that $x\cdot \log x = 0$ if $x = 0$. This result follows from the classic technique of convex inequality (a.k.a Jensen's inequality). Indeed, the function $f(x) = x\cdot \log x$ is convex on the positive reals, so

$$

\begin{align*}

\sum_x p_x \cdot \log \frac{p_x}{q_x} &= \sum_x q_x \cdot \frac{p_x}{q_x} \cdot \log \frac{p_x}{q_x} \\

&= \sum_x q_x \cdot f\left( \frac{p_x}{q_x}\right) \\

&\ge f\left(\sum_x q_x \frac{p_x}{q_x}\right) \\

&\ge f(1) = 0

\end{align*}

$$

**2. Binomial and entropy**

Varying the probability distributions in the Gibbs inequality can yield quite surprising result. For instance, let $n$ be a positive integer, and $0 \le p \le 1$ a real number such that $n\cdot p$ is an integer. Then

$$

\begin{equation*}

\binom{n}{n p} \le 2^{n\cdot h(p)}

\end{equation*}

$$

where $h(p) = - p\cdot \log - (1-p)\cdot \log (1-p)$. The function $h$ represents the entropy of a coin with bias $p$. The intuition behind this inequality is the following: to identify a subset of $np$ elements in a set of $n$ elements $\{1,2,\dots,n\}$, you need at least $\log \binom{n}{np}$ bits. Now, there is an approximate procedure to identify such a subset: for each element $i = 1,2,\dots,n$, you toss the bias coin, and add the element $i$ to your subset iff the coin turns, e.g., head. With such a procedure, on average, you identify a subset of size $np$, and you could associate with this subset the corresponding sequence of coin tosses as an identifier. By Shannon's source coding theorem, you can store such sequences of coin tosses using about $n\cdot h(p)$ bits. The inequality states that the latter procedure uses (slightly) more bits than necessary.

The proof goes as follows. Let $X_1,\dots,X_n$ be independent random variables following the same distribution

$$

\begin{align*}

\mathbb{P}(X_i = 1) &= p \\

\mathbb{P}(X_i = 0) &= 1-p

\end{align*}

$$

Let $X = (X_1,\dots,X_n)$ be the joint variable, with values in $\{0,1\}^n$. Consider also the uniform random variable $U$ with values in the subset of vectors $u \in \{0,1\}^n$ having exactly $k = np$ entries set to $1$ (and the others to $0$). We denote by $|u|$ the number of entries set to $1$ in $u$. The probability distribution of $U$ is given by, for any $u \in \{0,1\}^n$,

$$

\begin{equation*}

\mathbb{P}(U = u) = \left\{ \begin{array}{cc}

0 & \text{ if } |u| \ne k\\

\binom{n}{k}^{-1} & \text{ otherwise}

\end{array}

\right.

\end{equation*}

$$

Applying the Gibbs inequality to the distributions of $X$ and $U$, we get

$$

\begin{align*}

\sum_{u \in \{0,1\}^n} - \mathbb{P}(U=u) \cdot \log \mathbb{P}(U = u)

&\le

\sum_{u \in \{0,1\}^n} - \mathbb{P}(U=u) \cdot \log \mathbb{P}(X = u)\\

\end{align*}

$$

$$

\begin{align*}

\log \binom{n}{k}

&\le

\sum_{u,~ |u|=k} - \binom{n}{k}^{-1} \cdot \log p^k(1-p)^{n-k}\\

&\le -k \cdot \log p -(n-k)\cdot\log (1-p) \\

&\le n \cdot h(p) ~~~~\text{ (since } k = np \text{)}.

\end{align*}

$$

**3. The bound is quite tight**

The bound on $\binom{n}{np}$ is quite tight. Indeed, Stirling's formula states that:

$$

n! \sim \sqrt{2\pi n}\cdot n^n \cdot e^{-n}

$$

Applying this formula to the binomial, we get (with $q = 1-p$)

$$

\begin{align*}

\binom{n}{np} &= \frac{n!}{(np)!(nq)!} \\

&= \frac{\sqrt{2\pi n}\cdot n^n \cdot e^{-n}}{\sqrt{2\pi n p}\cdot (np)^{np} \cdot e^{-np}\cdot \sqrt{2\pi nq}\cdot (nq)^{nq} \cdot e^{-nq}}\\

&= \frac{2^{n\cdot h(p)}}{\sqrt{2\pi \cdot n \cdot p\cdot (1-p)}}.

\end{align*}

$$

Taking the logarithm of both sides, we see that

$$

\frac{1}{n} \log \binom{n}{np} \rightarrow h(p) \text{ as } n \rightarrow \infty

$$

This means that, as $n$ gets larger and larger, the ``coin-tossing procedure'' to identify a subset of size $np$ in $n$ gets more and more precise, in the sense that it uses the optimal number of bits $\log \binom{n}{np} \simeq n \cdot h(p)$.

pb

## No comments:

## Post a Comment