Forward and Reverse KL Divergence

KL-divergence is given by: $$ D_{KL}(P||Q) = \mathbb{E}_{x \sim P}\left[\log \frac{P(X)}{Q(X)}\right] $$

It is not necessarily symmetric, so $D_{KL}(P||Q) \neq D_{KL}(Q||P)$, thus it cannot be used as a distance metric.

It is always positive, i.e. $D_{KL} \geq 0$. Also, if there is a point $x$ where $Q(X) = 0$ but $P(X) \neq 0$, then $D_{KL}(P||Q) = \infty$. Thus, we need the support of $P$ to lie completely in $Q$ for a finite KL-divergence.

For distributions $P, Q$ that are the same almost everywhere we have $D_{KL}(P||Q) = D_{KL}(Q||P) = 0$, and conversely if for any distributions $P, Q$ we have $D_{KL}(P||Q) = 0$ or $D_{KL}(Q||P) = 0$, then $P, Q$ are same almost everywhere.

There is a relation between KL-divergence and cross entropy given by $D_{KL}(P||Q) = \mathbb{E}_{x \sim P}[-\log Q(X)] - H(P(X))$, where the first term is the well-known cross entropy term and the second term is the entropy. If we fix $P$, then optimizing KL-divergence is the same as optimizing the cross entropy.

Say, we are minimizing the KL-divergence between a fixed unknown distribution $P$ and a parametric distribution $Q_{\theta}$. We aim to find parameters $\theta$ that minimizes the KL-divergence. Then we have two natural objectives:

Forward KL: $D_{KL}(P||Q_\theta)$
Reverse KL: $D_{KL}(Q_\theta||P)$

Forward KL

Simplifying the forward KL objective gives,

$$D_{KL}(P||Q_\theta) = \mathbb{E}_{x \sim P}[-\log Q_{\theta}(X)] - H(P(X))$$

Thus minimizing the forward KL objective w.r.t $\theta$ is the same as minimizing the cross entropy term. In other words, for points sampled from $P$, we want to maximize the likelihood of those points under $Q$.

Thus, we are simply maximizing the likelihood of $P$ under $Q$. This results in a mean seeking behavior as $Q$ tries to cover all the high-probability regions of $P$ and thus could potentially spread itself across the modes of $P$.

Reverse KL

Simplifying the reverse KL objective gives,

$$D_{KL}(Q_\theta||P) = \mathbb{E}_{x \sim Q_\theta}[-\log P(X)] - H(Q_\theta(X))$$

Thus minimizing the reverse KL objective w.r.t. $\theta$ tries to maximize the likelihood of points sampled from $Q$ under $P$ while maximizing the entropy of $Q$.

Thus, we are simply maximizing the likelihood of $Q$ under $P$ while also maximizing the entropy of $Q$. The entropy term is crucial as it prevents $Q$ from collapsing into a delta distribution at the mode of $P$. This results in a mode seeking behavior where $Q$ tries to find a mode of $P$ with high probability and concentrate around it.

Other things of note

Forward KL requires only samples from unknown $P$ to optimize the objective, while reverse KL requires the capability of sampling from the parametric distribution $Q_\theta$ (which is typically possible) and a way to evaluate the likelihood under the unknown distribution $P$ (which is difficult.)

References

Mostly followed this great blog post and this one.

Forward and Reverse KL Divergence

2020/06/26

Forward KL

Reverse KL

Other things of note

References