Reading Note: Variational Information Bottleneck (VIB)

1 minute read

Published:


Introduction

Let’s think about a Markov chain. First, we have some targets $Y$ (for an example : labels of each image), we have some input source $X$ (for an example : images), let $X$ be encoded by a stochastic encoder parameterized by $\theta$ get the latent variable $Z$. Based on the description, we can get the Markov chain:

$\begin{align} Y\rightarrow X \rightarrow Z \end{align}$

Information bottlececk tries to solve the following problem:

How should we design $\theta$ such that : 1) $Z$ can be forgettable to $X$ (Z is compressive w.r.t $X$); 2)On the other hand, $Z$ preserves information of $X$ as much as possible ($Z$ is informative w.r.t $Y$).

In other words, information bottleneck tries to find a $\theta$ which maximize the following term:

$\begin{align} I(Y;Z) - \beta I(X;Z), \end{align}$

where $\beta$ is a tradeoff parameter. It’s computation challenging to directly compute $L$, so in VIB, a lower bound is derived and used to find $\theta$. Actually, a lot of papers related to this area use the idea similar to this — derive different lower bounds.


Variational Information Bottleneck

A lower bound $L$ of (2) is : $\begin{align} L = \int dxdydzp(x)p(y|x)p(z|x)\log q(y|z)-\beta\int dxdzp(x)p(z|x)\log \frac{p(z|x)}{r(z)}, \end{align}$

By using sampling to approximate $L$, we have: $\begin{align} L\simeq \frac{1}{N} \sum_{n=1}^N \left[\mathbb{E}_ {p(z\mid x_ n)}\left[ \log q(y_n\mid z) - \beta \frac{p(z|x_n)}{r(z)}\right]\right]. \end{align}$

The above equation can be put together and the following term is minimized: $\begin{align} J_{IB} = \frac{1}{N} \sum_{n=1}^N \left[ - \mathbb{E}_ {p(z|x_n)}\left[ \log q(y_n\mid z)\right] +\beta KL\left[ p(Z\mid x_n) \mid\mid p(Z))\right] \right] \end{align}$

And Finally, using reparametertization tirck: $\begin{align}J_ {IB} \simeq \frac{1}{N} \sum_{n=1}^N \left[ - \frac{1}{L}\sum_{l=1}^L\left[ \log q(y_n\mid z^{(n,l)})\right] +\beta KL\left[ p(Z\mid x_n),\mid\mid p(Z))\right] \right]\end{align}$

where $z^{(n,l)} = f(x^n,\epsilon^l)$ and $\epsilon^l\sim p(\epsilon)$.


A simple implementation

  1. $x^n$ goes through encoder network parameterized by $\theta$ and generate $\mu_\theta$ and $\sigma_\theta$
  2. use reparameterization trick to generate $z^{(n,l)}$
  3. $x^{(n,l)}$ goes through a network paramterized by $\phi$ and output by a softmax, then $q(y_n\mid z^{(n,l)})$ can be calculated.