Reading Note: Variational Auto-Encoder (VAE)
Published:
Auto-Encoding
Variational Auto-Ecoder (AVE) is a unsupervised learning method. This post briefly goes over the implementation of VAE.
Model Setup
Let \(\mathbf{X} = \{\mathbf{x}^{(1)}...\mathbf{x}^{(N)}\}\) is a dataset with $N$ points. Each $\mathbf{x}^{(i)}$ is generated as follows:
- an continuous random vaule $\mathbf{z}$ is generated from some prior distribution $p_{\theta^*}(\mathbf{z}^{(i)})$;
- $\mathbf{x}^{(i)}$ is generated from some conditional probability $p_{\theta^*}(\mathbf{x}^{(i)}\mid\mathbf{z}^{(i)})$.
We can therefore use $p_ {\theta^* }(\mathbf{z}^{(i)})$ and $p_ {\theta^* }(\mathbf{x}^{(i)}\mid\mathbf{z}^{(i)})$ to optimize marginlized log likelihood $\log p(\mathbf{X})$. However, since $\theta^*$ and $\mathbf{z}$ are unknown, it’s difficult to do the optimization.
Objective
A new parameterized approximate of posterior $p_{\theta}(\mathbf{z}|\mathbf{x})$ is introduced and denoted by $p_{\phi}(\mathbf{z}|\mathbf{x})$. Assume all points are i.i.d. and we want to optimize its marginal log-likelihood: $\begin{align} \log p_\theta(\mathbf{x}^{(1)}…\mathbf{x}^{(N)}) = \sum_{i=1}^N\log p_\theta(\mathbf{x}^{(i)}) \end{align}$
$p_\theta(\mathbf{x}^{(i)})$ can be rewritten as: $\begin{align} p_\theta(\mathbf{x}^{(i)}) = D_{KL}(q_{\phi}(\mathbf{z}\mid\mathbf{x}^{(i)})\mid\mid p_{\theta}(\mathbf{z}\mid\mathbf{x}^{(i)}))+\mathcal{L}(\theta,\phi;\mathbf{x}^{(i)}) \end{align}$
Due to the non-negativity of KL divergence, we know that $\mathcal{L}$ is a lower bound, i.e.:
$\begin{align} \log p_\theta(\mathbf{x}^{(i)}) \geq \mathcal{L}(\theta,\phi;\mathbf{x}^{(i)}). \end{align}$
Hence, the goal is to optimize $\mathcal{L}$ via parameters $\theta, \phi$. $\mathcal{L}(\theta,\phi;\mathbf{x}^{(i)})$ can be expressed as the following two forms:
$\begin{align} \mathcal{L}(\theta,\phi;\mathbf{x}^{(i)})&=E_ {q_ \phi(\mathbf{z}\mid\mathbf{x})}[ -\log q_ \phi(\mathbf{z}\mid\mathbf{x}) + \log p_{\theta}(\mathbf{x},\mathbf{z})] \newline &=-D_{KL}(q_\phi(\mathbf{z}|\mathbf{x}^{(i)})\mid\mid p_\theta(\mathbf{z})) + E_ {q_ \phi(\mathbf{z}\mid\mathbf{x})} [\log p_\theta(\mathbf{x}^{(i)}\mid \mathbf{z})] \end{align}$
Reparameterization Trick
The reparameterization trick is used to approximate the expectation term and make (4) or (5) differentiable. Let $\mathbf{z}$ be a continuous variable which obeys $p_\phi(\mathbf{z}\mid\mathbf{x})$. Let $\bar{\mathbf{z}}$ a realization, i.e. $\bar{\mathbf{z}}\sim p_\phi(\mathbf{z}\mid\mathbf{x})$. $\bar{\mathbf{z}}$ can be written by a deterministic function: $\begin{align} \bar{\mathbf{z}} = g_ \phi(\epsilon, \mathbf{x}), \end{align}$ where $\epsilon \sim p(\epsilon)$. Then we have the following trick: $\begin{align} E_{q_\phi(\mathbf{z}\mid\mathbf{x})}[ f(\mathbf{z})]&= E_{p(\epsilon)}[ f(g_ \phi(\epsilon, \mathbf{x})) ] \newline &\simeq \frac{1}{L}\sum_ {i=1} ^Lf(g_ \phi(\epsilon, \mathbf{x})) \end{align}$
We see that (8) is differentiable w.r.t. $\phi$.
We can combine this trick with (4) to get an approximation $\mathcal{L}^A$:
$\begin{align} \mathcal{L} \simeq \mathcal{L}^A = \frac{1}{L}\sum_{i=1}^L (\log p_ \theta (\mathbf{x}^{(i)},\mathbf{z}^{(i,l)})-\log q_ \phi (\mathbf{z}&^{(i,l)}\mid \mathbf{x}^{(i)})). \end{align}$
Also, we obtain other version of approximation $\mathcal{L}^B$ by using this trick on (5):
$\begin{align} \mathcal{L} \simeq \mathcal{L}^B = -D_{KL}(q_ \phi(\mathbf{z}\mid\mathbf{x}^{(i)})\mid\mid p_ \theta (\mathbf{z}) ) + \frac{1}{L} \sum_{i=1}^{L}\log p_ \theta(\mathbf{x}^{(i)}\mid\mathbf{z}^{(i,l)}). \end{align}$
Finally, we use minibatch \(\mathbf{X}^M = \{ \mathbf{x}^{(i)}\}_{i=1} ^M\) to approximate the total marginal log likelihood:
$\begin{align} \mathcal{L}(\theta, \phi; \mathbf{X}) \simeq \tilde{\mathcal{L}}^M (\theta, \phi; \mathbf{X}^M) = \frac{N}{M} \mathcal{L}(\theta, \phi; \mathbf{x}^{(i)}). \end{align}$
Variational Auto-Encoder
Assume that $p_ \theta(\mathbf{x}\mid\mathbf{z})$ has multivariate Gaussian distritbuion, assume that prior $p_ \theta (\mathbf{z})$ has normal distribution $\mathcal{N} (z; \mathbf{0}, \mathbf{I})$. Also, assume that estimated posterior $p_ \phi(\mathbf{z}\mid\mathbf{x})$ is a multivariate Gaussian random variable. For simplicity and notation clearity, we denote: $\begin{align} p_ \theta(\mathbf{x}\mid\mathbf{z}) &\sim \mathcal{N}(\mathbf{x}; \mathbf{\mu_ \theta}, \mathbf{\sigma_ \theta I}) \newline p_ \phi(\mathbf{z}\mid\mathbf{x}) & \sim \mathcal{N}(\mathbf{z}; \mathbf{\mu_ \phi}, \mathbf{\sigma_ \phi I }) \end{align}$
Let assume dimensions of $\mathbf{x}$ and $\mathbf{z}$ are $K$ and $J$, respectively. $\mu_ \theta$, $\sigma_ \theta$ are $K$ dimensional vectors, $\mu_ \phi$ and $\sigma_ \phi$ are $J$ dimensional vectors. We denote MLP paramterized by $\theta$ and $\phi$ by $MLP_ \theta$ and $MLP_ \phi$. For each point $\mathbf{x}^{(i)}$, in order to get $\mathcal{L}^B (\theta, \phi; \mathbf{x}^{(i)})$, do the following steps:
- pass $x^{(i)}$ through $MLP_ \theta$ and obtain $\mu_ \theta ^ {(i)}$ , $\sigma_ \theta^{(i)}$
- Implement reparamtrize trick, and obtain $\mathbf{z}^{(i,l)}$:$\begin{align} \mathbf{z}^{(i,l)}=\mathbf{\mu}^{(i)} + \sigma_ \theta^{(i)} \odot \mathbf{\epsilon}^{(l)} \end{align}$
- pass $\mathbf{z} ^{(i,l)}$ trhough $MLP_ \phi$ and obtain $\mu_ \phi ^{(i,l)}$ and $\sigma_ \phi ^{(i,l)}$, now, we can calculated $\log p_ \theta (\mathbf{x}^{(i)}\mid \mathbf{z} ^{(i,l)})$.
- Fianlly, you can get $\mathcal{L}^B (\theta, \phi; \mathbf{x}^{(i)})$ by calculating: $\begin{align}\frac{1}{2}\sum_{j=1}^J(1+\log((\sigma_ j^{(i)})^2)-(\mu_ j ^{(i)})^2)+ \frac{1}{L}\sum_{l=1}^{L}\log p_ \theta (\mathbf{x}^{(i)}\mid \mathbf{z} ^{(i,l)})\end{align}$
It’s interesting that it has some connections with coding theory: $MLP_ \theta$ is called encoding network and $MLP_ \phi$ is called decoding network. It’s straightforward and easy to understand: Encoder encodes $\mathbf{x}^{(i)}$ to the latent varaibles $\mathbf{z}^{(i,l)}$ and decoding networks find $\bar{\mathbf{x}}$ that are most related to latent variables.