variational inference
Main Idea
The main idea behind variational methods is to pick a family of distributions over the latent variables with its own variational parameter \[ q(z_{1:m}| \alpha) \] Then find the setting of the parameters that makes $q$ close to the posterior of interest.
Use $q$ with the fitted parameters as a proxy for the posterior, e.g., to form predictions about future data or to investigate the posterior distribution of the hidden variables.
Using KL to measure the closeness
- The KL divergence for variational inference is \[ KL(q || p)=\int q(Z) \ln { \frac{q(Z)}{p(Z|x)} }=E_q [ \ln \frac{q(Z)}{p(Z|x)}] \]
The Evidence Lower Bound (ELBO)
- We actually can’t minimize the KL divergence exactly, but we can minimize a function that is equal to it up to a constant. This is the evidence lower bound (ELBO)
\[
\begin{split}
\ln p(x) &= \ln \int_z p(x,z) \
&= \ln \int_z p(x,z) \frac{q(z)}{p(x,z)} \
&= \ln (E_q [ \frac{p(x,Z)}{q(z)} ]) \
&\geq E_q[ \ln p(x,Z)] - E_q[q(Z)]
\end{split}
\]
The last inequality is the ELBO, , and applies the Jensen’s inequality. We choose a family of variational distributions (i.e., a parameterization of a distribution of the latent variables, in another words, functional of distribution) such that the expectations are computable.
Take KL, ELBO together
-
First \[ p(z|x)= \frac{p(z,x)}{p(x)} \]
-
Second
\[
\begin{split}
KL(q(z)||p(z|x)) &= E_q[ \ln \frac{q(Z)}{p(Z|x)}] \
&= E_q[ \ln q(Z)] - E_q[ \ln p(Z|x)] \
&= E_q[ \ln q(Z)] - E_q[ \ln p(x,Z)] + \ln p(x) \
&= \ln p(x) - (E_q[ \ln p(x,Z)] - E_q[q(Z)])
\end{split}
\]
Notice that $\ln p(x)$ does not depend on $q$. So, as a function of the variational distribution, minimizing the KL divergence is the same as maximizing the ELBO, marked as $\mathcal{L}(q)$ .
\[ \ln p(x) = KL(q||p) + \mathcal{L}(q) \]
Mean Field Variational Inference
In mean field variational inference, we assume that the variational family factorizes, \[ q(z_1,…,z_m)= \prod_{j=1}^m q(z_j) \] Each variable is independent.
Then, Using coordinate ascent inference, ELBO can be written as, \[ \mathcal{L}(q)= \ln p(x_{1:n}) + \sum_{j=1}^m E[ \ln p(z_j | z_{1:(j-1)}, x_{1:n})] - E_j[ \ln q(z_j)] \]
Consider the ELBO as a function of $q(z_k)$, employ the chain rule with the variable $z_k$ as the last variable inte list, Then, \[ \mathcal{L}(q)=E[ \ln p(z_k |z_{-k},x)] - E_j[ \ln q(z_k)] + const \]
Write this object as a function of $q(z_k)$, \[ \mathcal{L} = \int q(z_k)E_{-k}[ \ln p(z_k| z_{-k},x)] dz_{k} - \int q(z_k) \ln q(z_k) dz_k \]
The derivative w.r.t. $q(z_k)$ is \[ \frac{ d \mathcal{L} }{ dq(z_k)}= E_{-k}[ \ln p(z_k|z_{-k},x) ]- \ln q(z_k) = 0 \]
This leads to the coordinate ascent update for $q(z_k)$ \[ q^*(z_k) \propto \exp{ E_{-k} [ \ln p(z_k | Z_{-k},x)] } \]
But the denominator of the posterior does not depend on $z_j$, so \[ q^*(z_k) \propto \exp{ E_{-k} [ \ln p(z_k , Z_{-k},x)] } \]
The coordinate ascent algorithm is to iteratively update each $q(z_k)$. The ELBO converges to a local minimum. Use the resulting $q$ is as a proxy for the true posterior.
working with exponential family conditional
- Compute the log of the conditional
\[ \ln p(z_j | z_{-j},x) = \ln h(z_j) + \eta (z_{-j},x)^T t(z_j) - a( \eta (z_{-j}, x)) \]
- Compute the expectation with respect to $q(z_{-j})$
\[ E[ \ln p(z_j | z_{-j},x)]= \ln h(z_j) + E[ \eta(z_{-j},x)]^Tt(z_j) - E[a( \eta (z_{-j}, x)) ] \]
- Noting that the last term does not depend on $q_j$, this means that
\[ q^*(z_j) \propto h(z_j) \exp E[ \eta(z_{-j},x)]^Tt(z_j) \] and the normalizing constant is $E[a( \eta (z_{-j}, x)) ]$
a sample from wiki
Given $N$ data points $X=[x_1,…,x_N]$ and our goal is to infer the posterior distribution $q( \mu , \tau) = p( \mu, \tau | x_1,…,x_N)$. We place conjugate prior distributions on the unknown mean and variance. That is,
\[
\begin{split}
\mu &\sim N( \mu_0, (\lambda_0 \tau)^{-1}) \
\tau &\sim Ga(a_0, b_0) \
p( \mathcal{D}) &\sim N( \mu , \tau^{-1})
\end{split}
\]
- the joint probability
The joint probability of all variables can be rewritten as
\[
p( \mathcal{D}, \mu , \tau)=p( \mathcal{D}| \mu, \tau)p( \mu | \tau)p( \tau)
\]
where the individual factors are,
\[
\begin{split}
p( \mathcal{D}| \mu, \tau) &= \prod_{i=1}^{N} N( x_i | \mu, \tau^{-1}) \
p( \mu | \tau ) &= N( \mu | \mu_0, ( \lambda_0 \tau)^{-1}) \
p( \tau) &= Gamma( \tau |a_0, b_0)
\end{split}
\]
where,
\[
\begin{split}
N(x| \mu, \sigma^2) &= \frac{1}{ \sqrt{2 \pi} \sigma} \exp \left[ - \frac{(x- \mu)^2}{2 \sigma^2} \right] \
Gamma( \tau | a,b) &= \frac{b^a \tau^{a-1} e^{-b \tau}}{ \Gamma(a)}
\end{split}
\]
-
mean field factorized approximation Assum that $q( \mu, \tau) = q( \mu) q( \tau)$, i.e. that the posterior distribution factorizes into independent factor $\mu$ and $\tau$.
-
variational derivation of $q( \mu)$ \[ \begin{split} \ln q_{ \mu}^* ( \mu) &= E_{ \tau} \left[ \ln (p( \mathcal{D} | \mu, \tau)) + \ln p( \mu | \tau) + \ln p( \tau) + C\right] \
&= - \frac{E_{ \tau}[ \tau]}{2} \left[ \sum_{i=1}^{N}(x_i - \mu)^2 + \lambda_0 ( \mu - \mu_0)^2 \right] + C \
& = - \frac{1}{2}( \lambda_0 + N)E_{ \tau}[ \tau] \left[ \mu - \frac{ \lambda_0 \mu_0 + \sum_{i=1}^N x_i}{ \lambda_0 + N} \right] + C \end{split} \] that is to say, \[ q_{ \mu}^* ( \mu) \sim N( \mu | \mu_N= \frac{ \lambda_0 \mu_0 + N \bar{x}}{ \lambda_0 + N}, \lambda_N^{-1}= ( \lambda_0 + N) E[ \tau]) \] similarity, \[ \ln q_{ \tau}^* ( \tau) = (a_0 -1) \ln \tau - b_0 \tau + \frac{1}{2} \ln \tau + \frac{N}{2} \ln \tau - \frac{ \tau}{2} E_{ \mu} \left[ \sum_{i=1}^N (x_i - \mu)^2 + \lambda_0( \mu - \mu_0)^2 \right] + C \] and, $q_{ \tau}( \tau) \sim Ga( \tau | a_N= a_0 + \frac{N+1}{2}, b_N=b_0 + \frac{1}{2}E_{ \mu} \left[ \sum_{i=1}^N (x_i - \mu)^2 + \lambda_0( \mu - \mu_0)^2 \right])$
After removed the expectation, we can write the parameter equations as follows,
\[
\begin{split}
\mu_N &= \frac{ \lambda_0 \mu_0 + N \bar{x}}{ \lambda_0 + N} \
\lambda_N &= ( \lambda_0 + N) \frac{ a_N }{b_N} \
a_N &= a_0 + \frac{N+1}{2} \
b_N &= b_0 + \frac{1}{2} \left[ ( \lambda_0 + N)( \lambda_N^{-1} + \mu_N^2) -2 \left[ \lambda_0 \mu_0 + \sum_{i=1}^N x_i \right] \mu_N + \left[ \sum_{i=1}^N x_i^2 \right] + \lambda_0 \mu_0^2 \right]
\end{split}
\]
Note taht there are circular dependencies among the formulas for $ \mu_N, \lambda_N, b_N$. This naturally suggests an EM-like algorithm to solve.