sufficient statistic and conjugate

Examples of Exponential family

normal \[ f(x|\mu,\sigma^2)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right) \] [$E(X)=\mu; Var(X)=\sigma^2$]
exponential
\[ f(x|\lambda)=\lambda \exp(-\lambda x) \] [$E(X)=\lambda^{-1}; Var(X)=\lambda^{-2}$]
gamma
\[ f(x|\alpha,\beta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha-1}\exp(-\beta x) \] [$E(X)=\frac{\alpha}{\beta}; Var(X)=\frac{\alpha}{\beta^2}$]
chi-squared
\[ f(x|k)=\frac{1}{2^{k/2}\Gamma(k/2)}x^{k/2-1}\exp\left(-\frac{x}{2}\right) \] [$E(X)=k; Var(X)=2k$]
beta \[ f(x|\alpha,\beta)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\cdot x^{\alpha-1}(1-x)^{\beta-1} \] [$E(X)=\frac{\alpha}{\alpha+\beta}; Var(X)=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}$]
dirichlet
\[ f(x|\alpha)=\frac{\Gamma(\sum_{i=1}^k \alpha_i)}{\prod_{i=1}^k \Gamma(\alpha_i)}x_i^{\alpha_i-1} \] where $k\geq 2$ is number of categories, $x_1,…,x_k$ where $x_i\in(0,1), \sum_{i=1}^k x_i =1$. [$E(X)=\frac{\alpha_i}{\sum_j \alpha_j};Var(X_i)=\frac{\alpha_i(\alpha_0-\alpha_i)}{\alpha_0^2(\alpha_0 + 1)}$, where $\alpha_0=\sum_{j=1}^k \alpha_j$; $Cov(X_i,X_j)=\frac{-\alpha_i\alpha_j}{\alpha_0^2(\alpha_0+1)}$]
bernoulli \[ f(x|p)= p^k(1-p)^{1-k} \] [$E(X)=p; Var(X)=p(1-p)$]
categorical \[ p(x)=[x=1]p_1 +…+[x=k]p_k \] $[x=i]$ is the Iverson bracket. [$E([x=i])=p_i; Var([x=i])=p_i(1-p_i); Cov([x=i],[x=j]=-p_ip_j$ this is the mean of the Iverson bracket $[x=i]$ and not the mean of $x$]
poisson \[ f(x=k|\lambda)\frac{1}{k!}\lambda^k \exp(-\lambda) \] $E(X)=\lambda, Var(X)=\lambda$.
wishart
If $S$ is $W_p(\Sigma, r)$ , then the density function of $S$ is, \[ p(S)=\frac{ |S|^{\frac{r-p-1}{2}} \cdot exp(-\frac{1}{2} Tr(\Sigma^{-1}S)) }{ 2^{rp/2} \cdot \pi^{\frac{p(p-1)}{4}} \cdot |\Sigma |^{r/2} \cdot \prod_{i=1}^p \Gamma (\frac{r-i+1}{2}) } \] $E(S)=r\Sigma$, $Var(S_{ij})=r(\sigma_{ij}^2+\sigma_{ii}\sigma_{jj})$, $r$ is the freedom.
inverse wishart
$S^{-1}$ is said to have an inverse wishart distribution $W_p^{-1}(\Sigma,r)$, if its pdf ($M=S^{-1}$) \[ p(M)=\frac{|M|^{-\frac{r+p+1}{2}} \cdot eTr(-\frac{1}{2} \Sigma^{-1} M^{-1})}{2^{rp/2} \cdot \pi^{p(p-1)/4} \cdot |\Sigma |^{r/2} \cdot \prod_{i=1}^p \Gamma (\frac{r+1-i}{2}) } \] Let $\Psi=\Sigma^{-1}$, then, $E(X)=\frac{\Psi}{r-p-1}$, $Var(S_{ij}=\frac{2\psi_{ii}^2}{(r-p-1)^2(r-p-3)}$, $Cov(s_{ij},s_{kl})=\frac{2\psi_{ij}\psi_{kl}+(r-p-1)(\psi_{ik}\phi_{jl}+\psi_{il}\psi_{kj})}{(r-p)(r-p-1)^2(r-p-3)}$

Statistic

[definition]
Given random variables (vectors) $X_1,…,X_n$ w.r.t. sets of possible values $\mathcal{X}_1,…,\mathcal{X}_n$, respectively. A random vector $t_n:\mathcal{X}_1 \times \mathcal{X}_2 \times … \times \mathcal{X}_n \rightarrow R^{k(n)}$ is called a $k(n)$ dimensional statistic.
[example]
- $t_n = \frac{1}{n}(X_1+X_2+…+X_n)$, $k(n)=1$
- $t_n = [n,(X_1+…+X_n),(X_1^2+…+X_n^2)]$, $k(n)=3$.
[sufficient statistic]
- [from wiki]
  A statistic $t=T(x)$ is sufficient for underlying parameter $\theta$ prcisely, if the conditional probability distribution of data $X$, given $t=T(x)$, does not depend on the parameter $\theta$, i.e., \[ p(x|\theta,t)=p(x|t) \]
- [from prof. zhihua Zhang’s manuscript]
  The sequence $t_1,…,t_n$ is a sufficient statistic for $X_1,…,X_n$, if for $n\geq 1$, the joint density for $X_1,…,X_n$ given $\theta$ has the form, \[ p(x_1,…,x_n|\theta)=h_n(t_n,\theta)g(x_1,…,x_n)=h_{\theta}(t_n)g(x_1,…,x_n) \] for some function $h_n\geq 0, g\ge 0$.
- [theorem]
  The sequence $t_1,…,t_n$ is sufficient for infinitely exchangable $X_1,…,X_n,…$ if and only if for any $n\geq 1$, the density $p(x_1,…,x_n|\theta,t_n)$ is independent of $\theta$.
  proof. For any $t_n=t_n(X_1,…,X_n)$, \[ \begin{split} p(x_1,…,x_n|\theta)&=p(x_1,…,x_n,t_n|\theta)\
  &=p(x_1,…,x_n|t_n,\theta)\cdot p(t_n|\theta)\
  \end{split} \] If $p(x_1,…,x_n|\theta,t_n)$ is independent of $\theta$, then $p(x_1,…,x_n|\theta,t_n)$ is $g$, $p(t_n,\theta)$ is $h_n$. So $t_n$ is a sufficient statistic.
  If $t_n$ is sufficient, then $p(x_1,…,x_n|\theta)=h_n(t_n,\theta)g(x_1,…,x_n), h_n\geq 0,g\ge 0$. Taking integral on both sides, we have, \[ \int_{t_n(x_1,…,x_n)=t_n} p(x_1,…,x_n|\theta)dx_1,…,x_n = \int_{t_n(x_1,…,x_n)=t_n} h_n(t_n,\theta)g(x_1,…,x_n)dx_1,…,x_n \] Note that $h_n(t_n,\theta)$ in the right side is unrelated to the integral, $\int g(x)dx$ can be seemed as a function of $t_n$, denoted by $G(t_n)$ and $\int p(x|\theta)dx$ can be seemed as $p(t_n|\theta)$. hence, we have, \[ p(t_n|\theta)=h_n(t_n,\theta)G(t_n) \] and, \[ h_n(t_n,\theta)=\frac{ p(t_n|\theta)}{G(t_n)} \] So, \[ p(x_1,…,x_n|\theta)=\frac{ p(t_n|\theta)}{G(t_n)} g(x_1,…,x_n) \] and, \[ p(x_1,…,x_n|t_n,\theta)=\frac{p(x_1,…,x_n|\theta)}{p(t_n|\theta)}=\frac{g(x_1,…,x_n)}{G(t_n)} \] Thus we can see $p(x_1,…,x_n|\theta,t_n)$ is independent with $\theta$.
- [example 1]
  For a Bernoulli sequence $X_1,…,X_n$, \[ \begin{split} p(x_1,…,x_n)&=\int_0^1 p(x_1,…,x_n|\theta)dF(\theta)\
  &=\int_0^1 \prod_{i=1}^n B(x|\theta)dF(\theta)\
  &=\int_0^1 \theta^{S_n}(1-\theta)^{n-S_n}dF(\theta) \end{split} \] where, $S_n=x_1+…,+x_n$. So, \[ p(x_1,…+x_n|\theta)=\theta^{S_n}(1-\theta)^{n-S_n} \] Let $t_n=[n,S_n]$, then $p(x_1,…,x_n|\theta)$ can be factorized into $h_n=\theta^{S_n}(1-\theta)^{n-S_n}$ and $g=1$. So, $t_n$ is the sufficient statistic of Bernoulli distribution.
- [example 2: Normal Distribution]
  \[ \begin{split} p(x_1,…,x_n|\mu,\lambda)&=\prod_{i=1}^n \left(\frac{\lambda}{2\pi}\right)^{\frac{1}{2}}\exp\left(-\frac{\lambda}{2}(x_i -\mu)^2 \right)\
  &=\left(\frac{\lambda}{2\pi}\right)^{\frac{n}{2}}\exp\left(-\frac{\lambda}{2}\sum_{i=1}^n(x_i -\mu)^2 \right)\
  &=\left(\frac{\lambda}{2\pi}\right)^{\frac{n}{2}}\exp\left(-\frac{\lambda}{2}[n(\bar{x}-\mu)^2+nS_n^2]\right)\
  \end{split} \] where $\bar{x}=\frac{1}{n}\sum_{i=1}^nx_i, S_n^2=\frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2$. So the sufficient statistic can be $[n,\bar{x},S_n^2]$. Note that the sufficient statistic is not unique.

One-Parameter Exponential Family

[one-parameter]
A pdf is said to belong to one-parameter exponential family if it is of the form \[ p(x|\theta)=f(x)g(\theta)\exp(c\cdot \phi(\theta)h(x)) \] where $g^{-1}(\theta)=\int f(x)\exp(c\cdot \phi(\theta)h(x))dx < \infty$ is a regularization factor, denoted by $E_f(f,g,h,\phi,c,\theta)$.
[sufficient statistic for $E_f$]
If $X_1,…,X_n\in \mathcal{X}$ is an exchangeable sequence such that given regular $E_f$, \[ p(x_1,…,x_n)=\int_{\theta}\prod_{i=1}^n E_f(x_i|f,g,h,\phi,c)dF(\theta) \] for some $dF(\theta)$. Then $t_n=t_n(x_1,…,x_n)=[n,h(x_1)+…+h(x_n)]$ is sufficient statistic.
- [example: bernoulli]
  \[ \begin{split} p(x|\theta)&=\theta^x (1-\theta)^{1-x}\
  &=(1-\theta)\left(\frac{\theta}{1-\theta}\right)^x\
  &=(1-\theta)\cdot\exp(x\log\frac{\theta}{1-\theta}) \end{split} \] Then, we have, $f(x)=1,g(\theta)=(1-\theta),c=1,\phi(\theta)=\log\frac{\theta}{1-\theta},h(x)=x$.
- [poisson]
  \[ p(x|\theta)=\frac{\theta^x\cdot e^{-\theta}}{x!}=\frac{1}{x!}\exp(-\theta)\cdot\exp(x\log\theta) \] where, $f(x)=\frac{1}{x!}, g(\theta)=\exp(-\theta),c=1,h(x)=x,\phi(\theta)=\log(\theta)$.
- [normal with unknown variance]
  \[ \begin{split} p(x|\theta)&=N(x|0,\sigma^2)\
  &=\left( \frac{1}{2\pi\sigma^2}\right)^{\frac{1}{2}}\exp\left(-\frac{x^2}{2\sigma^2}\right)\
  &=\left( \frac{1}{2\pi}\right)^{\frac{1}{2}}\theta^{-\frac{1}{2}}\exp\left(-\frac{x^2}{2\theta}\right)\
  \end{split} \] where, $f(x)=\left( \frac{1}{2\pi}\right)^{\frac{1}{2}}, g(\theta)=\theta^{-\frac{1}{2}},c=-1/2,h(x)=x^2,\phi(\theta)=\theta^{-1}$.
- [uniform]
  \[ p(x|\theta)=U(x|[0,\theta])=\frac{1}{\theta} \] So, $f(x)=1,g(\theta)=\theta^{-1},c=1,h(x)=\phi(\theta)=0$. Since $\mathcal{X}$ is $[0,\theta]$, so it is no regular.

###K-Parameter Exponential Family###

[k-parameters exponential family]
A pdf(pmf) $p(x|\theta), x\in \mathcal{X}$, which is labelled by $\theta\in\Theta\subseteq R$ is said to belong to k-parameters exponential family if it is of the form \[ p(x|\theta)=f(x)g(\theta)\exp\left(\sum_{j=1}^k c_j\cdot\phi_j(\theta)h_j(x) \right) \] Denoted by $E_{f_k}(x|f,g,h,\phi,c,\theta)$.
[sufficient statistic for $E_{f_k}$]
If $X_1,…,X_n\in \mathcal{X}$ is an exchangeable sequence such that given regular $E_{f_k}(X|f,g,h,\phi,c,\theta)$, \[ p(x_1,…,x_n|\theta)=\prod_{i=1}^n E_{f_k}(x_i|f,g,h,\phi,c,\theta) \] Then $t_n=t_n(X_1,…,X_n)=[n,\sum_{i=1}^n h_1(X_i),…,\sum_{i=1}^n h_1(X_i)]$ is sufficient statistic of $X_1,…,X_n$.
- [example:normal]
  Let $\theta=[\mu,\lambda]$, \[ \begin{split} p(x|\theta)&=N(x|\mu,\lambda)\
  &=\left( \frac{\lambda}{2\pi}\right)^{\frac{1}{2}}\exp\left(-\frac{\lambda}{2}(x-\mu)^2 \right)\
  &=\left( \frac{\lambda}{2\pi}\right)^{\frac{1}{2}}\lambda^{\frac{1}{2}}\exp\left( -\frac{\lambda}{2}\mu^2\right)\exp\left(\lambda\mu x-\frac{1}{2}\lambda x^2\right) \end{split} \] So, $g(\theta)=\lambda^{\frac{1}{2}}\exp\left( -\frac{\lambda}{2}\mu^2\right),c_1=1,c_2=-\frac{1}{2},\phi_1(\theta)=\lambda\mu,\phi_2(\theta)=\lambda,h_1(x)=x,h_2(x)=x^2$. Sufficient statistic: $t_n=[n,x,x^2]$.

Natural Exponential Family###

[definition from prof. zhang’s manuscript]
The pdf of exponential family, \[ p(y|\varphi)=a(y)\exp(y^T \varphi-b(\varphi)) \] where $y=(y_1,…,y_k),\varphi=(\varphi_1,…,\varphi_k)$. Comparing to the previous form, we can see $y_i=h_i(x_i),\varphi_i=c_i\phi(\theta)$.
[wiki definition]
The pdf of exponential family can be rewritten into another form: \[ f(x|\theta)=h(x)\exp(\eta(\theta)^TT(x)-A(\theta)) \]
[properties from wiki] The mean vector and covariance matrix are \[ E[X]=\triangledown_{\varphi}b(\varphi);\quad Cov[X]=\triangledown\triangledown^T b(\varphi) \]where $\triangledown$ is the gradient, and $\triangledown\triangledown^T$ is the Hessian matrix.
proof. Since $\int a(y)\exp(y^T \varphi-b(\varphi))dy=1$, taking derivation on both sides. \[ \begin{split} &\int a(y)\exp(y^T \varphi-b(\varphi))\cdot (y-\triangledown_\varphi b(\varphi))dy=0 \
\Longrightarrow &\int a(y)\exp(y^T \varphi-b(\varphi))\cdot y dy=\int a(y)\exp(y^T \varphi-b(\varphi))\cdot \triangledown_\varphi b(\varphi)dy\
\Longrightarrow &E[y]=\triangledown_\varphi b(\varphi)\
\end{split} \] And keep on taking derivation on both sides, we will obtain, \[ \begin{split} &E[y^2]-(\triangledown_\varphi b(\varphi))^2=\triangledown\triangledown^T b(\varphi)\
\Longrightarrow &E[y^2]-E^2[y]=\triangledown\triangledown^T b(\varphi)\
\Longrightarrow &D(y)=\triangledown\triangledown^T b(\varphi)\
\end{split} \]
[example: possion distribution]
\[ e^{-\lambda}\frac{\lambda^2}{x!}=\frac{1}{x!}\exp(x\log\lambda-\lambda)=\frac{1}{x!}\exp(x\varphi-e^{\varphi}) \] So, $E[y]=\triangledown_\varphi b(\varphi)=\triangledown_\varphi e^\varphi=e^\varphi=\lambda$.
[theorem]
If $X=(X_1,…,X_n)$ is random variable from a regular exponential family distribution such that \[ p(x|\theta)=\prod_{i=1}^n f(x_i)[g(\theta)]^n\exp\left( \sum_{j=1}^kc_j\phi_j(\theta)\sum_{i=1}^nh_j(x_i) \right) \] Then the conjugate family for $\theta$ has the form \[ p(\theta|\tau)=[K(\tau)]^{-1}[g(\theta)]^{\tau_0}\exp\left( \sum_{j=1}^kc_j\phi(\theta)\tau_j\right) \] where $[K(\tau)]=\int_\theta[g(\theta)]^\tau \exp\left(\sum_{j=1}^k c_j\phi_j(\theta)\tau_j \right)d\theta <\infty$.
- [explain from wiki]
  Assume that the probability of a single observation follows an exponential family, parameterized using its natural parameter: \[ p(x|\eta)=h(x)g(\eta)\exp(\eta^TT(x)) \] Then, for data $X=(X_1,…,X_n)$, the likelihood is computed as follows: \[ p(X|\eta)=\left(\prod_{i=1}^n h(x_i)\right)g(\eta)^n \exp\left(\eta^T\sum_{i=1}^n T(x_i) \right) \] Then, for the above conjugate prior: \[ p(\eta|\chi,\nu)=f(\chi,\nu)g(\eta)^\nu\exp(\eta^T \chi) \] We can then compute the posterior as follows: \[ \begin{split} p(\eta|X,\chi,\nu)&\propto p(X|\eta)p(\eta|\chi,\nu)\
  &=\left(\prod_{i=1}^n h(x_i)\right)g(\eta)^n \exp\left(\eta^T\sum_{i=1}^n T(x_i) \right)f(\chi,\nu)g(\eta)^\nu\exp(\eta^T \chi)\
  &\propto g(\eta)^n \exp\left(\eta^T\sum_{i=1}^n T(x_i) \right) g(\eta)^\nu\exp(\eta^T \chi)\
  &\propto g(\eta)^{n+\nu}\exp\left(\eta^T \left(\chi+\sum_{i=1}^nT(x_i)\right)\right) \end{split} \]
- [example:bernoulli]
  \[ \begin{split} p(x|\theta)&=\prod_{i=1}^n \theta^{x_i}(1-\theta)^{(1-x_i)}\
  &=(1-\theta)^n \exp\left(\log\frac{\theta}{1-\theta}\sum_{i=1}^nx_i \right) \end{split} \] So, the conjugate prior can be, \[ \begin{split} p(\theta|\tau)&\propto (1-\theta)^{\tau_0} \exp\left(\log\frac{\theta}{1-\theta}\tau_1 \right) \
  &\propto (1-\theta)^{\tau_0}\left(\frac{\theta}{1-\theta}\right)^\tau_1\
  &\propto \theta^{\tau_1}(1-\theta)^{\tau_0-\tau_1} \end{split} \] which is a beta distribution. The update rule is, \[ \begin{split} \chi’ &= \chi + \sum_{i=1}^nT(x_i)\
  \nu’ &=\nu + n\
  \end{split} \] Then, the posterior has the form, \[ p(\theta|X,\tau)\propto \theta^{\sum_{i=1}^n x_i +\tau_1}(1-\theta)^{n-\sum_{i=1}^n x_i +\tau_0-\tau_1} \] which is alos a beta distribution.