Guy has a note on the EM algorithm, and I wanted to show how one can both explain and generalize EM in a very straightforward way. (You should probably read his note before continuing–or see the Miscellanea at the bottom to jog your memory.)

EM is a special case of coordinate descent algorithms known as majorization-minimization procedures (MMP). Majorization-minimization procedures operate by iteratively (nearly-) minimizing an upper bound of the objective which is tight at the previous iteration’s (near-) minimizer. In other words, to guarantee the procedure minimizes our objective f(\theta), we require the following three properties:

  1. f(\theta^{(t)})=g(\theta^{(t)}|\theta^{(t)})
  2. f(\theta) \le g(\theta|\theta^{(t)})
  3. \theta^{(t+1)} \in \{ \theta\in\Theta : g(\theta|\theta^{(t)}) \le g(\theta^{(t)}|\theta^{(t)})\}.

That these conditions guarantee convergence, i.e., \theta^{(t)}\to\theta^{*}=\arg\min_\theta f(\theta), follows from the so-called descent property, i.e.,
  f(\theta^{(t)})\stackrel{\;1.}{\vphantom{\ge}=}g(\theta^{(t)}|\theta^{(t)})\stackrel{\;3.}{\ge} g(\theta^{(t+1)}|\theta^{(t)}) \stackrel{\;2.}{\ge} f(\theta^{(t+1)}).

From this view we see that EM is a special case, i.e.,
  \begin{array}{rl}  f(\theta) &\stackrel{def}{=} -\log p_\theta(x)\\  &=-\log p_\theta(x) \pm D(p_{\theta'}(Y|x)||p_\theta(Y|x))\\  &\le -\log p_\theta(x) + D(p_{\theta'}(Y|x)||p_\theta(Y|x))\\  &= D(p_{\theta'}(Y|x)||p_\theta(Y,x))\\  &\stackrel{def}{=} g(\theta|\theta').  \end{array}
This result (and that the MM conditions are met) follows from Gibbs’ Inequality, i.e., D(q||p)\ge 0 for all distributions q,p with equality iff q is identically p.

For more information and a handful of alternative majorization recipes, see the excellent 2004 tutorial by David Hunter and Kenneth Lange, A Tutorial on MM Algorithms. For a clever application of MM, check-out this paper by Andrew Ng and pals.

Most (all?) of the majorizers listed by Lange exploit convexity of the objective. When the objective is nonlinear, it is always possible to rewrite it as a difference of convex functions and then apply the MMP to the subtracted convex function. However it is typically quite challenging to formulate f as a difference of convex functions.

Alternative (equivalent) ways to derive EM:

Alt1: (KL-Div)
  \begin{array}{rl}  \log p_\theta(x) &= -D(q(Z)||p_\theta(x))\\  &= -D(q(Z)||\frac{p_\theta(x,Z)}{p_\theta(Z|x)})\\  &= -D(q(Z)||p_\theta(x,Z)) +D({q(Z)}||p_\theta(Z|x)))\\  &\ge -D(q(Z)||p_\theta(x,Z))  \end{array}
where the bound follows from Gibbs’ inequality.

Alt2: (Jensen)
  \begin{array}{rl}  \log p_\theta(x) &= \log\sum_z p_\theta(x,z)\\  &= \log\sum_z q(z)\frac{p_\theta(x,z)}{q(z)}\\  &\ge \sum_z q(z)\log\frac{p_\theta(x,z)}{q(z)}\\  &= -D(q(Z)||p_\theta(x,Z))  \end{array}
where the bound follows from Jensen’s inequality.

Note that all of these derivations essentially just use p(x)=p(x,z)/p(z|x) and the definition of convexity. For a great (but dated) overview on different incarnations of EM, see this survey.