Key Ideas

We want to learn probability distributions which generate real world data. But sometimes, instead of dealing with raw data, dealing with some representation of it makes it easier to learn its generative distribution. E.g. this representation may get rid of all the noise, and retain enough information to be able to explain the variance in this real world data.

We want to find a "good representation" ($h \in \mathbb{R}^m$) of data ($x \in \mathbb{R}^n$) by applying transformations ($h = f(x)$). The authors define a good representation as "one in which the distribution of the data is easy to model" and in this paper they propose a transformation such that the distribution of transformed data completely factorizes (i.e. $p_H(h) = \prod_dp_{H_d}(h_d)$).

Design of a Good Transformation

After applying the transformation $f$ (assume it is invertible, and assume $m=n$), the distribution of the original random variable can be written as (see Appendix 1 for proof): $$p_X(x) = p_H\left(f(x)\right)det(J)$$

Now say we observe $N$ samples of $X$, then we estimate the transformation $f$ (a parametrized neural network) by maximizing the log-likelihood of these samples (while assuming some prior distribution $p_H$). The log-likelihood can be written as:

$$\log{p_X(x)} = \sum_{d=1}^D\log{p_{H_d}(f_d(x))} + \log{\lvert det\left(\frac{\partial f(x)}{\partial x}\right)\rvert}$$

The authors observe that a good transformation should be able to capture a complex distribution and at the same time is should be easy to compute:

• Its inverse (so that we can easily sample from $p_X$ in two steps: (1) $h \sim p_H$, (2) $x = f^{-1}(h)$).
• The determinant of its Jacobian matrix (because that appears in the log-likelihood expression we are trying to maximize).

Architecture of the Neural Network used for the Transformation

Let's first see what makes for an easily computable Jacobian determinant. Now the determinant of a triangular matrix is just the product of its diagonal elements 3. With this in mind, the authors introduce a family of invertible functions with triangular Jacobians, called "Couplings".

General Coupling Layer

If $x \in \mathbb{R}^D$, then we define $y \in \mathbb{R}^D$ such that the first $d$ components of $y$ are the same as that of $x$, and the other $D-d$ components are determined by a function $g: \mathbb{R}^{D-d} \times m\left(\mathbb{R}^{d}\right) \to \mathbb{R}^{D-d}$ called the "Coupling Law":

$$y_{1\cdots d} = x_{1\cdots d} \\ y_{d+1\cdots D} = g\left(x_{1\cdots d}, m\left(x_{d+1\cdots D}\right)\right)$$

We see (from the way we defined this transformation above) that the Jacobian for this transformation of $x$ into $y$ is triangular:

$$\frac{\partial y}{\partial x} = \begin{bmatrix} I_d & 0 \\ \frac{\partial y_{d+1\cdots D}}{\partial x_{1\cdots d}} & \frac{\partial y_{d+1\cdots D}}{\partial x_{d+1\cdots D}} \end{bmatrix}$$

Appendix 1

A proof for Equation (1) from the paper.

Here is a proof for 2D based on notes from Prof. Ash 1 and Prof. Dobelman 2:

Let $f: \mathbb{R}^2 \to \mathbb{R}^2$ be the transformation. Let $h = (h_1, h_2)$ and $x = (x_1, x_2)$ be realizations of $H$ and $X$ respectively, such that we have $f(h) = x$. Say $x_1$ changes by $dx_1$, then the change in $h_1$ and $h_2$ is $(\partial h_1/\partial x_1)dx_1$ and $(\partial h_2/\partial x_1)dx_1$ respectively. Similarly, if $x_2$ changes by $dx_2$, then the change in $h_1$ and $h_2$ is $(\partial h_1/\partial x_2)dx_2$ and $(\partial h_2/\partial x_2)dx_2$ respectively. Now consider the small rectangle with lengths $dx_1$ and $dx_2$ in the $x_1-x_2$ plane. This corresponds to a parallelogram in the $h_1-h_2$ plane with sides:

$$\vec{A} = \begin{bmatrix}\frac{\partial h_1}{\partial x_1}dx_1 \\ \frac{\partial h_2}{\partial x_1}dx_1\end{bmatrix} \\ \vec{B} = \begin{bmatrix}\frac{\partial h_1}{\partial x_2}dx_2 \\ \frac{\partial h_2}{\partial x_2}dx_2\end{bmatrix}$$

The area of this parallelogram is given by the magnitude of the cross product $\vec{A} \times \vec{B}$, and that equals $det(J)dx_1dx_2$, where $J$ is the Jacobian matrix, written as: $$J = \begin{bmatrix} \frac{\partial h_1}{\partial x_1} & \frac{\partial h_1}{\partial x_2} \\ \frac{\partial h_2}{\partial x_1} & \frac{\partial h_2}{\partial x_2} \end{bmatrix}$$ Now the probability mass of the rectangle in $x_1-x_2$ plane is the same as that of the parallelogram in the $h_1-h_2$ plane. This is because the authors assume the transformation $f$ is invertible. So we can write: $$p_H(h)det(J)dx_1dx_2 = p_X(x)dx_1dx_2 \\ \implies p_X(x) = p_H(f(x))det(J)$$