Restricted Boltzmann Machines (Part 1)

$\def\mathbi#1{\boldsymbol{\it #1}}$

This is first of a two part introduction to the Restricted Boltzmann Machine (RBM) which is one of the fundamental elements of deep learning. RBMs were first introduced by Smolensky (1986) by the name of Harmoniums. I assume basic probability and graph theory knowledge in the discussion. We utilise the notation $ A \perp B \ | \ C $ to denote random variables $A$ and $B$ are independent given the random variable $C$. That is, $Pr\ (A \ | \ B,C)=Pr \ (A|C)$.

Consider a finite undirected graph $G=(V,E)$ where $V$ denotes the set of nodes and $E$ the set of edges. A set of nodes $\mathcal{V} \subset V$ separates two nodes $v$ and $w$ not in that set if every path from $v$ to $w$ contains a node from $\mathcal{V}$. The set $\mathcal{V} \subset V$ is said to separate two sets $\mathcal{U} \subset V$ and $\mathcal{W} \subset V$ if $\mathcal{V}$ separates every pair of nodes $u \in \mathcal{U}$ and $w \in \mathcal{W}$.

We now associate a random variable $ X_v $ to each node $v \in V$. We assume each such random variable takes values in the same state space, say $\Lambda$. Then we say the joint random variable $\mathbf{X}=(X_v)_{v \in V}$ forms a Markov Random Field (MRF) if they satisfy the Global Markov Property. The Global Markov property or simply the Markov Property states that whenever disjoint subsets $\mathcal{A}, \mathcal{B}, \mathcal{S}$ of $V$ are such that $\mathcal{S}$ separates $\mathcal{A}$ and $\mathcal{B}$ then the random variables defined by the nodes in $\mathcal{A}$ are independent of the random variables defined by the nodes in $\mathcal{B}$ given the random variables defined by the nodes in $\mathcal{S}$. That is, $(X_a)_{a \in \mathcal{A}} \perp (X_b)_{b \in \mathcal{B}} \ | \ (X_s)_{s \in \mathcal{S}} $. (Thus, roughly speaking, in an MRF the “separating property” of a set of nodes is related to a conditional independence property regarding that set).

In general, in a finite undirected graph whose nodes are associated with some random variables we define Markov Blanket $MB(v)$ of a node $v$ as the set of nodes which can completely define the associated random variable $X_v$. Formally speaking, the set $MB(v)$ is a Markov Blanket of $v$ if for any set of nodes $\mathcal{B}$ not containing $v$, $X_v$ is independent of random variables $ (X_b)_{b \in \mathcal{B} }$ given $MB(v)$. In an MRF, the Markov Blanket of a node is its neighbourhood (because the neighbourhood trivially seperates that node from any other nodes not in the neighbourhood).

An important question is what kind of distributions satisfy the global markov property and form an MRF. If $ \mathbi{x} \in \mathbf{X}$ denotes the tuple $(x_v)_{v \in V}$ which takes values in the state space $\Lambda^{|V|}$, then the question translates to what probability distributions $p$ of $\mathbi{x}=(x_v)$ define a Markov Random Field. This is partially answered by the important Hammersley-Clifford Theorem which was given by them in an unpublished paper of 1971.

Theorem 1 (Hammersley-Clifford Theorem): A strictly positive distribution $p>0$ satisfies the Global Markov Property with respect to an undirected graph $G$ if and only if $p$ factorises over $G$ .

A distribution $p(\mathbi{x})$ is said to factorise over an undirected graph $G$ if it can be written as a product of functions called potential functions defined corresponding to maximal cliques of $G$. Let $\mathcal{C}$ be the set of maximal cliques $C$ of $G$. Then we say $p$ factorises over $G$ if there are non-negative functions ${\psi_C}$ for cliques $C \in \mathcal{C}$ which satisfy:

$\forall \ \mathbi{x}, \mathbf{x'} \in \Lambda^{|V|} : (x_c)_{c \in C} = (x'_c)_{c \in C} \implies \psi_C(\mathbi{x}) = \psi_C(\mathbf{x'})$

and

$p(\mathbi{x}) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \psi_C(\mathbi{x})$

These non-negative functions are called potential functions and $Z$, the normalisation constant is called partition function. If $p$ is strictly positive, so are the potential functions and we can write:

$p(\mathbi{x}) = \frac{1}{Z} e^{-E(\mathbi{x})}$

where $ E(\mathbi{x}) = \sum_{C \in \mathcal{C}} \ln \psi_C(\mathbi{x}) $ is called the energy function. The above formulation of $p(\mathbi{x})$ is called Gibbs distribution. Thus a positive joint probability distribution of an MRF corresponds always to a Gibbs distribution. (Historically, the theory of Markov Random Fields derives from the generalisation of Ising Model of Ferromagnetism (1926, 1971) and thus terms relating to energy are used.)

RBM’s are nothing but bipartite MRF’s. But, before diving into Restricted Boltzmann Machines, we discuss Markov Chains as they will be useful to obtain the Gibbs distribution of an MRF by repeatedly running simulations of random sampling.

A semi-infinite path graph (ray) with nodes $v_1, v_2, …, v_n, ….$ and edges between $v_i$ and $v_{i+1}$ for $i \geq 1$ forms a Markov Chain if the corresponding family of random variables $\mathbf{x}={ X_k \ | \ k \geq 1}$ taking values in the finite space $\Omega$ satisfy the Markov Property. That is for all $X_k$ we have $X_{k+1} \perp X_{k-1}, X_{k-2}, …, X_{1} \ | \ X_k$. Usually $k$ is viewed as a discrete time index. Thus the distribution of $X_{k+1}$ depends only on the immediately previous distribution $X_k$ of the discrete time markov process. This dependence is formulated in terms of a transition matrix which is defined as a matrix of transition probabilities $ \mathbb{P}_k = (p^k_{ij})_{i,j \in \Omega} $ where $p^k_{ij} = Pr \ (X_{k+1} = j \ | \ X_k = i) $. The chain is homogeneous if $\mathbb{P}_k = \mathbb{P} $ for all $k$ and the Markov Process can be completely defined by the initial state and the transition matrix. The chain is said to reach a stationary distribution if after a point the random variables corresponding to all further nodes have the same probability distribution. A chain with starting state $X_1$ is said to be irreducible if for all pair of points $i,j \in \Omega$ there exists a $k>0$ such that $Pr \ ( X_k = j \ | \ X_1 = i )>0$. The chain is said to be aperiodic if $ \forall \ i \in \Omega$ the $gcd \ [ \ k-1 \ | \ Pr \ ( X_k =i \ | \ X_1 = 1 )>0 , k \geq 1 ] $ is 1.

Theorem 2: An irreducible, aperiodic homogeneous Markov Chain with finite state space $\Omega$ converges to a unique stationary distribution.

This has important practical uses in sampling from a distribution which can not be well formulated. We construct a Markov Chain that converges to the desired probability distribution and simulate a large number of state transitions on it. Sampling a point from each state of the chain then gives an approximate sample of the desired distribution except for a few incorrect samples in the beginning. This method is popularly known as Gibbs Sampling. In general methods which sample from a distribution based on constructing a Markov Chain are called Markov Chain Monte Carlo (MCMC) methods. The speed of convergence makes for an important question in these methods.

We show how to simulate Gibbs Sampling method for the desired distribution of an MRF. We use the same conventions for an MRF as used before here. $\mathbi{X} = (X_1, …, X_n)$ is an MRF with respect to an undirected graph $G(V,E)$ with $V=(1,2,…,n)$. The $X_i’s$ take values in the finite state space $\Lambda$ and for $\mathbi{x} \in \Lambda^{|V|}$. Let $\pi(\mathbi{x}) = \frac{1}{Z} e^{ -E(\mathbi{x}) }$ be the Gibbs distribution of the given MRF. From here on, we will in general refer to a family of random variables associated with graph $G$ and defined on $\Lambda^{|V|}$ simply as Random Fields (RFs) in contrast to the Markov RF’s which have the additional property of factorisation over G.

We construct a Markov Chain $ (\mathbi{X}^1 , \mathbi{X}^2, … , \mathbi{X}^k, … ) $ of RF’s with distributions $(\pi^1, \pi^2, …, \pi^k, … )$ respectively such that this chain converges to the desired stationary Gibbs distribution $\pi$ of MRF, $\mathbi{X}$ as $k \rightarrow \infty $. Note that each RF $\mathbi{X}^k $ is a family of random variables $(X^k_1, … , X^k_n)$ which can also be interpreted as a single joint random variable. $(\mathbi{X}^1 , \mathbi{X}^2, … , \mathbi{X}^k, … ) $ are the states of the Markov Chain and $(\pi^1, \pi^2, …, \pi^k, … )$ are the respective probability distributions each defined in the space $\Lambda^{|V|} $.

The rules of transition are as follows. The transition from state $\mathbi{X}^k$ to $\mathbi{X}^{k+1}$ of the Markov Chain is made by choosing an $X^k_i$ with probability $q(i)$ and sampling a new probability distribution for $X^k_i$ ($i \in V$). Thus, given a $k$ the distribution $\pi ^ {k+1}$ of $\mathbi{X}^{k+1}$ is related to the distribution $\pi ^ {k}$ of $\mathbi{X}^{k}$ by the transition matrix $ \mathbb{P} = (p_{ \mathbi{x}, \mathbi{y} })_{\mathbi{x}, \mathbi{y} \in \Lambda ^ {|V|}} $ where $p_{ \mathbi{x}, \mathbi{y} } = Pr\ ( \ \mathbi{X}^{k+1} = \mathbi{y} \ | \ \mathbi{X}^{k} = \mathbi{x} \ ) $ for given $ \mathbi{x} $ and $ \mathbi{y} $ in $ \Lambda ^ {|V|} $. If the tuples $ \mathbi{x}= (x_v)_{v \in V} $ and $ \mathbi{y}=(y_v)_{v \in V} $ differ exactly at one point $i \in V$ this probability is given by $p_{ \mathbi{x}, \mathbi{y} } = q(i) \ Pr\ ( \ X^{k+1}_i = y_i | \ \mathbi{X}^{k} = \mathbi{x} \ ) $. Further we assume that (or we define the transition process such that) $ X^{k+1}_i \perp X^{k}_i \ | \ (X^{k}_i)_{i \in V \backslash i}$. From this definition of the transition rule we obtain the following formula for the transition probabilities. For $ \mathbi{x} \neq \mathbi{y} $:

$p_{ \mathbi{x}, \mathbi{y} }= \left \{ \begin{array} {llc} q(i) \ Pr \ \{ X^{k+1}_i = y_i \ | \ (X^{k}_v= x_v )_{v \in V\backslash i} \}, & \mbox{if} & \exists i \in V \ni \forall v \in V \ \mbox{with} \\\ & & v \neq i: x_v = y_v \\\ 0, & \mbox{else} \end{array}\right. \label{eq1} \tag{1}$

and for $\mathbi{x} = \mathbi{y} $:

$p_{ \mathbi{x}, \mathbi{x} }= \sum_{i \in V} q(i) \ Pr\{ \ X^{k+1}_i = x_i | \ (X^{k}_v= x_v )_{v \in V\backslash i} \ \} \label{eq2} \tag{2}$

For a given $k$, the transition rule of the Markov Chain of RFs is thus defined by transition probabilities which are defined by the equations $(\ref{eq1})$ and $(\ref{eq2})$. Of course, it would be convenient if we can define the transition rule such that the transition matrix $\mathbb{P}$ is independent of $k$ that is, the Markov Chain is homogeneous. Further if the chain is also irreversible and aperiodic we can conclude using $\textbf{Theorem 2}$ that the chain converges to a unique stationary distribution. And if this stationary distribution is our desired Gibbs distribution $\pi$ we are done. To find such a transition rule we use the following theorem which states a much stronger property called Detailed Balance Equation than saying merely that a Markov chain has a stationary distribution.

Theorem 3 (Detailed Balance Equation): Given a homogeneous Markov Chain described by transition probabilities $(p_{ij})_{i,j \in \Omega}$ then a probability distribution $\pi$ defined on $\Omega$ which satisfies $\pi(i)p_{ij} = \pi(j)p_{ji}$ is necessarily a stationary distribution of the Markov Chain. (The converse may not necessarily be true.)

For our Markov Chain, if the transition rules defined by $(\ref{eq1})$ and $(\ref{eq2})$ follow the Detailed Balance Equation for the desired Gibbs distribution $\pi$ defined on space $\Lambda^{|V|}$ then we can conclude that the Markov Chain indeed has $\pi$ as a stationary distribution. It can be proved for this to hold we must have $Pr\{X^{k+1}_i = x_i\} = \pi(x_i)$ and our transition rule becomes:

$p_{ \mathbi{x}, \mathbi{y} }= \left \{ \begin{array} {lll} q(i) \ \pi( \ y_i \ | \ (x_v )_{v \in V\backslash i} \ ), & \mbox{if} & \exists i \in V \ni \forall v \in V \mbox{with } \\\ & & v \neq i: x_v = y_v \\\ \\\ \sum_{i \in V} q(i) \ \pi( \ x_i \ | \ (x_v )_{v \in V\backslash i} \ ), & \mbox{if} & \mathbi{x}=\mathbi{y} \\\ \\\ 0, & \mbox{else} \end{array}\right. \label{eq3} \tag{3}$

which also makes clear that $p_{ \mathbi{x}, \mathbi{y} }$ is independent of $k$ and thus the chain is homogeneous. We conclude by $\textbf{Theorem 3}$ that $\pi$ is a stationary distribution of the Markov Chain defined by $(\ref{eq3})$. Further if $\pi$ and $q$ are both positive probability distributions then it can be very easily proved that the Markov chain is also irreversible and aperiodic which guarantees that the chain converges to $\pi$ its unique stationary distribution. We have now constructed a Markov Chain model which converges to our desired MRF with Gibbs distribution $\pi$ starting from any initial distribution. Questions pertaining to how fast the chain converges, what should be the initial distribution and what should be $q$ are often analysed to obtain better and efficient results. Note that the intermediary RFs of the chain may not themselves be MRFs.

We dive fully into the Restricted Boltzmann Machines in the next part (2/3).