On Support Vector Machines

Data usually occurs as high dimensional points. Support Vector Machines is one of the well known methods for classifying the data points into a pre-defined finite number of classes $n$. This is a sufficiently detailed introduction to the subject.

SVMs were originally developed for two class classification problems prominently by Vladimir N. Vapnik and others. The now used formulation along with the famous kernel trick was first published in 1995 by him although the idea is older.

We discuss the case of 2 classes initially.

Consider a finite sample of labelled training data points $ \{ (x_i, y_i), i= 1, …, l \} $ where $x_i \in \mathbb{R}^n$ and $y_i \in \{-1,1\} $. The $n$-dimensional vectors $x_i$ are called feature vectors and their components are called the features of that data point/vector.

We consider initially the case of linearly separable points. Two sets of points are linearly separable if there exists a hyperplane that can exactly separate them. The SVM model then tries to find a suitable ‘separating hyperplane’ such that the test data is also well separated by that hyperplane. That is, there exist $w \in \mathbb{R}^n $ and $ b \in \mathbb{R} $ such that:

$\begin{array}{ll} w^T x_i + b > 0 && , \forall i \ \ \mbox{such that} \ \ y_i = 1 \\ w^T x_i + b < 0 && , \forall i \ \ \mbox{such that} \ \ y_i = -1 \end{array} \label{eq1} \tag{1}$

Since there are only finitely many points $x_i$ and none lies on the hyper-plane, we can thus find $w, b $ such that:

$\begin{array}{ll} w^T x_i + b \geq 1 && , \forall i \ \ \mbox{such that} \ \ y_i = 1 \\ w^T x_i + b \leq -1 && , \forall i \ \ \mbox{such that} \ \ y_i = -1 \end{array} \label{eq2} \tag{2}$

Any plane that completely lies between the two planes $w^Tx + b = \pm 1$ in equation $ ( \ref{eq2} ) $ is a separating hyperplane and there are infinitely many such planes. SVM chooses the most optimal hyperplane according to some suitable criteria. The criterion SVM employs is that of maximum margin. That is, $w , b$ are chosen such that the distance or the margin $\frac{2}{||w||}$ between the two planes in $ ( \ref{eq2} ) $ is maximum. The optimal hyperplane is then selected to be $w^T x +b =0 $. Note that $ ( \ref{eq2} ) $ is equivalent to saying $y_i(w^Tx_i + b) \geq 1 \ \forall $ $i$. The problem can thus be rewritten as the following constrained optimization problem:

$\begin{array}{l} \mbox{minimise} \ \ \frac{1}{2}w^T w \\ \mbox{subject to } \ \ y_i(w^Tx_i + b) \geq 1, \ i=1, ..., l. \end{array} \label{eq3} \tag{3}$

A well-known method to solve constrained optimization problems is the Method of Lagrange Multipliers.

Theorem 1 (The Method of Lagrange Multipliers): Consider the optimization problem: minimise $f(x)$ subject to $g_j(x) \leq 0, \ j=1,...,r$ where $f:\mathbb{R}^n \rightarrow \mathbb{R}$ and $g_j:\mathbb{R}^n \rightarrow \mathbb{R}$ are convex and continuously differentiable functions ( $C^1$ functions). The set of $x$ which satisfy the constraint equations $g_j(x) \leq 0$ are called feasible solutions. We define the associated Lagrangian as: $L(x,\mu) = f(x) + \sum_{j=1}^r \mu_j g_j(x)$ where $\mu = [\mu_1, ..., \mu_r]^T \in \mathbb{R}^r$ are called Lagrange multipliers. Then $x^*$ is a global solution of the optimization problem iff the following so called Kuhn-Tucker Conditions hold:

$x^*$ is a feasible solution.

There exists $\mu^*$ such that: $\nabla_x L(x^*, \mu^*)=0$ , $\mu_j^* \geq 0$ and $\mu_j^*g_j(x) = 0, \ \forall \ j$

The optimization variables in our problem are $w, b$ and the constraint equations are linear in them. The lagrangian then becomes $L(w,b,\mu) = \frac{1}{2}w^T w + \sum_{i=1}^l \mu_i [1-y_i(w^Tx_i + b)]$. Solving the Kuhn-Tucker conditions we see that $ (w^* , b^* ) $is a solution to $ ( \ref{eq3} ) $ if and only if there exist $ \mu_i^* \ , \ i=1,…l$ such that:

$\begin{array}{lll} 1- y_i [(w^{*})^T x_i + b^* ] \leq 0 && \forall i && (a) \\ \mu_i^* \geq 0 && \forall i && (b) \\ \mu_i^* (1- y_i [(w^{*})^T x_i + b^* ]) =0 && \forall i && (c) \\ \sum_{i=1}^l \mu_i^* y_i = 0 && && (d) \\ w^* - \sum_{i=1}^l \mu_i^* y_i x_i = 0 && && (e) \end{array} \label{eq4} \tag{4}$

A few things to notice. We call the training points $x_i$ for which equation (a) of $ ( \ref{eq4} ) $ is $0$ as support vectors. These points are closest to the optimal hyperplane $w^{ * T } x + b^* = 0$. (e) shows $ w^* $ is just a linear combination of the training points $x_i$. Also by (c), for points which are not support vectors the corresponding $ \mu_i^* $ are $0$ and thus $ w^* $ can essentially be considered as a linear combination of the support vectors i.e. the separating hyperplane is essentially defined by the support vectors or the points nearest to it which are the most difficult to classify.

To exactly determine the optimal $ ( w^* , b^* ) $ we must thus find $ \mu_i^* $ which satisfy $ ( \ref{eq4} ) $. This is done by forming a certain dual problem of $ ( \ref{eq3} ) $ in the variable $\mu$ and then solving using algorithms such as Sequential Minimal Optimization (SMO) which we discuss later on. The dual problem can be formulated using Theorem 2.

Theorem 2 (Lagrangian Duality): Call the optimization problem stated in Theorem 1 as primal problem . The dual problem is then the following optimization problem:
maximise $q(\mu)$
subject to $\mu_j \geq 0 \ , \ j=1,...,r$
where $q(\mu)$ is defined as the dual function $q:\mathbb{R}^r \rightarrow [-\infty, \infty]$ , $q(\mu) = \inf_x L(x,\mu)$ where $L(x,\mu)$ is the Lagrangian of the primal problem as defined in Theorem 1.
The duality theorem states that:

The primal optimal problem has an optimal solution if and only if the dual problem has an optimal solution and the corresponding optimal values are equal.

For $x^*$ and $\mu^*$ to be optimal solutions it is necessary and sufficient that they are feasible solutions and $f( x^* ) = L(x^* , \mu^* ) = min_xL(x, \mu^* )$ .

The classification method introduced above is the case of linearly separable training data points. However when there is no hyperplane which linearly separates the data points, we introduce slack variables to form a so called soft margin classifier. The objective is to find a hyperplane which separates the training set with minimal number of errors.

Formally, let $\xi_i \geq 0$ be training errors $\forall \ i=1,…,l$. We replace $1$ by $1-\xi_i$ for each $i$ in equation $ ( \ref{eq2} ) $ which gives us $y_i(w^Tx_i + b) \geq 1-\xi_i \ \forall \ i=1, …, l$. For sufficiently small $\sigma > 0 $ the function $\phi(\xi)= \sum_{i=1}^l \xi_i^{\sigma}$ represents the number of strictly positive training errors. The objective is to minimise the number of training errors $\phi(\xi)$ and find some minimal subset of training points after removing which from the training set the remaining points are linearly separable. Also the separating hyperplane should be optimal in some sense. (Note that just creating an optimal plane for the remaining linearly separable points may increase the number of training errors when all the points are reconsidered).

The problem can be expressed as:

$\begin{array}{ll} \mbox{minimise} && \frac{1}{2}w^T w + C \ f(\sum_{i=1}^l \xi_i) \\ \mbox{subject to } && y_i(w^Tx_i + b) \geq 1-\xi_i, \ i=1, ..., l\\ && \xi_i \geq 0 , \ i=1, ..., l \end{array} \label{eq5} \tag{5}$

Here $C$ is a sufficiently large constant, $\sigma $ is a sufficiently small constant and $f$ is a monotonic convex function with $f(0)=0$. For $\xi_i =0$ $ ( \ref{eq5} ) $ reduces to $ ( \ref{eq3} ) $ which is expected. We use $\sigma =1 $ as this is the minimum value for which there is a unique solution to $ ( \ref{eq5} ) $ which can be calculated efficiently. The optimal plane is given by $w^{ * T} x + b^* =0$ where $(w^* , b^* )$ is the optimal solution to above.

Assume $f(u) = u $ for simplifying the problem. The problem still remains a quadratic optimization problem. Using Theorem 1 on Kuhn-Tucker conditions, the lagrangian for this is given by $L(w,b,\mu, \lambda) = \frac{1}{2}w^T w + C(\sum_{i=1}^l \xi_i) + \sum_{i=1}^l \mu_i [1- \xi_i - y_i(w^Tx_i + b)] - \sum_{i=1}^l \lambda_i \xi_i $ and consequently $(w^* , b^* , \xi^* )$ is a solution to $ ( \ref{eq5} ) $ if and only if there exist $ \mu_i^* $ and $ \lambda_i^* , \ i= 1,…,l $ such that:

$\begin{array}{lll} 1- \xi_i^* - y_i [(w^{*})^T x_i + b^* ] \leq 0 && \forall i && (a) \\ \mu_i^* \geq 0, \xi_i^* \geq 0, \lambda_i^* \geq 0 && \forall i && (b) \\ \mu_i^* (1- \xi_i^* - y_i [(w^{*})^T x_i + b^* ]) =0 , \ \lambda_i^* \xi_i^* =0 && \forall i && (c) \\ \sum_{i=1}^l \mu_i^* y_i = 0 && && (d) \\ w^* - \sum_{i=1}^l \mu_i^* y_i x_i = 0 && && (e) \\ C - \mu_i^* - \lambda_i^* = 0 && \forall i && (f) \end{array} \label{eq6} \tag{6}$

Note that $w^* $ is still a linear combination of the training points or more specifically of support vectors. Value of $b^* $ can be calculated using equation (c) of $(\ref{eq6})$ for some $\mu^*_i > 0$. Again, the dual problem in variables $\mu, \lambda$ can be formulated using Theorem 2.

Another way to separate training points is using the non-linear SVM. We map the input training points to a different space and then use the usual SVM optimization to find an optimal hyperplane. However, the hyperplane in that space may not look like a hyperplane in the original space. The transformation is done by choosing a vector function $\phi$:

$\phi : \mathbb{R}^n \rightarrow \mathcal{H} \label{eq7} \tag{7}$

The optimization problem $ ( \ref{eq5} ) $ then becomes:

$\begin{array}{ll} \mbox{minimise} && \frac{1}{2}w^T w + C \ (\sum_{i=1}^l \xi_i) \\ \mbox{subject to } && y_i(w^T \phi (x_i) + b) \geq 1-\xi_i, \ i=1, ..., l\\ && \xi_i \geq 0 , \ i=1, ..., l \end{array} \label{eq8} \tag{8}$

where $w \in \mathcal{H}, b \in \mathbb{R}$ and the optimal separating surface is given by $w^{ * T} \phi (x) + b^* = 0$ which is a hyperplane in the space $\mathcal{H}$ but may not be a hyperplane in $\mathbb{R}^n$. $\mathcal{H}$ is usually a higher dimensional space and computing $ w^* , \phi(x)$ can be computationally quite expensive. But using the Kuhn Tucker equations and solving for $ w^* , b^* $ we see that $w^{ * T} \phi (x) + b^* $ is only a linear combination of dot products like $\phi (x_i) . \phi (x) $ and $\phi (x_i) . \phi (x_j) $. The idea is thus to use directly functions $K(x_i, x_j) = \phi (x_i) . \phi (x_j) $ called Kernel functions instead of computing individually $\phi (x)$ for points $x$. Also, we may omit the requirement of specifying $\phi$ and $\mathcal{H}$ and instead just specify the associated Kernel function. $\mathcal{H}$ may then even be an infinite dimensional space without actually having to compute values in that space (this is famously known as the ‘kernel trick’). The following results from Hilbert-Schmidt Theory enable us to do so.

Theorem 3 (Mercer's Theorem) : Given a symmetric continuous function $K:\mathbb{R}^n \times \mathbb{R}^n \rightarrow \mathbb{R}$ , there exists a countable sequence of functions $\{ \phi_i \}_{i \in \mathbb{N}}$ and a sequence of positive real numbers $\{ \lambda_i \}_{i \in \mathbb{N}}$ such that, $K(x,y) = \sum_{i=1}^{\infty} \lambda_i \phi_i (x) . \phi_i (y)$ if and only if $K$ is non-negative definite. That is,
$\iint K(x,y)g(x)g(y) \ dx \ dy \geq 0$
is satisfied for all $g$ defined on $\mathbb{R}^n$ such that
$\int g^2(u) \ du < \infty$

The above result enables us to decide which symmetric continuous functions $K(x,y)$ can be written as dot product $\phi (x) . \phi (y) = \sum_{i=1}^{\infty} \lambda_i \phi_i (x) \phi_i (y) $ of functions $\phi(x)$in some space $\mathcal{H}$.

Using Theorem 1, the solution to problem $(\ref{eq8})$ can be written as a linear function of lagrange multipliers of the Lagrangian obtained. As discussed before lagrange multipliers are calculated by forming a dual problem using Theorem 2. We now specify the dual problem of $(\ref{eq8})$:

$\begin{array}{ll} \mbox{maximise} && \sum_{i=1}^l \mu_i - \frac{1}{2} \sum_{i,j=1}^{l} (\mu_i \ \mu_j \ y_i \ y_j \ K(x_i, x_j) ) \\ \mbox{subject to } && 0 \leq \mu_i \leq C, \ i=1, ..., l\\ && \sum_{i=1}^{l} \mu_i y_i = 0 \end{array} \label{eq9} \tag{9}$

where $\mu \in \mathbb{R}^l $. The separating hyperplane in $\mathcal{H}$ specifically can be written as $\sum_{i=1}^l \mu_i^* y_i K(x_i, x) + b^* = 0$. The following kernels are frequently used:

$\begin{array}{ll} \mbox{Polynomial kernel} && K_p(x,y) = (a + bx^Ty)^p \\ \mbox{Gaussian/RBF kernel} && K_G(x,y) = \exp(- \frac{||x-y||^2}{2 \sigma^2}) \\ \mbox{Sigmoid kernel} && K_S(x,y) = \tanh(ax^T y + b) \end{array}$

Optimisation problem $(\ref{eq9})$ can be solved using the Sequential Minimal Optimisation (SMO) algorithm or its variants. The algorithm was first published in 1998 and was the first easy to implement and less expensive algorithm. It was inspired from Osuna’s 1997 algorithm which did not have a convergence proof and was complex to implement. Most implementations now use SMO or its subsequent developments. LIBSVM the most famous SVM implementation uses the method in Fan et al. (2005).

The training complexity depends on kernel evaluation cost, number of data points, number of features and the implementation. As a rule of thumb most implementations have complexity between $O(l^2)$ and $O(l^3)$ ( $l$ is the number of training data points). While training is expensive, predicting

The case of more than 2 classes.

For multiclass-classification into $ k>2 $ classes, there are essentially two methods:

One-versus-rest: Construct $ k $ SVMs with one of the classes as positive class and the remaining as all negative. However there are many problems like assignment of more than 1 class to a test data point, imbalance in training data etc.

One-versus-one: Construct $ \frac{k(k-1)}{2} $ binary classifiers on all possible pair of classes and then use a voting scheme to find the predicted class for a data point.

Another method is to make a multiclass objective funtion. However one-versus-one approach is usually used practically. The third method increases the computational complexity and leads to much slower training.