In this post I cover the fundamentals of Support Vector Regression.
A high level summary is that an SVR model is regularised regression using the epsilon-insensistive loss function. The SVR objective can then be reformulated to use kernels.
Main references:
- PRML by Christoper Bishop - Section 7
Epsilon-Inensistive Loss Function
In standard linear regression we use the squared error loss, which can be motivated by making the probabilistic assumption that the outcome variables are normally distributed conditionally given the predictors.
A way to view support vector regression (SVR) is by introducing the
This is defined below:
In words, this loss function only punishes incorrect predictions when the discrepancy between the actual value and the predicted value is larger than some choice of threshold
Regularised Regression
The objective function for SVR can be defined as regularised regression using the
The aim is to minimise this objective, with respect to the vector
Introducing Slack Variables
Minimising the objective function given above, can be rewritten as a constrained optimisation problem by introducing slack variables
This is given by:
Often a hyperparameter
Including the hyperparameter
SVR Standard Formulation:
Dual Formulation
In the SVM post, I gave the dual form of the soft margin classifier and then showed that this allows us to implement the Kernel trick. The Kernel trick is a way of applying a transformation to our features, but avoiding its direct computation. Therefore allowing us to transform to very high dimensional spaces, even infinite dimensional spaces.
We can carry out the analogous process for SVR.
The primal problem is given above (SVR Standard Formulation), the dual formulation of SVR is then given by:
Where the matrix
Proof
The primal problem is given as:
This primal problem is a convex (quadratic) problem, hence we will have strong duality if Slater’s condition holds. Slater’s condition requires that there exists a set of feasible points such that the inequality constraints are strictly satisfied. In fact, there is a weaker condition that just requires existence of a feasible point, if all the inequality constraints are affine.
Slater’s condition holds for the primal problem, as for any choice of
Introducing the Lagrange multipliers
We now let the linear model take the form
Using these equations and substituting into the Lagrangian, we get that dual problem is given by maximising the Lagrange dual function below.
Where
We have the constraints on the Lagrange multipliers that
Also note that substituting the equation
Interpreting the KKT Conditions
If we have primal and dual optimal points they must satisfy the KKT conditions. The KKT conditions often allow us to interpret hor
The KKT conditions for the SVR problem are:
For the first two equations, if
The reason being, is that if
Another observation, is that if we add the equations
For the last two equations, we can replace them with
Solving in Practice
The dual problem is a quadratic convex problem, and there exist many libraries to solve these type of problems. For SVMs and SVRs there is the LIBSVM
library written mainly in C/C++, this document describes the algorithm used for LIBSVM
is implemented in other open source languages, for example the R package e1071
is an interface for LIBSVM
.
Pros and Cons
- The objective function has hyperparameters
and which requires tuning in order to choose one. Additionaly if we implement Kernels this will introduce further hyperparameters, e.g for the RBF-kernel.