In this post I cover the fundamentals of Support Vector Regression.
A high level summary is that an SVR model is regularised regression using the epsilon-insensistive loss function. The SVR objective can then be reformulated to use kernels.
Main references:
- PRML by Christoper Bishop - Section 7
Epsilon-Inensistive Loss Function
In standard linear regression we use the squared error loss, which can be motivated by making the probabilistic assumption that the outcome variables are normally distributed conditionally given the predictors.
A way to view support vector regression (SVR) is by introducing the $\boldsymbol{\epsilon}$-insensistive loss function.
This is defined below:
In words, this loss function only punishes incorrect predictions when the discrepancy between the actual value and the predicted value is larger than some choice of threshold $\epsilon$.
$g(\bx;\bw)$ represents the affine function $g(\bx;\bw) := \langle \bw’,\bx \rangle + w_0$, that is used to construct our decision rule.
Regularised Regression
The objective function for SVR can be defined as regularised regression using the $\boldsymbol{\epsilon}$-insensistive loss function, where the penalty term uses the squared $L_2$ norm.
The aim is to minimise this objective, with respect to the vector $\bw := (\bw’, w_0)$
\[\sum_{i = 1}^{n}{E_{\epsilon}(y_i - g(\bx_i;\bw))} + \lambda||\bw'||^2\]Introducing Slack Variables
Minimising the objective function given above, can be rewritten as a constrained optimisation problem by introducing slack variables $\xi$, $\hat{\xi}$.
This is given by:
\[\begin{align} \underset{\boldsymbol{\xi},\boldsymbol{\hat{\xi}},\bw} \argmin{\sum_{i = 1}^{n}}{(\xi_i + \hat{\xi_i})} + \lambda||\bw'||^2 \nonumber\\ {\text{subject to:}} \nonumber\\ \quad y_i - g(\bx_i;\bw) \leq \epsilon + \xi_i \nonumber\\ g(\bx_i;\bw) - y_i \leq \epsilon + \hat{\xi}_i \nonumber\\ \xi_i, \hat{\xi}_i \geq 0 \nonumber \end{align}\]Often a hyperparameter $C$ is added in front of the $\sum_{i = 1}^{n}{(\xi_i + \hat{\xi_i})}$ term, called the inverse regularization parameter, smaller values of $C$ put less weight on the minimisation of the errors.
Including the hyperparameter $C$ and the setting the regularization term to $\frac{1}{2}$, gives the standard formulation of the SVR objective function:
SVR Standard Formulation:
\[\begin{align} \min_{\boldsymbol{\xi},\hat{\boldsymbol{\xi}},\bw} \; C{\sum_{i = 1}^{n}}{(\xi_i + \hat{\xi_i})} + \frac{1}{2}||\bw'||^2 \nonumber\\ {\text{subject to:}} \nonumber\\ \quad y_i - g(\bx_i;\bw) \leq \epsilon + \xi_i \label{eq:SVR formulation}\\ g(\bx_i;\bw) - y_i \leq \epsilon + \hat{\xi}_i \nonumber\\ \xi_i, \hat{\xi}_i \geq 0 \nonumber \end{align}\]Dual Formulation
In the SVM post, I gave the dual form of the soft margin classifier and then showed that this allows us to implement the Kernel trick. The Kernel trick is a way of applying a transformation to our features, but avoiding its direct computation. Therefore allowing us to transform to very high dimensional spaces, even infinite dimensional spaces.
We can carry out the analogous process for SVR.
The primal problem is given above (SVR Standard Formulation), the dual formulation of SVR is then given by:
\[\begin{align*} \min_{\mathbf{a}, \hat{\mathbf{a}}} \quad &\frac{1}{2} (\mathbf{a} - \hat{\mathbf{a}})^{T} K (\mathbf{a} - \hat{\mathbf{a}}) + \epsilon \sum_{i=1}^{N} (a_{i} + \hat{a_{i}}) + \sum_{i=1}^{N} y_{i} (a_{i} - \hat{a_{i}}) \nonumber\\ &{\text{subject to:}}\nonumber\\ \quad &0 \leq a_{i}, \hat{a}_{i} \leq C \quad \text{for} \; i=1, \dots,N \label{dual}\\ &\sum_{i=1}^{N} (a_{i} - \hat{a}_{i}) = 0 \nonumber\\ \end{align*}\]Where the matrix $K$ is constructed from the kernel function: $K_{ij} := k(\bx_{i}, \bx_{j}) := \phi(\bx_i)^T\phi(\bx_j)$
Proof
The primal problem is given as:
\[\begin{align*} \min_{\boldsymbol{\xi},\hat{\boldsymbol{\xi}},\bw'} \; C{\sum_{i = 1}^{n}}{(\xi_i + \hat{\xi_i})} + \frac{1}{2}||\bw'||^2 \nonumber\\ \text{s.t} \quad y_i - g(\bx_i;\bw) \leq \epsilon + \xi_i\\ g(\bx_i;\bw) - y_i \leq \epsilon + \hat{\xi}_i \nonumber\\ \xi_i, \hat{\xi}_i \geq 0 \nonumber \end{align*}\]This primal problem is a convex (quadratic) problem, hence we will have strong duality if Slater’s condition holds. Slater’s condition requires that there exists a set of feasible points such that the inequality constraints are strictly satisfied. In fact, there is a weaker condition that just requires existence of a feasible point, if all the inequality constraints are affine.
Slater’s condition holds for the primal problem, as for any choice of $\epsilon$ we can choose large enough values of $\xi_i$ and $\hat{\xi}_i$ such that the inequality constraints are satisfied for each point. Therefore, the optimal solution value of the primal objective will be equal to the dual optimal solution.
Introducing the Lagrange multipliers $a_i\geqslant0$, $a_i\geqslant0$, $\mu_i\geqslant0$, and $\hat{\mu}_i\geqslant0$, the Lagrangian for the primal is given by:
\[\begin{align*} &L(\boldsymbol{\xi}, \boldsymbol{\hat{\xi}}, \bw, \mathbf{a}, \mathbf{\hat{a}}, \boldsymbol{\mu}, \boldsymbol{\hat{\mu}}) := \label{lagrangian} \\ &C\sum^N_{i=1}(\xi_i+\hat{\xi_i})+ \frac{1}{2}||\bw'||^2-\sum^N_{i=1}(\mu_i\xi_i+\hat{\mu}_i\hat{\xi_i}) -\sum^N_{i=1}a_i(\epsilon+\xi_i + g(\bx_i;\bw) -y_i) -\sum^N_{i=1}\hat{a}_i(\epsilon+\hat{\xi}_i +y_i -g(\bx_i;\bw)) \nonumber \end{align*}\]We now let the linear model take the form $g(\bx;\bw) = \langle \bw’,\phi(\bx) \rangle + w_0$, where we have included a feature transform. Now to find the minimum with respect to $(\boldsymbol{\xi}, \boldsymbol{\hat{\xi}}, \bw)$ for constant Lagrange multipliers, we set the derivatives of the Lagrangian with respect to $\bw := (\bw’, w_0)$, $\xi_i$, and $\hat{\xi}_i$ to zero:
\[\begin{align} &\frac{\partial L}{\partial \bw'}=0 \Rightarrow \bw'=\sum^N_{i=1}(a_i-\hat{a}_i)\phi(\bx_i) \label{w_dash}\\ &\frac{\partial L}{\partial w_0}=0 \Rightarrow \sum^N_{i=1}(a_i-\hat{a}_i) = 0 \\ &\frac{\partial L}{\partial \xi_i}=0 \Rightarrow a_i + \mu_i = C \label{xi}\\ &\frac{\partial L}{\partial \hat{\xi}_i}=0 \Rightarrow \hat{a}_i + \hat{\mu}_i = C \label{xi_hat} \end{align}\]Using these equations and substituting into the Lagrangian, we get that dual problem is given by maximising the Lagrange dual function below.
\[\begin{align*} \tilde{L}(\mathbf{a},\mathbf{\hat{a}})=-\frac{1}{2}\sum^N_{i=1}\sum^N_{j=1}(a_i-\hat{a}_i)(a_j-\hat{a}_j)k(\bx_i,\bx_j)-\epsilon\sum^N_{i=1}(a_i+\hat{a}_i)+\sum^N_{i=1}(a_i-\hat{a}_i)y_i \label{eq:dual} \end{align*}\]Where $\quad k(\bx,\bx’)=\phi(\bx)^T\phi(\bx’)$
We have the constraints on the Lagrange multipliers that $a_i\geqslant0$, $a_i\geqslant0$, $\mu_i\geqslant0$, and $\hat{\mu}_i\geqslant0$. The Lagrange multipliers $\mu_i\geqslant0$, and $\hat{\mu}_i\geqslant0$ do not appear in the dual objective, but from equations \eqref{xi}, \eqref{xi_hat}, we have the following constraints (often called box constraints):
\[\begin{align} 0 \leqslant a_i \leqslant C \\ 0 \leqslant \hat{a}_i \leqslant C \end{align}\]Also note that substituting the equation $\bw’=\sum^N_{n=1}(a_i-\hat{a}_i)\phi(\bx_i)$ $\eqref{w_dash}$ into $g(\bx;\bw) = \langle \bw’,\phi(\bx) \rangle + w_0$, we get that the decision function can be rewritten in terms of the kernel function:
\[\begin{equation} g(\bx) = \sum_{i=1}^{N}{(a_i - \hat{a}_i)k(\bx,\bx_i)} + w_0 \label{pred} \end{equation}\]Interpreting the KKT Conditions
If we have primal and dual optimal points they must satisfy the KKT conditions. The KKT conditions often allow us to interpret hor
The KKT conditions for the SVR problem are:
\[\begin{align*} a_i(y_i - g(\bx_i;\bw) - \epsilon - \xi_i) = 0 \\ \hat{a}_i(g(\bx_i;\bw) - y_i - \epsilon - \hat{\xi}_i) = 0 \\ \mu_i \cdot \xi_i = 0 \\ \hat{\mu}_i \cdot \hat{\xi}_i = 0 \\ \end{align*}\]For the first two equations, if $a_i$ is non-zero this implies that the value of the corresponding prediction function $g(\bx_i;\bw)$ either lies on the borders of the $\epsilon$ tube or above.
The reason being, is that if $a_i$ is non-zero this implies $y_i - g(\bx_i;\bw) - \epsilon - \xi_i = 0$. Then there are two scenarios where this holds, either $y_i - g(\bx_i;\bw) = \epsilon$ which implies $\xi_i = 0$, or if $y_i - g(\bx_i;\bw) > \epsilon$ then $\xi_i = y_i - g(\bx_i;\bw) - \epsilon$.
Another observation, is that if we add the equations $y_i - g(\bx_i;\bw) - \epsilon - \xi_i = 0$ and $g(\bx_i;\bw) - y_i - \epsilon - \hat{\xi}_i = 0$. We get that the two equations are incompatible, therefore it is not possible to have $a_i$ and $\hat{a_i}$ as non-zero simultaenously. Note that from the prediction function \eqref{pred}, only data points where either $a_i$ or $\hat{a}_i$ is nonzero contributes to the prediction. As mentioned previously these are data points that lie on the border or outside of the $\epsilon$-tube, known as the support vectors.
For the last two equations, we can replace them with $(C - a_i)\xi = 0$ and $(C - \hat{a}_i)\hat{\xi} = 0$ respectively, using equations $\eqref{xi}, \eqref{xi_hat}$.
Solving in Practice
The dual problem is a quadratic convex problem, and there exist many libraries to solve these type of problems. For SVMs and SVRs there is the LIBSVM
library written mainly in C/C++, this document describes the algorithm used for $\epsilon$-SVR. LIBSVM
is implemented in other open source languages, for example the R package e1071
is an interface for LIBSVM
.
Pros and Cons
- The objective function has hyperparameters $\epsilon$ and $C$ which requires tuning in order to choose one. Additionaly if we implement Kernels this will introduce further hyperparameters, e.g $\sigma$ for the RBF-kernel.