Introduction to the theory of the Support Vector Machine
The support vector machine (SVM) approach for solving
classification tasks was introduced by Vapnik (Vapnik
1995). It has been successfully applied in various areas of research. Currently
its growing popularity in the bio
Several standard learning techniques are routinely used in bioinformatics. For simplicity we start with a binary classification. But before any of these methods can be applied to the biological data, it should be preprocessed to be suitable for analysis by computational learning techniques. For that data should be represented as labeled vectors in a high dimensional space. This representation is usually constructed to preserve as much information as possible about features which are responsible for correct classification of samples. Type of labels depends on the particular considered tasks. For instance, if task is to correctly predict drug versus non drug molecules, labels might be chosen to be plus one for drugs and minus one for non-drugs. Features in this case might be various chemical descriptors that map chemical properties of molecules to real numbers.
The classifier is then a surface in this high dimensional space that separates it two parts for class and non-class. Contrary to many standard approaches SVM is not trying to construct classifying surface directly in the given target space. First sample points are projected to a significantly higher dimensional space, where separating surface can be found in a form of hyperplane. Corresponding surface in the original space is then presented as a result of SVM training.
The separating hyperplane is defined as
where x is a sample vector mapped to a high dimensional space and w and w0 are parameters of the hyperplane that SVM will estimate. Then the width of the margin can be expressed as a minimal τ for which for all vectors holds:
; here are +1 for class and -1 for non-class variable.
Without loss of generality we can apply a constraint to w. In this case, maximizing τ is equivalent to minimizing and SVM training is becoming the problem of finding the minimum of a function with the following constraints:
subject to constraints
This problem is solved by introducing Lagrange multipliers and minimizing the following function:
Here ai are Lagrange multipliers. Differentiating over w and wi and substituting we obtain:
subject to constraints ,
When perfect separation is not possible slack variables are introduced for sample vectors violating the edges of the margin, and the optimization problem can be reformulated:
minimize . (2)
subject to constraints .
Here are slack variables. C is a tradeoff between maximizing the width of the margin and minimizing slack variables. These variables are not equal to zero only for those vectors which are within the margin. Introducing Lagrange multipliers again we finally obtain:
subject to constraints ,
This is a quadratic programming (QP) problem and several efficient standard methods are known to solve it (Coleman and Li 1996). Due to the very high dimensionality of the QP problem, which typically arises during SVM training, an extension of the algorithm for solving QP is used in SVM applications (Joachims 1999).
A geometrical illustration of the meaning of slack variables and Lagrange multipliers is given in Figure 1. The main goal of any classifiyer is to construct a hyperplane that correctly predict class of a vector. It is logically to assume that vectors that are laying close to the border that separates two classes should play more important role in determining exact position of the plane that separates two classes. The main advantage of the SVM-classifier versus others is that only vectors that are laying sufficiently close to the border that separates two classes determine exact position of the classifying hyperplane. These vectors called support vectors. Their name assumes that only they “support” the constructed hyperplane. The exact distance to the separating hyperplane that determines which vectors will be considered to be “support” vectors, is a paramter to maximize during SVM training. The larger is this distance, the more distinct is the separation by the hyperplane between two classes.
All other vectors are non-support vectors, for them , they are classified correctly by the hyperplane and are located outside the separating margin. Slack variables and Lagrange multipliers for them are equal to zero, which is obviously reflects the fact that their positions do not influence location of the separarating hyperplane. Parameters of the hyperplane do not depend on them, and even if their positions are changed the separating hyperplane and margin will remain the same, provided that these points will stay outside the margin.
Support vectors can be tentatively divided into two groups: vectors laying on the border of separating hyperplane and vectors laying within the separating hyperplane. Origin of this tentative division is the existence of hard-margin (Equation 1) and soft-margin (Equation 2) SVM classifiers. In hard-margin SVM classifier no support vectors are allowed within the separating margin and SVM is trained to maximize this separating margin. In soft-margin SVM classifier, which is our main focus here, a trade off is introduced between having a large separating margin and minimum number of errors. Some of the vectors lying within the separating margin can be classified incorectly by hyperplane, which is a price for the tradeoff between having a large separating margin and minimum number of errors. This trade off is introduced by C parameter, which shoud be optimized to achieve maximum classification accuracy.
For all support vectors the absolute values of the slack variables are equal to the distances from these points to the edge of the separating margin: for support vectors lying on the border of separating hyperplane slack variables are equal to zero. For other support vectors these distances determine to which extent they violate the margin. They are defined in the units of half of the width of the separating margin. For correctly classified vectors within the separating margin, slack variable values are between zero and one. For misclassified vectors within the margin the values of the slack variables are between one and two. For other misclassified points they are greater than two.
For vectors lying on the edge of margin, Lagrange multipliers are between zero and C, slack variables for them are still equal to zero. For all support vectors, for which the values of slack variables are larger than zero, Lagrange multipliers equal to C.
It was shown that classification accuracy usually depends only weakly on the specific projection, provided that the target space is sufficiently high-dimensional (Cortes and Vapnik 1995). Sometimes it is not possible to find a separating hyperplane even in a very high dimensional space. In this case a tradeoff is introduced between the size of the separating margin and penalties for every vector which is within the margin (Cortes and Vapnik 1995).
Additionally, explicit mapping of the original data to a high-dimensional space is not required if calculation of the scalar product for each pair of vectors is feasible in the high-dimensional space. This scalar product can then be defined by introducing a kernel function . Every kernel function corresponds to a scalar product in a certain high-dimensional space if it obeys the two following conditions (Cristianini and Shawe-Taylor 2000):
1. K(x,x’) takes its maximum value when x’ = x.
2. |K(x,x’)| decreases with |x-x’|.
Here x and x’ are vectors in a low-dimensional space for which a kernel function is defined that corresponds to a scalar product in a high-dimensional space.
Various kernels may be applied (Burges 1998). For many applications a polynomial kernel functions has been shown to be sufficient, e.g. a fifth-order polynomial-based kernel:
This kernel corresponds to the decision function in a form of fifth order polynomial:
Here ai are Lagrange multipliers determined during training of SVM. The sum is only over support vectors xsv. Lagrange multipliers for all other points are equal to zero. Parameter b determines the shift of the hyperplane, which is also determined during SVM training.
The kernel function can often be simplified without loss of generality. For instance, for a polynomial kernel it is easy to see that simultaneous scaling of s, r and b parameters will not change the decision function. Then we can simplify the kernel by setting r equal to one:
In this case, only the kernel parameter s and error tradeoff C should be tuned. Parameter C is not explicitly present in this equation, it is used as a penalty for the classification error at the formulation of the QP problem for SVM. For appropriate tuning of parameters s and C, multiple cross-validation of training data can be applied, and values for s and C are chosen by maximizing accuracy. Basically, the following procedure is applied. The data set is divided into two parts training and test data subsets. The test subset is put aside and is used only for estimation of the performance of the trained classifier. As it was mentioned above typically two parameters must be set prior to starting of the training of SVM. Optimal values for these parameters are set based on standard procedure of k-fold cross validation. Training data is divided into k non-overlapping subsets. First parameters to be determined are set to certain reasonable values. Second, SVM is trained on the whole training data excluded k subset. Then performance of the obtained SVM classifier is estimated on the excluded k subset. This procedure is repeated for every subset and average performance of the SVM classifier is estimated. Free parameters are then tuned to achieve maximum average classification performance at cross validatzion. Optimization can be performed, e.g. by simple heuristics based on gradient descent methods (Bishop 1995). Resulted parameters are then used for the final training with the complete training data. It should be noted that test data was not used at any stage of the training procedure, it allows to avoid artifacts of overestimated accuracy of classification resulted from the use of information from the labeling of test data for training.
Bishop CM. 1995. Neural Networks for Pattern Recognition. Oxford, Oxford University Press.
Burges CJC. 1998. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2):121-167.
Coleman TF, and Li Y. 1996. A Reflective Newton Method for Minimizing a Quadratic Function Subject to Bounds on some of the Variables. SIAM Journal on Optimization 6(4):1040-1058.
Cortes C, and Vapnik V. 1995. Support-Vector Networks. Machine Learning 20(3):273-297.
Cristianini N, and Shawe-Taylor J. 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge, Cambridge University Press.
Joachims T. 1999. Making large-Scale SVM Learning Practical. In B. Schölkopf, C. Burges and A. Smola, eds. Advances in Kernel Methods - Support Vector Learning. Cambridge, MIT-Press. p 41-56.
Vapnik V. 1995. The
Nature of Statistical Lerning Theory. Berlin, Springer.
The Nature of Statistical Lerning Theory. Berlin, Springer.