Introduction to the theory of the Support Vector Machine
The support vector machine (SVM) approach for solving
classification tasks was introduced by Vapnik (Vapnik
1995). It has been successfully applied in various areas of research. Currently
we
witness
its growing popularity in the bio Several standard learning techniques are routinely used in bioinformatics.
For simplicity we start with a binary classification. But before any of these
methods can be applied to the biological data, it should be preprocessed to be
suitable for analysis by computational learning techniques. For that data should
be represented as labeled vectors in a high dimensional space. This
representation is usually constructed to preserve as much information as
possible about features which are responsible for correct classification of
samples. Type of labels depends on the particular considered tasks. For
instance, if task is to correctly predict drug versus non drug molecules,
labels might be chosen to be plus one for drugs and minus one for non-drugs.
Features in this case might be various chemical descriptors that map chemical
properties of molecules to real numbers. The classifier is then a surface in this high dimensional
space that separates it two parts for class and non-class. Contrary to many
standard approaches SVM is not trying to construct classifying surface directly
in the given target space. First sample points are projected to a significantly
higher dimensional space, where separating surface can be found in a form of hyperplane. Corresponding surface in the original space is
then presented as a result of SVM training.
The separating hyperplane is
defined as
where x is a sample vector mapped to a high dimensional space and w and w0
are parameters of the hyperplane that SVM will
estimate. Then the width of the margin can be expressed as a minimal τ for
which for all vectors holds:
Without loss of generality we can apply a constraint minimize subject to constraints This problem is solved by introducing Lagrange multipliers
and minimizing the following function:
Here ai are Lagrange multipliers. Differentiating over w and wi
and substituting we obtain:
subject to constraints When perfect separation is not possible slack variables are
introduced for sample vectors violating the edges of the margin, and the
optimization problem can be reformulated: minimize subject to constraints Here
subject to constraints This is a quadratic programming (QP) problem and several
efficient standard methods are known to solve it (Coleman and Li 1996). Due to the very high
dimensionality of the QP problem, which typically arises during SVM training, an
extension of the algorithm for solving QP is used in SVM applications (Joachims
1999). A geometrical illustration of the meaning of slack variables
and Lagrange multipliers is given in Figure 1. The main goal of any classifiyer is to construct a hyperplane
that correctly predict class of a vector. It is logically to assume that
vectors that are laying close to the border that
separates two classes should play more important role in determining exact position
of the plane that separates two classes. The main advantage of the
SVM-classifier versus others is that only vectors that are laying
sufficiently close to the border that separates two classes determine exact
position of the classifying hyperplane. These vectors
called support vectors. Their name assumes that only they “support” the
constructed hyperplane. The exact distance to the
separating hyperplane that determines which vectors
will be considered to be “support” vectors, is a paramter
to maximize during SVM training. The larger is this distance, the more distinct
is the separation by the hyperplane between two
classes. All other vectors are non-support vectors, for them Support vectors can be tentatively divided into two groups:
vectors laying on the border of separating hyperplane and vectors laying within the separating hyperplane. Origin of this tentative division is the
existence of hard-margin (Equation 1) and soft-margin (Equation 2) SVM
classifiers. In hard-margin SVM classifier no support vectors are allowed
within the separating margin and SVM is trained to maximize this separating
margin. In soft-margin SVM classifier, which is our main focus here, a trade
off is introduced between having a large separating margin and minimum number
of errors. Some of the vectors lying within the separating margin can be
classified incorectly by hyperplane,
which is a price for the tradeoff between having a large separating margin and
minimum number of errors. This trade off is introduced by C parameter, which shoud be optimized to
achieve maximum classification accuracy. For all support vectors the absolute values of the slack
variables are equal to the distances from these points to the edge of the separating
margin: for support vectors lying on the border of separating hyperplane slack variables are equal to zero. For other
support vectors these distances determine to which extent they violate the
margin. They are defined in the units of half of the width of the separating
margin. For correctly classified vectors within the separating margin, slack
variable values are between zero and one. For misclassified vectors within the
margin the values of the slack variables are between one and two. For other
misclassified points they are greater than two. For vectors lying on the edge of margin, Lagrange multipliers
are between zero and C, slack
variables for them are still equal to zero. For all support vectors, for which the
values of slack variables are larger than zero, Lagrange multipliers equal to C. It was shown that classification accuracy usually depends
only weakly on the specific projection, provided that the target space is sufficiently
high-dimensional (Cortes and Vapnik 1995). Sometimes it is not possible to find a separating
hyperplane even in a very high dimensional space. In
this case a tradeoff is introduced between the size of the separating margin
and penalties for every vector which is within the margin (Cortes and Vapnik 1995). Additionally, explicit mapping of the original data to a
high-dimensional space is not required if calculation of the scalar product for
each pair of vectors is feasible in the high-dimensional space. This scalar
product can then be defined by introducing a kernel function 1. K(x,x’) takes its maximum value when x’ = x. 2. |K(x,x’)| decreases with |x-x’|. Here x and x’ are vectors in a low-dimensional
space for which a kernel function is defined that corresponds to a scalar
product in a high-dimensional space. Various kernels may be applied (Burges
1998).
For many applications a polynomial kernel functions has been shown to be
sufficient, e.g. a fifth-order polynomial-based kernel:
This kernel corresponds to the decision function in a form of
fifth order polynomial:
Here ai are Lagrange multipliers determined during training of SVM.
The sum is only over support vectors xsv. Lagrange multipliers for all other points
are equal to zero. Parameter b determines
the shift of the hyperplane, which is also determined
during SVM training. The kernel function can often be simplified without loss of
generality. For instance, for a polynomial kernel it is easy to see that
simultaneous scaling of s, r and b parameters will not change the
decision function. Then we can simplify the kernel by setting r equal to one:
In this case, only the kernel parameter s and error tradeoff C should
be tuned. Parameter C is not explicitly present in this equation,
it is used as a penalty for the classification error at the formulation of the QP
problem for SVM. For appropriate tuning of parameters s and C, multiple cross-validation of training
data can be applied, and values for s and
C are chosen by maximizing accuracy. Basically,
the following procedure is applied. The data set is divided into two parts
training and test data subsets. The test subset is put aside and is used only
for estimation of the performance of the trained classifier. As it was
mentioned above typically two parameters must be set prior to starting of the
training of SVM. Optimal values for these parameters are set based on standard
procedure of k-fold cross validation. Training data is divided into k non-overlapping
subsets. First parameters to be determined are set to certain reasonable
values. Second, SVM is trained on the whole training data excluded k subset. Then performance of the
obtained SVM classifier is estimated on the excluded k subset. This procedure is repeated for every subset and average
performance of the SVM classifier is estimated. Free parameters are then tuned
to achieve maximum average classification performance at cross validatzion. Optimization can be performed, e.g. by simple
heuristics based on gradient descent methods (Bishop 1995). Resulted parameters are then
used for the final training with the complete training data. It should be noted
that test data was not used at any stage of the training procedure, it allows
to avoid artifacts of overestimated accuracy of classification resulted from
the use of information from the labeling of test data for training. References
Bishop CM. 1995. Neural Networks for Pattern Recognition. Oxford, Oxford University Press.
Burges CJC. 1998. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2(2):121-167.
Coleman TF, and Li Y. 1996. A Reflective Newton Method for Minimizing a Quadratic Function Subject to Bounds on some of the Variables. SIAM Journal on Optimization 6(4):1040-1058.
Cortes C, and Vapnik V. 1995. Support-Vector Networks. Machine Learning 20(3):273-297.
Cristianini N, and Shawe-Taylor J. 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge, Cambridge University Press.
Joachims T. 1999. Making large-Scale SVM Learning Practical. In B. Schölkopf, C. Burges and A. Smola, eds. Advances in Kernel Methods - Support Vector Learning. Cambridge, MIT-Press. p 41-56.
Vapnik V. 1995. |