CS 540: Machine Learning Lecture 9: Support Vector Machinesarnaud/cs540/lec9_CS540_handouts.pdf ·...

CS 540: Machine LearningLecture 9: Support Vector Machines

AD & KPM

February 2008

AD & KPM () February 2008 1 / 41

Linear classi�cation

Recall that the perceptron algorithm learns a decision boundary of theform y(x) = wTφ(x) + b. But which boundary should it pick?In a Bayesian approach, we computePr(y jx,D) =

RPr(y jx,w)p(wjD)dw .


Maximum margin: what?

The margin is the smallest distance between the decision boundaryand the closest of the data points.Intuitively it is a �robust� solution.Maximizing the margin leads to a particular choice of decisionboundary. The points with the smallest margin are called �supportvectors�.


Maximum margin: why?

The concept of max margin is usually justi�ed using Vapnik�sStatistical learning theory.

SLT establishes models/ algorithms for which one can bound thegeneralization error:

Pr(jtest error rate-train error ratej > ε) < δ

where the probabilities are over random train/test sets of a given size.This is usually interpreted as implying statements like

Pr(jtest error rate� 0.02j > ε j train error rate = 0.02) < δ

However, this is conceptually �awed (cf frequentist statistics): onecannot predict test error conditioned on training error for a speci�cdata set without making prior assumptions about the data generatingmechanism.


Maximum margin: how?

Recall that the perpendicular distance of a point x from a hyperplaneis jy (x)jjjwjj where y(x) = w

Tφ(x) + b.If we assume for now all data points are linearly separable, thentny(xn) > 0, so the margin of point n is (since jtn j = 1)

tn(wTφ(xn) + b)jjwjj


Maximizing the margin

We want to solve

argmaxw,b

�minn

�tn(wTφ(xn) + b)

jjwjj

��= argmax

w,b

�1jjwjj minn

htn(wTφ(xn) + b)

i�We can rescale w!κw, b ! κb without changing the distance ofany points to the boundary. Hence we can set tn(wTφ(xn) + b) = 1for the point which is closest to the boundary. Hence for all xn we willhave

tn(wTφ(xn) + b) � 1There will always be one point that achieves this bound (and at leasttwo when the margin is maximized). Hence we just need to maximize1/jjwjj or minimize jjwjj or minimize

12jjwjj2 s.t. tn(wTφ(xn) + b) � 1

This can be solved using quadratic programming; minimize aquadratic function subject to a set of linear inequalities.AD & KPM () February 2008 6 / 41

Dual problem

Let us introduce lagrange multipliers an � 0

L(w, b, a) =12jjwjj2 �

N

∑n=1

anntn(wTφ(xn) + b)� 1

oSet derivatives to 0 with respect to w and b and solving we get

w =N

∑n=1

antnφ(xn)

0 =N

∑n=1

antn

Substituting into L(w, b, a) we get the dual representation

L(a) = ∑nan �

12 ∑

n∑nanamtntmk(xn, xm)

where k(xn, xm) = φ(xn)Tφ(xm) (with an � 0, ∑Nn=1 antn = 0).


Sparse solution

The predicted label for a test point

y(x) = wTφ(xn) + b =N

∑n=1

antnk(x, xn) + b

The advantage of the dual is that we can solve it using kernelsinstead of features.

The disadvantage is we must solve a QP in N variables, which isO(N3). However, the �nal answer is sparse.

It can be shown (using the KKT conditions) that either tny(xn) = 1(the constraint is tight) or an = 0 (the constraint is loose).

Those xn for which an = 0 can be eliminated from the solution.


Solving for the o¤set

All support vectors must satisfy tny(xn) = 1. Hence

1 = tny(xn) = tn( ∑m2S

amtmk(xm , xn) + b)

We can solve this for b. In practice, it is numerically more stable tocompute the average (using the fact that t2n = 1):

b =1NS

∑n2S

tn � ∑

m2Samtmk(xm , xn)

!


Example with Gaussian kernel

Contours of constant y (x) for a Gaussian kernel (decision boundary,margin boundaries) and the support vectors (green)

AD & KPM () February 2008 10 / 41

Soft margin constraints

If the training points are not linearly separable in the feature spaceφ (x), we use a soft margin constraint

tny(xn) � 1� ξn

where the slack variables ξn � 0.ξn = 0 on the margin or correct; 0 < ξn < 1 are correct but insidethe margin; ξn > 1 are on the wrong side.

Illustration of the slack variables ξn � 0. Circled data points are supportvectors.

AD & KPM () February 2008 11 / 41

Soft margin constraints

We will try to minimize the number of misclassi�cations andmaximize the margin:

minwC

N

∑n=1

ξn +12jjwjj2

The parameter C is a regularization parameter that controls thetradeo¤ between minimizing training errors and model complexity.

KKT theory tells us the dual problem is

minaL(a) = ∑

nan �

12 ∑

n∑manamtntmk(xn, xm)

subject to box constraints 0 � an � C and ∑n antn = 0.

AD & KPM () February 2008 12 / 41

In this case, a subset of the data have also an = 0.

The remaining points constitute the support vectors; these havean > 0 and

tny (xn) = 1� ξn

Again we have

b =1NM

∑n2M

tn � ∑

m2Mamtmk(xm , xn)

!

whereM denotes the set of indices of data points having 0 < an < C

AD & KPM () February 2008 13 / 41

nu-SVMs

It is hard to set C by hand because it is not very intuitive. (In realityone always uses cross validation to pick C .)

A ν-SVM is a di¤erent but equivalent formulation.

Here ν is both an upper bound on the number of margin errors (i.e.points where ξn > 0 lying on the wrong side of the margin boundary),and a lower bound on the fraction of support vectors.

AD & KPM () February 2008 14 / 41

Example of nu-SVM with Gaussian kernel

2 0 2

2

0

2

Illustration of the ν�SVM applied to a nonseparable dataset in twodimension with a Gaussian kernel.

AD & KPM () February 2008 15 / 41

Computational issues when solving the QP

The naive solution takes O(N3) time.

Chunking exploits the fact that we can remove xn for which an = 0.The most popular algorithm is called sequential minimal optimization(SMO) which solves (in closed form) for just 2 lagrange multipliers ata time. Empricially this takes O(N) to O(N2) time.

There are a large number of software packages available for SVMs.See

http://www.cs.ubc.ca/~murphyk/Software/svm.htm

The most popular seem to be SVMlight and libSVM.

AD & KPM () February 2008 16 / 41

Surrogate loss function

It is instructive to reinterpret the error function minimized when weperform SVM

CN

∑n=1

ξn +12jjwjj2

For data points on the correct side of the margin, we have yntn � 1so ξn = 0 whereas for the remaining points ξn = 1� yntn so the errorfunction is of the form

N

∑n=1

ESVM (yntn) + λjjwjj2

whereESVM (yt) = [1� yt]+

This is known as the hinge error function and can be viewed as anapproximation of the misclassi�cation rate.

AD & KPM () February 2008 17 / 41

Relation to logistic regression

We havep ( tj y) = σ (y) .

For N training data, if we assume a L2 penalty then we have byminimizing the log-likelihood the following error function

N

∑n=1

ELR (yntn) + λjjwjj2

whereELR (yt) = log (1+ exp (�yt)) .

Both the logistic error and hinge loss are continous approximations tothe misclassi�cation error.

AD & KPM () February 2008 18 / 41

Loss Functions

Hinge loss, logistic loss and square loss (rescaled)

AD & KPM () February 2008 19 / 41

Probabilistic outputs

The distance of the point from the hyperplane, y(x), can beinterpreted as a measure of con�dence.

This can be mapped to a 0:1 scale using a sigmoid. We can train theparameters of the sigmoid using a validation set:

P(t = 1jx) � σ(ay(x) + b)

In practice these probabilities are not well calibrated, since the SVMdecision boundary was not trained to compute the log-odds.

AD & KPM () February 2008 20 / 41

Multiclass SVMs: One versus rest

There are many ad-hoc approaches to handling K > 2 classes.In the one-versus-the rest approach, we treat the K � 1 other classesas negative. However, this can lead to ambiguities. In addition, theclasses are imbalanced. In practice, people use

y(x) = maxkyk (x)

although the di¤erent yk may have incomparable scales.

AD & KPM () February 2008 21 / 41

Multiclass SVMs: All pairs

In the one-versus-one approach, one trains K (K � 1)/2 classi�ers todistinguish all pairs. Then one picks the class with the highest numberof votes. However, this can lead to ambiguities. Also, it is slow.

AD & KPM () February 2008 22 / 41

Multiclass SVMs: ECOC

Error Correcting Output Codes is a way of converting any binaryclassi�er into a K -ary classi�er.

Instead of trying to distinguish class i from class j , for all pairs, onetries to distinguish classes Si from Sj .

These subsets are chosen so that the resulting codewords are far apart(robust to errors).

Designing such codebooks is hard, but greedy approaches exist.

AD & KPM () February 2008 23 / 41

SVM regression

SVM regression minimizes the regularized robust loss function

C ∑njy(xn)� tn jε +

12jjwjj2

where jy(xn)� tn jε is the ε-insensitive loss.

AD & KPM () February 2008 24 / 41

SVM regression

To minimize this, we introduce 2 slack variables per data point,ξn � 0 and ξn � 0 where ξn > 0 means tn is above the tube andξn > 0 means tn is below the tube. Points in the tube haveξn = ξn = 0.The �nal answer is again sparse by KKT (but all the points on theboundaries and outside the tube are active):

y(x) =N

∑n=1(an � an)k(x, xn) + b

Regression function with ε-insensitive �tube�.Points above the tube have ξ > 0 and ξ = 0. Points below the tube have

ξ > 0 and ξ = 0.

AD & KPM () February 2008 25 / 41

SVM regression example

0 1

1

0

1

Regression function with ε-insensitive �tube�. The predicted regressionfunction is in red and the ε-insensitive �tube� is also plotted. Blue circled

points are the support vectors.

AD & KPM () February 2008 26 / 41

RVM regression

The relevance vector machine is a Bayesian model ensuring sparsity.

Consider linear regression where the basis functions are kernels (whichdo not have to be positive de�nite):

y(x) = wTφ(x) = ∑nwnk(x, xn) + b

p(tjx,w, β) = N (tjy(x), β�1)

We do not require here k(x, x0) to be p.s.d.Ridge regression uses a spherical Gaussian prior N (wi ; 0, α�1), andchooses α by cross validation or maximization of the marginallikelihood.

Ridge regression does not provide sparse solution.

AD & KPM () February 2008 27 / 41

RVM regression

RVM uses a diagonal Gaussian prior on w

p(wjα) =N

∏i=1N (wi ; 0, α�1i )

Conditional upon α, y(x) is a Gaussian process with

E [y(x)j α] = 0,

cov�y(x), y(x0)

�� α�= φ(x0)Tdiag

�α�1

�φ(x)

For αi = α, we have the standard ridge regression.

AD & KPM () February 2008 28 / 41

Given fxn, tngNn=1, we have

p (wj t,X, α,β) = N (w;m, ° )

where

m = βΣΦTt,

Σ =�diag (α) + βΦTΦ

��1.

The ridge estimate is not sparse.

In the RVM, we maximize α = (α1, ..., αN ) and β by maximizing themarginal log-likelihood

(α�, β�) = argmaxα,β

log p (tjX, α,β)

AD & KPM () February 2008 29 / 41

We have the marginal likelihood is

p (tjX, α,β) =Zp (tjX,w, α,β) p (wj α) dw

= �12

nN log (2π) + log jC j+ tTC�1t

owhere

C = β�1I +Φdiag�α�1

�Φ

The marginal log-likelihood which is non-convex.

There are several iterative techniques to optimize p (tjX, α,β).

AD & KPM () February 2008 30 / 41

ARD in RVM

If we perform type-II ML estimation of α and β, we �nd manyαi ! ∞.Hence p (wi j t, α�,β�) is concentrated near 0, so the correspondingbasis φi can be removed.

This called Automatic Relevancy Determination.

AD & KPM () February 2008 31 / 41

RVM posterior predictive distribution

p(tjx,X, t, α�, β�) =Zp(tjx,w, β�)p (wjX, t, α�, β�) dw

= N (t;mTφ(x), σ2(x))

wherem=β�C , σ2(x) = (β�)�1 + φ(x)T° φ(x)

0 1

1

0

1

RVM regression. The mean of the regression is plotted in red line. Therelevance vectors are in blue.

AD & KPM () February 2008 32 / 41

RVMs for classi�cation

For binary classi�cation, we can use y(x) = σ(wTφ(x)).Since the Gaussian ARD prior is no longer conjugate, we cannotintegrate out w exactly. However, we can make a Laplaceapproximation. We use IRLS to �nd wMAP ; the Hessian is thecovariance.

This can easily be extended to the multiclass case using softmaxfunctions.

The �relevance vectors� are not on the decision boundary, but are inlocations that give the best overall prediction of the (conditional)density.

If there are D basis functions and N training points and C classes,complexity scales as O(D3C 3) per iteration (not sure about this!). Ifφi (x) = k(x, xi ) then D = N. Sparse approximations can reduce thisto O(D2C 2).

AD & KPM () February 2008 33 / 41

RVM classi�cation example

2 0 2

2

0

2

Example of the RVM for a binary classi�cation problem

AD & KPM () February 2008 34 / 41

RVMs compared to SVMs

-RVM training is slower than SVM training.

-RVM cost function is not convex.

+RVM models are sparser than SVMs models, and hence faster attest time.

+RVM models give probabilistic outputs.

+RVM can use any basis functions.

+RVM kernels do not need to satisy Mercer�s theorem.

AD & KPM () February 2008 35 / 41

Limitations of the RVM

Remember that conditional on α, we have

E [y(x)j α] = 0,

cov�y(x), y(x0)

�� α�= φ(x0)Tdiag

�α�1

�φ(x)

so in particular

V [y(x)j α] =N

∑i=1

φ2i (x)αi

.

Hence wherever φ(x) � 0 then V [y(x)j α] = 0. In other words,where we do not have data we have V [y(x)j α] � 0 for kernels of theform φi (x) = K (xi , x) say K (x, x

0) = exp�� kx�x0k2

σ2

�.

We do not have this problem if we select for y(x) a GP of mean

function 0 and kernel K (x, x0) = exp�� kx�x0k2

σ2

�.

AD & KPM () February 2008 36 / 41

RVM vs GP

0

0.5

1

0 20 40 60

100

0

100

0 20 40 60

100

0

100

RVM (left) vs GP (right)

AD & KPM () February 2008 37 / 41

Multi-class (polychotomous) logistic regression

We can use a softmax function

p(t = c jx,w) = ewTc φ(x)

∑c 0 ewTc 0φ(x)

If φi (x) = k(x, xi ), this is kernelized logistic regression; in this case,D = N and we must take extra care to prevent over�tting.

The log-likelihood is

l(w) = ∑nlog p(tn jxn,w) = ∑

n

"∑cwTc φ(xn)� log∑

c 0exp(wTc 0φ(xn))

#

AD & KPM () February 2008 38 / 41

SMLR/ BMR

Let us use a sparsity-promoting L1 / Laplace/ double exponentialprior p(wcd jλcd ) = λcd

2 exp(�λcd jwcd j) A simple special case ifλcd = λ, so p(wjλ) ∝ exp(�λjjwjj1).There are several methods to �nd the globally optimal MAP estimateof this convex function wMAP = argmaxw l(w) + p(w).SMLR (Sparse Multinomial Logistic Regression) by Krishnapuram,Carin, Figueirdo and Hartemink uses a bound-optimization (EM-like)method. This takes O(D3C 3) time per iteration (batch mode) orO(NDC ) time per iteration (cyclic mode).

BMR (Bayesian Multinomial Regression) by Madigan, Genkin, Lewisand Franklin use cyclic update with a tighter (but more expensive)bound. The complexity is O(NDC ) per iteration (?).

Both groups choose λ by cross validation.

AD & KPM () February 2008 39 / 41

SMLR vs RVM vs SVM

SMLR RVM SVMConvex 1 0 1Prob output 1 1 0Multiclass 1 1 0Fast to train 1 0 1Fast to test 1 1 1Arbitrary φ(x) 1 1 0Arbitrary k(x, x0) 1 1 0Sparsity Med High MedAccuracy High Med High

AD & KPM () February 2008 40 / 41

Other approaches to computation with L1 priors

LARS (least angle regression sequential) is a sequential algorithm forL2 regression with an L1 prior (this is an e¢ cient alternative to QP).

LARS can be extended to the GLM case (logistic likelihood).

If the likelihood is L1 regression and the prior is L1, the problem canbe solved using linear programming.

AD & KPM () February 2008 41 / 41

CS 540: Machine Learning Lecture 9: Support Vector Machinesarnaud/cs540/lec9_CS540_handouts.pdf ·...

Documents

Transcript of CS 540: Machine Learning Lecture 9: Support Vector Machinesarnaud/cs540/lec9_CS540_handouts.pdf ·...