Sciences des Données - Séparateurs linéaires et Machines à ...

Sciences des Donnees- Separateurs lineaires et Machines a Vecteurs de

Support -

Hachem Kadri

Aix-Marseille University, CNRSLaboratoire d’Informatique et des Systemes, LIS

QARMA teamhttps://qarma.lis-lab.fr/

https://qarma.lis-lab.fr/

Supervised Learning

Generalize from training to testing

Supervised Learning – Generalization

From a training set consisting of randomly sampled pairs of (input,target), learn a function or a predictor which predicts well thetarget of a new data.

Supervised learning / Generalization

−→ Given l training examples (x1, y1), . . . , (xn, yn) ∈ (X × Y) andu test data xn+1, . . . , xn+u ∈ X

−→ Learn f : X → Y to generalize from training to testing

Decision Trees reminder

Should we play tennis or not?

Play Wind Humidity Outlook

No Low High Sunny

No High Normal Rain

Yes Low High Overcast

Yes Weak Normal Rain

Yes Low Normal Sunny

Yes Low Normal Overcast

Yes High Normal Sunny

Decision Trees reminder

I breaking down our data by making decision based on asking aseries of questions

K-nearest Neighbors reminder

I Choose the number of k and a distance metric

I Find the k-nearest neighbors of the sample to be classified

I Assign the class label by majority vote

Linear Regression reminder

I model the relation between features (explanatory variable x)and a continuous valued response (target variable y)

I Linear model:

y = w0 + w1x (1) + . . .+ wdx (d) =d∑

i=1wix (i) = 〈w , x〉

Focus: linear classification

Important notions in learning to classifyI a number of training data (pictures, emails, etc.)I learning algorithm (how to build the classifier?)I generalization: the classifier should correctly classify test data

Quick formalizationI X (e.g. Rd , d > 0) is the

space of data, called inputspace

I Y (e.g. toxic/not toxic, or{−1,+1}) is the targetspace

I f : X → Y is the classifier

?class +1class -1

f(x) = 0X (= R 2)

Perceptron (Rosenblatt, 1958)

Inspiration: biological neural networkMotivations:

I Learning systemcomposed by associatingsimple processing units

I Efficiency, scalability, andadaptability

Perceptron: a linear classifier, X = Rd , Y = {−1, +1}biais : activation = 1

σ(∑d

i=1wixi + w0)x1

x2x =

σ

w0

w1

w2

Perceptron (Rosenblatt, 1958)

A linear classifier, X = Rd , Y = {−1, +1}

I Classifier weights: w ∈ Rd

I Classifier prediction: f (x) = sign〈w, x〉I Question: how to learn w from training data

Algorithm: Perceptron

Input: S = {(xi , yi )}ni=1w← 0while it exists (xi , yi ): yi〈w, xi〉 ≤ 0 do

w← w + yixiend while

Perceptron in action

Perceptron

1. Initialize the weights w to 0 or small random numbers

2. For each training sample xi :

a. Compute the output value yi , sign(w>xi )

b. Update the weights if yi 6= yi ,

Perceptron: limitations

Theorem (XOR, Minsky, Papert, 1969)The perceptron algorithm cannot solve the XOR problem

Perceptron: dual-form

I Solution is written as a linear combination of training data :

w =n∑

j=1αiyjxj

Algorithm: Dual-Form Perceptron

Input: S = {(xi , yi )}ni=1w← 0while it exists (xi , yi ): yi

(∑nj=1 αjyj〈xj , xi〉

)≤ 0 do

αi ← αi + 1end while

The kernel trick

Nonlinearly separable dataset S = {(x1, y1), . . . , (xn, yn)}Idea to learn a nonlinear classifier

I choose a (nonlinear) mapping φ

φ : X → Hx 7→ φ(x)

where H is an inner product space (inner product 〈·, ·〉H),called feature space

I find a linear classifier (i.e. a separating hyperplane) in H toclassify

{(φ(x1), y1), . . . , (φ(xn), yn)}I to classify test point x, consider φ(x)

The kernel trick

Linearly classifying in feature space feature spa e Hinput spa e X �Taking the previous linear algorithm and implementing it in H:

h(x) =∑

i=1,...,nαi〈φ(xi ), φ(x)〉H + b

The kernel trick

Mercer KernelsThe kernel trick can be applied if there is a functionk : X × X → R such that: k(u, v) = 〈φ(u), φ(v)〉HIf so, all occurrences of 〈φ(xi ), φ(x)〉H are syntactically replaced byk(xi , x)

I Keypoint: emphasis is sometimes (often) more on k than on φ

I Kernels must verify Mercer’s property to be valid kernelsI ensures that there exist a space H and a mapping φ : X → H

such that k(u, v) = 〈φ(u), φ(v)〉HI however non valid kernels have been used with successI and, research is in progress on using non semi-definite kernels

I k might be viewed as a similarity measure

The kernel trick

separating surfaceone possible separating hyperplaneno separating hyperplane

input spa e X input spa e Xfeature spa e H�

Kernel trick recipeI choose a linear classification algorithm (expr. in terms 〈·, ·〉)I replace all occurrences of 〈·, ·〉 by a kernel k(·, ·)

Obtained classifier:

f (x) = sign

∑i=1,...,n

αik(xi , x) + b

Kernel Perceptron

I Replace 〈xj , xi〉 by k(xj , xi )

Algorithm: Dual-Form Perceptron

Input: S = {(xi , yi )}ni=1w← 0while it exists (xi , yi ): yi

(∑nj=1 αjyjk(xj , xi )

)≤ 0 do

αi ← αi + 1end while

Common kernels

Gaussian/RBF kernel

I k(u, v) = exp(−‖u−v‖2

2σ2

), σ2 > 0

I the corresponding H is of infinite dimension

Polynomial kernelI k(u, v) = (〈u, v〉+ c)d , c ∈ R, d ∈ NI a corresponding analytic φ may be constructed (see below)

Tangent kernel (it is not a Mercer kernel)

I k(u, v) = tanh(a〈u, v〉+ c), a, c ∈ R

Common kernels

Polynomial kernel k = 〈u, v〉2R2

Polynomial kernel with c = 0 and d = 2, defined on R2 × R2

I Consider the mapping:

φ : R2 → R3

x = [x1, x2]> 7→ φ(x) =[x2

1 ,√

2x1x2, x22

]>I We have, for u, v ∈ R2:

〈φ(u), φ(v)〉R3 = 〈[u21 ,√

2u1u2, u22]>, [v2

1 ,√

2v1v2, v22 ]>〉

= (u1v1 + u2v2)2

= 〈u, v〉2R2

= k(u, v)

(Kernel) Gram matrices

Gram matrixLet k : X × X → R be a kernel. For a set of patternsS = {x1, . . . , xn}

KS =

k(x1, x1) k(x1, x2) · · · k(x1, xn)k(x1, x2) k(x2, x2) · · · k(x2, xn)

· · · · · ·k(x1, xn) k(x2, xn) · · · k(xn, xn)

is the Gram matrix of k with respect to S

Mercer’s propertyLet k : X × X → R be a symmetric function.k is a Mercer kernel ⇔

∀S = {x1, . . . , xn}, xi ∈ X , vKSv ≥ 0,∀v ∈ Rn

(and, therefore there exists φ such that k(u, v) = 〈φ(u), φ(v)〉)

(Kernel) Gram matrices

Any Mercer kernel k and any set of patterns S, the Gram matrixKS has only nonnegative eigenvalues

k1 and k2 being Mercer kernels, we haveI kp

1 , p ∈ N is a Mercer kernelI λk1 + γk2, λ, γ > 0 is a Mercer kernelI k1k2 is a Mercer kernel

Support Vector Machines (SVM)

I Maximize the margin which is equal to 2‖w‖

I under the constraint that the samples are classified correctly:

a. w>Xn ≥ 1 if Yn = 1

b. w>Xn ≤ −1 if Yn = −1

∀n = 1, . . . ,N

Support Vector Classification

A breakthrough in machine learningI Positive definite kernelsI Large margin classificationI Convex optimization (quadratic programming)I Statistical learning theory

support vectors

margin = 2 / ||w||

w.x + b = 0

Hard margin linear SVM

SettingI Input space X with dot product ·I Target space Y = {−1,+1}I S = {(xi , yi )}ni=1 linearly separable training set

Optimal hyperplaneFind the separating hyperplane with maximum margin, i.e. the onewith maximal distance with the closest data

margin

w.x + b = 0

Hard margin linear SVM and convex optimization

support vectors

margin = 2 / ||w||

w.x + b = 0

Definition (Canonical hyperplane wrt S)A hyperplane w · x + b = 0 is canonical wrt S ifminxi∈S |w · xi + b| = 1.(High school class:) the margin for a canonical separatinghyperplane is equal to 2/‖w‖


support vectors

margin = 2 / ||w||

w.x + b = 0

Primal problem (recall that margin=2/‖w‖)

minw,b

12‖w‖

2

s.t.{

w · xi + b ≥ +1 if yi = 1w · xi + b ≤ −1 otherwise


support vectors

margin = 2 / ||w||

w.x + b = 0

Primal problem (recall that margin=2/‖w‖)

minw,b

12‖w‖

2

s.t.{

w · xi + b ≥ +1 if yi = 1w · xi + b ≤ −1 otherwise

minw,b

12‖w‖

2

s.t. yi [w · xi + b] ≥ +1

Hard margin linear SVM and convex optimizationIntroducing Lagrange multipliersThe solution w, b can be found be solving the following problem

minw,b

maxα≥0

L(w, b,α)with

L(w, b,α) := 12‖w‖

2 −n∑

i=1αi [yi (w · xi + b)− 1]

The αi ’s (≥ 0) are Lagrange mulipliers, one per constraint

Another formulation of the constrained optimization problemI if a constraint is violated, i.e., yi (w · xi + b)− 1 < 0 for some

i , then the value of the objective function maxα≥0 L(w, b,α)is +∞ (for αi → +∞)optimal w and b necessarily verify yi (w · xi + b)− 1 ≥ 0

I also, if yi (w · xi + b)− 1 > 0 for some i , then αi = 0: again,just look at the function maxα≥0 L(w, b,α) → at the solutionαi [yi (w · xi + b)− 1] = 0 (KKT conditions)

Hard margin linear SVM and convex optimizationSwitching the min and the maxA theorem of convex optimization (see, e.g. [?]) exploiting the factthat L is convex wrt w and b and concave wrt α gives

minw,b

maxα≥0

L(w, b,α) = maxα≥0

minw,b

L(w, b,α)

with the same optimal points

Making the gradient be 0

minw,b

L(w, b,α) = minw,b

12‖w‖

2 −n∑

i=1αi [yi (w · xi + b)− 1]

is an unconstrained strictly convex (and coercive) optimizationproblem. It suffices to have the gradient of the functional to be 0to get the solution.

I ∇wL(w, b,α) = w−∑

i αiyixi = 0 ⇒ w =∑

i αiyixiI ∇bL(w, b,α) =

∑i yiαi = 0 ⇒

∑i yiαi = 0

Hard margin linear SVM and convex optimizationSwitching the min and the maxA theorem of convex optimization (see, e.g. [?]) exploiting the factthat L is convex wrt w and b and concave wrt α gives

minw,b

maxα≥0

L(w, b,α) = maxα≥0

minw,b

L(w, b,α)

with the same optimal points

A dual quadratic programPlugging in the value of w and the constraint provides thefollowing problem

maxα

n∑i=1

αi −12

n∑i ,j=1

yiyjαiαjxi · xj

s.t.n∑

i=1αiyi = 0 and α ≥ 0

On the dual formulation and support vectorsThe QP

maxα

n∑i=1

αi −12

n∑i ,j=1

yiyjαiαjxi · xj

s.t.n∑


I As many optimization variables as the number of training dataI Convex quadratic programI Only dot products appear (the kernel trick will strike soon)I b∗ is found through the KKT conditionsI w∗ =

∑ni=1 α

∗i yixi ,

f (x) =n∑

i=1α∗i yixi · x + b∗

On the dual formulation and support vectorsThe QP

maxα

n∑i=1

αi −12

n∑i ,j=1

yiyjαiαjxi · xj

s.t.n∑


Support vectors

support vectors

margin = 2 / ||w||

w.x + b = 0

The support vectors are those points for which α∗i > 0, they“support” the margin.

Soft margin linear SVMPresence of outliers

support vectors

margin = 2 / ||w||

w.x + b = 0

outliers

Slack variables - 1–norm

minw,b,ξ≥0

12‖w‖

2 + Cn∑

i=1ξi

s.t. yi [w · xi + b] ≥ 1− ξi

with C > 0

maxα

n∑i=1

αi −12

n∑i ,j=1

yiyjαiαjxi · xj

s.t.n∑

i=1αiyi = 0 and 0 ≤ α ≤ C

(using the same machinery asbefore)


support vectors

margin = 2 / ||w||

w.x + b = 0

outliers

Slack variables - 2–norm

minw,b,ξ

12‖w‖

2 + Cn∑

i=1ξ2

i

s.t. yi [w · xi + b] ≥ 1− ξi

with C > 0

maxα

n∑i=1

αi −12

n∑i ,j=1

yiyjαiαj

(xi · xj + δij

C

)

s.t.n∑


(using the same machinery asbefore)


support vectors

margin = 2 / ||w||

w.x + b = 0

outliers

Unconstrained 1-norm primal formminw,b

12‖w‖

2 + Cn∑

i=1|1− yi (w · xi + b)|+ , where |θ|+ = max(θ, 0)

convex, non-differentiable

Unconstrained 2-norm primal formminw,b

12‖w‖

2 + Cn∑

i=1|1− yi (w · xi + b)|2+

convex, differentiable

Kernel SVM for linearly inseparable data

I Project the data in a high-dimensional feature space using akernel function

I Apply a linear classifier in the feature space

Nonlinear SVM: tricking SVMs with kernels

1-norm and 2-norm soft-margin SVM

maxα

∑ni=1 αi− 1

2∑n

i,j=1 yi yjαiαj k(xi ,xj )

s.t.∑n

i=1 αi yi =0 and 0≤α≤C

maxα

∑ni=1 αi− 1

2∑n

i,j=1 yi yjαiαj

(k(xi ,xj )+

δijC

)s.t.∑n

i=1 αi yi =0 and α≥0

Classifier output f (x) =∑n

i=1 α∗i yik(xi , x) + b∗

I The sizes of the problems scale with the number of dataI Efficient methods to solve these problems (that exploit the

sparsity of the solution)I k and C are two hyperparameters that need be chosen

adequately

Sciences des Données - Séparateurs linéaires et Machines à ...

Documents

Transcript of Sciences des Données - Séparateurs linéaires et Machines à ...