Sciences des Données - Séparateurs linéaires et Machines à ...

42
Sciences des Donn´ ees - S´ eparateurs lin´ eaires et Machines ` a Vecteurs de Support - Hachem Kadri Aix-Marseille University, CNRS Laboratoire d’Informatique et des Syst` emes, LIS QARMA team https://qarma.lis-lab.fr/

Transcript of Sciences des Données - Séparateurs linéaires et Machines à ...

Page 1: Sciences des Données - Séparateurs linéaires et Machines à ...

Sciences des Donnees- Separateurs lineaires et Machines a Vecteurs de

Support -

Hachem Kadri

Aix-Marseille University, CNRSLaboratoire d’Informatique et des Systemes, LIS

QARMA teamhttps://qarma.lis-lab.fr/

Page 2: Sciences des Données - Séparateurs linéaires et Machines à ...

Supervised Learning

Generalize from training to testing

Page 3: Sciences des Données - Séparateurs linéaires et Machines à ...

Supervised Learning – Generalization

From a training set consisting of randomly sampled pairs of (input,target), learn a function or a predictor which predicts well thetarget of a new data.

Supervised learning / Generalization

−→ Given l training examples (x1, y1), . . . , (xn, yn) ∈ (X × Y) andu test data xn+1, . . . , xn+u ∈ X

−→ Learn f : X → Y to generalize from training to testing

Page 4: Sciences des Données - Séparateurs linéaires et Machines à ...

Decision Trees reminder

Should we play tennis or not?

Play Wind Humidity Outlook

No Low High Sunny

No High Normal Rain

Yes Low High Overcast

Yes Weak Normal Rain

Yes Low Normal Sunny

Yes Low Normal Overcast

Yes High Normal Sunny

Page 5: Sciences des Données - Séparateurs linéaires et Machines à ...

Decision Trees reminder

I breaking down our data by making decision based on asking aseries of questions

Page 6: Sciences des Données - Séparateurs linéaires et Machines à ...

K-nearest Neighbors reminder

I Choose the number of k and a distance metric

I Find the k-nearest neighbors of the sample to be classified

I Assign the class label by majority vote

Page 7: Sciences des Données - Séparateurs linéaires et Machines à ...

Linear Regression reminder

I model the relation between features (explanatory variable x)and a continuous valued response (target variable y)

I Linear model:

y = w0 + w1x (1) + . . .+ wdx (d) =d∑

i=1wix (i) = 〈w , x〉

Page 8: Sciences des Données - Séparateurs linéaires et Machines à ...

Focus: linear classification

Important notions in learning to classifyI a number of training data (pictures, emails, etc.)I learning algorithm (how to build the classifier?)I generalization: the classifier should correctly classify test data

Quick formalizationI X (e.g. Rd , d > 0) is the

space of data, called inputspace

I Y (e.g. toxic/not toxic, or{−1,+1}) is the targetspace

I f : X → Y is the classifier

?class +1class -1

f(x) = 0X (= R 2)

Page 9: Sciences des Données - Séparateurs linéaires et Machines à ...

Perceptron (Rosenblatt, 1958)

Inspiration: biological neural networkMotivations:

I Learning systemcomposed by associatingsimple processing units

I Efficiency, scalability, andadaptability

Perceptron: a linear classifier, X = Rd , Y = {−1, +1}biais : activation = 1

σ(∑d

i=1wixi + w0)x1

x2x =

σ

w0

w1

w2

Page 10: Sciences des Données - Séparateurs linéaires et Machines à ...

Perceptron (Rosenblatt, 1958)

A linear classifier, X = Rd , Y = {−1, +1}

I Classifier weights: w ∈ Rd

I Classifier prediction: f (x) = sign〈w, x〉I Question: how to learn w from training data

Algorithm: Perceptron

Input: S = {(xi , yi )}ni=1w← 0while it exists (xi , yi ): yi〈w, xi〉 ≤ 0 do

w← w + yixiend while

Page 11: Sciences des Données - Séparateurs linéaires et Machines à ...

Perceptron in action

Page 12: Sciences des Données - Séparateurs linéaires et Machines à ...

Perceptron in action

Page 13: Sciences des Données - Séparateurs linéaires et Machines à ...

Perceptron in action

Page 14: Sciences des Données - Séparateurs linéaires et Machines à ...

Perceptron in action

Page 15: Sciences des Données - Séparateurs linéaires et Machines à ...

Perceptron

1. Initialize the weights w to 0 or small random numbers

2. For each training sample xi :

a. Compute the output value yi , sign(w>xi )

b. Update the weights if yi 6= yi ,

Page 16: Sciences des Données - Séparateurs linéaires et Machines à ...

Perceptron: limitations

Theorem (XOR, Minsky, Papert, 1969)The perceptron algorithm cannot solve the XOR problem

Page 17: Sciences des Données - Séparateurs linéaires et Machines à ...

Perceptron: dual-form

I Solution is written as a linear combination of training data :

w =n∑

j=1αiyjxj

Algorithm: Dual-Form Perceptron

Input: S = {(xi , yi )}ni=1w← 0while it exists (xi , yi ): yi

(∑nj=1 αjyj〈xj , xi〉

)≤ 0 do

αi ← αi + 1end while

Page 18: Sciences des Données - Séparateurs linéaires et Machines à ...

The kernel trick

Nonlinearly separable dataset S = {(x1, y1), . . . , (xn, yn)}Idea to learn a nonlinear classifier

I choose a (nonlinear) mapping φ

φ : X → Hx 7→ φ(x)

where H is an inner product space (inner product 〈·, ·〉H),called feature space

I find a linear classifier (i.e. a separating hyperplane) in H toclassify

{(φ(x1), y1), . . . , (φ(xn), yn)}I to classify test point x, consider φ(x)

Page 19: Sciences des Données - Séparateurs linéaires et Machines à ...

The kernel trick

Linearly classifying in feature space feature spa e Hinput spa e X �Taking the previous linear algorithm and implementing it in H:

h(x) =∑

i=1,...,nαi〈φ(xi ), φ(x)〉H + b

Page 20: Sciences des Données - Séparateurs linéaires et Machines à ...

The kernel trick

Mercer KernelsThe kernel trick can be applied if there is a functionk : X × X → R such that: k(u, v) = 〈φ(u), φ(v)〉HIf so, all occurrences of 〈φ(xi ), φ(x)〉H are syntactically replaced byk(xi , x)

I Keypoint: emphasis is sometimes (often) more on k than on φ

I Kernels must verify Mercer’s property to be valid kernelsI ensures that there exist a space H and a mapping φ : X → H

such that k(u, v) = 〈φ(u), φ(v)〉HI however non valid kernels have been used with successI and, research is in progress on using non semi-definite kernels

I k might be viewed as a similarity measure

Page 21: Sciences des Données - Séparateurs linéaires et Machines à ...

The kernel trick

separating surfaceone possible separating hyperplaneno separating hyperplane

input spa e X input spa e Xfeature spa e H�

Kernel trick recipeI choose a linear classification algorithm (expr. in terms 〈·, ·〉)I replace all occurrences of 〈·, ·〉 by a kernel k(·, ·)

Obtained classifier:

f (x) = sign

∑i=1,...,n

αik(xi , x) + b

Page 22: Sciences des Données - Séparateurs linéaires et Machines à ...

Kernel Perceptron

I Replace 〈xj , xi〉 by k(xj , xi )

Algorithm: Dual-Form Perceptron

Input: S = {(xi , yi )}ni=1w← 0while it exists (xi , yi ): yi

(∑nj=1 αjyjk(xj , xi )

)≤ 0 do

αi ← αi + 1end while

Page 23: Sciences des Données - Séparateurs linéaires et Machines à ...

Common kernels

Gaussian/RBF kernel

I k(u, v) = exp(−‖u−v‖2

2σ2

), σ2 > 0

I the corresponding H is of infinite dimension

Polynomial kernelI k(u, v) = (〈u, v〉+ c)d , c ∈ R, d ∈ NI a corresponding analytic φ may be constructed (see below)

Tangent kernel (it is not a Mercer kernel)

I k(u, v) = tanh(a〈u, v〉+ c), a, c ∈ R

Page 24: Sciences des Données - Séparateurs linéaires et Machines à ...

Common kernels

Polynomial kernel k = 〈u, v〉2R2

Polynomial kernel with c = 0 and d = 2, defined on R2 × R2

I Consider the mapping:

φ : R2 → R3

x = [x1, x2]> 7→ φ(x) =[x2

1 ,√

2x1x2, x22

]>I We have, for u, v ∈ R2:

〈φ(u), φ(v)〉R3 = 〈[u21 ,√

2u1u2, u22]>, [v2

1 ,√

2v1v2, v22 ]>〉

= (u1v1 + u2v2)2

= 〈u, v〉2R2

= k(u, v)

Page 25: Sciences des Données - Séparateurs linéaires et Machines à ...

(Kernel) Gram matrices

Gram matrixLet k : X × X → R be a kernel. For a set of patternsS = {x1, . . . , xn}

KS =

k(x1, x1) k(x1, x2) · · · k(x1, xn)k(x1, x2) k(x2, x2) · · · k(x2, xn)

· · · · · ·k(x1, xn) k(x2, xn) · · · k(xn, xn)

is the Gram matrix of k with respect to S

Mercer’s propertyLet k : X × X → R be a symmetric function.k is a Mercer kernel ⇔

∀S = {x1, . . . , xn}, xi ∈ X , vKSv ≥ 0,∀v ∈ Rn

(and, therefore there exists φ such that k(u, v) = 〈φ(u), φ(v)〉)

Page 26: Sciences des Données - Séparateurs linéaires et Machines à ...

(Kernel) Gram matrices

Any Mercer kernel k and any set of patterns S, the Gram matrixKS has only nonnegative eigenvalues

k1 and k2 being Mercer kernels, we haveI kp

1 , p ∈ N is a Mercer kernelI λk1 + γk2, λ, γ > 0 is a Mercer kernelI k1k2 is a Mercer kernel

Page 27: Sciences des Données - Séparateurs linéaires et Machines à ...

Support Vector Machines (SVM)

I Maximize the margin which is equal to 2‖w‖

I under the constraint that the samples are classified correctly:

a. w>Xn ≥ 1 if Yn = 1

b. w>Xn ≤ −1 if Yn = −1

∀n = 1, . . . ,N

Page 28: Sciences des Données - Séparateurs linéaires et Machines à ...

Support Vector Classification

A breakthrough in machine learningI Positive definite kernelsI Large margin classificationI Convex optimization (quadratic programming)I Statistical learning theory

support vectors

margin = 2 / ||w||

w.x + b = 0

Page 29: Sciences des Données - Séparateurs linéaires et Machines à ...

Hard margin linear SVM

SettingI Input space X with dot product ·I Target space Y = {−1,+1}I S = {(xi , yi )}ni=1 linearly separable training set

Optimal hyperplaneFind the separating hyperplane with maximum margin, i.e. the onewith maximal distance with the closest data

margin

w.x + b = 0

Page 30: Sciences des Données - Séparateurs linéaires et Machines à ...

Hard margin linear SVM and convex optimization

support vectors

margin = 2 / ||w||

w.x + b = 0

Definition (Canonical hyperplane wrt S)A hyperplane w · x + b = 0 is canonical wrt S ifminxi∈S |w · xi + b| = 1.(High school class:) the margin for a canonical separatinghyperplane is equal to 2/‖w‖

Page 31: Sciences des Données - Séparateurs linéaires et Machines à ...

Hard margin linear SVM and convex optimization

support vectors

margin = 2 / ||w||

w.x + b = 0

Primal problem (recall that margin=2/‖w‖)

minw,b

12‖w‖

2

s.t.{

w · xi + b ≥ +1 if yi = 1w · xi + b ≤ −1 otherwise

Page 32: Sciences des Données - Séparateurs linéaires et Machines à ...

Hard margin linear SVM and convex optimization

support vectors

margin = 2 / ||w||

w.x + b = 0

Primal problem (recall that margin=2/‖w‖)

minw,b

12‖w‖

2

s.t.{

w · xi + b ≥ +1 if yi = 1w · xi + b ≤ −1 otherwise

minw,b

12‖w‖

2

s.t. yi [w · xi + b] ≥ +1

Page 33: Sciences des Données - Séparateurs linéaires et Machines à ...

Hard margin linear SVM and convex optimizationIntroducing Lagrange multipliersThe solution w, b can be found be solving the following problem

minw,b

maxα≥0

L(w, b,α)with

L(w, b,α) := 12‖w‖

2 −n∑

i=1αi [yi (w · xi + b)− 1]

The αi ’s (≥ 0) are Lagrange mulipliers, one per constraint

Another formulation of the constrained optimization problemI if a constraint is violated, i.e., yi (w · xi + b)− 1 < 0 for some

i , then the value of the objective function maxα≥0 L(w, b,α)is +∞ (for αi → +∞)optimal w and b necessarily verify yi (w · xi + b)− 1 ≥ 0

I also, if yi (w · xi + b)− 1 > 0 for some i , then αi = 0: again,just look at the function maxα≥0 L(w, b,α) → at the solutionαi [yi (w · xi + b)− 1] = 0 (KKT conditions)

Page 34: Sciences des Données - Séparateurs linéaires et Machines à ...

Hard margin linear SVM and convex optimizationSwitching the min and the maxA theorem of convex optimization (see, e.g. [?]) exploiting the factthat L is convex wrt w and b and concave wrt α gives

minw,b

maxα≥0

L(w, b,α) = maxα≥0

minw,b

L(w, b,α)

with the same optimal points

Making the gradient be 0

minw,b

L(w, b,α) = minw,b

12‖w‖

2 −n∑

i=1αi [yi (w · xi + b)− 1]

is an unconstrained strictly convex (and coercive) optimizationproblem. It suffices to have the gradient of the functional to be 0to get the solution.

I ∇wL(w, b,α) = w−∑

i αiyixi = 0 ⇒ w =∑

i αiyixiI ∇bL(w, b,α) =

∑i yiαi = 0 ⇒

∑i yiαi = 0

Page 35: Sciences des Données - Séparateurs linéaires et Machines à ...

Hard margin linear SVM and convex optimizationSwitching the min and the maxA theorem of convex optimization (see, e.g. [?]) exploiting the factthat L is convex wrt w and b and concave wrt α gives

minw,b

maxα≥0

L(w, b,α) = maxα≥0

minw,b

L(w, b,α)

with the same optimal points

A dual quadratic programPlugging in the value of w and the constraint provides thefollowing problem

maxα

n∑i=1

αi −12

n∑i ,j=1

yiyjαiαjxi · xj

s.t.n∑

i=1αiyi = 0 and α ≥ 0

Page 36: Sciences des Données - Séparateurs linéaires et Machines à ...

On the dual formulation and support vectorsThe QP

maxα

n∑i=1

αi −12

n∑i ,j=1

yiyjαiαjxi · xj

s.t.n∑

i=1αiyi = 0 and α ≥ 0

I As many optimization variables as the number of training dataI Convex quadratic programI Only dot products appear (the kernel trick will strike soon)I b∗ is found through the KKT conditionsI w∗ =

∑ni=1 α

∗i yixi ,

f (x) =n∑

i=1α∗i yixi · x + b∗

Page 37: Sciences des Données - Séparateurs linéaires et Machines à ...

On the dual formulation and support vectorsThe QP

maxα

n∑i=1

αi −12

n∑i ,j=1

yiyjαiαjxi · xj

s.t.n∑

i=1αiyi = 0 and α ≥ 0

Support vectors

support vectors

margin = 2 / ||w||

w.x + b = 0

The support vectors are those points for which α∗i > 0, they“support” the margin.

Page 38: Sciences des Données - Séparateurs linéaires et Machines à ...

Soft margin linear SVMPresence of outliers

support vectors

margin = 2 / ||w||

w.x + b = 0

outliers

Slack variables - 1–norm

minw,b,ξ≥0

12‖w‖

2 + Cn∑

i=1ξi

s.t. yi [w · xi + b] ≥ 1− ξi

with C > 0

maxα

n∑i=1

αi −12

n∑i ,j=1

yiyjαiαjxi · xj

s.t.n∑

i=1αiyi = 0 and 0 ≤ α ≤ C

(using the same machinery asbefore)

Page 39: Sciences des Données - Séparateurs linéaires et Machines à ...

Soft margin linear SVMPresence of outliers

support vectors

margin = 2 / ||w||

w.x + b = 0

outliers

Slack variables - 2–norm

minw,b,ξ

12‖w‖

2 + Cn∑

i=1ξ2

i

s.t. yi [w · xi + b] ≥ 1− ξi

with C > 0

maxα

n∑i=1

αi −12

n∑i ,j=1

yiyjαiαj

(xi · xj + δij

C

)

s.t.n∑

i=1αiyi = 0 and α ≥ 0

(using the same machinery asbefore)

Page 40: Sciences des Données - Séparateurs linéaires et Machines à ...

Soft margin linear SVMPresence of outliers

support vectors

margin = 2 / ||w||

w.x + b = 0

outliers

Unconstrained 1-norm primal formminw,b

12‖w‖

2 + Cn∑

i=1|1− yi (w · xi + b)|+ , where |θ|+ = max(θ, 0)

convex, non-differentiable

Unconstrained 2-norm primal formminw,b

12‖w‖

2 + Cn∑

i=1|1− yi (w · xi + b)|2+

convex, differentiable

Page 41: Sciences des Données - Séparateurs linéaires et Machines à ...

Kernel SVM for linearly inseparable data

I Project the data in a high-dimensional feature space using akernel function

I Apply a linear classifier in the feature space

Page 42: Sciences des Données - Séparateurs linéaires et Machines à ...

Nonlinear SVM: tricking SVMs with kernels

1-norm and 2-norm soft-margin SVM

maxα

∑ni=1 αi− 1

2∑n

i,j=1 yi yjαiαj k(xi ,xj )

s.t.∑n

i=1 αi yi =0 and 0≤α≤C

maxα

∑ni=1 αi− 1

2∑n

i,j=1 yi yjαiαj

(k(xi ,xj )+

δijC

)s.t.∑n

i=1 αi yi =0 and α≥0

Classifier output f (x) =∑n

i=1 α∗i yik(xi , x) + b∗

I The sizes of the problems scale with the number of dataI Efficient methods to solve these problems (that exploit the

sparsity of the solution)I k and C are two hyperparameters that need be chosen

adequately