Sciences des Données - Séparateurs linéaires et Machines à ...
Transcript of Sciences des Données - Séparateurs linéaires et Machines à ...
Sciences des Donnees- Separateurs lineaires et Machines a Vecteurs de
Support -
Hachem Kadri
Aix-Marseille University, CNRSLaboratoire d’Informatique et des Systemes, LIS
QARMA teamhttps://qarma.lis-lab.fr/
Supervised Learning
Generalize from training to testing
Supervised Learning – Generalization
From a training set consisting of randomly sampled pairs of (input,target), learn a function or a predictor which predicts well thetarget of a new data.
Supervised learning / Generalization
−→ Given l training examples (x1, y1), . . . , (xn, yn) ∈ (X × Y) andu test data xn+1, . . . , xn+u ∈ X
−→ Learn f : X → Y to generalize from training to testing
Decision Trees reminder
Should we play tennis or not?
Play Wind Humidity Outlook
No Low High Sunny
No High Normal Rain
Yes Low High Overcast
Yes Weak Normal Rain
Yes Low Normal Sunny
Yes Low Normal Overcast
Yes High Normal Sunny
Decision Trees reminder
I breaking down our data by making decision based on asking aseries of questions
K-nearest Neighbors reminder
I Choose the number of k and a distance metric
I Find the k-nearest neighbors of the sample to be classified
I Assign the class label by majority vote
Linear Regression reminder
I model the relation between features (explanatory variable x)and a continuous valued response (target variable y)
I Linear model:
y = w0 + w1x (1) + . . .+ wdx (d) =d∑
i=1wix (i) = 〈w , x〉
Focus: linear classification
Important notions in learning to classifyI a number of training data (pictures, emails, etc.)I learning algorithm (how to build the classifier?)I generalization: the classifier should correctly classify test data
Quick formalizationI X (e.g. Rd , d > 0) is the
space of data, called inputspace
I Y (e.g. toxic/not toxic, or{−1,+1}) is the targetspace
I f : X → Y is the classifier
?class +1class -1
f(x) = 0X (= R 2)
Perceptron (Rosenblatt, 1958)
Inspiration: biological neural networkMotivations:
I Learning systemcomposed by associatingsimple processing units
I Efficiency, scalability, andadaptability
Perceptron: a linear classifier, X = Rd , Y = {−1, +1}biais : activation = 1
σ(∑d
i=1wixi + w0)x1
x2x =
σ
w0
w1
w2
Perceptron (Rosenblatt, 1958)
A linear classifier, X = Rd , Y = {−1, +1}
I Classifier weights: w ∈ Rd
I Classifier prediction: f (x) = sign〈w, x〉I Question: how to learn w from training data
Algorithm: Perceptron
Input: S = {(xi , yi )}ni=1w← 0while it exists (xi , yi ): yi〈w, xi〉 ≤ 0 do
w← w + yixiend while
Perceptron in action
Perceptron in action
Perceptron in action
Perceptron in action
Perceptron
1. Initialize the weights w to 0 or small random numbers
2. For each training sample xi :
a. Compute the output value yi , sign(w>xi )
b. Update the weights if yi 6= yi ,
Perceptron: limitations
Theorem (XOR, Minsky, Papert, 1969)The perceptron algorithm cannot solve the XOR problem
Perceptron: dual-form
I Solution is written as a linear combination of training data :
w =n∑
j=1αiyjxj
Algorithm: Dual-Form Perceptron
Input: S = {(xi , yi )}ni=1w← 0while it exists (xi , yi ): yi
(∑nj=1 αjyj〈xj , xi〉
)≤ 0 do
αi ← αi + 1end while
The kernel trick
Nonlinearly separable dataset S = {(x1, y1), . . . , (xn, yn)}Idea to learn a nonlinear classifier
I choose a (nonlinear) mapping φ
φ : X → Hx 7→ φ(x)
where H is an inner product space (inner product 〈·, ·〉H),called feature space
I find a linear classifier (i.e. a separating hyperplane) in H toclassify
{(φ(x1), y1), . . . , (φ(xn), yn)}I to classify test point x, consider φ(x)
The kernel trick
Linearly classifying in feature space feature spa e Hinput spa e X �Taking the previous linear algorithm and implementing it in H:
h(x) =∑
i=1,...,nαi〈φ(xi ), φ(x)〉H + b
The kernel trick
Mercer KernelsThe kernel trick can be applied if there is a functionk : X × X → R such that: k(u, v) = 〈φ(u), φ(v)〉HIf so, all occurrences of 〈φ(xi ), φ(x)〉H are syntactically replaced byk(xi , x)
I Keypoint: emphasis is sometimes (often) more on k than on φ
I Kernels must verify Mercer’s property to be valid kernelsI ensures that there exist a space H and a mapping φ : X → H
such that k(u, v) = 〈φ(u), φ(v)〉HI however non valid kernels have been used with successI and, research is in progress on using non semi-definite kernels
I k might be viewed as a similarity measure
The kernel trick
separating surfaceone possible separating hyperplaneno separating hyperplane
input spa e X input spa e Xfeature spa e H�
Kernel trick recipeI choose a linear classification algorithm (expr. in terms 〈·, ·〉)I replace all occurrences of 〈·, ·〉 by a kernel k(·, ·)
Obtained classifier:
f (x) = sign
∑i=1,...,n
αik(xi , x) + b
Kernel Perceptron
I Replace 〈xj , xi〉 by k(xj , xi )
Algorithm: Dual-Form Perceptron
Input: S = {(xi , yi )}ni=1w← 0while it exists (xi , yi ): yi
(∑nj=1 αjyjk(xj , xi )
)≤ 0 do
αi ← αi + 1end while
Common kernels
Gaussian/RBF kernel
I k(u, v) = exp(−‖u−v‖2
2σ2
), σ2 > 0
I the corresponding H is of infinite dimension
Polynomial kernelI k(u, v) = (〈u, v〉+ c)d , c ∈ R, d ∈ NI a corresponding analytic φ may be constructed (see below)
Tangent kernel (it is not a Mercer kernel)
I k(u, v) = tanh(a〈u, v〉+ c), a, c ∈ R
Common kernels
Polynomial kernel k = 〈u, v〉2R2
Polynomial kernel with c = 0 and d = 2, defined on R2 × R2
I Consider the mapping:
φ : R2 → R3
x = [x1, x2]> 7→ φ(x) =[x2
1 ,√
2x1x2, x22
]>I We have, for u, v ∈ R2:
〈φ(u), φ(v)〉R3 = 〈[u21 ,√
2u1u2, u22]>, [v2
1 ,√
2v1v2, v22 ]>〉
= (u1v1 + u2v2)2
= 〈u, v〉2R2
= k(u, v)
(Kernel) Gram matrices
Gram matrixLet k : X × X → R be a kernel. For a set of patternsS = {x1, . . . , xn}
KS =
k(x1, x1) k(x1, x2) · · · k(x1, xn)k(x1, x2) k(x2, x2) · · · k(x2, xn)
· · · · · ·k(x1, xn) k(x2, xn) · · · k(xn, xn)
is the Gram matrix of k with respect to S
Mercer’s propertyLet k : X × X → R be a symmetric function.k is a Mercer kernel ⇔
∀S = {x1, . . . , xn}, xi ∈ X , vKSv ≥ 0,∀v ∈ Rn
(and, therefore there exists φ such that k(u, v) = 〈φ(u), φ(v)〉)
(Kernel) Gram matrices
Any Mercer kernel k and any set of patterns S, the Gram matrixKS has only nonnegative eigenvalues
k1 and k2 being Mercer kernels, we haveI kp
1 , p ∈ N is a Mercer kernelI λk1 + γk2, λ, γ > 0 is a Mercer kernelI k1k2 is a Mercer kernel
Support Vector Machines (SVM)
I Maximize the margin which is equal to 2‖w‖
I under the constraint that the samples are classified correctly:
a. w>Xn ≥ 1 if Yn = 1
b. w>Xn ≤ −1 if Yn = −1
∀n = 1, . . . ,N
Support Vector Classification
A breakthrough in machine learningI Positive definite kernelsI Large margin classificationI Convex optimization (quadratic programming)I Statistical learning theory
support vectors
margin = 2 / ||w||
w.x + b = 0
Hard margin linear SVM
SettingI Input space X with dot product ·I Target space Y = {−1,+1}I S = {(xi , yi )}ni=1 linearly separable training set
Optimal hyperplaneFind the separating hyperplane with maximum margin, i.e. the onewith maximal distance with the closest data
margin
w.x + b = 0
Hard margin linear SVM and convex optimization
support vectors
margin = 2 / ||w||
w.x + b = 0
Definition (Canonical hyperplane wrt S)A hyperplane w · x + b = 0 is canonical wrt S ifminxi∈S |w · xi + b| = 1.(High school class:) the margin for a canonical separatinghyperplane is equal to 2/‖w‖
Hard margin linear SVM and convex optimization
support vectors
margin = 2 / ||w||
w.x + b = 0
Primal problem (recall that margin=2/‖w‖)
minw,b
12‖w‖
2
s.t.{
w · xi + b ≥ +1 if yi = 1w · xi + b ≤ −1 otherwise
Hard margin linear SVM and convex optimization
support vectors
margin = 2 / ||w||
w.x + b = 0
Primal problem (recall that margin=2/‖w‖)
minw,b
12‖w‖
2
s.t.{
w · xi + b ≥ +1 if yi = 1w · xi + b ≤ −1 otherwise
minw,b
12‖w‖
2
s.t. yi [w · xi + b] ≥ +1
Hard margin linear SVM and convex optimizationIntroducing Lagrange multipliersThe solution w, b can be found be solving the following problem
minw,b
maxα≥0
L(w, b,α)with
L(w, b,α) := 12‖w‖
2 −n∑
i=1αi [yi (w · xi + b)− 1]
The αi ’s (≥ 0) are Lagrange mulipliers, one per constraint
Another formulation of the constrained optimization problemI if a constraint is violated, i.e., yi (w · xi + b)− 1 < 0 for some
i , then the value of the objective function maxα≥0 L(w, b,α)is +∞ (for αi → +∞)optimal w and b necessarily verify yi (w · xi + b)− 1 ≥ 0
I also, if yi (w · xi + b)− 1 > 0 for some i , then αi = 0: again,just look at the function maxα≥0 L(w, b,α) → at the solutionαi [yi (w · xi + b)− 1] = 0 (KKT conditions)
Hard margin linear SVM and convex optimizationSwitching the min and the maxA theorem of convex optimization (see, e.g. [?]) exploiting the factthat L is convex wrt w and b and concave wrt α gives
minw,b
maxα≥0
L(w, b,α) = maxα≥0
minw,b
L(w, b,α)
with the same optimal points
Making the gradient be 0
minw,b
L(w, b,α) = minw,b
12‖w‖
2 −n∑
i=1αi [yi (w · xi + b)− 1]
is an unconstrained strictly convex (and coercive) optimizationproblem. It suffices to have the gradient of the functional to be 0to get the solution.
I ∇wL(w, b,α) = w−∑
i αiyixi = 0 ⇒ w =∑
i αiyixiI ∇bL(w, b,α) =
∑i yiαi = 0 ⇒
∑i yiαi = 0
Hard margin linear SVM and convex optimizationSwitching the min and the maxA theorem of convex optimization (see, e.g. [?]) exploiting the factthat L is convex wrt w and b and concave wrt α gives
minw,b
maxα≥0
L(w, b,α) = maxα≥0
minw,b
L(w, b,α)
with the same optimal points
A dual quadratic programPlugging in the value of w and the constraint provides thefollowing problem
maxα
n∑i=1
αi −12
n∑i ,j=1
yiyjαiαjxi · xj
s.t.n∑
i=1αiyi = 0 and α ≥ 0
On the dual formulation and support vectorsThe QP
maxα
n∑i=1
αi −12
n∑i ,j=1
yiyjαiαjxi · xj
s.t.n∑
i=1αiyi = 0 and α ≥ 0
I As many optimization variables as the number of training dataI Convex quadratic programI Only dot products appear (the kernel trick will strike soon)I b∗ is found through the KKT conditionsI w∗ =
∑ni=1 α
∗i yixi ,
f (x) =n∑
i=1α∗i yixi · x + b∗
On the dual formulation and support vectorsThe QP
maxα
n∑i=1
αi −12
n∑i ,j=1
yiyjαiαjxi · xj
s.t.n∑
i=1αiyi = 0 and α ≥ 0
Support vectors
support vectors
margin = 2 / ||w||
w.x + b = 0
The support vectors are those points for which α∗i > 0, they“support” the margin.
Soft margin linear SVMPresence of outliers
support vectors
margin = 2 / ||w||
w.x + b = 0
outliers
Slack variables - 1–norm
minw,b,ξ≥0
12‖w‖
2 + Cn∑
i=1ξi
s.t. yi [w · xi + b] ≥ 1− ξi
with C > 0
maxα
n∑i=1
αi −12
n∑i ,j=1
yiyjαiαjxi · xj
s.t.n∑
i=1αiyi = 0 and 0 ≤ α ≤ C
(using the same machinery asbefore)
Soft margin linear SVMPresence of outliers
support vectors
margin = 2 / ||w||
w.x + b = 0
outliers
Slack variables - 2–norm
minw,b,ξ
12‖w‖
2 + Cn∑
i=1ξ2
i
s.t. yi [w · xi + b] ≥ 1− ξi
with C > 0
maxα
n∑i=1
αi −12
n∑i ,j=1
yiyjαiαj
(xi · xj + δij
C
)
s.t.n∑
i=1αiyi = 0 and α ≥ 0
(using the same machinery asbefore)
Soft margin linear SVMPresence of outliers
support vectors
margin = 2 / ||w||
w.x + b = 0
outliers
Unconstrained 1-norm primal formminw,b
12‖w‖
2 + Cn∑
i=1|1− yi (w · xi + b)|+ , where |θ|+ = max(θ, 0)
convex, non-differentiable
Unconstrained 2-norm primal formminw,b
12‖w‖
2 + Cn∑
i=1|1− yi (w · xi + b)|2+
convex, differentiable
Kernel SVM for linearly inseparable data
I Project the data in a high-dimensional feature space using akernel function
I Apply a linear classifier in the feature space
Nonlinear SVM: tricking SVMs with kernels
1-norm and 2-norm soft-margin SVM
maxα
∑ni=1 αi− 1
2∑n
i,j=1 yi yjαiαj k(xi ,xj )
s.t.∑n
i=1 αi yi =0 and 0≤α≤C
maxα
∑ni=1 αi− 1
2∑n
i,j=1 yi yjαiαj
(k(xi ,xj )+
δijC
)s.t.∑n
i=1 αi yi =0 and α≥0
Classifier output f (x) =∑n
i=1 α∗i yik(xi , x) + b∗
I The sizes of the problems scale with the number of dataI Efficient methods to solve these problems (that exploit the
sparsity of the solution)I k and C are two hyperparameters that need be chosen
adequately