Download - Les outils de modélisation des Big Data

Les outils de modélisation des Big Data

SEPIA3 dec 13

Pr Michel BéraChaire de Modélisation statistique du Risque

CNAM/SITI/IMATH

1

• Plan de l’exposé– L’inégalité de Vapnik et les fondements d’une nouvelle

théorie de la robustesse (1971 et 1995)

– Éclairages sur les méthodes classiques (NN, Decision Trees, analyse factorielle)

– La notion de géométrie des données et d’espace étendu –le Kernel Trick – quali et quanti : un combat dépassé

– Big Data et monde vapnikien, utopies et réalités – notions de complexité informatique

– Modélisation moderne : un enchaînement d’approches, du Machine Learning aveugle aux finesses de l’Evidence basedPolicy

2

Theoretical Statistics« Data are as they are »

Applied Statistics« modeling data then testing »

Theory of ill-posed problems

Empirical Methodsof conjuration (PCA,NN,Bayes)

1974 VC Dimension

2001: Start of the internet era,Millions of records& thousands of variables

1980 SRM (Vapnik)

1995 Support Vector Machines (Vapnik)

1960: Mainframe. Huge Datasets start appearing.

1930Kolmogorov-SmirnovFisher

1950Cramer

High dimensionnal problems malediction

STOP !

Watch out !

GO !

Statistical history

3

1. Le monde de Vapnik- Conférence aux Bell Labs (New Jersey) de 1995

4

Consistency : definition1) A learning process (model) is said to be consistent if

model error, measured on new data sampled from the same underlying probability laws of our original sample, converges, when original sample size increases, towards model error, measured on original sample.

2) A model that is consistent is also said to generalize well, or to be robust

5

%error

number of training examples

Test error

Training error

Consistent training?

%error

number of training examples

Test error

Training error

6

Generalization: definition• Generalization capacity for a model describes how

(ex: error function) a model will perform on data that he has never seen before (in his training set)

• Good generalization for a model means that model errors on new unknown data will be of the same « size » than known error on his training set. The model is also called « robust ».

7

Overfitting

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5

Example: Polynomial regressionTarget: a 10th degree polynomial + noise

Learning machine: y=w0+w1x + w2x2 …+ w10x10

8

Overfitting Avoidance

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r= 0.01

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r= 0.1

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r= 1

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r= 10

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+002

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+003

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+004

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+005

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+006

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+007

-10 -8 -6 -4 -2 0 2 4 6 8 10

-0.5

0

0.5

1

1.5d=10, r=1e+008

Example: Polynomial regressionTarget: a 10th degree polynomial + noise

Learning machine: y=w0+w1x + w2x2 …+ w10x10

9

Vapnik approach to modeling (1)• Vapnik approach is based on the family of functions S =

{f(X,w), w ε W}, in which a model is chosen as a specific function, described by a specific w

• For Vapnik, the model function must answer properly for a given row X the question described by target Y, ie predict Y, the quality of the answer being measured by a cost function Q

• Different families of functions may provide the same « quality » of answer

10

Vapnik approach to modeling (2)

• All the trick is then to find a good family of functions S, that not only answers in a « good way » the question described by target Y, but that can also be easy to understand, ie also provide a good description, allowing to explain easily what is underlying the data behaviour of the problem question

• VC dimension will be a key to understand and control model robustness

11

VC dimension - definition (1)

• Let us consider a sample (x1, .. , xL) from Rn

• There are 2L different ways to separate the sample in two sub-samples

• A set S of functions f(X,w) shatters the sample if all 2L

separations can be defined by different f(X,w) from family S

12

VC dimension - definition (2)

A function family S has VC dimension h (h is an integer) if:

1) Every sample of h vectors from Rn can be shattered by a function from S

2) There is at least one sample of h+1 vectors that cannot be shattered by any function from S

13

Example: VC dimension

VC dimension:- Measures the complexity of a solution (function). - Is not directly related to the number of variables

VC dimension:- Measures the complexity of a solution (function). - Is not directly related to the number of variables

14

Other examples• VC dimension for hyperplanes of Rn is n+1

• VC dimension of set of functions:

f(x,w) = sign (sin (w.x) ),c <= x <= 1, c>0,

where w is a free parameter, is infinite.

– Conclusion : VC dimension is not always equal to the number n of parameters (X1,..,Xn) of a given family S of functions from Rn to {-1,+1}.

15

Key Example: linear models -> y = <w|x> + b

• VC dimension of family S of linear models:

with:

depends on C and can take any value between 0 and n.This is the basis for Machine Learning approaches such as SVM

(Support Vector Machines) or Ridge Regression.

16

VC dimension : interpretation

• VC dimension of S: an integer, that measures the shattering (or separating) power (“complexity”) of function family S:

• We shall now show that VC dimension (a major theorem from Vapnik) gives a powerful indication for model consistency, hence “robustness”.

17

What is a Risk Functional?

• A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task.

Parameter space (w)

R[f(x,w)]

w* 18

Examples of Risk Functionals

• Classification:

– Error rate

– AUC

• Regression:

– Mean square error

19

Lift Curve

OMKI =

O M

Fraction of customers selected

Fra

ctio

n of

goo

d cu

stom

ers

sele

cted

Ideal Lift

100%

100%Customers ordered according to f(x); selection of the top ranking customers.

Gini index 0 ≤≤≤≤ KI ≤≤≤≤ 1

20

Statistical Learning Theoretical Foundations

• Structural Risk Minimization

• Regularization

• Weight decay

• Feature selection

• Data compression

21

Learning Theory Problem (1)• A model computes a function:

• Problem : minimize in w Risk Expectation

– w : a parameter that specifies the chosen model

– z = (X, y) are possible values for attributes (variables)

– Q measures (quantifies) model error cost

– P(z) is the underlying probability law (unknown) for data z

22

• We get L data from learning sample (z1, .. , zL), and we suppose them iid sampled from law P(z).

• To minimize R(w), we start by minimizing Empirical Risk over this sample :

• Example of classical cost functions :

– classification (eg. Q can be a cost function based on cost for misclassified points)

– regression (eg. Q can be a cost function of least squares type)

Learning Theory Problem (2)

23

Learning Theory Problem (3)

• Central problem for Statistical Learning Theory:

What is the relation between Risk Expectation R(W)

and Empirical Risk E(W)?

• How to define and measure a generalization capacity (“robustness”) for a model ?

24

Four Pillars for SLT (1 and 2)

• Consistency (guarantees generalization)

– Under what conditions will a model be consistent ?

• Model convergence speed (a measure for generalization capacity)

– How does generalization capacity improve when sample size L grows?

25

Four Pillars for SLT (3 and 4)

• Generalization capacity control

– How to control in an efficient way model generalization starting with the only given information we have: our sample data?

• A strategy for good learning algorithms

– Is there a strategy that guarantees, measures and controls our learning model generalization capacity ?

26

Vapnik main theorem

• Q : Under which conditions will a learning process (model) be consistent?

• R : A model will be consistent if and only if the function f that defines the model comes from a family of functions S with finite VC dimension h

• A finite VC dimension h not only guarantees a generalization capacity (consistency), but to pick f in a family S with finite VC dimension h is the only way to build a model that generalizes.

27

Model convergence speed (generalization

capacity)• Q : What is the nature of model risk difference between

learning data (sample: empirical risk) and test data (expected risk), for a sample of finite size L?

• R : This difference is no greater than a limit that only depends on the ratio between VC dimension h of model functions family S, and sample size L, ie h/L

This statement is a new theorem that belongs to Kolmogorov-Smirnov way for results, ie theorems that do not depend on data’s underlying probability law.

28

Empirical risk minimization in LS case

• With probability 1-q, the following inequality is true:

where w0 is the parameter w value that minimizes Empirical Risk:

29

Model convergence speed

Sample size L

ConfidenceInterval

Exp R:Test data

Emp R: Learning sample

% error

30

“SRM” methodology: how to control model

generalization capacity

Expected Risk = Empirical Risk + Confidence Interval

• To minimize Empirical Risk alone will not always give a good generalization capacity: one will want to minimize the sum of Empirical Risk and Confidence Interval

• What is important is not Vapnik limit numerical value , most often too large to be of any practical use, it is the fact that this limit is a non decreasing function of model family function “richness”, ie shattering power

31

SRM strategy (1)

• With probability 1-q,

• When h/L is too large, second term of equation becomes large

• SRM basic idea for strategy is to minimize simultaneously both terms standing on the right of this majoring equation for R(w)

• To do this, one has to make h a controlled parameter

32

SRM strategy (2)• Let us consider a sequence S1 < S2 < .. < Sn of model

family functions, with respective growing VC dimensions

h1 < h2 < .. < hn• For each family Si of our sequence, the inequality

is valid

33

SRM strategy (3)

SRM : find i such that expected risk R(w) becomes minimum, for a specific h*=hi, relating to a specific family Si of our sequence; build model using f from Si

Empirical Risk

Risk

Model Complexity

Total Risk

Confidence intervalIn h/L

Best Model

h*34

How to chose h*: cross-validation

• Learning sample of size L is divided in two: basic learning set of size L1, and validation set of size L2

• For a given meta-parameter that controls the model family S richness, hence its h, a model is built on basic learning set, and its actual risk is measured on validation set

• Meta-parameter is determined so that model actual risk is minimum on validation set: this leads to the best family, ie h*

• Final model is computed from this optimal family: best trade-off between fit and robustness is achieved by construction

35

Some Learning Machines

• Linear models

• Polynomial models

• Kernel methods

• Neural networks

• Decision trees

36

Learning Process

• Learning machines include:

– Linear discriminant (including Naïve Bayes)

– Kernel methods

– Neural networks

– Decision trees, Random Forests

• Learning is tuning:

– Parameters (weights w or αααα, threshold b)

– Hyperparameters (basis functions, kernels, number of units, number of features/attributes)

37

Industrial Data Mining: implementation example

x1

xn

x3

x2

Output

System

y1

yp

y2

Input

kx ky

Data

Pre

para

tion

Learning Algorithm

Class of Models

Data

Encodin

g

Loss

Crite

rion

kxky

DescriptorsAutomatic

via SRMRidge

regressionKI

(Gini index)

Polynomials

( )κ, σκ, σκ, σκ, σ

γγγγ

w

38

Data Encoding/Compression

• Encodes nominal and ordinal variables numerically.

• Encodes continuous variables non-linearly.

• Compresses variables in robust categories.

• Handles missing values and outliers.

• This process includes adjustable hyper-parameters.

39

Multiple StructuresS1⊂ S2 ⊂ … SN

• Weight decay/Ridge regression:

Sk = { w | ||w||2< ωk }, ω1<ω2<…<ωk

γ1 > γ2 > γ3 >… > γk (γ is the ridge)

• Feature selection:

Sk = { w | ||w||0< σk }, σ1<σ2<…<σk (σ is the number of features)

• Data compression:

κ1<κ2<…<κk (κ may be the number of clusters)

40

Hyper-parameter selection• w = parameter vector.

γ, σ, κ = hyper-parameters.

• Cross-validation with K-folds:

• For various values of γ, σ, κ:– Adjust w on (K-1)/K training

examples.

– Test on K remaining examples.

– Rotate examples and average test results (CV error).

– Select γ, σ, κ to minimize CV error.

– Re-compute w on all training examples using opt. γ, σ, κ.

X y

Prospective study / “real”

validation

Trai

ning

dat

a: M

ake

K fo

lds

Test

dat

a

41

SRM put to work : campaign optimization

OMKI =

OM

Fraction of customers selected

Fra

ctio

n of

goo

d cu

stom

ers

sele

cted

Ideal Lift

100%

100%Customers ordered according to f(x); selection of the top ranking customers.

G

CV lift

OGKR −=1

42

Summary

• Weight decay is a powerful mean of overfitting avoidance.

• It is also known as “ridge regression”.

• It is grounded in the SRM theory.

• Multiple structures are used by most current DM engines : ridge, feature selection, data compression.

43

Quelques exemples concrets

• Census : expliquer ce qui fait que l’on gagne plus ou moins de $50000/an

• Données biostatistiques : feature reduction

44

Ockham’s Razor

• Principle proposed by William of Ockham in the fourteenth century: “Pluralitas non est

ponenda sine neccesitate”.

• Of two theories providing similarly good predictions, prefer the simplest one.

• Shave off unnecessary parameters of your models.

45

Vision : l’Atelier de modélisation prédictive• Le data mining/machine learning intervient en

amont pour sélectionner dans un grand ensemble de variables, sur une problématique, les « bonnes » variables susceptibles d’inférence utile. Cette étape peut être « automatisée »

• On met ensuite en place la stratification, la randomisation et les RCT appropriés, à partir de ces variables « particulièrement intéressantes »

• On finit par les tests sur les résultats (étape qui peut être aussi automatisée)

• => un accélérateur de production de résultats pour une Evidence Based Policy toujours plus efficace

46