Les outils de modélisation des Big Data
-
Upload
shi-kezhan -
Category
Data & Analytics
-
view
282 -
download
0
description
Transcript of Les outils de modélisation des Big Data
Les outils de modélisation des Big Data
SEPIA3 dec 13
Pr Michel BéraChaire de Modélisation statistique du Risque
CNAM/SITI/IMATH
1
• Plan de l’exposé– L’inégalité de Vapnik et les fondements d’une nouvelle
théorie de la robustesse (1971 et 1995)
– Éclairages sur les méthodes classiques (NN, Decision Trees, analyse factorielle)
– La notion de géométrie des données et d’espace étendu –le Kernel Trick – quali et quanti : un combat dépassé
– Big Data et monde vapnikien, utopies et réalités – notions de complexité informatique
– Modélisation moderne : un enchaînement d’approches, du Machine Learning aveugle aux finesses de l’Evidence basedPolicy
2
Theoretical Statistics« Data are as they are »
Applied Statistics« modeling data then testing »
Theory of ill-posed problems
Empirical Methodsof conjuration (PCA,NN,Bayes)
1974 VC Dimension
2001: Start of the internet era,Millions of records& thousands of variables
1980 SRM (Vapnik)
1995 Support Vector Machines (Vapnik)
1960: Mainframe. Huge Datasets start appearing.
1930Kolmogorov-SmirnovFisher
1950Cramer
High dimensionnal problems malediction
STOP !
Watch out !
GO !
Statistical history
3
1. Le monde de Vapnik- Conférence aux Bell Labs (New Jersey) de 1995
4
Consistency : definition1) A learning process (model) is said to be consistent if
model error, measured on new data sampled from the same underlying probability laws of our original sample, converges, when original sample size increases, towards model error, measured on original sample.
2) A model that is consistent is also said to generalize well, or to be robust
5
%error
number of training examples
Test error
Training error
Consistent training?
%error
number of training examples
Test error
Training error
6
Generalization: definition• Generalization capacity for a model describes how
(ex: error function) a model will perform on data that he has never seen before (in his training set)
• Good generalization for a model means that model errors on new unknown data will be of the same « size » than known error on his training set. The model is also called « robust ».
7
Overfitting
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5
Example: Polynomial regressionTarget: a 10th degree polynomial + noise
Learning machine: y=w0+w1x + w2x2 …+ w10x10
8
Overfitting Avoidance
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r= 0.01
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r= 0.1
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r= 1
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r= 10
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+002
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+003
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+004
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+005
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+006
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+007
-10 -8 -6 -4 -2 0 2 4 6 8 10
-0.5
0
0.5
1
1.5d=10, r=1e+008
Example: Polynomial regressionTarget: a 10th degree polynomial + noise
Learning machine: y=w0+w1x + w2x2 …+ w10x10
9
Vapnik approach to modeling (1)• Vapnik approach is based on the family of functions S =
{f(X,w), w ε W}, in which a model is chosen as a specific function, described by a specific w
• For Vapnik, the model function must answer properly for a given row X the question described by target Y, ie predict Y, the quality of the answer being measured by a cost function Q
• Different families of functions may provide the same « quality » of answer
10
Vapnik approach to modeling (2)
• All the trick is then to find a good family of functions S, that not only answers in a « good way » the question described by target Y, but that can also be easy to understand, ie also provide a good description, allowing to explain easily what is underlying the data behaviour of the problem question
• VC dimension will be a key to understand and control model robustness
11
VC dimension - definition (1)
• Let us consider a sample (x1, .. , xL) from Rn
• There are 2L different ways to separate the sample in two sub-samples
• A set S of functions f(X,w) shatters the sample if all 2L
separations can be defined by different f(X,w) from family S
12
VC dimension - definition (2)
A function family S has VC dimension h (h is an integer) if:
1) Every sample of h vectors from Rn can be shattered by a function from S
2) There is at least one sample of h+1 vectors that cannot be shattered by any function from S
13
Example: VC dimension
VC dimension:- Measures the complexity of a solution (function). - Is not directly related to the number of variables
VC dimension:- Measures the complexity of a solution (function). - Is not directly related to the number of variables
14
Other examples• VC dimension for hyperplanes of Rn is n+1
• VC dimension of set of functions:
f(x,w) = sign (sin (w.x) ),c <= x <= 1, c>0,
where w is a free parameter, is infinite.
– Conclusion : VC dimension is not always equal to the number n of parameters (X1,..,Xn) of a given family S of functions from Rn to {-1,+1}.
15
Key Example: linear models -> y = <w|x> + b
• VC dimension of family S of linear models:
with:
depends on C and can take any value between 0 and n.This is the basis for Machine Learning approaches such as SVM
(Support Vector Machines) or Ridge Regression.
16
VC dimension : interpretation
• VC dimension of S: an integer, that measures the shattering (or separating) power (“complexity”) of function family S:
• We shall now show that VC dimension (a major theorem from Vapnik) gives a powerful indication for model consistency, hence “robustness”.
17
What is a Risk Functional?
• A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task.
Parameter space (w)
R[f(x,w)]
w* 18
Examples of Risk Functionals
• Classification:
– Error rate
– AUC
• Regression:
– Mean square error
19
Lift Curve
OMKI =
O M
Fraction of customers selected
Fra
ctio
n of
goo
d cu
stom
ers
sele
cted
Ideal Lift
100%
100%Customers ordered according to f(x); selection of the top ranking customers.
Gini index 0 ≤≤≤≤ KI ≤≤≤≤ 1
20
Statistical Learning Theoretical Foundations
• Structural Risk Minimization
• Regularization
• Weight decay
• Feature selection
• Data compression
21
Learning Theory Problem (1)• A model computes a function:
• Problem : minimize in w Risk Expectation
– w : a parameter that specifies the chosen model
– z = (X, y) are possible values for attributes (variables)
– Q measures (quantifies) model error cost
– P(z) is the underlying probability law (unknown) for data z
22
• We get L data from learning sample (z1, .. , zL), and we suppose them iid sampled from law P(z).
• To minimize R(w), we start by minimizing Empirical Risk over this sample :
• Example of classical cost functions :
– classification (eg. Q can be a cost function based on cost for misclassified points)
– regression (eg. Q can be a cost function of least squares type)
Learning Theory Problem (2)
23
Learning Theory Problem (3)
• Central problem for Statistical Learning Theory:
What is the relation between Risk Expectation R(W)
and Empirical Risk E(W)?
• How to define and measure a generalization capacity (“robustness”) for a model ?
24
Four Pillars for SLT (1 and 2)
• Consistency (guarantees generalization)
– Under what conditions will a model be consistent ?
• Model convergence speed (a measure for generalization capacity)
– How does generalization capacity improve when sample size L grows?
25
Four Pillars for SLT (3 and 4)
• Generalization capacity control
– How to control in an efficient way model generalization starting with the only given information we have: our sample data?
• A strategy for good learning algorithms
– Is there a strategy that guarantees, measures and controls our learning model generalization capacity ?
26
Vapnik main theorem
• Q : Under which conditions will a learning process (model) be consistent?
• R : A model will be consistent if and only if the function f that defines the model comes from a family of functions S with finite VC dimension h
• A finite VC dimension h not only guarantees a generalization capacity (consistency), but to pick f in a family S with finite VC dimension h is the only way to build a model that generalizes.
27
Model convergence speed (generalization
capacity)• Q : What is the nature of model risk difference between
learning data (sample: empirical risk) and test data (expected risk), for a sample of finite size L?
• R : This difference is no greater than a limit that only depends on the ratio between VC dimension h of model functions family S, and sample size L, ie h/L
This statement is a new theorem that belongs to Kolmogorov-Smirnov way for results, ie theorems that do not depend on data’s underlying probability law.
28
Empirical risk minimization in LS case
• With probability 1-q, the following inequality is true:
where w0 is the parameter w value that minimizes Empirical Risk:
29
Model convergence speed
Sample size L
ConfidenceInterval
Exp R:Test data
Emp R: Learning sample
% error
30
“SRM” methodology: how to control model
generalization capacity
Expected Risk = Empirical Risk + Confidence Interval
• To minimize Empirical Risk alone will not always give a good generalization capacity: one will want to minimize the sum of Empirical Risk and Confidence Interval
• What is important is not Vapnik limit numerical value , most often too large to be of any practical use, it is the fact that this limit is a non decreasing function of model family function “richness”, ie shattering power
31
SRM strategy (1)
• With probability 1-q,
• When h/L is too large, second term of equation becomes large
• SRM basic idea for strategy is to minimize simultaneously both terms standing on the right of this majoring equation for R(w)
• To do this, one has to make h a controlled parameter
32
SRM strategy (2)• Let us consider a sequence S1 < S2 < .. < Sn of model
family functions, with respective growing VC dimensions
h1 < h2 < .. < hn• For each family Si of our sequence, the inequality
is valid
33
SRM strategy (3)
SRM : find i such that expected risk R(w) becomes minimum, for a specific h*=hi, relating to a specific family Si of our sequence; build model using f from Si
Empirical Risk
Risk
Model Complexity
Total Risk
Confidence intervalIn h/L
Best Model
h*34
How to chose h*: cross-validation
• Learning sample of size L is divided in two: basic learning set of size L1, and validation set of size L2
• For a given meta-parameter that controls the model family S richness, hence its h, a model is built on basic learning set, and its actual risk is measured on validation set
• Meta-parameter is determined so that model actual risk is minimum on validation set: this leads to the best family, ie h*
• Final model is computed from this optimal family: best trade-off between fit and robustness is achieved by construction
35
Some Learning Machines
• Linear models
• Polynomial models
• Kernel methods
• Neural networks
• Decision trees
36
Learning Process
• Learning machines include:
– Linear discriminant (including Naïve Bayes)
– Kernel methods
– Neural networks
– Decision trees, Random Forests
• Learning is tuning:
– Parameters (weights w or αααα, threshold b)
– Hyperparameters (basis functions, kernels, number of units, number of features/attributes)
37
Industrial Data Mining: implementation example
x1
xn
x3
x2
Output
System
y1
yp
y2
Input
kx ky
Data
Pre
para
tion
Learning Algorithm
Class of Models
Data
Encodin
g
Loss
Crite
rion
kxky
DescriptorsAutomatic
via SRMRidge
regressionKI
(Gini index)
Polynomials
( )κ, σκ, σκ, σκ, σ
γγγγ
w
38
Data Encoding/Compression
• Encodes nominal and ordinal variables numerically.
• Encodes continuous variables non-linearly.
• Compresses variables in robust categories.
• Handles missing values and outliers.
• This process includes adjustable hyper-parameters.
39
Multiple StructuresS1⊂ S2 ⊂ … SN
• Weight decay/Ridge regression:
Sk = { w | ||w||2< ωk }, ω1<ω2<…<ωk
γ1 > γ2 > γ3 >… > γk (γ is the ridge)
• Feature selection:
Sk = { w | ||w||0< σk }, σ1<σ2<…<σk (σ is the number of features)
• Data compression:
κ1<κ2<…<κk (κ may be the number of clusters)
40
Hyper-parameter selection• w = parameter vector.
γ, σ, κ = hyper-parameters.
• Cross-validation with K-folds:
• For various values of γ, σ, κ:– Adjust w on (K-1)/K training
examples.
– Test on K remaining examples.
– Rotate examples and average test results (CV error).
– Select γ, σ, κ to minimize CV error.
– Re-compute w on all training examples using opt. γ, σ, κ.
X y
Prospective study / “real”
validation
Trai
ning
dat
a: M
ake
K fo
lds
Test
dat
a
41
SRM put to work : campaign optimization
OMKI =
OM
Fraction of customers selected
Fra
ctio
n of
goo
d cu
stom
ers
sele
cted
Ideal Lift
100%
100%Customers ordered according to f(x); selection of the top ranking customers.
G
CV lift
OGKR −=1
42
Summary
• Weight decay is a powerful mean of overfitting avoidance.
• It is also known as “ridge regression”.
• It is grounded in the SRM theory.
• Multiple structures are used by most current DM engines : ridge, feature selection, data compression.
43
Quelques exemples concrets
• Census : expliquer ce qui fait que l’on gagne plus ou moins de $50000/an
• Données biostatistiques : feature reduction
44
Ockham’s Razor
• Principle proposed by William of Ockham in the fourteenth century: “Pluralitas non est
ponenda sine neccesitate”.
• Of two theories providing similarly good predictions, prefer the simplest one.
• Shave off unnecessary parameters of your models.
45
Vision : l’Atelier de modélisation prédictive• Le data mining/machine learning intervient en
amont pour sélectionner dans un grand ensemble de variables, sur une problématique, les « bonnes » variables susceptibles d’inférence utile. Cette étape peut être « automatisée »
• On met ensuite en place la stratification, la randomisation et les RCT appropriés, à partir de ces variables « particulièrement intéressantes »
• On finit par les tests sur les résultats (étape qui peut être aussi automatisée)
• => un accélérateur de production de résultats pour une Evidence Based Policy toujours plus efficace
46