Les outils de modélisation des Big Data

Click here to load reader

Embed Size (px)



Transcript of Les outils de modélisation des Big Data

  • 1. Les outils de modlisation des Big Data SEPIA 3 dec 13 Pr Michel Bra Chaire de Modlisation statistique du Risque CNAM/SITI/IMATH 1

2. Plan de lexpos Lingalit de Vapnik et les fondements dune nouvelle thorie de la robustesse (1971 et 1995) clairages sur les mthodes classiques (NN, Decision Trees, analyse factorielle) La notion de gomtrie des donnes et despace tendu le Kernel Trick quali et quanti : un combat dpass Big Data et monde vapnikien, utopies et ralits notions de complexit informatique Modlisation moderne : un enchanement dapproches, du Machine Learning aveugle aux finesses de lEvidence based Policy 2 3. Theoretical Statistics Data are as they are Applied Statistics modeling data then testing Theory of ill-posed problems Empirical Methods of conjuration (PCA,NN,Bayes) 1974 VC Dimension 2001: Start of the internet era, Millions of records & thousands of variables 1980 SRM (Vapnik) 1995 Support Vector Machines (Vapnik) 1960: Mainframe. Huge Datasets start appearing. 1930 Kolmogorov-SmirnovFisher 1950Cramer High dimensionnal problems malediction STOP ! Watch out ! GO ! Statistical history 3 4. 1. Le monde de Vapnik - Confrence aux Bell Labs (New Jersey) de 1995 4 5. Consistency : definition 1) A learning process (model) is said to be consistent if model error, measured on new data sampled from the same underlying probability laws of our original sample, converges, when original sample size increases, towards model error, measured on original sample. 2) A model that is consistent is also said to generalize well, or to be robust 5 6. %error number of training examples Test error Training error Consistent training? %error number of training examples Test error Training error 6 7. Generalization: definition Generalization capacity for a model describes how (ex: error function) a model will perform on data that he has never seen before (in his training set) Good generalization for a model means that model errors on new unknown data will be of the same size than known error on his training set. The model is also called robust . 7 8. Overfitting -10 -8 -6 -4 -2 0 2 4 6 8 10 -0.5 0 0.5 1 1.5 Example: Polynomial regression Target: a 10th degree polynomial + noise Learning machine: y=w0+w1x + w2x2 + w10x10 8 9. Overfitting Avoidance -10 -8 -6 -4 -2 0 2 4 6 8 10 -0.5 0 0.5 1 1.5 d=10, r= 0.01 -10 -8 -6 -4 -2 0 2 4 6 8 10 -0.5 0 0.5 1 1.5 d=10, r= 0.1 -10 -8 -6 -4 -2 0 2 4 6 8 10 -0.5 0 0.5 1 1.5 d=10, r= 1 -10 -8 -6 -4 -2 0 2 4 6 8 10 -0.5 0 0.5 1 1.5 d=10, r= 10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -0.5 0 0.5 1 1.5 d=10, r=1e+002 -10 -8 -6 -4 -2 0 2 4 6 8 10 -0.5 0 0.5 1 1.5 d=10, r=1e+003 -10 -8 -6 -4 -2 0 2 4 6 8 10 -0.5 0 0.5 1 1.5 d=10, r=1e+004 -10 -8 -6 -4 -2 0 2 4 6 8 10 -0.5 0 0.5 1 1.5 d=10, r=1e+005 -10 -8 -6 -4 -2 0 2 4 6 8 10 -0.5 0 0.5 1 1.5 d=10, r=1e+006 -10 -8 -6 -4 -2 0 2 4 6 8 10 -0.5 0 0.5 1 1.5 d=10, r=1e+007 -10 -8 -6 -4 -2 0 2 4 6 8 10 -0.5 0 0.5 1 1.5 d=10, r=1e+008 Example: Polynomial regression Target: a 10th degree polynomial + noise Learning machine: y=w0+w1x + w2x2 + w10x10 9 10. Vapnik approach to modeling (1) Vapnik approach is based on the family of functions S = {f(X,w), w W}, in which a model is chosen as a specific function, described by a specific w For Vapnik, the model function must answer properly for a given row X the question described by target Y, ie predict Y, the quality of the answer being measured by a cost function Q Different families of functions may provide the same quality of answer 10 11. Vapnik approach to modeling (2) All the trick is then to find a good family of functions S, that not only answers in a good way the question described by target Y, but that can also be easy to understand, ie also provide a good description, allowing to explain easily what is underlying the data behaviour of the problem question VC dimension will be a key to understand and control model robustness 11 12. VC dimension - definition (1) Let us consider a sample (x1, .. , xL) from Rn There are 2L different ways to separate the sample in two sub- samples A set S of functions f(X,w) shatters the sample if all 2L separations can be defined by different f(X,w) from family S 12 13. VC dimension - definition (2) A function family S has VC dimension h (h is an integer) if: 1) Every sample of h vectors from Rn can be shattered by a function from S 2) There is at least one sample of h+1 vectors that cannot be shattered by any function from S 13 14. Example: VC dimension VC dimension: - Measures the complexity of a solution (function). - Is not directly related to the number of variables VC dimension: - Measures the complexity of a solution (function). - Is not directly related to the number of variables 14 15. Other examples VC dimension for hyperplanes of Rn is n+1 VC dimension of set of functions: f(x,w) = sign (sin (w.x) ), c y = + b VC dimension of family S of linear models: with: depends on C and can take any value between 0 and n. This is the basis for Machine Learning approaches such as SVM (Support Vector Machines) or Ridge Regression. 16 17. VC dimension : interpretation VC dimension of S: an integer, that measures the shattering (or separating) power (complexity) of function family S: We shall now show that VC dimension (a major theorem from Vapnik) gives a powerful indication for model consistency, hence robustness. 17 18. What is a Risk Functional? A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. Parameter space (w) R[f(x,w)] w* 18 19. Examples of Risk Functionals Classification: Error rate AUC Regression: Mean square error 19 20. Lift Curve O MKI = O M Fraction of customers selected Fractionofgoodcustomersselected Ideal Lift 100% 100%Customers ordered according to f(x); selection of the top ranking customers. Gini index 0 KI 1 20 21. Statistical Learning Theoretical Foundations Structural Risk Minimization Regularization Weight decay Feature selection Data compression 21 22. Learning Theory Problem (1) A model computes a function: Problem : minimize in w Risk Expectation w : a parameter that specifies the chosen model z = (X, y) are possible values for attributes (variables) Q measures (quantifies) model error cost P(z) is the underlying probability law (unknown) for data z 22 23. We get L data from learning sample (z1, .. , zL), and we suppose them iid sampled from law P(z). To minimize R(w), we start by minimizing Empirical Risk over this sample : Example of classical cost functions : classification (eg. Q can be a cost function based on cost for misclassified points) regression (eg. Q can be a cost function of least squares type) Learning Theory Problem (2) 23 24. Learning Theory Problem (3) Central problem for Statistical Learning Theory: What is the relation between Risk Expectation R(W) and Empirical Risk E(W)? How to define and measure a generalization capacity (robustness) for a model ? 24 25. Four Pillars for SLT (1 and 2) Consistency (guarantees generalization) Under what conditions will a model be consistent ? Model convergence speed (a measure for generalization capacity) How does generalization capacity improve when sample size L grows? 25 26. Four Pillars for SLT (3 and 4) Generalization capacity control How to control in an efficient way model generalization starting with the only given information we have: our sample data? A strategy for good learning algorithms Is there a strategy that guarantees, measures and controls our learning model generalization capacity ? 26 27. Vapnik main theorem Q : Under which conditions will a learning process (model) be consistent? R : A model will be consistent if and only if the function f that defines the model comes from a family of functions S with finite VC dimension h A finite VC dimension h not only guarantees a generalization capacity (consistency), but to pick f in a family S with finite VC dimension h is the only way to build a model that generalizes. 27 28. Model convergence speed (generalization capacity) Q : What is the nature of model risk difference between learning data (sample: empirical risk) and test data (expected risk), for a sample of finite size L? R : This difference is no greater than a limit that only depends on the ratio between VC dimension h of model functions family S, and sample size L, ie h/L This statement is a new theorem that belongs to Kolmogorov- Smirnov way for results, ie theorems that do not depend on datas underlying probability law. 28 29. Empirical risk minimization in LS case With probability 1-q, the following inequality is true: where w0 is the parameter w value that minimizes Empirical Risk: 29 30. Model convergence speed Sample size L Confidence Interval Exp R:Test data Emp R: Learning sample % error 30 31. SRM methodology: how to control model generalization capacity Expected Risk = Empirical Risk + Confidence Interval To minimize Empirical Risk alone will not always give a good generalization capacity: one will want to minimize the sum of Empirical Risk and Confidence Interval What is important is not Vapnik limit numerical value , most often too large to be of any practical use, it is the fact that this limit is a non decreasing function of model family function richness, ie shattering power 31 32. SRM strategy (1) With probability 1-q, When h/L is too large, second term of equation becomes large SRM basic idea for strategy is to minimize simultaneously both terms standing on the right of this majoring equation for R(w) To do this, one has to make h a controlled parameter 32 33. SRM strategy (2) Let us consider a sequence S1 < S2 < .. < Sn of model family functions, with respective growing VC dimensions h1 < h2 < .. < hn For each family Si of our sequence, the inequality is valid 33 34. SRM strategy (3) SRM : find i such that expected risk R(w) becomes minimum, for a specific h*=hi, relating to a specific family Si of our sequence; build model using f from Si Empirical Risk Risk Model Complexity Total Risk Confidence interval In h/L Best Model h* 34 35. How to chose h*: cross-validation Learning sample of size L is divided in two: basic learning set of size L1, and validation set of size L2 For a given meta-parameter that controls the model family S richness, hence its h, a model is built on basic learning set, and its actual risk is measured on validation set Meta-parameter is determined so that model actual risk is minimum on validation set: this leads to the best family, ie h* Final model is computed from this optimal family: best trade- off between fit and robustness is achieved by construction 35 36. Some Learning Machines Linear models Polynomial models Kernel methods Neural networks Decision trees 36 37. Learning Process Learning machines include: Linear discriminant (including Nave Bayes) Kernel methods Neural networks Decision trees, Random Forests Learning is tuning: Parameters (weights w or , threshold b) Hyperparameters (basis functions, kernels, number of units, number of features/attributes) 37 38. Industrial Data Mining: implementation example x1 xn x3 x2 Output System y1 yp y2 Input k x k y DataPreparation Learning Algorithm Class of Models DataEncoding LossCriterion k x k y Descriptors Automatic via SRM Ridge regression KI (Gini index) Polynomials ( ) , , , , w 38 39. Data Encoding/Compression Encodes nominal and ordinal variables numerically. Encodes continuous variables non-linearly. Compresses variables in robust categories. Handles missing values and outliers. This process includes adjustable hyper- parameters. 39 40. Multiple Structures S1 S2 SN Weight decay/Ridge regression: Sk = { w | ||w||2< k }, 1 > k ( is the ridge) Feature selection: Sk = { w | ||w||0< k }, 1