Post on 20-Jun-2015
Arthur CHARPENTIER - Analyse des donnees
Analyse des donnees (5)
L’Analyse Discriminante, ou Scoring
Arthur Charpentier
http ://perso.univ-rennes1.fr/arthur.charpentier/
blog.univ-rennes1.fr/arthur.charpentier/
Master 2, Universite Rennes 1
1
Arthur CHARPENTIER - Analyse des donnees
L’analyse discriminante
On cherche ici a discriminer entre deux ou plusieurs classes, definies par lesmodalites d’une variable Y , qualitative, a partir d’un certain nombre de variablesexplicatives X1, · · · , Xk (appelees predicteurs), supposes quantitatifs.
Les classes sont ici definies a priori (via la variable Y ). Deux types dediscrimination sont menees en pratique• a but descriptif : on cherche quelles sont les variables explicative (Xj) qui
discriminent le mieux• a but predictif : on cherche a affecter un individu dans une classe, a partir de
ses variables explicatives. On parle alors de scoringOn va alors chercher les variables explicatives les plus discriminantes vis vis desclasses dtermines.
On pourra alors dterminer quel groupe appartient un individu partir de sescaractristiques.
Par rapport aux techniques de classification on intervient ici a posteriori : Y estla classe (que l’on cherche a expliquer).
2
Arthur CHARPENTIER - Analyse des donnees
Exemple introductif : infarctus du myocarde
Considerons la base suivantes, extraite de Saporta (1990), concernant desvictimes d’infarctus du myocarde, qui ont ete mesures a leur admission, avec lafequence cardiaque (FRCAR), un indcex cardiaque(INCAR), index systolique(INSYS), pression diastolique (PRDIA), pression arterielle pulmonaire (PAPUL),pression venticulaire (PVENT) et resistance pulmonaire (REPUL).> (MYOCARDE=read.table("http://perso.univ-rennes1.fr/arthur.charpentier/
+ saporta.csv",head=TRUE,sep=";"))
FRCAR INCAR INSYS PRDIA PAPUL PVENT REPUL PRONO
1 90 1.71 19.0 16 19.5 16.0 912 SURVIE
2 90 1.68 18.7 24 31.0 14.0 1476 DECES
3 120 1.40 11.7 23 29.0 8.0 1657 DECES
4 82 1.79 21.8 14 17.5 10.0 782 SURVIE
5 80 1.58 19.7 21 28.0 18.5 1418 DECES
6 80 1.13 14.1 18 23.5 9.0 1664 DECES
7 94 2.04 21.7 23 27.0 10.0 1059 SURVIE
8 80 1.19 14.9 16 21.0 16.5 1412 SURVIE
9 78 2.16 27.7 15 20.5 11.5 759 SURVIE
10 100 2.28 22.8 16 23.0 4.0 807 SURVIE
3
Arthur CHARPENTIER - Analyse des donnees
11 90 2.79 31.0 16 25.0 8.0 717 SURVIE
12 86 2.70 31.4 15 23.0 9.5 681 SURVIE
13 80 2.61 32.6 8 15.0 1.0 460 SURVIE
On essaye de comprendre qui va survivre a l’infarctus, et qui va deceder.
On peut faire un peu de statistique descriptive sur les deux sous-groupes.
> apply(MYOCARDE[MYOCARDE$PRONO=="DECES",1:7],2,mean)
FRCAR INCAR INSYS PRDIA PAPUL PVENT REPUL
91.551724 1.397931 15.531034 21.448276 28.431034 11.844828 1738.689655
> apply(MYOCARDE[MYOCARDE$PRONO=="SURVIE",1:7],2,mean)
FRCAR INCAR INSYS PRDIA PAPUL PVENT REPUL
87.690476 2.318333 27.202381 15.976190 22.202381 8.642857 817.214286
> apply(MYOCARDE[MYOCARDE$PRONO=="DECES",1:7],2,sd)
FRCAR INCAR INSYS PRDIA PAPUL PVENT REPUL
15.2844136 0.3808954 4.4162932 5.0750525 7.1009609 4.4843049 616.3684023
> apply(MYOCARDE[MYOCARDE$PRONO=="SURVIE",1:7],2,sd)
FRCAR INCAR INSYS PRDIA PAPUL PVENT REPUL
14.589485 0.574388 8.484433 5.125204 6.574210 4.219996 313.039508
4
Arthur CHARPENTIER - Analyse des donnees
●
●
●
DECES SURVIE
6070
8090
100
110
120
●
DECES SURVIE
1.0
1.5
2.0
2.5
3.0
●
●
●
DECES SURVIE
1020
3040
50
●
●
DECES SURVIE
1015
2025
3035
●
●
DECES SURVIE
1015
2025
3035
4045
DECES SURVIE
510
1520
5
Arthur CHARPENTIER - Analyse des donnees
En supposant que l’on a des vecteurs Gaussiens, on peut tester l’egalite globalevia un test de Fisher,
> MYOCARDE.manova<-manova(cbind(FRCAR,INCAR,INSYS,PRDIA,PAPUL,PVENT,REPUL)~PRONO,data=MYOCARDE)
> MYOCARDE.manova
Call:
manova(cbind(FRCAR, INCAR, INSYS, PRDIA, PAPUL, PVENT, REPUL) ~
PRONO, data = MYOCARDE)
Terms:
PRONO Residuals
resp 1 256 15268
resp 2 15 18
resp 3 2337 3498
resp 4 514 1798
resp 5 666 3184
resp 6 176 1293
resp 7 14566540 14655223
Deg. of Freedom 1 69
Residual standard error: 14.8754 0.50489 7.119591 5.104912 6.79289 4.329197 460.8628
6
Arthur CHARPENTIER - Analyse des donnees
Estimated effects may be unbalanced
> summary(MYOCARDE.manova,test="Wilks")
Df Wilks approx F num Df den Df Pr(>F)
PRONO 1 0.4545 10.8034 7 63 7.312e-09 ***
Residuals 69
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
La variable discriminante est obtenu par combinaison lineaire des 7 variablescentrees sur la moyenne generale des 2 groupes.> lda(PRONO~.,data=MYOCARDE)
Call:
lda(PRONO ~ ., data = MYOCARDE)
Prior probabilities of groups:
DECES SURVIE
0.4084507 0.5915493
Group means:
FRCAR INCAR INSYS PRDIA PAPUL PVENT REPUL
DECES 91.55172 1.397931 15.53103 21.44828 28.43103 11.844828 1738.6897
7
Arthur CHARPENTIER - Analyse des donnees
SURVIE 87.69048 2.318333 27.20238 15.97619 22.20238 8.642857 817.2143
Coefficients of linear discriminants:
LD1
FRCAR -0.012743116
INCAR 1.074534545
INSYS -0.019139867
PRDIA -0.025483955
PAPUL 0.020177505
PVENT -0.037804074
REPUL -0.001353977
On pourrait tenter une ACP sur les 6 premieres variables, et regarder le nuagedes individus, pour voir si l’on arrive a discriminer “simplement”.
library(ade4)
mesures=MYOCARDE[,1:6]
acp <- dudi.pca(mesures,scann = FALSE, nf = 3)
s.class(acp$li, fac=MYOCARDE$PRONO,col=c("red","blue"),xax = 1, yax = 2)
8
Arthur CHARPENTIER - Analyse des donnees
FRCAR INCAR
INSYS PRDIA
PAPUL
PVENT
d = 1
1
2
3
4 5 6
7
8
9
10 11
12
13
14
15
16 17
18
19
20
21 22
23
24
25
26 27
28
29
30
31
32
33
34
35 36
37
38 39 40 41
42
43
44
45
46
47
48
49
50
51 52
53
54
55
56
57
58
59
60 61
62
63
64 65
66
67
68
69
70
71
9
Arthur CHARPENTIER - Analyse des donnees
d = 1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
DECES
SURVIE
10
Arthur CHARPENTIER - Analyse des donnees
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
−4 −2 0 2 4
−3
−2
−1
01
2
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
−4 −2 0 2 4
−3
−2
−1
01
2
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
Les points dans la region inferieure gauche sont pronostique “survie” et dans lapartie superieure droite “deces”. On peut alors comparer les valeurs observees Ya ces prediction Y
> table(PRONOSTIC,MYOCARDE$PRONO)
11
Arthur CHARPENTIER - Analyse des donnees
PRONOSTIC DECES SURVIE
SURVIE 14 34
DECES 15 8
Yi = 0 Yi = 1
Yi = 0 vrai negatif faux negatif
Yi = 1 faux positif vrai positif
Parmi les mesures de performance de la prediction,
P(Y = 1|Y = 1) est appele precision
P(Y = 1|Y = 1) est appele taux de vrais positifs
P(Y = 1|Y = 0) est appele taux de faux positifs
On peut eventuellement representer le taux de vrais positifs en fonction du tauxde faux positifs.
Comme on essaye d’expliquer Y (un pronostic binaire) par plusieurs variablescontinues, on pourrait utiliser une regression logistique, ou probit.
12
Arthur CHARPENTIER - Analyse des donnees
> glm(Y~.-PRONO,data=MYOCARDE, family=binomial(link = "logit"))
Call: glm(formula = Y ~ . - PRONO, family = binomial(link = "logit"),data = MYOCARDE)
Coefficients:
(Intercept) FRCAR INCAR INSYS PRDIA PAPUL PVENT REPUL
-10.187642 0.138178 -5.862429 0.717084 -0.073668 0.016757 -0.106776 -0.003154
Degrees of Freedom: 70 Total (i.e. Null); 63 Residual
Null Deviance: 96.03
Residual Deviance: 41.04 AIC: 57.04
> glm(Y~.-PRONO,data=MYOCARDE, family=binomial(link = "probit"))
Call: glm(formula = Y ~ . - PRONO, family = binomial(link = "probit"),data = MYOCARDE)
Coefficients:
(Intercept) FRCAR INCAR INSYS PRDIA PAPUL PVENT REPUL
-4.677478 0.072674 -3.071761 0.366205 -0.040006 0.009804 -0.063314 -0.001993
Degrees of Freedom: 70 Total (i.e. Null); 63 Residual
Null Deviance: 96.03
13
Arthur CHARPENTIER - Analyse des donnees
Residual Deviance: 40.97 AIC: 56.97
On peut alors regarder les predictions donnees par ces modeles.
> r.logit <- glm(Y~.-PRONO,data=MYOCARDE, family=binomial(link = "logit"))
> Y.logit <- predict(r.logit, type=’response’)
> r.probit <- glm(Y~.-PRONO,data=MYOCARDE, family=binomial(link = "probit"))
> Y.probit <- predict(r.probit, type=’response’)
> cbind(MYOCARDE$Y,Y.logit,Y.probit)
[,1] [,2] [,3]
1 1 0.601 0.613
2 0 0.169 0.175
3 0 0.328 0.338
4 1 0.881 0.882
5 0 0.142 0.143
6 0 0.057 0.060
7 1 0.679 0.668
8 1 0.078 0.087
9 1 0.967 0.968
10 1 0.945 0.951
11 1 0.985 0.989
12 1 0.989 0.992
14
Arthur CHARPENTIER - Analyse des donnees
13 1 0.999 0.999
14 1 0.999 0.999
15 1 0.988 0.992
Plus le score est proche de 0, plus on devrait se prononcer pour un deces, plus ilest proche de 1, plus on devrait etre confiant dans une survie.
De maniere “naturelle”, on peut fixer ce seuil a 50% : si Y ? < .5 on pronostiqueY = 0, et si Y = 1 si Y ? > .5.
On obtient alors les classements suivants, avec le meme tableau pour les modelesprobit et logit
Yi = 0 Yi = 1
Yi = 0 vrai negatif faux negatif
Yi = 1 faux positif vrai positif
Yi = 0 Yi = 1
Yi = 0 25 3
Yi = 1 4 39
Mais le seuil de 50% a ete fixe artibtrairement. En prenant des seuils a 30% ou70%, on peut changer les resultats,
15
Arthur CHARPENTIER - Analyse des donnees
Yi = 0 Yi = 1
Yi = 0 22 2
Yi = 1 7 40
Yi = 0 Yi = 1
Yi = 0 26 9
Yi = 1 3 33
Dans un cas, on diminue le nombre de faux negatif, en augmentant le nombre defaux positif, et l’inverser pour l’autre choix.
On a alors un trade off sur le choix du seuil : on ne peut pas bien detecter tout lemonde ! Ce probleme apparaissait egalement sur l’ACP, lorsque l’on coupait laregion en 2.
16
Arthur CHARPENTIER - Analyse des donnees
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
−4 −2 0 2 4
−3
−2
−1
01
2
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
−4 −2 0 2 4
−3
−2
−1
01
2
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
Notons qu’on peut “tester” la pertinence de notre classement,> library(ROCR)
> pred=prediction(Y.probit,MYOCARDE$PRONO)
> perf=performance(pred,’tpr’,’fpr’)
> plot(perf)
17
Arthur CHARPENTIER - Analyse des donnees
False positive rate
Tru
e po
sitiv
e ra
te
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
00.
20.
410.
610.
811.
01
False positive rate
Tru
e po
sitiv
e ra
te0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
00.
20.
410.
610.
811.
01
18
Arthur CHARPENTIER - Analyse des donnees
Exemple introductif : les vins de Bordeaux
On considere des degustation de Bordeaux, sur 34 annees entre 1924 et 1957.
> BORDEAUX=read.table("http://perso.univ-rennes1.fr/arthur.charpentier/
+ bordeaux_R.txt",head=TRUE)
> BORDEAUX=BORDEAUX[,-1]
> head(BORDEAUX,8)
NUMERO TEMPERAT SOLEIL CHALEUR PLUIE QUALITE
1 1 3064 1201 10 361 2
2 2 3000 1053 11 338 3
3 3 3155 1133 19 393 2
4 4 3085 970 4 467 3
5 5 3245 1258 36 294 1
6 6 3267 1386 35 225 1
7 7 3080 966 13 417 3
8 8 2974 1189 12 488 3
19
Arthur CHARPENTIER - Analyse des donnees
1 2 3
2900
3000
3100
3200
3300
3400
3500
1 2 3
1000
1100
1200
1300
1400
1500
●
1 2 3
1020
3040
1 2 3
300
400
500
600
20
Arthur CHARPENTIER - Analyse des donnees
On cherche une analyse discriminante separant au mieux ldes k classes,
Z1 = α0 +p∑
j=1
αjXj
> lda(QUALITE~.,data=BORDEAUX)
Call:
lda(QUALITE ~ . + 1, data = BORDEAUX)
Prior probabilities of groups:
1 2 3
0.3235294 0.3235294 0.3529412
Group means:
TEMPERAT SOLEIL CHALEUR PLUIE
1 3306.364 1363.636 28.54545 305.0000
2 3140.909 1262.909 16.45455 339.6364
3 3037.333 1126.417 12.08333 430.3333
Coefficients of linear discriminants:
LD1 LD2
21
Arthur CHARPENTIER - Analyse des donnees
TEMPERAT 0.008566046 -4.625059e-05
SOLEIL 0.006773869 -5.329293e-03
CHALEUR -0.027054492 1.276362e-01
PLUIE -0.005865665 6.174556e-03
Proportion of trace:
LD1 LD2
0.9595 0.0405
On peut aussi centrer et reduire les variables,
X1 =temperature− 3157.88√
7668.456, · · · , X4 =
pluie− 360√5758.039
.
> (M=apply(BORDEAUX,2,mean))
TEMPERAT SOLEIL CHALEUR PLUIE QUALITE
3157.882353 1247.323529 18.823529 360.441176 2.029412
> (S=apply(BORDEAUX,2,sd))
TEMPERAT SOLEIL CHALEUR PLUIE QUALITE
141.1843336 126.6229719 10.0165638 91.4016084 0.8343131
> BORDEAUX.CR=(BORDEAUX-matrix(rep(M,each=34),34,5))/matrix(rep(S,each=34),34,5)
22
Arthur CHARPENTIER - Analyse des donnees
> (LD=lda(QUALITE~.,data=BORDEAUX.CR))
Call:
lda(QUALITE ~ ., data = BORDEAUX.CR)
Prior probabilities of groups:
-1.23384339005595 -0.0352526682873127 1.16333805348132
0.3235294 0.3235294 0.3529412
Group means:
TEMPERAT SOLEIL CHALEUR PLUIE
-1.23384339005595 1.0516838 0.9185761 0.9705849 -0.6065667
-0.0352526682873127 -0.1202206 0.1230864 -0.2365067 -0.2276198
1.16333805348132 -0.8538413 -0.9548573 -0.6729050 0.7646710
Coefficients of linear discriminants:
LD1 LD2
TEMPERAT 1.2093914 -0.006529859
SOLEIL 0.8577274 -0.674810955
CHALEUR -0.2709930 1.278475787
PLUIE -0.5361312 0.564364371
23
Arthur CHARPENTIER - Analyse des donnees
Proportion of trace:
LD1 LD2
0.9595 0.0405
> PLD=predict(LD)$x
> boxplot(PLD~BORDEAUX$QUALITE)
On peut aussi utiliser la seconde variable disciminante, centree, mais non correleea Z1,
Z2 = β0 +p∑
j=1
βjXj
On obtient les deux box-plot suivants
24
Arthur CHARPENTIER - Analyse des donnees
1 2 3
−4
−2
02
4
1 2 3
−1
01
2
> X=predict(LD)$x[,1]; Y=predict(LD)$x[,2]
> plot(X,Y,col=BORDEAUX$QUALITE)
25
Arthur CHARPENTIER - Analyse des donnees
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
−4 −2 0 2 4−
10
12
1
2
3
4
5
6
7
8
9
10
11
12
13 14
15
16
17
18
19
20
2122
23
24
25
26
2728
29
30
31
32
33
34
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
−4 −2 0 2 4
−1
01
2
1
2
3
4
5
6
7
8
9
10
11
12
13 14
15
16
17
18
19
20
2122
23
24
25
26
2728
29
30
31
32
33
34 ●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
−4 −2 0 2 4
−4
−2
02
4
1
2
3
4
5
6
7
8
9
10
11
12
13 1415
1617
18
19
20
21 2223
24
25
26
2728
29
30
31
32
33
34
26
Arthur CHARPENTIER - Analyse des donnees
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
−4 −2 0 2 4
−4
−2
02
4
1
2
3
4
5
6
7
8
9
10
11
12
13 1415
1617
18
19
20
21 2223
24
25
26
2728
29
30
31
32
33
34
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
−4 −2 0 2 4
−4
−2
02
4
1
2
3
4
5
6
7
8
9
10
11
12
13 1415
1617
18
19
20
21 2223
24
25
26
2728
29
30
31
32
33
34
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
−4 −2 0 2 4
−4
−2
02
4
1
2
3
4
5
6
7
8
9
10
11
12
13 1415
1617
18
19
20
21 2223
24
25
26
2728
29
30
31
32
33
34
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
−4 −2 0 2 4
−4
−2
02
4
●
1
2
3
4
5
6
7
8
9
10
11
12
13 1415
1617
18
19
20
21 2223
24
25
26
2728
29
30
31
32
33
34
27
Arthur CHARPENTIER - Analyse des donnees
> AJUST=cbind(BORDEAUX[,5],predict(LD)$class,
+ BORDEAUX[,5]==as.numeric(predict(LD)$class))
> AJUST
[,1] [,2] [,3]
[1,] 2 2 1
[2,] 3 3 1
[3,] 2 3 0
[4,] 3 3 1
[5,] 1 1 1
[6,] 1 1 1
[7,] 3 3 1
[8,] 3 3 1
[9,] 3 3 1
[10,] 2 1 0
[11,] 1 1 1
[12,] 3 2 0
[13,] 3 3 1
[14,] 1 1 1
[15,] 2 2 1
[16,] 2 2 1
[17,] 2 2 1
28
Arthur CHARPENTIER - Analyse des donnees
[18,] 3 3 1
[19,] 2 2 1
> table(as.factor(as.numeric(AJUST[,1])),as.factor(as.numeric(AJUST[,2])))
1 2 3
1 9 2 0
2 2 8 1
3 0 2 10
29
Arthur CHARPENTIER - Analyse des donnees
Le reclassement
> lda(PRONO~.,data=MYOCARDE,prior=c(0.5,0.5),CV=TRUE)
> lda(PRONO~.-Y,data=MYOCARDE,prior=c(0.5,0.5),CV=TRUE)
$class
[1] DECES DECES DECES SURVIE DECES DECES SURVIE DECES SURVIE SURVIE SURVIE SURVIE SURVIE SURVIE SURVIE SURVIE SURVIE SURVIE SURVIE
[20] SURVIE SURVIE DECES SURVIE DECES SURVIE DECES DECES SURVIE SURVIE SURVIE SURVIE DECES DECES SURVIE DECES SURVIE SURVIE SURVIE
[39] DECES SURVIE DECES DECES SURVIE SURVIE DECES SURVIE DECES DECES DECES DECES SURVIE SURVIE DECES DECES SURVIE SURVIE DECES
[58] SURVIE SURVIE SURVIE DECES SURVIE DECES DECES SURVIE SURVIE DECES DECES SURVIE SURVIE SURVIE
Levels: DECES SURVIE
$posterior
DECES SURVIE
1 0.502843989 0.4971560108
2 0.760428401 0.2395715991
3 0.898718532 0.1012814675
4 0.205819247 0.7941807532
5 0.767586744 0.2324132563
6 0.891944506 0.1080554941
[...]67 0.988907194 0.0110928057
30
Arthur CHARPENTIER - Analyse des donnees
68 0.913385833 0.0866141669
69 0.038344052 0.9616559479
70 0.023091939 0.9769080611
71 0.017904179 0.9820958214
On peut changer les proportions attendues,
> lda(PRONO~.-Y,data=MYOCARDE,prior=c(0.3,0.7),CV=TRUE)
$posterior
DECES SURVIE
1 0.3023943986 0.6976056014
2 0.5763315167 0.4236684833
3 0.7917932260 0.2082067740
4 0.0999652630 0.9000347370
5 0.5859958160 0.4140041840
6 0.7796213462 0.2203786538
[...]
67 0.9744940304 0.0255059696
68 0.8188235413 0.1811764587
69 0.0168012965 0.9831987035
31
Arthur CHARPENTIER - Analyse des donnees
70 0.0100288802 0.9899711198
71 0.0077525353 0.9922474647
Dans le cas multinomial (plus de 2 modalites)
> B.LDA=lda(QUALITE~.,data=BORDEAUX,prior=c(1/3,1/3,1/3),CV=TRUE)
$class
[1] 2 3 3 3 2 1 3 3 3 1 1 2 3 2 2 2 2 3 2 1 2 1 2 1 2 1 1 3 1 2 3 2 2 3
Levels: 1 2 3
$posterior
1 2 3
1 7.037459e-03 6.295202e-01 3.634423e-01
2 7.537421e-05 5.994089e-02 9.399837e-01
3 8.143494e-03 1.822480e-01 8.096085e-01
4 1.134597e-05 2.619176e-02 9.737969e-01
5 2.536909e-01 6.212299e-01 1.250793e-01
6 8.973327e-01 1.025057e-01 1.615276e-04
7 1.127037e-05 9.005366e-03 9.909834e-01
On peut visualiser la prediction du score,
> barplot(t(B.LDA$posterior),col=c("blue","green","red"))
32
Arthur CHARPENTIER - Analyse des donnees
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 5 10 15 20 25 30 35
0.0
0.2
0.4
0.6
0.8
1.0
33
Arthur CHARPENTIER - Analyse des donnees
Les tests
La statistique la plus classique est le taux de bien classes, i.e. P(Y = Y ).
Notons que le tableau de classement (ou matrice de confusion) est un tableau decontingence, et on peut tester le charactere significatif de la prediction par untest du χ2.
Le test du Lambda de Wilks permet de tester si les vecteurs des moyennes pourles diffrentes groupes sont gaux ou non (ce test peut tre compris comme unquivalent multidimensionnel du test de Fisher).
Le test du V de Rao mesure la distance entre les centre des groupes, et lamoyenne globale.
En fait ces tests ne sont possibles qu’a condition d’avoir des vecteurs Gaussiens,avec en plus une hypothese degalite des matrices de variance-covariance danschaque groupe.
34
Arthur CHARPENTIER - Analyse des donnees
On peut utiliser un test de Kullback pour faire ce test, en notant que
k∑i=1
ni − 12
log(
detD?
detD?i
)∼ χ2 sous H0,
ou D? est la matrice de variance covariance intra-groupe, D?i est la matrice de
variance-covariance pour le groupe i, et ni designe le nombre d’observations dansle groupe i.
35
Arthur CHARPENTIER - Analyse des donnees
Un peu de formalisation
Pour commencer, supposons que Y prenne 2 modalites, notees 0 et 1. On supposeque les m variables Xj sont continues.
Soient X0 = (XY =0
1 , · · · , XY =0
m ), X1 = (XY =1
1 , · · · , XY =1
m ),
V0 = [cov(XY =0
i , XY =0
j )] et V1 = [cov(XY =1
i , XY =1
j )].
On pose egalement X? = (X1, · · · , Xm) et V? = [cov(Xi, Xj)] (sur l’ensemble dela population).
On note enfin ω0 et ω1 les poids de chacune des classes.
On appelle matrice de variance intercalsse la matrice de variance B des 2 centrede gravites,
B =1∑
k=0
ωk(Xk −X?)(Xk −X?)′,
36
Arthur CHARPENTIER - Analyse des donnees
et W la matrice de variance interclasse W , moyenne des matrices Vk, i.e.
W =1∑
k=0
ωkVk.
Notons que W est generallement inversible, alor que B ne l’est pas. La formule dedecomposition de la variance donne
V = W +B
(la variance totale est la somme de la moyenne des variances et de la variance desmoyennes).
On supposera les variables centrees, i.e. X? = 0, i.e.
B =1∑
k=0
ωkXkX′k et W =
1∑k=0
ωkVk, ou ωk =nk
n.
On considere le tableau compose de la variable Y , ou plus generalement dutableau disjonctif associe, note A, et du tableau X des variables explicatives.
37
Arthur CHARPENTIER - Analyse des donnees
Notons que les 2 centres de gravites X0 et X1 sont mes lignes de la matrice(A′DA)−1(A′DX) ou D est la matrice est la matrice des poids individuels.
38
Arthur CHARPENTIER - Analyse des donnees
L’analyse factorielle discriminante (AFD) consiste a chercher des variablesdisciminantes correspondant a des vecteurs dans Rm qui separent au mieux lenuage en k groupes.
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−2 0 2 4 6
−2
02
4
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
39
Arthur CHARPENTIER - Analyse des donnees
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−2 0 2 4 6
−2
02
4
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●●
●
●
●
●●
●
●●
●●
●●
●●
●
●
●●
●●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
−4 −2 0 2 4 6
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
● ●● ●● ● ●● ●● ●● ●●● ●●●● ●● ●●●●● ●● ●●● ●● ●● ● ●●●● ● ●●● ●●● ●●● ●●●● ●● ● ●● ●● ●● ●● ●●● ●● ● ●●●●● ● ●●● ●● ● ●●●● ●●●● ●●● ●●● ●●●● ●● ● ● ●● ●● ● ●● ● ●●● ●● ●● ● ●● ●●● ●● ●●● ● ●● ●●● ●● ●● ● ●● ●●● ●●●● ●●● ●●● ● ●● ●● ●● ● ●●●●● ●● ●●●● ●● ●● ●● ●●● ●● ●● ●●●● ●● ● ●● ●●
● ●●●● ●●●● ●● ●●● ●● ●● ● ● ●●●●●● ●● ●● ● ●● ● ● ●●● ● ●●●●● ●● ●● ●● ● ● ●● ● ●● ●● ●●● ● ●●● ● ●●● ●● ●● ●● ●●● ●● ●● ● ●●● ●●● ●●●● ●● ●●●● ●● ● ●●●● ● ●●● ● ●●●● ● ●●● ●● ● ● ●● ● ●● ●●● ●●● ●●●●● ●● ●● ● ●●● ● ●●● ● ● ● ●●● ●●●●●● ●● ●● ●●● ●● ●● ●●●●● ●●● ●●●●● ● ●● ●●● ● ●●● ●●
40
Arthur CHARPENTIER - Analyse des donnees
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−2 0 2 4 6
−2
02
4
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
−4 −2 0 2 4 6
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
●●● ●● ●● ●●● ●● ● ● ●●●● ● ●● ●● ●●● ●●● ● ● ●●●●●● ●●● ● ●●● ● ●●● ●● ● ●●● ●● ●● ●● ●● ● ●●●● ●● ● ● ●● ● ●● ●● ●●●● ●● ●●●● ● ●● ● ●● ● ●● ●● ●● ●●● ● ●● ●●● ●●● ●● ●● ● ●●● ●● ●● ● ●●● ●● ●● ●●●●●●● ●● ●●● ●● ●●●●● ●●●● ● ●●●●● ● ● ●●● ● ●●● ●● ●● ● ●● ●●●● ●● ●● ●●● ● ●● ●● ●● ●●●●
●● ● ●●●● ●● ●●●●●● ● ●●●● ●● ●● ●● ● ●● ● ●● ●● ●● ●●● ● ●●● ●● ● ● ●● ●●● ●● ● ●●● ●●● ●●●● ●● ●●● ● ●●● ● ●●●● ●●● ●● ●● ●● ● ●●● ●●● ● ● ● ●● ●● ●● ● ● ● ●●● ●●●● ●● ●●● ● ● ● ●● ● ● ●● ●●●● ● ●● ● ●● ● ●●● ●● ●● ● ●● ●● ●● ●●●● ●● ● ●●●● ●● ●● ● ● ●●●● ●● ●● ●●● ● ●●●● ● ● ●● ● ●●●●●● ● ●●
41
Arthur CHARPENTIER - Analyse des donnees
On cherche un axe ayant un bon pouvoir disciminant (entre les groupes), commel’axe dans le second cas.
En particulier, en projetant les centres de gravites des nuages, il faut que ladispersion soit maximale.
La matrice d’inertie du nuage de X0 et X1 est MBM (ou M est une metrique deRm), et l’inertie du nuage projecte selon un axe a est alors a′MBMa (si‖a‖M = 1). On cherche alors a maximiser a′MBMa.
On souhaite aussi a ce que le nuage soit regroupe autour du centre de gravite (nprojection), que qui revient a minimiser a′MWMa
En utilisant V = B +W , on obtient que
a′MVMa = a′MBMa+ a′MWMa
On peut alors prendre comme critere a maximiser est le rapport de l’inertieinterclasse a l’inertie totale,
maxa{a′MBMa
a′MV Ma}.
42
Arthur CHARPENTIER - Analyse des donnees
Ce maximum est atteint si a est vecteur propre de (MVM)−1MBM , associe a laplus grande valeur propre.
On fait alors l’ACP du nuage des centres de gravite, avec la metrique V −1.
43
Arthur CHARPENTIER - Analyse des donnees
Analyse de la variance ?
Une autre interpretation peut se faire en terme d’analyse de la variance.
A la base, l’analyse de la variance (ANOVA a un facteur) se fait de la manieresuivante : on dispose de k groupes. On dispose d’observations {X1,i, · · · , Xni,i}pour le groupe i. En supposant Xj,i ∼ N (µi, σ
2), i.i.d. On cherche a tester
H0 : µ1 = · · · = µi = · · · = µk (= µ).
L’idee de l’analyse de la variance est d’utiliser un test de Fisher, en notant que
F =S2
E
k − 1· n− kS2
R
∼ F(k − 1, n− k),
ou S2 =1n
∑i,j
(Xj,i −X)2 = S2E + S2
R,
S2E =
1n
∑i
ni(Xi −X)2 et S2R =
1n
∑i,j
(Xi,j −Xi)2.
(decomposition de la variance, entre variance inter S2E et variance intra S2
R).
44
Arthur CHARPENTIER - Analyse des donnees
Mais ici, comme nous disposons que p variables explicatives, on cherche lacombinaison lineaire qui maximise une statistique de type Fisher. On cherche uqui maximise
F =u′Bu
u′Wu.
La solution est de chercher le vecteur propre associe a la plus valeur propre deW−1B (qui correspondent aux vecteurs propres de V −1B).
Notons que la metrique associee a W−1 est parfois appelee metrique deMahalanobis.
45
Arthur CHARPENTIER - Analyse des donnees
Analyse de la variance avec 2 groupes
Comme k − 1 = 1, on recherche une unique variable discriminante.
Cet axe discriminant est alors la droite passant par les deux centres de gravite,X0 et X1. Alors
u = V −1(X0 −X1) ou W−1(X0 −X1)
W−1(X0 −X1) est appele fonction de Fisher. En fait, afin de normaliser, onconsidere plutot
n0 + n1 − 2n1 + n2
W−1
Fisher en effet, cherchait la combinaison lineaire des variables explicatives tellesque le carre de la statistique de test prenne une valeur maximale, i.e.
maxu
(Y?
0 − Y?
1)(n0S
20 + n1S
21
n0 + n1 − 2
)(1n0
+1n1
) ou Y ? = Xu.
46
Arthur CHARPENTIER - Analyse des donnees
Si l’on pose Σ =n0 + n1
n0 + n1 − 2W , on voit que la fonction de Fisher s’ecrit
max(u′(X0 −X1))2
u′Σu′,
c’est a dire que u doit etre proportionnel a Σ(X0 −X1).
47
Arthur CHARPENTIER - Analyse des donnees
Interpretation en terme de regression
Notons que si l’on regresse brutalement Y sur X1, · · · , Xp, l’estimateur parmoindre carres s’ecrit
β = (X ′X)−1X ′Y = V −1(X0 −X1).
Sur l’exemple precdant,
> base
y x1 x2
[1,] 0 -0.06842752 1.0664922282
[2,] 0 -0.01273235 -1.8565790136
[3,] 0 -2.24507861 -2.3625561698
[4,] 0 0.62173134 -1.3233327477
[5,] 0 -1.06797642 -0.4757008868
[6,] 0 0.51384396 -0.0561551010
[...]
[395,] 1 1.95266073 2.2221802298
[396,] 1 3.32203741 0.6882211866
48
Arthur CHARPENTIER - Analyse des donnees
[397,] 1 1.35032036 0.7791709815
[398,] 1 1.30084249 2.1642225218
[399,] 1 2.61357210 1.9169049693
[400,] 1 0.31456394 -0.4377148839
> (r=lm(y~x1+x2,data=base))
Call:
lm(formula = y ~ x1 + x2)
Coefficients:
(Intercept) x1 x2
0.3736 0.1370 0.1209
> -coef(r)[2]/coef(r)[3]
x1
-1.133024
L’axe de discrimination sera alors de pente −1.13, et la constante refletera laperformance de la discrimination. Le plus classique etant (comme ici n1 = n0)> (.5-coef(r)[1])/coef(r)[3]
49
Arthur CHARPENTIER - Analyse des donnees
(Intercept)
1.045773
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−2 0 2 4 6
−2
02
4
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−2 0 2 4 6−
20
24
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●●●
●
●
●
●
● ●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
50
Arthur CHARPENTIER - Analyse des donnees
Regle d’affectation
Une fois determine la direction de l’axe de discrimination, il reste a choisir oupositionner cet axe.
Une regle naturelle consiste a calculer la distance de l’observation aux centres degravites, puis a affecter en prenant la distance la plus faible. Mais il faut encorechoisir la distance a retenir... La metrique la plus usuelle est celle deMahalanobis, i.e. W−1.
51
Arthur CHARPENTIER - Analyse des donnees
Methode de scoring, approche bayesienne
On cherche ici a affecter un individu a l’une des classes, compte tenu de sesmodalites x. On l’affecte a la classe k pour laquelle la probabilite
P(Y = y|X = x) est maximale.
52
Arthur CHARPENTIER - Analyse des donnees
Methode de scoring, exemple Gaussien
Supposons que X|Y = y suive une loi Gaussienne, N (µy,Σy), i.e.
f(x|Y = y) =1√
(2π)k det Σy
exp(−1
2(x− µy)′Σ−1
y (x− µy)).
Le critere que l’on cherche a maximiser est alors pyf(x|Y = y), ou sonlogarithme, i.e.
(x− µy)′Σ−1y (x− µy)− 2 log py + log det Σy.
On parle alors de regle d’affectation quadratique.
Si l’on suppose les matrices de variance-covariance Σy constante, on obtient uneregle d’affectation lineaire.
53
Arthur CHARPENTIER - Analyse des donnees
Exemple Gaussien, une variable explicative
−4 −2 0 2 4 6
0.00
0.05
0.10
0.15
0.20
0.25 ●● ●● ●● ●● ●●● ●● ● ●●● ●● ●●● ● ●● ●● ● ●● ●● ●●● ● ●●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●●●● ● ●● ●● ● ●●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●●●● ● ● ●●●● ● ●● ●● ●●● ●●● ● ● ●●● ●● ●●● ●● ●● ●●● ●● ●● ●●● ●● ●● ● ●● ●●● ●● ● ●● ●● ●●● ●● ●●● ● ● ●●● ●● ●●● ● ●●● ●● ● ● ●● ●● ● ● ●● ● ●● ●●● ● ●●● ●●
● ● ●● ●● ● ●● ●● ●●● ●● ●●● ● ● ●● ●● ● ●●● ●●●● ● ●●● ●● ● ●● ●● ● ●● ●●● ●● ●● ● ● ●● ●● ●● ●●● ●● ●● ●●● ●●● ●●● ●●● ●● ● ● ●● ●● ●●● ● ●● ●● ●● ● ●● ●●● ● ●●● ●● ● ●●● ●●● ● ● ● ●● ●● ●● ● ●● ● ●● ● ●● ●● ●● ●●● ●● ● ●●●●● ● ● ●●● ●● ●● ● ●●●● ●● ●●● ●●●●● ●● ● ●● ●● ●●● ●●● ●●● ●● ●● ● ●● ●●
54
Arthur CHARPENTIER - Analyse des donnees
Exemple Gaussien, deux variable explicative
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
● ●
●
●
●●
●●
●
●
●
●●
●●
●
●● ●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●●
●●●
●●
● ●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
−2 0 2 4 6
−4
−2
02
46
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
● ●
●
●
●●
●●
●
●
●
●●
●●
●
●● ●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●●
●●●
●●
● ●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.01
0.02
0.02
0.03
0.03
0.04
0.05
0.06
0.07
55
Arthur CHARPENTIER - Analyse des donnees
Exemple Gaussien, interpretation
si les probabilites py sont gales, alors on affecte l’individu a la classe pour laquellela distance entre x et le centre de gravite du nuage est minimale.
Si l’on a deux groupes, on affecte x a la classe 0 si
x′Σ−1(µ0 − µ1) >12
(µ0 − µ1)′Σ−1(µ0 − µ1) + logp1
p0.
On parlera de methodes parametriques de classification. Notons qu’il est possibled’utiliser des methodes de type k-plus proches voisins, ou on recherche les kvoisins les plus proches de x, et x sera affecte a la classe majoritaire parmi sesvoisins.
56
Arthur CHARPENTIER - Analyse des donnees
L’utilisation des regressions
●● ●● ●● ●● ●●● ●● ● ●●● ●● ●●● ● ●● ●● ● ●● ●● ●●● ● ●●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●●●● ● ●● ●● ● ●●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●●●● ● ● ●●●● ● ●● ●● ●●● ●●● ● ● ●●● ●● ●●● ●● ●● ●●● ●● ●● ●●● ●● ●● ● ●● ●●● ●● ● ●● ●● ●●● ●● ●●● ● ● ●●● ●● ●●● ● ●●● ●● ● ● ●● ●● ● ● ●● ● ●● ●●● ● ● ●● ●●
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
● ● ●● ●● ● ●● ●● ●●● ●● ●●● ● ● ●● ●● ● ●●● ●●●● ● ●●● ●● ● ●● ●● ● ●● ●●● ●● ●● ● ● ●● ●● ●● ●●● ●● ●● ●●● ●●● ●●● ●●● ●● ● ● ●● ●● ●●● ● ●● ●● ●● ● ●● ●●● ● ●●● ●● ● ●●● ●●● ● ● ● ●● ●● ●● ● ●● ● ●● ● ●● ●● ●● ●●● ●● ● ●●●●● ● ● ●●● ●● ●● ● ●●●● ●● ●●● ●●●●● ●● ● ●● ●● ●●● ●●● ●●● ●● ●● ● ●● ●●
●● ●● ●● ●● ●●● ●● ● ●●● ●● ●●● ● ●● ●● ● ●● ●● ●●● ● ●●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●●●● ● ●● ●● ● ●●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●●●● ● ● ●●●● ● ●● ●● ●●● ●●● ● ● ●●● ●● ●●● ●● ●● ●●● ●● ●● ●●● ●● ●● ● ●● ●●● ●● ● ●● ●● ●●● ●● ●●● ● ● ●●● ●● ●●● ● ●●● ●● ● ● ●● ●● ● ● ●● ● ●● ●●● ● ● ●● ●●
57
Arthur CHARPENTIER - Analyse des donnees
L’utilisation des regressions
Ici, on cherche un modele qui pourrait estimer Y en fonction d’une - ou plusieurs- variables explicatives X. Y prend ici souvent deux valeurs 0 et 1, et seramodelisee par la variable latente Y ?, continue entre 0 et 1.
On interpretera alors Y ? = 0.1 comme “il y a 10% de chances que Y = 1”.
On introduit alors le rapport des chances, “odds” ou “cote”,
p1 =P(Y = 1)
1− P(Y = 1)
E.g. si P(Y = 1) = 90%, alors p1 = 0.9/0.1 = 9 : on a 9 fois plus de chanced’observer Y = 1 que Y = 0.
On passe de ce rapport de chance (defini sur R+) a une variable definie sur R(pour utiliser un modele lineaire) en prenant le logarithme : on defini latransformation logit
logit(p) = log(
p
1− p
), d’inverse logit−1(y) =
exp(y)1 + exp(y)
.
58
Arthur CHARPENTIER - Analyse des donnees
La regression logistique
On suppose ici que X|Y = y suive une loi Gaussienne, N (µy,Σy). Aussi,X|Y = 0 a pour densite φ0 et X|Y = 1 a pour densite φ1.
Comme les probabilites a posteriori sont une fonction logistique du score, on a
log(φ1(x)φ0(x)
)= β′x
On en deduit que
P(Y = 1|X = x) =p1φ1(x)
p1φ1(x) + p0φ0(x)=
p1φ1(x)p0φ0(x)
1 +p1φ1(x)p0φ0(x)
et donc
P(Y = 1|X = x) =exp(β′x+ log(p1/p0))
1 + exp(β′x+ log(p1/p0)),
59
Arthur CHARPENTIER - Analyse des donnees
et de maniere symmtrique
P(Y = 0|X = x) =1
1 + exp(β′x+ log(p1/p0)).
La vraisemblance de β est alors
β|x =∏
i
φ0(xi)∏
i
φ1(xi)
or, d’apres la formule de Bayes,
φ0(x) =P(Y = 0|X = x)[p0φ0(x) + p1φ1(x)]
p0
et donc
β|x =1
pn00 pn1
1
∏i
P(Y = 0|X = xi)∏
i
P(Y = 1|X = xi)∏
i
[f(xi)]
ou f(xi) = p0φ0(x) + p1φ1(x). Cette fonction etant inconnue, on utilise une
60
Arthur CHARPENTIER - Analyse des donnees
methode de maximum de vraisemblance conditionnelle,
maxβ
∏ exp(β′x+ log(p1/p0))1 + exp(β′x+ log(p1/p0))
∏ 11 + exp(β′x+ log(p1/p0))
qui n’admet pas de solution explicite.
On utilise une regle d’affectation simple : on affecte au groupe 1 si
β′x+ logp1
p0> 0.
61
Arthur CHARPENTIER - Analyse des donnees
●● ●● ●● ●● ●●● ●● ● ●●● ●● ●●● ● ●● ●● ● ●● ●● ●●● ● ●●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●●●● ● ●● ●● ● ●●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●●●● ● ● ●●●● ● ●● ●● ●●● ●●● ● ● ●●● ●● ●●● ●● ●● ●●● ●● ●● ●●● ●● ●● ● ●● ●●● ●● ● ●● ●● ●●● ●● ●●● ● ● ●●● ●● ●●● ● ●●● ●● ● ● ●● ●● ● ● ●● ● ●● ●●● ● ● ●● ●●
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
● ● ●● ●● ● ●● ●● ●●● ●● ●●● ● ● ●● ●● ● ●●● ●●●● ● ●●● ●● ● ●● ●● ● ●● ●●● ●● ●● ● ● ●● ●● ●● ●●● ●● ●● ●●● ●●● ●●● ●●● ●● ● ● ●● ●● ●●● ● ●● ●● ●● ● ●● ●●● ● ●●● ●● ● ●●● ●●● ● ● ● ●● ●● ●● ● ●● ● ●● ● ●● ●● ●● ●●● ●● ● ●●●●● ● ● ●●● ●● ●● ● ●●●● ●● ●●● ●●●●● ●● ● ●● ●● ●●● ●●● ●●● ●● ●● ● ●● ●●
●● ●● ●● ●● ●●● ●● ● ●●● ●● ●●● ● ●● ●● ● ●● ●● ●●● ● ●●● ●●● ●● ●● ●● ●●● ●● ●● ●● ●●● ●●●● ● ●● ●● ● ●●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● ●●●● ● ● ●●●● ● ●● ●● ●●● ●●● ● ● ●●● ●● ●●● ●● ●● ●●● ●● ●● ●●● ●● ●● ● ●● ●●● ●● ● ●● ●● ●●● ●● ●●● ● ● ●●● ●● ●●● ● ●●● ●● ● ● ●● ●● ● ● ●● ● ●● ●●● ● ● ●● ●●
62
Arthur CHARPENTIER - Analyse des donnees
Cas multinomial ordonne
Dans le cas des notations des vins de Bordeaux, on peut condierer les donneescomme etant ordonnees. La variable Y prend les valeurs 1, 2 et 3.
On peut alors creer deux variables dichotomiques
Y1 =
0 si Y = 1
1 si Y = 2, 3et Y2 =
0 si Y = 1, 2
1 si Y = 3
de telle sorte que Y = 1 + Y1 + Y2. On fait alors deux regressions, que l’on vasommer
> BORDEAUX$y1=BORDEAUX$QUALITE>1
> BORDEAUX$y2=BORDEAUX$QUALITE>2
> r1 <- glm(y1~TEMPERAT+SOLEIL+CHALEUR+PLUIE, data=BORDEAUX, family=binomial)
> r2 <- glm(y2~TEMPERAT+SOLEIL+CHALEUR+PLUIE, data=BORDEAUX, family=binomial)
> BORDEAUX$y1p <- predict(r1, type=’response’)
> BORDEAUX$y2p <- predict(r2, type=’response’)
> BORDEAUX$yP=1+BORDEAUX$y1p+BORDEAUX$y2p
> BORDEAUX
63
Arthur CHARPENTIER - Analyse des donnees
TEMPERAT SOLEIL CHALEUR PLUIE QUALITE y1 y2 yP y1p y2p
1 3064 1201 10 361 2 TRUE FALSE 2.123215 0.9902598703 1.329547e-01
2 3000 1053 11 338 3 TRUE TRUE 2.978320 0.9988771543 9.794432e-01
3 3155 1133 19 393 2 TRUE FALSE 2.756925 0.9823799308 7.745449e-01
4 3085 970 4 467 3 TRUE TRUE 2.975201 0.9997584698 9.754428e-01
5 3245 1258 36 294 1 FALSE FALSE 1.335511 0.3114261037 2.408500e-02
6 3267 1386 35 225 1 FALSE FALSE 1.025203 0.0252024785 3.309122e-07
7 3080 966 13 417 3 TRUE TRUE 2.998444 0.9994389749 9.990046e-01
8 2974 1189 12 488 3 TRUE TRUE 2.999847 0.9998466254 1.000000e+00
9 3038 1103 14 677 3 TRUE TRUE 2.999992 0.9999924418 1.000000e+00
10 3318 1310 29 427 2 TRUE FALSE 1.485805 0.4513896402 3.441497e-02
11 3317 1362 25 326 1 FALSE FALSE 1.077266 0.0772657691 1.882255e-08
12 3182 1171 28 326 3 TRUE TRUE 2.194081 0.8655663939 3.285148e-01
13 2998 1102 9 349 3 TRUE TRUE 2.954208 0.9986316794 9.555765e-01
14 3221 1424 21 382 1 FALSE FALSE 1.464454 0.4632192585 1.234297e-03
64
Arthur CHARPENTIER - Analyse des donnees
Cas multinomial nonordonne
Sinon sous R, on utilise plus generalement la commande suivante> library(nnet)
> (M=multinom(QUALITE~TEMPERAT+SOLEIL+CHALEUR+PLUIE, data=BORDEAUX))
converged
Call:
multinom(formula = QUALITE ~ TEMPERAT + SOLEIL + CHALEUR + PLUIE,
data = BORDEAUX)
Coefficients:
(Intercept) TEMPERAT SOLEIL CHALEUR PLUIE
2 55.84574 -0.01534060 -0.008522957 -0.03456657 0.01639574
3 222.75077 -0.07528596 -0.020627710 0.51944417 0.08425525
Residual Deviance: 22.46474
AIC: 42.46474
> predict(M)
[1] 2 3 3 3 1 1 3 3 3 1 1 2 3 1 2 2 2 3 2 1 1 1 2 1 2 1 1 3 1 2 3 2 3 3
Levels: 1 2 3
> BORDEAUX$QUALITE
65
Arthur CHARPENTIER - Analyse des donnees
[1] 2 3 2 3 1 1 3 3 3 2 1 3 3 1 2 2 2 3 2 1 2 1 2 1 2 1 2 3 1 1 3 1 3 3
66
Arthur CHARPENTIER - Analyse des donnees
Analyse discriminante et ACP
Il est possible de voir l’analyse discriminante comme un cas particulier d’ACPavec la metrique de Mahalanobis.
Soit X la matrice des donnees quantitatives, n× k. On dispose d’une variable Yprenant m modalites (le plus simple etant 2). On note alors G la matrice desbarycentres des classes, i.e. m× k.
67
Arthur CHARPENTIER - Analyse des donnees
L’analyse discriminante avec R
Sous R, la library(ade4) propose la fonction discrim. Sinon library(MASS) propose lafonction lda.
Sinon, les regressions probit et logit sont des cas particulier de la fonction glm,avec
glm( ... , family=binomial(link = "logit")
glm( ... , family=binomial(link = "probit")
68