About functional SIR

About functional SIRVictor Picheny, Rémi Servien & Nathalie Villa-Vialaneix

[email protected]://www.nathalievilla.org

Journées “Données fonctionnelles”Institut de Mathématiques de Toulouse, June 19th 2017

Nathalie Villa-Vialaneix | SISIR 1/34

mailto:[email protected]

http://www.nathalievilla.org

A joint work of SFCB team

Victor Picheny Rémi Servien NV2


Sommaire

1 Background and motivation

2 Presentation of SIR

3 Our proposal

4 Simulations and Real data


Sommaire



3 Our proposal



Introduction

X a functional random variable and Y ∈ R

n i.i.d. realizations of (X ,Y)


Objectives

variable selection in functional regressionselection of full intervals made of consecutive pointswithout any a priori information on the intervalsfully data-driven procedure


Question and mathematical framework

A functional regression problem: X : random variable (functional) & Y :random real variable

E(Y |X)?

Data: n i.i.d. observations (xi , yi)i=1,...,n.xi is not perfectly known but sampled at (fixed) points

xi = (xi(t1), . . . , xi(tp))T ∈ Rp . We denote: X =

xT

1...

xTn

.Question: Find a model that is easily interpretable and points out relevantintervals for the prediction within the definition domain of X .

Method: Do not expand X on a functional basis but use the fact that theentries of the digitized function xi are ordered in a natural way.




E(Y |X)?



xT

1...

xTn

.

Question: Find a model that is easily interpretable and points out relevantintervals for the prediction within the definition domain of X .





E(Y |X)?



xT

1...

xTn

.Question: Find a model that is easily interpretable and points out relevantintervals for the prediction within the definition domain of X .



Related works (variable selection in FDA)

LASSO / L1 regularization in linear models[Ferraty et al., 2010, Aneiros and Vieu, 2014] (isolated evaluationpoints), [Matsui and Konishi, 2011] (selects elements of an expansionbasis)

[Fraiman et al., 2016] (blinding approach usable for various problems:PCA, regression...)

[Gregorutti et al., 2015] adaptation of the importance of variables inrandom forest for groups of variables

[Fauvel et al., 2015, Ferraty and Hall, 2015] cross validation and agreedy update of the selected evaluation points to select the mostrelevant evaluation points in a nonparametric framework

However, none of these approach propose to automatically design andselect contiguous sets of variables.


Related works (selection of groups of variables)

[James et al., 2009] L1 regularization in linear model with sparsity onderivatives: piecewise constant predictors

[Park et al., 2016] criterion based on a minimization of the overallcorrelation during a greedy segmentation

[Grollemund et al., 2017] Bayesian approach in which a posterioridistribution about informative intervals can be obtained

All are proposed in the framework of the linear model and the second onedoes not use the target variable to define and select relevant intervals.

Our proposal: a semi-parametric (not entirely linear) model which selectsrelevant intervals combined with an automatic procedure to define theintervals.


Sommaire



3 Our proposal



SIR in multidimensional framework

SIR: a semi-parametric regression model for X ∈ Rp

Y = F(aT1 X , . . . , aT

d X , ε)

for a1, . . . , ad ∈ Rp (to be estimated), F : Rd+1 → R, unknown, and ε, an

error, independant from X .

Standard assumption for SIR

Y y X | PA (X)

in which A is the so-called EDR space, spanned by (ak )k=1,...,d .

SIR is the regression extension of Linear Discriminant Analysis.


Estimation

Equivalence between SIR and eigendecomposition

A is included in the space spanned by the first d Σ-orthogonaleigenvectors of the generalized eigendecomposition problem:Γa = λΣa, Σ covariance matrix of X and Γ covariance matrix ofE(X |Y)

Estimation (when n > p)

compute X = 1n∑n

i=1 xi and Σ = 1n XT (X − X)

split the range of Y into H different slices: τ1, ... τH and estimateE(X |Y) =

(1nh

∑i: yi∈τh

xi

)h=1,...,H

, with nh = |{i : yi ∈ τh}|, in each slice,

to obtain an estimate of Γ

solve the eigendecomposition problem Γa = λΣa and obtain theeigenvectors a1, . . . , ad


SIR in large dimensions: problem

In large dimension (or in Functional Data Analysis), n < p and Σ is

ill-conditionned and does not have an inverse⇒ Z = (X − InXT

)Σ−1/2 cannot be computed.

Different solutions have been proposed in the litterature based on:

prior dimension reduction (e.g., PCA) [Ferré and Yao, 2003] (in theframework of FDA)

regularization (ridge...)[Li and Yin, 2008, Bernard-Michel et al., 2008]: equivalent to thegeneralized eigendecomposition problem Γa = λ(Σ + µ2I)a

sparse SIR[Li and Yin, 2008, Li and Nachtsheim, 2008, Ni et al., 2005]

QZ-SIR [Coudret et al., 2014]: uses a method similar to QR-algorithm


SIR in large dimensions: sparse versions

Specific issue to introduce sparsity in SIRSparsity on a multiple-index model: most authors use shrinkageapproaches or sparsity on a single-index model and depletion (not shown)

First version: Li and Yin (2008) based on the regression formulationI Pro : Sparsity common to all dimensions dI Cons : Minimization problem with dependent variables in Rp

Second version: Li and Nachtsheim (2008) based on the correlationformulation

I Pro : Minimization problem with independent variables in Rd

I Cons : Sparsity different in all dimensions d


Equivalent formulationsSIR as a regression problem [Li and Yin, 2008] shows that SIR isequivalent to the (double) minimization of

E(A ,C) =H∑

h=1

ph

∥∥∥∥(Xh − X)− ΣACh

∥∥∥∥2

for Xh = 1nh

∑i: yi∈τh

, A a (p × d)-matrix and C a vector in Rd .

Rk: Given A , C is obtained as the solution of an ordinary least squareproblem...SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]shows that SIR rewrites as the double optimisation problemmaxaj ,φ Cor(φ(Y), aT

j X), where φ is any function R→ R and (aj)j areΣ-orthonormal.

Rk: The solution is shown to satisfy φ(y) = aTj E(X |Y = y) and aj is

also obtained as the solution of the mean square error problem:

minajE

(φ(Y) − aT

j X)2



E(A ,C) =H∑

h=1

ph

∥∥∥∥(Xh − X)− ΣACh

∥∥∥∥2

for Xh = 1nh

∑i: yi∈τh

, A a (p × d)-matrix and C a vector in Rd .Rk: Given A , C is obtained as the solution of an ordinary least squareproblem...

SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]shows that SIR rewrites as the double optimisation problemmaxaj ,φ Cor(φ(Y), aT




minajE

(φ(Y) − aT

j X)2



E(A ,C) =H∑

h=1

ph

∥∥∥∥(Xh − X)− ΣACh

∥∥∥∥2

for Xh = 1nh

∑i: yi∈τh

, A a (p × d)-matrix and C a vector in Rd .Rk: Given A , C is obtained as the solution of an ordinary least squareproblem...SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]shows that SIR rewrites as the double optimisation problemmaxaj ,φ Cor(φ(Y), aT




minajE

(φ(Y) − aT

j X)2



E(A ,C) =H∑

h=1

ph

∥∥∥∥(Xh − X)− ΣACh

∥∥∥∥2

for Xh = 1nh

∑i: yi∈τh

, A a (p × d)-matrix and C a vector in Rd .Rk: Given A , C is obtained as the solution of an ordinary least squareproblem...SIR as a Canonical Correlation problem [Li and Nachtsheim, 2008]shows that SIR rewrites as the double optimisation problemmaxaj ,φ Cor(φ(Y), aT

j X), where φ is any function R→ R and (aj)j areΣ-orthonormal.Rk: The solution is shown to satisfy φ(y) = aT

j E(X |Y = y) and aj isalso obtained as the solution of the mean square error problem:

minajE

(φ(Y) − aT

j X)2



First version: sparse penalization of the ridge solutionIf (A , C) are the solutions of the ridge SIR,[Ni et al., 2005, Li and Yin, 2008] propose to shrink this solution byminimizing

Es,1(α) =H∑

h=1

ph

∥∥∥∥(Xh − X)− ΣDiag(α)A Ch

∥∥∥∥2+ µ1‖α‖L1

(regression formulation of SIR)



Second version: [Li and Nachtsheim, 2008] derive the sparse optimizationproblem from the correlation formulation of SIR:

minas

j

n∑i=1

[Paj (X |yi) − (as

j )T xi

]2+ µ1,j‖as

j ‖L1 ,

in which Paj is the projection of E(X |Y = yi) = Xh onto the space spannedby the solution of the ridge problem.


Characteristics of the different approaches and possibleextensions

[Li and Yin, 2008] [Li and Nachtsheim, 2008]

sparsity on shrinkage coefficients estimatesnb optimization pb 1 dsparsity common to all dims specific to each dim


Sommaire



3 Our proposal



SIR in large dimensions: our sparse version

Background: Back to the functional setting, we suppose that t1, ..., tp aresplit into D intervals I1, ..., ID .

Based on the minimization problem of Li and Nachtsheim (2008)

Our adaptation: Sparsity under the intervals using α = (α1, . . . , αD)

∀l = 1, . . . , p, asjl = αk ajl for k such that tj ∈ Ik .

the sparsity constraint is put on α and not directly on asj

α are made identical for all dimensions of the projection j = 1, . . . , d


SIR in large dimensions: our sparse version

Li and Nachtsheim (2008) (LASSO):

minas

j

n∑i=1

‖Paj (X |yi) − (asj )T xi‖

2 + µ1,j‖asj ‖L1 ,

in which Paj is the projection of E(X |Y = yi) = Xh (for h such that yi inslide h) onto the space spanned by the aj .Our adaptation:

α = arg minα∈RD

d∑j=1

n∑i=1

‖Paj (X |yi) − (Λ(α) aj)> xi‖

2 + µ1‖α‖L1

with ∀l = 1, . . . , p, asjl = αk ajl for k such that tj ∈ Ik and

Λ(α) = Diag (α1I|I1 |, . . . , αDI|ID |) ∈ Mp×p .


Summary : SISIR: a two step approach

First step: Solve the projection problem (using SIR and L2-regularization ofΣ) that provides the estimates (aj)j∈{1,...,d} of the vectors spanning the EDRspace.

Second step: Sparsity under the D intervals using α = (α1, . . . , αD)solving a LASSO problem : handles functional setting by penalizing entireintervals and not just isolated points.


SISIR: Characteristics

uses the approach based on the correlation formulation (because thedimensionality of the optimization problem is smaller);

uses a shrinkage approach and optimizes shrinkage coefficients in asingle optimization problem;

handles functional setting by penalizing entire intervals and not justisolated points.


An automatic approach to define intervals1 Initial state: ∀ k = 1, . . . , p, τk = {tk }

2 Iterate

I define: D− (“strong zeros”) and D+ (“strong non zeros”)

Final solution: Minimize GCVD over D.


An automatic approach to define intervals1 Initial state: ∀ k = 1, . . . , p, τk = {tk }2 Iterate

I along the regularization path, select three values for µ1:

P% of thecoefficients are zero, P% of the coefficients are non zero, best GCV.define: D− (“strong zeros”) and D+ (“strong non zeros”)




I along the regularization path, select three values for µ1: P% of thecoefficients are zero, P% of the coefficients are non zero, best GCV.define: D− (“strong zeros”) and D+ (“strong non zeros”)




I define: D− (“strong zeros”) and D+ (“strong non zeros”)I merge consecutive “strong zeros” (or “strong non zeros”) or “strong

zeros” (resp. “strong non zeros”) separated by a few numbers ofintervals which are of undetermined type.

Until no more iterations can be performed.




I define: D− (“strong zeros”) and D+ (“strong non zeros”)I merge consecutive “strong zeros” (or “strong non zeros”) or “strong

zeros” (resp. “strong non zeros”) separated by a few numbers ofintervals which are of undetermined type.

Until no more iterations can be performed.3 Output: Collection of models (first with p intervals, last with 1),M∗D

(optimal for GCV) and corresponding GCVD versus D (number ofintervals).



Sommaire



3 Our proposal



Simulation frameworkData generated with:

X(t) a Gaussian process with mean µ(t) = −5 + 4t − 4t2 and aMatern covariance

aj = sin(

t(2+j)π2 −

(j−1)π3

)IIj (t)

Y =∑d

j=1 log∣∣∣〈X , aj〉

∣∣∣one model: (M1), d = 1, I1 = [0.2, 0.4].


Definition of the intervalsD = p = 200 (initial state=LASSO) D = 142

D = 41 D = 5


Second model(M2): d = 3 and I1 = [0, 0.1], I2 = [0.5, 0.65] and I3 = [0.65, 0.78].


Second model

SISIR standard sparse


Tecator dataset

relevant intervals

easily interpretable

good MSE


Sunflower dataset

climatic time series (between 1975 and 2012 in France)daily measure from April to OctoberX=evaportranspiration, Y=yield, n = 111, p = 309


Sunflower dataset

only two points identified outside the intervalfocus on the second half of the intervalmatches expert knowledge


Conclusion

SI-SIR:

sparse dimension reduction model adapted to functional framework

fully automated definition of relevant intervals in the range of thepredictors

Package SISIR available on CRAN athttps://cran.r-project.org/package=SISIR.

Perspectives

adaptation to multiple X

application to large-scale real data (agricultural application:X={temperature,rainfall ...}, Y={yield})

replace CV criterion?


https://cran.r-project.org/package=SISIR

Aneiros, G. and Vieu, P. (2014).Variable in infinite-dimensional problems.Statistics and Probability Letters, 94:12–20.

Bernard-Michel, C., Gardes, L., and Girard, S. (2008).A note on sliced inverse regression with regularizations.Biometrics, 64(3):982–986.

Coudret, R., Liquet, B., and Saracco, J. (2014).Comparison of sliced inverse regression aproaches for undetermined cases.Journal de la Société Française de Statistique, 155(2):72–96.

Fauvel, M., Deschene, C., Zullo, A., and Ferraty, F. (2015).Fast forward feature selection of hyperspectral images for classification with Gaussian mixture models.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(6):2824–2831.

Ferraty, F. and Hall, P. (2015).An algorithm for nonlinear, nonparametric model choice and prediction.Journal of Computational and Graphical Statistics, 24(3):695–714.

Ferraty, F., Hall, P., and Vieu, P. (2010).Most-predictive design points for functiona data predictors.Biometrika, 97(4):807–824.

Ferré, L. and Yao, A. (2003).Functional sliced inverse regression analysis.Statistics, 37(6):475–488.

Fraiman, R., Gimenez, Y., and Svarc, M. (2016).Feature selection for functional data.Journal of Multivariate Analysis, 146:191–208.

Gregorutti, B., Michel, B., and Saint-Pierre, P. (2015).Grouped variable importance with random forests and application to multiple functional data analysis.


Computational Statistics and Data Analysis, 90:15–35.

Grollemund, P., Abraham, C., Baragatti, M., and Pudlo, P. (2017).Bayesian functional linear regression with sparse step functions.Preprint.

James, G., Wang, J., and Zhu, J. (2009).Functional linear regression that’s interpretable.Annals of Statistics, 37(5A):2083–2108.

Li, L. and Nachtsheim, C. (2008).Sparse sliced inverse regression.Technometrics, 48(4):503–510.

Li, L. and Yin, X. (2008).Sliced inverse regression with regularizations.Biometrics, 64(1):124–131.

Liquet, B. and Saracco, J. (2012).A graphical tool for selecting the number of slices and the dimension of the model in SIR and SAVE approches.Computational Statistics, 27(1):103–125.

Matsui, H. and Konishi, S. (2011).Variable selection for functional regression models via the l1 regularization.Computational Statistics and Data Analysis, 55(12):3304–3310.

Ni, L., Cook, D., and Tsai, C. (2005).A note on shrinkage sliced inverse regression.Biometrika, 92(1):242–247.

Park, A., Aston, J., and Ferraty, F. (2016).Stable and predictive functional domain selection with application to brain images.Preprint arXiv 1606.02186.


Parameter estimation

H (number of slices): usually, SIR is known to be not very sensitive tothe number of slices (> d + 1). We took H = 10 (i.e., 10/30observations per slice);

µ2 and d (ridge estimate A ):I L -fold CV for µ2 (for a d0 large enough)

I using again L -fold CV, ∀ d = 1, . . . , d0, an estimate of

R(d) = d − E[Tr

(ΠdΠd

)],

in which Πd and Πd are the projector onto the first d dimensions of theEDR space and its estimate, is derived similarly as in[Liquet and Saracco, 2012]. The evolution of R(d) versus d is studiedto select a relevant d.

µ1 (LASSO) glmnet is used, in which µ1 is selected by CV along theregularization path.


Parameter estimationH (number of slices): usually, SIR is known to be not very sensitive tothe number of slices (> d + 1). We took H = 10 (i.e., 10/30observations per slice);µ2 and d (ridge estimate A ):

I L -fold CV for µ2 (for a d0 large enough) Note that GCV as described in[Li and Yin, 2008] can not be used since the current version of the L2

penalty involves the use of an estimate of Σ−1.

I using again L -fold CV, ∀ d = 1, . . . , d0, an estimate of

R(d) = d − E[Tr

(ΠdΠd

)],




Parameter estimation

H (number of slices): usually, SIR is known to be not very sensitive tothe number of slices (> d + 1). We took H = 10 (i.e., 10/30observations per slice);

µ2 and d (ridge estimate A ):I L -fold CV for µ2 (for a d0 large enough)I using again L -fold CV, ∀ d = 1, . . . , d0, an estimate of

R(d) = d − E[Tr

(ΠdΠd

)],




About functional SIR

Science

Transcript of About functional SIR