Rejoinder

81

optimisation more congenial to my taste as it relies on our traditional tools and intuition andcan be recycled for estimation in some cases.

A final point I would like to stress is that the optimisation methods covered in this surveyrequire a lot of analytic knowledge about the underlying model, either through derivatives orvia matrix inversions that may get ungainly in large dimensions. Similarly, both expectation–maximisation and MM algorithms are truly wonderful objects, but they eventually only applyto a highly restricted collection of objects. Collection where an alternative simulative approachshould also be feasible. In this respect, the survey does not truly address high-dimensionand ultra-high-dimension issues, especially those in non-convex or finite spaces that modernclassification problems usually face.

References

Del Moral, P, Doucet, A & Jasra, A. (2006). Sequential Monte Carlo samplers. J. R. Stat. Soc. Ser. B, 68,411–436.

Doucet, A, Godsill, S & Robert, C. (2002). Marginal maximum a posteriori estimation using Markov chain MonteCarlo. Stat. Comput., 12, 77–84.

Gaetan, C & Yao, J-F. (2003). A multiple-imputation Metropolis version of the EM algorithm. Biometrika, 90,643–654.

Goldstein, M. (2007). Bayes Linear Statistics: Theory and Methods. Chichester, England/Hoboken, NJ: John Wiley.Jaakkola, T & Jordan, M. (2000). Bayesian parameter estimation via variational methods. Stat. Comput., 10,

25–37.Lele, S, Dennis, B & Lutscher, F. (2007). Data cloning: Easy maximum likelihood estimation for complex ecological

models using Bayesian Markov chain Monte Carlo methods. Ecol. Lett., 10, 551–563.Marin, J, Pillai, N, Robert, C & Rousseau, J. (2011a). Relevant Statistics for Bayesian Model Choice. Tech. Rep.

arXiv:1111.4700.Marin, J, Pudlo, P, Robert, C & Ryder, R. (2011b). Approximate Bayesian computational methods. Stat. Comput.,

21(2), 1–14.Mitra, D, Romeo, F & Sangiovanni-Vincentelli, A. (1986). Convergence and finite-time behaviour of simulated

annealing. Adv. Appl. Probab., 18, 747–771.Park, T & Casella, G. (2008). The Bayesian lasso. J. Amer. Statist. Assoc., 103, 681–686.Robert, C & Casella, G. (2011). A history of Markov chain Monte Carlo—subjective recollections from incomplete

data. Statist. Sci., 26, 102–115.Robert, C & Soubiran, C. (1993). Estimation of a mixture model through Bayesian sampling and prior feedback.

TEST, 2, 125–146.

ŒReceived June 2013, accepted June 2013�

International Statistical Review (2014), 82, 1, 81–89 doi:10.1111/insr.12030

Rejoinder

Kenneth Lange1,2, Eric C. Chi2 and Hua Zhou3

1Departments of Biomathematics and Statistics, University of California, Los Angeles,CA 90095-1766, USAE-mail: [email protected] of Human Genetics, University of California, Los Angeles, CA 90095-1766, USAE-mail: [email protected] of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USAE-mail: [email protected]

International Statistical Review (2014), 82, 1, –8© 2014 The Authors. International Statistical Review © 2014 International Statistical Institute

Discussions

81 9

82 Rejoinder

1 Response to Y. Atchade and G. Michailidis

We are grateful to Profs. Atchade and Michailidis for discussing proximal splitting meth-ods and highlighting their connection to the methods under review. Although proximal splittingmethods have been around for decades, they have recently enjoyed a renaissance in handlingnon-smooth regularisation, not only in statistics but also in signal processing and machinelearning. Combettes & Wajs (2005) provided a comprehensive overview of proximal split-ting methods, including the proximal gradient method and the alternating direction method ofmultipliers (ADMM) discussed by Atchade and Michailidis.

Hunter made an interesting point that requires even stronger emphasis in the context of prox-imal methods. One of the reasons methods such as ADMM have become so popular is that,like MM and block coordinate descent, they decompose challenging optimisation problems intosimpler subproblems. These decompositions often lighten the load of coding. Moreover, just asproximal gradient algorithms can be accelerated by Nesterov’s method, ADMM and its variantscan also be accelerated with modest changes to the underlying algorithms (Deng & Yin, 2012;Goldstein et al., 2012; Goldfarb et al., 2012). Thus, proximal splitting can lead to a simplercode with no sacrifice in computational speed.

The benefits of this approach can be seen in a recent convex version of cluster analysis (Chi& Lange, 2013; Hocking et al., 2011; Lindsten et al., 2011). Given p points x1; : : : ;xp in Rq ,the new clustering method operates by minimising the convex criterion

F� .U / D1

2

pXiD1

kxi � uik22 C �

Xi<j

wij kui � uj k; (1)

where � is a positive regularisation parameter,wij is a non-negative weight, and the i -th columnui of the matrixU is the cluster centre attached to the point xi . The norm in the first summationis the Euclidean norm; the penalty norms can be either Euclidean or non-Euclidean. Figure 1shows the solution path to this convex problem as a function of � .

This problem generalises the fused lasso, and as with other fused lasso problems, thepenalties make minimisation challenging. The original problem can be reformulated as

minimize1

2

pXiD1

kxi � uik22 C �

Xi<j

wij kvij k

subject to ui � uj � vij D 0:

(2)

This alternative formulation is ripe for attack by proximal splitting. Our recent paper (Chi &Lange, 2013) presents variants of ADMM and the related alternating minimisation algorithm(AMA) (Tseng, 1991) that solve the equality-constrained version (2). As remarked earlier, bothapproaches are simple enough to encourage parameter acceleration. As a rule, the proximalsplitting framework generates simple modular solutions. Consider our ADMM solution. Let�ij denote the Lagrange multiplier for the ij -th equality constraint. We describe a singleround of ADMM block updates of the variables ui ; vij , and �ij . The centroids ui are updatedas follows.

ui D1

1C p�y i C

p�

1C p�Nx;

International Statistical Review (2014), 82, 1, 1–89© 2014 The Authors. International Statistical Review © 2014 International Statistical Institute

8

K. LANGE, E. C. CHI & H. ZHOU 83

0.00

0.25

0.50

0.75

1.00

0.0 0.3 0.6x

y

Figure 1. Cluster path assignment: the simulated example shows five well-separated clusters and the assigned clustersidentified by the convex clustering algorithm under an `2-penalty. The solid lines trace the path of the individual clustercentres as the regularisation parameter � increases.

where � is the positive quadratic penalty parameter in the augmented Lagrangian, Nx is theaverage of the xi , and

y i D xi CXj

Œ�ij C �vij � �Xj

Œ�j i C �vj i �:

The updates for vij are independent and amount to

vij D prox�ij k�k=��ui � uj � �

�1�ij�; (3)

where �ij D �wij . Finally, the Lagrange multipliers are updated by

�ij D �ij C �.vij � ui C uj /:

Each update is simple, and the effects of changing the norm in the fusion penalty are isolatedto the updates for vij . In other words, only the proximal mapping needs to be changed. Theupdates for the AMA method exhibit similar simplicity and modularity.

Proximal splitting methods have also proven to be effective when mixed and matched withother optimisation methods. For example, Ramani & Fessler (2013) combined ADMM, MM,and acceleration to concoct an image reconstruction algorithm that outperforms all currentlycompeting algorithms. Proximal methods themselves are undergoing improvement. Applicationof Newton and quasi-Newton methods to proximal methods is especially promising (Becker &Fadili, 2012; Lee et al., 2012).

We agree with Atchade and Michailidis that stochastic proximal gradient algorithms rep-resent an important frontier requiring further exploration. Recent results on inexact variantsof proximal splitting provide important clues for understanding the conditions under whichstochastic variants converge as reliably as their deterministic counterparts (Deng & Yin, 2012;


8

84 Rejoinder

Schmidt et al., 2011). Finally, it is noteworthy that some of the most recent refinements onstochastic gradient methods involve generalisation to second-order methods (Byrd et al., 2011;Byrd et al., 2012). It is refreshing to see classical ideas recycled and refurbished for modernpurposes. Second-order methods have been developed for lasso regularised optimisation in thedeterministic (Byrd et al., 2012) and stochastic settings (Byrd et al., 2012), but it remains to beseen whether deterministic second-order proximal methods can be generalised to the stochas-tic setting where other non-smooth regularisers come into play. Atchade and Michailidis aresurely right in calling for a deeper understanding of the convergence behaviour of such poten-tial hybrids. In practice, it appears that the Hessian approximation need not be as accurate as thegradient approximation. This observation can lead to substantial computational savings (Byrdet al. 2011, 2012). Obtaining a clearer understanding of how to tune the relative accuracies ofthe gradient and Hessian to obtain the best performance is one of many theoretical challengesbegging for resolution.

2 Response to D. Hunter

We are grateful to Prof. Hunter for emphasising that ‘there is simply no such thing as auniversal “gold standard” when it comes to algorithms’. His concrete treatment of the Bradley–Terry model is particularly apt. Readers may find a fuller discussion of this simple examplehelpful in understanding that there are multiple ways to skin a statistical cat. The log-likelihoodof the Bradley–Terry model considered by Hunter is

L.�/ DYi

Yj

��i

�i C �j

�wij; (4)

where wij is the number of times individual i beats individual j and the �i > 0 are theparameters to be estimated. Here, wi i D 0 by convention.

2.1 Bradley–Terry as a Geometric Program

Geometric programming (Boyd et al., 2007; Ecker, 1980; Peterson, 1976) deals withposynomials, namely functions of the form

f .x/ DX˛2S

c˛

nYiD1

x˛ii :

Here, the index set S Rn is finite, and all coefficients c˛ and all components x1; : : : ; xnof the argument x of f .x/ are positive. The possibly fractional powers ˛i corresponding toa particular ˛ may be positive, negative, or zero. In geometric programming, we minimizea posynomial f .x/ subject to posynomial inequality constraints of the form uj .x/ � 1 for1 � j � q. In some versions of geometric programming, equality constraints of posynomialtype are permitted (Boyd et al., 2007).

Maximising the positive likelihood function (4) is equivalent to minimising its reciprocalYi

Yj

��wiji .�i C �j /

wij ; (5)

which is a posynomial after expanding the powers .�i C �j /wij . Therefore, the Bradley–Terrymodel is an unconstrained geometric programming problem. Recognising standard convexprogramming problems such as geometric programming can free statisticians from the often


8


onerous task of designing and implementing their own optimisation algorithms. For instance,using the open-source convex optimisation software CVX (Grant & Boyd, 2008; 2012) tominimize the criterion (5) requires only six lines of MATLAB code:

p = size(W,1);

[rowidx,colidx,wvec] = find(W);

cvx_begin gp

variable gamma(p)

minimize prod(((gamma(rowidx)+gamma(colidx))./gamma(rowidx)).^wvec)

cvx_end

Because convex program solvers such as CVX implement variants of Newton’s method, theytend to falter on high-dimensional problems. For the Bradley–Terry model, CVX handles p <100 problems very efficiently but struggles for p > 1000. Parameter separation by the MMprinciple and exploitation of special Hessian structures are two possible remedies.

2.2 Another MM for Bradley–Terry

We now derive another MM algorithm for the Bradley–Terry model. By the arithmetic–geometric mean inequality, the objective (5) is majorised by

1

2

24Yi

Yj

��wijni .�ni C �nj /

wij

35Xi

Xj

wij

2 Qw

"��i

�ni

��2 Qw

C

��i C �j

�ni C �nj

�2 Qw#;

where Qw DPi

Pj wij . The parameters �i and �j are still entangled in the .�i C �j /2 Qw term

but can be separated by the further memorisation

.�i C �j /2 Qw �

1

2.2�i � �ni � �nj /

2 Qw C1

2.2�j � �ni � �nj /

2 Qw ;

thanks to the convexity of the function s2 Qw . The resulting surrogate function is easy to optimisebecause all of the �i parameters are separated. The next iteration �nC1;i of �i is obtained byminimising the univariate function

Pj¤i wij

2 Qw

��i

�ni

��2 Qw

C1

4 Qw

Xj¤i

.wij C wj i /

�2�i � �ni � �nj�ni C �nj

�2 Qw

;

which is strictly convex according to the second derivative test. Both bisection and Newton’smethod locate its minimum quickly.

Although our new MM algorithm achieves parameter separation, the two successive majori-sations and the lack of analytic updates probably make it uncompetitive with the simple MMalgorithm of Hunter. Nonetheless, this example illustrates the flexibility of the MM principle.Interested readers can refer to our recent paper (Lange & Zhou, 2014) for a general class ofMM algorithms for geometric and signomial programming. In signomial programming, someof the coefficients c˛ are allowed to be negative.


86 Rejoinder

2.3 Exploiting Structure in High Dimensions

The importance of exploiting Hessian structure in high-dimensional optimisation can alsobe illustrated by the Bradley–Terry model. By switching to the parametrisation �i D ln �i , itsuffices to minimize the equivalent negative log-likelihood

f .�/ DXi

Xj

wij

hln�e�i C e�j

�� i

i:

This is a convex function because the terms e�i are log-convex and the collection of log-convexfunctions is closed under addition (Boyd & Vandenberghe, 2004). The gradient has entries

@

@�if .�/ D

Xj¤i

e�i

e�i C e�j.wij C wj i / �

Xj¤i

wij ;

and the Hessian has entries

@2

@�i@�jf .�/ D

8̂<̂:Pj¤i

e�i e�j

.e�iCe�j /2

.wij C wj i / i D j

� e�i e�j

.e�iCe�j /2

.wij C wj i / i ¤ j :

Computing Newton’s direction �Œd 2f .�/��1rf .�/ requires solving a system of linearequations, an expensive O.p3/ operation that becomes prohibitive when the number of param-eters p is large. Fortunately, in large-scale competitions, most individuals or teams play onlya small fraction of their possible opponents. This implies that the data matrix W is sparseand consequently that the Hessian d 2f .�/ is also sparse. This fact allows fast calculation ofŒd 2f .�/�v for any vector v and suggests substitution of the conjugate gradient (CG) method forthe traditional (Cholesky) method of solving for the Newton direction. The computational costper iteration drops to O.p2/, where the constant depends on the sparsity level and the numberof CG iterations.

Figure 2 compares the progress of the Newton iterations using the Cholesky decom-position (NM) and the CG method on two data sets (p D 1000 and 2000) simulatedunder the same conditions as Hunter’s Figure 1. Our MATLAB code is available at http://www4.stat.ncsu.edu/�hzhou3/softwares/bradleyterry. The norm of the gradient vector servesto measure progress towards convergence; convergence is declared when its change periteration falls below 10�6. For this convex problem, all stationary points represent globalminimum. Under the simulation conditions, each individual chooses p=20 opponents. There-fore, nearly 90% of the entries of the Hessian matrix are zero. The NM-CG approachachieves remarkable efficiency at p D 2000, demonstrating the importance of exploitingsparsity in high-dimensional data. Of course, the original MM updates mesh with sparsityequally well.

2.4 Acceleration for Non-smooth Problems

Hunter mentioned that, for smooth problems, MM-type algorithms combined with quasi-Newton acceleration can achieve the ‘best of both worlds’ by offering both stability andefficiency. Let us add that we have also had good results applying a different quasi-Newtonacceleration scheme (Zhou et al., 2011) to non-smooth problems such as matrix completion(Chi et al., 2013) and regularised matrix regression (Zhou & Li, 2013). Nesterov accelerationworks well for certain types of non-smooth problems (Beck & Teboulle, 2009).


8


0 1 2 3 4 5 6 710

−10

10−8

10−6

10−4

10−2

100

102

104

Time (seconds)

Firs

t Ord

er O

ptim

ality

NM, p=1000NM−CG, p=1000NM, p=2000NM−CG, p=2000

Figure 2. Distance to convergence and timing results for Newton’s method implemented via Cholesky decomposition (NM)versus conjugate gradients (NM-CG).

3 Response to C. Robert

Prof. Robert has raised a number of objections from a Bayesian perspective, most of whichwe embrace. For instance, he said that integration comes more naturally to statisticians thanoptimisation. This is true; the current review is a modest attempt to change this state of affairs.We do not agree that an emphasis on optimisation neglects everything except estimation.For instance, the most convincing modern strides in model selection owe their existence topenalised estimation. Parameter tuning, cross-validation, and stability selection (Meinshausen& Bühlmann, 2010) all operate within the framework of optimisation. Robert may well be cor-rect in asserting that variational Bayes (Wainwright & Jordan, 2008) and approximate Bayesiancomputation (Marin et al., 2012) will rescue Bayesian applications to big data. In our view,the jury is still out on the scope of these methods. In any event, variational Bayes operates byoptimisation, so even fully committed Bayesians stand to gain from fluency in optimisation.

We have little experience with applying Bayesian inference to data summaries. Although thisis a worthy suggestion, data summaries run the risk of losing vital information and presupposeknowledge of a good model for how the data are generated.

We are sympathetic to simulated annealing and have employed it in many of our scientificapplications. It functions best on problems of intermediate size for which the computationalcomplexity of all known algorithms is high. It is not truly an option on large data. Imagine,for instance, using simulated annealing to solve the travelling salesman problem with a millioncities. As a rule, stochastic algorithms cannot compete with the speed of deterministic methodsin optimisation. That is why, we did not feature stochastic simulation in our review. However,ideas such as annealing do generalise successfully to optimisation. Our recent paper on param-eter estimation in the presence of multiple modes advocates deterministic annealing (Zhou &Lange, 2010).

We agree with Prof. Robert that EM and MM algorithms are not panaceas. It takes carefulthought to construct fast, stable algorithms. MM, EM, and block descent and ascent are alwaysstable and typically easy to code and debug. The failure of the MM principle in applicationsto intractable integrals is the current biggest bottleneck. The random-effects logistic regressionmodel discussed by Atchade and Michailidis is a prime example. Statisticians venturing into


8

88 Rejoinder

the terrain of convex optimisation must also exercise special caution. Our review gives a fewhints and a brief history of successful techniques. We cannot foresee the future, but it wouldbe surprising if the covered techniques did not prove helpful for many years to come. Finally,let us reiterate our agreement with Robert’s contention that statistical inference is more thanparameter estimation.

Acknowledgements

We thank our colleagues for their thoughtful commentaries. They raise intriguing points andarguments that deserve equally thoughtful responses.

References

Becker, S. & Fadili, J. (2012). A quasi-Newton proximal splitting method. In Advances in Neural Informa-tion Processing Systems, 25, Eds. P. Bartlett, F.C.N Pereira, C.J.C Burges, L. Bottou & K.Q. Weinberger,pp. 2627–2635.

Beck, A. & Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. J.Imaging Sci., 2(1), 183–202. Available at http://epubs.siam.org/doi/pdf/10.1137/080716542.

Boyd, S. & Vandenberghe, L. (2004). Convex Optimization. Cambridge: Cambridge University Press.Boyd, S., Kim, S.-J., Vandenberghe, L. & Hassibi, A. (2007). A tutorial on geometric programming. Optim. Eng.,

8(1), 67–127.Byrd, R.H., Chin, G.M., Neveitt, W. & Nocedal, J. (2011). On the use of stochastic Hessian information in

optimization methods for machine learning. SIAM J. Optim., 21(3), 977–995.Byrd, R.H., Chin, G.M., Nocedal, J. & Oztoprak, F. (2012). A family of second-order methods for convex`1-regularized optimization. Technical report, Optimization Center: Northwestern University.

Byrd, R.H., Chin, G.M., Nocedal, J. & Wu, Y. (2012). Sample size selection in optimization methods for machinelearning. Math. Program., 134(1), 127–155.

Chi, E.C. & Lange, K. (2013). Splitting Methods for Convex Clustering. arXiv:1304.0499 [stat.ML].Chi, E.C., Zhou, H., Chen, G.K., Del Vecchyo, D.O. & Lange, K. (2013). Genotype imputation via matrix completion.

Genome Res., 23(3), 509–518.Combettes, P.L. & Wajs, V.R. (2005). Signal recovery by proximal forward–backward splitting. Multiscale Model.

Simul., 4(4), 1168–1200.Deng, W. & Yin, W. (2012). On the global and linear convergence of the generalized alternating direction method of

multipliers. CAAM Technical Report TR12-14, Rice University.Ecker, J.G. (1980). Geometric programming: methods, computations and applications. SIAM Rev., 22(3), 338–362.Goldfarb, D., Ma, S. & Scheinberg, K. (2012). Fast alternating linearization methods for minimizing the sum of two

convex functions. Math. Program., 1–34.Goldstein, T., O’Donoghue, B. & Setzer, S. (2012). Fast alternating direction optimization methods. Technical report

cam12-35, University of California, Los Angeles.Grant, M. & Boyd, S. (2008). Graph implementations for nonsmooth convex programs. In Recent Advances

in Learning and Control, Eds. V.D. Blondel, S.P. Boyd & H. Kimura, pp. 95–110. London: Springer-Verlag.http://stanford.edu/�boyd/graph_dcp.html.

Grant, M. & Boyd, S. (2012). CVX: Matlab Software for Disciplined Convex Programming, version 2.0 beta. http://cvxr.com/cvx.

Hocking, T., Vert, J.-P., Bach, F. & Joulin, A. (2011). Clusterpath: an algorithm for clustering using convex fusionpenalties. In Proceedings of the 28th International Conference on Machine Learning (ICMl-11), Eds. L. Getoor &T. Scheffer, pp. 745–752. New York, NY, USA, ICML ’11: ACM.

Lange, K. & Zhou, H. (2014). MM algorithms for geometric and signomial programming. Math. Program. Ser. A,143(1–2), 339–356.

Lee, J., Sun, Y. & Saunders, M. (2012). Proximal Newton-type methods for convex optimization. In Advances inNeural Information Processing Systems, 25, Eds. P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou & K.Q.Weinberger, pp. 836–844.

Lindsten, F., Ohlsson, H. & Ljung, L. (2011). Just relax and come clustering! A convexification of k-means clustering,Linköpings universitet Technical report.

Marin, J.-M., Pudlo, P., Robert, C.P. & Ryder, R.J. (2012). Approximate Bayesian computational methods. Statist.Comput., 22(6), 1167–1180.


8


Meinshausen, N. & Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol., 72(4), 417–473.Peterson, E.L. (1976). Geometric programming. SIAM Rev., 18(1), 1–51.Ramani, S. & Fessler, J. (2013). Accelerated non-Cartesian SENSE reconstruction using a majorize–minimize

algorithm combining variable-splitting. In Proceedings IEEE International Symposium on Biomedical Imaging,pp. 700–703.

Schmidt, M., Roux, N.L. & Bach, F. (2011). Convergence rates of inexact proximal-gradient methods for convexoptimization. In Advances in Neural Information Processing Systems, 24, Eds. J. Shawe-Taylor, R.S. Zemel, P.Bartlett, F.C.N. Pereira & K.Q. Weinberger, pp. 1458–1466.

Tseng, P. (1991). Applications of a splitting algorithm to decomposition in convex programming and variationalinequalities. SIAM J. Control Optim., 29(1), 119–138.

Wainwright, M.J. & Jordan, M.I. (2008). Graphical models, exponential families, and variational inference. Found.Trends Mach. Learn., 1(1-2), 1–305.

Zhou, H., Alexander, D. & Lange, K. (2011). A quasi-Newton acceleration for high-dimensional optimizationalgorithms. Statist. Comput., 21, 261–273.

Zhou, H. & Lange, K.L. (2010). On the bumpy road to the dominant mode. Scand. J. Stat., 37(4), 612–631.Zhou, H. & Li, L. (2013). Regularized matrix regression. J. R. Stat. Soc. Ser. B Stat. Methodol., DOI: 10.1111/

rssb.12031.

[Received July 2013, accepted July 2013]


8

Rejoinder

Documents

Transcript of Rejoinder