Bayesian Quadrature - University of Torontoduvenaud/talks/intro_bq.pdf · 2017-05-29 ·...
Transcript of Bayesian Quadrature - University of Torontoduvenaud/talks/intro_bq.pdf · 2017-05-29 ·...
Bayesian Quadrature:Model-based Approximate Integration
David DuvenaudUniversity of Cambridge
December 8, 2012
The Quadrature Problem
� We want to estimate anintegral
Z =
∫f (x)p(x)dx
� Most computational problems
in inference correspond to
integrals:
� Expectations� Marginal distributions� Integrating out
nuisance parameters� Normalization
constants� Model comparison
x
function f(x)
input density p(x)
Sampling Methods
� Monte Carlo methods:Sample from p(x), takeempirical mean:
Z =1
N
N∑i=1
f (xi )
x
function f(x)
input density p(x)
samples
Sampling Methods
� Monte Carlo methods:Sample from p(x), takeempirical mean:
Z =1
N
N∑i=1
f (xi )
� Possibly sub-optimal for tworeasons:
x
function f(x)
input density p(x)
samples
Sampling Methods
� Monte Carlo methods:Sample from p(x), takeempirical mean:
Z =1
N
N∑i=1
f (xi )
� Possibly sub-optimal for two
reasons:
� Random bunching up x
function f(x)
input density p(x)
samples
Sampling Methods
� Monte Carlo methods:Sample from p(x), takeempirical mean:
Z =1
N
N∑i=1
f (xi )
� Possibly sub-optimal for two
reasons:
� Random bunching up� Often, nearby function
values will be similar
x
function f(x)
input density p(x)
samples
Sampling Methods
� Monte Carlo methods:Sample from p(x), takeempirical mean:
Z =1
N
N∑i=1
f (xi )
� Possibly sub-optimal for two
reasons:
� Random bunching up� Often, nearby function
values will be similar
� Model-based andquasi-Monte Carlo methodsspread out samples to achievefaster convergence.
x
function f(x)
input density p(x)
samples
Model-based Integration
� Place a prior on f , for example, a GP
� Posterior over f implies a posterior over Z .
Model-based Integration
� Place a prior on f , for example, a GP
� Posterior over f implies a posterior over Z .
Z
x
f(x)
p(x)
GP meanGP mean ± SDp(Z)
Model-based Integration
� Place a prior on f , for example, a GP
� Posterior over f implies a posterior over Z .
Z
x
f(x)
p(x)
GP meanGP mean ±2SDp(Z)
samples
Model-based Integration
� Place a prior on f , for example, a GP
� Posterior over f implies a posterior over Z .
Z
x
f(x)
p(x)
GP meanGP mean ±2SDp(Z)
samples
Model-based Integration
� Place a prior on f , for example, a GP
� Posterior over f implies a posterior over Z .
Z
x
f(x)
p(x)
GP meanGP mean ±2SDp(Z)
samples
Model-based Integration
� Place a prior on f , for example, a GP
� Posterior over f implies a posterior over Z .
Z
x
f(x)
p(x)
GP meanGP mean ±2SDp(Z)
samples
Model-based Integration
� Place a prior on f , for example, a GP
� Posterior over f implies a posterior over Z .
Z
x
f(x)
p(x)
GP meanGP mean ±2SDp(Z)
samples
Model-based Integration
� Place a prior on f , for example, a GP
� Posterior over f implies a posterior over Z .
Z
x
f(x)
p(x)
GP meanGP mean ±2SDp(Z)
samples
� We’ll call using a GP prior Bayesian Quadrature
Bayesian Quadrature Estimator
� Posterior over Z has mean linear in f (xs):
Egp [Z |f (xs)] =N∑i=1
zTK−1f (xi )
where zn =∫k(x , xn)p(x)dx
� Natural to minimize posterior variance of Z :
V [Z |f (xs)] =
∫ ∫k(x , x ′)p(x)p(x ′)dxdx ′ − zTK−1z
� Doesn’t depend on function values at all!
� Choosing samples sequentially to minimize variance:Sequential Bayesian Quadrature.
Bayesian Quadrature Estimator
� Posterior over Z has mean linear in f (xs):
Egp [Z |f (xs)] =N∑i=1
zTK−1f (xi )
where zn =∫k(x , xn)p(x)dx
� Natural to minimize posterior variance of Z :
V [Z |f (xs)] =
∫ ∫k(x , x ′)p(x)p(x ′)dxdx ′ − zTK−1z
� Doesn’t depend on function values at all!
� Choosing samples sequentially to minimize variance:Sequential Bayesian Quadrature.
Bayesian Quadrature Estimator
� Posterior over Z has mean linear in f (xs):
Egp [Z |f (xs)] =N∑i=1
zTK−1f (xi )
where zn =∫k(x , xn)p(x)dx
� Natural to minimize posterior variance of Z :
V [Z |f (xs)] =
∫ ∫k(x , x ′)p(x)p(x ′)dxdx ′ − zTK−1z
� Doesn’t depend on function values at all!
� Choosing samples sequentially to minimize variance:Sequential Bayesian Quadrature.
Bayesian Quadrature Estimator
� Posterior over Z has mean linear in f (xs):
Egp [Z |f (xs)] =N∑i=1
zTK−1f (xi )
where zn =∫k(x , xn)p(x)dx
� Natural to minimize posterior variance of Z :
V [Z |f (xs)] =
∫ ∫k(x , x ′)p(x)p(x ′)dxdx ′ − zTK−1z
� Doesn’t depend on function values at all!
� Choosing samples sequentially to minimize variance:Sequential Bayesian Quadrature.
Things you can do with Bayesian Quadrature
� Can incorporate knowledge of function (symmetries)
f (x , y) = f (y , x)⇔ ks(x , y , x ′, y ′) = k(x , y , x ′, y ′) + k(x , y ′, x ′, y)
+ k(x ′, y , x , y ′) + k(x ′, y ′, x , y)
� Can condition on gradients
� Posterior variance is a natural convergence diagnostic
100 120 140 160 180 200−0.2
−0.1
0
0.1
0.2
Z
# of samples
Things you can do with Bayesian Quadrature
� Can incorporate knowledge of function (symmetries)
f (x , y) = f (y , x)⇔ ks(x , y , x ′, y ′) = k(x , y , x ′, y ′) + k(x , y ′, x ′, y)
+ k(x ′, y , x , y ′) + k(x ′, y ′, x , y)
� Can condition on gradients
� Posterior variance is a natural convergence diagnostic
100 120 140 160 180 200−0.2
−0.1
0
0.1
0.2
Z
# of samples
Things you can do with Bayesian Quadrature
� Can incorporate knowledge of function (symmetries)
f (x , y) = f (y , x)⇔ ks(x , y , x ′, y ′) = k(x , y , x ′, y ′) + k(x , y ′, x ′, y)
+ k(x ′, y , x , y ′) + k(x ′, y ′, x , y)
� Can condition on gradients
� Posterior variance is a natural convergence diagnostic
100 120 140 160 180 200−0.2
−0.1
0
0.1
0.2
Z
# of samples
More things you can do with Bayesian Quadrature
� Can compute likelihood of GP, learn kernel
� Can compute marginals with error bars, in two ways:
More things you can do with Bayesian Quadrature
� Can compute likelihood of GP, learn kernel
� Can compute marginals with error bars, in two ways:
� Simply from the GP posterior:
More things you can do with Bayesian Quadrature
� Can compute likelihood of GP, learn kernel
� Can compute marginals with error bars, in two ways:
� Or by recomputing fθ(x) for different θ with same x
0.2 0.4 0.6 0.8 1 1.2σ
x
Z
True σx
Marg. Like.
More things you can do with Bayesian Quadrature
� Can compute likelihood of GP, learn kernel
� Can compute marginals with error bars, in two ways:
� Or by recomputing fθ(x) for different θ with same x
0.2 0.4 0.6 0.8 1 1.2σ
x
Z
True σx
Marg. Like.
� Much nicer than histograms!
Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD
Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD
Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD
Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD Empirical Rates in RKHS
Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMDEmpirical Rates out of
RKHS
Rates of Convergence
What is rate of convergence of SBQ when its assumptions are true?
Expected Variance / MMD Bound on Bayesian Error
GPs vs Log-GPs for Inference
−20
0
20
40
60
80
100
120
x
`(x)
True FunctionEvaluations
GPs vs Log-GPs for Inference
−20
0
20
40
60
80
100
120
x
`(x)
True FunctionEvaluationsGP Posterior
GPs vs Log-GPs for Inference
−20
0
20
40
60
80
100
120
x
`(x)
−1
0
1
2
3
4
x
log`(x)
True FunctionEvaluationsGP Posterior
True Log-func
Evaluations
GPs vs Log-GPs for Inference
−20
0
20
40
60
80
100
120
x
`(x)
−1
0
1
2
3
4
x
log`(x)
True Log-func
Evaluations
GP Posterior
True FunctionEvaluationsGP Posterior
GPs vs Log-GPs for Inference
−20
0
20
40
60
80
100
120
x
`(x)
True FunctionEvaluationsGP PosteriorLog-GP Posterior
−1
0
1
2
3
4
x
log`(x)
True Log-func
Evaluations
GP Posterior
GPs vs Log-GPs
0
50
100
150
200
x
`(x)
True FunctionEvaluations
GPs vs Log-GPs
0
50
100
150
200
x
`(x)
True FunctionEvaluationsGP Posterior
GPs vs Log-GPs
0
50
100
150
200
x
`(x)
−200
−150
−100
−50
0
x
log`(x)
True FunctionEvaluationsGP Posterior
True Log-func
Evaluations
GPs vs Log-GPs
0
50
100
150
200
x
`(x)
−200
−150
−100
−50
0
x
log`(x)
True Log-func
Evaluations
GP Posterior
True FunctionEvaluationsGP Posterior
GPs vs Log-GPs
0
50
100
150
200
x
`(x)
True FunctionEvaluationsGP PosteriorLog-GP Posterior
−200
−150
−100
−50
0
x
log`(x)
True Log-func
Evaluations
GP Posterior
Integrating under Log-GPs
−20
0
20
40
60
80
x
`(x)
True FunctionEvaluationsGP PosteriorLog-GP Posterior
−1
0
1
2
3
4
x
log`(x)
True Log-func
Evaluations
GP Posterior
Integrating under Log-GPs
−20
0
20
40
60
80
x
`(x)
True FunctionEvaluationsMean of GPApprox Log-GP
Inducing Points
−1
0
1
2
3
4
x
log`(x)
True Log-func
Evaluations
GP Posterior
Inducing Points
Conclusions
� Model-based integration allows active learning about integrals,can require fewer samples than MCMC, and allows us tocheck our assumptions.
� BQ has nice convergence properties if its assumptions arecorrect.
� For inference, GP is not especially appropriate, but othermodels are intractable.
Conclusions
� Model-based integration allows active learning about integrals,can require fewer samples than MCMC, and allows us tocheck our assumptions.
� BQ has nice convergence properties if its assumptions arecorrect.
� For inference, GP is not especially appropriate, but othermodels are intractable.
Conclusions
� Model-based integration allows active learning about integrals,can require fewer samples than MCMC, and allows us tocheck our assumptions.
� BQ has nice convergence properties if its assumptions arecorrect.
� For inference, GP is not especially appropriate, but othermodels are intractable.
Limitations and Future Directions
� Right now, BQ really only works in low dimensions ( < 10 ),when the function is fairly smooth, and is only worth usingwhen computing f (x) is expensive.
� How to extend to high dimensions? Gradient observations arehelpful, but a D-dimensional gradient is D separateobservations.
� It seems unlikely that we’ll find another tractablenonparametric distribution like GPs - should we accept thatwe’ll need a second round of approximate integration on asurrogate model?
� How much overhead is worthwhile? Bounded rationality workseems relevant.
Thanks!
Limitations and Future Directions
� Right now, BQ really only works in low dimensions ( < 10 ),when the function is fairly smooth, and is only worth usingwhen computing f (x) is expensive.
� How to extend to high dimensions? Gradient observations arehelpful, but a D-dimensional gradient is D separateobservations.
� It seems unlikely that we’ll find another tractablenonparametric distribution like GPs - should we accept thatwe’ll need a second round of approximate integration on asurrogate model?
� How much overhead is worthwhile? Bounded rationality workseems relevant.
Thanks!
Limitations and Future Directions
� Right now, BQ really only works in low dimensions ( < 10 ),when the function is fairly smooth, and is only worth usingwhen computing f (x) is expensive.
� How to extend to high dimensions? Gradient observations arehelpful, but a D-dimensional gradient is D separateobservations.
� It seems unlikely that we’ll find another tractablenonparametric distribution like GPs - should we accept thatwe’ll need a second round of approximate integration on asurrogate model?
� How much overhead is worthwhile? Bounded rationality workseems relevant.
Thanks!
Limitations and Future Directions
� Right now, BQ really only works in low dimensions ( < 10 ),when the function is fairly smooth, and is only worth usingwhen computing f (x) is expensive.
� How to extend to high dimensions? Gradient observations arehelpful, but a D-dimensional gradient is D separateobservations.
� It seems unlikely that we’ll find another tractablenonparametric distribution like GPs - should we accept thatwe’ll need a second round of approximate integration on asurrogate model?
� How much overhead is worthwhile? Bounded rationality workseems relevant.
Thanks!
Limitations and Future Directions
� Right now, BQ really only works in low dimensions ( < 10 ),when the function is fairly smooth, and is only worth usingwhen computing f (x) is expensive.
� How to extend to high dimensions? Gradient observations arehelpful, but a D-dimensional gradient is D separateobservations.
� It seems unlikely that we’ll find another tractablenonparametric distribution like GPs - should we accept thatwe’ll need a second round of approximate integration on asurrogate model?
� How much overhead is worthwhile? Bounded rationality workseems relevant.
Thanks!