SC7/SM6 Bayes Methods HT19 Lecturer: Geo Nicholls Lecture ...nicholls/BayesMethods/L9.pdf · p(yj )...

SC7/SM6 Bayes Methods HT19

Lecturer: Geoff Nicholls

Lecture 9: ABC

Notes and Problem sheets are available at

http://www.stats.ox.ac.uk/~nicholls/BayesMethods/

and via the MSc weblearn pages.

ABC - why work hard to fit a model that is wrong anyway?

Approximate Bayesian Computation is a Monte Carlo schemefor generating samples approximately distributed according tothe posterior. It is particularly useful when the likelihood is in-tractable, but easy to simulate.

This is actually quite a common scenario. If the observationmodel is p(y|θ) proportional to p(y, θ) (as a function of y) then

L(θ; y) = p(y, θ)/c(θ)

with

c(θ) =∫p(y, θ)dy

and c(θ) may be intractable. These problems are called dou-bly intractable. MCMC wont work, as we cant calculate theacceptance probability α(θ′|θ).

ABC example: the Ising model - a doubly intractable model.

Denote by ΩY = 0, 1n2the set of all binary images Y =

(Y1, Y2, ..., Yn2), Yi ∈ 0, 1, where i = 1, 2, ..., n2 is the cellindex on the square lattice of image cells. Let #y give thenumber of disagreeing neighbors in the binary image Y = y.

The Ising model is the following distribution over Ω:

p(y|θ) = exp(−θ#y)/c(θ).

Here θ is a positive smoothing parameter and c(θ)

c(θ) =∑y∈ΩY

exp(−θ#y).

is a normalizing constant which we cant compute for n large.∗∗There is a formula for c(θ) for the special case of periodic boundary condi-tions, but that is not generally our image model of choice.

Suppose we have image data y and we want to estimate θ. If

π(θ) is a prior for θ then the posterior is

π(θ|y) =p(y|θ)π(θ)

p(y).

Consider doing MCMC targeting π(θ|y). Choose a simple pro-

posal for the scalar parameter θ, say θ′ ∼ U(θ−a, θ+a), a > 0.

The acceptance probability is

α(θ′|θ) = min

1,

p(y|θ′)π(θ′)

p(y|θ)π(θ)

= min

1,c(θ)

c(θ′)× easy stuff

and although the overall constant p(y) cancels, c(θ)/c(θ′) does

not, and so we are left with an acceptance probability we cannot

evaluate.

Rejection, ABC-style

The basic ABC algorithm approximates the following variant of

rejection. Suppose y and θ are both discrete so p(y|θ) ≤ 1.

The following algorithm simulates θ ∼ π(θ|y) where (as usual)

π(θ|y) ∝ p(y|θ)π(θ).

(1) Simulate θ ∼ π(θ) and y′ ∼ p(y′|θ).

(2) If y′ = y return θ and stop, otherwise goto (1).

The second line simulates an event which succeeds with proba-

bility p(y|θ).

Let Θrej be the value returned by this algorithm.

To find the probability for Θrej = θ, we sum the probability of

returning Θrej = θ over all sequences of rejections ending in the

return value θ,

Pr(Θrej = θ) = π(θ)p(y|θ) + π(θ)p(y|θ)∑θ′π(θ′)(1− p(y|θ′)) + ...

= π(θ)p(y|θ)(1 + (1− p(y)) + (1− p(y))2 + ...)

=π(θ)p(y|θ)

p(y)

with p(y) =∑θ π(θ)p(y|θ).

The ABC algorithm

The idea here is that if y′ is “close” to y then π(θ|y′) should be

close to π(θ|y). To measure “close” we choose some summary

statistics S(y) = (S1(y), ..., Sp(y)) of the data, which inform θ,

and which we want our model to reproduce well.

We define a distance measure (often Euclidean) d(s′, s) between

p-component vectors s′ = S(y′), s = S(y), and a threshold δ,

and modify the algorithm above.

(1) Simulate θ ∼ π(θ) and y′ ∼ p(y′|θ).

(2) If d(S(y′), S(y)) < δ return θ and stop, otherwise goto (1).

If Θabc is the output then

Θabc ∼ π(θ|d(S(Y ), S(y)) < δ).

We have made two (more) approximations - we summarise the

data with y → s(y) and we replace the statement “the realised

value of the data is Y = y” with “the realised value of the data

is in the ball Y ∈ ∆y”, where

∆y = y′ ∈ Y : d(S(y′), S(y)) < δ.

If S is a sufficient statistic then

π(θ|Y = y) = π(θ|S(Y ) = s(y)),

so ΘabcD→ π(θ|y) as δ → 0 in that case.

Example

Data model: yi ∼ Poisson(Λ)∗, i = 1, 2, ..., n with n = 5.Prior: Λ ∼ Γ(α = 1, β = 1).Summary statistic: S(y) = yDistance measure: d(y′, y) = |y′ − y| and δ = 0.5, 1.

ABC algorithm(1) Simulate λ ∼ Γ(α, β) and y′i ∼ Poisson(Λ), i = 1, 2, ..., n.(2) If |y′ − y| < δ return θ and stop, otherwise goto (1).

We do the entire Bayesian inference without calculating L or π.Modeling is freed up - we just specify how to simulate parame-ters and data.

∗True value was Λ = 2.

1 2 3 4

0.0

0.2

0.4

0.6

0.8

1.0

LAMBDA

post

erio

r an

d ap

prox

pos

terio

r

True posteriorRejectiondelta=1delta 1, adjusteddelta=0.5delta=0.5, adjusted

Regression adjustment of samples

ABC generates samples (θ(t), y(t)), t = 1, 2, ..., T from the joint

distribution (θ, y′) ∼ π(θ, y′). We want pairs (θ, y) ∼ π(θ, y)with y the data.

Assume (1) s = S(y) is sufficient and let s′ = S(y′).

Assume (2) shifting the data, y to y′, shifts the posterior mean

µ(s) = E(θ|S(Y ) = s)

but has no other effect on the distribution of θ|y. If this is true

then if θ ∼ π(·|y) we can alternatively write

θ = µ(s) + ε

with ε ∼ F a mean zero r.v. with F not depending on y.

Assume (3) for d(s′, s) small, the linear approximation

µ(s′) ' α+ (s′ − s)βis good. Clearly α ' µ(s) is the posterior mean at the data.

If we knew µ(s), we could simulate θ|s by simulating ε and setting

θ = α+ ε.

We have lots of pairs θ(t), S(y(t)) and we can regress them to

estimate a local linear approximation θ′ = α+ ε,

θ(t) = α+ (S(y(t))− s)β + ε(t).

We estimate α, β using weighted∗ OLS regression and set

θ(t)adj = θ(t) − (S(y(t))− s)β

=[α+ (S(y(t))− s)β + ε(t)

]−

[(S(y(t))− s)β

]' α+ ε(t),

if β ' β is a good approximation. In this case θ(t)adj ∼ π(·|s)

approximately, a sample from the posterior, under under ourassumptions.

The regression correction adjusts the distribution of θ at y′ tomove it onto the distribution of θ at y.

Example: We did exactly this for the Poisson example above.Worked well. See the figure and R-code for this lecture.∗The regression is weighted because the weight for θ′, y′ pairs is Iy′∈∆y

.

Example: Radiocarbon dating revisited

We will fit our shrinkage model using ABC and see how it com-pares to MCMC. ABC is much easier here.

Recall the model.

Data model: y′i ∼ N(µ(θ), σ2 + σc(θ)2) for i = 1, ..., n.

Prior: uniform span v ∼ U(L,U), ψ1 ∼ U(L,U − v),ψ2 = ψ1 + v, θi ∼ U(ψ1, ψ2) i = 1, ..., n.

Summary statistic s(y) = y (works here as just n = 7 dates).

Distance: Euclidian d(s(y′), s(y)) = |y′ − y|. We chose δ byexperimentation (see figure).

for (k in 1:K) span=runif(1,min=0,max=U-L); #uniform spanlower=runif(1,min=L,max=U-span); #psi[1]upper=lower+span; #psi[2]dates=round(runif(nd,min=lower,max=upper))y.sim=mu[dates]+sqrt(d^2+err[dates]^2)*rnorm(nd)D=sqrt(sum((y-y.sim)^2))/1000 #arbitrary scaleif (D<delta)

S=rbind(S,y.sim); psi=rbind(psi,c(lower,upper))

Implementation detail: it is common practice to save all thesimulation output, not just the ones satisfying y′ ∈ ∆y, andchoose δ so some fixed fraction are retained. This allows usto trial different δ-values without rerunning. Above is simplerejection-ABC.

Posterior span, ABC,delta=0.2

span

Fre

quen

cy

0 100 200 300 400 500

050

100

200


span

Fre

quen

cy

0 100 200 300 400 500

020

040

060

0


span

Fre

quen

cy

0 100 200 300 400 500

040

080

012

00

Posterior span, MCMC

span

Fre

quen

cy

0 100 200 300 400 500

040

080

012

00

ABC example: the Ising Model

The Ising distribution Y ∼ p(·|θ) is easy to sample for moderaten-values using MCMC. Here are 3 samples Y ∼ p(y|θ):

Ising model, theta=0.4 Ising model, theta=0.8 Ising model, theta=1.2

Technically these samples are not exactly distributed accordingto p(y|θ) (convergence) but we can make them as good as weneed by taking long MCMC runs.

[Skip next two slides on MCMC at first reading]

We can sample Y ∼ p(·|θ) using MCMC: Suppose Y (t) = y.

[Step 1] Choose an update, something simple. Choose a cell

i ∼ U1, 2, ..., n2. Set y′i = 1−yi and y′j = yj for j 6= i. Notice

that q(y′|y) = q(y|y′) = 1/n2 for y′, y differing at exactly one

cell.

[Step 2] Write down the algorithm. Let Y (t) = y. Y (t+1) is

determined in the following way.

[1] Simulate y′ ∼ q(y′|y) as above, and u ∼ U(0, 1).

[2] If u < α(y′|y) set Y (t+1) = y′ and otherwise set Y (t+1) = y.

[Step 3] Calculate α. The q’s cancel as usual, so

α(y′|y) = min

1,

p(y′|θ)q(y|y′)p(y|θ)q(y′|y)

= min

1, exp(−θ(#y′ −#y))

It is clear the algorithm is irreducible (q is irreducible and α is

never zero) and aperiodic (rejection is possible), so it is ergodic

for p(y).

An implementation of this algorithm in R is available in the code

for this lecture. Some samples produced using this code are

shown above.

ABC inference for θ

Recall the doubly intractable inference for θ|y.

We have data Y = y which is an n × n binary matrix and our

observation model for Y , Y ∼ p(y|θ) is the Ising model. Our

goal is to estimate θ. The posterior is

π(θ|y) =L(θ; y)π(θ)

p(y)

and the likelihood

p(y|θ) = exp(−θ#y)/c(θ)

depends on c(θ), an intractable function of θ.

Our ABC algorithm sampling θ ∼ π(θ|y) approximately is as

follows. Suppose we have a prior for θ. Given the scale of θ,

Exp(2) is a natural generic prior. We take S(y) = #y, which is

actually sufficient for θ.

(1) Simulate θ ∼ Exp(2) and y′ ∼ exp(−θ#y′)/c(θ).

(2) If |#y′ −#y| < δ return θ, otherwise, goto (1).

We implemented this (see attached R, 8x8 Ising, θ = 0.8) and

estimated the posterior densities in the figure.

The distribution converges to something stable as δ → 0. The

regression adjustment for δ = 0.1 corrects its distribution to

agree with that for δ = 0.05.

0.5 1.0 1.5

0.0

0.5

1.0

1.5

THETA

App

rox

Pos

terio

r de

nsity

delta=0.1delta=0.050.1 adjusted0.5 adjustedtrue

SC7/SM6 Bayes Methods HT19 Lecturer: Geo Nicholls Lecture ...nicholls/BayesMethods/L9.pdf · p(yj )...

Documents

Transcript of SC7/SM6 Bayes Methods HT19 Lecturer: Geo Nicholls Lecture ...nicholls/BayesMethods/L9.pdf · p(yj )...