SC7/SM6 Bayes Methods HT19 Lecturer: Geo Nicholls Lecture ...nicholls/BayesMethods/L9.pdf · p(yj )...
Transcript of SC7/SM6 Bayes Methods HT19 Lecturer: Geo Nicholls Lecture ...nicholls/BayesMethods/L9.pdf · p(yj )...
SC7/SM6 Bayes Methods HT19
Lecturer: Geoff Nicholls
Lecture 9: ABC
Notes and Problem sheets are available at
http://www.stats.ox.ac.uk/~nicholls/BayesMethods/
and via the MSc weblearn pages.
ABC - why work hard to fit a model that is wrong anyway?
Approximate Bayesian Computation is a Monte Carlo schemefor generating samples approximately distributed according tothe posterior. It is particularly useful when the likelihood is in-tractable, but easy to simulate.
This is actually quite a common scenario. If the observationmodel is p(y|θ) proportional to p(y, θ) (as a function of y) then
L(θ; y) = p(y, θ)/c(θ)
with
c(θ) =∫p(y, θ)dy
and c(θ) may be intractable. These problems are called dou-bly intractable. MCMC wont work, as we cant calculate theacceptance probability α(θ′|θ).
ABC example: the Ising model - a doubly intractable model.
Denote by ΩY = 0, 1n2the set of all binary images Y =
(Y1, Y2, ..., Yn2), Yi ∈ 0, 1, where i = 1, 2, ..., n2 is the cellindex on the square lattice of image cells. Let #y give thenumber of disagreeing neighbors in the binary image Y = y.
The Ising model is the following distribution over Ω:
p(y|θ) = exp(−θ#y)/c(θ).
Here θ is a positive smoothing parameter and c(θ)
c(θ) =∑y∈ΩY
exp(−θ#y).
is a normalizing constant which we cant compute for n large.∗∗There is a formula for c(θ) for the special case of periodic boundary condi-tions, but that is not generally our image model of choice.
Suppose we have image data y and we want to estimate θ. If
π(θ) is a prior for θ then the posterior is
π(θ|y) =p(y|θ)π(θ)
p(y).
Consider doing MCMC targeting π(θ|y). Choose a simple pro-
posal for the scalar parameter θ, say θ′ ∼ U(θ−a, θ+a), a > 0.
The acceptance probability is
α(θ′|θ) = min
1,
p(y|θ′)π(θ′)
p(y|θ)π(θ)
= min
1,c(θ)
c(θ′)× easy stuff
and although the overall constant p(y) cancels, c(θ)/c(θ′) does
not, and so we are left with an acceptance probability we cannot
evaluate.
Rejection, ABC-style
The basic ABC algorithm approximates the following variant of
rejection. Suppose y and θ are both discrete so p(y|θ) ≤ 1.
The following algorithm simulates θ ∼ π(θ|y) where (as usual)
π(θ|y) ∝ p(y|θ)π(θ).
(1) Simulate θ ∼ π(θ) and y′ ∼ p(y′|θ).
(2) If y′ = y return θ and stop, otherwise goto (1).
The second line simulates an event which succeeds with proba-
bility p(y|θ).
Let Θrej be the value returned by this algorithm.
To find the probability for Θrej = θ, we sum the probability of
returning Θrej = θ over all sequences of rejections ending in the
return value θ,
Pr(Θrej = θ) = π(θ)p(y|θ) + π(θ)p(y|θ)∑θ′π(θ′)(1− p(y|θ′)) + ...
= π(θ)p(y|θ)(1 + (1− p(y)) + (1− p(y))2 + ...)
=π(θ)p(y|θ)
p(y)
with p(y) =∑θ π(θ)p(y|θ).
The ABC algorithm
The idea here is that if y′ is “close” to y then π(θ|y′) should be
close to π(θ|y). To measure “close” we choose some summary
statistics S(y) = (S1(y), ..., Sp(y)) of the data, which inform θ,
and which we want our model to reproduce well.
We define a distance measure (often Euclidean) d(s′, s) between
p-component vectors s′ = S(y′), s = S(y), and a threshold δ,
and modify the algorithm above.
(1) Simulate θ ∼ π(θ) and y′ ∼ p(y′|θ).
(2) If d(S(y′), S(y)) < δ return θ and stop, otherwise goto (1).
If Θabc is the output then
Θabc ∼ π(θ|d(S(Y ), S(y)) < δ).
We have made two (more) approximations - we summarise the
data with y → s(y) and we replace the statement “the realised
value of the data is Y = y” with “the realised value of the data
is in the ball Y ∈ ∆y”, where
∆y = y′ ∈ Y : d(S(y′), S(y)) < δ.
If S is a sufficient statistic then
π(θ|Y = y) = π(θ|S(Y ) = s(y)),
so ΘabcD→ π(θ|y) as δ → 0 in that case.
Example
Data model: yi ∼ Poisson(Λ)∗, i = 1, 2, ..., n with n = 5.Prior: Λ ∼ Γ(α = 1, β = 1).Summary statistic: S(y) = yDistance measure: d(y′, y) = |y′ − y| and δ = 0.5, 1.
ABC algorithm(1) Simulate λ ∼ Γ(α, β) and y′i ∼ Poisson(Λ), i = 1, 2, ..., n.(2) If |y′ − y| < δ return θ and stop, otherwise goto (1).
We do the entire Bayesian inference without calculating L or π.Modeling is freed up - we just specify how to simulate parame-ters and data.
∗True value was Λ = 2.
1 2 3 4
0.0
0.2
0.4
0.6
0.8
1.0
LAMBDA
post
erio
r an
d ap
prox
pos
terio
r
True posteriorRejectiondelta=1delta 1, adjusteddelta=0.5delta=0.5, adjusted
Regression adjustment of samples
ABC generates samples (θ(t), y(t)), t = 1, 2, ..., T from the joint
distribution (θ, y′) ∼ π(θ, y′). We want pairs (θ, y) ∼ π(θ, y)with y the data.
Assume (1) s = S(y) is sufficient and let s′ = S(y′).
Assume (2) shifting the data, y to y′, shifts the posterior mean
µ(s) = E(θ|S(Y ) = s)
but has no other effect on the distribution of θ|y. If this is true
then if θ ∼ π(·|y) we can alternatively write
θ = µ(s) + ε
with ε ∼ F a mean zero r.v. with F not depending on y.
Assume (3) for d(s′, s) small, the linear approximation
µ(s′) ' α+ (s′ − s)βis good. Clearly α ' µ(s) is the posterior mean at the data.
If we knew µ(s), we could simulate θ|s by simulating ε and setting
θ = α+ ε.
We have lots of pairs θ(t), S(y(t)) and we can regress them to
estimate a local linear approximation θ′ = α+ ε,
θ(t) = α+ (S(y(t))− s)β + ε(t).
We estimate α, β using weighted∗ OLS regression and set
θ(t)adj = θ(t) − (S(y(t))− s)β
=[α+ (S(y(t))− s)β + ε(t)
]−
[(S(y(t))− s)β
]' α+ ε(t),
if β ' β is a good approximation. In this case θ(t)adj ∼ π(·|s)
approximately, a sample from the posterior, under under ourassumptions.
The regression correction adjusts the distribution of θ at y′ tomove it onto the distribution of θ at y.
Example: We did exactly this for the Poisson example above.Worked well. See the figure and R-code for this lecture.∗The regression is weighted because the weight for θ′, y′ pairs is Iy′∈∆y
.
Example: Radiocarbon dating revisited
We will fit our shrinkage model using ABC and see how it com-pares to MCMC. ABC is much easier here.
Recall the model.
Data model: y′i ∼ N(µ(θ), σ2 + σc(θ)2) for i = 1, ..., n.
Prior: uniform span v ∼ U(L,U), ψ1 ∼ U(L,U − v),ψ2 = ψ1 + v, θi ∼ U(ψ1, ψ2) i = 1, ..., n.
Summary statistic s(y) = y (works here as just n = 7 dates).
Distance: Euclidian d(s(y′), s(y)) = |y′ − y|. We chose δ byexperimentation (see figure).
for (k in 1:K) span=runif(1,min=0,max=U-L); #uniform spanlower=runif(1,min=L,max=U-span); #psi[1]upper=lower+span; #psi[2]dates=round(runif(nd,min=lower,max=upper))y.sim=mu[dates]+sqrt(d^2+err[dates]^2)*rnorm(nd)D=sqrt(sum((y-y.sim)^2))/1000 #arbitrary scaleif (D<delta)
S=rbind(S,y.sim); psi=rbind(psi,c(lower,upper))
Implementation detail: it is common practice to save all thesimulation output, not just the ones satisfying y′ ∈ ∆y, andchoose δ so some fixed fraction are retained. This allows usto trial different δ-values without rerunning. Above is simplerejection-ABC.
Posterior span, ABC,delta=0.2
span
Fre
quen
cy
0 100 200 300 400 500
050
100
200
Posterior span, ABC,delta=0.4
span
Fre
quen
cy
0 100 200 300 400 500
020
040
060
0
Posterior span, ABC,delta=0.8
span
Fre
quen
cy
0 100 200 300 400 500
040
080
012
00
Posterior span, MCMC
span
Fre
quen
cy
0 100 200 300 400 500
040
080
012
00
ABC example: the Ising Model
The Ising distribution Y ∼ p(·|θ) is easy to sample for moderaten-values using MCMC. Here are 3 samples Y ∼ p(y|θ):
Ising model, theta=0.4 Ising model, theta=0.8 Ising model, theta=1.2
Technically these samples are not exactly distributed accordingto p(y|θ) (convergence) but we can make them as good as weneed by taking long MCMC runs.
[Skip next two slides on MCMC at first reading]
We can sample Y ∼ p(·|θ) using MCMC: Suppose Y (t) = y.
[Step 1] Choose an update, something simple. Choose a cell
i ∼ U1, 2, ..., n2. Set y′i = 1−yi and y′j = yj for j 6= i. Notice
that q(y′|y) = q(y|y′) = 1/n2 for y′, y differing at exactly one
cell.
[Step 2] Write down the algorithm. Let Y (t) = y. Y (t+1) is
determined in the following way.
[1] Simulate y′ ∼ q(y′|y) as above, and u ∼ U(0, 1).
[2] If u < α(y′|y) set Y (t+1) = y′ and otherwise set Y (t+1) = y.
[Step 3] Calculate α. The q’s cancel as usual, so
α(y′|y) = min
1,
p(y′|θ)q(y|y′)p(y|θ)q(y′|y)
= min
1, exp(−θ(#y′ −#y))
It is clear the algorithm is irreducible (q is irreducible and α is
never zero) and aperiodic (rejection is possible), so it is ergodic
for p(y).
An implementation of this algorithm in R is available in the code
for this lecture. Some samples produced using this code are
shown above.
ABC inference for θ
Recall the doubly intractable inference for θ|y.
We have data Y = y which is an n × n binary matrix and our
observation model for Y , Y ∼ p(y|θ) is the Ising model. Our
goal is to estimate θ. The posterior is
π(θ|y) =L(θ; y)π(θ)
p(y)
and the likelihood
p(y|θ) = exp(−θ#y)/c(θ)
depends on c(θ), an intractable function of θ.
Our ABC algorithm sampling θ ∼ π(θ|y) approximately is as
follows. Suppose we have a prior for θ. Given the scale of θ,
Exp(2) is a natural generic prior. We take S(y) = #y, which is
actually sufficient for θ.
(1) Simulate θ ∼ Exp(2) and y′ ∼ exp(−θ#y′)/c(θ).
(2) If |#y′ −#y| < δ return θ, otherwise, goto (1).
We implemented this (see attached R, 8x8 Ising, θ = 0.8) and
estimated the posterior densities in the figure.
The distribution converges to something stable as δ → 0. The
regression adjustment for δ = 0.1 corrects its distribution to
agree with that for δ = 0.05.
0.5 1.0 1.5
0.0
0.5
1.0
1.5
THETA
App
rox
Pos
terio
r de
nsity
delta=0.1delta=0.050.1 adjusted0.5 adjustedtrue