bayes-nets.c19

23
1 CMSC 471 Fall 2002 Class #19  Monday, November 4

Transcript of bayes-nets.c19

Page 1: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 1/23

1

CMSC 471

Fall 2002

Class #19 – Monday, November 4

Page 2: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 2/23

2

Today’s class 

• (Probability theory)

• Bayesian inference

 –  From the joint distribution

 –  Using independence/factoring

 –  From sources of evidence

• Bayesian networks

 –  Network structure

 –  Conditional probability tables

 –  Conditional independence

 –  Inference in Bayesian networks

Page 3: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 3/23

3

Bayesian Reasoning / 

Bayesian Networks

Chapters 14, 15.1-15.2

Page 4: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 4/23

4

Why probabilities anyway?

• Kolmogorov showed that three simple axioms lead to therules of probability theory

 –  De Finetti, Cox, and Carnap have also provided compellingarguments for these axioms

1. All probabilities are between 0 and 1:

• 0 <= P(a) <= 1

2. Valid propositions (tautologies) have probability 1, andunsatisfiable propositions have probability 0:

• P(true) = 1 ; P(false) = 0

3. The probability of a disjunction is given by:• P(a b) = P(a) + P(b) – P(a b)

aba b

Page 5: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 5/23

5

Inference from the joint: Example

alarm ¬alarm

earthquake ¬earthquake earthquake ¬earthquake

burglary .001 .008 .0001 .0009

¬burglary .01 .09 .001 .79

P(Burglary | alarm) = α P(Burglary, alarm)

= α [P(Burglary, alarm, earthquake) + P(Burglary, alarm, ¬earthquake)

= α [ (.001, .01) + (.008, .09) ]

= α [ (.009, .1) ]Since P(burglary | alarm) + P(¬burglary | alarm) = 1, α = 1/(.009+.1) = 9.173

(i.e., P(alarm) = 1/ α = .109 –  quizlet: how can you verify this?)

P(burglary | alarm) = .009 * 9.173 = .08255

P(¬burglary | alarm) = .1 * 9.173 = .9173

Page 6: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 6/23

6

Independence

• When two sets of propositions do not affect each others’probabilities, we call them independent, and can easilycompute their joint and conditional probability:

 –  Independent (A, B) → P(A B) = P(A) P(B), P(A | B) = P(A)

• For example, {moon-phase, light-level} might beindependent of {burglary, alarm, earthquake} –  Then again, it might not: Burglars might be more likely to

 burglarize houses when there’s a new moon (and hence little light) 

 – But if we know the light level, the moon phase doesn’t affectwhether we are burglarized

 – Once we’re burglarized, light level doesn’t affect whether the alarmgoes off 

• We need a more complex notion of independence, andmethods for reasoning about these kinds of relationships

Page 7: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 7/237

Conditional independence

• Absolute independence:

 –  A and B are independent if P(A B) = P(A) P(B); equivalently,P(A) = P(A | B) and P(B) = P(B | A)

• A and B are conditionally independent given C if 

 –  P(A B | C) = P(A | C) P(B | C)

• This lets us decompose the joint distribution:

 –  P(A B C) = P(A | C) P(B | C) P(C)

• Moon-Phase and Burglary are conditionally independent

 given Light-Level• Conditional independence is weaker than absolute

independence, but still useful in decomposing the full jointprobability distribution

Page 8: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 8/238

Bayes’ rule • Bayes rule is derived from the product rule:

 –  P(Y | X) = P(X | Y) P(Y) / P(X)

• Often useful for diagnosis:

 –  If X are (observed) effects and Y are (hidden) causes,

 –  We may have a model for how causes lead to effects (P(X | Y))

 –  We may also have prior beliefs (based on experience) about the

frequency of occurrence of effects (P(Y))

 –  Which allows us to reason abductively from effects to causes (P(Y |

X)).

Page 9: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 9/239

Bayesian inference

• In the setting of diagnostic/evidential reasoning

 – Know prior probability of hypothesis

conditional probability

 – Want to compute the posterior probability

• Bayes’ theorem (formula 1): 

onsanifestatievidence/m 

hypotheses 

1 m j

i

 E E E

  H 

)( / )|()()|(  ji ji ji E P H  E P H  P E H  P

)(i

 H  P

)|( i j H  E P

)|( i j H  E P

)|(  ji E H  P

)(i

 H  P

Page 10: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 10/2310

Simple Bayesian diagnostic reasoning

• Knowledge base:

 –  Evidence / manifestations: E1, … Em

 –  Hypotheses / disorders: H1, … Hn

• E j and Hi are binary; hypotheses are mutually exclusive (non-

overlapping) and exhaustive (cover all possible cases)

 –  Conditional probabilities: P(E j | Hi), i = 1, … n; j = 1, … m 

• Cases (evidence for a particular instance): E1, …, El

• Goal: Find the hypothesis Hi with the highest posterior –  Maxi P(Hi | E1, …, El)

Page 11: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 11/2311

Bayesian diagnostic reasoning II

• Bayes’ rule says that 

 –  P(Hi | E1, …, El) = P(E1, …, El | Hi) P(Hi) / P(E1, …, El)

• Assume each piece of evidence Ei

is conditionally

independent of the others, given a hypothesis Hi, then:

 –  P(E1, …, El | Hi) = l j=1 P(E j | Hi)

• If we only care about relative probabilities for the Hi, then

we have:

 –  P(Hi | E1, …, El) = α P(Hi) l j=1 P(E j | Hi)

Page 12: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 12/2312

Limitations of simple Bayesian

inference• Cannot easily handle multi-fault situation, nor cases where

intermediate (hidden) causes exist:

 –  Disease D causes syndrome S, which causes correlated

manifestations M1 and M2

• Consider a composite hypothesis H1  H2, where H1 and H2 

are independent. What is the relative posterior?

 –  P(H1  H2 | E1, …, El) = α P(E1, …, El | H1  H2) P(H1  H2)

= α P(E1, …, El | H1  H2) P(H1) P(H2)= α l

 j=1 P(E j | H1  H2) P(H1) P(H2)

• How do we compute P(E j | H1  H2) ??

Page 13: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 13/2313

Limitations of simple Bayesian

inference II• Assume H1 and H2 are independent, given E1, …, El? 

 –  P(H1  H2 | E1, …, El) = P(H1 | E1, …, El) P(H2 | E1, …, El)

• This is a very unreasonable assumption

 –  Earthquake and Burglar are independent, but not given Alarm:• P(burglar | alarm, earthquake) << P(burglar | alarm)

• Another limitation is that simple application of Bayes’ rule doesn’t

allow us to handle causal chaining:

 – A: year’s weather; B: cotton production; C: next year’s cotton price 

 –  A influences C indirectly: A→ B → C  –  P(C | B, A) = P(C | B)

• Need a richer representation to model interacting hypotheses,

conditional independence, and causal chaining

• Next time: conditional independence and Bayesian networks!

Page 14: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 14/2314

Bayesian Belief Networks (BNs)

• Definition: BN = (DAG, CPD)  – DAG: directed acyclic graph (BN’s structure)

• Nodes: random variables (typically binary or discrete, but

methods also exist to handle continuous variables)

• Arcs: indicate probabilistic dependencies between nodes

(lack of link signifies conditional independence) – CPD: conditional probability distribution (BN’s parameters)

• Conditional probabilities at each node, usually stored as a table

(conditional probability table, or CPT)

 – Root nodes are a special case – no parents, so just use priors

in CPD:

iiii x x P  of nodesparentallof settheiswhere)|(

)()|(so, iiii x P x P

Page 15: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 15/2315

Example BN

a

b c

d e 

P(C|A) = 0.2

P(C|~A) = 0.005

P(B|A) = 0.3

P(B|~A) = 0.001

P(A) = 0.001

P(D|B,C) = 0.1P(D|B,~C) = 0.01

P(D|~B,C) = 0.01P(D|~B,~C) = 0.00001

P(E|C) = 0.4P(E|~C) = 0.002

Note that we only specify P(A) etc., not P(¬A), since they have to add to one

Page 16: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 16/23

16

Topological semantics

• A node is conditionally independent of its non-

descendants given its parents

• A node is conditionally independent of all other nodes in

the network given its parents, children, and children’s

parents (also known as its Markov blanket)

• The method called d-separation can be applied to decide

whether a set of nodes X is independent of another set Y,

given a third set Z

Page 17: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 17/23

17

• Independence assumption –   

where q is any set of variables

(nodes) other than and its successors

 –    blocks influence of other nodes on

and its successors (q influences only

through variables in )

 –  With this assumption, the complete joint probability distribution of all

variables in the network can be represented by (recovered from) local

CPD by chaining these CPD

i x

)|(),...,( 11 ii

 n

i n x P x x P

)|(),|( iiii x Pq x P

i xi

i x

i

q

i x

i

Independence and chaining

Page 18: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 18/23

18

Chaining: Example

Computing the joint probability for all variables is easy:

P(a, b, c, d, e)

= P(e | a, b, c , d) P(a, b, c, d)  by Bayes’ theorem 

= P(e | c) P(a, b, c, d) by indep. assumption= P(e | c) P(d | a, b, c) P(a, b, c)

= P(e | c) P(d | b, c) P(c | a, b) P(a, b)

= P(e | c) P(d | b, c) P(c | a) P(b | a) P(a)

a

b c

d e 

Page 19: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 19/23

19

Direct inference with BNs

• Now suppose we just want the probability for one variable

• Belief update method

• Original belief (no variables are instantiated): Use priorprobability p(xi)

• If xi is a root, then P(xi) is given directly in the BN (CPT at

Xi)

• Otherwise, –  P(xi) = Σ πi P(xi | πi) P(πi)

• In this equation, P(xi | πi) is given in the CPT, but

computing P(πi) is complicated

Page 20: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 20/23

20

Computing πi: Example

• P (d) = P(d | b, c) P(b, c)

• P(b, c) = P(a, b, c) + P(¬a, b, c) (marginalizing)

= P(b | a, c) p (a, c) + p(b | ¬a, c) p(¬a, c) (product rule)

= P(b | a) P(c | a) P(a) + P(b | ¬a) P(c | ¬a) P(¬a)

• If some variables are instantiated, can “plug that in” andreduce amount of marginalization

• Still have to marginalize over all values of uninstantiated

parents – not computationally feasible with large networks

a

b c

d e 

Page 21: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 21/23

21

Representational extensions

• Compactly representing CPTs

 –  Noisy-OR

 –  Noisy-MAX

• Adding continuous variables

 –  Discretization

 –  Use density functions (usually mixtures of Gaussians) to build

hybrid Bayesian networks (with discrete and continuous variables)

Page 22: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 22/23

22

Inference tasks

• Simple queries: Computer posterior marginal P(Xi | E=e)

 –  E.g., P(NoGas | Gauge=empty, Lights=on, Starts=false)

• Conjunctive queries: 

 –  P(Xi, X j | E=e) = P(Xi | e=e) P(X j | Xi, E=e)• Optimal decisions:  Decision networks include utility

information; probabilistic inference is required to find

P(outcome | action, evidence)

• Value of information: Which evidence should we seek next?• Sensitivity analysis: Which probability values are most

critical?

• Explanation: Why do I need a new starter motor? 

Page 23: bayes-nets.c19

8/3/2019 bayes-nets.c19

http://slidepdf.com/reader/full/bayes-netsc19 23/23

23

Approaches to inference

• Exact inference

 –  Enumeration

 –  Variable elimination

 –  Clustering / join tree algorithms• Approximate inference

 –  Stochastic simulation / sampling methods

 –  Markov chain Monte Carlo methods

 –  Genetic algorithms

 –  Neural networks

 –  Simulated annealing

 –  Mean field theory