Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to...

51
Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER COMPLEXITY Slides adapted from Rob Schapire Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 1 / 30

Transcript of Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to...

Page 1: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Introduction to Machine Learning

Machine Learning: Jordan Boyd-GraberUniversity of MarylandRADEMACHER COMPLEXITY

Slides adapted from Rob Schapire

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 1 / 30

Page 2: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Setup

Nothing new . . .

� Samples S = ((x1,y1), . . . ,(xm,ym))

� Labels yi = {−1,+1}� Hypothesis h : X →{−1,+1}� Training error: R(h) = 1

m

∑mi 1 [h(xi) 6= yi ]

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 2 / 30

Page 3: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

An alternative derivation of training error

R(h) =1

m

m∑

i

1 [h(xi) 6= yi ] (1)

(2)

(3)

(4)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 30

Page 4: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

An alternative derivation of training error

R(h) =1

m

m∑

i

1 [h(xi) 6= yi ] (1)

=1

m

m∑

i

¨

1 if (h(xi ,yi) == (1,−1) or (−1,1)

0 (h(xi ,yi) == (1,1) or (−1,−1)(2)

(3)

(4)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 30

Page 5: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

An alternative derivation of training error

R(h) =1

m

m∑

i

1 [h(xi) 6= yi ] (1)

=1

m

m∑

i

¨

1 if (h(xi ,yi) == (1,−1) or (−1,1)

0 (h(xi ,yi) == (1,1) or (−1,−1)(2)

=1

m

m∑

i

1− yih(xi)

2(3)

(4)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 30

Page 6: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

An alternative derivation of training error

R(h) =1

m

m∑

i

1 [h(xi) 6= yi ] (1)

=1

m

m∑

i

¨

1 if (h(xi ,yi) == (1,−1) or (−1,1)

0 (h(xi ,yi) == (1,1) or (−1,−1)(2)

=1

m

m∑

i

1− yih(xi)

2(3)

=1

2−

1

2m

m∑

i

yih(xi) (4)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 30

Page 7: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

An alternative derivation of training error

R(h) =1

m

m∑

i

1 [h(xi) 6= yi ] (1)

=1

m

m∑

i

¨

1 if (h(xi ,yi) == (1,−1) or (−1,1)

0 (h(xi ,yi) == (1,1) or (−1,−1)(2)

=1

m

m∑

i

1− yih(xi)

2(3)

=1

2−

1

2m

m∑

i

yih(xi) (4)

Correlation between predictions and labels

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 30

Page 8: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

An alternative derivation of training error

R(h) =1

m

m∑

i

1 [h(xi) 6= yi ] (1)

=1

m

m∑

i

¨

1 if (h(xi ,yi) == (1,−1) or (−1,1)

0 (h(xi ,yi) == (1,1) or (−1,−1)(2)

=1

m

m∑

i

1− yih(xi)

2(3)

=1

2−

1

2m

m∑

i

yih(xi) (4)

Minimizing training error is thus equivalent to maximizing correlation

argmaxh

1

m

m∑

i

yih(xi) (5)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 3 / 30

Page 9: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Playing with Correlation

Imagine where we replace true labels with Rademacher random variables

σi =

¨

+1 with prob .5

−1 with prob .5(6)

This gives us Rademacher correlation—what’s the best that a randomclassifier could do?

RS (H)≡Eσ

maxh∈H

1

m

m∑

i

σih(xi)

(7)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 30

Page 10: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Playing with Correlation

Imagine where we replace true labels with Rademacher random variables

σi =

¨

+1 with prob .5

−1 with prob .5(6)

This gives us Rademacher correlation—what’s the best that a randomclassifier could do?

RS (H)≡Eσ

maxh∈H

1

m

m∑

i

σih(xi)

(7)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 30

Page 11: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Playing with Correlation

Imagine where we replace true labels with Rademacher random variables

σi =

¨

+1 with prob .5

−1 with prob .5(6)

This gives us Rademacher correlation—what’s the best that a randomclassifier could do?

RS (H)≡Eσ

maxh∈H

1

m

m∑

i

σih(xi)

(7)

Notation: Ep [f ]≡∑

x p(x)f (x)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 30

Page 12: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Playing with Correlation

Imagine where we replace true labels with Rademacher random variables

σi =

¨

+1 with prob .5

−1 with prob .5(6)

This gives us Rademacher correlation—what’s the best that a randomclassifier could do?

RS (H)≡Eσ

maxh∈H

1

m

m∑

i

σih(xi)

(7)

Note: Empirical Rademacher complexity is with respect to a sample.

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 4 / 30

Page 13: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Rademacher Extrema

� What are the maximum values of Rademacher correlation?

|H|= 1 |H|= 2m

� Rademacher correlation is larger for more complicated hypothesisspace.

� What if you’re right for stupid reasons?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 30

Page 14: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Rademacher Extrema

� What are the maximum values of Rademacher correlation?

|H|= 1

Eσ�

maxh∈H1m

∑mi σih(xi)

�|H|= 2m

� Rademacher correlation is larger for more complicated hypothesisspace.

� What if you’re right for stupid reasons?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 30

Page 15: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Rademacher Extrema

� What are the maximum values of Rademacher correlation?

|H|= 11m

∑mi h(xi)Eσ [σi ]

|H|= 2m

� Rademacher correlation is larger for more complicated hypothesisspace.

� What if you’re right for stupid reasons?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 30

Page 16: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Rademacher Extrema

� What are the maximum values of Rademacher correlation?

|H|= 1

h(x)Eσ [σi ]|H|= 2m

� Rademacher correlation is larger for more complicated hypothesisspace.

� What if you’re right for stupid reasons?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 30

Page 17: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Rademacher Extrema

� What are the maximum values of Rademacher correlation?

|H|= 1

h(x)Eσ [σi ] = 0|H|= 2m

� Rademacher correlation is larger for more complicated hypothesisspace.

� What if you’re right for stupid reasons?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 30

Page 18: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Rademacher Extrema

� What are the maximum values of Rademacher correlation?

|H|= 1

h(x)Eσ [σi ] = 0

|H|= 2m

Eσ�

maxh∈H1m

∑mi σih(xi)

� Rademacher correlation is larger for more complicated hypothesisspace.

� What if you’re right for stupid reasons?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 30

Page 19: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Rademacher Extrema

� What are the maximum values of Rademacher correlation?

|H|= 1

h(x)Eσ [σi ] = 0

|H|= 2m

mm = 1

� Rademacher correlation is larger for more complicated hypothesisspace.

� What if you’re right for stupid reasons?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 30

Page 20: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Rademacher Extrema

� What are the maximum values of Rademacher correlation?

|H|= 1

h(x)Eσ [σi ] = 0

|H|= 2m

mm = 1

� Rademacher correlation is larger for more complicated hypothesisspace.

� What if you’re right for stupid reasons?

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 5 / 30

Page 21: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Generalizing Rademacher Complexity

We can generalize Rademacher complexity to consider all sets of aparticular size.

Rm (H) =ES∼Dm

RS (H)�

(8)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 6 / 30

Page 22: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Generalizing Rademacher Complexity

Theorem

Convergence Bounds Let F be a family of functions mapping from Z to[0,1], and let sample S = (z1, . . . ,zm) were zi ∼D for some distribution Dover Z . Define E [f ]≡Ez∼D [f (z)] and ES [f ]≡ 1

m

∑mi=1 f (zi). With

probability greater than 1−δ for all f ∈ F:

E [f ]≤ Es [f ] + 2Rm (F) +O

√ ln 1δ

m

!

(8)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 6 / 30

Page 23: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Generalizing Rademacher Complexity

Theorem

Convergence Bounds Let F be a family of functions mapping from Z to[0,1], and let sample S = (z1, . . . ,zm) were zi ∼D for some distribution Dover Z . Define E [f ]≡Ez∼D [f (z)] and ES [f ]≡ 1

m

∑mi=1 f (zi). With

probability greater than 1−δ for all f ∈ F:

E [f ]≤ Es [f ] + 2Rm (F) +O

√ ln 1δ

m

!

(8)

f is a surrogate for the accuracy of a hypothesis (mathematicallyconvenient)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 6 / 30

Page 24: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Aside: McDiarmid’s Inequality

If we have a function:

|f (x1, . . . ,xi , . . .xm)− f (x1, . . . ,x ′i , . . . ,xm)| ≤ ci (9)

then:

Pr [f (x1, . . . ,xm)≥E [f (X1, . . . ,Xm)] +ε]≤ exp

−2ε2

∑mi c2

i

(10)

Proofs online and in Mohri (requires Martingale, constructingVk =E [V |x1 . . .xk ]−E [V |x1 . . .xk−1]).What function do we care about for Rademacher complexity? Let’s define

Φ(S) = supf

E [f ]− ES [f ]�

= supf

E [f ]−1

m

i

f (zi)

(11)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 7 / 30

Page 25: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Aside: McDiarmid’s Inequality

If we have a function:

|f (x1, . . . ,xi , . . .xm)− f (x1, . . . ,x ′i , . . . ,xm)| ≤ ci (9)

then:

Pr [f (x1, . . . ,xm)≥E [f (X1, . . . ,Xm)] +ε]≤ exp

−2ε2

∑mi c2

i

(10)

Proofs online and in Mohri (requires Martingale, constructingVk =E [V |x1 . . .xk ]−E [V |x1 . . .xk−1]).

What function do we care about for Rademacher complexity? Let’s define

Φ(S) = supf

E [f ]− ES [f ]�

= supf

E [f ]−1

m

i

f (zi)

(11)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 7 / 30

Page 26: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Aside: McDiarmid’s Inequality

If we have a function:

|f (x1, . . . ,xi , . . .xm)− f (x1, . . . ,x ′i , . . . ,xm)| ≤ ci (9)

then:

Pr [f (x1, . . . ,xm)≥E [f (X1, . . . ,Xm)] +ε]≤ exp

−2ε2

∑mi c2

i

(10)

Proofs online and in Mohri (requires Martingale, constructingVk =E [V |x1 . . .xk ]−E [V |x1 . . .xk−1]).What function do we care about for Rademacher complexity? Let’s define

Φ(S) = supf

E [f ]− ES [f ]�

= supf

E [f ]−1

m

i

f (zi)

(11)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 7 / 30

Page 27: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Step 1: Bounding divergence from true Expectation

Lemma

Moving to Expectation With probability at least 1−δ,

Φ(S)≤Es [Φ(S)] +

r

ln 1δ

2m

Since f (z1) ∈ [0,1], changing any zi to z′i in the training set will change1m

i f (zi) by at most 1m , so we can apply McDiarmid’s inequality with

ε=

r

ln 1δ

2m and ci = 1m .

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 8 / 30

Page 28: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Step 2: Comparing two different empirical expectations

Define a ghost sample S′= (z′1, . . . ,z′m)∼D. How much can two samplesfrom the same distribution vary?

Lemma

Two Different Samples

ES [Φ(S)] =ES

supf

(E [f ]− ES [f ])

(12)

(13)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 30

Page 29: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Step 2: Comparing two different empirical expectations

Define a ghost sample S′= (z′1, . . . ,z′m)∼D. How much can two samplesfrom the same distribution vary?

Lemma

Two Different Samples

ES [Φ(S)] =ES

supf

(E [f ]− ES [f ])

(12)

=ES

supf∈F

(ES′�

ES′ [f ]�

− ES [f ])

(13)

(14)

The expectation is equal to the expectation of the empirical expectation ofall sets S′

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 30

Page 30: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Step 2: Comparing two different empirical expectations

Define a ghost sample S′= (z′1, . . . ,z′m)∼D. How much can two samplesfrom the same distribution vary?

Lemma

Two Different Samples

ES [Φ(S)] =ES

supf

(E [f ]− ES [f ])

(12)

=ES

supf∈F

(ES′�

ES′ [f ]�

− ES [f ])

(13)

=ES

supf∈F

(ES′�

ES′ [f ]− ES [f ]�

)

(14)

(15)

S and S′ are distinct random variables, so we can move inside theexpectation

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 30

Page 31: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Step 2: Comparing two different empirical expectations

Define a ghost sample S′= (z′1, . . . ,z′m)∼D. How much can two samplesfrom the same distribution vary?

Lemma

Two Different Samples

ES [Φ(S)] =ES

supf

(E [f ]− ES [f ])

(12)

=ES

supf∈F

(ES′�

ES′ [f ]− ES [f ]�

)

(13)

≤ES,S′

supf

(ES′ [f ]− ES [f ])

(14)

The expectation of a max over some function is at least the max of thatexpectation over that function

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 9 / 30

Page 32: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Step 3: Adding in Rademacher Variables

From S,S′ we’ll create T ,T ′ by swapping elements between S and S′ withprobability .5. This is still idependent, identically distributed (iid) from D.They have the same distribution:

ES′ [f ]− ES [f ]∼ ET ′ [f ]− ET [f ] (15)

Let’s introduce σi :

ET ′ [f ]− ET [f ] =1

m

¨

f (zi)− f (z′i ) with prob .5

f (z′i )− f (zi) with prob .5(16)

=1

m

i

σi(f (z′i )− f (zi)) (17)

Thus:ES,S′

supf∈F

ES′ [f ]− ES [f ]��

=ES,S′,σ

supf∈F

�∑

iσi(f (z′i )− f (zi))��

.

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 10 / 30

Page 33: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Step 3: Adding in Rademacher Variables

From S,S′ we’ll create T ,T ′ by swapping elements between S and S′ withprobability .5. This is still idependent, identically distributed (iid) from D.They have the same distribution:

ES′ [f ]− ES [f ]∼ ET ′ [f ]− ET [f ] (15)

Let’s introduce σi :

ET ′ [f ]− ET [f ] =1

m

¨

f (zi)− f (z′i ) with prob .5

f (z′i )− f (zi) with prob .5(16)

=1

m

i

σi(f (z′i )− f (zi)) (17)

Thus:ES,S′

supf∈F

ES′ [f ]− ES [f ]��

=ES,S′,σ

supf∈F

�∑

iσi(f (z′i )− f (zi))��

.

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 10 / 30

Page 34: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Step 3: Adding in Rademacher Variables

From S,S′ we’ll create T ,T ′ by swapping elements between S and S′ withprobability .5. This is still idependent, identically distributed (iid) from D.They have the same distribution:

ES′ [f ]− ES [f ]∼ ET ′ [f ]− ET [f ] (15)

Let’s introduce σi :

ET ′ [f ]− ET [f ] =1

m

¨

f (zi)− f (z′i ) with prob .5

f (z′i )− f (zi) with prob .5(16)

=1

m

i

σi(f (z′i )− f (zi)) (17)

Thus:ES,S′

supf∈F

ES′ [f ]− ES [f ]��

=ES,S′,σ

supf∈F

�∑

iσi(f (z′i )− f (zi))��

.

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 10 / 30

Page 35: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Step 4: Making These Rademacher Complexities

Before, we had ES,S′,σ

supf∈F

iσi(f (z′i )− f (zi))�

≤ES,S′,σ

sup f∈F

i

σi f (z′i ) + sup f∈F

i

(−σi)f (zi)

(18)

(19)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 11 / 30

Page 36: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Step 4: Making These Rademacher Complexities

Before, we had ES,S′,σ

supf∈F

iσi(f (z′i )− f (zi))�

≤ES,S′,σ

sup f∈F

i

σi f (z′i ) + sup f∈F

i

(−σi)f (zi)

(18)

(19)

Taking the sup jointly must be less than or equal the individual sup.

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 11 / 30

Page 37: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Step 4: Making These Rademacher Complexities

Before, we had ES,S′,σ

supf∈F

iσi(f (z′i )− f (zi))�

≤ES,S′,σ

sup f∈F

i

σi f (z′i ) + sup f∈F

i

(−σi)f (zi)

(18)

≤ES,S′,σ

supf∈F

i

σi f (z′i )

+ES,S′,σ

supf∈F

i

(−σi)f (zi)

(19)

(20)

Linearity

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 11 / 30

Page 38: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Step 4: Making These Rademacher Complexities

Before, we had ES,S′,σ

supf∈F

iσi(f (z′i )− f (zi))�

≤ES,S′,σ

sup f∈F

i

σi f (z′i ) + sup f∈F

i

(−σi)f (zi)

(18)

≤ES,S′,σ

supf∈F

i

σi f (z′i )

+ES,S′,σ

supf∈F

i

(−σi)f (zi)

(19)

=Rm (F) +Rm (F) (20)

Definition

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 11 / 30

Page 39: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Putting the Pieces Together

With probability ≥ 1−δ:

Φ(S)≤ES [Φ(S)] +

√ ln 1δ

2m(21)

Step 1

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 30

Page 40: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Putting the Pieces Together

With probability ≥ 1−δ:

supf

E [f ]− ES [h]�

≤ES [Φ(S)] +

√ ln 1δ

2m(21)

Definition of Φ

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 30

Page 41: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Putting the Pieces Together

With probability ≥ 1−δ:

E [f ]− ES [h]≤ES [Φ(S)] +

√ ln 1δ

2m(21)

Drop the sup, still true

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 30

Page 42: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Putting the Pieces Together

With probability ≥ 1−δ:

E [f ]− ES [h]≤ES,S′

supf

(ES′ [f ]− ES [f ])

+

√ ln 1δ

2m(21)

Step 2

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 30

Page 43: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Putting the Pieces Together

With probability ≥ 1−δ:

E [f ]− ES [h]≤ES,S′,σ

supf∈F

i

σi(f (z′i )− f (zi))

��

+

√ ln 1δ

2m(21)

Step 3

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 30

Page 44: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Putting the Pieces Together

With probability ≥ 1−δ:

E [f ]− ES [h]≤ 2Rm (F) +

√ ln 1δ

2m(21)

Step 4

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 30

Page 45: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Putting the Pieces Together

With probability ≥ 1−δ:

E [f ]− ES [h]≤ 2Rm (F) +

√ ln 1δ

2m(21)

Recall that RS (F)≡Eσ�

supf1m

iσi f (zi)�

, so we apply McDiarmid’sinequality again (because f ∈ [0,1]):

RS (F)≤Rm (F) +

√ ln 1δ

2m(22)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 30

Page 46: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Putting the Pieces Together

With probability ≥ 1−δ:

E [f ]− ES [h]≤ 2Rm (F) +

√ ln 1δ

2m(21)

Recall that RS (F)≡Eσ�

supf1m

iσi f (zi)�

, so we apply McDiarmid’sinequality again (because f ∈ [0,1]):

RS (F)≤Rm (F) +

√ ln 1δ

2m(22)

Putting the two together:

E [f ]≤ Es [f ] + 2Rm (F) +O

√ ln 1δ

m

!

(23)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 12 / 30

Page 47: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

What about hypothesis classes?

Define:

Z ≡X ×{−1,+1} (24)

fh(x ,y)≡1 [h(x) 6= y ] (25)

FH ≡{fh : h ∈H} (26)

We can use this to create expressions for generalization and empiricalerror:

R(h) =E(x ,y)∼D [1 [h(x) 6= y ]] =E [fh] (27)

R(h) =1

m

i

1 [h(xi) 6= y ] = ES [fh] (28)

We can plug this into our theorem!

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 13 / 30

Page 48: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

What about hypothesis classes?

Define:

Z ≡X ×{−1,+1} (24)

fh(x ,y)≡1 [h(x) 6= y ] (25)

FH ≡{fh : h ∈H} (26)

We can use this to create expressions for generalization and empiricalerror:

R(h) =E(x ,y)∼D [1 [h(x) 6= y ]] =E [fh] (27)

R(h) =1

m

i

1 [h(xi) 6= y ] = ES [fh] (28)

We can plug this into our theorem!

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 13 / 30

Page 49: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

What about hypothesis classes?

Define:

Z ≡X ×{−1,+1} (24)

fh(x ,y)≡1 [h(x) 6= y ] (25)

FH ≡{fh : h ∈H} (26)

We can use this to create expressions for generalization and empiricalerror:

R(h) =E(x ,y)∼D [1 [h(x) 6= y ]] =E [fh] (27)

R(h) =1

m

i

1 [h(xi) 6= y ] = ES [fh] (28)

We can plug this into our theorem!

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 13 / 30

Page 50: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Generalization bounds

� We started with expectations

E [f ]≤ ES [f ] + 2RS (F) +O

√ ln 1δ

m

!

(29)

� We also had our definition of the generalization and empirical error:

R(h) =E(x ,y)∼D [1 [h(x) 6= y ]] =E [fh] R(h) =1

m

i

1 [h(xi) 6= y ] = ES [fh]

� Combined with the previous result:

RS (FH) =1

2RS (H) (30)

� All together:

R(h)≤ R(h) +Rm (H) +O

√ log 1δ

m

!

(31)

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 14 / 30

Page 51: Introduction to Machine Learning - cs.colorado.edujbg/teaching/CMSC_726/06a.pdf · Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER

Wrapup

� Interaction of data, complexity, and accuracy

� Still very theoretical

� Next up: How to evaluate generalizability of specific hypothesis classes

Machine Learning: Jordan Boyd-Graber | UMD Introduction to Machine Learning | 15 / 30