CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall...

33
/ CSE 559A: Computer Vision Fall 2020: T-R: 11:30-12:50pm @ Zoom Instru‘tor: Ayan Chakrabarti ([email protected]’u). Course Staff: A’ith Boloor, Patri‘k Williams De‘ 8, 2020 http://www.‘se.wustl.e’u/~ayan/‘ourses/‘se559a/ 1

Transcript of CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall...

Page 1: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

CSE 559A: Computer Vision

Fall 2020: T-R: 11:30-12:50pm @ Zoom

Instru‘tor: Ayan Chakrabarti ([email protected]’u).Course Staff: A’ith Boloor, Patri‘k Williams

De‘ 8, 2020

http://www.‘se.wustl.e’u/~ayan/‘ourses/‘se559a/

1

Page 2: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

LAST TIMELAST TIMETalke’ about the importan‘e o“ initialization to keep a‘tivations (i.e., outputs o“ layers) balan‘e’ .But initialization only ensures balan‘e at the start o“ trainin”.As your wei”hts up’ate, they ‘an be”in to ”ive you biase’ wei”hts.Another option, a’’ normalization in the network itsel“ !

Ser”ey Ioffe an’ Christian Sze”e’y, Bat‘h Normalization: A‘‘eleratin” Deep Network Trainin” byRe’u‘in” Internal Covariate Shi .

2

Page 3: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

BATCH-NORMALIZATIONBATCH-NORMALIZATIONBatch-Norm Layer

Here, mean an’ varian‘e are interprete’ per-‘hannel

So “or ea‘h ‘hannel, you ‘ompute mean o“ the values o“ that ‘hannel a‘tivation at all spatial lo‘ations

a‘ross all examples in the trainin” set.

But this woul’ be too har’ to ’o in ea‘h iteration. So the BN layer just ’oes this normalization over a bat‘h.

An’ ba‘k-propa”ates throu”h it.

y = BN(x)

y =

x −Mean(x)

Var(x) + ϵ

‾ ‾‾‾‾‾‾‾‾‾

3

Page 4: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

BATCH-NORMALIZATIONBATCH-NORMALIZATIONBatch-Norm Layer

The BN layer has no parameters.

What about ’urin” ba‘k-propa”ation ?

When you ba‘k-propa”ate throu”h it to , you ”o ba‘k-propa”ate throu”h the ‘omputation o“ mean an’varian‘e.

At test time, repla‘e an’ as mean an’ varian‘e over the entire trainin” set.

x = (x)

bijc

c

1

BHW

bij

x

bijc

= ( −σ

2

c

1

BHW

bij

x

bijc

μ

c

)

2

=y

bijc

−x

bijc

μ

c

+ ϵσ

2

c

‾ ‾‾‾‾‾

x

μ

c

σ

2

c

4

Page 5: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

BATCH-NORMALIZATIONBATCH-NORMALIZATIONTypi‘ally apply BN be“ore a RELU.Typi‘al use: RELU(BN(Conv(x,k))+b)

Don t a’’ bias be“ore BN as its pointlessLearn bias post-BNCan also learn a s‘ale: RELU(a BN(Conv(x,k))+b)

Lea’s to si”ni“i‘antly “aster trainin”.But you nee’ to make sure you are normalizin” over a ’iverse enou”h set o“ samples.

5

Page 6: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

REGULARIZATIONREGULARIZATIONGiven a limite’ amount o“ trainin” ’ata, ’eep ar‘hite‘tures will be”in to over“it.Important: Keep tra‘k o“ trainin” an’ ’ev-set errorsTrainin” errors will keep ”oin” ’own, but ’ev will saturate. Make sure you ’on t train to a point when ’everrors start ”oin” up.So how ’o we prevent, or ’elay, over“ittin” so that our ’ev per“orman‘e in‘reases ?

Solution 1: Get more ’ata.

6

Page 7: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

REGULARIZATIONREGULARIZATIONData Augmentation

Think o“ trans“orms to the ima”es that you have that woul’ still keep them in the ’istribution o“ real ima”es.Typi‘al Trans“orms

S‘alin” the ima”eTakin” ran’om ‘ropsApplyin” Color-trans“ormations (‘han”e bri”htness, hue, saturation ran’omly)Horizontal Flips (but not verti‘al)Rotations upto +- 5 ’e”rees.

Are a ”oo’ way o“ ”ettin” more trainin” ’ata “or “ree.Tea‘hes your network to be invariant to these trans“ormations ….Unless your output isn t. I“ your output is a boun’in” box, se”mentation map, or other quantities that woul’‘han”e with these au”mentation operations, you nee’ to apply them to the outputs too.

7

Page 8: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

REGULARIZATIONREGULARIZATIONWeight Decay

A’’ a square’ or absolute value penalty on all wei”ht values (“or example, on ea‘h element o“ every‘onvolutional kernel, matmul matrix) ex‘ept biases. or

So now your effe‘tive loss is

How woul’ you train “or this ?

Let s say you use ba‘kprop to ‘ompute .

What ”ra’ient woul’ you apply to your wei”hts ? What is ?

So in a’’ition to the stan’ar’ up’ate, you will also be subtra‘tin” a s‘ale’ version o“ the wei”ht itsel“.

What about “or ?

i

w

2

i

| |∑

i

w

i

= L + λL

i

w

2

i

L∇

w

i

w

i

L

∇ = ∇L + 2λL

w

i

= L + λ | |L

i

w

i

∇ = ∇L + λSign( )L

w

i

8

Page 9: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

REGULARIZATIONREGULARIZATIONRegularization: Dropout

Key I’ea: Prevent a network “rom ’epen’in” too mu‘h on the presen‘e o“ a spe‘i“i‘ a‘tivation.So, ran’omly ’rop these values ’urin” trainin”.

is the in‘omin” a‘tivation, is the output a er ’ropout. Both will have the same shape.

is a probability (between 0 an’ 1) that is a parameter o“ the layer (‘hosen manually, not learne’).

The Dropout layer behaves ’ifferently ’urin” trainin” an’ testin”.

g = Dropout(f , p)

f g

p

9

Page 10: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

REGULARIZATIONREGULARIZATIONRegularization: Dropout

Testing:

Training

For ea‘h element o“ in’epen’ently,

Set with probability .

Set with probability , where is a s‘alar.

What shoul’ be ?

, so that the , the same as at test time !

Usin” ’ropout “or‘es the network to learn to be robust to ’eviations “rom the trainin” set.For‘e’ to learn a “allba‘k even when some a‘tivations ’ie.

Empiri‘al question o“ whi‘h layers to apply ’ropout to.

g = Dropout(f , p)

g = f

f

i

f

= 0g

i

p

= αg

i

f

i

(1 − p) α

α

α = (1 − p)

−1

� =g

i

f

i

10

Page 11: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

REGULARIZATIONREGULARIZATIONRegularization: Dropout

Dropout is a layer. You will ba‘kpropa”ate throu”h it ! How ?

Write the “un‘tion as

Here is a ran’om array same size as , with values 0 an’ with probability an’ .

’enotes element-wise multipli‘ation.

So ”iven , what is the expression “or ?

Even thou”h is ran’om, you must use the same in the ba‘kwar’ pass that you ”enerate’ “or the“orwar’ pass.

Don t ba‘kpropa”ate to be‘ause it is not a “un‘tion o“ the input.

Like RELU, but kills ”ra’ients base’ on an external ran’om sour‘e: whether you ’roppe’ that a‘tivation ornot in the “orwar’ pass. I“ you ’i’n t, remember to multiply by the .

g = f ∘ ϵ

ϵ f (1 − p)

−1

p (1 − p)

g

f

= ∘ ϵ∇

f

g

ϵ ϵ

ϵ

(1 − p)

−1

11

Page 12: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

REGULARIZATIONREGULARIZATIONRegularization: Early Stopping

Keep tra‘k o“ ’ev set error. Stop optimization when it starts ”oin” up.This is a le”itimate re”ularization te‘hnique !

Essentially, you are restri‘tin” your hypothesis spa‘e to “un‘tions that are rea‘hable within iterations o“ aran’om initialization.

N

12

Page 13: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

13

Page 14: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

14

Page 15: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

15

Page 16: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

16

Page 17: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

17

Page 18: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

18

Page 19: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

19

Page 20: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

20

Page 21: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

21

Page 22: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

22

Page 23: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

23

Page 24: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

24

Page 25: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

25

Page 26: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

TRAINING IN PRACTICETRAINING IN PRACTICE

26

Page 27: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

DIFFERENT OPTIMIZATION METHODSDIFFERENT OPTIMIZATION METHODSStan’ar’ SGD

Momentum

But we are still applyin” the same learnin” rate “or all parameters / wei”hts.

← − λw

i

w

i

w

i

← + γg

i

w

i

g

i

← − λw

i

w

i

g

i

27

Page 28: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

DIFFERENT OPTIMIZATION METHODSDIFFERENT OPTIMIZATION METHODSAdaptive Learning Rate Methods

Key i’ea: Set the learnin” rate “or ea‘h parameter base’ on the ma”nitu’e o“ its ”ra’ients.

A’a”ra’

Global learnin” rate ’ivi’e’ by sum o“ ma”nitu’es o“ past ”ra’ients.

Problem: Will always keep ’roppin” the effe‘tive learnin” rate.

RMSProp

← + (g

2

i

g

2

i

w

i

)

2

← − λw

i

w

i

w

i

+ ϵg

2

i

‾ ‾‾‾‾‾

← γ + (1 − γ)(g

2

i

g

2

i

w

i

)

2

← − λw

i

w

i

w

i

+ ϵg

2

i

‾ ‾‾‾‾‾

28

Page 29: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

DIFFERENT OPTIMIZATION METHODSDIFFERENT OPTIMIZATION METHODSAdaptive Learning Rate Methods

A’am: RMSProp + Momentum

How ’o you initialize an’ ? Typi‘ally as 0 an’ 1.

This won t matter on‘e the values o“ stabilize. But in initial iterations, they will be biase’ towar’s theirinitial values.

← + (1 − )m

i

β

1

m

i

β

1

w

i

← + (1 − )(v

i

β

2

v

i

β

2

w

i

)

2

← −w

i

w

i

λ

+ ϵv

i

‾‾

m

i

m

i

v

i

,m

i

v

i

29

Page 30: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

DIFFERENT OPTIMIZATION METHODSDIFFERENT OPTIMIZATION METHODSAdaptive Learning Rate Methods

A’am: RMSProp + Momentum + Bias Corre‘tion

Here, is the iteration number.

As , .

← + (1 − )m

i

β

1

m

i

β

1

w

i

← + (1 − )(v

i

β

2

v

i

β

2

w

i

)

2

=m

 

i

m

i

1 − β

t

1

=v

 

i

v

i

1 − β

t

2

← −w

i

w

i

λ

+ ϵv

 

i

‾‾

m

 

i

t

t → ∞ 1 − = 1β

t

30

Page 31: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

DISTRIBUTED TRAININGDISTRIBUTED TRAININGNeural Network Trainin” is Slow.But many operations are parallelizable. In parti‘ular, operations “or ’ifferent bat‘hes are in’epen’ent.That s why GPUs are ”reat “or ’eep learnin”! But even so, you will be”in to saturate the ‘omputation (orworse, memory) on a GPU.Solution: Break up ‘omputation a‘ross multiple GPUs.Two possibilities:

Mo’el ParallelismData Parallelism

31

Page 32: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

DISTRIBUTED TRAININGDISTRIBUTED TRAININGModel Parallelism

Less popular, ’oesn t help “or many networks.Essentially, i“ you have two in’epen’ent paths in your network, you ‘an pla‘e them on ’ifferent ’evi‘es.An’ syn‘, when they join.

Was use’ in the Sutskever et al., 2012 Ima”eNet paper.

32

Page 33: CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall 2019: T-R: 11:30-12:50pm @ Hillman 60 Instructor: Ayan Chakrabarti (ayan@wustl.edu).

/

DISTRIBUTED TRAININGDISTRIBUTED TRAININGData Parallelism

Be”in with all ’evi‘es havin” the same mo’el wei”hts.One ea‘h ’evi‘e, loa’ a separate bat‘h o“ ’ata.Do “orwar’-ba‘kwar’ to ‘ompute wei”ht ”ra’ients on ea‘h GPU with its own bat‘h.Have a sin”le ’evi‘e (one o“ the GPUs, or a CPU) re‘over ”ra’ients “rom all ’evi‘es.Avera”e these ”ra’ients an’ apply the up’ate to the wei”hts.Distribute new wei”hts to all ’evi‘es.Works well in pra‘ti‘e, espe‘ially “or multiple GPUs in the same ma‘hine.Communi‘ation overhea’ o“ trans“errin” ”ra’ients an’ wei”hts ba‘k an’ “orth. Can be lar”e i“ ’istributin”a‘ross multiple ma‘hines.Approximate Distribute’ Trainin”

Let ea‘h worker keep up’atin” its own wei”hts in’epen’ently “or multiple iterations. Then, transmit ba‘kwei”hts to sin”le ’evi‘e, avera”e wei”hts, an’ syn‘ to all ’evi‘es.Other options, quantize ”ra’ients when sen’in” ba‘k an’ “orth (while makin” sure all workers have thesame mo’els).

33