CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall...

/

CSE 559A: Computer Vision

Fall 2020: T-R: 11:30-12:50pm @ Zoom

Instru‘tor: Ayan Chakrabarti ([email protected]’u).Course Staff: A’ith Boloor, Patri‘k Williams

De‘ 8, 2020

http://www.‘se.wustl.e’u/~ayan/‘ourses/‘se559a/

1

http://www.cse.wustl.edu/~ayan/courses/cse559a/

/

LAST TIMELAST TIMETalke’ about the importan‘e o“ initialization to keep a‘tivations (i.e., outputs o“ layers) balan‘e’ .But initialization only ensures balan‘e at the start o“ trainin”.As your wei”hts up’ate, they ‘an be”in to ”ive you biase’ wei”hts.Another option, a’’ normalization in the network itsel“ !

Ser”ey Ioffe an’ Christian Sze”e’y, Bat‘h Normalization: A‘‘eleratin” Deep Network Trainin” byRe’u‘in” Internal Covariate Shi .

2

/

BATCH-NORMALIZATIONBATCH-NORMALIZATIONBatch-Norm Layer

Here, mean an’ varian‘e are interprete’ per-‘hannel

So “or ea‘h ‘hannel, you ‘ompute mean o“ the values o“ that ‘hannel a‘tivation at all spatial lo‘ations

a‘ross all examples in the trainin” set.

But this woul’ be too har’ to ’o in ea‘h iteration. So the BN layer just ’oes this normalization over a bat‘h.

An’ ba‘k-propa”ates throu”h it.

y = BN(x)

y =

x −Mean(x)

Var(x) + ϵ

‾ ‾‾‾‾‾‾‾‾‾

√

3

/

BATCH-NORMALIZATIONBATCH-NORMALIZATIONBatch-Norm Layer

The BN layer has no parameters.

What about ’urin” ba‘k-propa”ation ?

When you ba‘k-propa”ate throu”h it to , you ”o ba‘k-propa”ate throu”h the ‘omputation o“ mean an’varian‘e.

At test time, repla‘e an’ as mean an’ varian‘e over the entire trainin” set.

x = (x)

bijc

=μ

c

1

BHW

∑

bij

x

bijc

= ( −σ

2

c

1

BHW

∑

bij

x

bijc

μ

c

)

2

=y

bijc

−x

bijc

μ

c

+ ϵσ

2

c

‾ ‾‾‾‾‾

√

x

μ

c

σ

2

c

4

/

BATCH-NORMALIZATIONBATCH-NORMALIZATIONTypi‘ally apply BN be“ore a RELU.Typi‘al use: RELU(BN(Conv(x,k))+b)

Don t a’’ bias be“ore BN as its pointlessLearn bias post-BNCan also learn a s‘ale: RELU(a BN(Conv(x,k))+b)

Lea’s to si”ni“i‘antly “aster trainin”.But you nee’ to make sure you are normalizin” over a ’iverse enou”h set o“ samples.

5

/

REGULARIZATIONREGULARIZATIONGiven a limite’ amount o“ trainin” ’ata, ’eep ar‘hite‘tures will be”in to over“it.Important: Keep tra‘k o“ trainin” an’ ’ev-set errorsTrainin” errors will keep ”oin” ’own, but ’ev will saturate. Make sure you ’on t train to a point when ’everrors start ”oin” up.So how ’o we prevent, or ’elay, over“ittin” so that our ’ev per“orman‘e in‘reases ?

Solution 1: Get more ’ata.

6

/

REGULARIZATIONREGULARIZATIONData Augmentation

Think o“ trans“orms to the ima”es that you have that woul’ still keep them in the ’istribution o“ real ima”es.Typi‘al Trans“orms

S‘alin” the ima”eTakin” ran’om ‘ropsApplyin” Color-trans“ormations (‘han”e bri”htness, hue, saturation ran’omly)Horizontal Flips (but not verti‘al)Rotations upto +- 5 ’e”rees.

Are a ”oo’ way o“ ”ettin” more trainin” ’ata “or “ree.Tea‘hes your network to be invariant to these trans“ormations ….Unless your output isn t. I“ your output is a boun’in” box, se”mentation map, or other quantities that woul’‘han”e with these au”mentation operations, you nee’ to apply them to the outputs too.

7

/

REGULARIZATIONREGULARIZATIONWeight Decay

A’’ a square’ or absolute value penalty on all wei”ht values (“or example, on ea‘h element o“ every‘onvolutional kernel, matmul matrix) ex‘ept biases. or

So now your effe‘tive loss is

How woul’ you train “or this ?

Let s say you use ba‘kprop to ‘ompute .

What ”ra’ient woul’ you apply to your wei”hts ? What is ?

So in a’’ition to the stan’ar’ up’ate, you will also be subtra‘tin” a s‘ale’ version o“ the wei”ht itsel“.

What about “or ?

∑

i

w

2

i

| |∑

i

w

i

= L + λL

′

∑

i

w

2

i

L∇

w

i

∇

w

i

L

′

∇ = ∇L + 2λL

′

w

i

= L + λ | |L

′

∑

i

w

i

∇ = ∇L + λSign( )L

′

w

i

8

/

REGULARIZATIONREGULARIZATIONRegularization: Dropout

Key I’ea: Prevent a network “rom ’epen’in” too mu‘h on the presen‘e o“ a spe‘i“i‘ a‘tivation.So, ran’omly ’rop these values ’urin” trainin”.

is the in‘omin” a‘tivation, is the output a er ’ropout. Both will have the same shape.

is a probability (between 0 an’ 1) that is a parameter o“ the layer (‘hosen manually, not learne’).

The Dropout layer behaves ’ifferently ’urin” trainin” an’ testin”.

g = Dropout(f , p)

f g

p

9

/


Testing:

Training

For ea‘h element o“ in’epen’ently,

Set with probability .

Set with probability , where is a s‘alar.

What shoul’ be ?

, so that the , the same as at test time !

Usin” ’ropout “or‘es the network to learn to be robust to ’eviations “rom the trainin” set.For‘e’ to learn a “allba‘k even when some a‘tivations ’ie.

Empiri‘al question o“ whi‘h layers to apply ’ropout to.

g = Dropout(f , p)

g = f

f

i

f

= 0g

i

p

= αg

i

f

i

(1 − p) α

α

α = (1 − p)

−1

� =g

i

f

i

10

/


Dropout is a layer. You will ba‘kpropa”ate throu”h it ! How ?

Write the “un‘tion as

Here is a ran’om array same size as , with values 0 an’ with probability an’ .

’enotes element-wise multipli‘ation.

So ”iven , what is the expression “or ?

Even thou”h is ran’om, you must use the same in the ba‘kwar’ pass that you ”enerate’ “or the“orwar’ pass.

Don t ba‘kpropa”ate to be‘ause it is not a “un‘tion o“ the input.

Like RELU, but kills ”ra’ients base’ on an external ran’om sour‘e: whether you ’roppe’ that a‘tivation ornot in the “orwar’ pass. I“ you ’i’n t, remember to multiply by the .

g = f ∘ ϵ

ϵ f (1 − p)

−1

p (1 − p)

∘

∇

g

∇

f

= ∘ ϵ∇

f

∇

g

ϵ ϵ

ϵ

(1 − p)

−1

11

/

REGULARIZATIONREGULARIZATIONRegularization: Early Stopping

Keep tra‘k o“ ’ev set error. Stop optimization when it starts ”oin” up.This is a le”itimate re”ularization te‘hnique !

Essentially, you are restri‘tin” your hypothesis spa‘e to “un‘tions that are rea‘hable within iterations o“ aran’om initialization.

N

12

/

TRAINING IN PRACTICETRAINING IN PRACTICE

13

/


14

/


15

/


16

/


17

/


18

/


19

/


20

/


21

/


22

/


23

/


24

/


25

/


26

/

DIFFERENT OPTIMIZATION METHODSDIFFERENT OPTIMIZATION METHODSStan’ar’ SGD

Momentum

But we are still applyin” the same learnin” rate “or all parameters / wei”hts.

← − λw

i

w

i

∇

w

i

← + γg

i

∇

w

i

g

i

← − λw

i

w

i

g

i

27

/

DIFFERENT OPTIMIZATION METHODSDIFFERENT OPTIMIZATION METHODSAdaptive Learning Rate Methods

Key i’ea: Set the learnin” rate “or ea‘h parameter base’ on the ma”nitu’e o“ its ”ra’ients.

A’a”ra’

Global learnin” rate ’ivi’e’ by sum o“ ma”nitu’es o“ past ”ra’ients.

Problem: Will always keep ’roppin” the effe‘tive learnin” rate.

RMSProp

← + (g

2

i

g

2

i

∇

w

i

)

2

← − λw

i

w

i

∇

w

i

+ ϵg

2

i

‾ ‾‾‾‾‾

√

← γ + (1 − γ)(g

2

i

g

2

i

∇

w

i

)

2

← − λw

i

w

i

∇

w

i

+ ϵg

2

i

‾ ‾‾‾‾‾

√

28

/


A’am: RMSProp + Momentum

How ’o you initialize an’ ? Typi‘ally as 0 an’ 1.

This won t matter on‘e the values o“ stabilize. But in initial iterations, they will be biase’ towar’s theirinitial values.

← + (1 − )m

i

β

1

m

i

β

1

∇

w

i

← + (1 − )(v

i

β

2

v

i

β

2

∇

w

i

)

2

← −w

i

w

i

λ

+ ϵv

i

‾‾

√

m

i

m

i

v

i

,m

i

v

i

29

/


A’am: RMSProp + Momentum + Bias Corre‘tion

Here, is the iteration number.

As , .

← + (1 − )m

i

β

1

m

i

β

1

∇

w

i

← + (1 − )(v

i

β

2

v

i

β

2

∇

w

i

)

2

=m

i

m

i

1 − β

t

1

=v

i

v

i

1 − β

t

2

← −w

i

w

i

λ

+ ϵv

i

‾‾

√

m

i

t

t → ∞ 1 − = 1β

t

30

/

DISTRIBUTED TRAININGDISTRIBUTED TRAININGNeural Network Trainin” is Slow.But many operations are parallelizable. In parti‘ular, operations “or ’ifferent bat‘hes are in’epen’ent.That s why GPUs are ”reat “or ’eep learnin”! But even so, you will be”in to saturate the ‘omputation (orworse, memory) on a GPU.Solution: Break up ‘omputation a‘ross multiple GPUs.Two possibilities:

Mo’el ParallelismData Parallelism

31

/

DISTRIBUTED TRAININGDISTRIBUTED TRAININGModel Parallelism

Less popular, ’oesn t help “or many networks.Essentially, i“ you have two in’epen’ent paths in your network, you ‘an pla‘e them on ’ifferent ’evi‘es.An’ syn‘, when they join.

Was use’ in the Sutskever et al., 2012 Ima”eNet paper.

32

/

DISTRIBUTED TRAININGDISTRIBUTED TRAININGData Parallelism

Be”in with all ’evi‘es havin” the same mo’el wei”hts.One ea‘h ’evi‘e, loa’ a separate bat‘h o“ ’ata.Do “orwar’-ba‘kwar’ to ‘ompute wei”ht ”ra’ients on ea‘h GPU with its own bat‘h.Have a sin”le ’evi‘e (one o“ the GPUs, or a CPU) re‘over ”ra’ients “rom all ’evi‘es.Avera”e these ”ra’ients an’ apply the up’ate to the wei”hts.Distribute new wei”hts to all ’evi‘es.Works well in pra‘ti‘e, espe‘ially “or multiple GPUs in the same ma‘hine.Communi‘ation overhea’ o“ trans“errin” ”ra’ients an’ wei”hts ba‘k an’ “orth. Can be lar”e i“ ’istributin”a‘ross multiple ma‘hines.Approximate Distribute’ Trainin”

Let ea‘h worker keep up’atin” its own wei”hts in’epen’ently “or multiple iterations. Then, transmit ba‘kwei”hts to sin”le ’evi‘e, avera”e wei”hts, an’ syn‘ to all ’evi‘es.Other options, quantize ”ra’ients when sen’in” ba‘k an’ “orth (while makin” sure all workers have thesame mo’els).

33

CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall...

Documents

Transcript of CSE 559A: Computer Visionayan/courses/cse559a/PDFs/lec24.pdf · CSE 559A: Computer Vision Fall...