Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdf · Recognition speech...
Transcript of Speech Recognitionspeech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR2 (v6).pdf · Recognition speech...
SpeechRecognitionHUNG-YI LEE æå®æ¯
LAS: å°±æ¯ seq2seq
CTC: decoder æ¯ linear classifier ç seq2seq
RNA: èŒžå ¥äžåæ±è¥¿å°±èŠèŒžåºäžåæ±è¥¿ç seq2seq
RNN-T: èŒžå ¥äžåæ±è¥¿å¯ä»¥èŒžåºå€åæ±è¥¿ç seq2seq
Neural Transducer: æ¯æ¬¡èŒžå ¥äžå window ç RNN-T
MoCha: window 移å䌞瞮èªåŠç Neural Transducer
Last Time
Two Points of Views
Source of image: æç³å±±èåž«ãæžäœèªé³èçæŠè«ã
Seq-to-seq HMM
Hidden Markov Model (HMM)
SpeechRecognition
speech text
X Y
Yâ = ðððmaxY
ð Y|ð
= ðððmaxY
ð ð|Y ð Y
ð ð
= ðððmaxY
ð ð|Y ð Y
ð ð|Y : Acoustic Model
ð Y : Language Model
HMM
Decode
HMM
hh w aa t d uw y uw th ih ng k
what do you think
t-d+uw1 t-d+uw2 t-d+uw3
âŠâŠ t-d+uw d-uw+y uw-y+uw y-uw+th âŠâŠ
d-uw+y1 d-uw+y2 d-uw+y3
Phoneme:
Tri-phone:
State:
ð ð|Y ð ð|ð
A token sequence Y corresponds to a sequence of states S
HMMð ð|Y ð ð|ð
A sentence Y corresponds to a sequence of states S
ð ð ð
Start End
ð ð
HMMð ð|Y ð ð|ð
A sentence Y corresponds to a sequence of states S
Transition Probability
ð ð|ðð ð
Emission Probability
t-d+uw1
d-uw+y3
P(x|ât-d+uw1â)
P(x|âd-uw+y3â)
Gaussian Mixture Model (GMM)
Probability from one state to another
HMM â Emission Probability
⢠Too many states âŠâŠ
P(x|ât-d+uw3â)P(x|âd-uw+y3â)
Tied-state
Same Addresspointer pointer
çµæ¥µåæ : Subspace GMM [Povey, et al., ICASSPâ10]
(Geoffrey Hinton also published deep learning for ASR in the same conference)
[Mohamed , et al., ICASSPâ10]
Pð ð|ð =?
transition
ð ð|ð
ð ðemission(GMM)
ââððððð(ð)
ð ð|â
alignment
which state generates which vector
â1 = ðððððð
ð ð ð
â2 = ðððððð
â = ðððððð
ð ð|â1
ð ð|â2
â = ððððð
ð ð ð ð ð ðð ð|ð ð ð|ð ð ð|ð ð ð|ð ð ð|ð
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
ð ð¥1|ð ð ð¥2|ð ð ð¥3|ð ð ð¥4|ð ð ð¥5|ð ð ð¥6|ð
How to use Deep Learning?
Method 1: Tandem
âŠâŠ âŠâŠ
ð¥ð
Size of output layer= No. of states DNN
âŠâŠ
âŠâŠ
Last hidden layer or bottleneck layer are also possible.
New acoustic feature for HMM
ð(ð|ð¥ð) ð(ð|ð¥ð) ð(ð|ð¥ð)
State classifier
Method 2: DNN-HMM Hybrid
DNN
ð ð|ð¥âŠâŠ
ð¥
ð ð¥|ð
ð ð¥|ð =ð ð¥, ð
ð ð=ð ð|ð¥ ð ð¥
ð ðCount from training data
CNN, LSTM âŠ
DNN output
How to train a state classifier?
Train HMM-GMM model
state sequence: { a , b , c }
Acoustic features:
Utterance + Label(without alignment)
Utterance + Label(aligned)
state sequence:
Acoustic features:
a a a b b c c
How to train a state classifier?
Train HMM-GMM model
Utterance + Label(without alignment)
Utterance + Label(aligned)
DNN1
How to train a state classifier?
Utterance + Label(without alignment)
Utterance + Label(aligned)
DNN2DNN1realignment
Human Parity!
⢠埮è»èªé³èŸšèæè¡çªç Žé倧éçšç¢ïŒå°è©±èŸšèèœåé人é¡æ°ŽæºïŒ(2016.10)
⢠https://www.bnext.com.tw/article/41414/bn-2016-10-19-020437-216
⢠IBM vs Microsoft: 'Human parity' speech recognition record changes hands again (2017.03)
⢠http://www.zdnet.com/article/ibm-vs-microsoft-human-parity-speech-recognition-record-changes-hands-again/
Machine 5.9% v.s. Human 5.9%
Machine 5.5% v.s. Human 5.1%
[Yu, et al., INTERSPEECHâ16]
[Saon, et al., INTERSPEECHâ17]
Very Deep
[Yu, et al., INTERSPEECHâ16]
Back to End-to-end
LAS
ð ð|X =?
Yâ = ðððmaxY
ðððð Y|ð
ð¥4ð¥3ð¥2ð¥1
⢠LAS directly computes ð ð|X
ð ð
ð ð|X = ð ð|ð ð ð|ð, ð âŠð§0
ð0
ð§1
Size V
ð1
ð§1 ð§2
a
ð ð ð ð
ð2
ð§3
b
ð ðžðð
Beam Search
ðâ = ðððmaxð
ðððPð ð|ð
Decoding:
Training:
CTC, RNN-T
ð ð|X =?
ð¥4ð¥3ð¥2ð¥1
⢠LAS directly computes ð ð|X
ð ð
ð ð|X = ð ð|ð ð ð|ð, ð âŠ
⢠CTC and RNN-T need alignment
Encoder
â4â3â2â1
â = ð ð ð ðð â|ð
ð ð ð ðð ð ð ð
P Y|ð =
ââððððð ð
ð â|ð
â ð ð
Yâ = ðððmaxY
ðððð Y|ðBeam Search
ðâ = ðððmaxð
ðððPð ð|ð
Decoding:
Training:
HMM, CTC, RNN-T
Pð ð|ð =
ââððððð ð
ð ð|â
ðâ = ðððmaxð
ðððPð ð|ð
Yâ = ðððmaxY
ðððð Y|ð
Pð Y|ð =
ââððððð ð
ð â|ð
HMM CTC, RNN-T
ðPð ð|ð
ðð=?
1. Enumerate all the possible alignments
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
HMM, CTC, RNN-T
ð ð|ð =
ââððððð ð
ð ð|â
ðâ = ðððmaxð
ðððPð ð|ð
Yâ = ðððmaxY
ðððð Y|ð
ð Y|ð =
ââððððð ð
ð â|ð
HMM CTC, RNN-T
ðPð ð|ð
ðð=?
1. Enumerate all the possible alignments
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
All the alignments
HMM
CTC
RNN-T
LAS
äœ ååšå¿ä»éºŒâº
c a t c c a a a t c a a a a t
c ð a a t t
add ð
duplicate to length ð
ð c a ð t ð
add ð x ð
c ð ð ð a ð ð t ð
c ð ð a ð ð t ð ð
ð = 6
SpeechRecognition
speech text
c a t
ð = 3
âŠ
duplicatec a t
to length ð
âŠ
c a t âŠâŠ
ð â â ððððð(ð)
HMM c a t c c a a a t c a a a a tduplicate to length ð
âŠ
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
nexttoken
duplicate
For n = 1 to ð
output the n-th token ð¡ð times
constraint: ð¡1 + ð¡2 +â¯ð¡ð = ð, ð¡ð > 0Trellis Graph
c c c
a
a
t
aa
t
t t
HMM c a t c c a a a t c a a a a tduplicate to length ð
âŠ
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
nexttoken
duplicate
For n = 1 to ð
output the n-th token ð¡ð times
constraint: ð¡1 + ð¡2 +â¯ð¡ð = ð, ð¡ð > 0Trellis Graph
c c c c cc
For n = 1 to ð
output the n-th token ð¡ð times
constraint:
output âðâ ðð times
ð¡ð > 0 ðð ⥠0
output âðâ ð0 times
ð¡1 + ð¡2 +â¯ð¡ð +
ð0 + ð1 +â¯ðð = ð
CTC c ð a a t t
add ð
ð c a ð t ðduplicate
c a t
to length ð
âŠ
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
ð
c
ð
a
ð
t
ð
next token
duplicate
insert ð
CTC c ð a a t t
add ð
ð c a ð t ðduplicate
c a t
to length ð
âŠ
(ð can be skipped)
c
CTC c ð a a t t
add ð
ð c a ð t ðduplicate
c a t
to length ð
âŠ
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
ð
c
ð
a
ð
t
ð
duplicate ð
next token
cannot skip any token
ð
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
ð
c
ð
a
ð
t
ð
next token
duplicate
insert ð
CTC c ð a a t t
add ð
ð c a ð t ðduplicate
c a t
to length ð
âŠ
duplicate
next token
CTC c ð a a t t
add ð
ð c a ð t ðduplicate
c a t
to length ð
âŠ
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
ð
c
ð
a
ð
t
ð
cc
a
t
ð
a
ð
t
ðð
c
ð
CTC c ð a a t t
add ð
ð c a ð t ðduplicate
c a t
to length ð
âŠ
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
ð
c
ð
a
ð
t
ð
c
a
t
ðð
c c ca
t
c
ð
CTC c ð a a t t
add ð
ð c a ð t ðduplicate
c a t
to length ð
âŠ
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
ð
s
ð
e
ð
e
ð
next token
duplicate
insert ð
⊠ee ⊠â e
Exception: when the next token is the same token
c
ð
RNN-Tadd ð x ð
c ð ð ð a ð ð t ð
c ð ð a ð ð t ð ðc a t âŠâŠ
For n = 1 to ð
output the n-th token 1 times
constraint:
output âðâ ðð times
output âðâ ð0 times
ð0 + ð1 +â¯ðð = ð
c a t
Put some ð (option)
Put some ð (option)
Put some ð (option)
Put some ð
at least once
ðð > 0
ðð ⥠0 for n = 1 to ð â 1
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
RNN-Tadd ð x ð
c ð ð ð a ð ð t ð
c ð ð a ð ð t ð ðc a t âŠâŠ
output token
Insert ð
ð
ð
c
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
output token
Insert ð
RNN-Tadd ð x ð
c ð ð ð a ð ð t ð
c ð ð a ð ð t ð ðc a t âŠâŠ
ðc
ð ðð
ð¡ð
ð¡ð ð
c
a
ð ð ð ð ð ð
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
output token
Insert ð
RNN-Tadd ð x ð
c ð ð ð a ð ð t ð
c ð ð a ð ð t ð ðc a t âŠâŠ
ð
ð ð ð ð ð
ð¡
c
a
ð
c a tStart End
c a tStart End
ð ð ð ð
c a tStart End
ð ð ð ð
HMM
CTC
RNN-T
not gen not gen not gen
1. Enumerate all the possible alignments
HMM, CTC, RNN-T
ð ð|Y =
ââððððð ð
ð ð|â
ðâ = ðððmaxð
ðððPð ð|ð
Yâ = ðððmaxY
ðððð Y|ð
ð Y|ð =
ââððððð ð
ð â|ð
HMM CTC, RNN-T
ðPð ð|ð
ðð=?
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
This part is challenging.
Score Computation
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
output token
Insert ð
ðc
ð ðð
ðð¡ð ð
â = ð c ð ð a ð t ð ð
ð â|ð
= ð ð|ð
à ð ð|ð, ð
à ð ð|ð, ðð âŠâŠ
ð
â1 â2 â2 â3 â4 â4 â5 â5 â6
ð1,0
ð
ð2,0
ð
ð2,1
ð
ð3,1
ð
ð4,1
ð
ð4,2
ð¡
ð5,2
ð
ð5,3
ð
ð6,3
c a t<BOS>
ð0 ð1 ð2 ð3
ð0 ð0 ð1 ð1 ð1 ð2 ð2 ð3ð3
â = ð c ð ð a ð t ð ð
Score Computation
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
ð1,0 ð
ð2,0 ð
ð2,1 ð ð3,1 ðð4,1 ð
ð4,2 ðð5,2 ð¡
ð5,3 ð ð6,3 ð
â = ð c ð ð a ð t ð ð
ð â|ð
Score Computation
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
c<B
OS>
a
ð 0ð 1
ð 2
Because ð is not considered!
ð4,2 ð
ð4,2 ðð4,2
ð
â1 â2 â2 â3 â4 â4 â5 â5 â6
ð1,0
ð
ð2,0
ð
ð2,1
ð
ð3,1
ð
ð4,1
ð
ð4,2
ð¡
ð5,2
ð
ð5,3
ð
ð6,3
c a t<BOS>
ð0 ð1 ð2 ð3
ð0 ð0 ð1 ð1 ð1 ð2 ð2 ð3ð3
ð
â4 â5 â5 â6
ð ð ð ð ð
ð4,2
ð¡
ð5,2
ð
ð5,3
ð
ð6,3
c a t<BOS>
ð0 ð1 ð2 ð3
ð2 ð2 ð3ð3ð ðð ð ð
ðð ð ðð
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
ðŒ4,2
ðŒ4,2 = ðŒ4,1ð4,1 ð + ðŒ3,2ð3,2 ð
ðŒ4,1
ðŒ3,2
generate âaâ
read ð¥4
generate âð â,
ðŒð,ð:the summation of the scores of all the alignmentsthat read i-th acoustic features and output j-th tokens
ðŒð,ð
ââððððð ð
ð â|ð
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
ðŒ4,2 = ðŒ4,1ð4,1 ð + ðŒ3,2ð3,2 ð
ðŒð,ð:the summation of the scores of all the alignmentsthat read i-th acoustic features and output j-th tokens
You can compute summation of the scores of all the alignments.
1. Enumerate all the possible alignments
HMM, CTC, RNN-T
Pð ð|Y =
ââððððð ð
ð ð|â
ðâ = ðððmaxð
ðððPð ð|ð
Yâ = ðððmaxY
ðððð Y|ð
Pð Y|ð =
ââððððð ð
ð â|ð
HMM CTC, RNN-T
ðPð ð|ð
ðð=?
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
Training
ð c ð ð a ð t ð ð
ð1,0 ð ð2,0 ð ð2,1 ð ð3,1 ð ð4,1 ð ð4,2 ð ð5,2 ð¡ ð5,3 ð ð6,3 ð
ðâ = ðððmaxð
ðððð ð|ð
ð ð|ð =
â
ð â|ð
ðð ð|ð
ðð=?
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
ð4,1 ð
ð3,2 ð
ð1,0 ð ð2,0 ð ð2,1 ð ð3,1 ð ð4,1 ð ð4,2 ð ð5,2 ð¡ ð5,3 ð ð6,3 ð
ð ð|ð =
â
ð â|ð
Each arrow is a component in ð ð|ð =
â
ð â|ð
Training
ð c ð ð a ð t ð ð
ð1,0 ð ð2,0 ð ð2,1 ð ð3,1 ð ð4,1 ð ð4,2 ð ð5,2 ð¡ ð5,3 ð ð6,3 ð
ðð ð|ð
ðð=?
ðâ = ðððmaxð
ðððð ð|ð
ð ð|ð =
â
ð â|ð
ð
ð4,1 ð
ð3,2 ð
âŠâŠð ð|ð
ðð4,1 ð
ðð
ðð ð|ð
ðð4,1 ð+ðð3,2 ð
ðð
ðð ð|ð
ðð3,2 ð+â¯
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
ð
ð4,1 ð
ð3,2 ð
âŠâŠð ð|ð
ðð ð|ð
ðð=?
Each arrow is a component
ðð4,1 ð
ðð
ðð ð|ð
ðð4,1 ð+ðð3,2 ð
ðð
ðð ð|ð
ðð3,2 ð+â¯
ð4,1 ð
ð3,2 ð
ð1,0 ð
â1 â2 â2 â3 â4
ð1,0
ð2,0 ð
ð2,0
ð2,1 ð
ð2,1
ð3,1 ð
ð3,1
ð4,1 ð
ð4,1
c<BOS>
ð0 ð1
ð0 ð0 ð1 ð1 ð1
ðð4,1 ð
ðð=?
Backpropagation (through time)
To encoder
ð ð|ð =
â ð€ðð¡â ð4,1 ð
ð â|ð +
â ð€ðð¡âðð¢ð¡ ð4,1 ð
ð â|ð
=
â ð€ðð¡â ð4,1 ð
ð â|ð
ð4,1 ð
ðð ð|ð
ðð=?
ðð4,1 ð
ðð
ðð ð|ð
ðð4,1 ð+ðð3,2 ð
ðð
ðð ð|ð
ðð3,2 ð+â¯
ð4,1 ð Ã ðð¡âðð
ðð ð|ð
ðð4,1 ð
=1
ð4,1 ð
â ð€ðð¡â ð4,1 ð
ð â|ð
=
â ð€ðð¡â ð4,1 ð
ðð¡âðð
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
ðœ4,2
ðœ4,2 = ðœ4,3ð4,2 ð¡ + ðœ5,2ð4,2 ð
generate âtâ
read ð¥5
generate âðâ
ðœð,ð:the summation of the score of all the alignmentsstaring from i-th acoustic features and j-th tokens
ðœ5,2
ðœ4,3
ð¥1 ð¥2 ð¥3 ð¥4 ð¥5 ð¥6
c
a
t
ðœ4,2
ðŒ4,1
ð4,1 ð
ðð ð|ð
ðð4,1 ð=
1
ð4,1 ð
ð ð€ðð¡â ð4,1 ð
ð â|ð ðŒ4,1 ð4,1 ð ðœ4,2
ðð ð|ð
ðð=?
ðð4,1 ð
ðð
ðð ð|ð
ðð4,1 ð+ðð3,2 ð
ðð
ðð ð|ð
ðð3,2 ð+â¯ðŒ4,1ðœ4,2
1. Enumerate all the possible alignments
HMM, CTC, RNN-T
Pð ð|Y =
ââððððð ð
ð ð|â
ðâ = ðððmaxð
ðððPð ð|ð
Yâ = ðððmaxY
ðððð Y|ð
Pð Y|ð =
ââððððð ð
ð â|ð
HMM CTC, RNN-T
ðPð ð|ð
ðð=?
2. How to sum over all the alignments
3. Training:
4. Testing (Inference, decoding):
Yâ = ðððmaxY
ðððP Y|ð
= ðððmaxY
ððð
ââððððð ð
ð â|ð
â ðððmaxY
maxââððððð ð
ðððð â|ð
maxââððððð ð
ð â|ðçæ³
ââ = ððð maxâ
ðððð â|ð
Yâ = ðððððâ1 ââ
Decoding
â = ð c ð ð a ð t ð ð âŠ
ð â|ð = ð â1|ð ð â2|ð, â1 ð â3|ð, â1, â2 âŠ
â1 â2 â3
çŸå¯Š
ââ = ððð maxâ
ðððð â|ð
ð
â1 â2 â2 â3 â4 â4 â5 â5 â6
ð1,0
ð
ð2,0
ð
ð2,1
ð
ð3,1
ð
ð4,1
ð
ð4,2
ð¡
ð5,2
ð
ð5,3
ð
ð6,3
c a t<BOS>
ð0 ð1 ð2 ð3
ð0 ð0 ð1 ð1 ð1 ð2 ð2 ð3ð3
max
ââ
Beam Search for better
approximation
Summary
LAS CTC RNN-T
Decoder dependent independent dependent
Alignment not explicit (soft alignment)
Yes Yes
Training just train it sum over alignment
sum over alignment
On-line No Yes Yes
Reference
⢠[Yu, et al., INTERSPEECHâ16] Dong Yu, Wayne Xiong, Jasha Droppo, Andreas Stolcke , Guoli Ye, Jinyu Li , Geoffrey Zweig, Deep Convolutional Neural Networks with Layer-wise Context Expansion and Attention, INTERSPEECH, 2016
⢠[Saon, et al., INTERSPEECHâ17] George Saon, Gakuto Kurata, Tom Sercu, Kartik Audhkhasi, Samuel Thomas, Dimitrios Dimitriadis, Xiaodong Cui, BhuvanaRamabhadran, Michael Picheny, Lynn-Li Lim, Bergul Roomi, Phil Hall, English Conversational Telephone Speech Recognition by Humans and Machines, INTERSPEECH, 2017
⢠[Povey, et al., ICASSPâ10] Daniel Povey, Lukas Burget, Mohit Agarwal, Pinar Akyazi, Kai Feng, Arnab Ghoshal, Ondrej Glembek, Nagendra Kumar Goel, Martin Karafiat, Ariya Rastrow, Richard C. Rose, Petr Schwarz, Samuel Thomas, Subspace Gaussian Mixture Models for speech recognition, ICASSP, 2010
⢠[Mohamed , et al., ICASSPâ10] Abdel-rahman Mohamed and Geoffrey Hinton, Phone recognition using Restricted Boltzmann Machines, ICASSP, 2010