End-to-end approaches to speech recognition and language ...klivescu/MLSLP2016/chorowski...O....

End-to-end approaches to speech recognition and language processing

Jan Chorowski University of Wroclaw & Google Brain

Collaborators

• Dzmitry Bahdanau

• Dmitriy Serdyuk

• Philémon Brakel

• Kyung Hyun Cho

• Yoshua Bengio

• Adrian Łańcucki

• Michał Zapotoczny

• Paweł Rychlikowski

• William Chan

• Navdeep Jaitly

Much of my research was done with friends from Universite de Montreal, Uniwersytet Wrocławski, and Google

Outline

• End-to-end speech recognition systems

• Challenges and promising research directions

• Uses of attention mechanism for NLP

Motivation: End-to-end systems

What is end-to-end: • “training all the modules to optimize a global performance

criterion” (“Gradient-based learning applied to document recognition”, LeCun et al., 98)

• present a system for recognizing checks in which segmentation and character recognition are trained jointly with word constraints taken into account (the approach would now be called Conditional Random Fields)

Not end-to-end: hand-crafted feature engineering, manual integration of separately trained modules. Why end-to-end: better performance, better portability

End-to-end systems are the future

Recent examples of end-to-end systems

• Convolutional networks for object recognition (Krizhevsky et al., 12)

• Neural Machine Translation: take raw words as the input, all components trained together (Sutskever et al., 14, Bahdanau et al., 15)

• Neural Caption Generation: produce image descriptions from raw images (many recent papers)

Are DNN-HMMs end-to-end trainable?

Without sequence discriminative training: no

– Lexicon and HMM structure are not optimized with the rest of the system

– Acoustic model (DNN) is trained to predict the states of the HMM in isolation from the language model

With sequence discriminative training: more end-to-end, but still no

– Lexicon and HMM structure …

Our Goal

• Directly model 𝑝(𝑌|𝑋) where 𝑌: sequence of words or characters 𝑋: recording

• Compare with the classical decomposition 𝑝 𝑌 𝑋 ∝ 𝑝 𝑌 𝑝(𝑋|𝑌)

RNNs Learn 𝑝(𝑌)

Decompose 𝑝 𝑌 = ∏𝑝 𝑦𝑡 𝑦𝑡−1, 𝑦𝑡−2, … , 𝑦1

Model the probabilities using a recurrent relation

𝑝 𝑦𝑡|𝑦𝑡−1, 𝑦𝑡−2, … , 𝑦1 = 𝑔 𝑠𝑡 𝑠𝑡 = 𝑓 𝑠𝑡−1, 𝑦𝑡−1

𝑓 𝑔

How to condition an RNN?

• Idea #1: conditioned through the first hidden state

• Idea #2: condition separately on every step

Idea #1 condition through the 1st hidden state

O. Vinyals et al, “Show and Tell: a Neural Image Caption Generator https://arxiv.org/abs/1411.4555

Idea #2: Attention

C

1. Choose relevant frames

𝑒𝑓 = score 𝑥𝑓 , 𝑠𝑡−1

𝛼𝑓 = SoftMax 𝑒 𝑓

3. Compute next state 𝑠𝑡 = 𝑓(𝑠𝑡−1, 𝑦𝑡−1, 𝑐)

2. Summarize into context

𝑐 = 𝛼𝑓𝑥𝑓𝑓

Soft attention originally proposed for the Montreal neural translation project [Bahdanau2014]

Attention mechanism in

RNNs • This is a network to

generate handwriting

• At each step the network looks at a context 𝒄

• 𝒄 is a summarization of a small fragment of the input sequence

Graves, A., 2013. Generating Sequences With Recurrent Neural Networks. arXiv:1308.0850 [cs]

Attention mechanism in translation

Attention mechanism for captioning

http://arxiv.org/pdf/1502.03044.pdf

Our System at a Glance

𝑦 2 = c

𝑠1

𝑦1 = sil

𝑦 3 = a

𝑠2

𝑦2 = c

𝑦 4 = t

𝑠3

𝑦3 = a

𝑐1 𝑐2 𝑐3

ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ6

GRU RNN with 250 units

Attention mechanism

3 or 4 GRU BiRNN layers with 250 units each

Network defines 𝑝Θ 𝑌 𝑋

where 𝑌 = 𝑦1𝑦2 …𝑦𝑇 are characters 𝑋 = 𝑥1𝑥2 …𝑥𝐹 are speech frames Θ are parameters

Training

We train the network to maximize the log-likelihood of the correct output

1

𝑁 log 𝑝Θ 𝑌𝑖|𝑋𝑖

𝑁

𝑖=1

𝑝Θ is differentiable with respect to Θ thus we can use gradient based methods

Decoding

Use beam search to find

𝑌 = argmax𝑌

𝑝Θ 𝑌 𝑋

Narrow beams are needed:

Tricks of the Trade: Subsampling

BiRNN encoders with subsampling

• BiRNN = forward RNN + backward RNN (both without outputs)

• Deep BiRNN = states of the layer K are the inputs of the layer K + 1

Tricks of the Trade: Regularization

• RNNs don’t like weight decay too much

• Constraining the norm of incoming weights to 1 for every unit of the network gave us ~30% performance improvement

• Weight noise was crucial on TIMIT!

Comparison with other Approaches

• The alignment 𝛼 is explicitly computed

• Compare with alignment as a latent variable to be marginalized:

𝑃Θ 𝑌 𝑋 = 𝑃Θ 𝑌, 𝛼|𝑋

𝛼

Some Results

Results obtained on TIMIT

Results obtained on WSJ (character based) (ICASSP2016)

Model Dev Test

Content-based Attention 15.9% 18.7%

Location-aware attention 15.8% 17.6%

(Graves 2013) RNN Transducer N/A 17.7%

(Toth 2014) HMM over Conv. Nets 13.9% 16.7%

Model CER WER

Attention-based model 6.7 18.9

Attention-based model + extended trigram LM 3.9 9.3

(Graves 2014) CTC + expected transcription loss 8.4 27.3

(Hannun 2014) CTC + bigram LM 5.7 14.1

(Miao 2015) CTC + extended trigram LM N/A 7.3

More Results Listen, Attend and Spell (Chan 2015)

• Trained on Google data (2000h)

• Learned “AAA=Triple A”!

Challenges

• Dealing with long sequences and repetitions.

• What about text-only data?

• Is cross-entropy the best training objective?

• On-line decoding?

A simple experiment

Train a network on TIMIT

Concatenate test utterances a few times

Decode as usual

Performance drops dramatically On long utterances decoding completely fails

Investigation of Long Inputs

The setup: • concatenate utterances • do force alignment (feed the correct inputs) Typical result Our hypothesis: the net learns an implicit location encoder. It is not robust to long utterances.

ph

on

es

speech frames

The net is lost: two alignment strands

Chorowski et al., „Attention-based models for speech recognition”, NIPS 2015

Location-aware Attention

• We want to separate repetitions of the same sound

• Use the selection from the last step to make the new selection

• This enables the model to learn concepts like “later than last” or “close to last”.

In

Ou

t

Stat

e

C

1. Choose relevant frames

3. Access context

2. Summarize frames into a context

4. Make recurrent step combing state and context.


Location-aware attention helps • Decoding error rate increases from 18% to 20%

• One more “trick”: constrain the attention mechanism to select only few frames – Keep up to 𝐾 with highest scores – Limit selection to the vicinity of previous one


Using Text-only Data Text-only data is abundant. How to use it?

Pragmatic solution:

Add LM cost during beam search, find

𝑌 = argmax𝑌

log 𝑝Θ 𝑌 |𝑋 + 𝛼 log 𝑝𝐿𝑀 𝑌 − 𝛾|𝑌 |

On WSJ, WER reduction 18.6 -> 9.3

Problems: No clear probabilistic interpretation.

Bahdanau et al., „End-to-end attention-based large vocabulary speech recognition”, ICASSP 2015

Alternatives to Cross-Entropy

We care about WER, not about perplexity.

Yet, we train with cross-entropy, the loss is:

− log 𝑝Θ 𝑦𝑖|𝑦𝑖−1 …𝑦1, 𝑋

𝑖

There are two problems:

1. Teacher-forcing

2. The model does not learn from its mistakes

Better training criteria

For an input 𝑋, let 𝑌∗ be the best output.

Define a loss 𝑙(𝑌, 𝑌∗) measuring how bad is answering 𝑌 instead of 𝑌∗ (e.g. WER or BLEU).

We want to incorporate our knowledge of 𝑙 into the training criterion.

Idea #1: Bahdanau et al, “Task Loss Estimation for Sequence Prediction”, http://arxiv.org/abs/1511.06456

Oversimplifying idea: teach the model to compute 𝑓 𝑋, 𝑌 = 𝑙(𝑌∗, 𝑌)

A practical algorithm can be formulated for WER reward. Intuition: teach the network which extensions of a hypothesis will increase the WER (help beam search!) Define loss for a partial string (unfinished recognition):

𝑙𝑝 𝑌∗, 𝑌 = min𝑍

𝑊𝐸𝑅(𝑌∗, 𝑌𝑍)

To train the model: 1. Let 𝑌 = 𝑦1 …𝑦𝑁 be a decoding (ground truth or beam search

result) 2. Feet 𝑌 to the network. After seeing character 𝑦𝑖 teach the net

to output: f yi y𝑖−1 …𝑦1, 𝑋 ≈ 𝑙𝑝 𝑌∗, 𝑦1 …𝑦𝑖 − 𝑙𝑝 𝑌∗, 𝑦1 …𝑦𝑖−1

http://arxiv.org/abs/1511.06456


IDEA #2: Norouzi et al, “Reward AugmentedMaximum Likelihood for Neural Structured Prediction”, NIPS 2016

Idea: we want −𝜏ℍ 𝑝Θ 𝑌 𝑋 + 𝔼𝑌~𝑝Θ Y X 𝑙 𝑌∗, 𝑌 =

𝜏𝐷𝐾𝐿 𝑝Θ 𝑌|𝑋 ∥ 𝑞(𝑌|𝑌∗, 𝜏) + 𝑐𝑜𝑛𝑠𝑡

Where 𝑞 𝑌 𝑌∗, 𝜏 =1

𝑍exp −

𝑙 𝑌∗,𝑌

𝜏

Instead we optimize:

𝐷𝐾𝐿 𝑞(𝑌|𝑌∗, 𝜏) ∥ 𝑝Θ 𝑌|𝑋 =

𝔼𝑌~𝑞 𝑌 𝑌∗,𝜏 log 𝑝Θ 𝑌 𝑋 + 𝑐𝑜𝑛𝑠𝑡

Gradient is easy! SGD friendly! Just use: 𝔼𝑌~𝑞 𝑌 𝑌∗,𝜏 ∇log 𝑝Θ 𝑌 𝑋

On-line decoding

• Two issues:

– Encoder uses BiRNNs, switching to plain LSTMs raises the WER by a few percentage points

– At each step, the attention mechanism scans the whole input sequence

Windowing: Pragmatic on-line decoding

• Limit the attention mechanism to consider a small neighborhood of the mean/median location from the last step. – Let 𝑝𝑡 = median(𝛼) from previous step

– Restrict new 𝛼 to be 0 outside of 𝑝𝑡 ± 𝑤

• Proved to be the best trick in scaling to long utterances

• May require window application during training too

Bahdanau et al., „End-to-end attention-based large vocabulary speech recognition”, ICASSP 2015

More principled on-line decoding N. Jaitly et al, “A Neural Transducer”, NIPS 2016

Core idea: Process the sequence in blocks

More-principled on-line decoding: signaling symbol emissions

Luo et al, “Learning Online Alignments with Continuous Rewards Policy Gradient”, arxiv.org/abs/1608.01281

• Decoder makes a step for each speech frame

• At every step, it chooses whether to emit a symbol

• Discrete decisions mandate RL training with Reinforce

https://arxiv.org/abs/1608.01281

Attention for NLP

• Better language models

• Attention-based parsers

Building Better Language Models

Big-picture idea:

• Take a recurrent language model

• At each step scan all previous inputs (or network states) using attention

C

Attention in Language Model J. Cheng et al, “Long Short-Term Memory-Networks for Machine Reading”, https://arxiv.org/abs/1601.06733



Autoregressive Model + Attention

𝑛 words are directly fed into the network 𝑚 additional words are aggregated into a context using attention

Autoregressive Model + Attention

How to parse sentences? For constituency parsing: Treat parsing as a sequence-to-sequence problem: • Input: sentence

„Go .” • Output: linearized parse tree:

„(S (VP XX )VP . )S END”

O. Vinyals et al, “Grammar as a Foreign Language”, NIPS 2015

Dependency parsing

• Desired output: directed edges between words.

• At each step the attention selects a few words.

• Idea: use the selection weights as pointers.

Dependency parsing

<s> I like cookies </s>

For each word 𝑤 Two operations: 1. Find head ℎ (use

attention mechanism) 2. Use 𝑤, ℎ to predict

dependency type

From characters to word embeddings

Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-Aware Neural Language Models,” arXiv:1508.06615 [cs, stat], Aug. 2015.

„incomprehensible”

Character embeddings concatenated into a matrix

Convolutional filters of varying lengths. Can react to pre-, in-, and post-fixes of words

Glue best word representations together

Highway layers – very nonlinear transformation of data

Parser results

Results on common dependencies treebanks


Thank you

We have code: https://github.com/rizar/attention-lvcsr

Questions?

https://github.com/rizar/attention-lvcsr




References & Further Reading • Good review presentation: T. Sainath “Towards End-to-End speech recognition using deep networks”

https://sites.google.com/site/tsainath/ • D. Amodei, el al., “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin,” arXiv:1512.02595 [cs], Dec. 2015. • Luo, Y., Chiu C, Jaitly N., Sutskever I., „Learning Online Alignments with Continuous Rewards Policy Gradient”,


• Chorowski J., Zapotoczny M., Rychlikowski P., „Read, Tag, and Parse All at Once, or Fully-neural Dependency Parsing”, http://arxiv.org/abs/1609.03441

• Chorowski J., Bahdanau, Serdyuk, D., D., Cho, K. Bengio, Y., Attention-Based Models for Speech Recognition, NIPS 2015

• Chorowski J., Bahdanau, D., Cho, K. Bengio, Y., End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results, Deep Learning Workshop at NIPS 2014

• Bahdanau D., Serdyuk D., Brakel P., Ke N., Chorowski J., Courville A., Bengio Y., „TASK LOSS ESTIMATION FOR SEQUENCE PREDICTION”, http://arxiv.org/abs/1511.06456

• Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y., End-to-End Attention-based Large Vocabulary Speech Recognition, arXiv:1508.04395

• Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ICLR2015 arXiv:1409.0473

• N. Jaitly et al, “A Neural Transducer”, NIPS 2016 • Norouzi et al, “Reward AugmentedMaximum Likelihood for Neural Structured Prediction”, NIPS 2016 • W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, Attend and Spell,” arXiv:1508.01211 [cs, stat], Aug. 2015. • A. Graves, “Practical Variational Inference for Neural Networks,” in Advances in Neural Information Processing Systems 24, J.

Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 2348–2356. • Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv:1308.0850

• Graves, A., Mohamed, A.R., and Hinton, G. (2013b). Speech recognition with deep recurrent neural networks. IEEE ICASSP 2013

• Sutskever, I., Vinyals, O., Le, Quoc, Sequence to Sequence Learning with Neural Networks, NIPS 2014

• Vinyals O., Toshev A., Bengio, S., Erhan, D. „Show and Tell: A Neural Image Caption Generator”, https://arxiv.org/abs/1411.4555

• Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskver, I., Hinton, G., Grammar as a Foreign Language, NIPS 2015

• Vinyals, O., Fortunato, M., Jaitly, N., Pointer Networks, NIPS 2015



End-to-end approaches to speech recognition and language ...klivescu/MLSLP2016/chorowski...O....

Documents

Transcript of End-to-end approaches to speech recognition and language ...klivescu/MLSLP2016/chorowski...O....