Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet,...

42
Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk )@ tsi.enst.f ggravier @ infres.enst.fr ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 http://www.tsi.enst.fr/~chollet
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet,...

Page 1: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Speaker Recognition

G. CHOLLET, G. GRAVIER,J. KHARROUBI, D. PETROVSKA-DELACRETAZ

(chollet, kharroub,petrovsk)@tsi.enst.fr [email protected]

ENST/CNRS-LTCI46 rue Barrault

75634 PARIS cedex 13http://www.tsi.enst.fr/~chollet

Page 2: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

ENST:ENST: Ecole Nationale Supérieure des Ecole Nationale Supérieure des TélécommunicationsTélécommunicationshttp://www.enst.frhttp://www.enst.fr

CNRS:CNRS: Centre National de la Recherche ScientifiqueCentre National de la Recherche Scientifiquehttp://www.cnrs.frhttp://www.cnrs.fr

LTCI:LTCI: Laboratoire de Traitement et Communication Laboratoire de Traitement et Communication de l’Informationde l’Information

http://www.enst.fr/ura/ura.htmlhttp://www.enst.fr/ura/ura.html

Our affiliations

Page 3: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

What is ENST?Ecole Nationale Supérieure des

Télécommunications

• classed among the

‘Grandes Ecoles d'Ingénieurs’.

• 250 state certified engineers

each year .

• part of ‘Groupement des Ecoles

de Télécommunications’

Page 4: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Modalities for Identity Verification

Bla-bla

SECUREDSPACE

PIN PIN 1111111111111111

11

Page 5: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Modalities for Identity Verification

A device you own (key, smart card,…) A code you remember (password, …)

Could be lost or stolen Physiological characteristics:

Face, iris, finger print, hand shape,… Need special equipment

Behavioral characteristics: Speech, signature, keystroke,…

Speech is the prefered modality over the telephone(but a ‘voice print’ is much more variable than a

finger print)

Page 6: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Outline

Where is the information about the speaker identity in the speech signal ?

How well could humans recognize a speaker ? Applications of Speaker Recognition Prior knowledge on what the speaker said Combining Speech Recognition and Speaker Verification Some research activities at ENST:

Speaker verification: The CAVE-PICASSO projects (text dependent) The ELISA consortium, NIST evaluations (text

independent) The EUREKA !2340 MAJORDOME project

Multimodal Identity Verification: The M2VTS and BIOMET projects

Perspectives

Page 7: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Speaker Identity in Speech

Differences in Vocal tract shapes and muscular control Fundamental frequency (typical values)

100 Hz (Male), 200 Hz (Female), 300 Hz (Child) Glottal waveform Phonotactics Lexical usage

The differences between Voices of Twins is a limit case

Voices can also be imitated or disguised

Page 8: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

spectral envelope of / i: /

f

A

Speaker A

Speaker B

Speaker Identity

segmental factors (~30ms)

glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness)

vocal tract:formant frequencies and bandwidths

suprasegmental factors speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits

Page 9: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Inter-speaker Variability

We wereaway

ayear ago.

Page 10: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Intra-speaker Variability

We

were

away

a

year

ago.

Page 11: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Vocal Apparatus

Page 12: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Speech production

Page 13: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Glottal Waveform Modeling

t

A

original residual: bluesynthetic residual: red

• Fitting a glottal pulse model to the excitation waveform allows perceptually relevant modifications to voice quality

Page 14: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Applications of Speaker Recognition

Identification from an open set (unrealistic) Identification from a closed set (who is

speaking in a videoconference ?) Verification of claimed identity (risk of

deliberate imposture)

The human performance in speaker recognition is far from being perfect (highly dependent on familiarity with the subject)

Page 15: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Speaker Verification

Typology of approaches (EAGLES Handbook) Text dependent

Public password Private password Customized password Text prompted

Text independent Incremental enrolment Evaluation

Page 16: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

What are the sources of difficulty ?

Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…)

Recording conditions (filtering, noise,…) Temporal drift Intentional imposture Voice disguise

Page 17: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Text-dependent Speaker Verification

Uses Automatic Speech Recognition techniques (DTW, HMM, …)

Client model adaptation from speaker independent HMM (‘World’ model)

Synchronous alignment of client and world models for the computation of a score.

Page 18: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Dynamic Time Warping (DTW)

Page 19: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

HMM structure depends on the application

Page 20: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Signal detection theory

Page 21: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Score normalisation

World model

Cohort normalisation

Discriminant techniques

Page 22: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Detection Error Tradeoff (DET) Curve

Page 23: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

CAVE – PICASSOhttp://www.picasso.ptt-telecom.nl/project/

Page 24: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Incremental enrolment of customised password

The client chooses his password using some feedback from the system.

The system attempts a phonetic transcription of the password.

Incremental enrolment is achieved on further repetitions of that password

Speaker independent phone HMM are adapted with the client enrolment data.

Synchronous alignment likelihood ratio scoring is performed on access trials.

Page 25: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Deliberate imposture

The impostor has some recordings of the target client voice. He can record the same sentences and align these speech signals with the recordings of the client.

A transformation (Multiple Linear Regression) is computed from these aligned data.

The impostor has heard the target client password. He records that password and applies the

transformation to this recording. The PICASSO reference system with less than 1 %

EER is defeated by this procedure (more than 30 % EER)

Page 26: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Speaker Verification (text independent)

The ELISA consortium ENST, LIA, IRISA, ... http://www.lia.univ-avignon.fr/equipes/RAL/elisa/

index_en.html

NIST evaluations http://www.nist.gov/speech/tests/spk/index.

htm

Ergodic HMM Gaussian Mixture Model

Page 27: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Gaussian Mixture Model

Parametric representation of the probability distribution of observations:

Page 28: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Gaussian Mixture Models

8 Gaussians per mixture

Page 29: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

National Institute of Standards & Technology (NIST)

Speaker Verification Evaluations

• Annual evaluation since 1995• Common paradigm for comparing technologies

Page 30: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

GMM speaker modeling

Front-endGMM

MODELING

WORLDGMM

MODEL

Front-end GMM model adaptation

TARGETGMM

MODEL

Page 31: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Baseline GMM method

HYPOTH.TARGET

GMM MOD.

Front-end

WORLDGMM

MODEL

Test Speech

xPxPLog ]

)/()/([

LLR SCORE

)/( xP

)/( xP

=

Page 32: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Support Vector Machines and Speaker Verification

Hybrid GMM-SVM system is proposed

SVM scoring model trained on development data to classify true-target speakers access and impostors access,using new feature representation based on GMMs

Modeling

Scoring

GMM

SVM

Page 33: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

SVM principles

X (X)

Inpu

t sp

ace

Feat

ure

spac

e Separating hyperplans H , with the optimal hyperplan Ho

Ho

H

Class(X)

Page 34: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Results

Page 35: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Combining Speech Recognition and Speaker Verification.

Speaker independent phone HMMs Selection of segments or segment classes

which are speaker specific Preliminary evaluations are performed on the

NIST extended data set (one hour of training data per speaker)

Page 36: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Selection of nasals in words in -ing

being everythi

ng getting

anything thing

something

things going

Page 37: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

«MAJORDOME»

Unified Messaging System

Eureka Projet no 2340

EDFVecsys

D. Bahu-Leyser, G. Chollet, K. Hallouli , J. Kharroubi, L. Likforman, D. Mostefa, D. Petrovska, M. Sigelle, P. Vaillant

KTH Mensatec UPC Airtel

Software602

Page 38: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Majordome’s Functionalities

• Speaker verification

• Dialogue

• Routing

• Updating the agenda

• Automatic summary

Voice

Fax

E-mail

Page 39: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Voice technology in Majordome

Server side background tasks:continuous speech recognition applied to voice messages upon reception Detection of sender’s name and subject

User interaction: Speaker identification and verification Speech recognition (receiving user

commands through voice interaction) Text-to-speech synthesis (reading text

summaries, E-mails or faxes)

Page 40: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

BIOMET

Bla-bla

SECUREDSPACE

PIN PIN 1111111111111111

11

Page 41: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

BIOMET

An extension of the M2VTS and DAVID projects to include such modalities as signature, finger print, hand shape.

Initial support (two years) is provided by GET (Groupement des Ecoles de Télécommunications)

Emphasis will be on fusion of scores obtained from two or more modalities.

Page 42: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI.

Conclusions and Perspectives

Evaluation trials (as conducted by NIST) help improve technology.

A strategy combining speech recognition and segmental scoring seems to be a promissing approach for speaker verification.

Whenever possible, text independent speaker verification should be confirmed by text dependent verification.

Whenever possible, fusion of multiple experts (preferably multimodal) should be performed.