Prosody-based automatic segmentation of speech into...

Prosody-based automatic segmentation of speech intosentences and topics

Elizabeth Shriberg a,*, Andreas Stolcke a, Dilek Hakkani-T�ur b, G�okhan T�ur b

a Speech Technology and Research Laboratory, SRI International, Menlo Park, CA, USAb Department of Computer Engineering, Bilkent University, Ankara 06533, Turkey

Abstract

A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to

segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for

segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody

(information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov

modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech

corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better

than, word-based statistical language models ± for both true and automatically recognized words in news speech. The

prosodic model achieves comparable performance with signi®cantly less training data, and requires no hand-labeling of

prosodic events. Across tasks and corpora, we obtain a signi®cant improvement over word-only models using a

probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture

language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent.

For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and

word-based cues dominate for natural conversation. Ó 2000 Elsevier Science B.V. All rights reserved.

Zusammenfassung

Ein wesentlicher Schritt in der Sprachverarbeitung zum Zweck der Informationsextrahierung, Themenklassi®zierung

oder Wiedergabe ist die Segmentierung in thematische und Satzeinheiten. Sprachsegmentierung ist schwierig, da die

Hinweise, die daf�ur gew�ohnlich in Texten vorzu®nden sind ( �Uberschriften, Abs�atze, Interpunktion), in gesprochener

Sprache fehlen. Wir untersuchen die Benutzung von Prosodie (Timing und Melodie der Sprache) zu diesem Zweck.

Mithilfe von Entscheidungsb�aumen und Hidden-Markov-Modellen kombinieren wir prosodische und wortbasierte

Informationen, und pr�ufen unsere Verfahren anhand von zwei Sprachkorpora, Broadcast News und Switchboard.

Sowohl bei korrekten, als auch bei automatisch erkannten Worttranskriptionen von Broadcast News zeigen unsere

Ergebnisse, daû Prosodiemodelle alleine eine gleichgute oder bessere Leistung als die wortbasieren statistischen

Sprachmodelle erbringen. Dabei erzielt das Prosodiemodell eine vergleichbare Leistung mit wesentlich weniger

Trainingsdaten und bedarf keines manuellen Transkribierens prosodischer Eigenschaften. F�ur beide Segmentierungs-

arten und Korpora erzielen wir eine signi®kante Verbesserung gegen�uber rein wortbasierten Modellen, indem wir

prosodische und lexikalische Informationsquellen probabilistisch kombinieren. Eine Untersuchung der Prosodiemo-

delle zeigt, daû diese auf sprachunabh�angige, in der Literatur beschriebene Segmentierungsmerkmale ansprechen. Die

www.elsevier.nl/locate/specomSpeech Communication 32 (2000) 127±154

* Corresponding author.

E-mail addresses: [email protected] (E. Shriberg), [email protected] (A. Stolcke), [email protected] (D. Hakkani-

TuÈr), [email protected] (G. TuÈr).

0167-6393/00/$ - see front matter Ó 2000 Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 6 3 9 3 ( 0 0 ) 0 0 0 2 8 - 5

Auswahl der Merkmale h�angt wesentlich von Segmentierungstyp und Korpus ab. Zum Beispiel sind Pausen und F0-

Merkmale vor allem f�ur Nachrichtensprache informativ, w�ahrend zeitdauer- und wortbasierte Merkmale in nat�urlichen

Gespr�achen dominieren. Ó 2000 Elsevier Science B.V. All rights reserved.

R�esum�e

Une �etape cruciale dans le traitement de la parole pour l'extraction d'information, la d�etection du sujet de con-

versation et la navigation est la segmentation du discours. Celle-ci est di�cile car les indices aidant �a segmenter un texte

(en-tetes, paragraphes, ponctuation) n'apparaissent pas dans le language parl�e. Nous �etudions l'usage de la prosodie

(l'information extraite du rythme et de la m�elodie de la parole) �a cet e�et. A l'aide d'arbres de d�ecision et de chaõnes de

Markov cach�ees, nous combinons les indices prosodiques avec le mod�ele du langage. Nous evaluons notre algorithme

sur deux corpora, Broadcast News et Switchboard. Nos r�esultats indiquent que le mod�ele prosodique est �equivalent ou

sup�erieur au mod�ele du langage, et qu'il requiert moins de donn�ees d'entraõnement. Il ne n�ecessite pas d'annotations

manuelles de la prosodie. De plus, nous obtenons un gain signi®catif en combinant de mani�ere probabiliste l'infor-

mation prosodique et lexicale, et ce pour di��erents corpora et applications. Une inspection plus d�etaill�ee des r�esultats

r�ev�ele que les mod�eles prosodiques identi®ent les indicateurs de d�ebut et de ®n de segments, tel que d�ecrit dans la

litt�erature. Finalement, l'usage des indices prosodiques d�epend de l'application et du corpus. Par exemple, le ton s'av�ere

extr�emement utile pour la segmentation des bulletins t�el�evis�es, alors que les caracteristiques de dur�ee et celles extraites

du mod�ele du langage servent davantage pour la segmentation de conversations naturelles. Ó 2000 Elsevier Science

B.V. All rights reserved.

Keywords: Sentence segmentation; Topic segmentation; Prosody; Information extraction; Automatic speech recognition; Broadcast

news; Switchboard

1. Introduction

1.1. Why process audio data?

Extracting information from audio data allowsexamination of a much wider range of data sourcesthan does text alone. Many sources (e.g., inter-views, conversations, news broadcasts) are avail-able only in audio form. Furthermore, audio datais often a much richer source than text alone, es-pecially if the data was originally meant to beheard rather than read (e.g., news broadcasts).

1.2. Why automatic segmentation?

Past automatic information extraction systemshave depended mostly on lexical information forsegmentation (Kubala et al., 1998; Allan et al.,1998; Hearst, 1997; Kozima, 1993; Yamron et al.,1998; among others). A problem for the text-basedapproach, when applied to speech input, is the lackof typographic cues (such as headers, paragraphs,sentence punctuation, and capitalization) in con-tinuous speech.

A crucial step toward robust information ex-traction from speech is the automatic determina-tion of topic, sentence, and phrase boundaries.Such locations are overt in text (via punctuation,capitalization, formatting) but are absent or``hidden'' in speech output. Topic boundaries arean important prerequisite for topic detection, topictracking, and summarization. They are furtherhelpful for constraining other tasks such as co-reference resolution (e.g., since anaphoric referencesdo not cross topic boundaries). Finding sentenceboundaries is a necessary ®rst step for topic seg-mentation. It is also necessary to break up longstretches of audio data prior to parsing. In addi-tion, modeling of sentence boundaries can bene®tnamed entity extraction from automatic speechrecognition (ASR) output, for example by pre-venting proper nouns spanning a sentenceboundary from being grouped together.

1.3. Why use prosody?

When spoken language is converted via ASR toa simple stream of words, the timing and pitch

128 E. Shriberg et al. / Speech Communication 32 (2000) 127±154

patterns are lost. Such patterns (and other relatedaspects that are independent of the words) areknown as speech prosody. In all languages, pros-ody is used to convey structural, semantic, andfunctional information.

Prosodic cues are known to be relevantto discourse structure across languages (e.g.,Vaissi�ere, 1983) and can therefore be expected toplay an important role in various informationextraction tasks. Analyses of read or spontaneousmonologues in linguistics and related ®elds haveshown that information units, such as sentencesand paragraphs, are often demarcated prosodi-cally. In English and related languages, suchprosodic indicators include pausing, changes inpitch range and amplitude, global pitch declina-tion, melody and boundary tone distribution, andspeaking rate variation. For example, both sen-tence boundaries and paragraph or topicboundaries are often marked by some combina-tion of a long pause, a preceding ®nal lowboundary tone, and a pitch range reset, amongother features (Lehiste, 1979, 1980; Brown et al.,1980; Bruce, 1982; Thorsen, 1985; Silverman,1987; Grosz and Hirschberg, 1992; Sluijter andTerken, 1994; Swerts and Geluykens, 1994;Koopmans-van Beinum and van Donzel, 1996;Hirschberg and Nakatani, 1996; Nakajima andTsukada, 1997; Swerts, 1997; Swerts and Osten-dorf, 1997).

Furthermore, prosodic cues by their nature arerelatively una�ected by word identity, and shouldtherefore improve the robustness of lexical infor-mation extraction methods based on ASR output.This may be particularly important for spontane-ous human±human conversation since ASR worderror rates remain much higher for these corporathan for read, constrained, or computer-directedspeech (LVCSR, 1999).

A related reason to use prosodic information isthat certain prosodic features can be computedeven in the absence of availability of ASR, forexample, for a new language where one may nothave a dictionary available. Here they could beapplied for instance for audio browsing andplayback, or to cut waveforms prior to recognitionto limit audio segments to durations feasible fordecoding.

Furthermore, unlike spectral features, someprosodic features (e.g., duration and intonationpatterns) are largely invariant to changes inchannel characteristics (to the extent that they canbe adequately extracted from the signal). Thus, theresearch results are independent of characteristicsof the communication channel, implying that thebene®ts of prosody are signi®cant across multipleapplications.

Finally, prosodic feature extraction can beachieved with minimal additional computationalload and no additional training data; results can beintegrated directly with existing conventional ASRlanguage and acoustic models. Thus, performancegains can be evaluated quickly and cheaply,without requiring additional infrastructure.

1.4. This study

Past studies involving prosodic informationhave generally relied on hand-coded cues (an ex-ception is (Hirschberg and Nakatani, 1996)). Webelieve the present work to be the ®rst that com-bines fully automatic extraction of both lexical andprosodic information for speech segmentation.Our general framework for combining lexical andprosodic cues for tagging speech with variouskinds of hidden structural information is a furtherdevelopment of earlier work on detecting sentenceboundaries and dis¯uencies in spontaneous speech(Shriberg et al., 1997; Stolcke et al., 1998; Hakk-ani-T�ur et al., 1999; Stolcke et al., 1999; T�ur et al.,To appear) and on detecting topic boundaries inBroadcast News (Hakkani-T�ur et al., 1999; Stol-cke et al., 1999; T�ur et al., To appear). In previouswork we provided only a high-level summary ofthe prosody modeling, focusing instead on detail-ing the language modeling and model combina-tion.

In this paper, we describe the prosodic model-ing in detail. In addition we include, for the ®rsttime, controlled comparisons for speech data fromtwo corpora di�ering greatly in style: BroadcastNews (Gra�, 1997) and Switchboard (Godfreyet al., 1992). The two corpora are compareddirectly on the task of sentence segmentation, andthe two tasks (sentence and topic segmentation)are compared for the Broadcast News data.

E. Shriberg et al. / Speech Communication 32 (2000) 127±154 129

Throughout, our paradigm holds the candidatefeatures for prosodic modeling constant acrosstasks and corpora. That is, we created parallelprosodic databases for both corpora, and used thesame machine learning approach for prosodicmodeling in all cases. We look at results for bothtrue words, and words as hypothesized by a speechrecognizer. Both conditions provide informativedata points. True words re¯ect the inherent addi-tional value of prosodic information above andbeyond perfect word information. Using recog-nized words allows comparison of degradation ofthe prosodic model to that of a language model,and also allows us to assess realistic performanceof the prosodic model when word boundary in-formation must be extracted based on incorrecthypotheses rather than forced alignments.

Section 2 describes the methodology, includingthe prosodic modeling using decision trees, thelanguage modeling, the model combination ap-proaches, and the data sets. The prosodic model-ing section is particularly detailed, outlining themotivation for each of the prosodic features andspecifying their extraction, computation, andnormalization. Section 3 discusses results for eachof our three tasks: sentence segmentation forBroadcast News, sentence segmentation forSwitchboard, and topic segmentation for Broad-cast News. For each task, we examine results fromcombining the prosodic information with languagemodel information, using both transcribed andrecognized words. We focus on overall perfor-mance, and on analysis of which prosodic featuresprove most useful for each task. The section closeswith a general discussion of cross-task compari-sons, and issues for further work. Finally, in

Section 4 we summarize main insights gained fromthe study, concluding with points on the generalrelevance of prosody for automatic segmentationof spoken audio.

2. Method

2.1. Prosodic modeling

2.1.1. Feature extraction regionsIn all cases we used only very local features, for

practical reasons (simplicity, computational con-straints, extension to other tasks), although inprinciple one could look at longer regions. Asshown in Fig. 1, for each inter-word boundary, welooked at prosodic features of the word immedi-ately preceding and following the boundary, oralternatively within a window of 20 frames (200 ms,a value empirically optimized for this work) beforeand after the boundary. In boundaries containinga pause, the window extended backward from thepause start, and forward from the pause end. (Ofcourse, it is conceivable that a more e�ective re-gion could be based on information about sylla-bles and stress patterns, for example, extendingbackward and forward until a stressed syllable isreached. However, the recognizer used did notmodel stress, so we preferred the simpler, word-based criterion used here.)

We extracted prosodic features re¯ecting pausedurations, phone durations, pitch information,and voice quality information. Pause featureswere extracted at the inter-word boundaries.Duration, F0, and voice quality features wereextracted mainly from the word or window

Fig. 1. Feature extraction regions for each inter-word boundary.


preceding the boundary (which was found to carrymore prosodic information for these tasks thanthe speech following the boundary (Shriberg et al.,1997)). We also included pitch-related featuresre¯ecting the di�erence in pitch range across theboundary.

In addition, we included non-prosodic featuresthat are inherently related to the prosodic features,for example, features that make a prosodic featureunde®ned (such as speaker turn boundaries) orthat would show up if we had not normalizedappropriately (such as gender, in the case of F0).This allowed us both to better understand featureinteractions, and to check for appropriateness ofnormalization schemes.

We chose not to use amplitude- or energy-basedfeatures, since previous work showed these fea-tures to be both less reliable than and largely re-dundant with duration and pitch features. A mainreason for the lack of robustness of the energy cueswas the high degree of channel variability in bothcorpora examined, even after application of vari-ous normalization techniques based on the signal-to-noise ratio distribution characteristics of, forexample, a conversation side (the speech recordedfrom one speaker in the two-party conversation) inSwitchboard. Exploratory work showed that en-ergy measures can correlate with shows (newsprograms in the Broadcast News corpus), speak-ers, and so forth, rather than with the structurallocations in which we were interested. Durationand pitch, on the other hand, are relatively in-variant to channel e�ects (to the extent that theycan be adequately extracted).

In training, word boundaries were obtainedfrom recognizer forced alignments. In testing onrecognized words, we used alignments for the1-best recognition hypothesis. Note that thisresults in a mismatch between train and test datafor the case of testing on recognized words, thatworks against us. That is, the prosodic models aretrained on better alignments than can be expectedin testing; thus, the features selected may be sub-optimal in the less robust situation of recognizedwords. Therefore, we expect that any bene®t fromthe present, suboptimal approach would be onlyenhanced if the prosodic models were based onrecognizer alignments in training as well.

2.1.2. FeaturesWe included features that, based on the de-

scriptive literature, should re¯ect breaks in thetemporal and intonational contour. We developedversions of such features that could be de®ned ateach inter-word boundary, and that could be ex-tracted by completely automatic means, withouthuman labeling. Furthermore, the features weredesigned to be independent of word identities, forrobustness to imperfect recognizer output.

We began with a set of over 100 features, which,after initial investigations, was pared down to asmaller set by eliminating features that were clearlynot at all useful (based on decision tree experi-ments; see also Section 2.1.4). The resulting set offeatures is described below. Features are groupedinto broad feature classes based on the kinds ofmeasurements involved, and the type of prosodicbehavior they were designed to capture.

2.1.2.1. Pause features. Important cues to bound-aries between semantic units, such as sentences ortopics, are breaks in prosodic continuity, includingpauses. We extracted pause duration at eachboundary based on recognizer output. The pausemodel used by the recognizer was trained as anindividual phone, which during training couldoccur optionally between words. In the case of nopause at the boundary, this pause duration featurewas output as 0.

We also included the duration of the pausepreceding the word before the boundary, to re¯ectwhether speech right before the boundary was juststarting up or continuous from previous speech.Most inter-word locations contained no pause,and were labeled as zero length. We did not needto distinguish between actual pauses and the shortsegmental-related pauses (e.g., stop closures) in-serted by the speech recognizer, since models easilylearned to distinguish the cases based on duration.

We investigated both raw durations and dura-tions normalized for pause duration distributionsfrom the particular speaker. Our models selectedthe unnormalized feature over the normalizedversion, possibly because of a lack of su�cientpause data per speaker. The unnormalized mea-sure was apparently su�cient to capture the grossdi�erences in pause duration distributions that


separate boundary from non-boundary locations,despite speaker variation within both categories.

For the Broadcast News data, which containedmainly monologues and which was recorded on asingle channel, pause durations were unde®ned atspeaker changes. For the Switchboard data therewas signi®cant speaker overlap, and a high rate ofbackchannels (such as `ùh-huh'') that were utteredby a listener during the speaker's turn. Some ofthese cases were associated with simultaneousspeaker pausing and listener backchanneling. Be-cause the pauses here did not constitute real turnboundaries, and because the Switchboard conver-sations were recorded on separate channels, weincluded such speaker pauses in the pause durationmeasure (i.e., even though a backchannel was ut-tered on the other channel).

2.1.2.2. Phone and rhyme duration features. An-other well-known cue to boundaries in speech is aslowing down toward the ends of units, or pre-boundary lengthening. Preboundary lengtheningtypically a�ects the nucleus and coda of syllables,so we included measures here that re¯ected dura-tion characteristics of the last rhyme (nucleus pluscoda) of the syllable preceding the boundary.

Each phone in the rhyme was normalized forinherent duration as follows:X

i

phone duri ÿ mean phone duri

std dev phone duri; �1�

where mean phone duri and std dev phone duri arethe mean and standard deviation of the currentphone over all shows or conversations in thetraining data. 1 Rhyme features included the av-erage normalized phone duration in the rhyme,computed by dividing the measure in Eq. (1) bythe number of phones in the rhyme, as well as avariety of other methods for normalization. Toroughly capture lengthening of pre®nal syllables

in a multisyllabic word, we also recorded thelongest normalized phone, as well as the longestnormalized vowel, found in the preboundaryword. 2

We distinguished phones in ®lled pauses (suchas `ùm'' and `ùh'') from those elsewhere, since ithas been shown in previous work that durations ofsuch ®llers (which are very frequent in Switch-board) are considerably longer than those ofspectrally similar vowels elsewhere (Shriberg,1999). We also noted that for some phones, par-ticularly nasals, errors in the recognizer forcedalignments in training sometimes produced inor-dinately long (incorrect) phone durations. Thisa�ected the robustness of our standard deviationestimates; to avoid the problem we removed anyclear outliers by inspecting the phone-speci®c du-ration histograms prior to computing standarddeviations.

In addition to using phone-speci®c means andstandard deviations over all speakers in a corpus,we investigated the use of speaker-speci®c valuesfor normalization, backing o� to cross-speakervalues for cases of low phone-by-speaker counts.However, these features were less useful than thefeatures from data pooled over all speakers(probably due to a lack of robustness in estimatingthe standard deviations in the smaller, speaker-speci®c data sets). Alternative normalizations werealso computed, including phone duri=mean phoneduri (to avoid noisy estimates of standard devia-

tions), both for speaker-independent and speaker-dependent means.

Interestingly, we found it necessary to bin thenormalized duration measures in order to re¯ectpreboundary lengthening, rather than segmentalinformation. Because these duration measureswere normalized by phone-speci®c values (meansand standard deviations), our decision trees wereable to use certain speci®c feature values as cluesto word identities and, indirectly, to boundaries.

1 Improvements in future work could include the use of

triphone-based normalization (on a su�ciently large corpus to

assure robust estimates), or of normalization based on syllable

position and stress information (given a dictionary marked for

this information).

2 Using dictionary stress information would probably be a

better approach. Nevertheless, one advantage of this simple

method is a robustness to pronunciation variation, since the

longest observed normalized phone duration is used, rather

than some predetermined phone.


For example, the word I in the Switchboard cor-pus is a strong cue to a sentence onset; normalizingby the constant mean and standard deviation forthat particular vowel resulted in speci®c valuesthat were ``learned'' by the models. To addressthis, we binned all duration features to remove thelevel of precision associated with the phone-levelcorrelations.

2.1.2.3. F0 features. Pitch information is typicallyless robust and more di�cult to model than otherprosodic features, such as duration. This is largelyattributable to variability in the way pitch is usedacross speakers and speaking contexts, complexityin representing pitch patterns, segmental e�ects,and pitch tracking discontinuities (such as dou-bling errors and pitch halving, the latter of whichis also associated with non-modal voicing).

To smooth out microintonation and trackingerrors, simplify our F0 feature computation, andidentify speaking-range parameters for eachspeaker, we postprocessed the frame-level F0 out-put from a standard pitch tracker. We used anautocorrelation-based pitch tracker (the ``get_f0''function in ESPS/Waves (Entropic Research Lab-oratory, 1993), with default parameter settings) togenerate estimates of frame-level F0 (Talkin, 1995).Postprocessing steps are outlined in Fig. 2 and aredescribed further in work on prosodic modeling forspeaker veri®cation (S�onmez et al., 1998).

The raw pitch tracker output has two mainnoise sources, which are minimized in the ®lteringstage. F0 halving and doubling are estimated by alognormal tied mixture model (LTM) of F0, basedon histograms of F0 values collected from all data

from the same speaker. 3 For the Broadcast Newscorpus we pooled data from the same speaker overmultiple news shows; for the Switchboard data, weused only the data from one side of a conversationfor each histogram.

For each speaker, the F0 distribution wasmodeled by three lognormal modes spaced log 2apart in the log frequency domain. The locationsof the modes were modeled with one tied param-eter (lÿ log 2; l; l� log 2), variances were scaledto be the same in the log domain, and mixtureweights were estimated by an expectation maxi-mization (EM) algorithm. This approach allowedestimation of speaker F0 range parameters thatproved useful for F0 normalization.

Prior to the regularization stage, median ®lter-ing smooths voicing onsets during which thetracker is unstable, resulting in local undershoot orovershoot. We applied median ®ltering to windowsof voiced frames with a neighborhood size of 7plus or minus 3 frames. Next, in the regularizationstage, F0 contours are ®t by a simple piecewiselinear model

eF0 �XK

k�1

�akF0 � bk�I�xkÿ1<F0 6 xk �;

where K is the number of nodes, xk are the nodelocations, and ak and bk are the linear parametersfor a given region. The parameters are estimated

Fig. 2. F0 processing.

3 We settled on a cheating approach here, assuming speaker

tracking information was available in testing, since automatic

speaker segmentation and tracking was beyond the scope of this

work.


by minimizing the mean squared error with agreedy node placement algorithm. The smooth-ness of the ®ts is ®xed by two global parameters:the maximum mean squared error for deviationfrom a line in a given region, and the minimumlength of a region.

The resulting ®ltered and stylized F0 contour,an example of which is shown in Fig. 3, enablesrobust extraction of features such as the value ofthe F0 slope at a particular point, the maximum orminimum stylized F0 within a region, and a simplecharacterization of whether the F0 trajectory be-fore a word boundary is broken or continued intothe next word. In addition, over all data from aparticular speaker, statistics such as average slopescan be computed for normalization purposes.These statistics, combined with the speaker rangevalues computed from the speaker histograms,allowed us to easily and robustly compute a largenumber of F0 features, as outlined in Section 2.1.2.In exploratory work on Switchboard, we foundthat the stylized F0 features yielded better resultsthan more complex features computed from theraw F0 tracks. Thus, we restricted our input fea-tures to those computed from the processed F0tracks, and did the same for Broadcast News.

We computed four di�erent types of F0 fea-tures, all based on values computed from thestylized processing, but each capturing a di�erentaspect of intonational behavior: (1) F0 reset fea-tures, (2) F0 range features, (3) F0 slope features,and (4) F0 continuity features. The general char-

acteristics captured can be illustrated with the helpof Fig. 4.

Reset features. The ®rst set of features was de-signed to capture the well-known tendency ofspeakers to reset pitch at the start of a new majorunit, such as a topic or sentence boundary, relativeto where they left o�. Typically the reset is pre-ceded by a ®nal fall in pitch associated with theends of such units. Thus, at boundaries we expect alarger reset than at non-boundaries. We tookmeasurements from the stylized F0 contours forthe voiced regions of the word preceding and ofthe word following the boundary. Measurementswere taken at either the minimum, maximum,mean, starting, or ending stylized F0 value withinthe region associated with each of the words.Numerous features were computed to compare theprevious to the following word; we computed boththe log of the ratio between the two values, and thelog of the di�erence between them, since it is un-clear which measure would be better. Thus, inFig. 4, the F0 di�erence between ``at'' and

Fig. 3. F0 contour ®ltering and regularization.

Fig. 4. Schematic example of stylized F0 for voiced regions of

the text. The speaker's estimated baseline F0 (from the log-

normal tied mixture modeling) is also indicated.


`èleven'' would not imply a reset, but that between``night'' and `àt'' would imply a large reset, par-ticularly for the measure comparing the minimumF0 of ``night'' to the maximum F0 of `àt''. Parallelfeatures were also computed based on the 200 mswindows rather than the words.

Range features. The second set of features re-¯ected the pitch range of a single word (or win-dow), relative to one of the speaker-speci®c globalF0 range parameters computed from the lognor-mal tied mixture modeling described earlier. Welooked both before and after the boundary, butfound features of the preboundary word or win-dow to be the most useful for these tasks. For thespeaker-speci®c range parameters, we estimatedF0 baselines, toplines, and some intermediaterange measures. By far the most useful value in ourmodeling was the F0 baseline, which we computedas occurring halfway between the ®rst mode andthe second mode in each speaker-speci®c F0 his-togram, i.e., roughly at the bottom of the modal(non-halved) speaking range. We also estimatedF0 toplines and intermediate values in the range,but these parameters proved much less useful thanthe baselines across tasks.

Unlike the reset features, which had to be de-®ned as ``missing'' at boundaries containing aspeaker change, the range features are de®ned atall boundaries for which F0 estimates can bemade (since they look only at one side of theboundary). Thus for example in Fig. 4, the F0 ofthe word ``night'' falls very close to the speaker'sF0 baseline, and can be utilized irrespective ofwhether or not the speaker changes before thenext word.

We were particularly interested in these featuresfor the case of topic segmentation in BroadcastNews, since due to the frequent speaker changes atactual topic boundaries we needed a measure thatwould be de®ned at such locations. We also ex-pected speakers to be more likely to fall closer tothe bottom of their pitch range for topic than forsentence boundaries, since the former implies agreater degree of ®nality.

Slope features. Our ®nal two sets of F0 featureslooked at the slopes of the stylized F0 segments,both for a word (or window) on only one side ofthe boundary, and for continuity across the

boundary. The aim was to capture local pitchvariation such as the presence of pitch accentsand boundary tones. Slope features measured thedegree of F0 excursion before or after theboundary (relative to the particular speaker'saverage excursion in the pitch range), or simplynormalized by the pitch range on the particularword.

Continuity features. Continuity features mea-sured the change in slope across the boundary.Here, we expected that continuous trajectorieswould correlate with non-boundaries, and brokentrajectories would tend to indicate boundaries,regardless of di�erence in pitch values acrosswords. For example, in Fig. 4 the words ``last''and ``night'' show a continuous pitch trajectory,so that it is highly unlikely there is a major syn-tactic or semantic boundary at that location. Wecomputed both scalar (slope di�erence) and cate-gorical (rise±fall) features for inclusion in the ex-periments.

2.1.2.4. Estimated voice quality features. Scalar F0statistics (e.g., those contributing to slopes, orminimum/maximum F0 within a word or region)were computed ignoring any frames associatedwith F0 halving or doubling (frames whose highestposterior was not that for the modal region).However, regions corresponding to F0 halving asestimated by the lognormal tied mixture modelshowed high correlation with regions of creakyvoice or glottalization that had been independentlyhand-labeled by a phonetician. Since creak maycorrelate with our boundaries of interest, we alsoincluded some categorical features, re¯ecting thepresence or absence of creak.

We used two simple categorical features. Onefeature re¯ected whether or not pitch halving (asestimated by the model) was present for at least afew frames, anywhere within the word precedingthe boundary. The second version looked atwhether halving was present at the end of thatword. As it turned out, while these two featuresshowed up in decision trees for some speakers, andin the patterns we expected, glottalization andcreak are highly speaker dependent and thus werenot helpful in our overall modeling. However, for


speaker-dependent modeling, such features couldpotentially be more useful.

2.1.2.5. Other features. We included two types ofnon-prosodic features, turn-related features andgender features. Both kinds of features were le-gitimately available for our modeling, in the sensethat standard speech recognition evaluations madethis information known. Whether or not speakerchange markers would actually be available de-pends on the application. It is not unreasonablehowever to assume this information, since auto-matic algorithms have been developed for thispurpose (e.g., Przybocki and Martin, 1999; Liuand Kubala, 1999; S�onmez et al., 1999). Such non-prosodic features often interact with prosodicfeatures. For example, turn boundaries cause cer-tain prosodic features (such as F0 di�erence acrossthe boundary) to be unde®ned, and speaker genderis highly correlated with F0. Thus, by including thefeatures we could better understand feature inter-actions and check for appropriateness of normal-ization schemes.

Our turn-related features included whether ornot the speaker changed at a boundary, the timeelapsed from the start of the turn, and the turncount in the conversation. The last measure wasincluded to capture structure information aboutthe data, such as the preponderance of topicchanges occurring early in Broadcast News shows,due to short initial summaries of topics at the be-ginning of certain shows.

We included speaker gender mainly as a checkto make sure the F0 processing was normalizedproperly for gender di�erences. That is, we initiallyhoped that this feature would not show up in thetrees. However, we learned that there are reasonsother than poor normalization for gender to occurin the trees, including potential truly stylistic dif-ferences between men and women, and structuredi�erences associated with gender (such as di�er-ences in lengths of stories in Broadcast News).Thus, gender revealed some interesting inherentinteractions in our data, which are discussed fur-ther in Section 3.3. In addition to speaker gender,we included the gender of the listener, to investi-gate the degree to which features distinguishing

boundaries might be a�ected by sociolinguisticvariables.

2.1.3. Decision treesAs in past prosodic modeling work (Shriberg

et al., 1997), we chose to use CART-style decisiontrees (Breiman et al., 1984), as implemented by theIND package (Buntine and Caruana, 1992). Thesoftware o�ers options for handling missing fea-ture values (important since we did not have goodpitch estimates for all data points), and is capableof processing large amounts of training data. De-cision trees are probabilistic classi®ers that can becharacterized brie¯y as follows. Given a set ofdiscrete or continuous features and a labeledtraining set, the decision tree construction algo-rithm repeatedly selects a single feature that, ac-cording to an information-theoretic criterion(entropy), has the highest predictive value for theclassi®cation task in question. 4 The feature que-ries are arranged in a hierarchical fashion, yieldinga tree of questions to be asked of a given datapoint. The leaves of the tree store probabilitiesabout the class distribution of all samples fallinginto the corresponding region of the feature space,which then serve as predictors for unseen testsamples. Various smoothing and pruning tech-niques are commonly employed to avoid over®t-ting the model to the training data.

Although any of several probabilistic classi®ers(such as neural networks, exponential models, ornaive Bayes networks) could be used as posteriorprobability estimators, decision trees allow us toadd, and automatically select, other (non-prosod-ic) features that might be relevant to the task ±including categorical features. Furthermore, deci-sion trees make no assumptions about the shape offeature distributions; thus it is not necessary toconvert feature values to some standard scale. Andperhaps most importantly, decision trees o�er thedistinct advantage of interpretability. We havefound that human inspection of feature interac-tions in a decision tree fosters an intuitive

4 For multivalued or continuous features, the algorithm also

determines optimal feature value subsets or thresholds, respec-

tively, to compare the feature to.


understanding of feature behaviors and the phe-nomena they re¯ect. This understanding is crucialfor progress in developing better features, as well asfor debugging the feature extraction process itself.

The decision tree served as a prosodic model forestimating the posterior probability of a (sentenceor topic) boundary at a given inter-word bound-ary, based on the automatically extracted prosodicfeatures. We de®ne Fi as the features extractedfrom a window around the ith potential boundary,and Ti as the boundary type (boundary/no-boundary) at that position. For each task, decisiontrees were trained to predict the ith boundary type,i.e., to estimate P �TijFi;W �. By design, this decisionwas only weakly conditioned on the word se-quence W, insofar as some of the prosodic featuresdepend on the phonetic alignment of the wordmodels. We preferred the weak conditioning forrobustness to word errors in speech recognizeroutput. Missing feature values in Fi occurredmainly for the F0 features (due to lack of robustpitch estimates, for example), but also at locationswhere features were inherently unde®ned (e.g.,pauses at turn boundaries). Such cases were han-dled in testing by sending the test sample downeach tree branch with the proportion found in thetraining set at that node, and then averaging thecorresponding predictions.

2.1.4. Feature selection algorithmOur initial feature sets contained a high degree

of feature redundancy because, for example, sim-ilar features arose from changing only normaliza-tion schemes, and others (such as energy and F0)are inherently correlated in speech production.The greedy nature of the decision tree learningalgorithm implies that larger initial feature sets canyield suboptimal results. The availability of morefeatures provides greater opportunity for ``greedy''features to be chosen; such features minimize en-tropy locally but are suboptimal with respect toentropy minimization over the whole tree. Fur-thermore, it is desirable to remove redundantfeatures for computational e�ciency and to sim-plify interpretation of results.

To automatically reduce our large initial candi-date feature set to an optimal subset, we developedan iterative feature selection algorithm that in-

volved running multiple decision trees in training(sometimes hundreds for each task). The algorithmcombines elements of brute-force search with pre-viously determined human-based heuristics fornarrowing the feature space to good groupings offeatures. We used the entropy reduction of theoverall tree after cross-validation as a criterion forselecting the best subtree. Entropy reduction is thedi�erence in test-set entropy between the prior classdistribution and the posterior distribution esti-mated by the tree. It is a more ®ne-grained metricthan classi®cation accuracy, and is thus the moreappropriate measure to use for any of the modelcombination approaches described in Section 2.3.

The algorithm proceeds in two phases. In the®rst phase, the large number of initial candidatefeatures is reduced by a leave-one-out procedure.Features that do not reduce performance whenremoved are eliminated from further consider-ation. The second phase begins with the reducednumber of features, and performs a beam searchover all possible subsets of features. Because ourinitial feature set contained over 100 features, wesplit the set into smaller subsets based on our ex-perience with feature behaviors. For each subsetwe included a set of ``core'' features, which weknew from human analyses of results served ascatalysts for other features. For example, in allsubsets, pause duration was included, since with-out this feature present, duration and pitch fea-tures are much less discriminative for theboundaries of interest. 5

2.2. Language modeling

The goal of language modeling for our seg-mentation tasks is to capture information aboutsegment boundaries contained in the word se-quences. We denote boundary classi®cations byT � T1; . . . ; TK and use W � W1; . . . ;WN for theword sequence. Our general approach is to model

5 The success of this approach depends on the makeup of the

initial feature sets, since highly correlated useful features can

cancel each other out during the ®rst phase. This problem can

be addressed by forming initial feature subsets that minimize

within-set cross-feature correlations.


the joint distribution of boundary types and wordsin a hidden Markov model (HMM), the hiddenvariable in this case being the boundaries Ti (orsome related variable from which Ti can be in-ferred). Because we had hand-labeled training dataavailable for all tasks, the HMM parameters couldbe trained in supervised fashion.

The structure of the HMM is task speci®c, asdescribed below, but in all cases the Markoviancharacter of the model allows us to e�cientlyperform the probabilistic inferences desired. Forexample, for topic segmentation we extract themost likely overall boundary classi®cation,

argmaxT

P �T jW �; �2�

using the Viterbi algorithm (Viterbi, 1967). Thisoptimization criterion is appropriate because thetopic segmentation evaluation metric prescribedby the TDT program (Doddington, 1998) rewardsoverall consistency of the segmentation. 6

For sentence segmentation, the evaluationmetric simply counts the number of correctly la-beled boundaries (see Section 2.4.4). Therefore, itis advantageous to use the slightly more complexforward±backward algorithm (Baum et al., 1970)to maximize the posterior probability of each in-dividual boundary classi®cation Ti,

argmaxTi

; P �TijW �: �3�

This approach minimizes the expected per-boundary classi®cation error rate (Dermatas andKokkinakis, 1995).

2.2.1. Sentence segmentationWe relied on a hidden-event N-gram language

model (LM) (Stolcke and Shriberg, 1996; Stolckeet al., 1998). The states of the HMM consist of theend-of-sentence status of each word (boundary orno-boundary), plus any preceding words andpossibly boundary tags to ®ll up the N-gramcontext (N � 4 in our experiments). Transition

probabilities are given by N-gram probabilitiesestimated from annotated, boundary-taggedtraining data using Katz backo� (Katz, 1987). Forexample, the bigram parameter P �hSijtonight�gives the probability of a sentence boundary fol-lowing the word ``tonight''. HMM observationsconsist of only the current word portion of theunderlying N-gram state (with emission likelihood1), constraining the state sequence to be consistentwith the observed word sequence.

2.2.2. Topic segmentationWe ®rst constructed 100 individual unigram

topic cluster language models, using the multipassk-means algorithm described in (Yamron et al.,1998). We used the pooled topic detection andtracking (TDT) Pilot and TDT-2 training data(Cieri et al., 1999). We removed stories with fewerthan 300 and more than 3000 words, leaving19,916 stories with an average length of 538 words.Then, similar to the Dragon topic segmentationapproach (Yamron et al., 1998), we built an HMMin which the states are topic clusters, and the ob-servations are sentences. The resulting HMMforms a complete graph, allowing transition be-tween any two topic clusters. In addition to thebasic HMM segmenter, we incorporated two statesfor modeling the initial and ®nal sentences of atopic segment. We reasoned that this can captureformulaic speech patterns used by broadcastspeakers. Likelihoods for the start and end modelsare obtained as the unigram language modelprobabilities of the topic-initial and ®nal sentenc-es, respectively, in the training data. Note thatsingle start and end states are shared for all topics,and traversal of the initial and ®nal states is op-tional in the HMM topology. The topic clustermodels work best if whole blocks of words or``pseudo-sentences'' are evaluated against the topiclanguage models (the likelihoods are otherwise toonoisy). We therefore presegment the data stream atpauses exceeding 0.65 second (a process we willrefer to as ``chopping'').

2.3. Model combination

We expect prosodic and lexical segmenta-tion cues to be partly complementary, so that

6 For example, given three sentences s1s2s3 and strong

evidence that there is a topic boundary between s1 and s3, it

is better to output a boundary either before or after s2, but not

in both places.


combining both knowledge sources should givesuperior accuracy over using each source alone.This raises the issue of how the knowledge sourcesshould be integrated. Here, we describe two ap-proaches to model combination that allow thecomponent prosodic and lexical models to be re-tained without much modi®cation. While this isconvenient and computationally e�cient, it pre-vents us from explicitly modeling interactions (i.e.,statistical dependence) between the two knowledgesources. Other researchers have proposed modelarchitectures based on decision trees (Heeman andAllen, 1997) or exponential models (Beefermanet al., 1999) that can potentially integrate theprosodic and lexical cues discussed here. In otherwork (Stolcke et al., 1998; T�ur et al., To appear)we have started to study integrated approaches forthe segmentation tasks studied here, althoughpreliminary results show that the simple combi-nation techniques are very competitive in practice.

2.3.1. Posterior probability interpolationBoth the prosodic decision tree and the lan-

guage model (via the forward±backward algo-rithm) estimate posterior probabilities for eachboundary type Ti. We can arrive at a better pos-terior estimator by linear interpolation,

P �TijW ; F � � kPLM�TijW � � �1ÿ k�PDT�TijFi;W �;�4�

where k is a parameter optimized on held-out datato optimize the overall model performance.

2.3.2. Integrated hidden Markov modelingOur second model combination approach is

based on the idea that the HMM used for lexicalmodeling can be extended to ``emit'' both wordsand prosodic observations. The goal is to obtainan HMM that models the joint distributionP �W ; F ; T � of word sequences W, prosodic featuresF, and hidden boundary types T in a Markovmodel. With suitable independence assumptionswe can then apply the familiar HMM techniquesto compute

argmaxT

P �T jW ; F �

or

argmaxTi

P �TijW ; F �;

which are now conditioned on both lexical andprosodic cues. We describe this approach for sen-tence segmentation HMMs; the treatment fortopic segmentation HMMs is mostly analogousbut somewhat more involved, and described indetail elsewhere (T�ur et al., To appear).

To incorporate the prosodic information intothe HMM, we model prosodic features as emis-sions from relevant HMM states, with likelihoodsP�FijTi;W �, where Fi is the feature vector pertain-ing to potential boundary Ti. For example, anHMM state representing a sentence boundary hSiat the current position would be penalized with thelikelihood P �FijhSi�. We do so based on the as-sumption that prosodic observations are condi-tionally independent of each other given theboundary types Ti and the words W. Under theseassumptions, a complete path through the HMMis associated with the total probability

P�W ; T �Y

i

P�FijTi;W � � P �W ; F ; T �; �5�

as desired.The remaining problem is to estimate the like-

lihoods P�FijTi;W �. Note that the decision treeestimates posteriors PDT�TijFi;W �. These can beconverted to likelihoods using Bayes' rule as in

P�FijTi;W � � P �FijW �PDT�TijFi;W �P �TijW � : �6�

The term P�FijW � is a constant for all choices of Ti

and can thus be ignored when choosing the mostprobable one. Next, because our prosodic model ispurposely not conditioned on word identities, butonly on aspects of W that relate to time alignment,we approximate P �TijW � � P �Ti�. Instead of ex-plicitly dividing the posteriors, we prefer todownsample the training set to make P�Ti � yes�� P �Ti � no� � 1

2. A bene®cial side e�ect of this

approach is that the decision tree models thelower-frequency events (segment boundaries) ingreater detail than if presented with the raw, highlyskewed class distribution.

When combining probabilistic models of dif-ferent types, it is advantageous to weight the


contributions of the language models and theprosodic trees relative to each other. We do so byintroducing a tunable model combination weight(MCW), and by using PDT�FijTi;W �MCW

as the ef-fective prosodic likelihoods. The value of MCW isoptimized on held-out data.

2.3.3. HMM posteriors as decision tree featuresA third approach could be used to combine the

language and prosodic models, although forpractical reasons we chose not to use it in thiswork. In this approach, an HMM incorporatingonly lexical information is used to compute pos-terior probabilities of boundary types, as describedin Section 2.3.1. A prosodic decision tree is thentrained, using the HMM posteriors as additionalinput features. The tree is free to combine theword-based posteriors with prosodic features; itcan thus model limited forms of dependence be-tween prosodic and word-based information (assummarized in the posteriors).

A severe drawback of using posteriors in thedecision tree, however, is that in our current par-adigm, the HMM is trained on correct words. Intesting, the tree may therefore grossly overestimatethe informativeness of the word-based posteriorsbased on automatic transcriptions. Indeed, wefound that on a hidden-event detection task simi-lar to sentence segmentation (Stolcke et al., 1998)this model combination method worked well ontrue words, but faired worse than the other ap-proaches on recognized words. To remedy themismatch between training and testing of thecombined model, we would have to train, as wellas test, on recognized words; this would requirecomputationally intensive processing of a largecorpus. For these reasons, we decided not to useHMM posteriors as tree features in the presentstudies.

2.3.4. Alternative modelsA few additional comments are in order re-

garding our choice of model architectures andpossible alternatives. The HMMs used for lexicalmodeling are likelihood models, i.e., they model theprobabilities of observations given the hiddenvariables (boundary types) to be inferred, whilemaking assumptions about the independence of

the observations given the hidden events. Themain virtue of HMMs in our context is that theyintegrate the local evidence (words and prosodicfeatures) with models of context (the N-gram his-tory) in a very computationally e�cient way (forboth training and testing). A drawback is that theindependence assumptions may be inappropriateand may therefore inherently limit the perfor-mance of the model.

The decision trees used for prosodic modeling,on the other hand, are posterior models, i.e., theydirectly model the probabilities of the unknownvariables given the observations. Unlike likeli-hood-based models, this has the advantage thatmodel training explicitly enhances discriminationbetween the target classi®cations (i.e., boundarytypes), and that input features can be combinedeasily to model interactions between them. Draw-backs are the sensitivity to skewed class distribu-tions (as pointed out in the previous section), andthe fact that it becomes computationally expensiveto model interactions between multiple targetvariables (e.g., adjacent boundaries). Furthermore,input features with large discrete ranges (such asthe set of words) present practical problems formany posterior model architectures.

Even for the tasks discussed here, other mod-eling choices would have been practical, and awaitcomparative study in future work. For example,posterior lexical models (such as decision trees orneural network classi®ers) could be used to predictthe boundary types from words and prosodicfeatures together, using word-coding techniquesdeveloped for tree-based language models (Bahlet al., 1989). Conversely, we could have usedprosodic likelihood models, removing the need toconvert posteriors to likelihoods. For example, thecontinuous feature distributions could be modeledwith (mixtures of) multidimensional Gaussians (orother types of distributions), as is commonly donefor the spectral features in speech recognizers(Digalakis and Murveit, 1994; among others).

2.4. Data

2.4.1. Speech data and annotationsSwitchboard data used in sentence segmenta-

tion was drawn from a subset of the corpus


(Godfrey et al., 1992) that had been hand-labeledfor sentence boundaries (Meteer et al., 1995) bythe Linguistic Data Consortium (LDC). BroadcastNews data for topic and sentence segmentationwas extracted from the LDC's 1997 BroadcastNews (BN) release. Sentence boundaries in BNwere automatically determined using the MITREsentence tagger (Palmer and Hearst, 1997) basedon capitalization and punctuation in the tran-scripts. Topic boundaries were derived from theSGML markup of story units in the transcripts.Training of Broadcast News language models forsentence segmentation also used an additional 130million words of text-only transcripts from the1996 Hub-4 language model corpus, in whichsentence boundaries had been marked by SGMLtags.

2.4.2. Training, tuning, and test setsTable 1 shows the amount of data used for the

various tasks. For each task, separate datasetswere used for model training, for tuning any freeparameters (such as the model combination andposterior interpolation weights), and for ®naltesting. In most cases the language model and theprosodic model components used di�erentamounts of training data.

As is common for speech recognition evalua-tions on Broadcast News, frequent speakers (suchas news anchors) appear in both training and testsets. By contrast, in Switchboard our train and testsets did not share any speakers. In both corpora,the average word count per speaker decreasedroughly monotonically with the percentage of

speakers included. In particular, the BroadcastNews data contained a large number of speakerswho contributed very few words. A reasonablymeaningful statistic to report for words perspeaker is thus a weighted average, or the averagenumber of datapoints by the same speaker. Onthat measure, the two corpora had similar statis-tics: 6687.11 and 7525.67 for Broadcast News andSwitchboard, respectively.

2.4.3. Word recognitionExperiments involving recognized words used

the 1-best output from SRI's DECIPHER large-vocabulary speech recognizer. We simpli®ed pro-cessing by skipping several of the computationallyexpensive or cumbersome steps often used foroptimum performance, such as acoustic adapta-tion and multiple-pass decoding. The recognizerperformed one bigram decoding pass, followed bya single N-best rescoring pass using a higher-orderlanguage model. The Switchboard test set wasdecoded with a word error rate of 46.7% usingacoustic models developed for the 1997 Hub-5evaluation (Conversational Speech RecognitionWorkshop, 1997). The Broadcast News recognizerwas based on the 1997 SRI Hub-4 recognizer(Sankar et al., 1998) and had a word error rate of30.5% on the test set used in our study.

2.4.4. Evaluation metricsSentence segmentation performance for true

words was measured by boundary classi®cationerror, i.e., the percentage of word boundaries la-beled with the incorrect class. For recognized

Table 1

Size of speech data sets used for model training and testing for the three segmentation tasks

Task Training Tuning Test

LM Prosody

SWB sentence 1788 sides 1788 sides 209 sides 209 sides

(transcribed) (1.2M words) (1.2M words) (103K words) (101K words)

SWB sentence 1788 sides 1788 sides 12 sides 38 sides

(recognized) (1.2M words) (1.2M words) (6K words) (18K words)

BN sentence 103 shows�BN96 93 shows 5 shows 5 shows

(130M words) (700K words) (24K words) (21K words)

BN topic TDT�TDT2 93 shows 10 shows 6 shows

(10.7M words) (700K words) (205K words) (44K words)


words, we ®rst performed a string alignment of theautomatically labeled recognition hypothesis withthe reference word string (and its segmentation).Based on this alignment we then counted thenumber of incorrectly labeled, deleted, and in-serted word boundaries, expressed as a percentageof the total number of word boundaries. Thismetric yields the same result as the boundaryclassi®cation error rate if the word hypothesis iscorrect. Otherwise, it includes additional errorsfrom inserted or deleted boundaries, in a mannersimilar to standard word error scoring in speechrecognition. Topic segmentation was evaluatedusing the metric de®ned by NIST for the TDT-2evaluation (Doddington, 1998).

3. Results and discussion

The following sections describe results from theprosodic modeling approach, for each of our threetasks. The ®rst three sections focus on the tasksindividually, detailing the features used in the best-performing tree. For sentence segmentation, wereport on trees trained on non-downsampled data,as used in the posterior interpolation approach.For all tasks, including topic segmentation, wealso trained downsampled trees for the HMMcombination approach. Where both types of treeswere used (sentence segmentation), feature usageon downsampled trees was roughly similar to thatof the non-downsampled trees, so we describe onlythe non-downsampled trees. For topic segmenta-tion, the description refers to a downsampled tree.

In each case we then look at results fromcombining the prosodic information with languagemodel information, for both transcribed and rec-ognized words. Where possible (i.e., in the sen-tence segmentation tasks), we compare results forthe two alternative model integration approaches(combined HMM and interpolation). In the nexttwo sections, we compare results across both tasksand speech corpora. We discuss di�erences inwhich types of features are helpful for a task, aswell as di�erences in the relative reduction in errorachieved by the di�erent models, using a measurethat tries to normalize for the inherent di�culty of

each task. Finally, we discuss issues for futurework.

3.1. Task 1. Sentence segmentation of BroadcastNews data

3.1.1. Prosodic feature usageThe best-performing tree identi®ed six features

for this task, which fall into four groups. Tosummarize the relative importance of the featuresin the decision tree we use a measure we call``feature usage'', which is computed as the relativefrequency with which that feature or feature classis queried in the decision tree. The measure in-crements for each sample classi®ed using thatfeature; features used higher in the tree classifymore samples and therefore have higher usagevalues. The feature usage (by type of feature) wasas follows:· (46%) Pause duration at boundary.· (42%) Turn/no turn at boundary.· (11%) F0 di�erence across boundary.· (01%) Rhyme duration.

The main features queried were pause, turn, andF0. To understand whether they behaved in themanner expected based on the descriptive litera-ture, we inspected the decision tree. The tree forthis task had 29 leaves; we show the top portion ofit in Fig. 5.

The behavior of the features is precisely thatexpected from the literature. Longer pause dura-tions at the boundary imply a higher probability ofa sentence boundary at that location. Speakersexchange turns almost exclusively at sentenceboundaries in this corpus, so the presence of a turnboundary implies a sentence boundary. The F0features all behave in the same way, with lowernegative values raising the probability of a sen-tence boundary. These features re¯ect the log ofthe ratio of F0 measured within the word (orwindow) preceding the boundary to the F0 in theword (or window) after the boundary. Thus, lowernegative values imply a larger pitch reset at theboundary, consistent with what we would expect.

3.1.2. Error reduction from prosodyTable 2 summarizes the results on both tran-

scribed and recognized words, for various sentence


segmentation models for this corpus. The baseline(or ``chance'') performance for true words in thistask is 6.2% error, obtained by labeling all loca-tions as non-boundaries (the most frequent class).For recognized words, it is considerably higher;this is due to the non-zero lower bound resulting ifone accounts for locations in which the 1-besthypothesis boundaries do not coincide with thoseof the reference alignment. ``Lower bound'' givesthe lowest segmentation error rate possible giventhe word boundary mismatches due to recognitionerrors.

Results show that the prosodic model aloneperforms better than a word-based languagemodel, despite the fact that the language modelwas trained on a much larger data set. Further-more, the prosodic model is somewhat more ro-bust to errorful recognizer output than thelanguage model, as measured by the absolute in-crease in error rate in each case. Most importantly,a statistically signi®cant error reduction isachieved by combining the prosodic features withthe lexical features, for both integration methods.The relative error reduction is 19% for true words,

Fig. 5. Top levels of decision tree selected for the Broadcast News sentence segmentation task. Nodes contain the percentage of ``else''

and ``S'' (sentence) boundaries, respectively, and are labeled with the majority class. PAU_DUR�pause duration, F0s� stylized F0

feature re¯ecting ratio of speech before the boundary to that after that boundary, in the log domain.

Table 2

Results for sentence segmentation on Broadcast Newsa

Model Transcribed words Recognized words

LM only (130M words) 4.1 11.8

Prosody only (700K words) 3.6 10.9

Interpolated 3.5 10.8

Combined HMM 3.3 11.7

Chance 6.2 13.3

Lower bound 0.0 7.9

a Values are word boundary classi®cation error rates (in percent).


and 8.5% for recognized words. This is true eventhough both models contained turn information,thus violating the independence assumption madein the model combination.

3.1.3. Performance without F0 featuresA question one may ask in using the prosody

features, is how the model would perform withoutany F0 features. Unlike pause, turn, and durationinformation, the F0 features used are not typicallyextracted or computed in most ASR systems. Weran comparison experiments on all conditions, butremoving all F0 features from the input to thefeature selection algorithm. Results are shown inTable 3, along with the previous results using allfeatures, for comparison.

As shown, the e�ect of removing F0 featuresreduces model accuracy for prosody alone, forboth true and recognized words. In the case of thetrue words, model integration using the no-F0prosodic tree actually fares slightly better than thatwhich used all features, despite similar modelcombination weights in the two cases. The e�ect isonly marginally signi®cant in a Sign test, so it mayindicate chance variation. However it could alsoindicate a higher degree of correlation betweentrue words and the prosodic features that indicateboundaries, when F0 is included. However, forrecognized words, the model with all prosodicfeatures is superior to that without the F0 features,

both alone and after integration with the languagemodel.

3.2. Task 2. Sentence segmentation of Switchboarddata

3.2.1. Prosodic feature usageSwitchboard sentence segmentation made use

of a markedly di�erent distribution of featuresthan observed for Broadcast News. For Switch-board, the best-performing tree found by the fea-ture selection algorithm had a feature usage asfollows:· (49%) Phone and rhyme duration preceding

boundary.· (18%) Pause duration at boundary.· (17%) Turn/no turn at boundary.· (15%) Pause duration at previous word boundary.· (01%) Time elapsed in turn.Clearly, the primary feature type used here ispreboundary duration, a measure that was usedonly a scant 1% of the time for the same task innews speech. Pause duration at the boundary wasalso useful, but not to the degree found forBroadcast News.

Of course, it should be noted in comparingfeature usage across corpora and tasks that resultshere pertain to comparisons of the most parsimo-nious, best-performing model for each corpus andtask. That is, we do not mean to imply that an

Table 3

Results for sentence segmentation on Broadcast News, with and without F0 featuresa

Model Transcribed words Recognized words

LM only (130M words) 4.1 11.8

All prosody features:


Prosody�LM: combined HMM 3.3

Prosody�LM: interpolation 10.8

No F0 features:


Prosody�LM: combined HMM 3.2

Prosody�LM: interpolation 11.1

Chance 6.2 13.3

Lower bound 0.0 7.9

a Values are word boundary classi®cation error rates (in percent). For the integrated (``Prosody�LM'') models, results are given for

optimal model only (combined HMM for true words, interpolation of posteriors for recognized words.)


individual feature such as preboundary duration isnot useful in Broadcast News, but rather that theminimal and most successful model for that corpusmakes little use of that feature (because it canmake better use of other features). Thus, it cannotbe inferred from these results that some feature notheavily used in the minimal model is not helpful.The feature may be useful on its own; however, itis not as useful as some other feature(s) madeavailable in this study. 7

The two ``pause'' features are not grouped to-gether, because they represent fundamentally dif-ferent phenomena. The second pause featureessentially captured the boundaries after one wordsuch as `ùh-huh'' and ``yeah'', which for this workhad been marked as followed by sentence bound-aries (``yeah hSi i know what you mean''). 8 Theprevious pause in this case was time that thespeaker had spent in listening to the other speaker(channels were recorded separately and recordingswere continuous on both sides). Since one-wordbackchannels (acknowledgments such as `ùh-huh'') and other short dialogue acts make up alarge percentage of sentence boundaries in thiscorpus, the feature is used fairly often. The turnfeatures also capture similar phenomena related toturn-taking. The leaf count for this tree was 236,so we display only the top portion of the tree inFig. 6.

Pause and turn information, as expected, sug-gested sentence boundaries. Most interestingabout this tree was the consistent behavior of du-ration features, which gave higher probability to asentence boundary when lengthening of phones orrhymes was detected in the word preceding theboundary. Although this is in line with descriptive

studies of prosody, it was rather remarkable to usthat duration would work at all, given the casualstyle and speaker variation in this corpus, as wellas the somewhat noisy forced alignments for theprosodic model training.

3.2.2. Error reduction from prosodyUnlike the previous results for the same task on

Broadcast News, we see in Table 4 that forSwitchboard data, prosody alone is not a partic-ularly good model. For transcribed words it isconsiderably worse than the language model;however, this di�erence is reduced for the case ofrecognized words (where the prosody shows lessdegradation than the language model).

Yet, despite the poor performance of prosodyalone, combining prosody with the languagemodel resulted in a statistically signi®cant im-provement over the language model alone (7.0%and 2.6% relative for true and recognized words,respectively). All di�erences were statistically sig-ni®cant, including the di�erence in performancebetween the two model integration approaches.Furthermore, the pattern of results for modelcombination approaches observed for BroadcastNews holds as well: the combined HMM is supe-rior for the case of transcribed words, but su�ersmore than the interpolation approach when ap-plied to recognized words.

3.3. Task 3. Topic segmentation of Broadcast Newsdata

3.3.1. Prosodic feature usageThe feature selection algorithm determined ®ve

feature types most helpful for this task:· (43%) Pause duration at boundary.· (36%) F0 range.· (09%) Turn/no turn at boundary.· (07%) Speaker gender.· (05%) Time elapsed in turn.The results are somewhat similar to those seenearlier for sentence segmentation in BroadcastNews, in that pause, turn, and F0 information arethe top features. However, the feature usage heredi�ers considerably from that for the sentencesegmentation task, in that here we see a muchhigher use of F0 information.

7 One might propose a more thorough investigation by

reporting performance for one feature at a time. However, we

found in examining such results that typically our features

required the presence of one or more additional features in

order to be helpful. (For example, pitch features required the

presence of the pause feature.) Given the large number of

features used, the number of potential combinations becomes

too large to report on fully here.8 `Ùtterance'' boundary is probably a better term, but for

consistency we use the term ``sentence'' boundary for these

dialogue act boundaries as well.


Fig

.6

.T

op

lev

els

of

dec

isio

ntr

eese

lect

edfo

rth

eS

wit

chb

oard

sen

ten

cese

gm

enta

tio

nta

sk.

No

des

con

tain

the

per

cen

tage

of

``S

''(s

ente

nce

)an

d``

else

''b

ou

nd

ari

es,

resp

ecti

vel

y,

an

dare

lab

eled

wit

hth

em

ajo

rity

class

.``

PA

U_D

UR

''�

pau

sed

ura

tio

n,

``R

HY

M''�

syll

ab

lerh

ym

e.V

OW

EL

,P

HO

NE

an

dR

HY

ME

featu

res

ap

ply

toth

e

wo

rdb

efo

reth

eb

ou

nd

ary

.

Fig

.7

.T

op

lev

els

of

dec

isio

ntr

eese

lect

edfo

rth

eB

road

cast

New

sto

pic

segm

enta

tio

nta

sk.

No

des

con

tain

the

per

cen

tage

of

``el

se''

an

d``

TO

PIC

''b

ou

nd

ari

es,

resp

ecti

vel

y,

an

dare

lab

eled

wit

hth

em

ajo

rity

class

.


Furthermore, the most important F0 featurewas a range feature (log ratio of the precedingword's F0 to the speaker's F0 baseline), which wasused 2.5 times more often in the tree than the F0feature based on di�erence across the boundary.The range feature does not require informationabout F0 on the other side of the boundary; thus,it could be applied regardless of whether there wasa speaker change at that location. This was a muchmore important issue for topic segmentation thanfor sentence segmentation, since the percentage ofspeaker changes is higher in the former than in thelatter.

It should be noted, however, that the impor-tance of pause duration is underestimated. As ex-plained earlier, pause duration was also used priorto tree building, in the chopping process. The de-cision tree was applied only to boundaries ex-ceeding a certain duration. Since the durationthreshold was found by optimizing for the TDTerror criterion, which assigns greater weight tofalse alarms than to false rejections, the resultingpause threshold is quite high (over half a second).Separate experiments using boundaries below ourchopping threshold show that trees distinguishmuch shorter pause durations for segmentationdecisions, implying that prosody could potentiallyyield an even larger relative advantage for errormetrics favoring a shorter chopping threshold.

Inspecting the tree in Fig. 7 (the tree has addi-tional leaves; we show only the top of it), we ®ndthat it is easily interpretable and consistent withprosodic descriptions of topic or paragraphboundaries. Boundaries are indicated by longerpauses and by turn information, as expected. Note

that the pause thresholds are considerably higherthan those used for the sentence tree. This is asexpected, because of the larger units used here, anddue to the prior chopping at long pause bound-aries for this task.

Most of the rest of the tree uses F0 information,in two ways. The most useful F0 range feature,F0s_LR_MEAN_KBASELN, computes the log ofthe ratio of the mean F0 in the last word to thespeaker's estimated F0 baseline. As shown, lowervalues favor topic boundaries, which is consistentwith speakers dropping to the bottom of theirpitch ranges at the ends of topic units. The otherF0 feature re¯ects the height of the last word rel-ative to a speaker's estimated F0 range; smallervalues thus indicate that a speaker is closer to hisor her F0 ¯oor, and, as would be predicted, implytopic boundaries.

The speaker-gender feature was used in the treein a pattern that at ®rst suggested to us a potentialproblem with our normalizations. It was repeat-edly used immediately after conditioning on the F0range feature F0s_LR_MEAN_KBASELN. How-ever, inspection of the feature value distributionsby gender and by boundary class suggested thatthis was not a problem with normalization, asshown in Fig. 8.

As indicated, there was no di�erence by genderin the distribution of F0 values for the feature inthe case of boundaries not containing a topic

Fig. 8. Normalized distribution of F0 range feature

(F0s_LR_MEAN_KBASELN) for male and female speakers for

topic and non-topic boundaries in Broadcast News.

Table 4

Results for sentence segmentation on Switchboarda

Model Transcribed

words

Recognized

words

LM only 4.3 22.8

Prosody only 6.7 22.9

Interpolated 4.1 22.2

Combined

HMM

4.0 22.5

Chance 11.0 25.8

Lower bound 0.0 17.6

a Values are word boundary classi®cation error rates (in

percent).


change. After normalization, both men and wom-en ended non-topic boundaries in similar regionsabove their baselines. Since non-topic boundariesare by far the more frequent class (distributions inthe histogram are normalized), the majority ofboundaries in the data show no di�erence on thismeasure by gender. For topic boundaries, howev-er, the women in a sense behave more ``neatly''than the men. As a group, the women have atighter distribution, ending topics at F0 values thatare centered closely around their F0 baselines.Men, on the other hand, are as a group somewhatless ``well-behaved'' in this regard. They often endtopics below their F0 baselines, and showing awider distribution (although it should also benoted that since these are aggregate distributions,the wider distribution for men could re¯ect eitherwithin-speaker or cross-speaker variation).

This di�erence is unlikely to be due to baselineestimation problems, since the non-topic distribu-tions show no di�erence. The variance di�erence isalso not explained by a di�erence in sample size,since that factor would predict an e�ect in theopposite direction. One possible explanation isthat men are more likely than women to produceregions of non-modal voicing (such as creak) at theends of topic boundaries; this awaits further study.In addition, we noted that non-topic pauses (i.e.,chopping boundaries) are much more likely tooccur in male than in female speech, a phenome-non that could have several causes. For example, itcould be that male speakers in Broadcast News areassigned longer topic segments on average, or thatmale speakers are more prone to pausing in gen-eral, or that males dominate the spontaneousspeech portions where pausing is naturally morefrequent. This ®nding, too, awaits further analysis.

3.3.2. Error reduction from prosodyTable 5 shows results for segmentation into

topics in Broadcast News speech. All results re¯ectthe word-averaged, weighted error metric used inthe TDT-2 evaluations (Doddington, 1998).Chance here corresponds to outputting the ``noboundary'' class at all locations, meaning that thefalse alarm rate will be zero, and the miss rate willbe 1. Since the TDT metric assigns a weight of 0.7

to false alarms, and 0.3 to misses, chance in thiscase will be 0.3.

As shown, the error rate for the prosody modelalone is lower than that for the language model.Furthermore, combining the models yields a sig-ni®cant improvement. Using the combined model,the error rate decreased by 27.3% relative to thelanguage model, for the correct words, and by24.2% for recognized words.

3.3.3. Performance without F0 featuresAs in the earlier case of Broadcast News sen-

tence segmentation, since this task made use of F0features, we asked how well it would fare withoutany F0 features. The experiments were conductedonly for true words, since as shown previously inTable 5, results are similar to those for recognizedwords. Results, as shown in Table 6, indicate asigni®cant degradation in performance when theF0 features are removed.

3.4. Comparisons of error reduction across condi-tions

To compare performance of the prosodic, lan-guage, and combined models directly across tasks

Table 5

Results for topic segmentation on Broadcast Newsa

Model Transcribed

words

Recognized

words

LM only 0.1895 0.1897

Prosody only 0.1657 0.1731

Combined

HMM

0.1377 0.1438

Chance 0.3 0.3

a Values indicate the TDT weighted segmentation cost metric.

Table 6

Results for topic segmentation on Broadcast Newsa

Model Transcribed words

LM only 0.1895

Combined HMM:

All prosodic features 0.1377

No F0 features 0.1511

Chance 0.3

a Values indicate the TDT weighted segmentation cost metric.


and corpora, it is necessary to normalize over threesources of variation. First, our conditions di�er inchance performance (since the percentage ofboundaries that correspond to a sentence or topicchange di�er across tasks and copora). Second, theupper bound on accuracy in the case of imperfectword recognition depends on both the word errorrate of the recognizer for the corpus, and the task.Third, the (standard) metric we have used toevaluate topic boundary detection di�ers from thestraight accuracy metric used to assess sentenceboundary detection.

A meaningful metric for comparing results di-rectly across tasks is the percentage of the chanceerror that remains after application of the model-ing. This measure takes into account the di�erentchance values, as well as the ceiling e�ect on ac-curacy due to recognition errors. Thus, a modelwith a score of 1.0 does no better than chance forthat task, since 100% of the error associated withchance performance remains after the modeling. Amodel with a score close to 0.0 is a nearly ``per-fect'' model, since it eliminates nearly all thechance error. Note that in the case of recognizedwords, this amounts to an error rate at the lowerbound rather than at zero.

In Fig. 9, performance on the relative errormetric is plotted by task/corpus, reliability of wordcues (ASR or reference transcript), and model. In

the case of the combined model, the plotted valuere¯ects performance for whichever of the twocombination approaches (HMM or interpolation)yielded best results for that condition.

Useful cross-condition comparisons can besummarized. For all tasks and as expected, per-formance su�ers for recognized words comparedwith transcribed words. For the sentence segmen-tation tasks, the prosodic model degrades less onrecognized words relative to true words than theword-based models. The topic segmentation re-sults based on language model information showremarkable robustness to recognition errors ±much more so than sentence segmentation. Thiscan be noted by comparing the large loss in per-formance from reference to ASR word cues for thelanguage model in the two sentence tasks, to theidentical performance of reference and ASR wordsin the case of the topic task. The pattern of resultscan be attributed to the di�erent character of thelanguage model used. Sentence segmentation usesa higher-order N-gram that is sensitive to speci®cwords around a potential boundary, whereas topicsegmentation is based on bag-of-words modelsthat are inherently robust to individual word er-rors.

Another important ®nding made visible in Fig. 9is that the performance of the language modelalone on Switchboard transcriptions is unusually

Fig. 9. Percentage of chance error remaining after application of model (allows performance to be directly compared across tasks).

BN�Broadcast News, SWB� Switchboard, ASR� 1-best recognition hypothesis, ref� transcribed words, LM� language model

only, Pros� prosody model only, Comb� combination of language and prosody models.


good, when compared with the performance of thelanguage model alone for all other conditions(including the corresponding condition forBroadcast News). This advantage for Switchboardcompletely disappears on recognized words. Whileresearchers typically have found Switchboard adi�cult corpus to process, in the case of sentencesegmentation on true words it is just the opposite ±atypically easy. Thus, previous work on automaticsegmentation on Switchboard transcripts (Stolckeand Shriberg, 1996) is likely to overestimate suc-cess for other corpora. The Switchboard sentencesegmentation advantage is due in large part to thehigh rate of a small number of words that occursentence-initially (especially I, discourse markers,backchannels, coordinating conjunctions, anddis¯uencies).

Finally, a potentially interesting pattern can beseen when comparing the two alternative modelcombination approaches (integrated HMM, orinterpolation) for the sentence segmentationtask. 9 Only the best-performing model combina-tion approach for each condition (ASR or refer-ence words) is noted in Fig. 9; however, thecomplete set of results is inferrable from Tables 2and 4. As indicated in the tables, the same generalpattern obtained for both corpora. The integratedHMM was the better approach on true words, butit fared relatively poorly on recognized words. Theposterior interpolation, on the other hand, yieldedsmaller, but consistent improvements over the in-dividual knowledge sources on both true andrecognized words. The pattern deserves furtherstudy, but one possible explanation is that theintegrated HMM approach as we have imple-mented it assumes that the prosodic features areindependent of the words. Recognition errors,however, will tend to a�ect both words (by de®-nition) and prosodic features through incorrectalignments. This will cause the two types of ob-servations to be correlated, violating the inde-pendence assumption.

3.5. General discussion and future work

There are a number of ways in which the studiesjust described could be improved and extended infuture work. One issue for the prosodic modeling isthat currently all of our features come from a smallwindow around the potential boundary. It is pos-sible that prosodic properties spanning a longerrange could convey additional useful information.A second likely source of improvement would be toutilize information about lexical stress and syllablestructure in de®ning features (for example, to bet-ter predict the domain of pre®nal lengthening).Third, additional features should be investigated;in particular it would be worthwhile to examineenergy-related features if e�ective normalization ofchannel and speaker characteristics could beachieved. Fourth, our decision tree models mightbe improved by using alternative algorithms toinduce combinations of our basic input features.This could result in smaller and/or better-performing trees. Finally, as mentioned earlier,testing on recognized words involved a funda-mental mismatch with respect to model training,where only true words were used. This mismatchworked against us, since the (fair) testing on rec-ognized words used prosodic models that had beenoptimized for alignments from true words. Fullretraining of all model components on recognizedwords would be an ideal (albeit presently expen-sive) solution to this problem.

Comparisons between the two speech styles interms of prosodic feature usage would bene®tfrom a study in which factors such as speakeroverlap in train and test data, and the soundquality of recordings, are more closely controlledacross corpora. As noted earlier, Broadcast Newshad an advantage over Switchboard in terms ofspeaker consistency, since as is typical in speechrecognition evaluations on news speech, it in-cluded speaker overlap in training and testing.This factor may have contributed to more robustperformance for features dependent on goodspeaker normalization ± particularly for the F0features, which used an estimate of the speaker'sbaseline pitch. It is also not yet clear to what ex-tent performance for certain features is a�ected byfactors such as recording quality and bandwidth,

9 The interpolated model combination is not possible for topic

segmentation, as explained earlier.


versus aspects of the speaking style itself. Forexample, it is possible that a high-quality, full-bandwidth recording of Switchboard-style speechwould show a greater use of prosodic featuresthan found here.

An added area for further study is to adaptprosodic or language models to the local context.For example, Broadcast News exhibits an inter-esting variety of shows, speakers, speaking styles,and acoustic conditions. Our current modelscontain only very minimal conditioning on theselocal properties. However, we have found in otherwork that tuning the topic segmenter to the typeof broadcast show provided signi®cant improve-ment (T�ur et al., To appear). The sentence seg-mentation task could also bene®t from explicitmodeling of speaking style. For example, our re-sults show that both lexical and prosodic sentencesegmentation cues di�er substantially betweenspontaneous and planned speech. Finally, resultsmight be improved by taking advantage ofspeaker-speci®c information (i.e., behaviors ortendencies beyond those accounted for by thespeaker-speci®c normalizations included in theprosodic modeling). Initial experiments suggest wedid not have enough training data per speakeravailable for an investigation of speaker-speci®cmodeling; however, this could be made possiblethrough additional data or the use of smoothingapproaches to adapt global models to speaker-speci®c ones.

More sophisticated model combination ap-proaches that explicitly model interactions of lex-ical and prosodic features o�er much promise forfuture improvements. Two candidate approachesare the decision trees based on unsupervised hier-archical word clustering of (Heeman and Allen,1997), and the feature selection approach for ex-ponential models (Beeferman et al., 1999). Asshown in (Stolcke and Shriberg, 1996) and similarto (Heeman and Allen, 1997), it is likely that theperformance of our segmentation language modelswould be improved by moving to an approachbased on word classes.

Finally, the approach developed here could beextended to other languages, as well as to othertasks. As noted in Section 1.3, prosody is usedacross languages to convey information units (e.g.,

(Vaissi�ere, 1983), among others). While there isbroad variation across languages in the manner inwhich information related to item salience (ac-centuation and prominence) is conveyed, there aresimilarities in many of the features used to conveyboundaries. Such universals include pausing, pitchdeclination (gradual lowering of F0 valleysthroughout both sentences and paragraphs), andamplitude and F0 resets at the beginnings of majorunits. One could thus potentially extend this ap-proach to a new language. The prosodic featureswould di�er, but it is expected that for many lan-guages, similar basic raw features of pausing, du-ration, and pitch can be e�ective in segmentationtasks. In a similar vein, although prosodic featuresdepend on the type of events one is trying to de-tect, the general approach could be extended totasks beyond sentence and topic segmentation (seefor example (Hakkani-T�ur et al., 1999; Shriberget al., 1998)).

4. Summary and conclusion

We have studied the use of prosodic informa-tion for sentence and topic segmentation, both ofwhich are important tasks for information ex-traction and archival applications. Prosodic fea-tures re¯ecting pause durations, suprasegmentaldurations, and pitch contours were automaticallyextracted, regularized, and normalized. They re-quired no hand-labeling of prosody; rather, theywere based solely on time alignment information(either from a forced alignment or from recogni-tion hypotheses).

The features were used as inputs to a decisiontree model, which predicted the appropriate seg-ment boundary type at each inter-word boundary.We compared the performance of these prosodicpredictors to that of statistical language modelscapturing lexical correlates of segment boundaries,as well as to combined models integrating bothlexical and prosodic information. Two knowledgesource integration approaches were investigated:one based on interpolating posterior probabilityestimators, and the other using a combined HMMthat emitted both lexical and prosodic observa-tions.


Results showed that on Broadcast News theprosodic model alone performed as well as (oreven better than) purely word-based statisticallanguage models, for both true and automaticallyrecognized words. The prosodic model achievedcomparable performance with signi®cantly lesstraining data, and often degraded less due to rec-ognition errors. Furthermore, for all tasks andcorpora, we obtained a signi®cant improvementover word-only models using one or both of ourcombined models. Interestingly, the integratedHMM worked best on transcribed words, whilethe posterior interpolation approach was muchmore robust in the case of recognized words.

Analysis of the prosodic decision trees revealedthat the models capture language-independentboundary indicators described in the literature,such as preboundary lengthening, boundary tones,and pitch resets. Consistent with descriptive work,larger breaks such as topics, showed featuressimilar to those of sentence breaks, but with morepronounced pause and intonation patterns. Fea-ture usage, however, was corpus dependent. Whilefeatures such as pauses were heavily used in bothcorpora, we found that pitch is a highly informa-tive feature in Broadcast News, whereas durationand word cues dominated in Switchboard. Weconclude that prosody provides rich and comple-mentary information to lexical information for thedetection of sentence and topic boundaries in dif-ferent speech styles, and that it can therefore playan important role in the automatic segmentationof spoken language.

Acknowledgements

We thank Kemal S�onmez for providing themodel for F0 stylization used in this work; Re-becca Bates, Mari Ostendorf, Ze'ev Rivlin, AnanthSankar and Kemal S�onmez for invaluable assis-tance in data preparation and discussions; Ma-delaine Plauch�e for hand-checking of F0stylization output and regions of non-modalvoicing; and Klaus Ries, Paul Taylor and ananonymous reviewer for helpful comments onearlier drafts. This research was supported byDARPA under contract no. N66001-97-C-8544

and by NSF under STIMULATE grant IRI-9619921. The views herein are those of the authorsand should not be interpreted as representing thepolicies of the funding agencies.

References

Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.,

1998. Topic detection and tracking pilot study: Final

report. In: Proceedings DARPA Broadcast News Tran-

scription and Understanding Workshop, Lansdowne, VA,

February. Morgan Kaufmann, Los Altos, CA, pp. 194±

218.

Bahl, L.R., Brown, P.F., de Souza, P.V., Mercer, R.L., 1989. A

tree-based statistical language model for natural language

speech recognition. IEEE Transactions on Acoustics,

Speech and Signal Processing 37 (7), 1001±1008.

Baum, L.E., Petrie, T., Soules, G., Weiss, N., 1970. A

maximization technique occurring in the statistical analysis

of probabilistic functions in Markov chains. The Annals of

Mathematical Statistics 41 (1), 164±171.

Beeferman, D., Berger, A., La�erty, J., 1999. Statistical models

for text segmentation. Machine Learning 34(1±3), 177±210

(Special Issue on Natural Language Learning).

Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984.

Classi®cation and Regression Trees. Wadsworth and

Brooks, Paci®c Grove, CA.

Brown, G., Currie, K., Kenworthy, J., 1980. Questions of

Intonation. Croom Helm, London.

Bruce, G., 1982. Textual aspects of prosody in Swedish.

Phonetica 39, 274±287.

Buntine, W., Caruana, R., 1992. Introduction to IND Version

2.1 and Recursive Partitioning. NASA Ames Research

Center, Mo�ett Field, CA.

Cieri, C., Gra�, D., Liberman, M., Martey, N., Strassell, S.,

1999. The TDT-2 text and speech corpus. In: Proceedings

DARPA Broadcast News Workshop, Herndon, VA, Feb-

ruary. Morgan Kaufmann, Los Altos, CA, pp. 57±60.

Conversational Speech Recognition Workshop DARPA Hub-

5E Evaluation, Baltimore, MD, May 1997.

Dermatas, E., Kokkinakis, G., 1995. Automatic stochastic

tagging of natural language texts. Computational Linguis-

tics 21 (2), 137±163.

Digalakis, V., Murveit, H., 1994. GENONES: An algorithm for

optimizing the degree of tying in a large vocabulary hidden

Markov model based speech recognizer. In: Proceedings of

the IEEE Conference on Acoustics, Speech, and Signal

Processing, Adelaide, Australia. Vol. 1, pp. 537±540.

Doddington, G., 1998. The Topic Detection and Tracking

Phase 2 (TDT2) evaluation plan. In: Proceedings DARPA

Broadcast News Transcription and Understanding Work-

shop, Lansdowne, VA, February. Morgan Kaufmann,

pp. 223±229 (Revised version available from http://

www.nist.gov/speech/tdt98/tdt98.htm.).


Entropic Research Laboratory, 1993. ESPS Version 5.0 Pro-

grams Manual, Washington, DC, August.

Godfrey, J.J., Holliman, E.C., McDaniel. J., 1992. SWITCH-

BOARD: Telephone speech corpus for research and

development. In: Proceedings of the IEEE Conference on

Acoustics, Speech, and Signal Processing, San Francisco,

March. Vol. 1, pp. 517±520.

Gra�, D., 1997. The 1996 Broadcast News speech and

language-model corpus. In: Proceedings DARPA Speech

Recognition Workshop, Chantilly, VA, February. Morgan

Kaufmann, Los Altos, CA, pp. 11±14.

Grosz, B., Hirschberg, J., 1992. Some intonational characteris-

tics of discourse structure. In: Ohala, J.J., Nearey, T.M.,

Derwing, B.L., Hodge, M.M., Wiebe, G.E. (Eds.), Proceed-

ings of the International Conference on Spoken Language

Processing, Ban�, Canada, October. Vol. 1, pp. 429±432.

Hakkani-T�ur, D., T�ur, G., Stolcke, A., Shriberg. E., 1999.

Combining words and prosody for information extraction

from speech. In: Proceedings of the Sixth European

Conference on Speech Communication and Technology,

Budapest, September. Vol. 5, pp. 1991±1994.

Hearst, M.A., 1997. TexTiling: segmenting text info multi-

paragraph subtopic passages. Computational Linguistics

23 (1), 33±64.

Heeman, P., Allen, J., 1997. Intonational boundaries, speech

repairs, and discourse markers: modeling spoken dialog.

In: Proceedings of the 35th Annual Meeting of the

Association for Computational Linguistics and Eighth

Conference of the European Chapter of the Association for

Computational Linguistics, Madrid, July.

Hirschberg, J., Nakatani, C., 1996. A prosodic analysis of

discourse segments in direction-giving monologues. In:

Proceedings of the 34th Annual Meeting of the Association

for Computational Linguistics, Santa Cruz, CA, June.

pp. 286±293.

Katz, S.M., 1987. Estimation of probabilities from sparse data

for the language model component of a speech recognizer.

IEEE Transactions on Acoustics Speech, and Signal

Processing 35 (3), 400±401.

Koopmans-van Beinum, F.J., van Donzel, M.E., 1996. Rela-

tionship between discourse structure and dynamic speech

rate. In: Bunnell, H.T., Idsardi, W. (Eds.), Proceedings of

the International Conference on Spoken Language Pro-

cessing, Vol. 3, Philadelphia, October, pp. 1724±1727.

Kozima, H., 1993. Text segmentation based on similarity

between words. In: Proceedings of the 31st Annual Meeting

of the Association for Computational Linguistics, Ohio

State University, Columbus, Ohio, June. pp. 286±288.

Kubala, F., Schwartz, R., Stone, R., Weischedel, R., 1998.

Named entity extraction from speech. In: Proceedings

DARPA Broadcast News Transcription and Understand-

ing Workshop, Lansdowne, VA, February. Morgan

Kaufmann, Los Altos, CA, pp. 287±292.

Lehiste, I., 1979. Perception of sentence and paragraph

boundaries. In: Lindblom, B., �Ohman, S. (Eds.), Frontiers

of Speech Communication Research. Academic, London,

pp. 191±201.

Lehiste, I., 1980. The phonetic structure of paragraphs. In:

Nooteboom, S., Cohen, A. (Eds.), Structure and Process in

Speech Perception. Springer, Berlin, pp. 195±206.

Liu, D., Kubala, F., 1999. Fast speaker change detection for

Broadcast News transcription and indexing. In: Proceed-

ings of the Sixth European Conference on Speech Com-

munication and Technology, Budapest, September. Vol. 3,

pp. 1031±1034.

LVCSR Hub-5 Workshop, Linthicum Heights, MD, June 1999.

Meteer, M., Taylor, A., MacIntyre, R., Iyer, R., 1995.

Dys¯uency annotation stylebook for the Switchboard

corpus. Distributed by LDC, ftp://ftp.cis.upenn.edu/pub/

treebank/swbd/doc/DFL-book.ps, February (Revised June

1995 by Ann Taylor).

Nakajima, S., Tsukada, H., 1997. Prosodic features of utter-

ances in task-oriented dialogues. In: Sagisaka, Y., Camp-

bell, N., Higuchi, N. (Eds.), Computing Prosody:

Computational Models for Processing Spontaneous

Speech. Springer, New York, Chapter 7, pp. 81±94.

Palmer, D.D., Hearst, M.A., 1997. Adaptive multilingual

sentence boundary disambiguation. Computational Lin-

guistics 23 (2), 241±267.

Przybocki, M.A., Martin, A.F., 1999. The 1999 NIST speaker

recognition evaluation, using summed two-channel tele-

phone data for speaker detection and speaker tracking. In:

Proceedings of the Sixth European Conference on Speech

Communication and Technology, Budapest. Vol. 5,

pp. 2215±2218.

Sankar, A., Weng, F., Rivlin, Z., Stolcke, A., Gadde, R.R.,

1998. The development of SRI's 1997 Broadcast News

transcription system. In: Proceedings DARPA Broadcast

News Transcription and Understanding Workshop, Lans-

downe, VA, February. Morgan Kaufmann, Los Altos, CA,

pp. 91±96.

Shriberg, E., 1999. Phonetic consequences of speech dis¯uency.

In: Proceedings of the XIVth International Congress on

Phonetic Sciences, San Francisco. pp. 619±622.

Shriberg, E., Bates, R., Stolcke, A., 1997. A prosody-only

decision-tree model for dis¯uency detection. In: Kokkin-

akis, G. Fakotakis, N., Dermatas, E. (Eds.), Proceedings of

the Fifth European Conference on Speech Communication

and Technology, Rhodes, Greece, September. Vol. 5,

pp. 2383±2386.

Shriberg, E., Bates, R., Stolcke, A., Taylor, P., Jurafsky, D.,

Ries, K., Coccaro, N., Martin, R., Meteer, M., van Ess-

Dykema, C., 1998. Can prosody aid the automatic classi-

®cation of dialog acts in conversational speech?. Language

and Speech 41 (3±4), 439±487.

Silverman, K., 1987. The structure and processing of funda-

mental frequency contours. Ph.D. thesis, Cambridge

University, Cambridge, UK.

Sluijter, A., Terken, J., 1994. Beyond sentence prosody:

paragraph intonation in Dutch. Phonetica 50, 180±188.

S�onmez, K., Shriberg, E., Heck, L., Weintraub, M., 1998.

Modeling dynamic prosodic variation for speaker veri®ca-

tion. In: Mannell, R.H., Robert-Ribes, J. (Eds.), Proceed-

ings of the International Conference on Spoken Language


Processing, Sydney, December. Australian Speech Science

and Technology Association, Vol. 7, pp. 3189±3192.

S�onmez, K., Heck, L., Weintraub, M., 1999. Speaker tracking

and detection with multiple speakers. In: Proceedings of

the Sixth European Conference on Speech Communication

and Technology, Budapest. Vol. 5, pp. 2219±2222.

Stolcke, A., Shriberg, E., 1996. Automatic linguistic segmenta-

tion of conversational speech. In: Bunnell, H.T., Idsardi,

W. (Eds.), Proceedings of the International Conference on

Spoken Language Processing, Philadelphia. Vol. 2,

pp. 1005±1008.

Stolcke, A., Shriberg, E., Bates, R., Ostendorf, M., Hakkani,

D., Plauch�e, M., T�ur, G., Lu, Y., 1998. Automatic

detection of sentence boundaries and dis¯uencies based

on recognized words. In: Mannell, R.H., Robert-Ribes, J.

(Eds.), Proceedings of the International Conference on

Spoken Language Processing, Sydney. Australian Speech

Science and Technology Association, Vol. 5, pp. 2247±

2250.

Stolcke, A., Shriberg, E., Hakkani-T�ur, D., T�ur, G., Rivlin, Z.,

S�onmez, K., 1999. Combining words and speech prosody

for automatic topic segmentation. In: Proceedings DARPA

Broadcast News Workshop, Herndon, VA, February.

Morgan Kaufmann, Los Altos, CA, pp. 61±64.

Swerts, M., 1997. Prosodic features at discourse boundaries of

di�erent strength. Journal of the Acoustical Society of

America 101, 514±521.

Swerts, M., Geluykens, R., 1994. Prosody as a marker of

information ¯ow in spoken discourse. Language and

Speech 37, 21±43.

Swerts, M., Ostendorf, M., 1997. Prosodic and lexical indica-

tions of discourse structure in human±machine interac-

tions. Speech Communication 22 (1), 25±41.

Talkin, D., 1995. A robust algorithm for pitch tracking

(RAPT). In: Klein, W.B., Paliwal, K.K. (Eds.), Speech

Coding and Synthesis. Elsevier, New York.

Thorsen, N.G., 1985. Intonation and text in Standard Dutch.

Journal of the Acoustical Society of America 77, 1205±

1216.

T�ur, G., Hakkani-T�ur, D., Stolcke, A., Shriberg, E., To appear.

Integrating prosodic and lexical cues for automatic topic

segmentation Computational Linguistics.

Vaissi�ere. J., 1983. Language-independent prosodic features. In

Cutler, A., Ladd, D.R. (Eds.), Prosody: Models and

Measurements. Springer, Berlin, Chapter 5, pp. 53±66.

Viterbi, A., 1967. Error bounds for convolutional codes and an

asymptotically optimum decoding algorithm. IEEE Trans-

actions on Information Theory 13, 260±269.

Yamron, J.P., Carp, I., Gillick, L., Lowe, S., van Mulbregt. P.,

1998. A hidden Markov model approach to text segmen-

tation and event tracking. In: Proceedings of the IEEE

Conference on Acoustics, Speech, and Signal Processing,

Seattle, WA. Vol. 1, pp. 333±336.


Prosody-based automatic segmentation of speech into...

Documents

Transcript of Prosody-based automatic segmentation of speech into...