Faculté des Lettes, Département de Linguistique

1
Faculté des Lettes, Département de Linguistique FipsRomanian: Towards a Romanian Version of the Fips Syntactic Parser Violeta Seretan, Eric Wehrli, Luka Nerima, Gabriela Soare LATL – Language Technology Laboratory {violeta.seretan, eric.wehrli, luka.nerima, [email protected]} Lexicon construction • list of headwords (DEX, 1998) • morphological generation: given a base word form, generates all its forms according to the appropriate inflection paradigm • manual and semi-automatic insertion • manual insertion for verbs (specific information: subcategorization, selectional features, thematic function, …) • Current status: – simple entries: 60K lexemes/ 380K words (10 K proper nouns) – complex entries: multi-word expressions (compounds and collocations): de jur împrejurul “around” problemă – a se pune “problem – to arise” Grammar implementation •Specifications (Soare, 2005) •Customisation of FipsRomanian grammar for standard operations (syntactic transformations: relativization, interrogation, passivization, ...) •Similarities and differences. Examples: clitic system wh-fronting •Attachment rules: constraints on the main parser operation, Merge, which combines two adjacent structures into a larger structure •Current status: about 100 rules specified; nearly half implemented and tested Extending Fips to Romanian: two main tasks Romanian language Sample text http://wt.jrc.it/lt/Acquis/ This Regulation shall enter into force on the twentieth day following that of its publication in the Official Journal of the European Union. Prezentul regulament intră în vigoare în a douăzecea zi de la publicarea în Jurnalul Oficial al Uniunii Europene. Orthography phonemic; Latin alphabet (since 1859) Diacritics: ă/ə, â/iB, î/iB; cedilla: ş/ʃ, ţ/ʦ Morphology Case system inherited from Latin nominative-accusative, genitive- dative, vocative • Three grammatical genders masculine, feminine, neuter • Rich declension of determiners, nouns, adjectives, and verbs e.g., about 35 forms for a verb • The definite article is enclitic, i.e., suffixed to nouns and adjectives: casă/house – casa/house-the mare/big – marea/big-the Vocabulary • Latin origin (fundamental vocabulary) • Slavic origin • Neologisms: French, Italian, … • Loanwords: Turkish, Greek, Hungarian, Albanian, ... Syntax • VSO language, relatively free word order Europe - Romance languages References Bresnan, J. 2001. Lexical Functional Syntax. Blackwell, Oxford. Chomsky, N. 1995. The Minimalist Program. MIT Press, Cambridge, Mass. Călăcean, M. and J. Nivre. 2009. A data-driven dependency parser for Romanian. In Proceedings of the 7th International Workshop on Treebanks and Linguistic Theories (TLT 7), pages 65–76, Groningen, Holland. 1998. DEX – Dicţionarul explicativ al limbii române. Academia Română, Bucharest. Seretan, V. 2008. Collocation extraction based on syntactic parsing. Ph.D. thesis, University of Geneva. Soare, G. 2005. Romanian syntax. Technical report, University of Geneva. Wehrli, E. 2007. Fips, a “deep” linguistic multilingual parser. In ACL 2007 Workshop on Deep Linguistic Processing, pages 120–127, Prague, Czech Republic. Related work & Useful resources Task-based evaluation • Collocation extraction from parsed data (Seretan, 2008) • Collocations are half idioms (of encoding, but not of decoding) • Used by parser and in-house rule-based machine translation system • Precision for top 2000 results: 30.3% (Precision for French data: 65.9%, top 500 results) Parsing experiment • data: journalistic texts, 1.05M words • average sentence length: 26.9 tokens • 16.2% full parses (FipsFrench, FipsEnglish: about 80%) • average partial parses length : 5.3 tokens • unknown words: 6.5% (of which 39.2% proper nouns) • satisfactory lexical coverage • grammatical coverage needs to be improved (work in progress!) Preliminary results Sample collocations extracted Lexicon interface Screen captures Fips interface POS-tagging output parsing output FipsRomanian: Sample results direct object subject predicate Fips: a multilingual parsing architecture (Wehrli, 2007) Output • Rich sentence representation: – constituent structure – predicate-argument table – co-indexation chains – intra-sentential pronoun resolution Underlying theory • Generative Grammar (Chomsky, 1995) Similarities: • Simpler Syntax (Culicover and Jackendoff, 2005) • Lexical Functional Grammar (Bresnan, 2001) Implementation • Left-to-right, bottom-up tabular parsing algorithm, relying on detailed lexical information • Language-independent core + language-specific implementation • Component Pascal, OOP paradigm, BlackBox IDE • Supported languages: French, English, German, Spanish, Italian, Greek; others in progress Sample parse tree produced by Fips • Data-driven dependency parser for Romanian based on the MaltParser, learns dependencies from manual annotations (Călăcean and Nivre, 2009). Problem: reduced treebank size and grammatical coverage (simple structures, no subordination, average sentence length only 9 words). Sketch Engine for Romanian: shallow parsing (POS patterns), http://www.sketchengine.co.uk/ • Dependency treebank construction, work in progress at the University of Iaşi, Romania • Text processing webservices, RACAI – Research Institute for Artificial Intelligence , Romanian Academy, Bucarest, Romania. http://www.racai.ro/webservices/TextProcessing.aspx • A repository of tools for Romanian: ConsILR - Consortium for the Romanian Language: Resources & Tools, research groups from Iaşi, Bucarest and Chişinău http://consilr.info.uaic.ro/

description

Sample text http://wt.jrc.it/lt/Acquis/. Prezentul regulament intră în vigoare în a douăzecea zi de la publicarea în Jurnalul Oficial al Uniunii Europene. This Regulation shall enter into force on the twentieth day following that of its publication in the Official Journal of the European Union. - PowerPoint PPT Presentation

Transcript of Faculté des Lettes, Département de Linguistique

Page 1: Faculté des Lettes, Département de Linguistique

Faculté des Lettes, Département de Linguistique

FipsRomanian: Towards a Romanian Version of the Fips Syntactic ParserVioleta Seretan, Eric Wehrli, Luka Nerima, Gabriela Soare LATL – Language Technology Laboratory {violeta.seretan, eric.wehrli, luka.nerima, [email protected]}

Lexicon construction• list of headwords (DEX, 1998)• morphological generation: given a base word form, generates all its forms according to the appropriate inflection paradigm

• manual and semi-automatic insertion• manual insertion for verbs (specific information: subcategorization, selectional features, thematic function, …)

• Current status:– simple entries:

60K lexemes/ 380K words (10 K proper nouns)

– complex entries: multi-word expressions (compounds and collocations): de jur împrejurul “around” problemă – a se pune “problem – to arise”

Grammar implementation • Specifications (Soare, 2005)• Customisation of FipsRomanian grammar for

standard operations (syntactic transformations: relativization, interrogation, passivization, ...)

• Similarities and differences. Examples:– clitic system

– wh-fronting

• Attachment rules: constraints on the main parser operation, Merge, which combines two adjacent structures into a larger structure

• Current status: about 100 rules specified; nearly half implemented and tested

Extending Fips to Romanian: two main tasksRomanian language

Sample text

http://wt.jrc.it/lt/Acquis/

This Regulation shall enter into force on the twentieth day following that of its publication in the Official Journal of the European Union.

Prezentul regulament intră în vigoare în a douăzecea zi de la publicarea în Jurnalul Oficial al Uniunii Europene.

Orthography• phonemic; Latin alphabet (since 1859)• Diacritics: ă/ə, â/ɨ, î/ɨ; cedilla: ş/ʃ, ţ/ʦ

Morphology• Case system inherited from Latin

nominative-accusative, genitive-dative, vocative• Three grammatical genders

masculine, feminine, neuter • Rich declension of determiners, nouns, adjectives, and verbs

e.g., about 35 forms for a verb• The definite article is enclitic, i.e., suffixed to nouns and adjectives:

casă/house – casa/house-themare/big – marea/big-the

Vocabulary• Latin origin (fundamental vocabulary)• Slavic origin• Neologisms: French, Italian, …• Loanwords: Turkish, Greek, Hungarian, Albanian, ...

Syntax• VSO language, relatively free word order

Europe - Romance languages

ReferencesBresnan, J. 2001. Lexical Functional Syntax. Blackwell, Oxford.Chomsky, N. 1995. The Minimalist Program. MIT Press, Cambridge, Mass.Călăcean, M. and J. Nivre. 2009. A data-driven dependency parser for Romanian. In

Proceedings of the 7th International Workshop on Treebanks and Linguistic Theories (TLT 7), pages 65–76, Groningen, Holland.

1998. DEX – Dicţionarul explicativ al limbii române. Academia Română, Bucharest.Seretan, V. 2008. Collocation extraction based on syntactic parsing. Ph.D. thesis, University of

Geneva.Soare, G. 2005. Romanian syntax. Technical report, University of Geneva.Wehrli, E. 2007. Fips, a “deep” linguistic multilingual parser. In ACL 2007 Workshop on Deep

Linguistic Processing, pages 120–127, Prague, Czech Republic.

Related work & Useful resources

Task-based evaluation• Collocation extraction from parsed data (Seretan, 2008)• Collocations are half idioms (of encoding, but not of decoding) • Used by parser and in-house rule-based machine translation system• Precision for top 2000 results: 30.3% (Precision for French data: 65.9%, top 500 results)

Parsing experiment• data: journalistic texts, 1.05M words • average sentence length: 26.9 tokens

• 16.2% full parses (FipsFrench, FipsEnglish: about 80%)• average partial parses length : 5.3 tokens• unknown words: 6.5% (of which 39.2% proper nouns)• satisfactory lexical coverage• grammatical coverage needs to be improved (work in progress!)

Preliminary results

Sample collocations extracted Lexicon interface

Screen captures

Fips interface

POS-tagging output

parsing output

FipsRomanian: Sample results

direct object

subject

predicate

Fips: a multilingual parsing architecture (Wehrli, 2007)

Output• Rich sentence representation:

– constituent structure– predicate-argument table– co-indexation chains– intra-sentential pronoun resolution

Underlying theory• Generative Grammar (Chomsky, 1995)

Similarities: • Simpler Syntax (Culicover and Jackendoff, 2005)• Lexical Functional Grammar (Bresnan, 2001)

Implementation• Left-to-right, bottom-up tabular parsing algorithm, relying on detailed lexical information • Language-independent core + language-specific implementation• Component Pascal, OOP paradigm, BlackBox IDE• Supported languages: French, English, German, Spanish, Italian, Greek; others in progress

Sample parse tree produced by Fips

• Data-driven dependency parser for Romanian based on the MaltParser, learns dependencies from manual annotations (Călăcean and Nivre, 2009). Problem: reduced treebank size and grammatical coverage (simple structures, no subordination, average sentence length only 9 words).

• Sketch Engine for Romanian: shallow parsing (POS patterns), http://www.sketchengine.co.uk/• Dependency treebank construction, work in progress at the University of Iaşi, Romania• Text processing webservices, RACAI – Research Institute for Artificial Intelligence, Romanian

Academy, Bucarest, Romania. http://www.racai.ro/webservices/TextProcessing.aspx• A repository of tools for Romanian: ConsILR - Consortium for the Romanian Language:

Resources & Tools, research groups from Iaşi, Bucarest and Chişinău http://consilr.info.uaic.ro/