French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

Murielle Popa-Fabre, Pedro Ortiz Suarez, Benoît Sagot, Éric de la Clergerie

mai, 2020

Image credit: Alix Chagué

Résumé

This paper investigates the impact of different types and size of training corpora on language models. By asking the fundamental question of quality versus quantity, we compare four French corpora by pre-training four different ELMos and evaluating them on dependency parsing, POS-tagging and Named Entities Recognition downstream tasks. We present and asses the relevance of a new balanced French corpus, CaBeRnet, that features a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative corpus will allow the language models to be more efficient, and therefore yield better evaluation scores on different evaluation sets and tasks.

Type

Publication

In 8th Workshop on the Challenges in the Management of Large Corpora

French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

Résumé

Pedro Ortiz Suarez

Chercheur Senior