1

Les modèles de langue contextuels Camembert pour le Français : impact de la taille et de l'hétérogénéité des données d'entrainement

Nous explorons l’impact de la taille et de l’hétérogénéité des données d’entraînement sur la modélisation de la langue française.

Louis Martin, Benjamin Muller, Pedro Ortiz Suarez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Benoît Sagot, Djamé Seddah

Les modèles de langue contextuels Camembert pour le Français : impact de la taille et de l'hétérogénéité des données d'entrainement

Establishing a New State-of-the-Art for French Named Entity Recognition

We explore convert the NER annotations of the French TreeBank to a more user-friendly format and establish a new state of the art for French NER.

Pedro Ortiz Suarez, Yoann Dupont, Benjamin Muller, Laurent Romary, Benoît Sagot

Establishing a New State-of-the-Art for French Named Entity Recognition

French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

We investigate the impact of different types and size of training corpora on language models.

Murielle Popa-Fabre, Pedro Ortiz Suarez, Benoît Sagot, Éric de la Clergerie

French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures

We explore the impact of the OCR quality on grobid-dictionaries models.

Mohamed Khemakhem, Ioana Galleron, Geoffrey Williams, Laurent Romary, Pedro Ortiz Suarez

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.

Pedro Ortiz Suarez, Benoît Sagot, Laurent Romary

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures