1

A Data-driven Approach to Named Entity Recognition for Early Modern French

We opt for a data-driven approach by developing a new corpus with fine-grained entity annotation, covering three centuries of literature corresponding to the early modern period, We then fine-tune existing state-of-the-art architectures obtaining results that are on par with those of the current state-of-the-art NER systems for Contemporary English.

Simon Gabay, Pedro Ortiz Suarez

A Data-driven Approach to Named Entity Recognition for Early Modern French

Le projet FREEM : ressources, outils et enjeux pour l’étude du français d’Ancien Régime

We present annotated corpora and NLP models for some downstream tasks in Early Modern French.

Simon Gabay, Pedro Ortiz Suarez, Rachel Bawden, Alexandre Bartz, Philippe Gambette, Benoît Sagot

Le projet FREEM : ressources, outils et enjeux pour l’étude du français d’Ancien Régime

BERTrade: Using Contextual Embeddings to Parse Old French

We consider several neural language models, some of which trained or fine-tuned on a new corpus of raw Old and Middle French texts, and use their internal representations of words as inputs to train taggers and parsers on the SRCMF treebank.

Loïc Grobol, Mathilde Regnault, Pedro Ortiz Suarez, Benoît Sagot, Laurent Romary, Benoît Crabbé

BERTrade: Using Contextual Embeddings to Parse Old French

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

We present the $\text{FreEM}_{\text{max}}$ corpus of Early Modern French and D’AlemBERT, a RoBERTa-based language model trained on $\text{FreEM}_{\text{max}}$.

Simon Gabay, Pedro Ortiz Suarez, Alexandre Bartz, Alix Chagué, Rachel Bawden, Philippe Gambette, Benoît Sagot

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

we take the existing multilingual web corpus OSCAR and its pipeline Ungoliant that extracts and classifies data from Common Crawl at the line level, and propose a set of improvements and automatic annotations in order to produce a new document-oriented version of OSCAR.

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus

We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data.

Julien Abadji, Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus

SinNer@CLEF-HIPE2020: Sinful Adaptation of SotA models for Named Entity Recognition in Historical French and German Newspapers

In this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers.

Pedro Ortiz Suarez, Yoann Dupont, Gaël Lejeune, Tian Tian

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages.

Pedro Ortiz Suarez, Laurent Romary, Benoît Sagot

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell

We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect.

Djamé Seddah, Farah Essaidi, Amal Fethi, Matthieu Futeral, Benjamin Muller, Pedro Ortiz Suarez, Benoît Sagot, Abhishek Srivastava