Pedro Ortiz Suarez
Pedro Ortiz Suarez
Accueil
Publications
Présentations
Projets
Contactez moi
CV
Light
Dark
Automatic
Français
Français
Deutsch
English
Español
1
Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus
We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data.
Julien Abadji
,
Pedro Ortiz Suarez
,
Laurent Romary
,
Benoît Sagot
PDF
Citation
Code
Jeu de données
DOI
CMLC-9
Website
HAL
SinNer@CLEF-HIPE2020: Sinful Adaptation of SotA models for Named Entity Recognition in Historical French and German Newspapers
In this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers.
Pedro Ortiz Suarez
,
Yoann Dupont
,
Gaël Lejeune
,
Tian Tian
PDF
Citation
Vidéo
CEUR-WS
CLEF-HIPE-2020
CLEF-2020
HAL
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages.
Pedro Ortiz Suarez
,
Laurent Romary
,
Benoît Sagot
PDF
Citation
Jeu de données
Projet
Vidéo
DOI
ACL Anthology
ACL 2020
HAL
arXiv
CamemBERT: a Tasty French Language Model
We explore the impact of the training data size on a French version of RoBERTa.
Louis Martin
,
Benjamin Muller
,
Pedro Ortiz Suarez
,
Yoann Dupont
,
Laurent Romary
,
Éric de la Clergerie
,
Djamé Seddah
,
Benoît Sagot
PDF
Citation
Jeu de données
Projet
Vidéo
DOI
ACL Anthology
arXiv
Website
ACL 2020
HAL
Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell
We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect.
Djamé Seddah
,
Farah Essaidi
,
Amal Fethi
,
Matthieu Futeral
,
Benjamin Muller
,
Pedro Ortiz Suarez
,
Benoît Sagot
,
Abhishek Srivastava
PDF
Citation
Vidéo
DOI
ACL Anthology
ACL 2020
Les modèles de langue contextuels Camembert pour le Français : impact de la taille et de l'hétérogénéité des données d'entrainement
Nous explorons l’impact de la taille et de l’hétérogénéité des données d’entraînement sur la modélisation de la langue française.
Louis Martin
,
Benjamin Muller
,
Pedro Ortiz Suarez
,
Yoann Dupont
,
Laurent Romary
,
Éric de la Clergerie
,
Benoît Sagot
,
Djamé Seddah
PDF
Citation
Jeu de données
Projet
TALN 2020
HAL
Website
Establishing a New State-of-the-Art for French Named Entity Recognition
We explore convert the NER annotations of the French TreeBank to a more user-friendly format and establish a new state of the art for French NER.
Pedro Ortiz Suarez
,
Yoann Dupont
,
Benjamin Muller
,
Laurent Romary
,
Benoît Sagot
PDF
Citation
LREC 2020
HAL
arXiv
ACL Anthology
French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus
We investigate the impact of different types and size of training corpora on language models.
Murielle Popa-Fabre
,
Pedro Ortiz Suarez
,
Benoît Sagot
,
Éric de la Clergerie
PDF
Citation
CMLC-8
ACL Anthology
HAL
How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures
We explore the impact of the OCR quality on grobid-dictionaries models.
Mohamed Khemakhem
,
Ioana Galleron
,
Geoffrey Williams
,
Laurent Romary
,
Pedro Ortiz Suarez
PDF
Citation
Projet
TEI 2019
HAL
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.
Pedro Ortiz Suarez
,
Benoît Sagot
,
Laurent Romary
PDF
Citation
Code
Jeu de données
Projet
Diapositives
DOI
CMLC-7
Website
HAL
Citation
×