Language Resources

One of my research interests is to create new language resources that can be used for research in linguistics and NLP. Here you can find some of them.

If you use any of these resources in your research, please refer to its respective description paper available in pdf.


  • Colonia: Corpus of Historical Portuguese Colonia Website Colonia at Linguateca Colonia at CorpusEye Colonia at Kaggle pdf pdf
    A Portuguese historical corpus containing texts from the 16th to the early 20th century, lemmatized and annotated with POS tags. The corpus is available to download and through a graphical CQPWeb-based interface. From May 2014, thanks to Diana Santos (University of Oslo), Colonia is also available at Linguateca. From October 2014, thanks to Eckhard Bick (University of Southern Denmark), a version of Colonia tagged using the PALAVRAS parsing system is available through CorpusEye. From August 2017, thanks to Rachael Tatman, Colonia is available at Kaggle.

  • DSL Corpus Collection (DSLCC) DSLCC pdf
    A collection of journalistic corpora written in closely related languages and language varieties. The dataset has been used in the DSL Shared Tasks in 2014, 2015, 2016, and 2017.

  • NLI-PT: A Portuguese Native Language Identification Dataset NLI-PT pdf
    A collection of 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish.

  • Offensive Language Identification Dataset (OLID) OLID pdf
    OLID contains a collection of annotated tweets using a hierarchical annotation model that encompasses following three levels: A: Offensive Language Detection; B: Categorization of Offensive Language; C: Offensive Language Target Identification. OLID was used in the OffensEval: Identifying and Categorizing Offensive Language in Social Media (SemEval 2019 - Task 6) shared task.

Other Resources

  • Frequency lists from comparable Spanish corpora Word Unigrams POS and Morphology pdf
    These two frequency lists were produced to compare linguistic features of four Spanish varieties (Argentina, Mexico, Peru, and Spain) as described in my 2013 paper.

  • LIDIOMS: A Multilingual Linked Idioms Data Set in Five Different Languages LIDIOMS pdf
    This is a multilingual linked idioms data set in five different languages (English, Portuese, Italian, German, Russian). Currently being expanded to other languages.

  • P-AWL: Portuguese Academic Word List P-AWL pdf
    The P-AWL was developed for Portuguese using the English Academic Word List (AWL) developed by Coxhead (2000). It contains 1,812 entries.