Published on Mon Apr 30 2018

A Portuguese Native Language Identification Dataset

Iria del Río, Marcos Zampieri, Shervin Malmasi

NLI-PT is the first Portuguese dataset compiled for Native Language Identification (NLI) The task of identifying an author's first language based on their second language writing. The dataset includes 1,868 essays written by learners of European Portuguese.

0
0
0
Abstract

In this paper we present NLI-PT, the first Portuguese dataset compiled for Native Language Identification (NLI), the task of identifying an author's first language based on their second language writing. The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP. We discuss possible applications of this dataset and present the results obtained for the first lexical baseline system for Portuguese NLI.

Tue Mar 09 2021
Artificial Intelligence
Comparing Approaches to Dravidian Language Identification
0
0
0
Fri Dec 11 2020
NLP
ParsiNLU: A Suite of Language Understanding Challenges for Persian
ParsiNLU is the first benchmark in Persian language that includes a range of high-level tasks. This results in over 14.5 new instances across 6 distinct NLU tasks.
2
37
114
Wed Sep 25 2019
NLP
Developing a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish
Kurdish is a less-resourced language consisting of different dialects written in various scripts. The lack of corpora is one of the main obstacles in Kurdish language processing. KTC-the Kurdish Textbooks Corpus is composed of 31 K-12 textbooks in Sorani dialect.
0
0
0
Fri May 13 2016
NLP
Universal Dependencies for Learner English
The Treebank of Learner English (TLE) is the first publicly available syntactic treebank for English as a Second Language. The TLE provides manually annotated POS tags and Universal Dependency (UD) trees for 5,124 sentences from the Cambridge First Certificate in English corpus.
0
0
0
Thu Jul 02 2020
NLP
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset
This paper describes the Dakshina dataset, a new resource consisting of text in both the Latin and native scripts for 12 South Asian languages. The datasetincludes, for each language: native script Wikipedia text; a romanization lexicon; and full sentence parallel data.
0
0
0
Tue Jan 23 2018
Computer Vision
The WiLI benchmark dataset for written language identification
WiLI-2018 is a publicly available, free of charge dataset of short text extracts from Wikipedia. It contains 1000paragraphs of 235 languages, totaling in 23500 paragraphs. Given an unknown paragraph written in one dominant language, it has to be decided which language it is.
0
0
0