Published on Thu Feb 11 2021

A reproduction of Apple's bi-directional LSTM models for language identification in short strings

Mads Toftrup, Søren Asger Sørensen, Manuel R. Ciosici, Ira Assent

Language Identification is the task of identifying a document's language. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers.

0
0
0
Abstract

Language Identification is the task of identifying a document's language. For applications like automatic spell checker selection, language identification must use very short strings such as text message fragments. In this work, we reproduce a language identification architecture that Apple briefly sketched in a blog post. We confirm the bi-LSTM model's performance and find that it outperforms current open-source language identifiers. We further find that its language identification mistakes are due to confusion between related languages.

Sun Apr 22 2018
NLP
Automatic Language Identification in Texts: A Survey
Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. LI is a key part of many text processing pipelines.
0
0
0
Thu Jan 12 2017
NLP
LanideNN: Multilingual Language Identification on Character Window
Monolingual language identification assumes that the given document is written in one language. In multilingual language identification, the document is usually in two or three languages and we just want their names. We propose a method for textual language identification where the languages can change arbitrarily. Our method is based on
0
0
0
Tue Oct 09 2018
NLP
A Fast, Compact, Accurate Model for Language Identification of Codemixed Text
We provide alanguage code for every token in a sentence, including codemixed text containing multiple languages. Such text is prevalent online, in documents, social media, and message boards. We show that a feed-forward network with a globally constrained decoder can accurately and rapidly label such text.
0
0
0
Tue Mar 09 2021
Artificial Intelligence
Comparing Approaches to Dravidian Language Identification
0
0
0
Wed Aug 10 2016
NLP
Hierarchical Character-Word Models for Language Identification
Social media messages' brevity and unconventional spelling pose a challenge to language identification. We introduce a hierarchical model that learns character and contextualized word-level representations. Our method performs well against strong base- lines and can also reveal code-switching.
0
0
0
Tue Jul 27 2021
NLP
gaBERT -- an Irish Language Model
gaBERT is a monolingual BERT model for the Irish language. It provides better representations for a downstream parsing task. We show how different criteria, vocabulary size and the choice of subword tokenisation affect downstream performance.
1
7
19