Published on Tue Dec 15 2015

An Operator for Entity Extraction in MapReduce

Ndapandula Nakashole

Dictionary-based entity extraction involves finding mentions of dictionary entities in text. Text mentions are often noisy, containing spurious or missing words. Efficient algorithms for detecting approximate entity mentions follow one of two general techniques.

0
0
0
Abstract

Dictionary-based entity extraction involves finding mentions of dictionary entities in text. Text mentions are often noisy, containing spurious or missing words. Efficient algorithms for detecting approximate entity mentions follow one of two general techniques. The first approach is to build an index on the entities and perform index lookups of document substrings. The second approach recognizes that the number of substrings generated from documents can explode to large numbers, to get around this, they use a filter to prune many such substrings which do not match any dictionary entity and then only verify the remaining substrings if they are entity mentions of dictionary entities, by means of a text join. The choice between the index-based approach and the filter & verification-based approach is a case-to-case decision as the best approach depends on the characteristics of the input entity dictionary, for example frequency of entity mentions. Choosing the right approach for the setting can make a substantial difference in execution time. Making this choice is however non-trivial as there are parameters within each of the approaches that make the space of possible approaches very large. In this paper, we present a cost-based operator for making the choice among execution plans for entity extraction. Since we need to deal with large dictionaries and even larger large datasets, our operator is developed for implementations of MapReduce distributed algorithms.

Mon Sep 02 2013
Machine Learning
Scalable Probabilistic Entity-Topic Modeling
We present an LDA approach to entity disambiguation. Each topic is associated with a Wikipedia article and topics generate either content words or entity mentions. We report state-of-the-art performance on a public dataset.
0
0
0
Wed Aug 30 2017
NLP
TANKER: Distributed Architecture for Named Entity Recognition and Disambiguation
NERD systems are crucial for several Natural Language Processing (NLP) tasks such as summarization, understanding, and machine translation. There is no standard interface specification for NERD systems. TANKER aims to overcome scalability, reliability and failure tolerance limitations related to industrial needs by
0
0
0
Mon Jul 16 2018
Machine Learning
Pangloss: Fast Entity Linking in Noisy Text Environments
Pangloss is a production system for entity disambiguation on noisy text. It combines a probabilistic linear-time key phrase identification algorithm with a semantic similarity engine. Pangloss leverages a local embedded database with a tiered architecture to house its statistics and metadata.
0
0
0
Tue Aug 25 2015
Artificial Intelligence
ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution
Entity resolution (ER) is an important and common data cleaning problem. ER is about detecting data duplicate representations for the same external entities. Relatively recently, rules called matching dependencies (MDs) have been proposed for specifying similarity conditions.
0
0
0
Sun Feb 07 2016
Artificial Intelligence
ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution
Entity resolution (ER) is about detecting data duplicate representations for the same external entities. Relatively recently, declarative rules called "matching dependencies" (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged.
0
0
0
Mon Sep 14 2015
Machine Learning
A Practioner's Guide to Evaluating Entity Resolution Results
entity resolution (ER) is the task of identifying records belonging to the same entity across one or multiple databases. In this paper we survey metrics used to evaluate ER results to improve performance.
0
0
0