Published on Fri Aug 06 2021

StrucTexT: Structured Text Understanding with Multi-Modal Transformers

Yulin Li, Yuxi Qian, Yuchen Yu, Xiameng Qin, Chengquan Zhang, Yan Liu, Kun Yao, Junyu Han, Jingtuo Liu, Errui Ding

Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity of content and layout in VRDs, structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity

1
22
88
Abstract

Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity of content and layout in VRDs, structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling and entity linking, which require an entire understanding of the context of documents at both token and segment levels. However, little work has been concerned with the solutions that efficiently extract the structured data from different levels. This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks. Specifically, based on the transformer, we introduce a segment-token aligned encoder to deal with the entity labeling and entity linking tasks at different levels of granularity. Moreover, we design a novel pre-training strategy with three self-supervised tasks to learn a richer representation. StrucTexT uses the existing Masked Visual Language Modeling task and the new Sentence Length Prediction and Paired Boxes Direction tasks to incorporate the multi-modal information across text, image, and layout. We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts with significantly superior performance on the FUNSD, SROIE, and EPHOIE datasets.

Wed Jan 27 2021
NLP
VisualMRC: Machine Reading Comprehension on Document Images
A new visual machine reading comprehension dataset, named VisualMRC, contains 30,000+ pairs of a question and an abstractive answer. The dataset will facilitate research aimed at connecting vision and language understanding. The model outperformed the base sequence-to-sequence models.
1
0
0
Tue Dec 29 2020
NLP
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks. The pre-trained LayoutLMv2 model is publicly available at https://aka.ms/layoutlmv2.
1
2
19
Mon Aug 05 2019
Computer Vision
Visual-Relation Conscious Image Generation from Structured-Text
We propose an end-to-end network for image generation from given structured-text. It consists of the visual-relation layout module and the pyramid of GANs. Our network realistically renders entities' details in high resolution while keeping the scene structure.
0
0
0
Tue Aug 10 2021our pick
NLP
BROS: A Layout-Aware Pre-trained Language Model for Understanding Documents
This paper introduces a pre-trained language model, BERT Relying On Spatiality (BROS), which effectively utilizes the information included in individual text blocks and their layouts. BROS encodes spatial information by utilizing relative positions and learns spatial dependencies between OCR blocks with a novel area-masking strategy.
1
2
7
Thu May 13 2021
Computer Vision
VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and Relations
0
0
0
Wed Sep 25 2019
Machine Learning
UNITER: UNiversal Image-TExt Representation Learning
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks. In this paper, we introduce UNITER, a UNuniversalImage-TExt Representation learned through large-scale pre-training. We also propose WRA via the use of Optimal Transport (OT)
0
0
0
Tue Dec 29 2020
NLP
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks. The pre-trained LayoutLMv2 model is publicly available at https://aka.ms/layoutlmv2.
1
2
19
Thu Oct 11 2018
NLP
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT is designed to pre-train deep                bidirectional representations from unlabeled text. It can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
14
8
15
Fri May 01 2020
NLP
Spatial Dependency Parsing for Semi-Structured Document Information Extraction
Information Extraction (IE) for semi-structured document images is often.approached as a sequence tagging problem by classifying each recognized input.token into one of the IOB (Inside, Outside, and Beginning) categories. Under this setup, it cannot easily handle complex spatial relationships.
1
0
6
Mon Sep 26 2016
NLP
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Neural Machine Translation (NMT) is an end-to-end learning approach for automated translation. Using a human side-by-side evaluation on a set of simple sentences, it reduces translation errors by an average of 60% compared to Google's phrase-based production system.
3
1
3
Tue Dec 04 2018
Computer Vision
Bag of Tricks for Image Classification with Convolutional Neural Networks
Recent progress in image classification research can be credited to training procedure refinements. In this paper, we will examine a collection of such refinements and empirically evaluate their impact on the final model accuracy.
1
0
1
Fri Mar 04 2016
Machine Learning
End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF
State-of-the-art sequence labeling systems traditionally require large amounts of task-specific knowledge in the form of hand-crafted features and data pre-processing. In this paper, we introduce a novel neutral network architecture that benefits from both word- and character-level representations.
0
0
0