Published on Sun Jan 03 2021

An Efficient Transformer Decoder with Compressed Sub-layers

Yanyang Li, Ye Lin, Tong Xiao, Jingbo Zhu

The large attention-based encoder-decoder network (Transformer) has become increasingly effective. But the high computation complexity of its decoder raises the inefficiency issue. Under mild conditions, the architecture could be simplified by compressing its sub-layers.

0
0
0
Abstract

The large attention-based encoder-decoder network (Transformer) has become prevailing recently due to its effectiveness. But the high computation complexity of its decoder raises the inefficiency issue. By examining the mathematic formulation of the decoder, we show that under some mild conditions, the architecture could be simplified by compressing its sub-layers, the basic building block of Transformer, and achieves a higher parallelism. We thereby propose Compressed Attention Network, whose decoder layer consists of only one sub-layer instead of three. Extensive experiments on 14 WMT machine translation tasks show that our model is 1.42x faster with performance on par with a strong baseline. This strong baseline is already 2x faster than the widely used standard baseline without loss in performance.

Wed May 02 2018
NLP
Accelerating Neural Transformer via an Average Attention Network
With parallelizable attention networks, the neural Transformer is very fast to train. However, due to the auto-regressive architecture and self-attention in the decoder, the decoding procedure becomes slow. To alleviate this issue, we propose an average attention network as an alternative.
0
0
0
Wed Jun 26 2019
NLP
Sharing Attention Weights for Fast Transformer
The Transformer machine translation system has shown strong results by stacking attention layers on both the source and target-language sides. But the inference of this model is slow due to the heavy use of dot-product Attention in auto-regressive decoding.
0
0
0
Thu Sep 05 2019
NLP
Accelerating Transformer Decoding via a Hybrid of Self-attention and Recurrent Neural Network
Transformer is faster to train than RNN-based models and popularly used in machine translation tasks. Each output word requires all the hidden states of the previously generated words. Our hybrid network can decode 4-times faster than the Transformer.
0
0
0
Mon Oct 29 2018
Artificial Intelligence
Parallel Attention Mechanisms in Neural Machine Translation
Recent papers in neural machine translation have proposed the strict use of attention mechanisms over previous standards. We propose that by running traditionally stacked encoding branches from encoder-decoder attention- focused architectures in parallel, that even more sequential operations can be removed from the model.
0
0
0
Sun Jun 03 2018
NLP
Dense Information Flow for Neural Machine Translation
The proposed DenseNMT architecture is based on the success of the DenseNet model in computer vision problems. It uses the dense attention structure to improve attention quality.
0
0
0
Mon Feb 24 2020
NLP
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation
A key feature of the Transformer architecture is the so-called multi-head attention mechanism. Most attention heads learn simple, and often redundant, positional patterns. We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
0
0
0
Mon Jun 12 2017
NLP
Attention Is All You Need
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms. Experiments on two machine translation tasks show these models to be superior in
50
215
883
Mon Mar 09 2015
Machine Learning
Distilling the Knowledge in a Neural Network
A new type of ensemble composed of one or more full models and many specialist models. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
5
4
21
Thu Jul 21 2016
Machine Learning
Layer Normalization
Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch.
1
0
2
Fri Sep 28 2018
NLP
Adaptive Input Representations for Neural Language Modeling
Adaptive input representations for neural language modeling extend the adaptive softmax of Grave et al. (2017) There are several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units. We perform a systematic comparison of popular choices.
1
0
1
Thu Aug 29 2019
NLP
Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention
The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformers architecture for machine translation results in poor convergence and high computational overhead. We propose a merged attention sublayer (MAtt) which combines a simplified averagebased
0
0
0
Sat Jun 25 2016
Neural Networks
Sequence-Level Knowledge Distillation
Neural machine translation (NMT) offers a novel alternative to statistical approaches. To reach competitive performance, NMT models need to be exceedingly large. We apply knowledge distillation approaches that have proven successful for reducing the size of neurological models in other domains to the problem of NMT.
0
0
0