Published on Thu Jul 16 2020

Kronecker Attention Networks

Hongyang Gao, Zhengyang Wang, Shuiwang Ji

Attention operators have been applied on both 1-D data like texts and higher-order data such as images and videos. Use of attention operators on high- order data requires flattening of the spatial or spatial-temporal dimensions into a vector. This incurs excessive requirements on computational

0
0
0
Abstract

Attention operators have been applied on both 1-D data like texts and higher-order data such as images and videos. Use of attention operators on high-order data requires flattening of the spatial or spatial-temporal dimensions into a vector, which is assumed to follow a multivariate normal distribution. This not only incurs excessive requirements on computational resources, but also fails to preserve structures in data. In this work, we propose to avoid flattening by assuming the data follow matrix-variate normal distributions. Based on this new view, we develop Kronecker attention operators (KAOs) that operate on high-order tensor data directly. More importantly, the proposed KAOs lead to dramatic reductions in computational resources. Experimental results show that our methods reduce the amount of required computational resources by a factor of hundreds, with larger factors for higher-dimensional and higher-order data. Results also show that networks with KAOs outperform models without attention, while achieving competitive performance as those with original attention operators.

Thu Feb 25 2021
Machine Learning
Named Tensor Notation
We propose a notation for tensors with named axes. It relieves the author,reader, and future implementers from the burden of keeping track of the order of axes and the purpose of each. It also makes it easy to extend operations on low-order tensors to higher order ones.
0
0
0
Thu Dec 17 2020
Computer Vision
Transformer Interpretability Beyond Attention Visualization
Self-attention techniques, and specifically Transformers, are becoming increasingly popular in computer vision. The method assigns local relevance based on the Deep Taylor Decomposition principle and then propagates these relevancy scores through the layers.
3
20
119
Sun Jul 25 2021
Machine Learning
H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix)
6
23
152
Mon Jul 12 2021
Machine Learning
Combiner: Full Attention Transformer with Sparse Computation Cost
Combiner is a drop-in replacement for attention layers in existing transformers. The key idea is to treat the self-attention mechanism as a conditional expectation over embeddings at each location. Each location can then be connected to all other locations, either via direct attention, or through indirect attention.
2
42
188
Wed May 22 2019
Computer Vision
AttentionRNN: A Structured Spatial Attention Mechanism
A proposed AttentionRNN layer explicitly enforces structure over the spatial attention variables. Each attention value depends not only on local image or contextual information, but also on the previously predicted attention values. Our experiments show consistent quantitative and qualitative improvements on a variety of recognition tasks.
0
0
0
Fri Sep 13 2019
Neural Networks
SANVis: Visual Analytics for Understanding Self-Attention Networks
Attention networks, a deep neural network architecture inspired by humans' attention mechanism, have seen significant success in image captioning, machine translation, and many other applications. Recently, they have been further evolved into an advanced approach called multi-head self-attention networks.
0
0
0