Published on Thu May 07 2020

SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost Computation

Yang Zhao, Xiaohan Chen, Yue Wang, Chaojian Li, Haoran You, Yonggan Fu, Yuan Xie, Zhangyang Wang, Yingyan Lin

We present SmartExchange, an algorithm-hardware co-design framework to trade higher-cost memory storage/access for lower-cost computation. We develop a novel algorithm to enforce a specially favorable DNN weight structure. The resulting sparse and readily-quantized DNN enjoys greatly reduced energy consumption.

0
0
0
Abstract

We present SmartExchange, an algorithm-hardware co-design framework to trade higher-cost memory storage/access for lower-cost computation, for energy-efficient inference of deep neural networks (DNNs). We develop a novel algorithm to enforce a specially favorable DNN weight structure, where each layerwise weight matrix can be stored as the product of a small basis matrix and a large sparse coefficient matrix whose non-zero elements are all power-of-2. To our best knowledge, this algorithm is the first formulation that integrates three mainstream model compression ideas: sparsification or pruning, decomposition, and quantization, into one unified framework. The resulting sparse and readily-quantized DNN thus enjoys greatly reduced energy consumption in data movement as well as weight storage. On top of that, we further design a dedicated accelerator to fully utilize the SmartExchange-enforced weights to improve both energy efficiency and latency performance. Extensive experiments show that 1) on the algorithm level, SmartExchange outperforms state-of-the-art compression techniques, including merely sparsification or pruning, decomposition, and quantization, in various ablation studies based on nine DNN models and four datasets; and 2) on the hardware level, the proposed SmartExchange based accelerator can improve the energy efficiency by up to 6.7 and the speedup by up to 19.2 over four state-of-the-art DNN accelerators, when benchmarked on seven DNN models (including four standard DNNs, two compact DNN models, and one segmentation model) and three datasets.

Mon Jan 04 2021
Machine Learning
SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and Training
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) The prohibitive energy of DRAM accesses makes it non-trivial to deploy DNNs on resource-constrained devices. We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
0
0
0
Wed Jul 03 2019
Neural Networks
Non-Structured DNN Weight Pruning -- Is It Beneficial in Any Platform?
Large deep neural network (DNN) models pose the key challenge to energy efficiency due to the significantly higher energy consumption of off-chip DRAM accesses than arithmetic or SRAM operations. Weight quantization leverages the redundancy in the number of bits in weights. Compared to pruning,quantization is much more hardware-friendly, and has become a "must-do" step for FPGA and ASIC implementations.
0
0
0
Mon Dec 31 2018
Artificial Intelligence
ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Method of Multipliers
ADMM-NN is a systematic, joint framework of DNN weight pruning and quantization using ADMM. It can be understood as a smart regularization technique with regularization target dynamically updated in each ADMM iteration. The second part is hardware-aware DNN optimizations to facilitate hardware-level implementations.
0
0
0
Thu Feb 04 2016
Computer Vision
EIE: Efficient Inference Engine on Compressed Deep Neural Network
Deep neural networks (DNNs) are computationally and memory intensive. They are difficult to deploy on embedded systems with limited hardware resources and power budgets. Previously proposed 'Deep Compression' makes it possible to fit large DNNs fully in on-chip SRAM. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model.
0
0
0
Thu Oct 05 2017
Machine Learning
To prune, or not to prune: exploring the efficacy of pruning for model compression
Model pruning seeks to induce sparsity in a deep neural network's various connection matrices, thereby reducing the number of nonzero-valued parameters. Recent reports prune deep networks at the cost of only a marginal loss in accuracy and achieve a sizable reduction in model size.
0
0
0
Tue Jun 15 2021
Machine Learning
Efficient Micro-Structured Weight Unification for Neural Network Compression
Compressing Deep Neural Network (DNN) models is essential for practical applications. Previous unstructured or structured weight pruning methods can hardly truly accelerate inference. We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
0
0
0