Published on Wed Jan 13 2021

Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks

Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

GALR is a low-cost high-performance network. It has achieved comparable separation performance at a much lower cost with 36.1% less memory and 49.4% fewer operations. GALR has consistently outperformed DPRNN in datasets.

0
0
0
Abstract

Recent research on the time-domain audio separation networks (TasNets) has brought great success to speech separation. Nevertheless, conventional TasNets struggle to satisfy the memory and latency constraints in industrial applications. In this regard, we design a low-cost high-performance architecture, namely, globally attentive locally recurrent (GALR) network. Alike the dual-path RNN (DPRNN), we first split a feature sequence into 2D segments and then process the sequence along both the intra- and inter-segment dimensions. Our main innovation lies in that, on top of features recurrently processed along the inter-segment dimensions, GALR applies a self-attention mechanism to the sequence along the inter-segment dimension, which aggregates context-aware information and also enables parallelization. Our experiments suggest that GALR is a notably more effective network than the prior work. On one hand, with only 1.5M parameters, it has achieved comparable separation performance at a much lower cost with 36.1% less runtime memory and 49.4% fewer computational operations, relative to the DPRNN. On the other hand, in a comparable model size with DPRNN, GALR has consistently outperformed DPRNN in three datasets, in particular, with a substantial margin of 2.4dB absolute improvement of SI-SNRi in the benchmark WSJ0-2mix task.

Sun Oct 25 2020
Machine Learning
Attention is All You Need in Speech Separation
Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism.
3
42
186
Mon Oct 14 2019
Machine Learning
Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation
Dual-path recurrent neural network (DPRNN) is a simple yet effective method for organizing RNN layers in a deep structure to model extremely long sequences. DPRNN splits the long sequential input into smaller chunks and applies intra- and inter-chunk operations.
0
0
0
Mon Mar 01 2021
Artificial Intelligence
Sandglasset: A Light Multi-Granularity Self-attentive Network For Time-Domain Speech Separation
Sandglasset advances state-of-the-art (SOTA) performance at significantly smaller model size and computational cost. Sandglasset with only 2.3M parameters has achieved the best results on two benchmark SS datasets.
0
0
0
Thu Sep 21 2017
Machine Learning
Deep Recurrent NMF for Speech Separation by Unfolding Iterative Thresholding
In this paper, we propose a novel recurrent neural network architecture for speech separation. This architecture is constructed by unfolding the iterations of a sequential iterative soft-thresholding algorithm. We name this network architecture deep recurrent NMF (DR-NMF)
0
0
0
Mon Dec 09 2019
Machine Learning
MITAS: A Compressed Time-Domain Audio Separation Network with Parameter Sharing
Deep learning methods have brought substantial advancements in speech separation. Nevertheless, it remains challenging to deploy deep-learning-based models on edge devices. To compress these large models without hurting SS performance has become an important research topic.
0
0
0
Thu May 21 2020
NLP
ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition
State-of-the-art (SOTA) performance on the LibriSpeech corpus. Two novel neural network architectures, a multistream CNN for acoustic modeling and a self-attentive simple recurrent unit (SRU) for language modeling.
0
0
0