GALR is a low-cost high-performance network. It has achieved comparable separation performance at a much lower cost with 36.1% less memory and 49.4% fewer operations. GALR has consistently outperformed DPRNN in datasets.
Recent research on the time-domain audio separation networks (TasNets) has
brought great success to speech separation. Nevertheless, conventional TasNets
struggle to satisfy the memory and latency constraints in industrial
applications. In this regard, we design a low-cost high-performance
architecture, namely, globally attentive locally recurrent (GALR) network.
Alike the dual-path RNN (DPRNN), we first split a feature sequence into 2D
segments and then process the sequence along both the intra- and inter-segment
dimensions. Our main innovation lies in that, on top of features recurrently
processed along the inter-segment dimensions, GALR applies a self-attention
mechanism to the sequence along the inter-segment dimension, which aggregates
context-aware information and also enables parallelization. Our experiments
suggest that GALR is a notably more effective network than the prior work. On
one hand, with only 1.5M parameters, it has achieved comparable separation
performance at a much lower cost with 36.1% less runtime memory and 49.4% fewer
computational operations, relative to the DPRNN. On the other hand, in a
comparable model size with DPRNN, GALR has consistently outperformed DPRNN in
three datasets, in particular, with a substantial margin of 2.4dB absolute
improvement of SI-SNRi in the benchmark WSJ0-2mix task.