Published on Mon Jun 20 2016

CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis

Maohua Zhu, Liu Liu, Chao Wang, Yuan Xie

CNNLab is a novel deep learning framework using GPU and FPGA-based accelerators. CNNLab provides a uniform programming model to users so that the hardware implementation and the scheduling are invisible to the programmers. At runtime, CNNLab leverages the trade-offs between GPU and FPGA.

0
0
0
Abstract

Designing and implementing efficient, provably correct parallel neural network processing is challenging. Existing high-level parallel abstractions like MapReduce are insufficiently expressive while low-level tools like MPI and Pthreads leave ML experts repeatedly solving the same design challenges. However, the diversity and large-scale data size have posed a significant challenge to construct a flexible and high-performance implementation of deep learning neural networks. To improve the performance and maintain the scalability, we present CNNLab, a novel deep learning framework using GPU and FPGA-based accelerators. CNNLab provides a uniform programming model to users so that the hardware implementation and the scheduling are invisible to the programmers. At runtime, CNNLab leverages the trade-offs between GPU and FPGA before offloading the tasks to the accelerators. Experimental results on the state-of-the-art Nvidia K40 GPU and Altera DE5 FPGA board demonstrate that the CNNLab can provide a universal framework with efficient support for diverse applications without increasing the burden of the programmers. Moreover, we analyze the detailed quantitative performance, throughput, power, energy, and performance density for both approaches. Experimental results leverage the trade-offs between GPU and FPGA and provide useful practical experiences for the deep learning research community.

Thu May 28 2020
Neural Networks
Brief Announcement: On the Limits of Parallelizing Convolutional Neural Networks on GPUs
Training a deep neural network (DNN) is a time-consuming process. Popular deep learning (DL) frameworks such as TensorFlow launch the majority of neural network operations serially on GPUs. We make a case for the need and potential benefit of exploiting this rich parallelism.
0
0
0
Mon Feb 20 2017
Machine Learning
A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks
FPGA-based hardware accelerators for convolutional neural networks (CNNs) have obtained great attentions due to their higher energy efficiency than GPUs. A key advantage of the FPGA accelerator is that its performance is insensitive to data batch size.
0
0
0
Fri Jan 04 2019
Machine Learning
FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters
FPDeep uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train DNNs. With 6 transceivers per FPGA, FPDeep shows linearity up to 83 FPGAs. FPDeep provides, on average, 6.36x higher energy efficiency than GPU servers.
0
0
0
Wed Jan 24 2018
Machine Learning
Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning
Deep Learning (DL) community sees many novel topologies published each year. Achieving high performance on each new topology remains challenging. This issue is compounded by the proliferation of frameworks and hardware platforms. We developed Intel nGraph to simplify the realization of high-performance deep learning.
0
0
0
Mon Apr 23 2018
Neural Networks
BrainSlug: Transparent Acceleration of Deep Learning Through Depth-First Parallelism
Neural network frameworks such as PyTorch and TensorFlow are the workhorses of numerous machine learning applications. BrainSlug is a framework that changes the default layer-by-layer processing to a depth-first approach, reducing the amount of data required by the computations.
0
0
0
Thu Aug 25 2016
Machine Learning
Benchmarking State-of-the-Art Deep Learning Software Tools
Deep learning has been shown as a successful machine learning method for a variety of tasks. Many tools exploit hardware features such as multi-core CPUs and many-core GPUs to shorten the training time. Different tools exhibit different features and running performance when training different types of neural networks.
0
0
0