8.7 Embedded hardware architectures for deep neural networks

Printer-friendly version PDF version

Date: Wednesday, March 27, 2019
Time: 17:00 - 18:30
Location / Room: Room 7

Chair:
Sandeep Pande, IMEC-NL, NL, Contact Sandeep Pande

Co-Chair:
Kyuho Lee, Ulsan National Institute of Science and Technology (UNIST), KR, Contact Kyuho Lee

This session presents papers that address various research challenges including optimization of deep neural networks for edge devices, multiplierless neural network acceleration, design space exploration of CNNs on FPGAs and accelerating local binary pattern networks on FPGAs.

TimeLabelPresentation Title
Authors
17:008.7.1SELF-SUPERVISED QUANTIZATION OF PRE-TRAINED NEURAL NETWORKS FOR MULTIPLIERLESS ACCELERATION
Speaker:
Sebastian Vogel, Robert Bosch GmbH, DE
Authors:
Sebastian Vogel1, Jannik Springer1, Andre Guntoro1 and Gerd Ascheid2
1Robert Bosch GmbH, DE; 2RWTH Aachen University, DE
Abstract
To host intelligent algorithms such as Deep Neural Networks on embedded devices, it is beneficial to transform the data representation of neural networks into a fixed-point format with reduced bit-width. In this paper we present a novel quantization procedure for parameters and activations of pre-trained neural networks. For 8,bit linear quantization, our procedure achieves close to original network performance without retraining and consequently does not require labeled training data. Additionally, we evaluate our method for power-of-two quantization as well as for a two-hot quantization scheme, enabling shift-based inference. To underline the hardware benefits of a multiplierless accelerator, we propose the design of a shift-based processing element.
17:308.7.2MULTI-OBJECTIVE PRECISION OPTIMIZATION OF DEEP NEURAL NETWORKS FOR EDGE DEVICES
Speaker:
Nhut Minh Ho, National University of Singapore, SG
Authors:
Nhut-Minh Ho, Ramesh Vaddi and Weng-Fai Wong, National University of Singapore, SG
Abstract
Precision tuning post-training is often needed for efficient implementation of deep neural networks especially when the inference implementation platform is resource constrained. While previous works have proposed many ad hoc strategies for this task, this paper describes a general method for allocating precision to trained deep neural networks' data based on a property relating errors in a network. We demonstrate that the precision results of previous works for hardware accelerator or understanding cross layer precision requirement is subsumed by the proposed general method. It has achieved a 29% and 46% energy saving over the state-of-the-art search-based method for GoogleNet and VGG-19 respectively. Proposed precision allocation method can be used to optimize for different criteria based on hardware design constraints, allocating precision at the granularity of layers for very deep networks such as Resnet-152, which hitherto was not achievable.
18:008.7.3TOWARDS DESIGN SPACE EXPLORATION AND OPTIMIZATION OF FAST ALGORITHMS FOR CONVOLUTIONAL NEURAL NETWORKS (CNNS) ON FPGAS
Speaker:
Muhammad Adeel Pasha, LUMS, PK
Authors:
Afzal Ahmad and Muhammad Adeel Pasha, Department of Electrical Engineering, SBASSE, LUMS, PK
Abstract
Convolutional Neural Networks (CNNs) have gained widespread popularity in the field of computer vision and image processing. Due to huge computational requirements of CNNs, dedicated hardware-based implementations are being explored to improve their performance. Hardware platforms such as Field Programmable Gate Arrays (FPGAs) are widely being used to design parallel architectures for this purpose. In this paper, we analyze Winograd minimal filtering or fast convolution algorithms to reduce the arithmetic complexity of convolutional layers of CNNs. We explore a complex design space to find the sets of parameters that result in improved throughput and power-efficiency. We also design a pipelined and parallel Winograd convolution engine that improves the throughput and power-efficiency while reducing the computational complexity of the overall system. Our proposed designs show up to 4.75x and 1.44x improvements in throughput and power-efficiency, respectively, in comparison to the state-of-the-art design while using approximately 2.67x more multipliers. Furthermore, we obtain savings of up to 53.6% in logic resources compared with the state-of-the-art implementation.
18:158.7.4ACCELERATING LOCAL BINARY PATTERN NETWORKS WITH SOFTWARE PROGRAMMABLE FPGAS
Speaker:
Jeng-Hau Lin, UC San Diego, US
Authors:
Jeng-Hau Lin1, Atieh Lotfi2, Vahideh Akhlaghi3, Zhuowen Tu2 and Rajesh Gupta2
1UCSD, US; 2UC San Diego, US; 3University of California, San Diego, US
Abstract
Fueled by the success of mobile devices, the computational demands on these platforms have been rising faster than the computational and storage capacities or energy availability to perform tasks ranging from recognizing speech, images to automated reasoning and cognition. While the success of convolutional neural networks (CNNs) have contributed to such a vision, these algorithms stay out of the reach of limited computing and storage capabilities of mobile platforms. It is clear to most researchers that such a transition can only be achieved by using dedicated hardware accelerators on these platforms. However, CNNs with arithmetic-intensive operations remain particularly unsuitable for such acceleration both computationally as well as for the high memory bandwidth needs of highly parallel processing required. In this paper, we implement and optimize an alternative genre of networks, local binary pattern network (LBPNet) which eliminates arithmetic operations by combinatorial operations thus substantially boosting the efficiency of hardware implementation. LBPNet is built upon a radically different view of the arithmetic operations sought by conventional neural networks to overcome limitations posed by compression and quantization methods used for hardware implementation of CNNs. This paper explores in depth the design and implementation of both an architecture and critical optimizations of LBPNet for realization in accelerator hardware and provides a comparison of results with the state-of-art CNN on multiple datasets.
18:31IP4-7, 334THE CASE FOR EXPLOITING UNDERUTILIZED RESOURCES IN HETEROGENEOUS MOBILE ARCHITECTURES
Speaker:
Nikil Dutt, University of California, Irvine, US
Authors:
Chenying Hsieh, Nikil Dutt and Ardalan Amiri Sani, UC Irvine, US
Abstract
Heterogeneous architectures are ubiquitous in mobile plat-forms, with mobile SoCs typically integrating multiple processors along with accelerators such as GPUs (for data-parallel kernels) and DSPs (for signal processing kernels). This strict partitioning of application execution on heterogeneous compute resources often results in underutilization of resources such as DSPs. We present a case study executing popular data-parallel workloads such as convolutional neural networks (CNNs), computer vision application and graphics kernels on mobile devices, and show that both performance and energy consumption of mobile platforms can be improved by synergistically deploying these underutilized DSPs. Our experiments on a mobile Snapdragon 835 platform under both single and multiple application scenarios executing CNNs and graphics workloads, demonstrates average performance and energy improvements of 15-46% and 18-80% respectively by synergistically deploying all available compute resources, especially the underutilized DSP.
18:32IP4-8, 420ONLINE RARE CATEGORY DETECTION FOR EDGE COMPUTING
Speaker:
Yufei Cui, City University of Hong Kong, HK
Authors:
Yufei Cui1, Qiao Li1, Sarana Nutanong2 and Chun Jason Xue1
1City University of Hong Kong, HK; 2Vidyasirimedhi Institute of Science and Technology, TH
Abstract
Abstract — Identifying rare categories is an important data management problem in many application fields including video surveillance, ecological environment monitoring and precision medicine. Previous solutions in literature require all data instances to be first delivered to the server. Then, the rare categories identification algorithms are executed on the pool of data to find informative instances for human annotators to label. This incurs large bandwidth consumption and high latency. To deal with the problems, we propose a light-weight rare categories identification framework. At the sensor side, the designed online algorithm filters less informative data instances from the data stream and only sends the informative ones to human annotators. After labeling, the server only sends labels of the corresponding data instances in response. The sensor-side algorithm is extended to enable cooperation between embedded devices for the cases that data is collected in a distributed manner. Experiments are conducted to show our framework dramatically outperforms the baseline. The network traffic is reduced by 75% on average.
18:33IP4-9, 416RAGRA: LEVERAGING MONOLITHIC 3D RERAM FOR MASSIVELY-PARALLEL GRAPH PROCESSING
Speaker:
Yu Huang, Huazhong University of Science and Technology, CN
Authors:
Yu Huang, Long Zheng, Xiaofei Liao, Hai Jin, Pengcheng Yao and Chuangyi Gui, Huazhong University of Science and Technology, CN
Abstract
With the maturity of monolithic 3D integration, 3D ReRAM provides impressive storage-density and computational-parallelism with great opportunities for parallel-graph processing acceleration. In this paper, we present RAGra, a 3D ReRAM based graph processing accelerator, which has two significant technical highlights. First, monolithic 3D ReRAM usually has the complexly-intertwined feature with shared input wordlines and output bitlines for different layers. We propose a novel mapping scheme, which can guide to apply graph algorithms into 3D ReRAM seamlessly and correctly for exposing the massive parallelism of 3D ReRAM. Second, consider the sparsity of real-world graphs, we further propose a row- and column-mixed execution model, which can filter invalid subgraphs for exploiting the massive parallelism of 3D ReRAM. Our evaluation on 8-layer stacked ReRAM shows that RAGra outperforms state-of-the-art planar (2D) ReRAM-based graph accelerator GraphR by 6.18x performance improvement and 2.21x energy saving, on average. In particular, RAGra significantly outperforms GridGraph (a typical CPU-based graph system) by up to 293.12x speedup.
18:30End of session