12.7 Emerging Strategies for Deep Neural Network Hardware

Printer-friendly version PDF version

Date: Thursday, March 28, 2019
Time: 16:00 - 17:30
Location / Room: Room 7

Jim Harkin, University of Ulster, GB, Contact Jim Harkin

Li Jiang, Institute: Shanghai Jiao Tong University, CN, Contact Li Jiang

This session presents new approaches to the acceleration of deep neural networks focused on ReRAM-based architectures with papers focusing on the key challenges of reliable operation with unreliable devices and strategies for counter-aging effects. In addition, 3D ReRAM are proposed in the acceleration of general graphics processing. In the evolution of stochastic computing, emerging work on low-cost and energy efficient convolutional neural networks is also explored with deterministic bitstream processing.

TimeLabelPresentation Title
Shuhang Zhang, TUM, DE
Shuhang Zhang1, Grace Li Zhang1, Bing Li1, Hai (Helen) Li2 and Ulf Schlichtmann1
1TUM, DE; 2Duke University/TUM-IAS, US
Deep Neural Networks (DNNs) have been applied in various fields successfully. Such networks, however, require significant computing resources. Traditional CMOS-based implementation cannot efficiently implement the specific computing patterns such as matrix multiplication. Therefore, memristor-based crossbars have been proposed to accelerate such computing tasks by their analog nature, which also leads to a significant reduction of power consumption. Neural networks must be trained to recognize the features of the applications. This training process leads to many repetitive updates of the memristors in the crossbar. However, memristors in the crossbar can only be programmed reliably for a given number of times. Afterwards, the working range of the memristors deviates from the fresh state. As a result, the weights of the corresponding neural networks cannot be implemented correctly and the classification accuracy drops significantly. This phenomenon is called aging, and it limits the lifetime of memristor-based crossbars. In this paper, we propose a co-optimization framework to reduce the aging effect in software training and hardware mapping simultaneously to counter the aging effect. Experimental results demonstrate that the proposed framework can extend the lifetime of such crossbars up to 15 times, while the expected accuracy of classification is maintained.
M. Hassan Najafi, University of Louisiana at Lafayette, US
Sayed Abdolrasoul Faraji1, M. Hassan Najafi2, Bingzhe Li1, Kia Bazargan3 and David Lilja1
1University of Minnesota, Twin Cities, US; 2University of Louisiana at Lafayette, US; 3University of Minnesota, US
Stochastic computing (SC) has been used for low-cost and low power implementation of neural networks. Inherent inaccuracy and long latency of processing random bit-streams have made prior SC-based implementations inefficient compared to conventional fixed-point designs. Random or pseudo-random bitstreams often need to be processed for a very long time to produce acceptable results. This long latency leads to a significantly higher energy consumption than the binary design counterparts. Low-discrepancy sequences have been recently used for fast-converging deterministic computation with stochastic constructs. In this work, we propose a low-cost, low-latency, and energy-efficient implementation of convolutional neural networks based on low-discrepancy deterministic bit-streams. Experimental results show a significant reduction in the energy consumption compared to conventional random bitstream-based implementations and to the optimized fixed-point design with no quality degradation.
Yiran Chen, Duke University, US
Zichen Fan1, Ziru Li1, Bing Li2, Yiran Chen3 and Hai (Helen) Li4
1Tsinghua University, CN; 2Duke university, US; 3Duke University, US; 4Duke University/TUM-IAS, US
Deconvolution has been widespread in neural networks. For example, it is essential for performing unsupervised learning in generative adversarial networks or constructing fully convolutional networks for semantic segmentation. Resistive RAM (ReRAM)-based processing-in-memory architecture has been widely explored in accelerating convolutional computation and demonstrates good performance. Performing deconvolution on existing ReRAM-based accelerator designs, however, suffers from long latency and high energy consumption because deconvolutional computation includes not only convolution but also extra add-on operations. To realize the more efficient execution for deconvolution, we analyze its computation requirement and propose a ReRAM-based accelerator design, namely, RED. More specific, RED integrates two orthogonal methods, the pixel-wise mapping scheme for reducing redundancy caused by zero-inserting operations and the zero-skipping data flow for increasing the computation parallelism and therefore improving performance. Experimental evaluations show that compared to the state-of-the-art ReRAM-based accelerator, RED can speed up operation 3.69-31.15x and reduce 8%_88.36% energy consumption.
Saibal Mukhopadhyay, GEORGIA TECH, US
Yun Long and Saibal Mukhopadhyay, Georgia Institute of Technology, US
Benefiting from the Computing-in-Memory (CIM) architecture and the unique device properties such as non-volatility, high density and fast read/write, ReRAM based deep learning accelerators provide a promising solution to greatly improve the computing efficiency for various artificial intelligence (AI) applications. However, the intrinsic stochastic behavior (the statistical distribution of device resistance, set/reset voltage, etc) making the computation error-prone. In this paper, we propose two algorithms to suppress the impact of device variation: (a) We employ the dynamical fixed point (DFP) data representation format to adaptively change the decimal point location, minimizing the unused integer bits. (b) We propose a noise-aware training methodology, enhancing the robustness of network to the parameter's variation. We evaluate the proposed algorithms with convolutional neural network (CNN) and recurrent neural network (RNN) across different dataset. Simulations indicate that, for all benchmarks, the accuracy is improved more than 15% with minimal hardware design overhead.
17:30End of session