PhDF.6 HW-SW CO-EXPLORATION AND OPTIMIZATION FOR NEXT-GENERATION LEARNING MACHINES

Speaker:

Chunyun Chen, Nanyang Technological University, SG

Authors:

Chunyun Chen and Mohamed M. Sabry Aly, Nanyang Technological University, SG

Abstract

Deep Neural Networks (DNNs) are proliferating in numerous AI applications, thanks to their high accuracy. For instance, Convolution Neural Networks (CNNs), one variety of DNNs, are used in object detection for autonomous driving and have reached or exceeded the performance of humans in some objection detection problems. Commonly adopted CNNs such as ResNet and MobileNet are becoming deeper (more layers) while narrower (smaller feature maps) than early AlexNet and VGG. Nevertheless, due to the race for better accuracy, the scaling up of DNN models, especially Transformers — another variety of DNNs, to trillions of parameters and trillions of Multiply-Accumulate (MAC) operations, as in the case of GPT-4, during both training and inference, has made DNN models both data-intensive and compute-intensive, carrying heavier workloads on the memory capacity to store weights and computation. This poses a significant challenge for the deployment of these models in an area-efficient and power-efficient manner. Given these challenges, model compression is a vital research topic to alleviate the crucial difficulties of memory capacity from the algorithmic perspective. Pruning, quantization, and entropy coding are three directions of model compression for DNNs. The effectiveness of pruning and quantization can be enhanced with entropy coding for further model compression. Entropy coding focuses on encoding the quantized values of weights or features in a more compact representation by utilizing the peaky distribution of the quantized values, to achieve a lower number of bits per variable, without any accuracy loss. Currently employed Fixed-to-Variable (F2V) entropy coding schemes such as Huffman coding and Arithmetic coding are inefficient to be decoded in the hardware platforms, suffering from high decoding complexity of O(n · k), where n is the number of codewords (quantized values) and k is the reciprocal of compression ratio. Hence, there is a pressing need for coding algorithms with both high compression ratios and low decoding complexities. To handle the increasing computational complexity of DNNs from the hardware perspective, domain-specific accelerators (DSAs) have been developed to accelerate the inference process. DSAs for DNNs fall into two categories: (1) Special function accelerators that enhance the efficiency of DNN hardware implementations, e.g., accelerators that reduce the memory footprint of the DNN hardware implementations after model compression, accelerators for the special functions in DNN workloads, etc. (2) End-to-end (E2E) workload accelerators, e.g., accelerators for ResNet, and ViT. Special function accelerators improve the efficiency of DNNs by supporting new numerical representations. For example, entropy coding reduces the memory footprint of the DNNs by 21.67×. However, this compression method requires a decoder to decode the compressed weights. Special decoders are required to decode the compressed weights while maintaining the throughput of the accelerator. Special function accelerators are also important in DNN workloads. CNN-based object detectors generate highly overlapped bounding boxes around the ground-truth location of the objects to be detected, hence Non-maximum Suppression (NMS) is introduced to filter the overlapping detections to reduce the false positives. While convolution operations have undergone massive performance improvement, thanks to both hardware scalable architectures and algorithmic optimizations, the commonly adopted GreedyNMS has not witnessed significant improvement and thus has started to dominate the inference time. GreedyNMS is notably time-intensive, consuming 2.76× more time than the entire inference operations. It is imperative to develop hardware accelerators for efficient NMS algorithms that are scalable by design. Therefore, scalable hardware accelerators for efficient NMS algorithms are of the essence. E2E workload accelerators empower the entire DNNs with high throughput and low power. Many studies have explored the benefits of implementing Deep Learning (DL) applications on hardware platforms such as Application Specific Integrated Circuits (ASICs), general-purpose graphics processing units (GPGPUs), and Field Programmable Gate Arrays (FPGAs) for training and inference. Additionally, the exploration of sparsity in DNNs further enhances performance by reducing energy demands and inference times. Although there are numerous hardware solutions for accelerating the inference of deep CNNs, those either provide an unaffordable latency, occupy a big area, or are not power efficient, thus limiting their applications in edge devices, where DL inference is required to be processed on the fly with tight latency, low energy consumption, small chip area and extremely high detection accuracy. Moreover, most of them focus on the convolutional (CONV) layers in CNNs or the matrix multiplications (MMs) in Transformers while ignoring the special functions and not having system-level solutions. Higher throughput and better accuracy required by modern DNN applications need computer architects to provide comprehensive E2E system-level solutions that not only accelerate the typical CONV layers and MMs but also the special functions while maintaining a high throughput and good inference accuracy. The contribution of this PhD dissertation is a thorough study of the hardware-software co-design and optimization for next-generation learning machines. The study includes the co-design of the special function hardware, the end-to-end workload hardware, and their full system integration. Contribution 1: Efficient Tunstall Decoder for Deep Neural Network Compression In this contribution, the Variable-to-Fixed entropy coding method—Tunstall coding— is adopted to address the inefficient decoding problem in deep network compression and present two architectures for streamlined decoding, i.e., the Logic Oriented and the Memory Oriented decoders. The decoders are implemented on FPGA as stand-alone components and are integrated into an open-source SoC platform to assess their overheads. Our decoders are 6× faster than F2V decoders and are up to 20× and 100× reduction in memory usage and energy consumption compared to 32-bit DNNs. The detailed analysis and results of this contribution are published in [27]. Contribution 2: Scalable Hardware Acceleration of Non- Maximum Suppression In this contribution, we present ShapoolNMS, a scalable and parallelizable hardware accelerator, to speed up the NMS process in both one-stage detectors and two-stage detectors with low power. Built on PSRR-MaxpoolNMS, an algorithm that transforms NMS to MaxPool operations, ShapoolNMS omits the need for sorting and nested loops, achieving low computation complexity. By exploiting the parallelism and sparsity of the score map, ShapoolNMS outperforms current software and hardware solutions, achieving up to 1, 313.69× speedup over the software PSRR-MaxpoolNMS and 10, 600× lower power compared with traditional GPUs. It also realizes up to 19.7× speedup over state-of-the-art NMS hardware accelerators. The detailed analysis and results of this contribution are published in [30]. Contribution 3: Chiplet-based Scalable ResNet Accelerating System In this contribution, we present Res-DLA, an efficient and scalable accelerating hardware system that reduces the processing time of the whole ResNet system with full-HD images. It proposes a novel memory-centric Cross-layer Optimization dataflow that reduces the memory requirements of the feature maps by 84.85%, from 262.97 MB to 39.845 MB. By supporting both Row Stationary and Column Stationary dataflow, the required capability of the weight buffer was reduced from 2,304 KB to 368 KB. With the dataflow-aware architecture, the latency of the 44-Chiplet Res-DLA is 52.75 ms, and the throughput achieves 68 FPS at 200 MHz. Contribution 4: ViTA: A Highly Efficient Dataflow and Architecture for Vision Transformers In this contribution, we present ViTA (Vision Transformer Accelerator), a scalable and efficient hardware accelerator for the Vision Transformer (ViT) model. ViTA adopts a novel memory-centric dataflow that reduces memory usage and data movement, exploiting computational parallelism and locality. This design results in a 76.71% reduction in memory requirements for Multi-Head Self Attention (MHA) compared to original dataflows with VGA resolution images. A fused configurable module is designed for supporting approximated non-linear functions, such as GELU, Softmax, and LayerNorm, optimizing hardware resource usage. Our results show that ViTA achieves 16.384 TOPS with area and power efficiencies of 2.13 TOPS/mm2 and 1.57 TOPS/W at 1 GHz, surpassing current Transformer accelerators by 27.85× and 1.40×, respectively. The detailed analysis and results of this contribution will be published in [33]. Contribution 5: Full System Integration In this contribution, we present the guidelines for integrating proposed innovations into a real hardware platform, i.e., the PULPissimo SoC platform. It presents the interfaces, registers map, and FSM of the introduced innovations to be integrated into the PULPissimo SoC platform. It demonstrates the validation results of the Tunstall decoder on the PULPissimo SoC platform with a four-layer CNN model. It also discusses the bandwidth requirements of the introduced accelerators to be integrated into an SoC platform or with other accelerators. Publications: This PhD dissertation has resulted in three publications and all the papers were published in reputed venues in this area, i.e., DAC 2021, DATE 2022, DATE 2024. Two more papers are expected to be published soon.

Share this page on social media