The DATE 2023 programme will be available here in January 2023.

BPA_6 Logic Synthesis and verification

Date: Monday, 17 April 2023
Time: 14:00 - 16:00 CET

Time Label Presentation Title
Authors
00:39 CET COMPUTING EFFECTIVE RESISTANCES ON LARGE GRAPHS BASED ON APPROXIMATE INVERSE OF CHOLESKY FACTOR
Authors:
Zhiqiang Liu and Wenjian Yu, Tsinghua University, CN
Abstract
Effective resistance, which originates from the field of circuits analysis, is an important graph distance in spectral graph theory. It has found numerous applications in various areas, such as graph data mining, spectral graph sparsification, circuits simulation, etc. However, computing effective resistances accurately can be intractable and we still lack efficient methods for estimating effective resistances on large graphs. In this work, we propose an efficient algorithm to compute effective resistances on general weighted graphs, based on a sparse approximate inverse technique. Compared with a recent competitor, the proposed algorithm shows several hundreds of speedups and also one to two orders of magnitude improvement in the accuracy of results. Incorporating the proposed algorithm with the graph sparsification based power grid (PG) reduction framework, we develop a fast PG reduction method, which achieves an average 6.4X speedup in the reduction time without loss of reduction accuracy. In the applications of power grid incremental analysis and transient analysis, the proposed method shows 2.5X and 1.7X advantages over the PG reduction method based on accurate effective resistances, with no increase in the errors of solutions.
00:39 CET FANOUT-BOUNDED LOGIC SYNTHESIS FOR EMERGING TECHNOLOGIES - A TOP-DOWN APPROACH
Authors:
Dewmini Marakkalage1 and Giovanni De Micheli2
1EPFL, CH; 2École Polytechnique Fédérale de Lausanne (EPFL), CH
Abstract
In logic circuits, the number of fanouts a gate can drive is limited, and such limits are tighter in emerging technologies such as superconducting electronic circuits. In this work, we study the problem of resynthesizing a logic network with bounded-fanout gates while minimizing area. We 1) formulate this problem for a fixed target logic depth as an integer linear program (ILP) and present exact solutions for small logic networks, and 2) propose a top-down approach to construct a feasible solution to the ILP which yields an efficient algorithm for fanout bounded synthesis. When using the minimum depth achievable with unbounded fanouts as the target logic depth, our top-down approach achieves 11.82% better area as compared to the state-of-the-art with matching or better delays.
00:39 CET SYNTHESIS WITH EXPLICIT DEPENDENCIES
Authors:
Priyanka Golia1, Subhajit Roy2 and Kuldeep S Meel3
1IIT Kanpur and NUS Singapore, SG; 2IIT Kanpur, IN; 3National University of Singapore, SG
Abstract
Quantified Boolean Formulas (QBF) extend propositional logic with quantification /forall exists for propositional variables. In QBF, an existentially quantified variable is allowed to depend on all universally quantified variables in its scope. Dependency Quantified Boolean Formulas (DQBF) restrict the dependencies of existentially quantified variables. In DQBF, existentially quantified variables have explicit dependencies on a subset of universally quantified variables, called Henkin dependencies. Given a Boolean specification between the set of inputs and outputs, the problem of Henkin synthesis is to synthesize each output variable as a function of its Henkin dependencies such that the specification is met. Henkin synthesis has wide-ranging applications, including verification of partial circuits, controller synthesis, and circuit realizability. In this work, we propose a data-driven approach for Henkin synthesis called HSynth. On an extensive evaluation of over 563 instances arising from past DQBF solving competitions, we demonstrate that HSynth is competitive with state-of-the-art tools. Furthermore, HSynth solves 26 benchmarks that none of the current state-of-the-art techniques could solve.

BPA_9 Memory Centric Computing

Date: Monday, 17 April 2023
Time: 14:00 - 16:00 CET

Time Label Presentation Title
Authors
00:39 CET MINIMIZING COMMUNICATION CONFLICTS IN NETWORK-ON-CHIP BASED PROCESSING-IN-MEMORY ARCHITECTURE
Authors:
Hanbo Sun, Tongxin Xie, Zhenhua Zhu, Guohao Dai, Huazhong Yang and Yu Wang, Tsinghua University, CN
Abstract
Deep Neural Networks (DNNs) have made significant breakthroughs in various fields. However, their enormous computation and parameters seriously hinder their application. Emerging Processing-In-Memory (PIM) architectures provide extremely high energy efficiency to accelerate DNN computing. Moreover, Network-on-Chip (NoC) based PIM architectures significantly improve the scalability of PIM architectures. However, the contradiction between high communication and limited NoC bandwidth introduces severe communication conflicts. Existing work neglects the impact of communication conflicts. On the one hand, neglecting communication conflicts leads to the lack of precise performance estimations in the mapping process, making it hard to find optimal results. On the other hand, communication conflicts cause low NoC bandwidth utilization in the schedule process. And there is over 70% latency gap in existing work caused by communication conflicts. This paper proposes communication conflict optimized mapping and schedule strategies for NoC based PIM architectures. The proposed mapping strategy constructs communication conflict graphs to model communication conflicts. Based on this constructed graph, we adopt a Graph Neural Network (GNN) as a precise performance estimator. Our schedule strategy predefines the communication priority and NoC communication behavior tables for target DNN workloads. In this way, it can improve the NoC bandwidth utilization effectively. Compared with existing work, for typical classification DNNs on the CIFAR and ImageNet datasets, the proposed strategies reduce 78% latency and improve the throughput by 3.33x on average with negligible deployment and hardware overhead. Experimental results also show that our strategies decrease the average gap to ideal cases without communication conflicts from 80.7% and 70% to 12.3% and 1.26% for latency and throughput, respectively.
00:39 CET HIERARCHICAL NON-STRUCTURED PRUNING FOR COMPUTING-IN-MEMORY ACCELERATORS WITH REDUCED ADC RESOLUTION REQUIREMENT
Authors:
Wenlu Xue1, Jinyu Bai2, Sifan Sun3 and Wang Kang2
1Beihang Universiry, CN; 2Beihang University, CN; 3Beiahng University, CN
Abstract
The crossbar architecture, which is comprised of novel nano-devices, enables high-speed and energy-efficient computing-in-memory (CIM) for neural networks. However, the overhead from analog-to-digital converters (ADCs) substantially degrades the energy efficiency of CIM accelerators. In this paper, we introduce a hierarchical non-structured pruning strategy where value-level and bit-level pruning are performed jointly on neural networks to reduce the resolution of ADCs by using the famous alternating direction method of multipliers (ADMM). To verify the effectiveness, we deployed the proposed method to a variety of state-of-the-art convolutional neural networks on two image classification benchmark datasets: CIFAR10, and ImageNet. The results show that our pruning method can reduce the required resolution of ADCs to 2 or 3 bits with only slight accuracy loss (∼0.25%), and thus can improve the hardware efficiency by 180%.
00:39 CET PIC-RAM: PROCESS-INVARIANT CAPACITIVE MULTIPLIER BASED ANALOG IN MEMORY COMPUTING IN 6T SRAM
Authors:
Kailash Prasad1, Aditya Biswas1, Arpita Kabra2 and Joycee Mekie2
1IIT Gandhinagar, IN; 2Indian Institute of Technology Gandhinagar, IN
Abstract
In-Memory Computing (IMC) is a promising approach to enabling energy-efficient Deep Neural Network-based applications on edge devices. However, analog domain dot product and multiplication suffers accuracy loss due to process variations. Furthermore, wordline degradation limits its minimum pulsewidth, creating additional non-linearity and limiting IMC's dynamic range and precision. This work presents a complete end-to-end process invariant capacitive multiplier based IMC in 6T-SRAM (PIC-RAM). The proposed architecture employs the novel idea of two-step multiplication in column-major IMC to support 4-bit multiplication. The PIC-RAM uses an operational amplifier-based capacitive multiplier to reduce bitline discharge allowing good enough WL pulse width. Further, it employs process tracking voltage reference and fuse capacitor to tackle dynamic and post-fabrication process variations, respectively. Our design is compute-disturb free and provides a high dynamic range. To the best of our knowledge, PIC-RAM is the first analog SRAM IMC approach to tackle process variation with a focus on its practical implementation. PIC-RAM has a high energy efficiency of about 25.6 TOPS/W for 4-bit X 4-bit multiplication and has only 0.5% area overheads due to the use of the capacitance multiplier. We obtain 409 bit-wise TOPS/W, which is about 2X better than state-of-the-art. PIC-RAM shows the TOP-1 accuracy for ResNet-18 on CIFAR10 and MNIST is 89.54% and 98.80% for 4bitX4bit multiplication.

S_D6 Reconfigurable architectures, machine learning and circuit design

Date: Monday, 17 April 2023
Time: 16:30 - 18:00 CET

Time Label Presentation Title
Authors
00:39 CET TOWARDS EFFICIENT NEURAL NETWORK MODEL PARALLELISM ON MULTI-FPGA PLATFORMS
Authors:
David Rodriguez Agut1, Rafael Tornero2 and Josè Flich3
1Universitat Politecnica de Valencia, ES; 2Universitat Polit�cnica de Val�ncia, ES; 3Associate Professor, Universitat Politècnica de València, ES
Abstract
Nowadays, convolutional neural networks (CNN) are common in a wide range of applications. Their high accuracy and efficiency contrast with their computing requirements, leading to the search for efficient hardware platforms. FPGAs are suitable due to their flexibility, energy efficiency and low latency. However, the ever increasing complexity of CNNs demands higher capacity devices, forcing the need for multi-FPGA platforms. In this paper, we present a multi-FPGA platform with distributed shared memory support for the inference of CNNs. Our solution, in contrast with previous works, enables combining different model parallelism strategies applied to CNNs, thanks to the distributed shared memory support. For a four FPGA setting, the platform reduces the execution time of 2D convolutions by a factor of 3.95 when compared to single FPGA. The inference of standard CNN models is improved by factors ranging 3.63-3.87.
00:39 CET HIGH-ACCURACY LOW-POWER RECONFIGURABLE ARCHITECTURES FOR DECOMPOSITION-BASED APPROXIMATE LOOKUP TABLE
Authors:
Xingyue Qian1, Chang Meng1, Xiaolong Shen2, Junfeng Zhao2, Leibin Ni2 and Weikang Qian1
1Shanghai Jiao Tong University, CN; 22012Labs, Huawei technologies Co.,Ltd., CN
Abstract
Storing pre-computed results of frequently-used functions into lookup table (LUT) is a popular way to improve energy efficiency, but its advantage diminishes as the number of input bits increases. A recent work shows that by decomposing the target function approximately, the total LUT entries can be dramatically reduced, leading to significant energy saving. However, its heuristic approximate decomposition algorithm explores the solution space greedily, so the approximation quality is not optimal. Also, its rigid hardware architecture only supports disjoint decomposition and may have unnecessary extra power consumption in some cases. To address these issues, we develop a novel beam search and simulated annealing-based approximate decomposition algorithm, which can reduce 11.1% error. We also implement a non-disjoint approximate decomposition method and propose two reconfigurable architectures. The first one has 10.4% less error using 19.2% less energy and the other has 23.0% less error with same energy consumption compared to the state-of-the-art design.
00:39 CET FPGA ACCELERATION OF GCN IN LIGHT OF THE SYMMETRY OF GRAPH ADJACENCY MATRIX
Authors:
Gopikrishnan Raveendran Nair1, Han-sok Suh1, Mahantesh Halappanavar2, Frank Liu3, Jae-sun Seo1 and Yu Cao1
1Arizona State University, US; 2Pacific Northwest National Laboratory, US; 3Oak Ridge National Lab, US
Abstract
Graph Convolutional Neural Networks (GCNs) are widely used to process large-scale graph data. Different from deep neural networks (DNNs), GCNs are sparse, irregular, and unstructured, posing unique challenges to hardware acceleration with regular processing elements (PEs). In particular, the adjacency matrix of a GCN is extremely sparse, leading to frequent but irregular memory access, low spatial-temporal data locality and poor data reuse. Furthermore, a realistic graph usually consists of unstructured data (e.g., unbalanced distributions), creating significantly different processing times and imbalanced workload for each node in GCN acceleration. To overcome these challenges, we propose an end-to-end hardware-software co-design to accelerate GCNs on resource-constrained FPGAs with the features including: (1) A custom dataflow that leverages symmetry along the diagonal of the adjacency matrix to accelerate feature aggregation for undirected graphs. We utilize either the upper or the lower triangular matrix of the adjacency matrix to perform aggregation in GCN to improve data reuse. (2) Unified compute cores for both aggregation and transform phases, with full support to the symmetry-based dataflow. These cores can be dynamically reconfigured to the systolic mode for transformation or as individual accumulators for aggregation in GCN processing. (3) Prepossessing of the graph in software to rearrange the edges and features to match the custom dataflow. This step improves the regularity in memory access and data reuse in the aggregation phase. Moreover, we quantize the GCN precision from FP32 to INT8 to reduce the memory footprint without losing the inference accuracy. We implement our accelerator design in Intel Stratix10 MX FPGA board with HBM2, and demonstrate significant improvement in end-to-end GCN operations as compared to the state of the art, on the graph datasets of Cora, Pubmed, Citeseer and Reddit.
00:39 CET PR-ESP: AN OPEN-SOURCE PLATFORM FOR DESIGN AND PROGRAMMING OF PARTIALLY RECONFIGURABLE SOCS
Authors:
Biruk Seyoum, Davide Giri, Kuan-Lin Chiu, Bryce Natter and Luca Carloni, Columbia University, US
Abstract
Despite its presence for more than two decades and its proven benefits in expanding the space of system design, dynamic partial reconfiguration (DPR) is rarely integrated into frameworks and platforms that are used to design complex reconfigurable system-on-chip (SoC) architectures. This is due to the complexity of the DPR FPGA flow as well as the lack of architectural and software runtime support to enable and fully harness DPR. Moreover, as DPR designs involve additional design steps and constraints, they often have a higher FPGA compilation (RTL-to-bitstream) runtime compared to equivalent monolithic designs. In this work, we present PR-ESP, an open-source platform for a system-level design flow of partially reconfigurable FPGA-based SoC architectures targeting embedded applications that are deployed on resource-constrained FPGAs. Our approach is realized by combining SoC design methodologies and tools from the open-source ESP platform with a fully-automated DPR flow that features a novel size-driven technique for parallel FPGA compilation. We also developed a software runtime reconfiguration manager on top of Linux. Finally, we evaluated our proposed platform using the WAMI-App benchmark application on Xilinx VC707.
00:39 CET ISOP: MACHINE LEARNING ASSISTED INVERSE STACK-UP OPTIMIZATION FOR ADVANCED PACKAGE DESIGN
Authors:
Hyunsu Chae1, Bhyrav Mutnury2, Keren Zhu1, Doug Wallace2, Doug Winterberg2, Daniel de Araujo3, Jay Reddy2, Adam Klivans1 and David Z. Pan1
1University of Texas at Austin, US; 2Dell Enterprise Product Group, US; 3Siemens EDA, US
Abstract
Future computing calls for heterogeneous integration, e.g., the recent adoption of the chiplet methodology. However, high-speed cross-chip interconnects, and packaging shall be critical for the overall system performance. As an example of advanced packaging, a high-density interconnect (HDI) printed circuit board (PCB) has been widely used in complex electronics from cell phones to computing servers. A modern HDI PCB may have over 20 layers, each with its unique material properties and geometrical dimensions, i.e., stack-up, to meet various design constraints and performance optimizations. However, stack-up design is usually done manually in the industry, where experienced designers may devote many hours adjusting the physical dimensions and materials in order to meet the desired specifications. This process, however, is time-consuming, tedious, and sub-optimal, largely depending on the designer's expertise. In this paper, we propose to automate the stack-up design with a new framework, ISOP, using machine learning for inverse stack-up optimization for advanced package design. Given a target design specification, ISOP automatically searches for ideal stack-up design parameters while optimizing performance. We develop a novel machine learning-assisted hyper-parameter optimization method to make the search efficient and reliable. Experimental results demonstrate that ISOP is 11.6%, 40.7%, 24.8%, and 26.6% better in figure-of-merit (FoM) than conventional simulated annealing and Bayesian optimization algorithms with all our design targets met with a shorter runtime. We also compare our fully-automated ISOP with expert designers in the industry and achieve very promising results, with orders of magnitude reduction of turn-around time.
00:39 CET FAST AND ACCURATE WIRE TIMING ESTIMATION BASED ON GRAPH LEARNING
Authors:
Yuyang Ye1, Tinghuan Chen2, Yifei Gao1, Hao Yan1, Bei Yu2 and Longxing Shi1
1Southeast University, CN; 2The Chinese University of Hong Kong, HK
Abstract
Accurate wire timing estimation has become a bottleneck in timing optimization caused by the long turn-around time using a sign-off timer. Unlike gate timing is calculated accurately by interpolating lookup tables in cell libraries, wire timing calculation has low accuracy and efficiency for complex RC nets, including both tree-like and non-tree nets. The limited number of wire paths opens a door for the graph learning method used to estimate wire timing. In this work, we present a fast and accurate wire timing estimator based on a novel graph learning architecture, namely GNNTrans. GNNTrans can generate wire path representations through aggregating local structure information and global relationships of whole RC nets, which cannot be collected with traditional graph learning work efficiently. Experimental results on both tree-like and non-tree nets of real-world open-source designs demonstrate improved accuracy with the max error of wire delay being lower than 5 ps. Our estimator can predict the timing of over 200K nets in less than 100 secs and can be integrated into incremental timing optimization.
00:39 CET DTOC: INTEGRATING DEEP-LEARNING DRIVEN TIMING OPTIMIZATION INTO STATE-OF-THE-ART COMMERCIAL EDA TOOL
Authors:
Kyungjoon Chang1, Heechun Park2, Jaehoon Ahn1, Kyu-Myung Choi1 and Taewhan Kim1
1Seoul National University, KR; 2Kookmon University, KR
Abstract
Recently, deep-learning (DL) models have been paid a considerable attention to timing prediction in the placement and routing (P&R) flow. As yet, the DL based prior works are confined to timing prediction at the time-consuming global routing stage, and very few have addressed the timing prediction problem at the placement, i.e., at the pre-route stage. This is because it is not easy to "accurately” predict various timing parameters at the pre-route stage. Moreover, no work has addressed a seamless link of timing prediction at the pre-route stage to the final timing optimization through making use of commercial P&R tools. In this work, we propose a framework, called DTOC, to be used at the pre-route stage for this end. Precisely, the framework is composed of two models: (1) a DL driven arc delay and arc output slew prediction model, performing in two levels: (level1) predicting net resistance (R), net capacitance (C), and arc length (Len), followed by (level-2) predicting arc delay and arc output slew from the R/C/Len prediction obtained in (level-1); (2) a timing optimization model, which uses the inference outcomes in our DL driven prediction model to enable the commercial P&R tools to calculate the full path delays, setting update timing margins on paths, so that the P&R tools should use more accurate margins on timing optimization. Experimental results show that, by using our DTOC framework during timing optimization in P&R, we improve the pre-route prediction accuracy on arc delay and arc output slew by 20∼26% on average, and improve the WNS, TNS, and the number of timing violation paths by 50∼63% on average.
00:39 CET RL-LEGALIZER: REINFORCEMENT LEARNING-BASED CELL PRIORITY OPTIMIZATION IN MIXED-HEIGHT STANDARD CELL LEGALIZATION
Authors:
Sung-Yun Lee1, Seonghyeon Park2, Daeyeon Kim2, Minjae Kim2, Tuyen Le3 and Seokhyeong Kang4
1Pohang University of Science and Technology (POSTECH), KR; 2POSTECH, KR; 3AgileSoDA, KR; 4Pohang University of Science and Technology, KR
Abstract
Cell legalization order has a substantial effect on the quality of modern VLSI designs, which use mixed-height standard cells. In this paper, we propose a deep reinforcement learning framework to optimize cell priority in the legalization phase of various designs. We extract the selected features of movable cells and their surroundings, then embed them into cell-wise deep neural networks. We then determine cell priority and legalize them in order using a pixel-wise search algorithm. The proposed framework uses a policy gradient algorithm and several training techniques, including grid-cell subepisode, data normalization, reduced-dimensional state, and network optimization. We aim to resolve the suboptimality of existing sequential legalization algorithms with respect to displacement and wirelength. On average, our proposed framework achieved 34% lower legalization costs in various benchmarks compared to that of the state-of-the-art legalization algorithm.
00:39 CET NEURAL NETWORK ON THE EDGE: EFFICIENT AND LOW COST FPGA IMPLEMENTATION OF DIGITAL PREDISTORTION IN MIMO SYSTEMS
Authors:
Yiyue Jiang1, Andrius Vaicaitis2, John Dooley2 and Miriam Leeser1
1Department of Electrical and Computer Engineering at Northeastern University, US; 2Department of Electronic Engineering at Maynooth University, IE
Abstract
Processing on the edge with low latency is increasingly important in wireless systems, and reconfigurable hardware provides the necessary performance for this application domain. The focus of this research is Digital PreDistortion (DPD) in Multiple-Input and Multiple-Output (MIMO) systems and particularly in processing data to rapidly adapt to the current wireless environment. We accomplish this by implementing a training sample reduced and hardware efficient Real Valued Time Delay Neural Network (RVTDNN) implemented on FPGA hardware. Data streams from four Power Amplifiers (PAs) are combined and processed using a single neural network to perform DPD. This implementation exploits a novel way to minimize the training signal sample selection based on a biased probability density function (pdf) to reduce the training time of the neural network. The paper describes hardware specific optimizations in the implementation of the NN including a pipelined RVTDNN structure with low digital cost along with high throughput, and an activation function that is efficient to implement in hardware. The design has been validated using an AMD/Xilinx RFSoC ZCU216 board and surpasses the data throughput of conventional RVTDNN based DPD while only using a fraction of their hardware utilization.
00:39 CET QUANTISED NEURAL NETWORK ACCELERATORS FOR LOW-POWER IDS IN AUTOMOTIVE NETWORKS
Authors:
Shashwat Khandelwal, Anneliese Walsh and Shreejith Shanker, Trinity College Dublin, IE
Abstract
Rising connectivity in vehicles in response to an increase in in-vehicle driving assistance with systems like Advanced Driver Assistance Systems (ADAS) has exposed inherent hidden vulnerabilities in modern-day vehicular networks. This has led to an increase in both active (injection) and passive (eavesdropping/sniffing) attacks in vehicles. Intrusion detection systems (IDSs) have been proposed for automotive networks to address these risks; however, these often struggle to meet the latency and power requirements for an in-vehicle deployment. In this paper, we present low-power custom quantised Multi Layer Perceptrons (MLPs) as an Intrusion Detection System for automotive controller area network (CAN). We utilise the FINN framework from AMD/Xilinx to quantise, train and generate the hardware IP of our MLP to detect denial of service (DoS), fuzzing, and spoofing attacks (RPM & Gear) on CAN network, using ZCU104 (XCZU7EV) FPGA as our target ECU architecture with integrated IDS capabilities. Our approach achieves significant improvements in latency (0.12 ms per-message processing latency) and inference energy consumption (0.25 mJ per inference) while achieving similar classification performance as state-of-the-art approaches proposed in the research literature.

S_D7 Logical and physical analysis and design

Date: Monday, 17 April 2023
Time: 16:30 - 18:00 CET

Time Label Presentation Title
Authors
00:39 CET SYNTHESIS AND UTILIZATION OF STANDARD CELLS AMENABLE TO GEAR RATIO OF GATE-METAL PITCHES FOR IMPROVING PIN ACCESSIBILITY
Authors:
Jooyeon Jeong, Sehyeon Chung, Kyeongrok Jo and Taewhan Kim, Seoul National University, KR
Abstract
Traditionally, the synthesis of standard cells invariably assumes that the gear ratio (GR) between the gate poly pitch in the cells and the metal pitch of the first vertical metal layer (to be used for routing) over the gate poly is 1:1 for chip implementation. However, the scaling trend in sub-10nm node CMOS designs is that GR is changing from 1:1 to 3:2 or 4:3, which means the number and location of pin access points vary depending on the cell placement location, thereby causing hard-to-pin-access if the pin access points were aligned on the offtrack routing pattern. This work overcomes the pin inaccessibility problem caused by non-1:1 GR in chip implementation. Precisely, we propose a non-1:1 GR aware DTCO (design and technology co-optimization) flow to generate cells with pin patterns that are best suited to the implementation of target design. To this end, we propose two new tasks to be installed in our DTCO framework: (1) from the existing cells optimized for 1:1 GR, we relocate their pin patterns amenable to non-1:1 GR, so that a maximal pin accessibility should be achieved; (2) we incrementally update the pin patterns of the cell instances with routing failures due to pin inaccessibility in the course of the DTCO iterations to produce the cells with best fitted pin patterns to the implementation of target design. Through experiments with benchmark circuits, it is shown that our DTCO methodology optimizing pin patterns amenable to non-1:1 GR is able to produce chip implementations with on average 5.88× fewer routing failures at no additional wirelength, timing, and power cost.
00:39 CET CENTER-OF-DELAY: A NEW METRIC TO DRIVE TIMING MARGIN AGAINST SPATIAL VARIATION IN COMPLEX SOCS
Authors:
Christian Lutkemeyer1 and Anton Belov2
1Marvell Semiconductor, Inc., US; 2Synopsys Inc, IE
Abstract
Complex VLSI SOCs are manufactured on large 300mm wafers. Individual SOCs can show significant spatial performance gradients in the order of 10% per 10mm. The traditional approach to handling this variation in STA tools is a margin look-up table indexed by the diagonal of the bounding box around the gates in a timing path. In this paper we propose a new approach based on the concept of the Center-of-Delay of a timing path. We justify this new approach theoretically for linear performance gradients and present experimental data that shows that the new approach is both safe, and significantly less pessimistic than the existing method.
00:39 CET A NOVEL DELAY CALIBRATION METHOD CONSIDERING INTERACTION BETWEEN CELLS AND WIRES
Authors:
Leilei Jin, Jia Xu, Wenjie Fu, Hao Yan, Xiao Shi, Ming Ling and Longxing Shi, Southeast University, CN
Abstract
In the advanced technology, the accuracy of cell and wire delay modeling are the key metrics for timing analysis. However, when the supply voltage decreases to the near-threshold regime, the complicated process variation effect causes the cell delay and the wire delay hard to model. Most researchers study cell or wire delay separately, ignoring the coefficients between them. In this paper, we propose an N-sigma delay model by characterizing different sigma levels (-3σ to +3σ) of the cell and wire delay distribution. The N-sigma cell delay model is represented by the first four moments and calibrated by the operating conditions (input slew, output load). Meanwhile, based on the Elmore model, the wire delay variability is calculated by considering the effect of drive and load cells. The delay models are verified through the ISCAS85 benchmarks and the functional units of PULPino processor with TSMC 28 nm technology. Compared to the SPICE results, the average errors for estimating the +⁄- 3σ cell delay are 2.1% and 2.7% and those of the wire delay are 2.4% and 1.6%, respectively. The errors of path delay analysis keep below 6.6% and the speed is 103X over SPICE MC simulations.
00:39 CET RETHINKING NPN CLASSIFICATION FROM FACE AND POINT CHARACTERISTICS OF BOOLEAN FUNCTIONS
Authors:
Jiaxi Zhang1, Shenggen Zheng2, Liwei Ni3, Huawei Li3 and Guojie Luo1
1Peking University, CN; 2Peng Cheng Laboratory, CN; 3Institute of Computing Technology, Chinese Academy of Sciences, CN
Abstract
NPN classification is an important problem in the synthesis flows of digital circuits. Most existing works explored variable symmetries and cofactor signatures to develop their classification methods. However, cofactor signatures only consider the face characteristics of Boolean functions. In this paper, we propose a new NPN classifier using both face and point characteristics of Boolean functions, including cofactor, influence, and sensitivity. The new method brings a new perspective to the classification of Boolean functions. The classifier only needs to compute some signatures, and the equality of corresponding signatures is a prerequisite for NPN equivalence. Therefore, these signatures can be directly used for NPN classification, thus avoiding the exhaustive transformation enumeration. The experiments show that the proposed NPN classifier gains better NPN classification accuracy with comparable speed.
00:39 CET EXACT SYNTHESIS BASED ON SEMI-TENSOR PRODUCT CIRCUIT SOLVER
Authors:
Hongyang Pan1 and Zhufei Chu2
1Ningbo university, CN; 2Ningbo University, CN
Abstract
In logic synthesis, Boolean satisfiability (SAT) is widely used as a reasoning engine, especially for exact synthesis. By representing input formulas as logic circuits instead of conjunction normal forms (CNFs) as in off-the-shelf CNF-based SAT solvers, circuit-based SAT solvers enable decoding after solution to be easier. An exact synthesis method based on a semi-tensor product (STP) circuit solver is presented in this paper. As opposed to other SAT-based exact synthesis algorithms, our method calculates the matrix forms of logic circuits efficiently so that all optimal solutions are obtained in one pass. In particular, all solutions are expressed as 2-lookup tables (LUTs), rather than homogenous logic representations. Hence, different costs can be considered when selecting the optimal circuit. In experiments, we demonstrate that our method accelerates the runtime up to 6.9X while reducing timeout instances by up to 63%.
00:39 CET AN EFFECTIVE AND EFFICIENT HEURISTIC FOR RATIONAL-WEIGHT THRESHOLD LOGIC GATE IDENTIFICATION
Authors:
Ting Yu Yeh, Yueh Cho and Yung Chih Chen, National Taiwan University of Science and Technology, TW
Abstract
In CMOS-based current mode realization, the threshold logic gate (TLG) implementation with rational weights has been shown to be more cost-effective than the conventional TLG implementation without rational weights. The existing method for the rational-weight TLG identification is an integer linear programming (ILP)-based method, which could suffer from inefficiency for a Boolean function with a large number of inputs. This paper presents a heuristic for rational-weight TLG identification. We observe that in the ILP formulation, many variables related to the rational weights are redundant according to the ILP solutions. Additionally, a rational-weight TLG could be transformed from a conventional TLG. Thus, the proposed method aims to identify the conventional TLG that can be transformed to a rational-weight TLG with lower implementation cost. We conducted the experiments on a set of TLGs with 4 ∼ 15 inputs. The results show that the proposed method has a competitive quality and is much more efficient, compared to the ILP-based method.
00:39 CET FAST STA GRAPH PARTITIONING FRAMEWORK FOR MULTI-GPU ACCELERATION
Authors:
Guannan Guo1, Tsung-Wei Huang2 and Martin Wong3
1UIUC, US; 2University of Utah, US; 3The Chinese University of Hong Kong, HK
Abstract
Path-based Analysis (PBA) is a key process in Static Timing Analysis (STA) to reduce excessive slack pessimism. However, PBA can easily become the major performance bottleneck due to its extremely long execution time. To overcome this bottleneck, recent STA researches have proposed to accelerate PBA algorithms with manycore CPU and GPU parallelisms. However, GPU memory is rather limited when we compute PBA on large industrial designs with millions of gates. In this work, we introduce a new endpoint-oriented partitioning framework that can separate STA graphs and dispatch the PBA workload onto multiple GPUs. Our framework can quickly identify logic overlaps among endpoints and group endpoints based on the size of shared logic. We then recover graph partitions from the endpoint groups and offload independent PBA workloads to multiple GPUs. Experiments show that our framework can greatly accelerate the PBA process on designs with over 10M gates.
00:39 CET TOFU: A TWO-STEP FLOORPLAN REFINEMENT FRAMEWORK FOR WHITESPACE REDUCTION
Authors:
Shixiong Kai1, Chak-Wa Pui2, Fangzhou Wang3, Jiang Shougao4, Bin Wang1, Yu Huang5 and Jianye Hao6
1Huawei Noah's Ark Lab, CN; 2UniVista, CN; 3The Chinese University of Hong Kong, HK; 4Hisilicon, CN; 5HiSilicon, CN; 6Tianjin University, CN
Abstract
Floorplanning, as an early step in physical design, will greatly affect the PPA of the later stages. To achieve better performance while maintaining relatively the same chip size, the utilization of the generated floorplan needs to be high and constraints related to design rules, routability, power should be honored. In this paper, we propose a two-step framework, called TOFU, for floorplan whitespace reduction with fixed-outline and soft/pre-placed/hard modules modeled. Whitespace is first reduced by iteratively refining the locations of modules. Then the modules near whitespace will be changed into rectilinear shapes to further improve the utilization. To ensure the legality and quality of the intermediate floorplan during the refinement process, a constraint graph-based legalizer with a novel constraint graph construction method is proposed. Experimental results show that the whitespace of the initial floorplans generated by Corblivar can be reduced by about 70% on average and up to 90% in several cases. Moreover, the resulting wirelength is also 3% shorter due to a higher utilization.
00:39 CET ROUTABILITY PREDICTION USING DEEP HIERARCHICAL CLASSIFICATION AND REGRESSION
Authors:
Daeyeon Kim1, Jakang Lee2 and Seokhyeong Kang2
1POSTECH, KR; 2Pohang University of Science and Technology, KR
Abstract
In physical design, routability indicates whether the routing is feasible or not, and it can be evaluated after the post-route. Routability prediction can forecast the locations where design rule violations occur without routing and thus can speed up the design iterations by skipping the time-consuming routing tasks. In this paper, we define tile-level routability prediction as a pixel-wise regression problem. We then propose a deep hierarchical classification and regression (HCR) model that can detect hotspots and evaluate tile-level routability as a continuous value, rather than a binary classification (as in previous studies). The hierarchical inference flow can prevent bias toward predicting the majority samples in the data-imbalance problem. Furthermore, we introduce a training method for the proposed HCR model that uses Bayesian optimization to quickly find the ideal modeling parameters and incorporates transfer learning for the regression model. We achieved an R2 score of 0.71 for the regression and increased the F1 score in the binary classification by 94\% compared to that presented in the state-of-the-art. We also obtained a 0.890 Pearson correlation between the ground-truth and prediction values in the layout-level routability.
00:39 CET ENABLING EFFICIENT DESIGN RULE CHECKING WITH GPU ACCELERATION
Authors:
Wei Zhong1, Zhenhua Feng1, Zhuolun He2, Weimin Wang1, Yuzhe Ma3 and Bei Yu2
1Dalian University of Technology, CN; 2The Chinese University of Hong Kong, HK; 3The Hong Kong University of Science and Technology (Guangzhou), CN
Abstract
Design Rule Checking (DRC) is an essential part of the chip design flow, which ensures that manufacturing requirements are conformed to avoid a chip failure. With the rapid increase of design scales, DRC has been suffering from runtime overhead. To overcome this challenge, we propose to accelerate DRC algorithms by harnessing the power of graphics processing units (GPUs). Specifically, we first explore an efficient data transfer approach for geometry information of a layout. Then we investigate GPU-based scanline algorithms to accommodate both intra-polygon checking and intre-polygon checking based on the characteristics of the design rules. Experimental results show that the proposed GPU-accelerated method can substantially outperform a multi-threaded DRC algorithm using CPU. Compared with the baseline with 24 threads, we can achieve an average speedup of 36 times and 201 times for spacing rule checks and enclosing rule checks on a metal layer, respectively.
00:39 CET MITIGATING LAYOUT DEPENDENT EFFECT-INDUCED TIMING RISK IN MULTI-ROW-HEIGHT DETAILED PLACEMENT
Authors:
Li-Chen Wang and Shao-Yun Fang, National Taiwan University of Science and Technology, TW
Abstract
With the development of advanced process technology, the electrical characteristic variation of MOSFET transistors has been seriously influenced by layout dependent effect (LDEs), such as the length of oxide diffusion (LOD) and the oxide-to-oxide spacing effect (OSE). Due to these LDEs, two cells of specific cell types may suffer from timing degradation when they are adjacently and closely placed with specific orientations. To mitigate the timing risk of critical paths and thus optimize the performance of a target design, this work focuses on multi-row-height detailed placement with cell flipping and cell shifting. We first develop an integer linear programming (ILP) formulation that can optimally solve the problem in terms of risky cell abutments and cell displacements. After that, a dynamic programming (DP)-based method is proposed that is much more efficient and can also derive near-optimal solutions. In addition, in contrast with ILP that can only derive solutions within acceptable runtimes by partitioning a design into sub-problems, the DP-based method is able to simultaneously consider the risky abutments on the critical paths of the entire circuit at a time. Experimental results shows the efficiency and effectiveness of the proposed DP-based approach.
00:39 CET A TWO-STAGE PCB ROUTING ALGORITHM USING POLYGON-BASED DYNAMIC PARTITIONING AND MCTS
Authors:
Youbiao He1, Hebi Li2, Ge Luo2 and Forrest Sheng Bao3
1Iowa state university, US; 2Iowa State University, US; 3Iowa State Univerity, US
Abstract
Routing is a vital step in printed circuit board (PCB) designs. With the rapid increase of design scales, manual routing has become very time-consuming. Previous works typically divide the problem into escape routing and area routing. However, existing escape routing approaches usually do not consider the quality of area routing among the chip components. As a result, the solutions to two subproblems do not always couple well. In addition, traditional escape routing mainly focuses on the connections between the regular ball grid array (BGA) of chip packages. However, real PCB designs usually contain non-BGA packages, making the PCB routing very challenging when applying the previous methods to solve it. To address the above problems, we propose a net-by-net two-stage PCB routing approach, including a Monte Carlo tree search (MCTS)-based global routing and an A*-based detailed routing approach. Furthermore, a polygon-based dynamic routable region partitioning mechanism is designed to minimize the gap between the global and detailed routing and to route for PCB designs with non-BGA packages. Experimental results show that our approach outperforms the state-of-the-art routers. Specifically, our algorithm successfully routes all the given test cases with shorter wirelengths while other state-of-the-art routers either fail to complete the routing or generate longer wires for each test case.
00:39 CET DEEPTH: CHIP PLACEMENT WITH DEEP REINFORCEMENT LEARNING USING A THREE-HEAD POLICY NETWORK
Authors:
Dengwei Zhao, Shuai Yuan, Yanan Sun, Shikui Tu and Lei Xu, Shanghai Jiao Tong University, CN
Abstract
Modern very-large-scale integrated (VLSI) circuit placement with huge state space is a critical task for achieving layouts with high performance. Recently, reinforcement learning (RL) algorithms have made a promising breakthrough to dramatically save design time than human effort. However, the previous RL-based works either require a large dataset of chip placements for pre-training or produce illegal final placement solutions. In this paper, DeepTH, a three-head policy gradient placer, is proposed to learn from scratch without the need of pre-training, and generate superior chip floorplans. Graph neural network is initially adopted to extract the features from nodes and nets of chips for estimating the policy and value. To efficiently improve the quality of floorplans, a reconstruction head is employed in the RL network to recover the visual representation of the current placement, by enriching the extracted features of placement embedding. Besides, the reconstruction error is used as a bonus during training to encourage exploration while alleviating the sparse reward problem. Furthermore, the expert knowledge of floorplanning preference is embedded into the decision process to narrow down the potential action space. Experiment results on the ISPD2005 benchmark have shown that our method achieves $19.02\%$ HPWL improvement than the analytic placer DREAMPlace and 19.89\% improvement at least than the state-of-the-art RL algorithms.

S_S1 Pitches - Security of emerging technologies and machine learning

Date: Monday, 17 April 2023
Time: 16:30 - 16:57 CET

Time Label Presentation Title
Authors
16:30 CET PRIVACY-PRESERVING NEURAL REPRESENTATION FOR BRAIN-INSPIRED LEARNING
Authors:
Javier Roberto Rubalcava-Cortes1, Alejandro , Hernandez Cano1, Alejandra Citlalli Pacheco Tovarm1, Farhad Imani2, Rosario Cammarota3 and Mohsen Imani4
1Universidad Nacional Autonoma de Mexico, MX; 2University of Connecticut, US; 3Intel Labs, US; 4University of California Irvine, US
Abstract
In this paper, we propose BIPOD, a brain-inspired privacy-oriented machine learning. Our method rethinks privacypreserving mechanisms by looking at how the human brain provides effective privacy with minimal cost. BIPOD exploits hyperdimensional computing (HDC) as a neurally-inspired computational model. HDC is motivated by the observation that the human brain operates on high-dimensional data representations. In HDC, objects are thereby encoded with high-dimensional vectors, called hypervectors, which have thousands of elements. BIPOD exploits this encoding as a holographic projection with both cryptographic and randomization-based features. BIPOD encoding is performed using a set of brain keys that are generated randomly. Therefore, attackers cannot get encoded data without accessing the encoding keys. In addition, revealing the encoding keys does not directly translate to information loss. We enhance BIPOD encoding method to mathematically create perturbation on encoded neural patterns to ensure a limited amount of information can be extracted from the encoded data. Since BIPOD encoding is a part of the learning process, thus can be optimized together to provide the best trade-off between accuracy, privacy, and efficiency. Our evaluation on a wide range of applications shows that BIPOD privacy-preserving techniques result in 11.3× higher information privacy with no loss in classification accuracy. In addition, at the same quality of learning, BIPOD provides significantly higher information privacy compared to state-of state-of-the-art privacy-preserving techniques
16:33 CET EXPLOITING SHORT APPLICATION LIFETIMES FOR LOW COST HARDWARE ENCRYPTION IN FLEXIBLE ELECTRONICS
Authors:
Nathaniel Bleier1, Muhammad Mubarik1, Suman Balaji2, Francisco Rodriguez2, Antony Sou2, Scott White2 and Rakesh Kumar1
1UIUC, US; 2PragmatIC Semiconductor, GB
Abstract
Many emerging flexible electronics applications require hardware based encryption, but it is unclear if practical hardware-based encryption is possible for flexible applications due to stringent power requirements of these applications and higher area and power over heads of flexible technologies. In this work, we observe that the lifetime of many flexible applications is so small that often one key suffices for the entire lifetime. This means that, instead of generating keys and round keys in hardware, we can generate the round keys offline, and instead store these round keys directly in the engine. This eliminates the need for hardware for dynamic generation of round keys, which significantly reduces encryption overhead. This significant reduction in encryption overhead allows us to demonstrate the first practical flexible encryption engines. To prevent an adversary from reading out the stored round keys, we scramble the round keys before storing them in the ROM; camouflage cells are used to unscramble the keys before feeding them to logic. In spite of the unscrambling overhead, our encryption engines consume 27.4% lower power than the already heavily area and power-optimized baselines, while being 21.9% smaller on average.
16:36 CET ATTACKING RERAM-BASED ARCHITECTURES USING REPEATED WRITES
Authors:
Biresh Kumar Joardar1 and Krishnendu Chakrabarty2
1University of Houston, US; 2Duke University, US
Abstract
Resistive random-access memory (ReRAM) have is a promising technology for both memory and for in-memory computing. However, these devices have security vulnerabilities that are yet to be adequately investigated. In this work, we identify one such vulnerability that exploits the write mechanism in ReRAMs. Whenever a cell/row is written, a constant bias is automatically applied to the remaining cells/rows to reduce sneak current. We develop a new attack (referred as WriteHammer) that exploits this process. By repeatedly exposing a subset of cells to this bias, WriteHammer can cause noticeable resistance drift in the victim ReRAM cells. Experimental results indicate that WriteHammer can cause up to 3.5X change in cell resistance by simply writing to the ReRAM cells
16:39 CET SECURITY EVALUATION OF A HYBRID CMOS/MRAM ASCON HARDWARE IMPLEMENTATION
Authors:
Nathan Roussel, Olivier Potin, Jean-Max Dutertre and Jean-Baptiste Rigaud, Mines Saint-Etienne, CEA, Leti, Centre CMP F-13541 Gardanne, France, FR
Abstract
As the number of IoT objects is growing fast, power consumption and security become a major concern in the design of integrated circuits. Lightweight Cryptography (LWC) algorithms aim to secure the communications of these connected objects at the lowest energy impact. To reduce the energy footprint of cryptographic primitive, several LWC hardware implementations embedding hybrid CMOS/MRAM-based cells have been investi- gated. These architectures use the non-volatile characteristic of MRAM to store data manipulated in the algorithm computation. We provide in this work a security evaluation of a hybrid CMOS/MRAM hardware implementation of the A SCON cipher, a finalist of the National Institute of Standards and Technology LWC contest. We focus on a simulation flow using the current EDA tools capable of carrying out power analysis for side-channel attacks, for the purpose of assessing potential weaknesses of MRAM hybridization. Differential Power Analysis (DPA) and Correlation Power Analysis (CPA) are conducted on the post- route and parasitic annoted netlist of the design. The results show that the hybrid implementation does not significantly lower the security feature compared to a reference CMOS implementation.
16:42 CET MANTIS: MACHINE LEARNING-BASED APPROXIMATE MODELING OF REDACTED INTEGRATED CIRCUITS
Authors:
Chaitali Sathe, Yiorgos Makris and Benjamin Carrion Schaefer, University of Texas at Dallas, US
Abstract
With most VLSI design companies now being fabless it is imperative to develop methods to protect their Intellectual Property (IP). One approach that has become very popular due to its relative simplicity and practicality is logic locking. One of the problems with traditional locking mechanisms is that the locking circuitry is built into the netlist that the VLSI design company delivers to the foundry which has now access to the entire design including the locking mechanism. This implies that they could potentially tamper with this circuitry or reverse engineer it to obtain the locking key. One relatively new approach that has been coined as hardware redaction is to map a portion of the design to an embedded FPGA (eFPGA). The bitstream of the eFPGA now acts as the locking key. The fab now receives the design without the bitstream and hence, cannot reverse engineer the functionality of the design. The obvious drawbacks are the increase in design complexity and the area and performance overheads associated with the eFPGA. In this work we propose, to the best of our knowledge, the first attack on these type of new locking mechanisms by substituting the exact logic mapped onto the eFPGA by a synthesizable predictive model that replicates the behavior of the exact logic. We show that this approach is especially applicable in the context of approximate computing where hardware accelerators tolerate certain degree of errors at their outputs. Some examples include Digital Signal Processing (DSP) or image processing applications Experimental results show that our proposed approach is very effective finding suitable predictive models.
16:45 CET LONG RANGE DETECTION OF EMANATION FROM HDMI CABLES USING CNN AND TRANSFER LEARNING
Authors:
Md Faizul Bari1, Meghna Roy Chowdhury1 and Shreyas Sen2
1Purdue University, US; 2ECE, Purdue University, US
Abstract
The transition of data and clock signals between high and low states in electronic devices creates electromagnetic radiation according to Maxwell's equations. These unintentional emissions, called emanation, may have a significant correlation with the original information-carrying signal and form an information leakage source, bypassing secure cryptographic methods at both hardware and software levels. Information extraction exploiting compromising emanations poses a major threat to information security. Shielding the devices and cables along with setting a control perimeter for a sensitive facility are the most commonly used preventive measures. These countermeasures raise the research need for the longest detection range of exploitable emanation and the efficacy of commercial shielding. In this work, using data collected from 3 types of commercial HDMI cables (unshielded, single-shielded, and double-shielded) in an office environment, we have shown that the CNN-based detection method outperforms the traditional threshold-based detection method and improves the detection range from 4 m to 22.5 m for an iso-accuracy of ~95%. Also, for an iso-distance of 16 m, the CNN-based method provides ~100% accuracy, compared to ~88.5% using the threshold-based method. The significant performance boost is achieved by treating the FFT plots as images and training a residual neural network (ResNet) with the data so that it learns to identify the impulse-like emanation peaks even in the presence of other interfering signals. A comparison has been made among the emanation power from the 3 types of HDMI cables to judge the efficacy of multi-layer shielding. Finally, a distinction has been made between monitor contents, i.e., still image vs video, with an accuracy of 91.7% at a distance of 16 m. This distinction bridges the gap between emanation-based image and video reconstruction algorithms.
16:48 CET ADVERSARIAL ATTACK ON HYPERDIMENSIONAL COMPUTING-BASED NLP APPLICATIONS
Authors:
Sizhe Zhang1, Zhao Wang2 and Xun Jiao1
1Villanova University, US; 2University of Chicago, US
Abstract
The security and robustness of machine learning algorithms have become increasingly important as they are used in critical applications such as natural language processing (NLP), e.g., text-based spam detection. Recently, the emerging brain-inspired hyperdimensional computing (HDC), compared to deep learning methods, has shown advantages such as compact model size, energy efficiency, and capability of few-shot learning in various NLP applications. While HDC has been demonstrated to be vulnerable to adversarial attacks in image and audio input, there is currently no study on its adversarial security to NLP tasks, which is arguable one of the most suitable applications for HDC. In this paper, we present the first study on the adversarial attack of HDC-based NLP applications. By leveraging the unique properties in HDC, the similarity-based inference, we propose similarity-guided approaches to automatically generate adversarial text samples for HDC. Our approach is able to achieve up to 89% attack success rate. More importantly, by comparing with unguided brute-force approach, similarity-guided attack achieves a speedup of 2.4X in generating adversarial samples. Our work opens up new directions and challenges for future adversarially-robust HDC model design and optimization.
16:51 CET A PRACTICAL REMOTE POWER ATTACK ON MACHINE LEARNING ACCELERATORS IN CLOUD FPGAS
Authors:
Shanquan Tian1, Shayan Moini2, Daniel Holcomb3, Russell Tessier4 and Jakub Szefer1
1Yale University, US; 2School of Electrical and Computer Engineering, University of Massachusetts Amherst, US; 3UMass Amherst, US; 4University of Massachusetts, US
Abstract
The security and performance of FPGA-based accelerators play vital roles in today's cloud services. In addition to supporting convenient access to high-end FPGAs, cloud vendors and third-party developers now provide numerous FPGA accelerators for machine learning models. However, the security of accelerators developed for state-of-the-art Cloud FPGA environments has not been fully explored, since most remote accelerator attacks have been prototyped on local FPGA boards in lab settings, rather than in Cloud FPGA environments. To address existing research gaps, this work analyzes three existing machine learning accelerators developed in Xilinx Vitis to assess the potential threats of power attacks on accelerators in Amazon Web Services (AWS) F1 Cloud FPGA platforms, in a multi-tenant setting. The experiments show that malicious co-tenants in a multi-tenant setting can instantiate voltage sensing circuits as register-transfer level (RTL) kernels within the Vitis design environment to spy on co-tenant modules. A practical methodology for launching a practical remote power attack on Cloud FPGAs is also presented, which uses an enhanced time-to-digital (TDC) based voltage sensor.The TDC is used to capture power signatures, which are then used to identify power consumption spikes and observe activity patterns involving the FPGA shell, DRAM on the FPGA board, or the other co-tenant victim's accelerators. Voltage change patterns related to shell use and accelerators are then used to create a practical auto-triggered attack that can automatically detect when to capture voltage traces without the need for a hard-wired synchronization signal between victim and attacker. To address the novel threats presented in this work, this paper also discusses defenses that could be leveraged to secure multi-tenant Cloud FPGAs from power-based attacks.
16:54 CET SCALABLE SCAN-CHAIN-BASED EXTRACTION OF NEURAL NETWORK MODELS
Authors:
Shui Jiang1, Seetal Potluri2 and Tsung-Yi Ho1
1The Chinese University of Hong Kong, HK; 2North Carolina State University, US
Abstract
can chains have greatly improved hardware testability while introducing security breaches for confidential data. Scan-chain attacks have extended their scope from cryptoprocessors to AI edge devices. The recently proposed scan-chain-based neural network (NN) model extraction attack (ICCAD 2021) made it possible to achieve fine-grained extraction and is multiple orders of magnitude more efficient both in queries and accuracy than its coarse-grained mathematical counterparts. However, both query formulation complexity and constraint solver failures increase drastically with network depth/size. We demonstrate a more powerful adversary, who is capable of improving scalability while maintaining accuracy, by relaxing high-fidelity constraints to formulate an approximate-fidelity-based layer-constrained least-squares extraction using random queries. We conduct our extraction attack on neural network inference topologies of different depths and sizes, targeting the MNIST digit recognition task. The results show that our method outperforms the scan-chain attack proposed in ICCAD 2021 by an average increase in the extracted neural network's functional accuracy of ≈ 31% and 2−3 orders of reduction in queries. Furthermore, we demonstrated that our attack is highly effective even in the presence of countermeasures against adversarial samples.

BPA_1 Testing

Date: Tuesday, 18 April 2023
Time: 08:30 - 10:30 CET

Time Label Presentation Title
Authors
00:39 CET DEVICE-AWARE TEST FOR BACK-HOPPING DEFECTS IN STT-MRAMS
Authors:
Sicong Yuan1, Mottaqiallah Taouil2, Moritz Fieback1, Hanzhi Xun1, Erik Marinissen3, Gouri Kar3, Siddharth Rao3, Sebastien Couet3 and Said Hamdioui2
1TUDelft, Delft, The Netherlands, NL; 2Delft University of Technology, NL; 3IMEC, Leuven, Belgium, BE
Abstract
The development of Spin-transfer torque magnetic RAM (STT-MRAM) mass production requires high-quality dedicated test solutions, for which understanding and modeling of manufacturing defects of the magnetic tunnel junction (MTJ) is crucial. This paper introduces and characterizes a new defect called Back-Hopping (BH); it also provides its fault models and test solutions. The BH defect causes MTJ state to oscillate during write operations, leading to write failures. The characterization of the defect is carried out based on manufactured MTJ devices. Due to the observed non-linear characteristics, the BH defect cannot be modeled with a linear resistance. Hence, device-aware defect modeling is applied by considering the intrinsic physical mechanisms; the model is then calibrated based on measurement data. Thereafter, the fault modeling and analysis is performed based on circuit-level simulations; new fault primitives/models are derived. These accurately describe the way the STT-MRAM behaves in the presence of BH defect. Finally, the dedicated march test and Design-for-Test solutions are proposed.
00:39 CET CORRECTNET: ROBUSTNESS ENHANCEMENT OF ANALOG IN-MEMORY COMPUTING FOR NEURAL NETWORKS BY ERROR SUPPRESSION AND COMPENSATION
Authors:
Amro Eldebiky1, Grace Li Zhang2, Georg Bocherer3, Bing Li1 and Ulf Schlichtmann1
1TU Munich, DE; 2TU Darmstadt, DE; 3Huawei Munich Research Center, DE
Abstract
The last decade has witnessed the breakthrough of deep neural networks (DNNs) in many fields. With the increasing depth of DNNs, hundreds of millions of multiply-and-accumulate (MAC) operations need to be executed. To accelerate such operations efficiently, analog in-memory computing platforms based on emerging devices, e.g., resistive RAM (RRAM), have been introduced. These acceleration platforms rely on analog properties of the devices and thus suffer from process variations and noise. Consequently, weights in neural networks configured into these platforms can deviate from the expected values, which may lead to feature errors and a significant degradation of inference accuracy. To address this issue, in this paper, we propose a framework to enhance the robustness of neural networks under variations and noise. First, a modified Lipschitz constant regularization is proposed during neural network training to suppress the amplification of errors propagated through network layers. Afterwards, error compensation is introduced at necessary locations determined by reinforcement learning to rescue the feature maps with remaining errors. Experimental results demonstrate that inference accuracy of neural networks can be recovered from as low as 1.69% under variations and noise back to more than 95% of their original accuracy, while the training and hardware cost are negligible.
00:39 CET ASSESSING CONVOLUTIONAL NEURAL NETWORKS RELIABILITY THROUGH STATISTICAL FAULT INJECTIONS
Authors:
Annachiara Ruospo1, Gabrile Gavarini1, Corrado De Sio1, Juan Guerrero Balaguera1, Luca Sterpone1, Matteo Sonza Reorda2, Ernesto Sanchez1, Riccardo Mariani3, Joseph Aribido4 and Jyotika Athavale4
1Politecnico di Torino, IT; 2Politecnico di Torino - DAUIN, IT; 3NVIDIA, IT; 4nvidia, US
Abstract
Assessing the reliability of modern devices running CNN algorithms is a very difficult task. Actually, the complexity of the state-of-the-art devices makes exhaustive Fault Injection (FI) campaigns impractical and typically out of the computational capabilities. A possible solution to this problem consists on resorting to statistical FI campaigns that allow to reduce the number of needed experiments by injecting only a significant part of it. Under specific hypothesis, statistical FIs guarantee an accurate picture of the problem, albeit selecting a reduced sample size. The main problems today are related to the choice of the sample size, the location of the faults, and the correct understanding of the assumptions with respect to the target system. The intent of this paper is twofold: first, we describe how to correctly specify statistical FIs for Convolutional Neural Networks; second, we propose a data analysis on the CNN param- eters that drastically reduces the number FIs needed to achieve statistically significant results without compromising the validity of the proposed method. The methodology is experimentally validated on two CNNs, ResNet-20 and MobileNetV2, and the results show that a statistical FI campaign on about 1.50% of the possible faults, provides a very precise information on the CNN reliability. The statistical results have been confirmed by the exhaustive FI campaigns on the same cases of study.

BPA_2 From Synthesis to application

Date: Tuesday, 18 April 2023
Time: 08:30 - 10:30 CET

Time Label Presentation Title
Authors
00:39 CET EFFICIENT PARALLELIZATION OF 5G-PUSCH ON A SCALABLE RISC-V MANY-CORE PROCESSOR
Authors:
Marco Bertuletti1, Yichao Zhang1, Alessandro Vanelli-Coralli2 and Luca Benini3
1ETH Zurich, CH; 2Universita' di Bologna and ETH Zurich, IT; 3Università di Bologna and ETH Zurich, IT
Abstract
5G Radio access network disaggregation and softwarization pose challenges in terms of computational performance to the processing units. At the physical layer level, the baseband processing computational effort is typically offloaded to specialized hardware accelerators. However, the trend toward software-defined radio-access networks demands flexible, programmable architectures. In this paper, we explore the software design, parallelization and optimization of the key kernels of the lower physical layer (PHY) for physical uplink shared channel (PUSCH) reception on MemPool and TeraPool, two manycore systems having respectively 256 and 1024 small and efficient RISC-V cores with a large shared L1 data memory. PUSCH processing is demanding and strictly time-constrained, it represents a challenge for the baseband processors, and it is also common to most of the uplink channels. Our analysis thus generalizes to the entire lower PHY of the uplink receiver at gNodeB (gNB). Based on the evaluation of the computational effort (in multiply-accumulate operations) required by the PUSCH algorithmic stages, we focus on the parallel implementation of the dominant kernels, namely fast Fourier transform, matrix-matrix multiplication, and matrix decomposition kernels for the solution of linear systems. Our optimized parallel kernels achieve respectively on MemPool and TeraPool speedups of 211, 225, 158, and 762, 880, 722, at high utilization (0.81, 0.89, 0.71, and 0.74, 0.88, 0.71), comparable a single-core serial execution, moving a step closer toward a full-software PUSCH implementation.
00:39 CET NARROWING THE SYNTHESIS GAP: ACADEMIC FPGA SYNTHESIS IS CATCHING UP WITH THE INDUSTRY
Authors:
Benjamin Barzen1, Arya Reais-Parsi1, Eddie Hung2, Minwoo Kang1, Alan Mishchenko1, Jonathan Greene1 and John Wawrzynek1
1UC Berkeley, US; 2FPG-eh Research and University of British Columbia, CA
Abstract
Historically, open-source FPGA synthesis and technology mapping tools have been considered far inferior to industry-standard tools. We show that this is no longer true. Improvements in recent years to Yosys (Verilog elaborator) and ABC (technology mapper) have resulted in substantially better performance, evident in both the reduction of area utilization and the increase in the maximum achievable clock frequency. More specifically, we describe how Yosys and ABC9 --- a set of feature additions to ABC --- were integrated such that technology mapping now has a complete view of the circuit, including support for hard blocks (e.g. carry chains) and multiple clock domains for timing-aware mapping. We demonstrate how these improvements accumulate in dramatically better synthesis results, with Yosys-ABC9 reducing the delay gap from 30\% to 0\% on our small FPGA target for the commonly used VTR Benchmark, thus matching Vivado's performance in terms of maximum clock frequency. We also measured the performance on a selection of circuits from OpenCores as well as literature, comparing the results produced by Yosys-ABC, Yosys-ABC9 and Vivado.
00:39 CET SAGEROUTE: SYNERGISTIC ANALOG ROUTING CONSIDERING GEOMETRIC AND ELECTRICAL CONSTRAINTS WITH MANUAL DESIGN COMPATIBILITY
Authors:
Haoyi Zhang, Xiaohan Gao, Haoyang Luo, Jiahao Song, Xiyuan Tang, Junhua Liu, Yibo Lin, Runsheng Wang and Ru Huang, Peking University, CN
Abstract
Routing is critical to the post-layout performance of analog circuits. As modern analog layouts need to consider both geometric constraints (e.g., design rules and low bending constraints) and electrical constraints (e.g., electromigration (EM), IR drop, symmetry, etc.), analog routing becomes increasingly challenging to investigate the complicated design space. Most previous work has focused only on geometric constraints or basic electrical constraints, lacking holistic and systematic investigation. Such an approach is far from typical manual design practice and can not guarantee post-layout performance on real-world designs. In this work, we propose SAGERoute, a synergistic routing framework taking both geometric and electrical constraints into consideration. Through Steiner tree based wire sizing and guided detailed routing, the framework can generate high-quality routing solutions efficiently under versatile constraints on real-world analog designs.

BPA_5 Benchmarking and Verification

Date: Tuesday, 18 April 2023
Time: 08:30 - 10:30 CET

Time Label Presentation Title
Authors
00:39 CET BENCHMARKING LARGE LANGUAGE MODELS FOR AUTOMATED VERILOG RTL CODE GENERATION
Authors:
Shailja Thakur1, Baleegh Ahmad1, Zhenxing Fan1, Hammond Pearce1, Benjamin Tan2, Ramesh Karri3, Brendan Dolan-Gavitt1 and Siddharth Garg1
1New York University, US; 2University of Calgary, CA; 3NYU, US
Abstract
Automating the hardware design process could obviate a significant amount of human error from design engineering and lead to fewer design errors. Verilog is a popular hardware description language (HDL) to model digital systems. Thus, automatically generating Verilog models is an appealing option and the focus of this paper. We explore and characterize the ability of emerging open-source large language models (LLMs) for use with Verilog given their ability to write coherent and functionally correct code across other programming languages. We fine-tune pre-trained LLMs on Verilog dataset collected from various sources such as GitHub and Verilog textbooks. We constructed an evaluation framework comprising test benches, design prompts of varying difficulties, and a flow to test the generated Verilog code. Our findings show that fine-tuned LLMs outperform a commercial LLM on basic hardware designs and can complete a variety of intermediate- and advanced-level hardware designs. We release training and evaluation scripts and model checkpoints available as open-source contributions: [blind for review].
00:39 CET PROCESSOR VERIFICATION USING SYMBOLIC EXECUTION: A RISC-V CASE-STUDY
Authors:
Niklas Bruns1, Vladimir Herdt2 and Rolf Drechsler3
1University of Bremen, DE; 2DFKI, DE; 3University of Bremen/DFKI, DE
Abstract
We propose to leverage state-of-the-art symbolic execution techniques from the Software (SW) domain for processor verification at the Register-Transfer Level (RTL). In particular, we utilize an Instruction Set Simulator (ISS) as a reference model and integrate it with the RTL processor under test in a co-simulation setting. We then leverage the symbolic execution engine KLEE to perform a symbolic exploration that searches for functional mismatches between the ISS and RTL processor. To ensure a comprehensive verification process, symbolic values are used to represent the instructions and also to initialize the register values of the ISS and processor. As a case study, we present results on the verification of the open source RISC-V based MicroRV32 processor, using the ISS of the open source RISC-V VP as a reference model. Our results demonstrate that modern symbolic execution techniques are applicable to a full scale processor co-simulation in the embedded domain and are very effective in finding bugs in the RTL core.
00:39 CET PERSPECTOR: BENCHMARKING BENCHMARK SUITES
Authors:
Sandeep Kumar1, Abhisek Panda2 and Smruti R. Sarangi1
1IIT Delhi, IN; 2Indian Institute of Technology, IN
Abstract
Estimating the quality of a benchmark suite is a non-trivial task. A poorly selected or improperly configured benchmark suite can present a distorted picture of the performance capabilities of the evaluated framework. With computing venturing into new domains, the total number of benchmark suites available is increasing by the day. Researchers must evaluate these suites quickly and decisively for their effective use. We present Perspector, a novel tool to quantify the performance of a benchmark suite. Perspector comprises novel metrics to characterize the quality of a benchmark suite. It provides a mathematical framework for capturing some qualitative suggestions and observations made in prior work. The metrics are generic and domain agnostic. Furthermore, our tool can be used to compare the efficacy of one suite vis-a-vis other benchmark suites, systematically and rigorously create a suite of workloads, and appropriately tune them for a target system.

BPA_8 Machine Learning techniques for embedded systems

Date: Tuesday, 18 April 2023
Time: 08:30 - 10:30 CET

Time Label Presentation Title
Authors
00:39 CET PRADA: POINT CLOUD RECOGNITION ACCELERATION VIA DYNAMIC APPROXIMATION
Authors:
Zhuoran Song, Heng Lu, Gang Li, Li Jiang, Naifeng Jing and Xiaoyao Liang, Shanghai Jiao Tong University, CN
Abstract
Recent point cloud recognition (PCR) tasks tend to utilize deep neural network (DNN) for better accuracy. Still, the computational intensity of DNN makes them far from real-time processing, given the fast-increasing number of points that need to be processed. Because the point cloud represents 3D-shaped discrete objects in the physical world using a mass of points, the points tend for an uneven distribution in the view space that exposes strong clustering possibility and local pairs' similarities. Based on this observation, this paper proposes PRADA, an algorithm-architecture co-design that can accelerate PCR while reserving its accuracy. We propose dynamic approximation, which can approximate and eliminate the similar local pairs' computations and recover their results by copying key local pairs' features for PCR speedup without losing accuracy. For accuracy good, we further propose an advanced re-clustering technique to maximize the similarity between local pairs. For performance good, we then propose a PRADA architecture that can be built on any conventional DNN accelerator to dynamically approximate the similarity and skip the redundant DNN computation with memory accesses at the same time. Our experiments on a wide variety of datasets show that PRADA averagely achieves 4.2x, 4.9x, 7.1x, and 12.2x speedup over Mesorasi, V100 GPU, 1080TI GPU, and Xeon CPU with negligible accuracy loss.
00:39 CET FEDERATED LEARNING WITH HETEROGENEOUS MODELS FOR ON-DEVICE MALWARE DETECTION IN IOT NETWORKS
Authors:
sanket shukla1, Setareh Rafatirad2, Houman Homayoun3 and Sai Manoj Pudukotai Dinakarrao4
1George mason university, US; 2University of California, Davis, US; 3University of California Davis, US; 4George Mason University, US
Abstract
IoT devices have been widely deployed in a vast number of applications to facilitate smart technology, increased portability, and seamless connectivity. Despite being widely adopted, security in IoT devices is often considered an afterthought due to resource and cost constraints. Among multiple security threats, malware attacks are observed to be a pivotal threat to IoT devices. Considering the spread of IoT devices and the threats they experience over time, deploying a static malware detector that is trained offline seems to be an ineffective solution. On the other hand, on-device learning is an expensive or infeasible option due to the limited available resources on IoT devices. To overcome these challenges, this work employs ‘Federated Learning' (FL) which enables timely updates to the malware detection models for increased security while mitigating the high communication or data storage overhead of centralized cloud approaches. Federated learning allows training machine learning models with decentralized data while preserving its privacy by design. However, one of the challenges with the FL is that the ondevice models are required to be homogeneous, which may not be true in the case of networked IoT systems. As a panacea, we introduce a methodology to unify the models in the cloud with minimal overheads and an impact on on-device malware detection. We evaluate the proposed technique against homogeneous models in networked IoT systems encompassing Raspberry Pi devices. The experimental results and system efficiency analysis indicate that end-to-end training time is just 1.12× higher than traditional FL, testing latency is 1.63× faster, and malware detection performance is improved by 7% to 13% for resource-constrained IoT devices.
00:39 CET GENETIC ALGORITHM-BASED FRAMEWORK FOR LAYER-FUSED SCHEDULING OF MULTIPLE DNNS ON MULTI-CORE SYSTEMS
Authors:
Sebastian Karl1, Arne Symons2, Nael Fasfous3 and Marian Verhelst2
1TU Munich, DE; 2KU Leuven, BE; 3BMW AG, DE
Abstract
Heterogeneous multi-core architectures are becoming a popular design choice to accelerate the inference of modern deep neural networks (DNNs). This trend allows for more flexible mappings onto the cores, but shifts the challenge to keeping all cores busy due to limited network parallelism. To this extent, layer-fused processing, where several layers are mapped simultaneously to an architecture and executed in a depth-first fashion, has shown promising opportunities to maximize core utilization. However, SotA mapping frameworks fail to efficiently map layer-fused DNNs onto heterogeneous multi-core architectures due to ignoring 1.) on-chip weight traffic and 2.) inter-core communication congestion. This work tackles these shortcomings by introducing a weight memory manager (WMM), which manages the weights present in a core and models the cost of re-fetching weights. Secondly, the inter-core communication (ICC) of feature data is modelled through a limited-bandwidth bus, and optimized through a contention-aware scheduler (CAS). Relying on these models, a genetic algorithm is developed to optimally schedule different DNN layers across the different cores. The impact of our enhanced modelling, core allocation and scheduling capabilities is shown in several experiments and demonstrates a decrease of 52% resp. 38% in latency, resp. energy when mapping a multi-DNN inference, consisting of ResNet-18, MobileNet-V2 and Tiny YOLO V2, on a heterogeneous multi-core platform compared to iso-area homogeneous architectures.

S_A3 Applications of Emerging Technologies and Computing Paradigms

Date: Tuesday, 18 April 2023
Time: 11:00 - 12:30 CET

Time Label Presentation Title
Authors
00:39 CET HDGIM: HYPERDIMENSIONAL GENOME SEQUENCE MATCHING ON UNRELIABLE HIGHLY-SCALED FEFET
Authors:
Hamza Errahmouni Barkam1, Sanggeon Yun2, Paul Genssler3, Zhuowen Zou4, Che-Kai Liu5, Hussam Amrouch3 and Mohsen Imani6
1University Of California Irvine, US; 2Kookmin University, KR; 3University of Stuttgart, DE; 4UCI, US; 5Zhejiang Univsersity, CN; 6University of California Irvine, US
Abstract
This is the first work to i) define theoretically the memorization capacity of Hyperdimensional (HDC) hyperparameters and ii) present a reliable application for highly-scaled (down to merely 3nm), multi-bit Ferroelectric FET (FeFET) technology. FeFET is one of the up-and-coming emerging technologies that is not only fully compatible with the existing CMOS but does hold the promise to realize ultra-efficient and compact Compute-in-Memory (CiM) architectures. Nevertheless, FeFETs struggle with the 10nm thickness of the Ferroelectric (FE) layer. This makes scaling profoundly challenging if not impossible because thinner FE significantly shrinks the memory window leading to large error probabilities that cannot be tolerated. To overcome these challenges, we propose HDGIM, a hyperdimensional computing framework catered to FeFET in the context of genome sequence matching. Genome Sequence Matching is known to have high computational costs, primarily due to huge data movement that substantially overwhelms von-Neuman architectures. On the one hand, our cross-layer FeFET reliability modeling (starting from device physics to circuits) accurately captures the impact of FE scaling on errors induced by process variation and inherent stochasticity in multi-bit FeFETs. On the other hand, our HDC learning framework iteratively adapts by using two models, a full-precision, ideal model for training and a quantized, noisy version for validation and inference. Our results demonstrate that highly-scaled FeFET realizing 3-bit and even 4-bit can withstand any noise given high dimensionality during inference. If we consider the noise during model adjustment, we can improve the inherent robustness compared to adding noise during the matching process.
00:39 CET QUANTUM MEASUREMENT DISCRIMINATION USING CUMULATIVE DISTRIBUTION FUNCTIONS
Authors:
Zachery Utt, Daniel Volya and Prabhat Mishra, University of Florida, US
Abstract
Quantum computing can efficiently solve many hard problems significantly faster than its classical counterpart. Quantum measurement is one of the critical steps in quantum computing that determines the probabilities associated with qubit states after conducting several circuit executions and measurements. As a mesoscopic quantum system, real quantum computers are prone to noise. Therefore, a major challenge in quantum measurement is how to correctly interpret the noisy results of a quantum computer. While there are promising classification based solutions, they either produce incorrect results (misclassify) or require many measurements (expensive). In this paper, we present an efficient technique to estimate a qubit's state through analysis of probability distributions of post-measurement data. Specifically, it estimates the state of a qubit using cumulative distribution functions to compare the measured distribution of a sample with the distributions of basis states. Our experimental results demonstrate a drastic reduction (78%) in single qubit readout error. It also provides significant reduction (12%) when used to boost existing multi-qubit discriminator models.
00:39 CET EXTENDING THE DESIGN SPACE OF DYNAMIC QUANTUM CIRCUITS FOR TOFFOLI BASED NETWORK
Authors:
Abhoy Kole1, Arighna Deb2, Kamalika Datta1 and Rolf Drechsler3
1German Research Centre for Artificial Intelligence (DFKI), DE; 2School of Electronics Engineering KIIT DU, IN; 3University of Bremen/DFKI, DE
Abstract
Recent advances in fault tolerant quantum systems allow to perform non-unitary operations like mid-circuit measurement, active reset and classically controlled gate operations in addition to the existing unitary gate operations. Real quantum devices that support these non-unitary operations enable us to execute a new class of quantum circuits, known as Dynamic Quantum Circuits (DQC). This helps to enhance the scalability, thereby allowing execution of quantum circuits comprising of many qubits by using at least two qubits. Recently DQC realizations of multi-qubit Quantum Phase Estimation and Bernstein–Vazirani algorithms have been demonstrated in two separate experiments. However the dynamic transformation of complex quantum circuits consisting of Toffoli gate operations have not been explored yet. This motivates us to: (a) explore the dynamic realization of Toffoli gates by extending the design space of DQC for Toffoli networks, and (b) propose a general dynamic transformation algorithm for the first time to the best of our knowledge. More precisely we introduce two dynamic transformation schemes (dynamic-1 and dynamic-2) for Toffoli gates, that differ with respect to the required number of classically controlled gate operations. For evaluation, we consider the Deutsch-Jozsa (DJ) algorithm composed of one or more Toffoli gates. Experimental results demonstrate that dynamic DJ circuits based on dynamic-2 Toffoli realization scheme provides better computational accuracy over the dynamic-1 scheme. Further the proposed dynamic transformation scheme is general and can also be applied to non-Toffoli quantum circuits.
00:39 CET AI-BASED DETECTION OF DROPLETS AND BUBBLES IN DIGITAL MICROFLUIDIC BIOCHIPS
Authors:
Jianan Xu, Wenjie Fan, Georgi Plamenov Tanev, Jan Madsen and Luca Pezzarossa, TU Denmark, DK
Abstract
Digital microfluidic biochips exploit the electrowetting on dielectric effect to move and manipulate microliter-sized liquid droplets on a planar surface. This technology has the potential to automate and miniaturize biochemical processes, but reliability is often an issue. The droplets may get temporarily stuck or gas bubbles may impede their movement leading to a disruption of the process being executed. However, if the position and size of the droplets and bubbles are known at run-time, these undesired effects can be easily mitigated by the biochip control system. This paper presents an AI-based computer vision solution for real-time detection of droplets and bubbles in DMF biochips and its implementation that supports cloud-based deployment. The detection is based on the YOLOv5 framework in combination with custom pre and post-processing techniques. The YOLOv5 neural network is trained using our own data set consisting of 5115 images. The solution is able to detect droplets and bubbles with real-time speed and high accuracy and to differentiate between them even in the extreme case where bubbles coexist with transparent droplets.
00:39 CET SPLIT ADDITIVE MANUFACTURING FOR PRINTED NEUROMORPHIC CIRCUITS
Authors:
Haibin Zhao1, Michael Hefenbrock2, Michael Beigl1 and Mehdi Tahoori1
1Karlsruhe Institute of Technology, DE; 2RevoAI, DE
Abstract
Printed and flexible electronics promises smart devices for application domains, such as smart fast moving consumer goods and medical wearables, which are generally untouchable by conventional rigid silicon technologies. This is due to their remarkable properties such as flexibility, non-toxic materials, and having low-cost per area. Combined with neuromorphic computing, printed neuromorphic circuits pose an attractive solution for these application domains. Particularly, the additive printing technologies can reduce large amount of fabrication complexities and costs. On the one hand, high-throughput additive printing processes, such as roll-to-roll printing, can reduce the per-device fabrication time and cost. On the other hand, jet-printing can provide point-of-use customization at the expense of lower fabrication throughput. In this work, we propose a machine learning based design framework, that respects the objective and physical constraints of split additive manufacturing for printed neuromorphic circuits. With the proposed framework, multiple printed neural networks are trained jointly with the aim to sensibly combine multiple fabrication techniques (e.g., roll-to-roll and jet-printing). This should lead to a cost-effective fabrication of multiple different printed neuromorphic circuits and achieve high fabrication throughput, lower cost, and point-of-use customization.
00:39 CET PIMPR: PIM-BASED PERSONALIZED RECOMMENDATION WITH HETEROGENEOUS MEMORY HIERARCHY
Authors:
Tao Yang1, Hui Ma1, Yilong Zhao1, Fangxin Liu2, Zhezhi He3, Xiaoli Sun4 and Li Jiang1
1Shanghai Jiao Tong University, CN; 2Shanghai Jiaotong University, CN; 3School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, CN; 4Institute of Scientific and Technical Information of Zhejiang Province, CN
Abstract
Deep learning-based personalized recommendation models (DLRMs) are dominating AI tasks in data centers. The performance bottleneck of typical DLRMs mainly lies in the memory-bounded embedding layers. Resistive Random Access Memory (ReRAM)-based Processing-in-memory (PIM) architecture is a natural fit for DLRMs thanks to its in-situ computation and high computational density. However, it remains two challenges before DLRMs fully embrace PIM architectures: 1) The size of DLRM's embedding tables can reach tens of GBs, far beyond the memory capacity of typical ReRAM chips. 2) The irregular sparsity conveyed in the embedding layers is difficult to exploit in PIM architecture. In this paper, we present the first PIM-based DLRM accelerator named PIMPR. PIMPR has a heterogeneous memory hierarchy—ReRAM crossbar-based PIM modules serve as the computing caches with high computing parallelism, while DIMM modules are able to hold the entire embedding table—leveraging the data locality of DLRM's embedding layers. Moreover, we propose a runtime strategy to skip the useless calculation induced by the sparsity and an offline strategy to balance the workload of each ReRAM crossbar. Compared to the state-of-the-art DLRM accelerator SPACE and TRiM, PIMPR achieves on average 2.02× and 1.79× speedup, 5.6×, and 5.1× energy reduction, respectively
00:39 CET FSL-HD: ACCELERATING FEW-SHOT LEARNING ON RERAM USING HYPERDIMENSIONAL COMPUTING
Authors:
Weihong Xu1, Jaeyoung Kang1 and Tajana Rosing2
1University of California San Diego, US; 2UCSD, US
Abstract
Few-shot learning (FSL) is a promising meta-learning paradigm that trains classification models on the fly with a few training sam- ples. However, existing FSL classifiers are either computationally expensive, or are not accurate enough. In this work, we propose an efficient in-memory FSL classifier, FSL-HD, based on hyper- dimensional computing (HDC) that achieves state-of-the-art FSL accuracy and efficiency. We devise an HDC-based FSL framework with efficient HDC encoding and search to reduce high complexity caused by the large HDC dimensionality. Also, we design a scal- able in-memory architecture to accelerate FSL-HD on ReRAM with distributed dataflow and organization that maximizes the data paral- lelism and hardware utilization. The evaluation shows that FSL-HD achieves 4.2% higher accuracy compared to other FSL classifiers. FSL-HD achieves 100 − 1000× better energy efficiency and 9 − 66× speedup over the CPU and GPU baselines. Moreover, FSL-HD is more accurate, scalable and 2.5× faster than the state-of-the-art ReRAM-based FSL design, SAPIENS, while requiring 85% less area.
00:39 CET HD-I-IOT: HYPERDIMENSIONAL COMPUTING FOR RESILIENT INDUSTRIAL INTERNET OF THINGS ANALYTICS
Authors:
Onat Gungor1, Tajana Rosing2 and Baris Aksanli3
1UCSD & SDSU, US; 2UCSD, US; 3San Diego State University, US
Abstract
Industrial Internet of Things (I-IoT) enables fully automated production systems by continuously monitoring devices and analyzing collected data. Machine learning (ML) methods are commonly utilized for data analytics in such systems. Cyberattacks are a grave threat to I-IoT as they can manipulate legitimate inputs, corrupting ML predictions and causing disruptions in the production systems. Hyperdimensional (HD) computing is a brain-inspired ML method that has been shown to be sufficiently accurate while being extremely robust, fast, and energy-efficient. In this work, we use non-linear encoding-based HD for intelligent fault diagnosis against different adversarial attacks. Our black-box adversarial attacks first train a substitute model and create perturbed test instances using this trained model. These examples are then transferred to the target models. The change in the classification accuracy is measured as the difference before and after the attacks. This change measures the resiliency of a learning method. Our experiments show that HD leads to a more resilient and lightweight learning solution than the state-of-the-art deep learning methods. HD has up to 67.5% higher resiliency compared to the state-of-the-art methods while being up to 25.1× faster to train.
00:39 CET SARA: AN EFFICIENT AND CONFIGURABLE SOFTMAX ENGINE FOR ATTENTION MODEL WITH VERSATILE RRAM CROSSBAR
Authors:
Yifeng Zhai1, Bing Li1 and Bonan Yan2
1Capital Normal University, CN; 2Peking University, CN
Abstract
RRAM has been exploited to accelerate the vector-matrix multiplication (VMM)-dominated neural network applications as the most promising in-memory computing technology. However, the state-of-the-art neural network models widely adopt the stacked attention mechanism, which has intensive softmax operations that pose the execution efficiency bottleneck. In this work, we propose SARA, which conists of an efficient RRAM-based softmax engine by exploiting the versatility of RRAM crossbar and a fine-grained pipeline to realize the high-efficiency RRAM-based accelerator for the attention neural networks. Moreover, SARA is precision configurable to achieve the high-accuracy inference performance. The experimental results evaluated on several attention models show SARA achieves up to 30.63× and 1.31× computing efficiency improvement over the GPU and the state-of-the-art RRAM-based Attention Accelerators, respectively.
00:39 CET VALUE-BASED REINFORCEMENT LEARNING USING EFFICIENT HYPERDIMENSIONAL COMPUTING
Authors:
Yang Ni1, Danny Abraham1, Mariam Issa1, Yeseong Kim2, Pietro Mercati3 and Mohsen Imani4
1University of California, Irvine, US; 2DGIST, KR; 3Intel Labs, US; 4University of California Irvine, US
Abstract
Reinforcement Learning (RL) has opened up new opportunities to solve a wide range of complex decision-making tasks. However, modern RL algorithms, e.g., Deep Q-Learning, are based on deep neural networks, resulting in high computational costs. In this paper, we propose QHD, an off-policy value-based Hyperdimensional Reinforcement Learning, that mimics brain properties toward robust and real-time learning. QHD relies on a lightweight brain-inspired model to learn an optimal policy in an unknown environment. We first develop a novel mathematical foundation and encoding module that maps state-action space into high-dimensional space. We accordingly develop a hyperdimensional regression model to approximate the Q-value function. The QHD-powered agent makes decisions by comparing the Q-values of each possible action. We evaluate the effect of the different RL training batch sizes and local memory capacity on the QHD quality of learning. Our QHD is also capable of online learning with tiny local memory capacity, which can be as small as the training batch size. QHD provides real-time learning by further decreasing the memory capacity and batch size. This makes QHD suitable for highly-efficient reinforcement learning with great potential for online and real-time learning. Our solution also supports a small experience replay batch size that provides a 12.3X speedup compared to DQN while ensuring minimal quality loss. Our evaluation shows QHD capability for real-time learning, providing a 34.6X speedup and significantly better quality of learning than state-of-the-art deep RL algorithms.
00:39 CET DROPDIM: INCORPORATING EFFICIENT UNCERTAINTY ESTIMATION INTO HYPERDIMENSIONAL COMPUTING
Authors:
Yang Ni1, Hanning Chen1, Prathyush Poduval2, Pietro Mercati3 and Mohsen Imani4
1University of California, Irvine, US; 2CMTC, Department of Physics, University of Maryland, US; 3Intel Labs, US; 4University of California Irvine, US
Abstract
Recent advancement in emerging brain-inspired computing has pointed out a promising path to Machine Learning (ML) algorithms with high efficiency. Particularly, research in the field of HyperDimensional Computing (HDC) brings orders of magnitude speedup to both ML model training and inference compared to their deep learning counterparts. However, current HDC-based ML algorithms generally lack uncertainty estimation, despite having shown good results in various practical applications and outstanding energy efficiency. On the other hand, existing solutions such as the Bayesian Neural Networks (BNN) are generally much slower than the regular neural networks and lead to high energy consumption. In this paper, we propose a hyperdimensional Bayesian framework called DropDim, which enables uncertainty estimation for the HDC-based regression algorithm. The core of our framework is a specially designed HDC encoder that maps input features to the high dimensional space with an extra layer of randomness, i.e., a small number of dimensions are randomly dropped for each input. Our key insight is that by using this encoder, DropDim implements Bayesian inference while maintaining the efficiency advantage of HDC. We verify our framework with both toy regression tasks and real-world datasets. The results on CPU show that DropDim provides comparable uncertainty estimations while also achieving significant speedup compared to the BNN baseline. Our implementation on FPGA shows that DropDim provides 84X (3740X) better energy efficiency for training (inference).

S_D3 Hardware accelerators and memory subsystems

Date: Tuesday, 18 April 2023
Time: 11:00 - 12:30 CET

Time Label Presentation Title
Authors
00:39 CET UVMMU: HARDWARE-OFFLOADED PAGE MIGRATION FOR HETEROGENEOUS COMPUTING
Authors:
Jihun Park1, Donghun Jeong2 and Jungrae Kim2
1dept. of Artificial Intelligence, Sungkyunkwan University, KR; 2Sungkyunkwan University, KR
Abstract
In a heterogeneous computing system with multiple memories, placing data near its current processing unit and migrating data over time can significantly improve performance. GPU vendors have introduced Unified Memory (UM) to automate data migrations between CPU and GPU memories and support memory over-subscription. Although UM improves software programmability, it can incur high costs due to its software-based migration. We propose a novel architecture to offload the migration to hardware and minimize UM overheads. Unified Virtual Memory Management Unit (UVMMU) detects access to remote memories and migrates pages without software intervention. By replacing page faults and software handling with hardware offloading, UVMMU can reduce the page migration latency to a few μs. Our evaluation shows that UVMMU can achieve 1.59× and 2.40× speed-ups over the state-of-the-art UM solutions for no over-subscription and 150% over-subscription, respectively.
00:39 CET ARRAYFLEX: A SYSTOLIC ARRAY ARCHITECTURE WITH CONFIGURABLE TRANSPARENT PIPELINING
Authors:
Christodoulos Peltekis1, Dionysios Filippas1, Giorgos Dimitrakopoulos1, Chrysostomos Nicopoulos2 and Dionisios Pnevmatikatos3
1Democritus University of Thrace, GR; 2University of Cyprus, CY; 3National TU Athens & ICCS, GR
Abstract
Convolutional Neural Networks (CNNs) are the state-of-the-art solution for many deep learning applications. For maximum scalability, their computation should combine high performance and energy efficiency. In practice, the convolutions of each CNN layer are mapped to a matrix multiplication that includes all input features and kernels of each layer and is computed using a systolic array. In this work, we focus on the design of a systolic array with configurable pipeline with the goal to select an optimal pipeline configuration for each CNN layer. The proposed systolic array, called ArrayFlex, can operate in normal, or in shallow pipeline mode, thus balancing the execution time in cycles and the operating clock frequency. By selecting the appropriate pipeline configuration per CNN layer, ArrayFlex reduces the inference latency of state-of-the-art CNNs by 11%, on average, as compared to a traditional fixed-pipeline systolic array. Most importantly, this result is achieved while using 13%-23% less power, for the same applications, thus offering a combined energy-delay-product efficiency between 1.4x and 1.8x.
00:39 CET FASTRW: A DATAFLOW-EFFICIENT AND MEMORY-AWARE ACCELERATOR FOR GRAPH RANDOM WALK ON FPGAS
Authors:
Yingxue Gao, teng wang, Lei Gong, Chao Wang, Xi Li and Xuehai Zhou, University of Science and Technology of China, CN
Abstract
Graph random walk (GRW) sampling is becoming increasingly important with the widespread popularity of graph applications. It involves some walkers that wander through the graph to capture the desirable properties and reduce the size of the original graph. However, previous research suffers long sampling latency and severe memory access bottlenecks due to intrinsic data dependency and irregular vertex distribution. This paper proposes FastRW, a dedicated accelerator to fully release GRW acceleration on FPGAs. FastRW first schedules walkers' execution to address data dependency and mask long sampling latency. Then, FastRW leverages pipeline specialization and bit-level optimization to customize a processing engine with five modules and achieve a pipelining dataflow. Finally, to alleviate the differential access caused by irregular vertex distribution, FastRW implements a hybrid memory architecture to provide parallel access ports according to the vertex's degree. We evaluate FastRW with two classic GRW algorithms on a wide range of real-world graph datasets. The experimental results show that FastRW achieves a speedup of 14.13x on average over the system running on a 2x8-core Intel CPU. FastRW also achieves 3.28x~198.24x energy efficiency over the architecture implemented on V100 GPU.
00:39 CET TWIN ECC: A DATA DUPLICATION BASED ECC FOR STRONG DRAM ERROR RESILIENCE
Authors:
Hyeong Kon Bae1, Myung Jae Chung1, Young-Ho Gong2 and Sung Woo Chung1
1Korea University, KR; 2Kwangwoon University, KR
Abstract
With the continuous scaling of process technology, DRAM reliability has become a critical challenge in modern memory systems. Currently, DRAM memory systems for servers have employed ECC DIMMs with a single error correction and double error detection (SECDED) code. However, the SECDED code is insufficient to ensure DRAM reliability since memory systems become more susceptible to errors. Though various studies have proposed multi-bit correctable ECC schemes, such ECC schemes cause performance and/or storage overhead. To minimize performance degradation while providing strong error resilience, in this paper, we propose Twin ECC, a low-cost memory protection scheme through data duplication. In a 512-bit data, Twin ECC duplicates meaningful data into meaningless zeros. Since ‘1'→‘0' error pattern is dominant in DRAM cells, Twin ECC provides strong error resilience by performing bitwise OR operations between the original meaningful data and duplicated data. After the bitwise OR operations, Twin ECC adopts the SECDED code for further enhancing data protection. Our evaluations show that Twin ECC reduces the system failure probability by average 64.8%, 56.9%, and 49.6%, when the portion of ‘1'→‘0' error is 100%, 90%, and 80%, respectively, while causing only 0.7% performance overhead and no storage overhead compared to the baseline ECC DIMM with SECDED code.
00:39 CET AIDING TO MULTIMEDIA ACCELERATORS: A HARDWARE DESIGN FOR EFFICIENT ROUNDING OF BINARY FLOATING POINT NUMBERS
Authors:
MAHENDRA RATHOR, Vishesh Mishra and Urbi Chatterjee, Indian Institute of Technology Kanpur, IN
Abstract
Hardware accelerators for multimedia applications such as JPEG image compression and video compression are quite popular due to their capability of enhancing overall performance and system throughput. The core of essentially all lossy compression techniques is the quantization process. In the quantization process, rounding is performed to obtain integer values for the compressed images and video frames. The recent studies in the photo forensic research has revealed that the direct rounding e.g. round up or round down of floating point numbers results into some compression artifacts such as 'JPEG dimples'. Therefore in the compression process, performing rounding to the nearest integer value is important especially for High Dynamic Range (HDR) photography and videography. Since rounding to the nearest integer is a data-intensive process, hence its realization as a dedicated hardware is imperative to enhance overall performance. This paper presents a novel high performance hardware architecture for performing rounding of binary floating point numbers to the nearest integer. Additionally, an optimized version of the basic hardware design is also proposed. The proposed optimized version provides 6.7% reduction in area and 7.4% reduction in power consumption in comparison to the proposed basic architecture. Furthermore, the integration of the proposed floating point rounding hardware with the design flow of the computing kernel of the compression processor is also discussed in the paper. The proposed rounding hardware architecture and the integrated design with the computing kernel of compression process have been implemented on an Intel FPGA. The average resource overhead due to this integration is reported to be less than 1%.
00:39 CET CRSPU: EXPLOIT COMMONALITY OF REGULAR SPARSITY TO SUPPORT VARIOUS CONVOLUTIONS ON SYSTOLIC ARRAYS
Authors:
Jianchao Yang, Mei Wen, Junzhong Shen, Yasong Cao, Minjin Tang, Renyu Yang, Xin Ju and Chunyuan Zhang, Colledge of Computer, National University of Defense Technology, CN
Abstract
Dilated convolution (DCONV) and transposed convolution (TCONV) are involved in the training of GANs and CNNs and introduces numerous regular zero-spaces into the feature maps or kernels. Existing accelerators typically pre-reorganize the zero-spaces, and then perform sparse computation to accelerate them, resulting in huge hardware resource overhead and control complexity. While the systolic array has proven advantages when it comes to accelerating convolutions, countermeasures for deploying DCONV and TCONV on systolic arrays are rarely proposed. Therefore, we opt to improve the traditional im2col algorithm to make full use of the regular sparsity and avoid data reorganization, thereby facilitating the use of systolic arrays in this context. Public Dimension Compression and Similar Sparsity Merging mechanisms are also designed to implement sparse computing, eliminating unnecessary computing caused by zero-spaces. We propose a systolic array-based processing unit, named CRSPU, that achieves 7736 GOPS peak throughput, with an area efficiency of 2046.56 GOPS/mm² and a PE efficiency of 30.22 GOPS/PE. Experiments show that compared with the dense and sparse versions of the state-of-the-art baseline accelerator GANPU, CRSPU achieves 11.20× and 8.43× speedup in the training performance of CycleGAN, improves PE efficiency by 58.80× and 44.23×, and reduces the on-chip bandwidth of DCONV and TCONV by up to 31.66% and 54.63% respectively. The average PE utilization of CRSPU on various GAN models can reach 93.27%. Furthermore, CRSPU's ability to avoid zero-space data reorganization represents a huge advantage for bandwidth-unfriendly accelerators.
00:39 CET CLAP: LOCALITY AWARE AND PARALLEL TRIANGLE COUNTING WITH CONTENT ADDRESSABLE MEMORY
Authors:
Tianyu Fu1, Chiyue Wei2, Zhenhua Zhu2, Shang Yang2, Zhongming Yu3, Guohao Dai2, Huazhong Yang2 and Yu Wang2
1Ph. D. Tsinghua University, CN; 2Tsinghua University, CN; 3University of California, San Diego, US
Abstract
Triangle counting (TC) is one of the most fundamental graph analysis tools with a wide range of applications. Modern triangle counting algorithms traverse the graph and perform set intersections of neighbor sets to find triangles. However, existing triangle counting approaches suffer from the heavy off-chip memory access and set intersection overhead. Thus, we propose CLAP, the first content addressable memory (CAM) based triangle counting architecture with the software and hardware co-optimizations. To reduce off-chip memory access and the number of set intersections, we propose the first force-based node index reorder method. It is a universal reorder framework that simultaneously optimizes both data locality and the computation amount. Compared with random node indices, the reorder method reduces the off-chip memory access and the set intersections by 61% and 64%, respectively, while providing 2.19x end-to-end speedup. To improve the set intersection parallelism, we propose the first CAM-based triangle counting architecture under chip area constraints. We enable high parallel set by translating it into content search on CAM with full parallelism. Thus the time complexity of the set intersection reduces from O(m + n) or O(n log m) to O(n). Extensive experiments on real-world graphs show that CLAP achieves 39x, 27x, and 78x speedup over state-of-the-art CPU, GPU, and processing-in-memory baselines, respectively.
00:39 CET ATOMIC BUT LAZY UPDATING WITH MEMORY-MAPPED FILES FOR PERSISTENT MEMORY
Authors:
Qisheng Jiang, Lei Jia and Chundong Wang, ShanghaiTech University, CN
Abstract
Applications memory-map file data stored in the persistent memory and expect both high performance and failure atomicity. State-of-the-art NOVA and Libnvmmio guarantee failure atomicity but yield inferior performance. They enforce data staying fresh and intact at the mapped addresses by continually updating the data there, thereby incurring severe write amplifications. They also lack the adaptability to dynamic workloads and entail housekeeping overheads with complex designs. We hence propose Acumen with a group of reflection pages managed for a mapped file. Using a simplistic bitmap to track fine-grained data slices, Acumen makes a reflection page and a mapped file page pair to alternately carry updates to achieve failure atomicity. Only on receiving a read request will it deploy valid data from reflection pages into target mapped file pages. The cost of deployment is amortized over subsequent read requests. Experiments show that Acumen significantly outperforms NOVA and Libnvmmio with consistently higher performance in serving a variety of workloads.
00:39 CET OUT-OF-STEP PIPELINE FOR GATHER/SCATTER INSTRUCTIONS
Authors:
Yi Ge1, Katsuhiro Yoda1, Makiko Ito1, Toshiyuki Ichiba1, Takahide Yoshikawa1, Ryota Shioya2 and Masahiro Goshima3
1Fujitsu Limited, JP; 2University of Tokyo, JP; 3National Institute of Informatics, JP
Abstract
Recently, demand for large sparse matrix calculation has been increasing as symbolized by the HPCG benchmark. However, wider SIMD units suffer from low scalability of gather/ scatter instructions that inevitably appear in sparse matrix calculations. The scalability of gather/scatter instruction is low because a v elements wide gather/scatter instruction takes O(v) cycles to access L1D with O(1) ports. We address this problem with a multibank L1D (MBL1D) and a newly proposed out-of-step pipeline. Though MBL1D with O(v) banks potentially enables access to non-contiguous v elements in O(1) cycles, it causes bank conflicts and resultant instruction rescheduling diminishes the gain. The out-of-step pipeline avoids this instruction rescheduling by allowing element operations of SIMD instructions to proceed out of step with each other with small FIFO buffers. We evaluate our pipeline on a fully cycle-accurate simulator with a sparse matrix-vector product kernel for various sparse matrices from the HPCG benchmark and the SuiteSparse Matrix Collection in the SELL-C-σ format. The results show that, for the SIMD width of 1024 bit, it achieves 76.8% efficiency to an ideal model with fullport L1D free from bank conflicts, and 1.87 times improvement over a model of a conventional pipeline with MBL1D.
00:39 CET MEMPOOL MEETS SYSTOLIC: FLEXIBLE SYSTOLIC COMPUTATION IN A LARGE SHARED-MEMORY PROCESSOR CLUSTER
Authors:
Samuel Riedel1, Gua Hao Khov1, Sergio Mazzola2, Matheus Cavalcante1, Renzo Andri3 and Luca Benini4
1ETH Zurich, CH; 2ETH Zürich, CH; 3Huawei Zurich Research Center, CH; 4Università di Bologna and ETH Zurich, IT
Abstract
Systolic arrays and shared-memory manycore clusters are two widely used architectural templates to accelerate highly parallel workloads. Yet, they offer vastly different trade-offs. Systolic arrays achieve exceptional performance for workloads with regular dataflow at the cost of an extremely rigid architecture and programming model. On the other hand, shared-memory manycore systems are well suited for a wide range of workloads and easy to program. However, dataflow has to be managed explicitly, incurring a runtime overhead. This work combines the best of both worlds by extending a general-purpose shared-memory manycore cluster with a systolic overlay. The combination allows for efficient execution for systolically computable kernels while keeping the flexibility of the programmable shared-memory system for irregular kernels. In our hybrid architecture, the small and efficient RISC-V processor cores act as the systolic array's processing elements (PEs). The cluster's shared memory is used to implement a memory-mapped queue network capable of establishing any systolic topology among all PEs. We propose two instruction set architecture extensions enabling native and autonomous queue communication and implement them in MemPool, an open-source manycore architecture featuring 256 cores and 1 MiB of shared L1 memory. Namely, the Xqueue extension enables single-instruction access to any shared memory queue, while the queue-linked register extension enables autonomous access to the queues, relieving the cores of communication instructions and letting them focus on computation. Our hybrid approach allows configuring different systolic topologies at execution time. We analyze the trade-offs between systolic and shared-memory implementations and run hybrid systolic-shared-memory computations. Through the hardware extensions and hybrid execution scheme, the hybrid architecture achieves up to 194 GOPS or 132 GOPS/W on a 2D convolution, outperforming the highly-optimized shared-memory implementation by 17%.
00:39 CET NOVEL EFFICIENT SYNONYM HANDLING MECHANISM FOR VIRTUAL-REAL CACHE HIERARCHY
Authors:
Varun Venkitaraman1, Ashok Sathyan1, Shrihari Deshmukh1 and Virendra Singh2
1Indian Institute of Technology, Bombay, IN; 2IIT Bombay, IN
Abstract
Optimizing L1 caches for latency is critical to improving the system's performance. Servicing a memory request involves address translation, set indexing, tag comparisons and data access. Generally, virtually indexed physically tagged (VIPT) caches are used as L1 caches since we can perform TLB lookups and set indexing parallelly, resulting in reduced L1 cache access latency. However, a TLB lookup precedes every L1 cache access. The TLB lookups significantly contribute to the system's total power consumption. Implementing virtually indexed virtually tagged (VIVT) caches appears to be an attractive option for reducing energy consumption due to TLB lookups. Though the solution seems attractive, VIVT caches are plagued with the issue of synonyms. Previous research has proposed several approaches for resolving synonyms to facilitate the implementation of VIVT L1 caches. However, the proposed techniques introduce new hardware structures in the cache hierarchy to detect and resolve synonyms. Instead of adding another hardware structure to the cache hierarchy, we propose a new cache hierarchy design that modifies the last-level cache's tag array to detect and resolve synonyms. Doing so eliminates the need for a new hardware structure to handle synonyms, thereby not adding to the already existing critical path length. Our proposed synonym handling mechanism is speculation-free and eliminates the need for tag comparisons for a synonym L1 cache re-access. Overall, our proposed scheme significantly reduces the dynamic energy consumption of the cache hierarchy, making it energy-efficient. Furthermore, our proposed cache hierarchy design can directly replace existing ones without requiring modifications to the core or software end.
00:39 CET TURBULENCE: COMPLEXITY-EFFECTIVE OUT-OF-ORDER EXECUTION ON GPU WITH DISTANCE-BASED ISA
Authors:
Reoma Matsuo, Toru Koizumi, Hidetsugu Irie, Shuichi Sakai and Ryota Shioya, University of Tokyo, JP
Abstract
A graphic processing unit (GPU) is a processor that achieves high throughput by exploiting data parallelism. One key feature that enables high throughput is multithreaded execution, in which a GPU generally executes many threads simultaneously to hide various latencies. As a result, GPUs generally do not perform expensive out-of-order execution to hide latency within threads. However, We found that many GPU workloads also contain instruction-level parallelism, which can be extracted through outof-order execution to provide additional performance improvement opportunities. We propose the TURBULENCE architecture for very low-cost out-of-order execution on GPUs. TURBULENCE consists of a novel ISA that introduces the concept of referencing operands by inter-instruction distance instead of register numbers, and a novel microarchitecture that executes the novel ISA. This distance-based operand has the property that it does not cause any false dependencies. Using this property, we achieve costeffective out-of-order execution on GPUs without introducing expensive hardware such as a rename logic and a load-store queue. The simulation results show that TURBULENCE improves performance by 17.6% without an increase in energy consumption compared to an existing GPU.

S_E2 Modelling, verification and timing analysis of cyber-physical systems

Date: Tuesday, 18 April 2023
Time: 11:00 - 12:30 CET

Time Label Presentation Title
Authors
00:39 CET IMPACTTRACER: ROOT CAUSE LOCALIZATION IN MICROSERVICES BASED ON FAULT PROPAGATION MODELING
Authors:
Ru Xie1, Jing Yang2, Jingying Li2 and Liming Wang2
1Institute of Information Engineering,CAS;Unversity of Chinese Academy of Sciences, CN; 2Institute of Information Engineering,CAS, CN
Abstract
Microservice architecture is embraced by a growing number of enterprises due to the benefits of modularity and flexibility. However, being composed of numerous interdependent microservices, it is prone to cascading failures and is afflicted by the arising problem of troubleshooting, which entails arduous efforts to identify the root cause node and ensure service availability. Previous works use call graph to characterize causality relationships of microservices but not completely and comprehensively, leading to an insufficient search of potential root cause nodes and consequently poor accuracy in culprit localization. In this paper, we propose ImpactTracer to address the above problems. ImpactTracer builds impact graph to provide a complete view of fault propagation in microservices and uses a novel backward tracing algorithm that exhaustively traverses the impact graph to identify the root cause node accurately. Extensive experiments on a real-world dataset demonstrate that ImpactTracer is effective in identifying the root cause node and outperforms the state-of-the-art methods by at least 73%, significantly facilitating troubleshooting in microservices.
00:39 CET PUMPCHANNEL: AN EFFICIENT AND SECURE COMMUNICATION CHANNEL FOR TRUSTED EXECUTION ENVIRONMENT ON ARM-FPGA EMBEDDED SOC
Authors:
Jingquan Ge, Yuekang Li, Yang Liu, Yaowen Zheng, Yi Liu and Lida Zhao, Nanyang Technological University, SG
Abstract
ARM TrustZone separates the system into the rich execution environment (REE) and the trusted execution environment (TEE). Data can be exchanged between REE and TEE through the communication channel, which is based on shared memory and can be accessed by both REE and TEE. Therefore, when the REE OS kernel is untrusted, the security of the communication channel cannot be guaranteed. The proposed schemes to protect the communication channel have high performance overhead and are not secure enough. In this paper, we propose PumpChannel, an efficient and secure communication channel implemented on ARM-FPGA embedded SoC. PumpChannel avoids the use of secret keys, but utilizes a hardware and software collaborative pump to enhance the security and performance of the communication channel. Besides, PumpChannel implements a hardware-based hook integrity monitor to ensure the integrity of all hook codes. Security and performance evaluation results show that PumpChannel is more secure than the encrypted channel countermeasures and has better performance than all other schemes.
00:39 CET ON THE DEGREE OF PARALLELISM IN REAL-TIME SCHEDULING OF DAG TASKS
Authors:
Qingqiang He1, Nan Guan2, Mingsong Lv1 and Zonghua Gu3
1The Hong Kong Polytechnic University, HK; 2City University of Hong Kong, HK; 3Zhejiang Univ, CN
Abstract
Real-time scheduling and analysis of parallel tasks modeled as directed acyclic graphs (DAG) have been intensively studied in recent years. The degree of parallelism of DAG tasks is an important characterization in scheduling. This paper revisits the definition and the computing algorithms for the degree of parallelism of DAG tasks, and clarifies some misunderstandings regarding the degree of parallelism which exist in real-time literature. Based on the degree of the parallelism, we propose a real-time scheduling approach for DAG tasks, which is quite simple but rather effective and outperforms the state-of-the-art by a considerable margin.
00:39 CET TIMING PREDICTABILITY FOR SOME/IP-BASED SERVICE-ORIENTED AUTOMOTIVE IN-VEHICLE NETWORKS
Authors:
Enrico Fraccaroli1, Prachi Joshi2, Shengjie Xu1, Khaja Shazzad2, Markus Jochim2 and Samarjit Chakraborty3
1University of North Carolina at Chapel Hill, US; 2General Motors, R&D, US; 3UNC Chapel Hill, US
Abstract
In-vehicle network architectures are evolving from a typical signal-based client-server paradigm to a service-oriented one, introducing flexibility for software updates and upgrades. While signal-based networks are static by nature, service-oriented ones can more easily evolve during and after the design phase. As a result, service-oriented protocols are spreading through several layers of in-vehicle networks. While applications like infotainment are less sensitive to delays, others like sensing and control have more stringent timing and reliability requirements. Hence, wider adoption of service-oriented protocols requires timing analyzability and predictability problems to be addressed, which are more challenging than in their signal-oriented counterparts. In service-oriented architectures, the discovery phase defines how clients find their required services. The time required to complete the discovery phase is an important parameter since it determines the readiness of a sub-system or even the vehicle. In this paper, we develop a formal timing analysis of the discovery phase of SOME/IP, an emerging service-oriented protocol considered for adoption by automotive OEMs and suppliers.
00:39 CET ANALYSIS AND OPTIMIZATION OF WORST-CASE TIME DISPARITY IN CAUSE-EFFECT CHAINS
Authors:
Xu Jiang1, xiantong Luo1, Nan Guan2, Zheng Dong3, Shao-Shan Liu4 and Wang Yi5
1Northeastern University, CN; 2City University of Hong Kong, HK; 3Wayne State University, US; 4BeyonCa, CN; 5Uppsala University, SE
Abstract
An important and common timing requirement in autonomous system is that when some component receives data originated from different sensors, the time difference among the timestamps of the corresponding raw data must be in a certain range, so that information from different sensors can be correctly synchronized and fused. In this paper, we present analysis techniques to bound the worst-case time disparity (maximum difference among the timestamps of all stimulus of an output) in cause-effect chains. In particular, we take the fork-join structures into consideration, and show that the worst-case time disparity could also be significant even for data tokens produced by the same sensor. This result suggests that data fusion after long processing pipelines should be avoided to meet such requirements. Moreover, we present a solution to reduce the worst-case time disparity by designing buffers with proper size. Experiments are conducted to show the correctness and effectiveness of both our analysis techniques and design.
00:39 CET DATA FRESHNESS OPTIMIZATION ON NETWORKED INTERMITTENT SYSTEMS
Authors:
Hao-Jan Huang1, Wen Sheng Lim2, Chia-Heng Tu1, Chun-Feng Wu3 and Yuan-Hao Chang4
1National Cheng Kung University, TW; 2National Taiwan University (NTU), TW; 3National Yang Ming Chiao Tung University, TW; 4Academia Sinica, TW
Abstract
A networked intermittent system (NIS) is often deployed in the field for environmental monitoring, where sink nodes are responsible for relaying the data captured by sensors to a central system. To evaluate the quality of the captured monitoring data, Age of Information (AoI) is adopted to quantify the freshness of the data received by the central server. As the sink nodes are powered by ambient energy sources (e.g., solar and wind), the energy-efficient design of the sink nodes is crucial in order to improve the system-wide AoI. This work proposes the energy-efficient sink node design to save energy and extend system uptime. We devise an AoI-aware data forwarding algorithm based on the branch-and-bound (B&B) paradigm for deriving the optimal solution offline. In addition, an AoI-aware data forwarding algorithm is developed to approximate the optimal solution during runtime. The experimental results show that our solution can greatly improve the average data freshness for 148% against existing well-known strategies and achieves 91% performance of the optimal solution. Compared with the state-of-the-art algorithm, our energy-efficient design can deliver better A^3oI results by up to 9.6%.
00:39 CET A SAFETY-GUARANTEED FRAMEWORK FOR NEURAL-NETWORK-BASED PLANNERS IN CONNECTED VEHICLES UNDER COMMUNICATION DISTURBANCE
Authors:
Kevin Kai-Chun Chang1, Xiangguo Liu2, Chung-Wei Lin1, Chao Huang3 and Qi Zhu2
1National Taiwan University, TW; 2Northwestern University, US; 3University of Liverpool, GB
Abstract
Neural-network-based (NN-based) planners have been increasingly used to enhance the performance of planning for autonomous vehicles. However, it is often difficult for NN-based planners to balance efficiency and safety in complicated scenarios, especially under real-world communication disturbance. To tackle this challenge, we present a safety-guaranteed framework for NN-based planners in connected vehicle environments with communication disturbance. Given an NN-based planner with no safety-guarantee, the framework generates a robust compound planner embedding the NN-based planner to ensure the overall system safety. Moreover, with the aid of an information filter for imperfect communication and an aggressive approach for the estimation of the unsafe set, the compound planner could achieve similar or better efficiency than the given NN-based planner. A comprehensive case study of unprotected left-turn and extensive simulations demonstrate the effectiveness of our framework.
00:39 CET CO-DESIGN OF TOPOLOGY, SCHEDULING, AND PATH PLANNING IN AUTOMATED WAREHOUSES
Authors:
Christopher Leet1, Chanwook Oh1, Michele Lora2, Sven Koenig1 and Pierluigi Nuzzo1
1University of Southern California, US; 2University of Verona, IT
Abstract
We address the warehouse servicing problem (WSP) in automated warehouses, which use teams of mobile agents to bring products from shelves to packing stations. Given a list of products, the WSP amounts to finding a plan for a team of agents which brings every product on the list to a station within a given timeframe. The WSP consists of four subproblems, concerning what tasks to perform (task formulation), who will perform them (task allocation), and when (scheduling) and how (path planning) to perform them. These subproblems are NP-hard individually and more complex in combination. The difficulty of the WSP is compounded by the scale of automated warehouses, which frequently use teams of hundreds of agents. In this paper, we present a methodology that can solve the WSP at such scales. We introduce a novel, contract-based design framework which decomposes an automated warehouse into traffic system components. By assigning each of these components a contract describing the traffic flows it can support, we can synthesize a traffic flow satisfying a given WSP instance. Component-wise search-based path planning is then used to transform this traffic flow into a plan for discrete agents in a modular way. Evaluation shows that this methodology can solve WSP instances on real automated warehouses.
00:39 CET POLYGLOT MODAL MODELS THROUGH LINGUA FRANCA
Authors:
Alexander Schulz-Rosengarten1, Reinhard von Hanxleden1, Marten Lohstroh2, Soroush Bateni3 and Edward Lee4
1Dept. of Computer Science, Kiel University, DE; 2University of California, Berkeley, US; 3University of Texas at Dallas, US; 4UC Berkeley, US
Abstract
Complex software systems often feature distinct modes of operation, each designed to handle a particular scenario that may require the system to respond in a certain way. Breaking down system behavior into mutually exclusive modes and discrete transitions between modes is a commonly used strategy to reduce implementation complexity and promote code readability. However, such capabilities often come in the form of self-contained domain specific languages or language-specific frameworks. The work in this paper aims to bring the advantages of modal models to mainstream programming languages, by following the polyglot coordination approach of Lingua Franca ( LF ), in which verbatim target code (e. g., C, C++, Python, Typescript, or Rust) is encapsulated in composable reactive components called reactors. Reactors can form a dataflow networks, be triggered by timed as well as sporadic events, execute concurrently, and be distributed across nodes on a network. With modal models in LF , we introduce a lean extension to the concept of reactors that enables the coordination of reactive tasks based on modes of operation. The implementation of modal reactors outlined in this paper generalizes to any LF-supported language with only modest modifications to the generic runtime system.
00:39 CET DEL: DYNAMIC SYMBOLIC EXECUTION-BASED LIFTER FOR ENHANCED LOW-LEVEL INTERMEDIATE REPRESENTATION
Authors:
Hany Abdelmaksoud1, Zain A. H. Hammadeh1, Goerschwin Fey2 and Daniel Luedtke1
1German Aerospace Center (DLR), DE; 2TU Hamburg, DE
Abstract
Analyzing safety and timing on Low-Level Intermediate Representation (LLIR) is preferable over source-code and binary-code levels. However, the expressiveness of the generated LLIR challenges the efficiency of the applied analysis. This work develops an approach that lifts binaries into an enhanced LLVM IR including indirect jumps. The proposed lifter combines both static and dynamic methods and strives to fully recover the Control-Flow Graph (CFG) of the program. Using Satisfiability Modulo Theories (SMT) supported by memory and register models, our lifter dynamically symbolically executes IR instructions after translating them into SMT expressions. Accuracy and scalability of the lifter are studied on a benchmark from the literature. The proposed lifter resolves all indirect jumps and fully recovers the CFG of a case study that static lifters like RetDec and Angr failed. Also, our lifter achieves100% code coverage of the benchmark, unlike purely Dynamic lifters. Even when scaling up the code, the lifter effectively recovers the CFG.
00:39 CET WCET ANALYSIS OF SHARED CACHES IN MULTI-CORE ARCHITECTURES USING EVENT-ARRIVAL CURVES
Authors:
Thilo Fischer and Heiko Falk, Hamburg University of Technology, DE
Abstract
The worst-case behavior of shared caches in multi-core systems is difficult to predict due to the effects of inter-core interference. This uncertainty can lead to large over-estimations of the worst-case execution time (WCET). In this paper, we propose a novel analysis approach for shared caches using the LRU replacement policy. We quantify inter-core cache interference by modelling cache accesses as event streams and show how an event-arrival curve can be derived from a low-level representation of the program code. Thus, inter-core cache interference may be expressed as a function of time. By analyzing the duration between accesses to a cache block, it is possible to bound the inter-core interference and classify accesses as cache hits or potential misses. We implemented this classification approach in a WCET analyzer and evaluated its performance for dual-core and quad-core systems using several cache configurations. The proposed analysis yielded similar WCET performance to a partitioned cache, with a median WCET increase of only 4% to 11%. Furthermore, the proposed analysis was able to outperform a partitioned cache by up to 26% in some cases. The analysis overhead of the proposed techniques was just 6 minutes on average for quad-core systems.
00:39 CET RESOURCE OPTIMIZATION WITH 5G CONFIGURED GRANT SCHEDULING FOR REAL-TIME APPLICATIONS
Authors:
Yungang Pan1, Rouhollah Mahfouzi1, Soheil Samii2, Petru Eles2 and Zebo Peng2
1Linköping University, SE; 2Linkoping University, SE
Abstract
5G is expected to support ultra-reliable low latency communication (URLLC) to enable real-time applications such as industrial automation and control. 5G configured grant (CG) scheduling features a pre-allocated periodicity-based scheduling approach which reduces control signaling time and guarantees service quality. Although this enables 5G to support hard real-time periodic traffics, efficiently synthesizing the schedule and achieving high resource efficiency while serving multiple traffics is still an open problem. To address this problem, we first formulate it using satisfiability modulo theories (SMT) so that an SMT-solver can be used to generate the optimal solution. For enhancing scalability, two efficient heuristic approaches are proposed. We conduct extensive experiments to evaluate the scalability and performance of our proposed algorithms. The experiments demonstrate the effectiveness of the proposed technique.
00:39 CET ADAPTIVE: AGENT-BASED LEARNING FOR BOUNDING TIME IN MIXED-CRITICALITY SYSTEMS
Authors:
Behnaz Ranjbar, Ali Hosseinghorban and Akash Kumar, TU Dresden, DE
Abstract
In Mixed-Criticality (MC) systems, the high Worst-Case Execution Time (WCET) of a task is a pessimistic bound, the maximum execution time of the task under all circumstances, while the low WCET should be close to the actual execution time of most instances of the task to improve utilization and Quality-of-Service (QoS). Most MC systems consider a static low WCET for each task which cannot adapt to dynamism at run-time. In this regard, we consider the run-time behavior of tasks and propose a learning-based approach that dynamically monitors the tasks' execution times and adapts the low WCETs to determine the ideal trade-off between mode-switches, utilization, and QoS. Based on our observations on running embedded real-time benchmarks on a real platform, the proposed scheme improves the QoS by 16.4% on average, while reducing the utilization waste by 17.7%, on average, compared to state-of-the-art works.

S_T2 Test methods and dependability

Date: Tuesday, 18 April 2023
Time: 11:00 - 12:30 CET

Time Label Presentation Title
Authors
00:39 CET IMPROVING RELIABILITY OF SPIKING NEURAL NETWORKS THROUGH FAULT AWARE THRESHOLD VOLTAGE OPTIMIZATION
Authors:
Ayesha Siddique and Khaza Anuarul Hoque, University of Missouri, US
Abstract
Spiking neural networks have made breakthroughs in computer vision by lending themselves to neuromorphic hardware. However, the neuromorphic hardware lacks parallelism and hence, limits the throughput and hardware acceleration of SNNs on edge devices. To address this problem, many systolic-array SNN accelerators (systolicSNNs) have been proposed recently, but their reliability is still a major concern. In this paper, we first extensively analyze the impact of permanent faults on the SystolicSNNs. Then, we present a novel fault mitigation method, i.e., fault-aware threshold voltage optimization in retraining (FalVolt). FalVolt optimizes the threshold voltage for each layer in retraining to achieve the classification accuracy close to the baseline in the presence of faults. To demonstrate the effectiveness of our proposed mitigation, we classify both static (i.e., MNIST) and neuromorphic datasets (i.e., N-MNIST and DVS Gesture) on a 256x256 systolicSNN with stuck-at faults. We empirically show that the classification accuracy of a systolicSNN drops significantly even at extremely low fault rates (as low as 0.012%). Our proposed FalVolt mitigation method improves the performance of systolicSNNs by enabling them to operate at fault rates of up to 60%, with a negligible drop in classification accuracy (as low as 0.1%). Our results show that FalVolt is 2x faster compared to other state-of-the-art techniques common in artificial neural networks (ANNs), such as fault-aware pruning and retraining without threshold voltage optimization.
00:39 CET AUTOMATED AND AGILE DESIGN OF LAYOUT HOTSPOT DETECTOR VIA NEURAL ARCHITECTURE SEARCH
Authors:
Zihao Chen1, Fan Yang1, Li Shang2 and Xuan Zeng1
1Fudan University, CN; 2fudan university, CN
Abstract
This paper presents a neural architecture search scheme for chip layout hotspot detection. In this work, hotspot detectors, in the form of neural networks, are modeled as weighted directed acyclic graphs. A variational autoencoder maps the discrete graph topological space into a continuous embedding space. Bayesian Optimization performs neural architecture search in this embedding space, where an architecture performance predictor is employed to accelerate the search process. Experimental studies on ICCAD 2012 and ICCAD 2019 Contest benchmarks demonstrate that, the proposed scheme significantly improves the agility of previous neural architecture search schemes, and generates hotspot detectors with competitive detection accuracy, false alarm rate, and inference time.
00:39 CET UPHEAVING SELF-HEATING EFFECTS FROM TRANSISTOR TO CIRCUIT LEVEL USING CONVENTIONAL EDA TOOL FLOWS
Authors:
Florian Klemme1, Sami Salamin2 and Hussam Amrouch1
1University of Stuttgart, DE; 2Hyperstone, DE
Abstract
In this work, we are the first to demonstrate how well-established EDA tool flows can be employed to upheave Self-Heating Effects (SHE) from individual devices at the transistor level all the way up to complete large circuits at the final layout (i.e., GDS-II) level. Transistor SHE imposes an ever-growing reliability challenge due to the continuous shrinking of geometries alongside the non-ideal voltage scaling in advanced technology nodes. The challenge is largely exacerbated when more confined 3D structures are adopted to build transistors such as upcoming Nanosheet FETs and Ribbon FETs. By employing increasingly-confined structures and materials of poorer thermal conductance, heat arising within the transistor's channel is trapped inside and cannot escape. This leads to accelerated defect generation and, if not considered carefully, a profound risk to IC reliability. Due to the lack of EDA tool flows that can consider SHE, circuit designers are forced to take pessimistic worst-case assumptions (obtained at the transistor level) to ensure reliability of the complete chip for the entire projected lifetime - at the cost of sub-optimal circuit designs and considerable efficiency losses. Our work paves the way for designers to estimate less pessimistic (i.e., small yet sufficient) safety margins for their circuits leading to higher efficiency without compromising reliability. Further, it provides new perspectives and opens new doors to estimate and optimize reliability correctly in the presence of emerging SHE challenge through identifying early the weak spots and failure sources across the design.
00:39 CET BUILT-IN SELF-TEST AND BUILT-IN SELF-REPAIR STRATEGIES WITHOUT GOLDEN SIGNATURE FOR COMPUTING IN-MEMORY
Authors:
Yu-Chih Tsai1, Wen-chien Ting2, Chia-Chun Wang1, chia-cheng chang1 and Ren-Shuo Liu1
1EE, National Tsing Hua University, TW; 2National Tsing Hua University, TW
Abstract
This paper proposes built-in self-test (BIST) and built-in self-repair (BISR) strategies for computing in-memory (CIM), including a novel testing method, CIM output range adjusting, and CIM bitline reordering. They all focus on mitigating the impacts of inherent and inevitable CIM inaccuracy on convolution neural networks (CNNs). Regarding the proposed BIST strategy, it exploits the distributive law to achieve at-speed CIM tests without storing testing vectors or golden results. Besides, it can assess the severity of the inherent inaccuracies among CIM bitlines instead of only offering a pass/fail outcome. In addition to BIST, we propose two BISR strategies. First, we propose to slightly offset the dynamic range of CIM outputs toward the negative side to create a margin for negative noises. By not cutting CIM outputs off at zero, negative noises are preserved to cancel positive noises statistically, and accuracy impacts are mitigated. Second, we propose to remap the bitlines of CIM according to our BIST outcomes. Briefly speaking, we propose to map the least noisy bitlines to be the MSBs. This remapping can be done in the digital domain without touching the CIM internals. Experiments show that our proposed BIST and BISR strategies can restore CIM to less than 1% Top-1 accuracy loss with slight hardware overhead.
00:39 CET BITSTREAM-LEVEL INTERCONNECT FAULT CHARACTERIZATION FOR SRAM-BASED FPGAS
Authors:
Christian Fibich1, Martin Horauer1 and Roman Obermaisser2
1University of Applied Sciences Technikum Wien, AT; 2University of Siegen, DE
Abstract
The configurable interconnect of SRAM-based FPGAs makes up a significant portion of their configuration, and thus exposes a large attack surface to single-event upsets. Time-consuming fault injection experiments as well as reliability estimation techniques benefit from a better understanding of the behavior of FPGA interconnects under the presence of these faults, allowing to treat some interconnect faults as more serious than others. This work proposes an approach to (1) analyze the interconnect configuration of a given FPGA technology to deduce the logical effects caused by single-bit flips and (2) to characterize the effects of a subcategory of such faults – faults that create device-internal short circuits – on routes implemented on a given FPGA technology. These approaches are illustrated in case studies on two FPGA technologies: Xilinx 7 Series and Lattice iCE40. Characterization of interconnect faults on Xilinx 7 Series revealed that in a particular subcategory – short circuits with previously unused wires – only a fraction of these faults impact a signal critically. Applying this knowledge to three benchmark designs shows that the number of tested bits in this category can be reduced by more than 85%. Assuming that all other bits are tested exhaustively, the total number of injections is reduced by at least 19%. This prediction has subsequently been validated using fault injection experiments.
00:39 CET COMPACT TEST PATTERN GENERATION FOR MULTIPLE FAULTS IN DEEP NEURAL NETWORKS
Authors:
Dina Moussa1, Michael Hefenbrock2 and Mehdi Tahoori1
1Karlsruhe Institute of Technology, DE; 2RevoAI, DE
Abstract
Deep neural networks (DNNs) have achieved record-breaking performance in various applications. However, this often comes at significant computational costs. To reduce the energy footprint and increase performance, DNNs are often implemented on specific hardware accelerators, such as Tensor Processing Units (TPU) or emerging Memristive technologies. Unfortunately, the presence of various hardware faults can threaten these accelerators' performance and degrade the inference accuracy. This necessitates the development of efficient testing methodologies to unveil hardware faults in DNN accelerators. In this work, we propose a test pattern generation approach to detect fault patterns in DNNs for a common type of hardware fault, namely, faulty (weight) value representation on the bit level. Opposed to most related works which reveal faults via output deviations, our test patterns are constructed to reveal faults via misclassification which is more realistic for black-box testing. The experimental results show that the generated test patterns provide 100% fault coverage for targeted fault patterns. Besides, a high compaction ratio was achieved over different datasets and model architectures (up to 50x), and high fault coverage (up to 99.9%) for unseen fault patterns during the test generation phase.
00:39 CET READ: RELIABILITY-ENHANCED ACCELERATOR DATAFLOW OPTIMIZATION USING CRITICAL INPUT PATTERN REDUCTION
Authors:
Zuodong Zhang1, Meng Li2, Yibo Lin3, Runsheng Wang3 and Ru Huang3
1School of Integrated Circuits, Peking University, CN; 2Institute for Artificial Intelligence and School of Integrated Circuits, Peking University, CN; 3Peking University, CN
Abstract
With the rapid advancements of deep learning in recent years, hardware accelerators are continuously deployed in more and more safety-critical applications such as autonomous driving and robotics. While the accelerators are usually fabricated with advanced technology nodes for higher performance and energy efficiency, which are more prone to timing errors under process, voltage, temperature, and aging (PVTA) variations. By revisiting the physical sources of timing errors, we show that most of the timing errors in the accelerator are caused by several specific input patterns, defined as critical input patterns. To improve the robustness of the accelerator, in this paper, we propose READ, a reliability-enhanced accelerator dataflow optimization method that can effectively reduce timing errors. READ can reduce the critical input patterns by exploring the optimal computing sequence when mapping a trained deep neural network to the accelerator. READ only changes the order of MAC operations in a convolution, and it does not introduce any additional hardware overhead to the computing array. The experimental results on VGG-16 and ResNet-18 demonstrate on average 6.3× timing error reduction and up to 24.25× timing error reduction for certain layers. The results also show that READ enables the accelerator to maintain accuracy over a wide range of PVTA variations, making it a promising approach for robust deep learning design.
00:39 CET ROBUST RESISTIVE OPEN DEFECT IDENTIFICATION USING MACHINE LEARNING WITH EFFICIENT FEATURE SELECTION
Authors:
Zahra Paria Najafi-Haghi, Florian Klemme, Hanieh Jafarzadeh, Hussam Amrouch and Hans-Joachim Wunderlich, University of Stuttgart, DE
Abstract
Resistive open defects of circuits in FinFET technology cause small delay faults, which may increase over the lifetime and should be ruled out before deployment. In order not to sacrifice yield, these defects have to be distinguished from the effects of process variation, which are mostly benign. Recently, it has been shown that machine learning schemes are able to classify defective circuits with high accuracy based on the maximum frequencies Fmax(Vdd) obtained under multiple supply voltages Vdd ∈ Vop. In the paper at hand, it is shown that the machine learning-based technique can work efficiently with a drastically reduced number of Fmax(Vdd) measurements while retaining still a high accuracy. Each supply voltage Vdd defines a feature Fmax(Vdd), and state-of-the-art feature selection techniques reduce the measurement efforts significantly. In systems with Adaptive Voltage Frequency Scaling (AVFS), Fmax measurements are available for minimum, maximum, and nominal supply voltages. Therefore, often the conventional speed-binning report allows already a highly accurate defect classification.
00:39 CET SECURITY-AWARE APPROXIMATE SPIKING NEURAL NETWORK
Authors:
Syed Tihaam Ahmad, Ayesha Siddique and Khaza Anuarul Hoque, University of Missouri, US
Abstract
Deep Neural Networks (DNNs) and Spiking Neural Networks (SNNs) are both known for their susceptibility to adversarial attacks. Therefore, researchers in the recent past have extensively studied the robustness and defense of DNNs and SNNs under adversarial attacks. Compared to accurate SNNs (AccSNN), approximate SNNs (AxSNNs) are known to be up to 4X more energy-efficient for ultra-low power applications. Unfortunately, the robustness of AxSNNs under adversarial attacks is yet unexplored. In this paper, we first extensively analyze the robustness of AxSNNs under different structural parameters and approximation levels against two gradient-based and two neuromorphic attacks. Our study revealed that AxSNNs are more prone to adversarial attacks than AccSNNs. Then we propose a novel design approach for designing robust AxSNNs using two novel defense methods: precision scaling and approximation- and quantization-aware filtering (AQF). The effectiveness of these two defense methods was evaluated using one static and one neuromorphic dataset. Our results demonstrate that precision scaling and AQF can significantly improve the robustness of AxSNNs. For instance, a PGD attack on AxSNN results in 72\% accuracy loss, whereas the same attack on the precision-scaled AxSNN leads to only 17\% accuracy loss in the static MNIST dataset (4X robustness improvement). Similarly, for the neuromorphic DVS128 Gesture dataset, we observe that Sparse Attack on AxSNN leads to 77\% accuracy loss compared to AccSNN without any attack. However, with AQF, the accuracy loss is only 2\% (38X robustness improvement).
00:39 CET BAFFI: A BIT-ACCURATE FAULT INJECTOR FOR IMPROVED DEPENDABILITY ASSESSMENT OF FPGA PROTOTYPES
Authors:
Ilya Tuzov1, David de Andres2, Juan-Carlos Ruiz2 and Carles Hernandez2
1Universidad Politécnica de Valencia, ES; 2Universidad Politecnica de Valencia, ES
Abstract
FPGA-based fault injection (FFI) is an indispensable technique for verification and dependability assessment of FPGA designs. Existing FFI tools make use of Xilinx essential bits technology to locate the relevant fault targets in FPGA configuration memory (CM). Most FFI tools treat essential bits as black-box, while few of them are able to filter essential bits on the area basis (grid coordinates) in order to selectively target design components contained within the predefined Pblocks. This approach, however, not only alters the timing properties of the circuit with respect to original (unconstrained) placement, but also remains insufficiently precise since the granularity of Pblocks in practice doesn't reach the smallest design (netlist) components. This paper proposes an open-source FFI tool that enables much more fine-grained FFI experiments for Xilinx 7-series and Ultrascale+ FPGAs. By mapping the essential bits with the hierarchical netlist, it allows to precisely target any component in the design tree (up to an individual LUT or register), without the need for defining Pblocks (floorplanning). With minimal experimental effort it estimates the contribution of each DUT component into the resulting dependability features, and discovers weak points of the DUT. Through case studies this paper exemplifies how the proposed tool can be exploited to setup FFI experiments, controlled from the host or from the FPGA itself, for different kinds of DUTs: from small-footprint microcontrollers, up to multicore RISC-V SoC. The correctness of FFI results is validated by means of RT-level and gate-level simulation-based fault injection.
00:39 CET A NOVEL FAULT-TOLERANT ARCHITECTURE FOR TILED MATRIX MULTIPLICATION
Authors:
Sandeep Bal1, Chandra Sekhar Mummidi1, Victor da Cruz Ferreira2, Sudarshan Srinivasan3 and Sandip Kundu4
1University of Massachusetts, Amherst, US; 2Federal University of Rio de Janeiro, BR; 3Intel Labs, IN; 4University of Massachusetts Amherst, US
Abstract
General matrix multiplication (GEMM) is common to many scientific and machine-learning applications. Convolution, the dominant computation in Convolutional Neural Networks (CNNs), can be formulated as a GEMM problem. Due to its widespread use, new generation of processors feature GEMM acceleration in hardware. Intel recently announced an Advanced Matrix Multiplication (AMX®) instruction set for GEMM, which is supported by 1kb AMX registers and a Tile Multiplication unit (TMUL) for multiplying tiles (sub matrices) in hardware. Silent Data Corruption (SDC) is a well-known problem that occurs when hardware generates corrupt output. Google and Meta recently reported findings of SDC in GEMM in their data centers. Algorithm-Based Fault Tolerance (ABFT) is an efficient mechanism for detecting and correcting errors in GEMM, but classic ABFT solutions are not optimized for hardware acceleration. In this paper, we present a novel ABFT implementation directly on hardware. Though the exact implementation of Intel TMUL is not known, we propose two different TMUL architectures representing two design points in the area-power-performance spectrum and illustrate how ABFT can be directly incorporated into the TMUL hardware. This approach has two advantages: (i) an error can be concurrently detected at the tile level, which is an improvement over finding such errors only after performing the full matrix multiplication; and (ii) we further demonstrate that performing ABFT at the hardware level has no performance impact and only a small area, latency and power overhead.
00:39 CET REDUCE: A FRAMEWORK FOR REDUCING THE OVERHEADS OF FAULT-AWARE RETRAINING
Authors:
Muhammad Abdullah Hanif1 and Muhammad Shafique2
1New York University Abu Dhabi, AE; 2New York University Abu Dhabi (NYUAD), AE
Abstract
Fault-aware retraining has emerged as the foremost technique for mitigating permanent faults in DNN hardware. However, retraining leads to huge overheads, specifically when used for fine-tuning large DNNs designed for complex AI problems. Moreover, as each fabricated chip can have a distinct fault pattern, fault-aware retraining is required to be performed for each chip individually considering its unique fault map, which further amplifies the problem. To reduce the overall/average retraining cost, in this work, we introduce the concepts of resilience-driven retraining amount selection and fusion of multiple fault maps belonging to different chips. To realize these concepts, we propose a framework, Reduce, that, at first, computes the resilience of a given DNN to faults at different fault rates and with different amounts of retraining. Then, based on the resilience, it computes the amount of retraining required for each chip considering its unique fault map. At the end, it performs resilience-driven grouping and fusion of fault maps to further reduce the number of retraining iterations required for tuning the given DNN for the given set of faulty chips. We demonstrate the effectiveness of our proposed methodology for a systolic array-based DNN accelerator experiencing permanent faults in the computational array of the design. The results show that the proposed technique significantly reduces the retraining cost when used for tuning a DNN for a set of faulty chips.

BPA_11 Supply chain attacks

Date: Tuesday, 18 April 2023
Time: 14:00 - 16:00 CET

Time Label Presentation Title
Authors
00:39 CET HARDWARE TROJANS IN ENVM NEUROMORPHIC DEVICES
Authors:
Lingxi Wu, Rahul Sreekumar, Rasool Sharifi, Kevin Skadron, Stan Mircea and Ashish Venkat, University of Virginia, US
Abstract
Fast and energy-efficient execution of a DNN on traditional CPU- and GPU-based architectures is challenging due to excessive data movement and inefficient computation. Emerging non-volatile memory (eNVM)-based accelerators that mimic biological neuron computations in the analog domain have shown significant performance improvements. However, the potential security threats in the supply chain of such systems have been largely understudied. This work describes a hardware supply chain attack against analog eNVM neural accelerators by identifying potential Trojan insertion points and proposes a hardware Trojan design that stealthily leaks model parameters while evading detection. Our evaluation shows that such a hardware Trojan can recover over 90% of the synaptic weights.
00:39 CET EVOLUTE: EVALUATION OF LOOK-UP-TABLE-BASED FINE-GRAINED IP REDACTION
Authors:
Rui Guo1, Mohammad Rahman1, Hadi Mardani Kamali1, Fahim Rahman1, Farimah Farahmandi1 and Mark Tehranipoor2
1University of Florida, US; 2Intel Charles E. Young Preeminence Endowed Chair Professor in Cybersecurity, Associate Chair for Research and Strategic Initiatives, ECE Department, University of Florida, US
Abstract
Recent studies on intellectual property (IP) protection techniques demonstrate that engaging embedded reconfigurable components (e.g., eFPGA redaction) would be a promising approach to concealing the functional and structural information of the security-critical design. However, detailed investigation reveals that such techniques suffer from almost prohibited overhead in terms of area, power, delay, and testability. In this paper, we introduce "EvoLUTe", a distinct and significantly more fine-grained redaction methodology using smaller reconfigurable components (such as look-up-tables (LUTs)). In "EvoLUTe", we examine both eFPGA-based and LUT-based design spaces, demonstrating that a novel cone-based and fine-grained universal function modeling approach using LUTs is capable of providing the same degree of resiliency at a much lower area, power, delay, and testability costs.
00:39 CET RTLOCK: IP PROTECTION USING SCAN-AWARE LOGIC LOCKING AT RTL
Authors:
Md Rafid Muttaki1, Shuvagata Saha1, Hadi Mardani Kamali1, Fahim Rahman1, Mark Tehranipoor2 and Farimah Farahmandi1
1University of Florida, US; 2Intel Charles E. Young Preeminence Endowed Chair Professor in Cybersecurity, Associate Chair for Research and Strategic Initiatives, ECE Department, University of Florida, US
Abstract
Conventional logic locking techniques mainly focus on gate-level netlists to combat IP piracy and IC overproduction. However, this is generally not sufficient for protecting semantics and behaviors of the design. Further, these techniques are even more objectionable when the IC supply chain is at risk of insider threats. This paper proposes RTLock, a robust logic locking framework at the RTL abstraction. RTLock provides a detailed formal analysis of the design specs at the RTL that determines the locking candidate points w.r.t. attacks resiliency (SAT/BMC), locking key size, and overhead. RTLock incorporates (partial) DFT infrastructure (scan chain) at the RTL, enabled with a scan locking mechanism. It allows us to push all the necessary security-driven actions to the highest abstraction level, thus making the flow EDA tool agnostic. Additionally, RTLock demonstrates why RTL-based locking must be coupled with encryption and management protocols (e.g., IEEE 1735), to be effective against insider threats. Our experimental results show that, vs. other techniques, RTLock protects the design against broader threats at low overhead and without compromising testability.

BPA_7 Improving Heterogenous hardware utilization

Date: Tuesday, 18 April 2023
Time: 14:00 - 16:00 CET

Time Label Presentation Title
Authors
00:39 CET DITTY: DIRECTORY-BASED CACHE COHERENCE FOR MULTICORE SAFETY-CRITICAL SYSTEMS
Authors:
Zhuanhao Wu, Marat Bekmyrza, Nachiket Kapre and Hiren Patel, University of Waterloo, CA
Abstract
Ditty is a predictable directory-based cache coherence mechanism for multicore safety-critical systems that guarantees a worst-case latency (WCL) on data accesses. All prior approaches for predictable cache coherence use a shared snooping bus approach to interconnect cores. This restricts the number of cores in the multicore to typically four or eight due to practical scalability concerns. Ditty takes a first step towards a scalable cache coherence mechanism that is predictable and one that can support a larger number of cores. In designing Ditty, we propose coherence protocol and micro-architecture additions to deliver a WCL bound that is significantly lower than a naive approach. Our WCL analysis reveals that the resulting bounds are comparable to state-of-the-art bus-based predictable coherence approaches. We prototype Ditty in hardware and empirically evaluate it on an FPGA. Our evaluation shows the observed WCL is within computed WCL bounds for both the synthetic and SPLASH-3 benchmarks. We release our implementation to the public domain.
00:39 CET LIGHT FLASH WRITE FOR EFFICIENT FIRMWARE UPDATE ON ENERGY-HARVESTING IOT DEVICES
Authors:
Songran Liu1, Mingsong Lv2, Wei Zhang3, Xu Jiang1, Chuancai Gu4, Tao Yang4, Wang Yi5 and Nan Guan6
1Northeastern University, CN; 2The Hong Kong Polytechnic University, HK; 3School of Cyber Science and Technology, Shandong University, CN; 4Huawei Technologies Company, CN; 5Uppsala University, SE; 6City University of Hong Kong, HK
Abstract
Firmware update is an essential service on Internet-of-Things (IoT) devices to fix vulnerabilities and add new functionalities. Firmware update is energy-consuming since it involves intensive flash erase/write operations. Nowadays, IoT devices are increasingly powered by energy harvesting. As the energy output of the harvesters on IoT devices is typically tiny and unstable, a firmware update will likely experience power failures during its progress and fail to complete. This paper presents an approach to increase the success rate of firmware update on energy-harvesting IoT devices. The main idea is to first conduct a lightweight flash write with reduced erase/write time (and thus less energy consumed) to quickly save the new firmware image to flash memory before a power failure occurs. To ensure a long data retention time, a reinforcement step follows to re-write the new firmware image on the flash with default erase/write configuration when the system is not busy and has free energy. Experiments conducted with different energy scenarios show that our approach can significantly increase the success rate and the efficiency of firmware update on energy-harvesting IoT devices.
00:39 CET HADAS: HARDWARE-AWARE DYNAMIC NEURAL ARCHITECTURE SEARCH FOR EDGE PERFORMANCE SCALING
Authors:
Halima Bouzidi1, Mohanad Odema2, Hamza Ouarnoughi3, Mohammad Al Faruque2 and Smail Niar4
1University Polytechnique Hauts-de-France, LAMIH, CNRS, UMR 8201, F-59313 Valenciennes, France, FR; 2University of California Irvine, US; 3INSA Hauts-de-France, FR; 4INSA Hauts-de-France and CNRS, FR
Abstract
Dynamic neural networks (DyNNs) have become viable techniques to enable intelligence on resource-constrained edge devices while maintaining computational efficiency. In many cases, the implementation of DyNNs can be sub-optimal due to its underlying backbone architecture being developed at the design stage independent of both: (i) the dynamic computing features, e.g. early exiting, and (ii) the resource efficiency features of the underlying hardware, e.g., dynamic voltage and frequency scaling (DVFS). Addressing this, we present HADAS, a novel Hardware-Aware Dynamic Neural Architecture Search framework that realizes DyNN architectures whose backbone, early exiting features, and DVFS settings have been jointly optimized to maximize performance and resource efficiency. Our experiments using the CIFAR-100 dataset and a diverse set of edge computing platforms have seen HADAS dynamic models achieve up to 57\% energy efficiency gains compared to the conventional dynamic ones while maintaining the desired level of accuracy scores.

S_E4 Hardware accelerators serving efficient machine learning software architectures

Date: Tuesday, 18 April 2023
Time: 16:30 - 18:00 CET

Time Label Presentation Title
Authors
00:39 CET PIPE-BD: PIPELINED PARALLEL BLOCKWISE DISTILLATION
Authors:
Hongsun Jang1, Jaewon Jung2, Jaeyong Song1, Joonsang Yu3, Youngsok Kim1 and Jinho Lee4
1Yonsei University, KR; 2Yonsei university, KR; 3NAVER CLOVA, KR; 4Seoul National University, KR
Abstract
Training or searching of large deep neural network models have been challenged by its tremendous amount of computation and memory requirement. Blockwise distillation is one promising method to mitigate the problem by ensuring faster convergence; a large model is effectively split into multiple smaller models. The training of latter student blocks requires the results of the earlier teacher blocks. Therefore, the state-of-the-art blockwise distillation method distributes a batch to multiple GPUs, and makes each GPU train its student block by executing all the earlier teacher blocks. However, this results in a high overhead of redundant teacher execution, low GPU utilization and extra data loading. To address the problems, we propose PipeBD, a novel parallelization method for blockwise distillation. PipeBD comprises three main components. First, teacher relaying lets each device only process its own teacher block by relaying the activation values. Second, decoupled parameter update removes synchronization barriers by performing unaligned updates on each student block. Last, automatic hybrid distribution enhances load balancing by further splitting the stages along the batch dimension. We implement PipeBD on PyTorch, where all the decisions are seamlessly made automatically. Experiments reveal that PipeBD achieves significant speedups.
00:39 CET LAYER-PUZZLE: ALLOCATING AND SCHEDULING MULTI-TASK ON MULTI-CORE NPUS BY USING LAYER HETEROGENEITY
Authors:
Chengsi Gao1, Ying Wang2, Cheng Liu1, Mengdi Wang1, Weiwei Chen1, yinhe han3 and Lei Zhang3
1Institute of Computing Technology, Chinese Academy of Sciences, CN; 2State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, CN; 3Institute of Computing Technology,Chinese Academy of Sciences, CN
Abstract
To support the efficient execution of multi-task DNNs on multi-core NPUs, intensive research has been proposed. However, prior works will lead to low hardware utilization since they ignore the inter-layer heterogeneity within the DNN models. Thus, to fully exploit such characteristics, we propose Layer-Puzzle, a multi-task allocation and scheduling framework for multi-core NPUs. Based on the proposed latency-prediction model and dynamic parallelization scheme, Layer-Puzzle can generate optimal or near-optimal results for each layer under given hardware resources and traffic congestion levels. As an online scheduler, Layer-Puzzle performs a QoS-aware and dynamic scheduling method that picks the superior version from the previous compiled results and co-runs the selected tasks to improve system performance. Our experiments on MLPerf show that Layer-Puzzle can achieve an average of 1.61X, 1.53X, and 1.95X improvement in ANTT, STP, and PE utilization, respectively.
00:39 CET DYNAMIC TASK REMAPPING FOR RELIABLE CNN TRAINING ON RERAM CROSSBARS
Authors:
Chung-Hsuan Tung1, Biresh Kumar Joardar2, Partha Pratim Pande3, Jana Doppa3, Hai (Helen) Li1 and Krishnendu Chakrabarty1
1Duke University, US; 2University of Houston, US; 3Washington State University, US
Abstract
A ReRAM crossbar-based computing system (RCS) can accelerate CNN training. However, hardware faults due to manufacturing defects and limited endurance impede the widespread adoption of RCS. We propose a dynamic task remapping-based technique for reliable CNN training on faulty RCS. Experimental results demonstrate that the proposed low-overhead method incurs only 0.85% accuracy loss on average while training popular CNNs such as VGGs, ResNets, and SqueezeNet with the CIFAR-10, CIFAR-100, and SVHN datasets in the presence of faults.
00:39 CET MOBILE ACCELERATOR EXPLOITING SPARSITY OF MULTI-HEADS, LINES AND BLOCKS IN TRANSFORMERS IN COMPUTER VISION
Authors:
Eunji Kwon1, Haena Song2, Jihye Park2 and Seokhyeong Kang2
1Postech, KR; 2POSTECH, KR
Abstract
It is difficult to employ transformer models for computer vision in mobile devices due to their memory- and computation-intensive properties. Accordingly, there is ongoing research on various methods for compressing transformer models, such as pruning. However, general computing platforms such as central processing units (CPUs) and graphics processing units (GPUs) are not energy-efficient to accelerate the pruned model due to their structured sparsity. This paper proposes a low-power accelerator for transformers in computer vision with various sizes of structured sparsity induced by pruning with different granularity. In this study, we can accelerate a transformer that has been pruned in a head-wise, line-wise, or block-wise manner. We developed a head scheduling algorithm to support head-wise skip operations and resolve the processing engine (PE) load imbalance problem caused by different amounts of computations in one head. Moreover, we implemented a sparse general matrix-to-matrix multiplication (sparse GEMM) that supports line-wise and block-wise skipping. As a result, when compared with a mobile GPU and mobile CPU respectively, our proposed accelerator achieved 6.1x and 13.6x improvements in energy efficiency for the detection transformer (DETR) model and achieved approximately 2.6x and 7.9x improvements in the energy efficiency on average for the vision transformer (ViT) models.
00:39 CET RAWATTEN: RECONFIGURABLE ACCELERATOR FOR WINDOW ATTENTION IN HIERARCHICAL VISION TRANSFORMERS
Authors:
Wantong Li, Yandong Luo and Shimeng Yu, Georgia Institute of Technology, US
Abstract
After the success of the transformer networks on natural language processing (NLP), the application of transformers to computer vision has followed suit to deliver unprecedented performance gains on vision tasks including image recognition and object detection. The multi-head self-attention (MSA) is the key component in transformers, allowing the models to learn the amount of attention paid to each input position. In particular, hierarchical vision transformers (HVTs) utilize window-based MSA to capture the benefits of the attention mechanism at various scales for further accuracy enhancements. Despite its strong modeling capability, MSA involves complex operations that make transformers prohibitively costly for hardware deployment. Existing hardware accelerators have mainly focused on the MSA workloads in NLP applications, but HVTs involve different parameter dimensions, input sizes, and data reuse opportunities. Therefore, we design the RAWAtten architecture to target the window-based MSA workloads in HVT models. Each w-core in RAWAtten contains near-memory compute engines for linear layers, MAC arrays for intermediate matrix multiplications, and a lightweight reconfigurable softmax. The w-cores can be combined at runtime to perform hierarchical processing to accommodate varying model parameters. Compared to the baseline GPU, RAWAtten at 40nm provides 2.4× average speedup for running the window-MSA workloads in Swin transformer models while consuming only a fraction of GPU power. In addition, RAWAtten achieves 2× area efficiency compared to prior ASIC accelerator for window-MSA.
00:39 CET M5: MULTI-MODAL MULTI-TASK MODEL MAPPING ON MULTI-FPGA WITH ACCELERATOR CONFIGURATION SEARCH
Authors:
Akshay Kamath, Stefan Abi-Karam, Ashwin Bhat and Cong "Callie" Hao, Georgia Institute of Technology, US
Abstract
Recent machine learning (ML) models have advanced from single-modality single-task to multi-modality multi-task (MMMT). MMMT models typically have multiple backbones of different sizes and complicated connections, exposing great challenges for hardware deployment. For scalable and energy-efficient implementations, multi- FPGA systems are emerging as the ideal design choice. However, finding the optimal solutions for mapping MMMT models onto multiple FPGAs is non-trivial. Existing mapping algorithms focus either on streamlined linear deep neural network architectures or only the critical path of simple heterogeneous models. Direct extensions of these algorithms for MMMT models lead to sub-optimal solutions. To address these shortcomings, we propose M5, a novel MMMT model Mapping framework for Multi- FPGA platforms. In addition to handling multiple modalities present in the models, M5 can flexibly explore accelerator configurations and possible resource sharing opportunities to significantly improve the system performance. For various computation-heavy MMMT models, experiment results demonstrate that M5 can remarkably outperform existing mapping methods and lead to an average reduction of 35%, 62%, and 70% in the number of low-end, mid-end, and high-end FPGAs required to achieve the same throughput, respectively. Code is available publicly.
00:39 CET STEPPINGNET: A STEPPING NEURAL NETWORK WITH INCREMENTAL ACCURACY ENHANCEMENT
Authors:
Wenhao Sun1, Grace Li Zhang2, Xunzhao Yin3, Cheng Zhuo3, Huaxi Gu4, Bing Li1 and Ulf Schlichtmann1
1TU Munich, DE; 2TU Darmstadt, DE; 3Zhejiang University, CN; 4Xidian University, CN
Abstract
Deep neural networks (DNNs) have successfully been applied in many fields in the past decades. However, the increasing number of multiply-and-accumulate (MAC) operations in DNNs prevents their application in resource-constrained and resource-varying platforms, e.g., mobile phones and autonomous vehicles. In such platforms, neural networks need to provide acceptable results quickly and the accuracy of the results should be able to be enhanced dynamically according to the computational resources available in the computing system. To address these challenges, we propose a design framework called SteppingNet. SteppingNet constructs a series of subnets whose accuracy is incrementally enhanced with more MAC operations. Therefore, this design allows a trade-off between accuracy and latency. In addition, the larger subnets in SteppingNet are built upon smaller subnets, so that the results of the latter can directly be reused in the former without recomputation. This property allows SteppingNet to decide on-the-fly whether to enhance the inference accuracy by executing further MAC operations. Experimental results demonstrate that SteppingNet provides an effective incremental accuracy improvement and its inference accuracy consistently outperforms the state-of-the-art work under the same limit of computational resources.
00:39 CET AIRCHITECT: AUTOMATING HARDWARE ARCHITECTURE AND MAPPING OPTIMIZATION
Authors:
Ananda Samajdar1, Jan Moritz Joseph2 and Tushar Krishna1
1Georgia Institute of Technology, US; 2RWTH Aachen University, DE
Abstract
Design space exploration and optimization is an essential but iterative step in custom accelerator design involving costly search based method to extract maximum performance and energy efficiency. State-of-the-art methods employ data centric approaches to reduce the cost of each iteration but still rely on search algorithms to obtain the optima. This work proposes a learned, constant time optimizer that uses a custom recommendation network called AIRCHITECT, which is capable of learning the architecture design and mapping space with a 94.3% test accuracy, and predicting optimal configurations, which achieve on an average (GeoMean) 99.9% of the best possible performance on a test dataset with 105 GEMM (GEneral Matrix- matrix Multiplication) workloads.
00:39 CET ACCELERATING INFERENCE OF 3D-CNN ON ARM MANY-CORE CPU VIA HIERARCHICAL MODEL PARTITION
Authors:
Jiazhi Jiang, ZiJian Huang, Dan Huang, Jiangsu Du and Yutong Lu, Sun Yat-sen University, CN
Abstract
Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for 3D CNN. Although accelerators such as GPU may provide higher throughput on deep learning applications, they may not be available in all scenarios. CPU, especially many-core CPU, remains an attractive choice for deep learning in many scenarios. In this paper, we propose a inference solution that targets on the emerging ARM many-core CPU platform. A hierarchical partition approach is claimed to accelerate 3D-CNN inference by exploiting characteristics of memory and cache on ARM many-core CPU. Based on the hierarchical model partition approach, other optimization techniques such as NUMA-aware thread scheduling is designed to exploit the potential of ARM many-core CPU for 3D-CNN. We evaluate our proposed inference solution with several classic 3D-CNNs: C3D, 3D-resnet34 and 3D-resnet50. Our experimental results show that our solution can boost the performance of the 3D-CNN inference, and achieve much better scalability.
00:39 CET CEST: COMPUTATION-EFFICIENT N:M SPARSE TRAINING FOR DEEP NEURAL NETWORKS
Authors:
Chao Fang1, Wei Sun2, Aojun Zhou3 and Zhongfeng Wang1
1Nanjing University, CN; 2Eindhoven University of Technology, NL; 3The Chinese University of Hongkong, HK
Abstract
Sparse training is one of the promising techniques to reduce the computational cost and retain the high accuracy of DNNs, but prior works mainly focus on leveraging structured or unstructured sparsity patterns. Structured sparsity has limitations in reducing computational complexity, while unstructured sparsity is difficult to accelerate on hardware. Recently, N:M fine-grained structured sparsity, where only N out of consecutive M elements can be nonzero, has attracted attention due to its practical sparsity ratio and hardware-friendly pattern. However, the potential to accelerate N:M sparse DNN training has not been fully exploited, and there is a lack of efficient hardware supporting N:M sparse training. To tackle these challenges, this paper presents an algorithm-hardware co-design scheme - Computation-Efficient N:M Sparse Training (CEST) for DNNs. At the algorithmic level, a bidirectional weight pruning method, dubbed BDWP, is proposed to leverage the N:M sparsity of weights during both forward and backward passes of DNN training, which can significantly reduce the computational cost while maintaining model accuracy. At the architectural level, a sparse accelerator for DNN training, namely SAT, is developed to neatly support both the regular dense operations and the computation-efficient N:M sparse operations. Finally, the effectiveness of CEST is evaluated on a Xilinx Virtex UltraScale+ VCU1525 FPGA card using various DNN models and datasets. Experimental results show that CEST significantly improves the training throughput by 1.89-12.49x and the energy efficiency by 2.19-3.26x over prior FPGA-based training accelerators, and incurs merely 0.45% accuracy loss on average compared to the dense training scheme.
00:39 CET BOMP-NAS: BAYESIAN OPTIMIZATION MIXED PRECISION NAS
Authors:
David van Son1, Floran de Putter1, Sebastian Vogel2 and Henk Corporaal3
1Eindhoven University of Technology, NL; 2NXP Semiconductors, NL; 3TU/e (Eindhoven University of Technology), NL
Abstract
Bayesian Optimization Mixed-Precision Neural Architecture Search (BOMP-NAS) is an approach to quantization-aware neural architecture search (QA-NAS) that leverages both Bayesian optimization (BO) and mixed-precision quantization (MP) to efficiently search for compact, high performance deep neural networks. The results show that integrating quantization-aware fine-tuning (QAFT) into the NAS loop is a necessary step to find networks that perform well under low-precision quantization: integrating it allows a model size reduction of nearly 50% on the CIFAR-10 dataset. BOMP-NAS is able to find neural networks that achieve state-of-the-art performance at much lower design costs. This study shows that BOMP-NAS can find these neural networks at a 6× shorter search time compared to the closest related work.
00:39 CET HARDNNING: A MACHINE-LEARNING-BASED FRAMEWORK FOR FAULT TOLERANCE ASSESSMENT AND PROTECTION OF DEEP NEURAL NETWORKS.
Authors:
Marcello Traiola1, Angeliki Kritikakou2 and Olivier Sentieys3
1Inria / IRISA, FR; 2Univ Rennes, Inria, CNRS, IRISA, FR; 3INRIA, FR
Abstract
Deep Neural Networks (DNNs) show promising performance in several application domains, such as robotics, aerospace, smart healthcare, and autonomous driving. Nevertheless, DNN results may be incorrect, not only because of the network intrinsic inaccuracy, but also due to faults affecting the hardware. Indeed, hardware faults may impact the DNN inference process and lead to prediction failures. Therefore, ensuring the fault tolerance of DNN is crucial. However, common fault tolerance approaches are not cost-effective for DNNs protection, because of the prohibitive overheads due to the large size of DNNs and of required memory for parameter storage. In this work, we propose a comprehensive framework to assess the fault tolerance of DNNs and cost-effectively protect them. As a first step, the proposed framework performs datatype-and-layer-based fault injection, driven by the DNN characteristics. As a second step, it uses classification-based machine learning methods in order to predict the criticality, not only of network parameters, but also of their bits. Last, dedicated Error Correction Codes (ECCs) are selectively inserted to protect the critical parameters and bits, hence protecting the DNNs with low cost. Thanks to the proposed framework, we explored and protected eight Convolutional Neural Networks (CNNs). The results show that it is possible to protect the critical network parameters with selective ECCs while saving up to 82% memory w.r.t. conventional ECC approaches.

S_S3 Secure circuits and architectures

Date: Tuesday, 18 April 2023
Time: 16:30 - 18:00 CET

Time Label Presentation Title
Authors
00:39 CET ESTABLISHING DYNAMIC SECURE SESSIONS FOR ECQV IMPLICIT CERTIFICATES IN EMBEDDED SYSTEMS
Authors:
Fikret Basic1, Christian Steger1 and Robert Kofler2
1Graz University of Technology, AT; 2NXP Semiconductors Austria GmbH Co & KG, AT
Abstract
Implicit certificates are gaining ever more prominence in constrained embedded devices, in both the internet of things (IoT) and automotive domains. They present a resource-efficient security solution against common threat concerns. The computational requirements are not the main issue anymore, with the focus now shifting to determining a good balance between the provided security level and the derived threat model. A security aspect that often gets overlooked is the establishment of secure communication sessions, as most design solutions are based only on the use of static key derivation, and therefore lack the perfect forward secrecy. This leaves the transmitted data open for potential future exposures as keys are tied to the certificates rather than the communication sessions. We aim to close this gap and present a design that utilizes the Station to Station (STS) protocol with implicit certificates. In addition, we propose potential protocol optimization implementation steps and run a comprehensive study on the performance and security level between the proposed design and the state-of-the-art key derivation protocols. In our comparative study, we show that we are able to mitigate many session-related security vulnerabilities that would otherwise remain open with only a slight computational increase of 20% compared to a static elliptic curve digital signature algorithm (ECDSA) key derivation.
00:39 CET CACHE SIDE-CHANNEL ATTACKS AND DEFENSES OF THE SLIDING WINDOW ALGORITHM IN TEES
Authors:
Zili KOU1, Sharad Sinha2, Wenjian HE1 and Wei ZHANG1
1Hong Kong University of Science and Technology, HK; 2Indian Institute of Technology Goa, IN
Abstract
Trusted execution environments (TEEs) such as SGX on x86 and TrustZone on ARM are announced to protect trusted programs against even a malicious operation system (OS), however, they are still vulnerable to cache side-channel attacks. In the new threat model of TEEs, kernel-privileged attackers are more capable, thus the effectiveness of previous defenses need to be carefully reevaluated. Aimed at the sliding window algorithm of RSA, this work analyzes the latest defenses from the TEE attacker's point of view and pinpoints their attack surfaces and vulnerabilities. The mainstream cryptography libraries are scrutinized, within which we attack and evaluate the implementations of Libgcrypt and MbedTLS on a real-world ARM processor with TrustZone. Our attack successfully recovers the key of RSA in the latest MbedTLS design when it adopts a small window size, despite MbedTLS being the reference implementation of ARM TrustZone. The possible countermeasures are finally presented together with the corresponding costs.
00:39 CET THE FIRST CONCEPT AND REAL-WORLD DEPLOYMENT OF A GPU-BASED THERMAL COVERT CHANNEL: ATTACK AND COUNTERMEASURES
Authors:
Jeferson Gonzalez-Gomez1, Kevin Cordero-Zuniga2, Lars Bauer3 and Joerg Henkel1
1KIT, DE; 2ITCR, CR; 3Karlsruhe Institute of Technology, DE
Abstract
Thermal covert channel (TCC) attacks have been studied as a threat to CPU-based systems over recent years. In this paper, we propose a new type of TCC attack that for the first time leverages the Graphics Processing Unit (GPU) of a system to create a stealthy communication channel between two malicious applications. We evaluate our new attack on two different real-world platforms: a GPU-dedicated general computing platform and a GPU-integrated embedded platform. Our results are the first to show that a GPU-based thermal covert channel attack is possible. From our experiments, we obtain a transmission rate of up to 8.75 bps with a very low error rate of less than 2% for a 12-bit packet size, which is comparable to CPU-based TCCs in the state of the art. Moreover, we show how existing state-of-the-art countermeasures for TCCs need to be extended to tackle the new GPU-based attack, at the cost of added overhead. To reduce this overhead, we propose our own DVFS-based countermeasure which mitigates the attack, while causing 2x less performance loss than the state-of-the-art countermeasure on a set of compute-intensive GPU benchmark applications.
00:39 CET SIGFUZZ: A FRAMEWORK FOR DISCOVERING MICROARCHITECTURAL TIMING SIDE CHANNELS
Authors:
Chathura Rajapaksha, Leila Delshadtehrani, Manuel Egele and Ajay Joshi, Boston University, US
Abstract
Timing side channels can be inadvertently introduced into processor microarchitecture during the design process, mainly due to optimizations carried out to improve processor performance. These timing side channels have been used in various attacks including transient execution attacks on recent commodity processors. Hence, we need a tool to detect timing side channels during the design process. This paper presents SIGFuzz, a fuzzing-based framework for detecting microarchitectural timing side channels. A designer can use SIGFuzz to detect side channels early in the design flow and mitigate potential vulnerabilities associated with them. SIGFuzz generates a cycle-accurate microarchitectural trace for a program that executes on the target processor, it then uses two trace properties to identify side channels that would have formed by the program. These two trace properties evaluate the effect of each instruction in the program on the timing of its prior and later instructions, respectively. SIGFuzz also uses a statistical distribution of execution delays of instructions with the same mnemonic to flag potential side channels that manifest with different operands of an instruction. Furthermore, SIGFuzz automatically groups the detected side channels based on the microarchitectural activity trace (i.e. signature) of the instruction that triggered it. We evaluated SIGFuzz on two real-world open-source processor designs: Rocket and BOOM, and found three new side channels and two known side channels. We present a novel Spectre-style attack on BOOM based on a newly detected side channel.
00:39 CET RUN-TIME INTEGRITY MONITORING OF UNTRUSTWORTHY ANALOG FRONT-ENDS
Authors:
Heba Salem and Nigel Topham, University of Edinburgh, GB
Abstract
Recent advances in hardware attacks, such as cross talk and covert channel based attacks, expose the structural and operational vulnerability of analog and mixed-signal circuit elements to the introduction of malicious and untrustworthy behaviour at run-time, potentially leading to adverse physical, personal, and environmental consequences. One untrustworthy behaviour of concern, is the introduction of abnormal/unexpected frequencies to the signals at the analog/ digital interface of a SoC, realised through intermittent bit-flipping or stuck-at-faults in the middle and lower bits of these signals. In this paper, we study the impact of these actions and propose integrity monitoring of signals of concern based on analysing the temporal and arithmetic relations between their samples. This paper presents a hybrid software/ hardware machine-learning based framework that consists of two phases; a run-time monitoring phase, and a trustworthiness assessment phase. The framework is evaluated with three different applications and its effectiveness in detecting the untrustworthy behaviour of concern is verified. This framework is device, application, and architecture agnostic, and relies only on analysing the output of the analog front-end, allowing its implementation in SoCs with on-chip and custom analog front-ends as well as those with outsourced and commercial off-the-shelf (COTS) analog front-ends.
00:39 CET SPOILER-ALERT: DETECTING SPOILER ATTACK USING CUCKOO FILTER
Authors:
Jinhua Cui, Yiyun Yin, Congcong Chen and Jiliang Zhang, Hunan University, CN
Abstract
SPOILER exploits the dependency resolution logic serving the speculative load to leak information about phys- ical page mappings, which can accelerate reverse engineering of virtual-to-physical address mapping, thus greatly advancing Rowhammer and cache attacks. However, unlike attacks that attempt to leak secret data directly, the focus of SPOILER is address information. Further, existing approaches mainly are developed for detection and defense of the data leakage, which hence cannot be used for the SPOILER attack. This paper proposes the first hardware-level mechanism named SPOILER- ALERT, which can detect SPOILER violations via a cuckoo filter module. The cuckoo filter is embedded into the Memory Order Buffer component to screen buffer addresses on-the-fly and alert human users when a threshold of the number of buffer addresses is reached. We further inspect the internal filtering algorithm and optimise it to reduce false positives on SPOILER-ALERT. We assess SPOILER-ALERT based on prototype implementations on gem5. The results show a detection rate of 99.9% and only negligible performance loss. Finally, we discuss potential mitigation and long-term defenses against SPOILER.
00:39 CET HUNTER: HARDWARE UNDERNEATH TRIGGER FOR EXPLOITING SOC-LEVEL VULNERABILITIES
Authors:
Sree Ranjani Rajendran1, Shams Tarek1, Benjamin Myers Hicks1, Hadi Mardani Kamali1, Farimah Farahmandi1 and Mark Tehranipoor2
1University of Florida, US; 2Intel Charles E. Young Preeminence Endowed Chair Professor in Cybersecurity, Associate Chair for Research and Strategic Initiatives, ECE Department, University of Florida, US
Abstract
Systems-on-chip (SoCs) have become increasingly large and complex, resulting in new threats and vulnerabilities, mainly related to system-level flaws. However, the system-level verification process, whose violation may lead to exploiting a hardware vulnerability, is not studied comprehensively due to the lack of decisive (security) requirements and properties from the SoC designer's perspective. To enable a more comprehensive verification for system-level properties, this paper presents HUnTer (Hardware Underneath Trigger), a framework for identifying sets (sequences) of instructions at the processor unit (PU) that unveils the underneath hardware vulnerabilities. The HUnTer framework automates (i) threat modeling, (ii) threat-based formal verification, (iii) generation of counterexamples, and (iv) generation of snippet code for exploiting the vulnerability. The HUnTer framework also defines a security coverage metric (HUnT_Coverage) to measure the performance and efficacy of the proposed approach. Using the HUnTer framework on a RISC-V-based open-source SoC architecture, we conduct a wide variety of case studies of Trust-HUB vulnerabilities to demonstrate the high effectiveness of the proposed framework.
00:39 CET MAXIMIZING THE POTENTIAL OF CUSTOM RISC-V VECTOR EXTENSIONS FOR SPEEDING UP SHA-3 HASH FUNCTIONS
Authors:
Huimin Li1, Nele Mentens2 and Stjepan Picek3
1Delft University of Technology, NL; 2KU Leuven, BE; 3Radboud University, NL
Abstract
SHA-3 is considered to be one of the most secure standardized hash functions. It relies on the Keccak-f[1,600] permutation, which operates on an internal state of 1,600 bits, mostly represented as a 5×5×64-bit matrix. While existing implementations process the state sequentially in chunks of typically 32 or 64 bits, the Keccak-f[1,600] permutation can benefit a lot from speedup through parallelization. This paper is the first to explore the full potential of parallelization of Keccak-f[1,600] in RISC-V based processors through custom vector extensions on 32-bit and 64-bit architectures. We analyze the Keccak-f[1,600] permutation, composed of five different step mappings, and propose ten custom vector instructions to speed up the computation. We realize these extensions in a SIMD processor described in SystemVerilog. We compare the performance of our designs to existing architectures based on vectorized application-specific instruction set processors (ASIP). We show that our designs outperform all related work thanks to our carefully selected custom vector instructions.
00:39 CET PRIVACY-BY-SENSING WITH TIME-DOMAIN DIFFERENTIALLY-PRIVATE COMPRESSED SENSING
Authors:
Jianbo Liu, Boyang Cheng, Pengyu Zeng, Steven Davis, Muya Chang and Ningyuan Cao, University of Notre Dame, US
Abstract
With the ubiquitous IoT sensors and enormous real-time data generation, data privacy is becoming a critical societal concern. State-of-the-art privacy protection methods all demand significant hardware overhead due to computation-insensitive algorithm and divided sensor/security architecture. In this paper, we propose a generic time-domain circuit architecture that protects raw data by enabling a differentially-private compressed sensing (DP-CS) algorithm secured by physical unclonable functions (PUF). To address privacy concerns and hardware overhead at the same time, a robust unified PUF and time-domain mixed-signal (TD-MS) module is designed, where PUF enables private and secure entropy generation. To evaluate the proposed design against digital baseline, we performed experiments based on synthesized circuits and SPICE simulation, and measured a 2.9x area reduction and 3.2x energy gains. We also measured high-quality PUF generation with TD-MS circuit with a inter-die Hamming distance of 52% and a low intra-die Hamming distance of 2.8%. Furthermore, we performed attack and algorithm performance measurement demonstrating the proposed design preserves data privacy even under attack and the machine learning performance has minimal degradation (within 2%) compared to digital baseline.
00:39 CET YOU CAN READ MY CODE, BUT YOU CAN'T EXECUTE IT
Authors:
YongGang Li1, Yu Bao1, Guoyuan Lin1, Yaowen Ma1 and Yeh-Ching Chung2
1China University of Mining and Technology, CN; 2the Chinese University of Hong Kong (ShenZhen), CN
Abstract
Due to the Address Space Layout Randomization (ASLR), especially fine-grained ASLR, Code Reuse Attacks (CRAs) require memory probing to get available gadgets. Code reading is the most commonly used probing method. In theory, setting the code pages to be unreadable can prevent code reading. However, since the code pages are loaded dynamically, the existing methods cannot set all code pages as unreadable at one time. While we can change code page permissions on a page-by-page basis relying on dynamically page tracking, it is expensive. In addition, some special applications (such as debuggers) need to read the code page, and turning off the reading permission will affect their running. To solve these problems, this paper proposes a method ReadGate. It rebuilds the buddy system for memory allocation. This method allocates code pages in a specific memory pool to manage their read permissions. When the code reading is perceived, ReadGate disables the execution permission of the code page that is being read, and re-give its execution permission in a new address space. Experiments and analysis show that ReadGate can prevent the code that has been read from being used as gadgets without affecting the processes with code reading needs. In addition, the performance overhead introduced by ReadGate to the CPU is about 1.8%.
00:39 CET ENERGY-EFFICIENT NTT DESIGN WITH ONE-BANK SRAM AND 2-D PE ARRAY
Authors:
Jianan Mu1, HuaJie Tan2, Jiawen Wu2, Haotian Lu2, Chip-Hong Chang3, Shuai Chen4, Shengwen Liang1, Jing Ye1, Huawei Li1 and Xiaowei Li1
1ICT, CAS, CN; 2School of Microelectronics, Tianjin University, China, CN; 3School of Electrical and Electronic Engineering (EEE) of NTU, SG; 4Rock-solid Security Lab. of Binary Semiconductor Co., Ltd., CN
Abstract
Number Theoretic Transform (NTT) can be used to accelerate the polynomial multiplication, which is one of the main operators in the next-generation cryptographic schemes. Energy-efficient hardware design methodology for NTT computation with different polynomial degrees and coefficient word lengths is lacking to proliferate applications that require NTT computation in energy-constrained devices. In NTT operation, more than half of the active energy consumption stems from memory accesses. Here, we propose a generalized design method to improve the energy efficiency of NTT operation by considering the effect of processing element (PE) geometry and memory organization on the data flow between PEs and memory. To decrease the number of data bits that are required to be accessed from the memory, a two-dimensional (2-D) PE array architecture is used. A pair of ping-pong buffers are proposed to transposed swap the coefficients to enable a single bank of memory to be used with the 2-D PE array to reduce the average memory bit access energy without compromising the throughput. Our experimental results show that this design method can produce NTT accelerators with up to 69.8% saving in average energy consumption compared with the existing designs based on multi-bank SRAM and one-bank SRAM with one-dimensional PE array with the same number of PEs and total memory size.
00:39 CET COFHEE: A CO-PROCESSOR FOR FULLY HOMOMORPHIC ENCRYPTION EXECUTION
Authors:
Mohammed Nabeel Thari Moopan1, Deepraj Soni2, Mohammed Ashraf3, Mizan Gebremichael4, Homer Gamil3, Eduardo Chielle5, Ramesh Karri6, Mihai Sanduleanu4 and Michail Maniatakos5
1New York University, AE; 2New York University Tandon School of Engineering, US; 3NYUAD, AE; 4Khalifa University, AE; 5New York University Abu Dhabi, AE; 6NYU, US
Abstract
The migration of computation to the cloud has raised privacy concerns as sensitive data becomes vulnerable to attacks since they need to be decrypted for processing. Fully Homomorphic Encryption (FHE) mitigates this issue as it enables meaningful computations to be performed directly on encrypted data. Nevertheless, FHE is orders of magnitude slower than unencrypted computation, which hinders its practicality and adoption. Therefore, improving FHE performance is essential for its real world deployment. In this paper, we present our efforts to design, implement, fabricate, and post-silicon validate a co-processor for Fully Homomorphic Encryption dubbed CoFHEE. With a small design area of 12 mm^2, CoFHEE targets the optimal ASIC implementations of fundamental polynomial operations, such as polynomial addition and subtraction, Hadamard product, and Number Theoretic Transform, which are underneath all higher-level FHE primitives. CoFHEE has native support of polynomial degrees of up to n = 2^14 with a coefficient size of 128 bits, and is fabricated in 55 nm CMOS technology, with a target frequency of 250 MHz. We evaluate our chip with performance and power experiments and compare it against state-of-the-art software implementations and other ASIC designs.
00:39 CET A RAPID RESET 8-TRANSISTOR PHYSICALLY UNCLONABLE FUNCTION UTILISING POWER GATING
Authors:
Yujin Zheng1, Alex Bystrov2 and Alex Yakovlev2
1Newcastle university, GB; 2Newcastle University, GB
Abstract
As promising hardware security primitives, Physically Unclonable Functions (PUFs) need error correction whilst regenerating Secret Key in cryptography. The proposed 8-Transistor (8T) PUF, which is coordinated with power gating technique, can dramatically speed up the key extraction process during manufacturing. In comparison, a single evaluation cycle of the proposed 8T PUF is more than 1000 times faster than its counterpart 6T-Static Random-Access Memory (SRAM) PUF. This enables the multiple evaluation even in the key regeneration phase in field, hence greatly reducing the number of errors and hardware penalty for error correction. The 8T-PUF cell derives from the 6T-SRAM cell. It is built to swiftly eliminate data retention and maximise physical mismatch. Meanwhile, power gating modules enable rapid power-on/off cycles only for the chosen PUF arrays. In addition, these modules are designed with a two-phase power gating method for minimising EMI and crosstalk amongst PUF cells whilst controlling the in-rush current and maintaining high performance. This method aims to improve PUF stability and protect PUF from side-channel attack. An architecture and implementation of the power-gated PUF is developed to accommodate fast multiple-evaluation of PUF Responses. Post-layout Monte Carlo simulations are performed with Cadence, and the extracted PUF Responses are processed with Matlab to evaluate the 8T PUF performance and stochastic metrics for subsequent inclusion into PUF Responses, which comprises the novelty of the approach.

S_D9 (Extra session) Emerging design technologies for future computing

Date: Wednesday, 19 April 2023
Time: 00:39 - 00:00 CET

Time Label Presentation Title
Authors
00:39 CET SCALABLE COHERENT OPTICAL CROSSBAR ARCHITECTURE USING PCM FOR AI ACCELERATION
Authors:
Dan Sturm and Sajjad Moazeni, University of Washington, US
Abstract
Recent advancements in artificial intelligence (AI) and machine learning (ML) have been challenging our conventional computing paradigms by demanding enormous computing power at a dramatically faster pace than Moore's law. Analog-based optical computing has been recently proposed as a new approach to achieve large compute powers (TOPS) at high energy-efficiency (TOPS/W), which makes them suitable for AI acceleration in datacenters and supercomputers. However, proposed implementations so far suffer from lack of scalability, large footprints and high power consumption, and lack of practical system-level architectures to become integrated within existing datacenter architecture for real-world applications. In this work, we present a truly scalable optical AI accelerator based on a crossbar architecture. We have considered all major roadblocks and address them in this design. Weights will be stored on chip using phase change material (PCM) that can be monolithically integrated in silicon photonic processes. This coherent crossbar architecture can be extended to large scales without the need for any multi-wavelength laser sources. All electro-optical components and circuit blocks are modeled based on measured performance metrics in monolithic silicon photonics 45nm for this work which can be co-packaged with advanced SoC and HBM memories. We also present a system-level modeling and analysis of our chip's performance for the Resnet-50 V1.5 neural network, considering all critical parameters, including memory size, array size, photonic losses, and energy consumption of peripheral electronics including ADC and DACs. Both on- chip SRAM and off-chip DRAM energy overheads have been considered in this modeling. We additionally address how using a dual-core crossbar design can eliminate programming time overhead at practical SRAM block sizes and batch sizes. Our results show that a 128 x 128 proposed architecture can achieve inference per second (IPS) similar to Nvidia A100 GPU at 15.4× lower power and 7.24× lower area.
00:39 CET MIXED-SIGNAL MEMRISTOR-BASED ITERATIVE MONTGOMERY MODULAR MULTIPLICATION
Authors:
Mehdi Kamal1 and Massoud Pedram2
1University of Southern California, US; 2USC, US
Abstract
In this paper, we present a mixed-signal implementation of iterative Montgomery multiplication algorithm (called X-IMM) for using in large arithmetic word size (LAWS) computations. LAWS is mainly utilized in security applications such as lattice-based cryptography where the width of the input operands may be equal to or larger than 1,024 bits. The proposed architecture is based on the iterative implementation of Montgomery multiplication (MM) algorithm where some critical parts of the multiplication are computed in the analog domain by mapping them on the memristor crossbar. Using a memristor crossbar reduces the area usage and latency of the modular multiplication unit compared to its fully digital implementation. The devised mixed-signal MM implementation is scalable by cascading the smaller X-IMMs to support dynamically adjustable larger operand sizes at runtime. The effectiveness of the proposed MM structure is assessed in the 45nm technology and the comparative studies show that the proposed 1,024-bit Radix-4 (Radix-16) Montgomery multiplication architecture provides about 13{\%} (22{\%}) higher $GOPS/mm^2$ compared to the state-of-the-art digital implementations of the iterative Montgomery multipliers.
00:39 CET ODLPIM: A WRITE-OPTIMIZED AND LONG-LIFETIME RERAM-BASED ACCELERATOR FOR ONLINE DEEP LEARNING
Authors:
Heng Zhou1, Bing Wu2, Huan Cheng1, Wei Zhao1, Xueliang Wei1, Jinpeng Liu1, Dan Feng1 and Wei Tong3
1Huazhong University of Science and Technology, CN; 2Huazhong university of science and technology, CN; 3HUST, CN
Abstract
ReRAM-based Processing-In-Memory (PIM) architectures have demonstrated high energy efficiency and performance in deep neural network (DNN) acceleration. Most of the existing PIM accelerators for DNN focus on offline batch learning (OBL) which require the whole dataset to be available before training. However, in the real world, data instances arrive in sequential settings, and even the data pattern may change, which calls concept drift. OBL requires expensive retraining to solve concept drift, whereas ODL is evidenced to be a better solution to keep the model evolving over streaming data. Unbalanced writes degrade the lifetime of the PIM accelerator. However, when ODL optimizes models over a large-scale data stream, this issue is more severe than in OBL, due to the heavier weight updates, resulting in the amplification of unbalanced writes and lifetime deterioration. In this work, we propose ODLPIM, an online deep learning PIM accelerator that extends system lifetime through algorithm-hardware co-optimization. ODLPIM adopts a novel write-optimized parameter update (WARP) scheme that reduces the non-critical weight updates in hidden layers. Besides, a table-based inter-crossbar wear-leveling (TIWL) scheme is proposed and applied to the hardware controller to achieve wear-leveling between crossbars for lifetime improvement. Experiments show that WARP reduces weight updates on average to 15.25% and up to 24% compared to that without WARP, and eventually prolongs system lifetime on average to 9.65% and up to 26.81%, with a negligible rise in cumulative error rate (up to 0.31%). By combining WARP with TIWL, the lifetime of ODLPIM is improved by an average of 12.59X and up to 17.73X.
00:39 CET SAT-BASED QUANTUM CIRCUIT ADAPTATION
Authors:
Sebastian Brandhofer1, Jinwoong Kim2, Siyuan Niu3 and Nicholas Bronn4
1University of Stuttgart, DE; 2Delft University of Technology, NL; 3University of Montpellier, FR; 4IBM Quantum, US
Abstract
As the nascent field of quantum computing develops, an increasing number of quantum hardware modalities, such as superconducting electronic circuits, semiconducting spins, ion traps, and neutral atoms, become available for performing quantum computations. These quantum hardware modalities exhibit varying characteristics and implement different universal quantum gate sets that may e.g. contain several distinct two-qubit quantum gates. Adapting a quantum circuit from a, possibly hardware-agnostic, universal quantum gate set to the quantum gate set of a target hardware modality has a crucial impact on the fidelity and duration of the intended quantum computation. However, current quantum circuit adaptation techniques only apply a specific decomposition or allow only for local improvements to the target quantum circuit possibly resulting in a quantum computation on the target hardware modality with decreased fidelity or increased qubit idle time. These issues are further aggravated by the multiple options of hardware-native quantum gates rendering multiple universal quantum gates sets accessible to a hardware modality. In this work, we developed a satisfiability modulo theories model that determines an optimized quantum circuit adaptation given a set of allowed substitutions and decompositions, a target hardware modality and the quantum circuit to be adapted. We further discuss the physics of the semiconducting spins hardware modality, show possible implementations of diverse two-qubit quantum gates, and evaluate the developed SMT model on the semiconducting spins hardware modality. Using the developed quantum circuit adaptation method, the Hellinger fidelity could be improved by up to 40% and the qubit idle time could be decreased by up to 87% compared to prior quantum circuit adaptation techniques.
00:39 CET ULTRA-DENSE 3D PHYSICAL DESIGN ENABLES NEW ARCHITECTURAL DESIGN POINTS WITH LARGE BENEFITS
Authors:
Tathagata Srimani1, Robert Radway1, Jinwoo Kim2, Kartik Prabhu1, Dennis Rich1, Carlo Gilardi1, Priyanka Raina1, Max Shulaker3, Sung Kyu Lim2 and Subhasish Mitra1
1Stanford University, US; 2Georgia Tech, US; 3MIT, US
Abstract
This paper focuses on iso-on-chip-memory-capacity and iso-footprint Energy-Delay-Product (EDP) benefits of ultra-dense 3D, e.g., monolithic 3D (M3D), computing systems vs. corresponding 2D designs. Simply folding existing 2D designs into corresponding M3D physical designs yields limited EDP benefits (~1.4×). New M3D architectural design points that exploit M3D physical design are crucial for large M3D EDP benefits. We perform comprehensive architectural exploration and detailed M3D physical design using foundry M3D process design kit and standard cell library for front-end-of-line (FEOL) Si CMOS logic, on-chip back-end-of-line (BEOL) memory, and a single layer of on-chip BEOL FETs. We find new M3D AI/ML accelerator architectural design points that have iso-footprint, iso-on-chip-memory-capacity EDP benefits ranging from 5-11.5× vs. corresponding 2D designs (containing only FEOL Si CMOS and on-chip BEOL memory). We also present an analytical framework to derive architectural insights into these benefits, showing that our principles extend to many architectural design points across various device technologies.
00:39 CET MEMRISTOR-SPIKELEARN: A SPIKING NEURAL NETWORK SIMULATOR FOR STUDYING SYNAPTIC PLASTICITY UNDER REALISTIC MEMRISTOR BEHAVIORS
Authors:
Yuming Liu1, Angel Yanguas-Gil2, Sandeep Madireddy2 and Yanjing Li1
1University of Chicago, US; 2Argonne National Laboratory, US
Abstract
Memristor-based implementations of synaptic plasticity in spiking neural networks provide a promising approach towards energy-efficient online learning. However, there are no existing methods or frameworks that enable the study of synaptic plasticity in real-world workloads under realistic memristor device behaviors. To fill this important gap, we present the new Memristor-Spikelearn simulator, which is capable of incorporating memristor models with the proper levels of details, and thus simulating synaptic plasticity under realistic device behaviors. Using this simulator, we demonstrate that: 1. results obtained using a simplified memristor device model can be misleading, and a realistic device model in simulation is essential; and 2. it is also essential to fine-tune a synaptic plasticity algorithm according to the characteristics of the underlying memristor device, e.g., by controlling how weight values are mapped to conductance values, which can yield drastically different energy-accuracy trade-offs. Moreover, our simulator is extensible and scalable.
00:39 CET EXPLOITING KERNEL COMPRESSION ON BNNS
Authors:
Franyell Silfa, Jose Maria Arnau and Antonio González, Polytechnic University of Catalonia, ES
Abstract
Binary Neural Networks (BNNs) are showing tremendous success on realistic image classification tasks. Notably, their accuracy is similar to the state-of-the-art accuracy obtained by full-precision models tailored to edge devices. In this regard, BNNs are very amenable to edge devices since they employ 1-bit to store the inputs and weights, and thus, their storage requirements are low. Moreover, BNNs computations are mainly done using xnor and pop-counts operations which are implemented very efficiently using simple hardware structures. Nonetheless, supporting BNNs efficiently on mobile CPUs is far from trivial since their benefits are hindered by frequent memory accesses to load weights and inputs. In BNNs, a weight or an input is stored using one bit, and aiming to increase storage and computation efficiency, several of them are packed together as a sequence of bits. In this work, we observe that the number of unique sequences representing a set of weights or inputs is typically low (i.e., 512). Also, we have seen that during the evaluation of a BNN layer, a small group of unique sequences is employed more frequently than others. Accordingly, we propose exploiting this observation by using Huffman Encoding to encode the bit sequences and then using an indirection table to decode them during the BNN evaluation. Also, we propose a clustering-based scheme to identify the most common sequences of bits and replace the less common ones with some similar common sequences. As a result, we decrease the storage requirements and memory accesses since the most common sequences are encoded with fewer bits. In this work, we extend a mobile CPU by adding a small hardware structure that can efficiently cache and decode the compressed sequence of bits. We evaluate our scheme using the ReAacNet model with the Imagenet dataset on an ARM CPU. Our experimental results show that our technique can reduce memory requirement by 1.32x and improve performance by 1.35x.
00:39 CET AXI-PACK: NEAR-MEMORY BUS PACKING FOR BANDWIDTH-EFFICIENT IRREGULAR WORKLOADS
Authors:
Chi Zhang1, Paul Scheffler1, Thomas Benz1, Matteo Perotti2 and Luca Benini1
1Integrated Systems Laboratory, ETH Zurich, CH; 2ETH Zürich, CH
Abstract
Data-intensive applications involving irregular memory streams are inefficiently handled by modern processors and memory systems highly optimized for regular, contiguous data. Recent work tackles these inefficiencies in hardware through core-side stream extensions or memory-side prefetchers and accelerators, but fails to provide end-to-end solutions which also achieve high efficiency in on-chip interconnects. We propose AXI-Pack, an extension to ARM's AXI4 protocol introducing bandwidth-efficient strided and indirect bursts to enable end-to-end irregular streams. AXI-Pack adds irregular stream semantics to memory requests and avoids inefficient narrow-bus transfers by packing multiple narrow data elements onto a wide bus. It retains full compatibility with AXI4 and does not require modifications to non-burst-reshaping interconnect IPs. To demonstrate our approach end-to-end, we extend an open-source RISC-V vector processor to leverage AXI-Pack at its memory interface for strided and indexed accesses. On the memory side, we design a banked memory controller efficiently handling AXI-Pack requests. On a system with a 256-bit-wide interconnect running FP32 workloads, AXI-Pack achieves near-ideal peak on-chip bus utilizations of 87% and 39%, speedups of 5.4x and 2.4x, and energy efficiency improvements of 5.3x and 2.1x over a baseline using an AXI4 bus on strided and indirect benchmarks, respectively.
00:39 CET SIMSNN: A WEIGHT-AGNOSTIC RERAM-BASED SEARCH-IN-MEMORY ENGINE FOR SPIKING NEURAL NETWORK ACCELERATION
Authors:
Fangxin Liu1, Xiaokang Yang1 and Li Jiang2
1Shanghai Jiaotong University, CN; 2Shanghai Jiao Tong University, CN
Abstract
Bio-plausible spiking neural networks (SNNs) have gained a great momentum due to its inherent efficiency of processing event-driven information. The dominant computation--matrix bit-wise And-Add operations--in SNN is naturally fit for process-in-memory architecture~(PIM). The long input spike train of SNN and the bit-serial processing mechanism of PIM, however, incur considerable latency and frequent analog-to-digital conversion, offsetting the performance gain and energy-efficiency. In this paper, we propose a novel Search-in-Memory~(SIM) architecture to accelerate the SNN inference, named SIMSnn. Rather than processing the input bit-by-bit over multiple time steps, SIMSnn can take in a sequence of spikes and search the result by parallel associative matches in the CAM crossbar. We explore the cascade search mechanism and the temporal pipeline design to enhance the parallelism of the search across time windows. The proposed SIMSnn can leverage the non-structured pruning mechanism, which is unusable for most PIM architecture, to further reduce the CAM overhead. As a weight-agnostic SNN accelerator, SIMSnn can adapt to various evolving SNNs without rewriting the crossbar array. Experimental results show that the proposed SIMSnn achieves $25.3 imes$ higher energy-efficiency and $13.7 imes$ speedup on average than the ISAAC-like design. Compared to the state-of-the-art PIM design, NEBULA, SIMSnn can also achieve up to $7.9 imes$ energy savings and $5.7 imes$ speedup.
00:39 CET BOMIG: A MAJORITY LOGIC SYNTHESIS FRAMEWORK FOR AQFP LOGIC
Authors:
Rongliang Fu1, Junying Huang2, Mengmeng Wang3, Yoshikawa Nobuyuki3, Bei Yu4, Tsung-Yi Ho4 and Olivia Chen5
1The Chinese University of Hong Kong, CN; 2State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing, China, CN; 3Yokohama National University, JP; 4The Chinese University of Hong Kong, HK; 5Tokyo City University, JP
Abstract
As an energy-efficient superconductor logic with no static power consumption and extremely low switching energy, adiabatic quantum-flux-parametron (AQFP) logic is a promising candidate for building energy-efficient computing systems. Due to the native majority function in AQFP logic, which can represent more complex logic with the same cost as the AND/OR function, the design of AQFP circuits differs from conventional digital circuits. Compared with AND-OR-inverter (AOI) logic structure commonly used in CMOS technology, majority-inverter logic enables a smaller size and depth of a circuit and is more suitable for AQFP logic. However, AQFP logic needs buffer and splitter insertion to satisfy the clock-based synchronization requirement and fan-out limitation, making traditional majority-based logic synthesis methods not applicable. The paper combines logic optimization with buffer and splitter insertion and proposes a novel logic synthesis framework to minimize the size and depth of AQFP circuits. First, logic synthesis of AQFP circuits is regarded as a black-box problem, where AOI and majority-based transformation methods construct the feasible domain, and the objective function is the normalized energy-delay-product of AQFP circuits. Then, an Bayesian optimization method is used to explore an optimal fixed-length transformation sequence applied to AQFP logic optimization. Experimental results show that the proposed method has a significant improvement in terms of both JJ number and circuit depth compared to the state-of-the-art.

BPA_10 Hardware Security

Date: Wednesday, 19 April 2023
Time: 08:30 - 10:30 CET

Time Label Presentation Title
Authors
00:39 CET SOCFUZZER: SOC VULNERABILITY DETECTION USING COST FUNCTION ENABLED FUZZ TESTING
Authors:
Muhammad Monir Hossain1, Arash Vafaei1, Kimia Zamiri Azar1, Fahim Rahman1, Farimah Farahmandi1 and Mark Tehranipoor2
1University of Florida, US; 2Intel Charles E. Young Preeminence Endowed Chair Professor in Cybersecurity, Associate Chair for Research and Strategic Initiatives, ECE Department, University of Florida, US
Abstract
The modern System-on-Chips (SoCs), with numerous complex and heterogeneous intellectual properties (IPs), and the inclusion of highly-sensitive assets, become the target of malicious attacks. However, security verification of these SoCs remains behind compared to the advances in functional verification, mostly because it is difficult to formally define the accurate threat model(s). Few recent studies have investigated the possibility of engaging fuzz testing for hardware-oriented vulnerability detection. However, they suffer from several limitations, i.e., lack of cross-layer co-verification, the need for expert knowledge, and the inability to capture detailed hardware interactions. In this paper, we propose SoCFuzzer, an automated SoC verification assisted by fuzz testing for detecting SoC security vulnerabilities. Unlike the previous HW-oriented fuzz testing studies, which mostly rely on traditional (code) coverage-based metrics, in SoCFuzzer, we develop (i) generic evaluation metrics for fuzzing the hardware domain, and (ii) security-oriented cost function. This relieves designers of making correlations between coverage metrics, test data, and possible vulnerabilities. The SoCFuzzer cost functions are defined high level, allowing us to follow the gray-box model, which requires less detailed and interactive information from the design-under-test. Our experiments on an open-source RISC-Vbased SoC show the efficiency of these metrics and cost functions on fuzzing for generating cornerstone inputs to trigger the vulnerability conditions with faster convergence.
00:39 CET NON-PROFILED SIDE-CHANNEL ASSISTED FAULT ATTACK: A CASE STUDY ON DOMREP
Authors:
Sayandeep Saha, Prasanna Ravi, Dirmanto Jap and Shivam Bhasin, Nanyang Technological University, Singapore, SG
Abstract
Recent work has shown that Side-Channel Attacks (SCA) and Fault Attacks (FA) can be combined, forming an extremely powerful adversarial model, which can bypass even some strongest protections against both FA and SCA. However, such strongest form of combined attack comes with some practical challenges -- 1) a profiled setting with multiple fault locations is needed; 2) fault models are restricted to single-bit set-reset/flips; 3) the input needs to be repeated several times. In this paper, we propose a new combined attack strategy called SCA-NFA that works in a non-profiled setting. Assuming knowledge of plaintexts/ciphertexts and exploiting bitsliced implementations of modern ciphers, we further relax the assumptions on the fault model, and the number of fault locations -- random multi-bit fault at a single fault location is sufficient for recovering several secret bits. Furthermore, the inputs are allowed to be varied, which is required in several practical use cases. The attack is validated on a recently proposed countermeasure called DOMREP, which individually provides SCA and FA protection of arbitrary order. Practical validation for an open-source masked implementation of GIMLI with DOMREP extension on STM32F407G, using electromagnetic fault and electromagnetic SCA, shows that SCA-NFA succeeds in around 10000 measurements.
00:39 CET EFFICIENT SOFTWARE MASKING OF AES THROUGH INSTRUCTION SET EXTENSIONS
Authors:
Songqiao Cui and Josep Balasch, KU Leuven, BE
Abstract
Masking is a well-studied countermeasure to protect software implementations against side-channel attacks. For the case of AES, incorporating masking often requires to implement internal transformations using finite field arithmetic. This results in significant performance overheads, mostly due to finite field multiplications, which are even worsened when no lookup tables are used. In this work, we extend a RISC-V core with custom instructions to accelerate AES finite field arithmetic. With a 3.4% area increase, we measure 7.27x and 5.45x speed up over software-only implementations of first-order Boolean Masking and Inner Product Masking, respectively. We also investigate vectorized instructions capable of exploiting the intra-block and inter-block parallelism in the implementation. Our implementations avoid the use of lookup tables, run in constant time, and show no evidence of first-order leakage when evaluated on an FPGA.

BPA_3 Efficient processing for NNs

Date: Wednesday, 19 April 2023
Time: 08:30 - 10:30 CET

Time Label Presentation Title
Authors
00:39 CET AUTOMATED ENERGY-EFFICIENT DNN COMPRESSION UNDER FINE-GRAIN ACCURACY CONSTRAINTS
Authors:
Ourania Spantidi and Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US
Abstract
Deep Neural Networks (DNNs) are utilized in a variety of domains, and their computation intensity is stressing embedded devices that comprise limited power budgets. DNN compression has been employed to achieve gains in energy consumption on embedded devices at the cost of accuracy loss. Compression-induced accuracy degradation is addressed through fine-tuning or retraining, which can not always be feasible. Additionally, state-of-art approaches compress DNNs with respect to the average accuracy achieved during inference, which can be a misleading evaluation metric. In this work, we explore more fine-grain properties of DNN inference accuracy, and generate energy-efficient DNNs using signal temporal logic and falsification jointly through pruning and quantization. We offer the ability to control at run-time the quality of the DNN inference, and propose an automated framework that can generate compressed DNNs that satisfy tight fine-grain accuracy requirements. The conducted evaluation on the ImageNet dataset has shown over 30% in energy consumption gains when compared to baseline DNNs.
00:39 CET A SPEED- AND ENERGY-DRIVEN HOLISTIC TRAINING FRAMEWORK FOR SPARSE CNN ACCELERATORS
Authors:
Yuanchen Qu, Yu Ma and Pingqiang Zhou, Shanghaitech University, CN
Abstract
Sparse convolution neural network (CNN) accelerators have shown to achieve high processing speed and low energy consumption by leveraging zero weights or activations, which can be further optimized by finely tuning the sparse activation maps in training process. In this paper, we propose a CNN training framework targeting at reducing energy consumption and processing cycles in sparse CNN accelerators. We first model accelerator's energy consumption and processing cycles as functions of layer-wise activation map sparsity. Then we leverage the model and propose a hybrid regularization approximation method to further sparsify activation maps in the training process. The results show that our proposed framework can reduce the energy consumption of Eyeriss by 31.33%, 20.6% and 26.6% on MobileNet-V2, SqueezeNet and Inception-V3. In addition, the processing speed can be increased by 1.96????, 1.4???? and 1.65???? respectively.
00:39 CET HARDWARE EFFICIENT WEIGHT-BINARIZED SPIKING NEURAL NETWORKS
Authors:
Chengcheng Tang and Jie Han, University of Alberta, CA
Abstract
The advancement in spiking neural networks (SNNs) provides a promising and alternative approach to conventional artificial neural networks (ANNs) with higher energy efficiency. However, the significant requirements on memory usage presents a performance bottleneck on resource constrained devices. Inspired by the notion of binarized neural networks (BNNs), we incorporate the design principle in BNNs into that of SNNs to reduce the stringent resource requirements. Specifically, the weights are binarized to 1 and -1 to represent excitatory and inhibitory synapses, so the proposed design is referred to as a weightbinarized spiking neural network (WB-SNN). In the WB-SNN, only one bit is used for the weight or a spike; for the latter, 1 and 0 indicate a spike and no spike, respectively. A priority encoder is used to identify the index of an active neuron as the basic unit to construct the WB-SNN. A fully connected neural network is designed that consists of an input layer, an output layer, and fully connected layers of different sizes. A counter is then utilized in each neuron to complete the accumulation of weights. The WB-SNN design is validated by using a multi-layer-perceptron on the MNIST dataset. Hardware implementations on FPGAs show that the WB-SNN attains a significant saving of memory with only a limited accuracy loss compared with its SNN and BNN counterparts.

BPA_4 Hardware accelerators

Date: Wednesday, 19 April 2023
Time: 08:30 - 10:30 CET

Time Label Presentation Title
Authors
00:39 CET ACCELERATING GUSTAVSON-BASED SPMM ON EMBEDDED FPGAS WITH ELEMENT-WISE PARALLELISM AND ACCESS PATTERN-AWARE CACHES
Authors:
Shiqing Li and Weichen Liu, Nanyang Technological University, SG
Abstract
The Gustavson's algorithm (i.e., the row-wise product algorithm) shows its potential as the backbone algorithm for sparse matrix-matrix multiplication (SpMM) on hardware accelerators. However, it still suffers from irregular memory accesses and thus its performance is bounded by the off-chip memory traffic. Previous works mainly focus on high bandwidth memory-based architectures and are not suitable for embedded FPGAs with traditional DDR. In this work, we propose an efficient Gustavson-based SpMM accelerator on embedded FPGAs with element-wise parallelism and access pattern-aware caches. First of all, we analyze the parallelism of the Gustavson's algorithm and propose to perform the algorithm with element-wise parallelism, which reduces the idle time of processing elements caused by synchronization. Further, we show a counter-intuitive example that the traditional cache leads to worse performance. Then, we propose a novel access pattern-aware cache scheme called SpCache, which provides quick responses to reduce bank conflicts caused by irregular memory accesses and combines streaming and caching to handle requests that access ordered elements of unpredictable length. Finally, we conduct experiments on the Xilinx Zynq-UltraScale ZCU106 platform with a set of benchmarks from the SuiteSparse matrix collection. The experimental results show that the proposed design achieves an average 1.62x performance speedup compared to the baseline.
00:39 CET GRAPHITE: ACCELERATING ITERATIVE GRAPH ALGORITHMS ON RERAM ARCHITECTURES VIA APPROXIMATE COMPUTING
Authors:
Dwaipayan Choudhury, Ananth Kalyanaraman and Partha Pratim Pande, Washington State University, US
Abstract
ReRAM-based Processing-in-Memory (PIM) offers a promising paradigm for computing near data, making it an attractive platform of choice for graph applications that suffer from sparsity and irregular memory access. However, the performance of ReRAM-based graph accelerators is limited by two key challenges – significant storage requirements (particularly due to wasted zero cell storage of a graph's adjacency matrix), and significant amount of on-chip traffic between ReRAM-based processing elements. In this paper we present, GraphIte, an approximate computing-based framework for accelerating iterative graph applications on ReRAM-based architectures. GraphIte uses sparsification and approximate updates to achieve significant reductions in ReRAM storage and data movement. Our experiments on PageRank and community detection show that our proposed architecture outperforms a state-of-the-art ReRAM-based graph accelerator by up to 83.4% reduction in execution time while consuming up to 87.9% less energy for a range of graph inputs and workloads.
00:39 CET PEDAL: A POWER EFFICIENT GCN ACCELERATOR WITH MULTIPLE DATAFLOWS
Authors:
Yuhan Chen, Alireza Khadem, Xin He, Nishil Talati, Tanvir Ahmed Khan and Trevor Mudge, University of Michigan, US
Abstract
Graphs are ubiquitous and used in many domains due to their ability to describe structural relations. Graph Convolutional Networks (GCNs) emerge in recent years and are rapidly developing due to their capability to analyze graph-structured data. The nature of graph-structured data leads to irregular memory accesses in the aggregation phase of GCN, making it hard for general-purpose architectures like CPUs and GPUs to utilize their computing resources. In this paper, we propose PEDAL, a power-efficient accelerator for GCN inference supporting multiple dataflows. PEDAL chooses the best-fit dataflow and phase ordering based on input graph characteristics and GCN algorithm, achieving both efficiency and flexibility. PEDAL also features a lightweight processing element design, and uses a relatively small number of processing elements for better power efficiency while achieving good performance. PEDAL achieves 144.5, 9.36, and 2.55 times speedup compared to CPU, GPU, and HyGCN respectively, and 8856, 1606, 8.4, and 1.78 times better power efficiency compared to CPU, GPU, HyGCN, and EnGN.

S_A2 Application specific circuits and systems

Date: Wednesday, 19 April 2023
Time: 11:00 - 12:30 CET

Time Label Presentation Title
Authors
00:39 CET A DECENTRALIZED FRONTIER QUEUE FOR IMPROVING SCALABILITY OF BREADTH-FIRST-SEARCH ON GPUS
Authors:
Chou-Ying Hsieh1, Po-Hsiu Cheng2, Chia-Ming Chang3 and Sy-Yen Kuo4
1+886971285958, TW; 2r10943151atntu [dot] edu [dot] tw, TW; 3r10921101atntu [dot] edu [dot] tw, TW; 4sykuoatntu [dot] edu [dot] tw, TW
Abstract
Breath-first-search (BFS) algorithm is the fundamental building block of broad applications from the electronic design automation (EDA) field to social network analysis. With the targeting data set size growing considerable, researchers have turned to developing parallel BFS (PBFS) algorithms and accelerating them with graph processing units (GPUs). The frontier queue, the core idea among state-of-the-art designs of PBFS, opens the door to neighbor visiting parallelism. However, the traditional centralized frontier queue in PBFS suffers from a dramatic collision when excessive threads simultaneously operate on it. Furthermore, the growing size of the graph puts considerable pressure on memory space. Therefore, we first identify the challenges of current frontier queue implementations. To solve these challenges, we proposed the decentralized frontier queue (DFQ), which separates a centralized queue into multiple tiny sub-queues for scattering the atomic operation collision on these queues. We also developed the novel overflow-free enqueue and asynchronous sub-queue drain methods to avoid the overflow issue on the naive sub-queue design. With these two optimizations, the memory consumption of the frontier queue can be constant rather than exponentially growing along with the vertex number of the graph. In our experiments, we showed that our design could have better scalability and grain averagely 1.04x speedup on the execution in the selected benchmark suit with considerable memory space efficiency.
00:39 CET TIMELY FUSION OF SURROUND RADAR/LIDAR FOR OBJECT DETECTION IN AUTONOMOUS DRIVING SYSTEMS
Authors:
Wenjing XIE1, Tao Hu1, Neiwen Ling2, Guoliang Xing2, Shao-Shan Liu3 and Nan Guan1
1City University of Hong Kong, HK; 2The Chinese University of Hong Kong, HK; 3BeyonCa, CN
Abstract
Fusion of multiple sensor modalities, such as camera, Lidar and Radar, are commonly used in autonomous driving systems to fully utilize the complementary advantages of different sensors. Surround Radar/Lidar can provide 360-degree view sampling with the minimal cost, which are promising sensing hardware solutions for autonomous driving systems. However, due to the intrinsic physical constraints, the rotating speed (i.e., the frequency to generate data frames) of surround Radar is much lower than surround Lidar, and existing Radar/Lidar fusion methods have to work at the low frequency of surround Radar, which cannot meet the high responsiveness requirement of autonomous driving systems. This paper develops techniques to fuse surround Radar/Lidar with working frequency only limited by the faster surround Lidar instead of the slower surround Radar, based on the state-of-the-art Radar/Lidar DNN model MVDNet. The basic idea of our approach is simple: we let MVDNet work with temporally unaligned data from Radar/Lidar, so that fusion can take place at any time when a new Lidar data frame arrives, instead of waiting for the slow Radar data frame. However, directly applying MVDNet to temporally unaligned Radar/Lidar data greatly degrades its object detection accuracy. The key information revealed in this paper is that we can achieve high output frequency with little accuracy loss by enhancing the training procedure to explore the temporal redundancy in fusion procedure of MVDNet so that it can tolerate the temporal unalignment of the input data. We explore several different ways of training enhancement and compare them quantitatively with experiments.
00:39 CET A LIGHTWEIGHT AND ADAPTIVE CACHE ALLOCATION SCHEME FOR CONTENT DELIVERY NETWORKS
Authors:
Ke Liu1, Hua Wang2, Ke Zhou1 and Cong Li3
1Wuhan National Laboratory for Optoelectronics (WNLO) of Huazhong University of Science and Technology (HUST), CN; 2Huazhong University of Science and Technology, CN; 3Tencent, CN
Abstract
Content delivery networks (CDNs) caching systems use multi-tenant shared caching due to their operational simplicity. However, this approach often results in significant space waste and requires timely space allocation. On the one hand, the accuracy and reliability of the existing static allocation schemes are not high. On the other hand, due to the large number of tenants in CDNs, the dynamic allocation schemes based on miss ratio curve (MRC) for cache space allocation will cause high computational overheads and performance fluctuations. As a result, none of these existing solutions can be used directly in the CDN caching system. In this paper, we propose a lightweight and adaptive cache allocation scheme for CDNs (LACA). Rather than configuring near-optimal configurations for each tenant, LACA detects in real time whether any tenants are using cache space inefficiently, and then constructs local MRCs for those tenants. Finally, the space to be adjusted is calculated from the local MRCs. We have deployed LACA in Company-T's CDN system and LACA can reduce the miss ratio by 27.1% and reduce the average user access latency by 28.5ms. Then, in terms of the accuracy of constructing local MRCs, LACA is compared with several current more advanced schemes. Experimental results demonstrate that LACA constructs a higher-accuracy local MRC with little overhead. In addition, LACA can adjust the space as fast as once a minute.
00:39 CET TBERT: DYNAMIC BERT INFERENCE WITH TOP-K BASED PREDICTORS
Authors:
Zejian Liu1, kun zhao2 and Jian Cheng2
1National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, CN; 2Institute of Automation, CN
Abstract
Dynamic inference is a compression method that adaptively prunes unimportant components according to the input at the inference stage, which can achieve a better trade-off between computational complexity and model accuracy than static compression methods. However, there are two limitations in previous works. The first one is that they usually need to search the threshold on the evaluation dataset to achieve the target compression ratio, but the search process is non-trivial. The second one is that these methods are unstable. Their performance will be significantly degraded on some datasets, especially when the compression ratio is high. In this paper, we propose TBERT, a simple yet stable dynamic inference method. TBERT utilizes the top-k-based pruning strategy which allows accurate control of the compression ratio. To enable stable end-to-end training of the model, we carefully design the structure of the predictor. Moreover, we propose adding auxiliary classifiers to help the model's training. Experimental results on the GLUE benchmark demonstrate that our method achieves higher performance than previous state-of-the-art methods.
00:39 CET TOKEN ADAPTIVE VISION TRANSFORMER WITH EFFICIENT DEPLOYMENT FOR FINE-GRAINED IMAGE RECOGNITION
Authors:
Chonghan Lee1, Rita Brufau2, Ke Ding2 and Vijaykrishnan Narayanan1
1Penn State University, US; 2Intel, US
Abstract
Fine-grained Visual Classification (FGVC) aims to distinguish object classes belonging to the same category, e.g., different bird species or models of vehicles. The task is more challenging than ordinary image classification due to the subtle inter-class differences. Recent works proposed deep learning models based on the vision transformer (ViT) architecture with its self-attention mechanism to locate important regions of the objects and derive global information. However, deploying them on resource-restricted devices is challenging due to their intensive computational cost and memory footprint. To improve inference efficiency, previous approaches require manually designing the model architecture and training a separate model for each computational budget. In this work, we propose Token Adaptive Vision Transformer (TAVT) that dynamically drops out tokens and can be used for various inference scenarios across many devices after training the model once. Our adaptive model can switch among different token drop configurations at run time, providing instant accuracy-efficiency trade-offs. We train a vision transformer with a progressive token pruning scheme, eliminating a large number of redundant tokens in the later layers. We then conduct a multi-objective evolutionary search with the overall number of floating point operations (FLOPs) as its efficiency constraint to find the token pruning schemes that maximize accuracy and efficiency under various computational budgets. Empirical results show that our proposed TAVT dramatically speeds up the inference latency by up to 10x and reduces memory requirements and FLOPs by up to 5.5 x and 13x respectively while achieving competitive accuracy compared to prior ViT-based state-of-the-art approaches.
00:39 CET END-TO-END OPTIMIZATION OF HIGH-DENSITY E-SKIN DESIGN: FROM SPIKING TAXEL READOUT TO TEXTURE CLASSIFICATION
Authors:
Jiaqi Wang, Mark Daniel Alea, Jonah Van Assche and Georges Gielen, KU Leuven, BE
Abstract
Spiking readout architectures are a promising low- power solution for high-density e-skins. This paper proposes the end-to-end model-based optimization of a high-density neuro- morphic e-skin solution, from the taxel readout to the texture classification. Architectural explorations include the spike coding and preprocessing, and the neural network used for classification. Simple rate coding preprocessing to spiking outputs from a modeled low-resolution on-chip spike encoder is demonstrated to achieve a comparable texture classification accuracy of 90 % at lower power consumption compared to the state of art. The modeling has also been extended from single-channel sensor recording to time-shifted multi-taxel readout. Applying this optimization to an actual tactile sensor array, the classification accuracy is boosted by 63 % for a low-cost FFNN using multi-taxel data. The proposed Spike-based SNR (SSNR) and Spike Time Error (STE) metrics for the taxel readout circuitry are shown to be good predictors of the accuracy.
00:39 CET TOWARDS DEEP LEARNING-BASED OCCUPANCY DETECTION VIA WIFI SENSING IN UNCONSTRAINED ENVIRONMENTS
Authors:
Cristian Turetta1, Geri Skenderi1, Luigi Capogrosso1, Florenc Demrozi2, Philipp H. Kindt3, Alejandro Masrur4, Franco Fummi1, Marco Cristani1 and Graziano Pravadelli1
1University of Verona, IT; 2Department of Electrical Engineering and Computer Science, University of Stavanger, NO; 3Lehrstuhl für Realzeit-Computersysteme (RCS), TU München (TUM), DE; 4TU Chemnitz, DE
Abstract
In the context of smart buildings and smart cities, the design of low-cost and privacy-aware solutions for recognizing the presence of humans and their activities is becoming of great interest. Existing solutions exploiting wearables and video-based systems have several drawbacks, such as high cost, low usability, poor portability, and privacy-related issues. Consequently, more ubiquitous and accessible solutions became the focus of attention, such as WiFi sensing. However, at the current state of the art, WiFi sensing is subject to low accuracy and poor generalization, primarily affected by environmental factors, such as humidity and temperature variations and furniture position changes. Such issues are partially solved at the cost of complex data preprocessing pipelines. In this paper, we present a highly accurate, resource- efficient occupancy detection solution based on deep learning, which is resilient to variations in humidity and temperature. The approach is tested on an extensive benchmark, where people are free to move and the furniture layout does change. In addition, based on a consolidated algorithm of explainable AI, we quantify the importance of the WiFi signal w.r.t. humidity and temperature for the proposed approach. Notably, humidity and temperature can indeed be predicted based on WiFi signals; this promotes the expressivity of the WiFi signal, and at the same time the need for a non-linear model to properly deal with it.
00:39 CET CONTENT- AND LIGHTING-AWARE ADAPTIVE BRIGHTNESS SCALING FOR IMPROVED MOBILE USER EXPERIENCE
Authors:
Samuel Isuwa1, David Amos2, Amit Kumar Singh3, Bashir Al-Hashimi4 and Geoff Merrett1
1University of Southampton, GB; 2University of Maiduguri, NG; 3University of Essex, GB; 4,
Abstract
Display subsystems have become the predominant user interface on mobile devices, serving as both input and output interfaces. For a better user experience, the display subsystem is expected to provide proper resolution and brightness despite its impact on battery life. Existing brightness scaling techniques set the display brightness statically (by the user) or dynamically (by the system) in response to predefined events such as low-battery or ambient light of the environment, which are independent of the displayed content. Techniques that consider the displayed content are either limited to video content or do not account for the user's expected battery life, thereby failing to maximise the user experience. This paper proposes CLABS: Content- and ambient Lighting-aware Adaptive Brightness Scaling in mobile devices that maximises user experience while meeting battery life expectations. The approach employs a content- and ambient lighting-aware profiler that learns and classifies each sample into predefined clusters at runtime by leveraging insights on user perceptions of content and ambient luminance variations. We maximise user experience through adaptive scaling of the display's brightness using an energy prediction model that determines appropriate brightness levels while meeting expected battery life. The evaluation of the proposed approach on a commercial smartphone improves Quality of Experience (QoE) by up to 24.5% compared to existing approaches.
00:39 CET TOWARDS SMART CATTLE FARMS: AUTOMATED INSPECTION OF CATTLE HEALTH WITH REAL-LIFE DATA
Authors:
Yigit Tuncel1, Toygun Basaklar1, Mackenzie Smithyman2, Vinicius Nunes de Gouvea3, Joao Dorea1, Younghyun Kim4 and Umit Ogras1
1University of Wisconsin - Madison, US; 2New Mexico State University, US; 3Texas A&M University, US; 4University of Wisconsin-Madison, US
Abstract
Cattle health problems, such as Bovine Respiratory Disease (BRD), constitute a significant source of economic loss for the agriculture industry. The current management practices to diagnose and select cattle for treatment is a widespread clinical scoring system called DART (Depression, Appetite, Respiration, and Temperature). DART requires significant manual human labor since animal evaluation is done individually. We propose a novel wearable accelerometer-based IoT system that predicts the DART scores to provide real-time animal health monitoring and hence, to reduce labor and costs associated with manual animal inspection and intervention. The proposed system first processes the accelerometer data to construct features that encode cattle's daily behavior. Then, it uses a lightweight decision-tree classifier to predict the DART score. We evaluate our approach on a dataset that consists of accelerometer data and veterinarian-approved DART scores for 54 animals. According to the results, the proposed system can classify healthy and sick animals with 78% accuracy. Furthermore, our approach outperforms 13 commonly used state-of-the-art time-series classifiers in terms of both accuracy and computational complexity. With 1 KB SRAM usage and less than 29 uJ energy consumption in a day, it enables an easily deployable IoT solution for smart farms.
00:39 CET TIME SERIES-BASED DRIVING EVENT RECOGNITION FOR TWO WHEELERS
Authors:
Sai Usha Goparaju1, Lakshmanan L2, Abhinav Navnit2, Rahul Biju1, Lovish Bajaj3, Deepak Gangadharan2 and Dr. Aftab Hussain4
1International Institute of Information and Technology, IN; 2International Institute of Information Technology, IN; 3Manipal Acadamy of Higher education, IN; 4IIIT Hyderabad, IN
Abstract
Classification of a motorcycle's driving events can provide deep insights to detect issues related to driver safety. Safety in two wheelers is a less studied problem, and we are attempting to address this gap by providing a learning based solution to classify driving events. Firstly, we developed a hardware system with 3-D accelerometer/gyroscope sensors that can be deployed on a motorcycle. The data obtained from these sensors is used to identify various driving events. We have investigated several machine learning (ML) models to classify driving events. However, in this process, we identified that though the overall accuracy of these traditional ML models is decent enough, the class-wise accuracy of these models is poor. Hence, we have developed time-series-based classification algorithms using LSTM and Bi-LSTM to classify various driving events. We have also embedded an attention mechanism in the architecture of these models for enhanced feature learning, thus improving the accuracy of event recognition. The experiments conducted have demonstrated that the proposed models have surpassed the state-of-the-art models in the context of driving event recognition with reasonable class-wise accuracies. We have also deployed these models on the edge devices like Raspberry pi and ESP32 and successfully reproduced the prediction accuracies in the devices. The experiments demonstrated that the proposed Bi-LSTM model showed a minimum of 87\% accuracy and a maximum of 99\% accuracy in class-wise prediction on a 2-wheeler driving dataset.

S_D1 System modeling, simulation, and validation

Date: Wednesday, 19 April 2023
Time: 11:00 - 12:30 CET

Time Label Presentation Title
Authors
00:39 CET SPATIO-TEMPORAL MODELING FOR FLASH MEMORY CHANNELS USING CONDITIONAL GENERATIVE NETS
Authors:
Simeng Zheng, Chih-Hui Ho, Wenyu Peng and Paul Siegel, University of California, San Diego, US
Abstract
Modeling spatio-temporal read voltages with complex distortions arising from the write and read mechanisms in flash memory devices is essential for the design of signal processing and coding algorithms. In this work, we propose a data-driven approach to modeling NAND flash memory read voltages in both space and time using conditional generative networks. This generative flash modeling (GFM) method reconstructs read voltages from an individual memory cell based on the program levels of the cell and its surrounding cells, as well as the time stamp. We evaluate the model over a range of time stamps using the cell read voltage distributions, the cell level error rates, and the relative frequency of errors for patterns most susceptible to inter-cell interference (ICI) effects. Experimental results demonstrate that the model accurately captures the complex spatial and temporal features of the flash memory channel.
00:39 CET EFFICIENT APPROXIMATION OF PERFORMANCE SPACES FOR ANALOG CIRCUITS VIA MULTI-OBJECTIVE OPTIMIZATION
Authors:
Benedikt Ohse1, David Schreiber2, Juergen Kampe1 and Christopher Schneider1
1Ernst-Abbe-Hochschule Jena, DE; 2University of Applied Sciences Jena, DE
Abstract
This paper presents an adaptation of the well-known normal boundary intersection (NBI) method for approximating complete feasible performance spaces of analog integrated circuits. Those spaces provide accurate information about all feasible combinations of competing performance parameters in a circuit. While the NBI-method is originally designed for computing the so-called Pareto front of a multi-objective optimization problem only, it can be adapted for approximating the complete performance space with some modifications. A scalarization into single-objective optimization problems is performed within our developed tool, which can be connected to any Spice-based simulator. Besides presenting the algorithm and its adaptations, the focus lies on investigating parallelization techniques and their effect on decreasing the computational time. Numerical experiments show the computed approximations of two- and three-dimensional performance spaces of several OTAs and compare the efficiencies of different parallelization schemes.
00:39 CET MULTIDIMENSIONAL FEATURES HELPING PREDICT FAILURES IN PRODUCTION SSD-BASED CONSUMER STORAGE SYSTEMS
Authors:
Xinyan Zhang1, Zhipeng Tan1, Dan Feng1, Qiang He1, Ju Wan1, Hao Jiang2, Ji Zhang2, Lihua Yang1 and Wenjie Qi3
1Wuhan National Laboratory for Optoelectronics, Huazhong University of Science & Technology, CN; 2Huawei Technologies, CN; 3Wuhan National Laboratory for Optoelectronics, Huazhong University of Science & Technology Country/Region: China (CN), CN
Abstract
As SSD failures seriously lead to data loss and service interruption, proactive failure prediction is often used to improve system availability. However, the unidimensional SMART-based prediction models hardly predict all drive failures. Some other features applied in data centers and enterprise storage systems are not readily available in consumer storage systems (CSS). To further analyze related failures in production SSD-based CSS, we study nearly 2.3 million SSDs from 12 drive models based on a dataset of SMART logs, trouble tickets, and error logs. We discover that SMART, FirmwareVersion, WindowsEvent, and BlueScreenofDeath (SFWB) are closely related to SSD failures. We further propose a multidimensional-based failure prediction approach (MFPA), which is portable in algorithms, SSD vendors, and PC manufacturers. Experiments on the datasets show that MFPA achieves a high true positive rate (98.18%) and low false positive rate (0.56%), which is 4% higher and 86% lower than the SMART-based model. It is robust and can continuously predict for 2-3 months without iteration, substantially improving the system availability.
00:39 CET PAR-GEM5: PARALLELIZING GEM5'S ATOMIC MODE
Authors:
Niko Zurstraßen1, Jose Cubero-Cascante2, Jan Moritz Joseph2, Rainer Leupers2, Xie Xinghua3 and Li Yichao3
1RWTH Aachen Institute for Communication Technologies and Embedded Systems, DE; 2RWTH Aachen University, DE; 3Huawei Technologies, CN
Abstract
While the complexity of MPSoCs continues to grow exponentially, their often sequential simulations could only benefit from a linear performance gain since the end of Dennard scaling. As a result, each new generation of MPSoCs requires ever longer simulation times. In this paper, we propose a solution to this problem: par-gem5 - the first universally parallelized version of the Full-System Simulator (FSS) gem5. It exploits the host system's multi-threading capabilities using a modified conservative, quantum-based Parallel Discrete Event Simulation. Compared to other parallel approaches, par-gem5 uses relaxed causality constraints, allowing temporal errors to occur. Yet, we show that the system's functionality is retained, and the inaccuracy of simulation statistics, such as simulation time or cache miss rate, can be kept within a single-digit percentage. Furthermore, we extend par-gem5 by a temporal error estimation that assesses the accuracy of a simulation without a sequential reference simulation. Our experiments reached speedups of 24.7x when simulating a 128-core ARM-based MPSoC on a 128-core host system.
00:39 CET FAST BEHAVIOURAL RTL SIMULATION OF 10B TRANSISTOR SOC DESIGNS WITH METRO-MPI
Authors:
Guillem López-Paradís1, Brian Li2, Adrià Armejach3, Stefan Wallentowitz4, Miquel Moreto5 and Jonathan Balkind6
1Barcelona Supercomputing Center, ES; 2University of California, Santa Barbara, US; 3BSC & UPC, ES; 4Munich University of Applied Sciences, DE; 5BSC, ES; 6UC Santa Barbara, US
Abstract
Chips with tens of billions of transistors have become today's norm. These designs are straining our electronic design automation tools throughout the design process, requiring ever more computational resources. In many tools, parallelisation has improved both latency and throughput for the designer's benefit. However, tools largely remain restricted to a single machine and in the case of RTL simulation, we believe that this leaves much potential performance on the table. We introduce Metro-MPI to improve RTL simulation for modern 10 billion transistor-scale chips. Metro-MPI exploits the natural boundaries present in chip designs to partition RTL simulations and leverage High Performance Computing (HPC) techniques to extract parallelism. For chip designs that scale in size by exploiting latency-insensitive interfaces like networks- on-chip and AXI, Metro-MPI offers a new paradigm for RTL simulation scalability. Our implementation of Metro-MPI in Open- Piton+Ariane delivers 2.7 MIPS of RTL simulation throughput for the first time on a design with more than 10 billion transistors and 1,024 Linux-capable cores, opening new avenues for distributed RTL simulation of emerging system-on-chip designs. Compared to sequential and multi-threaded RTL simulations of smaller designs, Metro-MPI achieves up to 135.9× and 9.29× speedups. Similarly, for a representative regression run, Metro-MPI reduces energy consumption by up to 2.53× and 2.91×.
00:39 CET DYNAMIC REFINEMENT OF HARDWARE ASSERTION CHECKERS
Authors:
Hasini Witharana, Sahan Sanjaya and Prabhat Mishra, University of Florida, US
Abstract
Post-silicon validation is a vital step in System-on-Chip (SoC) design cycle. A major challenge in post-silicon validation is the limited observability of internal signal states using trace buffers. Hardware assertions are promising to improve the observability during post-silicon debug. Unfortunately, we cannot synthesize thousands (or millions) of pre-silicon assertions as hardware checkers (coverage monitors) due to hardware overhead constraints. Prior efforts considered synthesis of a small set of checkers based on design constraints. However, these design constraints can change dynamically during the device lifetime due to changes in use-case scenarios as well as input variations. In this paper, we explore dynamic refinement of hardware checkers based on changing design constraints. Specifically, we propose a cost-based assertion selection framework that utilizes non-linear optimization as well as machine learning. Experimental results demonstrate that our machine learning model can accurately predict the area (less than 5% error) and power consumption (less than 3% error) of hardware checkers at runtime. This accurate prediction enables close-to-optimal dynamic refinement of checkers based on design constraints.
00:39 CET STSEARCH: STATE TRACING-BASED SEARCH HEURISTICS FOR RTL VALIDATION
Authors:
Ziyue Zheng1 and Yangdi Lyu2
1Hong Kong University of Science and Technology (GuangZhou), CN; 2Hong Kong University of Science and Technology (Guangzhou), CN
Abstract
Branch coverage is important in the functional validation of Register-Transfer-Level (RTL) models. While random tests can cover the majority of easy-to-reach branches, there are still many hard-to-activate branches in today's industrial designs. These remaining corner branches are typically the source of bugs and hardware Trojans. Directed test generation approaches using formal methods effectively activate a specific branch but are limited by the state explosion problem. Semi-formal methods, such as concolic testing, improve the scalability by exploring one path at a time. This paper presents a novel concolic testing framework to exercise the corner branches through state tracing-based search heuristics (STSearch). The proposed approach heuristically generates and evaluates input sequences based on a novel heuristic indicator that evaluates the distance between the current state and the target branch condition. The heuristic indicator is designed to utilize both the static structural property of the design and the state from dynamic simulation. Compared to the existing concolic testing approaches, where a full new path is generated in each round by solving path constraints, the cycle-based heuristic search in the proposed approach is more effective and efficient. Experimental results show that our approach significantly outperforms the state-of-the-art approaches in both running time and memory usage.
00:39 CET SYSTEM-LEVEL SIMULATOR OF EFLASH-BASED COMPUTE-IN-MEMORY ACCELERATORS FOR CONVOLUTIONAL NEURAL NETWORKS
Authors:
Jooho Wang, Sunwoo Kim, Junsu Heo and Chester Park, Department of Electrical and Electronics Engineering, Konkuk University, Seoul, South Korea, KR
Abstract
A new system-level simulator is proposed to estimate the bit-accurate and cycle-accurate performance of eFlash compute-in-memory (CIM) accelerators for convolutional neural networks. The proposed simulator can predict the inference accuracy by considering the impact of circuit nonideality such as program disturbance. Moreover, the simulator can also evaluate the system-level performance of dataflow strategies that have a significant impact on hardware area and performance of eFlash CIM accelerators. The simulator helps to find the optimal dataflow strategy of an eFlash CIM accelerator for each convolutional layer. It is shown that the improvement of area efficiency amounts to 26.8%, 21.2% and 17.9% in the case of LeNet-5, VGG-9 and ResNet-18, respectively.
00:39 CET STRUCTURAL GENERATION OF VIRTUAL PROTOTYPES FOR SMART SENSOR DEVELOPMENT IN SYSTEMC-AMS FROM SIMULINK MODELS
Authors:
Alexandra Kuester1, Rainer Dorsch1 and Christian Haubelt2
1Bosch Sensortec GmbH, DE; 2University of Rostock, DE
Abstract
We present a flow to reuse system-level analog/mixed-signal (AMS) models developed in MATLAB/Simulink for the extension of virtual prototypes in SystemC. To prevent time-consuming co-simulation, our flow translates the Simulink model into an equivalent SystemC-AMS model. Translation is supported either by wrapping code generated by MATLAB's Embedded Coder or by instantiating previously generated models. Thus, a one-to-one mapping of the model's hierarchy is possible which allows deep insights into the architecture and good traceability. The conducted case study on an accelerometer model shows the applicability of our approach. The generated hierarchical model is half as fast as a monolithic version but allows better observability and traceability of the system. It is tens of times faster than simulation in Simulink, thus especially faster than co-simulation. The extended virtual prototype aims to support software engineers during development and validation of firmware in smart sensors.
00:39 CET A HARDWARE-SOFTWARE COOPERATIVE INTERVAL-REPLAYING FOR FPGA-BASED ARCHITECTURE EVALUATION
Authors:
Hongwei Cui, Shuhao Liang, Yujie Cui, Weiqi Zhang, Honglan Zhan, Chun Yang, Xianhua Liu and Xu Cheng, The School of Computer Science, Peking University, CN
Abstract
Compared with the traditional software simulator, the hardware implementation can demonstrate the precise behavior of the microarchitecture and verify the feasibility of the new microarchitecture design accurately. FPGA can provide more real and accurate results, but resource limitation of FPGA boards still hinders researchers. First of all, the low frequency of FPGA leads to a long time for running large benchmarks. Secondly, the hardware resources of the FPGA (such as physical memory) cannot support the execution of some special benchmarks. In addition, researchers need to run a complete benchmark in a real operating system, which also brings great difficulties. This paper proposes a hardware-software cooperative solution for the general checkpoint scheme, which allows researchers to run the specified program interval correctly and independently in the hardware running environment. It uses the processor simulator to collect runtime information of programs and create checkpoints for arbitrary intervals. Furthermore, it provides an extensible and portable checkpoint loader to read checkpoints and re-execute program intervals. In order to select the key program intervals for checkpoint creation, this paper extends RISC-V ISA and proposes an event-based sampling design. This design enables researchers to find hot program intervals with more representative microarchitecture characteristics. Using checkpoints in hot regions, researchers can quickly verify the effectiveness of microarchitecture design on FPGA and alleviate the speed bottleneck of FPGA. In addition, this novel solution can support running intervals of some special programs that cannot be executed on the FPGA. The event-based sampling design is implemented in the BOOM processor, and the SPEC CPU2006 benchmarks are sampled. In addition, checkpoints are generated for the benchmarks, and the correctness and effectiveness of the checkpoint scheme are evaluated on the FPGA board. The experimental results show that the scheme is effective.
00:39 CET FELOPI: A FRAMEWORK FOR SIMULATION AND EVALUATION OF POST-LAYOUT FILE AGAINST OPTICAL PROBING
Authors:
Sajjad Parvin1, Mehran Goli1, Frank Sill Torres2 and Rolf Drechsler3
1University of Bremen, DE; 2German Aerospace Center, DE; 3University of Bremen/DFKI, DE
Abstract
In recent years, it has been shown that laser based side-channel analysis methods, specifically Optical Probing (OP) rose security concern against mission-critical application circuits. In reality, designing a robust circuit against OP can be time consuming because the chip needs to be designed, fabricated, and tested to see how robust it is against OP. To mitigate this problem, we propose a framework namely FELOPi, which takes the layout file format of a design as an input and then performs OP on the design in simulation. In a real scenario, i.e. OP attack on a chip to readout the data requires lots of time, however using our framework we can reduce the OP time to seconds rather than hours to days in simulation.
00:39 CET QUO VADIS SIGNAL? AUTOMATED DIRECTIONALITY EXTRACTION FOR POST-PROGRAMMING VERIFICATION OF A TRANSISTOR-LEVEL PROGRAMMABLE FABRIC
Authors:
Apurva Jain, Thomas Broadfoot, Yiorgos Makris and Carl Sechen, University of Texas at Dallas, US
Abstract
We discuss the challenges related with developing a post-programming verification solution for a TRAnsistor-level Programmable fabric (TRAP). Toward achieving high density, the TRAP architecture employs bidirectionally-operated pass transistors in the implementation of its logic and interconnect networks. While it is possible to model such transistors through appropriate primitives of hardware description languages (HDL) to enable simulation-based validation, Logic Equivalence Checking (LEC) methods and tools do not support such primitives. As a result, formally verifying the functionality programmed by a given bit-stream on TRAP is not innately possible. To address this limitation, we introduce a method for automatically determining the signal flow direction through bidirectional pass transistors for a given bit-stream and subsequently converting the HDL describing the programmed fabric to consist only of unidirectional transistors. Thereby, commercial EDA tools can be used to check logic equivalence between the transistor-level HDL describing the programmed fabric and the post-synthesis gate-level netlist. The proposed method has been successfully applied to verify various benchmark circuits programmed on the TRAP fabric.

S_D8 Future memories

Date: Wednesday, 19 April 2023
Time: 11:00 - 12:30 CET

Time Label Presentation Title
Authors
00:39 CET OVERLAPIM: OVERLAP OPTIMIZATION FOR PROCESSING IN-MEMORY NEURAL NETWORK ACCELERATION
Authors:
Minxuan Zhou1, Xuan Wang2 and Tajana Rosing1
1UCSD, US; 2University of California San Diego, US
Abstract
Processing in-memory (PIM) can accelerate neural networks (NNs) for its extensive parallelism and data movement minimization. The performance of NN acceleration on PIM heavily depends on software-to-hardware mapping, which indicates the order and distribution of operations across the hardware resources. Previous works optimize the mapping problem by exploring the design space of per-layer and cross-layer data layout, achieving speedup over manually designed mappings. However, previous works do not consider computation overlapping across consecutive layers. By overlapping computation, we can process a layer before its preceding layer fully completes, decreasing the execution latency of the whole network. The mapping optimization without overlap analysis can result in sub-optimal performance. In this work, we propose OverlaPIM, a new framework that integrates the overlap analysis with the DNN mapping optimization on PIM architectures. OverlaPIM adopts several techniques to enable efficient overlap analysis and optimization for the whole network mapping on PIM architectures. We test OverlaPIM on popular DNN networks and compare the results to nonoverlap optimization. Our experiments show that OverlaPIM can efficiently produce mappings that are 2.10× to 4.11× faster than the state-of-the-art mapping optimization framework.
00:39 CET TAM: A COMPUTING IN MEMORY BASED ON TANDEM ARRAY WITHIN STT-MRAM FOR ENERGY-EFFICIENT ANALOG MAC OPERATION
Authors:
Jinkai Wang, Zhengkun Gu, Hongyu Wang, Zuolei Hao, Bojun Zhang, Weisheng Zhao and Yue Zhang, Beihang University, CN
Abstract
Computing in memory (CIM) has been demonstrated promising for energy efficient computing. However, the dramatic growth of the data scale in neural network processors has aroused a demand for CIM architecture of higher bit density, for which the spin transfer torque magnetic RAM (STT-MRAM) with high bit density and performance arises as an up-and-coming candidate solution. In this work, we propose an analog CIM scheme based on tandem array within STT-MRAM (TAM) to further improve energy efficiency while achieving high bit density. First, the resistance summation based analog MAC operation minimizes the effect of low tunnel magnetoresistance (TMR) by the serial magnetic tunnel junctions (MTJs) structure in the proposed tandem array with smaller area overhead. Moreover, a read scheme of resistive-to-binary is designed to achieve the MAC results accurately and reliably. Besides, the data-dependent error caused by MTJs in series has been eliminated with a proposed dynamic selection circuit. Simulation results of a 2Kb TAM architecture show 113.2 TOPS/W and 63.7 TOPS/W for 4-bit and 8-bit input/weight precision, respectively, and reduction by 39.3% for bit-cell area compared with existing array of MTJs in series.
00:39 CET OUT-OF-CHANNEL DATA PLACEMENT FOR BALANCING WEAR-OUT AND I/O WORKLOADS IN RAID-ENABLED SSDS
Authors:
Yang Fan, Xiao qi, Li Jun, Sha bing, Cai gang and Liao wei, Southwest university, CN
Abstract
Channel-level RAID implementation SSDs can fight against channel failures inside SSDs, but greatly suffer from imbalanced wear-out (i.e. erase) and I/O workloads across all SSD channels, due to the nature of in-channel updates on data/parity chunks of data stripes. This paper proposes exchanging channel locations of data/parity chunks belonging to the same stripe when satisfying update (write) requests, termed as out-of-channel data placement. Consequently, it can smooth wear-out and I/O workloads across SSD channels, thus reducing I/O response time. Through a series of emulation experiments on several realistic disk traces, we show that our proposal can greatly improve I/O performance, as well as noticeably balance the wear-out and I/O workloads, in contrast to related methods.
00:39 CET AGDM:AN ADAPTIVE GRANULARITY DATA MIGRATION STRATEGY FOR HYBRID MEMORY SYSTEMS
Authors:
Zhouxuan Peng, Dan Feng, Jianxi Chen, Jing Hu and Chuang Huang, Wuhan National Laboratory for Optoelectronics, Key Laboratory of Information Storage System, MOE, Huazhong University of Science and Technology, Hubei, China, CN
Abstract
Hybrid memory systems show strong potential to satisfy the growing memory demands of modern applications by combining different memory technologies. Due to the different performance characteristics of hybrid memories, a data migration strategy that migrates hot data to a faster memory is critical to the overall performance. Prior works have focused on identifying hot data and migration decisions. However, we find that the fix-sized global migration granularity in existing data migration schemes results in suboptimal performance on most workloads. The key observation is that the optimal migration granularity varies with access patterns. This paper proposes AGDM, an access-pattern-aware Adaptive Granularity Data Migration strategy for hybrid memory systems. AGDM tracks memory access patterns in runtime and accordingly adopts the most appropriate migration mode and granularity. The novel remapping-migration decoupled metadata organization enables AGDM to set local optimal granularities for memory regions with different access patterns. Our evaluation shows that, compared to the state-of-the-art scheme, AGDM gets an average performance improvement of 20.06% with 29.98% energy savings.
00:39 CET P-PIM: A PARALLEL PROCESSING-IN-DRAM FRAMEWORK ENABLING ROWHAMMER PROTECTION
Authors:
Ranyang Zhou1, Sepehr Tabrizchi2, Mehrdad Morsali1, Arman Roohi3 and Shaahin Angizi1
1New Jersey Institute of Technology, US; 2University of Nebraska–Lincoln (UNL), US; 3University of Nebraska - Lincoln, US
Abstract
In this work, we propose a Parallel Processing-In-DRAM architecture named P-PIM leveraging the high density of DRAM to enable fast and flexible computation. P-PIM enables bulk bit-wise in-DRAM logic between operands in the same bit-line by elevating the analog operation of the memory sub-array based on a novel dual-row activation mechanism. With this, P-PIM can opportunistically perform a complete and inexpensive in-DRAM RowHammer (RH) self-tracking and mitigation technique to protect the memory unit against such a challenging security vulnerability. Our results show that P-PIM achieves ~72% higher energy efficiency than the fastest charge-sharing-based designs. As for the RH protection, with a worst-case slowdown of ~0.8%, P-PIM archives up to 71% energy-saving over the SRAM/CAM-based frameworks and about 90% saving over DRAM-based frameworks.
00:39 CET PRIVE: EFFICIENT RRAM PROGRAMMING WITH CHIP VERIFICATION FOR RRAM-BASED IN-MEMORY COMPUTING ACCELERATION
Authors:
Wangxin He1, Jian Meng1, Sujan Gonugondla2, Shimeng Yu3, Naresh Shanbhag4 and Jae-sun Seo1
1Arizona State University, US; 2Amazon, US; 3Georgia Institute of Technology, US; 4University of Illinois Urbana-Champaign, US
Abstract
As deep neural networks (DNNs) have been successfully developed in many applications with continuously increasing complexity, the number of weights in DNNs surges, leading to consistent demands for denser memories than SRAMs. RRAM-based in-memory computing (IMC) achieves high density and energy-efficiency for DNN inference, but RRAM programming remains to be a bottleneck due to high write latency and energy consumption. In this work, we present the Progressive-wRite In-memory program-VErify (PRIVE) scheme, which we verify with an RRAM testchip for IMC-based hardware acceleration for DNNs. We optimize the progressive write operations on different bit positions of RRAM weights to enable error compensation and reduce programming latency/energy, while achieving high DNN accuracy. For 5-bit precision DNNs, PRIVE reduces the RRAM programming energy by 1.82X, while maintaining high accuracy of 91.91% (VGG-7) and 71.47% (ResNet-18) on CIFAR-10 and CIFAR-100 datasets, respectively.
00:39 CET END-TO-END DNN INFERENCE ON A MASSIVELY PARALLEL IN-MEMORY COMPUTING ARCHITECTURE
Authors:
Nazareno Bruschi1, Giuseppe Tagliavini1, Angelo Garofalo1, Francesco Conti1, Irem Boybat2, Luca Benini3 and Davide Rossi4
1University of Bologna, IT; 2IBM Research Europe - Zurich, CH; 3Università di Bologna and ETH Zurich, IT; 4University Of Bologna, IT
Abstract
The demand for computation resources and energy efficiency of Convolutional Neural Networks (CNN) applications requires a new paradigm to overcome the "Memory Wall”. Analog In-Memory Computing (AIMC) is a promising paradigm since it performs matrix-vector multiplications, the critical kernel of many ML applications, in-place in the analog domain within memory arrays structured as crossbars of memory cells. However, several factors limit the full exploitation of this technology, including the physical fabrication of the crossbar devices, which constrain the memory capacity of a single array. Multi-AIMC architectures have been proposed to overcome this limitation, but they have been demonstrated only for tiny and custom CNNs or performing some layers off-chip. In this work, we present the full inference of an end-to-end ResNet-18 DNN on a 512-cluster heterogeneous architecture coupling a mix of AIMC cores and digital RISC-V cores, achieving up to 20.2 TOPS. Moreover, we analyze the mapping of the network on the available non-volatile cells, compare it with state-of-the-art models, and derive guidelines for next-generation many-core architectures based on AIMC devices.
00:39 CET UHS: AN ULTRA-FAST HYBRID STORAGE CONSOLIDATING NVM AND SSD IN PARALLEL
Authors:
Qingsong Zhu, Qiang Cao and Jie Yao, Huazhong University of Science and Technology, CN
Abstract
Non-Volatile Memory (NVM) with persistent and near-DRAM performance has been commonly used as first-level fast storage atop Solid-State Drives (SSDs) and Hard Disk Drives (HDDs), constituting classic hierarchy architecture with high cost-performance. However, the NVM/SSD tiered storage overuses primary NVM with limited actual performance and under-utilizes secondary SSD with increasing bandwidth. Besides, NVM and SSD exhibit distinguished I/O characteristics, but are complementary in I/O pattern. It motivate us to design a superior hybrid storage to fully exploit NVM and SSD simultaneously. In this paper, we propose UHS, an Ultra-fast Hybrid Storage consolidating NVM and SSD to reap their own merits with key enabled techniques. First, UHS builds a uniform yet heterogenous block-level storage view for the upper applications, e.g., file systems or key-value stores. UHS provides static address-mapping to explicitly partition the global block-space into coarse-grain NVM-zones and SSD-zones, which mainly serve the metadata and file data respectively. Second, UHS proposes a fine-grain request-level NVM buffer to dynamically absorb small file-writes in runtime and then migrates them to the SSDs in the background. Third, UHS designs I/O-affinity write allocation and hash-based buffer indexing to trade off write gain and read cost of the NVM-buffer. Finally, UHS designs a multi-thread I/O model to take full advantage of parallelism in both NVM and SSD. We implement UHS and evaluate it under a variety of workloads. The experiments show that UHS outperforms SSD, NVM, Bcache-writeback (representative hierarchy storage), and Device-Mapper (state-of-the-art hybrid storage) up to 8X, 1.5X, 3.5X, and 6X respectively.
00:39 CET BRANCH PREDICTOR DESIGN FOR AMBIENT ENERGY HARVESTING NONVOLATILE PROCESSORS
Authors:
Mengying Zhao, Shuo Xu, Hao Zhang, Huichuan Zheng and Xiaojun Cai, Shandong University, CN
Abstract
Non-volatile processors are proposed for ambient energy harvesting systems to enable accumulative computing across power failures. They employ nonvolatile memory for processor status backup before power outage and resume the system after power recovers. A straightforward backup policy is to back up all volatile data in processors, but it induces high backup cost. In this paper, we focus on branch predictor, an important component in processor, and propose efficient backup schemes to reduce backup cost while maintaining its prediction ability. We first analyze the modules in branch predictor, and accordingly propose two backup mechanisms pertaining to saturation-guided and locality-guided backup. Evaluation shows that, when compared with no-backup and all-backup strategy, the proposed design achieves 12.8% and 52.5% energy reduction, respectively.
00:39 CET OPTIMIZING DATA MIGRATION FOR GARBAGE COLLECTION IN ZNS SSDS
Authors:
Zhenhua Tan1, Linbo Long2, Renping Liu3, Congming Gao4, Yi Jiang3 and Yan Liu3
1College of Computer Science and Technology of Chongqing University of Posts and Telecommunications, CN; 2College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, CN; 3Chongqing University of Posts and Telecommunications, CN; 4Xiamen University, CN
Abstract
The NVMe Zoned Namespace (ZNS) is proposed as a high-performance interface for solid-state drives (SSDs), where the logical address space is divided into fixed-size and sequential-write (flash-friendly) zones. ZNS SSDs shift the responsibility of garbage collection (GC) to the host. However, data migration in GC needs to move data to the host's buffer first and write back to the new location, resulting in an unnecessary end-to-end transfer overhead during GC. Moreover, due to the pre-configured mapping between zones and blocks, GC needs to perform a large number of unnecessary block-to-block data migrations between zones. To address these issues, this paper proposes a dynamic zone mapping ZNS SSD design, termed Brick-ZNS, with in-storage data migration and address remapping. First, a new ZNS command, Zone_MD, is designed to realize in-storage data migration to avoid the end-to-end transfer overhead of GC. Second, a remapping strategy based on parallel physical blocks is proposed to reduce the amount of block-to-block data migrations between zones while ensuring zone parallelism. Based on a full-stack SSD emulator, our evaluation shows that Brick-ZNS reduces GC latency by 6.78× and improves SSD lifetime by 1.17× on average.
00:39 CET ENASA: TOWARDS EDGE NEURAL ARCHITECTURE SEARCH BASED ON CIM ACCELERATION
Authors:
Shixin Zhao1, Songyun Qu2, Ying Wang3 and yinhe han4
1University of Chinese Academy of Sciences. Institute of Computing Technology, Chinese Academy of Sciences, CN; 2University Of Chinese Academy of Sciences, Institute Of Computing Technology Chinese Academy Of Sciences, CN; 3State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, CN; 4Institute of Computing Technology,Chinese Academy of Sciences, CN
Abstract
One-shot Neural Architecture Search (NAS) has achieved remarkable improvement over human-designed neural network architecture in accuracy and hardware-oriented metrics such as power and latency. However, one-shot NAS requires hundreds of GPU hours to complete the search process and its tedious search time hinders the extensive application of one-shot NAS to the use cases that need quickly adapt or customize the network architecture for the deployment situations via NAS. In this work, we combine the Computing-in-Memory (CIM) and NAS technology, and propose a ReRAM-based accelerator, ENASA, so that NAS can be applied to various edge devices for them to customize the most suitable individual network architecture. However, the one-shot NAS process must repetitively evaluate the sampled sub-network within a large-scale supernet before converging to the best sub-network architecture. Thereby, how to map this iterative network inference task onto the CIM arrays makes a big difference in the performance and power overhead. To realize efficient in-memory supernet sampling and evaluation, we design a novel mapping method that tactically executes a group of sub-nets in the CIM arrays, not only to boost the sub-net concurrency but also to eliminate the repetitive operations shared by these subnets. Meanwhile, to further enhance the subnet-level operation concurrency and sharing in the CIM arrays, we proposed a novel CIM-friendly one-shot NAS algorithm that purposely samples those operation-sharing subnets in each iteration while still maintaining the convergence performance of NAS. According to the experiment results, our CIM NAS accelerator achieves an improvement of 196.6× and 1200× in performance speedup and energy saving, respectively, compared to the CPU+GPU baseline.

S_T1 Design and Test of Mixed-Signal Circuits and Memories

Date: Wednesday, 19 April 2023
Time: 11:00 - 12:30 CET

Time Label Presentation Title
Authors
00:39 CET POST-SILICON OPTIMIZATION OF A HIGHLY PROGRAMMABLE 64-MHZ PLL ACHIEVING 2.7-5.7µW
Authors:
Marco Gonzalez and David Bol, UCLouvain, BE
Abstract
Hierarchical optimization methods used in the design of complex mixed-signal systems require accurate behavioral models to avoid the long simulation times of transistor-level SPICE simulations of the whole system. However, robust behavioral models that accurately model circuit non-idealities and their complex interactions must be very complex themselves and are hardly achievable. Post-silicon tuning, which is already widely used for the calibration of analog building blocks, is an interesting alternative to speed up the optimization of these complex systems. However, post-silicon tuning usually focuses on single-objective problems in blocks with a limited number of degrees of freedom. In this paper, we propose a post-silicon "hardware-in-the-loop” optimization method to solve multi-objective problems in mixed-signal systems with numerous degrees of freedom. We use this method to optimize the noise-power trade-off of a 64-MHz phase-locked loop (PLL) based on a back-bias-controlled ring oscillator. A genetic algorithm was run based on measurements of the 22-nm fully-depleted silicon-on-insulator prototype to find the Pareto-optimal configurations in terms of power and long-term jitter. The obtained Pareto front gives a range of power consumption between 2.7 and 5.7 μW, corresponding to an RMS long-term jitter between 88 and 45 ns. Whereas the simulation-based optimization would require more than a year using the genetic algorithm based on SPICE simulations, we conducted the post-silicon optimization in only 17 h.
00:39 CET ANALOG COVERAGE-DRIVEN SELECTION OF SIMULATION CORNERS FOR AMS INTEGRATED CIRCUITS
Authors:
Sayandeep Sanyal1, Aritra Hazra2, Pallab Dasgupta1, Scott Morrison3, Sudhakar S4, Lakshmanan Balasubramanian5 and Moshiur Rahman6
1Indian Institute of Technology Kharagpur, IN; 2Dept of CSE, IIT Kharagpur, IN; 3Texas Instruments, US; 4Texas Instruments, IN; 5Texas Instruments (India) Pvt. Ltd., IN; 6,
Abstract
Integrated circuit designs are evaluated at various corners defined by choices of the design and process parameters. Considering the large number of corners and the simulation cost of covering all the corners of a large design, it is desirable to identify a subset of the corners that can potentially expose corner case bugs. In an integrated analog coverage management framework, this choice may be influenced by those corners that take one or more component analog IPs close to their individual specification boundaries. Since the admissible state space of an analog IP is multi-dimensional, the same corner may not reach the extreme behaviors for each attribute of the specification, and one needs to identify a subset that covers the extremity. This paper shows that the underlying problem is NP-hard and presents an automated methodology for selecting the corners. A formal analog coverage specification is leveraged by our algorithm, which uses a Satisfiability Modulo Theory (SMT) solver to identify the appropriate corners from the output of multiple Monte Carlo (MC) simulations. The efficacy of the proposed approach is demonstrated over industrial test cases.
00:39 CET FAST PERFORMANCE EVALUATION METHODOLOGY FOR HIGH-SPEED MEMORY INTERFACES
Authors:
Taehoon Kim, Yoona Lee and Woo-Seok Choi, Seoul National University, KR
Abstract
Increase in the data-rate of memory interfaces results in higher inter-symbol interference (ISI). To mitigate ISI, recent high-speed memory interfaces started supporting unmatched Rx and utilizing equalization such as continuous-time linear equalizer and decision-feedback equalizer, which incurs huge overhead for design verification with conventional methods. This paper proposes a fast and accurate verification methodology to evaluate the voltage and timing margin of the interface, based on impulse sensitivity function (ISF). The small- and large-signal were separated and calculated to improve accuracy, using the data obtained from the periodic AC (PAC) and periodic steady-state (PSS) analyses. With the proposed method, while maintaining high accuracy of less than 3% of the relative error, a significant reduction in verification time from 74 to 96% was achieved. In addition, an extension for the multi-stage RX is proposed. There is a trade-off between accuracy and efficiency, so it can be used to suit the verification environment.
00:39 CET EQUIVALENCE CHECKING OF SYSTEM-LEVEL AND SPICE-LEVEL MODELS OF STATIC NONLINEAR CIRCUITS
Authors:
Kemal Çağlar Coşkun1, Muhammad Hassan2 and Rolf Drechsler1
1Institute of Computer Science, University of Bremen, DE; 2Cyber-Physical Systems, DFKI GmbH, DE
Abstract
Recently, Signal Flow Graphs (SFGs) have been successfully leveraged to show equivalence for linear analog circuits. However, this is clearly not sufficient as the true complexity stems from nonlinear analog circuits. In this paper, we go beyond linear analog circuits, i.e., we extend the SFGs and develop the Modified Signal-Flow Graph (MSFG), to show equivalence between system-level and SPICE-level representations of static nonlinear analog circuits. First we map the nonlinear circuits to MSFGs. Afterward, graph simplification and functional approximation (in particular Legendre polynomials) techniques are used to create minimal MSFG and canonical MSFG. This enables us to compare the MSFGs even if they have vastly different structures. Finally, we propose a similarity metric that calculates the similarity between SPICE-level and system-level models. By successfully applying the proposed equivalence checking technique to benchmark circuits, we demonstrate its applicability.
00:39 CET ELECTROMIGRATION-AWARE DESIGN TECHNOLOGY CO-OPTIMIZATION FOR SRAM IN ADVANCED TECHNOLOGY NODES
Authors:
Mahta Mayahinia1, Hsiao-Hsuan Liu2, Subrat Mishra2, Zsolt Tokei2, Francky Catthoor2 and Mehdi Tahoori3
1Karlsruhe institute of technology (KIT), DE; 2IMEC, BE; 3Karlsruhe Institute of Technology, DE
Abstract
Static RAM (SRAM) is one of the critical components in advanced VLSI systems whose performance, capacity, and reliability have a decisive impact on the entire system. It offers the fastest memory in the storage hierarchy of modern computer systems. Therefore, fast, flexible, and accurate high-level modeling of the SRAM module enables design technology co-optimization (DTCO). By moving toward the smaller CMOS technology nodes, the back end of the line (BEoL) interconnects are also fabricated in tighter pitch size. Hence, besides the power lines, SRAM word- and bit-line (WL and BL) are also susceptible to electromigration (EM). Therefore, EM reliability of SRAM's WL and BL needs to be analyzed during DTCO cycle. In this work, we investigate the impact of technology scaling on SRAM designs, and perform a detailed analysis on the trend of their EM reliability and energy consumption. Our analysis shows that, though scaling down the CMOS technology can result in a 2.68x improvement in the energy efficiency of the SRAM module, it increases the EM-induced hydrostatic stress by ~2.53x.
00:39 CET SMART HAMMERING: A PRACTICAL METHOD OF PINHOLE DETECTION IN MRAM MEMORIES
Authors:
Sina Bakhtavari Mamaghani1, Christopher Muench1, Jongsin Yun2, Martin Keim3 and Mehdi Baradaran Tahoori1
1Karlsruhe Institute of Technology, DE; 2Siemens, US; 3siemens, US
Abstract
As we move toward the commercialization of Spin-Transfer Torque Magnetic Random Access Memories (STT-MRAM), cost-effective testing and in-field reliability have become more prominent. The conventional test methods are not enough to capture all possible defects in such technologies, and as a result, there are potential test escapes that cause field failures. Among STT-MRAM manufacturing defects, pinholes are one of the important ones. Pinholes are defects on the surface of the oxide layer which degrade the resistive values and, in some cases, cause an oxide breakdown. Some moderate levels of pinhole defects can remain undetected during the normal functional test and cause a field failure. A stress test of the whole memory has been suggested to detect candidate pinhole defects. However, this test not only causes extra test costs but also degrades the reliability of MRAM for the entire array. In this paper, we have statistically studied the behavior of pinholes and proposed a cost-effective testing scheme to capture pinhole defects and increase the reliability of the end product. Our method limits the number of test candidate cells that need to be hammered, providing a much-reduced test time compared to existing methods. The proposed approach is compatible with memory built-in self-test (MBIST) schemes.
00:39 CET MA-OPT: REINFORCEMENT LEARNING-BASED ANALOG CIRCUIT OPTIMIZATION USING MULTI-ACTORS
Authors:
Youngchang Choi1, Minjeong Choi2, Kyongsu Lee3 and Seokhyeong Kang1
1Pohang University of Science and Technology, KR; 2POSTECH, KR; 3Postech, KR
Abstract
Analog circuit design requires significant human efforts and expertise; therefore, electronic design automation (EDA) tools for analog design are needed. This study presents MA-Opt that is an analog circuit optimizer using reinforcement learning (RL)-inspired framework. MA-Opt using multiple actors is proposed to provide various predictions of optimized circuit designs in parallel. Sharing a specific memory that affects the loss function of network training is proposed to exploit multiple actors effectively, accelerating circuit optimization. Moreover, we devise a novel method to tune the most optimized design in previous simulations into a more optimized design. To demonstrate the efficiency of the proposed framework, MA-Opt was simulated for three analog circuits and the results were compared with those of other methods. The experimental results indicated the strength of using multiple actors with a shared elite solution set and the near-sampling method. Within the same number of simulations, while satisfying all given constraints, MA-Opt obtained minimum target metrics 13--24% better than those of DNN-Opt. Furthermore, MA-Opt obtained better Figure of Merits (FoMs) than those of DNN-Opt at the same runtime.
00:39 CET AUXCELLGEN: A FRAMEWORK FOR AUTONOMOUS GENERATION OF ANALOG AND MEMORY UNIT CELLS
Authors:
Sumanth Kamineni1, Arvind Sharma2, Ramesh Harjani2, Sachin S. Sapatnekar2 and Benton H. Calhoun1
1University of Virginia, US; 2University of Minnesota, US
Abstract
Recent advances in auto-generating analog and mixed-signal (AMS) circuits use standard digital tool flows to compose AMS circuits from a combination of digital standard cells and a set of auxiliary cells (auxcells). Until now, generating auxcell layouts for each new PDK was the last manual step in the flow for auto-generating AMS components, which limited the available auxcells and reduced the optimality of the auto-generated AMS designs. To solve this, we propose AuxcellGen, a framework to auto-generate auxcell layouts and performance models. AuxcellGen generates a parasitic-aware auxcell performance model using a neural network (NN), auto-sizes and optimizes auxcell schematics for a given design target, and auto-generates auxcell layouts. The framework is demonstrated by auto-generating tri-state buffer auxcells for PLLs and sense-amplifier auxcells for SRAM across a range of user specifications that are compatible with standard cell and memory bitcell pitch.
00:39 CET DEBUGGING LOW POWER ANALOG NEURAL NETWORKS FOR EDGE COMPUTING
Authors:
Sascha Schmalhofer, Marwin Moeller, Nikoletta Katsaouni, Marcel Schulz and Lars Hedrich, University of Frankfurt/M., DE
Abstract
The demand for extremely low power inference engines for e.g. edge computing is high and leads to concepts like dedicated convolutional neural networks (CNNs). Especially analog signal processing allows a low power realization with the drawback of reduced accuracy. For an analog CNN (ANN) implemented in hardware a generation of the (big) network is mandatory. In consequence the verification of such large ANNs turns out to be a problem because the netlist can have millions of transistors. The verification goal is not clearly defined, as the ANN may accept some deviations and the number of circuit elements compared to a traditional analog block results in a loss of oversight. In this paper we present a method to debug and analyze large synthesized ANNs enabling a systematic comparison of the transistor netlist, behavioral model and the implementation. With that an insight into the behavior of the analog netlist is easily gained and errors during generation or badly designed cells are quickly uncovered. An overall judgement of the accuracy is also presented. We demonstrate the functionality on several examples from small ANNs to ANNs consisting of more than 10000 of cells implementing a medical application. We report uncovered design flaws and their fixes on that big real world example.
00:39 CET HIGH PERFORMANCE AND DNU-RECOVERY SPINTRONIC RETENTION LATCH FOR HYBRID MTJ/CMOS TECHNOLOGY
Authors:
Aibin Yan1, Zhen Zhou1, Liang Ding1, Jie Cui1, Zhengfeng Huang2, Xiaoqing Wen3 and Patrick Girard4
1Anhui University, CN; 2Hefei University of Technology, CN; 3Kyushu Institute of Technology, JP; 4LIRMM / CNRS, FR
Abstract
Spintronic-based devices like magnetic tunnel junction (MTJ) are promising devices for space applications due to their radiation immunity, nonvolatility, and compatibility with nano-scale CMOS circuits. However, with the advancement of semiconductor technologies, CMOS peripheral circuits have become more vulnerable to soft errors, such as single-node-upsets (SNUs) and double-node-upsets (DNUs). In order to effectively tolerate DNUs caused by radiations and reduce the D-to-Q transmission delay of latches, this paper proposes a nonvolatile DNU resilient latch that mainly comprises two MTJs, two inverters and eight C-elements. Since two MTJs are used and all internal nodes are interlocked, the latch can provide nonvolatility and recover from all possible DNUs. Simulation results demonstrate the nonvolatility and DNU recovery.
00:39 CET MINIMUM UNIT CAPACITANCE CALCULATION FOR BINARY-WEIGHTED CAPACITOR ARRAYS
Authors:
Nibedita Karmokar, Ramesh Harjani and Sachin S. Sapatnekar, University of Minnesota, US
Abstract
The layout area and power consumption of a charge-scaling digital-to-analog converter (DAC) is typically dominated by the capacitor array. Since the number of unit capacitors in the array increases exponentially with the number of bits, minimizing the size of the unit capacitor is crucial for controlling the layout area. However, smaller capacitors can be susceptible to larger amounts of noise and process mismatch. Smaller capacitors can also be affected by mismatch in the parasitics of routing wires that connect the capacitors in the array: particularly in FinFET nodes, this mismatch can be significant. Together, these factors can degrade critical DAC performance metrics unless the unit capacitor is sufficiently large. This work proposes a systematic approach for minimizing the unit capacitance value in binary-weighted capacitor arrays for charge-scaling DACs. The method selects a value that optimizes the nonlinearity metrics of a DAC, accounting for multiple factors that contribute to mismatch, as well as the impact of flicker noise, and thermal noise.

S_A1 Power-efficient and Smart Energy Systems

Date: Wednesday, 19 April 2023
Time: 14:00 - 15:30 CET

Time Label Presentation Title
Authors
00:39 CET SPARSEMEM: ENERGY-EFFICIENT DESIGN FOR IN-MEMORY SPARSE-BASED GRAPH PROCESSING
Authors:
Mahdi Zahedi1, Geert Custers2, Taha Shahroodi2, Georgi Gaydadjiev3, Stephan Wong1 and Said Hamdioui1
1Delft University of Technology, NL; 2TU Delft, NL; 3Maxeler / Imperial College, GB
Abstract
Performing analysis on large graph datasets in an energy-efficient manner has posed a significant challenge; not only due to excessive data movements and poor locality, but also due to the non-optimal use of high sparsity of such datasets. The latter leads to a waste of resources on the computation over zero's operands which do not contribute to the final result. This paper designs a novel graph processing accelerator, SparseMEM, targeting sparse datasets by leveraging the computing-in-memory (CIM) concept; CIM is a promising solution to alleviate the overhead of data movement and the inherent poor locality of graph processing. The proposed solution stores the graph information in a compressed hierarchical format inside the memory and adjusts the workflow based on this new mapping. This vastly improves resource utilization, leading to higher energy and permanence efficiency. The experimental results demonstrate that SparseMEM outperforms a GPU-based platform and two state-of-the-art in-memory accelerators on speedup and energy efficiency by one and three orders of magnitude, respectively.
00:39 CET HULK-V: A HETEROGENEOUS ULTRA-LOW-POWER LINUX CAPABLE RISC-V SOC
Authors:
Luca Valente1, Yvan Tortorella1, Mattia Sinigaglia1, Giuseppe Tagliavini1, Alessandro Capotondi2, Luca Benini3 and Davide Rossi1
1University of Bologna, IT; 2University of Modena and Reggio Emilia, IT; 3IIS lab, ETH Zurich, CH
Abstract
IoT applications span a wide range in performance and memory footprint, under tight cost and power constraints. High-end applications rely on power-hungry Systems-on-Chip (SoCs) featuring powerful processors, large LPDDR/DDR3/4/5 memories, and supporting full-fledged Operating Systems (OS). On the contrary, low-end applications typically rely on Ultra-Low-Power microcontrollers with a "close to metal" software environment and simple micro-kernel-based runtimes. Emerging applications and trends of IoT require the "best of both worlds": cheap and low-power SoC systems with a well-known and agile software environment based on full-fledged OS (e.g., Linux), coupled with extreme energy efficiency and parallel digital signal processing capabilities. We present HULK-V: an open-source Heterogeneous Linux-capable RISC-V-based SoC coupling a 64-bit RISC-V processor with an 8-core Programmable Multi-Core Accelerator (PMCA), delivering up to 13.8 GOps, up to 157 GOps/W and accelerating the execution of complex DSP and ML tasks by up to 112x over the host processor. HULK-V leverages a lightweight, fully digital memory hierarchy based on HyperRAM IoT DRAM that exposes up to 512 MB of DRAM memory to the host CPU. Featuring HyperRAMs, HULK-V doubles the energy efficiency without significant performance loss compared to featuring power-hungry LPDDR memories, requiring expensive and large mixed-signal PHYs. HULK-V, implemented in Global Foundries 22nm FDX technology, is a fully digital ultra-low-cost SoC running a 64-bit Linux software stack with OpenMP host-to-PMCA offload within a power envelope of just 250 mW.
00:39 CET HIGH-SPEED AND ENERGY-EFFICIENT SINGLE-PORT CONTENT ADDRESSABLE MEMORY TO ACHIEVE DUAL-PORT OPERATION
Authors:
Honglan Zhan, Chenxi Wang, Hongwei Cui, Xianhua Liu, Feng Liu and Xu Cheng, The Department of Computer Science and Technology, Peking University, Beijing, China., CN
Abstract
Abstract—High-speed and energy-efficient multi-port content addressable memory (CAM) is very important to modern superscalar processors. In order to overcome the disadvantages of multi-port CAM and improve the performance of the searching stage, a high-speed and energy-efficient single-port (SP) CAM is introduced to achieve dual-port (DP) operation. For different bit cell topologies - the traditional 9T CAM cell and the 6T SRAM cell, two different peripheral schemes - CShare and VClamp are proposed. The proposed schemes have been verified using all possible corners, a wide range of temperatures, and detailed Monte-Carlo variation analysis. With 65-nm process and 1.2 V supply, the search delay of CShare and VClamp is 0.55 ns and 0.6 ns, respectively, which is decreased by about 87% compared with state-of-the-art works. In addition, compared with the recently proposed 10T BCAM, CShare and VClamp provide 84.9% and 85.1% energy reduction in the TT corner respectively. At 0.6 V supply, CShare achieves 0.15 fJ/search/bit and VClamp achieves 0.14 fJ/search/bit. Experimental results in an 8Kb CAM at 1.2 V supply and across different corners show that the energy efficiency is improved by 45.56% (CShare) and 45.64% (VClamp) on average compared with DP CAM. To further study the influence of advanced technology, the characteristics of 16-nm FinFET technology were investigated. The operating frequencies of CShare and VClamp as DP Translation Look-aside Buffer (TLB) are 3.125 GHz and 1.67 GHz respectively at 16-nm FinFET technology, and those as SP TLB are 6.25 GHz and 3.33 GHz respectively. CShare achieves 0.128 fJ/search/bit and VClamp achieves 0.085 fJ/search/bit over 100 searches with 16-nm FinFET technology and 0.8 V supply. Keywords—Superscalar processors, content-addressable memory (CAM), dual-port, Translation Look-aside Buffer (TLB).
00:39 CET ENERGY-EFFICIENT HARDWARE ACCELERATION OF SHALLOW MACHINE LEARNING APPLICATIONS
Authors:
Ziqing Zeng and Sachin S. Sapatnekar, University of Minnesota, US
Abstract
ML accelerators have largely focused on building general platforms for deep neural networks (DNNs), but less so on shallow machine learning (SML) algorithms, logistic regression, SVM, and decision trees. This paper proposes a compact, configurable, template-based generator for SML hardware acceleration. The approach identifies computational kernels as templates that are common to these algorithms and builds a pipelined accelerator for efficient execution. The dataflow graphs of individual ML instances, with different data dimensions, are mapped to the pipeline stages and then optimized by customized algorithms. The approach generates energy-efficient hardware for training and inference of different ML algorithms, as demonstrated with post-layout FPGA and ASIC results.
00:39 CET STATEFUL ENERGY MANAGEMENT FOR MULTI-SOURCE ENERGY HARVESTING TRANSIENT COMPUTING SYSTEMS
Authors:
Sergey Mileiko1, Oktay Cetinkaya2, Rishad Shafik1 and Domenico Balsamo1
1Newcastle University, GB; 2Oxford e-Research Centre, GB
Abstract
The intermittent and varying nature of energy harvesting (EH) entails dedicated energy management with large energy storage, which is a limiting factor for low-power/cost systems with small form factors. Transient computing allows system operations to be performed in the presence of power outages by saving the system state into a non-volatile memory (NVM), thereby reducing the size of this storage. These systems are often designed with a task-based strategy, which requires the storage to be sized for the most energy consuming task. That is, however, not ideal for most systems since their tasks/components have varying energy requirements, i.e., energy storage size and operating voltage. Hence, to overcome this issue, this paper proposes a novel energy management unit (EMU) tailored for multi-source EH transient systems that allows selecting the storage size and operating voltage for the next task at run-time, thereby optimizing task-specific energy needs and startup times based on application requirements. For the first time in literature, we adopted a hybrid NVM+VM approach allowing our EMU to reliably and efficiently retain its internal state, i.e., stateful EMU, under even the most severe EH conditions. Extensive empirical evaluations validated the operation of the proposed stateful EMU at a small overhead (0.07mJ of energy to update the EMU state and a ≃4μA of static current consumption of the EMU).
00:39 CET FULLY ON-BOARD LOW-POWER LOCALIZATION WITH MULTIZONE TIME-OF-FLIGHT SENSORS ON NANO-UAVS
Authors:
Hanna Mueller1, Nicky Zimmerman2, Tommaso Polonelli3, Jens Behley2, Michele Magno4, Cyrill Stachniss2 and Luca Benini5
1Integrated Systems Laboratory, ETH Zurich, CH; 2Uni Bonn, DE; 3Center for Project-Based Learning, ETH Z�rich, CH; 4ETH Zurich, CH; 5Università di Bologna and ETH Zurich, IT
Abstract
Nano-size unmanned aerial vehicles (UAVs) hold enormous potential to perform autonomous operations in complex environments, such as inspection, monitoring, or data collection. Moreover, their small size allows safe operation close to humans and agile flight. An important part of autonomous flight is localization, which is a computationally intensive task, especially on a nano-UAV that usually has strong constraints in sensing, processing and memory. This work presents a real-time localization approach with low element-count multizone range sensors for resource-constrained nano-UAVs. The proposed approach is based on a novel miniature 64-zone time-of-flight sensor from ST Microelectronics and a RISC-V-based parallel ultra low-power processor, to enable accurate and low latency Monte Carlo localization on-board. Experimental evaluation using a nano-UAV open platform demonstrated that the proposed solution is capable of localizing on a 31.2m^2 map with 0.15m accuracy and an above 95% success rate. The achieved accuracy is sufficient for localization in common indoor environments. We analyze tradeoffs in using full and half-precision floating point numbers as well as a quantized map and evaluate the accuracy and memory footprint across the design space. Experimental evaluation shows that parallelizing the execution for 8 RISC-V cores brings a 7x speedup and allows us to execute the algorithm on-board in real-time with a latency of 0.2-30ms (depending on the number of particles), while only increasing the overall drone power consumption by 3-7%. Finally, we provide an open-sourced implementation of our approach.
00:39 CET ENERGY-EFFICIENT WEARABLE-TO-MOBILE OFFLOAD OF ML INFERENCE FOR PPG-BASED HEART-RATE ESTIMATION
Authors:
Alessio Burrello1, Matteo Risso2, Noemi Tomasello2, Yukai Chen3, Luca Benini4, Enrico Macii2, Massimo Poncino2 and Daniele Jahier Pagliari2
1Department of Electric and Eletronic Engineering, University of Bologna, IT; 2Politecnico di Torino, IT; 3IMEC, BE; 4Università di Bologna and ETH Zurich, IT
Abstract
Modern smartwatches often include photoplethysmographic (PPG) sensors to sense the contractions within the dense arteriovenous system. This information can be used to measure heartbeats or blood pressure through complex algorithms that fuse PPG data with other signals. However, these approaches are often too complex to be deployed on microcontroller units (MCUs) such as the ones embedded in a smartwatch. In this work, we propose a collaborative inference approach that uses both a smartwatch and a connected smartphone to maximize the performance of heart rate (HR) tracking while also maximizing the smartwatch's battery life. In particular, we first analyze the trade-offs between running on-device HR tracking or offloading the work to the smartphone. Then, thanks to an additional step to evaluate the emph{difficulty} of the upcoming HR prediction, we demonstrate that we can smartly dispatch the workload between smartwatch and smartphone, maintaining a low mean absolute error (MAE) while reducing energy consumption. To benchmark our approach, we employed a custom smartwatch prototype which includes the STM32WB55 MCU for processing and Bluetooth Low-Energy (BLE) communication and a Raspberry Pi3 as a proxy for the smartphone. With our Collaborative Heart Rate Inference System (CHRIS), we obtain a set of Pareto-optimal configurations demonstrating the same MAE as State-of-Art (SoA) algorithms while consuming less energy. For instance, we can achieve approximately the same MAE of TimePPG-Small (5.54 BPM MAE vs. 5.60 BPM MAE) while reducing the energy by 2.03x, with a configuration that offloads 80% of the predictions to the phone. Furthermore, accepting a performance degradation to 7.16 BPM of MAE, we can achieve an energy consumption of 179 uJ per prediction, 3.03x less than running TimePPG-Small on the smartwatch, and 1.82x less than streaming all the input data to the phone.
00:39 CET A COUPLED BATTERY STATE OF CHARGE AND VOLTAGEMODEL FOR OPTIMAL CONTROL APPLICATIONS
Authors:
Masoomeh Karami1, Sajad Shahsavari1, Eero Immonen2, Hashem Haghbayan1 and Juha Plosila1
1University of Turku, FI; 2Turku University of Applied Sciences, FI
Abstract
Optimal control of electric vehicle (EV) batteries for maximal energy efficiency, safety and lifespan requires that the Battery Management System (BMS) has accurate real-time information on both the battery State-of-Charge (SoC) and its dynamics, i.e. energy supply capacity, at all times. However, these quantities cannot be measured directly from the battery, and, in practice, only SoC estimation is typically carried out. Moreover, the so-called Equivalent Circuit Models (ECM) commonly utilized in BMS solutions only display a memoryless algebraic dependence of voltage and current on SoC, without an ability to predict battery energy supply capacity based on its recent charge/discharge history. In this article, we propose a novel parametric algebraic voltage model coupled to the well-known Manwell-McGowan dynamic Kinetic Battery Model (KiBaM), which is able to predict both battery SoC dynamics and its electrical response. We present an offline model parameter identification procedure that yields SoC-dependent model parameters from standard dynamic battery tests, and we introduce an algorithm based on the Extended Kalman Filter (EKF) for standard SoC estimation on the proposed model. Numerical simulations, based on laboratory measurements, are presented for prismatic Lithium-Titanate Oxide (LTO) battery cells. Such cells are prime candidates for modern heavy offroad EV applications.
00:39 CET ADEE-LID: AUTOMATED DESIGN OF ENERGY-EFFICIENT HARDWARE ACCELERATORS FOR LEVODOPA-INDUCED DYSKINESIA CLASSIFIERS
Authors:
Martin Hurta, Vojtech Mrazek, Michaela Drahosova and Lukas Sekanina, Brno University of Technology, CZ
Abstract
Taking levodopa, a drug used to treat symptoms of Parkinson's disease, is often connected with severe side-effects, known as Levodopa-induced dyskinesia (LID). It can fluctuate in severity throughout the day and thus is difficult to classify during a short period of a physician's visit. A low-power wearable classifier enabling long-term and continuous LID classification would thus significantly help with LID detection and dosage adjustment. This paper deals with an automated design of energy-efficient hardware accelerators of LID classifiers that can be implemented in wearable devices. The accelerator consists of a feature extractor and a classification circuit co-designed using genetic programming (GP). We also introduce and evaluate a fast and accurate energy consumption estimation method for the target architecture of considered classifiers. In a multi-objective design scenario, GP evolves solutions showing the best trade-offs between accuracy and energy. Further energy savings are obtained by introducing a variable bit width for arithmetic operators used in the feature extractor and classifier. Compared to the state-of-the-art solutions, the proposed method leads to classifiers showing a comparable accuracy while the energy consumption is reduced by 49 %.

S_D4 Resource-aware computing

Date: Wednesday, 19 April 2023
Time: 14:00 - 15:30 CET

Time Label Presentation Title
Authors
00:39 CET EFFICIENT HYPERDIMENSIONAL LEARNING WITH TRAINABLE, QUANTIZABLE, AND HOLISTIC DATA REPRESENTATION
Authors:
Jiseung Kim1, Hyunsei Lee1, Mohsen Imani2 and Yeseong Kim1
1DGIST, KR; 2University of California Irvine, US
Abstract
Hyperdimensional computing (HDC) is a computing paradigm that draws inspiration from a human memory model. It represents data in the form of high-dimensional vectors. Recently, many works in literature have tried to use HDC as a learning model due to its simple arithmetic and high efficiency. However, learning frameworks in HDC use encoders that are randomly generated and static, resulting in many parameters and low accuracy. In this paper, we propose TrainableHD, a framework for HDC that utilizes a dynamic encoder with effective quantization for higher efficiency. Our model considers errors gained from the HD model and dynamically updates the encoder during training. Our evaluations show that TrainableHD improves the accuracy of the HDC by up to 22.26% (on average 3.62%) without any extra computation costs, achieving a comparable level to state-of-theart deep learning. Also, the proposed solution is 56.4× faster and 73× more energy efficient as compared to the deep learning on NVIDIA Jetson Xavier, a low-power GPU platform.
00:39 CET SMART KNOWLEDGE TRANSFER-BASED RUNTIME POWER MANAGEMENT
Authors:
Lin Chen1, Xiao Li1, Fan Jiang1, Chengeng Li1 and Jiang Xu2
1The Hong Kong University of Science and Technology, HK; 2Hong Kong University of Science and Technology (Guangzhou), CN
Abstract
As Moore's law slows down, computing systems must pivot towards higher energy efficiency to continue scaling performance. Reinforcement learning (RL) performs more adaptively than conventional methods in runtime power management under varied hardware configurations and varying software workloads. However, prior works on either model-free or model-based RL approaches face a non-negligible challenge: relearning the policies to adapt to the new environment is unacceptably time-consuming, especially when encountering significant variances in workloads or hardware configurations. Moreover, existing research on accelerating learning has focused on the speedup while largely ignoring the efficiency degradation of the results. In this paper, we present a smart transfer-enabled Q-learning (STQL) approach to boost the learning process and guarantee the learning efficiency through a contradiction checking mechanism, which evicts inappropriate transferred knowledge. Experiments on realistic applications show that the proposed method can speed up the learning process up to 2.3x and achieve a 6.2% energy-delay product (EDP) reduction compared to the state-of-the-art design.
00:39 CET SG-FLOAT: ACHIEVING MEMORY ACCESS AND COMPUTATIONAL POWER REDUCTION USING SELF-GATING FLOAT IN CNN ACCELERATORS
Author:
Jun-Shen Wu, National Tsing Hua University, Hsinchu, Taiwan, TW
Abstract
Convolutional neural networks (CNNs) are crucial for enabling the future artificial intelligence world. However, due to its large quantity of data and computation requirements, devices need considerable memory and hardware resources, limiting the implementation of energy-constrained or hardware-constrained devices, e.g., IoT devices. In this work, we present self-gating float (SG-Float), algorithm hardware co-design of a novel binary number format, which can significantly reduce CNN memory access and computational power. First, we propose the novel SG-Float number format that uses the exponent as the index to self-gate the mantissa to zero. Using SG-Float, the relatively small values are approximately represented only by the exponent. As a result, SG-Float can increase the zero proportion of mantissa, corresponding to the mantissa multiplication reduction. Second, we offer an optimization technique to best match SG-Float with CNN accelerators, SG-Float buffering/storage strategy, which can reduce the memory access of SG-Float. Finally, we implement the SG-Float buffering/storage strategy with an NVDLA-like floating-point processing element (PE) in TSMC 40nm technology. Our evaluation results reveal that on four state-of-the-art image recognition CNNs models, SG-Float can achieve up to 37% memory access power reduction with our proposed SG-Float buffering/storage strategy, and up to 46% computational power reduction compared with AdaptivFloat with negligible power and area overhead. Furthermore, the inference accuracy loss caused by SG-Float is within 1%. We also show impressive results that SG-Float can combine with the neural network pruning method, further reducing the memory access and mantissa multiplications of the pruned CNN model.
00:39 CET REDRAW: FAST AND EFFICIENT HARDWARE ACCELERATOR WITH REDUCED READS AND WRITES FOR 3D UNET
Authors:
Tom Glint1, Manu Awasthi2 and Joycee Mekie1
1Indian Institute of Technology Gandhinagar, IN; 2Ashoka University, IN
Abstract
Hardware accelerators (HA) proposed in the earlier works have been mainly designed with a focus on 2D convolution neural networks (CNNs), and 3D CNNs with temporal data. To the best of our knowledge, there is no HA for 3D CNNs with spatial data. 3D UNet is a 3D CNN with significant applications in the medical domain. However, the total on-chip buffer size (> 20 MB) required for the complete stationery approach of processing 3D UNet is cost prohibitive. In this work, we analyze the 3D UNet workload and propose an HA with an optimized memory hierarchy with a total on-chip buffer of less than 4 MB while conceding near theoretical minimum memory accesses required for processing 3DUNet. We demonstrate the efficiency of the proposed HA by comparing it with SOTA Simba architecture with the same number of MAC Units and show a ~1.3x increase in TOPS/watt in the same area. Further, we revise the proposed architecture to increase the compute operations to memory operation, and to meet the latency requirement of 3D UNet-based embedded applications. The revised architecture is compared against a dual instance of Simba, which has similar latency. Against the dual instance of Simba, the proposed architecture achieves ~1.8x increase in TOPS/watt in the similar area.
00:39 CET TEMPERATURE-AWARE SIZING OF MULTI-CHIP MODULE ACCELERATORS FOR MULTI-DNN WORKLOADS
Authors:
Prachi Shukla1, Derrick Aguren2, Tom Burd3, Ayse Coskun1 and John Kalamatianos3
1Boston University, US; 2Advanced Micro Devices, US; 3AMD, US
Abstract
This paper demonstrates the need for temperature awareness in sizing accelerators to target multi-DNN workloads. To that end, we build TESA, a TEmperature-aware methodology that Sizes and places Accelerators to balance both the cost and power of a multi-chip module (MCM) including DRAM power for multi-deep neural network workloads. TESA tunes the accelerator chiplet size and inter-chiplet spacing to generate a temperature-aware MCM layout, subject to user-defined latency, area, power, and thermal constraints. Using TESA for both 2D and 3D systolic array-based chiplets, we demonstrate up to 44% MCM cost savings and 63% DRAM power savings, respectively, over a temperature-unaware baseline at iso-frequency and iso-interposer area. We also demonstrate a need for TESA to obtain feasible MCM configurations for multi-DNN workloads such as augmented/virtual reality (AR/VR).
00:39 CET JUMPING SHIFT: A LOGARITHMIC QUANTIZATION METHOD FOR LOW-POWER CNN ACCELERATION
Authors:
Longxing Jiang, David Aledo and Rene Leuken, Delft University of Technology, NL
Abstract
Logarithmic quantization for Convolutional Neural Networks (CNN): a) fits well typical weights and activation distributions, and b) allows the replacement of the multiplication operation by a shift operation that can be implemented with fewer hardware resources. We propose a new quantization method named Jumping Log Quantization (JLQ). The key idea of JLQ is to extend the quantization range, by adding a coefficient parameter "s" in the power of two exponents 2^(sx+i) This quantization strategy skips some values from the standard logarithmic quantization. In addition, we also develop a small hardware-friendly optimization called weight de-zeroing. Zero-valued weights that cannot be performed by a single shift operation are all replaced with logarithmic weights that to further reduce hardware resources with almost no accuracy loss. To implement the Multiply-And-Accumulate (MAC) operation (needed to compute convolutions) when the weights are JLQ-ed and de-zeroed, a new Processing Element (PE) have been developed. This new PE uses a modified barrel shifter that can efficiently avoid the skipped values. Resource utilization, area, and power consumption of the new PE standing alone are reported. We have found that JLQ performs better than other state-of-the-art logarithmic quantization methods when the bit width of the operands becomes very small.
00:39 CET THERMAL MANAGEMENT FOR S-NUCA MANY-CORES VIA SYNCHRONOUS THREAD ROTATIONS
Authors:
Yixian Shen, Sobhan Niknam, Anuj Pathania and Andy Pimentel, University of Amsterdam, NL
Abstract
On-chip thermal management is quintessential to a thermally safe operation of a many-core processor. The presence of a physically-distributed logically-shared Last-Level Cache (LLC) significantly reduces the performance penalty of migrating threads within the cores of an S-NUCA many-core. This cost reduction allows novel thermal management of these many-cores via synchronous thread migration. Synchronous thread migration provides a viable alternative to Dynamic Voltage and Frequency Scaling (DVFS) and asynchronous thread migration used traditionally to manage thermals of S-NUCA many-cores. We present a theoretical method to compute the peak temperature in many-cores with synchronous thread migrations. We use the method to create a thermal management heuristic called HotPotato that maximizes the performance of S-NUCA many-cores under a peak temperature constraint. We implement HotPotato within the state-of-the-art HotSniper simulator. Detailed interval thermal simulations with HotSniper show an average 10.72% improvement in response time of S-NUCA many-cores when scheduling with HotPotato compared to a state-of-the-art thermal-aware S-NUCA scheduler.
00:39 CET PROTEUS: HLS BASED NOC GENERATOR AND SIMULATOR
Authors:
Abhimanyu Rajeshkumar Bambhaniya, Yangyu Chen, FNU Anshuman, Rohan Banerjee and Tushar Krishna, Georgia Institute of Technology, US
Abstract
Networks-on-chip (NoCs) form the backbone fabric for connecting multi-core SoCs containing several processor cores and memories. Design-space exploration (DSE) of NoCs is a crucial part of the SoC design process to ensure that it does not become a bottleneck. DSE today is often hindered by the inherent trade-off between software simulation vs hardware emulation/e- valuation. Software simulators are easily extendable and allow for evaluation of new ideas but are not able to capture the hardware complexity. Meanwhile RTL development is known to be time- consuming. This has forced DSE to use simulators followed by RTL development, evaluation and feedback, which slows down the overall design process. In an effort to tackle this problem, we present Proteus, a configurable and modular NoC simulator and RTL generator. Proteus is the first of its kind framework to use HLS compiler to develop NoCs from a C++ description of the NoC circuit. These generated NoCs can be simulated in software and tested on FPGAs. This allows users to do rapid DSE by providing the opportunity to tweak and test NoC architectures in real-time. We also compare Proteus-generated RTL with Chisel- generated and hand-written RTL in terms of area, timing and productivity. The ability to synthesize the NoC design on FPGAs can benefit large designs as the custom hardware results in faster run-time than cycle-accurate software simulators. Proteus is modeled similar to existing state-of-the-art simulators and offers users modifiable parameters to generate custom topologies, routing algorithm, and router microarchitectures.
00:39 CET MOELA: A MULTI-OBJECTIVE EVOLUTIONARY/LEARNING DESIGN SPACE EXPLORATION FRAMEWORK FOR 3D HETEROGENEOUS MANYCORE PLATFORMS
Authors:
Sirui Qi1, Yingheng Li2, Sudeep Pasricha1 and Ryan Kim1
1Colorado State University, US; 2University of Pittsburgh, US
Abstract
To enable emerging applications such as deep learning and graph processing, 3D network-on-chip (NoC) enabled heterogeneous manycore platforms that can integrate many processing elements (PEs) are needed. However, designing such complex systems with multiple objectives can be challenging due to the huge design space and long evaluation time associated with them. To optimize such systems, we propose a new multi-objective design space exploration framework called MOELA that combines the benefits of evolutionary-based search with a learning-based local search to quickly determine PE and communication link placement to optimize multiple objectives (e.g., latency, throughput, and energy) in 3D NoC enabled heterogeneous manycore systems. Compared to state-of-the-art approaches, MOELA increases the speed of finding solutions by up to 50x, leads to a better Pareto Hypervolume (PHV) by up to 349% and improves energy-delay-product (EDP) by up to 6.7% in a 5-objective scenario.
00:39 CET DEVELOPING AN ULTRA-LOW POWER RISC-V PROCESSOR FOR ANOMALY DETECTION
Authors:
Jina Park1, Eunjin Choi1, Kyungwon Lee1, Jae-Jin Lee2, Kyuseung Han2 and Woojoo Lee1
1Chung-Ang University, KR; 2ETRI, KR
Abstract
With a focus on anomaly detection, a representative application in healthcare, this paper develops an ultra-low power processor for wearable devices. First, this paper proposes a processor architecture that divides the architecture into a part for general applications running on wearable devices (day part) and a part that performs anomaly detection by analyzing sensor data (night parts), and each part operates completely independently. This day-night architecture allows the day part, which contains the power-hungry main CPU and system interconnect, to be turned off most of the time except for intermittent work, and the night part, which consists only of the sub-CPU and minimal IPs, can run all the time with low power. Next, this paper designs an ultra-lightweight all-night core based on a subset of RV32I optimized for anomaly detection applications, and completes the development of an ultra-low power processor by introducing it to the sub-CPU of the proposed architecture. Finally, by prototyping the proposed processor and developing an anomaly detection application that runs on the processor prototype, this paper demonstrates the superiority of power savings along with design validation of the proposed processor technology.
00:39 CET MONTM: MONITORING-BASED THERMAL MANAGEMENT FOR MIXED-CRITICALITY SYSTEMS
Authors:
Marcel Mettler1, Martin Rapp2, Heba Khdr3, Daniel Mueller-Gritschneder1, Joerg Henkel4 and Ulf Schlichtmann1
1TU Munich, DE; 2Karlsruhe Institute of Technology, DE; 3Karlsruhe Institute of Technology (KIT), DE; 4KIT, DE
Abstract
With a rapidly growing functionality of embedded real-time applications, it becomes inevitable to integrate tasks of different safety integrity levels on one many-core processor leading to a large-scale mixed-criticality system. In this process, it is not sufficient to only isolate shared architectural resources, as different tasks executing on different cores also possibly interfere via the many-core processor's thermal management. This can possibly lead to best-effort tasks causing deadline violations for safety-critical tasks. In order to prevent such a scenario, we propose a monitoring-based hardware extension that communicates imminent thermal violations between cores via a lightweight interconnect. Building on this infrastructure, we propose a thermal strategy such that best-effort tasks can be throttled in favor of safety-critical tasks. Furthermore, assigning static voltage/frequency (V/f) levels to each safety-critical task based on their worst-case execution time may result in unnecessary high V/f levels when the actual execution finishes faster. To free the otherwise wasted thermal resources, our solution monitors the progress of safety-critical tasks to detect slack and safely reduce their V/f levels. This increases the thermal headroom for best-effort tasks, boosting their performance. In our evaluation, we demonstrate our approach on an 80-core processor to show that it satisfies the thermal and deadline requirements, and simultaneously reduces the run-time of best-effort tasks by up to 45% compared to the state-of-the-art.
00:39 CET A LIGHTWEIGHT CONGESTION CONTROL TECHNIQUE FOR NOCS WITH DEFLECTION ROUTING
Authors:
Shruti Yadav Narayana1, Sumit Mandal2, Raid Ayoub3, Micheal Kishinevsky4 and Umit Ogras5
1University of wisconsin madison, US; 2Indian Institute of Science, IN; 3Intel corporation, US; 4Intel Corporation, US; 5University of Wisconsin - Madison, US
Abstract
Network-on-Chip (NoC) congestion builds up during heavy traffic load and cripples the system performance by stalling the cores. Moreover, congestion leads to wasted link bandwidth due to blocked buffers and bouncing packets. Existing approaches throttle the cores after congestion is detected, leading to highly congested NoC and stalled cores. In contrast, we propose a lightweight machine learning-based technique that helps predict congestion in the network. Specifically, our proposed technique collects the features related to traffic at each sink. Then, it labels the features using a novel time reversal approach. The labeled data is used to design a low overhead and an explainable decision tree model used at runtime congestion control. Experimental evaluations with synthetic and real traffic on industrial 6×6 NoC show that the proposed approach increases fairness and memory read bandwidth by up to 59% with respect to a state-of-the-art congestion control technique.

S_D5 Approximate computing

Date: Wednesday, 19 April 2023
Time: 14:00 - 15:30 CET

Time Label Presentation Title
Authors
00:39 CET MAXIMIZING COMPUTING ACCURACY ON RESOURCE-CONSTRAINED ARCHITECTURES
Authors:
Van-Phu Ha and Olivier Sentieys, INRIA, FR
Abstract
With the growing complexity of applications, designers need to fit more and more computing kernels into a limited energy or area budget. Therefore, improving the quality of results of applications in electronic devices with a constraint on its cost is becoming a critical problem. Word Length Optimization (WLO) is the process of determining bit-width for variables or operations represented using fixed-point arithmetic to trade-off between quality and cost. State-of-the-art approaches mainly solve WLO given a quality (accuracy) constraint. In this paper, we first show that existing WLO procedures are not adapted to solve the problem of optimizing accuracy given a cost constraint. It is then interesting and challenging to propose new methods to solve this problem. Then, we propose a Bayesian optimization based algorithm to maximize the quality of computations under a cost constraint (i.e., energy in this paper). Experimental results indicate that our approach outperforms conventional WLO approaches by improving the quality of the solutions by more than 170 %.
00:39 CET MECALS: A MAXIMUM ERROR CHECKING TECHNIQUE FOR APPROXIMATE LOGIC SYNTHESIS
Authors:
Chang Meng, Jiajun Sun, Yuqi Mai and Weikang Qian, Shanghai Jiao Tong University, CN
Abstract
Approximate computing is an effective computing paradigm to improve energy efficiency for error-tolerant applications. Approximate logic synthesis (ALS) methods are designed to generate approximate circuits under certain error constraints. This paper focuses on ALS methods under the maximum error constraint and proposes MECALS, a maximum error checking technique for ALS. MECALS models maximum error using partial Boolean difference and performs fast error checking with SAT-sweeping. Based on MECALS, we design an efficient ALS flow. Our experimental results show that compared to a state-of-the-art ALS method, our flow is 13× faster and improves area and delay reduction by 39.2% and 26.0%, respectively.
00:39 CET COMPACT: CO-PROCESSOR FOR MULTI-MODE PRECISION-ADJUSTABLE NONLINEAR ACTIVATION FUNCTION
Authors:
Wenhui Ou, Zhuoyu Wu, Zheng Wang, Chao Chen and Yongkui Yang, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, CN
Abstract
Non-linear activation functions imitating neuron behaviours are ubiquitous in machine learning algorithms for time series signals while also demonstrating significant gain in precision for conventional vision-based deep learning networks. State-of-the-art implementation of such functions on GPU-like devices incurs a large physical cost, whereas edge devices adopt either linear interpolation or simplified linear functions leading to degraded precision. In this work, we design COMPACT, a co-processor with adjustable precision for multiple non-linear activation functions including but not limited to exponent, sigmoid, tangent, logarithm and mish. Benchmarking with state-of-the-arts, COMPACT achieves a 30% reduction in the absolute error on a 1.6x widen approximation range taking advantage of the triple decomposition technique inspired by Hajduk's formula of Padé approximation. A SIMD-ISA based vector co-processor has been implemented on FPGA which leads to a 30% reduction in execution latency but the area overhead nearly remains the same with related designs. Furthermore, COMPACT is adjustable to 46% latency improvement when the maximum absolute error is tolerant to the order of 1E-3.
00:39 CET DEEPCAM: A FULLY CAM-BASED INFERENCE ACCELERATOR WITH VARIABLE HASH LENGTHS FOR ENERGY-EFFICIENT DEEP NEURAL NETWORKS
Authors:
Duy-Thanh Nguyen, Abhiroop Bhattacharjee, Abhishek Moitra and Priyadarshini Panda, Yale University, US
Abstract
With ever increasing depth and width in deep neural networks to achieve state-of-the-art performance, deep learning computation has significantly grown, and dot-products remain dominant in overall computation time. Most prior works are built on conventional dot-product where weighted input summation is used to represent the neuron operation. However, another implementation of dot-product based on the notion of angles and magnitudes in the Euclidean space has attracted limited attention. This paper proposes {em DeepCAM}, an inference accelerator built on two critical innovations to alleviate the computation time bottleneck of convolutional neural networks. The first innovation is an approximate dot-product built on computations in the Euclidean space that can replace addition and multiplication with simple bit-wise operations. The second innovation is a dynamic size content addressable memory-based (CAM-based) accelerator to perform bit-wise operations and accelerate the CNNs with a lower computation time. Our experiments on benchmark image recognition datasets demonstrate that DeepCAM is up to 523x and 3498x faster than Eyeriss and traditional CPUs like Intel Skylake, respectively. Furthermore, the energy consumed by our DeepCAM approach is 2.16x to 109x less compared to Eyeriss.
00:39 CET DESIGN OF LARGE-SCALE STOCHASTIC COMPUTING ADDERS AND THEIR ANOMALOUS BEHAVIOR
Authors:
Timothy Baker and John Hayes, University of Michigan, US
Abstract
Stochastic computing (SC) uses streams of pseudo-random bits to perform low-cost and error-tolerant numerical processing for applications like neural networks and digital filtering. A key operation in these domains is the summation of many hundreds of bit-streams, but existing SC adders are inflexible. Designs such as basic mux adders have low area but poor accuracy while other adders like accumulative parallel counters (APCs) have good accuracy but high area. This work introduces parallel sampling adders (PSAs), a novel weighted adder family that offers a favorable area-accuracy trade-off and provides great flexibility to large-scale SC adder design. Our experiments show that PSAs can sometimes achieve the same high accuracy as APCs, but at half the area cost. We also examine the behavior of large-scale SC adders in depth and uncover some surprising results. First, APC accuracy is shown to be very sensitive to input correlation despite the common belief that APCs are correlation insensitive. Then, we show that mux-based adders are sometimes more accurate than APCs, which contradicts most prior studies. Explanations for these anomalies are given and a decorrelation scheme is proposed to improve APC accuracy by 3x for a digital filtering application.
00:39 CET ACCURATE YET EFFICIENT STOCHASTIC COMPUTING NEURAL ACCELERATION WITH HIGH PRECISION RESIDUAL FUSION
Authors:
Yixuan Hu1, Tengyu Zhang1, Renjie Wei1, Meng Li2, Runsheng Wang1, Yuan Wang1 and Ru Huang1
1School of Integrated Circuits, Peking University, CN; 2Institute for Artificial Intelligence and School of Integrated Circuits, Peking University, CN
Abstract
Stochastic computing (SC) emerges as a fault-tolerant and area-efficient computing paradigm for neural acceleration. However, existing SC accelerators suffer from an intrinsic trade-off between inference accuracy and efficiency: accurate SC requires high precision computation but suffers from an exponential increase of bit stream length and inference latency. In this paper, we discover the high precision residual as a key remedy and propose to combine a low precision datapath with a high precision residual to improve inference accuracy with minimum efficiency overhead. We also propose to fuse batch normalization with the activation function to further improve the inference efficiency. The effectiveness of our proposed method is verified on a recently proposed SC accelerator. With extensive results, we show that our proposed SC-friendly network achieves 9.43% accuracy improvements compared to the baseline low precision networks with only 1.3% area-delay product (ADP) increase. We further show 3.01x ADP reduction compared to the baseline SC accelerator with almost iso-accuracy.
00:39 CET PECAN: A PRODUCT-QUANTIZED CONTENT ADDRESSABLE MEMORY NETWORK
Authors:
Jie Ran1, Rui Lin2, Jason Li1, JiaJun Zhou1 and Ngai Wong1
1University of Hong Kong, HK; 2University of Hong Hong, HK
Abstract
A novel deep neural network (DNN) architecture is proposed wherein the filtering and linear transform are realized solely with product quantization (PQ). This results in a natural implementation via content addressable memory (CAM), which transcends regular DNN layer operations and requires only simple table lookup. Two schemes are developed for the end-to-end PQ prototype training, namely, through angle- and distance-based similarities, which differ in their multiplicative and additive natures with different complexity-accuracy tradeoffs. Even more, the distance-based scheme constitutes a truly multiplier-free DNN solution. Experiments confirm the feasibility of such Product-Quantized Content Addressable Memory Network (PECAN), which has strong implication on hardware-efficient deployments especially for in-memory computing.
00:39 CET XRING: A CROSSTALK-AWARE SYNTHESIS METHOD FOR WAVELENGTH-ROUTED OPTICAL RING ROUTERS
Authors:
Zhidan Zheng, Mengchu Li, Tsun-Ming Tseng and Ulf Schlichtmann, TU Munich, DE
Abstract
Wavelength-routed optical networks-on-chip (WRONoCs) are well-known for supporting high-bandwidth communications with low power and latency. Among all WRONoC routers, optical ring routers have attracted great research interest thanks to their simple structure, which looks like concentric cycles formed by waveguides. Current ring routers are designed manually. When the number of network nodes increases or the position of network nodes changes, it can be difficult to manually determine the optimal design options. Besides, current ring routers face two problems. First, some signal paths in the routers can be very long and suffer high insertion loss; second, to connect the network nodes to off-chip lasers, waveguides in the power distribution network (PDN) have to intersect with the ring waveguides, which causes additional insertion loss and crosstalk noise. In this work, we propose XRing, which is the first design automation method to automatically synthesize optical ring routers based on the number and position of network nodes. In particular, XRing optimizes the waveguide connections between the network nodes with a mathematical modelling method. To reduce insertion loss and crosstalk noise, XRing constructs efficient shortcuts between the network nodes that suffer long signal paths and creates openings on ring waveguides so that the PDN can easily access the network nodes without causing waveguide crossings. The experimental results show that XRing outperforms other WRONoC routers in reducing insertion loss and crosstalk noise. In particular, more than 98% of signals in XRing do not suffer first-order crosstalk noise, which significantly enhances the signal quality.
00:39 CET EXPLOITING ASSERTIONS MINING AND FAULT ANALYSIS TO GUIDE RTL-LEVEL APPROXIMATION
Authors:
Alberto Bosio1, Samuele Germiniani2, Graziano Pravadelli2 and Marcello Traiola3
1Lyon Institute of Nanotechnology, FR; 2University of Verona, IT; 3Inria / IRISA, FR
Abstract
In Approximate Computing (AxC), several design exploration approaches and metrics have been proposed so far to identify the approximation targets at the gate level, but only a few of them works on RTL descriptions. In addition, the possibility of combining the information derived from assertions and fault analysis is still under-explored. To fill in the gap, this paper proposes an automatic methodology to guide the AxC design exploration at the RTL level. Two approximation techniques are considered, bit-width and statement reduction, while fault injection is used to mimic their effect on the design under approximation. Assertions are then dynamically mined from the original RTL description and the variation of their truth values is evaluated with respect to fault injections. These variations are then used to rank and cluster different approximation alternatives, according to their estimated impact on the functionality of the target design. The experiments carried out on a case study, show that the proposed approach represents a promising solution toward the automatization of AxC design exploration at RTL.
00:39 CET AN EFFICIENT FAULT INJECTION ALGORITHM FOR IDENTIFYING UNIMPORTANT FFS IN APPROXIMATE COMPUTING CIRCUITS
Authors:
Jiaxuan LU, Yutaka MASUDA and Tohru ISHIHARA, Nagoya University, JP
Abstract
Approximate computing (AC) has attracted much attention, contributing to energy saving and performance improvement by accurately performing the important computation and approximating others. In order to make AC circuits practical, we need to determine which computation is how important carefully, and thus need to appropriately approximate the unimportant computation for maintaining the required computational quality. In this paper, we focus on the importance of computations at the Flip-Flop (FF) level and propose a novel importance evaluation methodology. The key idea of the proposed methodology is a two-step fault injection algorithm to extract the near-optimal set of unimportant FFs in the circuit. In the first step, the proposed methodology derives the importance of each FF. Then, in the second step, the proposed methodology extracts the set of unimportant FFs in a binary search manner. Thanks to the two-step strategy, the proposed algorithm reduces the complexity of architecture exploration from an exponential order to a linear order without understanding the functionality and behavior of the target application program. In a case study of an image processing accelerator, the proposed algorithm identifies the candidates of unimportant FFs depending on the given constraints. The bit width scaling for extracted FFs with the proposed algorithm reduces the circuit area by 29.6% and saves power dissipation by 35.8% under the ASIC implementation. Under the FPGA implementation, the dynamic power dissipation is saved by 37.0% while satisfying the PSNR constraint.
00:39 CET HARDWARE-AWARE AUTOMATED NEURAL MINIMIZATION FOR PRINTED MULTILAYER PERCEPTRONS
Authors:
Argyris Kokkinis1, Georgios Zervakis2, Kostas Siozios3, Mehdi Tahoori4 and Joerg Henkel5
1Aristotle University of Thessaloniki, GR; 2University of Patras, GR; 3Department of Physics, Aristotle University of Thessaloniki, GR; 4Karlsruhe Institute of Technology, DE; 5KIT, DE
Abstract
Printed Electronics (PEs) set up a new path for the realization of ultra low-cost circuits that can be deployed in everyday consumer goods and disposables. In addition, PEs satisfy requirements such as porosity, flexibility, and conformity. However, the large feature sizes in PEs and limited device counts incur high restrictions and increased area and power overheads, prohibiting the realization of complex circuits. As a result, although printed Machine Learning (ML) circuits could open new horizons and bring "intelligence” in such domains, the implementation of complex classifiers, as required in target applications, is hardly feasible. In this paper, we aim to address this and focus on the design of battery-powered printed Multilayer Perceptrons (MLPs). To that end, we exploit fully-customized circuit (bespoke) implementations, enabled in PEs, and propose a hardware-aware neural minimization framework dedicated for such customized MLP circuits. Our evaluation demonstrates that, for up to 5% accuracy loss, our co-design methodology enables, for the first time, battery-powered operation of complex printed MLPs.

S_E1 Optimized software architecture towards an improved utilization of hardware features

Date: Wednesday, 19 April 2023
Time: 14:00 - 15:30 CET

Time Label Presentation Title
Authors
00:39 CET MARB: BRIDGE THE SEMANTIC GAP BETWEEN OPERATING SYSTEM AND APPLICATION MEMORY ACCESS BEHAVIOR
Authors:
Haifeng Li1, Ke Liu1, Ting Liang2, Zuojun Li3, Tianyue Lu2, Yisong Chang2, Hui Yuan4, Yinben Xia4, Yungang Bao5, Mingyu Chen5 and Yizhou Shan6
1ICT, CN; 2Institute of Computing Technology, Chinese Academy of Sciences, CN; 3Institute of Computing Technology, CN; 4Huawei, CN; 5ICT, CAS, CN; 6Huawei Cloud, CN
Abstract
The virtual memory subsystem (VMS) is a long-standing and integral part of an operating system (OS). It plays a vital role in enabling remote memory systems over fast data center networks and is promising in terms of practicality and generality. Specifically, these systems use three VMS mechanisms: demand paging, page swapping, and page prefetching. However, the VMS inherent data path is costly, which takes a huge toll on performance. Despite prior efforts to propose page swapping and prefetching algorithms to minimize the occurrences of the data path, they still fall short due to the semantic gap between the OS and applications – the VMS has limited knowledge of its running applications' memory access behaviors. In this paper, orthogonal to prior efforts, we take a fundamentally different approach by building an efficient framework to collect full memory access traces at the local bus and make them available to the OS through CPU cache. Consequently, the page swapping and page prefetching can use this trace to make better decisions, thereby improving the overall performance of VMS-based systems. We implement a proof-of-concept prototype on commodity x86 servers using a hardware-based memory tracking tool. To showcase our framework's benefits, we integrate it with a state-of-the-art disaggregated memory system and the default kernel page eviction subsystem. Our evaluation shows promising improvements.
00:39 CET SAT-MAPIT: A SAT-BASED MODULO SCHEDULING MAPPER FOR COARSE GRAIN RECONFIGURABLE ARCHITECTURES
Authors:
Cristian Tirelli1, Lorenzo Ferretti2 and Laura Pozzi1
1USI Lugano, CH; 2University of California Los Angeles, US
Abstract
Coarse-Grain Reconfigurable Arrays (CGRAs) are emerging low-power architectures aimed at accelerating compute-intensive application loops. The acceleration that a CGRA can ultimately provide, however, heavily depends on the quality of the mapping, i.e. on how effectively the loop is compiled onto the given platform. State of the Art compilation techniques achieve mapping through modulo scheduling, a strategy which attempts to minimize the II (Iteration Interval) needed to execute a loop, and they do so usually through well known graph algorithms, such as Max-Clique Enumeration. We address the mapping problem through a SAT formulation, instead, and thus explore the solution space more effectively than current SoA tools. To formulate the SAT problem, we introduce an ad-hoc schedule called the kernel mobility schedule (MKS), which we use in conjunction with the data-flow graph and the architectural information of the CGRA in order to create a set of boolean statements that describe all constraints to be obeyed by the mapping for a given II. We then let the SAT solver efficiently navigate this complex space. As in other SoA techniques, the process is iterative: if a valid mapping does not exist for the given II, the II is increased and a new KMS and set of constraints is generated and solved. Our experimental results show that SAT-MapIt obtains better results compared to SoA alternatives in 40% of the benchmarks explored: sometimes finding a lower II, and others even finding a valid mapping when none could previously be found.
00:39 CET LIVENESS-AWARE CHECKPOINTING OF ARRAYS FOR EFFICIENT INTERMITTENT COMPUTING
Authors:
Youngbin Kim, Yoojin Lim and Chae Deok Lim, ETRI, KR
Abstract
Intermittent computing enables computing under environments that may experience frequent and unpredictable power failures, such as energy harvesting systems. It relies on checkpointing to preserve computing progress between power cycles, which often incurs significant overhead due to energyexpensive writes to Non-Volatile Memory (NVM). In this paper, we present LACT (Liveness-Aware CheckpoinTing), as an approach to reducing the size of checkpointed data by exploiting the liveness of memory objects: excluding dead memory objects from checkpointing does not affect the correctness of the program. Especially, LACT can analyze the liveness of arrays, which take up most of the memory space but are not analyzable by existing methods for detecting the liveness of scalar objects. Using the liveness information of arrays, LACT determines the minimized checkpoint range for the arrays at compile time without any runtime addition. Our evaluation shows that LACT achieves an additional reduction of checkpointed data size of 37.8% on average over the existing state-of-the-art technique. Also, our experiments on a real energy harvesting environment show that LACT can reduce the execution time of applications by 27.7% on average.
00:39 CET SERICO: SCHEDULING REAL-TIME I/O REQUESTS IN COMPUTATIONAL STORAGE DRIVES
Authors:
Yun HUANG1, Nan Guan2, Shuhan BAI3, Tei-Wei Kuo4 and Jason Xue2
1City University of Hong Kong, CN; 2City University of Hong Kong, HK; 3City University of Hong Kong; Huazhong University of Science and Technology, CN; 4National Taiwan University, TW
Abstract
The latency and energy consumption caused by I/O accesses are significant in data-centric computing systems. Computational Storage Drive (CSD) can largely reduce data movement, and thus reduce I/O latency and energy consumption by performing near-data processing, i.e., offloading some data processing to processors inside the storage device. In this paper, we study the problem of how to efficiently utilize the limited processing and memory resources of CSD to simultaneously serve multiple I/O requests from different applications with different real-time requirements. We proposed SERICO, a novel technique for scheduling computational I/O requests in CSD. The key idea of SERICO is to perform admission control of real-time computational I/O requests by online schedulability analysis, to avoid wasting the processing capacity of CSD of doing meaningless work for those requests deemed to violate the timing constraints anyway. Each admitted computational I/O request is served in a controlled manner with carefully designed parameters, to meet its timing constraint with minimal memory cost. We evaluate SERICO with both synthetic workloads on simulators and representative applications on a realistic CSD platform. Experiment results show that SERICO significantly outperforms the baseline method currently used by the CSD device and the standard deadline-driven scheduling approach.
00:39 CET REGION-BASED FLASH CACHING WITH JOINT LATENCY AND LIFETIME OPTIMIZATION IN HYBRID SMR STORAGE SYSTEMS
Authors:
Zhengang Chen1, Guohui Wang1, Zhiping Shi2, Yong Guan3 and Tianyu Wang4
1College of Information Engineering,Capital Normal University, CN; 2Beijing Key Laboratory of Electronic System Reliability Technology,Capital Normal University, CN; 3International Science and Technology Cooperation Base of Electronic System Reliability and Mathematical Interdisciplinary,Capital Normal University, CN; 4The Chinese University of Hong Kong, HK
Abstract
The frequent Read-Modify-Write operations (RMWs) in Shingled Magnetic Recording (SMR) disks severely degrade the random write performance of the system. Although the adoption of persistent cache (PC) and built-in NAND flash cache alleviates some of the RMWs, when the cache is full, the triggered write-back operations still prolong I/O response time and the erasure of NAND flash also sacrifices its lifetime. In this paper, we propose a Region-based Co-optimized strategy named Multi-Regional Collaborative Management (MCM) to optimize the average response time by separately managing sequential/random and hot/cold data and extend the NAND flash lifetime by a region-aware wear leveling strategy. The experimental results show that our MCM reduces 71% of the average response time and 96% of RMWs on average compared with the Skylight (baseline). For the comparison with the state-of-art flash-based cache (FC) approach, we can still save the average response time and flash erase operations by 17.2% and 33.32%, respectively.
00:39 CET GEM-RL: GENERALIZED ENERGY MANAGEMENT OF WEARABLE DEVICES USING REINFORCEMENT LEARNING
Authors:
Toygun Basaklar1, Yigit Tuncel1, Suat Gumussoy2 and Umit Ogras1
1University of Wisconsin - Madison, US; 2Siemens Corporate Technology, US
Abstract
Energy harvesting (EH) and management (EM) have emerged as enablers of self-sustained wearable devices. Since EH alone is not sufficient for self-sustainability due to uncertainties of ambient sources, current EM approaches use expected EH predictions to optimize the application performance. Thus, their performance depends critically on EH prediction accuracy. In contrast, we present a generalized energy management framework (GEM-RL) using multi-objective reinforcement learning. GEM-RL learns the trade-off between utilization and the battery energy level of the target device under dynamic energy harvesting patterns and battery conditions. It also uses a lightweight approximate dynamic programming technique that utilizes the trained MORL agent to optimize the utilization of the device over a longer period. Thorough experiments show that, on average, GEM-RL achieves Pareto front solutions within 5.4% of the offline Oracle for a given day. For a 7-day horizon, it achieves utility up to 4% within the offline Oracle and up to 50% higher utility compared to baseline EM approaches. The hardware implementation of GEM-RL on a wearable device shows negligible execution time (1.98 ms) and energy consumption (23.17 uJ) overhead.
00:39 CET VIX: ANALYSIS-DRIVEN COMPILER FOR EFFICIENT LOW-PRECISION DIFFERENTIABLE INFERENCE
Authors:
Ashitabh Misra1, Jacob Laurel2 and Sasa Misailovic3
1University of Illinois at Urbana-Champaign, US; 2University of Illinois Urbana-Champaign, US; 3UIUC, US
Abstract
As more and more stochastic data is processed onboard edge devices, these systems must constantly make decisions under uncertainty. This challenge necessitates principled embedded compiler support for time- and energy-efficient probabilistic inference. Compiling probabilistic inference to run on the edge is significantly understudied, and the few existing works are limited as they exclusively use computationally expensive MCMC. Thus these works cannot exploit faster differentiable inference algorithms which can better scale to larger data sizes that are representative of realistic workloads in the edge setting. However, if a developer were to naively write code for differentiable inference it would suffer 1) from expensive floating point computation, or 2) the difficulty to come up with quantization schemes, as gradients are notoriously unstable in the face of low-precision. We propose ViX which is the first compiler for low-precision probabilistic programming with variational inference. ViX generates optimized variational inference code in reduced precision by automatically exploiting Bayesian domain knowledge and analytical mathematical properties to ensure that low-precision gradients can still be safely used. By exposing additional knobs for approximation at both the level of the algorithm and generated code, ViX can scale inference to much larger data-sets than any previous work on edge probabilistic programming while attaining both high accuracy and significant speedup.
00:39 CET CHAMELEON: DUAL MEMORY REPLAY FOR ONLINE CONTINUAL LEARNING ON EDGE DEVICES
Authors:
Shivam Aggarwal, Kuluhan Binici and Tulika Mitra, National University of Singapore, SG
Abstract
Once deployed on edge devices, a deep neural network model should dynamically adapt to newly discovered environments and personalize its utility for each user. The system must be capable of continual learning, i.e., learning new information from a temporal stream of data in situ without forgetting previously acquired knowledge. However, the prohibitive intricacies of such a personalized continual learning framework stand at odds with limited compute and storage on edge devices. Existing continual learning methods rely on massive memory storage to preserve the past data while learning from the incoming data stream. We propose Chameleon, a hardware-friendly continual learning framework for user-centric training with dual replay buffers. The proposed strategy leverages the hierarchical memory structure available on most edge devices, introducing a short-term replay store in the on-chip memory and a long-term replay store in the off-chip memory to acquire new information while retaining past knowledge. Extensive experiments on two large-scale continual learning benchmarks demonstrate the efficacy of our proposed method, achieving better or comparable accuracy than existing state-of-the-art techniques while reducing the memory footprint by roughly 16x. Our method achieves up to 7x speedup and energy efficiency on a custom FPGA-based hardware accelerator, Jetson Nano, and Google's EdgeTPU. Our code is available at https://github.com/chameleon-anon/Chameleon.
00:39 CET FAGC: FREE SPACE FRAGMENTATION AWARE GC SCHEME BASED ON OBSERVATIONS OF ENERGY CONSUMPTION
Authors:
Lihua Yang1, Zhipeng Tan2, Fang Wang2, Yang Xiao2, Wei Zhang1 and Biao He3
1National University of Defense Technology, CN; 2Huazhong University of Science and Technology, CN; 3Huawei Technologies Co., LTD, CN
Abstract
Smartphones are everyday necessities with limited power supply due to portability. Charging a smartphone twice a day or more affects the user experience. Flash friendly file system (F2FS) is a widely-used log-structured file system for smartphones. Free space fragmentation of F2FS causing performance degradation mainly consists of invalid blocks. F2FS reclaims invalid blocks by garbage collection (GC). We explore the energy consumption of GC and the effect of GC on reducing free space fragments. We find the energy consumption of one background GC is large but its effect on reducing free space fragments is limited. These motivate us to improve the energy efficiency of GC. We reassess how much free space is a free space fragment based on data analysis, use the free space fragmentation factor to measure the degree of free space fragmentation quickly. Then suggest the free space Fragmentation Aware GC scheme (FAGC). FAGC optimizes the selection for victim segments and the migrating for valid blocks to reduce GCs and improve the energy efficiency of free space reclaimed by each GC. Experiments on real platform show that FAGC reduces GC count by 82.68% and 74.51% respectively than traditional F2FS and the latest GC optimization of F2FS, ATGC. FAGC reduces the energy consumption by 164.37 J and 100.64 J compared to traditional F2FS and ATGC respectively for a synthetic benchmark. FAGC also reduces the running time by 24.4%-40.9% for large-scale applications.
00:39 CET TRANSLIB: A LIBRARY TO EXPLORE TRANSPRECISION FLOATING-POINT ARITHMETIC ON MULTI-CORE IOT END-NODES
Authors:
Seyed Ahmad Mirsalari1, Giuseppe Tagliavini1, Davide Rossi2 and Luca Benini3
1University of Bologna, IT; 2University Of Bologna, IT; 3Università di Bologna and ETH Zurich, IT
Abstract
Reduced-precision floating-point (FP) arithmetic is being widely adopted in many application fields. This approach considerably reduces memory footprint and execution time, critical resources for battery-powered Internet of Things (IoT) end-nodes. However, reduced precision computations must meet end-do-end precision constraints to be acceptable at the application level. This work introduces TransLib, an open-source kernel library based on transprecision computing principles, which provides knobs to exploit different FP data types (i.e., float, float16, and bfloat16), also considering the trade-off between homogeneous and mixed-precision solutions. Each kernel design includes a Python model and a C program. The Python model generates the input dataset, computes the kernel output as a golden reference, and assesses the accuracy using a customizable error metric. The C program provides several code variants (sequential version with DSP optimizations, parallel version with low synchronization overhead, packed-SIMD vectorization of 16 bits FP types) to guarantee efficient execution on diverse end-node configurations. We demonstrate the capabilities of the proposed library and its collaterals on PULP, a 32-bit microcontroller (MCU) coupled with a parallel programmable accelerator composed of 8 RISC-V cores enhanced with various DSP extensions including 16-bit SIMD floating-point transprecision units. On average, TransLib kernels achieve an IPC of 0.94, featuring vectorized 16-bit float variants speed-up of 1.64×. The parallel variants achieve a speed-up of 1.97×, 3.91×, and 7.59× on 2, 4, and 8 cores, respectively. The float16 accuracy in terms of mean squared error is around 1.35 ∗ 10^-4, very close to the roundoff error of this format. Moreover, adopting reduced-precision FP types implies a reduction of the memory footprint between 25% and 50%. Finally, we show that mixed-precision variants increase the accuracy by 30× at the cost of 2.09× execution time and 1.35× memory footprint compared to float16 vectorized.
00:39 CET CFU PLAYGROUND: A HARDWARE-SOFTWARE CO-DESIGN FRAMEWORK FOR TINY MACHINE LEARNING ON FPGAS
Authors:
Shvetank Prakash1, Timothy Callahan2, Joseph Bushagour3, Colby Banbury4, Alan Green2, Pete Warden5, Tim Ansell2 and Vijay Janapa Reddi1
1Harvard University, US; 2Google, US; 3Purdue University, US; 4Harvard, US; 5Stanford University, US
Abstract
Need for the efficient processing of neural networks has given rise to the development of hardware accelerators. The increased adoption of specialized hardware has highlighted the need for more agile design flows for hardware-software co-design and domain-specific optimizations. We present CFU Playground, a full-stack open-source framework that enables rapid and iterative design of machine learning (ML) accelerators for embedded ML systems. Our toolchain integrates open-source software, open-source RTL generators, and open-source FPGA tools for synthesis, place, and route. This full-stack framework gives the users access to explore bespoke architectures that are customized and co-optimized for embedded ML. The rapid, deploy-profile-optimization feedback loop lets ML hardware and software developers achieve significant returns out of a relatively small investment in customization. Using CFU Playground's design loop, we show substantial speedups between 55x and 75x. The soft CPU coupled with the accelerator opens up a new, rich design space between the two components that we explore in an automated fashion using Vizier, a black-box optimization service.

S_S2 Physical attacks and countermeasures

Date: Wednesday, 19 April 2023
Time: 14:00 - 15:30 CET

Time Label Presentation Title
Authors
00:39 CET TABLE RE-COMPUTATION BASED LOW ENTROPY INNER PRODUCT MASKING SCHEME
Authors:
Jingdian Ming1, Yongbin Zhou2, Wei Cheng3 and Huizhong Li4
1Institute of Information Engineering, Chinese Academy of Sciences;, CN; 2School of Cyber Security, Nanjing University of Science and Technology, Nanjing, CN; 3LTCI, Telecom Paris, Institut Polytechnique de Paris, 91120, Palaiseau, FR; 4Institute of Information Engineering, Chinese Academy of Sciences, Beijing, CN
Abstract
Masking is a popular countermeasure due to its provable security. Table re-computation based Boolean masking (BM) is efficient at small masking share number, and addition chain based inner product masking (IPM) provides higher security order than BM. As a result, the natural question is: can we design a masking scheme that costs close to that of re-computation based BM while providing security comparable to that of addition chain based IPM? In this paper, we propose a table re-computation based IPM scheme that provides third-order security order while being slightly more expensive than table re-computation based BM. Furthermore, we improve the side-channel security of IPM by randomly selecting the parameter $L$ from an elaborated low entropy set, which we call low entropy inner product masking (LE-IPM). In an Intel Core i7-4790 CPU and ARM Cotex M4 based MCU for AES, we implemented four masking schemes, namely the addition chain based IPM and table re-computation based BM, IPM, and LE-IPM. Our proposals perform slightly slower (by about 0.8 times) than table re-computation based BM but significantly faster (at least 30 times) than addition chain based IPM. Furthermore, we assess the security of our proposals using a standard method named test vector leakage assessment methodology (TVLA). Our proposals provide the expected security against side-channel attacks according to the evaluation.
00:39 CET SCFI: STATE MACHINE CONTROL-FLOW HARDENING AGAINST FAULT ATTACKS
Authors:
Pascal Nasahl1, Martin Unterguggenberger2, Rishub Nagpal2, Robert Schilling1, David Schrammel1 and Stefan Mangard2
1Graz University of Technology, AT; 2Graz University of Technology, Lamarr Security Research, AT
Abstract
Fault injection (FI) is a powerful attack methodology allowing an adversary to entirely break the security of a target device. As finite-state machines (FSMs) are fundamental hardware building blocks responsible for controlling systems, inducing faults into these controllers enables an adversary to hijack the execution of the integrated circuit. A common defense strategy mitigating these attacks is to manually instantiate FSMs multiple times and detect faults using a majority voting logic. However, as each additional FSM instance only provides security against one additional induced fault, this approach scales poorly in a multi-fault attack scenario. In this paper, we present SCFI: a strong, probabilistic FSM protection mechanism ensuring that control-flow deviations from the intended control-flow are detected even in the presence of multiple faults. At its core, SCFI consists of a hardened next-state function absorbing the execution history as well as the FSM's control signals to derive the next state. When either the absorbed inputs, the state registers, or the function itself are affected by faults, SCFI triggers an error with no detection latency. We integrate SCFI into a synthesis tool capable of automatically hardening arbitrary unprotected FSMs without user interaction and open-source the tool. Our evaluation shows that SCFI provides strong protection guarantees with a better area-time product than FSMs protected using classical redundancy-based approaches. Finally, we formally verify the resilience of the protected state machines using a pre-silicon fault analysis tool.
00:39 CET EASIMASK - TOWARDS EFFICIENT, AUTOMATED, AND SECURE IMPLEMENTATION OF MASKING IN HARDWARE
Authors:
Fabian Buschkowski1, Pascal Sasdrich2 and Tim Güneysu3
1Ruhr-University, DE; 2Ruhr-Universität Bochum, DE; 3Ruhr-Universität Bochum & DFKI, DE
Abstract
Side-Channel Analysis (SCA) is a major threat to implementations of mathematically secure cryptographic algorithms. Applying masking countermeasures to hardware- based implementations is both time-consuming and error-prone due to side-effects buried deeply in the hardware design process. As a consequence, we propose our novel framework E ASI M ASK in this work. Our semi-automated framework enables designers that have little experience with hardware implementation or physical security and the application of countermeasures to create a securely masked hardware implementation from an abstract description of a cryptographic algorithm. Its design-flow dismisses the developer from many challenges in the masking process of hardware implementations, while the generated implementations match the efficiency of hand-optimized designs from experienced security engineers. The modular approach can be mapped to arbitrary instantiations using different languages and transformations. We have verified the functionality, security, and efficiency of generated designs for several state of the art symmetric cryptographic algorithms, such as Advanced Encryption Standard (AES), Keccak, and PRESENT.
00:39 CET OBFUSLOCK: AN EFFICIENT OBFUSCATED LOCKING FRAMEWORK FOR CIRCUIT IP PROTECTION
Authors:
You Li, Guannan Zhao, Yunqi He and Hai Zhou, Northwestern University, US
Abstract
With the rapid evolution of the IC supply chain, circuit IP protection has become a critical realistic issue for the semiconductor industry. One promising technique to resolve the issue is logic locking. It adds key inputs to the original circuit such that only authorized users can get the correct function, and it modifies the circuit to obfuscate it against structural analysis. However, there is a trilemma among locking, obfuscation, and efficiency within all existing logic locking methods that at most two of the objectives can be achieved. In this work, we propose ObfusLock, the first logic locking method that simultaneously achieves all three objectives: locking security, obfuscation safety, and locking efficiency. ObfusLock is based on solid mathematical proofs, incurs small overheads (<5% on average), and has passed experimental tests of various existing attacks.
00:39 CET TEMPERATURE IMPACT ON REMOTE POWER SIDE-CHANNEL ATTACKS ON SHARED FPGAS
Authors:
Ognjen Glamocanin, Hajira Bazaz, Mathias Payer and Mirjana Stojilovic, EPFL, CH
Abstract
With the growing demand for hardware acceleration, FPGAs have recently been adopted by Amazon, Microsoft Azure, Alibaba, and many other major cloud service providers. However, researchers have shown that cloud FPGAs, when shared between multiple users, face a powerful threat of remote power side-channel analysis (SCA). FPGA time-to-digital converter (TDC) sensors enable adversaries to sense voltage fluctuations and, in turn, break cryptographic implementations or extract other types of confidential information with the help of machine learning (ML). The operating temperature of the TDC sensor affects the traces it acquires and yet its impact on the success of remote power SCA attacks has largely been ignored in literature. This paper attempts to fill in this gap. We focus on two attack scenarios: correlation power analysis (CPA) and ML-based profiling attacks. We show that the temperature does have an impact on the success of the remote power SCA: with the ambient temperature increasing, the success rate of the CPA attack decreases. In depth analysis reveals that TDC sensor measurements suffer from temperature-dependent effects, which, if ignored, can lead to misleading and overly optimistic results of ML-based profiling attacks. We find that random forest and long short-term memory ML classifiers are particularly affected. We end the paper with the trace acquisition guidelines for minimizing the temperature effects and, consequently, obtaining a more realistic measure of success of the ML-based profiling attacks.
00:39 CET APUF PRODUCTION LINE FAULTS: UNIQUENESS AND TESTING
Authors:
Yeqi Wei1, Wenjing Rao2 and Natasha Devroye2
1University of Illinois Chicago, US; 2University of Illinois at Chicago, US
Abstract
Arbiter Physically Unclonable Functions (APUFs) are low-cost hardware security primitives that may serve as unique digital fingerprints for ICs. To fulfill this role, it is critical for manufacturers to ensure that a batch of PUFs coming off the same design and production line have different truth tables, and uniqueness / inter-PUF-distance metrics have been defined to measure this. This paper points out that a widely-used uniqueness metric fails to capture some special cases, which we remedy by proposing a modified uniqueness metric. We then look at two fundamental APUF-native production line fault models that affect uniqueness severely: the μ (abnormal mean or a delay difference element) and σ (abnormal variance of a delay difference element) faults. We propose test and diagnosis methods aimed at these two types of APUF production line faults, and show that these low-cost techniques can efficiently and effectively detect such faults, and pinpoint the element of abnormality, without the (costly) need to directly measure the uniqueness metric of a PUF batch.
00:39 CET FAULT MODEL ANALYSIS OF DRAM UNDER ELECTROMAGNETIC FAULT INJECTION ATTACK
Authors:
Qiang Liu, Longtao Guo and Honghui Tang, Tianjin University, CN
Abstract
Electromagnetic fault injection (EMFI) attack has posed serious threats to ICs' security. Memory storing sensitive codes and data has become the first choice of attacking targets. This work performs a thorough characterization of the induced faults and the associated fault model of EMFI attacks on DRAM. Specifically, we firstly carry out the sensitivity analysis of various types of memory to EMFI by designing a set of experiments. The analysis shows that DRAM is more sensitive to EMFI. Then, we classify the induced faults in DRAM and formulate the fault models. Finally, we find the underlying reasons that explain the observed fault models by circuit-level simulation of DRAM under EMFI. The in-depth understanding of the fault models will guide countermeasure design of DRAM against EMFI attacks.
00:39 CET EXPANDING IN-CONE OBFUSCATED TREE FOR ANTI SAT ATTACK
Authors:
RuiJie Wang1, Li-Nung Hsu1, Yung-Chih Chen2 and TingTing Hwang1
1National Tsing Hua University, TW; 2National Taiwan University of Science and Technology, TW
Abstract
Logic locking is a hardware security technology to protect circuit designs from overuse, piracy, and reverse engineering. It protects a circuit by inserting key gates to hide the circuit functionality, so that the circuit is functional only when a correct key is applied. In recent years, encrypting the point function, e.g., AND-tree, in a circuit has been shown to be promising to resist SAT attack. However, the encryption technique may suffer from two problems: First, the tree size may not be large enough to achieve desired security. Second, SAT attack could break the encryption in one iteration when it finds a specific input pattern, called remove-all DIP. Thus, in this paper, we present a new method for constructing the obfuscated tree. We first apply the sum-of-product transformation to find the largest AND-tree in a circuit, and then insert extra variables with the proposed split-compensate operation to further enlarge the AND-tree and mitigate the remove-all DIP issue. The experimental results show that the proposed obfuscated tree can effectively resist SAT attack.
00:39 CET SHELL: SHRINKING EFPGA FABRICS FOR LOGIC LOCKING
Authors:
Hadi Mardani Kamali1, Kimia Zamiri Azar1, Farimah Farahmandi1 and Mark Tehranipoor2
1University of Florida, US; 2Intel Charles E. Young Preeminence Endowed Chair Professor in Cybersecurity, Associate Chair for Research and Strategic Initiatives, ECE Department, University of Florida, US
Abstract
The utilization of fully reconfigurable logic and routing modules may be considered as one potential and even provably resilient techniques against intellectual property (IP) piracy and integrated circuits (IC) overproduction. The embedded FPGA (eFPGA) is one instance that could be used for IP redaction leading to hiding the functionality through the untrusted stages of the IC supply chain. The eFPGA architecture, albeit reliable, unnecessarily results in exploding the die size even while it is supposed to be at fine granularity targeting small modules/IPs. In this paper, we propose SheLL, which primarily embeds the interconnects (routing channels) of the design and secondarily twists the minimal logic parts of the design into the eFPGA architecture. In SheLL, the eFPGA architecture is customized for this specific logic locking methodology, allowing us to minimize the overhead of eFPGA fabric as possible. Our experimental results demonstrate that SheLL guarantees robustness against notable attacks while the overhead is significantly lower compared to the existing eFPGA-based competitors.
00:39 CET HIGHLIGHTING TWO EM FAULT MODELS WHILE ANALYZING A DIGITAL SENSOR LIMITATIONS
Authors:
Roukoz Nabhan1, Jean-Max Dutertre1, Jean-Baptiste Rigaud1, Jean-Luc Danger2 and Laurent Sauvage2
1Mines Saint-Etienne, CEA, Leti, Centre CMP, F-13541 Gardanne, France, FR; 2LTCI, Telecom Paris, Institut Mines-Telecom, 91120 Palaiseau, France, FR
Abstract
Fault injection attacks can be carried out against an operating circuit by exposing it to EM perturbations. These attacks can be detected using embedded digital sensors based on the EM fault injection mechanism, as the one introduced by El-Baze et al. which uses the sampling fault model. We tested on an experimental basis the efficiency of this sensor embedded in the AES accelerator of an FPGA. It proved effective when the target was clocked at moderate frequency (the injected faults were consistent with the sampling fault model). As the clock frequency was progressively increased, faults started to escape detection, which raises warnings about possible limitations of the sampling model. Further tests at frequencies close to the target maximal frequency revealed faults injected according to a timing fault model. Both series of experimental results ascertain that EM injection can follow at least two different fault models. Undetected faults and the existence of different fault injection mechanisms cast doubt upon the use of sensors based on a single model.
00:39 CET SECURING HETEROGENEOUS 2.5D ICS AGAINST IP THEFT THROUGH DYNAMIC INTERPOSER OBFUSCATION
Authors:
Jonti Talukdar1, Arjun Chaudhuri1, Jinwoo Kim2, Sung-Kyu Lim2 and Krishnendu Chakrabarty1
1Duke University, US; 2Georgia Institute of Technology, US
Abstract
Recent breakthroughs in heterogeneous integration technologies using 2.5D and 3D ICs have been key to advances in the semiconductor industry. However, heterogeneous integrations has also led to several sources of distrust due to the use of third-party IP, testing, and fabrication facilities in the design and manufacturing process. Recent work on 2.5D IC security has only focused on attacks that can be mounted through rogue chiplets integrated in the design. Thus, existing solutions implement inter-chiplet communication protocols that prevent unauthorized data modification and interruption in a 2.5D system. However, none of the existing solutions offer inherent security against IP theft. We develop a comprehensive threat model for 2.5D systems indicating that such systems remain vulnerable to IP theft. We present a method that prevents IP theft by obfuscating the connectivity of chiplets on the interposer using reconfigurable interconnection networks. We also evaluate the PPA impact and security offered by our proposed scheme.
00:39 CET WARM-BOOT ATTACK ON MODERN DRAMS
Authors:
Yichen Jiang, Shuo Wang, Renato Jansen Figueiredo and Yier Jin, University of Florida, US
Abstract
Memory plays a critical role in storing almost all computation data for various applications, including those with sensitive data such as bank transactions and critical business management. As a result, protecting memory security from attackers with physical access is ultimately important. Various memory attacks have been proposed, among which "cold boot” and RowHammer are two leading examples. DRAM manufacturers have deployed a series of protection mechanisms to counter these attacks. Even with the latest protection techniques, DRAM may still be vulnerable to attackers with physical access. In this paper, we proposed a novel "warm boot” attack which utilizes external power supplies to bypass the existing protection mechanisms and steal the data from the modern SODIMM DDR4 memory. The proposed "warm boot” attack is applied to various DRAM chips from different brands. Combined with a new memory re-arrangement technique, the "warm boot” attack can achieve as high as 94% data recovery rate from SODIMM DDR4 memory.
00:39 CET LOW-COST FIRST-ORDER SECURE BOOLEAN MASKING IN GLITCHY HARDWARE
Authors:
Dilip Kumar S V, Josep Balasch, Benedikt Gierlichs and Ingrid Verbauwhede, KU Leuven, BE
Abstract
We describe how to securely implement the logical AND of two bits in hardware in the presence of glitches without the need for fresh randomness, and we provide guidelines for the composition of circuits. As a case study, we design, implement and evaluate a DES core. Our goal is an overall practically relevant tradeoff between area, latency, randomness cost, and security. We focus on first-order secure Boolean masking and we do not aim for provable security. The resulting DES engine shows no evidence of first-order leakage in a non-specific leakage assessment with 50M traces.
00:39 CET TIPLOCK: KEY-COMPRESSED LOGIC LOCKING USING THROUGH-INPUT-PROGRAMMABLE LOOKUP-TABLES
Authors:
Kaveh Shamsi and Rajesh Datta, University of Texas at Dallas, US
Abstract
Logic locking involves converting an original circuit to a semi-programmable ``locked'' one as to hide its precise functionality from an untrusted foundry or end-user. The locked circuit is programmed to the correct functionality post-fabrication in a secure facility using a secret key bit vector. Traditionally, these key bits typically require a secure scan-chain path for programming. In this paper, we instead utilize the original inputs of the individual gates themselves to program individual gates and hence the functionality, effectively compressing the amount of physical key management logic needed on the device. This requires novel through-input-programmable (TIP) gate designs which we build using a modification of existing multiplexer-based look-up-tables. Additionally, an algorithmic analysis of the circuit is needed to find candidate gates, discover further programming structure sharing, and generate programming sequences. This we accomplish with Boolean satisfiability and graph-coloring queries. We report deobfuscation attack runtime on benchmark circuits. Our proposed TIPLock achieves area reductions of 50-70\% on benchmark circuits compared to the traditional scan-chain-based key-programming logic.

A8 Industrial Experiences Brief Papers

Date: Wednesday, 19 April 2023
Time: 16:30 - 18:00 CET

Time Label Presentation Title
Authors
00:39 CET MULTIPHYSICS DESIGN AND SIMULATION METHODOLOGY FOR DENSE WDM SILICON PHOTONICS
Authors:
Jinsung Youn1, Luca Ramini2, Zeqin Lu3, Ahsan Alam3, James Pond3, Marco Fiorentino1 and Raymond Beausoleil1
1Hewlett Packard Enterprise, US; 2Hewlett Packard Enterprise, IT; 3Ansys, CA
Abstract
We present a novel design methodology covering multiphysics simulation workflows for microring-based dense wavelength division multiplexing (DWDM) Silicon Photonics (SiPh) circuits used for high-performance computing systems and data centers. The main workflow is an electronics-photonics co-simulation comprising various optical devices from a SiPh process design kit (PDK), electronic circuits designed with a commercial CMOS foundry's PDK, and channel S-parameter models, such as interposers and packages, generated by using a full-wave electromagnetic (EM) solver. With the co-simulation, electrical and optical as well as electro-optical behaviors can be analyzed at the same time because best-in-class electronics and photonic integrated circuit simulators interact with each other. As a result, not only optical spectrum and eye diagrams but also electrical eye diagrams can be evaluated on the same simulation platform. In addition, the proposed methodology includes a statistical- and thermal-aware photonic circuit simulation workflow to evaluate process and temperature variations as well as estimate the required thermal tuning power as those non-idealities can lead to microring's resonance wavelengths shifting. For this, thermal simulation is conducted with a 3D EM model which is also used for such signal and power integrity analysis as a channel link simulation and IR drop. Also, photonic circuit simulations are performed where a design exploration and optimization of such microring's design parameters as Q-factor, and bias voltages are required to select the most promising designs, for example, to satisfy a specific bit-error rate. With the proposed design methodology having those multiphysics simulation workflows, DWDM SiPh can be fully optimized to have reliable system performance.
00:39 CET TWO-STREAM NEURAL NETWORK FOR POST-LAYOUT WAVEFORM PREDICTION
Authors:
Sanghwi Kim, Hyejin Shin and Hyunkyu Kim, SK Hynix, KR
Abstract
The gap between pre- and post-simulation, as well as the considerable layout time, increases the significance of the post-layout waveform prediction in dynamic random access memory (DRAM) design. This study develops a post-layout prediction model using the following two-stream neural network: (1) a multi-layer perceptron neural network to calculate the coupling noise by using the physical properties of global interconnects, and (2) a convolutional neural network to compute the time series trends of the waveforms by referencing adjacent signals. The proposed model trains two types of heterogeneous data such that accuracy of 95.5% is achieved on the 1b DRAM process 16Gb DDR5. The model significantly improves the design completeness by pre-detecting the deterioration in the signal quality via post-layout waveform prediction. Generally, although a few weeks are required to obtain post-layout waveforms after the circuit design process, waveforms can be instantly predicted using our proposed model.
00:39 CET QUANTIZATION-AWARE NEURAL ARCHITECTURE SEARCH WITH HYPERPARAMETER OPTIMIZATION FOR INDUSTRIAL PREDICTIVE MAINTENANCE APPLICATIONS
Authors:
Nick van de Waterlaat, Sebastian Vogel, Hiram Rayo Torres Rodriguez, Willem Sanberg and Gerardo Daalderop, NXP Semiconductors, NL
Abstract
Optimizing the efficiency of neural networks is crucial for ubiquitous machine learning on the edge. However, it requires specialized expertise to account for the wide variety of applications, edge devices, and deployment scenarios. An attractive approach to mitigate this bottleneck is Neural Architecture Search (NAS), as it allows for optimizing networks for both efficiency and task performance. This work shows that including hyperparameter optimization for training-related parameters alongside NAS enables substantial improvements in efficiency and task performance on a predictive maintenance task. Furthermore, this work extends the combination of NAS and hyperparameter optimization with INT8 quantization since efficiency is of utmost importance for resource-constrained devices in industrial applications. Our combined approach, which we refer to as Quantization-Aware NAS (QA-NAS), allows for further improvements in efficiency on the predictive maintenance task. Consequently, our work shows that QA-NAS is a promising research direction for optimizing neural networks for deployment on resource-constrained edge devices in industrial applications.

S_D2 High level synthesis and verification

Date: Wednesday, 19 April 2023
Time: 16:30 - 18:00 CET

Time Label Presentation Title
Authors
00:39 CET TOWARDS HIGH-LEVEL SYNTHESIS OF QUANTUM CIRCUITS
Authors:
Chao Lu1, Christian Pilato2 and Kanad Basu1
1University of Texas at Dallas, US; 2Politecnico di Milano, IT
Abstract
In recent years, there has been a proliferation of quantum algorithms, primarily due to their exponential speedup over their classical counterparts. Quantum algorithms find applications in various domains, including machine learning, molecular simulation, and cryptography. However, extensive knowledge of linear algebra and quantum mechanics are required to program a quantum computer, which might not be feasible for traditional software programmers. Moreover, current quantum programming paradigm is difficult to scale and integrate quantum circuits to achieve complex functionality. To this end, in this paper, we introduce QHLS, a quantum high-level synthesis (HLS) framework. To the best of our knowledge, this is the first HLS framework for quantum circuits. The proposed QHLS allows quantum programmers to start with high-level behavioral descriptions (e.g., C, C++) and automatically generate the corresponding quantum circuit; thus, reducing the complexity of programming a quantum computer. Our experimental results demonstrate the success of QHLS in translating high-level behavioral software programs containing arithmetic, logical and conditional statements.
00:39 CET MIRROR: MAXIMIZING THE RE-USABILITY OF RTL THROUGH RTL TO C COMPILER
Authors:
Md Imtiaz Rashid and Benjamin Carrion Schaefer, University of Texas at Dallas, US
Abstract
This work presents a RTL to C compiler called MIRROR that maximizes the re-usability of the generated C code for High-Level Synthesis (HLS). The uniqueness of the compiler is that it generates C code by using libraries of pre-characterized RTL micro-structures that are uniquely identifiable through perceptual hashes. This allows to quickly generate C descriptions that include arrays and loops. These are important because HLS tools extensively use synthesis directives in the form of pragmas to control how to synthesize these constructs. E.g., arrays can be synthesized as registers or RAM, and loops fully unrolled, partially unrolled, not unrolled, or pipelined. Setting different pragma combinations lead to designs with unique area vs. performance and power trade-offs. Based on this, the main goal of our compiler is to parse synthesizable RTL descriptions specified in Verilog which have a fixed micro-architecture with specific area, performance and power profile and generate C code for HLS that can then be re-synthesized with different pragma combinations generating a variety of new micro-architectures with different area vs. performance trade-offs. We call this ‘maximizing the re-usability of the RTL code because it enables a path to re-target any legacy RTL description to applications with different constraints.
00:39 CET HIGH-LEVEL SYNTHESIS VERSUS HARDWARE CONSTRUCTION
Authors:
Alexander Kamkin1, Mikhail Chupilko1, Mikhail Lebedev1, Sergey Smolov1 and Georgi Gaydadjiev2
1ISP RAS, RU; 2University of Groningen, NL
Abstract
Application-specific systems with FPGA accelerators are often designed using high-level synthesis or hardware construction tools. Nowadays, there are many frameworks available, both open-source and commercial. In this work, we aim at a fair comparison of several languages (and tools), including Verilog (our baseline), Chisel, Bluespec SystemVerilog (Bluespec Compiler), DSLX (XLS), MaxJ (MaxCompiler), and C (Bambu and Vivado HLS). Our analysis has been carried out using a representative example of 8x8 inverse discrete cosine transform (IDCT), a widely used algorithm engaged in JPEG and MPEG decoders. The metrics under consideration include: (a) the degree of automation (how much less code is required compared to Verilog), (b) the controllability (possibility to achieve given design characteristics, namely a given ratio of the performance and area), and (c) the flexibility (ease of design modification to achieve certain characteristics). Rather than focusing on computational kernels only, we have developed AXI-Stream wrappers for the synthesized implementations, which allows adequately evaluating characteristics of the designs when they are used as parts of real systems. Our study shows clear examples of what impact specific optimizations (tool settings and source code modifications) have on the overall system performance and area. It emphasizes how important is to be able to control the balance between the communication interface utilization and the computational kernel performance and delivers clear guidelines for the next generation tools for designing FPGA-accelerator-based systems.
00:39 CET PTP: ACCELERATE APPLICATION LAUNCH VIA PREDICTIVE AND TIME-SHARING PREFETCHING ON SMARTPHONES
Authors:
Ying Yuan, Zhipeng Tan, Shitong Wei, Lihua Yang, Wenjie Qi, Xuanzhi Wang and Cong Liu, Huazhong University of Science and Technology, CN
Abstract
Low application launch latency is crucial to users' experience and the fastest application launch speed is one of the eternal pursuits of manufacturers. Page fault is a critical factor leading to long app launch latency. Prefetching is the current method of reducing page faults during application launch. Prefetching all demanded pages of the target application before application launch speeds up the application effectively, but it always wastes the memory of several hundred MB. Prefetching during application launch can reduce the waste of memory, however, traditional prefetching methods are not aware of the order of pages accessed, which is limited in reducing page faults. We propose a Predictive and Time-sharing Prefetching schema (PTP), it can accelerate the application launch effectively only using little memory. PTP includes two steps: 1) PTP identifies the pattern of users' usage of applications by long short-term memory to increase the accuracy of app usage prediction. Before the app launch, few but critical pages are prefetched based on prediction, to reduce memory wastage. 2) During app launch, PTP prefetches pages according to the order of the pages accessed, to reduce page faults effectively. We evaluate PTP on Google Pixel 3, compared to the state-of-the-art method, PTP can reduce the application launch time by up to 52.5%, and 37% on average, and the data prefetched before the target application started is only 1.31 MB on average.
00:39 CET USING HIGH-LEVEL SYNTHESIS TO MODEL SYSTEMVERILOG PROCEDURAL TIMING CONTROLS
Authors:
Luca Pozzoni1, Fabrizio Ferrandi1, Loris Mendola2, Alfio Palazzo2 and Francesco Pappalardo2
1Politecnico di Milano, IT; 2STMicroelectronics, IT
Abstract
In modern SoC designs, digital components' development and verification processes often depend on the component's interactions with other digital and analog modules on the same die. While designers can rely on a wide range of tools and practices for validating fully-digital models, porting the same workflow to mixed models' development requires significant efforts from the designers. A common practice is to use Real Number Modeling techniques to generate HDL-based behavioral models of analog components to efficiently simulate mixed models using only event-based simulations rather than Analog Mixed Signals (AMS) simulations. However, some of these models' language features are not synthesizable with existing synthesis tools, requiring additional efforts from the designers to generate post-tapeout prototypes. This paper presents a methodology for transforming some non-synthesizable SystemVerilog language features related to timing controls into functionally-equivalent synthesizable Verilog constructs. The resulting synthesizable models replicate their respective RNMs' behavior while explicitly managing delay controls and event expressions. The RNMs are first transformed using the MLIR framework and then synthesized with open-source HLS tools to obtain FPGA-synthesizable Verilog models.
00:39 CET R-LDPC: REFINING BEHAVIOR DESCRIPTIONS IN HLS TO IMPLEMENT HIGH-THROUGHPUT LDPC DECODER
Authors:
yifan zhang1, Cao Qiang1, Jie Yao2 and Hong Jiang3
1Wuhan National Laboratory for Optoelectronics, CN; 2School of Computer Sci. Huazhong University of Science and Technology, CN; 3professor, UT Arlington, US
Abstract
High-Level Synthesis (HLS) translates high-level behavior-description to Register-Transfer Level (RTL) implementation in modern Field-Programmable Gate Arrays (FPGAs), accelerating domain-specific hardware developments. Low-Density Parity-Check (LDPC), as a powerful error-correction code family, has been widely implemented in hardware for building a reliable data channel over a noisy physical channel in communication and storage applications. Leveraging HLS to fast prototype high-performance LDPC decoder is intriguing with high scalability and low hardware-dependence, but generally is sub-optimal due to the lack of accurate and precise behavior descriptions in HLS to characterize iteration- and circuit-level implementation details. This paper proposes an HLS-based QC-LDPC decoder with scalable throughput by precisely refining the LDPC behavior descriptions, R-LDPC for short. To this end, R-LDPC first adopts an HLS-based LDPC decoder micro-architecture with a module-level pipeline. Second, R-LDPC offers a multi-instance-sharing one description to explicitly define shared parts and non-shared parts for an array of check-node updating-units (CNU), eliminating redundant function modules and addressing circuits. Third, R-LDPC designs efficient single-stage and multi-stage shifters to eliminate unnecessary bit-selection circuits. Finally, R-LDPC provides invalid-element aware loop scheduling before the compile phase to avoid some unnecessary stalls in runtime. We implement a R-LDPC decoder, compared to the original HLS-based implementation, R-LDPC reduces the hardware consumption up to 56%, the latency up to 67%, and the decoding throughput up to 300%. Furthermore, R-LDPC is adapted to different scale, LDPC standards, and code rates, and can achieve 9.9Gbps decoding throughput in Xilinx U50.
00:39 CET AN AUTOMATED VERIFICATION FRAMEWORK FOR HALIDEIR-BASED COMPILER TRANSFORMATIONS
Authors:
Yanzhao Wang1, Fei Xie1, Zhenkun Yang2, Jeremy Casas2, Pasquale Cocchini2 and Jin Yang2
1Portland State University, US; 2Intel Corporation, US
Abstract
HalideIR is a popular intermediate representation for compilers in domains such as deep learning, image processing, and hardware design. In this paper, we present an automated verification framework for HalideIR-based compiler transformations. The framework conducts verification using symbolic execution in two steps. Given a compiler transformation, our automated verification framework first uses symbolic execution to enumerate the compiler transformation's paths, and then utilizes symbolic execution to verify if the output program for each transformation path is equivalent to its source. We have successfully applied this framework to verify 46 transformations from the three most-stared HalideIR-based compilers on GitHub and detected 4 transformation bugs undetected by manually crafted unit tests.
00:39 CET CHISELFV: A FORMAL VERIFICATION FRAMEWORK FOR CHISEL
Authors:
Mufan Xiang1, Yongjian Li2 and Yongxin Zhao3
1East China Normal Universtiy, CN; 2Chinese Academy of Sciences, Institute of Software, Laboratory of Computer Science, CN; 3East China Normal University, CN
Abstract
Modern digital hardware is becoming ever more complex. And agile development, an efficient idea in software development, has been introduced into hardware. Furthermore, as a new hardware construction language, Chisel helps to raise the level of hardware design abstraction with the support of object-oriented and functional programming. Chisel plays a crucial role in future hardware design and open-source hardware development. However, the formal verification for Chisel is still limited. In this paper, we propose ChiselFV, a formal verification framework that has supported detailed formal hardware property descriptions and integrated mature formal hardware verification flows based on SymbiYosys. It builds on top of Chisel and uses Scala to drive the verification process. Thus the framework can be seen as an extension of Chisel. ChiselFV makes it easy to verify hardware designs formally when implementing them in Chisel.
00:39 CET EMNAPE: EFFICIENT MULTI-DIMENSIONAL NEURAL ARCHITECTURE PRUNING FOR EDGEAI
Authors:
Hao Kong1, Xiangzhong Luo1, SHUO HUAI1, Di Liu2, Ravi Subramaniam3, Christian Makaya3, Qian Lin3 and Weichen Liu1
1Nanyang Technological University, SG; 2Yunnan University, CN; 3HP Inc., US
Abstract
Model pruning has shown its great potential in reducing the cost of convolutional neural networks (CNNs) for embedded hardware. However, existing pruning methods mainly remove the redundancy in a single dimension (depth, width, or resolution) of CNNs, which may excessively prune that dimension and lead to a severe accuracy drop. To address this issue, we propose EMNAPE, a multi-dimensional pruning framework to jointly prune all three dimensions of CNNs. To accurately identify redundant units from the three dimensions, we first introduce an inter-dimensional evaluation strategy (ITES) to comprehensively evaluate the global importance of units across different dimensions according to their contribution to model complexity, accuracy, and on-device latency. Moreover, as directly using ITES to evaluate all units of the three dimensions will incur a significant time cost, to mitigate this cost, we also propose an inner-dimensional evaluation strategy (INES) to quickly evaluate the local importance of units within each dimension. By collaboratively utilizing ITES and INES to evaluate all units of the three dimensions, we significantly improve the evaluation efficiency of our framework. Based on INES and ITES, we further propose a heuristic pruning algorithm to progressively prune the three dimensions, which iteratively utilizes INES and ITES to efficiently explore the giant design space formed by the three dimensions to find the optimal tiny model. Experiments on ImageNet-1K show that EMNAPE achieves 1.09% higher top-1 accuracy with 32.3% fewer MACs and 1.75× on-device acceleration compared to HRank, one of the state-of-the-art pruning frameworks. Source codes are available anonymously at https://anonymous.4open.science/r/emnape-review-5EE8.
00:39 CET METRIC TEMPORAL LOGIC WITH RESETTABLE SKEWED CLOCKS
Authors:
Alberto Bombardelli and Stefano Tonetta, FBK, IT
Abstract
The formal verification of distributed real-time systems is particularly challenging due to the intertwining of timing constraints and synchronization and communication mechanisms. Real-time properties are usually expressed in Metric Temporal Logic (MTL), an extension of Linear-time Temporal Logic (LTL) with metric constraints over time. One of the issues to apply these methods to distributed systems is that clocks are not perfectly synchronized and the local properties may refer to different, possibly skewed, clocks, which are reset for synchronization. Local components and properties, therefore, may refer to time points that are not guaranteed to be monotonic. In this paper, we investigate the specification of temporal properties of distributed systems with resettable skewed clocks. In order to take into account the synchronization of clocks, the local temporal operators are interpreted over resettable skewed clocks. We extend MTL with metric operators that are more suitable to express bounds over non-monotonic time. We propose a method to check the satisfiability for the proposed logic, which enables also compositional reasoning. We implemented and evaluated the approach on typical properties of real-time systems.
00:39 CET POLYNOMIAL FORMAL VERIFICATION OF FLOATING POINT ADDERS
Authors:
Jan Kleinekathöfer1, Alireza Mahzoon1 and Rolf Drechsler2
1University of Bremen, DE; 2University of Bremen/DFKI, DE
Abstract
In this paper, we present our verifier that takes advantage of Binary Decision Diagrams (BDDs) with case splitting to fully verify a floating point adder. We demonstrate that the traditional symbolic simulation using BDDs has an exponential time complexity and fails for large floating point adders. However, polynomial bounds can be ensured if our case splitting technique is applied in the specific points of the circuit. The efficiency of our verifier is demonstrated by experiments on an extensive set of floating point adders with different exponent and significand sizes.

S_E3 Efficient utilization of heterogeneous hardware architectures running machine learning-based applications

Date: Wednesday, 19 April 2023
Time: 16:30 - 18:00 CET

Time Label Presentation Title
Authors
00:39 CET BLOCK GROUP SCHEDULING : A GENERAL PRECISION-SCALABLE NPU SCHEDULING TECHNIQUE WITH PRECISION-AWARE MEMORY ALLOCATION
Authors:
Seokho Lee1, Younghyun Lee1, Hyejun Kim2, taehoon kim2 and Yongjun Park3
1Department of Artificial Intelligence, Hanyang University, KR; 2Hanyang University, KR; 3Yonsei University, KR
Abstract
Precision-scalable NPUs (PSNPUs), which can provide a native support of quantized neural network models efficiently, have suffered from a serious memory bottleneck as recent DNN model evolution trends have changed to perform many simple computations with larger data. In this work, we first analyze whether the memory bottleneck issue can be handled by traditional NPU scheduling techniques, and then introduce new NPU instruction scheduling techniques to minimize the memory effect: capacity-aware memory allocation and block scheduling. Compared to the baseline, the new scheduling techniques can achieve up to 2.26x performance improvement by mitigating substantial memory pressure on low-precision computations successfully, without any hardware overhead.
00:39 CET FPGA-BASED ACCELERATOR FOR RANK-ENHANCED AND HIGHLY-PRUNED BLOCK-CIRCULANT NEURAL NETWORKS
Authors:
Haena Song1, Jongho Yoon2, Dohun Kim2, Eunji Kwon2, Tae-Hyun Oh2 and Seokhyeong Kang2
1Pohang University of Science and Technology, KR; 2POSTECH, KR
Abstract
Numerous network compression methods have been proposed to deploy the deep neural networks in a resource-constrained embedded system. Among them, block-circulant matrix (BCM) compression is one of the promising hardware-friendly methods for both acceleration and compression. However, it has several limitations; (i) accuracy drop owing to the structural characteristic of circulant matrix, (ii) limitation of the compression parameter, (iii) needs for specialized dataflow to BCM-compressed network accelerators. In this paper, rank-enhanced and highly-pruned block-circulant matrices compression (RP-BCM) framework is proposed to overcome these limitations. RP-BCM comprises two stages: Hadamard-BCM and BCM-wise pruning. Also, a dedicated skip scheme is introduced to processing element design for maintaining high-parallelism with BCM-wise sparsity. Furthermore, we propose specialized dataflow for a BCM-compressed network, rather than the conventional CNN dataflow on FPGA. As a result, the proposed method achieves parameter reduction and FLOPs reduction for ResNet-50 in ImageNet by 92.4% and 77.3%, respectively. Moreover, compared to GPU, the proposed hardware design achieves 3.1x improvement in energy efficiency on the Xilinx PYNQ-Z2 FPGA board for ResNet-18 trained on ImageNet.
00:39 CET LOSSLESS SPARSE TEMPORAL CODING FOR SNN-BASED CLASSIFICATION OF TIME-CONTINUOUS SIGNALS
Authors:
Johnson Loh and Tobias Gemmeke, RWTH Aachen University, DE
Abstract
Ultra-low power classification systems using spiking neural networks (SNN) promise efficient processing for mobile devices. Temporal coding represents activations in an artificial neural network (ANN) as binary signaling events in time, thereby minimizing circuit activity. Discrepancies in numeric results are inherent to common conversion schemes, as the atomic computing unit, i.e. the neuron, performs algorithmically different operations and, thus, potentially degrading SNN's quality of service (QoS). In this work, a lossless conversion method is derived in a top-down design approach for continuous time signals using electrocardiogram (ECG) classification as an example. As a result, the converted SNN achieves identical results compared to its fixed-point ANN reference. The computations, implied by proposed method, result in a novel hybrid neuron model located in between the integrate-and-fire (IF) and conventional ANN neurons, which numerical result is equivalent to the latter. Additionally, a dedicated SNN accelerator is implemented in 22 nm FDSOI CMOS suitable for continuous real-time classification. The direct comparison with an equivalent ANN counterpart shows that power reductions of 2.32x and area reductions of 7.22x are achievable without loss in QoS.
00:39 CET NAF: DEEPER NETWORK/ACCELERATOR CO-EXPLORATION FOR CUSTOMIZING CNNS ON FPGA
Authors:
Wenqi Lou, Jiaming Qian, Lei Gong, Xuan Wang, Chao Wang and Xuehai Zhou, USTC, CN
Abstract
Recently, algorithm and hardware co-design for neural networks (NNs) has become the key to obtaining high-quality solutions. However, prior works lack consideration of the underlying hardware and thus suffer from a severely unbalanced neural architecture and hardware architecture search (NA-HAS) space on FPGAs, failing to unleash the performance potential. Nevertheless, a deeper joint search leads to a larger (multiplicative) search space, highly challenging the search. To this end, we propose an efficient differentiable search framework NAF, which jointly searches the networks (e.g., operations and bitwidths) and accelerators (e.g., heterogeneous multicores and mappings) under a balanced NA-HAS space. Concretely, we design a coarse-grained hardware-friendly quantization algorithm and integrate it at a block granularity into the co-search process. Meanwhile, we design a highly optimized block processing unit (BPU) with key dataflow configurable. Afterward, a dynamic hardware generation algorithm based on modeling and heuristic rules is designed to perform the critical HAS and fast generate hardware feedback. Experimental results show that compared with the previous state-of-the-art (SOTA) co-design works, NAF improves the throughput by 1.99×-6.84× on Xilinx ZCU102 and energy efficiency by 17%-88% under similar accuracy on the ImageNet dataset.
00:39 CET ESRU: EXTREMELY LOW-BIT AND HARDWARE-EFFICIENT STOCHASTIC ROUNDING UNIT DESIGN FOR 8-BIT DNN TRAINING
Authors:
Sung-En Chang1, Geng Yuan1, Alec Lu2, Mengshu Sun1, Yanyu Li1, Xiaolong Ma3, Zhengang Li1, Yanyue Xie1, Minghai Qin4, Xue Lin1, Zhenman Fang2 and Yanzhi Wang1
1Northeastern University, US; 2Simon Fraser University, CA; 3Clemson University, US; 4Self-employed, US
Abstract
Stochastic rounding is crucial in the low-bit (e.g., 8-bit) training of deep neural networks (DNNs) to achieve high accuracy. One of the drawbacks of prior studies is that they require a large number of high-precision stochastic rounding units (SRUs) to guarantee low-bit DNN accuracy, which involves considerable hardware overhead. In this paper, we use extremely low-bit SRUs (ESRUs) to save a large number of hardware resources during low-bit DNN training. However, a naively designed ESRU introduces a biased distribution of random numbers, causing accuracy degradation. To address this issue, we further propose an ESRU design with a plateau-shape distribution. The plateau-shape distribution in our ESRU design is implemented with the combination of an LFSR and an inverted LFSR, which avoids LFSR packing and turns an inherent LFSR drawback into an advantage in our efficient ESRU design. Experimental results using state-of-the-art DNN models demonstrate that, compared to the prior 24-bit SRU with 24-bit pseudo-random number generators (PRNG), our 8-bit ESRU with 3-bit PRNG reduces the SRU hardware resource usage by 9.75 times while achieving slightly higher accuracy.
00:39 CET CLASS-BASED QUANTIZATION FOR NEURAL NETWORKS
Authors:
Wenhao Sun1, Grace Li Zhang2, Huaxi Gu3, Bing Li1 and Ulf Schlichtmann1
1TU Munich, DE; 2TU Darmstadt, DE; 3Xidian University, CN
Abstract
In deep neural networks (DNNs), there are a huge number of weights and multiply-and-accumulate (MAC) operations. Accordingly, it is challenging to apply DNNs on resource-constrained platforms, e.g., mobile phones. Quantization is a method to reduce the size and the computational complexity of DNNs. Existing quantization methods either require hardware overhead to achieve a non-uniform conversion or focus on model-wise and layer-wise uniform conversions, which are not as fine-grained as filter-wise quantization. In this paper, we propose a class-based quantization method to determine the minimum number of quantization bits for each filter or neuron in DNNs individually. In the proposed method, the importance score of each filter or neuron with respect to the number of classes in the dataset is first evaluated. The larger the score is, the more important the filter or neuron is and thus the larger the number of quantization bits should be. Afterwards, a search algorithm is adopted to exploit the different importance of filters and neurons to determine the number of quantization bits of each filter or neuron. Experimental results demonstrate that the proposed method can maintain the inference accuracy with low bit-width quantization. Given the same number of quantization bits, the proposed method can also achieve a better inference accuracy than the existing methods.
00:39 CET ROAD-RUNNER: COLLABORATIVE DNN PARTITIONING AND OFFLOADING ON HETEROGENEOUS EDGE SYSTEMS
Authors:
Andreas Kakolyris1, Manolis Katsaragakis1, Dimosthenis Masouros1 and Dimitrios Soudris2
1National TU Athens, GR; 2NTUA, GR
Abstract
Deep Neural Networks (DNNs) are becoming extremely popular for many modern applications deployed at the edge of the computing continuum. Despite their effectiveness, DNNs are typically resource intensive, making it prohibitive to be deployed on resource- and/or energy-constrained devices found in such environments. To overcome this limitation, partitioning and offloading part of the DNN execution from edge devices to more powerful servers has been introduced as a prominent solution. While previous works have proposed resource management schemes to tackle this problem, they usually neglect the high dynamicity found in such environments, both regarding the diversity of the deployed DNN models, as well as the heterogeneity of the underlying hardware infrastructure. In this paper, we present RoaD-RuNNer, a framework for DNN partitioning and offloading for edge computing systems. RoaD-RuNNer relies on its prior knowledge and leverages collaborative filtering techniques to quickly estimate performance and energy requirements of individual layers over heterogeneous devices. By aggregating this information, it specifies a set of Pareto optimal DNN partitioning schemes that trade-off between performance and energy consumption. We evaluate our approach using a set of well-known DNN architectures and show that our framework i) outperforms existing state-of-the-art approaches by achieving 9.58× speedup on average and up to 88.73% less energy consumption, ii) achieves high prediction accuracy by limiting the prediction error down to 3.19% and 0.18% for latency and energy, respectively and iii) provides lightweight and dynamic performance characteristics.
00:39 CET PRUNING AND EARLY-EXIT CO-OPTIMIZATION FOR CNN ACCELERATION ON FPGAS
Authors:
Guilherme Korol1, Michael Jordan2, Mateus Beck Rutzig3, Jeronimo Castrillon4 and Antonio Carlos Schneider Beck1
1Universidade Federal do Rio Grande do Sul, BR; 2UFRGS, BR; 3UFSM, BR; 4TU Dresden, DE
Abstract
The challenge of processing heavy-load ML tasks, particularly CNN-based ones at resource-constrained IoT devices, has encouraged the use of edge servers. The edge offers performance levels higher than the end devices and better latency and security levels than the Cloud. On top of that, the rising complexity of ML applications, the ever-increasing number of connected devices, and the current demands for energy efficiency require optimizing such CNN models. Pruning and early-exit are notable optimizations that have been successfully used to alleviate the computational cost of inference. However, these optimizations have not yet been exploited simultaneously: while pruning is usually applied at design time, which involves retraining the CNN before deployment, early-exit is inherently dynamic. In this work, we propose AdaPEx, a framework that exploits the intrinsic reconfigurable FPGA capabilities so both can be cooperatively employed. AdaPEx first explores the trade-off between pruning and early-exit at design time, creating a design space never exploited in the state-of-the-art. Then, AdaPEx applies FPGA reconfiguration as a means to enable the combined use of pruning and early-exit dynamically. At run-time, this allows matching the inference processing to the current edge conditions and a user-configurable accuracy threshold. In a smart IoT application, AdaPEx processes up to 1.32x more inferences and improves EDP by up to 2.55x over the state-of-the-art FPGA-based FINN accelerator.
00:39 CET LATTICE QUANTIZATION
Authors:
Cl�ment Metz1, Thibault Allenet1, Johannes Thiele2, Antoine Dupret1 and Olivier Bichler1
1CEA, FR; 2CEA / Axelera.ai, CH
Abstract
Post-training quantization of neural networks consists in quantizing a model without retraining, which is user-friendly, fast and data frugal. In this paper, we propose LatticeQ, a new post-training weight quantization method designed for deep CNNs. Instead of the standard scalar rounding widely used in state-of-the-art quantization methods, LatticeQ uses a quantizer based on lattices – discrete algebraic structures – which we show are able to exploit the inner correlations between the model parameters to the benefit of minimizing quantization error. LatticeQ allows us to achieve state-of-the-art results in post-training quantization. In particular, we achieve ImageNet classification results close to full precision on the popular Resnet-18/50, with little to no accuracy drop for 4-bit models.
00:39 CET MITIGATING HETEROGENEITIES IN FEDERATED EDGE LEARNING WITH RESOURCE-INDEPENDENCE AGGREGATION
Authors:
Zhao Yang and Qingshuang Sun, Northwestern Polytechnical University, CN
Abstract
Heterogeneities have emerged as a critical challenge in Federated Learning (FL). In this paper, we identify the cause of FL performance degradation due to heterogeneous issues: the local communicated parameters have feature mismatches and feature representation range mismatches, resulting in ineffective global model generalization. To address this problem, Heterogeneous mitigating FL is proposed in order to improve the generalization of the global model with resource-independence aggregation. Instead of linking local model contributions to its occupied resources, we look for contributing parameters directly in each node's training results. We begin by evaluating the parameter features in local models in a federated manner. The general feature of the global model determined with the Geometric Median (GM) is applied as the evaluation criterion. Following that, we propose a dynamic parameter selection and aggregation method with ADMM for balancing feature mismatches and feature representation range mismatches of each node's communicated parameters. Extensive experiments on various datasets show that our method can perform competitively. The accuracy and convergence time are improved up to 7.96% and 2.61x with state-of-the-arts.
00:39 CET MULTISPECTRAL FEATURE FUSION FOR DEEP OBJECT DETECTION ON EMBEDDED NVIDIA PLATFORMS
Authors:
Thomas Kotrba1, Martin Lechner1, Omair Sarwar2 and Axel Jantsch1
1TU Wien, AT; 2Mission Embedded GmbH, AT
Abstract
Object detection with images recorded in multiple spectra, e.g., visible light and infrared, can benefit real-world applications such as autonomous driving. Using multispectral data can improve the performance of an object detection system due to its complementary information, especially in adverse weather and low illumination situations. To use multiple spectra in the field of deep-learning-based object detectors, the individual spectra' information must be fused. The fusion can be done at several positions in the network architecture using several methods. Aside from the question of what approach delivers the best performance, the impact of these fusion methods in the physical space is rarely studied. This paper compares the impact of general fusion schemes in the YOLOv4 object detector. We focus on optimizing these fusion approaches for an NVIDIA Jetson AGX Xavier and elaborating their impact on the device in physical metrics. We optimize six different fusion architectures in the network's backbone for TensorRT and compare their inference time, power consumption, and object detection performance. Our results show that multispectral fusion approaches are beneficial in terms of resource usage and object detection metrics compared to individual networks.
00:39 CET RANKSEARCH: AN AUTOMATIC RANK SEARCH TOWARDS OPTIMAL TENSOR COMPRESSION FOR VIDEO LSTM NETWORKS ON THE EDGE
Authors:
Changhai Man1, Cheng Chang2, Chenchen Ding3, Ao Shen3, Hongwei Ren3, Ziyi Guan4, Yuan Cheng5, Shaobo Luo3, Rumin Zhang3, Ngai Wong4 and Hao Yu3
1Georgia Institute of Technology, US; 2University of California, Los Angeles, US; 3Southern University of Science and Technology, CN; 4University of Hong Kong, HK; 5Shanghai Jiao Tong University, CN
Abstract
Various industrial and domestic applications call for optimized lightweight video LSTM network models on edge. Recent tensor-train method transform structured space-time features into tensor, which can be further decomposed into low-rank network models for lightweight video analysis on edge. The rank selection of tensor is however manually performed with no optimization. This paper formulates a rank search algorithm to automatically decide tensor ranks with consideration of trade-off between network accuracy and complexity. A fast rank search method, called RankSearch, is developed to find optimized low-rank video LSTM network model on edge. Results from experiments show that RankSearch achieves 4.84× reduction in model complexity, 1.46× reduction in run time, while delivering a 3.86% accuracy improvement compared with the manual ranked models.