DATE 2020 - Design, Automation and Test in Europe Conference https://www.date-conference.com/rss.xml en Preliminary Conference Programme https://www.date-conference.com/programme <span>Preliminary Conference Programme</span> <span><a title="View user profile." href="/user/25">Andreas Vörg, …</a></span> <span>Sun, 8 Dec 2019 22:22</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p>Keynotes: <a href="https://www.date-conference.com/keynotes">https://www.date-conference.com/keynotes</a></p> <p>Monday Tutorials: <a href="https://www.date-conference.com/conference/monday-tutorials">https://www.date-conference.com/conference/monday-tutorials</a></p> <p>PhD Forum: <a href="https://www.date-conference.com/fringe-meeting-fm01">https://www.date-conference.com/fringe-meeting-fm01</a></p> <p>Wednesday Special Day on "Embedded AI": Sessions <a href="https://www.date-conference.com/program#5.1">5.1</a>, <a href="https://www.date-conference.com/programme#6.1">6.1</a>, <a href="https://www.date-conference.com/programme#7.0">7.0</a>, <a href="https://www.date-conference.com/programme#7.1">7.1</a>, <a href="https://www.date-conference.com/programme#8.1">8.1</a></p> <p>Thursday Special Day on "Silicon Photonics": Sessions <a href="https://www.date-conference.com/programme#9.1">9.1</a>, <a href="https://www.date-conference.com/programme#10.1">10.1</a>, <a href="https://www.date-conference.com/programme#11.0">11.0</a>, <a href="https://www.date-conference.com/programme#11.1">11.1</a>, <a href="https://www.date-conference.com/programme#12.1">12.1</a></p> <p>Exhibition Theatre: <a href="https://www.date-conference.com/exhibition/exhibition-theatre">https://www.date-conference.com/exhibition/exhibition-theatre</a></p> <p>Friday Workshops: <a href="https://www.date-conference.com/conference/friday-workshops">https://www.date-conference.com/conference/friday-workshops</a></p> <h2 id="1.1">1.1 Opening Session: Plenary, Awards Ceremony &amp; Keynote Addresses</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 08:30 - 10:30<br /><b>Location / Room:</b> Amphithéâtre Dauphine</p> <p><b>Chair:</b><br />Giorgio Di Natale, <span class="date-blue" style="font-weight:bold">DATE 20<span class="date-red">20</span></span> General Chair, FR</p> <p><b>Co-Chair:</b><br />Cristiana Bolchini, <span class="date-blue" style="font-weight:bold">DATE 20<span class="date-red">20</span></span> Programme Chair, IT</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:15</td> <td>1.1.1</td> <td><b>WELCOME ADDRESSES</b><br /><b>Speakers</b>:<br />Giorgio Di Natale<sup>1</sup> and Cristiana Bolchini<sup>2</sup><br /><sup>1</sup>LIRMM, FR; <sup>2</sup>Politecnico di Milano, IT</td> </tr> <tr> <td>08:25</td> <td>1.1.2</td> <td><b>PRESENTATION OF AWARDS</b></td> </tr> <tr> <td>09:15</td> <td>1.1.3</td> <td><b>PLENARY KEYNOTE: THE INDUSTRIAL IOT MICROELECTRONICS REVOLUTION</b><br /><b>Speaker</b>:<br />Philippe Magarshack, STMicroelectronics, FR<br /><em><b>Abstract</b> <p>Industrial IoT (IIoT) Systems are now becoming a reality. IIoT is distributed by nature, encompassing many complementary technologies. IIOT systems are composed of sensors, actuators, a means of communication and control units, and are moving into the factories, with the Industry 4.0 generation. In order to operate concurrently, all these IIoT components will require a wide range of technologies, in order to maintain such system-of-systems in a full operational, coherent and secure state. We identify and describe the four key enablers for the Industrial IoT:</p> <ol> <li>more powerful and diverse embedded computing, available on ST's latest STM32 microcontrollers and microprocessors,</li> <li>augmented by AI applications at the edge ( in the end devices), whose development is becoming enormously simplified by our specialized tools,</li> <li>a wide set of connectivity technology, either with complete System-on-chip, or ready-to-use modules, and</li> <li>a scalable security offer, thanks to either integrated features or dedicated security devices.</li> </ol> <p>We conclude with some perspective on the usage of Digital Twins in the IIoT.</p> <p></p></em></td> </tr> <tr> <td></td> <td>1.1.4</td> <td><b>PLENARY KEYNOTE: OPEN PARALLEL ULTRA-LOW POWER PLATFORMS FOR EXTREME EDGE AI</b><br /><b>Speaker</b>:<br />Luca Benini, ETH Zurich, CH<br /><em><b>Abstract</b> <p>Edge Artificial Intelligence is the new megatrend, as privacy concerns and networks bandwidth/latency bottlenecks prevent cloud offloading of sensor analytics functions in many application domains, from autonomous driving to advanced prosthetic. The next wave of "Extreme Edge AI"  pushes aggressively towards sensors and actuators, opening major research and business development opportunities.  In this talk I will give an overview of recent efforts in developing an Extreme Edge AI platform based on open source parallel ultra-low power (PULP) Risc-V processors and accelerators. I will then look at what comes next in this brave new world of hardware reinaissance.</p> <p></p></em></td> </tr> <tr> <td>10:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="2.1">2.1 Executive Session: Memories for Emerging Applications</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 11:30 - 13:00<br /><b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>13:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="2.2">2.2 Hardware-assisted Secure Systems</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 11:30 - 13:00<br /><b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br />Schaumont Patrick, Virginia Tech, US</p> <p><b>Co-Chair:</b><br />Kavun Elif Bilge, University of Sheffield, GB</p> <p>This session covers state-of-the-art hardware-assisted techniques for secure systems such as random number generators, PUFs, and logic locking &amp; obfuscation. In addition, novel detection methods for hardware Trojans are presented.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.2.1</td> <td><b>BACKTRACKING SEARCH FOR OPTIMAL PARAMETERS OF A PLL-BASED TRUE RANDOM NUMBER GENERATOR</b><br /><b>Speaker</b>:<br />Brice Colombier, Université de Lyon, FR<br /><b>Authors</b>:<br />Brice Colombier<sup>1</sup>, Nathalie Bochard<sup>1</sup>, Florent BERNARD<sup>2</sup> and Lilian Bossuet<sup>1</sup><br /><sup>1</sup>Université de Lyon, FR; <sup>2</sup>Laboratory Hubert Curien, University of Lyon, UJM Saint-Etienne, FR<br /><em><b>Abstract</b><br />The phase-locked loop-based true random number generator (PLL-TRNG) extracts randomness from clock jitter. It is an interesting construct because it comes with a stochastic model, making it certifiable by certification bodies. However, bringing it to good performance is difficult since it comes with multiple parameters to tune. This article proposes to use backtracking to determine these parameters. Compared to existing methods, based on genetic algorithms or exhaustive search of a feasible set of parameters, backtracking has several advantages. Indeed, since this method is expressible by constraint programming, it provides very good readability. Constraints can be specified in a very straightforward and maintainable way. It also exhibits good performance and generates PLL-TRNG configurations rapidly. Finally, it allows to integrate new exploratory design constraints for the PLL-TRNG very easily. We provide experimental results with a PLL-TRNG implemented on three FPGA families that come with different physical constraints, showing that the method allows to find good parameters for every one of them. Moreover, we were able to obtain configurations that lead to an increase 59% in throughput and 82% in jitter sensitivity on average, thereby generating random numbers of higher quality at a faster rate. This approach also paves the way for new design exploration strategies for PLL-TRNG. The source code of our implementation is open source and available online for reproducibility and reuse.</em></td> </tr> <tr> <td>12:00</td> <td>2.2.2</td> <td><b>LONG-TERM CONTINUOUS ASSESSMENT OF SRAM PUF AND SOURCE OF RANDOM NUMBERS</b><br /><b>Speaker</b>:<br />Rui Wang, Intrinsic-ID, NL<br /><b>Authors</b>:<br />Rui Wang, Georgios Selimis, Roel Maes and Sven Goossens, Intrinsic-ID, NL<br /><em><b>Abstract</b><br />The qualities of Physical Unclonable Functions (PUFs) suffer from several noticeable degradations due to silicon aging. In this paper, we investigate the long-term effects of silicon aging on PUFs derived from the start-up behavior of Static Random Access Memories (SRAM). Previous research on SRAM aging is based on transistor-level simulation or accelerated aging test at high temperature and voltage to observe aging effects within a short period of time. In contrast, we have run a long-term continuous power-up test on 16 Arduino Leonardo boards under nominal conditions for two years. In total, we collected around 175 million measurements for reliability, uniqueness and randomness evaluations. Analysis shows that the number of bits that flip with respect to the reference increased by 19.3% while min-entropy of SRAM PUF noise improves by 19.3% on average after two years of aging. The impact of aging on reliability is smaller under nominal conditions than was previously assessed by the accelerated aging test. The test we conduct in this work more closely resembles the conditions of a device in the field, and therefore we more accurately evaluate how silicon aging affects SRAM PUFs.</em></td> </tr> <tr> <td>12:15</td> <td>2.2.3</td> <td><b>RESCUING LOGIC ENCRYPTION IN POST-SAT ERA BY LOCKING &amp; OBFUSCATION</b><br /><b>Speaker</b>:<br />Hai Zhou, Northwesern University, US<br /><b>Authors</b>:<br />Amin Rezaei, Yuanqi Shen and Hai Zhou, Northwestern University, US<br /><em><b>Abstract</b><br />The active participation of external entities in the manufacturing flow has produced numerous hardware security issues in which piracy and overproduction are likely to be the most ubiquitous and expensive ones. The main approach to prevent unauthorized products from functioning is logic encryption that inserts key-controlled gates to the original circuit in a way that the valid behavior of the circuit only happens when the correct key is applied. The challenge for the security designer is to ensure neither the correct key nor the original circuit can be revealed by different analyses of the encrypted circuit. However, in state-of-the-art logic encryption works, a lot of performance is sold to guarantee security against powerful logic and structural attacks. This contradicts the primary reason of logic encryption that is to protect a precious design from being pirated and overproduced. In this paper, we propose a bilateral logic encryption platform that maintains high degree of security with small circuit modification. The robustness against exact and approximate attacks is also demonstrated.</em></td> </tr> <tr> <td>12:30</td> <td>2.2.4</td> <td><b>SELECTIVE CONCOLIC TESTING FOR HARDWARE TROJAN DETECTION IN BEHAVIORAL SYSTEMC DESIGNS</b><br /><b>Speaker</b>:<br />Bin Lin, Portland State University, US<br /><b>Authors</b>:<br />Bin Lin<sup>1</sup>, Jinchao Chen<sup>2</sup> and Fei Xie<sup>3</sup><br /><sup>1</sup>Portland State University, US; <sup>2</sup>Northwestern Polytechnical Univeristy, CN; <sup>3</sup>Portland State Univeristy, US<br /><em><b>Abstract</b><br />With the growing complexities of modern SoC designs and increasingly shortened time-to-market requirements, new design paradigms such as outsourced design services have emerged. Design abstraction level has also been raised from RTL to ESL. Modern SoC designs in ESL often integrate a variety of third-party behavioral intellectual properties, as well as utilizing EDA tools intensively, to improve design productivity. However, this new design trend makes modern SoCs more vulnerable to hardware Trojan attacks. Although hardware Trojan detection has been studied for more than a decade in RTL and lower levels, it has just gained attention recently in ESL designs. In this paper, we present a novel approach for generating test cases by selective concolic testing to detect hardware Trojans in ESL. We have evaluated our approach on an open source benchmark that includes various types of hardware Trojans. The experimental results demonstrate that our approach is able to detect hardware Trojans effectively and efficiently.</em></td> </tr> <tr> <td>12:45</td> <td>2.2.5</td> <td><b>TEST PATTERN SUPERPOSITION TO DETECT HARDWARE TROJANS</b><br /><b>Speaker</b>:<br />Alex Orailoglu, University of California, San Diego, US<br /><b>Authors</b>:<br />Chris Nigh and Alex Orailoglu, University of California, San Diego, US<br /><em><b>Abstract</b><br />Current methods for the detection of hardware Trojans inserted by an untrusted foundry are either accompanied by unreasonable costs in design/test pattern overhead, or return results that fail to provide confident trustability. The challenges faced by these side-channel techniques are primarily a result of process variation, which renders pre-silicon expectations nearly meaningless in predicting the behavior of a manufactured IC. To overcome this hindrance in a cost-effective manner, we propose an easy-to-implement test pattern-based approach that is self-referential in nature, capable of dissecting and understanding the characteristics of a given manufactured IC to hone in on aberrant measurements that are demonstrative of malicious Trojan hardware. By leveraging the superposition principle to cancel out non-Trojan noise, we can isolate and magnify Trojan circuit effects, all within a regime considerate of practical test and design-for-test infrastructures. Experimental results performed on Trust-Hub benchmarks demonstrate the proposed method provides a clear and significant boost in our ability to confidently certify manufactured ICs over similar state-of-the-art techniques.</em></td> </tr> <tr> <td style="width:40px;">13:00</td> <td><a href="/date20/conference/session/IP1">IP1-1</a>, 280</td> <td><b>DYNUNLOCK: UNLOCKING SCAN CHAINS OBFUSCATED USING DYNAMIC KEYS</b><br /><b>Speaker</b>:<br />Nimisha Limaye, New York University, US<br /><b>Authors</b>:<br />Nimisha Limaye<sup>1</sup> and Ozgur Sinanoglu<sup>2</sup><br /><sup>1</sup>New York University, US; <sup>2</sup>New York University Abu Dhabi, AE<br /><em><b>Abstract</b><br />Outsourcing in semiconductor industry opened up venues for faster and cost-effective chip manufacturing. However, this also introduced untrusted entities with malicious intent, to steal intellectual property (IP), overproduce the circuits, insert hardware Trojans, or counterfeit the chips. Recently, a defense is proposed to obfuscate the scan access based on a dynamic key that is initially generated from a secret key but changes in every clock cycle. This defense can be considered as the most rigorous defense among all the scan locking techniques. In this paper, we propose an attack that remodels this defense into one that can be broken by the SAT attack, while we also note that our attack can be adjusted to break other less rigorous (key that is updated less frequently) scan locking techniques as well.</em></td> </tr> <tr> <td>13:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="2.3">2.3 Fueling the future of computing: 3D, TFT, or disruptive memories?</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 11:30 - 13:00<br /><b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br />Thomas Thomas Ernst, CEA-Leti, FR</p> <p><b>Co-Chair:</b><br />Yuanqing Yuanqing Cheng, Beihang University, CN</p> <p>In the post-CMOS era, the future of computing relies more and more on emerging technologies, like resistive memories, TFT and 3D integration or their combination, to continue performance improvements: from a novel accelerating solution for deep neural networks with ferroelectric transistor technology, to a physical design methodology for face-to-face 3D ICs to enable commercial-quality IC layouts. Furthermore, the monolithic 3D advantage obtained combining TFT and RRAM technology is quantified using a novel open-source CAD flow.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.3.1</td> <td><b>TERNARY COMPUTE-ENABLED MEMORY USING FERROELECTRIC TRANSISTORS FOR ACCELERATING DEEP NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />SANDEEP KRISHNA THIRUMALA, Purdue University, US<br /><b>Authors</b>:<br />Sandeep Krishna Thirumala, Shubham Jain, Sumeet Gupta and Anand Raghunathan, Purdue University, US<br /><em><b>Abstract</b><br />Ternary Deep Neural Networks (DNNs), which employ ternary precision for weights and activations, have recently been shown to attain accuracies close to full-precision DNNs, raising interest in their efficient hardware realization. In this work we propose a Non-Volatile Ternary Compute-Enabled memory cell (TeC-Cell) based on ferroelectric transistors (FEFETs) for in-memory computing in the signed ternary regime. In particular, the proposed cell enables storage of ternary weights and employs multi-word-line assertion to perform massively parallel signed dot-product computations between ternary weights and ternary inputs. We evaluate the proposed design at the array level and show 72% and 74% higher energy efficiency for multiply-and-accumulate (MAC) operations compared to standard near-memory computing designs based on SRAM and FEFET, respectively. Furthermore, we evaluate the proposed TeC-Cell in an existing ternary in-memory DNN accelerator. Our results show 3.3X-3.4X reduction in system energy and 4.3X-7X improvement in system performance over SRAM and FEFET based near-memory accelerators, across a wide range of DNN benchmarks including both deep convolutional and recurrent neural networks.</em></td> </tr> <tr> <td>12:00</td> <td>2.3.2</td> <td><b>MACRO-3D: A PHYSICAL DESIGN METHODOLOGY FOR FACE-TO-FACE-STACKED HETEROGENEOUS 3D ICS</b><br /><b>Speaker</b>:<br />Lennart Bamberg, Universität Bremen, DE / GrAi Matter Labs, DE<br /><b>Authors</b>:<br />Lennart Bamberg<sup>1</sup>, Lingjun Zhu<sup>2</sup>, Sai Pentapati<sup>2</sup>, Da Eun Shim<sup>2</sup>, Alberto Garcia-Ortiz<sup>3</sup> and Sung Kyu Lim<sup>2</sup><br /><sup>1</sup>GrAi Matter Labs, NL; <sup>2</sup>Georgia Institute of Technology, US; <sup>3</sup>Universität Bremen, DE<br /><em><b>Abstract</b><br />Memory-on-logic and sensor-on-logic face-to-face stacking are emerging design approaches that promise a significant increase in the performance of modern systems-on-chip at reasonable costs. In this work, a netlist-to-layout design flow for such heterogeneous 3D systems is proposed. The proposed technique overcomes the severe limitations of existing 3D physical design methodologies. A RISC-V-based multi-core system, implemented in a commercial technology, is used as a case study to evaluate the proposed design flow. The case study is performed for modern/large and small cache sizes to show the superiority of the proposed methodology for a broad set of systems. While previous 3D design flows do not show to optimize performance against 2D baseline designs for processor systems with a significant memory area occupation, the proposed flow shows a performance and power improvement by 20.4-28.2 % and 3.2-3.8 %, respectively.</em></td> </tr> <tr> <td>12:30</td> <td>2.3.3</td> <td><b>QUANTIFYING THE BENEFITS OF MONOLITHIC 3D COMPUTING SYSTEMS ENABLED BY TFT AND RRAM</b><br /><b>Speaker</b>:<br />Abdallah Felfel, Zewail City of Science and Technology, EG<br /><b>Authors</b>:<br />Abdallah M Felfel<sup>1</sup>, Kamalika Datta<sup>1</sup>, Arko Dutt<sup>1</sup>, Hasita Veluri<sup>2</sup>, Ahmed Zaky<sup>1</sup>, Aaron Thean<sup>2</sup> and Mohamed M Sabry Aly<sup>1</sup><br /><sup>1</sup>Nanyang Technological University, SG; <sup>2</sup>National University of Singapore, SG<br /><em><b>Abstract</b><br />Current data-centric workloads, such as deep learning, expose the memory-access inefficiencies of current computing systems. Monolithic 3D integration can overcome this limitation by leveraging fine-grained and dense vertical connectivity to enable massively-concurrent accesses between compute and memory units. Thin-Film Transistors (TFTs) and Resistive RAM (RRAM) naturally enable monolithic 3D integration as they are fabricated in low temperature (a crucial requirement). In this paper, we explore ZnO-based TFTs and HfO2-based RRAM to build a 1TFT-1R memory subsystem in the upper tiers. The TFT-based memory subsystem is stacked on top of a Si-FET bottom tier that can include compute units and SRAM. System-level simulations for various deep learning workloads show that our TFT-based monolithic 3D system achieves up to 11.4x system-level energy-delay product benefits compared to 2D baseline with off-chip DRAM---5.8x benefits over interposer-based 2.5D integration and 1.25x over 3D stacking of RRAM on silicon using through-silicon vias. These gains are achieved despite the low density of TFT-based RRAM and the higher energy consumption versus 3D stacking with RRAM, due to inherent TFT limitations.</em></td> </tr> <tr> <td>12:45</td> <td>2.3.4</td> <td><b>ORGANIC-FLOW: AN OPEN-SOURCE ORGANIC STANDARD CELL LIBRARY AND PROCESS DEVELOPMENT KIT</b><br /><b>Speaker</b>:<br />Ting-Jung Chang, Princeton University, US<br /><b>Authors</b>:<br />Ting-Jung Chang, Zhuozhi Yao, Barry P. Rand and David Wentzlaff, Princeton University, US<br /><em><b>Abstract</b><br />Organic thin-film transistors (OTFTs) are drawing increasing attention due to their unique advantages of mechanical flexibility, low-cost fabrication, and biodegradability, enabling diverse applications that were not achievable using traditional inorganic transistors. With a growing number of complex applications being proposed, the need for expediting the design process and ensuring the yield of large-scale designs with organic technology increases. A complete digital standard cell library plays a crucial role in integrating the emerging organic technology into existing computer-aided-design (CAD) flows. In this paper, we present the design, fabrication, and characterization of a standard cell library based on bottom gate, top contact pentacene OTFTs. We also propose a commercial tool compatible, RTL-to-GDS flow along with a new organic process design kit (PDK) developed based on our process. To the best of our knowledge, this is the first open-source organic standard cell library, enabling the community to explore this emerging technology.</em></td> </tr> <tr> <td style="width:40px;">13:00</td> <td><a href="/date20/conference/session/IP1">IP1-2</a>, 130</td> <td><b>CMOS IMPLEMENTATION OF SWITCHING LATTICES</b><br /><b>Speaker</b>:<br />Levent Aksoy, Istanbul TU, TR<br /><b>Authors</b>:<br />Ismail Cevik, Levent Aksoy and Mustafa Altun, Istanbul TU, TR<br /><em><b>Abstract</b><br />Switching lattices consisting of four-terminal switches are introduced as area-efficient structures to realize logic functions. Many optimization algorithms have been proposed, including exact ones, realizing logic functions on lattices with the fewest number of four-terminal switches, as well as heuristic ones. Hence, the computing potential of switching lattices has been justified adequately in the literature. However, the same thing cannot be said for their physical implementation. There have been conceptual ideas for the technology development of switching lattices, but no concrete and directly applicable technology has been proposed yet. In this study, we show that switching lattices can be directly and efficiently implemented using a standard CMOS process. To realize a given logic function on a switching lattice, we propose static and dynamic logic solutions. The proposed circuits as well as the compared conventional ones are designed and simulated in the Cadence environment using TSMC 65nm CMOS process. Experimental post layout results on logic functions show that switching lattices occupy much smaller area than those of the conventional CMOS implementations, while they have competitive delay and power consumption values.</em></td> </tr> <tr> <td style="width:40px;">13:01</td> <td><a href="/date20/conference/session/IP1">IP1-3</a>, 327</td> <td><b>A TIMING UNCERTAINTY-AWARE CLOCK TREE TOPOLOGY GENERATION ALGORITHM FOR SINGLE FLUX QUANTUM CIRCUITS</b><br /><b>Speaker</b>:<br />Massoud Pedram, University of Southern California, US<br /><b>Authors</b>:<br />Soheil Nazar Shahsavani, Bo Zhang and Massoud Pedram, University of Southern California, US<br /><em><b>Abstract</b><br />This paper presents a low-cost, timing uncertainty-aware synchronous clock tree topology generation algorithm for single flux quantum (SFQ) logic circuits. The proposed method considers the criticality of the data paths in terms of timing slacks as well as the total wirelength of the clock tree and generates a (height-) balanced binary clock tree using a bottom-up approach and an integer linear programming (ILP) formulation. The statistical timing analysis results for ten benchmark circuits show that the proposed method improves the total wirelength and the total negative hold slack by 4.2% and 64.6%, respectively, on average, compared with a wirelength-driven state-of-the-art balanced topology generation approach.</em></td> </tr> <tr> <td>13:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="2.4">2.4 Challenges in Analog Design Automation &amp; Security</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 11:30 - 13:00<br /><b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br />Manuel Barragan, TIMA, FR</p> <p><b>Co-Chair:</b><br />Gildas Leger, IMSE-CNM, ES</p> <p>Producing reliable and secure analog circuits is a challenging task. This session addresses novel and systematic approaches to analog security, based on key sequencing, and analog design, from automatic netlist annotation to Bayesian modeling optimization.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.4.1</td> <td><b>GANA: GRAPH CONVOLUTIONAL NETWORK BASED AUTOMATED NETLIST ANNOTATION FOR ANALOG CIRCUITS</b><br /><b>Speaker</b>:<br />Kishor Kunal, University of Minnesota, IN<br /><b>Authors</b>:<br />Kishor Kunal<sup>1</sup>, Tonmoy Dhar<sup>2</sup>, Meghna Madhusudan<sup>2</sup>, Jitesh Poojary<sup>1</sup>, Arvind Sharma<sup>1</sup>, Wenbin Xu<sup>3</sup>, Steven Burns<sup>4</sup>, Jiang Hu<sup>3</sup>, Ramesh Harjani<sup>1</sup> and Sachin S. Sapatnekar<sup>1</sup><br /><sup>1</sup>University of Minnesota, US; <sup>2</sup>University of Minnesota Twin Cities, US; <sup>3</sup>Texas A&amp;M University, US; <sup>4</sup>Intel Corporation, US<br /><em><b>Abstract</b><br />Automated subcircuit identification enables the creation of hierarchical representations of analog netlists, and can facilitate a variety of design automation tasks such as circuit layout and optimization. Subcircuit identification must be capable of navigating the numerous alternative structures that can implement any analog function, but traditional graph-based methods have been limited by the large number of such structural variants. The novel approach in this paper is based on the use of a trained graph convolutional neural network (GCN) that identifies netlist elements for circuit blocks at upper levels of the design hierarchy. Structures at lower levels of hierarchy are identified using graph-based algorithms. The proposed recognition scheme organically detects layout constraints, such as symmetry and matching, whose identification is essential for high-quality hierarchical layout. The subcircuit identification method demonstrates a high degree of accuracy over a wide range of analog designs, successfully identifies larger circuits that contain sub-blocks such as OTAs, LNAs, mixers, oscillators, and band-pass filters, and provides hierarchical decompositions of such circuits.</em></td> </tr> <tr> <td>12:00</td> <td>2.4.2</td> <td><b>SECURING PROGRAMMABLE ANALOG ICS AGAINST PIRACY</b><br /><b>Speaker</b>:<br />Mohamed Elshamy, Sorbonne Université, CNRS, LIP6, FR<br /><b>Authors</b>:<br />Mohamed Elshamy, Alhassan Sayed, Marie-Minerve Louerat, Amine Rhouni, Hassan Aboushady and Haralampos-G. Stratigopoulos, Sorbonne Université, CNRS, LIP6, FR<br /><em><b>Abstract</b><br />In this paper, we demonstrate a security approach for the class of highly-programmable analog Integrated Circuits (ICs) that can be used as a countermeasure for unauthorized chip use and piracy. The approach relies on functionality locking, i.e. a lock mechanism is introduced into the design such that unless the correct key is provided the functionality breaks. We show that for highly-programmable analog ICs the programmable fabric can naturally be used as the lock mechanism. We demonstrate the approach on a multi-standard RF receiver with configuration settings of 64-bit words.</em></td> </tr> <tr> <td>12:30</td> <td>2.4.3</td> <td><b>AN EFFICIENT BAYESIAN OPTIMIZATION APPROACH FOR ANALOG CIRCUIT SYNTHESIS VIA SPARSE GAUSSIAN PROCESS MODELING</b><br /><b>Speaker</b>:<br />Biao He, Fudan University, CN<br /><b>Authors</b>:<br />Biao He<sup>1</sup>, Shuhan Zhang<sup>1</sup>, Fan Yang<sup>2</sup>, Changhao Yan<sup>1</sup>, Dian Zhou<sup>3</sup> and Xuan Zeng<sup>1</sup><br /><sup>1</sup>Fudan university, CN; <sup>2</sup>Fudan University, CN; <sup>3</sup>UT Dallas, US<br /><em><b>Abstract</b><br />Bayesian optimization with Gaussian process models has been proposed for analog synthesis, since it is efficient for the optimizations of expensive black-box functions. However, the computational cost for training and prediction of Gaussian process models are $O(N^3)$ and $O(N^2)$, respectively, where $N$ is the number of data points. The overhead of the Gaussian process modeling would be not negligible as $N$ reaches a slightly large number. Recently, a Bayesian optimization approach using neural network has been proposed to address this problem. It reduces the computational cost of training/prediction of Gaussian process models to $O(N)$ and $O(1)$, respectively. However, it reduces the infinite-dimensional kernel in traditional Gaussian process to finite-dimensional kernel using neural network mapping. It could weaken the characterization ability of Gaussian process. In this paper, we propose a novel Bayesian optimization approach using Sparse Pseudo-input Gaussian process (SPGP). The idea is to select $M$ so-called inducing points out of $N$ data points and use the kernel function of the $M$ inducing points to approximate the kernel function of $N$ data points. The proposed approach can also reduce the computational cost of training/prediction to $O(N)$ and $O(1)$, respectively. However, the kernel of the proposed approach is still infinite-dimensional. It could provide similar characterization ability as the traditional Gaussian process. Several experiments were provided to demonstrate the efficiency of the proposed approach.</em></td> </tr> <tr> <td style="width:40px;">13:00</td> <td><a href="/date20/conference/session/IP1">IP1-4</a>, 307</td> <td><b>SYMMETRY-BASED A/M-S BIST (SYMBIST): DEMONSTRATION ON A SAR ADC IP</b><br /><b>Speaker</b>:<br />Antonios Pavlidis, Sorbonne Université, CNRS, LIP6, FR<br /><b>Authors</b>:<br />Antonios Pavlidis<sup>1</sup>, Marie-Minerve Louerat<sup>1</sup>, Eric Faehn<sup>2</sup>, Anand Kumar<sup>3</sup> and Haralampos-G. Stratigopoulos<sup>1</sup><br /><sup>1</sup>Sorbonne Université, CNRS, LIP6, FR; <sup>2</sup>STMicroelectronics, FR; <sup>3</sup>STMicroelectronics, IN<br /><em><b>Abstract</b><br />In this paper, we propose a defect-oriented Built-In Self-Test (BIST) paradigm for analog and mixed-signal (A/MS) Integrated Circuits (ICs), called symmetry-based BIST (Sym-BIST). SymBIST exploits inherent symmetries into the design to generate invariances that should hold true only in defect-free operation. Violation of any of these invariances points to defect detection. We demonstrate SymBIST on a 65nm 10-bit Successive Approximation Register (SAR) Analog-to-Digital Converter (ADC) IP by ST Microelectronics. SymBIST does notresult in any performance penalty, it incurs an area overhead of less than 5%, the test time equals about 16x the time to convert an analog input sample, it can be interfaced with a 2-pin digital access mechanism, and it covers the entire A/M-S part of the IP achieving a likelihood-weighted defect coverage higher than 85%.</em></td> </tr> <tr> <td style="width:40px;">13:01</td> <td><a href="/date20/conference/session/IP1">IP1-5</a>, 476</td> <td><b>RANGE CONTROLLED FLOATING-GATE TRANSISTORS: A UNIFIED SOLUTION FOR UNLOCKING AND CALIBRATING ANALOG ICS</b><br /><b>Speaker</b>:<br />YIORGOS MAKRIS, University OF TEXAS AT DALLAS, US<br /><b>Authors</b>:<br />Sai Govinda Rao Nimmalapudi, Georgios Volanis, Yichuan Lu, Angelos Antonopoulos, Andrew Marshall and Yiorgos Makris, University of Texas at Dallas, US<br /><em><b>Abstract</b><br />Analog Floating-Gate Transistors (AFGTs) are commonly used to fine-tune the performance of analog integrated circuits (ICs) after fabrication, thereby enabling high yield despite component mismatch and variability in semiconductor manufacturing. In this work, we propose a methodology that leverages such AFGTs to also prevent unauthorized use of analog ICs. Specifically, we introduce a locking mechanism that limits programming of AFGTs to a range which is inadequate for achieving the desired analog performance. Accordingly, our solution entails a two-step unlock-&amp;-calibrate process. In the first step, AFGTs must be programmed through a secret sequence of voltages within that range, called waypoints. Successfully following the waypoints unlocks the ability to program the AFGTs over their entire range. Thereby, in the second step, the typical AFGT-based post-silicon calibration process can be applied to adjust the performance of the IC within its specifications. Protection against brute-force or intelligent attacks attempting to guess the unlocking sequence is ensured through the vast space of possible waypoints in the continuous (analog) domain. Feasibility and effectiveness of the proposed solution is demonstrated and evaluated on an Operational Transconductance Amplifier (OTA). To our knowledge, this is the first solution which leverages the power of analog keys and addresses both unlocking and calibration needs of analog ICs in a unified manner.</em></td> </tr> <tr> <td style="width:40px;">13:02</td> <td><a href="/date20/conference/session/IP1">IP1-6</a>, 699</td> <td><b>TESTING THROUGH SILICON VIAS IN POWER DISTRIBUTION NETWORK OF 3D-IC WITH MANUFACTURING VARIABILITY CANCELLATION</b><br /><b>Speaker</b>:<br />Koutaro Hachiya, Teikyo Heisei University, JP<br /><b>Authors</b>:<br />Koutaro Hachiya<sup>1</sup> and Atsushi Kurokawa<sup>2</sup><br /><sup>1</sup>Teikyo Heisei University, JP; <sup>2</sup>Hirosaki University, JP<br /><em><b>Abstract</b><br />To detect open defects of power TSVs (Through Silicon Vias) in PDNs (Power Distribution Networks) of stacked 3D-ICs, a method was proposed which measures resistances between power micro-bumps connected to PDN and detects defects of TSVs by changes of the resistances. It suffers from manufacturing variabilities and must place one micro-bump directly under each TSV (direct-type placement style) to maximize its diagnostic performance, but the performance was not enough for practical applications. A variability cancellation method was also devised to improve the diagnostic performance. In this paper, a novel middle-type placement style is proposed which places one micro-bump between each pair of TSVs. Experimental simulations using a 3D-IC example show that the diagnostic performances of both the direct-type and the middle-type examples are improved by the variability cancellation and reach the practical level. The middle-type example outperforms the direct-type example in terms of number of micro-bumps and number of measurements.</em></td> </tr> <tr> <td>13:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="2.5">2.5 Pruning Techniques for Embedded Neural Networks</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 11:30 - 13:00<br /><b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br />Marian Verhelst, KU Leuven, BE</p> <p><b>Co-Chair:</b><br />Dirk Ziegenbein, Robert Bosch GmbH, DE</p> <p>Network pruning has been applied successfully to reduce the computational and memory footprint of neural network processing. This session presents three innovations to better exploit pruning in embedded processing architectures. The solutions presented extend the sparsity concept to the bit level with an enhanced bit-level pruning technique based on CSD representations, introduce a novel group-level pruning technique, demonstrating an improved trade-off between hardware-execution cost and accuracy loss, and explore a sparsity-aware cache architecture to reduce cache miss rate and execution time.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.5.1</td> <td><b>DEEPER WEIGHT PRUNING WITHOUT ACCURACY LOSS IN DEEP NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Byungmin Ahn, Seoul National University, KR<br /><b>Authors</b>:<br />Byungmin Ahn and Taewhan Kim, Seoul National University, KR<br /><em><b>Abstract</b><br />This work overcomes the inherent limitation of the bit-level weight pruning, that is, the maximal computation speedup is bounded by the total number of non-zero bits of the weights and the bound is invariably considered "uncontrollable" (i.e., constant) for the neural network to be pruned. Precisely, this work, based on the canonical signed digit (CSD) encoding, (1) proposes a transformation technique which converts the two's complement representation of every weight into a set of CSD representations of the minimal or near-minimal number of essential (i.e., non-zero) bits, (2) formulates the problem of selecting CSD representations of weights that maximize the parallelism of bit-level multiplication on the weights into a multi-objective shortest path problem and solves it efficiently using an approximation algorithm, and (3) proposes a supporting novel acceleration architecture with no additional inclusion of non-trivial hardware. Through experiments, it is shown that our proposed approach reduces the number of essential bits by 69% on AlexNet and 74% on VGG-16, by which our accelerator reduces the inference computation time by 47% on AlexNet and 50% on VGG-16 over the conventional bit-level weight pruning.</em></td> </tr> <tr> <td>12:00</td> <td>2.5.2</td> <td><b>FLEXIBLE GROUP-LEVEL PRUNING OF DEEP NEURAL NETWORKS FOR ON-DEVICE MACHINE LEARNING</b><br /><b>Speaker</b>:<br />Dongkun Shin, Sungkyunkwan University, KR<br /><b>Authors</b>:<br />Kwangbae Lee, Hoseung Kim, Hayun Lee and Dongkun Shin, Sungkyunkwan University, KR<br /><em><b>Abstract</b><br />Network pruning is a promising compression technique to reduce computation and memory access cost of deep neural networks. Pruning techniques are classified into two types: fine-grained pruning and coarse-grained pruning. Fine-grained pruning eliminates individual connections if they are insignificant and thus usually generates irregular networks. Therefore, it can fail to reduce inference time. Coarse-grained pruning such as filter-level and channel-level techniques can make hardware-friendly networks. However, it can suffer from low accuracy. In this paper, we focus on the group-level pruning method to accelerate deep neural networks on mobile GPUs, where several adjacent weights are pruned in a group to mitigate the irregularity of pruned networks while providing high accuracy. Although several group-level pruning techniques have been proposed, the previous techniques select weight groups to be pruned at group-size-aligned locations to reduce the problem space. In this paper, we propose an unaligned approach to improve the accuracy of the compressed model. We can find the optimal solution of the unaligned group selection problem with dynamic programming. Our technique also generates balanced sparse networks to get load balance at parallel computing units. Experiments demonstrate that the 2D unaligned group-level pruning shows 3.12% lower error rate at ResNet-20 network on CIFAR-10 that compared to the previous 2D aligned group-level pruning under the 95% sparsity.</em></td> </tr> <tr> <td>12:30</td> <td>2.5.3</td> <td><b>SPARSITY-AWARE CACHES TO ACCELERATE DEEP NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Vinod Ganesan, IIT Madras, IN<br /><b>Authors</b>:<br />Vinod Ganesan<sup>1</sup>, Sanchari Sen<sup>2</sup>, Pratyush Kumar<sup>1</sup>, Neel Gala<sup>1</sup>, Kamakoti Veezhinatha<sup>1</sup> and Anand Raghunathan<sup>2</sup><br /><sup>1</sup>IIT Madras, IN; <sup>2</sup>Purdue University, US<br /><em><b>Abstract</b><br />Deep Neural Networks (DNNs) have transformed the field of artificial intelligence and represent the state-of-the-art in many machine learning tasks. There is considerable interest in using DNNs to realize edge intelligence in highly resource-constrained devices such as wearables and IoT sensors. Unfortunately, the high computational requirements of DNNs pose a serious challenge to their deployment in these systems. Moreover, due to tight cost (and hence, area) constraints, these devices are often unable to accommodate hardware accelerators, requiring DNNs to execute on the General Purpose Processor (GPP) cores that they contain. We address this challenge through lightweight micro-architectural extensions to the memory hierarchy of GPPs that exploit a key attribute of DNNs, viz. sparsity, or the prevalence of zero values. We propose SparseCache, an enhanced cache architecture that utilizes a null cache based on a Ternary Content Addressable Memory (TCAM) to compactly store zero-valued cache lines while storing non-zero lines in a conventional data cache. By storing address rather than values for zero-valued cache lines, SparseCache increases the effective cache capacity, thereby reducing the overall miss rate and execution time. SparseCache utilizes a Zero Detector and Approximator (ZDA) and Address Merger (AM) to perform reads and writes to the null cache. We evaluate SparseCache on four state-of-the-art DNNs programmed with the Caffe framework. SparseCache achieves 5-28% reduction in miss-rate, which translates to 5-21% reduction in execution time, with only 0.1% area and 3.8% power overhead in comparison to a low-end Intel Atom Z-series processor.</em></td> </tr> <tr> <td style="width:40px;">13:00</td> <td><a href="/date20/conference/session/IP1">IP1-7</a>, 429</td> <td><b>TFAPPROX: TOWARDS A FAST EMULATION OF DNN APPROXIMATE HARDWARE ACCELERATORS ON GPU</b><br /><b>Speaker</b>:<br />Zdenek Vasicek, Brno University of Technology, CZ<br /><b>Authors</b>:<br />Filip Vaverka, Vojtech Mrazek, Zdenek Vasicek and Lukas Sekanina, Brno University of Technology, CZ<br /><em><b>Abstract</b><br />Energy efficiency of hardware accelerators of deep neural networks (DNN) can be improved by introducing approximate arithmetic circuits. In order to quantify the error introduced by using these circuits and avoid the expensive hardware prototyping, a software emulator of the DNN accelerator is usually executed on CPU or GPU. However, this emulation is typically two or three orders of magnitude slower than a software DNN implementation running on CPU or GPU and operating with standard floating point arithmetic instructions and common DNN libraries. The reason is that there is no hardware support for approximate arithmetic operations on common CPUs and GPUs and these operations have to be expensively emulated. In order to address this issue, we propose an efficient emulation method for approximate circuits utilized in a given DNN accelerator which is emulated on GPU. All relevant approximate circuits are implemented as look-up tables and accessed through a texture memory mechanism of CUDA capable GPUs. We exploit the fact that the texture memory is optimized for irregular read-only access and in some GPU architectures is even implemented as a dedicated cache. This technique allowed us to reduce the inference time of the emulated DNN accelerator approximately 200 times with respect to an optimized CPU version on complex DNNs such as ResNet. The proposed approach extends the TensorFlow library and is available online at <a href="https://github.com/ehw-fit/tf-approximate">https://github.com/ehw-fit/tf-approximate</a></em></td> </tr> <tr> <td>13:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="2.6">2.6 Improving reliability and fault tolerance of advanced memories</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 11:30 - 13:00<br /><b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br />Mounir Benabdenbi, TIMA Laboratory, FR</p> <p><b>Co-Chair:</b><br />Said Hamdioui, Delft University of Technology, NL</p> <p>This session discusses reliability issues for different memory technologies; addressing fault tolerance of memristors, how to reduce simulations with importance sampling and advance metrics as measure for the reliability of NAND flash memories.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.6.1</td> <td><b>ON IMPROVING FAULT TOLERANCE OF MEMRISTOR CROSSBAR BASED NEURAL NETWORK DESIGNS BY TARGET SPARSIFYING</b><br /><b>Speaker</b>:<br />Yu Wang, North China Electric Power University, CN<br /><b>Authors</b>:<br />Song Jin<sup>1</sup>, Songwei Pei<sup>2</sup> and Yu Wang<sup>1</sup><br /><sup>1</sup>North China Electric Power University, CN; <sup>2</sup>School of Computer Science, Beijing University of Posts and Telecommunications, CN<br /><em><b>Abstract</b><br />Memristor based crossbar (MBC) can execute neural network computations in an extremely energy efficient manner. However, stuck-at faults make memristors cannot represent network weight correctly, thus degrading classification accuracy of the network deployed on the MBC significantly. By carefully analyzing all the possible fault combinations in a pair of differential crossbars, we found that most of the stuck-at faults can be accommodated perfectly by mapping a zero value weight onto the memristors. Based on such observation, in this paper we propose a target sparsifying based fault tolerant scheme for the MBC which executes neural network applications. We first exploit a heuristic algorithm to map weight matrix onto the MBC, aiming at minimizing weight variations in the presence of stuck-at faults. After that, some weights mapped onto the faulty memristors which still have large variations will be purposefully forced to zero value. Network retraining is then performed to recover classification accuracy. For a 4-layer CNN designed for MNIST digit recognition, experimental results demonstrate that our scheme can achieve almost no accuracy loss when 10% of memristors in the MBC are faulty. As the faulty memristors increasing to 20%, accuracy loss is only within 3%.</em></td> </tr> <tr> <td>12:00</td> <td>2.6.2</td> <td><b>AN EFFICIENT YIELD ANALYSIS OF SRAM USING SCALED-SIGMA ADAPTIVE IMPORTANCE SAMPLING</b><br /><b>Speaker</b>:<br />Liang Pang, Southeast University, CN<br /><b>Authors</b>:<br />Liang Pang<sup>1</sup>, Mengyun Yao<sup>2</sup> and Yifan Chai<sup>1</sup><br /><sup>1</sup>School of Electronic Science &amp; Engineering, Southeast University, CN; <sup>2</sup>School of Microelectronics, Southeast University, CN<br /><em><b>Abstract</b><br />Statistical SRAM yield analysis has become a growing concern for its high integrated density and reliability. It is a challenge to estimate the SRAM failure probability efficiently because the circuit failure is a "rare-event". Existing methods are still not enough to solve the problem especially in high dimension under advanced process. In this paper, we develop a scaled-sigma adaptive importance sampling (SSAIS) which is an extension of the adaptive importance sampling. This method changes not only the location parameters but the shape parameters by iteratively searching the failure region. Our 40nm SRAM cell experiments validated that our method has outperform Monte Carlo method by 1500x which is 2.3x~5.2x faster than the state-of-art methods with remaining the enough accuracy. The another experiment on sense amplifier shows our method achieves 3968x speedup over the Monte Carlo method and 2.1x~11x speedup over the other methods.</em></td> </tr> <tr> <td>12:30</td> <td>2.6.3</td> <td><b>FAST AND ACCURATE HIGH-SIGMA FAILURE RATE ESTIMATION THROUGH EXTENDED BAYESIAN OPTIMIZED IMPORTANCE SAMPLING</b><br /><b>Speaker</b>:<br />Michael Hefenbrock, Karlsruhe Institute of Technology, DE<br /><b>Authors</b>:<br />Michael Hefenbrock, Dennis Weller, Michael Beigl and Mehdi Tahoori, Karlsruhe Institute of Technology, DE<br /><em><b>Abstract</b><br />Due to the aggressive technology downscaling, process variations are becoming pre-dominent, causing performance fluctuations and impacting the chip yield. Therefore, individual circuit components have to be designed with very small failure rates to guarantee functional correctness and robust operation. The assessment of high-sigma failure rates however cannot be achieved with conventional Monte Carlo (MC) methods due to the huge amount of required time-consuming circuit simulations. To this end, Importance Sampling (IS) methods were proposed to solve the otherwise intractable failure rate estimation problem by focusing on high-probable failure regions. However, the failure rate could largely be underestimated while the computational effort for deriving them is high. In this paper, we propose an eXtended Bayesian Optimized IS (XBOIS) method, which addresses the aforementioned shortcomings by deployment of an accurate surrogate model (e.g. delay) of the circuit around the failure region. The number of costly circuit simulations is therefore minimized and estimation accuracy is substantially improved by efficient exploration of the variation space. As especially memory elements occupy a large amount of on-chip resources, we evaluate our approach on SRAM cell failure rate estimation. Results show a speedup of about 16x as well as a two orders of magnitude higher failure rate estimation accuracy compared to the best state-of-the-art techniques.</em></td> </tr> <tr> <td>12:45</td> <td>2.6.4</td> <td><b>VALID WINDOW: A NEW METRIC TO MEASURE THE RELIABILITY OF NAND FLASH MEMORY</b><br /><b>Speaker</b>:<br />Min Ye, City University of Hong Kong, HK<br /><b>Authors</b>:<br />Min Ye<sup>1</sup>, Qiao Li<sup>1</sup>, Jianqiang Nie<sup>2</sup>, Tei-Wei Kuo<sup>1</sup> and Chun Jason Xue<sup>1</sup><br /><sup>1</sup>City University of Hong Kong, HK; <sup>2</sup>YEESTOR Microelectronics Co., Ltd, CN<br /><em><b>Abstract</b><br />NAND flash memory has been widely adopted in storage systems today. The most important issue in flash memory is its reliability, especially for 3D NAND, which suffers from several types of errors. The raw bit error rate (RBER) when applying default read reference voltages is usually adopted as the reliability metric for NAND flash memory. However, RBER is closely related to the way how data is read, and varies greatly if read retry operations are conducted with tuned read reference voltages. In this work, a new metric, valid window is proposed to measure the reliability, which is stable and accurate. A valid window expresses the size of error regions between two neighboring levels and determines if the data can be correctly read with further read retry. Taking advantage of these features, we design a method to reduce the number of read retry operations. This is achieved by adjusting program operations of 3D NAND flash memories. Experiments on a real 3D NAND flash chip verify the effectiveness of the proposed method.</em></td> </tr> <tr> <td style="width:40px;">13:00</td> <td><a href="/date20/conference/session/IP1">IP1-8</a>, 110</td> <td><b>BINARY LINEAR ECCS OPTIMIZED FOR BIT INVERSION IN MEMORIES WITH ASYMMETRIC ERROR PROBABILITIES</b><br /><b>Speaker</b>:<br />Valentin Gherman, CEA, FR<br /><b>Authors</b>:<br />Valentin Gherman, Samuel Evain and Bastien Giraud, CEA, FR<br /><em><b>Abstract</b><br />Many memory types are asymmetric with respect to the error vulnerability of stored 0's and 1's. For instance, DRAM, STT-MRAM and NAND flash memories may suffer from asymmetric error rates. A recently proposed error-protection scheme consists in the inversion of the memory words with too many vulnerable values before they are stored in an asymmetric memory. In this paper, a method is pro-posed for the optimization of systematic binary linear block error-correcting codes in order to maximize their impact when combined with memory word inversion.</em></td> </tr> <tr> <td style="width:40px;">13:01</td> <td><a href="/date20/conference/session/IP1">IP1-9</a>, 634</td> <td><b>BELDPC: BIT ERRORS AWARE ADAPTIVE RATE LDPC CODES FOR 3D TLC NAND FLASH MEMORY</b><br /><b>Speaker</b>:<br />Meng Zhang, Wuhan National Laboratory for Optoelectronics, CN<br /><b>Authors</b>:<br />Meng Zhang, Fei Wu, Qin Yu, Weihua Liu, Lanlan Cui, Yahui Zhao and Changsheng Xie, Wuhan National Laboratory for Optoelectronics, CN<br /><em><b>Abstract</b><br />Three-dimensional (3D) NAND flash memory has high capacity and cell storage density by using the multi-bit technology and vertical stack architecture, but degrading data reliability due to high raw bit error rates (RBER) caused by program/erase (P/E) cycles and retention periods. Low-density parity-check (LDPC) codes become more popular error-correcting technologies to improve data reliability due to strong error correction capability, but introducing more decoding iterations at higher RBER. To reduce decoding iterations, this paper proposes BeLDPC: bit errors aware adaptive rate LDPC codes for 3D triple-level cell (TLC) NAND flash memory. Firstly, bit error characteristics in 3D charge trap TLC NAND flash memory are studied on a real FPGA testing platform, including asymmetric bit flipping and temporal locality of bit errors. Then, based on these characteristics, a high-efficiency LDPC code is designed. Experimental results show BeLDPC can reduce decoding iterations under different P/E cycles and retention periods.</em></td> </tr> <tr> <td>13:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="2.7">2.7 Optimizing emerging applications for power-efficient computing</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 11:30 - 13:00<br /><b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br />Jungwook Choi, Hanyang University, KR</p> <p><b>Co-Chair:</b><br />Shafique Muhammad, TU Wien, AT</p> <p>This session focuses on emerging applications for power-efficient computing, such as bioinformatics and few-shot learning. Methods such as Hyperdimensional computing or computing in memory are applied to process DNA pattern matching or to perform few-shot learning in a more power-efficient way.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.7.1</td> <td><b>GENIEHD: EFFICIENT DNA PATTERN MATCHING ACCELERATOR USING HYPERDIMENSIONAL COMPUTING</b><br /><b>Speaker</b>:<br />Mohsen Imani, University of California, San Diego, US<br /><b>Authors</b>:<br />Yeseong Kim, Mohsen Imani, Niema Moshiri and Tajana Rosing, University of California, San Diego, US<br /><em><b>Abstract</b><br />DNA pattern matching is widely applied in many bioinformatics applications. The increasing volume of the DNA data exacerbates runtime and power consumption to discover DNA patterns. In this paper, we propose a hardware-software codesign, called GenieHD, which efficiently parallelizes the DNA pattern-matching task. We exploit brain-inspired hyperdimensional (HD) computing which mimics pattern-based computations in human memory. We transform inherent sequential processes of the DNA pattern matching to highly-parallelizable computation tasks using HD computing. We accordingly design an accelerator architecture targeting various parallel computing platforms to effectively parallelize the HD-based DNA pattern matching while significantly reducing memory accesses. We evaluate GenieHD on practical large-size DNA datasets such as human and Escherichia Coli genomes. Our evaluation shows that GenieHD significantly accelerates the DNA matching procedure, e.g., 44.4× speedup and 54.1× higher energy efficiency as compared to state-of-the-art FPGA-based design.</em></td> </tr> <tr> <td>12:00</td> <td>2.7.2</td> <td><b>REPUTE: AN OPENCL BASED READ MAPPING TOOL FOR EMBEDDED GENOMICS</b><br /><b>Speaker</b>:<br />Sidharth Maheshwari, Newcastle University, IN<br /><b>Authors</b>:<br />Sidharth Maheshwari<sup>1</sup>, Rishad Shafik<sup>1</sup>, Alex Yakovlev<sup>1</sup>, Ian Wilson<sup>1</sup> and Amit Acharyya<sup>2</sup><br /><sup>1</sup>Newcastle University, GB; <sup>2</sup>IIT Hyderabad, IN<br /><em><b>Abstract</b><br />Genomics is transforming medicine from reactive to personalized, predictive, preventive and participatory (P4). The massive amount of data produced by genomics is a major challenge as it requires extensive computational capabilities, consuming large amounts of energy. A crucial prerequisite for computational genomics is genome assembly but the existing mapping tools used are predominantly software based, optimized for homogeneous high-performance systems. In this paper, we propose an OpenCL based REad maPper for heterogeneoUs sysTEms (REPUTE), which can use diverse and parallel compute and storage devices effectively. Core to this tool are dynamic programming based filtration and verification kernel to map the reads on multiple devices, concurrently. We show hardware/ software co-design and implementations of REPUTE across different platforms, and compare it with state-of-the-art mappers. We demonstrate the performance of mappers on two systems: 1) Intel CPU + 2Nvidia GPUs; 2) HiKey970 embedded SoC with ARM Cortex-A73/A53 cores. The results show that REPUTE outperforms other read mappers in most cases producing up to 13x speedup with better or comparable accuracy. We also demonstrate that the embedded implementation can achieve up to 27x energy savings, enabling low-cost genomics.</em></td> </tr> <tr> <td>12:30</td> <td>2.7.3</td> <td><b>A FAST AND ENERGY EFFICIENT COMPUTING-IN-MEMORY ARCHITECTURE FOR FEW-SHOT LEARNING APPLICATIONS</b><br /><b>Speaker</b>:<br />Dayane Reis, University of Notre Dame, US<br /><b>Authors</b>:<br />Dayane Reis, Ann Franchesca Laguna, Michael Niemier and X. Sharon Hu, University of Notre Dame, US<br /><em><b>Abstract</b><br />Among few-shot learning methods, prototypical networks (PNs) are one of the most popular approaches due to their excellent classification accuracies and network simplicity. Test examples are classified based on their distances from class prototypes. Despite the application-level advantages of PNs, the latency of transferring data from memory to compute units is much higher than the PN computation time. Thus, PNs performance is limited by memory bandwidth. Computing-in-memory addresses this bandwidth-bottleneck problem by bringing a subset of compute units closer to memory. In this work, we propose a CiM-PN framework that enables the computation of distance metrics and prototypes inside the memory. CiM-PN replaces the computationally intensive Euclidean distance metric by the CiM-friendly Manhattan distance metric. Additionally, prototypes are computed using an in-memory mean operation realized by accumulation and division by powers of two, which enables few-shot learning implementations where "shots" are powers of two. The CiM-PN hardware uses CMOS memory cells, as well as CMOS peripherals such as customized sense amplifiers, carry look-ahead adders, in-place copy buffers and a log bit-shifter. Compared with a GPU implementation, a CMOS-based CiM-PN achieves speedups of 2808x/111x and energy savings of 2372x/5170x at iso-accuracy for the prototype and nearest-neighbor computation, respectively, and over 2x end-to-end speedup and energy improvements. We also gain 3-14% accuracy improvement when compared to existing non-GPU hardware approaches due to the floating-point CiM operations.</em></td> </tr> <tr> <td>13:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="2.8">2.8 EU/ESA projects on Heterogeneous Computing</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 11:30 - 13:00<br /><b>Location / Room:</b> Exhibition Theatre</p> <p><b>Chair:</b><br />Carles Hernandez, UPV, ES</p> <p><b>Co-Chair:</b><br />Francisco J. Cazorla, Barcelona Supercomputing Center, ES</p> <p>In the scope of this session the presented EU/ESA projects cover topics related to the control electronics and data processing architecture and functionality of the Wide Field Imager, one of two scientific instruments of the next European X-ray observatory ATHENA; task-based programming models to provide a software ecosystem for heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines; and a framework to allow Big Data solutions to dynamically and transparently exploit heterogeneous hardware accelerators.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.8.1</td> <td><b>ESA ATHENA WFI ONBOARD ELECTRONICS - DISTRIBUTED CONTROL AND DATA PROCESSING (WORK IN PROGRESS IN THE PROJECT)</b><br /><b>Speaker</b>:<br />Markus Plattner, Max Planck Institute for extraterrestrial Physics, DE<br /><b>Authors</b>:<br />Markus Plattner<sup>1</sup>, Sabine Ott<sup>1</sup>, Jintin Tran<sup>1</sup>, Christopher Mandla<sup>1</sup>, Manfred Steller<sup>2</sup>, Harald Jeszensky<sup>2</sup>, Roland Ottensamer<sup>3</sup>, Jan-Christoph Tenzer<sup>4</sup>, Thomas Schanz<sup>4</sup>, Samuel Pliego<sup>4</sup>, Konrad Skup<sup>5</sup>, Denis Tcherniak<sup>6</sup>, Chris Thomas<sup>7</sup>, Julian Thornhill<sup>7</sup> and Sebastian Albrecht<sup>1</sup><br /><sup>1</sup>Max Planck Institute for extraterrestrial Physics, DE; <sup>2</sup>IWF - Space Research Institute, AT; <sup>3</sup>University Vienna, AT; <sup>4</sup>University of Tübingen, DE; <sup>5</sup>CBK Warsaw, PL; <sup>6</sup>Technical University of Denmark, DK; <sup>7</sup>University of Leicester, GB<br /><em><b>Abstract</b><br />Within this paper, we describe the control electronics and data processing architecture and functionality of the Wide Field Imager (WFI). WFI is one of two scientific instruments of the next European X-ray observatory ATHENA whose development started five years ago. Meanwhile, a conceptual design, development models and a number of technology development activities have been performed.</em></td> </tr> <tr> <td>12:00</td> <td>2.8.2</td> <td><b>LEGATO: LOW-ENERGY, SECURE, AND RESILIENTTOOLSET FOR HETEROGENEOUS COMPUTING</b><br /><b>Speaker</b>:<br />Pascal Felber, University of Neuchâtel, CH<br /><b>Authors</b>:<br />Behzad Salami<sup>1</sup>, Konstantinos Parasyris<sup>1</sup>, Adrian Cristal<sup>1</sup>, Osman Unsal<sup>1</sup>, Xavier Martorell<sup>1</sup>, Paul Carpenter<sup>1</sup>, Raul De La Cruz<sup>1</sup>, Leonardo Bautista<sup>1</sup>, Daniel Jimenez<sup>1</sup>, Carlos Alvarez<sup>1</sup>, Saber Nabavi<sup>1</sup>, Sergi Madonar<sup>1</sup>, Miquel Pericàs<sup>2</sup>, Pedro Trancoso<sup>2</sup>, Mustafa Abduljabbar<sup>2</sup>, Jing Chen<sup>2</sup>, Pirah Noor Soomro<sup>2</sup>, Madhavan Manivannan<sup>2</sup>, Micha von dem Berge<sup>3</sup>, Stefan Krupop<sup>3</sup>, Frank Klawonn<sup>4</sup>, Amani Mihklafi<sup>4</sup>, Sigrun May<sup>4</sup>, Tobias Becker<sup>5</sup>, Georgi Gaydadjiev<sup>5</sup>, Hans Salomonsson<sup>6</sup>, Devdatt Dubhashi<sup>6</sup>, Oron Port<sup>7</sup>, Yoav Etsion<sup>8</sup>, Le Quoc Do<sup>9</sup>, Christof Fetzer<sup>9</sup>, Martin Kaiser<sup>10</sup>, Nils Kucza<sup>10</sup>, Jens Hagemeyer<sup>10</sup>, René Griessl<sup>10</sup>, Lennart Tigges<sup>10</sup>, Kevin Mika<sup>10</sup>, Arne Hüffmeier<sup>10</sup>, Marcelo Pasin<sup>11</sup>, Valerio Schiavoni<sup>11</sup>, Isabelly Rocha<sup>11</sup>, Christian Göttel<sup>11</sup> and Pascal Felber<sup>11</sup><br /><sup>1</sup>Barcelona Supercomputing Center, ES; <sup>2</sup>Chalmers University of Technology, SE; <sup>3</sup>Christmann Informationstechnik + Medien GmbH &amp; Co. KG, DE; <sup>4</sup>Helmholtz-Zentrum für Infektionsforschung GmbH, DE; <sup>5</sup>MAXELER, GB; <sup>6</sup>MIS, SE; <sup>7</sup>TECHNION, IL; <sup>8</sup>Technion, IL; <sup>9</sup>TUD, DE; <sup>10</sup>UNIBI, DE; <sup>11</sup>UNINE, CH<br /><em><b>Abstract</b><br />The LEGaTO project leverages task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC, balanced with the security and resilience challenges. LEGaTO is an ongoing three-year EU H2020 project started in December 2017.</em></td> </tr> <tr> <td>12:30</td> <td>2.8.3</td> <td><b>EFFICIENT COMPILATION AND EXECUTION OF JVM-BASED DATA PROCESSING FRAMEWORKS ON HETEROGENEOUS CO-PROCESSORS</b><br /><b>Speaker</b>:<br />Athanasios Stratikopoulos, University of Manchester, GB<br /><b>Authors</b>:<br />Christos Kotselidis<sup>1</sup>, Ioannis Komnios<sup>2</sup>, Orestis Akrivopoulos<sup>3</sup>, Sebastian Bress<sup>4</sup>, Katerina Doka<sup>5</sup>, Hazeef Mohammed<sup>6</sup>, Georgios Mylonas<sup>7</sup>, Vassilis Spitadakis<sup>8</sup>, Daniel Strimpel<sup>9</sup>, Juan Fumero<sup>1</sup>, Foivos S. Zakkak<sup>1</sup>, Michail Papadimitriou<sup>1</sup>, Maria Xekalaki<sup>1</sup>, Nikos Foutris<sup>1</sup>, Athanasios Stratikopoulos<sup>1</sup>, Nectarios Koziris<sup>5</sup>, Ioannis Konstantinou<sup>5</sup>, Ioannis Mytilinis<sup>5</sup>, Constantinos Bitsakos<sup>5</sup>, Christos Tsalidis<sup>8</sup>, Christos Tselios<sup>3</sup>, Nikolaos Kanakis<sup>3</sup>, Clemens Lutz<sup>4</sup>, Viktor Rosenfeld<sup>4</sup> and Volker Markl<sup>4</sup><br /><sup>1</sup>University of Manchester, GB; <sup>2</sup>Exus Ltd., US; <sup>3</sup>Spark Works ITC Ltd., GB; <sup>4</sup>German Research Center for Artificial Intelligence, DE; <sup>5</sup>National TU Athens, GR; <sup>6</sup>Kaleao Ltd., GB; <sup>7</sup>Computer Technology Institute &amp; Press Diophantus, GR; <sup>8</sup>Neurocom Luxembourg, LU; <sup>9</sup>IProov Ltd., GB<br /><em><b>Abstract</b><br />This paper addresses the fundamental question of how modern Big Data frameworks can dynamically and transparently exploit heterogeneous hardware accelerators. After presenting the major challenges that have to be addressed towards this goal, we describe our proposed architecture for automatic and transparent hardware acceleration of Big Data frameworks and applications. Our vision is to retain the uniform programming model of Big Data frameworks and enable automatic, dynamic Just-In-Time compilation of the candidate code segments that benefit from hardware acceleration to the corresponding format. In conjunction with machine learning-based device selection, that respect user-defined constraints (e.g., cost, time, etc.), we enable dynamic code execution on GPUs and FPGAs transparently to the user. In addition, we dynamically re-steer execution at runtime based on the availability of resources. Our preliminary results demonstrate that our approach can accelerate an existing Apache Flink application by up to 16.5x.</em></td> </tr> <tr> <td>13:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="3.0">3.0 LUNCHTIME KEYNOTE SESSION</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 13:50 - 14:20<br /><b>Location / Room:</b> </p> <p><b>Chair:</b><br />Marco Casale-Rossi, Synopsys, IT</p> <p><b>Co-Chair:</b><br />Giovanni De Micheli, EPFL, CH</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>13:50</td> <td>3.0.1</td> <td><b>NEUROMORPHIC COMPUTING: PAST, PRESENT, AND FUTURE</b><br /><b>Author</b>:<br />Catherine Schuman, Oak Ridge National Laboratory, US<br /><em><b>Abstract</b><br />Though neuromorphic systems were introduced decades ago, there has been a resurgence of interest in recent years due to the looming end of Moore's law, the end of Dennard scaling, and the tremendous success of AI and deep learning for a wide variety of applications. With this renewed interest, there is a diverse set of research ongoing in neuromorphic computing, ranging from novel hardware implementations, device and materials to the development of new training and learning algorithms. There are many potential advantages to neuromorphic systems that make them attractive in today's computing landscape, including the potential for very low power, efficient hardware that can perform neural network computation. Though some compelling results have been demonstrated thus far that demonstrate these advantages, there is still significant opportunity for innovations in hardware, algorithms, and applications in neuromorphic computing. In this talk, a brief overview of the history of neuromorphic computing will be discussed, and a summary of the current state of research in the field will be presented. Finally, a list of key challenges, open questions, and opportunities for future research in neuromorphic computing will be enumerated.</em></td> </tr> <tr> <td>14:20</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="3.1">3.1 Executive Session: HW Security</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="3.2">3.2 Accelerating Design Space Exploration</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br />Christian Pilato, Politecnico di Milano, IT</p> <p><b>Co-Chair:</b><br />Luca Carloni, Columbia University, US</p> <p>Accelerating Design Space Exploration efficiently is needed to optimize hardware accelerators. At high level, learning techniques can provide ways to either recognize previously synthesized kernels or to model the hidden dependences between synthesis directive costs and performances. At a lower level, speeding up RTL simulations based on data dependencies analysis can speed up one of the most time consuming steps.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.2.1</td> <td><b>EFFICIENT AND ROBUST HIGH-LEVEL SYNTHESIS DESIGN SPACE EXPLORATION THROUGH OFFLINE MICRO-KERNELS PRE-CHARACTERIZATION</b><br /><b>Authors</b>:<br />Zi Wang, Jianqi Chen and Benjamin Carrion Schaefer, University of Texas at Dallas, US<br /><em><b>Abstract</b><br />Abstract --This work proposes a method to accelerate the process of High-Level Synthesis (HLS) Design Space Exploration (DSE) by pre-characterizing micro-kernels offline and creating predictive models of these. HLS allows to generate different types of micro-architectures from the same untimed behavioral description. This is typically done by setting different combinations of synthesis options in the form or synthesis directives specified as pragmas in the code. This allows, e.g. to control how loops should be synthesized, arrays and functions. Unique combinations of these pragmas leads to micro-architectures with a unique area vs. performance/power trade-offs. The main problem is that the search space grows exponentially with the number of explorable operations. Thus, the main goal of efficient HLS DSE is to find the synthesis directives' combinations that lead to the Pareto-optimal designs quickly. Our proposed method is based on the pre-characterization of micro-kernels offline, creating predictive models for each of the kernels, and using the results to explore a new unseen behavioral description using compositional methods. In addition, we make use of perceptual hashing to match new unseen micro-kernels with the pre-characterized micro-kernels in order to further speed up the search process. Experimental results show that our proposed method is orders of magnitude faster than traditional methods.</em></td> </tr> <tr> <td>15:00</td> <td>3.2.2</td> <td><b>PROSPECTOR: SYNTHESIZING EFFICIENT ACCELERATORS VIA STATISTICAL LEARNING</b><br /><b>Speaker</b>:<br />Aninda Manocha, Princeton University, US<br /><b>Authors</b>:<br />Atefeh Mehrabi, Aninda Manocha, Benjamin Lee and Daniel Sorin, Duke University, US<br /><em><b>Abstract</b><br />Accelerator design is expensive due to the effort required to understand an algorithm and optimize the design. Architects have embraced two technologies to reduce costs. High-level synthesis automatically generates hardware from code. Reconfigurable fabrics instantiate accelerators while avoiding fabrication costs for custom circuits. We further reduce design effort with statistical learning. We build an automated framework, called Prospector, that uses Bayesian techniques to optimize synthesis directives, reducing execution latency and resource usage in field-programmable gate arrays. We show in a certain amount of time designs discovered by Prospector are closer to Pareto-efficient designs compared to prior approaches.</em></td> </tr> <tr> <td>15:30</td> <td>3.2.3</td> <td><b>TANGO: AN OPTIMIZING COMPILER FOR JUST-IN-TIME RTL SIMULATION</b><br /><b>Speaker</b>:<br />Blaise-Pascal Tine, Georgia Institute of Technology, US<br /><b>Authors</b>:<br />Blaise Tine, Sudhakar Yalamanchili and Hyesoon Kim, Georgia Institute of Technology, US<br /><em><b>Abstract</b><br />With Moore's law coming to an end, the advent of hardware specialization presents a unique challenge for a much tighter software and hardware co-design environment to exploit domain-specific optimizations and increase design efficiency. This trend is further accentuated by rapid-pace of innovations in Machine Learning and Graph Analytic, calling for a faster product development cycle for hardware accelerators and the importance of addressing the increasing cost of hardware verification. The productivity of software-hardware co-design relies upon a better integration between the software and hardware design methodologies, but more importantly in the effectiveness of the design tools and hardware simulators at reducing the development time. In this work, we developed Tango, an Optimizing compiler for a Just-in-Time RTL simulator. Tango implements unique hardware-centric compiler transformations to speed up runtime code generation in a software-hardware co-design environment where hardware simulation speed is critical. Tango achieves a 6x average speedup compared to the state-of-the-art RTL simulators.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="3.3">3.3 Artificial Intelligence and Secure Systems</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br />Annelie Heuser, Univ Rennes, Inria, CNRS, France, FR</p> <p><b>Co-Chair:</b><br />Ilia Polian, Universität Stuttgart, DE</p> <p>In this session we will cover artificial intelligence algorithms in the context of secure systems. The presented papers cover an extension of a trusted execution environment to securely run machine learning algorithms, novel attacking strategies against logic-locking countermeasures, and an investigation of aging effects on the success rate of machine learning modelling attacks.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.3.1</td> <td><b>A PARTICLE SWARM OPTIMIZATION GUIDED APPROXIMATE KEY SEARCH ATTACK ON LOGIC LOCKING IN THE ABSENCE OF SCAN ACCESS</b><br /><b>Speaker</b>:<br />RAJIT KARMAKAR, IIT Kharagpur, IN<br /><b>Authors</b>:<br />RAJIT KARMAKAR and Santanu Chattopadhyay, IIT Kharagpur, IN<br /><em><b>Abstract</b><br />Logic locking is a well known Design-for-Security(DfS) technique for Intellectual Property (IP) protection of digital Integrated Circuits(IC). However, various attacks on logic locking can extract the secret obfuscation key successfully. Although Boolean Satisfiability (SAT) attacks can break most of the logic locked circuits, inability to deobfuscate sequential circuits is the main limitation of this type of attacks. Several existing defense strategies exploit this fact to thwart SAT attack by obfuscating the scan-based Design-for-Testability (DfT) infrastructure. In the absence of scan access, Model Checking based circuit unrolling attacks also suffer from scalability issues. In this paper, we propose a particle swarm optimization (PSO) guided attack framework, which is capable of finding an approximate key that produces correct output in most of the cases. Unlike the SAT attacks, the proposed attack framework can work even in the absence of scan access. Unlike Model Checking attacks, it does not suffer from scalability issues, thus can be applied on significantly large sequential circuits. Experimental results show that the derived key can produce correct outputs in more than 99% cases, for the majority of the benchmark circuits, while for the rest of the circuits, a minimal error is observed. The proposed attack framework enables partial activation of large sequential circuits in the absence of scan access, which is not feasible using the existing attack frameworks.</em></td> </tr> <tr> <td>15:00</td> <td>3.3.2</td> <td><b>EFFECT OF AGING ON PUF MODELING ATTACKS BASED ON POWER SIDE-CHANNEL OBSERVATIONS</b><br /><b>Authors</b>:<br />Trevor Kroeger<sup>1</sup>, Wei Cheng<sup>2</sup>, Jean Luc Danger<sup>2</sup>, Sylvain Guilley<sup>3</sup> and Naghmeh Karimi<sup>4</sup><br /><sup>1</sup>University of Maryland Baltimore County, US; <sup>2</sup>Télécom ParisTech, FR; <sup>3</sup>Secure-IC, FR; <sup>4</sup>University of Maryland, Baltimore County, US<br /><em><b>Abstract</b><br />Thanks to the imperfections in manufacturing process, Physically Unclonable Functions (PUFs) produce their unique outputs for given input signals (challenges) fed to identical circuitry designs. PUFs are often used as hardware primitives to provide security, e.g., for key generation or authentication purposes. However, they can be vulnerable to modeling attacks that predict the output for an unknown challenge, based on a set of known challenge/response pairs (CRPs). In addition, an attacker may benefit from power side-channels to break a PUFs' security. Although such attacks have been extensively discussed in literature, the effect of device aging on the efficacy of these attacks is still an open question. Accordingly, in this paper, we focus on the impact of aging on Arbiter-PUFs and one of its modeling-resistant counterparts, the Voltage Transfer Characteristic (VTC) PUF. We present the results of our SPICE simulations used to perform modeling attack via Machine Learning (ML) schemes on the devices aged from 0 to 20 weeks. We show that aging has a significant impact on modeling attacks. Indeed, when the training dataset for ML attack is extracted at a different age than the evaluation dataset, the attack is greatly hindered despite being performed on the same device. We show that the ML attack via power traces is particularly efficient to recover the responses of the anti-modeling VTC PUF, yet the aging still contributes to enhance its security.</em></td> </tr> <tr> <td>15:30</td> <td>3.3.3</td> <td><b>OFFLINE MODEL GUARD: SECURE AND PRIVATE ML ON MOBILE DEVICES</b><br /><b>Speaker</b>:<br />Emmanuel Stapf, TU Darmstadt, DE<br /><b>Authors</b>:<br />Sebastian P. Bayerl<sup>1</sup>, Tommaso Frassetto<sup>2</sup>, Patrick Jauernig<sup>2</sup>, Korbinian Riedhammer<sup>1</sup>, Ahmad-Reza Sadeghi<sup>2</sup>, Thomas Schneider<sup>2</sup>, Emmanuel Stapf<sup>2</sup> and Christian Weinert<sup>2</sup><br /><sup>1</sup>TH Nürnberg, DE; <sup>2</sup>TU Darmstadt, DE<br /><em><b>Abstract</b><br />Performing machine learning tasks in mobile applications yields a challenging conflict of interest: highly sensitive client information (e.g., speech data) should remain private while also the intellectual property of service providers (e.g., model parameters) must be protected. Cryptographic techniques offer secure solutions for this, but have an unacceptable overhead and moreover require frequent network interaction. In this work, we design a practically efficient hardware-based solution. Specifically, we build Offline Model Guard (OMG) to enable privacy-preserving machine learning on the predominant mobile computing platform ARM - even in offline scenarios. By leveraging a trusted execution environment for strict hardware-enforced isolation from other system components, OMG guarantees privacy of client data, secrecy of provided models, and integrity of processing algorithms. Our prototype implementation on an ARM HiKey 960 development board performs privacy-preserving keyword recognition using TensorFlow Lite for Microcontrollers in real time.</em></td> </tr> <tr> <td style="width:40px;">16:00</td> <td><a href="/date20/conference/session/IP1">IP1-10</a>, 728</td> <td><b>POISONING THE (DATA) WELL IN ML-BASED CAD: A CASE STUDY OF HIDING LITHOGRAPHIC HOTSPOTS</b><br /><b>Speaker</b>:<br />Kang Liu, New York University, US<br /><b>Authors</b>:<br />Kang Liu<sup>1</sup>, Benjamin Tan<sup>1</sup>, Ramesh Karri<sup>2</sup> and Siddharth Garg<sup>1</sup><br /><sup>1</sup>New York University, US; <sup>2</sup>NYU, US<br /><em><b>Abstract</b><br />Machine learning (ML) provides state-of-the-art performance in many parts of computer-aided design (CAD) flows. However, deep neural networks (DNNs) are susceptible to various adversarial attacks, including data poisoning to compromise training to insert backdoors. Sensitivity to training data integrity presents a security vulnerability, especially in light of malicious insiders who want to cause targeted neural network misbehavior. In this study, we explore this threat in lithographic hotspot detection via training data poisoning, where hotspots in a layout clip can be "hidden" at inference time by including a trigger shape in the input. We show that training data poisoning attacks are feasible and stealthy, demonstrating a backdoored neural network that performs normally on clean inputs but misbehaves on inputs when a backdoor trigger is present. Furthermore, our results raise some fundamental questions about the robustness of ML-based systems in CAD.</em></td> </tr> <tr> <td style="width:40px;">16:01</td> <td><a href="/date20/conference/session/IP1">IP1-11</a>, 667</td> <td><b>SOLOMON: AN AUTOMATED FRAMEWORK FOR DETECTING FAULT ATTACK VULNERABILITIES IN HARDWARE</b><br /><b>Speaker</b>:<br />Milind Srivastava, IIT Madras, IN<br /><b>Authors</b>:<br />Milind Srivastava<sup>1</sup>, PATANJALI SLPSK<sup>1</sup>, Indrani Roy<sup>1</sup>, Chester Rebeiro<sup>1</sup>, Aritra Hazra<sup>2</sup> and Swarup Bhunia<sup>3</sup><br /><sup>1</sup>IIT Madras, IN; <sup>2</sup>IIT Kharagpur, IN; <sup>3</sup>University of Florida, US<br /><em><b>Abstract</b><br />Fault attacks are potent physical attacks on crypto-devices. A single fault injected during encryption can reveal the cipher's secret key. In a hardware realization of an encryption algorithm, only a tiny fraction of the gates is exploitable by such an attack. Finding these vulnerable gates has been a manual and tedious task requiring considerable expertise. In this paper, we propose SOLOMON, the first automatic fault attack vulnerability detection framework for hardware designs. Given a cipher implementation, either at RTL or gate-level, SOLOMON uses formal methods to map vulnerable regions in the cipher algorithm to specific locations in the hardware thus enabling targeted countermeasures to be deployed with much lesser overheads. We demonstrate the efficacy of the SOLOMON framework using three ciphers: AES, CLEFIA, and Simon.</em></td> </tr> <tr> <td style="width:40px;">16:02</td> <td><a href="/date20/conference/session/IP1">IP1-12</a>, 694</td> <td><b>FORMAL SYNTHESIS OF MONITORING AND DETECTION SYSTEMS FOR SECURE CPS IMPLEMENTATIONS</b><br /><b>Speaker</b>:<br />Ipsita Koley, IIT Kharagpur, IN<br /><b>Authors</b>:<br />Ipsita Koley<sup>1</sup>, Saurav Kumar Ghosh<sup>1</sup>, Dey Soumyajit<sup>1</sup>, Debdeep Mukhopadhyay<sup>1</sup>, Amogh Kashyap K N<sup>2</sup>, Sachin Kumar Singh<sup>2</sup>, Lavanya Lokesh<sup>2</sup>, Jithin Nalu Purakkal<sup>2</sup> and Nishant Sinha<sup>2</sup><br /><sup>1</sup>IIT Kharagpur, IN; <sup>2</sup>Robert Bosch Engineering and Business Solutions Private Limited, IN<br /><em><b>Abstract</b><br />We consider the problem of securing a given control loop implementation of a cyber-physical system (CPS) in the presence of Man-in-the-Middle attacks on data exchange between plant and controller over a compromised network. To this end, there exists various detection schemes which provide mathematical guarantees against such attacks for the theoretical control model. However, such guarantees may not hold for the actual control software implementation. In this article, we propose a formal approach towards synthesizing attack detectors with varying thresholds which can prevent performance degrading stealthy attacks while minimizing false alarms.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="3.4">3.4 Accelerating Neural Networks and Vision Workloads</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br />Leonidas Kosmidis, Barcelona Supercomputing Center, ES</p> <p><b>Co-Chair:</b><br />Georgios Keramidas, Aristotle University of Thessaloniki/Think Silicon S.A., GR, GR</p> <p>This session presents different solutions to accelerate emerging applications. The papers include various microarchitecture techniques as well as complete SoC and RISC-V based solutions. More fine-grained techniques are also presented like fast computations on sparse matrices. Vision applications are represented by the popular VSLAM, while various types and forms of emerging Neural Networks (such as Recurrent, Quantized, and Siamese NNs ) are considered.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.4.1</td> <td><b>PSB-RNN: A PROCESSING-IN-MEMORY SYSTOLIC ARRAYARCHITECTURE USING BLOCK CIRCULANT MATRICES FOR RECURRENT NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Nagadastagiri Challapalle, Pennsylvania State University, US<br /><b>Authors</b>:<br />Nagadastagiri Challapalle<sup>1</sup>, Sahithi Rampalli<sup>1</sup>, Makesh Tarun Chandran<sup>1</sup>, Gurpreet Singh Kalsi<sup>2</sup>, John (Jack) Sampson<sup>3</sup>, Sreenivas Subramoney<sup>2</sup> and Vijaykrishnan Narayanan<sup>4</sup><br /><sup>1</sup>The Pennsylvania State University, US; <sup>2</sup>Processor Architecture Research Lab, Intel Labs, IN; <sup>3</sup>Penn State, US; <sup>4</sup>Penn State University, US<br /><em><b>Abstract</b><br />Recurrent Neural Networks (RNNs) are widely used in Natural Language Processing (NLP) applications as they inherently capture contextual information across spatial and temporal dimensions. Compared to other classes of neural networks, RNNs have more weight parameters as they primarily consist of fully connected layers. Recently, several techniques such as weight pruning, zero-skipping, and block circulant compression have been introduced to reduce the storage and access requirements of RNN weight parameters. In this work, we present a ReRAM crossbar based processing-in-memory (PIM) architecture with systolic dataflow incorporating block circulant compression for RNNs. The block circulant compression decomposes the operations in a fully connected layer into a series of Fourier transforms and point-wise operations resulting in reduced space and computational complexity. We formulate the Fourier transform and point-wise operations into in-situ multiply-and-accumulate (MAC) operations mapped to ReRAM crossbars for high energy efficiency and throughput. We also incorporate systolic dataflow for communication within the crossbar arrays, in contrast to broadcast and multicast communications, to further improve energy efficiency. The proposed architecture achieves average improvements in compute efficiency of 44x and 17x over a custom FPGA architecture and conventional crossbar based architecture implementations, respectively.</em></td> </tr> <tr> <td>15:00</td> <td>3.4.2</td> <td><b>XPULPNN: ACCELERATING QUANTIZED NEURAL NETWORKS ON RISC-V PROCESSORS THROUGH ISA EXTENSIONS</b><br /><b>Speaker</b>:<br />Angelo Garofalo, Università di Bologna, IT<br /><b>Authors</b>:<br />Angelo Garofalo<sup>1</sup>, Giuseppe Tagliavini<sup>1</sup>, Francesco Conti<sup>2</sup>, Davide Rossi<sup>1</sup> and Luca Benini<sup>2</sup><br /><sup>1</sup>Università di Bologna, IT; <sup>2</sup>ETH Zurich, CH / Università di Bologna, CH<br /><em><b>Abstract</b><br />Strongly quantized fixed-point arithmetic is considered the key direction to enable the inference of CNNs on low-power, resource-constrained edge devices. However, the deployment of highly quantized Neural Networks at the extreme edge of IoT, on fully programmable MCUs, is currently limited by the lack of support, at the Instruction Set Architecture (ISA) level, for sub-byte fixed-point data types, making it necessary to add numerous instructions for packing and unpacking data when running low-bitwidth (i.e. 2- and 4-bit) QNN kernels, creating a bottleneck for performance and energy efficiency of QNN inference. In this work we present a set of extensions to the RISC-V ISA, aimed at boosting the energy efficiency of low-bitwidth QNNs on low-power microcontroller-class cores. The microarchitecture supporting the new extensions is builton top of a RISC-V core featuring instruction set extensions targeting energy-efficient digital signal processing. To evaluate the extensions, we integrated the core into a full microcontroller system, synthesized and placed&amp;routed in 22nm FDX technology. QNN convolution kernels, implemented on the new core, run 5.3×and 8.9× faster when considering 4- and 2-bit data operands respectively, compared to the baseline processor only supporting 8-bit SIMD instructions. With a peak of 279 GMAC/s/W, the proposed solution achieves 9×better energy efficiency compared to the baseline and two orders of magnitudes better energy efficiency compared to state-of-the-art microcontrollers.</em></td> </tr> <tr> <td>15:30</td> <td>3.4.3</td> <td><b>SNA: A SIAMESE NETWORK ACCELERATOR TO EXPLOIT THE MODEL-LEVEL PARALLELISM OF HYBRID NETWORK STRUCTURE</b><br /><b>Speaker</b>:<br />Xingbin Wang, Institute of Information Engineering, CAS, CN<br /><b>Authors</b>:<br />Xingbin Wang, Boyan Zhao, Rui Hou and Dan Meng, State Key Laboratory of Information Security, Institute of Information Engineering, CAS, CN<br /><em><b>Abstract</b><br />Siamese network is compute-intensive learning model with growing applicability in a wide range of domains. However, state-of-art deep neural network (DNN) accelerators would not work efficiently for siamese network, as their designs do not account for the algorithm properties of siamese network. In this paper, we propose a siamese network accelerator called SNA, the first Simultaneous Multi-Threading (SMT) hardware architecture to perform siamese network inference with high performance and energy efficiency. We devise an adaptive inter-model computing resource partition and flexible on-chip buffer management mechanism based on the model parallelism and SMT design philosophy. Our architecture is implemented in Verilog and synthesized in a 65nm technology using Synopsys design tools. We also evaluate it with several typical siamese networks. Compared to the state-of-art accelerator, on average, the SNA architecture offers 2.1x speedup and 1.48x energy reduction.</em></td> </tr> <tr> <td>15:45</td> <td>3.4.4</td> <td><b>HCVEACC: A HIGH-PERFORMANCE AND ENERGY-EFFICIENT ACCELERATOR FOR TRACKING TASK IN VSLAM SYSTEM</b><br /><b>Speaker</b>:<br />Meng Liu, Chinese Academy of Sciences, CN<br /><b>Authors</b>:<br />Li Renwei, Wu Junning, Liu Meng, Chen Zuding, Zhou Shengang and Feng Shanggong, Chinese Academy of Sciences, CN<br /><em><b>Abstract</b><br />Visual SLAM (vSLAM) is a critical computer vision technology that is able to build a map of an unknown environment and perform location, simultaneously leveraging the partially built map. While existing several software SLAM processing frameworks, underlying general-purpose processors still hardly achieve the real-time SLAM at a reasonably low cost. In this paper, we propose HcveAcc, the first specialized CMOS-based hardware accelerator to help optimize the tracking task in the vSLAM system with high-performance and energy-efficient. Our HcveAcc targets to solve the time overhead bottleneck in the tracking process—high-density feature extraction and high-precision descriptor generation, and provides a configurable hardware architecture that handles higher resolution image data. We have implemented the HcveAcc in a 28nm CMOS technology using commercial EDA tools and evaluated it for the EuRoC and TUM dataset to demonstrate the robustness and accuracy in the SLAM tracking procedure. Our results show that HcveAcc achieves 4.3X speedup while consuming much less energy compared with state-of-the-art FPGA solutions.</em></td> </tr> <tr> <td style="width:40px;">16:01</td> <td><a href="/date20/conference/session/IP1">IP1-13</a>, 55</td> <td><b>ASCELLA: ACCELERATING SPARSE COMPUTATION BY ENABLING STREAM ACCESSES TO MEMORY</b><br /><b>Speaker</b>:<br />Bahar Asgari, Georgia Institute of Technology, US<br /><b>Authors</b>:<br />Bahar Asgari, Ramyad Hadidi and Hyesoon Kim, Georgia Institute of Technology, US<br /><em><b>Abstract</b><br />Sparse computations dominate a wide range of applications from scientific problems to graph analytics. The main characterization of sparse computations, indirect memory accesses, prevents them from effectively achieving high performance on general-purpose processors. Therefore, hardware accelerators have been proposed for sparse problems. For these accelerators, the storage format and the decompression mechanism is crucial but have seen less attention in prior work. To address this gap, we propose Ascella, an accelerator for sparse computations, which besides enabling a smooth stream of data and parallel computation, proposes a fast decompression mechanism. Our implementation on a ZYNQ FPGA shows that on average, Ascella executes sparse problems up to 5.1x as fast as prior work.</em></td> </tr> <tr> <td style="width:40px;">16:02</td> <td><a href="/date20/conference/session/IP1">IP1-14</a>, 645</td> <td><b>ACCELERATION OF PROBABILISTIC REASONING THROUGH CUSTOM PROCESSOR ARCHITECTURE</b><br /><b>Speaker</b>:<br />Nimish Shah, KU Leuven, BE<br /><b>Authors</b>:<br />Nimish Shah, Laura I. Galindez Olascoaga, Wannes Meert and Marian Verhelst, KU Leuven, BE<br /><em><b>Abstract</b><br />Probabilistic reasoning is an essential tool for robust decision-making systems because of its ability to explicitly handle real-world uncertainty, constraints and causal relations. Consequently, researchers are developing hybrid models by combining Deep Learning with Probabilistic reasoning for safety-critical applications like self-driving vehicles, autonomous drones, etc. However, probabilistic reasoning kernels do not execute efficiently on CPUs or GPUs. This paper, therefore, proposes a custom programmable processor to accelerate sum-product networks, an important probabilistic reasoning execution kernel. The processor has an optimized datapath architecture and memory hierarchy optimized for sum-product networks execution. Experimental results show that the processor, while requiring fewer computational and memory units, achieves a 12x throughput benefit over the Nvidia Jetson TX2 embedded GPU platform.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="3.5">3.5 Parallel real-time systems</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br />Liliana Cucu-Grosjean, Inria, FR</p> <p><b>Co-Chair:</b><br />Antoine Bertout, ENSMA, FR</p> <p>This session presents novel techniques to enable parallel execution in real-time systems. More precisely, the papers are solving limitations of previous DAG models, devising tool chains to ensure WCET bounds, correcting results on heterogeneous processors, and considering wireless networks with application-oriented scheduling.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.5.1</td> <td><b>ON THE VOLUME CALCULATION FOR CONDITIONAL DAG TASKS: HARDNESS AND ALGORITHMS</b><br /><b>Speaker</b>:<br />Jinghao Sun, Northeastern University, CN<br /><b>Authors</b>:<br />Jinghao Sun<sup>1</sup>, Yaoyao Chi<sup>1</sup>, Tianfei Xu<sup>1</sup>, Lei Cao<sup>1</sup>, Nan Guan<sup>2</sup>, Zhishan Guo<sup>3</sup> and Wang Yi<sup>4</sup><br /><sup>1</sup>Northeastern University, CN; <sup>2</sup>The Hong Kong Polytechnic University, CN; <sup>3</sup>University of Central Florida, US; <sup>4</sup>Uppsala universitet, SE<br /><em><b>Abstract</b><br />The hardness of analyzing conditional directed acyclic graph (DAG) tasks remains unknown so far. For example, previous researches asserted that the conditional DAG's volume can be solved in polynomial time. However, these researches all assume well-nested structures that are recursively composed by single-source-single-sink parallel and conditional components. For conditional DAGs in general that do not comply with this assumption, the hardness and algorithms of volume computation are still open. In this paper, we construct counterexamples to show that previous work cannot provide a safe upper bound of the conditional DAG's volume in general. Moreover, we prove that the volume computation problem for conditional DAGs is strongly NP-hard. Finally, we propose an exact algorithm for computing the conditional DAG's volume. Experiments show that our method can significantly improve the accuracy of the conditional DAG's volume estimation.</em></td> </tr> <tr> <td>15:00</td> <td>3.5.2</td> <td><b>WCET-AWARE CODE GENERATION AND COMMUNICATION OPTIMIZATION FOR PARALLELIZING COMPILERS</b><br /><b>Speaker</b>:<br />Simon Reder, Karlsruhe Institute of Technology (KIT), DE<br /><b>Authors</b>:<br />Simon Reder<sup>1</sup> and Juergen Becker<sup>2</sup><br /><sup>1</sup>Karlsruhe Institute of Technology (KIT), DE; <sup>2</sup>Karlsruhe Institute of Technology - ITIV, DE<br /><em><b>Abstract</b><br />High performance demands of present and future embedded applications increase the need for multi-core processors in hard real-time systems. Challenges in static multi-core WCET-analysis and the more complex design of parallel software, however, oppose the adoption of multi-core processors in that area. Automated parallelization is a promising approach to solve these issues, but specialized solutions are required to preserve static analyzability. With a WCET-aware parallelizing transformation, this work presents a novel solution for an important building block of a real-time capable parallelizing compiler. The approach includes a technique to optimize communication and synchronization in the parallelized program and supports complex memory hierarchies consisting of both shared and core-private memory segments. In an experiment with four different applications, the parallelization improved the WCET by up to factor 3.2 on 4 cores. The studied optimization technique and the support for shared memories significantly contribute to these results.</em></td> </tr> <tr> <td>15:30</td> <td>3.5.3</td> <td><b>TEMPLATE SCHEDULE CONSTRUCTION FOR GLOBAL REAL-TIME SCHEDULING ON UNRELATED MULTIPROCESSOR PLATFORMS</b><br /><b>Authors</b>:<br />Antoine Bertout<sup>1</sup>, Joel Goossens<sup>2</sup>, Emmanuel Grolleau<sup>3</sup> and Xavier Poczekajlo<sup>4</sup><br /><sup>1</sup>LIAS, Université de Poitiers, ISAE-ENSMA, FR; <sup>2</sup>ULB, BE; <sup>3</sup>LIAS, ISAE-ENSMA, Universite de Poitiers, FR; <sup>4</sup>Université libre de Bruxelles, BE<br /><em><b>Abstract</b><br />The seminal work on the global real-time scheduling of periodic tasks on unrelated multiprocessor platforms is based on a two-steps method. First, the workload of each task is distributed over the processors and it is proved that the success of this first step ensures the existence of a feasible schedule. Second, a method for the construction of a template schedule from the workload assignment is presented. In this work, we review the seminal work and show by using a counter-example that this second step is incomplete. Thus, we propose and prove correct a novel and efficient algorithm to build the template schedule.</em></td> </tr> <tr> <td>15:45</td> <td>3.5.4</td> <td><b>APPLICATION-AWARE SCHEDULING OF NETWORKED APPLICATIONS OVER THE LOW-POWER WIRELESS BUS</b><br /><b>Speaker</b>:<br />Kacper Wardega, Boston University, US<br /><b>Authors</b>:<br />Kacper Wardega and Wenchao Li, Boston University, US<br /><em><b>Abstract</b><br />Recent successes of wireless networked systems in advancing industrial automation and in spawning the Internet of Things paradigm motivate the adoption of wireless networked systems in current and future safety-critical applications. As reliability is key in safety-critical applications, in this work we present NetDAG, a scheduler design and implementation suitable for real-time applications in the wireless setting. NetDAG is built upon the Low-Power Wireless Bus, a high-performant communication abstraction for wireless networked systems, and enables system designers to directly schedule applications under specified task-level real-time constraints. Access to real-time primitives in the scheduler permits efficient design exploration of tradeoffs between power consumption and latency. Furthermore, NetDAG provides support for weakly hard real-time applications with deterministic guarantees, in addition to heretofore considered soft real-time applications with probabilistic guarantees. We propose novel abstraction techniques for reasoning about conjunctions of weakly hard constraints and show how such abstractions can be used to handle the significant scheduling difficulties brought on by networked components with weakly hard behaviors.</em></td> </tr> <tr> <td style="width:40px;">16:01</td> <td><a href="/date20/conference/session/IP1">IP1-15</a>, 453</td> <td><b>A PERFORMANCE ANALYSIS FRAMEWORK FOR REAL-TIME SYSTEMS SHARING MULTIPLE RESOURCES</b><br /><b>Speaker</b>:<br />Shayan Tabatabaei Nikkhah, Eindhoven University of Technology, NL<br /><b>Authors</b>:<br />Shayan Tabatabaei Nikkhah, Marc Geilen, Dip Goswami and Kees Goossens, Eindhoven University of Technology, NL<br /><em><b>Abstract</b><br />Timing properties of applications strongly depend on resources that are allocated to them. Applications often have multiple resource requirements, all of which must be met for them to proceed. Performance analysis of event-based systems has been widely studied in the literature. However, the proposed works consider only one resource requirement for each application task. Additionally, they mainly focus on the rate at which resources serve applications (e.g., power, instructions or bits per second), but another aspect of resources, which is their provided capacity (e.g., energy, memory ranges, FPGA regions), has been ignored. In this work, we propose a mathematical framework to describe the provisioning rate and capacity of various types of resource. Additionally, we consider the simultaneous use of multiple resources. Conservative bounds on response times of events and their backlog are computed. We prove that the bounds are monotone in event arrivals and in required and provided rate and capacity, which enables verification of real-time application performance based on worst-case characterizations. The applicability of our framework is shown in a case study.</em></td> </tr> <tr> <td style="width:40px;">16:02</td> <td><a href="/date20/conference/session/IP1">IP1-16</a>, 778</td> <td><b>SCALING UP THE MEMORY INTERFERENCE ANALYSIS FOR HARD REAL-TIME MANY-CORE SYSTEMS</b><br /><b>Speaker</b>:<br />Matheus Schuh, Verimag / Kalray, FR<br /><b>Authors</b>:<br />Matheus Schuh<sup>1</sup>, Maximilien Dupont de Dinechin<sup>2</sup>, Matthieu Moy<sup>3</sup> and Claire Maiza<sup>4</sup><br /><sup>1</sup>Verimag / Kalray, FR; <sup>2</sup>ENS Paris / ENS Lyon / LIP, FR; <sup>3</sup>ENS Lyon / LIP, FR; <sup>4</sup>Grenoble INP / Verimag, FR<br /><em><b>Abstract</b><br />In RTNS 2016, Rihani et al. proposed an algorithm to compute the impact of interference on memory accesses on the timing of a task graph. It calculates a static, time-triggered schedule, i.e. a release date and a worst-case response time for each task. The task graph is a DAG, typically obtained by compilation of a high-level dataflow language, and the tool assumes a previously determined mapping and execution order. The algorithm is precise, but suffers from a high O(n^4) complexity, n being the number of input tasks. Since we target many-core platforms with tens or hundreds of cores, applications likely to exploit the parallelism of these platforms are too large to be handled by this algorithm in reasonable time. This paper proposes a new algorithm that solves the same problem. Instead of performing global fixed-point iterations on the task graph, we compute the static schedule incrementally, reducing the complexity to O(n^2). Experimental results show a reduction from 535 seconds to 0.90 seconds on a benchmark with 384 tasks, i.e. 593 times faster.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="3.6">3.6 NoC in the age of neural network and approximate computing</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br />Romain LEMAIRE, CEA, FR</p> <p><b>Co-Chair:</b><br />Jean-Philippe DIGUET, CNRS / Lab-STICC, FR</p> <p>To support innovative applications, new paradigms have been introduced, such as neural network and approximate computing. This session presents different NoC-based architectures that support these computing approaches. In these advanced architectures, NoC designs are no longer only a communication infrastructure but also part of the computing system. Different mechanisms are introduced at network-level to support the application and thus enhance the performance and power efficiency. As such, new NoC-based architectures must respond to highly demanding applications such as image segmentation and classification by taking advantage of new topologies (multiple layers, 3D…) and new technologies, such as ReRAM.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.6.1</td> <td><b>GRAMARCH: A GPU-RERAM BASED HETEROGENEOUS ARCHITECTURE FOR NEURAL IMAGE SEGMENTATION</b><br /><b>Speaker</b>:<br />Biresh Joardar, Washington State University, US<br /><b>Authors</b>:<br />Biresh Kumar Joardar<sup>1</sup>, Nitthilan Kannappan Jayakodi<sup>1</sup>, Jana Doppa<sup>1</sup>, Partha Pratim Pande<sup>1</sup>, Hai (Helen) Li<sup>2</sup> and Krishnendu Chakrabarty<sup>3</sup><br /><sup>1</sup>Washington State University, US; <sup>2</sup>Duke University / TUM-IAS, US; <sup>3</sup>Duke University, US<br /><em><b>Abstract</b><br />Deep Neural Networks (DNNs) employed for image segmentation are computationally more expensive and complex compared to the ones used for classification. However, manycore architectures to accelerate training of these DNNs are relatively unexplored. Resistive random-access memory (ReRAM)-based architectures offer a promising alternative to commonly used GPU-based platforms for training DNNs. However, due to their low-precision storage capability, they cannot support all DNN layers and suffer from accuracy loss of learned models. To address these challenges, in this paper, we propose a heterogeneous architecture: GRAMAR, that combines the benefits of ReRAM and GPUs simultaneously by using a high-throughput 3D Network-on-Chip. Experimental results indicate that by suitably mapping DNN layers to GRAMAR, it is possible to achieve up to 33.4X better performance compared to conventional GPUs.</em></td> </tr> <tr> <td>15:00</td> <td>3.6.2</td> <td><b>AN APPROXIMATE MULTIPLANE NETWORK-ON-CHIP</b><br /><b>Speaker</b>:<br />Xiaohang Wang, South China University of Technology, CN<br /><b>Authors</b>:<br />Ling Wang<sup>1</sup>, Xiaohang Wang<sup>2</sup> and Yadong Wang<sup>1</sup><br /><sup>1</sup>Harbin Institute of Technology, CN; <sup>2</sup>South China University of Technology, CN<br /><em><b>Abstract</b><br />The increasing communication demands in chip multiprocessors (CMPs) and many error-tolerant applications are driving the approximate design of the network-on-chip (NoC) for power-efficient packet delivery. However, current approximate NoC designs achieve improvements in network performance or dynamic power savings at the cost of additional circuit design and increased area overhead. In this paper, we propose a novel approximate multiplane NoC (AMNoC) that provides low-latency transfer for latency-sensitive packets and minimizes the power consumption of approximable packets through a lossy bufferless subnetwork. The AMNoC also includes a regular buffered subnetwork to guarantee the lossless delivery of nonapproximable packets. Evaluations show that, compared with a single-plane buffered NoC, the AMNoC reduces the average latency by 41.9%. In addition, the AMNoC achieves 48.6% and 53.4% savings in power consumption and area overhead, respectively.</em></td> </tr> <tr> <td>15:30</td> <td>3.6.3</td> <td><b>SHENJING: A LOW POWER RECONFIGURABLE NEUROMORPHIC ACCELERATOR WITH PARTIAL-SUM AND SPIKE NETWORKS-ON-CHIP</b><br /><b>Speaker</b>:<br />Bo Wang, National University of Singapore, SG<br /><b>Authors</b>:<br />Bo Wang<sup>1</sup>, Jun Zhou<sup>1</sup>, Weng-Fai Wong<sup>1</sup> and Li-Shiuan Peh<sup>2</sup><br /><sup>1</sup>National University of Singapore, SG; <sup>2</sup>Professor, National University of Singapore, SG<br /><em><b>Abstract</b><br />The next wave of on-device AI will likely require energy-efficient deep neural networks. Brain-inspired spiking neural networks (SNN) has been identified to be a promising candidate. Doing away with the need for multipliers significantly reduces energy. For on-device applications, besides computation, communication also incurs a significant amount of energy and time. In this paper, we propose Shenjing, a configurable SNN architecture which fully exposes all on-chip communications to software, enabling software mapping of SNN models with high accuracy at low power. Unlike prior SNN architectures like TrueNorth, Shenjing does not require any model modification and retraining for the mapping. We show that conventional artificial neural networks (ANN) such as multilayer perceptron, convolutional neural networks, as well as the latest residual neural networks can be mapped successfully onto Shenjing, realizing ANNs with SNN's energy efficiency. For the MNIST inference problem using a multilayer perceptron, we were able to achieve an accuracy of 96% while consuming just 1.26 mW using 10 Shenjing cores.</em></td> </tr> <tr> <td style="width:40px;">16:00</td> <td><a href="/date20/conference/session/IP1">IP1-17</a>, 139</td> <td><b>LIGHTWEIGHT ANONYMOUS ROUTING IN NOC BASED SOCS</b><br /><b>Speaker</b>:<br />Subodha Charles, University of Florida, LK<br /><b>Authors</b>:<br />Subodha Charles, Megan Logan and Prabhat Mishra, University of Florida, US<br /><em><b>Abstract</b><br />System-on-Chip (SoC) supply chain is widely acknowledged as a major source of security vulnerabilities. Potentially malicious third-party IPs integrated on the same Network-on-Chip (NoC) with the trusted components can lead to security and trust concerns. While secure communication is a well studied problem in computer networks domain, it is not feasible to implement those solutions on resource-constrained SoCs. In this paper, we present a lightweight anonymous routing protocol for communication between IP cores in NoC based SoCs. Our method eliminates the major overhead associated with traditional anonymous routing protocols while ensuring that the desired security goals are met. Experimental results demonstrate that existing security solutions on NoC can introduce significant (1.5X) performance degradation, whereas our approach provides the same security features with minor (4%) impact on performance.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="3.7">3.7 Augmented and Assisted Living: A reality</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br />Graziano Pravadelli, Università di Verona, IT</p> <p><b>Co-Chair:</b><br />Vassilis Pavlidis, Aristotle University of Thessaloniki, GR</p> <p>Novel solutions for healthcare and ambient assistant living: innovative brain-computer interfaces, novel cancer prediction systems and energy-efficient ECG and wearable systems.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.7.1</td> <td><b>COMPRESSING SUBJECT-SPECIFIC BRAIN-COMPUTER INTERFACE MODELS INTO ONE MODEL BY SUPERPOSITION IN HYPERDIMENSIONAL SPACE</b><br /><b>Speaker</b>:<br />Michael Hersche, ETH Zurich, CH<br /><b>Authors</b>:<br />Michael Hersche, Philipp Rupp, Luca Benini and Abbas Rahimi, ETH Zurich, CH<br /><em><b>Abstract</b><br />Accurate multiclass classification of electroencephalography (EEG) signals is still a challenging task towards the development of reliable motor imagery brain-computer interfaces (MI-BCIs). Deep learning algorithms have been recently used in this area to deliver a compact and accurate model. Reaching high-level of accuracy requires to store subjects-specific trained models that cannot be achieved with an otherwise compact model trained globally across all subjects. In this paper, we propose a new methodology that closes the gap between these two extreme modeling approaches: we reduce the overall storage requirements by superimposing many subject-specific models into one single model such that it can be reliably decomposed, after retraining, to its constituent models while providing a trade-off between compression ratio and accuracy. Our method makes the use of unexploited capacity of trained models by orthogonalizing parameters in a hyperdimensional space, followed by iterative retraining to compensate noisy decomposition. This method can be applied to various layers of deep inference models. Experimental results on the 4-class BCI competition IV-2a dataset show that our method exploits unutilized capacity for compression and surpasses the accuracy of two state-of-the-art networks: (1) it compresses the smallest network, EEGNet [1], by 1.9x, and increases its accuracy by 2.41% (74.73% vs. 72.32%); (2) using a relatively larger Shallow ConvNet [2], our method achieves 2.95x compression as well as 1.4% higher accuracy (75.05% vs. 73.59%).</em></td> </tr> <tr> <td>15:00</td> <td>3.7.2</td> <td><b>A NOVEL FPGA-BASED SYSTEM FOR TUMOR GROWTH PREDICTION</b><br /><b>Speaker</b>:<br />Yannis Papaefstathiou, Aristotle University of Thessaloniki, GR<br /><b>Authors</b>:<br />Konstantinos Malavazos<sup>1</sup>, Maria Papadogiorgaki<sup>1</sup>, PAVLOS MALAKONAKIS<sup>1</sup> and Ioannis Papaefstathiou<sup>2</sup><br /><sup>1</sup>TU Crete, GR; <sup>2</sup>Aristotle University of Thessaloniki, GR<br /><em><b>Abstract</b><br />An emerging trend in the biomedical community is to create models that take advantage of the increasing available computational power, in order to manage and analyze new biological data as well as to model complex biological processes. Such biomedical software applications require significant computational resources since they process and analyze large amounts of data, such as medical image sequences. This paper presents a novel FPGA-based system that implements a novel model for the prediction of the spatio-temporal evolution of glioma. Glioma is a rapidly evolving type of brain cancer, well known for its aggressive and diffusive behavior. The developed system simulates the glioma tumor growth in the brain tissue, which consists of different anatomic structures, by utilizing individual MRI slices. The presented innovative hardware system is more than 60% faster than a high-end server consisting of 20 physical cores (and 40 virtual ones) and more than 28x more energy efficient.</em></td> </tr> <tr> <td>15:30</td> <td>3.7.3</td> <td><b>AN EVENT-BASED SYSTEM FOR LOW-POWER ECG QRS COMPLEX DETECTION</b><br /><b>Speaker</b>:<br />Silvio Zanoli, EPFL, CH<br /><b>Authors</b>:<br />Silvio Zanoli<sup>1</sup>, Tomas Teijeiro<sup>1</sup>, Fabio Montagna<sup>2</sup> and David Atienza<sup>1</sup><br /><sup>1</sup>École Polytechnique Fédérale de Lausanne, CH; <sup>2</sup>Università di Bologna, IT<br /><em><b>Abstract</b><br />One of the greatest challenges in the design of modern wearable devices is energy efficiency. While data processing and communication have received a lot of attention from the industry and academia, leading to highly efficient microcontrollers and transmission devices, sensor data acquisition in medical devices is still based on a conservative paradigm that requires regular sampling at the Nyquist rate of the target signal. This requirement is usually excessive for sparse and highly non-stationary signals, leading to data overload and a waste of resources in the full processing pipeline. In this work, we propose a new system to create event-based heart-rate analysis devices, including a novel algorithm for QRS detection that is able to process electrocardiogram signals acquired irregularly and much below the theoretically-required Nyquist rate. This technique allows us to drastically reduce the average sampling frequency of the signal and, hence, the energy needed to process it and extract the relevant information. We implemented both the proposed event-based algorithm and a state-of-the-art version based on regular Nyquist rate based sampling on an ultra-low power hardware platform, and the experimental results show that the event-based version reduces the energy consumption in runtime up to 15.6 times, while the detection performance is maintained at an average F1 score of 99.5%.</em></td> </tr> <tr> <td>15:45</td> <td>3.7.4</td> <td><b>SEMI-AUTONOMOUS PERSONAL CARE ROBOTS INTERFACE DRIVEN BY EEG SIGNALS DIGITIZATION</b><br /><b>Speaker</b>:<br />Daniela De Venuto, Politecnico di Bari, IT<br /><b>Authors</b>:<br />Giovanni Mezzina and Daniela De Venuto, Politecnico di Bari, IT<br /><em><b>Abstract</b><br />In this paper, we propose an innovative architecture that merges the Personal Care Robots (PCRs) advantages with a novel Brain Computer Interface (BCI) to carry out assistive tasks, aiming to reduce the burdens of caregivers. The BCI is based on movement related potentials (MRPs) and exploits EEG from 8 smart wireless electrodes placed on the sensorimotor area. The collected data are firstly pre-processed and then sent to a novel Feature Extraction (FE) step. The FE stage is based on symbolization algorithm, the Local Binary Patterning, which adopts end-to-end binary operations. It strongly reduces the stage complexity, speeding the BCI up. The final user intentions discrimination is entrusted to a linear Support Vector Machine (SVM). The BCI performances have been evaluated on four healthy young subjects. Experimental results showed a user intention recognition accuracy of ~84 % with a timing of ~ 554 ms per decision. A proof of concept is presented, showing how the BCI-based binary decisions could be used to drive the PCR up to a requested object, expressing the will to keep it (delivering it to user) or to continue the research.</em></td> </tr> <tr> <td style="width:40px;">16:01</td> <td><a href="/date20/conference/session/IP1">IP1-18</a>, 216</td> <td><b>A NON-INVASIVE WEARABLE BIOIMPEDANCE SYSTEM TO WIRELESSLY MONITOR BLADDER FILLING</b><br /><b>Speaker</b>:<br />Michele Magno, ETH Zurich, CH<br /><b>Authors</b>:<br />Markus Reichmuth, Simone Schuerle and Michele Magno, ETH Zurich, CH<br /><em><b>Abstract</b><br />Monitoring of renal function can be crucial for patients in acute care settings. Commonly during postsurgical surveillance, urinary catheters are employed to assess the urine output accurately. However, as with any external device inserted into the body, the use of these catheters carries a significant risk of infection. In this paper, we present a non-invasive method to measure the fill rate of the bladder, and thus the rate of renal clearance, via an external bioimpedance sensor system to avoid the use of urinary catheters, thereby eliminating the risk of infections and improving patient comfort. We design and propose a 4-electrode front-end and the whole wearable and wireless system with low power and accuracy in mind. The results demonstrate the accuracy of the sensors and low power consumption of only 80µW with a duty cycling of 1 acquisition every 5 minutes, which makes this battery-operated wearable device a long-term monitor system.</em></td> </tr> <tr> <td style="width:40px;">16:02</td> <td><a href="/date20/conference/session/IP1">IP1-19</a>, 906</td> <td><b>INFINIWOLF: ENERGY EFFICIENT SMART BRACELET FOR EDGE COMPUTING WITH DUAL SOURCE ENERGY HARVESTING</b><br /><b>Speaker</b>:<br />Michele Magno, ETH Zurich, CH<br /><b>Authors</b>:<br />Michele Magno<sup>1</sup>, Xiaying Wang<sup>1</sup>, Manuel Eggimann<sup>1</sup>, Lukas Cavigelli<sup>1</sup> and Luca Benini<sup>2</sup><br /><sup>1</sup>ETH Zurich, CH; <sup>2</sup>Università di Bologna, IT<br /><em><b>Abstract</b><br />This work presents InfiniWolf, a novel multi-sensor smartwatch that can achieve self-sustainability exploiting thermal and solar energy harvesting, performing computationally high demanding tasks. The smartwatch embeds both a System-on-Chip (SoC) with an ARM Cortex-M processor and Bluetooth Low Energy (BLE) and Mr. Wolf, an open-hardware RISC-V based parallel ultra-low-power processor that boosts the processing capabilities on board by more than one order of magnitude, while also increasing energy efficiency. We demonstrate its functionality based on a sample application scenario performing stress detection with multi-layer artificial neural networks on a wearable multi-sensor bracelet. Experimental results show the benefits in terms of energy efficiency and latency of Mr. Wolf over an ARM Cortex-M4F micro-controllers and the possibility, under specific assumptions, to be self-sustainable using thermal and solar energy harvesting while performing up to 24 stress classifications per minute in indoor conditions.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="IP1">IP1 Interactive Presentations</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 16:00 - 17:00<br /><b>Location / Room:</b> Poster Area</p> <p>Interactive Presentations run simultaneously during a 30-minute slot. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session</p> <table> <tr> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> <tr> <td style="width:40px;">IP1-1</td> <td><b>DYNUNLOCK: UNLOCKING SCAN CHAINS OBFUSCATED USING DYNAMIC KEYS</b><br /><b>Speaker</b>:<br />Nimisha Limaye, New York University, US<br /><b>Authors</b>:<br />Nimisha Limaye<sup>1</sup> and Ozgur Sinanoglu<sup>2</sup><br /><sup>1</sup>New York University, US; <sup>2</sup>New York University Abu Dhabi, AE<br /><em><b>Abstract</b><br />Outsourcing in semiconductor industry opened up venues for faster and cost-effective chip manufacturing. However, this also introduced untrusted entities with malicious intent, to steal intellectual property (IP), overproduce the circuits, insert hardware Trojans, or counterfeit the chips. Recently, a defense is proposed to obfuscate the scan access based on a dynamic key that is initially generated from a secret key but changes in every clock cycle. This defense can be considered as the most rigorous defense among all the scan locking techniques. In this paper, we propose an attack that remodels this defense into one that can be broken by the SAT attack, while we also note that our attack can be adjusted to break other less rigorous (key that is updated less frequently) scan locking techniques as well.</em></td> </tr> <tr> <td style="width:40px;">IP1-2</td> <td><b>CMOS IMPLEMENTATION OF SWITCHING LATTICES</b><br /><b>Speaker</b>:<br />Levent Aksoy, Istanbul TU, TR<br /><b>Authors</b>:<br />Ismail Cevik, Levent Aksoy and Mustafa Altun, Istanbul TU, TR<br /><em><b>Abstract</b><br />Switching lattices consisting of four-terminal switches are introduced as area-efficient structures to realize logic functions. Many optimization algorithms have been proposed, including exact ones, realizing logic functions on lattices with the fewest number of four-terminal switches, as well as heuristic ones. Hence, the computing potential of switching lattices has been justified adequately in the literature. However, the same thing cannot be said for their physical implementation. There have been conceptual ideas for the technology development of switching lattices, but no concrete and directly applicable technology has been proposed yet. In this study, we show that switching lattices can be directly and efficiently implemented using a standard CMOS process. To realize a given logic function on a switching lattice, we propose static and dynamic logic solutions. The proposed circuits as well as the compared conventional ones are designed and simulated in the Cadence environment using TSMC 65nm CMOS process. Experimental post layout results on logic functions show that switching lattices occupy much smaller area than those of the conventional CMOS implementations, while they have competitive delay and power consumption values.</em></td> </tr> <tr> <td style="width:40px;">IP1-3</td> <td><b>A TIMING UNCERTAINTY-AWARE CLOCK TREE TOPOLOGY GENERATION ALGORITHM FOR SINGLE FLUX QUANTUM CIRCUITS</b><br /><b>Speaker</b>:<br />Massoud Pedram, University of Southern California, US<br /><b>Authors</b>:<br />Soheil Nazar Shahsavani, Bo Zhang and Massoud Pedram, University of Southern California, US<br /><em><b>Abstract</b><br />This paper presents a low-cost, timing uncertainty-aware synchronous clock tree topology generation algorithm for single flux quantum (SFQ) logic circuits. The proposed method considers the criticality of the data paths in terms of timing slacks as well as the total wirelength of the clock tree and generates a (height-) balanced binary clock tree using a bottom-up approach and an integer linear programming (ILP) formulation. The statistical timing analysis results for ten benchmark circuits show that the proposed method improves the total wirelength and the total negative hold slack by 4.2% and 64.6%, respectively, on average, compared with a wirelength-driven state-of-the-art balanced topology generation approach.</em></td> </tr> <tr> <td style="width:40px;">IP1-4</td> <td><b>SYMMETRY-BASED A/M-S BIST (SYMBIST): DEMONSTRATION ON A SAR ADC IP</b><br /><b>Speaker</b>:<br />Antonios Pavlidis, Sorbonne Université, CNRS, LIP6, FR<br /><b>Authors</b>:<br />Antonios Pavlidis<sup>1</sup>, Marie-Minerve Louerat<sup>1</sup>, Eric Faehn<sup>2</sup>, Anand Kumar<sup>3</sup> and Haralampos-G. Stratigopoulos<sup>1</sup><br /><sup>1</sup>Sorbonne Université, CNRS, LIP6, FR; <sup>2</sup>STMicroelectronics, FR; <sup>3</sup>STMicroelectronics, IN<br /><em><b>Abstract</b><br />In this paper, we propose a defect-oriented Built-In Self-Test (BIST) paradigm for analog and mixed-signal (A/MS) Integrated Circuits (ICs), called symmetry-based BIST (Sym-BIST). SymBIST exploits inherent symmetries into the design to generate invariances that should hold true only in defect-free operation. Violation of any of these invariances points to defect detection. We demonstrate SymBIST on a 65nm 10-bit Successive Approximation Register (SAR) Analog-to-Digital Converter (ADC) IP by ST Microelectronics. SymBIST does notresult in any performance penalty, it incurs an area overhead of less than 5%, the test time equals about 16x the time to convert an analog input sample, it can be interfaced with a 2-pin digital access mechanism, and it covers the entire A/M-S part of the IP achieving a likelihood-weighted defect coverage higher than 85%.</em></td> </tr> <tr> <td style="width:40px;">IP1-5</td> <td><b>RANGE CONTROLLED FLOATING-GATE TRANSISTORS: A UNIFIED SOLUTION FOR UNLOCKING AND CALIBRATING ANALOG ICS</b><br /><b>Speaker</b>:<br />YIORGOS MAKRIS, University OF TEXAS AT DALLAS, US<br /><b>Authors</b>:<br />Sai Govinda Rao Nimmalapudi, Georgios Volanis, Yichuan Lu, Angelos Antonopoulos, Andrew Marshall and Yiorgos Makris, University of Texas at Dallas, US<br /><em><b>Abstract</b><br />Analog Floating-Gate Transistors (AFGTs) are commonly used to fine-tune the performance of analog integrated circuits (ICs) after fabrication, thereby enabling high yield despite component mismatch and variability in semiconductor manufacturing. In this work, we propose a methodology that leverages such AFGTs to also prevent unauthorized use of analog ICs. Specifically, we introduce a locking mechanism that limits programming of AFGTs to a range which is inadequate for achieving the desired analog performance. Accordingly, our solution entails a two-step unlock-&amp;-calibrate process. In the first step, AFGTs must be programmed through a secret sequence of voltages within that range, called waypoints. Successfully following the waypoints unlocks the ability to program the AFGTs over their entire range. Thereby, in the second step, the typical AFGT-based post-silicon calibration process can be applied to adjust the performance of the IC within its specifications. Protection against brute-force or intelligent attacks attempting to guess the unlocking sequence is ensured through the vast space of possible waypoints in the continuous (analog) domain. Feasibility and effectiveness of the proposed solution is demonstrated and evaluated on an Operational Transconductance Amplifier (OTA). To our knowledge, this is the first solution which leverages the power of analog keys and addresses both unlocking and calibration needs of analog ICs in a unified manner.</em></td> </tr> <tr> <td style="width:40px;">IP1-6</td> <td><b>TESTING THROUGH SILICON VIAS IN POWER DISTRIBUTION NETWORK OF 3D-IC WITH MANUFACTURING VARIABILITY CANCELLATION</b><br /><b>Speaker</b>:<br />Koutaro Hachiya, Teikyo Heisei University, JP<br /><b>Authors</b>:<br />Koutaro Hachiya<sup>1</sup> and Atsushi Kurokawa<sup>2</sup><br /><sup>1</sup>Teikyo Heisei University, JP; <sup>2</sup>Hirosaki University, JP<br /><em><b>Abstract</b><br />To detect open defects of power TSVs (Through Silicon Vias) in PDNs (Power Distribution Networks) of stacked 3D-ICs, a method was proposed which measures resistances between power micro-bumps connected to PDN and detects defects of TSVs by changes of the resistances. It suffers from manufacturing variabilities and must place one micro-bump directly under each TSV (direct-type placement style) to maximize its diagnostic performance, but the performance was not enough for practical applications. A variability cancellation method was also devised to improve the diagnostic performance. In this paper, a novel middle-type placement style is proposed which places one micro-bump between each pair of TSVs. Experimental simulations using a 3D-IC example show that the diagnostic performances of both the direct-type and the middle-type examples are improved by the variability cancellation and reach the practical level. The middle-type example outperforms the direct-type example in terms of number of micro-bumps and number of measurements.</em></td> </tr> <tr> <td style="width:40px;">IP1-7</td> <td><b>TFAPPROX: TOWARDS A FAST EMULATION OF DNN APPROXIMATE HARDWARE ACCELERATORS ON GPU</b><br /><b>Speaker</b>:<br />Zdenek Vasicek, Brno University of Technology, CZ<br /><b>Authors</b>:<br />Filip Vaverka, Vojtech Mrazek, Zdenek Vasicek and Lukas Sekanina, Brno University of Technology, CZ<br /><em><b>Abstract</b><br />Energy efficiency of hardware accelerators of deep neural networks (DNN) can be improved by introducing approximate arithmetic circuits. In order to quantify the error introduced by using these circuits and avoid the expensive hardware prototyping, a software emulator of the DNN accelerator is usually executed on CPU or GPU. However, this emulation is typically two or three orders of magnitude slower than a software DNN implementation running on CPU or GPU and operating with standard floating point arithmetic instructions and common DNN libraries. The reason is that there is no hardware support for approximate arithmetic operations on common CPUs and GPUs and these operations have to be expensively emulated. In order to address this issue, we propose an efficient emulation method for approximate circuits utilized in a given DNN accelerator which is emulated on GPU. All relevant approximate circuits are implemented as look-up tables and accessed through a texture memory mechanism of CUDA capable GPUs. We exploit the fact that the texture memory is optimized for irregular read-only access and in some GPU architectures is even implemented as a dedicated cache. This technique allowed us to reduce the inference time of the emulated DNN accelerator approximately 200 times with respect to an optimized CPU version on complex DNNs such as ResNet. The proposed approach extends the TensorFlow library and is available online at <a href="https://github.com/ehw-fit/tf-approximate">https://github.com/ehw-fit/tf-approximate</a></em></td> </tr> <tr> <td style="width:40px;">IP1-8</td> <td><b>BINARY LINEAR ECCS OPTIMIZED FOR BIT INVERSION IN MEMORIES WITH ASYMMETRIC ERROR PROBABILITIES</b><br /><b>Speaker</b>:<br />Valentin Gherman, CEA, FR<br /><b>Authors</b>:<br />Valentin Gherman, Samuel Evain and Bastien Giraud, CEA, FR<br /><em><b>Abstract</b><br />Many memory types are asymmetric with respect to the error vulnerability of stored 0's and 1's. For instance, DRAM, STT-MRAM and NAND flash memories may suffer from asymmetric error rates. A recently proposed error-protection scheme consists in the inversion of the memory words with too many vulnerable values before they are stored in an asymmetric memory. In this paper, a method is pro-posed for the optimization of systematic binary linear block error-correcting codes in order to maximize their impact when combined with memory word inversion.</em></td> </tr> <tr> <td style="width:40px;">IP1-9</td> <td><b>BELDPC: BIT ERRORS AWARE ADAPTIVE RATE LDPC CODES FOR 3D TLC NAND FLASH MEMORY</b><br /><b>Speaker</b>:<br />Meng Zhang, Wuhan National Laboratory for Optoelectronics, CN<br /><b>Authors</b>:<br />Meng Zhang, Fei Wu, Qin Yu, Weihua Liu, Lanlan Cui, Yahui Zhao and Changsheng Xie, Wuhan National Laboratory for Optoelectronics, CN<br /><em><b>Abstract</b><br />Three-dimensional (3D) NAND flash memory has high capacity and cell storage density by using the multi-bit technology and vertical stack architecture, but degrading data reliability due to high raw bit error rates (RBER) caused by program/erase (P/E) cycles and retention periods. Low-density parity-check (LDPC) codes become more popular error-correcting technologies to improve data reliability due to strong error correction capability, but introducing more decoding iterations at higher RBER. To reduce decoding iterations, this paper proposes BeLDPC: bit errors aware adaptive rate LDPC codes for 3D triple-level cell (TLC) NAND flash memory. Firstly, bit error characteristics in 3D charge trap TLC NAND flash memory are studied on a real FPGA testing platform, including asymmetric bit flipping and temporal locality of bit errors. Then, based on these characteristics, a high-efficiency LDPC code is designed. Experimental results show BeLDPC can reduce decoding iterations under different P/E cycles and retention periods.</em></td> </tr> <tr> <td style="width:40px;">IP1-10</td> <td><b>POISONING THE (DATA) WELL IN ML-BASED CAD: A CASE STUDY OF HIDING LITHOGRAPHIC HOTSPOTS</b><br /><b>Speaker</b>:<br />Kang Liu, New York University, US<br /><b>Authors</b>:<br />Kang Liu<sup>1</sup>, Benjamin Tan<sup>1</sup>, Ramesh Karri<sup>2</sup> and Siddharth Garg<sup>1</sup><br /><sup>1</sup>New York University, US; <sup>2</sup>NYU, US<br /><em><b>Abstract</b><br />Machine learning (ML) provides state-of-the-art performance in many parts of computer-aided design (CAD) flows. However, deep neural networks (DNNs) are susceptible to various adversarial attacks, including data poisoning to compromise training to insert backdoors. Sensitivity to training data integrity presents a security vulnerability, especially in light of malicious insiders who want to cause targeted neural network misbehavior. In this study, we explore this threat in lithographic hotspot detection via training data poisoning, where hotspots in a layout clip can be "hidden" at inference time by including a trigger shape in the input. We show that training data poisoning attacks are feasible and stealthy, demonstrating a backdoored neural network that performs normally on clean inputs but misbehaves on inputs when a backdoor trigger is present. Furthermore, our results raise some fundamental questions about the robustness of ML-based systems in CAD.</em></td> </tr> <tr> <td style="width:40px;">IP1-11</td> <td><b>SOLOMON: AN AUTOMATED FRAMEWORK FOR DETECTING FAULT ATTACK VULNERABILITIES IN HARDWARE</b><br /><b>Speaker</b>:<br />Milind Srivastava, IIT Madras, IN<br /><b>Authors</b>:<br />Milind Srivastava<sup>1</sup>, PATANJALI SLPSK<sup>1</sup>, Indrani Roy<sup>1</sup>, Chester Rebeiro<sup>1</sup>, Aritra Hazra<sup>2</sup> and Swarup Bhunia<sup>3</sup><br /><sup>1</sup>IIT Madras, IN; <sup>2</sup>IIT Kharagpur, IN; <sup>3</sup>University of Florida, US<br /><em><b>Abstract</b><br />Fault attacks are potent physical attacks on crypto-devices. A single fault injected during encryption can reveal the cipher's secret key. In a hardware realization of an encryption algorithm, only a tiny fraction of the gates is exploitable by such an attack. Finding these vulnerable gates has been a manual and tedious task requiring considerable expertise. In this paper, we propose SOLOMON, the first automatic fault attack vulnerability detection framework for hardware designs. Given a cipher implementation, either at RTL or gate-level, SOLOMON uses formal methods to map vulnerable regions in the cipher algorithm to specific locations in the hardware thus enabling targeted countermeasures to be deployed with much lesser overheads. We demonstrate the efficacy of the SOLOMON framework using three ciphers: AES, CLEFIA, and Simon.</em></td> </tr> <tr> <td style="width:40px;">IP1-12</td> <td><b>FORMAL SYNTHESIS OF MONITORING AND DETECTION SYSTEMS FOR SECURE CPS IMPLEMENTATIONS</b><br /><b>Speaker</b>:<br />Ipsita Koley, IIT Kharagpur, IN<br /><b>Authors</b>:<br />Ipsita Koley<sup>1</sup>, Saurav Kumar Ghosh<sup>1</sup>, Dey Soumyajit<sup>1</sup>, Debdeep Mukhopadhyay<sup>1</sup>, Amogh Kashyap K N<sup>2</sup>, Sachin Kumar Singh<sup>2</sup>, Lavanya Lokesh<sup>2</sup>, Jithin Nalu Purakkal<sup>2</sup> and Nishant Sinha<sup>2</sup><br /><sup>1</sup>IIT Kharagpur, IN; <sup>2</sup>Robert Bosch Engineering and Business Solutions Private Limited, IN<br /><em><b>Abstract</b><br />We consider the problem of securing a given control loop implementation of a cyber-physical system (CPS) in the presence of Man-in-the-Middle attacks on data exchange between plant and controller over a compromised network. To this end, there exists various detection schemes which provide mathematical guarantees against such attacks for the theoretical control model. However, such guarantees may not hold for the actual control software implementation. In this article, we propose a formal approach towards synthesizing attack detectors with varying thresholds which can prevent performance degrading stealthy attacks while minimizing false alarms.</em></td> </tr> <tr> <td style="width:40px;">IP1-13</td> <td><b>ASCELLA: ACCELERATING SPARSE COMPUTATION BY ENABLING STREAM ACCESSES TO MEMORY</b><br /><b>Speaker</b>:<br />Bahar Asgari, Georgia Institute of Technology, US<br /><b>Authors</b>:<br />Bahar Asgari, Ramyad Hadidi and Hyesoon Kim, Georgia Institute of Technology, US<br /><em><b>Abstract</b><br />Sparse computations dominate a wide range of applications from scientific problems to graph analytics. The main characterization of sparse computations, indirect memory accesses, prevents them from effectively achieving high performance on general-purpose processors. Therefore, hardware accelerators have been proposed for sparse problems. For these accelerators, the storage format and the decompression mechanism is crucial but have seen less attention in prior work. To address this gap, we propose Ascella, an accelerator for sparse computations, which besides enabling a smooth stream of data and parallel computation, proposes a fast decompression mechanism. Our implementation on a ZYNQ FPGA shows that on average, Ascella executes sparse problems up to 5.1x as fast as prior work.</em></td> </tr> <tr> <td style="width:40px;">IP1-14</td> <td><b>ACCELERATION OF PROBABILISTIC REASONING THROUGH CUSTOM PROCESSOR ARCHITECTURE</b><br /><b>Speaker</b>:<br />Nimish Shah, KU Leuven, BE<br /><b>Authors</b>:<br />Nimish Shah, Laura I. Galindez Olascoaga, Wannes Meert and Marian Verhelst, KU Leuven, BE<br /><em><b>Abstract</b><br />Probabilistic reasoning is an essential tool for robust decision-making systems because of its ability to explicitly handle real-world uncertainty, constraints and causal relations. Consequently, researchers are developing hybrid models by combining Deep Learning with Probabilistic reasoning for safety-critical applications like self-driving vehicles, autonomous drones, etc. However, probabilistic reasoning kernels do not execute efficiently on CPUs or GPUs. This paper, therefore, proposes a custom programmable processor to accelerate sum-product networks, an important probabilistic reasoning execution kernel. The processor has an optimized datapath architecture and memory hierarchy optimized for sum-product networks execution. Experimental results show that the processor, while requiring fewer computational and memory units, achieves a 12x throughput benefit over the Nvidia Jetson TX2 embedded GPU platform.</em></td> </tr> <tr> <td style="width:40px;">IP1-15</td> <td><b>A PERFORMANCE ANALYSIS FRAMEWORK FOR REAL-TIME SYSTEMS SHARING MULTIPLE RESOURCES</b><br /><b>Speaker</b>:<br />Shayan Tabatabaei Nikkhah, Eindhoven University of Technology, NL<br /><b>Authors</b>:<br />Shayan Tabatabaei Nikkhah, Marc Geilen, Dip Goswami and Kees Goossens, Eindhoven University of Technology, NL<br /><em><b>Abstract</b><br />Timing properties of applications strongly depend on resources that are allocated to them. Applications often have multiple resource requirements, all of which must be met for them to proceed. Performance analysis of event-based systems has been widely studied in the literature. However, the proposed works consider only one resource requirement for each application task. Additionally, they mainly focus on the rate at which resources serve applications (e.g., power, instructions or bits per second), but another aspect of resources, which is their provided capacity (e.g., energy, memory ranges, FPGA regions), has been ignored. In this work, we propose a mathematical framework to describe the provisioning rate and capacity of various types of resource. Additionally, we consider the simultaneous use of multiple resources. Conservative bounds on response times of events and their backlog are computed. We prove that the bounds are monotone in event arrivals and in required and provided rate and capacity, which enables verification of real-time application performance based on worst-case characterizations. The applicability of our framework is shown in a case study.</em></td> </tr> <tr> <td style="width:40px;">IP1-16</td> <td><b>SCALING UP THE MEMORY INTERFERENCE ANALYSIS FOR HARD REAL-TIME MANY-CORE SYSTEMS</b><br /><b>Speaker</b>:<br />Matheus Schuh, Verimag / Kalray, FR<br /><b>Authors</b>:<br />Matheus Schuh<sup>1</sup>, Maximilien Dupont de Dinechin<sup>2</sup>, Matthieu Moy<sup>3</sup> and Claire Maiza<sup>4</sup><br /><sup>1</sup>Verimag / Kalray, FR; <sup>2</sup>ENS Paris / ENS Lyon / LIP, FR; <sup>3</sup>ENS Lyon / LIP, FR; <sup>4</sup>Grenoble INP / Verimag, FR<br /><em><b>Abstract</b><br />In RTNS 2016, Rihani et al. proposed an algorithm to compute the impact of interference on memory accesses on the timing of a task graph. It calculates a static, time-triggered schedule, i.e. a release date and a worst-case response time for each task. The task graph is a DAG, typically obtained by compilation of a high-level dataflow language, and the tool assumes a previously determined mapping and execution order. The algorithm is precise, but suffers from a high O(n^4) complexity, n being the number of input tasks. Since we target many-core platforms with tens or hundreds of cores, applications likely to exploit the parallelism of these platforms are too large to be handled by this algorithm in reasonable time. This paper proposes a new algorithm that solves the same problem. Instead of performing global fixed-point iterations on the task graph, we compute the static schedule incrementally, reducing the complexity to O(n^2). Experimental results show a reduction from 535 seconds to 0.90 seconds on a benchmark with 384 tasks, i.e. 593 times faster.</em></td> </tr> <tr> <td style="width:40px;">IP1-17</td> <td><b>LIGHTWEIGHT ANONYMOUS ROUTING IN NOC BASED SOCS</b><br /><b>Speaker</b>:<br />Subodha Charles, University of Florida, LK<br /><b>Authors</b>:<br />Subodha Charles, Megan Logan and Prabhat Mishra, University of Florida, US<br /><em><b>Abstract</b><br />System-on-Chip (SoC) supply chain is widely acknowledged as a major source of security vulnerabilities. Potentially malicious third-party IPs integrated on the same Network-on-Chip (NoC) with the trusted components can lead to security and trust concerns. While secure communication is a well studied problem in computer networks domain, it is not feasible to implement those solutions on resource-constrained SoCs. In this paper, we present a lightweight anonymous routing protocol for communication between IP cores in NoC based SoCs. Our method eliminates the major overhead associated with traditional anonymous routing protocols while ensuring that the desired security goals are met. Experimental results demonstrate that existing security solutions on NoC can introduce significant (1.5X) performance degradation, whereas our approach provides the same security features with minor (4%) impact on performance.</em></td> </tr> <tr> <td style="width:40px;">IP1-18</td> <td><b>A NON-INVASIVE WEARABLE BIOIMPEDANCE SYSTEM TO WIRELESSLY MONITOR BLADDER FILLING</b><br /><b>Speaker</b>:<br />Michele Magno, ETH Zurich, CH<br /><b>Authors</b>:<br />Markus Reichmuth, Simone Schuerle and Michele Magno, ETH Zurich, CH<br /><em><b>Abstract</b><br />Monitoring of renal function can be crucial for patients in acute care settings. Commonly during postsurgical surveillance, urinary catheters are employed to assess the urine output accurately. However, as with any external device inserted into the body, the use of these catheters carries a significant risk of infection. In this paper, we present a non-invasive method to measure the fill rate of the bladder, and thus the rate of renal clearance, via an external bioimpedance sensor system to avoid the use of urinary catheters, thereby eliminating the risk of infections and improving patient comfort. We design and propose a 4-electrode front-end and the whole wearable and wireless system with low power and accuracy in mind. The results demonstrate the accuracy of the sensors and low power consumption of only 80µW with a duty cycling of 1 acquisition every 5 minutes, which makes this battery-operated wearable device a long-term monitor system.</em></td> </tr> <tr> <td style="width:40px;">IP1-19</td> <td><b>INFINIWOLF: ENERGY EFFICIENT SMART BRACELET FOR EDGE COMPUTING WITH DUAL SOURCE ENERGY HARVESTING</b><br /><b>Speaker</b>:<br />Michele Magno, ETH Zurich, CH<br /><b>Authors</b>:<br />Michele Magno<sup>1</sup>, Xiaying Wang<sup>1</sup>, Manuel Eggimann<sup>1</sup>, Lukas Cavigelli<sup>1</sup> and Luca Benini<sup>2</sup><br /><sup>1</sup>ETH Zurich, CH; <sup>2</sup>Università di Bologna, IT<br /><em><b>Abstract</b><br />This work presents InfiniWolf, a novel multi-sensor smartwatch that can achieve self-sustainability exploiting thermal and solar energy harvesting, performing computationally high demanding tasks. The smartwatch embeds both a System-on-Chip (SoC) with an ARM Cortex-M processor and Bluetooth Low Energy (BLE) and Mr. Wolf, an open-hardware RISC-V based parallel ultra-low-power processor that boosts the processing capabilities on board by more than one order of magnitude, while also increasing energy efficiency. We demonstrate its functionality based on a sample application scenario performing stress detection with multi-layer artificial neural networks on a wearable multi-sensor bracelet. Experimental results show the benefits in terms of energy efficiency and latency of Mr. Wolf over an ARM Cortex-M4F micro-controllers and the possibility, under specific assumptions, to be self-sustainable using thermal and solar energy harvesting while performing up to 24 stress classifications per minute in indoor conditions.</em></td> </tr> </table> <hr /> <h2 id="4.1">4.1 Hardware-enabled security</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br />Marchand Cedric, Ecole Centrale de Lyon, FR</p> <p><b>Co-Chair:</b><br />Cambou Bertrand, Northern Arizona University, US</p> <p>This session covers solutions in hardware-based design to improve security. The papers in the session propose a NTT (Number Theoretic Transform) technique enabling faster polynomial multiplication, a reliable key-PUF for key generation, and a runtime circuit de-obfuscating solution. Post-Quantum cryptography and new attacks will be discussed along this session.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.1.1</td> <td><b>A FLEXIBLE AND SCALABLE NTT HARDWARE: APPLICATIONS FROM HOMOMORPHICALLY ENCRYPTED DEEP LEARNING TO POST-QUANTUM CRYPTOGRAPHY</b><br /><b>Speaker</b>:<br />Erkay Savas, Sabanci University, TR<br /><b>Authors</b>:<br />Ahmet Can Mert<sup>1</sup>, Emre Karabulut<sup>2</sup>, Erdinc Ozturk<sup>1</sup>, Erkay Savas<sup>1</sup>, Michela Becchi<sup>2</sup> and Aydin Aysu<sup>2</sup><br /><sup>1</sup>Sabanci University, TR; <sup>2</sup>North Carolina State University, US<br /><em><b>Abstract</b><br />The Number Theoretic Transform (NTT) enables faster polynomial multiplication and is becoming a fundamental component of next-generation cryptographic systems. NTT hardware designs have two prevalent problems related to design-time flexibility. First, algorithms have different arithmetic structures causing the hardware designs to be manually tuned for each setting. Second, applications have diverse throughput/area needs but the hardware have been designed for a fixed, pre-defined number of processing elements. This paper proposes a parametric NTT hardware generator that takes arithmetic configurations and the number of processing elements as inputs to produce an efficient hardware with the desired parameters and throughput. We illustrate the employment of the proposed design in two applications with different needs: A homomorphically encrypted deep neural network inference (CryptoNets) and a post-quantum digital signature scheme (qTESLA). We propose the first NTT hardware acceleration for both applications on FPGAs. Compared to prior software and high-level synthesis solutions, the results show that our hardware can accelerate NTT up to 28x and 48x, respectively. Therefore, our work paves the way for high-level, automated, and modular design of next-generation cryptographic hardware solutions.</em></td> </tr> <tr> <td>17:30</td> <td>4.1.2</td> <td><b>RELIABLE AND LIGHTWEIGHT PUF-BASED KEY GENERATION USING VARIOUS INDEX VOTING ARCHITECTURE</b><br /><b>Speaker</b>:<br />Jeong-Hyeon Kim, Sungkyunkwan University, KR<br /><b>Authors</b>:<br />Jeong-Hyeon Kim<sup>1</sup>, Ho-Jun Jo<sup>1</sup>, Kyung-kuk Jo<sup>1</sup>, Sunghee Cho<sup>1</sup>, Jaeyong Chung<sup>2</sup> and Joon-Sung Yang<sup>1</sup><br /><sup>1</sup>Sungkyunkwan University, KR; <sup>2</sup>Incheon National University, KR<br /><em><b>Abstract</b><br />Physical Unclonable Functions (PUFs) can be utilized for secret key generation in security applications. Since the inherent randomness of PUF can degrade its reliability, most of the existing PUF architectures have designed post-processing logic to enhance the reliability such as an error correction function for guaranteeing reliability. However, the structures incur high cost in terms of implementation area and power consumption. This paper introduces a Various Index Voting Architecture (VIVA) that can enhance the reliability with a low overhead compared to the conventional schemes. The proposed architecture is based on an index-based scheme with simple computation logic units and iterative operations to generate multiple indices for the accuracy of key generation. Our evaluation results show that the proposed architecture reduces the hardware implementation overhead by 2 to more than 5 times, without losing a key generation failure probability compared to conventional approaches.</em></td> </tr> <tr> <td>18:00</td> <td>4.1.3</td> <td><b>ESTIMATING THE CIRCUIT DE-OBFUSCATION RUNTIME BASED ON GRAPH DEEP LEARNING</b><br /><b>Speaker</b>:<br />Sai Manoj Pudukotai Dinakarrao, George Mason University, US<br /><b>Authors</b>:<br />Zhiqian Chen<sup>1</sup>, Gaurav Kolhe<sup>2</sup>, Setareh Rafatirad<sup>2</sup>, Chang-Tien Lu<sup>1</sup>, Sai Manoj Pudukotai Dinakarrao<sup>2</sup>, Houman Homayoun<sup>2</sup> and Liang Zhao<sup>2</sup><br /><sup>1</sup>Virginia Tech, US; <sup>2</sup>George Mason University, US<br /><em><b>Abstract</b><br />Circuit obfuscation has been proposed to protect digital integrated circuits (ICs) from different security threats such as reverse engineering by introducing ambiguity in the circuit, i.e., the addition of the logic gates whose functionality cannot be determined easily by the attacker. In order to conquer such defenses, techniques such as Boolean satisfiability-checking (SAT)-based attacks were introduced. SAT-attack can potentially decrypt the obfuscated circuits. However, the deobfuscation runtime could have a large span ranging from few milliseconds to a few years or more, depending on the number and location of obfuscated gates, the topology of the obfuscated circuit and obfuscation technique used. To ensure the security of the deployed obfuscation mechanism, it is essential to accurately pre-estimate the deobfuscation time. Thereby one can optimize the deployed defense in order to maximize the deobfuscation runtime. However, estimating the deobfuscation runtime is a challenging task due to 1) the complexity and heterogeneity of the graph-structured circuit, 2) the unknown and sophisticated mechanisms of the attackers for deobfuscation, 3) efficiency and scalability requirement in practice. To address the challenges mentioned above, this work proposes the first machine-learning framework that predicts the deobfuscation runtime based on graph deep learning. Specifically, we design a new model, ICNet with new input and convolution layers to characterize the circuit's topology, which is then integrated by composite deep fully-connected layers to obtain the deobfuscation runtime. The proposed ICNet is an end-to-end framework that can automatically extract the determinant features required for deobfuscation runtime prediction. Extensive experiments on standard benchmarks demonstrate its effectiveness and efficiency beyond many competitive baselines.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="/date20/conference/session/IP2">IP2-1</a>, 908</td> <td><b>SAMPLING FROM DISCRETE DISTRIBUTIONS IN COMBINATIONAL HARDWARE WITH APPLICATION TO POST-QUANTUM CRYPTOGRAPHY</b><br /><b>Speaker</b>:<br />Michael Lyons, George Mason University, US<br /><b>Authors</b>:<br />Michael Lyons and Kris Gaj, George Mason University, US<br /><em><b>Abstract</b><br />Random values from discrete distributions are typically generated from uniformly-random samples. A common technique is to use a cumulative distribution table (CDT) lookup for inversion sampling, but it is also possible to use Boolean functions to map a uniformly-random bit sequence into a value from a discrete distribution. This work presents a methodology for deriving such functions for any discrete distribution, encoding them in VHDL for implementation in combinational hardware, and (for moderate precision and sample space size) confirming the correctness of the produced distribution. The process is demonstrated using a discrete Gaussian distribution with a small sample space, but it is applicable to any discrete distribution with fixed parameters. Results are presented for sampling schemes from several submissions to the NIST PQC standardization process, comparing this method to CDT lookups on a Xilinx Artix-7 FPGA. The process produces compact solutions for distributions up to moderate size and precision.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="/date20/conference/session/IP2">IP2-2</a>, 472</td> <td><b>ON THE PERFORMANCE OF NON-PROFILED DIFFERENTIAL DEEP LEARNING ATTACKS AGAINST AN AES ENCRYPTION ALGORITHM PROTECTED USING A CORRELATED NOISE HIDING COUNTERMEASURE</b><br /><b>Speaker</b>:<br />Amir Alipour, Grenoble INP Esisar, IR<br /><b>Authors</b>:<br />Amir Alipour<sup>1</sup>, Athanasios Papadimitriou<sup>2</sup>, Vincent Beroulle<sup>3</sup>, Ehsan Aerabi<sup>3</sup> and David Hely<sup>3</sup><br /><sup>1</sup>University Grenoble Alpes, Grenoble INP ESISAR, LCIS Laboratory, FR; <sup>2</sup>University Grenoble Alpes, Grenoble INP ESISAR, ESYNOV, FR; <sup>3</sup>University Grenoble Alpes, Grenoble INP ESISAR, LSIC Laboratory, FR<br /><em><b>Abstract</b><br />Recent works in the field of cryptography focus on Deep Learning based Side Channel Analysis (DLSCA) as one of the most powerful attacks against common encryption algorithms such as AES. As a common case, profiling DLSCA have shown great capabilities in revealing secret cryptographic keys against the majority of AES implementations. In a very recent study, it has been shown that Deep Learning can be applied in a non-profiling way (non-profiling DLSCA), making this method considerably more practical, and able to break powerful countermeasures for encryption algorithms such as AES including masking countermeasures, requiring considerably less power traces than a first order CPA attack. In this work, our main goal is to apply the non-profiling DLSCA against a hiding-based AES countermeasure which utilizes correlated noise generation so as to hide the secret encryption key. We show that this AES, with correlated noise generation as a lightweight countermeasure, can provide equivalent protection under CPA and under non- profiling DLSCA attacks, in terms of the required power traces to obtain the secret key.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="4.2">4.2 Timing in System-Level Modeling and Simulation</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br />Jorn Janneck, Lund University, SE</p> <p><b>Co-Chair:</b><br />Gianluca Palermo, Politecnico di Milano, IT</p> <p>Given the importance of time in specifying and modeling systems, this session presents three contributions at different levels of abstraction, from transaction-level to system level. While the first two contributions attempt to give fast and accurate simulation models for DRAM memories and analog mixed systems, the last one models uncertainties at higher-level for reasoning and formal verification purpose.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.2.1</td> <td><b>FAST AND ACCURATE DRAM SIMULATION: CAN WE FURTHER ACCELERATE IT?</b><br /><b>Speaker</b>:<br />Matthias Jung, Fraunhofer IESE, DE<br /><b>Authors</b>:<br />Johannes Feldmann<sup>1</sup>, Matthias Jung<sup>2</sup>, Kira Kraft<sup>1</sup>, Lukas Steiner<sup>1</sup> and Norbert Wehn<sup>1</sup><br /><sup>1</sup>University of Kaiserslautern, DE; <sup>2</sup>Fraunhofer IESE, DE<br /><em><b>Abstract</b><br />Virtual platforms are state-of-the-art for design space exploration and simulation of today's complex System on Chips (SoCs). The challenge of these virtual platforms is to find the right trade-off between speed and accuracy. For the simulation of Dynamic Random Access Memories (DRAMs), that have a complex timing and power behavior, high accuracy models are needed. However, cycle accurate DRAM models require a huge part of the overall simulation time. Therefore, it is important to accelerate the DRAM simulation models while keeping the accuracy. In the literature different approaches to accelerate the DRAM simulation speed in virtual platforms do already exist. This paper proposes two new performance optimized DRAM models that further accelerate the simulation speed with only a negligible degradation in accuracy. The first model is an enhanced Transaction Level Model (TLM), which uses a look-up table to accelerate simulation parts with high bandwidth usage for online scenarios. The other is a neural network based simulator for offline trace analysis. We show a mathematical methodology to generate the inputs for the look-up table and the optimal artificial training set for the neural network. The TLM model is up to 5~times faster compared to a state-of-the-art TLM DRAM simulator. The neural network is able to speed up to~10x, while inferring on a GPU. Both solutions provide only a slight decrease in accuracy of approximately 5%.</em></td> </tr> <tr> <td>17:30</td> <td>4.2.2</td> <td><b>ACCURATE AND EFFICIENT CONTINUOUS TIME AND DISCRETE EVENTS SIMULATION IN SYSTEMC</b><br /><b>Speaker</b>:<br />Breytner Fernandez-Mesa, TIMA Laboratory, University Grenoble Alpes, FR<br /><b>Authors</b>:<br />Breytner Fernandez-Mesa, Liliana Andrade and Frédéric Pétrot, TIMA Lab, Université Grenoble Alpes, FR<br /><em><b>Abstract</b><br />The AMS extensions of SystemC emerged to aid the virtual prototyping of continuous time and discrete event heterogeneous systems. Although useful for a large set of use cases, synchronization of both domains through a fixed timestep generates inaccuracies that cannot be overcome without penalizing simulation speed. We propose a direct, optimistic, and causal synchronization algorithm on top of the SystemC kernel that explicitly handles the rich set of interactions that occur in the domain interface. We test our algorithm with a complex nonlinear automotive use case and show that it breaks the described accuracy and efficiency trade-off. Our work enlarges the applicability range of SystemC AMS based design frameworks.</em></td> </tr> <tr> <td>18:00</td> <td>4.2.3</td> <td><b>MODELING AND VERIFYING UNCERTAINTY-AWARE TIMINGBEHAVIORS USING PARAMETRIC LOGICAL TIME CONSTRAINT</b><br /><b>Speaker</b>:<br />Fei Gao, Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai, 200062, China, CN<br /><b>Authors</b>:<br />Fei Gao<sup>1</sup>, Mallet Frederic<sup>2</sup>, Min Zhang<sup>1</sup> and Mingsong Chen<sup>3</sup><br /><sup>1</sup>East China Normal University, CN; <sup>2</sup>Universite Cote d'Azur, CNRS, Inria, I3S, Nice, France, FR; <sup>3</sup>East China Normal University, FR<br /><em><b>Abstract</b><br />The Clock Constraint Specification Language (CCSL) is a logical time based modeling language to formalize timing behaviors of real-time and embedded systems. However, it cannot capture timing behaviors that contain uncertainties, e.g., uncertainty in execution time and period. This limits the application of the language to real-world systems, as uncertainty often exists in practice due to both internal and external factors. To capture uncertainties in timing behaviors, in this paper we extend CCSL by introducing parameters into constraints. We then propose an approach to transform parametric CCSL constraints into SMT formulas for efficient verification. We apply our approach to an industrial case which is proposed as the FMTV (Formal Methods for Timing Verification) Challenge in 2015, which shows that timing behaviors with uncertainties can be effectively modeled and verified using the parametric CCSL.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="/date20/conference/session/IP2">IP2-3</a>, 831</td> <td><b>FAST AND ACCURATE PERFORMANCE EVALUATION FOR RISC-V USING VIRTUAL PROTOTYPES</b><br /><b>Speaker</b>:<br />Vladimir Herdt, Universität Bremen, DE<br /><b>Authors</b>:<br />Vladimir Herdt<sup>1</sup>, Daniel Grosse<sup>2</sup> and Rolf Drechsler<sup>2</sup><br /><sup>1</sup>Universität Bremen, DE; <sup>2</sup>Universität Bremen / DFKI GmbH, DE<br /><em><b>Abstract</b><br />RISC-V is gaining huge popularity in particular for embedded systems. Recently, a SystemC-based Virtual Prototype (VP) has been open sourced to lay the foundation for providing support for system-level use cases such as design space exploration, analysis of complex HW/SW interactions and power/timing/performance validation for RISC-V based systems. In this paper, we propose an efficient core timing model and integrate it into the VP core to enable fast and accurate performance evaluation for RISC-V based systems. As a case-study we provide a timing configuration matching the RISC-V HiFive1 board from SiFive. Our experiments demonstrate that our approach allows to obtain very accurate performance evaluation results while still retaining a high simulation performance.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="/date20/conference/session/IP2">IP2-4</a>, 641</td> <td><b>AUTOMATED GENERATION OF LTL SPECIFICATIONS FOR SMART HOME IOT USING NATURAL LANGUAGE</b><br /><b>Speaker</b>:<br />Shiyu Zhang, State Key Laboratory of Novel Software Technology, Department of Computer Science and Technology, Nanjing University, CN<br /><b>Authors</b>:<br />Shiyu Zhang<sup>1</sup>, Juan Zhai<sup>1</sup>, Lei Bu<sup>1</sup>, Mingsong Chen<sup>2</sup>, Linzhang Wang<sup>1</sup> and Xuandong Li<sup>1</sup><br /><sup>1</sup>Nanjing University, CN; <sup>2</sup>East China Normal University, CN<br /><em><b>Abstract</b><br />Ordinary inexperienced users can build their smart home IoT system easily nowadays, but such user-customized systems could be error-prone. Using formal verification to prove the correctness of such systems is necessary. However, to conduct formal proof, formal specifications such as Linear Temporal Logic (LTL) formulas have to be provided, but ordinary users cannot author LTL formulas but only natural language. To address this problem, this paper presents a novel approach that can automatically generate formal LTL specifications from natural language requirements based on domain knowledge and our proposed ambiguity refining techniques. Experimental results show that our approach can achieve a high correctness rate of 95.4% in converting natural language sentences into LTL formulas from 481 requirements of real examples.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="4.3">4.3 Special Session: Architectures for Emerging Technologies</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br />Pierre-Emmanuel Gaillardon, University of Utah, US</p> <p><b>Co-Chair:</b><br />Michael Niemier, University of Notre Dame, US</p> <p>The past five decades have witnessed transformations happening at an ever-growing pace thanks to the sustained increase of capabilities of electronics systems. We are now at the dawn of a new revolution where emerging technologies, understand beyond silicon complementary metal oxide semiconductors, are going to further revolutionize the way we design electronics. In this hot topic session, we intend to elaborate on the architectural opportunities and challenges brought by non-standard semiconductor technologies. In addition to provide new perspectives to the DATE community beyond the currently hot novel architectures, such as neuromorphic or in-memory computing, this proposal also serve the purpose of tightening the link between DATE and the EDA community at large with the mission and roles of the IEEE Rebooting Computing Initiative - <a href="https://rebootingcomputing.ieee.org" title="https://rebootingcomputing.ieee.org">https://rebootingcomputing.ieee.org</a>.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.3.1</td> <td><b>CRYO-CMOS INTERFACES FOR A SCALABLE QUANTUM COMPUTER</b><br /><b>Authors</b>:<br />Edoardo Charbon<sup>1</sup>, Andrei Vladimirescu<sup>2</sup>, Fabio Sebastiano<sup>3</sup> and Masoud Babaie<sup>3</sup><br /><sup>1</sup>EPFL, CH; <sup>2</sup>University of California, Berkeley, US; <sup>3</sup>Delft University of Technology, NL</td> </tr> <tr> <td>17:15</td> <td>4.3.2</td> <td><b>THE N3XT 1,000X FOR THE COMING SUPERSTORM OF ABUNDANT DATA: CARBON NANOTUBE FETS, RESISTIVE RAM, MONOLITHIC 3D</b><br /><b>Authors</b>:<br />Gage Hills<sup>1</sup> and Mohamed M. Sabry<sup>2</sup><br /><sup>1</sup>MIT, US; <sup>2</sup>Nanyang Technological University, SG</td> </tr> <tr> <td>17:30</td> <td>4.3.3</td> <td><b>MULTIPLIER ARCHITECTURES: CHALLENGES AND OPPORTUNITIES WITH PLASMONIC-BASED LOGIC</b><br /><b>Speaker</b>:<br />Eleonora Testa, EPFL, CH<br /><b>Authors</b>:<br />Eleonora Testa<sup>1</sup>, Samantha Lubaba Noor<sup>2</sup>, Odysseas Zografos<sup>3</sup>, Mathias Soeken<sup>1</sup>, Francky Catthoor<sup>3</sup>, Azad Naeemi<sup>2</sup> and Giovanni Demicheli<sup>1</sup><br /><sup>1</sup>EPFL, CH; <sup>2</sup>Georgia Institute of Technology, US; <sup>3</sup>IMEC, BE</td> </tr> <tr> <td>17:45</td> <td>4.3.4</td> <td><b>QUANTUM COMPUTER ARCHITECTURE: TOWARDS FULL-STACK QUANTUM ACCELERATORS</b><br /><b>Speaker</b>:<br />Koen Bertels, Delft University of Technology, BE<br /><b>Authors</b>:<br />Koen Bertels, Aritra Arkar, T. Hubregtsen, M. Serrao, Abid A. Mouedenne, A. Yadav, A. Krol, Imran Ashraf and Carmen G. Almudever, Delft University of Technology, NL</td> </tr> <tr> <td>18:00</td> <td>4.3.5</td> <td><b>UTILIZING BURIED POWER RAILS AND BACKSIDE PDN TO FURTHER CMOS SCALING BELOW 5NM NODES</b><br /><b>Authors</b>:<br />Odysseas Zografos, Sudhir Patli, Satadru Sarkar, Bilal Chehab, Doyoung Jang, Rogier Baert, Peter Debacker, Myung-Hee Na and Julien Ryckaert, IMEC, BE</td> </tr> <tr> <td>18:15</td> <td>4.3.6</td> <td><b>A RRAM-BASED FPGA FOR ENERGY-EFFICIENT EDGE COMPUTING</b><br /><b>Authors</b>:<br />Xifan Tang, Ganesh Gore, Patsy Cadareanu, Edouard Giacomin and Pierre-Emmanuel Gaillardon, University of Utah, US</td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="4.4">4.4 Some run it hot, others do not</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br />Pascal Vivet, CEA-Leti, FR</p> <p><b>Co-Chair:</b><br />Alberto Macii, Politecnico di Torino, IT</p> <p>Temperature management is a must-have in modern computing systems. The session presents a set of techniques for smart cooling systems, both active and pro-active, and thermal control policies. The techniques presented are vertically applied to different components, such as computing and communication sub-systems, and use orthogonal modeling and optimization strategies, such as machine-learning.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.4.1</td> <td><b>A LEARNING-BASED THERMAL SIMULATION FRAMEWORK FOR EMERGING TWO-PHASE COOLING TECHNOLOGIES</b><br /><b>Speaker</b>:<br />Ayse Coskun, Boston University, Pl<br /><b>Authors</b>:<br />Zihao Yuan<sup>1</sup>, Geoffrey Vaartstra<sup>2</sup>, Prachi Shukla<sup>1</sup>, Zhengmao Lu<sup>2</sup>, Evelyn Wang<sup>2</sup>, Sherief Reda<sup>3</sup> and Ayse Coskun<sup>1</sup><br /><sup>1</sup>Boston University, US; <sup>2</sup>MIT, US; <sup>3</sup>Brown University, US<br /><em><b>Abstract</b><br />Future high-performance chips will require new cooling technologies that can extract heat efficiently. Two-phase cooling is a promising processor cooling solution owing to its high heat transfer rate and potential benefits in cooling power. Two-phase cooling mechanisms, including microchannel-based two-phase cooling or two-phase vapor chambers (VCs), are typically modeled by computing the temperature-dependent heat transfer coefficient (HTC) of the evaporator or coolant using an iterative simulation framework. Precomputed HTC correlations are specific to a given cooling system design and cannot be applied to even the same cooling technology with different cooling parameters (such as different geometries). Another challenge is that HTC correlations are typically calculated with computational fluid dynamics (CFD) tools, which induce long design and simulation times. This paper introduces a learning-based temperature-dependent HTC simulation framework that is used to model a two-phase cooling solution with a wide range of cooling design parameters. In particular, the proposed framework includes a compact thermal model (CTM) of two-phase VCs with hybrid wick evaporators (of nanoporous membrane and microchannels). We build a new simulation tool to integrate the proposed simulation framework and CTM. We validate the proposed simulation framework as well as the new CTM through comparisons against a CFD model. Our simulation framework and CTM achieve a speedup of 21X with an average error of 0.98degC (and a maximum error of 2.59degC). We design an optimization flow for hybrid wicks to select the most beneficial nanoporous membrane and microchannel geometries. Our flow is capable of finding a geometry-coolant combination that results in a lower (or similar) maximum chip temperature compared to that of the best coolant-geometry pair selected by grid search, while providing a speedup of 9.4X.</em></td> </tr> <tr> <td>17:30</td> <td>4.4.2</td> <td><b>LIGHTWEIGHT THERMAL MONITORING IN OPTICAL NETWORKS-ON-CHIP VIA ROUTER REUSE</b><br /><b>Speaker</b>:<br />Mengquan Li, Nanyang Technological University, SG<br /><b>Authors</b>:<br />Mengquan Li<sup>1</sup>, Jun Zhou<sup>2</sup> and Weichen Liu<sup>2</sup><br /><sup>1</sup>Nanyang Technological University, CN; <sup>2</sup>Nanyang Technological University, SG<br /><em><b>Abstract</b><br />Optical network-on-chip (ONoC) is an emerging communication architecture for manycore systems due to low latency, high bandwidth, and low power dissipation. However, a major concern lies in its thermal susceptibility -- under on-chip temperature variations, functional nanophotonic devices, especially microring resonator (MR)-based devices, suffer from significant thermal-induced optical power loss, which may counteract the power advantages of ONoCs and even cause functional failures. Considering the fact that temperature gradients are typically found on many-core systems, effective thermal monitoring, performing as the foundation of thermal-aware management, is critical on ONoCs. In this paper, a lightweight thermal monitoring scheme is proposed for ONoCs. We first design a temperature measurement module based on generic optical routers. It introduces trivial overheads in chip area by reusing the components in routers. A major problem with reusing optical routers is that it may potentially interfere with the normal communications in ONoCs. To address it, we then propose a time allocation strategy to schedule thermal sensing operations in the time intervals between communications. Evaluation results show that our scheme exhibits an untrimmed inaccuracy of 1.0070 K with low energy consumption of 656.38 pJ/Sa. It occupies an extremely small area of 0.0020 mm^2, reducing the area cost by 83.74% on average compared to the state-of-the-art optical thermal sensor design.</em></td> </tr> <tr> <td>18:00</td> <td>4.4.3</td> <td><b>A SPECTRAL APPROACH TO SCALABLE VECTORLESS THERMAL INTEGRITY VERIFICATION</b><br /><b>Speaker</b>:<br />Zhuo Feng, Stevens Institute of Technology, US<br /><b>Authors</b>:<br />Zhiqiang Zhao<sup>1</sup> and Zhuo Feng<sup>2</sup><br /><sup>1</sup>Michigan Technological University, US; <sup>2</sup>Stevens Institute of Technology, US<br /><em><b>Abstract</b><br />Existing chip thermal analysis and verification methods require detailed distribution of power densities or modeling of underlying input workloads (vectors), which may not always be feasible at early-design stage. This paper introduces the first vectorless thermal integrity verification framework that allows computing worst-case temperature (gradient) distributions across the entire chip under a set of local and global workload (power density) constraints. To address the computational challenges introduced by the large 3D mesh-structured thermal grids, we propose a novel spectral approach for highly-scalable vectorless thermal verification of large chip designs. Our approach is based on emerging spectral graph theory and graph signal processing techniques, which consists of a thermal grid topology sparsification phase, an edge weight scaling phase, as well as a solution refinement procedure. The effectiveness and efficiency of our approach have been demonstrated through extensive experiments.</em></td> </tr> <tr> <td>18:15</td> <td>4.4.4</td> <td><b>DYNAMIC THERMAL MANAGEMENT WITH PROACTIVE FAN SPEED CONTROL THROUGH REINFORCEMENT LEARNING</b><br /><b>Speaker</b>:<br />Arman Iranfar, EPFL, CH<br /><b>Authors</b>:<br />Arman Iranfar<sup>1</sup>, Federico Terraneo<sup>2</sup>, Gabor Csordas<sup>3</sup>, Marina Zapater<sup>1</sup>, William Fornaciari<sup>4</sup> and David Atienza<sup>3</sup><br /><sup>1</sup>EPFL, CH; <sup>2</sup>Politecnico di Milano, IT; <sup>3</sup>École Polytechnique Fédérale de Lausanne, CH; <sup>4</sup>Politecnico di Milano - DEIB, IT<br /><em><b>Abstract</b><br />Dynamic Thermal Management (DTM) in submicron technology has become a major challenge since it directly affects Multiprocessors Systems-on-chip (MPSoCs) performance, power consumption, and lifetime reliability. For proper DTM, thermal simulators play a significant role as they allow chip temperature to be safely studied. Nonetheless, state-of-the-art thermal simulators do not support transient fan models. As a result, adaptive fan speed control, which is an important runtime parameter, cannot be well utilized in DTM. Therefore, in this work, we first propose and integrate a transient fan model into a state-of-the-art thermal simulator, enabling adaptive fan speed control simulation for efficient DTM. We, then, validate our simulation framework through a thermal test chip achieving less than 2$^circ{C}$ error in the worst case. With multiple fan speeds, however, the DTM design space grows significantly, which can ultimately make conventional solutions, such as grid search, infeasible, impractical, or insufficient due to the large runtime overhead. Therefore, we address this challenge through a reinforcement learning-based solution to proactively determine number of active cores, operating frequency, and fan speed. The proposed solution is able to reduce fan power by up to 40% compared to a DTM with constant fan speed with less than 1% performance degradation. Also, compared to a state-of-the-art DTM technique our solution improves the performance by up to 19% for the same fan power.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="/date20/conference/session/IP2">IP2-5</a>, 362</td> <td><b>A HEAT-RECIRCULATION-AWARE VM PLACEMENT STRATEGY FOR DATA CENTERS</b><br /><b>Authors</b>:<br />Hao Feng<sup>1</sup>, Yuhui Deng<sup>2</sup> and Yi Zhou<sup>3</sup><br /><sup>1</sup>Jinan University, CN; <sup>2</sup>Chinese Academy of Sciences; Jinan University, CN; <sup>3</sup>Columbus State University, US<br /><em><b>Abstract</b><br />Data centers consisted of a great number of IT devices (e.g., servers, switches and etc.) which generates a massive amount of heat emission. Due to the special arrangement of racks in the data center, heat recirculation often occurs between nodes. It can cause a sharp rise in temperature of the equipment coupled with local hot spots in data centers. Existing VM placement strategies can minimize energy consumption of data centers by optimizing resource allocation in terms of multiple physical resources (e.g., memory, bandwidth, cpu and etc.). However, existing strategies ignore the role of heat recirculation in the data center. To address this problem, in this study, we propose a heat-recirculation-aware VM placement strategy and design a Simulated Annealing Based Algorithm (SABA) to lower the energy consumption of data centers. Different from the existing SA algorithm, SABA optimize the distribution of the initial solution and the way of iteration. We quantitatively evaluate SABA's performance in terms of algorithm efficiency, the activated servers and the energy saving against with XINT-GA algorithm (Thermal-aware task scheduling Strategy), FCFS (First-Come First-Served), and SA. Experimental results indicate that our heat-recirculation-aware VM placement strategy provides a powerful solution for improving energy efficiency of data centers.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="/date20/conference/session/IP2">IP2-6</a>, 826</td> <td><b>ENERGY OPTIMIZATION IN NCFET-BASED PROCESSORS</b><br /><b>Authors</b>:<br />Sami Salamin<sup>1</sup>, Martin Rapp<sup>1</sup>, Hussam Amrouch<sup>1</sup>, Andreas Gerstlauer<sup>2</sup> and Joerg Henkel<sup>1</sup><br /><sup>1</sup>Karlsruhe Institute of Technology, DE; <sup>2</sup>University of Texas, Austin, US<br /><em><b>Abstract</b><br />Energy consumption is a key optimization goal for all modern processors. Negative Capacitance Field-Effect Transistors (NCFETs) are a leading emerging technology that promises outstanding performance in addition to better energy efficiency. The thickness of the additional ferroelectric layer, frequency, and voltage are the key parameters in NCFET technology that impact the power and frequency of processors. However, their joint impact on energy optimization has not been investigated yet. In this work, we are the first to demonstrate that conventional (i.e., NCFET-unaware) dynamic voltage/frequency scaling (DVFS) techniques to minimize energy are sub-optimal when applied to NCFET-based processors. We further demonstrate that state-of-the-art NCFET-aware voltage scaling for power minimization is also sub-optimal when it comes to energy. This work provides the first NCFET-aware DVFS technique that optimizes the processor's energy through optimal runtime frequency/voltage selection. In NCFETs, energy-optimal frequency and voltage are dependent on the workload and technology parameters. Our NCFET-aware DVFS technique considers these effects to perform optimal voltage/frequency selection at runtime depending on workload characteristics. Results show up to 90 % energy savings compared to conventional DVFS techniques. Compared to state-of-the-art NCFET-aware power management, our technique provides up to 72 % energy savings along with 3:7x higher performance.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="4.5">4.5 Adaptation and optimization for real-time systems</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br />Wanli Chang, University of york, GB</p> <p><b>Co-Chair:</b><br />Emmanuel Grolleau, ENSMA, FR</p> <p>This session presents novel techniques for systems requiring adaptations. The papers in this session are including monitoring techniques to increase reactivity, considering weakly-hard constraints, extending previous cache persistence analyses from one core to several cores, and modeling data chains while latency bounds are ensured.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.5.1</td> <td><b>RELIABLE AND ENERGY-AWARE FIXED-PRIORITY (M,K)-DEADLINES ENFORCEMENT WITH STANDBY-SPARING</b><br /><b>Speaker</b>:<br />Linwei Niu, West Virginia State University, US<br /><b>Authors</b>:<br />Linwei Niu<sup>1</sup> and Dakai Zhu<sup>2</sup><br /><sup>1</sup>West Virginia State University, US; <sup>2</sup>University of Texas at San Antonio, US<br /><em><b>Abstract</b><br />For real-time computing systems, energy efficiency, Quality of Service, and fault tolerance are among the major design concerns. In this work, we study the problem of reliable and energy-aware fixed-priority (m,k)-deadlines enforcement with standby-sparing. The standby-sparing systems adopt a primary processor and a spare processor to provide fault tolerance for both permanent and transient faults. In order to reduce energy consumption for such kind of systems, we proposed a novel scheduling scheme under the QoS constraint of (m,k)-deadlines. The evaluation results demonstrate that our proposed approach significantly outperformed the previous research in energy conservation while assuring (m,k)-deadlines and fault tolerance for real-time systems.</em></td> </tr> <tr> <td>17:30</td> <td>4.5.2</td> <td><b>PERIOD ADAPTATION FOR CONTINUOUS SECURITY MONITORING IN MULTICORE REAL-TIME SYSTEMS</b><br /><b>Speaker</b>:<br />Monowar Hasan, University of Illinois at Urbana-Champaign, US<br /><b>Authors</b>:<br />Monowar Hasan<sup>1</sup>, Sibin Mohan<sup>2</sup>, Rodolfo Pellizzoni<sup>3</sup> and Rakesh Bobba<sup>4</sup><br /><sup>1</sup>University of Illinois at Urbana-Champaign, US; <sup>2</sup>University of Illinois at Urbana-Champaign (UIUC), US; <sup>3</sup>University of Waterloo, CA; <sup>4</sup>Oregon State University, US<br /><em><b>Abstract</b><br />We propose HYDRA-C, a design-time evaluation framework for integrating monitoring mechanisms in multicore real-time systems (RTS). Our goal is to ensure that security (or other monitoring) mechanisms execute in a "continuous" manner -- i.e., as often as possible, across cores. This is to ensure that any such mechanisms run with few interruptions, if any. HYDRA-C is intended to allow designers of RTS to integrate monitoring mechanisms without perturbing existing timing properties or execution orders. We demonstrate the framework using a proof-of-concept implementation with intrusion detection mechanisms as security tasks. We develop and use both, (a) a custom intrusion detection system (IDS) as well as (b) Tripwire -- an open source data integrity checking tool. We compare the performance of HYDRA-C with a state-of-the-art multicore RT security integration approach and find that our method does not impact the schedulability and, on average, can detect intrusions 19.05% faster without impacting the performance of RT tasks.</em></td> </tr> <tr> <td>18:00</td> <td>4.5.3</td> <td><b>EFFICIENT LATENCY BOUND ANALYSIS FOR DATA CHAINS OF REAL-TIME TASKS IN MULTIPROCESSOR SYSTEMS</b><br /><b>Speaker</b>:<br />Jiankang Ren, Dalian University of Technology, CN<br /><b>Authors</b>:<br />Jiankang Ren<sup>1</sup>, Xin He<sup>1</sup>, Junlong Zhou<sup>2</sup>, Hongwei Ge<sup>1</sup>, Guowei Wu<sup>1</sup> and Guozhen Tan<sup>1</sup><br /><sup>1</sup>Dalian University of Technology, CN; <sup>2</sup>Nanjing University of Science and Technology, CN<br /><em><b>Abstract</b><br />End-to-end latency analysis is one of the key problems in the automotive embedded system design. In this paper, we propose an efficient worst-case end-to-end latency analysis method for data chains of periodic real-time tasks executed on multiprocessors under a partitioned fixed-priority preemptive scheduling policy. The key idea of this research is to improve the analysis efficiency by transforming the problem of bounding the worst-case latency of the data chain to a problem of bounding the releasing interval of data propagation instances for each pair of consecutive tasks in the chain. In particular, we derive an upper bound on the releasing interval of successive data propagation instances to yield the desired data chain latency bound by a simple accumulation. Based on the above idea, we present an efficient latency upper bound analysis algorithm with polynomial time complexity. Experiments with randomly generated task sets based on a generic automotive benchmark show that our proposed approach can obtain a relatively tighter data chain latency upper bound with lower computational cost.</em></td> </tr> <tr> <td>18:15</td> <td>4.5.4</td> <td><b>CACHE PERSISTENCE-AWARE MEMORY BUS CONTENTION ANALYSIS FOR MULTICORE SYSTEMS</b><br /><b>Speaker</b>:<br />Syed Aftab Rashid, Polytechnic Institute of Porto, PT<br /><b>Authors</b>:<br />Syed Aftab Rashid<sup>1</sup>, Geoffrey Nelissen<sup>1</sup> and Eduardo Tovar<sup>2</sup><br /><sup>1</sup>Polytechnic Institute of Porto, PT; <sup>2</sup>CISTER/INESC-TEC, PT<br /><em><b>Abstract</b><br />Memory bus contention strongly relates to the number of main memory requests generated by tasks running on different cores of a multicore platform, which, in turn, depends on the content of the cache memories during the execution of those tasks. Recent works have shown that due to cache persistence the memory access demand of multiple jobs of a task may not always be equal to its worst-case memory access demand in isolation. Analysis of the variable memory access demand of tasks due to cache persistence leads to significantly tighter worst-case response time (WCRT) of tasks. In this work, we show how the notion of cache persistence can be extended from single-core to multicore systems. In particular, we focus on analyzing the impact of cache persistence on the memory bus contention suffered by tasks executing on a multicore platform considering both work conserving and non-work conserving bus arbitration policies. Experimental evaluation shows that cache persistence-aware analyses of bus arbitration policies increase the number of task sets deemed schedulable by up to 70 percentage points in comparison to their respective counterparts that do not account for cache persistence</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="/date20/conference/session/IP2">IP2-7</a>, 934</td> <td><b>TOWARDS A MODEL-BASED MULTI-OBJECTIVE OPTIMIZATION APPROACH FOR SAFETY-CRITICAL REAL-TIME SYSTEMS</b><br /><b>Speaker</b>:<br />Emmanuel Grolleau, LIAS / ISAE-ENSMA, FR<br /><b>Authors</b>:<br />Soulimane Kamni<sup>1</sup>, Yassine OUHAMMOU<sup>2</sup>, Antoine Bertout<sup>3</sup> and Emmanuel Grolleau<sup>4</sup><br /><sup>1</sup>LIAS/ENSMA, FR; <sup>2</sup>LIAS / ISAE-ENSMA, FR; <sup>3</sup>LIAS, Université de Poitiers, ISAE-ENSMA, FR; <sup>4</sup>LIAS, ISAE-ENSMA, Universite de Poitiers, FR<br /><em><b>Abstract</b><br />In safety-critical real-time systems domain, obtaining the appropriate operational model which meets the temporal (e.g. deadlines) and business (e.g. redundancy) requirements while being optimal in terms of several metrics is a primordial process in the design life-cycle. Recently, several researches have proposed to explore cross-domain trade-offs for a higher behaviour performance. Indeed, this process represents the first step in the deployment phase, which is very sensitive because it could be error-prone and time consuming. This paper is a work in progress proposing an approach aiming to help real-time system architects to take benefit from existing works, overcome their limits, and capitalize the efforts. Furthermore, the approach is based on the model-driven engineering paradigm and suggests to ease the usage of methods and tools thanks to repositories gathering them as a sort of a shared knowledge.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="4.6">4.6 Future computing fabrics: security and design integration</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br />Elena Gnani, Università di Bologna, IT</p> <p><b>Co-Chair:</b><br />Subhasish Mitra, Stanford University, US</p> <p>Emerging technologies always promise to achieve computational and resource-efficiency. This session addresses various aspects of efficiency in the context of security and future computing fabrics: a unique challenge at the intersection of hardware security and machine learning, fully front-end compatible CAD frameworks to enable access to floating-gate memristive devices, and current recycling in superconducting circuits.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.6.1</td> <td><b>SECURITY ENHANCEMENT FOR RRAM COMPUTING SYSTEM THROUGH OBFUSCATING CROSSBAR ROW CONNECTIONS</b><br /><b>Speaker</b>:<br />Minhui Zou, Nanjing University of Science and Technology, CN<br /><b>Authors</b>:<br />Minhui Zou<sup>1</sup>, Zhenhua Zhu<sup>2</sup>, Yi Cai<sup>2</sup>, Junlong Zhou<sup>1</sup>, Chengliang Wang<sup>3</sup> and Yu Wang<sup>2</sup><br /><sup>1</sup>Nanjing University of Science and Technology, CN; <sup>2</sup>Tsinghua University, CN; <sup>3</sup>Chongqing University, CN<br /><em><b>Abstract</b><br />Neural networks (NN) have gained great success in visual object recognition and natural language processing, but this kind of data-intensive applications requires huge data movements between computing units and memory. Emerging resistive random-access memory (RRAM) computing systems have demonstrated great potential in avoiding the huge data movements by performing matrix-vector-multiplications in memory. However, the nonvolatility of the RRAM devices may lead to potential stealing of the NN weights stored in crossbars and the adversary could extract the NN models from the stolen weights. This paper proposes an effective security enhancing method for RRAM computing systems to thwart this sort of piracy attack. We first analyze the theft methods of the NN weights. Then we propose an efficient security enhancing technique based on obfuscating the row connections between positive crossbars and their pairing negative crossbars. Two heuristic techniques are also presented to optimize the hardware overhead of the obfuscation module. Compared with existing NN security work, our method eliminates the additional RRAM writing operations used for encryption/decryption, without shortening the lifetime of RRAM computing systems. The experiment results show that the proposed methods ensure the trial times of brute-force attack are more than (16!)^17 and the classification accuracy of the incorrectly extracted NN models is less than 20%, with minimal area overhead.</em></td> </tr> <tr> <td>17:30</td> <td>4.6.2</td> <td><b>MODELING A FLOATING-GATE MEMRISTIVE DEVICE FOR COMPUTER AIDED DESIGN OF NEUROMORPHIC COMPUTING</b><br /><b>Speaker</b>:<br />Loai Danial, PhD student, Technion - Israel Institute of Technology, IL<br /><b>Authors</b>:<br />Loai Danial<sup>1</sup>, Vasu Gupta<sup>2</sup>, Evgeny Pikhay<sup>3</sup>, Yakov Roizin<sup>3</sup> and Shahar Kvatinsky<sup>4</sup><br /><sup>1</sup>PhD student - Technion, IL; <sup>2</sup>Technion, IN; <sup>3</sup>TowerJazz, IL; <sup>4</sup>Technion, IL<br /><em><b>Abstract</b><br />Memristive technology is still not mature enough for the very large-scale integration necessary to obtain practical value from neuromorphic systems. While nonvolatile floating-gate "synapse transistors" have been implemented in very large-scale integrated neuromorphic systems, their large footprint still constrains an upper bound on the overall performance. A two-terminal floating-gate memristive device can combine the technological maturity of the floating-gate transistor and the conceptual novelty of the memristor using a standard CMOS process. In this paper, we present a top-down computer aided design framework of the floating-gate memristive device and show its potential in neuromorphic computing. Our framework includes a Verilog-A SPICE model, small-signal schematics, a stochastic model, Monte-Carlo simulations, layout, DRC, LVS, and RC extraction.</em></td> </tr> <tr> <td>18:00</td> <td>4.6.3</td> <td><b>GROUND PLANE PARTITIONING FOR CURRENT RECYCLING OF SUPERCONDUCTING CIRCUITS</b><br /><b>Speaker</b>:<br />Naveen Katam, University of Southern California, US<br /><b>Authors</b>:<br />Naveen Kumar Katam, Bo Zhang and Massoud Pedram, University of Southern California, US<br /><em><b>Abstract</b><br />Superconducting single flux quantum (SFQ) technology using Josephson junctions (JJs) is an excellent choice for the computing fabrics of the future. Current recycling is a necessary technique for the implementation of large SFQ circuits with energy-efficiency, where circuit partitions with similar bias current requirements are biased serially. Though this technique has been verified for small scale circuits, it has not been implemented for large circuits as there is no trivial way to partition the circuit into circuit blocks with separate ground planes. The major constraints for partitioning are (1) equal bias current and (2) equal area for all the partitions; (3) minimize the connections between adjacent ground planes with high-cost for non-adjacent planes. For the first time, all these constraints are formulated into a cost function and it is minimized with the gradient descent method. The algorithm takes a circuit netlist and the intended number of partitions as inputs and gives the output as groups of cells belonging to separate ground planes. It minimizes the connections among different ground planes and gives a solution on which the current recycling technique can be implemented. The parameters of cost function have been initialized randomly along with minimizing the dimensions to find the solution quickly. On average, 30% of connections are between non-adjacent ground planes for the given benchmark circuits.</em></td> </tr> <tr> <td>18:15</td> <td>4.6.4</td> <td><b>SILICON PHOTONIC MICRORING RESONATORS: DESIGN OPTIMIZATION UNDER FABRICATION NON-UNIFORMITY</b><br /><b>Speaker</b>:<br />Mahdi Nikdast, Colorado State University, US<br /><b>Authors</b>:<br />Asif Mirza, Febin Sunny, Sudeep Pasricha and Mahdi Nikdast, Colorado State University, US<br /><em><b>Abstract</b><br />Microring resonators (MRRs) are very often considered as the primary building block in silicon photonic integrated circuits (PICs). Despite many advantages, MRRs are considerably sensitive to fabrication non-uniformity (a.k.a. fabrication process variations), necessitating the use of power-hungry compensation methods (e.g., thermal tuning) to guarantee their reliable operation. Moreover, the design space of MRRs is complicated and includes several highly correlated design parameters, preventing designers from easily exploring and optimizing the design of MRRs against fabrication process variations (FPVs). In this paper, for the first time, we present a comprehensive design space exploration and optimization of MRRs against FPVs. In particular, we indicate how physical design parameters in MRRs can be optimized during design time to enhance their tolerance to FPVs while also improving the insertion loss and quality factor in such devices. Fabrication results obtained by measuring multiple fabricated MRRs designed using our design optimization solution demonstrate a significant 70% improvement on average in MRRs tolerance to different FPVs. Such improvement indicates the efficiency of our novel design optimization solution in reducing the tuning power required for reliable operation of MRRs.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="/date20/conference/session/IP2">IP2-8</a>, 849</td> <td><b>CURRENT-MODE CARRY-FREE MULTIPLIER DESIGN USING A MEMRISTOR-TRANSISTOR CROSSBAR ARCHITECTURE</b><br /><b>Speaker</b>:<br />Shengqi Yu, Newcastle Universtiy, GB<br /><b>Authors</b>:<br />Shengqi Yu<sup>1</sup>, Ahmed Soltan<sup>2</sup>, Rishad Shafik<sup>3</sup>, Thanasin Bunnam<sup>3</sup>, Domenico Balsamo<sup>3</sup>, Fei Xia<sup>3</sup> and Alex Yakovlev<sup>3</sup><br /><sup>1</sup>Newcastle Universtiy, GB; <sup>2</sup>Nile University, EG; <sup>3</sup>Newcastle University, GB<br /><em><b>Abstract</b><br />Traditional multipliers consist of complex logic components. They are a major energy and performance contributor of modern compute-intensive applications. As such, designing multipliers with reduced energy and faster speed has remained a thoroughgoing challenge. This paper presents a novel, carry-free multiplier, which is suitable for new-generation of energy-constrained applications. The multiplier circuit consists of an array of memristor-transistor cells that can be selected (i.e., turned ON or OFF) using a combination of DC bias voltages based on the operand values. When a cell is selected it contributes to current in the array path, which is then amplified by current mirrors with variable transistor gate sizes. The different current paths are connected to a node for analogously accumulating the currents to produce the multiplier output directly, removing the carry propagation stages, typically seen in traditional digital multipliers. An essential feature of this multiplier is autonomous survivability, i.e., when the power is below this threshold the logic state automatically retains at a zero-cost due to the non-volatile properties of memristors.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="/date20/conference/session/IP2">IP2-9</a>, 88</td> <td><b>N-BIT DATA PARALLEL SPIN WAVE LOGIC GATE</b><br /><b>Speaker</b>:<br />Abdulqader Mahmoud, Delft University of Technology, NL<br /><b>Authors</b>:<br />Abdulqader Mahmoud<sup>1</sup>, Frederic Vanderveken<sup>2</sup>, Florin Ciubotaru<sup>2</sup>, Christoph Adelmann<sup>2</sup>, Sorin Cotofana<sup>1</sup> and Said Hamdioui<sup>1</sup><br /><sup>1</sup>Delft University of Technology, NL; <sup>2</sup>IMEC, BE<br /><em><b>Abstract</b><br />Due to their very nature, Spin Waves (SWs) created in the same waveguide, but with different frequencies, can coexist while selectively interacting with their own species only. The absence of inter-frequency interferences isolates input data sets encoded in SWs with different frequencies and creates the premises for simultaneous data parallel SW based processing without hardware replication or delay overhead. In this paper we leverage this SW property by introducing a novel computation paradigm, which allows for the parallel processing of n-bit input data vectors on the same basic SW based logic gate. Subsequently, to demonstrate the proposed concept, we present 8-bit parallel 3-input Majority gate implementation and validate it by means of Object Oriented MicroMagnetic Framework (OOMMF) simulations. To evaluate the potential benefit of our proposal we compare the 8-bit data parallel gate with equivalent scalar SW gate based implementation. Our evaluation indicates that 8-bit data 3-input Majority gate implementation requires 4.16x less area than the scalar SW gate based equivalent counterpart while preserving the same delay and energy consumption figures.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="4.7">4.7 EU Projects on Nanoelectronics with CMOS and alternative technologies</h2> <p><b>Date:</b> Tuesday, March 10, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br />Dimitris Gizopoulos, UoA, GR</p> <p><b>Co-Chair:</b><br />George Karakonstantis, Queen's University Belfast, GR</p> <p>This session presents the results of three European Projects in different stages of execution covering the development of a complete synthesis and optimization methodology for nano-crossbar arrays; the reliability, security, and associated EDA tools for nanoelectronic systems, and the exploitation of STT-MTJ technologies for heterogeneous function implementation.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.7.1</td> <td><b>NANO-CROSSBAR BASED COMPUTING: LESSONS LEARNED AND FUTURE DIRECTIONS</b><br /><b>Speaker</b>:<br />Mustafa Altun, Istanbul TU, TR<br /><b>Authors</b>:<br />Mustafa Altun<sup>1</sup>, Ismail Cevik<sup>1</sup>, Ahmet Erten<sup>1</sup>, Osman Eksik<sup>1</sup>, Mircea Stan<sup>2</sup> and Csaba Moritz<sup>3</sup><br /><sup>1</sup>Istanbul TU, TR; <sup>2</sup>University of Virginia, US; <sup>3</sup>University of Massachusetts, US<br /><em><b>Abstract</b><br />In this paper, we first summarize our research activities done through our European Union's Horizon-2020 project between 2015 and 2019. The project has a goal of developing synthesis and performance optimization techniques for nanocrossbar arrays. For this purpose, different computing models including diode, memristor, FET, and four-terminal switch based models, within different technologies including carbon nanotubes, nanowires, and memristors as well as the CMOS technology have been investigated. Their capabilities to realize logic functions and to tolerate faults have been deeply analyzed. From these experiences, we think that instead of replacing CMOS with a completely new crossbar based technology, developing CMOS compatible crossbar technologies and computing models is a more viable solution to overcome challenges in CMOS miniaturization. At this point, four terminal switch based arrays, called switching lattices, come forward with their CMOS compatibility feature as well as with their area efficient device and circuit realizations. We have showed that switching lattices can be efficiently implemented using a standard CMOS process to implement logic functions by doing experiments in a 65nm CMOS process. Further in this paper, we make an introduction of realizing memory arrays with switching lattices including ROMs and RAMs. Also we discuss challenges and promises in realizing switching lattices for under 30nm CMOS technologies including FinFET technologies.</em></td> </tr> <tr> <td>17:30</td> <td>4.7.2</td> <td><b>RESCUE: INTERDEPENDENT CHALLENGES OF RELIABILITY, SECURITY AND QUALITY IN NANOELECTRONIC SYSTEMS</b><br /><b>Speaker</b>:<br />Maksim Jenihhin, Tallinn University of Technology, EE<br /><b>Authors</b>:<br />Maksim Jenihhin<sup>1</sup>, Said Hamdioui<sup>2</sup>, Matteo Sonza Reorda<sup>3</sup>, Milos Krstic<sup>4</sup>, Peter Langendoerfer<sup>4</sup>, Christian Sauer<sup>5</sup>, Anton Klotz<sup>5</sup>, Michael Huebner<sup>6</sup>, Joerg Nolte<sup>6</sup>, H.T. Vierhaus<sup>6</sup>, Georgios Selimis<sup>7</sup>, Dan Alexandrescu<sup>8</sup>, Mottaqiallah Taouil<sup>2</sup>, Geert-Jan Schrijen<sup>7</sup>, Luca Sterpone<sup>9</sup>, Giovanni Squillero<sup>3</sup>, Zoya Dyka<sup>4</sup> and Jaan Raik<sup>1</sup><br /><sup>1</sup>Tallinn University of Technology, EE; <sup>2</sup>Delft University of Technology, NL; <sup>3</sup>Politecnico di Torino - DAUIN, IT; <sup>4</sup>Leibniz-Institut für innovative Mikroelektronik, DE; <sup>5</sup>Cadence Design Systems, DE; <sup>6</sup>BTU Cottbus-Senftenberg, DE; <sup>7</sup>Intrinsic-ID, NL; <sup>8</sup>IROC Technologies, FR; <sup>9</sup>Politecnico di Torino, IT<br /><em><b>Abstract</b><br />The recent trends for nanoelectronic computing systems include machine-to-machine communication in the era of Internet-of-Things (IoT) and autonomous systems, complex safety-critical applications, extreme miniaturization of implementation technologies and intensive interaction with the physical world. These set tough requirements on mutually dependent extra-functional design aspects. The H2020 MSCA ITN project RESCUE is focused on key challenges for reliability, security and quality, as well as related electronic design automation tools and methodologies. The objectives include both research advancements and cross-sectoral training of a new generation of interdisciplinary researchers. Notable interdisciplinary collaborative research results for the first half-period include novel approaches for test generation, soft-error and transient faults vulnerability analysis, cross-layer fault-tolerance and error-resilience, functional safety validation, reliability assessment and run-time management, HW security enhancement and initial implementation of these into holistic EDA tools.</em></td> </tr> <tr> <td>18:00</td> <td>4.7.3</td> <td><b>A UNIVERSAL SPINTRONIC TECHNOLOGY BASED ON MULTIFUNCTIONAL STANDARDIZED STACK</b><br /><b>Speaker</b>:<br />Mehdi Tahoori, Karlsruhe Institute of Technology (KIT), DE<br /><b>Authors</b>:<br />Mehdi Tahoori<sup>1</sup>, Sarath Mohanachandran Nair<sup>1</sup>, Rajendra Bishnoi<sup>2</sup>, Lionel Torres<sup>3</sup>, Guillaume Partigeon<sup>4</sup>, Gregory DiPendina<sup>5</sup> and Guillaume Prenat<sup>5</sup><br /><sup>1</sup>Karlsruhe Institute of Technology, DE; <sup>2</sup>Delft University of Technology, NL; <sup>3</sup>Université de Montpellier, FR; <sup>4</sup>LIRMM, FR; <sup>5</sup>Spintec, FR<br /><em><b>Abstract</b><br />The goal of the GREAT RIA project is to co-integrate multiple functions like sensors ("Sensing"), RF emitters or receivers ("Communicating") and logic/memory ("Processing/Storing") together within CMOS by adapting the Spin-Transfer Torque Magnetic Tunnel Junction (STT-MTJ), elementary constitutive cell of the MRAM memories, to a single baseline technology. Based on the STT unique set of performances (non-volatility, high speed, infinite endurance and moderate read/write power), GREAT will achieve the same goal as heterogeneous integration of devices but in a much simpler way. This will lead to a unique STT-MTJ cell technology called Multifunctional Standardized Stack (MSS). This paper presents the lessons learned in the project from the technology, compact modeling, process design kit, standard cells, as well as memory and system level design evaluation and exploration. The proposed technology and toolsets are giant leaps towards heterogeneous integrated technology and architectures for IoT.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="5.1">5.1 Special Day on "Embedded AI": Tutorial Overviews</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br />Dmitri Strukov, University of California, Santa Barbara, US</p> <p><b>Co-Chair:</b><br />Bernabe Linares-Barranco, CSIC, ES</p> <p>This session aims to provide a more tutorial overview of hardware AI case studies and some proposed solutions, problems, and challenges.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.1.1</td> <td><b>NEURAL NETWORKS CIRCUITS BASED ON RESISTIVE MEMORIES</b><br /><b>Author</b>:<br />Carlo Reita, CEA, FR<br /><em><b>Abstract</b><br />In recent years, the field of Neural Networks has found a new golden age after nearly twenty years of lessened interest. Under the heading of Artificial Intelligence (AI) a large number of Deep Neural Netwooks (DNNs) have recently found application in image processing, management of information in large databases, decision aids, natural language recognition, etc. Most of these applications rely on algorithms that run on standard computing systems and sometimes make use of specific accelerators like Graphic Processor Units (GPUs) or dedicated highly parallel processors. In effect, a common operation in all NN algorithms is the scalar product of two vectors and its optimisation is of paramount importance to reduce computational time and energy. In particular, the energy element is relevant for all embedded applications that cannot rely on cooling and/or unlimited power supply. The availability of resistive memories, with their unique capability of both storing computational values and of performing analog multiplication by the use of ohm's law, allows new circuit architectures where the latency, bandwidth limitations and power consumption issues associated to the use of conventional SRAM, DRAM and Flash memories can be greatly improved upon. In the presentation, some examples of advantageous use of resistive memories in NN circuits will be shown and some of their peculiarities will be discussed.</em></td> </tr> <tr> <td>09:15</td> <td>5.1.2</td> <td><b>EXPLOITING ACTIVATION SPARSITY IN DRAM-BASED SCALABLE CNN AND RNN ACCELERATORS</b><br /><b>Author</b>:<br />Tobi Delbrück, ETH Zurich, CH<br /><em><b>Abstract</b><br />Large deep neural networks (DNNs) need lots of fast memory for states and weights. Although DRAM is the dominant high-throughput, low-cost memory (costing 20X less than SRAM), its long random access latency is bad for the unpredictable access patterns in spiking neural networks (SNNs). But sparsely active SNNs are key to biological computational efficiency. This talk reports on our 5 year developments of convolutional and recurrent deep neural network hardware accelerators that exploit spatial and temporal sparsity like SNNs but achieve SOA throughput, power efficiency and latency using DRAM for the large weight and state memory required by powerful DNNs.</em></td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="5.2">5.2 Machine Learning Approaches to Analog Design</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br />Marie-Minerve Louerat, Sorbonne University Lip6, FR</p> <p><b>Co-Chair:</b><br />Sebastien CLIQUENNOIS, STMicroelectronics, FR</p> <p>This session presents recent advances in machine learning approaches to support the design of analog and mixed-signal circuits. Techniques such as reinforced learning and convolutional networks are employed to address circuit and layout optimization. The presented techniques have a great potential for seeding innovative solutions to face current and future challenges in this field.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.2.1</td> <td><b>AUTOCKT: DEEP REINFORCEMENT LEARNING OF ANALOG CIRCUIT DESIGNS</b><br /><b>Speaker</b>:<br />Keertana Settaluri, University of California, Berkeley, US<br /><b>Authors</b>:<br />Keertana Settaluri, Ameer Haj-Ali, Qijing Huang, Kourosh Hakhamaneshi and Borivoje Nikolic, University of California, Berkeley, US<br /><em><b>Abstract</b><br />The need for domain specialization under energy constraints in deeply-scaled CMOS has been driving the need for agile development of Systems on a Chip (SoCs). While digital subsystems have design flows that are conducive to rapid iterations from specification to layout, analog and mixed-signal modules face the challenge of a long human-in-the-middle iteration loop that requires expert intuition to verify that post-layout circuit parameters meet the original design specification. Existing automated solutions that optimize circuit parameters for a given target design specification have limitations of being schematic-only, inaccurate, sample-inefficient or not generalizable. This work presents AutoCkt, a deep-reinforcement learning tool that not only finds post-layout circuit parameters for a given target specification, but also gains knowledge about the entire design space through a sparse subsampling technique. Our results show that for multiple circuit topologies, the trained AutoCkt agent is able to converge and meet all target specifications on at least 96.3% of tested design goals in schematic simulation, on average 40X faster than a traditional genetic algorithm. Using the Berkeley Analog Generator, AutoCkt is able to design 40 LVS passed operational amplifiers in 68 hours, 9.6X faster than the state-of-the-art when considering layout parasitics.</em></td> </tr> <tr> <td>09:00</td> <td>5.2.2</td> <td><b>TOWARDS DECRYPTING THE ART OF ANALOG LAYOUT: PLACEMENT QUALITY PREDICTION VIA TRANSFER LEARNING</b><br /><b>Speaker</b>:<br />David Pan, University of Texas, Austin, US<br /><b>Authors</b>:<br />Mingjie Liu<sup>1</sup>, Keren Zhu<sup>2</sup>, Jiaqi Gu<sup>2</sup>, Linxiao Shen<sup>2</sup>, Xiyuan Tang<sup>2</sup>, Nan Sun<sup>3</sup> and David Z. Pan<sup>2</sup><br /><sup>1</sup>University of Texas Austin, US; <sup>2</sup>University of Texas, Austin, US; <sup>3</sup>UT Austin, US<br /><em><b>Abstract</b><br />Despite tremendous efforts in analog layout automation, little adoption has been demonstrated in practical design flows. Traditional analog layout synthesis tools use various heuristic constraints to prune the design space to ensure post layout performance. However, these approaches provide limited guarantee and poor generalizability dut to a lack of model mapping layout properties to circuit performance. In this paper, we attempt to shorten the gap in post layout performance modeling for analog circuits with a quantitative statistical approach. We leverage a state-of-the-art automatic layout tool and industry-level simulator to generate labeled training data in an automatic manner. We propose a 3D convolutional neural network (CNN) model to predict the relative placement quality using well-crafted placement features. To achieve data-efficiency for practical usage, we further propose a transfer learning scheme that greatly reduces the amount of data needed. Our model would enable early pruning and efficient design explorations for practical layout design flows. Experimental results demonstrate the effectiveness and generalizability of our method across different operational transconductance amplifier (OTA) designs.</em></td> </tr> <tr> <td>09:30</td> <td>5.2.3</td> <td><b>DESIGN OF MULTI-OUTPUT SWITCHED-CAPACITOR VOLTAGE REGULATOR VIA MACHINE LEARNING</b><br /><b>Speaker</b>:<br />Zhiyuan Zhou, Washington State University, US<br /><b>Authors</b>:<br />Zhiyuan Zhou<sup>1</sup>, Syrine Belakaria<sup>2</sup>, Aryan Deshwal<sup>2</sup>, Wookpyo Hong<sup>1</sup>, Jana Doppa<sup>2</sup>, Partha Pratim Pande<sup>1</sup> and Deukhyoun Heo<sup>1</sup><br /><sup>1</sup>Washington State University, US; <sup>2</sup>‎Washington State University, US<br /><em><b>Abstract</b><br />Efficiency of power management system (PMS) is one of the key performance metrics for highly integrated system on chips (SoCs). Towards the goal of improving power efficiency of SoCs, we make two key technical contributions in this paper. First, we develop a multi-output switched-capacitor voltage regulator (SCVR) with a new flying capacitor crossing technique (FCCT) and cloud-capacitor method. Second, to optimize the design parameters of SCVR, we introduce a novel machine-learning (ML)-inspired optimization framework to reduce the number of expensive design simulations. Simulation shows that power loss of the multi-output SCVR with FCCT is reduced by more than 40% compared to conventional multiple single-output SCVRs. Our ML-based design optimization framework is able to achieve more than 90% reduction in the number of simulations needed to uncover optimized circuit parameters of the proposed SCVR.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="/date20/conference/session/IP2">IP2-10</a>, 371</td> <td><b>HIGH-SPEED ANALOG SIMULATION OF CMOS VISION CHIPS USING EXPLICIT INTEGRATION TECHNIQUES ON MANY-CORE PROCESSORS</b><br /><b>Speaker</b>:<br />Tom Kazmierski, University of Southampton, GB<br /><b>Authors</b>:<br />Gines Domenech-Asensi<sup>1</sup> and Tom J Kazmierski<sup>2</sup><br /><sup>1</sup>Universidad Politecnica de Cartagena, ES; <sup>2</sup>University of Southampton, GB<br /><em><b>Abstract</b><br />This work describes a high-speed simulation technique of analog circuits which is based on the use of state-space equations and an explicit integration method parallelised on a multiprocessor architecture. The integration step of such method is smaller than the one required by an implicit simulation technique based on Newton-Raphson iterations. However, given that explicit methods do not require the computation of time-consuming matrix factorizations, the overall simulation time is reduced. The technique described in this work has been implemented on a NVIDIA general purpose GPU and has been tested simulating the Gaussian filtering operation performed by a smart CMOS image sensor. Such devices are used to perform computation on the edge and include built-in image processing functions. Among those, the Gaussian filtering is one of the most common functions, since it is a basic task for early vision processing. These smart sensors are increasingly complex and hence the time required to simulate them during their design cycle is also larger and larger. From a certain imager size, the proposed simulation method yields simulation times two order of magnitude faster that an implicit method based tool such us SPICE.</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="/date20/conference/session/IP2">IP2-11</a>, 919</td> <td><b>A 100KHZ-1GHZ TERMINATION-DEPENDENT HUMAN BODY COMMUNICATION CHANNEL MEASUREMENT USING MINIATURIZED WEARABLE DEVICES</b><br /><b>Speaker</b>:<br />Shreyas Sen, Purdue University, US<br /><b>Authors</b>:<br />Shitij Avlani, Mayukh Nath, Shovan Maity and Shreyas Sen, Purdue University, US<br /><em><b>Abstract</b><br />Human Body Communication has shown great promise to replace wireless communication for information exchange between wearable devices of a body area network. However, there are very few studies in literature, that systematically study the channel loss of capacitive HBC for wearable devices over a wide frequency range with different terminations at the receiver, partly due to the need for miniaturized wearable devices for an accurate study. This paper, for the first time, measures the channel loss of capacitive HBC from 100KHz to 1GHz for both high-impedance and 50 ohm terminations using wearable, battery powered devices; which is mandatory for accurate measurement of the HBC channel-loss, due to ground coupling effects. Results show that high impedance termination leads to a significantly lower channel loss (40 dB improvement at 1MHz), as compared to 50 ohm termination at low frequencies. This difference steadily decreases with increasing frequency, until they become similar near 80MHz. Beyond 100MHz inter-device coupling dominates, thereby preventing accurate measurements of channel loss of the human body. The measured results provide a consistent wearable, wide-frequency HBC channel loss data and could serve as a backbone for the emerging field of HBC by aiding in the selection of an appropriate operation frequency and termination.</em></td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="5.3">5.3 Special Session: Secure Composition of Hardware Systems</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br />Ilia Polian, Stuttgart University, DE</p> <p><b>Co-Chair:</b><br />Francesco Regazzoni, ALARI, CH</p> <p>Today's electronic systems consist of mixtures of programmable, reconfigurable, and application- specific hardware components, tied together by tremendously complex software. At the same time, systems are increasingly integrated such that a sub-system that was traditionally regarded "harm- less" (car's entertainment system) finds itself tightly coupled with safety-critical sub-systems (driving assistance) and security-sensitive sub-systems such as online payment and others. Moreover, a system's hardware components are now often directly accessible to the end users and thus vulnerable to physical attacks. The goal of this hot-topic session is to establish a common understanding of principles and techniques that can facilitate composition and integration of hardware systems and achieve security guarantees. Theoretical foundations of secure composition are currently limited to software systems, and unique security challenges arise when a real system, composed of a range of hardware components with different owners and trust assumptions is put together. Physical and side-channel attacks add another level of complexity to the problem of secure composition. Moreover, practical hardware systems include software stacks of tremendous size and complexity, and hardware- software interaction can create new security challenges. This hot-topic session will consider secure composition both from a purely hardware-centric and from a hardware-software perspective in a more complex system. It will also target composition of countermeasures against hardware-centric attacks and against software-driven attacks on hardware. It brings together researchers and industry practitioners who deal with secure composition: security- oriented electronic design automation; secure architectures of automotive hardware-software systems; and advanced attack scenarios against complexed hardware systems.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.3.1</td> <td><b>TOWARDS SECURE COMPOSITION OF INTEGRATED CIRCUITS AND ELECTRONIC SYSTEMS: ON THE ROLE OF EDA</b><br /><b>Speaker</b>:<br />Johann Knechtel, NYU Abu Dhabi, AE<br /><b>Authors</b>:<br />Johann Knechtel<sup>1</sup>, Elif Bilge Kavun<sup>2</sup>, Francesco Regazzoni<sup>3</sup>, Annelie Heuser<sup>4</sup>, Anupam Chattopadhyay<sup>5</sup>, Debdeep Mukhopadhyay<sup>6</sup>, Dey Soumyajit<sup>6</sup>, Yunsi Fei<sup>7</sup>, Yaacov Belenky<sup>8</sup>, Itamar Levi<sup>9</sup>, Tim Güneysu<sup>10</sup>, Patrick Schaumont<sup>11</sup> and Ilia Polian<sup>12</sup><br /><sup>1</sup>New York University Abu Dhabi (NYUAD), AE; <sup>2</sup>University of Sheffield, GB; <sup>3</sup>ALaRI, CH; <sup>4</sup>Univ Rennes, Inria, CNRS, IRISA, FR; <sup>5</sup>Nanyang Technological University, SG; <sup>6</sup>IIT Kharagpur, IN; <sup>7</sup>Northeastern University, US; <sup>8</sup>Intel, IL; <sup>9</sup>Bar-Ilan University, IL; <sup>10</sup>Ruhr-University Bochum, DE; <sup>11</sup>Virginia Tech, US; <sup>12</sup>Universität Stuttgart, DE<br /><em><b>Abstract</b><br />Modern electronic systems become evermore complex, yet remain modular, with integrated circuits (ICs) acting as versatile hardware components at their heart. Electronic design automation (EDA) for ICs has focused traditionally on power, performance, and area. However, given the rise of hardware-centric security threats, we believe that EDA must also adopt related notions like secure by design and secure composition of hardware. Despite various promising studies, we argue that some aspects still require more efforts, for example: effective means for compilation of assumptions and constraints for security schemes, all the way from the system level down to the "bare metal"; modeling, evaluation, and consideration of security-relevant metrics; or automated and holistic synthesis of various countermeasures, without inducing negative cross-effects. In this paper, we first introduce hardware security for the EDA community. Next we review prior (academic) art for EDA-driven security evaluation and implementation of countermeasures. We then discuss strategies and challenges for advancing research and development toward secure composition of circuits and systems.</em></td> </tr> <tr> <td>08:55</td> <td>5.3.2</td> <td><b>ATTACKER MODELING ON COMPOSED SYSTEMS</b><br /><b>Authors</b>:<br />Tobias Basic, Jan Müller, Pierre Schnarz and Marc Stoettinger, Continental AG, DE</td> </tr> <tr> <td>09:15</td> <td>5.3.3</td> <td><b>PITFALLS IN MACHINE LEARNING-BASED ADVERSARY MODELING FOR HARDWARE SYSTEMS</b><br /><b>Speaker</b>:<br />Fatemeh Ganji, University of Florida, US<br /><b>Authors</b>:<br />Fatemeh Ganji<sup>1</sup>, Sarah Amir<sup>1</sup>, Shahin Tajik<sup>1</sup>, Jean-Pierre Seifert<sup>2</sup> and Domenic Forte<sup>1</sup><br /><sup>1</sup>University of Florida, US; <sup>2</sup>TU Berlin, DE</td> </tr> <tr> <td>09:35</td> <td>5.3.4</td> <td><b>USING UNIVERSAL COMPOSITION TO DESIGN AND ANALYZE SECURE COMPLEX HARDWARE SYSTEMS</b><br /><b>Speaker</b>:<br />Hoda Maleki, University of Augusta, US<br /><b>Authors</b>:<br />Ran Canetti<sup>1</sup>, Marten van Dijk<sup>2</sup>, Hoda Maleki<sup>3</sup>, Ulrich Rührmair<sup>4</sup> and Patrick Schaumont<sup>5</sup><br /><sup>1</sup>Boston University, US; <sup>2</sup>University of Connecticut, US; <sup>3</sup>University of Augusta, US; <sup>4</sup>TUM, DE; <sup>5</sup>Worcester Polytechnic Institute, US</td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="5.4">5.4 New Frontiers in Formal Verification for Hardware</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br />Alessandro Cimatti, Fondazione Bruno Kessler, IT</p> <p><b>Co-Chair:</b><br />Heinz Riener, EPFL, CH</p> <p>The session presents several new techniques in hardware verification. The technical papers propose methods for the formal verification of industrial arithmetic circuits and processors, and show how reinforcement learning can be used for verification of shared memory protocols. Two interactive presentations describe how to use high-level synthesis to supply security guarantees and to generate certificates when verifying multipliers.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.4.1</td> <td><b>GAP-FREE PROCESSOR VERIFICATION BY S²QED AND PROPERTY GENERATION</b><br /><b>Speaker</b>:<br />Keerthikumara Devarajegowda, Infineon Technologies AG, DE<br /><b>Authors</b>:<br />Keerthikumara Devarajegowda<sup>1</sup>, Mohammad Rahmani Fadiheh<sup>2</sup>, Eshan Singh<sup>3</sup>, Clark Barrett<sup>3</sup>, Subhasish Mitra<sup>3</sup>, Wolfgang Ecker<sup>1</sup>, Dominik Stoffel<sup>2</sup> and Wolfgang Kunz<sup>2</sup><br /><sup>1</sup>Infineon Technologies AG, DE; <sup>2</sup>University of Kaiserslautern, DE; <sup>3</sup>Stanford University, US<br /><em><b>Abstract</b><br />The required manual effort and verification expertise are among the main hurdles for adopting formal verification in processor design flows. Developing a set of properties that fully covers all instruction behaviors is a laborious and challenging task. This paper proposes a highly automated and "complete" processor verification approach which requires considerably less manual effort and expertise compared to the state of the art. The proposed approach extends the S²QED approach to cover both single and multiple instruction bugs and ensures that a design is completely verified according to a well-defined criterion. This makes the approach robust against human errors. The properties are simple and can be automatically generated from an ISA model with small manual effort. Furthermore, unlike in conventional property checking, the verification engineer does not need to explicitly specify the processor's behavior in different special scenarios, such as stalling, exception, or speculation, since these are taken care of implicitly by the proposed computational model. The great promise of the approach is shown by an industrial case study with a 5-stage RISC-V processor.</em></td> </tr> <tr> <td>09:00</td> <td>5.4.2</td> <td><b>SPEAR: HARDWARE-BASED IMPLICIT REWRITING FOR SQUARE-ROOT VERIFICATION</b><br /><b>Speaker</b>:<br />Maciej Ciesielski, University of Massachusetts Amherst, US<br /><b>Authors</b>:<br />Atif Yasin<sup>1</sup>, Tiankai Su<sup>2</sup>, Sebastien Pillement<sup>3</sup> and Maciej Ciesielski<sup>4</sup><br /><sup>1</sup>PhD student at UMass Amherst, US; <sup>2</sup>UMass Amherst, US; <sup>3</sup>University of Nantes France, FR; <sup>4</sup>University of Massachusetts Amherst, US<br /><em><b>Abstract</b><br />The paper addresses the formal verification of gate-level square-root circuits. Division and square root functions are some of the most complex arithmetic operations to implement and proving the correctness of their hardware implementation is of great importance. In contrast to standard approaches that use satisfiability and equivalence checking techniques, the presented method verifies whether the gate-level square-root circuit actually performs a root operation, instead of checking equivalence with a reference design. The method extends the algebraic rewriting technique developed earlier for multipliers and introduces a novel technique of implicit hardware rewriting. The tool called SPEAR based on hardware rewriting enables the verification of a 256-bit gate-level square-root circuit with 0.26 million gates in under 18 minutes.</em></td> </tr> <tr> <td>09:30</td> <td>5.4.3</td> <td><b>A REINFORCEMENT LEARNING APPROACH TO DIRECTED TEST GENERATION FOR SHARED MEMORY VERIFICATION</b><br /><b>Speaker</b>:<br />Nícolas Pfeifer, UFSC, BR<br /><b>Authors</b>:<br />Nicolas Pfeifer, Bruno V. Zimpel, Gabriel A. G. Andrade and Luiz C. V. dos Santos, Federal University of Santa Catarina, BR<br /><em><b>Abstract</b><br />Multicore chips are expected to rely on coherent shared memory. Albeit the coherence hardware can scale gracefully, the protocol state space grows exponentially with core count. That is why design verification requires directed test generation (DTG) for dynamic coverage control under the tight time constraints resulting from slow simulation and short verification budgets. Next generation EDA tools are expected to exploit Machine Learning for reaching high coverage in less time. We propose a technique that addresses DTG as a decision process and tries to find a decision-making policy for maximizing the cumulative coverage, as a result of successive actions taken by an agent. Instead of simply relying on learning, our technique builds upon the legacy from constrained random test generation (RTG). It casts DTG as coverage-driven RTG, and it explores distinct RTG engines subject to progressively tighter constraints. We compared three Reinforcement Learning generators with a state-of-the-art generator based on Genetic Programming. The experimental results show that the proper enforcement of constraints is more efficient for guiding learning towards higher coverage than simply letting the generator learn how to select the most promising memory events for increasing coverage. For a 3-level MESI 32-core design, the proposed approach led to the highest observed coverage (95.81%), and it was 2.4 times faster than the baseline generator to reach the latter's maximal coverage.</em></td> </tr> <tr> <td>09:45</td> <td>5.4.4</td> <td><b>TOWARDS FORMAL VERIFICATION OF OPTIMIZED AND INDUSTRIAL MULTIPLIERS</b><br /><b>Speaker</b>:<br />Alireza Mahzoon, Universität Bremen, DE<br /><b>Authors</b>:<br />Alireza Mahzoon<sup>1</sup>, Daniel Grosse<sup>2</sup>, Christoph Scholl<sup>3</sup> and Rolf Drechsler<sup>2</sup><br /><sup>1</sup>Universität Bremen, DE; <sup>2</sup>Universität Bremen / DFKI GmbH, DE; <sup>3</sup>University Freiburg, DE<br /><em><b>Abstract</b><br />Formal verification methods have made huge progress over the last decades. However, proving the correctness of arithmetic circuits involving integer multipliers still drives the verification techniques to their limits. Recently, Symbolic Computer Algebra (SCA) methods have shown good results in the verification of both large and non-trivial multipliers. Their success is mainly based on (1) reverse engineering and identifying basic building blocks, (2) finding converging gate cones which start from the basic building blocks and (3) early removal of redundant terms (vanishing monomials) to avoid the blow-up during backward rewriting. Despite these important accomplishments, verifying optimized and technology-mapped multipliers is an almost unexplored area. This creates major barriers for industrial use as most of the designs are area and delay optimized. To overcome the barriers, we propose a novel SCA-method which supports the formal verification of a large variety of optimized multipliers. Our method takes advantage of a dynamic substitution ordering to avoid the monomial explosion during backward rewriting. Experimental results confirm the efficiency of our approach in the verification of a wide range of optimized multipliers including industrial benchmarks.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="/date20/conference/session/IP2">IP2-12</a>, 151</td> <td><b>FROM DRUP TO PAC AND BACK</b><br /><b>Speaker</b>:<br />Daniela Kaufmann, Johannes Kepler Universität Linz, AT<br /><b>Authors</b>:<br />Daniela Kaufmann, Armin Biere and Manuel Kauers, Johannes Kepler Universität Linz, AT<br /><em><b>Abstract</b><br />Currently the most efficient automatic approach to verify gate-level multipliers combines SAT solving and computer algebra. In order to increase confidence in the verification, proof certificates are generated. However, due to different solving techniques, these certificates require two different proof formats, namely DRUP and PAC. A combined proof has so far been missing. Correctness of this approach can thus only be trusted up to the correctness of compositional reasoning. In this paper we show how to generate a single proof in one proof format, which then allows to certify correctness using one simple proof checker. We further investigate empirically the effect on proof generation and checking time as well as on proof size. It turns out that PAC proofs are much more compact and faster to check.</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="/date20/conference/session/IP2">IP2-13</a>, 956</td> <td><b>VERIFIABLE SECURITY TEMPLATES FOR HARDWARE</b><br /><b>Speaker</b>:<br />Bill Harrison, Oak Ridge National Laboratory, Cyber Security Research Group, US<br /><b>Authors</b>:<br />William Harrison<sup>1</sup> and Gerard Allwein<sup>2</sup><br /><sup>1</sup>Oak Ridge National Laboratory, US; <sup>2</sup>Naval Research Laboratory, US<br /><em><b>Abstract</b><br />But HLS has, with a few notable exceptions, not focused on transferring ideas and techniques from high assurance software formal methods to hardware development, despite there being a sophisticated and mature body of research in that area. Just as it has introduced software engineering virtues, we believe HLS can also become a vector for retrofitting software formal methods to the challenge of high assurance security in hardware. This paper introduces the Device Calculus and its mechanization in the Agda proof checking system. The Device Calculus is a starting point for exploring formal methods and security within high-level synthesis flows. We illustrate the Device Calculus with a number of examples of formally verifiable security templates---i.e., functions in the Device Calculus that express common security structures at a high-level of abstraction.</em></td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="5.5">5.5 Model-Based Analysis and Security</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br />Ylies Falcone, University Grenoble Alpes and Inria, FR</p> <p><b>Co-Chair:</b><br />Todd Austin, University of Michigan, US</p> <p>The session explores the use of state-of-the-art model-based analysis and verification techniques to secure and improve the performance of embedded systems. More specifically, it presents the use of satisfiability modulo theory, runtime monitoring, fuzzing, and model-checking to evaluate how secure is a system, prevent, and detect attacks.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.5.1</td> <td><b>IS REGISTER TRANSFER LEVEL LOCKING SECURE?</b><br /><b>Speaker</b>:<br />Chandan Karfa, IIT Guwahati, IN<br /><b>Authors</b>:<br />Chandan Karfa<sup>1</sup>, Ramanuj Chouksey<sup>1</sup>, Christian Pilato<sup>2</sup>, Siddharth Garg<sup>3</sup> and Ramesh Karri<sup>4</sup><br /><sup>1</sup>IIT Guwahati, IN; <sup>2</sup>Politecnico di Milano, IT; <sup>3</sup>New York University, US; <sup>4</sup>NYU, US<br /><em><b>Abstract</b><br />Register Transfer Level (RTL) locking seeks to prevent intellectual property (IP) theft of a design by locking the RTL description that functions correctly on the application of a key. This paper evaluates the security of a state-of-the-art RTL locking scheme using a satisfiability modulo theories (SMT) based algorithm to retrieve the secret key. The attack first obtains the high-level behavior of the locked RTL, and then use an SMT based formulation to find so-called distinguishing input patterns (DIP)/footnote{i.e., inputs that help eliminate incorrect keys from the keyspace.}. The attack methodology has two main advantages over the gate-level attacks. First, since the attack handles the design at the RTL, the method scales to large designs. Second, the attack does not apply separate unlocking strategies for the combinational and sequential parts of design; it handles both styles via a unifying abstraction. We demonstrate the attack on locked RTL generated by TAO, a state-of-the-art RTL locking solution. Empirical results show that we can partially or completely break designs locked by TAO.</em></td> </tr> <tr> <td>09:00</td> <td>5.5.2</td> <td><b>DESIGN SPACE EXPLORATION FOR MODEL-BASED COMMUNICATION SYSTEMS</b><br /><b>Speaker</b>:<br />Valentina Richthammer, Ulm University, DE<br /><b>Authors</b>:<br />Valentina Richthammer, Marcel Rieß, Julian Bestler, Frank Slomka and Michael Glaß, Ulm University, DE<br /><em><b>Abstract</b><br />A main challenge of modem design lies in selecting a suitable combination of subsystems (e.g. Analog Digital/Digital Analog Converters (ADC/DAC), (de)modulators, scramblers, interleavers, and coding and filtering modules) - each of which can be implemented in a multitude of ways. At the same time, the complete modem configuration needs to be tailored to the specific requirements of the intended communication channel or scenario. Therefore, model-based design methodologies have recently been popularized in this field, since their application facilitates the specification of individual modem components that are easily exchanged during the automated synthesization of the modem. However, this development has resulted in a tremendous increase in the number of synthesizable modem options. In fact, the optimal modem configuration for a communication scenario can not readily be determined, since an exhaustive analysis of all configuration possibilities is computationally intractable. To remedy this, we propose a fully automated Design Space Exploration (DSE) methodology for model-based modem design that combines (I) the metaheuristic exploration and optimization of modem-configuration possibilities with (II) a simulative analysis of suitable measures of communication quality. The presented case study for an acoustic underwater communication scenario supports the described need for novel, automated methodologies in the area of model-based design, since the modem configurations discovered during a comparably short DSE are demonstrated to significantly outperform state-of-the-art modems from literature.</em></td> </tr> <tr> <td>09:30</td> <td>5.5.3</td> <td><b>STATISTICAL TIME-BASED INTRUSION DETECTION IN EMBEDDED SYSTEMS</b><br /><b>Speaker</b>:<br />Nadir Carreon Rascon, University of Arizona, MX<br /><b>Authors</b>:<br />Nadir Carreon Rascon, Allison Gilbreath and Roman Lysecky, University of Arizona, US<br /><em><b>Abstract</b><br />This paper presents a statistical method based on cumulative distribution functions (CDF) to analyze an embedded system's behavior to detect anomalous and malicious executions behaviors. The proposed method analyzes the internal timing of the system by monitoring individual operations and sequences of operations, wherein the timing of operations is decomposed into multiple timing subcomponents. Creating the normal model of the system utilizing the internal timing adds resilience to zero-day attacks, and mimicry malware. The combination of CDF-based statistical analysis and timing subcomponents enable both higher detection rates and lower false positives rates. We demonstrate the effectiveness of the approach and compare to several state-of-the-art malware detection methods using two embedded systems benchmarks, namely a network connected pacemaker and an unmanned aerial vehicle, utilizing seven different malware.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="/date20/conference/session/IP2">IP2-14</a>, 637</td> <td><b>IFFSET: IN-FIELD FUZZING OF INDUSTRIAL CONTROL SYSTEMS USING SYSTEM EMULATION</b><br /><b>Speaker</b>:<br />Dimitrios Tychalas, NYU Tandon School of Engineering, US<br /><b>Authors</b>:<br />Dimitrios Tychalas<sup>1</sup> and Michail Maniatakos<sup>2</sup><br /><sup>1</sup>NYU Tandon School of Engineering, US; <sup>2</sup>New York University Abu Dhabi, AE<br /><em><b>Abstract</b><br />Industrial Control Systems (ICS) have evolved in the last decade, shifting from proprietary software/hardware to contemporary embedded architectures paired with open-source operating systems. In contrast to the IT world, where continuous updates and patches are expected, decommissioning always-on ICS for security assessment can incur prohibitive costs to their owner. Thus, a solution for routinely assessing the cybersecurity posture of diverse ICS without affecting their operation is essential. Therefore, in this paper we introduce IFFSET, a platform that leverages full system emulation of Linux-based ICS firmware and utilizes fuzzing for security evaluation. Our platform extracts the file system and kernel information from a live ICS device, building an image which is emulated on a desktop system through QEMU. We employ fuzzing as a security assessment tool to analyze ICS specific libraries and find potential security threatening conditions. We test our platform with commercial PLCs, showcasing potential threats with no interruption to the control process.</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="/date20/conference/session/IP2">IP2-15</a>, 814</td> <td><b>FANNET: FORMAL ANALYSIS OF NOISE TOLERANCE, TRAINING BIAS AND INPUT SENSITIVITY IN NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Mahum Naseer, TU Wien, AT<br /><b>Authors</b>:<br />Mahum Naseer<sup>1</sup>, Mishal Fatima Minhas<sup>2</sup>, Faiq Khalid<sup>1</sup>, Muhammad Abdullah Hanif<sup>3</sup>, Osman Hasan<sup>4</sup> and Muhammad Shafique<sup>5</sup><br /><sup>1</sup>TU Wien, AT; <sup>2</sup>National University of Sciences and Technology (NUST), Islamabad, Pakistan, PK; <sup>3</sup>Institute of Computer Engineering, Vienna University of Technology, AT; <sup>4</sup>NUST, PK; <sup>5</sup>Vienna University of Technology (TU Wien), AT<br /><em><b>Abstract</b><br />With a constant improvement in the network architectures and training methodologies, Neural Networks (NNs) are increasingly being deployed in real-world Machine Learning systems. However, despite their impressive performance on "known inputs", these NNs can fail absurdly on the "unseen inputs", especially if these real-time inputs deviate from the training dataset distributions, or contain certain types of input noise. This indicates the low noise tolerance of NNs, which is a major reason for the recent increase of adversarial attacks. This is a serious concern, particularly for safety-critical applications, where inaccurate results lead to dire consequences. We propose a novel methodology that leverages model checking for the Formal Analysis of Neural Network (FANNet) under different input noise ranges. Our methodology allows us to rigorously analyze the noise tolerance of NNs, their input node sensitivity, and the effects of training bias on their performance, e.g., in terms of classification accuracy. For evaluation, we use a feed-forward fully-connected NN architecture trained for the Leukemia classification. Our experimental results show 11% noise tolerance for the given trained network, identify the most sensitive input nodes, confirm the biasness of the available training dataset, and indicate that the proposed methodology is much more rigorous and yet comparable to validation testing in terms of time and computational resources for larger noise ranges.</em></td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="5.6">5.6 Logic synthesis towards fast, compact, and secure designs</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br />Valeria Bertacco, University of Michigan, US</p> <p><b>Co-Chair:</b><br />Lukas Sekanina, Brno University of Technology, CZ</p> <p>The logic synthesis family is growing. While traditional optimization goals such as area and delay are still very important in todays design automation, new applications require improvement of aspects such as security or power consumption. This session showcases various algorithms addressing both emerging and traditional optimization goals. An algorithm is proposed for cryptographic applications which reduces the multiplicative complexity thereby making designs less vulnerable to attacks. A synthesis method converts flip-flops to latches in a clever way and saves power in this way. Approximation and bi-decomposition techniques are used in an area optimization strategy. Finally, a methodology for design minimization in advanced technology nodes is presented that takes both wire congestion and coupling effects into account.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.6.1</td> <td><b>A LOGIC SYNTHESIS TOOLBOX FOR REDUCING THE MULTIPLICATIVE COMPLEXITY IN LOGIC NETWORKS</b><br /><b>Speaker</b>:<br />Eleonora Testa, EPFL, CH<br /><b>Authors</b>:<br />Eleonora Testa<sup>1</sup>, Mathias Soeken<sup>1</sup>, Heinz Riener<sup>1</sup>, Luca Amaru<sup>2</sup> and Giovanni De Micheli<sup>3</sup><br /><sup>1</sup>EPFL, CH; <sup>2</sup>Synopsys Inc., US; <sup>3</sup>École Polytechnique Fédérale de Lausanne, CH<br /><em><b>Abstract</b><br />Logic synthesis is a fundamental step in the realization of modern integrated circuits. It has traditionally been employed for the optimization of CMOS-based designs, as well as for emerging technologies and quantum computing. Recently, it found application in minimizing the number of AND gates in cryptography benchmarks represented as xor-and graphs (XAGs). The number of AND gates in an XAG, which is called the logic network's multiplicative complexity, plays a critical role in various cryptography and security protocols such as fully homomorphic encryption (FHE) and secure multi-party computation (MPC). Further, the number of AND gates is also important to assess the degree of vulnerability of a Boolean function, and influences the cost of techniques to protect against side-channel attacks. However, so far a complete logic synthesis flow for reducing the multiplicative complexity in logic networks did not exist or relied heavily on manual manipulations. In this paper, we present a logic synthesis toolbox for cryptography and security applications. The proposed tool consists of powerful transformations, namely resubstitution, refactoring, and rewriting, specifically designed to minimize the multiplicative complexity of an XAG. Our flow is fully automatic and achieves significant results over both EPFL benchmarks and cryptography circuits. We improve the best-known results for cryptography up to 59%, resulting in a normalized geometric mean of 0.82.</em></td> </tr> <tr> <td>09:00</td> <td>5.6.2</td> <td><b>SAVING POWER BY CONVERTING FLIP-FLOP TO 3-PHASE LATCH-BASED DESIGNS</b><br /><b>Speaker</b>:<br />Peter Beerel, University of Southern California, US<br /><b>Authors</b>:<br />HUIMEI CHENG, Xi Li, Yichen Gu and Peter Beerel, University of Southern California, US<br /><em><b>Abstract</b><br />Latches are smaller and lower power than flip-flops (FFs) and are typically used in a time-borrowing master-slave configuration. This paper presents an automatic flow for converting arbitrarily-complex single-clock-domain FF-based RTL designs to efficient 3-phase latch-based designs with reduced number of required latches, saving both register and clock-tree power. Post place-and-route results demonstrate that our 3-phase latch-based designs save an average of 15.5% and 18.5% power on a variety of ISCAS, CEP, and CPU benchmark circuits, compared to their more traditional FF and master-slave based alternatives.</em></td> </tr> <tr> <td>09:30</td> <td>5.6.3</td> <td><b>COMPUTING THE FULL QUOTIENT IN BI-DECOMPOSITION BY APPROXIMATION</b><br /><b>Speaker</b>:<br />Valentina Ciriani, University of Milan, IT<br /><b>Authors</b>:<br />Anna Bernasconi<sup>1</sup>, Valentina Ciriani<sup>2</sup>, Jordi Cortadella<sup>3</sup> and Tiziano Villa<sup>4</sup><br /><sup>1</sup>Università di Pisa, IT; <sup>2</sup>Universita' degli Studi di Milano, IT; <sup>3</sup>UPC, ES; <sup>4</sup>Università di Verona, IT<br /><em><b>Abstract</b><br />Bi-decomposition is a design technique widely used to realize logic functions by the composition of simpler components. It can be seen as a form of Boolean division, where a given function is split into a divisor and quotient (and a remainder, if needed). The key questions are how to find a good divisor and then how to compute the quotient. In this paper we choose as divisor an approximation of the given function, and characterize the incompletely specified function which describes the full flexibility for the quotient. We report at the end preliminary experiments for bi-decomposition based on two AND-like operators with a divisor approximation from 1 to 0, and discuss the impact of the approximation error rate on the final area of the components in the case of synthesis by three-level XOR-AND-OR forms.</em></td> </tr> <tr> <td>09:45</td> <td>5.6.4</td> <td><b>MINIDELAY: MULTI-STRATEGY TIMING-AWARE LAYER ASSIGNMENT FOR ADVANCED TECHNOLOGY NODES</b><br /><b>Speaker</b>:<br />Xinghai Zhang, Fuzhou University, CN<br /><b>Authors</b>:<br />Xinghai Zhang<sup>1</sup>, Zhen Zhuang<sup>1</sup>, Genggeng Liu<sup>1</sup>, Xing Huang<sup>2</sup>, Wen-Hao Liu<sup>3</sup>, Wenzhong Guo<sup>1</sup> and Ting-Chi Wang<sup>2</sup><br /><sup>1</sup>Fuzhou University, CN; <sup>2</sup>National Tsing Hua University, TW; <sup>3</sup>Cadence Design Systems, US<br /><em><b>Abstract</b><br />Layer assignment, a major step in global routing of integrated circuits, is usually performed to assign segments of nets to multiple layers. Besides the traditional optimization goals such as overflow and via count, interconnect delay plays an important role in determining chip performance and has been attracting much attention in recent years. Accordingly, in this paper, we propose MiniDelay, a timing-aware layer assignment algorithm to minimize delay for advanced technology nodes, taking both wire congestion and coupling effect into account. MiniDelay consists of the following three key techniques: 1) a non-default-rule routing technique is adopted to reduce the delay of timing critical nets, 2) an effective congestion assessment method is proposed to optimize delay of nets and via count simultaneously, and 3) a net scalpel technique is proposed to further reduce the maximum delay of nets, so that the chip performance can be improved in a global manner. Experimental results on multiple benchmarks confirm that the proposed algorithm leads to lower delay and few vias, while achieving the best solution quality among the existing algorithms with the shortest runtime.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="/date20/conference/session/IP2">IP2-16</a>, 932</td> <td><b>A SCALABLE MIXED SYNTHESIS FRAMEWORK FOR HETEROGENEOUS NETWORKS</b><br /><b>Speaker</b>:<br />Max Austin, Author, US<br /><b>Authors</b>:<br />Max Austin<sup>1</sup>, Scott Temple<sup>1</sup>, Walter Lau Neto<sup>1</sup>, Luca Amaru<sup>2</sup>, Xifan Tang<sup>1</sup> and Pierre-Emmanuel Gaillardon<sup>1</sup><br /><sup>1</sup>University of Utah, US; <sup>2</sup>Synopsys, US<br /><em><b>Abstract</b><br />We present a new logic synthesis framework which produces efficient post-technology mapped results on heterogeneous networks containing a mix of different types of logic. This framework accomplishes this by breaking down the circuit into sections using a hypergraph k-way partitioner and then determines the best-fit logic representation for each partition between two Boolean networks, And-Inverter Graphs(AIG) and Majority-Inverter Graphs(MIG), which have been shown to perform better over each other on different types of logic. Experimental results show that over a set of Open Piton DesignBenchmarks(OPDB) and OpenCores benchmarks, our proposed methodology outperforms state-of-the-art academic tools inArea-Delay Product(ADP), Power-Delay Product(PDP), and Energy-Delay Product(EDP) by 5%, 2%, and 15% respectively after performing Application Specific Integrated Circuits(ASIC) technology mapping as well as showing a 54% improvement in runtime over conventional MIG optimization</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="/date20/conference/session/IP2">IP2-17</a>, 456</td> <td><b>DISCERN: DISTILLING STANDARD CELLS FOR EMERGING RECONFIGURABLE NANOTECHNOLOGIES</b><br /><b>Speaker</b>:<br />Shubham Rai, Technische Universität Dresden, DE<br /><b>Authors</b>:<br />Shubham Rai<sup>1</sup>, Michael Raitza<sup>2</sup>, Siva Satyendra Sahoo<sup>1</sup> and Akash Kumar<sup>1</sup><br /><sup>1</sup>Technische Universität Dresden, DE; <sup>2</sup>TU Dresden and CfAED, DE<br /><em><b>Abstract</b><br />Logic gates and circuits based on reconfigurable nanotechnologies demonstrate runtime-reconfigurability, where a single logic gate can exhibit more than one functionality. Recent attempts on circuits based on emerging reconfigurable nanotechnologies have primarily focused on using the traditional CMOS design flow involving similar-styled standard-cells. These CMOS-centric standard-cells fail to utilize the exciting properties offered by these nanotechnologies. In the present work, we explore the boolean properties that define the reconfigurable properties of a logic gate. By analyzing the truth-table in detail, we find that there is a common boolean rule which dictates why a logic gate is reconfigurable. Such logic gates can be efficiently implemented using reconfigurable nanotechnologies. We propose an algorithm which analyses the truth-tables of nodes in a circuit to list all such potential reconfigurable logic gates for a particular circuit. Technology mapping with these new logic gates (or standard-cells) leads to a better mapping in terms of area and delay. Experiments employing our methodology over EPFL benchmarks, show average improvements of around 13%, 16% and 11.5% in terms of area, number of edges and delay respectively as compared to the conventional CMOS-centric standard-cell based mapping.</em></td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="5.7">5.7 Stochastic Computing</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br />Robert Wille, Johannes Kepler Universität Linz, AT</p> <p><b>Co-Chair:</b><br />Shigeru Yamashita, Ritsumeikan, JP</p> <p>Stochastic computing uses random bitstreams to reduce computational and area costs of a general class of Boolean operations, including arithmetic addition and multiplication. This session considers stochastic computing from a model-, accuracy-, and applications-perspective, by presenting papers that span from models of pseudo-random number generators, to accuracy analysis of stochastic circuits, to novel applications for signal processing tasks.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.7.1</td> <td><b>THE HYPERGEOMETRIC DISTRIBUTION AS A MORE ACCURATE MODEL FOR STOCHASTIC COMPUTING</b><br /><b>Speaker</b>:<br />Timothy Baker, University of Michigan, US<br /><b>Authors</b>:<br />Timothy Baker and John Hayes, University of Michigan, US<br /><em><b>Abstract</b><br />A fundamental assumption in stochastic computing (SC) is that bit-streams are generally well-approximated by a Bernoulli process, i.e., a sequence of independent 0-1 choices. We show that this assumption is flawed in unexpected and significant ways for some bit-streams such as those produced by a typical LFSR-based stochastic number generator (SNG). In particular, the Bernoulli assumption leads to a surprising overestimation of output errors and how they vary with input changes. We then propose a more accurate model for such bit-streams based on the hypergeometric distribution and examine its implications for several SC applications. First, we explore the effect of correlation on a mux-based stochastic adder and show that, contrary to what was previously thought, it is not entirely correlation insensitive. Further, inspired by the hypergeometric model, we introduce a new mux tree adder that offers major area savings and accuracy improvement. The effectiveness of this study is validated on a large image processing circuit which achieves an accuracy improvement of 32%, combined with a reduction in overall circuit area.</em></td> </tr> <tr> <td>09:00</td> <td>5.7.2</td> <td><b>ACCURACY ANALYSIS FOR STOCHASTIC CIRCUITS WITH D-FLIP FLOP INSERTION</b><br /><b>Speaker</b>:<br />Kuncai Zhong, University of Michigan-Shanghai Jiao Tong University Joint Institute, CN<br /><b>Authors</b>:<br />Kuncai Zhong and Weikang Qian, Shanghai Jiao Tong University, CN<br /><em><b>Abstract</b><br />One of the challenges stochastic computing (SC) faces is the high cost of stochastic number generators (SNG). A solution to it is inserting D flip-flops (DFFs) into the circuit. However, the accuracy of the stochastic circuits would be affected and it is crucial to capture it. In this work, we propose an efficient method to analyze the accuracy of stochastic circuits with DFFs inserted. Furthermore, given the importance of multiplication, we apply this method to analyze stochastic multiplier with DFFs inserted. Several interesting claims are obtained about the use of probability conversion circuits. For example, using weighted binary generator is more accurate than using comparator. The experimental results show the correctness of the proposed method and the claims. Furthermore, the proposed method is up to 560× faster than the simulation-based method.</em></td> </tr> <tr> <td>09:30</td> <td>5.7.3</td> <td><b>DYNAMIC STOCHASTIC COMPUTING FOR DIGITAL SIGNAL PROCESSING APPLICATIONS</b><br /><b>Speaker</b>:<br />Jie Han, University of Alberta, CA<br /><b>Authors</b>:<br />Siting Liu and Jie Han, University of Alberta, CA<br /><em><b>Abstract</b><br />Stochastic computing (SC) utilizes a random binary bit stream to encode a number by counting the frequency of 1's in the stream (or sequence). Typically, a small circuit is used to perform a bit-wise logic operation on the stochastic sequences, which leads to significant hardware and power savings. Energy efficiency, however, is a challenge for SC due to the long sequences required for accurately encoding numbers. To overcome this challenge, we consider to use a stochastic sequence to encode a continuously variable signal instead of a number to achieve higher accuracy, higher energy efficiency and greater flexibility. Specifically, one single bit is used to encode a sample from a signal for efficient processing. This type of sequences encodes constantly variable values, so it is referred to as dynamic stochastic sequences (DSS's). The DSS enables the use of SC circuits to efficiently perform tasks such as frequency mixing and function estimation. It is shown that such a dynamic SC (DSC) system achieves savings up to 98.4% in energy and up to 96.8% in time with a slightly higher accuracy compared to conventional SC. It also achieves energy and time savings of up to 60% compared to a fixed-width binary implementation.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="/date20/conference/session/IP2">IP2-18</a>, 437</td> <td><b>A 16×128 STOCHASTIC-BINARY PROCESSING ELEMENT ARRAY FOR ACCELERATING STOCHASTIC DOT-PRODUCT COMPUTATION USING 1-16 BIT-STREAM LENGTH</b><br /><b>Speaker</b>:<br />Hyunjoon Kim, Nanyang Technological University, SG<br /><b>Authors</b>:<br />Qian Chen, Yuqi Su, Hyunjoon Kim, Taegeun Yoo, Tony Tae-Hyoung Kim and Bongjin Kim, Nanyang Technological University, SG<br /><em><b>Abstract</b><br />This work presents 16×128 stochastic-binary processing elements for energy/area efficient processing of artificial neural networks. A processing element (PE) with all-digital components consists of an XNOR gate as a bipolar stochastic multiplier and an 8bit binary adder with 8× registers for accumulating partial-sums. The PE array comprises 16× dot-product units, each with 128 PEs cascaded in a single row. The latency and energy of the proposed dot-product unit is minimized by reducing the number of bit-streams required for minimizing the accuracy degradation induced by the approximate stochastic computing. A 128-input dot-product operation requires the bit-stream length (N) of 1-to-16, which is two orders of magnitude smaller than the baseline stochastic computation using MUX-based adders. The simulated dot-product error is 6.9-to-1.5% for N=1-to-16, while the error from the baseline stochastic method is 5.9-to-1.7% with N=128-to-2048. A mean MNIST classification accuracy is 96.11% (which is 1.19% lower than 8b binary) using a three-layer MLP at N=16. The measured energy from a 65nm test-chip is 10.04pJ per dot-product, and the energy efficiency is 25.5TOPS/W at N=16.</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="/date20/conference/session/IP2">IP2-19</a>, 599</td> <td><b>TOWARDS EXPLORING THE POTENTIAL OF ALTERNATIVE QUANTUM COMPUTING ARCHITECTURES</b><br /><b>Speaker</b>:<br />Arighna Deb, Kalinga Institute of Industrial Technology, IN<br /><b>Authors</b>:<br />Arighna Deb<sup>1</sup>, Gerhard W. Dueck<sup>2</sup> and Robert Wille<sup>3</sup><br /><sup>1</sup>Kalinga Institute of Industrial Technology, IN; <sup>2</sup>University of New Brunswick, CA; <sup>3</sup>Johannes Kepler Universität Linz, AT<br /><em><b>Abstract</b><br />The recent advances in the physical realization of Noisy Intermediate Scale Quantum (NISQ) computers have motivated research on design automation that allows users to execute quantum algorithms on them. Certain physical constraints in the architectures restrict how logical qubits used to describe the algorithm can be mapped to physical qubits used to realize the corresponding functionality. Thus far, this has been addressed by inserting additional operations in order to overcome the physical constrains. However, all these approaches have taken the existing architectures as invariant and did not explore the potential of changing the quantum architecture itself—a valid option as long as the underlying physical constrains remain satisfied. In this work, we propose initial ideas to explore this potential. More precisely, we introduce several schemes for the generation of alternative coupling graphs (and, by this, quantum computing architectures) that still might be able to satisfy physical constraints but, at the same time, allow for a more efficient realization of the desired quantum functionality.</em></td> </tr> <tr> <td style="width:40px;">10:02</td> <td><a href="/date20/conference/session/IP2">IP2-20</a>, 719</td> <td><b>ACCELERATING QUANTUM APPROXIMATE OPTIMIZATION ALGORITHM USING MACHINE LEARNING</b><br /><b>Speaker</b>:<br />Swaroop Ghosh, Pennsylvania State University, US<br /><b>Authors</b>:<br />Mahabubul Alam, Abdullah Ash- Saki and Swaroop Ghosh, Pennsylvania State University, US<br /><em><b>Abstract</b><br />We propose a machine learning based approach to accelerate quantum approximate optimization algorithm (QAOA) implementation which is a promising quantum-classical hybrid algorithm to prove the so-called quantum supremacy. In QAOA, a parametric quantum circuit and a classical optimizer iterates in a closed loop to solve hard combinatorial optimization problems. The performance of QAOA improves with an increasing number of stages (depth) in the quantum circuit. However, two new parameters are introduced with each added stage for the classical optimizer increasing the number of optimization loop iterations. We note a correlation among parameters of the lower-depth and the higher-depth QAOA implementations and, exploit it by developing a machine learning model to predict the gate parameters close to the optimal values. As a result, the optimization loop converges in a fewer number of iterations. We choose graph MaxCut problem as a prototype to solve using QAOA. We perform a feature extraction routine using 100 different QAOA instances and develop a training data-set with 13,860 optimal parameters. We present our analysis for 4 flavors of regression models and 4 flavors of classical optimizers. Finally, we show that the proposed approach can curtail the number of optimization iterations by on average 44.9% (up to 65.7%) from an analysis performed with 264 flavors of graphs.</em></td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="5.8">5.8 Special Session: HLS for AI HW</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Exhibition Theatre</p> <p><b>Chair:</b><br />Massimo Cecchetti, Mentor, A Siemens Business, US</p> <p><b>Co-Chair:</b><br />Astrid Ernst, Mentor, A Siemens Business, US</p> <p>One of the fastest growing areas of hardware and software design is artificial intelligence (AI)/machine learning (ML), fueled by the demand for more autonomous systems like self-driving vehicles and voice recognition for personal assistants. Many of these algorithms rely on convolutional neural networks (CNNs) to implement deep learning systems. While the concept of convolution is relatively straightforward, the application of CNNs to the ML domain has yielded dozens of different neural network approaches. These networks can be executed in software on CPUs/GPUs, the power requirements for these solutions make them impractical for most inferencing applications, the majority of which involve portable, low-power devices. To improve the power/performance, hardware teams are forming to create ML hardware acceleration blocks. However, the process of taking any one of these compute-intensive networks into hardware, especially energy-efficient hardware, is a time consuming process if the team employs a traditional RTL design flow. Consider all of these interdependent activities using a traditional flow: •Expressing the algorithm correctly in RTL. •Choosing the optimal bit-widths for kernel weights and local storage to meet the memory budget. •Designing the microarchitecture to have a low enough latency to be practical for the target application, while determining how the accelerator communicates across the system bus without killing the latency the team just fought for. •Verifying the algorithm early on and throughout the implementation process, especially in the context of the entire system. •Optimizing for power for mobile devices. •Getting the product to market on time. This domain is in desperate need of a productivity-boosting methodology shift away from an RTL flow.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.8.1</td> <td><b>INTRODUCTION TO HLS CONCEPTS OPEN-SOURCE IP AND REFERENCES DESIGNS ENABLING BUILDING AI ACCELERATION HARDWARE</b><br /><b>Author</b>:<br />Mike Fingeroff, Mentor, A Siemens Business, US<br /><em><b>Abstract</b><br />HLS provides a hardware design solution for algorithm designers that generates high-quality RTL from C++ and/or SystemC descriptions that target ASIC, FPGA, or eFPGA implementations. By employing these elements of the HLS solution, teams can quickly develop quality high-performance, low-power hardware implementations: • Enables late-stage changes. Easily change C++ algorithms at any time and regenerate RTL code or target a new technology. • Rapidly explore options for power, performance, and area without changing source code. • Reduce design and verification time from one year to a few months and add new features in days not weeks, all using C/C++ code that contains 5x fewer lines of code than RTL.</em></td> </tr> <tr> <td>09:00</td> <td>5.8.2</td> <td><b>EARLY SOC PERFORMANCE VERIFICATION USING SYSTEMC WITH NVIDIA MATCHLIB AND HLS</b><br /><b>Author</b>:<br />Stuart Swan, Mentor, A Siemens Business, US<br /><em><b>Abstract</b><br />NVidia MatchLib is a new open-source library that enables much faster design and verification of SOCs using High-Level Synthesis. One of the primary objectives of MatchLib is to enable performance accurate modeling of SOCs in SystemC/C++. With these models, designers can identify and resolve issues such as bus and memory contention, arbitration strategies, and optimal interconnect structure at a much higher level of abstraction than RTL. In addition, much of the system level verification of the SOC can occur in SystemC/C++, before RTL is even created. This presentation will introduce NVidia Matchlib and flow (Figure 3) and its usage with Catapult HLS using some demonstration examples. Key Components of MatchLib: • Connections o Synthesizable Message Passing Framework o SystemC/C++ used to accurately model concurrent IO that synthesized HW will have o Automatic stall injection enables interconnect to be stress tested in SystemC • Parameterized AXI4 Fabric Components o Router/Splitter o Arbiter o AXI4 &lt;-&gt; AXI4Lite o Automatic burst segmentation and last bit generation • Parameterized Banked Memories, Crossbar, Reorder Buffer, Cache • Parameterized NOC components</em></td> </tr> <tr> <td>09:30</td> <td>5.8.3</td> <td><b>CUSTOMER CASE STUDIES OF USING HLS FOR ULTRA-LOW POWER AI HARDWARE ACCELERATION</b><br /><b>Author</b>:<br />Ellie Burns, Mentor, A Siemens Business, US<br /><em><b>Abstract</b><br />This presentation will review 3 customer case studies where HLS has been used for designs and applications that use AI/ML accelerated HW. All case studies are available as full customer authored white papers that detail both the design and the HLS use, design experience and lessons learned. The 3 customers studies will be NVIDIA - High-productivity IC Design for Machine Learning Accelerators FotoNation/Xperi - A Designer Life with HLS Faster Computer Vision Neural Networks Chips&amp;Media - Deep Learning Accelerator Using HLS</em></td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="IP2">IP2 Interactive Presentations</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 10:00 - 11:00<br /><b>Location / Room:</b> Poster Area</p> <p>Interactive Presentations run simultaneously during a 30-minute slot. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session</p> <table> <tr> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> <tr> <td style="width:40px;">IP2-1</td> <td><b>SAMPLING FROM DISCRETE DISTRIBUTIONS IN COMBINATIONAL HARDWARE WITH APPLICATION TO POST-QUANTUM CRYPTOGRAPHY</b><br /><b>Speaker</b>:<br />Michael Lyons, George Mason University, US<br /><b>Authors</b>:<br />Michael Lyons and Kris Gaj, George Mason University, US<br /><em><b>Abstract</b><br />Random values from discrete distributions are typically generated from uniformly-random samples. A common technique is to use a cumulative distribution table (CDT) lookup for inversion sampling, but it is also possible to use Boolean functions to map a uniformly-random bit sequence into a value from a discrete distribution. This work presents a methodology for deriving such functions for any discrete distribution, encoding them in VHDL for implementation in combinational hardware, and (for moderate precision and sample space size) confirming the correctness of the produced distribution. The process is demonstrated using a discrete Gaussian distribution with a small sample space, but it is applicable to any discrete distribution with fixed parameters. Results are presented for sampling schemes from several submissions to the NIST PQC standardization process, comparing this method to CDT lookups on a Xilinx Artix-7 FPGA. The process produces compact solutions for distributions up to moderate size and precision.</em></td> </tr> <tr> <td style="width:40px;">IP2-2</td> <td><b>ON THE PERFORMANCE OF NON-PROFILED DIFFERENTIAL DEEP LEARNING ATTACKS AGAINST AN AES ENCRYPTION ALGORITHM PROTECTED USING A CORRELATED NOISE HIDING COUNTERMEASURE</b><br /><b>Speaker</b>:<br />Amir Alipour, Grenoble INP Esisar, IR<br /><b>Authors</b>:<br />Amir Alipour<sup>1</sup>, Athanasios Papadimitriou<sup>2</sup>, Vincent Beroulle<sup>3</sup>, Ehsan Aerabi<sup>3</sup> and David Hely<sup>3</sup><br /><sup>1</sup>University Grenoble Alpes, Grenoble INP ESISAR, LCIS Laboratory, FR; <sup>2</sup>University Grenoble Alpes, Grenoble INP ESISAR, ESYNOV, FR; <sup>3</sup>University Grenoble Alpes, Grenoble INP ESISAR, LSIC Laboratory, FR<br /><em><b>Abstract</b><br />Recent works in the field of cryptography focus on Deep Learning based Side Channel Analysis (DLSCA) as one of the most powerful attacks against common encryption algorithms such as AES. As a common case, profiling DLSCA have shown great capabilities in revealing secret cryptographic keys against the majority of AES implementations. In a very recent study, it has been shown that Deep Learning can be applied in a non-profiling way (non-profiling DLSCA), making this method considerably more practical, and able to break powerful countermeasures for encryption algorithms such as AES including masking countermeasures, requiring considerably less power traces than a first order CPA attack. In this work, our main goal is to apply the non-profiling DLSCA against a hiding-based AES countermeasure which utilizes correlated noise generation so as to hide the secret encryption key. We show that this AES, with correlated noise generation as a lightweight countermeasure, can provide equivalent protection under CPA and under non- profiling DLSCA attacks, in terms of the required power traces to obtain the secret key.</em></td> </tr> <tr> <td style="width:40px;">IP2-3</td> <td><b>FAST AND ACCURATE PERFORMANCE EVALUATION FOR RISC-V USING VIRTUAL PROTOTYPES</b><br /><b>Speaker</b>:<br />Vladimir Herdt, Universität Bremen, DE<br /><b>Authors</b>:<br />Vladimir Herdt<sup>1</sup>, Daniel Grosse<sup>2</sup> and Rolf Drechsler<sup>2</sup><br /><sup>1</sup>Universität Bremen, DE; <sup>2</sup>Universität Bremen / DFKI GmbH, DE<br /><em><b>Abstract</b><br />RISC-V is gaining huge popularity in particular for embedded systems. Recently, a SystemC-based Virtual Prototype (VP) has been open sourced to lay the foundation for providing support for system-level use cases such as design space exploration, analysis of complex HW/SW interactions and power/timing/performance validation for RISC-V based systems. In this paper, we propose an efficient core timing model and integrate it into the VP core to enable fast and accurate performance evaluation for RISC-V based systems. As a case-study we provide a timing configuration matching the RISC-V HiFive1 board from SiFive. Our experiments demonstrate that our approach allows to obtain very accurate performance evaluation results while still retaining a high simulation performance.</em></td> </tr> <tr> <td style="width:40px;">IP2-4</td> <td><b>AUTOMATED GENERATION OF LTL SPECIFICATIONS FOR SMART HOME IOT USING NATURAL LANGUAGE</b><br /><b>Speaker</b>:<br />Shiyu Zhang, State Key Laboratory of Novel Software Technology, Department of Computer Science and Technology, Nanjing University, CN<br /><b>Authors</b>:<br />Shiyu Zhang<sup>1</sup>, Juan Zhai<sup>1</sup>, Lei Bu<sup>1</sup>, Mingsong Chen<sup>2</sup>, Linzhang Wang<sup>1</sup> and Xuandong Li<sup>1</sup><br /><sup>1</sup>Nanjing University, CN; <sup>2</sup>East China Normal University, CN<br /><em><b>Abstract</b><br />Ordinary inexperienced users can build their smart home IoT system easily nowadays, but such user-customized systems could be error-prone. Using formal verification to prove the correctness of such systems is necessary. However, to conduct formal proof, formal specifications such as Linear Temporal Logic (LTL) formulas have to be provided, but ordinary users cannot author LTL formulas but only natural language. To address this problem, this paper presents a novel approach that can automatically generate formal LTL specifications from natural language requirements based on domain knowledge and our proposed ambiguity refining techniques. Experimental results show that our approach can achieve a high correctness rate of 95.4% in converting natural language sentences into LTL formulas from 481 requirements of real examples.</em></td> </tr> <tr> <td style="width:40px;">IP2-5</td> <td><b>A HEAT-RECIRCULATION-AWARE VM PLACEMENT STRATEGY FOR DATA CENTERS</b><br /><b>Authors</b>:<br />Hao Feng<sup>1</sup>, Yuhui Deng<sup>2</sup> and Yi Zhou<sup>3</sup><br /><sup>1</sup>Jinan University, CN; <sup>2</sup>Chinese Academy of Sciences; Jinan University, CN; <sup>3</sup>Columbus State University, US<br /><em><b>Abstract</b><br />Data centers consisted of a great number of IT devices (e.g., servers, switches and etc.) which generates a massive amount of heat emission. Due to the special arrangement of racks in the data center, heat recirculation often occurs between nodes. It can cause a sharp rise in temperature of the equipment coupled with local hot spots in data centers. Existing VM placement strategies can minimize energy consumption of data centers by optimizing resource allocation in terms of multiple physical resources (e.g., memory, bandwidth, cpu and etc.). However, existing strategies ignore the role of heat recirculation in the data center. To address this problem, in this study, we propose a heat-recirculation-aware VM placement strategy and design a Simulated Annealing Based Algorithm (SABA) to lower the energy consumption of data centers. Different from the existing SA algorithm, SABA optimize the distribution of the initial solution and the way of iteration. We quantitatively evaluate SABA's performance in terms of algorithm efficiency, the activated servers and the energy saving against with XINT-GA algorithm (Thermal-aware task scheduling Strategy), FCFS (First-Come First-Served), and SA. Experimental results indicate that our heat-recirculation-aware VM placement strategy provides a powerful solution for improving energy efficiency of data centers.</em></td> </tr> <tr> <td style="width:40px;">IP2-6</td> <td><b>ENERGY OPTIMIZATION IN NCFET-BASED PROCESSORS</b><br /><b>Authors</b>:<br />Sami Salamin<sup>1</sup>, Martin Rapp<sup>1</sup>, Hussam Amrouch<sup>1</sup>, Andreas Gerstlauer<sup>2</sup> and Joerg Henkel<sup>1</sup><br /><sup>1</sup>Karlsruhe Institute of Technology, DE; <sup>2</sup>University of Texas, Austin, US<br /><em><b>Abstract</b><br />Energy consumption is a key optimization goal for all modern processors. Negative Capacitance Field-Effect Transistors (NCFETs) are a leading emerging technology that promises outstanding performance in addition to better energy efficiency. The thickness of the additional ferroelectric layer, frequency, and voltage are the key parameters in NCFET technology that impact the power and frequency of processors. However, their joint impact on energy optimization has not been investigated yet. In this work, we are the first to demonstrate that conventional (i.e., NCFET-unaware) dynamic voltage/frequency scaling (DVFS) techniques to minimize energy are sub-optimal when applied to NCFET-based processors. We further demonstrate that state-of-the-art NCFET-aware voltage scaling for power minimization is also sub-optimal when it comes to energy. This work provides the first NCFET-aware DVFS technique that optimizes the processor's energy through optimal runtime frequency/voltage selection. In NCFETs, energy-optimal frequency and voltage are dependent on the workload and technology parameters. Our NCFET-aware DVFS technique considers these effects to perform optimal voltage/frequency selection at runtime depending on workload characteristics. Results show up to 90 % energy savings compared to conventional DVFS techniques. Compared to state-of-the-art NCFET-aware power management, our technique provides up to 72 % energy savings along with 3:7x higher performance.</em></td> </tr> <tr> <td style="width:40px;">IP2-7</td> <td><b>TOWARDS A MODEL-BASED MULTI-OBJECTIVE OPTIMIZATION APPROACH FOR SAFETY-CRITICAL REAL-TIME SYSTEMS</b><br /><b>Speaker</b>:<br />Emmanuel Grolleau, LIAS / ISAE-ENSMA, FR<br /><b>Authors</b>:<br />Soulimane Kamni<sup>1</sup>, Yassine OUHAMMOU<sup>2</sup>, Antoine Bertout<sup>3</sup> and Emmanuel Grolleau<sup>4</sup><br /><sup>1</sup>LIAS/ENSMA, FR; <sup>2</sup>LIAS / ISAE-ENSMA, FR; <sup>3</sup>LIAS, Université de Poitiers, ISAE-ENSMA, FR; <sup>4</sup>LIAS, ISAE-ENSMA, Universite de Poitiers, FR<br /><em><b>Abstract</b><br />In safety-critical real-time systems domain, obtaining the appropriate operational model which meets the temporal (e.g. deadlines) and business (e.g. redundancy) requirements while being optimal in terms of several metrics is a primordial process in the design life-cycle. Recently, several researches have proposed to explore cross-domain trade-offs for a higher behaviour performance. Indeed, this process represents the first step in the deployment phase, which is very sensitive because it could be error-prone and time consuming. This paper is a work in progress proposing an approach aiming to help real-time system architects to take benefit from existing works, overcome their limits, and capitalize the efforts. Furthermore, the approach is based on the model-driven engineering paradigm and suggests to ease the usage of methods and tools thanks to repositories gathering them as a sort of a shared knowledge.</em></td> </tr> <tr> <td style="width:40px;">IP2-8</td> <td><b>CURRENT-MODE CARRY-FREE MULTIPLIER DESIGN USING A MEMRISTOR-TRANSISTOR CROSSBAR ARCHITECTURE</b><br /><b>Speaker</b>:<br />Shengqi Yu, Newcastle Universtiy, GB<br /><b>Authors</b>:<br />Shengqi Yu<sup>1</sup>, Ahmed Soltan<sup>2</sup>, Rishad Shafik<sup>3</sup>, Thanasin Bunnam<sup>3</sup>, Domenico Balsamo<sup>3</sup>, Fei Xia<sup>3</sup> and Alex Yakovlev<sup>3</sup><br /><sup>1</sup>Newcastle Universtiy, GB; <sup>2</sup>Nile University, EG; <sup>3</sup>Newcastle University, GB<br /><em><b>Abstract</b><br />Traditional multipliers consist of complex logic components. They are a major energy and performance contributor of modern compute-intensive applications. As such, designing multipliers with reduced energy and faster speed has remained a thoroughgoing challenge. This paper presents a novel, carry-free multiplier, which is suitable for new-generation of energy-constrained applications. The multiplier circuit consists of an array of memristor-transistor cells that can be selected (i.e., turned ON or OFF) using a combination of DC bias voltages based on the operand values. When a cell is selected it contributes to current in the array path, which is then amplified by current mirrors with variable transistor gate sizes. The different current paths are connected to a node for analogously accumulating the currents to produce the multiplier output directly, removing the carry propagation stages, typically seen in traditional digital multipliers. An essential feature of this multiplier is autonomous survivability, i.e., when the power is below this threshold the logic state automatically retains at a zero-cost due to the non-volatile properties of memristors.</em></td> </tr> <tr> <td style="width:40px;">IP2-9</td> <td><b>N-BIT DATA PARALLEL SPIN WAVE LOGIC GATE</b><br /><b>Speaker</b>:<br />Abdulqader Mahmoud, Delft University of Technology, NL<br /><b>Authors</b>:<br />Abdulqader Mahmoud<sup>1</sup>, Frederic Vanderveken<sup>2</sup>, Florin Ciubotaru<sup>2</sup>, Christoph Adelmann<sup>2</sup>, Sorin Cotofana<sup>1</sup> and Said Hamdioui<sup>1</sup><br /><sup>1</sup>Delft University of Technology, NL; <sup>2</sup>IMEC, BE<br /><em><b>Abstract</b><br />Due to their very nature, Spin Waves (SWs) created in the same waveguide, but with different frequencies, can coexist while selectively interacting with their own species only. The absence of inter-frequency interferences isolates input data sets encoded in SWs with different frequencies and creates the premises for simultaneous data parallel SW based processing without hardware replication or delay overhead. In this paper we leverage this SW property by introducing a novel computation paradigm, which allows for the parallel processing of n-bit input data vectors on the same basic SW based logic gate. Subsequently, to demonstrate the proposed concept, we present 8-bit parallel 3-input Majority gate implementation and validate it by means of Object Oriented MicroMagnetic Framework (OOMMF) simulations. To evaluate the potential benefit of our proposal we compare the 8-bit data parallel gate with equivalent scalar SW gate based implementation. Our evaluation indicates that 8-bit data 3-input Majority gate implementation requires 4.16x less area than the scalar SW gate based equivalent counterpart while preserving the same delay and energy consumption figures.</em></td> </tr> <tr> <td style="width:40px;">IP2-10</td> <td><b>HIGH-SPEED ANALOG SIMULATION OF CMOS VISION CHIPS USING EXPLICIT INTEGRATION TECHNIQUES ON MANY-CORE PROCESSORS</b><br /><b>Speaker</b>:<br />Tom Kazmierski, University of Southampton, GB<br /><b>Authors</b>:<br />Gines Domenech-Asensi<sup>1</sup> and Tom J Kazmierski<sup>2</sup><br /><sup>1</sup>Universidad Politecnica de Cartagena, ES; <sup>2</sup>University of Southampton, GB<br /><em><b>Abstract</b><br />This work describes a high-speed simulation technique of analog circuits which is based on the use of state-space equations and an explicit integration method parallelised on a multiprocessor architecture. The integration step of such method is smaller than the one required by an implicit simulation technique based on Newton-Raphson iterations. However, given that explicit methods do not require the computation of time-consuming matrix factorizations, the overall simulation time is reduced. The technique described in this work has been implemented on a NVIDIA general purpose GPU and has been tested simulating the Gaussian filtering operation performed by a smart CMOS image sensor. Such devices are used to perform computation on the edge and include built-in image processing functions. Among those, the Gaussian filtering is one of the most common functions, since it is a basic task for early vision processing. These smart sensors are increasingly complex and hence the time required to simulate them during their design cycle is also larger and larger. From a certain imager size, the proposed simulation method yields simulation times two order of magnitude faster that an implicit method based tool such us SPICE.</em></td> </tr> <tr> <td style="width:40px;">IP2-11</td> <td><b>A 100KHZ-1GHZ TERMINATION-DEPENDENT HUMAN BODY COMMUNICATION CHANNEL MEASUREMENT USING MINIATURIZED WEARABLE DEVICES</b><br /><b>Speaker</b>:<br />Shreyas Sen, Purdue University, US<br /><b>Authors</b>:<br />Shitij Avlani, Mayukh Nath, Shovan Maity and Shreyas Sen, Purdue University, US<br /><em><b>Abstract</b><br />Human Body Communication has shown great promise to replace wireless communication for information exchange between wearable devices of a body area network. However, there are very few studies in literature, that systematically study the channel loss of capacitive HBC for wearable devices over a wide frequency range with different terminations at the receiver, partly due to the need for miniaturized wearable devices for an accurate study. This paper, for the first time, measures the channel loss of capacitive HBC from 100KHz to 1GHz for both high-impedance and 50 ohm terminations using wearable, battery powered devices; which is mandatory for accurate measurement of the HBC channel-loss, due to ground coupling effects. Results show that high impedance termination leads to a significantly lower channel loss (40 dB improvement at 1MHz), as compared to 50 ohm termination at low frequencies. This difference steadily decreases with increasing frequency, until they become similar near 80MHz. Beyond 100MHz inter-device coupling dominates, thereby preventing accurate measurements of channel loss of the human body. The measured results provide a consistent wearable, wide-frequency HBC channel loss data and could serve as a backbone for the emerging field of HBC by aiding in the selection of an appropriate operation frequency and termination.</em></td> </tr> <tr> <td style="width:40px;">IP2-12</td> <td><b>FROM DRUP TO PAC AND BACK</b><br /><b>Speaker</b>:<br />Daniela Kaufmann, Johannes Kepler Universität Linz, AT<br /><b>Authors</b>:<br />Daniela Kaufmann, Armin Biere and Manuel Kauers, Johannes Kepler Universität Linz, AT<br /><em><b>Abstract</b><br />Currently the most efficient automatic approach to verify gate-level multipliers combines SAT solving and computer algebra. In order to increase confidence in the verification, proof certificates are generated. However, due to different solving techniques, these certificates require two different proof formats, namely DRUP and PAC. A combined proof has so far been missing. Correctness of this approach can thus only be trusted up to the correctness of compositional reasoning. In this paper we show how to generate a single proof in one proof format, which then allows to certify correctness using one simple proof checker. We further investigate empirically the effect on proof generation and checking time as well as on proof size. It turns out that PAC proofs are much more compact and faster to check.</em></td> </tr> <tr> <td style="width:40px;">IP2-13</td> <td><b>VERIFIABLE SECURITY TEMPLATES FOR HARDWARE</b><br /><b>Speaker</b>:<br />Bill Harrison, Oak Ridge National Laboratory, Cyber Security Research Group, US<br /><b>Authors</b>:<br />William Harrison<sup>1</sup> and Gerard Allwein<sup>2</sup><br /><sup>1</sup>Oak Ridge National Laboratory, US; <sup>2</sup>Naval Research Laboratory, US<br /><em><b>Abstract</b><br />But HLS has, with a few notable exceptions, not focused on transferring ideas and techniques from high assurance software formal methods to hardware development, despite there being a sophisticated and mature body of research in that area. Just as it has introduced software engineering virtues, we believe HLS can also become a vector for retrofitting software formal methods to the challenge of high assurance security in hardware. This paper introduces the Device Calculus and its mechanization in the Agda proof checking system. The Device Calculus is a starting point for exploring formal methods and security within high-level synthesis flows. We illustrate the Device Calculus with a number of examples of formally verifiable security templates---i.e., functions in the Device Calculus that express common security structures at a high-level of abstraction.</em></td> </tr> <tr> <td style="width:40px;">IP2-14</td> <td><b>IFFSET: IN-FIELD FUZZING OF INDUSTRIAL CONTROL SYSTEMS USING SYSTEM EMULATION</b><br /><b>Speaker</b>:<br />Dimitrios Tychalas, NYU Tandon School of Engineering, US<br /><b>Authors</b>:<br />Dimitrios Tychalas<sup>1</sup> and Michail Maniatakos<sup>2</sup><br /><sup>1</sup>NYU Tandon School of Engineering, US; <sup>2</sup>New York University Abu Dhabi, AE<br /><em><b>Abstract</b><br />Industrial Control Systems (ICS) have evolved in the last decade, shifting from proprietary software/hardware to contemporary embedded architectures paired with open-source operating systems. In contrast to the IT world, where continuous updates and patches are expected, decommissioning always-on ICS for security assessment can incur prohibitive costs to their owner. Thus, a solution for routinely assessing the cybersecurity posture of diverse ICS without affecting their operation is essential. Therefore, in this paper we introduce IFFSET, a platform that leverages full system emulation of Linux-based ICS firmware and utilizes fuzzing for security evaluation. Our platform extracts the file system and kernel information from a live ICS device, building an image which is emulated on a desktop system through QEMU. We employ fuzzing as a security assessment tool to analyze ICS specific libraries and find potential security threatening conditions. We test our platform with commercial PLCs, showcasing potential threats with no interruption to the control process.</em></td> </tr> <tr> <td style="width:40px;">IP2-15</td> <td><b>FANNET: FORMAL ANALYSIS OF NOISE TOLERANCE, TRAINING BIAS AND INPUT SENSITIVITY IN NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Mahum Naseer, TU Wien, AT<br /><b>Authors</b>:<br />Mahum Naseer<sup>1</sup>, Mishal Fatima Minhas<sup>2</sup>, Faiq Khalid<sup>1</sup>, Muhammad Abdullah Hanif<sup>3</sup>, Osman Hasan<sup>4</sup> and Muhammad Shafique<sup>5</sup><br /><sup>1</sup>TU Wien, AT; <sup>2</sup>National University of Sciences and Technology (NUST), Islamabad, Pakistan, PK; <sup>3</sup>Institute of Computer Engineering, Vienna University of Technology, AT; <sup>4</sup>NUST, PK; <sup>5</sup>Vienna University of Technology (TU Wien), AT<br /><em><b>Abstract</b><br />With a constant improvement in the network architectures and training methodologies, Neural Networks (NNs) are increasingly being deployed in real-world Machine Learning systems. However, despite their impressive performance on "known inputs", these NNs can fail absurdly on the "unseen inputs", especially if these real-time inputs deviate from the training dataset distributions, or contain certain types of input noise. This indicates the low noise tolerance of NNs, which is a major reason for the recent increase of adversarial attacks. This is a serious concern, particularly for safety-critical applications, where inaccurate results lead to dire consequences. We propose a novel methodology that leverages model checking for the Formal Analysis of Neural Network (FANNet) under different input noise ranges. Our methodology allows us to rigorously analyze the noise tolerance of NNs, their input node sensitivity, and the effects of training bias on their performance, e.g., in terms of classification accuracy. For evaluation, we use a feed-forward fully-connected NN architecture trained for the Leukemia classification. Our experimental results show 11% noise tolerance for the given trained network, identify the most sensitive input nodes, confirm the biasness of the available training dataset, and indicate that the proposed methodology is much more rigorous and yet comparable to validation testing in terms of time and computational resources for larger noise ranges.</em></td> </tr> <tr> <td style="width:40px;">IP2-16</td> <td><b>A SCALABLE MIXED SYNTHESIS FRAMEWORK FOR HETEROGENEOUS NETWORKS</b><br /><b>Speaker</b>:<br />Max Austin, Author, US<br /><b>Authors</b>:<br />Max Austin<sup>1</sup>, Scott Temple<sup>1</sup>, Walter Lau Neto<sup>1</sup>, Luca Amaru<sup>2</sup>, Xifan Tang<sup>1</sup> and Pierre-Emmanuel Gaillardon<sup>1</sup><br /><sup>1</sup>University of Utah, US; <sup>2</sup>Synopsys, US<br /><em><b>Abstract</b><br />We present a new logic synthesis framework which produces efficient post-technology mapped results on heterogeneous networks containing a mix of different types of logic. This framework accomplishes this by breaking down the circuit into sections using a hypergraph k-way partitioner and then determines the best-fit logic representation for each partition between two Boolean networks, And-Inverter Graphs(AIG) and Majority-Inverter Graphs(MIG), which have been shown to perform better over each other on different types of logic. Experimental results show that over a set of Open Piton DesignBenchmarks(OPDB) and OpenCores benchmarks, our proposed methodology outperforms state-of-the-art academic tools inArea-Delay Product(ADP), Power-Delay Product(PDP), and Energy-Delay Product(EDP) by 5%, 2%, and 15% respectively after performing Application Specific Integrated Circuits(ASIC) technology mapping as well as showing a 54% improvement in runtime over conventional MIG optimization</em></td> </tr> <tr> <td style="width:40px;">IP2-17</td> <td><b>DISCERN: DISTILLING STANDARD CELLS FOR EMERGING RECONFIGURABLE NANOTECHNOLOGIES</b><br /><b>Speaker</b>:<br />Shubham Rai, Technische Universität Dresden, DE<br /><b>Authors</b>:<br />Shubham Rai<sup>1</sup>, Michael Raitza<sup>2</sup>, Siva Satyendra Sahoo<sup>1</sup> and Akash Kumar<sup>1</sup><br /><sup>1</sup>Technische Universität Dresden, DE; <sup>2</sup>TU Dresden and CfAED, DE<br /><em><b>Abstract</b><br />Logic gates and circuits based on reconfigurable nanotechnologies demonstrate runtime-reconfigurability, where a single logic gate can exhibit more than one functionality. Recent attempts on circuits based on emerging reconfigurable nanotechnologies have primarily focused on using the traditional CMOS design flow involving similar-styled standard-cells. These CMOS-centric standard-cells fail to utilize the exciting properties offered by these nanotechnologies. In the present work, we explore the boolean properties that define the reconfigurable properties of a logic gate. By analyzing the truth-table in detail, we find that there is a common boolean rule which dictates why a logic gate is reconfigurable. Such logic gates can be efficiently implemented using reconfigurable nanotechnologies. We propose an algorithm which analyses the truth-tables of nodes in a circuit to list all such potential reconfigurable logic gates for a particular circuit. Technology mapping with these new logic gates (or standard-cells) leads to a better mapping in terms of area and delay. Experiments employing our methodology over EPFL benchmarks, show average improvements of around 13%, 16% and 11.5% in terms of area, number of edges and delay respectively as compared to the conventional CMOS-centric standard-cell based mapping.</em></td> </tr> <tr> <td style="width:40px;">IP2-18</td> <td><b>A 16×128 STOCHASTIC-BINARY PROCESSING ELEMENT ARRAY FOR ACCELERATING STOCHASTIC DOT-PRODUCT COMPUTATION USING 1-16 BIT-STREAM LENGTH</b><br /><b>Speaker</b>:<br />Hyunjoon Kim, Nanyang Technological University, SG<br /><b>Authors</b>:<br />Qian Chen, Yuqi Su, Hyunjoon Kim, Taegeun Yoo, Tony Tae-Hyoung Kim and Bongjin Kim, Nanyang Technological University, SG<br /><em><b>Abstract</b><br />This work presents 16×128 stochastic-binary processing elements for energy/area efficient processing of artificial neural networks. A processing element (PE) with all-digital components consists of an XNOR gate as a bipolar stochastic multiplier and an 8bit binary adder with 8× registers for accumulating partial-sums. The PE array comprises 16× dot-product units, each with 128 PEs cascaded in a single row. The latency and energy of the proposed dot-product unit is minimized by reducing the number of bit-streams required for minimizing the accuracy degradation induced by the approximate stochastic computing. A 128-input dot-product operation requires the bit-stream length (N) of 1-to-16, which is two orders of magnitude smaller than the baseline stochastic computation using MUX-based adders. The simulated dot-product error is 6.9-to-1.5% for N=1-to-16, while the error from the baseline stochastic method is 5.9-to-1.7% with N=128-to-2048. A mean MNIST classification accuracy is 96.11% (which is 1.19% lower than 8b binary) using a three-layer MLP at N=16. The measured energy from a 65nm test-chip is 10.04pJ per dot-product, and the energy efficiency is 25.5TOPS/W at N=16.</em></td> </tr> <tr> <td style="width:40px;">IP2-19</td> <td><b>TOWARDS EXPLORING THE POTENTIAL OF ALTERNATIVE QUANTUM COMPUTING ARCHITECTURES</b><br /><b>Speaker</b>:<br />Arighna Deb, Kalinga Institute of Industrial Technology, IN<br /><b>Authors</b>:<br />Arighna Deb<sup>1</sup>, Gerhard W. Dueck<sup>2</sup> and Robert Wille<sup>3</sup><br /><sup>1</sup>Kalinga Institute of Industrial Technology, IN; <sup>2</sup>University of New Brunswick, CA; <sup>3</sup>Johannes Kepler Universität Linz, AT<br /><em><b>Abstract</b><br />The recent advances in the physical realization of Noisy Intermediate Scale Quantum (NISQ) computers have motivated research on design automation that allows users to execute quantum algorithms on them. Certain physical constraints in the architectures restrict how logical qubits used to describe the algorithm can be mapped to physical qubits used to realize the corresponding functionality. Thus far, this has been addressed by inserting additional operations in order to overcome the physical constrains. However, all these approaches have taken the existing architectures as invariant and did not explore the potential of changing the quantum architecture itself—a valid option as long as the underlying physical constrains remain satisfied. In this work, we propose initial ideas to explore this potential. More precisely, we introduce several schemes for the generation of alternative coupling graphs (and, by this, quantum computing architectures) that still might be able to satisfy physical constraints but, at the same time, allow for a more efficient realization of the desired quantum functionality.</em></td> </tr> <tr> <td style="width:40px;">IP2-20</td> <td><b>ACCELERATING QUANTUM APPROXIMATE OPTIMIZATION ALGORITHM USING MACHINE LEARNING</b><br /><b>Speaker</b>:<br />Swaroop Ghosh, Pennsylvania State University, US<br /><b>Authors</b>:<br />Mahabubul Alam, Abdullah Ash- Saki and Swaroop Ghosh, Pennsylvania State University, US<br /><em><b>Abstract</b><br />We propose a machine learning based approach to accelerate quantum approximate optimization algorithm (QAOA) implementation which is a promising quantum-classical hybrid algorithm to prove the so-called quantum supremacy. In QAOA, a parametric quantum circuit and a classical optimizer iterates in a closed loop to solve hard combinatorial optimization problems. The performance of QAOA improves with an increasing number of stages (depth) in the quantum circuit. However, two new parameters are introduced with each added stage for the classical optimizer increasing the number of optimization loop iterations. We note a correlation among parameters of the lower-depth and the higher-depth QAOA implementations and, exploit it by developing a machine learning model to predict the gate parameters close to the optimal values. As a result, the optimization loop converges in a fewer number of iterations. We choose graph MaxCut problem as a prototype to solve using QAOA. We perform a feature extraction routine using 100 different QAOA instances and develop a training data-set with 13,860 optimal parameters. We present our analysis for 4 flavors of regression models and 4 flavors of classical optimizers. Finally, we show that the proposed approach can curtail the number of optimization iterations by on average 44.9% (up to 65.7%) from an analysis performed with 264 flavors of graphs.</em></td> </tr> </table> <hr /> <h2 id="6.1">6.1 Special Day on "Embedded AI": Emerging Devices, Circuits and Systems</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br />Carlo Reita, CEA, FR</p> <p><b>Co-Chair:</b><br />Bernabe Linares-Barranco, CSIC, ES</p> <p>This session focuses on the advantages and use of novel emerging nanotechnology devices and their use in designing circuits and systems for embedded AI hardware solutions.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.1.1</td> <td><b>IN-MEMORY RESISTIVE RAM IMPLEMENTATION OF BINARIZED NEURAL NETWORKS FOR MEDICAL APPLICATIONS</b><br /><b>Speaker</b>:<br />Damien Querlioz, University Paris-Saclay, FR<br /><b>Authors</b>:<br />Bogdan Penkovsky<sup>1</sup>, Marc Bocquet<sup>2</sup>, Tifenn Hirtzlin<sup>1</sup>, Jacques-Olivier Klein<sup>1</sup>, Etienne Nowak<sup>3</sup>, Elisa Vianello<sup>3</sup>, Jean-Michel Portal<sup>2</sup> and Damien Querlioz<sup>4</sup><br /><sup>1</sup>Université Paris-Saclay, FR; <sup>2</sup>Aix-Marseille University, FR; <sup>3</sup>CEA-Leti, FR; <sup>4</sup>Univ Paris-Sud, FR<br /><em><b>Abstract</b><br />The advent of deep learning has considerably accelerated machine learning development, but its development at the edge is limited by its high energy cost and memory requirement. With new memory technology available, emerging Binarized Neural Networks (BNNs) are promising to reduce the energy impact of the forthcoming machine learning hardware generation, enabling machine learning on the edge devices and avoiding data transfer over the network. In this work, after presenting our implementation employing a hybrid CMOS -hafnium oxide resistive memory technology, we suggest strategies to apply BNNs to biomedical signals such as electrocardiography and electroencephalography, keeping accuracy level and reducing memory requirements. These results are obtained in binarizing solely the classifier part of a neural network. We also discuss how these results translate to the edge-oriented Mobilenet V1 neural network on the Imagenet task. The final goal of this research is to enable smart autonomous healthcare devices.</em></td> </tr> <tr> <td>11:22</td> <td>6.1.2</td> <td><b>MIXED-SIGNAL VECTOR-BY-MATRIX MULTIPLIER CIRCUITS BASED ON 3D-NAND MEMORIES FOR NEUROMORPHIC COMPUTING</b><br /><b>Speaker</b>:<br />Dmitri Strukow, University of California, Santa Barbara, US<br /><b>Authors</b>:<br />Mohammad Bavandpour, Shubham Sahay, Mohammad Mahmoodi and Dmitri Strukov, University of California, Santa Barbara, US<br /><em><b>Abstract</b><br />We propose an extremely dense, energy-efficient mixed-signal vector-by-matrix-multiplication (VMM) circuits based on the existing 3D-NAND flash memory blocks, without any need for their modification. Such compatibility is achieved using time-domain-encoded VMM design. We have performed rigorous simulations of such a circuit, taking into account non-idealities such as drain-induced barrier lowering, capacitive coupling, charge injection, parasitics, process variations, and noise. Our results, for example, show that the 4-bit VMM of 200-element vectors, using the commercially available 64-layer gate-all-around macaroni-type 3D-NAND memory blocks designed in the 55-nm technology node, may provide an unprecedented area efficiency of 0.14 μm 2 /byte and energy efficiency of ~11 fJ/Op, including the input/output and other peripheral circuitry overheads.</em></td> </tr> <tr> <td>11:44</td> <td>6.1.3</td> <td><b>MODULAR RRAM BASED IN-MEMORY COMPUTING DESIGN FOR EMBEDDED AI</b><br /><b>Authors</b>:<br />Xinxin Wang, Qiwen Wang, Mohammed A. Zidan, Fan-Hsuan Meng, John Moon and Wei Lu, University of Michigan, US<br /><em><b>Abstract</b><br />Deep Neural Networks (DNN) are widely used for many artificial intelligence applications with great success. However, they often come with high computation cost and complexity. Accelerators are crucial in improving energy efficiency and throughput, particularly for embedded AI applications. Resistive random-access memory (RRAM) has the potential to enable efficient AI accelerator implementation, as the weights can be mapped as the conductance values of RRAM devices and computation can be directly performed in-memory. Specifically, by converting input activations into voltage pulses, vector-matrix multiplications (VMM) can be performed in analog domain, in place and in parallel. Moreover, the whole model can be stored on-chip, thus eliminating off-chip DRAM access completely and achieving high energy efficiency during the end-to-end operation. In this presentation, we will discuss how practical DNN models can be mapped onto realistic RRAM arrays in a modular design. Challenges such as quantization effects, finite array size, and device non-idealities on the system performance will be analyzed through standard DNN models such as VGG-16 and MobileNet. System performance metrics such as throughput and energy/image will also be discussed.</em></td> </tr> <tr> <td>12:06</td> <td>6.1.4</td> <td><b>NEUROMORPHIC COMPUTING: TOWARD DYNAMICAL DATA PROCESSING</b><br /><b>Author</b>:<br />Fabian Alibart, CNRS, Lille, FR<br /><em><b>Abstract</b><br />While machine-learning approaches have done tremendous progresses these last years, more is expected with the third generation of neural networks that should sustain this evolution. In addition to unsupervised learning and spike-based computing capability, this new generation of computing machines will be intrinsically dynamical systems that will shift our conception of electronic. In this context, investigating new material implementation of neuromorphic concept seems a very attracting direction. In this presentation, I will present our recent efforts toward the development of neuromorphic synapses that present attractive features for both spike-based computing and unsupervised learning. From their basic physics, I will show how their dynamics can be used to implement time-dependent computing functions. I will also extend this idea of dynamical computing to the case of reservoir computing based on organic sensors in order to show how neuromorphic concepts can be applied to a large class of dynamical problems.</em></td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="6.2">6.2 Secure and fast memory and storage</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br />Hao Yu, SUSTech, CN</p> <p><b>Co-Chair:</b><br />Chengmo Yang, University of Delaware, US</p> <p>As memories become persistent, the design of traditional data structures such as trees and hash tables as well as filesystems should be revisited to cope with the challenges brought by new memory devices. In this context, the main focus of this session is on how to improve performance, security, and energy-efficiency of memory and storage. The specific techniques range from the designs of integrity trees and hash tables, the management of superpages in filesystems, data prefetch in solid state drives (SSDs), as well as energy-efficient carbon-nanotube cache design.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.2.1</td> <td><b>AN EFFICIENT PERSISTENCY AND RECOVERY MECHANISM FOR SGX-STYLE INTEGRITY TREE IN SECURE NVM</b><br /><b>Speaker</b>:<br />Mengya Lei, Huazhong University of Science &amp; Technology, CN<br /><b>Authors</b>:<br />Mengya Lei, Fang Wang, Dan Feng, Fan Li and Jie Xu, Huazhong University of Science &amp; Technology, CN<br /><em><b>Abstract</b><br />The integrity tree is a crucial part of secure non-volatile memory (NVM) system design. For NVM with large capacity, the SGX-style integrity tree (SIT) is practical due to its parallel updates and variable arity. However, employing SIT in secure NVM is not easy. This is because the secure metadata SIT must be strictly persisted or restored after a sudden power-loss, which unfortunately incurs unacceptable run-time overhead or recovery time. In this paper, we propose PSIT, a metadata persistency solution for SIT-protected secure NVM with high performance and fast restoration. PSIT utilizes the observation that for a lazily updated SIT, the lost tree nodes after a crash can be recovered by the corresponding child nodes in the NVM. It reduces the persistency overhead of the SIT nodes through a restrained write-back meta-cache and leverages the SIT inter-layer dependency for recovery. Experiments show that compared to ASIT, a state-of-the-art secure NVM using SIT, PSIT decreases write traffic by 47% and improves the performance by 18% on average while maintaining a comparable recovery time.</em></td> </tr> <tr> <td>11:30</td> <td>6.2.2</td> <td><b>REVISITING PERSISTENT HASH TABLE DESIGN FOR COMMERCIAL NON-VOLATILE MEMORY</b><br /><b>Speaker</b>:<br />Kaixin Huang, Shanghai Jiao Tong University, CN<br /><b>Authors</b>:<br />Kaixin Huang, Yan Yan and Linpeng Huang, Shanghai Jiao Tong University, CN<br /><em><b>Abstract</b><br />Emerging non-volatile memory technologies bring evolution to storage systems and durable data structures. Among them, a proliferation of researches on persistent hash table employ NVM as the storage layer for both fast access and efficient persistence. Most of them are based on the assumptions that NVM has byte access granularity, poor write endurance, DRAM-comparable read latency and much higher write latency. However, a commercial non-volatile memory product, named Intel Optane DC Persistent Memory (AEP), has a few interesting features that are different from previous assumptions, such as 1) block access granularity 2) little concern for software-layer write endurance and 3) much higher read latency than DRAM and DRAM-comparable write latency. Confronted with the new challenges brought by AEP, we propose Rewo-Hash, a novel read-efficient and write-optimized hash table for commercial non-volatile memory. Our design can be summarized into three key points. First, we keep a hash table copy in DRAM as a cached table to speed up search requests. Second, we design a log-free atomic mechanism to support fast writes. Third, we devise an efficient synchronization scheme between the persistent table and cached table to mask the data synchronization overhead. We conduct extensive experiments using real NVM platform and the results show that compared with state-of-the-art NVM-Optimized hash tables, Rewo-Hash gains improvement of 1.73x-2.70x and 1.46x-3.11x in read latency and write latency, respectively. Rewo-Hash also outperforms its counterparts by 1.86x-4.24x in throughput for various YCSB workloads.</em></td> </tr> <tr> <td>12:00</td> <td>6.2.3</td> <td><b>OPTIMIZING PERFORMANCE OF PERSISTENT MEMORY FILE SYSTEMS USING VIRTUAL SUPERPAGES</b><br /><b>Speaker</b>:<br />Chaoshu Yang, Chongqing University, CN<br /><b>Authors</b>:<br />Chaoshu Yang<sup>1</sup>, Duo Liu<sup>1</sup>, Runyu Zhang<sup>1</sup>, Xianzhang Chen<sup>1</sup>, Shun Nie<sup>1</sup>, Qingfeng Zhuge<sup>1</sup> and Edwin H.-M Sha<sup>2</sup><br /><sup>1</sup>Chongqing University, CN; <sup>2</sup>East China Normal University, CN<br /><em><b>Abstract</b><br />Existing persistent memory file systems can significantly improve the performance by utilizing the advantages of emerging Persistent Memories (PMs). Especially, persistent memory file systems can employ superpages (e.g., 2MB a page) of PMs to alleviate the overhead of locating file data and reduce TLB misses. Unfortunately, superpage also induces two critical problems. First, the data consistency of file systems using superpages causes severe write amplification during overwrite of file data. Second, existing management of superpages may lead to large waste of PM space. In this paper, we propose a Virtual Superpage Mechanism (VSM) to solve the problems by taking advantages of virtual address space. On one hand, VSM adopts multi-grained copy-on-write mechanism to reduce the write amplification while ensuring data consistency. On the other hand, VSM presents zero-copy file data migration mechanism to eliminate the loss of space utilization efficiency caused by superpages.We implement the proposed VSM mechanism in Linux kernel based on PMFS. Compared with the original PMFS and NOVA, the experimental results show that VSM improves 36% and 14% on average for write and read performance, respectively. Meanwhile, VSM can achieve the same space utilization efficiency of file system that uses the normal 4KB pages to organize files.</em></td> </tr> <tr> <td>12:15</td> <td>6.2.4</td> <td><b>FREQUENT ACCESS PATTERN-BASED PREFETCHING INSIDE OF SOLID-STATE DRIVES</b><br /><b>Speaker</b>:<br />Jianwei Liao, Southwest University of China, CN<br /><b>Authors</b>:<br />Xiaofei Xu<sup>1</sup>, Zhigang Cai<sup>2</sup>, Jianwei Liao<sup>2</sup> and Yutaka Ishikawa<sup>3</sup><br /><sup>1</sup>Southwest University, CN; <sup>2</sup>Southwest University of China, CN; <sup>3</sup>RIKEN, Japan, JP<br /><em><b>Abstract</b><br />This paper proposes an SSD-inside data prefetching scheme, which has features of OS-dependence and use transparency. To be specific, it first mines frequent block access patterns that reflect the correlation among the occurred requests. Then it compares the requests in the current time window with the identified patterns, to direct fetching data in advance. Furthermore, to maximize the cache use efficiency, we construct a mathematical model to adaptively determine the cache partition on the basis of I/O workload characteristics, for separately buffering the prefetched data and the write data. Experimental results demonstrate that our proposal can yield improvements on average read latency by 6.3% to 9.3% without noticeably increasing write latency, in contrast to conventional SSD-inside prefetching schemes.</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="/date20/conference/session/IP3">IP3-1</a>, 594</td> <td><b>CNT-CACHE: AN ENERGY-EFFICIENT CARBON NANOTUBE CACHE WITH ADAPTIVE ENCODING</b><br /><b>Speaker</b>:<br />Kexin Chu, School of Electronic Science &amp; Applied Physics Hefei University of Technology Anhui,China, CN<br /><b>Authors</b>:<br />Dawen Xu<sup>1</sup>, Kexin Chu<sup>1</sup>, Cheng Liu<sup>2</sup>, Ying Wang<sup>2</sup>, Lei Zhang<sup>2</sup> and Huawei Li<sup>2</sup><br /><sup>1</sup>School of Electronic Science &amp; Applied Physics Hefei University of Technology Anhui, CN; <sup>2</sup>Chinese Academy of Sciences, CN<br /><em><b>Abstract</b><br />Carbon Nanotubu field-effect transistor(CNFET) that promises both higher clock speed and energy efficiency becomes an attractive alternative to the conventional power-hungry CMOS cache. We observe that CNFET-based cacheconstructed with typical 9T SRAM cells has distinct energy consumption when reading/writing 0 and 1 from/to it. The energy consumption of reading 0 is around 3X higher compared toreading 1. The energy consumption of writing 1 is almost 10X higher than writing 0. With this observation, we propose an energy-efficient cache design called CNT-Cache to take advantage of this feature. It includes an adaptive data encoding modulethat can convert the coding of each cache line to match the cache reading and writing preferences. Meanwhile, it has a cache line encoding direction predictor that instructs the encoding direction according to the cache line access history. The two optimizations combined together can reduce the overall dynamicpower consumption significantly. According to our experiments,the optimized CNFET-based L1 D-Cache reduces the dynamic power consumption by 22% on average compared to the baseline CNFET cache.</em></td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="6.3">6.3 Special Session: Modern Logic Reasoning Methods for Functional ECO</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br />Patrick Vuillod, Synopsys, US</p> <p><b>Co-Chair:</b><br />Christoph Scholl, Albert-Ludwigs-University Freiburg, DE</p> <p>Functional Engineering Change Order (ECO) is the problem of incrementally updating an existing logic network after a (possibly late) change in the design specification. The problem requires (i) to identify a small portion of the network's logic to be changed and (ii) to automatically synthesize a patch to replace this portion and rectify the network's functional behavior. ECOs can be solved using the logical framework of quantified Boolean formulæ (QBF), where a logic query asks for the existence of a set of nodes and values at those nodes to rectify the logic network's output functions. The global nature of the problem, however, challenges scalability. Any internal node in the logic network is a potential location for rectification and any node in the logic network may be used to simplify the synthesized patch. Furthermore, off-the-self QBF algorithms do not allow a formulation of resource costs for reusing existing logic.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.3.1</td> <td><b>ENGINEERING CHANGE ORDER FOR COMBINATIONAL AND SEQUENTIAL DESIGN RECTIFICATION</b><br /><b>Speaker</b>:<br />Jie-Hong Roland Jiang, National Taiwan University, TW<br /><b>Authors</b>:<br />Jie-Hong Roland Jiang<sup>1</sup>, Victor Kravets<sup>2</sup> and NIAN-ZE LEE<sup>1</sup><br /><sup>1</sup>National Taiwan University, TW; <sup>2</sup>IBM, US</td> </tr> <tr> <td>11:20</td> <td>6.3.2</td> <td><b>EXACT DAG-AWARE REWRITING</b><br /><b>Speaker</b>:<br />Heinz Riener, EPFL, CH<br /><b>Authors</b>:<br />Heinz Riener<sup>1</sup>, Alan Mishchenko<sup>2</sup> and Mathias Soeken<sup>1</sup><br /><sup>1</sup>EPFL, CH; <sup>2</sup>University of California, Berkeley, US</td> </tr> <tr> <td>11:40</td> <td>6.3.3</td> <td><b>LEARNING TO AUTOMATE THE DESIGN UPDATES FROM OBSERVED ENGINEERING CHANGES IN THE CHIP DEVELOPMENT CYCLE</b><br /><b>Speaker</b>:<br />Victor Kravets, IBM, US<br /><b>Authors</b>:<br />Victor Kravets<sup>1</sup>, Jie-Hong Roland Jiang<sup>2</sup> and Heinz Riener<sup>3</sup><br /><sup>1</sup>IBM, US; <sup>2</sup>National Taiwan University, TW; <sup>3</sup>EPFL, CH</td> </tr> <tr> <td>12:05</td> <td>6.3.4</td> <td><b>SYNTHESIS AND OPTIMIZATION OF MULTIPLE PORTIONS OF CIRCUITS FOR ECO BASED ON SET-COVERING AND QBF FORMULATIONS</b><br /><b>Speaker</b>:<br />Masahiro Fujita, University of Tokyo, JP<br /><b>Authors</b>:<br />Masahiro Fujita, Yusuke Kimura, Xingming Le, Yukio Miyasaka and Amir Masoud Gharehbaghi, University of Tokyo, JP</td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="6.4">6.4 Microarchitecture to the rescue of memory</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br />Olivier Sentieys, INRIA, FR</p> <p><b>Co-Chair:</b><br />Jeronimo Castrillon, Technische Universität Dresden, DE</p> <p>This session discusses micro-architectural innovations across three different memory technologies, namely, caches, 3D-stacked DRAM and non-volatile. This includes exploiting several aspects of redundancy to maximize cache utilization through compression, as well as multicast in 3D-stacked high-speed memories for graph analytics, and a microarchitecture solution to unify persistency and encryption in non-volatile memories.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.4.1</td> <td><b>EFFICIENT HARDWARE-ASSISTED CRASH CONSISTENCY IN ENCRYPTED PERSISTENT MEMORY</b><br /><b>Speaker</b>:<br />Zhan Zhang, Huazhong University of Science &amp; Technology, CN<br /><b>Authors</b>:<br />Zhan Zhang<sup>1</sup>, Jianhui Yue<sup>2</sup>, Xiaofei Liao<sup>1</sup> and Hai Jin<sup>1</sup><br /><sup>1</sup>Huazhong University of Science &amp; Technology, CN; <sup>2</sup>Michigan Technological University, US<br /><em><b>Abstract</b><br />The persistent memory (PM) requires maintaining the crash consistency and encrypting data, to ensure data recoverability and data confidentiality. The enforcement of these two goals does not only put more burden on programmers but also degrades performance. To address this issue, we propose a hardware-assisted encrypted persistent memory system. Specifically, logging and data encryption are assisted by hardware. Furthermore, we apply the counter-based encryption and the cipher feedback (CFB) mode encryption to data and log respectively, reducing the encryption overhead. Our primary experimental results show that the transaction throughput of the proposed design outperforms the baseline design by up to 34.4%.</em></td> </tr> <tr> <td>11:30</td> <td>6.4.2</td> <td><b>2DCC: CACHE COMPRESSION IN TWO DIMENSIONS</b><br /><b>Speaker</b>:<br />Amin Ghasemazar, University of British Columbia, CA<br /><b>Authors</b>:<br />Amin Ghasemazar<sup>1</sup>, Mohammad Ewais<sup>2</sup>, Prashant Nair<sup>3</sup> and Mieszko Lis<sup>3</sup><br /><sup>1</sup>UBC, CA; <sup>2</sup>UofT, CA; <sup>3</sup>University of British Columbia, CA<br /><em><b>Abstract</b><br />The importance of caches for performance, together with their high silicon area cost, has led to an interest in hardware solutions that transparently compress the cached data to increase effective capacity without sacrificing silicon area. Work to date has taken one of two tacks: either (a) deduplicating identical cache blocks across the cache to take advantage of inter-block redundancy or (b) identifying and compressing common patterns within each cache block to take advantage of intra-block redundancy. In this paper, we demonstrate that leveraging only one of these redundancy types leads to significant loss of compression opportunities in many applications: some workloads exhibit either inter-block or intra-block redundancy, while others exhibit both. We propose 2DCC, a simple technique that takes advantage of both types of redundancy. Across the SPEC and Parsec benchmark suites, 2DCC results in a 2.12× compression factor (geomean) compared to 1.44-1.49× for best prior techniques on an iso-silicon basis. For the cache-sensitive subset of these benchmarks run in isolation, 2DCC also achieves a 11.7% speedup (geomean).</em></td> </tr> <tr> <td>12:00</td> <td>6.4.3</td> <td><b>GRAPHVINE: EXPLOITING MULTICAST FOR SCALABLE GRAPH ANALYTICS</b><br /><b>Speaker</b>:<br />Leul Belayneh, UNIVERSITY OF MICHIGAN, US<br /><b>Authors</b>:<br />Leul Belayneh and Valeria Bertacco, University of Michigan, US<br /><em><b>Abstract</b><br />The proliferation of graphs as a key data structure for big-data analytics has heightened the demand for efficient graph processing. To meet this demand, prior works have proposed processing in memory (PIM) solutions in 3D-stacked DRAMs, such as Hybrid Memory Cubes (HMCs). However, PIM-based architectures, despite considerable improvement over conventional architectures, continue to be hampered by the presence of high inter-cube communication traffic. In turn, this trait has limited the underlying processing elements from fully capitalizing on the memory bandwidth an HMC has to offer. In this paper, we show that it is possible to combine multiple messages emitted from a source node into a single multicast message, thus reducing the inter-cube communication without affecting the correctness of the execution. Hence, we propose to add multicast support at source and in-network routers to reduce vertex-update traffic. Our experimental evaluation shows that, by combining multiple messages emitted at the source, it is possible to achieve an average speedup of 2.4x over a highly optimized PIM-based solution and reduce energy consumption by 3.4x, while incurring a modest power overhead of 6.8%.</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="/date20/conference/session/IP3">IP3-2</a>, 855</td> <td><b>ENHANCING MULTITHREADED PERFORMANCE OF ASYMMETRIC MULTICORES WITH SIMD OFFLOADING</b><br /><b>Speaker</b>:<br />Antonio Scheneider Beck, Universidade Federal do Rio Grande do Sul, BR<br /><b>Authors</b>:<br />Jeckson Dellagostin Souza<sup>1</sup>, Madhavan Manivannan<sup>2</sup>, Miquel Pericas<sup>2</sup> and Antonio Carlos Schneider Beck<sup>1</sup><br /><sup>1</sup>Universidade Federal do Rio Grande do Sul, BR; <sup>2</sup>Chalmers University of Technology, SE<br /><em><b>Abstract</b><br />Asymmetric multicore architectures with single-ISA can accelerate multithreaded applications by running code that does not execute concurrently (i.e., the serial region) on a big core and the parallel region on a larger number of smaller cores. Nevertheless, in such architectures the big core still implements resource-expensive application-specific instruction extensions that are rarely used while running the serial region, such as Single Instruction Multiple Data (SIMD) and Floating-Point (FP) operations. In this work, we propose a design in which these extensions are not implemented in the big core, thereby freeing up area and resources to increase the number of small cores in the system, and potentially enhance thread-level parallelism (TLP). To address the case when missing instruction extensions are required while running on the big core we devise an approach to automatically offload these operations to the execution units of the small cores, where the extensions are implemented and can be executed. Our evaluation shows that, on average, the proposed architecture provides 1.76x speedup when compared to a traditional single-ISA asymmetric multicore processor with the same area, for a variety of parallel applications.</em></td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="6.5">6.5 Efficient Data Representations in Neural Networks</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br />Brandon Reagen, Facebook and New York University, US</p> <p><b>Co-Chair:</b><br />Sebastian Steinhorst, TUM, DE</p> <p>The large processing requirements of ML models strains the capabilities of low-power embedded systems. Addressing this challenge, the first presentation proposes a robust co-design to leverage stochastic computing for highly accurate and efficient inference. Next, a structural optimization is proposed to counter faults at low voltage levels. Then, authors present a method for sharing results in binarized CNNs to reduce computation. The session will conclude with a talk implementing binary networks on mobile GPUs.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.5.1</td> <td><b>ACOUSTIC: ACCELERATING CONVOLUTIONAL NEURAL NETWORKS THROUGH OR-UNIPOLAR SKIPPED STOCHASTIC COMPUTING</b><br /><b>Speaker</b>:<br />Puneet Gupta, University of California, Los Angeles, US<br /><b>Authors</b>:<br />Wojciech Romaszkan, Tianmu Li, Tristan Melton, Sudhakar Pamarti and Puneet Gupta, University of California, Los Angeles, US<br /><em><b>Abstract</b><br />As privacy and latency requirements force a move towards edge Machine Learning inference, resource constrained devices are struggling to cope with large and computationally complex models. For Convolutional Neural Networks, those limitations can be overcome by taking advantage of enormous data reuse opportunities and amenability to reduced precision. To do that however, a level of compute density unattainable for conventional binary arithmetic is required. Stochastic Computing can deliver such density, but it has not lived up to its full potential because of multiple underlying precision issues. We present ACOUSTIC: Accelerating Convolutions through Or-Unipolar Skipped sTochastIc Computing, an accelerator framework that enables fully stochastic, high-density CNN inference. Leveraging split-unipolar representation, OR-based accumulation and novel computation-skipping approach, ACOUSTIC delivers server-class parallelism within a mobile area and power budget - a 12mm2 accelerator can be as much as 38.7x more energy efficient and 72.5x faster than conventional fixed-point accelerators. It can also be up to 79.6x more energy efficient than state-of-the-art stochastic accelerators. At the lower-end ACOUSTIC achieves 8x-120X inference throughput improvement with similar energy and area when compared to recent mixed-signal/neuromorphic accelerators.</em></td> </tr> <tr> <td>11:30</td> <td>6.5.2</td> <td><b>ACCURACY TOLERANT NEURAL NETWORKS UNDER AGGRESSIVE POWER OPTIMIZATION</b><br /><b>Speaker</b>:<br />Yi-Wen Hung, National Tsing Hua University, TW<br /><b>Authors</b>:<br />Xiang-Xiu Wu<sup>1</sup>, Yi-Wen Hung<sup>1</sup>, Yung-Chih Chen<sup>2</sup> and Shih-Chieh Chang<sup>1</sup><br /><sup>1</sup>National Tsing Hua University, TW; <sup>2</sup>Yuan Ze University, Taoyuan, Taiwan, TW<br /><em><b>Abstract</b><br />With the success of deep learning, many neural network models have been proposed and applied to various applications. In several applications, the devices used to implement the complicated models have limited power resources and thus aggressive optimization techniques are often applied for saving power. However, some optimization techniques, such as voltage scaling and multiple threshold voltages, may increase the probability of error occurrence due to slow signal propagation, which increases the path delay in a circuit and fails some input patterns. Although neural network models are considered to have some error tolerance, the prediction accuracy could be significantly affected, when there are a large number of errors. Thus, in this paper, we propose a scheme to mitigate the errors caused by slow signal propagation. Since the delay of multipliers dominate the critical path of the circuit, we consider the patterns significantly altered by the slow signal propagation according to the multipliers and prevent the patterns from failure by adjusting the neural network and the parameters. The proposed scheme modifies a neural network on the software side and thus it is unnecessary to re-design the hardware structure. The experimental results show that the proposed scheme is effective for several neural network models. It can improve the accuracy by up to 27%, when the device under consideration is applied with aggressive power optimization techniques.</em></td> </tr> <tr> <td>12:00</td> <td>6.5.3</td> <td><b>A CONVOLUTIONAL RESULT SHARING APPROACH FOR BINARIZED NEURAL NETWORK INFERENCE</b><br /><b>Speaker</b>:<br />CHIA CHUN LIN, National Tsing Hua University, TW<br /><b>Authors</b>:<br />Ya-Chun Chang<sup>1</sup>, Chia-Chun Lin<sup>1</sup>, Yi-Ting Lin<sup>1</sup>, Yung-Chih Chen<sup>2</sup> and Chun-Yao Wang<sup>1</sup><br /><sup>1</sup>National Tsing Hua University, TW; <sup>2</sup>Yuan Ze University, TW<br /><em><b>Abstract</b><br />The binary-weight-binary-input binarized neural network (BNN) allows a much more efficient way to implement convolutional neural networks (CNNs) on mobile platforms. During inference, the multiply-accumulate operations in BNNs can be reduced to XNOR-popcount operations. Thus, the XNOR-popcount operations dominate most of the computation in BNNs. To reduce the number of required operations in convolution layers of BNNs, we decompose 3-D filters into 2-D filters and exploit the repeated filters, inverse filters, and similar filters to share results. By sharing the results, the number of operations in convolution layers of BNNs can be reduced effectively. Experimental results show that the number of operations can be reduced by about 60% for CIFAR-10 on BNNs while keeping the accuracy loss within 1% of originally trained network.</em></td> </tr> <tr> <td>12:15</td> <td>6.5.4</td> <td><b>PHONEBIT: EFFICIENT GPU-ACCELERATED BINARY NEURAL NETWORK INFERENCE ENGINE FOR MOBILE PHONES</b><br /><b>Speaker</b>:<br />Gang Chen, Sun Yat-sen University, CN<br /><b>Authors</b>:<br />Gang Chen<sup>1</sup>, Shengyu He<sup>2</sup>, Haitao Meng<sup>2</sup> and Kai Huang<sup>1</sup><br /><sup>1</sup>Sun Yat-sen University, CN; <sup>2</sup>Northeastern University, CN<br /><em><b>Abstract</b><br />Over the last years, a great success of deep neural networks (DNNs) has been witnessed in computer vision and other fields. However, performance and power constraints make it still challenging to deploy DNNs on mobile devices due to their high computational complexity. Binary neural networks (BNNs) have been demonstrated as a promising solution to achieve this goal by using bit-wise operations to replace most arithmetic operations. Currently, existing GPU-accelerated implementations of BNNs are only tailored for desktop platforms. Due to architecture differences, mere porting of such implementations to mobile devices yields suboptimal performance or is impossible in some cases. In this paper, we propose PhoneBit, a GPU-accelerated BNN inference engine for Android-based mobile devices that fully exploits the computing power of BNNs on mobile GPUs. PhoneBit provides a set of operator-level optimizations including locality-friendly data layout, bit packing with vectorization and layers integration for efficient binary convolution. We also provide a detailed implementation and parallelization optimization for PhoneBit to optimally utilize the memory bandwidth and computing power of mobile GPUs. We evaluate PhoneBit with AlexNet, YOLOv2 Tiny and VGG16 with their binary version. Our experiment results show that PhoneBit can achieve significant speedup and energy efficiency compared with state-of-the-art frameworks for mobile devices.</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="/date20/conference/session/IP3">IP3-3</a>, 140</td> <td><b>HARDWARE ACCELERATION OF CNN WITH ONE-HOT QUANTIZATION OF WEIGHTS AND ACTIVATIONS</b><br /><b>Speaker</b>:<br />Gang Li, Chinese Academy of Sciences, CN<br /><b>Authors</b>:<br />Gang Li, Peisong Wang, Zejian Liu, Cong Leng and Jian Cheng, Chinese Academy of Sciences, CN<br /><em><b>Abstract</b><br />In this paper, we propose a novel one-hot representation for weights and activations in CNN model and demonstrate its benefits on hardware accelerator design. Specifically, rather than merely reducing the bitwidth, we quantize both weights and activations into n-bit integers that containing only one non-zero bit per value. In this way, the massive multiply and accumulates (MACs) are equivalent to additions of powers of two that can be efficiently calculated with histogram based computations. Experiments on the ImageNet classification task show that comparable accuracy can be obtained on our proposed One-Hot Networks (OHN) compared to conventional fixed-point networks. As case studies, we evaluate the efficacy of one-hot data representation on two state-of-the-art CNN accelerators on FPGA, our preliminary results show that 50% and 68.5% resource saving can be achieved on DaDianNao and Laconic respectively. Besides, the one-hot optimized Laconic can further achieve an average speedup of 4.94x on AlexNet.</em></td> </tr> <tr> <td style="width:40px;">12:31</td> <td><a href="/date20/conference/session/IP3">IP3-4</a>, 729</td> <td><b>BNNSPLIT: BINARIZED NEURAL NETWORKS FOR EMBEDDED DISTRIBUTED FPGA-BASED COMPUTING SYSTEMS</b><br /><b>Speaker</b>:<br />Luca Stornaiuolo, Politecnico di Milano, IT<br /><b>Authors</b>:<br />Giorgia Fiscaletti, Marco Speziali, Luca Stornaiuolo, Marco D. Santambrogio and Donatella Sciuto, Politecnico di Milano, IT<br /><em><b>Abstract</b><br />In the past few years, Convolutional Neural Networks (CNNs) have seen a massive improvement, outperforming other visual recognition algorithms. Since they are playing an increasingly important role in fields such as face recognition, augmented reality or autonomous driving, there is the growing need for a fast and efficient system to perform the redundant and heavy computations of CNNs. This trend led researchers towards heterogeneous systems provided with hardware accelerators, such as GPUs and FPGAs. The vast majority of CNNs is implemented with floating-point parameters and operations, but from research, it has emerged that high classification accuracy can be obtained also by reducing the floating-point activations and weights to binary values. This context is well suitable for FPGAs, that are known to stand out in terms of performance when dealing with binary operations, as demonstrated in Finn, the state-of-the-art framework for building Binarized Neural Network (BNN) accelerators on FPGAs. In this paper, we propose a framework that extends Finn to a distributed scenario, enabling BNNs implementation on embedded multi-FPGA systems.</em></td> </tr> <tr> <td style="width:40px;">12:32</td> <td><a href="/date20/conference/session/IP3">IP3-5</a>, 147</td> <td><b>L2L: A HIGHLY ACCURATE LOG_2_LEAD QUANTIZATION OF PRE-TRAINED NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Salim Ullah, Technische Universität Dresden, DE<br /><b>Authors</b>:<br />Salim Ullah<sup>1</sup>, Siddharth Gupta<sup>2</sup>, Kapil Ahuja<sup>2</sup>, Aruna Tiwari<sup>2</sup> and Akash Kumar<sup>1</sup><br /><sup>1</sup>Technische Universität Dresden, DE; <sup>2</sup>IIT Indore, IN<br /><em><b>Abstract</b><br />Deep Neural Networks are one of the machine learning techniques which are increasingly used in a variety of applications. However, the significantly high memory and computation demands of deep neural networks often limit their deployment on embedded systems. Many recent works have considered this problem by proposing different types of data quantization schemes. However, most of these techniques either require post-quantization retraining of deep neural networks or bear a significant loss in output accuracy. In this paper, we propose a novel quantization technique for parameters of pre-trained deep neural networks. Our technique significantly maintains the accuracy of the parameters and does not require retraining of the networks. Compared to the single-precision floating-point numbers-based implementation, our proposed 8-bit quantization technique generates only ∼ 1% and ∼ 0.4%, loss in the top-1 and top-5 accuracies respectively for VGG16 network using ImageNet dataset.</em></td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="6.6">6.6 From DFT to Yield Optimization</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br />Stephan EGGERSGLUESS, Mentor, A Siemens Bussiness, DE</p> <p><b>Co-Chair:</b><br />Ernesto SANCHEZ, Politecnico di Torino, IT</p> <p>The session presents a variety of semiconductor test techniques, including a new design-for-testability scheme for FinFET SRAMs, a method to increase yield based on error-metric-independent signature analysis, and a synthesis method for fault-tolerant reconfigurable scan networks.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.6.1</td> <td><b>A DFT SCHEME TO IMPROVE COVERAGE OF HARD-TO-DETECT FAULTS IN FINFET SRAMS</b><br /><b>Speaker</b>:<br />Guilherme Cardoso Medeiros, Delft University of Technology, NL<br /><b>Authors</b>:<br />Guilherme Cardoso Medeiros<sup>1</sup>, Cemil Cem Gürsoy<sup>2</sup>, Moritz Fieback<sup>1</sup>, Lizhou Wu<sup>1</sup>, Maksim Jenihhin<sup>2</sup>, Mottaqiallah Taouil<sup>1</sup> and Said Hamdioui<sup>1</sup><br /><sup>1</sup>Delft University of Technology, NL; <sup>2</sup>Tallinn University of Technology, EE<br /><em><b>Abstract</b><br />Manufacturing defects can cause faults in FinFET SRAMs. Of them, easy-to-detect (ETD) faults always cause incorrect behavior, and therefore are easily detected by applying sequences of write and read operations. However, hard-to-detect (HTD) faults may not cause incorrect behavior, only parametric deviations. Detection of these faults is of major importance as they may lead to test escapes. This paper proposes a new design-for-testability (DFT) scheme for FinFET SRAMs to detect such faults by creating a mismatch in the sense amplifier (SA). This mismatch, combined with the defect in the cell, will incorrectly bias the SA and cause incorrect read outputs. Furthermore, post-silicon calibration schemes can be used to avoid over-testing or test escapes caused by process variation effects. Compared to the state of the art, this scheme introduces negligible overheads in area and test time while it significantly improves fault coverage and reduces the number of test escapes.</em></td> </tr> <tr> <td>11:30</td> <td>6.6.2</td> <td><b>SYNTHESIS OF FAULT-TOLERANT RECONFIGURABLE SCAN NETWORKS</b><br /><b>Speaker</b>:<br />Sebastian Brandhofer, Universität Stuttgart, DE<br /><b>Authors</b>:<br />Sebastian Brandhofer, Michael Kochte and Hans-Joachim Wunderlich, Universität Stuttgart, DE<br /><em><b>Abstract</b><br />On-chip instrumentation is mandatory for efficient bring-up, test and diagnosis, post-silicon validation, as well as in-field calibration, maintenance, and fault tolerance. Reconfigurable scan networks (RSNs) provide a scalable and efficient scan-based access mechanism to such instruments. The correct operation of this access mechanism is crucial for all manufacturing, bring-up and debug tasks as well as for in-field operation, but it can be affected by faults and design errors. This work develops for the first time fault-tolerant RSNs such that the resulting scan network still provides access to as many instruments as possible in presence of a fault. The work contributes a model and an algorithm to compute scan paths in faulty RSNs, a metric to quantify its fault tolerance and a synthesis algorithm that is based on graph connectivity and selective hardening of control logic in the scan network. Experimental results demonstrate that fault-tolerant RSNs can be synthesized with only moderate hardware overhead.</em></td> </tr> <tr> <td>12:00</td> <td>6.6.3</td> <td><b>USING PROGRAMMABLE DELAY MONITORS FOR WEAR-OUT AND EARLY LIFE FAILURE PREDICTION</b><br /><b>Speaker</b>:<br />Chang Liu, Altran Deutschland, DE<br /><b>Authors</b>:<br />Chang Liu, Eric Schneider and Hans-Joachim Wunderlich, Universität Stuttgart, DE<br /><em><b>Abstract</b><br />Early life failures in marginal devices are a severe reliability threat in current nano-scaled CMOS devices. While small delay faults are an effective indicator of marginalities, their detection requires special efforts in testing by so-called Faster-than-At-Speed Test (FAST). In a similar way, delay degradation is an indicator that a device reaches the wear-out phase due to aging. Programmable delay monitors provide the possibility to detect gradual performance changes in a system and allow to observe device degradation. This paper presents a unified approach to test small delay faults related to wear-out and early-life failures by reuse of existing programmable delay monitors within FAST. The approach is complemented by a test-scheduling which optimally selects frequencies and delay configurations to significantly increase the fault coverage of small delays and to reduce the test time.</em></td> </tr> <tr> <td>12:15</td> <td>6.6.4</td> <td><b>MAXIMIZING YIELD FOR APPROXIMATE INTEGRATED CIRCUITS</b><br /><b>Speaker</b>:<br />Marcello Traiola, Université de Montpellier, FR<br /><b>Authors</b>:<br />Marcello Traiola<sup>1</sup>, Arnaud Virazel<sup>1</sup>, Patrick Girard<sup>2</sup>, Mario Barbareschi<sup>3</sup> and Alberto Bosio<sup>4</sup><br /><sup>1</sup>LIRMM, FR; <sup>2</sup>LIRMM / CNRS, FR; <sup>3</sup>University of Naples Federico II, IT; <sup>4</sup>Lyon Institute of Nanotechnology, FR<br /><em><b>Abstract</b><br />Approximate Integrated Circuits (AxICs) have emerged in the last decade as an outcome of Approximate Computing (AxC) paradigm. AxC focuses on efficiency of computing systems by sacrificing some computation quality. As AxICs spread, consequent challenges to test them arose. On the other hand, the opportunity to increase the production yield emerged in the AxIC context. Indeed, some particular defects in the manufactured AxIC might not catastrophically impact the final circuit quality. Therefore, some defective AxICs might still be acceptable. Efforts to detect favorable conditions to consider defective AxICs as acceptable - with the goal to increase the production yield - have been done in last years. Unfortunately, the final achieved yield gain is often not as high as expected. In this work, we propose a methodology to actually achieve a yield gain as close as possible to expectations, by proposing a technique to suitably apply tests to AxICs. Experiments carried out on state-of-the-art AxICs show yield gain results very close to the expected ones (i.e., between 98% and 100% of the expectations).</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="/date20/conference/session/IP3">IP3-6</a>, 359</td> <td><b>FAULT DIAGNOSIS OF VIA-SWITCH CROSSBAR IN NON-VOLATILE FPGA</b><br /><b>Speaker</b>:<br />Ryutaro Doi, Osaka University, JP<br /><b>Authors</b>:<br />Ryutaro DOI<sup>1</sup>, Xu Bai<sup>2</sup>, Toshitsugu Sakamoto<sup>2</sup> and Masanori Hashimoto<sup>1</sup><br /><sup>1</sup>Osaka University, JP; <sup>2</sup>NEC Corporation, JP<br /><em><b>Abstract</b><br />FPGA that exploits via-switches, which are a kind of non-volatile resistive RAMs, for crossbar implementation is attracting attention due to its high integration density and energy efficiency. Via-switch crossbar is responsible for the signal routing by changing on/off-states of via-switches. To verify the via-switch crossbar functionality after manufacturing, fault testing that checks whether we can turn on/off via-switches normally is essential. This paper confirms that a general differential pair comparator successfully discriminates on/off-states of via-switches, and clarifies fault modes of a via-switch by transistor-level SPICE simulation that injects stuck-on/off faults to atom switch and varistor, where a via-switch consists of two atom switches and two varistors. We then propose a fault diagnosis methodology that diagnoses the fault modes of each via-switch using the comparator response difference between normal and faulty via-switches. The proposed method achieves 100% fault detection by checking the comparator responses after turning on/off the via-switch. In case that the number of faulty components in a via-switch is one, the ratio of the fault diagnosis, which exactly identifies the faulty varistor and atom switch inside the faulty via-switch, is 100%, and in case of up to two faults, the fault diagnosis ratio is 79%.</em></td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="6.7">6.7 Safety and efficiency for smart automotive and energy systems</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br />Selma Saidi, TUHH, DE</p> <p><b>Co-Chair:</b><br />Donghwa Shin, Soongsil University, KR</p> <p>This session presents four papers dealing with various aspects of smart automotive and energy systems, including safety and efficiency of photovoltaic panels, deterministic execution behavior of adaptive automotive applications, efficient implementation of fail-operational automated vehicles, and efficient resource usage in networked automotive systems.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.7.1</td> <td><b>A DIODE-AWARE MODEL OF PV MODULES FROM DATASHEET SPECIFICATIONS</b><br /><b>Speaker</b>:<br />Sara Vinco, Politecnico di Torino, IT<br /><b>Authors</b>:<br />Sara Vinco, Yukai Chen, Enrico Macii and Massimo Poncino, Politecnico di Torino, IT<br /><em><b>Abstract</b><br />Semi-empirical models of photovoltaic (PV) modulesbased only on datasheet information are popular in electricalenergy systems (EES) simulation because they can be built without measurements and allow quick exploration of alternative devices. One key limitation of these models, however, is the fact that they cannot model the presence of bypass diodes, which are inserted across a set of series-connected cells in a PV moduleto mitigate the impact of partial shading; datasheet information refer in fact to the operations of the module under uniform irradiance. Neglecting the effect of bypass diodes may incur insignificant underestimation of the extracted power. This paper proposes a semi-empirical model for a PV module, that, by taking into account the only available information about bypass diodes in a datasheet, i.e., its number, by a first downscaling the model to a single PV cell and a subsequent upscaling to the level of a substring and of a module, allows to take into accout the diode effect as much accurately as allowed by the datasheet information. Experimental results show that, in a typical PV array on a roof, using a diode-agnostic model can signifantly underestimate the output power production</em></td> </tr> <tr> <td>11:30</td> <td>6.7.2</td> <td><b>ACHIEVING DETERMINISM IN ADAPTIVE AUTOSAR</b><br /><b>Speaker</b>:<br />Christian Menard, Technische Universität Dresden, DE<br /><b>Authors</b>:<br />Christian Menard<sup>1</sup>, Andres Goens<sup>1</sup>, Marten Lohstroh<sup>2</sup> and Jeronimo Castrillon<sup>1</sup><br /><sup>1</sup>Technische Universität Dresden, DE; <sup>2</sup>University of California, Berkeley, US<br /><em><b>Abstract</b><br />AUTOSAR Adaptive Platform is an emerging industry standard that tackles the challenges of modern automotive software design, but does not provide adequate mechanisms to enforce deterministic execution. This poses profound challenges to testing and maintenance of the application software, which is particularly problematic for safety-critical applications. In this paper, we analyze the problem of nondeterminism in AP and propose a framework for the design of deterministic automotive software that transparently integrates with the AP communication mechanisms. We illustrate our approach in a case study based on the brake assistant demonstrator application that is provided by the AUTOSAR consortium. We show that the original implementation is nondeterministic and discuss a deterministic solution based on our framework.</em></td> </tr> <tr> <td>12:00</td> <td>6.7.3</td> <td><b>A FAIL-SAFE ARCHITECTURE FOR AUTOMATED DRIVING</b><br /><b>Speaker</b>:<br />Sebastian vom Dorff, DENSO Automotive Deutschland GmbH, DE<br /><b>Authors</b>:<br />Sebastian vom Dorff<sup>1</sup>, Bert Böddeker<sup>2</sup>, Maximilian Kneissl<sup>1</sup> and Martin Fränzle<sup>3</sup><br /><sup>1</sup>DENSO Automotive Deutschland GmbH, DE; <sup>2</sup>Autonomous Intelligent Driving GmbH, DE; <sup>3</sup>Carl von Ossietzky University Oldenburg, DE<br /><em><b>Abstract</b><br />The development of autonomous vehicles has gained a rapid pace. Along with the promising possibilities of such automated systems, the question of how to ensure their safety arises. With increasing levels of automation the need for fail-operational systems, not relying on a back-up driver, poses new challenges in system design. In this paper we propose a lightweight architecture addressing the challenge of a verifiable, fail-safe safety implementation for trajectory planning. It offers a distributed design and the ability to comply with the requirements of ISO26262, while avoiding an overly redundant set-up. Furthermore, we show an example with low-level prediction models applied to a real world situation.</em></td> </tr> <tr> <td>12:15</td> <td>6.7.4</td> <td><b>PRIORITY-PRESERVING OPTIMIZATION OF STATUS QUO ID-ASSIGNMENTS IN CONTROLLER AREA NETWORK</b><br /><b>Speaker</b>:<br />Lea Schoenberger, TU Dortmund University, DE<br /><b>Authors</b>:<br />Sebastian Schwitalla<sup>1</sup>, Lea Schönberger<sup>1</sup> and Jian-Jia Chen<sup>2</sup><br /><sup>1</sup>TU Dortmund University, DE; <sup>2</sup>TU Dortmund, DE<br /><em><b>Abstract</b><br />Controller Area Network (CAN) is the prevailing solution for connecting multiple electronic control units (ECUs) in automotive systems. Every broadcast message on the bus is received by each bus participant and introduces computational overhead to the typically resource-constrained ECUs due to interrupt handling. To reduce this overhead, hardware message filters can be applied. However, since such filters are configured according to the message identifiers (IDs) specified in the system, the filter quality is limited by the nature of the ID-assignment. Although hardware message filters are highly relevant for industrial applications, so far, only the optimization of the filter design, but not the related optimization of ID-assignments has been addressed in the literature. In this work, we explicitly focus on the optimization of message ID-assignments against the background of hardware message filtering. More precisely, we propose an optimization algorithm transforming a given ID-assignment in such a way that, based on the resulting IDs, the quality of hardware message filters is improved significantly, i.e., the computational overhead introduced to each ECU is minimized, and, moreover, the priority order of the system remains unchanged. Conducting comprehensive experiments on automotive benchmarks, we show that our proposed algorithm clearly outperforms optimizations based on the conventional method simulated annealing with respect to the achieved filter quality as well as to the runtime.</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="/date20/conference/session/IP3">IP3-7</a>, 519</td> <td><b>APPLYING RESERVATION-BASED SCHEDULING TO A µC-BASED HYPERVISOR: AN INDUSTRIAL CASE STUDY</b><br /><b>Speaker</b>:<br />Dirk Ziegenbein, Robert Bosch GmbH, DE<br /><b>Authors</b>:<br />Dakshina Dasari<sup>1</sup>, Paul Austin<sup>2</sup>, Michael Pressler<sup>1</sup>, Arne Hamann<sup>1</sup> and Dirk Ziegenbein<sup>1</sup><br /><sup>1</sup>Robert Bosch GmbH, DE; <sup>2</sup>ETAS GmbH, GB<br /><em><b>Abstract</b><br />Existing software scheduling mechanisms do not suffice for emerging applications in the automotive space, which have the conflicting needs of performance and predictability. %We need mechanisms that lend themselves naturally to this requirement, by virtue of their design. As a concrete case, we consider the ETAS light-weight hypervisor, a commercially viable solution in the automotive industry, deployed on multicore microcontrollers. We describe the architecture of the hypervisor and its current scheduling mechanisms based on Time Division Multiplexing. We next show how Reservation-based Scheduling can be implemented in the ETAS LWHVR to efficiently use resources while also providing freedom from interference and explore design choices towards an efficient implementation of such a scheduler. With experiments from an industry use case, we also compare the performance of RBS and the existing scheduler in the hypervisor.</em></td> </tr> <tr> <td style="width:40px;">12:31</td> <td><a href="/date20/conference/session/IP3">IP3-8</a>, 353</td> <td><b>REAL-TIME ENERGY MONITORING IN IOT-ENABLED MOBILE DEVICES</b><br /><b>Speaker</b>:<br />Nitin Shivaraman, TUMCREATE, SG<br /><b>Authors</b>:<br />Nitin Shivaraman<sup>1</sup>, Seima Suriyasekaran<sup>1</sup>, Zhiwei Liu<sup>2</sup>, Saravanan Ramanathan<sup>1</sup>, Arvind Easwaran<sup>2</sup> and Sebastian Steinhorst<sup>3</sup><br /><sup>1</sup>TUMCREATE, SG; <sup>2</sup>Nanyang Technological University, SG; <sup>3</sup>TUM, DE<br /><em><b>Abstract</b><br />With rapid advancements in the Internet of Things (IoT) paradigm, every electrical device in the near future is expected to have IoT capabilities. This enables fine-grained tracking of individual energy consumption data of such devices, offering location-independent per-device billing and demand management. Hence, it abstracts from the location-based metering of state-of-the-art infrastructure, which traditionally aggregates on a building or household level, defining the entity to be billed. However, such in-device energy metering is susceptible to manipulation and fraud. As a remedy, we propose a secure decentralized metering architecture that enables devices with IoT capabilities to measure their own energy consumption. In this architecture, the device-level consumption is additionally reported to a system-level aggregator that verifies distributed information from our decentralized metering systems and provides secure data storage using Blockchain, preventing data manipulation by untrusted entities. Through experimental evaluation, we show that the proposed architecture supports device mobility and enables location-independent monitoring of energy consumption.</em></td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="7.0">7.0 LUNCHTIME KEYNOTE SESSION</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 13:45 - 14:20<br /><b>Location / Room:</b> </p> <p><b>Chair:</b><br />Bernabe Linares-Barranco, CSIC, ES</p> <p><b>Co-Chair:</b><br />Li-C Wang, University of California, Santa Barbara, US</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>13:45</td> <td>7.0.0</td> <td><b>CEDA LUNCHEON ANNOUNCEMENT</b><br /><b>Author</b>:<br />David Atienza, École Polytechnique Fédérale de Lausanne, CH</td> </tr> <tr> <td>13:50</td> <td>7.0.1</td> <td><b>LEVERAGING EMBEDDED INTELLIGENCE IN INDUSTRY: CHALLENGES AND OPPORTUNITIES</b><br /><b>Author</b>:<br />Jim Tung, MathWorks Fellow, US<br /><em><b>Abstract</b><br />The buzz about AI is deafening. Compelling applications are starting to emerge, dramatically changing the customer service that we experience, the marketing messages that we receive, and some systems we use. But, as organizations decide whether and how to incorporate AI in their systems and services, they must bring together new combinations of specialized knowledge, domain expertise, and business objectives. They must navigate through numerous choices - algorithms, processors, compute placement, data availability, architectural allocation, communications, and more. At the same time, they must keep their focus on the applications that will create compelling value for them. In this keynote, Jim Tung looks at the promising opportunities and practical challenges of building AI into our systems and services.</em></td> </tr> <tr> <td>14:20</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="7.1">7.1 Special Day on "Embedded AI": Industry AI chips</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br />Tobi Delbrück, ETH Zurich, CH</p> <p><b>Co-Chair:</b><br />Bernabe Linares-Barranco, CSIC, ES</p> <p>This session on Industry AI chips will present examples of companies developing actual products for AI hardware solutions, a highly competitive and full of new challenges market.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>7.1.1</td> <td><b>OPPORTUNITIES FOR ANALOG ACCELERATION OF DEEP LEARNING WITH PHASE CHANGE MEMORY</b><br /><b>Authors</b>:<br />Pritish Narayanan, Geoffrey W. Burr, Stefano Ambrogio, Hsinyu Tsai, Charles Mackin, Katherine Spoon, An Chen, Alexander Friz and Andrea Fasoli, IBM Research, US<br /><em><b>Abstract</b><br />Storage Class Memory and High Bandwidth Memory Technologies are already reshaping systems architecture in interesting ways, by bringing cheap and high-density memory closer and closer to processing. Extrapolating on this trend, a new class of in-memory computing solutions is emerging, where some or all of the computing happens at the location of the data. Within the landscape of in-memory computing approaches, Non-von Neumann architectures seek to eliminate most of the data movement associated with computing, eliminating the demarcation between compute and memory. While such non-Von Neumann architectures could offer orders of magnitude performance improvements on certain workloads, they are not as general purpose nor as easily programmable as von-Neumann architectures. Therefore, well defined use cases need to exist to justify the hardware investment. Fortunately, acceleration of deep learning, which is both compute and memory-intensive, is one such use case. Today, the training of deep learning networks is done primarily in the cloud and could take days or weeks even when using many GPUs. Specialized hardware for training is thus primarily focused on speedup, with energy/power a secondary concern. On the other hand, 'Inference', the deployment and use of pre-trained models for real-world tasks, is done both in the cloud and on edge devices and presents hardware opportunities at both high speed and low power design points. In this presentation, we describe some of the opportunities and challenges in building accelerators for deep learning using analog volatile and non-volatile memory. We review our group's recent progress towards achieving software-equivalent accuracies on deep learning tasks in the presence of real-device imperfections such as non-linearity, asymmetry, variability and conductance drift. We will present some novel techniques and optimizations across device, circuit, and neural network design to achieve high accuracy with existing devices. We will then discuss challenges for peripheral circuit design and conclude by providing an outlook on the prospects for analog memory-based DNN accelerators.</em></td> </tr> <tr> <td>11:22</td> <td>7.1.2</td> <td><b>EVENT-BASED AI FOR AUTOMOTIVE AND IOT</b><br /><b>Author</b>:<br />Amos Sironi, Prophesee, FR<br /><em><b>Abstract</b><br />Event cameras are a new type of sensor encoding visual information in the form of asynchronous events. An event corresponds to a change in the log-luminosity intensity at a given pixel location. Compared to standard frame cameras, event cameras have higher temporal resolution, higher dynamic range and lower power consumption. Thanks to these characteristics, event cameras find many applications in automotive and IoT, where low latency, robustness to challenging lighting conditions and low power consumption are critical requirements. In this talk we present recent advances in artificial intelligence applied to event cameras. In particular, we discuss how to adapt deep learning methods to work on events and their advantages compared to conventional frame-based methods. The presentation will be illustrated by results on object detection in automotive and IoT scenarios, running real-time on mobile platforms.</em></td> </tr> <tr> <td>11:44</td> <td>7.1.3</td> <td><b>NEURONFLOW: A NEUROMORPHIC PROCESSOR ARCHITECTURE FOR LIVE AI APPLICATIONS</b><br /><b>Speaker</b>:<br />Orlando Moreira, GrAI Matter Labs, NL<br /><b>Authors</b>:<br />Orlando Moreira, Amirreza Yousefzadeh, Gokturk Cinserin, Rik-Jan Zwartenkot, Ajay Kapoor, Fabian Chersi, Peng Qiao, Peter Kievits, Mina Khoei, Louis Rouillard, Ashoka Visweswara and Jonathan Tapson, GrAI Matter Labs, NL<br /><em><b>Abstract</b><br />This paper gives an overview of the Neuronflow many-core architecture. It is a neuromorphic data flow architecture that exploits brain-inspired concepts to deliver a scalable event-based processing engine for neuron networks in Live AI applications at the edge. Its design is inspired by brain biology, but not necessarily biologically plausible. The main design goal is the exploitation of sparsity to dramatically reduce latency and power consumption as required by sensor processing at the Edge.</em></td> </tr> <tr> <td>12:06</td> <td>7.1.4</td> <td><b>SPECK - SUB-MW SMART VISION SENSOR FOR MOBILE IOT APPLICATIONS</b><br /><b>Author</b>:<br />Ning Qiao, aiCTX, CH<br /><em><b>Abstract</b><br />Speck is the first available neuromorphic smart vision sensor system-on-chip (SoC), which combines neuromorphic vision sensing and neuromorphic computation on a single die, for mW vision processing. The DVS pixel array is coupled directly to a new fully-asynchronous event-driven spiking CNN processor for highly compact and energy efficient dynamic visual processing. Speck supports a wide range of potential applications, spanning industrial and consumer-facing use cases.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="7.2">7.2 Reconfigurable Systems and Architectures</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br />Christian Pilato, Politecnico di Milano, IT</p> <p><b>Co-Chair:</b><br />Philippe Coussy, University Bretagne Sud / Lab-STICC, FR</p> <p>Reconfigurable technologies are evolving at the device, architecture, and system levels, from embedded computation to server-based accelerator integration. In this session we explore ideas at these levels, discussing architectural features for power optimisation of CGRAs, a framework for integrating FPGA accelerators in serverless environments, and placement strategies on alternative FPGA device technologies.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.2.1</td> <td><b>A FRAMEWORK FOR ADDING LOW-OVERHEAD, FINE-GRAINED POWER DOMAINS TO CGRAS</b><br /><b>Speaker</b>:<br />Ankita Nayak, Stanford University, US<br /><b>Authors</b>:<br />Ankita Nayak, Keyi Zhang, Raj Setaluri, Alex Carsello, Makai Mann, Stephen Richardson, Rick Bahr, Pat Hanrahan, Mark Horowitz and Priyanka Raina, Stanford University, US<br /><em><b>Abstract</b><br />To effectively minimize static power for a wide range of applications, power domains for a coarse-grained reconfigurable array (CGRA) need to be finer-grained than a typical ASIC. However, the special isolation logic needed to ensure electrical protection between off and on domains makes fine-grained power domains area- and timing-inefficient. We propose a novel design of the CGRA routing fabric that intrinsically provides boundary protection. This technique reduces the area overhead of boundary protection between power domains for the CGRA from around 9% to less than 1% and removes the delay from the isolation cells. However, with this design choice, we cannot leverage the conventional UPF-based flow to introduce power domain boundary protection. We create compiler-like passes that iteratively introduce the needed design transformations, and formally verify the passes with satisfiability modulo theories (SMT) methods. These passes also allow us to optimize how we handle test and debug signals through the off tiles. We use our framework to insert power domains into an SoC with an ARM Cortex M3 processor and a CGRA with 32x16 processing element (PE) and memory tiles and 4MB secondary memory. Depending on the size of the applications mapped, our CGRA achieves up to an 83% reduction in leakage power and 26% reduction in total power versus a CGRA without multiple power domains, for a range of image processing and machine learning applications.</em></td> </tr> <tr> <td>15:00</td> <td>7.2.2</td> <td><b>BLASTFUNCTION: AN FPGA-AS-A-SERVICE SYSTEM FOR ACCELERATED SERVERLESS COMPUTING</b><br /><b>Speaker</b>:<br />Rolando Brondolin, Politecnico di Milano, IT<br /><b>Authors</b>:<br />Marco Bacis, Rolando Brondolin and Marco D. Santambrogio, Politecnico di Milano, IT<br /><em><b>Abstract</b><br />Heterogeneous computing platforms are now a valuable solution to continue to meet Service Level Agreements (SLAs) for compute intensive cloud workloads. Field Programmable Gate Arrays (FPGAs) effectively accelerate cloud workloads, however, these workloads have a spiky behavior as well as long periods of underutilization. Sharing the FPGA with multiple tenants then helps to increase the board's time utilization. In this paper we present BlastFunction, a distributed FPGA sharing system for the acceleration of microservices and serverless applications in cloud environments. BlastFunction includes a Remote OpenCL Library to access the shared devices transparently; multiple Device Managers to time-share and monitor the FPGAs and a central Accelerators Registry to allocate the available devices. BlastFunction reaches higher utilization and throughput w.r.t. a native execution thanks to device sharing, with minimal differences in latency given by the concurrent accesses.</em></td> </tr> <tr> <td>15:30</td> <td>7.2.3</td> <td><b>ENERGY-AWARE PLACEMENT FOR SRAM-NVM HYBRID FPGAS</b><br /><b>Speaker</b>:<br />Seongsik Park, Seoul National University, KR<br /><b>Authors</b>:<br />Seongsik Park, Jongwan Kim and Sungroh Yoon, Seoul National University, KR<br /><em><b>Abstract</b><br />Field-programmable gate arrays (FPGAs) have been widely used in many applications due to their reconfigurability. Especially, the short development time makes the FPGAs one of the promising reconfigurable architectures for emerging applications, such as deep learning. As CMOS technology advances, however, conventional SRAM-based FPGAs have approached their limitations. To overcome these obstacles, NVM-based FPGAs have been introduced. Although NVM-based FPGAs have the features of high area density, low static power consumption, and non-volatility, they are struggling to reduce energy consumption. Their challenge is mainly caused by the access speed of NVM, which is relatively slower than SRAM. In this paper, for compensating this limitation, we suggest SRAM-NVM hybrid FPGA architecture with SRAM- and NVM-based CLBs. In addition, we propose an energy-aware placement for efficient use of the SRAM-NVM hybrid FPGAs. As a result of our experiments, we were able to reduce the average energy consumption of SRAM-NVM hybrid FPGA by 22.23% and 21.94% compared to SRAM-based FPGA on the MCNC and VTR benchmarks, respectively.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="7.3">7.3 Special Session: Realizing Quantum Algorithms on Real Quantum Computing Devices</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br />Anupam Chattopadhyay, NTU Singapore, SG</p> <p><b>Co-Chair:</b><br />Swaroop Ghosh, Penn State, US</p> <p>Quantum computing is currently moving from an academic idea to a practical reality. Quantum computing in the cloud is already available and allows users from all over the world to develop and execute real quantum algorithms. However, companies which are heavily investing in this new technology such as Google, IBM, Rigetti, and Intel follow different technological approaches. This led to a situation where we have substantially different quantum computing devices available thus far. Because of that, various methods for realizing the intended quantum functionality to a respectively given quantum computing device are available. This special session provides an introduction and overview into this domain and comprehensively describes corresponding methods (also referred to as compilers, mappers, synthesizers, or routers). By this, attendees will be provided with a detailed understanding on how to use quantum computers in general and dedicated quantum computing devices in particular. The special session will include speakers from both, academia and industry, and will cover the most relevant quantum computing devices such as provided by IBM, Intel, etc.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.3.1</td> <td><b>RUNNING QUANTUM ALGORITHMS ON RESOURCE-CONSTRAINED QUANTUM DEVICES</b><br /><b>Author</b>:<br />Carmen G. Almudever, Delft University of Technology, NL<br /><em><b>Abstract</b><br />A number of quantum computing devices consisting of a few tens of noisy qubits already exist. All of them present various limitations such as limited qubit connectivity and reduced gate set that must be considered to make quantum algorithms executable. In this talk, after briefly introduce the basics of quantum computing, we will provide an overview on the problem of realizing quantum circuits. We will discuss different mapping approaches as well as quantum devices emphasizing their main constraints. Special attention will be given to the quantum chips developed within the QuTech-Intel partnership.</em></td> </tr> <tr> <td>15:00</td> <td>7.3.2</td> <td><b>REALIZING QUANTUM CIRCUITS ON IBM Q DEVICES</b><br /><b>Author</b>:<br />Robert Wille, Johannes Kepler Universität Linz, AT<br /><em><b>Abstract</b><br />In 2017, IBM launched the first publicly available quantum computing device which is accessible through a cloud service. In the meantime, many further devices followed which have been used by more than 100,000 people who have executed more than 7 million experiments on them. Accordingly, fast and efficient solutions to realize quantum functionality to those devices are demanded by a huge user-base. This talk will provide an overview on IBM's own tools for that matter as well solutions which have been developed by researchers world-wide - including a description of a compiler that won the IBM Qiskit Developer Challenge.</em></td> </tr> <tr> <td>15:30</td> <td>7.3.3</td> <td><b>EVERY DEVICE IS (ALMOST) EQUAL BEFORE THE COMPILER</b><br /><b>Author</b>:<br />Gian Giacomo Guerreschi, Intel Corporation, US<br /><em><b>Abstract</b><br />At the current stage of quantum computing technologies, it is not only expected but often required to tailor the compiler to the characteristic of each individual machine. No amount of coherence time must be wasted. However, many of the recently presented architectures have constraints that can be described in a unifying framework. We discuss how to represent these constraints and how more flexible compilers can be used to guide the design of novel architectures.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="7.4">7.4 Simulation and verification: where real issues meet scientific innovation</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br />Avi Ziv, IBM, IL</p> <p><b>Co-Chair:</b><br />Graziano Pravadelli, Università di Verona, IT</p> <p>This session presents recent concerns and innovative solutions in verification and simulation, covering topics ranging from partial verification to lazy event prediction, till signal name disambiguation.They tackle these challenges by reducing complexity, exploiting GPUs, and using similarity-learning techniques.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.4.1</td> <td><b>VERIFICATION RUNTIME ANALYSIS: GET THE MOST OUT OF PARTIAL VERIFICATION</b><br /><b>Authors</b>:<br />Martin Ring<sup>1</sup>, Fritjof Bornbebusch<sup>1</sup>, Christoph Lüth<sup>2</sup>, Robert Wille<sup>3</sup> and Rolf Drechsler<sup>2</sup><br /><sup>1</sup>Deutsches Forschungszentrum für Künstliche Intelligenz, DE; <sup>2</sup>Universität Bremen / DFKI GmbH, DE; <sup>3</sup>Johannes Kepler Universität Linz, AT<br /><em><b>Abstract</b><br />The design of modern systems has reached a complexity which makes it inevitable to apply verification methods in order to guarantee its correct and safe execution. The verification methods frequently produce proof obligations that can not be solved any more due to the huge search space. However, by setting enough variables to fixed values, the search space is obviously reduced and solving engines eventually may be able to complete the verification task. Although this results in a partial verification, the results may still be valuable --- in particular as opposed to the alternative of no verification at all. However, so far no systematic investigation has been conducted on which variables to fix in order to reduce verification runtime as much as possible while, at the same time, still getting most coverage. This paper addresses this question by proposing a corresponding verification runtime analysis. Experimental evaluations confirm the potential of this approach.</em></td> </tr> <tr> <td>15:00</td> <td>7.4.2</td> <td><b>GPU-ACCELERATED TIME SIMULATION OF SYSTEMS WITH ADAPTIVE VOLTAGE AND FREQUENCY SCALING</b><br /><b>Speaker</b>:<br />Eric Schneider, Universität Stuttgart, DE<br /><b>Authors</b>:<br />Eric Schneider and Hans-Joachim Wunderlich, Universität Stuttgart, DE<br /><em><b>Abstract</b><br />Timing validation of systems with adaptive voltage- and frequency scaling (AVFS) requires an accurate timing model under multiple operating points. Simulating such a model at gate level is extremely time-consuming, and the state-of-the-art compromises both accuracy and compute efficiency. This paper presents a method for dynamic gate delay modeling on graphics processing unit (GPU) accelerators which is based on polynomial approximation with offline statistical learning using regression analysis. It provides glitch-accurate switching activity information for gates and designs under varying supply voltages with negligible memory and performance impact. Parallelism from the evaluation of operating conditions, gates and stimuli is exploited simultaneously to utilize the high arithmetic computing throughput of GPUs. This way, large-scale design space exploration of AVFS-based systems is enabled. Experimental results demonstrate the efficiency and accuracy of the presented approach showing speedups of three orders of magnitude over conventional time simulation that supports static delays only.</em></td> </tr> <tr> <td>15:30</td> <td>7.4.3</td> <td><b>LAZY EVENT PREDICTION USING DEfiNING TREES AND SCHEDULE BYPASS FOR OUT-OF-ORDER PDES</b><br /><b>Speaker</b>:<br />Rainer Doemer, University of California, Irvine, US<br /><b>Authors</b>:<br />Daniel Mendoza, Zhongqi Cheng, Emad Arasteh and Rainer Doemer, University of California, Irvine, US<br /><em><b>Abstract</b><br />Out-of-order parallel discrete event simulation (PDES) has been shown to be very effective in speeding up system design by utilizing parallel processors on multi- and many-core hosts. As the number of threads in the design model grows larger, however, the original scheduling approach does not scale. In this work, we analyze the out-of-order scheduler and identify a bottleneck with quadratic complexity in event prediction. We propose a more efficient lazy strategy based on defining trees and a schedule bypass with O(mlog2 m) complexity which shows sustained and improved performance gains in simulation of SystemC models with many processes. For models containing over 1000 processes, experimental results show simulation run time speedups of up to 90x using lazy event prediction against the original out-of-order PDES approach.</em></td> </tr> <tr> <td>15:45</td> <td>7.4.4</td> <td><b>EMBEDDING HIERARCHICAL SIGNAL TO SIAMESE NETWORK FOR FAST NAME RECTIFICATION</b><br /><b>Speaker</b>:<br />Yi-An Chen, National Chiao Tung University, Pl<br /><b>Authors</b>:<br />Yi-An Chen<sup>1</sup>, Gung-Yu Pan<sup>2</sup>, Che-Hua Shih<sup>2</sup>, Yen-Chin Liao<sup>1</sup>, Chia-Chih Yen<sup>2</sup> and Hsie-Chia Chang<sup>1</sup><br /><sup>1</sup>National Chiao Tung University, TW; <sup>2</sup>Synopsys, TW<br /><em><b>Abstract</b><br />EDA tools are necessary to assist complicated flow of advanced IC design and verification in nowadays industry. After synthesis or simulation, the same signal could be viewed as different hierarchical names, especially for mixed-language designs. This name mismatching problem blocks automation and needs experienced users to rectify manually with domain knowledge. Even rule-based rectification helps the process but still fails when encountering unseen mismatching types. In this paper, hierarchical name rectification is transformed into the similarity search problem where the most similar name becomes the rectified name. However, naïve full search in design with string comparison costs unacceptable time. Our proposed framework embeds name strings into vectors for representing distance relation in a latent space using character n-gram and locality-sensitive hashing (LSH), and then finds the most similar signal using nearest neighbor search (NNS) and detailed search. Learning similarity using Siamese network provides general name rectification regardless of mismatching types, while string-to-vector embedding for proximity search accelerates the rectification process. Our approach is capable of achieving 93.43% rectification rate with only 0.052s per signal, which outperforms the naïve string search with 2.3% higher accuracy and 4,500 times speed-up.</em></td> </tr> <tr> <td style="width:40px;">16:00</td> <td><a href="/date20/conference/session/IP3">IP3-9</a>, 832</td> <td><b>TOWARDS SPECIFICATION AND TESTING OF RISC-V ISA COMPLIANCE</b><br /><b>Speaker</b>:<br />Vladimir Herdt, Universität Bremen, DE<br /><b>Authors</b>:<br />Vladimir Herdt<sup>1</sup>, Daniel Grosse<sup>2</sup> and Rolf Drechsler<sup>2</sup><br /><sup>1</sup>Universität Bremen, DE; <sup>2</sup>Universität Bremen / DFKI GmbH, DE<br /><em><b>Abstract</b><br />Compliance testing for RISC-V is very important. Therefore, an official hand-written compliance test-suite is being actively developed. However, this requires significant manual effort in particular to achieve a high test coverage. In this paper we propose a test-suite specification mechanism in combination with a first set of instruction constraints and coverage requirements for the base RISC-V ISA. In addition, we present an automated method to generate a test-suite that satisfies the specification. Our evaluation demonstrates the effectiveness and potential of our method.</em></td> </tr> <tr> <td style="width:40px;">16:01</td> <td><a href="/date20/conference/session/IP3">IP3-10</a>, 702</td> <td><b>POST-SILICON VALIDATION OF THE IBM POWER9 PROCESSOR</b><br /><b>Speaker</b>:<br />Hillel Mendelson, IBM, IL<br /><b>Authors</b>:<br />Tom Kolan<sup>1</sup>, Hillel Mendelson<sup>1</sup>, Vitali Sokhin<sup>1</sup>, Kevin Reick<sup>2</sup>, Elena Tsanko<sup>2</sup> and Gregory Wetli<sup>2</sup><br /><sup>1</sup>IBM Research - Haifa, IL; <sup>2</sup>IBM Systems, US<br /><em><b>Abstract</b><br />Due to the complexity of designs, post-silicon validation remains a major challenge with few systematic solutions. We provide an overview of the state-of-the-art post silicon validation process used by IBM to verify its latest IBM POWER9 processor. During the POWER9 post-silicon validation, we detected and handled 30% more logic bugs in 80% of the time, as compared to the previous IBMPOWER8 bring-up. This improvement is the result of lessons learned from previous designs, leading to numerous innovations. We provide bug analysis data and compare it to POWER8 results. We demonstrate our methodology by describing several bugs from fail detection to root cause.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="7.5">7.5 Runtime support for multi/many cores</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br />Sara Vinco, Politecnico di Torino, IT</p> <p><b>Co-Chair:</b><br />Jeronimo Castrillon, Technische Universität Dresden, DE</p> <p>In the era of heterogenous embedded systems, the diverse nature of computing elements pushes more than ever the need for smart runtime systems to be able to deal with resource management, multi-application mapping, task parallelism, and non-functional constraints. This session tackles these issues with solutions that span from resource-aware software architectures to novel runtime systems optimizing memory and energy consumption.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.5.1</td> <td><b>RESOURCE-AWARE MAPREDUCE RUNTIME FOR MULTI/MANY-CORE ARCHITECTURES</b><br /><b>Speaker</b>:<br />Konstantinos Iliakis, MicroLab, ECE, NTUA, GR<br /><b>Authors</b>:<br />Konstantinos Iliakis<sup>1</sup>, Sotirios Xydis<sup>1</sup> and Dimitrios Soudris<sup>2</sup><br /><sup>1</sup>National TU Athens, GR; <sup>2</sup>NTUA, GR<br /><em><b>Abstract</b><br />Modern multi/many-core processors exhibit high integration densities, e.g. up to several dozens or hundreds of cores. To ease the application development burden for such systems, various programming frameworks have emerged. The MapReduce programming model, after having demonstrated its usability in the area of distributed systems, has been adapted to the needs of shared-memory many-core and multi-processor systems, showing promising results in comparison with conventional multi-threaded libraries, e.g. pthreads. In this paper, we propose a novel resource-aware MapReduce architecture. The proposed runtime decouples map and combine phases in order to enhance the parallelism degree, while it effectively overlaps the memory-intensive combine with the compute-intensive map operation resulting in superior resource utilization and performance improvements. A detailed sensitivity analysis to the framework's tuning knobs is provided. The decoupled MapReduce architecture is evaluated against the state-of-art library into two diverse systems, i.e. a Haswell server and a Xeon Phi co-processor, demonstrating speedups on average up-to 2.2x and 2.9x respectively.</em></td> </tr> <tr> <td>15:00</td> <td>7.5.2</td> <td><b>TOWARDS A QUALIFIABLE OPENMP FRAMEWORK FOR EMBEDDED SYSTEMS</b><br /><b>Speaker</b>:<br />Adrian Munera Sanchez, Barcelona Supercomputing Center, ES<br /><b>Authors</b>:<br />Adrián Munera Sánchez, Sara Royuela and Eduardo Quiñones, Barcelona Supercomputing Center, ES<br /><em><b>Abstract</b><br />OpenMP is a very convenient parallel programming model to develop critical real-time applications by virtue of its powerful tasking model and its proven time predictable properties. However, current OpenMP implementations are not suitable due to the intensive use of dynamic memory to allocate data structures needed to efficiently manage the parallel execution. This jeopardizes the qualification processes of critical real-time systems, which are needed to ensure that the integrated system stack, including the OpenMP framework, is compliant with the system requirements. This paper proposes a novel OpenMP framework that statically allocates all the data structures needed to execute the OpenMP tasking model. Our framework is composed of a compiler phase that captures the data environment of all the OpenMP tasks instantiated along the parallel execution, and a run-time phase implementing a lazy task creation policy, that significantly reduces the memory requirements at run-time, whilst exploiting parallelism efficiently.</em></td> </tr> <tr> <td>15:30</td> <td>7.5.3</td> <td><b>ENERGY-EFFICIENT RUNTIME RESOURCE MANAGEMENT FOR ADAPTABLE MULTI-APPLICATION MAPPING</b><br /><b>Speaker</b>:<br />Robert Khasanov, Technische Universität Dresden, DE<br /><b>Authors</b>:<br />Robert Khasanov and Jeronimo Castrillon, Technische Universität Dresden, DE<br /><em><b>Abstract</b><br />Modern embedded computing platforms consist of a high amount of heterogeneous resources, which allows executing multiple applications on a single device. The number of running application on the system varies with time and so does the amount of available resources. This has considerably increased the complexity of analysis and optimization algorithms for runtime mapping of firm real-time applications. To reduce the runtime overhead, researchers have proposed to pre-compute partial mappings at compile time and have the runtime efficiently compute the final mapping. However, most existing solutions only compute a fixed mapping for a given set of running applications, and the mapping is defined for the entire duration of the workload execution. In this work we allow applications to adapt to the amount of available resources by using mapping segments. This way, applications may switch between different configurations with varied degree of parallelism. We present a runtime manager for firm real-time applications that generates such mapping segments based on partial solutions and aims at minimizing the overall energy consumption without deadline violations. The proposed algorithm outperforms the state-of-the-art approaches on the overall energy consumption by up to 13% while incurring an order of magnitude less scheduling overhead.</em></td> </tr> <tr> <td style="width:40px;">16:00</td> <td><a href="/date20/conference/session/IP3">IP3-11</a>, 619</td> <td><b>ON THE TASK MAPPING AND SCHEDULING FOR DAG-BASED EMBEDDED VISION APPLICATIONS ON HETEROGENEOUS MULTI/MANY-CORE ARCHITECTURES</b><br /><b>Speaker</b>:<br />Nicola Bombieri, Università di Verona, IT<br /><b>Authors</b>:<br />Stefano Aldegheri<sup>1</sup>, Nicola Bombieri<sup>1</sup> and Hiren Patel<sup>2</sup><br /><sup>1</sup>Università di Verona, IT; <sup>2</sup>University of Waterloo, CA<br /><em><b>Abstract</b><br />In this work, we show that applying the heterogeneous earliest finish time (HEFT) heuristic for the task scheduling of embedded vision applications can improve the system performance up to 70% w.r.t. the scheduling solutions at the state of the art. We propose an algorithm called exclusive earliest finish time (XEFT) that introduces the notion of exclusive overlap between application primitives to improve the load balancing. We show that XEFT can improve the system performance up to 33% over HEFT, and 82% over the state of the art approaches. We present the results on different benchmarks, including a real-world localization and mapping application (ORB-SLAM) combined with the NVIDIA object detection application based on deep-learning.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="7.6">7.6 Attacks on Hardware Architectures</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br />Johanna Sepúlveda, Airbus Defence and Space, DE</p> <p><b>Co-Chair:</b><br />Nele Mentens, KU Leuven, BE</p> <p>Hardware architectures are under the continuous threat of all types of attacks. This session covers attacks based on side-channel leakage and the exploitation of vulnerabilities at the micro-architectural and circuit level.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.6.1</td> <td><b>SWEEPING FOR LEAKAGE IN MASKED CIRCUIT LAYOUTS</b><br /><b>Speaker</b>:<br />Danilo Šijačić, IMEC / KU Leuven, BE<br /><b>Authors</b>:<br />Danilo Šijačić, Josep Balasch and Ingrid Verbauwhede, KU Leuven, BE<br /><em><b>Abstract</b><br />Masking schemes are the most popular countermeasure against side-channel analysis. They theoretically decorrelate information leaked through inherent physical channels from the key-dependent intermediate values that occur during computation. Their provable security is devised under models that abstract complex physical phenomena of the underlying hardware. In this work, we investigate the impact of the physical layout to the side-channel security of masking schemes. For this we propose a model for co-simulation of the analog power distribution network with the digital logic core. Our study considers the drive of the power supply buffers, as well as parasitic resistors, inductors and capacitors. We quantify our findings using Test Vector Leakage Assessment by relative comparison to the parasitic-free model. Thus we provide a deeper insight into the potential layout sources of leakage and their magnitude.</em></td> </tr> <tr> <td>15:00</td> <td>7.6.2</td> <td><b>INCREASED REPRODUCIBILITY AND COMPARABILITY OF DATA LEAK EVALUATIONS USING EXOT</b><br /><b>Speaker</b>:<br />Philipp Miedl, ETH Zurich, CH<br /><b>Authors</b>:<br />Philipp Miedl, Bruno Klopott and Lothar Thiele, ETH Zurich, CH<br /><em><b>Abstract</b><br />As computing systems are increasingly shared among different users or application domains, researchers have intensified their efforts to detect possible data leaks. In particular, many investigations highlight the vulnerability of systems w. r. t. covert and side channel attacks. However, the effort required to reproduce and compare different results has proven to be high. Therefore, we present a novel methodology for covert channel evaluation. In addition, we introduce the Experiment Orchestration Toolkit ExOT, which provides software tools to efficiently execute the methodology. Our methodology ensures that the covert channel analysis yields expressive results that can be reproduced and allow the comparison of the threat potential of different data leaks. ExOT is a software bundle that consists of easy to extend C++ libraries and Python packages. These libraries and packages provide tools for the generation and execution of experiments, as well as the analysis of the experimental data. Therefore, ExOT decreases the engineering effort needed to execute our novel methodology. We verify these claims with an extensive evaluation of four different covert channels on an Intel Haswell and an ARMv8 based platform. In our evaluation, we derive capacity bounds and show achievable throughputs to compare the threat potential of these different covert channels.</em></td> </tr> <tr> <td>15:15</td> <td>7.6.3</td> <td><b>GHOSTBUSTERS: MITIGATING SPECTRE ATTACKS ON A DBT-BASED PROCESSOR</b><br /><b>Speaker and Author</b>:<br />Simon Rokicki, Irisa, FR<br /><em><b>Abstract</b><br />Unveiled early 2018, the Spectre vulnerability affects most of the modern high-performance processors. Spectre variants exploit the speculative execution mechanisms and a cache side-channel attack to leak secret data. As of today, the main countermeasures consist of turning off the speculation, which drastically reduces the processor performance. In this work, we focus on a different kind of micro-architecture: the DBT based processors, such as Transmeta Crusoe [1], NVidia Denver or Hybrid-DBT. Instead of using complex out-of-order (OoO) mechanisms, those cores combines a software Dynamic Binary Translation mechanism (DBT) and a parallel in-order architecture, typically a VLIW core. The DBT is in charge of translating and optimizing the binaries before their execution. Studies show that DBT based processors can reach the performance level of OoO cores for regular enough applications. In this paper, we demonstrate that, even if those processors do not use OoO execution, they are still vulnerable to Spectre variants, because of the DBT optimizations. However, we also demonstrate that those systems can easily be patched, as the DBT is done in software and has fine-grained control over the optimization process.</em></td> </tr> <tr> <td>15:30</td> <td>7.6.4</td> <td><b>DYNAMIC FAULTS BASED HARDWARE TROJAN DESIGN IN STT-MRAM</b><br /><b>Speaker</b>:<br />Sarath Mohanachandran Nair, Karlsruhe Institute of Technology (KIT), DE<br /><b>Authors</b>:<br />Sarath Mohanachandran Nair<sup>1</sup>, Rajendra Bishnoi<sup>2</sup>, Arunkumar Vijayan<sup>1</sup> and Mehdi Tahoori<sup>1</sup><br /><sup>1</sup>Karlsruhe Institute of Technology, DE; <sup>2</sup>Delft University of Technology, NL<br /><em><b>Abstract</b><br />The emerging Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is seen as a promising candidate to replace conventional on-chip memories. It has several advantages such as high density, non-volatility, scalability, and CMOS compatibility. With this technology becoming ubiquitous, it also becomes interesting as a target for security attacks. As the fabrication process of STT-MRAM evolves, it is susceptible to various fault mechanisms which are different from those of conventional CMOS memories. These unique fault mechanisms can be exploited by an adversary to deploy hardware Trojans, which are deliberately introduced design modifications. In this work, we demonstrate how a particular stealthy circuit modification to inject a fault mechanism, namely dynamic fault, can be exploited to implement a hardware Trojan trigger which cannot be detected by standard memory testing methods. The fault mechanisms can also be used to design new payloads specific to STT-MRAM. We illustrate this by proposing a new payload by utilizing coupling faults, which leads to degraded performance and data corruption.</em></td> </tr> <tr> <td>15:45</td> <td>7.6.5</td> <td><b>ORACLE-BASED LOGIC LOCKING ATTACKS: PROTECT THE ORACLE NOT ONLY THE NETLIST</b><br /><b>Speaker</b>:<br />Emmanouil Kalligeros, University of the Aegean, GR<br /><b>Authors</b>:<br />Emmanouil Kalligeros, Nikolaos Karousos and Irene Karybali, University of the Aegean, GR<br /><em><b>Abstract</b><br />Logic locking has received a lot of attention in the literature due to its very attractive hardware-security characteristics: it can protect against IP piracy and overproduction throughout the whole IC supply chain. However, a large class of logic-locking attacks, the oracle-based ones, take advantage of a functional copy of the chip, the oracle, to extract the key that protects the chip. So far, the techniques dealing with oracle-based attacks focus on the netlist that the attacker possesses, assuming that the oracle is always available. For this reason, they are usually overcome by new attacks. In this paper, we propose a hardware security scheme that targets the protection of the oracle circuit, by locking the circuit when the, necessary for setting the inputs and observing the outputs, scan in/out process begins. Hence, no correct input/output pairs can be acquired to perform the attacks. The proposed scheme is not based on controlling global signals like test_enable or scan_enable, whose values can be easily suppressed by the attacker. Security threats are identified, discussed and addressed. The developed scheme is combined with a traditional logic locking technique with high output corruptibility, to achieve increased levels of protection.</em></td> </tr> <tr> <td style="width:40px;">16:00</td> <td><a href="/date20/conference/session/IP3">IP3-12</a>, 424</td> <td><b>ARE CLOUD FPGAS REALLY VULNERABLE TO POWER ANALYSIS ATTACKS?</b><br /><b>Speaker</b>:<br />Ognjen Glamocanin, EPFL, CH<br /><b>Authors</b>:<br />Ognjen Glamocanin<sup>1</sup>, Louis Coulon<sup>1</sup>, Francesco Regazzoni<sup>2</sup> and Mirjana Stojilovic<sup>3</sup><br /><sup>1</sup>École Polytechnique Fédérale de Lausanne, CH; <sup>2</sup>ALaRI, CH; <sup>3</sup>EPFL, CH<br /><em><b>Abstract</b><br />Recent works have demonstrated the possibility of extracting secrets from a cryptographic core running on an FPGA by means of remote power analysis attacks. To mount these attacks, an adversary implements a voltage fluctuation sensor in the FPGA logic, records the power consumption of the target cryptographic core, and recovers the secret key by running a power analysis attack on the recorded traces. Despite showing that the power analysis could also be performed without physical access to the cryptographic core, these works were mostly carried out on dedicated FPGA boards in a controlled environment, leaving open the question about the possibility to successfully mount these attacks on a real system deployed in the cloud. In this paper, we demonstrate, for the first time, a successful key recovery attack on an AES cryptographic accelerator running on an Amazon EC2 F1 instance. We collect the power traces using a delay-line based voltage drop sensor, adapted to the Xilinx Virtex Ultrascale+ architecture used on Amazon EC2 F1, where CARRY8 blocks do not have a monotonic delay increase at their outputs. Our results demonstrate that security concerns raised by multitenant FPGAs are indeed valid and that countermeasures should be put in place to mitigate them.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="7.7">7.7 Self-Adaptive and Learning Systems</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 14:30 - 16:00<br /><b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br />Gilles Sassatelli, Université de Montpellier, FR</p> <p><b>Co-Chair:</b><br />Rishad Shafik, University of Newcastle, GB</p> <p>Recent advances in machine learning have pushed the boundaries of what is possible in self-adaptive and learning systems. This session pushes the state of art in runtime power and performance trade-offs for deep neural networks and self-optimizing embedded systems.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.7.1</td> <td><b>ANYTIMENET: CONTROLLING TIME-QUALITY TRADEOFFS IN DEEP NEURAL NETWORK ARCHITECTURES</b><br /><b>Speaker</b>:<br />Jung-Eun Kim, Yale University, US<br /><b>Authors</b>:<br />Jung-Eun Kim<sup>1</sup>, Richard Bradford<sup>2</sup> and Zhong Shao<sup>1</sup><br /><sup>1</sup>Department of Computer Science, Yale University, US; <sup>2</sup>Collins Aerospace, US<br /><em><b>Abstract</b><br />Deeper neural networks, especially those with extremely large numbers of internal parameters, impose a heavy computational burden in obtaining sufficiently high-quality results. These burdens are impeding the application of machine learning and related techniques to time-critical computing systems. To address this challenge, we are proposing an architectural approach for neural networks that adaptively trades off computation time and solution quality to achieve high-quality solutions with timeliness. We propose a novel and general framework,AnytimeNet, that gradually inserts additional layers, so users can expect monotonically increasing quality of solutions as more computation time is expended. The framework allows users to select on the fly when to retrieve a result during runtime. Extensive evaluation results on classification tasks demonstrate that our proposed architecture provides adaptive control of classification solution quality according to the available computation time.</em></td> </tr> <tr> <td>15:00</td> <td>7.7.2</td> <td><b>ANTIDOTE: ATTENTION-BASED DYNAMIC OPTIMIZATION FOR NEURAL NETWORK RUNTIME EFFICIENCY</b><br /><b>Speaker</b>:<br />Xiang Chen, George Mason University, US<br /><b>Authors</b>:<br />Fuxun Yu<sup>1</sup>, Chenchen Liu<sup>2</sup>, Di Wang<sup>3</sup>, Yanzhi Wang<sup>1</sup> and Xiang Chen<sup>1</sup><br /><sup>1</sup>George Mason University, US; <sup>2</sup>University of Maryland, Baltimore County, US; <sup>3</sup>Microsoft, US<br /><em><b>Abstract</b><br />Convolutional Neural Networks (CNNs) achieved great cognitive performance at the expense of considerable computation load. To relieve the computation load, many optimization works are developed to reduce the model redundancy by identifying and removing insignificant model components, such as weight sparsity and filter pruning. However, these works only evaluate model components' static significance with internal parameter information, ignoring their dynamic interaction with external inputs. With per-input feature activation, the model component significance can dynamically change, and thus the static methods can only achieve sub-optimal results. Therefore, we propose a dynamic CNN optimization framework in this work. Based on the neural network attention mechanism, we propose a comprehensive dynamic optimization framework including (1) testing-phase channel and column feature map pruning, as well as (2) training-phase optimization by targeted dropout. Such a dynamic optimization framework has several benefits: (1) First, it can accurately identify and aggressively remove per-input feature redundancy with considering the model-input interaction; (2) Meanwhile, it can maximally remove the feature map redundancy in various dimensions thanks to the multi-dimension flexibility; (3) The training-testing co-optimization favors the dynamic pruning and helps maintain the model accuracy even with very high feature pruning ratio. Extensive experiments show that our method could bring 37.4%∼54.5% FLOPs reduction with negligible accuracy drop on various of test networks.</em></td> </tr> <tr> <td>15:30</td> <td>7.7.3</td> <td><b>USING LEARNING CLASSIFIER SYSTEMS FOR THE DSE OF ADAPTIVE EMBEDDED SYSTEMS</b><br /><b>Speaker</b>:<br />Fedor Smirnov, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE<br /><b>Authors</b>:<br />Fedor Smirnov, Behnaz Pourmohseni and Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE<br /><em><b>Abstract</b><br />Modern embedded systems are not only becoming more and more complex but are also often exposed to dynamically changing run-time conditions such as resource availability or processing power requirements. This trend has led to the emergence of adaptive systems which are designed using novel approaches that combine a static off-line Design Space Exploration (DSE) with the consideration of the dynamic run-time behavior of the system under design. In contrast to a static design approach, which provides a single design solution as a compromise between the possible run-time situations, the off-line DSE of these so-called hybrid design approaches yields a set of configuration alternatives, so that at run time, it becomes possible to dynamically choose the option most suited for the current situation. However, most of these approaches still use optimizers which were primarily developed for static design. Consequently, modeling complex dynamic environments or run-time requirements is either not possible or comes at the cost of a significant computation overhead or results of poor quality. As a remedy, this paper introduces Learning Optimizer Constrained by ALtering conditions (LOCAL), a novel optimization framework for the DSE of adaptive embedded systems. Following the structure of Learning Classifier System (LCS) optimizers, the proposed framework optimizes a strategy, i.e., a set of conditionally applicable solutions for the problem at hand, instead of a set of independent solutions. We show how the proposed framework—which can be used for the optimization of any adaptive system—is used for the optimization of dynamically reconfigurable many-core systems and provide experimental evidence that the hereby obtained strategy offers superior embeddability compared to the solutions provided by a s.o.t.a. hybrid approach which uses an evolutionary algorithm.</em></td> </tr> <tr> <td style="width:40px;">16:00</td> <td><a href="/date20/conference/session/IP3">IP3-13</a>, 760</td> <td><b>EFFICIENT TRAINING ON EDGE DEVICES USING ONLINE QUANTIZATION</b><br /><b>Speaker</b>:<br />Michael Ostertag, University of California, San Diego, US<br /><b>Authors</b>:<br />Michael Ostertag<sup>1</sup>, Sarah Al-Doweesh<sup>2</sup> and Tajana Rosing<sup>1</sup><br /><sup>1</sup>University of California, San Diego, US; <sup>2</sup>King Abdulaziz City of Science and Technology, SA<br /><em><b>Abstract</b><br />Sensor-specific calibration functions offer superior performance over global models and single-step calibration procedures but require prohibitive levels of sampling in the input feature space. Sensor self-calibration by gathering training data through collaborative calibration or self-analyzing predictive results allows these sensors to gather sufficient information. Resource-constrained edge devices are then stuck between high communication costs for transmitting training data to a centralized server and high memory requirements for storing data locally. We propose online dataset quantization that maximizes the diversity of input features, maintaining a representative set of data from a larger stream of training data points. We test the effectiveness of online dataset quantization on two real-world datasets: air quality calibration and power prediction modeling. Online Dataset Quantization outperforms reservoir sampling and performs equally to offline methods.</em></td> </tr> <tr> <td style="width:40px;">16:01</td> <td><a href="/date20/conference/session/IP3">IP3-14</a>, 190</td> <td><b>MULTI-AGENT ACTOR-CRITIC METHOD FOR JOINT DUTY-CYCLE AND TRANSMISSION POWER CONTROL</b><br /><b>Speaker</b>:<br />Sota Sawaguchi, CEA-Leti, FR<br /><b>Authors</b>:<br />Sota Sawaguchi<sup>1</sup>, Jean-Frédéric Christmann<sup>2</sup>, Anca Molnos<sup>2</sup>, Carolynn Bernier<sup>2</sup> and Suzanne Lesecq<sup>2</sup><br /><sup>1</sup>CEA, FR; <sup>2</sup>CEA-Leti, FR<br /><em><b>Abstract</b><br />Energy-harvesting Internet of Things (EH-IoT) wireless networks have gained attention due to their infinite operation and maintenance-free property. However, maintaining energy neutral operation (ENO) of EH-IoT devices, such that the harvested and consumed energy are matched during a certain time period, is crucial. Guaranteeing this ENO condition and optimal power-performance trade-off under various workloads and transient wireless channel quality is particularly challenging. This paper proposes a multi-agent actor-critic method for modulating both the transmission duty-cycle and the transmitter output power based on the state-of-buffer (SoB) and the state-of-charge (SoC) information as a state. Thanks to these buffers, system uncertainties, especially harvested energy and wireless link conditions, are addressed effectively. Differently from the state-of-the-art, our solution does not require any model of the wireless transceiver nor any measurement of wireless channel quality. Simulation results of a solar powered EH-IoT node using real-life outdoor solar irradiance data show that the proposed method achieves better performance without system fails throughout a year compared to the state-of-the-art that suffers some system downtime. Our approach also predicts almost no system fails during five years of operation. This proves that our approach can adapt to the change in energy-harvesting and wireless channel quality, all without direct observations.</em></td> </tr> <tr> <td>16:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="IP3">IP3 Interactive Presentations</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 16:00 - 17:00<br /><b>Location / Room:</b> Poster Area</p> <p>Interactive Presentations run simultaneously during a 30-minute slot. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session</p> <table> <tr> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> <tr> <td style="width:40px;">IP3-1</td> <td><b>CNT-CACHE: AN ENERGY-EFFICIENT CARBON NANOTUBE CACHE WITH ADAPTIVE ENCODING</b><br /><b>Speaker</b>:<br />Kexin Chu, School of Electronic Science &amp; Applied Physics Hefei University of Technology Anhui,China, CN<br /><b>Authors</b>:<br />Dawen Xu<sup>1</sup>, Kexin Chu<sup>1</sup>, Cheng Liu<sup>2</sup>, Ying Wang<sup>2</sup>, Lei Zhang<sup>2</sup> and Huawei Li<sup>2</sup><br /><sup>1</sup>School of Electronic Science &amp; Applied Physics Hefei University of Technology Anhui, CN; <sup>2</sup>Chinese Academy of Sciences, CN<br /><em><b>Abstract</b><br />Carbon Nanotubu field-effect transistor(CNFET) that promises both higher clock speed and energy efficiency becomes an attractive alternative to the conventional power-hungry CMOS cache. We observe that CNFET-based cacheconstructed with typical 9T SRAM cells has distinct energy consumption when reading/writing 0 and 1 from/to it. The energy consumption of reading 0 is around 3X higher compared toreading 1. The energy consumption of writing 1 is almost 10X higher than writing 0. With this observation, we propose an energy-efficient cache design called CNT-Cache to take advantage of this feature. It includes an adaptive data encoding modulethat can convert the coding of each cache line to match the cache reading and writing preferences. Meanwhile, it has a cache line encoding direction predictor that instructs the encoding direction according to the cache line access history. The two optimizations combined together can reduce the overall dynamicpower consumption significantly. According to our experiments,the optimized CNFET-based L1 D-Cache reduces the dynamic power consumption by 22% on average compared to the baseline CNFET cache.</em></td> </tr> <tr> <td style="width:40px;">IP3-2</td> <td><b>ENHANCING MULTITHREADED PERFORMANCE OF ASYMMETRIC MULTICORES WITH SIMD OFFLOADING</b><br /><b>Speaker</b>:<br />Antonio Scheneider Beck, Universidade Federal do Rio Grande do Sul, BR<br /><b>Authors</b>:<br />Jeckson Dellagostin Souza<sup>1</sup>, Madhavan Manivannan<sup>2</sup>, Miquel Pericas<sup>2</sup> and Antonio Carlos Schneider Beck<sup>1</sup><br /><sup>1</sup>Universidade Federal do Rio Grande do Sul, BR; <sup>2</sup>Chalmers University of Technology, SE<br /><em><b>Abstract</b><br />Asymmetric multicore architectures with single-ISA can accelerate multithreaded applications by running code that does not execute concurrently (i.e., the serial region) on a big core and the parallel region on a larger number of smaller cores. Nevertheless, in such architectures the big core still implements resource-expensive application-specific instruction extensions that are rarely used while running the serial region, such as Single Instruction Multiple Data (SIMD) and Floating-Point (FP) operations. In this work, we propose a design in which these extensions are not implemented in the big core, thereby freeing up area and resources to increase the number of small cores in the system, and potentially enhance thread-level parallelism (TLP). To address the case when missing instruction extensions are required while running on the big core we devise an approach to automatically offload these operations to the execution units of the small cores, where the extensions are implemented and can be executed. Our evaluation shows that, on average, the proposed architecture provides 1.76x speedup when compared to a traditional single-ISA asymmetric multicore processor with the same area, for a variety of parallel applications.</em></td> </tr> <tr> <td style="width:40px;">IP3-3</td> <td><b>HARDWARE ACCELERATION OF CNN WITH ONE-HOT QUANTIZATION OF WEIGHTS AND ACTIVATIONS</b><br /><b>Speaker</b>:<br />Gang Li, Chinese Academy of Sciences, CN<br /><b>Authors</b>:<br />Gang Li, Peisong Wang, Zejian Liu, Cong Leng and Jian Cheng, Chinese Academy of Sciences, CN<br /><em><b>Abstract</b><br />In this paper, we propose a novel one-hot representation for weights and activations in CNN model and demonstrate its benefits on hardware accelerator design. Specifically, rather than merely reducing the bitwidth, we quantize both weights and activations into n-bit integers that containing only one non-zero bit per value. In this way, the massive multiply and accumulates (MACs) are equivalent to additions of powers of two that can be efficiently calculated with histogram based computations. Experiments on the ImageNet classification task show that comparable accuracy can be obtained on our proposed One-Hot Networks (OHN) compared to conventional fixed-point networks. As case studies, we evaluate the efficacy of one-hot data representation on two state-of-the-art CNN accelerators on FPGA, our preliminary results show that 50% and 68.5% resource saving can be achieved on DaDianNao and Laconic respectively. Besides, the one-hot optimized Laconic can further achieve an average speedup of 4.94x on AlexNet.</em></td> </tr> <tr> <td style="width:40px;">IP3-4</td> <td><b>BNNSPLIT: BINARIZED NEURAL NETWORKS FOR EMBEDDED DISTRIBUTED FPGA-BASED COMPUTING SYSTEMS</b><br /><b>Speaker</b>:<br />Luca Stornaiuolo, Politecnico di Milano, IT<br /><b>Authors</b>:<br />Giorgia Fiscaletti, Marco Speziali, Luca Stornaiuolo, Marco D. Santambrogio and Donatella Sciuto, Politecnico di Milano, IT<br /><em><b>Abstract</b><br />In the past few years, Convolutional Neural Networks (CNNs) have seen a massive improvement, outperforming other visual recognition algorithms. Since they are playing an increasingly important role in fields such as face recognition, augmented reality or autonomous driving, there is the growing need for a fast and efficient system to perform the redundant and heavy computations of CNNs. This trend led researchers towards heterogeneous systems provided with hardware accelerators, such as GPUs and FPGAs. The vast majority of CNNs is implemented with floating-point parameters and operations, but from research, it has emerged that high classification accuracy can be obtained also by reducing the floating-point activations and weights to binary values. This context is well suitable for FPGAs, that are known to stand out in terms of performance when dealing with binary operations, as demonstrated in Finn, the state-of-the-art framework for building Binarized Neural Network (BNN) accelerators on FPGAs. In this paper, we propose a framework that extends Finn to a distributed scenario, enabling BNNs implementation on embedded multi-FPGA systems.</em></td> </tr> <tr> <td style="width:40px;">IP3-5</td> <td><b>L2L: A HIGHLY ACCURATE LOG_2_LEAD QUANTIZATION OF PRE-TRAINED NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Salim Ullah, Technische Universität Dresden, DE<br /><b>Authors</b>:<br />Salim Ullah<sup>1</sup>, Siddharth Gupta<sup>2</sup>, Kapil Ahuja<sup>2</sup>, Aruna Tiwari<sup>2</sup> and Akash Kumar<sup>1</sup><br /><sup>1</sup>Technische Universität Dresden, DE; <sup>2</sup>IIT Indore, IN<br /><em><b>Abstract</b><br />Deep Neural Networks are one of the machine learning techniques which are increasingly used in a variety of applications. However, the significantly high memory and computation demands of deep neural networks often limit their deployment on embedded systems. Many recent works have considered this problem by proposing different types of data quantization schemes. However, most of these techniques either require post-quantization retraining of deep neural networks or bear a significant loss in output accuracy. In this paper, we propose a novel quantization technique for parameters of pre-trained deep neural networks. Our technique significantly maintains the accuracy of the parameters and does not require retraining of the networks. Compared to the single-precision floating-point numbers-based implementation, our proposed 8-bit quantization technique generates only ∼ 1% and ∼ 0.4%, loss in the top-1 and top-5 accuracies respectively for VGG16 network using ImageNet dataset.</em></td> </tr> <tr> <td style="width:40px;">IP3-6</td> <td><b>FAULT DIAGNOSIS OF VIA-SWITCH CROSSBAR IN NON-VOLATILE FPGA</b><br /><b>Speaker</b>:<br />Ryutaro Doi, Osaka University, JP<br /><b>Authors</b>:<br />Ryutaro DOI<sup>1</sup>, Xu Bai<sup>2</sup>, Toshitsugu Sakamoto<sup>2</sup> and Masanori Hashimoto<sup>1</sup><br /><sup>1</sup>Osaka University, JP; <sup>2</sup>NEC Corporation, JP<br /><em><b>Abstract</b><br />FPGA that exploits via-switches, which are a kind of non-volatile resistive RAMs, for crossbar implementation is attracting attention due to its high integration density and energy efficiency. Via-switch crossbar is responsible for the signal routing by changing on/off-states of via-switches. To verify the via-switch crossbar functionality after manufacturing, fault testing that checks whether we can turn on/off via-switches normally is essential. This paper confirms that a general differential pair comparator successfully discriminates on/off-states of via-switches, and clarifies fault modes of a via-switch by transistor-level SPICE simulation that injects stuck-on/off faults to atom switch and varistor, where a via-switch consists of two atom switches and two varistors. We then propose a fault diagnosis methodology that diagnoses the fault modes of each via-switch using the comparator response difference between normal and faulty via-switches. The proposed method achieves 100% fault detection by checking the comparator responses after turning on/off the via-switch. In case that the number of faulty components in a via-switch is one, the ratio of the fault diagnosis, which exactly identifies the faulty varistor and atom switch inside the faulty via-switch, is 100%, and in case of up to two faults, the fault diagnosis ratio is 79%.</em></td> </tr> <tr> <td style="width:40px;">IP3-7</td> <td><b>APPLYING RESERVATION-BASED SCHEDULING TO A µC-BASED HYPERVISOR: AN INDUSTRIAL CASE STUDY</b><br /><b>Speaker</b>:<br />Dirk Ziegenbein, Robert Bosch GmbH, DE<br /><b>Authors</b>:<br />Dakshina Dasari<sup>1</sup>, Paul Austin<sup>2</sup>, Michael Pressler<sup>1</sup>, Arne Hamann<sup>1</sup> and Dirk Ziegenbein<sup>1</sup><br /><sup>1</sup>Robert Bosch GmbH, DE; <sup>2</sup>ETAS GmbH, GB<br /><em><b>Abstract</b><br />Existing software scheduling mechanisms do not suffice for emerging applications in the automotive space, which have the conflicting needs of performance and predictability. %We need mechanisms that lend themselves naturally to this requirement, by virtue of their design. As a concrete case, we consider the ETAS light-weight hypervisor, a commercially viable solution in the automotive industry, deployed on multicore microcontrollers. We describe the architecture of the hypervisor and its current scheduling mechanisms based on Time Division Multiplexing. We next show how Reservation-based Scheduling can be implemented in the ETAS LWHVR to efficiently use resources while also providing freedom from interference and explore design choices towards an efficient implementation of such a scheduler. With experiments from an industry use case, we also compare the performance of RBS and the existing scheduler in the hypervisor.</em></td> </tr> <tr> <td style="width:40px;">IP3-8</td> <td><b>REAL-TIME ENERGY MONITORING IN IOT-ENABLED MOBILE DEVICES</b><br /><b>Speaker</b>:<br />Nitin Shivaraman, TUMCREATE, SG<br /><b>Authors</b>:<br />Nitin Shivaraman<sup>1</sup>, Seima Suriyasekaran<sup>1</sup>, Zhiwei Liu<sup>2</sup>, Saravanan Ramanathan<sup>1</sup>, Arvind Easwaran<sup>2</sup> and Sebastian Steinhorst<sup>3</sup><br /><sup>1</sup>TUMCREATE, SG; <sup>2</sup>Nanyang Technological University, SG; <sup>3</sup>TUM, DE<br /><em><b>Abstract</b><br />With rapid advancements in the Internet of Things (IoT) paradigm, every electrical device in the near future is expected to have IoT capabilities. This enables fine-grained tracking of individual energy consumption data of such devices, offering location-independent per-device billing and demand management. Hence, it abstracts from the location-based metering of state-of-the-art infrastructure, which traditionally aggregates on a building or household level, defining the entity to be billed. However, such in-device energy metering is susceptible to manipulation and fraud. As a remedy, we propose a secure decentralized metering architecture that enables devices with IoT capabilities to measure their own energy consumption. In this architecture, the device-level consumption is additionally reported to a system-level aggregator that verifies distributed information from our decentralized metering systems and provides secure data storage using Blockchain, preventing data manipulation by untrusted entities. Through experimental evaluation, we show that the proposed architecture supports device mobility and enables location-independent monitoring of energy consumption.</em></td> </tr> <tr> <td style="width:40px;">IP3-9</td> <td><b>TOWARDS SPECIFICATION AND TESTING OF RISC-V ISA COMPLIANCE</b><br /><b>Speaker</b>:<br />Vladimir Herdt, Universität Bremen, DE<br /><b>Authors</b>:<br />Vladimir Herdt<sup>1</sup>, Daniel Grosse<sup>2</sup> and Rolf Drechsler<sup>2</sup><br /><sup>1</sup>Universität Bremen, DE; <sup>2</sup>Universität Bremen / DFKI GmbH, DE<br /><em><b>Abstract</b><br />Compliance testing for RISC-V is very important. Therefore, an official hand-written compliance test-suite is being actively developed. However, this requires significant manual effort in particular to achieve a high test coverage. In this paper we propose a test-suite specification mechanism in combination with a first set of instruction constraints and coverage requirements for the base RISC-V ISA. In addition, we present an automated method to generate a test-suite that satisfies the specification. Our evaluation demonstrates the effectiveness and potential of our method.</em></td> </tr> <tr> <td style="width:40px;">IP3-10</td> <td><b>POST-SILICON VALIDATION OF THE IBM POWER9 PROCESSOR</b><br /><b>Speaker</b>:<br />Hillel Mendelson, IBM, IL<br /><b>Authors</b>:<br />Tom Kolan<sup>1</sup>, Hillel Mendelson<sup>1</sup>, Vitali Sokhin<sup>1</sup>, Kevin Reick<sup>2</sup>, Elena Tsanko<sup>2</sup> and Gregory Wetli<sup>2</sup><br /><sup>1</sup>IBM Research - Haifa, IL; <sup>2</sup>IBM Systems, US<br /><em><b>Abstract</b><br />Due to the complexity of designs, post-silicon validation remains a major challenge with few systematic solutions. We provide an overview of the state-of-the-art post silicon validation process used by IBM to verify its latest IBM POWER9 processor. During the POWER9 post-silicon validation, we detected and handled 30% more logic bugs in 80% of the time, as compared to the previous IBMPOWER8 bring-up. This improvement is the result of lessons learned from previous designs, leading to numerous innovations. We provide bug analysis data and compare it to POWER8 results. We demonstrate our methodology by describing several bugs from fail detection to root cause.</em></td> </tr> <tr> <td style="width:40px;">IP3-11</td> <td><b>ON THE TASK MAPPING AND SCHEDULING FOR DAG-BASED EMBEDDED VISION APPLICATIONS ON HETEROGENEOUS MULTI/MANY-CORE ARCHITECTURES</b><br /><b>Speaker</b>:<br />Nicola Bombieri, Università di Verona, IT<br /><b>Authors</b>:<br />Stefano Aldegheri<sup>1</sup>, Nicola Bombieri<sup>1</sup> and Hiren Patel<sup>2</sup><br /><sup>1</sup>Università di Verona, IT; <sup>2</sup>University of Waterloo, CA<br /><em><b>Abstract</b><br />In this work, we show that applying the heterogeneous earliest finish time (HEFT) heuristic for the task scheduling of embedded vision applications can improve the system performance up to 70% w.r.t. the scheduling solutions at the state of the art. We propose an algorithm called exclusive earliest finish time (XEFT) that introduces the notion of exclusive overlap between application primitives to improve the load balancing. We show that XEFT can improve the system performance up to 33% over HEFT, and 82% over the state of the art approaches. We present the results on different benchmarks, including a real-world localization and mapping application (ORB-SLAM) combined with the NVIDIA object detection application based on deep-learning.</em></td> </tr> <tr> <td style="width:40px;">IP3-12</td> <td><b>ARE CLOUD FPGAS REALLY VULNERABLE TO POWER ANALYSIS ATTACKS?</b><br /><b>Speaker</b>:<br />Ognjen Glamocanin, EPFL, CH<br /><b>Authors</b>:<br />Ognjen Glamocanin<sup>1</sup>, Louis Coulon<sup>1</sup>, Francesco Regazzoni<sup>2</sup> and Mirjana Stojilovic<sup>3</sup><br /><sup>1</sup>École Polytechnique Fédérale de Lausanne, CH; <sup>2</sup>ALaRI, CH; <sup>3</sup>EPFL, CH<br /><em><b>Abstract</b><br />Recent works have demonstrated the possibility of extracting secrets from a cryptographic core running on an FPGA by means of remote power analysis attacks. To mount these attacks, an adversary implements a voltage fluctuation sensor in the FPGA logic, records the power consumption of the target cryptographic core, and recovers the secret key by running a power analysis attack on the recorded traces. Despite showing that the power analysis could also be performed without physical access to the cryptographic core, these works were mostly carried out on dedicated FPGA boards in a controlled environment, leaving open the question about the possibility to successfully mount these attacks on a real system deployed in the cloud. In this paper, we demonstrate, for the first time, a successful key recovery attack on an AES cryptographic accelerator running on an Amazon EC2 F1 instance. We collect the power traces using a delay-line based voltage drop sensor, adapted to the Xilinx Virtex Ultrascale+ architecture used on Amazon EC2 F1, where CARRY8 blocks do not have a monotonic delay increase at their outputs. Our results demonstrate that security concerns raised by multitenant FPGAs are indeed valid and that countermeasures should be put in place to mitigate them.</em></td> </tr> <tr> <td style="width:40px;">IP3-13</td> <td><b>EFFICIENT TRAINING ON EDGE DEVICES USING ONLINE QUANTIZATION</b><br /><b>Speaker</b>:<br />Michael Ostertag, University of California, San Diego, US<br /><b>Authors</b>:<br />Michael Ostertag<sup>1</sup>, Sarah Al-Doweesh<sup>2</sup> and Tajana Rosing<sup>1</sup><br /><sup>1</sup>University of California, San Diego, US; <sup>2</sup>King Abdulaziz City of Science and Technology, SA<br /><em><b>Abstract</b><br />Sensor-specific calibration functions offer superior performance over global models and single-step calibration procedures but require prohibitive levels of sampling in the input feature space. Sensor self-calibration by gathering training data through collaborative calibration or self-analyzing predictive results allows these sensors to gather sufficient information. Resource-constrained edge devices are then stuck between high communication costs for transmitting training data to a centralized server and high memory requirements for storing data locally. We propose online dataset quantization that maximizes the diversity of input features, maintaining a representative set of data from a larger stream of training data points. We test the effectiveness of online dataset quantization on two real-world datasets: air quality calibration and power prediction modeling. Online Dataset Quantization outperforms reservoir sampling and performs equally to offline methods.</em></td> </tr> <tr> <td style="width:40px;">IP3-14</td> <td><b>MULTI-AGENT ACTOR-CRITIC METHOD FOR JOINT DUTY-CYCLE AND TRANSMISSION POWER CONTROL</b><br /><b>Speaker</b>:<br />Sota Sawaguchi, CEA-Leti, FR<br /><b>Authors</b>:<br />Sota Sawaguchi<sup>1</sup>, Jean-Frédéric Christmann<sup>2</sup>, Anca Molnos<sup>2</sup>, Carolynn Bernier<sup>2</sup> and Suzanne Lesecq<sup>2</sup><br /><sup>1</sup>CEA, FR; <sup>2</sup>CEA-Leti, FR<br /><em><b>Abstract</b><br />Energy-harvesting Internet of Things (EH-IoT) wireless networks have gained attention due to their infinite operation and maintenance-free property. However, maintaining energy neutral operation (ENO) of EH-IoT devices, such that the harvested and consumed energy are matched during a certain time period, is crucial. Guaranteeing this ENO condition and optimal power-performance trade-off under various workloads and transient wireless channel quality is particularly challenging. This paper proposes a multi-agent actor-critic method for modulating both the transmission duty-cycle and the transmitter output power based on the state-of-buffer (SoB) and the state-of-charge (SoC) information as a state. Thanks to these buffers, system uncertainties, especially harvested energy and wireless link conditions, are addressed effectively. Differently from the state-of-the-art, our solution does not require any model of the wireless transceiver nor any measurement of wireless channel quality. Simulation results of a solar powered EH-IoT node using real-life outdoor solar irradiance data show that the proposed method achieves better performance without system fails throughout a year compared to the state-of-the-art that suffers some system downtime. Our approach also predicts almost no system fails during five years of operation. This proves that our approach can adapt to the change in energy-harvesting and wireless channel quality, all without direct observations.</em></td> </tr> </table> <hr /> <h2 id="8.1">8.1 Special Day on "Embedded AI": Neuromorphic chips and systems</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br />Wei Lu, Michigan Univ, US</p> <p><b>Co-Chair:</b><br />Bernabe Linares-Barranco, CSIC, ES</p> <p>Within the global field of AI, there is a subfield that focuses on exploiting neuroscience knowledge for artificial intelligent hardware systems. This is the neuromorphic engineering field. This session presents some examples of AI research focusing on this AI subfield.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.1.1</td> <td><b>SPINNAKER2 : A PLATFORM FOR BIO-INSPIRED ARTIFICIAL INTELLIGENCE AND BRAIN SIMULATION</b><br /><b>Authors</b>:<br />Christian Mayr, Sebastian Höppner, Johannes Partzsch and Steve Furber, Technische Universität Dresden, DE<br /><em><b>Abstract</b><br />SpiNNaker is an ARM-based processor platform optimized for the simulation of spiking neural networks. This brief describes the roadmap in going from the current SPINNaker1 system, a 1 Million core machine in 130nm CMOS, to SpiNNaker2, a 10 Million core machine in 22nm FDSOI. Apart from pure scaling, we will take advantage of specific technology features, such as runtime adaptive body biasing, to deliver cutting-edge power consumption. Power management of the cores allows a wide range of workload adaptivity, i.e. processor power scales with the complexity and activity of the spiking network. Additional numerical accelerators will enhance the utility of SpiNNaker2 for simulation of spiking neural networks as well as for executing conventional deep neural networks. The interplay between these two domains will provide a wide field for bio-inspired algorithm exploration on SpiNNaker2, bringing machine learning and neuromorphics closer together. Apart from the platforms' traditional usage as a neuroscience exploration tool, the extended functionality opens up new application areas such as automotive AI, tactile internet, industry 4.0 and biomedical processing.</em></td> </tr> <tr> <td>17:30</td> <td>8.1.2</td> <td><b>AN ON-CHIP LEARNING ACCELERATOR FOR SPIKING NEURAL NETWORKS USING STT-RAM CROSSBAR ARRAYS</b><br /><b>Authors</b>:<br />Shruti R. Kulkarni, Shihui Yin, Jae-sun Seo and Bipin Rajendran, New Jersey Institute of Technology, US<br /><em><b>Abstract</b><br />In this work, we present a scheme for implementinglearning on a digital non-volatile memory (NVM) based hardware accelerator for Spiking Neural Networks (SNNs). Our design estimates across three prominent non-volatile memories - Phase Change Memory (PCM), Resistive RAM (RRAM), and Spin Transfer Torque RAM (STT-RAM) show that the STT-RAM arrays enable at least 2× higher throughput compared to the other two memory technologies. We discuss the design and the signal communication framework through the STT-RAM crossbar array for training and inference in SNNs. Each STT-RAM cell in the array stores a single bit value. Our neurosynaptic computational core consists of the memory crossbar array and its read/write peripheral circuitry and the digital logic for the spiking neurons, weight update computations, spike router, and decoder for incoming spike packets. Our STT-RAM based design shows ∼20× higher performance per unit Watt per unit area compared to conventional SRAM based design, making it a promising learning platform for realizing systems with significant area and power limitations.</em></td> </tr> <tr> <td>18:00</td> <td>8.1.3</td> <td><b>OVERCOMING CHALLENGES FOR ACHIEVING HIGH IN-SITU TRAINING ACCURACY WITH EMERGING MEMORIES</b><br /><b>Speaker</b>:<br />Shimeng Yu, Georgia Institute of Technology, US<br /><b>Authors</b>:<br />Shanshi Huang, Xiaoyu Sun, Xiaochen Peng, Hongwu Jiang and Shimeng Yu, Georgia Institute of Technology, US<br /><em><b>Abstract</b><br />Embedded artificial intelligence (AI) prefers the adaptive learning capability when deployed in the field, thus in-situ training on-chip is required. Emerging non-volatile memories (eNVMs) are of great interests serving as analog synapses in deep neural network (DNN) on-chip acceleration due to its multilevel programmability. However, the asymmetry/nonlinearity in the conductance tuning remains a grand challenge for achieving high in-situ training accuracy. In addition, analog-to-digital converter (ADC) at the edge of the memory array introduces an additional challenge - quantization error for in-memory computing. In this work, we gain new insights and overcome these challenges through an algorithm-hardware co-optimization. We incorporate these hardware non-ideal effects into the DNN propagation and weight update steps. We evaluate on a VGG-like network for CIFAR-10 dataset, and we show that the asymmetry of the conductance tuning is no longer a limiting factor of in-situ training accuracy if exploiting adaptive "momentum" in the weight update rule. Even considering ADC quantization error, in-situ training accuracy could approach software baseline. Our results show much relaxed requirements that enable a variety of eNVMs for DNN acceleration on the embedded AI platforms.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="8.2">8.2 We are all hackers: design and detection of security attacks</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br />Regazzoni Francesco, ALaRI, CH</p> <p><b>Co-Chair:</b><br />Grosse Daniel, Universität Bremen, DE</p> <p>This session deals with hardware trojans and vulnerabilities, proposing detection techniques and design paradigms to model attacks. It describes attacks by leveraging the exclusive characteristics of microfluidic devices and malicious usage of energy management. As for defenses, an automated test generation approach for hardware trojan detection using delay-based side-channel analysis is also presented.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.2.1</td> <td><b>AUTOMATED TEST GENERATION FOR TROJAN DETECTION USING DELAY-BASED SIDE CHANNEL ANALYSIS</b><br /><b>Speaker</b>:<br />Prabhat Mishra, University of Florida, US<br /><b>Authors</b>:<br />Yangdi Lyu and Prabhat Mishra, University of Florida, US<br /><em><b>Abstract</b><br />Side-channel analysis is widely used for hardware Trojan detection in integrated circuits by analyzing various side-channel signatures, such as timing, power and path delay. Existing delay-based side-channel analysis techniques have two major bottlenecks: (i) they are not suitable in detecting Trojans since the delay difference between the golden design and a Trojan inserted design is negligible, and (ii) they are not effective in creating robust delay signatures due to reliance on random and ATPG based test patterns. In this paper, we propose an efficient test generation technique to detect Trojans using delay-based side channel analysis. This paper makes two important contributions. (1) We propose an automated test generation algorithm to produce test patterns that are likely to activate trigger conditions, and drastically change critical paths. Compared to existing approaches where delay difference is solely based on extra gates from a small Trojan, the change of critical paths by our approach will lead to significant difference in path delay. (2) We propose a fast and efficient reordering technique to maximize the delay deviation between the golden design and Trojan inserted design. Experimental results demonstrate that our approach significantly outperforms state-of-the-art approaches that rely on ATPG or random test patterns for delay-based side-channel analysis.</em></td> </tr> <tr> <td>17:30</td> <td>8.2.2</td> <td><b>MICROFLUIDIC TROJAN DESIGN IN FLOW-BASED BIOCHIPS</b><br /><b>Speaker</b>:<br />Shayan Mohammed, new york university, US<br /><b>Authors</b>:<br />Shayan Mohammed<sup>1</sup>, Sukanta Bhattacharjee<sup>2</sup>, Yong-Ak Song<sup>2</sup>, Krishnendu Chakrabarty<sup>3</sup> and Ramesh Karri<sup>4</sup><br /><sup>1</sup>New York University, US; <sup>2</sup>New York University Abu Dhabi, AE; <sup>3</sup>Duke University, US; <sup>4</sup>NYU, US<br /><em><b>Abstract</b><br />Microfluidic technologies find application in various safety-critical fields such as medical diagnostics, drug research, and cell analysis. Recent work has focused on security threats to microfluidic-based cyberphysical systems and defenses. So far the threat analysis has been limited to the cases of tampering with control software/hardware, which is common to most cyberphysical control systems in general; in a sense, such an approach is not exclusive to microfluidics. In this paper, we present a stealthy attack paradigm that uses characteristics exclusive to the microfluidic devices - a microfluidic trojan. The proposed trojan payload is a valve whose height has been perturbed to vary its pressure response. This trojan can be triggered in multiple ways based on time or specific operations. These triggers can occur naturally in a bioassay or added into the controlling software. We showcase the trojan application in carrying out practical attacks - contamination, parameter-tampering, and denial-of-service - on a real-life bioassay implementation. Further, we present guidelines to launch stealthy attacks and to counter them.</em></td> </tr> <tr> <td>18:00</td> <td>8.2.3</td> <td><b>TOWARDS MALICIOUS EXPLOITATION OF ENERGY MANAGEMENT MECHANISMS</b><br /><b>Speaker</b>:<br />Safouane Noubir, Polytech Nantes, IETR, FR<br /><b>Authors</b>:<br />Safouane Noubir<sup>1</sup>, Maria Mendez Real<sup>2</sup> and Sebastien Pillement<sup>2</sup><br /><sup>1</sup>Polytech Nantes, IETR, FR; <sup>2</sup>Polytech Nantes - IETR, FR<br /><em><b>Abstract</b><br />Architectures are becoming more and more complex to keep up with the increase of algorithmic complexity. To fully exploit those architectures, dynamic resources managers are required. The goal of dynamic managers is either to optimize the resource usage (e.g. cores, memory) or to reduce energy consumption under performance constraints. However, performance optimization being their main goal, they have not been designed to be secure and present vulnerabilities. Recently, it has been proven that energy managers can be exploited to cause faults within a processor allowing to steal information from a user device. However, this exploitation is not often possible in current commercial devices. In this work, we show current security vulnerabilities through another type of malicious usage of energy management, experimentation shows that it is possible to remotely lock out a device, denying access to all services and data, requiring for example the user to pay a ransom to unlock it. The main target of this exploit are embedded systems and we demonstrate this work by its implementation on two different commercial ARM-based devices.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="/date20/conference/session/IP4">IP4-1</a>, 551</td> <td><b>HIT: A HIDDEN INSTRUCTION TROJAN MODEL FOR PROCESSORS</b><br /><b>Speaker</b>:<br />Jiaqi Zhang, Tongji University, CN<br /><b>Authors</b>:<br />Jiaqi Zhang<sup>1</sup>, Ying Zhang<sup>1</sup>, Huawei Li<sup>2</sup> and Jianhui Jiang<sup>3</sup><br /><sup>1</sup>Tongji University, CN; <sup>2</sup>Chinese Academy of Sciences, CN; <sup>3</sup>School of Software Engineering, Tongji University, CN<br /><em><b>Abstract</b><br />This paper explores an intrusion mechanism to microprocessors using illegal instructions, namely hidden instruction Trojan (HIT). It uses a low-probability sequence consisting of normal instructions as a boot sequence, followed by an illegal instruction to trigger the Trojan. The payload is a hidden interrupt to force the program counter to a specific address. Hence the program at the address has the super privileges. Meanwhile, we use integer programming to minimize the trigger probability of HIT within a given area overhead. The experimental results demonstrate that HIT has an extremely low trigger probability and can survive from the detection of the existing test methods.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="/date20/conference/session/IP4">IP4-2</a>, 658</td> <td><b>BITSTREAM MODIFICATION ATTACK ON SNOW 3G</b><br /><b>Speaker</b>:<br />Michail Moraitis, Royal Institute of Technology KTH, SE<br /><b>Authors</b>:<br />Michail Moraitis and Elena Dubrova, Royal Institute of Technology - KTH, SE<br /><em><b>Abstract</b><br />SNOW 3G is one of the core algorithms for confidentiality and integrity in several 3GPP wireless communication standards, including the new Next Generation (NG) 5G. It is believed to be resistant to classical cryptanalysis. In this paper, we show that SNOW 3G can be broken by a fault attack based on bitstream modification. By changing the content of some look-up tables in the bitstream, we reduce the non-linear state updating function of SNOW 3G to a linear one. As a result, it becomes possible to recover the key from a known plaintext-ciphertext pair. To our best knowledge, this is the first successful bitstream modification attack on SNOW 3G.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="8.3">8.3 Optimizing System-Level Design for Machine Learning</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br />Luciano Lavagno, Politecnico di Torino, IT</p> <p><b>Co-Chair:</b><br />Philippe Coussy, Universite Bretagne Sud / Lab-STICC, FR</p> <p>In the last years, the use of ML techniques, as deep neural networks, have become a trend in system-level design, either to help the flow finding promising solutions or to deploy ML-based applications. This session presents various approaches to optimize several aspects of system-level design, like the mapping of applications on heterogeneous platforms, the inference of CNNs or the file-system usage.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.3.1</td> <td><b>ESP4ML: PLATFORM-BASED DESIGN OF SYSTEMS-ON-CHIP FOR EMBEDDED MACHINE LEARNING</b><br /><b>Speaker</b>:<br />Davide Giri, Columbia University, US<br /><b>Authors</b>:<br />Davide Giri, Kuan-Lin Chiu, Giuseppe Di Guglielmo, Paolo Mantovani and Luca Carloni, Columbia University, US<br /><em><b>Abstract</b><br />We present ESP4ML an open-source system-level design flow to build and program SoC architectures for embedded applications that require the hardware acceleration of machine learning and signal processing algorithms. We realized ESP4ML by combining two established open-source projects (ESP and HLS4ML) into a new, fully-automated design flow. For the SoC integration of accelerators generated by HLS4ML, we designed a set of new parameterized interface circuits synthesizable with high-level synthesis. For accelerator configuration and management, we developed an embedded software runtime system on top of Linux. With this HW/SW layer, we addressed the challenge of dynamically shaping the data traffic on a network-on-chip to activate and support the reconfigurable pipelines of accelerators that are needed by the application workloads currently running on the SoC. We demonstrate our vertically-integrated contributions with the FPGA-based implementations of complete SoC instances booting Linux and executing computer-vision applications that process images taken from the Google Street View database.</em></td> </tr> <tr> <td>17:30</td> <td>8.3.2</td> <td><b>PROBABILISTIC SEQUENTIAL MULTI-OBJECTIVE OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Zixuan Yin, McGill University, CA<br /><b>Authors</b>:<br />Zixuan Yin, Warren Gross and Brett Meyer, McGill University, CA<br /><em><b>Abstract</b><br />With the advent of deeper, larger and more complex convolutional neural networks (CNN), manual design has become a daunting task, especially when hardware performance must be optimized. Sequential model-based optimization (SMBO) is an efficient method for hyperparameter optimization on highly parameterized machine learning (ML) algorithms, able to find good configurations with a limited number of evaluations by predicting the performance of candidates before evaluation. A case study on MNIST shows that SMBO regression model prediction error significantly impedes search performance in multi-objective optimization. To address this issue, we propose probabilistic SMBO, which selects candidates based on probabilistic estimation of their Pareto efficiency. With a formulation that incorporates error in accuracy prediction and uncertainty in latency measurement, probabilistic Pareto efficiency quantifies a candidate's quality in two ways: its likelihood of being Pareto optimal, and the expected number of current Pareto optimal solutions that it will dominate. We evaluate our proposed method on four image classification problems. Compared to a deterministic approach, probabilistic SMBO consistently generates Pareto optimal solutions that perform better, and that are competitive with state-of-the-art efficient CNN models, offering tremendous speedup in inference latency while maintaining comparable accuracy.</em></td> </tr> <tr> <td>18:00</td> <td>8.3.3</td> <td><b>ARS: REDUCING F2FS FRAGMENTATION FOR SMARTPHONES USING DECISION TREES</b><br /><b>Speaker</b>:<br />Lihua Yang, Huazhong University of Science &amp; Technology, CN<br /><b>Authors</b>:<br />Lihua Yang, Fang Wang, Zhipeng Tan, Dan Feng, Jiaxing Qian and Shiyun Tu, Huazhong University of Science &amp; Technology, CN<br /><em><b>Abstract</b><br />As we all know, file and free space fragmentation negatively affect file system performance. F2FS is a file system designed for flash memory. However, it suffers from severe fragmentation due to its out-of-place updates and the highly synchronous, multi-threaded writing behaviors of mobile applications. We observe that the running time of fragmented files is 2.36X longer than that of continuous files and that F2FS's in-place update scheme is incapable of reducing fragmentation. A fragmented file system leads to a poor user experience. Reserving space to prevent fragmentation is an intuitive approach. However, reserving space for all files wastes space since there are a large number of files. To deal with this dilemma, we propose an adaptive reserved space (ARS) scheme to choose some specific files to update in the reserved space. How to effectively select reserved files is critical to performance. We collect file characteristics associated with fragmentation to construct data sets and use decision trees to accurately pick reserved files. Besides, adjustable reserved space and dynamic reservation strategy are adopted. We implement ARS on a HiKey960 development platform and a commercial smartphone with slight space and file creation time overheads. Experimental results show that ARS reduces file and free space fragmentation dramatically, improves file I/O performance and reduces garbage collection overhead compared to traditional F2FS and F2FS with in-place updates. Furthermore, ARS delivers up to 1.26X transactions per second under SQLite than traditional F2FS and reduces up to 41.72% running time of Facebook and Twitter than F2FS with in-place updates.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="/date20/conference/session/IP4">IP4-3</a>, 22</td> <td><b>A MACHINE LEARNING BASED WRITE POLICY FOR SSD CACHE IN CLOUD BLOCK STORAGE</b><br /><b>Speaker</b>:<br />YU ZHANG, Huazhong University of Science &amp; Technology, CN<br /><b>Authors</b>:<br />Yu Zhang<sup>1</sup>, Ke Zhou<sup>1</sup>, Ping Huang<sup>2</sup>, Hua Wang<sup>1</sup>, Jianying Hu<sup>3</sup>, Yangtao Wang<sup>1</sup>, Yongguang Ji<sup>3</sup> and Bin Cheng<sup>3</sup><br /><sup>1</sup>Huazhong University of Science &amp; Technology, CN; <sup>2</sup>Temple University, US; <sup>3</sup>Tencent Technology (Shenzhen) Co., Ltd., CN<br /><em><b>Abstract</b><br />Nowadays, SSD cache plays an important role in cloud storage systems. The associated write policy, which enforces an admission control policy regarding filling data into the cache, has a significant impact on the performance of the cache system and the amount of write traffic to SSD caches. Based on our analysis on a typical cloud block storage system, approximately 47.09% writes are write-only, i.e., writes to the blocks which have not been read during a certain time window. Naively writing the write-only data to the SSD cache unnecessarily introduces a large number of harmful writes to the SSD cache without any contribution to cache performance. On the other hand, it is a challenging task to identify and filter out those write-only data in a real-time manner, especially in a cloud environment running changing and diverse workloads. In this paper, to alleviate the above cache problem, we propose an ML-WP, Machine Learning Based Write Policy, which reduces write traffic to SSDs by avoiding writing write-only data. The main challenge in this approach is to identify write-only data in a real-time manner. To realize ML-WP and achieve accurate write-only data identification, we use machine learning methods to classify data into two groups (i.e., write-only and normal data). Based on this classification, the write-only data is directly written to backend storage without being cached. Experimental results show that, compared with the industry widely deployed write-back policy, ML-WP decreases write traffic to SSD cache by 41.52%, while improving the hit ratio by 2.61% and reducing the average read latency by 37.52%.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="/date20/conference/session/IP4">IP4-4</a>, 47</td> <td><b>YOU ONLY SEARCH ONCE: A FAST AUTOMATION FRAMEWORK FOR SINGLE-STAGE DNN/ACCELERATOR CO-DESIGN</b><br /><b>Speaker</b>:<br />Weiwei Chen, University of Chinese Academy of Sciences, CN<br /><b>Authors</b>:<br />Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN<br /><em><b>Abstract</b><br />DNN/Accelerator co-design has shown great poten-tial in improving QoR and performance. Typical approaches separate the design flow into two-stage: (1) designing an application-specific DNN model with high accuracy; (2) building an accelerator considering the DNN specific characteristics. However, it may fails in promising the highest composite score which combines the goals of accuracy and other hardware-related constraints (e.g., latency, energy efficiency) when building a specific neural-network-based system. In this work, we present a single-stage automated framework, YOSO, aiming to generate the optimal solution of software-and-hardware that flexibly balances between the goal of accuracy, power, and QoS. Compared with the two-stage method on the baseline systolic array accelerator and Cifar10 dataset, we achieve 1.42x~2.29x energy or 1.79x~3.07x latency reduction at the same level of precision, for different user-specified energy and latency optimization constraints, respectively.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="8.4">8.4 Architectural and Circuit Techniques toward Energy-efficient Computing</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br />Andrea Andrea Calimera, Politecnico di Torino, IT</p> <p><b>Co-Chair:</b><br />Davide Davide Rossi, Università di Bologna, IT</p> <p>The session discusses low-power design techniques at the architectural as well as the circuit level. The presented works span from new solutions for conventional computing, such as ultra-low power tunable precision architectures and speculative SRAM arrays, to emerging paradigms, like spiking neural networks and stochastic computing.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.4.1</td> <td><b>TRANSPIRE: AN ENERGY-EFFICIENT TRANSPRECISION FLOATING-POINT PROGRAMMABLE ARCHITECTURE</b><br /><b>Speaker</b>:<br />Rohit Prasad, Lab-SICC, UBS, France &amp; DEI, UniBo, Italy, FR<br /><b>Authors</b>:<br />Rohit Prasad<sup>1</sup>, Satyajit Das<sup>2</sup>, Kevin Martin<sup>3</sup>, Giuseppe Tagliavini<sup>4</sup>, Philippe Coussy<sup>5</sup>, Luca Benini<sup>4</sup> and Davide Rossi<sup>4</sup><br /><sup>1</sup>Université Bretagne Sud, FR; <sup>2</sup>IIT Palakkad, IN; <sup>3</sup>University Bretagne Sud, FR; <sup>4</sup>Università di Bologna, IT; <sup>5</sup>Universite de Bretagne-Sud / Lab-STICC, FR<br /><em><b>Abstract</b><br />In recent years, Coarse Grain Reconfigurable Architecture (CGRA) accelerators have been increasingly deployed in Internet-of-Things (IoT) end nodes. A modern CGRA has to support and efficiently accelerate both integer and floating-point (FP) operations. In this paper, we propose an ultra-low-power tunable-precision CGRA architectural template, called TRANSprecision floating-point Programmable archItectuRE (TRANSPIRE), and its associated compilation flow supporting both integer and FP operations. TRANSPIRE employs transprecision computing and multiple Single Instruction Multiple Data (SIMD) to accelerate FP operations while boosting energy efficiency as well. Experimental results show that TRANSPIRE achieves a maximum of 10.06x performance gain and consumes 12.91x less energy w.r.t. a RISC-V based CPU with an enhanced ISA supporting SIMD-style vectorization and FP data-types, while executing applications for near-sensor computing and embedded machine learning, with an area overhead of 1.25x only.</em></td> </tr> <tr> <td>17:30</td> <td>8.4.2</td> <td><b>MODELING AND DESIGNING OF A PVT AUTO-TRACKING TIMING-SPECULATIVE SRAM</b><br /><b>Speaker</b>:<br />Shan Shen, Southeast University, CN<br /><b>Authors</b>:<br />Shan Shen, Tianxiang Shao, Ming Ling, Jun Yang and Longxing Shi, Southeast University, CN<br /><em><b>Abstract</b><br />In the low supply voltage region, the performance of 6T cell SRAM degrades seriously, which takes more time to achieve the sufficient voltage difference on bitlines. Timing-speculative techniques are proposed to boost the SRAM frequency and the throughput with speculatively reading data in an aggressive timing and correcting timing failures in one or more extended cycles. However, the throughput gains of timing-speculative SRAM are affected by the process, voltage and temperature (PVT) variations, which causes the timing design of speculative SRAM to be either too aggressive or too conservative. This paper first proposes a statistical model to abstract the characteristics of speculative SRAM and shows the presence of an optimal sensing time that maximizes the overall throughput. Then, with the guidance of the performance model, a PVT auto-tracking speculative SRAM is designed and fabricated, which can dynamically self-tune the bitline sensing to the optimal time as the working condition changes. According to the measurement results, the maximum throughput gain of the proposed 28nm SRAM is 1.62X compared to the baseline at 0.6V VDD.</em></td> </tr> <tr> <td>18:00</td> <td>8.4.3</td> <td><b>SOLVING CONSTRAINT SATISFACTION PROBLEMS USING THE LOIHI SPIKING NEUROMORPHIC PROCESSOR</b><br /><b>Speaker</b>:<br />Chris Yakopcic, University of Dayton, US<br /><b>Authors</b>:<br />Chris Yakopcic<sup>1</sup>, Nayim Rahman<sup>1</sup>, Tanvir Atahary<sup>1</sup>, Tarek M. Taha<sup>1</sup> and Scott Douglass<sup>2</sup><br /><sup>1</sup>University of Dayton, US; <sup>2</sup>Air Force Research Laboratory, US<br /><em><b>Abstract</b><br />In many cases, low power autonomous systems need to make decisions extremely efficiently. However, as a potential solution space becomes more complex, finding a solution quickly becomes nearly impossible using traditional computing methods. Thus, in this work we present a constraint satisfaction algorithm based on the principles of spiking neural networks. To demonstrate the validity of this algorithm, we have shown successful execution of the Boolean satisfiability problem (SAT) on the Intel Loihi spiking neuromorphic research processor. Power consumption in this spiking processor is due primarily to the propagation of spikes, which are the key drivers of data movement and processing. Thus, this system is inherently efficient for many types of problems. However, algorithms must be redesigned in a spiking neural network format to achieve the greatest efficiency gains. To the best of our knowledge, the work in this paper exhibits the first implementation of constraint satisfaction on a low power embedded neuromorphic processor. With this result, we aim to show that embedded spiking neuromorphic hardware is capable of executing general problem solving algorithms with great areal and computational efficiency.</em></td> </tr> <tr> <td>18:15</td> <td>8.4.4</td> <td><b>ACCURATE POWER DENSITY MAP ESTIMATION FOR COMMERCIAL MULTI-CORE MICROPROCESSORS</b><br /><b>Speaker</b>:<br />Sheldon Tan, University of California, Riverside, US<br /><b>Authors</b>:<br />Jinwei Zhang, Sheriff Sadiqbatcha, Wentian Jin and Sheldon Tan, University of California, Riverside, US<br /><em><b>Abstract</b><br />In this work, we propose an accurate full chip steady-state power density map estimation method for the commercial multi-core microprocessors. The new approach is based on the measured steady-state thermal maps (images) from an advanced infrared (IR) thermal imaging system to ensure its accuracy. The new method consists of a few steps. First, based on the first principle of heat transfer, 2D spatial Laplace operation is performed on the given thermal map to obtain so-called raw power density map, which consists of both positive and negative values due to the steady-state nature and boundary conditions of the microprocessors. Then based on the total power of the microprocessor from the online CPU tool, we develop a novel scheme to generate the actual real positive-only power density map from the raw power density map. At the same time, we develop a novel approach to estimate the effective thermal conductivity of the microprocessors. To further validate the power density map and the estimated actual thermal conductivity of the microprocessors, we construct a thermal model with COMSOL, which mimics the real experimental set up of measurement used in the IR imaging system. Then we compute the thermal maps from the estimated power density maps to ensure the computed thermal maps match the measured thermal maps using FEM method. Experimental results on intel i7-8650U 4-core processor show 1.8$^circ$C root-mean-square-error (RMSE) and 96% similarity (2D correlation) between the computed thermal maps and the measured thermal maps.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="/date20/conference/session/IP4">IP4-5</a>, 168</td> <td><b>WHEN SORTING NETWORK MEETS PARALLEL BITSTREAMS: A FAULT-TOLERANT PARALLEL TERNARY NEURAL NETWORK ACCELERATOR BASED ON STOCHASTIC COMPUTING</b><br /><b>Speaker</b>:<br />Yawen Zhang, Peking University, CN<br /><b>Authors</b>:<br />Yawen Zhang<sup>1</sup>, Sheng Lin<sup>2</sup>, Runsheng Wang<sup>1</sup>, Yanzhi Wang<sup>2</sup>, Yuan Wang<sup>1</sup>, Weikang Qian<sup>3</sup> and Ru Huang<sup>1</sup><br /><sup>1</sup>Peking University, CN; <sup>2</sup>Northeastern University, US; <sup>3</sup>Shanghai Jiao Tong University, CN<br /><em><b>Abstract</b><br />Stochastic computing (SC) has been widely used in neural networks (NNs) due to its simple hardware cost and high fault tolerance. Conventionally, SC-based NN accelerators adopt a hybrid stochastic-binary format, using an accumulative parallel counter to convert bitstreams into a binary number. This method, however, sacrifices the fault tolerance and causes a high hardware cost. In order to fully exploit the superior fault tolerance of SC, taking a ternary neural network (TNN) as an example, we propose a parallel SC-based NN accelerator purely using bitstream computation. We apply a bitonic sorting network for simultaneously implementing the accumulation and activation function with parallel bitstreams. The proposed design not only has high fault tolerance, but also achieves at least 2.8 energy efficiency improvement over the binary computing counterpart.</em></td> </tr> <tr> <td style="width:40px;">18:32</td> <td><a href="/date20/conference/session/IP4">IP4-6</a>, 452</td> <td><b>WAVEPRO: CLOCK-LESS WAVE-PROPAGATED PIPELINE COMPILER FOR LOW-POWER AND HIGH-THROUGHPUT COMPUTATION</b><br /><b>Speaker</b>:<br />Yehuda Kra, Bar-Ilan University, IL<br /><b>Authors</b>:<br />Yehuda Kra, Adam Teman and Tzachi Noy, Bar-Ilan University, IL<br /><em><b>Abstract</b><br />Clock-less Wave-Propagated Pipelining is a longknown approach to achieve high-throughput without the overhead of costly sampling registers. However, due to many design challenges, which have only increased with technology scaling, this approach has never been widely accepted and has generally been limited to small and very specific demonstrations. This paper addresses this barrier by presenting WavePro, a generic and scalable algorithm, capable of skew balancing any combinatorial logic netlist for the application of wave pipelining. The algorithm was implemented in the WavePro Compiler automation utility, which interfaces with industry delays extraction and standard timing analysis tools to produce a sign-off quality result. The utility is demonstrated upon a dot-product accelerator in a 65 nm CMOS technology, using a vendor-provided standard cell library and commercial timing analysis tools. By reducing the worstcase output skew by over 70%, the test case example was able to achieve equivalent throughput of an 8-staged sequentially pipelined implementation with power savings of almost 3X.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="8.5">8.5 CNN Dataflow Optimizations</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br />Mario Casu, Politecnico di Torino, IT</p> <p><b>Co-Chair:</b><br />Wanli Chang, University of York, GB</p> <p>This session focuses on efficient dataflow approaches for reducing CNN runtime on embedded hardware platforms. The papers to be presented demonstrate techniques for enhancing parallelism to improve performance of CNNs, leverage output prediction to reduce the runtime for time-critical embedded applications during inference, and presents a Keras-based DNN framework for real-time cyber physical systems.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.5.1</td> <td><b>ANALYSIS AND SOLUTION OF CNN ACCURACY REDUCTION OVER CHANNEL LOOP TILING</b><br /><b>Speaker</b>:<br />Yesung Kang, Pohang University of Science and Technology, KR<br /><b>Authors</b>:<br />Yesung Kang<sup>1</sup>, Yoonho Park<sup>1</sup>, Sunghoon Kim<sup>1</sup>, Eunji Kwon<sup>1</sup>, Taeho Lim<sup>2</sup>, Mingyu Woo<sup>3</sup>, Sangyun Oh<sup>4</sup> and Seokhyeong Kang<sup>1</sup><br /><sup>1</sup>Pohang University of Science and Technology, KR; <sup>2</sup>SK Hynix, KR; <sup>3</sup>University of California, San Diego, US; <sup>4</sup>UNIST, KR<br /><em><b>Abstract</b><br />Owing to the growth of the size of convolutional neural networks (CNNs), quantization and loop tiling (also called loop breaking) are mandatory to implement CNN on an embedded system. However, channel loop tiling of quantized CNNs induces unexpected errors. We explain why channel loop tiling of quantized CNNs induces the unexpected errors, and how the errors affect the accuracy of state-of-the-art CNNs. We also propose a method to recover accuracy under channel tiling by compressing and decompressing the most-significant bits of partial sums. Using the proposed method, we can recover accuracy by 12.3% with only 1% circuit area overhead and an additional 2% of power consumption.</em></td> </tr> <tr> <td>17:30</td> <td>8.5.2</td> <td><b>DCCNN: COMPUTATIONAL FLOW REDEFINITION FOR EFFICIENT CNN INFERENCE THROUGH MODEL STRUCTURAL DECOUPLING</b><br /><b>Speaker</b>:<br />Xiang Chen, George Mason University, US<br /><b>Authors</b>:<br />Fuxun Yu<sup>1</sup>, Zhuwei Qin<sup>1</sup>, Di Wang<sup>2</sup>, Ping Xu<sup>1</sup>, Chenchen Liu<sup>3</sup>, Zhi Tian<sup>1</sup> and Xiang Chen<sup>1</sup><br /><sup>1</sup>George Mason University, US; <sup>2</sup>Microsoft, US; <sup>3</sup>University of Maryland, Baltimore County, US<br /><em><b>Abstract</b><br />With the excellent accuracy and feasibility, Convolutional Neural Networks (CNNs) have been widely applied into novel intelligent applications and systems. However, the CNN computation performance is significantly hindered by its computation flow, which computes the model structure sequentially by layers with massive convolution operations. Such a layer-wise sequential computation flow is defined by the inter-layer data dependency and causes certain performance issues, such as resource under-utilization, significant computation overhead, etc. To solve these problems, in this work, we propose a novel CNN structural decoupling method, which could decouple CNN models by "critical paths" and eliminate the inter-layer data dependency. Based on this method, we redefine the CNN computation flow into parallel and cascade computing paradigms, which can significantly enhance the CNN computation performance with both multi-core and single-core CPU processors. Experiments show that, our DC-CNN framework could reduce at most 33% latency on multi-core CPUs for both CIFAR and ImageNet. On small-capacity mobile platforms, cascade computing could reduce the latency by average 24% on ImageNet and 42% on CIFAR10. Meanwhile, the memory reduction could reach average 21% and 64%, respectively.</em></td> </tr> <tr> <td>18:00</td> <td>8.5.3</td> <td><b>ABC: ABSTRACT PREDICTION BEFORE CONCRETENESS</b><br /><b>Speaker</b>:<br />Jung-Eun Kim, Yale University, US<br /><b>Authors</b>:<br />Jung-Eun Kim<sup>1</sup>, Richard Bradford<sup>2</sup>, Man-Ki Yoon<sup>3</sup> and Zhong Shao<sup>1</sup><br /><sup>1</sup>Department of Computer Science, Yale University, US; <sup>2</sup>Collins Aerospace, US; <sup>3</sup>Yale University, US<br /><em><b>Abstract</b><br />Learning techniques are advancing the utility and capability of modern embedded systems. However, the challenge of incorporating learning modules into embedded systems is that computing resources are scarce. For such a resource-constrained environment, we have developed a framework for learning abstract information early and learning more concretely as time allows. The intermediate results can be utilized to prepare for early decisions/actions as needed. To apply this framework to a classification task, the datasets are categorized in an abstraction hierarchy. Then the framework classifies intermediate labels from the most abstract level to the most concrete. Our proposed method outperforms the existing approaches and reference base-lines in terms of accuracy. We show our framework with different architectures and on various benchmark datasets CIFAR-10,CIFAR-100, and GTSRB. We measure prediction times on GPU-equipped embedded computing platforms as well.</em></td> </tr> <tr> <td>18:15</td> <td>8.5.4</td> <td><b>A COMPOSITIONAL APPROACH USING KERAS FOR NEURAL NETWORKS IN REAL-TIME SYSTEMS</b><br /><b>Speaker</b>:<br />Xin Yang, University of Auckland, NZ<br /><b>Authors</b>:<br />Xin Yang, Partha Roop, Hammond Pearce and Jin Woo Ro, University of Auckland, NZ<br /><em><b>Abstract</b><br />Real-time systems are designed using model-driven approaches, where a complex system is represented as a set of interacting components. Such a compositional approach facilitates design of simpler components, which are easier to validate and integrate with the overall system. In contrast to such systems, data-driven systems like neural networks are designed as monolithic black-boxes to capture the non-linear relationship from inputs to outputs. Increasingly, such systems are being used in safety-critical real-time systems. Here, a compositional approach would be ideal. However, to the best of our knowledge, such a compositional approach is lacking while designing data-driven components based on neural networks. This paper formalises this problem by developing the concept of Composed Neural Networks (CpNNs) by extending the well known Keras python framework. CpNNs formalise the synchronous composition of several interacting neural networks in Keras. Further, using the developed semantics, we enable modular compilation from a given CpNN to C code. The generated code is suitable for the Worst-Case Execution Time (WCET) analysis. Using several benchmarks we demonstrate the superiority of the developed approach over a recently proposed approach using Esterel, as well as the popular Python package Tensorflow Lite. For the given benchmarks, our approach is superior to Esterel with an average WCET reduction of 64.06%, and superior to Tensorflow Lite with an average measured WCET reduction of 62.08%.</em></td> </tr> <tr> <td style="width:40px;">18:00</td> <td><a href="/date20/conference/session/IP4">IP4-7</a>, 935</td> <td><b>DEEPNVM: A FRAMEWORK FOR MODELING AND ANALYSIS OF NON-VOLATILE MEMORY TECHNOLOGIES FOR DEEP LEARNING APPLICATIONS</b><br /><b>Speaker</b>:<br />Ahmet Inci, Carnegie Mellon University, US<br /><b>Authors</b>:<br />Ahmet Inci, Mehmet M Isgenc and Diana Marculescu, Carnegie Mellon University, US<br /><em><b>Abstract</b><br />Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 4.2x and 5x energy-delay product (EDP) reduction and 2.4x and 3x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide 2.3x EDP reduction on average across all workloads when compared to SRAM. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPU platforms for deep learning applications.</em></td> </tr> <tr> <td style="width:40px;">18:01</td> <td><a href="/date20/conference/session/IP4">IP4-8</a>, 419</td> <td><b>EFFICIENT EMBEDDED MACHINE LEARNING APPLICATIONS USING ECHO STATE NETWORKS</b><br /><b>Speaker</b>:<br />Rolando Brondolin, politecnico di milano, IT<br /><b>Authors</b>:<br />Luca Cerina<sup>1</sup>, Giuseppe Franco<sup>2</sup>, Claudio Gallicchio<sup>3</sup>, Alessio Micheli<sup>3</sup> and Marco D. Santambrogio<sup>4</sup><br /><sup>1</sup>politecnico di milano, IT; <sup>2</sup>Scuola Superiore Sant'Anna / Università di Pisa, IT; <sup>3</sup>Università di Pisa, IT; <sup>4</sup>Politecnico di Milano, IT<br /><em><b>Abstract</b><br />The increasing role of Artificial Intelligence (AI) and Machine Learning (ML) in our lives brought a paradigm shift on how and where the computation is performed. Stringent latency requirements and congested bandwidth moved AI inference from Cloud space towards end-devices. This change required a major simplification of Deep Neural Networks (DNN), with memory-wise libraries or co-processors that perform fast inference with minimal power. Unfortunately, many applications such as natural language processing, time-series analysis and audio interpretation are built on a different type of Artifical Neural Networks (ANN), the so-called Recurrent Neural Networks (RNN), which, due to their intrinsic architecture, remains too complex and heavy to run efficiently on embedded devices. To solve this issue, the Reservoir Computing paradigm proposes sparse untrained non-linear networks, the Reservoir, that can embed temporal relations without some of the hindrances of Recurrent Neural Networks training, and with a lower memory usage. Echo State Networks (ESN) and Liquid State Machines are the most notable examples. In this scenario, we propose a performance comparison of a ESN, designed and trained using Bayesian Optimization techniques, against current RNN solutions. We aim to demonstrate that ESN have comparable performance in terms of accuracy, require minimal training time, and they are more optimized in terms of memory usage and computational efficiency. Preliminary results show that ESN are competitive with RNN on a simple benchmark, and both training and inference time are faster, with a maximum speed-up of 2.35x and 6.60x, respectively.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="8.6">8.6 Microarchitecture-level reliability analysis and protection</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br />Michail Maniatakos, NYU Abu Dhabi, UA</p> <p><b>Co-Chair:</b><br />Alessandro Savino, Politecnico di Torino, IT</p> <p>Reliability analysis and protection at the microarchitecture level is of paramount importance to speed-up the design face of any computing system. On the analysis side, this session starts presenting a reverse-order ACE (Architecturally Correct Execution) analysis that is more accurate than original ACE proposals, then moving to an instruction level analysis based on a genetic-algorithm able to improve program resiliency to errors. Finally, on the protection side, the session presents a low-cost ECC plus approximation mechanism for GPU register files.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.6.1</td> <td><b>RACE: REVERSE-ORDER PROCESSOR RELIABILITY ANALYSIS</b><br /><b>Authors</b>:<br />Athanasios Chatzidimitriou and Dimitris Gizopoulos, University of Athens, GR<br /><em><b>Abstract</b><br />Modern microprocessors suffer from increased error rates that come along with fabrication technology scaling. Processor designs continuously become more prone to hardware faults that lead to execution errors and system failures, which raise the requirement of protection mechanisms. However, error mitigation strategies have to be applied diligently, as they impose significant power, area, and performance overheads. Early and accurate reliability estimation of a microprocessor design is essential in order to determine the most vulnerable hardware structures and the most efficient protection schemes. One of the most commonly used techniques for reliability estimation is Architecturally Correct Execution (ACE) analysis. ACE analysis can be applied at different abstraction models, including microarchitecture and RTL and often requires a single or few simulations to report the Architectural Vulnerability Factor (AVF) of the processor structures. However, ACE analysis overestimates the vulnerability of structures because of its pessimistic, worst-case nature. Moreover, it only delivers coarse-grain vulnerability reports and no details about the expected result of hardware faults (silent data corruptions, crashes). In this paper, we present reverse ACE (rACE), a methodology that (a) improves the accuracy of ACE analysis and (b) delivers fine-grain error outcome reports. Using a reverse-order tracing flow, rACE analysis associates portions of the simulated execution of a program with the actual output and the control flow, delivering finer accuracy and results classification. Our findings show that rACE reports an average 1.45X overestimation, compared to Statistical Fault Injection, for different sizes of the register file of an out-of-order CPU core (executing both ARM and x86 binaries), when a baseline ACE analysis reports 2.3X overestimation and even refined versions of ACE analysis report an average of 1.8X overestimation.</em></td> </tr> <tr> <td>17:30</td> <td>8.6.2</td> <td><b>DEFCON: GENERATING AND DETECTING FAILURE-PRONE INSTRUCTION SEQUENCES VIA STOCHASTIC SEARCH</b><br /><b>Speaker</b>:<br />Ioannis Tsiokanos, Queen's University Belfast, GB<br /><b>Authors</b>:<br />Ioannis Tsiokanos<sup>1</sup>, Lev Mukhanov<sup>1</sup>, Giorgis Georgakoudis<sup>2</sup>, Dimitrios S. Nikolopoulos<sup>3</sup> and Georgios Karakonstantis<sup>1</sup><br /><sup>1</sup>Queen's University Belfast, GB; <sup>2</sup>Lawrence Livermore National Laboratory, US; <sup>3</sup>Virginia Tech, US<br /><em><b>Abstract</b><br />The increased variability and adopted low supply voltages render nanometer devices prone to timing failures, which threaten the functionality of digital circuits. Recent schemes focused on developing instruction-aware failure prediction models and adapting voltage/frequency to avoid errors while saving energy. However, such schemes may be inaccurate when applied to pipelined cores since they consider only the currently executed instruction and the preceding one, thereby neglecting the impact of all the concurrently executing instructions on failure occurrence. In this paper, we first demonstrate that the order and type of instructions in sequences with a length equal to the pipeline depth affect significantly the failure rate. To overcome the practically impossible evaluation of the impact of all possible sequences on failures, we present DEFCON, a fully automated framework that stochastically searches for the most failure-prone instruction sequences (ISQs). DEFCON generates such sequences by integrating a properly formulated genetic algorithm with accurate post-layout dynamic timing analysis, considering the data-dependent path sensitization and instruction execution history. The generated micro-architecture aware ISQs are then used by DEFCON to estimate the failure vulnerability of any application. To evaluate the efficacy of the proposed framework, we implement a pipelined floating-point unit and perform dynamic timing analysis based on input data that we extract from a variety of applications consisting of up-to 43.5M ISQs. Our results show that DEFCON reveals quickly ISQs that maximize the output quality loss and correctly detects 99.7% of the actual faulty ISQs in different applications under various levels of variation-induced delay increase. Finally, DEFCON enable us to identify failure-prone ISQs early at the design cycle, and save 26.8% of energy on average when combined with a clock stretching mechanism.</em></td> </tr> <tr> <td>18:00</td> <td>8.6.3</td> <td><b>LAD-ECC: ENERGY-EFFICIENT ECC MECHANISM FOR GPGPUS REGISTER FILE</b><br /><b>Speaker</b>:<br />Hengshan Yue, Jilin University, CN<br /><b>Authors</b>:<br />Xiaohui Wei, Hengshan Yue and Jingweijia Tan, Jilin University, CN<br /><em><b>Abstract</b><br />Graphics Processing Units (GPUs) are widely used in general-purpose high-performance computing applications (i.e., GPGPUs), which require reliable execution in the presence of soft errors. To support massive thread level parallelism, a sizeable register file is adopted in GPUs, which is highly vulnerable to soft errors. Although modern commercial GPUs provide single-error-correction double-error-detection (SEC-DED) ECC for the register file, it consumes a considerable amount of energy due to frequent register accesses and leakage power of ECC storage. In this paper, we propose to Leverage Approximation and Duplication characteristics of register values to build an energy-efficient ECC mechanism (LAD-ECC) in GPGPUs, which consists of APproximation-aware ECC (AP-ECC) and Duplication-Aware ECC (DA-ECC). Leveraging the inherent error tolerance features, AP-ECC merely protects significant bits of registers to combat the critical error. Observing same-named registers across threads usually keep the same data, DA-ECC avoids unnecessary ECC generation and verification for duplicate register values. Experimental results demonstrate that our LAD-ECC tremendously reduces 69.72% energy consumption of traditional SEC-DED ECC.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="/date20/conference/session/IP4">IP4-9</a>, 698</td> <td><b>EXPLFRAME: EXPLOITING PAGE FRAME CACHE FOR FAULT ANALYSIS OF BLOCK CIPHERS</b><br /><b>Speaker</b>:<br />Anirban Chakraborty, IIT Kharagpur, IN<br /><b>Authors</b>:<br />Anirban Chakraborty<sup>1</sup>, Sarani Bhattacharya<sup>2</sup>, Sayandeep Saha<sup>1</sup> and Debdeep Mukhopadhyay<sup>1</sup><br /><sup>1</sup>IIT Kharagpur, IN; <sup>2</sup>Phd, BE<br /><em><b>Abstract</b><br />Page Frame Cache (PFC) is a purely software cache, present in modern Linux based operating systems (OS), which stores the page frames that were recently released by the processes running on a particular CPU. In this paper, we show that the page frame cache can be maliciously exploited by an adversary to steer the pages of a victim process to some pre-decided attacker-chosen locations in the memory. We practically demonstrate an end-to-end attack, emph{ExplFrame}, where an attacker having only user-level privilege is able to force a victim process's memory pages to vulnerable locations in DRAM and deterministically conduct Rowhammer to induce faults. As a case study, we induce single bit faults in the T-tables on OpenSSL (v1.1.1) AES using our proposed attack ExplFrame. We also propose an improvised fault analysis technique which can exploit any Rowhammer-induced bit-flips in the AES T-tables.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="8.7">8.7 Physical Design and Analysis</h2> <p><b>Date:</b> Wednesday, March 11, 2020<br /><b>Time:</b> 17:00 - 18:30<br /><b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br />Vasilis Pavlidis, U Manchester, GB</p> <p><b>Co-Chair:</b><br />L. Miguel Silveira, INESC ID / IST, U Lisboa, PT</p> <p>This session deals with problems in extraction, DRC hotspots, IR drop, routing and other relevant issues in physical design and analysis. The common trend between all papers is efficiency improvement while maintaining accuracy. Floating random walk extraction is performed to handle non-stratified dielectrics with on-the-fly computations. Also, serial equivalence can be guaranteed in FPGA routing by exploring parallelism. A legalization flow is proposed for double-patterning aware feature alignment. Finally, machine-learning based DRC hotspot prediction is enhanced with explainability.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.7.1</td> <td><b>FLOATING RANDOM WALK BASED CAPACITANCE SOLVER FOR VLSI STRUCTURES WITH NON-STRATIFIED DIELECTRICS</b><br /><b>Speaker</b>:<br />Ming Yang, Tsinghua University, CN<br /><b>Authors</b>:<br />Mingye Song, Ming Yang and Wenjian Yu, Tsinghua University, CN<br /><em><b>Abstract</b><br />In this paper, two techniques are proposed to enhance the floating random walk (FRW) based capacitance solver for handling non-stratified dielectrics in very large-scale integrated (VLSI) circuits. They follow an existing approach which employs approximate eight-octant transition cubes while simulating the structure with conformal dielectrics. Firstly, the symmetry property of the transition probabilities of the eightoctant cube is revealed and utilized to derive an on-the-fly sampling scheme during the FRWprocedure. This avoids the precharacterization, saves substantial memory, and improves computational accuracy for extracting the structure with non-stratified dielectrics. Then, the space management technique is extended to improve the runtime efficiency for simulating structures with thousands of non-stratified dielectrics. Numerical experiments are carried out to validate the proposed techniques and show their effectiveness for handling structures with conformal dielectrics and air bubbles. Moreover, the extended space management brings up to 1441X speedup for handling structures with from several thousand to one million non-stratified dielectrics.</em></td> </tr> <tr> <td>17:30</td> <td>8.7.2</td> <td><b>TOWARDS SERIAL-EQUIVALENT MULTI-CORE PARALLEL ROUTING FOR FPGAS</b><br /><b>Speaker</b>:<br />Minghua Shen, Sun Yat-sen University, CN<br /><b>Authors</b>:<br />Minghua Shen and Nong Xiao, Sun Yat-sen University, CN<br /><em><b>Abstract</b><br />In this paper, we present a serial-equivalent parallel router for FPGAs on modern multi-core processors. We are based on the inherent net order of serial router to schedule all the nets into a series of stages, where the non-conflicting nets are scheduled in same stage and the conflicting nets are scheduled in different stages. We explore the parallel routing of non-conflicting nets on multi-core processors for a significant speedup. We perform the data synchronization of conflicting stages using MPI-based message queue for a feasible routing solution. Note that load balance is always used to guide the multi-core parallel routing. Experimental results show that our parallel router provides about 19.13x speedup on average using 32 processor cores comparing to the serial router. Notably, our parallel router generates exactly the same wirelength as the serial router satisfying serial equivalency.</em></td> </tr> <tr> <td>18:00</td> <td>8.7.3</td> <td><b>SELF-ALIGNED DOUBLE-PATTERNING AWARE LEGALIZATION</b><br /><b>Speaker</b>:<br />Hua Xiang, IBM Research, US<br /><b>Authors</b>:<br />Hua Xiang<sup>1</sup>, Gi-Joon Nam<sup>1</sup>, Gustavo Tellez<sup>2</sup>, Shyam Ramji<sup>2</sup> and Xiaoqing Xu<sup>3</sup><br /><sup>1</sup>IBM Research, US; <sup>2</sup>IBM Thomas J. Watson Research Center, US; <sup>3</sup>UT-Austin, US<br /><em><b>Abstract</b><br />Double patterning is a widely used technique for sub-22nm. Among various double patterning techniques, Self-Aligned Double Patterning (SADP) is a promising technique for good mask overlay control. Based on SADP, a new set of standard cells (T-cells) are developed using thicker metal wires for stronger drive strength. By applying this kind of gates on critical paths, it helps to improve the design performance. However, a mixed design with T-cells and normal cells (N-cells) requires that T-cells are placed on circuit rows with thicker metal, and the normal cells are on the normal circuit rows. Therefore, a placer is needed to adjust the cells to the matched circuit rows. In this paper, a two-stage min-cost max-flow based legalization flow is presented to adjust N/T gate locations for a legal placement. The experimental results demonstrate the effectiveness and efficiency of our approach.</em></td> </tr> <tr> <td>18:15</td> <td>8.7.4</td> <td><b>EXPLAINABLE DRC HOTSPOT PREDICTION WITH RANDOM FOREST AND SHAP TREE EXPLAINER</b><br /><b>Speaker</b>:<br />Wei Zeng, University of Wisconsin-Madison, US<br /><b>Authors</b>:<br />Wei Zeng<sup>1</sup>, Azadeh Davoodi<sup>1</sup> and Rasit Onur Topaloglu<sup>2</sup><br /><sup>1</sup>University of Wisconsin - Madison, US; <sup>2</sup>IBM, US<br /><em><b>Abstract</b><br />With advanced technology nodes, resolving design rule check (DRC) violations has become a cumbersome task, which makes it desirable to make predictions at earlier stages of the design flow. In this paper, we show that the Random Forest (RF) model is quite effective for the DRC hotspot prediction at the global routing stage, and in fact significantly outperforms recent prior works, with only a fraction of the runtime to develop the model. We also propose, for the first time, to adopt a recent explanatory metric--the SHAP value--to make accurate and consistent explanations for individual DRC hotspot predictions from RF. Experiments show that RF is 21%-60% better in predictive performance on average, compared with promising machine learning models used in similar works (e.g. SVM and neural networks) while exhibiting good explainability, which makes it ideal for DRC hotspot prediction.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="/date20/conference/session/IP4">IP4-10</a>, 522</td> <td><b>XGBIR: AN XGBOOST-BASED IR DROP PREDICTOR FOR POWER DELIVERY NETWORK</b><br /><b>Speaker</b>:<br />An-Yu Su, National Chiao Tung University, TW<br /><b>Authors</b>:<br />Chi-Hsien Pao, Yu-Min Lee and An-Yu Su, National Chiao Tung University, TW<br /><em><b>Abstract</b><br />This work utilizes the XGBoost to build a machine-learning-based IR drop predictor, XGBIR, for the power grid. To capture the behavior of power grid, we extract its several features and employ its locality property to save the extraction time. XGBIR can be effectively applied to large designs and the average error of predicted IR drops is less than 6 mV.</em></td> </tr> <tr> <td style="width:40px;">18:32</td> <td><a href="/date20/conference/session/IP4">IP4-11</a>, 347</td> <td><b>ON PRE-ASSIGNMENT ROUTE PROTOTYPING FOR IRREGULAR BUMPS ON BGA PACKAGES</b><br /><b>Speaker</b>:<br />Hung-Ming Chen, National Chiao Tung University, TW<br /><b>Authors</b>:<br />Jyun-Ru Jiang<sup>1</sup>, Yun-Chih Kuo<sup>2</sup>, Simon Chen<sup>3</sup> and Hung-Ming Chen<sup>1</sup><br /><sup>1</sup>Institute of Electronics, National Chiao Tung University, TW; <sup>2</sup>National Taiwan University, TW; <sup>3</sup>MediaTek.inc, TW<br /><em><b>Abstract</b><br />In modern package design, the bumps often place irregularly due to the macros varied in sizes and positions. This will make pre-assignment routing more difficult, even with massive design efforts. This work presents a 2-stage routing method which can be applied to an arbitrary bump placement on 2-layer BGA packages. Our approach combines escape routing with via assignment: the escape routing is used to handle the irregular bumps and the via assignment is applied for improving the wire congestion and total wirelength of global routing. Experimental results based on industrial cases show that our methodology can solve the routing efficiently, and we have achieved 82% improvement on wire congestion with 5% wirelength increase compared with conventional regular treatments.</em></td> </tr> <tr> <td>18:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="9.1">9.1 Special Day on "Silicon Photonics": Advancements on Silicon Photonics</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br />Ashkan Sayedy, Hewlett Packard Labs, US</p> <p><b>Co-Chair:</b><br />Gabriela Nicolescu, Polytechnique Montréal, CA</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.1.1</td> <td><b>SYSTEM STUDY OF SILICON PHOTOHOTONICNICS MODULATOR IN SHORT REACH GRIDLESS COHERENT NETWORKS</b><br /><b>Speaker</b>:<br />Naim Ben-Hamida, Ciena Corporation, CA<br /><b>Authors</b>:<br />Naim Ben-Hamida<sup>1</sup>, Ahmad Abdo<sup>1</sup>, Xueyang Li<sup>2</sup>, Md Samiul Alam<sup>2</sup>, Mahdi Parvizi<sup>1</sup>, Claude D'Amours<sup>3</sup> and David V. Plant<sup>2</sup><br /><sup>1</sup>Ciena Corporation, CA; <sup>2</sup>McGill, CA; <sup>3</sup>University of Ottawa, CA<br /><em><b>Abstract</b><br />A study the impact of modulation loss of Silicon-Photonics Mach-Zehnder modulators in the context of single-carrier coherent receivers, i.e. 400G-ZR. The modulation loss is primarily due to limited bandwidth and large peak to overage ratio of the modulator output. We present the implications of performing only post-compensation of the loss at the receiver and its advantages in gridless-networks. A manageable Q factor penalty of around 0.5 dB is found for dual- polarization system with a 0.75 dB peak to average ratio (PAPR) reduction.</em></td> </tr> <tr> <td>09:00</td> <td>9.1.2</td> <td><b>FULLY INTEGRATED PHOTONIC CIRCUITS ON SILICON BY MEANS OF III-V/SILICON BONDING</b><br /><b>Author</b>:<br />Florian Denis-le Coarer, SCINTIL Photonics, US<br /><em><b>Abstract</b><br />This presentation introduces a new platform integrating heterogeneous III-V/silicon gain devices at the backside of silicon-on-insulator wafers. The fabrication relies on commercial silicon photonic processes. This platform enables fully photonic integrated circuits comprising lasers, modulators, passives and photodetectors, that can be tested at the wafer level.</em></td> </tr> <tr> <td>09:30</td> <td>9.1.3</td> <td><b>III-V/SILICON HYBRID LASERS INTEGRATION ON CMOS-COMPATIBLE 200MM AND 300MM PLATFORMS</b><br /><b>Authors</b>:<br />Szelag Bertrand1<sup>1</sup>, Laetitia Adelmini<sup>1</sup>, Cecilia Dupre<sup>1</sup>, Elodie Ghegin<sup>2</sup>, Philippe Rodriguez<sup>1</sup>, Fabrice Nemouchi<sup>1</sup>, Pierre Brianceau<sup>1</sup>, Antoine Schembri<sup>1</sup>, David Carrara<sup>3</sup>, Pierrick Cavalie<sup>3</sup>, Florent Franchin<sup>3</sup>, Marie-Christine Roure<sup>1</sup>, Loic Sanchez<sup>1</sup>, Christophe Jany<sup>1</sup> and Ségolène Olivier<sup>1</sup><br /><sup>1</sup>CEA-Leti, FR; <sup>2</sup>STMicroelectronics, FR; <sup>3</sup>Almae Technologies, FR<br /><em><b>Abstract</b><br />We present a CMOS-compatible hybrid III-V/Silicon technology developed in CEA-Leti. Large-scale integration of silicon photonics is already available worldwide in 200mm or 300mm through different foundries, but the development of CMOS-compatible process for the III-V integration remains of major interest for next gen transceivers in the Datacom and High Performance Computing domains. The technological developments involve first the hybridization on top of a mature silicon photonic front-end wafer through direct molecular bonding, then the patterning of the III-V epitaxy layer, and low access resistance contacts though planar multilevel BEOL to be optimized. The different technological blocks will be described, and the results will be discussed on the basis of test vehicles based on either distributed feedback (DFB), distributed Bragg reflector (DBR), or Fabry-Perot (FP) laser cavities. While first demonstrations have been obtained through wafer bonding, we show that the fabrication process was subsequently validated on III-V dies bonding with a fabrication yield of Fabry-Perot lasers of 97% in 200mm. The overall technological features are expected improve the efficiency, density, and cost of silicon photonics PICs.</em></td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="9.2">9.2 Autonomous Systems Design Initiative: Architectures and Frameworks for Autonomous Systems</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br />Selma Saidi, TU Dortmund, DE</p> <p><b>Co-Chair:</b><br />Rolf Ernst, TU Dortmund, DE</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.2.1</td> <td><b>DEEPRACING: A FRAMEWORK FOR AGILE AUTONOMY</b><br /><b>Speaker</b>:<br />Trent Weiss, University of Virginia, US<br /><b>Authors</b>:<br />Trent Weiss and Madhur Behl, University of Virginia, US<br /><em><b>Abstract</b><br />We consider the challenging problem of vision-based high speed autonomous racing in realistic dynamic environments. We present DeepRacing, a novel end-to-end framework, and a virtual testbed for training and evaluating algorithms for autonomous racing. The virtual testbed is implemented using the Formula One (F1) Codemasters game, which is used by many real world F1 drivers for training. We also present AdmiralNet - a Convolution Neural Network (CNN) integrated with Long Short-Term Memory (LSTM) cells that can be tuned for the autonomous racing task in the highly realistic F1 game. We evaluate AdmiralNet's performance on unseen race tracks, and also evaluate the degree of transference between the simulation and the real world by implementing end-to-end racing on a physical 1/10 scale autonomous racecar.</em></td> </tr> <tr> <td>09:00</td> <td>9.2.2</td> <td><b>FAIL-OPERATIONAL AUTOMOTIVE SOFTWARE DESIGN USING AGENT-BASED GRACEFUL DEGRADATION</b><br /><b>Speaker</b>:<br />Philipp Weiss, TUM, DE<br /><b>Authors</b>:<br />Philipp Weiss<sup>1</sup>, Andreas Weichslgartner<sup>2</sup>, Felix Reimann<sup>2</sup> and Sebastian Steinhorst<sup>1</sup><br /><sup>1</sup>TUM, DE; <sup>2</sup>Audi Electronics Venture GmbH, DE</td> </tr> <tr> <td>09:30</td> <td>9.2.3</td> <td><b>A DISTRIBUTED SAFETY MECHANISM USING MIDDLEWARE AND HYPERVISORS FOR AUTONOMOUS VEHICLES</b><br /><b>Speaker</b>:<br />Andrei Terechko, NXP Semiconductors, NL<br /><b>Authors</b>:<br />Tjerk Bijlsma<sup>1</sup>, Andrii Buriachevskyi<sup>2</sup>, Alessandro Frigerio<sup>3</sup>, Yuting Fu<sup>2</sup>, Kees Goossens<sup>3</sup>, Ali Osman Örs<sup>2</sup>, Pieter J. van der Perk<sup>2</sup>, Andrei Terechko<sup>2</sup> and Bart Vermeulen<sup>2</sup><br /><sup>1</sup>TNO, NL; <sup>2</sup>NXP Semiconductors, NL; <sup>3</sup>Eindhoven University of Technology, NL</td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="9.3">9.3 Special Session: In memory computing for edge AI</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br />Maha Kooli, CEA-Leti, FR</p> <p><b>Co-Chair:</b><br />Alexandre Levisse, EPFL, CH</p> <p>In-Memory Computing (IMC) represents new computing paradigm where computation happens at data location. Within the landscape of IMC approaches, non-von Neumann architectures seek to minimize data movement associated with computing. Artificial intelligence applications are one of the most promising use case of IMC since they are both compute- and memory-intensive. Running such applications on edge devices offers significant save of energy consumption and high-speed acceleration. This special session proposes to take the attendees along a journey through IMC solutions for Edge AI. This session will cover four different viewpoints of IMC for Edge AI with four talks: (i) Enabling flexible electronics very-Edge AI with IMC, (ii) design automation methodology for computational SRAM for energy efficient SIMD operations, (iii) circuit/architecture/application multiscale design and optimization methodologies for IMC architectures, and (iv) device circuit and architecture optimizations to enable PCM-based deep learning accelerators. The speakers come from three different continents (Asia, Europe, America) and four different countries (Singapore, France, USA, Switzerland). Two speakers are affiliated to academic institutes; one to industry; and one to an institute of technological research center. We strongly believe that the topic and especially selected talks are extremely hot topics in the community and will attract various people from different countries and affiliations, from both academia and industry. Furthermore, thanks to its cross layer nature, we believe that this session is tailored to touch a wide range of experts from device and circuit community up to system and application design community. We also believe that highlighting and discussing such design methodologies is a key point for high quality and high impact research. Following up previous occurrences and success of IMC-oriented sessions and panels in DAC2019 as well as in ISLPED2019, we believe that this topic is extremely hot in the community and will trigger fruitful interactions and, we hope, collaboration among the community. We thereby expect more than 60 attendees for this session. This session will be the object of two scientific papers that will be integrated with DATE proceedings in case of acceptance.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.3.1</td> <td><b>FLEDGE: FLEXIBLE EDGE PLATFORMS ENABLED BY IN-MEMORY COMPUTING</b><br /><b>Speaker</b>:<br />Kamalika Datta, Nanyang Technological University, SG<br /><b>Authors</b>:<br />Kamalika Datta<sup>1</sup>, Umesh Chand<sup>2</sup>, Arko Dutt<sup>1</sup>, Devendra Singh<sup>2</sup>, Aaron Thean<sup>2</sup> and Mohamed M. Sabry<sup>1</sup><br /><sup>1</sup>Nanyang Technological University, SG; <sup>2</sup>National University of Singapore, SG</td> </tr> <tr> <td>08:50</td> <td>9.3.2</td> <td><b>COMPUTATIONAL SRAM DESIGN AUTOMATION USING PUSHED-RULE BITCELLS FOR ENERGY-EFFICIENT VECTOR PROCESSING</b><br /><b>Speaker</b>:<br />Maha Kooli, CEA-Leti, FR<br /><b>Authors</b>:<br />Jean-Philippe Noel<sup>1</sup>, Valentin Egloff<sup>1</sup>, Maha Kooli<sup>1</sup>, Roman Gauchi<sup>1</sup>, Jean-Michel Portal<sup>2</sup>, Henri-Pierre Charles<sup>1</sup>, Pascal Vivet<sup>1</sup> and Bastien Giraud<sup>1</sup><br /><sup>1</sup>CEA-Leti, FR; <sup>2</sup>Aix-Marseille University, FR</td> </tr> <tr> <td>09:10</td> <td>9.3.3</td> <td><b>DEMONSTRATING IN-CACHE COMPUTING THANKS TO CROSS-LAYER DESIGN OPTIMIZATIONS</b><br /><b>Authors</b>:<br />Marco Rios, William Simon, Alexandre Levisse, Marina Zapater and David Atienza, EPFL, CH</td> </tr> <tr> <td>09:35</td> <td>9.3.4</td> <td><b>DEVICE, CIRCUIT AND SOFTWARE INNOVATIONS TO MAKE DEEP LEARNING WITH ANALOG MEMORY A REALITY</b><br /><b>Authors</b>:<br />Pritish Narayanan, Stefano Ambrogio, Hsinyu Tsai, Katie Spoon and Geoffrey W. Burr, IBM Research, US</td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="9.4">9.4 Efficient DNN design with Approximate Computing.</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br />Daniel Menard, INSA Rennes, FR</p> <p><b>Co-Chair:</b><br />Seokhyeong Kang, Pohang University of Science and Technology, KR</p> <p>Deep Neural Networks (DNN) are widely used in numerous domains. Cross-layer DNN approximation requires efficient simulation framework. The GPU-accelerated simulation framework, ProxSim, supports DNN inference and retraining for approximate hardware. A significant amount of energy is consumed during the training process due to excessive memory accesses. The precision-controlled memory systems, dedicated for GPUs, allow flexible management of approximation. New generation of networks, like Capsule Networks, provide better learning capabilities but at the expense of high complexity. ReD-CaNe methodology analyzes resilience through an error injection and approximates them. </p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.4.1</td> <td><b>PROXSIM: SIMULATION FRAMEWORK FOR CROSS-LAYER APPROXIMATE DNN OPTIMIZATION</b><br /><b>Speaker</b>:<br />Cecilia Eugenia De la Parra Aparicio, Robert Bosch GmbH, MX<br /><b>Authors</b>:<br />Cecilia De la Parra<sup>1</sup>, Andre Guntoro<sup>1</sup> and Akash Kumar<sup>2</sup><br /><sup>1</sup>Robert Bosch GmbH, DE; <sup>2</sup>Technische Universität Dresden, DE<br /><em><b>Abstract</b><br />Through cross-layer approximation of Deep Neural Networks (DNN) significant improvements in hardware resources utilization for DNN applications can be achieved. This comes at the cost of accuracy degradation, which can be compensated through different optimization methods. However, DNN optimization is highly time-consuming in existing simulation frameworks for cross-layer DNN approximation, as they are usually implemented for CPU usage only. Specially for large-scale image processing tasks, the need of a more efficient simulation framework is evident. In this paper we present ProxSim, a specialized, GPU-accelerated simulation framework for approximate hardware, based on Tensorflow, which supports approximate DNN inference and retraining. Additionally, we propose a novel hardware-aware regularization technique for approximate DNN optimization. By using ProxSim, we report up to 11x savings in execution time, compared to a multi-thread CPU-based framework, and an accuracy recovery of up to 30% for three case studies of image classification with MNIST, CIFAR-10 and ImageNet.</em></td> </tr> <tr> <td>09:00</td> <td>9.4.2</td> <td><b>PCM: PRECISION-CONTROLLED MEMORY SYSTEM FOR ENERGY EFFICIENT DEEP NEURAL NETWORK TRAINING</b><br /><b>Speaker</b>:<br />Boyeal Kim, Seoul National University, KR<br /><b>Authors</b>:<br />Boyeal Kim<sup>1</sup>, SangHyun Lee<sup>1</sup>, Hyun Kim<sup>2</sup>, Duy-Thanh Nguyen<sup>3</sup>, Minh-Son Le<sup>4</sup>, Ik Joon Chang<sup>5</sup>, Dohun Kwon<sup>6</sup>, Jin Hyeok Yoo<sup>7</sup>, Jun Won Choi<sup>6</sup> and Hyuk-Jae Lee<sup>1</sup><br /><sup>1</sup>Seoul National University, KR; <sup>2</sup>Seoul National University of Science and Technology, KR; <sup>3</sup>Kyung Hee University, KR; <sup>4</sup>KyungHee University, KR; <sup>5</sup>Kyunghee University, KR; <sup>6</sup>Hanyang University, KR; <sup>7</sup>Hanyang university, KR<br /><em><b>Abstract</b><br />Deep neural network (DNN) training suffers from the significant energy consumption in memory system, and most existing energy reduction techniques for memory system have focused on introducing low precision that is compatible with computing unit (e.g., FP16, FP8). These researches have shown that even in learning the networks with FP16 data precision, it is possible to provide training accuracy as good as FP32, de facto standard of the DNN training. However, our extensive experiments show that we can further reduce the data precision while maintaining the training accuracy of DNNs, which can be obtained by truncating some least significant bits (LSBs) of FP16, named as hard approximation. Nevertheless, the existing hardware structures for DNN training cannot efficiently support such low precision. In this work, we propose a novel memory system architecture for GPUs, named as precision-controlled memory system (PCM), which allows for flexible management at the level of hard approximation. PCM provides high DRAM bandwidth by distributing each precision to different channels with as transposed data mapping on DRAM. In addition, PCM supports fine-grained hard approximation in the L1 data cache using software-controlled registers, which can reduce data movement and thereby improve energy saving and system performance. Furthermore, PCM facilitates the reduction of data maintenance energy, which accounts for a considerable portion of memory energy consumption, by controlling refresh period of DRAM. The experimental results show that in training CIFAR-100 dataset on Resnet-20 with precision tuning, PCM achieves energy saving and performance enhancement by 66% and 20%, respectively, without loss of accuracy.</em></td> </tr> <tr> <td>09:30</td> <td>9.4.3</td> <td><b>RED-CANE: A SYSTEMATIC METHODOLOGY FOR RESILIENCE ANALYSIS AND DESIGN OF CAPSULE NETWORKS UNDER APPROXIMATIONS</b><br /><b>Speaker</b>:<br />Alberto Marchisio, TU Wien (TU Wien), AT<br /><b>Authors</b>:<br />Alberto Marchisio<sup>1</sup>, Vojtech Mrazek<sup>2</sup>, Muhammad Abdullah Hanif<sup>3</sup> and Muhammad Shafique<sup>4</sup><br /><sup>1</sup>TU Wien (TU Wien), AT; <sup>2</sup>Brno University of Technology, CZ; <sup>3</sup>Institute of Computer Engineering, Vienna University of Technology, AT; <sup>4</sup>Vienna University of Technology (TU Wien), AT<br /><em><b>Abstract</b><br />Recent advances in Capsule Networks (CapsNets) have shown their superior learning capability, compared to the traditional Convolutional Neural Networks (CNNs). However, the extremely high complexity of CapsNets limits their fast deployment in real-world applications. Moreover, while the resilience of CNNs have been extensively investigated to enable their energy-efficient implementations, the analysis of CapsNets' resilience is a largely unexplored area, that can provide a strong foundation to investigate techniques to overcome the CapsNets' complexity challenge. Following the trend of Approximate Computing to enable energy-efficient designs, we perform an extensive resilience analysis of the CapsNets inference subjected to the approximation errors. Our methodology models the errors arising from the approximate components (like multipliers), and analyze their impact on the classification accuracy of CapsNets. This enables the selection of approximate components based on the resilience of each operation of the CapsNet inference. We modify the TensorFlow framework to simulate the injection of approximation noise (based on the models of the approximate components) at different computational operations of the CapsNet inference. Our results show that the CapsNets are more resilient to the errors injected in the computations that occur during the dynamic routing (the softmax and the update of the coefficients), rather than other stages like convolutions and activation functions. Our analysis is extremely useful towards designing efficient CapsNet hardware accelerators with approximate components. To the best of our knowledge, this is the first proof-of-concept for employing approximations on the specialized CapsNet hardware.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="/date20/conference/session/IP4">IP4-12</a>, 968</td> <td><b>TOWARDS BEST-EFFORT APPROXIMATION: APPLYING NAS TO APPROXIMATE COMPUTING</b><br /><b>Speaker</b>:<br />Weiwei Chen, University of Chinese Academy of Sciences, CN<br /><b>Authors</b>:<br />Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN<br /><em><b>Abstract</b><br />The design of neural network architecture for code approximation involves a large number of hyper-parameters to explore, it is a non-trivial task to find an neural-based approximate computing solution that meets the demand of application-specified accuracy and Quality of Service (QoS). Prior works do not address the problem of 'optimal' network architectures design in program approximation, which depends on the user-specified constraints, the complexity of dataset and the hardware configuration. In this paper, we apply Neural Architecture Search (NAS) for searching and selecting the neural approximate computing and provide an automatic framework that tries to generate the best-effort approxi-mation result while satisfying the user-specified QoS/accuracy constraints. Compared with previous method, this work achieves more than 1.43x speedup and 1.74x energy reduction on average when applied to the AxBench benchmarks.</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="/date20/conference/session/IP4">IP4-13</a>, 973</td> <td><b>ON THE AUTOMATIC EXPLORATION OF WEIGHT SHARING FOR DEEP NEURAL NETWORK COMPRESSION</b><br /><b>Speaker</b>:<br />Etienne Dupuis, École Centrale de Lyon, FR<br /><b>Authors</b>:<br />Etienne Dupuis<sup>1</sup>, David Novo<sup>2</sup>, Ian O'Connor<sup>1</sup> and Alberto Bosio<sup>1</sup><br /><sup>1</sup>Lyon Institute of Nanotechnology, FR; <sup>2</sup>Université de Montpellier, FR<br /><em><b>Abstract</b><br />Deep neural networks demonstrate impressive inference results, particularly in computer vision and speech recognition. However, the computational workload and storage associated render their use prohibitive in resource-limited embedded systems. The approximate computing paradigm has been widely explored in both industrial and academic circles. It improves performance and energy-efficiency by relaxing the need for fully accurate operations. Consequently, there is a large number of implementation options with very different approximation strategies (such as pruning, quantization, low-rank factorization, knowledge distillation, ...). To the best of our knowledge, no automated approach exists for exploring, selecting and generating the best approximate versions of a given convolutional neural network (CNN) and the design objectives. The objective of this work in progress is to show that the design space exploration phase can enable significant network compression without noticeable accuracy loss. We demonstrate this via an example based on weight sharing and show that we can obtain 4x compression rate without re-training and the resulting network does not suffer from accuracy loss, in an int-16 version of LeNet-5 (5-layer 1,720-kbit CNN) using our method.</em></td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="9.5">9.5 Emerging memory devices</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br />Alexandere Levisse, EPFL, CH</p> <p><b>Co-Chair:</b><br />Marco Vacca, Politecnico di Torino, IT</p> <p>The development of future memories is driven by new devices, studied to overcome the limitations of traditional memories. Among these devices STT magnetic RAMs play a fundamental role, due to their excellent performance coupled with long endurance and non-volatility. What are the issues that these memories face? How can we solve them and make them ready for a successfull commercial development? And if, by changing perspective, emerging devices are used to improve existing memories like SRAM? These are some of the questions that this section aim to answer.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.5.1</td> <td><b>IMPACT OF MAGNETIC COUPLING AND DENSITY ON STT-MRAM PERFORMANCE</b><br /><b>Speaker</b>:<br />Lizhou Wu, Delft University of Technology, NL<br /><b>Authors</b>:<br />Lizhou Wu<sup>1</sup>, Siddharth Rao<sup>2</sup>, Mottaqiallah Taouil<sup>1</sup>, Erik Jan Marinissen<sup>2</sup>, Gouri Sankar Kar<sup>2</sup> and Said Hamdioui<sup>1</sup><br /><sup>1</sup>Delft University of Technology, NL; <sup>2</sup>IMEC, BE<br /><em><b>Abstract</b><br />As a unique mechanism for MRAMs, magnetic coupling needs to be accounted for when designing memory arrays. This paper models both intra- and inter-cell magnetic coupling analytically for STT-MRAMs and investigates their impact on the write performance and retention of MTJ devices, which are the data-storing elements of STT-MRAMs. We present magnetic measurement data of MTJ devices with diameters ranging from 35 nm to 175 nm, which we use to calibrate our intra-cell magnetic coupling model. Subsequently, we extrapolate this model to study inter-cell magnetic coupling in memory arrays. We propose the inter-cell magnetic coupling factor Psi to indicate coupling strength. Our simulation results show that Psi=2% maximizes the array density under the constraint that the magnetic coupling has negligible impact on the device's performance. Higher array densities show significant variations in average switching time, especially at low switching voltages, caused by inter-cell magnetic coupling, and dependent on the data pattern in the cell's neighborhood. We also observe a marginal degradation of the data retention time under the influence of inter-cell magnetic coupling.</em></td> </tr> <tr> <td>09:00</td> <td>9.5.2</td> <td><b>HIGH-DENSITY, LOW-POWER VOLTAGE-CONTROL SPIN ORBIT TORQUE MEMORY WITH SYNCHRONOUS TWO-STEP WRITE AND SYMMETRIC READ TECHNIQUES</b><br /><b>Speaker</b>:<br />Wang Kang, Beihang University, CN<br /><b>Authors</b>:<br />Haotian Wang<sup>1</sup>, Wang Kang<sup>1</sup>, Liuyang Zhang<sup>1</sup>, He Zhang<sup>1</sup>, Brajesh Kumar Kaushik<sup>2</sup> and Weisheng Zhao<sup>1</sup><br /><sup>1</sup>Beihang University, CN; <sup>2</sup>IIT Roorkee, IN<br /><em><b>Abstract</b><br />Voltage-control spin orbit torque (VC-SOT) magnetic tunnel junction (MTJ) has the potential to achieve high-speed and low-power spintronic memory, owing to the adaptive voltage modulated energy barrier of the MTJ. However, the three-terminal device structure needs two access transistors (one for write operation and the other one for read operation) and thus occupies larger bit-cell area compared to two terminal MTJs. A feasible method to reduce area overhead is to stack multiple VC-SOT MTJs on a common antiferromagnetic strip to share the write access transistors. In this structure, high density can be achieved. However, write and read operations face problems and the design space is not sure given a strip length. In this paper, we propose a synchronous two-step multi-bit write and symmetric read method by exploiting the selective VC-SOT driven MTJ switching mechanism. Then hybrid circuits are designed and evaluated based a physics-based VC-SOT MTJ model and a 40nm CMOS design-kit to show the feasibility and performance of our method. Our work enables high-density, low-power, high-speed voltage-control SOT memory.</em></td> </tr> <tr> <td>09:30</td> <td>9.5.3</td> <td><b>DESIGN OF ALMOST-NONVOLATILE EMBEDDED DRAM USING NANOELECTROMECHANICAL RELAY DEVICES</b><br /><b>Speaker</b>:<br />Hongtao Zhong, Tsinghua University, CN<br /><b>Authors</b>:<br />Hongtao Zhong, Mingyang Gu, Juejian Wu, Huazhong Yang and Xueqing Li, Tsinghua University, CN<br /><em><b>Abstract</b><br />This paper proposes low-power design of embedded dynamic random-access memory (eDRAM) using emerging nanoelectromechanical (NEM) relay devices. The motivation of this work is to reduce the standby refresh power consumption through the improvement of retention time of eDRAM cells. In this paper, it is revealed that the tunable beyond-CMOS characteristics of emerging NEM relay devices, especially the ultra-high OFF-state drain-source resistance, open up new opportunities with device-circuit co-design. In addition, the pull-in and pull-out threshold voltages are tilled to fit the operating mechanisms of eDRAM, so as to support low-voltage operations along with long retention time. Excitingly, when low-gate-leakage thick-gate transistors are used together, the proposed NEM-relay-based eDRAM exhibits so significant retention time improvement that it behaves almost "nonvolatile". Even if using thin-gate transistors in a 130nm CMOS, the evaluation of the proposed eDRAM shows up to 63x and 127x retention time improvement at 1.0V and 1.4V supply, respectively. Detailed performance benchmarking analysis, along with the practical CMOS-compatible NEM relay model, the eDRAM design and optimization considerations, is included in this paper.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="/date20/conference/session/IP4">IP4-14</a>, 100</td> <td><b>ROBUST AND HIGH-PERFORMANCE12-T INTERLOCKED SRAM FOR IN-MEMORY COMPUTING</b><br /><b>Speaker</b>:<br />Joycee Mekie, IIT Gandhinagar, IN<br /><b>Authors</b>:<br />Neelam Surana, Mili Lavania, Abhishek Barma and Joycee Mekie, IIT Gandhinagar, IN<br /><em><b>Abstract</b><br />In this paper, we analyze the existing SRAM based In-Memory Computing(IMC) proposals and show through exhaustive simulations that they fail under process variations. 6-T SRAM, 8-T SRAM, and 10-T SRAM based IMC architectures suffer from compute-disturb (stored data flips during IMC), compute-failure (provides false computation results), and half-select failures, respectively. To circumvent these issues, we propose a novel 12-T Dual Port Dual Interlocked-storage Cell (DPDICE) SRAM. DPDICE SRAM based IMC architecture(DPDICE-IMC) can perform essential boolean functions successfully in a single cycle and can perform basic arithmetic operations such as add and multiply. The most striking feature is that DPDICE-IMC architecture can perform IMC on two datasets simultaneously, thus doubling the throughput. Cumulatively, the proposed DPDICE-IMC is 26.7%, 8$imes$, and 28% better than 6-T SRAM, 8-T SRAM, and 10-T SRAM based IMC architectures, respectively.</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="/date20/conference/session/IP4">IP4-15</a>, 600</td> <td><b>HIGH DENSITY STT-MRAM COMPILER DESIGN, VALIDATION AND CHARACTERIZATION METHODOLOGY IN 28NM FDSOI TECHNOLOGY</b><br /><b>Speaker</b>:<br />Piyush Jain, ARM Embedded Technologies Pvt Ltd., IN<br /><b>Authors</b>:<br />Piyush Jain<sup>1</sup>, Akshay Kumar<sup>1</sup>, Nicolaas Van Winkelhoff<sup>2</sup>, Didier Gayraud<sup>2</sup>, Surya Gupta<sup>3</sup>, Abdelali El Amraoui<sup>2</sup>, Giorgio Palma<sup>2</sup>, Alexandra Gourio<sup>2</sup>, Laurentz Vachez<sup>2</sup>, Luc Palau<sup>2</sup>, Jean-Christophe Buy<sup>2</sup> and Cyrille Dray<sup>2</sup><br /><sup>1</sup>ARM Embedded Technologies Pvt Ltd., IN; <sup>2</sup>ARM France, FR; <sup>3</sup>ARM Embedded technologies Pvt Ltd., IN<br /><em><b>Abstract</b><br />Spin Transfer Torque Magneto-resistive Random-Access Memory (STT-MRAM) is emerging as a promising substitute for flash memories due to scaling challenges for flash in process nodes beyond 28nm. STT-MRAM's high endurance, fast speed and low power makes it suitable for wide variety of applications. An embedded MRAM (eMRAM) compiler is highly desirable to enable SoC designers to use eMRAM instances in their designs in a flexible manner. However, the development of an eMRAM compiler has added challenges of handling multi-fold higher density and maintaining analog circuits accuracy, on top of the challenges associated with conventional SRAM memory compilers. In this paper, we present a successful design methodology for a high density 128Mb eMRAM compiler in a 28nm fully depleted SOI (FDSOI) process. This compiler enables optimized eMRAM instance generation with varying capacity ranges, word-widths, and optional features like repair and error correction. eMRAM compiler design is achieved by evolving various architecture design, validations and characterization methods. A hierarchical and modular characterization methodology is presented to enable high accuracy characterization and industry-standard EDA view generation from the eMRAM compiler.</em></td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="9.6">9.6 Intelligent Dependable Systems</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br />Saqib Saqib Khursheed, University of Liverpool, GB</p> <p><b>Co-Chair:</b><br />Rishad Rishad Shafik, Newcasstle University, GB</p> <p>This session spans from dependability approaches for multicore systems realized as SoCs for intelligent reliability management and on-line software-based self-test, to error resilient AI systems where the AI system is re-designed to tolerate critical faults or is used for error detection purposes.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.6.1</td> <td><b>THERMAL-CYCLING-AWARE DYNAMIC RELIABILITY MANAGEMENT IN MANY-CORE SYSTEM-ON-CHIP</b><br /><b>Speaker</b>:<br />Mohammad-Hashem Haghbayan, University of Turku, FI<br /><b>Authors</b>:<br />Mohammad-Hashem Haghbayan<sup>1</sup>, Antonio Miele<sup>2</sup>, Zhuo Zou<sup>3</sup>, Hannu Tenhunen<sup>1</sup> and Juha Plosila<sup>1</sup><br /><sup>1</sup>University of Turku, FI; <sup>2</sup>Politecnico di Milano, IT; <sup>3</sup>Nanjing University of Science and Technology, CN<br /><em><b>Abstract</b><br />Dynamic Reliability Management (DRM) is a common approach to mitigate aging and wear-out effects in multi-/many-core systems. State-of-the-art DRM approaches apply fine-grained control on resource management to increase/balance the chip reliability while considering other system constraints, e.g., performance, and power budget. Such approaches, acting on various knobs such as workload mapping and scheduling, Dynamic Voltage/Frequency Scaling (DVFS) and Per-Core Power Gating (PCPG), demonstrated to work properly with the various aging mechanisms, such as electromigration, and Negative-Bias Temperature Instability (NBTI). However, we claim that they do not suffice for thermal cycling. Thus, we here propose a novel thermal-cycling-aware DRM approach for shared-memory many-core systems running multi-threaded applications. The approach applies a fine-grained control capable at reducing both temperature levels and variations. The experimental evaluations demonstrated that the proposed approach is able to achieve 39% longer lifetime than past approaches.</em></td> </tr> <tr> <td>09:00</td> <td>9.6.2</td> <td><b>DETERMINISTIC CACHE-BASED EXECUTION OF ON-LINE SELF-TEST ROUTINES IN MULTI-CORE AUTOMOTIVE SYSTEM-ON-CHIPS</b><br /><b>Speaker</b>:<br />Andrea Floridia, Politecnico di Torino, IT<br /><b>Authors</b>:<br />Andrea Floridia<sup>1</sup>, Tzamn Melendez Carmona<sup>1</sup>, Davide Piumatti<sup>1</sup>, Annachiara Ruospo<sup>1</sup>, Ernesto Sanchez<sup>1</sup>, Sergio De Luca<sup>2</sup>, Rosario Martorana<sup>2</sup> and Mose Alessandro Pernice<sup>2</sup><br /><sup>1</sup>Politecnico di Torino, IT; <sup>2</sup>STMicroelectronics, IT<br /><em><b>Abstract</b><br />Traditionally, the usage of caches and deterministic execution of on-line self-test procedures have been considered two mutually exclusive concepts. At the same time, software executed in a multi-core context suffers of a limited timing predictability due to the higher system bus contention. When dealing with self-test procedures, this higher contention might lead to a fluctuating fault coverage or even the failure of some test programs. This paper presents a cache-based strategy for achieving both deterministic behaviour and stable fault coverage from the execution of self-test procedures in multi-core systems. The proposed strategy is applied to two representative modules negatively affected by a multi-core execution: synchronous imprecise interrupts logic and pipeline hazard detection unit. The experiments illustrate that it is possible to achieve a stable execution while also improving the state-of-the-art approaches for the on-line testing of embedded microprocessors. The effectiveness of the methodology was assessed on all the three cores of a multi-core industrial System-on-Chip intended for automotive ASIL D applications.</em></td> </tr> <tr> <td>09:30</td> <td>9.6.3</td> <td><b>FT-CLIPACT: RESILIENCE ANALYSIS OF DEEP NEURAL NETWORKS AND IMPROVING THEIR FAULT TOLERANCE USING CLIPPED ACTIVATION</b><br /><b>Authors</b>:<br />Le-Ha Hoang<sup>1</sup>, Muhammad Abdullah Hanif<sup>2</sup> and Muhammad Shafique<sup>3</sup><br /><sup>1</sup>TU Wien (TU Wien), AT; <sup>2</sup>Institute of Computer Engineering, Vienna University of Technology, AT; <sup>3</sup>Vienna University of Technology (TU Wien), AT<br /><em><b>Abstract</b><br />Deep Neural Networks (DNNs) are widely being adopted for safety-critical applications, e.g., healthcare and autonomous driving. Inherently, they are considered to be highly error-tolerant. However, recent studies have shown that hardware faults that impact the parameters of a DNN (e.g., weights) can have drastic impacts on its classification accuracy. In this paper, we perform a comprehensive error resilience analysis of DNNs subjected to hardware faults (e.g., permanent faults) in the weight memory. The outcome of this analysis is leveraged to propose a novel error mitigation technique which squashes the high-intensity faulty activation values to alleviate their impact. We achieve this by replacing the unbounded activation functions with their clipped versions. We also present a method to systematically define the clipping values of the activation functions that result in increased resilience of the networks against faults. We evaluate our technique on the AlexNet and VGG-16 DNN trained for the CIFAR-10 dataset. The experimental results show that our mitigation technique significantly improves the network's resilience to faults. For example, the proposed technique offers on average 68.92% improvement in the classification accuracy of resilience-optimized VGG-16 model at 1 ×10−5 fault rate, when compared to the base network without any fault mitigation.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="/date20/conference/session/IP4">IP4-16</a>, 221</td> <td><b>AN APPROXIMATION-BASED FAULT DETECTION SCHEME FOR IMAGE PROCESSING APPLICATIONS</b><br /><b>Speaker</b>:<br />Antonio Miele, Politecnico di Milano, IT<br /><b>Authors</b>:<br />Matteo Biasielli, Luca Cassano and Antonio Miele, Politecnico di Milano, IT<br /><em><b>Abstract</b><br />Image processing applications expose an intrinsic resilience to faults. In this application field the classical Duplication with Comparison (DWC) scheme, where output images are discarded as soon as the two replicas' outputs differ for at least one pixel, may be over-conseravative. This paper introduces a novel lightweight fault detection scheme for image processing applications; i) it extends the DWC scheme by substituting one of the two exact replicas with a faster approximated one; and ii) it features a Neural Network-based checker designed to distinguish between usable and unusable images instead of faulty/unfaulty ones. The application of the hardening scheme on a case study has shown an execution time reduction from 27% to 34% w.r.t. the DWC, while guaranteeing a comparable fault detection capability.</em></td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="9.7">9.7 Diverse Applications of Emerging Technologies</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br />Vasilis Pavlidis, University of Manchester, GB</p> <p><b>Co-Chair:</b><br />Bing Li, TUM, DE</p> <p>This session examines a diverse set of applications for emerging technologies. Papers consider the use of Q-learning to perform more efficient backups in non-volatile processors, the use of emerging technologies to mitigate hardware side-channels, time-sequence-based classification that rise from ultrasonic patters due to hand movements for gesture recognition, and processing-in-memory-based solutions to accelerate DNA alignment searches.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.7.1</td> <td><b>Q-LEARNING BASED BACKUP FOR ENERGY HARVESTING POWERED EMBEDDED SYSTEMS</b><br /><b>Speaker</b>:<br />Wei Fan, Shandong University, CN<br /><b>Authors</b>:<br />Wei Fan, Yujie Zhang, Weining Song, Mengying Zhao, Zhaoyan Shen and Zhiping Jia, Shandong University, CN<br /><em><b>Abstract</b><br />Non-volatile processors (NVPs) are used in energy harvesting powered embedded systems to preserve data across interruptions. In NVP systems, volatile data are backed up to non-volatile memory upon power failures and resumed after power comes back. Traditionally, backup is triggered immediately when energy warning occurs. However, it is also possible to more aggressively utilize the residual energy for program execution to improve forward progress. In this work, we propose a Q-learning based backup strategy to achieve maximal forward progress in energy harvesting powered intermittent embedded systems. The experimental results show an average of 307.4% and 43.4% improved forward progress compared with traditional instant backup and the most related work, respectively.</em></td> </tr> <tr> <td>09:00</td> <td>9.7.2</td> <td><b>A NOVEL TIGFET-BASED DFF DESIGN FOR IMPROVED RESILIENCE TO POWER SIDE-CHANNEL ATTACKS</b><br /><b>Speaker</b>:<br />Michael Niemier, University of Notre Dame, US<br /><b>Authors</b>:<br />Mohammad Mehdi Sharifi<sup>1</sup>, Ramin Rajaei<sup>1</sup>, Patsy Cadareanu<sup>2</sup>, Pierre-Emmanuel Gaillardon<sup>2</sup>, Yier Jin<sup>3</sup>, Michael Niemier<sup>1</sup> and X. Sharon Hu<sup>1</sup><br /><sup>1</sup>University of Notre Dame, US; <sup>2</sup>University of Utah, US; <sup>3</sup>University of Florida, US<br /><em><b>Abstract</b><br />Side-channel attacks (SCAs) represent a significant security threat, and aim to reveal otherwise secret data by analyzing a relevant circuit's behavior, e.g., its power consumption. While all circuit components are potential power side channels, D-flip-flops (DFFs) are often the primary source of information leakage to an SCA. This paper proposes a DFF design based on the three-independent-gate field-effect transistor (TIGFET) that reduces side-channel vulnerabilities of sequential circuits. Notably, we find that the I-V characteristics of the TIGFET itself leads to inherent side-channel resilience, which in turn enables simpler and more efficient cryptographic hardware. Our proposed design is based on a prior TIGFET-based true single-phase clock (TSPC) DFF design, which offers high performance and reduced area. More specifically, our modified TSPC (mTSPC) design exploits the symmetric I-V characteristics of TIGFETs, which results in pull-up and pull-down currents that are nearly identical. When combined with additional circuit modifications (made possible by the unique characteristics of the TIGFET), the mTSPC circuit draws almost the same amount of supply currents under all possible input transitions (less than 1% variation for different transitions), which can in turn mask information leakage. Using a 10nm TIGFET technology model, simulation results show that the proposed TIGFET-based DFF circuit leads to decreased power consumption (up to 96.9% when compared to the prior secured designs), has a low delay (15.2 ps), and employs only 12 TIGFET devices. Furthermore, an 8-bit S-box whose output is sampled by a group of eight mTSPC DFFs was simulated. A correlation power analysis attack on the simulated S-box with 256 power traces shows that the key is not revealed, which confirms the SCA resiliency of the proposed DFF design.</em></td> </tr> <tr> <td>09:30</td> <td>9.7.3</td> <td><b>LOW COMPLEXITY MULTI-DIRECTIONAL IN-AIR ULTRASONIC GESTURE RECOGNITION USING A TCN</b><br /><b>Speaker</b>:<br />Emad A. Ibrahim, Eindhoven University of Technology, NL<br /><b>Authors</b>:<br />Emad A. Ibrahim<sup>1</sup>, Marc Geilen<sup>1</sup>, Jos Huisken<sup>1</sup>, Min Li<sup>2</sup> and Jose Pineda de Gyvez<sup>2</sup><br /><sup>1</sup>Eindhoven University of Technology, NL; <sup>2</sup>NXP Semiconductors, NL<br /><em><b>Abstract</b><br />On the trend of ultrasound-based gesture recognition, this study introduces the concept of time-sequence classification of ultrasonic patterns induced by hand movements on a microphone array. We refer to time-sequence ultrasound echoes as continuous frequency patterns being received in real-time at different steering angles. The ultrasound source is a single tone continuously being emitted from the center of the microphone array. In the interim, the array beamforms and locates an ultrasonic activity (induced echoes) after which a processing pipeline is initiated to extract band-limited frequency features. These beamformed features are organized in a 2D matrix of size 11*30 updated every 10ms on which a Temporal Convolutional Network (TCN) outputs continuous classification. Prior to that, the same TCN is trained to classify Doppler shift variability rate. Using this approach, we show that a user can easily achieve 49 gestures at different steering angles by means of sequence detection. To make it simple to users, we define two Doppler shift variability rates; very slow and very fast which the TCN detects 95-99 % of the time. Not only a gesture can be performed at different directions but also the length of each performed gesture can be measured. This leverages the diversity of in-air ultrasonic gestures allowing more control capabilities. The process is designed under low-resource settings; that is, given the fact that this real-time process is always-on, the power and memory resources should be optimized. The proposed solution needs 6.2-10.2 MMACs and a memory footprint of 6KB allowing such gesture recognition system to be hosted by energy-constrained edge devices such as smart-speakers.</em></td> </tr> <tr> <td>09:45</td> <td>9.7.4</td> <td><b>PIM-ALIGNER: A PROCESSING-IN-MRAM PLATFORM FOR BIOLOGICAL SEQUENCE ALIGNMENT</b><br /><b>Speaker</b>:<br />Deliang Fan, Arizona State University, US<br /><b>Authors</b>:<br />Shaahin Angizi<sup>1</sup>, Jiao Sun<sup>1</sup>, Wei Zhang<sup>1</sup> and Deliang Fan<sup>2</sup><br /><sup>1</sup>University of Central Florida, US; <sup>2</sup>Arizona State University, US<br /><em><b>Abstract</b><br />In this paper, we propose a high-throughput and energy-efficient Processing-in-Memory accelerator (PIM-Aligner) to execute DNA short read alignment based on an optimized and hardware-friendly alignment algorithm. We first reconstruct the existing sequence alignment algorithm based on BWT and FM-index such that it can be fully implemented in PIM platforms. It supports exact alignment and also handles mismatches to reduce excessive backtracking. We then develop PIM-Aligner platform that transforms SOT-MRAM array to a potential computational memory to accelerate the reconstructed alignment-in-memory algorithm incurring a low cost on top of original SOT-MRAM chips (less than 10% of chip area). Accordingly, we present a local data partitioning, mapping, and pipeline technique to maximize the parallelism in multiple computational sub-array while doing the alignment task. The simulation results show that PIM-Aligner outperforms recent platforms based on dynamic programming with ~3.1x higher throughput per Watt. Besides, PIM-Aligner improves the short read alignment throughput per Watt per mm^2 by ~9x and 1.9x compared to FM-index-based ASIC and processing-in-ReRAM designs, respectively.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="/date20/conference/session/IP4">IP4-17</a>, 852</td> <td><b>TRANSPORT-FREE MODULE BINDING FOR SAMPLE PREPARATION USING MICROFLUIDIC FULLY PROGRAMMABLE VALVE ARRAYS</b><br /><b>Speaker</b>:<br />Gautam Choudhary, Adobe Research, India, IN<br /><b>Authors</b>:<br />Gautam Choudhary<sup>1</sup>, Sandeep Pal<sup>1</sup>, Debraj Kundu<sup>2</sup>, Sukanta Bhattacharjee<sup>3</sup>, Shigeru Yamashita<sup>4</sup>, Bing Li<sup>5</sup>, Ulf Schlichtmann<sup>5</sup> and Sudip Roy<sup>1</sup><br /><sup>1</sup>IIT Roorkee, IN; <sup>2</sup>PhD, IN; <sup>3</sup>Indian Statistical Institute, IN; <sup>4</sup>Ritsumeikan University, JP; <sup>5</sup>TUM, DE<br /><em><b>Abstract</b><br />Microfluidic fully programmable valve array (FPVA) biochips have emerged as general-purpose flow-based microfluidic lab-on-chips (LoCs). An FPVA supports highly re-configurable on-chip components (modules) in the two-dimensional grid-like structure controlled by some software programs, unlike application-specific flow-based LoCs. Fluids can be loaded into or washed from a cell with the help of flows from the inlet to outlet of an FPVA, whereas cell-to-cell transportation of discrete fluid segment(s) is not precisely possible. The simplest mixing module to realize on an FPVA-based LoC is a four-way mixer consisting of a $2imes2$ array of cells working as a ring-like mixer having four valves. In this paper, we propose a design automation method for sample preparation that finds suitable placements of mixing operations of a mixing tree using four-way mixers without requiring any transportation of fluid(s) between modules. We also propose a heuristic that modifies the mixing tree to reduce the sample preparation time. We have performed an extensive simulation and examined several parameters to determine the performance of the proposed solution.</em></td> </tr> <tr> <td>10:00</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="9.8">9.8 Special Session: Panel: Variation-aware analyzes of Mega-MOSFET Memories, Challenges and Solutions</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 08:30 - 10:00<br /><b>Location / Room:</b> Exhibition Theatre</p> <p><b>Moderators:</b><br />Firas MOHAMED, Silvaco, FR<br />Jean-Baptiste DULUC, Silvaco, FR</p> <p>Designing large memories under manufacturing variability requires statistical approaches that rely on SPICE simulations at different Process, Voltage, Temperature operating points to verify that yield requirements will be met. Variation-aware simulations of full memories that consist of millions of transistors is a challenging task for both SPICE simulators and statistical methodology to achieve accurate results. The ideal solution for variation-aware verifications of full memories would be to run Monte Carlo simulations through SPICE simulators to assess that all the addressable elements enable successful write and read operations. However, this classical approach suffers from practical issues and prevent it to be used. Indeed, for large memory arrays (e.g. MB and more) the number of SPICE simulations to perform would be intractable to achieve a descent statistical precision. Moreover, the SPICE simulation of a single sample of the full-memory netlist that involve millions or billions of MOSFETs and parasitic elements might be very long or impossible because of the netlist size. Unfortunately, Fast-SPICE simulations are not a palatable solution for final verification because the loss of accuracy compared to pure SPICE simulations is difficult to evaluate for such netlists. So far, most of the variation-aware methodologies to analyze and validate Mega-MOSFETs memories rely on the assumption that the sub-blocks of the system (e.g. control unit, IOs, row decoders, column circuitries, memory cells) might be assessed independently. Doing so memory designers apply dedicated statistical approaches for each individual sub-block to reduce the overall simulation time to achieve variation-aware closure. When considering that each element of the memory is independent of its neighborhood, the simulation of the memory is drastically reduced to few MOSFETs on the critical paths (longest paths for read or write memory operation), the other sub-blocks being idealized and estimations being derived under Gaussian assumption. Using such an approach, memory designers avoid the usual statistical simulations of the full memory that is, most of the time, unpractical in terms of duration and load. Although the aforementioned approach has been widely used by memory designers, these methods reach their limits when designing memory for low-power and advanced-node technologies where non idealities arise. The consequence of less reliable results is that the memory designers compensate by increasing security margins at the expense of performances to achieve satisfactory yield. In this context sub-blocks can no longer be considered individually and Gaussianity no longer prevails, other practical simulation flows are required to verify full memories with satisfying performances. New statistical approaches and simulation flows must handle memory slices or critical paths with all relevant sub-blocks in order to consider element interactions to be more realistic. Additionally, these approaches must handle the hierarchy of the memory to respect variation ranges of each sub-block, from low sigma for control units and IOs to high sigma for highly replicated blocks. Using a virtual reconstruction of the full memory the yield can be asserted without relying on the assumptions of individual sub-block analyzes. With accurate estimation over the full memory, no more security margins are required, and better performances will be reached."</p> <p><b>Panelists:</b></p> <ul> <li>Yves Laplanche, ARM, FR</li> <li>Lorenzo Ciampolini, CEA, FR</li> <li>Pierre Faubet, SILVACO FRANCE, FR</li> </ul> <table> <tbody> <tr> <td style="width: 20px">10:00</td> <td>End of session</td> </tr> <tr> <td style="width: 20px"></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="IP4">IP4 Interactive Presentations</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 10:00 - 11:00<br /><b>Location / Room:</b> Poster Area</p> <p>Interactive Presentations run simultaneously during a 30-minute slot. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session</p> <table> <tr> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> <tr> <td style="width:40px;">IP4-1</td> <td><b>HIT: A HIDDEN INSTRUCTION TROJAN MODEL FOR PROCESSORS</b><br /><b>Speaker</b>:<br />Jiaqi Zhang, Tongji University, CN<br /><b>Authors</b>:<br />Jiaqi Zhang<sup>1</sup>, Ying Zhang<sup>1</sup>, Huawei Li<sup>2</sup> and Jianhui Jiang<sup>3</sup><br /><sup>1</sup>Tongji University, CN; <sup>2</sup>Chinese Academy of Sciences, CN; <sup>3</sup>School of Software Engineering, Tongji University, CN<br /><em><b>Abstract</b><br />This paper explores an intrusion mechanism to microprocessors using illegal instructions, namely hidden instruction Trojan (HIT). It uses a low-probability sequence consisting of normal instructions as a boot sequence, followed by an illegal instruction to trigger the Trojan. The payload is a hidden interrupt to force the program counter to a specific address. Hence the program at the address has the super privileges. Meanwhile, we use integer programming to minimize the trigger probability of HIT within a given area overhead. The experimental results demonstrate that HIT has an extremely low trigger probability and can survive from the detection of the existing test methods.</em></td> </tr> <tr> <td style="width:40px;">IP4-2</td> <td><b>BITSTREAM MODIFICATION ATTACK ON SNOW 3G</b><br /><b>Speaker</b>:<br />Michail Moraitis, Royal Institute of Technology KTH, SE<br /><b>Authors</b>:<br />Michail Moraitis and Elena Dubrova, Royal Institute of Technology - KTH, SE<br /><em><b>Abstract</b><br />SNOW 3G is one of the core algorithms for confidentiality and integrity in several 3GPP wireless communication standards, including the new Next Generation (NG) 5G. It is believed to be resistant to classical cryptanalysis. In this paper, we show that SNOW 3G can be broken by a fault attack based on bitstream modification. By changing the content of some look-up tables in the bitstream, we reduce the non-linear state updating function of SNOW 3G to a linear one. As a result, it becomes possible to recover the key from a known plaintext-ciphertext pair. To our best knowledge, this is the first successful bitstream modification attack on SNOW 3G.</em></td> </tr> <tr> <td style="width:40px;">IP4-3</td> <td><b>A MACHINE LEARNING BASED WRITE POLICY FOR SSD CACHE IN CLOUD BLOCK STORAGE</b><br /><b>Speaker</b>:<br />YU ZHANG, Huazhong University of Science &amp; Technology, CN<br /><b>Authors</b>:<br />Yu Zhang<sup>1</sup>, Ke Zhou<sup>1</sup>, Ping Huang<sup>2</sup>, Hua Wang<sup>1</sup>, Jianying Hu<sup>3</sup>, Yangtao Wang<sup>1</sup>, Yongguang Ji<sup>3</sup> and Bin Cheng<sup>3</sup><br /><sup>1</sup>Huazhong University of Science &amp; Technology, CN; <sup>2</sup>Temple University, US; <sup>3</sup>Tencent Technology (Shenzhen) Co., Ltd., CN<br /><em><b>Abstract</b><br />Nowadays, SSD cache plays an important role in cloud storage systems. The associated write policy, which enforces an admission control policy regarding filling data into the cache, has a significant impact on the performance of the cache system and the amount of write traffic to SSD caches. Based on our analysis on a typical cloud block storage system, approximately 47.09% writes are write-only, i.e., writes to the blocks which have not been read during a certain time window. Naively writing the write-only data to the SSD cache unnecessarily introduces a large number of harmful writes to the SSD cache without any contribution to cache performance. On the other hand, it is a challenging task to identify and filter out those write-only data in a real-time manner, especially in a cloud environment running changing and diverse workloads. In this paper, to alleviate the above cache problem, we propose an ML-WP, Machine Learning Based Write Policy, which reduces write traffic to SSDs by avoiding writing write-only data. The main challenge in this approach is to identify write-only data in a real-time manner. To realize ML-WP and achieve accurate write-only data identification, we use machine learning methods to classify data into two groups (i.e., write-only and normal data). Based on this classification, the write-only data is directly written to backend storage without being cached. Experimental results show that, compared with the industry widely deployed write-back policy, ML-WP decreases write traffic to SSD cache by 41.52%, while improving the hit ratio by 2.61% and reducing the average read latency by 37.52%.</em></td> </tr> <tr> <td style="width:40px;">IP4-4</td> <td><b>YOU ONLY SEARCH ONCE: A FAST AUTOMATION FRAMEWORK FOR SINGLE-STAGE DNN/ACCELERATOR CO-DESIGN</b><br /><b>Speaker</b>:<br />Weiwei Chen, University of Chinese Academy of Sciences, CN<br /><b>Authors</b>:<br />Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN<br /><em><b>Abstract</b><br />DNN/Accelerator co-design has shown great poten-tial in improving QoR and performance. Typical approaches separate the design flow into two-stage: (1) designing an application-specific DNN model with high accuracy; (2) building an accelerator considering the DNN specific characteristics. However, it may fails in promising the highest composite score which combines the goals of accuracy and other hardware-related constraints (e.g., latency, energy efficiency) when building a specific neural-network-based system. In this work, we present a single-stage automated framework, YOSO, aiming to generate the optimal solution of software-and-hardware that flexibly balances between the goal of accuracy, power, and QoS. Compared with the two-stage method on the baseline systolic array accelerator and Cifar10 dataset, we achieve 1.42x~2.29x energy or 1.79x~3.07x latency reduction at the same level of precision, for different user-specified energy and latency optimization constraints, respectively.</em></td> </tr> <tr> <td style="width:40px;">IP4-5</td> <td><b>WHEN SORTING NETWORK MEETS PARALLEL BITSTREAMS: A FAULT-TOLERANT PARALLEL TERNARY NEURAL NETWORK ACCELERATOR BASED ON STOCHASTIC COMPUTING</b><br /><b>Speaker</b>:<br />Yawen Zhang, Peking University, CN<br /><b>Authors</b>:<br />Yawen Zhang<sup>1</sup>, Sheng Lin<sup>2</sup>, Runsheng Wang<sup>1</sup>, Yanzhi Wang<sup>2</sup>, Yuan Wang<sup>1</sup>, Weikang Qian<sup>3</sup> and Ru Huang<sup>1</sup><br /><sup>1</sup>Peking University, CN; <sup>2</sup>Northeastern University, US; <sup>3</sup>Shanghai Jiao Tong University, CN<br /><em><b>Abstract</b><br />Stochastic computing (SC) has been widely used in neural networks (NNs) due to its simple hardware cost and high fault tolerance. Conventionally, SC-based NN accelerators adopt a hybrid stochastic-binary format, using an accumulative parallel counter to convert bitstreams into a binary number. This method, however, sacrifices the fault tolerance and causes a high hardware cost. In order to fully exploit the superior fault tolerance of SC, taking a ternary neural network (TNN) as an example, we propose a parallel SC-based NN accelerator purely using bitstream computation. We apply a bitonic sorting network for simultaneously implementing the accumulation and activation function with parallel bitstreams. The proposed design not only has high fault tolerance, but also achieves at least 2.8 energy efficiency improvement over the binary computing counterpart.</em></td> </tr> <tr> <td style="width:40px;">IP4-6</td> <td><b>WAVEPRO: CLOCK-LESS WAVE-PROPAGATED PIPELINE COMPILER FOR LOW-POWER AND HIGH-THROUGHPUT COMPUTATION</b><br /><b>Speaker</b>:<br />Yehuda Kra, Bar-Ilan University, IL<br /><b>Authors</b>:<br />Yehuda Kra, Adam Teman and Tzachi Noy, Bar-Ilan University, IL<br /><em><b>Abstract</b><br />Clock-less Wave-Propagated Pipelining is a longknown approach to achieve high-throughput without the overhead of costly sampling registers. However, due to many design challenges, which have only increased with technology scaling, this approach has never been widely accepted and has generally been limited to small and very specific demonstrations. This paper addresses this barrier by presenting WavePro, a generic and scalable algorithm, capable of skew balancing any combinatorial logic netlist for the application of wave pipelining. The algorithm was implemented in the WavePro Compiler automation utility, which interfaces with industry delays extraction and standard timing analysis tools to produce a sign-off quality result. The utility is demonstrated upon a dot-product accelerator in a 65 nm CMOS technology, using a vendor-provided standard cell library and commercial timing analysis tools. By reducing the worstcase output skew by over 70%, the test case example was able to achieve equivalent throughput of an 8-staged sequentially pipelined implementation with power savings of almost 3X.</em></td> </tr> <tr> <td style="width:40px;">IP4-7</td> <td><b>DEEPNVM: A FRAMEWORK FOR MODELING AND ANALYSIS OF NON-VOLATILE MEMORY TECHNOLOGIES FOR DEEP LEARNING APPLICATIONS</b><br /><b>Speaker</b>:<br />Ahmet Inci, Carnegie Mellon University, US<br /><b>Authors</b>:<br />Ahmet Inci, Mehmet M Isgenc and Diana Marculescu, Carnegie Mellon University, US<br /><em><b>Abstract</b><br />Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 4.2x and 5x energy-delay product (EDP) reduction and 2.4x and 3x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide 2.3x EDP reduction on average across all workloads when compared to SRAM. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPU platforms for deep learning applications.</em></td> </tr> <tr> <td style="width:40px;">IP4-8</td> <td><b>EFFICIENT EMBEDDED MACHINE LEARNING APPLICATIONS USING ECHO STATE NETWORKS</b><br /><b>Speaker</b>:<br />Rolando Brondolin, politecnico di milano, IT<br /><b>Authors</b>:<br />Luca Cerina<sup>1</sup>, Giuseppe Franco<sup>2</sup>, Claudio Gallicchio<sup>3</sup>, Alessio Micheli<sup>3</sup> and Marco D. Santambrogio<sup>4</sup><br /><sup>1</sup>politecnico di milano, IT; <sup>2</sup>Scuola Superiore Sant'Anna / Università di Pisa, IT; <sup>3</sup>Università di Pisa, IT; <sup>4</sup>Politecnico di Milano, IT<br /><em><b>Abstract</b><br />The increasing role of Artificial Intelligence (AI) and Machine Learning (ML) in our lives brought a paradigm shift on how and where the computation is performed. Stringent latency requirements and congested bandwidth moved AI inference from Cloud space towards end-devices. This change required a major simplification of Deep Neural Networks (DNN), with memory-wise libraries or co-processors that perform fast inference with minimal power. Unfortunately, many applications such as natural language processing, time-series analysis and audio interpretation are built on a different type of Artifical Neural Networks (ANN), the so-called Recurrent Neural Networks (RNN), which, due to their intrinsic architecture, remains too complex and heavy to run efficiently on embedded devices. To solve this issue, the Reservoir Computing paradigm proposes sparse untrained non-linear networks, the Reservoir, that can embed temporal relations without some of the hindrances of Recurrent Neural Networks training, and with a lower memory usage. Echo State Networks (ESN) and Liquid State Machines are the most notable examples. In this scenario, we propose a performance comparison of a ESN, designed and trained using Bayesian Optimization techniques, against current RNN solutions. We aim to demonstrate that ESN have comparable performance in terms of accuracy, require minimal training time, and they are more optimized in terms of memory usage and computational efficiency. Preliminary results show that ESN are competitive with RNN on a simple benchmark, and both training and inference time are faster, with a maximum speed-up of 2.35x and 6.60x, respectively.</em></td> </tr> <tr> <td style="width:40px;">IP4-9</td> <td><b>EXPLFRAME: EXPLOITING PAGE FRAME CACHE FOR FAULT ANALYSIS OF BLOCK CIPHERS</b><br /><b>Speaker</b>:<br />Anirban Chakraborty, IIT Kharagpur, IN<br /><b>Authors</b>:<br />Anirban Chakraborty<sup>1</sup>, Sarani Bhattacharya<sup>2</sup>, Sayandeep Saha<sup>1</sup> and Debdeep Mukhopadhyay<sup>1</sup><br /><sup>1</sup>IIT Kharagpur, IN; <sup>2</sup>Phd, BE<br /><em><b>Abstract</b><br />Page Frame Cache (PFC) is a purely software cache, present in modern Linux based operating systems (OS), which stores the page frames that were recently released by the processes running on a particular CPU. In this paper, we show that the page frame cache can be maliciously exploited by an adversary to steer the pages of a victim process to some pre-decided attacker-chosen locations in the memory. We practically demonstrate an end-to-end attack, emph{ExplFrame}, where an attacker having only user-level privilege is able to force a victim process's memory pages to vulnerable locations in DRAM and deterministically conduct Rowhammer to induce faults. As a case study, we induce single bit faults in the T-tables on OpenSSL (v1.1.1) AES using our proposed attack ExplFrame. We also propose an improvised fault analysis technique which can exploit any Rowhammer-induced bit-flips in the AES T-tables.</em></td> </tr> <tr> <td style="width:40px;">IP4-10</td> <td><b>XGBIR: AN XGBOOST-BASED IR DROP PREDICTOR FOR POWER DELIVERY NETWORK</b><br /><b>Speaker</b>:<br />An-Yu Su, National Chiao Tung University, TW<br /><b>Authors</b>:<br />Chi-Hsien Pao, Yu-Min Lee and An-Yu Su, National Chiao Tung University, TW<br /><em><b>Abstract</b><br />This work utilizes the XGBoost to build a machine-learning-based IR drop predictor, XGBIR, for the power grid. To capture the behavior of power grid, we extract its several features and employ its locality property to save the extraction time. XGBIR can be effectively applied to large designs and the average error of predicted IR drops is less than 6 mV.</em></td> </tr> <tr> <td style="width:40px;">IP4-11</td> <td><b>ON PRE-ASSIGNMENT ROUTE PROTOTYPING FOR IRREGULAR BUMPS ON BGA PACKAGES</b><br /><b>Speaker</b>:<br />Hung-Ming Chen, National Chiao Tung University, TW<br /><b>Authors</b>:<br />Jyun-Ru Jiang<sup>1</sup>, Yun-Chih Kuo<sup>2</sup>, Simon Chen<sup>3</sup> and Hung-Ming Chen<sup>1</sup><br /><sup>1</sup>Institute of Electronics, National Chiao Tung University, TW; <sup>2</sup>National Taiwan University, TW; <sup>3</sup>MediaTek.inc, TW<br /><em><b>Abstract</b><br />In modern package design, the bumps often place irregularly due to the macros varied in sizes and positions. This will make pre-assignment routing more difficult, even with massive design efforts. This work presents a 2-stage routing method which can be applied to an arbitrary bump placement on 2-layer BGA packages. Our approach combines escape routing with via assignment: the escape routing is used to handle the irregular bumps and the via assignment is applied for improving the wire congestion and total wirelength of global routing. Experimental results based on industrial cases show that our methodology can solve the routing efficiently, and we have achieved 82% improvement on wire congestion with 5% wirelength increase compared with conventional regular treatments.</em></td> </tr> <tr> <td style="width:40px;">IP4-12</td> <td><b>TOWARDS BEST-EFFORT APPROXIMATION: APPLYING NAS TO APPROXIMATE COMPUTING</b><br /><b>Speaker</b>:<br />Weiwei Chen, University of Chinese Academy of Sciences, CN<br /><b>Authors</b>:<br />Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN<br /><em><b>Abstract</b><br />The design of neural network architecture for code approximation involves a large number of hyper-parameters to explore, it is a non-trivial task to find an neural-based approximate computing solution that meets the demand of application-specified accuracy and Quality of Service (QoS). Prior works do not address the problem of 'optimal' network architectures design in program approximation, which depends on the user-specified constraints, the complexity of dataset and the hardware configuration. In this paper, we apply Neural Architecture Search (NAS) for searching and selecting the neural approximate computing and provide an automatic framework that tries to generate the best-effort approxi-mation result while satisfying the user-specified QoS/accuracy constraints. Compared with previous method, this work achieves more than 1.43x speedup and 1.74x energy reduction on average when applied to the AxBench benchmarks.</em></td> </tr> <tr> <td style="width:40px;">IP4-13</td> <td><b>ON THE AUTOMATIC EXPLORATION OF WEIGHT SHARING FOR DEEP NEURAL NETWORK COMPRESSION</b><br /><b>Speaker</b>:<br />Etienne Dupuis, École Centrale de Lyon, FR<br /><b>Authors</b>:<br />Etienne Dupuis<sup>1</sup>, David Novo<sup>2</sup>, Ian O'Connor<sup>1</sup> and Alberto Bosio<sup>1</sup><br /><sup>1</sup>Lyon Institute of Nanotechnology, FR; <sup>2</sup>Université de Montpellier, FR<br /><em><b>Abstract</b><br />Deep neural networks demonstrate impressive inference results, particularly in computer vision and speech recognition. However, the computational workload and storage associated render their use prohibitive in resource-limited embedded systems. The approximate computing paradigm has been widely explored in both industrial and academic circles. It improves performance and energy-efficiency by relaxing the need for fully accurate operations. Consequently, there is a large number of implementation options with very different approximation strategies (such as pruning, quantization, low-rank factorization, knowledge distillation, ...). To the best of our knowledge, no automated approach exists for exploring, selecting and generating the best approximate versions of a given convolutional neural network (CNN) and the design objectives. The objective of this work in progress is to show that the design space exploration phase can enable significant network compression without noticeable accuracy loss. We demonstrate this via an example based on weight sharing and show that we can obtain 4x compression rate without re-training and the resulting network does not suffer from accuracy loss, in an int-16 version of LeNet-5 (5-layer 1,720-kbit CNN) using our method.</em></td> </tr> <tr> <td style="width:40px;">IP4-14</td> <td><b>ROBUST AND HIGH-PERFORMANCE12-T INTERLOCKED SRAM FOR IN-MEMORY COMPUTING</b><br /><b>Speaker</b>:<br />Joycee Mekie, IIT Gandhinagar, IN<br /><b>Authors</b>:<br />Neelam Surana, Mili Lavania, Abhishek Barma and Joycee Mekie, IIT Gandhinagar, IN<br /><em><b>Abstract</b><br />In this paper, we analyze the existing SRAM based In-Memory Computing(IMC) proposals and show through exhaustive simulations that they fail under process variations. 6-T SRAM, 8-T SRAM, and 10-T SRAM based IMC architectures suffer from compute-disturb (stored data flips during IMC), compute-failure (provides false computation results), and half-select failures, respectively. To circumvent these issues, we propose a novel 12-T Dual Port Dual Interlocked-storage Cell (DPDICE) SRAM. DPDICE SRAM based IMC architecture(DPDICE-IMC) can perform essential boolean functions successfully in a single cycle and can perform basic arithmetic operations such as add and multiply. The most striking feature is that DPDICE-IMC architecture can perform IMC on two datasets simultaneously, thus doubling the throughput. Cumulatively, the proposed DPDICE-IMC is 26.7%, 8$imes$, and 28% better than 6-T SRAM, 8-T SRAM, and 10-T SRAM based IMC architectures, respectively.</em></td> </tr> <tr> <td style="width:40px;">IP4-15</td> <td><b>HIGH DENSITY STT-MRAM COMPILER DESIGN, VALIDATION AND CHARACTERIZATION METHODOLOGY IN 28NM FDSOI TECHNOLOGY</b><br /><b>Speaker</b>:<br />Piyush Jain, ARM Embedded Technologies Pvt Ltd., IN<br /><b>Authors</b>:<br />Piyush Jain<sup>1</sup>, Akshay Kumar<sup>1</sup>, Nicolaas Van Winkelhoff<sup>2</sup>, Didier Gayraud<sup>2</sup>, Surya Gupta<sup>3</sup>, Abdelali El Amraoui<sup>2</sup>, Giorgio Palma<sup>2</sup>, Alexandra Gourio<sup>2</sup>, Laurentz Vachez<sup>2</sup>, Luc Palau<sup>2</sup>, Jean-Christophe Buy<sup>2</sup> and Cyrille Dray<sup>2</sup><br /><sup>1</sup>ARM Embedded Technologies Pvt Ltd., IN; <sup>2</sup>ARM France, FR; <sup>3</sup>ARM Embedded technologies Pvt Ltd., IN<br /><em><b>Abstract</b><br />Spin Transfer Torque Magneto-resistive Random-Access Memory (STT-MRAM) is emerging as a promising substitute for flash memories due to scaling challenges for flash in process nodes beyond 28nm. STT-MRAM's high endurance, fast speed and low power makes it suitable for wide variety of applications. An embedded MRAM (eMRAM) compiler is highly desirable to enable SoC designers to use eMRAM instances in their designs in a flexible manner. However, the development of an eMRAM compiler has added challenges of handling multi-fold higher density and maintaining analog circuits accuracy, on top of the challenges associated with conventional SRAM memory compilers. In this paper, we present a successful design methodology for a high density 128Mb eMRAM compiler in a 28nm fully depleted SOI (FDSOI) process. This compiler enables optimized eMRAM instance generation with varying capacity ranges, word-widths, and optional features like repair and error correction. eMRAM compiler design is achieved by evolving various architecture design, validations and characterization methods. A hierarchical and modular characterization methodology is presented to enable high accuracy characterization and industry-standard EDA view generation from the eMRAM compiler.</em></td> </tr> <tr> <td style="width:40px;">IP4-16</td> <td><b>AN APPROXIMATION-BASED FAULT DETECTION SCHEME FOR IMAGE PROCESSING APPLICATIONS</b><br /><b>Speaker</b>:<br />Antonio Miele, Politecnico di Milano, IT<br /><b>Authors</b>:<br />Matteo Biasielli, Luca Cassano and Antonio Miele, Politecnico di Milano, IT<br /><em><b>Abstract</b><br />Image processing applications expose an intrinsic resilience to faults. In this application field the classical Duplication with Comparison (DWC) scheme, where output images are discarded as soon as the two replicas' outputs differ for at least one pixel, may be over-conseravative. This paper introduces a novel lightweight fault detection scheme for image processing applications; i) it extends the DWC scheme by substituting one of the two exact replicas with a faster approximated one; and ii) it features a Neural Network-based checker designed to distinguish between usable and unusable images instead of faulty/unfaulty ones. The application of the hardening scheme on a case study has shown an execution time reduction from 27% to 34% w.r.t. the DWC, while guaranteeing a comparable fault detection capability.</em></td> </tr> <tr> <td style="width:40px;">IP4-17</td> <td><b>TRANSPORT-FREE MODULE BINDING FOR SAMPLE PREPARATION USING MICROFLUIDIC FULLY PROGRAMMABLE VALVE ARRAYS</b><br /><b>Speaker</b>:<br />Gautam Choudhary, Adobe Research, India, IN<br /><b>Authors</b>:<br />Gautam Choudhary<sup>1</sup>, Sandeep Pal<sup>1</sup>, Debraj Kundu<sup>2</sup>, Sukanta Bhattacharjee<sup>3</sup>, Shigeru Yamashita<sup>4</sup>, Bing Li<sup>5</sup>, Ulf Schlichtmann<sup>5</sup> and Sudip Roy<sup>1</sup><br /><sup>1</sup>IIT Roorkee, IN; <sup>2</sup>PhD, IN; <sup>3</sup>Indian Statistical Institute, IN; <sup>4</sup>Ritsumeikan University, JP; <sup>5</sup>TUM, DE<br /><em><b>Abstract</b><br />Microfluidic fully programmable valve array (FPVA) biochips have emerged as general-purpose flow-based microfluidic lab-on-chips (LoCs). An FPVA supports highly re-configurable on-chip components (modules) in the two-dimensional grid-like structure controlled by some software programs, unlike application-specific flow-based LoCs. Fluids can be loaded into or washed from a cell with the help of flows from the inlet to outlet of an FPVA, whereas cell-to-cell transportation of discrete fluid segment(s) is not precisely possible. The simplest mixing module to realize on an FPVA-based LoC is a four-way mixer consisting of a $2imes2$ array of cells working as a ring-like mixer having four valves. In this paper, we propose a design automation method for sample preparation that finds suitable placements of mixing operations of a mixing tree using four-way mixers without requiring any transportation of fluid(s) between modules. We also propose a heuristic that modifies the mixing tree to reduce the sample preparation time. We have performed an extensive simulation and examined several parameters to determine the performance of the proposed solution.</em></td> </tr> </table> <hr /> <h2 id="10.1">10.1 Special Day on "Silicon Photonics": High-Speed Silicon Photonics Interconnects for Data Center and HPC</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br />Ian O’Connor, Ecole Centrale de Lyon, FR</p> <p><b>Co-Chair:</b><br />Ashkan Seyedi, Hewlett Packard Labs, US</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.1.1</td> <td><b>THE NEED AND CHALLENGES OF CO-PACKAGING AND OPTICAL INTEGRATION IN DATA CENTERS</b><br /><b>Author</b>:<br />Liron Gantz, Mellanox, US<br /><em><b>Abstract</b><br />Silicon photonic (SiPh) technology was the "talk of the town" for almost two decades, yet only in the last couple of years, actual SiPh based transceivers were introduced for short-reach links. As the global IP traffic skyrockets, and will surpass 1 ZB per year by 2020, it seems that this is the optimal point for new disruptive technology to emerge. SiPh technology has the potential to reduce power consumption while meeting the demand for increasing rates, and potentially even reduce the cost. Yet in order to fully integrate SiPh components in mainly CMOS ICs, the entire industry must align, beginning with industrial FABs and OSATs, and ending with system manufacturers and Data Center clients. Indeed, in the last year positive developments have occurred as the Hyper-scalers are starting to show interest in driving the market into integrating optics and forgo pluggable transceivers. Yet many challenges have to be met, and some hard decisions have to be taken in order to fully integrate optics in a scalable manner. In this talk I will review these challenges and possible ways to meet them in order to enable optical integrated products in Data Centers and High-Performance Computers.</em></td> </tr> <tr> <td>11:30</td> <td>10.1.2</td> <td><b>POWER AND COST ESTIMATE OF SCALABLE ALL-TO-ALL TOPOLOGIES WITH SILICON PHOTONICS LINKS</b><br /><b>Author</b>:<br />Luca Ramini, Hewlett Packard Labs, US<br /><em><b>Abstract</b><br />For many applications that require a tight latency profile, such as machine learning, a network topology that does not leverage arbitration-based switching is desired. All-to-all (A2A) interconnection networks enable any node in the network to communicate to any other node at any given time. Many abstractions can be made to enable this capability such as buffering, time-domain multiplexing, etc. However, typical A2A topologies are limited to about 32 nodes within one hop. This is primarily due to limitations in reach, power consumption and bandwidth per interconnect. In this presentation, a topology of 256 nodes and beyond is considered by leveraging the many- wavelengths-per-fiber advantage of DWDM silicon photonics technology. Power and cost estimate of scalable A2A topologies using silicon photonics links are provided in order to understand the practical limits, if any, of a single node communicating with many other nodes via one wavelength per node.</em></td> </tr> <tr> <td>12:00</td> <td>10.1.3</td> <td><b>THE NEXT FRONTIER IN SILICON PHOTONIC DESIGN: EXPERIMENTALLY VALIDATED STATISTICAL MODELS</b><br /><b>Authors</b>:<br />Geoff Duggan<sup>1</sup>, James Pond<sup>1</sup>, Xu Wang<sup>1</sup>, Ellen Schelew<sup>1</sup>, Federico Gomez<sup>1</sup>, Milad Mahpeykar<sup>1</sup>, Ray Chung<sup>1</sup>, Zequin Lu<sup>1</sup>, Parya Samadian<sup>1</sup>, Jens Niegemann<sup>1</sup>, Adam Reid<sup>1</sup>, Roberto Armenta<sup>1</sup>, Dylan McGuire<sup>1</sup>, Peng Sun<sup>2</sup>, Jared Hulme<sup>2</sup>, Mudit Jan<sup>2</sup> and Ashkan Seyedi<sup>2</sup><br /><sup>1</sup>Lumerical, US; <sup>2</sup>Hewlett Packard Labs, US<br /><em><b>Abstract</b><br />Silicon photonics has made tremendous progress in recent years and is now a critical technology embedded in many commercial products, particularly for data communications, while new products in sensing, AI and even quantum information technologies are in development. High quality processes from multiple foundries, supported by sophisticated electronic-photonic design automation (EPDA) workflows have made these advancements possible. Although several initiatives have begun to address the issue of manufacturing variability in photonics, these approaches have not been integrated meaningfully into EPDA workflows which lag well behind electronic integrated circuit workflows. Contributing to this deficiency has been a lack of data to calibrate statistical photonic compact models used in photonic circuit and system simulation. We present our current work in developing tools to calibrate statistical photonic compact models and compare our results against experimental data.</em></td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="10.2">10.2 Autonomous Systems Design Initiative: Uncertainty Handling in Safe Autonomous Systems (UHSAS)</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br />Peter Munk, Bosch Corporate Research, DE</p> <p><b>Co-Chair:</b><br />Ahmad Adee, Bosch Corporate Research, DE</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.2.1</td> <td><b>MAKING THE RELATIONSHIP BETWEEN UNCERTAINTY ESTIMATION AND SAFETY LESS UNCERTAIN</b><br /><b>Speaker</b>:<br />Peter Schlicht, Volkswagen, DE<br /><b>Authors</b>:<br />Peter Schlicht<sup>1</sup>, Vincent Aravantinos<sup>2</sup> and Fabian Hüger<sup>1</sup><br /><sup>1</sup>Volkswagen, DE; <sup>2</sup>AID, DE</td> </tr> <tr> <td>11:30</td> <td>10.2.2</td> <td><b>SYSTEM THEORETIC VIEW ON UNCERTAINTIES</b><br /><b>Speaker</b>:<br />Roman Gansch, Robert Bosch GmbH, DE<br /><b>Authors</b>:<br />Roman Gansch and Ahmad Adee, Robert Bosch GmbH, DE<br /><em><b>Abstract</b><br />The complexity of the operating environment and required technologies for highly automated driving is unprecedented. A different type of threat to safe operation besides the fault-error-failure model by Laprie et al. arises in the form of performance limitations. We propose a system theoretic approach to handle these and derive a taxonomy based on uncertainty, i.e. lack of knowledge, as a root cause. Uncertainty is a threat to the dependability of a system, as it limits our ability to assess its dependability properties. We distinguish uncertainties by aleatory (inherent to probabilistic models), epistemic (lack of model parameter knowledge) and ontological (incompleteness of models) in order to determine strategies and methods to cope with them. Analogous to the taxonomy of Laprie et al. we cluster methods into uncertainty prevention (use of elements with well-known behavior, avoiding architectures prone to emergent behavior, restriction of operational design domain, etc.), uncertainty removal (during design time by design of experiment, etc. and after release by field observation, continuous updates, etc.), uncertainty tolerance (use of redundant architectures with diverse uncertainties, uncertainty aware deep learning, etc.) and uncertainty forecasting (estimation of residual uncertainty, etc.).</em></td> </tr> <tr> <td>12:00</td> <td>10.2.3</td> <td><b>DETECTION OF FALSE NEGATIVE AND FALSE POSITIVE SAMPLES IN SEMANTIC SEGMENTATION</b><br /><b>Speaker</b>:<br />Matthias Rottmann, School of Mathematics &amp; Science and ICMD, DE<br /><b>Authors</b>:<br />Hanno Gottschalk<sup>1</sup>, Matthias Rottmann<sup>1</sup>, Kira Maag<sup>1</sup>, Robin Chan<sup>1</sup>, Fabian Hüger<sup>2</sup> and Peter Schlicht<sup>2</sup><br /><sup>1</sup>School of Mathematics &amp; Science and ICMD, DE; <sup>2</sup>Volkswagen, DE</td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="10.3">10.3 Special Session: Next Generation Arithmetic for Edge Computing</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br />Anupam Chattopadhyay, Nanyang Technological University, SG</p> <p><b>Co-Chair:</b><br />Farhad Merchant, RWTH Aachen University, DE</p> <p>Arithmetic is ubiquitous in today's digital world, ranging from embedded to high- performance computing systems. With machine learning at fore in a wide range of application domains from wearables, automotive, avionics to weather prediction, sufficiently accurate yet low-cost arithmetic is the need for the day. Recently, there have been several advances in the domain of computer arithmetic like high-precision anchored numbers from ARM, posit arithmetic by John Gustafson, and bfloat16, etc. as an alternative to IEEE 754-2008 compliant arithmetic. Optimizations on fixed-point and integer arithmetic are also pursued actively for low-power computing architectures. Furthermore, approximate computing and transprecision/mixed-precision computing have been exciting areas for research forever. While academic research in the domain of computer arithmetic has a long history, industrial adoption of some of these new data-types and techniques is in its early stages and expected to increase in future. bfloat16 is an excellent example of that. In this special session, we bring academia and industry together to discuss latest results and future directions for research in the domain of next-generation computer arithmetic.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.3.1</td> <td><b>PARADIGM ON APPROXIMATE COMPUTE FOR COMPLEX PERCEPTION-BASED NEURAL NETWORKS</b><br /><b>Authors</b>:<br />Andre Guntoro<sup>1</sup> and Cecilia De la Parra<sup>2</sup><br /><sup>1</sup>Research Lead, Corporate Research, Robert Bosch GmbH, DE; <sup>2</sup>Robert Bosch GmbH, DE<br /><em><b>Abstract</b><br />The rise of machine learning pushes the massive compute power requirements, especially on the edge devices for their real-time inferences. One established approach for reducing the power usage is by going down to integer inferences (such as 8-bit) instead of utilizing higher computation accuracy given by their floating-point counterparts. Squeezing into lower bit representations such as in binary weight networks or binary neural networks requires complex training methods and also more efforts to recover the precision loss, and it typically functions only on simple classification tasks. One promising alternative to further reduce power consumption and computation latency is by utilizing approximate compute units. This method is a promising paradigm for mitigating the computation demand of neural networks, by taking advantage of their inherent resilience. Thanks to the development in approximate computing in the last decade, we have abundant options to utilize the best available approximate units, without re-developing or re-designing them. Nonetheless, adaptation during training phase is required. At first, we need to adapt the training methods for neural networks to take into account the inaccuracy given by approximate compute, without sacrificing the training speed (considering the trainings are performed on GPU with floating-point). Second, we need to define new metric for assessing and selecting the best-fit of approximation units per use-case basis. Lastly, we need to take advantages of approximation into the neural networks, such as over-fitting mitigation per design and resiliency, so that the networks trained for and designed with approximation will and shall perform better than their exact computing counterparts. For these steps, we evaluate on small tasks first and further validate on complex tasks which are more relevant in automotive domains.</em></td> </tr> <tr> <td>11:22</td> <td>10.3.2</td> <td><b>NEXT GENERATION FPGA ARITHMETIC FOR AI</b><br /><b>Author</b>:<br />Martin Langhammer, Intel, GB<br /><em><b>Abstract</b><br />The most recent FPGA architectures have introduced new levels of embedded floating point performance, with tens of TFLOPs now available across a wide range of device sizes. The last two generations of FPGAs have introduced IEEE754 single precision (FP32) arithmetic, containing up to 10 TFLOPs. The emergence of AI/Machine Learning as the highest profile FPGA application has changed the focus from signal processing and embedded calculations supported by FP32 to smaller floating point precisions, such as BFLOAT16 for training and FP16 for inference. In this talk, we will describe the architecture and development of the Intel Agilex DSP Block, which contains a FP32 multiplier-adder pair that can be decomposed into two smaller precision pairs; fp16, bfloat16, and a third proprietary format which can be used for both training and inference. In the Edge, where even lower precision arithmetic is required for inference, new FPGA EDA flows can implement 100 TFLOPs+ of soft logic-based compute power. In the second half of our talk, we will describe new synthesis, clustering, and packing methodologies - collectively known as Fractal Synthesis - that allow an unprecedented near 100% logic use of the FPGA for arithmetic, while maintaining the clock rates of a small example design. The soft logic and embedded arithmetic capabilities can be used simultaneously, making the FPGA the most flexible, and amongst the highest performing AI platform</em></td> </tr> <tr> <td>11:44</td> <td>10.3.3</td> <td><b>APPLICATION-SPECIFIC ARITHMETIC DESIGN</b><br /><b>Author</b>:<br />Florent de Dinechin, INSA Lyon, FR<br /><em><b>Abstract</b><br />General-purpose processor manufacturers face the difficult task of deciding the best arithmetic systems to commit to silicon. An alternative, particularly relevant to FPGA computing and ASIC design, is to keep this choice as open as possible, designing tools that enable different arithmetic system to be mixed and matched in an application-specific way. To achieve this, a productive paradigm has emerged from the FloPoCo project: open-ended generation of over-parameterized operators that compute just right thanks to last-bit accuracy at all levels. This work reviews this paradigm, and also reviews some of the arithmetic tools recently developed for this purpose: the generic bit-heap framework of FloPoCo, and the integration of arithmetic optimization inside HLS tools in the Marto project.</em></td> </tr> <tr> <td>12:06</td> <td>10.3.4</td> <td><b>A COMPARISON OF POSIT AND IEEE 754 FLOATING-POINT ARITHMETIC THAT ACCOUNTS FOR EXCEPTION HANDLING</b><br /><b>Author</b>:<br />John Gustafson, National University of Singapore, SG<br /><em><b>Abstract</b><br />The posit number format has advantages over the decades-old IEEE 754 Standard floating-point standard in many dimensions: Accuracy, dynamic range, simplicity, bitwise reproducibility, resiliency, and resistance to side-channel security attacks. In making comparisons, it is essential to distinguish between an IEEE 754 Standard implementation that handles all the exceptions in hardware, and one that either ignores the exceptions of the Standard or handles them with software or microcode that take hundreds of clock cycles to execute. Ignoring the exceptions quickly leads to egregious problems such as different values comparing as equal; handling exceptions with microcode creates massive data-dependency on timing that permits side-channel attacks like the well-known Spectre and Meltdown security weaknesses. Many microprocessors, such as current x86 architectures, use the exception-trapping approach for exceptions such as denormalized floats, which makes them unsuitable for secure use. Posit arithmetic provides data-independent and fast execution times with less complexity than a data-independent IEEE 754 float environment for the same data size.</em></td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="10.4">10.4 Design Methodologies for Hardware Approximation</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br />Lukas Sekanina, Brno University of Technology, CZ</p> <p><b>Co-Chair:</b><br />David Novo, CNRS &amp; University of Montpellier, FR</p> <p>New methods for the design and evaluation of approximate hardware are key to its success. This section shows that these approximation methods are applicable across different levels of hardware description including an RTL design of an approximate multiplier, approximate circuits modelled using binary decision diagrams and a behavioural description used in the context of high level synthesis of hardware accelerators. The papers of this section also show how to address another challenge - an efficient error evaluation - by means of new statistical and formal verification methods.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.4.1</td> <td><b>REALM: REDUCED-ERROR APPROXIMATE LOG-BASED INTEGER MULTIPLIER</b><br /><b>Speaker</b>:<br />Hassaan Saadat, University of New South Wales, AU<br /><b>Authors</b>:<br />Hassaan Saadat<sup>1</sup>, Haris Javaid<sup>2</sup>, Aleksandar Ignjatovic<sup>1</sup> and Sri Parameswaran<sup>3</sup><br /><sup>1</sup>University of New South Wales, AU; <sup>2</sup>Xilinx, SG; <sup>3</sup>UNSW, AU<br /><em><b>Abstract</b><br />We propose a new error-configurable approximate unsigned integer multiplier named REALM. It incorporates a novel error-reduction method into the classical approximate log-based multiplier. Each power-of-two-interval of the input operands is partitioned into MxM segments, and an error-reduction factor for each segment is analytically determined. These error-reduction factors can be used across any power-of-two-interval, so we quantize only M^2 factors and store them in the form of read-only hardwired lookup tables to keep the resource overhead to a minimum. Error characterization of REALM shows that it achieves very low error bias (mostly less than or equal to 0.05%), along with lower mean error (from 0.4% to 1.6%), and lower peak error (from 2.08% to 7.4%) than the classical approximate log-based multiplier and its state-of-the-art derivatives (mean errors greater than or equal to 2.6% and peak errors greater than or equal to 7.8%). Synthesis results using TSMC 45nm standard-cell library show that REALM enables significant power-efficiency (66% to 86% reduction) and area-efficiency (50% to 76% reduction) when compared with the accurate integer multiplier. We show that REALM produces Pareto optimal design trade-offs in the design space of state-of-the-art approximate multipliers. Application-level evaluation of REALM demonstrates that it has negligible effect on the output quality.</em></td> </tr> <tr> <td>11:30</td> <td>10.4.2</td> <td><b>A FAST BDD MINIMIZATION FRAMEWORK FOR APPROXIMATE COMPUTING</b><br /><b>Speaker</b>:<br />Oliver Keszocze, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE<br /><b>Authors</b>:<br />Andreas Wendler and Oliver Keszocze, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE<br /><em><b>Abstract</b><br />Approximate Computing is a design paradigm that trades off computational accuracy for gains in non-functional aspects such as reduced area, increased computation speed, or power reduction. Computing the error of the approximated design is an essential step to determine its quality. The computation time for determining the error can become very large, effectively rendering the entire logic approximation procedure infeasible. As a remedy, we present methods to accelerate the computation of error metric computations by (a) exploiting structural information and (b) computing estimates of the metrics for multi-output Boolean functions represented as BDDs. We further present a novel greedy, bucket-based BDD minimization framework employing the newly proposed error metric computations to produce Pareto-optimal solutions with respect to BDD size and multiple error metrics. The applicability of the proposed minimization framework is demonstrated by an experimental evaluation. We can report considerable speedups while, at the same time, creating high-quality approximated BDDs.</em></td> </tr> <tr> <td>12:00</td> <td>10.4.3</td> <td><b>ON THE DESIGN OF HIGH PERFORMANCE HW ACCELERATOR THROUGH HIGH-LEVEL SYNTHESIS SCHEDULING APPROXIMATIONS</b><br /><b>Speaker</b>:<br />Benjamin Carrion Schaefer, University of Texas at Dallas, US<br /><b>Authors</b>:<br />Siyuan Xu and Benjamin Carrion Schaefer, University of Texas at Dallas, US<br /><em><b>Abstract</b><br />High-level synthesis (HLS) takes as input a behavioral description (e.g. C/C++) and generates efficient hardware through three main steps: allocation, scheduling, and binding. The scheduling step, times the operations in the behavioral description by scheduling different portions of the code at unique clock steps (control steps). The code portions assigned to each clock step mainly depend on the target synthesis frequency and target technology. This work makes use of this to generate smaller and faster circuits by approximating the program portions scheduled in each clock step and by exploiting the slack between different scheduling step to further increase the performance/reduce the latency of the resultant circuit. In particular, each individual scheduling step is approximated given a maximum error boundary and a library of different approximation techniques. In order to further optimize the resultant circuit, different scheduling steps are merged based on the timing slack of different control step without violating the given timing constraint (target frequency). Experimental results from different domain-specific applications show that our method works well and is able to increase the throughput on average by 82% while at the same time reducing the area by 21% for a given maximum allowable error.</em></td> </tr> <tr> <td>12:15</td> <td>10.4.4</td> <td><b>FAST KRIGING-BASED ERROR EVALUATION FOR APPROXIMATE COMPUTING SYSTEMS</b><br /><b>Speaker</b>:<br />Daniel Menard, INSA Rennes, FR<br /><b>Authors</b>:<br />Justine Bonnot<sup>1</sup>, Karol Desnos<sup>2</sup> and Daniel Menard<sup>3</sup><br /><sup>1</sup>Univ Rennes, INSA Rennes, CNRS, IETR - UMR 6164, FR; <sup>2</sup>Univ Rennes, INSA Rennes, CNRS - IETR UMR 6164, FR; <sup>3</sup>INSA Rennes, FR<br /><em><b>Abstract</b><br />Approximate computing techniques trade-off the performance of an application for its accuracy. The challenge when implementing approximate computing in an application is to efficiently evaluate the quality at the output of the application to optimize the noise budgeting of the different approximation sources. It is commonly achieved with an optimization algorithm to minimize the implementation cost of the application subject to a quality constraint. During the optimization process, numerous approximation configurations are tested, and the quality at the output of the application is measured for each configuration with simulations. The optimization process is a time-consuming task. We propose a new method for infering the accuracy or quality metric at the output of an application using kriging, a geostatistical method.</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="/date20/conference/session/IP5">IP5-1</a>, 21</td> <td><b>STATISTICAL MODEL CHECKING OF APPROXIMATE CIRCUITS: CHALLENGES AND OPPORTUNITIES</b><br /><b>Speaker and Author</b>:<br />Josef Strnadel, Brno University of Technology, CZ<br /><em><b>Abstract</b><br />Many works have shown that approximate circuits may play an important role in the development of resourceefficient electronic systems. This motivates many researchers to propose new approaches for finding an optimal trade-off between the approximation error and resource savings for predefined applications of approximate circuits. The works and approaches, however, focus mainly on design aspects regarding relaxed functional requirements while neglecting further aspects such as signal and parameter dynamics/stochasticity, relaxed/non-functional equivalence, testing or formal verification. This paper aims to take a step ahead by moving towards the formal verification of time-dependent properties of systems based on approximate circuits. Firstly, it presents our approach to modeling such systems by means of stochastic timed automata whereas our approach goes beyond digital, combinational and/or synchronous circuits and is applicable in the area of sequential, analog and/or asynchronous circuits as well. Secondly, the paper shows the principle and advantage of verifying properties of modeled approximate systems by the statistical model checking technique. Finally, the paper evaluates our approach and outlines future research perspectives.</em></td> </tr> <tr> <td style="width:40px;">12:31</td> <td><a href="/date20/conference/session/IP5">IP5-2</a>, 912</td> <td><b>RUNTIME ACCURACY-CONFIGURABLE APPROXIMATE HARDWARE SYNTHESIS USING LOGIC GATING AND RELAXATION</b><br /><b>Speaker</b>:<br />Tanfer Alan, Karlsruhe Institute of Technology, TR<br /><b>Authors</b>:<br />Tanfer Alan<sup>1</sup>, Andreas Gerstlauer<sup>2</sup> and Joerg Henkel<sup>1</sup><br /><sup>1</sup>Karlsruhe Institute of Technology, DE; <sup>2</sup>University of Texas, Austin, US<br /><em><b>Abstract</b><br />Approximate computing trades off computation accuracy against energy efficiency. Algorithms from several modern application domains such as decision making and computer vision are tolerant to approximations while still meeting their requirements. The extent of approximation tolerance, however, significantly varies with a change in input characteristics and applications. We propose a novel hybrid approach for the synthesis of runtime accuracy configurable hardware that minimizes energy consumption at area expense. To that end, first we explore instantiating multiple hardware blocks with different fixed approximation levels. These blocks can be selected dynamically and thus allow to configure the accuracy during runtime. They benefit from having fewer transistors and also synthesis relaxations in contrast to state-of-the-art gating mechanisms which only switch off a group of logic. Our hybrid approach combines instantiating such blocks with area-efficient gating mechanisms that reduce toggling activity, creating a fine-grained design-time knob on energy vs. area. Examining total energy savings for a Sobel Filter under different workloads and accuracy tolerances show that our method finds Pareto-optimal solutions providing up to 16% and 44% energy savings compared to state-of-the-art accuracy-configurable gating mechanism and an exact hardware block, respectively, at 2x area cost</em></td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="10.5">10.5 Emerging Machine Learning Applications and Models</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br />Mladen Berekovic, TU Braunschweig, DE</p> <p><b>Co-Chair:</b><br />Sophie Quinton, INRIA, FR</p> <p>This session presents new application domains and new models for neural networks, discussing two novel video applications: multi-view and surveillance, and discusessing a Bayesian model approach for neural networks.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.5.1</td> <td><b>COMMUNICATION-EFFICIENT VIEW-POOLING FOR DISTRIBUTED INFERENCE WITH MULTI-VIEW NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Manik Singhal, School of Electrical and Computer Engineering, Purdue University, US<br /><b>Authors</b>:<br />Manik Singhal, Anand Raghunathan and Vijay Raghunathan, Purdue University, US<br /><em><b>Abstract</b><br />Multi-view object detection or the problem of detecting an object using multiple viewpoints, is an important problem in computer vision with varied applications such as distributed smart cameras and collaborative drone swarms. Multi-view object detection algorithms based on deep neural networks (DNNs) achieve high accuracy by {em view pooling}, or aggregating features corresponding to the different views. However, when these algorithms are realized on networks of edge devices, the communication cost incurred by view pooling often dominates the overall latency and energy consumption. In this paper, we propose techniques for communication-efficient view pooling that can be used to improve the efficiency of distributed multi-view object detection and apply them to state-of-the-art multi-view DNNs. First, we propose {em significance-aware selective view pooling}, which identifies and communicates only those features from each view that are likely to impact the pooled result (and hence, the final output of the DNN). Second, we propose {em multi-resolution feature view pooling}, which divides views into dominant and non-dominant views, and down-scales the features from non-dominant views using an additional network layer before communicating them for pooling. The dominant and non-dominant views are pooled separately and the results are jointly used to derive the final classification. We implement and evaluate the proposed pooling schemes using a model test-bed of twelve Raspberry Pi 3b+ devices and show that they achieve 9X - 36X reduction in data communicated and 1.8X reduction in inference latency, with no degradation in accuracy.</em></td> </tr> <tr> <td>11:30</td> <td>10.5.2</td> <td><b>AN ANOMALY COMPREHENSION NEURAL NETWORK FOR SURVEILLANCE VIDEOS ON TERMINAL DEVICES</b><br /><b>Speaker</b>:<br />Yuan Cheng, Shanghai Jiao Tong University, CN<br /><b>Authors</b>:<br />Yuan Cheng<sup>1</sup>, Guangtai Huang<sup>2</sup>, Peining Zhen<sup>1</sup>, Bin Liu<sup>2</sup>, Hai-Bao Chen<sup>1</sup>, Ngai Wong<sup>3</sup> and Hao Yu<sup>2</sup><br /><sup>1</sup>Shanghai Jiao Tong University, CN; <sup>2</sup>Southern University of Science and Technology, CN; <sup>3</sup>University of Hong Kong, HK<br /><em><b>Abstract</b><br />Anomaly comprehension in surveillance videos is more challenging than detection. This work introduces the design of a lightweight and fast anomaly comprehension neural network. For comprehension, a spatio-temporal LSTM model is developed based on the structured, tensorized time-series features extracted from surveillance videos. Deep compression of network size is achieved by tensorization and quantization for the implementation on terminal devices. Experiments on large-scale video anomaly dataset UCF-Crime demonstrate that the proposed network can achieve an impressive inference speed of 266 FPS on a GTX-1080Ti GPU, which is 4.29 faster than ConvLSTM-based method; a 3.34% AUC improvement with 5.55% accuracy niche versus the 3D-CNN based approach; and at least 15k× parameter reduction and 228× storage compression over the RNN-based approaches. Moreover, the proposed framework has been realized on an ARM-core based IOT board with only 2.4W power consumption.</em></td> </tr> <tr> <td>12:00</td> <td>10.5.3</td> <td><b>BYNQNET: BAYESIAN NEURAL NETWORK WITH QUADRATIC ACTIVATIONS FOR SAMPLING-FREE UNCERTAINTY ESTIMATION ON FPGA</b><br /><b>Speaker</b>:<br />Hiromitsu Awano, Osaka University, JP<br /><b>Authors</b>:<br />Hiromitsu Awano and Masanori Hashimoto, Osaka University, JP<br /><em><b>Abstract</b><br />An efficient inference algorithm for Bayesian neural network (BNN) named BYNQNet, Bayesian neural network with quadratic activations, and its FPGA implementation are proposed. As neural networks find applications in mission critical systems, uncertainty estimations in network inference become increasingly important. BNN is a theoretically grounded solution to deal with uncertainty in neural network by treating network parameters as random variables. However, an inference in BNN involves Monte Carlo (MC) sampling, i.e., a stochastic forwarding is repeated N times with randomly sampled network parameters, which results in N times slower inference compared to non-Bayesian approach. Although recent papers proposed sampling-free algorithms for BNN inference, they still require evaluation of complex functions such as a cumulative distribution function (CDF) of Gaussian distribution for propagating uncertainties through nonlinear activation functions such as ReLU and Heaviside, which requires considerable amount of resources for hardware implementation. Contrary to conventional BNN, BYNQNet employs quadratic nonlinear activation functions and hence the uncertainty propagation can be achieved using only polynomial operations. Our numerical experiment reveals that BYNQNet has comparative accuracy with MC-based BNN which requires N=10 forwardings. We also demonstrate that BYNQNet implemented on Xilinx PYNQ-Z1 FPGA board achieves the throughput of 131x10^3 images per second and the energy efficiency of 44.7×10^3 images per joule, which corresponds to 4.07x and 8.99x improvements from the state-of-the-art MC-based BNN accelerator.</em></td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="10.6">10.6 Secure Processor Architecture</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br />Bossuet Lilian, Université de Lyon, FR</p> <p><b>Co-Chair:</b><br />Moraes Fernando, Pontifícia Universidade Católica do Rio Grande do Sul, BR</p> <p>This session proposes an overview of new mechanisms to protect processor architectures, boot sequences, caches, and energy management. The solutions strive to address and mitigate a wide range of attack methodologies, with a special focus on new emerging attacks.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.6.1</td> <td><b>CAPTURING AND OBSCURING PING-PONG PATTERNS TO MITIGATE CONTINUOUS ATTACKS</b><br /><b>Speaker</b>:<br />Kai Wang, Harbin Institute of Technology, CN<br /><b>Authors</b>:<br />Kai Wang<sup>1</sup>, Fengkai Yuan<sup>2</sup>, Rui Hou<sup>2</sup>, Zhenzhou Ji<sup>1</sup> and Dan Meng<sup>2</sup><br /><sup>1</sup>Harbin Institute of Technology, CN; <sup>2</sup>State Key Laboratory of Information Security, Institute of Information Engineering, CAS, CN<br /><em><b>Abstract</b><br />In this paper, we observed Continuous Attacks are one kind of common side channel attack scenarios, where an adversary frequently probes the same target cache lines in a short time. Continuous Attacks cause target cache lines to go through multiple load-evict processes, exhibiting Ping-Pong Patterns. Identifying and obscuring Ping-Pong Patterns effectively interferes with the attacker's probe and mitigates Continuous Attacks. Based on the observations, this paper proposes Ping-Pong Regulator to identify multiple Ping-Pong Patterns and block them with different strategies (Preload or Lock). The Preload proactively loads target lines into the cache, causing the attacker to mistakenly infer that the victim has accessed these lines; the Lock fixes the attacked lines' directory entries on the last level cache directory until they are evicted out of caches, making an attacker's observation of the locked lines is always the L2 cache miss. The experimental evaluation demonstrates that the Ping-Pong Regulator efficiently identifies and secures attacked lines, induces negligible performance impacts and storage overhead, and does not require any software support.</em></td> </tr> <tr> <td>11:30</td> <td>10.6.2</td> <td><b>MITIGATING CACHE-BASED SIDE-CHANNEL ATTACKS THROUGH RANDOMIZATION: A COMPREHENSIVE SYSTEM AND ARCHITECTURE LEVEL ANALYSIS</b><br /><b>Speaker</b>:<br />Houman Homayoun, University of California, Davis, US<br /><b>Authors</b>:<br />Han Wang<sup>1</sup>, Hossein Sayadi<sup>1</sup>, Avesta Sasan<sup>1</sup>, Setareh Rafatirad<sup>1</sup>, Houman Homayoun<sup>1</sup>, Liang Zhao<sup>1</sup> and Tinoosh Mohsenin<sup>2</sup><br /><sup>1</sup>George Mason University, US; <sup>2</sup>University of Maryland, Baltimore County, US<br /><em><b>Abstract</b><br />Cache hierarchy was designed to allow CPU cores to process instructions faster by bridging the significant latency gap between the main memory and processor. In addition, various cache replacement algorithms are proposed to predict future data and instructions to boost the performance of the computer systems. However, recently proposed cache-based SideChannel Attacks (SCAs) have shown to effectively exploiting such a hierarchical cache design. The cache-based SCAs are exploiting the hardware vulnerabilities to steal secret information from users by observing cache access patterns of cryptographic applications and thus are emerging as a serious threat to the security of the computer systems. Prior works on mitigating the cache-based SCAs have mainly focused on cache partitioning techniques and/or randomization of mapping between main memory. However, such solutions though effective, require modification in the processor hardware which increases the complexity of architecture design and are not applicable to current as well as legacy architectures. In response, this paper proposes a lightweight system and architecture level randomization technique to effectively mitigate the impact of side-channel attacks on last-level caches with no hardware redesign overhead for current as well as legacy architectures. To this aim, by carefully adapting the processor frequency and prefetchers operation and adding proper level of noise to the attackers' cache observations we attempt to protect the critical information from being leaked. The experimental results indicate that the concurrent randomization of frequency and prefetchers can significantly prevent cache-based side-channel attacks with no need for a new cache design. In addition, the proposed randomization and adaptation methodology outperform state-of-the-art solutions in terms of the performance and execution time by reducing the performance overhead from 32.66% to nearly 20%.</em></td> </tr> <tr> <td>12:00</td> <td>10.6.3</td> <td><b>EXTENDING THE RISC-V INSTRUCTION SET FOR HARDWARE ACCELERATION OF THE POST-QUANTUM SCHEME LAC</b><br /><b>Speaker</b>:<br />Tim Fritzmann, TUM, DE<br /><b>Authors</b>:<br />Tim Fritzmann<sup>1</sup>, Georg Sigl<sup>2</sup> and Johanna Sepúlveda<sup>3</sup><br /><sup>1</sup>TUM, DE; <sup>2</sup>TU Munich/Fraunhofer AISEC, DE; <sup>3</sup>Airbus Defence and Space, DE<br /><em><b>Abstract</b><br />The increasing effort in the development of quantum computers represents a high risk for communication systems due to their capability of breaking currently used public-key cryptography. LAC is a lattice-based public-key encryption scheme resistant to traditional and quantum attacks. It is characterized by small key sizes and low arithmetic complexity. Recent publications have shown practical post-quantum solutions through co-design techniques. However, for LAC only software implementations were explored. In this work, we propose an efficient, flexible and time-protected HW/SW co-design architecture for LAC. We present two contributions. First, we develop and integrate hardware accelerators for three LAC performance bottlenecks: the generation of polynomials, polynomial multiplication and error correction. The accelerators were designed to support all post-quantum security levels from 128 to 256-bits. Second, we develop tailored instruction set extensions for LAC on RISC-V and integrate the HW accelerators directly into a RISC-V core. The results show that our architecture for LAC with constant-time error correction improves the performance by a factor of 7.66 for LAC-128, 14.42 for LAC-192, and 13.36 for LAC-256, when compared to the unprotected reference implementation running on RISC-V. The increased performance comes at a cost of an increased resource consumption (32,617 LUTs, 11,019 registers, and two DSP slices).</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="/date20/conference/session/IP5">IP5-3</a>, 438</td> <td><b>POST-QUANTUM SECURE BOOT</b><br /><b>Speaker</b>:<br />Vinay B. Y. Kumar, Nanyang Technological University (Singapore), SG<br /><b>Authors</b>:<br />Vinay B. Y. Kumar<sup>1</sup>, Naina Gupta<sup>2</sup>, Anupam Chattopadhyay<sup>3</sup>, Michael Kasper<sup>4</sup>, Christoph Krauss<sup>5</sup> and Ruben Niederhagen<sup>5</sup><br /><sup>1</sup>Nanyang Technological University, Singapore, SG; <sup>2</sup>Indraprastha Institute of Information Technology, IN; <sup>3</sup>Nanyang Technological University, SG; <sup>4</sup>Fraunhofer Singapore, SG; <sup>5</sup>Fraunhofer SIT, DE<br /><em><b>Abstract</b><br />A secure boot protocol is fundamental to ensuring the integrity of the trusted computing base of a secure system. The use of digital signature algorithms (DSAs) based on traditional asymmetric cryptography, particularly for secure boot, leaves such systems vulnerable to the threat of quantum computers. This paper presents the first post-quantum secure boot solution, implemented fully as hardware for reasons of security and performance. In particular, this work uses the eXtended Merkle Signature Scheme (XMSS), a hash-based scheme that has been specified as an IETF RFC. The solution has been integrated into a secure SoC platform around RISC-V cores and evaluated on an FPGA and is shown to be orders of magnitude faster compared to corresponding hardware/software implementations and to compare competitively with a fully hardware elliptic curve DSA based solution.</em></td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="10.7">10.7 Accelerators for Neuromorphic Computing</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 11:00 - 12:30<br /><b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br />Michael Niemier, University of Notre Dame, US</p> <p><b>Co-Chair:</b><br />Xunzhao Yin, Zhejiang University, CN</p> <p>In this session, special hardware accelerators based on different technologies for neuromorphic computing will be presented. These accelerators (i) improve the computing efficiency by using pulse widths to deliver information across memristor crossbars, (ii) enhance the robustness of neuromorphic computing with unary coding and priority mapping, and (iii) explore the modulation of light in transferring information so to push the performance of computing systems to new limits.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.7.1</td> <td><b>A PULSE WIDTH NEURON WITH CONTINUOUS ACTIVATION FOR PROCESSING-IN-MEMORY ENGINES</b><br /><b>Speaker</b>:<br />Shuhang Zhang, TUM, DE<br /><b>Authors</b>:<br />Shuhang Zhang<sup>1</sup>, Bing Li<sup>1</sup>, Hai (Helen) Li<sup>2</sup> and Ulf Schlichtmann<sup>1</sup><br /><sup>1</sup>TUM, DE; <sup>2</sup>Duke University / TUM-IAS, US<br /><em><b>Abstract</b><br />Processing-in-memory engines have been applied successfully to accelerate deep neural networks. For improving computing efficiency, spiking-based platforms are widely utilized. However, spiking-based designs quantize inter-layer signals naturally, leading to performance loss. In addition, the spike mismatch effect makes digital processing an essential part, impeding direct signal transferring between layers and thus resulting in longer latency. In this paper, we propose a novel neuron design based on pulse width modulation, avoiding quantization step and bypassing spike mismatch via its continuous activation. The computation latency and circuit complexity can be reduced significantly due to the absence of quantization and digital processing steps, while keeping a competitive performance. Experimental results demonstrate that the proposed neuron design can achieve &gt;100× speedup, and the area and power consumption can be reduced up to 75% and 25% compared with spiking-based designs.</em></td> </tr> <tr> <td>11:30</td> <td>10.7.2</td> <td><b>GO UNARY: A NOVEL SYNAPSE CODING AND MAPPING SCHEME FOR RELIABLE RERAM-BASED NEUROMORPHIC COMPUTING</b><br /><b>Speaker</b>:<br />Li Jiang, Shanghai Jiao Tong University, CN<br /><b>Authors</b>:<br />Chang Ma, Yanan Sun, Weikang Qian, Ziqi Meng, Rui Yang and Li Jiang, Shanghai Jiao Tong University, CN<br /><em><b>Abstract</b><br />Neural network (NN) computing contains a large number of multiply-and-accumulate (MAC) operations, which is the speed bottleneck in traditional von Neumann architecture. Resistive random access memory (ReRAM)-based crossbar is well suited for matrix-vector multiplication. Existing ReRAM-based NNs are mainly based on the binary coding for synaptic weights. However, the imperfect fabrication process combined with stochastic filament-based switching leads to resistance variations, which can significantly affect the weights in binary synapses and degrade the accuracy of NNs. Further, as multi-level cells (MLCs) are being developed for reducing hardware overhead, the NN accuracy deteriorates more due to the resistance variations in the binary coding. In this paper, a novel unary coding of synaptic weights is presented to overcome the resistance variations of MLCs and achieve reliable ReRAM-based neuromorphic computing. The priority mapping is also proposed in compliance with the unary coding to guarantee high accuracy by mapping those bits with lower resistance states to ReRAMs with smaller resistance variations. Our experimental results show that the proposed method provides less than 0.45% and 5.48% accuracy loss on LeNet (on MNIST dataset) and VGG16 (on CIFAR-10 dataset), respectively, while maintaining acceptable hardware cost.</em></td> </tr> <tr> <td>12:00</td> <td>10.7.3</td> <td><b>LIGHTBULB: A PHOTONIC-NONVOLATILE-MEMORY-BASED ACCELERATOR FOR BINARIZED CONVOLUTIONAL NEURAL NETWORKS</b><br /><b>Authors</b>:<br />Farzaneh Zokaee<sup>1</sup>, Qian Lou<sup>1</sup>, Nathan Youngblood<sup>2</sup>, Weichen Liu<sup>3</sup>, Yiyuan Xie<sup>4</sup> and Lei Jiang<sup>1</sup><br /><sup>1</sup>Indiana University Bloomington, US; <sup>2</sup>University of Pittsburh, US; <sup>3</sup>Nanyang Technological University, SG; <sup>4</sup>Southwest University, CN<br /><em><b>Abstract</b><br />Although Convolutional Neural Networks (CNNs) have demonstrated the state-of-the-art inference accuracy in various intelligent applications, each CNN inference involves millions of expensive floating point multiply-accumulate (MAC) operations. To energy-efficiently process CNN inferences, prior work proposes an electro-optical accelerator to process power-of-2 quantized CNNs by electro-optical ripple-carry adders and optical binary shifters. The electro-optical accelerator also uses SRAM registers to store intermediate data. However, electro-optical ripple-carry adders and SRAMs seriously limit the operating frequency and inference throughput of the electro-optical accelerator, due to the long critical path of the adder and the long access latency of SRAMs. In this paper, we propose a photonic nonvolatile memory (NVM)-based accelerator, LightBulb, to process binarized CNNs by high frequency photonic XNOR gates and popcount units. LightBulb also adopts photonic racetrack memory to serve as input/output registers to achieve high operating frequency. Compared to prior electro-optical accelerators, on average, LightBulb improves the CNN inference throughput by $17imessim 173imes$ and the inference throughput per Watt by $17.5imessim 660imes$.</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="/date20/conference/session/IP5">IP5-4</a>, 863</td> <td><b>ROQ: A NOISE-AWARE QUANTIZATION SCHEME TOWARDS ROBUST OPTICAL NEURAL NETWORKS WITH LOW-BIT CONTROLS</b><br /><b>Speaker</b>:<br />Jiaqi Gu, University of Texas, Austin, US<br /><b>Authors</b>:<br />Jiaqi Gu<sup>1</sup>, Zheng Zhao<sup>1</sup>, Chenghao Feng<sup>1</sup>, Hanqing Zhu<sup>2</sup>, Ray T. Chen<sup>1</sup> and David Z. Pan<sup>1</sup><br /><sup>1</sup>University of Texas, Austin, US; <sup>2</sup>Shanghai Jiao Tong University, CN<br /><em><b>Abstract</b><br />Optical neural networks (ONNs) demonstrate orders-of-magnitude higher speed in deep learning acceleration than their electronic counterparts. However, limited control precision and device variations induce accuracy degradation in practical ONN implementations. To tackle this issue, we propose a quantization scheme that adapts a full-precision ONN to low-resolution voltage controls. Moreover, we propose a protective regularization technique that dynamically penalizes quantized weights based on their estimated noise-robustness, leading to an improvement in noise robustness. Experimental results show that the proposed scheme effectively adapts ONNs to limited-precision controls and device variations. The resultant four-layer ONN demonstrates higher inference accuracy with lower variances than baseline methods under various control precisions and device noises.</em></td> </tr> <tr> <td style="width:40px;">12:31</td> <td><a href="/date20/conference/session/IP5">IP5-5</a>, 789</td> <td><b>STATISTICAL TRAINING FOR NEUROMORPHIC COMPUTING USING MEMRISTOR-BASED CROSSBARS CONSIDERING PROCESS VARIATIONS AND NOISE</b><br /><b>Speaker</b>:<br />Ying Zhu, TUM, DE<br /><b>Authors</b>:<br />Ying Zhu<sup>1</sup>, Grace Li Zhang<sup>1</sup>, Tianchen Wang<sup>2</sup>, Bing Li<sup>1</sup>, Yiyu Shi<sup>2</sup>, Tsung-Yi Ho<sup>3</sup> and Ulf Schlichtmann<sup>1</sup><br /><sup>1</sup>TUM, DE; <sup>2</sup>University of Notre Dame, US; <sup>3</sup>National Tsing Hua University, TW<br /><em><b>Abstract</b><br />Memristor-based crossbars are an attractive platform to accelerate neuromorphic computing. However, process variations during manufacturing and noise in memristors cause significant accuracy loss if not addressed. In this paper, we propose to model process variations and noise as correlated random variables and incorporate them into the cost function during training. Consequently, the weights after this statistical training become more robust and together with global variation compensation provide a stable inference accuracy. Simulation results demonstrate that the mean value and the standard deviation of the inference accuracy can be improved significantly, by even up to 54% and 31%, respectively, in a two-layer fully connected neural network.</em></td> </tr> <tr> <td>12:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="11.0">11.0 LUNCHTIME KEYNOTE SESSION</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 13:20 - 13:50<br /><b>Location / Room:</b> </p> <p><b>Chair:</b><br />Gabriela Nicolescu, Polytechnique Montréal, CA</p> <p><b>Co-Chair:</b><br />Luca Ramini, Hewlett Packard Labs, US</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>13:20</td> <td>10.0.1</td> <td><b>MEMORY DRIVEN COMPUTING TO REVOLUTIONIZE THE MEDICAL SCIENCES</b><br /><b>Author</b>:<br />Joachim Schultze, Director Platform for Single Cell Genomics and Epigenomics German Center for Neurodegenerative Diseases, DE<br /><em><b>Abstract</b><br />As any other area of our lives, medicine is experiencing the digital revolution. We produce more and more quantitative data in medicine, and therefore, we need significantly more compute power and data storage capabilities in the near future. Yet, since medicine is inherently decentralized, current compute infrastructures are not build for that. Central cloud storage and centralized super computing infrastructures are not helpful in a discipline such as medicine that will produce data always at the edge. Here we completely need to rethink computing. What we require are distributed federated cloud solutions with sufficient memory at the edge to cope with the large sensor data that record many medical data of individual patients. Here memory-driven computing comes in as a perfect solution. Its potential to provide sufficiently large memory at the edge, where data is generated, yet its potential to connect these new devices to build distributed federated cloud solutions will be key to drive the digital revolution in medicine. I will provide our own efforts using memory driven computing towards this direction.</em></td> </tr> <tr> <td>13:50</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="11.1">11.1 Special Day on "Silicon Photonics": Advanced Applications</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 14:00 - 15:30<br /><b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br />Olivier Sentieys, University of Rennes, IRISA, INRIA, FR</p> <p><b>Co-Chair:</b><br />Gabriela Nicolescu, Polytechnique Montréal, CA</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:00</td> <td>11.1.1</td> <td><b>SYSTEM-LEVEL EVALUATION OF CHIP-SCALE SILICON PHOTONIC NETWORKS FOR EMERGING DATA- INTENSIVE APPLICATIONS</b><br /><b>Speaker</b>:<br />Aditya Narayan, Boston University, US<br /><b>Authors</b>:<br />Aditya Narayan<sup>1</sup>, Yvain Thonnart<sup>2</sup>, Pascal Vivet<sup>2</sup>, Ajay Joshi<sup>1</sup> and Ayse Coskun<sup>1</sup><br /><sup>1</sup>Boston University, US; <sup>2</sup>CEA-Leti, FR<br /><em><b>Abstract</b><br />Emerging data-driven applications such as graph processing applications are characterized by their excessive memory footprint and abundant parallelism, resulting in high memory bandwidth demand. As the scale of datasets for applications are reaching orders of TBs, performance limitation due to bandwidth demands is a major concern. Traditional on-chip electrical networks fail to meet such high bandwidth demands due to increased energy-per-bit or physical limitations with pin counts. Silicon photonic networks have emerged as a promising alternative to electrical interconnects, owing to their high bandwidth and low energy-per-bit communication with negligible data-dependent power. Wide-scale adoption of silicon photonics at chip level, however, is hampered by their high sensitivity to process and thermal variations, high laser power due to losses along the network, and power consumption of the electrical- optical conversion. Device-level technological innovations to mitigate these issues are promising, yet they do not consider the system-level implications of the applications running on manycore systems with photonic networks. This work aims to bridge the gap between the system-level attributes of applications with the underlying architectural and device-level characteristics of silicon photonic networks to achieve energy-efficient computing. We particularly focus on graph applications, which involve unstructured yet abundant parallel memory accesses that stress the on-chip communication networks, and develop a cross-layer framework to evaluate 2.5D systems with silicon photonic networks. We demonstrate significant energy savings through system-level management using wavelength selection policies and further evaluate architectural design choices on 2.5D systems with photonic networks.</em></td> </tr> <tr> <td>14:30</td> <td>11.1.2</td> <td><b>OSCAR: AN OPTICAL STOCHASTIC COMPUTING ACCELERATOR FOR POLYNOMIAL FUNCTIONS</b><br /><b>Speaker</b>:<br />Hassnaa El-Derhalli, Concordia University, CA<br /><b>Authors</b>:<br />Hassnaa El-Derhalli, Sébastien Le Beux and Sofiène Tahar, Concordia University, CA<br /><em><b>Abstract</b><br />Approximate computing allows to trade-off design energy efficiency with computing accuracy. Stochastic computing is an approximate computing technique, where numbers are represented as probabilities using stochastic bit streams. The serial computation of the bit streams leads to reduced hardware complexity but induces high latency, which is the main limitation of the approach. Silicon photonics has the potential to overcome the processing latency drawback thanks to high-speed propagation of signals and large bandwidth. However, the implementation of stochastic computing architectures using integrated optics involves high static energy that calls for adaptable architectures able to meet application-specific requirements. In this paper, we propose a reconfigurable optical accelerator allowing online adaptation of computing accuracy and energy efficiency according to the application requirements. The architecture can be configured to execute i) 4th order function for high accuracy processing or ii) 2nd order function for high-energy efficiency purposes. Evaluations are carried out using image processing Gamma correction function. Compared to a static architecture for which accuracy is defined at design time, the proposed architecture leads to 36.8% energy overhead but increases the range of reachable accuracy by 65%.</em></td> </tr> <tr> <td>15:00</td> <td>11.1.3</td> <td><b>POPSTAR: A ROBUST MODULAR OPTICAL NOC ARCHITECTURE FOR CHIPLET-BASED 3D INTEGRATED SYSTEMS</b><br /><b>Speaker</b>:<br />Yvain Thonnart, CEA-Leti, FR<br /><b>Authors</b>:<br />Yvain Thonnart<sup>1</sup>, Stéphane Bernabe<sup>1</sup>, Jean Charbonnier<sup>1</sup>, César Fuget Totolero<sup>1</sup>, Pierre Tissier<sup>1</sup>, Benoit Charbonnier<sup>1</sup>, Stephane Malhouitre<sup>1</sup>, Damien Saint-Patrice<sup>1</sup>, Myriam Assous<sup>1</sup>, Aditya Narayan<sup>2</sup>, Ayse Coskun<sup>2</sup>, Denis Dutoit<sup>1</sup> and Pascal Vivet<sup>1</sup><br /><sup>1</sup>CEA-Leti, FR; <sup>2</sup>Boston University, US<br /><em><b>Abstract</b><br />Silicon photonics technology is now gaining maturity with increasing levels of design complexity from devices to large photonic integrated circuits. Close integration of control electronics with 3D assembly of photonics and CMOS open the way to high-performance computing architectures partitioned in chiplets connected by optical NoC on silicon photonic interposers. In this paper, we give an overview of our works on optical links and NoC for manycore systems, from low-level control of photonic devices to high-level system optimization of the optical communications. We detail the POPSTAR architecture (Processors On Photonic Silicon interposer Terascale ARchitecture) with electro-optical interface chiplets, the corresponding nested spiral topology for single-writer multiple-reader links and the associated control electronics, in charge of high-speed drivers, thermal stabilization and handling of the protocol stack, from data integrity to flow-control, routing and arbitration of the optical communications. The strengths and opportunities for this architecture will be discussed, with a shift in system &amp; implementation constraints with respect to previous optical NoC proposals, and new challenges to be addressed.</em></td> </tr> <tr> <td>15:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="11.2">11.2 Autonomous Systems Design Initiative: Autonomous Cyber-Physical Systems: Modeling and Verification</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 14:00 - 15:30<br /><b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br />Nikos Aréchiga, Toyota Research Institute, US</p> <p><b>Co-Chair:</b><br />Jyotirmoy V. Deshmukh, University of Southern California, US</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:00</td> <td>11.2.1</td> <td><b>TRUSTWORTHY AUTONOMY: BEHAVIOR PREDICTION AND VALIDATION</b><br /><b>Author</b>:<br />Katherine Driggs-Campbell, University of Illinois Urbana Champaign, US</td> </tr> <tr> <td>14:30</td> <td>11.2.2</td> <td><b>ON INFUSING LOGICAL REASONING INTO ROBOT LEARNING</b><br /><b>Author</b>:<br />Marco Pavone, Stanford University, US</td> </tr> <tr> <td>15:00</td> <td>11.2.3</td> <td><b>FORMALLY-SPECIFIABLE AGENT BEHAVIOR MODELS FOR AUTONOMOUS VEHICLE TEST GENERATION</b><br /><b>Author</b>:<br />Jonathan DeCastro, Toyota Research Institute, US</td> </tr> <tr> <td>15:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="11.3">11.3 Special Session: Emerging Neural Algorithms and Their Impact on Hardware</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 14:00 - 15:30<br /><b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br />Ian O’Connor, Ecole Centrale de Lyon, FR</p> <p><b>Co-Chair:</b><br />Michael Niemier, University of Notre Dame, US</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:00</td> <td>11.3.1</td> <td><b>ANALOG RESISTIVE CROSSBAR ARRAYS FOR NEURAL NETWORK ACCELERATION</b><br /><b>Author</b>:<br />Martin Frank, IBM, US</td> </tr> <tr> <td>14:30</td> <td>11.3.2</td> <td><b>IN-MEMORY COMPUTING FOR MEMORY AUGMENTED NEURAL NETWORKS</b><br /><b>Authors</b>:<br />X. Sharon Hu<sup>1</sup> and Anand Raghunathan<sup>2</sup><br /><sup>1</sup>University of Notre Dame, US; <sup>2</sup>Purdue University, US</td> </tr> <tr> <td>15:00</td> <td>11.3.3</td> <td><b>HARDWARE CHALLENGES FOR NEURAL RECOMMENDATION SYSTEMS</b><br /><b>Author</b>:<br />Udit Gupta, Harvard University, US</td> </tr> <tr> <td>15:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="11.4">11.4 Reliable in-memory computing</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 14:00 - 15:30<br /><b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br />Jean-Philippe Noel, CEA-Leti, FR</p> <p><b>Co-Chair:</b><br />Kvatinsky Shahar, Technion, IL</p> <p>This session deals with work on the reliability of computing in memories. This includes new design techniques to improve CNN computing in ReRAM going through the co-optimization between device and algorithm to improve the reliability of ReRAM-based Graph Processing. Moreover, this session also deals with work on the improvment of reliability of well-established STT-MRAM and PCM. Finally, early works presenting stochastic computing and disruptive image processing techniques based on memristor are also discussed.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:00</td> <td>11.4.1</td> <td><b>REBOC: ACCELERATING BLOCK-CIRCULANT NEURAL NETWORKS IN RERAM</b><br /><b>Speaker</b>:<br />Yitu Wang, Fudan University, CN<br /><b>Authors</b>:<br />Yitu Wang<sup>1</sup>, Fan Chen<sup>2</sup>, Linghao Song<sup>2</sup>, C.-J. Richard Shi<sup>3</sup>, Hai (Helen) Li<sup>4</sup> and Yiran Chen<sup>2</sup><br /><sup>1</sup>Fudan University, CN; <sup>2</sup>Duke University, US; <sup>3</sup>University of Washington, US; <sup>4</sup>Duke University / TUM-IAS, US<br /><em><b>Abstract</b><br />Deep neural networks (DNNs) emerge as a key component in various applications. However, the ever-growing DNN size hinders efficient processing on hardware. To tackle this problem, on the algorithmic side, compressed DNN models are explored, of which block-circulant DNN models are memory efficient, and hardware-friendly; on the hardware side, resistive random-access memory (ReRAM) based accelerators are promising for in-situ processing for DNNs. In this work, we design an accelerator named ReBoc for accelerating block-circulant neural networks in ReRAM to reap the benefits of light-weight DNN models and efficient in-situ processing simultaneously. We propose a novel mapping scheme which utilizes Horizontal Weight Slicing and Intra-Crossbar Weight Duplication to map the block-circulant DNN model onto ReRAM crossbars with significant improved crossbar utilization. Moreover, two techniques, namely Input Slice Reusing and Input Tile Sharing are introduced to take advantage of the circulant calculation feature in block-circulant DNN models to reduce data access and buffer size. In ReBoc, a DNN model is executed within an intra-layer processing pipeline and achieves respectively 96× and 8.86× power efficiency improvement compared to the state-of-the-art FPGA and ASIC accelerators for block-circulant neural networks. Compared to ReRAM-based DNN acclerators, ReBoc achieves averagely 4.1× speedup and 2.6× energy reduction.</em></td> </tr> <tr> <td>14:30</td> <td>11.4.2</td> <td><b>GRAPHRSIM: A JOINT DEVICE-ALGORITHM RELIABILITY ANALYSIS FOR RERAM-BASED GRAPH PROCESSING</b><br /><b>Speaker</b>:<br />Chin-Fu Nien, Academia Sinica, TW<br /><b>Authors</b>:<br />Chin-Fu Nien<sup>1</sup>, Yi-Jou Hsiao<sup>2</sup>, Hsiang-Yun Cheng<sup>1</sup>, Cheng-Yu Wen<sup>3</sup>, Ya-Cheng Ko<sup>3</sup> and Che-Ching Lin<sup>3</sup><br /><sup>1</sup>Academia Sinica, TW; <sup>2</sup>National Chiao-Tung University, TW; <sup>3</sup>National Taiwan University, TW<br /><em><b>Abstract</b><br />Graph processing has attracted a lot of interests in recent years as it plays a key role to analyze huge datasets. ReRAM-based accelerators provide a promising solution to accelerate graph processing. However, the intrinsic stochastic behavior of ReRAM devices makes its computation results unreliable. In this paper, we build a simulation platform to analyze the impact of non-ideal ReRAM devices on the error rates of various graph algorithms. We show that the characteristic of the targeted graph algorithm and the type of ReRAM computations employed greatly affect the error rates. Using representative graph algorithms as case studies, we demonstrate that our simulation platform can guide chip designers to select better design options and develop new techniques to improve reliability.</em></td> </tr> <tr> <td>15:00</td> <td>11.4.3</td> <td><b>STAIR: HIGH RELIABLE STT-MRAM AWARE MULTI-LEVEL I/O CACHE ARCHITECTURE BY ADAPTIVE ECC ALLOCATION</b><br /><b>Speaker</b>:<br />Hossein Asadi, Sharif University of Technology, IR<br /><b>Authors</b>:<br />Mostafa Hadizadeh, Elham Cheshmikhani and Hossein Asadi, Sharif University of Technology, IR<br /><em><b>Abstract</b><br />Hybrid Multi−Level Cache Architectures (HCAs) are promising solutions for the growing need of high-performance and cost-efficient data storage systems. HCAs employ a high endurable memory as the first-level cache and a Solid−State Drive (SSD) as the second-level cache. Spin−Transfer Torque Magnetic RAM (STT-MRAM) is one of the most promising candidates for the first-level cache of HCAs because of its high endurance and DRAM-comparable performance along with non-volatility. However, STT-MRAM faces with three major reliability challenges named Read Disturbance, Write Failure, and Retention Failure. To provide a reliable HCA, the reliability challenges of STT-MRAM should be carefully addressed. To this end, this paper first makes a careful distinction between clean and dirty pages to classify and prioritize their different vulnerabilities. Then, we investigate the distribution of more vulnerable pages in the first-level cache of HCAs over 17 storage workloads. Our observations show that the protection overhead can be significantly reduced by adjusting the protection level of data pages based on their vulnerability. To this aim, we propose a STT−MRAM Aware Multi−Level I/O Cache Architecture (STAIR) to improve HCA reliability by dynamically generating extra strong Error−Correction Codes (ECCs) for the dirty data pages. STAIR adaptively allocates under-utilized parts of the first-level cache to store these extra ECCs. Our evaluations show that STAIR decreases the data loss probability by five orders of magnitude, on average, with negligible performance overhead (0.12% hit ratio reduction in the worst case) and 1.56% memory overhead for the cache controller.</em></td> </tr> <tr> <td>15:15</td> <td>11.4.4</td> <td><b>EFFECTIVE WRITE DISTURBANCE MITIGATION ENCODING SCHEME FOR HIGH-DENSITY PCM</b><br /><b>Speaker</b>:<br />Muhammad Imran, Sungkyunkwan University, KR<br /><b>Authors</b>:<br />Muhammad Imran, Taehyun Kwon and Joon-Sung Yang, Sungkyunkwan University, KR<br /><em><b>Abstract</b><br />Write Disturbance (WD) is a crucial reliability concern in a high-density PCM with below 20nm scaling. WD occurs because of the inter-cell heat transfer during a RESET operation. Being dependent on the type of programming pulse and the state of the vulnerable cell, WD is significantly impacted by the data patterns. Existing encoding techniques to mitigate WD reduce the percentage of a single WD-vulnerable pattern in the data. However, it is observed that reducing the frequency of a single bit pattern may not be effective to mitigate WD for certain data patterns. This work proposes a significantly more effective encoding method which minimizes the number of vulnerable cells instead of a single bit pattern. The proposed method mitigates WD both within a word-line and across the bit-lines. In addition to WD-mitigation, the proposed method encodes the data to minimize the bit flips, thus improving the memory lifetime compared to the conventional WD-mitigation techniques. Our evaluation using SPEC CPU2006 benchmarks shows that the proposed method can reduce the aggregate (word-line+bit-line) WD errors by 42% compared to the existing state-of-the-art (SD-PCM). Compared to the state-of-the-art SD-PCM method, the proposed method improves the average write time, instructions-per-cycle (IPC) and write energy by 12%, 12% and 9%, respectively, by reducing the frequency of verify and correct operations to address WD errors. With reduction in bit flips, memory lifetime is also improved by 18% to 37% compared to SD-PCM, given an asymmetric cost of the bit flips. By integrating with the orthogonal techniques of SD-PCM, the proposed method can further enhance the performance and energy efficiency.</em></td> </tr> <tr> <td style="width:40px;">15:30</td> <td><a href="/date20/conference/session/IP5">IP5-6</a>, 478</td> <td><b>COMPUTATIONAL RESTRUCTURING: RETHINKING IMAGE PROCESSING USING MEMRISTOR CROSSBAR ARRAYS</b><br /><b>Speaker</b>:<br />Rickard Ewetz, University of Central Florida, US<br /><b>Authors</b>:<br />Baogang Zhang, Necati Uysal and Rickard Ewetz, University of Central Florida, US<br /><em><b>Abstract</b><br />Image processing is a core operation performed on billions of sensor-devices in the Internet of Things (IoT). Emerging memristor crossbar arrays (MCAs) promise to perform matrix-vector multiplication (MVM) with extremely small energy-delay product, which is the dominating computation within the two-dimensional Discrete Cosine Transform (2D DCT). Earlier studies have directly mapped the digital implementation to MCA based hardware. The drawback is that the series computation is vulnerable to errors. Moreover, the implementation requires the use of large image block sizes, which is known to degrade the image quality. In this paper, we propose to restructure the 2D DCT into an equivalent single linear transformation (or MVM operation). The reconstruction eliminates the series computation and reduces the processed block sizes from NxN to √Nx√N. Consequently, both the robustness to errors and the image quality is improved. Moreover, the latency, power, and area is reduced with 2X while eliminating the storage of intermediate data, and the power and area can be further reduced with up to 62% and 74% using frequency spectrum optimization.</em></td> </tr> <tr> <td style="width:40px;">15:33</td> <td><a href="/date20/conference/session/IP5">IP5-7</a>, 312</td> <td><b>SCRIMP: A GENERAL STOCHASTIC COMPUTING ACCELERATION ARCHITECTURE USING RERAM IN-MEMORY PROCESSING</b><br /><b>Speaker</b>:<br />Saransh Gupta, University of California, San Diego, US<br /><b>Authors</b>:<br />Saransh Gupta<sup>1</sup>, Mohsen Imani<sup>1</sup>, Joonseop Sim<sup>1</sup>, Andrew Huang<sup>1</sup>, Fan Wu<sup>1</sup>, M. Hassan Najafi<sup>2</sup> and Tajana Rosing<sup>1</sup><br /><sup>1</sup>University of California, San Diego, US; <sup>2</sup>University of Louisiana, US<br /><em><b>Abstract</b><br />Stochastic computing (SC) reduces the complexity of computation by representing numbers with long independent bit-streams. However, increasing performance in SC comes with an increase in area and loss in accuracy. Processing in memory (PIM) with non-volatile memories (NVMs) computes data in-place, while having high memory density and supporting bit-parallel operations with low energy. In this paper, we propose SCRIMP for stochastic computing acceleration with resistive RAM (ReRAM) in-memory processing, which enables SC in memory. SCRIMP can be used for a wide range of applications. It supports all SC encodings and operations in memory. It maximizes the performance and energy efficiency of implementing SC by introducing novel in-memory parallel stochastic number generation and efficient implication-based logic in memory. To show the efficiency of our stochastic architecture, we implement image processing on the proposed hardware.</em></td> </tr> <tr> <td>15:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="11.5">11.5 Compile time and virtualization support for embedded system design</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 14:00 - 15:30<br /><b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br />Nicola Bombieri, Università di Verona, IT</p> <p><b>Co-Chair:</b><br />Rodolfo Pellizzoni, University of Waterloo, CA</p> <p>The session leverages compiler support and novel architectural features, such as virtualization extensions and emerging memory structures, to optimize the design flow of modern embedded systems.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:00</td> <td>11.5.1</td> <td><b>UNIFIED THREAD- AND DATA-MAPPING FOR MULTI-THREADED MULTI-PHASE APPLICATIONS ON SPM MANY-CORES</b><br /><b>Speaker</b>:<br />Anuj Pathania, National University of Singapore, SG<br /><b>Authors</b>:<br />Vanchinathan Venkataramani<sup>1</sup>, Anuj Pathania<sup>2</sup> and Tulika Mitra<sup>2</sup><br /><sup>1</sup>NUS, SG; <sup>2</sup>National University of Singapore, SG<br /><em><b>Abstract</b><br />Scratchpad Memories (SPMs) are more scalable than caches as they offer better performance with lower power and area overheads. This scalability advocates their suitability as on-chip memory in many-cores. However, SPM many-cores delegate the responsibility of thread- and data-mapping to the software. The mapping is especially challenging in the case of multi-threaded multi-phase applications. Threads from these applications exhibit both inter- and intra-phase data-sharing patterns. These patterns intricately intertwine thread- and data- mapping across phases. The accompanying qualitative mapping is the key to extract application performance on SPM many-cores. State-of-the-art framework for SPM many-cores performs thread- and data-mapping independently. Furthermore, it can only operate with single-phase multi-threaded applications. We are the first to propose in this work, a unified thread- and data-mapping framework for NoC-based SPM many-cores when executing multi-threaded multi-phase applications. Experimental evaluations show, on average, 1.36x performance improvement compared to the state-of-the-art framework for multi-threaded multi-phase applications.</em></td> </tr> <tr> <td>14:30</td> <td>11.5.2</td> <td><b>GENERALIZED DATA PLACEMENT STRATEGIES FOR RACETRACK MEMORIES</b><br /><b>Speaker</b>:<br />Asif Ali Khan, Technische Universität Dresden, DE<br /><b>Authors</b>:<br />Asif Ali Khan, Andres Goens, Fazal Hameed and Jeronimo Castrillon, Technische Universität Dresden, DE<br /><em><b>Abstract</b><br />Ultra-dense non-volatile racetrack memories (RTMs) have been investigated at various levels in the memory hierarchy for improved performance and reduced energy consumption. However, the innate shift operations in RTMs hinder their applicability to replace low-latency on-chip memories. Recent research has demonstrated that intelligent placement of memory objects in RTMs can significantly reduce the amount of shifts with no hardware overhead, albeit for specific system setups. However, existing placement strategies may lead to sub-optimal performance when applied to different architectures. In this paper we look at generalized data placement mechanisms that improve upon existing ones by taking into account the underlying memory architecture and the timing and liveliness information of memory objects. We propose a novel heuristic and a formulation using genetic algorithms that optimize key performance parameters. We show that, on average, our generalized approach improves the number of shifts, performance and energy consumption by 4.3×, 46% and 55% respectively compared to the state-of-the-art.</em></td> </tr> <tr> <td>15:00</td> <td>11.5.3</td> <td><b>ARM-ON-ARM: LEVERAGING VIRTUALIZATION EXTENSIONS FOR FAST VIRTUAL PLATFORMS</b><br /><b>Speaker</b>:<br />Lukas Jünger, RWTH Aachen University, DE<br /><b>Authors</b>:<br />Lukas Jünger<sup>1</sup>, Jan Luca Malte Bölke<sup>2</sup>, Stephan Tobies<sup>2</sup>, Rainer Leupers<sup>1</sup> and Andreas Hoffmann<sup>2</sup><br /><sup>1</sup>RWTH Aachen University, DE; <sup>2</sup>Synopsys GmbH, DE<br /><em><b>Abstract</b><br />Virtual Platforms (VPs) are an essential enabling technology in the System-on-a-Chip (SoC) development cycle. They are used for early software development and hardware/software codesign. However, since virtual prototyping is limited by simulation performance, improving the simulation speed of VPs has been an active research topic for years. Different strategies have been proposed, such as fast instruction set simulation using Dynamic Binary Translation (DBT). But even fast simulators do not reach native execution speed. They do however allow executing rich Operating System (OS) kernels, which is typically infeasible when another OS is already running. Executing multiple OSs on shared physical hardware is typically accomplished by using virtualization, which has a long history on x86 hardware. It enables encapsulated, native code execution on the host processor and has been extensively used in data centers, where many users share hardware resources. When it comes to embedded systems, virtualization has been made available recently. For ARM processors, virtualization was introduced with the ARM Virtualization Extensions for the ARMv7 architecture. Since virtualization allows native guest code execution, near-native execution speeds can be reached. In this work we present a VP containing a novel ARMv8 SystemC Transaction Level Modeling 2.0 (TLM) compatible processor model. The model leverages the ARM Virtualization Extensions (VE) via the Linux Kernel-based Virtual Machine (KVM) to execute the target software natively on an ARMv8 host. To enable the integration of the processor model into a loosely-timed VP, we developed an accurate instruction counting mechanism using the ARM Performance Monitors Extension (PMU). The requirements for integrating the processor mode into a VP and the integration process are detailed in this work. Our evaluations show that speedups of up to 2.57x over state-of-the-art DBT-based simulator can be achieved using our processor model on ARMv8 hardware.</em></td> </tr> <tr> <td style="width:40px;">15:30</td> <td><a href="/date20/conference/session/IP5">IP5-8</a>, 597</td> <td><b>TDO-CIM: TRANSPARENT DETECTION AND OFFLOADING FOR COMPUTATION IN-MEMORY</b><br /><b>Speaker</b>:<br />Lorenzo Chelini, Eindhoven University of Technology, SZ<br /><b>Authors</b>:<br />Kanishkan Vadivel<sup>1</sup>, Lorenzo Chelini<sup>2</sup>, Ali BanaGozar<sup>1</sup>, Gagandeep Singh<sup>2</sup>, Stefano Corda<sup>2</sup>, Roel Jordans<sup>1</sup> and Henk Corporaal<sup>1</sup><br /><sup>1</sup>Eindhoven University of Technology, NL; <sup>2</sup>IBM Research - Zurich, CH<br /><em><b>Abstract</b><br />Computation in-memory is a promising non-von Neumann approach aiming at completely diminishing the data transfer to and from the memory subsystem. Although a lot of architectures have been proposed, compiler support for such architectures is still lagging behind. In this paper, we close this gap by proposing an end-to-end compilation flow for in-memory computing based on the LLVM compiler infrastructure. Starting from sequential code, our approach automatically detects, optimizes, and offloads kernels suitable for in-memory acceleration. We demonstrate our compiler tool-flow on the PolyBench/C benchmark suite and evaluate the benefits of our proposed in-memory architecture simulated in Gem5 by comparing it with a state-of-the-art von Neumann architecture.</em></td> </tr> <tr> <td style="width:40px;">15:33</td> <td><a href="/date20/conference/session/IP5">IP5-9</a>, 799</td> <td><b>BACKFLOW: BACKWARD EDGE CONTROL FLOW ENFORCEMENT FOR LOW END ARM MICROCONTROLLERS</b><br /><b>Speaker</b>:<br />Cyril Bresch, LCIS, FR<br /><b>Authors</b>:<br />Cyril Bresch<sup>1</sup>, David Hély<sup>2</sup> and Roman Lysecky<sup>3</sup><br /><sup>1</sup>LCIS, FR; <sup>2</sup>LCIS - Grenoble INP, FR; <sup>3</sup>University of Arizona, US<br /><em><b>Abstract</b><br />This paper presents BackFlow, a compiler-based toolchain that enforces indirect backward edge control flow integrity for low-end ARM Cortex-M microprocessors. BackFlow is implemented within the Clang/LLVM compiler and supports the ARM instruction set and its subset Thumb. The control flow integrity generated by the compiler relies on a bitmap, where each set bit indicates a valid pointer destination. The efficiency of the framework is benchmarked using an STM32 NUCLEO F446RE microcontroller. The obtained results show that the control flow integrity solution incurs an execution time overhead ranging from 1.5 to 4.5%.</em></td> </tr> <tr> <td>15:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="11.6">11.6 Aging: estimation and mitigation</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 14:00 - 15:30<br /><b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br />Arnaud Virazel, Université de Montpellier / LIRMM, FR</p> <p><b>Co-Chair:</b><br />Lorena Anghel, University Grenoble-Alpes, Fr</p> <p>This session shares improvements in aging calculations of emerging technologies and how to take these reliability aspects into account during power grid design and floorplanning of FPGAs.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:00</td> <td>11.6.1</td> <td><b>IMPACT OF NBTI AGING ON SELF-HEATING IN NANOWIRE FET</b><br /><b>Speaker</b>:<br />Hussam Amrouch, Karlsruhe Institute of Technology (KIT), DE<br /><b>Authors</b>:<br />Om Prakash<sup>1</sup>, Hussam Amrouch<sup>1</sup>, Sanjeev Kumar Manhas<sup>2</sup> and Joerg Henkel<sup>1</sup><br /><sup>1</sup>Karlsruhe Institute of Technology, DE; <sup>2</sup>IIT Roorkee, IN<br /><em><b>Abstract</b><br />This is the first work that investigates the impact of Negative Bias Temperature Instability (NBTI) on the Self-Heating (SH) phenomena in Silicon Nanowire Field-Effect Transistors (SiNW-FETs). We investigate the individual as well as joint impact of NBTI and SH on pSiNW-FETs and demonstrate that NBTI-induced traps mitigate SH effects due to reduced current densities. Our Technology CAD (TCAD)-based SiNW-FET device is calibrated against experimental data. It accounts for thermodynamic and hydrodynamic effects in 3-D nano structures for accurate modeling of carrier transport mechanisms. Our analysis focuses on how lattice temperature, thermal resistance and thermal capacitance of pSiNW-FETs are affected due to NBTI, demonstrating that accurate self-heating modeling necessitates considering the effects that NBTI aging has over time. Hence, NBTI and SH effects need to be jointly and not individually modeled. Our evaluation shows that an individual modeling of NBTI and SH effects leads to a noticeable overestimation of the overall induced delay increase in circuits due to the impact of NBTI traps on SH mitigation. Hence, it is necessary to model NBTI and SH effects jointly in order to estimate efficient (i.e. small, yet sufficient) timing guardbands that protect circuits against timing violations, which will occur at runtime due to delay increases induced by aging and self-heating.</em></td> </tr> <tr> <td>14:30</td> <td>11.6.2</td> <td><b>POWERPLANNINGDL: RELIABILITY-AWARE FRAMEWORK FOR ON-CHIP POWER GRID DESIGN USING DEEP LEARNING</b><br /><b>Speaker</b>:<br />Sukanta Dey, IIT Guwahati, IN<br /><b>Authors</b>:<br />Sukanta Dey, Sukumar Nandi and Gaurav Trivedi, IIT Guwahati, IN<br /><em><b>Abstract</b><br />With the increase in the complexity of chip designs, VLSI physical design has become a time-consuming task, which is an iterative design process. Power planning is that part of the floorplanning in VLSI physical design where power grid networks are designed in order to provide adequate power to all the underlying functional blocks. Power planning also requires multiple iterative steps to create the power grid network while satisfying the allowed worst-case IR drop and Electromigration (EM) margin. For the first time, this paper introduces Deep learning (DL)-based framework to approximately predict the initial design of the power grid network, considering different reliability constraints. The proposed framework reduces many iterative design steps and speeds up the total design cycle. Neural Network-based multi-target regression technique is used to create the DL model. Feature extraction is done, and training dataset is generated from the floorplans of some of the power grid designs extracted from IBM processor. The DL model is trained using the generated dataset. The proposed DL-based framework is validated using a new set of power grid specifications (obtained by perturbing the designs used in the training phase). The results show that the predicted power grid design is closer to the original design with minimal prediction error (~2%). The proposed DL-based approach also improves the design cycle time significantly with a speedup of ~6X for standard power grid benchmarks</em></td> </tr> <tr> <td>15:00</td> <td>11.6.3</td> <td><b>AN EFFICIENT MILP-BASED AGING-AWARE FLOORPLANNER FOR MULTI-CONTEXT COARSE-GRAINED RUNTIME RECONFIGURABLE FPGAS</b><br /><b>Speaker</b>:<br />Carl Sechen, University of Texas at Dallas, US<br /><b>Authors</b>:<br />Bo Hu, Mustafa Shihab, Yiorgos Makris, Benjamin Carrion Schaefer and Carl Sechen, University of Texas at Dallas, US<br /><em><b>Abstract</b><br />Shrinking transistor sizes are jeopardizing the reliability of runtime reconfigurable Field Programmable Gate Arrays (FPGAs), making them increasingly sensitive to aging effects such as Negative Bias Temperature Instability (NBTI). This paper introduces a reliability-aware floorplanner which is tailored to multi-context, coarse-grained, runtime reconfigurable architectures (CGRRAs) and seeks to extend their Mean Time to Failure (MTTF) by balancing the usage of processing elements (PEs). The proposed method is based on a Mixed Integer Linear Programming (MILP) formulation, the solution to which produces appropriately-balanced mappings of workload to PEs on the reconfigurable fabric, thereby mitigating aging-induced lifetime degradation. Results demonstrate that, as compared to the default reliability-unaware floorplanning solutions, the proposed method achieves an average MTTF increase of 2.5X without introducing any performance degradation.</em></td> </tr> <tr> <td style="width:40px;">15:30</td> <td><a href="/date20/conference/session/IP5">IP5-10</a>, 119</td> <td><b>DELAY SENSITIVITY POLYNOMIALS BASED DESIGN-DEPENDENT PERFORMANCE MONITORS FOR WIDE OPERATING RANGES</b><br /><b>Speaker</b>:<br />Ruikai Shi, State Key Laboratory of Computer Architecture, ICT, CAS; University of Chinese Academy of Sciences, CN<br /><b>Authors</b>:<br />Ruikai Shi<sup>1</sup>, Liang Yang<sup>2</sup> and Hao Wang<sup>2</sup><br /><sup>1</sup>State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN; <sup>2</sup>Loongson Technology Corporation Limited, CN<br /><em><b>Abstract</b><br />The downsizing of CMOS technology makes circuit performance more sensitive to on-chip parameter variations. Previous proposed design-dependent ring oscillator (DDRO) method provides an efficient way to monitor circuit performance at runtime. However, the linear delay sensitivity expression may be inadequate, especially in a wide range of operating conditions. To overcome it, a new design-dependent performance monitor (DDPM) method is proposed in this work, which formulates the delay sensitivity as high-order polynomials, makes it possible to accurately track the nonlinear timing behavior for wide operating ranges. A 28nm technology is used for design evaluation, and quite a low error rate is achieved in circuit performance monitoring comparison.</em></td> </tr> <tr> <td style="width:40px;">15:31</td> <td><a href="/date20/conference/session/IP5">IP5-11</a>, 191</td> <td><b>MITIGATION OF SENSE AMPLIFIER DEGRADATION USING SKEWED DESIGN</b><br /><b>Speaker</b>:<br />Daniel Kraak, Delft University of Technology, NL<br /><b>Authors</b>:<br />Daniel Kraak<sup>1</sup>, Mottaqiallah Taouil<sup>1</sup>, Said Hamdioui<sup>1</sup>, Pieter Weckx<sup>2</sup>, Stefan Cosemans<sup>2</sup> and Francky Catthoor<sup>2</sup><br /><sup>1</sup>Delft University of Technology, NL; <sup>2</sup>IMEC, BE<br /><em><b>Abstract</b><br />Designers typically add design margins to semiconductor memories to compensate for aging. However, the aging impact increases with technology downscaling, leading to the need for higher margins. This results into a negative impact on area, yield, performance, and power consumption. As an alternative, mitigation schemes can be developed to reduce such impact. This paper proposes a mitigation scheme for the memory's sense amplifier (SA); the scheme is based on creating a skew in the relative strengths of the SA's cross-coupled inverters during design. The skew is compensated by aging due to unbalanced workloads. As a result, the impact of aging on the SA is reduced. To validate the mitigation scheme, the degradation of the sense amplifier is analyzed for several workloads.The experimental results show that the proposed mitigation scheme reduces the degradation of the sense amplifier's critical figure-of-merit, the offset voltage, with up to 26%.</em></td> </tr> <tr> <td>15:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="11.7">11.7 System Level Security</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 14:00 - 15:30<br /><b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br />Benoit Pascal, Université de Montpellier, FR</p> <p><b>Co-Chair:</b><br />Hely David, Unviversity Grenoble Alpes, FR</p> <p>The session focuses on topics of system-level security, especially related to authentication. The papers span topics of memory authentication and group-of-users authentication, with a focus on IoT applications.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:00</td> <td>11.7.1</td> <td><b>AMSA: ADAPTIVE MERKLE SIGNATURE ARCHITECTURE</b><br /><b>Speaker</b>:<br />Emanuel Regnath, TUM, DE<br /><b>Authors</b>:<br />Emanuel Regnath and Sebastian Steinhorst, TUM, DE<br /><em><b>Abstract</b><br />Hash-based signatures (HBS) are promising candidates for quantum-secure signatures on embedded IoT devices because they only use fast integer math, are well understood, produce small public keys, and offer many design parameters. However, HBS can only sign a limited amount of messages and produce - similar to most post-quantum schemes - large signatures of several kilo bytes. In this paper, we explore possibilities to reduce the size of the signatures by 1. improving the Winternitz One-Time Signature with a more efficient encoding and 2. offloading auxiliary data to a gateway. We show that for similar security and performance, our approach produces 2.6 % smaller signatures in general and up to 17.3 % smaller signatures for the sender compared to the related approaches LMS and XMSS. Furthermore, our open-source implementation allows a wider set of parameters that allows to tailor the scheme to the available resources of an embedded device, which is an important factor to overcome the security challenges in IoT.</em></td> </tr> <tr> <td>14:30</td> <td>11.7.2</td> <td><b>DISSECT: DYNAMIC SKEW-AND-SPLIT TREE FOR MEMORY AUTHENTICATION</b><br /><b>Speaker</b>:<br />Lam Siew-Kei, Nanyang Technological University, SG<br /><b>Authors</b>:<br />Saru Vig<sup>1</sup>, Rohan Juneja<sup>2</sup> and Siew Kei Lam<sup>1</sup><br /><sup>1</sup>Nanyang Technological University, SG; <sup>2</sup>Qualcomm, IN<br /><em><b>Abstract</b><br />Memory integrity trees are widely-used to protect external memories in embedded systems against replay, splicing and spoofing attacks. However, existing methods often result in high-performance overhead that is proportional to the height of the tree. Reducing the height of the integrity tree by increasing its arity, however, leads to frequent overflowing of the counters that are used for encryption in the tree. We will show that increasing the tree arity of a widely-used integrity tree from 2 to 8 can result in over 200% increase in memory authentication overhead for some benchmark applications, despite the reduction in tree height. In this paper, we propose DISSECT, a memory authentication framework which utilizes a dynamic memory integrity tree that can adapt to the memory access patterns of the application by progressively adjusting the tree height and arity in order to significantly reduce performance overhead. This is achieved by 1) initializing an integrity tree structure with the largest arity possible considering the performance impact due to counter overflow, 2) dynamically skewing the tree such that the more frequently accessed memory locations are positioned closer to the tree root (overcomes the tree height problem), and 3) dynamically splitting the tree at nodes with counters that are about to overflow (overcomes the counter overflow problem). Experimental results undertaken using Multi2Sim on benchmarks from SPEC-CPU2006, SPLASH-2, and PARSEC demonstrate the performance benefits of our proposed memory integrity tree.</em></td> </tr> <tr> <td>15:00</td> <td>11.7.3</td> <td><b>DESIGN-FLOW METHODOLOGY FOR SECURE GROUP ANONYMOUS AUTHENTICATION</b><br /><b>Speaker</b>:<br />Michel Kinsy, Boston University, US<br /><b>Authors</b>:<br />Rashmi Agrawal<sup>1</sup>, Lake Bu<sup>2</sup>, Eliakin del Rosario<sup>1</sup> and Michel Kinsy<sup>1</sup><br /><sup>1</sup>Boston University, US; <sup>2</sup>Draper Lab, US<br /><em><b>Abstract</b><br />In heterogeneous distributed systems, the computing devices and software components often come from different providers and have different security, trust, and privacy levels. In many of the systems, the need frequently arises to (i) control the access to services and resources granted to the individual devices or components in a context-aware manner, and (ii) establish and enforce data sharing policies that preserve the privacy of the critical information on end-users. In essence, the need is to simultaneously authenticate and anonymize an entity or device, two seemingly contradictory goals. The design challenge is further complicated by potential security problems such as man-in-the-middle attacks, hijacked devices, and counterfeits. In this work, we present a system design flow for a trustworthy group anonymous authentication protocol (GAAP), which not only fulfills the desired functionality for authentication and privacy, but also provides strong security guarantees.</em></td> </tr> <tr> <td style="width:40px;">15:30</td> <td><a href="/date20/conference/session/IP5">IP5-12</a>, 708</td> <td><b>BLOCKCHAIN TECHNOLOGY ENABLED PAY PER USE LICENSING APPROACH FOR HARDWARE IPS</b><br /><b>Speaker</b>:<br />Krishnendu Guha, University of Calcutta, IN<br /><b>Authors</b>:<br />Krishnendu Guha, Debasri Saha and Amlan Chakrabarti, University of Calcutta, IN<br /><em><b>Abstract</b><br />The present era is witnessing a reuse of hardware IPs to reduce cost. As trustworthiness is an essential factor, designers prefer to use hardware IPs which performed effectively in the past, but at the same time, are still active and did not age. In such scenarios, pay per use licensing schemes suit best for both producers and users. Existing pay per use licensing mechanisms consider a centralized third party, which may not be trustworthy. Hence, we seek refuge to blockchain technology to eradicate such third parties and facilitate a transparent and automated pay per use licensing mechanism. A blockchain is a distributed public ledger whose records are added based on peer review and majority consensus of its participants, that cannot be tampered or modified later. Smart contracts are deployed to facilitate the mechanism. Even dynamic pricing of the hardware IPs based on the factors of trustworthiness and aging have been focused in this work, which are not associated in existing literature. Security analysis of the proposed mechanism has been provided. Performance evaluation is carried based on the gas usage of Ethereum Solidity test environment, along with cost analysis based on lifetime and related user ratings.</em></td> </tr> <tr> <td>15:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="11.8">11.8 Special Session: Self-aware, biologically-inspired adaptive hardware systems for ultimate dependability and longevity</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 14:00 - 15:30<br /><b>Location / Room:</b> Exhibition Theatre</p> <p><b>Chair:</b><br />Martin A Trefzer, University of York, GB</p> <p><b>Co-Chair:</b><br />Andy M. Tyrrell, University of York, GB</p> <p>State-of-the-art electronic design allows the integration of complex electronic systems comprising thousands of high-level functions on a single chip. This has become possible and feasible because of the combination of atomic-scale semiconductor technology allowing VLSI of billions of transistors, and EDA tools that can handle their useful application and integration by following strictly hierarchical design methodology. This results in many layers of abstraction within a system that makes it implementable, verifiable and, ultimately, explainable. However, while many layers of abstraction maximise the likelihood of a system to function correctly, this can prevent a design from making full use of the capabilities of current technology. Making systems brittle at a time where NoC- and SoC-based implementations are the only way to increase compute capabilities as clock speed limits are reached, devices are affected by variability and ageing, and heat-dissipation limits impose "dark silicon" constraints. Design challenges of electronic systems are no longer driven by making designs smaller but by creating systems that are ultra-low power, resilient and autonomous in their adaptation to anomalies including faults, timing violations and performance degradation. This gives rise to the idea of self-aware hardware, capable of adaptive behaviours or features taking inspiration from, e.g., biological systems, learning algorithms, factory processes. The challenge is to adopt and implement these concepts while achieving a "next- generation" kind of electronic system which is considered at least as useful and trustworthy as its "classical" counterpart—plus additional essential features for future system design and operation. The goal of this Special Session is to present research from world-leading experts addressing state-of-the-art techniques and devices demonstrating the efficacy of concepts of self-awareness, adaptivity and bio-inspiration in the context of real-world hardware systems and applications with a focus on autonomous resource management at runtime, robustness and performance, and new computing architecture in embedded hardware systems."</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>14:00</td> <td>11.8.1</td> <td><b>EMBEDDED SOCIAL INSECT-INSPIRED INTELLIGENCE NETWORKS FOR SYSTEM-LEVEL RUNTIME MANAGEMENT</b><br /><b>Speaker</b>:<br />Matthew R. P. Rowlings, University of York, GB<br /><b>Authors</b>:<br />Matthew Rowlings, Andy Tyrrell and Martin Albrecht Trefzer, University of York, GB<br /><em><b>Abstract</b><br />Large-scale distributed computing architectures such as, e.g. systems on chip or many-core devices, offer ad- vantages over monolithic or centralised single-core systems in terms of speed, power/thermal performance and fault tolerance. However, these are not implicit properties of such systems and runtime management at software or hardware level is required to unlock these features. Biological systems naturally present such properties and are also adaptive and scalable. To consider how these can be similarly achieved in hardware may be beneficial. We present Social Insect behaviours as a suitable model for enabling autonomous runtime management (RTM) in many-core architectures. The emergent properties sought to establish are self-organisation of task mapping and system- level fault tolerance. For example, large social insect colonies accomplish a wide range of tasks to build and maintain the colony. Many thousands of individuals, each possessing relatively little intelligence, contribute without any centralised control. Hence, it would seem that social insects have evolved a scalable approach to task allocation, load balancing and robustness that can be applied to large many-core computing systems. Based on this, a self-optimising and adaptive, yet fundamentally scalable, design approach for many-core systems based on the emergent behaviours of social-insect colonies are developed. Experiments capture decision-making processes of each colony member to exhibit such high-level behaviours and embed these decision engines within the routers of the many-core system.</em></td> </tr> <tr> <td>14:20</td> <td>11.8.2</td> <td><b>OPTIMISING RESOURCE MANAGEMENT FOR EMBEDDED MACHINE LEARNING</b><br /><b>Authors</b>:<br />Lei Xun, Long Tran-Thanh, Bashir Al-Hashimi and Geoff Merrett, University of Southampton, GB</td> </tr> <tr> <td>14:40</td> <td>11.8.3</td> <td><b>EMERGENT CONTROL OF MPSOC OPERATION BY A HIERARCHICAL SUPERVISOR / REINFORCEMENT LEARNING APPROACH</b><br /><b>Speaker</b>:<br />Florian Maurer, TUM, DE<br /><b>Authors</b>:<br />Florian Maurer<sup>1</sup>, Andreas Herkersdorf<sup>1</sup>, Bryan Donyanavard<sup>2</sup>, Amir M. Rahmani<sup>2</sup> and Nikil Dutt<sup>3</sup><br /><sup>1</sup>TUM, DE; <sup>2</sup>University of California, Irvine, US; <sup>3</sup>University of California, US<br /><em><b>Abstract</b><br />MPSoCs increasingly depend on adaptive resource management strategies at runtime for efficient utilization of resources when executing complex application workloads. In particular, conflicting demands for adequate computation perfor- mance and power-/energy-efficiency constraints make desired ap- plication goals hard to achieve. We present a hierarchical, cross- layer hardware/software resource manager capable of adapting to changing workloads and system dynamics with zero initial knowledge. The manager uses rule-based reinforcement learning classifier tables (LCTs) with an archive-based backup policy as leaf controllers. The LCTs directly manipulate and enforce MPSoC building block operation parameters in order to explore and optimize potentially conflicting system requirements (e.g., meeting a performance target while staying within the power constraint). A supervisor translates system requirements and application goals into per-LCT objective functions (e.g., core instructions-per-second (IPS). Thus, the supervisor manages the possibly emergent behavior of the low-level LCT controllers in response to 1) switching between operation strategies (e.g., maximize performance vs. minimize power; and 2) changing application requirements. This hierarchical manager leverages the dual benefits of a software supervisor (enabling flexibility), together with hardware learners (allowing quick and efficient optimization). Experiments on an FPGA prototype confirmed the ability of our approach to identify optimized MPSoC oper- ation parameters at runtime while strictly obeying given power constraints.</em></td> </tr> <tr> <td>15:00</td> <td>11.8.4</td> <td><b>ASTROBYTE: A MULTI-FPGA ARCHITECTURE FOR ACCELERATED SIMULATIONS OF SPIKING ASTROCYTE NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Shvan Karim, Ulster University, GB<br /><b>Authors</b>:<br />Shvan Haji Karim, Jim Harkin, McDaid Liam, Gardiner Bryan and Junxiu Liu, Ulster University, GB<br /><em><b>Abstract</b><br />Spiking astrocyte neural networks (SANN) are a new computational paradigm that exhibit enhanced self-adapting and reliability properties. The inclusion of astrocyte behaviour increases the computational load and critically the number of connections, where each astrocyte typically communicates with up to 9 neurons (and their associated synapses) with feedback pathways from each neuron to the astrocyte. Each astrocyte cell also communicates with its neighbouring cell resulting in a significant interconnect density. The substantial level of parallelisms in SANNs lends itself to acceleration in hardware, however, the challenge in accelerating simulations of SANNs firmly resides in scalable interconnect and the ability to inject and retrieve data from the hardware. This paper presents a novel multi-FPGA acceleration architecture, AstroByte, for the speedup of SANNs. AstroByte explores Networks-on-Chip (NoC) routing mechanisms to address the challenge of communicating both spike event (neuron data) and numeric (astrocyte data) across significant interconnect pathways between astrocytes and neurons. AstroByte also exploits the NoC interconnect to inject data and retrieve runtime data from the accelerated SANN simulations. Results show that AstroByte can simulate SANN applications with speedup factors of between x162 -x188 over Matlab equivalent simulations.</em></td> </tr> <tr> <td>15:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="IP5">IP5 Interactive Presentations</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 15:30 - 16:00<br /><b>Location / Room:</b> Poster Area</p> <p>Interactive Presentations run simultaneously during a 30-minute slot. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session</p> <table> <tr> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> <tr> <td style="width:40px;">IP5-1</td> <td><b>STATISTICAL MODEL CHECKING OF APPROXIMATE CIRCUITS: CHALLENGES AND OPPORTUNITIES</b><br /><b>Speaker and Author</b>:<br />Josef Strnadel, Brno University of Technology, CZ<br /><em><b>Abstract</b><br />Many works have shown that approximate circuits may play an important role in the development of resourceefficient electronic systems. This motivates many researchers to propose new approaches for finding an optimal trade-off between the approximation error and resource savings for predefined applications of approximate circuits. The works and approaches, however, focus mainly on design aspects regarding relaxed functional requirements while neglecting further aspects such as signal and parameter dynamics/stochasticity, relaxed/non-functional equivalence, testing or formal verification. This paper aims to take a step ahead by moving towards the formal verification of time-dependent properties of systems based on approximate circuits. Firstly, it presents our approach to modeling such systems by means of stochastic timed automata whereas our approach goes beyond digital, combinational and/or synchronous circuits and is applicable in the area of sequential, analog and/or asynchronous circuits as well. Secondly, the paper shows the principle and advantage of verifying properties of modeled approximate systems by the statistical model checking technique. Finally, the paper evaluates our approach and outlines future research perspectives.</em></td> </tr> <tr> <td style="width:40px;">IP5-2</td> <td><b>RUNTIME ACCURACY-CONFIGURABLE APPROXIMATE HARDWARE SYNTHESIS USING LOGIC GATING AND RELAXATION</b><br /><b>Speaker</b>:<br />Tanfer Alan, Karlsruhe Institute of Technology, TR<br /><b>Authors</b>:<br />Tanfer Alan<sup>1</sup>, Andreas Gerstlauer<sup>2</sup> and Joerg Henkel<sup>1</sup><br /><sup>1</sup>Karlsruhe Institute of Technology, DE; <sup>2</sup>University of Texas, Austin, US<br /><em><b>Abstract</b><br />Approximate computing trades off computation accuracy against energy efficiency. Algorithms from several modern application domains such as decision making and computer vision are tolerant to approximations while still meeting their requirements. The extent of approximation tolerance, however, significantly varies with a change in input characteristics and applications. We propose a novel hybrid approach for the synthesis of runtime accuracy configurable hardware that minimizes energy consumption at area expense. To that end, first we explore instantiating multiple hardware blocks with different fixed approximation levels. These blocks can be selected dynamically and thus allow to configure the accuracy during runtime. They benefit from having fewer transistors and also synthesis relaxations in contrast to state-of-the-art gating mechanisms which only switch off a group of logic. Our hybrid approach combines instantiating such blocks with area-efficient gating mechanisms that reduce toggling activity, creating a fine-grained design-time knob on energy vs. area. Examining total energy savings for a Sobel Filter under different workloads and accuracy tolerances show that our method finds Pareto-optimal solutions providing up to 16% and 44% energy savings compared to state-of-the-art accuracy-configurable gating mechanism and an exact hardware block, respectively, at 2x area cost</em></td> </tr> <tr> <td style="width:40px;">IP5-3</td> <td><b>POST-QUANTUM SECURE BOOT</b><br /><b>Speaker</b>:<br />Vinay B. Y. Kumar, Nanyang Technological University (Singapore), SG<br /><b>Authors</b>:<br />Vinay B. Y. Kumar<sup>1</sup>, Naina Gupta<sup>2</sup>, Anupam Chattopadhyay<sup>3</sup>, Michael Kasper<sup>4</sup>, Christoph Krauss<sup>5</sup> and Ruben Niederhagen<sup>5</sup><br /><sup>1</sup>Nanyang Technological University, Singapore, SG; <sup>2</sup>Indraprastha Institute of Information Technology, IN; <sup>3</sup>Nanyang Technological University, SG; <sup>4</sup>Fraunhofer Singapore, SG; <sup>5</sup>Fraunhofer SIT, DE<br /><em><b>Abstract</b><br />A secure boot protocol is fundamental to ensuring the integrity of the trusted computing base of a secure system. The use of digital signature algorithms (DSAs) based on traditional asymmetric cryptography, particularly for secure boot, leaves such systems vulnerable to the threat of quantum computers. This paper presents the first post-quantum secure boot solution, implemented fully as hardware for reasons of security and performance. In particular, this work uses the eXtended Merkle Signature Scheme (XMSS), a hash-based scheme that has been specified as an IETF RFC. The solution has been integrated into a secure SoC platform around RISC-V cores and evaluated on an FPGA and is shown to be orders of magnitude faster compared to corresponding hardware/software implementations and to compare competitively with a fully hardware elliptic curve DSA based solution.</em></td> </tr> <tr> <td style="width:40px;">IP5-4</td> <td><b>ROQ: A NOISE-AWARE QUANTIZATION SCHEME TOWARDS ROBUST OPTICAL NEURAL NETWORKS WITH LOW-BIT CONTROLS</b><br /><b>Speaker</b>:<br />Jiaqi Gu, University of Texas, Austin, US<br /><b>Authors</b>:<br />Jiaqi Gu<sup>1</sup>, Zheng Zhao<sup>1</sup>, Chenghao Feng<sup>1</sup>, Hanqing Zhu<sup>2</sup>, Ray T. Chen<sup>1</sup> and David Z. Pan<sup>1</sup><br /><sup>1</sup>University of Texas, Austin, US; <sup>2</sup>Shanghai Jiao Tong University, CN<br /><em><b>Abstract</b><br />Optical neural networks (ONNs) demonstrate orders-of-magnitude higher speed in deep learning acceleration than their electronic counterparts. However, limited control precision and device variations induce accuracy degradation in practical ONN implementations. To tackle this issue, we propose a quantization scheme that adapts a full-precision ONN to low-resolution voltage controls. Moreover, we propose a protective regularization technique that dynamically penalizes quantized weights based on their estimated noise-robustness, leading to an improvement in noise robustness. Experimental results show that the proposed scheme effectively adapts ONNs to limited-precision controls and device variations. The resultant four-layer ONN demonstrates higher inference accuracy with lower variances than baseline methods under various control precisions and device noises.</em></td> </tr> <tr> <td style="width:40px;">IP5-5</td> <td><b>STATISTICAL TRAINING FOR NEUROMORPHIC COMPUTING USING MEMRISTOR-BASED CROSSBARS CONSIDERING PROCESS VARIATIONS AND NOISE</b><br /><b>Speaker</b>:<br />Ying Zhu, TUM, DE<br /><b>Authors</b>:<br />Ying Zhu<sup>1</sup>, Grace Li Zhang<sup>1</sup>, Tianchen Wang<sup>2</sup>, Bing Li<sup>1</sup>, Yiyu Shi<sup>2</sup>, Tsung-Yi Ho<sup>3</sup> and Ulf Schlichtmann<sup>1</sup><br /><sup>1</sup>TUM, DE; <sup>2</sup>University of Notre Dame, US; <sup>3</sup>National Tsing Hua University, TW<br /><em><b>Abstract</b><br />Memristor-based crossbars are an attractive platform to accelerate neuromorphic computing. However, process variations during manufacturing and noise in memristors cause significant accuracy loss if not addressed. In this paper, we propose to model process variations and noise as correlated random variables and incorporate them into the cost function during training. Consequently, the weights after this statistical training become more robust and together with global variation compensation provide a stable inference accuracy. Simulation results demonstrate that the mean value and the standard deviation of the inference accuracy can be improved significantly, by even up to 54% and 31%, respectively, in a two-layer fully connected neural network.</em></td> </tr> <tr> <td style="width:40px;">IP5-6</td> <td><b>COMPUTATIONAL RESTRUCTURING: RETHINKING IMAGE PROCESSING USING MEMRISTOR CROSSBAR ARRAYS</b><br /><b>Speaker</b>:<br />Rickard Ewetz, University of Central Florida, US<br /><b>Authors</b>:<br />Baogang Zhang, Necati Uysal and Rickard Ewetz, University of Central Florida, US<br /><em><b>Abstract</b><br />Image processing is a core operation performed on billions of sensor-devices in the Internet of Things (IoT). Emerging memristor crossbar arrays (MCAs) promise to perform matrix-vector multiplication (MVM) with extremely small energy-delay product, which is the dominating computation within the two-dimensional Discrete Cosine Transform (2D DCT). Earlier studies have directly mapped the digital implementation to MCA based hardware. The drawback is that the series computation is vulnerable to errors. Moreover, the implementation requires the use of large image block sizes, which is known to degrade the image quality. In this paper, we propose to restructure the 2D DCT into an equivalent single linear transformation (or MVM operation). The reconstruction eliminates the series computation and reduces the processed block sizes from NxN to √Nx√N. Consequently, both the robustness to errors and the image quality is improved. Moreover, the latency, power, and area is reduced with 2X while eliminating the storage of intermediate data, and the power and area can be further reduced with up to 62% and 74% using frequency spectrum optimization.</em></td> </tr> <tr> <td style="width:40px;">IP5-7</td> <td><b>SCRIMP: A GENERAL STOCHASTIC COMPUTING ACCELERATION ARCHITECTURE USING RERAM IN-MEMORY PROCESSING</b><br /><b>Speaker</b>:<br />Saransh Gupta, University of California, San Diego, US<br /><b>Authors</b>:<br />Saransh Gupta<sup>1</sup>, Mohsen Imani<sup>1</sup>, Joonseop Sim<sup>1</sup>, Andrew Huang<sup>1</sup>, Fan Wu<sup>1</sup>, M. Hassan Najafi<sup>2</sup> and Tajana Rosing<sup>1</sup><br /><sup>1</sup>University of California, San Diego, US; <sup>2</sup>University of Louisiana, US<br /><em><b>Abstract</b><br />Stochastic computing (SC) reduces the complexity of computation by representing numbers with long independent bit-streams. However, increasing performance in SC comes with an increase in area and loss in accuracy. Processing in memory (PIM) with non-volatile memories (NVMs) computes data in-place, while having high memory density and supporting bit-parallel operations with low energy. In this paper, we propose SCRIMP for stochastic computing acceleration with resistive RAM (ReRAM) in-memory processing, which enables SC in memory. SCRIMP can be used for a wide range of applications. It supports all SC encodings and operations in memory. It maximizes the performance and energy efficiency of implementing SC by introducing novel in-memory parallel stochastic number generation and efficient implication-based logic in memory. To show the efficiency of our stochastic architecture, we implement image processing on the proposed hardware.</em></td> </tr> <tr> <td style="width:40px;">IP5-8</td> <td><b>TDO-CIM: TRANSPARENT DETECTION AND OFFLOADING FOR COMPUTATION IN-MEMORY</b><br /><b>Speaker</b>:<br />Lorenzo Chelini, Eindhoven University of Technology, SZ<br /><b>Authors</b>:<br />Kanishkan Vadivel<sup>1</sup>, Lorenzo Chelini<sup>2</sup>, Ali BanaGozar<sup>1</sup>, Gagandeep Singh<sup>2</sup>, Stefano Corda<sup>2</sup>, Roel Jordans<sup>1</sup> and Henk Corporaal<sup>1</sup><br /><sup>1</sup>Eindhoven University of Technology, NL; <sup>2</sup>IBM Research - Zurich, CH<br /><em><b>Abstract</b><br />Computation in-memory is a promising non-von Neumann approach aiming at completely diminishing the data transfer to and from the memory subsystem. Although a lot of architectures have been proposed, compiler support for such architectures is still lagging behind. In this paper, we close this gap by proposing an end-to-end compilation flow for in-memory computing based on the LLVM compiler infrastructure. Starting from sequential code, our approach automatically detects, optimizes, and offloads kernels suitable for in-memory acceleration. We demonstrate our compiler tool-flow on the PolyBench/C benchmark suite and evaluate the benefits of our proposed in-memory architecture simulated in Gem5 by comparing it with a state-of-the-art von Neumann architecture.</em></td> </tr> <tr> <td style="width:40px;">IP5-9</td> <td><b>BACKFLOW: BACKWARD EDGE CONTROL FLOW ENFORCEMENT FOR LOW END ARM MICROCONTROLLERS</b><br /><b>Speaker</b>:<br />Cyril Bresch, LCIS, FR<br /><b>Authors</b>:<br />Cyril Bresch<sup>1</sup>, David Hély<sup>2</sup> and Roman Lysecky<sup>3</sup><br /><sup>1</sup>LCIS, FR; <sup>2</sup>LCIS - Grenoble INP, FR; <sup>3</sup>University of Arizona, US<br /><em><b>Abstract</b><br />This paper presents BackFlow, a compiler-based toolchain that enforces indirect backward edge control flow integrity for low-end ARM Cortex-M microprocessors. BackFlow is implemented within the Clang/LLVM compiler and supports the ARM instruction set and its subset Thumb. The control flow integrity generated by the compiler relies on a bitmap, where each set bit indicates a valid pointer destination. The efficiency of the framework is benchmarked using an STM32 NUCLEO F446RE microcontroller. The obtained results show that the control flow integrity solution incurs an execution time overhead ranging from 1.5 to 4.5%.</em></td> </tr> <tr> <td style="width:40px;">IP5-10</td> <td><b>DELAY SENSITIVITY POLYNOMIALS BASED DESIGN-DEPENDENT PERFORMANCE MONITORS FOR WIDE OPERATING RANGES</b><br /><b>Speaker</b>:<br />Ruikai Shi, State Key Laboratory of Computer Architecture, ICT, CAS; University of Chinese Academy of Sciences, CN<br /><b>Authors</b>:<br />Ruikai Shi<sup>1</sup>, Liang Yang<sup>2</sup> and Hao Wang<sup>2</sup><br /><sup>1</sup>State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN; <sup>2</sup>Loongson Technology Corporation Limited, CN<br /><em><b>Abstract</b><br />The downsizing of CMOS technology makes circuit performance more sensitive to on-chip parameter variations. Previous proposed design-dependent ring oscillator (DDRO) method provides an efficient way to monitor circuit performance at runtime. However, the linear delay sensitivity expression may be inadequate, especially in a wide range of operating conditions. To overcome it, a new design-dependent performance monitor (DDPM) method is proposed in this work, which formulates the delay sensitivity as high-order polynomials, makes it possible to accurately track the nonlinear timing behavior for wide operating ranges. A 28nm technology is used for design evaluation, and quite a low error rate is achieved in circuit performance monitoring comparison.</em></td> </tr> <tr> <td style="width:40px;">IP5-11</td> <td><b>MITIGATION OF SENSE AMPLIFIER DEGRADATION USING SKEWED DESIGN</b><br /><b>Speaker</b>:<br />Daniel Kraak, Delft University of Technology, NL<br /><b>Authors</b>:<br />Daniel Kraak<sup>1</sup>, Mottaqiallah Taouil<sup>1</sup>, Said Hamdioui<sup>1</sup>, Pieter Weckx<sup>2</sup>, Stefan Cosemans<sup>2</sup> and Francky Catthoor<sup>2</sup><br /><sup>1</sup>Delft University of Technology, NL; <sup>2</sup>IMEC, BE<br /><em><b>Abstract</b><br />Designers typically add design margins to semiconductor memories to compensate for aging. However, the aging impact increases with technology downscaling, leading to the need for higher margins. This results into a negative impact on area, yield, performance, and power consumption. As an alternative, mitigation schemes can be developed to reduce such impact. This paper proposes a mitigation scheme for the memory's sense amplifier (SA); the scheme is based on creating a skew in the relative strengths of the SA's cross-coupled inverters during design. The skew is compensated by aging due to unbalanced workloads. As a result, the impact of aging on the SA is reduced. To validate the mitigation scheme, the degradation of the sense amplifier is analyzed for several workloads.The experimental results show that the proposed mitigation scheme reduces the degradation of the sense amplifier's critical figure-of-merit, the offset voltage, with up to 26%.</em></td> </tr> <tr> <td style="width:40px;">IP5-12</td> <td><b>BLOCKCHAIN TECHNOLOGY ENABLED PAY PER USE LICENSING APPROACH FOR HARDWARE IPS</b><br /><b>Speaker</b>:<br />Krishnendu Guha, University of Calcutta, IN<br /><b>Authors</b>:<br />Krishnendu Guha, Debasri Saha and Amlan Chakrabarti, University of Calcutta, IN<br /><em><b>Abstract</b><br />The present era is witnessing a reuse of hardware IPs to reduce cost. As trustworthiness is an essential factor, designers prefer to use hardware IPs which performed effectively in the past, but at the same time, are still active and did not age. In such scenarios, pay per use licensing schemes suit best for both producers and users. Existing pay per use licensing mechanisms consider a centralized third party, which may not be trustworthy. Hence, we seek refuge to blockchain technology to eradicate such third parties and facilitate a transparent and automated pay per use licensing mechanism. A blockchain is a distributed public ledger whose records are added based on peer review and majority consensus of its participants, that cannot be tampered or modified later. Smart contracts are deployed to facilitate the mechanism. Even dynamic pricing of the hardware IPs based on the factors of trustworthiness and aging have been focused in this work, which are not associated in existing literature. Security analysis of the proposed mechanism has been provided. Performance evaluation is carried based on the gas usage of Ethereum Solidity test environment, along with cost analysis based on lifetime and related user ratings.</em></td> </tr> </table> <hr /> <h2 id="12.1">12.1 Special Day on "Silicon Photonics": Design Automation for Photonics</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 16:00 - 17:30<br /><b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br />Dave Penkler, SCINTIL Photonics, US</p> <p><b>Co-Chair:</b><br />Ashkan Seyedi, Hewlett Packard Labs, US</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>16:00</td> <td>12.1.1</td> <td><b>OPPORTUNITIES FOR CROSS-LAYER DESIGN IN HIGH-PERFORMANCE COMPUTING SYSTEMS WITH INTEGRATED SILICON PHOTONIC NETWORKS</b><br /><b>Speaker</b>:<br />Mahdi Nikdast, Colorado State University, US<br /><b>Authors</b>:<br />Asif Mirza, Shadi Manafi Avari, Ebadollah Taheri, Sudeep Pasricha and Mahdi Nikdast, Colorado State University, US<br /><em><b>Abstract</b><br />With the ever growing complexity of high-performance computing (HPC) systems to satisfy emerging application requirements (e.g., high memory bandwidth requirement for machine learning applications), the performance bottleneck in such systems has moved from being computation-centric to be more communication-centric. Silicon photonic interconnection networks have been proposed to address the aggressive communication requirements in HPC systems, to realize higher bandwidth, lower latency, and better energy efficiency. There have been many successful efforts on developing silicon photonic devices, integrated circuits, and architectures for HPC systems. Moreover, many efforts have been made to address and mitigate the impact of different challenges (e.g., fabrication process and thermal variations) in silicon photonic interconnects. However, most of these efforts have focused only on a single design layer in the system design space (e.g., device, circuit or architecture level). Therefore, there is often a gap between what a design technique can improve in one layer, and what it might impair in another one. In this paper, we discuss the promise of cross-layer design methodologies for HPC systems integrating silicon photonic interconnects. In particular, we discuss how such cross-layer design solutions based on cooperatively designing and exchanging design objectives among different system design layers can help achieve the best possible performance when integrating silicon photonics into HPC systems.</em></td> </tr> <tr> <td>16:30</td> <td>12.1.2</td> <td><b>DESIGN AND VALIDATION OF PHOTONIC IP MACROS BASED ON FOUNDRY PDKS</b><br /><b>Authors</b>:<br />Lee Crudigington, François Chabert and Pieter Dumon, Luceda Photonics, US<br /><em><b>Abstract</b><br />Silicon photonic foundry PDKs are steadily maturing. On the basis of these, designers can start to design and validate more complex circuits. Successful prototypes and productization depend however on a tight integration of the design flow, across hierarchical levels and between layout and simulation model extraction. Circuit performance and yield is impacted by fabrication variability and needs to be taken into account in the design cycle already at prototype level. We will show how a fully flow with integrated layout, circuit and building block simulation can speed up design and validation of larger photonic macros and discuss PDK requirements.</em></td> </tr> <tr> <td>17:00</td> <td>12.1.3</td> <td><b>EFFICIENT OPTICAL POWER DELIVERY SYSTEM FOR HYBRID ELECTRONIC-PHOTONIC MANYCORE PROCESSORS</b><br /><b>Speaker</b>:<br />Shixi Chen, The Hong Kong University of Science and Technology, HK<br /><b>Authors</b>:<br />Shixi Chen<sup>1</sup>, Jiang Xu<sup>2</sup>, Xuanqi Chen<sup>1</sup>, Zhifei Wang<sup>1</sup>, Jun Feng<sup>1</sup>, Jiaxu Zhang<sup>1</sup>, Zhongyuan Tian<sup>1</sup> and Xiao Li<sup>1</sup><br /><sup>1</sup>The Hong Kong University of Science and Technology, HK; <sup>2</sup>Hong Kong University of Science and Technology, HK<br /><em><b>Abstract</b><br />A lot of efforts have been devoted to optically enabled high-performance communication infrastructures for future manycore processors. Silicon photonic network promises high bandwidth, high energy efficiency and low latency. However, the ever-increasing design complexity results in complex optical power demands, which stress the optical power delivery and affect delivery efficiency. Facing optical power delivery challenges, we propose a Ring- based Optical Active Delivery (ROAD) system, to effectively manage and efficiently deliver optical power throughout photonic-electronic hybrid systems.</em></td> </tr> <tr> <td>17:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="12.2">12.2 Autonomous Systems Design Initiative: Emerging Approaches to Autonomous Systems Design</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 16:00 - 17:30<br /><b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br />Dirk Ziegenbein, Robert Bosch GmbH, DE</p> <p><b>Co-Chair:</b><br />Sebastian Steinhorst, TUM, DE</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>16:00</td> <td>12.2.1</td> <td><b>A PRELIMINARY VIEW ON AUTOMOTIVE CYBER SECURITY MANAGEMENT SYSTEMS</b><br /><b>Speaker</b>:<br />Christoph Schmittner, Austrian Institute of Technology, AT<br /><b>Authors</b>:<br />Christoph Schmittner<sup>1</sup>, Jürgen Dobaj<sup>2</sup>, Georg Macher<sup>3</sup> and Eugen Brenner<sup>2</sup><br /><sup>1</sup>Austrian Institute of Technology, AT; <sup>2</sup>Technische Universität Graz, DE; <sup>3</sup>Technische Universität Graz, AT</td> </tr> <tr> <td>16:30</td> <td>12.2.2</td> <td><b>TOWARDS SAFETY VERIFICATION OF DIRECT PERCEPTION NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Chih-Hong Cheng, DENSO Automotive Deutschland GmbH, DE<br /><b>Authors</b>:<br />Chih-Hong Cheng<sup>1</sup>, Chung-Hao Huang<sup>2</sup>, Thomas Brunner<sup>2</sup> and Vahid Hashemi<sup>3</sup><br /><sup>1</sup>DENSO Automotive Deutschland GmbH, DE; <sup>2</sup>Fortiss, DE; <sup>3</sup>Audi AG, DE</td> </tr> <tr> <td>17:00</td> <td>12.2.3</td> <td><b>MINIMIZING EXECUTION DURATION IN THE PRESENCE OF LEARNING-ENABLED COMPONENTS</b><br /><b>Speaker</b>:<br />Sanjoy Baruah, Washington University in St. Louis, US<br /><b>Authors</b>:<br />Kunal Agrawal<sup>1</sup>, Sanjoy Baruah<sup>2</sup>, Alan Burns<sup>3</sup> and Abhishek Singh<sup>2</sup><br /><sup>1</sup>Washington University in Saint Louis, US; <sup>2</sup>Washington University in St. Louis, US; <sup>3</sup>University of York, GB</td> </tr> <tr> <td>17:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="12.3">12.3 Reconfigurable Systems for Machine Learning</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 16:00 - 17:30<br /><b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br />Bogdan Pasca, Intel, FR</p> <p><b>Co-Chair:</b><br />Smail Niar, Université Polytechnique Hauts-de-France, FR</p> <p>Machine learning continues to attract significant research attention and reconfigurable systems offer ample flexibility for exploring new approaches to accelerating these workloads. In this session we explore how FPGAs can be used for a variety of machine learning workloads. We discuss memory optimisations for 3D convolutional neural networks (CNNs), design and implementation of binarised neural networks, and an approach for cascading hybrid precision datapaths to improve CNN classification latency.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>16:00</td> <td>12.3.1</td> <td><b>EXPLORATION OF MEMORY ACCESS OPTIMIZATION FOR FPGA-BASED 3D CNN ACCELERATOR</b><br /><b>Speaker</b>:<br />Teng Tian, University of Science and Technology of China, CN<br /><b>Authors</b>:<br />Teng Tian, Xi Jin, Letian Zhao, Xiaotian Wang, Jie Wang and Wei Wu, University of Science and Technology of China, CN<br /><em><b>Abstract</b><br />Three-dimensional convolutional networks (3D CNNs) are used efficiently in various video recognition applications. Compared to traditional 2D CNNs, extra temporal dimension causes 3D CNNs more computationally intensive and to have a larger memory footprint. Therefore, the memory optimization is extremely crucial in this case. This paper presents a design space exploration of memory access optimization for FPGA-based 3D CNN accelerator. We present a non-overlapping data tiling method for contiguous off-chip memory access and explore on-chip data reuse opportunity by leveraging different loop ordering strategies. We propose a hardware architecture design which can flexibly support different loop ordering strategies for each 3D CNN layer. With the help of hardware/software co-design, we can provide the optimal configuration toward an energy-efficient and high-performance accelerator design. According to the experiments on AlexNet, VGG16, and C3D, our optimal model reduces up to 84% DRAM accesses and 55% energy consumption on C3D compared to a baseline model, and demonstrates state-of-the-art performance compared to prior FPGA implementations.</em></td> </tr> <tr> <td>16:30</td> <td>12.3.2</td> <td><b>A THROUGHPUT-LATENCY CO-OPTIMISED CASCADE OF CONVOLUTIONAL NEURAL NETWORK CLASSIFIERS</b><br /><b>Speaker</b>:<br />Alexandros Kouris, Imperial College London, GB<br /><b>Authors</b>:<br />Alexandros Kouris<sup>1</sup>, Stylianos Venieris<sup>2</sup> and Christos Bouganis<sup>1</sup><br /><sup>1</sup>Imperial College London, GB; <sup>2</sup>Samsung AI, GB<br /><em><b>Abstract</b><br />Convolutional Neural Networks constitute a prominent AI model for classification tasks, serving a broad span of diverse application domains. To enable their efficient deployment in real-world tasks, the inherent redundancy of CNNs is frequently exploited to eliminate unnecessary computational costs. Driven by the fact that not all inputs require the same amount of computation to drive a confident prediction, multi-precision cascade classifiers have been recently introduced. FPGAs comprise a promising platform for the deployment of such input-dependent computation models, due to their enhanced customisation capabilities. Current literature, however, is limited to throughput-optimised cascade implementations, employing large batching at the expense of a substantial latency aggravation prohibiting their deployment on real-time scenarios. In this work, we introduce a novel methodology for throughput-latency co-optimised cascaded CNN classification, deployed on a custom FPGA architecture tailored to the target application and deployment platform, with respect to a set of user-specified requirements on accuracy and performance. Our experiments indicate that the proposed approach achieves comparable throughput gains with related state-of-the-art works, under substantially reduced overhead in latency, enabling its deployment on latency-sensitive applications.</em></td> </tr> <tr> <td>17:00</td> <td>12.3.3</td> <td><b>ORTHRUSPE: RUNTIME RECONFIGURABLE PROCESSING ELEMENTS FOR BINARY NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Nael Fasfous, TUM, DE<br /><b>Authors</b>:<br />Nael Fasfous<sup>1</sup>, Manoj-Rohit Vemparala<sup>2</sup>, Alexander Frickenstein<sup>2</sup> and Walter Stechele<sup>1</sup><br /><sup>1</sup>TUM, DE; <sup>2</sup>BMW Group, DE<br /><em><b>Abstract</b><br />Recent advancements in Binary Neural Networks (BNNs) have yielded promising results, bringing them a step closer to their full-precision counterparts in terms of prediction accuracy. These advancements were brought about by additional arithmetic and binary operations, in the form of scale and shift operations (fixed-point) and convolutions with multiple weight and activation bases (binary). In this paper, we propose OrthrusPE, a runtime reconfigurable processing element (PE) which is capable of executing all the operations required by modern BNNs while improving resource utilization and power efficiency. More precisely, we exploit DSP48 blocks on off-the-shelf FPGAs to compute binary Hadamard products (for binary convolutions) and fixed-point arithmetic (for scaling, shifting, batch norm, and non-binary layers), thereby utilizing the same hardware resource for two distinct, critical modes of operation. Our experiments show that common PE implementations increase dynamic power consumption by 67%, while requiring 39% more lookup tables, when compared to an OrthrusPE implementation.</em></td> </tr> <tr> <td>17:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="12.4">12.4 Approximate Computing Works! Applications &amp; Case Studies</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 16:00 - 17:30<br /><b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br />Oliver Keszocze, Friedrich-Alexander-University Erlangen-Nuremberg (FAU), DE</p> <p><b>Co-Chair:</b><br />Benjamin Carrion Schaefer, University of Texas at Dallas, US</p> <p>Approximate computing leverages the fact that many applications are tolerant of incorrect results. This session highlights that by presenting methods and applications that optimize the trade-off between area, power and output error. At the same time it is important to ensure that the approximation approaches are scalable because complex problems are addressed. While some of these approaches completely work at the application level, others are oriented towards optimizing key subcircuits.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>16:00</td> <td>12.4.1</td> <td><b>TOWARDS GENERIC AND SCALABLE WORD-LENGTH OPTIMIZATION</b><br /><b>Speaker</b>:<br />Van-Phu Ha, Univ Rennes, Inria, IRISA, FR<br /><b>Authors</b>:<br />Van-Phu Ha<sup>1</sup>, Tomofumi Yuki<sup>2</sup> and Olivier Sentieys<sup>2</sup><br /><sup>1</sup>Univ Rennes, Inria, CNRS, IRISA, France, FR; <sup>2</sup>INRIA, FR<br /><em><b>Abstract</b><br />In this paper, we propose a method to improve the scalability of Word-Length Optimization (WLO) for large applications that use complex quality metrics such as Structural Similarity (SSIM). The input application is decomposed into smaller kernels to avoid uncontrolled explosion of the exploration time, which is known as noise budgeting. The main challenge addressed in this paper is how to allocate noise budgets to each kernel. This requires capturing the interactions across kernels. The main idea is to characterize the impact of approximating each kernel on accuracy/cost through simulation and regression. Our approach improves the scalability while finding better solutions for Image Signal Processor pipeline.</em></td> </tr> <tr> <td>16:30</td> <td>12.4.2</td> <td><b>TRADING SENSITIVITY FOR POWER IN AN IEEE 802.15.4 CONFORMANT ADEQUATE DEMODULATOR</b><br /><b>Speaker</b>:<br />Paul Detterer, Eindhoven University of Technology, NL<br /><b>Authors</b>:<br />Paul Detterer<sup>1</sup>, Cumhur Erdin<sup>1</sup>, Jos Huisken<sup>1</sup>, Hailong Jiao<sup>1</sup>, Majid Nabi<sup>1</sup>, Twan Basten<sup>1</sup> and Jose Pineda de Gyvez<sup>2</sup><br /><sup>1</sup>Eindhoven University of Technology, NL; <sup>2</sup>NXP Semiconductors, US<br /><em><b>Abstract</b><br />In this work, a design of an IEEE 802.15.4 conformant O-QPSK demodulator is proposed, that is capable of trading off receiver sensitivity for power savings. Such design can be used to meet rigid energy and power constraints for many applications in the Internet-of-Things (IoT) context. In a Body Area Network (BAN), for example, the circuits need to operate with extremely limited energy sources, while still meeting the network performance requirements. This challenge can be addressed by the paradigm of adequate computing, which trades off excessive quality of service for power or energy using approximation techniques. Three different, adjustable approximation techniques are integrated into the demodulation to trade off effective signal quantization bit-width, filtering performance, and sampling frequency for power. Such approximations impact incoming signal sensitivity of the demodulator. For detailed tradeoff analysis, the proposed design is implemented in a commercial 40-nm CMOS technology to estimate power and in Python environment to estimate sensitivity. Simulation results show up to 64% power savings by sacrificing 7 dB sensitivity.</em></td> </tr> <tr> <td>17:00</td> <td>12.4.3</td> <td><b>APPROXIMATION TRADE OFFS IN AN IMAGE-BASED CONTROL SYSTEM</b><br /><b>Speaker</b>:<br />Sayandip De, Eindhoven University of Technology, NL<br /><b>Authors</b>:<br />Sayandip De, Sajid Mohamed, Konstantinos Bimpisidis, Dip Goswami, Twan Basten and Henk Corporaal, Eindhoven University of Technology, NL<br /><em><b>Abstract</b><br />Image-based control (IBC) systems use camera sensor(s) to perceive the environment. The inherent compute-heavy nature of image processing causes long processing delay that negatively influences the performance of the IBC systems. Our idea is to reduce the long delay using coarse-grained approximation of the image signal processing pipeline without affecting the functionality and performance of the IBC system. The question is: how is the degree of approximation related to the closed-loop quality-of-control (QoC), memory utilization and energy consumption? We present a software-in-the-loop (SiL) evaluation framework for the above approximation-in-the-loop system. We identify the error resilient stages and the corresponding coarse-grained approximation settings for the IBC system. We perform trade off analysis between the QoC, memory utilisation and energy consumption for varying degrees of coarse-grained approximation. We demonstrate the effectiveness of our approach using a concrete case study of a lane keeping assist system (LKAS). We obtain energy and memory reduction of upto 84% and 29% respectively, for 28% QoC improvements.</em></td> </tr> <tr> <td>17:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="12.5">12.5 Cyber-Physical Systems for Manufacturing and Transportation</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 16:00 - 17:30<br /><b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br />Ulrike Thomas, Chemnitz University of Technology, DE</p> <p><b>Co-Chair:</b><br />Robert De Simone, INRIA, FR</p> <p>Modeling and design of transportation and manufacturing systems from a cyber-physical system (CPS) perspective have lately attracted extensive attention and the session covers various aspects, from modelling of traffic intersections and control of traffic signals, to implementations of iterative learning controllers for control blocks. Other contributions deal with the selection of network architectures for manufacturing plants and the Digital Twin of production processes for validation.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>16:00</td> <td>12.5.1</td> <td><b>CPS-ORIENTED MODELING AND CONTROL OF TRAFFIC SIGNALS USING ADAPTIVE BACK PRESSURE</b><br /><b>Speaker</b>:<br />Wanli Chang, University of York, GB<br /><b>Authors</b>:<br />Wanli Chang<sup>1</sup>, Debayan Roy<sup>2</sup>, Shuai Zhao<sup>1</sup>, Anuradha Annaswamy<sup>3</sup> and Samarjit Chakraborty<sup>2</sup><br /><sup>1</sup>University of York, GB; <sup>2</sup>TUM, DE; <sup>3</sup>MIT, US<br /><em><b>Abstract</b><br />Modeling and design of automotive systems from a cyber-physical system (CPS) perspective have lately attracted extensive attention. As the trend towards automated driving and connectivity accelerates, strong interactions between vehicles and the infrastructure are expected. This requires modeling and control of the traffic network in a similarly formal manner. Modeling of such networks involves a tradeoff between expressivity of the appropriate features and tractability of the control problem. Back-pressure control of traffic signals is gaining ground due to its decentralized implementation, low computational complexity, and no requirements on prior traffic information. It guarantees maximum stability under idealistic assumptions. However, when deployed in real traffic intersections, the existing back-pressure control algorithms may result in poor junction utilization due to (i) fixed-length control phases; (ii) stability as the only objective; and (iii) obliviousness to finite road capacities and empty roads. In this paper, we propose a CPS-oriented model of traffic intersections and control of traffic signals, aiming to address the utilization issue of the back-pressure algorithms. We consider a more realistic model with transition phases and dedicated turning lanes, the latter influencing computation of the pressure and subsequently the utilization. The main technical contribution is an adaptive controller that enables varying-length control phases and considers both stability and utilization, while taking both cases of full roads and empty roads into account. We implement a mechanism to prevent frequent changes of control phases and thus limit the number of transition phases, which have negative impact on the junction utilization. Microscopic simulation results with SUMO on a 3x3 traffic network under various traffic patterns show that the proposed algorithm is at least about 13% better in performance than the existing fixed-length back-pressure control algorithms reported in previous works. This is a significant improvement in the context of traffic signal control.</em></td> </tr> <tr> <td>16:30</td> <td>12.5.2</td> <td><b>NETWORK SYNTHESIS FOR INDUSTRY 4.0</b><br /><b>Speaker</b>:<br />Enrico Fraccaroli, Università di Verona, IT<br /><b>Authors</b>:<br />Enrico Fraccaroli, Alan Michael Padovani, Davide Quaglia and Franco Fummi, Università di Verona, IT<br /><em><b>Abstract</b><br />Today's factory machines are ever more connected together and with SCADA, MES, ERP applications as well as external systems for data analysis. Different types of network architectures must be used for this purpose. For instance, control applications at the lowest level are very sensitive to delays and errors while data analysis with machine learning procedures requires to move large amount of data without real-time constraints. Standard data formats, like Automation Markup Language (AML), have been established to document factory environment, machine placement and network deployment, however, no automatic technique is currently available in the context of Industry 4.0 to choose the best mix of network architectures according to spacial constraints, cost and performance. We propose to fill this gap by formulating an optimization problem. First of all, spatial and communication requirements are extracted from the AML description. Then, the optimal interconnection of wired or wireless channels is obtained according to application objectives. Finally, this result is back-annotated to AML to be used in the life cycle of the production system. The proposed methodology is described through a small, but complete, smart production plant.</em></td> </tr> <tr> <td>17:00</td> <td>12.5.3</td> <td><b>PRODUCTION RECIPE VALIDATION THROUGH FORMALIZATION AND DIGITAL TWIN GENERATION</b><br /><b>Speaker</b>:<br />Stefano Spellini, Università di Verona, IT<br /><b>Authors</b>:<br />Stefano Spellini<sup>1</sup>, Roberta Chirico<sup>1</sup>, Marco Panato<sup>1</sup>, Michele Lora<sup>2</sup> and Franco Fummi<sup>1</sup><br /><sup>1</sup>Università di Verona, IT; <sup>2</sup>Singapore University of Technology and Design, SG<br /><em><b>Abstract</b><br />The advent of Industry 4.0 is making production processes every day more complicated. As such, early process validation is becoming crucial to avoid production errors thus decreasing costs. In this paper, we present an approach to validate production recipes. Initially, the recipe is specified according to the ISA-95 standard, while the production plant is described using AutomationML. These specifications are formalized into a hierarchy of assume-guarantee contracts. Each contract specifies a set of temporal behaviors, characterizing the different machines composing the production line, their actions and interaction. Then, the formal specifications provided by the contracts are systematically synthesized to automatically generate a digital twin for the production line. Finally, the digital twin is used to evaluate, and validate, both the functional and the extra-functional characteristics of the system. The methodology has been applied to validate the production of a product requiring additive manufacturing, robotic assembling and transportation.</em></td> </tr> <tr> <td>17:15</td> <td>12.5.4</td> <td><b>PARALLEL IMPLEMENTATION OF ITERATIVE LEARNING CONTROLLERS ON MULTI-CORE PLATFORMS</b><br /><b>Speaker</b>:<br />Mojtaba Haghi, Eindhoven University of Technology, NL<br /><b>Authors</b>:<br />Mojtaba Haghi, Yusheng Yao, Dip Goswami and Kees Goossens, Eindhoven University of Technology, NL<br /><em><b>Abstract</b><br />This paper presents design and implementation techniques for iterative learning controllers (ILCs) targeting predictable multi-core embedded platforms. Implementation on embedded platforms results in a number of timing artifacts. Sensor-to-actuator delay (referred to as delay) is an important timing artifact which influences the control performance by changing the dynamic behavior of the system. We propose a delay-based design for ILCs that identifies and operates in the performance-optimal delay region. We then propose two implementation methods -- sequential and parallel -- for ILCs targeting the predictable multi-core platforms. The proposed methods enable the designer to carefully adjust the scheduling to achieve the optimal delay region in the resulting control system. We validate our results by the hardware-in-the-loop (HIL) simulation, considering a motion system as a case-study.</em></td> </tr> <tr> <td>17:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="12.6">12.6 Industrial Experience: From Wafer-Level Up to IoT Security</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 16:00 - 17:30<br /><b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br />Enrico Macii, Politecnico di Torino, IT</p> <p><b>Co-Chair:</b><br />Norbert Wehn, University of Kaiserslautern, DE</p> <p>This session addresses recent industrial experiences covering all Design Levels from Technology up to System Level</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>16:00</td> <td>12.6.1</td> <td><b>WAFER-LEVEL TEST PATH PATTERN RECOGNITION AND TEST CHARACTERISTICS FOR TEST-INDUCED DEFECT DIAGNOSIS</b><br /><b>Authors</b>:<br />Andrew Yi-Ann Huang<sup>1</sup>, Katherine Shu-Min Li<sup>2</sup>, Ken Chau-Cheung Cheng<sup>3</sup>, Ji-Wei Li<sup>1</sup>, Leon Li-Yang Chen<sup>4</sup>, Nova Cheng-Yen Tsai<sup>1</sup>, Sying-Jyan Wang<sup>5</sup>, Chen-Shiun Lee<sup>1</sup>, Leon Chou<sup>1</sup>, Peter Yi-Yu Liao<sup>1</sup>, Hsing-Chung Liang<sup>6</sup> and Jwu E Chen<sup>7</sup><br /><sup>1</sup>NXP Semiconductors Taiwan Ltd., TW; <sup>2</sup>National Sun Yat-sen University, TW; <sup>3</sup>NXP Semiconductors Taiwan Ltd, TW; <sup>4</sup>National Sun Yat-Sen University, TW; <sup>5</sup>National Chung-Hsing University, TW; <sup>6</sup>Chung Yuan Christian University, TW; <sup>7</sup>National Central University, TW<br /><em><b>Abstract</b><br />Wafer defect maps provide precious information of fabrication and test process defects, so they can be used as valuable sources to improve fabrication and test yield. This paper applies artificial intelligence based pattern recognition techniques to distinguish fab-induced defects from test-induced ones. As a result, test quality, reliability and yield could be improved accordingly. Wafer test data contain site-dependent information regarding test configurations in automatic test equipment, including effective load push force, gap between probe and loadboard, probe tip size, probe-cleaning force, etc. Our method analyzes both the test paths and site-dependent test characteristics to identify test-induced defects. Experimental results achieve 96.83% prediction accuracy of six different NXP products, which show that our methods are both effective and efficient.</em></td> </tr> <tr> <td>16:15</td> <td>12.6.2</td> <td><b>A METHOD OF VIA VARIATION INDUCED DELAY COMPUTATION</b><br /><b>Authors</b>:<br />Moonsu Kim<sup>1</sup>, Yun Heo<sup>1</sup>, Seungjae Jung<sup>1</sup>, Kelvin Le<sup>2</sup>, Jongpil Lee<sup>1</sup>, Youngmin Shin<sup>1</sup>, Nathaniel Conos<sup>2</sup> and Hanif Fatemi<sup>2</sup><br /><sup>1</sup>Samsung, KR; <sup>2</sup>Synopsys, US<br /><em><b>Abstract</b><br />Abstract— As process technologies are scaled down, interconnect delay becomes major component of entire path delay, and vias represent a significant portion of the interconnect delay. In this paper, a novel variation-aware delay computation method for vias is proposed. Our experiments show that this method can reduce over five percent of pessimism in arrival time calculation when it is compared with state-of-the-art solutions.</em></td> </tr> <tr> <td>16:30</td> <td>12.6.3</td> <td><b>FULLY AUTOMATED ANALOG SUB-CIRCUIT CLUSTERING WITH GRAPH CONVOLUTIONAL NEURAL NETWORKS</b><br /><b>Speaker</b>:<br />Keertana Settaluri, University of California, Berkeley, US<br /><b>Authors</b>:<br />Keertana Settaluri<sup>1</sup> and Elias Fallon<sup>2</sup><br /><sup>1</sup>University of California, Berkeley, US; <sup>2</sup>Cadence Design Systems, US<br /><em><b>Abstract</b><br />The design of custom analog integrated circuits is one of the contributing factors in high development cost and increased production time, driving the need for more automation in this space. In automating particular avenues of analog design, it is then crucial to assess the efficacy with which the automation algorithm is able to solve the desired problem. To do this, one must consider four metrics that are especially pertinent in this area: robustness, accuracy, level of automation, and computation time. In this work, we present a framework that bridges the gap between schematic and layout generation, by encapsulating the design intuition needed to create layout through identification of critical sub-circuit structures. Our approach focuses on identifying analog sub-circuits by utilizing Graphical Convolutional Neural Networks (GCNNs) in conjunction with an unsupervised graph clustering technique to result in the first tool, to our knowledge, to entirely automate this clustering process. We compare our algorithm to prior work in this space utilizing the four important figures of merit, and our results show over 90% accuracy across six different analog circuits, ranging in size and complexity, while taking just under 1 second to complete.</em></td> </tr> <tr> <td>16:45</td> <td>12.6.4</td> <td><b>EVPS: AN AUTOMOTIVE VIDEO ACQUISITION AND PROCESSING PLATFORM</b><br /><b>Speaker</b>:<br />Christophe FLOUZAT, CEA List, FR<br /><b>Authors</b>:<br />Christophe Flouzat, Erwan Piriou, Mickael Guibert, Bojan Jovanovic and Mohamad Oussayran, CEA List, FR<br /><em><b>Abstract</b><br />This paper describes a versatile and flexible video acquisition and processing platform for automotive. It is designed to meet aggressive requirements in terms of bandwidth and latency when implementing ADAS functions. Based on a Xilinx Ultrascale+ FPGA device, a vision processing pipeline mixing software and hardware tasks is implemented on this platform. This setup is able to collect four automotive camera streams (MIPI CSI2) and process them in the loop before transmitting a more intelligible pre-processed/enhanced data.</em></td> </tr> <tr> <td>17:00</td> <td>12.6.5</td> <td><b>AN ON-BOARD ALGORITHM IMPLEMENTATION ON AN EMBEDDED GPU: A SPACE CASE STUDY</b><br /><b>Speaker</b>:<br />Ivan Rodriguez, Universitat Politècnica de Catalunya and Barcelona Supercomputing Center, ES<br /><b>Authors</b>:<br />Ivan Rodriguez<sup>1</sup>, Leonidas Kosmidis<sup>2</sup>, Olivier Notebaert<sup>3</sup>, Francisco J Cazorla<sup>4</sup> and David Steenari<sup>5</sup><br /><sup>1</sup>Barcelona Supercomputing Center and Universitat Politècnica de Catalunya, ES; <sup>2</sup>Barcelona Supercomputing Center (BSC), ES; <sup>3</sup>Airbus Defence and Space, FR; <sup>4</sup>Barcelona Supercomputing Center, ES; <sup>5</sup>European Space Agency, NL<br /><em><b>Abstract</b><br />On-board processing requirements of future space missions are constantly increasing, calling for new hardware than the traditional ones used in space. Embedded GPUs are an attractive candidate offering both high performance capabilities and low power consumption, but there are no complex industrial case studies from the space domain demonstrating these advantages. In this paper we present the GPU parallelisation of an on-board algorithm, as well as its performance on a promising embedded GPU COTS platform targeting critical systems.</em></td> </tr> <tr> <td>17:15</td> <td>12.6.6</td> <td><b>TLS-LEVEL SECURITY FOR LOW POWER INDUSTRIAL IOT NETWORK INFRASTRUCTURES</b><br /><b>Authors</b>:<br />Jochen Mades<sup>1</sup>, Gerd Ebelt<sup>1</sup>, Boris Janjic<sup>1</sup>, Frederik Lauer<sup>2</sup>, Carl Rheinländer<sup>2</sup> and Norbert Wehn<sup>2</sup><br /><sup>1</sup>KSB SE &amp; Co. KGaA, DE; <sup>2</sup>University of Kaiserslautern, DE<br /><em><b>Abstract</b><br />The Industrial Internet of Things (IIoT) enables communication services between machinery and cloud to enhance industrial processes e.g. by collecting relevant process parameters or providing predictable maintenance. Since the data is often origin from critical infrastructures, the security of the data channel is the main challenge, and is often weakened due to limited compute power and energy availability of battery-powered sensor nodes. Lightweight alternatives to standard security protocols avoid computationally intensive algorithms, however, they do not provide the same level of trust as established standards such as Transport Layer Security (TLS). In this paper, we propose an IIoT network system that enables a secure end-to-end IP communication between ultra-low-power sensor nodes and cloud servers. It provides full TLS support to ensure perfect forward secrecy by using hardware accelerators to reduce the energy demand of the security algorithms. Our results show that the energy overhead of the TLS handshake can be significantly reduced to enable a secure IIoT infrastructure with a reasonable battery lifetime of the edge devices.</em></td> </tr> <tr> <td>17:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="12.7">12.7 Power-efficient multi-core embedded architectures</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 16:00 - 17:30<br /><b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br />Andreas Burg, EPFL, CH</p> <p><b>Co-Chair:</b><br />Semeen Rehman, TU Wien, AT</p> <p>This session has papers that provide power-efficiency solutions for multi-core embedded architectures. Techniques discussed in the session are related to the architectural measures as well as effectively controlling voltage-frequency settings using machine learning based on user experiences.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>16:00</td> <td>12.7.1</td> <td><b>TUNING THE ISA FOR INCREASED HETEROGENEOUS COMPUTATION IN MPSOCS</b><br /><b>Authors</b>:<br />Pedro Henrique Exenberger Becker, Jeckson Dellagostin Souza and Antonio Carlos Schneider Beck, Universidade Federal do Rio Grande do Sul, BR<br /><em><b>Abstract</b><br />Heterogeneous MPSoCs are crucial to meeting energy efficiency and performance, given their combination of cores and accelerators. In this work, we propose a novel technique for MPSoCs design, increasing their specialization and task-parallelism within a given area and power budget. By removing the microarchitectural support of costly ISA extensions (e.g., FP, SIMD, crypto) from a few cores (transforming them into Partial-ISA Cores), we make room to add extra (full and simpler) in-order cores and hardware accelerators. While applications must migrate from Partial-ISA cores when they need the removed ISA support, they also execute at lower power consumption during their ISA-extension-free phases, since partial cores have much simpler datapaths compared to their full-ISA counterparts. On top of it, the additional cores and accelerators increase task-level parallelism and make the MPSoC more suitable for application-specific scenarios. We show the effectiveness of our approach by composing different MPSoCs in distinct execution scenarios, using the FP instructions and RISC-V ISA as a case study. To support our system, we also propose two scheduling policies, performance- and energy-oriented, to coordinate the execution of this novel design. For the former policy, we achieve 2.8x speedup for a neural network road sign detection, 1.53x speedup for a video-streaming app, and 1.2x speedup for a task-parallel scenario, consuming 68%, 75%, and 33% less energy, respectively. For the energy-oriented policy, partial-ISA reduces energy consumption by 29% over a highly efficient baseline, with increased performance.</em></td> </tr> <tr> <td>16:30</td> <td>12.7.2</td> <td><b>USER INTERACTION AWARE REINFORCEMENT LEARNING FOR POWER AND THERMAL EFFICIENCY OF CPU-GPU MOBILE MPSOCS</b><br /><b>Speaker</b>:<br />Somdip Dey, University of Essex, GB<br /><b>Authors</b>:<br />Somdip Dey<sup>1</sup>, Amit Kumar Singh<sup>1</sup>, Xiaohang Wang<sup>2</sup> and Klaus McDonald-Maier<sup>1</sup><br /><sup>1</sup>University of Essex, GB; <sup>2</sup>South China University of Technology, CN<br /><em><b>Abstract</b><br />Mobile user's usage behaviour changes throughout the day and the desirable Quality of Service (QoS) could thus change for each session. In this paper, we propose a QoS aware agent to monitor mobile user's usage behaviour to find the target frame rate, which satisfies the desired user's QoS, and applies reinforcement learning based DVFS on a CPU-GPU MPSoC to satisfy the frame rate requirement. Experimental study on a real Exynos hardware platform shows that our proposed agent is able to achieve a maximum of 50% power saving and 29% reduction in peak temperature compared to stock Android's power saving scheme. It also outperforms the existing state-of-the-art power and thermal management scheme by 41% and 19%, respectively.</em></td> </tr> <tr> <td>17:00</td> <td>12.7.3</td> <td><b>ENERGY-EFFICIENT TWO-LEVEL INSTRUCTION CACHE DESIGN FOR AN ULTRA-LOW-POWER MULTI-CORE CLUSTER</b><br /><b>Speaker</b>:<br />Jie Chen, Università di Bologna, CN<br /><b>Authors</b>:<br />Jie Chen<sup>1</sup>, Igor Loi<sup>2</sup>, Luca Benini<sup>3</sup> and Davide Rossi<sup>3</sup><br /><sup>1</sup>Università di Bologna, FR; <sup>2</sup>GreenWaves Technologies, FR; <sup>3</sup>Università di Bologna, IT<br /><em><b>Abstract</b><br />High Energy efficiency and high performance are the key regiments for Internet of Things (IoT) edge devices. Exploiting clusters of multiple programmable processors has recently emerged as a suitable solution to address this challenge. However, one of the main power bottlenecks for multi-core architectures is the instruction cache memory. We propose a two-level structure based on Standard Cell Memories (SCMs) which combines a private instruction cache (L1) per-core and a low-latency (only one cycle latency) shared instruction cache (L1,5). We present a detailed comparison of performance and energy efficiency for different instruction cache architectures. Our system-level analysis shows that the proposed design improves upon both state-of-the art private and shared cache architectures and balances well performance with energy-efficacy. On average, when executing a set of real-life IoT applications, our multi-level cache improves performance and energy efficiency both by 10% with respect to the private instruction cache system, and improves energy efficiency by 15% and 7% with a performance loss of only 2% with respect to the shared instruction cache. Besides, relaxed timing makes two-level instruction cache an attractive choice for aggressive implementation, with more slack for convergence in physical design.</em></td> </tr> <tr> <td>17:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> <h2 id="12.8">12.8 Special Session: EDA Challenges in Monolithic 3D Integration: From Circuits to Systems</h2> <p><b>Date:</b> Thursday, March 12, 2020<br /><b>Time:</b> 16:00 - 17:30<br /><b>Location / Room:</b> Exhibition Theatre</p> <p><b>Chair:</b><br />Pascal Vivet, CEA-Leti, FR</p> <p><b>Co-Chair:</b><br />Mehdi Tahoori, Karlsruhe Institute of Technology, DE</p> <p>Monolithic-3D integration (M3D) has the potential to improve the performance and energy efficiency of 3D ICs over conventional TSV-based counterparts. By using significantly smaller inter-layer vias (ILVs), M3D offers the "true" benefits of utilizing the vertical dimension for system integration: M3D provides ILVs that are 100x smaller than a TSV and have similar dimensions as normal vias in planar technology. This allows M3D to enable high-performance and energy-efficient systems through higher integration density, flexible partitioning of logic blocks across multiple layers, and significantly lower total wire-length. From a system design perspective, M3D is a breakthrough technology to achieve "More Moore and More Than Moore," and opens up the possibility of creating manycore chips with multi-tier cores and network routers by utilizing ILVs. Importantly, this allows us to create scalable manycore systems that can address the communication and computation needs of big data, graph analytics, and other data-intensive parallel applications. In addition, the dramatic reduction in via size and the resulting increase in density opens up numerous opportunities for design optimizations in the manycore domain.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br />Authors</th> </tr> </thead> <tbody> <tr> <td>16:00</td> <td>12.8.1</td> <td><b>M3D-ADTCO: MONOLITHIC 3D ARCHITECTURE, DESIGN AND TECHNOLOGY CO-OPTIMIZATION FOR HIGH ENERGY-EFFICIENT 3D IC</b><br /><b>Authors</b>:<br />Sebastien Thuries, Olivier BILLOINT, Sylvain Choisent, Didier Lattard, Romain Lemaire and Perrine Batude, CEA-Leti, FR</td> </tr> <tr> <td>16:30</td> <td>12.8.2</td> <td><b>DESIGN OF A RELIABLE POWER DELIVERY NETWORK FOR MONOLITHIC 3D ICS</b><br /><b>Speaker</b>:<br />Krishnendu Chakrabarty, Duke University, US<br /><b>Authors</b>:<br />Shao-Chun Hung and Krishnendu Chakrabarty, Duke University, US</td> </tr> <tr> <td>17:00</td> <td>12.8.3</td> <td><b>POWER-PERFORMANCE-THERMAL TRADE-OFFS IN M3D-ENABLED MANYCORE CHIPS</b><br /><b>Speaker</b>:<br />Partha Pande, Washington State University, US<br /><b>Authors</b>:<br />Shouvik Musavvir<sup>1</sup>, Anwesha Chatterjee<sup>1</sup>, Ryan Kim<sup>2</sup>, Daehyun Kim<sup>1</sup>, Janardhan Rao Doppa<sup>1</sup> and Partha Pratim Pande<sup>1</sup><br /><sup>1</sup>Washington State University, US; <sup>2</sup>Colorado State University, US<br /><em><b>Abstract</b><br />Monolithic 3D (M3D) technology enables unprecedented degrees of integration on a single chip. The miniscule monolithic inter-tier vias (MIVs) in M3D are the key behind higher transistor density and more flexibility in designing circuits compared to conventional through silicon via (TSV)-based architectures. This results in significant performance and energy-efficiency improvements in M3D-based systems. Moreover, the thin inter-layer dielectric (ILD) used in M3D provides better thermal conductivity compared to TSV-based solutions and eliminates the possibility of thermal hotspots. However, the fabrication of M3D circuits still suffers from several non-ideal effects. The thin ILD layer may cause electrostatic coupling between tiers. Furthermore, the low-temperature annealing degrades the top-tier transistors and bottom-tier interconnects. An NoC-based manycore design needs to consider all these M3D-process related non-idealities. In this paper, we discuss various design challenges for an M3D-enabled manycore chip. We present the power-performance-thermal trade-offs associated with these emerging manycore architectures.</em></td> </tr> <tr> <td>17:30</td> <td></td> <td>End of session</td> </tr> <tr> <td></td> <td></td> <td></td> </tr> </tbody> </table> <hr /> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Sun, 08 Dec 2019 21:22:10 +0000 Andreas Vörg, edacentrum GmbH, DE 478 at https://www.date-conference.com General Information Registration & Participation https://www.date-conference.com/registration <span>General Information Registration &amp; Participation</span> <span><a title="View user profile." href="/user/288">Matthias Fried…</a></span> <span>Wed, 13 Mar 2019 12:20</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p><strong>The registration to the conference is only possible via the <a href="https://regonline.react-profile.org/Date20/Registration/start" target="_blank">online registration platform</a>. Please kindly note that everyone who wants to attend the conference, the exhibition or single sessions must register. The online registration for <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> is possible until <b>26 February 2020, 12:00:00 CET</b>. Afterwards, the registration to the conference is only possible on-site at the registration desk and <span style="color: #ff0000;">will result in an additional on-site charge of EUR 50.00</span>.</strong></p> <h1 class="text-align-center" style="background-color: #ffff99;"><strong><strong>Please click <a href="https://regonline.react-profile.org/Date20/Registration/start" target="_blank">here to start online registration</a>.</strong></strong></h1> <p><strong>Please note:</strong></p> <ul> <li><strong>Speakers</strong> are kindly asked to provide the SoftID/PaperID of their accepted paper when registering online. <strong><span style="color: #ff0000;">Each accepted paper shall be accompanied by at least one full conference registration at the speaker rate (i.e., two speaker registrations are needed for two accepted papers, e.g. from the main author or a co-author of the paper).</span></strong><br /> The deadline for payment of fees at speaker rate in order to have a paper published in the conference proceedings is 28 November 2019. This deadline is independent from the general registration deadlines mentioned below. Please see the <a href="https://www.date-conference.com/submission-instructions">submission instructions</a> for further details.<br /> Speakers of Monday Tutorials or Friday Workshops are not entitled to register at the speaker rate.</li> <li><strong>IEEE/ACM members</strong> are kindly asked to provide their member number when registering online.</li> <li><strong>Students</strong> are kindly asked to upload a full proof of matriculation, i.e. a scanned copy of the student ID card or a letter from a professor or head of department, during the online registration to confirm the student status at the time of the conference. Kindly note that the document must be in English.</li> <li><strong>Press representatives</strong> can participate for free in the conference including access to all sessions, the exhibition and social events. A press identification card must be provided upon registration.</li> <li>The <strong>visit of the exhibition</strong> is free for all registered participants. Please choose the participant type “Exhibition Visitor” if the exhibition is the only part of the conference that you are going to attend.</li> </ul> <p><strong><strong>In case of any questions, please do not hesitate to contact the DATE Registration Office via email: <span class="spamspan"><span class="u">date-registration</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">kitdresden<span class="o"> [dot] </span>de</span></span>.</strong></strong></p> <table border="1" cellpadding="0" cellspacing="3" style="width: 100%;"> <tbody> <tr> <td width="50%"> <p align="center"><strong>Registration Fees</strong></p> </td> <td align="center" width="25%"> <p align="center"><span style="color: #ff0000;"><strong>Payment received by<br /> <b>5 February 2020, 23:59:59 CET</b></strong></span></p> </td> <td align="center" width="25%"> <p align="center"><span style="color: #ff0000;"><strong>Payment received by<br /> <b>26 February 2020, 12:00:00 CET</b>*</strong></span></p> </td> </tr> <tr> <td colspan="3" width="99%"><a href="https://www.date-conference.com/conference/event-overview" target="_blank"><strong>CONFERENCE (Tuesday to Thursday)</strong></a></td> </tr> <tr> <td width="46%"> <p>Speakers, IEEE/ACM Members</p> </td> <td width="25%"> <p align="center">EUR 620.00</p> </td> <td width="26%"> <p align="center">EUR 730.00</p> </td> </tr> <tr> <td width="46%"> <p>Delegates</p> </td> <td width="25%"> <p align="center">EUR 710.00</p> </td> <td width="26%"> <p align="center">EUR 820.00</p> </td> </tr> <tr> <td width="46%"> <p>Students<span style="color: #ff0000;"><strong>**</strong></span></p> </td> <td width="25%"> <p align="center">EUR 440.00</p> </td> <td width="26%"> <p align="center">EUR 530.00</p> </td> </tr> <tr> <td colspan="3" width="99%"> <p><strong>DAY TICKET (Tuesday, Wednesday <span style="text-decoration: underline;">or</span> Thursday)</strong></p> </td> </tr> <tr> <td width="46%"> <p>Speakers, IEEE/ACM Members, Delegates</p> </td> <td width="25%"> <p align="center">EUR 330.00</p> </td> <td width="26%"> <p align="center">EUR 430.00</p> </td> </tr> <tr> <td width="46%"> <p>Students</p> </td> <td width="25%"> <p align="center">EUR 160.00</p> </td> <td width="26%"> <p align="center">EUR 320.00</p> </td> </tr> <tr> <td colspan="3" width="99%"> <p><a href="https://www.date-conference.com/exhibitors-sponsors" target="_blank"><strong>EXHIBITION VISIT (Tuesday to Thursday)</strong></a></p> </td> </tr> <tr> <td width="46%"> <p>Exhibition Visitor</p> </td> <td colspan="2" width="52%"> <p align="center">free</p> </td> </tr> <tr> <td colspan="3" width="99%"> <p><a href="https://www.date-conference.com/conference/monday-tutorials" target="_blank"><strong>TUTORIALS (Monday, half-day afternoon)</strong></a></p> </td> </tr> <tr> <td width="46%"> <p>Speakers, IEEE/ACM Members, Delegates</p> </td> <td width="25%"> <p align="center">EUR 140.00</p> </td> <td width="26%"> <p align="center">EUR 190.00</p> </td> </tr> <tr> <td width="46%"> <p>Students</p> </td> <td width="25%"> <p align="center">EUR 90.00</p> </td> <td width="26%"> <p align="center">EUR 120.00</p> </td> </tr> <tr> <td colspan="3" width="99%"> <p><a href="https://www.date-conference.com/conference/friday-workshops" target="_blank"><strong>WORKSHOPS (Friday)</strong></a></p> </td> </tr> <tr> <td width="46%"> <p>Speakers, IEEE/ACM Members, Delegates, Students</p> </td> <td width="25%"> <p align="center">EUR 240.00</p> </td> <td width="26%"> <p align="center">EUR 280.00</p> </td> </tr> <tr> <td colspan="3" width="99%"> <p><a href="https://www.date-conference.com/date-party-networking-event" target="_blank"><strong>DATE PARTY (Wednesday)</strong></a></p> </td> </tr> <tr> <td width="46%"> <p>Speakers, IEEE/ACM Members, Delegates</p> </td> <td colspan="2" width="52%"> <p align="center"><em>Included in the above-mentioned registration fee</em></p> </td> </tr> <tr> <td width="46%"> <p>Exhibition Visitors, Extra Party Ticket<span style="color: #ff0000;"><strong>**</strong></span></p> </td> <td colspan="2" width="52%"> <p align="center">EUR 70.00</p> </td> </tr> </tbody> </table> <p><span style="color: #ff0000;"><strong>* After <b>26 February 2020, 12:00:00 CET</b>, an additional on-site fee of EUR 50.00 will apply (except for Friday Workshops and Party Tickets).</strong></span></p> <p><span style="color: #ff0000;"><strong>** <strong>Student registration does not include a party ticket. Each student online conference registration can apply for one student extra party ticket at 50% of the costs of a regular extra party ticket. This offer is only available during online registrations and only valid for student <span style="color: #ff0000;"><strong><strong>full </strong></strong></span> conference registrations.</strong></strong></span></p> <h3> The CONFERENCE registration includes:</h3> <ul> <li>Access to all sessions and to the exhibition area from Tuesday to Thursday</li> <li>Conference bag incl. programme booklet and conference proceedings (available for download on-site)</li> <li>Entrance to the DATE Party (<b>Wednesday, 11 March 2020</b> evening),<strong><em> except for the student rate </em></strong>(students can purchase one party ticket at a 50 % reduced price online**)</li> <li>Refreshments during coffee breaks, buffet lunch during lunch breaks</li> </ul> <h3>The DAY TICKET registration includes:</h3> <ul> <li>Access to all sessions – <em>on the chosen day</em></li> <li>Access to the exhibition area &amp; all Exhibition Theatre Sessions from Tuesday to Thursday as well as Keynote Sessions</li> <li>Conference bag incl. programme booklet and conference proceedings (available for download on-site)</li> <li>Refreshments during coffee breaks, buffet lunch during lunch break – <em>on the chosen day</em></li> <li><strong>NO</strong> entrance to the DATE Party</li> </ul> <h3>The EXHIBITION registration includes:</h3> <ul> <li>Access to the exhibition area &amp; all Exhibition Theatre sessions (Tuesday to Thursday) as well as Keynote Sessions</li> <li>Programme booklet (including Exhibition Guide)</li> <li><strong>NO</strong> access to scientific / technical sessions</li> <li><strong>NO</strong> buffet lunch </li> <li><strong>NO</strong> entrance to the DATE Party</li> </ul> <h3>The TUTORIAL registration includes:</h3> <ul> <li>Access to the Monday Tutorials<strong><em> (please indicate your preference; it will be possible to switch from one tutorial to another on-site) </em></strong></li> <li>Documentation (available for download online)</li> <li>Afternoon coffee break</li> </ul> <h3>The WORKSHOP registration includes:</h3> <ul> <li>Access to the chosen workshop on Friday <strong><em>(<strong><em>please select one workshop during your registration; switching between workshops on-site is NOT allowed</em></strong>)</em></strong></li> <li>Documentation (available for download online)</li> <li>Refreshments during coffee breaks, buffet lunch</li> </ul> <p><strong>Payment methods</strong></p> <p>Payment can be made by credit card or by bank transfer.</p> <p>Payment by credit card can be conducted during the online registration process. No credit card fee will be charged. You will get a confirmation of registration and payment afterwards via e-mail. For paying by bank transfer, you will receive an invoice including all payment details via e-mail some days after completing the online registration. All bank charges have to be covered by the delegate.</p> <p>The registration for the conference is only valid after receipt of the full registration fees. If you do not paid the fees within the mentioned deadlines, the higher fee will apply and you will receive a new invoice from the Conference Organization subsequently.</p> <p><strong>VISA Letter Application</strong></p> <p>A VISA letter can be requested during the online registration process. VISA Letters will only be sent to fully registered conference delegates (full payment of the registration fees is required), <strong>not </strong>to Exhibition Visitors.</p> <p><strong>Please also refer to the <a href="/sites/default/files/DATE%202020/DATE2020_GeneralTermsConditions.pdf">General Terms and Conditions</a><strong> </strong>for further information.</strong></p> <p><em>In case of questions, please contact <b>Conference Organization - Registration</b><br />Anja Zeun, K.I.T. Group GmbH Dresden, DE<br /><span class="spamspan"><span class="u">date-registration</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">kitdresden<span class="o"> [dot] </span>de</span></span><br />phone: +49 351 65573 -137</em></p> <p>Download further information:</p> <p><a href="/sites/default/files/DATE%202020/DATE2020_GeneralTermsConditions.pdf">Download the <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> General terms &amp; conditions here</a></p> <p><a href="/sites/default/files/DATE%202020/DATE2020_Registration%20Information%2BFees.pdf">Download the <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> Registration information &amp; fees here</a></p> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Wed, 13 Mar 2019 11:20:59 +0000 Matthias Friedrich, edacentrum GmbH, DE 55 at https://www.date-conference.com Authors' Guidelines for Audio-Visual Presentation https://www.date-conference.com/av-guidelines <span>Authors&#039; Guidelines for Audio-Visual Presentation</span> <span><a title="View user profile." href="/user/25">Andreas Vörg, …</a></span> <span>Sat, 4 Jan 2020 10:29</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p>This page describes the guidelines to prepare and present audio-visual materials at DATE. Please read all instructions carefully and follow them strictly to maintain the highest possible standards. Even experienced speakers should read the following paragraphs, as they cover several problems that have arisen over the years.</p> <h3>General Instructions for …</h3> <dl class="ckeditor-accordion"> <dt id="General-Instructions-for-Oral-Presentations">… Oral Presentations</dt> <dd> <h2>Presentation Submission</h2> <p>DATE will provide a centralised presentation management system for all speakers of the main conference. You will not be allowed to use your own laptop for presentation on-site – no exceptions.</p> <p>To enable the A/V staff to handle the technical aspects in an efficient way, all presentations should be prepared according to the <a href="#General-Instructions-for-Preparing-AV-Material">"General Instructions for Preparing A/V Material"</a>. <strong>It is essential that the correct format is used.</strong></p> <p>Please bring your presentation file with you (CD/DVD/Memory Stick) to the conference and submit it to the presentations server at the A/V Office.</p> <p>Before the conference, you can upload your presentation by using the web-based upload service at <a href="https://date.t-e-m.de">https://date.t-e-m.de</a>. The correct file name is set automatically by the server. The access data for the upload service will be sent to the main contributing author in due time. The upload service will close on <b>25 February 2020, 23:59:59 CET</b>.</p> <h2>At the Conference</h2> <p>Preview computer systems, identical in software and hardware to the ones used for presentation, will be available in the Audio/Video Office at the conference. This room can be used during below-mentioned times of the <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> week for presentation concerns. Since this facility will be shared between multiple presenters, its use can be limited.</p> <table> <tbody> <tr> <td>Monday</td> <td>1300 – 1900</td> </tr> <tr> <td>Tuesday</td> <td>0730 – 1900</td> </tr> <tr> <td>Wednesday</td> <td>0730 – 1800</td> </tr> <tr> <td>Thursday</td> <td>0730 – 1730</td> </tr> </tbody> </table> <p>All presenters are required to meet with the local conference Audio/Video staff at least two hours before the beginning of their session to check their presentation at one of the conference computers. However, it is strongly recommended to do so the day before the session if possible.</p> <p>The facilities at the A/V Office will provide the possibility of:</p> <ul> <li>uploading the presentations to the server</li> <li>reviewing the presentations on Windows-based computers</li> <li>last minute alterations of the presentations</li> <li>support by technical staff</li> </ul> <p>Please submit your presentation to the A/V Office via one of the following media:</p> <ul> <li>CD ROM (CD-R/RW), DVD-ROM (DVD-R/RW)</li> <li>USB memory stick</li> </ul> <p>Save all files associated with your presentation (PowerPoint file, movie/video files etc.) to one folder/location. We recommend to save videos and graphics and pictures separately on your storage medium. In case of problems, we can re-insert the originals.</p> <p>In the event that you have more than one presentation during the conference, save the different presentations in different folders and name them clearly to avoid any on-site misunderstandings and problems.</p> <p>Always make a backup copy of your presentations and all associated files and save them on a separate portable medium by yourself.</p> <p>Conference staff will transfer your presentation from the A/V Office to the corresponding session rooms. You will easily find your presentation on the laptop installed at the lectern in your session room.</p> <p>Each session room is equipped with:</p> <ul> <li>Video projector</li> <li>Lectern with microphone</li> <li>Laptop with operating system Windows 10 (English)</li> <li>presenter with laser pointer and slideshow remote control</li> </ul> <p>You can control/move slides during your presentation on your own (by remote control – please kindly check this in the Speaker Preview Room in advance).</p> <p>Kindly be at the session room at least 20 minutes before the session starts to meet the chair and familiarise yourself with the technical and other equipment.</p> <p>Using your own laptop for presentation is not possible.</p> <p>During your presentation you should keep in mind your time limit. The session moderator will stop your presentation if it takes more than your allocated time slot.</p> <h2>Speaker’s Breakfast</h2> <p>There will be a speaker’s breakfast in the morning of your presentation. It will be located in the ground level of the Alpes Congrès Building, and it will start at 7:30 a.m. Attending the speaker's breakfast at the morning of your presentation is mandatory in order to get all final instructions. A sign with the session number will point to your table.</p> </dd> <dt id="General-Instructions-for-Preparing-AV-Material">… Preparing A/V Material</dt> <dd> <p>When preparing your AV material, keep the time limit for your presentation in mind. To make your visual presentation a success, it needs to be well planned to clearly point out the important results of your research. The audience will appreciate your talk only if your material is visible and legible. They will remember your talk far better and read your paper if you can manage to communicate at least two important facts within your presentation timeslot. Please consider that the audience will need at least a minute to understand each technical slide.</p> <p>The first slide should contain the title of your paper and the author names, your affiliations and your company, university or funding logo (if applicable). This will be the only page where logos are permitted.</p> <p>Keep your material simple and uncluttered. Program listings and very long equations should be avoided. Tables should be represented graphically, wherever possible. Do not use the valuable space on your slides for large company logos and other elements that do not help in motivation or understanding your work. Duplicates of slides should only be produced in case the same information is needed twice.</p> <h2>Presentation Format</h2> <p>Please use Microsoft PowerPoint 97-2016 (*.ppt/*pptx), OpenOffice / LibreOffice 1.0 – 6.0, PREZI or Adobe PDF to guarantee your presentation will open successfully on an on-site PC.</p> <p>All slides must use landscape format with 16:9 aspect ratio.</p> <p>Please limit the file size to less than 25 MB (except video content) to minimise problems with storage and access speed that can result in a distorted or incomplete presentation.</p> <p>To speed up your start, we provide a PowerPoint template presentation. You are encouraged to use this template to prepare your presentation. Press <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a> to download the PowerPoint file.</p> <p>Mac users: please convert your file to PowerPoint format or PDF before you leave for the conference. Be aware that PowerPoint Mac-to-PC conversions can lead to unexpected results, especially with fonts, certain formats of embedded graphics, and special characters (ASCII characters 128 to 255). To avoid questions of PowerPoint compatibility, please embed all used fonts, convert them to vectors or use only compatible fonts (e. g. Arial, Courier New, Lucida Sans, Times New Roman, Verdana).</p> <h2>Pictures and Videos</h2> <p>Because of the many different video formats, support cannot be provided for embedded videos in your presentation; please test your presentation with the on-site PC several hours before your presentation. Generally, the WMV and MPEG-4 format should work without difficulties.</p> <p>Movies or videos that require additional reading or projection equipment (e.g. VHS cassettes, Video-DVDs) will not be accepted.</p> <p>Audio is supported.</p> <h2>Fonts</h2> <p>Only fonts which are included in the basic installation of MS-Windows 10 will be available. Use of other fonts not included in Windows can cause a wrong layout/style of your presentation (Suggested fonts: Arial, Tahoma). If you use different fonts, these must be embedded into your presentation.</p> <p>Please use high contrast lettering and fonts with a minimum size of 16 pt and high contrast layouts like light text on dark colours.</p> <p>Please make sure that also index expressions are clearly visible and use an appropriate font size.</p> <h2>Colours</h2> <p>Colour should be used carefully and colour combinations resulting in a low contrast (e.g. dark blue on black) should be avoided. Be aware that the contrast of your computer monitor is much higher than that of a projector in a partly lit room</p> <p>Try to use only colours that convert for black and white printing. The distinction between blue and black for text and thin lines is especially weak. Red filled-in objects (circles, rectangles, etc.) with white text are well-suited for highlighting important text.</p> </dd> </dl> <h3>Further Instructions for…</h3> <dl class="ckeditor-accordion"> <dt id="Further-Instructions-for-Session-Chairs-and-Co-Chairs">… Session Chairs and Co-Chairs</dt> <dd> <h2>Quick Checklist</h2> <ul> <li>If needed, get a PowerPoint template <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a>.</li> <li>At least two hours before your session, contact the Audio/Video Office staff to check that all your session presentations have been uploaded. If you have introductory slides, please also contact the A/V staff.</li> <li>Attend Speaker’s Breakfast the morning of your session at 7:30 a.m.</li> <li>Please, check presence of all speakers 10 minutes before your session starts, at the latest.</li> <li>After your session, please fill in the <a href="/sites/default/files/2020-01/DATE20-session-evaluation-form.pdf">session evaluation form</a> and return it to the conference registration desk.</li> </ul> <h2>Session Chairs</h2> <p>The main task of a session chair is to run the session. All speakers have been advised to get in contact with you before the session – please check that all of them are present before your session starts. If one of the speakers is missing, leave the presentation slot empty to be on schedule. Within your session, please introduce the speakers and keep track of the time limits (indicate to the speaker when it is time to stop). Please also manage the question and answer procedure after each talk (long dialogues have to be done off-line). If there are Interactive Presentations assigned to your session, please provide a one-minute time slot to each of them at the end of the session. After your session, please fill in the <a href="/sites/default/files/2020-01/DATE20-session-evaluation-form.pdf">session evaluation form</a> and return it to the conference registration desk.</p> <h2>Session Co-Chairs</h2> <p>The main task of a session co-chair is to support the session moderator and to handle unexpected situations. Please estimate the number of attendees (required for the evaluation form). You are requested to handle unexpected noise (talk to security people), A/V problems (talk to A/V people / technicians) and look for missing speakers. Please replace the missing session chair in case he is not present due to unexpected circumstances.</p> <p><strong>For further information, please have a look at the guidelines for your appropriate session below.</strong></p> </dd> <dt>… Organisers of Executive and Panel Sessions</dt> <dd> <h2>Quick Checklist</h2> <ul> <li>Carefully read the information provided in the <a href="#General-Instructions-for-Oral-Presentations">"General Instructions for Oral Presentations"</a> and <a href="#General-Instructions-for-Preparing-AV-Material">"General Instructions for Preparing A/V Material"</a></li> <li>If needed, get a PowerPoint template <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a>.</li> <li>Collect all your session’s presentations where applicable and upload them via <a href="https://date.t-e-m.de">https://date.t-e-m.de</a> until <b>25 February 2020, 23:59:59 CET</b>.<br /> The access data for the upload service will be sent to the Organisers of Special Session in due time.</li> <li>At least two hours before your session, visit the Audio/Video Office to check/submit all your session’s presentations.</li> <li>Attend Speaker’s Breakfast the morning of your session at 7:30 a.m.</li> </ul> <p>If you are also the chair or co-chair of your session, <strong>please have a look at the <a href="#Further-Instructions-for-Session-Chairs-and-Co-Chairs">"Further Instructions for Session Chairs and Co-Chairs"</a></strong>.</p> </dd> <dt>… Speakers in Executive and Panel Sessions</dt> <dd> <h2>Quick Checklist</h2> <ul> <li>Carefully read the information provided in the <a href="#General-Instructions-for-Oral-Presentations">"General Instructions for Oral Presentations"</a> and <a href="#General-Instructions-for-Preparing-AV-Material">"General Instructions for Preparing A/V Material"</a></li> <li>If needed, get a PowerPoint template <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a> and prepare your slides according to the above-mentioned guidelines.</li> <li>Send your presentation to the session organiser. Your session organiser is responsible for uploading your presentation to the conference server in time. Please contact her/him for instructions.</li> <li>At least two hours before your session, visit the Audio/Video Office to check your presentation.</li> <li>Attend Speaker’s Breakfast the morning of your presentation at 7:30 a.m. and bring your filled <a href="/sites/default/files/2020-01/DATE20-speakers-bio.pdf">Speaker's Bio</a></li> <li>20 minutes before your session, contact the session chair to confirm your presence.</li> </ul> </dd> <dt>… Speakers in Regular, Embedded Tutorial and Hot-Topic Sessions (Long and Short Presentations)</dt> <dd> <h2>Quick Checklist</h2> <ul> <li>Carefully read the information provided in the <a href="#General-Instructions-for-Oral-Presentations">"General Instructions for Oral Presentations"</a> and <a href="#General-Instructions-for-Preparing-AV-Material">"General Instructions for Preparing A/V Material"</a></li> <li>If needed, get a PowerPoint template <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a> and prepare your slides according to the above-mentioned guidelines.</li> <li>Upload your presentation via <a href="https://date.t-e-m.de">https://date.t-e-m.de</a> until <b>25 February 2020, 23:59:59 CET</b>.<br /> The access data for the upload service will be sent to the main contributing author in due time.</li> <li>At least two hours before your session, visit the Audio/Video Office to check your presentation.</li> <li>Attend Speaker’s Breakfast the morning of your presentation at 7:30 a.m. and bring your filled <a href="/sites/default/files/2020-01/DATE20-speakers-bio.pdf">Speaker's Bio</a></li> <li>20 minutes before your session, contact the session chair to confirm your presence.</li> </ul> <h2>Presentation timeslots</h2> <p>The presentation timeslot is 25+5 minutes for long and 13+2 minutes for short presentations, including the +time for questions. Please consider that the audience will need at least a minute to understand each technical slide. Therefore, you should prepare 15 to 20 slides for long and 10 to 15 slides for short presentations.</p> </dd> <dt>… Authors of Interactive Presentations</dt> <dd> <h2>Quick Checklist</h2> <ul> <li>Carefully read the information provided in the <a href="#General-Instructions-for-Oral-Presentations">"General Instructions for Oral Presentations"</a> and <a href="#General-Instructions-for-Preparing-AV-Material">"General Instructions for Preparing A/V Material"</a></li> <li>If needed, get a PowerPoint template <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a> and prepare your two advertisement slides according to the above-mentioned guidelines.</li> <li>Prepare your poster according to the below-mentioned guidelines.</li> <li>Upload your 2-slide-presentation via <a href="https://date.t-e-m.de">https://date.t-e-m.de</a> until <b>25 February 2020, 23:59:59 CET</b>.<br /> The access data for the upload service will be sent to the main contributing author in due time.</li> <li>At least two hours before your advertisement session, visit the Audio/Video Office to check your presentation.</li> <li>20 minutes before your advertisement session, contact the session chair to confirm your presence.</li> <li>15 minutes before your IP session: please mount your poster and stay in the IP session area</li> <li>60 minutes before the next IP session: please get sure to remove your poster, otherwise it will be disposed.</li> </ul> <h2>Advertisement talk</h2> <p>IP authors have two time slots for presentation. One first time slot is scheduled at the end of a regular session for a short advertisement of the poster presentation being scheduled in the following IP session. The timeslot for your advertisement presentation is only one minute. You are allowed to show at most two slides (cover page included).</p> <h2>Poster Presentation</h2> <p>The second time slot is characterised by an oral explanation given to interested audience during the interactive presentation sessions. Each IP session runs on a 30 minutes’ timeslot and will be supported by a compulsory poster according to the guidelines indicated below. Therefore, please be in the IP area at least 15 minutes before the session starts to correctly mount the poster. Please also take care of removing it at latest 60 minutes before the next IP session. Posters from previous sessions will be removed and disposed. You do not have to prepare presentation slides for this time slot as there will be no table or power socket nearby the poster wall.</p> <p>Finally, remember that the best IP award selection committee will check the quality of the presentation and of the answers during the IP sessions to make its decision.</p> <p>IP authors are kindly asked to prepare posters in DIN A0 portrait format (841x1189 mm / 33.11x46.81 in) and bring the printed poster to the conference. There is no poster printing service on-site. The poster will be exhibited in the IP session area on the poster walls labelled with the corresponding IP number. Blu-Tack/Pins will be provided. Posters made as mosaic of A4 or letter papers are discouraged.</p> </dd> <dt>… Speakers in Exhibition Theatre Sessions</dt> <dd> <h2>Quick Checklist</h2> <ul> <li>Carefully read the information provided in the <a href="#General-Instructions-for-Oral-Presentations">"General Instructions for Oral Presentations"</a> and <a href="#General-Instructions-for-Preparing-AV-Material">"General Instructions for Preparing A/V Material"</a></li> <li>If needed, get a PowerPoint template <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a> and prepare your slides according to the above-mentioned guidelines.</li> <li>Upload your presentation via <a href="https://date.t-e-m.de">https://date.t-e-m.de</a> until <b>25 February 2020, 23:59:59 CET</b>.<br /> The access data for the upload service will be sent to the main contributing author in due time.</li> <li>At least two hours before your session, visit the Audio/Video Office to check your presentation.</li> <li>Attend Speaker’s Breakfast the morning of your presentation at 7:30 a.m. and bring your filled <a href="/sites/default/files/2020-01/DATE20-speakers-bio.pdf">Speaker's Bio</a></li> <li>20 minutes before your session, contact the session chair to confirm your presence.</li> </ul> </dd> <dt>… Authors of Monday Tutorial Presentations</dt> <dd> <p>The centralised presentation management system will NOT be used for Monday Tutorials and presentations will be handled individually by each Tutorial Organiser. Please contact your Tutorial Organiser to get information on the organisation of your presentation.</p> </dd> <dt>… Authors of Friday Workshop Presentations</dt> <dd> <p>The centralised presentation management system will NOT be used for Friday Workshops and presentations will be handled individually by each Workshop Organiser. Please contact your Workshop Organiser to get information on the organisation of your presentation.</p> </dd> </dl> <p>For more information please contact:</p> <p><b>Conference Organization - Conference Manager</b><br />Eva Smejkal, K.I.T. Group GmbH Dresden, DE<br /><span class="spamspan"><span class="u">date</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">kitdresden<span class="o"> [dot] </span>de</span></span><br />phone: +49 351 65573-133<br />fax: +49 351 65573-299</p> </div> <div class="field field--name-field-news-attachments field--type-file field--label-above clearfix"> <div class="field__label">Download further information:</div> <div class="field__items"> <div class="field__item"><span class="file file--mime-application-pdf file--application-pdf"><a href="https://www.date-conference.com/sites/default/files/2020-01/DATE20-speakers-bio.pdf" type="application/pdf; length=101265" title="DATE20-speakers-bio.pdf">DATE 2020 Speaker's Bio</a></span> </div> <div class="field__item"><span class="file file--mime-application-pdf file--application-pdf"><a href="https://www.date-conference.com/sites/default/files/2020-01/DATE20-session-evaluation-form.pdf" type="application/pdf; length=114702" title="DATE20-session-evaluation-form.pdf">DATE 2020 Session Evaluation Form</a></span> </div> <div class="field__item"><span class="file file--mime-application-vnd-openxmlformats-officedocument-presentationml-presentation file--x-office-presentation"><a href="https://www.date-conference.com/sites/default/files/2020-01/DATE20-slide-template.pptx" type="application/vnd.openxmlformats-officedocument.presentationml.presentation; length=895391" title="DATE20-slide-template.pptx">DATE 2020 PowerPoint template</a></span> </div> </div> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Sat, 04 Jan 2020 09:29:05 +0000 Andreas Vörg, edacentrum GmbH, DE 515 at https://www.date-conference.com DATE 2020 in Grenoble: Highlighting Embedded AI and Silicon Photonics https://www.date-conference.com/date-2020-grenoble-highlighting-embedded-ai-and-silicon-photonics <span>DATE 2020 in Grenoble: Highlighting Embedded AI and Silicon Photonics</span> <span><a title="View user profile." href="/user/371">Eva Smejkal, K…</a></span> <span>Fri, 22 Nov 2019 11:03</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p>Out of a total of 748 paper submissions received, a large share (39%) is coming from authors in Europe, 27% of submissions are from the Americas, 33% from Asia, and 1% from the rest of the world. Submissions involved more than 2400 authors from 45 different countries. This distribution clearly demonstrates DATE’s international character, global reach and impact.</p> <p>In the D track, the largest number of papers were received in Topics D14 – “Emerging Design Technologies for Future Memories”, DT6 – “Design and Test of Secure Systems”, and D10 – “Approximate Computing”. For the A track, Topic A5 – “Secure Systems, Circuits, and Architectures” scored the highest number of submissions, similarly Topic T1 – “Modelling and Mitigation of Defects, Faults, Variability, and Reliability” for the T track, and finally Topic E2 – “Embedded Systems for Deep Learning” for the E track.</p> <p>For the 23rd year in a row, DATE has prepared an exciting technical programme. With the help of the 328 members of the Technical Programme Committee, who carried out 3014 reviews (mostly four reviews per submission), 196 papers (26%) were finally selected for regular presentation and 82 additional ones (cumulatively 37%, including all papers) for interactive presentation.</p> <p><span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> programme will include five keynote talks. Two visionary talks will be given during the opening ceremony: one from <strong>Philippe Magarshack</strong>, Corporate Vice President at ST Microelectronics, and one from <strong>Luca Benini</strong>, chair of digital Circuits and Systems at ETH Zurich and Professor at University of Bologna. Moreover, three luncheon keynotes will inspire the attendees: on Tuesday, <strong>Catherine Schumann</strong> from Oak Ridge National Laboratory will talk about neuromorphic computing; on Wednesday, <strong>Jim Tung</strong> will present Mathworks’ vision on how to leverage Embedded Intelligence in Industry; on Thursday, <strong>Joachim Schultze</strong> from DZNE will talk about bottlenecks and challenges for HPC in medicinal and genomics research.</p> <p>The conference programme includes several executive and hot-topic sessions, addressing Memories for Emerging Applications, Architectures for Emerging Technologies (Quantum Computing, Edge Computing, Neural Algorithms, In-Memory Computing, Bio-Inspired Adaptive Hardware), Hardware Security, 3D Integration and Logic Reasoning for Functional ECO.</p> <p>On the first day of the DATE week, seven in-depth technical tutorials on the main topics of DATE as well as one industry hands-on tutorial will be given by leading experts in their respective fields. The topics cover Early Reliability Analysis in Microprocessor Systems, AI Chip Technologies and DFT Methodologies, Data Analytics for Scalable Computing Systems Design, Security in the Post-Quantum Era, Industrial Control Systems Security, HW/SW codesign of Heterogeneous Parallel dedicated Systems, Evolutionary computing for EDA, and the Deployment of deep learning networks on FPGA (Mathworks).</p> <p>On Friday, 8 full-day workshops cover several hot topics from areas like Autonomous Systems Design, Optical/Photonic Interconnects, Computation-In-Memory, Design Automation for Understanding Hardware Designs, Open-Source Design Automation, Stochastic Computing for Neuromorphic Architectures, Hardware Security and Quantum Computing.</p> <p>Two Special Days in the programme will focus on areas bringing new challenges to the system design community: <strong>Embedded AI and Silicon Photonics</strong>. Each of the Special Days will have a full programme of keynotes, panels, tutorials and technical presentations.</p> <p>More in detail, the Special Day on <strong>Embedded Artificial Intelligence </strong>will cover new trends in cognitive algorithms, hardware architectures, software designs, emerging device technologies as well as the application space for deploying AI into edge devices. The topics will include technical areas to enable the realization of embedded artificial intelligence on specialized chips, such as bio-inspired chips, with and without self-learning capabilities, special low-power accelerator chips for aiding in vector/matrix-based computations, convolution and deep-net chips for possible machine learning, cognitive, and perception applications in health, automotive, robotics, or smart cities applications.</p> <p>The Special Day on <strong>Silicon Photonics </strong>will focus on data communication via photonics for both data centre/high-performance computing and optical networks on chip applications. Industrial and academic experts will highlight recent advances on devices and integrated circuits. The sessions will also feature talks on design automation and link-level simulations. Other applications of silicon photonics such as sensing and optical compute will also be discussed.</p> <p>A timely Special Initiative <strong>“Autonomous Systems Design – Automated Vehicles and beyond” </strong>on Thursday and Friday will include reviewed and invited papers as well as working sessions.</p> <p>To inform attendees on commercial and design-related topics, there will be a full programme in the <strong>Exhibition Theatre</strong> which will combine presentations by exhibiting companies, best-practice reports by industry leaders on their latest design projects and selected conference special sessions.</p> <p>The conference is complemented by an <strong>exhibition, running for three days (Tuesday – Thursday)</strong>, including exhibition booths from companies, and collaborative research initiatives including EU project presentations. The exhibition provides a unique networking opportunity and is the perfect venue for industries to meet University Professors to foster University Programmes and especially for PhD Students to meet future employers.</p> <p><strong>The conference online registration is now open, and the complete advance programme will be available on the DATE website starting from December 2019.</strong></p> <div> <h3><span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20 Press Contacts</span></span></h3> <p><b>General Chair</b><br />Giorgio Di Natale, CNRS/TIMA, FR<br /><span class="spamspan"><span class="u">giorgio<span class="o"> [dot] </span>di-natale</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">univ-grenoble-alpes<span class="o"> [dot] </span>fr</span></span></p> <p><b>Programme Chair</b><br />Cristiana Bolchini, Politecnico di Milano, IT<br /><span class="spamspan"><span class="u">papers</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> <p><b>Press and Publicity Chair</b><br />Fabien Clermidy, CEA, FR<br /><span class="spamspan"><span class="u">fabien<span class="o"> [dot] </span>clermidy</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">cea<span class="o"> [dot] </span>fr</span></span></p> <p><b>Exhibition Theatre Chair</b><br />Jürgen Haase, edacentrum, DE<br /><span class="spamspan"><span class="u">exhibition-theatre</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> <p><b>Local Arrangements/ICT</b><br />Wendelin Serwe, INRIA, Grenoble, FR<br /><span class="spamspan"><span class="u">wendelin<span class="o"> [dot] </span>serwe</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">inria<span class="o"> [dot] </span>fr</span></span><br />Pascal Vivet, CEA, FR<br /><span class="spamspan"><span class="u">pascal<span class="o"> [dot] </span>vivet</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">cea<span class="o"> [dot] </span>fr</span></span></p> <p><b>Conference Organization - Conference Manager</b><br />Eva Smejkal, K.I.T. Group GmbH Dresden, DE<br /><span class="spamspan"><span class="u">date</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">kitdresden<span class="o"> [dot] </span>de</span></span><br />phone: +49 351 65573-133<br />fax: +49 351 65573-299</p> </div> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Fri, 22 Nov 2019 10:03:46 +0000 Eva Smejkal, K.I.T. Group GmbH Dresden, DE 414 at https://www.date-conference.com DATE Fellows https://www.date-conference.com/date-fellows <span>DATE Fellows</span> <span><a title="View user profile." href="/user/25">Andreas Vörg, …</a></span> <span>Tue, 19 Nov 2019 13:57</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p>The following persons have received the "DATE Fellow Award" for outstanding service contribution to DATE:</p> <p>Ahmed Jerraya, CEA Leti, FR<br /> Anne Cirkel, Mentor, US<br /> Bashir Al-Hashimi, University of Southampton, GB<br /> Bernard Courtois, TIMA Laboratory, FR<br /> David Atienza Alonso, Ecole Polytechnique Federale de Lausanne, CH<br /> Diederik Verkest, IMEC, BE<br /> Donatella Sciuto, Politecnico di Milano, IT<br /> Enrico Macii, Politecnico di Torino, IT<br /> Gabriele Saucier, Design and Reuse, FR<br /> Georges Gielen, KU Leuven, BE<br /> Gerhard Fettweis, TU Dresden, DE<br /> Giovanni De Micheli, EPF Lausanne, CH<br /> H. Gordon Adshead, Manchester Design Technology, GB<br /> Herman Beke, LUCEDA Photonics, BE<br /> Ivo Bolsens, Xilinx, US<br /> Jan Madsen, Technical University of Denmark, DK<br /> Jano Gebelein, Goethe University Frankfurt, DE<br /> Joan Figueras, Universitat Politècnica Catalunya, ES<br /> José Epifânio da Franca, ChipIdea, Lisbon, PT<br /> Jürgen Haase, edacentrum GmbH, DE<br /> Luca Benini, Università di Bologna, IT<br /> Luca Fanucci, University of Pisa, IT<br /> Norbert Wehn, University of Kaiserslautern, DE<br /> Patrick Dewilde, TUM-IAS, DE<br /> Peter Marwedel, TU Dortmund, Informatik 12, DE<br /> Robert Cogné, ITT, FR<br /> Rolf Ernst, TU Braunschweig, DE<br /> Rudy Lauwereins, IMEC, BE<br /> Udo Kebschull, Goethe University Frankfurt, DE<br /> Volker Düppe, DATE, DE<br /> Wolfgang Müller, Universität Paderborn, DE<br /> Wolfgang Nebel, Carl von Ossietzky Universität Oldenburg, DE<br /> Wolfgang Rosenstiel, University of Tübingen, DE</p></div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Tue, 19 Nov 2019 12:57:52 +0000 Andreas Vörg, edacentrum GmbH, DE 394 at https://www.date-conference.com Authors' Guidelines for Camera-Ready Submission of Accepted Papers https://www.date-conference.com/author-guidelines <span>Authors&#039; Guidelines for Camera-Ready Submission of Accepted Papers</span> <span><a title="View user profile." href="/user/25">Andreas Vörg, …</a></span> <span>Sat, 9 Nov 2019 16:33</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p style="color: red; font-weight: bold;">Deadline: <b>28 November 2019 23:59:59 CET</b></p> <p>Your submission for <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> has been accepted. Congratulations!</p> <p>This page contains instructions to prepare the final material required to publish your contribution in time for the conference.</p> <p>As the instructions differ depending on the type of submission, please follow the appropriate links below.</p> <p><strong>One full conference registration (at speaker/member rate) is required per accepted paper, regardless of whether the only attendee/presenter is a student (in which case the student rate registration will not qualify) by <b>28 November 2019 23:59:59 CET</b>.</strong></p> <dl class="ckeditor-accordion"> <dt>Your submission has been accepted as regular paper (Long and Short Presentation)</dt> <dd> <p><strong>Instructions for authors of papers accepted for the DATE Conference proceedings (regular paper) </strong></p> <p><em>These instructions are for authors of papers accepted for the conference as a regular or short paper. If you are an author of a paper accepted as an industrial paper, an interactive presentation, or if your special session proposal (panel, hot topic, embedded tutorial) has been accepted, please refer to the specific instructions on this page.</em></p> <p>Congratulations on acceptance of your paper for the DATE Conference! Each author should prepare a final manuscript for inclusion in the the electronic conference proceedings to be published by EDAA.</p> <p>Please follow the instructions below carefully!</p> <table border="1" cellpadding="1" cellspacing="1" style="width: 100%;"> <tbody> <tr> <td> </td> <td><strong>Instructions</strong></td> <td><strong>Deadline</strong></td> </tr> <tr> <td>1</td> <td>Format manuscript following these guidelines <ul> <li>Templates are provided for <a href="/sites/default/files/2019-07/DATE-conference-template-letter.docx">Microsoft Word</a> (DOCX, 31 KB) and <a href="/sites/default/files/2019-07/DATE-conference-template-LaTeX.zip">LaTeX</a> (ZIP, 746 KB) files. <ul> <li><a href="/sites/default/files/2019-07/IEEEtran_HOWTO.pdf">LaTeX Template Instructions</a> (PDF, 656 KB)</li> <li><a href="/sites/default/files/2019-07/IEEEtranBST2.zip">LateX Bibliography Files for Windows</a> (ZIP, 309 KB)</li> <li>Tips: Be sure to use the template's conference mode. See template documentation for details. Select Save when the File Download window appears. The files cannot open directly from the server.</li> </ul> </li> <li>Paper format: letter</li> <li><b>The strict page limit for your paper is 6 pages.</b></li> <li>Font size: 10 pt</li> <li>Margins: top = 0.75 inches, bottom = 1 inch, side = 0.625 inches.</li> <li>Each column measures 3.5 inches wide, with a 0.25-inch measurement between columns.</li> <li>For additional details: download <a href="https://past.date-conference.com/files/file/09-author_guidelines/format.pdf">format.pdf</a></li> </ul> </td> <td><b>28 November 2019 23:59:59 CET</b></td> </tr> <tr> <td>2</td> <td>One full <a href="/registration">conference registration</a> (at speaker/member rate) is required per accepted paper, regardless of whether the only attendee/presenter is a student (in which case the student rate registration will not qualify).<br /> <a href="/registration">Click Here to Register</a></td> <td><b>28 November 2019 23:59:59 CET</b></td> </tr> <tr> <td>3</td> <td>Find final paper upload and submission instructions for the proceedings on this page below.</td> <td><b>28 November 2019 23:59:59 CET</b></td> </tr> <tr> <td>4</td> <td><a href="/av-guidelines">Audio-visual guidelines</a> – available from mid-December 2019</td> <td> </td> </tr> </tbody> </table> <p>Please contact the following people for more information related to</p> <p><b>Final paper support</b><br />Support Team, iTEK CMS Web Solutions, SG<br /><span class="spamspan"><span class="u">date-conf</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">itekcms<span class="o"> [dot] </span>com</span></span></p> <p><b>Proceedings Chair</b><br />Elena Ioana Vatajelu, TIMA, FR<br /><span class="spamspan"><span class="u">papers</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> <p><b>Audio Visual Chair</b><br />Meikel Becker, TEM Festival GmbH Technical Event Management, DE<br /><span class="spamspan"><span class="u">audio-visual</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> </dd> <dt>Your submission has been accepted as an industrial paper</dt> <dd> <p><strong>Instructions for authors of papers accepted for the DATE Conference proceedings (industrial paper) </strong></p> <p><em>These instructions are for authors of papers accepted for the conference as an industrial paper. If you are an author of a paper accepted as a regular paper or an interactive presentation, or if your special session proposal (panel, hot topic, embedded tutorial) has been accepted, please refer to the specific instructions on this page.</em></p> <p>Congratulations on acceptance of your paper for the DATE Conference! Each author should prepare a final manuscript for inclusion in the the electronic conference proceedings to be published by EDAA.</p> <p>Please follow the instructions below carefully!</p> <table border="1" cellpadding="1" cellspacing="1" style="width: 100%;"> <tbody> <tr> <td> </td> <td><strong>Instructions</strong></td> <td><strong>Deadline</strong></td> </tr> <tr> <td>1</td> <td>Format manuscript following these guidelines <ul> <li>Templates are provided for <a href="/sites/default/files/2019-07/DATE-conference-template-letter.docx">Microsoft Word</a> (DOCX, 31 KB) and <a href="/sites/default/files/2019-07/DATE-conference-template-LaTeX.zip">LaTeX</a> (ZIP, 746 KB) files. <ul> <li><a href="/sites/default/files/2019-07/IEEEtran_HOWTO.pdf">LaTeX Template Instructions</a> (PDF, 656 KB)</li> <li><a href="/sites/default/files/2019-07/IEEEtranBST2.zip">LateX Bibliography Files for Windows</a> (ZIP, 309 KB)</li> <li>Tips: Be sure to use the template's conference mode. See template documentation for details. Select Save when the File Download window appears. The files cannot open directly from the server.</li> </ul> </li> <li>Paper format: letter</li> <li><b>The strict page limit for your paper is 2 pages.</b></li> <li>Font size: 10 pt</li> <li>Margins: top = 0.75 inches, bottom = 1 inch, side = 0.625 inches.</li> <li>Each column measures 3.5 inches wide, with a 0.25-inch measurement between columns.</li> <li>For additional details: download <a href="https://past.date-conference.com/files/file/09-author_guidelines/format.pdf">format.pdf</a></li> </ul> </td> <td><b>28 November 2019 23:59:59 CET</b></td> </tr> <tr> <td>2</td> <td>One full <a href="/registration">conference registration</a> (at speaker/member rate) is required per accepted paper, regardless of whether the only attendee/presenter is a student (in which case the student rate registration will not qualify).<br /> <a href="/registration">Click Here to Register</a></td> <td><b>28 November 2019 23:59:59 CET</b></td> </tr> <tr> <td>3</td> <td>Find final paper upload and submission instructions for the proceedings on this page below.</td> <td><b>28 November 2019 23:59:59 CET</b></td> </tr> <tr> <td>4</td> <td><a href="/av-guidelines">Audio-visual guidelines</a> – available from mid-December 2019</td> <td> </td> </tr> </tbody> </table> <p>Please contact the following people for more information related to</p> <p><b>Final paper support</b><br />Support Team, iTEK CMS Web Solutions, SG<br /><span class="spamspan"><span class="u">date-conf</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">itekcms<span class="o"> [dot] </span>com</span></span></p> <p><b>Proceedings Chair</b><br />Elena Ioana Vatajelu, TIMA, FR<br /><span class="spamspan"><span class="u">papers</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> <p><b>Audio Visual Chair</b><br />Meikel Becker, TEM Festival GmbH Technical Event Management, DE<br /><span class="spamspan"><span class="u">audio-visual</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> </dd> <dt>Your special session or special day session proposal (panel, hot topic, embedded tutorial) has been accepted</dt> <dd> <p><strong>Instructions for authors of special session or special day session papers accepted for the DATE Conference proceedings </strong></p> <hr /> <p><em>These instructions are for authors of papers accepted as part of a special session for the conference. If you are an author of a regular or short paper, an industrial paper or an interactive presentation, please refer to the specific instructions on this page. </em></p> <ul> <li><strong>Panel Sessions </strong>are entitled to <em><strong>one (1) page per panelist </strong></em>in the proceedings.</li> <li><strong>Hot-Topic Sessions </strong>are allocated a <em><strong>maximum of six (6) pages per speaker</strong></em>.</li> <li><strong>Embedded Tutorials</strong> are allocated <em><strong>one single paper for the entire session which should not exceed ten (10) pages</strong></em>.</li> </ul> <p><em>Follow steps 1, 2, 3, 4, and 5 below to submit the final manuscript for inclusion in the the electronic conference proceedings to be published by EDAA.</em></p> <p>Please follow the instructions below carefully!</p> <table border="1" cellpadding="1" cellspacing="1" style="width: 100%;"> <tbody> <tr> <td> </td> <td><strong>Instructions</strong></td> <td><strong>Deadline</strong></td> </tr> <tr> <td>1</td> <td>Format manuscript following these guidelines <ul> <li>Templates are provided for <a href="/sites/default/files/2019-07/DATE-conference-template-letter.docx">Microsoft Word</a> (DOCX, 31 KB) and <a href="/sites/default/files/2019-07/DATE-conference-template-LaTeX.zip">LaTeX</a> (ZIP, 746 KB) files. <ul> <li><a href="/sites/default/files/2019-07/IEEEtran_HOWTO.pdf">LaTeX Template Instructions</a> (PDF, 656 KB)</li> <li><a href="/sites/default/files/2019-07/IEEEtranBST2.zip">LateX Bibliography Files for Windows</a> (ZIP, 309 KB)</li> <li>Tips: Be sure to use the template's conference mode. See template documentation for details. Select Save when the File Download window appears. The files cannot open directly from the server.</li> </ul> </li> <li>Paper format: letter</li> <li><b>The strict page limit for your paper is 6 pages.</b></li> <li>Font size: 10 pt</li> <li>Margins: top = 0.75 inches, bottom = 1 inch, side = 0.625 inches.</li> <li>Each column measures 3.5 inches wide, with a 0.25-inch measurement between columns.</li> <li>For additional details: download <a href="https://past.date-conference.com/files/file/09-author_guidelines/format.pdf">format.pdf</a></li> </ul> </td> <td><b>28 November 2019 23:59:59 CET</b></td> </tr> <tr> <td>2</td> <td>One full <a href="/registration">conference registration</a> (at speaker/member rate) is required per accepted paper, regardless of whether the only attendee/presenter is a student (in which case the student rate registration will not qualify).<br /> <a href="/registration">Click Here to Register</a></td> <td><b>28 November 2019 23:59:59 CET</b></td> </tr> <tr> <td>3</td> <td>Find final paper upload and submission instructions for the proceedings on this page below.</td> <td><b>28 November 2019 23:59:59 CET</b></td> </tr> <tr> <td>4</td> <td><a href="/av-guidelines">Audio-visual guidelines</a> – available from mid-December 2019</td> <td> </td> </tr> </tbody> </table> <p>Please contact the following people for more information related to</p> <p><b>Final paper support</b><br />Support Team, iTEK CMS Web Solutions, SG<br /><span class="spamspan"><span class="u">date-conf</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">itekcms<span class="o"> [dot] </span>com</span></span></p> <p><b>Proceedings Chair</b><br />Elena Ioana Vatajelu, TIMA, FR<br /><span class="spamspan"><span class="u">papers</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> <p><b>Audio Visual Chair</b><br />Meikel Becker, TEM Festival GmbH Technical Event Management, DE<br /><span class="spamspan"><span class="u">audio-visual</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> </dd> <dt>Your submission has been accepted as an interactive presentation</dt> <dd> <p><em>These instructions are for authors of papers accepted as an interactive presentation. If you are the author of an industrial paper, a regular or short paper or if your special session proposal (panel, hot topic, embedded tutorial) has been accepted, please refer to the specific instructions on this page.</em></p> <p>Congratulations on acceptance of your paper for the DATE Conference! Each author should prepare a final manuscript for inclusion in the the electronic conference proceedings to be published by EDAA.</p> <p>Please follow the instructions below carefully!</p> <p>Please follow the instructions below carefully!</p> <table border="1" cellpadding="1" cellspacing="1" style="width: 100%;"> <tbody> <tr> <td> </td> <td><strong>Instructions</strong></td> <td><strong>Deadline</strong></td> </tr> <tr> <td>1</td> <td>Format manuscript following these guidelines <ul> <li>Templates are provided for <a href="/sites/default/files/2019-07/DATE-conference-template-letter.docx">Microsoft Word</a> (DOCX, 31 KB) and <a href="/sites/default/files/2019-07/DATE-conference-template-LaTeX.zip">LaTeX</a> (ZIP, 746 KB) files. <ul> <li><a href="/sites/default/files/2019-07/IEEEtran_HOWTO.pdf">LaTeX Template Instructions</a> (PDF, 656 KB)</li> <li><a href="/sites/default/files/2019-07/IEEEtranBST2.zip">LateX Bibliography Files for Windows</a> (ZIP, 309 KB)</li> <li>Tips: Be sure to use the template's conference mode. See template documentation for details. Select Save when the File Download window appears. The files cannot open directly from the server.</li> </ul> </li> <li>Paper format: letter</li> <li><b>The strict page limit for your paper is 4 pages.</b></li> <li>Font size: 10 pt</li> <li>Margins: top = 0.75 inches, bottom = 1 inch, side = 0.625 inches.</li> <li>Each column measures 3.5 inches wide, with a 0.25-inch measurement between columns.</li> <li>For additional details: download <a href="https://past.date-conference.com/files/file/09-author_guidelines/format.pdf">format.pdf</a></li> </ul> </td> <td><b>28 November 2019 23:59:59 CET</b></td> </tr> <tr> <td>2</td> <td>One full <a href="/registration">conference registration</a> (at speaker/member rate) is required per accepted paper, regardless of whether the only attendee/presenter is a student (in which case the student rate registration will not qualify).<br /> <a href="/registration">Click Here to Register</a></td> <td><b>28 November 2019 23:59:59 CET</b></td> </tr> <tr> <td>3</td> <td>Find final paper upload and submission instructions for the proceedings on this page below.</td> <td><b>28 November 2019 23:59:59 CET</b></td> </tr> <tr> <td>4</td> <td><a href="/av-guidelines">Audio-visual guidelines</a> – available from mid-December 2019</td> <td> </td> </tr> </tbody> </table> <p>Please contact the following people for more information related to</p> <p><b>Final paper support</b><br />Support Team, iTEK CMS Web Solutions, SG<br /><span class="spamspan"><span class="u">date-conf</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">itekcms<span class="o"> [dot] </span>com</span></span></p> <p><b>Proceedings Chair</b><br />Elena Ioana Vatajelu, TIMA, FR<br /><span class="spamspan"><span class="u">papers</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> <p><b>Audio Visual Chair</b><br />Meikel Becker, TEM Festival GmbH Technical Event Management, DE<br /><span class="spamspan"><span class="u">audio-visual</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> </dd> <dt>Submission instructions for proceedings</dt> <dd> <p>Please read instructions carefully to submit your final manuscript.</p> <!-- p style="color: red; font-weight: bold;">Opens: <span style='font-weight:bold; color:red;'>No match in custom filter Important Dates for Camera-ready_paper_start_date</span></p --><p style="color: red; font-weight: bold;">Deadline: <b>28 November 2019 23:59:59 CET</b></p> <p>1. Format and name your file following the instructions on the DATE website. Please do not include page numbers in your file.</p> <ul> <li>Format your paper to US letter size (8 ½ by 11 inches), NOT A4 settings.</li> <li>DO NOT include page numbers.</li> <li>Use of color may enhance your figures and is encouraged</li> <li><strong>Avoid the use of type-3 fonts since due to conversions by IEEE the original file size will extremely increase in IEEEXplore.</strong></li> </ul> <p>Your file must also meet the IEEE requirements. IEEE Templates are provided for Microsoft Word and LaTex files at <a href="http://www.ieee.org/web/publications/pubservices/confpub/AuthorTools/conferenceTemplates.html">www.ieee.org/web/publications/pubservices/confpub/AuthorTools/conferenc…</a>. For more information see <a href="https://www.date-conference.com/date09/files/file/09-author_guidelines/format.pdf"><span style="color: #0000cc;">format.pdf</span></a>.</p> <p>2. Your file MUST be sent to SoftConf (the same place where you submitted your paper).</p> <p>The submission tool will automatically check the paper for compatibility with respect to IEEE PDF eXpress.</p> <p>To upload your PDF file for submission to DATE:</p> <ul> <li>connect to <a href="https://www.softconf.com/date20/conference">https://www.softconf.com/date20/conference</a> and log in with the same credentials you used to submit the paper</li> <li>under the section "Submission(s)", click on "Your current Submission(s)"</li> <li>you will see all submitted papers (both accepted and rejected). Click on the paper for which you want to upload the camera ready</li> <li>click on "Final camera-ready submission"</li> <li>follow the instructions provided in that page</li> </ul> <p><span style="color: #ff0000;"><strong>REMARK</strong></span>: during the camera-ready submission procedure, you will be asked to confirm the title and the abstract of your paper.</p> <p>It is <strong>your responsibility</strong> to check for the correctness of these data. This data, as well as the author names and author order should be exactly the same as they appear on the submitted PDF.</p> <p>These data will be published <strong>as is</strong> in the program booklet and on the relevant websites.</p> <p><strong>For questions about submission of your paper, contact:</strong></p> <p><b>Final paper support</b><br />Support Team, iTEK CMS Web Solutions, SG<br /><span class="spamspan"><span class="u">date-conf</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">itekcms<span class="o"> [dot] </span>com</span></span></p> <p><b>Proceedings Chair</b><br />Elena Ioana Vatajelu, TIMA, FR<br /><span class="spamspan"><span class="u">papers</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> </dd> </dl> </div> <div class="field field--name-field-news-attachments field--type-file field--label-above clearfix"> <div class="field__label">Download further information:</div> <div class="field__items"> <div class="field__item"><span class="file file--mime-application-zip file--package-x-generic"><a href="https://www.date-conference.com/sites/default/files/2019-07/DATE-conference-template-LaTeX.zip" type="application/zip; length=763007">DATE-conference-template-LaTeX.zip</a></span> </div> <div class="field__item"><span class="file file--mime-application-zip file--package-x-generic"><a href="https://www.date-conference.com/sites/default/files/2019-07/DATE-conference-template-LaTeX-IP.zip" type="application/zip; length=774460">DATE-conference-template-LaTeX-IP.zip</a></span> </div> <div class="field__item"><span class="file file--mime-application-vnd-openxmlformats-officedocument-wordprocessingml-document file--x-office-document"><a href="https://www.date-conference.com/sites/default/files/2019-07/DATE-conference-template-letter.docx" type="application/vnd.openxmlformats-officedocument.wordprocessingml.document; length=30879">DATE-conference-template-letter.docx</a></span> </div> <div class="field__item"><span class="file file--mime-application-vnd-openxmlformats-officedocument-wordprocessingml-document file--x-office-document"><a href="https://www.date-conference.com/sites/default/files/2019-07/DATE-conference-template-letter-IP.docx" type="application/vnd.openxmlformats-officedocument.wordprocessingml.document; length=32024">DATE-conference-template-letter-IP.docx</a></span> </div> <div class="field__item"><span class="file file--mime-application-pdf file--application-pdf"><a href="https://www.date-conference.com/sites/default/files/2019-07/IEEEtran_HOWTO.pdf" type="application/pdf; length=671626" title="IEEEtran_HOWTO.pdf">LaTeX Template Instructions</a></span> </div> <div class="field__item"><span class="file file--mime-application-zip file--package-x-generic"><a href="https://www.date-conference.com/sites/default/files/2019-07/IEEEtranBST2.zip" type="application/zip; length=316101" title="IEEEtranBST2.zip">LateX Bibliography Files for Windows</a></span> </div> <div class="field__item"><span class="file file--mime-application-pdf file--application-pdf"><a href="https://www.date-conference.com/sites/default/files/2019-07/format_0.pdf" type="application/pdf; length=157693">format_0.pdf</a></span> </div> </div> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Sat, 09 Nov 2019 15:33:07 +0000 Andreas Vörg, edacentrum GmbH, DE 120 at https://www.date-conference.com Event Overview https://www.date-conference.com/conference/event-overview <span>Event Overview</span> <span><a title="View user profile." href="/user/25">Andreas Vörg, …</a></span> <span>Mon, 4 Nov 2019 22:11</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><h2>Welcome to <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span></h2> <p><b>General Chair</b><br />Giorgio Di Natale, CNRS/TIMA, FR<br /><span class="spamspan"><span class="u">giorgio<span class="o"> [dot] </span>di-natale</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">univ-grenoble-alpes<span class="o"> [dot] </span>fr</span></span></p> <p><b>Programme Chair</b><br />Cristiana Bolchini, Politecnico di Milano, IT<br /><span class="spamspan"><span class="u">papers</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> <h2><span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> Preliminary Programme</h2> <p>Keynotes: <a href="https://www.date-conference.com/keynotes">https://www.date-conference.com/keynotes</a></p> <p>Monday Tutorials: <a href="https://www.date-conference.com/conference/monday-tutorials">https://www.date-conference.com/conference/monday-tutorials</a></p> <p>PhD Forum: <a href="https://www.date-conference.com/fringe-meeting-fm01">https://www.date-conference.com/fringe-meeting-fm01</a></p> <p>Conference Programme: <a href="https://www.date-conference.com/programme">https://www.date-conference.com/programme</a></p> <p>Wednesday Special Day on "Embedded AI": Sessions <a href="/program#5.1">5.1</a>, <a href="/programme#6.1">6.1</a>, <a href="/programme#7.0">7.0</a>, <a href="/programme#7.1">7.1</a>, <a href="/programme#8.1">8.1</a></p> <p>Thursday Special Day on "Silicon Photonics": Sessions <a href="/programme#9.1">9.1</a>, <a href="/programme#10.1">10.1</a>, <a href="/programme#11.0">11.0</a>, <a href="/programme#11.1">11.1</a>, <a href="/programme#12.1">12.1</a></p> <p>Exhibition Theatre: <a href="https://www.date-conference.com/exhibition/exhibition-theatre">https://www.date-conference.com/exhibition/exhibition-theatre</a></p> <p>Friday Workshops: <a href="https://www.date-conference.com/conference/friday-workshops">https://www.date-conference.com/conference/friday-workshops</a></p> <h2 id="2.7"> </h2> <!-- p>Currently information about <a href="/conference/monday-tutorials">Monday Tutorials</a> and <a href="/conference/friday-workshops">Friday Workshops</a> are available. More information will be available soon.</p --><!-- p><p><span style="font-size: small;"><strong><span style="font-size: small;"><a href="https://www.date-conference.com/files/file/date19/DATE19-Programme-Web.pdf">Click here for the <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> advance&nbsp;Programme in PDF format for download (~5 MB).</a></span></strong></span></p> <table border="1" cellpadding="1" cellspacing="1"> <tbody> <tr> <td><a href="conference/monday-at-a-glance">Monday</a></td> <td> <ul> <li><a href="conference/monday-at-a-glance">Educational Tutorials</a></li> <li><a href="conference/fringe-meeting-fm01">Welcome Reception &amp; PhD Forum, hosted by EDAA, ACM SIGDA, and IEEE CEDA</a></li> <li><a href="fringe-meetings-and-co-located-workshops">Fringe Meetings &amp; Co-Located Workshops</a></li> </ul> </td> </tr> <tr> <td><a href="conference/tuesday-at-a-glance">Tuesday</a></td> <td> <ul> <li><a href="conference/session/SB1">07:30-08:30 Speaker's breakfast (restricted to the speakers, chairs and co-chairs of the day); room: TBD</a></li> <li><a href="conference/session/1.1">Opening Session: Plenary, Awards Ceremony</a> and <a href="keynotes">Keynote Addresses</a></li> <li><a href="conference/tuesday-technical-sessions">Technical Conference</a></li> <li><a href="exhibition/exhibitors-sponsors">Vendor Exhibition</a> &amp; <a href="exhibition/exhibition-theatre">Exhibition Theatre</a></li> <li><a href="conference/executive-sessions">Executive Sessions</a> &amp; <a href="conference/session/3.0">Keynote</a></li> <li>Interactive Presentations <a href="conference/session/IP1">IP1</a></li> <li><a href="exhibition/ub-programme">University Booth</a></li> <li><a href="fringe-meetings-and-co-located-workshops">Fringe Meetings &amp; Co-Located Workshops</a></li> <li><a href="conference/session/Exhibition-Reception">Exhibition Reception</a></li> </ul> </td> </tr> <tr> <td><a href="conference/wednesday-at-a-glance">Wednesday</a></td> <td> <ul> <li><a href="conference/session/SB2">07:30-08:30 Speaker's breakfast (restricted to the speakers, chairs and co-chairs of the day); room: TBD</a></li> <li><a href="conference/wednesday-technical-sessions">Technical Conference</a></li> <li><a href="exhibition/exhibitors-sponsors">Vendor Exhibition</a> &amp; <a href="exhibition/exhibition-theatre">Exhibition Theatre</a></li> <li><a href="conference/wednesday-special-day-sessions">Special Day on "Embedded Meets Hyperscale and HPC"</a> and <a href="conference/session/7.0">Keynote</a></li> <li>Interactive Presentations <a href="conference/session/IP2">IP2</a> and <a href="conference/session/IP3">IP3</a></li> <li><a href="exhibition/ub-programme">University Booth</a></li> <li><a href="fringe-meetings-and-co-located-workshops">Fringe Meetings &amp; Co-Located Workshops</a></li> <li><a href="conference/session/DATE-Party">DATE Party</a></li> </ul> </td> </tr> <tr> <td><a href="conference/thursday-at-a-glance">Thursday</a></td> <td> <ul> <li><a href="conference/session/SB3">07:30-08:30 Speaker's breakfast (restricted to the speakers, chairs and co-chairs of the day); room: TBD</a></li> <li><a href="conference/thursday-technical-sessions">Technical Conference</a></li> <li><a href="exhibition/exhibitors-sponsors">Vendor Exhibition</a> &amp; <a href="exhibition/exhibition-theatre">Exhibition Theatre</a></li> <li><a href="conference/thursday-special-day-sessions">Special Day on "Model-Based Design of Intelligent Systems"</a> and <a href="conference/session/11.0">Keynote</a></li> <li><a href="fringe-meetings-and-co-located-workshops">Fringe Meetings &amp; Co-Located Workshops</a></li> <li>Interactive Presentations <a href="conference/session/IP4">IP4</a> and <a href="conference/session/IP5">IP5</a></li> <li><a href="exhibition/ub-programme">University Booth</a></li> </ul> </td> </tr> <tr> <td><a href="conference/friday-workshops">Friday</a></td> <td> <ul> <li><a href="conference/friday-workshops">Special Interest Workshops</a></li> <li><a href="fringe-meetings-and-co-located-workshops">Fringe Meetings &amp; Co-Located Workshops</a></li> </ul> </td> </tr> </tbody> </table --> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Mon, 04 Nov 2019 21:11:44 +0000 Andreas Vörg, edacentrum GmbH, DE 58 at https://www.date-conference.com DATE Party | Networking Event https://www.date-conference.com/date-party-networking-event <span>DATE Party | Networking Event</span> <span><a title="View user profile." href="/user/371">Eva Smejkal, K…</a></span> <span>Sun, 27 Oct 2019 06:37</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p>The DATE Party traditionally states one of the highlights of the DATE week. As one of the main networking opportunities during the DATE week, it is a perfect occasion to meet friends and colleagues in a relaxed atmosphere while enjoying local amenities. It is scheduled on <strong> <b>Wednesday, 11 March 2020</b>, from 19:00 to 23:00.</strong></p> <p>This year, it will take place at the SUMMUM, which is located within the Alpexpo Park.</p> <p><strong>Please kindly note that it is not a seated dinner.</strong></p> <p>All delegates, exhibitors and their guests are invited to attend the party. Please note that entrance is only possible with a valid party ticket. Each full conference registration includes a ticket for the DATE Party (which needs to be booked during the online registration process though). Additional tickets can be purchased on-site at the registration desk (subject to availability of tickets). Price for extra ticket: 70 € per person.</p> <p><img alt="DATE Party Location" data-entity-type="file" data-entity-uuid="70a16a88-1922-4835-828f-4ff67629ed14" src="/sites/default/files/inline-images/Alpexpo_DATE%20Party.jpg" width="48%" /> <img alt="Summum" data-entity-type="file" data-entity-uuid="8d1c007a-2812-4d6e-8d33-4902b43e50b3" src="/sites/default/files/inline-images/Summum.png" width="48%" /></p> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Sun, 27 Oct 2019 05:37:28 +0000 Eva Smejkal, K.I.T. Group GmbH Dresden, DE 78 at https://www.date-conference.com How to reach Grenoble and Alpexpo https://www.date-conference.com/reach-grenoble <span>How to reach Grenoble and Alpexpo</span> <span><a title="View user profile." href="/user/25">Andreas Vörg, …</a></span> <span>Fri, 27 Sep 2019 09:26</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p>ALPEXPO is the place where <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> will take place. It is easy accessible by car and public transportation from two airports nearby (Lyon and Geneva).</p> <p><a href="/sites/default/files/DATE%202020/DATE2020-Travel%20information.pdf" rel="noopener" target="_blank">Download travel information (pdf)</a></p> <p><strong><strong>Public Transportation in Grenoble </strong>to ALPEXPO (<a href="/sites/default/files/DATE%202020/DATE2020-Public%20transport-Grenoble.pdf" rel="noopener" target="_blank">Map</a>)</strong></p> <p>You can access Alpexpo using the public transportation network in Grenoble, preferably by tramway or alternatively by bus.</p> <p>By tramway, from the railway and bus stations, take the <strong>tram line A (Direction:<span style="background-color: transparent;"> ÉCHIROLLES, Denis Papin</span>)</strong> directly to Alpexpo, <strong>stop at "Pôle Sud Alpexpo"</strong>. While getting off the tram, turn to your left. Walk until the cross-roads and turn right. You will already see the congress entrance from there. Notice that you will have to buy the tickets on any tramway stop using the automatic machines.</p> <p>By bus: Line C3 stop AlpExpo. Lines C6, 12, 65 and 67 - stop Grand Place (5 minutes walk). You can buy the ticket on the bus.</p> <p><strong>Taxis in Grenoble: </strong>Available 24/7, Tel. +33 4 76 54 42 54</p> <p><span style="font-family: Helvetica, Arial, sans-serif; font-size: 12px; font-weight: bold; line-height: 15.6000003814697px;">How to reach Grenoble</span></p> <h3><strong>By plane</strong></h3> <p><span style="text-decoration: underline;">From Paris Airport</span> (<a href="http://www.aeroportsdeparis.fr">www.aeroportsdeparis.fr</a>)</p> <ul> <li>Take the RER train (regional) from Charles De Gaulle airport to Paris Gare de Lyon train station and next the TGV (high speed train) directly to Grenoble train station (see "by train" information)</li> </ul> <p><span style="text-decoration: underline;">From Lyon-St. Exupéry Airport</span> (<a href="https://www.lyonaeroports.com/en/">https://www.lyonaeroports.com/en/</a>)</p> <ul> <li>Airport shuttle OUIBUS <a href="https://www.ouibus.com/">https://www.ouibus.com/</a><br /> At least one bus every hour. Fare: 12 € one way.  Return ticket: 24€. Please book your ticket <a href="https://www.ouibus.com/booking?origin=LYS&amp;destination=XGE&amp;outboundDate=2020-03-09&amp;inboundDate=2020-03-13&amp;passengers%5B0%5D%5Btype%5D=A">here</a>.</li> <li>Airport shuttle FLIXBUS <a href="https://global.flixbus.com/">https://global.flixbus.com/</a><br /> Fare: from 5 € one way. Return ticket: from 10 €. Please book your ticket <a href="https://shop.global.flixbus.com/search?&amp;departureCity=23721&amp;arrivalCity=3988&amp;rideDate=09.03.2020&amp;backRideDate=13.03.2020">here</a>.</li> <li>Train Lyon-Grenoble takes about 1 hour in most cases (see "by train" information).</li> <li>1 hour drive by car from the airport</li> <li>Taxi costs approximately 160 € - Tel. +33 4 78 28 23 33</li> </ul> <p><span style="text-decoration: underline;">From Geneva-Cointrin Airport</span> (<a href="https://www.gva.ch/en/">https://www.gva.ch/en/</a>)</p> <ul> <li>Airport shuttle OUIBUS <a href="https://www.ouibus.com/">https://www.ouibus.com/</a><br /> 5 busses per day. Fare: 29 € one way. Return ticket: 56,00 €. Please book your ticket online <a href="https://www.ouibus.com/booking?origin=GVA&amp;destination=XGE&amp;outboundDate=2020-03-09&amp;inboundDate=2020-03-13&amp;passengers%5B0%5D%5Btype%5D=A">here</a>.</li> <li>Airport shuttle FLIXBUS <a href="https://global.flixbus.com/">https://global.flixbus.com/</a><br /> Fare: from 16 € one way. Return ticket: from 32 €. Please book your ticket <a href="https://shop.global.flixbus.com/search?departureCity=23811&amp;arrivalCity=3988&amp;rideDate=09.03.2020&amp;backRideDate=13.03.2020">here</a>.</li> <li>Train Geneva-Grenoble takes between 2 and 3 hours in most cases.</li> <li>2 hours drive by car from the airport</li> <li>Taxi costs approximately 240 € - Tel + 41 22 3 202 202</li> </ul> <p><span style="text-decoration: underline;">From Grenoble Isère Airport</span> (<a href="https://www.grenoble-airport.com/en">https://www.grenoble-airport.com/en</a>)<br /> <strong>Grenoble Isère Airport is not an International airport</strong>.</p> <ul> <li>Bus tickets are sold on the bus at Grenoble Airport and at the coach station's ticket desk in Grenoble. Fare: 15,50 € one way, Return ticket: 27 € (<a href="https://ublo-file-manager.valraiso.net/assets/actibus/LIGNE%20AEROPORT%20EXPRESS%20%202019-20.v2.pdf">timetable</a>)</li> <li>35 minutes drive by car to Grenoble town center</li> </ul> <h3><strong>By train</strong></h3> <ul> <li>High Speed Train (TGV) from Paris Gare de Lyon via Lyon St Exupéry airport to Grenoble in 3 hours - 9 trains every day</li> </ul> <p>Grenoble train station Tel. +33 8 92 35 35 35<br /> Booking: <a href="https://www.sncf.com/en">https://www.sncf.com/en</a></p> <h3><strong>VIP Transports - </strong><strong>Limousine Service</strong><span style="font-weight: normal;"> (1 up to 15 persons)</span></h3> <p><a href="http://www.vip-limousine.fr/en/service.html" rel="noopener" target="_blank">Information &amp; Reservation</a></p> <p><strong>Contact: VIP Limousine</strong><br /> Email: <span class="spamspan"><span class="u">contact</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">vip-limousine<span class="o"> [dot] </span>fr</span></span><br /> Tel: +33 4 76 87 18 15</p> <h3><strong>By car</strong></h3> <p><span style="text-decoration: underline;">From Paris airport</span> (Charles de Gaulle) Follow direction Lyon then Grenoble, approximatively 600 kms.</p> <div class="iframe-container"> <iframe allowfullscreen="" src="https://www.google.com/maps/embed?pb=!1m28!1m12!1m3!1d2783017.8487398955!2d1.7782490781631886!3d47.063952622897006!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!4m13!3e0!4m5!1s0x47e63e038e4ccf5b%3A0x42be0982f5ba62c!2sParis-Charles%20De%20Gaulle%20(CDG)%2C%20Roissy-en-France%2C%20France!3m2!1d49.0096906!2d2.5479244999999997!4m5!1s0x478a8b32c7b50ec1%3A0x5c0e45c7d036d0c0!2sAlpexpo%2C%20Avenue%20d'Innsbruck%2C%20Grenoble%2C%20France!3m2!1d45.154674!2d5.7355747!5e0!3m2!1sen!2sde!4v1572153538202!5m2!1sen!2sde"></iframe></div> <p><span style="text-decoration: underline;">From Geneva airport</span> on A41 motorway. When arriving near Grenoble, take the half-ring road "Rocade Sud" towards Lyon. Exit at Alpexpo (Exit n°6) and follow signs to Alpexpo.</p> <div class="iframe-container"> <iframe allowfullscreen="" src="https://www.google.com/maps/embed?pb=!1m28!1m12!1m3!1d102155.0066501329!2d6.0277613446952945!3d46.16990961166041!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!4m13!3e0!4m5!1s0x478c6480ae239337%3A0xe511a9f24eb8a630!2sGen%C3%A8ve%20A%C3%A9roport%2C%20Route%20de%20l'A%C3%A9roport%2021%2C%201215%20Le%20Grand-Saconnex%2C%20Switzerland!3m2!1d46.237009699999994!2d6.1091564!4m5!1s0x478a8b32c7b50ec1%3A0x5c0e45c7d036d0c0!2sAlpexpo%2C%20Avenue%20d'Innsbruck%2C%20Grenoble%2C%20France!3m2!1d45.154674!2d5.7355747!5e0!3m2!1sen!2sde!4v1572153954422!5m2!1sen!2sde"></iframe></div> <p><span style="text-decoration: underline;">From Lyon airport</span>, take A48 motorway, towards Grenoble. Once in the Grenoble area, after the motorway payment station, take the half-ring road direction Chambery, and Exit at Alpexpo Exit. Follow signs to Alpexpo.</p> <div class="iframe-container"> <iframe allowfullscreen="" src="https://www.google.com/maps/embed?pb=!1m28!1m12!1m3!1d358351.6305158628!2d5.109689945665499!3d45.437118379812475!2m3!1f0!2f0!3f0!3m2!1i1024!2i768!4f13.1!4m13!3e0!4m5!1s0x47f4c92b009f84cd%3A0x86dc510cc9b255f0!2sLyon-Saint%20Exup%C3%A9ry%20Airport%2C%2069125%20Colombier-Saugnieu%2C%20France!3m2!1d45.723418099999996!2d5.0887768!4m5!1s0x478a8b32c7b50ec1%3A0x5c0e45c7d036d0c0!2sAlpexpo%2C%20Avenue%20d'Innsbruck%2C%20Grenoble%2C%20France!3m2!1d45.154674!2d5.7355747!5e0!3m2!1sen!2sde!4v1572154079490!5m2!1sen!2sde"></iframe></div> <p>To get more details about directions, please also visit <a href="http://www.mappy.fr">www.mappy.fr</a></p> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Fri, 27 Sep 2019 07:26:18 +0000 Andreas Vörg, edacentrum GmbH, DE 286 at https://www.date-conference.com PhD Forum - Call for Submissions https://www.date-conference.com/phd-forum-call-for-submission <span>PhD Forum - Call for Submissions</span> <span><a title="View user profile." href="/user/288">Matthias Fried…</a></span> <span>Mon, 27 May 2019 12:37</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p>The PhD forum is hosted by the European Design Automation Association (EDAA), the ACM Special Interest Group on Design Automation (SIGDA), and the IEEE Council on Electronic Design Automation (CEDA). The forum is a great opportunity for PhD students who have completed their PhD thesis within the last 12 months or who are close to complete their thesis to present their work to a broad audience in the system design and design automation community from both industry and academia. The forum may also help students to establish contacts for entering the job market. In addition, representatives from industry and academia get a glance of state-of-the-art in system design and design automation.</p> <h2>Eligibility</h2> <p>The following two classes of students are eligible:</p> <ol> <li>Students who have finished their PhD thesis within the last 12 months and</li> <li>Students who are close to complete their thesis work.</li> </ol> <h2>Benefits</h2> <ul> <li>A poster presentation at the PhD Forum</li> <li>Eligibility to win the PhD Forum Award</li> <li>Contacts to professionals from industry and academia</li> <li>Possibility to distribute flyers summarizing the PhD work</li> <li>Possibility to apply for travel grants once the acceptance of the submission has been acknowledged</li> <li>FREE registration for the DATE Monday tutorials</li> <li>A (free) dinner reception</li> </ul> <h2>Submission</h2> <p>Submissions need to contain:</p> <ul> <li>Full contact address with affiliation, phone, e-mail</li> <li>A 2-page extended abstract describing the novelties and advantages of the thesis work of not more than 1600 words (PDF). The abstract should also include name and affiliation.</li> <li>Either (a) a University-approved thesis proposal (PDF) or (b) one published paper (PDF)</li> </ul> <p><strong>Submit this material to the DATE online submission system at <a href="https://www.softconf.com/date20/phdforum/">https://www.softconf.com/date20/phdforum/</a> .</strong></p> <h2>Important dates</h2> <ul> <li>Submission deadline: <b>14 November 2019 23:59:59 CET</b></li> <li>Notification of acceptance: <b>20 December 2019 23:59:59 CET</b></li> <li>Presentation at DATE: <b>Monday, 9 March 2020</b></li> </ul> <h2>Contact</h2> <p><b>EDAA/ACM SIGDA/IEEE CEDA PhD Forum Chair</b><br />Robert Wille, Johannes Kepler University Linz, AT<br /><span class="spamspan"><span class="u">phd-forum</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> <p>About the sponsors</p> <p>EDAA is a nonprofit association. Its purpose is to operate for educational, scientific and technical purposes for the benefit of the international electronics design and design automation community. The Association, in the field of design and design automation of electronic circuits and systems, promotes a series of high quality technical international conferences and workshops across Europe and cooperates actively to maintain harmonious relationships with other national and international technical societies and groups promoting the purpose of the Association. EDAA is the main sponsor of DATE, the premier Design, Automation and Test Conference and Exhibition in Europe.</p> <p><a href="http://www.edaa.com">http://www.edaa.com</a></p> <p>The ACM Special Interest Group on Design Automation is organized and operated exclusively for educational, scientific, and technical purposes in design automation. The mission of SIGDA and its activities includes collecting and disseminating information in design automation through a newsletter and other publications; organizing sessions at conferences of the ACM; sponsoring conferences, symposia, and workshops; organizing projects and working groups for education, research, and development; serving as a source of technical information for the Council and subunits of the ACM; and representing the opinions and expertise of the members- hip on matters of technical interest to SIGDA or ACM.</p> <p><a href="http://www.sigda.org">http://www.sigda.org</a></p> <p>The IEEE Council on Electronic Design Automation (CEDA) provides a focal point for Electronic Design Automation (EDA) and embedded systems research and promotion activities spread across six IEEE societies: Antennas and Propagation, Circuits and Systems, Computer, Electron Devices, Microwave Theory and Techniques, and Solid-State Circuits. The Council sponsors or co-sponsors over a dozen key EDA and embedded systems conferences, and publishes IEEE Transactions on CAD, IEEE Design &amp; Test of Computers, and IEEE Embedded Systems Letters. CEDA is continuously involved in promoting research in engineering performed by young researchers and professionals, and recognizing its leaders via the A. Richard Newton Award, Early Career Award, and Phil Kaufmann Awards.</p> <p><a href="http://www.ieee-ceda.org/">http://www.ieee-ceda.org/</a></p> <p>Download further information:</p> <p> <a href="https://www.date-conference.com/admin/structure/path_file_entity/14/edit">Download the <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> PhD Forum Call for Papers here</a></p> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Mon, 27 May 2019 10:37:03 +0000 Matthias Friedrich, edacentrum GmbH, DE 174 at https://www.date-conference.com