DATE 2020 - Design, Automation and Test in Europe Conference https://www.date-conference.com/ en DATE 2020 virtual conference now online https://www.date-conference.com/virtual-conference <span>DATE 2020 virtual conference now online</span> <span><a title="View user profile." href="/user/419">Anja Zeun, K.I…</a></span> <span>Tue, 21 Apr 2020 10:47</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p><span><span><span>For the last weeks, the DATE Committees have been working intensively to transform </span></span></span><span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span><span><span><span><strong><span> </span></strong>into a virtual conference. More than 400 presentations were submitted. Again, we would like to express our deepest gratitude to all presenters for their contributions, despite the demanding situation worldwide, as well as participants for their patience. We would also like to thank all exhibitors and sponsors for their ongoing support. </span></span></span></p> <p><span><span><span>We are glad to announce that the virtual conference has now been launched and can be accessed via <a href="http://www.date-conference-virtual.com">www.date-conference-virtual.com</a><span class="MsoHyperlink"><span><span lang="EN-US" xml:lang="EN-US" xml:lang="EN-US"><span>.</span></span></span></span></span></span></span></p> <p><strong><span><span><span>All participants received an invitiation link to generate their personal log-in date via e-mail this morning.</span></span></span></strong></p> <p><span><span><span>Once you created your log-in you can view the virtual conference in accordance with your initial conference registration. The platform will remain open until <strong>31 May 2020 </strong>to ensure that you will be able to view the fascinating presentations. </span></span></span></p> <p><span><span><span>The structure of the </span></span></span><span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span><span><span><span><strong><span> </span></strong>virtual conference is based on the programme of the physical event. To guide you through the individual sessions, you can view and download the programme booklet. By clicking on the menu item <strong>Virtual Conference</strong> you will find an overview of the programme that leads you to all submitted presentations. The platform allows you to get in contact with the author, either by commenting the presentation directly or by private messaging. In the menu points <strong>Virtual Exhibition</strong> and <strong>Sponsors</strong> you will find all sponsors and exhibitors who have contributed to </span></span></span><span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span><span><span><span>. </span></span></span></p> <p><span><span><span>We hope you enjoy this one of a kind DATE conference and welcome any feedback!</span></span></span></p> <p><span><span><span>If you have any questions regarding the virtual conference, please do not hesitate to contact the DATE Conference Registration under <span class="spamspan"><span class="u">date-registration</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">kitdresden<span class="o"> [dot] </span>de</span></span>! </span></span></span></p> <p><span><span><span>Stay healthy and all the best,</span></span></span></p> <p><span><span><span>Giorgio Di Natale<em> </em>| <em><span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> General Chair</em></span></span></span></p> <p><span><span><span>Cristiana Bolchini | <em><span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> Programme Chair</em></span></span></span></p></div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Tue, 21 Apr 2020 08:47:59 +0000 Anja Zeun, K.I.T. Group GmbH Dresden, DE 812 at https://www.date-conference.com Conference and Exhibition - 1–5 February 2021 - ALPEXPO, Grenoble, France https://www.date-conference.com/conference-and-exhibition-1-5-february-2021-alpexpo-grenoble-france <span>Conference and Exhibition - 1–5 February 2021 - ALPEXPO, Grenoble, France</span> <span><a title="View user profile." href="/user/25">Andreas Vörg, …</a></span> <span>Wed, 20 May 2020 13:58</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><h2 class="text-align-center">Call for Papers</h2> <h3 class="text-align-center">Scope of the Event</h3> <p class="text-align-center">The 24th DATE conference and exhibition is the main European event bringing together designers and design automation users, researchers and vendors as well as specialists in hardware and software design, test and manufacturing of electronic circuits and systems. DATE puts strong emphasis on both technology and systems, covering ICs/SoCs, emerging technologies, embedded systems and embedded software.</p> <h3 class="text-align-center">Structure of the Event</h3> <p class="text-align-center">The five-day event consists of a conference with plenary invited papers, regular papers, panels, hot-topic sessions, tutorials, workshops, special focus days and a track for executives. The scientific conference is complemented by a commercial exhibition showing the state-of-the-art in design and test tools, methodologies, IP and design services, reconfigurable and other hardware platforms, embedded software and (industrial) design experiences from different application domains, such as automotive, wireless, telecom and multimedia applications. The organisation of user group meetings, fringe meetings, a university booth, a PhD forum, vendor presentations and social events offers a wide variety of extra opportunities to meet and exchange information on relevant issues for the design automation, design and test communities. Special space will also be allocated for EU-funded projects to show their results.</p> <p class="text-align-center">More details are available on the DATE website: <a href="http://www.date-conference.com">www.date-conference.com</a>.</p> <h3 class="text-align-center">Areas of Interest</h3> <p class="text-align-center">Within the scope of the conference, the main areas of interest are: design automation, design tools and hardware architectures for electronic and embedded systems; test and dependability at system, chip, circuit and device level for analogue and digital electronics; modelling, analysis, design and deployment of embedded software and cyber-physical systems; application design and industrial design experiences.</p> <p class="text-align-center">Topics of interest include, but are not restricted to:</p> <table style="width:100%; table-layout:fixed;"> <tbody> <tr> <td style="width:50%;"> <ul> <li>System Specification and Modelling</li> <li>System-level Design Methodologies and High-Level Synthesis</li> <li>System Simulation and Validation</li> <li>Formal Methods and Verification</li> <li>Design and Test for Analogue and Mixed-Signal Circuits and Systems, and MEMS</li> <li>Design and Test of Secure Systems</li> <li>Network on Chip and Communication-Centric Design</li> <li>Architectural and Microarchitectural Design</li> <li>Low-power, Energy-efficient and Thermal-aware Design</li> <li>Approximate Computing</li> <li>Reconfigurable Systems</li> <li>Logical and Physical Analysis and Design</li> <li>Emerging Design Technologies for Future Computing</li> <li>Emerging Design Technologies for Future Memories</li> <li>Power-efficient and Sustainable Computing</li> <li>Robotics and Industry 4.0</li> </ul> </td> <td style="width:50%;"> <ul> <li>Automotive Systems and Smart Energy Systems</li> <li>Augmented Living and Personalized Healthcare</li> <li>Secure Systems, Circuits and Architectures</li> <li>Self-adaptive and Learning Systems</li> <li>Applications of Emerging Technologies</li> <li>Modelling and Mitigation of Defects, Faults, Variability and Reliability</li> <li>Test Generation, Test Architectures, Design for Test, and Diagnosis</li> <li>Microarchitecture-Level Dependability</li> <li>System-Level Dependability</li> <li>Real-time and Dependable Systems</li> <li>Embedded Systems for Deep Learning</li> <li>Model-based Design, Verification and Security for Embedded Systems</li> <li>Embedded Software Architectures, Compilers and Tool Chains</li> <li>Cyber-Physical Systems Design</li> </ul> </td> </tr> </tbody> </table> <h3 class="text-align-center">Submission of Papers</h3> <p class="text-align-center"><strong>All papers have to be submitted electronically by Monday, 14 September 2020, as abstracts and by Monday, 21 September 2020 as full papers via: <a href="http://www.date-conference.com">www.date-conference.com</a></strong></p> <p class="text-align-center">Papers can be submitted either for standard oral presentation or for interactive presentation.</p> <p class="text-align-center"><em>The Program Committee also encourages proposals for Special Sessions, Tutorials, Friday Workshops, University Booth Demonstrations, PhD Forum and Exhibition Theatre.</em></p> <table style="width:100%; table-layout:fixed;"> <tbody> <tr> <td width:50=""> <h3>Chairs</h3> <p><strong>General Chair:</strong><br /> Franco Fummi, Università di Verona, IT<br /> E-mail: <span class="spamspan"><span class="u">franco<span class="o"> [dot] </span>fummi</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">univr<span class="o"> [dot] </span>it</span></span></p> <p><strong>Programme Chair:</strong><br /> Ian O'Connor, University of Lyon, FR<br /> E-mail: <span class="spamspan"><span class="u">ian<span class="o"> [dot] </span>oconnor</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">ec-lyon<span class="o"> [dot] </span>fr</span></span></p> </td> <td style="width:50%;"> <h3>Conference Organisation</h3> <p>c/o K.I.T. Group GmbH Dresden<br /> Bautzner Str. 117–119, 01099 Dresden, DE<br /> Phone: +49 351 65573-137<br /> E-mail: <span class="spamspan"><span class="u">date</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">kitdresden<span class="o"> [dot] </span>de</span></span></p> </td> </tr> </tbody> </table> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Wed, 20 May 2020 11:58:12 +0000 Andreas Vörg, edacentrum GmbH, DE 814 at https://www.date-conference.com DATE 2020 Awards https://www.date-conference.com/awards <span>DATE 2020 Awards</span> <span><a title="View user profile." href="/user/25">Andreas Vörg, …</a></span> <span>Tue, 21 Apr 2020 12:12</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><h2 class="text-align-center">PhD Forum Best Presentation Awards<br /> supported by EDAA, ACM Sigda &amp; IEEE CEDA</h2> <p class="text-align-center">Alireza Mahzoon, University of Bremen, DE</p> <p class="text-align-center">is awarded for the presentation</p> <p class="text-align-center"><em>Proving Correctness of Industrial Multipliers using Symbolic Computer Algebra</em></p> <p class="text-align-center"> </p> <p class="text-align-center">Behnaz Pourmohseni, Friedrich Alexander Universität Erlangen Nürnberg (FAU), DE</p> <p class="text-align-center">is awarded for the work</p> <p class="text-align-center"><em>System-Level Mapping, Analysis, and Management of Real-Time Applications in Many-Core Systems</em></p> <p class="text-align-center"> <iframe allow="autoplay; fullscreen" allowfullscreen="" frameborder="0" height="360" src="https://player.vimeo.com/video/415142480" width="640"></iframe></p> <h2 class="text-align-center">2020 EDAA Achievement Award</h2> <p class="text-align-center">The 2020 EDAA Achievement Award ceremony by Norbert Wehn starts in the video below.</p> <p class="text-align-center"> <iframe allow="autoplay; fullscreen" allowfullscreen="" frameborder="0" height="360" src="https://player.vimeo.com/video/407513508" width="640"></iframe></p> <p class="text-align-center">Luca Benini, ETHZ, Switzerland</p> <p class="text-align-center"><a href="https://www.edaa.com/press_releases/PR_achievement_2020_results.pdf">https://www.edaa.com/press_releases/PR_achievement_2020_results.pdf</a></p> <h2 class="text-align-center">EDAA Outstanding Dissertations Award 2019</h2> <p class="text-align-center">The EDAA Outstanding Dissertations Award 2019 ceremony by Lorena Anghel starts in the video below.</p> <p class="text-align-center"> <iframe allow="autoplay; fullscreen" allowfullscreen="" frameborder="0" height="360" src="https://player.vimeo.com/video/407513611" width="640"></iframe></p> <h3 class="text-align-center">Topic 1</h3> <p class="text-align-center">Eric Schneider, Ph.D</p> <p class="text-align-center"><em>Multi-Level Simulation of Nano-Electronic Digital Circuits on GPUs</em></p> <h3 class="text-align-center">Topic 2</h3> <p class="text-align-center">Fabio Passos, Ph.D.</p> <p class="text-align-center"><em>A Multilevel Approach to the Systematic Design of Radio-Frequency Integrated Circuits</em></p> <h3 class="text-align-center">Topic 3</h3> <p class="text-align-center">Ahmedullah Aziz, Ph.D.</p> <p class="text-align-center"><em>Device-Circuit Co-design Employing Phase Transition Materials for Low Power Electronics</em></p> <h3 class="text-align-center">Topic 4</h3> <p class="text-align-center">Innocent Agbo, Ph.D.</p> <p class="text-align-center"><em>Reliability Modeling and Mitigation for Embedded Memories</em></p> <h3 class="text-align-center">Topic 5 (Quantum Computing Systems):</h3> <p class="text-align-center">Alwin Zulehner, Ph.D.</p> <p class="text-align-center"><em>Design Automation for Quantum Computing</em></p> <p class="text-align-center"><a href="https://www.edaa.com/press_releases/EDAA_Award_2019_Results.pdf">https://www.edaa.com/press_releases/EDAA_Award_2019_Results.pdf</a></p> <h2 class="text-align-center">DATE Fellow Award</h2> <p class="text-align-center">The DATE Fellow Award ceremony by Norbert Wehn starts in the video below at minute 2:50.</p> <p class="text-align-center"> <iframe allow="autoplay; fullscreen" allowfullscreen="" frameborder="0" height="360" src="https://player.vimeo.com/video/407513508" width="640"></iframe></p> <p class="text-align-center">Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE</p> <h2 class="text-align-center">IEEE CS TTTC Outstanding Contribution Award</h2> <p class="text-align-center">Jan Madsen, Technical University of Denmark, DK<br /> Giorgio Di Natale, CNRS/TIMA, FR</p> <h2 class="text-align-center">DATE Best Paper Awards 2020</h2> <p class="text-align-center">The Best Paper Award ceremony by Cristiana Bolchini during her opening talk video below starts at minute 2:05.</p> <p class="text-align-center"> <iframe allow="autoplay; fullscreen" allowfullscreen="" frameborder="0" height="360" src="https://player.vimeo.com/video/400893288" width="640"></iframe></p> <p class="text-align-center">Each year the Design, Automation and Test in Europe Conference presents awards to the authors of the best papers. The selection is performed by the award committee composed of the Track Chairs Cristiana Bolchini, Theocharis Theocharides, Jaume Abella and Valeria Bertacco and the following members: Borzoo Bonakdarpour, Andrea Calimera, Ramon Canal, Luca Carloni, Alessandro Cimatti, Ayse Coskun, Nikil Dutt, Ioannis Papaefstathiou, Dionisios Pnevmatikatos, Davide Quaglia, Muhammad Shafique, Olivier Sentieys, Luis Miguel Silveira, Juergen Teich, Vasileios Tenentes, Jerzy Tyszer, Arnaud Virazel.</p> <h3 class="text-align-center">D Track</h3> <p class="text-align-center">Impact of Magnetic Coupling and Density on STT-MRAM Performance</p> <p class="text-align-center"><i>Lizhou Wu<sup>1</sup>, Siddharth Rao<sup>2</sup>, Mottaqiallah Taouil<sup>1</sup>, Erik Jan Marinissen<sup>2</sup>, Gouri Sankar Kar<sup>2</sup> and Said Hamdioui<sup>1</sup></i></p> <p class="text-align-center"><i><sup>1</sup>TU Delft, NL; <sup>2</sup>IMEC, BE</i></p> <h3 class="text-align-center">A Track</h3> <p class="text-align-center">A Flexible and Scalable NTT Hardware: Applications from Homomorphically Encrypted Deep Learning to Post-Quantum Cryptography</p> <p class="text-align-center"><i>Ahmet Can Mert<sup>1</sup>, Emre Karabulut<sup>2</sup>, Erdinc Ozturk<sup>1</sup>, Erkay Savas<sup>1</sup>, Michela Becchi<sup>2</sup> and Aydin Aysu<sup>2</sup></i></p> <p class="text-align-center"><i><sup>1</sup>Sabanci University, TR; <sup>2</sup>North Carolina State University, US</i></p> <h3 class="text-align-center">T Track</h3> <p class="text-align-center">DEFCON: Generating and Detecting Failure-prone Instruction Sequences via Stochastic Search</p> <p class="text-align-center"><i>Ioannis Tsiokanos<sup>1</sup>, Lev Mukhanov<sup>1</sup>, Giorgis Georgakoudis<sup>2</sup>, Dimitrios S. Nikolopoulos<sup>3</sup> and Georgios Karakonstantis<sup>1</sup></i></p> <p class="text-align-center"><i><sup>1</sup>Queen's University Belfast, GB; <sup>2</sup>Lawrence Livermore National Laboratory, US; <sup>3</sup>Virginia Tech, US</i></p> <h3 class="text-align-center">E Track</h3> <p class="text-align-center">Statistical Time-based Intrusion Detection in Embedded Systems</p> <p class="text-align-center"><i>Nadir Carreon Rascon, Allison Gilbreath and Roman Lysecky</i></p> <p class="text-align-center"><i>University of Arizona, US</i></p> <h2 class="text-align-center">Best Paper Award Nominations</h2> <h3 class="text-align-center">D Track</h3> <p class="text-align-center">Fast and Accurate DRAM Simulation: Can we Further Accelerate it?</p> <p class="text-align-center"><i>Johannes Feldmann<sup>1</sup>; Matthias Jung<sup>2</sup>; Kira Kraft<sup>1</sup>; Lukas Steiner<sup>1</sup>; Norbert Wehn<sup>1</sup></i></p> <p class="text-align-center"><i><sup>1</sup>TU Kaiserslautern, <sup>2</sup>Fraunhofer IESE</i></p> <p class="text-align-center">ESP4ML: Platform-Based Design of Systems-on-Chip for Embedded Machine Learning</p> <p class="text-align-center"><i>Davide Giri; Kuan-Lin Chiu; Giuseppe Di Guglielmo; Paolo Mantovani; Luca Carloni</i></p> <p class="text-align-center"><i>Columbia University</i></p> <p class="text-align-center">Verification Runtime Analysis: Get the Most Out of Partial Verification</p> <p class="text-align-center"><i>Martin Ring<sup>1</sup>; Fritjof Bornbebusch<sup>1</sup>; Christoph Lüth<sup>1,2</sup>; Robert Wille<sup>3</sup>; Rolf Drechsler<sup>1,2</sup></i></p> <p class="text-align-center"><i><sup>1</sup>DFKI, <sup>2</sup>University of Bremen, <sup>3</sup>Johannes Kepler University Linz</i></p> <p class="text-align-center">Gap-free Processor Verification by S²QED and Property Generation</p> <p class="text-align-center"><i>Keerthikumara Devarajegowda<sup>1</sup>; Mohammad Rahmani Fadiheh<sup>2</sup>; Eshan Singh<sup>3</sup>; Clark Barrett<sup>3</sup>; Subhasish Mitra<sup>3</sup>; Wolfgang Ecker<sup>1</sup>; Dominik Stoffel<sup>2</sup>; Wolfgang Kunz<sup>2</sup></i></p> <p class="text-align-center"><i><sup>1</sup>Infineon Technologies, <sup>2</sup>TU Kaiserslautern, <sup>3</sup>Stanford University</i></p> <p class="text-align-center">GANA: Graph Convolutional Network Based Automated Netlist Annotation<br /> for Analog Circuits</p> <p class="text-align-center"><i>Kishor Kunal<sup>1</sup>; Tonmoy Dhar<sup>1</sup>; Meghna Madhusudan<sup>1</sup>; Jitesh Poojary<sup>1</sup>; Arvind Sharma<sup>1</sup>; Wenbin Xu<sup>2</sup>; Steven Burns<sup>3</sup>; Jiang Hu<sup>2</sup>; Ramesh Harjani<sup>1</sup>; Sachin S. Sapatnekar<sup>1</sup></i></p> <p class="text-align-center"><i><sup>1</sup>University of Minnesota, <sup>2</sup>Texas A&amp;M University, <sup>3</sup>Intel Corporation</i></p> <p class="text-align-center">Backtracking Search for Optimal Parameters of a PLL-based True Random Number Generator</p> <p class="text-align-center"><i>Brice Colombier; Nathalie Bochard; Florent Bernard; Lilian Bossuet</i></p> <p class="text-align-center"><i>University of Lyon</i></p> <p class="text-align-center">GRAMARCH: A GPU-ReRAM based Heterogeneous Architecture<br /> for Neural Image Segmentation</p> <p class="text-align-center"><i>Biresh Kumar Joardar<sup>1</sup>; Nitthilan Kannappan Jayakodi<sup>1</sup>; Jana Doppa<sup>1</sup>; Partha Pratim Pande<sup>1</sup>; Hai (Helen) Li<sup>2</sup>; Krishnendu Chakrabarty<sup>3</sup></i></p> <p class="text-align-center"><i><sup>1</sup>Washington State University, <sup>2</sup>Duke University/TUM-IAS, <sup>3</sup>Duke University</i></p> <p class="text-align-center">PSB-RNN: A Processing-in-Memory Systolic Array Architecture<br /> using Block Circulant Matrices for Recurrent Neural Networks</p> <p class="text-align-center"><i>Nagadastagiri<sup>1</sup>; Sahithi Rampalli<sup>1</sup>; Makesh Tarun Chandran<sup>1</sup>; Gurpreet Singh Kalsi<sup>2</sup>; John (Jack) Sampson<sup>1</sup>; Sreenivas Subramoney<sup>2</sup>; Vijaykrishnan Narayanan<sup>1</sup></i></p> <p class="text-align-center"><i><sup>1</sup>The Pennsylvania State University, <sup>2</sup>Processor Architecture Research Lab, Intel Labs</i></p> <p class="text-align-center">A Learning-Based Thermal Simulation Framework<br /> for Emerging Two-Phase Cooling Technologies</p> <p class="text-align-center"><i>Zihao Yuan<sup>1</sup>; Geoffrey Vaartstra<sup>2</sup>; Prachi Shukla<sup>1</sup>; Zhengmao Lu<sup>2</sup>; Evelyn Wang<sup>2</sup>; Sherief Reda<sup>3</sup>; Ayse Coskun<sup>1</sup></i></p> <p class="text-align-center"><i><sup>1</sup>Boston University, <sup>2</sup>Massachusetts Institute of Technology, <sup>3</sup>Brown University</i></p> <p class="text-align-center">ProxSim: Simulation Framework for Cross-Layer Approximate DNN Optimization</p> <p class="text-align-center"><i>Cecilia De la Parra<sup>1</sup>; Andre Guntoro<sup>1</sup>; Akash Kumar<sup>2</sup></i></p> <p class="text-align-center"><i><sup>1</sup>Robert Bosch GmbH, <sup>2</sup>TU Dresden</i></p> <p class="text-align-center">A Framework for Adding Low-Overhead, Fine-Grained Power Domains to CGRAs</p> <p class="text-align-center"><i>Ankita Nayak; Keyi Zhang; Raj Setaluri; Alex Carsello; Makai Mann; Stephen Richardson; Rick Bahr; Pat Hanrahan; Mark Horowitz; Priyanka Raina</i></p> <p class="text-align-center"><i>Stanford University</i></p> <p class="text-align-center">Floating Random Walk Based Capacitance Solver for VLSI Structures<br /> with Non-Stratified Dielectrics<br /> <i>Mingye Song; Ming Yang; Wenjian Yu</i></p> <p class="text-align-center"><i>Tsinghua University</i></p> <p class="text-align-center">Ternary Compute-Enabled Memory based on Ferroelectric Transistors<br /> for Accelerating Deep Neural Networks</p> <p class="text-align-center"><i>Sandeep Krishna Thirumala; Shubham Jain; Sumeet Gupta; Anand Raghunathan</i></p> <p class="text-align-center"><i>Purdue University</i></p> <p class="text-align-center">Impact of Magnetic Coupling and Density on STT-MRAM Performance</p> <p class="text-align-center"><i>Lizhou Wu<sup>1</sup>; Siddharth Rao<sup>2</sup>; Mottaqiallah Taouil<sup>1</sup>; Erik Jan Marinissen<sup>2</sup>;<br /> Gouri Sankar Kar<sup>2</sup>; Said Hamdioui<sup>1</sup></i></p> <p class="text-align-center"><sup>1</sup><i>Delft University of Technology, <sup>2</sup>IMEC</i></p> <h3 class="text-align-center">A Track</h3> <p class="text-align-center">GenieHD: Efficient DNA Pattern Matching Accelerator Using Hyperdimensional Computing</p> <p class="text-align-center"><i>Yeseong Kim; Mohsen Imani; Niema Moshiri;Tajana Rosing</i></p> <p class="text-align-center"><i>University of California San Diego</i></p> <p class="text-align-center">Achieving Determinism in Adaptive AUTOSAR</p> <p class="text-align-center"><i>Christian Menard<sup>1</sup>; Andres Goens<sup>1</sup>; Marten Lohstroh<sup>2</sup>; Jeronimo Castrillon<sup>1</sup></i></p> <p class="text-align-center"><i><sup>1</sup>TU Dresden, <sup>2</sup>University of California, Berkeley</i></p> <p class="text-align-center">A Flexible and Scalable NTT Hardware: Applications from Homomorphically Encrypted Deep Learning to Post-Quantum Cryptography</p> <p class="text-align-center"><i>Ahmet Can Mert<sup>1</sup>; Emre Karabulut<sup>2</sup>; Erdinc Ozturk<sup>1</sup>; Erkay Savas<sup>1</sup>;<br /> Michela Becchi<sup>2</sup>; Aydin Aysu<sup>2</sup><br /> <sup>1</sup>Sabanci University, <sup>2</sup>North Carolina State University</i></p> <p class="text-align-center">AntiDOte: Attention-based Dynamic Optimization for Neural Network Runtime Efficiency</p> <p class="text-align-center"><i>Fuxun Yu<sup>1</sup>; Chenchen Liu<sup>2</sup>; Di Wang<sup>3</sup>; Yanzhi Wang<sup>1</sup>; Xiang Chen<sup>1</sup></i></p> <p class="text-align-center"><i><sup>1</sup>George Mason University, <sup>2</sup>University of Maryland, <sup>3</sup>Microsoft</i></p> <p class="text-align-center">Go Unary: A Novel Synapse Coding and Mapping Scheme<br /> for Reliable ReRAM-based Neuromorphic Computing</p> <p class="text-align-center"><i>Chang Ma; Yanan Sun; Weikang Qian; Ziqi Meng; Rui Yang; Li Jiang</i></p> <p class="text-align-center"><i>Shanghai Jiao Tong University</i></p> <h3 class="text-align-center">T Track</h3> <p class="text-align-center">On Improving Fault Tolerance of Memristor Crossbar Based Neural Network Designs<br /> by Target Sparsifying</p> <p class="text-align-center"><i>Song Jin<sup>1</sup>; Songwei Pei<sup>2</sup>; Yu Wang<sup>1</sup></i></p> <p class="text-align-center"><i><sup>1</sup>North China Electric Power University,<br /> <sup>2</sup>Beijing University of Posts and Telecommunications</i></p> <p class="text-align-center">Synthesis of Fault-Tolerant Reconfigurable Scan Networks</p> <p class="text-align-center"><i>Sebastian Brandhofer; Michael Kochte; Hans-Joachim Wunderlich</i></p> <p class="text-align-center"><i>University of Stuttgart</i></p> <p class="text-align-center">DEFCON: Generating and Detecting Failure-prone Instruction Sequences<br /> via Stochastic Search</p> <p class="text-align-center"><i>Ioannis Tsiokanos<sup>1</sup>; Lev Mukhanov<sup>2</sup>; Giorgis Georgakoudis<sup>3</sup>;<br /> Dimitrios S. Nikolopoulos<sup>4</sup>; Georgios Karakonstantis<sup>1</sup></i></p> <p class="text-align-center"><i><sup>1</sup>Queen's University Belfast, <sup>2</sup>QUB, <sup>3</sup>Lawrence Livermore National Laboratory,<br /> <sup>4</sup>Virginia Tech</i></p> <p class="text-align-center">Thermal-Cycling-aware Dynamic Reliability Management in Many-Core System-on-Chip</p> <p class="text-align-center"><i>Mohammad-Hashem Haghbayan<sup>1</sup>; Antonio Miele<sup>2</sup>; Zhuo Zou<sup>3</sup>;<br /> Hannu Tenhunen<sup>1</sup>; Juha Plosila<sup>1</sup></i></p> <p class="text-align-center"><i><sup>1</sup>University of Turku, <sup>2</sup>Politecnico di Milano,<br /> <sup>3</sup>Nanjing University of Computer Science and Technology</i></p> <h3 class="text-align-center">E Track</h3> <p class="text-align-center">Deeper Weight Pruning without Accuracy Loss in Deep Neural Networks</p> <p class="text-align-center"><i>Byungmin Ahn; Taewhan Kim</i></p> <p class="text-align-center"><i>Seoul National University</i></p> <p class="text-align-center">ACOUSTIC: Accelerating Convolutional Neural Networks<br /> through Or-Unipolar Skipped Stochastic Computing</p> <p class="text-align-center"><i>Wojciech Romaszkan; Tianmu Li; Tristan Melton; Sudhakar Pamarti; Puneet Gupta</i></p> <p class="text-align-center"><i>University of California Los Angeles</i></p> <p class="text-align-center">Statistical Time-based Intrusion Detection in Embedded Systems</p> <p class="text-align-center"><i>Nadir Carreon Rascon; Allison Gilbreath; Roman Lysecky</i></p> <p class="text-align-center"><i>University of Arizona</i></p> <p class="text-align-center">Energy-efficient runtime resource management for adaptable multi-application mapping</p> <p class="text-align-center"><i>Robert Khasanov; Jeronimo Castrillon</i></p> <p class="text-align-center"><i>TU Dresden</i></p> <p class="text-align-center">CPS-oriented Modeling and Control of Traffic Signals Using Adaptive Back Pressure</p> <p class="text-align-center"><i>Wanli Chang<sup>1</sup>; Debayan Roy<sup>2</sup>; Shuai Zhao<sup>1</sup>; Anuradha Annaswamy<sup>3</sup>; Samarjit Chakraborty<sup>2</sup></i></p> <p class="text-align-center"><i><sup>1</sup>University of York, <sup>2</sup>Technical University of Munich,</i></p> <p class="text-align-center"><i><sup>3</sup>Massachusetts Institute of Technology</i></p> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Tue, 21 Apr 2020 10:12:20 +0000 Andreas Vörg, edacentrum GmbH, DE 813 at https://www.date-conference.com DATE 2020 in Grenoble replaced by a virtual conference that will be scheduled in the coming weeks https://www.date-conference.com/date-2020-grenoble-replaced-virtual-conference-will-be-scheduled-coming-weeks <span>DATE 2020 in Grenoble replaced by a virtual conference that will be scheduled in the coming weeks</span> <span><a title="View user profile." href="/user/25">Andreas Vörg, …</a></span> <span>Wed, 4 Mar 2020 17:45</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p><strong><em>***Update 14 April 2020: The virtual conference platform will be launched shortly. Participants may expect further information as well as access data via e-mail in the course of the week.***</em></strong></p> <p>The Corona virus (COVID-19) is affecting the plans of many people who made arrangements to attend <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> in Grenoble. The DATE organizers have carefully been watching the situation over the past weeks, with full attention for the health and safety of all presenters, exhibitors and attendees, and have explored all options to maintain the event as planned. The current situation with increasing travel restrictions and cancellations, however, no longer makes it possible to guarantee the program and the quality expected from DATE.</p> <p style="font-weight: bold; color: darkblue;">Therefore, it has been decided to organize <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> as a virtual event in the coming weeks instead of the physical event that was planned in Grenoble from March 9 to 13, 2020.</p> <p>To implement this, the following actions will be taken:</p> <ol> <li>A virtual conference environment will be set up to allow authors of accepted papers (long/short regular presentations, Interactive Presentations, Special Session presentations), as well as Exhibition Theatre and PhD Forum presenters who want to, to present their work in a virtual way;</li> <li>All papers presented in the virtual conference will be published in IEEE Xplore and ACM Digital Library;</li> <li>A virtual market environment will be set up for the exhibitors;</li> </ol> <p>All registered attendees will receive more information through e-mail in the next few days.</p> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Wed, 04 Mar 2020 16:45:50 +0000 Andreas Vörg, edacentrum GmbH, DE 806 at https://www.date-conference.com University Booth Programme https://www.date-conference.com/exhibition/university-booth/programme <span>University Booth Programme</span> <span><a title="View user profile." href="/user/25">Andreas Vörg, …</a></span> <span>Sun, 2 Feb 2020 18:00</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p>The <strong>virtual University Booth of <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span></strong> is available <strong><a href="https://www.date-conference-virtual.com/online-program/session?s=UB">here</a></strong>.</p> <p>The University Booth is organised during DATE and will be located in the <strong>exhibition area at booth 11</strong>. All demonstrations will take place from <strong>Tuesday, March 10 to Thursday, March 12, 2020</strong> during DATE. Universities and public research institutes have been invited to submit hardware or software demonstrators.</p> <p>The University Booth programme is composed of <strong>42 demonstrations</strong> from <strong>11 countries</strong>, presenting software and hardware solutions. The programme is organised in <strong>11 sessions</strong> of 2 or 2.5 h duration and will cover the topics:</p> <ul> <li><strong>Electronic Design Automation Prototypes</strong></li> <li><strong>Hardware Design and Test Prototypes</strong></li> <li><strong>Embedded Systems Design</strong></li> </ul> <p>The University Booth at <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> invites you to find out more about the latest trends in software and hardware from the international research community.</p> <p>Most demonstrators will be shown more than once, giving visitors more flexibility to come to the booth and find out about the latest innovations.</p> <p>We are sure that the demonstrators will give an attractive supplement to the DATE conference program and exhibition. We would like to thank all contributors to this programme.</p> <p>More information is available online at <a href="https://www.date-conference.com/exhibition/university-booth">https://www.date-conference.com/exhibition/university-booth</a>. The University Booth programme is included in the conference booklet and available online at <a href="https://www.date-conference.com/exhibition/university-booth/programme">https://www.date-conference.com/exhibition/university-booth/programme</a>. The following demonstrators will be presented at the University Booth.</p> <p><b>A BINARY TRANSLATION FRAMEWORK FOR AUTOMATED HARDWARE GENERATION</b></p> <p><b>Authors:</b><br /> Nuno Paulino and João Canas Ferreira, INESC TEC / University of Porto, PT</p> <p><b>Timeslots:</b></p> <ul> <li>UB07.3 (Wednesday, March 11, 2020 14:00 - 16:00)</li> </ul> <p><i><b>Abstract</b>: Hardware specialization is an efficient solution for maximization of performance and minimization of energy consumption. This work is based on automated detection of workload by analysis of a compiled application, and on the automated generation of specialized hardware modules. We will present the current version of the binary analysis and translation framework. Currently, our implementation is capable of processing ARMv8 and MicroBlaze (32-bit) Executable and Linking Format (ELF) files or instruction traces. The framework can interpret the instructions for these two ISAs, and detect different types of instruction patterns. After detection, segments are converted into CDFG representations exposing the underlying Instruction Level Parallelism which we aim to exploit via automated hardware generation. On-going work is addressing the extraction of cyclical execution traces or static code blocks, more methods of hardware generation.</i></p> <p><b>A DIGITAL MICROFLUIDICS BIO-COMPUTING PLATFORM</b></p> <p><b>Authors:</b><br /> Georgi Tanev, Luca Pezzarossa, Winnie Edith Svendsen and Jan Madsen, TU Denmark, DK</p> <p><b>Timeslots:</b></p> <ul> <li>UB02.2 (Tuesday, March 10, 2020 12:30 - 15:00)</li> <li>UB06.1 (Wednesday, March 11, 2020 12:00 - 14:00)</li> <li>UB08.2 (Wednesday, March 11, 2020 16:00 - 18:00)</li> </ul> <p><i><b>Abstract</b>: Digital microfluidics is a lab-on-a-chip (LOC) technology used to actuate small amounts of liquids on an array of individually addressable electrodes. Microliter sized droplets can be programmatically dispensed, moved, mixed, split, in a controlled environment which combined with miniaturized sensing techniques makes LOC suitable for a broad range of applications in the field of medical diagnostics and synthetic biology. Furthermore, a programmable digital microfluidics platform holds the potential to add a "fluidic subsystem" to the classical computation model thus opening the doors for cyber-physical bio-processors. To facilitate the programming and operation of such bio-fluidic computing, we propose dedicated instruction set architecture and virtual machine. A set of digital microfluidic core instructions as well as classic computing operations are executed on a virtual machine, which decouples the protocol execution from the LOC functionality.</i></p> <p><b>AT-SPEED DFT ARCHITECTURE FOR BUNDLED-DATA CIRCUITS</b></p> <p><b>Authors:</b><br /> Ricardo Aquino Guazzelli and Laurent Fesquet, Université Grenoble Alpes, FR</p> <p><b>Timeslots:</b></p> <ul> <li>UB02.5 (Tuesday, March 10, 2020 12:30 - 15:00)</li> <li>UB05.7 (Wednesday, March 11, 2020 10:00 - 12:00)</li> </ul> <p><i><b>Abstract</b>: At-speed testing for asynchronous circuits is still an open concern in the literature. Due to its timing constraints between control and data paths, Design for Testability (DfT) methodologies must test both control and data paths at the same time in order to guarantee the circuit correctness. As Process Voltage Temperature (PVT) variations significantly impact circuit design in newer CMOS technologies and low-power techniques such as voltage scaling, the timing constraints between control and data paths must be tested after fabrication not only under nominal conditions but through a range of operating conditions. This work explores an at-speed testing approach for bundled data circuits, targetting the micropipeline template. The main target of this test approach focuses on whether the sized delay lines in control paths respect the local timing assumptions of the data paths.</i></p> <p><b>ATECES: AUTOMATED TESTING THE ENERGY CONSUMPTION OF EMBEDDED SYSTEMS</b></p> <p><b>Authors:</b><br /> Eduard Enoiu, Mälardalen University, SE</p> <p><b>Timeslots:</b></p> <ul> <li>UB10.10 (Thursday, March 12, 2020 12:00 - 14:30)</li> <li>UB11.1 (Thursday, March 12, 2020 14:30 - 16:30)</li> </ul> <p><i><b>Abstract</b>: The demostrator will focus on automatically generating test suites by selecting test cases using random test generation and mutation testing is a solution for improving the efficiency and effectiveness of testing. Specifically, we generate and select test cases based on the concept of energy-aware mutants, small syntactic modifications in the system architecture, intended to mimic real energy faults. Test cases that can distinguish a certain behavior from its mutations are sensitive to changes, and hence considered to be good at detecting faults. We applied this method on a brake by wire system and our results suggest that an approach that selects test cases showing diverse energy consumption can increase the fault detection ability. This kind of results should motivate both academia and industry to investigate the use of automatic test generation for energy consumption.</i></p> <p><b>BCFELEAM: BACKFLOW: BACKWARD EDGE CONTROL FLOW ENFORCEMENT FOR LOW END ARM REAL-TIME SYSTEMS</b></p> <p><b>Authors:</b><br /> Bresch Cyril<sup>1</sup>, David Héy<sup>1</sup>, Roman Lysecky<sup>2</sup> and Stephanie Chollet<sup>1</sup><br /> <sup>1</sup>LCIS, FR; <sup>2</sup>University of Arizona, US</p> <p><b>Timeslots:</b></p> <ul> <li>UB05.4 (Wednesday, March 11, 2020 10:00 - 12:00)</li> <li>UB07.2 (Wednesday, March 11, 2020 14:00 - 16:00)</li> </ul> <p><i><b>Abstract</b>: The C programming language is one of the most popular languages in embedded system programming. Indeed, C is efficient, lightweight and can easily meet high performance and deterministic real-time constraints. However, these assets come at a certain price. Indeed, C does not provide extra features for memory safety. As a result, attackers can easily exploit spatial memory vulnerabilities to hijack the execution flow of an application. The demonstration features a real-time connected infusion pump vulnerable to memory attacks. First, we showcase an exploit that remotely takes control of the pump. Then, we demonstrate the effectiveness of BackFlow, an LLVM-based compiler extension that enforces control-flow integrity in low-end ARM embedded systems.</i></p> <p><b>BROOK SC: HIGH-LEVEL CERTIFICATION-FRIENDLY PROGRAMMING FOR GPU-POWERED SAFETY CRITICAL SYSTEMS</b></p> <p><b>Authors:</b><br /> Marc Benito, Matina Maria Trompouki and Leonidas Kosmidis, BSC / UPC, ES</p> <p><b>Timeslots:</b></p> <ul> <li>UB04.7 (Tuesday, March 10, 2020 17:30 - 19:30)</li> <li>UB11.2 (Thursday, March 12, 2020 14:30 - 16:30)</li> </ul> <p><i><b>Abstract</b>: Graphics processing units (GPUs) can provide the increased performance required in future critical systems, i.e. automotive and avionics. However, their programming models, e.g. CUDA or OpenCL, cannot be used in such systems as they violate safety critical programming guidelines. Brook SC (<a href="https://github.com/lkosmid/brook" title="https://github.com/lkosmid/brook">https://github.com/lkosmid/brook</a>) was developed in UPC/BSC to allow safety-critical applications to be programmed in a CUDA-like GPU language, Brook, which enables the certification while increasing productivity. In our demo, an avionics application running on a realistic safety critical GPU software stack and hardware is show cased. In this Bachelor's thesis project, which was awarded a 2019 HiPEAC Technology Transfer Award, an Airbus prototype application performing general-purpose computations with a safety-critical graphics API was ported to Brook SC in record time, achieving an order of magnitude reduction in the lines of code to implement the same functionality without performance penalty.</i></p> <p><b>CATANIS: CAD TOOL FOR AUTOMATIC NETWORK SYNTHESIS</b></p> <p><b>Authors:</b><br /> Davide Quaglia, Enrico Fraccaroli, Filippo Nevi and Sohail Mushtaq, Università di Verona, IT</p> <p><b>Timeslots:</b></p> <ul> <li>UB01.8 (Tuesday, March 10, 2020 10:30 - 12:30)</li> <li>UB05.8 (Wednesday, March 11, 2020 10:00 - 12:00)</li> </ul> <p><i><b>Abstract</b>: The proliferation of communication technologies for embedded systems opened the way for new applications, e.g., Smart Cities and Industry 4.0. In such applications hundreds or thousands of smart devices interact together through different types of channels and protocols. This increasing communication complexity forces computer-aided design methodologies to scale up from embedded systems in isolation to the global inter-connected system. Network Synthesis is the methodology to optimally allocate functionality onto network nodes and define the communication infrastructure among them. This booth will demonstrate the functionality of a graphic tool for automatic network synthesis developed by the Computer Science Department of University of Verona. It allows to graphically specify the communication requirements of a smart space (e.g., its map can be considered) in terms of sensing and computation tasks together with a library of node types and communication protocols to be used.</i></p> <p><b>CSI-REPUTE: A LOW POWER EMBEDDED DEVICE CLUSTERING APPROACH TO GENOME READ MAPPING</b></p> <p><b>Authors:</b><br /> Tousif Rahman<sup>1</sup>, Sidharth Maheshwari<sup>1</sup>, Rishad Shafik<sup>1</sup>, Ian Wilson<sup>1</sup>, Alex Yakovlev<sup>1</sup> and Amit Acharyya<sup>2</sup><br /> <sup>1</sup>Newcastle University, GB; <sup>2</sup>IIT Hyderabad, IN</p> <p><b>Timeslots:</b></p> <ul> <li>UB03.6 (Tuesday, March 10, 2020 15:00 - 17:30)</li> <li>UB04.6 (Tuesday, March 10, 2020 17:30 - 19:30)</li> </ul> <p><i><b>Abstract</b>: The big data challenge of genomics is rooted in its requirements of extensive computational capability and results in large power and energy consumption. To encourage widespread usage of genome assembly tools there must be a transition from the existing predominantly software-based mapping tools, optimized for homogeneous high-performance systems, to more heterogeneous low power and cost-effective mapping systems. This demonstration will show a cluster system implementation for the REPUTE algorithm, (An OpenCL based Read Mapping Tool for Embedded Genomics) where cluster nodes are composed of low power single board computer (SBC) devices and the algorithm is deployed on each node spreading the genomic workload, we propose a working concept prototype to challenge current conventional high-performance many-core CPU based cluster nodes. This demonstration will highlight the advantage in the power and energy domains of using SBC clusters enabling potential solutions to low-cost genomics.</i></p> <p><b>DEEPSENSE-FPGA: FPGA ACCELERATION OF A MULTIMODAL NEURAL NETWORK</b></p> <p><b>Authors:</b><br /> Mehdi Trabelsi Ajili and Yuko Hara-Azumi, Tokyo Institute of Technology, JP</p> <p><b>Timeslots:</b></p> <ul> <li>UB07.7 (Wednesday, March 11, 2020 14:00 - 16:00)</li> <li>UB10.7 (Thursday, March 12, 2020 12:00 - 14:30)</li> </ul> <p><i><b>Abstract</b>: Currently, Internet of Things and Deep Learning (DL) are merging into one domain and creating outstanding technologies for various classification tasks. Such technologies require complex DL networks that are mainly targeting powerful platforms with rich computing resources like servers. Therefore, for resource-constrained embedded systems, new challenges of size, performance and power consumption have to be considered, particularly when edge devices handle multimodal data, i.e., different types of real-time sensing data (voice, video, text, etc.). Our ongoing project is focused on DeepSense, a multimodal DL framework combining Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) to process time-series data, such as accelerometer and gyroscope to detect human activity. We aim at accelerating DeepSense by FPGA (Xilinx Zynq) in a hardware-software co-design manner. Our demo will show the latest achievements through latency and power consumption evaluations. </i></p> <p><b>DESIGN AUTOMATION FOR EXTENDED BURST-MODE AUTOMATA IN WORKCRAFT</b></p> <p><b>Authors:</b><br /> Alex Chan, Alex Yakovlev, Danil Sokolov and Victor Khomenko, Newcastle University, GB</p> <p><b>Timeslots:</b></p> <ul> <li>UB05.6 (Wednesday, March 11, 2020 10:00 - 12:00)</li> <li>UB07.6 (Wednesday, March 11, 2020 14:00 - 16:00)</li> </ul> <p><i><b>Abstract</b>: Asynchronous circuits are known to have high performance, robustness and low power consumption, which are particularly beneficial for the area of so-called "little digital" controllers where low latency is crucial. However, asynchronous design is not widely adopted by industry, partially due to the steep learning curve inherent in the complexity of formal specifications, such as Signal Transition Graphs (STGs). In this demo, we promote a class of the Finite State Machine (FSM) model called Extended Burst-Mode (XBM) automata as a practical way to specify many asynchronous circuits. The XBM specification has been automated in the Workcraft toolkit (<a href="https://workcraft.org" title="https://workcraft.org">https://workcraft.org</a>) with elaborate support for state encoding, conditionals and "don't care" signals. Formal verification and logic synthesis of the XBM automata is implemented via conversion to the established STG model, reusing existing methods and CAD tools. Tool support for the XBM flow will be demonstrated using several case studies.</i></p> <p><b>DISTRIBUTING TIME-SENSITIVE APPLICATIONS ON EDGE COMPUTING ENVIRONMENTS</b></p> <p><b>Authors:</b><br /> Eudald Sabaté Creixell<sup>1</sup>, Unai Perez Mendizabal<sup>1</sup>, Elli Kartsakli<sup>2</sup>, Maria A. Serrano Gracia<sup>3</sup> and Eduardo Quiñones Moreno<sup>3</sup><br /> <sup>1</sup>BSC / UPC, ES; <sup>2</sup>BSC, GR; <sup>3</sup>BSC, ES</p> <p><b>Timeslots:</b></p> <ul> <li>UB04.10 (Tuesday, March 10, 2020 17:30 - 19:30)</li> <li>UB08.3 (Wednesday, March 11, 2020 16:00 - 18:00)</li> <li>UB11.3 (Thursday, March 12, 2020 14:30 - 16:30)</li> </ul> <p><i><b>Abstract</b>: The proposed demonstration aims to showcase the capabilities of a task-based distributed programming framework for the execution of real-time applications in edge computing scenarios, in the context of smart cities. Edge computing shifts the computation close to the data source, alleviating the pressure on the cloud and reducing application response times. However, the development and deployment of distributed real-time applications is complex, due to the heterogeneous and dynamic edge environment where resources may not always be available. To address these challenges, our demo employs COMPSs, a highly portable and infrastructure-agnostic programming model, to efficiently distribute time-sensitive applications across the compute continuum. We will exhibit how COMPSs distributes the workload on different edge devices (e.g., NVIDIA GPUs and a Rasberry Pi), and how COMPSs re-adapts this distribution upon the availability (connection or disconnection) of devices.</i></p> <p><b>DL PUF ENAU: DEEP LEARNING BASED PHYSICALLY UNCLONABLE FUNCTION ENROLLMENT AND AUTHENNTICATION</b></p> <p><b>Authors:</b><br /> Amir Alipour<sup>1</sup>, David Hely<sup>2</sup>, Vincent Beroulle<sup>2</sup> and Giorgio Di Natale<sup>3</sup><br /> <sup>1</sup>Grenoble INP / LCIS, FR; <sup>2</sup>Grenoble INP, FR; <sup>3</sup>CNRS / Grenoble INP / TIMA, FR</p> <p><b>Timeslots:</b></p> <ul> <li>UB07.1 (Wednesday, March 11, 2020 14:00 - 16:00)</li> <li>UB10.4 (Thursday, March 12, 2020 12:00 - 14:30)</li> <li>UB11.4 (Thursday, March 12, 2020 14:30 - 16:30)</li> </ul> <p><i><b>Abstract</b>: Physically Unclonable Functions (PUFs) have been addressed nowadays as a potential solution to improve the security in authentication and encryption process in Cyber Physical Systems. The research on PUF is actively growing due to its potential of being secure, easily implementable and expandable, using considerably less energy. To use PUF in common, the low level device Hardware Variation is captured per unit for device enrollment into a format called Challenge-Response Pair (CRP), and recaptured after device is deployed, and compared with the original for authentication. These enrollment + comparison functions can vary and be more data demanding for applications that demand robustness, and resilience to noise. In this demonstration, our aim is to show the potential of using Deep Learning for enrollment and authentication of PUF CRPs. Most importantly, during this demonstration, we will show how this method can save time and storage compared to other classical methods.</i></p> <p><b>EEC: ENERGY EFFICIENT COMPUTING VIA DYNAMIC VOLTAGE SCALING AND IN-NETWORK OPTICAL PROCESSING</b></p> <p><b>Authors:</b><br /> Ryosuke Matsuo<sup>1</sup>, Jun Shiomi<sup>1</sup>, Yutaka Masuda<sup>2</sup> and Tohru Ishihara<sup>2</sup><br /> <sup>1</sup>Kyoto University, JP; <sup>2</sup>Nagoya University, JP</p> <p><b>Timeslots:</b></p> <ul> <li>UB01.7 (Tuesday, March 10, 2020 10:30 - 12:30)</li> <li>UB09.7 (Thursday, March 12, 2020 10:00 - 12:00)</li> </ul> <p><i><b>Abstract</b>: This poster demonstration will show results of our two research projects. The first one is on a project of energy efficient computing. In this project we developed a power management algorithm which keeps the target processor always running at the most energy efficient operating point by appropriately tuning the supply voltage and threshold voltage under a specific performance constraint. This algorithm is applicable to wide variety of processor systems including high-end processors and low-end embedded processors. We will show the results obtained with actual RISC processors designed using a 65nm technology. The second one is on a project of in-network optical computing. We show optical functional units such as parallel multipliers and optical neural networks. Several key techniques for reducing the power consumption of optical circuits will be also presented. Finally, we will show the results of optical circuit simulation, which demonstrate the light speed operation of the circuits.</i></p> <p><b>ELSA: EIGENVALUE BASED HYBRID LINEAR SYSTEM ABSTRACTION: BEHAVIORAL MODELING OF TRANSISTOR-LEVEL CIRCUITS USING AUTOMATIC ABSTRACTION TO HYBRID AUTOMATA</b></p> <p><b>Authors:</b><br /> Ahmad Tarraf and Lars Hedrich, University of Frankfurt, DE</p> <p><b>Timeslots:</b></p> <ul> <li>UB03.2 (Tuesday, March 10, 2020 15:00 - 17:30)</li> <li>UB04.2 (Tuesday, March 10, 2020 17:30 - 19:30)</li> <li>UB05.2 (Wednesday, March 11, 2020 10:00 - 12:00)</li> <li>UB06.2 (Wednesday, March 11, 2020 12:00 - 14:00)</li> </ul> <p><i><b>Abstract</b>: Model abstraction of transistor-level circuits, while preserving an accurate behavior, is still an open problem. In this demo an approach is presented that automatically generates a hybrid automaton (HA) with linear states from an existing circuit netlist. The approach starts with a netlist at transistor level with full SPICE accuracy and ends at the system level description of the circuit in matlab or in Verilog-A. The resulting hybrid automaton exhibits linear behavior as well as the technology dependent nonlinear e.g. limiting behavior. The accuracy and speed-up of the Verilog-A generated models is evaluated based on several transistor level circuit abstractions of simple operational amplifiers up to a complex filters. Moreover, we verify the equivalence between the generated model and the original circuit. For the generated models in matlab syntax, a reachability analysis is performed using the reachability tool cora.</i></p> <p><b>EUCLID-NIR GPU: AN ON-BOARD PROCESSING GPU-ACCELERATED SPACE CASE STUDY DEMONSTRATOR</b></p> <p><b>Authors:</b><br /> Ivan Rodriguez and Leonidas Kosmidis, BSC / UPC, ES</p> <p><b>Timeslots:</b></p> <ul> <li>UB05.3 (Wednesday, March 11, 2020 10:00 - 12:00)</li> </ul> <p><i><b>Abstract</b>: Embedded Graphics Processing Units (GPUs) are very attractive candidates for on-board payload processing of future space systems, thanks to their high performance and low-power consumption. Although there is significant interest from both academia and industry, there is no open and publicly available case study showing their capabilities, yet. In this master thesis project, which was performed within the GPU4S (GPU for Space) ESA-funded project, we have parallelised and ported the Euclid NIR (Near Infrared) image processing algorithm used in the European Space Agency's (ESA) mission to be launched in 2022, to an automotive GPU platform, the NVIDIA Xavier. In the demo we will present in real-time its significantly higher performance achieved compared to the original sequential implementation. In addition, visitors will have the opportunity to examine the images on which the algorithm operates, as well as to inspect the algorithm parallelisation through profiling and code inspection.</i></p> <p><b>FASTHERMSIM: FAST AND ACCURATE THERMAL SIMULATIONS FROM CHIPLETS TO SYSTEM</b></p> <p><b>Authors:</b><br /> Yu-Min Lee, Chi-Wen Pan, Li-Rui Ho and Hong-Wen Chiou, National Chiao Tung University, TW</p> <p><b>Timeslots:</b></p> <ul> <li>UB01.5 (Tuesday, March 10, 2020 10:30 - 12:30)</li> <li>UB03.10 (Tuesday, March 10, 2020 15:00 - 17:30)</li> <li>UB08.8 (Wednesday, March 11, 2020 16:00 - 18:00)</li> </ul> <p><i><b>Abstract</b>: Recently, owing to the scaling down of technology and 2.5D/3D integration, power densities and temperatures of chips have been increasing significantly. Though commercial computational fluid dynamics tools can provide accurate thermal maps, they may lead to inefficiency in thermal-aware design with huge runtime. Thus, we develop the chip/package/system-level thermal analyzer, called FasThermSim, which can assist you to improve your design under thermal constraints in pre/post-silicon stages. In FasThermSim, we consider three heat transfer modes, conduction, convection, and thermal radiation. We convert them to temperature-independent terms by linearization methods and build a compact thermal model (CTM). By applying numerical methods to the CTM, the steady-state and transient thermal profiles can be solved efficiently without loss of accuracy. Finally, an easy-to-use thermal analysis tool is implemented for your design, which is flexible and compatible, with the graphic user interface.</i></p> <p><b>FLETCHER: TRANSPARENT GENERATION OF HARDWARE INTERFACES FOR ACCELERATING BIG DATA APPLICATIONS</b></p> <p><b>Authors:</b><br /> Zaid Al-Ars, Johan Peltenburg, Jeroen van Straten, Matthijs Brobbel and Joost Hoozemans, TU Delft, NL</p> <p><b>Timeslots:</b></p> <ul> <li>UB02.1 (Tuesday, March 10, 2020 12:30 - 15:00)</li> <li>UB03.1 (Tuesday, March 10, 2020 15:00 - 17:30)</li> <li>UB04.1 (Tuesday, March 10, 2020 17:30 - 19:30)</li> </ul> <p><i><b>Abstract</b>: This demo created by TUDelft is a software-hardware framework to allow for an efficient integration of FPGA hardware accelerators both on edge devices as well as in the cloud. The framework is called Fletcher, which is used to automatically generate data communication interfaces in hardware based on the widely used big data format Apache Arrow. This provides two distinct advantages. On the one hand, since the accelerators use the same data format as the software, data communication bottlenecks can be reduced. On the other hand, since a standardized data format is used, this allows for easy-to-use interfaces on the accelerator side, thereby reducing the design and development time. The demo shows how to use Fletcher for big data acceleration to decompress Snappy compressed files and perform filtering on the whole Wikipedia body of text. The demo enables 25 GB/s processing throughput.</i></p> <p><b>FPGA-DSP: A PROTOTYPE FOR HIGH QUALITY DIGITAL AUDIO SIGNAL PROCESSING BASED ON AN FPGA</b></p> <p><b>Authors:</b><br /> Bernhard Riess and Christian Epe, University of Applied Sciences Düsseldorf, DE</p> <p><b>Timeslots:</b></p> <ul> <li>UB02.4 (Tuesday, March 10, 2020 12:30 - 15:00)</li> <li>UB03.4 (Tuesday, March 10, 2020 15:00 - 17:30)</li> </ul> <p><i><b>Abstract</b>: Our demonstrator presents a prototype of a new digital audio signal processing system which is based on an FPGA. It achieves a performance that up to now has been preserved to costly high-end solutions. Main components of the system are an analog/digital converter, an FPGA to perform the digital signal processing tasks, and a digital/analog converter implemented on a printed circuit board. To demonstrate the quality of the audio signal processing, infinite impulse response, finite impulse response filters and a delay effect were realized in VHDL. More advanced signal processing systems can easily be implemented due to the flexibility of the FPGA. Measured results were compared to state of the art audio signal processing systems with respect to size, performance and cost. Our prototype outperforms systems of the same price in quality, and outperforms systems of the same quality at a maximum of 20% of the price. Examples of the performance of our system can be heard in the demo.</i></p> <p><b>FU: LOW POWER AND ACCURACY CONFIGURABLE APPROXIMATE ARITHMETIC UNITS</b></p> <p><b>Authors:</b><br /> Tomoaki Ukezono and Toshinori Sato, Fukuoka University, JP</p> <p><b>Timeslots:</b></p> <ul> <li>UB05.10 (Wednesday, March 11, 2020 10:00 - 12:00)</li> <li>UB09.10 (Thursday, March 12, 2020 10:00 - 12:00)</li> </ul> <p><i><b>Abstract</b>: In this demonstration, we will introduce the approximate arithmetic units such as adder, multiplier, and MAC that are being studied in our system-architecture laboratory. Our approximate arithmetic units can reduce delay and power consumption at the expense of accuracy. Our approximate arithmetic units are intended to be applied to IoT edge devices that can process images, and are suitable for battery-driven and low-cost devices. The biggest feature of our approximate arithmetic units is that the circuit is configured so that the accuracy is dynamically variable, and the trade-off relationship between accuracy and power can be selected according to the usage status of the device. In this demonstration, we show the power consumption according to various accuracy-requirements based on actual data and claim the practicality of the proposed arithmetic units.</i></p> <p><b>FUZZING EMBEDDED BINARIES LEVERAGING SYSTEMC-BASED VIRTUAL PROTOTYPES</b></p> <p><b>Authors:</b><br /> Vladimir Herdt<sup>1</sup>, Daniel Grosse<sup>2</sup> and Rolf Drechsler<sup>2</sup><br /> <sup>1</sup>DFKI, DE; <sup>2</sup>University of Bremen / DFKI GmbH, DE</p> <p><b>Timeslots:</b></p> <ul> <li>UB01.1 (Tuesday, March 10, 2020 10:30 - 12:30)</li> <li>UB03.7 (Tuesday, March 10, 2020 15:00 - 17:30)</li> </ul> <p><i><b>Abstract</b>: Verification of embedded Software (SW) binaries is very important. Mainly, simulation-based methods are employed that execute (randomly) generated test-cases on Virtual Prototypes (VPs). However, to enable a comprehensive VP-based verification, sophisticated test-case generation techniques need to be integrated. Our demonstrator combines state-of-the-art fuzzing techniques with SystemC-based VPs to enable a fast and accurate verification of embedded SW binaries. The fuzzing process is guided by the coverage of the embedded SW as well as the SystemC-based peripherals of the VP. The effectiveness of our approach is demonstrated by our experiments, using RISC-V SW binaries as an example.</i></p> <p><b>GENERATING ASYNCHRONOUS CIRCUITS FROM CATAPULT</b></p> <p><b>Authors:</b><br /> Yoan Decoudu<sup>1</sup>, Jean Simatic<sup>2</sup>, Katell Morin-Allory<sup>3</sup> and Laurent Fesquet<sup>3</sup><br /> <sup>1</sup>University Grenoble Alpes, FR; <sup>2</sup>HawAI.Tech, FR; <sup>3</sup>Université Grenoble Alpes, FR</p> <p><b>Timeslots:</b></p> <ul> <li>UB02.7 (Tuesday, March 10, 2020 12:30 - 15:00)</li> <li>UB06.7 (Wednesday, March 11, 2020 12:00 - 14:00)</li> <li>UB10.8 (Thursday, March 12, 2020 12:00 - 14:30)</li> <li>UB11.8 (Thursday, March 12, 2020 14:30 - 16:30)</li> </ul> <p><i><b>Abstract</b>: In order to spread asynchronous circuit design to a large community of designers, High-Level Synthesis (HLS) is probably a good choice because it requires limited design technical skills. HLS usually provides an RTL description, which includes a data-path and a control-path. The desynchronization process is only applied to the control-path, which is a Finite State Machine (FSM). This method is sufficient to make asynchronous the circuit. Indeed, data are processed step by step in the pipeline stages, thanks to the desynchronized FSM. Thus, the data-path computation time is no longer related to the clock period but rather to the average time for processing data into the pipeline. This tends to improve speed when the pipeline stages are not well-balanced. Moreover, our approach helps to quickly designing data-driven circuits while maintaining a reasonable cost, a similar area and a short time-to-market.</i></p> <p><b>INTACT: A 96-CORE PROCESSOR WITH 6 CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER AND A 16-CORE PROTOTYPE RUNNING GRAPHICAL OPERATING SYSTEM</b></p> <p><b>Authors:</b><br /> Eric Guthmuller<sup>1</sup>, Pascal Vivet<sup>1</sup>, César Fuguet<sup>1</sup>, Yvain Thonnart<sup>1</sup>, Gaël Pillonnet<sup>2</sup> and Fabien Clermidy<sup>1</sup><br /> <sup>1</sup>Université Grenoble Alpes / CEA List, FR; <sup>2</sup>Université Grenoble Alpes / CEA-Leti, FR</p> <p><b>Timeslots:</b></p> <ul> <li>UB01.6 (Tuesday, March 10, 2020 10:30 - 12:30)</li> <li>UB02.6 (Tuesday, March 10, 2020 12:30 - 15:00)</li> </ul> <p><i><b>Abstract</b>: We built a demonstrator for our 96-cores cache coherent 3D processor and a first prototype featuring 16 cores. The demonstrator consists in our 16-cores processor running commodity operating systems such as Linux and NetBSD on a PC-like motherboard with user-friendly devices such as a HDMI display, keyboard and mouse. A graphical desktop is displayed, and the user will interact with it through the keyboard and mouse. The demonstrator is able to run parallel applications to benchmark its performance in terms of scalability. The main innovation of our processor is its scalable cache coherent architecture based on distributed L2-caches and adaptive L3-caches. Additionally, the energy consumption is also measured and displayed by reading dynamically from the monitors of power-supply devices. Finally we will also show open packages of the 3D processor featuring 6 16-core chiplets (28 nm FDSOI) on an active interposer (65 nm) embedding Network-on-Chips, power management and IO controllers.</i></p> <p><b>JOINTER: JOINING FLEXIBLE MONITORS WITH HETEROGENEOUS ARCHITECTURES</b></p> <p><b>Authors:</b><br /> Giacomo Valente<sup>1</sup>, Tiziana Fanni<sup>2</sup>, Carlo Sau<sup>3</sup>, Claudio Rubattu<sup>2</sup>, Francesca Palumbo<sup>2</sup> and Luigi Pomante<sup>1</sup><br /> <sup>1</sup>Università degli Studi dell'Aquila, IT; <sup>2</sup>Università degli Studi di Sassari, IT; <sup>3</sup>Università degli Studi di Cagliari, IT</p> <p><b>Timeslots:</b></p> <ul> <li>UB01.10 (Tuesday, March 10, 2020 10:30 - 12:30)</li> <li>UB02.10 (Tuesday, March 10, 2020 12:30 - 15:00)</li> <li>UB06.10 (Wednesday, March 11, 2020 12:00 - 14:00)</li> </ul> <p><i><b>Abstract</b>: As embedded systems grow more complex and shift toward heterogeneous architectures, understanding workload performance characteristics becomes increasingly difficult. In this regard, run-time monitoring systems can support on obtaining the desired visibility to characterize a system. This demo presents a framework that allows to develop complex heterogeneous architectures composed of programmable processors and dedicated accelerators on FPGA, together with customizable monitoring systems, keeping under control the introduced overhead. The whole development flow (and related prototypal EDA tools), that starts with the accelerators creation using a dataflow model, in parallel with the monitoring system customization using a library of elements, showing also the final joining, will be shown. Moreover, a comparison among different monitoring systems functionalities on different architectures developed on Zynq7000 SoC will be illustrated.</i></p> <p><b>LAGARTO: FIRST SILICON RISC-V ACADEMIC PROCESSOR DEVELOPED IN SPAIN</b></p> <p><b>Authors:</b><br /> Guillem Cabo Pitarch<sup>1</sup>, Cristobal Ramirez Lazo<sup>1</sup>, Julian Pavon Rivera<sup>1</sup>, Vatistas Kostalabros<sup>1</sup>, Carlos Rojas Morales<sup>1</sup>, Miquel Moreto<sup>1</sup>, Jaume Abella<sup>1</sup>, Francisco J. Cazorla<sup>1</sup>, Adrian Cristal<sup>1</sup>, Roger Figueras<sup>1</sup>, Alberto Gonzalez<sup>1</sup>, Carles Hernandez<sup>1</sup>, Cesar Hernandez<sup>2</sup>, Neiel Leyva<sup>2</sup>, Joan Marimon<sup>1</sup>, Ricardo Martinez<sup>3</sup>, Jonnatan Mendoza<sup>1</sup>, Francesc Moll<sup>4</sup>, Marco Antonio Ramirez<sup>2</sup>, Carlos Rojas<sup>1</sup>, Antonio Rubio<sup>4</sup>, Abraham Ruiz<sup>1</sup>, Nehir Sonmez<sup>1</sup>, Lluis Teres<sup>3</sup>, Osman Unsal<sup>5</sup>, Mateo Valero<sup>1</sup>, Ivan Vargas<sup>1</sup> and Luis Villa<sup>2</sup><br /> <sup>1</sup>BSC / UPC, ES; <sup>2</sup>CIC-IPN, MX; <sup>3</sup>IMB-CNM (CSIC), ES; <sup>4</sup>UPC, ES; <sup>5</sup>BSC, ES</p> <p><b>Timeslots:</b></p> <ul> <li>UB01.3 (Tuesday, March 10, 2020 10:30 - 12:30)</li> <li>UB04.4 (Tuesday, March 10, 2020 17:30 - 19:30)</li> <li>UB08.1 (Wednesday, March 11, 2020 16:00 - 18:00)</li> <li>UB10.5 (Thursday, March 12, 2020 12:00 - 14:30)</li> <li>UB11.5 (Thursday, March 12, 2020 14:30 - 16:30)</li> </ul> <p><i><b>Abstract</b>: Open hardware is a possibility that has emerged in recent years and has the potential to be as disruptive as Linux was once, an open source software paradigm. If Linux managed to lessen the dependence of users in large companies providing software and software applications, it is envisioned that hardware based on ISAs open source can do the same in their own field. In the Lagarto tapeout four research institutions were involved: Centro de Investigación en Computación of the Mexican IPN, Centro Nacional de Microelectrónica of the CSIC, Universitat Politècnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC). As a result, many bachelor, master and PhD students had the chance to achieve real-world experience with ASIC design and achieve a functional SoC. In the booth, you will find a live demo of the first ASIC and prototypes running on FPGA of the next versions of the SoC and core.</i></p> <p><b>LEARNV: LEARNV: A RISC-V BASED EMBEDDED SYSTEM DESIGN FRAMEWORK FOR EDUCATION AND RESEARCH DEVELOPMENT</b></p> <p><b>Authors:</b><br /> Noureddine Ait Said and Mounir Benabdenbi, TIMA Laboratory, FR</p> <p><b>Timeslots:</b></p> <ul> <li>UB03.5 (Tuesday, March 10, 2020 15:00 - 17:30)</li> <li>UB04.5 (Tuesday, March 10, 2020 17:30 - 19:30)</li> <li>UB06.8 (Wednesday, March 11, 2020 12:00 - 14:00)</li> <li>UB08.5 (Wednesday, March 11, 2020 16:00 - 18:00)</li> <li>UB11.7 (Thursday, March 12, 2020 14:30 - 16:30)</li> </ul> <p><i><b>Abstract</b>: Designing a modern System on a Chip is based on the joint design of hardware and software (co-design). However, understanding the tight relationship between hardware and software is not straightforward. Moreover to validate new concepts in SoC design from the idea to the hardware implementation is time-consuming and often slowed by legacy issues (intellectual property of hardware blocks and expensive commercial tools). To overcome these issues we propose to use the open-source Rocket Chip environment for educational purposes, combined with the open-source LowRisc architecture to implement a custom SoC design on an FPGA board. The demonstration will present how students and engineers can take benefit from the environment to deepen their knowledge in HW and SW co-design. Using the LowRisC architecture, an image classification application based on the use of CNNs will serve as a demonstrator of the whole open-source hardware and software flow and will be mapped on a Nexys A7 FPGA board.</i></p> <p><b>MDD-COP: A PRELIMINARY TOOL FOR MODEL-DRIVEN DEVELOPMENT EXTENDED WITH LAYER DIAGRAM FOR CONTEXT-ORIENTED PROGRAMMING</b></p> <p><b>Authors:</b><br /> Harumi Watanabe<sup>1</sup>, Chinatsu Yamamoto<sup>1</sup>, Takeshi Ohkawa<sup>1</sup>, Mikiko Sato<sup>1</sup>, Nobuhiko Ogura<sup>2</sup> and Mana Tabei<sup>1</sup><br /> <sup>1</sup>Tokai University, JP; <sup>2</sup>Tokyo City University, JP</p> <p><b>Timeslots:</b></p> <ul> <li>UB07.10 (Wednesday, March 11, 2020 14:00 - 16:00)</li> <li>UB08.10 (Wednesday, March 11, 2020 16:00 - 18:00)</li> </ul> <p><i><b>Abstract</b>: This presentation introduces a preliminary tool for Model-Driven development (MDD) to generate programs for Context-Oriented Programming (COP). In modern embedded systems such as IoT and Industry 4.0, their software began to process multiple services by following the changing surrounding environments. COP is helpful for programming such software. In COP, we can consider the surrounding environments and multiple services as contexts and layers. Even though MDD is a powerful technique for developing such modern systems, the works of modeling for COP are limited. There are no works to mention the relation between UML (Unified Modeling Language) and COP. To solve this problem, we provide a COP generation from a layer diagram extended the package diagram of UML by stereotypes. In our approach, users draw a layer diagram and other UML diagrams, then xtUML, which is a major tool of MDD, generates XML code with layer information for COP; finally, our tool generates COP code from XML code.</i></p> <p><b>PA-HLS: HIGH-LEVEL ANNOTATION OF ROUTING CONGESTION FOR XILINX VIVADO HLS DESIGNS</b></p> <p><b>Authors:</b><br /> Osama Bin Tariq<sup>1</sup>, Junnan Shan<sup>1</sup>, Luciano Lavagno<sup>1</sup>, Georgios Floros<sup>2</sup>, Mihai Teodor Lazarescu<sup>1</sup>, Christos Sotiriou<sup>2</sup> and Mario Roberto Casu<sup>1</sup><br /> <sup>1</sup>Politecnico di Torino, IT; <sup>2</sup>University of Thessaly, GR</p> <p><b>Timeslots:</b></p> <ul> <li>UB07.9 (Wednesday, March 11, 2020 14:00 - 16:00)</li> <li>UB08.9 (Wednesday, March 11, 2020 16:00 - 18:00)</li> <li>UB09.9 (Thursday, March 12, 2020 10:00 - 12:00)</li> <li>UB10.9 (Thursday, March 12, 2020 12:00 - 14:30)</li> </ul> <p><i><b>Abstract</b>: We will demo a novel high-level backannotation flow that reports routing congestion issues at the C++ source level by analyzing reports from FPGA physical design (Xilinx Vivado) and internal debugging files of the Vivado HLS tool. The flow annotates the C++ source code, identifying likely causes of congestion, e.g., on-chip memories or the DSP units. These shared resources often cause routing problems on FPGAs because they cannot be duplicated by physical design. We demonstrate on realistic large designs how the information provided by our flow can be used to both identify congestion issues at the C++ source level and solve them using HLS directives. The main demo steps are: 1-Extraction of the source-level debugging information from the Vivado HLS database 2-Generation of a list of net names involved in congestion areas and of their relative significance from the Vivado post global-routing database 3-Visualization of the C++ code lines that contribute most to congestion </i></p> <p><b>PAFUSI: PARTICLE FILTER FUSION ASIC FOR INDOOR POSITIONING</b></p> <p><b>Authors:</b><br /> Christian Schott, Marko Rößler, Daniel Froß, Marcel Putsche and Ulrich Heinkel, TU Chemnitz, DE</p> <p><b>Timeslots:</b></p> <ul> <li>UB03.3 (Tuesday, March 10, 2020 15:00 - 17:30)</li> <li>UB09.3 (Thursday, March 12, 2020 10:00 - 12:00)</li> </ul> <p><i><b>Abstract</b>: The meaning of data acquired from IoT devices is heavily enhanced if global or local position information of their acquirement is known. Infrastructure for indoor positioning as well as the IoT device involve the need of small, energy efficient but powerful devices that provide the location awareness. We propose the PAFUSI, a hardware implementation of an UWB position estimation algorithm that fulfils these requirements. Our design fuses distance measurements to fixed points in an environment to calculate the position in 3D space and is capable of using different positioning technologies like GPS, DecaWave or Nanotron as data source simultaneously. Our design comprises of an estimator which processes the data by means of a Sequential Monte Carlo method and a microcontroller core which configures and controls the measurement unit as well as it analyses the results of the estimator. The PAFUSI is manufactured as a monolithic integrated ASIC in a multi-project wafer in UMC's 65nm process.</i></p> <p><b>PARALLEL ALGORITHM FOR CNN INFERENCE AND ITS AUTOMATIC SYNTHESIS</b></p> <p><b>Authors:</b><br /> Takashi Matsumoto, Yukio Miyasaka, Xinpei Zhang and Masahiro Fujita, University of Tokyo, JP</p> <p><b>Timeslots:</b></p> <ul> <li>UB01.4 (Tuesday, March 10, 2020 10:30 - 12:30)</li> <li>UB05.9 (Wednesday, March 11, 2020 10:00 - 12:00)</li> <li>UB09.6 (Thursday, March 12, 2020 10:00 - 12:00)</li> </ul> <p><i><b>Abstract</b>: Recently, Convolutional Neural Network (CNN) has surpassed conventional methods in the field of image processing. This demonstration shows a new algorithm to calculate CNN inference using processing elements arranged and connected based on the topology of the convolution. They are connected in mesh and calculate CNN inference in a systolic way. The algorithm performs the convolution of all elements with the same output feature in parallel. We demonstrate a method to automatically synthesize an algorithm, which simultaneously performs the convolution and the communication of pixels for the computation of the next layer. We show with several sizes of input layers, kernels, and strides and confirmed that the correct algorithms were synthesized. The synthesis method is extended to the sparse kernel. The synthesized algorithm requires fewer cycles than the original algorithm. There were the more chances to reduce the number of cycles with the sparser kernel.</i></p> <p><b>PRE-IMPACT FALL DETECTION ARCHITECTURE BASED ON NEUROMUSCULAR CONNECTIVITY STATISTICS</b></p> <p><b>Authors:</b><br /> Giovanni Mezzina, Sardar Mehboob Hussain and Daniela De Venuto, Politecnico di Bari, IT</p> <p><b>Timeslots:</b></p> <ul> <li>UB01.9 (Tuesday, March 10, 2020 10:30 - 12:30)</li> <li>UB02.9 (Tuesday, March 10, 2020 12:30 - 15:00)</li> </ul> <p><i><b>Abstract</b>: In this demonstration, we propose an innovative multi-sensor architecture operating in the field of pre-impact fall detection (PIFD). The proposed architecture jointly analyzes cortical and muscular involvement when unexpected slippages occur during steady walking. The EEG and EMG are acquired through wearable and wireless devices. The control unit consists of an STM32L4 microcontroller and a Simulink modeling. The C implements the EMG computation, while the cortical analysis and the final classification were entrusted to the Simulink model. The EMG computation block translates EMGs into binary signals, which are used both to enable cortical analyses and to extract a score to distinguish "standard" muscular behaviors from anomalous ones. The Simulink model evaluates the cortical responsiveness in five bands of interest and implements the logical-based network classifier. The system, tested on 6 healthy subjects, shows an accuracy of 96.21% and a detection time of ~371 ms.</i></p> <p><b>RESCUED: A RESCUE DEMONSTRATOR FOR INTERDEPENDENT ASPECTS OF RELIABILITY, SECURITY AND QUALITY TOWARDS A COMPLETE EDA FLOW</b></p> <p><b>Authors:</b><br /> Nevin George<sup>1</sup>, Guilherme Cardoso Medeiros<sup>2</sup>, Junchao Chen<sup>3</sup>, Josie Esteban Rodriguez Condia<sup>4</sup>, Thomas Lange<sup>5</sup>, Aleksa Damljanovic<sup>4</sup>, Raphael Segabinazzi Ferreira<sup>1</sup>, Aneesh Balakrishnan<sup>5</sup>, Xinhui Lai<sup>6</sup>, Shayesteh Masoumian<sup>7</sup>, Dmytro Petryk<sup>3</sup>, Troya Cagil Koylu<sup>2</sup>, Felipe Augusto da Silva<sup>8</sup>, Ahmet Cagri Bagbaba<sup>8</sup>, Cemil Cem Gürsoy<sup>6</sup>, Said Hamdioui<sup>2</sup>, Mottaqiallah Taouil<sup>2</sup>, Milos Krstic<sup>3</sup>, Peter Langendoerfer<sup>3</sup>, Zoya Dyka<sup>3</sup>, Marcelo Brandalero<sup>1</sup>, Michael Hübner<sup>1</sup>, Jörg Nolte<sup>1</sup>, Heinrich Theodor Vierhaus<sup>1</sup>, Matteo Sonza Reorda<sup>4</sup>, Giovanni Squillero<sup>4</sup>, Luca Sterpone<sup>4</sup>, Jaan Raik<sup>6</sup>, Dan Alexandrescu<sup>5</sup>, Maximilien Glorieux<sup>5</sup>, Georgios Selimis<sup>7</sup>, Geert-Jan Schrijen<sup>7</sup>, Anton Klotz<sup>8</sup>, Christian Sauer<sup>8</sup> and Maksim Jenihhin<sup>6</sup><br /> <sup>1</sup>Brandenburg University of Technology Cottbus-Senftenberg, DE; <sup>2</sup>TU Delft, NL; <sup>3</sup>Leibniz-Institut für innovative Mikroelektronik, DE; <sup>4</sup>Politecnico di Torino, IT; <sup>5</sup>IROC Technologies, FR; <sup>6</sup>Tallinn University of Technology, EE; <sup>7</sup>Intrinsic ID, NL; <sup>8</sup>Cadence Design Systems GmbH, DE</p> <p><b>Timeslots:</b></p> <ul> <li>UB09.2 (Thursday, March 12, 2020 10:00 - 12:00)</li> <li>UB10.2 (Thursday, March 12, 2020 12:00 - 14:30)</li> </ul> <p><i><b>Abstract</b>: The demonstrator highlights the various interdependent aspects of Reliability, Security and Quality in nanoelectronics system design within an EDA toolset and a processor architecture setup. The compelling need of attention towards these three aspects of nanoelectronic systems have been ever more pronounced over extreme miniaturization of technologies. Further, such systems have exploded in numbers with IoT devices, heavy and analogous interaction with the external physical world, complex safety-critical applications, and Artificial intelligence applications. RESCUE targets such aspects in the form, Reliability (functional safety, ageing, soft errors), Security (tamper-resistance, PUF technology, intelligent security) and Quality (novel fault models, functional test, FMEA/FMECA, verification/debug) spanning the entire hardware software system stack. The demonstrator is brought together by a group of PhD students under the banner of H2020-MSCA-ITN RESCUE European Union project.</i></p> <p><b>RETINE: A PROGRAMMABLE 3D STACKED VISION CHIP ENABLING LOW LATENCY IMAGE ANALYSIS</b></p> <p><b>Authors:</b><br /> Stéphane Chevobbe<sup>1</sup>, Maria Lepecq<sup>1</sup> and Laurent Millet<sup>2</sup><br /> <sup>1</sup>CEA LIST, FR; <sup>2</sup>CEA-Leti, FR</p> <p><b>Timeslots:</b></p> <ul> <li>UB07.4 (Wednesday, March 11, 2020 14:00 - 16:00)</li> <li>UB08.7 (Wednesday, March 11, 2020 16:00 - 18:00)</li> <li>UB10.3 (Thursday, March 12, 2020 12:00 - 14:30)</li> </ul> <p><i><b>Abstract</b>: We have developed and fabricated a 3D stacked imager called RETINE composed with 2 layers based on the replication of a programmable 3D tile in a matrix manner providing a highly parallel programmable architecture. This tile is composed by a 16x16 BSI binned pixels array with associated readout and 16 column ADC on the first layer coupled to an efficient SIMD processor of 16 PE on the second layer. The prototype of RETINE achieves high video rates, from 5500 fps in binned mode to 340 fps in full resolution mode. It operates at 80 MHz with 720 mW power consumption leading to 85 GOPS/W power efficiency. To highlight the capabilities of the RETINE chip we have developed a demonstration platform with an electronic board embedding a RETINE chip that films rotating disks. Three scenarii are available: high speed image capture, slow motion and composed image capture with parallel processing during acquisition.</i></p> <p><b>RUMORE: A FRAMEWORK FOR RUNTIME MONITORING AND TRACE ANALYSIS FOR COMPONENT-BASED EMBEDDED SYSTEMS DESIGN FLOW</b></p> <p><b>Authors:</b><br /> Vittoriano Muttillo<sup>1</sup>, Luigi Pomante<sup>1</sup>, Giacomo Valente<sup>1</sup>, Hector Posadas<sup>2</sup>, Javier Merino<sup>2</sup> and Eugenio Villar<sup>2</sup><br /> <sup>1</sup>University of L'Aquila, IT; <sup>2</sup>University of Cantabria, ES</p> <p><b>Timeslots:</b></p> <ul> <li>UB03.9 (Tuesday, March 10, 2020 15:00 - 17:30)</li> <li>UB04.9 (Tuesday, March 10, 2020 17:30 - 19:30)</li> <li>UB11.9 (Thursday, March 12, 2020 14:30 - 16:30)</li> </ul> <p><i><b>Abstract</b>: The purpose of this demonstrator is to introduce runtime monitoring infrastructures and to analyze trace data. The goal is to show the concept among different monitoring requirements by defining a general reference architecture that can be adapted to different scenarios. Starting from design artifacts, generated by a system engineering modeling tool, a custom HW monitoring system infrastructure will be presented. This sub-system will be able to generate runtime artifacts for runtime verification. We will show how the RUMORE framework provides round-trip support in the development chain, injecting monitoring requirements from design models down to code and its execution on the platform and trace data back to the models, where the expected behavior will then compared with the actual behavior. This approach will be used towards optimizing design models for specific properties (e.g, for system performance).</i></p> <p><b>SKELETOR: AN OPEN SOURCE EDA TOOL FLOW FROM HIERARCHY SPECIFICATION TO HDL DEVELOPMENT</b></p> <p><b>Authors:</b><br /> Ivan Rodriguez, Guillem Cabo, Javier Barrera, Jeremy Giesen, Alvaro Jover and Leonidas Kosmidis, BSC / UPC, ES</p> <p><b>Timeslots:</b></p> <ul> <li>UB01.2 (Tuesday, March 10, 2020 10:30 - 12:30)</li> <li>UB09.4 (Thursday, March 12, 2020 10:00 - 12:00)</li> </ul> <p><i><b>Abstract</b>: Large hardware design projects have high overhead for project bootstrapping, requiring significant effort for translating hardware specifications to hardware design language (HDL) files and setting up their corresponding development and verification infrastructure. Skeletor (<a href="https://github.com/jaquerinte/Skeletor" title="https://github.com/jaquerinte/Skeletor">https://github.com/jaquerinte/Skeletor</a>) is an open source EDA tool developed as a student project at UPC/BSC, which simplifies this process, by increasing developer's productivity and reducing typing errors, while at the same time lowers the bar for entry in hardware development. Skeletor uses a C/verilog-like language for the specification of the modules in a hardware project hierarchy and their connections, which is used to generate automatically the require skeleton of source files, their development and verification testbenches and simulation scripts. Integration with KiCad schematics and support for syntax highlighting in code editors simplifies further its use. This demo is linked with workshop W05.</i></p> <p><b>SRSN: SECURE RECONFIGURABLE TEST NETWORK</b></p> <p><b>Authors:</b><br /> Vincent Reynaud<sup>1</sup>, Emanuele Valea<sup>2</sup>, Paolo Maistri<sup>1</sup>, Regis Leveugle<sup>1</sup>, Marie-Lise Flottes<sup>2</sup>, Sophie Dupuis<sup>2</sup>, Bruno Rouzeyre<sup>2</sup> and Giorgio Di Natale<sup>1</sup><br /> <sup>1</sup>TIMA Laboratory, FR; <sup>2</sup>LIRMM, FR</p> <p><b>Timeslots:</b></p> <ul> <li>UB04.3 (Tuesday, March 10, 2020 17:30 - 19:30)</li> <li>UB06.6 (Wednesday, March 11, 2020 12:00 - 14:00)</li> <li>UB08.6 (Wednesday, March 11, 2020 16:00 - 18:00)</li> <li>UB10.6 (Thursday, March 12, 2020 12:00 - 14:30)</li> <li>UB11.6 (Thursday, March 12, 2020 14:30 - 16:30)</li> </ul> <p><i><b>Abstract</b>: The critical importance of testability for electronic devices led to the development of IEEE test standards. These methods, if not protected, offer a security backdoor to attackers. This demonstrator illustrates a state-of-the-art solution that prevents unauthorized usage of the test infrastructure based on the IEEE 1687 standard and implemented on an FPGA target.</i></p> <p><b>SUBRISC+: IMPLEMENTATION AND EVALUATION OF AN EMBEDDED PROCESSOR FOR LIGHTWEIGHT IOT EHEALTH</b></p> <p><b>Authors:</b><br /> Mingyu Yang and Yuko Hara-Azumi, Tokyo Institute of Technology, JP</p> <p><b>Timeslots:</b></p> <ul> <li>UB07.8 (Wednesday, March 11, 2020 14:00 - 16:00)</li> <li>UB09.8 (Thursday, March 12, 2020 10:00 - 12:00)</li> </ul> <p><i><b>Abstract</b>: Although the rapid growth of Internet of Things (IoT) has enabled new opportunities for eHealth devices, the further development of complex systems is severely constrained by the power and energy supply on the battery-powered embedded systems. To address this issue, this work presents a processor design called "SubRISC+" targeting lightweight IoT eHealth. SubRISC+ is a processor design to achieve low power/energy consumption through its unique and compact architecture. As an example of lightweight eHealth applications on SubRISC+, we are working on the epileptic seizure detection using the dynamic time wrapping algorithm to deploy on wearable IoT eHealth devices. Simulation results show that 22% reduction on dynamic power and 50% reduction on leakage power and core area are achieved compared to Cortex-M0. As an ongoing work, the evaluation on a fabricated chip will be done within the first half of 2020.</i></p> <p><b>SYSTEMC-CT/DE: A SIMULATOR WITH FAST AND ACCURATE CONTINUOUS TIME AND DISCRETE EVENTS INTERACTIONS ON TOP OF SYSTEMC.</b></p> <p><b>Authors:</b><br /> Breytner Joseph Fernandez-Mesa, Liliana Andrade and Frédéric Pétrot, Université Grenoble Alpes / CNRS / TIMA Laboratory, FR</p> <p><b>Timeslots:</b></p> <ul> <li>UB06.4 (Wednesday, March 11, 2020 12:00 - 14:00)</li> <li>UB09.5 (Thursday, March 12, 2020 10:00 - 12:00)</li> </ul> <p><i><b>Abstract</b>: We have developed a continuous time (CT) and discrete events (DE) simulator on top of SystemC. Systems that mix both domains are critical and their proper functioning must be verified. Simulation serves to achieve this goal. Our simulator implements direct CT/DE synchronization, which enables a rich set of interactions between the domains: events from the CT models are able to trigger DE processes; events from the DE models are able to modify the CT equations. DE-based interactions are, then, simulated at their precise time by the DE kernel rather than at fixed time steps. We demonstrate our simulator by executing a set of challenging examples: they either require a superdense model of time or include Zeno behavior or are highly sensitive to accuracy errors. Results show that our simulator overcomes these issues, is accurate, and improves simulation speed w.r.t. fixed time steps; all of these advantages open up new possibilities for the design of a wider set of heterogeneous systems.</i></p> <p><b>TAPASCO: THE OPEN-SOURCE TASK-PARALLEL SYSTEM COMPOSER FRAMEWORK</b></p> <p><b>Authors:</b><br /> Carsten Heinz, Lukas Sommer, Lukas Weber, Jaco Hofmann and Andreas Koch, TU Darmstadt, DE</p> <p><b>Timeslots:</b></p> <ul> <li>UB05.1 (Wednesday, March 11, 2020 10:00 - 12:00)</li> <li>UB09.1 (Thursday, March 12, 2020 10:00 - 12:00)</li> <li>UB10.1 (Thursday, March 12, 2020 12:00 - 14:30)</li> </ul> <p><i><b>Abstract</b>: Field-programmable gate arrays (FPGA) are an established platform for highly specialized accelerators, but in a heterogeneous setup, the accelerator still needs to be integrated into the overall system. The open-source TaPaSCo (Task-Parallel System Composer) framework was created to serve this purpose: The fast integration of FPGA-based accelerators into compute platforms or systems-on-chip (SoC) and their connection to relevant components on the FPGA board. TaPaSCo can support developers in all steps of the development process: from cores resulting from High-Level Synthesis or cores written in an HDL, a complete FPGA-design can be created. TaPaSCo will automatically connect all processing elements to the memory- and host-interface and generate a complete bitstream. The TaPaSCo Runtime API allows to interface with accelerators from software and supports operations such as transferring data to the FPGA memory, passing values and controlling the execution of the accelerators.</i></p> <p><b>UWB ACKATCK: HIJACKING DEVICES IN UWB INDOOR POSITIONING SYSTEMS</b></p> <p><b>Authors:</b><br /> Baptiste Pestourie, Vincent Beroulle and Nicolas Fourty, Université Grenoble Alpes, FR</p> <p><b>Timeslots:</b></p> <ul> <li>UB05.5 (Wednesday, March 11, 2020 10:00 - 12:00)</li> <li>UB07.5 (Wednesday, March 11, 2020 14:00 - 16:00)</li> </ul> <p><i><b>Abstract</b>: Various radio-based Indoor Positioning Systems (IPS) have been proposed during the last decade as solutions to GPS inconsistency in indoor environments. Among the different radio technologies proposed for this purpose, 802.15.4 Ultra-Wideband (UWB) is by far the most performant, reaching up to 10 cm accuracy with 1000 Hz refresh rates. As a consequence, UWB is a popular technology for applications such as assets tracking in industrial environments or robots/drones indoor navigation. However, some security flaws in 802.15.4 standard expose UWB positioning to attacks. In this demonstration, we show how an attacker can exploit a vulnerability on 802.15.4 acknowledgment frames to hijack a device in a UWB positioning system. We demonsrate that using simply one cheap UWB chip, the attacker can take control over the positioning system and generate fake trajectories from a laptop. The results are observed in real-time in the 3D engine monitoring the positioning system.</i></p> <p><b>VIRTUAL PLATFORMS FOR COMPLEX SOFTWARE STACKS</b></p> <p><b>Authors:</b><br /> Lukas Jünger and Rainer Leupers, RWTH Aachen University, DE</p> <p><b>Timeslots:</b></p> <ul> <li>UB02.3 (Tuesday, March 10, 2020 12:30 - 15:00)</li> <li>UB06.3 (Wednesday, March 11, 2020 12:00 - 14:00)</li> </ul> <p><i><b>Abstract</b>: This demonstration is going to showcase our "AVP64" Virtual Platform (VP), which models a multi-core ARMv8 (Cortex A72) system including several peripherals, such as an SDHCI and an ethernet controller. For the ARMv8 instruction set simulation a dynamic binary translation based solution is used. As the workload, the Xen hypervisor with two Linux Virtual Machines (VMs) is executed. Both VMs are connected to the simulation hosts' network subsystem via a virtual ethernet controller. One of the VMs executes a NodeJS-based server application offering a REST API via this network connection. An AngularJS client application on the host system can then connect to the server application to obtain and store data via the server's REST API. This data is read and written by the server application to the virtual SD Card connected to the SDHCI. For this, one SD card partition is passed to the VM through Xen's block device virtualization mechanism.</i></p> <p><b>WALLANCE: AN ALTERNATIVE TO BLOCKCHAIN FOR IOT</b></p> <p><b>Authors:</b><br /> Loic Dalmasso, Florent Bruguier, Pascal Benoit and Achraf Lamlih, Université de Montpellier, FR</p> <p><b>Timeslots:</b></p> <ul> <li>UB02.8 (Tuesday, March 10, 2020 12:30 - 15:00)</li> <li>UB03.8 (Tuesday, March 10, 2020 15:00 - 17:30)</li> <li>UB04.8 (Tuesday, March 10, 2020 17:30 - 19:30)</li> <li>UB06.9 (Wednesday, March 11, 2020 12:00 - 14:00)</li> </ul> <p><i><b>Abstract</b>: Since the expansion of the Internet of Things (IoT), connected devices became smart and autonomous. Their exponentially increasing number and their use in many application domains result in a huge potential of cybersecurity threats. Taking into account the evolution of the IoT, security and interoperability are the main challenges, to ensure the reliability of the information. The blockchain technology provides a new approach to handle the trust in a decentralized network. However, current blockchain implementations cannot be used in IoT domain because of their huge need of computing power and storage utilization. This demonstrator presents a lightweight distributed ledger protocol dedicated to the IoT application, reducing the computing power and storage utilization, handling the scalability and ensuring the reliability of information.</i></p> <p>See you at the University Booth!<br /> <strong>University Booth Co-Chairs</strong><br /> Frédéric Pétrot, IMAG, FR and<br /> Andreas Vörg, edacentrum, DE<br /> <span class="spamspan"><span class="u">university-booth</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">date-conference<span class="o"> [dot] </span>com</span></span></p> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Sun, 02 Feb 2020 17:00:00 +0000 Andreas Vörg, edacentrum GmbH, DE 792 at https://www.date-conference.com Authors' Guidelines for Audio-Visual Presentation https://www.date-conference.com/av-guidelines <span>Authors&#039; Guidelines for Audio-Visual Presentation</span> <span><a title="View user profile." href="/user/25">Andreas Vörg, …</a></span> <span>Sat, 4 Jan 2020 10:29</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p>This page describes the guidelines to prepare and present audio-visual materials at DATE. Please read all instructions carefully and follow them strictly to maintain the highest possible standards. Even experienced speakers should read the following paragraphs, as they cover several problems that have arisen over the years.</p> <h3>General Instructions for …</h3> <dl class="ckeditor-accordion"> <dt id="General-Instructions-for-Oral-Presentations">… Oral Presentations</dt> <dd> <h2>Presentation Submission</h2> <p>DATE will provide a centralised presentation management system for all speakers of the main conference. You will not be allowed to use your own laptop for presentation on-site – no exceptions.</p> <p>To enable the A/V staff to handle the technical aspects in an efficient way, all presentations should be prepared according to the <a href="#General-Instructions-for-Preparing-AV-Material">"General Instructions for Preparing A/V Material"</a>. <strong>It is essential that the correct format is used.</strong></p> <p>Please bring your presentation file with you (CD/DVD/Memory Stick) to the conference and submit it to the presentations server at the A/V Office.</p> <p>Before the conference, you can upload your presentation by using the web-based upload service at <a href="https://date.t-e-m.de">https://date.t-e-m.de</a>. The correct file name is set automatically by the server. The access data for the upload service will be sent to the main contributing author in due time. The upload service will close on <b>25 February 2020, 23:59:59 CET</b>.</p> <h2>At the Conference</h2> <p>Preview computer systems, identical in software and hardware to the ones used for presentation, will be available in the Audio/Video Office at the conference. This room can be used during below-mentioned times of the <span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> week for presentation concerns. Since this facility will be shared between multiple presenters, its use can be limited.</p> <table> <tbody> <tr> <td>Monday</td> <td>1300 – 1900</td> </tr> <tr> <td>Tuesday</td> <td>0730 – 1900</td> </tr> <tr> <td>Wednesday</td> <td>0730 – 1800</td> </tr> <tr> <td>Thursday</td> <td>0730 – 1730</td> </tr> </tbody> </table> <p>All presenters are required to meet with the local conference Audio/Video staff at least two hours before the beginning of their session to check their presentation at one of the conference computers. However, it is strongly recommended to do so the day before the session if possible.</p> <p>The facilities at the A/V Office will provide the possibility of:</p> <ul> <li>uploading the presentations to the server</li> <li>reviewing the presentations on Windows-based computers</li> <li>last minute alterations of the presentations</li> <li>support by technical staff</li> </ul> <p>Please submit your presentation to the A/V Office via one of the following media:</p> <ul> <li>CD ROM (CD-R/RW), DVD-ROM (DVD-R/RW)</li> <li>USB memory stick</li> </ul> <p>Save all files associated with your presentation (PowerPoint file, movie/video files etc.) to one folder/location. We recommend to save videos and graphics and pictures separately on your storage medium. In case of problems, we can re-insert the originals.</p> <p>In the event that you have more than one presentation during the conference, save the different presentations in different folders and name them clearly to avoid any on-site misunderstandings and problems.</p> <p>Always make a backup copy of your presentations and all associated files and save them on a separate portable medium by yourself.</p> <p>Conference staff will transfer your presentation from the A/V Office to the corresponding session rooms. You will easily find your presentation on the laptop installed at the lectern in your session room.</p> <p>Each session room is equipped with:</p> <ul> <li>Video projector</li> <li>Lectern with microphone</li> <li>Laptop with operating system Windows 10 (English)</li> <li>presenter with laser pointer and slideshow remote control</li> </ul> <p>You can control/move slides during your presentation on your own (by remote control – please kindly check this in the Speaker Preview Room in advance).</p> <p>Kindly be at the session room at least 20 minutes before the session starts to meet the chair and familiarise yourself with the technical and other equipment.</p> <p>Using your own laptop for presentation is not possible.</p> <p>During your presentation you should keep in mind your time limit. The session moderator will stop your presentation if it takes more than your allocated time slot.</p> <h2>Speaker’s Breakfast</h2> <p>There will be a speaker’s breakfast in the morning of your presentation. It will be located in the ground level of the Alpes Congrès Building, and it will start at 7:30 a.m. Attending the speaker's breakfast at the morning of your presentation is mandatory in order to get all final instructions. A sign with the session number will point to your table.</p> </dd> <dt id="General-Instructions-for-Preparing-AV-Material">… Preparing A/V Material</dt> <dd> <p>When preparing your AV material, keep the time limit for your presentation in mind. To make your visual presentation a success, it needs to be well planned to clearly point out the important results of your research. The audience will appreciate your talk only if your material is visible and legible. They will remember your talk far better and read your paper if you can manage to communicate at least two important facts within your presentation timeslot. Please consider that the audience will need at least a minute to understand each technical slide.</p> <p>The first slide should contain the title of your paper and the author names, your affiliations and your company, university or funding logo (if applicable). This will be the only page where logos are permitted.</p> <p>Keep your material simple and uncluttered. Program listings and very long equations should be avoided. Tables should be represented graphically, wherever possible. Do not use the valuable space on your slides for large company logos and other elements that do not help in motivation or understanding your work. Duplicates of slides should only be produced in case the same information is needed twice.</p> <h2>Presentation Format</h2> <p>Please use Microsoft PowerPoint 97-2016 (*.ppt/*pptx), OpenOffice / LibreOffice 1.0 – 6.0, PREZI or Adobe PDF to guarantee your presentation will open successfully on an on-site PC.</p> <p>All slides must use landscape format with 16:9 aspect ratio.</p> <p>Please limit the file size to less than 25 MB (except video content) to minimise problems with storage and access speed that can result in a distorted or incomplete presentation.</p> <p>To speed up your start, we provide a PowerPoint template presentation. You are encouraged to use this template to prepare your presentation. Press <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a> to download the PowerPoint file.</p> <p>Mac users: please convert your file to PowerPoint format or PDF before you leave for the conference. Be aware that PowerPoint Mac-to-PC conversions can lead to unexpected results, especially with fonts, certain formats of embedded graphics, and special characters (ASCII characters 128 to 255). To avoid questions of PowerPoint compatibility, please embed all used fonts, convert them to vectors or use only compatible fonts (e. g. Arial, Courier New, Lucida Sans, Times New Roman, Verdana).</p> <h2>Pictures and Videos</h2> <p>Because of the many different video formats, support cannot be provided for embedded videos in your presentation; please test your presentation with the on-site PC several hours before your presentation. Generally, the WMV and MPEG-4 format should work without difficulties.</p> <p>Movies or videos that require additional reading or projection equipment (e.g. VHS cassettes, Video-DVDs) will not be accepted.</p> <p>Audio is supported.</p> <h2>Fonts</h2> <p>Only fonts which are included in the basic installation of MS-Windows 10 will be available. Use of other fonts not included in Windows can cause a wrong layout/style of your presentation (Suggested fonts: Arial, Tahoma). If you use different fonts, these must be embedded into your presentation.</p> <p>Please use high contrast lettering and fonts with a minimum size of 16 pt and high contrast layouts like light text on dark colours.</p> <p>Please make sure that also index expressions are clearly visible and use an appropriate font size.</p> <h2>Colours</h2> <p>Colour should be used carefully and colour combinations resulting in a low contrast (e.g. dark blue on black) should be avoided. Be aware that the contrast of your computer monitor is much higher than that of a projector in a partly lit room</p> <p>Try to use only colours that convert for black and white printing. The distinction between blue and black for text and thin lines is especially weak. Red filled-in objects (circles, rectangles, etc.) with white text are well-suited for highlighting important text.</p> </dd> </dl> <h3>Further Instructions for…</h3> <dl class="ckeditor-accordion"> <dt id="Further-Instructions-for-Session-Chairs-and-Co-Chairs">… Session Chairs and Co-Chairs</dt> <dd> <h2>Quick Checklist</h2> <ul> <li>If needed, get a PowerPoint template <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a>.</li> <li>At least two hours before your session, contact the Audio/Video Office staff to check that all your session presentations have been uploaded. If you have introductory slides, please also contact the A/V staff.</li> <li>Attend Speaker’s Breakfast the morning of your session at 7:30 a.m.</li> <li>Please, check presence of all speakers 10 minutes before your session starts, at the latest.</li> <li>After your session, please fill in the <a href="/sites/default/files/2020-01/DATE20-session-evaluation-form.pdf">session evaluation form</a> and return it to the conference registration desk.</li> </ul> <h2>Session Chairs</h2> <p>The main task of a session chair is to run the session. All speakers have been advised to get in contact with you before the session – please check that all of them are present before your session starts. If one of the speakers is missing, leave the presentation slot empty to be on schedule. Within your session, please introduce the speakers and keep track of the time limits (indicate to the speaker when it is time to stop). Please also manage the question and answer procedure after each talk (long dialogues have to be done off-line). If there are Interactive Presentations assigned to your session, please provide a one-minute time slot to each of them at the end of the session. After your session, please fill in the <a href="/sites/default/files/2020-01/DATE20-session-evaluation-form.pdf">session evaluation form</a> and return it to the conference registration desk.</p> <h2>Session Co-Chairs</h2> <p>The main task of a session co-chair is to support the session moderator and to handle unexpected situations. Please estimate the number of attendees (required for the evaluation form). You are requested to handle unexpected noise (talk to security people), A/V problems (talk to A/V people / technicians) and look for missing speakers. Please replace the missing session chair in case he is not present due to unexpected circumstances.</p> <p><strong>For further information, please have a look at the guidelines for your appropriate session below.</strong></p> </dd> <dt>… Organisers of Executive and Panel Sessions</dt> <dd> <h2>Quick Checklist</h2> <ul> <li>Carefully read the information provided in the <a href="#General-Instructions-for-Oral-Presentations">"General Instructions for Oral Presentations"</a> and <a href="#General-Instructions-for-Preparing-AV-Material">"General Instructions for Preparing A/V Material"</a></li> <li>If needed, get a PowerPoint template <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a>.</li> <li>Collect all your session’s presentations where applicable and upload them via <a href="https://date.t-e-m.de">https://date.t-e-m.de</a> until <b>25 February 2020, 23:59:59 CET</b>.<br /> The access data for the upload service will be sent to the Organisers of Special Session in due time.</li> <li>At least two hours before your session, visit the Audio/Video Office to check/submit all your session’s presentations.</li> <li>Attend Speaker’s Breakfast the morning of your session at 7:30 a.m.</li> </ul> <p>If you are also the chair or co-chair of your session, <strong>please have a look at the <a href="#Further-Instructions-for-Session-Chairs-and-Co-Chairs">"Further Instructions for Session Chairs and Co-Chairs"</a></strong>.</p> </dd> <dt>… Speakers in Executive and Panel Sessions</dt> <dd> <h2>Quick Checklist</h2> <ul> <li>Carefully read the information provided in the <a href="#General-Instructions-for-Oral-Presentations">"General Instructions for Oral Presentations"</a> and <a href="#General-Instructions-for-Preparing-AV-Material">"General Instructions for Preparing A/V Material"</a></li> <li>If needed, get a PowerPoint template <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a> and prepare your slides according to the above-mentioned guidelines.</li> <li>Send your presentation to the session organiser. Your session organiser is responsible for uploading your presentation to the conference server in time. Please contact her/him for instructions.</li> <li>At least two hours before your session, visit the Audio/Video Office to check your presentation.</li> <li>Attend Speaker’s Breakfast the morning of your presentation at 7:30 a.m. and bring your filled <a href="/sites/default/files/2020-01/DATE20-speakers-bio.pdf">Speaker's Bio</a></li> <li>20 minutes before your session, contact the session chair to confirm your presence.</li> </ul> </dd> <dt>… Speakers in Regular, Embedded Tutorial and Hot-Topic Sessions (Long and Short Presentations)</dt> <dd> <h2>Quick Checklist</h2> <ul> <li>Carefully read the information provided in the <a href="#General-Instructions-for-Oral-Presentations">"General Instructions for Oral Presentations"</a> and <a href="#General-Instructions-for-Preparing-AV-Material">"General Instructions for Preparing A/V Material"</a></li> <li>If needed, get a PowerPoint template <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a> and prepare your slides according to the above-mentioned guidelines.</li> <li>Upload your presentation via <a href="https://date.t-e-m.de">https://date.t-e-m.de</a> until <b>25 February 2020, 23:59:59 CET</b>.<br /> The access data for the upload service will be sent to the main contributing author in due time.</li> <li>At least two hours before your session, visit the Audio/Video Office to check your presentation.</li> <li>Attend Speaker’s Breakfast the morning of your presentation at 7:30 a.m. and bring your filled <a href="/sites/default/files/2020-01/DATE20-speakers-bio.pdf">Speaker's Bio</a></li> <li>20 minutes before your session, contact the session chair to confirm your presence.</li> </ul> <h2>Presentation timeslots</h2> <p>The presentation timeslot is 25+5 minutes for long and 13+2 minutes for short presentations, including the +time for questions. Please consider that the audience will need at least a minute to understand each technical slide. Therefore, you should prepare 15 to 20 slides for long and 10 to 15 slides for short presentations.</p> </dd> <dt>… Authors of Interactive Presentations</dt> <dd> <h2>Quick Checklist</h2> <ul> <li>Carefully read the information provided in the <a href="#General-Instructions-for-Oral-Presentations">"General Instructions for Oral Presentations"</a> and <a href="#General-Instructions-for-Preparing-AV-Material">"General Instructions for Preparing A/V Material"</a></li> <li>If needed, get a PowerPoint template <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a> and prepare your two advertisement slides according to the above-mentioned guidelines.</li> <li>Prepare your poster according to the below-mentioned guidelines.</li> <li>Upload your 2-slide-presentation via <a href="https://date.t-e-m.de">https://date.t-e-m.de</a> until <b>25 February 2020, 23:59:59 CET</b>.<br /> The access data for the upload service will be sent to the main contributing author in due time.</li> <li>At least two hours before your advertisement session, visit the Audio/Video Office to check your presentation.</li> <li>20 minutes before your advertisement session, contact the session chair to confirm your presence.</li> <li>15 minutes before your IP session: please mount your poster and stay in the IP session area</li> <li>60 minutes before the next IP session: please get sure to remove your poster, otherwise it will be disposed.</li> </ul> <h2>Advertisement talk</h2> <p>IP authors have two time slots for presentation. One first time slot is scheduled at the end of a regular session for a short advertisement of the poster presentation being scheduled in the following IP session. The timeslot for your advertisement presentation is only one minute. You are allowed to show at most two slides (cover page included).</p> <h2>Poster Presentation</h2> <p>The second time slot is characterised by an oral explanation given to interested audience during the interactive presentation sessions. Each IP session runs on a 30 minutes’ timeslot and will be supported by a compulsory poster according to the guidelines indicated below. Therefore, please be in the IP area at least 15 minutes before the session starts to correctly mount the poster. Please also take care of removing it at latest 60 minutes before the next IP session. Posters from previous sessions will be removed and disposed. You do not have to prepare presentation slides for this time slot as there will be no table or power socket nearby the poster wall.</p> <p>Finally, remember that the best IP award selection committee will check the quality of the presentation and of the answers during the IP sessions to make its decision.</p> <p>IP authors are kindly asked to prepare posters in DIN A0 portrait format (841x1189 mm / 33.11x46.81 in) and bring the printed poster to the conference. There is no poster printing service on-site. The poster will be exhibited in the IP session area on the poster walls labelled with the corresponding IP number. Blu-Tack/Pins will be provided. Posters made as mosaic of A4 or letter papers are discouraged.</p> </dd> <dt>… Speakers in Exhibition Theatre Sessions</dt> <dd> <h2>Quick Checklist</h2> <ul> <li>Carefully read the information provided in the <a href="#General-Instructions-for-Oral-Presentations">"General Instructions for Oral Presentations"</a> and <a href="#General-Instructions-for-Preparing-AV-Material">"General Instructions for Preparing A/V Material"</a></li> <li>If needed, get a PowerPoint template <a href="/sites/default/files/2020-01/DATE20-slide-template.pptx">here</a> and prepare your slides according to the above-mentioned guidelines.</li> <li>Upload your presentation via <a href="https://date.t-e-m.de">https://date.t-e-m.de</a> until <b>25 February 2020, 23:59:59 CET</b>.<br /> The access data for the upload service will be sent to the main contributing author in due time.</li> <li>At least two hours before your session, visit the Audio/Video Office to check your presentation.</li> <li>Attend Speaker’s Breakfast the morning of your presentation at 7:30 a.m. and bring your filled <a href="/sites/default/files/2020-01/DATE20-speakers-bio.pdf">Speaker's Bio</a></li> <li>20 minutes before your session, contact the session chair to confirm your presence.</li> </ul> </dd> <dt>… Authors of Monday Tutorial Presentations</dt> <dd> <p>The centralised presentation management system will NOT be used for Monday Tutorials and presentations will be handled individually by each Tutorial Organiser. Please contact your Tutorial Organiser to get information on the organisation of your presentation.</p> </dd> <dt>… Authors of Friday Workshop Presentations</dt> <dd> <p>The centralised presentation management system will NOT be used for Friday Workshops and presentations will be handled individually by each Workshop Organiser. Please contact your Workshop Organiser to get information on the organisation of your presentation.</p> </dd> </dl> <p>For more information please contact:</p> <p><b>Conference Organization - Conference Manager</b><br />Eva Smejkal, K.I.T. Group GmbH Dresden, DE<br /><span class="spamspan"><span class="u">date</span><img class="spamspan-image" alt="at" src="/modules/contrib/spamspan/image.gif" /><span class="d">kitdresden<span class="o"> [dot] </span>de</span></span><br />phone: +49 351 65573-133<br />fax: +49 351 65573-299</p> </div> <div class="field field--name-field-news-attachments field--type-file field--label-above clearfix"> <div class="field__label">Download further information:</div> <div class="field__items"> <div class="field__item"><span class="file file--mime-application-pdf file--application-pdf"><a href="https://www.date-conference.com/sites/date20/files/2020-01/DATE20-speakers-bio.pdf" type="application/pdf; length=101265" title="DATE20-speakers-bio.pdf">DATE 2020 Speaker's Bio</a></span> <span class="file-size">(98.89 KB)</span> </div> <div class="field__item"><span class="file file--mime-application-pdf file--application-pdf"><a href="https://www.date-conference.com/sites/date20/files/2020-01/DATE20-session-evaluation-form.pdf" type="application/pdf; length=114702" title="DATE20-session-evaluation-form.pdf">DATE 2020 Session Evaluation Form</a></span> <span class="file-size">(112.01 KB)</span> </div> <div class="field__item"><span class="file file--mime-application-vnd-openxmlformats-officedocument-presentationml-presentation file--x-office-presentation"><a href="https://www.date-conference.com/sites/date20/files/2020-01/DATE20-slide-template.pptx" type="application/vnd.openxmlformats-officedocument.presentationml.presentation; length=895391" title="DATE20-slide-template.pptx">DATE 2020 PowerPoint template</a></span> <span class="file-size">(874.41 KB)</span> </div> </div> </div> <div class="shariff" data-services="[&quot;twitter&quot;,&quot;facebook&quot;,&quot;linkedin&quot;,&quot;xing&quot;,&quot;mail&quot;]" data-theme="colored" data-css="complete" data-orientation="horizontal" data-mail-url="mailto:" data-lang="en"> </div> Sat, 04 Jan 2020 09:29:05 +0000 Andreas Vörg, edacentrum GmbH, DE 515 at https://www.date-conference.com Advance Conference Programme https://www.date-conference.com/programme <span>Advance Conference Programme</span> <span><a title="View user profile." href="/user/25">Andreas Vörg, …</a></span> <span>Sun, 8 Dec 2019 22:22</span> <div class="field field--name-field-news-content field--type-text-with-summary field--label-hidden clearfix field__item"><p><a href="/sites/default/files/2020-02/DATE2020_ProgrammeBooklet%20web_0.pdf"><strong><span style="color:blue; font-weight:bold">DATE 20<span style="color:red">20</span></span> Advance Programme for download</strong></a></p> <p>Keynotes: <a href="https://www.date-conference.com/keynotes">https://www.date-conference.com/keynotes</a></p> <p>Monday Tutorials: <a href="https://www.date-conference.com/conference/monday-tutorials">https://www.date-conference.com/conference/monday-tutorials</a></p> <p>PhD Forum: <a href="https://www.date-conference.com/fringe-meeting-fm01">https://www.date-conference.com/fringe-meeting-fm01</a></p> <p>Special Days, Executive &amp; Special Sessions: <a href="https://www.date-conference.com/special">https://www.date-conference.com/special</a></p> <ul> <li>Wednesday Special Day on "Embedded AI": Sessions <a href="https://www.date-conference.com/program#5.1">5.1</a>, <a href="https://www.date-conference.com/programme#6.1">6.1</a>, <a href="https://www.date-conference.com/programme#7.0">7.0</a>, <a href="https://www.date-conference.com/programme#7.1">7.1</a>, <a href="https://www.date-conference.com/programme#8.1">8.1</a></li> <li>Thursday Special Day on "Silicon Photonics": Sessions <a href="https://www.date-conference.com/programme#9.1">9.1</a>, <a href="https://www.date-conference.com/programme#10.1">10.1</a>, <a href="https://www.date-conference.com/programme#11.0">11.0</a>, <a href="https://www.date-conference.com/programme#11.1">11.1</a>, <a href="https://www.date-conference.com/programme#12.1">12.1</a></li> <li>Special Initiative on "Autonomous Systems Design": Sessions <a href="https://www.date-conference.com/programme#9.2">9.2</a>, <a href="https://www.date-conference.com/programme#10.2">10.2</a>, <a href="https://www.date-conference.com/programme#11.2">11.2</a>,<a href="https://www.date-conference.com/programme#12.2">12.2</a>, <a href="https://date-conference.com/workshop/w03">W03</a></li> </ul> <p>Exhibition Theatre: <a href="https://www.date-conference.com/exhibition/exhibition-theatre">https://www.date-conference.com/exhibition/exhibition-theatre</a></p> <p>Friday Workshops: <a href="https://www.date-conference.com/conference/friday-workshops">https://www.date-conference.com/conference/friday-workshops</a></p> <h2 id="1.1">1.1 Opening Session: Plenary, Awards Ceremony &amp; Keynote Addresses</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 08:30 - 10:30<br /> <b>Location / Room:</b> Amphithéâtre Dauphine</p> <p><b>Chair:</b><br /> Giorgio Di Natale, <span class="date-blue" style="font-weight:bold">DATE 20<span class="date-red">20</span></span> General Chair, FR</p> <p><b>Co-Chair:</b><br /> Cristiana Bolchini, <span class="date-blue" style="font-weight:bold">DATE 20<span class="date-red">20</span></span> Programme Chair, IT</p> <p> </p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:15</td> <td>1.1.1</td> <td><b>WELCOME ADDRESSES</b><br /> <b>Speakers</b>:<br /> Giorgio Di Natale<sup>1</sup> and Cristiana Bolchini<sup>2</sup><br /> <sup>1</sup>TIMA, FR; <sup>2</sup>Politecnico di Milano, IT</td> </tr> <tr> <td>08:25</td> <td>1.1.2</td> <td><b>PRESENTATION OF AWARDS</b></td> </tr> <tr> <td>09:15</td> <td>1.1.3</td> <td><b>PLENARY KEYNOTE: THE INDUSTRIAL IOT MICROELECTRONICS REVOLUTION</b><br /> <b>Speaker</b>:<br /> Philippe Magarshack, STMicroelectronics, FR<br /> <em><b>Abstract</b></em> <p><em>Industrial IoT (IIoT) Systems are now becoming a reality. IIoT is distributed by nature, encompassing many complementary technologies. IIOT systems are composed of sensors, actuators, a means of communication and control units, and are moving into the factories, with the Industry 4.0 generation. In order to operate concurrently, all these IIoT components will require a wide range of technologies, in order to maintain such system-of-systems in a full operational, coherent and secure state. We identify and describe the four key enablers for the Industrial IoT:</em></p> <p> <em> </em></p> <ol> <li><em>more powerful and diverse embedded computing, available on ST's latest STM32 microcontrollers and microprocessors,</em></li> <li><em>augmented by AI applications at the edge ( in the end devices), whose development is becoming enormously simplified by our specialized tools,</em></li> <li><em>a wide set of connectivity technology, either with complete System-on-chip, or ready-to-use modules, and</em></li> <li><em>a scalable security offer, thanks to either integrated features or dedicated security devices.</em></li> </ol> <p> <em> </em></p> <p><em>We conclude with some perspective on the usage of Digital Twins in the IIoT.</em></p> </td> </tr> <tr> <td> </td> <td>1.1.4</td> <td><b>PLENARY KEYNOTE: OPEN PARALLEL ULTRA-LOW POWER PLATFORMS FOR EXTREME EDGE AI</b><br /> <b>Speaker</b>:<br /> Luca Benini, ETH Zurich, CH<br /> <em><b>Abstract</b></em> <p><em>Edge Artificial Intelligence is the new megatrend, as privacy concerns and networks bandwidth/latency bottlenecks prevent cloud offloading of sensor analytics functions in many application domains, from autonomous driving to advanced prosthetic. The next wave of "Extreme Edge AI"  pushes aggressively towards sensors and actuators, opening major research and business development opportunities.  In this talk I will give an overview of recent efforts in developing an Extreme Edge AI platform based on open source parallel ultra-low power (PULP) Risc-V processors and accelerators. I will then look at what comes next in this brave new world of hardware reinaissance.</em></p> </td> </tr> <tr> <td>10:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="UB01">UB01 Session 1</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 10:30 - 12:30<br /> <b>Location / Room:</b> Booth 11, Exhibition Area</p> <p> </p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td>UB01.1</td> <td><b>FUZZING EMBEDDED BINARIES LEVERAGING SYSTEMC-BASED VIRTUAL PROTOTYPES</b><br /> <b>Authors</b>:<br /> Vladimir Herdt<sup>1</sup>, Daniel Grosse<sup>2</sup> and Rolf Drechsler<sup>2</sup><br /> <sup>1</sup>DFKI, DE; <sup>2</sup>University of Bremen / DFKI GmbH, DE<br /> <em><b>Abstract</b><br /> Verification of embedded Software (SW) binaries is very important. Mainly, simulation-based methods are employed that execute (randomly) generated test-cases on Virtual Prototypes (VPs). However, to enable a comprehensive VP-based verification, sophisticated test-case generation techniques need to be integrated. Our demonstrator combines state-of-the-art fuzzing techniques with SystemC-based VPs to enable a fast and accurate verification of embedded SW binaries. The fuzzing process is guided by the coverage of the embedded SW as well as the SystemC-based peripherals of the VP. The effectiveness of our approach is demonstrated by our experiments, using RISC-V SW binaries as an example.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3098.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB01.2</td> <td><b>SKELETOR: AN OPEN SOURCE EDA TOOL FLOW FROM HIERARCHY SPECIFICATION TO HDL DEVELOPMENT</b><br /> <b>Authors</b>:<br /> Ivan Rodriguez, Guillem Cabo, Javier Barrera, Jeremy Giesen, Alvaro Jover and Leonidas Kosmidis, BSC / UPC, ES<br /> <em><b>Abstract</b><br /> Large hardware design projects have high overhead for project bootstrapping, requiring significant effort for translating hardware specifications to hardware design language (HDL) files and setting up their corresponding development and verification infrastructure. Skeletor (<a href="https://github.com/jaquerinte/Skeletor" title="https://github.com/jaquerinte/Skeletor">https://github.com/jaquerinte/Skeletor</a>) is an open source EDA tool developed as a student project at UPC/BSC, which simplifies this process, by increasing developer's productivity and reducing typing errors, while at the same time lowers the bar for entry in hardware development. Skeletor uses a C/verilog-like language for the specification of the modules in a hardware project hierarchy and their connections, which is used to generate automatically the require skeleton of source files, their development and verification testbenches and simulation scripts. Integration with KiCad schematics and support for syntax highlighting in code editors simplifies further its use. This demo is linked with workshop W05.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3107.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB01.3</td> <td><b>LAGARTO: FIRST SILICON RISC-V ACADEMIC PROCESSOR DEVELOPED IN SPAIN</b><br /> <b>Authors</b>:<br /> Guillem Cabo Pitarch<sup>1</sup>, Cristobal Ramirez Lazo<sup>1</sup>, Julian Pavon Rivera<sup>1</sup>, Vatistas Kostalabros<sup>1</sup>, Carlos Rojas Morales<sup>1</sup>, Miquel Moreto<sup>1</sup>, Jaume Abella<sup>1</sup>, Francisco J. Cazorla<sup>1</sup>, Adrian Cristal<sup>1</sup>, Roger Figueras<sup>1</sup>, Alberto Gonzalez<sup>1</sup>, Carles Hernandez<sup>1</sup>, Cesar Hernandez<sup>2</sup>, Neiel Leyva<sup>2</sup>, Joan Marimon<sup>1</sup>, Ricardo Martinez<sup>3</sup>, Jonnatan Mendoza<sup>1</sup>, Francesc Moll<sup>4</sup>, Marco Antonio Ramirez<sup>2</sup>, Carlos Rojas<sup>1</sup>, Antonio Rubio<sup>4</sup>, Abraham Ruiz<sup>1</sup>, Nehir Sonmez<sup>1</sup>, Lluis Teres<sup>3</sup>, Osman Unsal<sup>5</sup>, Mateo Valero<sup>1</sup>, Ivan Vargas<sup>1</sup> and Luis Villa<sup>2</sup><br /> <sup>1</sup>BSC / UPC, ES; <sup>2</sup>CIC-IPN, MX; <sup>3</sup>IMB-CNM (CSIC), ES; <sup>4</sup>UPC, ES; <sup>5</sup>BSC, ES<br /> <em><b>Abstract</b><br /> Open hardware is a possibility that has emerged in recent years and has the potential to be as disruptive as Linux was once, an open source software paradigm. If Linux managed to lessen the dependence of users in large companies providing software and software applications, it is envisioned that hardware based on ISAs open source can do the same in their own field. In the Lagarto tapeout four research institutions were involved: Centro de Investigación en Computación of the Mexican IPN, Centro Nacional de Microelectrónica of the CSIC, Universitat Politècnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC). As a result, many bachelor, master and PhD students had the chance to achieve real-world experience with ASIC design and achieve a functional SoC. In the booth, you will find a live demo of the first ASIC and prototypes running on FPGA of the next versions of the SoC and core.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3104.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB01.4</td> <td><b>PARALLEL ALGORITHM FOR CNN INFERENCE AND ITS AUTOMATIC SYNTHESIS</b><br /> <b>Authors</b>:<br /> Takashi Matsumoto, Yukio Miyasaka, Xinpei Zhang and Masahiro Fujita, University of Tokyo, JP<br /> <em><b>Abstract</b><br /> Recently, Convolutional Neural Network (CNN) has surpassed conventional methods in the field of image processing. This demonstration shows a new algorithm to calculate CNN inference using processing elements arranged and connected based on the topology of the convolution. They are connected in mesh and calculate CNN inference in a systolic way. The algorithm performs the convolution of all elements with the same output feature in parallel. We demonstrate a method to automatically synthesize an algorithm, which simultaneously performs the convolution and the communication of pixels for the computation of the next layer. We show with several sizes of input layers, kernels, and strides and confirmed that the correct algorithms were synthesized. The synthesis method is extended to the sparse kernel. The synthesized algorithm requires fewer cycles than the original algorithm. There were the more chances to reduce the number of cycles with the sparser kernel.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3132.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB01.5</td> <td><b>FASTHERMSIM: FAST AND ACCURATE THERMAL SIMULATIONS FROM CHIPLETS TO SYSTEM</b><br /> <b>Authors</b>:<br /> Yu-Min Lee, Chi-Wen Pan, Li-Rui Ho and Hong-Wen Chiou, National Chiao Tung University, TW<br /> <em><b>Abstract</b><br /> Recently, owing to the scaling down of technology and 2.5D/3D integration, power densities and temperatures of chips have been increasing significantly. Though commercial computational fluid dynamics tools can provide accurate thermal maps, they may lead to inefficiency in thermal-aware design with huge runtime. Thus, we develop the chip/package/system-level thermal analyzer, called FasThermSim, which can assist you to improve your design under thermal constraints in pre/post-silicon stages. In FasThermSim, we consider three heat transfer modes, conduction, convection, and thermal radiation. We convert them to temperature-independent terms by linearization methods and build a compact thermal model (CTM). By applying numerical methods to the CTM, the steady-state and transient thermal profiles can be solved efficiently without loss of accuracy. Finally, an easy-to-use thermal analysis tool is implemented for your design, which is flexible and compatible, with the graphic user interface.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3137.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB01.6</td> <td><b>INTACT: A 96-CORE PROCESSOR WITH 6 CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER AND A 16-CORE PROTOTYPE RUNNING GRAPHICAL OPERATING SYSTEM</b><br /> <b>Authors</b>:<br /> Eric Guthmuller<sup>1</sup>, Pascal Vivet<sup>1</sup>, César Fuguet<sup>1</sup>, Yvain Thonnart<sup>1</sup>, Gaël Pillonnet<sup>2</sup> and Fabien Clermidy<sup>1</sup><br /> <sup>1</sup>Université Grenoble Alpes / CEA List, FR; <sup>2</sup>Université Grenoble Alpes / CEA-Leti, FR<br /> <em><b>Abstract</b><br /> We built a demonstrator for our 96-cores cache coherent 3D processor and a first prototype featuring 16 cores. The demonstrator consists in our 16-cores processor running commodity operating systems such as Linux and NetBSD on a PC-like motherboard with user-friendly devices such as a HDMI display, keyboard and mouse. A graphical desktop is displayed, and the user will interact with it through the keyboard and mouse. The demonstrator is able to run parallel applications to benchmark its performance in terms of scalability. The main innovation of our processor is its scalable cache coherent architecture based on distributed L2-caches and adaptive L3-caches. Additionally, the energy consumption is also measured and displayed by reading dynamically from the monitors of power-supply devices. Finally we will also show open packages of the 3D processor featuring 6 16-core chiplets (28 nm FDSOI) on an active interposer (65 nm) embedding Network-on-Chips, power management and IO controllers.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3114.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB01.7</td> <td><b>EEC: ENERGY EFFICIENT COMPUTING VIA DYNAMIC VOLTAGE SCALING AND IN-NETWORK OPTICAL PROCESSING</b><br /> <b>Authors</b>:<br /> Ryosuke Matsuo<sup>1</sup>, Jun Shiomi<sup>1</sup>, Yutaka Masuda<sup>2</sup> and Tohru Ishihara<sup>2</sup><br /> <sup>1</sup>Kyoto University, JP; <sup>2</sup>Nagoya University, JP<br /> <em><b>Abstract</b><br /> This poster demonstration will show results of our two research projects. The first one is on a project of energy efficient computing. In this project we developed a power management algorithm which keeps the target processor always running at the most energy efficient operating point by appropriately tuning the supply voltage and threshold voltage under a specific performance constraint. This algorithm is applicable to wide variety of processor systems including high-end processors and low-end embedded processors. We will show the results obtained with actual RISC processors designed using a 65nm technology. The second one is on a project of in-network optical computing. We show optical functional units such as parallel multipliers and optical neural networks. Several key techniques for reducing the power consumption of optical circuits will be also presented. Finally, we will show the results of optical circuit simulation, which demonstrate the light speed operation of the circuits.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3128.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB01.8</td> <td><b>CATANIS: CAD TOOL FOR AUTOMATIC NETWORK SYNTHESIS</b><br /> <b>Authors</b>:<br /> Davide Quaglia, Enrico Fraccaroli, Filippo Nevi and Sohail Mushtaq, Università di Verona, IT<br /> <em><b>Abstract</b><br /> The proliferation of communication technologies for embedded systems opened the way for new applications, e.g., Smart Cities and Industry 4.0. In such applications hundreds or thousands of smart devices interact together through different types of channels and protocols. This increasing communication complexity forces computer-aided design methodologies to scale up from embedded systems in isolation to the global inter-connected system. Network Synthesis is the methodology to optimally allocate functionality onto network nodes and define the communication infrastructure among them. This booth will demonstrate the functionality of a graphic tool for automatic network synthesis developed by the Computer Science Department of University of Verona. It allows to graphically specify the communication requirements of a smart space (e.g., its map can be considered) in terms of sensing and computation tasks together with a library of node types and communication protocols to be used.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3125.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB01.9</td> <td><b>PRE-IMPACT FALL DETECTION ARCHITECTURE BASED ON NEUROMUSCULAR CONNECTIVITY STATISTICS</b><br /> <b>Authors</b>:<br /> Giovanni Mezzina, Sardar Mehboob Hussain and Daniela De Venuto, Politecnico di Bari, IT<br /> <em><b>Abstract</b><br /> In this demonstration, we propose an innovative multi-sensor architecture operating in the field of pre-impact fall detection (PIFD). The proposed architecture jointly analyzes cortical and muscular involvement when unexpected slippages occur during steady walking. The EEG and EMG are acquired through wearable and wireless devices. The control unit consists of an STM32L4 microcontroller and a Simulink modeling. The C implements the EMG computation, while the cortical analysis and the final classification were entrusted to the Simulink model. The EMG computation block translates EMGs into binary signals, which are used both to enable cortical analyses and to extract a score to distinguish "standard" muscular behaviors from anomalous ones. The Simulink model evaluates the cortical responsiveness in five bands of interest and implements the logical-based network classifier. The system, tested on 6 healthy subjects, shows an accuracy of 96.21% and a detection time of ~371 ms.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3122.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB01.10</td> <td><b>JOINTER: JOINING FLEXIBLE MONITORS WITH HETEROGENEOUS ARCHITECTURES</b><br /> <b>Authors</b>:<br /> Giacomo Valente<sup>1</sup>, Tiziana Fanni<sup>2</sup>, Carlo Sau<sup>3</sup>, Claudio Rubattu<sup>2</sup>, Francesca Palumbo<sup>2</sup> and Luigi Pomante<sup>1</sup><br /> <sup>1</sup>Università degli Studi dell'Aquila, IT; <sup>2</sup>Università degli Studi di Sassari, IT; <sup>3</sup>Università degli Studi di Cagliari, IT<br /> <em><b>Abstract</b><br /> As embedded systems grow more complex and shift toward heterogeneous architectures, understanding workload performance characteristics becomes increasingly difficult. In this regard, run-time monitoring systems can support on obtaining the desired visibility to characterize a system. This demo presents a framework that allows to develop complex heterogeneous architectures composed of programmable processors and dedicated accelerators on FPGA, together with customizable monitoring systems, keeping under control the introduced overhead. The whole development flow (and related prototypal EDA tools), that starts with the accelerators creation using a dataflow model, in parallel with the monitoring system customization using a library of elements, showing also the final joining, will be shown. Moreover, a comparison among different monitoring systems functionalities on different architectures developed on Zynq7000 SoC will be illustrated.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3124.pdf">More information ...</a></b></em></td> </tr> <tr> <td>12:30</td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="2.1">2.1 Executive Session: Memories for Emerging Applications</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 11:30 - 13:00<br /> <b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br /> Pierre-Emmanuel Gaillardon, University of Utah, US</p> <p><b>Co-Chair:</b><br /> Kvatinsky Shahar, Technion, IL</p> <p>Memories play a prime role in virtually every modern computing systems. While memory technology has been able to follow the aggressive trend of scaling and keep up with the most stringent demands, there exists new applications for which traditional memories struggle to deliver viable solutions. In this context, and more than ever, novel memory technologies are required. Identifying a close match between a killer application and a supporting emerging memory technology will ensure unprecedented capabilities and open durable new horizons for computing systems. In this executive session, we will explore specific cases where novel memories (OxRAM and SOT MRAM in particular) are opening such novel applications unachievable with standard memories.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.1.1</td> <td><b>RESISTIVE RAM AND ITS DENSE 3D INTEGRATION FOR THE <i>N3XT 1,000X</i></b><br /> <b>Author</b>:<br /> Subhasish Mitra, Stanford University, US</td> </tr> <tr> <td>12:00</td> <td>2.1.2</td> <td><b>EMERGING MEMORIES FOR VON NEUMANN AND FOR NEUROMORPHIC COMPUTING</b><br /> <b>Author</b>:<br /> Jamil Kawa, Synopsys, US</td> </tr> <tr> <td>12:30</td> <td>2.1.3</td> <td><b>RERAM TECHNOLOGY FOR NEXT GENERATION AI AND COST-EFFECTIVE EMBEDDED MEMORY</b><br /> <b>Author</b>:<br /> Amir Regev, Weebit Nano, AU</td> </tr> <tr> <td>13:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="2.2">2.2 Hardware-assisted Secure Systems</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 11:30 - 13:00<br /> <b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br /> Prabhat Mishra, University of Florida, US</p> <p><b>Co-Chair:</b><br /> Kavun Elif Bilge, University of Sheffield, GB</p> <p>This session covers state-of-the-art hardware-assisted techniques for secure systems such as random number generators, PUFs, and logic locking &amp; obfuscation. In addition, novel detection methods for hardware Trojans are presented.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.2.1</td> <td><b>BACKTRACKING SEARCH FOR OPTIMAL PARAMETERS OF A PLL-BASED TRUE RANDOM NUMBER GENERATOR</b><br /> <b>Speaker</b>:<br /> Brice Colombier, Université de Lyon, FR<br /> <b>Authors</b>:<br /> Brice Colombier<sup>1</sup>, Nathalie Bochard<sup>1</sup>, Florent BERNARD<sup>2</sup> and Lilian Bossuet<sup>1</sup><br /> <sup>1</sup>Université de Lyon, FR; <sup>2</sup>Laboratory Hubert Curien, University of Lyon, UJM Saint-Etienne, FR<br /> <em><b>Abstract</b><br /> The phase-locked loop-based true random number generator (PLL-TRNG) extracts randomness from clock jitter. It is an interesting construct because it comes with a stochastic model, making it certifiable by certification bodies. However, bringing it to good performance is difficult since it comes with multiple parameters to tune. This article proposes to use backtracking to determine these parameters. Compared to existing methods, based on genetic algorithms or exhaustive search of a feasible set of parameters, backtracking has several advantages. Indeed, since this method is expressible by constraint programming, it provides very good readability. Constraints can be specified in a very straightforward and maintainable way. It also exhibits good performance and generates PLL-TRNG configurations rapidly. Finally, it allows to integrate new exploratory design constraints for the PLL-TRNG very easily. We provide experimental results with a PLL-TRNG implemented on three FPGA families that come with different physical constraints, showing that the method allows to find good parameters for every one of them. Moreover, we were able to obtain configurations that lead to an increase 59% in throughput and 82% in jitter sensitivity on average, thereby generating random numbers of higher quality at a faster rate. This approach also paves the way for new design exploration strategies for PLL-TRNG. The source code of our implementation is open source and available online for reproducibility and reuse.</em></td> </tr> <tr> <td>12:00</td> <td>2.2.2</td> <td><b>LONG-TERM CONTINUOUS ASSESSMENT OF SRAM PUF AND SOURCE OF RANDOM NUMBERS</b><br /> <b>Speaker</b>:<br /> Rui Wang, Intrinsic-ID, NL<br /> <b>Authors</b>:<br /> Rui Wang, Georgios Selimis, Roel Maes and Sven Goossens, Intrinsic-ID, NL<br /> <em><b>Abstract</b><br /> The qualities of Physical Unclonable Functions (PUFs) suffer from several noticeable degradations due to silicon aging. In this paper, we investigate the long-term effects of silicon aging on PUFs derived from the start-up behavior of Static Random Access Memories (SRAM). Previous research on SRAM aging is based on transistor-level simulation or accelerated aging test at high temperature and voltage to observe aging effects within a short period of time. In contrast, we have run a long-term continuous power-up test on 16 Arduino Leonardo boards under nominal conditions for two years. In total, we collected around 175 million measurements for reliability, uniqueness and randomness evaluations. Analysis shows that the number of bits that flip with respect to the reference increased by 19.3% while min-entropy of SRAM PUF noise improves by 19.3% on average after two years of aging. The impact of aging on reliability is smaller under nominal conditions than was previously assessed by the accelerated aging test. The test we conduct in this work more closely resembles the conditions of a device in the field, and therefore we more accurately evaluate how silicon aging affects SRAM PUFs.</em></td> </tr> <tr> <td>12:15</td> <td>2.2.3</td> <td><b>RESCUING LOGIC ENCRYPTION IN POST-SAT ERA BY LOCKING &amp; OBFUSCATION</b><br /> <b>Speaker</b>:<br /> Hai Zhou, Northwestern University, US<br /> <b>Authors</b>:<br /> Amin Rezaei, Yuanqi Shen and Hai Zhou, Northwestern University, US<br /> <em><b>Abstract</b><br /> The active participation of external entities in the manufacturing flow has produced numerous hardware security issues in which piracy and overproduction are likely to be the most ubiquitous and expensive ones. The main approach to prevent unauthorized products from functioning is logic encryption that inserts key-controlled gates to the original circuit in a way that the valid behavior of the circuit only happens when the correct key is applied. The challenge for the security designer is to ensure neither the correct key nor the original circuit can be revealed by different analyses of the encrypted circuit. However, in state-of-the-art logic encryption works, a lot of performance is sold to guarantee security against powerful logic and structural attacks. This contradicts the primary reason of logic encryption that is to protect a precious design from being pirated and overproduced. In this paper, we propose a bilateral logic encryption platform that maintains high degree of security with small circuit modification. The robustness against exact and approximate attacks is also demonstrated.</em></td> </tr> <tr> <td>12:30</td> <td>2.2.4</td> <td><b>SELECTIVE CONCOLIC TESTING FOR HARDWARE TROJAN DETECTION IN BEHAVIORAL SYSTEMC DESIGNS</b><br /> <b>Speaker</b>:<br /> Bin Lin, Portland State University, US<br /> <b>Authors</b>:<br /> Bin Lin<sup>1</sup>, Jinchao Chen<sup>2</sup> and Fei Xie<sup>1</sup><br /> <sup>1</sup>Portland State University, US; <sup>2</sup>Northwestern Polytechnical University, CN<br /> <em><b>Abstract</b><br /> With the growing complexities of modern SoC designs and increasingly shortened time-to-market requirements, new design paradigms such as outsourced design services have emerged. Design abstraction level has also been raised from RTL to ESL. Modern SoC designs in ESL often integrate a variety of third-party behavioral intellectual properties, as well as utilizing EDA tools intensively, to improve design productivity. However, this new design trend makes modern SoCs more vulnerable to hardware Trojan attacks. Although hardware Trojan detection has been studied for more than a decade in RTL and lower levels, it has just gained attention recently in ESL designs. In this paper, we present a novel approach for generating test cases by selective concolic testing to detect hardware Trojans in ESL. We have evaluated our approach on an open source benchmark that includes various types of hardware Trojans. The experimental results demonstrate that our approach is able to detect hardware Trojans effectively and efficiently.</em></td> </tr> <tr> <td>12:45</td> <td>2.2.5</td> <td><b>TEST PATTERN SUPERPOSITION TO DETECT HARDWARE TROJANS</b><br /> <b>Speaker</b>:<br /> Alex Orailoglu, University of California, San Diego, US<br /> <b>Authors</b>:<br /> Chris Nigh and Alex Orailoglu, University of California, San Diego, US<br /> <em><b>Abstract</b><br /> Current methods for the detection of hardware Trojans inserted by an untrusted foundry are either accompanied by unreasonable costs in design/test pattern overhead, or return results that fail to provide confident trustability. The challenges faced by these side-channel techniques are primarily a result of process variation, which renders pre-silicon expectations nearly meaningless in predicting the behavior of a manufactured IC. To overcome this hindrance in a cost-effective manner, we propose an easy-to-implement test pattern-based approach that is self-referential in nature, capable of dissecting and understanding the characteristics of a given manufactured IC to hone in on aberrant measurements that are demonstrative of malicious Trojan hardware. By leveraging the superposition principle to cancel out non-Trojan noise, we can isolate and magnify Trojan circuit effects, all within a regime considerate of practical test and design-for-test infrastructures. Experimental results performed on Trust-Hub benchmarks demonstrate the proposed method provides a clear and significant boost in our ability to confidently certify manufactured ICs over similar state-of-the-art techniques.</em></td> </tr> <tr> <td style="width:40px;">13:00</td> <td><a href="#IP1">IP1-1</a>, 280</td> <td><b>DYNUNLOCK: UNLOCKING SCAN CHAINS OBFUSCATED USING DYNAMIC KEYS</b><br /> <b>Speaker</b>:<br /> Nimisha Limaye, New York University, US<br /> <b>Authors</b>:<br /> Nimisha Limaye<sup>1</sup> and Ozgur Sinanoglu<sup>2</sup><br /> <sup>1</sup>New York University, US; <sup>2</sup>New York University Abu Dhabi, AE<br /> <em><b>Abstract</b><br /> Outsourcing in semiconductor industry opened up venues for faster and cost-effective chip manufacturing. However, this also introduced untrusted entities with malicious intent, to steal intellectual property (IP), overproduce the circuits, insert hardware Trojans, or counterfeit the chips. Recently, a defense is proposed to obfuscate the scan access based on a dynamic key that is initially generated from a secret key but changes in every clock cycle. This defense can be considered as the most rigorous defense among all the scan locking techniques. In this paper, we propose an attack that remodels this defense into one that can be broken by the SAT attack, while we also note that our attack can be adjusted to break other less rigorous (key that is updated less frequently) scan locking techniques as well.</em></td> </tr> <tr> <td>13:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="2.3">2.3 Fueling the future of computing: 3D, TFT, or disruptive memories?</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 11:30 - 13:00<br /> <b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br /> Yvain Thonnart, CEA-Leti, FR</p> <p><b>Co-Chair:</b><br /> Marco Vacca, Politecnico di Torino, IT</p> <p>In the post-CMOS era, the future of computing relies more and more on emerging technologies, like resistive memories, TFT and 3D integration or their combination, to continue performance improvements: from a novel accelerating solution for deep neural networks with ferroelectric transistor technology, to a physical design methodology for face-to-face 3D ICs to enable commercial-quality IC layouts. Furthermore, the monolithic 3D advantage obtained combining TFT and RRAM technology is quantified using a novel open-source CAD flow.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.3.1</td> <td><b>TERNARY COMPUTE-ENABLED MEMORY USING FERROELECTRIC TRANSISTORS FOR ACCELERATING DEEP NEURAL NETWORKS</b><br /> <b>Speaker</b>:<br /> Sandeep Krishna Thirumala, Purdue University, US<br /> <b>Authors</b>:<br /> Sandeep Krishna Thirumala, Shubham Jain, Sumeet Gupta and Anand Raghunathan, Purdue University, US<br /> <em><b>Abstract</b><br /> Ternary Deep Neural Networks (DNNs), which employ ternary precision for weights and activations, have recently been shown to attain accuracies close to full-precision DNNs, raising interest in their efficient hardware realization. In this work we propose a Non-Volatile Ternary Compute-Enabled memory cell (TeC-Cell) based on ferroelectric transistors (FEFETs) for in-memory computing in the signed ternary regime. In particular, the proposed cell enables storage of ternary weights and employs multi-word-line assertion to perform massively parallel signed dot-product computations between ternary weights and ternary inputs. We evaluate the proposed design at the array level and show 72% and 74% higher energy efficiency for multiply-and-accumulate (MAC) operations compared to standard near-memory computing designs based on SRAM and FEFET, respectively. Furthermore, we evaluate the proposed TeC-Cell in an existing ternary in-memory DNN accelerator. Our results show 3.3X-3.4X reduction in system energy and 4.3X-7X improvement in system performance over SRAM and FEFET based near-memory accelerators, across a wide range of DNN benchmarks including both deep convolutional and recurrent neural networks.</em></td> </tr> <tr> <td>12:00</td> <td>2.3.2</td> <td><b>MACRO-3D: A PHYSICAL DESIGN METHODOLOGY FOR FACE-TO-FACE-STACKED HETEROGENEOUS 3D ICS</b><br /> <b>Speaker</b>:<br /> Lennart Bamberg, University of Bremen, DE / GrAi Matter Labs, NL<br /> <b>Authors</b>:<br /> Lennart Bamberg<sup>1</sup>, Lingjun Zhu<sup>2</sup>, Sai Pentapati<sup>2</sup>, Da Eun Shim<sup>2</sup>, Alberto Garcia-Ortiz<sup>3</sup> and Sung Kyu Lim<sup>2</sup><br /> <sup>1</sup>GrAi Matter Labs, NL; <sup>2</sup>Georgia Tech, US; <sup>3</sup>University of Bremen, DE<br /> <em><b>Abstract</b><br /> Memory-on-logic and sensor-on-logic face-to-face stacking are emerging design approaches that promise a significant increase in the performance of modern systems-on-chip at reasonable costs. In this work, a netlist-to-layout design flow for such heterogeneous 3D systems is proposed. The proposed technique overcomes the severe limitations of existing 3D physical design methodologies. A RISC-V-based multi-core system, implemented in a commercial technology, is used as a case study to evaluate the proposed design flow. The case study is performed for modern/large and small cache sizes to show the superiority of the proposed methodology for a broad set of systems. While previous 3D design flows do not show to optimize performance against 2D baseline designs for processor systems with a significant memory area occupation, the proposed flow shows a performance and power improvement by 20.4-28.2 % and 3.2-3.8 %, respectively.</em></td> </tr> <tr> <td>12:30</td> <td>2.3.3</td> <td><b>QUANTIFYING THE BENEFITS OF MONOLITHIC 3D COMPUTING SYSTEMS ENABLED BY TFT AND RRAM</b><br /> <b>Speaker</b>:<br /> Abdallah Felfel, Zewail City of Science and Technology, EG<br /> <b>Authors</b>:<br /> Abdallah M Felfel<sup>1</sup>, Kamalika Datta<sup>1</sup>, Arko Dutt<sup>1</sup>, Hasita Veluri<sup>2</sup>, Ahmed Zaky<sup>1</sup>, Aaron Thean<sup>2</sup> and Mohamed M Sabry Aly<sup>1</sup><br /> <sup>1</sup>Nanyang Technological University, SG; <sup>2</sup>National University of Singapore, SG<br /> <em><b>Abstract</b><br /> Current data-centric workloads, such as deep learning, expose the memory-access inefficiencies of current computing systems. Monolithic 3D integration can overcome this limitation by leveraging fine-grained and dense vertical connectivity to enable massively-concurrent accesses between compute and memory units. Thin-Film Transistors (TFTs) and Resistive RAM (RRAM) naturally enable monolithic 3D integration as they are fabricated in low temperature (a crucial requirement). In this paper, we explore ZnO-based TFTs and HfO2-based RRAM to build a 1TFT-1R memory subsystem in the upper tiers. The TFT-based memory subsystem is stacked on top of a Si-FET bottom tier that can include compute units and SRAM. System-level simulations for various deep learning workloads show that our TFT-based monolithic 3D system achieves up to 11.4x system-level energy-delay product benefits compared to 2D baseline with off-chip DRAM---5.8x benefits over interposer-based 2.5D integration and 1.25x over 3D stacking of RRAM on silicon using through-silicon vias. These gains are achieved despite the low density of TFT-based RRAM and the higher energy consumption versus 3D stacking with RRAM, due to inherent TFT limitations.</em></td> </tr> <tr> <td>12:45</td> <td>2.3.4</td> <td><b>ORGANIC-FLOW: AN OPEN-SOURCE ORGANIC STANDARD CELL LIBRARY AND PROCESS DEVELOPMENT KIT</b><br /> <b>Speaker</b>:<br /> Ting-Jung Chang, Princeton University, US<br /> <b>Authors</b>:<br /> Ting-Jung Chang, Zhuozhi Yao, Barry P. Rand and David Wentzlaff, Princeton University, US<br /> <em><b>Abstract</b><br /> Organic thin-film transistors (OTFTs) are drawing increasing attention due to their unique advantages of mechanical flexibility, low-cost fabrication, and biodegradability, enabling diverse applications that were not achievable using traditional inorganic transistors. With a growing number of complex applications being proposed, the need for expediting the design process and ensuring the yield of large-scale designs with organic technology increases. A complete digital standard cell library plays a crucial role in integrating the emerging organic technology into existing computer-aided-design (CAD) flows. In this paper, we present the design, fabrication, and characterization of a standard cell library based on bottom gate, top contact pentacene OTFTs. We also propose a commercial tool compatible, RTL-to-GDS flow along with a new organic process design kit (PDK) developed based on our process. To the best of our knowledge, this is the first open-source organic standard cell library, enabling the community to explore this emerging technology.</em></td> </tr> <tr> <td style="width:40px;">13:00</td> <td><a href="#IP1">IP1-2</a>, 130</td> <td><b>CMOS IMPLEMENTATION OF SWITCHING LATTICES</b><br /> <b>Speaker</b>:<br /> Levent Aksoy, Istanbul TU, TR<br /> <b>Authors</b>:<br /> Ismail Cevik, Levent Aksoy and Mustafa Altun, Istanbul TU, TR<br /> <em><b>Abstract</b><br /> Switching lattices consisting of four-terminal switches are introduced as area-efficient structures to realize logic functions. Many optimization algorithms have been proposed, including exact ones, realizing logic functions on lattices with the fewest number of four-terminal switches, as well as heuristic ones. Hence, the computing potential of switching lattices has been justified adequately in the literature. However, the same thing cannot be said for their physical implementation. There have been conceptual ideas for the technology development of switching lattices, but no concrete and directly applicable technology has been proposed yet. In this study, we show that switching lattices can be directly and efficiently implemented using a standard CMOS process. To realize a given logic function on a switching lattice, we propose static and dynamic logic solutions. The proposed circuits as well as the compared conventional ones are designed and simulated in the Cadence environment using TSMC 65nm CMOS process. Experimental post layout results on logic functions show that switching lattices occupy much smaller area than those of the conventional CMOS implementations, while they have competitive delay and power consumption values.</em></td> </tr> <tr> <td style="width:40px;">13:01</td> <td><a href="#IP1">IP1-3</a>, 327</td> <td><b>A TIMING UNCERTAINTY-AWARE CLOCK TREE TOPOLOGY GENERATION ALGORITHM FOR SINGLE FLUX QUANTUM CIRCUITS</b><br /> <b>Speaker</b>:<br /> Massoud Pedram, University of Southern California, US<br /> <b>Authors</b>:<br /> Soheil Nazar Shahsavani, Bo Zhang and Massoud Pedram, University of Southern California, US<br /> <em><b>Abstract</b><br /> This paper presents a low-cost, timing uncertainty-aware synchronous clock tree topology generation algorithm for single flux quantum (SFQ) logic circuits. The proposed method considers the criticality of the data paths in terms of timing slacks as well as the total wirelength of the clock tree and generates a (height-) balanced binary clock tree using a bottom-up approach and an integer linear programming (ILP) formulation. The statistical timing analysis results for ten benchmark circuits show that the proposed method improves the total wirelength and the total negative hold slack by 4.2% and 64.6%, respectively, on average, compared with a wirelength-driven state-of-the-art balanced topology generation approach.</em></td> </tr> <tr> <td>13:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="2.4">2.4 Challenges in Analog Design Automation &amp; Security</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 11:30 - 13:00<br /> <b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br /> Manuel Barragan, TIMA, FR</p> <p><b>Co-Chair:</b><br /> Haralampos Stratigopoulos, LIP6, FR</p> <p>Producing reliable and secure analog circuits is a challenging task. This session addresses novel and systematic approaches to analog security, based on key sequencing, and analog design, from automatic netlist annotation to Bayesian modeling optimization.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.4.1</td> <td><b>GANA: GRAPH CONVOLUTIONAL NETWORK BASED AUTOMATED NETLIST ANNOTATION FOR ANALOG CIRCUITS</b><br /> <b>Speaker</b>:<br /> Kishor Kunal, University of Minnesota, IN<br /> <b>Authors</b>:<br /> Kishor Kunal<sup>1</sup>, Tonmoy Dhar<sup>2</sup>, Meghna Madhusudan<sup>2</sup>, Jitesh Poojary<sup>1</sup>, Arvind Sharma<sup>1</sup>, Wenbin Xu<sup>3</sup>, Steven Burns<sup>4</sup>, Jiang Hu<sup>3</sup>, Ramesh Harjani<sup>1</sup> and Sachin S. Sapatnekar<sup>1</sup><br /> <sup>1</sup>University of Minnesota, US; <sup>2</sup>University of Minnesota Twin Cities, US; <sup>3</sup>Texas A&amp;M University, US; <sup>4</sup>Intel Corporation, US<br /> <em><b>Abstract</b><br /> Automated subcircuit identification enables the creation of hierarchical representations of analog netlists, and can facilitate a variety of design automation tasks such as circuit layout and optimization. Subcircuit identification must be capable of navigating the numerous alternative structures that can implement any analog function, but traditional graph-based methods have been limited by the large number of such structural variants. The novel approach in this paper is based on the use of a trained graph convolutional neural network (GCN) that identifies netlist elements for circuit blocks at upper levels of the design hierarchy. Structures at lower levels of hierarchy are identified using graph-based algorithms. The proposed recognition scheme organically detects layout constraints, such as symmetry and matching, whose identification is essential for high-quality hierarchical layout. The subcircuit identification method demonstrates a high degree of accuracy over a wide range of analog designs, successfully identifies larger circuits that contain sub-blocks such as OTAs, LNAs, mixers, oscillators, and band-pass filters, and provides hierarchical decompositions of such circuits.</em></td> </tr> <tr> <td>12:00</td> <td>2.4.2</td> <td><b>SECURING PROGRAMMABLE ANALOG ICS AGAINST PIRACY</b><br /> <b>Speaker</b>:<br /> Mohamed Elshamy, Sorbonne Université, CNRS, LIP6, FR<br /> <b>Authors</b>:<br /> Mohamed Elshamy, Alhassan Sayed, Marie-Minerve Louerat, Amine Rhouni, Hassan Aboushady and Haralampos-G. Stratigopoulos, Sorbonne Université, CNRS, LIP6, FR<br /> <em><b>Abstract</b><br /> In this paper, we demonstrate a security approach for the class of highly-programmable analog Integrated Circuits (ICs) that can be used as a countermeasure for unauthorized chip use and piracy. The approach relies on functionality locking, i.e. a lock mechanism is introduced into the design such that unless the correct key is provided the functionality breaks. We show that for highly-programmable analog ICs the programmable fabric can naturally be used as the lock mechanism. We demonstrate the approach on a multi-standard RF receiver with configuration settings of 64-bit words.</em></td> </tr> <tr> <td>12:30</td> <td>2.4.3</td> <td><b>AN EFFICIENT BAYESIAN OPTIMIZATION APPROACH FOR ANALOG CIRCUIT SYNTHESIS VIA SPARSE GAUSSIAN PROCESS MODELING</b><br /> <b>Speaker</b>:<br /> Biao He, Fudan University, CN<br /> <b>Authors</b>:<br /> Biao He<sup>1</sup>, Shuhan Zhang<sup>1</sup>, Fan Yang<sup>2</sup>, Changhao Yan<sup>1</sup>, Dian Zhou<sup>3</sup> and Xuan Zeng<sup>1</sup><br /> <sup>1</sup>Fudan university, CN; <sup>2</sup>Fudan University, CN; <sup>3</sup>University of Texas at Dallas, US<br /> <em><b>Abstract</b><br /> Bayesian optimization with Gaussian process models has been proposed for analog synthesis, since it is efficient for the optimizations of expensive black-box functions. However, the computational cost for training and prediction of Gaussian process models are $O(N^3)$ and $O(N^2)$, respectively, where $N$ is the number of data points. The overhead of the Gaussian process modeling would be not negligible as $N$ reaches a slightly large number. Recently, a Bayesian optimization approach using neural network has been proposed to address this problem. It reduces the computational cost of training/prediction of Gaussian process models to $O(N)$ and $O(1)$, respectively. However, it reduces the infinite-dimensional kernel in traditional Gaussian process to finite-dimensional kernel using neural network mapping. It could weaken the characterization ability of Gaussian process. In this paper, we propose a novel Bayesian optimization approach using Sparse Pseudo-input Gaussian process (SPGP). The idea is to select $M$ so-called inducing points out of $N$ data points and use the kernel function of the $M$ inducing points to approximate the kernel function of $N$ data points. The proposed approach can also reduce the computational cost of training/prediction to $O(N)$ and $O(1)$, respectively. However, the kernel of the proposed approach is still infinite-dimensional. It could provide similar characterization ability as the traditional Gaussian process. Several experiments were provided to demonstrate the efficiency of the proposed approach.</em></td> </tr> <tr> <td style="width:40px;">13:00</td> <td><a href="#IP1">IP1-4</a>, 307</td> <td><b>SYMMETRY-BASED A/M-S BIST (SYMBIST): DEMONSTRATION ON A SAR ADC IP</b><br /> <b>Speaker</b>:<br /> Antonios Pavlidis, Sorbonne Université, CNRS, LIP6, FR<br /> <b>Authors</b>:<br /> Antonios Pavlidis<sup>1</sup>, Marie-Minerve Louerat<sup>1</sup>, Eric Faehn<sup>2</sup>, Anand Kumar<sup>3</sup> and Haralampos-G. Stratigopoulos<sup>1</sup><br /> <sup>1</sup>Sorbonne Université, CNRS, LIP6, FR; <sup>2</sup>STMicroelectronics, FR; <sup>3</sup>STMicroelectronics, IN<br /> <em><b>Abstract</b><br /> In this paper, we propose a defect-oriented Built-In Self-Test (BIST) paradigm for analog and mixed-signal (A/MS) Integrated Circuits (ICs), called symmetry-based BIST (Sym-BIST). SymBIST exploits inherent symmetries into the design to generate invariances that should hold true only in defect-free operation. Violation of any of these invariances points to defect detection. We demonstrate SymBIST on a 65nm 10-bit Successive Approximation Register (SAR) Analog-to-Digital Converter (ADC) IP by ST Microelectronics. SymBIST does notresult in any performance penalty, it incurs an area overhead of less than 5%, the test time equals about 16x the time to convert an analog input sample, it can be interfaced with a 2-pin digital access mechanism, and it covers the entire A/M-S part of the IP achieving a likelihood-weighted defect coverage higher than 85%.</em></td> </tr> <tr> <td style="width:40px;">13:01</td> <td><a href="#IP1">IP1-5</a>, 476</td> <td><b>RANGE CONTROLLED FLOATING-GATE TRANSISTORS: A UNIFIED SOLUTION FOR UNLOCKING AND CALIBRATING ANALOG ICS</b><br /> <b>Speaker</b>:<br /> Yiorgos Makris, University of Texas at Dallas, US<br /> <b>Authors</b>:<br /> Sai Govinda Rao Nimmalapudi, Georgios Volanis, Yichuan Lu, Angelos Antonopoulos, Andrew Marshall and Yiorgos Makris, University of Texas at Dallas, US<br /> <em><b>Abstract</b><br /> Analog Floating-Gate Transistors (AFGTs) are commonly used to fine-tune the performance of analog integrated circuits (ICs) after fabrication, thereby enabling high yield despite component mismatch and variability in semiconductor manufacturing. In this work, we propose a methodology that leverages such AFGTs to also prevent unauthorized use of analog ICs. Specifically, we introduce a locking mechanism that limits programming of AFGTs to a range which is inadequate for achieving the desired analog performance. Accordingly, our solution entails a two-step unlock-&amp;-calibrate process. In the first step, AFGTs must be programmed through a secret sequence of voltages within that range, called waypoints. Successfully following the waypoints unlocks the ability to program the AFGTs over their entire range. Thereby, in the second step, the typical AFGT-based post-silicon calibration process can be applied to adjust the performance of the IC within its specifications. Protection against brute-force or intelligent attacks attempting to guess the unlocking sequence is ensured through the vast space of possible waypoints in the continuous (analog) domain. Feasibility and effectiveness of the proposed solution is demonstrated and evaluated on an Operational Transconductance Amplifier (OTA). To our knowledge, this is the first solution which leverages the power of analog keys and addresses both unlocking and calibration needs of analog ICs in a unified manner.</em></td> </tr> <tr> <td style="width:40px;">13:02</td> <td><a href="#IP1">IP1-6</a>, 699</td> <td><b>TESTING THROUGH SILICON VIAS IN POWER DISTRIBUTION NETWORK OF 3D-IC WITH MANUFACTURING VARIABILITY CANCELLATION</b><br /> <b>Speaker</b>:<br /> Koutaro Hachiya, Teikyo Heisei University, JP<br /> <b>Authors</b>:<br /> Koutaro Hachiya<sup>1</sup> and Atsushi Kurokawa<sup>2</sup><br /> <sup>1</sup>Teikyo Heisei University, JP; <sup>2</sup>Hirosaki University, JP<br /> <em><b>Abstract</b><br /> To detect open defects of power TSVs (Through Silicon Vias) in PDNs (Power Distribution Networks) of stacked 3D-ICs, a method was proposed which measures resistances between power micro-bumps connected to PDN and detects defects of TSVs by changes of the resistances. It suffers from manufacturing variabilities and must place one micro-bump directly under each TSV (direct-type placement style) to maximize its diagnostic performance, but the performance was not enough for practical applications. A variability cancellation method was also devised to improve the diagnostic performance. In this paper, a novel middle-type placement style is proposed which places one micro-bump between each pair of TSVs. Experimental simulations using a 3D-IC example show that the diagnostic performances of both the direct-type and the middle-type examples are improved by the variability cancellation and reach the practical level. The middle-type example outperforms the direct-type example in terms of number of micro-bumps and number of measurements.</em></td> </tr> <tr> <td>13:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="2.5">2.5 Pruning Techniques for Embedded Neural Networks</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 11:30 - 13:00<br /> <b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br /> Marian Verhelst, KU Leuven, BE</p> <p><b>Co-Chair:</b><br /> Dirk Ziegenbein, Robert Bosch GmbH, DE</p> <p>Network pruning has been applied successfully to reduce the computational and memory footprint of neural network processing. This session presents three innovations to better exploit pruning in embedded processing architectures. The solutions presented extend the sparsity concept to the bit level with an enhanced bit-level pruning technique based on CSD representations, introduce a novel group-level pruning technique, demonstrating an improved trade-off between hardware-execution cost and accuracy loss, and explore a sparsity-aware cache architecture to reduce cache miss rate and execution time.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.5.1</td> <td><b>DEEPER WEIGHT PRUNING WITHOUT ACCURACY LOSS IN DEEP NEURAL NETWORKS</b><br /> <b>Speaker</b>:<br /> Byungmin Ahn, Seoul National University, KR<br /> <b>Authors</b>:<br /> Byungmin Ahn and Taewhan Kim, Seoul National University, KR<br /> <em><b>Abstract</b><br /> This work overcomes the inherent limitation of the bit-level weight pruning, that is, the maximal computation speedup is bounded by the total number of non-zero bits of the weights and the bound is invariably considered "uncontrollable" (i.e., constant) for the neural network to be pruned. Precisely, this work, based on the canonical signed digit (CSD) encoding, (1) proposes a transformation technique which converts the two's complement representation of every weight into a set of CSD representations of the minimal or near-minimal number of essential (i.e., non-zero) bits, (2) formulates the problem of selecting CSD representations of weights that maximize the parallelism of bit-level multiplication on the weights into a multi-objective shortest path problem and solves it efficiently using an approximation algorithm, and (3) proposes a supporting novel acceleration architecture with no additional inclusion of non-trivial hardware. Through experiments, it is shown that our proposed approach reduces the number of essential bits by 69% on AlexNet and 74% on VGG-16, by which our accelerator reduces the inference computation time by 47% on AlexNet and 50% on VGG-16 over the conventional bit-level weight pruning.</em></td> </tr> <tr> <td>12:00</td> <td>2.5.2</td> <td><b>FLEXIBLE GROUP-LEVEL PRUNING OF DEEP NEURAL NETWORKS FOR ON-DEVICE MACHINE LEARNING</b><br /> <b>Speaker</b>:<br /> Dongkun Shin, Sungkyunkwan University, KR<br /> <b>Authors</b>:<br /> Kwangbae Lee, Hoseung Kim, Hayun Lee and Dongkun Shin, Sungkyunkwan University, KR<br /> <em><b>Abstract</b><br /> Network pruning is a promising compression technique to reduce computation and memory access cost of deep neural networks. Pruning techniques are classified into two types: fine-grained pruning and coarse-grained pruning. Fine-grained pruning eliminates individual connections if they are insignificant and thus usually generates irregular networks. Therefore, it can fail to reduce inference time. Coarse-grained pruning such as filter-level and channel-level techniques can make hardware-friendly networks. However, it can suffer from low accuracy. In this paper, we focus on the group-level pruning method to accelerate deep neural networks on mobile GPUs, where several adjacent weights are pruned in a group to mitigate the irregularity of pruned networks while providing high accuracy. Although several group-level pruning techniques have been proposed, the previous techniques select weight groups to be pruned at group-size-aligned locations to reduce the problem space. In this paper, we propose an unaligned approach to improve the accuracy of the compressed model. We can find the optimal solution of the unaligned group selection problem with dynamic programming. Our technique also generates balanced sparse networks to get load balance at parallel computing units. Experiments demonstrate that the 2D unaligned group-level pruning shows 3.12% lower error rate at ResNet-20 network on CIFAR-10 that compared to the previous 2D aligned group-level pruning under the 95% sparsity.</em></td> </tr> <tr> <td>12:30</td> <td>2.5.3</td> <td><b>SPARSITY-AWARE CACHES TO ACCELERATE DEEP NEURAL NETWORKS</b><br /> <b>Speaker</b>:<br /> Vinod Ganesan, IIT Madras, IN<br /> <b>Authors</b>:<br /> Vinod Ganesan<sup>1</sup>, Sanchari Sen<sup>2</sup>, Pratyush Kumar<sup>1</sup>, Neel Gala<sup>1</sup>, Kamakoti Veezhinatha<sup>1</sup> and Anand Raghunathan<sup>2</sup><br /> <sup>1</sup>IIT Madras, IN; <sup>2</sup>Purdue University, US<br /> <em><b>Abstract</b><br /> Deep Neural Networks (DNNs) have transformed the field of artificial intelligence and represent the state-of-the-art in many machine learning tasks. There is considerable interest in using DNNs to realize edge intelligence in highly resource-constrained devices such as wearables and IoT sensors. Unfortunately, the high computational requirements of DNNs pose a serious challenge to their deployment in these systems. Moreover, due to tight cost (and hence, area) constraints, these devices are often unable to accommodate hardware accelerators, requiring DNNs to execute on the General Purpose Processor (GPP) cores that they contain. We address this challenge through lightweight micro-architectural extensions to the memory hierarchy of GPPs that exploit a key attribute of DNNs, viz. sparsity, or the prevalence of zero values. We propose SparseCache, an enhanced cache architecture that utilizes a null cache based on a Ternary Content Addressable Memory (TCAM) to compactly store zero-valued cache lines while storing non-zero lines in a conventional data cache. By storing address rather than values for zero-valued cache lines, SparseCache increases the effective cache capacity, thereby reducing the overall miss rate and execution time. SparseCache utilizes a Zero Detector and Approximator (ZDA) and Address Merger (AM) to perform reads and writes to the null cache. We evaluate SparseCache on four state-of-the-art DNNs programmed with the Caffe framework. SparseCache achieves 5-28% reduction in miss-rate, which translates to 5-21% reduction in execution time, with only 0.1% area and 3.8% power overhead in comparison to a low-end Intel Atom Z-series processor.</em></td> </tr> <tr> <td style="width:40px;">13:00</td> <td><a href="#IP1">IP1-7</a>, 429</td> <td><b>TFAPPROX: TOWARDS A FAST EMULATION OF DNN APPROXIMATE HARDWARE ACCELERATORS ON GPU</b><br /> <b>Speaker</b>:<br /> Zdenek Vasicek, Brno University of Technology, CZ<br /> <b>Authors</b>:<br /> Filip Vaverka, Vojtech Mrazek, Zdenek Vasicek and Lukas Sekanina, Brno University of Technology, CZ<br /> <em><b>Abstract</b><br /> Energy efficiency of hardware accelerators of deep neural networks (DNN) can be improved by introducing approximate arithmetic circuits. In order to quantify the error introduced by using these circuits and avoid the expensive hardware prototyping, a software emulator of the DNN accelerator is usually executed on CPU or GPU. However, this emulation is typically two or three orders of magnitude slower than a software DNN implementation running on CPU or GPU and operating with standard floating point arithmetic instructions and common DNN libraries. The reason is that there is no hardware support for approximate arithmetic operations on common CPUs and GPUs and these operations have to be expensively emulated. In order to address this issue, we propose an efficient emulation method for approximate circuits utilized in a given DNN accelerator which is emulated on GPU. All relevant approximate circuits are implemented as look-up tables and accessed through a texture memory mechanism of CUDA capable GPUs. We exploit the fact that the texture memory is optimized for irregular read-only access and in some GPU architectures is even implemented as a dedicated cache. This technique allowed us to reduce the inference time of the emulated DNN accelerator approximately 200 times with respect to an optimized CPU version on complex DNNs such as ResNet. The proposed approach extends the TensorFlow library and is available online at <a href="https://github.com/ehw-fit/tf-approximate">https://github.com/ehw-fit/tf-approximate</a></em></td> </tr> <tr> <td>13:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="2.6">2.6 Improving reliability and fault tolerance of advanced memories</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 11:30 - 13:00<br /> <b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br /> Mounir Benabdenbi, TIMA Laboratory, FR</p> <p><b>Co-Chair:</b><br /> Said Hamdioui, TU Delft, NL</p> <p>This session discusses reliability issues for different memory technologies; addressing fault tolerance of memristors, how to reduce simulations with importance sampling and advance metrics as measure for the reliability of NAND flash memories.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.6.1</td> <td><b>ON IMPROVING FAULT TOLERANCE OF MEMRISTOR CROSSBAR BASED NEURAL NETWORK DESIGNS BY TARGET SPARSIFYING</b><br /> <b>Speaker</b>:<br /> Yu Wang, North China Electric Power University, CN<br /> <b>Authors</b>:<br /> Song Jin<sup>1</sup>, Songwei Pei<sup>2</sup> and Yu Wang<sup>1</sup><br /> <sup>1</sup>North China Electric Power University, CN; <sup>2</sup>School of Computer Science, Beijing University of Posts and Telecommunications, CN<br /> <em><b>Abstract</b><br /> Memristor based crossbar (MBC) can execute neural network computations in an extremely energy efficient manner. However, stuck-at faults make memristors cannot represent network weight correctly, thus degrading classification accuracy of the network deployed on the MBC significantly. By carefully analyzing all the possible fault combinations in a pair of differential crossbars, we found that most of the stuck-at faults can be accommodated perfectly by mapping a zero value weight onto the memristors. Based on such observation, in this paper we propose a target sparsifying based fault tolerant scheme for the MBC which executes neural network applications. We first exploit a heuristic algorithm to map weight matrix onto the MBC, aiming at minimizing weight variations in the presence of stuck-at faults. After that, some weights mapped onto the faulty memristors which still have large variations will be purposefully forced to zero value. Network retraining is then performed to recover classification accuracy. For a 4-layer CNN designed for MNIST digit recognition, experimental results demonstrate that our scheme can achieve almost no accuracy loss when 10% of memristors in the MBC are faulty. As the faulty memristors increasing to 20%, accuracy loss is only within 3%.</em></td> </tr> <tr> <td>12:00</td> <td>2.6.2</td> <td><b>AN EFFICIENT YIELD ANALYSIS OF SRAM USING SCALED-SIGMA ADAPTIVE IMPORTANCE SAMPLING</b><br /> <b>Speaker</b>:<br /> Liang Pang, Southeast University, CN<br /> <b>Authors</b>:<br /> Liang Pang<sup>1</sup>, Mengyun Yao<sup>2</sup> and Yifan Chai<sup>1</sup><br /> <sup>1</sup>School of Electronic Science &amp; Engineering, Southeast University, CN; <sup>2</sup>School of Microelectronics, Southeast University, CN<br /> <em><b>Abstract</b><br /> Statistical SRAM yield analysis has become a growing concern for its high integrated density and reliability. It is a challenge to estimate the SRAM failure probability efficiently because the circuit failure is a "rare-event". Existing methods are still not enough to solve the problem especially in high dimension under advanced process. In this paper, we develop a scaled-sigma adaptive importance sampling (SSAIS) which is an extension of the adaptive importance sampling. This method changes not only the location parameters but the shape parameters by iteratively searching the failure region. Our 40nm SRAM cell experiments validated that our method has outperform Monte Carlo method by 1500x which is 2.3x~5.2x faster than the state-of-art methods with remaining the enough accuracy. The another experiment on sense amplifier shows our method achieves 3968x speedup over the Monte Carlo method and 2.1x~11x speedup over the other methods.</em></td> </tr> <tr> <td>12:30</td> <td>2.6.3</td> <td><b>FAST AND ACCURATE HIGH-SIGMA FAILURE RATE ESTIMATION THROUGH EXTENDED BAYESIAN OPTIMIZED IMPORTANCE SAMPLING</b><br /> <b>Speaker</b>:<br /> Michael Hefenbrock, Karlsruhe Institute of Technology, DE<br /> <b>Authors</b>:<br /> Michael Hefenbrock, Dennis Weller, Michael Beigl and Mehdi Tahoori, Karlsruhe Institute of Technology, DE<br /> <em><b>Abstract</b><br /> Due to the aggressive technology downscaling, process variations are becoming pre-dominent, causing performance fluctuations and impacting the chip yield. Therefore, individual circuit components have to be designed with very small failure rates to guarantee functional correctness and robust operation. The assessment of high-sigma failure rates however cannot be achieved with conventional Monte Carlo (MC) methods due to the huge amount of required time-consuming circuit simulations. To this end, Importance Sampling (IS) methods were proposed to solve the otherwise intractable failure rate estimation problem by focusing on high-probable failure regions. However, the failure rate could largely be underestimated while the computational effort for deriving them is high. In this paper, we propose an eXtended Bayesian Optimized IS (XBOIS) method, which addresses the aforementioned shortcomings by deployment of an accurate surrogate model (e.g. delay) of the circuit around the failure region. The number of costly circuit simulations is therefore minimized and estimation accuracy is substantially improved by efficient exploration of the variation space. As especially memory elements occupy a large amount of on-chip resources, we evaluate our approach on SRAM cell failure rate estimation. Results show a speedup of about 16x as well as a two orders of magnitude higher failure rate estimation accuracy compared to the best state-of-the-art techniques.</em></td> </tr> <tr> <td>12:45</td> <td>2.6.4</td> <td><b>VALID WINDOW: A NEW METRIC TO MEASURE THE RELIABILITY OF NAND FLASH MEMORY</b><br /> <b>Speaker</b>:<br /> Min Ye, City University of Hong Kong, HK<br /> <b>Authors</b>:<br /> Min Ye<sup>1</sup>, Qiao Li<sup>1</sup>, Jianqiang Nie<sup>2</sup>, Tei-Wei Kuo<sup>1</sup> and Chun Jason Xue<sup>1</sup><br /> <sup>1</sup>City University of Hong Kong, HK; <sup>2</sup>YEESTOR Microelectronics Co., Ltd, CN<br /> <em><b>Abstract</b><br /> NAND flash memory has been widely adopted in storage systems today. The most important issue in flash memory is its reliability, especially for 3D NAND, which suffers from several types of errors. The raw bit error rate (RBER) when applying default read reference voltages is usually adopted as the reliability metric for NAND flash memory. However, RBER is closely related to the way how data is read, and varies greatly if read retry operations are conducted with tuned read reference voltages. In this work, a new metric, valid window is proposed to measure the reliability, which is stable and accurate. A valid window expresses the size of error regions between two neighboring levels and determines if the data can be correctly read with further read retry. Taking advantage of these features, we design a method to reduce the number of read retry operations. This is achieved by adjusting program operations of 3D NAND flash memories. Experiments on a real 3D NAND flash chip verify the effectiveness of the proposed method.</em></td> </tr> <tr> <td style="width:40px;">13:00</td> <td><a href="#IP1">IP1-8</a>, 110</td> <td><b>BINARY LINEAR ECCS OPTIMIZED FOR BIT INVERSION IN MEMORIES WITH ASYMMETRIC ERROR PROBABILITIES</b><br /> <b>Speaker</b>:<br /> Valentin Gherman, CEA, FR<br /> <b>Authors</b>:<br /> Valentin Gherman, Samuel Evain and Bastien Giraud, CEA, FR<br /> <em><b>Abstract</b><br /> Many memory types are asymmetric with respect to the error vulnerability of stored 0's and 1's. For instance, DRAM, STT-MRAM and NAND flash memories may suffer from asymmetric error rates. A recently proposed error-protection scheme consists in the inversion of the memory words with too many vulnerable values before they are stored in an asymmetric memory. In this paper, a method is pro-posed for the optimization of systematic binary linear block error-correcting codes in order to maximize their impact when combined with memory word inversion.</em></td> </tr> <tr> <td style="width:40px;">13:01</td> <td><a href="#IP1">IP1-9</a>, 634</td> <td><b>BELDPC: BIT ERRORS AWARE ADAPTIVE RATE LDPC CODES FOR 3D TLC NAND FLASH MEMORY</b><br /> <b>Speaker</b>:<br /> Meng Zhang, Huazhong University of Science &amp; Technology, CN<br /> <b>Authors</b>:<br /> Meng Zhang, Fei Wu, Qin Yu, Weihua Liu, Lanlan Cui, Yahui Zhao and Changsheng Xie, Huazhong University of Science &amp; Technology, CN<br /> <em><b>Abstract</b><br /> Three-dimensional (3D) NAND flash memory has high capacity and cell storage density by using the multi-bit technology and vertical stack architecture, but degrading data reliability due to high raw bit error rates (RBER) caused by program/erase (P/E) cycles and retention periods. Low-density parity-check (LDPC) codes become more popular error-correcting technologies to improve data reliability due to strong error correction capability, but introducing more decoding iterations at higher RBER. To reduce decoding iterations, this paper proposes BeLDPC: bit errors aware adaptive rate LDPC codes for 3D triple-level cell (TLC) NAND flash memory. Firstly, bit error characteristics in 3D charge trap TLC NAND flash memory are studied on a real FPGA testing platform, including asymmetric bit flipping and temporal locality of bit errors. Then, based on these characteristics, a high-efficiency LDPC code is designed. Experimental results show BeLDPC can reduce decoding iterations under different P/E cycles and retention periods.</em></td> </tr> <tr> <td>13:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="2.7">2.7 Optimizing emerging applications for power-efficient computing</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 11:30 - 13:00<br /> <b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br /> Theocharis Theocharides, University of Cyprus, CY</p> <p><b>Co-Chair:</b><br /> Shafique Muhammad, TU Wien, AT</p> <p>This session focuses on emerging applications for power-efficient computing, such as bioinformatics and few-shot learning. Methods such as Hyperdimensional computing or computing in memory are applied to process DNA pattern matching or to perform few-shot learning in a more power-efficient way.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:30</td> <td>2.7.1</td> <td><b>GENIEHD: EFFICIENT DNA PATTERN MATCHING ACCELERATOR USING HYPERDIMENSIONAL COMPUTING</b><br /> <b>Speaker</b>:<br /> Mohsen Imani, University of California, San Diego, US<br /> <b>Authors</b>:<br /> Yeseong Kim, Mohsen Imani, Niema Moshiri and Tajana Rosing, University of California, San Diego, US<br /> <em><b>Abstract</b><br /> DNA pattern matching is widely applied in many bioinformatics applications. The increasing volume of the DNA data exacerbates runtime and power consumption to discover DNA patterns. In this paper, we propose a hardware-software codesign, called GenieHD, which efficiently parallelizes the DNA pattern-matching task. We exploit brain-inspired hyperdimensional (HD) computing which mimics pattern-based computations in human memory. We transform inherent sequential processes of the DNA pattern matching to highly-parallelizable computation tasks using HD computing. We accordingly design an accelerator architecture targeting various parallel computing platforms to effectively parallelize the HD-based DNA pattern matching while significantly reducing memory accesses. We evaluate GenieHD on practical large-size DNA datasets such as human and Escherichia Coli genomes. Our evaluation shows that GenieHD significantly accelerates the DNA matching procedure, e.g., 44.4× speedup and 54.1× higher energy efficiency as compared to state-of-the-art FPGA-based design.</em></td> </tr> <tr> <td>12:00</td> <td>2.7.2</td> <td><b>REPUTE: AN OPENCL BASED READ MAPPING TOOL FOR EMBEDDED GENOMICS</b><br /> <b>Speaker</b>:<br /> Sidharth Maheshwari, Newcastle University, GB<br /> <b>Authors</b>:<br /> Sidharth Maheshwari<sup>1</sup>, Rishad Shafik<sup>1</sup>, Alex Yakovlev<sup>1</sup>, Ian Wilson<sup>1</sup> and Amit Acharyya<sup>2</sup><br /> <sup>1</sup>Newcastle University, GB; <sup>2</sup>IIT Hyderabad, IN<br /> <em><b>Abstract</b><br /> Genomics is transforming medicine from reactive to personalized, predictive, preventive and participatory (P4). The massive amount of data produced by genomics is a major challenge as it requires extensive computational capabilities, consuming large amounts of energy. A crucial prerequisite for computational genomics is genome assembly but the existing mapping tools used are predominantly software based, optimized for homogeneous high-performance systems. In this paper, we propose an OpenCL based REad maPper for heterogeneoUs sysTEms (REPUTE), which can use diverse and parallel compute and storage devices effectively. Core to this tool are dynamic programming based filtration and verification kernel to map the reads on multiple devices, concurrently. We show hardware/ software co-design and implementations of REPUTE across different platforms, and compare it with state-of-the-art mappers. We demonstrate the performance of mappers on two systems: 1) Intel CPU + 2Nvidia GPUs; 2) HiKey970 embedded SoC with ARM Cortex-A73/A53 cores. The results show that REPUTE outperforms other read mappers in most cases producing up to 13x speedup with better or comparable accuracy. We also demonstrate that the embedded implementation can achieve up to 27x energy savings, enabling low-cost genomics.</em></td> </tr> <tr> <td>12:30</td> <td>2.7.3</td> <td><b>A FAST AND ENERGY EFFICIENT COMPUTING-IN-MEMORY ARCHITECTURE FOR FEW-SHOT LEARNING APPLICATIONS</b><br /> <b>Speaker</b>:<br /> Dayane Reis, University of Notre Dame, US<br /> <b>Authors</b>:<br /> Dayane Reis, Ann Franchesca Laguna, Michael Niemier and X. Sharon Hu, University of Notre Dame, US<br /> <em><b>Abstract</b><br /> Among few-shot learning methods, prototypical networks (PNs) are one of the most popular approaches due to their excellent classification accuracies and network simplicity. Test examples are classified based on their distances from class prototypes. Despite the application-level advantages of PNs, the latency of transferring data from memory to compute units is much higher than the PN computation time. Thus, PNs performance is limited by memory bandwidth. Computing-in-memory addresses this bandwidth-bottleneck problem by bringing a subset of compute units closer to memory. In this work, we propose a CiM-PN framework that enables the computation of distance metrics and prototypes inside the memory. CiM-PN replaces the computationally intensive Euclidean distance metric by the CiM-friendly Manhattan distance metric. Additionally, prototypes are computed using an in-memory mean operation realized by accumulation and division by powers of two, which enables few-shot learning implementations where "shots" are powers of two. The CiM-PN hardware uses CMOS memory cells, as well as CMOS peripherals such as customized sense amplifiers, carry look-ahead adders, in-place copy buffers and a log bit-shifter. Compared with a GPU implementation, a CMOS-based CiM-PN achieves speedups of 2808x/111x and energy savings of 2372x/5170x at iso-accuracy for the prototype and nearest-neighbor computation, respectively, and over 2x end-to-end speedup and energy improvements. We also gain 3-14% accuracy improvement when compared to existing non-GPU hardware approaches due to the floating-point CiM operations.</em></td> </tr> <tr> <td>13:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="UB02">UB02 Session 2</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 12:30 - 15:00<br /> <b>Location / Room:</b> Booth 11, Exhibition Area</p> <p> </p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td>UB02.1</td> <td><b>FLETCHER: TRANSPARENT GENERATION OF HARDWARE INTERFACES FOR ACCELERATING BIG DATA APPLICATIONS</b><br /> <b>Authors</b>:<br /> Zaid Al-Ars, Johan Peltenburg, Jeroen van Straten, Matthijs Brobbel and Joost Hoozemans, TU Delft, NL<br /> <em><b>Abstract</b><br /> This demo created by TUDelft is a software-hardware framework to allow for an efficient integration of FPGA hardware accelerators both on edge devices as well as in the cloud. The framework is called Fletcher, which is used to automatically generate data communication interfaces in hardware based on the widely used big data format Apache Arrow. This provides two distinct advantages. On the one hand, since the accelerators use the same data format as the software, data communication bottlenecks can be reduced. On the other hand, since a standardized data format is used, this allows for easy-to-use interfaces on the accelerator side, thereby reducing the design and development time. The demo shows how to use Fletcher for big data acceleration to decompress Snappy compressed files and perform filtering on the whole Wikipedia body of text. The demo enables 25 GB/s processing throughput.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3134.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB02.2</td> <td><b>A DIGITAL MICROFLUIDICS BIO-COMPUTING PLATFORM</b><br /> <b>Authors</b>:<br /> Georgi Tanev, Luca Pezzarossa, Winnie Edith Svendsen and Jan Madsen, TU Denmark, DK<br /> <em><b>Abstract</b><br /> Digital microfluidics is a lab-on-a-chip (LOC) technology used to actuate small amounts of liquids on an array of individually addressable electrodes. Microliter sized droplets can be programmatically dispensed, moved, mixed, split, in a controlled environment which combined with miniaturized sensing techniques makes LOC suitable for a broad range of applications in the field of medical diagnostics and synthetic biology. Furthermore, a programmable digital microfluidics platform holds the potential to add a "fluidic subsystem" to the classical computation model thus opening the doors for cyber-physical bio-processors. To facilitate the programming and operation of such bio-fluidic computing, we propose dedicated instruction set architecture and virtual machine. A set of digital microfluidic core instructions as well as classic computing operations are executed on a virtual machine, which decouples the protocol execution from the LOC functionality.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3103.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB02.3</td> <td><b>VIRTUAL PLATFORMS FOR COMPLEX SOFTWARE STACKS</b><br /> <b>Authors</b>:<br /> Lukas Jünger and Rainer Leupers, RWTH Aachen University, DE<br /> <em><b>Abstract</b><br /> This demonstration is going to showcase our "AVP64" Virtual Platform (VP), which models a multi-core ARMv8 (Cortex A72) system including several peripherals, such as an SDHCI and an ethernet controller. For the ARMv8 instruction set simulation a dynamic binary translation based solution is used. As the workload, the Xen hypervisor with two Linux Virtual Machines (VMs) is executed. Both VMs are connected to the simulation hosts' network subsystem via a virtual ethernet controller. One of the VMs executes a NodeJS-based server application offering a REST API via this network connection. An AngularJS client application on the host system can then connect to the server application to obtain and store data via the server's REST API. This data is read and written by the server application to the virtual SD Card connected to the SDHCI. For this, one SD card partition is passed to the VM through Xen's block device virtualization mechanism.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3099.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB02.4</td> <td><b>FPGA-DSP: A PROTOTYPE FOR HIGH QUALITY DIGITAL AUDIO SIGNAL PROCESSING BASED ON AN FPGA</b><br /> <b>Authors</b>:<br /> Bernhard Riess and Christian Epe, University of Applied Sciences Düsseldorf, DE<br /> <em><b>Abstract</b><br /> Our demonstrator presents a prototype of a new digital audio signal processing system which is based on an FPGA. It achieves a performance that up to now has been preserved to costly high-end solutions. Main components of the system are an analog/digital converter, an FPGA to perform the digital signal processing tasks, and a digital/analog converter implemented on a printed circuit board. To demonstrate the quality of the audio signal processing, infinite impulse response, finite impulse response filters and a delay effect were realized in VHDL. More advanced signal processing systems can easily be implemented due to the flexibility of the FPGA. Measured results were compared to state of the art audio signal processing systems with respect to size, performance and cost. Our prototype outperforms systems of the same price in quality, and outperforms systems of the same quality at a maximum of 20% of the price. Examples of the performance of our system can be heard in the demo.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3100.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB02.5</td> <td><b>AT-SPEED DFT ARCHITECTURE FOR BUNDLED-DATA CIRCUITS</b><br /> <b>Authors</b>:<br /> Ricardo Aquino Guazzelli and Laurent Fesquet, Université Grenoble Alpes, FR<br /> <em><b>Abstract</b><br /> At-speed testing for asynchronous circuits is still an open concern in the literature. Due to its timing constraints between control and data paths, Design for Testability (DfT) methodologies must test both control and data paths at the same time in order to guarantee the circuit correctness. As Process Voltage Temperature (PVT) variations significantly impact circuit design in newer CMOS technologies and low-power techniques such as voltage scaling, the timing constraints between control and data paths must be tested after fabrication not only under nominal conditions but through a range of operating conditions. This work explores an at-speed testing approach for bundled data circuits, targetting the micropipeline template. The main target of this test approach focuses on whether the sized delay lines in control paths respect the local timing assumptions of the data paths.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3117.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB02.6</td> <td><b>INTACT: A 96-CORE PROCESSOR WITH 6 CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER AND A 16-CORE PROTOTYPE RUNNING GRAPHICAL OPERATING SYSTEM</b><br /> <b>Authors</b>:<br /> Eric Guthmuller<sup>1</sup>, Pascal Vivet<sup>1</sup>, César Fuguet<sup>1</sup>, Yvain Thonnart<sup>1</sup>, Gaël Pillonnet<sup>2</sup> and Fabien Clermidy<sup>1</sup><br /> <sup>1</sup>Université Grenoble Alpes / CEA List, FR; <sup>2</sup>Université Grenoble Alpes / CEA-Leti, FR<br /> <em><b>Abstract</b><br /> We built a demonstrator for our 96-cores cache coherent 3D processor and a first prototype featuring 16 cores. The demonstrator consists in our 16-cores processor running commodity operating systems such as Linux and NetBSD on a PC-like motherboard with user-friendly devices such as a HDMI display, keyboard and mouse. A graphical desktop is displayed, and the user will interact with it through the keyboard and mouse. The demonstrator is able to run parallel applications to benchmark its performance in terms of scalability. The main innovation of our processor is its scalable cache coherent architecture based on distributed L2-caches and adaptive L3-caches. Additionally, the energy consumption is also measured and displayed by reading dynamically from the monitors of power-supply devices. Finally we will also show open packages of the 3D processor featuring 6 16-core chiplets (28 nm FDSOI) on an active interposer (65 nm) embedding Network-on-Chips, power management and IO controllers.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3114.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB02.7</td> <td><b>GENERATING ASYNCHRONOUS CIRCUITS FROM CATAPULT</b><br /> <b>Authors</b>:<br /> Yoan Decoudu<sup>1</sup>, Jean Simatic<sup>2</sup>, Katell Morin-Allory<sup>3</sup> and Laurent Fesquet<sup>3</sup><br /> <sup>1</sup>University Grenoble Alpes, FR; <sup>2</sup>HawAI.Tech, FR; <sup>3</sup>Université Grenoble Alpes, FR<br /> <em><b>Abstract</b><br /> In order to spread asynchronous circuit design to a large community of designers, High-Level Synthesis (HLS) is probably a good choice because it requires limited design technical skills. HLS usually provides an RTL description, which includes a data-path and a control-path. The desynchronization process is only applied to the control-path, which is a Finite State Machine (FSM). This method is sufficient to make asynchronous the circuit. Indeed, data are processed step by step in the pipeline stages, thanks to the desynchronized FSM. Thus, the data-path computation time is no longer related to the clock period but rather to the average time for processing data into the pipeline. This tends to improve speed when the pipeline stages are not well-balanced. Moreover, our approach helps to quickly designing data-driven circuits while maintaining a reasonable cost, a similar area and a short time-to-market.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3118.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB02.8</td> <td><b>WALLANCE: AN ALTERNATIVE TO BLOCKCHAIN FOR IOT</b><br /> <b>Authors</b>:<br /> Loic Dalmasso, Florent Bruguier, Pascal Benoit and Achraf Lamlih, Université de Montpellier, FR<br /> <em><b>Abstract</b><br /> Since the expansion of the Internet of Things (IoT), connected devices became smart and autonomous. Their exponentially increasing number and their use in many application domains result in a huge potential of cybersecurity threats. Taking into account the evolution of the IoT, security and interoperability are the main challenges, to ensure the reliability of the information. The blockchain technology provides a new approach to handle the trust in a decentralized network. However, current blockchain implementations cannot be used in IoT domain because of their huge need of computing power and storage utilization. This demonstrator presents a lightweight distributed ledger protocol dedicated to the IoT application, reducing the computing power and storage utilization, handling the scalability and ensuring the reliability of information.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3119.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB02.9</td> <td><b>PRE-IMPACT FALL DETECTION ARCHITECTURE BASED ON NEUROMUSCULAR CONNECTIVITY STATISTICS</b><br /> <b>Authors</b>:<br /> Giovanni Mezzina, Sardar Mehboob Hussain and Daniela De Venuto, Politecnico di Bari, IT<br /> <em><b>Abstract</b><br /> In this demonstration, we propose an innovative multi-sensor architecture operating in the field of pre-impact fall detection (PIFD). The proposed architecture jointly analyzes cortical and muscular involvement when unexpected slippages occur during steady walking. The EEG and EMG are acquired through wearable and wireless devices. The control unit consists of an STM32L4 microcontroller and a Simulink modeling. The C implements the EMG computation, while the cortical analysis and the final classification were entrusted to the Simulink model. The EMG computation block translates EMGs into binary signals, which are used both to enable cortical analyses and to extract a score to distinguish "standard" muscular behaviors from anomalous ones. The Simulink model evaluates the cortical responsiveness in five bands of interest and implements the logical-based network classifier. The system, tested on 6 healthy subjects, shows an accuracy of 96.21% and a detection time of ~371 ms.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3122.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB02.10</td> <td><b>JOINTER: JOINING FLEXIBLE MONITORS WITH HETEROGENEOUS ARCHITECTURES</b><br /> <b>Authors</b>:<br /> Giacomo Valente<sup>1</sup>, Tiziana Fanni<sup>2</sup>, Carlo Sau<sup>3</sup>, Claudio Rubattu<sup>2</sup>, Francesca Palumbo<sup>2</sup> and Luigi Pomante<sup>1</sup><br /> <sup>1</sup>Università degli Studi dell'Aquila, IT; <sup>2</sup>Università degli Studi di Sassari, IT; <sup>3</sup>Università degli Studi di Cagliari, IT<br /> <em><b>Abstract</b><br /> As embedded systems grow more complex and shift toward heterogeneous architectures, understanding workload performance characteristics becomes increasingly difficult. In this regard, run-time monitoring systems can support on obtaining the desired visibility to characterize a system. This demo presents a framework that allows to develop complex heterogeneous architectures composed of programmable processors and dedicated accelerators on FPGA, together with customizable monitoring systems, keeping under control the introduced overhead. The whole development flow (and related prototypal EDA tools), that starts with the accelerators creation using a dataflow model, in parallel with the monitoring system customization using a library of elements, showing also the final joining, will be shown. Moreover, a comparison among different monitoring systems functionalities on different architectures developed on Zynq7000 SoC will be illustrated.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3124.pdf">More information ...</a></b></em></td> </tr> <tr> <td>15:00</td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="3.0">3.0 LUNCHTIME KEYNOTE SESSION</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 13:50 - 14:20<br /> <b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br /> Marco Casale-Rossi, Synopsys, IT</p> <p><b>Co-Chair:</b><br /> Giovanni De Micheli, EPFL, CH</p> <p> </p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>13:50</td> <td>3.0.1</td> <td><b>NEUROMORPHIC COMPUTING: PAST, PRESENT, AND FUTURE</b><br /> <b>Author</b>:<br /> Catherine Schuman, Oak Ridge National Laboratory, US<br /> <em><b>Abstract</b><br /> Though neuromorphic systems were introduced decades ago, there has been a resurgence of interest in recent years due to the looming end of Moore's law, the end of Dennard scaling, and the tremendous success of AI and deep learning for a wide variety of applications. With this renewed interest, there is a diverse set of research ongoing in neuromorphic computing, ranging from novel hardware implementations, device and materials to the development of new training and learning algorithms. There are many potential advantages to neuromorphic systems that make them attractive in today's computing landscape, including the potential for very low power, efficient hardware that can perform neural network computation. Though some compelling results have been demonstrated thus far that demonstrate these advantages, there is still significant opportunity for innovations in hardware, algorithms, and applications in neuromorphic computing. In this talk, a brief overview of the history of neuromorphic computing will be discussed, and a summary of the current state of research in the field will be presented. Finally, a list of key challenges, open questions, and opportunities for future research in neuromorphic computing will be enumerated.</em></td> </tr> <tr> <td>14:20</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="3.1">3.1 Special Session: Architectures for Emerging Technologies</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br /> Pierre-Emmanuel Gaillardon, University of Utah, US</p> <p><b>Co-Chair:</b><br /> Michael Niemier, University of Notre Dame, US</p> <p>The past five decades have witnessed transformations happening at an ever-growing pace thanks to the sustained increase of capabilities of electronics systems. We are now at the dawn of a new revolution where emerging technologies, understand beyond silicon complementary metal oxide semiconductors, are going to further revolutionize the way we design electronics. In this hot topic session, we intend to elaborate on the architectural opportunities and challenges brought by non-standard semiconductor technologies. In addition to provide new perspectives to the DATE community beyond the currently hot novel architectures, such as neuromorphic or in-memory computing, this proposal also serve the purpose of tightening the link between DATE and the EDA community at large with the mission and roles of the IEEE Rebooting Computing Initiative - <a href="https://rebootingcomputing.ieee.org" title="https://rebootingcomputing.ieee.org">https://rebootingcomputing.ieee.org</a>.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.1.1</td> <td><b>CRYO-CMOS INTERFACES FOR A SCALABLE QUANTUM COMPUTER</b><br /> <b>Authors</b>:<br /> Edoardo Charbon<sup>1</sup>, Andrei Vladimirescu<sup>2</sup>, Fabio Sebastiano<sup>3</sup> and Masoud Babaie<sup>3</sup><br /> <sup>1</sup>EPFL, CH; <sup>2</sup>University of California, Berkeley, US; <sup>3</sup>TU Delft, NL</td> </tr> <tr> <td>14:45</td> <td>3.1.2</td> <td><b>THE <i>N3XT 1,000X</i> FOR THE COMING SUPERSTORM OF ABUNDANT DATA: CARBON NANOTUBE FETS, RESISTIVE RAM, MONOLITHIC 3D</b><br /> <b>Authors</b>:<br /> Gage Hills<sup>1</sup> and Mohamed M. Sabry<sup>2</sup><br /> <sup>1</sup>Massachusetts Institute of Technology, US; <sup>2</sup>Nanyang Technological University, SG</td> </tr> <tr> <td>15:00</td> <td>3.1.3</td> <td><b>MULTIPLIER ARCHITECTURES: CHALLENGES AND OPPORTUNITIES WITH PLASMONIC-BASED LOGIC</b><br /> <b>Speaker</b>:<br /> Eleonora Testa, EPFL, CH<br /> <b>Authors</b>:<br /> Eleonora Testa<sup>1</sup>, Samantha Lubaba Noor<sup>2</sup>, Odysseas Zografos<sup>3</sup>, Mathias Soeken<sup>1</sup>, Francky Catthoor<sup>3</sup>, Azad Naeemi<sup>2</sup> and Giovanni Demicheli<sup>1</sup><br /> <sup>1</sup>EPFL, CH; <sup>2</sup>Georgia Tech, US; <sup>3</sup>IMEC, BE</td> </tr> <tr> <td>15:15</td> <td>3.1.4</td> <td><b>QUANTUM COMPUTER ARCHITECTURE: TOWARDS FULL-STACK QUANTUM ACCELERATORS</b><br /> <b>Speaker</b>:<br /> Koen Bertels, TU Delft, BE<br /> <b>Authors</b>:<br /> Koen Bertels, Aritra Arkar, T. Hubregtsen, M. Serrao, Abid A. Mouedenne, A. Yadav, A. Krol, Imran Ashraf and Carmen G. Almudever, TU Delft, NL</td> </tr> <tr> <td>15:30</td> <td>3.1.5</td> <td><b>UTILIZING BURIED POWER RAILS AND BACKSIDE PDN TO FURTHER CMOS SCALING BELOW 5NM NODES</b><br /> <b>Authors</b>:<br /> Odysseas Zografos, Sudhir Patli, Satadru Sarkar, Bilal Chehab, Doyoung Jang, Rogier Baert, Peter Debacker, Myung-Hee Na and Julien Ryckaert, IMEC, BE</td> </tr> <tr> <td>15:45</td> <td>3.1.6</td> <td><b>A RRAM-BASED FPGA FOR ENERGY-EFFICIENT EDGE COMPUTING</b><br /> <b>Authors</b>:<br /> Xifan Tang, Ganesh Gore, Patsy Cadareanu, Edouard Giacomin and Pierre-Emmanuel Gaillardon, University of Utah, US</td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="3.2">3.2 Accelerating Design Space Exploration</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br /> Christian Pilato, Politecnico di Milano, IT</p> <p><b>Co-Chair:</b><br /> Luca Carloni, Columbia University, US</p> <p>Accelerating Design Space Exploration efficiently is needed to optimize hardware accelerators. At high level, learning techniques can provide ways to either recognize previously synthesized kernels or to model the hidden dependences between synthesis directive costs and performances. At a lower level, speeding up RTL simulations based on data dependencies analysis can speed up one of the most time consuming steps.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.2.1</td> <td><b>EFFICIENT AND ROBUST HIGH-LEVEL SYNTHESIS DESIGN SPACE EXPLORATION THROUGH OFFLINE MICRO-KERNELS PRE-CHARACTERIZATION</b><br /> <b>Authors</b>:<br /> Zi Wang, Jianqi Chen and Benjamin Carrion Schaefer, University of Texas at Dallas, US<br /> <em><b>Abstract</b><br /> This work proposes a method to accelerate the process of High-Level Synthesis (HLS) Design Space Exploration (DSE) by pre-characterizing micro-kernels offline and creating predictive models of these. HLS allows to generate different types of micro-architectures from the same untimed behavioral description. This is typically done by setting different combinations of synthesis options in the form or synthesis directives specified as pragmas in the code. This allows, e.g. to control how loops should be synthesized, arrays and functions. Unique combinations of these pragmas leads to micro-architectures with a unique area vs. performance/power trade-offs. The main problem is that the search space grows exponentially with the number of explorable operations. Thus, the main goal of efficient HLS DSE is to find the synthesis directives' combinations that lead to the Pareto-optimal designs quickly. Our proposed method is based on the pre-characterization of micro-kernels offline, creating predictive models for each of the kernels, and using the results to explore a new unseen behavioral description using compositional methods. In addition, we make use of perceptual hashing to match new unseen micro-kernels with the pre-characterized micro-kernels in order to further speed up the search process. Experimental results show that our proposed method is orders of magnitude faster than traditional methods.</em></td> </tr> <tr> <td>15:00</td> <td>3.2.2</td> <td><b>PROSPECTOR: SYNTHESIZING EFFICIENT ACCELERATORS VIA STATISTICAL LEARNING</b><br /> <b>Speaker</b>:<br /> Aninda Manocha, Princeton University, US<br /> <b>Authors</b>:<br /> Atefeh Mehrabi, Aninda Manocha, Benjamin Lee and Daniel Sorin, Duke University, US<br /> <em><b>Abstract</b><br /> Accelerator design is expensive due to the effort required to understand an algorithm and optimize the design. Architects have embraced two technologies to reduce costs. High-level synthesis automatically generates hardware from code. Reconfigurable fabrics instantiate accelerators while avoiding fabrication costs for custom circuits. We further reduce design effort with statistical learning. We build an automated framework, called Prospector, that uses Bayesian techniques to optimize synthesis directives, reducing execution latency and resource usage in field-programmable gate arrays. We show in a certain amount of time designs discovered by Prospector are closer to Pareto-efficient designs compared to prior approaches.</em></td> </tr> <tr> <td>15:30</td> <td>3.2.3</td> <td><b>TANGO: AN OPTIMIZING COMPILER FOR JUST-IN-TIME RTL SIMULATION</b><br /> <b>Speaker</b>:<br /> Blaise-Pascal Tine, Georgia Tech, US<br /> <b>Authors</b>:<br /> Blaise Tine, Sudhakar Yalamanchili and Hyesoon Kim, Georgia Tech, US<br /> <em><b>Abstract</b><br /> With Moore's law coming to an end, the advent of hardware specialization presents a unique challenge for a much tighter software and hardware co-design environment to exploit domain-specific optimizations and increase design efficiency. This trend is further accentuated by rapid-pace of innovations in Machine Learning and Graph Analytic, calling for a faster product development cycle for hardware accelerators and the importance of addressing the increasing cost of hardware verification. The productivity of software-hardware co-design relies upon a better integration between the software and hardware design methodologies, but more importantly in the effectiveness of the design tools and hardware simulators at reducing the development time. In this work, we developed Tango, an Optimizing compiler for a Just-in-Time RTL simulator. Tango implements unique hardware-centric compiler transformations to speed up runtime code generation in a software-hardware co-design environment where hardware simulation speed is critical. Tango achieves a 6x average speedup compared to the state-of-the-art RTL simulators.</em></td> </tr> <tr> <td style="width:40px;">16:01</td> <td><a href="#IP1">IP1-10</a>, 728</td> <td><b>POISONING THE (DATA) WELL IN ML-BASED CAD: A CASE STUDY OF HIDING LITHOGRAPHIC HOTSPOTS</b><br /> <b>Speaker</b>:<br /> Kang Liu, New York University, US<br /> <b>Authors</b>:<br /> Kang Liu, Benjamin Tan, Ramesh Karri and Siddharth Garg, New York University, US<br /> <em><b>Abstract</b><br /> Machine learning (ML) provides state-of-the-art performance in many parts of computer-aided design (CAD) flows. However, deep neural networks (DNNs) are susceptible to various adversarial attacks, including data poisoning to compromise training to insert backdoors. Sensitivity to training data integrity presents a security vulnerability, especially in light of malicious insiders who want to cause targeted neural network misbehavior. In this study, we explore this threat in lithographic hotspot detection via training data poisoning, where hotspots in a layout clip can be "hidden" at inference time by including a trigger shape in the input. We show that training data poisoning attacks are feasible and stealthy, demonstrating a backdoored neural network that performs normally on clean inputs but misbehaves on inputs when a backdoor trigger is present. Furthermore, our results raise some fundamental questions about the robustness of ML-based systems in CAD.</em></td> </tr> <tr> <td style="width:40px;">16:02</td> <td><a href="#IP1">IP1-11</a>, 667</td> <td><b>SOLOMON: AN AUTOMATED FRAMEWORK FOR DETECTING FAULT ATTACK VULNERABILITIES IN HARDWARE</b><br /> <b>Speaker</b>:<br /> Milind Srivastava, IIT Madras, IN<br /> <b>Authors</b>:<br /> Milind Srivastava<sup>1</sup>, PATANJALI SLPSK<sup>1</sup>, Indrani Roy<sup>1</sup>, Chester Rebeiro<sup>1</sup>, Aritra Hazra<sup>2</sup> and Swarup Bhunia<sup>3</sup><br /> <sup>1</sup>IIT Madras, IN; <sup>2</sup>IIT Kharagpur, IN; <sup>3</sup>University of Florida, US<br /> <em><b>Abstract</b><br /> Fault attacks are potent physical attacks on crypto-devices. A single fault injected during encryption can reveal the cipher's secret key. In a hardware realization of an encryption algorithm, only a tiny fraction of the gates is exploitable by such an attack. Finding these vulnerable gates has been a manual and tedious task requiring considerable expertise. In this paper, we propose SOLOMON, the first automatic fault attack vulnerability detection framework for hardware designs. Given a cipher implementation, either at RTL or gate-level, SOLOMON uses formal methods to map vulnerable regions in the cipher algorithm to specific locations in the hardware thus enabling targeted countermeasures to be deployed with much lesser overheads. We demonstrate the efficacy of the SOLOMON framework using three ciphers: AES, CLEFIA, and Simon.</em></td> </tr> <tr> <td style="width:40px;">16:02</td> <td><a href="#IP1">IP1-12</a>, 694</td> <td><b>FORMAL SYNTHESIS OF MONITORING AND DETECTION SYSTEMS FOR SECURE CPS IMPLEMENTATIONS</b><br /> <b>Speaker</b>:<br /> Ipsita Koley, IIT Kharagpur, IN<br /> <b>Authors</b>:<br /> Ipsita Koley<sup>1</sup>, Saurav Kumar Ghosh<sup>1</sup>, Dey Soumyajit<sup>1</sup>, Debdeep Mukhopadhyay<sup>1</sup>, Amogh Kashyap K N<sup>2</sup>, Sachin Kumar Singh<sup>2</sup>, Lavanya Lokesh<sup>2</sup>, Jithin Nalu Purakkal<sup>2</sup> and Nishant Sinha<sup>2</sup><br /> <sup>1</sup>IIT Kharagpur, IN; <sup>2</sup>Robert Bosch Engineering and Business Solutions Private Limited, IN<br /> <em><b>Abstract</b><br /> We consider the problem of securing a given control loop implementation of a cyber-physical system (CPS) in the presence of Man-in-the-Middle attacks on data exchange between plant and controller over a compromised network. To this end, there exists various detection schemes which provide mathematical guarantees against such attacks for the theoretical control model. However, such guarantees may not hold for the actual control software implementation. In this article, we propose a formal approach towards synthesizing attack detectors with varying thresholds which can prevent performance degrading stealthy attacks while minimizing false alarms.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="3.3">3.3 EU/ESA projects on Heterogeneous Computing</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br /> Carles Hernandez, UPV, ES</p> <p><b>Co-Chair:</b><br /> Francisco J. Cazorla, BSC, ES</p> <p>In the scope of this session the presented EU/ESA projects cover topics related to the control electronics and data processing architecture and functionality of the Wide Field Imager, one of two scientific instruments of the next European X-ray observatory ATHENA; task-based programming models to provide a software ecosystem for heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines; and a framework to allow Big Data solutions to dynamically and transparently exploit heterogeneous hardware accelerators.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.3.1</td> <td><b>ESA ATHENA WFI ONBOARD ELECTRONICS - DISTRIBUTED CONTROL AND DATA PROCESSING (WORK IN PROGRESS IN THE PROJECT)</b><br /> <b>Speaker</b>:<br /> Markus Plattner, Max Planck Institute for extraterrestrial Physics, DE<br /> <b>Authors</b>:<br /> Markus Plattner<sup>1</sup>, Sabine Ott<sup>1</sup>, Jintin Tran<sup>1</sup>, Christopher Mandla<sup>1</sup>, Manfred Steller<sup>2</sup>, Harald Jeszensky<sup>2</sup>, Roland Ottensamer<sup>3</sup>, Jan-Christoph Tenzer<sup>4</sup>, Thomas Schanz<sup>4</sup>, Samuel Pliego<sup>4</sup>, Konrad Skup<sup>5</sup>, Denis Tcherniak<sup>6</sup>, Chris Thomas<sup>7</sup>, Julian Thornhill<sup>7</sup> and Sebastian Albrecht<sup>1</sup><br /> <sup>1</sup>Max Planck Institute for extraterrestrial Physics, DE; <sup>2</sup>IWF - Space Research Institute, AT; <sup>3</sup>TU Wien, AT; <sup>4</sup>University of Tübingen, DE; <sup>5</sup>CBK Warsaw, PL; <sup>6</sup>Technical University of Denmark, DK; <sup>7</sup>University of Leicester, GB<br /> <em><b>Abstract</b><br /> Within this paper, we describe the control electronics and data processing architecture and functionality of the Wide Field Imager (WFI). WFI is one of two scientific instruments of the next European X-ray observatory ATHENA whose development started five years ago. Meanwhile, a conceptual design, development models and a number of technology development activities have been performed.</em></td> </tr> <tr> <td>15:00</td> <td>3.3.2</td> <td><b>LEGATO: LOW-ENERGY, SECURE, AND RESILIENTTOOLSET FOR HETEROGENEOUS COMPUTING</b><br /> <b>Speaker</b>:<br /> Valerio Schiavoni, University of Neuchâtel, CH<br /> <b>Authors</b>:<br /> Behzad Salami<sup>1</sup>, Konstantinos Parasyris<sup>1</sup>, Adrian Cristal<sup>1</sup>, Osman Unsal<sup>1</sup>, Xavier Martorell<sup>1</sup>, Paul Carpenter<sup>1</sup>, Raul De La Cruz<sup>1</sup>, Leonardo Bautista<sup>1</sup>, Daniel Jimenez<sup>1</sup>, Carlos Alvarez<sup>1</sup>, Saber Nabavi<sup>1</sup>, Sergi Madonar<sup>1</sup>, Miquel Pericàs<sup>2</sup>, Pedro Trancoso<sup>2</sup>, Mustafa Abduljabbar<sup>2</sup>, Jing Chen<sup>2</sup>, Pirah Noor Soomro<sup>2</sup>, Madhavan Manivannan<sup>2</sup>, Micha von dem Berge<sup>3</sup>, Stefan Krupop<sup>3</sup>, Frank Klawonn<sup>4</sup>, Amani Mihklafi<sup>4</sup>, Sigrun May<sup>4</sup>, Tobias Becker<sup>5</sup>, Georgi Gaydadjiev<sup>5</sup>, Hans Salomonsson<sup>6</sup>, Devdatt Dubhashi<sup>6</sup>, Oron Port<sup>7</sup>, Yoav Etsion<sup>8</sup>, Le Quoc Do<sup>9</sup>, Christof Fetzer<sup>9</sup>, Martin Kaiser<sup>10</sup>, Nils Kucza<sup>10</sup>, Jens Hagemeyer<sup>10</sup>, René Griessl<sup>10</sup>, Lennart Tigges<sup>10</sup>, Kevin Mika<sup>10</sup>, Arne Hüffmeier<sup>10</sup>, Marcelo Pasin<sup>11</sup>, Valerio Schiavoni<sup>11</sup>, Isabelly Rocha<sup>11</sup>, Christian Göttel<sup>11</sup> and Pascal Felber<sup>11</sup><br /> <sup>1</sup>BSC, ES; <sup>2</sup>Chalmers, SE; <sup>3</sup>Christmann Informationstechnik + Medien GmbH &amp; Co. KG, DE; <sup>4</sup>Helmholtz-Zentrum für Infektionsforschung GmbH, DE; <sup>5</sup>MAXELER, GB; <sup>6</sup>MIS, SE; <sup>7</sup>TECHNION, IL; <sup>8</sup>Technion, IL; <sup>9</sup>TU Dresden, DE; <sup>10</sup>UNIBI, DE; <sup>11</sup>UNINE, CH<br /> <em><b>Abstract</b><br /> The LEGaTO project leverages task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC, balanced with the security and resilience challenges. LEGaTO is an ongoing three-year EU H2020 project started in December 2017.</em></td> </tr> <tr> <td>15:30</td> <td>3.3.3</td> <td><b>EFFICIENT COMPILATION AND EXECUTION OF JVM-BASED DATA PROCESSING FRAMEWORKS ON HETEROGENEOUS CO-PROCESSORS</b><br /> <b>Speaker</b>:<br /> Athanasios Stratikopoulos, The University of Manchester, GB<br /> <b>Authors</b>:<br /> Christos Kotselidis<sup>1</sup>, Ioannis Komnios<sup>2</sup>, Orestis Akrivopoulos<sup>3</sup>, Sebastian Bress<sup>4</sup>, Katerina Doka<sup>5</sup>, Hazeef Mohammed<sup>6</sup>, Georgios Mylonas<sup>7</sup>, Vassilis Spitadakis<sup>8</sup>, Daniel Strimpel<sup>9</sup>, Juan Fumero<sup>1</sup>, Foivos S. Zakkak<sup>1</sup>, Michail Papadimitriou<sup>1</sup>, Maria Xekalaki<sup>1</sup>, Nikos Foutris<sup>1</sup>, Athanasios Stratikopoulos<sup>1</sup>, Nectarios Koziris<sup>5</sup>, Ioannis Konstantinou<sup>5</sup>, Ioannis Mytilinis<sup>5</sup>, Constantinos Bitsakos<sup>5</sup>, Christos Tsalidis<sup>8</sup>, Christos Tselios<sup>3</sup>, Nikolaos Kanakis<sup>3</sup>, Clemens Lutz<sup>4</sup>, Viktor Rosenfeld<sup>4</sup> and Volker Markl<sup>4</sup><br /> <sup>1</sup>The University of Manchester, GB; <sup>2</sup>Exus Ltd., US; <sup>3</sup>Spark Works ITC Ltd., GB; <sup>4</sup>German Research Center for Artificial Intelligence, DE; <sup>5</sup>National TU Athens, GR; <sup>6</sup>Kaleao Ltd., GB; <sup>7</sup>Computer Technology Institute &amp; Press Diophantus, GR; <sup>8</sup>Neurocom Luxembourg, LU; <sup>9</sup>IProov Ltd., GB<br /> <em><b>Abstract</b><br /> This paper addresses the fundamental question of how modern Big Data frameworks can dynamically and transparently exploit heterogeneous hardware accelerators. After presenting the major challenges that have to be addressed towards this goal, we describe our proposed architecture for automatic and transparent hardware acceleration of Big Data frameworks and applications. Our vision is to retain the uniform programming model of Big Data frameworks and enable automatic, dynamic Just-In-Time compilation of the candidate code segments that benefit from hardware acceleration to the corresponding format. In conjunction with machine learning-based device selection, that respect user-defined constraints (e.g., cost, time, etc.), we enable dynamic code execution on GPUs and FPGAs transparently to the user. In addition, we dynamically re-steer execution at runtime based on the availability of resources. Our preliminary results demonstrate that our approach can accelerate an existing Apache Flink application by up to 16.5x.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="3.4">3.4 Accelerating Neural Networks and Vision Workloads</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br /> Leonidas Kosmidis, BSC, ES</p> <p><b>Co-Chair:</b><br /> Georgios Keramidas, Aristotle University of Thessaloniki/Think Silicon S.A., GR</p> <p>This session presents different solutions to accelerate emerging applications. The papers include various microarchitecture techniques as well as complete SoC and RISC-V based solutions. More fine-grained techniques are also presented like fast computations on sparse matrices. Vision applications are represented by the popular VSLAM, while various types and forms of emerging Neural Networks (such as Recurrent, Quantized, and Siamese NNs ) are considered.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.4.1</td> <td><b>PSB-RNN: A PROCESSING-IN-MEMORY SYSTOLIC ARRAYARCHITECTURE USING BLOCK CIRCULANT MATRICES FOR RECURRENT NEURAL NETWORKS</b><br /> <b>Speaker</b>:<br /> Nagadastagiri Challapalle, Pennsylvania State University, US<br /> <b>Authors</b>:<br /> Nagadastagiri Challapalle<sup>1</sup>, Sahithi Rampalli<sup>1</sup>, Makesh Tarun Chandran<sup>1</sup>, Gurpreet Singh Kalsi<sup>2</sup>, John (Jack) Sampson<sup>1</sup>, Sreenivas Subramoney<sup>2</sup> and Vijaykrishnan Narayanan<sup>1</sup><br /> <sup>1</sup>Pennsylvania State University, US; <sup>2</sup>Intel Labs, IN<br /> <em><b>Abstract</b><br /> Recurrent Neural Networks (RNNs) are widely used in Natural Language Processing (NLP) applications as they inherently capture contextual information across spatial and temporal dimensions. Compared to other classes of neural networks, RNNs have more weight parameters as they primarily consist of fully connected layers. Recently, several techniques such as weight pruning, zero-skipping, and block circulant compression have been introduced to reduce the storage and access requirements of RNN weight parameters. In this work, we present a ReRAM crossbar based processing-in-memory (PIM) architecture with systolic dataflow incorporating block circulant compression for RNNs. The block circulant compression decomposes the operations in a fully connected layer into a series of Fourier transforms and point-wise operations resulting in reduced space and computational complexity. We formulate the Fourier transform and point-wise operations into in-situ multiply-and-accumulate (MAC) operations mapped to ReRAM crossbars for high energy efficiency and throughput. We also incorporate systolic dataflow for communication within the crossbar arrays, in contrast to broadcast and multicast communications, to further improve energy efficiency. The proposed architecture achieves average improvements in compute efficiency of 44x and 17x over a custom FPGA architecture and conventional crossbar based architecture implementations, respectively.</em></td> </tr> <tr> <td>15:00</td> <td>3.4.2</td> <td><b>XPULPNN: ACCELERATING QUANTIZED NEURAL NETWORKS ON RISC-V PROCESSORS THROUGH ISA EXTENSIONS</b><br /> <b>Speaker</b>:<br /> Angelo Garofalo, Università di Bologna, IT<br /> <b>Authors</b>:<br /> Angelo Garofalo<sup>1</sup>, Giuseppe Tagliavini<sup>1</sup>, Francesco Conti<sup>2</sup>, Davide Rossi<sup>1</sup> and Luca Benini<sup>2</sup><br /> <sup>1</sup>Università di Bologna, IT; <sup>2</sup>ETH Zurich, CH / Università di Bologna, CH<br /> <em><b>Abstract</b><br /> Strongly quantized fixed-point arithmetic is considered the key direction to enable the inference of CNNs on low-power, resource-constrained edge devices. However, the deployment of highly quantized Neural Networks at the extreme edge of IoT, on fully programmable MCUs, is currently limited by the lack of support, at the Instruction Set Architecture (ISA) level, for sub-byte fixed-point data types, making it necessary to add numerous instructions for packing and unpacking data when running low-bitwidth (i.e. 2- and 4-bit) QNN kernels, creating a bottleneck for performance and energy efficiency of QNN inference. In this work we present a set of extensions to the RISC-V ISA, aimed at boosting the energy efficiency of low-bitwidth QNNs on low-power microcontroller-class cores. The microarchitecture supporting the new extensions is builton top of a RISC-V core featuring instruction set extensions targeting energy-efficient digital signal processing. To evaluate the extensions, we integrated the core into a full microcontroller system, synthesized and placed&amp;routed in 22nm FDX technology. QNN convolution kernels, implemented on the new core, run 5.3×and 8.9× faster when considering 4- and 2-bit data operands respectively, compared to the baseline processor only supporting 8-bit SIMD instructions. With a peak of 279 GMAC/s/W, the proposed solution achieves 9×better energy efficiency compared to the baseline and two orders of magnitudes better energy efficiency compared to state-of-the-art microcontrollers.</em></td> </tr> <tr> <td>15:30</td> <td>3.4.3</td> <td><b>SNA: A SIAMESE NETWORK ACCELERATOR TO EXPLOIT THE MODEL-LEVEL PARALLELISM OF HYBRID NETWORK STRUCTURE</b><br /> <b>Speaker</b>:<br /> Xingbin Wang, Chinese Academy of Sciences, CN<br /> <b>Authors</b>:<br /> Xingbin Wang, Boyan Zhao, Rui Hou and Dan Meng, Chinese Academy of Sciences, CN<br /> <em><b>Abstract</b><br /> Siamese network is compute-intensive learning model with growing applicability in a wide range of domains. However, state-of-art deep neural network (DNN) accelerators would not work efficiently for siamese network, as their designs do not account for the algorithm properties of siamese network. In this paper, we propose a siamese network accelerator called SNA, the first Simultaneous Multi-Threading (SMT) hardware architecture to perform siamese network inference with high performance and energy efficiency. We devise an adaptive inter-model computing resource partition and flexible on-chip buffer management mechanism based on the model parallelism and SMT design philosophy. Our architecture is implemented in Verilog and synthesized in a 65nm technology using Synopsys design tools. We also evaluate it with several typical siamese networks. Compared to the state-of-art accelerator, on average, the SNA architecture offers 2.1x speedup and 1.48x energy reduction.</em></td> </tr> <tr> <td>15:45</td> <td>3.4.4</td> <td><b>HCVEACC: A HIGH-PERFORMANCE AND ENERGY-EFFICIENT ACCELERATOR FOR TRACKING TASK IN VSLAM SYSTEM</b><br /> <b>Speaker</b>:<br /> Meng Liu, Chinese Academy of Sciences, CN<br /> <b>Authors</b>:<br /> Li Renwei, Wu Junning, Liu Meng, Chen Zuding, Zhou Shengang and Feng Shanggong, Chinese Academy of Sciences, CN<br /> <em><b>Abstract</b><br /> Visual SLAM (vSLAM) is a critical computer vision technology that is able to build a map of an unknown environment and perform location, simultaneously leveraging the partially built map. While existing several software SLAM processing frameworks, underlying general-purpose processors still hardly achieve the real-time SLAM at a reasonably low cost. In this paper, we propose HcveAcc, the first specialized CMOS-based hardware accelerator to help optimize the tracking task in the vSLAM system with high-performance and energy-efficient. Our HcveAcc targets to solve the time overhead bottleneck in the tracking process—high-density feature extraction and high-precision descriptor generation, and provides a configurable hardware architecture that handles higher resolution image data. We have implemented the HcveAcc in a 28nm CMOS technology using commercial EDA tools and evaluated it for the EuRoC and TUM dataset to demonstrate the robustness and accuracy in the SLAM tracking procedure. Our results show that HcveAcc achieves 4.3X speedup while consuming much less energy compared with state-of-the-art FPGA solutions.</em></td> </tr> <tr> <td style="width:40px;">16:01</td> <td><a href="#IP1">IP1-13</a>, 55</td> <td><b>ASCELLA: ACCELERATING SPARSE COMPUTATION BY ENABLING STREAM ACCESSES TO MEMORY</b><br /> <b>Speaker</b>:<br /> Bahar Asgari, Georgia Tech, US<br /> <b>Authors</b>:<br /> Bahar Asgari, Ramyad Hadidi and Hyesoon Kim, Georgia Tech, US<br /> <em><b>Abstract</b><br /> Sparse computations dominate a wide range of applications from scientific problems to graph analytics. The main characterization of sparse computations, indirect memory accesses, prevents them from effectively achieving high performance on general-purpose processors. Therefore, hardware accelerators have been proposed for sparse problems. For these accelerators, the storage format and the decompression mechanism is crucial but have seen less attention in prior work. To address this gap, we propose Ascella, an accelerator for sparse computations, which besides enabling a smooth stream of data and parallel computation, proposes a fast decompression mechanism. Our implementation on a ZYNQ FPGA shows that on average, Ascella executes sparse problems up to 5.1x as fast as prior work.</em></td> </tr> <tr> <td style="width:40px;">16:02</td> <td><a href="#IP1">IP1-14</a>, 645</td> <td><b>ACCELERATION OF PROBABILISTIC REASONING THROUGH CUSTOM PROCESSOR ARCHITECTURE</b><br /> <b>Speaker</b>:<br /> Nimish Shah, KU Leuven, BE<br /> <b>Authors</b>:<br /> Nimish Shah, Laura I. Galindez Olascoaga, Wannes Meert and Marian Verhelst, KU Leuven, BE<br /> <em><b>Abstract</b><br /> Probabilistic reasoning is an essential tool for robust decision-making systems because of its ability to explicitly handle real-world uncertainty, constraints and causal relations. Consequently, researchers are developing hybrid models by combining Deep Learning with Probabilistic reasoning for safety-critical applications like self-driving vehicles, autonomous drones, etc. However, probabilistic reasoning kernels do not execute efficiently on CPUs or GPUs. This paper, therefore, proposes a custom programmable processor to accelerate sum-product networks, an important probabilistic reasoning execution kernel. The processor has an optimized datapath architecture and memory hierarchy optimized for sum-product networks execution. Experimental results show that the processor, while requiring fewer computational and memory units, achieves a 12x throughput benefit over the Nvidia Jetson TX2 embedded GPU platform.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="3.5">3.5 Parallel real-time systems</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br /> Liliana Cucu-Grosjean, Inria, FR</p> <p><b>Co-Chair:</b><br /> Antoine Bertout, ENSMA, FR</p> <p>This session presents novel techniques to enable parallel execution in real-time systems. More precisely, the papers are solving limitations of previous DAG models, devising tool chains to ensure WCET bounds, correcting results on heterogeneous processors, and considering wireless networks with application-oriented scheduling.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.5.1</td> <td><b>ON THE VOLUME CALCULATION FOR CONDITIONAL DAG TASKS: HARDNESS AND ALGORITHMS</b><br /> <b>Speaker</b>:<br /> Jinghao Sun, Northeastern University, CN<br /> <b>Authors</b>:<br /> Jinghao Sun<sup>1</sup>, Yaoyao Chi<sup>1</sup>, Tianfei Xu<sup>1</sup>, Lei Cao<sup>1</sup>, Nan Guan<sup>2</sup>, Zhishan Guo<sup>3</sup> and Wang Yi<sup>4</sup><br /> <sup>1</sup>Northeastern University, CN; <sup>2</sup>The Hong Kong Polytechnic University, CN; <sup>3</sup>University of Central Florida, US; <sup>4</sup>Uppsala universitet, SE<br /> <em><b>Abstract</b><br /> The hardness of analyzing conditional directed acyclic graph (DAG) tasks remains unknown so far. For example, previous researches asserted that the conditional DAG's volume can be solved in polynomial time. However, these researches all assume well-nested structures that are recursively composed by single-source-single-sink parallel and conditional components. For conditional DAGs in general that do not comply with this assumption, the hardness and algorithms of volume computation are still open. In this paper, we construct counterexamples to show that previous work cannot provide a safe upper bound of the conditional DAG's volume in general. Moreover, we prove that the volume computation problem for conditional DAGs is strongly NP-hard. Finally, we propose an exact algorithm for computing the conditional DAG's volume. Experiments show that our method can significantly improve the accuracy of the conditional DAG's volume estimation.</em></td> </tr> <tr> <td>15:00</td> <td>3.5.2</td> <td><b>WCET-AWARE CODE GENERATION AND COMMUNICATION OPTIMIZATION FOR PARALLELIZING COMPILERS</b><br /> <b>Speaker</b>:<br /> Simon Reder, Karlsruhe Institute of Technology, DE<br /> <b>Authors</b>:<br /> Simon Reder and Juergen Becker, Karlsruhe Institute of Technology, DE<br /> <em><b>Abstract</b><br /> High performance demands of present and future embedded applications increase the need for multi-core processors in hard real-time systems. Challenges in static multi-core WCET-analysis and the more complex design of parallel software, however, oppose the adoption of multi-core processors in that area. Automated parallelization is a promising approach to solve these issues, but specialized solutions are required to preserve static analyzability. With a WCET-aware parallelizing transformation, this work presents a novel solution for an important building block of a real-time capable parallelizing compiler. The approach includes a technique to optimize communication and synchronization in the parallelized program and supports complex memory hierarchies consisting of both shared and core-private memory segments. In an experiment with four different applications, the parallelization improved the WCET by up to factor 3.2 on 4 cores. The studied optimization technique and the support for shared memories significantly contribute to these results.</em></td> </tr> <tr> <td>15:30</td> <td>3.5.3</td> <td><b>TEMPLATE SCHEDULE CONSTRUCTION FOR GLOBAL REAL-TIME SCHEDULING ON UNRELATED MULTIPROCESSOR PLATFORMS</b><br /> <b>Authors</b>:<br /> Antoine Bertout<sup>1</sup>, Joel Goossens<sup>2</sup>, Emmanuel Grolleau<sup>3</sup> and Xavier Poczekajlo<sup>4</sup><br /> <sup>1</sup>LIAS, Université de Poitiers, ISAE-ENSMA, FR; <sup>2</sup>ULB, BE; <sup>3</sup>LIAS, ISAE-ENSMA, Universite de Poitiers, FR; <sup>4</sup>Université libre de Bruxelles, BE<br /> <em><b>Abstract</b><br /> The seminal work on the global real-time scheduling of periodic tasks on unrelated multiprocessor platforms is based on a two-steps method. First, the workload of each task is distributed over the processors and it is proved that the success of this first step ensures the existence of a feasible schedule. Second, a method for the construction of a template schedule from the workload assignment is presented. In this work, we review the seminal work and show by using a counter-example that this second step is incomplete. Thus, we propose and prove correct a novel and efficient algorithm to build the template schedule.</em></td> </tr> <tr> <td>15:45</td> <td>3.5.4</td> <td><b>APPLICATION-AWARE SCHEDULING OF NETWORKED APPLICATIONS OVER THE LOW-POWER WIRELESS BUS</b><br /> <b>Speaker</b>:<br /> Kacper Wardega, Boston University, US<br /> <b>Authors</b>:<br /> Kacper Wardega and Wenchao Li, Boston University, US<br /> <em><b>Abstract</b><br /> Recent successes of wireless networked systems in advancing industrial automation and in spawning the Internet of Things paradigm motivate the adoption of wireless networked systems in current and future safety-critical applications. As reliability is key in safety-critical applications, in this work we present NetDAG, a scheduler design and implementation suitable for real-time applications in the wireless setting. NetDAG is built upon the Low-Power Wireless Bus, a high-performant communication abstraction for wireless networked systems, and enables system designers to directly schedule applications under specified task-level real-time constraints. Access to real-time primitives in the scheduler permits efficient design exploration of tradeoffs between power consumption and latency. Furthermore, NetDAG provides support for weakly hard real-time applications with deterministic guarantees, in addition to heretofore considered soft real-time applications with probabilistic guarantees. We propose novel abstraction techniques for reasoning about conjunctions of weakly hard constraints and show how such abstractions can be used to handle the significant scheduling difficulties brought on by networked components with weakly hard behaviors.</em></td> </tr> <tr> <td style="width:40px;">16:01</td> <td><a href="#IP1">IP1-15</a>, 453</td> <td><b>A PERFORMANCE ANALYSIS FRAMEWORK FOR REAL-TIME SYSTEMS SHARING MULTIPLE RESOURCES</b><br /> <b>Speaker</b>:<br /> Shayan Tabatabaei Nikkhah, Eindhoven University of Technology, NL<br /> <b>Authors</b>:<br /> Shayan Tabatabaei Nikkhah<sup>1</sup>, Marc Geilen<sup>1</sup>, Dip Goswami<sup>1</sup> and Kees Goossens<sup>2</sup><br /> <sup>1</sup>Eindhoven University of Technology, NL; <sup>2</sup>Eindhoven university of technology, NL<br /> <em><b>Abstract</b><br /> Timing properties of applications strongly depend on resources that are allocated to them. Applications often have multiple resource requirements, all of which must be met for them to proceed. Performance analysis of event-based systems has been widely studied in the literature. However, the proposed works consider only one resource requirement for each application task. Additionally, they mainly focus on the rate at which resources serve applications (e.g., power, instructions or bits per second), but another aspect of resources, which is their provided capacity (e.g., energy, memory ranges, FPGA regions), has been ignored. In this work, we propose a mathematical framework to describe the provisioning rate and capacity of various types of resource. Additionally, we consider the simultaneous use of multiple resources. Conservative bounds on response times of events and their backlog are computed. We prove that the bounds are monotone in event arrivals and in required and provided rate and capacity, which enables verification of real-time application performance based on worst-case characterizations. The applicability of our framework is shown in a case study.</em></td> </tr> <tr> <td style="width:40px;">16:02</td> <td><a href="#IP1">IP1-16</a>, 778</td> <td><b>SCALING UP THE MEMORY INTERFERENCE ANALYSIS FOR HARD REAL-TIME MANY-CORE SYSTEMS</b><br /> <b>Speaker</b>:<br /> Matheus Schuh, Verimag / Kalray, FR<br /> <b>Authors</b>:<br /> Matheus Schuh<sup>1</sup>, Maximilien Dupont de Dinechin<sup>2</sup>, Matthieu Moy<sup>3</sup> and Claire Maiza<sup>4</sup><br /> <sup>1</sup>Verimag / Kalray, FR; <sup>2</sup>ENS Paris / ENS Lyon / LIP, FR; <sup>3</sup>ENS Lyon / LIP, FR; <sup>4</sup>Grenoble INP / Verimag, FR<br /> <em><b>Abstract</b><br /> In RTNS 2016, Rihani et al. proposed an algorithm to compute the impact of interference on memory accesses on the timing of a task graph. It calculates a static, time-triggered schedule, i.e. a release date and a worst-case response time for each task. The task graph is a DAG, typically obtained by compilation of a high-level dataflow language, and the tool assumes a previously determined mapping and execution order. The algorithm is precise, but suffers from a high O(n^4) complexity, n being the number of input tasks. Since we target many-core platforms with tens or hundreds of cores, applications likely to exploit the parallelism of these platforms are too large to be handled by this algorithm in reasonable time. This paper proposes a new algorithm that solves the same problem. Instead of performing global fixed-point iterations on the task graph, we compute the static schedule incrementally, reducing the complexity to O(n^2). Experimental results show a reduction from 535 seconds to 0.90 seconds on a benchmark with 384 tasks, i.e. 593 times faster.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="3.6">3.6 NoC in the age of neural network and approximate computing</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br /> Romain Lemaire, CEA-Leti, FR</p> <p>To support innovative applications, new paradigms have been introduced, such as neural network and approximate computing. This session presents different NoC-based architectures that support these computing approaches. In these advanced architectures, NoC designs are no longer only a communication infrastructure but also part of the computing system. Different mechanisms are introduced at network-level to support the application and thus enhance the performance and power efficiency. As such, new NoC-based architectures must respond to highly demanding applications such as image segmentation and classification by taking advantage of new topologies (multiple layers, 3D…) and new technologies, such as ReRAM.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.6.1</td> <td><b>GRAMARCH: A GPU-RERAM BASED HETEROGENEOUS ARCHITECTURE FOR NEURAL IMAGE SEGMENTATION</b><br /> <b>Speaker</b>:<br /> Biresh Joardar, Washington State University, US<br /> <b>Authors</b>:<br /> Biresh Kumar Joardar<sup>1</sup>, Nitthilan Kannappan Jayakodi<sup>1</sup>, Jana Doppa<sup>1</sup>, Partha Pratim Pande<sup>1</sup>, Hai (Helen) Li<sup>2</sup> and Krishnendu Chakrabarty<sup>3</sup><br /> <sup>1</sup>Washington State University, US; <sup>2</sup>Duke University, US / TU Munich, US; <sup>3</sup>Duke University, US<br /> <em><b>Abstract</b><br /> Deep Neural Networks (DNNs) employed for image segmentation are computationally more expensive and complex compared to the ones used for classification. However, manycore architectures to accelerate training of these DNNs are relatively unexplored. Resistive random-access memory (ReRAM)-based architectures offer a promising alternative to commonly used GPU-based platforms for training DNNs. However, due to their low-precision storage capability, they cannot support all DNN layers and suffer from accuracy loss of learned models. To address these challenges, in this paper, we propose a heterogeneous architecture: GRAMAR, that combines the benefits of ReRAM and GPUs simultaneously by using a high-throughput 3D Network-on-Chip. Experimental results indicate that by suitably mapping DNN layers to GRAMAR, it is possible to achieve up to 33.4X better performance compared to conventional GPUs.</em></td> </tr> <tr> <td>15:00</td> <td>3.6.2</td> <td><b>AN APPROXIMATE MULTIPLANE NETWORK-ON-CHIP</b><br /> <b>Speaker</b>:<br /> Xiaohang Wang, South China University of Technology, CN<br /> <b>Authors</b>:<br /> Ling Wang<sup>1</sup>, Xiaohang Wang<sup>2</sup> and Yadong Wang<sup>1</sup><br /> <sup>1</sup>Harbin Institute of Technology, CN; <sup>2</sup>South China University of Technology, CN<br /> <em><b>Abstract</b><br /> The increasing communication demands in chip multiprocessors (CMPs) and many error-tolerant applications are driving the approximate design of the network-on-chip (NoC) for power-efficient packet delivery. However, current approximate NoC designs achieve improvements in network performance or dynamic power savings at the cost of additional circuit design and increased area overhead. In this paper, we propose a novel approximate multiplane NoC (AMNoC) that provides low-latency transfer for latency-sensitive packets and minimizes the power consumption of approximable packets through a lossy bufferless subnetwork. The AMNoC also includes a regular buffered subnetwork to guarantee the lossless delivery of nonapproximable packets. Evaluations show that, compared with a single-plane buffered NoC, the AMNoC reduces the average latency by 41.9%. In addition, the AMNoC achieves 48.6% and 53.4% savings in power consumption and area overhead, respectively.</em></td> </tr> <tr> <td>15:30</td> <td>3.6.3</td> <td><b>SHENJING: A LOW POWER RECONFIGURABLE NEUROMORPHIC ACCELERATOR WITH PARTIAL-SUM AND SPIKE NETWORKS-ON-CHIP</b><br /> <b>Speaker</b>:<br /> Bo Wang, National University of Singapore, SG<br /> <b>Authors</b>:<br /> Bo Wang, Jun Zhou, Weng-Fai Wong and Li-Shiuan Peh, National University of Singapore, SG<br /> <em><b>Abstract</b><br /> The next wave of on-device AI will likely require energy-efficient deep neural networks. Brain-inspired spiking neural networks (SNN) has been identified to be a promising candidate. Doing away with the need for multipliers significantly reduces energy. For on-device applications, besides computation, communication also incurs a significant amount of energy and time. In this paper, we propose Shenjing, a configurable SNN architecture which fully exposes all on-chip communications to software, enabling software mapping of SNN models with high accuracy at low power. Unlike prior SNN architectures like TrueNorth, Shenjing does not require any model modification and retraining for the mapping. We show that conventional artificial neural networks (ANN) such as multilayer perceptron, convolutional neural networks, as well as the latest residual neural networks can be mapped successfully onto Shenjing, realizing ANNs with SNN's energy efficiency. For the MNIST inference problem using a multilayer perceptron, we were able to achieve an accuracy of 96% while consuming just 1.26 mW using 10 Shenjing cores.</em></td> </tr> <tr> <td style="width:40px;">16:00</td> <td><a href="#IP1">IP1-17</a>, 139</td> <td><b>LIGHTWEIGHT ANONYMOUS ROUTING IN NOC BASED SOCS</b><br /> <b>Speaker</b>:<br /> Prabhat Mishra, University of Florida, US<br /> <b>Authors</b>:<br /> Subodha Charles, Megan Logan and Prabhat Mishra, University of Florida, US<br /> <em><b>Abstract</b><br /> System-on-Chip (SoC) supply chain is widely acknowledged as a major source of security vulnerabilities. Potentially malicious third-party IPs integrated on the same Network-on-Chip (NoC) with the trusted components can lead to security and trust concerns. While secure communication is a well studied problem in computer networks domain, it is not feasible to implement those solutions on resource-constrained SoCs. In this paper, we present a lightweight anonymous routing protocol for communication between IP cores in NoC based SoCs. Our method eliminates the major overhead associated with traditional anonymous routing protocols while ensuring that the desired security goals are met. Experimental results demonstrate that existing security solutions on NoC can introduce significant (1.5X) performance degradation, whereas our approach provides the same security features with minor (4%) impact on performance.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="3.7">3.7 Augmented and Assisted Living: A reality</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br /> Graziano Pravadelli, Università di Verona, IT</p> <p><b>Co-Chair:</b><br /> Vassilis Pavlidis, Aristotle University of Thessaloniki, GR</p> <p>Novel solutions for healthcare and ambient assistant living: innovative brain-computer interfaces, novel cancer prediction systems and energy-efficient ECG and wearable systems.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>3.7.1</td> <td><b>COMPRESSING SUBJECT-SPECIFIC BRAIN-COMPUTER INTERFACE MODELS INTO ONE MODEL BY SUPERPOSITION IN HYPERDIMENSIONAL SPACE</b><br /> <b>Speaker</b>:<br /> Michael Hersche, ETH Zurich, CH<br /> <b>Authors</b>:<br /> Michael Hersche, Philipp Rupp, Luca Benini and Abbas Rahimi, ETH Zurich, CH<br /> <em><b>Abstract</b><br /> Accurate multiclass classification of electroencephalography (EEG) signals is still a challenging task towards the development of reliable motor imagery brain-computer interfaces (MI-BCIs). Deep learning algorithms have been recently used in this area to deliver a compact and accurate model. Reaching high-level of accuracy requires to store subjects-specific trained models that cannot be achieved with an otherwise compact model trained globally across all subjects. In this paper, we propose a new methodology that closes the gap between these two extreme modeling approaches: we reduce the overall storage requirements by superimposing many subject-specific models into one single model such that it can be reliably decomposed, after retraining, to its constituent models while providing a trade-off between compression ratio and accuracy. Our method makes the use of unexploited capacity of trained models by orthogonalizing parameters in a hyperdimensional space, followed by iterative retraining to compensate noisy decomposition. This method can be applied to various layers of deep inference models. Experimental results on the 4-class BCI competition IV-2a dataset show that our method exploits unutilized capacity for compression and surpasses the accuracy of two state-of-the-art networks: (1) it compresses the smallest network, EEGNet [1], by 1.9x, and increases its accuracy by 2.41% (74.73% vs. 72.32%); (2) using a relatively larger Shallow ConvNet [2], our method achieves 2.95x compression as well as 1.4% higher accuracy (75.05% vs. 73.59%).</em></td> </tr> <tr> <td>15:00</td> <td>3.7.2</td> <td><b>A NOVEL FPGA-BASED SYSTEM FOR TUMOR GROWTH PREDICTION</b><br /> <b>Speaker</b>:<br /> Yannis Papaefstathiou, Aristotle University of Thessaloniki, GR<br /> <b>Authors</b>:<br /> Konstantinos Malavazos<sup>1</sup>, Maria Papadogiorgaki<sup>1</sup>, PAVLOS MALAKONAKIS<sup>1</sup> and Ioannis Papaefstathiou<sup>2</sup><br /> <sup>1</sup>TU Crete, GR; <sup>2</sup>Aristotle University of Thessaloniki, GR<br /> <em><b>Abstract</b><br /> An emerging trend in the biomedical community is to create models that take advantage of the increasing available computational power, in order to manage and analyze new biological data as well as to model complex biological processes. Such biomedical software applications require significant computational resources since they process and analyze large amounts of data, such as medical image sequences. This paper presents a novel FPGA-based system that implements a novel model for the prediction of the spatio-temporal evolution of glioma. Glioma is a rapidly evolving type of brain cancer, well known for its aggressive and diffusive behavior. The developed system simulates the glioma tumor growth in the brain tissue, which consists of different anatomic structures, by utilizing individual MRI slices. The presented innovative hardware system is more than 60% faster than a high-end server consisting of 20 physical cores (and 40 virtual ones) and more than 28x more energy efficient.</em></td> </tr> <tr> <td>15:30</td> <td>3.7.3</td> <td><b>AN EVENT-BASED SYSTEM FOR LOW-POWER ECG QRS COMPLEX DETECTION</b><br /> <b>Speaker</b>:<br /> Silvio Zanoli, EPFL, CH<br /> <b>Authors</b>:<br /> Silvio Zanoli<sup>1</sup>, Tomas Teijeiro<sup>1</sup>, Fabio Montagna<sup>2</sup> and David Atienza<sup>1</sup><br /> <sup>1</sup>EPFL, CH; <sup>2</sup>Università di Bologna, IT<br /> <em><b>Abstract</b><br /> One of the greatest challenges in the design of modern wearable devices is energy efficiency. While data processing and communication have received a lot of attention from the industry and academia, leading to highly efficient microcontrollers and transmission devices, sensor data acquisition in medical devices is still based on a conservative paradigm that requires regular sampling at the Nyquist rate of the target signal. This requirement is usually excessive for sparse and highly non-stationary signals, leading to data overload and a waste of resources in the full processing pipeline. In this work, we propose a new system to create event-based heart-rate analysis devices, including a novel algorithm for QRS detection that is able to process electrocardiogram signals acquired irregularly and much below the theoretically-required Nyquist rate. This technique allows us to drastically reduce the average sampling frequency of the signal and, hence, the energy needed to process it and extract the relevant information. We implemented both the proposed event-based algorithm and a state-of-the-art version based on regular Nyquist rate based sampling on an ultra-low power hardware platform, and the experimental results show that the event-based version reduces the energy consumption in runtime up to 15.6 times, while the detection performance is maintained at an average F1 score of 99.5%.</em></td> </tr> <tr> <td>15:45</td> <td>3.7.4</td> <td><b>SEMI-AUTONOMOUS PERSONAL CARE ROBOTS INTERFACE DRIVEN BY EEG SIGNALS DIGITIZATION</b><br /> <b>Speaker</b>:<br /> Daniela De Venuto, Politecnico di Bari, IT<br /> <b>Authors</b>:<br /> Giovanni Mezzina and Daniela De Venuto, Politecnico di Bari, IT<br /> <em><b>Abstract</b><br /> In this paper, we propose an innovative architecture that merges the Personal Care Robots (PCRs) advantages with a novel Brain Computer Interface (BCI) to carry out assistive tasks, aiming to reduce the burdens of caregivers. The BCI is based on movement related potentials (MRPs) and exploits EEG from 8 smart wireless electrodes placed on the sensorimotor area. The collected data are firstly pre-processed and then sent to a novel Feature Extraction (FE) step. The FE stage is based on symbolization algorithm, the Local Binary Patterning, which adopts end-to-end binary operations. It strongly reduces the stage complexity, speeding the BCI up. The final user intentions discrimination is entrusted to a linear Support Vector Machine (SVM). The BCI performances have been evaluated on four healthy young subjects. Experimental results showed a user intention recognition accuracy of ~84 % with a timing of ~ 554 ms per decision. A proof of concept is presented, showing how the BCI-based binary decisions could be used to drive the PCR up to a requested object, expressing the will to keep it (delivering it to user) or to continue the research.</em></td> </tr> <tr> <td style="width:40px;">16:01</td> <td><a href="#IP1">IP1-18</a>, 216</td> <td><b>A NON-INVASIVE WEARABLE BIOIMPEDANCE SYSTEM TO WIRELESSLY MONITOR BLADDER FILLING</b><br /> <b>Speaker</b>:<br /> Michele Magno, ETH Zurich, CH<br /> <b>Authors</b>:<br /> Markus Reichmuth, Simone Schuerle and Michele Magno, ETH Zurich, CH<br /> <em><b>Abstract</b><br /> Monitoring of renal function can be crucial for patients in acute care settings. Commonly during postsurgical surveillance, urinary catheters are employed to assess the urine output accurately. However, as with any external device inserted into the body, the use of these catheters carries a significant risk of infection. In this paper, we present a non-invasive method to measure the fill rate of the bladder, and thus the rate of renal clearance, via an external bioimpedance sensor system to avoid the use of urinary catheters, thereby eliminating the risk of infections and improving patient comfort. We design and propose a 4-electrode front-end and the whole wearable and wireless system with low power and accuracy in mind. The results demonstrate the accuracy of the sensors and low power consumption of only 80µW with a duty cycling of 1 acquisition every 5 minutes, which makes this battery-operated wearable device a long-term monitor system.</em></td> </tr> <tr> <td style="width:40px;">16:02</td> <td><a href="#IP1">IP1-19</a>, 906</td> <td><b>INFINIWOLF: ENERGY EFFICIENT SMART BRACELET FOR EDGE COMPUTING WITH DUAL SOURCE ENERGY HARVESTING</b><br /> <b>Speaker</b>:<br /> Michele Magno, ETH Zurich, CH<br /> <b>Authors</b>:<br /> Michele Magno<sup>1</sup>, Xiaying Wang<sup>1</sup>, Manuel Eggimann<sup>1</sup>, Lukas Cavigelli<sup>1</sup> and Luca Benini<sup>2</sup><br /> <sup>1</sup>ETH Zurich, CH; <sup>2</sup>Università di Bologna and ETH Zurich, IT<br /> <em><b>Abstract</b><br /> This work presents InfiniWolf, a novel multi-sensor smartwatch that can achieve self-sustainability exploiting thermal and solar energy harvesting, performing computationally high demanding tasks. The smartwatch embeds both a System-on-Chip (SoC) with an ARM Cortex-M processor and Bluetooth Low Energy (BLE) and Mr. Wolf, an open-hardware RISC-V based parallel ultra-low-power processor that boosts the processing capabilities on board by more than one order of magnitude, while also increasing energy efficiency. We demonstrate its functionality based on a sample application scenario performing stress detection with multi-layer artificial neural networks on a wearable multi-sensor bracelet. Experimental results show the benefits in terms of energy efficiency and latency of Mr. Wolf over an ARM Cortex-M4F micro-controllers and the possibility, under specific assumptions, to be self-sustainable using thermal and solar energy harvesting while performing up to 24 stress classifications per minute in indoor conditions.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="UB03">UB03 Session 3</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 15:00 - 17:30<br /> <b>Location / Room:</b> Booth 11, Exhibition Area</p> <p> </p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td>UB03.1</td> <td><b>FLETCHER: TRANSPARENT GENERATION OF HARDWARE INTERFACES FOR ACCELERATING BIG DATA APPLICATIONS</b><br /> <b>Authors</b>:<br /> Zaid Al-Ars, Johan Peltenburg, Jeroen van Straten, Matthijs Brobbel and Joost Hoozemans, TU Delft, NL<br /> <em><b>Abstract</b><br /> This demo created by TUDelft is a software-hardware framework to allow for an efficient integration of FPGA hardware accelerators both on edge devices as well as in the cloud. The framework is called Fletcher, which is used to automatically generate data communication interfaces in hardware based on the widely used big data format Apache Arrow. This provides two distinct advantages. On the one hand, since the accelerators use the same data format as the software, data communication bottlenecks can be reduced. On the other hand, since a standardized data format is used, this allows for easy-to-use interfaces on the accelerator side, thereby reducing the design and development time. The demo shows how to use Fletcher for big data acceleration to decompress Snappy compressed files and perform filtering on the whole Wikipedia body of text. The demo enables 25 GB/s processing throughput.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3134.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB03.2</td> <td><b>ELSA: EIGENVALUE BASED HYBRID LINEAR SYSTEM ABSTRACTION: BEHAVIORAL MODELING OF TRANSISTOR-LEVEL CIRCUITS USING AUTOMATIC ABSTRACTION TO HYBRID AUTOMATA</b><br /> <b>Authors</b>:<br /> Ahmad Tarraf and Lars Hedrich, University of Frankfurt, DE<br /> <em><b>Abstract</b><br /> Model abstraction of transistor-level circuits, while preserving an accurate behavior, is still an open problem. In this demo an approach is presented that automatically generates a hybrid automaton (HA) with linear states from an existing circuit netlist. The approach starts with a netlist at transistor level with full SPICE accuracy and ends at the system level description of the circuit in matlab or in Verilog-A. The resulting hybrid automaton exhibits linear behavior as well as the technology dependent nonlinear e.g. limiting behavior. The accuracy and speed-up of the Verilog-A generated models is evaluated based on several transistor level circuit abstractions of simple operational amplifiers up to a complex filters. Moreover, we verify the equivalence between the generated model and the original circuit. For the generated models in matlab syntax, a reachability analysis is performed using the reachability tool cora.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3097.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB03.3</td> <td><b>PAFUSI: PARTICLE FILTER FUSION ASIC FOR INDOOR POSITIONING</b><br /> <b>Authors</b>:<br /> Christian Schott, Marko Rößler, Daniel Froß, Marcel Putsche and Ulrich Heinkel, TU Chemnitz, DE<br /> <em><b>Abstract</b><br /> The meaning of data acquired from IoT devices is heavily enhanced if global or local position information of their acquirement is known. Infrastructure for indoor positioning as well as the IoT device involve the need of small, energy efficient but powerful devices that provide the location awareness. We propose the PAFUSI, a hardware implementation of an UWB position estimation algorithm that fulfils these requirements. Our design fuses distance measurements to fixed points in an environment to calculate the position in 3D space and is capable of using different positioning technologies like GPS, DecaWave or Nanotron as data source simultaneously. Our design comprises of an estimator which processes the data by means of a Sequential Monte Carlo method and a microcontroller core which configures and controls the measurement unit as well as it analyses the results of the estimator. The PAFUSI is manufactured as a monolithic integrated ASIC in a multi-project wafer in UMC's 65nm process.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3102.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB03.4</td> <td><b>FPGA-DSP: A PROTOTYPE FOR HIGH QUALITY DIGITAL AUDIO SIGNAL PROCESSING BASED ON AN FPGA</b><br /> <b>Authors</b>:<br /> Bernhard Riess and Christian Epe, University of Applied Sciences Düsseldorf, DE<br /> <em><b>Abstract</b><br /> Our demonstrator presents a prototype of a new digital audio signal processing system which is based on an FPGA. It achieves a performance that up to now has been preserved to costly high-end solutions. Main components of the system are an analog/digital converter, an FPGA to perform the digital signal processing tasks, and a digital/analog converter implemented on a printed circuit board. To demonstrate the quality of the audio signal processing, infinite impulse response, finite impulse response filters and a delay effect were realized in VHDL. More advanced signal processing systems can easily be implemented due to the flexibility of the FPGA. Measured results were compared to state of the art audio signal processing systems with respect to size, performance and cost. Our prototype outperforms systems of the same price in quality, and outperforms systems of the same quality at a maximum of 20% of the price. Examples of the performance of our system can be heard in the demo.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3100.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB03.5</td> <td><b>LEARNV: LEARNV: A RISC-V BASED EMBEDDED SYSTEM DESIGN FRAMEWORK FOR EDUCATION AND RESEARCH DEVELOPMENT</b><br /> <b>Authors</b>:<br /> Noureddine Ait Said and Mounir Benabdenbi, TIMA Laboratory, FR<br /> <em><b>Abstract</b><br /> Designing a modern System on a Chip is based on the joint design of hardware and software (co-design). However, understanding the tight relationship between hardware and software is not straightforward. Moreover to validate new concepts in SoC design from the idea to the hardware implementation is time-consuming and often slowed by legacy issues (intellectual property of hardware blocks and expensive commercial tools). To overcome these issues we propose to use the open-source Rocket Chip environment for educational purposes, combined with the open-source LowRisc architecture to implement a custom SoC design on an FPGA board. The demonstration will present how students and engineers can take benefit from the environment to deepen their knowledge in HW and SW co-design. Using the LowRisC architecture, an image classification application based on the use of CNNs will serve as a demonstrator of the whole open-source hardware and software flow and will be mapped on a Nexys A7 FPGA board.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3116.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB03.6</td> <td><b>CSI-REPUTE: A LOW POWER EMBEDDED DEVICE CLUSTERING APPROACH TO GENOME READ MAPPING</b><br /> <b>Authors</b>:<br /> Tousif Rahman<sup>1</sup>, Sidharth Maheshwari<sup>1</sup>, Rishad Shafik<sup>1</sup>, Ian Wilson<sup>1</sup>, Alex Yakovlev<sup>1</sup> and Amit Acharyya<sup>2</sup><br /> <sup>1</sup>Newcastle University, GB; <sup>2</sup>IIT Hyderabad, IN<br /> <em><b>Abstract</b><br /> The big data challenge of genomics is rooted in its requirements of extensive computational capability and results in large power and energy consumption. To encourage widespread usage of genome assembly tools there must be a transition from the existing predominantly software-based mapping tools, optimized for homogeneous high-performance systems, to more heterogeneous low power and cost-effective mapping systems. This demonstration will show a cluster system implementation for the REPUTE algorithm, (An OpenCL based Read Mapping Tool for Embedded Genomics) where cluster nodes are composed of low power single board computer (SBC) devices and the algorithm is deployed on each node spreading the genomic workload, we propose a working concept prototype to challenge current conventional high-performance many-core CPU based cluster nodes. This demonstration will highlight the advantage in the power and energy domains of using SBC clusters enabling potential solutions to low-cost genomics.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3121.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB03.7</td> <td><b>FUZZING EMBEDDED BINARIES LEVERAGING SYSTEMC-BASED VIRTUAL PROTOTYPES</b><br /> <b>Authors</b>:<br /> Vladimir Herdt<sup>1</sup>, Daniel Grosse<sup>2</sup> and Rolf Drechsler<sup>2</sup><br /> <sup>1</sup>DFKI, DE; <sup>2</sup>University of Bremen / DFKI GmbH, DE<br /> <em><b>Abstract</b><br /> Verification of embedded Software (SW) binaries is very important. Mainly, simulation-based methods are employed that execute (randomly) generated test-cases on Virtual Prototypes (VPs). However, to enable a comprehensive VP-based verification, sophisticated test-case generation techniques need to be integrated. Our demonstrator combines state-of-the-art fuzzing techniques with SystemC-based VPs to enable a fast and accurate verification of embedded SW binaries. The fuzzing process is guided by the coverage of the embedded SW as well as the SystemC-based peripherals of the VP. The effectiveness of our approach is demonstrated by our experiments, using RISC-V SW binaries as an example.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3098.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB03.8</td> <td><b>WALLANCE: AN ALTERNATIVE TO BLOCKCHAIN FOR IOT</b><br /> <b>Authors</b>:<br /> Loic Dalmasso, Florent Bruguier, Pascal Benoit and Achraf Lamlih, Université de Montpellier, FR<br /> <em><b>Abstract</b><br /> Since the expansion of the Internet of Things (IoT), connected devices became smart and autonomous. Their exponentially increasing number and their use in many application domains result in a huge potential of cybersecurity threats. Taking into account the evolution of the IoT, security and interoperability are the main challenges, to ensure the reliability of the information. The blockchain technology provides a new approach to handle the trust in a decentralized network. However, current blockchain implementations cannot be used in IoT domain because of their huge need of computing power and storage utilization. This demonstrator presents a lightweight distributed ledger protocol dedicated to the IoT application, reducing the computing power and storage utilization, handling the scalability and ensuring the reliability of information.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3119.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB03.9</td> <td><b>RUMORE: A FRAMEWORK FOR RUNTIME MONITORING AND TRACE ANALYSIS FOR COMPONENT-BASED EMBEDDED SYSTEMS DESIGN FLOW</b><br /> <b>Authors</b>:<br /> Vittoriano Muttillo<sup>1</sup>, Luigi Pomante<sup>1</sup>, Giacomo Valente<sup>1</sup>, Hector Posadas<sup>2</sup>, Javier Merino<sup>2</sup> and Eugenio Villar<sup>2</sup><br /> <sup>1</sup>University of L'Aquila, IT; <sup>2</sup>University of Cantabria, ES<br /> <em><b>Abstract</b><br /> The purpose of this demonstrator is to introduce runtime monitoring infrastructures and to analyze trace data. The goal is to show the concept among different monitoring requirements by defining a general reference architecture that can be adapted to different scenarios. Starting from design artifacts, generated by a system engineering modeling tool, a custom HW monitoring system infrastructure will be presented. This sub-system will be able to generate runtime artifacts for runtime verification. We will show how the RUMORE framework provides round-trip support in the development chain, injecting monitoring requirements from design models down to code and its execution on the platform and trace data back to the models, where the expected behavior will then compared with the actual behavior. This approach will be used towards optimizing design models for specific properties (e.g, for system performance).</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3126.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB03.10</td> <td><b>FASTHERMSIM: FAST AND ACCURATE THERMAL SIMULATIONS FROM CHIPLETS TO SYSTEM</b><br /> <b>Authors</b>:<br /> Yu-Min Lee, Chi-Wen Pan, Li-Rui Ho and Hong-Wen Chiou, National Chiao Tung University, TW<br /> <em><b>Abstract</b><br /> Recently, owing to the scaling down of technology and 2.5D/3D integration, power densities and temperatures of chips have been increasing significantly. Though commercial computational fluid dynamics tools can provide accurate thermal maps, they may lead to inefficiency in thermal-aware design with huge runtime. Thus, we develop the chip/package/system-level thermal analyzer, called FasThermSim, which can assist you to improve your design under thermal constraints in pre/post-silicon stages. In FasThermSim, we consider three heat transfer modes, conduction, convection, and thermal radiation. We convert them to temperature-independent terms by linearization methods and build a compact thermal model (CTM). By applying numerical methods to the CTM, the steady-state and transient thermal profiles can be solved efficiently without loss of accuracy. Finally, an easy-to-use thermal analysis tool is implemented for your design, which is flexible and compatible, with the graphic user interface.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3137.pdf">More information ...</a></b></em></td> </tr> <tr> <td>17:30</td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="IP1">IP1 Interactive Presentations</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 16:00 - 16:30<br /> <b>Location / Room:</b> Poster Area</p> <p>Interactive Presentations run simultaneously during a 30-minute slot. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session</p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td style="width:40px;">IP1-1</td> <td><b>DYNUNLOCK: UNLOCKING SCAN CHAINS OBFUSCATED USING DYNAMIC KEYS</b><br /> <b>Speaker</b>:<br /> Nimisha Limaye, New York University, US<br /> <b>Authors</b>:<br /> Nimisha Limaye<sup>1</sup> and Ozgur Sinanoglu<sup>2</sup><br /> <sup>1</sup>New York University, US; <sup>2</sup>New York University Abu Dhabi, AE<br /> <em><b>Abstract</b><br /> Outsourcing in semiconductor industry opened up venues for faster and cost-effective chip manufacturing. However, this also introduced untrusted entities with malicious intent, to steal intellectual property (IP), overproduce the circuits, insert hardware Trojans, or counterfeit the chips. Recently, a defense is proposed to obfuscate the scan access based on a dynamic key that is initially generated from a secret key but changes in every clock cycle. This defense can be considered as the most rigorous defense among all the scan locking techniques. In this paper, we propose an attack that remodels this defense into one that can be broken by the SAT attack, while we also note that our attack can be adjusted to break other less rigorous (key that is updated less frequently) scan locking techniques as well.</em></td> </tr> <tr> <td style="width:40px;">IP1-2</td> <td><b>CMOS IMPLEMENTATION OF SWITCHING LATTICES</b><br /> <b>Speaker</b>:<br /> Levent Aksoy, Istanbul TU, TR<br /> <b>Authors</b>:<br /> Ismail Cevik, Levent Aksoy and Mustafa Altun, Istanbul TU, TR<br /> <em><b>Abstract</b><br /> Switching lattices consisting of four-terminal switches are introduced as area-efficient structures to realize logic functions. Many optimization algorithms have been proposed, including exact ones, realizing logic functions on lattices with the fewest number of four-terminal switches, as well as heuristic ones. Hence, the computing potential of switching lattices has been justified adequately in the literature. However, the same thing cannot be said for their physical implementation. There have been conceptual ideas for the technology development of switching lattices, but no concrete and directly applicable technology has been proposed yet. In this study, we show that switching lattices can be directly and efficiently implemented using a standard CMOS process. To realize a given logic function on a switching lattice, we propose static and dynamic logic solutions. The proposed circuits as well as the compared conventional ones are designed and simulated in the Cadence environment using TSMC 65nm CMOS process. Experimental post layout results on logic functions show that switching lattices occupy much smaller area than those of the conventional CMOS implementations, while they have competitive delay and power consumption values.</em></td> </tr> <tr> <td style="width:40px;">IP1-3</td> <td><b>A TIMING UNCERTAINTY-AWARE CLOCK TREE TOPOLOGY GENERATION ALGORITHM FOR SINGLE FLUX QUANTUM CIRCUITS</b><br /> <b>Speaker</b>:<br /> Massoud Pedram, University of Southern California, US<br /> <b>Authors</b>:<br /> Soheil Nazar Shahsavani, Bo Zhang and Massoud Pedram, University of Southern California, US<br /> <em><b>Abstract</b><br /> This paper presents a low-cost, timing uncertainty-aware synchronous clock tree topology generation algorithm for single flux quantum (SFQ) logic circuits. The proposed method considers the criticality of the data paths in terms of timing slacks as well as the total wirelength of the clock tree and generates a (height-) balanced binary clock tree using a bottom-up approach and an integer linear programming (ILP) formulation. The statistical timing analysis results for ten benchmark circuits show that the proposed method improves the total wirelength and the total negative hold slack by 4.2% and 64.6%, respectively, on average, compared with a wirelength-driven state-of-the-art balanced topology generation approach.</em></td> </tr> <tr> <td style="width:40px;">IP1-4</td> <td><b>SYMMETRY-BASED A/M-S BIST (SYMBIST): DEMONSTRATION ON A SAR ADC IP</b><br /> <b>Speaker</b>:<br /> Antonios Pavlidis, Sorbonne Université, CNRS, LIP6, FR<br /> <b>Authors</b>:<br /> Antonios Pavlidis<sup>1</sup>, Marie-Minerve Louerat<sup>1</sup>, Eric Faehn<sup>2</sup>, Anand Kumar<sup>3</sup> and Haralampos-G. Stratigopoulos<sup>1</sup><br /> <sup>1</sup>Sorbonne Université, CNRS, LIP6, FR; <sup>2</sup>STMicroelectronics, FR; <sup>3</sup>STMicroelectronics, IN<br /> <em><b>Abstract</b><br /> In this paper, we propose a defect-oriented Built-In Self-Test (BIST) paradigm for analog and mixed-signal (A/MS) Integrated Circuits (ICs), called symmetry-based BIST (Sym-BIST). SymBIST exploits inherent symmetries into the design to generate invariances that should hold true only in defect-free operation. Violation of any of these invariances points to defect detection. We demonstrate SymBIST on a 65nm 10-bit Successive Approximation Register (SAR) Analog-to-Digital Converter (ADC) IP by ST Microelectronics. SymBIST does notresult in any performance penalty, it incurs an area overhead of less than 5%, the test time equals about 16x the time to convert an analog input sample, it can be interfaced with a 2-pin digital access mechanism, and it covers the entire A/M-S part of the IP achieving a likelihood-weighted defect coverage higher than 85%.</em></td> </tr> <tr> <td style="width:40px;">IP1-5</td> <td><b>RANGE CONTROLLED FLOATING-GATE TRANSISTORS: A UNIFIED SOLUTION FOR UNLOCKING AND CALIBRATING ANALOG ICS</b><br /> <b>Speaker</b>:<br /> Yiorgos Makris, University of Texas at Dallas, US<br /> <b>Authors</b>:<br /> Sai Govinda Rao Nimmalapudi, Georgios Volanis, Yichuan Lu, Angelos Antonopoulos, Andrew Marshall and Yiorgos Makris, University of Texas at Dallas, US<br /> <em><b>Abstract</b><br /> Analog Floating-Gate Transistors (AFGTs) are commonly used to fine-tune the performance of analog integrated circuits (ICs) after fabrication, thereby enabling high yield despite component mismatch and variability in semiconductor manufacturing. In this work, we propose a methodology that leverages such AFGTs to also prevent unauthorized use of analog ICs. Specifically, we introduce a locking mechanism that limits programming of AFGTs to a range which is inadequate for achieving the desired analog performance. Accordingly, our solution entails a two-step unlock-&amp;-calibrate process. In the first step, AFGTs must be programmed through a secret sequence of voltages within that range, called waypoints. Successfully following the waypoints unlocks the ability to program the AFGTs over their entire range. Thereby, in the second step, the typical AFGT-based post-silicon calibration process can be applied to adjust the performance of the IC within its specifications. Protection against brute-force or intelligent attacks attempting to guess the unlocking sequence is ensured through the vast space of possible waypoints in the continuous (analog) domain. Feasibility and effectiveness of the proposed solution is demonstrated and evaluated on an Operational Transconductance Amplifier (OTA). To our knowledge, this is the first solution which leverages the power of analog keys and addresses both unlocking and calibration needs of analog ICs in a unified manner.</em></td> </tr> <tr> <td style="width:40px;">IP1-6</td> <td><b>TESTING THROUGH SILICON VIAS IN POWER DISTRIBUTION NETWORK OF 3D-IC WITH MANUFACTURING VARIABILITY CANCELLATION</b><br /> <b>Speaker</b>:<br /> Koutaro Hachiya, Teikyo Heisei University, JP<br /> <b>Authors</b>:<br /> Koutaro Hachiya<sup>1</sup> and Atsushi Kurokawa<sup>2</sup><br /> <sup>1</sup>Teikyo Heisei University, JP; <sup>2</sup>Hirosaki University, JP<br /> <em><b>Abstract</b><br /> To detect open defects of power TSVs (Through Silicon Vias) in PDNs (Power Distribution Networks) of stacked 3D-ICs, a method was proposed which measures resistances between power micro-bumps connected to PDN and detects defects of TSVs by changes of the resistances. It suffers from manufacturing variabilities and must place one micro-bump directly under each TSV (direct-type placement style) to maximize its diagnostic performance, but the performance was not enough for practical applications. A variability cancellation method was also devised to improve the diagnostic performance. In this paper, a novel middle-type placement style is proposed which places one micro-bump between each pair of TSVs. Experimental simulations using a 3D-IC example show that the diagnostic performances of both the direct-type and the middle-type examples are improved by the variability cancellation and reach the practical level. The middle-type example outperforms the direct-type example in terms of number of micro-bumps and number of measurements.</em></td> </tr> <tr> <td style="width:40px;">IP1-7</td> <td><b>TFAPPROX: TOWARDS A FAST EMULATION OF DNN APPROXIMATE HARDWARE ACCELERATORS ON GPU</b><br /> <b>Speaker</b>:<br /> Zdenek Vasicek, Brno University of Technology, CZ<br /> <b>Authors</b>:<br /> Filip Vaverka, Vojtech Mrazek, Zdenek Vasicek and Lukas Sekanina, Brno University of Technology, CZ<br /> <em><b>Abstract</b><br /> Energy efficiency of hardware accelerators of deep neural networks (DNN) can be improved by introducing approximate arithmetic circuits. In order to quantify the error introduced by using these circuits and avoid the expensive hardware prototyping, a software emulator of the DNN accelerator is usually executed on CPU or GPU. However, this emulation is typically two or three orders of magnitude slower than a software DNN implementation running on CPU or GPU and operating with standard floating point arithmetic instructions and common DNN libraries. The reason is that there is no hardware support for approximate arithmetic operations on common CPUs and GPUs and these operations have to be expensively emulated. In order to address this issue, we propose an efficient emulation method for approximate circuits utilized in a given DNN accelerator which is emulated on GPU. All relevant approximate circuits are implemented as look-up tables and accessed through a texture memory mechanism of CUDA capable GPUs. We exploit the fact that the texture memory is optimized for irregular read-only access and in some GPU architectures is even implemented as a dedicated cache. This technique allowed us to reduce the inference time of the emulated DNN accelerator approximately 200 times with respect to an optimized CPU version on complex DNNs such as ResNet. The proposed approach extends the TensorFlow library and is available online at <a href="https://github.com/ehw-fit/tf-approximate">https://github.com/ehw-fit/tf-approximate</a></em></td> </tr> <tr> <td style="width:40px;">IP1-8</td> <td><b>BINARY LINEAR ECCS OPTIMIZED FOR BIT INVERSION IN MEMORIES WITH ASYMMETRIC ERROR PROBABILITIES</b><br /> <b>Speaker</b>:<br /> Valentin Gherman, CEA, FR<br /> <b>Authors</b>:<br /> Valentin Gherman, Samuel Evain and Bastien Giraud, CEA, FR<br /> <em><b>Abstract</b><br /> Many memory types are asymmetric with respect to the error vulnerability of stored 0's and 1's. For instance, DRAM, STT-MRAM and NAND flash memories may suffer from asymmetric error rates. A recently proposed error-protection scheme consists in the inversion of the memory words with too many vulnerable values before they are stored in an asymmetric memory. In this paper, a method is pro-posed for the optimization of systematic binary linear block error-correcting codes in order to maximize their impact when combined with memory word inversion.</em></td> </tr> <tr> <td style="width:40px;">IP1-9</td> <td><b>BELDPC: BIT ERRORS AWARE ADAPTIVE RATE LDPC CODES FOR 3D TLC NAND FLASH MEMORY</b><br /> <b>Speaker</b>:<br /> Meng Zhang, Huazhong University of Science &amp; Technology, CN<br /> <b>Authors</b>:<br /> Meng Zhang, Fei Wu, Qin Yu, Weihua Liu, Lanlan Cui, Yahui Zhao and Changsheng Xie, Huazhong University of Science &amp; Technology, CN<br /> <em><b>Abstract</b><br /> Three-dimensional (3D) NAND flash memory has high capacity and cell storage density by using the multi-bit technology and vertical stack architecture, but degrading data reliability due to high raw bit error rates (RBER) caused by program/erase (P/E) cycles and retention periods. Low-density parity-check (LDPC) codes become more popular error-correcting technologies to improve data reliability due to strong error correction capability, but introducing more decoding iterations at higher RBER. To reduce decoding iterations, this paper proposes BeLDPC: bit errors aware adaptive rate LDPC codes for 3D triple-level cell (TLC) NAND flash memory. Firstly, bit error characteristics in 3D charge trap TLC NAND flash memory are studied on a real FPGA testing platform, including asymmetric bit flipping and temporal locality of bit errors. Then, based on these characteristics, a high-efficiency LDPC code is designed. Experimental results show BeLDPC can reduce decoding iterations under different P/E cycles and retention periods.</em></td> </tr> <tr> <td style="width:40px;">IP1-10</td> <td><b>POISONING THE (DATA) WELL IN ML-BASED CAD: A CASE STUDY OF HIDING LITHOGRAPHIC HOTSPOTS</b><br /> <b>Speaker</b>:<br /> Kang Liu, New York University, US<br /> <b>Authors</b>:<br /> Kang Liu, Benjamin Tan, Ramesh Karri and Siddharth Garg, New York University, US<br /> <em><b>Abstract</b><br /> Machine learning (ML) provides state-of-the-art performance in many parts of computer-aided design (CAD) flows. However, deep neural networks (DNNs) are susceptible to various adversarial attacks, including data poisoning to compromise training to insert backdoors. Sensitivity to training data integrity presents a security vulnerability, especially in light of malicious insiders who want to cause targeted neural network misbehavior. In this study, we explore this threat in lithographic hotspot detection via training data poisoning, where hotspots in a layout clip can be "hidden" at inference time by including a trigger shape in the input. We show that training data poisoning attacks are feasible and stealthy, demonstrating a backdoored neural network that performs normally on clean inputs but misbehaves on inputs when a backdoor trigger is present. Furthermore, our results raise some fundamental questions about the robustness of ML-based systems in CAD.</em></td> </tr> <tr> <td style="width:40px;">IP1-11</td> <td><b>SOLOMON: AN AUTOMATED FRAMEWORK FOR DETECTING FAULT ATTACK VULNERABILITIES IN HARDWARE</b><br /> <b>Speaker</b>:<br /> Milind Srivastava, IIT Madras, IN<br /> <b>Authors</b>:<br /> Milind Srivastava<sup>1</sup>, PATANJALI SLPSK<sup>1</sup>, Indrani Roy<sup>1</sup>, Chester Rebeiro<sup>1</sup>, Aritra Hazra<sup>2</sup> and Swarup Bhunia<sup>3</sup><br /> <sup>1</sup>IIT Madras, IN; <sup>2</sup>IIT Kharagpur, IN; <sup>3</sup>University of Florida, US<br /> <em><b>Abstract</b><br /> Fault attacks are potent physical attacks on crypto-devices. A single fault injected during encryption can reveal the cipher's secret key. In a hardware realization of an encryption algorithm, only a tiny fraction of the gates is exploitable by such an attack. Finding these vulnerable gates has been a manual and tedious task requiring considerable expertise. In this paper, we propose SOLOMON, the first automatic fault attack vulnerability detection framework for hardware designs. Given a cipher implementation, either at RTL or gate-level, SOLOMON uses formal methods to map vulnerable regions in the cipher algorithm to specific locations in the hardware thus enabling targeted countermeasures to be deployed with much lesser overheads. We demonstrate the efficacy of the SOLOMON framework using three ciphers: AES, CLEFIA, and Simon.</em></td> </tr> <tr> <td style="width:40px;">IP1-12</td> <td><b>FORMAL SYNTHESIS OF MONITORING AND DETECTION SYSTEMS FOR SECURE CPS IMPLEMENTATIONS</b><br /> <b>Speaker</b>:<br /> Ipsita Koley, IIT Kharagpur, IN<br /> <b>Authors</b>:<br /> Ipsita Koley<sup>1</sup>, Saurav Kumar Ghosh<sup>1</sup>, Dey Soumyajit<sup>1</sup>, Debdeep Mukhopadhyay<sup>1</sup>, Amogh Kashyap K N<sup>2</sup>, Sachin Kumar Singh<sup>2</sup>, Lavanya Lokesh<sup>2</sup>, Jithin Nalu Purakkal<sup>2</sup> and Nishant Sinha<sup>2</sup><br /> <sup>1</sup>IIT Kharagpur, IN; <sup>2</sup>Robert Bosch Engineering and Business Solutions Private Limited, IN<br /> <em><b>Abstract</b><br /> We consider the problem of securing a given control loop implementation of a cyber-physical system (CPS) in the presence of Man-in-the-Middle attacks on data exchange between plant and controller over a compromised network. To this end, there exists various detection schemes which provide mathematical guarantees against such attacks for the theoretical control model. However, such guarantees may not hold for the actual control software implementation. In this article, we propose a formal approach towards synthesizing attack detectors with varying thresholds which can prevent performance degrading stealthy attacks while minimizing false alarms.</em></td> </tr> <tr> <td style="width:40px;">IP1-13</td> <td><b>ASCELLA: ACCELERATING SPARSE COMPUTATION BY ENABLING STREAM ACCESSES TO MEMORY</b><br /> <b>Speaker</b>:<br /> Bahar Asgari, Georgia Tech, US<br /> <b>Authors</b>:<br /> Bahar Asgari, Ramyad Hadidi and Hyesoon Kim, Georgia Tech, US<br /> <em><b>Abstract</b><br /> Sparse computations dominate a wide range of applications from scientific problems to graph analytics. The main characterization of sparse computations, indirect memory accesses, prevents them from effectively achieving high performance on general-purpose processors. Therefore, hardware accelerators have been proposed for sparse problems. For these accelerators, the storage format and the decompression mechanism is crucial but have seen less attention in prior work. To address this gap, we propose Ascella, an accelerator for sparse computations, which besides enabling a smooth stream of data and parallel computation, proposes a fast decompression mechanism. Our implementation on a ZYNQ FPGA shows that on average, Ascella executes sparse problems up to 5.1x as fast as prior work.</em></td> </tr> <tr> <td style="width:40px;">IP1-14</td> <td><b>ACCELERATION OF PROBABILISTIC REASONING THROUGH CUSTOM PROCESSOR ARCHITECTURE</b><br /> <b>Speaker</b>:<br /> Nimish Shah, KU Leuven, BE<br /> <b>Authors</b>:<br /> Nimish Shah, Laura I. Galindez Olascoaga, Wannes Meert and Marian Verhelst, KU Leuven, BE<br /> <em><b>Abstract</b><br /> Probabilistic reasoning is an essential tool for robust decision-making systems because of its ability to explicitly handle real-world uncertainty, constraints and causal relations. Consequently, researchers are developing hybrid models by combining Deep Learning with Probabilistic reasoning for safety-critical applications like self-driving vehicles, autonomous drones, etc. However, probabilistic reasoning kernels do not execute efficiently on CPUs or GPUs. This paper, therefore, proposes a custom programmable processor to accelerate sum-product networks, an important probabilistic reasoning execution kernel. The processor has an optimized datapath architecture and memory hierarchy optimized for sum-product networks execution. Experimental results show that the processor, while requiring fewer computational and memory units, achieves a 12x throughput benefit over the Nvidia Jetson TX2 embedded GPU platform.</em></td> </tr> <tr> <td style="width:40px;">IP1-15</td> <td><b>A PERFORMANCE ANALYSIS FRAMEWORK FOR REAL-TIME SYSTEMS SHARING MULTIPLE RESOURCES</b><br /> <b>Speaker</b>:<br /> Shayan Tabatabaei Nikkhah, Eindhoven University of Technology, NL<br /> <b>Authors</b>:<br /> Shayan Tabatabaei Nikkhah<sup>1</sup>, Marc Geilen<sup>1</sup>, Dip Goswami<sup>1</sup> and Kees Goossens<sup>2</sup><br /> <sup>1</sup>Eindhoven University of Technology, NL; <sup>2</sup>Eindhoven university of technology, NL<br /> <em><b>Abstract</b><br /> Timing properties of applications strongly depend on resources that are allocated to them. Applications often have multiple resource requirements, all of which must be met for them to proceed. Performance analysis of event-based systems has been widely studied in the literature. However, the proposed works consider only one resource requirement for each application task. Additionally, they mainly focus on the rate at which resources serve applications (e.g., power, instructions or bits per second), but another aspect of resources, which is their provided capacity (e.g., energy, memory ranges, FPGA regions), has been ignored. In this work, we propose a mathematical framework to describe the provisioning rate and capacity of various types of resource. Additionally, we consider the simultaneous use of multiple resources. Conservative bounds on response times of events and their backlog are computed. We prove that the bounds are monotone in event arrivals and in required and provided rate and capacity, which enables verification of real-time application performance based on worst-case characterizations. The applicability of our framework is shown in a case study.</em></td> </tr> <tr> <td style="width:40px;">IP1-16</td> <td><b>SCALING UP THE MEMORY INTERFERENCE ANALYSIS FOR HARD REAL-TIME MANY-CORE SYSTEMS</b><br /> <b>Speaker</b>:<br /> Matheus Schuh, Verimag / Kalray, FR<br /> <b>Authors</b>:<br /> Matheus Schuh<sup>1</sup>, Maximilien Dupont de Dinechin<sup>2</sup>, Matthieu Moy<sup>3</sup> and Claire Maiza<sup>4</sup><br /> <sup>1</sup>Verimag / Kalray, FR; <sup>2</sup>ENS Paris / ENS Lyon / LIP, FR; <sup>3</sup>ENS Lyon / LIP, FR; <sup>4</sup>Grenoble INP / Verimag, FR<br /> <em><b>Abstract</b><br /> In RTNS 2016, Rihani et al. proposed an algorithm to compute the impact of interference on memory accesses on the timing of a task graph. It calculates a static, time-triggered schedule, i.e. a release date and a worst-case response time for each task. The task graph is a DAG, typically obtained by compilation of a high-level dataflow language, and the tool assumes a previously determined mapping and execution order. The algorithm is precise, but suffers from a high O(n^4) complexity, n being the number of input tasks. Since we target many-core platforms with tens or hundreds of cores, applications likely to exploit the parallelism of these platforms are too large to be handled by this algorithm in reasonable time. This paper proposes a new algorithm that solves the same problem. Instead of performing global fixed-point iterations on the task graph, we compute the static schedule incrementally, reducing the complexity to O(n^2). Experimental results show a reduction from 535 seconds to 0.90 seconds on a benchmark with 384 tasks, i.e. 593 times faster.</em></td> </tr> <tr> <td style="width:40px;">IP1-17</td> <td><b>LIGHTWEIGHT ANONYMOUS ROUTING IN NOC BASED SOCS</b><br /> <b>Speaker</b>:<br /> Prabhat Mishra, University of Florida, US<br /> <b>Authors</b>:<br /> Subodha Charles, Megan Logan and Prabhat Mishra, University of Florida, US<br /> <em><b>Abstract</b><br /> System-on-Chip (SoC) supply chain is widely acknowledged as a major source of security vulnerabilities. Potentially malicious third-party IPs integrated on the same Network-on-Chip (NoC) with the trusted components can lead to security and trust concerns. While secure communication is a well studied problem in computer networks domain, it is not feasible to implement those solutions on resource-constrained SoCs. In this paper, we present a lightweight anonymous routing protocol for communication between IP cores in NoC based SoCs. Our method eliminates the major overhead associated with traditional anonymous routing protocols while ensuring that the desired security goals are met. Experimental results demonstrate that existing security solutions on NoC can introduce significant (1.5X) performance degradation, whereas our approach provides the same security features with minor (4%) impact on performance.</em></td> </tr> <tr> <td style="width:40px;">IP1-18</td> <td><b>A NON-INVASIVE WEARABLE BIOIMPEDANCE SYSTEM TO WIRELESSLY MONITOR BLADDER FILLING</b><br /> <b>Speaker</b>:<br /> Michele Magno, ETH Zurich, CH<br /> <b>Authors</b>:<br /> Markus Reichmuth, Simone Schuerle and Michele Magno, ETH Zurich, CH<br /> <em><b>Abstract</b><br /> Monitoring of renal function can be crucial for patients in acute care settings. Commonly during postsurgical surveillance, urinary catheters are employed to assess the urine output accurately. However, as with any external device inserted into the body, the use of these catheters carries a significant risk of infection. In this paper, we present a non-invasive method to measure the fill rate of the bladder, and thus the rate of renal clearance, via an external bioimpedance sensor system to avoid the use of urinary catheters, thereby eliminating the risk of infections and improving patient comfort. We design and propose a 4-electrode front-end and the whole wearable and wireless system with low power and accuracy in mind. The results demonstrate the accuracy of the sensors and low power consumption of only 80µW with a duty cycling of 1 acquisition every 5 minutes, which makes this battery-operated wearable device a long-term monitor system.</em></td> </tr> <tr> <td style="width:40px;">IP1-19</td> <td><b>INFINIWOLF: ENERGY EFFICIENT SMART BRACELET FOR EDGE COMPUTING WITH DUAL SOURCE ENERGY HARVESTING</b><br /> <b>Speaker</b>:<br /> Michele Magno, ETH Zurich, CH<br /> <b>Authors</b>:<br /> Michele Magno<sup>1</sup>, Xiaying Wang<sup>1</sup>, Manuel Eggimann<sup>1</sup>, Lukas Cavigelli<sup>1</sup> and Luca Benini<sup>2</sup><br /> <sup>1</sup>ETH Zurich, CH; <sup>2</sup>Università di Bologna and ETH Zurich, IT<br /> <em><b>Abstract</b><br /> This work presents InfiniWolf, a novel multi-sensor smartwatch that can achieve self-sustainability exploiting thermal and solar energy harvesting, performing computationally high demanding tasks. The smartwatch embeds both a System-on-Chip (SoC) with an ARM Cortex-M processor and Bluetooth Low Energy (BLE) and Mr. Wolf, an open-hardware RISC-V based parallel ultra-low-power processor that boosts the processing capabilities on board by more than one order of magnitude, while also increasing energy efficiency. We demonstrate its functionality based on a sample application scenario performing stress detection with multi-layer artificial neural networks on a wearable multi-sensor bracelet. Experimental results show the benefits in terms of energy efficiency and latency of Mr. Wolf over an ARM Cortex-M4F micro-controllers and the possibility, under specific assumptions, to be self-sustainable using thermal and solar energy harvesting while performing up to 24 stress classifications per minute in indoor conditions.</em></td> </tr> </tbody> </table> <hr /> <h2 id="4.1">4.1 Hardware-enabled security</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br /> Marchand Cedric, Ecole Centrale de Lyon, FR</p> <p><b>Co-Chair:</b><br /> Hai Zhou, Northwestern University, US</p> <p>This session covers solutions in hardware-based design to improve security. The papers in the session propose a NTT (Number Theoretic Transform) technique enabling faster polynomial multiplication, a reliable key-PUF for key generation, and a runtime circuit de-obfuscating solution. Post-Quantum cryptography and new attacks will be discussed along this session.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.1.1</td> <td><b>A FLEXIBLE AND SCALABLE NTT HARDWARE: APPLICATIONS FROM HOMOMORPHICALLY ENCRYPTED DEEP LEARNING TO POST-QUANTUM CRYPTOGRAPHY</b><br /> <b>Speaker</b>:<br /> Ahmet Can Mert, Sabanci University, TR<br /> <b>Authors</b>:<br /> Ahmet Can Mert<sup>1</sup>, Emre Karabulut<sup>2</sup>, Erdinc Ozturk<sup>1</sup>, Erkay Savas<sup>1</sup>, Michela Becchi<sup>2</sup> and Aydin Aysu<sup>2</sup><br /> <sup>1</sup>Sabanci University, TR; <sup>2</sup>North Carolina State University, US<br /> <em><b>Abstract</b><br /> The Number Theoretic Transform (NTT) enables faster polynomial multiplication and is becoming a fundamental component of next-generation cryptographic systems. NTT hardware designs have two prevalent problems related to design-time flexibility. First, algorithms have different arithmetic structures causing the hardware designs to be manually tuned for each setting. Second, applications have diverse throughput/area needs but the hardware have been designed for a fixed, pre-defined number of processing elements. This paper proposes a parametric NTT hardware generator that takes arithmetic configurations and the number of processing elements as inputs to produce an efficient hardware with the desired parameters and throughput. We illustrate the employment of the proposed design in two applications with different needs: A homomorphically encrypted deep neural network inference (CryptoNets) and a post-quantum digital signature scheme (qTESLA). We propose the first NTT hardware acceleration for both applications on FPGAs. Compared to prior software and high-level synthesis solutions, the results show that our hardware can accelerate NTT up to 28x and 48x, respectively. Therefore, our work paves the way for high-level, automated, and modular design of next-generation cryptographic hardware solutions.</em></td> </tr> <tr> <td>17:30</td> <td>4.1.2</td> <td><b>RELIABLE AND LIGHTWEIGHT PUF-BASED KEY GENERATION USING VARIOUS INDEX VOTING ARCHITECTURE</b><br /> <b>Speaker</b>:<br /> Jeong-Hyeon Kim, Sungkyunkwan University, KR<br /> <b>Authors</b>:<br /> Jeong-Hyeon Kim<sup>1</sup>, Ho-Jun Jo<sup>1</sup>, Kyung-kuk Jo<sup>1</sup>, Sunghee Cho<sup>1</sup>, Jaeyong Chung<sup>2</sup> and Joon-Sung Yang<sup>1</sup><br /> <sup>1</sup>Sungkyunkwan University, KR; <sup>2</sup>Incheon National University, KR<br /> <em><b>Abstract</b><br /> Physical Unclonable Functions (PUFs) can be utilized for secret key generation in security applications. Since the inherent randomness of PUF can degrade its reliability, most of the existing PUF architectures have designed post-processing logic to enhance the reliability such as an error correction function for guaranteeing reliability. However, the structures incur high cost in terms of implementation area and power consumption. This paper introduces a Various Index Voting Architecture (VIVA) that can enhance the reliability with a low overhead compared to the conventional schemes. The proposed architecture is based on an index-based scheme with simple computation logic units and iterative operations to generate multiple indices for the accuracy of key generation. Our evaluation results show that the proposed architecture reduces the hardware implementation overhead by 2 to more than 5 times, without losing a key generation failure probability compared to conventional approaches.</em></td> </tr> <tr> <td>18:00</td> <td>4.1.3</td> <td><b>ESTIMATING THE CIRCUIT DE-OBFUSCATION RUNTIME BASED ON GRAPH DEEP LEARNING</b><br /> <b>Speaker</b>:<br /> Gaurav Kolhe, George Mason University, US<br /> <b>Authors</b>:<br /> Zhiqian Chen<sup>1</sup>, Gaurav Kolhe<sup>2</sup>, Setareh Rafatirad<sup>2</sup>, Chang-Tien Lu<sup>1</sup>, Sai Manoj Pudukotai Dinakarrao<sup>2</sup>, Houman Homayoun<sup>2</sup> and Liang Zhao<sup>2</sup><br /> <sup>1</sup>Virginia Tech, US; <sup>2</sup>George Mason University, US<br /> <em><b>Abstract</b><br /> Circuit obfuscation has been proposed to protect digital integrated circuits (ICs) from different security threats such as reverse engineering by introducing ambiguity in the circuit, i.e., the addition of the logic gates whose functionality cannot be determined easily by the attacker. In order to conquer such defenses, techniques such as Boolean satisfiability-checking (SAT)-based attacks were introduced. SAT-attack can potentially decrypt the obfuscated circuits. However, the deobfuscation runtime could have a large span ranging from few milliseconds to a few years or more, depending on the number and location of obfuscated gates, the topology of the obfuscated circuit and obfuscation technique used. To ensure the security of the deployed obfuscation mechanism, it is essential to accurately pre-estimate the deobfuscation time. Thereby one can optimize the deployed defense in order to maximize the deobfuscation runtime. However, estimating the deobfuscation runtime is a challenging task due to 1) the complexity and heterogeneity of the graph-structured circuit, 2) the unknown and sophisticated mechanisms of the attackers for deobfuscation, 3) efficiency and scalability requirement in practice. To address the challenges mentioned above, this work proposes the first machine-learning framework that predicts the deobfuscation runtime based on graph deep learning. Specifically, we design a new model, ICNet with new input and convolution layers to characterize the circuit's topology, which is then integrated by composite deep fully-connected layers to obtain the deobfuscation runtime. The proposed ICNet is an end-to-end framework that can automatically extract the determinant features required for deobfuscation runtime prediction. Extensive experiments on standard benchmarks demonstrate its effectiveness and efficiency beyond many competitive baselines.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="#IP2">IP2-1</a>, 908</td> <td><b>SAMPLING FROM DISCRETE DISTRIBUTIONS IN COMBINATIONAL HARDWARE WITH APPLICATION TO POST-QUANTUM CRYPTOGRAPHY</b><br /> <b>Speaker</b>:<br /> Michael Lyons, George Mason University, US<br /> <b>Authors</b>:<br /> Michael Lyons and Kris Gaj, George Mason University, US<br /> <em><b>Abstract</b><br /> Random values from discrete distributions are typically generated from uniformly-random samples. A common technique is to use a cumulative distribution table (CDT) lookup for inversion sampling, but it is also possible to use Boolean functions to map a uniformly-random bit sequence into a value from a discrete distribution. This work presents a methodology for deriving such functions for any discrete distribution, encoding them in VHDL for implementation in combinational hardware, and (for moderate precision and sample space size) confirming the correctness of the produced distribution. The process is demonstrated using a discrete Gaussian distribution with a small sample space, but it is applicable to any discrete distribution with fixed parameters. Results are presented for sampling schemes from several submissions to the NIST PQC standardization process, comparing this method to CDT lookups on a Xilinx Artix-7 FPGA. The process produces compact solutions for distributions up to moderate size and precision.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="#IP2">IP2-2</a>, 472</td> <td><b>ON THE PERFORMANCE OF NON-PROFILED DIFFERENTIAL DEEP LEARNING ATTACKS AGAINST AN AES ENCRYPTION ALGORITHM PROTECTED USING A CORRELATED NOISE HIDING COUNTERMEASURE</b><br /> <b>Speaker</b>:<br /> Amir Alipour, Grenoble INP Esisar, FR<br /> <b>Authors</b>:<br /> Amir Alipour<sup>1</sup>, Athanasios Papadimitriou<sup>2</sup>, Vincent Beroulle<sup>3</sup>, Ehsan Aerabi<sup>3</sup> and David Hely<sup>3</sup><br /> <sup>1</sup>University Grenoble Alpes, Grenoble INP ESISAR, LCIS Laboratory, FR; <sup>2</sup>University Grenoble Alpes, Grenoble INP ESISAR, ESYNOV, FR; <sup>3</sup>University Grenoble Alpes, Grenoble INP ESISAR, LSIC Laboratory, FR<br /> <em><b>Abstract</b><br /> Recent works in the field of cryptography focus on Deep Learning based Side Channel Analysis (DLSCA) as one of the most powerful attacks against common encryption algorithms such as AES. As a common case, profiling DLSCA have shown great capabilities in revealing secret cryptographic keys against the majority of AES implementations. In a very recent study, it has been shown that Deep Learning can be applied in a non-profiling way (non-profiling DLSCA), making this method considerably more practical, and able to break powerful countermeasures for encryption algorithms such as AES including masking countermeasures, requiring considerably less power traces than a first order CPA attack. In this work, our main goal is to apply the non-profiling DLSCA against a hiding-based AES countermeasure which utilizes correlated noise generation so as to hide the secret encryption key. We show that this AES, with correlated noise generation as a lightweight countermeasure, can provide equivalent protection under CPA and under non- profiling DLSCA attacks, in terms of the required power traces to obtain the secret key.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="4.2">4.2 Timing in System-Level Modeling and Simulation</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br /> Jorn Janneck, Lund University, SE</p> <p><b>Co-Chair:</b><br /> Gianluca Palermo, Politecnico di Milano, IT</p> <p>Given the importance of time in specifying and modeling systems, this session presents three contributions at different levels of abstraction, from transaction-level to system level. While the first two contributions attempt to give fast and accurate simulation models for DRAM memories and analog mixed systems, the last one models uncertainties at higher-level for reasoning and formal verification purpose.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.2.1</td> <td><b>FAST AND ACCURATE DRAM SIMULATION: CAN WE FURTHER ACCELERATE IT?</b><br /> <b>Speaker</b>:<br /> Matthias Jung, Fraunhofer IESE, DE<br /> <b>Authors</b>:<br /> Johannes Feldmann<sup>1</sup>, Matthias Jung<sup>2</sup>, Kira Kraft<sup>1</sup>, Lukas Steiner<sup>1</sup> and Norbert Wehn<sup>1</sup><br /> <sup>1</sup>TU Kaiserslautern, DE; <sup>2</sup>Fraunhofer IESE, DE<br /> <em><b>Abstract</b><br /> Virtual platforms are state-of-the-art for design space exploration and simulation of today's complex System on Chips (SoCs). The challenge of these virtual platforms is to find the right trade-off between speed and accuracy. For the simulation of Dynamic Random Access Memories (DRAMs), that have a complex timing and power behavior, high accuracy models are needed. However, cycle accurate DRAM models require a huge part of the overall simulation time. Therefore, it is important to accelerate the DRAM simulation models while keeping the accuracy. In the literature different approaches to accelerate the DRAM simulation speed in virtual platforms do already exist. This paper proposes two new performance optimized DRAM models that further accelerate the simulation speed with only a negligible degradation in accuracy. The first model is an enhanced Transaction Level Model (TLM), which uses a look-up table to accelerate simulation parts with high bandwidth usage for online scenarios. The other is a neural network based simulator for offline trace analysis. We show a mathematical methodology to generate the inputs for the look-up table and the optimal artificial training set for the neural network. The TLM model is up to 5~times faster compared to a state-of-the-art TLM DRAM simulator. The neural network is able to speed up to~10x, while inferring on a GPU. Both solutions provide only a slight decrease in accuracy of approximately 5%.</em></td> </tr> <tr> <td>17:30</td> <td>4.2.2</td> <td><b>ACCURATE AND EFFICIENT CONTINUOUS TIME AND DISCRETE EVENTS SIMULATION IN SYSTEMC</b><br /> <b>Speaker</b>:<br /> Breytner Fernandez-Mesa, TIMA Laboratory, University Grenoble Alpes, FR<br /> <b>Authors</b>:<br /> Breytner Fernandez-Mesa, Liliana Andrade and Frédéric Pétrot, TIMA Lab, Université Grenoble Alpes, FR<br /> <em><b>Abstract</b><br /> The AMS extensions of SystemC emerged to aid the virtual prototyping of continuous time and discrete event heterogeneous systems. Although useful for a large set of use cases, synchronization of both domains through a fixed timestep generates inaccuracies that cannot be overcome without penalizing simulation speed. We propose a direct, optimistic, and causal synchronization algorithm on top of the SystemC kernel that explicitly handles the rich set of interactions that occur in the domain interface. We test our algorithm with a complex nonlinear automotive use case and show that it breaks the described accuracy and efficiency trade-off. Our work enlarges the applicability range of SystemC AMS based design frameworks.</em></td> </tr> <tr> <td>18:00</td> <td>4.2.3</td> <td><b>MODELING AND VERIFYING UNCERTAINTY-AWARE TIMINGBEHAVIORS USING PARAMETRIC LOGICAL TIME CONSTRAINT</b><br /> <b>Speaker</b>:<br /> Fei Gao, East China Normal University, CN<br /> <b>Authors</b>:<br /> Fei Gao<sup>1</sup>, Mallet Frederic<sup>2</sup>, Min Zhang<sup>1</sup> and Mingsong Chen<sup>3</sup><br /> <sup>1</sup>East China Normal University, CN; <sup>2</sup>Universite Cote d'Azur, CNRS, Inria, I3S, Nice, France, FR; <sup>3</sup>East China Normal University, FR<br /> <em><b>Abstract</b><br /> The Clock Constraint Specification Language (CCSL) is a logical time based modeling language to formalize timing behaviors of real-time and embedded systems. However, it cannot capture timing behaviors that contain uncertainties, e.g., uncertainty in execution time and period. This limits the application of the language to real-world systems, as uncertainty often exists in practice due to both internal and external factors. To capture uncertainties in timing behaviors, in this paper we extend CCSL by introducing parameters into constraints. We then propose an approach to transform parametric CCSL constraints into SMT formulas for efficient verification. We apply our approach to an industrial case which is proposed as the FMTV (Formal Methods for Timing Verification) Challenge in 2015, which shows that timing behaviors with uncertainties can be effectively modeled and verified using the parametric CCSL.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="#IP2">IP2-3</a>, 831</td> <td><b>FAST AND ACCURATE PERFORMANCE EVALUATION FOR RISC-V USING VIRTUAL PROTOTYPES</b><br /> <b>Speaker</b>:<br /> Vladimir Herdt, University of Bremen, DE<br /> <b>Authors</b>:<br /> Vladimir Herdt<sup>1</sup>, Daniel Grosse<sup>2</sup> and Rolf Drechsler<sup>2</sup><br /> <sup>1</sup>University of Bremen, DE; <sup>2</sup>University of Bremen / DFKI, DE<br /> <em><b>Abstract</b><br /> RISC-V is gaining huge popularity in particular for embedded systems. Recently, a SystemC-based Virtual Prototype (VP) has been open sourced to lay the foundation for providing support for system-level use cases such as design space exploration, analysis of complex HW/SW interactions and power/timing/performance validation for RISC-V based systems. In this paper, we propose an efficient core timing model and integrate it into the VP core to enable fast and accurate performance evaluation for RISC-V based systems. As a case-study we provide a timing configuration matching the RISC-V HiFive1 board from SiFive. Our experiments demonstrate that our approach allows to obtain very accurate performance evaluation results while still retaining a high simulation performance.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="#IP2">IP2-4</a>, 641</td> <td><b>AUTOMATED GENERATION OF LTL SPECIFICATIONS FOR SMART HOME IOT USING NATURAL LANGUAGE</b><br /> <b>Speaker</b>:<br /> Shiyu Zhang, Nanjing University, CN<br /> <b>Authors</b>:<br /> Shiyu Zhang<sup>1</sup>, Juan Zhai<sup>1</sup>, Lei Bu<sup>1</sup>, Mingsong Chen<sup>2</sup>, Linzhang Wang<sup>1</sup> and Xuandong Li<sup>1</sup><br /> <sup>1</sup>Nanjing University, CN; <sup>2</sup>East China Normal University, CN<br /> <em><b>Abstract</b><br /> Ordinary inexperienced users can build their smart home IoT system easily nowadays, but such user-customized systems could be error-prone. Using formal verification to prove the correctness of such systems is necessary. However, to conduct formal proof, formal specifications such as Linear Temporal Logic (LTL) formulas have to be provided, but ordinary users cannot author LTL formulas but only natural language. To address this problem, this paper presents a novel approach that can automatically generate formal LTL specifications from natural language requirements based on domain knowledge and our proposed ambiguity refining techniques. Experimental results show that our approach can achieve a high correctness rate of 95.4% in converting natural language sentences into LTL formulas from 481 requirements of real examples.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="4.3">4.3 EU Projects on Nanoelectronics with CMOS and alternative technologies</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br /> Dimitris Gizopoulos, UoA, GR</p> <p><b>Co-Chair:</b><br /> George Karakonstantis, Queen's University Belfast, GR</p> <p>This session presents the results of three European Projects in different stages of execution covering the development of a complete synthesis and optimization methodology for nano-crossbar arrays; the reliability, security, and associated EDA tools for nanoelectronic systems, and the exploitation of STT-MTJ technologies for heterogeneous function implementation.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.3.1</td> <td><b>NANO-CROSSBAR BASED COMPUTING: LESSONS LEARNED AND FUTURE DIRECTIONS</b><br /> <b>Speaker</b>:<br /> Mustafa Altun, Istanbul TU, TR<br /> <b>Authors</b>:<br /> Mustafa Altun<sup>1</sup>, Ismail Cevik<sup>1</sup>, Ahmet Erten<sup>1</sup>, Osman Eksik<sup>1</sup>, Mircea Stan<sup>2</sup> and Csaba Moritz<sup>3</sup><br /> <sup>1</sup>Istanbul TU, TR; <sup>2</sup>University of Virginia, US; <sup>3</sup>University of Massachusetts Amherst, US<br /> <em><b>Abstract</b><br /> In this paper, we first summarize our research activities done through our European Union's Horizon-2020 project between 2015 and 2019. The project has a goal of developing synthesis and performance optimization techniques for nanocrossbar arrays. For this purpose, different computing models including diode, memristor, FET, and four-terminal switch based models, within different technologies including carbon nanotubes, nanowires, and memristors as well as the CMOS technology have been investigated. Their capabilities to realize logic functions and to tolerate faults have been deeply analyzed. From these experiences, we think that instead of replacing CMOS with a completely new crossbar based technology, developing CMOS compatible crossbar technologies and computing models is a more viable solution to overcome challenges in CMOS miniaturization. At this point, four terminal switch based arrays, called switching lattices, come forward with their CMOS compatibility feature as well as with their area efficient device and circuit realizations. We have showed that switching lattices can be efficiently implemented using a standard CMOS process to implement logic functions by doing experiments in a 65nm CMOS process. Further in this paper, we make an introduction of realizing memory arrays with switching lattices including ROMs and RAMs. Also we discuss challenges and promises in realizing switching lattices for under 30nm CMOS technologies including FinFET technologies.</em></td> </tr> <tr> <td>17:30</td> <td>4.3.2</td> <td><b>RESCUE: INTERDEPENDENT CHALLENGES OF RELIABILITY, SECURITY AND QUALITY IN NANOELECTRONIC SYSTEMS</b><br /> <b>Speaker</b>:<br /> Maksim Jenihhin, Tallinn University of Technology, EE<br /> <b>Authors</b>:<br /> Maksim Jenihhin<sup>1</sup>, Said Hamdioui<sup>2</sup>, Matteo Sonza Reorda<sup>3</sup>, Milos Krstic<sup>4</sup>, Peter Langendoerfer<sup>4</sup>, Christian Sauer<sup>5</sup>, Anton Klotz<sup>5</sup>, Michael Huebner<sup>6</sup>, Joerg Nolte<sup>6</sup>, H.T. Vierhaus<sup>6</sup>, Georgios Selimis<sup>7</sup>, Dan Alexandrescu<sup>8</sup>, Mottaqiallah Taouil<sup>2</sup>, Geert-Jan Schrijen<sup>7</sup>, Luca Sterpone<sup>3</sup>, Giovanni Squillero<sup>3</sup>, Zoya Dyka<sup>4</sup> and Jaan Raik<sup>1</sup><br /> <sup>1</sup>Tallinn University of Technology, EE; <sup>2</sup>TU Delft, NL; <sup>3</sup>Politecnico di Torino, IT; <sup>4</sup>Leibniz-Institut für innovative Mikroelektronik, DE; <sup>5</sup>Cadence Design Systems, DE; <sup>6</sup>BTU Cottbus-Senftenberg, DE; <sup>7</sup>Intrinsic-ID, NL; <sup>8</sup>IROC Technologies, FR<br /> <em><b>Abstract</b><br /> The recent trends for nanoelectronic computing systems include machine-to-machine communication in the era of Internet-of-Things (IoT) and autonomous systems, complex safety-critical applications, extreme miniaturization of implementation technologies and intensive interaction with the physical world. These set tough requirements on mutually dependent extra-functional design aspects. The H2020 MSCA ITN project RESCUE is focused on key challenges for reliability, security and quality, as well as related electronic design automation tools and methodologies. The objectives include both research advancements and cross-sectoral training of a new generation of interdisciplinary researchers. Notable interdisciplinary collaborative research results for the first half-period include novel approaches for test generation, soft-error and transient faults vulnerability analysis, cross-layer fault-tolerance and error-resilience, functional safety validation, reliability assessment and run-time management, HW security enhancement and initial implementation of these into holistic EDA tools.</em></td> </tr> <tr> <td>18:00</td> <td>4.3.3</td> <td><b>A UNIVERSAL SPINTRONIC TECHNOLOGY BASED ON MULTIFUNCTIONAL STANDARDIZED STACK</b><br /> <b>Speaker</b>:<br /> Mehdi Tahoori, Karlsruhe Institute of Technology, DE<br /> <b>Authors</b>:<br /> Mehdi Tahoori<sup>1</sup>, Sarath Mohanachandran Nair<sup>1</sup>, Rajendra Bishnoi<sup>2</sup>, Lionel Torres<sup>3</sup>, Guillaume Partigeon<sup>4</sup>, Gregory DiPendina<sup>5</sup> and Guillaume Prenat<sup>5</sup><br /> <sup>1</sup>Karlsruhe Institute of Technology, DE; <sup>2</sup>TU Delft, NL; <sup>3</sup>Université de Montpellier, FR; <sup>4</sup>LIRMM, FR; <sup>5</sup>Spintec, FR<br /> <em><b>Abstract</b><br /> The goal of the GREAT RIA project is to co-integrate multiple functions like sensors ("Sensing"), RF emitters or receivers ("Communicating") and logic/memory ("Processing/Storing") together within CMOS by adapting the Spin-Transfer Torque Magnetic Tunnel Junction (STT-MTJ), elementary constitutive cell of the MRAM memories, to a single baseline technology. Based on the STT unique set of performances (non-volatility, high speed, infinite endurance and moderate read/write power), GREAT will achieve the same goal as heterogeneous integration of devices but in a much simpler way. This will lead to a unique STT-MTJ cell technology called Multifunctional Standardized Stack (MSS). This paper presents the lessons learned in the project from the technology, compact modeling, process design kit, standard cells, as well as memory and system level design evaluation and exploration. The proposed technology and toolsets are giant leaps towards heterogeneous integrated technology and architectures for IoT.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="4.4">4.4 Some run it hot, others do not</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br /> Pascal Vivet, CEA-Leti, FR</p> <p><b>Co-Chair:</b><br /> Daniele J. Pagliari, Politecnico di Torino, IT</p> <p>Temperature management is a must-have in modern computing systems. The session presents a set of techniques for smart cooling systems, both active and pro-active, and thermal control policies. The techniques presented are vertically applied to different components, such as computing and communication sub-systems, and use orthogonal modeling and optimization strategies, such as machine-learning.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.4.1</td> <td><b>A LEARNING-BASED THERMAL SIMULATION FRAMEWORK FOR EMERGING TWO-PHASE COOLING TECHNOLOGIES</b><br /> <b>Speaker</b>:<br /> Ayse Coskun, Boston University, US<br /> <b>Authors</b>:<br /> Zihao Yuan<sup>1</sup>, Geoffrey Vaartstra<sup>2</sup>, Prachi Shukla<sup>1</sup>, Zhengmao Lu<sup>2</sup>, Evelyn Wang<sup>2</sup>, Sherief Reda<sup>3</sup> and Ayse Coskun<sup>1</sup><br /> <sup>1</sup>Boston University, US; <sup>2</sup>Massachusetts Institute of Technology, US; <sup>3</sup>Brown University, US<br /> <em><b>Abstract</b><br /> Future high-performance chips will require new cooling technologies that can extract heat efficiently. Two-phase cooling is a promising processor cooling solution owing to its high heat transfer rate and potential benefits in cooling power. Two-phase cooling mechanisms, including microchannel-based two-phase cooling or two-phase vapor chambers (VCs), are typically modeled by computing the temperature-dependent heat transfer coefficient (HTC) of the evaporator or coolant using an iterative simulation framework. Precomputed HTC correlations are specific to a given cooling system design and cannot be applied to even the same cooling technology with different cooling parameters (such as different geometries). Another challenge is that HTC correlations are typically calculated with computational fluid dynamics (CFD) tools, which induce long design and simulation times. This paper introduces a learning-based temperature-dependent HTC simulation framework that is used to model a two-phase cooling solution with a wide range of cooling design parameters. In particular, the proposed framework includes a compact thermal model (CTM) of two-phase VCs with hybrid wick evaporators (of nanoporous membrane and microchannels). We build a new simulation tool to integrate the proposed simulation framework and CTM. We validate the proposed simulation framework as well as the new CTM through comparisons against a CFD model. Our simulation framework and CTM achieve a speedup of 21X with an average error of 0.98degC (and a maximum error of 2.59degC). We design an optimization flow for hybrid wicks to select the most beneficial nanoporous membrane and microchannel geometries. Our flow is capable of finding a geometry-coolant combination that results in a lower (or similar) maximum chip temperature compared to that of the best coolant-geometry pair selected by grid search, while providing a speedup of 9.4X.</em></td> </tr> <tr> <td>17:30</td> <td>4.4.2</td> <td><b>LIGHTWEIGHT THERMAL MONITORING IN OPTICAL NETWORKS-ON-CHIP VIA ROUTER REUSE</b><br /> <b>Speaker</b>:<br /> Mengquan Li, Nanyang Technological University, SG<br /> <b>Authors</b>:<br /> Mengquan Li<sup>1</sup>, Jun Zhou<sup>2</sup> and Weichen Liu<sup>2</sup><br /> <sup>1</sup>Nanyang Technological University, CN; <sup>2</sup>Nanyang Technological University, SG<br /> <em><b>Abstract</b><br /> Optical network-on-chip (ONoC) is an emerging communication architecture for manycore systems due to low latency, high bandwidth, and low power dissipation. However, a major concern lies in its thermal susceptibility -- under on-chip temperature variations, functional nanophotonic devices, especially microring resonator (MR)-based devices, suffer from significant thermal-induced optical power loss, which may counteract the power advantages of ONoCs and even cause functional failures. Considering the fact that temperature gradients are typically found on many-core systems, effective thermal monitoring, performing as the foundation of thermal-aware management, is critical on ONoCs. In this paper, a lightweight thermal monitoring scheme is proposed for ONoCs. We first design a temperature measurement module based on generic optical routers. It introduces trivial overheads in chip area by reusing the components in routers. A major problem with reusing optical routers is that it may potentially interfere with the normal communications in ONoCs. To address it, we then propose a time allocation strategy to schedule thermal sensing operations in the time intervals between communications. Evaluation results show that our scheme exhibits an untrimmed inaccuracy of 1.0070 K with low energy consumption of 656.38 pJ/Sa. It occupies an extremely small area of 0.0020 mm^2, reducing the area cost by 83.74% on average compared to the state-of-the-art optical thermal sensor design.</em></td> </tr> <tr> <td>18:00</td> <td>4.4.3</td> <td><b>A SPECTRAL APPROACH TO SCALABLE VECTORLESS THERMAL INTEGRITY VERIFICATION</b><br /> <b>Speaker</b>:<br /> Zhuo Feng, Stevens Institute of Technology, US<br /> <b>Authors</b>:<br /> Zhiqiang Zhao<sup>1</sup> and Zhuo Feng<sup>2</sup><br /> <sup>1</sup>Michigan Technological University, US; <sup>2</sup>Stevens Institute of Technology, US<br /> <em><b>Abstract</b><br /> Existing chip thermal analysis and verification methods require detailed distribution of power densities or modeling of underlying input workloads (vectors), which may not always be feasible at early-design stage. This paper introduces the first vectorless thermal integrity verification framework that allows computing worst-case temperature (gradient) distributions across the entire chip under a set of local and global workload (power density) constraints. To address the computational challenges introduced by the large 3D mesh-structured thermal grids, we propose a novel spectral approach for highly-scalable vectorless thermal verification of large chip designs. Our approach is based on emerging spectral graph theory and graph signal processing techniques, which consists of a thermal grid topology sparsification phase, an edge weight scaling phase, as well as a solution refinement procedure. The effectiveness and efficiency of our approach have been demonstrated through extensive experiments.</em></td> </tr> <tr> <td>18:15</td> <td>4.4.4</td> <td><b>DYNAMIC THERMAL MANAGEMENT WITH PROACTIVE FAN SPEED CONTROL THROUGH REINFORCEMENT LEARNING</b><br /> <b>Speaker</b>:<br /> Arman Iranfar, EPFL, CH<br /> <b>Authors</b>:<br /> Arman Iranfar<sup>1</sup>, Federico Terraneo<sup>2</sup>, Gabor Csordas<sup>1</sup>, Marina Zapater<sup>1</sup>, William Fornaciari<sup>2</sup> and David Atienza<sup>1</sup><br /> <sup>1</sup>EPFL, CH; <sup>2</sup>Politecnico di Milano, IT<br /> <em><b>Abstract</b><br /> Dynamic Thermal Management (DTM) in submicron technology has become a major challenge since it directly affects Multiprocessors Systems-on-chip (MPSoCs) performance, power consumption, and lifetime reliability. For proper DTM, thermal simulators play a significant role as they allow chip temperature to be safely studied. Nonetheless, state-of-the-art thermal simulators do not support transient fan models. As a result, adaptive fan speed control, which is an important runtime parameter, cannot be well utilized in DTM. Therefore, in this work, we first propose and integrate a transient fan model into a state-of-the-art thermal simulator, enabling adaptive fan speed control simulation for efficient DTM. We, then, validate our simulation framework through a thermal test chip achieving less than 2$^circ{C}$ error in the worst case. With multiple fan speeds, however, the DTM design space grows significantly, which can ultimately make conventional solutions, such as grid search, infeasible, impractical, or insufficient due to the large runtime overhead. Therefore, we address this challenge through a reinforcement learning-based solution to proactively determine number of active cores, operating frequency, and fan speed. The proposed solution is able to reduce fan power by up to 40% compared to a DTM with constant fan speed with less than 1% performance degradation. Also, compared to a state-of-the-art DTM technique our solution improves the performance by up to 19% for the same fan power.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="#IP2">IP2-5</a>, 362</td> <td><b>A HEAT-RECIRCULATION-AWARE VM PLACEMENT STRATEGY FOR DATA CENTERS</b><br /> <b>Authors</b>:<br /> Hao Feng<sup>1</sup>, Yuhui Deng<sup>2</sup> and Yi Zhou<sup>3</sup><br /> <sup>1</sup>Jinan University, CN; <sup>2</sup>Chinese Academy of Sciences; Jinan University, CN; <sup>3</sup>Columbus State University, US<br /> <em><b>Abstract</b><br /> Data centers consisted of a great number of IT devices (e.g., servers, switches and etc.) which generates a massive amount of heat emission. Due to the special arrangement of racks in the data center, heat recirculation often occurs between nodes. It can cause a sharp rise in temperature of the equipment coupled with local hot spots in data centers. Existing VM placement strategies can minimize energy consumption of data centers by optimizing resource allocation in terms of multiple physical resources (e.g., memory, bandwidth, cpu and etc.). However, existing strategies ignore the role of heat recirculation in the data center. To address this problem, in this study, we propose a heat-recirculation-aware VM placement strategy and design a Simulated Annealing Based Algorithm (SABA) to lower the energy consumption of data centers. Different from the existing SA algorithm, SABA optimize the distribution of the initial solution and the way of iteration. We quantitatively evaluate SABA's performance in terms of algorithm efficiency, the activated servers and the energy saving against with XINT-GA algorithm (Thermal-aware task scheduling Strategy), FCFS (First-Come First-Served), and SA. Experimental results indicate that our heat-recirculation-aware VM placement strategy provides a powerful solution for improving energy efficiency of data centers.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="#IP2">IP2-6</a>, 826</td> <td><b>ENERGY OPTIMIZATION IN NCFET-BASED PROCESSORS</b><br /> <b>Authors</b>:<br /> Sami Salamin<sup>1</sup>, Martin Rapp<sup>1</sup>, Hussam Amrouch<sup>1</sup>, Andreas Gerstlauer<sup>2</sup> and Joerg Henkel<sup>1</sup><br /> <sup>1</sup>Karlsruhe Institute of Technology, DE; <sup>2</sup>University of Texas at Austin, US<br /> <em><b>Abstract</b><br /> Energy consumption is a key optimization goal for all modern processors. Negative Capacitance Field-Effect Transistors (NCFETs) are a leading emerging technology that promises outstanding performance in addition to better energy efficiency. The thickness of the additional ferroelectric layer, frequency, and voltage are the key parameters in NCFET technology that impact the power and frequency of processors. However, their joint impact on energy optimization has not been investigated yet. In this work, we are the first to demonstrate that conventional (i.e., NCFET-unaware) dynamic voltage/frequency scaling (DVFS) techniques to minimize energy are sub-optimal when applied to NCFET-based processors. We further demonstrate that state-of-the-art NCFET-aware voltage scaling for power minimization is also sub-optimal when it comes to energy. This work provides the first NCFET-aware DVFS technique that optimizes the processor's energy through optimal runtime frequency/voltage selection. In NCFETs, energy-optimal frequency and voltage are dependent on the workload and technology parameters. Our NCFET-aware DVFS technique considers these effects to perform optimal voltage/frequency selection at runtime depending on workload characteristics. Results show up to 90 % energy savings compared to conventional DVFS techniques. Compared to state-of-the-art NCFET-aware power management, our technique provides up to 72 % energy savings along with 3:7x higher performance.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="4.5">4.5 Adaptation and optimization for real-time systems</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br /> Wanli Chang, University of york, GB</p> <p><b>Co-Chair:</b><br /> Emmanuel Grolleau, ENSMA, FR</p> <p>This session presents novel techniques for systems requiring adaptations. The papers in this session are including monitoring techniques to increase reactivity, considering weakly-hard constraints, extending previous cache persistence analyses from one core to several cores, and modeling data chains while latency bounds are ensured.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.5.1</td> <td><b>RELIABLE AND ENERGY-AWARE FIXED-PRIORITY (M,K)-DEADLINES ENFORCEMENT WITH STANDBY-SPARING</b><br /> <b>Speaker</b>:<br /> Linwei Niu, West Virginia State University, US<br /> <b>Authors</b>:<br /> Linwei Niu<sup>1</sup> and Dakai Zhu<sup>2</sup><br /> <sup>1</sup>West Virginia State University, US; <sup>2</sup>University of Texas at San Antonio, US<br /> <em><b>Abstract</b><br /> For real-time computing systems, energy efficiency, Quality of Service, and fault tolerance are among the major design concerns. In this work, we study the problem of reliable and energy-aware fixed-priority (m,k)-deadlines enforcement with standby-sparing. The standby-sparing systems adopt a primary processor and a spare processor to provide fault tolerance for both permanent and transient faults. In order to reduce energy consumption for such kind of systems, we proposed a novel scheduling scheme under the QoS constraint of (m,k)-deadlines. The evaluation results demonstrate that our proposed approach significantly outperformed the previous research in energy conservation while assuring (m,k)-deadlines and fault tolerance for real-time systems.</em></td> </tr> <tr> <td>17:30</td> <td>4.5.2</td> <td><b>PERIOD ADAPTATION FOR CONTINUOUS SECURITY MONITORING IN MULTICORE REAL-TIME SYSTEMS</b><br /> <b>Speaker</b>:<br /> Monowar Hasan, University of Illinois at Urbana-Champaign, US<br /> <b>Authors</b>:<br /> Monowar Hasan<sup>1</sup>, Sibin Mohan<sup>2</sup>, Rodolfo Pellizzoni<sup>3</sup> and Rakesh Bobba<sup>4</sup><br /> <sup>1</sup>University of Illinois at Urbana-Champaign, US; <sup>2</sup>University of Illinois at Urbana-Champaign (UIUC), US; <sup>3</sup>University of Waterloo, CA; <sup>4</sup>Oregon State University, US<br /> <em><b>Abstract</b><br /> We propose HYDRA-C, a design-time evaluation framework for integrating monitoring mechanisms in multicore real-time systems (RTS). Our goal is to ensure that security (or other monitoring) mechanisms execute in a "continuous" manner -- i.e., as often as possible, across cores. This is to ensure that any such mechanisms run with few interruptions, if any. HYDRA-C is intended to allow designers of RTS to integrate monitoring mechanisms without perturbing existing timing properties or execution orders. We demonstrate the framework using a proof-of-concept implementation with intrusion detection mechanisms as security tasks. We develop and use both, (a) a custom intrusion detection system (IDS) as well as (b) Tripwire -- an open source data integrity checking tool. We compare the performance of HYDRA-C with a state-of-the-art multicore RT security integration approach and find that our method does not impact the schedulability and, on average, can detect intrusions 19.05% faster without impacting the performance of RT tasks.</em></td> </tr> <tr> <td>18:00</td> <td>4.5.3</td> <td><b>EFFICIENT LATENCY BOUND ANALYSIS FOR DATA CHAINS OF REAL-TIME TASKS IN MULTIPROCESSOR SYSTEMS</b><br /> <b>Speaker</b>:<br /> Jiankang Ren, Dalian University of Technology, CN<br /> <b>Authors</b>:<br /> Jiankang Ren<sup>1</sup>, Xin He<sup>1</sup>, Junlong Zhou<sup>2</sup>, Hongwei Ge<sup>1</sup>, Guowei Wu<sup>1</sup> and Guozhen Tan<sup>1</sup><br /> <sup>1</sup>Dalian University of Technology, CN; <sup>2</sup>Nanjing University of Science and Technology, CN<br /> <em><b>Abstract</b><br /> End-to-end latency analysis is one of the key problems in the automotive embedded system design. In this paper, we propose an efficient worst-case end-to-end latency analysis method for data chains of periodic real-time tasks executed on multiprocessors under a partitioned fixed-priority preemptive scheduling policy. The key idea of this research is to improve the analysis efficiency by transforming the problem of bounding the worst-case latency of the data chain to a problem of bounding the releasing interval of data propagation instances for each pair of consecutive tasks in the chain. In particular, we derive an upper bound on the releasing interval of successive data propagation instances to yield the desired data chain latency bound by a simple accumulation. Based on the above idea, we present an efficient latency upper bound analysis algorithm with polynomial time complexity. Experiments with randomly generated task sets based on a generic automotive benchmark show that our proposed approach can obtain a relatively tighter data chain latency upper bound with lower computational cost.</em></td> </tr> <tr> <td>18:15</td> <td>4.5.4</td> <td><b>CACHE PERSISTENCE-AWARE MEMORY BUS CONTENTION ANALYSIS FOR MULTICORE SYSTEMS</b><br /> <b>Speaker</b>:<br /> Syed Aftab Rashid, Polytechnic Institute of Porto, PT<br /> <b>Authors</b>:<br /> Syed Aftab Rashid, Geoffrey Nelissen and Eduardo Tovar, Polytechnic Institute of Porto, PT<br /> <em><b>Abstract</b><br /> Memory bus contention strongly relates to the number of main memory requests generated by tasks running on different cores of a multicore platform, which, in turn, depends on the content of the cache memories during the execution of those tasks. Recent works have shown that due to cache persistence the memory access demand of multiple jobs of a task may not always be equal to its worst-case memory access demand in isolation. Analysis of the variable memory access demand of tasks due to cache persistence leads to significantly tighter worst-case response time (WCRT) of tasks. In this work, we show how the notion of cache persistence can be extended from single-core to multicore systems. In particular, we focus on analyzing the impact of cache persistence on the memory bus contention suffered by tasks executing on a multicore platform considering both work conserving and non-work conserving bus arbitration policies. Experimental evaluation shows that cache persistence-aware analyses of bus arbitration policies increase the number of task sets deemed schedulable by up to 70 percentage points in comparison to their respective counterparts that do not account for cache persistence</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="#IP2">IP2-7</a>, 934</td> <td><b>TOWARDS A MODEL-BASED MULTI-OBJECTIVE OPTIMIZATION APPROACH FOR SAFETY-CRITICAL REAL-TIME SYSTEMS</b><br /> <b>Speaker</b>:<br /> Emmanuel Grolleau, LIAS / ISAE-ENSMA, FR<br /> <b>Authors</b>:<br /> Soulimane Kamni<sup>1</sup>, Yassine OUHAMMOU<sup>2</sup>, Antoine Bertout<sup>3</sup> and Emmanuel Grolleau<sup>4</sup><br /> <sup>1</sup>LIAS/ENSMA, FR; <sup>2</sup>LIAS / ISAE-ENSMA, FR; <sup>3</sup>LIAS, Université de Poitiers, ISAE-ENSMA, FR; <sup>4</sup>LIAS, ISAE-ENSMA, Universite de Poitiers, FR<br /> <em><b>Abstract</b><br /> In safety-critical real-time systems domain, obtaining the appropriate operational model which meets the temporal (e.g. deadlines) and business (e.g. redundancy) requirements while being optimal in terms of several metrics is a primordial process in the design life-cycle. Recently, several researches have proposed to explore cross-domain trade-offs for a higher behaviour performance. Indeed, this process represents the first step in the deployment phase, which is very sensitive because it could be error-prone and time consuming. This paper is a work in progress proposing an approach aiming to help real-time system architects to take benefit from existing works, overcome their limits, and capitalize the efforts. Furthermore, the approach is based on the model-driven engineering paradigm and suggests to ease the usage of methods and tools thanks to repositories gathering them as a sort of a shared knowledge.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="4.6">4.6 Artificial Intelligence and Secure Systems</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br /> Annelie Heuser, Univ Rennes, Inria, CNRS, France, FR</p> <p><b>Co-Chair:</b><br /> Ilia Polian, University of Stuttgart, DE</p> <p>In this session we will cover artificial intelligence algorithms in the context of secure systems. The presented papers cover an extension of a trusted execution environment to securely run machine learning algorithms, novel attacking strategies against logic-locking countermeasures, and an investigation of aging effects on the success rate of machine learning modelling attacks.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.6.1</td> <td><b>A PARTICLE SWARM OPTIMIZATION GUIDED APPROXIMATE KEY SEARCH ATTACK ON LOGIC LOCKING IN THE ABSENCE OF SCAN ACCESS</b><br /> <b>Speaker</b>:<br /> Rajit Karmakar, IIT KHARAGPUR, IN<br /> <b>Authors</b>:<br /> RAJIT KARMAKAR and Santanu Chattopadhyay, IIT Kharagpur, IN<br /> <em><b>Abstract</b><br /> Logic locking is a well known Design-for-Security(DfS) technique for Intellectual Property (IP) protection of digital Integrated Circuits(IC). However, various attacks on logic locking can extract the secret obfuscation key successfully. Although Boolean Satisfiability (SAT) attacks can break most of the logic locked circuits, inability to deobfuscate sequential circuits is the main limitation of this type of attacks. Several existing defense strategies exploit this fact to thwart SAT attack by obfuscating the scan-based Design-for-Testability (DfT) infrastructure. In the absence of scan access, Model Checking based circuit unrolling attacks also suffer from scalability issues. In this paper, we propose a particle swarm optimization (PSO) guided attack framework, which is capable of finding an approximate key that produces correct output in most of the cases. Unlike the SAT attacks, the proposed attack framework can work even in the absence of scan access. Unlike Model Checking attacks, it does not suffer from scalability issues, thus can be applied on significantly large sequential circuits. Experimental results show that the derived key can produce correct outputs in more than 99% cases, for the majority of the benchmark circuits, while for the rest of the circuits, a minimal error is observed. The proposed attack framework enables partial activation of large sequential circuits in the absence of scan access, which is not feasible using the existing attack frameworks.</em></td> </tr> <tr> <td>17:30</td> <td>4.6.2</td> <td><b>EFFECT OF AGING ON PUF MODELING ATTACKS BASED ON POWER SIDE-CHANNEL OBSERVATIONS</b><br /> <b>Speaker</b>:<br /> Trevor Kroeger, University of Maryland Baltimore County, US<br /> <b>Authors</b>:<br /> Trevor Kroeger<sup>1</sup>, Wei Cheng<sup>2</sup>, Jean Luc Danger<sup>3</sup>, Sylvain Guilley<sup>4</sup> and Naghmeh Karimi<sup>5</sup><br /> <sup>1</sup>University of Maryland Baltimore County, US; <sup>2</sup>Telecom ParisTech, FR; <sup>3</sup>Télécom ParisTech, FR; <sup>4</sup>Secure-IC, FR; <sup>5</sup>University of Maryland, Baltimore County, US<br /> <em><b>Abstract</b><br /> Thanks to the imperfections in manufacturing process, Physically Unclonable Functions (PUFs) produce their unique outputs for given input signals (challenges) fed to identical circuitry designs. PUFs are often used as hardware primitives to provide security, e.g., for key generation or authentication purposes. However, they can be vulnerable to modeling attacks that predict the output for an unknown challenge, based on a set of known challenge/response pairs (CRPs). In addition, an attacker may benefit from power side-channels to break a PUFs' security. Although such attacks have been extensively discussed in literature, the effect of device aging on the efficacy of these attacks is still an open question. Accordingly, in this paper, we focus on the impact of aging on Arbiter-PUFs and one of its modeling-resistant counterparts, the Voltage Transfer Characteristic (VTC) PUF. We present the results of our SPICE simulations used to perform modeling attack via Machine Learning (ML) schemes on the devices aged from 0 to 20 weeks. We show that aging has a significant impact on modeling attacks. Indeed, when the training dataset for ML attack is extracted at a different age than the evaluation dataset, the attack is greatly hindered despite being performed on the same device. We show that the ML attack via power traces is particularly efficient to recover the responses of the anti-modeling VTC PUF, yet the aging still contributes to enhance its security.</em></td> </tr> <tr> <td>18:00</td> <td>4.6.3</td> <td><b>OFFLINE MODEL GUARD: SECURE AND PRIVATE ML ON MOBILE DEVICES</b><br /> <b>Speaker</b>:<br /> Emmanuel Stapf, TU Darmstadt, DE<br /> <b>Authors</b>:<br /> Sebastian P. Bayerl<sup>1</sup>, Tommaso Frassetto<sup>2</sup>, Patrick Jauernig<sup>2</sup>, Korbinian Riedhammer<sup>1</sup>, Ahmad-Reza Sadeghi<sup>2</sup>, Thomas Schneider<sup>2</sup>, Emmanuel Stapf<sup>2</sup> and Christian Weinert<sup>2</sup><br /> <sup>1</sup>TH Nürnberg, DE; <sup>2</sup>TU Darmstadt, DE<br /> <em><b>Abstract</b><br /> Performing machine learning tasks in mobile applications yields a challenging conflict of interest: highly sensitive client information (e.g., speech data) should remain private while also the intellectual property of service providers (e.g., model parameters) must be protected. Cryptographic techniques offer secure solutions for this, but have an unacceptable overhead and moreover require frequent network interaction. In this work, we design a practically efficient hardware-based solution. Specifically, we build Offline Model Guard (OMG) to enable privacy-preserving machine learning on the predominant mobile computing platform ARM - even in offline scenarios. By leveraging a trusted execution environment for strict hardware-enforced isolation from other system components, OMG guarantees privacy of client data, secrecy of provided models, and integrity of processing algorithms. Our prototype implementation on an ARM HiKey 960 development board performs privacy-preserving keyword recognition using TensorFlow Lite for Microcontrollers in real time.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="4.7">4.7 Future computing fabrics: security and design integration</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br /> Elena Gnani, Università di Bologna, IT</p> <p><b>Co-Chair:</b><br /> Gage Hills, Massachusetts Institute of Technology, US</p> <p>Emerging technologies always promise to achieve computational and resource-efficiency. This session addresses various aspects of efficiency in the context of security and future computing fabrics: a unique challenge at the intersection of hardware security and machine learning, fully front-end compatible CAD frameworks to enable access to floating-gate memristive devices, and current recycling in superconducting circuits.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>4.7.1</td> <td><b>SECURITY ENHANCEMENT FOR RRAM COMPUTING SYSTEM THROUGH OBFUSCATING CROSSBAR ROW CONNECTIONS</b><br /> <b>Speaker</b>:<br /> Minhui Zou, Nanjing University of Science and Technology, CN<br /> <b>Authors</b>:<br /> Minhui Zou<sup>1</sup>, Zhenhua Zhu<sup>2</sup>, Yi Cai<sup>2</sup>, Junlong Zhou<sup>1</sup>, Chengliang Wang<sup>3</sup> and Yu Wang<sup>2</sup><br /> <sup>1</sup>Nanjing University of Science and Technology, CN; <sup>2</sup>Tsinghua University, CN; <sup>3</sup>Chongqing University, CN<br /> <em><b>Abstract</b><br /> Neural networks (NN) have gained great success in visual object recognition and natural language processing, but this kind of data-intensive applications requires huge data movements between computing units and memory. Emerging resistive random-access memory (RRAM) computing systems have demonstrated great potential in avoiding the huge data movements by performing matrix-vector-multiplications in memory. However, the nonvolatility of the RRAM devices may lead to potential stealing of the NN weights stored in crossbars and the adversary could extract the NN models from the stolen weights. This paper proposes an effective security enhancing method for RRAM computing systems to thwart this sort of piracy attack. We first analyze the theft methods of the NN weights. Then we propose an efficient security enhancing technique based on obfuscating the row connections between positive crossbars and their pairing negative crossbars. Two heuristic techniques are also presented to optimize the hardware overhead of the obfuscation module. Compared with existing NN security work, our method eliminates the additional RRAM writing operations used for encryption/decryption, without shortening the lifetime of RRAM computing systems. The experiment results show that the proposed methods ensure the trial times of brute-force attack are more than (16!)^17 and the classification accuracy of the incorrectly extracted NN models is less than 20%, with minimal area overhead.</em></td> </tr> <tr> <td>17:30</td> <td>4.7.2</td> <td><b>MODELING A FLOATING-GATE MEMRISTIVE DEVICE FOR COMPUTER AIDED DESIGN OF NEUROMORPHIC COMPUTING</b><br /> <b>Speaker</b>:<br /> Loai Danial, Technion, IL<br /> <b>Authors</b>:<br /> Loai Danial<sup>1</sup>, Vasu Gupta<sup>2</sup>, Evgeny Pikhay<sup>3</sup>, Yakov Roizin<sup>3</sup> and Shahar Kvatinsky<sup>1</sup><br /> <sup>1</sup>Technion, IL; <sup>2</sup>Technion, IN; <sup>3</sup>TowerJazz, IL<br /> <em><b>Abstract</b><br /> Memristive technology is still not mature enough for the very large-scale integration necessary to obtain practical value from neuromorphic systems. While nonvolatile floating-gate "synapse transistors" have been implemented in very large-scale integrated neuromorphic systems, their large footprint still constrains an upper bound on the overall performance. A two-terminal floating-gate memristive device can combine the technological maturity of the floating-gate transistor and the conceptual novelty of the memristor using a standard CMOS process. In this paper, we present a top-down computer aided design framework of the floating-gate memristive device and show its potential in neuromorphic computing. Our framework includes a Verilog-A SPICE model, small-signal schematics, a stochastic model, Monte-Carlo simulations, layout, DRC, LVS, and RC extraction.</em></td> </tr> <tr> <td>18:00</td> <td>4.7.3</td> <td><b>GROUND PLANE PARTITIONING FOR CURRENT RECYCLING OF SUPERCONDUCTING CIRCUITS</b><br /> <b>Speaker</b>:<br /> Naveen Katam, University of Southern California, US<br /> <b>Authors</b>:<br /> Naveen Kumar Katam, Bo Zhang and Massoud Pedram, University of Southern California, US<br /> <em><b>Abstract</b><br /> Superconducting single flux quantum (SFQ) technology using Josephson junctions (JJs) is an excellent choice for the computing fabrics of the future. Current recycling is a necessary technique for the implementation of large SFQ circuits with energy-efficiency, where circuit partitions with similar bias current requirements are biased serially. Though this technique has been verified for small scale circuits, it has not been implemented for large circuits as there is no trivial way to partition the circuit into circuit blocks with separate ground planes. The major constraints for partitioning are (1) equal bias current and (2) equal area for all the partitions; (3) minimize the connections between adjacent ground planes with high-cost for non-adjacent planes. For the first time, all these constraints are formulated into a cost function and it is minimized with the gradient descent method. The algorithm takes a circuit netlist and the intended number of partitions as inputs and gives the output as groups of cells belonging to separate ground planes. It minimizes the connections among different ground planes and gives a solution on which the current recycling technique can be implemented. The parameters of cost function have been initialized randomly along with minimizing the dimensions to find the solution quickly. On average, 30% of connections are between non-adjacent ground planes for the given benchmark circuits.</em></td> </tr> <tr> <td>18:15</td> <td>4.7.4</td> <td><b>SILICON PHOTONIC MICRORING RESONATORS: DESIGN OPTIMIZATION UNDER FABRICATION NON-UNIFORMITY</b><br /> <b>Speaker</b>:<br /> Mahdi Nikdast, Colorado State University, US<br /> <b>Authors</b>:<br /> Asif Mirza, Febin Sunny, Sudeep Pasricha and Mahdi Nikdast, Colorado State University, US<br /> <em><b>Abstract</b><br /> Microring resonators (MRRs) are very often considered as the primary building block in silicon photonic integrated circuits (PICs). Despite many advantages, MRRs are considerably sensitive to fabrication non-uniformity (a.k.a. fabrication process variations), necessitating the use of power-hungry compensation methods (e.g., thermal tuning) to guarantee their reliable operation. Moreover, the design space of MRRs is complicated and includes several highly correlated design parameters, preventing designers from easily exploring and optimizing the design of MRRs against fabrication process variations (FPVs). In this paper, for the first time, we present a comprehensive design space exploration and optimization of MRRs against FPVs. In particular, we indicate how physical design parameters in MRRs can be optimized during design time to enhance their tolerance to FPVs while also improving the insertion loss and quality factor in such devices. Fabrication results obtained by measuring multiple fabricated MRRs designed using our design optimization solution demonstrate a significant 70% improvement on average in MRRs tolerance to different FPVs. Such improvement indicates the efficiency of our novel design optimization solution in reducing the tuning power required for reliable operation of MRRs.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="#IP2">IP2-8</a>, 849</td> <td><b>CURRENT-MODE CARRY-FREE MULTIPLIER DESIGN USING A MEMRISTOR-TRANSISTOR CROSSBAR ARCHITECTURE</b><br /> <b>Speaker</b>:<br /> Shengqi Yu, Newcastle University, GB<br /> <b>Authors</b>:<br /> Shengqi Yu<sup>1</sup>, Ahmed Soltan<sup>2</sup>, Rishad Shafik<sup>1</sup>, Thanasin Bunnam<sup>1</sup>, Domenico Balsamo<sup>1</sup>, Fei Xia<sup>1</sup> and Alex Yakovlev<sup>1</sup><br /> <sup>1</sup>Newcastle University, GB; <sup>2</sup>Nile University, EG<br /> <em><b>Abstract</b><br /> Traditional multipliers consist of complex logic components. They are a major energy and performance contributor of modern compute-intensive applications. As such, designing multipliers with reduced energy and faster speed has remained a thoroughgoing challenge. This paper presents a novel, carry-free multiplier, which is suitable for new-generation of energy-constrained applications. The multiplier circuit consists of an array of memristor-transistor cells that can be selected (i.e., turned ON or OFF) using a combination of DC bias voltages based on the operand values. When a cell is selected it contributes to current in the array path, which is then amplified by current mirrors with variable transistor gate sizes. The different current paths are connected to a node for analogously accumulating the currents to produce the multiplier output directly, removing the carry propagation stages, typically seen in traditional digital multipliers. An essential feature of this multiplier is autonomous survivability, i.e., when the power is below this threshold the logic state automatically retains at a zero-cost due to the non-volatile properties of memristors.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="#IP2">IP2-9</a>, 88</td> <td><b>N-BIT DATA PARALLEL SPIN WAVE LOGIC GATE</b><br /> <b>Speaker</b>:<br /> Abdulqader Mahmoud, TU Delft, NL<br /> <b>Authors</b>:<br /> Abdulqader Mahmoud<sup>1</sup>, Frederic Vanderveken<sup>2</sup>, Florin Ciubotaru<sup>2</sup>, Christoph Adelmann<sup>2</sup>, Sorin Cotofana<sup>1</sup> and Said Hamdioui<sup>1</sup><br /> <sup>1</sup>TU Delft, NL; <sup>2</sup>IMEC, BE<br /> <em><b>Abstract</b><br /> Due to their very nature, Spin Waves (SWs) created in the same waveguide, but with different frequencies, can coexist while selectively interacting with their own species only. The absence of inter-frequency interferences isolates input data sets encoded in SWs with different frequencies and creates the premises for simultaneous data parallel SW based processing without hardware replication or delay overhead. In this paper we leverage this SW property by introducing a novel computation paradigm, which allows for the parallel processing of n-bit input data vectors on the same basic SW based logic gate. Subsequently, to demonstrate the proposed concept, we present 8-bit parallel 3-input Majority gate implementation and validate it by means of Object Oriented MicroMagnetic Framework (OOMMF) simulations. To evaluate the potential benefit of our proposal we compare the 8-bit data parallel gate with equivalent scalar SW gate based implementation. Our evaluation indicates that 8-bit data 3-input Majority gate implementation requires 4.16x less area than the scalar SW gate based equivalent counterpart while preserving the same delay and energy consumption figures.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="UB04">UB04 Session 4</h2> <p><b>Date:</b> Tuesday 10 March 2020<br /> <b>Time:</b> 17:30 - 19:30<br /> <b>Location / Room:</b> Booth 11, Exhibition Area</p> <p> </p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td>UB04.1</td> <td><b>FLETCHER: TRANSPARENT GENERATION OF HARDWARE INTERFACES FOR ACCELERATING BIG DATA APPLICATIONS</b><br /> <b>Authors</b>:<br /> Zaid Al-Ars, Johan Peltenburg, Jeroen van Straten, Matthijs Brobbel and Joost Hoozemans, TU Delft, NL<br /> <em><b>Abstract</b><br /> This demo created by TUDelft is a software-hardware framework to allow for an efficient integration of FPGA hardware accelerators both on edge devices as well as in the cloud. The framework is called Fletcher, which is used to automatically generate data communication interfaces in hardware based on the widely used big data format Apache Arrow. This provides two distinct advantages. On the one hand, since the accelerators use the same data format as the software, data communication bottlenecks can be reduced. On the other hand, since a standardized data format is used, this allows for easy-to-use interfaces on the accelerator side, thereby reducing the design and development time. The demo shows how to use Fletcher for big data acceleration to decompress Snappy compressed files and perform filtering on the whole Wikipedia body of text. The demo enables 25 GB/s processing throughput.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3134.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB04.2</td> <td><b>ELSA: EIGENVALUE BASED HYBRID LINEAR SYSTEM ABSTRACTION: BEHAVIORAL MODELING OF TRANSISTOR-LEVEL CIRCUITS USING AUTOMATIC ABSTRACTION TO HYBRID AUTOMATA</b><br /> <b>Authors</b>:<br /> Ahmad Tarraf and Lars Hedrich, University of Frankfurt, DE<br /> <em><b>Abstract</b><br /> Model abstraction of transistor-level circuits, while preserving an accurate behavior, is still an open problem. In this demo an approach is presented that automatically generates a hybrid automaton (HA) with linear states from an existing circuit netlist. The approach starts with a netlist at transistor level with full SPICE accuracy and ends at the system level description of the circuit in matlab or in Verilog-A. The resulting hybrid automaton exhibits linear behavior as well as the technology dependent nonlinear e.g. limiting behavior. The accuracy and speed-up of the Verilog-A generated models is evaluated based on several transistor level circuit abstractions of simple operational amplifiers up to a complex filters. Moreover, we verify the equivalence between the generated model and the original circuit. For the generated models in matlab syntax, a reachability analysis is performed using the reachability tool cora.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3097.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB04.3</td> <td><b>SRSN: SECURE RECONFIGURABLE TEST NETWORK</b><br /> <b>Authors</b>:<br /> Vincent Reynaud<sup>1</sup>, Emanuele Valea<sup>2</sup>, Paolo Maistri<sup>1</sup>, Regis Leveugle<sup>1</sup>, Marie-Lise Flottes<sup>2</sup>, Sophie Dupuis<sup>2</sup>, Bruno Rouzeyre<sup>2</sup> and Giorgio Di Natale<sup>1</sup><br /> <sup>1</sup>TIMA Laboratory, FR; <sup>2</sup>LIRMM, FR<br /> <em><b>Abstract</b><br /> The critical importance of testability for electronic devices led to the development of IEEE test standards. These methods, if not protected, offer a security backdoor to attackers. This demonstrator illustrates a state-of-the-art solution that prevents unauthorized usage of the test infrastructure based on the IEEE 1687 standard and implemented on an FPGA target.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3112.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB04.4</td> <td><b>LAGARTO: FIRST SILICON RISC-V ACADEMIC PROCESSOR DEVELOPED IN SPAIN</b><br /> <b>Authors</b>:<br /> Guillem Cabo Pitarch<sup>1</sup>, Cristobal Ramirez Lazo<sup>1</sup>, Julian Pavon Rivera<sup>1</sup>, Vatistas Kostalabros<sup>1</sup>, Carlos Rojas Morales<sup>1</sup>, Miquel Moreto<sup>1</sup>, Jaume Abella<sup>1</sup>, Francisco J. Cazorla<sup>1</sup>, Adrian Cristal<sup>1</sup>, Roger Figueras<sup>1</sup>, Alberto Gonzalez<sup>1</sup>, Carles Hernandez<sup>1</sup>, Cesar Hernandez<sup>2</sup>, Neiel Leyva<sup>2</sup>, Joan Marimon<sup>1</sup>, Ricardo Martinez<sup>3</sup>, Jonnatan Mendoza<sup>1</sup>, Francesc Moll<sup>4</sup>, Marco Antonio Ramirez<sup>2</sup>, Carlos Rojas<sup>1</sup>, Antonio Rubio<sup>4</sup>, Abraham Ruiz<sup>1</sup>, Nehir Sonmez<sup>1</sup>, Lluis Teres<sup>3</sup>, Osman Unsal<sup>5</sup>, Mateo Valero<sup>1</sup>, Ivan Vargas<sup>1</sup> and Luis Villa<sup>2</sup><br /> <sup>1</sup>BSC / UPC, ES; <sup>2</sup>CIC-IPN, MX; <sup>3</sup>IMB-CNM (CSIC), ES; <sup>4</sup>UPC, ES; <sup>5</sup>BSC, ES<br /> <em><b>Abstract</b><br /> Open hardware is a possibility that has emerged in recent years and has the potential to be as disruptive as Linux was once, an open source software paradigm. If Linux managed to lessen the dependence of users in large companies providing software and software applications, it is envisioned that hardware based on ISAs open source can do the same in their own field. In the Lagarto tapeout four research institutions were involved: Centro de Investigación en Computación of the Mexican IPN, Centro Nacional de Microelectrónica of the CSIC, Universitat Politècnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC). As a result, many bachelor, master and PhD students had the chance to achieve real-world experience with ASIC design and achieve a functional SoC. In the booth, you will find a live demo of the first ASIC and prototypes running on FPGA of the next versions of the SoC and core.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3104.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB04.5</td> <td><b>LEARNV: LEARNV: A RISC-V BASED EMBEDDED SYSTEM DESIGN FRAMEWORK FOR EDUCATION AND RESEARCH DEVELOPMENT</b><br /> <b>Authors</b>:<br /> Noureddine Ait Said and Mounir Benabdenbi, TIMA Laboratory, FR<br /> <em><b>Abstract</b><br /> Designing a modern System on a Chip is based on the joint design of hardware and software (co-design). However, understanding the tight relationship between hardware and software is not straightforward. Moreover to validate new concepts in SoC design from the idea to the hardware implementation is time-consuming and often slowed by legacy issues (intellectual property of hardware blocks and expensive commercial tools). To overcome these issues we propose to use the open-source Rocket Chip environment for educational purposes, combined with the open-source LowRisc architecture to implement a custom SoC design on an FPGA board. The demonstration will present how students and engineers can take benefit from the environment to deepen their knowledge in HW and SW co-design. Using the LowRisC architecture, an image classification application based on the use of CNNs will serve as a demonstrator of the whole open-source hardware and software flow and will be mapped on a Nexys A7 FPGA board.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3116.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB04.6</td> <td><b>CSI-REPUTE: A LOW POWER EMBEDDED DEVICE CLUSTERING APPROACH TO GENOME READ MAPPING</b><br /> <b>Authors</b>:<br /> Tousif Rahman<sup>1</sup>, Sidharth Maheshwari<sup>1</sup>, Rishad Shafik<sup>1</sup>, Ian Wilson<sup>1</sup>, Alex Yakovlev<sup>1</sup> and Amit Acharyya<sup>2</sup><br /> <sup>1</sup>Newcastle University, GB; <sup>2</sup>IIT Hyderabad, IN<br /> <em><b>Abstract</b><br /> The big data challenge of genomics is rooted in its requirements of extensive computational capability and results in large power and energy consumption. To encourage widespread usage of genome assembly tools there must be a transition from the existing predominantly software-based mapping tools, optimized for homogeneous high-performance systems, to more heterogeneous low power and cost-effective mapping systems. This demonstration will show a cluster system implementation for the REPUTE algorithm, (An OpenCL based Read Mapping Tool for Embedded Genomics) where cluster nodes are composed of low power single board computer (SBC) devices and the algorithm is deployed on each node spreading the genomic workload, we propose a working concept prototype to challenge current conventional high-performance many-core CPU based cluster nodes. This demonstration will highlight the advantage in the power and energy domains of using SBC clusters enabling potential solutions to low-cost genomics.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3121.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB04.7</td> <td><b>BROOK SC: HIGH-LEVEL CERTIFICATION-FRIENDLY PROGRAMMING FOR GPU-POWERED SAFETY CRITICAL SYSTEMS</b><br /> <b>Authors</b>:<br /> Marc Benito, Matina Maria Trompouki and Leonidas Kosmidis, BSC / UPC, ES<br /> <em><b>Abstract</b><br /> Graphics processing units (GPUs) can provide the increased performance required in future critical systems, i.e. automotive and avionics. However, their programming models, e.g. CUDA or OpenCL, cannot be used in such systems as they violate safety critical programming guidelines. Brook SC (<a href="https://github.com/lkosmid/brook" title="https://github.com/lkosmid/brook">https://github.com/lkosmid/brook</a>) was developed in UPC/BSC to allow safety-critical applications to be programmed in a CUDA-like GPU language, Brook, which enables the certification while increasing productivity. In our demo, an avionics application running on a realistic safety critical GPU software stack and hardware is show cased. In this Bachelor's thesis project, which was awarded a 2019 HiPEAC Technology Transfer Award, an Airbus prototype application performing general-purpose computations with a safety-critical graphics API was ported to Brook SC in record time, achieving an order of magnitude reduction in the lines of code to implement the same functionality without performance penalty.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3105.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB04.8</td> <td><b>WALLANCE: AN ALTERNATIVE TO BLOCKCHAIN FOR IOT</b><br /> <b>Authors</b>:<br /> Loic Dalmasso, Florent Bruguier, Pascal Benoit and Achraf Lamlih, Université de Montpellier, FR<br /> <em><b>Abstract</b><br /> Since the expansion of the Internet of Things (IoT), connected devices became smart and autonomous. Their exponentially increasing number and their use in many application domains result in a huge potential of cybersecurity threats. Taking into account the evolution of the IoT, security and interoperability are the main challenges, to ensure the reliability of the information. The blockchain technology provides a new approach to handle the trust in a decentralized network. However, current blockchain implementations cannot be used in IoT domain because of their huge need of computing power and storage utilization. This demonstrator presents a lightweight distributed ledger protocol dedicated to the IoT application, reducing the computing power and storage utilization, handling the scalability and ensuring the reliability of information.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3119.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB04.9</td> <td><b>RUMORE: A FRAMEWORK FOR RUNTIME MONITORING AND TRACE ANALYSIS FOR COMPONENT-BASED EMBEDDED SYSTEMS DESIGN FLOW</b><br /> <b>Authors</b>:<br /> Vittoriano Muttillo<sup>1</sup>, Luigi Pomante<sup>1</sup>, Giacomo Valente<sup>1</sup>, Hector Posadas<sup>2</sup>, Javier Merino<sup>2</sup> and Eugenio Villar<sup>2</sup><br /> <sup>1</sup>University of L'Aquila, IT; <sup>2</sup>University of Cantabria, ES<br /> <em><b>Abstract</b><br /> The purpose of this demonstrator is to introduce runtime monitoring infrastructures and to analyze trace data. The goal is to show the concept among different monitoring requirements by defining a general reference architecture that can be adapted to different scenarios. Starting from design artifacts, generated by a system engineering modeling tool, a custom HW monitoring system infrastructure will be presented. This sub-system will be able to generate runtime artifacts for runtime verification. We will show how the RUMORE framework provides round-trip support in the development chain, injecting monitoring requirements from design models down to code and its execution on the platform and trace data back to the models, where the expected behavior will then compared with the actual behavior. This approach will be used towards optimizing design models for specific properties (e.g, for system performance).</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3126.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB04.10</td> <td><b>DISTRIBUTING TIME-SENSITIVE APPLICATIONS ON EDGE COMPUTING ENVIRONMENTS</b><br /> <b>Authors</b>:<br /> Eudald Sabaté Creixell<sup>1</sup>, Unai Perez Mendizabal<sup>1</sup>, Elli Kartsakli<sup>2</sup>, Maria A. Serrano Gracia<sup>3</sup> and Eduardo Quiñones Moreno<sup>3</sup><br /> <sup>1</sup>BSC / UPC, ES; <sup>2</sup>BSC, GR; <sup>3</sup>BSC, ES<br /> <em><b>Abstract</b><br /> The proposed demonstration aims to showcase the capabilities of a task-based distributed programming framework for the execution of real-time applications in edge computing scenarios, in the context of smart cities. Edge computing shifts the computation close to the data source, alleviating the pressure on the cloud and reducing application response times. However, the development and deployment of distributed real-time applications is complex, due to the heterogeneous and dynamic edge environment where resources may not always be available. To address these challenges, our demo employs COMPSs, a highly portable and infrastructure-agnostic programming model, to efficiently distribute time-sensitive applications across the compute continuum. We will exhibit how COMPSs distributes the workload on different edge devices (e.g., NVIDIA GPUs and a Rasberry Pi), and how COMPSs re-adapts this distribution upon the availability (connection or disconnection) of devices.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3108.pdf">More information ...</a></b></em></td> </tr> <tr> <td>19:30</td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="5.1">5.1 Special Day on "Embedded AI": Tutorial Overviews</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br /> Dmitri Strukov, University of California, Santa Barbara, US</p> <p><b>Co-Chair:</b><br /> Bernabe Linares-Barranco, CSIC, ES</p> <p>This session aims to provide a more tutorial overview of hardware AI case studies and some proposed solutions, problems, and challenges.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.1.1</td> <td><b>NEURAL NETWORKS CIRCUITS BASED ON RESISTIVE MEMORIES</b><br /> <b>Author</b>:<br /> Carlo Reita, CEA, FR<br /> <em><b>Abstract</b><br /> In recent years, the field of Neural Networks has found a new golden age after nearly twenty years of lessened interest. Under the heading of Artificial Intelligence (AI) a large number of Deep Neural Netwooks (DNNs) have recently found application in image processing, management of information in large databases, decision aids, natural language recognition, etc. Most of these applications rely on algorithms that run on standard computing systems and sometimes make use of specific accelerators like Graphic Processor Units (GPUs) or dedicated highly parallel processors. In effect, a common operation in all NN algorithms is the scalar product of two vectors and its optimisation is of paramount importance to reduce computational time and energy. In particular, the energy element is relevant for all embedded applications that cannot rely on cooling and/or unlimited power supply. The availability of resistive memories, with their unique capability of both storing computational values and of performing analog multiplication by the use of ohm's law, allows new circuit architectures where the latency, bandwidth limitations and power consumption issues associated to the use of conventional SRAM, DRAM and Flash memories can be greatly improved upon. In the presentation, some examples of advantageous use of resistive memories in NN circuits will be shown and some of their peculiarities will be discussed.</em></td> </tr> <tr> <td>09:15</td> <td>5.1.2</td> <td><b>EXPLOITING ACTIVATION SPARSITY IN DRAM-BASED SCALABLE CNN AND RNN ACCELERATORS</b><br /> <b>Author</b>:<br /> Tobi Delbrück, ETH Zurich, CH<br /> <em><b>Abstract</b><br /> Large deep neural networks (DNNs) need lots of fast memory for states and weights. Although DRAM is the dominant high-throughput, low-cost memory (costing 20X less than SRAM), its long random access latency is bad for the unpredictable access patterns in spiking neural networks (SNNs). But sparsely active SNNs are key to biological computational efficiency. This talk reports on our 5 year developments of convolutional and recurrent deep neural network hardware accelerators that exploit spatial and temporal sparsity like SNNs but achieve SOA throughput, power efficiency and latency using DRAM for the large weight and state memory required by powerful DNNs.</em></td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="5.2">5.2 Machine Learning Approaches to Analog Design</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br /> Marie-Minerve Louerat, Sorbonne University Lip6, FR</p> <p><b>Co-Chair:</b><br /> Sebastien Cliquennois, STMicroelectronics, FR</p> <p>This session presents recent advances in machine learning approaches to support the design of analog and mixed-signal circuits. Techniques such as reinforced learning and convolutional networks are employed to address circuit and layout optimization. The presented techniques have a great potential for seeding innovative solutions to face current and future challenges in this field.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.2.1</td> <td><b>AUTOCKT: DEEP REINFORCEMENT LEARNING OF ANALOG CIRCUIT DESIGNS</b><br /> <b>Speaker</b>:<br /> Keertana Settaluri, University of California, Berkeley, US<br /> <b>Authors</b>:<br /> Keertana Settaluri, Ameer Haj-Ali, Qijing Huang, Kourosh Hakhamaneshi and Borivoje Nikolic, University of California, Berkeley, US<br /> <em><b>Abstract</b><br /> The need for domain specialization under energy constraints in deeply-scaled CMOS has been driving the need for agile development of Systems on a Chip (SoCs). While digital subsystems have design flows that are conducive to rapid iterations from specification to layout, analog and mixed-signal modules face the challenge of a long human-in-the-middle iteration loop that requires expert intuition to verify that post-layout circuit parameters meet the original design specification. Existing automated solutions that optimize circuit parameters for a given target design specification have limitations of being schematic-only, inaccurate, sample-inefficient or not generalizable. This work presents AutoCkt, a deep-reinforcement learning tool that not only finds post-layout circuit parameters for a given target specification, but also gains knowledge about the entire design space through a sparse subsampling technique. Our results show that for multiple circuit topologies, the trained AutoCkt agent is able to converge and meet all target specifications on at least 96.3% of tested design goals in schematic simulation, on average 40X faster than a traditional genetic algorithm. Using the Berkeley Analog Generator, AutoCkt is able to design 40 LVS passed operational amplifiers in 68 hours, 9.6X faster than the state-of-the-art when considering layout parasitics.</em></td> </tr> <tr> <td>09:00</td> <td>5.2.2</td> <td><b>TOWARDS DECRYPTING THE ART OF ANALOG LAYOUT: PLACEMENT QUALITY PREDICTION VIA TRANSFER LEARNING</b><br /> <b>Speaker</b>:<br /> David Pan, University of Texas at Austin, US<br /> <b>Authors</b>:<br /> Mingjie Liu, Keren Zhu, Jiaqi Gu, Linxiao Shen, Xiyuan Tang, Nan Sun and David Z. Pan, University of Texas at Austin, US<br /> <em><b>Abstract</b><br /> Despite tremendous efforts in analog layout automation, little adoption has been demonstrated in practical design flows. Traditional analog layout synthesis tools use various heuristic constraints to prune the design space to ensure post layout performance. However, these approaches provide limited guarantee and poor generalizability dut to a lack of model mapping layout properties to circuit performance. In this paper, we attempt to shorten the gap in post layout performance modeling for analog circuits with a quantitative statistical approach. We leverage a state-of-the-art automatic layout tool and industry-level simulator to generate labeled training data in an automatic manner. We propose a 3D convolutional neural network (CNN) model to predict the relative placement quality using well-crafted placement features. To achieve data-efficiency for practical usage, we further propose a transfer learning scheme that greatly reduces the amount of data needed. Our model would enable early pruning and efficient design explorations for practical layout design flows. Experimental results demonstrate the effectiveness and generalizability of our method across different operational transconductance amplifier (OTA) designs.</em></td> </tr> <tr> <td>09:30</td> <td>5.2.3</td> <td><b>DESIGN OF MULTI-OUTPUT SWITCHED-CAPACITOR VOLTAGE REGULATOR VIA MACHINE LEARNING</b><br /> <b>Speaker</b>:<br /> Zhiyuan Zhou, Washington State University, US<br /> <b>Authors</b>:<br /> Zhiyuan Zhou<sup>1</sup>, Syrine Belakaria<sup>2</sup>, Aryan Deshwal<sup>2</sup>, Wookpyo Hong<sup>1</sup>, Jana Doppa<sup>2</sup>, Partha Pratim Pande<sup>1</sup> and Deukhyoun Heo<sup>1</sup><br /> <sup>1</sup>Washington State University, US; <sup>2</sup>‎Washington State University, US<br /> <em><b>Abstract</b><br /> Efficiency of power management system (PMS) is one of the key performance metrics for highly integrated system on chips (SoCs). Towards the goal of improving power efficiency of SoCs, we make two key technical contributions in this paper. First, we develop a multi-output switched-capacitor voltage regulator (SCVR) with a new flying capacitor crossing technique (FCCT) and cloud-capacitor method. Second, to optimize the design parameters of SCVR, we introduce a novel machine-learning (ML)-inspired optimization framework to reduce the number of expensive design simulations. Simulation shows that power loss of the multi-output SCVR with FCCT is reduced by more than 40% compared to conventional multiple single-output SCVRs. Our ML-based design optimization framework is able to achieve more than 90% reduction in the number of simulations needed to uncover optimized circuit parameters of the proposed SCVR.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="#IP2">IP2-10</a>, 371</td> <td><b>HIGH-SPEED ANALOG SIMULATION OF CMOS VISION CHIPS USING EXPLICIT INTEGRATION TECHNIQUES ON MANY-CORE PROCESSORS</b><br /> <b>Speaker</b>:<br /> Tom Kazmierski, University of Southampton, GB<br /> <b>Authors</b>:<br /> Gines Domenech-Asensi<sup>1</sup> and Tom J Kazmierski<sup>2</sup><br /> <sup>1</sup>Universidad Politecnica de Cartagena, ES; <sup>2</sup>University of Southampton, GB<br /> <em><b>Abstract</b><br /> This work describes a high-speed simulation technique of analog circuits which is based on the use of state-space equations and an explicit integration method parallelised on a multiprocessor architecture. The integration step of such method is smaller than the one required by an implicit simulation technique based on Newton-Raphson iterations. However, given that explicit methods do not require the computation of time-consuming matrix factorizations, the overall simulation time is reduced. The technique described in this work has been implemented on a NVIDIA general purpose GPU and has been tested simulating the Gaussian filtering operation performed by a smart CMOS image sensor. Such devices are used to perform computation on the edge and include built-in image processing functions. Among those, the Gaussian filtering is one of the most common functions, since it is a basic task for early vision processing. These smart sensors are increasingly complex and hence the time required to simulate them during their design cycle is also larger and larger. From a certain imager size, the proposed simulation method yields simulation times two order of magnitude faster that an implicit method based tool such us SPICE.</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="#IP2">IP2-11</a>, 919</td> <td><b>A 100KHZ-1GHZ TERMINATION-DEPENDENT HUMAN BODY COMMUNICATION CHANNEL MEASUREMENT USING MINIATURIZED WEARABLE DEVICES</b><br /> <b>Speaker</b>:<br /> Shreyas Sen, Purdue University, US<br /> <b>Authors</b>:<br /> Shitij Avlani, Mayukh Nath, Shovan Maity and Shreyas Sen, Purdue University, US<br /> <em><b>Abstract</b><br /> Human Body Communication has shown great promise to replace wireless communication for information exchange between wearable devices of a body area network. However, there are very few studies in literature, that systematically study the channel loss of capacitive HBC for wearable devices over a wide frequency range with different terminations at the receiver, partly due to the need for miniaturized wearable devices for an accurate study. This paper, for the first time, measures the channel loss of capacitive HBC from 100KHz to 1GHz for both high-impedance and 50 ohm terminations using wearable, battery powered devices; which is mandatory for accurate measurement of the HBC channel-loss, due to ground coupling effects. Results show that high impedance termination leads to a significantly lower channel loss (40 dB improvement at 1MHz), as compared to 50 ohm termination at low frequencies. This difference steadily decreases with increasing frequency, until they become similar near 80MHz. Beyond 100MHz inter-device coupling dominates, thereby preventing accurate measurements of channel loss of the human body. The measured results provide a consistent wearable, wide-frequency HBC channel loss data and could serve as a backbone for the emerging field of HBC by aiding in the selection of an appropriate operation frequency and termination.</em></td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="5.3">5.3 Special Session: Secure Composition of Hardware Systems</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br /> Ilia Polian, Stuttgart University, DE</p> <p><b>Co-Chair:</b><br /> Francesco Regazzoni, ALARI, CH</p> <p>Today's electronic systems consist of mixtures of programmable, reconfigurable, and application- specific hardware components, tied together by tremendously complex software. At the same time, systems are increasingly integrated such that a sub-system that was traditionally regarded "harm- less" (car's entertainment system) finds itself tightly coupled with safety-critical sub-systems (driving assistance) and security-sensitive sub-systems such as online payment and others. Moreover, a system's hardware components are now often directly accessible to the end users and thus vulnerable to physical attacks. The goal of this hot-topic session is to establish a common understanding of principles and techniques that can facilitate composition and integration of hardware systems and achieve security guarantees. Theoretical foundations of secure composition are currently limited to software systems, and unique security challenges arise when a real system, composed of a range of hardware components with different owners and trust assumptions is put together. Physical and side-channel attacks add another level of complexity to the problem of secure composition. Moreover, practical hardware systems include software stacks of tremendous size and complexity, and hardware- software interaction can create new security challenges. This hot-topic session will consider secure composition both from a purely hardware-centric and from a hardware-software perspective in a more complex system. It will also target composition of countermeasures against hardware-centric attacks and against software-driven attacks on hardware. It brings together researchers and industry practitioners who deal with secure composition: security- oriented electronic design automation; secure architectures of automotive hardware-software systems; and advanced attack scenarios against complexed hardware systems.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.3.1</td> <td><b>TOWARDS SECURE COMPOSITION OF INTEGRATED CIRCUITS AND ELECTRONIC SYSTEMS: ON THE ROLE OF EDA</b><br /> <b>Speaker</b>:<br /> Johann Knechtel, New York University Abu Dhabi, AE<br /> <b>Authors</b>:<br /> Johann Knechtel<sup>1</sup>, Elif Bilge Kavun<sup>2</sup>, Francesco Regazzoni<sup>3</sup>, Annelie Heuser<sup>4</sup>, Anupam Chattopadhyay<sup>5</sup>, Debdeep Mukhopadhyay<sup>6</sup>, Dey Soumyajit<sup>6</sup>, Yunsi Fei<sup>7</sup>, Yaacov Belenky<sup>8</sup>, Itamar Levi<sup>9</sup>, Tim Güneysu<sup>10</sup>, Patrick Schaumont<sup>11</sup> and Ilia Polian<sup>12</sup><br /> <sup>1</sup>New York University Abu Dhabi, AE; <sup>2</sup>University of Sheffield, GB; <sup>3</sup>ALaRI, CH; <sup>4</sup>Université de Rennes / Inria / CNRS / IRISA, FR; <sup>5</sup>Nanyang Technological University, SG; <sup>6</sup>IIT Kharagpur, IN; <sup>7</sup>Northeastern University, US; <sup>8</sup>Intel, IL; <sup>9</sup>Bar-Ilan University, IL; <sup>10</sup>Ruhr-University Bochum, DE; <sup>11</sup>Worcester Polytechnic Institute, US; <sup>12</sup>University of Stuttgart, DE<br /> <em><b>Abstract</b><br /> Modern electronic systems become evermore complex, yet remain modular, with integrated circuits (ICs) acting as versatile hardware components at their heart. Electronic design automation (EDA) for ICs has focused traditionally on power, performance, and area. However, given the rise of hardware-centric security threats, we believe that EDA must also adopt related notions like secure by design and secure composition of hardware. Despite various promising studies, we argue that some aspects still require more efforts, for example: effective means for compilation of assumptions and constraints for security schemes, all the way from the system level down to the "bare metal"; modeling, evaluation, and consideration of security-relevant metrics; or automated and holistic synthesis of various countermeasures, without inducing negative cross-effects. In this paper, we first introduce hardware security for the EDA community. Next we review prior (academic) art for EDA-driven security evaluation and implementation of countermeasures. We then discuss strategies and challenges for advancing research and development toward secure composition of circuits and systems.</em></td> </tr> <tr> <td>08:55</td> <td>5.3.2</td> <td><b>ATTACKER MODELING ON COMPOSED SYSTEMS</b><br /> <b>Speaker</b>:<br /> Pierre Schnarz, Continental AG, DE<br /> <b>Authors</b>:<br /> Tobias Basic, Jan Müller, Pierre Schnarz and Marc Stoettinger, Continental AG, DE</td> </tr> <tr> <td>09:15</td> <td>5.3.3</td> <td><b>PITFALLS IN MACHINE LEARNING-BASED ADVERSARY MODELING FOR HARDWARE SYSTEMS</b><br /> <b>Speaker</b>:<br /> Fatemeh Ganji, University of Florida, US<br /> <b>Authors</b>:<br /> Fatemeh Ganji<sup>1</sup>, Sarah Amir<sup>1</sup>, Shahin Tajik<sup>1</sup>, Jean-Pierre Seifert<sup>2</sup> and Domenic Forte<sup>1</sup><br /> <sup>1</sup>University of Florida, US; <sup>2</sup>TU Berlin, DE</td> </tr> <tr> <td>09:35</td> <td>5.3.4</td> <td><b>USING UNIVERSAL COMPOSITION TO DESIGN AND ANALYZE SECURE COMPLEX HARDWARE SYSTEMS</b><br /> <b>Speaker</b>:<br /> Marten van Dijk, University of Connecticut, US<br /> <b>Authors</b>:<br /> Ran Canetti<sup>1</sup>, Marten van Dijk<sup>2</sup>, Hoda Maleki<sup>3</sup>, Ulrich Rührmair<sup>4</sup> and Patrick Schaumont<sup>5</sup><br /> <sup>1</sup>Boston University, US; <sup>2</sup>University of Connecticut, US; <sup>3</sup>University of Augusta, US; <sup>4</sup>TU Munich, DE; <sup>5</sup>Worcester Polytechnic Institute, US</td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="5.4">5.4 New Frontiers in Formal Verification for Hardware</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br /> Alessandro Cimatti, Fondazione Bruno Kessler, IT</p> <p><b>Co-Chair:</b><br /> Heinz Riener, EPFL, CH</p> <p>The session presents several new techniques in hardware verification. The technical papers propose methods for the formal verification of industrial arithmetic circuits and processors, and show how reinforcement learning can be used for verification of shared memory protocols. Two interactive presentations describe how to use high-level synthesis to supply security guarantees and to generate certificates when verifying multipliers.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.4.1</td> <td><b>GAP-FREE PROCESSOR VERIFICATION BY S²QED AND PROPERTY GENERATION</b><br /> <b>Speaker</b>:<br /> Keerthikumara Devarajegowda, Infineon Technologies, DE<br /> <b>Authors</b>:<br /> Keerthikumara Devarajegowda<sup>1</sup>, Mohammad Rahmani Fadiheh<sup>2</sup>, Eshan Singh<sup>3</sup>, Clark Barrett<sup>3</sup>, Subhasish Mitra<sup>3</sup>, Wolfgang Ecker<sup>1</sup>, Dominik Stoffel<sup>2</sup> and Wolfgang Kunz<sup>2</sup><br /> <sup>1</sup>Infineon Technologies, DE; <sup>2</sup>TU Kaiserslautern, DE; <sup>3</sup>Stanford University, US<br /> <em><b>Abstract</b><br /> The required manual effort and verification expertise are among the main hurdles for adopting formal verification in processor design flows. Developing a set of properties that fully covers all instruction behaviors is a laborious and challenging task. This paper proposes a highly automated and "complete" processor verification approach which requires considerably less manual effort and expertise compared to the state of the art. The proposed approach extends the S²QED approach to cover both single and multiple instruction bugs and ensures that a design is completely verified according to a well-defined criterion. This makes the approach robust against human errors. The properties are simple and can be automatically generated from an ISA model with small manual effort. Furthermore, unlike in conventional property checking, the verification engineer does not need to explicitly specify the processor's behavior in different special scenarios, such as stalling, exception, or speculation, since these are taken care of implicitly by the proposed computational model. The great promise of the approach is shown by an industrial case study with a 5-stage RISC-V processor.</em></td> </tr> <tr> <td>09:00</td> <td>5.4.2</td> <td><b>SPEAR: HARDWARE-BASED IMPLICIT REWRITING FOR SQUARE-ROOT VERIFICATION</b><br /> <b>Speaker</b>:<br /> Maciej Ciesielski, University of Massachusetts Amherst, US<br /> <b>Authors</b>:<br /> Atif Yasin<sup>1</sup>, Tiankai Su<sup>1</sup>, Sebastien Pillement<sup>2</sup> and Maciej Ciesielski<sup>1</sup><br /> <sup>1</sup>University of Massachusetts Amherst, US; <sup>2</sup>University of Nantes France, FR<br /> <em><b>Abstract</b><br /> The paper addresses the formal verification of gate-level square-root circuits. Division and square root functions are some of the most complex arithmetic operations to implement and proving the correctness of their hardware implementation is of great importance. In contrast to standard approaches that use satisfiability and equivalence checking techniques, the presented method verifies whether the gate-level square-root circuit actually performs a root operation, instead of checking equivalence with a reference design. The method extends the algebraic rewriting technique developed earlier for multipliers and introduces a novel technique of implicit hardware rewriting. The tool called SPEAR based on hardware rewriting enables the verification of a 256-bit gate-level square-root circuit with 0.26 million gates in under 18 minutes.</em></td> </tr> <tr> <td>09:30</td> <td>5.4.3</td> <td><b>A REINFORCEMENT LEARNING APPROACH TO DIRECTED TEST GENERATION FOR SHARED MEMORY VERIFICATION</b><br /> <b>Speaker</b>:<br /> Nícolas Pfeifer, Federal University of Santa Catarina, BR<br /> <b>Authors</b>:<br /> Nicolas Pfeifer, Bruno V. Zimpel, Gabriel A. G. Andrade and Luiz C. V. dos Santos, Federal University of Santa Catarina, BR<br /> <em><b>Abstract</b><br /> Multicore chips are expected to rely on coherent shared memory. Albeit the coherence hardware can scale gracefully, the protocol state space grows exponentially with core count. That is why design verification requires directed test generation (DTG) for dynamic coverage control under the tight time constraints resulting from slow simulation and short verification budgets. Next generation EDA tools are expected to exploit Machine Learning for reaching high coverage in less time. We propose a technique that addresses DTG as a decision process and tries to find a decision-making policy for maximizing the cumulative coverage, as a result of successive actions taken by an agent. Instead of simply relying on learning, our technique builds upon the legacy from constrained random test generation (RTG). It casts DTG as coverage-driven RTG, and it explores distinct RTG engines subject to progressively tighter constraints. We compared three Reinforcement Learning generators with a state-of-the-art generator based on Genetic Programming. The experimental results show that the proper enforcement of constraints is more efficient for guiding learning towards higher coverage than simply letting the generator learn how to select the most promising memory events for increasing coverage. For a 3-level MESI 32-core design, the proposed approach led to the highest observed coverage (95.81%), and it was 2.4 times faster than the baseline generator to reach the latter's maximal coverage.</em></td> </tr> <tr> <td>09:45</td> <td>5.4.4</td> <td><b>TOWARDS FORMAL VERIFICATION OF OPTIMIZED AND INDUSTRIAL MULTIPLIERS</b><br /> <b>Speaker</b>:<br /> Alireza Mahzoon, University of Bremen, DE<br /> <b>Authors</b>:<br /> Alireza Mahzoon<sup>1</sup>, Daniel Grosse<sup>2</sup>, Christoph Scholl<sup>3</sup> and Rolf Drechsler<sup>2</sup><br /> <sup>1</sup>University of Bremen, DE; <sup>2</sup>University of Bremen / DFKI, DE; <sup>3</sup>University of Freiburg, DE<br /> <em><b>Abstract</b><br /> Formal verification methods have made huge progress over the last decades. However, proving the correctness of arithmetic circuits involving integer multipliers still drives the verification techniques to their limits. Recently, Symbolic Computer Algebra (SCA) methods have shown good results in the verification of both large and non-trivial multipliers. Their success is mainly based on (1) reverse engineering and identifying basic building blocks, (2) finding converging gate cones which start from the basic building blocks and (3) early removal of redundant terms (vanishing monomials) to avoid the blow-up during backward rewriting. Despite these important accomplishments, verifying optimized and technology-mapped multipliers is an almost unexplored area. This creates major barriers for industrial use as most of the designs are area and delay optimized. To overcome the barriers, we propose a novel SCA-method which supports the formal verification of a large variety of optimized multipliers. Our method takes advantage of a dynamic substitution ordering to avoid the monomial explosion during backward rewriting. Experimental results confirm the efficiency of our approach in the verification of a wide range of optimized multipliers including industrial benchmarks.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="#IP2">IP2-12</a>, 151</td> <td><b>FROM DRUP TO PAC AND BACK</b><br /> <b>Speaker</b>:<br /> Daniela Kaufmann, Johannes Kepler University Linz, AT<br /> <b>Authors</b>:<br /> Daniela Kaufmann, Armin Biere and Manuel Kauers, Johannes Kepler University Linz, AT<br /> <em><b>Abstract</b><br /> Currently the most efficient automatic approach to verify gate-level multipliers combines SAT solving and computer algebra. In order to increase confidence in the verification, proof certificates are generated. However, due to different solving techniques, these certificates require two different proof formats, namely DRUP and PAC. A combined proof has so far been missing. Correctness of this approach can thus only be trusted up to the correctness of compositional reasoning. In this paper we show how to generate a single proof in one proof format, which then allows to certify correctness using one simple proof checker. We further investigate empirically the effect on proof generation and checking time as well as on proof size. It turns out that PAC proofs are much more compact and faster to check.</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="#IP2">IP2-13</a>, 956</td> <td><b>VERIFIABLE SECURITY TEMPLATES FOR HARDWARE</b><br /> <b>Speaker</b>:<br /> Bill Harrison, Oak Ridge National Laboratory, US<br /> <b>Authors</b>:<br /> William Harrison<sup>1</sup> and Gerard Allwein<sup>2</sup><br /> <sup>1</sup>Oak Ridge National Laboratory, US; <sup>2</sup>Naval Research Laboratory, US<br /> <em><b>Abstract</b><br /> But HLS has, with a few notable exceptions, not focused on transferring ideas and techniques from high assurance software formal methods to hardware development, despite there being a sophisticated and mature body of research in that area. Just as it has introduced software engineering virtues, we believe HLS can also become a vector for retrofitting software formal methods to the challenge of high assurance security in hardware. This paper introduces the Device Calculus and its mechanization in the Agda proof checking system. The Device Calculus is a starting point for exploring formal methods and security within high-level synthesis flows. We illustrate the Device Calculus with a number of examples of formally verifiable security templates---i.e., functions in the Device Calculus that express common security structures at a high-level of abstraction.</em></td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="5.5">5.5 Model-Based Analysis and Security</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br /> Ylies Falcone, University Grenoble Alpes and Inria, FR</p> <p><b>Co-Chair:</b><br /> Todd Austin, University of Michigan, US</p> <p>The session explores the use of state-of-the-art model-based analysis and verification techniques to secure and improve the performance of embedded systems. More specifically, it presents the use of satisfiability modulo theory, runtime monitoring, fuzzing, and model-checking to evaluate how secure is a system, prevent, and detect attacks.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.5.1</td> <td><b>IS REGISTER TRANSFER LEVEL LOCKING SECURE?</b><br /> <b>Speaker</b>:<br /> Chandan Karfa, IIT Guwahati, IN<br /> <b>Authors</b>:<br /> Chandan Karfa<sup>1</sup>, Ramanuj Chouksey<sup>1</sup>, Christian Pilato<sup>2</sup>, Siddharth Garg<sup>3</sup> and Ramesh Karri<sup>3</sup><br /> <sup>1</sup>IIT Guwahati, IN; <sup>2</sup>Politecnico di Milano, IT; <sup>3</sup>New York University, US<br /> <em><b>Abstract</b><br /> Register Transfer Level (RTL) locking seeks to prevent intellectual property (IP) theft of a design by locking the RTL description that functions correctly on the application of a key. This paper evaluates the security of a state-of-the-art RTL locking scheme using a satisfiability modulo theories (SMT) based algorithm to retrieve the secret key. The attack first obtains the high-level behavior of the locked RTL, and then use an SMT based formulation to find so-called distinguishing input patterns (DIP)/footnote{i.e., inputs that help eliminate incorrect keys from the keyspace.}. The attack methodology has two main advantages over the gate-level attacks. First, since the attack handles the design at the RTL, the method scales to large designs. Second, the attack does not apply separate unlocking strategies for the combinational and sequential parts of design; it handles both styles via a unifying abstraction. We demonstrate the attack on locked RTL generated by TAO, a state-of-the-art RTL locking solution. Empirical results show that we can partially or completely break designs locked by TAO.</em></td> </tr> <tr> <td>09:00</td> <td>5.5.2</td> <td><b>DESIGN SPACE EXPLORATION FOR MODEL-BASED COMMUNICATION SYSTEMS</b><br /> <b>Speaker</b>:<br /> Valentina Richthammer, University of Ulm, DE<br /> <b>Authors</b>:<br /> Valentina Richthammer, Marcel Rieß, Julian Bestler, Frank Slomka and Michael Glaß, University of Ulm, DE<br /> <em><b>Abstract</b><br /> A main challenge of modem design lies in selecting a suitable combination of subsystems (e.g. Analog Digital/Digital Analog Converters (ADC/DAC), (de)modulators, scramblers, interleavers, and coding and filtering modules) - each of which can be implemented in a multitude of ways. At the same time, the complete modem configuration needs to be tailored to the specific requirements of the intended communication channel or scenario. Therefore, model-based design methodologies have recently been popularized in this field, since their application facilitates the specification of individual modem components that are easily exchanged during the automated synthesization of the modem. However, this development has resulted in a tremendous increase in the number of synthesizable modem options. In fact, the optimal modem configuration for a communication scenario can not readily be determined, since an exhaustive analysis of all configuration possibilities is computationally intractable. To remedy this, we propose a fully automated Design Space Exploration (DSE) methodology for model-based modem design that combines (I) the metaheuristic exploration and optimization of modem-configuration possibilities with (II) a simulative analysis of suitable measures of communication quality. The presented case study for an acoustic underwater communication scenario supports the described need for novel, automated methodologies in the area of model-based design, since the modem configurations discovered during a comparably short DSE are demonstrated to significantly outperform state-of-the-art modems from literature.</em></td> </tr> <tr> <td>09:30</td> <td>5.5.3</td> <td><b>STATISTICAL TIME-BASED INTRUSION DETECTION IN EMBEDDED SYSTEMS</b><br /> <b>Speaker</b>:<br /> Nadir Carreon Rascon, University of Arizona, US<br /> <b>Authors</b>:<br /> Nadir Carreon Rascon, Allison Gilbreath and Roman Lysecky, University of Arizona, US<br /> <em><b>Abstract</b><br /> This paper presents a statistical method based on cumulative distribution functions (CDF) to analyze an embedded system's behavior to detect anomalous and malicious executions behaviors. The proposed method analyzes the internal timing of the system by monitoring individual operations and sequences of operations, wherein the timing of operations is decomposed into multiple timing subcomponents. Creating the normal model of the system utilizing the internal timing adds resilience to zero-day attacks, and mimicry malware. The combination of CDF-based statistical analysis and timing subcomponents enable both higher detection rates and lower false positives rates. We demonstrate the effectiveness of the approach and compare to several state-of-the-art malware detection methods using two embedded systems benchmarks, namely a network connected pacemaker and an unmanned aerial vehicle, utilizing seven different malware.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="#IP2">IP2-14</a>, 637</td> <td><b>IFFSET: IN-FIELD FUZZING OF INDUSTRIAL CONTROL SYSTEMS USING SYSTEM EMULATION</b><br /> <b>Speaker</b>:<br /> Dimitrios Tychalas, New York University, US<br /> <b>Authors</b>:<br /> Dimitrios Tychalas<sup>1</sup> and Michail Maniatakos<sup>2</sup><br /> <sup>1</sup>New York University, US; <sup>2</sup>New York University Abu Dhabi, AE<br /> <em><b>Abstract</b><br /> Industrial Control Systems (ICS) have evolved in the last decade, shifting from proprietary software/hardware to contemporary embedded architectures paired with open-source operating systems. In contrast to the IT world, where continuous updates and patches are expected, decommissioning always-on ICS for security assessment can incur prohibitive costs to their owner. Thus, a solution for routinely assessing the cybersecurity posture of diverse ICS without affecting their operation is essential. Therefore, in this paper we introduce IFFSET, a platform that leverages full system emulation of Linux-based ICS firmware and utilizes fuzzing for security evaluation. Our platform extracts the file system and kernel information from a live ICS device, building an image which is emulated on a desktop system through QEMU. We employ fuzzing as a security assessment tool to analyze ICS specific libraries and find potential security threatening conditions. We test our platform with commercial PLCs, showcasing potential threats with no interruption to the control process.</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="#IP2">IP2-15</a>, 814</td> <td><b>FANNET: FORMAL ANALYSIS OF NOISE TOLERANCE, TRAINING BIAS AND INPUT SENSITIVITY IN NEURAL NETWORKS</b><br /> <b>Speaker</b>:<br /> Mahum Naseer, TU Wien, AT<br /> <b>Authors</b>:<br /> Mahum Naseer<sup>1</sup>, Mishal Fatima Minhas<sup>2</sup>, Faiq Khalid<sup>1</sup>, Muhammad Abdullah Hanif<sup>1</sup>, Osman Hasan<sup>2</sup> and Muhammad Shafique<sup>1</sup><br /> <sup>1</sup>TU Wien, AT; <sup>2</sup>National University of Sciences and Technology, PK<br /> <em><b>Abstract</b><br /> With a constant improvement in the network architectures and training methodologies, Neural Networks (NNs) are increasingly being deployed in real-world Machine Learning systems. However, despite their impressive performance on "known inputs", these NNs can fail absurdly on the "unseen inputs", especially if these real-time inputs deviate from the training dataset distributions, or contain certain types of input noise. This indicates the low noise tolerance of NNs, which is a major reason for the recent increase of adversarial attacks. This is a serious concern, particularly for safety-critical applications, where inaccurate results lead to dire consequences. We propose a novel methodology that leverages model checking for the Formal Analysis of Neural Network (FANNet) under different input noise ranges. Our methodology allows us to rigorously analyze the noise tolerance of NNs, their input node sensitivity, and the effects of training bias on their performance, e.g., in terms of classification accuracy. For evaluation, we use a feed-forward fully-connected NN architecture trained for the Leukemia classification. Our experimental results show 11% noise tolerance for the given trained network, identify the most sensitive input nodes, confirm the biasness of the available training dataset, and indicate that the proposed methodology is much more rigorous and yet comparable to validation testing in terms of time and computational resources for larger noise ranges.</em></td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="5.6">5.6 Logic synthesis towards fast, compact, and secure designs</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br /> Valeria Bertacco, University of Michigan, US</p> <p><b>Co-Chair:</b><br /> Lukas Sekanina, Brno University of Technology, CZ</p> <p>The logic synthesis family is growing. While traditional optimization goals such as area and delay are still very important in todays design automation, new applications require improvement of aspects such as security or power consumption. This session showcases various algorithms addressing both emerging and traditional optimization goals. An algorithm is proposed for cryptographic applications which reduces the multiplicative complexity thereby making designs less vulnerable to attacks. A synthesis method converts flip-flops to latches in a clever way and saves power in this way. Approximation and bi-decomposition techniques are used in an area optimization strategy. Finally, a methodology for design minimization in advanced technology nodes is presented that takes both wire congestion and coupling effects into account.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.6.1</td> <td><b>A LOGIC SYNTHESIS TOOLBOX FOR REDUCING THE MULTIPLICATIVE COMPLEXITY IN LOGIC NETWORKS</b><br /> <b>Speaker</b>:<br /> Eleonora Testa, EPFL, CH<br /> <b>Authors</b>:<br /> Eleonora Testa<sup>1</sup>, Mathias Soeken<sup>1</sup>, Heinz Riener<sup>1</sup>, Luca Amaru<sup>2</sup> and Giovanni De Micheli<sup>1</sup><br /> <sup>1</sup>EPFL, CH; <sup>2</sup>Synopsys, US<br /> <em><b>Abstract</b><br /> Logic synthesis is a fundamental step in the realization of modern integrated circuits. It has traditionally been employed for the optimization of CMOS-based designs, as well as for emerging technologies and quantum computing. Recently, it found application in minimizing the number of AND gates in cryptography benchmarks represented as xor-and graphs (XAGs). The number of AND gates in an XAG, which is called the logic network's multiplicative complexity, plays a critical role in various cryptography and security protocols such as fully homomorphic encryption (FHE) and secure multi-party computation (MPC). Further, the number of AND gates is also important to assess the degree of vulnerability of a Boolean function, and influences the cost of techniques to protect against side-channel attacks. However, so far a complete logic synthesis flow for reducing the multiplicative complexity in logic networks did not exist or relied heavily on manual manipulations. In this paper, we present a logic synthesis toolbox for cryptography and security applications. The proposed tool consists of powerful transformations, namely resubstitution, refactoring, and rewriting, specifically designed to minimize the multiplicative complexity of an XAG. Our flow is fully automatic and achieves significant results over both EPFL benchmarks and cryptography circuits. We improve the best-known results for cryptography up to 59%, resulting in a normalized geometric mean of 0.82.</em></td> </tr> <tr> <td>09:00</td> <td>5.6.2</td> <td><b>SAVING POWER BY CONVERTING FLIP-FLOP TO 3-PHASE LATCH-BASED DESIGNS</b><br /> <b>Speaker</b>:<br /> Peter Beerel, University of Southern California, US<br /> <b>Authors</b>:<br /> Huimei Cheng, Xi Li, Yichen Gu and Peter Beerel, University of Southern California, US<br /> <em><b>Abstract</b><br /> Latches are smaller and lower power than flip-flops (FFs) and are typically used in a time-borrowing master-slave configuration. This paper presents an automatic flow for converting arbitrarily-complex single-clock-domain FF-based RTL designs to efficient 3-phase latch-based designs with reduced number of required latches, saving both register and clock-tree power. Post place-and-route results demonstrate that our 3-phase latch-based designs save an average of 15.5% and 18.5% power on a variety of ISCAS, CEP, and CPU benchmark circuits, compared to their more traditional FF and master-slave based alternatives.</em></td> </tr> <tr> <td>09:30</td> <td>5.6.3</td> <td><b>COMPUTING THE FULL QUOTIENT IN BI-DECOMPOSITION BY APPROXIMATION</b><br /> <b>Speaker</b>:<br /> Valentina Ciriani, University of Milan, IT<br /> <b>Authors</b>:<br /> Anna Bernasconi<sup>1</sup>, Valentina Ciriani<sup>2</sup>, Jordi Cortadella<sup>3</sup> and Tiziano Villa<sup>4</sup><br /> <sup>1</sup>Università di Pisa, IT; <sup>2</sup>Universita' degli Studi di Milano, IT; <sup>3</sup>UPC, ES; <sup>4</sup>Università di Verona, IT<br /> <em><b>Abstract</b><br /> Bi-decomposition is a design technique widely used to realize logic functions by the composition of simpler components. It can be seen as a form of Boolean division, where a given function is split into a divisor and quotient (and a remainder, if needed). The key questions are how to find a good divisor and then how to compute the quotient. In this paper we choose as divisor an approximation of the given function, and characterize the incompletely specified function which describes the full flexibility for the quotient. We report at the end preliminary experiments for bi-decomposition based on two AND-like operators with a divisor approximation from 1 to 0, and discuss the impact of the approximation error rate on the final area of the components in the case of synthesis by three-level XOR-AND-OR forms.</em></td> </tr> <tr> <td>09:45</td> <td>5.6.4</td> <td><b>MINIDELAY: MULTI-STRATEGY TIMING-AWARE LAYER ASSIGNMENT FOR ADVANCED TECHNOLOGY NODES</b><br /> <b>Speaker</b>:<br /> Xinghai Zhang, Fuzhou University, CN<br /> <b>Authors</b>:<br /> Xinghai Zhang<sup>1</sup>, Zhen Zhuang<sup>1</sup>, Genggeng Liu<sup>1</sup>, Xing Huang<sup>2</sup>, Wen-Hao Liu<sup>3</sup>, Wenzhong Guo<sup>1</sup> and Ting-Chi Wang<sup>2</sup><br /> <sup>1</sup>Fuzhou University, CN; <sup>2</sup>National Tsing Hua University, TW; <sup>3</sup>Cadence Design Systems, US<br /> <em><b>Abstract</b><br /> Layer assignment, a major step in global routing of integrated circuits, is usually performed to assign segments of nets to multiple layers. Besides the traditional optimization goals such as overflow and via count, interconnect delay plays an important role in determining chip performance and has been attracting much attention in recent years. Accordingly, in this paper, we propose MiniDelay, a timing-aware layer assignment algorithm to minimize delay for advanced technology nodes, taking both wire congestion and coupling effect into account. MiniDelay consists of the following three key techniques: 1) a non-default-rule routing technique is adopted to reduce the delay of timing critical nets, 2) an effective congestion assessment method is proposed to optimize delay of nets and via count simultaneously, and 3) a net scalpel technique is proposed to further reduce the maximum delay of nets, so that the chip performance can be improved in a global manner. Experimental results on multiple benchmarks confirm that the proposed algorithm leads to lower delay and few vias, while achieving the best solution quality among the existing algorithms with the shortest runtime.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="#IP2">IP2-16</a>, 932</td> <td><b>A SCALABLE MIXED SYNTHESIS FRAMEWORK FOR HETEROGENEOUS NETWORKS</b><br /> <b>Speaker</b>:<br /> Max Austin, University of Utah, US<br /> <b>Authors</b>:<br /> Max Austin<sup>1</sup>, Scott Temple<sup>1</sup>, Walter Lau Neto<sup>1</sup>, Luca Amaru<sup>2</sup>, Xifan Tang<sup>1</sup> and Pierre-Emmanuel Gaillardon<sup>1</sup><br /> <sup>1</sup>University of Utah, US; <sup>2</sup>Synopsys, US<br /> <em><b>Abstract</b><br /> We present a new logic synthesis framework which produces efficient post-technology mapped results on heterogeneous networks containing a mix of different types of logic. This framework accomplishes this by breaking down the circuit into sections using a hypergraph k-way partitioner and then determines the best-fit logic representation for each partition between two Boolean networks, And-Inverter Graphs(AIG) and Majority-Inverter Graphs(MIG), which have been shown to perform better over each other on different types of logic. Experimental results show that over a set of Open Piton DesignBenchmarks(OPDB) and OpenCores benchmarks, our proposed methodology outperforms state-of-the-art academic tools inArea-Delay Product(ADP), Power-Delay Product(PDP), and Energy-Delay Product(EDP) by 5%, 2%, and 15% respectively after performing Application Specific Integrated Circuits(ASIC) technology mapping as well as showing a 54% improvement in runtime over conventional MIG optimization</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="#IP2">IP2-17</a>, 456</td> <td><b>DISCERN: DISTILLING STANDARD CELLS FOR EMERGING RECONFIGURABLE NANOTECHNOLOGIES</b><br /> <b>Speaker</b>:<br /> Shubham Rai, TU Dresden, DE<br /> <b>Authors</b>:<br /> Shubham Rai<sup>1</sup>, Michael Raitza<sup>2</sup>, Siva Satyendra Sahoo<sup>1</sup> and Akash Kumar<sup>1</sup><br /> <sup>1</sup>TU Dresden, DE; <sup>2</sup>TU Dresden and CfAED, DE<br /> <em><b>Abstract</b><br /> Logic gates and circuits based on reconfigurable nanotechnologies demonstrate runtime-reconfigurability, where a single logic gate can exhibit more than one functionality. Recent attempts on circuits based on emerging reconfigurable nanotechnologies have primarily focused on using the traditional CMOS design flow involving similar-styled standard-cells. These CMOS-centric standard-cells fail to utilize the exciting properties offered by these nanotechnologies. In the present work, we explore the boolean properties that define the reconfigurable properties of a logic gate. By analyzing the truth-table in detail, we find that there is a common boolean rule which dictates why a logic gate is reconfigurable. Such logic gates can be efficiently implemented using reconfigurable nanotechnologies. We propose an algorithm which analyses the truth-tables of nodes in a circuit to list all such potential reconfigurable logic gates for a particular circuit. Technology mapping with these new logic gates (or standard-cells) leads to a better mapping in terms of area and delay. Experiments employing our methodology over EPFL benchmarks, show average improvements of around 13%, 16% and 11.5% in terms of area, number of edges and delay respectively as compared to the conventional CMOS-centric standard-cell based mapping.</em></td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="5.7">5.7 Stochastic Computing</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br /> Robert Wille, Johannes Kepler University Linz, AT</p> <p><b>Co-Chair:</b><br /> Shigeru Yamashita, Ritsumeikan, JP</p> <p>Stochastic computing uses random bitstreams to reduce computational and area costs of a general class of Boolean operations, including arithmetic addition and multiplication. This session considers stochastic computing from a model-, accuracy-, and applications-perspective, by presenting papers that span from models of pseudo-random number generators, to accuracy analysis of stochastic circuits, to novel applications for signal processing tasks.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.7.1</td> <td><b>THE HYPERGEOMETRIC DISTRIBUTION AS A MORE ACCURATE MODEL FOR STOCHASTIC COMPUTING</b><br /> <b>Speaker</b>:<br /> Timothy Baker, University of Michigan, US<br /> <b>Authors</b>:<br /> Timothy Baker and John Hayes, University of Michigan, US<br /> <em><b>Abstract</b><br /> A fundamental assumption in stochastic computing (SC) is that bit-streams are generally well-approximated by a Bernoulli process, i.e., a sequence of independent 0-1 choices. We show that this assumption is flawed in unexpected and significant ways for some bit-streams such as those produced by a typical LFSR-based stochastic number generator (SNG). In particular, the Bernoulli assumption leads to a surprising overestimation of output errors and how they vary with input changes. We then propose a more accurate model for such bit-streams based on the hypergeometric distribution and examine its implications for several SC applications. First, we explore the effect of correlation on a mux-based stochastic adder and show that, contrary to what was previously thought, it is not entirely correlation insensitive. Further, inspired by the hypergeometric model, we introduce a new mux tree adder that offers major area savings and accuracy improvement. The effectiveness of this study is validated on a large image processing circuit which achieves an accuracy improvement of 32%, combined with a reduction in overall circuit area.</em></td> </tr> <tr> <td>09:00</td> <td>5.7.2</td> <td><b>ACCURACY ANALYSIS FOR STOCHASTIC CIRCUITS WITH D-FLIP FLOP INSERTION</b><br /> <b>Speaker</b>:<br /> Kuncai Zhong, University of Michigan-Shanghai Jiao Tong University Joint Institute, CN<br /> <b>Authors</b>:<br /> Kuncai Zhong and Weikang Qian, Shanghai Jiao Tong University, CN<br /> <em><b>Abstract</b><br /> One of the challenges stochastic computing (SC) faces is the high cost of stochastic number generators (SNG). A solution to it is inserting D flip-flops (DFFs) into the circuit. However, the accuracy of the stochastic circuits would be affected and it is crucial to capture it. In this work, we propose an efficient method to analyze the accuracy of stochastic circuits with DFFs inserted. Furthermore, given the importance of multiplication, we apply this method to analyze stochastic multiplier with DFFs inserted. Several interesting claims are obtained about the use of probability conversion circuits. For example, using weighted binary generator is more accurate than using comparator. The experimental results show the correctness of the proposed method and the claims. Furthermore, the proposed method is up to 560× faster than the simulation-based method.</em></td> </tr> <tr> <td>09:30</td> <td>5.7.3</td> <td><b>DYNAMIC STOCHASTIC COMPUTING FOR DIGITAL SIGNAL PROCESSING APPLICATIONS</b><br /> <b>Speaker</b>:<br /> Jie Han, University of Alberta, CA<br /> <b>Authors</b>:<br /> Siting Liu and Jie Han, University of Alberta, CA<br /> <em><b>Abstract</b><br /> Stochastic computing (SC) utilizes a random binary bit stream to encode a number by counting the frequency of 1's in the stream (or sequence). Typically, a small circuit is used to perform a bit-wise logic operation on the stochastic sequences, which leads to significant hardware and power savings. Energy efficiency, however, is a challenge for SC due to the long sequences required for accurately encoding numbers. To overcome this challenge, we consider to use a stochastic sequence to encode a continuously variable signal instead of a number to achieve higher accuracy, higher energy efficiency and greater flexibility. Specifically, one single bit is used to encode a sample from a signal for efficient processing. This type of sequences encodes constantly variable values, so it is referred to as dynamic stochastic sequences (DSS's). The DSS enables the use of SC circuits to efficiently perform tasks such as frequency mixing and function estimation. It is shown that such a dynamic SC (DSC) system achieves savings up to 98.4% in energy and up to 96.8% in time with a slightly higher accuracy compared to conventional SC. It also achieves energy and time savings of up to 60% compared to a fixed-width binary implementation.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="#IP2">IP2-18</a>, 437</td> <td><b>A 16×128 STOCHASTIC-BINARY PROCESSING ELEMENT ARRAY FOR ACCELERATING STOCHASTIC DOT-PRODUCT COMPUTATION USING 1-16 BIT-STREAM LENGTH</b><br /> <b>Speaker</b>:<br /> Hyunjoon Kim, Nanyang Technological University, SG<br /> <b>Authors</b>:<br /> Qian Chen, Yuqi Su, Hyunjoon Kim, Taegeun Yoo, Tony Tae-Hyoung Kim and Bongjin Kim, Nanyang Technological University, SG<br /> <em><b>Abstract</b><br /> This work presents 16×128 stochastic-binary processing elements for energy/area efficient processing of artificial neural networks. A processing element (PE) with all-digital components consists of an XNOR gate as a bipolar stochastic multiplier and an 8bit binary adder with 8× registers for accumulating partial-sums. The PE array comprises 16× dot-product units, each with 128 PEs cascaded in a single row. The latency and energy of the proposed dot-product unit is minimized by reducing the number of bit-streams required for minimizing the accuracy degradation induced by the approximate stochastic computing. A 128-input dot-product operation requires the bit-stream length (N) of 1-to-16, which is two orders of magnitude smaller than the baseline stochastic computation using MUX-based adders. The simulated dot-product error is 6.9-to-1.5% for N=1-to-16, while the error from the baseline stochastic method is 5.9-to-1.7% with N=128-to-2048. A mean MNIST classification accuracy is 96.11% (which is 1.19% lower than 8b binary) using a three-layer MLP at N=16. The measured energy from a 65nm test-chip is 10.04pJ per dot-product, and the energy efficiency is 25.5TOPS/W at N=16.</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="#IP2">IP2-19</a>, 599</td> <td><b>TOWARDS EXPLORING THE POTENTIAL OF ALTERNATIVE QUANTUM COMPUTING ARCHITECTURES</b><br /> <b>Speaker</b>:<br /> Arighna Deb, Kalinga Institute of Industrial Technology, IN<br /> <b>Authors</b>:<br /> Arighna Deb<sup>1</sup>, Gerhard W. Dueck<sup>2</sup> and Robert Wille<sup>3</sup><br /> <sup>1</sup>Kalinga Institute of Industrial Technology, IN; <sup>2</sup>University of New Brunswick, CA; <sup>3</sup>Johannes Kepler University Linz, AT<br /> <em><b>Abstract</b><br /> The recent advances in the physical realization of Noisy Intermediate Scale Quantum (NISQ) computers have motivated research on design automation that allows users to execute quantum algorithms on them. Certain physical constraints in the architectures restrict how logical qubits used to describe the algorithm can be mapped to physical qubits used to realize the corresponding functionality. Thus far, this has been addressed by inserting additional operations in order to overcome the physical constrains. However, all these approaches have taken the existing architectures as invariant and did not explore the potential of changing the quantum architecture itself—a valid option as long as the underlying physical constrains remain satisfied. In this work, we propose initial ideas to explore this potential. More precisely, we introduce several schemes for the generation of alternative coupling graphs (and, by this, quantum computing architectures) that still might be able to satisfy physical constraints but, at the same time, allow for a more efficient realization of the desired quantum functionality.</em></td> </tr> <tr> <td style="width:40px;">10:02</td> <td><a href="#IP2">IP2-20</a>, 719</td> <td><b>ACCELERATING QUANTUM APPROXIMATE OPTIMIZATION ALGORITHM USING MACHINE LEARNING</b><br /> <b>Speaker</b>:<br /> Swaroop Ghosh, Pennsylvania State University, US<br /> <b>Authors</b>:<br /> Mahabubul Alam, Abdullah Ash- Saki and Swaroop Ghosh, Pennsylvania State University, US<br /> <em><b>Abstract</b><br /> We propose a machine learning based approach to accelerate quantum approximate optimization algorithm (QAOA) implementation which is a promising quantum-classical hybrid algorithm to prove the so-called quantum supremacy. In QAOA, a parametric quantum circuit and a classical optimizer iterates in a closed loop to solve hard combinatorial optimization problems. The performance of QAOA improves with an increasing number of stages (depth) in the quantum circuit. However, two new parameters are introduced with each added stage for the classical optimizer increasing the number of optimization loop iterations. We note a correlation among parameters of the lower-depth and the higher-depth QAOA implementations and, exploit it by developing a machine learning model to predict the gate parameters close to the optimal values. As a result, the optimization loop converges in a fewer number of iterations. We choose graph MaxCut problem as a prototype to solve using QAOA. We perform a feature extraction routine using 100 different QAOA instances and develop a training data-set with 13,860 optimal parameters. We present our analysis for 4 flavors of regression models and 4 flavors of classical optimizers. Finally, we show that the proposed approach can curtail the number of optimization iterations by on average 44.9% (up to 65.7%) from an analysis performed with 264 flavors of graphs.</em></td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="5.8">5.8 Special Session: HLS for AI HW</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Exhibition Theatre</p> <p><b>Chair:</b><br /> Massimo Cecchetti, Mentor, A Siemens Business, US</p> <p><b>Co-Chair:</b><br /> Astrid Ernst, Mentor, A Siemens Business, US</p> <p>One of the fastest growing areas of hardware and software design is artificial intelligence (AI)/machine learning (ML), fueled by the demand for more autonomous systems like self-driving vehicles and voice recognition for personal assistants. Many of these algorithms rely on convolutional neural networks (CNNs) to implement deep learning systems. While the concept of convolution is relatively straightforward, the application of CNNs to the ML domain has yielded dozens of different neural network approaches. These networks can be executed in software on CPUs/GPUs, the power requirements for these solutions make them impractical for most inferencing applications, the majority of which involve portable, low-power devices. To improve the power/performance, hardware teams are forming to create ML hardware acceleration blocks. However, the process of taking any one of these compute-intensive networks into hardware, especially energy-efficient hardware, is a time consuming process if the team employs a traditional RTL design flow. Consider all of these interdependent activities using a traditional flow: •Expressing the algorithm correctly in RTL. •Choosing the optimal bit-widths for kernel weights and local storage to meet the memory budget. •Designing the microarchitecture to have a low enough latency to be practical for the target application, while determining how the accelerator communicates across the system bus without killing the latency the team just fought for. •Verifying the algorithm early on and throughout the implementation process, especially in the context of the entire system. •Optimizing for power for mobile devices. •Getting the product to market on time. This domain is in desperate need of a productivity-boosting methodology shift away from an RTL flow.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>5.8.1</td> <td><b>INTRODUCTION TO HLS CONCEPTS OPEN-SOURCE IP AND REFERENCES DESIGNS ENABLING BUILDING AI ACCELERATION HARDWARE</b><br /> <b>Author</b>:<br /> Mike Fingeroff, Mentor, A Siemens Business, US<br /> <em><b>Abstract</b><br /> HLS provides a hardware design solution for algorithm designers that generates high-quality RTL from C++ and/or SystemC descriptions that target ASIC, FPGA, or eFPGA implementations. By employing these elements of the HLS solution, teams can quickly develop quality high-performance, low-power hardware implementations: • Enables late-stage changes. Easily change C++ algorithms at any time and regenerate RTL code or target a new technology. • Rapidly explore options for power, performance, and area without changing source code. • Reduce design and verification time from one year to a few months and add new features in days not weeks, all using C/C++ code that contains 5x fewer lines of code than RTL.</em></td> </tr> <tr> <td>09:00</td> <td>5.8.2</td> <td><b>EARLY SOC PERFORMANCE VERIFICATION USING SYSTEMC WITH NVIDIA MATCHLIB AND HLS</b><br /> <b>Author</b>:<br /> Stuart Swan, Mentor, A Siemens Business, US<br /> <em><b>Abstract</b><br /> NVidia MatchLib is a new open-source library that enables much faster design and verification of SOCs using High-Level Synthesis. One of the primary objectives of MatchLib is to enable performance accurate modeling of SOCs in SystemC/C++. With these models, designers can identify and resolve issues such as bus and memory contention, arbitration strategies, and optimal interconnect structure at a much higher level of abstraction than RTL. In addition, much of the system level verification of the SOC can occur in SystemC/C++, before RTL is even created. This presentation will introduce NVidia Matchlib and flow (Figure 3) and its usage with Catapult HLS using some demonstration examples. Key Components of MatchLib: • Connections o Synthesizable Message Passing Framework o SystemC/C++ used to accurately model concurrent IO that synthesized HW will have o Automatic stall injection enables interconnect to be stress tested in SystemC • Parameterized AXI4 Fabric Components o Router/Splitter o Arbiter o AXI4 &lt;-&gt; AXI4Lite o Automatic burst segmentation and last bit generation • Parameterized Banked Memories, Crossbar, Reorder Buffer, Cache • Parameterized NOC components</em></td> </tr> <tr> <td>09:30</td> <td>5.8.3</td> <td><b>CUSTOMER CASE STUDIES OF USING HLS FOR ULTRA-LOW POWER AI HARDWARE ACCELERATION</b><br /> <b>Author</b>:<br /> Ellie Burns, Mentor, A Siemens Business, US<br /> <em><b>Abstract</b><br /> This presentation will review 3 customer case studies where HLS has been used for designs and applications that use AI/ML accelerated HW. All case studies are available as full customer authored white papers that detail both the design and the HLS use, design experience and lessons learned. The 3 customers studies will be NVIDIA - High-productivity IC Design for Machine Learning Accelerators FotoNation/Xperi - A Designer Life with HLS Faster Computer Vision Neural Networks Chips&amp;Media - Deep Learning Accelerator Using HLS</em></td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="IP2">IP2 Interactive Presentations</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 10:00 - 10:30<br /> <b>Location / Room:</b> Poster Area</p> <p>Interactive Presentations run simultaneously during a 30-minute slot. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session</p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td style="width:40px;">IP2-1</td> <td><b>SAMPLING FROM DISCRETE DISTRIBUTIONS IN COMBINATIONAL HARDWARE WITH APPLICATION TO POST-QUANTUM CRYPTOGRAPHY</b><br /> <b>Speaker</b>:<br /> Michael Lyons, George Mason University, US<br /> <b>Authors</b>:<br /> Michael Lyons and Kris Gaj, George Mason University, US<br /> <em><b>Abstract</b><br /> Random values from discrete distributions are typically generated from uniformly-random samples. A common technique is to use a cumulative distribution table (CDT) lookup for inversion sampling, but it is also possible to use Boolean functions to map a uniformly-random bit sequence into a value from a discrete distribution. This work presents a methodology for deriving such functions for any discrete distribution, encoding them in VHDL for implementation in combinational hardware, and (for moderate precision and sample space size) confirming the correctness of the produced distribution. The process is demonstrated using a discrete Gaussian distribution with a small sample space, but it is applicable to any discrete distribution with fixed parameters. Results are presented for sampling schemes from several submissions to the NIST PQC standardization process, comparing this method to CDT lookups on a Xilinx Artix-7 FPGA. The process produces compact solutions for distributions up to moderate size and precision.</em></td> </tr> <tr> <td style="width:40px;">IP2-2</td> <td><b>ON THE PERFORMANCE OF NON-PROFILED DIFFERENTIAL DEEP LEARNING ATTACKS AGAINST AN AES ENCRYPTION ALGORITHM PROTECTED USING A CORRELATED NOISE HIDING COUNTERMEASURE</b><br /> <b>Speaker</b>:<br /> Amir Alipour, Grenoble INP Esisar, FR<br /> <b>Authors</b>:<br /> Amir Alipour<sup>1</sup>, Athanasios Papadimitriou<sup>2</sup>, Vincent Beroulle<sup>3</sup>, Ehsan Aerabi<sup>3</sup> and David Hely<sup>3</sup><br /> <sup>1</sup>University Grenoble Alpes, Grenoble INP ESISAR, LCIS Laboratory, FR; <sup>2</sup>University Grenoble Alpes, Grenoble INP ESISAR, ESYNOV, FR; <sup>3</sup>University Grenoble Alpes, Grenoble INP ESISAR, LSIC Laboratory, FR<br /> <em><b>Abstract</b><br /> Recent works in the field of cryptography focus on Deep Learning based Side Channel Analysis (DLSCA) as one of the most powerful attacks against common encryption algorithms such as AES. As a common case, profiling DLSCA have shown great capabilities in revealing secret cryptographic keys against the majority of AES implementations. In a very recent study, it has been shown that Deep Learning can be applied in a non-profiling way (non-profiling DLSCA), making this method considerably more practical, and able to break powerful countermeasures for encryption algorithms such as AES including masking countermeasures, requiring considerably less power traces than a first order CPA attack. In this work, our main goal is to apply the non-profiling DLSCA against a hiding-based AES countermeasure which utilizes correlated noise generation so as to hide the secret encryption key. We show that this AES, with correlated noise generation as a lightweight countermeasure, can provide equivalent protection under CPA and under non- profiling DLSCA attacks, in terms of the required power traces to obtain the secret key.</em></td> </tr> <tr> <td style="width:40px;">IP2-3</td> <td><b>FAST AND ACCURATE PERFORMANCE EVALUATION FOR RISC-V USING VIRTUAL PROTOTYPES</b><br /> <b>Speaker</b>:<br /> Vladimir Herdt, University of Bremen, DE<br /> <b>Authors</b>:<br /> Vladimir Herdt<sup>1</sup>, Daniel Grosse<sup>2</sup> and Rolf Drechsler<sup>2</sup><br /> <sup>1</sup>University of Bremen, DE; <sup>2</sup>University of Bremen / DFKI, DE<br /> <em><b>Abstract</b><br /> RISC-V is gaining huge popularity in particular for embedded systems. Recently, a SystemC-based Virtual Prototype (VP) has been open sourced to lay the foundation for providing support for system-level use cases such as design space exploration, analysis of complex HW/SW interactions and power/timing/performance validation for RISC-V based systems. In this paper, we propose an efficient core timing model and integrate it into the VP core to enable fast and accurate performance evaluation for RISC-V based systems. As a case-study we provide a timing configuration matching the RISC-V HiFive1 board from SiFive. Our experiments demonstrate that our approach allows to obtain very accurate performance evaluation results while still retaining a high simulation performance.</em></td> </tr> <tr> <td style="width:40px;">IP2-4</td> <td><b>AUTOMATED GENERATION OF LTL SPECIFICATIONS FOR SMART HOME IOT USING NATURAL LANGUAGE</b><br /> <b>Speaker</b>:<br /> Shiyu Zhang, Nanjing University, CN<br /> <b>Authors</b>:<br /> Shiyu Zhang<sup>1</sup>, Juan Zhai<sup>1</sup>, Lei Bu<sup>1</sup>, Mingsong Chen<sup>2</sup>, Linzhang Wang<sup>1</sup> and Xuandong Li<sup>1</sup><br /> <sup>1</sup>Nanjing University, CN; <sup>2</sup>East China Normal University, CN<br /> <em><b>Abstract</b><br /> Ordinary inexperienced users can build their smart home IoT system easily nowadays, but such user-customized systems could be error-prone. Using formal verification to prove the correctness of such systems is necessary. However, to conduct formal proof, formal specifications such as Linear Temporal Logic (LTL) formulas have to be provided, but ordinary users cannot author LTL formulas but only natural language. To address this problem, this paper presents a novel approach that can automatically generate formal LTL specifications from natural language requirements based on domain knowledge and our proposed ambiguity refining techniques. Experimental results show that our approach can achieve a high correctness rate of 95.4% in converting natural language sentences into LTL formulas from 481 requirements of real examples.</em></td> </tr> <tr> <td style="width:40px;">IP2-5</td> <td><b>A HEAT-RECIRCULATION-AWARE VM PLACEMENT STRATEGY FOR DATA CENTERS</b><br /> <b>Authors</b>:<br /> Hao Feng<sup>1</sup>, Yuhui Deng<sup>2</sup> and Yi Zhou<sup>3</sup><br /> <sup>1</sup>Jinan University, CN; <sup>2</sup>Chinese Academy of Sciences; Jinan University, CN; <sup>3</sup>Columbus State University, US<br /> <em><b>Abstract</b><br /> Data centers consisted of a great number of IT devices (e.g., servers, switches and etc.) which generates a massive amount of heat emission. Due to the special arrangement of racks in the data center, heat recirculation often occurs between nodes. It can cause a sharp rise in temperature of the equipment coupled with local hot spots in data centers. Existing VM placement strategies can minimize energy consumption of data centers by optimizing resource allocation in terms of multiple physical resources (e.g., memory, bandwidth, cpu and etc.). However, existing strategies ignore the role of heat recirculation in the data center. To address this problem, in this study, we propose a heat-recirculation-aware VM placement strategy and design a Simulated Annealing Based Algorithm (SABA) to lower the energy consumption of data centers. Different from the existing SA algorithm, SABA optimize the distribution of the initial solution and the way of iteration. We quantitatively evaluate SABA's performance in terms of algorithm efficiency, the activated servers and the energy saving against with XINT-GA algorithm (Thermal-aware task scheduling Strategy), FCFS (First-Come First-Served), and SA. Experimental results indicate that our heat-recirculation-aware VM placement strategy provides a powerful solution for improving energy efficiency of data centers.</em></td> </tr> <tr> <td style="width:40px;">IP2-6</td> <td><b>ENERGY OPTIMIZATION IN NCFET-BASED PROCESSORS</b><br /> <b>Authors</b>:<br /> Sami Salamin<sup>1</sup>, Martin Rapp<sup>1</sup>, Hussam Amrouch<sup>1</sup>, Andreas Gerstlauer<sup>2</sup> and Joerg Henkel<sup>1</sup><br /> <sup>1</sup>Karlsruhe Institute of Technology, DE; <sup>2</sup>University of Texas at Austin, US<br /> <em><b>Abstract</b><br /> Energy consumption is a key optimization goal for all modern processors. Negative Capacitance Field-Effect Transistors (NCFETs) are a leading emerging technology that promises outstanding performance in addition to better energy efficiency. The thickness of the additional ferroelectric layer, frequency, and voltage are the key parameters in NCFET technology that impact the power and frequency of processors. However, their joint impact on energy optimization has not been investigated yet. In this work, we are the first to demonstrate that conventional (i.e., NCFET-unaware) dynamic voltage/frequency scaling (DVFS) techniques to minimize energy are sub-optimal when applied to NCFET-based processors. We further demonstrate that state-of-the-art NCFET-aware voltage scaling for power minimization is also sub-optimal when it comes to energy. This work provides the first NCFET-aware DVFS technique that optimizes the processor's energy through optimal runtime frequency/voltage selection. In NCFETs, energy-optimal frequency and voltage are dependent on the workload and technology parameters. Our NCFET-aware DVFS technique considers these effects to perform optimal voltage/frequency selection at runtime depending on workload characteristics. Results show up to 90 % energy savings compared to conventional DVFS techniques. Compared to state-of-the-art NCFET-aware power management, our technique provides up to 72 % energy savings along with 3:7x higher performance.</em></td> </tr> <tr> <td style="width:40px;">IP2-7</td> <td><b>TOWARDS A MODEL-BASED MULTI-OBJECTIVE OPTIMIZATION APPROACH FOR SAFETY-CRITICAL REAL-TIME SYSTEMS</b><br /> <b>Speaker</b>:<br /> Emmanuel Grolleau, LIAS / ISAE-ENSMA, FR<br /> <b>Authors</b>:<br /> Soulimane Kamni<sup>1</sup>, Yassine OUHAMMOU<sup>2</sup>, Antoine Bertout<sup>3</sup> and Emmanuel Grolleau<sup>4</sup><br /> <sup>1</sup>LIAS/ENSMA, FR; <sup>2</sup>LIAS / ISAE-ENSMA, FR; <sup>3</sup>LIAS, Université de Poitiers, ISAE-ENSMA, FR; <sup>4</sup>LIAS, ISAE-ENSMA, Universite de Poitiers, FR<br /> <em><b>Abstract</b><br /> In safety-critical real-time systems domain, obtaining the appropriate operational model which meets the temporal (e.g. deadlines) and business (e.g. redundancy) requirements while being optimal in terms of several metrics is a primordial process in the design life-cycle. Recently, several researches have proposed to explore cross-domain trade-offs for a higher behaviour performance. Indeed, this process represents the first step in the deployment phase, which is very sensitive because it could be error-prone and time consuming. This paper is a work in progress proposing an approach aiming to help real-time system architects to take benefit from existing works, overcome their limits, and capitalize the efforts. Furthermore, the approach is based on the model-driven engineering paradigm and suggests to ease the usage of methods and tools thanks to repositories gathering them as a sort of a shared knowledge.</em></td> </tr> <tr> <td style="width:40px;">IP2-8</td> <td><b>CURRENT-MODE CARRY-FREE MULTIPLIER DESIGN USING A MEMRISTOR-TRANSISTOR CROSSBAR ARCHITECTURE</b><br /> <b>Speaker</b>:<br /> Shengqi Yu, Newcastle University, GB<br /> <b>Authors</b>:<br /> Shengqi Yu<sup>1</sup>, Ahmed Soltan<sup>2</sup>, Rishad Shafik<sup>1</sup>, Thanasin Bunnam<sup>1</sup>, Domenico Balsamo<sup>1</sup>, Fei Xia<sup>1</sup> and Alex Yakovlev<sup>1</sup><br /> <sup>1</sup>Newcastle University, GB; <sup>2</sup>Nile University, EG<br /> <em><b>Abstract</b><br /> Traditional multipliers consist of complex logic components. They are a major energy and performance contributor of modern compute-intensive applications. As such, designing multipliers with reduced energy and faster speed has remained a thoroughgoing challenge. This paper presents a novel, carry-free multiplier, which is suitable for new-generation of energy-constrained applications. The multiplier circuit consists of an array of memristor-transistor cells that can be selected (i.e., turned ON or OFF) using a combination of DC bias voltages based on the operand values. When a cell is selected it contributes to current in the array path, which is then amplified by current mirrors with variable transistor gate sizes. The different current paths are connected to a node for analogously accumulating the currents to produce the multiplier output directly, removing the carry propagation stages, typically seen in traditional digital multipliers. An essential feature of this multiplier is autonomous survivability, i.e., when the power is below this threshold the logic state automatically retains at a zero-cost due to the non-volatile properties of memristors.</em></td> </tr> <tr> <td style="width:40px;">IP2-9</td> <td><b>N-BIT DATA PARALLEL SPIN WAVE LOGIC GATE</b><br /> <b>Speaker</b>:<br /> Abdulqader Mahmoud, TU Delft, NL<br /> <b>Authors</b>:<br /> Abdulqader Mahmoud<sup>1</sup>, Frederic Vanderveken<sup>2</sup>, Florin Ciubotaru<sup>2</sup>, Christoph Adelmann<sup>2</sup>, Sorin Cotofana<sup>1</sup> and Said Hamdioui<sup>1</sup><br /> <sup>1</sup>TU Delft, NL; <sup>2</sup>IMEC, BE<br /> <em><b>Abstract</b><br /> Due to their very nature, Spin Waves (SWs) created in the same waveguide, but with different frequencies, can coexist while selectively interacting with their own species only. The absence of inter-frequency interferences isolates input data sets encoded in SWs with different frequencies and creates the premises for simultaneous data parallel SW based processing without hardware replication or delay overhead. In this paper we leverage this SW property by introducing a novel computation paradigm, which allows for the parallel processing of n-bit input data vectors on the same basic SW based logic gate. Subsequently, to demonstrate the proposed concept, we present 8-bit parallel 3-input Majority gate implementation and validate it by means of Object Oriented MicroMagnetic Framework (OOMMF) simulations. To evaluate the potential benefit of our proposal we compare the 8-bit data parallel gate with equivalent scalar SW gate based implementation. Our evaluation indicates that 8-bit data 3-input Majority gate implementation requires 4.16x less area than the scalar SW gate based equivalent counterpart while preserving the same delay and energy consumption figures.</em></td> </tr> <tr> <td style="width:40px;">IP2-10</td> <td><b>HIGH-SPEED ANALOG SIMULATION OF CMOS VISION CHIPS USING EXPLICIT INTEGRATION TECHNIQUES ON MANY-CORE PROCESSORS</b><br /> <b>Speaker</b>:<br /> Tom Kazmierski, University of Southampton, GB<br /> <b>Authors</b>:<br /> Gines Domenech-Asensi<sup>1</sup> and Tom J Kazmierski<sup>2</sup><br /> <sup>1</sup>Universidad Politecnica de Cartagena, ES; <sup>2</sup>University of Southampton, GB<br /> <em><b>Abstract</b><br /> This work describes a high-speed simulation technique of analog circuits which is based on the use of state-space equations and an explicit integration method parallelised on a multiprocessor architecture. The integration step of such method is smaller than the one required by an implicit simulation technique based on Newton-Raphson iterations. However, given that explicit methods do not require the computation of time-consuming matrix factorizations, the overall simulation time is reduced. The technique described in this work has been implemented on a NVIDIA general purpose GPU and has been tested simulating the Gaussian filtering operation performed by a smart CMOS image sensor. Such devices are used to perform computation on the edge and include built-in image processing functions. Among those, the Gaussian filtering is one of the most common functions, since it is a basic task for early vision processing. These smart sensors are increasingly complex and hence the time required to simulate them during their design cycle is also larger and larger. From a certain imager size, the proposed simulation method yields simulation times two order of magnitude faster that an implicit method based tool such us SPICE.</em></td> </tr> <tr> <td style="width:40px;">IP2-11</td> <td><b>A 100KHZ-1GHZ TERMINATION-DEPENDENT HUMAN BODY COMMUNICATION CHANNEL MEASUREMENT USING MINIATURIZED WEARABLE DEVICES</b><br /> <b>Speaker</b>:<br /> Shreyas Sen, Purdue University, US<br /> <b>Authors</b>:<br /> Shitij Avlani, Mayukh Nath, Shovan Maity and Shreyas Sen, Purdue University, US<br /> <em><b>Abstract</b><br /> Human Body Communication has shown great promise to replace wireless communication for information exchange between wearable devices of a body area network. However, there are very few studies in literature, that systematically study the channel loss of capacitive HBC for wearable devices over a wide frequency range with different terminations at the receiver, partly due to the need for miniaturized wearable devices for an accurate study. This paper, for the first time, measures the channel loss of capacitive HBC from 100KHz to 1GHz for both high-impedance and 50 ohm terminations using wearable, battery powered devices; which is mandatory for accurate measurement of the HBC channel-loss, due to ground coupling effects. Results show that high impedance termination leads to a significantly lower channel loss (40 dB improvement at 1MHz), as compared to 50 ohm termination at low frequencies. This difference steadily decreases with increasing frequency, until they become similar near 80MHz. Beyond 100MHz inter-device coupling dominates, thereby preventing accurate measurements of channel loss of the human body. The measured results provide a consistent wearable, wide-frequency HBC channel loss data and could serve as a backbone for the emerging field of HBC by aiding in the selection of an appropriate operation frequency and termination.</em></td> </tr> <tr> <td style="width:40px;">IP2-12</td> <td><b>FROM DRUP TO PAC AND BACK</b><br /> <b>Speaker</b>:<br /> Daniela Kaufmann, Johannes Kepler University Linz, AT<br /> <b>Authors</b>:<br /> Daniela Kaufmann, Armin Biere and Manuel Kauers, Johannes Kepler University Linz, AT<br /> <em><b>Abstract</b><br /> Currently the most efficient automatic approach to verify gate-level multipliers combines SAT solving and computer algebra. In order to increase confidence in the verification, proof certificates are generated. However, due to different solving techniques, these certificates require two different proof formats, namely DRUP and PAC. A combined proof has so far been missing. Correctness of this approach can thus only be trusted up to the correctness of compositional reasoning. In this paper we show how to generate a single proof in one proof format, which then allows to certify correctness using one simple proof checker. We further investigate empirically the effect on proof generation and checking time as well as on proof size. It turns out that PAC proofs are much more compact and faster to check.</em></td> </tr> <tr> <td style="width:40px;">IP2-13</td> <td><b>VERIFIABLE SECURITY TEMPLATES FOR HARDWARE</b><br /> <b>Speaker</b>:<br /> Bill Harrison, Oak Ridge National Laboratory, US<br /> <b>Authors</b>:<br /> William Harrison<sup>1</sup> and Gerard Allwein<sup>2</sup><br /> <sup>1</sup>Oak Ridge National Laboratory, US; <sup>2</sup>Naval Research Laboratory, US<br /> <em><b>Abstract</b><br /> But HLS has, with a few notable exceptions, not focused on transferring ideas and techniques from high assurance software formal methods to hardware development, despite there being a sophisticated and mature body of research in that area. Just as it has introduced software engineering virtues, we believe HLS can also become a vector for retrofitting software formal methods to the challenge of high assurance security in hardware. This paper introduces the Device Calculus and its mechanization in the Agda proof checking system. The Device Calculus is a starting point for exploring formal methods and security within high-level synthesis flows. We illustrate the Device Calculus with a number of examples of formally verifiable security templates---i.e., functions in the Device Calculus that express common security structures at a high-level of abstraction.</em></td> </tr> <tr> <td style="width:40px;">IP2-14</td> <td><b>IFFSET: IN-FIELD FUZZING OF INDUSTRIAL CONTROL SYSTEMS USING SYSTEM EMULATION</b><br /> <b>Speaker</b>:<br /> Dimitrios Tychalas, New York University, US<br /> <b>Authors</b>:<br /> Dimitrios Tychalas<sup>1</sup> and Michail Maniatakos<sup>2</sup><br /> <sup>1</sup>New York University, US; <sup>2</sup>New York University Abu Dhabi, AE<br /> <em><b>Abstract</b><br /> Industrial Control Systems (ICS) have evolved in the last decade, shifting from proprietary software/hardware to contemporary embedded architectures paired with open-source operating systems. In contrast to the IT world, where continuous updates and patches are expected, decommissioning always-on ICS for security assessment can incur prohibitive costs to their owner. Thus, a solution for routinely assessing the cybersecurity posture of diverse ICS without affecting their operation is essential. Therefore, in this paper we introduce IFFSET, a platform that leverages full system emulation of Linux-based ICS firmware and utilizes fuzzing for security evaluation. Our platform extracts the file system and kernel information from a live ICS device, building an image which is emulated on a desktop system through QEMU. We employ fuzzing as a security assessment tool to analyze ICS specific libraries and find potential security threatening conditions. We test our platform with commercial PLCs, showcasing potential threats with no interruption to the control process.</em></td> </tr> <tr> <td style="width:40px;">IP2-15</td> <td><b>FANNET: FORMAL ANALYSIS OF NOISE TOLERANCE, TRAINING BIAS AND INPUT SENSITIVITY IN NEURAL NETWORKS</b><br /> <b>Speaker</b>:<br /> Mahum Naseer, TU Wien, AT<br /> <b>Authors</b>:<br /> Mahum Naseer<sup>1</sup>, Mishal Fatima Minhas<sup>2</sup>, Faiq Khalid<sup>1</sup>, Muhammad Abdullah Hanif<sup>1</sup>, Osman Hasan<sup>2</sup> and Muhammad Shafique<sup>1</sup><br /> <sup>1</sup>TU Wien, AT; <sup>2</sup>National University of Sciences and Technology, PK<br /> <em><b>Abstract</b><br /> With a constant improvement in the network architectures and training methodologies, Neural Networks (NNs) are increasingly being deployed in real-world Machine Learning systems. However, despite their impressive performance on "known inputs", these NNs can fail absurdly on the "unseen inputs", especially if these real-time inputs deviate from the training dataset distributions, or contain certain types of input noise. This indicates the low noise tolerance of NNs, which is a major reason for the recent increase of adversarial attacks. This is a serious concern, particularly for safety-critical applications, where inaccurate results lead to dire consequences. We propose a novel methodology that leverages model checking for the Formal Analysis of Neural Network (FANNet) under different input noise ranges. Our methodology allows us to rigorously analyze the noise tolerance of NNs, their input node sensitivity, and the effects of training bias on their performance, e.g., in terms of classification accuracy. For evaluation, we use a feed-forward fully-connected NN architecture trained for the Leukemia classification. Our experimental results show 11% noise tolerance for the given trained network, identify the most sensitive input nodes, confirm the biasness of the available training dataset, and indicate that the proposed methodology is much more rigorous and yet comparable to validation testing in terms of time and computational resources for larger noise ranges.</em></td> </tr> <tr> <td style="width:40px;">IP2-16</td> <td><b>A SCALABLE MIXED SYNTHESIS FRAMEWORK FOR HETEROGENEOUS NETWORKS</b><br /> <b>Speaker</b>:<br /> Max Austin, University of Utah, US<br /> <b>Authors</b>:<br /> Max Austin<sup>1</sup>, Scott Temple<sup>1</sup>, Walter Lau Neto<sup>1</sup>, Luca Amaru<sup>2</sup>, Xifan Tang<sup>1</sup> and Pierre-Emmanuel Gaillardon<sup>1</sup><br /> <sup>1</sup>University of Utah, US; <sup>2</sup>Synopsys, US<br /> <em><b>Abstract</b><br /> We present a new logic synthesis framework which produces efficient post-technology mapped results on heterogeneous networks containing a mix of different types of logic. This framework accomplishes this by breaking down the circuit into sections using a hypergraph k-way partitioner and then determines the best-fit logic representation for each partition between two Boolean networks, And-Inverter Graphs(AIG) and Majority-Inverter Graphs(MIG), which have been shown to perform better over each other on different types of logic. Experimental results show that over a set of Open Piton DesignBenchmarks(OPDB) and OpenCores benchmarks, our proposed methodology outperforms state-of-the-art academic tools inArea-Delay Product(ADP), Power-Delay Product(PDP), and Energy-Delay Product(EDP) by 5%, 2%, and 15% respectively after performing Application Specific Integrated Circuits(ASIC) technology mapping as well as showing a 54% improvement in runtime over conventional MIG optimization</em></td> </tr> <tr> <td style="width:40px;">IP2-17</td> <td><b>DISCERN: DISTILLING STANDARD CELLS FOR EMERGING RECONFIGURABLE NANOTECHNOLOGIES</b><br /> <b>Speaker</b>:<br /> Shubham Rai, TU Dresden, DE<br /> <b>Authors</b>:<br /> Shubham Rai<sup>1</sup>, Michael Raitza<sup>2</sup>, Siva Satyendra Sahoo<sup>1</sup> and Akash Kumar<sup>1</sup><br /> <sup>1</sup>TU Dresden, DE; <sup>2</sup>TU Dresden and CfAED, DE<br /> <em><b>Abstract</b><br /> Logic gates and circuits based on reconfigurable nanotechnologies demonstrate runtime-reconfigurability, where a single logic gate can exhibit more than one functionality. Recent attempts on circuits based on emerging reconfigurable nanotechnologies have primarily focused on using the traditional CMOS design flow involving similar-styled standard-cells. These CMOS-centric standard-cells fail to utilize the exciting properties offered by these nanotechnologies. In the present work, we explore the boolean properties that define the reconfigurable properties of a logic gate. By analyzing the truth-table in detail, we find that there is a common boolean rule which dictates why a logic gate is reconfigurable. Such logic gates can be efficiently implemented using reconfigurable nanotechnologies. We propose an algorithm which analyses the truth-tables of nodes in a circuit to list all such potential reconfigurable logic gates for a particular circuit. Technology mapping with these new logic gates (or standard-cells) leads to a better mapping in terms of area and delay. Experiments employing our methodology over EPFL benchmarks, show average improvements of around 13%, 16% and 11.5% in terms of area, number of edges and delay respectively as compared to the conventional CMOS-centric standard-cell based mapping.</em></td> </tr> <tr> <td style="width:40px;">IP2-18</td> <td><b>A 16×128 STOCHASTIC-BINARY PROCESSING ELEMENT ARRAY FOR ACCELERATING STOCHASTIC DOT-PRODUCT COMPUTATION USING 1-16 BIT-STREAM LENGTH</b><br /> <b>Speaker</b>:<br /> Hyunjoon Kim, Nanyang Technological University, SG<br /> <b>Authors</b>:<br /> Qian Chen, Yuqi Su, Hyunjoon Kim, Taegeun Yoo, Tony Tae-Hyoung Kim and Bongjin Kim, Nanyang Technological University, SG<br /> <em><b>Abstract</b><br /> This work presents 16×128 stochastic-binary processing elements for energy/area efficient processing of artificial neural networks. A processing element (PE) with all-digital components consists of an XNOR gate as a bipolar stochastic multiplier and an 8bit binary adder with 8× registers for accumulating partial-sums. The PE array comprises 16× dot-product units, each with 128 PEs cascaded in a single row. The latency and energy of the proposed dot-product unit is minimized by reducing the number of bit-streams required for minimizing the accuracy degradation induced by the approximate stochastic computing. A 128-input dot-product operation requires the bit-stream length (N) of 1-to-16, which is two orders of magnitude smaller than the baseline stochastic computation using MUX-based adders. The simulated dot-product error is 6.9-to-1.5% for N=1-to-16, while the error from the baseline stochastic method is 5.9-to-1.7% with N=128-to-2048. A mean MNIST classification accuracy is 96.11% (which is 1.19% lower than 8b binary) using a three-layer MLP at N=16. The measured energy from a 65nm test-chip is 10.04pJ per dot-product, and the energy efficiency is 25.5TOPS/W at N=16.</em></td> </tr> <tr> <td style="width:40px;">IP2-19</td> <td><b>TOWARDS EXPLORING THE POTENTIAL OF ALTERNATIVE QUANTUM COMPUTING ARCHITECTURES</b><br /> <b>Speaker</b>:<br /> Arighna Deb, Kalinga Institute of Industrial Technology, IN<br /> <b>Authors</b>:<br /> Arighna Deb<sup>1</sup>, Gerhard W. Dueck<sup>2</sup> and Robert Wille<sup>3</sup><br /> <sup>1</sup>Kalinga Institute of Industrial Technology, IN; <sup>2</sup>University of New Brunswick, CA; <sup>3</sup>Johannes Kepler University Linz, AT<br /> <em><b>Abstract</b><br /> The recent advances in the physical realization of Noisy Intermediate Scale Quantum (NISQ) computers have motivated research on design automation that allows users to execute quantum algorithms on them. Certain physical constraints in the architectures restrict how logical qubits used to describe the algorithm can be mapped to physical qubits used to realize the corresponding functionality. Thus far, this has been addressed by inserting additional operations in order to overcome the physical constrains. However, all these approaches have taken the existing architectures as invariant and did not explore the potential of changing the quantum architecture itself—a valid option as long as the underlying physical constrains remain satisfied. In this work, we propose initial ideas to explore this potential. More precisely, we introduce several schemes for the generation of alternative coupling graphs (and, by this, quantum computing architectures) that still might be able to satisfy physical constraints but, at the same time, allow for a more efficient realization of the desired quantum functionality.</em></td> </tr> <tr> <td style="width:40px;">IP2-20</td> <td><b>ACCELERATING QUANTUM APPROXIMATE OPTIMIZATION ALGORITHM USING MACHINE LEARNING</b><br /> <b>Speaker</b>:<br /> Swaroop Ghosh, Pennsylvania State University, US<br /> <b>Authors</b>:<br /> Mahabubul Alam, Abdullah Ash- Saki and Swaroop Ghosh, Pennsylvania State University, US<br /> <em><b>Abstract</b><br /> We propose a machine learning based approach to accelerate quantum approximate optimization algorithm (QAOA) implementation which is a promising quantum-classical hybrid algorithm to prove the so-called quantum supremacy. In QAOA, a parametric quantum circuit and a classical optimizer iterates in a closed loop to solve hard combinatorial optimization problems. The performance of QAOA improves with an increasing number of stages (depth) in the quantum circuit. However, two new parameters are introduced with each added stage for the classical optimizer increasing the number of optimization loop iterations. We note a correlation among parameters of the lower-depth and the higher-depth QAOA implementations and, exploit it by developing a machine learning model to predict the gate parameters close to the optimal values. As a result, the optimization loop converges in a fewer number of iterations. We choose graph MaxCut problem as a prototype to solve using QAOA. We perform a feature extraction routine using 100 different QAOA instances and develop a training data-set with 13,860 optimal parameters. We present our analysis for 4 flavors of regression models and 4 flavors of classical optimizers. Finally, we show that the proposed approach can curtail the number of optimization iterations by on average 44.9% (up to 65.7%) from an analysis performed with 264 flavors of graphs.</em></td> </tr> </tbody> </table> <hr /> <h2 id="UB05">UB05 Session 5</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 10:00 - 12:00<br /> <b>Location / Room:</b> Booth 11, Exhibition Area</p> <p> </p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td>UB05.1</td> <td><b>TAPASCO: THE OPEN-SOURCE TASK-PARALLEL SYSTEM COMPOSER FRAMEWORK</b><br /> <b>Authors</b>:<br /> Carsten Heinz, Lukas Sommer, Lukas Weber, Jaco Hofmann and Andreas Koch, TU Darmstadt, DE<br /> <em><b>Abstract</b><br /> Field-programmable gate arrays (FPGA) are an established platform for highly specialized accelerators, but in a heterogeneous setup, the accelerator still needs to be integrated into the overall system. The open-source TaPaSCo (Task-Parallel System Composer) framework was created to serve this purpose: The fast integration of FPGA-based accelerators into compute platforms or systems-on-chip (SoC) and their connection to relevant components on the FPGA board. TaPaSCo can support developers in all steps of the development process: from cores resulting from High-Level Synthesis or cores written in an HDL, a complete FPGA-design can be created. TaPaSCo will automatically connect all processing elements to the memory- and host-interface and generate a complete bitstream. The TaPaSCo Runtime API allows to interface with accelerators from software and supports operations such as transferring data to the FPGA memory, passing values and controlling the execution of the accelerators.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3101.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB05.2</td> <td><b>ELSA: EIGENVALUE BASED HYBRID LINEAR SYSTEM ABSTRACTION: BEHAVIORAL MODELING OF TRANSISTOR-LEVEL CIRCUITS USING AUTOMATIC ABSTRACTION TO HYBRID AUTOMATA</b><br /> <b>Authors</b>:<br /> Ahmad Tarraf and Lars Hedrich, University of Frankfurt, DE<br /> <em><b>Abstract</b><br /> Model abstraction of transistor-level circuits, while preserving an accurate behavior, is still an open problem. In this demo an approach is presented that automatically generates a hybrid automaton (HA) with linear states from an existing circuit netlist. The approach starts with a netlist at transistor level with full SPICE accuracy and ends at the system level description of the circuit in matlab or in Verilog-A. The resulting hybrid automaton exhibits linear behavior as well as the technology dependent nonlinear e.g. limiting behavior. The accuracy and speed-up of the Verilog-A generated models is evaluated based on several transistor level circuit abstractions of simple operational amplifiers up to a complex filters. Moreover, we verify the equivalence between the generated model and the original circuit. For the generated models in matlab syntax, a reachability analysis is performed using the reachability tool cora.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3097.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB05.3</td> <td><b>EUCLID-NIR GPU: AN ON-BOARD PROCESSING GPU-ACCELERATED SPACE CASE STUDY DEMONSTRATOR</b><br /> <b>Authors</b>:<br /> Ivan Rodriguez and Leonidas Kosmidis, BSC / UPC, ES<br /> <em><b>Abstract</b><br /> Embedded Graphics Processing Units (GPUs) are very attractive candidates for on-board payload processing of future space systems, thanks to their high performance and low-power consumption. Although there is significant interest from both academia and industry, there is no open and publicly available case study showing their capabilities, yet. In this master thesis project, which was performed within the GPU4S (GPU for Space) ESA-funded project, we have parallelised and ported the Euclid NIR (Near Infrared) image processing algorithm used in the European Space Agency's (ESA) mission to be launched in 2022, to an automotive GPU platform, the NVIDIA Xavier. In the demo we will present in real-time its significantly higher performance achieved compared to the original sequential implementation. In addition, visitors will have the opportunity to examine the images on which the algorithm operates, as well as to inspect the algorithm parallelisation through profiling and code inspection.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3106.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB05.4</td> <td><b>BCFELEAM: BACKFLOW: BACKWARD EDGE CONTROL FLOW ENFORCEMENT FOR LOW END ARM REAL-TIME SYSTEMS</b><br /> <b>Authors</b>:<br /> Bresch Cyril<sup>1</sup>, David Héy<sup>1</sup>, Roman Lysecky<sup>2</sup> and Stephanie Chollet<sup>1</sup><br /> <sup>1</sup>LCIS, FR; <sup>2</sup>University of Arizona, US<br /> <em><b>Abstract</b><br /> The C programming language is one of the most popular languages in embedded system programming. Indeed, C is efficient, lightweight and can easily meet high performance and deterministic real-time constraints. However, these assets come at a certain price. Indeed, C does not provide extra features for memory safety. As a result, attackers can easily exploit spatial memory vulnerabilities to hijack the execution flow of an application. The demonstration features a real-time connected infusion pump vulnerable to memory attacks. First, we showcase an exploit that remotely takes control of the pump. Then, we demonstrate the effectiveness of BackFlow, an LLVM-based compiler extension that enforces control-flow integrity in low-end ARM embedded systems.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3109.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB05.5</td> <td><b>UWB ACKATCK: HIJACKING DEVICES IN UWB INDOOR POSITIONING SYSTEMS</b><br /> <b>Authors</b>:<br /> Baptiste Pestourie, Vincent Beroulle and Nicolas Fourty, Université Grenoble Alpes, FR<br /> <em><b>Abstract</b><br /> Various radio-based Indoor Positioning Systems (IPS) have been proposed during the last decade as solutions to GPS inconsistency in indoor environments. Among the different radio technologies proposed for this purpose, 802.15.4 Ultra-Wideband (UWB) is by far the most performant, reaching up to 10 cm accuracy with 1000 Hz refresh rates. As a consequence, UWB is a popular technology for applications such as assets tracking in industrial environments or robots/drones indoor navigation. However, some security flaws in 802.15.4 standard expose UWB positioning to attacks. In this demonstration, we show how an attacker can exploit a vulnerability on 802.15.4 acknowledgment frames to hijack a device in a UWB positioning system. We demonsrate that using simply one cheap UWB chip, the attacker can take control over the positioning system and generate fake trajectories from a laptop. The results are observed in real-time in the 3D engine monitoring the positioning system.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3115.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB05.6</td> <td><b>DESIGN AUTOMATION FOR EXTENDED BURST-MODE AUTOMATA IN WORKCRAFT</b><br /> <b>Authors</b>:<br /> Alex Chan, Alex Yakovlev, Danil Sokolov and Victor Khomenko, Newcastle University, GB<br /> <em><b>Abstract</b><br /> Asynchronous circuits are known to have high performance, robustness and low power consumption, which are particularly beneficial for the area of so-called "little digital" controllers where low latency is crucial. However, asynchronous design is not widely adopted by industry, partially due to the steep learning curve inherent in the complexity of formal specifications, such as Signal Transition Graphs (STGs). In this demo, we promote a class of the Finite State Machine (FSM) model called Extended Burst-Mode (XBM) automata as a practical way to specify many asynchronous circuits. The XBM specification has been automated in the Workcraft toolkit (<a href="https://workcraft.org" title="https://workcraft.org">https://workcraft.org</a>) with elaborate support for state encoding, conditionals and "don't care" signals. Formal verification and logic synthesis of the XBM automata is implemented via conversion to the established STG model, reusing existing methods and CAD tools. Tool support for the XBM flow will be demonstrated using several case studies.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3120.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB05.7</td> <td><b>AT-SPEED DFT ARCHITECTURE FOR BUNDLED-DATA CIRCUITS</b><br /> <b>Authors</b>:<br /> Ricardo Aquino Guazzelli and Laurent Fesquet, Université Grenoble Alpes, FR<br /> <em><b>Abstract</b><br /> At-speed testing for asynchronous circuits is still an open concern in the literature. Due to its timing constraints between control and data paths, Design for Testability (DfT) methodologies must test both control and data paths at the same time in order to guarantee the circuit correctness. As Process Voltage Temperature (PVT) variations significantly impact circuit design in newer CMOS technologies and low-power techniques such as voltage scaling, the timing constraints between control and data paths must be tested after fabrication not only under nominal conditions but through a range of operating conditions. This work explores an at-speed testing approach for bundled data circuits, targetting the micropipeline template. The main target of this test approach focuses on whether the sized delay lines in control paths respect the local timing assumptions of the data paths.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3117.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB05.8</td> <td><b>CATANIS: CAD TOOL FOR AUTOMATIC NETWORK SYNTHESIS</b><br /> <b>Authors</b>:<br /> Davide Quaglia, Enrico Fraccaroli, Filippo Nevi and Sohail Mushtaq, Università di Verona, IT<br /> <em><b>Abstract</b><br /> The proliferation of communication technologies for embedded systems opened the way for new applications, e.g., Smart Cities and Industry 4.0. In such applications hundreds or thousands of smart devices interact together through different types of channels and protocols. This increasing communication complexity forces computer-aided design methodologies to scale up from embedded systems in isolation to the global inter-connected system. Network Synthesis is the methodology to optimally allocate functionality onto network nodes and define the communication infrastructure among them. This booth will demonstrate the functionality of a graphic tool for automatic network synthesis developed by the Computer Science Department of University of Verona. It allows to graphically specify the communication requirements of a smart space (e.g., its map can be considered) in terms of sensing and computation tasks together with a library of node types and communication protocols to be used.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3125.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB05.9</td> <td><b>PARALLEL ALGORITHM FOR CNN INFERENCE AND ITS AUTOMATIC SYNTHESIS</b><br /> <b>Authors</b>:<br /> Takashi Matsumoto, Yukio Miyasaka, Xinpei Zhang and Masahiro Fujita, University of Tokyo, JP<br /> <em><b>Abstract</b><br /> Recently, Convolutional Neural Network (CNN) has surpassed conventional methods in the field of image processing. This demonstration shows a new algorithm to calculate CNN inference using processing elements arranged and connected based on the topology of the convolution. They are connected in mesh and calculate CNN inference in a systolic way. The algorithm performs the convolution of all elements with the same output feature in parallel. We demonstrate a method to automatically synthesize an algorithm, which simultaneously performs the convolution and the communication of pixels for the computation of the next layer. We show with several sizes of input layers, kernels, and strides and confirmed that the correct algorithms were synthesized. The synthesis method is extended to the sparse kernel. The synthesized algorithm requires fewer cycles than the original algorithm. There were the more chances to reduce the number of cycles with the sparser kernel.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3132.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB05.10</td> <td><b>FU: LOW POWER AND ACCURACY CONFIGURABLE APPROXIMATE ARITHMETIC UNITS</b><br /> <b>Authors</b>:<br /> Tomoaki Ukezono and Toshinori Sato, Fukuoka University, JP<br /> <em><b>Abstract</b><br /> In this demonstration, we will introduce the approximate arithmetic units such as adder, multiplier, and MAC that are being studied in our system-architecture laboratory. Our approximate arithmetic units can reduce delay and power consumption at the expense of accuracy. Our approximate arithmetic units are intended to be applied to IoT edge devices that can process images, and are suitable for battery-driven and low-cost devices. The biggest feature of our approximate arithmetic units is that the circuit is configured so that the accuracy is dynamically variable, and the trade-off relationship between accuracy and power can be selected according to the usage status of the device. In this demonstration, we show the power consumption according to various accuracy-requirements based on actual data and claim the practicality of the proposed arithmetic units.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3127.pdf">More information ...</a></b></em></td> </tr> <tr> <td>12:00</td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="6.1">6.1 Special Day on "Embedded AI": Emerging Devices, Circuits and Systems</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br /> Carlo Reita, CEA, FR</p> <p><b>Co-Chair:</b><br /> Bernabe Linares-Barranco, CSIC, ES</p> <p>This session focuses on the advantages and use of novel emerging nanotechnology devices and their use in designing circuits and systems for embedded AI hardware solutions.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.1.1</td> <td><b>IN-MEMORY RESISTIVE RAM IMPLEMENTATION OF BINARIZED NEURAL NETWORKS FOR MEDICAL APPLICATIONS</b><br /> <b>Speaker</b>:<br /> Damien Querlioz, University Paris-Saclay, FR<br /> <b>Authors</b>:<br /> Bogdan Penkovsky<sup>1</sup>, Marc Bocquet<sup>2</sup>, Tifenn Hirtzlin<sup>1</sup>, Jacques-Olivier Klein<sup>1</sup>, Etienne Nowak<sup>3</sup>, Elisa Vianello<sup>3</sup>, Jean-Michel Portal<sup>2</sup> and Damien Querlioz<sup>4</sup><br /> <sup>1</sup>Université Paris-Saclay, FR; <sup>2</sup>Aix-Marseille University, FR; <sup>3</sup>CEA-Leti, FR; <sup>4</sup>Université Paris-Sud, FR<br /> <em><b>Abstract</b><br /> The advent of deep learning has considerably accelerated machine learning development, but its development at the edge is limited by its high energy cost and memory requirement. With new memory technology available, emerging Binarized Neural Networks (BNNs) are promising to reduce the energy impact of the forthcoming machine learning hardware generation, enabling machine learning on the edge devices and avoiding data transfer over the network. In this work, after presenting our implementation employing a hybrid CMOS -hafnium oxide resistive memory technology, we suggest strategies to apply BNNs to biomedical signals such as electrocardiography and electroencephalography, keeping accuracy level and reducing memory requirements. These results are obtained in binarizing solely the classifier part of a neural network. We also discuss how these results translate to the edge-oriented Mobilenet V1 neural network on the Imagenet task. The final goal of this research is to enable smart autonomous healthcare devices.</em></td> </tr> <tr> <td>11:22</td> <td>6.1.2</td> <td><b>MIXED-SIGNAL VECTOR-BY-MATRIX MULTIPLIER CIRCUITS BASED ON 3D-NAND MEMORIES FOR NEUROMORPHIC COMPUTING</b><br /> <b>Speaker</b>:<br /> Dmitri Strukow, University of California, Santa Barbara, US<br /> <b>Authors</b>:<br /> Mohammad Bavandpour, Shubham Sahay, Mohammad Mahmoodi and Dmitri Strukov, University of California, Santa Barbara, US<br /> <em><b>Abstract</b><br /> We propose an extremely dense, energy-efficient mixed-signal vector-by-matrix-multiplication (VMM) circuits based on the existing 3D-NAND flash memory blocks, without any need for their modification. Such compatibility is achieved using time-domain-encoded VMM design. We have performed rigorous simulations of such a circuit, taking into account non-idealities such as drain-induced barrier lowering, capacitive coupling, charge injection, parasitics, process variations, and noise. Our results, for example, show that the 4-bit VMM of 200-element vectors, using the commercially available 64-layer gate-all-around macaroni-type 3D-NAND memory blocks designed in the 55-nm technology node, may provide an unprecedented area efficiency of 0.14 μm 2 /byte and energy efficiency of ~11 fJ/Op, including the input/output and other peripheral circuitry overheads.</em></td> </tr> <tr> <td>11:44</td> <td>6.1.3</td> <td><b>MODULAR RRAM BASED IN-MEMORY COMPUTING DESIGN FOR EMBEDDED AI</b><br /> <b>Authors</b>:<br /> Xinxin Wang, Qiwen Wang, Mohammed A. Zidan, Fan-Hsuan Meng, John Moon and Wei Lu, University of Michigan, US<br /> <em><b>Abstract</b><br /> Deep Neural Networks (DNN) are widely used for many artificial intelligence applications with great success. However, they often come with high computation cost and complexity. Accelerators are crucial in improving energy efficiency and throughput, particularly for embedded AI applications. Resistive random-access memory (RRAM) has the potential to enable efficient AI accelerator implementation, as the weights can be mapped as the conductance values of RRAM devices and computation can be directly performed in-memory. Specifically, by converting input activations into voltage pulses, vector-matrix multiplications (VMM) can be performed in analog domain, in place and in parallel. Moreover, the whole model can be stored on-chip, thus eliminating off-chip DRAM access completely and achieving high energy efficiency during the end-to-end operation. In this presentation, we will discuss how practical DNN models can be mapped onto realistic RRAM arrays in a modular design. Challenges such as quantization effects, finite array size, and device non-idealities on the system performance will be analyzed through standard DNN models such as VGG-16 and MobileNet. System performance metrics such as throughput and energy/image will also be discussed.</em></td> </tr> <tr> <td>12:06</td> <td>6.1.4</td> <td><b>NEUROMORPHIC COMPUTING: TOWARD DYNAMICAL DATA PROCESSING</b><br /> <b>Author</b>:<br /> Fabian Alibart, CNRS, Lille, FR<br /> <em><b>Abstract</b><br /> While machine-learning approaches have done tremendous progresses these last years, more is expected with the third generation of neural networks that should sustain this evolution. In addition to unsupervised learning and spike-based computing capability, this new generation of computing machines will be intrinsically dynamical systems that will shift our conception of electronic. In this context, investigating new material implementation of neuromorphic concept seems a very attracting direction. In this presentation, I will present our recent efforts toward the development of neuromorphic synapses that present attractive features for both spike-based computing and unsupervised learning. From their basic physics, I will show how their dynamics can be used to implement time-dependent computing functions. I will also extend this idea of dynamical computing to the case of reservoir computing based on organic sensors in order to show how neuromorphic concepts can be applied to a large class of dynamical problems.</em></td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="6.2">6.2 Secure and fast memory and storage</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br /> Hao Yu, SUSTech, CN</p> <p><b>Co-Chair:</b><br /> Chengmo Yang, University of Delaware, US</p> <p>As memories become persistent, the design of traditional data structures such as trees and hash tables as well as filesystems should be revisited to cope with the challenges brought by new memory devices. In this context, the main focus of this session is on how to improve performance, security, and energy-efficiency of memory and storage. The specific techniques range from the designs of integrity trees and hash tables, the management of superpages in filesystems, data prefetch in solid state drives (SSDs), as well as energy-efficient carbon-nanotube cache design.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.2.1</td> <td><b>AN EFFICIENT PERSISTENCY AND RECOVERY MECHANISM FOR SGX-STYLE INTEGRITY TREE IN SECURE NVM</b><br /> <b>Speaker</b>:<br /> Mengya Lei, Huazhong University of Science &amp; Technology, CN<br /> <b>Authors</b>:<br /> Mengya Lei, Fang Wang, Dan Feng, Fan Li and Jie Xu, Huazhong University of Science &amp; Technology, CN<br /> <em><b>Abstract</b><br /> The integrity tree is a crucial part of secure non-volatile memory (NVM) system design. For NVM with large capacity, the SGX-style integrity tree (SIT) is practical due to its parallel updates and variable arity. However, employing SIT in secure NVM is not easy. This is because the secure metadata SIT must be strictly persisted or restored after a sudden power-loss, which unfortunately incurs unacceptable run-time overhead or recovery time. In this paper, we propose PSIT, a metadata persistency solution for SIT-protected secure NVM with high performance and fast restoration. PSIT utilizes the observation that for a lazily updated SIT, the lost tree nodes after a crash can be recovered by the corresponding child nodes in the NVM. It reduces the persistency overhead of the SIT nodes through a restrained write-back meta-cache and leverages the SIT inter-layer dependency for recovery. Experiments show that compared to ASIT, a state-of-the-art secure NVM using SIT, PSIT decreases write traffic by 47% and improves the performance by 18% on average while maintaining a comparable recovery time.</em></td> </tr> <tr> <td>11:30</td> <td>6.2.2</td> <td><b>REVISITING PERSISTENT HASH TABLE DESIGN FOR COMMERCIAL NON-VOLATILE MEMORY</b><br /> <b>Speaker</b>:<br /> Kaixin Huang, Shanghai Jiao Tong University, CN<br /> <b>Authors</b>:<br /> Kaixin Huang, Yan Yan and Linpeng Huang, Shanghai Jiao Tong University, CN<br /> <em><b>Abstract</b><br /> Emerging non-volatile memory technologies bring evolution to storage systems and durable data structures. Among them, a proliferation of researches on persistent hash table employ NVM as the storage layer for both fast access and efficient persistence. Most of them are based on the assumptions that NVM has byte access granularity, poor write endurance, DRAM-comparable read latency and much higher write latency. However, a commercial non-volatile memory product, named Intel Optane DC Persistent Memory (AEP), has a few interesting features that are different from previous assumptions, such as 1) block access granularity 2) little concern for software-layer write endurance and 3) much higher read latency than DRAM and DRAM-comparable write latency. Confronted with the new challenges brought by AEP, we propose Rewo-Hash, a novel read-efficient and write-optimized hash table for commercial non-volatile memory. Our design can be summarized into three key points. First, we keep a hash table copy in DRAM as a cached table to speed up search requests. Second, we design a log-free atomic mechanism to support fast writes. Third, we devise an efficient synchronization scheme between the persistent table and cached table to mask the data synchronization overhead. We conduct extensive experiments using real NVM platform and the results show that compared with state-of-the-art NVM-Optimized hash tables, Rewo-Hash gains improvement of 1.73x-2.70x and 1.46x-3.11x in read latency and write latency, respectively. Rewo-Hash also outperforms its counterparts by 1.86x-4.24x in throughput for various YCSB workloads.</em></td> </tr> <tr> <td>12:00</td> <td>6.2.3</td> <td><b>OPTIMIZING PERFORMANCE OF PERSISTENT MEMORY FILE SYSTEMS USING VIRTUAL SUPERPAGES</b><br /> <b>Speaker</b>:<br /> Chaoshu Yang, Chongqing University, CN<br /> <b>Authors</b>:<br /> Chaoshu Yang<sup>1</sup>, Duo Liu<sup>1</sup>, Runyu Zhang<sup>1</sup>, Xianzhang Chen<sup>1</sup>, Shun Nie<sup>1</sup>, Qingfeng Zhuge<sup>1</sup> and Edwin H.-M Sha<sup>2</sup><br /> <sup>1</sup>Chongqing University, CN; <sup>2</sup>East China Normal University, CN<br /> <em><b>Abstract</b><br /> Existing persistent memory file systems can significantly improve the performance by utilizing the advantages of emerging Persistent Memories (PMs). Especially, persistent memory file systems can employ superpages (e.g., 2MB a page) of PMs to alleviate the overhead of locating file data and reduce TLB misses. Unfortunately, superpage also induces two critical problems. First, the data consistency of file systems using superpages causes severe write amplification during overwrite of file data. Second, existing management of superpages may lead to large waste of PM space. In this paper, we propose a Virtual Superpage Mechanism (VSM) to solve the problems by taking advantages of virtual address space. On one hand, VSM adopts multi-grained copy-on-write mechanism to reduce the write amplification while ensuring data consistency. On the other hand, VSM presents zero-copy file data migration mechanism to eliminate the loss of space utilization efficiency caused by superpages.We implement the proposed VSM mechanism in Linux kernel based on PMFS. Compared with the original PMFS and NOVA, the experimental results show that VSM improves 36% and 14% on average for write and read performance, respectively. Meanwhile, VSM can achieve the same space utilization efficiency of file system that uses the normal 4KB pages to organize files.</em></td> </tr> <tr> <td>12:15</td> <td>6.2.4</td> <td><b>FREQUENT ACCESS PATTERN-BASED PREFETCHING INSIDE OF SOLID-STATE DRIVES</b><br /> <b>Speaker</b>:<br /> Jianwei Liao, Southwest University of China, CN<br /> <b>Authors</b>:<br /> Xiaofei Xu<sup>1</sup>, Zhigang Cai<sup>2</sup>, Jianwei Liao<sup>2</sup> and Yutaka Ishikawa<sup>3</sup><br /> <sup>1</sup>Southwest University, CN; <sup>2</sup>Southwest University of China, CN; <sup>3</sup>RIKEN, Japan, JP<br /> <em><b>Abstract</b><br /> This paper proposes an SSD-inside data prefetching scheme, which has features of OS-dependence and use transparency. To be specific, it first mines frequent block access patterns that reflect the correlation among the occurred requests. Then it compares the requests in the current time window with the identified patterns, to direct fetching data in advance. Furthermore, to maximize the cache use efficiency, we construct a mathematical model to adaptively determine the cache partition on the basis of I/O workload characteristics, for separately buffering the prefetched data and the write data. Experimental results demonstrate that our proposal can yield improvements on average read latency by 6.3% to 9.3% without noticeably increasing write latency, in contrast to conventional SSD-inside prefetching schemes.</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="#IP3">IP3-1</a>, 594</td> <td><b>CNT-CACHE: AN ENERGY-EFFICIENT CARBON NANOTUBE CACHE WITH ADAPTIVE ENCODING</b><br /> <b>Speaker</b>:<br /> Kexin Chu, School of Electronic Science &amp; Applied Physics Hefei University of Technology Anhui,China, CN<br /> <b>Authors</b>:<br /> Dawen Xu<sup>1</sup>, Kexin Chu<sup>1</sup>, Cheng Liu<sup>2</sup>, Ying Wang<sup>2</sup>, Lei Zhang<sup>2</sup> and Huawei Li<sup>2</sup><br /> <sup>1</sup>School of Electronic Science &amp; Applied Physics Hefei University of Technology Anhui, CN; <sup>2</sup>Chinese Academy of Sciences, CN<br /> <em><b>Abstract</b><br /> Carbon Nanotubu field-effect transistor(CNFET) that promises both higher clock speed and energy efficiency becomes an attractive alternative to the conventional power-hungry CMOS cache. We observe that CNFET-based cacheconstructed with typical 9T SRAM cells has distinct energy consumption when reading/writing 0 and 1 from/to it. The energy consumption of reading 0 is around 3X higher compared toreading 1. The energy consumption of writing 1 is almost 10X higher than writing 0. With this observation, we propose an energy-efficient cache design called CNT-Cache to take advantage of this feature. It includes an adaptive data encoding modulethat can convert the coding of each cache line to match the cache reading and writing preferences. Meanwhile, it has a cache line encoding direction predictor that instructs the encoding direction according to the cache line access history. The two optimizations combined together can reduce the overall dynamicpower consumption significantly. According to our experiments,the optimized CNFET-based L1 D-Cache reduces the dynamic power consumption by 22% on average compared to the baseline CNFET cache.</em></td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="6.3">6.3 Special Session: Modern Logic Reasoning Methods for Functional ECO</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br /> Patrick Vuillod, Synopsys, US</p> <p><b>Co-Chair:</b><br /> Christoph Scholl, Albert-Ludwigs-University Freiburg, DE</p> <p>Functional Engineering Change Order (ECO) is the problem of incrementally updating an existing logic network after a (possibly late) change in the design specification. The problem requires (i) to identify a small portion of the network's logic to be changed and (ii) to automatically synthesize a patch to replace this portion and rectify the network's functional behavior. ECOs can be solved using the logical framework of quantified Boolean formulæ (QBF), where a logic query asks for the existence of a set of nodes and values at those nodes to rectify the logic network's output functions. The global nature of the problem, however, challenges scalability. Any internal node in the logic network is a potential location for rectification and any node in the logic network may be used to simplify the synthesized patch. Furthermore, off-the-self QBF algorithms do not allow a formulation of resource costs for reusing existing logic.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.3.1</td> <td><b>ENGINEERING CHANGE ORDER FOR COMBINATIONAL AND SEQUENTIAL DESIGN RECTIFICATION</b><br /> <b>Speaker</b>:<br /> Jie-Hong Roland Jiang, National Taiwan University, TW<br /> <b>Authors</b>:<br /> Jie-Hong Roland Jiang<sup>1</sup>, Victor Kravets<sup>2</sup> and NIAN-ZE LEE<sup>1</sup><br /> <sup>1</sup>National Taiwan University, TW; <sup>2</sup>IBM, US</td> </tr> <tr> <td>11:20</td> <td>6.3.2</td> <td><b>EXACT DAG-AWARE REWRITING</b><br /> <b>Speaker</b>:<br /> Heinz Riener, EPFL, CH<br /> <b>Authors</b>:<br /> Heinz Riener<sup>1</sup>, Alan Mishchenko<sup>2</sup> and Mathias Soeken<sup>1</sup><br /> <sup>1</sup>EPFL, CH; <sup>2</sup>University of California, Berkeley, US</td> </tr> <tr> <td>11:40</td> <td>6.3.3</td> <td><b>LEARNING TO AUTOMATE THE DESIGN UPDATES FROM OBSERVED ENGINEERING CHANGES IN THE CHIP DEVELOPMENT CYCLE</b><br /> <b>Speaker</b>:<br /> Victor Kravets, IBM, US<br /> <b>Authors</b>:<br /> Victor Kravets<sup>1</sup>, Jie-Hong Roland Jiang<sup>2</sup> and Heinz Riener<sup>3</sup><br /> <sup>1</sup>IBM, US; <sup>2</sup>National Taiwan University, TW; <sup>3</sup>EPFL, CH</td> </tr> <tr> <td>12:05</td> <td>6.3.4</td> <td><b>SYNTHESIS AND OPTIMIZATION OF MULTIPLE PORTIONS OF CIRCUITS FOR ECO BASED ON SET-COVERING AND QBF FORMULATIONS</b><br /> <b>Speaker</b>:<br /> Masahiro Fujita, University of Tokyo, JP<br /> <b>Authors</b>:<br /> Masahiro Fujita, Yusuke Kimura, Xingming Le, Yukio Miyasaka and Amir Masoud Gharehbaghi, University of Tokyo, JP</td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="6.4">6.4 Microarchitecture to the rescue of memory</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br /> Olivier Sentieys, INRIA, FR</p> <p><b>Co-Chair:</b><br /> Jeronimo Castrillon, TU Dresden, DE</p> <p>This session discusses micro-architectural innovations across three different memory technologies, namely, caches, 3D-stacked DRAM and non-volatile. This includes exploiting several aspects of redundancy to maximize cache utilization through compression, as well as multicast in 3D-stacked high-speed memories for graph analytics, and a microarchitecture solution to unify persistency and encryption in non-volatile memories.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.4.1</td> <td><b>EFFICIENT HARDWARE-ASSISTED CRASH CONSISTENCY IN ENCRYPTED PERSISTENT MEMORY</b><br /> <b>Speaker</b>:<br /> Zhan Zhang, Huazhong University of Science &amp; Technology, CN<br /> <b>Authors</b>:<br /> Zhan Zhang<sup>1</sup>, Jianhui Yue<sup>2</sup>, Xiaofei Liao<sup>1</sup> and Hai Jin<sup>1</sup><br /> <sup>1</sup>Huazhong University of Science &amp; Technology, CN; <sup>2</sup>Michigan Technological University, US<br /> <em><b>Abstract</b><br /> The persistent memory (PM) requires maintaining the crash consistency and encrypting data, to ensure data recoverability and data confidentiality. The enforcement of these two goals does not only put more burden on programmers but also degrades performance. To address this issue, we propose a hardware-assisted encrypted persistent memory system. Specifically, logging and data encryption are assisted by hardware. Furthermore, we apply the counter-based encryption and the cipher feedback (CFB) mode encryption to data and log respectively, reducing the encryption overhead. Our primary experimental results show that the transaction throughput of the proposed design outperforms the baseline design by up to 34.4%.</em></td> </tr> <tr> <td>11:30</td> <td>6.4.2</td> <td><b>2DCC: CACHE COMPRESSION IN TWO DIMENSIONS</b><br /> <b>Speaker</b>:<br /> Amin Ghasemazar, University of British Columbia, CA<br /> <b>Authors</b>:<br /> Amin Ghasemazar<sup>1</sup>, Mohammad Ewais<sup>2</sup>, Prashant Nair<sup>1</sup> and Mieszko Lis<sup>1</sup><br /> <sup>1</sup>University of British Columbia, CA; <sup>2</sup>UofT, CA<br /> <em><b>Abstract</b><br /> The importance of caches for performance, together with their high silicon area cost, has led to an interest in hardware solutions that transparently compress the cached data to increase effective capacity without sacrificing silicon area. Work to date has taken one of two tacks: either (a) deduplicating identical cache blocks across the cache to take advantage of inter-block redundancy or (b) identifying and compressing common patterns within each cache block to take advantage of intra-block redundancy. In this paper, we demonstrate that leveraging only one of these redundancy types leads to significant loss of compression opportunities in many applications: some workloads exhibit either inter-block or intra-block redundancy, while others exhibit both. We propose 2DCC, a simple technique that takes advantage of both types of redundancy. Across the SPEC and Parsec benchmark suites, 2DCC results in a 2.12× compression factor (geomean) compared to 1.44-1.49× for best prior techniques on an iso-silicon basis. For the cache-sensitive subset of these benchmarks run in isolation, 2DCC also achieves a 11.7% speedup (geomean).</em></td> </tr> <tr> <td>12:00</td> <td>6.4.3</td> <td><b>GRAPHVINE: EXPLOITING MULTICAST FOR SCALABLE GRAPH ANALYTICS</b><br /> <b>Speaker</b>:<br /> Leul Belayneh, University of Michigan, US<br /> <b>Authors</b>:<br /> Leul Belayneh and Valeria Bertacco, University of Michigan, US<br /> <em><b>Abstract</b><br /> The proliferation of graphs as a key data structure for big-data analytics has heightened the demand for efficient graph processing. To meet this demand, prior works have proposed processing in memory (PIM) solutions in 3D-stacked DRAMs, such as Hybrid Memory Cubes (HMCs). However, PIM-based architectures, despite considerable improvement over conventional architectures, continue to be hampered by the presence of high inter-cube communication traffic. In turn, this trait has limited the underlying processing elements from fully capitalizing on the memory bandwidth an HMC has to offer. In this paper, we show that it is possible to combine multiple messages emitted from a source node into a single multicast message, thus reducing the inter-cube communication without affecting the correctness of the execution. Hence, we propose to add multicast support at source and in-network routers to reduce vertex-update traffic. Our experimental evaluation shows that, by combining multiple messages emitted at the source, it is possible to achieve an average speedup of 2.4x over a highly optimized PIM-based solution and reduce energy consumption by 3.4x, while incurring a modest power overhead of 6.8%.</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="#IP3">IP3-2</a>, 855</td> <td><b>ENHANCING MULTITHREADED PERFORMANCE OF ASYMMETRIC MULTICORES WITH SIMD OFFLOADING</b><br /> <b>Speaker</b>:<br /> Antonio Scheneider Beck, Universidade Federal do Rio Grande do Sul, BR<br /> <b>Authors</b>:<br /> Jeckson Dellagostin Souza<sup>1</sup>, Madhavan Manivannan<sup>2</sup>, Miquel Pericas<sup>2</sup> and Antonio Carlos Schneider Beck<sup>1</sup><br /> <sup>1</sup>Universidade Federal do Rio Grande do Sul, BR; <sup>2</sup>Chalmers, SE<br /> <em><b>Abstract</b><br /> Asymmetric multicore architectures with single-ISA can accelerate multithreaded applications by running code that does not execute concurrently (i.e., the serial region) on a big core and the parallel region on a larger number of smaller cores. Nevertheless, in such architectures the big core still implements resource-expensive application-specific instruction extensions that are rarely used while running the serial region, such as Single Instruction Multiple Data (SIMD) and Floating-Point (FP) operations. In this work, we propose a design in which these extensions are not implemented in the big core, thereby freeing up area and resources to increase the number of small cores in the system, and potentially enhance thread-level parallelism (TLP). To address the case when missing instruction extensions are required while running on the big core we devise an approach to automatically offload these operations to the execution units of the small cores, where the extensions are implemented and can be executed. Our evaluation shows that, on average, the proposed architecture provides 1.76x speedup when compared to a traditional single-ISA asymmetric multicore processor with the same area, for a variety of parallel applications.</em></td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="6.5">6.5 Efficient Data Representations in Neural Networks</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br /> Brandon Reagen, Facebook and New York University, US</p> <p><b>Co-Chair:</b><br /> Sebastian Steinhorst, TU Munich, DE</p> <p>The large processing requirements of ML models strains the capabilities of low-power embedded systems. Addressing this challenge, the first presentation proposes a robust co-design to leverage stochastic computing for highly accurate and efficient inference. Next, a structural optimization is proposed to counter faults at low voltage levels. Then, authors present a method for sharing results in binarized CNNs to reduce computation. The session will conclude with a talk implementing binary networks on mobile GPUs.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.5.1</td> <td><b>ACOUSTIC: ACCELERATING CONVOLUTIONAL NEURAL NETWORKS THROUGH OR-UNIPOLAR SKIPPED STOCHASTIC COMPUTING</b><br /> <b>Speaker</b>:<br /> Puneet Gupta, University of California, Los Angeles, US<br /> <b>Authors</b>:<br /> Wojciech Romaszkan, Tianmu Li, Tristan Melton, Sudhakar Pamarti and Puneet Gupta, University of California, Los Angeles, US<br /> <em><b>Abstract</b><br /> As privacy and latency requirements force a move towards edge Machine Learning inference, resource constrained devices are struggling to cope with large and computationally complex models. For Convolutional Neural Networks, those limitations can be overcome by taking advantage of enormous data reuse opportunities and amenability to reduced precision. To do that however, a level of compute density unattainable for conventional binary arithmetic is required. Stochastic Computing can deliver such density, but it has not lived up to its full potential because of multiple underlying precision issues. We present ACOUSTIC: Accelerating Convolutions through Or-Unipolar Skipped sTochastIc Computing, an accelerator framework that enables fully stochastic, high-density CNN inference. Leveraging split-unipolar representation, OR-based accumulation and novel computation-skipping approach, ACOUSTIC delivers server-class parallelism within a mobile area and power budget - a 12mm2 accelerator can be as much as 38.7x more energy efficient and 72.5x faster than conventional fixed-point accelerators. It can also be up to 79.6x more energy efficient than state-of-the-art stochastic accelerators. At the lower-end ACOUSTIC achieves 8x-120X inference throughput improvement with similar energy and area when compared to recent mixed-signal/neuromorphic accelerators.</em></td> </tr> <tr> <td>11:30</td> <td>6.5.2</td> <td><b>ACCURACY TOLERANT NEURAL NETWORKS UNDER AGGRESSIVE POWER OPTIMIZATION</b><br /> <b>Speaker</b>:<br /> Yi-Wen Hung, National Tsing Hua University, TW<br /> <b>Authors</b>:<br /> Xiang-Xiu Wu<sup>1</sup>, Yi-Wen Hung<sup>1</sup>, Yung-Chih Chen<sup>2</sup> and Shih-Chieh Chang<sup>1</sup><br /> <sup>1</sup>National Tsing Hua University, TW; <sup>2</sup>Yuan Ze University, Taoyuan, Taiwan, TW<br /> <em><b>Abstract</b><br /> With the success of deep learning, many neural network models have been proposed and applied to various applications. In several applications, the devices used to implement the complicated models have limited power resources and thus aggressive optimization techniques are often applied for saving power. However, some optimization techniques, such as voltage scaling and multiple threshold voltages, may increase the probability of error occurrence due to slow signal propagation, which increases the path delay in a circuit and fails some input patterns. Although neural network models are considered to have some error tolerance, the prediction accuracy could be significantly affected, when there are a large number of errors. Thus, in this paper, we propose a scheme to mitigate the errors caused by slow signal propagation. Since the delay of multipliers dominate the critical path of the circuit, we consider the patterns significantly altered by the slow signal propagation according to the multipliers and prevent the patterns from failure by adjusting the neural network and the parameters. The proposed scheme modifies a neural network on the software side and thus it is unnecessary to re-design the hardware structure. The experimental results show that the proposed scheme is effective for several neural network models. It can improve the accuracy by up to 27%, when the device under consideration is applied with aggressive power optimization techniques.</em></td> </tr> <tr> <td>12:00</td> <td>6.5.3</td> <td><b>A CONVOLUTIONAL RESULT SHARING APPROACH FOR BINARIZED NEURAL NETWORK INFERENCE</b><br /> <b>Speaker</b>:<br /> Chia-Chun Lin, National Tsing Hua University, TW<br /> <b>Authors</b>:<br /> Ya-Chun Chang<sup>1</sup>, Chia-Chun Lin<sup>1</sup>, Yi-Ting Lin<sup>1</sup>, Yung-Chih Chen<sup>2</sup> and Chun-Yao Wang<sup>1</sup><br /> <sup>1</sup>National Tsing Hua University, TW; <sup>2</sup>Yuan Ze University, TW<br /> <em><b>Abstract</b><br /> The binary-weight-binary-input binarized neural network (BNN) allows a much more efficient way to implement convolutional neural networks (CNNs) on mobile platforms. During inference, the multiply-accumulate operations in BNNs can be reduced to XNOR-popcount operations. Thus, the XNOR-popcount operations dominate most of the computation in BNNs. To reduce the number of required operations in convolution layers of BNNs, we decompose 3-D filters into 2-D filters and exploit the repeated filters, inverse filters, and similar filters to share results. By sharing the results, the number of operations in convolution layers of BNNs can be reduced effectively. Experimental results show that the number of operations can be reduced by about 60% for CIFAR-10 on BNNs while keeping the accuracy loss within 1% of originally trained network.</em></td> </tr> <tr> <td>12:15</td> <td>6.5.4</td> <td><b>PHONEBIT: EFFICIENT GPU-ACCELERATED BINARY NEURAL NETWORK INFERENCE ENGINE FOR MOBILE PHONES</b><br /> <b>Speaker</b>:<br /> Gang Chen, Sun Yat-sen University, CN<br /> <b>Authors</b>:<br /> Gang Chen<sup>1</sup>, Shengyu He<sup>2</sup>, Haitao Meng<sup>2</sup> and Kai Huang<sup>1</sup><br /> <sup>1</sup>Sun Yat-sen University, CN; <sup>2</sup>Northeastern University, CN<br /> <em><b>Abstract</b><br /> Over the last years, a great success of deep neural networks (DNNs) has been witnessed in computer vision and other fields. However, performance and power constraints make it still challenging to deploy DNNs on mobile devices due to their high computational complexity. Binary neural networks (BNNs) have been demonstrated as a promising solution to achieve this goal by using bit-wise operations to replace most arithmetic operations. Currently, existing GPU-accelerated implementations of BNNs are only tailored for desktop platforms. Due to architecture differences, mere porting of such implementations to mobile devices yields suboptimal performance or is impossible in some cases. In this paper, we propose PhoneBit, a GPU-accelerated BNN inference engine for Android-based mobile devices that fully exploits the computing power of BNNs on mobile GPUs. PhoneBit provides a set of operator-level optimizations including locality-friendly data layout, bit packing with vectorization and layers integration for efficient binary convolution. We also provide a detailed implementation and parallelization optimization for PhoneBit to optimally utilize the memory bandwidth and computing power of mobile GPUs. We evaluate PhoneBit with AlexNet, YOLOv2 Tiny and VGG16 with their binary version. Our experiment results show that PhoneBit can achieve significant speedup and energy efficiency compared with state-of-the-art frameworks for mobile devices.</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="#IP3">IP3-3</a>, 140</td> <td><b>HARDWARE ACCELERATION OF CNN WITH ONE-HOT QUANTIZATION OF WEIGHTS AND ACTIVATIONS</b><br /> <b>Speaker</b>:<br /> Gang Li, Chinese Academy of Sciences, CN<br /> <b>Authors</b>:<br /> Gang Li, Peisong Wang, Zejian Liu, Cong Leng and Jian Cheng, Chinese Academy of Sciences, CN<br /> <em><b>Abstract</b><br /> In this paper, we propose a novel one-hot representation for weights and activations in CNN model and demonstrate its benefits on hardware accelerator design. Specifically, rather than merely reducing the bitwidth, we quantize both weights and activations into n-bit integers that containing only one non-zero bit per value. In this way, the massive multiply and accumulates (MACs) are equivalent to additions of powers of two that can be efficiently calculated with histogram based computations. Experiments on the ImageNet classification task show that comparable accuracy can be obtained on our proposed One-Hot Networks (OHN) compared to conventional fixed-point networks. As case studies, we evaluate the efficacy of one-hot data representation on two state-of-the-art CNN accelerators on FPGA, our preliminary results show that 50% and 68.5% resource saving can be achieved on DaDianNao and Laconic respectively. Besides, the one-hot optimized Laconic can further achieve an average speedup of 4.94x on AlexNet.</em></td> </tr> <tr> <td style="width:40px;">12:31</td> <td><a href="#IP3">IP3-4</a>, 729</td> <td><b>BNNSPLIT: BINARIZED NEURAL NETWORKS FOR EMBEDDED DISTRIBUTED FPGA-BASED COMPUTING SYSTEMS</b><br /> <b>Speaker</b>:<br /> Luca Stornaiuolo, Politecnico di Milano, IT<br /> <b>Authors</b>:<br /> Giorgia Fiscaletti, Marco Speziali, Luca Stornaiuolo, Marco D. Santambrogio and Donatella Sciuto, Politecnico di Milano, IT<br /> <em><b>Abstract</b><br /> In the past few years, Convolutional Neural Networks (CNNs) have seen a massive improvement, outperforming other visual recognition algorithms. Since they are playing an increasingly important role in fields such as face recognition, augmented reality or autonomous driving, there is the growing need for a fast and efficient system to perform the redundant and heavy computations of CNNs. This trend led researchers towards heterogeneous systems provided with hardware accelerators, such as GPUs and FPGAs. The vast majority of CNNs is implemented with floating-point parameters and operations, but from research, it has emerged that high classification accuracy can be obtained also by reducing the floating-point activations and weights to binary values. This context is well suitable for FPGAs, that are known to stand out in terms of performance when dealing with binary operations, as demonstrated in Finn, the state-of-the-art framework for building Binarized Neural Network (BNN) accelerators on FPGAs. In this paper, we propose a framework that extends Finn to a distributed scenario, enabling BNNs implementation on embedded multi-FPGA systems.</em></td> </tr> <tr> <td style="width:40px;">12:32</td> <td><a href="#IP3">IP3-5</a>, 147</td> <td><b>L2L: A HIGHLY ACCURATE LOG_2_LEAD QUANTIZATION OF PRE-TRAINED NEURAL NETWORKS</b><br /> <b>Speaker</b>:<br /> Salim Ullah, TU Dresden, DE<br /> <b>Authors</b>:<br /> Salim Ullah<sup>1</sup>, Siddharth Gupta<sup>2</sup>, Kapil Ahuja<sup>2</sup>, Aruna Tiwari<sup>2</sup> and Akash Kumar<sup>1</sup><br /> <sup>1</sup>TU Dresden, DE; <sup>2</sup>IIT Indore, IN<br /> <em><b>Abstract</b><br /> Deep Neural Networks are one of the machine learning techniques which are increasingly used in a variety of applications. However, the significantly high memory and computation demands of deep neural networks often limit their deployment on embedded systems. Many recent works have considered this problem by proposing different types of data quantization schemes. However, most of these techniques either require post-quantization retraining of deep neural networks or bear a significant loss in output accuracy. In this paper, we propose a novel quantization technique for parameters of pre-trained deep neural networks. Our technique significantly maintains the accuracy of the parameters and does not require retraining of the networks. Compared to the single-precision floating-point numbers-based implementation, our proposed 8-bit quantization technique generates only ∼ 1% and ∼ 0.4%, loss in the top-1 and top-5 accuracies respectively for VGG16 network using ImageNet dataset.</em></td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="6.6">6.6 From DFT to Yield Optimization</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br /> Maria Micheal, University of Cyprus, CY</p> <p><b>Co-Chair:</b><br /> Sanchez Ernesto, Politecnico di Torino, IT</p> <p>The session presents a variety of semiconductor test techniques, including a new design-for-testability scheme for FinFET SRAMs, a method to increase yield based on error-metric-independent signature analysis, and a synthesis method for fault-tolerant reconfigurable scan networks.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.6.1</td> <td><b>A DFT SCHEME TO IMPROVE COVERAGE OF HARD-TO-DETECT FAULTS IN FINFET SRAMS</b><br /> <b>Speaker</b>:<br /> Guilherme Cardoso Medeiros, TU Delft, NL<br /> <b>Authors</b>:<br /> Guilherme Cardoso Medeiros<sup>1</sup>, Cemil Cem Gürsoy<sup>2</sup>, Moritz Fieback<sup>1</sup>, Lizhou Wu<sup>1</sup>, Maksim Jenihhin<sup>2</sup>, Mottaqiallah Taouil<sup>1</sup> and Said Hamdioui<sup>1</sup><br /> <sup>1</sup>TU Delft, NL; <sup>2</sup>Tallinn University of Technology, EE<br /> <em><b>Abstract</b><br /> Manufacturing defects can cause faults in FinFET SRAMs. Of them, easy-to-detect (ETD) faults always cause incorrect behavior, and therefore are easily detected by applying sequences of write and read operations. However, hard-to-detect (HTD) faults may not cause incorrect behavior, only parametric deviations. Detection of these faults is of major importance as they may lead to test escapes. This paper proposes a new design-for-testability (DFT) scheme for FinFET SRAMs to detect such faults by creating a mismatch in the sense amplifier (SA). This mismatch, combined with the defect in the cell, will incorrectly bias the SA and cause incorrect read outputs. Furthermore, post-silicon calibration schemes can be used to avoid over-testing or test escapes caused by process variation effects. Compared to the state of the art, this scheme introduces negligible overheads in area and test time while it significantly improves fault coverage and reduces the number of test escapes.</em></td> </tr> <tr> <td>11:30</td> <td>6.6.2</td> <td><b>SYNTHESIS OF FAULT-TOLERANT RECONFIGURABLE SCAN NETWORKS</b><br /> <b>Speaker</b>:<br /> Sebastian Brandhofer, University of Stuttgart, DE<br /> <b>Authors</b>:<br /> Sebastian Brandhofer, Michael Kochte and Hans-Joachim Wunderlich, University of Stuttgart, DE<br /> <em><b>Abstract</b><br /> On-chip instrumentation is mandatory for efficient bring-up, test and diagnosis, post-silicon validation, as well as in-field calibration, maintenance, and fault tolerance. Reconfigurable scan networks (RSNs) provide a scalable and efficient scan-based access mechanism to such instruments. The correct operation of this access mechanism is crucial for all manufacturing, bring-up and debug tasks as well as for in-field operation, but it can be affected by faults and design errors. This work develops for the first time fault-tolerant RSNs such that the resulting scan network still provides access to as many instruments as possible in presence of a fault. The work contributes a model and an algorithm to compute scan paths in faulty RSNs, a metric to quantify its fault tolerance and a synthesis algorithm that is based on graph connectivity and selective hardening of control logic in the scan network. Experimental results demonstrate that fault-tolerant RSNs can be synthesized with only moderate hardware overhead.</em></td> </tr> <tr> <td>12:00</td> <td>6.6.3</td> <td><b>USING PROGRAMMABLE DELAY MONITORS FOR WEAR-OUT AND EARLY LIFE FAILURE PREDICTION</b><br /> <b>Speaker</b>:<br /> Chang Liu, Altran Deutschland, DE<br /> <b>Authors</b>:<br /> Chang Liu, Eric Schneider and Hans-Joachim Wunderlich, University of Stuttgart, DE<br /> <em><b>Abstract</b><br /> Early life failures in marginal devices are a severe reliability threat in current nano-scaled CMOS devices. While small delay faults are an effective indicator of marginalities, their detection requires special efforts in testing by so-called Faster-than-At-Speed Test (FAST). In a similar way, delay degradation is an indicator that a device reaches the wear-out phase due to aging. Programmable delay monitors provide the possibility to detect gradual performance changes in a system and allow to observe device degradation. This paper presents a unified approach to test small delay faults related to wear-out and early-life failures by reuse of existing programmable delay monitors within FAST. The approach is complemented by a test-scheduling which optimally selects frequencies and delay configurations to significantly increase the fault coverage of small delays and to reduce the test time.</em></td> </tr> <tr> <td>12:15</td> <td>6.6.4</td> <td><b>MAXIMIZING YIELD FOR APPROXIMATE INTEGRATED CIRCUITS</b><br /> <b>Speaker</b>:<br /> Marcello Traiola, Université de Montpellier, FR<br /> <b>Authors</b>:<br /> Marcello Traiola<sup>1</sup>, Arnaud Virazel<sup>1</sup>, Patrick Girard<sup>2</sup>, Mario Barbareschi<sup>3</sup> and Alberto Bosio<sup>4</sup><br /> <sup>1</sup>LIRMM, FR; <sup>2</sup>LIRMM / CNRS, FR; <sup>3</sup>Università di Napoli Federico II, IT; <sup>4</sup>Lyon Institute of Nanotechnology, FR<br /> <em><b>Abstract</b><br /> Approximate Integrated Circuits (AxICs) have emerged in the last decade as an outcome of Approximate Computing (AxC) paradigm. AxC focuses on efficiency of computing systems by sacrificing some computation quality. As AxICs spread, consequent challenges to test them arose. On the other hand, the opportunity to increase the production yield emerged in the AxIC context. Indeed, some particular defects in the manufactured AxIC might not catastrophically impact the final circuit quality. Therefore, some defective AxICs might still be acceptable. Efforts to detect favorable conditions to consider defective AxICs as acceptable - with the goal to increase the production yield - have been done in last years. Unfortunately, the final achieved yield gain is often not as high as expected. In this work, we propose a methodology to actually achieve a yield gain as close as possible to expectations, by proposing a technique to suitably apply tests to AxICs. Experiments carried out on state-of-the-art AxICs show yield gain results very close to the expected ones (i.e., between 98% and 100% of the expectations).</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="#IP3">IP3-6</a>, 359</td> <td><b>FAULT DIAGNOSIS OF VIA-SWITCH CROSSBAR IN NON-VOLATILE FPGA</b><br /> <b>Speaker</b>:<br /> Ryutaro Doi, Osaka University, JP<br /> <b>Authors</b>:<br /> Ryutaro DOI<sup>1</sup>, Xu Bai<sup>2</sup>, Toshitsugu Sakamoto<sup>2</sup> and Masanori Hashimoto<sup>1</sup><br /> <sup>1</sup>Osaka University, JP; <sup>2</sup>NEC Corporation, JP<br /> <em><b>Abstract</b><br /> FPGA that exploits via-switches, which are a kind of non-volatile resistive RAMs, for crossbar implementation is attracting attention due to its high integration density and energy efficiency. Via-switch crossbar is responsible for the signal routing by changing on/off-states of via-switches. To verify the via-switch crossbar functionality after manufacturing, fault testing that checks whether we can turn on/off via-switches normally is essential. This paper confirms that a general differential pair comparator successfully discriminates on/off-states of via-switches, and clarifies fault modes of a via-switch by transistor-level SPICE simulation that injects stuck-on/off faults to atom switch and varistor, where a via-switch consists of two atom switches and two varistors. We then propose a fault diagnosis methodology that diagnoses the fault modes of each via-switch using the comparator response difference between normal and faulty via-switches. The proposed method achieves 100% fault detection by checking the comparator responses after turning on/off the via-switch. In case that the number of faulty components in a via-switch is one, the ratio of the fault diagnosis, which exactly identifies the faulty varistor and atom switch inside the faulty via-switch, is 100%, and in case of up to two faults, the fault diagnosis ratio is 79%.</em></td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="6.7">6.7 Safety and efficiency for smart automotive and energy systems</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br /> Selma Saidi, TU Dortmund, DE</p> <p><b>Co-Chair:</b><br /> Donghwa Shin, Soongsil University, KR</p> <p>This session presents four papers dealing with various aspects of smart automotive and energy systems, including safety and efficiency of photovoltaic panels, deterministic execution behavior of adaptive automotive applications, efficient implementation of fail-operational automated vehicles, and efficient resource usage in networked automotive systems.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>6.7.1</td> <td><b>A DIODE-AWARE MODEL OF PV MODULES FROM DATASHEET SPECIFICATIONS</b><br /> <b>Speaker</b>:<br /> Sara Vinco, Politecnico di Torino, IT<br /> <b>Authors</b>:<br /> Sara Vinco, Yukai Chen, Enrico Macii and Massimo Poncino, Politecnico di Torino, IT<br /> <em><b>Abstract</b><br /> Semi-empirical models of photovoltaic (PV) modulesbased only on datasheet information are popular in electricalenergy systems (EES) simulation because they can be built without measurements and allow quick exploration of alternative devices. One key limitation of these models, however, is the fact that they cannot model the presence of bypass diodes, which are inserted across a set of series-connected cells in a PV moduleto mitigate the impact of partial shading; datasheet information refer in fact to the operations of the module under uniform irradiance. Neglecting the effect of bypass diodes may incur insignificant underestimation of the extracted power. This paper proposes a semi-empirical model for a PV module, that, by taking into account the only available information about bypass diodes in a datasheet, i.e., its number, by a first downscaling the model to a single PV cell and a subsequent upscaling to the level of a substring and of a module, allows to take into accout the diode effect as much accurately as allowed by the datasheet information. Experimental results show that, in a typical PV array on a roof, using a diode-agnostic model can signifantly underestimate the output power production</em></td> </tr> <tr> <td>11:30</td> <td>6.7.2</td> <td><b>ACHIEVING DETERMINISM IN ADAPTIVE AUTOSAR</b><br /> <b>Speaker</b>:<br /> Christian Menard, TU Dresden, DE<br /> <b>Authors</b>:<br /> Christian Menard<sup>1</sup>, Andres Goens<sup>1</sup>, Marten Lohstroh<sup>2</sup> and Jeronimo Castrillon<sup>1</sup><br /> <sup>1</sup>TU Dresden, DE; <sup>2</sup>University of California, Berkeley, US<br /> <em><b>Abstract</b><br /> AUTOSAR Adaptive Platform is an emerging industry standard that tackles the challenges of modern automotive software design, but does not provide adequate mechanisms to enforce deterministic execution. This poses profound challenges to testing and maintenance of the application software, which is particularly problematic for safety-critical applications. In this paper, we analyze the problem of nondeterminism in AP and propose a framework for the design of deterministic automotive software that transparently integrates with the AP communication mechanisms. We illustrate our approach in a case study based on the brake assistant demonstrator application that is provided by the AUTOSAR consortium. We show that the original implementation is nondeterministic and discuss a deterministic solution based on our framework.</em></td> </tr> <tr> <td>12:00</td> <td>6.7.3</td> <td><b>A FAIL-SAFE ARCHITECTURE FOR AUTOMATED DRIVING</b><br /> <b>Speaker</b>:<br /> Sebastian vom Dorff, DENSO Automotive Deutschland GmbH, DE<br /> <b>Authors</b>:<br /> Sebastian vom Dorff<sup>1</sup>, Bert Böddeker<sup>2</sup>, Maximilian Kneissl<sup>1</sup> and Martin Fränzle<sup>3</sup><br /> <sup>1</sup>DENSO Automotive Deutschland GmbH, DE; <sup>2</sup>Autonomous Intelligent Driving GmbH, DE; <sup>3</sup>Carl von Ossietzky University Oldenburg, DE<br /> <em><b>Abstract</b><br /> The development of autonomous vehicles has gained a rapid pace. Along with the promising possibilities of such automated systems, the question of how to ensure their safety arises. With increasing levels of automation the need for fail-operational systems, not relying on a back-up driver, poses new challenges in system design. In this paper we propose a lightweight architecture addressing the challenge of a verifiable, fail-safe safety implementation for trajectory planning. It offers a distributed design and the ability to comply with the requirements of ISO26262, while avoiding an overly redundant set-up. Furthermore, we show an example with low-level prediction models applied to a real world situation.</em></td> </tr> <tr> <td>12:15</td> <td>6.7.4</td> <td><b>PRIORITY-PRESERVING OPTIMIZATION OF STATUS QUO ID-ASSIGNMENTS IN CONTROLLER AREA NETWORK</b><br /> <b>Speaker</b>:<br /> Lea Schoenberger, TU Dortmund University, DE<br /> <b>Authors</b>:<br /> Sebastian Schwitalla<sup>1</sup>, Lea Schönberger<sup>1</sup> and Jian-Jia Chen<sup>2</sup><br /> <sup>1</sup>TU Dortmund University, DE; <sup>2</sup>TU Dortmund, DE<br /> <em><b>Abstract</b><br /> Controller Area Network (CAN) is the prevailing solution for connecting multiple electronic control units (ECUs) in automotive systems. Every broadcast message on the bus is received by each bus participant and introduces computational overhead to the typically resource-constrained ECUs due to interrupt handling. To reduce this overhead, hardware message filters can be applied. However, since such filters are configured according to the message identifiers (IDs) specified in the system, the filter quality is limited by the nature of the ID-assignment. Although hardware message filters are highly relevant for industrial applications, so far, only the optimization of the filter design, but not the related optimization of ID-assignments has been addressed in the literature. In this work, we explicitly focus on the optimization of message ID-assignments against the background of hardware message filtering. More precisely, we propose an optimization algorithm transforming a given ID-assignment in such a way that, based on the resulting IDs, the quality of hardware message filters is improved significantly, i.e., the computational overhead introduced to each ECU is minimized, and, moreover, the priority order of the system remains unchanged. Conducting comprehensive experiments on automotive benchmarks, we show that our proposed algorithm clearly outperforms optimizations based on the conventional method simulated annealing with respect to the achieved filter quality as well as to the runtime.</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="#IP3">IP3-7</a>, 519</td> <td><b>APPLYING RESERVATION-BASED SCHEDULING TO A µC-BASED HYPERVISOR: AN INDUSTRIAL CASE STUDY</b><br /> <b>Speaker</b>:<br /> Dirk Ziegenbein, Robert Bosch GmbH, DE<br /> <b>Authors</b>:<br /> Dakshina Dasari<sup>1</sup>, Paul Austin<sup>2</sup>, Michael Pressler<sup>1</sup>, Arne Hamann<sup>1</sup> and Dirk Ziegenbein<sup>1</sup><br /> <sup>1</sup>Robert Bosch GmbH, DE; <sup>2</sup>ETAS GmbH, GB<br /> <em><b>Abstract</b><br /> Existing software scheduling mechanisms do not suffice for emerging applications in the automotive space, which have the conflicting needs of performance and predictability. %We need mechanisms that lend themselves naturally to this requirement, by virtue of their design. As a concrete case, we consider the ETAS light-weight hypervisor, a commercially viable solution in the automotive industry, deployed on multicore microcontrollers. We describe the architecture of the hypervisor and its current scheduling mechanisms based on Time Division Multiplexing. We next show how Reservation-based Scheduling can be implemented in the ETAS LWHVR to efficiently use resources while also providing freedom from interference and explore design choices towards an efficient implementation of such a scheduler. With experiments from an industry use case, we also compare the performance of RBS and the existing scheduler in the hypervisor.</em></td> </tr> <tr> <td style="width:40px;">12:31</td> <td><a href="#IP3">IP3-8</a>, 353</td> <td><b>REAL-TIME ENERGY MONITORING IN IOT-ENABLED MOBILE DEVICES</b><br /> <b>Speaker</b>:<br /> Nitin Shivaraman, TUMCREATE, SG<br /> <b>Authors</b>:<br /> Nitin Shivaraman<sup>1</sup>, Seima Suriyasekaran<sup>1</sup>, Zhiwei Liu<sup>2</sup>, Saravanan Ramanathan<sup>1</sup>, Arvind Easwaran<sup>2</sup> and Sebastian Steinhorst<sup>3</sup><br /> <sup>1</sup>TUMCREATE, SG; <sup>2</sup>Nanyang Technological University, SG; <sup>3</sup>TU Munich, DE<br /> <em><b>Abstract</b><br /> With rapid advancements in the Internet of Things (IoT) paradigm, every electrical device in the near future is expected to have IoT capabilities. This enables fine-grained tracking of individual energy consumption data of such devices, offering location-independent per-device billing and demand management. Hence, it abstracts from the location-based metering of state-of-the-art infrastructure, which traditionally aggregates on a building or household level, defining the entity to be billed. However, such in-device energy metering is susceptible to manipulation and fraud. As a remedy, we propose a secure decentralized metering architecture that enables devices with IoT capabilities to measure their own energy consumption. In this architecture, the device-level consumption is additionally reported to a system-level aggregator that verifies distributed information from our decentralized metering systems and provides secure data storage using Blockchain, preventing data manipulation by untrusted entities. Through experimental evaluation, we show that the proposed architecture supports device mobility and enables location-independent monitoring of energy consumption.</em></td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="UB06">UB06 Session 6</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 12:00 - 14:00<br /> <b>Location / Room:</b> Booth 11, Exhibition Area</p> <p> </p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td>UB06.1</td> <td><b>A DIGITAL MICROFLUIDICS BIO-COMPUTING PLATFORM</b><br /> <b>Authors</b>:<br /> Georgi Tanev, Luca Pezzarossa, Winnie Edith Svendsen and Jan Madsen, TU Denmark, DK<br /> <em><b>Abstract</b><br /> Digital microfluidics is a lab-on-a-chip (LOC) technology used to actuate small amounts of liquids on an array of individually addressable electrodes. Microliter sized droplets can be programmatically dispensed, moved, mixed, split, in a controlled environment which combined with miniaturized sensing techniques makes LOC suitable for a broad range of applications in the field of medical diagnostics and synthetic biology. Furthermore, a programmable digital microfluidics platform holds the potential to add a "fluidic subsystem" to the classical computation model thus opening the doors for cyber-physical bio-processors. To facilitate the programming and operation of such bio-fluidic computing, we propose dedicated instruction set architecture and virtual machine. A set of digital microfluidic core instructions as well as classic computing operations are executed on a virtual machine, which decouples the protocol execution from the LOC functionality.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3103.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB06.2</td> <td><b>ELSA: EIGENVALUE BASED HYBRID LINEAR SYSTEM ABSTRACTION: BEHAVIORAL MODELING OF TRANSISTOR-LEVEL CIRCUITS USING AUTOMATIC ABSTRACTION TO HYBRID AUTOMATA</b><br /> <b>Authors</b>:<br /> Ahmad Tarraf and Lars Hedrich, University of Frankfurt, DE<br /> <em><b>Abstract</b><br /> Model abstraction of transistor-level circuits, while preserving an accurate behavior, is still an open problem. In this demo an approach is presented that automatically generates a hybrid automaton (HA) with linear states from an existing circuit netlist. The approach starts with a netlist at transistor level with full SPICE accuracy and ends at the system level description of the circuit in matlab or in Verilog-A. The resulting hybrid automaton exhibits linear behavior as well as the technology dependent nonlinear e.g. limiting behavior. The accuracy and speed-up of the Verilog-A generated models is evaluated based on several transistor level circuit abstractions of simple operational amplifiers up to a complex filters. Moreover, we verify the equivalence between the generated model and the original circuit. For the generated models in matlab syntax, a reachability analysis is performed using the reachability tool cora.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3097.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB06.3</td> <td><b>VIRTUAL PLATFORMS FOR COMPLEX SOFTWARE STACKS</b><br /> <b>Authors</b>:<br /> Lukas Jünger and Rainer Leupers, RWTH Aachen University, DE<br /> <em><b>Abstract</b><br /> This demonstration is going to showcase our "AVP64" Virtual Platform (VP), which models a multi-core ARMv8 (Cortex A72) system including several peripherals, such as an SDHCI and an ethernet controller. For the ARMv8 instruction set simulation a dynamic binary translation based solution is used. As the workload, the Xen hypervisor with two Linux Virtual Machines (VMs) is executed. Both VMs are connected to the simulation hosts' network subsystem via a virtual ethernet controller. One of the VMs executes a NodeJS-based server application offering a REST API via this network connection. An AngularJS client application on the host system can then connect to the server application to obtain and store data via the server's REST API. This data is read and written by the server application to the virtual SD Card connected to the SDHCI. For this, one SD card partition is passed to the VM through Xen's block device virtualization mechanism.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3099.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB06.4</td> <td><b>SYSTEMC-CT/DE: A SIMULATOR WITH FAST AND ACCURATE CONTINUOUS TIME AND DISCRETE EVENTS INTERACTIONS ON TOP OF SYSTEMC.</b><br /> <b>Authors</b>:<br /> Breytner Joseph Fernandez-Mesa, Liliana Andrade and Frédéric Pétrot, Université Grenoble Alpes / CNRS / TIMA Laboratory, FR<br /> <em><b>Abstract</b><br /> We have developed a continuous time (CT) and discrete events (DE) simulator on top of SystemC. Systems that mix both domains are critical and their proper functioning must be verified. Simulation serves to achieve this goal. Our simulator implements direct CT/DE synchronization, which enables a rich set of interactions between the domains: events from the CT models are able to trigger DE processes; events from the DE models are able to modify the CT equations. DE-based interactions are, then, simulated at their precise time by the DE kernel rather than at fixed time steps. We demonstrate our simulator by executing a set of challenging examples: they either require a superdense model of time or include Zeno behavior or are highly sensitive to accuracy errors. Results show that our simulator overcomes these issues, is accurate, and improves simulation speed w.r.t. fixed time steps; all of these advantages open up new possibilities for the design of a wider set of heterogeneous systems.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3110.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB06.6</td> <td><b>SRSN: SECURE RECONFIGURABLE TEST NETWORK</b><br /> <b>Authors</b>:<br /> Vincent Reynaud<sup>1</sup>, Emanuele Valea<sup>2</sup>, Paolo Maistri<sup>1</sup>, Regis Leveugle<sup>1</sup>, Marie-Lise Flottes<sup>2</sup>, Sophie Dupuis<sup>2</sup>, Bruno Rouzeyre<sup>2</sup> and Giorgio Di Natale<sup>1</sup><br /> <sup>1</sup>TIMA Laboratory, FR; <sup>2</sup>LIRMM, FR<br /> <em><b>Abstract</b><br /> The critical importance of testability for electronic devices led to the development of IEEE test standards. These methods, if not protected, offer a security backdoor to attackers. This demonstrator illustrates a state-of-the-art solution that prevents unauthorized usage of the test infrastructure based on the IEEE 1687 standard and implemented on an FPGA target.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3112.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB06.7</td> <td><b>GENERATING ASYNCHRONOUS CIRCUITS FROM CATAPULT</b><br /> <b>Authors</b>:<br /> Yoan Decoudu<sup>1</sup>, Jean Simatic<sup>2</sup>, Katell Morin-Allory<sup>3</sup> and Laurent Fesquet<sup>3</sup><br /> <sup>1</sup>University Grenoble Alpes, FR; <sup>2</sup>HawAI.Tech, FR; <sup>3</sup>Université Grenoble Alpes, FR<br /> <em><b>Abstract</b><br /> In order to spread asynchronous circuit design to a large community of designers, High-Level Synthesis (HLS) is probably a good choice because it requires limited design technical skills. HLS usually provides an RTL description, which includes a data-path and a control-path. The desynchronization process is only applied to the control-path, which is a Finite State Machine (FSM). This method is sufficient to make asynchronous the circuit. Indeed, data are processed step by step in the pipeline stages, thanks to the desynchronized FSM. Thus, the data-path computation time is no longer related to the clock period but rather to the average time for processing data into the pipeline. This tends to improve speed when the pipeline stages are not well-balanced. Moreover, our approach helps to quickly designing data-driven circuits while maintaining a reasonable cost, a similar area and a short time-to-market.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3118.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB06.8</td> <td><b>LEARNV: LEARNV: A RISC-V BASED EMBEDDED SYSTEM DESIGN FRAMEWORK FOR EDUCATION AND RESEARCH DEVELOPMENT</b><br /> <b>Authors</b>:<br /> Noureddine Ait Said and Mounir Benabdenbi, TIMA Laboratory, FR<br /> <em><b>Abstract</b><br /> Designing a modern System on a Chip is based on the joint design of hardware and software (co-design). However, understanding the tight relationship between hardware and software is not straightforward. Moreover to validate new concepts in SoC design from the idea to the hardware implementation is time-consuming and often slowed by legacy issues (intellectual property of hardware blocks and expensive commercial tools). To overcome these issues we propose to use the open-source Rocket Chip environment for educational purposes, combined with the open-source LowRisc architecture to implement a custom SoC design on an FPGA board. The demonstration will present how students and engineers can take benefit from the environment to deepen their knowledge in HW and SW co-design. Using the LowRisC architecture, an image classification application based on the use of CNNs will serve as a demonstrator of the whole open-source hardware and software flow and will be mapped on a Nexys A7 FPGA board.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3116.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB06.9</td> <td><b>WALLANCE: AN ALTERNATIVE TO BLOCKCHAIN FOR IOT</b><br /> <b>Authors</b>:<br /> Loic Dalmasso, Florent Bruguier, Pascal Benoit and Achraf Lamlih, Université de Montpellier, FR<br /> <em><b>Abstract</b><br /> Since the expansion of the Internet of Things (IoT), connected devices became smart and autonomous. Their exponentially increasing number and their use in many application domains result in a huge potential of cybersecurity threats. Taking into account the evolution of the IoT, security and interoperability are the main challenges, to ensure the reliability of the information. The blockchain technology provides a new approach to handle the trust in a decentralized network. However, current blockchain implementations cannot be used in IoT domain because of their huge need of computing power and storage utilization. This demonstrator presents a lightweight distributed ledger protocol dedicated to the IoT application, reducing the computing power and storage utilization, handling the scalability and ensuring the reliability of information.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3119.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB06.10</td> <td><b>JOINTER: JOINING FLEXIBLE MONITORS WITH HETEROGENEOUS ARCHITECTURES</b><br /> <b>Authors</b>:<br /> Giacomo Valente<sup>1</sup>, Tiziana Fanni<sup>2</sup>, Carlo Sau<sup>3</sup>, Claudio Rubattu<sup>2</sup>, Francesca Palumbo<sup>2</sup> and Luigi Pomante<sup>1</sup><br /> <sup>1</sup>Università degli Studi dell'Aquila, IT; <sup>2</sup>Università degli Studi di Sassari, IT; <sup>3</sup>Università degli Studi di Cagliari, IT<br /> <em><b>Abstract</b><br /> As embedded systems grow more complex and shift toward heterogeneous architectures, understanding workload performance characteristics becomes increasingly difficult. In this regard, run-time monitoring systems can support on obtaining the desired visibility to characterize a system. This demo presents a framework that allows to develop complex heterogeneous architectures composed of programmable processors and dedicated accelerators on FPGA, together with customizable monitoring systems, keeping under control the introduced overhead. The whole development flow (and related prototypal EDA tools), that starts with the accelerators creation using a dataflow model, in parallel with the monitoring system customization using a library of elements, showing also the final joining, will be shown. Moreover, a comparison among different monitoring systems functionalities on different architectures developed on Zynq7000 SoC will be illustrated.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3124.pdf">More information ...</a></b></em></td> </tr> <tr> <td>14:00</td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="7.0">7.0 LUNCHTIME KEYNOTE SESSION</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 13:45 - 14:20<br /> <b>Location / Room:</b> Amphitéâtre Jean Prouvé</p> <p><b>Chair:</b><br /> Bernabe Linares-Barranco, CSIC, ES</p> <p><b>Co-Chair:</b><br /> Dmitri Strukov, University of California, Santa Barbara, US</p> <p> </p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>13:45</td> <td>7.0.0</td> <td><b>CEDA LUNCHEON ANNOUNCEMENT</b><br /> <b>Author</b>:<br /> David Atienza, EPFL, CH</td> </tr> <tr> <td>13:50</td> <td>7.0.1</td> <td><b>LEVERAGING EMBEDDED INTELLIGENCE IN INDUSTRY: CHALLENGES AND OPPORTUNITIES</b><br /> <b>Author</b>:<br /> Jim Tung, MathWorks Fellow, US<br /> <em><b>Abstract</b><br /> The buzz about AI is deafening. Compelling applications are starting to emerge, dramatically changing the customer service that we experience, the marketing messages that we receive, and some systems we use. But, as organizations decide whether and how to incorporate AI in their systems and services, they must bring together new combinations of specialized knowledge, domain expertise, and business objectives. They must navigate through numerous choices - algorithms, processors, compute placement, data availability, architectural allocation, communications, and more. At the same time, they must keep their focus on the applications that will create compelling value for them. In this keynote, Jim Tung looks at the promising opportunities and practical challenges of building AI into our systems and services.</em></td> </tr> <tr> <td>14:20</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="UB07">UB07 Session 7</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 14:00 - 16:00<br /> <b>Location / Room:</b> Booth 11, Exhibition Area</p> <p> </p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td>UB07.1</td> <td><b>DL PUF ENAU: DEEP LEARNING BASED PHYSICALLY UNCLONABLE FUNCTION ENROLLMENT AND AUTHENNTICATION</b><br /> <b>Authors</b>:<br /> Amir Alipour<sup>1</sup>, David Hely<sup>2</sup>, Vincent Beroulle<sup>2</sup> and Giorgio Di Natale<sup>3</sup><br /> <sup>1</sup>Grenoble INP / LCIS, FR; <sup>2</sup>Grenoble INP, FR; <sup>3</sup>CNRS / Grenoble INP / TIMA, FR<br /> <em><b>Abstract</b><br /> Physically Unclonable Functions (PUFs) have been addressed nowadays as a potential solution to improve the security in authentication and encryption process in Cyber Physical Systems. The research on PUF is actively growing due to its potential of being secure, easily implementable and expandable, using considerably less energy. To use PUF in common, the low level device Hardware Variation is captured per unit for device enrollment into a format called Challenge-Response Pair (CRP), and recaptured after device is deployed, and compared with the original for authentication. These enrollment + comparison functions can vary and be more data demanding for applications that demand robustness, and resilience to noise. In this demonstration, our aim is to show the potential of using Deep Learning for enrollment and authentication of PUF CRPs. Most importantly, during this demonstration, we will show how this method can save time and storage compared to other classical methods.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3111.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB07.2</td> <td><b>BCFELEAM: BACKFLOW: BACKWARD EDGE CONTROL FLOW ENFORCEMENT FOR LOW END ARM REAL-TIME SYSTEMS</b><br /> <b>Authors</b>:<br /> Bresch Cyril<sup>1</sup>, David Héy<sup>1</sup>, Roman Lysecky<sup>2</sup> and Stephanie Chollet<sup>1</sup><br /> <sup>1</sup>LCIS, FR; <sup>2</sup>University of Arizona, US<br /> <em><b>Abstract</b><br /> The C programming language is one of the most popular languages in embedded system programming. Indeed, C is efficient, lightweight and can easily meet high performance and deterministic real-time constraints. However, these assets come at a certain price. Indeed, C does not provide extra features for memory safety. As a result, attackers can easily exploit spatial memory vulnerabilities to hijack the execution flow of an application. The demonstration features a real-time connected infusion pump vulnerable to memory attacks. First, we showcase an exploit that remotely takes control of the pump. Then, we demonstrate the effectiveness of BackFlow, an LLVM-based compiler extension that enforces control-flow integrity in low-end ARM embedded systems.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3109.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB07.3</td> <td><b>A BINARY TRANSLATION FRAMEWORK FOR AUTOMATED HARDWARE GENERATION</b><br /> <b>Authors</b>:<br /> Nuno Paulino and João Canas Ferreira, INESC TEC / University of Porto, PT<br /> <em><b>Abstract</b><br /> Hardware specialization is an efficient solution for maximization of performance and minimization of energy consumption. This work is based on automated detection of workload by analysis of a compiled application, and on the automated generation of specialized hardware modules. We will present the current version of the binary analysis and translation framework. Currently, our implementation is capable of processing ARMv8 and MicroBlaze (32-bit) Executable and Linking Format (ELF) files or instruction traces. The framework can interpret the instructions for these two ISAs, and detect different types of instruction patterns. After detection, segments are converted into CDFG representations exposing the underlying Instruction Level Parallelism which we aim to exploit via automated hardware generation. On-going work is addressing the extraction of cyclical execution traces or static code blocks, more methods of hardware generation.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3135.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB07.4</td> <td><b>RETINE: A PROGRAMMABLE 3D STACKED VISION CHIP ENABLING LOW LATENCY IMAGE ANALYSIS</b><br /> <b>Authors</b>:<br /> Stéphane Chevobbe<sup>1</sup>, Maria Lepecq<sup>1</sup> and Laurent Millet<sup>2</sup><br /> <sup>1</sup>CEA LIST, FR; <sup>2</sup>CEA-Leti, FR<br /> <em><b>Abstract</b><br /> We have developed and fabricated a 3D stacked imager called RETINE composed with 2 layers based on the replication of a programmable 3D tile in a matrix manner providing a highly parallel programmable architecture. This tile is composed by a 16x16 BSI binned pixels array with associated readout and 16 column ADC on the first layer coupled to an efficient SIMD processor of 16 PE on the second layer. The prototype of RETINE achieves high video rates, from 5500 fps in binned mode to 340 fps in full resolution mode. It operates at 80 MHz with 720 mW power consumption leading to 85 GOPS/W power efficiency. To highlight the capabilities of the RETINE chip we have developed a demonstration platform with an electronic board embedding a RETINE chip that films rotating disks. Three scenarii are available: high speed image capture, slow motion and composed image capture with parallel processing during acquisition.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3113.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB07.5</td> <td><b>UWB ACKATCK: HIJACKING DEVICES IN UWB INDOOR POSITIONING SYSTEMS</b><br /> <b>Authors</b>:<br /> Baptiste Pestourie, Vincent Beroulle and Nicolas Fourty, Université Grenoble Alpes, FR<br /> <em><b>Abstract</b><br /> Various radio-based Indoor Positioning Systems (IPS) have been proposed during the last decade as solutions to GPS inconsistency in indoor environments. Among the different radio technologies proposed for this purpose, 802.15.4 Ultra-Wideband (UWB) is by far the most performant, reaching up to 10 cm accuracy with 1000 Hz refresh rates. As a consequence, UWB is a popular technology for applications such as assets tracking in industrial environments or robots/drones indoor navigation. However, some security flaws in 802.15.4 standard expose UWB positioning to attacks. In this demonstration, we show how an attacker can exploit a vulnerability on 802.15.4 acknowledgment frames to hijack a device in a UWB positioning system. We demonsrate that using simply one cheap UWB chip, the attacker can take control over the positioning system and generate fake trajectories from a laptop. The results are observed in real-time in the 3D engine monitoring the positioning system.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3115.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB07.6</td> <td><b>DESIGN AUTOMATION FOR EXTENDED BURST-MODE AUTOMATA IN WORKCRAFT</b><br /> <b>Authors</b>:<br /> Alex Chan, Alex Yakovlev, Danil Sokolov and Victor Khomenko, Newcastle University, GB<br /> <em><b>Abstract</b><br /> Asynchronous circuits are known to have high performance, robustness and low power consumption, which are particularly beneficial for the area of so-called "little digital" controllers where low latency is crucial. However, asynchronous design is not widely adopted by industry, partially due to the steep learning curve inherent in the complexity of formal specifications, such as Signal Transition Graphs (STGs). In this demo, we promote a class of the Finite State Machine (FSM) model called Extended Burst-Mode (XBM) automata as a practical way to specify many asynchronous circuits. The XBM specification has been automated in the Workcraft toolkit (<a href="https://workcraft.org" title="https://workcraft.org">https://workcraft.org</a>) with elaborate support for state encoding, conditionals and "don't care" signals. Formal verification and logic synthesis of the XBM automata is implemented via conversion to the established STG model, reusing existing methods and CAD tools. Tool support for the XBM flow will be demonstrated using several case studies.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3120.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB07.7</td> <td><b>DEEPSENSE-FPGA: FPGA ACCELERATION OF A MULTIMODAL NEURAL NETWORK</b><br /> <b>Authors</b>:<br /> Mehdi Trabelsi Ajili and Yuko Hara-Azumi, Tokyo Institute of Technology, JP<br /> <em><b>Abstract</b><br /> Currently, Internet of Things and Deep Learning (DL) are merging into one domain and creating outstanding technologies for various classification tasks. Such technologies require complex DL networks that are mainly targeting powerful platforms with rich computing resources like servers. Therefore, for resource-constrained embedded systems, new challenges of size, performance and power consumption have to be considered, particularly when edge devices handle multimodal data, i.e., different types of real-time sensing data (voice, video, text, etc.). Our ongoing project is focused on DeepSense, a multimodal DL framework combining Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) to process time-series data, such as accelerometer and gyroscope to detect human activity. We aim at accelerating DeepSense by FPGA (Xilinx Zynq) in a hardware-software co-design manner. Our demo will show the latest achievements through latency and power consumption evaluations. </em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3131.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB07.8</td> <td><b>SUBRISC+: IMPLEMENTATION AND EVALUATION OF AN EMBEDDED PROCESSOR FOR LIGHTWEIGHT IOT EHEALTH</b><br /> <b>Authors</b>:<br /> Mingyu Yang and Yuko Hara-Azumi, Tokyo Institute of Technology, JP<br /> <em><b>Abstract</b><br /> Although the rapid growth of Internet of Things (IoT) has enabled new opportunities for eHealth devices, the further development of complex systems is severely constrained by the power and energy supply on the battery-powered embedded systems. To address this issue, this work presents a processor design called "SubRISC+" targeting lightweight IoT eHealth. SubRISC+ is a processor design to achieve low power/energy consumption through its unique and compact architecture. As an example of lightweight eHealth applications on SubRISC+, we are working on the epileptic seizure detection using the dynamic time wrapping algorithm to deploy on wearable IoT eHealth devices. Simulation results show that 22% reduction on dynamic power and 50% reduction on leakage power and core area are achieved compared to Cortex-M0. As an ongoing work, the evaluation on a fabricated chip will be done within the first half of 2020.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3129.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB07.9</td> <td><b>PA-HLS: HIGH-LEVEL ANNOTATION OF ROUTING CONGESTION FOR XILINX VIVADO HLS DESIGNS</b><br /> <b>Authors</b>:<br /> Osama Bin Tariq<sup>1</sup>, Junnan Shan<sup>1</sup>, Luciano Lavagno<sup>1</sup>, Georgios Floros<sup>2</sup>, Mihai Teodor Lazarescu<sup>1</sup>, Christos Sotiriou<sup>2</sup> and Mario Roberto Casu<sup>1</sup><br /> <sup>1</sup>Politecnico di Torino, IT; <sup>2</sup>University of Thessaly, GR<br /> <em><b>Abstract</b><br /> We will demo a novel high-level backannotation flow that reports routing congestion issues at the C++ source level by analyzing reports from FPGA physical design (Xilinx Vivado) and internal debugging files of the Vivado HLS tool. The flow annotates the C++ source code, identifying likely causes of congestion, e.g., on-chip memories or the DSP units. These shared resources often cause routing problems on FPGAs because they cannot be duplicated by physical design. We demonstrate on realistic large designs how the information provided by our flow can be used to both identify congestion issues at the C++ source level and solve them using HLS directives. The main demo steps are: 1-Extraction of the source-level debugging information from the Vivado HLS database 2-Generation of a list of net names involved in congestion areas and of their relative significance from the Vivado post global-routing database 3-Visualization of the C++ code lines that contribute most to congestion </em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3123.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB07.10</td> <td><b>MDD-COP: A PRELIMINARY TOOL FOR MODEL-DRIVEN DEVELOPMENT EXTENDED WITH LAYER DIAGRAM FOR CONTEXT-ORIENTED PROGRAMMING</b><br /> <b>Authors</b>:<br /> Harumi Watanabe<sup>1</sup>, Chinatsu Yamamoto<sup>1</sup>, Takeshi Ohkawa<sup>1</sup>, Mikiko Sato<sup>1</sup>, Nobuhiko Ogura<sup>2</sup> and Mana Tabei<sup>1</sup><br /> <sup>1</sup>Tokai University, JP; <sup>2</sup>Tokyo City University, JP<br /> <em><b>Abstract</b><br /> This presentation introduces a preliminary tool for Model-Driven development (MDD) to generate programs for Context-Oriented Programming (COP). In modern embedded systems such as IoT and Industry 4.0, their software began to process multiple services by following the changing surrounding environments. COP is helpful for programming such software. In COP, we can consider the surrounding environments and multiple services as contexts and layers. Even though MDD is a powerful technique for developing such modern systems, the works of modeling for COP are limited. There are no works to mention the relation between UML (Unified Modeling Language) and COP. To solve this problem, we provide a COP generation from a layer diagram extended the package diagram of UML by stereotypes. In our approach, users draw a layer diagram and other UML diagrams, then xtUML, which is a major tool of MDD, generates XML code with layer information for COP; finally, our tool generates COP code from XML code.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3130.pdf">More information ...</a></b></em></td> </tr> <tr> <td>16:00</td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="7.1">7.1 Special Day on "Embedded AI": Industry AI chips</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br /> Tobi Delbrück, ETH Zurich, CH</p> <p><b>Co-Chair:</b><br /> Bernabe Linares-Barranco, CSIC, ES</p> <p>This session on Industry AI chips will present examples of companies developing actual products for AI hardware solutions, a highly competitive and full of new challenges market.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.1.1</td> <td><b>OPPORTUNITIES FOR ANALOG ACCELERATION OF DEEP LEARNING WITH PHASE CHANGE MEMORY</b><br /> <b>Authors</b>:<br /> Pritish Narayanan, Geoffrey W. Burr, Stefano Ambrogio, Hsinyu Tsai, Charles Mackin, Katherine Spoon, An Chen, Alexander Friz and Andrea Fasoli, IBM Research, US<br /> <em><b>Abstract</b><br /> Storage Class Memory and High Bandwidth Memory Technologies are already reshaping systems architecture in interesting ways, by bringing cheap and high-density memory closer and closer to processing. Extrapolating on this trend, a new class of in-memory computing solutions is emerging, where some or all of the computing happens at the location of the data. Within the landscape of in-memory computing approaches, Non-von Neumann architectures seek to eliminate most of the data movement associated with computing, eliminating the demarcation between compute and memory. While such non-Von Neumann architectures could offer orders of magnitude performance improvements on certain workloads, they are not as general purpose nor as easily programmable as von-Neumann architectures. Therefore, well defined use cases need to exist to justify the hardware investment. Fortunately, acceleration of deep learning, which is both compute and memory-intensive, is one such use case. Today, the training of deep learning networks is done primarily in the cloud and could take days or weeks even when using many GPUs. Specialized hardware for training is thus primarily focused on speedup, with energy/power a secondary concern. On the other hand, 'Inference', the deployment and use of pre-trained models for real-world tasks, is done both in the cloud and on edge devices and presents hardware opportunities at both high speed and low power design points. In this presentation, we describe some of the opportunities and challenges in building accelerators for deep learning using analog volatile and non-volatile memory. We review our group's recent progress towards achieving software-equivalent accuracies on deep learning tasks in the presence of real-device imperfections such as non-linearity, asymmetry, variability and conductance drift. We will present some novel techniques and optimizations across device, circuit, and neural network design to achieve high accuracy with existing devices. We will then discuss challenges for peripheral circuit design and conclude by providing an outlook on the prospects for analog memory-based DNN accelerators.</em></td> </tr> <tr> <td>14:52</td> <td>7.1.2</td> <td><b>EVENT-BASED AI FOR AUTOMOTIVE AND IOT</b><br /> <b>Speaker</b>:<br /> Etienne Pero, Prophesee, FR<br /> <b>Author</b>:<br /> Etienne Perot, Prophesee, FR<br /> <em><b>Abstract</b><br /> Event cameras are a new type of sensor encoding visual information in the form of asynchronous events. An event corresponds to a change in the log-luminosity intensity at a given pixel location. Compared to standard frame cameras, event cameras have higher temporal resolution, higher dynamic range and lower power consumption. Thanks to these characteristics, event cameras find many applications in automotive and IoT, where low latency, robustness to challenging lighting conditions and low power consumption are critical requirements. In this talk we present recent advances in artificial intelligence applied to event cameras. In particular, we discuss how to adapt deep learning methods to work on events and their advantages compared to conventional frame-based methods. The presentation will be illustrated by results on object detection in automotive and IoT scenarios, running real-time on mobile platforms.</em></td> </tr> <tr> <td>15:14</td> <td>7.1.3</td> <td><b>NEURONFLOW: A NEUROMORPHIC PROCESSOR ARCHITECTURE FOR LIVE AI APPLICATIONS</b><br /> <b>Speaker</b>:<br /> Orlando Moreira, GrAI Matter Labs, NL<br /> <b>Authors</b>:<br /> Orlando Moreira, Amirreza Yousefzadeh, Gokturk Cinserin, Rik-Jan Zwartenkot, Ajay Kapoor, Fabian Chersi, Peng Qiao, Peter Kievits, Mina Khoei, Louis Rouillard, Ashoka Visweswara and Jonathan Tapson, GrAI Matter Labs, NL<br /> <em><b>Abstract</b><br /> This paper gives an overview of the Neuronflow many-core architecture. It is a neuromorphic data flow architecture that exploits brain-inspired concepts to deliver a scalable event-based processing engine for neuron networks in Live AI applications at the edge. Its design is inspired by brain biology, but not necessarily biologically plausible. The main design goal is the exploitation of sparsity to dramatically reduce latency and power consumption as required by sensor processing at the Edge.</em></td> </tr> <tr> <td>15:36</td> <td>7.1.4</td> <td><b>SPECK - SUB-MW SMART VISION SENSOR FOR MOBILE IOT APPLICATIONS</b><br /> <b>Author</b>:<br /> Ning Qiao, aiCTX, CH<br /> <em><b>Abstract</b><br /> Speck is the first available neuromorphic smart vision sensor system-on-chip (SoC), which combines neuromorphic vision sensing and neuromorphic computation on a single die, for mW vision processing. The DVS pixel array is coupled directly to a new fully-asynchronous event-driven spiking CNN processor for highly compact and energy efficient dynamic visual processing. Speck supports a wide range of potential applications, spanning industrial and consumer-facing use cases.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="7.2">7.2 Reconfigurable Systems and Architectures</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br /> Christian Pilato, Politecnico di Milano, IT</p> <p><b>Co-Chair:</b><br /> Philippe Coussy, University Bretagne Sud / Lab-STICC, FR</p> <p>Reconfigurable technologies are evolving at the device, architecture, and system levels, from embedded computation to server-based accelerator integration. In this session we explore ideas at these levels, discussing architectural features for power optimisation of CGRAs, a framework for integrating FPGA accelerators in serverless environments, and placement strategies on alternative FPGA device technologies.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.2.1</td> <td><b>A FRAMEWORK FOR ADDING LOW-OVERHEAD, FINE-GRAINED POWER DOMAINS TO CGRAS</b><br /> <b>Speaker</b>:<br /> Ankita Nayak, Stanford University, US<br /> <b>Authors</b>:<br /> Ankita Nayak, Keyi Zhang, Raj Setaluri, Alex Carsello, Makai Mann, Stephen Richardson, Rick Bahr, Pat Hanrahan, Mark Horowitz and Priyanka Raina, Stanford University, US<br /> <em><b>Abstract</b><br /> To effectively minimize static power for a wide range of applications, power domains for a coarse-grained reconfigurable array (CGRA) need to be finer-grained than a typical ASIC. However, the special isolation logic needed to ensure electrical protection between off and on domains makes fine-grained power domains area- and timing-inefficient. We propose a novel design of the CGRA routing fabric that intrinsically provides boundary protection. This technique reduces the area overhead of boundary protection between power domains for the CGRA from around 9% to less than 1% and removes the delay from the isolation cells. However, with this design choice, we cannot leverage the conventional UPF-based flow to introduce power domain boundary protection. We create compiler-like passes that iteratively introduce the needed design transformations, and formally verify the passes with satisfiability modulo theories (SMT) methods. These passes also allow us to optimize how we handle test and debug signals through the off tiles. We use our framework to insert power domains into an SoC with an ARM Cortex M3 processor and a CGRA with 32x16 processing element (PE) and memory tiles and 4MB secondary memory. Depending on the size of the applications mapped, our CGRA achieves up to an 83% reduction in leakage power and 26% reduction in total power versus a CGRA without multiple power domains, for a range of image processing and machine learning applications.</em></td> </tr> <tr> <td>15:00</td> <td>7.2.2</td> <td><b>BLASTFUNCTION: AN FPGA-AS-A-SERVICE SYSTEM FOR ACCELERATED SERVERLESS COMPUTING</b><br /> <b>Speaker</b>:<br /> Rolando Brondolin, Politecnico di Milano, IT<br /> <b>Authors</b>:<br /> Marco Bacis, Rolando Brondolin and Marco D. Santambrogio, Politecnico di Milano, IT<br /> <em><b>Abstract</b><br /> Heterogeneous computing platforms are now a valuable solution to continue to meet Service Level Agreements (SLAs) for compute intensive cloud workloads. Field Programmable Gate Arrays (FPGAs) effectively accelerate cloud workloads, however, these workloads have a spiky behavior as well as long periods of underutilization. Sharing the FPGA with multiple tenants then helps to increase the board's time utilization. In this paper we present BlastFunction, a distributed FPGA sharing system for the acceleration of microservices and serverless applications in cloud environments. BlastFunction includes a Remote OpenCL Library to access the shared devices transparently; multiple Device Managers to time-share and monitor the FPGAs and a central Accelerators Registry to allocate the available devices. BlastFunction reaches higher utilization and throughput w.r.t. a native execution thanks to device sharing, with minimal differences in latency given by the concurrent accesses.</em></td> </tr> <tr> <td>15:30</td> <td>7.2.3</td> <td><b>ENERGY-AWARE PLACEMENT FOR SRAM-NVM HYBRID FPGAS</b><br /> <b>Speaker</b>:<br /> Seongsik Park, Seoul National University, KR<br /> <b>Authors</b>:<br /> Seongsik Park, Jongwan Kim and Sungroh Yoon, Seoul National University, KR<br /> <em><b>Abstract</b><br /> Field-programmable gate arrays (FPGAs) have been widely used in many applications due to their reconfigurability. Especially, the short development time makes the FPGAs one of the promising reconfigurable architectures for emerging applications, such as deep learning. As CMOS technology advances, however, conventional SRAM-based FPGAs have approached their limitations. To overcome these obstacles, NVM-based FPGAs have been introduced. Although NVM-based FPGAs have the features of high area density, low static power consumption, and non-volatility, they are struggling to reduce energy consumption. Their challenge is mainly caused by the access speed of NVM, which is relatively slower than SRAM. In this paper, for compensating this limitation, we suggest SRAM-NVM hybrid FPGA architecture with SRAM- and NVM-based CLBs. In addition, we propose an energy-aware placement for efficient use of the SRAM-NVM hybrid FPGAs. As a result of our experiments, we were able to reduce the average energy consumption of SRAM-NVM hybrid FPGA by 22.23% and 21.94% compared to SRAM-based FPGA on the MCNC and VTR benchmarks, respectively.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="7.3">7.3 Special Session: Realizing Quantum Algorithms on Real Quantum Computing Devices</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br /> Eduard Alarcon, UPC BarcelonaTech, ES</p> <p><b>Co-Chair:</b><br /> Swaroop Ghosh, Pennsylvania State University, US</p> <p>Quantum computing is currently moving from an academic idea to a practical reality. Quantum computing in the cloud is already available and allows users from all over the world to develop and execute real quantum algorithms. However, companies which are heavily investing in this new technology such as Google, IBM, Rigetti, and Intel follow different technological approaches. This led to a situation where we have substantially different quantum computing devices available thus far. Because of that, various methods for realizing the intended quantum functionality to a respectively given quantum computing device are available. This special session provides an introduction and overview into this domain and comprehensively describes corresponding methods (also referred to as compilers, mappers, synthesizers, or routers). By this, attendees will be provided with a detailed understanding on how to use quantum computers in general and dedicated quantum computing devices in particular. The special session will include speakers from both, academia and industry, and will cover the most relevant quantum computing devices such as provided by IBM, Intel, etc.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.3.1</td> <td><b>RUNNING QUANTUM ALGORITHMS ON RESOURCE-CONSTRAINED QUANTUM DEVICES</b><br /> <b>Author</b>:<br /> Carmen G. Almudever, TU Delft, NL<br /> <em><b>Abstract</b><br /> A number of quantum computing devices consisting of a few tens of noisy qubits already exist. All of them present various limitations such as limited qubit connectivity and reduced gate set that must be considered to make quantum algorithms executable. In this talk, after briefly introduce the basics of quantum computing, we will provide an overview on the problem of realizing quantum circuits. We will discuss different mapping approaches as well as quantum devices emphasizing their main constraints. Special attention will be given to the quantum chips developed within the QuTech-Intel partnership.</em></td> </tr> <tr> <td>15:00</td> <td>7.3.2</td> <td><b>REALIZING QUANTUM CIRCUITS ON IBM Q DEVICES</b><br /> <b>Author</b>:<br /> Robert Wille, Johannes Kepler University Linz, AT<br /> <em><b>Abstract</b><br /> In 2017, IBM launched the first publicly available quantum computing device which is accessible through a cloud service. In the meantime, many further devices followed which have been used by more than 100,000 people who have executed more than 7 million experiments on them. Accordingly, fast and efficient solutions to realize quantum functionality to those devices are demanded by a huge user-base. This talk will provide an overview on IBM's own tools for that matter as well solutions which have been developed by researchers world-wide - including a description of a compiler that won the IBM Qiskit Developer Challenge.</em></td> </tr> <tr> <td>15:30</td> <td>7.3.3</td> <td><b>EVERY DEVICE IS (ALMOST) EQUAL BEFORE THE COMPILER</b><br /> <b>Author</b>:<br /> Gian Giacomo Guerreschi, Intel Corporation, US<br /> <em><b>Abstract</b><br /> At the current stage of quantum computing technologies, it is not only expected but often required to tailor the compiler to the characteristic of each individual machine. No amount of coherence time must be wasted. However, many of the recently presented architectures have constraints that can be described in a unifying framework. We discuss how to represent these constraints and how more flexible compilers can be used to guide the design of novel architectures.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="7.4">7.4 Simulation and verification: where real issues meet scientific innovation</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br /> Avi Ziv, IBM, IL</p> <p><b>Co-Chair:</b><br /> Graziano Pravadelli, Università di Verona, IT</p> <p>This session presents recent concerns and innovative solutions in verification and simulation, covering topics ranging from partial verification to lazy event prediction, till signal name disambiguation.They tackle these challenges by reducing complexity, exploiting GPUs, and using similarity-learning techniques.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.4.1</td> <td><b>VERIFICATION RUNTIME ANALYSIS: GET THE MOST OUT OF PARTIAL VERIFICATION</b><br /> <b>Authors</b>:<br /> Martin Ring<sup>1</sup>, Fritjof Bornbebusch<sup>1</sup>, Christoph Lüth<sup>2</sup>, Robert Wille<sup>3</sup> and Rolf Drechsler<sup>2</sup><br /> <sup>1</sup>DFKI, DE; <sup>2</sup>University of Bremen / DFKI, DE; <sup>3</sup>Johannes Kepler University Linz, AT<br /> <em><b>Abstract</b><br /> The design of modern systems has reached a complexity which makes it inevitable to apply verification methods in order to guarantee its correct and safe execution. The verification methods frequently produce proof obligations that can not be solved any more due to the huge search space. However, by setting enough variables to fixed values, the search space is obviously reduced and solving engines eventually may be able to complete the verification task. Although this results in a partial verification, the results may still be valuable --- in particular as opposed to the alternative of no verification at all. However, so far no systematic investigation has been conducted on which variables to fix in order to reduce verification runtime as much as possible while, at the same time, still getting most coverage. This paper addresses this question by proposing a corresponding verification runtime analysis. Experimental evaluations confirm the potential of this approach.</em></td> </tr> <tr> <td>15:00</td> <td>7.4.2</td> <td><b>GPU-ACCELERATED TIME SIMULATION OF SYSTEMS WITH ADAPTIVE VOLTAGE AND FREQUENCY SCALING</b><br /> <b>Speaker</b>:<br /> Eric Schneider, University of Stuttgart, DE<br /> <b>Authors</b>:<br /> Eric Schneider and Hans-Joachim Wunderlich, University of Stuttgart, DE<br /> <em><b>Abstract</b><br /> Timing validation of systems with adaptive voltage- and frequency scaling (AVFS) requires an accurate timing model under multiple operating points. Simulating such a model at gate level is extremely time-consuming, and the state-of-the-art compromises both accuracy and compute efficiency. This paper presents a method for dynamic gate delay modeling on graphics processing unit (GPU) accelerators which is based on polynomial approximation with offline statistical learning using regression analysis. It provides glitch-accurate switching activity information for gates and designs under varying supply voltages with negligible memory and performance impact. Parallelism from the evaluation of operating conditions, gates and stimuli is exploited simultaneously to utilize the high arithmetic computing throughput of GPUs. This way, large-scale design space exploration of AVFS-based systems is enabled. Experimental results demonstrate the efficiency and accuracy of the presented approach showing speedups of three orders of magnitude over conventional time simulation that supports static delays only.</em></td> </tr> <tr> <td>15:30</td> <td>7.4.3</td> <td><b>LAZY EVENT PREDICTION USING DEfiNING TREES AND SCHEDULE BYPASS FOR OUT-OF-ORDER PDES</b><br /> <b>Speaker</b>:<br /> Rainer Doemer, University of California, Irvine, US<br /> <b>Authors</b>:<br /> Daniel Mendoza, Zhongqi Cheng, Emad Arasteh and Rainer Doemer, University of California, Irvine, US<br /> <em><b>Abstract</b><br /> Out-of-order parallel discrete event simulation (PDES) has been shown to be very effective in speeding up system design by utilizing parallel processors on multi- and many-core hosts. As the number of threads in the design model grows larger, however, the original scheduling approach does not scale. In this work, we analyze the out-of-order scheduler and identify a bottleneck with quadratic complexity in event prediction. We propose a more efficient lazy strategy based on defining trees and a schedule bypass with O(mlog2 m) complexity which shows sustained and improved performance gains in simulation of SystemC models with many processes. For models containing over 1000 processes, experimental results show simulation run time speedups of up to 90x using lazy event prediction against the original out-of-order PDES approach.</em></td> </tr> <tr> <td>15:45</td> <td>7.4.4</td> <td><b>EMBEDDING HIERARCHICAL SIGNAL TO SIAMESE NETWORK FOR FAST NAME RECTIFICATION</b><br /> <b>Speaker</b>:<br /> Yi-An Chen, National Chiao Tung University, TW<br /> <b>Authors</b>:<br /> Yi-An Chen<sup>1</sup>, Gung-Yu Pan<sup>2</sup>, Che-Hua Shih<sup>2</sup>, Yen-Chin Liao<sup>1</sup>, Chia-Chih Yen<sup>2</sup> and Hsie-Chia Chang<sup>1</sup><br /> <sup>1</sup>National Chiao Tung University, TW; <sup>2</sup>Synopsys, TW<br /> <em><b>Abstract</b><br /> EDA tools are necessary to assist complicated flow of advanced IC design and verification in nowadays industry. After synthesis or simulation, the same signal could be viewed as different hierarchical names, especially for mixed-language designs. This name mismatching problem blocks automation and needs experienced users to rectify manually with domain knowledge. Even rule-based rectification helps the process but still fails when encountering unseen mismatching types. In this paper, hierarchical name rectification is transformed into the similarity search problem where the most similar name becomes the rectified name. However, naïve full search in design with string comparison costs unacceptable time. Our proposed framework embeds name strings into vectors for representing distance relation in a latent space using character n-gram and locality-sensitive hashing (LSH), and then finds the most similar signal using nearest neighbor search (NNS) and detailed search. Learning similarity using Siamese network provides general name rectification regardless of mismatching types, while string-to-vector embedding for proximity search accelerates the rectification process. Our approach is capable of achieving 93.43% rectification rate with only 0.052s per signal, which outperforms the naïve string search with 2.3% higher accuracy and 4,500 times speed-up.</em></td> </tr> <tr> <td style="width:40px;">16:00</td> <td><a href="#IP3">IP3-9</a>, 832</td> <td><b>TOWARDS SPECIFICATION AND TESTING OF RISC-V ISA COMPLIANCE</b><br /> <b>Speaker</b>:<br /> Vladimir Herdt, University of Bremen, DE<br /> <b>Authors</b>:<br /> Vladimir Herdt<sup>1</sup>, Daniel Grosse<sup>2</sup> and Rolf Drechsler<sup>2</sup><br /> <sup>1</sup>University of Bremen, DE; <sup>2</sup>University of Bremen / DFKI, DE<br /> <em><b>Abstract</b><br /> Compliance testing for RISC-V is very important. Therefore, an official hand-written compliance test-suite is being actively developed. However, this requires significant manual effort in particular to achieve a high test coverage. In this paper we propose a test-suite specification mechanism in combination with a first set of instruction constraints and coverage requirements for the base RISC-V ISA. In addition, we present an automated method to generate a test-suite that satisfies the specification. Our evaluation demonstrates the effectiveness and potential of our method.</em></td> </tr> <tr> <td style="width:40px;">16:01</td> <td><a href="#IP3">IP3-10</a>, 702</td> <td><b>POST-SILICON VALIDATION OF THE IBM POWER9 PROCESSOR</b><br /> <b>Speaker</b>:<br /> Hillel Mendelson, IBM, IL<br /> <b>Authors</b>:<br /> Tom Kolan<sup>1</sup>, Hillel Mendelson<sup>1</sup>, Vitali Sokhin<sup>1</sup>, Kevin Reick<sup>2</sup>, Elena Tsanko<sup>2</sup> and Gregory Wetli<sup>2</sup><br /> <sup>1</sup>IBM Research, IL; <sup>2</sup>IBM Systems, US<br /> <em><b>Abstract</b><br /> Due to the complexity of designs, post-silicon validation remains a major challenge with few systematic solutions. We provide an overview of the state-of-the-art post silicon validation process used by IBM to verify its latest IBM POWER9 processor. During the POWER9 post-silicon validation, we detected and handled 30% more logic bugs in 80% of the time, as compared to the previous IBMPOWER8 bring-up. This improvement is the result of lessons learned from previous designs, leading to numerous innovations. We provide bug analysis data and compare it to POWER8 results. We demonstrate our methodology by describing several bugs from fail detection to root cause.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="7.5">7.5 Runtime support for multi/many cores</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br /> Sara Vinco, Politecnico di Torino, IT</p> <p><b>Co-Chair:</b><br /> Jeronimo Castrillon, TU Dresden, DE</p> <p>In the era of heterogenous embedded systems, the diverse nature of computing elements pushes more than ever the need for smart runtime systems to be able to deal with resource management, multi-application mapping, task parallelism, and non-functional constraints. This session tackles these issues with solutions that span from resource-aware software architectures to novel runtime systems optimizing memory and energy consumption.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.5.1</td> <td><b>RESOURCE-AWARE MAPREDUCE RUNTIME FOR MULTI/MANY-CORE ARCHITECTURES</b><br /> <b>Speaker</b>:<br /> Konstantinos Iliakis, MicroLab, ECE, NTUA, GR<br /> <b>Authors</b>:<br /> Konstantinos Iliakis<sup>1</sup>, Sotirios Xydis<sup>1</sup> and Dimitrios Soudris<sup>2</sup><br /> <sup>1</sup>National TU Athens, GR; <sup>2</sup>National Technical University of Athens, GR<br /> <em><b>Abstract</b><br /> Modern multi/many-core processors exhibit high integration densities, e.g. up to several dozens or hundreds of cores. To ease the application development burden for such systems, various programming frameworks have emerged. The MapReduce programming model, after having demonstrated its usability in the area of distributed systems, has been adapted to the needs of shared-memory many-core and multi-processor systems, showing promising results in comparison with conventional multi-threaded libraries, e.g. pthreads. In this paper, we propose a novel resource-aware MapReduce architecture. The proposed runtime decouples map and combine phases in order to enhance the parallelism degree, while it effectively overlaps the memory-intensive combine with the compute-intensive map operation resulting in superior resource utilization and performance improvements. A detailed sensitivity analysis to the framework's tuning knobs is provided. The decoupled MapReduce architecture is evaluated against the state-of-art library into two diverse systems, i.e. a Haswell server and a Xeon Phi co-processor, demonstrating speedups on average up-to 2.2x and 2.9x respectively.</em></td> </tr> <tr> <td>15:00</td> <td>7.5.2</td> <td><b>TOWARDS A QUALIFIABLE OPENMP FRAMEWORK FOR EMBEDDED SYSTEMS</b><br /> <b>Speaker</b>:<br /> Adrian Munera Sanchez, BSC, ES<br /> <b>Authors</b>:<br /> Adrián Munera Sánchez, Sara Royuela and Eduardo Quiñones, BSC, ES<br /> <em><b>Abstract</b><br /> OpenMP is a very convenient parallel programming model to develop critical real-time applications by virtue of its powerful tasking model and its proven time predictable properties. However, current OpenMP implementations are not suitable due to the intensive use of dynamic memory to allocate data structures needed to efficiently manage the parallel execution. This jeopardizes the qualification processes of critical real-time systems, which are needed to ensure that the integrated system stack, including the OpenMP framework, is compliant with the system requirements. This paper proposes a novel OpenMP framework that statically allocates all the data structures needed to execute the OpenMP tasking model. Our framework is composed of a compiler phase that captures the data environment of all the OpenMP tasks instantiated along the parallel execution, and a run-time phase implementing a lazy task creation policy, that significantly reduces the memory requirements at run-time, whilst exploiting parallelism efficiently.</em></td> </tr> <tr> <td>15:30</td> <td>7.5.3</td> <td><b>ENERGY-EFFICIENT RUNTIME RESOURCE MANAGEMENT FOR ADAPTABLE MULTI-APPLICATION MAPPING</b><br /> <b>Speaker</b>:<br /> Robert Khasanov, TU Dresden, DE<br /> <b>Authors</b>:<br /> Robert Khasanov and Jeronimo Castrillon, TU Dresden, DE<br /> <em><b>Abstract</b><br /> Modern embedded computing platforms consist of a high amount of heterogeneous resources, which allows executing multiple applications on a single device. The number of running application on the system varies with time and so does the amount of available resources. This has considerably increased the complexity of analysis and optimization algorithms for runtime mapping of firm real-time applications. To reduce the runtime overhead, researchers have proposed to pre-compute partial mappings at compile time and have the runtime efficiently compute the final mapping. However, most existing solutions only compute a fixed mapping for a given set of running applications, and the mapping is defined for the entire duration of the workload execution. In this work we allow applications to adapt to the amount of available resources by using mapping segments. This way, applications may switch between different configurations with varied degree of parallelism. We present a runtime manager for firm real-time applications that generates such mapping segments based on partial solutions and aims at minimizing the overall energy consumption without deadline violations. The proposed algorithm outperforms the state-of-the-art approaches on the overall energy consumption by up to 13% while incurring an order of magnitude less scheduling overhead.</em></td> </tr> <tr> <td style="width:40px;">16:00</td> <td><a href="#IP3">IP3-11</a>, 619</td> <td><b>ON THE TASK MAPPING AND SCHEDULING FOR DAG-BASED EMBEDDED VISION APPLICATIONS ON HETEROGENEOUS MULTI/MANY-CORE ARCHITECTURES</b><br /> <b>Speaker</b>:<br /> Nicola Bombieri, Università di Verona, IT<br /> <b>Authors</b>:<br /> Stefano Aldegheri<sup>1</sup>, Nicola Bombieri<sup>1</sup> and Hiren Patel<sup>2</sup><br /> <sup>1</sup>Università di Verona, IT; <sup>2</sup>University of Waterloo, CA<br /> <em><b>Abstract</b><br /> In this work, we show that applying the heterogeneous earliest finish time (HEFT) heuristic for the task scheduling of embedded vision applications can improve the system performance up to 70% w.r.t. the scheduling solutions at the state of the art. We propose an algorithm called exclusive earliest finish time (XEFT) that introduces the notion of exclusive overlap between application primitives to improve the load balancing. We show that XEFT can improve the system performance up to 33% over HEFT, and 82% over the state of the art approaches. We present the results on different benchmarks, including a real-world localization and mapping application (ORB-SLAM) combined with the NVIDIA object detection application based on deep-learning.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="7.6">7.6 Attacks on Hardware Architectures</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br /> Johanna Sepúlveda, Airbus Defence and Space, DE</p> <p><b>Co-Chair:</b><br /> Jean-Luc Danger, Télécom ParisTech, FR</p> <p>Hardware architectures are under the continuous threat of all types of attacks. This session covers attacks based on side-channel leakage and the exploitation of vulnerabilities at the micro-architectural and circuit level.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.6.1</td> <td><b>SWEEPING FOR LEAKAGE IN MASKED CIRCUIT LAYOUTS</b><br /> <b>Speaker</b>:<br /> Danilo Šijačić, IMEC / KU Leuven, BE<br /> <b>Authors</b>:<br /> Danilo Šijačić, Josep Balasch and Ingrid Verbauwhede, KU Leuven, BE<br /> <em><b>Abstract</b><br /> Masking schemes are the most popular countermeasure against side-channel analysis. They theoretically decorrelate information leaked through inherent physical channels from the key-dependent intermediate values that occur during computation. Their provable security is devised under models that abstract complex physical phenomena of the underlying hardware. In this work, we investigate the impact of the physical layout to the side-channel security of masking schemes. For this we propose a model for co-simulation of the analog power distribution network with the digital logic core. Our study considers the drive of the power supply buffers, as well as parasitic resistors, inductors and capacitors. We quantify our findings using Test Vector Leakage Assessment by relative comparison to the parasitic-free model. Thus we provide a deeper insight into the potential layout sources of leakage and their magnitude.</em></td> </tr> <tr> <td>15:00</td> <td>7.6.2</td> <td><b>INCREASED REPRODUCIBILITY AND COMPARABILITY OF DATA LEAK EVALUATIONS USING EXOT</b><br /> <b>Speaker</b>:<br /> Philipp Miedl, ETH Zürich, CH<br /> <b>Authors</b>:<br /> Philipp Miedl<sup>1</sup>, Bruno Klopott<sup>2</sup> and Lothar Thiele<sup>1</sup><br /> <sup>1</sup>ETH Zurich, CH; <sup>2</sup>ETH Zürich, CH<br /> <em><b>Abstract</b><br /> As computing systems are increasingly shared among different users or application domains, researchers have intensified their efforts to detect possible data leaks. In particular, many investigations highlight the vulnerability of systems w. r. t. covert and side channel attacks. However, the effort required to reproduce and compare different results has proven to be high. Therefore, we present a novel methodology for covert channel evaluation. In addition, we introduce the Experiment Orchestration Toolkit ExOT, which provides software tools to efficiently execute the methodology. Our methodology ensures that the covert channel analysis yields expressive results that can be reproduced and allow the comparison of the threat potential of different data leaks. ExOT is a software bundle that consists of easy to extend C++ libraries and Python packages. These libraries and packages provide tools for the generation and execution of experiments, as well as the analysis of the experimental data. Therefore, ExOT decreases the engineering effort needed to execute our novel methodology. We verify these claims with an extensive evaluation of four different covert channels on an Intel Haswell and an ARMv8 based platform. In our evaluation, we derive capacity bounds and show achievable throughputs to compare the threat potential of these different covert channels.</em></td> </tr> <tr> <td>15:15</td> <td>7.6.3</td> <td><b>GHOSTBUSTERS: MITIGATING SPECTRE ATTACKS ON A DBT-BASED PROCESSOR</b><br /> <b>Speaker and Author</b>:<br /> Simon Rokicki, Irisa, FR<br /> <em><b>Abstract</b><br /> Unveiled early 2018, the Spectre vulnerability affects most of the modern high-performance processors. Spectre variants exploit the speculative execution mechanisms and a cache side-channel attack to leak secret data. As of today, the main countermeasures consist of turning off the speculation, which drastically reduces the processor performance. In this work, we focus on a different kind of micro-architecture: the DBT based processors, such as Transmeta Crusoe [1], NVidia Denver or Hybrid-DBT. Instead of using complex out-of-order (OoO) mechanisms, those cores combines a software Dynamic Binary Translation mechanism (DBT) and a parallel in-order architecture, typically a VLIW core. The DBT is in charge of translating and optimizing the binaries before their execution. Studies show that DBT based processors can reach the performance level of OoO cores for regular enough applications. In this paper, we demonstrate that, even if those processors do not use OoO execution, they are still vulnerable to Spectre variants, because of the DBT optimizations. However, we also demonstrate that those systems can easily be patched, as the DBT is done in software and has fine-grained control over the optimization process.</em></td> </tr> <tr> <td>15:30</td> <td>7.6.4</td> <td><b>DYNAMIC FAULTS BASED HARDWARE TROJAN DESIGN IN STT-MRAM</b><br /> <b>Speaker</b>:<br /> Sarath Mohanachandran Nair, Karlsruhe Institute of Technology, DE<br /> <b>Authors</b>:<br /> Sarath Mohanachandran Nair<sup>1</sup>, Rajendra Bishnoi<sup>2</sup>, Arunkumar Vijayan<sup>1</sup> and Mehdi Tahoori<sup>1</sup><br /> <sup>1</sup>Karlsruhe Institute of Technology, DE; <sup>2</sup>TU Delft, NL<br /> <em><b>Abstract</b><br /> The emerging Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) is seen as a promising candidate to replace conventional on-chip memories. It has several advantages such as high density, non-volatility, scalability, and CMOS compatibility. With this technology becoming ubiquitous, it also becomes interesting as a target for security attacks. As the fabrication process of STT-MRAM evolves, it is susceptible to various fault mechanisms which are different from those of conventional CMOS memories. These unique fault mechanisms can be exploited by an adversary to deploy hardware Trojans, which are deliberately introduced design modifications. In this work, we demonstrate how a particular stealthy circuit modification to inject a fault mechanism, namely dynamic fault, can be exploited to implement a hardware Trojan trigger which cannot be detected by standard memory testing methods. The fault mechanisms can also be used to design new payloads specific to STT-MRAM. We illustrate this by proposing a new payload by utilizing coupling faults, which leads to degraded performance and data corruption.</em></td> </tr> <tr> <td>15:45</td> <td>7.6.5</td> <td><b>ORACLE-BASED LOGIC LOCKING ATTACKS: PROTECT THE ORACLE NOT ONLY THE NETLIST</b><br /> <b>Speaker</b>:<br /> Emmanouil Kalligeros, University of the Aegean, GR<br /> <b>Authors</b>:<br /> Emmanouil Kalligeros, Nikolaos Karousos and Irene Karybali, University of the Aegean, GR<br /> <em><b>Abstract</b><br /> Logic locking has received a lot of attention in the literature due to its very attractive hardware-security characteristics: it can protect against IP piracy and overproduction throughout the whole IC supply chain. However, a large class of logic-locking attacks, the oracle-based ones, take advantage of a functional copy of the chip, the oracle, to extract the key that protects the chip. So far, the techniques dealing with oracle-based attacks focus on the netlist that the attacker possesses, assuming that the oracle is always available. For this reason, they are usually overcome by new attacks. In this paper, we propose a hardware security scheme that targets the protection of the oracle circuit, by locking the circuit when the, necessary for setting the inputs and observing the outputs, scan in/out process begins. Hence, no correct input/output pairs can be acquired to perform the attacks. The proposed scheme is not based on controlling global signals like test_enable or scan_enable, whose values can be easily suppressed by the attacker. Security threats are identified, discussed and addressed. The developed scheme is combined with a traditional logic locking technique with high output corruptibility, to achieve increased levels of protection.</em></td> </tr> <tr> <td style="width:40px;">16:00</td> <td><a href="#IP3">IP3-12</a>, 424</td> <td><b>ARE CLOUD FPGAS REALLY VULNERABLE TO POWER ANALYSIS ATTACKS?</b><br /> <b>Speaker</b>:<br /> Ognjen Glamocanin, EPFL, CH<br /> <b>Authors</b>:<br /> Ognjen Glamocanin<sup>1</sup>, Louis Coulon<sup>1</sup>, Francesco Regazzoni<sup>2</sup> and Mirjana Stojilovic<sup>1</sup><br /> <sup>1</sup>EPFL, CH; <sup>2</sup>ALaRI, CH<br /> <em><b>Abstract</b><br /> Recent works have demonstrated the possibility of extracting secrets from a cryptographic core running on an FPGA by means of remote power analysis attacks. To mount these attacks, an adversary implements a voltage fluctuation sensor in the FPGA logic, records the power consumption of the target cryptographic core, and recovers the secret key by running a power analysis attack on the recorded traces. Despite showing that the power analysis could also be performed without physical access to the cryptographic core, these works were mostly carried out on dedicated FPGA boards in a controlled environment, leaving open the question about the possibility to successfully mount these attacks on a real system deployed in the cloud. In this paper, we demonstrate, for the first time, a successful key recovery attack on an AES cryptographic accelerator running on an Amazon EC2 F1 instance. We collect the power traces using a delay-line based voltage drop sensor, adapted to the Xilinx Virtex Ultrascale+ architecture used on Amazon EC2 F1, where CARRY8 blocks do not have a monotonic delay increase at their outputs. Our results demonstrate that security concerns raised by multitenant FPGAs are indeed valid and that countermeasures should be put in place to mitigate them.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="7.7">7.7 Self-Adaptive and Learning Systems</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 14:30 - 16:00<br /> <b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br /> Gilles Sassatelli, Université de Montpellier, FR</p> <p><b>Co-Chair:</b><br /> Rishad Shafik, University of Newcastle, GB</p> <p>Recent advances in machine learning have pushed the boundaries of what is possible in self-adaptive and learning systems. This session pushes the state of art in runtime power and performance trade-offs for deep neural networks and self-optimizing embedded systems.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>14:30</td> <td>7.7.1</td> <td><b>ANYTIMENET: CONTROLLING TIME-QUALITY TRADEOFFS IN DEEP NEURAL NETWORK ARCHITECTURES</b><br /> <b>Speaker</b>:<br /> Jung-Eun Kim, Yale University, US<br /> <b>Authors</b>:<br /> Jung-Eun Kim<sup>1</sup>, Richard Bradford<sup>2</sup> and Zhong Shao<sup>1</sup><br /> <sup>1</sup>Yale University, US; <sup>2</sup>Collins Aerospace, US<br /> <em><b>Abstract</b><br /> Deeper neural networks, especially those with extremely large numbers of internal parameters, impose a heavy computational burden in obtaining sufficiently high-quality results. These burdens are impeding the application of machine learning and related techniques to time-critical computing systems. To address this challenge, we are proposing an architectural approach for neural networks that adaptively trades off computation time and solution quality to achieve high-quality solutions with timeliness. We propose a novel and general framework,AnytimeNet, that gradually inserts additional layers, so users can expect monotonically increasing quality of solutions as more computation time is expended. The framework allows users to select on the fly when to retrieve a result during runtime. Extensive evaluation results on classification tasks demonstrate that our proposed architecture provides adaptive control of classification solution quality according to the available computation time.</em></td> </tr> <tr> <td>15:00</td> <td>7.7.2</td> <td><b>ANTIDOTE: ATTENTION-BASED DYNAMIC OPTIMIZATION FOR NEURAL NETWORK RUNTIME EFFICIENCY</b><br /> <b>Speaker</b>:<br /> Xiang Chen, George Mason University, US<br /> <b>Authors</b>:<br /> Fuxun Yu<sup>1</sup>, Chenchen Liu<sup>2</sup>, Di Wang<sup>3</sup>, Yanzhi Wang<sup>1</sup> and Xiang Chen<sup>1</sup><br /> <sup>1</sup>George Mason University, US; <sup>2</sup>University of Maryland, Baltimore County, US; <sup>3</sup>Microsoft, US<br /> <em><b>Abstract</b><br /> Convolutional Neural Networks (CNNs) achieved great cognitive performance at the expense of considerable computation load. To relieve the computation load, many optimization works are developed to reduce the model redundancy by identifying and removing insignificant model components, such as weight sparsity and filter pruning. However, these works only evaluate model components' static significance with internal parameter information, ignoring their dynamic interaction with external inputs. With per-input feature activation, the model component significance can dynamically change, and thus the static methods can only achieve sub-optimal results. Therefore, we propose a dynamic CNN optimization framework in this work. Based on the neural network attention mechanism, we propose a comprehensive dynamic optimization framework including (1) testing-phase channel and column feature map pruning, as well as (2) training-phase optimization by targeted dropout. Such a dynamic optimization framework has several benefits: (1) First, it can accurately identify and aggressively remove per-input feature redundancy with considering the model-input interaction; (2) Meanwhile, it can maximally remove the feature map redundancy in various dimensions thanks to the multi-dimension flexibility; (3) The training-testing co-optimization favors the dynamic pruning and helps maintain the model accuracy even with very high feature pruning ratio. Extensive experiments show that our method could bring 37.4%∼54.5% FLOPs reduction with negligible accuracy drop on various of test networks.</em></td> </tr> <tr> <td>15:30</td> <td>7.7.3</td> <td><b>USING LEARNING CLASSIFIER SYSTEMS FOR THE DSE OF ADAPTIVE EMBEDDED SYSTEMS</b><br /> <b>Speaker</b>:<br /> Fedor Smirnov, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE<br /> <b>Authors</b>:<br /> Fedor Smirnov, Behnaz Pourmohseni and Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE<br /> <em><b>Abstract</b><br /> Modern embedded systems are not only becoming more and more complex but are also often exposed to dynamically changing run-time conditions such as resource availability or processing power requirements. This trend has led to the emergence of adaptive systems which are designed using novel approaches that combine a static off-line Design Space Exploration (DSE) with the consideration of the dynamic run-time behavior of the system under design. In contrast to a static design approach, which provides a single design solution as a compromise between the possible run-time situations, the off-line DSE of these so-called hybrid design approaches yields a set of configuration alternatives, so that at run time, it becomes possible to dynamically choose the option most suited for the current situation. However, most of these approaches still use optimizers which were primarily developed for static design. Consequently, modeling complex dynamic environments or run-time requirements is either not possible or comes at the cost of a significant computation overhead or results of poor quality. As a remedy, this paper introduces Learning Optimizer Constrained by ALtering conditions (LOCAL), a novel optimization framework for the DSE of adaptive embedded systems. Following the structure of Learning Classifier System (LCS) optimizers, the proposed framework optimizes a strategy, i.e., a set of conditionally applicable solutions for the problem at hand, instead of a set of independent solutions. We show how the proposed framework—which can be used for the optimization of any adaptive system—is used for the optimization of dynamically reconfigurable many-core systems and provide experimental evidence that the hereby obtained strategy offers superior embeddability compared to the solutions provided by a s.o.t.a. hybrid approach which uses an evolutionary algorithm.</em></td> </tr> <tr> <td style="width:40px;">16:00</td> <td><a href="#IP3">IP3-13</a>, 760</td> <td><b>EFFICIENT TRAINING ON EDGE DEVICES USING ONLINE QUANTIZATION</b><br /> <b>Speaker</b>:<br /> Michael Ostertag, University of California, San Diego, US<br /> <b>Authors</b>:<br /> Michael Ostertag<sup>1</sup>, Sarah Al-Doweesh<sup>2</sup> and Tajana Rosing<sup>1</sup><br /> <sup>1</sup>University of California, San Diego, US; <sup>2</sup>King Abdulaziz City of Science and Technology, SA<br /> <em><b>Abstract</b><br /> Sensor-specific calibration functions offer superior performance over global models and single-step calibration procedures but require prohibitive levels of sampling in the input feature space. Sensor self-calibration by gathering training data through collaborative calibration or self-analyzing predictive results allows these sensors to gather sufficient information. Resource-constrained edge devices are then stuck between high communication costs for transmitting training data to a centralized server and high memory requirements for storing data locally. We propose online dataset quantization that maximizes the diversity of input features, maintaining a representative set of data from a larger stream of training data points. We test the effectiveness of online dataset quantization on two real-world datasets: air quality calibration and power prediction modeling. Online Dataset Quantization outperforms reservoir sampling and performs equally to offline methods.</em></td> </tr> <tr> <td style="width:40px;">16:01</td> <td><a href="#IP3">IP3-14</a>, 190</td> <td><b>MULTI-AGENT ACTOR-CRITIC METHOD FOR JOINT DUTY-CYCLE AND TRANSMISSION POWER CONTROL</b><br /> <b>Speaker</b>:<br /> Sota Sawaguchi, CEA-Leti, FR<br /> <b>Authors</b>:<br /> Sota Sawaguchi<sup>1</sup>, Jean-Frédéric Christmann<sup>2</sup>, Anca Molnos<sup>2</sup>, Carolynn Bernier<sup>2</sup> and Suzanne Lesecq<sup>2</sup><br /> <sup>1</sup>CEA, FR; <sup>2</sup>CEA-Leti, FR<br /> <em><b>Abstract</b><br /> Energy-harvesting Internet of Things (EH-IoT) wireless networks have gained attention due to their infinite operation and maintenance-free property. However, maintaining energy neutral operation (ENO) of EH-IoT devices, such that the harvested and consumed energy are matched during a certain time period, is crucial. Guaranteeing this ENO condition and optimal power-performance trade-off under various workloads and transient wireless channel quality is particularly challenging. This paper proposes a multi-agent actor-critic method for modulating both the transmission duty-cycle and the transmitter output power based on the state-of-buffer (SoB) and the state-of-charge (SoC) information as a state. Thanks to these buffers, system uncertainties, especially harvested energy and wireless link conditions, are addressed effectively. Differently from the state-of-the-art, our solution does not require any model of the wireless transceiver nor any measurement of wireless channel quality. Simulation results of a solar powered EH-IoT node using real-life outdoor solar irradiance data show that the proposed method achieves better performance without system fails throughout a year compared to the state-of-the-art that suffers some system downtime. Our approach also predicts almost no system fails during five years of operation. This proves that our approach can adapt to the change in energy-harvesting and wireless channel quality, all without direct observations.</em></td> </tr> <tr> <td>16:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="IP3">IP3 Interactive Presentations</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 16:00 - 16:30<br /> <b>Location / Room:</b> Poster Area</p> <p>Interactive Presentations run simultaneously during a 30-minute slot. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session</p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td style="width:40px;">IP3-1</td> <td><b>CNT-CACHE: AN ENERGY-EFFICIENT CARBON NANOTUBE CACHE WITH ADAPTIVE ENCODING</b><br /> <b>Speaker</b>:<br /> Kexin Chu, School of Electronic Science &amp; Applied Physics Hefei University of Technology Anhui,China, CN<br /> <b>Authors</b>:<br /> Dawen Xu<sup>1</sup>, Kexin Chu<sup>1</sup>, Cheng Liu<sup>2</sup>, Ying Wang<sup>2</sup>, Lei Zhang<sup>2</sup> and Huawei Li<sup>2</sup><br /> <sup>1</sup>School of Electronic Science &amp; Applied Physics Hefei University of Technology Anhui, CN; <sup>2</sup>Chinese Academy of Sciences, CN<br /> <em><b>Abstract</b><br /> Carbon Nanotubu field-effect transistor(CNFET) that promises both higher clock speed and energy efficiency becomes an attractive alternative to the conventional power-hungry CMOS cache. We observe that CNFET-based cacheconstructed with typical 9T SRAM cells has distinct energy consumption when reading/writing 0 and 1 from/to it. The energy consumption of reading 0 is around 3X higher compared toreading 1. The energy consumption of writing 1 is almost 10X higher than writing 0. With this observation, we propose an energy-efficient cache design called CNT-Cache to take advantage of this feature. It includes an adaptive data encoding modulethat can convert the coding of each cache line to match the cache reading and writing preferences. Meanwhile, it has a cache line encoding direction predictor that instructs the encoding direction according to the cache line access history. The two optimizations combined together can reduce the overall dynamicpower consumption significantly. According to our experiments,the optimized CNFET-based L1 D-Cache reduces the dynamic power consumption by 22% on average compared to the baseline CNFET cache.</em></td> </tr> <tr> <td style="width:40px;">IP3-2</td> <td><b>ENHANCING MULTITHREADED PERFORMANCE OF ASYMMETRIC MULTICORES WITH SIMD OFFLOADING</b><br /> <b>Speaker</b>:<br /> Antonio Scheneider Beck, Universidade Federal do Rio Grande do Sul, BR<br /> <b>Authors</b>:<br /> Jeckson Dellagostin Souza<sup>1</sup>, Madhavan Manivannan<sup>2</sup>, Miquel Pericas<sup>2</sup> and Antonio Carlos Schneider Beck<sup>1</sup><br /> <sup>1</sup>Universidade Federal do Rio Grande do Sul, BR; <sup>2</sup>Chalmers, SE<br /> <em><b>Abstract</b><br /> Asymmetric multicore architectures with single-ISA can accelerate multithreaded applications by running code that does not execute concurrently (i.e., the serial region) on a big core and the parallel region on a larger number of smaller cores. Nevertheless, in such architectures the big core still implements resource-expensive application-specific instruction extensions that are rarely used while running the serial region, such as Single Instruction Multiple Data (SIMD) and Floating-Point (FP) operations. In this work, we propose a design in which these extensions are not implemented in the big core, thereby freeing up area and resources to increase the number of small cores in the system, and potentially enhance thread-level parallelism (TLP). To address the case when missing instruction extensions are required while running on the big core we devise an approach to automatically offload these operations to the execution units of the small cores, where the extensions are implemented and can be executed. Our evaluation shows that, on average, the proposed architecture provides 1.76x speedup when compared to a traditional single-ISA asymmetric multicore processor with the same area, for a variety of parallel applications.</em></td> </tr> <tr> <td style="width:40px;">IP3-3</td> <td><b>HARDWARE ACCELERATION OF CNN WITH ONE-HOT QUANTIZATION OF WEIGHTS AND ACTIVATIONS</b><br /> <b>Speaker</b>:<br /> Gang Li, Chinese Academy of Sciences, CN<br /> <b>Authors</b>:<br /> Gang Li, Peisong Wang, Zejian Liu, Cong Leng and Jian Cheng, Chinese Academy of Sciences, CN<br /> <em><b>Abstract</b><br /> In this paper, we propose a novel one-hot representation for weights and activations in CNN model and demonstrate its benefits on hardware accelerator design. Specifically, rather than merely reducing the bitwidth, we quantize both weights and activations into n-bit integers that containing only one non-zero bit per value. In this way, the massive multiply and accumulates (MACs) are equivalent to additions of powers of two that can be efficiently calculated with histogram based computations. Experiments on the ImageNet classification task show that comparable accuracy can be obtained on our proposed One-Hot Networks (OHN) compared to conventional fixed-point networks. As case studies, we evaluate the efficacy of one-hot data representation on two state-of-the-art CNN accelerators on FPGA, our preliminary results show that 50% and 68.5% resource saving can be achieved on DaDianNao and Laconic respectively. Besides, the one-hot optimized Laconic can further achieve an average speedup of 4.94x on AlexNet.</em></td> </tr> <tr> <td style="width:40px;">IP3-4</td> <td><b>BNNSPLIT: BINARIZED NEURAL NETWORKS FOR EMBEDDED DISTRIBUTED FPGA-BASED COMPUTING SYSTEMS</b><br /> <b>Speaker</b>:<br /> Luca Stornaiuolo, Politecnico di Milano, IT<br /> <b>Authors</b>:<br /> Giorgia Fiscaletti, Marco Speziali, Luca Stornaiuolo, Marco D. Santambrogio and Donatella Sciuto, Politecnico di Milano, IT<br /> <em><b>Abstract</b><br /> In the past few years, Convolutional Neural Networks (CNNs) have seen a massive improvement, outperforming other visual recognition algorithms. Since they are playing an increasingly important role in fields such as face recognition, augmented reality or autonomous driving, there is the growing need for a fast and efficient system to perform the redundant and heavy computations of CNNs. This trend led researchers towards heterogeneous systems provided with hardware accelerators, such as GPUs and FPGAs. The vast majority of CNNs is implemented with floating-point parameters and operations, but from research, it has emerged that high classification accuracy can be obtained also by reducing the floating-point activations and weights to binary values. This context is well suitable for FPGAs, that are known to stand out in terms of performance when dealing with binary operations, as demonstrated in Finn, the state-of-the-art framework for building Binarized Neural Network (BNN) accelerators on FPGAs. In this paper, we propose a framework that extends Finn to a distributed scenario, enabling BNNs implementation on embedded multi-FPGA systems.</em></td> </tr> <tr> <td style="width:40px;">IP3-5</td> <td><b>L2L: A HIGHLY ACCURATE LOG_2_LEAD QUANTIZATION OF PRE-TRAINED NEURAL NETWORKS</b><br /> <b>Speaker</b>:<br /> Salim Ullah, TU Dresden, DE<br /> <b>Authors</b>:<br /> Salim Ullah<sup>1</sup>, Siddharth Gupta<sup>2</sup>, Kapil Ahuja<sup>2</sup>, Aruna Tiwari<sup>2</sup> and Akash Kumar<sup>1</sup><br /> <sup>1</sup>TU Dresden, DE; <sup>2</sup>IIT Indore, IN<br /> <em><b>Abstract</b><br /> Deep Neural Networks are one of the machine learning techniques which are increasingly used in a variety of applications. However, the significantly high memory and computation demands of deep neural networks often limit their deployment on embedded systems. Many recent works have considered this problem by proposing different types of data quantization schemes. However, most of these techniques either require post-quantization retraining of deep neural networks or bear a significant loss in output accuracy. In this paper, we propose a novel quantization technique for parameters of pre-trained deep neural networks. Our technique significantly maintains the accuracy of the parameters and does not require retraining of the networks. Compared to the single-precision floating-point numbers-based implementation, our proposed 8-bit quantization technique generates only ∼ 1% and ∼ 0.4%, loss in the top-1 and top-5 accuracies respectively for VGG16 network using ImageNet dataset.</em></td> </tr> <tr> <td style="width:40px;">IP3-6</td> <td><b>FAULT DIAGNOSIS OF VIA-SWITCH CROSSBAR IN NON-VOLATILE FPGA</b><br /> <b>Speaker</b>:<br /> Ryutaro Doi, Osaka University, JP<br /> <b>Authors</b>:<br /> Ryutaro DOI<sup>1</sup>, Xu Bai<sup>2</sup>, Toshitsugu Sakamoto<sup>2</sup> and Masanori Hashimoto<sup>1</sup><br /> <sup>1</sup>Osaka University, JP; <sup>2</sup>NEC Corporation, JP<br /> <em><b>Abstract</b><br /> FPGA that exploits via-switches, which are a kind of non-volatile resistive RAMs, for crossbar implementation is attracting attention due to its high integration density and energy efficiency. Via-switch crossbar is responsible for the signal routing by changing on/off-states of via-switches. To verify the via-switch crossbar functionality after manufacturing, fault testing that checks whether we can turn on/off via-switches normally is essential. This paper confirms that a general differential pair comparator successfully discriminates on/off-states of via-switches, and clarifies fault modes of a via-switch by transistor-level SPICE simulation that injects stuck-on/off faults to atom switch and varistor, where a via-switch consists of two atom switches and two varistors. We then propose a fault diagnosis methodology that diagnoses the fault modes of each via-switch using the comparator response difference between normal and faulty via-switches. The proposed method achieves 100% fault detection by checking the comparator responses after turning on/off the via-switch. In case that the number of faulty components in a via-switch is one, the ratio of the fault diagnosis, which exactly identifies the faulty varistor and atom switch inside the faulty via-switch, is 100%, and in case of up to two faults, the fault diagnosis ratio is 79%.</em></td> </tr> <tr> <td style="width:40px;">IP3-7</td> <td><b>APPLYING RESERVATION-BASED SCHEDULING TO A µC-BASED HYPERVISOR: AN INDUSTRIAL CASE STUDY</b><br /> <b>Speaker</b>:<br /> Dirk Ziegenbein, Robert Bosch GmbH, DE<br /> <b>Authors</b>:<br /> Dakshina Dasari<sup>1</sup>, Paul Austin<sup>2</sup>, Michael Pressler<sup>1</sup>, Arne Hamann<sup>1</sup> and Dirk Ziegenbein<sup>1</sup><br /> <sup>1</sup>Robert Bosch GmbH, DE; <sup>2</sup>ETAS GmbH, GB<br /> <em><b>Abstract</b><br /> Existing software scheduling mechanisms do not suffice for emerging applications in the automotive space, which have the conflicting needs of performance and predictability. %We need mechanisms that lend themselves naturally to this requirement, by virtue of their design. As a concrete case, we consider the ETAS light-weight hypervisor, a commercially viable solution in the automotive industry, deployed on multicore microcontrollers. We describe the architecture of the hypervisor and its current scheduling mechanisms based on Time Division Multiplexing. We next show how Reservation-based Scheduling can be implemented in the ETAS LWHVR to efficiently use resources while also providing freedom from interference and explore design choices towards an efficient implementation of such a scheduler. With experiments from an industry use case, we also compare the performance of RBS and the existing scheduler in the hypervisor.</em></td> </tr> <tr> <td style="width:40px;">IP3-8</td> <td><b>REAL-TIME ENERGY MONITORING IN IOT-ENABLED MOBILE DEVICES</b><br /> <b>Speaker</b>:<br /> Nitin Shivaraman, TUMCREATE, SG<br /> <b>Authors</b>:<br /> Nitin Shivaraman<sup>1</sup>, Seima Suriyasekaran<sup>1</sup>, Zhiwei Liu<sup>2</sup>, Saravanan Ramanathan<sup>1</sup>, Arvind Easwaran<sup>2</sup> and Sebastian Steinhorst<sup>3</sup><br /> <sup>1</sup>TUMCREATE, SG; <sup>2</sup>Nanyang Technological University, SG; <sup>3</sup>TU Munich, DE<br /> <em><b>Abstract</b><br /> With rapid advancements in the Internet of Things (IoT) paradigm, every electrical device in the near future is expected to have IoT capabilities. This enables fine-grained tracking of individual energy consumption data of such devices, offering location-independent per-device billing and demand management. Hence, it abstracts from the location-based metering of state-of-the-art infrastructure, which traditionally aggregates on a building or household level, defining the entity to be billed. However, such in-device energy metering is susceptible to manipulation and fraud. As a remedy, we propose a secure decentralized metering architecture that enables devices with IoT capabilities to measure their own energy consumption. In this architecture, the device-level consumption is additionally reported to a system-level aggregator that verifies distributed information from our decentralized metering systems and provides secure data storage using Blockchain, preventing data manipulation by untrusted entities. Through experimental evaluation, we show that the proposed architecture supports device mobility and enables location-independent monitoring of energy consumption.</em></td> </tr> <tr> <td style="width:40px;">IP3-9</td> <td><b>TOWARDS SPECIFICATION AND TESTING OF RISC-V ISA COMPLIANCE</b><br /> <b>Speaker</b>:<br /> Vladimir Herdt, University of Bremen, DE<br /> <b>Authors</b>:<br /> Vladimir Herdt<sup>1</sup>, Daniel Grosse<sup>2</sup> and Rolf Drechsler<sup>2</sup><br /> <sup>1</sup>University of Bremen, DE; <sup>2</sup>University of Bremen / DFKI, DE<br /> <em><b>Abstract</b><br /> Compliance testing for RISC-V is very important. Therefore, an official hand-written compliance test-suite is being actively developed. However, this requires significant manual effort in particular to achieve a high test coverage. In this paper we propose a test-suite specification mechanism in combination with a first set of instruction constraints and coverage requirements for the base RISC-V ISA. In addition, we present an automated method to generate a test-suite that satisfies the specification. Our evaluation demonstrates the effectiveness and potential of our method.</em></td> </tr> <tr> <td style="width:40px;">IP3-10</td> <td><b>POST-SILICON VALIDATION OF THE IBM POWER9 PROCESSOR</b><br /> <b>Speaker</b>:<br /> Hillel Mendelson, IBM, IL<br /> <b>Authors</b>:<br /> Tom Kolan<sup>1</sup>, Hillel Mendelson<sup>1</sup>, Vitali Sokhin<sup>1</sup>, Kevin Reick<sup>2</sup>, Elena Tsanko<sup>2</sup> and Gregory Wetli<sup>2</sup><br /> <sup>1</sup>IBM Research, IL; <sup>2</sup>IBM Systems, US<br /> <em><b>Abstract</b><br /> Due to the complexity of designs, post-silicon validation remains a major challenge with few systematic solutions. We provide an overview of the state-of-the-art post silicon validation process used by IBM to verify its latest IBM POWER9 processor. During the POWER9 post-silicon validation, we detected and handled 30% more logic bugs in 80% of the time, as compared to the previous IBMPOWER8 bring-up. This improvement is the result of lessons learned from previous designs, leading to numerous innovations. We provide bug analysis data and compare it to POWER8 results. We demonstrate our methodology by describing several bugs from fail detection to root cause.</em></td> </tr> <tr> <td style="width:40px;">IP3-11</td> <td><b>ON THE TASK MAPPING AND SCHEDULING FOR DAG-BASED EMBEDDED VISION APPLICATIONS ON HETEROGENEOUS MULTI/MANY-CORE ARCHITECTURES</b><br /> <b>Speaker</b>:<br /> Nicola Bombieri, Università di Verona, IT<br /> <b>Authors</b>:<br /> Stefano Aldegheri<sup>1</sup>, Nicola Bombieri<sup>1</sup> and Hiren Patel<sup>2</sup><br /> <sup>1</sup>Università di Verona, IT; <sup>2</sup>University of Waterloo, CA<br /> <em><b>Abstract</b><br /> In this work, we show that applying the heterogeneous earliest finish time (HEFT) heuristic for the task scheduling of embedded vision applications can improve the system performance up to 70% w.r.t. the scheduling solutions at the state of the art. We propose an algorithm called exclusive earliest finish time (XEFT) that introduces the notion of exclusive overlap between application primitives to improve the load balancing. We show that XEFT can improve the system performance up to 33% over HEFT, and 82% over the state of the art approaches. We present the results on different benchmarks, including a real-world localization and mapping application (ORB-SLAM) combined with the NVIDIA object detection application based on deep-learning.</em></td> </tr> <tr> <td style="width:40px;">IP3-12</td> <td><b>ARE CLOUD FPGAS REALLY VULNERABLE TO POWER ANALYSIS ATTACKS?</b><br /> <b>Speaker</b>:<br /> Ognjen Glamocanin, EPFL, CH<br /> <b>Authors</b>:<br /> Ognjen Glamocanin<sup>1</sup>, Louis Coulon<sup>1</sup>, Francesco Regazzoni<sup>2</sup> and Mirjana Stojilovic<sup>1</sup><br /> <sup>1</sup>EPFL, CH; <sup>2</sup>ALaRI, CH<br /> <em><b>Abstract</b><br /> Recent works have demonstrated the possibility of extracting secrets from a cryptographic core running on an FPGA by means of remote power analysis attacks. To mount these attacks, an adversary implements a voltage fluctuation sensor in the FPGA logic, records the power consumption of the target cryptographic core, and recovers the secret key by running a power analysis attack on the recorded traces. Despite showing that the power analysis could also be performed without physical access to the cryptographic core, these works were mostly carried out on dedicated FPGA boards in a controlled environment, leaving open the question about the possibility to successfully mount these attacks on a real system deployed in the cloud. In this paper, we demonstrate, for the first time, a successful key recovery attack on an AES cryptographic accelerator running on an Amazon EC2 F1 instance. We collect the power traces using a delay-line based voltage drop sensor, adapted to the Xilinx Virtex Ultrascale+ architecture used on Amazon EC2 F1, where CARRY8 blocks do not have a monotonic delay increase at their outputs. Our results demonstrate that security concerns raised by multitenant FPGAs are indeed valid and that countermeasures should be put in place to mitigate them.</em></td> </tr> <tr> <td style="width:40px;">IP3-13</td> <td><b>EFFICIENT TRAINING ON EDGE DEVICES USING ONLINE QUANTIZATION</b><br /> <b>Speaker</b>:<br /> Michael Ostertag, University of California, San Diego, US<br /> <b>Authors</b>:<br /> Michael Ostertag<sup>1</sup>, Sarah Al-Doweesh<sup>2</sup> and Tajana Rosing<sup>1</sup><br /> <sup>1</sup>University of California, San Diego, US; <sup>2</sup>King Abdulaziz City of Science and Technology, SA<br /> <em><b>Abstract</b><br /> Sensor-specific calibration functions offer superior performance over global models and single-step calibration procedures but require prohibitive levels of sampling in the input feature space. Sensor self-calibration by gathering training data through collaborative calibration or self-analyzing predictive results allows these sensors to gather sufficient information. Resource-constrained edge devices are then stuck between high communication costs for transmitting training data to a centralized server and high memory requirements for storing data locally. We propose online dataset quantization that maximizes the diversity of input features, maintaining a representative set of data from a larger stream of training data points. We test the effectiveness of online dataset quantization on two real-world datasets: air quality calibration and power prediction modeling. Online Dataset Quantization outperforms reservoir sampling and performs equally to offline methods.</em></td> </tr> <tr> <td style="width:40px;">IP3-14</td> <td><b>MULTI-AGENT ACTOR-CRITIC METHOD FOR JOINT DUTY-CYCLE AND TRANSMISSION POWER CONTROL</b><br /> <b>Speaker</b>:<br /> Sota Sawaguchi, CEA-Leti, FR<br /> <b>Authors</b>:<br /> Sota Sawaguchi<sup>1</sup>, Jean-Frédéric Christmann<sup>2</sup>, Anca Molnos<sup>2</sup>, Carolynn Bernier<sup>2</sup> and Suzanne Lesecq<sup>2</sup><br /> <sup>1</sup>CEA, FR; <sup>2</sup>CEA-Leti, FR<br /> <em><b>Abstract</b><br /> Energy-harvesting Internet of Things (EH-IoT) wireless networks have gained attention due to their infinite operation and maintenance-free property. However, maintaining energy neutral operation (ENO) of EH-IoT devices, such that the harvested and consumed energy are matched during a certain time period, is crucial. Guaranteeing this ENO condition and optimal power-performance trade-off under various workloads and transient wireless channel quality is particularly challenging. This paper proposes a multi-agent actor-critic method for modulating both the transmission duty-cycle and the transmitter output power based on the state-of-buffer (SoB) and the state-of-charge (SoC) information as a state. Thanks to these buffers, system uncertainties, especially harvested energy and wireless link conditions, are addressed effectively. Differently from the state-of-the-art, our solution does not require any model of the wireless transceiver nor any measurement of wireless channel quality. Simulation results of a solar powered EH-IoT node using real-life outdoor solar irradiance data show that the proposed method achieves better performance without system fails throughout a year compared to the state-of-the-art that suffers some system downtime. Our approach also predicts almost no system fails during five years of operation. This proves that our approach can adapt to the change in energy-harvesting and wireless channel quality, all without direct observations.</em></td> </tr> </tbody> </table> <hr /> <h2 id="UB08">UB08 Session 8</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 16:00 - 18:00<br /> <b>Location / Room:</b> Booth 11, Exhibition Area</p> <p> </p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td>UB08.1</td> <td><b>LAGARTO: FIRST SILICON RISC-V ACADEMIC PROCESSOR DEVELOPED IN SPAIN</b><br /> <b>Authors</b>:<br /> Guillem Cabo Pitarch<sup>1</sup>, Cristobal Ramirez Lazo<sup>1</sup>, Julian Pavon Rivera<sup>1</sup>, Vatistas Kostalabros<sup>1</sup>, Carlos Rojas Morales<sup>1</sup>, Miquel Moreto<sup>1</sup>, Jaume Abella<sup>1</sup>, Francisco J. Cazorla<sup>1</sup>, Adrian Cristal<sup>1</sup>, Roger Figueras<sup>1</sup>, Alberto Gonzalez<sup>1</sup>, Carles Hernandez<sup>1</sup>, Cesar Hernandez<sup>2</sup>, Neiel Leyva<sup>2</sup>, Joan Marimon<sup>1</sup>, Ricardo Martinez<sup>3</sup>, Jonnatan Mendoza<sup>1</sup>, Francesc Moll<sup>4</sup>, Marco Antonio Ramirez<sup>2</sup>, Carlos Rojas<sup>1</sup>, Antonio Rubio<sup>4</sup>, Abraham Ruiz<sup>1</sup>, Nehir Sonmez<sup>1</sup>, Lluis Teres<sup>3</sup>, Osman Unsal<sup>5</sup>, Mateo Valero<sup>1</sup>, Ivan Vargas<sup>1</sup> and Luis Villa<sup>2</sup><br /> <sup>1</sup>BSC / UPC, ES; <sup>2</sup>CIC-IPN, MX; <sup>3</sup>IMB-CNM (CSIC), ES; <sup>4</sup>UPC, ES; <sup>5</sup>BSC, ES<br /> <em><b>Abstract</b><br /> Open hardware is a possibility that has emerged in recent years and has the potential to be as disruptive as Linux was once, an open source software paradigm. If Linux managed to lessen the dependence of users in large companies providing software and software applications, it is envisioned that hardware based on ISAs open source can do the same in their own field. In the Lagarto tapeout four research institutions were involved: Centro de Investigación en Computación of the Mexican IPN, Centro Nacional de Microelectrónica of the CSIC, Universitat Politècnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC). As a result, many bachelor, master and PhD students had the chance to achieve real-world experience with ASIC design and achieve a functional SoC. In the booth, you will find a live demo of the first ASIC and prototypes running on FPGA of the next versions of the SoC and core.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3104.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB08.2</td> <td><b>A DIGITAL MICROFLUIDICS BIO-COMPUTING PLATFORM</b><br /> <b>Authors</b>:<br /> Georgi Tanev, Luca Pezzarossa, Winnie Edith Svendsen and Jan Madsen, TU Denmark, DK<br /> <em><b>Abstract</b><br /> Digital microfluidics is a lab-on-a-chip (LOC) technology used to actuate small amounts of liquids on an array of individually addressable electrodes. Microliter sized droplets can be programmatically dispensed, moved, mixed, split, in a controlled environment which combined with miniaturized sensing techniques makes LOC suitable for a broad range of applications in the field of medical diagnostics and synthetic biology. Furthermore, a programmable digital microfluidics platform holds the potential to add a "fluidic subsystem" to the classical computation model thus opening the doors for cyber-physical bio-processors. To facilitate the programming and operation of such bio-fluidic computing, we propose dedicated instruction set architecture and virtual machine. A set of digital microfluidic core instructions as well as classic computing operations are executed on a virtual machine, which decouples the protocol execution from the LOC functionality.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3103.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB08.3</td> <td><b>DISTRIBUTING TIME-SENSITIVE APPLICATIONS ON EDGE COMPUTING ENVIRONMENTS</b><br /> <b>Authors</b>:<br /> Eudald Sabaté Creixell<sup>1</sup>, Unai Perez Mendizabal<sup>1</sup>, Elli Kartsakli<sup>2</sup>, Maria A. Serrano Gracia<sup>3</sup> and Eduardo Quiñones Moreno<sup>3</sup><br /> <sup>1</sup>BSC / UPC, ES; <sup>2</sup>BSC, GR; <sup>3</sup>BSC, ES<br /> <em><b>Abstract</b><br /> The proposed demonstration aims to showcase the capabilities of a task-based distributed programming framework for the execution of real-time applications in edge computing scenarios, in the context of smart cities. Edge computing shifts the computation close to the data source, alleviating the pressure on the cloud and reducing application response times. However, the development and deployment of distributed real-time applications is complex, due to the heterogeneous and dynamic edge environment where resources may not always be available. To address these challenges, our demo employs COMPSs, a highly portable and infrastructure-agnostic programming model, to efficiently distribute time-sensitive applications across the compute continuum. We will exhibit how COMPSs distributes the workload on different edge devices (e.g., NVIDIA GPUs and a Rasberry Pi), and how COMPSs re-adapts this distribution upon the availability (connection or disconnection) of devices.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3108.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB08.5</td> <td><b>LEARNV: LEARNV: A RISC-V BASED EMBEDDED SYSTEM DESIGN FRAMEWORK FOR EDUCATION AND RESEARCH DEVELOPMENT</b><br /> <b>Authors</b>:<br /> Noureddine Ait Said and Mounir Benabdenbi, TIMA Laboratory, FR<br /> <em><b>Abstract</b><br /> Designing a modern System on a Chip is based on the joint design of hardware and software (co-design). However, understanding the tight relationship between hardware and software is not straightforward. Moreover to validate new concepts in SoC design from the idea to the hardware implementation is time-consuming and often slowed by legacy issues (intellectual property of hardware blocks and expensive commercial tools). To overcome these issues we propose to use the open-source Rocket Chip environment for educational purposes, combined with the open-source LowRisc architecture to implement a custom SoC design on an FPGA board. The demonstration will present how students and engineers can take benefit from the environment to deepen their knowledge in HW and SW co-design. Using the LowRisC architecture, an image classification application based on the use of CNNs will serve as a demonstrator of the whole open-source hardware and software flow and will be mapped on a Nexys A7 FPGA board.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3116.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB08.6</td> <td><b>SRSN: SECURE RECONFIGURABLE TEST NETWORK</b><br /> <b>Authors</b>:<br /> Vincent Reynaud<sup>1</sup>, Emanuele Valea<sup>2</sup>, Paolo Maistri<sup>1</sup>, Regis Leveugle<sup>1</sup>, Marie-Lise Flottes<sup>2</sup>, Sophie Dupuis<sup>2</sup>, Bruno Rouzeyre<sup>2</sup> and Giorgio Di Natale<sup>1</sup><br /> <sup>1</sup>TIMA Laboratory, FR; <sup>2</sup>LIRMM, FR<br /> <em><b>Abstract</b><br /> The critical importance of testability for electronic devices led to the development of IEEE test standards. These methods, if not protected, offer a security backdoor to attackers. This demonstrator illustrates a state-of-the-art solution that prevents unauthorized usage of the test infrastructure based on the IEEE 1687 standard and implemented on an FPGA target.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3112.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB08.7</td> <td><b>RETINE: A PROGRAMMABLE 3D STACKED VISION CHIP ENABLING LOW LATENCY IMAGE ANALYSIS</b><br /> <b>Authors</b>:<br /> Stéphane Chevobbe<sup>1</sup>, Maria Lepecq<sup>1</sup> and Laurent Millet<sup>2</sup><br /> <sup>1</sup>CEA LIST, FR; <sup>2</sup>CEA-Leti, FR<br /> <em><b>Abstract</b><br /> We have developed and fabricated a 3D stacked imager called RETINE composed with 2 layers based on the replication of a programmable 3D tile in a matrix manner providing a highly parallel programmable architecture. This tile is composed by a 16x16 BSI binned pixels array with associated readout and 16 column ADC on the first layer coupled to an efficient SIMD processor of 16 PE on the second layer. The prototype of RETINE achieves high video rates, from 5500 fps in binned mode to 340 fps in full resolution mode. It operates at 80 MHz with 720 mW power consumption leading to 85 GOPS/W power efficiency. To highlight the capabilities of the RETINE chip we have developed a demonstration platform with an electronic board embedding a RETINE chip that films rotating disks. Three scenarii are available: high speed image capture, slow motion and composed image capture with parallel processing during acquisition.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3113.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB08.8</td> <td><b>FASTHERMSIM: FAST AND ACCURATE THERMAL SIMULATIONS FROM CHIPLETS TO SYSTEM</b><br /> <b>Authors</b>:<br /> Yu-Min Lee, Chi-Wen Pan, Li-Rui Ho and Hong-Wen Chiou, National Chiao Tung University, TW<br /> <em><b>Abstract</b><br /> Recently, owing to the scaling down of technology and 2.5D/3D integration, power densities and temperatures of chips have been increasing significantly. Though commercial computational fluid dynamics tools can provide accurate thermal maps, they may lead to inefficiency in thermal-aware design with huge runtime. Thus, we develop the chip/package/system-level thermal analyzer, called FasThermSim, which can assist you to improve your design under thermal constraints in pre/post-silicon stages. In FasThermSim, we consider three heat transfer modes, conduction, convection, and thermal radiation. We convert them to temperature-independent terms by linearization methods and build a compact thermal model (CTM). By applying numerical methods to the CTM, the steady-state and transient thermal profiles can be solved efficiently without loss of accuracy. Finally, an easy-to-use thermal analysis tool is implemented for your design, which is flexible and compatible, with the graphic user interface.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3137.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB08.9</td> <td><b>PA-HLS: HIGH-LEVEL ANNOTATION OF ROUTING CONGESTION FOR XILINX VIVADO HLS DESIGNS</b><br /> <b>Authors</b>:<br /> Osama Bin Tariq<sup>1</sup>, Junnan Shan<sup>1</sup>, Luciano Lavagno<sup>1</sup>, Georgios Floros<sup>2</sup>, Mihai Teodor Lazarescu<sup>1</sup>, Christos Sotiriou<sup>2</sup> and Mario Roberto Casu<sup>1</sup><br /> <sup>1</sup>Politecnico di Torino, IT; <sup>2</sup>University of Thessaly, GR<br /> <em><b>Abstract</b><br /> We will demo a novel high-level backannotation flow that reports routing congestion issues at the C++ source level by analyzing reports from FPGA physical design (Xilinx Vivado) and internal debugging files of the Vivado HLS tool. The flow annotates the C++ source code, identifying likely causes of congestion, e.g., on-chip memories or the DSP units. These shared resources often cause routing problems on FPGAs because they cannot be duplicated by physical design. We demonstrate on realistic large designs how the information provided by our flow can be used to both identify congestion issues at the C++ source level and solve them using HLS directives. The main demo steps are: 1-Extraction of the source-level debugging information from the Vivado HLS database 2-Generation of a list of net names involved in congestion areas and of their relative significance from the Vivado post global-routing database 3-Visualization of the C++ code lines that contribute most to congestion </em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3123.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB08.10</td> <td><b>MDD-COP: A PRELIMINARY TOOL FOR MODEL-DRIVEN DEVELOPMENT EXTENDED WITH LAYER DIAGRAM FOR CONTEXT-ORIENTED PROGRAMMING</b><br /> <b>Authors</b>:<br /> Harumi Watanabe<sup>1</sup>, Chinatsu Yamamoto<sup>1</sup>, Takeshi Ohkawa<sup>1</sup>, Mikiko Sato<sup>1</sup>, Nobuhiko Ogura<sup>2</sup> and Mana Tabei<sup>1</sup><br /> <sup>1</sup>Tokai University, JP; <sup>2</sup>Tokyo City University, JP<br /> <em><b>Abstract</b><br /> This presentation introduces a preliminary tool for Model-Driven development (MDD) to generate programs for Context-Oriented Programming (COP). In modern embedded systems such as IoT and Industry 4.0, their software began to process multiple services by following the changing surrounding environments. COP is helpful for programming such software. In COP, we can consider the surrounding environments and multiple services as contexts and layers. Even though MDD is a powerful technique for developing such modern systems, the works of modeling for COP are limited. There are no works to mention the relation between UML (Unified Modeling Language) and COP. To solve this problem, we provide a COP generation from a layer diagram extended the package diagram of UML by stereotypes. In our approach, users draw a layer diagram and other UML diagrams, then xtUML, which is a major tool of MDD, generates XML code with layer information for COP; finally, our tool generates COP code from XML code.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3130.pdf">More information ...</a></b></em></td> </tr> <tr> <td>18:00</td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="8.1">8.1 Special Day on "Embedded AI": Neuromorphic chips and systems</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br /> Wei Lu, University of Michigan, US</p> <p><b>Co-Chair:</b><br /> Bernabe Linares-Barranco, CSIC, ES</p> <p>Within the global field of AI, there is a subfield that focuses on exploiting neuroscience knowledge for artificial intelligent hardware systems. This is the neuromorphic engineering field. This session presents some examples of AI research focusing on this AI subfield.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.1.1</td> <td><b>SPINNAKER2 : A PLATFORM FOR BIO-INSPIRED ARTIFICIAL INTELLIGENCE AND BRAIN SIMULATION</b><br /> <b>Authors</b>:<br /> Bernhard Vogginger, Christian Mayr, Sebastian Höppner, Johannes Partzsch and Steve Furber, TU Dresden, DE<br /> <em><b>Abstract</b><br /> SpiNNaker is an ARM-based processor platform optimized for the simulation of spiking neural networks. This brief describes the roadmap in going from the current SPINNaker1 system, a 1 Million core machine in 130nm CMOS, to SpiNNaker2, a 10 Million core machine in 22nm FDSOI. Apart from pure scaling, we will take advantage of specific technology features, such as runtime adaptive body biasing, to deliver cutting-edge power consumption. Power management of the cores allows a wide range of workload adaptivity, i.e. processor power scales with the complexity and activity of the spiking network. Additional numerical accelerators will enhance the utility of SpiNNaker2 for simulation of spiking neural networks as well as for executing conventional deep neural networks. The interplay between these two domains will provide a wide field for bio-inspired algorithm exploration on SpiNNaker2, bringing machine learning and neuromorphics closer together. Apart from the platforms' traditional usage as a neuroscience exploration tool, the extended functionality opens up new application areas such as automotive AI, tactile internet, industry 4.0 and biomedical processing.</em></td> </tr> <tr> <td>17:30</td> <td>8.1.2</td> <td><b>AN ON-CHIP LEARNING ACCELERATOR FOR SPIKING NEURAL NETWORKS USING STT-RAM CROSSBAR ARRAYS</b><br /> <b>Authors</b>:<br /> Shruti R. Kulkarni, Shihui Yin, Jae-sun Seo and Bipin Rajendran, New Jersey Institute of Technology, US<br /> <em><b>Abstract</b><br /> In this work, we present a scheme for implementinglearning on a digital non-volatile memory (NVM) based hardware accelerator for Spiking Neural Networks (SNNs). Our design estimates across three prominent non-volatile memories - Phase Change Memory (PCM), Resistive RAM (RRAM), and Spin Transfer Torque RAM (STT-RAM) show that the STT-RAM arrays enable at least 2× higher throughput compared to the other two memory technologies. We discuss the design and the signal communication framework through the STT-RAM crossbar array for training and inference in SNNs. Each STT-RAM cell in the array stores a single bit value. Our neurosynaptic computational core consists of the memory crossbar array and its read/write peripheral circuitry and the digital logic for the spiking neurons, weight update computations, spike router, and decoder for incoming spike packets. Our STT-RAM based design shows ∼20× higher performance per unit Watt per unit area compared to conventional SRAM based design, making it a promising learning platform for realizing systems with significant area and power limitations.</em></td> </tr> <tr> <td>18:00</td> <td>8.1.3</td> <td><b>OVERCOMING CHALLENGES FOR ACHIEVING HIGH IN-SITU TRAINING ACCURACY WITH EMERGING MEMORIES</b><br /> <b>Speaker</b>:<br /> Shimeng Yu, Georgia Tech, US<br /> <b>Authors</b>:<br /> Shanshi Huang, Xiaoyu Sun, Xiaochen Peng, Hongwu Jiang and Shimeng Yu, Georgia Tech, US<br /> <em><b>Abstract</b><br /> Embedded artificial intelligence (AI) prefers the adaptive learning capability when deployed in the field, thus in-situ training on-chip is required. Emerging non-volatile memories (eNVMs) are of great interests serving as analog synapses in deep neural network (DNN) on-chip acceleration due to its multilevel programmability. However, the asymmetry/nonlinearity in the conductance tuning remains a grand challenge for achieving high in-situ training accuracy. In addition, analog-to-digital converter (ADC) at the edge of the memory array introduces an additional challenge - quantization error for in-memory computing. In this work, we gain new insights and overcome these challenges through an algorithm-hardware co-optimization. We incorporate these hardware non-ideal effects into the DNN propagation and weight update steps. We evaluate on a VGG-like network for CIFAR-10 dataset, and we show that the asymmetry of the conductance tuning is no longer a limiting factor of in-situ training accuracy if exploiting adaptive "momentum" in the weight update rule. Even considering ADC quantization error, in-situ training accuracy could approach software baseline. Our results show much relaxed requirements that enable a variety of eNVMs for DNN acceleration on the embedded AI platforms.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="8.2">8.2 We are all hackers: design and detection of security attacks</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br /> Regazzoni Francesco, ALaRI, CH</p> <p><b>Co-Chair:</b><br /> Daniel Grosse, University of Bremen, DE</p> <p>This session deals with hardware trojans and vulnerabilities, proposing detection techniques and design paradigms to model attacks. It describes attacks by leveraging the exclusive characteristics of microfluidic devices and malicious usage of energy management. As for defenses, an automated test generation approach for hardware trojan detection using delay-based side-channel analysis is also presented.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.2.1</td> <td><b>AUTOMATED TEST GENERATION FOR TROJAN DETECTION USING DELAY-BASED SIDE CHANNEL ANALYSIS</b><br /> <b>Speaker</b>:<br /> Prabhat Mishra, University of Florida, US<br /> <b>Authors</b>:<br /> Yangdi Lyu and Prabhat Mishra, University of Florida, US<br /> <em><b>Abstract</b><br /> Side-channel analysis is widely used for hardware Trojan detection in integrated circuits by analyzing various side-channel signatures, such as timing, power and path delay. Existing delay-based side-channel analysis techniques have two major bottlenecks: (i) they are not suitable in detecting Trojans since the delay difference between the golden design and a Trojan inserted design is negligible, and (ii) they are not effective in creating robust delay signatures due to reliance on random and ATPG based test patterns. In this paper, we propose an efficient test generation technique to detect Trojans using delay-based side channel analysis. This paper makes two important contributions. (1) We propose an automated test generation algorithm to produce test patterns that are likely to activate trigger conditions, and drastically change critical paths. Compared to existing approaches where delay difference is solely based on extra gates from a small Trojan, the change of critical paths by our approach will lead to significant difference in path delay. (2) We propose a fast and efficient reordering technique to maximize the delay deviation between the golden design and Trojan inserted design. Experimental results demonstrate that our approach significantly outperforms state-of-the-art approaches that rely on ATPG or random test patterns for delay-based side-channel analysis.</em></td> </tr> <tr> <td>17:30</td> <td>8.2.2</td> <td><b>MICROFLUIDIC TROJAN DESIGN IN FLOW-BASED BIOCHIPS</b><br /> <b>Speaker</b>:<br /> Shayan Mohammed, New York University, US<br /> <b>Authors</b>:<br /> Shayan Mohammed<sup>1</sup>, Sukanta Bhattacharjee<sup>2</sup>, Yong-Ak Song<sup>2</sup>, Krishnendu Chakrabarty<sup>3</sup> and Ramesh Karri<sup>1</sup><br /> <sup>1</sup>New York University, US; <sup>2</sup>New York University Abu Dhabi, AE; <sup>3</sup>Duke University, US<br /> <em><b>Abstract</b><br /> Microfluidic technologies find application in various safety-critical fields such as medical diagnostics, drug research, and cell analysis. Recent work has focused on security threats to microfluidic-based cyberphysical systems and defenses. So far the threat analysis has been limited to the cases of tampering with control software/hardware, which is common to most cyberphysical control systems in general; in a sense, such an approach is not exclusive to microfluidics. In this paper, we present a stealthy attack paradigm that uses characteristics exclusive to the microfluidic devices - a microfluidic trojan. The proposed trojan payload is a valve whose height has been perturbed to vary its pressure response. This trojan can be triggered in multiple ways based on time or specific operations. These triggers can occur naturally in a bioassay or added into the controlling software. We showcase the trojan application in carrying out practical attacks - contamination, parameter-tampering, and denial-of-service - on a real-life bioassay implementation. Further, we present guidelines to launch stealthy attacks and to counter them.</em></td> </tr> <tr> <td>18:00</td> <td>8.2.3</td> <td><b>TOWARDS MALICIOUS EXPLOITATION OF ENERGY MANAGEMENT MECHANISMS</b><br /> <b>Speaker</b>:<br /> Safouane Noubir, École Polytechnique de l'Université de Nantes, FR<br /> <b>Authors</b>:<br /> Safouane Noubir, Maria Mendez Real and Sebastien Pillement, École Polytechnique de l'Université de Nantes, FR<br /> <em><b>Abstract</b><br /> Architectures are becoming more and more complex to keep up with the increase of algorithmic complexity. To fully exploit those architectures, dynamic resources managers are required. The goal of dynamic managers is either to optimize the resource usage (e.g. cores, memory) or to reduce energy consumption under performance constraints. However, performance optimization being their main goal, they have not been designed to be secure and present vulnerabilities. Recently, it has been proven that energy managers can be exploited to cause faults within a processor allowing to steal information from a user device. However, this exploitation is not often possible in current commercial devices. In this work, we show current security vulnerabilities through another type of malicious usage of energy management, experimentation shows that it is possible to remotely lock out a device, denying access to all services and data, requiring for example the user to pay a ransom to unlock it. The main target of this exploit are embedded systems and we demonstrate this work by its implementation on two different commercial ARM-based devices.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="#IP4">IP4-1</a>, 551</td> <td><b>HIT: A HIDDEN INSTRUCTION TROJAN MODEL FOR PROCESSORS</b><br /> <b>Speaker</b>:<br /> Jiaqi Zhang, Tongji University, CN<br /> <b>Authors</b>:<br /> Jiaqi Zhang<sup>1</sup>, Ying Zhang<sup>1</sup>, Huawei Li<sup>2</sup> and Jianhui Jiang<sup>3</sup><br /> <sup>1</sup>Tongji University, CN; <sup>2</sup>Chinese Academy of Sciences, CN; <sup>3</sup>School of Software Engineering, Tongji University, CN<br /> <em><b>Abstract</b><br /> This paper explores an intrusion mechanism to microprocessors using illegal instructions, namely hidden instruction Trojan (HIT). It uses a low-probability sequence consisting of normal instructions as a boot sequence, followed by an illegal instruction to trigger the Trojan. The payload is a hidden interrupt to force the program counter to a specific address. Hence the program at the address has the super privileges. Meanwhile, we use integer programming to minimize the trigger probability of HIT within a given area overhead. The experimental results demonstrate that HIT has an extremely low trigger probability and can survive from the detection of the existing test methods.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="#IP4">IP4-2</a>, 658</td> <td><b>BITSTREAM MODIFICATION ATTACK ON SNOW 3G</b><br /> <b>Speaker</b>:<br /> Michail Moraitis, Royal Institute of Technology KTH, SE<br /> <b>Authors</b>:<br /> Michail Moraitis and Elena Dubrova, Royal Institute of Technology - KTH, SE<br /> <em><b>Abstract</b><br /> SNOW 3G is one of the core algorithms for confidentiality and integrity in several 3GPP wireless communication standards, including the new Next Generation (NG) 5G. It is believed to be resistant to classical cryptanalysis. In this paper, we show that SNOW 3G can be broken by a fault attack based on bitstream modification. By changing the content of some look-up tables in the bitstream, we reduce the non-linear state updating function of SNOW 3G to a linear one. As a result, it becomes possible to recover the key from a known plaintext-ciphertext pair. To our best knowledge, this is the first successful bitstream modification attack on SNOW 3G.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="8.3">8.3 Optimizing System-Level Design for Machine Learning</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br /> Luciano Lavagno, Politecnico di Torino, IT</p> <p><b>Co-Chair:</b><br /> Yuko Hara-Azumi, Tokyo Institute of Technology, JP</p> <p>In the last years, the use of ML techniques, as deep neural networks, have become a trend in system-level design, either to help the flow finding promising solutions or to deploy ML-based applications. This session presents various approaches to optimize several aspects of system-level design, like the mapping of applications on heterogeneous platforms, the inference of CNNs or the file-system usage.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.3.1</td> <td><b>ESP4ML: PLATFORM-BASED DESIGN OF SYSTEMS-ON-CHIP FOR EMBEDDED MACHINE LEARNING</b><br /> <b>Speaker</b>:<br /> Davide Giri, Columbia University, US<br /> <b>Authors</b>:<br /> Davide Giri, Kuan-Lin Chiu, Giuseppe Di Guglielmo, Paolo Mantovani and Luca Carloni, Columbia University, US<br /> <em><b>Abstract</b><br /> We present ESP4ML an open-source system-level design flow to build and program SoC architectures for embedded applications that require the hardware acceleration of machine learning and signal processing algorithms. We realized ESP4ML by combining two established open-source projects (ESP and HLS4ML) into a new, fully-automated design flow. For the SoC integration of accelerators generated by HLS4ML, we designed a set of new parameterized interface circuits synthesizable with high-level synthesis. For accelerator configuration and management, we developed an embedded software runtime system on top of Linux. With this HW/SW layer, we addressed the challenge of dynamically shaping the data traffic on a network-on-chip to activate and support the reconfigurable pipelines of accelerators that are needed by the application workloads currently running on the SoC. We demonstrate our vertically-integrated contributions with the FPGA-based implementations of complete SoC instances booting Linux and executing computer-vision applications that process images taken from the Google Street View database.</em></td> </tr> <tr> <td>17:30</td> <td>8.3.2</td> <td><b>PROBABILISTIC SEQUENTIAL MULTI-OBJECTIVE OPTIMIZATION OF CONVOLUTIONAL NEURAL NETWORKS</b><br /> <b>Speaker</b>:<br /> Zixuan Yin, McGill University, CA<br /> <b>Authors</b>:<br /> Zixuan Yin, Warren Gross and Brett Meyer, McGill University, CA<br /> <em><b>Abstract</b><br /> With the advent of deeper, larger and more complex convolutional neural networks (CNN), manual design has become a daunting task, especially when hardware performance must be optimized. Sequential model-based optimization (SMBO) is an efficient method for hyperparameter optimization on highly parameterized machine learning (ML) algorithms, able to find good configurations with a limited number of evaluations by predicting the performance of candidates before evaluation. A case study on MNIST shows that SMBO regression model prediction error significantly impedes search performance in multi-objective optimization. To address this issue, we propose probabilistic SMBO, which selects candidates based on probabilistic estimation of their Pareto efficiency. With a formulation that incorporates error in accuracy prediction and uncertainty in latency measurement, probabilistic Pareto efficiency quantifies a candidate's quality in two ways: its likelihood of being Pareto optimal, and the expected number of current Pareto optimal solutions that it will dominate. We evaluate our proposed method on four image classification problems. Compared to a deterministic approach, probabilistic SMBO consistently generates Pareto optimal solutions that perform better, and that are competitive with state-of-the-art efficient CNN models, offering tremendous speedup in inference latency while maintaining comparable accuracy.</em></td> </tr> <tr> <td>18:00</td> <td>8.3.3</td> <td><b>ARS: REDUCING F2FS FRAGMENTATION FOR SMARTPHONES USING DECISION TREES</b><br /> <b>Speaker</b>:<br /> Lihua Yang, Huazhong University of Science &amp; Technology, CN<br /> <b>Authors</b>:<br /> Lihua Yang, Fang Wang, Zhipeng Tan, Dan Feng, Jiaxing Qian and Shiyun Tu, Huazhong University of Science &amp; Technology, CN<br /> <em><b>Abstract</b><br /> As we all know, file and free space fragmentation negatively affect file system performance. F2FS is a file system designed for flash memory. However, it suffers from severe fragmentation due to its out-of-place updates and the highly synchronous, multi-threaded writing behaviors of mobile applications. We observe that the running time of fragmented files is 2.36X longer than that of continuous files and that F2FS's in-place update scheme is incapable of reducing fragmentation. A fragmented file system leads to a poor user experience. Reserving space to prevent fragmentation is an intuitive approach. However, reserving space for all files wastes space since there are a large number of files. To deal with this dilemma, we propose an adaptive reserved space (ARS) scheme to choose some specific files to update in the reserved space. How to effectively select reserved files is critical to performance. We collect file characteristics associated with fragmentation to construct data sets and use decision trees to accurately pick reserved files. Besides, adjustable reserved space and dynamic reservation strategy are adopted. We implement ARS on a HiKey960 development platform and a commercial smartphone with slight space and file creation time overheads. Experimental results show that ARS reduces file and free space fragmentation dramatically, improves file I/O performance and reduces garbage collection overhead compared to traditional F2FS and F2FS with in-place updates. Furthermore, ARS delivers up to 1.26X transactions per second under SQLite than traditional F2FS and reduces up to 41.72% running time of Facebook and Twitter than F2FS with in-place updates.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="#IP4">IP4-3</a>, 22</td> <td><b>A MACHINE LEARNING BASED WRITE POLICY FOR SSD CACHE IN CLOUD BLOCK STORAGE</b><br /> <b>Speaker</b>:<br /> Yu Zhang, Huazhong University of Science &amp; Technology, CN<br /> <b>Authors</b>:<br /> Yu Zhang<sup>1</sup>, Ke Zhou<sup>1</sup>, Ping Huang<sup>2</sup>, Hua Wang<sup>1</sup>, Jianying Hu<sup>3</sup>, Yangtao Wang<sup>1</sup>, Yongguang Ji<sup>3</sup> and Bin Cheng<sup>3</sup><br /> <sup>1</sup>Huazhong University of Science &amp; Technology, CN; <sup>2</sup>Temple University, US; <sup>3</sup>Tencent Technology (Shenzhen) Co., Ltd., CN<br /> <em><b>Abstract</b><br /> Nowadays, SSD cache plays an important role in cloud storage systems. The associated write policy, which enforces an admission control policy regarding filling data into the cache, has a significant impact on the performance of the cache system and the amount of write traffic to SSD caches. Based on our analysis on a typical cloud block storage system, approximately 47.09% writes are write-only, i.e., writes to the blocks which have not been read during a certain time window. Naively writing the write-only data to the SSD cache unnecessarily introduces a large number of harmful writes to the SSD cache without any contribution to cache performance. On the other hand, it is a challenging task to identify and filter out those write-only data in a real-time manner, especially in a cloud environment running changing and diverse workloads. In this paper, to alleviate the above cache problem, we propose an ML-WP, Machine Learning Based Write Policy, which reduces write traffic to SSDs by avoiding writing write-only data. The main challenge in this approach is to identify write-only data in a real-time manner. To realize ML-WP and achieve accurate write-only data identification, we use machine learning methods to classify data into two groups (i.e., write-only and normal data). Based on this classification, the write-only data is directly written to backend storage without being cached. Experimental results show that, compared with the industry widely deployed write-back policy, ML-WP decreases write traffic to SSD cache by 41.52%, while improving the hit ratio by 2.61% and reducing the average read latency by 37.52%.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="#IP4">IP4-4</a>, 47</td> <td><b>YOU ONLY SEARCH ONCE: A FAST AUTOMATION FRAMEWORK FOR SINGLE-STAGE DNN/ACCELERATOR CO-DESIGN</b><br /> <b>Speaker</b>:<br /> Weiwei Chen, Chinese Academy of Sciences, CN<br /> <b>Authors</b>:<br /> Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN<br /> <em><b>Abstract</b><br /> DNN/Accelerator co-design has shown great poten-tial in improving QoR and performance. Typical approaches separate the design flow into two-stage: (1) designing an application-specific DNN model with high accuracy; (2) building an accelerator considering the DNN specific characteristics. However, it may fails in promising the highest composite score which combines the goals of accuracy and other hardware-related constraints (e.g., latency, energy efficiency) when building a specific neural-network-based system. In this work, we present a single-stage automated framework, YOSO, aiming to generate the optimal solution of software-and-hardware that flexibly balances between the goal of accuracy, power, and QoS. Compared with the two-stage method on the baseline systolic array accelerator and Cifar10 dataset, we achieve 1.42x~2.29x energy or 1.79x~3.07x latency reduction at the same level of precision, for different user-specified energy and latency optimization constraints, respectively.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="8.4">8.4 Architectural and Circuit Techniques toward Energy-efficient Computing</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br /> Sara Vinco, Politecnico di Torino, IT</p> <p><b>Co-Chair:</b><br /> Davide Rossi, Università di Bologna, IT</p> <p>The session discusses low-power design techniques at the architectural as well as the circuit level. The presented works span from new solutions for conventional computing, such as ultra-low power tunable precision architectures and speculative SRAM arrays, to emerging paradigms, like spiking neural networks and stochastic computing.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.4.1</td> <td><b>TRANSPIRE: AN ENERGY-EFFICIENT TRANSPRECISION FLOATING-POINT PROGRAMMABLE ARCHITECTURE</b><br /> <b>Speaker</b>:<br /> Rohit Prasad, Lab-SICC, UBS, France &amp; DEI, UniBo, Italy, FR<br /> <b>Authors</b>:<br /> Rohit Prasad<sup>1</sup>, Satyajit Das<sup>2</sup>, Kevin Martin<sup>3</sup>, Giuseppe Tagliavini<sup>4</sup>, Philippe Coussy<sup>5</sup>, Luca Benini<sup>6</sup> and Davide Rossi<sup>4</sup><br /> <sup>1</sup>Université Bretagne Sud, FR; <sup>2</sup>IIT Palakkad, IN; <sup>3</sup>University Bretagne Sud, FR; <sup>4</sup>Università di Bologna, IT; <sup>5</sup>Université Bretagne Sud / Lab-STICC, FR; <sup>6</sup>Università di Bologna and ETH Zurich, IT<br /> <em><b>Abstract</b><br /> In recent years, Coarse Grain Reconfigurable Architecture (CGRA) accelerators have been increasingly deployed in Internet-of-Things (IoT) end nodes. A modern CGRA has to support and efficiently accelerate both integer and floating-point (FP) operations. In this paper, we propose an ultra-low-power tunable-precision CGRA architectural template, called TRANSprecision floating-point Programmable archItectuRE (TRANSPIRE), and its associated compilation flow supporting both integer and FP operations. TRANSPIRE employs transprecision computing and multiple Single Instruction Multiple Data (SIMD) to accelerate FP operations while boosting energy efficiency as well. Experimental results show that TRANSPIRE achieves a maximum of 10.06x performance gain and consumes 12.91x less energy w.r.t. a RISC-V based CPU with an enhanced ISA supporting SIMD-style vectorization and FP data-types, while executing applications for near-sensor computing and embedded machine learning, with an area overhead of 1.25x only.</em></td> </tr> <tr> <td>17:30</td> <td>8.4.2</td> <td><b>MODELING AND DESIGNING OF A PVT AUTO-TRACKING TIMING-SPECULATIVE SRAM</b><br /> <b>Speaker</b>:<br /> Shan Shen, Southeast University, CN<br /> <b>Authors</b>:<br /> Shan Shen, Tianxiang Shao, Ming Ling, Jun Yang and Longxing Shi, Southeast University, CN<br /> <em><b>Abstract</b><br /> In the low supply voltage region, the performance of 6T cell SRAM degrades seriously, which takes more time to achieve the sufficient voltage difference on bitlines. Timing-speculative techniques are proposed to boost the SRAM frequency and the throughput with speculatively reading data in an aggressive timing and correcting timing failures in one or more extended cycles. However, the throughput gains of timing-speculative SRAM are affected by the process, voltage and temperature (PVT) variations, which causes the timing design of speculative SRAM to be either too aggressive or too conservative. This paper first proposes a statistical model to abstract the characteristics of speculative SRAM and shows the presence of an optimal sensing time that maximizes the overall throughput. Then, with the guidance of the performance model, a PVT auto-tracking speculative SRAM is designed and fabricated, which can dynamically self-tune the bitline sensing to the optimal time as the working condition changes. According to the measurement results, the maximum throughput gain of the proposed 28nm SRAM is 1.62X compared to the baseline at 0.6V VDD.</em></td> </tr> <tr> <td>18:00</td> <td>8.4.3</td> <td><b>SOLVING CONSTRAINT SATISFACTION PROBLEMS USING THE LOIHI SPIKING NEUROMORPHIC PROCESSOR</b><br /> <b>Speaker</b>:<br /> Chris Yakopcic, University of Dayton, US<br /> <b>Authors</b>:<br /> Chris Yakopcic<sup>1</sup>, Nayim Rahman<sup>1</sup>, Tanvir Atahary<sup>1</sup>, Tarek M. Taha<sup>1</sup> and Scott Douglass<sup>2</sup><br /> <sup>1</sup>University of Dayton, US; <sup>2</sup>Air Force Research Laboratory, US<br /> <em><b>Abstract</b><br /> In many cases, low power autonomous systems need to make decisions extremely efficiently. However, as a potential solution space becomes more complex, finding a solution quickly becomes nearly impossible using traditional computing methods. Thus, in this work we present a constraint satisfaction algorithm based on the principles of spiking neural networks. To demonstrate the validity of this algorithm, we have shown successful execution of the Boolean satisfiability problem (SAT) on the Intel Loihi spiking neuromorphic research processor. Power consumption in this spiking processor is due primarily to the propagation of spikes, which are the key drivers of data movement and processing. Thus, this system is inherently efficient for many types of problems. However, algorithms must be redesigned in a spiking neural network format to achieve the greatest efficiency gains. To the best of our knowledge, the work in this paper exhibits the first implementation of constraint satisfaction on a low power embedded neuromorphic processor. With this result, we aim to show that embedded spiking neuromorphic hardware is capable of executing general problem solving algorithms with great areal and computational efficiency.</em></td> </tr> <tr> <td>18:15</td> <td>8.4.4</td> <td><b>ACCURATE POWER DENSITY MAP ESTIMATION FOR COMMERCIAL MULTI-CORE MICROPROCESSORS</b><br /> <b>Speaker</b>:<br /> Sheldon Tan, University of California, Riverside, US<br /> <b>Authors</b>:<br /> Jinwei Zhang, Sheriff Sadiqbatcha, Wentian Jin and Sheldon Tan, University of California, Riverside, US<br /> <em><b>Abstract</b><br /> In this work, we propose an accurate full chip steady-state power density map estimation method for the commercial multi-core microprocessors. The new approach is based on the measured steady-state thermal maps (images) from an advanced infrared (IR) thermal imaging system to ensure its accuracy. The new method consists of a few steps. First, based on the first principle of heat transfer, 2D spatial Laplace operation is performed on the given thermal map to obtain so-called raw power density map, which consists of both positive and negative values due to the steady-state nature and boundary conditions of the microprocessors. Then based on the total power of the microprocessor from the online CPU tool, we develop a novel scheme to generate the actual real positive-only power density map from the raw power density map. At the same time, we develop a novel approach to estimate the effective thermal conductivity of the microprocessors. To further validate the power density map and the estimated actual thermal conductivity of the microprocessors, we construct a thermal model with COMSOL, which mimics the real experimental set up of measurement used in the IR imaging system. Then we compute the thermal maps from the estimated power density maps to ensure the computed thermal maps match the measured thermal maps using FEM method. Experimental results on intel i7-8650U 4-core processor show 1.8$^circ$C root-mean-square-error (RMSE) and 96% similarity (2D correlation) between the computed thermal maps and the measured thermal maps.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="#IP4">IP4-5</a>, 168</td> <td><b>WHEN SORTING NETWORK MEETS PARALLEL BITSTREAMS: A FAULT-TOLERANT PARALLEL TERNARY NEURAL NETWORK ACCELERATOR BASED ON STOCHASTIC COMPUTING</b><br /> <b>Speaker</b>:<br /> Yawen Zhang, Peking University, CN<br /> <b>Authors</b>:<br /> Yawen Zhang<sup>1</sup>, Sheng Lin<sup>2</sup>, Runsheng Wang<sup>1</sup>, Yanzhi Wang<sup>2</sup>, Yuan Wang<sup>1</sup>, Weikang Qian<sup>3</sup> and Ru Huang<sup>1</sup><br /> <sup>1</sup>Peking University, CN; <sup>2</sup>Northeastern University, US; <sup>3</sup>Shanghai Jiao Tong University, CN<br /> <em><b>Abstract</b><br /> Stochastic computing (SC) has been widely used in neural networks (NNs) due to its simple hardware cost and high fault tolerance. Conventionally, SC-based NN accelerators adopt a hybrid stochastic-binary format, using an accumulative parallel counter to convert bitstreams into a binary number. This method, however, sacrifices the fault tolerance and causes a high hardware cost. In order to fully exploit the superior fault tolerance of SC, taking a ternary neural network (TNN) as an example, we propose a parallel SC-based NN accelerator purely using bitstream computation. We apply a bitonic sorting network for simultaneously implementing the accumulation and activation function with parallel bitstreams. The proposed design not only has high fault tolerance, but also achieves at least 2.8 energy efficiency improvement over the binary computing counterpart.</em></td> </tr> <tr> <td style="width:40px;">18:32</td> <td><a href="#IP4">IP4-6</a>, 452</td> <td><b>WAVEPRO: CLOCK-LESS WAVE-PROPAGATED PIPELINE COMPILER FOR LOW-POWER AND HIGH-THROUGHPUT COMPUTATION</b><br /> <b>Speaker</b>:<br /> Yehuda Kra, Bar-Ilan University, IL<br /> <b>Authors</b>:<br /> Yehuda Kra, Adam Teman and Tzachi Noy, Bar-Ilan University, IL<br /> <em><b>Abstract</b><br /> Clock-less Wave-Propagated Pipelining is a longknown approach to achieve high-throughput without the overhead of costly sampling registers. However, due to many design challenges, which have only increased with technology scaling, this approach has never been widely accepted and has generally been limited to small and very specific demonstrations. This paper addresses this barrier by presenting WavePro, a generic and scalable algorithm, capable of skew balancing any combinatorial logic netlist for the application of wave pipelining. The algorithm was implemented in the WavePro Compiler automation utility, which interfaces with industry delays extraction and standard timing analysis tools to produce a sign-off quality result. The utility is demonstrated upon a dot-product accelerator in a 65 nm CMOS technology, using a vendor-provided standard cell library and commercial timing analysis tools. By reducing the worstcase output skew by over 70%, the test case example was able to achieve equivalent throughput of an 8-staged sequentially pipelined implementation with power savings of almost 3X.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="8.5">8.5 CNN Dataflow Optimizations</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br /> Mario Casu, Politecnico di Torino, IT</p> <p><b>Co-Chair:</b><br /> Wanli Chang, University of York, GB</p> <p>This session focuses on efficient dataflow approaches for reducing CNN runtime on embedded hardware platforms. The papers to be presented demonstrate techniques for enhancing parallelism to improve performance of CNNs, leverage output prediction to reduce the runtime for time-critical embedded applications during inference, and presents a Keras-based DNN framework for real-time cyber physical systems.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.5.1</td> <td><b>ANALYSIS AND SOLUTION OF CNN ACCURACY REDUCTION OVER CHANNEL LOOP TILING</b><br /> <b>Speaker</b>:<br /> Yesung Kang, Pohang University of Science and Technology, KR<br /> <b>Authors</b>:<br /> Yesung Kang<sup>1</sup>, Yoonho Park<sup>1</sup>, Sunghoon Kim<sup>1</sup>, Eunji Kwon<sup>1</sup>, Taeho Lim<sup>2</sup>, Mingyu Woo<sup>3</sup>, Sangyun Oh<sup>4</sup> and Seokhyeong Kang<sup>1</sup><br /> <sup>1</sup>Pohang University of Science and Technology, KR; <sup>2</sup>SK Hynix, KR; <sup>3</sup>University of California, San Diego, US; <sup>4</sup>UNIST, KR<br /> <em><b>Abstract</b><br /> Owing to the growth of the size of convolutional neural networks (CNNs), quantization and loop tiling (also called loop breaking) are mandatory to implement CNN on an embedded system. However, channel loop tiling of quantized CNNs induces unexpected errors. We explain why channel loop tiling of quantized CNNs induces the unexpected errors, and how the errors affect the accuracy of state-of-the-art CNNs. We also propose a method to recover accuracy under channel tiling by compressing and decompressing the most-significant bits of partial sums. Using the proposed method, we can recover accuracy by 12.3% with only 1% circuit area overhead and an additional 2% of power consumption.</em></td> </tr> <tr> <td>17:30</td> <td>8.5.2</td> <td><b>DCCNN: COMPUTATIONAL FLOW REDEFINITION FOR EFFICIENT CNN INFERENCE THROUGH MODEL STRUCTURAL DECOUPLING</b><br /> <b>Speaker</b>:<br /> Xiang Chen, George Mason University, US<br /> <b>Authors</b>:<br /> Fuxun Yu<sup>1</sup>, Zhuwei Qin<sup>1</sup>, Di Wang<sup>2</sup>, Ping Xu<sup>1</sup>, Chenchen Liu<sup>3</sup>, Zhi Tian<sup>1</sup> and Xiang Chen<sup>1</sup><br /> <sup>1</sup>George Mason University, US; <sup>2</sup>Microsoft, US; <sup>3</sup>University of Maryland, Baltimore County, US<br /> <em><b>Abstract</b><br /> With the excellent accuracy and feasibility, Convolutional Neural Networks (CNNs) have been widely applied into novel intelligent applications and systems. However, the CNN computation performance is significantly hindered by its computation flow, which computes the model structure sequentially by layers with massive convolution operations. Such a layer-wise sequential computation flow is defined by the inter-layer data dependency and causes certain performance issues, such as resource under-utilization, significant computation overhead, etc. To solve these problems, in this work, we propose a novel CNN structural decoupling method, which could decouple CNN models by "critical paths" and eliminate the inter-layer data dependency. Based on this method, we redefine the CNN computation flow into parallel and cascade computing paradigms, which can significantly enhance the CNN computation performance with both multi-core and single-core CPU processors. Experiments show that, our DC-CNN framework could reduce at most 33% latency on multi-core CPUs for both CIFAR and ImageNet. On small-capacity mobile platforms, cascade computing could reduce the latency by average 24% on ImageNet and 42% on CIFAR10. Meanwhile, the memory reduction could reach average 21% and 64%, respectively.</em></td> </tr> <tr> <td>18:00</td> <td>8.5.3</td> <td><b>ABC: ABSTRACT PREDICTION BEFORE CONCRETENESS</b><br /> <b>Speaker</b>:<br /> Jung-Eun Kim, Yale University, US<br /> <b>Authors</b>:<br /> Jung-Eun Kim<sup>1</sup>, Richard Bradford<sup>2</sup>, Man-Ki Yoon<sup>1</sup> and Zhong Shao<sup>1</sup><br /> <sup>1</sup>Yale University, US; <sup>2</sup>Collins Aerospace, US<br /> <em><b>Abstract</b><br /> Learning techniques are advancing the utility and capability of modern embedded systems. However, the challenge of incorporating learning modules into embedded systems is that computing resources are scarce. For such a resource-constrained environment, we have developed a framework for learning abstract information early and learning more concretely as time allows. The intermediate results can be utilized to prepare for early decisions/actions as needed. To apply this framework to a classification task, the datasets are categorized in an abstraction hierarchy. Then the framework classifies intermediate labels from the most abstract level to the most concrete. Our proposed method outperforms the existing approaches and reference base-lines in terms of accuracy. We show our framework with different architectures and on various benchmark datasets CIFAR-10,CIFAR-100, and GTSRB. We measure prediction times on GPU-equipped embedded computing platforms as well.</em></td> </tr> <tr> <td>18:15</td> <td>8.5.4</td> <td><b>A COMPOSITIONAL APPROACH USING KERAS FOR NEURAL NETWORKS IN REAL-TIME SYSTEMS</b><br /> <b>Speaker</b>:<br /> Xin Yang, University of Auckland, NZ<br /> <b>Authors</b>:<br /> Xin Yang, Partha Roop, Hammond Pearce and Jin Woo Ro, University of Auckland, NZ<br /> <em><b>Abstract</b><br /> Real-time systems are designed using model-driven approaches, where a complex system is represented as a set of interacting components. Such a compositional approach facilitates design of simpler components, which are easier to validate and integrate with the overall system. In contrast to such systems, data-driven systems like neural networks are designed as monolithic black-boxes to capture the non-linear relationship from inputs to outputs. Increasingly, such systems are being used in safety-critical real-time systems. Here, a compositional approach would be ideal. However, to the best of our knowledge, such a compositional approach is lacking while designing data-driven components based on neural networks. This paper formalises this problem by developing the concept of Composed Neural Networks (CpNNs) by extending the well known Keras python framework. CpNNs formalise the synchronous composition of several interacting neural networks in Keras. Further, using the developed semantics, we enable modular compilation from a given CpNN to C code. The generated code is suitable for the Worst-Case Execution Time (WCET) analysis. Using several benchmarks we demonstrate the superiority of the developed approach over a recently proposed approach using Esterel, as well as the popular Python package Tensorflow Lite. For the given benchmarks, our approach is superior to Esterel with an average WCET reduction of 64.06%, and superior to Tensorflow Lite with an average measured WCET reduction of 62.08%.</em></td> </tr> <tr> <td style="width:40px;">18:00</td> <td><a href="#IP4">IP4-7</a>, 935</td> <td><b>DEEPNVM: A FRAMEWORK FOR MODELING AND ANALYSIS OF NON-VOLATILE MEMORY TECHNOLOGIES FOR DEEP LEARNING APPLICATIONS</b><br /> <b>Speaker</b>:<br /> Ahmet Inci, Carnegie Mellon University, US<br /> <b>Authors</b>:<br /> Ahmet Inci, Mehmet M Isgenc and Diana Marculescu, Carnegie Mellon University, US<br /> <em><b>Abstract</b><br /> Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 4.2x and 5x energy-delay product (EDP) reduction and 2.4x and 3x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide 2.3x EDP reduction on average across all workloads when compared to SRAM. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPU platforms for deep learning applications.</em></td> </tr> <tr> <td style="width:40px;">18:01</td> <td><a href="#IP4">IP4-8</a>, 419</td> <td><b>EFFICIENT EMBEDDED MACHINE LEARNING APPLICATIONS USING ECHO STATE NETWORKS</b><br /> <b>Speaker</b>:<br /> Rolando Brondolin, Politecnico di Milano, IT<br /> <b>Authors</b>:<br /> Luca Cerina<sup>1</sup>, Giuseppe Franco<sup>2</sup>, Claudio Gallicchio<sup>3</sup>, Alessio Micheli<sup>3</sup> and Marco D. Santambrogio<sup>4</sup><br /> <sup>1</sup>politecnico di milano, IT; <sup>2</sup>Scuola Superiore Sant'Anna / Università di Pisa, IT; <sup>3</sup>Università di Pisa, IT; <sup>4</sup>Politecnico di Milano, IT<br /> <em><b>Abstract</b><br /> The increasing role of Artificial Intelligence (AI) and Machine Learning (ML) in our lives brought a paradigm shift on how and where the computation is performed. Stringent latency requirements and congested bandwidth moved AI inference from Cloud space towards end-devices. This change required a major simplification of Deep Neural Networks (DNN), with memory-wise libraries or co-processors that perform fast inference with minimal power. Unfortunately, many applications such as natural language processing, time-series analysis and audio interpretation are built on a different type of Artifical Neural Networks (ANN), the so-called Recurrent Neural Networks (RNN), which, due to their intrinsic architecture, remains too complex and heavy to run efficiently on embedded devices. To solve this issue, the Reservoir Computing paradigm proposes sparse untrained non-linear networks, the Reservoir, that can embed temporal relations without some of the hindrances of Recurrent Neural Networks training, and with a lower memory usage. Echo State Networks (ESN) and Liquid State Machines are the most notable examples. In this scenario, we propose a performance comparison of a ESN, designed and trained using Bayesian Optimization techniques, against current RNN solutions. We aim to demonstrate that ESN have comparable performance in terms of accuracy, require minimal training time, and they are more optimized in terms of memory usage and computational efficiency. Preliminary results show that ESN are competitive with RNN on a simple benchmark, and both training and inference time are faster, with a maximum speed-up of 2.35x and 6.60x, respectively.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="8.6">8.6 Microarchitecture-level reliability analysis and protection</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br /> Michail Maniatakos, New York University Abu Dhabi, UA</p> <p><b>Co-Chair:</b><br /> Alessandro Savino, Politecnico di Torino, IT</p> <p>Reliability analysis and protection at the microarchitecture level is of paramount importance to speed-up the design face of any computing system. On the analysis side, this session starts presenting a reverse-order ACE (Architecturally Correct Execution) analysis that is more accurate than original ACE proposals, then moving to an instruction level analysis based on a genetic-algorithm able to improve program resiliency to errors. Finally, on the protection side, the session presents a low-cost ECC plus approximation mechanism for GPU register files.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.6.1</td> <td><b>RACE: REVERSE-ORDER PROCESSOR RELIABILITY ANALYSIS</b><br /> <b>Authors</b>:<br /> Athanasios Chatzidimitriou and Dimitris Gizopoulos, University of Athens, GR<br /> <em><b>Abstract</b><br /> Modern microprocessors suffer from increased error rates that come along with fabrication technology scaling. Processor designs continuously become more prone to hardware faults that lead to execution errors and system failures, which raise the requirement of protection mechanisms. However, error mitigation strategies have to be applied diligently, as they impose significant power, area, and performance overheads. Early and accurate reliability estimation of a microprocessor design is essential in order to determine the most vulnerable hardware structures and the most efficient protection schemes. One of the most commonly used techniques for reliability estimation is Architecturally Correct Execution (ACE) analysis. ACE analysis can be applied at different abstraction models, including microarchitecture and RTL and often requires a single or few simulations to report the Architectural Vulnerability Factor (AVF) of the processor structures. However, ACE analysis overestimates the vulnerability of structures because of its pessimistic, worst-case nature. Moreover, it only delivers coarse-grain vulnerability reports and no details about the expected result of hardware faults (silent data corruptions, crashes). In this paper, we present reverse ACE (rACE), a methodology that (a) improves the accuracy of ACE analysis and (b) delivers fine-grain error outcome reports. Using a reverse-order tracing flow, rACE analysis associates portions of the simulated execution of a program with the actual output and the control flow, delivering finer accuracy and results classification. Our findings show that rACE reports an average 1.45X overestimation, compared to Statistical Fault Injection, for different sizes of the register file of an out-of-order CPU core (executing both ARM and x86 binaries), when a baseline ACE analysis reports 2.3X overestimation and even refined versions of ACE analysis report an average of 1.8X overestimation.</em></td> </tr> <tr> <td>17:30</td> <td>8.6.2</td> <td><b>DEFCON: GENERATING AND DETECTING FAILURE-PRONE INSTRUCTION SEQUENCES VIA STOCHASTIC SEARCH</b><br /> <b>Speaker</b>:<br /> Ioannis Tsiokanos, Queen's University Belfast, GB<br /> <b>Authors</b>:<br /> Ioannis Tsiokanos<sup>1</sup>, Lev Mukhanov<sup>1</sup>, Giorgis Georgakoudis<sup>2</sup>, Dimitrios S. Nikolopoulos<sup>3</sup> and Georgios Karakonstantis<sup>1</sup><br /> <sup>1</sup>Queen's University Belfast, GB; <sup>2</sup>Lawrence Livermore National Laboratory, US; <sup>3</sup>Virginia Tech, US<br /> <em><b>Abstract</b><br /> The increased variability and adopted low supply voltages render nanometer devices prone to timing failures, which threaten the functionality of digital circuits. Recent schemes focused on developing instruction-aware failure prediction models and adapting voltage/frequency to avoid errors while saving energy. However, such schemes may be inaccurate when applied to pipelined cores since they consider only the currently executed instruction and the preceding one, thereby neglecting the impact of all the concurrently executing instructions on failure occurrence. In this paper, we first demonstrate that the order and type of instructions in sequences with a length equal to the pipeline depth affect significantly the failure rate. To overcome the practically impossible evaluation of the impact of all possible sequences on failures, we present DEFCON, a fully automated framework that stochastically searches for the most failure-prone instruction sequences (ISQs). DEFCON generates such sequences by integrating a properly formulated genetic algorithm with accurate post-layout dynamic timing analysis, considering the data-dependent path sensitization and instruction execution history. The generated micro-architecture aware ISQs are then used by DEFCON to estimate the failure vulnerability of any application. To evaluate the efficacy of the proposed framework, we implement a pipelined floating-point unit and perform dynamic timing analysis based on input data that we extract from a variety of applications consisting of up-to 43.5M ISQs. Our results show that DEFCON reveals quickly ISQs that maximize the output quality loss and correctly detects 99.7% of the actual faulty ISQs in different applications under various levels of variation-induced delay increase. Finally, DEFCON enable us to identify failure-prone ISQs early at the design cycle, and save 26.8% of energy on average when combined with a clock stretching mechanism.</em></td> </tr> <tr> <td>18:00</td> <td>8.6.3</td> <td><b>LAD-ECC: ENERGY-EFFICIENT ECC MECHANISM FOR GPGPUS REGISTER FILE</b><br /> <b>Speaker</b>:<br /> Hengshan Yue, Jilin University, CN<br /> <b>Authors</b>:<br /> Xiaohui Wei, Hengshan Yue and Jingweijia Tan, Jilin University, CN<br /> <em><b>Abstract</b><br /> Graphics Processing Units (GPUs) are widely used in general-purpose high-performance computing applications (i.e., GPGPUs), which require reliable execution in the presence of soft errors. To support massive thread level parallelism, a sizeable register file is adopted in GPUs, which is highly vulnerable to soft errors. Although modern commercial GPUs provide single-error-correction double-error-detection (SEC-DED) ECC for the register file, it consumes a considerable amount of energy due to frequent register accesses and leakage power of ECC storage. In this paper, we propose to Leverage Approximation and Duplication characteristics of register values to build an energy-efficient ECC mechanism (LAD-ECC) in GPGPUs, which consists of APproximation-aware ECC (AP-ECC) and Duplication-Aware ECC (DA-ECC). Leveraging the inherent error tolerance features, AP-ECC merely protects significant bits of registers to combat the critical error. Observing same-named registers across threads usually keep the same data, DA-ECC avoids unnecessary ECC generation and verification for duplicate register values. Experimental results demonstrate that our LAD-ECC tremendously reduces 69.72% energy consumption of traditional SEC-DED ECC.</em></td> </tr> <tr> <td style="width:40px;">18:30</td> <td><a href="#IP4">IP4-9</a>, 698</td> <td><b>EXPLFRAME: EXPLOITING PAGE FRAME CACHE FOR FAULT ANALYSIS OF BLOCK CIPHERS</b><br /> <b>Speaker</b>:<br /> Anirban Chakraborty, IIT Kharagpur, IN<br /> <b>Authors</b>:<br /> Anirban Chakraborty<sup>1</sup>, Sarani Bhattacharya<sup>2</sup>, Sayandeep Saha<sup>1</sup> and Debdeep Mukhopadhyay<sup>1</sup><br /> <sup>1</sup>IIT Kharagpur, IN; <sup>2</sup>KU Leuven, BE<br /> <em><b>Abstract</b><br /> Page Frame Cache (PFC) is a purely software cache, present in modern Linux based operating systems (OS), which stores the page frames that were recently released by the processes running on a particular CPU. In this paper, we show that the page frame cache can be maliciously exploited by an adversary to steer the pages of a victim process to some pre-decided attacker-chosen locations in the memory. We practically demonstrate an end-to-end attack, emph{ExplFrame}, where an attacker having only user-level privilege is able to force a victim process's memory pages to vulnerable locations in DRAM and deterministically conduct Rowhammer to induce faults. As a case study, we induce single bit faults in the T-tables on OpenSSL (v1.1.1) AES using our proposed attack ExplFrame. We also propose an improvised fault analysis technique which can exploit any Rowhammer-induced bit-flips in the AES T-tables.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="8.7">8.7 Physical Design and Analysis</h2> <p><b>Date:</b> Wednesday 11 March 2020<br /> <b>Time:</b> 17:00 - 18:30<br /> <b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br /> Vasilis Pavlidis, The University of Manchester, GB</p> <p><b>Co-Chair:</b><br /> L. Miguel Silveira, INESC ID / IST, U Lisboa, PT</p> <p>This session deals with problems in extraction, DRC hotspots, IR drop, routing and other relevant issues in physical design and analysis. The common trend between all papers is efficiency improvement while maintaining accuracy. Floating random walk extraction is performed to handle non-stratified dielectrics with on-the-fly computations. Also, serial equivalence can be guaranteed in FPGA routing by exploring parallelism. A legalization flow is proposed for double-patterning aware feature alignment. Finally, machine-learning based DRC hotspot prediction is enhanced with explainability.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>17:00</td> <td>8.7.1</td> <td><b>FLOATING RANDOM WALK BASED CAPACITANCE SOLVER FOR VLSI STRUCTURES WITH NON-STRATIFIED DIELECTRICS</b><br /> <b>Speaker</b>:<br /> Ming Yang, Tsinghua University, CN<br /> <b>Authors</b>:<br /> Mingye Song, Ming Yang and Wenjian Yu, Tsinghua University, CN<br /> <em><b>Abstract</b><br /> In this paper, two techniques are proposed to enhance the floating random walk (FRW) based capacitance solver for handling non-stratified dielectrics in very large-scale integrated (VLSI) circuits. They follow an existing approach which employs approximate eight-octant transition cubes while simulating the structure with conformal dielectrics. Firstly, the symmetry property of the transition probabilities of the eightoctant cube is revealed and utilized to derive an on-the-fly sampling scheme during the FRWprocedure. This avoids the precharacterization, saves substantial memory, and improves computational accuracy for extracting the structure with non-stratified dielectrics. Then, the space management technique is extended to improve the runtime efficiency for simulating structures with thousands of non-stratified dielectrics. Numerical experiments are carried out to validate the proposed techniques and show their effectiveness for handling structures with conformal dielectrics and air bubbles. Moreover, the extended space management brings up to 1441X speedup for handling structures with from several thousand to one million non-stratified dielectrics.</em></td> </tr> <tr> <td>17:30</td> <td>8.7.2</td> <td><b>TOWARDS SERIAL-EQUIVALENT MULTI-CORE PARALLEL ROUTING FOR FPGAS</b><br /> <b>Speaker</b>:<br /> Minghua Shen, Sun Yat-sen University, CN<br /> <b>Authors</b>:<br /> Minghua Shen and Nong Xiao, Sun Yat-sen University, CN<br /> <em><b>Abstract</b><br /> In this paper, we present a serial-equivalent parallel router for FPGAs on modern multi-core processors. We are based on the inherent net order of serial router to schedule all the nets into a series of stages, where the non-conflicting nets are scheduled in same stage and the conflicting nets are scheduled in different stages. We explore the parallel routing of non-conflicting nets on multi-core processors for a significant speedup. We perform the data synchronization of conflicting stages using MPI-based message queue for a feasible routing solution. Note that load balance is always used to guide the multi-core parallel routing. Experimental results show that our parallel router provides about 19.13x speedup on average using 32 processor cores comparing to the serial router. Notably, our parallel router generates exactly the same wirelength as the serial router satisfying serial equivalency.</em></td> </tr> <tr> <td>18:00</td> <td>8.7.3</td> <td><b>SELF-ALIGNED DOUBLE-PATTERNING AWARE LEGALIZATION</b><br /> <b>Speaker</b>:<br /> Hua Xiang, IBM Research, US<br /> <b>Authors</b>:<br /> Hua Xiang<sup>1</sup>, Gi-Joon Nam<sup>1</sup>, Gustavo Tellez<sup>2</sup>, Shyam Ramji<sup>2</sup> and Xiaoqing Xu<sup>3</sup><br /> <sup>1</sup>IBM Research, US; <sup>2</sup>IBM Thomas J. Watson Research Center, US; <sup>3</sup>University of Texas at Austin, US<br /> <em><b>Abstract</b><br /> Double patterning is a widely used technique for sub-22nm. Among various double patterning techniques, Self-Aligned Double Patterning (SADP) is a promising technique for good mask overlay control. Based on SADP, a new set of standard cells (T-cells) are developed using thicker metal wires for stronger drive strength. By applying this kind of gates on critical paths, it helps to improve the design performance. However, a mixed design with T-cells and normal cells (N-cells) requires that T-cells are placed on circuit rows with thicker metal, and the normal cells are on the normal circuit rows. Therefore, a placer is needed to adjust the cells to the matched circuit rows. In this paper, a two-stage min-cost max-flow based legalization flow is presented to adjust N/T gate locations for a legal placement. The experimental results demonstrate the effectiveness and efficiency of our approach.</em></td> </tr> <tr> <td>18:15</td> <td>8.7.4</td> <td><b>EXPLAINABLE DRC HOTSPOT PREDICTION WITH RANDOM FOREST AND SHAP TREE EXPLAINER</b><br /> <b>Speaker</b>:<br /> Wei Zeng, University of Wisconsin-Madison, US<br /> <b>Authors</b>:<br /> Wei Zeng<sup>1</sup>, Azadeh Davoodi<sup>1</sup> and Rasit Onur Topaloglu<sup>2</sup><br /> <sup>1</sup>University of Wisconsin - Madison, US; <sup>2</sup>IBM, US<br /> <em><b>Abstract</b><br /> With advanced technology nodes, resolving design rule check (DRC) violations has become a cumbersome task, which makes it desirable to make predictions at earlier stages of the design flow. In this paper, we show that the Random Forest (RF) model is quite effective for the DRC hotspot prediction at the global routing stage, and in fact significantly outperforms recent prior works, with only a fraction of the runtime to develop the model. We also propose, for the first time, to adopt a recent explanatory metric--the SHAP value--to make accurate and consistent explanations for individual DRC hotspot predictions from RF. Experiments show that RF is 21%-60% better in predictive performance on average, compared with promising machine learning models used in similar works (e.g. SVM and neural networks) while exhibiting good explainability, which makes it ideal for DRC hotspot prediction.</em></td> </tr> <tr> <td style="width:40px;">18:31</td> <td><a href="#IP4">IP4-10</a>, 522</td> <td><b>XGBIR: AN XGBOOST-BASED IR DROP PREDICTOR FOR POWER DELIVERY NETWORK</b><br /> <b>Speaker</b>:<br /> An-Yu Su, National Chiao Tung University, TW<br /> <b>Authors</b>:<br /> Chi-Hsien Pao, Yu-Min Lee and An-Yu Su, National Chiao Tung University, TW<br /> <em><b>Abstract</b><br /> This work utilizes the XGBoost to build a machine-learning-based IR drop predictor, XGBIR, for the power grid. To capture the behavior of power grid, we extract its several features and employ its locality property to save the extraction time. XGBIR can be effectively applied to large designs and the average error of predicted IR drops is less than 6 mV.</em></td> </tr> <tr> <td style="width:40px;">18:32</td> <td><a href="#IP4">IP4-11</a>, 347</td> <td><b>ON PRE-ASSIGNMENT ROUTE PROTOTYPING FOR IRREGULAR BUMPS ON BGA PACKAGES</b><br /> <b>Speaker</b>:<br /> Hung-Ming Chen, National Chiao Tung University, TW<br /> <b>Authors</b>:<br /> Jyun-Ru Jiang<sup>1</sup>, Yun-Chih Kuo<sup>2</sup>, Simon Chen<sup>3</sup> and Hung-Ming Chen<sup>1</sup><br /> <sup>1</sup>National Chiao Tung University, TW; <sup>2</sup>National Taiwan University, TW; <sup>3</sup>MediaTek.inc, TW<br /> <em><b>Abstract</b><br /> In modern package design, the bumps often place irregularly due to the macros varied in sizes and positions. This will make pre-assignment routing more difficult, even with massive design efforts. This work presents a 2-stage routing method which can be applied to an arbitrary bump placement on 2-layer BGA packages. Our approach combines escape routing with via assignment: the escape routing is used to handle the irregular bumps and the via assignment is applied for improving the wire congestion and total wirelength of global routing. Experimental results based on industrial cases show that our methodology can solve the routing efficiently, and we have achieved 82% improvement on wire congestion with 5% wirelength increase compared with conventional regular treatments.</em></td> </tr> <tr> <td>18:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="9.1">9.1 Special Day on "Silicon Photonics": Advancements on Silicon Photonics</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br /> Gabriela Nicolescu, Polytechnique Montréal, CA</p> <p><b>Co-Chair:</b><br /> Luca Ramini, Hewlett Packard Labs, US</p> <p> </p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.1.1</td> <td><b>SYSTEM STUDY OF SILICON PHOTOHOTONICNICS MODULATOR IN SHORT REACH GRIDLESS COHERENT NETWORKS</b><br /> <b>Speaker</b>:<br /> Sadok Aouini, Ciena Corporation, CA<br /> <b>Authors</b>:<br /> Sadok Aouini<sup>1</sup>, Ahmad Abdo<sup>1</sup>, Xueyang Li<sup>2</sup>, Md Samiul Alam<sup>2</sup>, Mahdi Parvizi<sup>1</sup>, Claude D'Amours<sup>3</sup> and David V. Plant<sup>2</sup><br /> <sup>1</sup>Ciena Corporation, CA; <sup>2</sup>McGill University, CA; <sup>3</sup>University of Ottawa, CA<br /> <em><b>Abstract</b><br /> A study the impact of modulation loss of Silicon-Photonics Mach-Zehnder modulators in the context of single-carrier coherent receivers, i.e. 400G-ZR. The modulation loss is primarily due to limited bandwidth and large peak to overage ratio of the modulator output. We present the implications of performing only post-compensation of the loss at the receiver and its advantages in gridless-networks. A manageable Q factor penalty of around 0.5 dB is found for dual- polarization system with a 0.75 dB peak to average ratio (PAPR) reduction.</em></td> </tr> <tr> <td>09:00</td> <td>9.1.2</td> <td><b>FULLY INTEGRATED PHOTONIC CIRCUITS ON SILICON BY MEANS OF III-V/SILICON BONDING</b><br /> <b>Author</b>:<br /> Florian Denis-le Coarer, SCINTIL Photonics, US<br /> <em><b>Abstract</b><br /> This presentation introduces a new platform integrating heterogeneous III-V/silicon gain devices at the backside of silicon-on-insulator wafers. The fabrication relies on commercial silicon photonic processes. This platform enables fully photonic integrated circuits comprising lasers, modulators, passives and photodetectors, that can be tested at the wafer level.</em></td> </tr> <tr> <td>09:30</td> <td>9.1.3</td> <td><b>III-V/SILICON HYBRID LASERS INTEGRATION ON CMOS-COMPATIBLE 200MM AND 300MM PLATFORMS</b><br /> <b>Speaker</b>:<br /> Karim Hassan, CEA-Leti, FR<br /> <b>Authors</b>:<br /> Karim Hassan<sup>1</sup>, Szelag Bertrand<sup>1</sup>, Laetitia Adelmini<sup>1</sup>, Cecilia Dupre<sup>1</sup>, Elodie Ghegin<sup>2</sup>, Philippe Rodriguez<sup>1</sup>, Fabrice Nemouchi<sup>1</sup>, Pierre Brianceau<sup>1</sup>, Antoine Schembri<sup>1</sup>, David Carrara<sup>3</sup>, Pierrick Cavalie<sup>3</sup>, Florent Franchin<sup>3</sup>, Marie-Christine Roure<sup>1</sup>, Loic Sanchez<sup>1</sup>, Christophe Jany<sup>1</sup> and Ségolène Olivier<sup>1</sup><br /> <sup>1</sup>CEA-Leti, FR; <sup>2</sup>STMicroelectronics, FR; <sup>3</sup>Almae Technologies, FR<br /> <em><b>Abstract</b><br /> We present a CMOS-compatible hybrid III-V/Silicon technology developed in CEA-Leti. Large-scale integration of silicon photonics is already available worldwide in 200mm or 300mm through different foundries, but the development of CMOS-compatible process for the III-V integration remains of major interest for next gen transceivers in the Datacom and High Performance Computing domains. The technological developments involve first the hybridization on top of a mature silicon photonic front-end wafer through direct molecular bonding, then the patterning of the III-V epitaxy layer, and low access resistance contacts though planar multilevel BEOL to be optimized. The different technological blocks will be described, and the results will be discussed on the basis of test vehicles based on either distributed feedback (DFB), distributed Bragg reflector (DBR), or Fabry-Perot (FP) laser cavities. While first demonstrations have been obtained through wafer bonding, we show that the fabrication process was subsequently validated on III-V dies bonding with a fabrication yield of Fabry-Perot lasers of 97% in 200mm. The overall technological features are expected improve the efficiency, density, and cost of silicon photonics PICs.</em></td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="9.2">9.2 Autonomous Systems Design Initiative: Architectures and Frameworks for Autonomous Systems</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br /> Selma Saidi, TU Dortmund, DE</p> <p><b>Co-Chair:</b><br /> Rolf Ernst, TU Braunschweig, DE</p> <p> </p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.2.1</td> <td><b>DEEPRACING: A FRAMEWORK FOR AGILE AUTONOMY</b><br /> <b>Speaker</b>:<br /> Trent Weiss, University of Virginia, US<br /> <b>Authors</b>:<br /> Trent Weiss and Madhur Behl, University of Virginia, US<br /> <em><b>Abstract</b><br /> We consider the challenging problem of vision-based high speed autonomous racing in realistic dynamic environments. We present DeepRacing, a novel end-to-end framework, and a virtual testbed for training and evaluating algorithms for autonomous racing. The virtual testbed is implemented using the Formula One (F1) Codemasters game, which is used by many real world F1 drivers for training. We also present AdmiralNet - a Convolution Neural Network (CNN) integrated with Long Short-Term Memory (LSTM) cells that can be tuned for the autonomous racing task in the highly realistic F1 game. We evaluate AdmiralNet's performance on unseen race tracks, and also evaluate the degree of transference between the simulation and the real world by implementing end-to-end racing on a physical 1/10 scale autonomous racecar.</em></td> </tr> <tr> <td>09:00</td> <td>9.2.2</td> <td><b>FAIL-OPERATIONAL AUTOMOTIVE SOFTWARE DESIGN USING AGENT-BASED GRACEFUL DEGRADATION</b><br /> <b>Speaker</b>:<br /> Philipp Weiss, TU Munich, DE<br /> <b>Authors</b>:<br /> Philipp Weiss<sup>1</sup>, Andreas Weichslgartner<sup>2</sup>, Felix Reimann<sup>2</sup> and Sebastian Steinhorst<sup>1</sup><br /> <sup>1</sup>TU Munich, DE; <sup>2</sup>Audi Electronics Venture GmbH, DE</td> </tr> <tr> <td>09:30</td> <td>9.2.3</td> <td><b>A DISTRIBUTED SAFETY MECHANISM USING MIDDLEWARE AND HYPERVISORS FOR AUTONOMOUS VEHICLES</b><br /> <b>Speaker</b>:<br /> Pieter van der Perk, NXP Semiconductors, NL<br /> <b>Authors</b>:<br /> Tjerk Bijlsma<sup>1</sup>, Andrii Buriachevskyi<sup>2</sup>, Alessandro Frigerio<sup>3</sup>, Yuting Fu<sup>2</sup>, Kees Goossens<sup>4</sup>, Ali Osman Örs<sup>2</sup>, Pieter J. van der Perk<sup>2</sup>, Andrei Terechko<sup>2</sup> and Bart Vermeulen<sup>2</sup><br /> <sup>1</sup>TNO, NL; <sup>2</sup>NXP Semiconductors, NL; <sup>3</sup>Eindhoven University of Technology, NL; <sup>4</sup>Eindhoven university of technology, NL</td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="9.3">9.3 Special Session: In memory computing for edge AI</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br /> Maha Kooli, CEA-Leti, FR</p> <p><b>Co-Chair:</b><br /> Alexandre Levisse, EPFL, CH</p> <p>In-Memory Computing (IMC) represents new computing paradigm where computation happens at data location. Within the landscape of IMC approaches, non-von Neumann architectures seek to minimize data movement associated with computing. Artificial intelligence applications are one of the most promising use case of IMC since they are both compute- and memory-intensive. Running such applications on edge devices offers significant save of energy consumption and high-speed acceleration. This special session proposes to take the attendees along a journey through IMC solutions for Edge AI. This session will cover four different viewpoints of IMC for Edge AI with four talks: (i) Enabling flexible electronics very-Edge AI with IMC, (ii) design automation methodology for computational SRAM for energy efficient SIMD operations, (iii) circuit/architecture/application multiscale design and optimization methodologies for IMC architectures, and (iv) device circuit and architecture optimizations to enable PCM-based deep learning accelerators. The speakers come from three different continents (Asia, Europe, America) and four different countries (Singapore, France, USA, Switzerland). Two speakers are affiliated to academic institutes; one to industry; and one to an institute of technological research center. We strongly believe that the topic and especially selected talks are extremely hot topics in the community and will attract various people from different countries and affiliations, from both academia and industry. Furthermore, thanks to its cross layer nature, we believe that this session is tailored to touch a wide range of experts from device and circuit community up to system and application design community. We also believe that highlighting and discussing such design methodologies is a key point for high quality and high impact research. Following up previous occurrences and success of IMC-oriented sessions and panels in DAC2019 as well as in ISLPED2019, we believe that this topic is extremely hot in the community and will trigger fruitful interactions and, we hope, collaboration among the community. We thereby expect more than 60 attendees for this session. This session will be the object of two scientific papers that will be integrated with DATE proceedings in case of acceptance.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.3.1</td> <td><b>FLEDGE: FLEXIBLE EDGE PLATFORMS ENABLED BY IN-MEMORY COMPUTING</b><br /> <b>Speaker</b>:<br /> Kamalika Datta, Nanyang Technological University, SG<br /> <b>Authors</b>:<br /> Kamalika Datta<sup>1</sup>, Umesh Chand<sup>2</sup>, Arko Dutt<sup>1</sup>, Devendra Singh<sup>2</sup>, Aaron Thean<sup>2</sup> and Mohamed M. Sabry<sup>1</sup><br /> <sup>1</sup>Nanyang Technological University, SG; <sup>2</sup>National University of Singapore, SG</td> </tr> <tr> <td>08:50</td> <td>9.3.2</td> <td><b>COMPUTATIONAL SRAM DESIGN AUTOMATION USING PUSHED-RULE BITCELLS FOR ENERGY-EFFICIENT VECTOR PROCESSING</b><br /> <b>Speaker</b>:<br /> Maha Kooli, CEA-Leti, FR<br /> <b>Authors</b>:<br /> Jean-Philippe Noel<sup>1</sup>, Valentin Egloff<sup>1</sup>, Maha Kooli<sup>1</sup>, Roman Gauchi<sup>1</sup>, Jean-Michel Portal<sup>2</sup>, Henri-Pierre Charles<sup>1</sup>, Pascal Vivet<sup>1</sup> and Bastien Giraud<sup>1</sup><br /> <sup>1</sup>CEA-Leti, FR; <sup>2</sup>Aix-Marseille University, FR</td> </tr> <tr> <td>09:10</td> <td>9.3.3</td> <td><b>DEMONSTRATING IN-CACHE COMPUTING THANKS TO CROSS-LAYER DESIGN OPTIMIZATIONS</b><br /> <b>Authors</b>:<br /> Marco Rios, William Simon, Alexandre Levisse, Marina Zapater and David Atienza, EPFL, CH</td> </tr> <tr> <td>09:35</td> <td>9.3.4</td> <td><b>DEVICE, CIRCUIT AND SOFTWARE INNOVATIONS TO MAKE DEEP LEARNING WITH ANALOG MEMORY A REALITY</b><br /> <b>Authors</b>:<br /> Pritish Narayanan, Stefano Ambrogio, Hsinyu Tsai, Katie Spoon and Geoffrey W. Burr, IBM Research, US</td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="9.4">9.4 Efficient DNN design with Approximate Computing.</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br /> Daniel Menard, INSA Rennes, FR</p> <p><b>Co-Chair:</b><br /> Seokhyeong Kang, Pohang University of Science and Technology, KR</p> <p>Deep Neural Networks (DNN) are widely used in numerous domains. Cross-layer DNN approximation requires efficient simulation framework. The GPU-accelerated simulation framework, ProxSim, supports DNN inference and retraining for approximate hardware. A significant amount of energy is consumed during the training process due to excessive memory accesses. The precision-controlled memory systems, dedicated for GPUs, allow flexible management of approximation. New generation of networks, like Capsule Networks, provide better learning capabilities but at the expense of high complexity. ReD-CaNe methodology analyzes resilience through an error injection and approximates them.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.4.1</td> <td><b>PROXSIM: SIMULATION FRAMEWORK FOR CROSS-LAYER APPROXIMATE DNN OPTIMIZATION</b><br /> <b>Speaker</b>:<br /> Cecilia Eugenia De la Parra Aparicio, Robert Bosch GmbH, DE<br /> <b>Authors</b>:<br /> Cecilia De la Parra<sup>1</sup>, Andre Guntoro<sup>1</sup> and Akash Kumar<sup>2</sup><br /> <sup>1</sup>Robert Bosch GmbH, DE; <sup>2</sup>TU Dresden, DE<br /> <em><b>Abstract</b><br /> Through cross-layer approximation of Deep Neural Networks (DNN) significant improvements in hardware resources utilization for DNN applications can be achieved. This comes at the cost of accuracy degradation, which can be compensated through different optimization methods. However, DNN optimization is highly time-consuming in existing simulation frameworks for cross-layer DNN approximation, as they are usually implemented for CPU usage only. Specially for large-scale image processing tasks, the need of a more efficient simulation framework is evident. In this paper we present ProxSim, a specialized, GPU-accelerated simulation framework for approximate hardware, based on Tensorflow, which supports approximate DNN inference and retraining. Additionally, we propose a novel hardware-aware regularization technique for approximate DNN optimization. By using ProxSim, we report up to 11x savings in execution time, compared to a multi-thread CPU-based framework, and an accuracy recovery of up to 30% for three case studies of image classification with MNIST, CIFAR-10 and ImageNet.</em></td> </tr> <tr> <td>09:00</td> <td>9.4.2</td> <td><b>PCM: PRECISION-CONTROLLED MEMORY SYSTEM FOR ENERGY EFFICIENT DEEP NEURAL NETWORK TRAINING</b><br /> <b>Speaker</b>:<br /> Boyeal Kim, Seoul National University, KR<br /> <b>Authors</b>:<br /> Boyeal Kim<sup>1</sup>, SangHyun Lee<sup>1</sup>, Hyun Kim<sup>2</sup>, Duy-Thanh Nguyen<sup>3</sup>, Minh-Son Le<sup>3</sup>, Ik Joon Chang<sup>3</sup>, Dohun Kwon<sup>4</sup>, Jin Hyeok Yoo<sup>5</sup>, Jun Won Choi<sup>4</sup> and Hyuk-Jae Lee<sup>1</sup><br /> <sup>1</sup>Seoul National University, KR; <sup>2</sup>Seoul National University of Science and Technology, KR; <sup>3</sup>Kyung Hee University, KR; <sup>4</sup>Hanyang University, KR; <sup>5</sup>Hanyang university, KR<br /> <em><b>Abstract</b><br /> Deep neural network (DNN) training suffers from the significant energy consumption in memory system, and most existing energy reduction techniques for memory system have focused on introducing low precision that is compatible with computing unit (e.g., FP16, FP8). These researches have shown that even in learning the networks with FP16 data precision, it is possible to provide training accuracy as good as FP32, de facto standard of the DNN training. However, our extensive experiments show that we can further reduce the data precision while maintaining the training accuracy of DNNs, which can be obtained by truncating some least significant bits (LSBs) of FP16, named as hard approximation. Nevertheless, the existing hardware structures for DNN training cannot efficiently support such low precision. In this work, we propose a novel memory system architecture for GPUs, named as precision-controlled memory system (PCM), which allows for flexible management at the level of hard approximation. PCM provides high DRAM bandwidth by distributing each precision to different channels with as transposed data mapping on DRAM. In addition, PCM supports fine-grained hard approximation in the L1 data cache using software-controlled registers, which can reduce data movement and thereby improve energy saving and system performance. Furthermore, PCM facilitates the reduction of data maintenance energy, which accounts for a considerable portion of memory energy consumption, by controlling refresh period of DRAM. The experimental results show that in training CIFAR-100 dataset on Resnet-20 with precision tuning, PCM achieves energy saving and performance enhancement by 66% and 20%, respectively, without loss of accuracy.</em></td> </tr> <tr> <td>09:30</td> <td>9.4.3</td> <td><b>RED-CANE: A SYSTEMATIC METHODOLOGY FOR RESILIENCE ANALYSIS AND DESIGN OF CAPSULE NETWORKS UNDER APPROXIMATIONS</b><br /> <b>Speaker</b>:<br /> Alberto Marchisio, TU Wien (TU Wien), AT<br /> <b>Authors</b>:<br /> Alberto Marchisio<sup>1</sup>, Vojtech Mrazek<sup>2</sup>, Muhammad Abdullah Hanif<sup>3</sup> and Muhammad Shafique<sup>3</sup><br /> <sup>1</sup>TU Wien (TU Wien), AT; <sup>2</sup>Brno University of Technology, CZ; <sup>3</sup>TU Wien, AT<br /> <em><b>Abstract</b><br /> Recent advances in Capsule Networks (CapsNets) have shown their superior learning capability, compared to the traditional Convolutional Neural Networks (CNNs). However, the extremely high complexity of CapsNets limits their fast deployment in real-world applications. Moreover, while the resilience of CNNs have been extensively investigated to enable their energy-efficient implementations, the analysis of CapsNets' resilience is a largely unexplored area, that can provide a strong foundation to investigate techniques to overcome the CapsNets' complexity challenge. Following the trend of Approximate Computing to enable energy-efficient designs, we perform an extensive resilience analysis of the CapsNets inference subjected to the approximation errors. Our methodology models the errors arising from the approximate components (like multipliers), and analyze their impact on the classification accuracy of CapsNets. This enables the selection of approximate components based on the resilience of each operation of the CapsNet inference. We modify the TensorFlow framework to simulate the injection of approximation noise (based on the models of the approximate components) at different computational operations of the CapsNet inference. Our results show that the CapsNets are more resilient to the errors injected in the computations that occur during the dynamic routing (the softmax and the update of the coefficients), rather than other stages like convolutions and activation functions. Our analysis is extremely useful towards designing efficient CapsNet hardware accelerators with approximate components. To the best of our knowledge, this is the first proof-of-concept for employing approximations on the specialized CapsNet hardware.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="#IP4">IP4-12</a>, 968</td> <td><b>TOWARDS BEST-EFFORT APPROXIMATION: APPLYING NAS TO APPROXIMATE COMPUTING</b><br /> <b>Speaker</b>:<br /> Weiwei Chen, Chinese Academy of Sciences, CN<br /> <b>Authors</b>:<br /> Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN<br /> <em><b>Abstract</b><br /> The design of neural network architecture for code approximation involves a large number of hyper-parameters to explore, it is a non-trivial task to find an neural-based approximate computing solution that meets the demand of application-specified accuracy and Quality of Service (QoS). Prior works do not address the problem of 'optimal' network architectures design in program approximation, which depends on the user-specified constraints, the complexity of dataset and the hardware configuration. In this paper, we apply Neural Architecture Search (NAS) for searching and selecting the neural approximate computing and provide an automatic framework that tries to generate the best-effort approxi-mation result while satisfying the user-specified QoS/accuracy constraints. Compared with previous method, this work achieves more than 1.43x speedup and 1.74x energy reduction on average when applied to the AxBench benchmarks.</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="#IP4">IP4-13</a>, 973</td> <td><b>ON THE AUTOMATIC EXPLORATION OF WEIGHT SHARING FOR DEEP NEURAL NETWORK COMPRESSION</b><br /> <b>Speaker</b>:<br /> Etienne Dupuis, École Centrale de Lyon, FR<br /> <b>Authors</b>:<br /> Etienne Dupuis<sup>1</sup>, David Novo<sup>2</sup>, Ian O'Connor<sup>1</sup> and Alberto Bosio<sup>1</sup><br /> <sup>1</sup>Lyon Institute of Nanotechnology, FR; <sup>2</sup>Université de Montpellier, FR<br /> <em><b>Abstract</b><br /> Deep neural networks demonstrate impressive inference results, particularly in computer vision and speech recognition. However, the computational workload and storage associated render their use prohibitive in resource-limited embedded systems. The approximate computing paradigm has been widely explored in both industrial and academic circles. It improves performance and energy-efficiency by relaxing the need for fully accurate operations. Consequently, there is a large number of implementation options with very different approximation strategies (such as pruning, quantization, low-rank factorization, knowledge distillation, ...). To the best of our knowledge, no automated approach exists for exploring, selecting and generating the best approximate versions of a given convolutional neural network (CNN) and the design objectives. The objective of this work in progress is to show that the design space exploration phase can enable significant network compression without noticeable accuracy loss. We demonstrate this via an example based on weight sharing and show that we can obtain 4x compression rate without re-training and the resulting network does not suffer from accuracy loss, in an int-16 version of LeNet-5 (5-layer 1,720-kbit CNN) using our method.</em></td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="9.5">9.5 Emerging memory devices</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br /> Alexandere Levisse, EPFL, CH</p> <p><b>Co-Chair:</b><br /> Marco Vacca, Politecnico di Torino, IT</p> <p>The development of future memories is driven by new devices, studied to overcome the limitations of traditional memories. Among these devices STT magnetic RAMs play a fundamental role, due to their excellent performance coupled with long endurance and non-volatility. What are the issues that these memories face? How can we solve them and make them ready for a successfull commercial development? And if, by changing perspective, emerging devices are used to improve existing memories like SRAM? These are some of the questions that this section aim to answer.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.5.1</td> <td><b>IMPACT OF MAGNETIC COUPLING AND DENSITY ON STT-MRAM PERFORMANCE</b><br /> <b>Speaker</b>:<br /> Lizhou Wu, TU Delft, NL<br /> <b>Authors</b>:<br /> Lizhou Wu<sup>1</sup>, Siddharth Rao<sup>2</sup>, Mottaqiallah Taouil<sup>1</sup>, Erik Jan Marinissen<sup>2</sup>, Gouri Sankar Kar<sup>2</sup> and Said Hamdioui<sup>1</sup><br /> <sup>1</sup>TU Delft, NL; <sup>2</sup>IMEC, BE<br /> <em><b>Abstract</b><br /> As a unique mechanism for MRAMs, magnetic coupling needs to be accounted for when designing memory arrays. This paper models both intra- and inter-cell magnetic coupling analytically for STT-MRAMs and investigates their impact on the write performance and retention of MTJ devices, which are the data-storing elements of STT-MRAMs. We present magnetic measurement data of MTJ devices with diameters ranging from 35 nm to 175 nm, which we use to calibrate our intra-cell magnetic coupling model. Subsequently, we extrapolate this model to study inter-cell magnetic coupling in memory arrays. We propose the inter-cell magnetic coupling factor Psi to indicate coupling strength. Our simulation results show that Psi=2% maximizes the array density under the constraint that the magnetic coupling has negligible impact on the device's performance. Higher array densities show significant variations in average switching time, especially at low switching voltages, caused by inter-cell magnetic coupling, and dependent on the data pattern in the cell's neighborhood. We also observe a marginal degradation of the data retention time under the influence of inter-cell magnetic coupling.</em></td> </tr> <tr> <td>09:00</td> <td>9.5.2</td> <td><b>HIGH-DENSITY, LOW-POWER VOLTAGE-CONTROL SPIN ORBIT TORQUE MEMORY WITH SYNCHRONOUS TWO-STEP WRITE AND SYMMETRIC READ TECHNIQUES</b><br /> <b>Speaker</b>:<br /> Wang Kang, Beihang University, CN<br /> <b>Authors</b>:<br /> Haotian Wang<sup>1</sup>, Wang Kang<sup>1</sup>, Liuyang Zhang<sup>1</sup>, He Zhang<sup>1</sup>, Brajesh Kumar Kaushik<sup>2</sup> and Weisheng Zhao<sup>1</sup><br /> <sup>1</sup>Beihang University, CN; <sup>2</sup>IIT Roorkee, IN<br /> <em><b>Abstract</b><br /> Voltage-control spin orbit torque (VC-SOT) magnetic tunnel junction (MTJ) has the potential to achieve high-speed and low-power spintronic memory, owing to the adaptive voltage modulated energy barrier of the MTJ. However, the three-terminal device structure needs two access transistors (one for write operation and the other one for read operation) and thus occupies larger bit-cell area compared to two terminal MTJs. A feasible method to reduce area overhead is to stack multiple VC-SOT MTJs on a common antiferromagnetic strip to share the write access transistors. In this structure, high density can be achieved. However, write and read operations face problems and the design space is not sure given a strip length. In this paper, we propose a synchronous two-step multi-bit write and symmetric read method by exploiting the selective VC-SOT driven MTJ switching mechanism. Then hybrid circuits are designed and evaluated based a physics-based VC-SOT MTJ model and a 40nm CMOS design-kit to show the feasibility and performance of our method. Our work enables high-density, low-power, high-speed voltage-control SOT memory.</em></td> </tr> <tr> <td>09:30</td> <td>9.5.3</td> <td><b>DESIGN OF ALMOST-NONVOLATILE EMBEDDED DRAM USING NANOELECTROMECHANICAL RELAY DEVICES</b><br /> <b>Speaker</b>:<br /> Hongtao Zhong, Tsinghua University, CN<br /> <b>Authors</b>:<br /> Hongtao Zhong, Mingyang Gu, Juejian Wu, Huazhong Yang and Xueqing Li, Tsinghua University, CN<br /> <em><b>Abstract</b><br /> This paper proposes low-power design of embedded dynamic random-access memory (eDRAM) using emerging nanoelectromechanical (NEM) relay devices. The motivation of this work is to reduce the standby refresh power consumption through the improvement of retention time of eDRAM cells. In this paper, it is revealed that the tunable beyond-CMOS characteristics of emerging NEM relay devices, especially the ultra-high OFF-state drain-source resistance, open up new opportunities with device-circuit co-design. In addition, the pull-in and pull-out threshold voltages are tilled to fit the operating mechanisms of eDRAM, so as to support low-voltage operations along with long retention time. Excitingly, when low-gate-leakage thick-gate transistors are used together, the proposed NEM-relay-based eDRAM exhibits so significant retention time improvement that it behaves almost "nonvolatile". Even if using thin-gate transistors in a 130nm CMOS, the evaluation of the proposed eDRAM shows up to 63x and 127x retention time improvement at 1.0V and 1.4V supply, respectively. Detailed performance benchmarking analysis, along with the practical CMOS-compatible NEM relay model, the eDRAM design and optimization considerations, is included in this paper.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="#IP4">IP4-14</a>, 100</td> <td><b>ROBUST AND HIGH-PERFORMANCE12-T INTERLOCKED SRAM FOR IN-MEMORY COMPUTING</b><br /> <b>Speaker</b>:<br /> Joycee Mekie, IIT Gandhinagar, IN<br /> <b>Authors</b>:<br /> Neelam Surana, Mili Lavania, Abhishek Barma and Joycee Mekie, IIT Gandhinagar, IN<br /> <em><b>Abstract</b><br /> In this paper, we analyze the existing SRAM based In-Memory Computing(IMC) proposals and show through exhaustive simulations that they fail under process variations. 6-T SRAM, 8-T SRAM, and 10-T SRAM based IMC architectures suffer from compute-disturb (stored data flips during IMC), compute-failure (provides false computation results), and half-select failures, respectively. To circumvent these issues, we propose a novel 12-T Dual Port Dual Interlocked-storage Cell (DPDICE) SRAM. DPDICE SRAM based IMC architecture(DPDICE-IMC) can perform essential boolean functions successfully in a single cycle and can perform basic arithmetic operations such as add and multiply. The most striking feature is that DPDICE-IMC architecture can perform IMC on two datasets simultaneously, thus doubling the throughput. Cumulatively, the proposed DPDICE-IMC is 26.7%, 8$imes$, and 28% better than 6-T SRAM, 8-T SRAM, and 10-T SRAM based IMC architectures, respectively.</em></td> </tr> <tr> <td style="width:40px;">10:01</td> <td><a href="#IP4">IP4-15</a>, 600</td> <td><b>HIGH DENSITY STT-MRAM COMPILER DESIGN, VALIDATION AND CHARACTERIZATION METHODOLOGY IN 28NM FDSOI TECHNOLOGY</b><br /> <b>Speaker</b>:<br /> Piyush Jain, ARM Embedded Technologies Pvt Ltd., IN<br /> <b>Authors</b>:<br /> Piyush Jain<sup>1</sup>, Akshay Kumar<sup>1</sup>, Nicolaas Van Winkelhoff<sup>2</sup>, Didier Gayraud<sup>2</sup>, Surya Gupta<sup>3</sup>, Abdelali El Amraoui<sup>2</sup>, Giorgio Palma<sup>2</sup>, Alexandra Gourio<sup>2</sup>, Laurentz Vachez<sup>2</sup>, Luc Palau<sup>2</sup>, Jean-Christophe Buy<sup>2</sup> and Cyrille Dray<sup>2</sup><br /> <sup>1</sup>ARM Embedded Technologies Pvt Ltd., IN; <sup>2</sup>ARM France, FR; <sup>3</sup>ARM Embedded technologies Pvt Ltd., IN<br /> <em><b>Abstract</b><br /> Spin Transfer Torque Magneto-resistive Random-Access Memory (STT-MRAM) is emerging as a promising substitute for flash memories due to scaling challenges for flash in process nodes beyond 28nm. STT-MRAM's high endurance, fast speed and low power makes it suitable for wide variety of applications. An embedded MRAM (eMRAM) compiler is highly desirable to enable SoC designers to use eMRAM instances in their designs in a flexible manner. However, the development of an eMRAM compiler has added challenges of handling multi-fold higher density and maintaining analog circuits accuracy, on top of the challenges associated with conventional SRAM memory compilers. In this paper, we present a successful design methodology for a high density 128Mb eMRAM compiler in a 28nm fully depleted SOI (FDSOI) process. This compiler enables optimized eMRAM instance generation with varying capacity ranges, word-widths, and optional features like repair and error correction. eMRAM compiler design is achieved by evolving various architecture design, validations and characterization methods. A hierarchical and modular characterization methodology is presented to enable high accuracy characterization and industry-standard EDA view generation from the eMRAM compiler.</em></td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="9.6">9.6 Intelligent Dependable Systems</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br /> Rishad Shafik, Newcastle University, GB</p> <p>This session spans from dependability approaches for multicore systems realized as SoCs for intelligent reliability management and on-line software-based self-test, to error resilient AI systems where the AI system is re-designed to tolerate critical faults or is used for error detection purposes.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.6.1</td> <td><b>THERMAL-CYCLING-AWARE DYNAMIC RELIABILITY MANAGEMENT IN MANY-CORE SYSTEM-ON-CHIP</b><br /> <b>Speaker</b>:<br /> Mohammad-Hashem Haghbayan, University of Turku, FI<br /> <b>Authors</b>:<br /> Mohammad-Hashem Haghbayan<sup>1</sup>, Antonio Miele<sup>2</sup>, Zhuo Zou<sup>3</sup>, Hannu Tenhunen<sup>1</sup> and Juha Plosila<sup>1</sup><br /> <sup>1</sup>University of Turku, FI; <sup>2</sup>Politecnico di Milano, IT; <sup>3</sup>Nanjing University of Science and Technology, CN<br /> <em><b>Abstract</b><br /> Dynamic Reliability Management (DRM) is a common approach to mitigate aging and wear-out effects in multi-/many-core systems. State-of-the-art DRM approaches apply fine-grained control on resource management to increase/balance the chip reliability while considering other system constraints, e.g., performance, and power budget. Such approaches, acting on various knobs such as workload mapping and scheduling, Dynamic Voltage/Frequency Scaling (DVFS) and Per-Core Power Gating (PCPG), demonstrated to work properly with the various aging mechanisms, such as electromigration, and Negative-Bias Temperature Instability (NBTI). However, we claim that they do not suffice for thermal cycling. Thus, we here propose a novel thermal-cycling-aware DRM approach for shared-memory many-core systems running multi-threaded applications. The approach applies a fine-grained control capable at reducing both temperature levels and variations. The experimental evaluations demonstrated that the proposed approach is able to achieve 39% longer lifetime than past approaches.</em></td> </tr> <tr> <td>09:00</td> <td>9.6.2</td> <td><b>DETERMINISTIC CACHE-BASED EXECUTION OF ON-LINE SELF-TEST ROUTINES IN MULTI-CORE AUTOMOTIVE SYSTEM-ON-CHIPS</b><br /> <b>Speaker</b>:<br /> Andrea Floridia, Politecnico di Torino, IT<br /> <b>Authors</b>:<br /> Andrea Floridia<sup>1</sup>, Tzamn Melendez Carmona<sup>1</sup>, Davide Piumatti<sup>1</sup>, Annachiara Ruospo<sup>1</sup>, Ernesto Sanchez<sup>1</sup>, Sergio De Luca<sup>2</sup>, Rosario Martorana<sup>2</sup> and Mose Alessandro Pernice<sup>2</sup><br /> <sup>1</sup>Politecnico di Torino, IT; <sup>2</sup>STMicroelectronics, IT<br /> <em><b>Abstract</b><br /> Traditionally, the usage of caches and deterministic execution of on-line self-test procedures have been considered two mutually exclusive concepts. At the same time, software executed in a multi-core context suffers of a limited timing predictability due to the higher system bus contention. When dealing with self-test procedures, this higher contention might lead to a fluctuating fault coverage or even the failure of some test programs. This paper presents a cache-based strategy for achieving both deterministic behaviour and stable fault coverage from the execution of self-test procedures in multi-core systems. The proposed strategy is applied to two representative modules negatively affected by a multi-core execution: synchronous imprecise interrupts logic and pipeline hazard detection unit. The experiments illustrate that it is possible to achieve a stable execution while also improving the state-of-the-art approaches for the on-line testing of embedded microprocessors. The effectiveness of the methodology was assessed on all the three cores of a multi-core industrial System-on-Chip intended for automotive ASIL D applications.</em></td> </tr> <tr> <td>09:30</td> <td>9.6.3</td> <td><b>FT-CLIPACT: RESILIENCE ANALYSIS OF DEEP NEURAL NETWORKS AND IMPROVING THEIR FAULT TOLERANCE USING CLIPPED ACTIVATION</b><br /> <b>Authors</b>:<br /> Le-Ha Hoang<sup>1</sup>, Muhammad Abdullah Hanif<sup>2</sup> and Muhammad Shafique<sup>2</sup><br /> <sup>1</sup>TU Wien (TU Wien), AT; <sup>2</sup>TU Wien, AT<br /> <em><b>Abstract</b><br /> Deep Neural Networks (DNNs) are widely being adopted for safety-critical applications, e.g., healthcare and autonomous driving. Inherently, they are considered to be highly error-tolerant. However, recent studies have shown that hardware faults that impact the parameters of a DNN (e.g., weights) can have drastic impacts on its classification accuracy. In this paper, we perform a comprehensive error resilience analysis of DNNs subjected to hardware faults (e.g., permanent faults) in the weight memory. The outcome of this analysis is leveraged to propose a novel error mitigation technique which squashes the high-intensity faulty activation values to alleviate their impact. We achieve this by replacing the unbounded activation functions with their clipped versions. We also present a method to systematically define the clipping values of the activation functions that result in increased resilience of the networks against faults. We evaluate our technique on the AlexNet and VGG-16 DNN trained for the CIFAR-10 dataset. The experimental results show that our mitigation technique significantly improves the network's resilience to faults. For example, the proposed technique offers on average 68.92% improvement in the classification accuracy of resilience-optimized VGG-16 model at 1 ×10−5 fault rate, when compared to the base network without any fault mitigation.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="#IP4">IP4-16</a>, 221</td> <td><b>AN APPROXIMATION-BASED FAULT DETECTION SCHEME FOR IMAGE PROCESSING APPLICATIONS</b><br /> <b>Speaker</b>:<br /> Antonio Miele, Politecnico di Milano, IT<br /> <b>Authors</b>:<br /> Matteo Biasielli, Luca Cassano and Antonio Miele, Politecnico di Milano, IT<br /> <em><b>Abstract</b><br /> Image processing applications expose an intrinsic resilience to faults. In this application field the classical Duplication with Comparison (DWC) scheme, where output images are discarded as soon as the two replicas' outputs differ for at least one pixel, may be over-conseravative. This paper introduces a novel lightweight fault detection scheme for image processing applications; i) it extends the DWC scheme by substituting one of the two exact replicas with a faster approximated one; and ii) it features a Neural Network-based checker designed to distinguish between usable and unusable images instead of faulty/unfaulty ones. The application of the hardening scheme on a case study has shown an execution time reduction from 27% to 34% w.r.t. the DWC, while guaranteeing a comparable fault detection capability.</em></td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="9.7">9.7 Diverse Applications of Emerging Technologies</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br /> Pavlidis Vasilis, The University of Manchester, GB</p> <p><b>Co-Chair:</b><br /> Bing Li, TU Munich, DE</p> <p>This session examines a diverse set of applications for emerging technologies. Papers consider the use of Q-learning to perform more efficient backups in non-volatile processors, the use of emerging technologies to mitigate hardware side-channels, time-sequence-based classification that rise from ultrasonic patters due to hand movements for gesture recognition, and processing-in-memory-based solutions to accelerate DNA alignment searches.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>08:30</td> <td>9.7.1</td> <td><b>Q-LEARNING BASED BACKUP FOR ENERGY HARVESTING POWERED EMBEDDED SYSTEMS</b><br /> <b>Speaker</b>:<br /> Wei Fan, Shandong University, CN<br /> <b>Authors</b>:<br /> Wei Fan, Yujie Zhang, Weining Song, Mengying Zhao, Zhaoyan Shen and Zhiping Jia, Shandong University, CN<br /> <em><b>Abstract</b><br /> Non-volatile processors (NVPs) are used in energy harvesting powered embedded systems to preserve data across interruptions. In NVP systems, volatile data are backed up to non-volatile memory upon power failures and resumed after power comes back. Traditionally, backup is triggered immediately when energy warning occurs. However, it is also possible to more aggressively utilize the residual energy for program execution to improve forward progress. In this work, we propose a Q-learning based backup strategy to achieve maximal forward progress in energy harvesting powered intermittent embedded systems. The experimental results show an average of 307.4% and 43.4% improved forward progress compared with traditional instant backup and the most related work, respectively.</em></td> </tr> <tr> <td>09:00</td> <td>9.7.2</td> <td><b>A NOVEL TIGFET-BASED DFF DESIGN FOR IMPROVED RESILIENCE TO POWER SIDE-CHANNEL ATTACKS</b><br /> <b>Speaker</b>:<br /> Michael Niemier, University of Notre Dame, US<br /> <b>Authors</b>:<br /> Mohammad Mehdi Sharifi<sup>1</sup>, Ramin Rajaei<sup>1</sup>, Patsy Cadareanu<sup>2</sup>, Pierre-Emmanuel Gaillardon<sup>2</sup>, Yier Jin<sup>3</sup>, Michael Niemier<sup>1</sup> and X. Sharon Hu<sup>1</sup><br /> <sup>1</sup>University of Notre Dame, US; <sup>2</sup>University of Utah, US; <sup>3</sup>University of Florida, US<br /> <em><b>Abstract</b><br /> Side-channel attacks (SCAs) represent a significant security threat, and aim to reveal otherwise secret data by analyzing a relevant circuit's behavior, e.g., its power consumption. While all circuit components are potential power side channels, D-flip-flops (DFFs) are often the primary source of information leakage to an SCA. This paper proposes a DFF design based on the three-independent-gate field-effect transistor (TIGFET) that reduces side-channel vulnerabilities of sequential circuits. Notably, we find that the I-V characteristics of the TIGFET itself leads to inherent side-channel resilience, which in turn enables simpler and more efficient cryptographic hardware. Our proposed design is based on a prior TIGFET-based true single-phase clock (TSPC) DFF design, which offers high performance and reduced area. More specifically, our modified TSPC (mTSPC) design exploits the symmetric I-V characteristics of TIGFETs, which results in pull-up and pull-down currents that are nearly identical. When combined with additional circuit modifications (made possible by the unique characteristics of the TIGFET), the mTSPC circuit draws almost the same amount of supply currents under all possible input transitions (less than 1% variation for different transitions), which can in turn mask information leakage. Using a 10nm TIGFET technology model, simulation results show that the proposed TIGFET-based DFF circuit leads to decreased power consumption (up to 96.9% when compared to the prior secured designs), has a low delay (15.2 ps), and employs only 12 TIGFET devices. Furthermore, an 8-bit S-box whose output is sampled by a group of eight mTSPC DFFs was simulated. A correlation power analysis attack on the simulated S-box with 256 power traces shows that the key is not revealed, which confirms the SCA resiliency of the proposed DFF design.</em></td> </tr> <tr> <td>09:30</td> <td>9.7.3</td> <td><b>LOW COMPLEXITY MULTI-DIRECTIONAL IN-AIR ULTRASONIC GESTURE RECOGNITION USING A TCN</b><br /> <b>Speaker</b>:<br /> Emad A. Ibrahim, Eindhoven University of Technology, NL<br /> <b>Authors</b>:<br /> Emad A. Ibrahim<sup>1</sup>, Marc Geilen<sup>1</sup>, Jos Huisken<sup>1</sup>, Min Li<sup>2</sup> and Jose Pineda de Gyvez<sup>2</sup><br /> <sup>1</sup>Eindhoven University of Technology, NL; <sup>2</sup>NXP Semiconductors, NL<br /> <em><b>Abstract</b><br /> On the trend of ultrasound-based gesture recognition, this study introduces the concept of time-sequence classification of ultrasonic patterns induced by hand movements on a microphone array. We refer to time-sequence ultrasound echoes as continuous frequency patterns being received in real-time at different steering angles. The ultrasound source is a single tone continuously being emitted from the center of the microphone array. In the interim, the array beamforms and locates an ultrasonic activity (induced echoes) after which a processing pipeline is initiated to extract band-limited frequency features. These beamformed features are organized in a 2D matrix of size 11*30 updated every 10ms on which a Temporal Convolutional Network (TCN) outputs continuous classification. Prior to that, the same TCN is trained to classify Doppler shift variability rate. Using this approach, we show that a user can easily achieve 49 gestures at different steering angles by means of sequence detection. To make it simple to users, we define two Doppler shift variability rates; very slow and very fast which the TCN detects 95-99 % of the time. Not only a gesture can be performed at different directions but also the length of each performed gesture can be measured. This leverages the diversity of in-air ultrasonic gestures allowing more control capabilities. The process is designed under low-resource settings; that is, given the fact that this real-time process is always-on, the power and memory resources should be optimized. The proposed solution needs 6.2-10.2 MMACs and a memory footprint of 6KB allowing such gesture recognition system to be hosted by energy-constrained edge devices such as smart-speakers.</em></td> </tr> <tr> <td>09:45</td> <td>9.7.4</td> <td><b>PIM-ALIGNER: A PROCESSING-IN-MRAM PLATFORM FOR BIOLOGICAL SEQUENCE ALIGNMENT</b><br /> <b>Speaker</b>:<br /> Deliang Fan, Arizona State University, US<br /> <b>Authors</b>:<br /> Shaahin Angizi<sup>1</sup>, Jiao Sun<sup>1</sup>, Wei Zhang<sup>1</sup> and Deliang Fan<sup>2</sup><br /> <sup>1</sup>University of Central Florida, US; <sup>2</sup>Arizona State University, US<br /> <em><b>Abstract</b><br /> In this paper, we propose a high-throughput and energy-efficient Processing-in-Memory accelerator (PIM-Aligner) to execute DNA short read alignment based on an optimized and hardware-friendly alignment algorithm. We first reconstruct the existing sequence alignment algorithm based on BWT and FM-index such that it can be fully implemented in PIM platforms. It supports exact alignment and also handles mismatches to reduce excessive backtracking. We then develop PIM-Aligner platform that transforms SOT-MRAM array to a potential computational memory to accelerate the reconstructed alignment-in-memory algorithm incurring a low cost on top of original SOT-MRAM chips (less than 10% of chip area). Accordingly, we present a local data partitioning, mapping, and pipeline technique to maximize the parallelism in multiple computational sub-array while doing the alignment task. The simulation results show that PIM-Aligner outperforms recent platforms based on dynamic programming with ~3.1x higher throughput per Watt. Besides, PIM-Aligner improves the short read alignment throughput per Watt per mm^2 by ~9x and 1.9x compared to FM-index-based ASIC and processing-in-ReRAM designs, respectively.</em></td> </tr> <tr> <td style="width:40px;">10:00</td> <td><a href="#IP4">IP4-17</a>, 852</td> <td><b>TRANSPORT-FREE MODULE BINDING FOR SAMPLE PREPARATION USING MICROFLUIDIC FULLY PROGRAMMABLE VALVE ARRAYS</b><br /> <b>Speaker</b>:<br /> Gautam Choudhary, Adobe Research, India, IN<br /> <b>Authors</b>:<br /> Gautam Choudhary<sup>1</sup>, Sandeep Pal<sup>1</sup>, Debraj Kundu<sup>1</sup>, Sukanta Bhattacharjee<sup>2</sup>, Shigeru Yamashita<sup>3</sup>, Bing Li<sup>4</sup>, Ulf Schlichtmann<sup>4</sup> and Sudip Roy<sup>1</sup><br /> <sup>1</sup>IIT Roorkee, IN; <sup>2</sup>Indian Statistical Institute, IN; <sup>3</sup>Ritsumeikan University, JP; <sup>4</sup>TU Munich, DE<br /> <em><b>Abstract</b><br /> Microfluidic fully programmable valve array (FPVA) biochips have emerged as general-purpose flow-based microfluidic lab-on-chips (LoCs). An FPVA supports highly re-configurable on-chip components (modules) in the two-dimensional grid-like structure controlled by some software programs, unlike application-specific flow-based LoCs. Fluids can be loaded into or washed from a cell with the help of flows from the inlet to outlet of an FPVA, whereas cell-to-cell transportation of discrete fluid segment(s) is not precisely possible. The simplest mixing module to realize on an FPVA-based LoC is a four-way mixer consisting of a $2imes2$ array of cells working as a ring-like mixer having four valves. In this paper, we propose a design automation method for sample preparation that finds suitable placements of mixing operations of a mixing tree using four-way mixers without requiring any transportation of fluid(s) between modules. We also propose a heuristic that modifies the mixing tree to reduce the sample preparation time. We have performed an extensive simulation and examined several parameters to determine the performance of the proposed solution.</em></td> </tr> <tr> <td>10:00</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="9.8">9.8 Special Session: Panel: Variation-aware analyzes of Mega-MOSFET Memories, Challenges and Solutions</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 08:30 - 10:00<br /> <b>Location / Room:</b> Exhibition Theatre</p> <p><b>Moderators:</b><br /> Firas MOHAMED, Silvaco, FR<br /> Jean-Baptiste DULUC, Silvaco, FR</p> <p>Designing large memories under manufacturing variability requires statistical approaches that rely on SPICE simulations at different Process, Voltage, Temperature operating points to verify that yield requirements will be met. Variation-aware simulations of full memories that consist of millions of transistors is a challenging task for both SPICE simulators and statistical methodology to achieve accurate results. The ideal solution for variation-aware verifications of full memories would be to run Monte Carlo simulations through SPICE simulators to assess that all the addressable elements enable successful write and read operations. However, this classical approach suffers from practical issues and prevent it to be used. Indeed, for large memory arrays (e.g. MB and more) the number of SPICE simulations to perform would be intractable to achieve a descent statistical precision. Moreover, the SPICE simulation of a single sample of the full-memory netlist that involve millions or billions of MOSFETs and parasitic elements might be very long or impossible because of the netlist size. Unfortunately, Fast-SPICE simulations are not a palatable solution for final verification because the loss of accuracy compared to pure SPICE simulations is difficult to evaluate for such netlists. So far, most of the variation-aware methodologies to analyze and validate Mega-MOSFETs memories rely on the assumption that the sub-blocks of the system (e.g. control unit, IOs, row decoders, column circuitries, memory cells) might be assessed independently. Doing so memory designers apply dedicated statistical approaches for each individual sub-block to reduce the overall simulation time to achieve variation-aware closure. When considering that each element of the memory is independent of its neighborhood, the simulation of the memory is drastically reduced to few MOSFETs on the critical paths (longest paths for read or write memory operation), the other sub-blocks being idealized and estimations being derived under Gaussian assumption. Using such an approach, memory designers avoid the usual statistical simulations of the full memory that is, most of the time, unpractical in terms of duration and load. Although the aforementioned approach has been widely used by memory designers, these methods reach their limits when designing memory for low-power and advanced-node technologies where non idealities arise. The consequence of less reliable results is that the memory designers compensate by increasing security margins at the expense of performances to achieve satisfactory yield. In this context sub-blocks can no longer be considered individually and Gaussianity no longer prevails, other practical simulation flows are required to verify full memories with satisfying performances. New statistical approaches and simulation flows must handle memory slices or critical paths with all relevant sub-blocks in order to consider element interactions to be more realistic. Additionally, these approaches must handle the hierarchy of the memory to respect variation ranges of each sub-block, from low sigma for control units and IOs to high sigma for highly replicated blocks. Using a virtual reconstruction of the full memory the yield can be asserted without relying on the assumptions of individual sub-block analyzes. With accurate estimation over the full memory, no more security margins are required, and better performances will be reached."</p> <p><b>Panelists:</b></p> <ul> <li>Yves Laplanche, ARM, FR</li> <li>Lorenzo Ciampolini, CEA, FR</li> <li>Pierre Faubet, SILVACO FRANCE, FR</li> </ul> <table> <tbody> <tr> <td style="width: 20px">10:00</td> <td>End of session</td> </tr> <tr> <td style="width: 20px"> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="IP4">IP4 Interactive Presentations</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 10:00 - 10:30<br /> <b>Location / Room:</b> Poster Area</p> <p>Interactive Presentations run simultaneously during a 30-minute slot. Additionally, each IP paper is briefly introduced in a one-minute presentation in a corresponding regular session</p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td style="width:40px;">IP4-1</td> <td><b>HIT: A HIDDEN INSTRUCTION TROJAN MODEL FOR PROCESSORS</b><br /> <b>Speaker</b>:<br /> Jiaqi Zhang, Tongji University, CN<br /> <b>Authors</b>:<br /> Jiaqi Zhang<sup>1</sup>, Ying Zhang<sup>1</sup>, Huawei Li<sup>2</sup> and Jianhui Jiang<sup>3</sup><br /> <sup>1</sup>Tongji University, CN; <sup>2</sup>Chinese Academy of Sciences, CN; <sup>3</sup>School of Software Engineering, Tongji University, CN<br /> <em><b>Abstract</b><br /> This paper explores an intrusion mechanism to microprocessors using illegal instructions, namely hidden instruction Trojan (HIT). It uses a low-probability sequence consisting of normal instructions as a boot sequence, followed by an illegal instruction to trigger the Trojan. The payload is a hidden interrupt to force the program counter to a specific address. Hence the program at the address has the super privileges. Meanwhile, we use integer programming to minimize the trigger probability of HIT within a given area overhead. The experimental results demonstrate that HIT has an extremely low trigger probability and can survive from the detection of the existing test methods.</em></td> </tr> <tr> <td style="width:40px;">IP4-2</td> <td><b>BITSTREAM MODIFICATION ATTACK ON SNOW 3G</b><br /> <b>Speaker</b>:<br /> Michail Moraitis, Royal Institute of Technology KTH, SE<br /> <b>Authors</b>:<br /> Michail Moraitis and Elena Dubrova, Royal Institute of Technology - KTH, SE<br /> <em><b>Abstract</b><br /> SNOW 3G is one of the core algorithms for confidentiality and integrity in several 3GPP wireless communication standards, including the new Next Generation (NG) 5G. It is believed to be resistant to classical cryptanalysis. In this paper, we show that SNOW 3G can be broken by a fault attack based on bitstream modification. By changing the content of some look-up tables in the bitstream, we reduce the non-linear state updating function of SNOW 3G to a linear one. As a result, it becomes possible to recover the key from a known plaintext-ciphertext pair. To our best knowledge, this is the first successful bitstream modification attack on SNOW 3G.</em></td> </tr> <tr> <td style="width:40px;">IP4-3</td> <td><b>A MACHINE LEARNING BASED WRITE POLICY FOR SSD CACHE IN CLOUD BLOCK STORAGE</b><br /> <b>Speaker</b>:<br /> Yu Zhang, Huazhong University of Science &amp; Technology, CN<br /> <b>Authors</b>:<br /> Yu Zhang<sup>1</sup>, Ke Zhou<sup>1</sup>, Ping Huang<sup>2</sup>, Hua Wang<sup>1</sup>, Jianying Hu<sup>3</sup>, Yangtao Wang<sup>1</sup>, Yongguang Ji<sup>3</sup> and Bin Cheng<sup>3</sup><br /> <sup>1</sup>Huazhong University of Science &amp; Technology, CN; <sup>2</sup>Temple University, US; <sup>3</sup>Tencent Technology (Shenzhen) Co., Ltd., CN<br /> <em><b>Abstract</b><br /> Nowadays, SSD cache plays an important role in cloud storage systems. The associated write policy, which enforces an admission control policy regarding filling data into the cache, has a significant impact on the performance of the cache system and the amount of write traffic to SSD caches. Based on our analysis on a typical cloud block storage system, approximately 47.09% writes are write-only, i.e., writes to the blocks which have not been read during a certain time window. Naively writing the write-only data to the SSD cache unnecessarily introduces a large number of harmful writes to the SSD cache without any contribution to cache performance. On the other hand, it is a challenging task to identify and filter out those write-only data in a real-time manner, especially in a cloud environment running changing and diverse workloads. In this paper, to alleviate the above cache problem, we propose an ML-WP, Machine Learning Based Write Policy, which reduces write traffic to SSDs by avoiding writing write-only data. The main challenge in this approach is to identify write-only data in a real-time manner. To realize ML-WP and achieve accurate write-only data identification, we use machine learning methods to classify data into two groups (i.e., write-only and normal data). Based on this classification, the write-only data is directly written to backend storage without being cached. Experimental results show that, compared with the industry widely deployed write-back policy, ML-WP decreases write traffic to SSD cache by 41.52%, while improving the hit ratio by 2.61% and reducing the average read latency by 37.52%.</em></td> </tr> <tr> <td style="width:40px;">IP4-4</td> <td><b>YOU ONLY SEARCH ONCE: A FAST AUTOMATION FRAMEWORK FOR SINGLE-STAGE DNN/ACCELERATOR CO-DESIGN</b><br /> <b>Speaker</b>:<br /> Weiwei Chen, Chinese Academy of Sciences, CN<br /> <b>Authors</b>:<br /> Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN<br /> <em><b>Abstract</b><br /> DNN/Accelerator co-design has shown great poten-tial in improving QoR and performance. Typical approaches separate the design flow into two-stage: (1) designing an application-specific DNN model with high accuracy; (2) building an accelerator considering the DNN specific characteristics. However, it may fails in promising the highest composite score which combines the goals of accuracy and other hardware-related constraints (e.g., latency, energy efficiency) when building a specific neural-network-based system. In this work, we present a single-stage automated framework, YOSO, aiming to generate the optimal solution of software-and-hardware that flexibly balances between the goal of accuracy, power, and QoS. Compared with the two-stage method on the baseline systolic array accelerator and Cifar10 dataset, we achieve 1.42x~2.29x energy or 1.79x~3.07x latency reduction at the same level of precision, for different user-specified energy and latency optimization constraints, respectively.</em></td> </tr> <tr> <td style="width:40px;">IP4-5</td> <td><b>WHEN SORTING NETWORK MEETS PARALLEL BITSTREAMS: A FAULT-TOLERANT PARALLEL TERNARY NEURAL NETWORK ACCELERATOR BASED ON STOCHASTIC COMPUTING</b><br /> <b>Speaker</b>:<br /> Yawen Zhang, Peking University, CN<br /> <b>Authors</b>:<br /> Yawen Zhang<sup>1</sup>, Sheng Lin<sup>2</sup>, Runsheng Wang<sup>1</sup>, Yanzhi Wang<sup>2</sup>, Yuan Wang<sup>1</sup>, Weikang Qian<sup>3</sup> and Ru Huang<sup>1</sup><br /> <sup>1</sup>Peking University, CN; <sup>2</sup>Northeastern University, US; <sup>3</sup>Shanghai Jiao Tong University, CN<br /> <em><b>Abstract</b><br /> Stochastic computing (SC) has been widely used in neural networks (NNs) due to its simple hardware cost and high fault tolerance. Conventionally, SC-based NN accelerators adopt a hybrid stochastic-binary format, using an accumulative parallel counter to convert bitstreams into a binary number. This method, however, sacrifices the fault tolerance and causes a high hardware cost. In order to fully exploit the superior fault tolerance of SC, taking a ternary neural network (TNN) as an example, we propose a parallel SC-based NN accelerator purely using bitstream computation. We apply a bitonic sorting network for simultaneously implementing the accumulation and activation function with parallel bitstreams. The proposed design not only has high fault tolerance, but also achieves at least 2.8 energy efficiency improvement over the binary computing counterpart.</em></td> </tr> <tr> <td style="width:40px;">IP4-6</td> <td><b>WAVEPRO: CLOCK-LESS WAVE-PROPAGATED PIPELINE COMPILER FOR LOW-POWER AND HIGH-THROUGHPUT COMPUTATION</b><br /> <b>Speaker</b>:<br /> Yehuda Kra, Bar-Ilan University, IL<br /> <b>Authors</b>:<br /> Yehuda Kra, Adam Teman and Tzachi Noy, Bar-Ilan University, IL<br /> <em><b>Abstract</b><br /> Clock-less Wave-Propagated Pipelining is a longknown approach to achieve high-throughput without the overhead of costly sampling registers. However, due to many design challenges, which have only increased with technology scaling, this approach has never been widely accepted and has generally been limited to small and very specific demonstrations. This paper addresses this barrier by presenting WavePro, a generic and scalable algorithm, capable of skew balancing any combinatorial logic netlist for the application of wave pipelining. The algorithm was implemented in the WavePro Compiler automation utility, which interfaces with industry delays extraction and standard timing analysis tools to produce a sign-off quality result. The utility is demonstrated upon a dot-product accelerator in a 65 nm CMOS technology, using a vendor-provided standard cell library and commercial timing analysis tools. By reducing the worstcase output skew by over 70%, the test case example was able to achieve equivalent throughput of an 8-staged sequentially pipelined implementation with power savings of almost 3X.</em></td> </tr> <tr> <td style="width:40px;">IP4-7</td> <td><b>DEEPNVM: A FRAMEWORK FOR MODELING AND ANALYSIS OF NON-VOLATILE MEMORY TECHNOLOGIES FOR DEEP LEARNING APPLICATIONS</b><br /> <b>Speaker</b>:<br /> Ahmet Inci, Carnegie Mellon University, US<br /> <b>Authors</b>:<br /> Ahmet Inci, Mehmet M Isgenc and Diana Marculescu, Carnegie Mellon University, US<br /> <em><b>Abstract</b><br /> Non-volatile memory (NVM) technologies such as spin-transfer torque magnetic random access memory (STT-MRAM) and spin-orbit torque magnetic random access memory (SOT-MRAM) have significant advantages compared to conventional SRAM due to their non-volatility, higher cell density, and scalability features. While previous work has investigated several architectural implications of NVM for generic applications, in this work we present DeepNVM, a framework to characterize, model, and analyze NVM-based caches in GPU architectures for deep learning (DL) applications by combining technology-specific circuit-level models and the actual memory behavior of various DL workloads. We present both iso-capacity and iso-area performance and energy analysis for systems whose last-level caches rely on conventional SRAM and emerging STT-MRAM and SOT-MRAM technologies. In the iso-capacity case, STT-MRAM and SOT-MRAM provide up to 4.2x and 5x energy-delay product (EDP) reduction and 2.4x and 3x area reduction compared to conventional SRAM, respectively. Under iso-area assumptions, STT-MRAM and SOT-MRAM provide 2.3x EDP reduction on average across all workloads when compared to SRAM. Our comprehensive cross-layer framework is demonstrated on STT-/SOT-MRAM technologies and can be used for the characterization, modeling, and analysis of any NVM technology for last-level caches in GPU platforms for deep learning applications.</em></td> </tr> <tr> <td style="width:40px;">IP4-8</td> <td><b>EFFICIENT EMBEDDED MACHINE LEARNING APPLICATIONS USING ECHO STATE NETWORKS</b><br /> <b>Speaker</b>:<br /> Rolando Brondolin, Politecnico di Milano, IT<br /> <b>Authors</b>:<br /> Luca Cerina<sup>1</sup>, Giuseppe Franco<sup>2</sup>, Claudio Gallicchio<sup>3</sup>, Alessio Micheli<sup>3</sup> and Marco D. Santambrogio<sup>4</sup><br /> <sup>1</sup>politecnico di milano, IT; <sup>2</sup>Scuola Superiore Sant'Anna / Università di Pisa, IT; <sup>3</sup>Università di Pisa, IT; <sup>4</sup>Politecnico di Milano, IT<br /> <em><b>Abstract</b><br /> The increasing role of Artificial Intelligence (AI) and Machine Learning (ML) in our lives brought a paradigm shift on how and where the computation is performed. Stringent latency requirements and congested bandwidth moved AI inference from Cloud space towards end-devices. This change required a major simplification of Deep Neural Networks (DNN), with memory-wise libraries or co-processors that perform fast inference with minimal power. Unfortunately, many applications such as natural language processing, time-series analysis and audio interpretation are built on a different type of Artifical Neural Networks (ANN), the so-called Recurrent Neural Networks (RNN), which, due to their intrinsic architecture, remains too complex and heavy to run efficiently on embedded devices. To solve this issue, the Reservoir Computing paradigm proposes sparse untrained non-linear networks, the Reservoir, that can embed temporal relations without some of the hindrances of Recurrent Neural Networks training, and with a lower memory usage. Echo State Networks (ESN) and Liquid State Machines are the most notable examples. In this scenario, we propose a performance comparison of a ESN, designed and trained using Bayesian Optimization techniques, against current RNN solutions. We aim to demonstrate that ESN have comparable performance in terms of accuracy, require minimal training time, and they are more optimized in terms of memory usage and computational efficiency. Preliminary results show that ESN are competitive with RNN on a simple benchmark, and both training and inference time are faster, with a maximum speed-up of 2.35x and 6.60x, respectively.</em></td> </tr> <tr> <td style="width:40px;">IP4-9</td> <td><b>EXPLFRAME: EXPLOITING PAGE FRAME CACHE FOR FAULT ANALYSIS OF BLOCK CIPHERS</b><br /> <b>Speaker</b>:<br /> Anirban Chakraborty, IIT Kharagpur, IN<br /> <b>Authors</b>:<br /> Anirban Chakraborty<sup>1</sup>, Sarani Bhattacharya<sup>2</sup>, Sayandeep Saha<sup>1</sup> and Debdeep Mukhopadhyay<sup>1</sup><br /> <sup>1</sup>IIT Kharagpur, IN; <sup>2</sup>KU Leuven, BE<br /> <em><b>Abstract</b><br /> Page Frame Cache (PFC) is a purely software cache, present in modern Linux based operating systems (OS), which stores the page frames that were recently released by the processes running on a particular CPU. In this paper, we show that the page frame cache can be maliciously exploited by an adversary to steer the pages of a victim process to some pre-decided attacker-chosen locations in the memory. We practically demonstrate an end-to-end attack, emph{ExplFrame}, where an attacker having only user-level privilege is able to force a victim process's memory pages to vulnerable locations in DRAM and deterministically conduct Rowhammer to induce faults. As a case study, we induce single bit faults in the T-tables on OpenSSL (v1.1.1) AES using our proposed attack ExplFrame. We also propose an improvised fault analysis technique which can exploit any Rowhammer-induced bit-flips in the AES T-tables.</em></td> </tr> <tr> <td style="width:40px;">IP4-10</td> <td><b>XGBIR: AN XGBOOST-BASED IR DROP PREDICTOR FOR POWER DELIVERY NETWORK</b><br /> <b>Speaker</b>:<br /> An-Yu Su, National Chiao Tung University, TW<br /> <b>Authors</b>:<br /> Chi-Hsien Pao, Yu-Min Lee and An-Yu Su, National Chiao Tung University, TW<br /> <em><b>Abstract</b><br /> This work utilizes the XGBoost to build a machine-learning-based IR drop predictor, XGBIR, for the power grid. To capture the behavior of power grid, we extract its several features and employ its locality property to save the extraction time. XGBIR can be effectively applied to large designs and the average error of predicted IR drops is less than 6 mV.</em></td> </tr> <tr> <td style="width:40px;">IP4-11</td> <td><b>ON PRE-ASSIGNMENT ROUTE PROTOTYPING FOR IRREGULAR BUMPS ON BGA PACKAGES</b><br /> <b>Speaker</b>:<br /> Hung-Ming Chen, National Chiao Tung University, TW<br /> <b>Authors</b>:<br /> Jyun-Ru Jiang<sup>1</sup>, Yun-Chih Kuo<sup>2</sup>, Simon Chen<sup>3</sup> and Hung-Ming Chen<sup>1</sup><br /> <sup>1</sup>National Chiao Tung University, TW; <sup>2</sup>National Taiwan University, TW; <sup>3</sup>MediaTek.inc, TW<br /> <em><b>Abstract</b><br /> In modern package design, the bumps often place irregularly due to the macros varied in sizes and positions. This will make pre-assignment routing more difficult, even with massive design efforts. This work presents a 2-stage routing method which can be applied to an arbitrary bump placement on 2-layer BGA packages. Our approach combines escape routing with via assignment: the escape routing is used to handle the irregular bumps and the via assignment is applied for improving the wire congestion and total wirelength of global routing. Experimental results based on industrial cases show that our methodology can solve the routing efficiently, and we have achieved 82% improvement on wire congestion with 5% wirelength increase compared with conventional regular treatments.</em></td> </tr> <tr> <td style="width:40px;">IP4-12</td> <td><b>TOWARDS BEST-EFFORT APPROXIMATION: APPLYING NAS TO APPROXIMATE COMPUTING</b><br /> <b>Speaker</b>:<br /> Weiwei Chen, Chinese Academy of Sciences, CN<br /> <b>Authors</b>:<br /> Weiwei Chen, Ying Wang, Shuang Yang, Cheng Liu and Lei Zhang, Chinese Academy of Sciences, CN<br /> <em><b>Abstract</b><br /> The design of neural network architecture for code approximation involves a large number of hyper-parameters to explore, it is a non-trivial task to find an neural-based approximate computing solution that meets the demand of application-specified accuracy and Quality of Service (QoS). Prior works do not address the problem of 'optimal' network architectures design in program approximation, which depends on the user-specified constraints, the complexity of dataset and the hardware configuration. In this paper, we apply Neural Architecture Search (NAS) for searching and selecting the neural approximate computing and provide an automatic framework that tries to generate the best-effort approxi-mation result while satisfying the user-specified QoS/accuracy constraints. Compared with previous method, this work achieves more than 1.43x speedup and 1.74x energy reduction on average when applied to the AxBench benchmarks.</em></td> </tr> <tr> <td style="width:40px;">IP4-13</td> <td><b>ON THE AUTOMATIC EXPLORATION OF WEIGHT SHARING FOR DEEP NEURAL NETWORK COMPRESSION</b><br /> <b>Speaker</b>:<br /> Etienne Dupuis, École Centrale de Lyon, FR<br /> <b>Authors</b>:<br /> Etienne Dupuis<sup>1</sup>, David Novo<sup>2</sup>, Ian O'Connor<sup>1</sup> and Alberto Bosio<sup>1</sup><br /> <sup>1</sup>Lyon Institute of Nanotechnology, FR; <sup>2</sup>Université de Montpellier, FR<br /> <em><b>Abstract</b><br /> Deep neural networks demonstrate impressive inference results, particularly in computer vision and speech recognition. However, the computational workload and storage associated render their use prohibitive in resource-limited embedded systems. The approximate computing paradigm has been widely explored in both industrial and academic circles. It improves performance and energy-efficiency by relaxing the need for fully accurate operations. Consequently, there is a large number of implementation options with very different approximation strategies (such as pruning, quantization, low-rank factorization, knowledge distillation, ...). To the best of our knowledge, no automated approach exists for exploring, selecting and generating the best approximate versions of a given convolutional neural network (CNN) and the design objectives. The objective of this work in progress is to show that the design space exploration phase can enable significant network compression without noticeable accuracy loss. We demonstrate this via an example based on weight sharing and show that we can obtain 4x compression rate without re-training and the resulting network does not suffer from accuracy loss, in an int-16 version of LeNet-5 (5-layer 1,720-kbit CNN) using our method.</em></td> </tr> <tr> <td style="width:40px;">IP4-14</td> <td><b>ROBUST AND HIGH-PERFORMANCE12-T INTERLOCKED SRAM FOR IN-MEMORY COMPUTING</b><br /> <b>Speaker</b>:<br /> Joycee Mekie, IIT Gandhinagar, IN<br /> <b>Authors</b>:<br /> Neelam Surana, Mili Lavania, Abhishek Barma and Joycee Mekie, IIT Gandhinagar, IN<br /> <em><b>Abstract</b><br /> In this paper, we analyze the existing SRAM based In-Memory Computing(IMC) proposals and show through exhaustive simulations that they fail under process variations. 6-T SRAM, 8-T SRAM, and 10-T SRAM based IMC architectures suffer from compute-disturb (stored data flips during IMC), compute-failure (provides false computation results), and half-select failures, respectively. To circumvent these issues, we propose a novel 12-T Dual Port Dual Interlocked-storage Cell (DPDICE) SRAM. DPDICE SRAM based IMC architecture(DPDICE-IMC) can perform essential boolean functions successfully in a single cycle and can perform basic arithmetic operations such as add and multiply. The most striking feature is that DPDICE-IMC architecture can perform IMC on two datasets simultaneously, thus doubling the throughput. Cumulatively, the proposed DPDICE-IMC is 26.7%, 8$imes$, and 28% better than 6-T SRAM, 8-T SRAM, and 10-T SRAM based IMC architectures, respectively.</em></td> </tr> <tr> <td style="width:40px;">IP4-15</td> <td><b>HIGH DENSITY STT-MRAM COMPILER DESIGN, VALIDATION AND CHARACTERIZATION METHODOLOGY IN 28NM FDSOI TECHNOLOGY</b><br /> <b>Speaker</b>:<br /> Piyush Jain, ARM Embedded Technologies Pvt Ltd., IN<br /> <b>Authors</b>:<br /> Piyush Jain<sup>1</sup>, Akshay Kumar<sup>1</sup>, Nicolaas Van Winkelhoff<sup>2</sup>, Didier Gayraud<sup>2</sup>, Surya Gupta<sup>3</sup>, Abdelali El Amraoui<sup>2</sup>, Giorgio Palma<sup>2</sup>, Alexandra Gourio<sup>2</sup>, Laurentz Vachez<sup>2</sup>, Luc Palau<sup>2</sup>, Jean-Christophe Buy<sup>2</sup> and Cyrille Dray<sup>2</sup><br /> <sup>1</sup>ARM Embedded Technologies Pvt Ltd., IN; <sup>2</sup>ARM France, FR; <sup>3</sup>ARM Embedded technologies Pvt Ltd., IN<br /> <em><b>Abstract</b><br /> Spin Transfer Torque Magneto-resistive Random-Access Memory (STT-MRAM) is emerging as a promising substitute for flash memories due to scaling challenges for flash in process nodes beyond 28nm. STT-MRAM's high endurance, fast speed and low power makes it suitable for wide variety of applications. An embedded MRAM (eMRAM) compiler is highly desirable to enable SoC designers to use eMRAM instances in their designs in a flexible manner. However, the development of an eMRAM compiler has added challenges of handling multi-fold higher density and maintaining analog circuits accuracy, on top of the challenges associated with conventional SRAM memory compilers. In this paper, we present a successful design methodology for a high density 128Mb eMRAM compiler in a 28nm fully depleted SOI (FDSOI) process. This compiler enables optimized eMRAM instance generation with varying capacity ranges, word-widths, and optional features like repair and error correction. eMRAM compiler design is achieved by evolving various architecture design, validations and characterization methods. A hierarchical and modular characterization methodology is presented to enable high accuracy characterization and industry-standard EDA view generation from the eMRAM compiler.</em></td> </tr> <tr> <td style="width:40px;">IP4-16</td> <td><b>AN APPROXIMATION-BASED FAULT DETECTION SCHEME FOR IMAGE PROCESSING APPLICATIONS</b><br /> <b>Speaker</b>:<br /> Antonio Miele, Politecnico di Milano, IT<br /> <b>Authors</b>:<br /> Matteo Biasielli, Luca Cassano and Antonio Miele, Politecnico di Milano, IT<br /> <em><b>Abstract</b><br /> Image processing applications expose an intrinsic resilience to faults. In this application field the classical Duplication with Comparison (DWC) scheme, where output images are discarded as soon as the two replicas' outputs differ for at least one pixel, may be over-conseravative. This paper introduces a novel lightweight fault detection scheme for image processing applications; i) it extends the DWC scheme by substituting one of the two exact replicas with a faster approximated one; and ii) it features a Neural Network-based checker designed to distinguish between usable and unusable images instead of faulty/unfaulty ones. The application of the hardening scheme on a case study has shown an execution time reduction from 27% to 34% w.r.t. the DWC, while guaranteeing a comparable fault detection capability.</em></td> </tr> <tr> <td style="width:40px;">IP4-17</td> <td><b>TRANSPORT-FREE MODULE BINDING FOR SAMPLE PREPARATION USING MICROFLUIDIC FULLY PROGRAMMABLE VALVE ARRAYS</b><br /> <b>Speaker</b>:<br /> Gautam Choudhary, Adobe Research, India, IN<br /> <b>Authors</b>:<br /> Gautam Choudhary<sup>1</sup>, Sandeep Pal<sup>1</sup>, Debraj Kundu<sup>1</sup>, Sukanta Bhattacharjee<sup>2</sup>, Shigeru Yamashita<sup>3</sup>, Bing Li<sup>4</sup>, Ulf Schlichtmann<sup>4</sup> and Sudip Roy<sup>1</sup><br /> <sup>1</sup>IIT Roorkee, IN; <sup>2</sup>Indian Statistical Institute, IN; <sup>3</sup>Ritsumeikan University, JP; <sup>4</sup>TU Munich, DE<br /> <em><b>Abstract</b><br /> Microfluidic fully programmable valve array (FPVA) biochips have emerged as general-purpose flow-based microfluidic lab-on-chips (LoCs). An FPVA supports highly re-configurable on-chip components (modules) in the two-dimensional grid-like structure controlled by some software programs, unlike application-specific flow-based LoCs. Fluids can be loaded into or washed from a cell with the help of flows from the inlet to outlet of an FPVA, whereas cell-to-cell transportation of discrete fluid segment(s) is not precisely possible. The simplest mixing module to realize on an FPVA-based LoC is a four-way mixer consisting of a $2imes2$ array of cells working as a ring-like mixer having four valves. In this paper, we propose a design automation method for sample preparation that finds suitable placements of mixing operations of a mixing tree using four-way mixers without requiring any transportation of fluid(s) between modules. We also propose a heuristic that modifies the mixing tree to reduce the sample preparation time. We have performed an extensive simulation and examined several parameters to determine the performance of the proposed solution.</em></td> </tr> </tbody> </table> <hr /> <h2 id="UB09">UB09 Session 9</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 10:00 - 12:00<br /> <b>Location / Room:</b> Booth 11, Exhibition Area</p> <p> </p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td>UB09.1</td> <td><b>TAPASCO: THE OPEN-SOURCE TASK-PARALLEL SYSTEM COMPOSER FRAMEWORK</b><br /> <b>Authors</b>:<br /> Carsten Heinz, Lukas Sommer, Lukas Weber, Jaco Hofmann and Andreas Koch, TU Darmstadt, DE<br /> <em><b>Abstract</b><br /> Field-programmable gate arrays (FPGA) are an established platform for highly specialized accelerators, but in a heterogeneous setup, the accelerator still needs to be integrated into the overall system. The open-source TaPaSCo (Task-Parallel System Composer) framework was created to serve this purpose: The fast integration of FPGA-based accelerators into compute platforms or systems-on-chip (SoC) and their connection to relevant components on the FPGA board. TaPaSCo can support developers in all steps of the development process: from cores resulting from High-Level Synthesis or cores written in an HDL, a complete FPGA-design can be created. TaPaSCo will automatically connect all processing elements to the memory- and host-interface and generate a complete bitstream. The TaPaSCo Runtime API allows to interface with accelerators from software and supports operations such as transferring data to the FPGA memory, passing values and controlling the execution of the accelerators.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3101.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB09.2</td> <td><b>RESCUED: A RESCUE DEMONSTRATOR FOR INTERDEPENDENT ASPECTS OF RELIABILITY, SECURITY AND QUALITY TOWARDS A COMPLETE EDA FLOW</b><br /> <b>Authors</b>:<br /> Nevin George<sup>1</sup>, Guilherme Cardoso Medeiros<sup>2</sup>, Junchao Chen<sup>3</sup>, Josie Esteban Rodriguez Condia<sup>4</sup>, Thomas Lange<sup>5</sup>, Aleksa Damljanovic<sup>4</sup>, Raphael Segabinazzi Ferreira<sup>1</sup>, Aneesh Balakrishnan<sup>5</sup>, Xinhui Lai<sup>6</sup>, Shayesteh Masoumian<sup>7</sup>, Dmytro Petryk<sup>3</sup>, Troya Cagil Koylu<sup>2</sup>, Felipe Augusto da Silva<sup>8</sup>, Ahmet Cagri Bagbaba<sup>8</sup>, Cemil Cem Gürsoy<sup>6</sup>, Said Hamdioui<sup>2</sup>, Mottaqiallah Taouil<sup>2</sup>, Milos Krstic<sup>3</sup>, Peter Langendoerfer<sup>3</sup>, Zoya Dyka<sup>3</sup>, Marcelo Brandalero<sup>1</sup>, Michael Hübner<sup>1</sup>, Jörg Nolte<sup>1</sup>, Heinrich Theodor Vierhaus<sup>1</sup>, Matteo Sonza Reorda<sup>4</sup>, Giovanni Squillero<sup>4</sup>, Luca Sterpone<sup>4</sup>, Jaan Raik<sup>6</sup>, Dan Alexandrescu<sup>5</sup>, Maximilien Glorieux<sup>5</sup>, Georgios Selimis<sup>7</sup>, Geert-Jan Schrijen<sup>7</sup>, Anton Klotz<sup>8</sup>, Christian Sauer<sup>8</sup> and Maksim Jenihhin<sup>6</sup><br /> <sup>1</sup>Brandenburg University of Technology Cottbus-Senftenberg, DE; <sup>2</sup>TU Delft, NL; <sup>3</sup>Leibniz-Institut für innovative Mikroelektronik, DE; <sup>4</sup>Politecnico di Torino, IT; <sup>5</sup>IROC Technologies, FR; <sup>6</sup>Tallinn University of Technology, EE; <sup>7</sup>Intrinsic ID, NL; <sup>8</sup>Cadence Design Systems GmbH, DE<br /> <em><b>Abstract</b><br /> The demonstrator highlights the various interdependent aspects of Reliability, Security and Quality in nanoelectronics system design within an EDA toolset and a processor architecture setup. The compelling need of attention towards these three aspects of nanoelectronic systems have been ever more pronounced over extreme miniaturization of technologies. Further, such systems have exploded in numbers with IoT devices, heavy and analogous interaction with the external physical world, complex safety-critical applications, and Artificial intelligence applications. RESCUE targets such aspects in the form, Reliability (functional safety, ageing, soft errors), Security (tamper-resistance, PUF technology, intelligent security) and Quality (novel fault models, functional test, FMEA/FMECA, verification/debug) spanning the entire hardware software system stack. The demonstrator is brought together by a group of PhD students under the banner of H2020-MSCA-ITN RESCUE European Union project.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3096.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB09.3</td> <td><b>PAFUSI: PARTICLE FILTER FUSION ASIC FOR INDOOR POSITIONING</b><br /> <b>Authors</b>:<br /> Christian Schott, Marko Rößler, Daniel Froß, Marcel Putsche and Ulrich Heinkel, TU Chemnitz, DE<br /> <em><b>Abstract</b><br /> The meaning of data acquired from IoT devices is heavily enhanced if global or local position information of their acquirement is known. Infrastructure for indoor positioning as well as the IoT device involve the need of small, energy efficient but powerful devices that provide the location awareness. We propose the PAFUSI, a hardware implementation of an UWB position estimation algorithm that fulfils these requirements. Our design fuses distance measurements to fixed points in an environment to calculate the position in 3D space and is capable of using different positioning technologies like GPS, DecaWave or Nanotron as data source simultaneously. Our design comprises of an estimator which processes the data by means of a Sequential Monte Carlo method and a microcontroller core which configures and controls the measurement unit as well as it analyses the results of the estimator. The PAFUSI is manufactured as a monolithic integrated ASIC in a multi-project wafer in UMC's 65nm process.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3102.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB09.4</td> <td><b>SKELETOR: AN OPEN SOURCE EDA TOOL FLOW FROM HIERARCHY SPECIFICATION TO HDL DEVELOPMENT</b><br /> <b>Authors</b>:<br /> Ivan Rodriguez, Guillem Cabo, Javier Barrera, Jeremy Giesen, Alvaro Jover and Leonidas Kosmidis, BSC / UPC, ES<br /> <em><b>Abstract</b><br /> Large hardware design projects have high overhead for project bootstrapping, requiring significant effort for translating hardware specifications to hardware design language (HDL) files and setting up their corresponding development and verification infrastructure. Skeletor (<a href="https://github.com/jaquerinte/Skeletor" title="https://github.com/jaquerinte/Skeletor">https://github.com/jaquerinte/Skeletor</a>) is an open source EDA tool developed as a student project at UPC/BSC, which simplifies this process, by increasing developer's productivity and reducing typing errors, while at the same time lowers the bar for entry in hardware development. Skeletor uses a C/verilog-like language for the specification of the modules in a hardware project hierarchy and their connections, which is used to generate automatically the require skeleton of source files, their development and verification testbenches and simulation scripts. Integration with KiCad schematics and support for syntax highlighting in code editors simplifies further its use. This demo is linked with workshop W05.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3107.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB09.5</td> <td><b>SYSTEMC-CT/DE: A SIMULATOR WITH FAST AND ACCURATE CONTINUOUS TIME AND DISCRETE EVENTS INTERACTIONS ON TOP OF SYSTEMC.</b><br /> <b>Authors</b>:<br /> Breytner Joseph Fernandez-Mesa, Liliana Andrade and Frédéric Pétrot, Université Grenoble Alpes / CNRS / TIMA Laboratory, FR<br /> <em><b>Abstract</b><br /> We have developed a continuous time (CT) and discrete events (DE) simulator on top of SystemC. Systems that mix both domains are critical and their proper functioning must be verified. Simulation serves to achieve this goal. Our simulator implements direct CT/DE synchronization, which enables a rich set of interactions between the domains: events from the CT models are able to trigger DE processes; events from the DE models are able to modify the CT equations. DE-based interactions are, then, simulated at their precise time by the DE kernel rather than at fixed time steps. We demonstrate our simulator by executing a set of challenging examples: they either require a superdense model of time or include Zeno behavior or are highly sensitive to accuracy errors. Results show that our simulator overcomes these issues, is accurate, and improves simulation speed w.r.t. fixed time steps; all of these advantages open up new possibilities for the design of a wider set of heterogeneous systems.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3110.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB09.6</td> <td><b>PARALLEL ALGORITHM FOR CNN INFERENCE AND ITS AUTOMATIC SYNTHESIS</b><br /> <b>Authors</b>:<br /> Takashi Matsumoto, Yukio Miyasaka, Xinpei Zhang and Masahiro Fujita, University of Tokyo, JP<br /> <em><b>Abstract</b><br /> Recently, Convolutional Neural Network (CNN) has surpassed conventional methods in the field of image processing. This demonstration shows a new algorithm to calculate CNN inference using processing elements arranged and connected based on the topology of the convolution. They are connected in mesh and calculate CNN inference in a systolic way. The algorithm performs the convolution of all elements with the same output feature in parallel. We demonstrate a method to automatically synthesize an algorithm, which simultaneously performs the convolution and the communication of pixels for the computation of the next layer. We show with several sizes of input layers, kernels, and strides and confirmed that the correct algorithms were synthesized. The synthesis method is extended to the sparse kernel. The synthesized algorithm requires fewer cycles than the original algorithm. There were the more chances to reduce the number of cycles with the sparser kernel.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3132.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB09.7</td> <td><b>EEC: ENERGY EFFICIENT COMPUTING VIA DYNAMIC VOLTAGE SCALING AND IN-NETWORK OPTICAL PROCESSING</b><br /> <b>Authors</b>:<br /> Ryosuke Matsuo<sup>1</sup>, Jun Shiomi<sup>1</sup>, Yutaka Masuda<sup>2</sup> and Tohru Ishihara<sup>2</sup><br /> <sup>1</sup>Kyoto University, JP; <sup>2</sup>Nagoya University, JP<br /> <em><b>Abstract</b><br /> This poster demonstration will show results of our two research projects. The first one is on a project of energy efficient computing. In this project we developed a power management algorithm which keeps the target processor always running at the most energy efficient operating point by appropriately tuning the supply voltage and threshold voltage under a specific performance constraint. This algorithm is applicable to wide variety of processor systems including high-end processors and low-end embedded processors. We will show the results obtained with actual RISC processors designed using a 65nm technology. The second one is on a project of in-network optical computing. We show optical functional units such as parallel multipliers and optical neural networks. Several key techniques for reducing the power consumption of optical circuits will be also presented. Finally, we will show the results of optical circuit simulation, which demonstrate the light speed operation of the circuits.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3128.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB09.8</td> <td><b>SUBRISC+: IMPLEMENTATION AND EVALUATION OF AN EMBEDDED PROCESSOR FOR LIGHTWEIGHT IOT EHEALTH</b><br /> <b>Authors</b>:<br /> Mingyu Yang and Yuko Hara-Azumi, Tokyo Institute of Technology, JP<br /> <em><b>Abstract</b><br /> Although the rapid growth of Internet of Things (IoT) has enabled new opportunities for eHealth devices, the further development of complex systems is severely constrained by the power and energy supply on the battery-powered embedded systems. To address this issue, this work presents a processor design called "SubRISC+" targeting lightweight IoT eHealth. SubRISC+ is a processor design to achieve low power/energy consumption through its unique and compact architecture. As an example of lightweight eHealth applications on SubRISC+, we are working on the epileptic seizure detection using the dynamic time wrapping algorithm to deploy on wearable IoT eHealth devices. Simulation results show that 22% reduction on dynamic power and 50% reduction on leakage power and core area are achieved compared to Cortex-M0. As an ongoing work, the evaluation on a fabricated chip will be done within the first half of 2020.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3129.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB09.9</td> <td><b>PA-HLS: HIGH-LEVEL ANNOTATION OF ROUTING CONGESTION FOR XILINX VIVADO HLS DESIGNS</b><br /> <b>Authors</b>:<br /> Osama Bin Tariq<sup>1</sup>, Junnan Shan<sup>1</sup>, Luciano Lavagno<sup>1</sup>, Georgios Floros<sup>2</sup>, Mihai Teodor Lazarescu<sup>1</sup>, Christos Sotiriou<sup>2</sup> and Mario Roberto Casu<sup>1</sup><br /> <sup>1</sup>Politecnico di Torino, IT; <sup>2</sup>University of Thessaly, GR<br /> <em><b>Abstract</b><br /> We will demo a novel high-level backannotation flow that reports routing congestion issues at the C++ source level by analyzing reports from FPGA physical design (Xilinx Vivado) and internal debugging files of the Vivado HLS tool. The flow annotates the C++ source code, identifying likely causes of congestion, e.g., on-chip memories or the DSP units. These shared resources often cause routing problems on FPGAs because they cannot be duplicated by physical design. We demonstrate on realistic large designs how the information provided by our flow can be used to both identify congestion issues at the C++ source level and solve them using HLS directives. The main demo steps are: 1-Extraction of the source-level debugging information from the Vivado HLS database 2-Generation of a list of net names involved in congestion areas and of their relative significance from the Vivado post global-routing database 3-Visualization of the C++ code lines that contribute most to congestion </em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3123.pdf">More information ...</a></b></em></td> </tr> <tr> <td>UB09.10</td> <td><b>FU: LOW POWER AND ACCURACY CONFIGURABLE APPROXIMATE ARITHMETIC UNITS</b><br /> <b>Authors</b>:<br /> Tomoaki Ukezono and Toshinori Sato, Fukuoka University, JP<br /> <em><b>Abstract</b><br /> In this demonstration, we will introduce the approximate arithmetic units such as adder, multiplier, and MAC that are being studied in our system-architecture laboratory. Our approximate arithmetic units can reduce delay and power consumption at the expense of accuracy. Our approximate arithmetic units are intended to be applied to IoT edge devices that can process images, and are suitable for battery-driven and low-cost devices. The biggest feature of our approximate arithmetic units is that the circuit is configured so that the accuracy is dynamically variable, and the trade-off relationship between accuracy and power can be selected according to the usage status of the device. In this demonstration, we show the power consumption according to various accuracy-requirements based on actual data and claim the practicality of the proposed arithmetic units.</em><br /> <em><b><a href="https://past.date-conference.com/sites/default/files/university-booth/3127.pdf">More information ...</a></b></em></td> </tr> <tr> <td>12:00</td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="10.1">10.1 Special Day on "Silicon Photonics": High-Speed Silicon Photonics Interconnects for Data Center and HPC</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Amphithéâtre Jean Prouve</p> <p><b>Chair:</b><br /> Ian O’Connor, Ecole Centrale de Lyon, FR</p> <p><b>Co-Chair:</b><br /> Luca Ramini, Hewlett Packard Labs, US</p> <p> </p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.1.1</td> <td><b>THE NEED AND CHALLENGES OF CO-PACKAGING AND OPTICAL INTEGRATION IN DATA CENTERS</b><br /> <b>Author</b>:<br /> Liron Gantz, Mellanox, US<br /> <em><b>Abstract</b><br /> Silicon photonic (SiPh) technology was the "talk of the town" for almost two decades, yet only in the last couple of years, actual SiPh based transceivers were introduced for short-reach links. As the global IP traffic skyrockets, and will surpass 1 ZB per year by 2020, it seems that this is the optimal point for new disruptive technology to emerge. SiPh technology has the potential to reduce power consumption while meeting the demand for increasing rates, and potentially even reduce the cost. Yet in order to fully integrate SiPh components in mainly CMOS ICs, the entire industry must align, beginning with industrial FABs and OSATs, and ending with system manufacturers and Data Center clients. Indeed, in the last year positive developments have occurred as the Hyper-scalers are starting to show interest in driving the market into integrating optics and forgo pluggable transceivers. Yet many challenges have to be met, and some hard decisions have to be taken in order to fully integrate optics in a scalable manner. In this talk I will review these challenges and possible ways to meet them in order to enable optical integrated products in Data Centers and High-Performance Computers.</em></td> </tr> <tr> <td>11:30</td> <td>10.1.2</td> <td><b>POWER AND COST ESTIMATE OF SCALABLE ALL-TO-ALL TOPOLOGIES WITH SILICON PHOTONICS LINKS</b><br /> <b>Author</b>:<br /> Luca Ramini, Hewlett Packard Labs, US<br /> <em><b>Abstract</b><br /> For many applications that require a tight latency profile, such as machine learning, a network topology that does not leverage arbitration-based switching is desired. All-to-all (A2A) interconnection networks enable any node in the network to communicate to any other node at any given time. Many abstractions can be made to enable this capability such as buffering, time-domain multiplexing, etc. However, typical A2A topologies are limited to about 32 nodes within one hop. This is primarily due to limitations in reach, power consumption and bandwidth per interconnect. In this presentation, a topology of 256 nodes and beyond is considered by leveraging the many- wavelengths-per-fiber advantage of DWDM silicon photonics technology. Power and cost estimate of scalable A2A topologies using silicon photonics links are provided in order to understand the practical limits, if any, of a single node communicating with many other nodes via one wavelength per node.</em></td> </tr> <tr> <td>12:00</td> <td>10.1.3</td> <td><b>THE NEXT FRONTIER IN SILICON PHOTONIC DESIGN: EXPERIMENTALLY VALIDATED STATISTICAL MODELS</b><br /> <b>Authors</b>:<br /> Geoff Duggan<sup>1</sup>, James Pond<sup>1</sup>, Xu Wang<sup>1</sup>, Ellen Schelew<sup>1</sup>, Federico Gomez<sup>1</sup>, Milad Mahpeykar<sup>1</sup>, Ray Chung<sup>1</sup>, Zequin Lu<sup>1</sup>, Parya Samadian<sup>1</sup>, Jens Niegemann<sup>1</sup>, Adam Reid<sup>1</sup>, Roberto Armenta<sup>1</sup>, Dylan McGuire<sup>1</sup>, Peng Sun<sup>2</sup>, Jared Hulme<sup>2</sup>, Mudit Jan<sup>2</sup> and Ashkan Seyedi<sup>2</sup><br /> <sup>1</sup>Lumerical, US; <sup>2</sup>Hewlett Packard Labs, US<br /> <em><b>Abstract</b><br /> Silicon photonics has made tremendous progress in recent years and is now a critical technology embedded in many commercial products, particularly for data communications, while new products in sensing, AI and even quantum information technologies are in development. High quality processes from multiple foundries, supported by sophisticated electronic-photonic design automation (EPDA) workflows have made these advancements possible. Although several initiatives have begun to address the issue of manufacturing variability in photonics, these approaches have not been integrated meaningfully into EPDA workflows which lag well behind electronic integrated circuit workflows. Contributing to this deficiency has been a lack of data to calibrate statistical photonic compact models used in photonic circuit and system simulation. We present our current work in developing tools to calibrate statistical photonic compact models and compare our results against experimental data.</em></td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="10.2">10.2 Autonomous Systems Design Initiative: Uncertainty Handling in Safe Autonomous Systems (UHSAS)</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Chamrousse</p> <p><b>Chair:</b><br /> Philipp Mundhenk, Autonomous Intelligent Driving GmbH, DE</p> <p><b>Co-Chair:</b><br /> Ahmad Adee, Bosch Corporate Research, DE</p> <p> </p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.2.1</td> <td><b>MAKING THE RELATIONSHIP BETWEEN UNCERTAINTY ESTIMATION AND SAFETY LESS UNCERTAIN</b><br /> <b>Speaker</b>:<br /> Peter Schlicht, Volkswagen, DE<br /> <b>Authors</b>:<br /> Peter Schlicht<sup>1</sup>, Vincent Aravantinos<sup>2</sup> and Fabian Hüger<sup>1</sup><br /> <sup>1</sup>Volkswagen, DE; <sup>2</sup>AID, DE</td> </tr> <tr> <td>11:30</td> <td>10.2.2</td> <td><b>SYSTEM THEORETIC VIEW ON UNCERTAINTIES</b><br /> <b>Speaker</b>:<br /> Roman Gansch, Robert Bosch GmbH, DE<br /> <b>Authors</b>:<br /> Roman Gansch and Ahmad Adee, Robert Bosch GmbH, DE<br /> <em><b>Abstract</b><br /> The complexity of the operating environment and required technologies for highly automated driving is unprecedented. A different type of threat to safe operation besides the fault-error-failure model by Laprie et al. arises in the form of performance limitations. We propose a system theoretic approach to handle these and derive a taxonomy based on uncertainty, i.e. lack of knowledge, as a root cause. Uncertainty is a threat to the dependability of a system, as it limits our ability to assess its dependability properties. We distinguish uncertainties by aleatory (inherent to probabilistic models), epistemic (lack of model parameter knowledge) and ontological (incompleteness of models) in order to determine strategies and methods to cope with them. Analogous to the taxonomy of Laprie et al. we cluster methods into uncertainty prevention (use of elements with well-known behavior, avoiding architectures prone to emergent behavior, restriction of operational design domain, etc.), uncertainty removal (during design time by design of experiment, etc. and after release by field observation, continuous updates, etc.), uncertainty tolerance (use of redundant architectures with diverse uncertainties, uncertainty aware deep learning, etc.) and uncertainty forecasting (estimation of residual uncertainty, etc.).</em></td> </tr> <tr> <td>12:00</td> <td>10.2.3</td> <td><b>DETECTION OF FALSE NEGATIVE AND FALSE POSITIVE SAMPLES IN SEMANTIC SEGMENTATION</b><br /> <b>Speaker</b>:<br /> Matthias Rottmann, School of Mathematics &amp; Science and ICMD, DE<br /> <b>Authors</b>:<br /> Hanno Gottschalk<sup>1</sup>, Matthias Rottmann<sup>1</sup>, Kira Maag<sup>1</sup>, Robin Chan<sup>1</sup>, Fabian Hüger<sup>2</sup> and Peter Schlicht<sup>2</sup><br /> <sup>1</sup>School of Mathematics &amp; Science and ICMD, DE; <sup>2</sup>Volkswagen, DE</td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="10.3">10.3 Special Session: Next Generation Arithmetic for Edge Computing</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Autrans</p> <p><b>Chair:</b><br /> Farhad Merchant, RWTH Aachen University, DE</p> <p><b>Co-Chair:</b><br /> Akash Kumar, TU Dresden, DE</p> <p>Arithmetic is ubiquitous in today's digital world, ranging from embedded to high- performance computing systems. With machine learning at fore in a wide range of application domains from wearables, automotive, avionics to weather prediction, sufficiently accurate yet low-cost arithmetic is the need for the day. Recently, there have been several advances in the domain of computer arithmetic like high-precision anchored numbers from ARM, posit arithmetic by John Gustafson, and bfloat16, etc. as an alternative to IEEE 754-2008 compliant arithmetic. Optimizations on fixed-point and integer arithmetic are also pursued actively for low-power computing architectures. Furthermore, approximate computing and transprecision/mixed-precision computing have been exciting areas for research forever. While academic research in the domain of computer arithmetic has a long history, industrial adoption of some of these new data-types and techniques is in its early stages and expected to increase in future. bfloat16 is an excellent example of that. In this special session, we bring academia and industry together to discuss latest results and future directions for research in the domain of next-generation computer arithmetic.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.3.1</td> <td><b>PARADIGM ON APPROXIMATE COMPUTE FOR COMPLEX PERCEPTION-BASED NEURAL NETWORKS</b><br /> <b>Authors</b>:<br /> Andre Guntoro and Cecilia De la Parra, Robert Bosch GmbH, DE<br /> <em><b>Abstract</b><br /> The rise of machine learning pushes the massive compute power requirements, especially on the edge devices for their real-time inferences. One established approach for reducing the power usage is by going down to integer inferences (such as 8-bit) instead of utilizing higher computation accuracy given by their floating-point counterparts. Squeezing into lower bit representations such as in binary weight networks or binary neural networks requires complex training methods and also more efforts to recover the precision loss, and it typically functions only on simple classification tasks. One promising alternative to further reduce power consumption and computation latency is by utilizing approximate compute units. This method is a promising paradigm for mitigating the computation demand of neural networks, by taking advantage of their inherent resilience. Thanks to the development in approximate computing in the last decade, we have abundant options to utilize the best available approximate units, without re-developing or re-designing them. Nonetheless, adaptation during training phase is required. At first, we need to adapt the training methods for neural networks to take into account the inaccuracy given by approximate compute, without sacrificing the training speed (considering the trainings are performed on GPU with floating-point). Second, we need to define new metric for assessing and selecting the best-fit of approximation units per use-case basis. Lastly, we need to take advantages of approximation into the neural networks, such as over-fitting mitigation per design and resiliency, so that the networks trained for and designed with approximation will and shall perform better than their exact computing counterparts. For these steps, we evaluate on small tasks first and further validate on complex tasks which are more relevant in automotive domains.</em></td> </tr> <tr> <td>11:22</td> <td>10.3.2</td> <td><b>NEXT GENERATION FPGA ARITHMETIC FOR AI</b><br /> <b>Author</b>:<br /> Martin Langhammer, Intel, GB<br /> <em><b>Abstract</b><br /> The most recent FPGA architectures have introduced new levels of embedded floating point performance, with tens of TFLOPs now available across a wide range of device sizes. The last two generations of FPGAs have introduced IEEE754 single precision (FP32) arithmetic, containing up to 10 TFLOPs. The emergence of AI/Machine Learning as the highest profile FPGA application has changed the focus from signal processing and embedded calculations supported by FP32 to smaller floating point precisions, such as BFLOAT16 for training and FP16 for inference. In this talk, we will describe the architecture and development of the Intel Agilex DSP Block, which contains a FP32 multiplier-adder pair that can be decomposed into two smaller precision pairs; fp16, bfloat16, and a third proprietary format which can be used for both training and inference. In the Edge, where even lower precision arithmetic is required for inference, new FPGA EDA flows can implement 100 TFLOPs+ of soft logic-based compute power. In the second half of our talk, we will describe new synthesis, clustering, and packing methodologies - collectively known as Fractal Synthesis - that allow an unprecedented near 100% logic use of the FPGA for arithmetic, while maintaining the clock rates of a small example design. The soft logic and embedded arithmetic capabilities can be used simultaneously, making the FPGA the most flexible, and amongst the highest performing AI platform</em></td> </tr> <tr> <td>11:44</td> <td>10.3.3</td> <td><b>APPLICATION-SPECIFIC ARITHMETIC DESIGN</b><br /> <b>Author</b>:<br /> Florent de Dinechin, INSA Lyon, FR<br /> <em><b>Abstract</b><br /> General-purpose processor manufacturers face the difficult task of deciding the best arithmetic systems to commit to silicon. An alternative, particularly relevant to FPGA computing and ASIC design, is to keep this choice as open as possible, designing tools that enable different arithmetic system to be mixed and matched in an application-specific way. To achieve this, a productive paradigm has emerged from the FloPoCo project: open-ended generation of over-parameterized operators that compute just right thanks to last-bit accuracy at all levels. This work reviews this paradigm, and also reviews some of the arithmetic tools recently developed for this purpose: the generic bit-heap framework of FloPoCo, and the integration of arithmetic optimization inside HLS tools in the Marto project.</em></td> </tr> <tr> <td>12:06</td> <td>10.3.4</td> <td><b>A COMPARISON OF POSIT AND IEEE 754 FLOATING-POINT ARITHMETIC THAT ACCOUNTS FOR EXCEPTION HANDLING</b><br /> <b>Author</b>:<br /> John Gustafson, National University of Singapore, SG<br /> <em><b>Abstract</b><br /> The posit number format has advantages over the decades-old IEEE 754 Standard floating-point standard in many dimensions: Accuracy, dynamic range, simplicity, bitwise reproducibility, resiliency, and resistance to side-channel security attacks. In making comparisons, it is essential to distinguish between an IEEE 754 Standard implementation that handles all the exceptions in hardware, and one that either ignores the exceptions of the Standard or handles them with software or microcode that take hundreds of clock cycles to execute. Ignoring the exceptions quickly leads to egregious problems such as different values comparing as equal; handling exceptions with microcode creates massive data-dependency on timing that permits side-channel attacks like the well-known Spectre and Meltdown security weaknesses. Many microprocessors, such as current x86 architectures, use the exception-trapping approach for exceptions such as denormalized floats, which makes them unsuitable for secure use. Posit arithmetic provides data-independent and fast execution times with less complexity than a data-independent IEEE 754 float environment for the same data size.</em></td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="10.4">10.4 Design Methodologies for Hardware Approximation</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Stendhal</p> <p><b>Chair:</b><br /> Lukas Sekanina, Brno University of Technology, CZ</p> <p><b>Co-Chair:</b><br /> David Novo, CNRS &amp; University of Montpellier, FR</p> <p>New methods for the design and evaluation of approximate hardware are key to its success. This section shows that these approximation methods are applicable across different levels of hardware description including an RTL design of an approximate multiplier, approximate circuits modelled using binary decision diagrams and a behavioural description used in the context of high level synthesis of hardware accelerators. The papers of this section also show how to address another challenge - an efficient error evaluation - by means of new statistical and formal verification methods.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.4.1</td> <td><b>REALM: REDUCED-ERROR APPROXIMATE LOG-BASED INTEGER MULTIPLIER</b><br /> <b>Speaker</b>:<br /> Hassaan Saadat, University of New South Wales, AU<br /> <b>Authors</b>:<br /> Hassaan Saadat<sup>1</sup>, Haris Javaid<sup>2</sup>, Aleksandar Ignjatovic<sup>1</sup> and Sri Parameswaran<sup>1</sup><br /> <sup>1</sup>University of New South Wales, AU; <sup>2</sup>Xilinx, SG<br /> <em><b>Abstract</b><br /> We propose a new error-configurable approximate unsigned integer multiplier named REALM. It incorporates a novel error-reduction method into the classical approximate log-based multiplier. Each power-of-two-interval of the input operands is partitioned into MxM segments, and an error-reduction factor for each segment is analytically determined. These error-reduction factors can be used across any power-of-two-interval, so we quantize only M^2 factors and store them in the form of read-only hardwired lookup tables to keep the resource overhead to a minimum. Error characterization of REALM shows that it achieves very low error bias (mostly less than or equal to 0.05%), along with lower mean error (from 0.4% to 1.6%), and lower peak error (from 2.08% to 7.4%) than the classical approximate log-based multiplier and its state-of-the-art derivatives (mean errors greater than or equal to 2.6% and peak errors greater than or equal to 7.8%). Synthesis results using TSMC 45nm standard-cell library show that REALM enables significant power-efficiency (66% to 86% reduction) and area-efficiency (50% to 76% reduction) when compared with the accurate integer multiplier. We show that REALM produces Pareto optimal design trade-offs in the design space of state-of-the-art approximate multipliers. Application-level evaluation of REALM demonstrates that it has negligible effect on the output quality.</em></td> </tr> <tr> <td>11:30</td> <td>10.4.2</td> <td><b>A FAST BDD MINIMIZATION FRAMEWORK FOR APPROXIMATE COMPUTING</b><br /> <b>Speaker</b>:<br /> Oliver Keszocze, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE<br /> <b>Authors</b>:<br /> Andreas Wendler and Oliver Keszocze, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE<br /> <em><b>Abstract</b><br /> Approximate Computing is a design paradigm that trades off computational accuracy for gains in non-functional aspects such as reduced area, increased computation speed, or power reduction. Computing the error of the approximated design is an essential step to determine its quality. The computation time for determining the error can become very large, effectively rendering the entire logic approximation procedure infeasible. As a remedy, we present methods to accelerate the computation of error metric computations by (a) exploiting structural information and (b) computing estimates of the metrics for multi-output Boolean functions represented as BDDs. We further present a novel greedy, bucket-based BDD minimization framework employing the newly proposed error metric computations to produce Pareto-optimal solutions with respect to BDD size and multiple error metrics. The applicability of the proposed minimization framework is demonstrated by an experimental evaluation. We can report considerable speedups while, at the same time, creating high-quality approximated BDDs.</em></td> </tr> <tr> <td>12:00</td> <td>10.4.3</td> <td><b>ON THE DESIGN OF HIGH PERFORMANCE HW ACCELERATOR THROUGH HIGH-LEVEL SYNTHESIS SCHEDULING APPROXIMATIONS</b><br /> <b>Speaker</b>:<br /> Benjamin Carrion Schaefer, University of Texas at Dallas, US<br /> <b>Authors</b>:<br /> Siyuan Xu and Benjamin Carrion Schaefer, University of Texas at Dallas, US<br /> <em><b>Abstract</b><br /> High-level synthesis (HLS) takes as input a behavioral description (e.g. C/C++) and generates efficient hardware through three main steps: allocation, scheduling, and binding. The scheduling step, times the operations in the behavioral description by scheduling different portions of the code at unique clock steps (control steps). The code portions assigned to each clock step mainly depend on the target synthesis frequency and target technology. This work makes use of this to generate smaller and faster circuits by approximating the program portions scheduled in each clock step and by exploiting the slack between different scheduling step to further increase the performance/reduce the latency of the resultant circuit. In particular, each individual scheduling step is approximated given a maximum error boundary and a library of different approximation techniques. In order to further optimize the resultant circuit, different scheduling steps are merged based on the timing slack of different control step without violating the given timing constraint (target frequency). Experimental results from different domain-specific applications show that our method works well and is able to increase the throughput on average by 82% while at the same time reducing the area by 21% for a given maximum allowable error.</em></td> </tr> <tr> <td>12:15</td> <td>10.4.4</td> <td><b>FAST KRIGING-BASED ERROR EVALUATION FOR APPROXIMATE COMPUTING SYSTEMS</b><br /> <b>Speaker</b>:<br /> Daniel Menard, INSA Rennes, FR<br /> <b>Authors</b>:<br /> Justine Bonnot<sup>1</sup>, Karol Desnos<sup>1</sup> and Daniel Menard<sup>2</sup><br /> <sup>1</sup>Université de Rennes / Inria / IRISA, FR; <sup>2</sup>INSA Rennes, FR<br /> <em><b>Abstract</b><br /> Approximate computing techniques trade-off the performance of an application for its accuracy. The challenge when implementing approximate computing in an application is to efficiently evaluate the quality at the output of the application to optimize the noise budgeting of the different approximation sources. It is commonly achieved with an optimization algorithm to minimize the implementation cost of the application subject to a quality constraint. During the optimization process, numerous approximation configurations are tested, and the quality at the output of the application is measured for each configuration with simulations. The optimization process is a time-consuming task. We propose a new method for infering the accuracy or quality metric at the output of an application using kriging, a geostatistical method.</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="#IP5">IP5-1</a>, 21</td> <td><b>STATISTICAL MODEL CHECKING OF APPROXIMATE CIRCUITS: CHALLENGES AND OPPORTUNITIES</b><br /> <b>Speaker and Author</b>:<br /> Josef Strnadel, Brno University of Technology, CZ<br /> <em><b>Abstract</b><br /> Many works have shown that approximate circuits may play an important role in the development of resourceefficient electronic systems. This motivates many researchers to propose new approaches for finding an optimal trade-off between the approximation error and resource savings for predefined applications of approximate circuits. The works and approaches, however, focus mainly on design aspects regarding relaxed functional requirements while neglecting further aspects such as signal and parameter dynamics/stochasticity, relaxed/non-functional equivalence, testing or formal verification. This paper aims to take a step ahead by moving towards the formal verification of time-dependent properties of systems based on approximate circuits. Firstly, it presents our approach to modeling such systems by means of stochastic timed automata whereas our approach goes beyond digital, combinational and/or synchronous circuits and is applicable in the area of sequential, analog and/or asynchronous circuits as well. Secondly, the paper shows the principle and advantage of verifying properties of modeled approximate systems by the statistical model checking technique. Finally, the paper evaluates our approach and outlines future research perspectives.</em></td> </tr> <tr> <td style="width:40px;">12:31</td> <td><a href="#IP5">IP5-2</a>, 912</td> <td><b>RUNTIME ACCURACY-CONFIGURABLE APPROXIMATE HARDWARE SYNTHESIS USING LOGIC GATING AND RELAXATION</b><br /> <b>Speaker</b>:<br /> Tanfer Alan, Karlsruhe Institute of Technology, DE<br /> <b>Authors</b>:<br /> Tanfer Alan<sup>1</sup>, Andreas Gerstlauer<sup>2</sup> and Joerg Henkel<sup>1</sup><br /> <sup>1</sup>Karlsruhe Institute of Technology, DE; <sup>2</sup>University of Texas at Austin, US<br /> <em><b>Abstract</b><br /> Approximate computing trades off computation accuracy against energy efficiency. Algorithms from several modern application domains such as decision making and computer vision are tolerant to approximations while still meeting their requirements. The extent of approximation tolerance, however, significantly varies with a change in input characteristics and applications. We propose a novel hybrid approach for the synthesis of runtime accuracy configurable hardware that minimizes energy consumption at area expense. To that end, first we explore instantiating multiple hardware blocks with different fixed approximation levels. These blocks can be selected dynamically and thus allow to configure the accuracy during runtime. They benefit from having fewer transistors and also synthesis relaxations in contrast to state-of-the-art gating mechanisms which only switch off a group of logic. Our hybrid approach combines instantiating such blocks with area-efficient gating mechanisms that reduce toggling activity, creating a fine-grained design-time knob on energy vs. area. Examining total energy savings for a Sobel Filter under different workloads and accuracy tolerances show that our method finds Pareto-optimal solutions providing up to 16% and 44% energy savings compared to state-of-the-art accuracy-configurable gating mechanism and an exact hardware block, respectively, at 2x area cost</em></td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="10.5">10.5 Emerging Machine Learning Applications and Models</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Bayard</p> <p><b>Chair:</b><br /> Mladen Berekovic, TU Braunschweig, DE</p> <p><b>Co-Chair:</b><br /> Sophie Quinton, INRIA, FR</p> <p>This session presents new application domains and new models for neural networks, discussing two novel video applications: multi-view and surveillance, and discusessing a Bayesian model approach for neural networks.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.5.1</td> <td><b>COMMUNICATION-EFFICIENT VIEW-POOLING FOR DISTRIBUTED INFERENCE WITH MULTI-VIEW NEURAL NETWORKS</b><br /> <b>Speaker</b>:<br /> Manik Singhal, School of Electrical and Computer Engineering, Purdue University, US<br /> <b>Authors</b>:<br /> Manik Singhal, Anand Raghunathan and Vijay Raghunathan, Purdue University, US<br /> <em><b>Abstract</b><br /> Multi-view object detection or the problem of detecting an object using multiple viewpoints, is an important problem in computer vision with varied applications such as distributed smart cameras and collaborative drone swarms. Multi-view object detection algorithms based on deep neural networks (DNNs) achieve high accuracy by {em view pooling}, or aggregating features corresponding to the different views. However, when these algorithms are realized on networks of edge devices, the communication cost incurred by view pooling often dominates the overall latency and energy consumption. In this paper, we propose techniques for communication-efficient view pooling that can be used to improve the efficiency of distributed multi-view object detection and apply them to state-of-the-art multi-view DNNs. First, we propose {em significance-aware selective view pooling}, which identifies and communicates only those features from each view that are likely to impact the pooled result (and hence, the final output of the DNN). Second, we propose {em multi-resolution feature view pooling}, which divides views into dominant and non-dominant views, and down-scales the features from non-dominant views using an additional network layer before communicating them for pooling. The dominant and non-dominant views are pooled separately and the results are jointly used to derive the final classification. We implement and evaluate the proposed pooling schemes using a model test-bed of twelve Raspberry Pi 3b+ devices and show that they achieve 9X - 36X reduction in data communicated and 1.8X reduction in inference latency, with no degradation in accuracy.</em></td> </tr> <tr> <td>11:30</td> <td>10.5.2</td> <td><b>AN ANOMALY COMPREHENSION NEURAL NETWORK FOR SURVEILLANCE VIDEOS ON TERMINAL DEVICES</b><br /> <b>Speaker</b>:<br /> Yuan Cheng, Shanghai Jiao Tong University, CN<br /> <b>Authors</b>:<br /> Yuan Cheng<sup>1</sup>, Guangtai Huang<sup>2</sup>, Peining Zhen<sup>1</sup>, Bin Liu<sup>2</sup>, Hai-Bao Chen<sup>1</sup>, Ngai Wong<sup>3</sup> and Hao Yu<sup>2</sup><br /> <sup>1</sup>Shanghai Jiao Tong University, CN; <sup>2</sup>Southern University of Science and Technology, CN; <sup>3</sup>University of Hong Kong, HK<br /> <em><b>Abstract</b><br /> Anomaly comprehension in surveillance videos is more challenging than detection. This work introduces the design of a lightweight and fast anomaly comprehension neural network. For comprehension, a spatio-temporal LSTM model is developed based on the structured, tensorized time-series features extracted from surveillance videos. Deep compression of network size is achieved by tensorization and quantization for the implementation on terminal devices. Experiments on large-scale video anomaly dataset UCF-Crime demonstrate that the proposed network can achieve an impressive inference speed of 266 FPS on a GTX-1080Ti GPU, which is 4.29 faster than ConvLSTM-based method; a 3.34% AUC improvement with 5.55% accuracy niche versus the 3D-CNN based approach; and at least 15k× parameter reduction and 228× storage compression over the RNN-based approaches. Moreover, the proposed framework has been realized on an ARM-core based IOT board with only 2.4W power consumption.</em></td> </tr> <tr> <td>12:00</td> <td>10.5.3</td> <td><b>BYNQNET: BAYESIAN NEURAL NETWORK WITH QUADRATIC ACTIVATIONS FOR SAMPLING-FREE UNCERTAINTY ESTIMATION ON FPGA</b><br /> <b>Speaker</b>:<br /> Hiromitsu Awano, Osaka University, JP<br /> <b>Authors</b>:<br /> Hiromitsu Awano and Masanori Hashimoto, Osaka University, JP<br /> <em><b>Abstract</b><br /> An efficient inference algorithm for Bayesian neural network (BNN) named BYNQNet, Bayesian neural network with quadratic activations, and its FPGA implementation are proposed. As neural networks find applications in mission critical systems, uncertainty estimations in network inference become increasingly important. BNN is a theoretically grounded solution to deal with uncertainty in neural network by treating network parameters as random variables. However, an inference in BNN involves Monte Carlo (MC) sampling, i.e., a stochastic forwarding is repeated N times with randomly sampled network parameters, which results in N times slower inference compared to non-Bayesian approach. Although recent papers proposed sampling-free algorithms for BNN inference, they still require evaluation of complex functions such as a cumulative distribution function (CDF) of Gaussian distribution for propagating uncertainties through nonlinear activation functions such as ReLU and Heaviside, which requires considerable amount of resources for hardware implementation. Contrary to conventional BNN, BYNQNet employs quadratic nonlinear activation functions and hence the uncertainty propagation can be achieved using only polynomial operations. Our numerical experiment reveals that BYNQNet has comparative accuracy with MC-based BNN which requires N=10 forwardings. We also demonstrate that BYNQNet implemented on Xilinx PYNQ-Z1 FPGA board achieves the throughput of 131x10^3 images per second and the energy efficiency of 44.7×10^3 images per joule, which corresponds to 4.07x and 8.99x improvements from the state-of-the-art MC-based BNN accelerator.</em></td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="10.6">10.6 Secure Processor Architecture</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Lesdiguières</p> <p><b>Chair:</b><br /> Emanule Regnath, TU Munich, DE</p> <p><b>Co-Chair:</b><br /> Erkay Savas, Sabanci University, TR</p> <p>This session proposes an overview of new mechanisms to protect processor architectures, boot sequences, caches, and energy management. The solutions strive to address and mitigate a wide range of attack methodologies, with a special focus on new emerging attacks.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.6.1</td> <td><b>CAPTURING AND OBSCURING PING-PONG PATTERNS TO MITIGATE CONTINUOUS ATTACKS</b><br /> <b>Speaker</b>:<br /> Kai Wang, Harbin Institute of Technology, CN<br /> <b>Authors</b>:<br /> Kai Wang<sup>1</sup>, Fengkai Yuan<sup>2</sup>, Rui Hou<sup>2</sup>, Zhenzhou Ji<sup>1</sup> and Dan Meng<sup>2</sup><br /> <sup>1</sup>Harbin Institute of Technology, CN; <sup>2</sup>Chinese Academy of Sciences, CN<br /> <em><b>Abstract</b><br /> In this paper, we observed Continuous Attacks are one kind of common side channel attack scenarios, where an adversary frequently probes the same target cache lines in a short time. Continuous Attacks cause target cache lines to go through multiple load-evict processes, exhibiting Ping-Pong Patterns. Identifying and obscuring Ping-Pong Patterns effectively interferes with the attacker's probe and mitigates Continuous Attacks. Based on the observations, this paper proposes Ping-Pong Regulator to identify multiple Ping-Pong Patterns and block them with different strategies (Preload or Lock). The Preload proactively loads target lines into the cache, causing the attacker to mistakenly infer that the victim has accessed these lines; the Lock fixes the attacked lines' directory entries on the last level cache directory until they are evicted out of caches, making an attacker's observation of the locked lines is always the L2 cache miss. The experimental evaluation demonstrates that the Ping-Pong Regulator efficiently identifies and secures attacked lines, induces negligible performance impacts and storage overhead, and does not require any software support.</em></td> </tr> <tr> <td>11:30</td> <td>10.6.2</td> <td><b>MITIGATING CACHE-BASED SIDE-CHANNEL ATTACKS THROUGH RANDOMIZATION: A COMPREHENSIVE SYSTEM AND ARCHITECTURE LEVEL ANALYSIS</b><br /> <b>Speaker</b>:<br /> Houman Homayoun, University of California, Davis, US<br /> <b>Authors</b>:<br /> Han Wang<sup>1</sup>, Hossein Sayadi<sup>1</sup>, Avesta Sasan<sup>1</sup>, Setareh Rafatirad<sup>1</sup>, Houman Homayoun<sup>1</sup>, Liang Zhao<sup>1</sup> and Tinoosh Mohsenin<sup>2</sup><br /> <sup>1</sup>George Mason University, US; <sup>2</sup>University of Maryland, Baltimore County, US<br /> <em><b>Abstract</b><br /> Cache hierarchy was designed to allow CPU cores to process instructions faster by bridging the significant latency gap between the main memory and processor. In addition, various cache replacement algorithms are proposed to predict future data and instructions to boost the performance of the computer systems. However, recently proposed cache-based SideChannel Attacks (SCAs) have shown to effectively exploiting such a hierarchical cache design. The cache-based SCAs are exploiting the hardware vulnerabilities to steal secret information from users by observing cache access patterns of cryptographic applications and thus are emerging as a serious threat to the security of the computer systems. Prior works on mitigating the cache-based SCAs have mainly focused on cache partitioning techniques and/or randomization of mapping between main memory. However, such solutions though effective, require modification in the processor hardware which increases the complexity of architecture design and are not applicable to current as well as legacy architectures. In response, this paper proposes a lightweight system and architecture level randomization technique to effectively mitigate the impact of side-channel attacks on last-level caches with no hardware redesign overhead for current as well as legacy architectures. To this aim, by carefully adapting the processor frequency and prefetchers operation and adding proper level of noise to the attackers' cache observations we attempt to protect the critical information from being leaked. The experimental results indicate that the concurrent randomization of frequency and prefetchers can significantly prevent cache-based side-channel attacks with no need for a new cache design. In addition, the proposed randomization and adaptation methodology outperform state-of-the-art solutions in terms of the performance and execution time by reducing the performance overhead from 32.66% to nearly 20%.</em></td> </tr> <tr> <td>12:00</td> <td>10.6.3</td> <td><b>EXTENDING THE RISC-V INSTRUCTION SET FOR HARDWARE ACCELERATION OF THE POST-QUANTUM SCHEME LAC</b><br /> <b>Speaker</b>:<br /> Tim Fritzmann, TU Munich, DE<br /> <b>Authors</b>:<br /> Tim Fritzmann<sup>1</sup>, Georg Sigl<sup>2</sup> and Johanna Sepúlveda<sup>3</sup><br /> <sup>1</sup>TU Munich, DE; <sup>2</sup>TU Munich/Fraunhofer AISEC, DE; <sup>3</sup>Airbus Defence and Space, DE<br /> <em><b>Abstract</b><br /> The increasing effort in the development of quantum computers represents a high risk for communication systems due to their capability of breaking currently used public-key cryptography. LAC is a lattice-based public-key encryption scheme resistant to traditional and quantum attacks. It is characterized by small key sizes and low arithmetic complexity. Recent publications have shown practical post-quantum solutions through co-design techniques. However, for LAC only software implementations were explored. In this work, we propose an efficient, flexible and time-protected HW/SW co-design architecture for LAC. We present two contributions. First, we develop and integrate hardware accelerators for three LAC performance bottlenecks: the generation of polynomials, polynomial multiplication and error correction. The accelerators were designed to support all post-quantum security levels from 128 to 256-bits. Second, we develop tailored instruction set extensions for LAC on RISC-V and integrate the HW accelerators directly into a RISC-V core. The results show that our architecture for LAC with constant-time error correction improves the performance by a factor of 7.66 for LAC-128, 14.42 for LAC-192, and 13.36 for LAC-256, when compared to the unprotected reference implementation running on RISC-V. The increased performance comes at a cost of an increased resource consumption (32,617 LUTs, 11,019 registers, and two DSP slices).</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="#IP5">IP5-3</a>, 438</td> <td><b>POST-QUANTUM SECURE BOOT</b><br /> <b>Speaker</b>:<br /> Vinay B. Y. Kumar, Nanyang Technological University, SG<br /> <b>Authors</b>:<br /> Vinay B. Y. Kumar<sup>1</sup>, Naina Gupta<sup>2</sup>, Anupam Chattopadhyay<sup>1</sup>, Michael Kasper<sup>3</sup>, Christoph Krauss<sup>4</sup> and Ruben Niederhagen<sup>4</sup><br /> <sup>1</sup>Nanyang Technological University, SG; <sup>2</sup>Indraprastha Institute of Information Technology, IN; <sup>3</sup>Fraunhofer Singapore, SG; <sup>4</sup>Fraunhofer SIT, DE<br /> <em><b>Abstract</b><br /> A secure boot protocol is fundamental to ensuring the integrity of the trusted computing base of a secure system. The use of digital signature algorithms (DSAs) based on traditional asymmetric cryptography, particularly for secure boot, leaves such systems vulnerable to the threat of quantum computers. This paper presents the first post-quantum secure boot solution, implemented fully as hardware for reasons of security and performance. In particular, this work uses the eXtended Merkle Signature Scheme (XMSS), a hash-based scheme that has been specified as an IETF RFC. The solution has been integrated into a secure SoC platform around RISC-V cores and evaluated on an FPGA and is shown to be orders of magnitude faster compared to corresponding hardware/software implementations and to compare competitively with a fully hardware elliptic curve DSA based solution.</em></td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="10.7">10.7 Accelerators for Neuromorphic Computing</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 11:00 - 12:30<br /> <b>Location / Room:</b> Berlioz</p> <p><b>Chair:</b><br /> Alexandre Levisse, EPFL, CH</p> <p><b>Co-Chair:</b><br /> Deliang Fan, Arizona State University, US</p> <p>In this session, special hardware accelerators based on different technologies for neuromorphic computing will be presented. These accelerators (i) improve the computing efficiency by using pulse widths to deliver information across memristor crossbars, (ii) enhance the robustness of neuromorphic computing with unary coding and priority mapping, and (iii) explore the modulation of light in transferring information so to push the performance of computing systems to new limits.</p> <table> <thead> <tr> <th>Time</th> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> </thead> <tbody> <tr> <td>11:00</td> <td>10.7.1</td> <td><b>A PULSE WIDTH NEURON WITH CONTINUOUS ACTIVATION FOR PROCESSING-IN-MEMORY ENGINES</b><br /> <b>Speaker</b>:<br /> Shuhang Zhang, TU Munich, DE<br /> <b>Authors</b>:<br /> Shuhang Zhang<sup>1</sup>, Bing Li<sup>1</sup>, Hai (Helen) Li<sup>2</sup> and Ulf Schlichtmann<sup>1</sup><br /> <sup>1</sup>TU Munich, DE; <sup>2</sup>Duke University, US / TU Munich, US<br /> <em><b>Abstract</b><br /> Processing-in-memory engines have been applied successfully to accelerate deep neural networks. For improving computing efficiency, spiking-based platforms are widely utilized. However, spiking-based designs quantize inter-layer signals naturally, leading to performance loss. In addition, the spike mismatch effect makes digital processing an essential part, impeding direct signal transferring between layers and thus resulting in longer latency. In this paper, we propose a novel neuron design based on pulse width modulation, avoiding quantization step and bypassing spike mismatch via its continuous activation. The computation latency and circuit complexity can be reduced significantly due to the absence of quantization and digital processing steps, while keeping a competitive performance. Experimental results demonstrate that the proposed neuron design can achieve &gt;100× speedup, and the area and power consumption can be reduced up to 75% and 25% compared with spiking-based designs.</em></td> </tr> <tr> <td>11:30</td> <td>10.7.2</td> <td><b>GO UNARY: A NOVEL SYNAPSE CODING AND MAPPING SCHEME FOR RELIABLE RERAM-BASED NEUROMORPHIC COMPUTING</b><br /> <b>Speaker</b>:<br /> Li Jiang, Shanghai Jiao Tong University, CN<br /> <b>Authors</b>:<br /> Chang Ma, Yanan Sun, Weikang Qian, Ziqi Meng, Rui Yang and Li Jiang, Shanghai Jiao Tong University, CN<br /> <em><b>Abstract</b><br /> Neural network (NN) computing contains a large number of multiply-and-accumulate (MAC) operations, which is the speed bottleneck in traditional von Neumann architecture. Resistive random access memory (ReRAM)-based crossbar is well suited for matrix-vector multiplication. Existing ReRAM-based NNs are mainly based on the binary coding for synaptic weights. However, the imperfect fabrication process combined with stochastic filament-based switching leads to resistance variations, which can significantly affect the weights in binary synapses and degrade the accuracy of NNs. Further, as multi-level cells (MLCs) are being developed for reducing hardware overhead, the NN accuracy deteriorates more due to the resistance variations in the binary coding. In this paper, a novel unary coding of synaptic weights is presented to overcome the resistance variations of MLCs and achieve reliable ReRAM-based neuromorphic computing. The priority mapping is also proposed in compliance with the unary coding to guarantee high accuracy by mapping those bits with lower resistance states to ReRAMs with smaller resistance variations. Our experimental results show that the proposed method provides less than 0.45% and 5.48% accuracy loss on LeNet (on MNIST dataset) and VGG16 (on CIFAR-10 dataset), respectively, while maintaining acceptable hardware cost.</em></td> </tr> <tr> <td>12:00</td> <td>10.7.3</td> <td><b>LIGHTBULB: A PHOTONIC-NONVOLATILE-MEMORY-BASED ACCELERATOR FOR BINARIZED CONVOLUTIONAL NEURAL NETWORKS</b><br /> <b>Authors</b>:<br /> Farzaneh Zokaee<sup>1</sup>, Qian Lou<sup>1</sup>, Nathan Youngblood<sup>2</sup>, Weichen Liu<sup>3</sup>, Yiyuan Xie<sup>4</sup> and Lei Jiang<sup>1</sup><br /> <sup>1</sup>Indiana University Bloomington, US; <sup>2</sup>University of Pittsburh, US; <sup>3</sup>Nanyang Technological University, SG; <sup>4</sup>Southwest University, CN<br /> <em><b>Abstract</b><br /> Although Convolutional Neural Networks (CNNs) have demonstrated the state-of-the-art inference accuracy in various intelligent applications, each CNN inference involves millions of expensive floating point multiply-accumulate (MAC) operations. To energy-efficiently process CNN inferences, prior work proposes an electro-optical accelerator to process power-of-2 quantized CNNs by electro-optical ripple-carry adders and optical binary shifters. The electro-optical accelerator also uses SRAM registers to store intermediate data. However, electro-optical ripple-carry adders and SRAMs seriously limit the operating frequency and inference throughput of the electro-optical accelerator, due to the long critical path of the adder and the long access latency of SRAMs. In this paper, we propose a photonic nonvolatile memory (NVM)-based accelerator, LightBulb, to process binarized CNNs by high frequency photonic XNOR gates and popcount units. LightBulb also adopts photonic racetrack memory to serve as input/output registers to achieve high operating frequency. Compared to prior electro-optical accelerators, on average, LightBulb improves the CNN inference throughput by $17imessim 173imes$ and the inference throughput per Watt by $17.5imessim 660imes$.</em></td> </tr> <tr> <td style="width:40px;">12:30</td> <td><a href="#IP5">IP5-4</a>, 863</td> <td><b>ROQ: A NOISE-AWARE QUANTIZATION SCHEME TOWARDS ROBUST OPTICAL NEURAL NETWORKS WITH LOW-BIT CONTROLS</b><br /> <b>Speaker</b>:<br /> Jiaqi Gu, University of Texas at Austin, US<br /> <b>Authors</b>:<br /> Jiaqi Gu<sup>1</sup>, Zheng Zhao<sup>1</sup>, Chenghao Feng<sup>1</sup>, Hanqing Zhu<sup>2</sup>, Ray T. Chen<sup>1</sup> and David Z. Pan<sup>1</sup><br /> <sup>1</sup>University of Texas at Austin, US; <sup>2</sup>Shanghai Jiao Tong University, CN<br /> <em><b>Abstract</b><br /> Optical neural networks (ONNs) demonstrate orders-of-magnitude higher speed in deep learning acceleration than their electronic counterparts. However, limited control precision and device variations induce accuracy degradation in practical ONN implementations. To tackle this issue, we propose a quantization scheme that adapts a full-precision ONN to low-resolution voltage controls. Moreover, we propose a protective regularization technique that dynamically penalizes quantized weights based on their estimated noise-robustness, leading to an improvement in noise robustness. Experimental results show that the proposed scheme effectively adapts ONNs to limited-precision controls and device variations. The resultant four-layer ONN demonstrates higher inference accuracy with lower variances than baseline methods under various control precisions and device noises.</em></td> </tr> <tr> <td style="width:40px;">12:31</td> <td><a href="#IP5">IP5-5</a>, 789</td> <td><b>STATISTICAL TRAINING FOR NEUROMORPHIC COMPUTING USING MEMRISTOR-BASED CROSSBARS CONSIDERING PROCESS VARIATIONS AND NOISE</b><br /> <b>Speaker</b>:<br /> Ying Zhu, TU Munich, DE<br /> <b>Authors</b>:<br /> Ying Zhu<sup>1</sup>, Grace Li Zhang<sup>1</sup>, Tianchen Wang<sup>2</sup>, Bing Li<sup>1</sup>, Yiyu Shi<sup>2</sup>, Tsung-Yi Ho<sup>3</sup> and Ulf Schlichtmann<sup>1</sup><br /> <sup>1</sup>TU Munich, DE; <sup>2</sup>University of Notre Dame, US; <sup>3</sup>National Tsing Hua University, TW<br /> <em><b>Abstract</b><br /> Memristor-based crossbars are an attractive platform to accelerate neuromorphic computing. However, process variations during manufacturing and noise in memristors cause significant accuracy loss if not addressed. In this paper, we propose to model process variations and noise as correlated random variables and incorporate them into the cost function during training. Consequently, the weights after this statistical training become more robust and together with global variation compensation provide a stable inference accuracy. Simulation results demonstrate that the mean value and the standard deviation of the inference accuracy can be improved significantly, by even up to 54% and 31%, respectively, in a two-layer fully connected neural network.</em></td> </tr> <tr> <td>12:30</td> <td> </td> <td>End of session</td> </tr> <tr> <td> </td> <td> </td> <td> </td> </tr> </tbody> </table> <hr /> <h2 id="UB10">UB10 Session 10</h2> <p><b>Date:</b> Thursday 12 March 2020<br /> <b>Time:</b> 12:00 - 14:30<br /> <b>Location / Room:</b> Booth 11, Exhibition Area</p> <p> </p> <table> <tbody> <tr> <th>Label</th> <th>Presentation Title<br /> Authors</th> </tr> <tr> <td>UB10.1</td> <td><b>TAPASCO: THE OPEN-SOURCE TASK-PARALLEL SYSTEM COMPOSER FRAMEWORK</b><br /> <b>Authors</b>:<br /> Carsten Heinz, Lukas Sommer, Lukas Weber, Jaco Hofmann and Andreas Koch, TU Darmstadt, DE<br /> <em><b>Abstract</b><br /> Field-programmable gate arrays (FPGA) are an established