DATE Save the Date 17 to 19 April 2023


Dear DATE community,

We, the DATE Sponsors Committee (DSC) and the DATE Executive Committee (DEC), are deeply shocked and saddened by the tragedy currently unfolding in Ukraine, and we would like to express our full solidarity with all the people and families affected by the war.

Our thoughts also go out to everyone in Ukraine and Russia, whether they are directly or indirectly affected by the events, and we extend our deep sympathy.

We condemn Russia’s military action in Ukraine, which violates international law. And we call on the different governments to take immediate action to protect everyone in that country, particularly including its civilian population and people affiliated with its universities.

Now more than ever, our DATE community must promote our societal values (justice, freedom, respect, community, and responsibility) and confront this situation collectively and peacefully to end this nonsense war.

DATE Sponsors and Executive Committees.


Kindly note that all times on the virtual conference platform are displayed in the user's time zone.

The time zone for all times mentioned at the DATE website is CET – Central Europe Time (UTC+1).

O.1 Opening

Date: Monday, 14 March 2022
Time: 08:30 - 09:15 CET

Session chair:
Cristiana Bolchini, Politecnico di Milano, IT

Session co-chair:
Ingrid Verbauwhede, KU Leuven, BE

Time Label Presentation Title
Authors
08:30 CET O.1.1 OPENING
Speakers:
Cristiana Bolchini1 and Ingrid Verbauwhede2
1Politecnico di Milano, IT; 2KU Leuven - COSIC, BE
Abstract
DATE 2022 opening
09:00 CET O.1.2 AWARDS
Speakers:
Donatella Sciuto1, David Atienza2 and Yervant Zorian3
1Politecnico di Milano, IT; 2École Polytechnique Fédérale de Lausanne (EPFL), CH; 3Synopsys, US
Abstract
DATE 2022 awards presentation

K.1 Opening keynote #1: "What is beyond AI? Societal opportunities and electronic design automation"

Date: Monday, 14 March 2022
Time: 09:20 - 10:10 CET

Session chair:
Cristiana Bolchini, Politecnico di Milano, IT

The success of hardware in enabling AI acceleration and broadening its scope has been nothing short of remarkable. How do we use the power of hardware design and electronic design automation to instead make the world a better place? EDA will be the cornerstone of innovative solutions in ensuring data privacy, sustainable computing and taming the data flood.

Speaker's bio: Valeria Bertacco is Thurnau Professor of Computer Science and Engineering at the University of Michigan, and Adjunct Professor of Computer Engineering at the Addis Ababa Institute of Technology. Her research interests are in the area of computer design, with emphasis on specialized architecture solutions and design viability, in particular reliability, validation, and hardware-security assurance. Her research endeavors are supported by the Applications Driving Architectures (ADA) Research Center, which Valeria directs. The ADA Center, sponsored by a consortium of semiconductor companies, has the goal of reigniting computing systems design and innovation for the 2030-2040s decades, through specialized heterogeneity, domain-specific language abstractions, and new silicon devices that show benefit to applications. Valeria joined the University of Michigan in 2003. She currently serves as the Vice Provost for Engaged Learning at the University of Michigan, supporting all co-curricular engagements and international partnerships for the institution, and facilitating the work of several central units, whose goals range from promoting environmental sustainability, to the promotion of the arts in research universities, and to increasing the participation of gender minorities in the academy.

Time Label Presentation Title
Authors
09:20 CET K.1.1 WHAT IS BEYOND AI? SOCIETAL OPPORTUNITIES AND ELECTRONIC DESIGN AUTOMATION
Speaker and Author:
Valeria Bertacco, University of Michigan, US
Abstract
The success of hardware in enabling AI acceleration and broadening its scope has been nothing short of remarkable. How do we use the power of hardware design and electronic design automation to instead make the world a better place? EDA will be the cornerstone of innovative solutions in ensuring data privacy, sustainable computing and taming the data flood.
10:00 CET K.1.2 Q&A SESSION
Author:
Cristiana Bolchini, Politecnico di Milano, IT
Abstract
Questions and answers with the speaker

K.2 Opening keynote #2: "Cryo-CMOS Quantum Control: from a Wild Idea to Working Silicon"

Date: Monday, 14 March 2022
Time: 10:10 - 11:00 CET

Session chair:
Giovanni De Micheli, EPFL, CH

The core of a quantum processor is generally an array of qubits that need to be controlled and read out by a classical processor. This processor operates on the qubits with nanosecond latency, several millions of times per second, with tight constraints on noise and power. This is due to the extremely weak signals involved in the process that require highly sensitive circuits and systems, along with very precise timing capability. We advocate the use of CMOS technologies to achieve these goals, whereas the circuits will be operated at deep-cryogenic temperatures. We believe that these circuits, collectively known as cryo-CMOS control, will make future qubit arrays scalable, enabling a faster growth in qubit count. In the lecture, the challenges of designing and operating complex circuits and systems at 4K and below will be outlined, along with preliminary results achieved in the control and read-out of qubits by ad hoc integrated circuits

Speaker's bio: Edoardo Charbon (SM’00 F’17) received the Diploma from ETH Zurich, the M.S. from the University of California at San Diego, and the Ph.D. from the University of California at Berkeley in 1988, 1991, and 1995, respectively, all in electrical engineering and EECS. He has consulted with numerous organizations, including Bosch, X-Fab, Texas Instruments, Maxim, Sony, Agilent, and the Carlyle Group. He was with Cadence Design Systems from 1995 to 2000, where he was the Architect of the company's initiative on information hiding for intellectual property protection. In 2000, he joined Canesta Inc., as the Chief Architect, where he led the development of wireless 3-D CMOS image sensors. Since 2002 he has been a member of the faculty of EPFL. From 2008 to 2016 he was with Delft University of Technology’s as full professor and Chair of VLSI design. He has been the driving force behind the creation of deep-submicron CMOS SPAD technology, which is mass-produced since 2015 and is present in telemeters, proximity sensors, and medical diagnostics tools. His interests span from 3-D vision, LiDAR, FLIM, FCS, NIROT to super-resolution microscopy, time-resolved Raman spectroscopy, and cryo-CMOS circuits and systems for quantum computing. He has authored or co-authored over 400 papers and two books, and he holds 23 patents. Dr. Charbon is a distinguished visiting scholar of the W. M. Keck Institute for Space at Caltech, a fellow of the Kavli Institute of Nanoscience Delft, a distinguished lecturer of the IEEE Photonics Society, and a fellow of the IEEE.

Time Label Presentation Title
Authors
10:10 CET K.2.1 CRYO-CMOS QUANTUM CONTROL: FROM A WILD IDEA TO WORKING SILICON
Speaker and Author:
Edoardo Charbon, École Polytechnique Fédérale de Lausanne (EPFL), CH
Abstract
The core of a quantum processor is generally an array of qubits that need to be controlled and read out by a classical processor. This processor operates on the qubits with nanosecond latency, several millions of times per second, with tight constraints on noise and power. This is due to the extremely weak signals involved in the process that require highly sensitive circuits and systems, along with very precise timing capability. We advocate the use of CMOS technologies to achieve these goals, whereas the circuits will be operated at deep-cryogenic temperatures. We believe that these circuits, collectively known as cryo-CMOS control, will make future qubit arrays scalable, enabling a faster growth in qubit count. In the lecture, the challenges of designing and operating complex circuits and systems at 4K and below will be outlined, along with preliminary results achieved in the control and read-out of qubits by ad hoc integrated circuits
10:50 CET K.2.2 Q&A SESSION
Author:
Giovanni De Micheli, École Polytechnique Fédérale de Lausanne (EPFL), CH
Abstract
Questions and answers with the speaker

1.1 Scalable quantum stacks: current status and future prospects

Date: Monday, 14 March 2022
Time: 11:00 - 12:30 CET

Session chair:
Fabio Sebastiano, TU Delft, NL

Session co-chair:
Giovanni De Micheli, EPFL, CH

In this session we explore quantum computing from the quantum algorithm to the qubit, going through the compilation process. In this context, we look at similarities with conventional computing in the overall quantum stack architecture and differences in the control of qubit processors. From these and other perspectives, the session will offer a view into the future of quantum computers.

Time Label Presentation Title
Authors
11:00 CET 1.1.1 FULL-STACK QUANTUM COMPUTING SYSTEMS IN THE NISQ ERA: ALGORITHM-DRIVEN AND HARDWARE-AWARE COMPILATION TECHNIQUES
Speaker:
Carmen G. Almudever, TU Valencia, ES
Authors:
Medina Bandic1, Sebastian Feld1 and Carmen G. Almudever2
1Delft University of Technology, NL; 2TU Valencia, ES
Abstract
The progress in developing quantum hardware with functional quantum processors integrating tens of noisy qubits, together with the availability of near-term quantum algorithms has led to the release of the first quantum computers. These quantum computing systems already integrate different software and hardware components of the so- called "full-stack", bridging quantum applications to quantum devices. In this paper, we will provide an overview on current full-stack quantum computing systems. We will emphasize the need for tight co-design among adjacent layers as well as vertical cross-layer design to extract the most from noisy intermediate-scale quantum (NISQ) processors which are both error-prone and severely constrained in resources. As an example of co-design, we will focus on the development of hardware-aware and algorithm-driven compilation techniques.
11:30 CET 1.1.2 TWEEDLEDUM: A COMPILER COMPANION FOR QUANTUM COMPUTING
Speaker:
Bruno Schmitt, EPFL, CH
Authors:
Bruno Schmitt and Giovanni De Micheli, École Polytechnique Fédérale de Lausanne (EPFL), CH
Abstract
This work presents tweedledum—an extensible open-source library aiming at narrowing the gap between high- level algorithms and physical devices by enhancing the expressive power of existing frameworks. For example, it allows designers to insert classical logic (defined at a high abstraction level, e.g., a Python function) directly into quantum circuits. We describe its design principles, concrete implementation, and, in particular, the library’s core: An intuitive and flexible intermediate representation (IR) that supports different abstraction levels across the same circuit structure.
12:00 CET 1.2.3 A CRYO-CMOS TRANSMON QUBIT CONTROLLER AND VERIFICATION WITH FPGA EMULATION
Speaker:
Kevin Tien, IBM Research, US
Authors:
Kevin Tien1, Ken Inoue1, Scott Lekuch1, David Frank1, Sudipto Chakraborty1, Pat Rosno2, Thomas Fox1, Mark Yeck1, Joseph Glick1, Raphael Robertazzi1, Ray Richetta2, John Bulzacchelli1, Daniel Ramirez2, Dereje Yilma2, Andy Davies2, Rajiv Joshi1, Devin Underwood1, Dorothy Wisnieff1, Chris Baks1, Donald Bethune3, John Timmerwilke1, Blake Johnson1, Brian Gaucher1 and Daniel Friedman1
1IBM T.J. Watson Research Center, US; 2IBM Systems, US; 3IBM Almaden Research Center, US
Abstract
Future generations of quantum computers are expected to operate in a paradigm where multi-qubit devices will predominantly perform circuits to support quantum error correction. Highly integrated cryogenic electronics are a key enabling technology to support the control of the large numbers of physical qubits that will be required in this fault-tolerant, error-corrected regime. Here, we describe our perspectives on cryoelectronics-driven qubit control architectures, and will then describe an implementation of a scalable, low-power, cryogenic qubit state controller that includes a domain-specific processor and a SSB upconversion I/Q-mixer-based RF AWG. We will also describe an FPGA-based emulation platform that is able to closely reproduce the system intention, and which was used to verify different aspects of the ASIC system design in in situ transmon qubit control experiments.

K.3 Lunch Keynote: "Batteries: powering up the next generations"

Date: Monday, 14 March 2022
Time: 13:10 - 14:00 CET

Session chair:
Marco Casale-Rossi, Synopsys, IT

Session co-chair:
Enrico Macii, Politecnico di Torino, IT

The quest for energy possibly from renewable sources is rapidly increasing, due to new digital technologies that are taking up more and more space in our lives, electric vehicles expected to replace old combustion ones. However, today’s battery technology is lagging behind adjacent technological advances, with most devices using lithium-ion batteries, that bring with them some concerns and not the least their availability in Europe. To create a European energy platform for the future, bringing together renewable energy sources, electric transportation and a connected Internet of Things, a new solution for battery technology needs to be found. This keynote will explore how current challenges can be overcome through the application of advances in new materials, what is Europe doing in the field of batteries, the need of skilled people and how the future of battery technology can contribute to build a better, greener and connected world.

Speaker's bio: Silvia Bodoardo is professor at Politecnico di Torino where she is responsible for the task force on batteries and leads the Electrochemistry Group@Polito. Her research activity is mainly focused on the study of materials for Li-ion and post Li-ion batteries. The research is also dealing with cells production and battery testing. She is participating in several EU funded projects (coordinator of STABLE project), as well as national and regional ones. She is leader of WP3 on Education in Battery2030+ initiative and is co-chair in WG3 of BatteRIesEurope. Silvia organized many conferences and workshops on materials with electrochemical application and was Chair woman at the launch of the Horizon Prize on Innovative Batteries.

Time Label Presentation Title
Authors
13:10 CET K.3.1 BATTERIES: POWERING UP THE NEXT GENERATIONS
Speaker and Author:
Silvia Bodoardo, Politecnico di Torino, IT
Abstract
The quest for energy possibly from renewable sources is rapidly increasing, due to new digital technologies that are taking up more and more space in our lives, electric vehicles expected to replace old combustion ones. However, today’s battery technology is lagging behind adjacent technological advances, with most devices using lithium-ion batteries, that bring with them some concerns and not the least their availability in Europe. To create a European energy platform for the future, bringing together renewable energy sources, electric transportation and a connected Internet of Things, a new solution for battery technology needs to be found. This keynote will explore how current challenges can be overcome through the application of advances in new materials, what is Europe doing in the field of batteries, the need of skilled people and how the future of battery technology can contribute to build a better, greener and connected world.
13:50 CET K.3.2 Q&A SESSION
Author:
Marco Casale-Rossi, Synopsys, IT
Abstract
Questions and answers with the speaker

2.1 Energy-autonomous systems for next generation of IoT

Date: Monday, 14 March 2022
Time: 14:30 - 16:00 CET

Session chair:
Marco Casale-Rossi, Synopsys, IT

Session co-chair:
Giovanni De Micheli, EPFL, CH

Energy autonomous systems hold the promise of perpetual operation for low power sensing systems and next generation of Internet of Things. The key enabling technologies towards this vision are energy harvesting transducers and energy efficient converters including micro power management, energy storage and ultra-low power electronics. The capability of harvesting from the surrounding environment the power required for operation exploits several physical effects and specific energy transducers like electromechanical, thermoelectric, photovoltaic, etc. The limited and intermittent nature of the available power requires dedicated micro-power management circuits for proper interfacing with conventional electronic loads. However, the success of an application and energy-autonomous systems is based on energy-aware and low power design since the beginning. This session will review the main technologies supporting energy autonomous systems, and will focus on advances in micro-power management circuits and successful applications of energy harvesting technologies to achieve the next generation of IoT that exploits perpetual connected intelligent devices.

Time Label Presentation Title
Authors
14:30 CET 2.1.1 MICROPOWER MANAGEMENT TECHNIQUES FOR ENERGY HARVESTING APPLICATIONS
Speaker and Author:
Aldo Romani, University of Bologna, IT
Abstract
This talk will review the main technologies adopted for energy harvesting with different types of transducersand the types of associated power conversion techniques targeting the most efficient trade-offs between maximum power point tracking, efficiency and internal consumption. Some specific implementations will be reviewed. Finally, the emerging technology trends will be discussed along with application perspectives.
15:00 CET 2.1.2 FULLY SELF-POWERED WIRELESS SENSORS ENABLED BY OPTIMIZED POWER MANAGEMENT MODULES
Speaker and Author:
Peter Spies, Fraunhofer IIS, DE
Abstract
The power supply of wireless sensors can be assisted or completely covered by energy harvesting technologies. If a fully self-powered operation by energy harvesting is feasible depends strongly on the ambient conditions, the use-case requirements and the available board space for harvesting building blocks. Besides these aforementioned conditions and requirements, the efficiency of the power supply functional blocks and the system control can play a major role in achieving fully self-powered and unlimited operation time. The talk will introduce building blocks for energy harvesting power supplies to reach the goal of full autonomy. It will also discuss wireless technologies and system control strategies which are of paramount importance in self-powered wireless sensors. Different application examples will illustrate the introduced building blocks and technologies with a focus on condition monitoring and predictive maintenance use cases.
15:30 CET 2.1.3 DESIGN OF SELF-SUSTAINING CONNECTED SMART DEVICES
Speaker and Author:
Michele Magno, ETH Zürich, CH
Abstract
Internet of things is a revolutionizing technology which aims to create an ecosystem of connected smart devices and smart sensor providing ubiquitous connectivity between trillions ofdevices. Recent advancements in miniaturization of devices with higher computational capabilities and ultra-low power technology have enabled the vast deployment of sensors with significant changes in hardware design, software, network architecture, data analytic, data storage and power sources. However, the largest portion of IoT devices is still powered by batteries. This talk will focus on the viable solution of harvesting energy from environment and then provide enough energy to the smart devices to achieve self-sustaining smart devices combining, energy harvesting, low power devices, edge computing including machine learning on low power processors and even directly on MEMS sensors to achieve truly self-sustaining smart sensors.

3.1 Panel: Quantum Software Toolchain

Date: Monday, 14 March 2022
Time: 16:30 - 18:00 CET

Session chair:
Aida Todri Sanial, LIRMM, FR

Session co-chair:
Anne Matsuura, Intel, US

Panellists:
Xin-Chuan (Ryan) Wu, Intel, US
Ali Javadi-Abhari, IBM Research, US
Ross Duncan, Cambridge Quantum Computing / University of Strathclyde, GB
Carmen G. Almudever, TU Valencia, ES

Today’s quantum software toolchains are integral to system-level design of quantum computers. Compilers, system software, qubit simulators, and other software tools are being used to develop and execute quantum workloads and drive architectural research and design of both software and hardware. In this session, industry experts cover the latest software research and development for quantum computing systems.


4.1 Panel: Quantum Hardware

Date: Tuesday, 15 March 2022
Time: 09:00 - 10:30 CET

Session chair:
Anne Matsuura, Intel, US

Session co-chair:
Aida Todri Sanial, LIRMM, FR

Panellists:
Lieven Vandersypen, Delft University of Technology, NL
Lotte Geck, Forschungszentrum Jülich, DE
Steven Brebels, IMEC, BE
Heike Riel, IBM Research, CH

This session highlights recent advancements in qubits and qubit control. Industrial and academic experts present that latest hardware development for quantum computing from materials and qubit devices to qubit control systems.


5.1 Novel Design Techniques for Emerging Technologies in Computing

Date: Tuesday, 15 March 2022
Time: 11:00 - 12:30 CET

Session chair:
Scott Robertson Temple, University of Utah, US

This session is devoted to innovations in design techniques for emerging technologies in computing. The first paper proposes a new security locking scheme based on hybrid CMOS / nanomagnet logic system. The second paper introduces automated methodologies for standard cell design using reconfigurable transistors. The third paper reports advances in complementary FET devices design, which shows promise for sub 5nm nodes. The fourth and last paper presents an industrial RTL-to-GDSII flow for the AQFP superconducting logic family, also discussing novel synthesis opportunities for this technology.

Time Label Presentation Title
Authors
11:00 CET 5.1.1 PHYSICALLY & ALGORITHMICALLY SECURE LOGIC LOCKING WITH HYBRID CMOS/NANOMAGNET LOGIC CIRCUITS
Speaker:
Alexander J. Edwards, University of Texas at Dallas, US
Authors:
Alexander Edwards1, Naimul Hassan1, Dhritiman Bhattacharya2, Mustafa Shihab1, Peng Zhou1, Xuan Hu1, Jayasimha Atulasimha2, Yiorgos Makris1 and Joseph Friedman1
1University of Texas at Dallas, US; 2Virginia Commonwealth University, US
Abstract
The successful logic locking of integrated circuits requires that the system be secure against both algorithmic and physical attacks. In order to provide resilience against imaging techniques that can detect electrical behavior, we recently proposed an approach for physically and algorithmically secure logic locking with strain-protected nanomagnet logic (NML). While this NML system exhibits physical and algorithmic security, the fabrication imprecision, noise-related errors, and slow speed of NML incur a significant security overhead cost. In this paper, we therefore propose a hybrid CMOS/NML logic locking solution in which NML islands provide security within a system primarily composed of CMOS, thereby providing physical and algorithmic security with minimal overhead. In addition to describing this proposed system, we also develop a framework for device/system co-design techniques that consider trade-offs regarding the efficiency and security.
11:20 CET 5.1.2 EXPLORING STANDARD-CELL DESIGN FOR RECONFIGURABLE NANOTECHNOLOGIES: A FORMAL APPROACH
Speaker:
Michael Raitza, TU Dresden, DE
Authors:
Michael Raitza, Steffen Märcker, Shubham Rai and Akash Kumar, TU Dresden, DE
Abstract
Standard-cell design has always been a craft and common field-effect transistors span only a narrow design space. This has changed with reconfigurable transistors. Boolean functions that exhibit multiple dual product-terms in their sum- of-product form yield various beneficial circuit implementations with reconfigurable transistors. In this work, we present an approach to automatically generate these implementations through a formal modeling approach. Using the 3-input XOR function as an example, we discuss the variations and show how to quantify properties like worst-case delay and power dissipation, as well as averages of delay and energy consumption per operation over different scenarios. The quantification runs fully automated on charge-transport network models employing probabilistic model checking. This yields exact results instead of approximations obtained from experiments and sampling. Our results show several benefits of reconfigurable transistor circuits over static CMOS implementations.
11:40 CET 5.1.3 DESIGN ENABLEMENT OF CFET DEVICES FOR SUB-2NM CMOS NODES
Speaker:
Odysseas Zografos, imec, BE
Authors:
Odysseas Zografos, Bilal Chehab, Pieter Schuddinck, Gioele Mirabeli, Naveen Kakarla, Yang Xiang, Pieter Weckx and Julien Ryckaert, imec, BE
Abstract
Novel devices that optimize their structure in a three-dimensional fashion and offer significant area gains by reducing standard cell track height are adopted to scale silicon technologies beyond the 5nm node. Such a device is the Complementary FET (CFET), which consists of an n-type channel stacked vertically over a p-type channel. In this paper we review the significant benefits of CFET devices as well as the challenges that arise with their use. More specifically, we focus on the standard cell design challenges as well as the physical implementation ones. We show that to fully exploit the area benefits of the CFET devices, one must carefully select the metal stack used for the physical implementation of a large design.
12:00 CET 5.1.4 MAJORITY-BASED DESIGN FLOW FOR AQFP SUPERCONDUCTING FAMILY
Speaker:
Giulia Meuli, Synopsys, IT
Authors:
Giulia Meuli1, Vinicius Possani2, Rajinder Singh2, Siang-Yun Lee3, Alessandro Tempia Calvino4, Dewmini Marakkalage4, Patrick Vuillod5, Luca Amarù6, Scott Chase6, Jamil Kawa7 and Giovanni De Micheli8
1Synopsys, IT; 2Synopsys Inc., US; 3École Polytechnique Fédérale de Lausanne, CH; 4EPFL, CH; 5Synopsys Inc., FR; 6Synopsys Inc, US; 7Synopsys, Inc., US; 8École Polytechnique Fédérale de Lausanne (EPFL), CH
Abstract
Adiabatic superconducting devices are promising candidates to develop high-speed/low-power electronics. Advances in physical technology must be matched with a systematic development of comprehensive design and simulation tools to bring superconducting electronics to a commercially viable state. Being the technology fundamentally different from CMOS, new challenges are posed to design automation tools: library cells are controlled by multi-phase clocks, they implement the majority logic function, and they have limited fanout. We present a product-level RTL-to-GDSII flow for the design of Adiabatic Quantum-Flux-Parametron (AQFP) electronic circuits, with a focus on the special techniques used to comply with these challenges. In addition, we demonstrate new optimization opportunities for graph matching, resynthesis, and buffer/splitter insertion, improving the state-of-the-art.

K.4 Lunch Keynote: "AI in the edge; the edge of AI"

Date: Tuesday, 15 March 2022
Time: 13:10 - 14:00 CET

Session chair:
Gi-Joon Nan, IBM, US

Session co-chair:
Marian Verhelst, KU Leuven, BE

In the world of IoT, both humans and objects are continuously connected, collecting and communicating data, in a rising number of applications including industry 4.0, biomedical, environmental monitoring, smart houses and offices. Local computation in the edge has become a necessity to limit data traffic. Additionally embedding AI processing in the edge adds potentially high levels of smart autonomy to these IoT 2.0 systems. Progress in nanoelectronic technology allows to do this in power- and hardware-efficient architectures and designs. This keynote gives an overview of key solutions, but also describes main limitations and risks, exploring the edge of edge AI.

Speaker's bio: Georges G.E. Gielen received the MSc and PhD degrees in Electrical Engineering from the Katholieke Universiteit Leuven (KU Leuven), Belgium, in 1986 and 1990, respectively. He currently is Full Professor in the MICAS research division at the Department of Electrical Engineering (ESAT) at KU Leuven. From August 2013 till July 2017 he was also appointed at KU Leuven as Vice-Rector for the Group of Sciences, Engineering and Technology, and he was also responsible for academic Human Resource Management. He was visiting professor in UC Berkeley and Stanford University. Since 2020 he is Chair of the Department of Electrical Engineering. His research interests are in the design of analog and mixed-signal integrated circuits, and especially in analog and mixed-signal CAD tools and design automation. He is a frequently invited speaker/lecturer and coordinator/partner of several (industrial) research projects in this area, including several European projects. He has (co-)authored 10 books and more than 600 papers in edited books, international journals and conference proceedings. He is a 1997 Laureate of the Belgian Royal Academy of Sciences, Literature and Arts in the discipline of Engineering. He is Fellow of the IEEE since 2002, and received the IEEE CAS Mac Van Valkenburg award in 2015 and the IEEE CAS Charles Desoer award in 2020. He is an elected member of the Academia Europæa.

Time Label Presentation Title
Authors
13:10 CET K.4.1 AI IN THE EDGE; THE EDGE OF AI
Speaker and Author:
Georges Gielen, KU Leuven, BE
Abstract
In the world of IoT, both humans and objects are continuously connected, collecting and communicating data, in a rising number of applications including industry 4.0, biomedical, environmental monitoring, smart houses and offices. Local computation in the edge has become a necessity to limit data traffic. Additionally embedding AI processing in the edge adds potentially high levels of smart autonomy to these IoT 2.0 systems. Progress in nanoelectronic technology allows to do this in power- and hardware-efficient architectures and designs. This keynote gives an overview of key solutions, but also describes main limitations and risks, exploring the edge of edge AI.
13:50 CET K.4.2 Q&A SESSION
Author:
Gi-Joon Nam, IBM Research, US
Abstract
Questions and answers with the speaker

6.1 Alternative design paradigms for sustainable IoT nodes

Date: Tuesday, 15 March 2022
Time: 14:30 - 16:00 CET

Session chair:
David Atienza, EPFL, CH

Session co-chair:
Ayse Coskun, Boston University, US

While the potential influence of AI in the context of IoT in our daily life is enormous, there are significant challenges related to the ethics and interpretability of AI results, as well as ecological implications on system design for deep learning technologies. This special session investigates how the progress in AI technologies can be combined with alternative design paradigms for smart nodes so that the future of IoT can be nurtured and cultivated on a sustainable way for the benefit of society.

Time Label Presentation Title
Authors
14:30 CET 6.1.1 BIO-INSPIRED ENERGY EFFICIENT ALL-SPIKING INTERNET OF THINGS NODES
Speaker:
Adrian M. Ionescu, EPFL, CH
Author:
Adrian Ionescu, EFPL, CH
Abstract
In this talk we will present bio-inspired innovations exploiting phase change and ferroelectric materials and devices for all-spiking IoT nodes and Edge AI event detection applications.  Particularly, we will report new progress in (i) electromagnetic and optical spiking sensors based on vanadium dioxides, and, (ii) ferroelectric neurons and synapses built with doped high-k dielectrics on 2D semiconducting materials. The future implications for improving the energy efficiency of IoT nodes will be discussed.
15:00 CET 6.1.2 HYBRID DIGITAL-ANALOG SYSTEMS-ON-CHIP FOR EFFICIENT EDGE AI
Speaker:
Marian Verhelst, KU Leuven, BE
Authors:
Marian Verhelst1, Kodai Ueyoshi1, Giuseppe Sarda1, Pouya Houshmand1, Ioannis Papistas2, Vikram Jain1, Man Shi1, Peter Vrancx3, Debjyoti Bhattacharjee3, Stefan Cosemans2, Arindam Mallik3 and Peter Debacker3
1KU Leuven, BE; 2Imec and Axelera, BE; 3imec, BE
Abstract
Deep inference workloads at the edge are characterized by a wide variety of neural network layer topologies and characteristics. While large convolutional layers execute very efficiency on the dense compute-in-memory co-processors appearing in literature, other layer types (grouped convolutions, layers with low channel count or high precision requirements) benefit from digital execution. This talk discusses a new breed of heterogeneous SoC’s, integrating co-processors of different nature into a common processing systems with tightly coupled shared memory, to be able to dispatch every layer to the most optimal accelerator.
15:30 CET 6.1.3 3D COMPUTE CUBES FOR EDGE INTELLIGENCE: NANOELECTRONIC-ENABLED ADAPTIVE SYSTEMS BASED ON JUNCTIONLESS, AMBIPOLAR, AND FERROELECTRIC VERTICAL FETS
Speaker:
Ian O'Connor, Lyon Institute of Nanotechnology, FR
Authors:
Ian O'Connor1, David Atienza2, Jens Trommer3, Oskar Baumgartner4, Guilhem Larrieu5 and Cristell Maneux6
1Lyon Institute of Nanotechnology, FR; 2École Polytechnique Fédérale de Lausanne (EPFL), CH; 3Namlab gGmbH, DE; 4Global TCAD Solutions, AT; 5LAAS – CNRS, FR; 6University of Boedeaux, FR
Abstract
New computing paradigms and technologies are required to respond to the challenges of data-intensive edge intelligence. We propose a triple combination of emerging technologies for the fine interweaving of versatile logic functionality and memory for reconfigurable in-memory computing: vertical junctionless gate-all-around nanowire transistors for ultimate downscaling; ambipolar functionality enhancement for fine-grain flexibility; ferroelectric oxides for non-volatile logic operation. Through a DTCO approach, this talk will describe the design of 3D compute cubes naturally suited to the hardware acceleration of computation-intensive kernels, as well as their integration into computing systems, introducing a system-wide exploration framework to assess their effectiveness. HW/SW optimization will also be described with a focus on Transformer and Conformer networks and the matrix multiplication kernel, which dominates their run-time.

7.1 Panel: Autonomous Systems Design as a Research Challenge

Date: Tuesday, 15 March 2022
Time: 16:30 - 18:00 CET

Session chair:
Selma Saidi, TU Dortmund, DE

Session co-chair:
Rolf Ernst, TU Braunschweig, DE

Panellists:
Karl-Erik Arzen, Lund University, SE
Peter Liggesmeyer, Fraunhofer Institute for Experimental Software Engineering IESE, DE
Axel Jantsch, TU Wien, AT

Autonomous systems require specific design methods that leave behavioral freedom and plan for the unexpected without losing trustworthiness and dependability. How does this requirement influence research at major research institutions? How is it reflected in public funding? Should autonomous systems design become a new discipline or should the regular design process be adapted to handle autonomy? The panel will begin with position statements by the panelists, followed by an open discussion with the hybrid audience.


8.1 Young People Program: Career Fair

Date: Wednesday, 16 March 2022
Time: 16:00 - 17:00 CET

Session chair:
Anton Klotz, Cadence, DE

Session co-chair:
Xavier Salazar, Barcelona Supercomputing Center & HiPEAC, ES

The Career Fair aims at bringing together Ph.D. Students and potential job seekers with recruiters from EDA and microelectronic companies. In this slot, sponsoring companies present themselves to jobseekers and to the DATE community.

Time Label Presentation Title
Authors
16:00 CET 8.1.1 INTRODUCTION TO THE CAREER FAIR
Speaker and Author:
Anton Klotz, Cadence Design Systems, DE
Abstract
Introduction to Career Fair. How to apply for listed positions
16:10 CET 8.1.2 CADENCE DESIGN SYSTEMS
Speaker and Author:
Anton Klotz, Cadence Design Systems, DE
Abstract
Introducing Cadence Design Systems as employer for young talents
16:17 CET 8.1.3 IMMS
Speaker and Author:
Eric Schaefer, IMMS, DE
Abstract
Introducing IMMS as employer for young talents
16:23 CET 8.1.4 SIEMENS EDA
Speaker and Author:
Janani Muruganandam, Siemens, NL
Abstract
Introducing Siemens EDA as employer for young talents
16:30 CET 8.1.5 SYNOPSYS
Speaker and Author:
Markus Wedler, Synopsys, DE
Abstract
Introducing Synopsys as employer for young talents
16:37 CET 8.1.6 ANSYS
Speaker and Author:
Helene Tabourier, Ansys, DE
Abstract
Introducing Ansys as employer for young talents
16:43 CET 8.1.7 INTEL
Speaker and Author:
Pablo Herrero, INTEL, DE
Abstract
Introducing Intel as employer for young talents
16:50 CET 8.1.8 BOSCH
Speaker and Author:
Atefe Dalirsani, BOSCH, DE
Abstract
Introducing Bosch as employer for young talents

9.1 Young People Program: Sponsorship Fair

Date: Wednesday, 16 March 2022
Time: 17:00 - 18:30 CET

Session chair:
Sara Vinco, Politecnico di Torino, IT

Session co-chair:
Anton Klotz, Cadence, DE

The Sponsorship Fair aims at bringing together University Student Teams involved in international competitions and personnel from EDA and microelectronic companies. In this slot, Student Teams present their activities, success stories and challenges in front of the DATE audience and to sponsoring companies, to build new collaborations.

Time Label Presentation Title
Authors
17:00 CET 9.1.1 DUTCH NAO TEAM
Speaker and Author:
Thomas Wiggers, University of Amsterdam, NL
Abstract
Dutch Nao Team is a team of bachelor and master students from the University of Amsterdam that program robots to play football autonomously. Dutch Nao Team competes in the RoboCup SPL League and competitions around the world.
17:10 CET 9.1.2 SQUADRA CORSE POLITO
Speaker and Author:
Enrico Salvatore, Politecnico di Torino, IT
Abstract
Squadra Corse PoliTO is the Formula SAE team of the Politecnico di Torino. The team is entirely run by students of the Politecnico di Torino who design, manufacture, test, and compete with formula style race cars in the Formula Student competitions. The team qualified for all the major Formula SAE student competitions of the 2021-2022 season.
17:20 CET 9.1.3 DYNAMIS PRC
Speaker and Author:
Ishac Oursana, Politecnico di Milano, IT
Abstract
Dynamics PRC is the Formula Student team of Politecnico di Milano. Originally working on Combustion engines, the Dynamics PRC Teams also works on Electric prototypes and autonomous driving. Dynamics PRC classified 1st in Overall FSN and Overall FSATA in 2019.
17:30 CET 9.1.4 HYPED
Speaker and Author:
Marina Antonogiannaki, University of Edinburgh, GB
Abstract
HYPED is the Edinburgh University Hyperloop Team. HYPED co-organises the European Hyperloop Week to promote the development of Hyperloop and connect students with the industry. HYPED has been among the finalists of SpaceX Hyperloop Pod Competition from 2017 to 2019 and won the Virgin Hyperloop One Global Challenge
17:40 CET 9.1.5 ONELOOP AT UC DAVIS
Speaker and Author:
Zbynka Kekula, UC Davis, US
Abstract
"OneLoop is a student run organization of UC Davis working on developing a Hyperloop pod. Since the first SpaceX competition in 2017they continued to excel in competitions and in furthering HyperLoop research."
17:50 CET 9.1.6 NEUROTECH LEUVEN
Speaker and Author:
Jonah Van Assche, KU Leuven, BE
Abstract
NeuroTech Leuven is a team of students from KU Leuven, Belgium that are interested in all things “Neuro”, ranging from neuroscience to neurotechnology. The NeureTech team takes part to the NeuroTechX Competition.
18:00 CET 9.1.7 Q&A SESSION
Authors:
Sara Vinco1 and Anton Klotz2
1Politecnico di Torino, IT; 2Cadence Design Systems, DE
Abstract
This poster session allows a closer interaction of student teams with EDA and microelectronic companies, to allow discussion on sponsorship opportunities, e.g., in terms on monetary sponsorships, licenses, tutorials.

10.1 PhDForum

Date: Wednesday, 16 March 2022
Time: 18:30 - 20:30 CET

Session chair:
Gabriela Nicolescu, École Polytechnique de Montréal, CA

Session co-chair:
Mahdi Nikdast, Colorado State University, US

The PhD Forum is an online poster session hosted by EDAA, ACM-SIGDA, and IEEE CEDA for PhD students who have completed their PhD thesis within the last 12 months or who are close to complete their thesis work. It represents an excellent opportunity for them to get feedback on their research and for the industry to get a glance of state-of-the-art in system design and design automation.

Time Label Presentation Title
Authors
18:30 CET 10.1.1 NOVEL ATTACK AND DEFENSE STRATEGIES FOR ENHANCED LOGIC LOCKING SECURITY
Speaker:
Lilas Alrahis, New York University Abu Dhabi, AE
Authors:
Lilas Alrahis1 and Hani Saleh2
1New York University Abu Dhabi, AE; 2Khalifa University, AE
Abstract
The globalized and, thus, distributed semiconductor supply chain creates an attack vector for the untrusted entities in stealing the intellectual property (IP) of a design. To ward off the threat of IP theft, researchers developed various countermeasures like state-space obfuscation, split manufacturing, and logic locking (LL). LL is a holistic design-for-trust technique that aims to protect the design IP from untrustworthy entities throughout the IC supply chain, by locking the functionality of the design. State-of-the-art LL solutions such as provably secure logic locking (PSLL) and scan locking/obfuscation aim to offer protection against immediate attacks such as the Boolean satisfiability (SAT)-based attack. However, these implementations mostly focus on thwarting the SAT-based attack leaving them vulnerable to other unexplored threats. The underlying research objective of this Ph.D. work is enhancing the security of LL by exposing and then addressing its security vulnerabilities.
18:30 CET 10.1.2 PROPER ABSTRACTIONS FOR DIGITAL ELECTRONIC CIRCUITS: A PHYSICALLY GUIDED APPROACH
Speaker:
Jurgen Maier, TU Wien, AT
Author:
Jürgen Maier, TU Wien, AT
Abstract
In this thesis I show that developing abstractions, which are able to describe the behavior of digital electronic circuits in a simple yet accurate fashion, can be efficiently guided by identifying the underlying physical processes. Based on transistor-level analysis, analog SPICE simulations and even formal proofs I thus provide approximations of the analog signal trajectories inside a circuit and of the signal propagation delay in the digital domain. In addition I introduce methods for an efficient characterization of the Schmitt-Trigger, including its metastable and dynamic behavior. Overall, the developed abstractions are highly faithful in regard to the fact that only physically reasonable behavior can be modeled and vice versa. This leads to more powerful, accurate and trustworthy results which allows one to identify problematic spots in a circuit with higher confidence in less time. Nevertheless, no "silver bullet" w.r.t modeling abstractions could be found, meaning that each abstractions requires careful analysis of the physical behavior to achieve the optimal performance, accuracy and coverage.
18:30 CET 10.1.3 RETRAINING-FREE WEIGHT-SHARING FOR CNN COMPRESSION
Speaker:
Etienne Dupuis, Lyon Institute of Nanotechnology, FR
Authors:
Etienne Dupuis1, David Novo2, Alberto Bosio3 and Ian O'Connor3
1Institut des Nanotechnologies de Lyon, FR; 2CNRS, LIRMM, University of Montpellier, FR; 3Lyon Institute of Nanotechnology, FR
Abstract
The Weight-Sharing (WS) technique gives promising results in compressing Convolutional Neural Networks (CNNs), but it requires the careful determining of the shared values for each layer of a given CNN. The WS Design Space Exploration (DSE) time can easily explode for state-of-the-art CNNs. We propose a new heuristic approach to drastically reduce the exploration time without sacrificing the quality of the output. The results carried out on recent CNNs GoogleNet, ResNet50V2, MobileNetV2, InceptionV3, and EfficientNet), trained with the ImageNet dataset, show over 5× memory compression at an acceptable accuracy loss (complying with the MLPerf quality target) without any retraining step. Index Terms—Convolutional Neural Network, Deep Learning, Computer vision, Hardware Accelerator, Design Space Explo- ration, Approximate Computing, Weight-Sharing
18:30 CET 10.1.4 INTELLIGENT CIRCUIT DESIGN AND IMPLEMENTATION WITH MACHINE LEARNING IN EDA
Speaker and Author:
Zhiyao Xie, Duke University, US
Abstract
EDA (Electronic Design Automation) technology has achieved remarkable progress over the past decades, from attaining merely functionally correct designs to handling multi-million-gate circuits. However, chip design is not completely automatic yet in general and the gap is not easily surmountable. For example, automation of EDA flow is still largely restricted to individual point tools with little interplay across different tools and design steps. Tools in early steps cannot well judge if their solutions may eventually lead to satisfactory designs, and the consequence of a poor solution cannot be found until very late. A major weakness of these traditional EDA technologies is the insufficient prior design knowledge reuse. Conventional optimization techniques construct solutions from scratch even if similar optimizations have already been performed, perhaps even repeatedly. Predictive models are either inaccurate or dependent on trial designs, which are very time- and resource-consuming. These limitations point to a major strength of machine learning (ML) – the capability to explore highly complex correlations between two design stages based on prior data. During my Ph.D. study, I construct multiple fast yet accurate models for various design objectives in EDA with customized ML algorithms.
18:30 CET 10.1.5 CROSS-LAYER TECHNIQUES FOR ENERGY-EFFICIENCY AND RESILIENCY OF ADVANCED MACHINE LEARNING ARCHITECTURES
Speaker:
Alberto Marchisio, TU Wien, AT
Authors:
Alberto Marchisio1 and Muhammad Shafique2
1TU Wien (TU Wien), AT; 2New York University Abu Dhabi, AE
Abstract
Machine Learning (ML) algorithms have shown high level of accuracy in several tasks, therefore ML-based applications are widely used in many systems and platforms. However, the development of efficient ML-based systems requires addressing two key research problems: energy-efficiency and security. Current trends show the growing interest in the community for complex ML models, such as Deep Neural Networks (DNNs), Capsule Networks (CapsNets), Spiking Neural Networks (SNNs). Besides their high learning capabilities, their complexity pose several challenges to address the above-discussed research problems. In this work, we explore cross-layer concepts to engage both hardware and software-level techniques to build resilient and energy-efficient architectures for these networks.
18:30 CET 10.1.6 DESIGN & ANALYSIS OF AN ON-CHIP PROCESSOR FOR THE AUTISM SPECTRUM DISORDER (ASD) CHILDREN ASSISTANCE USING THEIR EMOTIONS
Speaker:
Abdul Rehman Aslam, Lahore University of Management Sciences, PK
Authors:
Abdul Rehman Aslam and Muhammad Awais Bin Altaf, Lahore University of Management Sciences, Pakistan, PK
Abstract
Autism Spectrum Disorder (ASD) is a neurological disorder that affects the cognitive and emotional abilities of children. The number of ASD patients has increased drastically in the past decade. The world health organization estimates that around 1 out of every 160 children is an ASD patient in the United States. The actual number of patients may be substantially higher as many patients are not reported due to the stigma associated with the ASD diagnosis methods. The ASD statistics can be more severe in underdeveloped and 3rd world countries with a lack of basic health facilities for a major population. The conventional Autism Diagnosis Observation Schedule (ADOS-2) diagnosis methods require extensive behavioral evaluations and require frequent visits of the children to the neurologists. These extensive evaluations lead to late diagnosis and hence late treatment. The chronic ailment of the central nervous system in ASD causes the degradation of emotional and cognitive abilities. The ASD patients suffer from attention deficit hyperactivity disorder, memory issues, inability to take decisions, emotional issues, and lack of self-control. The lack of self-control is overriding in their emotions. They have highly imbalanced emotions and face certain negative emotional outbursts (NEOB). The NEOB’s are impulses of negative emotions causing self-injuries and suicide attempts leading to death. The long-term continuous monitoring with neurofeedback of human emotions is therefore crucial for ASD patients. The timely prediction of NEOB’s is crucial in mitigating its harmful effect. The emotions prediction can be used to regulate the emotions by controlling these NEOB’s. This need can be addressed by Electroencephalography (EEG) based non-invasive, real-time and continuous emotion’s prediction system on chip (SoC) embedded inside some headband. This work targets the design and analysis of the digital backend (DBE) processor for a fully integrated wearable emotion prediction SoC. The SoC involves an analog front (AFE) for EEG data acquisition and a DBE processor for the emotion’s prediction. The miniaturized low-power processor can be embedded in a headband (patch sensor) for the timely prediction of NEOB’s. An SoC that predicts the NEOB’s and records its pattern was designed and implemented in 0.18µm 1P6M CMOS process. The dual-channel deep neural network (DNN) based emotions classification processor utilizes only two EEG channels for the emotion’s classification. The lowest number of channels minimizes the patient’s discomfort while wearing the headband SoC. The DBE classification processor utilizes only two features per channel to minimize the area and power and overcome overfitting problems. The proposed approximated skewness indicator feature was implemented using 86X lower area (gate count) after tuning the conventional mathematical formula for skewness. The DNN classifier was implemented in a semi pipelined manner after instructions rescheduling and customized arithmetic and logic unit implementation with 34X lower area (gate count). The sigmoid activation function was implemented with 50% lower memory resources due to symmetry between positive and negative sigmoid values. The overall area efficiency of 71% was achieved for the DNN classification unit. The 16mm2 SoC is implemented in 0.18um 1P6M, CMOS process and consumes 10.13μJ/classification for 2 channel operation while achieving an average accuracy of >85% on multiple emotion databases and real-time testing. The DBE processor for the wearable non-invasive emotions classification system was fabricated using 0.18µm CMOS process. The processor has an overall energy efficiency of 10.13µJ per classification. This is the world’s first SoC for emotions prediction targeting ASD patients with minimal hardware resources. The SoC can also be used for the ASD prediction with an excellent classification accuracy of 95%.
18:30 CET 10.1.7 RESILIENCE AND ENERGY-EFFICIENCY FOR DEEP LEARNING AND SPIKING NEURAL NETWORKS FOR EMBEDDED SYSTEMS
Speaker:
Rachmad Vidya Wicaksana Putra, TU Wien, AT
Authors:
Rachmad Vidya Wicaksana Putra1 and Muhammad Shafique2
1TU Wien, AT; 2New York University Abu Dhabi, AE
Abstract
Neural networks (NNs) have become prominent machine learning (ML) algorithms because they achieve state-of-the-art accuracy for various data analytic applications, such as object recognition, healthcare, and autonomous driving. However, deploying the advanced NN algorithms, such as deep neural networks (DNNs) and spiking neural networks (SNNs), to the resource-constrained embedded systems is challenging because of their memory- and compute-intensive nature. Moreover, the existing SNN-based systems still cannot adapt to dynamic operating environments that make the offline-learned knowledge obsolete, and suffer from the negative impact of hardware-induced faults, thereby degrading the accuracy. Therefore, in this PhD work, we explore cross-layer hardware (HW)- and software (SW)-level techniques for building resilient and energy-efficient NN-based systems to enable their deployment for embedded applications in a reliable manner under diverse operating conditions.
18:30 CET 10.1.8 MODELING AND OPTIMIZATION OF EMERGING AI ACCELERATORS UNDER RANDOM UNCERTAINTIES
Speaker and Author:
Sanmitra Banerjee, Duke University, US
Abstract
Artificial intelligence (AI) accelerators based on carbon nanotube FETs (CNFETs) and silicon-photonic neural networks (SPNNs) enable ultra-low-energy and ultra-high-speed matrix multiplication. However, these emerging technologies are susceptible to inevitable fabrication-process variations and manufacturing defects. My Ph.D. dissertation focuses on the development of a comprehensive modeling framework to analyze such uncertainties and their impact on emerging AI accelerators. We show that the nature of uncertainties in CNFETs and SPNNs differs from that in Si CMOS circuits and as such, the application and effectiveness of conventional EDA and test approaches is significantly restricted when applied to such emerging technologies. To address this, we also propose several novel technology-aware design optimization and test generation methods to facilitate yield ramp-up of next-generation AI accelerators.
18:30 CET 10.1.9 LOGIC SYNTHESIS IN THE MACHINE LEARNING ERA: IMPROVING CORRELATION AND HEURISTICS
Speaker:
Walter Lau Neto, University of Utah, US
Authors:
Walter Lau Neto and Pierre-Emmanuel Gaillardon, University of Utah, US
Abstract
This extended abstract proposes to explore current advances in extit{Machine Learning} (ML) techniques to enhance both abstraction and heuristics in Logic Synthesis. We start by proposing a extit{Convolutional Neural Network} (CNN) model to predict early in the flow post extit{Place & Route} (PnR) critical paths, and a method to use this information and optimize these paths, achieving 15.3\% improvement in ADP and 18.5\% improvement in EDP. We also present a CNN model to be used during technology-mapping, that presents a novel cut-pruning policy, improving the mapping delay by an average of 10\% when compared to the ABC tool, the state-of-the-art open source technology mapper, at a cost of 2\% area. Our model for technology mapping replaces a core heuristic, which to the best of our knowledge is a novel contribution. Most of previous work for ML in EDA use ML to forecast metrics and tune the flow, but not embedded as a core heuristic.
18:30 CET 10.1.10 ACCELERATING CNN INFERENCE NEAR TO THE MEMORY BY EXPLOITING PARALLELISM, SPARSITY, AND REDUNDANCY
Speaker:
Palash Das, Indian Institute of Technology, Guwahati, IN
Authors:
Palash Das and Hemangee Kapoor, Indian Institute of Technology, Guwahati, IN
Abstract
Convolutional Neural Networks (CNNs) have become a promising tool for deep learning, specifically in the domain of computer vision. Deep CNNs have widespread use in real-life applications like image classification, object detection, and image segmentation. The inference phase of CNNs is often used in real-time for faster prediction and classification and hence seeks high performance and energy efficiency from the system. Towards designing such systems, we implement multiple strategies that make the real-time inference exceptionally faster in exchange for minimum area/power overhead. We implement multiple custom accelerators with various capabilities and integrate them closer to the main memory to reduce the memory access latency/energy using the near-memory processing (NMP) concept. In our first contribution, we design custom hardware, convolutional logic unit (CLU), and integrate them close to a 3D memory, specifically hybrid memory cube (HMC). We propose a dataflow that helps in parallelizing the CNN tasks for their concurrent execution. In the second contribution, we propose an architecture that leverages the benefits of using NMP using HMC, exploiting parallelism and data sparsity. In the third contribution, apart from NMP and parallelism, the proposed hardware can also remove the redundant multiplications of inference by a lookaside memory (LAM)-based search technique. This makes the inference substantially faster because of the reduced number of costly multiplication operations. And lastly, we investigate the efficacy of NMP with the conventional DRAM while accelerating the inference. While implementing NMP in DRAM, we also explore the design space with our designed hardware modules based on the parameters like performance, power consumption, and area overhead.
18:30 CET 10.1.11 DESIGN AUTOMATION FOR ADVANCED MICROFLUIDIC BIOCHIPS
Speaker and Author:
Debraj Kundu, IITR, IN
Abstract
The science behind the handling of fluids on the scale of nano to femto liter in order to automate a bio-application is termed as microfluidics and the devices used in such process are generally called as biochips. Due to the recent advancements in the fabrication technologies of these biochips, there is a huge boom in its design automation field in the last decade. Integration, precision, and high throughput are the main advantages of biochips over lab-based macro systems. Based on the working principle, these biochips can be broadly classified as continuous flow-based microfluidic biochips (CFMBs) and digital microfluidic biochips (DMFBs). In order to automate various bio-applications on a biochip different design automation methodologies are required for different kinds of biochips. We provide rigorous and elegant design automation techniques for sample preparation, fluid loading, placement of mixers and scheduling of mixing graphs in MEDA, PMD and CMF biochips.
18:30 CET 10.1.12 ULTRA-FAST TEMPERATURE ESTIMATION METHODS FOR ARCHITECTURE-LEVEL THERMAL MODELING
Speaker and Author:
Hameedah Sultan, Indian Institute of Technology Delhi, IN
Abstract
As the power density of modern-day chips has increased, the chip temperature, too, has increased steadily. High temperature causes several adverse effects, affecting the chip's performance and reliability. It also increases the leakage power, which further increases the on-chip temperature, resulting in a feedback effect. In order to carry out temperature-aware design optimization, it is often necessary to conduct thousands of temperature simulations at various stages of the design cycle, and thus the speed of simulation without a concomitant loss in accuracy is essential. State-of-the-art works in thermal estimation have serious limitations in modeling some important aspects of thermal modeling. Additionally, these methods are slow. We overcome the limitations of these works by developing fast Green's function-based analytical methods.
18:30 CET 10.1.13 MULTI-OBJECTIVE DIGITAL VLSI DESIGN OPTIMISATION
Speaker and Author:
Linan Cao, University of York, GB
Abstract
Modern VLSI design's complexity and density has been exponentially increasing over the past 50 years and recently reached a stage within its development, allowing heterogeneous, many-core systems and numerous functions to be integrated into a tiny silicon die. These achievements are accomplished by pushing process technology to its physical limits. Transistor shrinking has succeeded with continuous improvements in the physical dimension, switching frequency and power efficiency of integrated circuits (ICs), allowing embedded electronic systems to be used in more and more real-world automated applications. However, as advanced semiconductor technologies come ever closer to the atomic scale, the transistor scaling challenge and stochastic performance variations intrinsic to fabrication emerge. Electronic design automation (EDA) tools handle the growing size and complexity of modern electronic designs by breaking down systems into smaller blocks or cells, introducing different levels of abstraction. In the field of digital very large scale integration (VLSI) design, comprehensive and mature industry-standard design flows are available to tape out chips. This complex process consists of several steps including logic design, logic synthesis, physical implementation and pre-silicon physical verification. However, in this staged, hierarchical design approach, where each step is optimised independently, overheads and inefficiency can accumulate in the resulting overall design. Designers and EDA vendors have to handle these challenges from process technology, design complexity and growing scale, which may otherwise result in inferior design quality, even failures, and lower design yields under time-to-market pressure. Multiple or many design objectives and constraints are emerging during the design process and often need to be dealt with simultaneously. Multi-objective evolutionary algorithms (MOEAs) show flexible capabilities in maintaining multiple variable components and factors in uncertain environments. The VLSI design process involves a large number of available parameters both from designs and EDA tools. This provides many potential optimisation avenues where evolutionary algorithms can excel. This PhD work investigates the application of evolutionary techniques for digital VLSI design optimisation. Automated multi-objective optimisation frameworks, compatible with industrial design flows and foundry technologies, are proposed to improve solution performance, expand feasible design space, and handle complex physical floorplan constraints through tuning designs at gate-level. Methodologies for enriching standard cell libraries regarding drive strength are also introduced to cooperate with multi-objective optimisation frameworks, e.g., subsequent hill-climbing, providing a richer pool of solutions optimised for different trade-offs. The experiments of this thesis work demonstrate that multi-objective evolutionary algorithms, derived from biological inspirations, can assist the digital VLSI design process, in an industrial design context, to more efficiently search for well-balanced trade-off solutions as well as optimised design space coverage. The expanded drive granularity of standard cells can push the performance of silicon technologies with offering improved solutions regarding critical objectives. The achieved optimisation results can better deliver trade-off solutions regarding power, performance and area (PPA) metrics than using standard EDA tools alone. This has been not only shown for a single circuit solution but also covered the entire standard-tool-produced design space.
18:30 CET 10.1.14 TINYDL: EFFICIENT DESIGN OF SCALABLE DEEP NEURAL NETWORKS FOR RESOURCE-CONSTRAINED EDGE DEVICES
Speaker and Author:
Mohammad Loni, Mälardalen University, SE
Abstract
The main aim of my Ph.D. thesis is to develop theoretical foundations and practical algorithms that (i) enable designing scalable and energy-efficient DNNs with low energy footprint, (ii) facilitate fast deployment of complicated DL models for a diverse set of Edge devices satisfying given hardware constraints, and (iii) improve the accuracy of network quantization methods for largescale datasets. To address research challenges, I developed (i) a set of ADONN, DeepMaker, NeuroPower, DenseDisp and FastStereoNet frameworks during my Ph.D. studies to design hardware-friendly NAS methods with minimum design cost, and (ii) novel ternarization frameworks named TOT-Net and TAS that prevents the accuracy degradation of quantization techniques.
18:30 CET 10.1.15 DECISION DIAGRAMS IN QUANTUM DESIGN AUTOMATION
Speaker and Author:
Stefan Hillmich, Johannes Kepler University Linz, AT
Abstract
The impact quantum computing may achieve hinges on Computer-Aided Design (CAD) keeping up with the increasing power of physical realizations. The complexity of quantum computing has to be tackled with dedicated methods and data structures as well as a close cooperation between the CAD community and physicists. The main contribution of the thesis is to narrow the emerging design gap for quantum computing by bringing established methods of the CAD community to the quantum world. More precisely, the work focuses on the application of decision diagrams to the areas of quantum circuit simulation, estimation of observables in quantum chemistry, and technology mapping. The supporting paper is attached to the extended abstract.
18:30 CET 10.1.16 DEPENDABLE RECONFIGURABLE SCAN NETWORKS
Speaker:
Natalia Lylina, University of Stuttgart, DE
Authors:
Natalia Lylina and Hans-Joachim Wunderlich, University of Stuttgart, DE
Abstract
Dependability of modern devices is enhanced by integrating an extensive number of non-functional instruments. These are needed to facilitate cost-efficient bring-up, debug, test, diagnosis, and adaptivity in the field and might include, e.g., sensors, aging monitors, Logic and Memory Built-In Self-Test (BIST) registers. Reconfigurable Scan Networks (RSNs) provide a flexible way to access such instruments as well the device's registers throughout the lifetime, starting from PSV through manufacturing test and finally during in-field test. At the same time, the dependability properties of the device-under-test (DUT) can be affected through an improper RSN integration. This doctoral project overcomes these problems and establishes a methodology to integrate dependable RSNs for a given device considering such dependability aspects, as accessibility via RSNs, testability of RSNs, and security compliance of RSNs with the underlying device-under-test. The remainder of this extended abstract is structured as follows. First, the background information about RSNs is provided, followed by the challenges of dependability-aware RSN integration. Next, the objectives and the contributions of this work are summarized for specific dependability properties.
18:30 CET 10.1.17 BREAKING THE ENERGY CAGE OF INSECT-SCALE AUTONOMOUS DRONES: INTERPLAY OF PROBABILISTIC HARDWARE AND CO-DESIGNED ALGORITHMS
Speaker:
Priyesh Shukla, University of Illinois at Chicago, US
Authors:
Priyesh Shukla and Amit Trivedi, University of Illinois at Chicago, US
Abstract
Autonomy at insect-scale drones is challenged with highly constrained area and power budget. Robustness amidst noisy sensory inputs and surrounding is also critical. To address this, we present two compute-in-memory (CIM) frameworks for insect-scale drone localization. Our first framework is floating-gate (FG) inverter array based CIM (for Bayesian particle filtering) that efficiently evaluates log-likelihood of drone's pose which otherwise demands heavy computational workload using conventional digital processing. Our second method is Monte-Carlo dropout (MC-Dropout)-based deep neural network (DNN) inference in an all-digital 8T-SRAM (static random access memory) CIM. The CIM is equipped with additional MC-Dropout inference primitives to account for uncertainty in drone's pose prediction. We discuss compute reuse and optimization strategy for MC-Dropout schedules to gain significant reduction in this (approximated Bayesian) DNN workload. FG-CIM based localization is 25x energy efficient that conventional digital processing. And SRAM-CIM for MC-Dropout inference consumes 28pJ for 30 MC-Dropout inference iterations (3 TOPS/W).
18:30 CET 10.1.18 RESILIENT: PROTECTING DESIGN IP FROM MALICIOUS ENTITIES
Speaker:
Nimisha Limaye, New Yor University, US
Authors:
Nimisha Limaye1 and Ozgur Sinanoglu2
1New York University, US; 2New York University Abu Dhabi, AE
Abstract
Globalization of integrated circuit (IC) supply chain opened up venues for untrusted entities with the malicious intent of intellectual property (IP) piracy and overproduction of ICs. These malicious entities encompass foundry, test facility, and end user. An untrusted foundry can readily obtain the unprotected design IP from the design house, and a test facility or an end user can reverse-engineer the chip using widely available tools and extract the underlying design IP to pirate or overproduce the ICs. We first perform an exhaustive security analysis of the state-of-the-art logic locking techniques and propose various attacks. Further, we propose countermeasures to thwart attacks from all the malicious entities in the supply chain. Through our solutions, we allow the security-enforcing designers to protect their design IP at various abstraction levels. Our solution can protect not just digital designs but also mixed-signal designs.
18:30 CET 10.1.19 ALGORITHM-ARCHITECTURE CO-DESIGN FOR ENERGY-EFFICIENT, ROBUST, AND PRIVACY-PRESERVING MACHINE LEARNING
Speaker and Author:
Souvik Kundu, USC, US
Abstract
My Ph.D. research includes three major aspects of the algorithm -architecture co-design for machine learning accelerators: 1. energy-efficiency via novel training-efficient pruning, quantization, and distillation, 2. robust model training for safety-critical edge applications, 3. analysis of model and data privacy of the associated IPs.
18:30 CET 10.1.20 PERFORMANCE-AWARE DESIGN-SPACE OPTIMIZATION AND ATTACK MITIGATION FOR EMERGING HETEROGENEOUS ARCHITECTURES
Speaker and Author:
Mitali Sinha, IIIT Delhi, IN
Abstract
The growing system sizes and time-to-market pressure of heterogeneous SoCs compel the chip designers to analyze only part of the design space, leading to suboptimal Intellectual Property (IP) designs. Hence, different processing cores like accelerators are generally designed as standalone IP blocks by third-party vendors and chip designers often over-provision the amount of on-chip resources required to add flexibility to each IP design. Although this modularity simplifies IP design, integrating these off-the-shelf IP blocks into a single SoC may overshoot the resource budget of the underlying system. Furthermore, the integration of third-party IPs alongside other on-chip modules makes the system vulnerable to security threats. This work addresses the challenges involved in designing efficient heterogeneous SoCs by optimizing the utilization of on-chip resources and mitigating performance-based security threats.
18:30 CET 10.1.21 PRACTICAL SIDE-CHANNEL AND FAULT ATTACKS ON LATTICE-BASED CRYPTOGRAPHY
Speaker:
Prasanna Ravi, Nanyang Technological University, SG
Authors:
Prasanna Ravi1, Anupam Chattopadhyay1 and Shivam Bhasin2
1Nanyang Technological University, SG; 2Temasek Laboratories, Nanyang Technological University, SG
Abstract
The possibility of large scale quantum computers in the future has been an ever-growing threat towards existing public-key infrastructure, which is predomiantly based on classical RSA and ECC-based public-key cryptography. This prompted NIST to initiate a global level standardization process for alternate quantum-attack resistant Public Key Encryption (PKE), Key Encapsulation Mechanisms (KEM) and Digital Signatures (DSS), better known as Post-Quantum Cryptography (PQC). The PQC standardization process started in 2017 with 69 submissions and is currently in its third and final round with seven (7) main finalist candidates and eight (8) alternate finalist candidates. Among these fifteen (15) finalist candidates, seven (7) of them belong to a single category, referred to as lattice-based cryptography. These schemes are based on hard geometric problems, that are conjectured to be computationally intractable by quantum computers. NIST laid out several evaluation criteria for the standardization process, which include theoretical Post-Quantum (PQ) security guarantees, implementation cost and performance. Along with them, resistance against physical attacks such as Side-Channel Analysis (SCA) and Fault Injection Analysis (FIA) has also emerged as an important criterion for the standardization process. This is especially relevant for adoption of PQC in embedded devices, which will be used in environments where an attacker can have unimpeded physical access to the target device. We therefore focus on evaluating the security of practical implementations of lattice-based schemes against SCA and FIA. We have identified novel SCA and FIA vulnerabilities which led to practical attacks on implementations of several lattice-based schemes. Most of our attacks exploit vulnerabilities inherent in the algorithms of lattice-based schemes, which make our attacks adaptable to different implementation platforms (hardware and software).
18:30 CET 10.1.22 MEMORY INTERFERENCE AND MITIGATIONS IN RECONFIGURABLE HESOCS FOR EMBEDDED AI
Speaker:
Gialuca Brilli, University of Modena and Reggio Emilia, IT
Authors:
Gianluca Brilli, Alessandro Capotondi, Paolo Burgio, Andrea Marongiu and Marko Bertogna, University of Modena and Reggio Emilia, IT
Abstract
Recent advances in high-performance embedded systems has paved the way for next-generation applications, which were impratical few decades ago, such as Deep Neural Networks (DNNs). DNNs are widely adopted in several embedded domains and in particular in the so-called Cyber Physical Systems (CPS). Examples of CPS are autonomous robots, that typically integrate one or more neural networks into their navigation systems for perception and localization tasks. To match this need, high-performance embedded chips manufacturers are increasingly adopting a heterogeneous design (HeSoC), where sequential processors and energy efficient massively parallel accelerators, used to perform ML tasks in an energy efficient manner. These systems typycally follow a Commercial-Off-The-Shelf (COTS) organization, where the memory hierarchy composed of multiple cache layers and a main memory (DRAM) is shared between the computational engines of the system. This scheme allows on the one hand to increase the time-to-market, the scalability of the system and in general to provide good average-case performance. However, it is not always adequate in applications where by construction the system must guarantee bounded performance even in the worst-case. Shared memory organization creates contention problems on shared resources [1]–[3], where the execution time of a task also depends on the number of other tasks that access a given shared resource in the same time interval. The main aspects addressed in this work are: (i) a characterization of state-of-the-art embedded neural networks engines, to study the typical workload of a DNN and the impact that could have on the system; (ii) A deep memory interference characterization on HeSoCs with particular reference to FPGA-based; (iii) Architectural solutions to mitigate memory interference and improve the low memory-bandwidth utilization of PREM-like schemes.

IP.1_1 Interactive presentations

Date: Thursday, 17 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.1_1.1 (Best Paper Award Candidate)
A SOFTWARE ARCHITECTURE TO CONTROL SERVICE-ORIENTED MANUFACTURING SYSTEMS
Speaker:
Sebastiano Gaiardelli, Università di Verona, IT
Authors:
Sebastiano Gaiardelli1, Stefano Spellini1, Marco Panato2, Michele Lora3 and Franco Fummi1
1Università di Verona, IT; 2Universita' di Verona, IT; 3University of Southern California, US
Abstract
This paper presents a software architecture extending the classical automation pyramid to control and reconfigure flexible, service-oriented manufacturing systems. At the Planning level, the architecture requires a Manufacturing Execution System (MES) consistent with the International Society of Automation (ISA) standard. Then, the Supervisory level is automated by introducing a novel component, called Automation Manager. The new component interacts upward with the MES, and downward with a set of servers providing access to the manufacturing machines. The communication with machines relies on the OPC Unified Architecture (OPC UA) standard protocol, which allows exposing production tasks as “services”. The proposed software architecture has been prototyped to control a real production line, originally controlled by a commercial MES, unable to fully exploit the flexibility provided by the case study manufacturing system. Meanwhile, the proposed architecture is fully exploiting the production line’s flexibility.
IP.1_1.2 (Best Paper Award Candidate)
COMPREHENSIVE AND ACCESSIBLE CHANNEL ROUTING FOR MICROFLUIDIC DEVICES
Speaker:
Philipp Ebner, Johannes Kepler University, AT
Authors:
Gerold Fink, Philipp Ebner and Robert Wille, Johannes Kepler University Linz, AT
Abstract
Microfluidics is an emerging field that allows to minimize, integrate, and automate processes that are usually conducted with unwieldy laboratory equipment inside a single device; resulting in so-called "Labs-on-a-Chip" (LoCs). The design process of channel-based LoCs is still mainly conducted manually thus far - resulting in time-consuming tasks and error-prone designs. This also holds for the routing process, where multiple components inside an LoC should be connected according to a specification. In this work, we present a routing tool which considers the particular requirements of microfluidic applications and automates the routing process. In order to make the tool more accessible (even to users with little to no EDA-expertise), it is incorporated into a user-friendly and intuitive online interface.
IP.1_1.3 (Best Paper Award Candidate)
XST: A CROSSBAR COLUMN-WISE SPARSE TRAINING FOR EFFICIENT CONTINUAL LEARNING
Speaker:
Fan Zhang, Arizona State University, US
Authors:
Fan Zhang, Li Yang, Jian Meng, Jae-sun Seo, Yu Cao and Deliang Fan, Arizona State University, US
Abstract
Leveraging the ReRAM crossbar-based In-Memory-Computing(IMC) to accelerate single task DNN inference has been widely studied. However, using the ReRAM crossbar for continual learning has not been explored yet. In this work, we propose XST, a novel crossbar column-wise sparse training framework for continual learning. XST significantly reduces the training cost and saves inference energy. More importantly, it is friendly to existing crossbar-based convolution engine with almost no hardware overhead. Compared with the state-of-the-art CPG method, the experiments show that XST's accuracy achieves 4.95% higher accuracy. Furthermore, XST demonstrates ~5.59X training speedup and 1.5X inference energy-saving.

IP.1_2 Interactive presentations

Date: Thursday, 17 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.1_2.1 (Best Paper Award Candidate)
ENERGY-EFFICIENT BRAIN-INSPIRED HYPERDIMENSIONAL COMPUTING USING VOLTAGE SCALING
Speaker:
Xun Jiao, Villanova University, US
Authors:
Sizhe Zhang1, Ruixuan Wang1, Dongning Ma1, Jeff Zhang2, Xunzhao Yin3 and Xun Jiao1
1Villanova University, US; 2Harvard University, US; 3Zhejiang University, CN
Abstract
Brain-inspired hyperdimensional computing (HDC) is an emerging computational paradigm that mimics the brain cognition and leverages hyperdimensional vectors with fully distributed holographic representation and (pseudo) randomness. Recently, HDC has demonstrated promising capability in a wide range of applications such as medical diagnosis, human activity recognition, and voice classification, etc. Despite the growing popularity of HDC, its memory-centric computing characteristics make the associative memory implementation under significant energy consumption due to the massive data storage and processing. While voltage scaling has been studied intensively to reduce memory energy dissipation, it can introduce errors which would degrade the output quality. In this paper, we systematically study and leverage the application-level error resilience of HDC to reduce the energy consumption of HDC associative memory by using voltage scaling. Evaluation results on various applications show that our proposed approach can achieve 47.6% energy saving on associative memory with a negligible accuracy loss (<1%). We further explore two low-cost error masking methods, i.e., word masking and bit masking, respectively, to mitigate the impact of voltage scaling-induced errors. Experimental results show that the proposed word masking (bit masking) method can further enhance energy saving up to 62.3% (72.5%) with accuracy loss <1%.
IP.1_2.2 ERROR GENERATION FOR 3D NAND FLASH MEMORY
Speaker:
Weihua Liu, Huazhong University of Science and Technology, CN
Authors:
Weihua Liu, Fei Wu, Songmiao Meng, Xiang Chen and Changsheng Xie, Huazhong University of Science and Technology, CN
Abstract
Three-dimension (3D) NAND flash memory is the preferred storage component of solid-state drive (SSD) for its high ratio of capacity and cost. Optimizing the reliability of modern SSD needs to test and collect a large amount of realworld error data from 3D NAND flash memory. However, the test costs have surged dozens of times as its capacity increases. It’s imperative to reduce the costs of testing denser and highcapacity flash memory. To facilitate it, in this paper, we aim to enable reproducing error data efficiently for 3D NAND flash memory. We use a conditional generative adversarial network (cGAN) to learn the error distribution with multiple interferences and generate diverse error data comparable to the real-world. Evaluation results demonstrate it is feasible and efficient for error generation with cGAN.
IP.1_2.3 ESTIMATING VULNERABILITY OF ALL MODEL PARAMETERS IN DNN WITH A SMALL NUMBER OF FAULT INJECTIONS
Speaker:
Yangchao Zhang, Osaka University, JP
Authors:
Yangchao Zhang1, Hiroaki Itsuji2, Takumi Uezono2, Tadanobu Toba2 and Masanori Hashimoto3
1Osaka University, JP; 2Hitachi Ltd., JP; 3Kyoto University, JP
Abstract
The reliability of deep neural networks (DNNs) against hardware errors is essential as DNNs are increasingly employed in safety-critical applications such as automatic driving. Transient errors in memory, such as radiation-induced soft error, may propagate through the inference computation, resulting in unexpected output, which can adversely trigger catastrophic system failures. As a first step to tackle this problem, this paper proposes constructing a vulnerability model (VM) with a small number of fault injections to identify vulnerable model parameters in DNN. We reduce the number of bit locations for fault injection significantly and develop a flow to incrementally collect the training data, i.e., the fault injection results, for VM accuracy improvement. Experimental results show that VM can estimate vulnerabilities of all DNN model parameters only with 1/3490 computations compared with traditional fault injection-based vulnerability estimation.

IP.1_3 Interactive presentations

Date: Thursday, 17 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.1_3.1 EXPLOITING ARBITRARY PATHS FOR THE SIMULATION OF QUANTUM CIRCUITS WITH DECISION DIAGRAMS
Speaker:
Lukas Burgholzer, Johannes Kepler University Linz, Austria, AT
Authors:
Lukas Burgholzer, Alexander Ploier and Robert Wille, Johannes Kepler University Linz, AT
Abstract
The classical simulation of quantum circuits is essential in the development and testing of quantum algorithms. Methods based on tensor networks or decision diagrams have proven to alleviate the inevitable exponential growth of the underlying complexity in many cases. But the complexity of these methods is very sensitive to so-called contraction plans or simulation paths, respectively, which define the order in which respective operations are applied. While, for tensor networks, a plethora of strategies has been developed, simulation based on decision diagrams is mostly conducted in a straight-forward fashion thus far. In this work, we envision a flow that allows to translate strategies from the domain of tensor networks to decision diagrams. Preliminary results indicate that a substantial advantage may be gained by employing suitable simulation paths—motivating a thorough consideration.
IP.1_3.2 A NOVEL NEUROMORPHIC PROCESSORS REALIZATION OF SPIKING DEEP REINFORCEMENT LEARNING FOR PORTFOLIO MANAGEMENT
Speaker:
Seyyed Amirhossein Saeidi, Amirkabir University of Technology (Tehran Polytechnic), IR
Authors:
Seyyed Amirhossein Saeidi, Forouzan Fallah, Soroush Barmaki and Hamed Farbeh, Amirkabir University of Technology, IR
Abstract
The process of constantly reallocating budgets into financial assets, aiming to increase the anticipated return of assets and minimizing the risk, is known as portfolio management. Processing speed and energy consumption of portfolio management have become crucial as the complexity of their real-world applications increasingly involves high-dimensional observation and action spaces and environment uncertainty, which their limited onboard resources cannot offset. Emerging neuromorphic chips inspired by the human brain increase processing speed by up to 500 times and reduce power consumption by several orders of magnitude. This paper proposes a spiking deep reinforcement learning (SDRL) algorithm that can predict financial markets based on unpredictable environments and achieve the defined portfolio management goal of profitability and risk reduction. This algorithm is optimized for Intel’s Loihi neuromorphic processor and provides 186x and 516x energy consumption reduction compared to a high-end processor and GPU, respectively. In addition, a 1.3x and 2.0x speed-up is observed over the high-end processors and GPUs, respectively. The evaluations are performed on cryptocurrency market benchmark between 2016 and 2021.
IP.1_3.3 IN-SITU TUNING OF PRINTED NEURAL NETWORKS FOR VARIATION TOLERANCE
Speaker:
Mehdi Tahoori, Karlsruhe Institute of Technology, DE
Authors:
Michael Hefenbrock, Dennis Weller, Jasmin Aghassi, Michael Beigl and Mehdi Tahoori, Karlsruhe Institute of Technology, DE
Abstract
Printed electronic (PE) can meet the requirements of many application domains with requirements on cost, conformity, and non-toxicity which silicon-based computing systems cannot achieve. A typical computational task to be performed in many of such applications is classification. Therefore, printed Neural Networks (pNNs) have been proposed to meet these requirements. However, PE suffers from high process variations due to low resolution printing in low-cost additive manufacturing. This can severely impact the inference accuracy of pNNs. In this work, we show how a unique feature of PE, namely additive printing can be leveraged to perform in-situ tuning of pNNs to compensate accuracy losses induced by device variations. The experiments show that, even under 30 % variation of the conductances, up to 90% of the initial accuracy can be recovered.

IP.1_4 Interactive presentations

Date: Thursday, 17 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.1_4.1 PRACTICAL IDENTITY RECOGNITION USING WIFI'S CHANNEL STATE INFORMATION
Speaker:
Cristian Turetta, University of Verona, IT
Authors:
Cristian Turetta1, Florenc Demrozi1, Philipp H. Kindt2, Alejandro Masrur3 and Graziano Pravadelli1
1Università di Verona, IT; 2TU Munich, DE; 3TU Chemnitz, DE
Abstract
Identity recognition is increasingly used to control access to sensitive data, restricted areas in industrial, healthcare, and defense settings, as well as in consumer electronics. To this end, existing approaches are typically based on collecting and analyzing biometric data and imply severe privacy concerns. Particularly when cameras are involved, users might even reject or dismiss an identity recognition system. Furthermore, iris or fingerprint scanners, cameras, microphones, etc., imply installation and maintenance costs and require the user's active participation in the recognition procedure.This paper proposes a non-intrusive identity recognition system based on analyzing WiFi's Channel State Information (CSI). We show that CSI data attenuated by a person's body and typical movements allows for a reliable identification -- even in a sitting posture. We further propose a lightweight deep learning algorithm trained using CSI data, which we implemented and evaluated on an embedded platform (i.e., a Raspberry Pi 4B). Our results obtained using real-world experiments suggest a high accuracy in recognizing people's identity, with a specificity of 98% and a sensitivity of 99%, while requiring a low training effort and negligible cost.
IP.1_4.2 A RDMA INTERFACE FOR ULTRA-FAST ULTRASOUND DATA-STREAMING OVER AN OPTICAL LINK
Speaker:
Andrea Cossettini, ETH Zurich, CH
Authors:
Andrea Cossettini, Konstantin Taranov, Christian Vogt, Michele Magno, Torsten Hoefler and Luca Benini, ETH Zürich, CH
Abstract
Digital ultrasound (US) probes integrate the analog-to-digital conversion directly on the probe and can be conveniently connected to commodity devices. Existing digital probes are however limited to a relatively small number of channels, do not guarantee access to the raw US data, or cannot operate at very high frame rates (e.g., due to exhaustion of computing and storage units on the receiving device). In this work, we present an open, compact, power-efficient, 192-channels digital US data acquisition system capable of streaming US data at transfer rates greater than 80 Gbps towards a host PC for ultra-high frame rate imaging (in the multi-kHz range). Our US probe is equipped with two power-efficient Field Programmable Gate Arrays (FPGAs) and is interfaced to the host PC with two optical-link 100G Ethernet connections. The high-speed performance is enabled by implementing a Remote Direct Memory Access (RDMA) communication protocol between the probe and the controlling PC, that utilizes a high-performance Non-Volatile Memory Express (NVMe) interface to store the streamed data. To the best of our knowledge, thanks to the achieved datarates, this is the first high-channel-count compact digital US platform capable of raw data streaming at frame rates of 20 kHz (for imaging at 3.5 cm depths), without the need for sparse sampling, consuming less than 40 W.
IP.1_4.3 ROBUST HUMAN ACTIVITY RECOGNITION USING GENERATIVE ADVERSARIAL IMPUTATION NETWORKS
Speaker:
Dina Hussein, Washington State University, US
Authors:
Dina Hussein1, Aaryan Jain2 and Ganapati Bhat1
1Washington State University, US; 2Nikola Tesla STEM High School, US
Abstract
Human activity recognition (HAR) is widely used in applications ranging from activity tracking to rehabilitation of patients. HAR classifiers are typically trained with data collected from a known set of users while assuming that all the sensors needed for activity recognition are working perfectly and there are no missing samples. However, real-world usage of the HAR classifier may encounter missing data samples due to user error, device error, or battery limitations. The missing samples, in turn, lead to a significant reduction in accuracy. To address this limitation, we propose an adaptive method that either uses low-power mean imputation or generative adversarial imputation networks (GAIN) to recover the missing data samples before classifying the activities. Experiments on a public HAR dataset with 22 users show that the proposed robust HAR classifier achieves 94% classification accuracy with as much as 20% missing samples from the sensors with 390 µJ energy consumption per imputation.

IP.1_5 Interactive presentations

Date: Thursday, 17 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.1_5.1 HYPERX: A HYBRID RRAM-SRAM PARTITIONED SYSTEM FOR ERROR RECOVERY IN MEMRISTIVE XBARS
Speaker:
Adarsh Kosta, Purdue University, US
Authors:
Adarsh Kosta, Efstathia Soufleri, Indranil Chakraborty, Amogh Agrawal, Aayush Ankit and Kaushik Roy, Purdue University, US
Abstract
Memristive crossbars based on Non-volatile Memory (NVM) technologies such as RRAM, have recently shown great promise for accelerating Deep Neural Networks (DNNs). They achieve this by performing efficient Matrix-Vector-Multiplications (MVMs) while offering dense on-chip storage and minimal off-chip data movement. However, their analog nature of computing introduces functional errors due to non-ideal RRAM devices, significantly degrading the application accuracy. Further, RRAMs suffer from low endurance and high write costs, hindering on-chip trainability. To alleviate these limitations, we propose HyperX, a hybrid RRAM-SRAM system that leverages the complementary benefits of NVM and CMOS technologies. Our proposed system consists of a fixed RRAM block offering area and energy-efficient MVMs and an SRAM block enabling on-chip training to recover the accuracy drop due to the RRAM non-idealities. The improvements are reported in terms of energy and product of latency and area (ms x mm^2), termed as area-normalized latency. Our experiments on CIFAR datasets using ResNet-20 show up to 2.88x and 10.1x improvements in inference energy and area-normalized latency, respectively. In addition, for a transfer learning task from ImageNet to CIFAR datasets using ResNet-18, we observe up to 1.58x and 4.48x improvements in energy and area-normalized latency, respectively. These improvements are with respect to an all-SRAM baseline.
IP.1_5.2 A RESOURCE-EFFICIENT SPIKING NEURAL NETWORK ACCELERATOR SUPPORTING EMERGING NEURAL ENCODING
Speaker:
Daniel Gerlinghoff, Agency for Science, Technology and Research, SG
Authors:
Daniel Gerlinghoff1, Zhehui Wang1, Xiaozhe Gu2, Rick Siow Mong Goh1 and Tao Luo1
1Agency for Science, Technology and Research, SG; 2Chinese University of Hong Kong, Shenzhen, CN
Abstract
Spiking neural networks (SNNs) recently gained momentum due to their low-power multiplication-free computing and the closer resemblance of biological processes in the nervous system of humans. However, SNNs require very long spike trains (up to 1000) to reach an accuracy similar to their artificial neural network (ANN) counterparts for large models, which offsets efficiency and inhibits its application to low-power systems for real-world use cases. To alleviate this problem, emerging neural encoding schemes are proposed to shorten the spike train while maintaining the high accuracy. However, current accelerators for SNN cannot well support the emerging encoding schemes. In this work, we present a novel hardware architecture that can efficiently support SNN with emerging neural encoding. Our implementation features energy and area efficient processing units with increased parallelism and reduced memory accesses. We verified the accelerator on FPGA and achieve 25% and 90% improvement over previous work in power consumption and latency, respectively. At the same time, high area efficiency allows us to scale for large neural network models. To the best of our knowledge, this is the first work to deploy the large neural network model VGG on physical FPGA-based neuromorphic hardware.
IP.1_5.3 SCALABLE HARDWARE ACCELERATION OF NON-MAXIMUM SUPPRESSION
Speaker:
Chunyun Chen, Nanyang Technological University, SG
Authors:
Chunyun Chen1, Tianyi Zhang2, Zehui Yu1, Adithi Raghuraman1, Shwetalaxmi Udayan1, Jie Lin2 and Mohamed Aly1
1Nanyang Technological University, SG; 2Institute for Infocomm Research, ASTAR, SG
Abstract
Non-maximum Suppression (NMS) in one- and two-stage object detection deep neural networks (e.g., SSD and Faster- RCNN) is becoming the computation bottleneck. In this paper, we introduce a hardware acceleration for the scalable PSRR- MaxpoolNMS algorithm. Our architecture shows 75.0× and 305× speedups compared to the software implementation of the PSRR- MaxpoolNMS as well as the hardware implementations of Gree-dyNMS, respectively, while simultaneously achieving comparable Mean Average Precision (mAP) to software-based floating-point implementations. Our architecture is 13.4× faster than the state-of-the-art NMS one. Our accelerator supports both one- and two-stage detectors, while supporting very high input resolutions (i.e., FHD)—essential input size for better detection accuracy.

IP.1_6 Interactive presentations

Date: Thursday, 17 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.1_6.1 ACTIVE LEARNING OF ABSTRACT SYSTEM MODELS FROM TRACES USING MODEL CHECKING
Speaker:
Natasha Yogananda Jeppu, University of Oxford, GB
Authors:
Natasha Yogananda Jeppu1, Tom Melham1 and Daniel Kroening2
1University of Oxford, GB; 2Amazon, Inc, GB
Abstract
We present a new active model-learning approach to generating abstractions of a system implementation, as finite state automata (FSAs), from execution traces. Given an implementation and a set of observable system variables, the generated automata admit all system behaviours over the given variables and provide useful insight in the form of invariants that hold on the implementation. To achieve this, the proposed approach uses a pluggable model learning component that can generate an FSA from a given set of traces. Conditions that encode a completeness hypothesis are then extracted from the FSA under construction and used to evaluate its degree of completeness by checking their truth value against the system using software model checking. This generates new traces that express any missing behaviours. The new trace data is used to iteratively refine the abstraction, until all system behaviours are admitted by the learned abstraction. To evaluate the approach, we reverse-engineer a set of publicly available Simulink Stateflow models from their C implementations.
IP.1_6.2 REDUCING THE CONFIGURATION OVERHEAD OF THE DISTRIBUTED TWO-LEVEL CONTROL SYSTEM
Speaker:
Yu Yang, KTH Royal Institute of Technology, SE
Authors:
Yu Yang, Dimitrios Stathis and Ahmed Hemani, KTH Royal Institute of Technology, SE
Abstract
With the growing demand for more efficient hardware accelerators for streaming applications, a novel Coarse-Grained Reconfigurable Architecture (CGRA) that uses a Distributed Two-Level Control (D2LC) system has been proposed in the literature. Even though the highly distributed and parallel structure makes it fast and energy-efficient, the single-issue instruction channel between the level-1 and level-2 controller in each D2LC cell becomes the bottleneck of its performance. In this paper, we improve its design to mimic a multi-issued architecture by inserting shadow instruction buffers between the level-1 and level-2 controllers. Together with a zero-overhead hardware loop, the improved D2LC architecture can enable efficient overlap between loop iterations. We also propose a complete constraint programming based instruction scheduling algorithm to support the above hardware features. The experiment result shows that the improved D2LC architecture can achieve up to 25% of reduction on the instruction execution cycles and 35% reduction on the energy-delay product.
IP.1_6.3 BATCHLENS: A VISUALIZATION APPROACH FOR ANALYZING BATCH JOBS IN CLOUD SYSTEMS
Speaker:
Qiang Guan, Kent State University, US
Authors:
Shaolun Ruan1, Yong Wang1, Hailong Jiang2, Weijia Xu3 and Qiang Guan2
1Singapore Management University, SG; 2Kent State University, US; 3TACC, US
Abstract
Cloud systems are becoming increasingly powerful and complex. It is highly challenging to identify anomalous execution behaviors and pinpoint problems by examining the overwhelming intermediate results/states in complex application workflows. Domain scientists urgently need a friendly and functional interface to understand the quality of the computing services and the performance of their applications in real time. To meet these needs, we explore data generated by job schedulers and investigate general performance metrics (e.g., utilization of CPU, memory and disk I/O). Specifically, we propose an interactive visual analytics approach, BatchLens, to provide both providers and users of cloud service with an intuitive and effective way to explore the status of system batch jobs and help them conduct root-cause analysis of anomalous behaviors in batch jobs. We demonstrate the effectiveness of BatchLens through a case study on the public Alibaba bench workload trace datasets.

IP.1_7 Interactive presentations

Date: Thursday, 17 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.1_7.1 FLOWACC: REAL-TIME HIGH-ACCURACY DNN-BASED OPTICAL FLOW ACCELERATOR IN FPGA
Speaker:
Yehua Ling, Sun Yat-sen University, CN
Authors:
Yehua Ling, Yuanxing Yan, Kai Huang and Gang Chen, Sun Yat-sen University, CN
Abstract
Recently, accelerator architectures have been designed to use deep neural networks (DNNs) to accelerate computer vision tasks, possessing the advantages of both accuracy and speed. Optical flow accelerator is however not among these architectures that DNNs have been successfully deployed. Existing hardware accelerators for optical flow estimation are all designed for classic methods and generally perform poorly in estimated accuracy. In this paper, we present FlowAcc, a dedicated hardware accelerator for DNN-based optical flow estimation, adopting a pipelined hardware design for real-time processing of image streams. We design an efficient multiplexing binary neural network (BNN) architecture for pyramidal feature extraction to significantly reduce the hardware cost and make it independent of the pyramid level number. Furthermore, efficient hamming distance calculation and competent flow regularization are utilized for hierarchical optical flow estimation to greatly improve the system efficiency. Comprehensive experimental results demonstrate that FlowAcc achieves state-of-the-art estimation accuracy and real-time performance on the Middlebury dataset when compared with the existing optical flow accelerators.
IP.1_7.2 ON EXPLOITING PATTERNS FOR ROBUST FPGA-BASED MULTI-ACCELERATOR EDGE COMPUTING SYSTEMS
Speaker:
Seyyed Ahmad Razavi, University of California, Irvine, US
Authors:
Seyyed Ahmad razavi, Hsin-Yu Ting, Tootiya Giyahchi and Eli Bozorgzadeh, University of California, Irvine, US
Abstract
Edge computing plays a key role in providing services for emerging compute-intensive applications while bringing computation close to end devices. FPGAs have been deployed to provide custom acceleration services due to their reconfigurability and support for multi-tenancy in sharing the computing resource. This paper explores an FPGA-based Multi-Accelerator Edge Computing System, that serves various DNN applications from multiple end devices simultaneously. To dynamically maximize the responsiveness to end devices, we propose a system framework that exploits the characteristic of applications in patterns and employs a staggering module coupled with a mixed offline/online multi-queue scheduling method to alleviate resource contention, and uncertain delay caused by network delay variation. Our evaluation shows the framework can significantly improve responsiveness and robustness in serving multiple end devices.
IP.1_7.3 RLPLACE: DEEP RL GUIDED HEURISTICS FOR DETAILED PLACEMENT OPTIMIZATION
Speaker:
Uday Mallappa, UC San Diego, US
Authors:
Uday Mallappa1, Sreedhar Pratty2 and David Brown2
1University of California San Diego, US; 2Nvidia, US
Abstract
The solution space of detailed placement becomes intractable with increase in thenumber of placeable cells and their possible locations. So, the existing works either focus on the sliding window-based optimization or row-based optimization. Though these region-based methods enable us to use linear-programming, pseudo-greedy or dynamic-programming algorithms, locally optimal solutions from these methods are globally sub-optimal with inherent heuristics. The heuristics such as the order in which we choose these local problems or size of each sliding window (runtime vs. optimality tradeoff) account for the degradation of solution quality. Our hypothesis is that learning-based techniques (with their richer representation ability) have shown a great success in problems with huge solution spaces, and can offer an alternative to the existing rudimentary heuristics. We propose a two-stage detailed-placement algorithm RLPlace that uses reinforcement learning (RL) for coarse re-arrangement and Satisfiability Modulo Theories (SMT) for fine-grain refinement. With global placement output of two critical IPs as the start point, RLPlace achieves upto 1.35% HPWL improvement as compared to the commercial tool’s detailed-placement result. In addition, RLPlace shows at least 1.2% HPWL improvement over highly optimized detailed-placement variants of the two IPs.

IP.ASD Interactive presentations

Date: Thursday, 17 March 2022
Time: 11:30 - 12:15 CET

Session chair:
Philipp Mundhenk, Bosh, DE

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.ASD.1 DEADLOCK ANALYSIS AND PREVENTION FOR INTERSECTION MANAGEMENT BASED ON COLORED TIMED PETRI NETS
Speaker:
Tsung-Lin Tsou, National Taiwan University, TW
Authors:
Tsung-Lin Tsou, Chung-Wei Lin and Iris Hui-Ru Jiang, National Taiwan University, TW
Abstract
We propose a Colored Timed Petri Net (CTPN) based model for intersection management. With the expressiveness of the CTPN-based model, we can consider timing, vehicle-specific information, and different types of vehicles. We then design deadlock-free policies and guarantee deadlock-freeness for intersection management. To the best of our knowledge, this is the first work on CTPN-based deadlock analysis and prevention for intersection management.
IP.ASD.2 ATTACK DATA GENERATION FRAMEWORK FOR AUTONOMOUS VEHICLE SENSORS
Speaker:
Jan Lauinger, TU Munich, DE
Authors:
Jan Lauinger1, Andreas Finkenzeller1, Henrik Lautebach2, Mohammad Hamad1 and Sebastian Steinhorst1
1TU Munich, DE; 2ZF Group, DE
Abstract
Driving scenarios of autonomous vehicles combine many data sources with new networking requirements in highly dynamic system setups. To keep security mechanisms applicable to new application fields in the automotive domain, our work introduces a security framework to generate, attack, and validate realistic data sets at rest and in transit. Concerning realistic data sets, our framework leverages autonomous driving simulators as well as static data sets of vehicle sensors. A configurable networking setup enables flexible data encapsulation to perform and validate networking attacks on data in transit. We validate our results with intrusion detection algorithms and simulation environments. Generated data sets and configurations are reproducible, portable, storable, and support iterative security testing of scenarios.
IP.ASD.3 CONTRACT-BASED QUALITY-OF-SERVICE ASSURANCE IN DYNAMIC DISTRIBUTED SYSTEMS
Speaker:
Lea Schönberger, TU Dortmund University, DE
Authors:
Lea Schönberger1, Susanne Graf2, Selma Saidi3, Dirk Ziegenbein4 and Arne Hamann4
1TU Dortmund University, DE; 2University Grenoble Alpes, CNRS, FR; 3TU Dortmund, DE; 4Robert Bosch GmbH, DE
Abstract
To offer an infrastructure for autonomous systems offloading parts of their functionality, dynamic distributed systems must be able to satisfy non-functional quality-of-service (QoS) requirements. However, providing hard QoS guarantees without complex global verification that are satisfied even under uncertain conditions is very challenging. In this work, we propose a contract-based QoS assurance for centralized, hierarchical systems, which requires local verification only and has the potential to cope with dynamic changes and uncertainties.

K.5 Lunch Keynote: "Probabilistic and Deep Learning Techniques for Robot Navigation and Automated Driving"

Date: Thursday, 17 March 2022
Time: 13:00 - 13:50 CET

Session chair:
Rolf Ernst, TU Braunschweig, DE

Session co-chair:
Selma Saidi, TU Dortmund, DE

For autonomous robots and automated driving, the capability to robustly perceive environments and execute their actions is the ultimate goal. The key challenge is that no sensors and actuators are perfect, which means that robots and cars need the ability to properly deal with the resulting uncertainty. In this presentation, I will introduce the probabilistic approach to robotics, which provides a rigorous statistical methodology to deal with state estimation problems. I will furthermore discuss how this approach can be extended using state-of-the-art technology from machine learning to deal with complex and changing real-world environments.

Speaker's bio: Wolfram Burgard is a Professor for Robotics and Artificial Intelligence at the Technical University of Nuremberg. His interests lie in Robotics, Artificial Intelligence, Machine Learning, and Computer Vision. He has published over 400 publications, more than 15 of which received best paper awards. In 2009, he was awarded the Gottfried Wilhelm Leibniz Prize, the most prestigious German research award. In 2010, he received an Advanced Grant from the European Research Council. In 2021, he received the IEEE Technical Field Award for Robotics and Automation. He is a Fellow of the IEEE, the AAAI, the EurAI, and a member of the German Academy of Sciences Leopoldina as well as of the Heidelberg Academy of Sciences and Humanities.

Time Label Presentation Title
Authors
13:00 CET K.5.1 PROBABILISTIC AND DEEP LEARNING TECHNIQUES FOR ROBOT NAVIGATION AND AUTOMATED DRIVING
Speaker and Author:
Wolfram Burgard, TU Nuremberg, DE
Abstract
For autonomous robots and automated driving, the capability to robustly perceive environments and execute their actions is the ultimate goal. The key challenge is that no sensors and actuators are perfect, which means that robots and cars need the ability to properly deal with the resulting uncertainty. In this presentation, I will introduce the probabilistic approach to robotics, which provides a rigorous statistical methodology to deal with state estimation problems. I will furthermore discuss how this approach can be extended using state-of-the-art technology from machine learning to deal with complex and changing real-world environments.

11.1 Analog / mixed-signal EDA from system level to layout level

Date: Thursday, 17 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Manuel Barragan, Universite Grenoble Alpes, CNRS, Grenoble INP, TIMA, FR

Session co-chair:
Lars Hedrich, Goethe University of Frankfurt/Main, DE

The first paper in the session explores the high level design of a mixed signal system. Topology generation and sizing for an OPAMP is discussed next. The following 4 papers deal with various issues in placement; placement guided by circuit simulations, discussion of models for placement, routability issues, and finally placement and routing of capacitor arrays.

Time Label Presentation Title
Authors
14:30 CET 11.1.1 EFFICSENSE: AN ARCHITECTURAL PATHFINDING FRAMEWORK FOR ENERGY-CONSTRAINED SENSOR APPLICATIONS
Speaker:
Jonah Van Assche, KU Leuven, BE
Authors:
Jonah Van Assche, Ruben Helsen and Georges Gielen, KU Leuven, BE
Abstract
This paper introduces EffiCSense, an architectural pathfinding framework for mixed-signal sensor front-ends for both regular and compressive sensing systems. Since sensing systems are often energy constrained, finding a suitable architecture can be a long iterative process between high-level modeling and circuit design. We present a Simulink-based framework that allows for architectural pathfinding with high-level functional models while also including power consumption models of the different circuit blocks. This allows to directly model the impact of design specifications on power consumption and speeds up the overall design process significantly. Both architectures with and without compressive sensing can be handled. The framework is demonstrated for the processing of EEG signals for epilepsy detection, comparing solutions with and without analog compressive sensing. Simulations show that using the compression, an optimal design can be found that is estimated to be 3.6 times more power-efficient compared to a system without compression, consuming 2.44 uW for a detection accuracy of 99.3%.
14:34 CET 11.1.2 TOPOLOGY OPTIMIZAITON OF OPERATIONAL AMPLIFIER IN CONTINUOUS SPACE VIA GRAPH EMBEDDING
Speaker:
Jialin Lu, Fudan University, CN
Authors:
Jialin Lu, Liangbo Lei, Fan Yang, Li Shang and Xuan Zeng, Fudan University, CN
Abstract
Operational amplifier is a key building block in analog circuits. However, the design process of the operational amplifier is complex and time-consuming, as there are no practical automation tools available in the industry. This paper presents a new topology optimization method for operational amplifiers. The behavioral description of the operational amplifier is described using a directed acyclic graph (DAG), which is then transformed into a low-dimensional embedding in continuous space using a variational graph autoencoder. Topology search is performed in the continuous embedding space using stochastic optimization methods, such as Bayesian Optimization. The yield search results are then transformed back to operational amplifier topologies using a graph decoder. The proposed method is also equipped with a surrogate model for performance prediction. Experimental results show that the proposed approach can achieve significant speedup over the genetic searching algorithms. The produced three-stage operational amplifiers offer competitive performance compared to manual designs.
14:38 CET 11.1.3 A CHARGE FLOW FORMULATION FOR GUIDING ANALOG/MIXED-SIGNAL PLACEMENT
Speaker:
Tonmoy Dhar, University of Minnesota Twin Cities, US
Authors:
Tonmoy Dhar1, Ramprasath S2, Jitesh Poojary2, Soner Yaldiz3, Steven Burns3, Ramesh Harjani2 and Sachin S. Sapatnekar2
1University of Minnesota Twin Cities, US; 2University of Minnesota, US; 3Intel Corporation, US
Abstract
An analog/mixed-signal designer typically performs circuit optimization, involving intensive SPICE simulations, on a schematic netlist and then sends the optimized netlist to layout. During the layout phase, it is vital to maintain symmetry requirements to avoid performance degradation due to mismatch: these constraints are usually specified using user input or by invoking an external tool. Moreover, to achieve high performance, the layout must avoid large interconnect parasitics on critical nets. Prior works that optimize parasitics during placement work with coarse metrics such as the half-perimeter wire length, but these metrics do not appropriately emphasize performance-critical nets. The novel charge flow (CF) formulation in this work addresses both symmetry detection and parasitic optimization. By leveraging schematic-level simulations, which are available “for free” from the circuit optimization step, the approach (a) alters the objective function to emphasize the reduction of parasitics on performance-critical nets, and (b) identifies symmetric elements/element groups. The effectiveness of the CF-based approach is demonstrated on a variety of circuits within a stochastic placement engine.
14:42 CET 11.1.4 (Best Paper Award Candidate)
ARE ANALYTICAL TECHNIQUES WORTHWHILE FOR ANALOG IC PLACEMENT?
Speaker:
Yishuang Lin, Texas A&M University, US
Authors:
Yishuang Lin1, Yaguang Li1, Donghao Fang1, Meghna Madhusudan2, Sachin S. Sapatnekar2, Ramesh Harjani2 and Jiang Hu1
1Texas A&M University, US; 2University of Minnesota, US
Abstract
Analytical techniques have long been a prevailing approach to digital IC placement due to their advantage in handling huge size problems. Recently, they were adopted for analog IC placement, where prior methods were mostly based on simulated annealing. However, there lacks a comparative study between the two approaches. Moreover, the impact from different analytical techniques is not clear. This work attempts to shed light on both issues by studying existing methods and developing a new analytical technique. Circuit performance is a critical concern for automated analog layout. To this end, we propose a performance driven analytical analog placement technique, which has not been studied in the past to the best of our knowledge. Experiments were performed on various testcase circuits. For conventional formulation without considering performance, the proposed analytical technique achieves 55 times speedup and 12% wirelength reduction compared to simulated annealing. For performance driven placement, the proposed technique outperforms simulated annealing in terms of circuit performance, area and runtime. Moreover, the proposed technique generally provides better solution quality than a recent previous analytical technique.
14:46 CET 11.1.5 ROUTABILITY-AWARE PLACEMENT FOR ADVANCED FINFET MIXED-SIGNAL CIRCUITS USING SATISFIABILITY MODULO THEORIES
Speaker:
Hao Chen, University of Texas at Austin, US
Authors:
Hao Chen1, Walker Turner2, David Z. Pan1 and Haoxing Ren2
1University of Texas at Austin, US; 2NVIDIA Corporation, US
Abstract
Due to the increasingly complex design rules and geometric layout constraints within advanced FinFET nodes, automated placement of full-custom analog/mixed-signal (AMS) designs has become increasingly challenging. Compared with traditional planar nodes, AMS circuit layout is dramatically different for FinFET technologies due to strict design rules and grid-based restrictions for both placement and routing. This limits previous analog placement approaches in effectively handling all of the new constraints while adhering to the new layout style. Additionally, limited work has demonstrated effective routability modeling, which is crucial for successful routing. This paper presents a robust analog placement framework using satisfiability modulo theories (SMT) for efficient constraint handling and routability modeling. Experimental results based on industrial designs show the effectiveness of the proposed framework in optimizing placement metrics while satisfying the specified constraints.
14:50 CET 11.1.6 CONSTRUCTIVE COMMON-CENTROID PLACEMENT AND ROUTING FOR BINARY-WEIGHTED CAPACITOR ARRAYS
Speaker:
Nibedita Karmokar, University of Minnesota, Twin Cities, US
Authors:
Nibedita Karmokar, Arvind Kumar Sharma, Jitesh Poojary, Meghna Madhusudan, Ramesh Harjani and Sachin S. Sapatnekar, University of Minnesota, US
Abstract
The accuracy and linearity of capacitive digital-to-analog converters (DACs) depend on precise capacitor ratios, but these ratios are perturbed by process variations and parasitics. This paper develops fast constructive procedures for common-centroid placement and routing for binary-weighted capacitors in charge-sharing DACs. Parasitics also degrade the switching speed of a capacitor array, particularly in FinFET nodes with severe wire/via resistances. To overcome this, the capacitor array is placed and routed to optimize switching speed, measured by the 3dB frequency. A balance between 3dB frequency and DAC INL/DNL is shown by trading off via counts with dispersion. The approach delivers high-quality results with low runtimes.
14:54 CET 11.1.7 Q&A SESSION
Authors:
Manuel Barragan1 and Lars Hedrich2
1Universite Grenoble Alpes, CNRS, Grenoble INP, TIMA, FR; 2Goethe University of Frankfurt/Main, DE
Abstract
Questions and answers with the authors

11.2 Approximate Computing Everywhere

Date: Thursday, 17 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Jie Han, University of Alberta, CA

Session co-chair:
Ilaria Scarabottolo, Università della Svizzera, CH

New automated synthesis and optimization methods targeting approximate circuits are presented in the first part of this session. Application papers then deal with new approximation techniques developed for deep neural network accelerators, printed circuit optimization, speech processing, and approximate solutions for stochastic computing. The first paper introduces a new logic synthesis method, which utilizes formal verification engines to generate approximate circuits satisfying quality constraints by construction. The second paper presents a method for optimizing approximate compressor trees in multipliers. The third paper addresses a method developed to generate approximate low-power deep learning accelerators based on TPUs automatically. A new application of approximate computing – the optimization of printed circuits – is introduced in the fourth paper. The fifth paper proposes a speech recognition ASIC based on a target-separable binarized weight network, capable of performing speaker verification and keyword spotting. The authors of the last paper combine approximate and stochastic computing principles in coarse-grained reconfigurable architectures to reduce circuit complexity and power consumption. IP papers deal with a probabilistic-oriented approximate computing method for DNN accelerators and Learned Approximate Computing method capable of tuning the application parameters to maximize the output quality without changing the computation.

Time Label Presentation Title
Authors
14:30 CET 11.2.1 MUSCAT: MUS-BASED CIRCUIT APPROXIMATION TECHNIQUE
Speaker:
Linus Witschen, Paderborn University, DE
Authors:
Linus Witschen, Tobias Wiersema, Matthias Artmann and Marco Platzner, Paderborn University, DE
Abstract
Many applications show an inherent resiliency against inaccuracies and errors in their computations. The design paradigm approximate computing exploits this fact by trading off the application’s accuracy against a target metric, e.g., hardware area. This work focuses on approximate computing on the hardware level, where approximate logic synthesis seeks to generate approximate circuits under user-defined quality constraints. We propose the novel approximate logic synthesis method MUSCAT to generate approximate circuits which are valid-by-construction. MUSCAT inserts cutpoints into the netlist to employ the commonly-used concept of substituting connections between gates by constant values, which offers potential for subsequent logic minimization. MUSCAT’s novelty lies in utilizing formal verification engines to identify minimal unsatisfiable subsets. These subsets determine a maximal number of cutpoints that can be activated together without resulting in a violation against the user-defined quality constraints. As a result, MUSCAT determines an optimal solution w.r.t. the number of activated cutpoints while providing a guarantee on the quality constraints. We present the method and experimentally compare MUSCAT’s open-source implementation to AIG rewriting and components from the EvoApproxLib. We show that our method improves upon these state-of-the-art methods by achieving up to 80 % higher savings in circuit area at typically much lower computation times.
14:34 CET 11.2.2 OPACT: OPTIMIZATION OF APPROXIMATE COMPRESSOR TREE FOR APPROXIMATE MULTIPLIER
Speaker:
Xiao Weihua, Shanghai Jiao Tong University, CN
Authors:
Weihua Xiao1, Cheng Zhuo2 and Weikang Qian1
1Shanghai Jiao Tong University, CN; 2Zhejiang University, CN
Abstract
Approximate multipliers have attracted significant attention of researchers for designing low-power systems. The most area-consuming part of a multiplier is its compressor tree (CT). Hence, the prior works proposed various approximate compressors to reduce the area of the CT. However, the compression strategy for the approximate compressors has not been systematically studied: Most of the prior works apply their ad hoc strategies to arrange approximate compressors. In this work, we propose OPACT, a method for optimizing approximate compressor tree for approximate multiplier. An integer linear programming problem is first formulated to co-optimize CT’s area and error. Moreover, since different connection orders of the approximate compressors can affect the error of an approximate multiplier, we formulate another mixed-integer programming problem for optimizing the connection order. The experimental results showed that OPACT can produce approximate multipliers with with an average reduction of 24.4% and 8.4% in power-delay product and mean error distance, respectively, compared to the best existing designs with the same types of approximate compressors used.
14:38 CET 11.2.3 LEARNING TO DESIGN ACCURATE DEEP LEARNING ACCELERATORS WITH INACCURATE MULTIPLIERS
Speaker:
Paras Jain, UC Berkeley, US
Authors:
Paras Jain1, Safeen Huda2, Martin Maas3, Joseph Gonzalez1, Ion Stoica1 and Azalia Mirhoseini4
1UC Berkeley, US; 2University of Toronto, CA; 3Google, Inc., US; 4Google, US
Abstract
Approximate computing is a promising way to improve the power efficiency of deep learning. While recent work proposes new arithmetic circuits (adders and multipliers) that consume substantially less power at the cost of computation errors, these approximate circuits decrease the end-to-end accuracy of common models. We present AutoApprox, a framework to automatically generate approximate low-power deep learning accelerators without any accuracy loss. AutoApprox generates a wide range of approximate ASIC accelerators with a TPUv3 systolic-array template. AutoApprox uses a learned router to assign each DNN layer to an approximate systolic array from a bank of arrays with varying approximation levels. By tailoring this routing for a specific neural network architecture, we discover circuit designs without the accuracy penalty from prior methods. Moreover, AutoApprox optimizes for the end-to-end performance, power and area of the the whole chip and PE mapping rather than simply measuring the performance of the arithmetic units in isolation. To our knowledge, our work is the first to demonstrate the effectiveness of custom-tailored approximate circuits in delivering significant chip-level energy savings with zero accuracy loss on a large-scale dataset such as ImageNet. AutoApprox synthesizes a novel approximate accelerator based on the TPU that reduces end-to-end power consumption by 3.2% and area by 5.2% at a sub-10nm process with no degradation in ImageNet validation top-1 and top-5 accuracy.
14:42 CET 11.2.4 (Best Paper Award Candidate)
CROSS-LAYER APPROXIMATION FOR PRINTED MACHINE LEARNING CIRCUITS
Speaker:
Giorgos Armeniakos, NTUA / KIT, GR
Authors:
Giorgos Armeniakos1, Georgios Zervakis2, Dimitrios Soudris3, Mehdi Tahoori2 and Joerg Henkel4
1National Technichal University of Athens, GR; 2Karlsruhe Institute of Technology, DE; 3National TU Athens, GR; 4Karlsruhe institute of technology, DE
Abstract
Printed electronics (PE) feature low non-recurring engineering costs and low per unit-area fabrication costs, enabling thus extremely low-cost and on-demand hardware. Such low-cost fabrication allows for high customization that would be infeasible in silicon, and bespoke architectures prevail to improve the efficiency of emerging PE machine learning (ML) applications. However, even with bespoke architectures, the large feature sizes in PE constraint the complexity of the ML models that can be implemented. In this work, we bring together, for the first time, approximate computing and PE design targeting to enable complex ML models, such as Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs), in PE. To this end, we propose and implement a cross-layer approximation, tailored for bespoke ML architectures. At the algorithmic level we apply a hardware-driven coefficient approximation of the ML model and at the circuit level we apply a netlist pruning through a full search exploration. In our extensive experimental evaluation we consider 14 MLPs and SVMs and evaluate more than 4300 approximate and exact designs. Our results demonstrate that our cross approximation delivers Pareto optimal designs that, compared to the state-of-the-art exact designs, feature 47% and 44% average area and power reduction, respectively, and less than 1% accuracy loss.
14:46 CET 11.2.5 A TARGET-SEPARABLE BWN INSPIRED SPEECH RECOGNITION PROCESSOR WITH LOW-POWER PRECISION-ADAPTIVE APPROXIMATE COMPUTING
Speaker:
Bo Liu, Southeast University, CN
Authors:
Bo Liu1, Hao Cai1, Xuan Zhang1, Haige Wu1, Anfeng Xue1, Zilong Zhang1, Zhen Wang2 and Jun Yang1
1Southeast University, CN; 2Nanjing Prochip Electronic Technology Co. Ltd, CN
Abstract
This paper proposes a speech recognition processor based on a target-separable binarized weight network (BWN), capable of performing both speaker verification (SV) and keyword spotting (KWS). In traditional speech recognition system, the SV based on traditional model and the KWS based on neural networks (NN) model are two independent hardware modules. In this work, both SV and KWS are processed by the proposed BWN with unified training and optimization framework which can be performed for various application scenarios. By the system-architecture co-design, SV and KWS share most of the feature extraction network parameters, and the classification part is calculated separately according to different targets. An energy-efficient NN accelerator which can be dynamically reconfigured to process different layers of the BWN with splitting calculation of frequency domain convolution is proposed. SV and KWS can be achieved with only one time calculation of each input speech frame, which greatly improves the computing energy efficiency. The computing units of the NN accelerator is optimized using precision-adaptive approximate addition tree architecture with Dual-VDD method to further reduce the energy cost. Compared to state-of-the-arts, this work can achieve about 4x reduction in power consumption while maintaining high system adaptability and accuracy.
14:50 CET 11.2.6 TOWARDS ENERGY-EFFICIENT CGRAS VIA STOCHASTIC COMPUTING
Speaker:
Bo Wang, Chongqing University, CN
Authors:
Bo Wang1, Rong Zhu1, Jiaxing Shang2 and Dajiang Liu1
1Chongqing University, CN; 2Chongqing Univesity, CN
Abstract
Stochastic computing (SC) is a promising computing paradigm for low-power and low-cost applications with the added benefit of high error tolerance. Meanwhile, Coarse-Grained Reconfigurable Architecture (CGRA) is also a promising platform for domain-specific applications for its combination of energy efficiency and flexibility. Intuitively, introducing SC to CGRA would synergistically reinforce the strengths of both paradigms. Accordingly, this paper proposes an SC-based CGRA by replacing the exact multiplication in traditional CGRA with an SC-based multiplication, where the problem of accuracy and latency are both improved using parallel stochastic sequence generators and leading zero shifters. In addition, with the flexible connections among PEs, the high-accuracy operation can be easily achieved by combing neighbor PEs without switching costs like power-gating. Compared to the state-of-the-art approximate computing design of CGRA, our proposed CGRA has 16% more energy reduction and 34% energy efficiency improvement while keeping high configuration flexibility.
14:54 CET 11.2.7 Q&A SESSION
Authors:
Jie Han1 and Ilaria Scarabottolo2
1University of Alberta, CA; 2USI Lugano, CH
Abstract
Questions and answers with the authors

11.3 Advanced Mapping and Optimization for Emerging ML Hardware

Date: Thursday, 17 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Jan Moritz Joseph, RWTH AACHEN, DE

Session co-chair:
Elnaz Ansari, Meta/Facebook, US

In this session we present six papers on advanced optimization techniques for model-mapping-hardware co-design. The first paper focuses on optimizing performance of sparse ML models with conventional DRAM. The second broadens the scope to emerging PIM architectures. The third introduces a latency model for diverse hardware architectures. And the last three papers introduce novel evolutionary/genetic algorithmic methods for co-optimizing the model, mapping and hardware.

Time Label Presentation Title
Authors
14:30 CET 11.3.1 DASC : A DRAM DATA MAPPING METHODOLOGY FOR SPARSE CONVOLUTIONAL NEURAL NETWORKS
Speaker:
Bo-Cheng Lai, National Yang Ming Chiao Tung University, TW
Authors:
Bo-Cheng Lai1, Tzu-Chieh Chiang1, Po-Shen Kuo1, Wan-Ching Wang1, Yan-Lin Hung1, Hung-Ming Chen2, Chien-Nan Liu1 and Shyh-Jye Jou1
1National Yang Ming Chiao Tung University, TW; 2Institute of Electronics, National Chiao Tung University, TW
Abstract
The data transferring of sheer model size of CNN (Convolution Neural Network) has become one of the main performance challenges in modern intelligent systems. Although pruning can trim down substantial amount of non-effective neurons, the excessive DRAM accesses of the non-zero data in a sparse network still dominate the overall system performance. Proper data mapping can enable efficient DRAM accesses for a CNN. However, previous DRAM mapping methods focus on dense CNN and become less effective when handling the compressed format and irregular accesses of sparse CNN. The extensive design space search for mapping parameters also results in a time-consuming process. This paper proposes DASC: a DRAM data mapping methodology for sparse CNNs. DASC is designed to handle the data patterns and block schedule of sparse CNN to attain good spatial locality and efficient DRAM accesses. The bank-group feature in modern DDR is further exploited to enhance processing parallelism. DASC also introduces an analytical model to facilitate fast exploration and quick convergence of parameter search in minutes instead of days from previous work. When compared with the state-of-the-art, DASC decreases the total DRAM latencies and attains an average of 17.1x, 14.3x, and 14.6x better DRAM performance for sparse AlexNet, VGG-16, and Resnet-50 respectively.
14:34 CET 11.3.2 VW-SDK: EFFICIENT CONVOLUTIONAL WEIGHT MAPPING USING VARIABLE WINDOWS FOR PROCESSING-IN-MEMORY ARCHITECTURES
Speaker:
Johnny Rhe, Sungkyunkwan University, KR
Authors:
Johnny Rhe, Sungmin Moon and Jong Hwan Ko, Sungkyunkwan University, KR
Abstract
With their high energy efficiency, processing-in-memory (PIM) arrays are increasingly used for convolutional neural network (CNN) inference. In PIM-based CNN inference, the computational latency and energy are dependent on how the CNN weights are mapped to the PIM array. A recent study proposed shifted and duplicated kernel (SDK) mapping that reuses the input feature maps with a unit of a parallel window, which is convolved with duplicated kernels to obtain multiple output elements in parallel. However, the existing SDK-based mapping algorithm does not always result in the minimum computing cycles because it only maps a square-shaped parallel window with the entire channels. In this paper, we introduce a novel mapping algorithm called variable-window SDK (VW-SDK), which adaptively determines the shape of the parallel window that leads to the minimum computing cycles for a given convolutional layer and PIM array. By allowing rectangular-shaped windows with partial channels, VW-SDK utilizes the PIM array more efficiently, thereby further reduces the number of computing cycles. The simulation with a 512x512 PIM array and Resnet-18 shows that VW-SDK improves the inference speed by 1.69x compared to the existing SDK-based algorithm.
14:38 CET 11.3.3 A UNIFORM LATENCY MODEL FOR DNN ACCELERATORS WITH DIVERSE ARCHITECTURES AND DATAFLOWS
Speaker:
Linyan Mei, KU Leuven, CN
Authors:
Linyan Mei1, Huichu Liu2, Tony Wu3, H. Ekin Sumbul2, Marian Verhelst1 and Edith Beigne2
1KU Leuven, BE; 2Facebook Inc., US; 3Meta/Facebook, US
Abstract
In the early design phase of a Deep Neural Network (DNN) acceleration system, fast energy and latency estimation are important to evaluate the optimality of different design candidates on algorithm, hardware, and algorithm-to-hardware mapping, given the gigantic design space. This work proposes a uniform intra-layer analytical latency model for DNN accelerators that can be used to evaluate diverse architectures and dataflows. It employs a 3-step approach to systematically estimate the latency breakdown of different system components, capture the operation state of each memory component, and identify stall-induced performance bottlenecks. To achieve high accuracy, different memory attributes, operands' memory sharing scenarios, as well as dataflow implications have been taken into account. Validation against an in-house taped-out accelerator across various DNN layers has shown an average latency model accuracy of 94.3%. To showcase the capability of the proposed model, we carry out 3 case studies to assess respectively the impact of mapping, workloads, and diverse hardware architectures on latency, driving design insights for algorithm-hardware-mapping co-optimization.
14:42 CET 11.3.4 MEDEA: A MULTI-OBJECTIVE EVOLUTIONARY APPROACH TO DNN HARDWARE MAPPING
Speaker:
Enrico Russo, University of Catania, IT
Authors:
Enrico Russo1, Maurizio Palesi1, Salvatore Monteleone2, Davide Patti1, Giuseppe Ascia1 and Vincenzo Catania1
1University of Catania, IT; 2Università Niccolò Cusano, IT
Abstract
Deep Neural Networks (DNNs) embedded domain-specific accelerators enable inference on resource-constrained devices. Making optimal design choices and efficiently scheduling neural network algorithms on these specialized architectures is challenging. Many choices can be made to schedule computation spatially and temporally on the accelerator. Each choice influences the access pattern to the buffers of the architectural hierarchy, affecting the energy and latency of the inference. Each mapping also requires specific buffer capacities and a number of spatial components instances that translate in different chip area occupation. The space of possible combinations, the mapping space, is so large that automatic tools are needed for its rapid exploration and simulation. This work presents MEDEA, an open-source multi-objective evolutionary algorithm based approach toDNNs accelerator mapping space exploration. MEDEA leverages the Timeloop analytical cost model. Differently from the other schedulers that optimize towards a single objective, MEDEA allows deriving the Pareto set of mappings to optimize towards multiple, sometimes conflicting, objectives simultaneously. We found that solutions found by MEDEA dominates in most cases those found by state-of-the-art mappers.
14:46 CET 11.3.5 DIGAMMA: DOMAIN-AWARE GENETIC ALGORITHM FOR HW-MAPPING CO-OPTIMIZATION FOR DNN ACCELERATORS
Speaker:
Sheng-Chun Kao, Georgia Institute of Technology, US
Authors:
Sheng-Chun Kao1, Michael Pellauer2, Angshuman Parashar2 and Tushar Krishna1
1Georgia Institute of Technology, US; 2Nvidia, US
Abstract
The design of DNN accelerators includes two key parts: HW resource configuration and mapping strategy. Intensive research has been conducted to optimize each of them independently. Unfortunately, optimizing for both together is extremely challenging due to the extremely large cross-coupled search space. To address this, in this paper, we propose a HW-Mapping co-optimization framework, an efficient encoding of the immense design space constructed by HW and Mapping, and a domain-aware genetic algorithm, named DiGamma, with specialized operators for improving search efficiency. We evaluate DiGamma with seven popular DNNs models with different properties. Our evaluations show DiGamma can achieve (geomean) 3.0x and 10.0x speedup, comparing to the best-performing baseline optimization algorithms, in edge and cloud settings.
14:50 CET 11.3.6 (Best Paper Award Candidate)
ANACONGA: ANALYTICAL HW-CNN CO-DESIGN USING NESTED GENETIC ALGORITHMS
Speaker:
Nael Fasfous, TU Munich, DE
Authors:
Nael Fasfous1, Manoj Rohit Vemparala2, Alexander Frickenstein2, Emanuele Valpreda3, Driton Salihu1, Julian Höfer4, Anmol Singh2, Naveen-Shankar Nagaraja2, Hans-Joerg Voegel2, Nguyen Anh Vu Doan1, Maurizio Martina3, Juergen Becker4 and Walter Stechele1
1TU Munich, DE; 2BMW Group, DE; 3Politecnico di Torino, IT; 4Karlsruhe Institute of Technology, DE
Abstract
We present AnaCoNGA, an analytical co-design methodology, which enables two genetic algorithms to evaluate the fitness of design decisions on layer-wise quantization of a neural network and hardware (HW) resource allocation. We embed a hardware architecture search (HAS) algorithm into a quantization strategy search (QSS) algorithm to evaluate the hardware design Pareto-front of each considered quantization strategy. We harness the speed and flexibility of analytical HW-modeling to enable parallel HW-CNN co-design. With this approach, the QSS is focused on seeking high-accuracy quantization strategies which are guaranteed to have efficient hardware designs at the end of the search. Through AnaCoNGA, we improve the accuracy by 2.88 p.p. with respect to a uniform 2-bit ResNet20 on CIFAR-10, and achieve a 35% and 37% improvement in latency and DRAM accesses, while reducing LUT and BRAM resources by 9% and 59% respectively, when compared to a standard edge variant of the accelerator.
14:54 CET 11.3.7 Q&A SESSION
Authors:
Jan Moritz Joseph1 and Elnaz Ansari2
1RWTH Aachen University, DE; 2Meta/Facebook, US
Abstract
Questions and answers with the authors

11.4 Reconfigurable Systems

Date: Thursday, 17 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Michaela Blott, Xilinx, IE

Session co-chair:
Shreejith Shanker, Trinity College Dublin, IE

This session presents six papers three of which discuss innovative applications including adaptive CNN acceleration for edge scenarios, filtering for big data applications, and a graph processing accelerator. Two papers explore extensions to CGRA hardware to enable improved mapping, and a fast mapping algorithm for CGRAs. Finally, a paper that explores technology mapping for FPGAs based on And-Inverter-Cones.

Time Label Presentation Title
Authors
14:30 CET 11.4.1 (Best Paper Award Candidate)
ADAFLOW: A FRAMEWORK FOR ADAPTIVE DATAFLOW CNN ACCELERATION ON FPGAS
Speaker:
Guilherme Korol, Federal University of Rio Grande do Sul - Brazil, BR
Authors:
Guilherme Korol1, Michael Jordan2, Mateus Beck Rutzig3 and Antonio Carlos Schneider Beck1
1Universidade Federal do Rio Grande do Sul, BR; 2UFRGS, BR; 3UFSM, BR
Abstract
To meet latency and privacy requirements, resource-hungry deep learning applications have been migrating to the Edge, where IoT devices can offload the inference processing to local Edge servers. Since FPGAs have successfully accelerated an increasing number of deep learning applications (especially CNN-based ones), they emerge as an effective alternative for Edge platforms. However, Edge applications may present highly unpredictable workloads, requiring runtime adaptability in the inference processing. Although some works apply model switching on CPU and GPU platforms by exploiting different pruning rates at runtime, so the inference can adapt according to some quality-performance trade-off, FPGA-based accelerators refrain from this approach since they are synthesized to specific CNN models. In this context, this work enables model switching on FPGAs by adding to the well-known FINN accelerator an extra level of adaptability (i.e., flexibility) and support to the dynamic use of pruning via fast model switch on flexible accelerators, at the cost of some extra logic, or via FPGA reconfigurations of fixed accelerators. From that, we developed AdaFlow: a framework that automatically builds, at design time, a library from these new available versions (flexible and fixed, pruned or not) that will be used, at runtime, to dynamically select a given version according to a user-configurable accuracy threshold and current workload conditions. We have evaluated AdaFlow under a smart Edge surveillance application with two CNN models and two datasets, showing that AdaFlow processes, on average, 1.3x more inferences and increases, on average, 1.4x the power efficiency over state-of-the-art statically deployed dataflow accelerators.
14:34 CET 11.4.2 RAW FILTERING OF JSON DATA ON FPGAS
Speaker:
Tobias Hahn, FAU, DE
Authors:
Tobias Hahn, Andreas Becher, Stefan Wildermann and Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE
Abstract
Many Big Data applications include the processing of data streams on semi-structured data formats such as JSON. A disadvantage of such formats is that an application may spend a significant amount of processing time just on unselectively parsing all data. To relax this issue, the concept of raw filtering is proposed with the idea to remove data from a stream prior to the costly parsing stage. However, as accurate filtering of raw data is often only possible after the data has been parsed, raw filters are designed to be approximate in the sense of allowing false-positives in order to be implemented efficiently. Contrary to previously proposed CPU-based raw filtering techniques that are restricted to string matching, we present FPGA-based primitives for filtering strings, numbers and also number ranges. In addition, a primitive respecting the basic structure of JSON data is proposed that can be used to further increase the accuracy of introduced raw filters. The proposed raw filter primitives are designed to allow for their composition according to a given filter expression of a query. Thus, complex raw filters can be created for FPGAs which enable a drastical decrease in the amount of generated false-positives, particularly for IoT workload. As there exists a trade-off between accuracy and resource consumption, we evaluate primitives as well as composed raw filters using different queries from the RiotBench benchmark. Our results show that up to 94.3% of the raw data can be filtered without producing any observed false-positives using only a few hundred LUTs.
14:38 CET 11.4.3 GRAPHWAVE: A HIGHLY-PARALLEL COMPUTE-AT-MEMORY GRAPH PROCESSING ACCELERATOR
Speaker:
Jinho Lee, National University of Singapore, SG
Authors:
Jinho Lee, Burin Amornpaisannon, Tulika Mitra and Trevor E. Carlson, National University of Singapore, SG
Abstract
The fast, efficient processing of graphs is needed to quickly analyze and understand connected data, from large social network graphs, to edge devices performing timely, local data analytics. But, as graph data tends to exhibit poor locality, designing both high-performance and efficient graph accelerators have been difficult to realize. In this work, GraphWave, we take a different approach compared to previous research and focus on maximizing accelerator parallelism with a compute-at-memory approach, where each vertex is paired with a dedicated functional unit. We also demonstrate that this work can improve performance and efficiency by optimizing the accelerator's interconnect with multi-level multi-casting to minimize congestion. Taken together, this work achieves, to the best of our knowledge, a state-of-the-art efficiency of up to 63.94 GTEPS/W with a throughput of 97.80 GTEPS (billion traversed edges per second).
14:42 CET 11.4.4 RF-CGRA: A ROUTING-FRIENDLY CGRA WITH HIERARCHICAL REGISTER CHAINS
Speaker:
Dajiang Liu, Chongqing University, CN
Authors:
Rong Zhu, Bo Wang and Dajiang Liu, Chongqing University, CN
Abstract
CGRAs are promising architectures to accelerate domain-specific applications as they combine high energy-efficiency and flexibility. With either isolated register files (RFs) or link-consuming distributed registers in each processing element (PE), existing CGRAs are all not friendly to data routing for data-flow graphs (DFGs) with a high edge/node ratio since there are many multi-cycle dependences. To this end, this paper proposes a Routing-Friendly CGRA (RF-CGRA) where hierarchical (intra-PE or inter-PE) register chains could be flexibly (wide range of chain length) and compactly (consuming fewer links among PEs) achieved for data routing, resulting in a new mapping problem that requires the improvement of a compiler. Experimental results show that RF-CGRA gets 1.19X performance and 1.14X energy efficiency of the state-of-the-art CGRA with single-cycle multi-hop connections (HyCUBE) while keeping a moderate compilation time.
14:46 CET 11.4.5 PATHSEEKER: A FAST MAPPING ALGORITHM FOR CGRAS
Speaker:
Mahesh Balasubramanian, Arizona State University, US
Authors:
Mahesh Balasubramanian and Aviral Shrivastava, Arizona State University, US
Abstract
Coarse-grained reconfigurable arrays (CGRAs) have gained traction over the years as a low-power accelerator due to the efficient mapping of the compute-intensive loops onto the 2-D array by the CGRA compiler. When encountering a mapping failure for a given node, existing mapping techniques either exit and retry the mapping anew, or perform backtracking, i.e., recursively remove the previously mapped node to find a valid mapping. Abandoning mapping and starting afresh can deteriorate the quality of mapping and the compilation time. Even backtracking may not be the best choice since the previous node may not be the incorrectly placed node. To tackle this issue, we propose PathSeeker -- a mapping approach that analyzes mapping failures and performs local adjustments to the schedule to obtain a mapping. Experimental results on 35 top performance-critical loops from MiBench, Rodinia, and Parboil benchmark suites demonstrate that PathSeeker can map all of them with better mapping quality and dramatically less compilation time than the previous state-of-the-art approaches -- GraphMinor and RAMP, which were unable to map 20 and 5 loops, respectively. Over these benchmarks, PathSeeker achieves 28% better performance at 550x compilation speedup over GraphMinor and 3% better performance at 10x compilation speedup over RAMP on a 4x4 CGRA.
14:50 CET 11.4.6 IMPROVING TECHNOLOGY MAPPING FOR AIC-BASED FPGAS
Speaker:
Shubham Rai, TU Dresden, DE
Authors:
Martin Thümmler, Shubham Rai and Akash Kumar, TU Dresden, DE
Abstract
Commonly, LUTs are used in FPGAs as their main source of configurability. But these large multiplexers have only one output and their area scales exponentially with the number of inputs. As counterpart AND-inverter-cones (AIC) were proposed in 2012. They are a cone-like structure of configurable gates. AICs are not as flexible configurable as LUTs, but have multiple major benefits: First, its structure is inspired by And-Inverter-Graphs, which are currently the predominant form to represent and optimize digital hardware circuits. Second, they provide multiple outputs and are intrinsically fracturable. Therefore logic duplication can be reduced. Additionally, physical AICs can be split into multiple smaller ones without any additional hardware effort. Third, their area scales linearly with the exponentially increasing number of inputs. Additionally, a special form of AICs called Nand-Nor-Cones can be implemented very efficiently, especially for the newly emerging RFET technologies. Technology mapping is one of the crucial tasks to release the full power of AIC based FPGAs. In this thesis the current technology mapping algorithms are reviewed and the following improvements are proposed: First, instead of calculating the required time by choices, a direct required time calculation method is presented. This ensures, that every node has a sensible required time assigned. Second, it is shown that the priority cut calculation method can be replaced by a much simpler direct cut selection method with reduced runtime and similar quality of results. Third, a local subgraph balancing is proposed, to reduce the cone sizes to which cuts get mapped. Combining all of these improvements leads to an average area reduction of over 20\% for the MCNC benchmarks compared to the previous technology mapper, while not increasing the average circuit delay. % Similar improvements are presented for the VTR benchmarks. Additionally, a mapping algorithm to NNCs with three inputs per gate is provided for the first time. Finally, the technology mapper is integrated open-source into the logic synthesis and verification system Abc
14:54 CET 11.4.7 Q&A SESSION
Authors:
Michaela Blott1 and Shanker Shreejith2
1Xilinx, IE; 2Trinity College Dublin, IE
Abstract
Questions and answers with the authors

11.5 An Industrial Perspective on Autonomous Systems Design

Date: Thursday, 17 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Rolf Ernst, TU Braunschweig, DE

Session co-chair:
Selma Saidi, TU Dortmund, DE

This session presents 4 talks from industry sharing current practices and perspectives on autonomous systems and their design. The session discusses several challenges related to software architecture solutions for safe and efficient operational autonomous systems, novel rule-based methods for guaranteeing safety, and requirements on infrastructure for autonomy currently merging CPSs as well as IT domains.

Time Label Presentation Title
Authors
14:30 CET 11.5.1 SYMBIOTIC SAFETY: SAFE AND EFFICIENT HUMAN-MACHINE COLLABORATION BY UTILIZING RULES
Speaker:
Tasuku Ishigooka, Hitachi, Ltd., JP
Authors:
Tasuku Ishigooka, Hiroyuki Yamada, Satoshi Otsuka, Nobuyasu Kanekawa and Junya Takahashi, Hitachi, Ltd., JP
Abstract
Collaborative work between workers and autonomous systems in the same area is required to improve operation efficiency. However, there exist collision risks caused by coexistence of workers and autonomous systems. The safety functions of the autonomous systems, such as emergency stops, can reduce the risks and but may decrease the operation efficiency. Therefore, we propose a novel safety concept called Symbiotic Safety. The concept improves both safety and operation efficiency by transformation of action plan, e.g., adjustment of action plan or update of safety rule, which reduces frequency of risk occurrence and suppress efficiency loss due to safety functions. In this paper, we explain the symbiotic safety technologies and share results of an evaluation experiment by utilizing our prototype system.
14:45 CET 11.5.2 A MIDDLEWARE JOURNEY FROM MICROCONTROLLERS TO MICROPROCESSORS
Speaker:
Alban Tamisier, Apex.AI, FR
Authors:
Michael Pöhnl, Alban Tamisier and Tobias Blaß, Apex.AI, DE
Abstract
This paper discusses some of the challenges we encountered when developing Apex.OS, an automotive grade version of the Robot Operating System (ROS) 2. To better understand these challenges, we look back at the best practices used for data communication and software execution in OSEK-based systems. Finally we describe the extensions made in ROS 2, Apex.OS and Apex.Middleware to meet the real-time constraints of the targeted automotive systems.
15:00 CET 11.5.3 RELIABLE DISTRIBUTED SYSTEMS
Speaker:
Philipp Mundhenk, Robert Bosch GmbH, DE
Authors:
Philipp Mundhenk, Arne Hamann, Andreas Heyl and Dirk Ziegenbein, Robert Bosch GmbH, DE
Abstract
The domains of Cyber-Physical Systems (CPSs) and Information Technology (IT) are converging. Driven by the need for increased compute performance, as well as the need for increased connectivity and runtime flexibility, IT hardware, such as microprocessors and Graphics Processing Units (GPUs), as well as software abstraction layers are introduced to CPS. These systems and components are being enhanced for the execution of hard real-time applications. This enables the convergence of embedded and IT: Embedded workloads can be executed reliably on top of IT infrastructure. This is the dawn of Reliable Distributed Systems (RDSs), a technology that combines the performance and cost of IT systems with the reliability of CPSs. The Fabric is a global RDS runtime environment, weaving the interconnections between devices and enabling abstractions for compute, communication, storage, sensing & actuation. This paper outlines the vision of RDS, introduces the aspects required for implementing RDSs and the Fabric, relates existing technologies, and outlines open research challenges.
15:15 CET 11.5.4 PAVE 360 - A PARADIGM SHIFT IN AUTONOMOUS DRIVING VERIFICATION WITH A DIGITAL TWIN
Speaker and Author:
Tapan Vikas, Siemens EDA GmbH, DE
Abstract
The talk will show case the benefits of architectural exploration based on Digital Twin approaches. The challenges involved in state of the art Digital twin will be highlighted. Hardware software co-design challenges will be discussed shortly.

12.1 AI as a Driver for Innovative Applications

Date: Thursday, 17 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Xun Jiao, University of Villanova, US

Session co-chair:
Srinivas Katkoori, University of South Florida, US

This session exploits different AI architectures and methodologies for creating innovative applications. They impact on some fields: from brain-inspired computing, through internet of things up to industry 4.0.

Time Label Presentation Title
Authors
15:40 CET 12.1.1 (Best Paper Award Candidate)
ALGORITHM-HARDWARE CO-DESIGN FOR EFFICIENT BRAIN-INSPIRED HYPERDIMENSIONAL LEARNING ON EDGE
Speaker:
Yang Ni, University of California, Irvine, US
Authors:
Yang Ni1, Yeseong Kim2, Tajana S. Rosing3 and Mohsen Imani4
1University of California, Irvine, US; 2DGIST, KR; 3UCSD, US; 4University of California Irvine, US
Abstract
Machine learning methods have been widely utilized to provide high quality for many cognitive tasks. Running sophisticated learning tasks requires high computational costs to process a large amount of learning data. Brain-inspired Hyperdimensional (HD) computing is introduced as an alternative solution for lightweight learning on edge devices. However, HD computing models still rely on accelerators to ensure real-time and efficient learning. These hardware designs are not commercially available and need a relatively long period to synthesize and fabricate after deriving the new applications. In this paper, we propose an efficient framework for accelerating the HD computing at the edge by fully utilizing the available computing power. We optimize the HD computing through algorithm-hardware co-design of the host CPU and existing low-power machine learning accelerators, such as Edge TPU. We interpret the lightweight HD learning model as a hyper-wide neural network to take advantage of the accelerator and machine learning platform. We further improve the runtime cost of training by employing a bootstrap aggregating algorithm called bagging while maintaining the learning quality. We evaluate the performance of the proposed framework with several applications. Joint experiments on mobile CPU and the Edge TPU show that our framework achieves 4.5× faster training and 4.2× faster inference compared to the baseline platform. In addition, our framework achieves 19.4× faster training and 8.9× faster inference as compared to embedded ARM CPU, Raspberry Pi, that consumes similar power consumption.
15:44 CET 12.1.2 POISONHD: POISON ATTACK ON BRAIN-INSPIRED HYPERDIMENSIONAL COMPUTING
Speaker:
Xun Jiao, Villanova University, US
Authors:
Ruixuan Wang1 and Xun Jiao2
1VU, US; 2Villanova University, US
Abstract
While machine learning (ML) methods especially deep neural networks (DNNs) promise enormous societal and economic benefits, their deployments present daunting challenges due to intensive computational demands and high storage requirements. Brain-inspired hyperdimensional computing (HDC) has recently been introduced as an alternative computational model that mimics the ``human brain'' at the functionality level. HDC has already demonstrated promising accuracy and efficiency in multiple application domains including healthcare and robotics. However, the robustness and security aspects of HDC has not been systematically investigated and sufficiently examined. Poison attack is a commonly-seen attack on various ML models including DNNs. It injects noises to labels of training data to introduce classification error of ML models. This paper presents PoisonHD, an HDC-specific poison attack framework that maximizes its effectiveness in degrading the classification accuracy by leveraging the internal structural information of HDC models. By applying PoisonHD on three datasets, we show that PoisonHD can cause significantly greater accuracy drop on HDC model than a random label flipping approach. We further develop a defense mechanism by designing an HDC-based data sanitization that can fully recover the accuracy loss caused by poison attack. To the best of our knowledge, this is the first paper that studies the poison attack on HDC models.
15:48 CET 12.1.3 AIME: WATERMARKING AI MODELS BY LEVERAGING ERRORS
Speaker:
Dhwani Mehta, University of Florida, US
Authors:
Dhwani Mehta, Nurun Mondol, Farimah Farahmandi and Mark Tehranipoor, University of Florida, US
Abstract
The recent evolution of deep neural networks (DNNs) has made running complex data analytics tasks, which range from natural language processing, object detection to autonomous cars, artificial intelligence (AI) warfare, cloud, healthcare, industrial robots, and edge devices feasible. The benefits of AI are indisputable. However, there are several concerns regarding the security of the deployed AI models, such as reverse engineering and Intellectual Property (IP) piracy. Accumulating a sufficiently large amount of data - building, training, and improving the model accuracy - to finally deploying the model requires immense human and computational power, making the process expensive. Therefore, it is of utmost importance to protect the model against IP infringement. We propose AIME, a novel watermarking framework that captures model inaccuracy during the training phase and converts it into the owner-specific unique signature. The watermark is embedded within the class mispredictions of the DNN model. Watermark extraction is performed when the model is queried by an owner-specific sequence of key inputs, and the signature is decoded from the sequence of model predictions. AIME works with negligible watermark embedding runtime overhead while preserving the accurate functionality of the DNN. We have performed a comprehensive evaluation of AIME, which models on MNIST, Fashion-MNIST, and CIFAR-10 dataset and corroborated its effectiveness, robustness, and performance.
15:52 CET 12.1.4 THINGNET: A LIGHTWEIGHT REAL-TIME MIRAI IOT VARIANTS HUNTER THROUGH CPU POWER FINGERPRINTING
Speaker:
Zhuoran Li, Old Dominion University, US
Authors:
Zhuoran Li and Danella Zhao, Old Dominion University, US
Abstract
Internet of Things (IoT) devices have become attractive targets of cyber criminals, whereas attackers have been leveraging these vulnerable devices most notably via the infamous Mirai-based botnets, accounting for nearly 90% of IoT malware attacks in 2020. In this work, we propose a robust, universal and non-invasive Mirai-based malware detection engine employing a compact deep neural network architecture. Our design allows programmatic collection of CPU power footprints with integrated current sensors under various device states, such as idle, service and attack. A lightweight online inference model is deployed in the CPU for on-the-fly classification. Our model is robust against noisy environment with a lucid design of noise reduction function. This work appears to be the first step towards a viable CPU malware detection engine based on power fingerprinting. The extensive simulation study under ARM architecture that is widely used in IoT devices, demonstrates a high detection accuracy of 99.1% at a speed less than 1ms. By analyzing Mirai-based infection under distinguishable phases for power feature extraction, our model has further demonstrated an accuracy of 96.3% on model-unknown variants detection.
15:56 CET 12.1.5 M2M-ROUTING: ENVIRONMENTAL ADAPTIVE MULTI-AGENT REINFORCEMENT LEARNING BASED MULTI-HOP ROUTING POLICY FOR SELF-POWERED IOT SYSTEMS
Speaker:
Wen Zhang, Texas A&M- Corpus Christi, US
Authors:
Wen Zhang1, Jun Zhang2, Mimi Xie3, Tao Liu4, Wenlu Wang1 and Chen Pan5
1Texas A&M University--Corpus Christi, US; 2Harvard University, US; 3University of Texas at San Antonio, US; 4Lawrence Technological University, US; 5Texas A&M University-Corpus Christi, US
Abstract
Energy harvesting (EH) technologies facilitate the trending proliferation of IoT devices with sustainable power supplies. However, the intrinsic weak and unstable nature of EH results in frequent and unpredictable power interruptions in EH IoT devices, which further causes unpleasant packet loss or reconnection failures in IoT network. Therefore, conventional routing and energy allocation methods are inefficient in the EH environments. The complexity of the EH environment caused a stumbling block to an intelligent routing policy and energy allocation. To address the problems, this work proposes an environment adaptive Deep Reinforcement Learning (DRL)-based multi-hop routing policy, M2M-Routing, to jointly optimize energy allocation and routing policy, which conquers these challenges through leveraging the offline computation resources. We prepare multi-models for the complicated energy harvesting environment offline. By searching a historical similar power trace to identify the model ID, the prepared DRL model is selected to manage energy allocation and routing policy for the query power traces. Simulation results indicate that M2M-Routing improves the amount of data delivery by 3 times to 4 times compared with baselines.
16:00 CET 12.1.6 Q&A SESSION
Authors:
Xun Jiao1 and Srinivas Katkoori2
1Villanova University, US; 2University of South Florida, US
Abstract
Questions and answers with the authors

12.2 Applications of optimized quantum and probabilistic circuits in emergent computing systems

Date: Thursday, 17 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Giulia Meuli, Synopsys, IT

Session co-chair:
Yvain Thonnart, CEA, FR

Emerging computing platforms such as near-term quantum computers are currently based on the executions of circuits that have a gate set which is specific to the corresponding hardware platform. These systems are currently still strongly impacted by noise or decoherence and the success of such calculations depend strongly on circuit depth the effort to input data. This session discusses the use of classical and machine learning approaches to optimize these circuits both in complexity and noise resilience.

Time Label Presentation Title
Authors
15:40 CET 12.2.1 MUZZLE THE SHUTTLE: EFFICIENT COMPILATION FOR MULTI-TRAP TRAPPED-ION QUANTUM COMPUTERS
Speaker:
Abdullah Ash Saki, Pennsylvania State University, US
Authors:
Abdullah Ash- Saki1, Rasit Onur Topaloglu2 and Swaroop Ghosh1
1Pennsylvania State University, US; 2IBM, US
Abstract
Trapped-ion systems can have a limited number of ions (qubits) in a single trap. Increasing the qubit count to run meaningful quantum algorithms would require multiple traps where ions need to shuttle between traps to communicate. The existing compiler has several limitations, which result in a high number of shuttle operations and degraded fidelity. In this paper, we target this gap and propose compiler optimizations to reduce the number of shuttles. Our technique achieves a maximum reduction of 51.17% in shuttles (average ~ 33%) tested over 125 circuits. Furthermore, the improved compilation enhances the program fidelity up to 22.68X with a modest increase in the compilation time.
15:44 CET 12.2.2 CIRCUITS FOR MEASUREMENT BASED QUANTUM STATE PREPARATION
Speaker:
Niels Gleinig, ETH Zurich, DE
Authors:
Niels Gleinig and Torsten Hoefler, ETH Zürich, CH
Abstract
In quantum computing, state preparation is the problem of synthesizing circuits that initialize quantum systems to specific states. It has been shown that there are states that require circuits of exponential size to be prepared (when not using measurements), and consequently, despite extensive research on this problem, the existing computer-aided design (CAD) methods produce circuits of exponential size. This is even the case for the methods that solve this problem on the important subclass of uniform states, which for example need to be prepared when using Quantum Simulated Annealing algorithms to solve combinatorial optimization problems. In this paper, we show how CAD based state preparation can be made scalable by using techniques that are unique to quantum computing: amplitude amplification, measurements, and the resulting state collapses. With this approach, we are able to produce wide classes of states in polynomial time, resulting in an exponential improvement over existing CAD methods.
15:48 CET 12.2.3 OPTIC: A PRACTICAL QUANTUM BINARY CLASSIFIER FOR NEAR-TERM QUANTUM COMPUTERS
Speaker:
Daniel Silver, Northeastern University, US
Authors:
Tirthak Patel, Daniel Silver and Devesh Tiwari, Northeastern University, US
Abstract
Quantum computers can theoretically speed up optimization workloads such as variational machine learning and classification workloads over classical computers. However, in practice, proposed variational algorithms have not been able to run on existing quantum computers for practical-scale problems owing to their error-prone hardware. We propose OPTIC, a framework to effectively execute quantum binary classification on real noisy intermediate-scale quantum (NISQ) computers.
15:52 CET 12.2.4 SCALABLE VARIATIONAL QUANTUM CIRCUITS FOR AUTOENCODER-BASED DRUG DISCOVERY
Speaker:
Junde Li, Pennsylvania State University, US
Authors:
Junde Li and Swaroop Ghosh, Pennsylvania State University, US
Abstract
The de novo design of drug molecules is recognized as a time-consuming and costly process, and computational approaches have been applied in each stage of the drug discovery pipeline. Variational autoencoder is one of the computer-aided design methods which explores the chemical space based on existing molecular dataset. Quantum machine learning has emerged as an atypical learning method that may speed up some classical learning tasks because of its strong expressive power. However, near-term quantum computers suffer from limited number of qubits which hinders the representation learning in high dimensional spaces. We present a scalable quantum generative autoencoder (SQ-VAE) for simultaneously reconstructing and sampling drug molecules, and a corresponding vanilla variant (SQ-AE) for better reconstruction. The architectural strategies in hybrid quantum classical networks such as, adjustable quantum layer depth, heterogeneous learning rates, and patched quantum circuits are proposed to learn high dimensional dataset such as, ligand-targeted drugs. Extensive experimental results are reported for different dimensions including 8x8 and 32x32 after choosing suitable architectural strategies. The performance of quantum generative autoencoder is compared with the corresponding classical counterpart throughout all experiments. The results show that quantum computing advantages can be achieved for normalized low-dimension molecules, and that high-dimension molecules generated from quantum generative autoencoders have better drug properties within the same learning period.
15:56 CET 12.2.5 TOWARDS LOW-COST HIGH-ACCURACY STOCHASTIC COMPUTING ARCHITECTURE FOR UNIVARIATE FUNCTIONS: DESIGN AND DESIGN SPACE EXPLORATION
Speaker:
Kuncai Zhong, Shanghai Jiao Tong University, CN
Authors:
Kuncai Zhong, Zexi Li and Weikang Qian, Shanghai Jiao Tong University, CN
Abstract
Univariate functions are widely used. Several recent works propose to implement them by an unconventional computing paradigm, stochastic computing (SC). However, existing SC designs either have a high hardware cost due to the area consuming randomizer or a low accuracy. In this work, we propose a low-cost high-accuracy SC architecture for univariate functions. It consists of only a single stochastic number generator and a minimum number of D flip-flops. We also apply three methods, random number source (RNS) negating, RNS scrambling, and input scrambling, to improve the accuracy of the architecture. To efficiently configure the architecture to achieve a high accuracy, we further propose a design space exploration algorithm. The experimental results show that compared to the conventional architecture, the area of the proposed architecture is reduced by up to 76%, while its accuracy is close to or sometimes even higher than that of the conventional architecture.
16:00 CET 12.2.6 Q&A SESSION
Authors:
Giulia Meuli1 and Yvain Thonnart2
1Synopsys, IT; 2CEA-Leti, FR
Abstract
Questions and answers with the authors

12.3 Reliable safe and approximate systems

Date: Thursday, 17 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Angeliki Kritikakou, IRISA, FR

Session co-chair:
Marcello Traiola, INRIA, FR

The session proposes techniques for reliable, safe, and approximate computing over many different architectures, ranging from traditional systems to Neural network accelerators and hyper-dimensional computing.

Time Label Presentation Title
Authors
15:40 CET 12.3.1 (Best Paper Award Candidate)
DO TEMPERATURE AND HUMIDITY EXPOSURES HURT OR BENEFIT YOUR SSDS?
Speaker:
Adnan Maruf, Florida International University, US
Authors:
Adnan Maruf1, Sashri Brahmakshatriya1, Baolin Li2, Devesh Tiwari2, Gang Quan1 and Janki Bhimani1
1Florida International University, US; 2Northeastern University, US
Abstract
SSDs are becoming mainstream data storage devices, replacing HDDs in most data centers, consumer goods, and IoT gadgets. In this work, we ask an uncharted research question: What is the environmental conditions' impact on SSD performance? To answer it, we systematically measure, quantify, and characterize the impact of various commonly changing environmental conditions such as temperature and humidity on the performance of SSDs. Our experiments and analysis uncover that exposure to changes in temperature and humidity can significantly affect SSD performance.
15:44 CET 12.3.2 SAFEDM: A HARDWARE DIVERSITY MONITOR FOR REDUNDANT EXECUTION ON NON-LOCKSTEPPED CORES
Speaker:
Francisco Bas, Barcelona Supercomputing Center (BSC), Universitat Politècnica de Catalunya (UPC), ES
Authors:
Francisco Bas1, Pedro Benedicte2, Sergi Alcaide1, Guillem Cabo2, Fabio Mazzocchetti2 and Jaume Abella2
1Universitat Politècnica de Catalunya - Barcelona Supercomputing Center, ES; 2Barcelona Supercomputing Center, ES
Abstract
Computing systems in the safety domain, such as those in avionics or space, require specific safety measures related to the criticality of the deployment. A problem these systems face is that of transient failures in hardware. A solution commonly used to tackle potential failures is to introduce redundancy in these systems, for example 2 cores that execute the same program at the same time. However, redundancy does not solve all potential failures, such as Common Cause Failures (CCF), where a single fault affects both cores identically (e.g. a voltage droop). If both redundant cores have identical state when the fault occurs, then there may be a CCF since the fault can affect both cores in the same way. To avoid CCF it is critical to know that there is diversity in the execution amongst the redundant cores. In this paper we introduce SafeDM, a hardware Diversity Monitor that quantifies the diversity of each redundant processor to guarantee that CCF will not go unnoticed, and without needing to deploy lockstepped cores. SafeDM computes data and instruction diversity separately, using different techniques appropriate for each case. We integrate SafeDM in a RISC-V FPGA space MPSoC from Cobham Gaisler where SafeDM is proven effective with a large benchmark suite, incurring low area and power overheads. Overall, SafeDM is an effective hardware solution to quantify diversity in cores performing redundant execution.
15:48 CET 12.3.3 IS APPROXIMATION UNIVERSALLY DEFENSIVE AGAINST ADVERSARIAL ATTACKS IN DEEP NEURAL NETWORKS?
Speaker:
Ayesha Siddique, University of Missouri, US
Authors:
Ayesha Siddique and Khaza Anuarul Hoque, University of Missouri, US
Abstract
Approximate computing is known for its effectiveness in improvising the energy efficiency of deep neural network (DNN) accelerators at the cost of slight accuracy loss. Very recently, the inexact nature of approximate components, such as approximate multipliers have also been reported successful in defending adversarial attacks on DNNs models. Since the approximation errors traverse through the DNN layers as masked or unmasked, this raises a key research question-can approximate computing always offer a defense against adversarial attacks in DNNs, i.e., are they universally defensive? Towards this, we present an extensive adversarial robustness analysis of different approximate DNN accelerators (AxDNNs) using the state-of-the-art approximate multipliers. In particular, we evaluate the impact of ten adversarial attacks on different AxDNNs using the MNIST and CIFAR-10 datasets. Our results demonstrate that adversarial attacks on AxDNNs can cause 53% accuracy loss whereas the same attack may lead to almost no accuracy loss (as low as 0.06%) in the accurate DNN. Thus, approximate computing cannot be referred to as a universal defense strategy against adversarial attacks.
15:52 CET 12.3.4 RELIABILITY ANALYSIS OF A SPIKING NEURAL NETWORK HARDWARE ACCELERATOR
Speaker:
Theofilos Spyrou, Sorbonne University, CNRS, LIP6, FR
Authors:
Theofilos Spyrou1, Sarah A. Elsayed1, Engin Afacan2, Luis A. Camuñas Mesa3, Barnabé Linares-Barranco3 and Haralampos-G. Stratigopoulos1
1Sorbonne Université, CNRS, LIP6, FR; 2Gebze TU, TR; 3IMSE-CNM, CSIC, University of Sevilla, ES
Abstract
Despite the parallelism and sparsity in neural network models, their transfer into hardware unavoidably makes them susceptible to hardware-level faults. Hardware-level faults can occur either during manufacturing, such as physical defects and process-induced variations, or in the field due to environmental factors and aging. The performance under fault scenarios needs to be assessed so as to develop cost-effective fault-tolerance schemes. In this work, we assess the resilience characteristics of a hardware accelerator for Spiking Neural Networks (SNNs) designed in VHDL and implemented on an FPGA. The fault injection experiments pinpoint the parts of the design that need to be protected against faults, as well as the parts that are inherently fault-tolerant.
15:56 CET 12.3.5 RELIABILITY OF GOOGLE’S TENSOR PROCESSING UNITS FOR EMBEDDED APPLICATIONS
Speaker:
Rubens Luiz Rech Junior, Institute of Informatics, UFRGS, BR
Authors:
Rubens Luiz Rech Junior1 and Paolo Rech2
1UFRGS, BR; 2LANL/UFRGS, US
Abstract
Convolutional Neural Networks (CNNs) have become the most used and efficient way to identify and classify objects in a scene. CNNs are today fundamental not only for autonomous vehicles, but also for Internet of Things (IoT) and smart cities or smart homes. Vendors are developing low-power, efficient, and low-cost dedicated accelerators to allow the execution of the computational-demanding CNNs even in embedded applications with strict power and cost budgets. Google's Coral Tensor Processing Unit (TPU) is one of the latest low power accelerators for CNNs. In this paper we investigate the reliability of TPUs to atmospheric neutrons, reporting experimental data equivalent to more than 30 million years of natural irradiation. We analyze the behavior of TPUs executing atomic operations (standard or depthwise convolutions) with increasing input sizes as well as eight CNN designs typical of embedded applications, including transfer learning and reduced data-set configurations. We found that, despite the high error rate, most neutrons-induced errors only slightly modify the convolution output and do not change the CNNs detection or classification. By reporting details about the fault model and error rate, we provide valuable information on how to evaluate and improve the reliability of CNNs executed on a TPU.
16:00 CET 12.3.6 Q&A SESSION
Authors:
Angeliki Kritikakou1 and Marcello Traiola2
1Univ Rennes, Inria, CNRS, IRISA, FR; 2Inria / IRISA, FR
Abstract
Questions and answers with the authors

12.4 Raising Performance and Reliability of the Memory Subsystem

Date: Thursday, 17 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Leonidas Kosmidis, Barcelona Supercomputing Center, ES

Session co-chair:
Thaleia Dimitra Doudali, IMDEA Software Institute, ES

Performance and reliability are important considerations for modern architectures. This session includes papers addressing these concerns with novel physical design paradigms in EDA and emerging memory technologies. The first three papers lie in the intersection of architecture and physical design with 3D stacking, Silicon Carbide and an automated flow for taping-out GPU designs. The next three papers include a solution for an adaptive error correction scheme in DRAM, a reduced latency logging for crash recovery in systems based on persistent memory, as well as an in-memory accelerator based on Resistive RAM for bioinformatics.

Time Label Presentation Title
Authors
15:40 CET 12.4.1 STEALTH ECC: A DATA-WIDTH AWARE ADAPTIVE ECC SCHEME FOR DRAM ERROR RESILIENCE
Speaker:
Young Seo Lee, Korea University, KR
Authors:
Young Seo Lee1, Gunjae Koo1, Young-Ho Gong2 and Sung Woo Chung1
1Korea University, KR; 2KwangWoon University, KR
Abstract
As DRAM process technology scales down and DRAM density continues to grow, DRAM errors have become a primary concern in modern data centers. Typically, data centers have adopted memory systems with a single error correction double error detection (SECDED) code. However, the SECDED code is not sufficient to satisfy DRAM reliability demands as memory systems get more vulnerable. Though the servers in data centers employ strong ECC schemes such as Chipkill, such ECC schemes lead to substantial performance and/or storage overhead. In this paper, we propose Stealth ECC, a cost-effective memory protection scheme providing stronger error correctability than the conventional SECDED code, with negligible performance overhead and without storage overhead. Depending on the data-width (either narrow-width or full-width), Stealth ECC adaptively selects ECC schemes. For narrow-width values, Stealth ECC provides multi-bit error correctability by storing more parity bits in MSB side, instead of zeros. Furthermore, with bitwise interleaved data placement between x4 DRAM chips, Stealth ECC is robust to a single DRAM chip error for narrow-width values. On the other hand, for full-width values, Stealth ECC adopts the SECDED code, which maintains DRAM reliability comparable to the conventional SECDED code. As a result, thanks to the reliability improvement of narrow-width values, Stealth ECC enhances overall DRAM reliability, while incurring negligible performance overhead as well as no storage overhead. Our simulation results show that Stealth ECC reduces the probability of system failure (caused by DRAM errors) by 47.9%, on average, with only 0.9% performance overhead compared to the conventional SECDED code.
15:44 CET 12.4.2 ACCELERATE HARDWARE LOGGING TO EFFICIENTLY GUARANTEE PM CRASH CONSISTENCY
Speaker:
Zhiyuan Lu, Michigan Tech. University, US
Authors:
Zhiyuan Lu1, Jianhui Yue1, Yifu Deng1 and Yifeng Zhu2
1Michigan Tech. University, US; 2University of Maine, US
Abstract
While logging has been adopted in persistent memory (PM) to support crash consistency, logging incurs severe performance overhead. This paper discovers two common factors that contribute to the inefficiency of logging: (1) load imbalance among memory banks, and (2) constraints of intra-record ordering. Over-loaded memory banks may significantly prolong the waiting time of log requests targeting these banks. To address this issue, we propose a novel log entry allocation scheme (LALEA) that reshapes the traffic distribution over PM banks. In addition, the intra-record ordering between a header and its log entries decreases the degree of parallelism in log operations. We design a log metadata buffering scheme (BLOM) that totally eliminates the intra-record ordering constraints. These two proposed log optimizations are general and can be applied to many existing designs. We evaluate our designs using both micro-benchmarks and real PM applications. Our experimental results show that LALEA and BLOM can achieve 54.04% and 17.16% higher transaction throughput on average, compared to two state-of-the-art designs, respectively.
15:48 CET 12.4.3 (Best Paper Award Candidate)
MEMPOOL-3D: BOOSTING PERFORMANCE AND EFFICIENCY OF SHARED-L1 MEMORY MANY-CORE CLUSTERS WITH 3D INTEGRATION
Speaker:
Matheus Cavalcante, ETH Zürich, CH
Authors:
Matheus Cavalcante1, Anthony Agnesina2, Samuel Riedel1, Moritz Brunion3, Alberto Garcia-Ortiz4, Dragomir Milojevic5, Francky Catthoor5, Sung Kyu Lim2 and Luca Benini6
1ETH Zürich, CH; 2Georgia Tech, US; 3University of Bremen, DE; 4ITEM (U.Bremen), DE; 5IMEC, BE; 6Università di Bologna and ETH Zürich, IT
Abstract
Three-dimensional integrated circuits promise power, performance, and footprint gains compared to their 2D counterparts, thanks to drastic reductions in the interconnects' length through their smaller form factor. We can leverage the potential of 3D integration by enhancing MemPool, an open-source many-core design with 256 cores and a shared pool of L1 scratchpad memory connected with a low-latency interconnect. MemPool's baseline 2D design is severely limited by routing congestion and wire propagation delay, making the design ideal for 3D integration. In architectural terms, we increase MemPool's scratchpad memory capacity beyond the sweet spot for 2D designs, improving performance in a common digital signal processing kernel. We propose a 3D MemPool design that leverages a smart partitioning of the memory resources across two layers to balance the size and utilization of the stacked dies. In this paper, we explore the architectural and the technology parameter spaces by analyzing the power, performance, area, and energy efficiency of MemPool instances in 2D and 3D with 1 MiB, 2 MiB, 4 MiB, and 8 MiB of scratchpad memory in a commercial 28 nm technology node. We observe a performance gain of 9.1 % when running a matrix multiplication on the MemPool-3D design with 4 MiB of scratchpad memory compared to the MemPool 2D counterpart. In terms of energy efficiency, we can implement the MemPool-3D instance with 4 MiB of L1 memory on an energy budget 15 % smaller than its 2D counterpart, and even 3.7 % smaller than the MemPool-2D instance with one-fourth of the L1 scratchpad memory capacity.
15:52 CET 12.4.4 REPAIR: A RERAM-BASED PROCESSING-IN-MEMORY ACCELERATOR FOR INDEL REALIGNMENT
Speaker:
Chin-Fu Nien, Academia Sinica, TW
Authors:
Ting Wu1, Chin-Fu Nien2, Kuang-Chao Chou3 and Hsiang-Yun Cheng2
1Electrical and Computer Engineering, Carnegie Mellon university, US; 2Academia Sinica, TW; 3Gradute Institute of Electronics Engineering, National Taiwan University, TW
Abstract
Genomic analysis has attracted a lot of interest recently since it is the key to realizing precision medicine for diseases such as cancer. Among all the genomic analysis pipeline stages, Indel Realignment is the most time-consuming and induces intensive data movements. Thus, we propose RePAIR, the first ReRAM-based processing-in-memory accelerator targeting the Indel Realignment algorithm. To further increase the computation parallelism, we design several mapping and scheduling optimization schemes. RePAIR achieves 7443x speedup and is 27211x more energy efficient over the GATK3.8 running on a CPU server, significantly outperforming the state-of-the-art.
15:56 CET 12.4.5 SIC PROCESSORS FOR EXTREME HIGH-TEMPERATURE VENUS SURFACE EXPLORATION
Speaker:
Heewoo Kim, University of Michigan, Ann Arbor, US
Authors:
Heewoo Kim, Javad Bagherzadeh and Ronald Dreslinski, University of Michigan, US
Abstract
Being the ‘sister planet’ of the Earth, surface exploration of Venus is expected to provide valuable scientific insights into the history and the environment of the Earth. Despite the benefits, the surface temperature of Venus, at 450C, poses a large challenge for any surface exploration. In particular, conventional Silicon electronics do not properly function under such high temperatures. Due to this constraint, the most prolonged previous surface exploration lasted only for 2 hours. Silicon Carbide (SiC) electronics, which can endure and function properly in high-temperature environments, is proposed as a strong candidate to be used in Venus surface explorations. However, this technology is still immature and associated with limiting factors, such as slower speed, power constraint, limited die area, and approximately 1,000 times longer channel than the state-of-the-art Si transistors. In this paper, we configure a computing infrastructure for high-temperature SiC-based technology, conduct design space exploration, and evaluate the performance of different SiC processors when used in Venus surface landers. Our evaluation shows that the SiC processor has an average 16.6X lower throughput than the RAD6000 Si processor used in the previous Mars rover. The Venus rover with SiC processor is expected to have a moving speed of 0.6 meters per hour and visual odometry processing time of 50 minutes. Lastly, we provide the design guidelines to improve the SiC processors at the microarchitecture and the instruction set architecture levels.
16:00 CET 12.4.6 Q&A SESSION
Authors:
Leonidas Kosmidis1 and Thaleia Dimitra Doudali2
1Universitat Politècnica de Catalunya - Barcelona Supercomputing Center, ES; 2IMDEA Software Institute, ES
Abstract
Questions and answers with the authors

12.5 Bringing Robust Deep Learning to the Autonomous Edge: New Challenges and Algorithm-Hardware Solutions

Date: Thursday, 17 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Dirk Ziegenbein, Robert Bosch GmbH, DE

Session co-chair:
Chung-Wei Lin, National Taiwan University, TW

Deep neural networks (DNNs) are being continually deployed at the autonomous edge systems for many applications, such as speech recognition, image classification, and object detection. While DNNs have proven to be effective in handling these tasks, their robustness (i.e. accuracy) can suffer postdeployment at the edge. Moreover, designing robust deep learning algorithms for the autonomous edge is highly challenging because such systems are severely resource-constrained. This session includes four different invited talks that present the challenges and propose novel, lightweight algorithm-hardware codesign methods to improve DNN robustness at the edge. The first paper evaluates the effectiveness of various unsupervised DNN adaptation methods on real-world edge systems, followed by selecting the best technique in terms of accuracy, performance and energy. The second paper explores a lightweight image super-resolution technique to prevent adversarial attacks, which is also characterized on an Arm neural processing unit. The third paper tackles the issue of loss in DNN prediction accuracy in resistive memory-based in-memory accelerators by proposing a stochastic fault-tolerant training scheme. The final paper focuses on robust distributed reinforcement learning for swarm intelligence where it analyzes and mitigates the effect of various transient/permanent faults.

Time Label Presentation Title
Authors
15:40 CET 12.5.1 UNSUPERVISED TEST-TIME ADAPTATION OF DEEP NEURAL NETWORKS AT THE EDGE: A CASE STUDY
Speaker:
Kshitij Bhardwaj, Lawrence Livermore National Laboratory, US
Authors:
Kshitij Bhardwaj, James Diffenderfer, Bhavya Kailkhura and Maya Gokhale, LLNL, US
Abstract
Deep learning is being increasingly used in mobile and edge autonomous systems. The prediction accuracy of deep neural networks (DNNs), however, can degrade after deployment due to encountering data samples whose distributions are differ- ent than the training samples. To continue to robustly predict, DNNs must be able to adapt themselves post-deployment. Such adaptation at the edge is challenging as new labeled data may not be available, and it has to be performed on a resource- constrained device. This paper performs a case study to evaluate the cost of test-time fully unsupervised adaptation strategies on a real-world edge platform: Nvidia Jetson Xavier NX. In particular, we adapt pretrained state-of-the-art robust DNNs (trained using data augmentation) to improve the accuracy on image classification data that contains various image corruptions. During this prediction-time on-device adaptation, the model parameters of a DNN are updated using a single backpropagation pass while optimizing entropy loss. The effects of following three simple model updates are compared in terms of accuracy, adaptation time and energy: updating only convolutional (Conv- Tune); only fully-connected (FC-Tune); and only batch-norm parameters (BN-Tune). Our study shows that BN-Tune and Conv- Tune are more effective than FC-Tune in terms of improving accuracy for corrupted images data (average of 6.6%, 4.97%, and 4.02%, respectively over no adaptation). However, FC-Tune leads to significantly faster and more energy efficient solution with a small loss in accuracy. Even when using FC-Tune, the extra overheads of on-device fine-tuning are significant to meet tight real-time deadlines (209ms). This study motivates the need for designing hardware-aware robust algorithms for efficient on- device adaptation at the autonomous edge.
15:50 CET 12.5.2 SUPER-EFFICIENT SUPER RESOLUTION FOR FAST ADVERSARIAL DEFENSE AT THE EDGE
Speaker:
Kartikeya Bhardwaj, Arm Inc., US
Authors:
Kartikeya Bhardwaj1, Dibakar Gope2, James Ward3, Paul Whatmough2 and Danny Loh4
1Arm Inc., US; 2Arm Research, US; 3Arm Inc., IE; 4Arm Inc., GB
Abstract
Autonomous systems are highly vulnerable to a variety of adversarial attacks on Deep Neural Networks (DNNs). Training-free model-agnostic defenses have recently gained popularity due to their speed, ease of deployment, and ability to work across many DNNs. To this end, a new technique has emerged for mitigating attacks on image classification DNNs, namely, preprocessing adversarial images using super resolution -- upscaling low-quality inputs into high-resolution images. This defense requires running both image classifiers and super resolution models on constrained autonomous systems. However, super resolution incurs a heavy computational cost. Therefore, in this paper, we investigate the following question: Does the robustness of image classifiers suffer if we use tiny super resolution models? To answer this, we first review a recent work called Super-Efficient Super Resolution (SESR) that achieves similar or better image quality than prior art while requiring 2x to 330x fewer Multiply-Accumulate (MAC) operations. We demonstrate that despite being orders of magnitude smaller than existing models, SESR achieves the same level of robustness as significantly larger networks. Finally, we estimate end-to-end performance of super resolution-based defenses on a commercial Arm Ethos-U55 micro-NPU. Our findings show that SESR achieves nearly 3x higher FPS than a baseline while achieving similar robustness.
16:00 CET 12.5.3 FAULT-TOLERANT DEEP NEURAL NETWORKS FOR PROCESSING-IN-MEMORY BASED AUTONOMOUS EDGE SYSTEMS
Speaker:
Xue Lin, Northeastern University, US
Authors:
Siyue Wang1, Geng Yuan1, Xiaolong Ma1, Yanyu Li1, Xue Lin1 and Bhavya Kailkhura2
1Northeastern University, US; 2LLNL, US
Abstract
In-memory deep neural network (DNN) accelerators will be the key for energy-efficient autonomous edge systems. The resistive random access memory (ReRAM) is a potential solution for the non-CMOS-based in-memory computing platform for energy-efficient autonomous edge systems, thanks to its promising characteristics, such as near-zero leakage-power and non-volatility. However, due to the hardware instability of ReRAM, the weights of the DNN model may deviate from the originally trained weights, resulting in accuracy loss. To mitigate this undesirable accuracy loss, we propose two stochastic fault-tolerant training methods to generally improve the models' robustness without dealing with individual devices. Moreover, we propose Stability Score -- a comprehensive metric that serves as an indicator to the instability problem. Extensive experiments demonstrate that the DNN models trained using our proposed stochastic fault-tolerant training method achieve superior performance, which provides better flexibility, scalability, and deployability of ReRAM on the autonomous edge systems.
16:10 CET 12.5.4 FRL-FI: TRANSIENT FAULT ANALYSIS FOR FEDERATED REINFORCEMENT LEARNING-BASED NAVIGATION SYSTEMS
Speaker:
Arijit Raychowdhury, Georgia Institute of Technology, US
Authors:
Zishen Wan1, Aqeel Anwar1, Abdulrahman Mahmoud2, Tianyu Jia3, Yu-Shun Hsiao2, Vijay Reddi2 and Arijit Raychowdhury1
1Georgia Institute of Technology, US; 2Harvard University, US; 3Carnegie Mellon University, US
Abstract
Swarm intelligence is being increasingly deployed in autonomous systems, such as drones and unmanned vehicles. Federated reinforcement learning (FRL), a key swarm intelligence paradigm where agents interact with their own environments and cooperatively learn a consensus policy while preserving privacy, has recently shown potential advantages and gained popularity. However, transient faults are increasing in the hardware system with continuous technology node scaling and can pose threats to FRL systems. Meanwhile, conventional redundancy-based protection methods are challenging to deploy on resource-constrained edge applications. In this paper, we experimentally evaluate the fault tolerance of FRL navigation systems at various scales with respect to fault models, fault locations, learning algorithms, layer types, communication intervals, and data types at both training and inference stages. We further propose two cost-effective fault detection and recovery techniques that can achieve up to 3.3x improvement in resilience with <2.7% overhead in FRL systems.
16:20 CET 12.5.5 Q&A SESSION
Authors:
Dirk Ziegenbein1 and Chung-Wei Lin2
1Robert Bosch GmbH, DE; 2National Taiwan University, TW
Abstract
Questions and answers with the authors

13.1 New Perspectives in Test and Diagnosis

Date: Thursday, 17 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Melanie Schillinsky, NXP Semiconductors Germany GmbH, DE

Session co-chair:
Riccardo Cantoro, Politecnico di Torino, IT

This session covers new techniques for cell-aware test, fault modeling, test and diagnosis for hardware security primitives, machine-learning enabled diagnosis for monolithic 3D circuits, as well as static compaction for SBST in GPU architectures.

Time Label Presentation Title
Authors
16:40 CET 13.1.1 IMPROVING CELL-AWARE TEST FOR INTRA-CELL SHORT DEFECTS
Speaker:
Dong-Zhen Lee, National Yang Ming Chiao Tung University, TW
Authors:
Dong-Zhen Li1, Ying-Yen Chen2, Kai-Chiang Wu3 and Chia-Tso Chao1
1National Yang Ming Chiao Tung University, TW; 2Realtek Semiconductor Corporation, TW; 3Department of Computer Science, National Chiao Tung University, TW
Abstract
Conventional fault models define their faulty behavior at the IO ports of standard cells with simple rules of fault activation and fault propagation. However, there still exist some defects inside a cell (intra-cell) that cannot be effectively detected by the test patterns of conventional fault models and hence become a source of DPPM. In order to further increase the defect coverage, many research works have been conducted to study the fault models resulting from different types of intra-cell defects, by SPICE-simulating each targeted defect with its equivalent circuit-level defect model. In this paper, we propose to improve cell-aware (CA) test methodology by concentrating on intra-cell bridging faults due to short defects inside standard cells. The faults extracted are based on examining the actual physical proximity of polygons in the layout of a cell, and are thus more realistic and reasonable than those (faults) determined by RC extraction. Experimental results on a set of industrial designs show that the proposed methodology can indeed improve the test quality of intra-cell bridging faults. On average, 0.36% and 0.47% increases in fault coverage can be obtained for 1-time-frame and 2-time-frame CA tests, respectively. In addition to short defects between two metal polygons, short defects among three metal polygons are also considered in our methodology for another 9.33% improvement in fault coverage.
16:44 CET 13.1.2 APUF FAULTS: IMPACT, TESTING, AND DIAGNOSIS
Speaker:
Wenjing Rao, University of Illinois Chicago, US
Authors:
Natasha Devroye, Vincent Dumoulin, Tim Fox, Wenjing Rao and Yeqi Wei, University of Illinois at Chicago, US
Abstract
Arbiter Physically Unclonable Functions (APUFs) are hardware security primitives that exploit manufacturing randomness to generate unique digital fingerprints for ICs. This paper theoretically and numerically examines the impact of faults native to APUFs -- mask parameter faults from the design phase, or process variation (PV) during the manufacturing phase. We model them statistically, and explain quantitatively how these faults affect the resulting PUF bias and uniqueness. When given access to only a single PUF instance, we focus on abnormal delta elements that are outliers in magnitude, as this is how the statistically modeled faults manifest at the individual level. To detect such bad PUF instances and diagnose the abnormal delta elements, we propose a testing methodology which partitions a random set of challenges so that a specific delta element can be targeted, forming a perceivable bias in the responses over these sets. This low-cost approach is highly effective in detecting and diagnosing bad PUFs with abnormal delta element(s).
16:48 CET 13.1.3 GRAPH NEURAL NETWORK-BASED DELAY-FAULT LOCALIZATION FOR MONOLITHIC 3D ICS
Speaker:
Shao-Chun Hung, Department of Electrical and Computer Engineering, Duke University, US
Authors:
Shao-Chun Hung, Sanmitra Banerjee, Arjun Chaudhuri and Krishnendu Chakrabarty, Duke University, US
Abstract
Monolithic 3D (M3D) integration is a promising technology for achieving high performance and low power consumption. However, the limitations of current M3D fabrication flows lead to performance degradation of devices in the top tier and unreliable interconnects between tiers. Fault localization at the tier level is therefore necessary to enhance yield learning, For example, tier-level localization can enable targeted diagnosis and process optimization efforts. In this paper, we develop a graph neural network-based diagnosis framework to efficiently localize faults to a device tier. The proposed framework can be used to provide rapid feedback to the foundry and help enhance the quality of diagnosis reports generated by commercial tools. Results for four M3D benchmarks, with and without response compaction, show that the proposed solution achieves up to 39.19% improvement in diagnostic resolution with less than 1% loss of accuracy, compared to results from commercial tools.
16:52 CET 13.1.4 A COMPACTION METHOD FOR STLS FOR GPU IN-FIELD TEST
Speaker:
Juan David Guerrero Balaguera, Politecnico di Torino, IT
Authors:
Juan Guerrero Balaguera, Josie Rodriguez Condia and Matteo Sonza Reorda, Politecnico di Torino, IT
Abstract
Nowadays, Graphics Processing Units (GPUs) are effective platforms for implementing complex algorithms (e.g., for Artificial Intelligence) in different domains (e.g., automotive and robotics), where massive parallelism and high computational effort are required. In some domains, strict safety-critical requirements exist, mandating the adoption of mechanisms to detect faults during the operational phases of a device. An effective test solution is based on Self-Test Libraries (STLs) aiming at testing devices functionally. This solution is frequently adopted for CPUs, but can also be used with GPUs. Nevertheless, the in-field constraints restrict the size and duration of acceptable STLs. This work proposes a method to automatically compact the test programs of a given STL targeting GPUs. The proposed method combines a multi-level abstraction analysis resorting to logic simulation to extract the microarchitectural operations triggered by the test program and the information about the thread-level activity of each instruction and to fault simulation to know its ability to propagate faults to an observable point. The main advantage of the proposed method is that it requires a single fault simulation to perform the compaction. The effectiveness of the proposed approach was evaluated, resorting to several test programs developed for an open-source GPU model (FlexGripPlus) compatible with NVIDIA GPUs. The results show that the method can compact test programs by up to 98.64% in code size and by up to 98.42% in terms of duration, with minimum effects on the achieved fault coverage.
16:56 CET 13.1.5 Q&A SESSION
Authors:
Melanie Schillinsky1 and Riccardo Cantoro2
1NXP Germany GmbH, DE; 2Politecnico di Torino, IT
Abstract
Questions and answers with the authors

13.2 From system-level specification to RTL and back

Date: Thursday, 17 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Andy Pimentel, University of Amsterdam, NL

Session co-chair:
Matthias Jung, Fraunhofer IESE, DE

This session highlights the importance of system modeling for an efficient design. The first three papers showcase solutions to generate system level models from RTL descriptions and back. Finally, the last paper presents a cost-sensitive model and learning engine for disk failure prediction, to reduce misclassification costs while maintaining a high fault detection rate.

Time Label Presentation Title
Authors
16:40 CET 13.2.1 AUTOMATIC GENERATION OF ARCHITECTURE-LEVEL MODELS FROM RTL DESIGNS FOR PROCESSORS AND ACCELERATORS
Speaker:
Yu Zeng, Princeton University, US
Authors:
Yu Zeng, Aarti Gupta and Sharad Malik, Princeton University, US
Abstract
Hardware platforms comprise general-purpose processors and application-specific accelerators. Unlike processors, application-specific accelerators often do not have clearly specified architecture-level models/specifications (the instruction set architecture or ISA). This poses challenges to the development and verification/validation of firmware/software for these accelerators. Manually writing architecture-level models takes great effort and is error-prone. When Register-Transfer Level (RTL) designs are available, they can be a source from which to automatically derive the architecture-level models. In this work, we propose an approach for automatically generating architecture-level models for processors as well as accelerators from their RTL designs. In previous work, we showed how to automatically extract the architectural state variables (ASVs) from RTL designs. (These are the state variables that are persistent across instructions.) In this work, we present an algorithm for generating the update functions of the model: how the ASVs and outputs are updated by each instruction. Experiments on several processors and accelerators demonstrate that our approach can cover a wide range of hardware features and generate high- quality architecture-level models within reasonable computing time.
16:44 CET 13.2.2 TWINE: A CHISEL EXTENSION FOR COMPONENT-LEVEL HETEROGENEOUS DESIGN
Speaker:
Shibo Chen, University of Michigan, US
Authors:
Shibo Chen, Yonathan Fisseha, Jean-Baptiste Jeannin and Todd Austin, University of Michigan, US
Abstract
Algorithm-oriented heterogeneous hardware design has been one of the major driving forces for hardware improvement in the post-Moore's Law era. To achieve the swift development of heterogeneous designs, designers reuse existing hardware components to craft their systems. However, current hardware design languages either require tremendous efforts to customize designs, or sacrifice quality for simplicity. Chisel, while attracting more users for its capability to easily reconfigure designs, lacks a few key features to further expedite the heterogeneous design flow. In this paper, we introduce Twine—a Chisel extension that provides high-level semantics to efficiently generate heterogeneous designs. Twine standardizes the interface for better reusability and supports control-free specification with flexible data type conversion, which saves designers from the busy-work of interconnecting modules. Our results show that Twine provides a smooth on-boarding experience for hardware designers, considerably improves reusability, and reduces design complexity for heterogeneous designs while maintaining high design quality.
16:48 CET 13.2.3 TOWARDS IMPLEMENTING RTL MICROPROCESSOR AGILE DESIGN USING FEATURE ORIENTED PROGRAMMING
Speaker:
Tun Li, National University of Defense Technology, CN
Authors:
Hongji Zou, Mingchuan Shi, Tun Li and Wanxia Qu, National University of Defense Technology, CN
Abstract
Recently, hardware agile design methods have been developed to improve the design productivity. However, the modeling methods hinder further design productivity improvements. In this paper, we propose and implement a microprocessor agile design method using feature oriented programming technology to improve design productivity. In this method, designs could be uniquely partitioned and constructed incrementally to explore various functional design features flexibly and efficiently. The key techniques to improve design productivity are flexible modeling extension and on-the-fly feature composing mechanisms. The evaluations on RISC-V and OR1200 CPU pipelines show the effectiveness of the proposed method on duplicate codes reduction and flexible feature composing while avoiding design resource overheads.
16:52 CET 13.2.4 CSLE: A COST-SENSITIVE LEARNING ENGINE FOR DISK FAILURE PREDICTION IN LARGE DATA CENTERS
Speaker:
Xinyan Zhang, Huazhong University of Science and Technology, CN
Authors:
Xinyan Zhang1, Kai Shan2, Zhipeng Tan3 and Dan Feng3
1Wuhan National Laboratory for Optoelectronics, Huazhong University of Science & Technology, CN; 2Huawei Technologies, CN; 3Huazhong University of Science and Technology, CN
Abstract
As the principal failure in data centers, disk failure may pose the risk of data loss, increase the maintenance cost, and affect system availability. As a proactive fault tolerance technology, disk failure prediction can minimize the loss before failure occurs. Whereas, a weak prediction model with a low Failure Detection Rate (FDR) and high False Alarm Rate (FAR) may substantially increase the system cost due to inadequate consideration or misperception of the misclassification cost. To address these challenges, we propose a cost-sensitive learning engine CSLE for disk failure prediction, which combines a two-phase feature selection based on Cohen’s D and Genetic Algorithm, a meta-algorithm based on cost-sensitive learning, and an adaptive optimal classifier for heterogeneous and homogeneous disk series. Experimental results on real datasets show that the AUC of CSLE is increased by 2%-42% compared with the commonly used rank-sum test. CSLE can reduce the misclassification cost by 52%-96% compared with the rank model. Besides, CSLE has a better pervasiveness than the traditional prediction model, it can reduce both the misclassification cost and the FAR by 16%-70% for heterogeneous disk series, and increase the FDR by 3%-29% for homogeneous disk series.
16:56 CET 13.2.5 Q&A SESSION
Authors:
Andy Pimentel1 and Matthias Jung2
1University of Amsterdam, NL; 2Fraunhofer IESE, DE
Abstract
Questions and answers with the authors

13.3 Advances in permanent storage efficiency and NN-in-memory

Date: Thursday, 17 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Yi Wang, Shenzhen University, CN

Session co-chair:
Zili Shao, The Chinese University of Hong Kong, HK

In this session we present several hardware- and software-based advances in permanent storage. The solutions are based on several technologies, like emerging persistent memories, flash and shingled magnetic disks to improve overall bandwidth, latency, capacity and resilience of permanent storage solutions. They do this by analyzing the current bottlenecks and associating several of these technologies to increase the performance at the overall system level, by developing a new framework for revisiting FTL firmware organization for future open-source multicore architectures, and presenting a robust implementation of binary neural networks addressing computing-in-memory.

Time Label Presentation Title
Authors
16:40 CET 13.3.1 ROBUST BINARY NEURAL NETWORK AGAINST NOISY ANALOG COMPUTATION
Speaker:
Zong-Han Lee, National Tsing-Hua University, TW
Authors:
Zong-Han Lee1, Fu-Cheng Tsai2 and Shih-Chieh Chang1
1National Tsing-Hua University, TW; 2Industrial Technology Research Institute, TW
Abstract
Computing in memory (CIM) technology has shown promising results in reducing the energy consumption of a battery-powered device. On the other hand, to reduce MAC operations, Binary neural networks (BNN) show the potential to catch up with a full-precision model. This paper proposes a robust BNN model applied to the CIM framework, which can tolerate analog noises. These analog noises caused by various variations, such as process variation, can lead to low inference accuracy. We first observe that the traditional batch normalization can cause a BNN model to be susceptible to analog noise. We then propose a new approach to replace the batch normalization while maintaining the advantages. Secondly, in BNN, since noises can be removed when inputs are zeros during the multiplication and accumulation (MAC) operation, we also propose novel methods to increase the number of zeros in a convolution output. We apply our new BNN model in the keyword spotting application. Our results are very exciting.
16:44 CET 13.3.2 (Best Paper Award Candidate)
MU-RMW: MINIMIZING UNNECESSARY RMW OPERATIONS IN THE EMBEDDED FLASH WITH SMR DISK
Speaker:
Chenlin Ma, Shenzhen University, CN
Authors:
Chenlin Ma, Zhuokai Zhou, Yingping Wang, Yi Wang and Rui Mao, Shenzhen University, CN
Abstract
Emerging Shingled Magnetic Recording (SMR) Disk can improve the storage capacity significantly by overlapping multiple tracks with the shingled direction. However, the shingled-like structure leads to severe write amplification caused by RMW operations inner SMR disks. As the mainstream solid-state storage technology, NAND flash has the advantages of tiny size, cost-effective, high performance, making it suitable and promising to be incorporated into SMR disks to boost the system performance. In this hybrid embedded storage system (i.e., the Embedded Flash with SMR disk (EF-SMR) system), we observe that physical flash blocks can contain a mixture of data associated with different SMR data bands; when garbage collecting such flash blocks, multiple RMW operations are triggered to rewrite the involved SMR bands and the performance is further exacerbated. Therefore, in this paper, we for the first time present MU-RMW to guarantee data from different SMR bands will not be mixed up within the flash blocks with an aim at minimizing unnecessary RMW operations. The effectiveness of MU-RMW was evaluated with realistic and intensive I/O workloads and the results are encouraging.
16:48 CET 13.3.3 OPTIMIZING COW-BASED FILE SYSTEMS ON OPEN-CHANNEL SSDS WITH PERSISTENT MEMORY
Speaker:
Runyu Zhang, Chongqing University, CN
Authors:
Runyu Zhang1, Duo Liu2, Chaoshu Yang3, Xianzhang Chen2, Lei Qiao4 and Yujuan Tan2
1College of Computer Science, Chongqing University, CN; 2Chongqing University, CN; 3Guizhou University, CN; 4Beijing Institute of Control Engineering, CN
Abstract
Block-based file systems, such as Btrfs, utilize the copy-on-write (CoW) mechanism to guarantee data consistency on solid-state drives (SSDs). Open-channel SSD provides opportunities for in-depth optimization of block-based file systems. However, existing systems fail to co-design the two-layer semantics and cannot take full advantage of the open-channel characteristics. Specifically, synchronizing an overwrite in Btrfs will copy-on-write all pages in the update path and induce severe write amplification. In this paper, we propose a hybrid fine-grained copy-on-write and journaling mechanism (HyFiM) to address these problems. We first utilize persistent memories to preserve the address mapping table of open-channel SSD. Then, we design an intra-FTL copy-on-write mechanism (IFCoW) that eliminates the recursive updates caused by overwrites. Finally, we devise fine-grained metadata journals (FGMJ) to guarantee the consistency of metadata with minimum overhead. We prototype HyFiM based on Btrfs in the Linux kernel. Comprehensive evaluations demonstrate that HyFiM can outperform over Btrfs by 30.77% and 33.82% for sequential and random overwrites, respectively.
16:52 CET 13.3.4 MCMQ: SIMULATION FRAMEWORK FOR SCALABLE MULTI-CORE FLASH FIRMWARE OF MULTI-QUEUE SSDS
Speaker:
Jin Xue, The Chinese University of Hong Kong, HK
Authors:
Jin Xue, Tianyu Wang and Zili Shao, The Chinese University of Hong Kong, HK
Abstract
Solid-state drives (SSDs) have been used in a wide range of emerging data processing systems. To fully utilize the massive internal parallelism delivered by SSDs,manufacturers begin to utilize high-performance multi-core microprocessors in scalable flash firmware to process I/O requests concurrently. Designing scalable multi-core flash firmwares requires simulation tools that can model the features of a multi-core environment. However, existing SSD simulators assume a single-threading execution model and are not capable of modelling overheads incurred by multi-threading firmware execution such as lock contentions. In this paper, we propose MCMQ, a novel framework for simulating scalable multi-core flash firmware. The framework is based on an emulated multi-core RISC processor and supports executing multiple I/O traces in parallel through a multi-queue interface. Experiment results show the effectiveness of the proposed framework. We have released the open-source code of MCMQ for public access.
16:56 CET 13.3.5 Q&A SESSION
Authors:
Yi Wang1 and Zili Shao2
1Shenzhen University, CN; 2The Chinese University of Hong Kong, HK
Abstract
Questions and answers with the authors

13.4 System-level security

Date: Thursday, 17 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Pascal Benoit, University of Montpellier, FR

Session co-chair:
Mike Hamburg, Cryptography Research, US

The session focuses on security from a high-level perspective: improvements of Intel Software Guard Extensions, one that ensures that pages are available in the secure memory when needed, and another one that extends the existing secure Key-Value Store, new protections against transient execution attacks and fault injection attack, and a new dynamic attack that can evade hardware-assisted attack/intrusion detection will be exposed.

Time Label Presentation Title
Authors
16:40 CET 13.4.1 CR-SPECTRE: DEFENSE-AWARE ROP INJECTED CODE-REUSE BASED DYNAMIC SPECTRE
Speaker:
Abhijitt Dhavlle, George Mason University, US
Authors:
Abhijitt Dhavlle1, Setareh Rafatirad2, Houman Homayoun2 and Sai Manoj Pudukotai Dinakarrao3
1George Mason University , VA, USA, US; 2University of California Davis, US; 3George Mason University, US
Abstract
Side-channel attacks have been a constant threat to computing systems. In recent times, vulnerabilities in the architecture were discovered and exploited to mount and execute a state-of-the-art attack such as Spectre. The Spectre attack exploits a vulnerability in the Intel-based processors to leak confidential data through the covert channel. There exist some defenses to mitigate the Spectre attack. Among multiple defenses, hardware-assisted attack/intrusion detection (HID) systems have received overwhelming response due to its low overhead and efficient attack detection. The HID systems deploy machine learning (ML) classifiers to perform anomaly detection to determine whether the system is under attack. For this purpose, a performance monitoring tool profiles the applications to record hardware performance counters (HPC), which performs anomaly detection. Previous HID systems assume that the Spectre is executed as a standalone application. In contrast, we propose an attack that dynamically generates variations in the injected code to evade detection. The attack is injected into a benign application. In this manner, the attack conceals itself as a benign application and generates perturbations to avoid detection. For the attack injection, we exploit a return-oriented programming (ROP)-based code-injection technique that reuses the code, called gadgets, present in the exploited victim's (host) memory to execute the attack, which, in our case, is the CR-Spectre attack to steal sensitive data from a target victim (target) application. Our work focuses on proposing a dynamic attack that can evade HID detection by injecting perturbations, and its dynamically generated variations thereof, under the cloak of a benign application. We evaluate the proposed attack on the MiBench suite as the host. From our experiments, the HID performance degrades from 90% to 16%, indicating our Spectre-CR attack avoids detection successfully.
16:44 CET 13.4.2 CACHEREWINDER: REVOKING SPECULATIVE CACHE UPDATES EXPLOITING WRITE-BACK BUFFER
Speaker:
Jongmin Lee, Korea University, KR
Authors:
Jongmin Lee1, Junyeon Lee2, Taeweon Suh1 and Gunjae Koo1
1Korea University, KR; 2Samsung Advanced Institute of Technology, KR
Abstract
Transient execution attacks are critical security threats since those attacks exploit speculative execution which is an essential architectural solution that can improve the performance of out-of-order processors significantly. Such attacks change cache state by accessing secret data during speculative executions, then the attackers leak the secret information exploiting cache timing side-channels. Even though software patches against transient execution attacks have been proposed, the software solutions significantly slow down the performance of a system. In this paper, we propose CacheRewinder, an efficient hardware-based defense mechanism against transient execution attacks. CacheRewinder prevents leakage of secret information by revoking the cache updates done by speculative executions. To restore the cache state efficiently, CacheRewinder exploits the underutilized write-back buffer space as the temporary storage for victimized cache blocks that are evicted during speculative executions. Hence when speculation fails CacheRewinder can quickly restore the cache state using the evicted cache blocks held in the write-back buffer. Our evaluation exhibits that CacheRewinder can effectively defend the transient execution attacks. The performance overhead by CacheRewinder is only 0.6%, which is negligible compared to the unprotected baseline processor. CacheRewinder also requires minimal storage cost since it exploits unused write-back buffer entries as storage for evicted cache blocks.
16:48 CET 13.4.3 SAFETEE: COMBINING SAFETY AND SECURITY ON ARM-BASED MICROCONTROLLERS
Speaker:
Martin Schönstedt, TU Darmstadt, DE
Authors:
Martin Schönstedt, Ferdinand Brasser, Patrick Jauernig, Emmanuel Stapf and Ahmad-Reza Sadeghi, TU Darmstadt, DE
Abstract
From industry automation to smart home, embedded devices are already ubiquitous, and the number of applications continues to grow rapidly. However, the plethora of embedded devices used in these systems leads to considerable hardware and maintenance costs. To reduce these costs, it is necessary to consolidate applications and functionalities that are currently implemented on individual embedded devices. Especially in mixed-criticality systems, consolidating applications on a single device is highly challenging and requires strong isolation to ensure the security and safety of each application. Existing isolation solutions, such as partitioning designs for ARM-based microcontrollers, do not meet these requirements. In this paper, we present SafeTEE, a novel approach to enable security- and safety-critical applications on a single embedded device. We leverage hardware mechanisms of commercially available ARM-based microcontrollers to strongly isolate applications on individual cores. This makes SafeTEE the first solution to provide strong isolation for multiple applications in terms of security as well as safety. We thoroughly evaluate our prototype of SafeTEE for the most recent ARM microcontrollers using a standard microcontroller benchmark suite.
16:52 CET 13.4.4 Q&A SESSION
Authors:
Pascal Benoit1 and Mike Hamburg2
1University of Montpellier, FR; 2Cryptography Research, US
Abstract
Questions and answers with the authors

13.5 Safe and Efficient Engineering of Autonomous Systems

Date: Thursday, 17 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Sebastian Steinhorst, TU Munich, DE

Session co-chair:
Sharon Hu, University of Notre Dame, US

This session discusses novel approaches for engineering autonomous systems considering safety and validation aspects as well as efficiency. The first paper uses an ontology-based perception for autonomous vehicles which enables a comprehensive safety analysis, the second paper relies on formal approaches for generating relevant critical scenarios for automated driving. The last paper proposes an efficient method for recharging unmanned Aerial Vehicles (UAVs) to perform a large-scale remote sensing with maximal coverage.

Time Label Presentation Title
Authors
16:40 CET 13.5.1 USING ONTOLOGIES FOR DATASET ENGINEERING IN AUTOMOTIVE AI APPLICATIONS
Speaker:
Martin Herrmann, Robert Bosch GmbH, DE
Authors:
Martin Herrmann1, Christian Witt2, Laureen Lake1, Stefani Guneshka3, Christian Heinzemann1, Frank Bonarens4, Patrick Feifel4 and Simon Funke5
1Robert Bosch GmbH, DE; 2Valeo Schalter und Sensoren GmbH,, DE; 3Understand AI, DE; 4Stellantis, Opel Automobile GmbH, DE; 5Understand AI,, DE
Abstract
Basis of a robust safety strategy for an automated driving function based on neural networks is a detailed description of its input domain, i.e. a description of the environment, in which the function is used. This is required to describe its functional system boundaries and to perform a comprehensive safety analysis. Moreover, it allows to tailor datasets specifically designed for safety related validation tests. Ontologies fulfill the task to gather expert knowledge and model information to enable computer aided processing, while using a notion understandable for humans. In this contribution, we propose a methodology for domain analysis to build up an ontology for perception of autonomous vehicles including characteristic features that become important when dealing with neural networks. Additionally, the method is demonstrated by the creation of a synthetic test dataset for an Euro NCAP-like use case.
16:53 CET 13.5.2 USING FORMAL CONFORMANCE TESTING TO GENERATE SCENARIOS FOR AUTONOMOUS VEHICLES
Speaker:
Lucie Muller, INRIA, FR
Authors:
Jean-Baptiste Horel1, Christian Laugier1, Lina Marsso2, Radu Mateescu3, Lucie Muller3, Anshul Paigwar1, Alessandro Renzaglia1 and Wendelin Serwe3
1University Grenoble Alpes, Inria, FR; 2University of Toronto, CA; 3INRIA, FR
Abstract
Simulation, a common practice to evaluate autonomous vehicles, requires to specify realistic scenarios, in particular critical ones, which correspond to corner-case situations occurring rarely and potentially dangerous to reproduce in real environments. Such simulation scenarios may be either generated randomly, or specified manually. Randomly generated scenarios can be easily generated, but their relevance might be difficult to assess, for instance when many slightly different scenarios target one feature. Manually specified scenarios can focus on a given feature, but their design might be difficult and time-consuming, especially to achieve satisfactory coverage. In this work, we propose an automatic approach to generate a large number of relevant critical scenarios for autonomous driving simulators. The approach is based on the generation of behavioural conformance tests from a formal model (specifying the ground truth configuration with the range of vehicle behaviours) and a test purpose (specifying the critical feature to focus on). The obtained abstract test cases cover, by construction, all possible executions exercising a given feature, and can be automatically translated into the inputs of autonomous driving simulators. We illustrate our approach by generating hundreds of behaviour trees for the CARLA simulator for several realistic configurations.
17:06 CET 13.5.3 REMOTE SENSING WITH UAV AND MOBILE RECHARGING VEHICLE RENDEZVOUS
Speaker:
Michael Ostertag, University of California, San Diego, US
Authors:
Michael Ostertag1, Jason Ma1 and Tajana S. Rosing2
1University of California, San Diego, US; 2UCSD, US
Abstract
Small unmanned aerial vehicles (UAVs) equipped with sensors offer an effective way to perform high-resolution environmental monitoring in remote areas but suffer from limited battery life. In order to perform large-scale remote sensing, a UAV must cover the area using multiple discharge cycles. A practical and efficient method to achieve full coverage is for the sensing UAV to rendezvous with a mobile recharge vehicle (MRV) for a battery exchange, which is an NP-hard problem. Existing works tackle this problem using slow genetic algorithms or greedy heuristics. We propose an alternative approach: a two-stage algorithm that iterates between dividing a region into independent subregions aligned to MRV travel and a new Diffusion Heuristic that performs a local exchange of points of interest between neighboring subregions. The algorithm outperforms existing state-of-the-art planners for remote sensing applications, creating more fuel efficient paths that better align with MRV travel.

A.1 Panel on Quantum and Neuromorphic Computing: Designing Brain-Inspired Chips

Date: Thursday, 17 March 2022
Time: 17:30 - 19:00 CET

Session chair:
Aida Todri Sanial, LIRMM, FR

Session co-chair:
Anne Matsuura, Intel, US

Panellists:
Bhavin J. Shastri, Queen’s University, CA
Giacomo Indiveri, ETH Zürich, CH
Mike Davies, INTEL, US

In this session, invited speakers from industry and academia will cover aspects from neuro-inspired computing chips, neuromorphic engineering, photonics to organic electronics for neuromorphic computing.


14.1 University Fair

Date: Thursday, 17 March 2022
Time: 19:00 - 20:30 CET

Session chair:
Ioannis Sourdis, Chalmers, SE

Session co-chair:
Nele Mentens, KU Leuven, BE

The University Fair is a forum for disseminating academic research activities. Its goal is twofold:
(1) to foster the transfer of mature academic work to a large audience of industrial parties.
(2) to advertise new or upcoming research plans associated with new open research positions to a large audience of graduate students.
To this end, the University Fair program includes talks that describe (1) pre-commercial mature academic research results and/or prototypes with technology transfer potential as well as (2) new upcoming research initiatives associated with openings of academic research positions.

Time Label Presentation Title
Authors
19:00 CET 14.1.1 CHALMERS ACTIVITIES IN EUROHPC JU
Speaker and Author:
Per Stenstrom, Chalmers University of Technology, SE
Abstract
.
19:10 CET 14.1.2 HARDWARE DESIGNS FOR HIGH PERFORMANCE AND RELIABLE SPACE PROCESSORS
Authors:
Leonidas Kosmidis and Marc Solé Bonet, Universitat Politècnica de Catalunya - Barcelona Supercomputing Center, ES
Abstract
.
19:20 CET 14.1.3 NEW POSITION IN THE SSH TEAM OF TéLéCOM PARIS
Speaker and Author:
Jean Luc Danger, Télécom ParisTech, FR
Abstract
.
19:30 CET 14.1.4 A TOOLCHAIN FOR LIBRARY CELL CHARACTERIZATION FOR RFET TECHNOLOGIES
Speaker:
Steffen Märcker, TU Dresden, DE
Authors:
Steffen Märcker, Akash Kumar, Michael Raitza and Shubham Rai, TU Dresden, DE
Abstract
.
19:40 CET 14.1.5 SAFETY-RELATED OPEN SOURCE HARDWARE MODULES
Speaker:
Jaume Abella, Barcelona Supercomputing Center, ES
Authors:
Jaume Abella1, Sergi Alcaide2 and Pedro Benedicte1
1Barcelona Supercomputing Center, ES; 2Universitat Politècnica de Catalunya - Barcelona Supercomputing Center, ES
Abstract
.
19:50 CET 14.1.6 RESEARCH @NECSTLAB IN A NUTSHELL AKA RESEARCH ACTIVITIES AND OPPORTUNITIES FOR PROSPECTIVE PHD STUDENTS
Speaker and Author:
Marco D. Santambrogio, Politecnico di Milano, IT
Abstract
.
20:00 CET 14.1.7 POWER-OFF LASER ATTACKS ON SECURITY PRIMITIVES
Speaker and Author:
Giorgio Di Natale, TIMA, FR
Abstract
.

15.1 Young People Program: BarCamp

Date: Friday, 18 March 2022
Time: 09:00 - 17:30 CET

Session chair:
Anton Klotz, Cadence, DE

Session co-chair:
Georg Glaeser, Institut für Mikroelektronik- und Mechatronik-Systeme, DE

The BarCamp is an interactive open research meeting, where participants present, discuss and jointly develop ideas and results of the ongoing scientific work in a more interactive way. Characterized by an informal atmosphere, the goal of the BarCamp is to generate new and out-of-the-box ideas, and allow networking and interaction between participants.


15.2 Panel: Forum on Advancing Diversity in EDA (DivEDA)

Date: Friday, 18 March 2022
Time: 18:30 - 20:30 CET

Session chair:
Ayse K Coskun, Boston University, US

Session co-chair:
Nele Mentens, KU Leuven, BE

Panellists:
Ileana Buhan, Radboud University Nijmegen, NL
Michaela Blott, Xilinx, IE
Andreia Cathelin, STMicroelectronics, FR
Marian Verhelst, KU Leuven, BE

The 3rd Advancing Diversity in EDA (DivEDA) forum is co-sponsored by IEEE CEDA and ACM SIGDA. The goal of DivEDA is to help facilitate women and underrepresented minorities (URM) to advance their careers in academia and industry, and hence, to help increase diversity in the EDA community. A more diverse community will then help accelerate innovation in the EDA ecosystem and benefit societal progress. Through an interactive medium, our aim is to provide practical tips to women and URM on how to succeed and to overcome possible hurdles in their career growth, while at the same time, connecting senior and junior researchers to enable a growing diverse community. We are excited to build upon earlier diversity-focused efforts in EDA and create a venue that aims to make a difference. Prior DivEDA editions were held at DATE’18 and DAC’19. This year’s forum will be held as a single 2-hours virtual session, including a 1-hour panel followed by smaller group mentoring and Q&A sessions. The topic of the forum is “Addressing career challenges during the pandemic: work-life balance, networking, and more”. Registration to the event is free of charge.


A.2 Disruptive and Nanoelectronics-based edge AI computing systems

Date: Monday, 21 March 2022
Time: 17:30 - 19:00 CET

Session chair:
David Atienza, EPFL, CH

Session co-chair:
Ayse Coskun, Boston University, US

Progress in process technology has enabled the miniaturization of data processing elements, radio transceivers, and sensors for a large set of physiological phenomena. Autonomous sensor nodes, or also called edge computing systems, can monitor and react unobtrusively during our daily lives. Nonetheless, the need for automated analysis and interpretation of complex signals poses critical design challenges, which can be potentially addressed (regarding either power consumption, performance, or size) by using nanoelectronics. These new technologies can enable us to go beyond key limitations on CMOS-based technology for particular applications, such as, healthcare. This special session covers the latest trends towards including AI/ML on edge computing , as well as alternative design paradigms and using nanoelectronics technologies for the next generation of edge AI systems.

Time Label Presentation Title
Authors
17:30 CET A.2.1 TINY MACHINE LEARNING FOR IOT 2.0
Speaker and Author:
Vijay Janapa Reddi, Harvard University, US
Abstract
Tiny machine learning (TinyML) is a fast-growing field at the intersection of ML algorithms and low-cost embedded systems. TinyML enables a rich and wide array of on-device sensor data analysis (vision, audio, IMU, etc.) at ultra-low-power consumption. Processing data close to the sensor allows for an expansive new variety of always-on ML use-cases that preserve bandwidth, latency, and energy while improving responsiveness and maintaining data privacy. This talk introduces the vision behind TinyML and showcases some of the exciting applications that TinyML is enabling in the field, from supporting personalized health initiatives to unlocking the massive potential to improve manufacturing efficiencies. Yet, there are still numerous technical hardware and software challenges to address. Tight memory and storage constraints, extreme hardware heterogeneity, software fragmentation and a lack of relevant and commercially viable large-scale datasets pose a substantial barrier to unlocking TinyML for IoT 2.0. To this end, the talk also touches on the opportunities and future directions for unlocking the full potential of TinyML.
18:00 CET A.2.2 HD COMPUTING WITH APPLICATIONS
Speaker and Author:
Tajana S. Rosing, UCSD, US
Abstract
Hyperdimensional (HD) computing is a class of brain-inspired learning algorithms that uses high dimensional random vectors (e.g. ~10,000 bits) to represent data along with simple and highly parallelizable operations. In this talk I will present some of my team’s recent work on hyperdimensional computing software and hardware infrastructure, including: i) novel algorithms supporting key cognitive computations in high-dimensional space, ii) novel HW systems for efficient HD computing on sensors and mobile devices that are orders of magnitude more efficient that state of the art, at comparable accuracy.
18:30 CET A.2.3 IMPROVING WEIGHT PERTURBATION ROBUSTNESS FOR MEMRISTOR-BASED HARDWARE DEPLOYMENT
Speaker:
Yiran Chen, Duke University, US
Authors:
Yiran Chen, Huanrui Yang and Xiaoxuan Yang, Duke University, US
Abstract
Crossbar-based memristors, owing to the advantages in executing vector-matrix multiplication, enable highly power-efficient and area-efficient neuromorphic system designs. However, deploying deep learning applications on memristor-based neuromorphic computing devices may lead to noticeable programming and runtime noises on the deployed model’s parameters, resulting in a significant performance degradation. In this talk, we will discuss algorithmic and system solutions to improve robustness of memristor-based designs. We tackle this problem by modeling the distribution of parameter noise and accounting for it in the model training process. More generally, we derive a theoretical robustness guarantee against weight perturbation from curvature perspective, leading to general robustness against hardware noise, quantization noise, and generalization noise.

16.1 Young People Program Keynote: "Engineering skills that will advance quantum computing"

Date: Monday, 21 March 2022
Time: 19:00 - 19:45 CET

Session chair:
Sara Vinco, Politecnico di Torino, IT

Session co-chair:
Anton Klotz, Cadence Design Systems, DE

Quantum computing is a computing paradigm that exploits fundamental principles of quantum mechanics to tackle problems in mathematics, chemistry, and material science that require particularly extensive computational resources. Its power is derived from a quantum bit (qubit), a physical system that can be in a superposition state and entangled with other qubits. Quantum computing is the main driver behind the phenomenal development of selected areas in electronic engineering (such as cryogenic CMOS), computer sciences, machine learning, material sciences, etc. A number of challenges to create practical quantum computers are associated with engineering challenges. In this talk, we would like to discuss the what challenges quantum computing has brought to the fields of electronic and computer engineering. We would also like to discuss quantum engineering and what skills are required to begin a career in quantum engineering.

Speaker's bio: Elena Blokhina (Senior Member, IEEE) received the M.Sc. degree in physics and the Ph.D. degree in physical and mathematical sciences from Saratov State University, Russia, in 2002 and 2006, respectively, and the Habilitation HDR degree in electronic engineering from UPMC Sorbonne Universities, France, in 2017. Since 2007, she has been with University College Dublin where she is currently Associate Professor. Since 2019, she has also been with Equal1 Labs, where she is CTO. Her current research interests focus on the theory, modelling and characterisation of semiconductor quantum devices, quantum computing, modelling and simulations of nonlinear systems and multi-physics simulations. Prof Blokhina had been elected to serve as a member of the Boards of Governors of the IEEE Circuits and Systems Society form 2013 to 2015 and was re-elected for the term 2015 to 2017. She has served as the Programme Co-Chair and General Co-Chair of multiple editions of the IEEE International Conference on Electronics, Circuits and System and the IEEE International Symposium on Integrated Circuits and Systems. From 2016 to 2017, she was an Associate Editor for IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, and in 2018-2021 she was the Deputy Editor in Chief of that journal. She has served as a member of organizing committees, review and programme committees, a session chair, and a track chair at many leading international conferences on microelectronic circuits and systems and device physics.
Robert Bogdan Staszewski (Fellow, IEEE) received the B.Sc. (summa cum laude), M.Sc., and Ph.D. degrees in electrical engineering from The University of Texas at Dallas, Richardson, TX, USA, in 1991, 1992, and 2002, respectively. From 1991 to 1995, he was with Alcatel Network Systems, Richardson, Texas, involved in SONET cross-connect systems for fiber optics communica- tions. He joined Texas Instruments Incorporated, Dallas, TX, USA, in 1995, where he was elected as a Distinguished Member of Technical Staff (limited to 2% of technical staff). From 1995 to 1999, he was engaged in advanced CMOS read channel development for hard disk drives. In 1999, he co-started the Digital RF Processor (DRP) group within Texas Instruments with a mission to invent new digitally intensive approaches to traditional RF functions for integrated radios in deeply-scaled CMOS technology. He was appointed as a CTO of the DRP Group from 2007 to 2009. In 2009, he joined Delft University of Technology, Delft, The Netherlands, where he currently holds a guest appointment of a Full Professor (Antoni van Leeuwenhoek Hoogleraar). Since 2014, he has been a Full Professor with University College Dublin (UCD), Dublin, Ireland. He is also a Co-Founder of a startup company, Equal1 Labs, with design centers located in Silicon Valley and Dublin, Ireland, aiming to produce single-chip CMOS quantum computers. He has authored or coauthored five books, seven book chapters, 140 journals and 210 conference publications, and holds 210 issued U.S. patents. His research interests include nanoscale CMOS architectures and circuits for frequency synthesizers, transmitters and receivers, and quantum computers. Prof. Staszewski was a recipient of the 2012 IEEE Circuits and Systems Industrial Pioneer Award. In May 2019, he received the title of Professor from the President of the Republic of Poland. He was also the TPC Chair of the 2019 European Solid-State Circuits Conference (ESSCIRC), Krakow, Poland.

Time Label Presentation Title
Authors
19:00 CET 16.1.1 ENGINEERING SKILLS THAT WILL ADVANCE QUANTUM COMPUTING
Speaker and Authors:
Elena Blokhina and Robert Staszewski, University College Dublin, IE
Abstract
Quantum computing is a computing paradigm that exploits fundamental principles of quantum mechanics to tackle problems in mathematics, chemistry, and material science that require particularly extensive computational resources. Its power is derived from a quantum bit (qubit), a physical system that can be in a superposition state and entangled with other qubits. Quantum computing is the main driver behind the phenomenal development of selected areas in electronic engineering (such as cryogenic CMOS), computer sciences, machine learning, material sciences, etc. A number of challenges to create practical quantum computers are associated with engineering challenges. In this talk, we would like to discuss the what challenges quantum computing has brought to the fields of electronic and computer engineering. We would also like to discuss quantum engineering and what skills are required to begin a career in quantum engineering.

16.2 Young People Program Panel

Date: Monday, 21 March 2022
Time: 19:45 - 20:30 CET

Session chair:
Anton Klotz, Cadence, DE

Session co-chair:
Xavier Salazar, Barcelona Supercomputing Center & HiPEAC, ES

Panellists:
Antonia Schmalz, SPRIN D.org, DE
Ari Kulmala, Tampere University, FI
Anna Puig-Centelles, HADEA, ES
Alba Cervera, Barcelona Supercomputing Center, ES

The session will feature a round table discussion with different views and opportunities in computer science high-end research and careers. Speakers with heterogeneous backgrounds and positions have been invited to give their insights and valuable knowledge on these different paths.


IP.2_1 Interactive presentations

Date: Tuesday, 22 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.2_1.1 (Best Paper Award Candidate)
G-GPU: A FULLY-AUTOMATED GENERATOR OF GPU-LIKE ASIC ACCELERATORS
Speaker:
TIAGO DIADAMI PEREZ, Tallinn University of Technology (TalTech), EE
Authors:
Tiago Diadami Perez1, Márcio Gonçalves2, José Rodrigo Azambuja2, Leonardo Gobatto2, Marcelo Brandalero3 and Samuel Pagliarini1
1Tallinn University of Technology (TalTech), EE; 2UFRGS, BR; 3Brandenburg University of Technology, DE
Abstract
Modern Systems on Chip (SoC), almost as a rule, require accelerators for achieving energy efficiency and high performance for specific tasks that are not necessarily well suited for execution in standard processing units. Considering the broad range of applications and necessity for specialization, the design of SoCs has thus become expressively more challenging. In this paper, we put forward the concept of G-GPU, a general-purpose GPU-like accelerator that is not application-specific but still gives benefits in energy efficiency and throughput. Furthermore, we have identified an existing gap for these accelerators in ASIC, for which no known automated generation platform/tool exists. Our solution, called GPUPlanner, is an open-source generator of accelerators, from RTL to GDSII, that addresses this gap. Our analysis results show that our automatically generated G-GPU designs are remarkably efficient when compared against the popular CPU architecture RISC-V, presenting speed-ups of up to 223 times in raw performance and up to 11 times when the metric is performance derated by area. These results are achieved by executing a design space exploration of the GPU-like accelerators, where the memory hierarchy is broken in a smart fashion and the logic is pipelined on demand. Finally, tapeout-ready layouts of the G-GPU in 65nm CMOS are presented.
IP.2_1.2 (Best Paper Award Candidate)
EFFICIENT TRAVELING SALESMAN PROBLEM SOLVERS USING THE ISING MODEL WITH SIMULATED BIFURCATION
Speaker:
Tingting Zhang, University of Alberta, CA
Authors:
Tingting Zhang and Jie Han, University of Alberta, CA
Abstract
An Ising model-based solver has shown efficiency in obtaining suboptimal solutions for combinatorial optimization problems. As an NP-hard problem, the traveling salesman problem (TSP) plays an important role in various routing and scheduling applications. However, the execution speed and solution quality significantly deteriorate using a solver with simulated annealing (SA) due to the quadratically increasing number of spins and strong constraints placed on the spins. The ballistic simulated bifurcation (bSB) algorithm utilizes the signs of Kerr-nonlinear parametric oscillators’ positions as the spins’ states. It can update the states in parallel to alleviate the time explosion problem. In this paper, we propose an efficient method for solving TSPs by using the Ising model with bSB. Firstly, the TSP is mapped to an Ising model without external magnetic fields by introducing a redundant spin. Secondly, various evolution strategies for the introduced position and different dynamic configurations of the time step are considered to improve the efficiency in solving TSPs. The effectiveness is specifically discussed and evaluated by comparing the solution quality to SA. Experiments on benchmark datasets show that the proposed bSB-based TSP solvers offer superior performance in solution quality and achieve a significant speed up in runtime than recent SA-based ones.
IP.2_1.3 (Best Paper Award Candidate)
PROVIDING RESPONSE TIMES GUARANTEES FOR MIXED-CRITICALITY NETWORK SLICING IN 5G
Speaker:
Andrea Nota, TU Dortmund, DE
Authors:
Andrea Nota, Selma Saidi, Dennis Overbeck, Fabian Kurtz and Christian Wietfeld, TU Dortmund, DE
Abstract
Mission critical applications in domains such as Industry 4.0, autonomous vehicles or Smart Grids are increasingly dependent on flexible, yet highly reliable communication systems. In this context, Fifth Generation of mobile Communication Networks (5G) promises to support mixed-criticality applications on a single unified physical communication network. This is achieved by a novel approach known as network slicing, that promises to fulfil diverging requirements while providing strict separation between network tenants. We focus in this work on hard performance guarantees by formalizing an analytical method for bounding response times in mixed-criticality 5G network slicing. We reduce pessimism considering models on workload variations.

IP.2_2 Interactive presentations

Date: Tuesday, 22 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.2_2.1 (Best Paper Award Candidate)
SCI-FI: CONTROL SIGNAL, CODE, AND CONTROL FLOW INTEGRITY AGAINST FAULT INJECTION ATTACKS
Speaker:
Thomas Chamelot, University Grenoble Alpes, CEA, List, FR
Authors:
Thomas Chamelot1, Damien Couroussé1 and Karine Heydemann2
1University Grenoble Alpes, CEA, LIST, FR; 2Sorbonne Université, CNRS, FR
Abstract
Fault injection attacks have become a serious threat against embedded systems. Recently, Laurent et al. have reported that some faults inside the microarchitecture escape all typical software fault models and so software counter-measures. Moreover, state-of-the-art counter-measures, hardware-only or with hardware support, do not consider the integrity of microarchitectural control signals that are the target of these faults. We present SCI-FI, a counter-measure for Control Signal, Code, and Control-Flow Integrity against Fault Injection attacks. SCI-FI combines the protection of pipeline control signals with a fine-grained code and control-flow integrity mechanism, and can additionally provide code authentication. We evaluate SCI-FI by extending a RISC-V core. The average hardware area overheads range from 6.5% to 23.8%, and the average code size and execution time increase by 25.4% and 17.5% respectively.
IP.2_2.2 XTENSTORE: FAST SHIELDED IN-MEMORY KEY-VALUE STORE ON A HYBRID X86-FPGA SYSTEM
Speaker:
Hyungon Moon, UNIST, KR
Authors:
Hyunyoung Oh1, Dongil Hwang2, Maja Malenko3, Myunghyun Cho2, Hyungon Moon4, Marcel Baunach3 and Yunheung Paek2
1Seoul National University, KR; 2Dept. of Electrical and Computer Engineering and Inter-University Semiconductor Research Center (ISRC), Seoul National University, KR; 3Graz University of Technology, AT; 4UNIST, KR
Abstract
We propose XtenStore, a system that extends the existing SGX-based secure in-memory key-value store with an external hardware accelerator in order to ensure comparable security guarantees with lower performance degradation. The accelerator is implemented on a commodity FPGA card that is readily connected with the x86 CPU via PCIe interconnect to form a hybrid x86-FPGA system. In comparison to the prior SGX-based work, XtenStore improves the throughput by 4-33x, and exhibits considerably shorter tail latency (>23x, 99th-percentile).
IP.2_2.3 LEARNING TO MITIGATE ROWHAMMER ATTACKS
Speaker:
Biresh Kumar Joardar, Duke University, US
Authors:
Biresh Kumar Joardar, Tyler Bletsch and Krishnendu Chakrabarty, Duke University, US
Abstract
Rowhammer is a security vulnerability that arises due to the undesirable electrical interaction between physically adjacent rows in DRAMs. Rowhammer attacks cause bit flips in the neighboring rows by repeatedly accessing (hammering) a DRAM row. This phenomenon has been exploited to craft many types of attacks in platforms ranging from edge devices to datacenter servers. Existing DRAM protections using error-correction codes and targeted row refresh are not adequate for defending against Rowhammer attacks. In this work, we propose a Rowhammer-detection solution using machine learning (ML). Experimental evaluation shows that the proposed technique can reliably detect different types of Rowhammer attacks (both real and artificially engineered) and prevent bit flips. Moreover, the ML model introduces less power and performance overheads on average compared to two recently proposed Rowhammer mitigation techniques, namely Graphene and Blockhammer, for 26 different applications from the Parsec, Pampar, and Splash-2 benchmark suites

IP.2_3 Interactive presentations

Date: Tuesday, 22 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.2_3.1 ONCE FOR ALL SKIP: EFFICIENT ADAPTIVE DEEP NEURAL NETWORKS
Speaker:
Yu Yang, Yunnan University, CN
Authors:
Yu Yang, Di Liu, Hui Fang, Yi-Xiong Huang, Ying Sun and Zhi-Yuan Zhang, Yunnan University, CN
Abstract
In this paper, we propose a new module, namely extit{once for all skip} (OFAS), for adaptive deep neural networks to efficiently control the block skip within a DNN model. The novelty of OFAS is that it only needs to compute once for all skippable blocks to determine their execution states. Moreover, since adaptive DNN models with OFAS cannot achieve the best accuracy and efficiency in end-to-end training, we propose a reinforcement learning-based training method to enhance the training procedure. The experimental results with different models and datasets demonstrate the effectiveness and efficiency in comparison to the state of the arts. The code is available at url{https://github.com/ieslab-ynu/OFAS}.
IP.2_3.2 SELF-AWARE MIMO BEAMFORMING SYSTEMS: DYNAMIC ADAPTATION TO CHANNEL CONDITIONS AND MANUFACTURING VARIABILITY
Speaker:
Suhasini Komarraju, Georgia Institute of Technology, US
Authors:
Suhasini Komarraju and Abhijit Chatterjee, Georgia Institute of Technology, US
Abstract
Emerging wireless technologies employ MIMO beamforming antenna arrays to improve channel Signal-to-Noise Ratio (SNR). The increased dynamic range of channel SNR values that can be accommodated, creates power stress on Radio Frequency (RF) electronic circuitry. To alleviate this, we propose an approach in which the circuitry along with other transmission coding parameters can be dynamically tuned in response to channel SNR and beam-steering angle to either minimize power consumption or maximize throughput in the presence of manufacturing process variations while meeting a specified Bit Error Rate (BER) limit. The adaptation control policy is learned online and is facilitated by information obtained from testing of the RF circuitry before deployment.
IP.2_3.3 SALVAGING RUNTIME BAD BLOCKS BY SKIPPING BAD PAGES FOR IMPROVING SSD PERFORMANCE
Speaker:
Mincheol Kang, KAIST, KR
Authors:
Junoh Moon, Mincheol Kang, Wonyoung Lee and Soontae Kim, KAIST, KR
Abstract
Recent research has revealed that runtime bad blocks are found in the early lifespan of solid state drives. The reduction in overprovisioning space due to runtime bad blocks may well have a negative impact on performance as it weakens the chances of selecting a better victim block during garbage collection. Moreover, previous studies focused on reusing worn- out bad blocks exceeding a program/erase cycle threshold, leaving the problem of runtime bad blocks unaddressed. Based on this observation, we present a salvation scheme for runtime bad blocks. This paper reveals that these blocks can be identified when a page write fails at runtime. Furthermore, we introduce a method to salvage functioning pages from runtime bad blocks. Consequently, the loss in the overprovisioning space can be minimized even after the occurrence of runtime bad blocks. Experimental results show a 26.3% reduction in latency and a 25.6% increase in throughput compared to the baseline at a conservative bad block ratio of 0.45%. Additionally, our results confirm that almost no overhead was observed.

IP.2_4 Interactive presentations

Date: Tuesday, 22 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.2_4.1 SACC: SPLIT AND COMBINE APPROACH TO REDUCE THE OFF-CHIP MEMORY ACCESSES OF LSTM ACCELERATORS
Speaker:
Saurabh Tewari, Indian Institute Of Technology, IN
Authors:
Saurabh Tewari1, Anshul Kumar2 and Kolin Paul3
1I.I.T.Delhi, IN; 2I.I.T. Delhi, IN; 3IIT Delhi, IN
Abstract
Long Short-Term Memory (LSTM) networks are widely used in speech recognition and natural language processing. Recently, a large number of LSTM accelerators have been proposed for the efficient processing of LSTM networks. The high energy consumption of these accelerators limits their usage in energy-constrained systems. LSTM accelerators repeatedly access large weight matrices from off-chip memory, significantly contributing to energy consumption. Reducing off-chip memory access is the key to improving the energy efficiency of these accelerators. We propose a data reuse approach that splits and combines the LSTM cell computations in a way that reduces the off-chip memory accesses of LSTM hidden state matrices by 50%. In addition, the data reuse efficiency of our approach is independent of on-chip memory size, making it more suitable for small on-chip memory LSTM accelerators. Experimental results show that our approach reduces off-chip memory access by 28% and 32%, and energy consumption by 13% and 16%, respectively, compared to conventional approaches for character level Language Modelling and Speech Recognition LSTM models.
IP.2_4.2 NPU-ACCELERATED IMITATION LEARNING FOR THERMAL- AND QOS-AWARE OPTIMIZATION OF HETEROGENEOUS MULTI-CORES
Speaker:
Martin Rapp, Karlsruhe Institute of Technology, DE
Authors:
Martin Rapp1, Nikita Krohmer1, Heba Khdr1 and Joerg Henkel2
1Karlsruhe Institute of Technology, DE; 2Karlsruhe institute of technology, DE
Abstract
Task migration and dynamic voltage and frequency scaling (DVFS) are indispensable means in thermal optimization of a heterogeneous clustered multi-core processor under user-defined quality of service (QoS) targets. However, selecting the core to execute each application and the voltage/frequency (V/f) levels of each cluster is a complex problem because 1) the diverse characteristics and QoS targets of applications require different optimizations, and 2) V/f levels are often shared between cores on a cluster, which requires a global optimization considering all running applications. State-of-the-art techniques for power or temperature minimization either rely on measurements that are often not available (such as power) or fail to consider all the dimensions of the problem (e.g., by using simplified analytical models). Imitation learning (IL) enables to use the optimality of an oracle policy, yet at low run-time overhead, by training a model from oracle demonstrations. We are the first to employ IL for temperature minimization under QoS targets. We tackle the complexity by using a neural network (NN) model and accelerate the NN inference using a neural processing unit (NPU). While such NN accelerators are becoming increasingly widespread on end devices, they are so far only used to accelerate user applications. In contrast, we use an accelerator on a real platform to accelerate NN-based resource management. Our evaluation on a HiKey970 board with an Arm big.LITTLE CPU and an NPU shows significant temperature reductions at a negligible overhead while satisfying QoS targets.
IP.2_4.3 BMPQ: BIT-GRADIENT SENSITIVITY DRIVEN MIXED-PRECISION QUANTIZATION OF DNNS FROM SCRATCH
Speaker:
Souvik Kundu, University of Southern California, US
Authors:
Souvik Kundu1, Shikai Wang2, Qirui Sun2, Peter Beerel2 and Massoud Pedram1
1USC, US; 2University of Southern California, US
Abstract
Large DNNs with mixed-precision quantization can achieve ultra-high compression while retaining high classification performance. However, because of the challenges in finding an accurate metric that can guide the optimization process, these methods either sacrifice significant performance compared to the 32-bit floating-point (FP-32) baseline or rely on compute expensive iterative training policy that requires the availability of a pre-trained baseline. To address this issue, this paper presents BMPQ, a training method that uses bit gradients to analyze layer sensitivities and yield mixed-precision quantized models. BMPQ requires a single training iteration but does not need a pre-trained baseline. It uses an integer linear program (ILP) to dynamically adjust the precision of layers during training subject to a fixed hardware budget. To evaluate the efficacy of BMPQ, we conduct extensive experiments with VGG16 and ResNet18 on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. Compared to the baseline FP-32 models, BMPQ can yield models that are 15.4× fewer parameter bits with a negligible drop in accuracy. Compared to the SOTA “during training” mixed-precision training scheme, our models are 2.1x, 2.2x, and 2.9x smaller, on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, with improved accuracy of up to 14.54%. We have open-sourced our trained models and test code for reproducibility.

IP.2_5 Interactive presentations

Date: Tuesday, 22 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.2_5.1 EM SCA & FI SELF-AWARENESS AND RESILIENCE WITH SINGLE ON-CHIP LOOP & ML CLASSIFIERS
Speaker:
Archisman Ghosh, Purdue University, US
Authors:
Archisman Ghosh1, Debayan Das2, Santosh Ghosh2 and Shreyas Sen1
1Purdue University, US; 2Intel Corp., US
Abstract
Securing ICs is becoming increasingly challenging with improvements in electromagnetic (EM) side-channel analysis (SCA) and fault injection (FI) attacks. In this work, we develop a pro-active approach to detect and counter these attacks by embedding a single on-chip integrated loop around a crypto core (AES-256), designed and fabricated using TSMC 65nm process. The measured results demonstrate that the proposed system 1) provides EM-Self-awareness by acting as an on-chip H-field sensor, detecting voltage/clock glitching fault-attacks; 2) senses an approaching EM probe to detect an incoming threat, and 3) can be used to induce EM noise to increase resilience against EM attacks.
IP.2_5.2 RTSEC: AUTOMATED RTL CODE AUGMENTATION FOR HARDWARE SECURITY ENHANCEMENT
Speaker:
Orlando Arias, University of Florida, US
Authors:
Orlando Arias1, Zhaoxiang Liu2, Xiaolong Guo3, Yier Jin1 and Shuo Wang1
1University of Florida, US; 2Kansas State University, US; 3Electrical and Computer Engineering Department, Kansas State University, US
Abstract
Current hardware designs have increased in complexity, resulting in a reduced ability to perform security checks on them. Further, the addition of any security features to these designs is still largely manual which further complicates the design and integration process. In this paper, we address these shortcomings by introducing RTSec as a framework which is capable of performing security analysis on designs as well as integrating security features directly into the HDL code, a feature that commercial EDA tools do not provide. RTSec first breaks down HDL code into an Abstract Syntax Tree which is then used to infer the logic of the design. We demonstrate how RTSec can be utilized to automatically include security mechanisms in RTL designs: watermarking and logic locking. We also compare the efficacy of our analysis algorithms with state of the art tools, demonstrating that RTSec has capabilities equal or superior to those of state of the art tools while also providing the means of enhancing security features to the design.
IP.2_5.3 INTER-IP MALICIOUS MODIFICATION DETECTION THROUGH STATIC INFORMATION FLOW TRACKING
Speaker:
Zhaoxiang Liu, Kansas State University, CN
Authors:
Zhaoxiang Liu1, Orlando Arias2, Weimin Fu1, Yier Jin2 and Xiaolong Guo3
1Kansas State University, US; 2University of Florida, US; 3Electrical and Computer Engineering Department, Kansas State University, US
Abstract
To help expand the usage of formal methods in the hardware security domain. We propose a static register-transfer level (RTL) security analysis framework and an electronic design automation (EDA) tool named If-Tracker to support the proposed framework. Through this framework, a data-flow model will be automatically extracted from the RTL description of the SoC. Information flow security properties will then be generated. The tool checks all possible inter-IP paths to verify whether any property violations exist. The effectiveness of the proposed framework is demonstrated on customized SoC designs using AMBA bus where malicious modifications are inserted across multiple IPs. Existing IP level security analysis tools cannot detect such Trojans. Compared to commercial formal tools such as Cadence JasperGold and Synopsys VC-Formal, our framework provides a much simpler user interface and can identify more types of malicious modifications.

IP.2_6 Interactive presentations

Date: Tuesday, 22 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.2_6.1 MANY-LAYER HOTSPOT DETECTION BY LAYER-ATTENTIONED VISUAL QUESTION ANSWERING
Speaker:
Yen-Shuo Chen, National Taiwan University, TW
Authors:
Yen-Shuo Chen and Iris Hui-Ru Jiang, National Taiwan University, TW
Abstract
Exploring hotspot patterns and correcting them as early as possible is crucial to guarantee yield and manufacturability. Hotspot patterns can be classified into various types according to potentially induced defects. In modern layouts, defects are caused by not only the geometry on one specific layer but also the accumulated influence from other layers. Existing hotspot detection and pattern classification methods, however, consider only the geometry on one single layer or one main layer with adjacent layers. They cannot recognize the corresponding defect type for a hotspot pattern, either. Therefore, in this paper, we investigate the linkage between many-layer hotspot patterns and corresponding potentially induced defect types. We first cast the many-layer critical hotspot pattern extraction task as a visual question answering (VQA) problem: Considering a many-layer layout pattern an image and a defect type a question, we devise a layer-attentioned VQA model to answer whether the pattern is critical to the queried defect type. Simply considering all layers equally may dilute the key features of hotspot patterns. Thus, our layer attention mechanism attempts to identify the importance and relevance of each layer for different types. Experimental results show that the proposed model has superior performance and question-answering ability based on modern layouts with more than thirty layout layers.
IP.2_6.2 RESTORE: REAL-TIME TASK SCHEDULING ON A TEMPERATURE AWARE FINFET BASED MULTICORE
Speaker:
Shounak Chakraborty, Department of Computer Science, Norwegian University of Science and Technology (NTNU), NO
Authors:
Yanshul Sharma1, Sanjay Moulik1 and Shounak Chakraborty2
1IIIT Guwahati, IN; 2Norwegian University of Science and Technology, NO
Abstract
In this work, we propose RESTORE that exploits the unique thermal feature of FinFET based multicore platforms, where processing speed increases with temperature, in the context of time-criticality to meet other design constraints of real-time systems. RESTORE is a temperature aware real-time scheduler for FinFET based multicore system that first derives a task-to-core allocation, and prepares a schedule. Next, it balances the performance and temperature on the fly by incorporating a prudential temperature cognizant voltage/frequency scaling while guaranteeing task deadlines. Simulation results claim, RESTORE is able to maintain a safe and stable thermal status (peak temperature below 80 °C), hence the frequency (3.7 GHz on an average), that ensures legitimate time-critical performance for a variety of workloads while surpassing state-of-the-arts.
IP.2_6.3 ONLINE PERFORMANCE AND POWER PREDICTION FOR EDGE TPU VIA COMPREHENSIVE CHARACTERIZATION
Speaker:
Yang Ni, University of California, Irvine, US
Authors:
Yang Ni1, Yeseong Kim2, Tajana S. Rosing3 and Mohsen Imani4
1University of California, Irvine, US; 2DGIST, KR; 3UCSD, US; 4University of California Irvine, US
Abstract
In this paper, we characterize and model the performance and power consumption of Edge TPU, which efficiently accelerates deep learning (DL) inference in a low-power environment. Systolic array, as a high throughput computation architecture, its usage in the edge excites our interest in its performance and power pattern. We perform an extensive study for various neural network settings and sizes using more than 10,000 DL models. Through comprehensive exploration, we profile which factors highly influence the inference time and power to run DL Models. We show our key remarks for the relation between the performance/power and DL model complexity to enable hardware-aware optimization and design decisions. For example, our measurement shows that energy/performance is not linearly-proportional to the number of MAC operations. In fact, as the computation and DL model size increase, the performance follows a stepped pattern. Hence, the accurate estimate should consider other features of DL models such as on-chip/off-chip memory usages. Based on the characterization, we propose a modeling framework, called PETET, which perform online predictions for the performance and power of Edge TPU. The proposed method automatically identifies the relationship of the performance, power, and memory usages to the DL model settings based on machine learning techniques.

IP.2_7 Interactive presentations

Date: Tuesday, 22 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.2_7.1 PROACTIVE RUN-TIME MITIGATION FOR TIME-CRITICAL APPLICATIONS USING DYNAMIC SCENARIO METHODOLOGY
Speaker:
Ji-Yung Lin, IMEC, TW
Authors:
Ji-Yung Lin1, Pieter Weckx2, Subrat Mishra2, Alessio Spessot3 and Francky Catthoor2
1KU Leuven, BE; 2IMEC, BE; 3Imec, BE
Abstract
Energy saving is important for both high-end processors and battery-powered devices. However, for time-critical application such as car auto-driving systems and multimedia streaming, saving energy by slowing down speed poses a threat to timing guarantee of the applications. The worst-case execution time (WCET) method is a widespread solution to this problem, but its static execution time model is not sufficient anymore for highly dynamic hardware and applications nowadays. In this work, a fully proactive run-time mitigation methodology is proposed for energy saving while ensuring timing guarantee. This methodology introduces heterogeneous datapath options, a fast fine-grained knob which enables processors to switch between datapaths of different speed and energy levels with a switching time of only tens of clock cycles. In addition, a run-time controller using a dynamic scenario methodology is developed. This methodology incorporates execution time prediction and timing guarantee criteria calculation, so it can dynamically switch knobs for energy saving while rigorously still ensuring all timing guarantees. Simulation shows that the proposed methodology can mitigate a dynamic workload without any deadline misses, and at the same time energy can be saved.
IP.2_7.2 ANALYZING CAN'S TIMING UNDER PERIODICALLY AUTHENTICATED ENCRYPTION
Speaker:
Mingqing Zhang, TU Chemnitz, DE
Authors:
Mingqing Zhang1, Philip Parsch1, Henry Hoffmann2 and Alejandro Masrur1
1TU Chemnitz, DE; 2University of Chicago, US
Abstract
With increasing connectivity in the automotive domain, it has become easier to remotely access in-vehicle buses like CAN (Controller Area Network). This not only jeopardizes security, but it also exposes CAN's limitations. In particular, to reject replay and spoofing attacks, messages need to be authenticated, i.e., an authentication tag has to be included. As a result, messages become larger and need to be split in at least two frames due to CAN's restrictive payload. This increases the delay on the bus and, thus, some deadlines may start being missed compromising safety. In this paper, we propose a Periodically Authenticated Encryption (PAE) based on the observation that we do not need to send authentication tags with every single message on the bus, but only with a configurable frequency that allows meeting both safety and security requirements. Plausibility checks can then be used to detect whether non-authenticated messages sent in between two authenticated ones have been altered or are being replayed, e.g., the transmitted values exceed a given range or are not in accordance with previous ones. We extend CAN's known schedulability analysis to consider PAE and analyze its timing behavior based on an implementation on real hardware and on extensive simulations.
IP.2_7.3 TOWARDS ADC-LESS COMPUTE-IN-MEMORY ACCELERATORS FOR ENERGY EFFICIENT DEEP LEARNING
Speaker:
Utkarsh Saxena, Purdue University, US
Authors:
Utkarsh Saxena, Indranil Chakraborty and Kaushik Roy, Purdue University, US
Abstract
Compute-in-Memory (CiM) hardware has shown great potential in accelerating Deep Neural Networks (DNNs). However, most CiM accelerators for matrix vector multiplication rely on costly analog to digital converters (ADCs) which becomes a bottleneck in achieving high energy efficiency. In this work, we propose a hardware-software co-design approach to reduce the aforementioned ADC costs through partial-sum quantization. Specifically, we replace ADCs with 1-bit sense amplifiers and develop a quantization aware training methodology to compensate for the loss in representation ability. We show that the proposed ADC-less DNN model achieves 1.1x-9.6x reduction in energy consumption while maintaining accuracy within 1\% of the DNN model without partial-sum quantization.

IP.MPP Multi-Partner Projects – Interactive Presentations

Date: Tuesday, 22 March 2022
Time: 11:30 - 12:15 CET

The session is dedicated to multi-partner innovative and high-tech research projects addressing the DATE 2022 topics. The types of collaboration covered are projects funded by EU schemes (H2020, ESA, EIC, MSCA, COST, etc.), nationally- and regionally-funded projects, collaborative research projects funded by industry. Depending on the stage of the project, the papers present the novelty of the project concepts, relevance the technical objectives to the DATE community, technical highlights of the project results and insights on the lessons learnt in the project or open bits until the end of the project. In particular, three interactive presentations cover concepts for the embedded FPGA tile of the European Processor Initiative, a training network view on approximate computing trade-offs, and an open-source RISC-V SoC with AI accelerator.

Label Presentation Title
Authors
IP.MPP.1 TOWARDS RECONFIGURABLE ACCELERATORS IN HPC: DESIGNING A MULTIPURPOSE EFPGA TILE FOR HETEROGENEOUS SOCS
Speaker:
Juan Miguel de Haro Ruiz, Barcelona Supercomputing Center, ES
Authors:
Tim Hotfilter1, Juan Miguel de Haro Ruiz2, Fabian Kreß3, Carlos Alvarez4, Fabian Kempf3, Daniel Jimenez-Gonzalez5, Miquel Moreto2, imen baili6, Jesus Labarta2 and Juergen Becker3
1Karlsruhe institute of technology, DE; 2Barcelona Supercomputing Center, ES; 3Karlsruhe Institute of Technology, DE; 4Universitat Politècnica de Catalunya, ES; 5Universitat Politècnica de Catalunya - Barcelona Supercomputing Center, ES; 6Technical Product Marketing, FR
Abstract
The goal of modern high performance computing platforms is to combine low power consumption and high throughput. Within the European Processor Initiative (EPI), such an SoC platform to meet the novel exascale requirements is built and investigated. As part of this project, we introduce an embedded Field Programmable Gate Array (eFPGA), adding flexibility to accelerate various workloads. In this article, we show our approach to design the eFPGA tile that supports the EPI SoC. While eFPGAs are inherently reconfigurable, their initial design has to be determined for tape-out. The design space of the eFPGA is explored and evaluated with different configurations of two HPC workloads, covering control and dataflow heavy applications. As a result, we present a well-balanced eFPGA design that can host several use cases and potential future ones by allocating 1% of the total EPI SoC area. Finally, our simulation results of the architectures on the eFPGA show great performance improvements over their software counterparts.
IP.MPP.2 TOWARDS APPROXIMATE COMPUTING FOR ACHIEVING ENERGY VS. ACCURACY TRADE-OFFS
Speaker:
Jari Nurmi, Tampere University, FI
Authors:
Jari Nurmi and Aleksandr Ometov, Tampere University, FI
Abstract
Despite the recent advances in semiconductor technology and energy-aware system design, the overall energy consumption of computing and communication systems is rapidly growing. On the one hand, the pervasiveness of these technologies everywhere in the form of mobile devices, cyber-physical embedded systems, sensor networks, wearables, social media and context-awareness, intelligent machines, broadband cellular networks, Cloud computing, and Internet of Things (IoT) has drastically increased the demand for computing and communications. On the other hand, the user expectations on features and battery life of online devices are increasing all the time, and it creates another incentive for finding good trade-offs between performance and energy consumption. One of the opportunities to address this growing demand is to utilize an Approximate Computing approach through software and hardware design. The APROPOS project aims at finding the balance between accuracy and energy consumption, and this short paper provides an initial overview of the corresponding roadmap, as the project is still in the initial stage.
IP.MPP.3 THE SELENE DEEP LEARNING ACCELERATION FRAMEWORK FOR SAFETY-RELEVANT APPLICATIONS
Speaker:
Laura Medina, Universitat Politècnica de València, ES
Authors:
Laura Medina1, Salvador Carrión1, Pablo Cerezo2, Tomás Picornell1, Josè Flich3, Carles Hernandez1, Markel Sainz4, Michael Sandoval4, Charles-Alexis Lefebvre4, Martin Ronnback5, Martin Matschnig6, Matthias Wess6 and Herber Taucher6
1Universitat Politècnica de València, ES; 2Universidad Politécnica de Valencia, ES; 3Associate Professor, Universitat Politècnica de València, ES; 4Ikerlan Technology Research Centre, Basque Research and Technology Alliance (BRTA), ES; 5Cobham Gaisler, SE; 6Siemens Technology, DE
Abstract
The goal of the H2020 SELENE project is the development of a flexible computing platform for autonomous applications that includes built-in hardware support for safety. The SELENE computing platform is an open-source RISC-V heterogeneous multicore system-on-chip (SoC) that includes 6 NOEL-V RISC-V cores and artificial intelligence accelerators. In this paper, we describe the approach we have followed in the SELENE project to accelerate neural network inference processes. Our intermediate results show that both the FPGA and ASIC accelerators provide real-time inference performance for the analyzed network models at a resonable implementation cost.

L.1 Panel on Quantum and Neuromorphic Computing: "What’s it like to be an Engineer for Emerging Computing Technologies?"

Date: Tuesday, 22 March 2022
Time: 12:30 - 14:00 CET

Session chair:
Anne Matsuura, Intel, US

Session co-chair:
Aida Todri Sanial, LIRMM, FR

Panellists:
Fernando Gonzalez Zalba, Quantum Motion Technologies, GB
Théophile Gonos, A.I. Mergence, FR
Robert Wille, Johannes Kepler University Linz, AT

In this session, we invite four neuromorphic and quantum engineers to share their experiences on becoming engineers and working for emerging computing technologies. After the presentations, the floor will be opened for discussions and exchange with the moderator and audience.


17.1 Brain- and Bio-inspired architectures and applications

Date: Tuesday, 22 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Michael Niemier, Notre Dame University, US

Session co-chair:
François Rummens, CEA, FR

This session focuses on architectures and application in the context of biochips and neural networks. This includes the discussion on solutions for adaptive droplet routing and contamination-free switches for biochips. Another aspect is the combination of graph convolutional networks and processing in-memory. Spiking neural networks try to replicate brain-like behavior. This session shows, how this existing emerging technology can be combined with the concept of hyperdimensional computing and how the backpropagation through time approach can be applied more efficiently.

Time Label Presentation Title
Authors
14:30 CET 17.1.1 (Best Paper Award Candidate)
ADAPTIVE DROPLET ROUTING FOR MEDA BIOCHIPS VIA DEEP REINFORCEMENT LEARNING
Speaker:
Mahmoud Elfar, Duke University, US
Authors:
Mahmoud Elfar, Tung-Che Liang, Krishnendu Chakrabarty and Miroslav Pajic, Duke University, US
Abstract
Digital microfluidic biochips (DMFBs) based on a micro-electrode-dot-array (MEDA) architecture provide fine-grained control and sensing of droplets in real-time. However, excessive actuation of microelectrodes in MEDA biochips can lead to charge trapping during bioassay execution, causing the failure of microelectrodes and erroneous bioassay outcomes. A recently proposed enhancement to MEDA allows run-time measurement of microelectrode health information, thereby enabling synthesis of adaptive routing strategies for droplets. However, existing synthesis solutions are computationally infeasible for large MEDA biochips that have been commercialized. In this paper, we propose a synthesis framework for adaptive droplet routing in MEDA biochips via deep reinforcement learning (DRL). The framework utilizes the real-time microelectrode health feedback to synthesize droplet routes that proactively minimize the likelihood of charge trapping. We show how the adaptive routing strategies can be synthesized using DRL. We implement the DRL agent, the MEDA simulation environment, and the bioassay scheduler using the OpenAI Gym environment. Our framework obtains adaptive routing policies efficiently for COVID-19 testing protocols on large arrays that reflect the sizes of commercial MEDA biochips available in the marketplace, significantly increasing probabilities of successful bioassay completion compared to existing methods.
14:34 CET 17.1.2 CONTAMINATION-FREE SWITCH DESIGN AND SYNTHESIS FOR MICROFLUIDIC LARGE-SCALE INTEGRATION
Speaker:
Duan Shen, TU Munich, DE
Authors:
Duan Shen, Yushen Zhang, Mengchu Li, Tsun-Ming Tseng and Ulf Schlichtmann, TU Munich, DE
Abstract
Microfluidic large-scale integration (mLSI) biochips have developed rapidly in recent decades. The gap between design efficiency and application complexity has led to a growing interest in mLSI design automation. The state-of-the-art design automation tools for mLSI focus on the simultaneous co-optimisation of the flow and control layers but neglect potential contamination between different fluid reagents and products. Microfluidic switches, as fluid routers at the intersection of flow paths, are especially prone to contamination. State-of-the-art tools design the switches as spines with junctions, which aggregate the contamination problem. In this work, we present a contamination-free microfluidic switch design and a synthesis method to generate application-specific switches that can be employed by physical design tools for mLSI. We also propose a scheduling and binding method to transport the fluids with least time and fewest resources. To reduce the number of pressure inlets, we consider pressure sharing between valves within the switch. Experimental results demonstrate that our methods show advantages in avoiding contamination and improving transportation efficiency over conventional methods.
14:38 CET 17.1.3 EXPLOITING PARALLELISM WITH VERTEX-CLUSTERING IN PROCESSING-IN-MEMORY-BASED GCN ACCELERATORS
Speaker:
Yu Zhu, Tsinghua University, CN
Authors:
Yu Zhu, Zhenhua Zhu, Guohao Dai, Kai Zhong, Huazhong Yang and Yu Wang, Tsinghua University, CN
Abstract
Recently, Graph Convolutional Networks (GCNs) have shown powerful learning capabilities in graph processing tasks. Computing GCNs with conventional von Neumann architectures usually suffers from limited memory bandwidth due to the irregular memory access. Recent work has proposed Processing-In-Memory (PIM) architectures to overcome the bandwidth bottleneck in Convolutional Neural Networks (CNNs) by performing in-situ matrix-vector multiplication. However, the performance improvement and computation parallelism of existing CNN-oriented PIM architectures is hindered when performing GCNs because of the large scale and sparsity of graphs. To tackle these problems, this paper presents a parallelism enhancement framework for PIM-based GCN architectures. At the software level, we propose a fixed-point quantization method for GCNs, which reduces the PIM computation overhead with little accuracy loss. We also introduce the vertex clustering algorithm to the graph, minimizing the inter-cluster links and realizing cluster-level parallel computing on multi-core systems. At the hardware level, we design a Resistive Random Access Memory (RRAM) based multi-core PIM architecture for GCN, which supports the cluster-level parallelism. Besides, we propose a coarse-grained pipeline dataflow to cover the RRAM write costs and improve the GCN computation throughput. At the software/hardware interface level, we propose a PIM-aware GCN mapping strategy to achieve the optimal tradeoff between resource utilization and computation performance. We also propose edge dropping methods to reduce the inter-core communications with little accuracy loss. We evaluate our framework on typical datasets with multiple widely-used GCN models. Experimental results show that the proposed framework achieves 698x, 89x, and 41x speedup with 7108x, 255x, and 31x energy efficiency enhancement compared with CPUs, GPUs, and ASICs, respectively.
14:42 CET 17.1.4 ACCELERATING SPATIOTEMPORAL SUPERVISED TRAINING OF LARGE-SCALE SPIKING NEURAL NETWORKS ON GPU
Speaker:
LING LIANG, University of California Santa Barbara, CN
Authors:
LING LIANG1, Zhaodong Chen1, Lei Deng2, Fengbin Tu1, Guoqi Li3 and Yuan Xie4
1UCSB, US; 2Thsinghua, CN; 3Tsinghua, CN; 4UCAB, US
Abstract
Spiking neural networks (SNNs) have great potential to achieve brain-like intelligence, however, it suffers low accuracy of conventional synaptic plasticity rules and low training efficiency on GPUs. Recently, the emerging backpropagation through time (BPTT) inspired learning algorithms bring new opportunities to boost the accuracy of SNNs, while training on GPUs still remains inefficient due to the complex spatiotemporal dynamics and huge memory consumption, which restricts the model exploration for SNNs and prevents the advance of neuromorphic computing. In this work, we build a framework to solve the inefficiency of BPTT-based SNN training on modern GPUs. To reduce the memory consumption, we optimize the dataflow by abandoning a part of intermediate data in the forward pass and recomputing them in the backward pass. Then, we customize kernel functions to accelerate the neural dynamics for all training stages. Finally, we provide a Pytorch interface to make our framework easy-to-deploy in real systems. Compared to vanilla Pytorch implementation, our framework can achieve up to 2.13x end-to-end speedup and consume only 0.41x peak memory on the CIFAR10 dataset. Moreover, for the distributed training on the large ImageNet dataset, we can achieve up to 1.81x end-to-end speedup and consume only 0.38xpeak memory.
14:46 CET 17.1.5 HYPERSPIKE: HYPERDIMENSIONAL COMPUTING FOR MORE EFFICIENT AND ROBUST SPIKING NEURAL NETWORKS
Speaker:
Justin Morris, University of California, San Diego, US
Authors:
Justin Morris1, Hin Wai Lui2, Kenneth Stewart2, Behnam Khaleghi1, Anthony Thomas1, Thiago Marback1, Baris Aksanli3, Emre Neftci4 and Tajana S. Rosing5
1University of California, San Diego, US; 2University of California, Irvine, US; 3San Diego State University, US; 4UC Irvine, US; 5UCSD, US
Abstract
Today’s Machine Learning(ML) systems, especially those running in server farms running workloads such as Deep Neural Networks, which require billions of parameters and many hours to train a model, consume a significant amount of energy. To combat this, researchers have been focusing on new emerging neuromorphic computing models. Two of those models are Hyperdimensional Computing (HDC) and Spiking Neural Networks (SNNs), both with their own benefits. HDC has various desirable properties that other Machine Learning (ML) algorithms lack such as: robustness to noise in the system, simple operations, and high parallelism. SNNs are able to process event based signal data in an efficient manner. In this paper, we combine these two neuromorphic methods to create HyperSpike. We utilize a single SNN layer to first process the event based data and transform it into a more traditional feature vector that HDC can interpret. Then, an HDC classifier is used to enable more efficient classification as well as robustness to errors. We additionally test HyperSpike against different levels of bit error rates to experimentally show that HyperSpike is on average 31.5× more robust to errors than SNNs using other classifiers as the last layer. We also propose an ASIC accelerator for HyperSpike that provides a 10× speedup and 19.3× more energy efficiency over traditional SNN networks run on Loihi chips.
14:50 CET 17.1.6 Q&A SESSION
Authors:
Michael Niemier1 and François Rummens2
1University of Notre Dame, US; 2CEA, FR
Abstract
Questions and answers with the authors

17.2 Attacks on Secure and Trustworthy Systems

Date: Tuesday, 22 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Emanuele Valea, CEA LIST, FR

Session co-chair:
Francesco Regazzoni, University of Amsterdam and Università della Svizzera italiana, CH

In the last two decades we have witnessed a massive development of devices containing different types of valuable assets. Moreover, the globalization of the semiconductor industry has led to new trust risks. This session includes 5 presentations proposing novel methods to bypass state-of-the-art countermeasures against security threats (Cache Timing Attack and Side-Channel Based CPU Disassembly) and trust threats (Hardware Trojans and overproduction)

Time Label Presentation Title
Authors
14:30 CET 17.2.1 A DEEP-LEARNING APPROACH TO SIDE-CHANNEL BASED CPU DISASSEMBLY AT DESIGN TIME
Speaker:
Hedi Fendri, ALaRI, Universita della Svizzera italiana, CH
Authors:
Hedi Fendri1, Marco Macchetti2, Jerome Perrine2 and Mirjana Stojilovic3
1ALaRI, Universita della Svizzera italiana, CH; 2Kudelski Group, CH; 3EPFL, CH
Abstract
Side-channel CPU disassembly is a side-channel attack that allows an adversary to recover instructions executed by a processor. Not only does such an attack compromise code confidentiality, it can also reveal critical information on the system’s internals. Being easily accessible to a vast number of end users, modern embedded devices are highly vulnerable against disassembly attacks. To protect them, designers deploy countermeasures and verify their efficiency in security laboratories. Clearly, any vulnerability discovered at that point, after the integrated circuit has been manufactured, represents an important setback. In this paper, we address the above issues in two steps: Firstly, we design a framework that takes a design netlist and outputs simulated power side-channel traces, with the goal of assessing the vulnerability of the device at design time. Secondly, we propose a novel side-channel disassembler, based on multilayer perceptron and sparse dictionary learning for feature engineering. Experimental results on simulated and measured side-channel traces of two commercial RISC-V devices, both working on operating frequencies of at least 100 MHz, demonstrate that our disassembler can recognize CPU instructions with success rates of 96.01% and 93.16%, respectively.
14:34 CET 17.2.2 (Best Paper Award Candidate)
A CROSS-PLATFORM CACHE TIMING ATTACK FRAMEWORK VIA DEEP LEARNING
Speaker:
Ruyi Ding, Northeastern University, US
Authors:
Ruyi Ding, Ziyue Zhang, Xiang Zhang, Cheng Gongye, Yunsi Fei and A. Adam Ding, Northeastern University, US
Abstract
While deep learning methods have been adopted in power side-channel analysis, they have not been applied to cache timing attacks due to the limited dimension of cache timing data. This paper proposes a persistent cache monitor based on cache line flushing instructions, which runs concurrently to a victim execution and captures detailed memory access patterns in high-dimensional timing traces. We discover a new cache timing side-channel across both inclusive and non-inclusive caches, different from the traditional "Flush+Flush" timing leakage. We then propose a non-profiling differential deep learning analysis strategy to exploit the cache timing traces for key recovery. We further propose a framework for cross-platform cache timing attack via deep learning. Knowledge learned from profiling a common reference device can be transferred to build models to attack many other victim devices, even in different processor families. We take the OpenSSL AES-128 encryption algorithm as an example victim and deploy an asynchronous cache attack. We target three different devices from Intel, AMD, and ARM processors. We examine various scenarios for assigning the teacher role to one device and the student role to other devices and evaluate the cross-platform deep-learning attack framework. Experimental results show that this new attack is easily extendable to victim devices and is more effective than attacks without any prior knowledge.
14:38 CET 17.2.3 DESIGN OF AI TROJANS FOR EVADING MACHINE LEARNING-BASED DETECTION OF HARDWARE TROJANS
Speaker:
Prabhat Mishra, University of Florida, US
Authors:
Zhixin Pan and Prabhat Mishra, University of Florida, US
Abstract
The globalized semiconductor supply chain significantly increases the risk of exposing System-on-Chip (SoC) designs to malicious implants, popularly known as hardware Trojans. Traditional simulation-based validation is unsuitable for detection of carefully-crafted hardware Trojans with extremely rare trigger conditions. While machine learning (ML) based Trojan detection approaches are promising due to their scalability as well as detection accuracy, ML methods themselves are vulnerable from Trojan attacks. In this paper, we propose a robust backdoor attack on ML-based HT detection algorithms to demonstrate this serious vulnerability. The proposed framework is able to design an AI Trojan and implant it inside the ML model that can be triggered by specific inputs. Experimental results demonstrate that the proposed AI Trojans can bypass state-of-the-art defense algorithms. Moreover, our approach provides a fast and cost-effective solution in achieving 100% attack success rate that significantly outperforms state-of-the art approaches based on adversarial attacks.
14:42 CET 17.2.4 DIP LEARNING ON CAS-LOCK: USING DISTINGUISHING INPUT PATTERNS FOR ATTACKING LOGIC LOCKING
Speaker:
Akashdeep Saha, Indian Institute of Technology, Kharagpur, IN
Authors:
Akashdeep Saha1, Urbi Chatterjee2, Debdeep Mukhopadhyay3 and Rajat Subhra Chakraborty4
1Indian Institute of Technology, Kharagpur, IN; 2Indian Institute of Technology Kanpur, IN; 3Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, IN; 4Associate Professor, Computer Science and Engineering, IIT Kharagpur, IN
Abstract
The globalization of the integrated circuit (IC) manufacturing industry has lured the adversary to come up with numerous malicious activities in the IC supply chain. Logic locking has risen to prominence as a proactive defense strategy against such threats. CAS-Lock (proposed in CHES'20), is an advanced logic locking technique that harnesses the concept of single-point function in providing SAT-attack resiliency. It is claimed to be powerful and efficient enough in mitigating existing state-of-the-art attacks against logic locking techniques. Despite the security robustness of CAS-Lock as claimed by the authors, we expose a serious vulnerability and by exploiting the same we devise a novel attack algorithm against CAS-Lock. The proposed attack can not only reveal the correct key but also the exact AND/OR structure of the implemented CAS-Lock design along with all the key gates utilized in both the blocks of CAS-Lock. It simply relies on the externally observable Distinguishing Input Patterns (DIPs) pertaining to a carefully chosen key simulation of the locked design without the requirement of structural analysis of any kind of the locked netlist. Our attack is successful against various AND/OR cascaded-chain configurations of CAS-Lock and reports a 100% success rate in recovering the correct key. The proposed attack is capable of further revealing the exact AND/OR structure of the implemented CAS-Lock design along with all the key gates utilized in both the blocks of CAS-Lock. It has an attack complexity of O(n), where n denotes the number of DIPs obtained for an incorrect key simulation.
14:46 CET 17.2.5 MUXLINK: CIRCUMVENTING LEARNING-RESILIENT MUX-LOCKING USING GRAPH NEURAL NETWORK-BASED LINK PREDICTION
Speaker:
Lilas Alrahis, New York University Abu Dhabi, AE
Authors:
Lilas Alrahis1, Satwik Patnaik2, Muhammad Shafique1 and Ozgur Sinanoglu1
1New York University Abu Dhabi, AE; 2Texas A&M University, US
Abstract
Logic locking has received considerable interest as a prominent technique for protecting the design intellectual property from untrusted entities, especially the foundry. Recently, machine learning (ML)-based attacks have questioned the security guarantees of logic locking, and have demonstrated considerable success in deciphering the secret key without relying on an oracle, hence, proving to be very useful for an adversary in the fab. Such ML-based attacks have triggered the development of learning-resilient locking techniques. The most advanced state-of-the-art deceptive MUX-based locking (D-MUX) and the symmetric MUX-based locking techniques have recently demonstrated resilience against existing ML-based attacks. Both defense techniques obfuscate the design by inserting key-controlled MUX logic, ensuring that all the secret inputs to the MUXes are equiprobable. In this work, we show that these techniques primarily introduce local and limited changes to the circuit without altering the global structure of the design. By leveraging this observation, we propose a novel graph neural network (GNN)-based link prediction attack, MuxLink, that successfully breaks both the D-MUX and symmetric MUX-locking techniques, relying only on the underlying structure of the locked design, i.e., in an oracle-less setting. Our trained GNN model learns the structure of the given circuit and the composition of gates around the non-obfuscated wires, thereby generating meaningful link embeddings that help decipher the secret inputs to the MUXes. The proposed MuxLink achieves key prediction accuracy and precision up to 100% on D-MUX and symmetric MUX-locked ISCAS-85 and ITC-99 benchmarks, fully unlocking the designs. We open-source MuxLink [1].
14:50 CET 17.2.6 Q&A SESSION
Authors:
Emanuele Valea1 and Francesco Regazzoni2
1CEA LIST, FR; 2University of Amsterdam and ALaRI - USI, CH
Abstract
Questions and answers with the authors

17.3 Algorithmic techniques for efficient and robust ML hardware

Date: Tuesday, 22 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Giulio Gambardella, Synopsys, IR

Session co-chair:
Tony Wu, Meta/Facebook, US

In this session we present results from 5 papers on algorithmic techniques for efficient and robust ML hardware. The first paper introduces a dynamic token-based compression technique for efficient acceleration of attention mechanism in DNN. Next paper sheds light on the negative effect that adversarial training has on resilience of deep neural networks (DNNs), and proposes a simple weight decay remedy for adversarially trained models to maintain adversarial robustness" to "fault resilience". The third paper proposes a joint variability- and quantization-aware DNN training algorithm and self-tuning strategy to overcome accuracy less in highly quantized analog PIM-based models. Our next paper presents a new training algorithm that converts deep neural networks to spiking neural networks with low latency and high spike sparsity, demonstrating 2.5-8X faster inference than prior SNN models. Finally, the last paper introduces a new technique proposed for zero-overhead ECC embedding in DNN models.

Time Label Presentation Title
Authors
14:30 CET 17.3.1 (Best Paper Award Candidate)
DTQATTEN: LEVERAGING DYNAMIC TOKEN-BASED QUANTIZATION FOR EFFICIENT ATTENTION ARCHITECTURE
Speaker:
Tao Yang, Shanghai Jiao Tong University, CN
Authors:
Tao Yang, Dongyue Li, Zhuoran Song, Yilong Zhao, Fangxin Liu, Zongwu Wang, Zhezhi He and Li Jiang, Shanghai Jiao Tong University, CN
Abstract
Models based on the attention mechanism, i.e. transformers, have shown extraordinary performance in Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are still prohibitive for efficient inference at edge devices, even at data centers. To tackle this issue, we present an algorithm-architecture co-design with dynamic and mixed-precision quantization, DTQAtten. We present empirically that the tolerance to the noise varies from token to token in attention-based models. This finding leads us to quantize different tokens with mixed levels of bits. Thus, we design a compression framework that (i) dynamically quantizes tokens while they are forwarded in the models and (ii) jointly determines the ratio of each precision. Moreover, due to the dynamic mixed-precision tokens caused by our framework, previous matrix-multiplication accelerators (e.g. systolic array) cannot effectively exploit the benefit of the compressed attention computation. We thus design our accelerator with the Variable Speed Systolic-Array (VSSA) and propose an effective optimization strategy to alleviate the pipeline-stall problem in VSSA without hardware overhead. We conduct experiments with existing attention-based models, including BERT and GPT-2 on various language tasks. Our results show that DTQAtten outperforms the previous neural network accelerator Eyeriss by 13.12x in terms of speedup and 3.8x on average, in terms of energy-saving. Compared with the state-of-the-art attention accelerator SpAtten, our DTQAtten achieves at least 2.65x speedup and 3.38x energy efficiency improvement.
14:34 CET 17.3.2 MIND THE SCALING FACTORS: RESILIENCE ANALYSIS OF QUANTIZED ADVERSARIALLY ROBUST CNNS
Speaker:
Nael Fasfous, TU Munich, DE
Authors:
Nael Fasfous1, Lukas Frickenstein2, Michael Neumeier1, Manoj Rohit Vemparala2, Alexander Frickenstein2, Emanuele Valpreda3, Maurizio Martina3 and Walter Stechele1
1TU Munich, DE; 2BMW Group, DE; 3Politecnico di Torino, IT
Abstract
As more deep learning algorithms enter safety-critical application domains, the importance of analyzing their resilience against hardware faults cannot be overstated. Most existing works focus on bit-flips in memory, fewer focus on compute errors, and almost none study the effect of hardware faults on adversarially trained convolutional neural networks (CNNs). In this work, we show that adversarially trained CNNs are more susceptible to failure due to hardware errors when compared to vanilla-trained models. We identify large differences in the quantization scaling factors of the CNNs which are resilient to hardware faults and those which are not. As adversarially trained CNNs learn robustness against input attack perturbations, their internal weight and activation distributions open a backdoor for injecting large magnitude hardware faults. We propose a simple weight decay remedy for adversarially trained models to maintain adversarial robustness and hardware resilience in the same CNN. We improve the fault resilience of an adversarially trained ResNet56 by 25% for large-scale bit-flip benchmarks on activation data while gaining slightly improved accuracy and adversarial robustness.
14:38 CET 17.3.3 VARIABILITY-AWARE TRAINING AND SELF-TUNING OF HIGHLY QUANTIZED DNNS FOR ANALOG PIM
Speaker:
Zihao Deng, University of Texas at Austin, US
Authors:
Zihao Deng and Michael Orshansky, University of Texas at Austin, US
Abstract
DNNs deployed on analog processing in memory (PIM) architectures are subject to fabrication-time variability. We developed a new joint variability- and quantization-aware DNN training algorithm for highly quantized analog PIM-based models that is significantly more effective than prior work. It outperforms variability-oblivious and post-training quantized models on multiple computer vision datasets/models. For low-bitwidth models and high variation, the gain in accuracy is up to 35.7% for ResNet-18 over the best alternative. We demonstrate that, under a realistic pattern of within- and between-chip components of variability, training alone is unable to prevent large DNN accuracy loss (of up to 54% on CIFAR- 100/ResNet-18). We introduce a self-tuning DNN architecture that dynamically adjusts layer-wise activations during inference and is effective in reducing accuracy loss to below 10%.
14:42 CET 17.3.4 CAN DEEP NEURAL NETWORKS BE CONVERTED TO ULTRA LOW-LATENCY SPIKING NEURAL NETWORKS?
Speaker:
Gourav Datta, University of Southern California, US
Authors:
Gourav Datta and Peter Beerel, University of Southern California, US
Abstract
Spiking neural networks (SNNs), that operate via binary spikes distributed over time, have emerged as a promising energy efficient ML paradigm for resource-constrained devices. However, the current state-of-the-art (SOTA) SNNs require multiple time steps for acceptable inference accuracy, increasing spiking activity and, consequently, energy consumption. SOTA training strategies for SNNs involve conversion from a non-spiking deep neural network (DNN). In this paper, we determine that SOTA conversion strategies cannot yield ultra low latency because they incorrectly assume that the DNN and SNN pre-activation values are uniformly distributed. We propose a new training algorithm that accurately captures these distributions, minimizing the error between the DNN and converted SNN. The resulting SNNs have ultra low latency and high activation sparsity, yielding significant improvements in compute efficiency. In particular, we evaluate our framework on image recognition tasks from CIFAR-10 and CIFAR-100 datasets on several VGG and ResNet architectures. We obtain top-1 accuracy of 64.19% with only 2 time steps on the CIFAR-100 dataset with 159.2x lower compute energy compared to an iso-architecture standard DNN. Compared to other SOTA SNN models, our models perform inference 2.5-8x faster (i.e., with fewer time steps).
14:46 CET 17.3.5 VALUE-AWARE PARITY INSERTION ECC FOR FAULT-TOLERANT DEEP NEURAL NETWORK
Speaker:
Seo-Seok Lee, Samsung Electronics, KR
Authors:
Seo-Seok Lee1 and Joon-Sung Yang2
1Samsung Electronics Co.Ltd, KR; 2Yonsei University, KR
Abstract
Deep neural networks (DNNs) are deployed on hardware devices and are widely used in various fields to perform inference from inputs. Unfortunately, hardware devices can become unreliable by incidents such as unintended process, voltage and temperature variations, and this can introduce the occurrence of erroneous weights. Prior study reports that the erroneous weights can cause a significant accuracy degradation. In safety-critical applications such as autonomous driving, it can bring catastrophic results. Retraining or fine-tuning can be used to adjust corrupted weights to prevent the accuracy degradation. However, training-based approaches would incur a significant computational overhead due to a massive size of training datasets and intensive training operations. Thus, this paper proposes a value-aware parity insertion error correction code (ECC) to recover erroneous weights with a reduced parity storage overhead and no additional training processes. Previous ECC-based reliability improvement methods, Weight Nulling and In-place Zero-space ECC, are compared with the proposed method. Experimental results demonstrate that DNNs with the value-aware parity insertion ECC can perform inference without the accuracy degradation, on average, in 122.5x and 15.1x higher bit error rate conditions over Weight Nulling and In-place Zero-space ECC, respectively.
14:50 CET 17.3.6 Q&A SESSION
Authors:
Giulio Gambardella1 and Tony Wu2
1Synopsys, IE; 2Meta/Facebook, US
Abstract
Questions and answers with the authors

17.4 Energy Efficiency with Emerging Technologies for the Edge and the Cloud

Date: Tuesday, 22 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Qinru Qiu, Syracuse University, US

Session co-chair:
Iraklis Anagnostopoulos, SIU, US

Papers in this session discuss approaches for energy reduction on edge devices and for the cloud by optimizing hardware and software architectures and memory management methodologies. The first paper presents a precision scalable architecture for edge-DNN accelarators. The second paper proposes an energy-efficient classification for an event-based vision sensor using ternary convolutional networks. The third paper address the read/write overheads of NVM based on an extensible hashing methodology. The fourth paper presents a new memory allocation technique based on data-structure refinement for NVM and DRAM hybrid systems. The fifth paper reduces cost of energy and ownership by replacing x86 based rack servers with a large number of ARM-based single-board computers for serverless Function-as-a-Service platforms.

Time Label Presentation Title
Authors
14:30 CET 17.4.1 A PRECISION-SCALABLE ENERGY-EFFICIENT BIT-SPLIT-AND-COMBINATION VECTOR SYSTOLIC ACCELERATOR FOR NAS-OPTIMIZED DNNS ON EDGE
Speaker:
Junzhuo Zhou, Southern University of Science and Technology, CN
Authors:
Kai Li, Junzhuo Zhou, Yuhang Wang, Junyi Luo, Zhengke Yang, Shuxin Yang, Wei Mao, Mingqiang Huang and Hao Yu, Southern University of Science and Technology, CN
Abstract
Optimized model and energy-efficient hardware are both required for deep neural networks (DNNs) in edge-computing area. Neural architecture search (NAS) methods are employed for DNN model optimization with resulted multi-precision networks. Previous works have proposed low-precision-combination (LPC) and high-precision-split (HPS) methods for multi-precision networks, which are not energy-efficient for precision-scalable vector implementation. In this paper, a bit-split-and-combination (BSC) based vector systolic accelerator is developed for a precision-scalable energy-efficient convolution on edge. The maximum energy efficiency of the proposed BSC vector processing element (PE) is up to 1.95x higher in 2-bit, 4-bit and 8-bit operations when compared with LPC and HPS PEs. Further with NAS optimized multi-precision CNN networks, the averaged energy efficiency of the proposed vector systolic BSC PE array achieves up to 2.18x higher in 2-bit, 4-bit and 8-bit operations than that of LPC and HPS PE arrays.
14:34 CET 17.4.2 TERNARIZED TCN FOR μJ/INFERENCE GESTURE RECOGNITION FROM DVS EVENT FRAMES
Speaker:
Georg Rutishauser, ETH Zürich, CH
Authors:
Georg Rutishauser1, Moritz Scherer1, Tim Fischer1 and Luca Benini2
1ETH Zürich, CH; 2Università di Bologna and ETH Zürich, IT
Abstract
Dynamic Vision Sensors (DVS) offer the opportunity to scale the energy consumption in image acquisition proportionally to the activity in the captured scene by only transmitting data when the captured image changes. Their potential for energy-proportional sensing makes them highly attractive for severely energy-constrained sensing nodes at the edge. Most approaches to the processing of DVS data employ Spiking Neural Networks to classify the input from the sensor. In this paper, we propose an alternative, event frame-based approach to the classification of DVS video data. We assemble ternary video frames from the event stream and process them with a fully ternarized Temporal Convolutional Network which can be mapped to CUTIE, a highly energy-efficient Ternary Neural Network accelerator. The network mapped to the accelerator achieves a classification accuracy of 94.5 %, matching the state of the art for embedded implementations. We implement the processing pipeline in a modern 22 nm FDX technology and perform post-synthesis power simulation of the network running on the system, achieving an inference energy of 1.7 μJ, which is 647× lower than previously reported results based on Spiking Neural Networks.
14:38 CET 17.4.3 REH: REDESIGNING EXTENDIBLE HASHING FOR COMMERCIAL NON-VOLATILE MEMORY
Speaker:
Zhengtao Li, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, CN
Authors:
Zhengtao Li, Zhipeng Tan and Jianxi Chen, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, CN
Abstract
Emerging Non-volatile Memory (NVM) is attractive because of its byte-addressability, durability, and DRAM-scale latency. Hashing indexes have been extensively used to provide fast query services in the storage system. Recent research proposes crash-consistent and write-optimized hashing indexes for NVM. However, existing NVM-based hashing indexes suffer from limited scalability when running on a Commercial Non-Volatile Memory product, named Intel Optane DC Persistent Memory Module (DCPMM), due to the limited bandwidth of Optane DCPMM. To achieve a high load factor, existing NVM-based hashing indexes often evict an existing item to its alternative position, which incurs extra write and will consume the limited bandwidth. Moreover, the lock operations and metadata updates further saturate the limited bandwidth and prevent the hash table from scaling. In order to achieve scalability performance as well as a high load factor for the NVM-based hashing index, we design a new persistent hashing index, called REH, based on extendible hashing. REH (1) proposes a selective persistence scheme that stores buckets in NVM and places directory and metadata in DRAM to reduce both unnecessary NVM reads and writes, (2) uses 256B sized-buckets, as 256B is the internal data access size in Optane DCPMM, and the buckets are directly pointed to by directory entries, (3) leverages fingerprinting to further reduce unnecessary NVM reads, (4) employs failure-atomic bucket split to reduce bucket split overhead. Evaluations show that REH outperforms the state-of-the-art NVM-based hashing indexes by up to 1.68∼7.78×. In the meantime, REH can achieve a high load factor.
14:42 CET 17.4.4 MEMORY MANAGEMENT METHODOLOGY FOR APPLICATION DATA STRUCTURE REFINEMENT AND PLACEMENT ON HETEROGENEOUS DRAM/NVM SYSTEMS
Speaker:
Manolis Katsaragakis, National TU Athens and KU Leuven, GR
Authors:
Manolis Katsaragakis1, Lazaros Papadopoulos2, Christos Baloukas2 and Dimitrios Soudris2
1National TU Athens and KU Leuven, GR; 2National TU Athens, GR
Abstract
The emergence of memory systems that combine multiple memory technologies with alternative performance and energy characteristics are becoming mainstream. Existing data placement strategies evolve to map application requirements to the underlying heterogeneous memory systems. In this work, we propose a memory management methodology that leverages a data structure refinement approach to improve data placement results, in terms of execution time and energy consumption. The methodology is evaluated on three machine learning algorithms deployed on various NVM technologies, both on emulated and on real DRAM/NVM systems. Results show execution time improvement up to 57% and energy consumption gains up to 41%.
14:46 CET 17.4.5 MICROFAAS: ENERGY-EFFICIENT SERVERLESS ON BARE-METAL SINGLE-BOARD COMPUTERS
Speaker:
Anthony Byrne, Boston University, US
Authors:
Anthony Byrne1, Yanni Pang1, Allen Zou1, Shripad Nadgowda2 and Ayse Coskun1
1Boston University, US; 2IBM T.J. Watson Research Center, US
Abstract
Serverless function-as-a-service (FaaS) platforms offer a radically-new paradigm for cloud software development, yet the hardware infrastructure underlying these platforms is based on a decades-old design pattern. The rise of FaaS presents an opportunity to reimagine cloud infrastructure to be more energy-efficient, cost-effective, reliable, and secure. In this paper, we show how replacing handfuls of x86-based rack servers with hundreds of ARM-based single-board computers could lead to a virtualization-free, energy-proportional cloud that achieves this vision. We call our systematically-designed implementation MicroFaaS, and we conduct a thorough evaluation and cost analysis comparing MicroFaaS to a throughput-matched FaaS platform implemented in the style of conventional virtualization-based cloud systems. Our results show a 5.6x increase in energy efficiency and 34.2% decrease in total cost-of-ownership compared to our baseline.
14:50 CET 17.4.6 Q&A SESSION
Authors:
Qinru Qiu1 and Iraklis Anagnostopoulos2
1Syracuse University, US; 2Southern Illinois University Carbondale, US
Abstract
Questions and answers with the authors

17.5 Putting Place and Route research on the right track

Date: Tuesday, 22 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Behjat Laleh, University of Calgary, CA

Session co-chair:
Jens Lienig, TU Dresden, DE

This session discusses how placement and routing can be done more efficiently. The first paper presents a global routing framework running on hybrid CPU-GPU platforms with a heterogeneous task scheduler achieving considerable speedup over sequential implementations and state-of-the-art routers. The second paper addresses the track assignment during detailed routing. The third and the fourth papers show that routing violations can be reduced if the root causes of the problems are tackled during the placement stage. The last paper brings us back to the use of GPU and CPU and discusses how they can be employed during legalization to reduce runtime.

Time Label Presentation Title
Authors
14:30 CET 17.5.1 (Best Paper Award Candidate)
FASTGR : GLOBAL ROUTING ON CPU-GPU WITH HETEROGENEOUS TASK GRAPH SCHEDULER
Speaker:
Siting Liu, The Chinese University of Hong Kong, HK
Authors:
Siting Liu1, Peiyu Liao1, Rui Zhang2, Zhitang Chen3, Wenlong Lv4, Yibo Lin5 and Bei Yu1
1The Chinese University of Hong Kong, HK; 2HiSilicon Technologies Co. Ltd., CN; 3Huawei Noah's Ark Lab, HK; 4Huawei Noah's Ark Lab, CN; 5Peking University, CN
Abstract
Routing is an essential step to integrated circuits (IC) design closure. With the rapid increase of design scales, routing has become the runtime bottleneck in the physical design flow. Thus, accelerating routing becomes a vital and urgent task for IC design automation. This paper proposes a global routing framework running on hybrid CPU-GPU platforms with a heterogeneous task scheduler and a GPU-accelerated pattern routing algorithm. We demonstrate that the task scheduler can lead to 2.307$ imes$ speedup compared with the widely-adopted batch-based parallelization strategy on CPU and the GPU-accelerated pattern routing algorithm can contribute to 10.877$ imes$ speedup over the sequential algorithm on CPU. Finally, the combined techniques can achieve 2.426$ imes$ speedup without quality degradation compared with the state-of-the-art global router.
14:34 CET 17.5.2 TRADER: A PRACTICAL TRACK-ASSIGNMENT-BASED DETAILED ROUTER
Speaker:
Zhen Zhuang, Fuzhou University, CN
Authors:
Zhen Zhuang1, Genggeng Liu1, Tsung-Yi Ho2, Bei Yu2 and Wenzhong Guo1
1Fuzhou University, CN; 2The Chinese University of Hong Kong, HK
Abstract
As the last stage of VLSI routing, detailed routing should consider complicated design rules in order to meet the manufacturability of chips. With the continuous development of VLSI technology node, the design rules are changing and increasing which makes detailed routing a hard task. In this paper, we present a practical track-assignment-based detailed router to deal with the most representative design rules in modern designs. The proposed router consists of four major stages: (1) a graph-based track assignment algorithm is proposed to optimize the design rule violations of an entire die area; (2) an effective rip-up and reroute method is used to reduce the design rule violations in local regions; (3) a segment migration algorithm is proposed to reduce short violations; and (4) a stack via optimization technique is proposed to reduce minimum area violations. Practical benchmarks from 2019 ISPD contest are used to evaluate the proposed router. Compared with the state-of-the-art detailed router, Dr. CU 2.0, the number of violations can be reduced by up to 35.11% with an average reduction rate of 10.08%. The area of short can be reduced by up to 61.49% with an average reduction rate of 44.80%.
14:38 CET 17.5.3 CR&P: AN EFFICIENT CO-OPERATION BETWEEN ROUTING AND PLACEMENT
Speaker:
Erfan Aghaeekiasaraee, University of Calgary, CA
Authors:
Erfan Aghaeekiasaraee1, Aysa Fakheri Tabrizi1, Tiago Fontana2, Renan Netto3, Sheiny Almeida3, Upma Gandhi1, Jose Guntzel3, David Westwick1 and Laleh Behjat1
1University of Calgary, CA; 2Federal University of Santa Catarina (UFSC), BR; 3Federal University of Santa Catarina, BR
Abstract
Placement and Routing (P&R) are two main steps of the physical design flow implementation. Traditionally, because of their complexity, these two steps are performed separately. But the implementation of the physical design in advanced technology nodes shows that the performance of these two steps is tied to each other. Therefore creating efficient co-operation between the routing and placement steps has become a hot topic in Electronic Design Automation (EDA). In this work, to achieve an efficient collaboration between the routing and placement engines, an iterative replacement and rerouting framework facilitated with an Integer Linear Programming (ILP)-based legalizer is proposed and tested on the ACM/IEEE International Symposium on Physical Design (ISPD) 2018 contest's benchmarks. Numerical results show that the proposed framework can improve detailed routing vias and wirelength by 2.06% and 0.14% on average in a reasonable runtime without adding new Design Rule Violations (DRVs). The proposed framework can be considered as an add-on to the physical design flow between global routing and detailed routing.
14:42 CET 17.5.4 PIN ACCESSIBILITY-DRIVEN PLACEMENT OPTIMIZATION WITH ACCURATE AND COMPREHENSIVE PREDICTION MODEL
Speaker:
Suwan Kim, Seoul National University, KR
Authors:
Suwan Kim and Taewhan Kim, Seoul National University, KR
Abstract
The significantly increased density of pins of standard cells and the reduced number of routing tracks at sub-10nm nodes have made the pin access problem in detailed routing very difficult. To alleviate this pin accessibility problem in detailed routing, recent works have proposed to make a small perturbation of cell shifting, cell flipping, and adjacent cells swapping in the detailed placement stage. Here, an essential element for the success of pin accessibility aware detailed placement is the installed cost function, which should be sufficiently accurate in predicting the degree of routing difficulty in accessing pins. In this work, we propose a new model of cost function that is comprehensively devised to overcome the limitations of the prior ones. Precisely, unlike the conventional cost functions, our proposed cost function model is based on the empirical routing data in order to fully reflect the potential outcomes of detailed routing. Through experiments with benchmark circuits, it is shown that using our proposed cost function in detailed placement is able to reduce the routing errors by 44% on average while using the existing cost functions reduce the routing errors on average by at most 15%.
14:46 CET 17.5.5 MIXED-CELL-HEIGHT LEGALIZATION ON CPU-GPU HETEROGENEOUS SYSTEMS
Speaker:
Haoyu Yang, NVIDIA Corp., US
Authors:
Haoyu Yang1, Kit Fung2, Yuxuan Zhao2, Yibo Lin3 and Bei Yu4
1NVIDIA Corp., US; 2Chinese University of Hong Kong, HK; 3Peking University, CN; 4The Chinese University of Hong Kong, HK
Abstract
Legalization conducts refinements on post-global-placement cell location to compromise design constraints and parameters. These include placement fence regions, power/ground rail alignments, timing, wire length and etc. In advanced technology nodes, designs can easily contain millions of mutiple-row standard cells, which challenges the scalability of modern legalization algorithms. In this paper, for the first time, we investigate dedicated legalization algorithms on heterogeneous platforms, which promises intelligent usage of CPU and GPU resources and hence provides new algorithm design methodologies for large scale physical design problems. Experimental results on IC/CAD 2017 and ISPD 2015 contest benchmarks demonstrate the effectiveness and the efficiency of the proposed algorithm, compared to the state-of-the-art legalization solution for mixed-cell-height designs.
14:50 CET 17.5.6 Q&A SESSION
Authors:
Laleh Behjat1 and Jens Lienig2
1University of Calgary, CA; 2TU Dresden, DE
Abstract
Questions and answers with the authors

17.6 Multi-Partner Projects – Session 1

Date: Tuesday, 22 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Leticia Maria Bolzani Poehls, RWTH Aachen University, DE

Session co-chair:
Maksim Jenihhin, Tallinn UT, EE

The session is dedicated to multi-partner innovative and high-tech research projects addressing the DATE 2022 topics. The types of collaboration covered are projects funded by EU schemes (H2020, ESA, EIC, MSCA, COST, etc.), nationally- and regionally-funded projects, collaborative research projects funded by industry. Depending on the stage of the project, the papers present the novelty of the project concepts, relevance the technical objectives to the DATE community, technical highlights of the project results and insights on the lessons learnt in the project or open bits until the end of the project. In particular, this session discusses projects for automotive and safety-critical systems covering the security aspects, RISC-V architecture platforms and cross-layer concepts for reliability analysis.

Time Label Presentation Title
Authors
14:30 CET 17.6.1 A COMPREHENSIVE SOLUTION FOR SECURING CONNECTED AND AUTONOMOUS VEHICLES
Speaker:
Theocharis Theocharides, KIOS Research and Innovation Center of Excellence, University of Cyprus, CY
Authors:
Mohsin Kamal1, Christos Kyrkou2, Nikos Piperigkos3, Andreas Papandreou3, Andreas Kloukiniotis3, Jordi Casademont4, Natalia Mateu5, Daniel Castillo5, Rodrigo Rodriguez6, Nicola Durante6, Peter Hofmann7, Petros Kapsalas8, Aris Lalos9, Konstantinos Moustakas9, Christos Laoudias1, Theocharis Theocharides2 and Georgios Ellinas10
1KIOS Research and Innovation Center of Excellence, University of Cyprus, CY; 2University of Cyprus, CY; 3Department of Electrical and Computer Engineering, University of Patras, Greece, GR; 4Universitat Politecnica de Catalunya and Fundacio i2CAT, Barcelona, ES; 5Nextium by Idneo, ES; 6Atos IT Solutions and Services Iberia S.L., Madrid, ES; 7Deutsche Telekom Security GmbH, T-Systems, Berlin, DE; 8Panasonic Automotive, Langen, DE; 9Department of Electrical and Computer Engineering, University of Patras, GR; 10Department of Electrical and Computer Engineering, University of Cyprus, Nicosia, CY
Abstract
With the advent of Connected and Autonomous Vehicles (CAVs) comes the very real risk that these vehicles will be exposed to cyber-attacks by exploiting various vulnerabilities. This paper gives a technical overview of the H2020 CARAMEL project (currently in the intermediate stage) in which Artificial Intelligent (AI)-based cybersecurity for CAVs is the main goal. Most of the possible scenarios are considered, by which an adversary can generate attacks on CAVs, such as attacks on camera sensors, GPS location, Vehicle to Everything (V2X) message transmission, the vehicle's On-Board Unit (OBU), etc. The counter-measures to these attacks and vulnerabilities are presented via the current results in the CARAMEL project achieved by implementing the designed security algorithms.
14:34 CET 17.6.2 PHYSICAL AND FUNCTIONAL REVERSE ENGINEERING CHALLENGES FOR ADVANCED SEMICONDUCTOR SOLUTIONS
Speaker:
Bernhard Lippmann, Infineon, DE
Authors:
Bernhard Lippmann1, Matthias Ludwig1, Johannes Mutter1, Ann-Christin Bette1, Alexander Hepp2, Johanna Baehr2, Martin Rasche3, Oliver Kellermann3, Horst Gieser4, Tobias Zweifel4 and Nicola Kovacˇ4
1Infineon, DE; 2TU Munich, DE; 3RAITH, DE; 4Fraunhofer, DE
Abstract
Motivated by the threats of malicious modification and piracy arising from worldwide distributed supply chains, the goal of RESEC is the creation, verification, and optimization of a complete reverse engineering process for integrated circuits manufactured in technology nodes of 40 nm and below. Building upon the presentation of individual reverse engineering process stages, this paper connects analysis efforts and yields with their impact on hardware security, demonstrated on a design with implemented hardware Trojans. We outline the interim stage of our research activities and present our future targets linking chip design and physical verification processes.
14:38 CET 17.6.3 DE-RISC: A COMPLETE RISC-V BASED SPACE-GRADE PLATFORM
Speaker:
Jaume Abella, Barcelona Supercomputing Center, ES
Authors:
Nils-Johan Wessman1, Fabio Malatesta1, Stefano Ribes1, Jan Andersson1, Antonio Garcia-Vilanova2, Miguel Masmano2, Vicente Nicolau2, Paco Gomez2, Jimmy Le Rhun3, Sergi Alcaide4, Guillem Cabo5, Francisco Bas4, Pedro Benedicte5, Fabio Mazzocchetti5 and Jaume Abella5
1CAES Gaisler, SE; 2fentISS, ES; 3Thales Research and Technology, FR; 4Universitat Politècnica de Catalunya - Barcelona Supercomputing Center, ES; 5Barcelona Supercomputing Center, ES
Abstract
The H2020 EIC-FTI De-RISC project develops a RISC-V space-grade platform to jointly respond to several emerging, as well as longstanding needs in the space domain such as: (1) higher performance than that of monocore and basic multicore space-grade processors in the market; (2) access to an increasingly rich software ecosystem rather than sticking to the slowly fading SPARC and PowerPC-based ones; (3) freedom (or drastic reduction) of export and license restrictions imposed by commercial ISAs such as ARM; and (4) improved support for the design and validation of safety-related real-time applications, (5) being the platform with software qualified and hardware designed per established space industry standards. De-RISC partners have set up the different layers of the platform during the first phases of the project. However, they have recently boosted integration and assessment activities. This paper introduces the De-RISC space platform, presents recent progress such as enabling virtualization and software qualification, new MPSoC features, and use case deployment and evaluation, including a comparison against other commercial platforms. Finally, this paper introduces the ongoing activities that will lead to the hardware and fully qualified software platform at TRL8 on FPGA by September 2022.
14:42 CET 17.6.4 THE SCALE4EDGE RISC-V ECOSYSTEM
Speaker:
Wolfgang Ecker, Infineon Technologies AG, DE
Authors:
Wolfgang Ecker1, Milos Krstic2, Andreas Mauderer3, Eyck Jentzsch4, Andreas Koch5, Wolfgang Müller6, Vladimir Herdt7, Daniel Mueller-Gritschneder8, Rafael Stahl8, Kim Grüttner9, Jörg Bormann10, Wolfgang Kunz11, Reinhold Heckmann12, Ralf Wimmer13, Bernd Becker14, Philipp Scholl14, Oliver Bringmann15, Johannes Partzsch16 and Christian Mayr16
1Infineon Technologies AG, DE; 2IHP, DE; 3Robert Bosch GmbH, DE; 4MINRES Technologies GmbH, DE; 5TU Darmstadt, DE; 6Paderborn University, DE; 7University Bremen, DE; 8TU Munich, DE; 9OFFIS - Institute for Information Technology, DE; 10Siemens EDA, DE; 11TU Kaiserslautern, DE; 12AbsInt Angewandte Informatik GmbH, DE; 13Concept Engineering GmbH, DE; 14University of Freiburg, DE; 15University of Tuebingen / FZI, DE; 16TU Dresden, DE
Abstract
This paper introduces the project Scale4Edge. The project is focused on enabling an effective RISC-V ecosystem for optimization of edge applications. We describe the basic components of this ecosystem and introduce the envisioned demonstrators, which will be used in their evaluation.
14:46 CET 17.6.5 XANDAR: EXPLOITING THE X-BY-CONSTRUCTION PARADIGM IN MODEL-BASED DEVELOPMENT OF SAFETY-CRITICAL SYSTEMS
Speaker:
Leonard Masing, Karlsruhe Institute of Technology, DE
Authors:
Leonard Masing1, Tobias Dörr1, Florian Schade2, Juergen Becker1, Georgios Keramidas3, Christos Antonopoulos3, Michail Mavropoulos3, Efstratios Tiganourias3, Vasilios Kelefouras3, Konstantinos Antonopoulos3, Nikolaos Voros3, Umut Durak4, Alexander Ahlbrecht4, Wanja Zaeske4, Christos Panagiotou5, Dimitris Karadimas5, Nico Adler6, Andreas Sailer6, Raphael Weber6, Thomas Wilhelm6, Geza Nemeth7, Fahad Siddiqui8, Rafiullah Khan8, Vahid Garousi8, Sakir Sezer8 and Victor Morales9
1Karlsruhe Institute of Technology, DE; 2Karlsruhe Intitute of Technology, DE; 3University of Peloponnese, GR; 4German Aerospace Center (DLR), DE; 5AVN Innovative Technology Solutions Limited, CY; 6Vector Informatik GmbH, DE; 7Bayerische Motoren Werke Aktiengesellschaft, DE; 8Queen’s University, Belfast, GB; 9fentISS, ES
Abstract
Realizing desired properties “by construction” is a highly appealing goal in the design of safety-critical embedded systems. As verification and validation tasks in this domain are often both challenging and time-consuming, the by-construction paradigm is a promising solution to increase design productivity and reduce design errors. In the XANDAR project, partners from industry and academia develop a toolchain that will advance current development processes by employing a modelbased X-by-Construction (XbC) approach. XANDAR defines a development process, metamodel extensions, a library of safety and security patterns, and investigates many further techniques for design automation, verification, and validation. The developed toolchain will use a hypervisor-based platform, targeting future centralized, AI-capable high-performance embedded processing systems. It is co-developed and validated in both an avionics use case for situation perception and pilot assistance as well as an automotive use case for autonomous driving.
14:50 CET 17.6.6 FLODAM: CROSS-LAYER RELIABILITY ANALYSIS FLOW FOR COMPLEX HARDWARE DESIGNS
Speaker:
Angeliki Kritikakou, Univ Rennes, Inria, CNRS, IRISA, Pl
Authors:
Angeliki Kritikakou1, Olivier Sentieys2, Guillaume Hubert3, Youri Helen4, Jean-francois Coulon5 and Patrice Deroux-Dauphin5
1Univ Rennes, Inria, CNRS, IRISA, FR; 2INRIA, FR; 3ONERA, FR; 4DGA, FR; 5Temento, FR
Abstract
Modern technologies make hardware designs more and more sensitive to radiation particles and related faults. As a result, analysing the behavior of a system under radiation-induced faults has become an essential part of the system design process. Existing approaches either focus on analysing the radiation impact at the lower hardware design layers, without further propagating any radiation-induced fault to the system execution, or analyse system reliability at higher hardware or application layers, based on fault models that are agnostic of the fabrication technology and the radiation environment. Flodam combines the benefits of existing approaches by providing a novel cross-layer reliability analysis from the semiconductor layer up to the application layer, able to quantify the risks of faults under a given context, taking into account the environmental conditions, the physical hardware design and the application under study.
14:54 CET 17.6.7 Q&A SESSION
Authors:
Leticia Maria Bolzani Poehls1 and Maksim Jenihhin2
1RWTH Aachen University, DE; 2Tallinn University of Technology, EE
Abstract
Questions and answers with the authors

18.1 Domain-specific co-design: From sensors to graph analytics

Date: Tuesday, 22 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Jeronimo Castrillon, TU Dresden, DE

Session co-chair:
Paula Herber, WWU Munster, DE

This session demonstrates how domain-specific knowledge can be leveraged to design algorithms and micro-architectures for an improved computational efficiency. The presentations touch upon CNN optimizations, custom vectorization for sparse solvers and cache-aware data management for graph analytics. For instance, authors exploit the structure of matrices in circuit simulations to better use modern vector instructions, propose highly energy-efficient architectures for spiking neural networks by modifying the order in which loops are processed, design predictors to reduce CNN operations at runtime, and improve the utilization of the memory subsystem by judiciously bypassing hierarchy levels for graph analytics. The session also describes a router design that exploits input and output patterns to train an generative adversarial networks to detect hardware trojans in router designs.

Time Label Presentation Title
Authors
15:40 CET 18.1.1 SNE: AN ENERGY-PROPORTIONAL DIGITAL ACCELERATOR FOR SPARSE EVENT-BASED CONVOLUTIONS
Speaker:
Alfio Di Mauro, ETH Zürich, CH
Authors:
Alfio Di Mauro1, Arpan Prasad1, Zhikai Huang1, Matteo Spallanzani1, Francesco Conti2 and Luca Benini3
1ETH Zürich, CH; 2University of Bologna, IT; 3Università di Bologna and ETH Zürich, IT
Abstract
Event-based sensors are drawing increasing attention due to their high temporal resolution, low power consumption, and low bandwidth. To efficiently extract semantically meaningful information from sparse data streams produced by such sensors, we present a 4.5TOP/s/W digital accelerator capable of performing 4-bits-quantized event-based convolutional neural networks (eCNN). Compared to standard convolutional engines, our accelerator performs a number of operations proportional to the number of events contained into the input data stream, ultimately achieving a high energy-to-information processing proportionality. On the IBM-DVS-Gesture dataset, we report 80uJ/inf to 261uJ/inf, respectively, when the input activity is 1.2% and 4.9%. Our accelerator consumes 0.221pJ/SOP, to the best of our knowledge it is the lowest energy/OP reported on a digital neuromorphic engine.
15:44 CET 18.1.2 LRP: PREDICTIVE OUTPUT ACTIVATION BASED ON SVD APPROACH FOR CNNS ACCELERATION
Speaker:
Xinxin Wu, Institute of Computing Technology, Chinese Academy of Sciences, CN
Authors:
Xinxin Wu, zhihua fan, tianyu liu, wenming li, xiaochun ye and dongrui fan, Institute of Computing Technology, Chinese Academy of Sciences, CN
Abstract
Convolutional Neural Networks (CNNs) achieve state-of-the-art performance in a wide range of applications. CNNs contain millions of parameters, and a large number of computations challenge hardware design. In this paper, we take advantage of the output activation sparsity of CNNs to reduce the execution time and energy consumption of the network. We propose Low Rank Prediction (LRP), an effective prediction method that leverages the output activation sparsity. LRP first predicts the output activation polarity of the convolutional layer based on the singular value decomposition (SVD) approach of the convolution kernel. And then it uses the predicted negative value to skip invalid computation in the original convolution. In addition, an effective accelerator, LRPPU, is proposed to take advantage of sparsity to achieve network inference acceleration. Experiments show that our LRPPU achieves 1.48× speedup and 2.02× energy reduction compared with dense networks with slight loss of accuracy. Also, it achieves on average 2.57× speedup over Eyeriss and has similar performance and less accuracy loss compared with SnaPEA.
15:48 CET 18.1.3 EXPLOITING ARCHITECTURE ADVANCES FOR SPARSE SOLVERS IN CIRCUIT SIMULATION
Speaker:
Zhiyuan Yan, Institute of Computing Technology, Chinese Academy of Sciences, CN
Authors:
Zhiyuan Yan1, Biwei Xie1, Xingquan Li2 and Yungang Bao1
1Institute of Computing Technology, Chinese Academy of Sciences, CN; 2Peng Cheng Laboratoy, CN
Abstract
Sparse direct solvers provide vital functionality for a wide variety of scientific applications. The dominated part of the sparse direct solver, LU factorization, suffers a lot from the irregularity of sparse matrices. Meanwhile, the specific characteristics of sparse solvers in circuit simulation and the unique sparse pattern of circuit matrices provide more design spaces and also great challenges. In this paper, we propose a sparse solver named FLU and re-examine the performance of LU factorization from the perspectives of vectorization, parallelization, and data locality. To improve vectorization efficiency and data locality, FLU introduces a register-level supernode computation method by delicately manipulating data movement. With alternating multiple columns computation, FLU further reduces the off-chip memory accesses greatly. Furthermore, we implement a fine-grained elimination tree based parallelization scheme to fully exploit task-level parallelism. Compared with PARDISO and NICSLU, experimental results show that FLU achieves a speedup up to 19.51× (3.86× on average) and 2.56× (1.66× on average) on Intel Xeon respectively.
15:52 CET 18.1.4 DATA-AWARE CACHE MANAGEMENT FOR GRAPH ANALYTICS
Speaker:
Varun Venkitaraman, Indian Institute of Technology Bombay, IN
Authors:
Neelam Sharma1, Varun Venkitaraman1, Newton Singh2, Vikash Kumar2, Shubham Singhania2 and Chandan Kumar Jha2
1Indian Institute of Technology, Bombay, IN; 2IIT Bombay, IN
Abstract
Graph analytics is powering a wide variety of applications in the domains of cybersecurity, contact tracing, and social networking. It consists of various algorithms (or workloads) that investigate the relationships between entities involved in transactions, interactions, and organizations. CPU-based graph analytics is inefficient because their cache hierarchy performs poorly owing to highly irregular memory access patterns of graph workloads. Policies managing the cache hierarchy in such systems are ignorant to the locality demands of different data types within graph workloads, and therefore are suboptimal. In this paper, we conduct an in-depth data type aware characterization of graph workloads to better understand the cache utilization of various graph data types. We find that different levels of the cache hierarchy are more sensitive to the locality demands of certain graph data types than others. Hence, we propose GRACE, a graph data-aware cache management technique, to increase cache hierarchy utilization, thereby minimizing off-chip memory traffic and enhancing performance. Our thorough evaluations show that GRACE, when augmented with a vertex reordering algorithm, outperforms a recent cache management scheme by up to 1.4x, with up to 27% reduction in expensive off-chip memory accesses. Thus, our work demonstrates that awareness of different graph data types is critical for effective cache management in graph analytics.
15:56 CET 18.1.5 AGAPE: ANOMALY DETECTION WITH GENERATIVE ADVERSARIAL NETWORK FOR IMPROVED PERFORMANCE, ENERGY, AND SECURITY IN MANYCORE SYSTEMS
Speaker:
Ke Wang, The George Washington University, US
Authors:
Ke Wang1, Hao Zheng2, Yuan Li1, Jiajun Li3 and Ahmed Louri1
1The George Washington University, US; 2University of Central Florida, US; 3Beihang University, CN
Abstract
The security of manycore systems has become increasingly critical. In system-on-chips (SoCs), Hardware Trojans (HTs) manipulate the functionalities of the routing components to saturate the on-chip network, degrade performance, and result in the leakage of sensitive data. Existing HT detection techniques, including runtime monitoring and state-of-the-art learning-based methods, are unable to timely and accurately identify the implanted HTs, due to the increasingly dynamic and complex nature of on-chip communication behaviors. We propose AGAPE, a novel Generative Adversarial Network (GAN)-based anomaly detection and mitigation method against HTs for secured on-chip communication. AGAPE learns the distribution of the multivariate time series of a number of NoC attributes captured by on-chip sensors under both HT-free and HT-infected working conditions. The proposed GAN can learn the potential latent interactions among different runtime attributes concurrently, accurately distinguish abnormal attacked situations from normal SoC behaviors, and identify the type and location of the implanted HTs. Using the detection results, we apply the most suitable protection techniques to each type of detected HTs instead of simply isolating the entire HT-infected router, with the aim to mitigate security threats as well as reducing performance loss. Simulation results show that AGAPE enhances the HT detection accuracy by 19\%, reduces network latency and power consumption by 39\% and 30\%, respectively, as compared to state-of-the-art security designs.
16:00 CET 18.1.6 Q&A SESSION
Authors:
Jeronimo Castrillon1 and Paula Herber2
1TU Dresden, DE; 2University of Münster, DE
Abstract
Questions and answers with the authors

18.2 Memory-centric and neural network systems: architectures, tools, and profilers

Date: Tuesday, 22 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Mohamed M. Sabry Aly, Nanyang Technological University, SG

Session co-chair:
Huichu Liu, Meta, Inc., US

This session focuses on two domains: neural networks (NN) and processing in memory (PIM) systems. The first paper introduces a profiler to aid in the decision-making process to migrate tasks to PIM from CPUs. The second paper provides a framework for efficient design-space exploration of NN mapping to PIM fabrics. The third paper analysis the security and resilience of spiking neural network architectures. The fourth paper investigates circuit-level techniques to enhance nonlinear operations in SRAM-based NN kernels. The session includes also a tool for content-addressable memory and a hybrid in-memory computing architecture.

Time Label Presentation Title
Authors
15:40 CET 18.2.1 (Best Paper Award Candidate)
PIMPROF: AN AUTOMATED PROGRAM PROFILER FOR PROCESSING-IN-MEMORY OFFLOADING DECISIONS
Speaker:
Yizhou Wei, University of Virginia, US
Authors:
Yizhou Wei1, Minxuan Zhou2, Sihang Liu1, Korakit Seemakhupt1, Tajana S. Rosing2 and Samira Khan1
1University of Virginia, US; 2UCSD, US
Abstract
Processing-in-memory (PIM) architectures reduce the data movement overhead by bringing computation closer to the memory. However, a key challenge is to decide which code regions of a program should be offloaded to PIM for the best performance. The goal of this work is to help programmers leverage PIM architectures by automatically profiling legacy workloads to find PIM-friendly code regions for offloading. We propose PIMProf, an automated profiling and offloading tool to determine PIM offloading regions for CPU-PIM hybrid architectures. PIMProf efficiently models the comprehensive cost related to PIM offloading and makes the offloading decision by an effective and computational-tractable algorithm. We demonstrate the effectiveness of PIMProf by evaluating the GAP graph benchmark suite and the PARSEC benchmark suite under different PIM and CPU configurations. Our evaluation shows that, compared to the CPU baseline and a PIM-only configuration, the offloading decisions by PIMProf provides 5.33x and 1.39x speedup in the GAP graph workloads, respectively; 2.22x and 1.74x speedup in the PARSEC benchmarks, respectively.
15:44 CET 18.2.2 ANALYSIS OF POWER-ORIENTED FAULT INJECTION ATTACKS ON SPIKING NEURAL NETWORKS
Speaker:
Karthikeyan Nagarajan, Pennsylvania State University, US
Authors:
Karthikeyan Nagarajan1, Junde Li1, Sina Sayyah Ensan1, Mohammad Nasim Imtiaz Khan2, Sachhidh Kannan3 and Swaroop Ghosh1
1Pennsylvania State University, US; 2Intel Corporation, US; 3Ampere Computing LLC, US
Abstract
Spiking Neural Networks (SNN) are quickly gaining traction as a viable alternative to Deep Neural Networks (DNN). In comparison to DNNs, SNNs are more computationally powerful and provide superior energy efficiency. SNNs, while exciting at first appearance, contain security-sensitive assets (e.g., neuron threshold voltage) and vulnerabilities (e.g., sensitivity of classification accuracy to neuron threshold voltage change) that adversaries can exploit. We investigate global fault injection attacks by employing external power supplies and laser-induced local power glitches to corrupt crucial training parameters such as spike amplitude and neuron's membrane threshold potential on SNNs developed using common analog neurons. We also evaluate the impact of power-based attacks on individual SNN layers for 0% (i.e., no attack) to 100% (i.e., whole layer under attack). We investigate the impact of the attacks on digit classification tasks and find that in the worst-case scenario, classification accuracy is reduced by 85.65%. We also propose defenses e.g., a robust current driver design that is immune to power-oriented attacks, improved circuit sizing of neuron components to reduce/recover the adversarial accuracy degradation at the cost of negligible area and 25% power overhead. We also present a dummy neuron-based voltage fault injection detection system with 1% power and area overhead.
15:48 CET 18.2.3 GIBBON: EFFICIENT CO-EXPLORATION OF NN MODEL AND PROCESSING-IN-MEMORY ARCHITECTURE
Speaker:
Hanbo Sun, Tsinghua University, CN
Authors:
Hanbo Sun, Chenyu Wang, Zhenhua Zhu, Xuefei Ning, Guohao Dai, Huazhong Yang and Yu Wang, Tsinghua University, CN
Abstract
The memristor-based Processing-In-Memory (PIM) architectures have shown great potential to boost the computing energy efficiency of Neural Networks (NNs). Existing work concentrates on hardware architecture design and algorithm-hardware co-optimization, but neglects the non-negligible impact of the correlation between NN models and PIM architectures. To ensure high accuracy and energy efficiency, it is important to co-design the NN model and PIM architecture. However, on the one hand, the co-exploration space of NN model and PIM architecture is extremely tremendous, making searching for the optimal results difficult. On the other hand, during the co-exploration process, PIM simulators pose a heavy computational burden and runtime overhead for evaluation. To address these problems, in this paper, we propose an efficient co-exploration framework for the NN model and PIM architecture, named Gibbon. In Gibbon, we propose an evolutionary search algorithm with adaptive parameter priority, which focuses on subspace of high priority parameters and alleviates the problem of vast co-design space. Besides, we design a Recurrent Neural Network (RNN) based predictor for accuracy and hardware performances. It substitutes for a large part of the PIM simulator workload and reduces the long simulation time. Experimental results show that the proposed co-exploration framework can find better NN models and PIM architectures than existing studies in only seven GPU hours (8.4∼41.3× speedup). At the same time, Gibbon can improve the accuracy of co-design results by 10.7% and reduce the energy-delay-product by 6.48× compared with existing work.
15:52 CET 18.2.4 AID: ACCURACY IMPROVEMENT OF ANALOG DISCHARGE-BASED IN-SRAM MULTIPLICATION ACCELERATOR
Speaker:
Saeed Seyedfaraji, Vienna University of Technology (TU-Wien), AT
Authors:
Saeed Seyedfaraji1, Baset Mesgari2 and Semeen Rehman3
1Institute of Computer Technology, TU Wien (TU Wien), AT; 2Vienna university of technology, AT; 3TU Wien, AT
Abstract
This paper presents a novel technique to improve the accuracy of an energy-efficient in-memory multiplier using a standard 6T-SRAM. The state-of-the-art discharge-based in-SRAM multiplication accelerators suffer from a non-linear behavior in their bit-line (BL, BLB) due to the quadratic nature of the access transistor that leads to a poor signal-to-noise ratio (SNR). In order to achieve linearity in the BLB voltage, we propose a novel root function voltage technique on the access transistor's gate that results in accuracy improvement of on average 10.77 dB SNR compared to state-of-the-art discharge-based topologies. Our analytical methods and a circuit simulation in a 65 nm CMOS technology verify that the proposed technique consumes 0.523 pJ per computation (multiplication, accumulation, and preset) from a power supply of 1V, which is 51.18% lower compared to other state-of-the-art techniques. We have performed an extensive Monte Carlo based simulation for a 4x4 multiplication operation, and our novel technique presents less than 0.086 standard deviations for the worst-case incorrect output scenario.
15:56 CET 18.2.5 Q&A SESSION
Authors:
Mohamed M. Sabry Aly1 and Huichu Liu2
1Nanyang Technological University, SG; 2Facebook Inc., US
Abstract
Questions and answers with the authors

18.3 Persistent Memory

Date: Tuesday, 22 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Joseph Friedman, UT Dallas, US

Session co-chair:
Chengmo Yang, University of Delaware, US

Switching device non-volatility inspires the opportunity for persistent memory that stores data without requiring the continuous application of energy. This session therefore explores the implications of non-volatility and persistent memory on cache architecture design. In particular, the presentations focus on locality, bandwidth, granularity, wearing, and application awareness.

Time Label Presentation Title
Authors
15:40 CET 18.3.1 CHARACTERIZING AND OPTIMIZING HYBRID DRAM-PM MAIN MEMORY SYSTEM WITH APPLICATION AWARENESS
Speaker:
Yongfeng Wang, Sun Yat-Sen University, CN
Authors:
Yongfeng Wang, Yinjin Fu, Yubo Liu, Zhiguang Chen and Nong Xiao, Sun Yat-Sen University, CN
Abstract
Persistent memory (PM) has always been used in combination with DRAM to configure hybrid main memory systems that can obtain both the high performance of DRAM and large capacity of PM. There are critical management challenges in data placement, memory concurrency and workload scheduling for the concurrent execution of multiple application workloads. But the non-negligible performance gap between DRAM and PM makes the existing application-agnostic management strategies inefficient in reaching the full potential of hybrid memory. In this paper, we propose a series of application aware optimization strategies, including application aware data placement, adaptive thread allocation and inter-application interference avoiding, to improve the concurrent performance of different application workloads on hybrid memory. Finally, we provide the performance evaluation for our application aware solutions on real hybrid memory hardware with some comprehensive benchmark suites. Our experimental results show that the duration of multi-application concurrent execution on hybrid memory can be reduced by at most 60.7% for application aware data placement, 37.7% for adaptive thread allocation and 34.8% for workload scheduling with inter-application interference avoiding, respectively. And the additive effects of all these three optimization methods can reach 62.8% performance improvement with negligible overheads.
15:44 CET 18.3.2 PATS: TAMING BANDWIDTH CONTENTION BETWEEN PERSISTENT AND DYNAMIC MEMORIES
Speaker:
Shucheng Wang, Huazhong University of Science and Technology, CN
Authors:
Shu Cheng Wang1, Qiang Cao1, Hong Jiang2 and Yuanyuan Dong3
1Huazhong University of Science and Technology, CN; 2University of Texas at Arlington, US; 3Alibaba Group, CN
Abstract
Emerging persistent memory (PM) with fast persistence and byte-addressability physically shares the memory channel with DRAM-based main memory. We experimentally uncover that the throughput of application accessing DRAM collapses when multiple threads access PM due to head-of-line blockage in the memory controller within CPU. To address this problem, we design a PM-Accessing Thread Scheduling (PATS) mechanism that is guided by a contention model, to adaptively tune the maximum number of contention-free concurrent PM-threads. Experimental results show that even with 14 concurrent threads accessing PM, PATS is able to allow only up to 8% decrease in the DRAM-throughput of the front-end applications (e.g., Memcached), gaining 1.5x PM-throughput speedup over the default configuration.
15:48 CET 18.3.3 UNIFYING TEMPORAL AND SPATIAL LOCALITY FOR CACHE MANAGEMENT INSIDE SSDS
Speaker:
Jianwei Liao, Southwest University of China, CN
Authors:
Zhibing Sha1, Zhigang Cai1, Dong Yin2, Jianwei Liao1 and Francois Trahay3
1Southwest University of China, CN; 2Huaihua University, CN; 3Telecom Sudparis, FR
Abstract
To ensure better I/O performance of solid-state drivers (SSDs), a dynamic random access memory (DRAM) is commonly equipped as a cache to absorb overwrites or writes and then avoid flushing them onto underlying SSD cells. This paper focuses on the management of the small amount cache inside SSDs. First, we propose to unify both factors of temporal and spatial locality using the visibility graph technique when running user applications, for directing cache management. Next, we propose to support batch adjustment of adjacent or nearby (hot) cached data pages by referring to the connection situations in the visibility graph of all cached pages. At last, we propose to evict the buffered data pages in batches, to maximize the internal flushing parallelism of SSD devices, without worsening I/O congestion. The trace-driven simulation experiments show that our proposal can yield improvements on cache hits by more than 2.0%, and the overall I/O latency by 19.3% on average, in contrast to conventional cache schemes inside SSDs.
15:52 CET 18.3.4 DWR: DIFFERENTIAL WEARING FOR READ PERFORMANCE OPTIMIZATION ON HIGH-DENSITY NAND FLASH MEMORY
Speaker:
Liang Shi, School of Computer Science and Technology, East China Normal University, CN
Authors:
Yunpeng Song1, Qiao Li2, Yina Lv1, Changlong Li1 and Liang Shi1
1School of Computer Science and Technology, East China Normal University, CN; 2City University of Hong Kong, HK
Abstract
With the cost reduction and density optimization, the read performance and lifetime of high-density NAND flash memory have been significantly degraded during the last decade. Previous works proposed to optimize lifetime with wear leveling and optimize read performance with reliability improvement. However, with wearing, the reliability and read performance will be degraded along with the life of the device. To solve this problem, a differential wearing scheme (DWR) is proposed to optimize the read performance. The basic idea of DWR is to partition the flash memory into two areas and wear them at different speeds. For the area with low wearing speed, read operations are scheduled for read performance optimization. For the area with high wearing speed, write operations are scheduled but designed to avoid generating bad blocks early. Through careful design and real workloads evaluation on 3D TLC NAND flash, DWR achieves encouraging read performance optimization with negligible impacts to the lifetime.
15:56 CET 18.3.5 GATLB: A GRANULARITY-AWARE TLB TO SUPPORT MULTI-GRANULARITY PAGES IN HYBRID MEMORY SYSTEM
Speaker:
Yujie Xie, Chongqing University, CN
Authors:
Yujuan Tan1, Yujie Xie1, Zhulin Ma1, Zhichao Yan2, Zhichao Zhang1, Duo Liu1 and Xianzhang Chen1
1Chongqing University, CN; 2HewlettPackard Enterprise, US
Abstract
The parallel hybrid memory system that combines Non-volatile Memory (NVM) and DRAM can effectively expand the memory capacity. But it puts lots of pressure on TLB due to a limited TLB capacity. The superpage technology that manages pages with a large granularity (e.g., 2MB) is usually used to improve the TLB performance. However, its coarse-grained granularity conflicts with the fine-grained page migration in the hybrid memory system, resulting in serious invalid migration and page fragmentation problems. To solve these problems, we propose to maintain the coexistence of multi-granularity pages, and design a smart TLB called GATLB to support multi-granularity page management, coalesce consecutive pages and adapt to various changes in page size. Compared with the existing TLB technologies, GATLB can not only perceive page granularity to effectively expand the TLB coverage and reduce miss rate, but also provide faster address translation with a much lower overhead. Our experimental evaluations show that GATLB can expand the TLB coverage by 7.09x, reduce the TLB miss rate by 91.1%, and shorten the address translation cycle by 49.41%.
16:00 CET 18.3.6 Q&A SESSION
Authors:
Joseph Friedman1 and Chengmo Yang2
1University of Texas at Dallas, US; 2University of Delaware, US
Abstract
Questions and answers with the authors

18.4 Energy Efficient Platforms: from Autonomous Vehicles to Intermittent Computing

Date: Tuesday, 22 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Domenico Balsamo, Newcastle University, GB

Session co-chair:
Bart Vermeulen, NXP Semiconductors, NL

This session focuses on energy efficient platforms. In particular, the session will present four papers. The first paper will present an efficient accelerator to enable real-time probabilistic 3D mapping at edge that can be used for autonomous machine. Continuing on autonomous vehicles, The second paper will present an FPGA solution for efficient real-time localization. Moving to more tiny devices the next two papers focuses on energy harvesting IoT devices that operate intermittently without batteries, the first will present a deep learning approach while the second will close the session presenting a FPGA-based emulation for non-volatile digital logic for intermittent computing.

Time Label Presentation Title
Authors
15:40 CET 18.4.1 OMU: A PROBABILISTIC 3D OCCUPANCY MAPPING ACCELERATOR FOR REAL-TIME OCTOMAP AT THE EDGE
Speaker:
Tianyu Jia, Peking University, CN
Authors:
Tianyu Jia1, En-Yu Yang2, Yu-Shun Hsiao2, Jonathan Cruz2, David Brooks2, Gu-Yeon Wei2 and Vijay Janapa Reddi2
1Peking University, CN; 2Harvard University, US
Abstract
Autonomous machines (e.g., vehicles, mobile robots, drones) require sophisticated 3D mapping to perceive the dynamic environment. However, maintaining a real-time 3D map is expensive both in terms of compute and memory requirements, especially for resource-constrained edge machines. Probabilistic OctoMap is a reliable and memory-efficient 3D dense map model to represent the full environment, with dynamic voxel node pruning and expansion capacity. This paper presents the first efficient accelerator solution, i.e. OMU, to enable real-time probabilistic 3D mapping at the edge. To improve the performance, the input map voxels are updated via parallel PE units for data parallelism. Within each PE, the voxels are stored using a specially developed data structure in parallel memory banks. In addition, a pruning address manager is designed within each PE unit to reuse the pruned memory addresses. The proposed 3D mapping accelerator is implemented and evaluated using a commercial 12 nm technology. Compared to the ARM Cortex-A57 CPU in the Nvidia Jetson TX2 platform, the proposed accelerator achieves up to 62× performance and 708× energy efficiency improvement. Furthermore, the accelerator provides 63 FPS throughput, more than 2× higher than a real-time requirement, enabling real-time perception for 3D mapping.
15:44 CET 18.4.2 AN FPGA OVERLAY FOR EFFICIENT REAL-TIME LOCALIZATION IN 1/10TH SCALE AUTONOMOUS VEHICLES
Speaker:
Paolo Burgio, University of Modena and Reggio Emilia, IT
Authors:
Andrea Bernardi1, Gianluca Brilli2, Alessandro Capotondi3, Andrea Marongiu4 and Paolo Burgio1
1University of Modena and Reggio Emilia, IT; 2Unimore, IT; 3Universita' di Modena e Reggio Emilia, IT; 4Università di Modena e Reggio Emilia, IT
Abstract
Heterogeneous systems-on-chip (HeSoC) based on reconfigurable accelerators, such as Field-Programmable Gate Arrays (FPGA), represent an appealing option to deliver the performance/Watt required by the advanced perception and localization tasks employed in the design of Autonomous Vehicles. Different from software-programmed GPUs, FPGA development involves significant hardware design effort, which in the context of HeSoCs is further complicated by the system-level integration of HW and SW blocks. High-Level Synthesis is increasingly being adopted to ease hardware IP design, allowing engineers to quickly prototype their solutions. However, automated tools still lack the required maturity to efficiently build the complex hardware/ software interaction between the host CPU and the FPGA accelerator(s). In this paper we present a fully integrated system design where a particle filter for LiDAR-based localization is efficiently deployed as FPGA logic, while the rest of the compute pipeline executes on programmable cores. This design constitutes the heart of a fully-functional 1/10th-scale racing autonomous car. In our design, accelerated IPs are controlled locally to the FPGA via a proxy core. Communication between the two and with the host CPU happens via shared memory banks also implemented as FPGA IPs. This allows for a scalable and easy-to-deploy solution both from the hardware and software viewpoint, while providing better performance and energy efficiency compared to state-ofthe- art solutions.
15:48 CET 18.4.3 ENABLING FAST DEEP LEARNING ON TINY ENERGY-HARVESTING IOT DEVICES
Speaker:
Sahidul Islam, University of Texas at San Antonio, US
Authors:
Sahidul Islam1, Jieren Deng2, Shanglin Zhou2, Chen Pan3, Caiwen Ding2 and Mimi Xie1
1University of Texas at San Antonio, US; 2University of Connecticut, US; 3Texas A&M University-Corpus Christi, US
Abstract
Energy harvesting (EH) IoT devices that operate intermittently without batteries, coupled with advances in deep neural networks (DNNs), have opened up new opportunities for enabling sustainable smart applications. Nevertheless, implementing those computation and memory-intensive intelligent algorithms on EH devices is extremely difficult due to the challenges of limited resources and intermittent power supply that causes frequent failures. To address those challenges, this paper proposes a methodology that enables fast deep learning with low-energy accelerators for tiny energy harvesting devices. We first propose RAD, a resource-aware structured DNN training framework, which employs block circulant matrix and structured pruning to achieve high compression for leveraging the advantage of various vector operation accelerators. A DNN implementation method, ACE, is then proposed that employs low-energy accelerators to profit maximum performance with small energy consumption. Finally, we further design FLEX, the system support for intermittent computation in energy harvesting situations. Experimental results from three different DNN models demonstrate that RAD, ACE, and FLEX can enable fast and correct inference on energy harvesting devices with up to 4.26X runtime reduction, up to 7.7X energy reduction with higher accuracy over the state-of-the-art.
15:52 CET 18.4.4 EMULATION OF NON-VOLATILE DIGITAL LOGIC FOR BATTERYLESS INTERMITTENT COMPUTING
Speaker:
Simone Ruffini, University of Trento, IT
Authors:
Simone Ruffini, Kasim Sinan Yildirim and Davide Brunelli, University of Trento, IT
Abstract
Recent engineering efforts gave rise to the emergence of devices that operate only by harvesting power from ambient energy sources, such as radiofrequency and solar energy. Due to the sporadic ambient energy sources, frequent power failures are inevitable for these devices that rely only on energy harvesting. These devices lose the values maintained in volatile hardware state elements upon a power failure. This situation leads to intermittent execution, which prevents the forward progress of computing operations. To countermeasure power failures, these devices require non-volatile memory elements, e.g., FRAM, to store the computational state. However, hardware designers can only represent volatile state elements using FPGAs in the market and current hardware description languages. As of now, there is no existing solution to fast-prototype non-volatile digital logic. This paper enables FPGA-based emulation of any custom non-volatile digital logic for intermittent computing. Therefore, our proposal can be a standard part of the current FPGA libraries provided by the vendors to design and validate future non-volatile logic designs targeting intermittent computing.
15:56 CET 18.4.5 Q&A SESSION
Authors:
Domenico Balsamo1 and Bart Vermeulen2
1Newcastle University, GB; 2NXP Semiconductors, NL
Abstract
Questions and answers with the authors

18.5 Circuit Optimization and Analysis: No Time to Lose

Date: Tuesday, 22 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Eleonora Testa, Synopsys Inc., CH

Session co-chair:
Ibrahim Elfadel, Khalifa University, AE

This session presents papers that focus on timing optimization in both logic and physical synthesis. Also, it presents a new approach to flip-chip routing. The first paper proposes an algorithm that can fix minimum implant area violations in a timing-aware fashion without displacing cells, fixing the violations only by applying cell swapping. The second paper proposes a momentum-based timing-driven global placement algorithm, and the third paper describes an efficient way to parallelize dynamic timing computation on the critical path. The last paper shows a substrate routing algorithm using a novel ring routing model that handles symmetry and shielding constraints.

Time Label Presentation Title
Authors
15:40 CET 18.5.1 A SYSTEMATIC REMOVAL OF MINIMUM IMPLANT AREA VIOLATIONS UNDER TIMING CONSTRAINT
Speaker:
Eunsol Jeong, Seoul National University, KR
Authors:
Eunsol Jeong, Heechun Park and Taewhan Kim, Seoul National University, KR
Abstract
Fixing minimum implant area (MIA) violations in post-route layout is an essential and inevitable task for the high-performance designs employing multiple threshold voltages. Unlike the conventional approaches, which have tried to locally move cells or reassign Vt (threshold voltage) of some cells in a way to resolve the MIA violations with little or no consideration of timing constraint, our proposed approach fully and systematically controls the timing budget during the removal of MIA violations. Precisely, our solution consists of three sequential steps: (1) performing critical path aware cell selection for Vt reassignment to fix the intra-row MIA violations while considering timing constraint and minimal power increments; (2) performing a theoretically optimal Vt reassignment to fix the inter-row MIA violations while satisfying both of the intra-row MIA and timing constraints; (3) refining Vt reassignment to further reduce the power consumption while meeting intra- and inter-row MIA constraints as well as timing constraint. Experiments through benchmark circuits show that our proposed approach is able to completely resolve MIA violations while ensuring no timing violation and achieving much less power increments over that by the conventional approaches.
15:44 CET 18.5.2 DREAMPLACE 4.0: TIMING-DRIVEN GLOBAL PLACEMENT WITH MOMENTUM-BASED NET WEIGHTING
Speaker:
Peiyu Liao, The Chinese University of Hong Kong, HK
Authors:
Peiyu Liao1, Siting Liu1, Zhitang Chen2, Wenlong Lv3, Yibo Lin4 and Bei Yu1
1The Chinese University of Hong Kong, HK; 2Huawei Noah's Ark Lab, HK; 3Huawei Noah's Ark Lab, CN; 4Peking University, CN
Abstract
Timing optimization is critical to integrated circuit (IC) design closure. Existing global placement algorithms mostly focus on wirelength optimization without considering timing. In this paper, we propose a timing-driven global placement algorithm leveraging a momentum-based net weighting strategy. Besides, we improve the preconditioner to incorporate our net weighting scheme. Experimental results on ICCAD 2015 contest benchmarks demonstrate that our algorithm can significantly improve total negative slack (TNS) and meanwhile be beneficial to worse negative slack (WNS).
15:48 CET 18.5.3 EVENTTIMER: FAST AND ACCURATE EVENT-BASED DYNAMIC TIMING ANALYSIS
Speaker:
Zuodong Zhang, Institution of Microelectronics, Peking University, CN
Authors:
Zuodong Zhang, Zizheng Guo, Yibo Lin, Runsheng Wang and Ru Huang, Peking University, CN
Abstract
As the transistor shrinks to nanoscale, the overhead of ensuring circuit functionality becomes extremely large due to the increasing timing variations. Thus, better-than-worst-case design (BTWC) has attracted more and more attention. Many of these techniques utilize dynamic timing slack (DTS) and activity information for design optimization and runtime tuning. Existing DTS computation methods are essentially a modification to the worst-case delay information, which cannot guarantee exact DTS and activity simulation, causing performance degradation in timing optimization. Therefore, in this paper, we propose EventTimer, a dynamic timing analysis engine based on event propagation to accurately compute DTS and activity information. We evaluate its accuracy and efficiency on different benchmark circuits. The experimental results show that EventTimer can achieve exact DTS computation with high efficiency. And it also proves that EventTimer has good scalability with the circuit scale and the number of CPU threads, which make it possible to be used in the application-level analysis.
15:52 CET 18.5.4 PRACTICAL SUBSTRATE DESIGN CONSIDERING SYMMETRICAL AND SHIELDING ROUTES
Speaker:
Hung-Ming Chen, National Yang Ming Chiao Tung University, TW
Authors:
Hao-Yu Chi1, Yi-Hung Chen2, Hung-Ming Chen1, Chien-Nan Liu3, Yun-Chih Kuo2, Ya-Hsin Chang2 and Kuan-Hsien Ho2
1National Yang Ming Chiao Tung University, TW; 2Mediatek Inc. Taiwan, TW; 3National Yang Ming Chiao Tung University, TZ
Abstract
In modern package design, the flip-chip package has become mainstream because of the benefit of its high I/O pins. However, the package design is still done manually in the industry. The lack of automation tools makes the package design cycle longer due to complex routing constraints, and the frequent modification requests. In this work, we propose yet another routing framework for substrate routing. Compared with previous works, our routing algorithm generates a feasible routing solution in a few seconds for industrial design and considers important symmetry and shielding constraints that have not been handled before. Benefiting from the efficiency of our routing algorithm, the designer can get the result immediately and accommodate some modifications to reduce the cost. The experimental result shows that the routing result generated from our router is in good quality, very close to the manual design.
15:56 CET 18.5.5 Q&A SESSION
Authors:
Eleonora Testa1 and Ibrahim (Abe) Elfadel2
1Synopsys Inc., CH; 2Khalifa University, AE
Abstract
Questions and answers with the authors

18.6 Multi-Partner Projects – Session 2

Date: Tuesday, 22 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Ernesto Sanchez, Politecnico di Torino, IT

Session co-chair:
Maksim Jenihhin, Tallinn UT, EE

The session is dedicated to multi-partner innovative and high-tech research projects addressing the DATE 2022 topics. The types of collaboration covered are projects funded by EU schemes (H2020, ESA, EIC, MSCA, COST, etc.), nationally- and regionally-funded projects, collaborative research projects funded by industry. Depending on the stage of the project, the papers present the novelty of the project concepts, relevance the technical objectives to the DATE community, technical highlights of the project results and insights on the lessons learnt in the project or open bits until the end of the project. In particular, this session focuses on projects tackling the challenges of artificial intelligence and deep learning, integration of hardware and software layers, and also presents a cross-sectoral collaboration for a graduate school project.

Time Label Presentation Title
Authors
15:40 CET 18.6.1 NEUROTEC I: NEURO-INSPIRED ARTIFICIAL INTELLIGENCE TECHNOLOGIES FOR THE ELECTRONICS OF THE FUTURE
Speaker:
Christopher Bengel, Institute of Materials in Electrical Engineering II, RWTH Aachen University, DE
Authors:
Melvin Galicia1, Stephan Menzel2, Farhad Merchant1, Maximilian Müller3, Hsin-Yu Chen2, Qing-Tai Zhao4, Felix Cüppers2, Abdur R. Jalil4, Qi Shu5, Peter Schüffelgen4, Gregor Mussler4, Carsten Funck6, Christian Lanius7, Stefan Wiefels2, Moritz von Witzleben6, Christopher Bengel6, Nils Kopperberg6, Tobias Ziegler6, Rana Ahmad2, Alexander Krüger2, Leticia Pöhls7, Regina Dittmann2, Susanne Hoffmann-Eifert2, Vikas Rana2, Detlev Grützmacher4, Matthias Wuttig3, Dirk Wouters6, Andrei Vescan8, Tobias Gemmeke7, Joachim Knoch9, Max Lemme10, Rainer Leupers1 and Rainer Waser6
1Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, DE; 2Peter-Grünberg Institut-7, Forschungszentrum Jülich GmbH, DE; 3Institute of Physics, Physics of Novel Materials, RWTH Aachen University,, DE; 4Peter-Grünberg Institut-9, Forschungszentrum Jülich GmbH, DE; 5Peter-Grünberg Institut-10, Forschungszentrum Jülich GmbH, DE; 6Institute of Materials in Electrical Engineering II, RWTH Aachen University, DE; 7Institute of Integrated Digital Systems and Circuit Design, RWTH Aachen University, DE; 8Compound Semiconductor Technology, RWTH Aachen University, DE; 9Institute of Semiconductor Electronics, RWTH Aachen University, DE; 10Chair of Electronic Devices, RWTH Aachen University, DE
Abstract
The field of neuromorphic computing is approaching an era of rapid adoption driven by the urgent need of a substitute for the von Neumann computing architecture. NEUROTEC I: "Neuro-inspired Artificial Intelligence Technologies for the Electronics of the Future" project is an initiative sponsored by the German Federal Ministry of Education and Research (BMBF for its initials in German), that aims to effectively advance the foundations for the utilization and exploitation of neuromorphic computing. NEUROTEC I stands at its successful "final stage" driven by the collaboration from more than 8 institutes from the Jülich Research Center and the RWTH Aachen University, as well as collaboration from several high-tech industry partners. The NEUROTEC I project considers the field interplay among materials, circuits, design and simulation tools. This paper provides an overview of the project’s overall structure and discusses the scientific achievements of its individual activities.
15:44 CET 18.6.2 VEDLIOT: VERY EFFICIENT DEEP LEARNING IN IOT
Speaker:
Jens Hagemeyer, Bielefeld University, DE
Authors:
Martin Kaiser1, Rene Griessl1, Nils Kucza1, Carola Haumann1, Lennart Tigges1, Kevin Mika1, Jens Hagemeyer1, Florian Porrmann1, Ulrich Rückert1, Micha vor dem Berge2, Stefan Krupop3, Mario Porrmann4, Marco Tassemeier4, Pedro Trancoso5, Fareed Qararyah5, Stavroula Zouzoula5, Antonio Casimiro6, Alysson Bessani6, José Cecilio6, Stefan Andersson7, Oliver Brunnegard7, Olof Eriksson7, Roland Weiss8, Franz Meierhöfer8, Hans Salomonsson9, Elaheh Malekzadeh9, Daniel Ödman9, Anum Khurshid10, Pascal Felber11, Marcelo Pasin11, Valerio Schiavoni11, James Menetrey11, Karol Gugula12, Piotr Zierhoffer12, Eric Knauss13 and Hans-Martin Heyn13
1Bielefeld University, DE; 2Christmann Informationstechnik, DE; 3christmann informationstechnik, DE; 4Osnabrück University, DE; 5Chalmers University of Technology, SE; 6University of Lisbon, PT; 7VEONEER Inc., SE; 8Siemens AG, DE; 9EMBEDL AB, SE; 10Research Institutes of Sweden AB (RISE), SE; 11University of Neuchatel, CH; 12Antmicro, PL; 13Göteborg University, SE
Abstract
The VEDLIoT project targets the development of energy-efficient Deep Learning for distributed AIoT applications. A holistic approach is used to optimize algorithms while also dealing with safety and security challenges. The approach is based on a modular and scalable cognitive IoT hardware platform. Using modular microserver technology enables the user to configure the hardware to satisfy a wide range of applications. VEDLIoT offers a complete design flow for Next-Generation IoT devices required for collaboratively solving complex Deep Learning applications across distributed systems. The methods are tested on various use-cases ranging from Smart Home to Automotive and Industrial IoT appliances. VEDLIoT is an H2020 EU project which started in November 2020. It is currently in an intermediate stage with the first results available.
15:48 CET 18.6.3 INTELLIGENT METHODS FOR TEST AND RELIABILITY
Speaker:
Hussam Amrouch, University of Stuttgart, DE
Authors:
Hussam Amrouch1, Jens Anders1, Steffen Becker1, Maik Betka1, Gerd Bleher2, Peter Domanski1, Nourhan Elhamawy1, Thomas Ertl1, Athanasios Gatzastras1, Paul R. Genssler1, Sebastian Hasler1, Martin Heinrich2, Andre van Hoorn1, Hanieh Jafarzadeh1, Ingmar Kallfass1, Florian Klemme1, Steffen Koch1, Ralf Küsters1, Andrés Lalama1, Raphael Latty2, Yiwen Liao1, Natalia Lylina1, Zahra Paria Najafi-Haghi1, Dirk Pflüger1, Ilia Polian1, Jochen Rivoir2, Matthias Sauer2, Denis Schwachhofer1, Steffen Templin2, Christian Volmer2, Stefan Wagner1, Daniel Weiskopf1, Hans-Joachim Wunderlich1, Bin Yang1 and Martin Zimmermann2
1University of Stuttgart, DE; 2Advantest Corporation, DE
Abstract
Test methods that can keep up with the ongoing increase in complexity of semiconductor products and their underlying technologies are an essential prerequisite for maintaining quality and safety of our daily lives and for continued success of our economies and societies. There is a huge potential how test methods can benefit from recent breakthroughs in domains such as artificial intelligence, data analytics, virtual/augmented reality, and security. The Graduate School on “Intelligent Methods for Semiconductor Test and Reliability” (GS-IMTR) at the University of Stuttgart is a large-scale, radically interdisciplinary effort to address the scientific-technological challenges in this domain. It is funded by Advantest, one of the world leaders in automatic test equipment. In this paper, we describe the overall philosophy of the Graduate School and the specific scientific questions targeted by its ten projects.
15:52 CET 18.6.4 EVOLVE: TOWARDS CONVERGING BIG-DATA, HIGH-PERFORMANCE AND CLOUD-COMPUTING WORLDS
Speaker:
Achilleas Tzenetopoulos, National TU Athens, GR
Authors:
Achilleas Tzenetopoulos1, Dimosthenis Masouros1, Konstantina Koliogeorgi1, Sotirios Xydis2, Dimitrios Soudris1, Antony Chazapis3, Christos Kozanitis3, Angelos Bilas4, Christian Pinto5, Huy-Nam Nguyen6, Stelios Louloudakis7, Georgios Gardikis8, George Vamvakas8, Michelle Aubrun9, Christy Symeonidou10, Vassilis Spitadakis10, Konstantinos Xylogiannopoulos11, Bernhard Peischl11, Tahir Kalayci12, Alexander Stocker12 and Jean-Thomas Acquaviva13
1National TU Athens, GR; 2Harokopio University of Athens, GR; 3Institute of Computer Science, FORTH, GR; 4FORTH and University of Crete, GR; 5IBM Research, IE; 6Atos/BULL, FR; 7Sunlight.io, GR; 8Space Hellas S.A., GR; 9Thales Alenia Space, FR; 10Neurocom, LU; 11AVL List GmbH, AT; 12Virtual Vehicle Research GmbH, AT; 13DataDirect Networks, FR
Abstract
EVOLVE is a pan-European Innovation Action that aims to fully-integrate High-Performance-Computing (HPC) hardware with state-of-the-art software technologies under a unique testbed, that enables the convergence of HPC, Cloud, and Big-Data worlds and increases our ability to extract value from massive and demanding datasets. EVOLVE's advanced compute platform combines HPC-enabled capabilities, with transparent deployment in high abstraction level, and a versatile Big-Data processing stack for end-to-end workflows. Hence, domain experts have the potential to improve substantially the efficiency of existing services or introduce new models in the respective domains, e.g., automotive services, bus transportation, maritime surveillance, and others. In this paper, we describe EVOLVE's testbed and evaluate the performance of the integrated pilots from different domains.
15:56 CET 18.6.5 SDK4ED: ONE-CLICK PLATFORM FOR ENERGY-AWARE, MAINTAINABLE AND DEPENDABLE APPLICATIONS
Speaker:
Charalampos Marantos, National TU Athens, GR
Authors:
Charalampos Marantos1, Miltiadis Siavvas2, Dimitrios Tsoukalas2, Christos Lamprakos3, Lazaros Papadopoulos1, Paweł Boryszko4, Katarzyna Filus4, Joanna Domańska4, Apostolos Ampatzoglou5, Alexander Chatzigeorgiou6, Erol Gelenbe4, Dionysios Kehagias2 and Dimitrios Soudris1
1National TU Athens, GR; 2Centre for Research and Technology Hellas, Thessaloniki, GR; 3School of ECE, National TU Athens, GR; 4Institute of Theoretical & Applied Computer Science, IITIS-PAN, Gliwice, PL; 5University of Macedonia, GR; 6Department of Applied Informatics, University of Macedonia, GR
Abstract
Developing modern secure and low-energy applications in a short time imposes new challenges and creates the need of designing new software tools to assist developers in all phases of application development. The design of such tools cannot be considered a trivial task, as they should be able to provide optimization of multiple quality requirements. In this paper, we introduce the SDK4ED platform, which incorporates advanced methods and tools for measuring and optimizing maintainability, dependability and energy. The presented solution offers a complete tool-flow for providing indicators and optimization methods with emphasis on embedded software. Effective forecasting models and decision-making solutions are also implemented to improve the quality of the software, respecting the constraints imposed on maintenance standards, energy consumption limits and security vulnerabilities. The use of the SDK4ED platform is demonstrated in a healthcare embedded application.
16:00 CET 18.6.6 Q&A SESSION
Authors:
Ernesto Sanchez1 and Maksim Jenihhin2
1Politecnico di Torino, IT; 2Tallinn University of Technology, EE
Abstract
Questions and answers with the authors

19.1 Hardware security primitives and attacks

Date: Tuesday, 22 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Johanna Sepulveda, Airbus Defense and Space, DE

Session co-chair:
Jorge Guajardo, Bosch, US

The first three papers in this session discuss hardware security attacks. The first paper is on a novel verification methodology to verify the DPA security of masked hardware circuits. The second paper discusses a mechanism to activate capacitive triggers for Hardware Trojans. The third paper presents an attack based on the voltage-drop effect in an SoC composed of an FPGA and a CPU. The last paper in the session is on physically unclonable functions. More precisely, the paper proposes three new evaluation methods aimed at higher-order alphabet PUFs.

Time Label Presentation Title
Authors
16:40 CET 19.1.1 (Best Paper Award Candidate)
ADD-BASED SPECTRAL ANALYSIS OF PROBING SECURITY
Speaker:
Maria Chiara Molteni, Universita' degli Studi di Milano, IT
Authors:
Maria Chiara Molteni1, Vittorio Zaccaria2 and Valentina Ciriani1
1Universita' degli Studi di Milano, IT; 2Politecnico di Milano, IT
Abstract
In this paper, we introduce a novel exact verification methodology for non-interference properties of cryptographic circuits. The methodology exploits the Algebraic Decision Diagram representation of the Walsh spectrum to overcome the potential slow down associated with its exact verification against noninterference constraints. Benchmarked against a standard set of use cases, the methodology speeds-up 1.88x the median verification time over the existing state-of-the art tools for exact verification.
16:44 CET 19.1.2 GUARANTEED ACTIVATION OF CAPACITIVE TROJAN TRIGGERS DURING POST PRODUCTION TEST VIA SUPPLY PULSING
Speaker:
Sule Ozev, ASU, US
Authors:
Bora Bilgic and Sule Ozev, ASU, US
Abstract
Involvement of many parties in the production of ICs makes the process more vulnerable to tampering. Consequently, IC security has become an important challenge to tackle. One of the threat models in hardware security domain is the insertion of unwanted and malicious hardware components, known as Hardware Trojans. A malicious attacker can insert a small modification into the functional circuit that can cause havoc in the field. To make the Trojan circuit stealth, typically trigger circuits are used, which not only hide the Trojan activity during post-production testing, but also randomize activation conditions, thereby making it very difficult to diagnose even after failures. Trigger mechanisms for Trojans typically delay and randomize the outcome based on a subset of internal digital signals. While there are many different ways of implementing the trigger mechanisms, charge based mechanisms have gained popularity due to their small size. In this paper, we propose a scheme to ensure that the trigger mechanisms are activated during production testing even if the conditions specified by the malicious attacker are not met. By disabling the mechanism by which the Trojan is stealth, any of the parametric techniques can be used to detect potential Trojans at production time. The proposed technique relies on supply pulsing, where we generate a potential differential between the non-active input and output of any digital gate regardless of signal pattern that the trigger mechanism is tied to. SPICE simulations show that our method works well even for the smallest Trojan trigger mechanisms.
16:48 CET 19.1.3 FPGA-TO-CPU UNDERVOLTING ATTACKS
Speaker:
Dina Mahmoud, EPFL, CH
Authors:
Dina Mahmoud1, Samah Hussein1, Vincent Lenders2 and Mirjana Stojilovic1
1EPFL, CH; 2Armasuisse, CH
Abstract
FPGAs are proving useful and attractive for many applications, thanks to their hardware reconfigurability, low power, and high-degree of parallelism. As a result, modern embedded systems are often based on systems-on-chip (SoCs), where CPUs and FPGAs share the same die. In this paper, we demonstrate the first undervolting attack in which the FPGA acts as an aggressor while the CPU, residing on the same SoC, is the victim. We show that an adversary can use the FPGA fabric to create a significant supply voltage drop which, in turn, faults the software computation performed by the CPU. Additionally, we show that an attacker can, with an even higher success rate, execute a denial-of-service attack, without any modification of the underlying hardware or the power distribution network. Our work exposes a new electrical-level attack surface, created by tight integration of CPUs and FPGAs in modern SoCs, and incites future research on countermeasures.
16:52 CET 19.1.4 BEWARE OF THE BIAS - STATISTICAL PERFORMANCE EVALUATION OF HIGHER-ORDER ALPHABET PUFS
Speaker:
Christoph Frisch, TU Munich, DE
Authors:
Christoph Frisch and Michael Pehl, TU Munich, DE
Abstract
Physical Unclonable Functions (PUFs) derive unpredictable and device-specific responses from uncontrollable manufacturing variations. While most of the PUFs provide only one response bit per PUF cell, deriving more bits such as a symbol from a higher-order alphabet would make PUF designs more efficient. This type of PUFs is thus suggested in some applications and subject to current research. However, only few methods are available to analyze the statistical performance of such higher-order alphabet PUFs. This work, therefore, introduces various novel schemes. Unlike previous works, the new approaches involve statistical hypothesis testing. This facilitates more refined and statistically significant statements about the PUF regarding bias effects. We utilize real-world PUF data to illustrate the capabilities of the tests. In comparison to state-of-the-art approaches, our methods indeed capture more aspects of bias. Overall, this work is a steps towards an improved quality control of higher-order alphabet PUFs.
16:56 CET 19.1.5 Q&A SESSION
Authors:
Johanna Sepúlveda1 and Jorge Guajardo2
1Airbus Defence and Space, DE; 2Bosch Research and Technology Center, Robert Bosch LLC, US
Abstract
Questions and answers with the authors

19.2 Hardware components and architectures for Machine Learning

Date: Tuesday, 22 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Charles Mackin, IBM, US

Session co-chair:
Mladen Berekovic, University of Lübeck, DE

This session is dedicated to new advances in hardware compoments and architectures for ML. The first paper focuses on improving energy consumption, latency, and application throughput of neuromorhpic implementations; the second one proposes a near hybrid memory accelerator integrated close to the DRAM to improve inference; the third one is about a novel tensor processor with superior PPA metrics compared to the state of the art; the fourh paper presents a new mixed-signal architecture for implementing Quantized Neural Networks (QNNs) using flash transistors. Two IP papers complete the session: the first one presents a hybrid RRAM-SRAM system for Deep Neural Networks, while the second one is the first work to deploy a large neural network on FPGA-based neuromorphic hardware.

Time Label Presentation Title
Authors
16:40 CET 19.2.1 DESIGN OF MANY-CORE BIG LITTLE μBRAINS FOR ENERGY-EFFICIENT EMBEDDED NEUROMORPHIC COMPUTING
Speaker:
Lakshmi Varshika Mirtinti, Drexel University, US
Authors:
M. Lakshmi Varshika1, Adarsha Balaji1, Federico Corradi2, Anup Das1, Jan Stuijt2 and Francky Catthoor3
1Drexel University, US; 2Imec, NL; 3Imec, BE
Abstract
As spiking-based deep learning inference applications are increasing in embedded systems, these systems tend to integrate neuromorphic accelerators such as μBrain to improve energy efficiency. We propose a μBrain-based scalable many-core neuromorphic hardware design to accelerate the computations of spiking deep convolutional neural networks (SDCNNs). To increase energy efficiency, cores are designed to be heterogeneous in terms of their neuron and synapse capacity (i.e., big vs. little cores), and they are interconnected using a parallel segmented bus interconnect, which leads to lower latency and energy compared to a traditional mesh-based Network-on-Chip (NoC). We propose a system software framework called SentryOS to map SDCNN inference applications to the proposed design. SentryOS consists of a compiler and a run-time manager. The compiler compiles an SDCNN application into sub-networks by exploiting the internal architecture of big and little μBrain cores. The run-time manager schedules these sub-networks onto cores and pipeline their execution to improve throughput. We evaluate the proposed big little many-core neuromorphic design and the system software framework with five commonly-used SDCNN inference applications and show that the proposed solution reduces energy (between 37% and 98%), reduces latency (between 9% and 25%), and increases application throughput (between 20% and 36%). We also show that SentryOS can be easily extended for other spiking neuromorphic accelerators such as Loihi and DYNAPs.
16:44 CET 19.2.2 HYDRA: A NEAR HYBRID MEMORY ACCELERATOR FOR CNN INFERENCE
Speaker:
Palash Das, Indian Institute of Technology, Guwahati, IN
Authors:
Palash Das1, Ajay Joshi2 and Hemangee Kapoor1
1Indian Institute of Technology, Guwahati, IN; 2Boston University, US
Abstract
Convolutional neural network (CNN) accelerators often suffer from limited off-chip memory bandwidth and on-chip capacity constraints. One solution to this problem is near-memory or in-memory processing. Non-volatile memory, such as phase-change memory (PCM), has emerged as a promising DRAM alternative. It is also used in combination with DRAM, forming a hybrid memory. Though near-memory processing (NMP) has been used to accelerate the CNN inference, the feasibility/efficacy of NMP remained unexplored for a hybrid main memory system. Additionally, PCMs are also known to have low write endurance, and therefore, the tremendous amount of writes generated by the accelerators can drastically hamper the longevity of the PCM memory. In this work, we propose Hydra, a near hybrid memory accelerator integrated close to the DRAM to execute inference. The PCM banks store the models that are only read by the memory controller during the inference. For entire forward propagation (inference), the intermediate writes from Hydra are entirely performed to the DRAM, eliminating PCM-writes to enhance PCM lifetime. Unlike the other in-DRAM processing-based works, Hydra does not eliminate any multiplication operations by using binary or ternary neural networks, making it more suitable for the requirement of high accuracy. We also exploit inter- and intra-chip (DRAM chip) parallelism to improve the system's performance. On average, Hydra achieves around 20x performance improvements over the in-DRAM processing-based state-of-the-art works while accelerating the CNN inference.
16:48 CET 19.2.3 TCX: A PROGRAMMABLE TENSOR PROCESSOR
Speaker:
Tailin Liang, University of Science and Technology Beijing, CN
Authors:
Tailin Liang1, Lei Wang1, Shaobo Shi2, John Glossner1 and Xiaotong Zhang1
1University of Science and Technology Beijing, CN; 2Hua Xia General Processor Technologies, CN
Abstract
Neural network processors and accelerators are domain-specific architectures deployed to solve the high computational requirements of deep learning algorithms. This paper proposes a new instruction set extension for tensor computing, TCX, with RISC-style instructions and variable length tensor extensions. It features a multi-dimensional register file, dimension registers, and fully generic tensor instructions. It can be seamlessly integrated into existing RISC ISAs and provides software compatibility for scalable hardware implementations. We present an implementation of the TCX tensor computing accelerator using an out-of-order microarchitecture implementation. The tensor accelerator is scalable in computation units from several hundred to tens of thousands. An optimized register renaming mechanism is described which allows for many physical tensor registers without requiring architectural support for large tensor register names. We describe new tensor load and store instructions that reduce bandwidth requirements based on tensor dimensions. Implementations may balance data bandwidth and computation utilization for different types of tensor computations such as element-wise, depth-wise, and matrix-multiplication. We characterize the computation precision of tensor operations to balance area, generality, and accuracy loss for several well-known neural networks. The TCX processor runs at 1 GHz and sustains 8.2 Tera operations per second using a 4096 multiplication-accumulation compute unit with up to 98.83\% MAC utilization. It consumes 12.8 square millimeters while dissipating 0.46 Watts per TOP in TSMC 28nm technology.
16:52 CET 19.2.4 A FLASH-BASED CURRENT-MODE IC TO REALIZE QUANTIZED NEURAL NETWORKS
Speaker:
Kyler Scott, Texas A&M University, US
Authors:
Kyler Scott1, Cheng-Yen Lee1, Sunil Khatri1 and Sarma Vrudhula2
1Texas A&M University, US; 2Arizona State University, US
Abstract
This paper presents a mixed-signal architecture for implementing Quantized Neural Networks (QNNs) using flash transistors to achieve extremely high throughput with extremely low power, energy and memory requirements. Its low resource consumption makes our design especially suited for use in edge devices. The network weights are stored in-memory using flash transistors, and nodes perform operations in the analog current domain. Our design can be programmed with any QNN whose hyperparameters (the number of layers, filters, or filter size, etc) do not exceed the maximum provisioned. Once the flash devices are programmed with a trained model and the IC is given an input, our architecture performs inference with zero access to off-chip memory. We demonstrate the robustness of our design under current-mode non-linearities arising from process and voltage variations. We test validation accuracy on the ImageNet dataset, and show that our IC suffers only 0.6% and 1.0% reduction in classification accuracy for Top-1 and Top-5 outputs, respectively. Our implementation results in a ~50x reduction in latency and energy when compared to a recently published mixed-signal ASIC implementation, with similar power characteristics. Our approach provides layer partitioning and node sharing possibilities, which allow us to trade off latency, power, and area amongst each other.
16:56 CET 19.2.5 Q&A SESSION
Authors:
Charles Mackin1 and Mladen Berekovic2
1IBM, US; 2University of Lübeck, DE
Abstract
Questions and answers with the authors

19.3 NoC optimization with emerging technologies

Date: Tuesday, 22 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Romain Lemaire, CEA, FR

Session co-chair:
Sebastien Le Beux, Concordia University, CA

Network-on-chip and more generally on-chip communication architectures have to be constantly improved to address new applications constraints but also to take advantage of innovation in technologies for system integration. This session proposes various approaches to illustrate these topics. First at design time a framework is proposed to estimate NoC performance using neural network. Then at execution time, two adaptive routing algorithms are detailed: one based on an optimized credit flow control between routers and the other targeting 2.5D topologies in the context of faulty links. Finally, in a prospective way, phase-change material is considered to build a whole optical NoC system. By optimizing former approaches and introducing emerging technologies, NoCs are definitely still on an innovative path.

Time Label Presentation Title
Authors
16:40 CET 19.3.1 NOCEPTION: A FAST PPA PREDICTION FRAMEWORK FOR NETWORK-ON-CHIPS USING GRAPH NEURAL NETWORK
Speaker:
Fuping Li, Institute of Computing Technology, CN
Authors:
Fuping Li, Ying Wang, Cheng Liu, Huawei Li and Xiaowei Li, Institute of Computing Technology, Chinese Academy of Sciences, CN
Abstract
Network-on-chips (NoCs) have been viewed as a promising alternative to traditional on-chip communication architecture for the increasing number of IPs in modern chips. To support the vast design space exploration of application-specific NoC characteristics with arbitrary topologies, in this paper, we propose a fast estimation framework to predict power, performance, and area (PPA) of NoCs based on graph neural networks (GNNs). We present a general way of modeling the application and the NoC with user-defined parameters as an attributed graph, which can be learned by the GNN model. Experimental results show that on the unseen realistic applications, the proposed method achieves the accuracy of 97.36% on power estimation, 97.83% on area estimation, and improves the accuracy of the network-level and system-level performance predictor over the topology-constrained baseline method by 6.52% and 4.73% respectively.
16:44 CET 19.3.2 (Best Paper Award Candidate)
AN EASY-TO-IMPLEMENT AND EFFICIENT FLOW CONTROL FOR DEADLOCK-FREE ADAPTIVE ROUTING
Speaker:
Yi Dai, National University of Defense Technology, CN
Authors:
Yi Dai, Kai Lu, Sheng Ma and Junsheng Chang, National University of Defense Technology, CN
Abstract
Deadlock-free adaptive routing is extensively adopted in interconnection networks to improve communication bandwidth and reduce latency. However, existing deadlock-free flow control schemes either underutilize memory resources due to inefficient buffer management for simple hardware implementations, or rely on complicated coordination and synchronization mechanisms with high hardware complexity. In this work, we solve the deadlock problem from a different perspective by considering the deadlock as a lack of credit. With minor modifications of the credit accumulation procedure, our proposed full-credit flow control (FFC) ensures atomic buffer usage only based on local credit status while making full use of the buffer space. FFC can be easily integrated in the industrial router to achieve deadlock freedom with less area and power consumption, but 112% higher throughput, compared to the critical bubble scheme (CBS). We further propose a credit reservation strategy to eliminate the escape virtual channel (VC) cost for fully adaptive routing implementation. The synthesizing results demonstrate that FFC along with credit reservation (FFC-CR) can effectively reduce the area by 29% and power consumption by 26% compared with CBS.
16:48 CET 19.3.3 DEFT: A DEADLOCK-FREE AND FAULT-TOLERANT ROUTING ALGORITHM FOR 2.5D CHIPLET NETWORKS
Speaker:
Ebadollah Taheri, Colorado State University, US
Authors:
Ebadollah Taheri, Sudeep Pasricha and Mahdi Nikdast, Colorado State University, US
Abstract
By interconnecting smaller chiplets through an interposer, 2.5D integration offers a cost-effective and high-yield solution to implement large-scale modular systems. Nevertheless, the underlying network is prone to deadlock, despite deadlock-free chiplets, and to different faults on the vertical links used for connecting the chiplets to the interposer. Unfortunately, existing fault-tolerant routing techniques proposed for 2D and 3D on-chip networks cannot be applied to chiplet networks. To address these problems, this paper presents the first deadlock-free and fault-tolerant routing algorithm, called DeFT, for 2.5D integrated chiplet systems. DeFT improves the redundancy in vertical-link selection to tolerate faults in vertical links while considering network congestion. Moreover, DeFT can tolerate different vertical-link-fault scenarios while accounting for vertical-link utilization. Compared to the state-of-the-art routing algorithms in 2.5D chiplet systems, our simulation results show that DeFT improves network reachability by up to 75% with a fault rate of up to 25% and reduces the network latency by up to 40% for multi-application execution scenarios with less than 2% area overhead.
16:52 CET 19.3.4 NON-VOLATILE PHASE CHANGE MATERIAL BASED NANOPHOTONIC INTERCONNECT
Speaker:
Parya Zolfaghari, Concordia University, CA
Authors:
Parya Zolfaghari1, Joel Ortiz2, Cedric Killian2 and Sébastien Le Beux3
1Concordia University, CA; 2University of Rennes 1, Inria, CNRS/IRISA Lannion, FR; 3Department of Electrical & Computer Engineering Concordia University, CA
Abstract
Integrated optics is a promising technology to take advantage of light propagation for high throughput chip-scale interconnects in many core architectures. A key challenge for the deployment of nanophotonic interconnects is their high static power, which is induced by signal losses and devices calibration. To tackle this challenge, we propose to use Phase Change Material (PCM) to configure optical paths between writers and readers. The non-volatility of PCM elements and the high contrast between crystalline and amorphous phase states allow to bypass unused readers, thus reducing losses and calibration requirements. We evaluate the efficiency of the proposed PCM-based interconnects using system level simulations carried out with SNIPER manycore simulator. For this purpose, we have modified the simulator to partition clusters according to executed applications. Simulation results show that bypassing readers using PCM leads up to 52% communication power saving.
16:56 CET 19.3.5 Q&A SESSION
Authors:
Romain Lemaire1 and Sébastien Le Beux2
1CEA-List, FR; 2Concordia University, CA
Abstract
Questions and answers with the authors

19.4 Emerging devices for new computing paradigms

Date: Tuesday, 22 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Georgiev Vihar, University of Glasgow, GB

Session co-chair:
Gabriele Boschetto, CNRS-LIRMM, FR

This session will cover new computing ideas and approaches. The fist paper of the session will describe the impact of reliability on the performance of photonic neural networks. Next paper will reveal how the parallel circuits approach can be applied in quantum computing domain. The session will also cover works on optical logical circuits for space and power reduction application, ternary processor and application of Ising model for solving traveling salesman problem.

Time Label Presentation Title
Authors
16:40 CET 19.4.1 A RELIABILITY CONCERN ON PHOTONIC NEURAL NETWORKS
Speaker:
Yinyi Liu, Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, HK
Authors:
Yinyi LIU1, Jiaxu Zhang1, Jun Feng1, Shixi Chen1 and Jiang Xu2
1Electronic and Computer Engineering Department, The Hong Kong University of Science and Technology, HK; 2Microelectronics Thrust, Electronic and Computer Engineering Department, AI Chip Center for Emerging Smart Systems, The Hong Kong University of Science and Technology, HK
Abstract
Emerging integrated photonic neural networks have experimentally proved to achieve an ultra-high speedup of deep neural network training and inference in the optical domain. However, photonic devices suffer from the inherent crosstalk noise and loss, inevitably leading to reliability concerns. This paper systematically analyzes the impacts of crosstalk and loss on photonic computing systems. We propose a crosstalk-aware model for reliability estimation and find out the worst-case bounds as we increase the footprints and scales of the photonic chips. Our evaluations show that -30dB crosstalk noise can cause maximal photonic chip integration to a sharp drop by 109x. To facilitate very-large-scale photonic integration for future computing, we further propose multiple heterogeneous bijou photonic-cores to address the crosstalk-aware reliability concern.
16:44 CET 19.4.2 HOW PARALLEL CIRCUIT EXECUTION CAN BE USEFUL FOR NISQ COMPUTING?
Speaker:
Siyuan Niu, LIRMM, University of Montpellier, FR
Authors:
Siyuan Niu1 and Aida Todri-Sanial2
1LIRMM, University of Montpellier, FR; 2LIRMM, University of Montpellier, CNRS, FR
Abstract
Quantum computing is performed on Noisy Intermediate-Scale Quantum (NISQ) hardware in the short term. Only small circuits can be executed reliably on a quantum machine due to the unavoidable noisy quantum operations on NISQ devices, leading to the under-utilization of hardware resources. With the growing demand to access quantum hardware, how to utilize it more efficiently while maintaining output fidelity is becoming a timely issue. A parallel circuit execution technique has been proposed to address this problem by executing multiple programs on hardware simultaneously. It can improve the hardware throughput and reduce the overall runtime. However, accumulative noises such as crosstalk can decrease the output fidelity in parallel workload execution. In this paper, we first give an in-depth overview of state-of-the-art parallel circuit execution methods. Second, we propose a Quantum Crosstalk-aware Parallel workload execution method (QuCP) without the overhead of crosstalk characterization. Third, we investigate the trade-off between hardware throughput and fidelity loss to explore the hardware limitation with parallel circuit execution. Finally, we apply parallel circuit execution to VQE and zero-noise extrapolation error mitigation method to showcase its various applications on advancing NISQ computing.
16:48 CET 19.4.3 SPACE AND POWER REDUCTION IN BDD-BASED OPTICAL LOGIC CIRCUITS EXPLOITING DUAL PORTS
Speaker:
Ryosuke Matsuo, Kyoto University, JP
Authors:
Ryosuke Matsuo and Shin-ichi Minato, Kyoto University, JP
Abstract
Optical logic circuits based on integrated nanophotonics have attracted significant interest due to their ultra-high-speed operation. A synthesis method based on the Binary Decision Diagram (BDD) has been studied, as BDD-based optical logic circuits can take advantage of the speed of light. However, a fundamental disadvantage of BDD-based optical logic circuits is a large number of splitters, which results in large power consumption. In BDD-based circuits a dual port of each logic gate is not used. We propose a method for eliminating a splitter exploiting this dual port. We define a BDD node corresponding to a dual port as a dual port node (DP node) and call the proposed method DP node sharing. We demonstrated that DP node sharing significantly reduces the power consumption and to a lesser extent circuit size without increasing delay. We conducted an experiment involving 10-input logic functions obtained by applying an LUT technology mapper to an ISCSA'85 C7552 benchmark circuit to evaluate our dual node sharing. The experimental results demonstrated that DP node sharing reduces the power consumption by two orders of magnitude of circuit that consume a large amount of power.
16:52 CET 19.4.4 DESIGN AND EVALUATION FRAMEWORKS FOR ADVANCED RISC-BASED TERNARY PROCESSOR
Speaker:
Dongyun Kam, Pohang University of Science and Technology, KR
Authors:
Dongyun Kam, Jung Gyu Min, Jongho Yoon, Sunmean Kim, Seokhyeong Kang and Youngjoo Lee, Pohang University of Science and Technology, KR
Abstract
In this paper, we introduce the design and verification frameworks for developing a fully-functional emerging ternary processor. Based on the existing compiling environments for binary processors, for the given ternary instructions, the software-level framework provides an efficient way to convert the given programs to the ternary assembly codes. We also present a hardware-level framework to rapidly evaluate the performance of a ternary processor implemented in arbitrary design technology. As a case study, the fully-functional 9-trit advanced RISC-based ternary (ART-9) core is newly developed by using the proposed frameworks. Utilizing 24 custom ternary instructions, the 5-stage ART-9 prototype architecture is successfully verified by a number of test programs including dhrystone benchmark in a ternary domain, achieving the processing efficiency of 57.8 DMIPS/W and 3.06 x 10^6 DMIPS/W in the FPGA-level ternary-logic emulations and the emerging CNTFET ternary gates, respectively.
16:56 CET 19.4.5 Q&A SESSION
Authors:
Vihar Georgiev1 and Gabriele Boschetto2
1University of Glasgow, GB; 2CNRS-LIRMM, FR
Abstract
Questions and answers with the authors

19.5 Dealing with Correct Design and Robustness analysis for Complex Systems, MPSoCs and Circuits

Date: Tuesday, 22 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Chung-Wei Lin, National Taiwan University, TW

Session co-chair:
Dionisios N. Pnevmatikatos, NTUA, GR

This session contains two parts; the first part is dedicated to complex systems. The presence of many different subsystems is an important concern in today's complex systems. The first paper addresses the robustness of deep neural networks. The second paper faces the problem of planning communications in wireless networks with energy harvesting.  The second part presents two industrial briefs on different levels of chip design. One is technology-oriented and focuses on transistor design, and the other deals with statistical analysis for robustness of MPSoCs. Both briefs provide detailed results on performance or energy.

Time Label Presentation Title
Authors
16:40 CET 19.5.1 REVISITING PASS-TRANSISTOR LOGIC STYLES IN A 12NM FINFET TECHNOLOGY NODE
Speaker:
Jan Lappas, TU Kaiserslautern, DE
Authors:
Jan Lappas1, André Chinazzo1, Christian Weis2, Chenyang Xia3, Zhihang Wu3, Leibin Ni3 and Norbert Wehn2
1TU Kaiserslautern, DE; 2University of Kaiserslautern, DE; 3Huawei Technologies Co., Ltd., CN
Abstract
With the slow-down of Moore’s law and the increasing requirements on energy efficiency, alternative logic styles compared to complementary static CMOS have to be revisited for digital circuit implementations. Pass Transistor Logic (PTL) gained much attention in the ‘90s, however, only a limited number of recent investigations and publications regarding PTL exist that use advanced technology nodes. This paper compares key performance metrics of 22 different PTL based 1-bit full adder designs to a complementary static CMOS logic reference, using a recent 12nm FinFET technology. The figures of merit are the propagation delay, the energy consumption, and the energy-delay-product (EDP). Our investigations show that PTL based adder circuits can have an up to 49% decreased delay and a 48% and 63% reduced energy consumption and EDP, respectively, compared to a state-of-the-art complementary CMOS logic reference. In addition, we analyzed the impact of PVT variations on the delay for selected PTL full adder designs.
16:44 CET 19.5.2 SAFESU-2: A SAFE STATISTICS UNIT FOR SPACE MPSOCS
Speaker:
Guillem Cabo, Barcelona Supercomputing Center, ES
Authors:
Guillem Cabo1, Sergi Alcaide2, Carles Hernandez3, Pedro Benedicte1, Francisco Bas1, Fabio Mazzocchetti1 and Jaume Abella1
1Barcelona Supercomputing Center, ES; 2Universitat Politècnica de Catalunya - Barcelona Supercomputing Center, ES; 3Universitat Politecnica de Valencia, ES
Abstract
Advanced statistics units (SUs) have been proven effective for the verification, validation and implementation of safety measures as part of safety-related MPSoCs. This is the case, for instance, of the RISC-V MPSoC by Cobham Gaisler based on NOEL-V cores that will become commercially ready by the end of 2022. However, while those SUs support safety in the rest of the SoC, they must be built to be safe to be deployed in real products. This paper presents the SafeSU-2, the safety-compliant version of the SafeSU. In particular, we develop the safety concept of the SafeSU for relevant fault models, and implement fault detection and fault tolerance features needed to make it compliant with the requirements of safety-related devices in general, and of space MPSoCs in particular.
16:48 CET 19.5.3 (Best Paper Award Candidate)
EFFICIENT GLOBAL ROBUSTNESS CERTIFICATION OF NEURAL NETWORKS VIA INTERLEAVING TWIN-NETWORK ENCODING
Speaker:
Zhilu Wang, Northwestern University, US
Authors:
Zhilu Wang1, Chao Huang2 and Qi Zhu1
1Northwestern University, US; 2University of Liverpool, Northwestern University, GB
Abstract
The robustness of deep neural networks has received significant interest recently, especially when being deployed in safety-critical systems, as it is important to analyze how sensitive the model output is under input perturbations. While most previous works focused on the local robustness property around an input sample, the studies of the global robustness property, which bounds the maximum output change under perturbations over the entire input space, are still lacking. In this work, we formulate the global robustness certification for neural networks with ReLU activation functions as a mixed-integer linear programming (MILP) problem, and present an efficient approach to address it. Our approach includes a novel interleaving twin-network encoding scheme, where two copies of the neural network are encoded side-by-side with extra interleaving dependencies added between them, and an over-approximation algorithm leveraging relaxation and refinement techniques to reduce complexity. Experiments demonstrate the timing efficiency of our work when compared with previous global robustness certification methods and the tightness of our over-approximation. A case study of closed-loop control safety verification is conducted, and demonstrates the importance and practicality of our approach for certifying the global robustness of neural networks in safety-critical systems.
16:52 CET 19.5.4 OPPORTUNISTIC COMMUNICATION WITH LATENCY GUARANTEES FOR INTERMITTENTLY-POWERED DEVICES
Speaker:
Kacper Wardega, Boston University, US
Authors:
Kacper Wardega1, Wenchao Li1, Hyoseung Kim2, Yawen Wu3, Zhenge Jia3 and Jingtong Hu3
1Boston University, US; 2University of California, Riverside, US; 3University of Pittsburgh, US
Abstract
Energy-harvesting wireless sensor nodes have found widespread adoption due to their low cost and small form factor. However, uncertainty in the available power supply introduces significant challenges in engineering communications between intermittently- powered nodes. We propose a constraint-based model for energy harvests that together with a hardware model can be used to enable safe, opportunistic communication with worst-case latency guarantees. We show that greedy approaches that attempt communication whenever energy is available lead to prolonged latencies in real-world environments. Our approach offers bounded worst-case latency while providing a performance improvement over a conservative, offline approach planned around the worst-case energy harvest.
16:56 CET 19.5.5 Q&A SESSION
Authors:
Chung-Wei Lin1 and Dionisios Pnevmatikatos2
1National Taiwan University, TW; 2School of ECE, National TU Athens & FORTH-ICS, GR
Abstract
Questions and answers with the authors

20.1 Panel: The Good, the Bad and the Trendy of Multi-Partner Research Projects in Europe

Date: Tuesday, 22 March 2022
Time: 18:00 - 20:30 CET

Session chair:
Lorena Anghel, Grenoble INP, FR

Session co-chair:
Maksim Jenihhin, Tallinn University of Technology, EE

Panellists:
Yves Gigase, KDT Joint Undertaking, BE
Anton Chichkov, KDT Joint Undertaking, BE
Daniel Watzenig, Virtual Vehicle Research GmbH, AT
Said Hamdioui, Delft University of Technology, NL
Peter Hofmann, Deutsche Telekom Security, DE
Christoph Grimm, TU Kaiserslautern, DE
Dirk Pflueger, University of Stuttgart, DE

The panel establishes an open discussion of opportunities and approaches to collaborative research and innovation in Europe. In addition, it foresees an invited talk by Yves Gigase entitled “KDT JU and the Chips Act: Opportunities for the DATE Community”. The panel speakers include representatives of the European Commission and distinguished experts in multi-partner research, notably representatives of the projects CARAMEL, GENIAL! and GS-IMTR. The on-line live debate will address the balance between blue-sky and applied research in Europe, next killer trends, protection of EU interests, exacerbated by COVID-19 and a number of other exciting questions.


IP.3_1 Interactive presentations

Date: Wednesday, 23 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.3_1.1 REDMULE: A COMPACT FP16 MATRIX-MULTIPLICATION ACCELERATOR FOR ADAPTIVE DEEP LEARNING ON RISC-V-BASED ULTRA-LOW-POWER SOCS
Speaker:
Yvan Tortorella, University of Bologna, IT
Authors:
Yvan Tortorella1, Luca Bertaccini2, Davide Rossi3, Luca Benini4 and Francesco Conti1
1University of Bologna, IT; 2ETH Zürich, CH; 3University Of Bologna, IT; 4Università di Bologna and ETH Zürich, IT
Abstract
The fast proliferation of extreme-edge applications using Deep Learning (DL) based algorithms required dedicated hardware to satisfy extreme-edge applications’ latency, throughput, and precision requirements. While inference is achievable in practical cases, online finetuning and adaptation of general DL models are still highly challenging. One of the key stumbling stones is the need for parallel floating-point operations, which are considered unaffordable on sub-100mW extreme-edge SoCs. We tackle this problem with RedMulE (Reduced-precision matrix Multiplication Engine), a parametric low-power hardware accelerator for FP16 matrix multiplications - the main kernel of DL training and inference - conceived for tight integration within a cluster of tiny RISC-V cores based on the PULP (Parallel Ultra-Low-Power) architecture. In 22nm technology, a 32-FMA RedMulE instance occupies just 0.07mm^2 (14% of an 8-core RISC-V cluster) and achieves up to 666MHz maximum operating frequency, for a throughput of 31.6MAC/cycle (98.8% utilization). We reach a cluster-level power consumption of 43.5mW and a full-cluster energy efficiency of 68816-bitGFLOPS/W. Overall, RedMulE features up to 4.65× higher energy efficiency and 22× speedup over SW execution on 8 RISC-V cores.
IP.3_1.2 INCREASING CELLULAR NETWORK ENERGY EFFICIENCY FOR RAILWAY CORRIDORS
Speaker:
Adrian Schumacher, Swisscom (Switzerland) Ltd., CH
Authors:
Adrian Schumacher1, Ruben Merz1 and Andreas Burg2
1Swisscom (Switzerland) Ltd., CH; 2EPFL-TCL, CH
Abstract
Modern trains act as Faraday cages making it challenging to provide high cellular data capacities to passengers. A solution is the deployment of linear cells along railway tracks, forming a cellular corridor. To provide a sufficiently high data capacity, many cell sites need to be installed at regular distances. However, such cellular corridors with high power sites in short distance intervals are not sustainable due to the infrastructure power consumption. To render railway connectivity more sustainable, we propose to deploy fewer high-power radio units with intermediate low-power support repeater nodes. We show that these repeaters consume only 5% of the energy of a regular cell site and help to maintain the same data capacity in the trains. In a further step, we introduce a sleep mode for the repeater nodes that enables autonomous solar powering and even eases installation because no cables to the relays are needed.
IP.3_1.3 HEALTH MONITORING OF MILLING TOOLS UNDER DISTINCT OPERATING CONDITIONS BY A DEEP CONVOLUTIONAL NEURAL NETWORK MODEL
Speaker:
Priscile Suawa, Brandenburg TU, Cottbus–Senftenberg, DE
Authors:
Priscile Suawa and Michael Hübner, Brandenburg TU Cottbus, DE
Abstract
One of the most popular manufacturing techniques is milling. It can be used to make a variety of geometric components, such as flat grooves, surfaces, etc. The condition of the milling tool has a major impact on the quality of milling processes. Hence the importance of follow-up. When working on monitoring solutions, it is crucial to take into account different operating variables, such as rotational speed, especially in real-world experiences. This work addresses the topic of predictive maintenance by exploiting the fusion of sensor data and the artificial intelligence-based analysis of signals measured by sensors. With a set of data such as vibration and sound reflection from the sensors, we focus on finding solutions for the task of detecting the health condition of machines. A Deep Convolutional Neural Network (DCNN) model is provided with fusion at the sensor data level to detect five consecutive health states of a milling tool; From a healthier state to a state of degradation. In addition, a demonstrator is built with Simulink to simulate and visualize the detection process. To examine the capacity of our model, the signal data was processed individually and subsequently merged. Experiments were carried out on three sets of data recorded during a real milling process. Results using the proposed DCNN architecture with raw data have reached an accuracy of more than 94\% for all data sets.

IP.3_2 Interactive presentations

Date: Wednesday, 23 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.3_2.1 GRADIENT-BASED BIT ENCODING OPTIMIZATION FOR NOISE-ROBUST BINARY MEMRISTIVE CROSSBAR
Speaker:
Youngeun Kim, Yale University, US
Authors:
Youngeun Kim1, Hyunsoo Kim2, Seijoon Kim3, Sang Joon Kim2 and Priyadarshini Panda1
1Yale University, US; 2Samsung Advanced Institute of Technology, KR; 3Seoul National University, KR
Abstract
Binary memristive crossbars have gained huge attention as an energy-efficient deep learning hardware accelerator. Nonetheless, they suffer from various noises due to the analog nature of the crossbars. To overcome such limitations, most previous works train weight parameters with noise data obtained from a crossbar. These methods are, however, ineffective because it is difficult to collect noise data in large-volume manufacturing environment where each crossbar has a large device/circuit level variation. Moreover, we argue that there is still room for improvement even though these methods somewhat improve accuracy. This paper explores a new perspective on mitigating crossbar noise in a more generalized way by manipulating input binary bit encoding rather than training the weight of networks with respect to noise data. We first mathematically show that the noise decreases as the number of binary bit encoding pulses increases when representing the same amount of information. In addition, we propose Gradient-based Bit Encoding Optimization (GBO) which optimizes a different number of pulses at each layer, based on our in-depth analysis that each layer has a different level of noise sensitivity. The proposed heterogeneous layer-wise bit encoding scheme achieves high noise robustness with low computational cost. Our experimental results on public benchmark datasets show that GBO improves the classification accuracy by ~ 5-40% in severe noise scenarios.
IP.3_2.2 TAS: TERNARIZED NEURAL ARCHITECTURE SEARCH FOR RESOURCE-CONSTRAINED EDGE DEVICES
Speaker:
Mohammad Loni, MDH, SE
Authors:
Mohammad Loni1, Hamid Mousavi2, Mohammad Riazati2, Masoud Daneshtalab2 and Mikael Sjodin3
1Mälardalen University, SE; 2MDH, SE; 3Mälardalen Real-Time Research Centre, SE
Abstract
Ternary Neural Networks (TNNs) compress network weights and activation functions into 2-bit representation resulting in remarkable network compression and energy efficiency. However, there remains a significant gap in accuracy between TNNs and full-precision counterparts. Recent advances in Neural Architectures Search (NAS) promise opportunities in automated optimization for various deep learning tasks. Unfortunately, this area is unexplored for optimizing TNNs. This paper proposes TAS, a framework that drastically reduces the accuracy gap between TNNs and their full-precision counterparts by integrating quantization into the network design. We experienced that directly applying NAS to the ternary domain provides accuracy degradation as the search settings are customized for full-precision networks. To address this problem, we propose (i) a new cell template for ternary networks with maximum gradient propagation; and (ii) a novel learnable quantizer that adaptively relaxes the ternarization mechanism from the distribution of the weights and activation functions. Experimental results reveal that TAS delivers 2.64% higher accuracy and ≈2.8x memory saving over competing methods with the same bit-width resolution on the CIFAR-10 dataset. These results suggest that TAS is an effective method that paves the way for the efficient design of the next generation of quantized neural networks.
IP.3_2.3 EXAMINING AND MITIGATING THE IMPACT OF CROSSBAR NON-IDEALITIES FOR ACCURATE IMPLEMENTATION OF SPARSE DEEP NEURAL NETWORKS
Speaker:
Abhiroop Bhattacharjee, Yale University, US
Authors:
Abhiroop Bhattacharjee1, Lakshya Bhatnagar2 and Priyadarshini Panda1
1Yale University, US; 2IIT Delhi, IN
Abstract
Recently several structured pruning techniques have been introduced for energy-efficient implementation of Deep Neural Networks (DNNs) with lesser number of crossbars. Although, these techniques have claimed to preserve the accuracy of the sparse DNNs on crossbars, none have studied the impact of the inexorable crossbar non-idealities on the actual performance of the pruned networks. To this end, we perform a comprehensive study to show how highly sparse DNNs, that result in significant crossbar-compression-rate, can lead to severe accuracy losses compared to unpruned DNNs mapped onto non-ideal crossbars. We perform experiments with multiple structured-pruning approaches (such as, C/F pruning, XCS and XRS) on VGG11 and VGG16 DNNs with benchmark datasets (CIFAR10 and CIFAR100). We propose two mitigation approaches - Crossbar-column rearrangement and Weight-Constrained-Training (WCT) - that can be integrated with the crossbar-mapping of the sparse DNNs to minimize accuracy losses incurred by the pruned models. These help in mitigating non-idealities by increasing the proportion of low conductance synapses on crossbars, thereby improving their computational accuracies.

IP.3_3 Interactive presentations

Date: Wednesday, 23 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.3_3.1 CROSS-LEVEL PROCESSOR VERIFICATION VIA ENDLESS RANDOMIZED INSTRUCTION STREAM GENERATION WITH COVERAGE-GUIDED AGING
Speaker:
Niklas Bruns, University of Bremen, DE
Authors:
Niklas Bruns1, Vladimir Herdt2, Eyck Jentzsch3 and Rolf Drechsler4
1University of Bremen, DE; 2DFKI, DE; 3MINRES Technologies GmbH, DE; 4University of Bremen/DFKI, DE
Abstract
We propose a novel cross-level verification approach for processor verification at the Register-Transfer Level (RTL). The foundation is a randomized coverage-guided instruction stream generator that produces one endless and unrestricted instruction stream that evolves dynamically at runtime. We leverage an Instruction Set Simulator(ISS) as a reference model in a tight co-simulation setting. Coverage information is continuously updated based on the execution state of the ISS and we employ Coverage-guided Aging to smooth out the coverage distribution of the randomized instruction stream over the time. Our case study with an industrial pipelined 32 bit RISC-V processor demonstrate the effectiveness of our approach.
IP.3_3.2 HARDWARE ACCELERATION OF EXPLAINABLE MACHINE LEARNING
Speaker:
Prabhat Mishra, University of Florida, US
Authors:
Zhixin Pan and Prabhat Mishra, University of Florida, US
Abstract
Machine learning (ML) is successful in achieving human-level performance in various fields. However, it lacks the ability to explain an outcome due to its black-box nature. While recent efforts on explainable ML has received significant attention, the existing solutions are not applicable in real-time systems since they map interpretability as an optimization problem, which leads to numerous iterations of time-consuming complex computations. To make matters worse, existing implementations are not amenable for hardware-based acceleration. In this paper, we propose an efficient framework to enable acceleration of explainable ML procedure with hardware accelerators. We explore the effectiveness of both Tensor Processing Unit (TPU) and Graphics Processing Unit (GPU) based architectures in accelerating explainable ML. Specifically, this paper makes three important contributions. (1) To the best of our knowledge, our proposed work is the first attempt in enabling hardware acceleration of explainable ML. (2) Our proposed solution exploits the synergy between matrix convolution and Fourier transform, and therefore, it takes full advantage of TPU’s inherent ability in accelerating matrix computations. (3) Our proposed approach can lead to real-time outcome interpretation. Extensive experimental evaluation demonstrates that proposed approach deployed on TPU can provide drastic improvement in interpretation time (39x on average) as well as energy efficiency (69x on average) compared to existing acceleration techniques.
IP.3_3.3 FAST SIMULATION OF FUTURE 128-BIT ARCHITECTURES
Speaker:
Frédéric Pétrot, University Grenoble Alpes, Grenoble INP, FR
Authors:
Fabien Portas1 and Frédéric Pétrot2
1TIMA lab, University Grenoble Alpes, CNRS, Grenoble-INP, FR; 2TIMA Lab, Université Grenoble Alpes, FR
Abstract
Whether 128-bit architectures will some day hit the market or not is an open question. There is however a trend towards that direction: virtual addresses grew from 34 to 48 bits in 1999 and then to 57 bits in 2019. The impact of a virtually infinite addressable space on software is hard to predict, but it will most likely be major. Simulation tools are therefore needed to support research and experimentation for tooling and software. In this paper, we present the implementation of the 128-bit extension of the RISC-V architecture in the QEMU functional simulator and report first performance evaluations. On our limited set of programs, simulation is slowed down by a factor of at worst 5 compared to 64-bit simulation, making the tool still usable for executing large software codes.

IP.3_4 Interactive presentations

Date: Wednesday, 23 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.3_4.1 A GENERATIVE AI FOR HETEROGENEOUS NETWORK-ON-CHIP DESIGN SPACE PRUNING
Speaker:
Maxime France-Pillois, LIRMM, FR
Authors:
Maxime Mirka1, Maxime France-Pillois1, Gilles Sassatelli2 and Abdoulaye Gamatie3
1LIRMM CNRS / University of Montpellier, FR; 2LIRMM CNRS / University of Montpellier 2, FR; 3CNRS LIRMM / University of Montpellier, FR
Abstract
Often suffering from under-optimization, Networks-on-Chip (NoCs) heavily impact the efficiency of domain-specific Systems-on-Chip. To cope with this issue, heterogeneous NoCs are promising alternatives. Nevertheless, the design of optimized NoCs satisfying multiple performance objectives, e.g. throughput, power and area, is extremely challenging and requires significant expertise. While some approaches have been proposed to deal with the design space of NoCs, most fail to meet some expectations such as tractable exploration time and handling of multi-objective optimization. In this paper, we propose an approach based on generative artificial intelligence to help pruning complex design spaces for heterogeneous NoCs, according to configurable performance objectives. This is made possible by the ability of Generative Adversarial Networks to learn and generate relevant design candidates for the target NoCs. The speed and flexibility of our solution enable a fast generation of optimized NoCs that fit users' expectations. Through some experiments, we show how to obtain competitive NoC designs reducing the power consumption with no communication performance or area penalty compared to a given conventional NoC design.
IP.3_4.2 SPARROW: A LOW-COST HARDWARE/SOFTWARE CO-DESIGNED SIMD MICROARCHITECTURE FOR AI OPERATIONS IN SPACE PROCESSORS
Speaker:
Marc Solé Bonet, Universitat Politècnica de Catalunya - Barcelona Supercomputing Center, ES
Authors:
Marc Solé Bonet and Leonidas Kosmidis, Universitat Politècnica de Catalunya - Barcelona Supercomputing Center, ES
Abstract
Recently there is an increasing interest in the use of artificial intelligence for on-board processing as indicated by the latest space missions, which cannot be satisfied by existing low-performance space-qualified processors. Although COTS AI accelerators can provide the required performance, they are not designed to meet space requirements. In this work, we co-design a low-cost SIMD micro-architecture integrated in a space qualified processor, which can significantly increase its performance. Our solution has no impact on the processor's 100 MHz frequency and consumes minimal area thanks to its innovative design compared to conventional vector micro-architectures. For the minimum configuration of our baseline space processor, our results indicate a performance boost of up to 9.3x for commonly used AI-related and image processing algorithms and 5.5x faster for a complex, space-relevant inference application with just 30% area increase.
IP.3_4.3 A PLUGGABLE VECTOR UNIT FOR RISC-V VECTOR EXTENSION
Speaker:
Vincenzo Maisto, Hensoldt Cyber GmbH, and University of Naples Federico II, IT
Authors:
Vincenzo Maisto1 and Alessandro Cilardo2
1University of Naples Federico II and Hensoldt Cyber GmbH, IT; 2University of Naples Federico II, IT
Abstract
Vector extensions have become increasingly important for accelerating data-parallel applications in areas like multimedia, data-streaming, and Machine Learning. This interactive presentation introduces a microarchitectural design of a vector unit compliant with the RISC-V vector extension v1.0. While we targeted a specific core for demonstration, CVA6, our architecture is designed so as to ensure extensibility, maintainability, and re-usability in other cores. Furthermore, as a distinctive feature, we support speculative execution and precise vector traps. The paper provides an overview of the main motivation, design choices, and implementation details, followed by a qualitative and quantitative discussion of the results collected from the synthesis of the extended CVA6 RISC-V core.

IP.3_5 Interactive presentations

Date: Wednesday, 23 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.3_5.1 ROBUST RECONFIGURABLE SCAN NETWORKS
Speaker:
Natalia Lylina, University of Stuttgart, DE
Authors:
Natalia Lylina, Chih-Hao Wang and Hans-Joachim Wunderlich, University of Stuttgart, DE
Abstract
Reconfigurable Scan Networks (RSNs) access the evaluation results from embedded instruments and control their operation throughout the device lifetime. At the same time, a single fault in an RSN may dramatically reduce the accessibility of the instruments. During post-silicon validation, it may prevent extracting the complete data from a device. During online operation, the inaccessibility of runtime-critical instruments via a defect RSN may eventually result in a system failure. This paper addresses both scenarios above by presenting robust RSNs. We show that by making a small number of carefully selected spots in RSNs more robust, the entire access mechanism becomes significantly more reliable. A flexible cost function assesses the importance of specific control primitives for the overall accessibility of the instruments. Following the cost function, a minimized number of spots is hardened against permanent faults. All the critical instruments as well as most of the remaining instruments are accessible through the resulting RSNs even in the presence of defects. In contrast to state-of-the-art fault-tolerant RSNs, the presented scheme does not change the RSN topology and needs less hardware overhead. Selective hardening is formulated as a multi-objective optimization problem and solved by using an evolutionary algorithm. The experimental results validate the efficiency and the scalability of the approach.
IP.3_5.2 SYNCLOCK: RF TRANSCEIVER SECURITY USING SYNCHRONIZATION LOCKING
Speaker:
Alan Rodrigo Díaz Rizo, Sorbonne University, CNRS, LIP6, FR
Authors:
Alan Rodrigo Díaz Rizo, Hassan Aboushady and Haralampos-G. Stratigopoulos, Sorbonne Université, CNRS, LIP6, FR
Abstract
We present an anti-piracy locking-based design methodology for RF transceivers, called SyncLock. SyncLock acts on the synchronization of the transmitter with the receiver. If a key other than the secret one is applied the synchronization and, thereby, the communication fails. SyncLock is implemented using a novel locking concept. A hard-coded error is hidden into the design while the unlocking, i.e., the error correction, takes place at another part of the design upon application of the secret key. SyncLock presents several advantages. It is generally applicable, incorrect keys result in denial-of-service, it incurs no performance penalty and minimum overheads, and it offers maximum security thwarting all known counter-attacks. We demonstrate SyncLock with hardware measurements.
IP.3_5.3 DEEP REINFORCEMENT LEARNING FOR ANALOG CIRCUIT STRUCTURE SYNTHESIS
Speaker:
Zhenxin Zhao, Memorial University of Newfoundland, CA
Authors:
Zhenxin Zhao and Lihong Zhang, Memorial University of Newfoundland, CA
Abstract
This paper presents a novel deep-reinforcement-learning-based method for analog circuit structure synthesis. It behaves like a designer, who learns from trials, derives design knowledge and experience, and evolves gradually to eventually figure out a way to construct circuit structures that can meet the given design specifications. Necessary design rules are defined and applied to set up the specialized environment of reinforcement learning in order to reasonably construct circuit structures. The produced circuit structures are then verified by the simulation-in-loop sizing. In addition, hash table and symbolic analysis techniques are employed to significantly promote the evaluation efficiency. Our experimental results demonstrate the sound efficiency, strong reliability, and wide applicability of the proposed method.

IP.3_6 Interactive presentations

Date: Wednesday, 23 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.3_6.1 COMPATIBILITY CHECKING FOR AUTONOMOUS LANE-CHANGING ASSISTANCE SYSTEMS
Speaker:
Chung-Wei Lin, National Taiwan University, TW
Authors:
Po-Yu Huang1, Kai-Wei Liu1, Zong-Lun Li1, Sanggu Park2, Edward Andert2, Chung-Wei Lin1 and Aviral Shrivastava2
1National Taiwan University, TW; 2Arizona State University, US
Abstract
Different types of lane-changing assistance systems are usually developed separately by different automotive makers or suppliers. A lane-changing model can meet its own requirements, but it may be incompatible with another lane-changing model. In this paper, we verify if two lane-changing models are compatible so that the two corresponding vehicles on different lanes can exchange their lanes successfully. We propose a methodology and an algorithm to perform the verification on the combinations of four lane-changing models. Experimental results demonstrate the compatibility (or incompatibility) between the models. The verification results can be utilized during runtime to prevent incompatible vehicles from entering a lane-changing road segment. To the best of our knowledge, this is the first work considering the compatibility issue for lane-changing models.
IP.3_6.2 PAXC: A PROBABILISTIC-ORIENTED APPROXIMATE COMPUTING METHODOLOGY FOR ANN LEARNING
Speaker:
Pengfei Huang, Nanjing University of Aeroanutics and Astronautics, CN
Authors:
Pengfei Huang, Chenghua Wang, Ke Chen and Weiqiang Liu, Nanjing University of Aeronautics and Astronautics, CN
Abstract
In spite of the rapidly increasing number of approximate designs in circuit logic stack for Artificial Neural Networks (ANNs) learning. A principled and systematic approximate hardware incorporating domain knowledge is still lacking. As the layer of ANN becomes deeper, the errors introduced by approximate hardware will be accumulated quickly, which can result in unexpected results. In this paper, we propose a probabilistic oriented approximate computing (PAxC) methodology based on the notion of approximate probability to overcome the conceptual and computational difficulties inherent to probabilistic ANN learning. The PAxC makes use of minimum likelihood error in both circuit and application level to maintain the aggressive approximate datapaths to boost the benefits from the trade-off between accuracy and energy. Compared with a baseline design, the proposed method significantly reduces the power-delay product (PDP) with a negligible accuracy loss. Simulation and a case study of image processing validate the effectiveness of the proposed methodology.
IP.3_6.3 LAC: LEARNED APPROXIMATE COMPUTING
Speaker:
Tianmu Li, University of California, Los Angeles, US
Authors:
Vaibhav Gupta1, Tianmu Li2 and Puneet Gupta1
1UCLA, US; 2University of California, Los Angeles, US
Abstract
Approximate hardware trades acceptable error for improved performance and previous literature focuses on optimizing this trade-off in the hardware. We show in this paper that the application (i.e., the software) can be optimized for better accuracy without losing any performance benefits of the approximate hardware. We propose LAC: learned approximate computing as a method of tuning the application parameters to compensate for hardware errors. Our approach showed improvements across a variety of standard signal/image processing applications delivering an average improvement of 5.82db in PSNR and 0.23 in SSIM of the outputs. This translates to up to 87% power reduction and 83% area reduction for similar application quality. LAC allows the same approximate hardware to be used for multiple applications.

IP.3_7 Interactive presentations

Date: Wednesday, 23 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.3_7.1 EVA-CAM: A CIRCUIT/ARCHITECTURE-LEVEL EVALUATION TOOL FOR GENERAL CONTENT ADDRESSABLE MEMORIES
Speaker:
Liu Liu, University of Notre Dame, US
Authors:
Liu Liu1, Mohammad Mehdi Sharifi1, Ramin Rajaei2, Arman Kazemi1, Kai Ni3, Xunzhao Yin4, Michael Niemier1 and X. Sharon Hu1
1University of Notre Dame, US; 2Department of Computer Science and Engineering, University of Notre Dame, US; 3Rochester Institute of Technology, US; 4Zhejiang University, CN
Abstract
Content addressable memories (CAMs), a special-purpose in-memory computing (IMC) unit, support parallel searches directly in memory. There is growing interest in CAMs for data-intensive applications such as machine learning and bioinformatics. The design space for CAMs is rapidly expanding. In addition to traditional ternary CAMs (TCAMs), analog CAM (ACAM) and multi-bit CAM (MCAM) designs based on various non-volatile memory (NVM) devices have been recently introduced and may offer higher density, better energy efficiency, and non-volatility. Furthermore, aside from the widely-used exact match based search, CAM-based approximate matches have been proposed to further extend the utility of CAMs to new application spaces. For this memory architecture, evaluating different CAM design options for a given application is becoming more challenging. This paper presents Eva-CAM, a circuit/architecture-level modeling and evaluation tool for CAMs. Eva-CAM supports TCAM, ACAM, and MCAM designs implemented in non-volatile memories, for both exact and approximate match types. It also allows for the exploration of CAM array structures and sensing circuits. Eva-CAM has been validated with HSPICE simulation results and chip measurements. A comprehensive case study is described for FeFET CAM design space exploration.
IP.3_7.2 HYBRID DIGITAL-DIGITAL IN-MEMORY COMPUTING
Speaker:
Muhammad Rashedul Haq Rashed, University of Central Florida, US
Authors:
Muhammad Rashedul Haq Rashed1, Sumit Kumar Jha2, Fan Yao1 and Rickard Ewetz1
1University of Central Florida, US; 2University of Texas at San Antonio, US
Abstract
In-memory computing (IMC) using emerging non-volatile memory promises exascale computing capabilities for a number of data-intensive workloads. The state-of-the-art solution to accelerating high assurance applications is based on digital in-memory computing. Digital in-memory computing can be WRITE-based or READ-based, i.e., logic is evaluated while switching or without switching the state of the non-volatile resistive devices. All prominent studies for accelerating matrix-vector multiplication (MVM) based applications utilize a single digital logic style. However, we observe that WRITE-based and READ-based digital in-memory computing are advantageous for dense and sparse matrices, respectively. In this paper, we propose a new computing paradigm called hybrid digital-digital in-memory computing paradigm. The paper also introduces automated synthesis tool for mapping computation to a hybrid architecture. The key idea is to first decompose the matrix into dense and sparse blocks. Next, bit-slicing is used to further decompose the dense blocks into sparse and dense parts. The dense (sparse) blocks are mapped to WRITE-based (READ-based) digital in-memory accelerators. The proposed paradigm is evaluated using 12 applications from various domains. Compared with WRITE-based IMC, the hybrid digital-digital paradigm improves energy and speed with 13X and 20X at the expense of increasing the area with 151X. Compared with READ-based IMC, the hybrid paradigms improves energy, speed, and area with 264X, 198X, and 2996X, respectively.
IP.3_7.3 NEUROHAMMER: INDUCING BIT-FLIPS IN MEMRISTIVE CROSSBAR MEMORIES
Speaker:
Felix Staudigl, Institute for Communication Technologies and Embedded Systems, RWTH Aachen University, DE
Authors:
Felix Staudigl1, Hazem al Indari1, Daniel Schön2, Dominik Sisejkovic1, Farhad Merchant1, Jan Moritz Joseph1, Vikas Rana3, Stephan Menzel2 and Rainer Leupers1
1Institute for Communication Technologies and Embedded System, RWTH Aachen University, DE; 2Peter-Günberg-Institut (PGI-7), Forschungszentrum Jülich GmbH & JARA-FIT, DE; 3Peter-Günberg-Institut (PGI-10), Forschungszentrum Jülich GmbH, DE
Abstract
Emerging non-volatile memory (NVM) technologies offer unique advantages in energy efficiency, latency, and features such as computing-in-memory. Consequently, emerging NVM technologies are considered an ideal substrate for computation and storage in future-generation neuromorphic platforms. These technologies need to be evaluated for fundamental reliability and security issues. In this paper, we present NeuroHammer, a security threat in ReRAM crossbars caused by thermal crosstalk between memory cells. We demonstrate that bit-flips can be deliberately induced in ReRAM devices in a crossbar by systematically writing adjacent memory cells. A simulation flow is developed to evaluate NeuroHammer and the impact of physical parameters on the effectiveness of the attack. Finally, we discuss the security implications in the context of possible attack scenarios.

IP.3_8 Interactive presentations

Date: Wednesday, 23 March 2022
Time: 11:30 - 12:15 CET

Interactive Presentations (IPs) run simultaneously during a 45-minute slot. Authors of IPs are available to present their work and answer to questions throughout the session.

Label Presentation Title
Authors
IP.3_8.1 A LOW-COST METHODOLOGY FOR EM FAULT EMULATION ON FPGA
Speaker:
Paolo MAISTRI, TIMA Laboratory, FR
Authors:
Paolo Maistri and Jiayun Po, TIMA Laboratory, FR
Abstract
In embedded systems, the presence of a security layer is now a well-established requirement. In order to guarantee the suitable level of performance and resistance against attacks, dedicated hardware implementations are often proposed to accelerate cryptographic computations in a controllable environment. On the other hand, these same implementations may be vulnerable to physical attacks, such as side channel analysis or fault injections. In this scenario, the designer must hence be able to assess the robustness of the implementation (and of the adopted countermeasures) as soon as possible in the design flow against several different threats. In this paper, we propose a methodology to characterize the robustness of a generic hardware design described at RTL against EM fault injections. Thanks to our framework, we are able to emulate the EM faults on FPGA platforms, without the need of expensive equipment or lengthy experimental campaigns. We present a tool supporting our methodology and the first validations tests done on several AES designs confirming the feasibility of the proposed approach.
IP.3_8.2 RELIABILITY ANALYSIS OF FINFET-BASED SRAM PUFS FOR 16NM, 14NM, AND 7NM TECHNOLOGY NODES
Speaker:
Shayesteh Masoumian, Intrinsic ID, NL
Authors:
Shayesteh Masoumian1, Georgios Selimis1, Rui Wang1, Geert-Jan Schrijen1, Said Hamdioui2 and Mottaqiallah Taouil2
1Intrinsic ID, NL; 2Delft University of Technology, NL
Abstract
SRAM Physical Unclonable Functions (PUFs) are among other things today commercially used for secure primitives such as key generation and authentication. The quality of the PUFs and hence the security primitives, depends on intrinsic variations which are technology dependent. Therefore, to sustain the commercial usage of PUFs for cutting-edge technologies, it is important to properly model and evaluate their reliability. In this work, we evaluate the SRAM PUF reliability using within class Hamming distance (WCHD) for 16nm, 14nm, and 7nm using simulations and silicon validation for both low-power and high-performance designs. The results show that our simulation models and expectations match with the silicon measurements. From the experiments, we conclude the following: (1) SRAM PUF is reliable in advanced FinFET technology nodes, i.e., the noise is low in 16nm, 14nm, and 7nm, (2) temperature variations have a marginal impact on the reliability, and (3) both low-power and high-performance SRAMs can be used as a PUF without excessive need of error correcting codes (ECCs).
IP.3_8.3 BOILS: BAYESIAN OPTIMISATION FOR LOGIC SYNTHESIS
Speaker:
Antoine Grosnit, Huawei Noah's ark Lab, FR
Authors:
Antoine Grosnit1, Cedric Malherbe2, Xingchen Wan1, Rasul Tutunov1, Jun Wang3 and Haitham Bou Ammar1
1Huawei R&D London, GB; 2Huawei R&D Paris, FR; 3University of College London, GB
Abstract
Optimising the quality-of-results (QoR) of circuits during logic synthesis is a formidable challenge necessitating the exploration of exponentially sized search spaces. While expert-designed operations aid in uncovering effective sequences, the increase in complexity of logic circuits favours automated procedures. To enable efficient and scalable solvers, we propose BOiLS, the first algorithm adapting Bayesian optimisation to navigate the space of synthesis operations. BOiLS requires no human intervention and trades-off exploration versus exploitation through novel Gaussian process kernels and trust-region constrained acquisitions. In a set of experiments on EPFL benchmarks, we demonstrate BOiLS's superior performance compared to state-of-the-art in terms of both sample efficiency and QoR values.

L.2 Panel: The future of conferences - what will DATE and the others be like?

Date: Wednesday, 23 March 2022
Time: 13:00 - 14:00 CET

Session chair:
Ian O'Connor, Lyon Institute of Nanotechnology, FR

Panellists:
David Atienza, École Polytechnique Fédérale de Lausanne (EPFL), CH
Enrico Macii, Politecnico di Torino, IT
Yiran Chen, Duke University, US
Tulika Mitra, National University of Singapore, SG

The panel aims at exploring how conferences will be organized and attended after the Covid-19 pandemic to meet time and costs sustainability, attendees interests and needs as well as the opportunities offered by technology.


21.1 Self-adaptive and Dynamic Resource Management, Learning at the Edge and Applications

Date: Wednesday, 23 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Heba Khdr, Karlsruhe Institute of Technology, DE

Session co-chair:
Federico Corradi, iMEC, NL

Self-adaptive and runtime decision making is increasingly important for optimizing the extra-functional behaviour of modern systems. The first part of this session covers various techniques for optimizing specific objectives, such as performance, energy consumption, and accuracy, and applies these techniques to different parts of the systems. The second part of this session includes papers that advance the state of the art of machine learning and its applications at the edge. The fourth paper improves localization through an encoder-based framework, the fifth one proposes novel and efficient accelerator architectures for probabilistic reasoning models, and the sixth one presents a high-efficiency and low-cost framework for federated learning.

Time Label Presentation Title
Authors
14:30 CET 21.1.1 (Best Paper Award Candidate)
ACCURATE PROBABILISTIC MISS RATIO CURVE APPROXIMATION FOR ADAPTIVE CACHE ALLOCATION IN BLOCK STORAGE SYSTEMS
Speaker:
Yingtian Tang, University of Pennsylvania, US
Authors:
Rongshang Li1, Yingtian Tang2, QIQUAN SHI3, Hui Mao4, Lei Chen4, Jikun Jin5, Peng Lu5 and Zhuo Cheng6
1University of Sydney, AU; 2University of Pennsylvania, US; 3Huawei Noah's Ark Lab, HK; 4Huawei Noah's Ark Lab, CN; 5Huawei Storage Product Line, CN; 6Tsinghua University and Huawei Storage Product Line, CN
Abstract
Cache plays an important role in storage systems. With better allocation of cache space to each storage device, total I/O latency can be reduced remarkably. To achieve this goal, we propose an Accurate Probabilistic miss ratio curve approximation for Adaptive Cache allocation (APAC) system. APAC can obtain near-optimal performance for allocating cache space with low overhead. Specifically, with a linear-time probabilistic approximation of reuse distance of all blocks inside each device, APAC can accurately estimate the miss ratio curve (MRC). Furthermore, APAC utilizes the MRCs to obtain the near-optimal configuration of cache allocation by dynamic programming. Experimental results show that APAC achieves higher accuracy in MRC approximation compared to the state-of-the-art methods, leading to higher hit ratio and lower latency of the block storage systems.
14:34 CET 21.1.2 SGRM: STACKELBERG GAME-BASED RESOURCE MANAGEMENT FOR EDGE COMPUTING SYSTEMS
Speaker:
Manolis Katsaragakis, National TU Athens and KU Leuven, GR
Authors:
Antonios Karteris1, Manolis Katsaragakis2, Dimosthenis Masouros1 and Dimitrios Soudris1
1National TU Athens, GR; 2National TU Athens and KU Leuven, GR
Abstract
The incessant technological advancements of recent Internet of Things (IoT) networks have led to a rapidly increasing number of connected devices and workloads. Resource management is a key technique for such systems to operate efficiently. In this paper, we present SGRM, a game theory-based framework for dynamic resource management of IoT networks under CPU, memory, bandwidth and latency constraints. SGRM combines a novel execution time prediction mechanism along with Stackelberg games and Vickrey auctions in order to tackle the multi-objective problem of task offloading in a competitive Edge Computing system. We design, implement and evaluate our novel game theory-based framework over a real IoT system for a diverse set of interference scenarios and varying devices, showing that i) the proposed prediction mechanism can provide accurate predictions, achieving 2.3% absolute percentage error on average, ii) SGRM achieves near-optimal results and outperforms alternative solutions by up to 66.6% and iii) SGRM provides scalable, real-time and lightweight performance characteristics.
14:38 CET 21.1.3 RUNTIME ENERGY MINIMIZATION OF DISTRIBUTED MANY-CORE SYSTEMS USING TRANSFER LEARNING
Speaker:
Dainius Jenkus, Newcastle University, GB
Authors:
Dainius Jenkus, Fei Xia, Rishad Shafik and Alex Yakovlev, Newcastle University, GB
Abstract
The heterogeneity of computing resources continues to permeate into many-core systems making energy-efficiency a challenging objective. Existing rule-based and model-driven methods return sub-optimal energy-efficiency and limited scalability as system complexity increases to the domain of distributed systems. This is exacerbated further by dynamic variations of workloads and quality-of-service (QoS) demands. This work presents a QoS-aware runtime management method for energy minimization using a transfer learning (TL) driven exploration strategy. It enhances standard Q-learning to improve both learning speed and operational optimality (i.e., QoS and energy). The core to our approach is a multi-dimensional knowledge transfer across a task's state-action space. It accelerates the learning of dynamic voltage/frequency scaling (DVFS) control actions for tuning power/performance trade-offs. Firstly, the method identifies and transfers already learned policies between explored and behaviorally similar states referred to as Intra-Task Learning Transfer (ITLT). Secondly, if no similar “expert” states are available, it accelerates exploration at a local state's level through what’s known as Intra-State Learning Transfer (ISLT). A comparative evaluation of the approach indicates faster and more balanced exploration. This is shown through energy savings ranging from 7.30% to 18.06%, and improved QoS from 10.43% to 14.3%, when compared to existing exploration strategies. This method is demonstrated under WordPress and TensorFlow workloads on a server cluster.
14:42 CET 21.1.4 SIAMESE NEURAL ENCODERS FOR LONG-TERM INDOOR LOCALIZATION WITH MOBILE DEVICES
Speaker:
Saideep Tiku, Colorado State University, US
Authors:
Saideep Tiku and Sudeep Pasricha, Colorado State University, US
Abstract
WiFi fingerprinting-based indoor localization on smartphones is an emerging application domain for enhanced positioning and tracking of people and assets within indoor locales. Unfortunately, the transmitted signal characteristics from independently maintained WiFi access points (APs) vary greatly over time. Moreover, some of the WiFi APs visible at the initial deployment phase may be replaced or removed over time. These factors are often ignored and cause gradual and cata-strophic degradation of indoor localization accuracy post-deployment, over weeks and months. We propose a Siamese neural encoder-based framework that offers up to 40% reduction in degradation of localization accuracy over time compared to the state-of-the-art in the area, without requiring any re-training.
14:46 CET 21.1.5 DISCRETE SAMPLERS FOR APPROXIMATE INFERENCE IN PROBABILISTIC MACHINE LEARNING
Speaker:
Shirui Zhao, KU Leuven, BE
Authors:
Shirui Zhao1, Nimish Shah1, Wannes Meert2 and Marian Verhelst3
1Department of Electrical Engineering, ESAT-MICAS, KU Leuven, BE; 2Departement of Computer Science, KU Leuven, BE; 3KU Leuven, BE
Abstract
Probabilistic reasoning models (PMs) and probabilistic inference bring advantages when dealing with small datasets or uncertainty on the observed data, and allow to integrate expert knowledge and create interpretable models. The main challenge of using these PMs in practice is that their inference is very compute-intensive. Therefore, custom hardware architectures for the exact and approximate inference of PMs have been proposed in the SotA. The throughput, energy and area efficiency of approximate PM inference accelerators are strongly dominated by the sampler blocks required to sample arbitrary discrete distributions. This paper proposes and studies novel discrete sampler architectures towards efficient and flexible hardware implementations for PM accelerators. Both cumulative distribution table (CDT) and Knuth-Yao (KY) based sampling algorithms are assessed, based on which different sampler hardware architectures were implemented. Innovation is brought in terms of a reconfigurable CDT sampling architecture with a flexible range and a reconfigurable Knuth-Yao sampling architecture that supports both flexible range and dynamic precision. All architectures are benchmarked on real-world Bayesian Networks, demonstrating up to 13x energy efficiency benefits and 11x area efficiency improvement of the optimized reconfigurable Knuth-Yao sampler over the traditional linear CDT-based samplers used in the PM SotA.
14:50 CET 21.1.6 HELCFL: HIGH-EFFICIENCY AND LOW-COST FEDERATED LEARNING IN HETEROGENEOUS MOBILE-EDGE COMPUTING
Speaker:
Yangguang Cui, East China Normal University, CN
Authors:
Yangguang Cui1, Kun Cao2, Junlong Zhou3 and Tongquan Wei1
1East China Normal University, CN; 2Jinan University, CN; 3Nanjing University of Science and Technology, CN
Abstract
Federated Learning (FL), an emerging distributed machine learning (ML), empowers a large number of embedded devices (e.g., phones and cameras) and a server to jointly train a global ML model without centralizing user private data on a server. However, when deploying FL in a mobile-edge computing (MEC) system, restricted communication resources of the MEC system, heterogeneity and constrained energy of user devices have a severe impact on FL training efficiency. To address these issues, in this article, we design a distinctive FL framework, called HELCFL, to achieve high-efficiency and low-cost FL training. Specifically, by analyzing the theoretical foundation of FL, our HELCFL first develops a utility-driven and greedy-decay user selection strategy to enhance FL performance and reduce training delay. Subsequently, by analyzing and utilizing the slack time in FL training, our HELCFL introduces a device operating frequency determination approach to reduce training energy costs. Experiments verify that our HELCFL can enhance the highest accuracy by up to 43.45%, realize the training speedup of up to 275.03%, and save up to 58.25% training energy costs compared to state-of-the-art baselines.
14:54 CET 21.1.7 Q&A SESSION
Authors:
Heba Khdr1 and Federico Corradi2
1Karlsruhe Institute of Technology, DE; 2IMEC, NL
Abstract
Questions and answers with the authors

21.2 Advances in defect detection and dependability

Date: Wednesday, 23 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Leticia Maria Bolzani Poehls, RWTH Aachen University, DE

Session co-chair:
Ernesto Sanchez, Politecnico di Torino, IT

This session addresses advances in defect detection and dependability improvement. We cover a wide range of aspects: from hotspot detection and variability reduction to minimize the influence of processing. Up to design aspects to make latches resilient to radiation upsets of up to three nodes and improving security by masking the power signature.

Time Label Presentation Title
Authors
14:30 CET 21.2.1 HOTSPOT DETECTION VIA GRAPH NEURAL NETWORK
Speaker:
Shuyuan Sun, Fudan University, CN
Authors:
Shuyuan Sun1, Yiyang Jiang1, Fan Yang1, Bei Yu2 and Xuan Zeng1
1Fudan University, CN; 2The Chinese University of Hong Kong, HK
Abstract
Lithography hotspot detection is of great importance in chip manufacturing. It aims to find patterns that may incur defects in the early design stage. Inspired by the success of deep learning in computer vision, many works convert layouts into images, turn the hotspot detection problem into an image classification task. Traditional graph-based methods consume fewer computer resources and less detection time compared to image-based methods, but they have too many false alarms. In this paper, a hotspot detection approach via the graph neural network (GNN) is proposed. We also propose a novel representation model to map a layout to one graph, in which we introduce multi-dimensional features to encode components of the layout. Then we use a modified GNN to further process the extracted layout features and get an embedding of the local geometric relationship.Experimental results on the ICCAD2012 Contest benchmarks show our proposed approach can achieve over 10$ imes$ speedup and fewer false alarms without loss of accuracy. On the ICCAD2020 benchmark, our model can achieve 2.10\% higher accuracy compared with the previous approach.
14:34 CET 21.2.2 FITACT: ERROR RESILIENT DEEP NEURAL NETWORKS VIA FINE-GRAINED POST-TRAINABLE ACTIVATION FUNCTIONS
Speaker:
Behnam Ghavami, Simon Fraser University, CA
Authors:
Behnam Ghavami1, Mani Sadati2, Zhenman Fang1 and Lesley Shannon1
1Simon Fraser University, CA; 2Independent Researcher, IR
Abstract
Deep neural networks (DNNs) are increasingly being deployed in safety-critical systems such as personal healthcare devices and self-driving cars. In such DNN-based systems, error resilience is a top priority since faults in DNN inference could lead to mispredictions and safety hazards. For latency-critical DNN inference on resource-constrained edge devices, it is nontrivial to apply conventional redundancy-based fault tolerance techniques. In this paper, we propose FitAct, a low-cost approach to enhance the error resilience of DNNs by deploying fine-grained post-trainable activation functions. The main idea is to precisely bound the activation value of each individual neuron via neuron-wise bounded activation functions, so that it could prevent the fault propagation in the network. To avoid complex DNN model re-training, we propose to decouple the accuracy training and resilience training, and develop a lightweight post-training phase to learn these activation functions with precise bound values. Experimental results on widely used DNN models such as AlexNet, VGG16, and ResNet50 demonstrate that FitAct outperform state-of-the-art studies such as Clip-Act and Ranger in enhancing the DNN error resilience for a wide range of fault rates, while adding manageable runtime and memory space overheads.
14:38 CET 21.2.3 WRAP: WEIGHT REMAPPING AND PROCESSING IN RRAM-BASED NEURAL NETWORK ACCELERATORS CONSIDERING THERMAL EFFECT
Speaker:
Ing-Chao Lin, National Cheng Kung University, TW
Authors:
Po-Yuan Chen, Fang-Yi Gu, Yu-Hong Huang and Ing-Chao Lin, National Cheng Kung University, TW
Abstract
Abstract— Resistive random-access memory (RRAM) has shown great potential for computing in memory (CIM) to support the requirements of high memory bandwidth and low power in neuromorphic computing systems. However, the accuracy of RRAM-based neural network (NN) accelerators can degrade significantly due to the intrinsic statistical variations of the resistance of RRAM cells, as well as the negative effects of high temperatures. In this paper, we propose a subarray-based thermal-aware weight remapping and processing framework (WRAP) to map the weights of a neural network model into RRAM subarrays. Instead of dealing with each weight individually, this framework maps weights into subarrays and performs subarray-based algorithms to reduce computational complexity while maintaining accuracy under thermal impact. Experimental results demonstrate that using our framework, inference accuracy losses of four DNN models are less than 2% compared to the ideal results and 1% with compensation applied even when the surrounding temperature is around 360K.
14:42 CET 21.2.4 (Best Paper Award Candidate)
SELF-TERMINATED WRITE OF MULTI-LEVEL CELL RERAM FOR EFFICIENT NEUROMORPHIC COMPUTING
Speaker:
Zongwu Wang, Shanghai Jiao Tong University, CN
Authors:
Zongwu Wang1, Zhezhi He1, Rui Yang1, Shiquan Fan2, Jie Lin3, Fangxin Liu1, Yueyang Jia1, Chenxi Yuan2, Qidong Tang1 and Li Jiang1
1Shanghai Jiao Tong University, CN; 2Xi’an Jiaotong University, CN; 3University of Central Florida, US
Abstract
The Resistive Random-Access-Memory (ReRAM) in crossbar structure has shown great potential in accelerating the vector-matrix multiplication, owing to the fascinating computing complexity reduction (from O(n^2) to O(1)). Nevertheless, the ReRAM cells still encounter device programming variation and resistance drifting during computation (known as read disturbance), which significantly hamper its analog computing precision. Inspired by prior precise memory programming works, we propose a Self-Terminating Write (STW) circuit for Multi-Level Cell (MLC) ReRAM. In order to minimize the area overhead, the design heavily reuses inherent computing peripherals (e.g., Analog-to-Digital Converter and Trans-Impedance Amplifier) in conventional dot-product engine. Thanks to the fast and precise programming capability of our design, the ReRAM cell can possess 4 linear distributed conductance levels, with minimum latency used for intermediate resistance refreshing. Our comprehensive cross-layer (device/circuit/architecture) simulation indicates that the proposed MLC STW scheme can effectively obtain 2-bit precision via a single programming pulse. Besides, our design outperforms the prior write & verify scheme by 4.7x and 2x in programming latency and energy, respectively.
14:46 CET 21.2.5 SCLCRL: SHUTTLING C-ELEMENTS BASED LOW-COST AND ROBUST LATCH DESIGN PROTECTED AGAINST TRIPLE NODE UPSETS IN HARSH RADIATION ENVIRONMENTS
Speaker:
Aibin Yan, Anhui University, CN
Authors:
Aibin Yan1, Zhixing Li1, Shiwei Huang1, Zijie Zhai1, Xiangyu Cheng1, Jie Cui1, Tianming Ni2, Xiaoqing Wen3 and Patrick Girard4
1Anhui University, CN; 2Anhui Polytechnic University, CN; 3Kyushu Institute of Technology, JP; 4LIRMM / CNRS, FR
Abstract
As the CMOS technology is continuously scaling down, nano-scale integrated circuits are becoming susceptible to harsh-radiation induced soft errors, such as double-node upsets (DNUs) and triple-node upsets (TNUs). This paper presents a shuttle C-elements based low-cost and robust latch (namely SCLCRL) that can recover from any TNU in harsh radiation environments. The latch comprises seven primary storage nodes and seven secondary storage nodes. Each pair of primary nodes feeds a secondary node through one C-element (CE) and each pair of secondary nodes feeds a primary node through another CE, forming redundant feedback loops to robustly retain values. Simulation results validate all key TNUs’ recoverability features of the proposed latch. Simulation results also demonstrate that the proposed SCLCRL latch can approximately save 29% silicon area and 47% D-Q delay on average at the cost of moderate power, compared with the state-of-the-art TNU-recoverable reference latches of the same-type.
14:50 CET 21.2.6 LEAKAGE POWER ANALYSIS IN DIFFERENT S-BOX MASKING PROTECTION SCHEMES
Speaker:
Javad Bahrami, University of Maryland Baltimore County, US
Authors:
Javad Bahrami1, Mohammad Ebrahimabadi1, Jean Luc Danger2, Sylvain Guilley3 and Naghmeh Karimi1
1University of Maryland Baltimore County, US; 2Télécom ParisTech, FR; 3Secure-IC, FR
Abstract
Internet-of-Things (IoT) devices are natural targets for side-channel attacks. Still, side-channel leakage can be com- plex: its modeling can be assisted by statistical tools. Projection of the leakage into an orthonormal basis allows to understand its structure, typically linear (1st-order leakage) or non-linear (sometimes referred to as glitches). In order to ensure cryptosys- tems protection, several masking methods have been published. Unfortunately, they follow different strategies; thus it is hard to compare them. Namely, ISW is constructive, GLUT is systematic, RSM is a low-entropy version of GLUT, RSM-ROM is a further optimization aiming at balancing the leakage further, and TI aims at avoiding, by design, the leakage arising from the glitches. In practice, no study has compared these styles on an equal basis. Accordingly, in this paper, we present a consistent methodology relying on a Walsh-Hadamard transform in this respect. We consider different masked implementations of substitution boxes of PRESENT algorithm, as this function is the most leaking in symmetric cryptography. We show that ISW is the most secure among the considered masking implementations. For sure, it takes strong advantage of the knowledge of the PRESENT substitution box equation. Tabulated masking schemes appear as providing a lesser amount of security compared to unprotected counterparts. The leakage is assessed over time, i.e., considering device aging which contributes to mitigate the leakage differently according to the masking style.
14:54 CET 21.2.7 Q&A SESSION
Authors:
Leticia Maria Bolzani Poehls1 and Ernesto Sanchez2
1RWTH Aachen University, DE; 2Politecnico di Torino, IT
Abstract
Questions and answers with the authors

21.3 Real-Time Systems and Technology

Date: Wednesday, 23 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Renato Mancuso, Boston University, US

Session co-chair:
Yasmina Abdeddaim, UGE, FR

Modern embedded real-time systems are facing multiple challenges related to predictably scheduling concurrent and parallel task systems upon multi-core and heterogeneous platforms. In this session, we present a number of exciting papers addressing timing-predictability challenges related to memory-aware scheduling, parallel task systems, partitioning hypervisors, controller-area network (CAN) and energy optimization for embedded systems.

Time Label Presentation Title
Authors
14:30 CET 21.3.1 (Best Paper Award Candidate)
CACHE-AWARE SCHEDULABILITY ANALYSIS OF PREM COMPLIANT TASKS
Speaker:
Syed Aftab Rashid, CISTER, ISEP Polytechnic Institute of Porto, PT
Authors:
Syed Aftab Rashid1, Muhammad Ali Awan1, Pedro Souto2, Konstantinos Bletsas1 and Eduardo Tovar1
1CISTER, ISEP Polytechnic Institute of Porto, PT; 2University of Porto, PT
Abstract
The Predictable Execution Model (PREM) is useful for mitigating inter-core interference due to shared resources such as the main memory. However, it is cache-agnostic, which makes schedulabulity analysis pessimistic, via overestimation of prefetches and write-backs. In response, we present cache-aware schedulability analysis for PREM tasks on fixed-task-priority partitioned multicores, that bounds the number of cache prefetches and write-backs. Our approach identifies memory blocks loaded in the execution of a previous scheduling interval of each task, that remain in the cache until its next scheduling interval. Doing so, greatly reduces the estimated prefetches and write backs. In experimental evaluations, our analysis improves the schedulability of PREM tasks by up to 55 percentage points.
14:34 CET 21.3.2 RECONCILING QOS AND CONCURRENCY IN NVIDIA GPUS VIA WARP-LEVEL SCHEDULING
Speaker:
Jayati Singh, University of Illinois Urbana-Champaign, US
Authors:
Jayati Singh1, Ignacio Sañudo Olmedo2, Nicola Capodieci2, Andrea Marongiu2 and Marco Caccamo3
1University of Illinois Urbana-Champaign, US; 2University of Modena and Reggio Emilia, IT; 3TU Munich, DE
Abstract
The widespread deployment of NVIDIA GPUs in latency-sensitive systems today requires predictable GPU multi-tasking, which cannot be trivially achieved. The NVIDIA CUDA API allows programmers to easily exploit the processing power provided by these massively parallel accelerators and is one of the major reasons behind their ubiquity. However, NVIDIA GPUs and the CUDA programming model favor throughput instead of latency and timing predictability. Hence, providing real-time and quality-of-service (QoS) properties to GPU applications presents an interesting research challenge. Such a challenge is paramount when considering simultaneous multikernel (SMK) scenarios, wherein kernels are executed concurrently within each streaming multiprocessor (SM). In this work, we explore QoS-based fine-grained multitasking in SMK via job arbitration at the lowest level of the GPU scheduling hierarchy, i.e., between warps. We present QoS-aware warp scheduling (QAWS) and evaluate it against state-of-the-art, kernel-agnostic policies seen in NVIDIA hardware today. Since the NVIDIA ecosystem lacks a mechanism to specify and enforce kernel priority at the warp granularity, we implement and evaluate our proposed warp scheduling policy on GPGPU-Sim. QAWS not only improves the response time of the higher priority tasks but also has comparable or better throughput than the state-of-the-art policies.
14:38 CET 21.3.3 COUNTING PRIORITY INVERSIONS: COMPUTING MAXIMUM ADDITIONAL CORE REQUESTS OF DAG TASKS
Speaker:
Morteza Mohaqeqi, Uppsala University, SE
Authors:
Morteza Mohaqeqi, Gaoyang Dai and Wang Yi, Uppsala University, SE
Abstract
Many parallel real-time applications can be modeled as DAG tasks. Guaranteeing timing constraints of such applications executed on multicore systems is challenging, especially for the applications with non-preemptive execution blocks. The existing approach for timing analysis of such tasks with sporadic release relies on computing a bound on the interfering workload on a task, which depends on the number of priority inversions the task may experience. The number of priority inversions, in turn, is a function of the total number of additional cores a task instance may request after each node spawning. In this paper, we show that the previously proposed polynomial-time algorithm to compute the maximum number of additional core requests of a DAG is not correct, providing a counter example. We show that the problem is in fact NP-hard. We then present an ILP formulation as an exact solution to the problem. Our evaluations show that the problem can be solved in a few minutes even for DAGs with hundreds of nodes.
14:42 CET 21.3.4 SHYPER: AN EMBEDDED HYPERVISOR APPLYING HIERARCHICAL RESOURCE ISOLATION STRATEGIES FOR MIXED-CRITICALITY SYSTEMS
Speaker:
Siran Li, Beihang University, CN
Authors:
YiCong Shen, Lei Wang, YuanZhi Liang, SiRan Li and Bo Jiang, School of Computer Science and Engineering, Beihang University, CN
Abstract
With the development of the IoT, modern embedded systems are evolving to general-purpose and mixed-criticality systems, where virtualization has become the key to guarantee the isolation between tasks with different criticality. Traditional server-based hypervisors (KVM and Xen) are difficult to use in embedded scenarios due to performance and security reasons. As a result, several new hypervisors (Jailhouse and Bao) have been proposed in resent years, which effectively solve the problems above through static partitioning. However, this inflexible resource isolation strategy assumes no resource sharing across guests, which greatly reduces the resource utilization and VM scalability. This prevents themselves from simultaneously fulfilling the differentiated demands from VMs conducting different tasks. This paper proposes an efficient and real-time embedded hypervisor "Shyper", aiming at providing differentiated services for VMs with different criticality. To achieve that, Shyper supports fine-grained hierarchical resource isolation strategies and introduces several novel "VM-Exit-less" real-time virtualization techniques, which grants users the flexibility to strike a trade-off between VM's resource utilization and real-time performance. In this paper, we also compare Shyper with other mainstream hypervisors (KVM, Jailhouse, etc.) to evaluate its feasibility and effectiveness.
14:46 CET 21.3.5 RESPONSE TIME ANALYSIS FOR ENERGY-HARVESTING MIXED-CRITICALITY SYSTEMS
Speaker:
Kankan Wang, Northeastern University, CN
Authors:
Kankan Wang, Yuhan Lin and Qingxu Deng, Northeastern University, CN
Abstract
With the increasing demand for real-time computing applications on energy-harvesting embedded devices which are deployed wherever it is not possible or practical to recharge, the worst-case performance analysis becomes crucial. However, it is difficult to bound the worst-case response time of tasks under both timing and energy constraints due to the uncertainty of harvested energy. Based on this motivation, this paper studies response time analysis for Energy-Harvesting Mixed-Criticality (EHMC) systems. We present schedulability analysis algorithm to extend the Adaptive Mixed Criticality (AMC) approach to EHMC systems. Furthermore, we develop two response time bounds for it. To our best knowledge, this is the first work of response time analysis for EHMC systems. Finally, we examine both the effectiveness and the tightness of the bounds by experiments.
14:50 CET 21.3.6 LATENCY ANALYSIS OF SELF-SUSPENDING TASK CHAINS
Speaker:
Tomasz Kloda, TU Munich, DE
Authors:
Tomasz Kloda1, Jiyang Chen2, Antoine Bertout3, Lui Sha2 and Marco Caccamo1
1TU Munich, DE; 2University of Illinois at Urbana-Champaign, US; 3LIAS, Université de Poitiers, ISAE-ENSMA, FR
Abstract
Many cyber-physical systems are offloading computation-heavy programs to hardware accelerators (e.g.,GPU and TPU) to reduce execution time. These applications will self-suspend between offloading data to the accelerators and obtaining the returned results. Previous efforts have shown that self-suspending tasks can cause scheduling anomalies, but none has examined inter-task communication. This paper aims to explore self-suspending tasks' data chain latency with periodic activation and asynchronous message passing. We first present the cause for suspension-induced delays and worst-case latency analysis. We then propose a rule for utilizing the hardware co-processors to reduce data chain latency and schedulability analysis. Simulation results show that the proposed strategy can improve overall latency while preserving~system~schedulability.
14:54 CET 21.3.7 Q&A SESSION
Authors:
Renato Mancuso1 and Yasmina ABDEDDAÏM2
1Boston University, US; 2LIGM, Univ Gustave Eiffel, CNRS, FR
Abstract
Questions and answers with the authors

21.4 Defense Techniques for Secure and Trustworthy Systems

Date: Wednesday, 23 March 2022
Time: 14:30 - 15:30 CET

Session chair:
Sophie Dupuis, LIRMM, University of Montpellier, FR

Session co-chair:
Elif Bilge Kavun, Univ Passau, DE

Building novel defense mechanisms to thwart attacks towards real-world systems is vital due to the valuable assets in such systems that need to be protected. This session focuses on defense techniques providing countermeasures against side-channel attacks and hardware Trojans. The contributions in this session include conventional as well as novel machine learning methods for the detection of hardware Trojans, protection mechanisms against side-channel analysis of chips and neural networks, and exploitation of sequentiality and synthesis flexibility in logic obfuscation for thwarting SAT Attacks.

Time Label Presentation Title
Authors
14:30 CET 21.4.1 COUNTERACT SIDE-CHANNEL ANALYSIS OF NEURAL NETWORKS BY SHUFFLING
Speaker:
Manuel Brosch, TU Munich, DE
Authors:
Manuel Brosch1, Matthias Probst1 and Georg Sigl2
1TU Munich, DE; 2TU Munich / Fraunhofer Institute for Applied and Integrated Security (AISEC), DE
Abstract
Machine learning is becoming an essential part in almost every electronic device. Implementations of neural networks are mostly targeted towards computational performance or memory footprint. Nevertheless, security is also an important part in order to keep the network secret and protect the intellectual property associated to the network. Especially, since neural network implementations are demonstrated to be vulnerable to side-channel analysis, powerful and computational cheap countermeasures are in demand. In this work, we apply a shuffling countermeasure to a microcontroller implementation of a neural network to prevent side-channel analysis. The countermeasure is effective while the computational overhead is low. We investigate the extensions necessary for our countermeasure, and how shuffling increases the effort for an attack in theory. In addition, we demonstrate the increase in effort for an attacker through experiments on real side-channel measurements. Based on the mechanism of shuffling and our experimental results, we conclude that an attack on a commonly used neural network with shuffling is no longer feasible in a reasonable amount of time.
14:34 CET 21.4.2 GNN4GATE: A BI-DIRECTIONAL GRAPH NEURAL NETWORK FOR GATE-LEVEL HARDWARE TROJAN DETECTION
Speaker:
Dong Cheng, College of Computer and Data Science, Fuzhou University, Fuzhou, China, CN
Authors:
Dong Cheng1, Chen Dong1, Wenwu He2, Zhenyi Chen3 and Yi Xu1
1Fuzhou University, CN; 2Fujian University of Technology, CN; 3University of South Floride, US
Abstract
Hardware is the physical foundation of cyberspace, and chips are the core components. The security risk of the chip will bring disaster to the entire world. Hardware Trojans (HTs) are malicious circuits, which are the primary security issue of chip. Recently, a series of machine learning-based HT detection methods were proposed. However, some shortcomings still deserve further consideration, such as relying too much on manual feature extraction, losing some signal propagation structure information, being hard to track the HTs' location and adapt them to various types of HTs. To address the above challenges, this paper proposes a gate-level HT detection method based on Graph Neural Network (GNN), named GNN4Gate, which is a golden-free Trojan-gate identification technology. Specifically, a special coding method combining logic gate type and port connection information is developed for circuit graph modeling. Based on this, taking logic gates as the classification object, an automatic GNN detection architecture based on Bi-directional Graph Convolutional Network (Bi-GCN) is developed to aggregate both the circuit signal propagation (forward) and dispersion (backward) structure features from the circuit graph. The proposed method is evaluated by Trusthub benchmarks with different functional HTs, the average True Positive Rate (Recall) is 87.14%, and the average True Negative Rate is 99.73%. The experimental results demonstrate that GNN4Gate is sufficiently accurate compared to the state-of-the-art detection works at gate-level.
14:38 CET 21.4.3 GOLDEN MODEL-FREE HARDWARE TROJAN DETECTION BY CLASSIFICATION OF NETLIST MODULE GRAPHS
Speaker:
Alexander Hepp, TU Munich, DE
Authors:
Alexander Hepp1, Johanna Baehr1 and Georg Sigl2
1TU Munich, DE; 2TU Munich/Fraunhofer AISEC, DE
Abstract
In a world where increasingly complex integrated circuits are manufactured in supply chains across the globe, hardware Trojans are an omnipresent threat. State-of-the-art methods for Trojan detection often require a golden model of the device under test. Other methods that operate on the netlist without a golden model can not handle complex designs and operate on Trojan-specific sets of netlist graph features. In this work, we propose a novel machine-learning-based method for hardware Trojan detection. Our method first uses a library of known malicious and benign modules in hierarchical designs to train an eXtreme Gradient Boosted Tree Classifier (XGBClassifier). For training, we generate netlist graphs of each hierarchical module and calculate feature vectors comprising structural characteristics of these graphs. After the training phase, we can analyze the synthesized hierarchical modules of an unknown design under test. The method calculates a feature vector for each module. With this feature vector, each module can be classified into either benign or malicious by the previously trained XGBClassifier. After classifying all modules, we derive a classification for all standard cells in the design under test. This technique allows the identification of hardware Trojan cells in a design and highlights regions of interest to direct further reverse engineering efforts. Experiments show that this approach performs with >97% Sensitivity and Specificity across available and generated hardware Trojan benchmarks and can be applied to more complex designs than previous netlist-based methods while maintaining similar computational complexity.
14:42 CET 21.4.4 JANUS-HD: EXPLOITING FSM SEQUENTIALITY AND SYNTHESIS FLEXIBILITY IN LOGIC OBFUSCATION TO THWART SAT ATTACK WHILE OFFERING STRONG CORRUPTION
Speaker:
Leon Li, University of California, San Diego, US
Authors:
Leon Li1 and Alex Orailoglu2
1University of California, San Diego, US; 2UC San Diego, US
Abstract
Logic obfuscation has been proposed as a countermeasure towards chip counterfeiting and IP piracy by obfuscating circuit designs with a key-controlled locking mechanism. However, the extensive output corruption of early key gate based logic obfuscation techniques has exposed them to effective SAT attacks. While current SAT resilient logic obfuscation techniques succeed in undermining the attack by offering near-trivial output corruption, they do so at the expense of a drastic reduction in functional and structural protection scope. In this work, we present JANUS-HD based on novel insights that succeed to deliver the heretofore elusive goal of simultaneously boosting corruptibility and foiling SAT attacks. JANUS-HD obfuscates an FSM through diverse FF configurations for different transitions with the overall configuration setting as the obfuscation secret. A key-controlled Hamming distance comparator controls the obfuscation status at the minimized number of entrance states identified through a custom graph partitioning algorithm. Reliance on the inherent state transition patterns extends the obfuscation benefits to non-entrance states without exposing any additional key space pruning trace. We leverage the flexibility of state encoding and equivalence-based FSM transformations to generate an obfuscated netlist at low overhead using standard synthesis tools. Finally, we present a scan chain crippling mechanism that delivers unfettered scan chain access while eradicating any key trace leakage in the scan mode, thus thwarting chosen-input attacks aimed at the Hamming distance comparator. We illustrate through experiments that JANUS-HD delivers obfuscation scope improvements of up to 45.5x over the state-of-the-art, establishing the first cost-effective solution to offer a broad yet attack-resilient obfuscation scope against supply chain threats.
14:46 CET 21.4.5 TRILOCK: IC PROTECTION WITH TUNABLE CORRUPTIBILITY AND RESILIENCE TO SAT AND REMOVAL ATTACKS
Speaker:
Yuke Zhang, University of Southern California, US
Authors:
Yuke Zhang, Yinghua Hu, Pierluigi Nuzzo and Peter Beerel, University of Southern California, US
Abstract
Sequential logic locking has been studied over the last decade as a method to protect sequential circuits from reverse engineering. However, most of the existing sequential logic locking techniques are threatened by increasingly more sophisticated SAT-based attacks, efficiently using input queries to a SAT solver to rule out incorrect keys, as well as removal attacks based on structural analysis. In this paper, we propose TriLock, a sequential logic locking method that simultaneously addresses these vulnerabilities. TriLock can achieve high, tunable functional corruptibility while still guaranteeing exponential queries to the SAT solver in a SAT-based attack. Further, it adopts a state re-encoding method to obscure the boundary between the original state registers and those inserted by the locking method, thus making it more difficult to detect and remove the locking-related components.
14:50 CET 21.4.6 Q&A SESSION
Authors:
Sophie Dupuis1 and Elif Bilge Kavun2
1LIRMM, FR; 2University of Passau, DE
Abstract
Questions and answers with the authors

22.1 Heterogeneous system-on-chip design methods

Date: Wednesday, 23 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Lana Josipovic, ETH Zurich, CH

Session co-chair:
John Wickerson, Imperial College, GB

The session presents various design methods addressing an array of important challenges in heterogeneous system-on-chip design. The methods cover not only system-level methods for FPGA and NoC architectures, but also high-level synthesis solutions for performance improvements, power estimation, energy efficiency, and data/IP protection. We complete the session with two interactive presentations about coarse-grained reconfigurable architectures and cloud systems.

Time Label Presentation Title
Authors
15:40 CET 22.1.1 UNDERSTANDING AND MITIGATING MEMORY INTERFERENCE IN FPGA-BASED HESOCS
Speaker:
Gianluca Brilli, University of Modena and Reggio Emilia, IT
Authors:
Gianluca Brilli, Alessandro Capotondi, Paolo Burgio and Andrea Marongiu, Unimore, IT
Abstract
Like most high-end embedded systems, FPGA-based systems-on-chip (SoC) are increasingly adopting heterogeneous designs, where CPU cores, the configurable logic and other ICs all share interconnect and main memory (DRAM) controller. This paradigm is scalable and reduces production costs and time-to-market, but creates resource contention issues, which ultimately affects the programs' timing. This problem has been widely studied on CPU- and GPU-based systems, along with strategies to mitigate such effects, but little has been done so far to systematically study the problem on FPGA-based SoCs. This work provides an in-depth analysis of memory interference on such systems, targeting two state-of-the-art commercial FPGA SoCs. We also discuss architectural support for Controlled Memory Request Injection (CMRI), a technique that has proven effective at reducing the bandwidth under-utilization implied by naive schemes that solve the interference problem by only allowing mutually exclusive access to the shared resources. Our experimental results show that: i) memory interference can slow down CPU tasks by up to 16x in the tested FPGA-based SoCs; ii) CMRI allows to exploit more than 40% of the memory bandwidth available to FPGA accelerators (normally completely unused in PREM-like schemes), keeping the slowdown due to interference below 10%.
15:44 CET 22.1.2 (Best Paper Award Candidate)
POWERGEAR: EARLY-STAGE POWER ESTIMATION IN FPGA HLS VIA HETEROGENEOUS EDGE-CENTRIC GNNS
Speaker:
Zhe Lin, Peng Cheng Laboratory, CN
Authors:
Zhe Lin1, Zike Yuan2, Jieru Zhao3, Wei Zhang4, Hui Wang1 and Yonghong Tian5
1Peng Cheng Laboratory, CN; 2University of Auckland, NZ; 3Shanghai Jiao Tong University, CN; 4Hong Kong University of Science and Technology, HK; 5Peking University & Peng Cheng Laboratory, CN
Abstract
Power estimation is the basis of many hardware optimization strategies. However, it is still challenging to offer accurate power estimation at an early stage such as high-level synthesis (HLS). In this paper, we propose PowerGear, a graph-learning-assisted power estimation approach for FPGA HLS, which features high accuracy, efficiency and transferability. PowerGear comprises two main components: a graph construction flow and a customized graph neural network (GNN) model. Specifically, in the graph construction flow, we introduce buffer insertion, datapath merging, graph trimming and feature annotation techniques to transform HLS designs into graph-structured data, which encode both intra-operation micro-architectures and inter-operation interconnects annotated with switching activities. Furthermore, we propose a novel power-aware heterogeneous edge-centric GNN model which effectively learns heterogeneous edge semantics and structural properties of the constructed graphs via edge-centric neighborhood aggregation, and fits the formulation of dynamic power. Compared with on-board measurement, PowerGear estimates total and dynamic power for new HLS designs with errors of 3.60% and 8.81%, respectively, which outperforms the prior arts in research and the commercial product Vivado. In addition, PowerGear demonstrates a speedup of 4x over Vivado power estimator. Finally, we present a case study in which PowerGear is exploited to facilitate design space exploration for FPGA HLS, leading to a performance gain of up to 11.2%, compared with methods using state-of-the-art predictive models.
15:48 CET 22.1.3 ENERGY EFFICIENT, REAL-TIME AND RELIABLE TASK DEPLOYMENT ON NOC-BASED MULTICORES WITH DVFS
Speaker:
Lei Mo, Southeast University, CN
Authors:
Lei Mo1, Qi Zhou1, Angeliki Kritikakou2 and Ji Liu3
1Southeast University, CN; 2Univ Rennes, Inria, CNRS, IRISA, FR; 3Baidu Research, CN
Abstract
Task deployment plays an important role in the overall system performance, especially for complex architectures, including several cores with Dynamic Voltage and Frequency Scaling (DVFS) and Network-on-Chips (NoC). Task deployment affects not only the energy consumption but also the real-time response and reliability of the system. In this work, a task deployment approach is proposed to optimize the overall system energy consumption, including computation of the cores and communication of the NoC, under task reliability and real-time constraints. More precisely, the task deployment approach combines task allocation and scheduling, frequency assignment, task duplication, and multi-path data routing. The task deployment problem is formulated using mixed-integer non-linear programming. To find the optimal solution, the original problem is equivalently transformed to mixed-integer linear programming, and solved by state-of-the-art solvers. Furthermore, a decomposition-based heuristic, with low computational complexity, is proposed to deal with scalability. Finally, extended simulations evaluate the proposed methods.
15:52 CET 22.1.4 COXHE: A SOFTWARE-HARDWARE CO-DESIGN FRAMEWORK FOR FPGA ACCELERATION OF HOMOMORPHIC COMPUTATION
Speaker:
Mingqin Han, Shandong University, CN
Authors:
Mingqin Han1, Yilan Zhu1, Qian Lou2, Zimeng Zhou1, Shanqing Guo1 and Lei Ju1
1Shandong University, CN; 2Indiana University, US
Abstract
Data privacy becomes a crucial concern in the AI and big data era. Fully homomorphic encryption (FHE) is a promising data privacy protection technique where the entire computation is performed on encrypted data. However, the dramatic increase of the computation workload restrains the usage of FHE for the real-world applications. In this paper, we propose an FPFA accelerator design framework for CKKS-based HE. While the key-switch operations are the primary performance bottleneck of FHE computation, we propose a low latency design of key-switch module with reduced intra-operation data dependency. Compared with the state-of-the-art FPGA based key-switch implementation that is based on Verilog, the proposed high-level synthesis (HLS) based design reduces the operation latency by 40%. Furthermore, we propose an automated design space exploration framework which generates optimal encryption parameters and accelerators for a given application kernel and the target FPGA device. Experimental results for a set of real HE application kernels on different FPGA devices show that our HLS-based flexible design framework produces substantially better accelerator design compared with a fixed-parameter HE accelerator in terms of security, approximation error, and overall performance.
15:56 CET 22.1.5 A COMPOSABLE DESIGN SPACE EXPLORATION FRAMEWORK TO OPTIMIZE BEHAVIORAL LOCKING
Speaker:
Christian Pilato, Politecnico di Milano, IT
Authors:
Luca Collini1, Ramesh Karri2 and Christian Pilato1
1Politecnico di Milano, IT; 2NYU, US
Abstract
Globalization of the integrated circuit (IC) supply chain exposes designs to security threats such as reverse engineering and intellectual property (IP) theft. Designers may want to protect specific high-level synthesis (HLS) optimizations or micro-architectural solutions of their designs. Hence, protecting the IP of ICs is essential. Behavioral locking is an approach to thwart these threats by operating at high levels of abstraction instead of reasoning on the circuit structure. Like any security protection, behavioral locking requires additional area. Existing locking techniques have a different impact on security and overhead, but they do not explore the effects of alternatives when making locking decisions. We develop a design-space exploration (DSE) framework to optimize behavioral locking for a given security metric. For instance, we optimize differential entropy under area or key-bit constraints. We define a set of heuristics to score each locking point by analyzing the system dependence graph of the design. The solution yields better results for 92% of the cases when compared to baseline, state-of-the-art (SOTA) techniques. The approach has results comparable to evolutionary DSE while requiring 100x to 400x less computational time.
16:00 CET 22.1.6 Q&A SESSION
Authors:
Lana Josipovic1 and John Wickerson2
1ETH Zurich, CH; 2Imperial College London, GB
Abstract
Questions and answers with the authors

22.2 Power, Thermal and Performance Management for Advanced Computing Systems

Date: Wednesday, 23 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Pascal Vivet, CEA-LIST, FR

Session co-chair:
Andrea Bartolini, Bologna University, IT

This session discusses power and temperature management and performance gain for computing systems. The first two papers aim to advance energy management for energy-harvesting wearable devices and multi-core systems using federated reinforcement learning. The following two papers present thermal management methods for processor systems, focusing on 3D integration and cache contention modeling, respectively. The last paper boosts performance with smart cache prefetching.

Time Label Presentation Title
Authors
15:40 CET 22.2.1 DIET: A DYNAMIC ENERGY MANAGEMENT APPROACH FOR WEARABLE HEALTH MONITORING DEVICES
Speaker:
Nuzhat Yamin, Washington State University, US
Authors:
Nuzhat Yamin, Ganapati Bhat and Jana Doppa, Washington State University, US
Abstract
Wearable devices are becoming increasingly popular for health and activity monitoring applications. These devices typically include small rechargeable batteries to improve user comfort. However, the small battery capacity leads to limited operating life, requiring frequent recharging. Recent research has proposed energy harvesting using light and user motion to improve the lifetime of wearable devices. Most energy harvesting approaches assume that the placement of the energy harvesting device and sensors required for health monitoring are the same. However, this assumption does not hold for several real-world applications. For example, motion energy harvesting using piezoelectric sensors is limited to the knees and elbows, while a sensor for heart rate monitoring must be placed on the chest for optimal performance. To address this challenge, we propose a novel dynamic energy management approach referred to as DIET for wearable health applications enabled by multiple sensors and energy harvesting devices. The key idea behind DIET is to harvest energy from multiple sources and optimally allocate it to each sensor using a lightweight optimization algorithm such that the overall utility for applications is maximized. Experiments on real-world data from four users over 30 days show that the DIET approach achieves utility within 10% of an offline Oracle.
15:44 CET 22.2.2 IMPROVE THE STABILITY AND ROBUSTNESS OF POWER MANAGEMENT THROUGH MODEL-FREE DEEP REINFORCEMENT LEARNING
Speaker:
Lin Chen, Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, HK
Authors:
Lin Chen1, Xiao Li2 and Jiang Xu3
1Electronic and Computer Engineering Department, AI Chip Center for Emerging Smart Systems, The Hong Kong University of Science and Technology, HK; 2Electronic and Computer Engineering Department, The Hong Kong University of Science and Technology, HK; 3Microelectronics Thrust, Electronic and Computer Engineering Department, AI Chip Center for Emerging Smart Systems, The Hong Kong University of Science and Technology, HK
Abstract
Achieving high performance with low energy consumption has become a primary design objective in multi-core systems. Recently, power management based on reinforcement learning has shown great potential in adapting to dynamic environments without much prior knowledge. However, conventional Q-learning (QL) algorithms adopted in most existing works encounter serious problems about scalability, instability, and overestimation. In this paper, we present a deep reinforcement learning-based approach to improve the stability and robustness of power management while reducing the energy-delay product (EDP) under user-specified performance requirements. The comprehensive status of the system is monitored periodically, making our controller sensitive to environmental change. To further improve the learning effectiveness, knowledge sharing among multiple devices is implemented in our approach. Experimental results on multiple realistic applications show that the proposed method can reduce the instability up to 68% compared with QL. Through knowledge sharing among multiple devices, our federated approach achieves around 4.8% EDP improvement over QL on average.
15:48 CET 22.2.3 (Best Paper Award Candidate)
COREMEMDTM: INTEGRATED PROCESSOR CORE AND 3D MEMORY DYNAMIC THERMAL MANAGEMENT FOR IMPROVED PERFORMANCE
Speaker:
Lokesh Siddhu, Indian Institute of Technology, Delhi, IN
Authors:
Lokesh Siddhu1, Rajesh Kedia1 and Preeti Ranjan Panda2
1Indian Institute of Technology Delhi, IN; 2Indian Institute of Technology, Delhi, IN
Abstract
The growing performance of processors and 3D memories has resulted in higher power densities and temperatures. Dynamic thermal management (DTM) policies for processor cores and memory have received significant research attention, but existing solutions address processors and 3D memories independently, which causes overcompensation, and there is a need to coordinate the DTM of the two subsystems. Further, existing CPU DTM policies slow down heated cores significantly, increasing the overall execution time and performance overheads. We propose CoreMemDTM, a technique for integrating processor core and 3D memory DTM policies that attempts to minimize performance overheads. We suggest employing DTM depending on the thermal margin since safe temperature thresholds might differ for the two subsystems. We propose a stall-balanced core DVFS policy for core thermal management that enables distributed cooling, decreasing overheads. We evaluate CoreMemDTM using ten different SPEC CPU2017 workloads across various safe temperature thresholds and observe average execution time and energy improvements of 14% and 36% compared to state-of-the-art thermal management policies.
15:52 CET 22.2.4 THERMAL- AND CACHE-AWARE RESOURCE MANAGEMENT BASED ON ML-DRIVEN CACHE CONTENTION PREDICTION
Speaker:
Mohammed Bakr Sikal, Karlsruhe Institute of Technology, DE
Authors:
Mohammed Bakr Sikal1, Heba Khdr1, Martin Rapp1 and Joerg Henkel2
1Karlsruhe Institute of Technology, DE; 2Karlsruhe institute of technology, DE
Abstract
While on-chip many-core systems enable a large number of applications to run in parallel, the increased overall performance may come at the cost of complicating the performance constraints of individual applications due to contention on shared resources. For instance, the competition for last-level cache by concurrently-running applications may lead to slowing down the execution and to potentially violating individual performance constraints. Clustered many-cores reduce cache contention at chip level by sharing caches only at cluster level. To reduce cache contention within a cluster, state-of-the art techniques aim to co-map a memory-intensive application with a compute-intensive application onto one cluster. However, compute-intensive applications typically consume high power, and therefore, executing another application in their nearby cores may lead to high temperatures. Hence, there is a trade-off between cache contention and temperature. This paper is the first to consider this trade-off through a novel thermal- and cache-aware resource management technique. We build a neural network (NN)-based model to predict the slowdown of the application execution induced by cache contention feeding our resource management technique that then optimizes the application mapping and selects the voltage/frequency levels of the clusters to compensate for the potential contention-induced slowdown. Thereby, it meets the performance constraints, while minimizing temperature. Compared to the state of the art, our technique significantly reduces the temperature by 30% on average, while satisfying performance constraints of all individual applications.
15:56 CET 22.2.5 T-SKID: PREDICTING WHEN TO PREFETCH SEPARATELY FROM ADDRESS PREDICTION
Speaker:
Toru Koizumi, University of Tokyo, JP
Authors:
Toru Koizumi, Tomoki Nakamura, Yuya Degawa, Hidetsugu Irie, Shuichi Sakai and Ryota Shioya, University of Tokyo, JP
Abstract
Prefetching is an important technique for reducing the number of cache misses and improving processor performance, and thus various prefetchers have been proposed. Many prefetchers are focused on issuing prefetches sufficiently earlier than demand accesses to hide miss latency. In contrast, we propose a T-SKID prefetcher, which focuses on delaying prefetching. If a prefetcher issues prefetches for demand accesses too early, the prefetched line will be evicted before it is referenced. We found that existing prefetchers often issue such too-early prefetches, and this observation offers new opportunities to improve performance. To tackle this issue, T-SKID performs timing prediction independently of address prediction. In addition to issuing prefetches sufficiently early as existing prefetchers do, T-SKID can delay the issue of prefetches until an appropriate time if necessary. We evaluated T-SKID by simulations using SPEC CPU 2017. The result shows that T-SKID achieves a 5.6% performance improvement for multi-core environment, compared to Instruction Pointer Classifier based Prefetching, which is a state-of-the-art prefetcher.
16:00 CET 22.2.6 Q&A SESSION
Authors:
Pascal Vivet1 and Andrea Bartolini2
1CEA-Leti, FR; 2University of Bologna, IT
Abstract
Questions and answers with the authors

22.3 Compute in- and near- memory

Date: Wednesday, 23 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Jean-Phillipe Noel, CEA, FR

Session co-chair:
Pierre-Emmanuel Gaillardon, University of Utah, US

This session deals with design issues around the concepts of in- and near-memory computing. This ranges from optimizing digital synthesis for crossbar-based IMC through optimizing the analog design of both RRAM-based IMC and MRAM-based NMC circuits. In addition, the issues of non-volatility of data in cache memories will also be tackled with innovative solutions.

Time Label Presentation Title
Authors
15:40 CET 22.3.1 LIM-HDL: HDL-BASED SYNTHESIS FOR IN-MEMORY COMPUTING
Speaker:
Saman Froehlich, University of Bremen, DE
Authors:
Saman Froehlich1 and Rolf Drechsler2
1Department of Mathematics and Computer Science, University of Bremen, DE; 2University of Bremen/DFKI, DE
Abstract
HDLs are widely used in EDA for abstract specification and synthesis of logic circuits, as well as validation by simulation or formal verification techniques. Despite the popularity and the many benefits of HDL-based synthesis, it has not yet been performed for in-memory computing. Hence, there is a need to design a particular HDL which supplies efficient and compatible descriptions. In this paper, we enable HDL-based synthesis for the Programmable Logic-in-Memory (PLiM) computer architecture. The starting point to allow HDL-based synthesis for the PLiM computer architecture is to provide abstract descriptions of the final program similarly to the conventional logic synthesis approaches using standard HDLs such as VHDL or Verilog. We present LiM-HDL - a Verilog-based HDL - which allows for the detailed description of programs for in-memory computation. Having the description given in LiM-HDL, we propose a synthesis scheme which translates the description into PLiM programs, i.e. a sequence of resistive majority operations. This includes lexical and syntax analysis as well as preprocessing, custom levelization and a compiler. In our experiments, we show the benefits of LiM-HDL compared to classical Verilog-based synthesis. We show in a case-study that LiM-HDL can be used to implement programs with respect to constraints of specific applications such as edge computing in IoT, for which the PLiM computer is of particular interest and where low area is a key requirement. In our case-study, we show that we can reduce the number of ReRAM devices needed for the computation of an encryption module by 69%
15:44 CET 22.3.2 TRIPLE-SKIPPING NEAR-MRAM COMPUTING FRAMEWORK FOR AIOT ERA
Speaker:
Juntong Chen, Southeast University, CN
Authors:
Juntong Chen, Hao Cai, Bo Liu and Jun Yang, Southeast University, CN
Abstract
Near memory computing (NMC) paradigm shows great significance in non-von Neumann architecture to reduce data movement. The normally-off and instance-on characteristics of spin-transfer torque magnetic random access memory (STTMRAM) promise energy-efficient storage in the AIoT era. To avoid unnecessary memory-related processing, we propose a novel write-read-calculation triple-skipping (TS) NMC for multiply-accumulate (MAC) operation with minimally modified peripheral circuits. The proposed TS-NMC is evaluated with a custom micro control unit (MCU) in 28-nm high-K metal gate (HKMG) CMOS process and foundry announced universal two-transistor two-magnetic tunnel junction (2T-2MTJ) MRAM cell. The framework consists of a sparse flag which is defined in extra STT-MRAM columns with only 0.73% area overhead, and a calculation block for NMC logic with 9.9% overhead. The TS-NMC can successfully work at 0.6-V supply voltage under 20MHz. This Near-MRAM framework can offer up to ∼95.6% energy saving compared to commercial SRAM refer to ultra-low-power benchmark (ULPBenchmark). Classification task on MNIST takes 13nJ/pattern. The energy access of memory, calculation, and the total can be reduced by 52.49×, 2.7×, and 11.3× respectively from the TS scheme.
15:48 CET 22.3.3 ACHIEVING CRASH CONSISTENCY BY EMPLOYING PERSISTENT L1 CACHE
Speaker:
Akshay Krishna Ramanathan, Pennsylvania State University, US
Authors:
Akshay Krishna Ramanathan1, Sara Mahdizadeh Shahri2, Yi Xiao1 and Vijaykrishnan Narayanan1
1Pennsylvania State University, US; 2University of Michigan, US
Abstract
Emerging non-volatile memory technologies promise the opportunity for maintaining persistent data in memory. However, providing crash-consistency in such systems can be costly as any update to the persistent data has to reach the persistent domain in a specific order, imposing high overhead. Prior works, proposed solutions both in software (SW) and hardware (HW) to address this problem but fall short to remove this overhead completely. In this work, we propose Non-Volatile Cache (NVC) architecture design that employs a hybrid volatile, non-volatile memory cell employing monolithic 3D and Ferroelectric technology in L1 data cache to guarantee crash consistency with almost no performance overhead. We show that NVC achieves up to 5.1x speedup over state-of-the-art (SOTA) SW undo logging and 11% improvement over SOTA HW solution without yielding the conventional architecture, while incurring 7% hardware overhead.
15:52 CET 22.3.4 REFERENCING-IN-ARRAY SCHEME FOR RRAM-BASED CIM ARCHITECTURE
Speaker:
Abhairaj Singh, Delft University of Technology, NL
Authors:
Abhairaj Singh, Rajendra Bishnoi and Said Hamdioui, Delft University of Technology, NL
Abstract
Resistive random access memory (RRAM) based computation-in-memory (CIM) architectures are attracting a lot of attention due to their potential in performing fast and energy-efficient computing. However, the RRAM variability and non-idealities limit the computing accuracy of such architectures, especially for multi-operand logic operations. This paper proposes a voltage-based differential referencing-in-array scheme that enables accurate two and multi-operand logic operations for RRAM-based CIM architecture. The scheme makes use of a 2T2R cell configuration to create a complementary bitcell structure that inherently acts also as a reference during the operation execution; this results in a high sensing margin. Moreover, the variation-sensitive multi-operand (N)AND operation is implemented using complementary-input (N)OR operation to further improve its accuracy. Simulation results for a post-layout extracted 512x512 (256Kb) RRAM-based CIM array show that up to 56 operand (N)OR/(N)AND operation can be accurately and reliably performed as opposed to a maximum of 4 operands supported by state-of-the-art solutions while offering up to 11.4X better energy-efficiency.
15:56 CET 22.3.5 Q&A SESSION
Authors:
Jean-Philippe Noel1 and Pierre-Emmanuel Gaillardon2
1CEA, FR; 2University of Utah, US
Abstract
Questions and answers with the authors

22.4 Formal Methods in Design and Verification of Software and Hardware Systems

Date: Wednesday, 23 March 2022
Time: 15:40 - 16:30 CET

Session chair:
Stefano Quer, Politecnico di Torino, IT

Session co-chair:
Christoph Scholl, University Freiburg, DE

Ever-growing complexity of Software and Hardware systems requires an increasing level of automation and more scalable design and verification methods. We will learn how Bounded Model Checking can be combined with Coverage Guided Fuzzing into an efficient and effective tool for software verification. We will get introduced to an FPGA solution of swarm verification engine that acts as a model checker capable of proving liveness property. Multiplier verification will be pushed forward through the clever use of dual variables and tail substitution in their algebraic encoding. Finally, as a comeback or resistance to leaving, BDDs remain on the scene, and, for some applications, it is shown how to construct provably optimal variable orders in polynomial time.

Time Label Presentation Title
Authors
15:40 CET 22.4.1 (Best Paper Award Candidate)
BMC+FUZZ : EFFICIENT AND EFFECTIVE TEST GENERATION
Speaker:
Ravindra Metta, TCS, IN
Authors:
Ravindra Metta1, Raveendra Medicherla1 and Samarjit Chakraborty2
1TCS, IN; 2UNC Chapel Hill, US
Abstract
Coverage Guided Fuzzing (CGF) is a greybox test generation technique. Bounded Model Checking (BMC) is a whitebox test generation technique. Both these have been highly successful at program coverage as well as error detection. It is well known that CGF fails to cover complex conditionals and deeply nested program points. BMC, on the other hand, fails to scale for programming features such as large loops and arrays. To alleviate the above problems, we propose (1) to combine BMC and CGF by using BMC for a short and potentially incomplete unwinding of a given program to generate effective initial test prefixes, which are then extended into complete test inputs for CGF to fuzz, and (2) in case BMC gets stuck even for the short unwinding, we automatically identify the reason, and rerun BMC with a corresponding remedial strategy. We call this approach as BMCFuzz and implemented it in the VeriFuzz framework. This implementation was experimentally evaluated by participating in Test-Comp 2021 and the results show that BMCFuzz is both effective and efficient at covering branches as well as exposing errors. In this paper, we present the details of BMCFuzz and our analysis of the experimental results.
15:44 CET 22.4.2 DOLMEN: FPGA SWARM FOR SAFETY AND LIVENESS VERIFICATION
Speaker:
Emilien Fournier, ENSTA Bretagne, FR
Authors:
Emilien Fournier, Ciprian Teodorov and Loïc Lagadec, ENSTA Bretagne, FR
Abstract
To ensure correctness of critical systems, swarm verification produces proofs of failure on systems too large to be verified using model-checking. Recent research efforts exploit both intrinsic parallelism and low-latency on-chip memory offered by FPGAs to achieve 3 orders of magnitude speedups over software. However, these approaches are limited to safety verification that encodes only what the system should not do. Liveness properties express what the system should do, and are widely used in the verification of operating systems, distributed systems, and communication protocols. Both safety and liveness properties are of paramount importance to ensure systems correctness. This paper presents Dolmen, the first FPGA implementation of a swarm verification engine that supports both safety and liveness properties. Dolmen features a deeply pipelined verification core, along with a scalable architecture to allow high-frequency synthesis on large FPGAs. Our experimental results, on a Xilinx Virtex Ultrascale+ FPGA, show that the Dolmen architecture can achieve up to 4 orders of magnitude speedups compared to software model-checking.
15:48 CET 22.4.3 ADDING DUAL VARIABLES TO ALGEBRAIC REASONING FOR GATE-LEVEL MULTIPLIER VERIFICATION
Speaker:
Daniela Kaufmann, Johannes Kepler University Linz, AT
Authors:
Daniela Kaufmann1, Paul Beame2, Armin Biere3 and Jakob Nordström4
1Johannes Kepler University Linz, AT; 2University of Washington, US; 3Albert-Ludwigs-University Freiburg, DE; 4Københavns Universitet (DIKU), DK
Abstract
Algebraic reasoning has proven to be one of the most effective approaches for verifying gate-level integer multipliers, but it struggles with certain components, necessitating the complementary use of SAT solvers. For this reason validation certificates require proofs in two different formats. Approaches to unify the certificates are not scalable, meaning that the validation results can only be trusted up to the correctness of compositional reasoning. We show in this paper that using dual variables in the algebraic encoding, together with a novel tail substitution and carry rewriting method, removes the need for SAT solvers in the verification flow and yields a single, uniform proof certificate.
15:52 CET 22.4.4 ON THE OPTIMAL OBDD REPRESENTATION OF 2-XOR BOOLEAN AFFINE SPACES
Speaker:
Valentina Ciriani, Universita' degli Studi di Milano, IT
Authors:
Anna Bernasconi1, Valentina Ciriani2 and Marco Longhi2
1Universita' di Pisa, IT; 2Universita' degli Studi di Milano, IT
Abstract
A Reduced Ordered Binary Decision Diagram (ROBDD) is a data structure widely used in an increasing number of fields of Computer Science. In general, ROBDD representations of Boolean functions have a tractable size, polynomial in the number of input variable, for many practical applications. However, the size of a ROBDD, and consequently the complexity of its manipulation, strongly depends on the variable ordering: depending on the initial ordering of the input variables, the size of a ROBDD representation can grow from linear to exponential. In this paper, we study the ROBDD representation of Boolean functions that describe a special class of Boolean affine spaces, which play an important role in some logic synthesis applications. We first discuss how the ROBDD representations of these functions are very sensitive to variable ordering, and then we provide an efficient linear time algorithm for computing an optimal variable ordering that always guarantees a ROBDD of size linear in the number of input variables.
15:56 CET 22.4.5 Q&A SESSION
Authors:
Stefano Quer1 and Christoph Scholl2
1Politecnico di Torino, IT; 2University Freiburg, DE
Abstract
Questions and answers with the authors

23.1 Artificial Intelligence for embedded systems in healthcare

Date: Wednesday, 23 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Marina Zapater, University of Applied Sciences Western Switzerland, CH

Session co-chair:
Daniele Pagliari, Politecnico di Torino, IT

Health-related application need more and more intelligence at the edge to process data efficiently. This session will explain how artificial intelligence can help.

Time Label Presentation Title
Authors
16:40 CET 23.1.1 (Best Paper Award Candidate)
BIOFORMERS: EMBEDDING TRANSFORMERS FOR ULTRA-LOW POWER SEMG-BASED GESTURE RECOGNITION
Speaker:
Alessio Burrello, University of Bologna, IT
Authors:
Alessio Burrello1, Francesco Bianco Morghet2, Moritz Scherer3, Simone Benatti4, Luca Benini5, Enrico Macii2, Massimo Poncino2 and Daniele Jahier Pagliari2
1Department of Electric and Eletronic Engineering, University of Bologna, IT; 2Politecnico di Torino, IT; 3ETH Zürich, CH; 4University of Bologna, IT; 5Università di Bologna and ETH Zürich, IT
Abstract
Human-machine interaction is gaining traction in rehabilitation tasks, such as controlling prosthetic hands or robotic arms. Gesture recognition exploiting surface electromyographic (sEMG) signals is one of the most promising approaches, given that sEMG signal acquisition is non-invasive and is directly related to muscle contraction. However, the analysis of these signals still presents many challenges since similar gestures result in similar muscle contractions. Thus the resulting signal shapes are almost identical, leading to low classification accuracy. To tackle this challenge, complex neural networks are employed, which require large memory footprints, consume relatively high energy and limit the maximum battery life of devices used for classification. This work addresses this problem with the introduction of the Bioformers. This new family of ultra-small attention-based architectures approaches state-of-the-art performance while reducing the number of parameters and operations of 4.9X. Additionally, by introducing a new inter-subjects pre-training, we improve the accuracy of our best Bioformer by 3.39%, matching state-of-the-art accuracy without any additional inference cost. Deploying our best performing Bioformer on a Parallel, Ultra-Low Power (PULP) microcontroller unit (MCU), the GreenWaves GAP8, we achieve an inference latency and energy of 2.72 ms and 0.14 mJ, respectively, 8.0X lower than the previous state-of-the-art neural network, while occupying just 94.2 kB of memory.
16:44 CET 23.1.2 INCLASS: INCREMENTAL CLASSIFICATION STRATEGY FOR SELF-AWARE EPILEPTIC SEIZURE DETECTION
Speaker:
Lorenzo Ferretti, University of California Los Angeles (UCLA), IT
Authors:
Lorenzo Ferretti1, Giovanni Ansaloni2, Renaud Marquis3, Tomas Teijeiro4, Philippe Ryvlin3, David Atienza4 and Laura Pozzi5
1University of California Los Angeles, US; 2EPFL, CH; 3CHUV, CH; 4École Polytechnique Fédérale de Lausanne (EPFL), CH; 5USI Lugano, CH
Abstract
Wearable Health Companions allow the unobtrusive monitoring of patients affected by chronic conditions. In particular, by acquiring and interpreting bio-signals, they enable the detection of acute episodes in cardiac and neurological ailments. Nevertheless, the processing of bio-signals is computationally complex, especially when a large number of features are required to obtain reliable detection outcomes. Addressing this challenge, we present a novel methodology, named INCLASS, that iteratively extends employed feature sets at run-time, until a confidence condition is satisfied. INCLASS builds such sets at design time based on code analysis and profiling information. When applied to the challenging scenario of detecting epileptic seizures based on ECG and SpO2 acquisitions, INCLASS obtains savings of up to 54%, while incurring in a negligible loss of detection performance (1.1% degradation of specificity and sensitivity) with respect to always computing and evaluating all features.
16:48 CET 23.1.3 AMSER: ADAPTIVE MULTI-MODAL SENSING FOR ENERGY EFFICIENT AND RESILIENT EHEALTH SYSTEMS
Speaker:
Emad Kasaeyan Naeini, University of California, Irvine, US
Authors:
Emad Kasaeyan Naeini1, Sina Shahhosseini1, Anil Kanduri2, Pasi Liljeberg2, Amir M. Rahmani1 and Nikil Dutt1
1University of California Irvine, US; 2University of Turku, FI
Abstract
eHealth systems deliver critical digital healthcare and wellness services for users by continuously monitoring physiological and contextual data. eHealth applications use multi-modal machine learning kernels to analyze data from different sensor modalities and automate decision-making. Noisy inputs and motion artifacts during sensory data acquisition affect the i) prediction accuracy and resilience of eHealth services and ii) energy efficiency in processing garbage data. Monitoring raw sensory inputs to identify and drop data and features from noisy modalities can improve prediction accuracy and energy efficiency. We propose a closed-loop monitoring and control framework for multi-modal eHealth applications, AMSER, that can mitigate garbage-in garbage-out by i) monitoring input modalities, ii) analyzing raw input to selectively drop noisy data and features, and iii) choosing appropriate machine learning models that fit the configured data and feature vector - to improve prediction accuracy and energy efficiency. We evaluate our AMSER approach using multi-modal eHealth applications of pain assessment and stress monitoring over different levels and types of noisy components incurred via different sensor modalities. Our approach achieves up to 22% improvement in prediction accuracy and 5.6x energy consumption reduction in the sensing phase against the state-of-the-art multi-modal monitoring application.
16:52 CET 23.1.4 Q&A SESSION
Authors:
Marina Zapater1 and Daniele Jahier Pagliari2
1University of Applied Sciences Western Switzerland (HES-SO), CH; 2Politecnico di Torino, IT
Abstract
Questions and answers with the authors

23.2 Performance Evaluation & Optimization using Modeling, Simulation & Benchmarking

Date: Wednesday, 23 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Avi Ziv, IBM, IL

Session co-chair:
Daniel Grosse, Johannes Kepler University, AT

This session introduces solutions that increase the accuracy and/or the speed of assessing the performance of future designs. The solutions cover a simple and accurate modelling of the delay of multi-input gates, modeling at high level of non-idealities of the computation in memory components, a platform to assess PiMs, and a method to decide on the quantization of DNNs.

Time Label Presentation Title
Authors
16:40 CET 23.2.1 A SIMPLE HYBRID MODEL FOR ACCURATE DELAY MODELING OF A MULTI-INPUT GATE
Speaker:
Arman Ferdowsi, TU Wien, AT
Authors:
Arman Ferdowsi, Juergen Maier, Daniel Oehlinger and Ulrich Schmid, TU WIEN, AT
Abstract
Faithfully representing small delay variations caused by transitions on different inputs in close temporal proximity is a challenging task for digital circuit delay models. In this paper, we show that a simple hybrid model, derived from considering transistors as ideal switches in a simple RC model, leads to a surprisingly accurate model. By analytically solving the resulting ODEs for a NOR gate, explicit expressions for the delay are derived. In addition, we experimentally compare our model's predictions to SPICE simulations and to existing delay models.
16:44 CET 23.2.2 SYSCIM: SYSTEMC-AMS SIMULATION OF MEMRISTIVE COMPUTATION IN MEMORY
Speaker:
Ali BanaGozar, Eindhoven University of Technology, NL
Authors:
Seyed Hossein Hashemi Shadmehri1, Ali BanaGozar2, Mehdi Kamal1, Sander Stuijk2, Ali Afzali-Kusha1, Massoud Pedram3 and Henk Corporaal4
1University of Tehran, IR; 2Eindhoven University of Technology, NL; 3USC, US; 4TU/e (Eindhoven University of Technology), NL
Abstract
Computation-in-memory (CIM) is one of the most appealing computing paradigms, especially for implementing artificial neural networks. Non-volatile memories like ReRAMs, PCMs, etc., have proven to be promising candidates for the realization of CIM processors. However, these devices and their driving circuits are subject to non-idealities. This paper presents a comprehensive platform, named SysCIM, for simulating memristor-based CIM systems. SySCIM considers the impact of the non-idealities of the CIM components, including memristor device, memristor crossbar (interconnects), analog-to-digital converter, and transimpedance amplifier, on the vector-matrix multiplication performed by the CIM unit. The CIM modules are described in SystemC and SystemC-AMS to reach a higher simulation speed while maintaining high simulation accuracy. Experiments under different crossbar sizes show SySCIM performs simulations up to 117x faster than HSPICE with less than 4% accuracy loss. The modular design of SySCIM provides researchers with an easy design-space exploration tool to investigate the effects of various non-idealities.
16:48 CET 23.2.3 PIMULATOR: A FAST AND FLEXIBLE PROCESSING-IN-MEMORY EMULATION PLATFORM
Speaker:
Sergiu Mosanu, University of Virginia, US
Authors:
Sergiu Mosanu, Mohammad Nazmus Sakib, Tommy Tracy II, Ersin Cukurtas, Alif Ahmed, Preslav Ivanov, Samira Khan, Kevin Skadron and Mircea Stan, University of Virginia, US
Abstract
Motivated by the memory wall problem, researchers propose many new Processing-in-Memory (PiM) architectures to bring computation closer to data. However, evaluating the performance of these emerging architectures involves using a myriad of tools, including circuit simulators, behavioral RTL or software simulation models, hardware approximations, etc. It is challenging to mimic both software and hardware aspects of a PiM architecture using the currently available tools with high performance and fidelity. Until and unless actual products that include PiM become available, the next best thing is to emulate various hardware PiM solutions on FPGA fabric and boards. This paper presents a modular, parameterizable, FPGA synthesizable soft PiM model suitable for prototyping and rapid evaluation of Processing-in-Memory architectures. The PiM model is implemented in System Verilog and allows users to generate any desired memory configuration on the FPGA fabric with complete control over the structure and distribution of the PiM logic units. Moreover, the model is compatible with the LiteX framework, which provides a high degree of usability and compatibility with the FPGA and RISC-V ecosystem. Thus, the framework enables architects to easily prototype, emulate and evaluate a wide range of emerging PiM architectures and designs. We demonstrate strategies to model several pioneering bitwise-PiM architectures and provide detailed benchmark performance results that demonstrate the platform's ability to facilitate design space exploration. We observe an emulation vs. simulation weighted-average speedup of 28x when running a memory benchmark workload. The model can utilize 100% BRAM and only 1% FF and LUT of an Alveo U280 FPGA board. The project is entirely open-source.
16:52 CET 23.2.4 BENQ: BENCHMARKING AUTOMATED QUANTIZATION ON DEEP NEURAL NETWORK ACCELERATORS
Speaker:
Zheng Wei, Xi’an Jiaotong University, CN
Authors:
Zheng Wei1, Xingjun Zhang1, Jingbo Li2, Zeyu Ji1 and Jia Wei2
1Xi’an Jiaotong University, CN; 2Xi'an Jiaotong University, CN
Abstract
Hardware-aware automated quantization promises to unlock an entirely new algorithm-hardware co-design paradigm for efficiently accelerating deep neural network (DNN) inference by incorporating the hardware cost into the reinforcement learning (RL) -based quantization strategy search process. Existing works usually design an automated quantization algorithm targeting one hardware accelerator with a device-specific performance model or pre-collected data. However, determining the hardware cost is non-trivial for algorithm experts due to their lack of cross-disciplinary knowledge in computer architecture, compiler, and physical chip design. Such a barrier limits reproducibility and fair comparison. Moreover, it is notoriously challenging to interpret the results due to the lack of quantitative metrics. To this end, we first propose BenQ, which includes various RL-based automated quantization algorithms with aligned settings and encapsulates two off-the-shelf performance predictors with standard OpenAI Gym API. Then, we leverage cosine similarity and manhattan distance to interpret the similarity between the searched policies. The experiments show that different automated quantization algorithms can achieve near equivalent optimal trade-offs because of the high similarity between the searched policies, which provides insights for revisiting the innovations in automated quantization algorithms.
16:56 CET 23.2.5 Q&A SESSION
Authors:
Avi Ziv1 and Daniel Grosse2
1IBM Research - Haifa, IL; 2Johannes Kepler University Linz, AT
Abstract
Questions and answers with the authors

23.3 New Methods and Tools using Machine Learning

Date: Wednesday, 23 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Shafique Muhammad, NYU Abu Dhabi, AE

Session co-chair:
Niar Smail, Université Polytechnique Hauts-de-France, FR

This session includes four regular and two IP papers that improve the state of the art of methods and tools for Machine Learning. Among the regular papers, the first one presents a new approach for graph classification with hyperdimensional computing, the second one takes inspiration from deep learning techniques in natural language processing to improve energy estimation, the third one presents an in situ training framework for memristive crossbar structures, and the fourth one offers new perspectives on compression algorithms for quantized neural networks. Among the IP papers, the first one proposes a neural approach to improving thermal management, while the second one presents a training method that uses bit gradients to obtain mixed-precision quantized models.

Time Label Presentation Title
Authors
16:40 CET 23.3.1 (Best Paper Award Candidate)
GRAPHHD: EFFICIENT GRAPH CLASSIFICATION USING HYPERDIMENSIONAL COMPUTING
Speaker:
Igor Nunes, University of California, Irvine, US
Authors:
Igor Nunes, Mike Heddes, Tony Givargis, Alex Nicolau and Alex Veidenbaum, University of California, Irvine, US
Abstract
Hyperdimensional Computing (HDC) developed by Kanerva is a computational model for machine learning inspired by neuroscience. HDC exploits characteristics of biological neural systems such as high-dimensionality, randomness and a holographic representation of information to achieve a good balance between accuracy, efficiency and robustness. HDC models have already been proven to be useful in different learning applications, especially in resource-limited settings such as the increasingly popular Internet of Things (IoT). One class of learning tasks that is missing from the current body of work on HDC is graph classification. Graphs are among the most important forms of information representation, yet, to this day, HDC algorithms have not been applied to the graph learning problem in a general sense. Moreover, graph learning in IoT and sensor networks, with limited compute capabilities, introduce challenges to the overall design methodology. In this paper, we present GraphHD — a baseline approach for graph classification with HDC. We evaluate GraphHD on real-world graph classification problems. Our results show that when compared to the state-of-the-art Graph Neural Networks (GNNs) the proposed model achieves comparable accuracy, while training and inference times are on average 14.6X and 2.0X faster, respectively.
16:44 CET 23.3.2 DEEPPM: TRANSFORMER-BASED POWER AND PERFORMANCE PREDICTION FOR ENERGY-AWARE SOFTWARE
Speaker:
Jun S. SHIM, Seoul National University, KR
Authors:
Jun S. Shim1, Bogyeong Han1, Yeseong Kim2 and Jihong Kim1
1Seoul National University, KR; 2DGIST, KR
Abstract
Many system-level management and optimization techniques need accurate estimates of power consumption and performance. Earlier research has proposed many high-level/source-level estimation modeling works, particularly for basic blocks. However, most of them still need to execute the target software at least once on a fine-grained simulator or real hardware to extract required features. This paper proposes a performance/power prediction framework, called Deep Power Meter (DeepPM), which estimates them accurately only using the compiled binary. Inspired by the deep learning techniques in natural language processing, we convert the program instructions in the form of vectors and predict the average power and performance of basic blocks based on a transformer model. In addition, unlike existing works based on a Long Short-Term Memory (LSTM) model structure, which only works for basic blocks with a small number of instructions, DeepPM provides highly accurate results for long basic blocks, which takes the majority of the execution time for actual application runs. In our evaluation conducted with SPEC2006 benchmark suite, we show that DeepPM can provide accurate prediction for performance and power consumption with 10.2% and 12.3% error, respectively. DeepPM also outperforms the LSTM-based model by up to 67.2% and 34.9% error for performance and power, respectively.
16:48 CET 23.3.3 QUANTIZATION-AWARE IN-SITU TRAINING FOR RELIABLE AND ACCURATE EDGE AI
Speaker:
João Paulo de Lima, Federal University of Rio Grande do Sul, BR
Authors:
Joao Paulo Lima1 and Luigi Carro2
1Federal University of Rio Grande do Sul, BR; 2UFRGS, BR
Abstract
In-memory analog computation based on memristor crossbars has become the most promising approach for DNN inference. Because compute and memory requirements are larger during training, memristive crossbars are also an alternative to train DNN models within a feasible energy budget for edge devices, especially in the light of trends towards security, privacy, latency, and energy reduction, by avoiding data transfer over the Internet. To enable online training and inference on the same device, however, there are still challenges related to different minimum bitwidth needed in each phase, and memristor non-idealities to be addressed. We provide an in-situ training framework that allows the network to adapt to hardware imperfections, while practically eliminating errors from weight quantization. We validate our methodology with image classifiers, namely MNIST and CIFAR10, by training NN models with 8-bit weights and quantizing to 2 bits. The training algorithm recovers up to 12% of the accuracy lost to quantization errors even under high variability, reduces training energy by up to 6x, and allows for energy-efficient inferences using a single cell per synapse, hence enhancing robustness and accuracy for a smooth training-to-inference transition.
16:52 CET 23.3.4 ENCORE COMPRESSION: EXPLOITING NARROW-WIDTH VALUES FOR QUANTIZED DEEP NEURAL NETWORKS
Speaker:
Myeongjae Jang, KAIST, KR
Authors:
Myeongjae Jang, Jinkwon Kim, Jesung Kim and Soontae Kim, KAIST, KR
Abstract
Deep Neural Networks (DNNs) become a practical machine learning algorithm running on various Neural Processing Units (NPUs). For higher performance and lower hardware overheads, DNN datatype reduction through quantization is proposed. Moreover, to solve the memory bottleneck caused by large data size in DNNs, several zero value-aware compression algorithms are used. However, these compression algorithms do not compress modern quantized DNNs well because of decreased zero values. We find that the latest quantized DNNs have data redundancy due to frequent narrow-width values. Because low-precision quantization reduces DNN datatypes to a simple datatype with less bits, scattered DNN data are gathered to a small number of discrete values and incur a biased data distribution. Narrow-width values occupy a large proportion of the biased distribution. Moreover, an appropriate zero run-length bits can be dynamically changed according to DNN sparsity. Based on this observation, we propose a compression algorithm that exploits narrow-width values and variable zero run-length for quantized DNNs. In experiments with three quantized DNNs, our proposed scheme yields an average compression ratio of 2.99.
16:56 CET 23.3.5 Q&A SESSION
Authors:
Muhammad Shafique1 and Smail Niar2
1New York University Abu Dhabi, AE; 2Université Polytechnique Hauts-de-France, FR
Abstract
Questions and answers with the authors

23.4 Side-channel attacks and beyond

Date: Wednesday, 23 March 2022
Time: 16:40 - 17:20 CET

Session chair:
Begül Bilgin, RAMBUS Cryptography Research, NL

Session co-chair:
Maria Mushtaq, Telecom Paristech, FR

This session covers a variety of attacks and defense mechanisms: side-channel attack on DNN, thermal covert channel on Xeon processors, power-based side-channel on Homomorphic Encryption , a mitigation technique against Rowhammer attacks and a secure prefetcher against Cache Side Channel Attacks.

Time Label Presentation Title
Authors
16:40 CET 23.4.1 (Best Paper Award Candidate)
PREFENDER: A PREFETCHING DEFENDER AGAINST CACHE SIDE CHANNEL ATTACKS AS A PRETENDER
Speaker:
Lang Feng, Nanjing University, CN
Authors:
Luyi Li1, Jiayi Huang2, Lang Feng1 and Zhongfeng Wang1
1Nanjing University, CN; 2University of California, Santa Barbara, US
Abstract
Cache side channel attacks are increasingly alarming in modern processors due to the recent emergence of Spectre and Meltdown attacks. A typical attack performs intentional cache access and manipulates cache states to leak secrets by observing the victim's cache access patterns. Different countermeasures have been proposed to defend against both general and transient execution based attacks. Despite their effectiveness, they all trade some level of performance for security. In this paper, we seek an approach to enforcing security while maintaining performance. We leverage the insight that attackers need to access cache in order to manipulate and observe cache state changes for information leakage. Specifically, we propose Prefender, a secure prefetcher that learns and predicts attack-related accesses for prefetching the cachelines to simultaneously help security and performance. Our results show that Prefender is effective against several cache side channel attacks while maintaining or even improving performance for SPEC CPU2006 benchmarks.
16:44 CET 23.4.2 STEALTHY INFERENCE ATTACK ON DNN VIA CACHE-BASED SIDE-CHANNEL ATTACKS
Speaker:
Han Wang, University of California Davis, US
Authors:
Han Wang, Syed Mahbub Hafiz, Kartik Patwari, Chen-Nee Chuah, Zubair Shafiq and Houman Homayoun, University of California Davis, US
Abstract
The advancement of deep neural networks (DNNs) motivates the deployment in various domains, including image classification, disease diagnoses, voice recognition, etc. Since some tasks that DNN undertakes are very sensitive, the label information is confidential and contains a commercial value or critical privacy. The leakage of label information can lead to further crimes, like intentionally causing a collision with DNN enabled autonomous systems, disrupting energy networks with DNN-based controlling systems, etc. This paper demonstrates that DNNs also bring a new security threat, leading to the leakage of label information of input instances for the DNN models. In particular, we leverage the cache-based side-channel attack (SCA), i.e., Flush+Reload on the DNN (victim) models, to observe the execution of computation graphs, and create a database of them for building a classifier that the attacker can use to decide the label information of (unknown) input instances for victim models. Then we deploy the cache-based SCA on the same host machine with victim models and deduce the labels with the attacker’s classification model to compromise the privacy and confidentiality of victim models. We explore different settings and classification techniques to achieve a high attack success rate of stealing label information from the victim models. Additionally, we consider two attacking scenarios: binary attacking identifies specific sensitive labels and others while multi-class attacking targets recognize all classes victim DNNs provide. Last, we implement the attack on both static DNN models with identical architectures for all inputs and dynamic DNN models with an adaptation of architectures for different inputs to demonstrate the vast existence of the proposed attack, including DenseNet 121, DenseNet 169, VGG 16, VGG 19, MobileNet v1, and MobileNet v2. Our experiment exhibits that MobileNet v1 is the most vulnerable one with 99% and 75.6% attacking success rates for binary and multi-class attacking scenarios, respectively.
16:48 CET 23.4.3 KNOW YOUR NEIGHBOR: PHYSICALLY LOCATING XEON PROCESSOR CORES ON THE CORE TILE GRID
Speaker and Author:
Hyungmin Cho, Sungkyunkwan University, KR
Abstract
The physical locations of the processor cores in multi- or many-core CPUs are often hidden from the users. The current generation Intel Xeon CPUs accommodate many processor cores on a tile grid, but the exact locations of the individual cores are not plainly visible. We present a methodology for physically locating the cores in the Intel Xeon CPUs. Using the method, we collect core location samples of 300 CPU instances deployed in a commercial cloud platform, which reveal a wide variety of core map patterns. The locations of the individual processor cores are not contiguously mapped, and the mapping pattern can be different per each CPU instance. We also demonstrate that an attacker can exploit an inter-core thermal covert channel using the identified core locations. The attacker can increase the channel capacity by strategically placing multiple sender and receiver nodes. Our evaluation shows that up to 15 bps of data transfer is possible with less than 1% of bit error rate on a cloud environment, which is 3 times higher than the previously reported results.
16:52 CET 23.4.4 REVEAL: SINGLE-TRACE SIDE-CHANNEL LEAKAGE OF THE SEAL HOMOMORPHIC ENCRYPTION LIBRARY
Speaker:
Furkan Aydin, North Carolina State University, US
Authors:
Furkan Aydin1, Emre Karabulut1, Seetal Potluri1, Erdem Alkim2 and Aydin Aysu1
1North Carolina State University, US; 2Department of Computer Science, Dokuz Eylul University, TR
Abstract
This paper demonstrates the first side-channel attack on homomorphic encryption (HE), which allows computing on encrypted data. We reveal a power-based side-channel leakage of Microsoft’s Simple Encrypted Arithmetic Library (SEAL) that implements the Brakerski/Fan-Vercauteren (BFV) protocol. Our proposed attack targets the discrete Gaussian sampling in the SEAL’s encryption phase and can extract the entire message with a single power measurement. Our attack works by (1) identifying each coefficient index being sampled, (2) extracting the sign value of the coefficients from control-flow variations, (3) recovering the coefficients with a high probability from data-flow variations, and (4) using a Blockwise Korkine-Zolotarev (BKZ) algorithm to efficiently explore and estimate the remaining search space. Using real power measurements, the results on a RISC-V FPGA implementation of the Microsoft SEAL show that the proposed attack can reduce the plaintext encryption security level from 2^{128} to 2^{4.4}. Therefore, as HE gears toward real-world applications, such attacks and related defenses should be considered.
16:56 CET 23.4.5 Q&A SESSION
Authors:
Begul Bilgin1 and Maria Mushtaq2
1Rambus Cryptography Research, NL; 2Telecom Paristech, FR
Abstract
Questions and answers with the authors

C.1 Closing

Date: Wednesday, 23 March 2022
Time: 18:00 - 19:00 CET

Session chair:
Cristiana Bolchini, Politecnico di Milano, IT

Session co-chair:
Ingrid Verbauwhede, KU Leuven, BE

Time Label Presentation Title
Authors
18:00 CET C.1.1 CLOSING
Speaker:
Cristiana Bolchini, Politecnico di Milano, IT
Abstract
Closing session
18:30 CET C.1.2 AWARDS
Speakers:
Ingrid Verbauwhede1, Jan Madsen2, Antonio Miele3 and Ingrid Verbauwhede1
1KU Leuven - COSIC, BE; 2TU Denmark, DK; 3Politecnico di Milano, IT
Abstract
Award session Jan Madsen: EDAA Dissertation Awards Antonio Miele: Best IP Award
18:55 CET C.1.3 SAVE THE DATE - DATE 2023
Speakers:
Ian O'Connor1 and Robert Wille2
1Lyon Institute of Nanotechnology, FR; 2Johannes Kepler University Linz, AT
Abstract
See you at DATE 2023!