M10 Dependability | Reliable VLSI Systems

Printer-friendly versionPDF version
Location / Room: 


Marilyn Wolf, Georgia Tech, US (Contact Marilyn Wolf)
Joerg Henkel, Karlsruher Institut für Technologie (KIT), DE (Contact Joerg Henkel)
Norbert Wehn, TU Kaiserslautern, DE (Contact Norbert Wehn)

Download handouts here (Handouts are available for attendees only! The password has been sent to you by email or you may ask for the password at the on-site registration desk.)

Reliability is a key concern for many growing semiconductor markets, including automotive, medical, and industrial IoT. Consumer electronics markets have traditionally had low reliability requirements due to the high turnover of products. Chips built for these emerging markets must operate at high levels of confidence and must provide long lifetimes. This tutorial will give a holistic view of reliability, including physical effects, system-level models, and applications. The speakers will discuss reliability models and their influence on system design; DRAM and wireless communication systems; and statistical modeling of thermal behavior.


14:30M10.1Reliability of On-chip Systems in the Nano-CMOS Era

Jörg Henkel, Karlsruhe Institute of Technology, DE, Contact Jörg Henkel

Reliability, power (or, more precisely: power density) and on-chip temperatures are interdependent. These optimization goals/constraints have traditionally been treated in an orthogonal way. However, with the advent of nano-CMOS on-chip systems, it has become apparent that such treatment would lead to wrong power/thermal/reliability models. The goal of this talk is therefore to provide the basics behind and explain the interdependencies within the triangle of reliability (with focus on aging effects), power and temperature. The talk starts with discussing the discontinuation of Dennard Scaling leading to high power densities and high on-chip temperatures (and spatial and temporal thermal gradients) that in turn accelerate some aging effects like electro migration, NBTI etc. Our research results show in which way some key reliability metrics are affected. The second part of the talk introduces some representative system-level and architectural-level mitigation techniques.

15:45M10.2Dependability Issues in Memories: DRAM Subsystems and Wireless Communication Systems

Norbert Wehn, University of Kaiserslautern, DE, Contact Norbert Wehn

Many applications show an inherent error resilience due to its probabilistic behavior. This inherent error resilience can be exploited to reduce the design margin for advanced technology nodes resulting in more energy and area efficient implementation. We will present a cross layer approach for efficient reliability management in wireless baseband processing with special emphasis on memories since memories are most susceptible to dependability problems. A MIMO system will be used as design example.

In the second part we will focus on DRAM memories. All today's computing systems rely on dependable Dynamic Random Access Memories (DRAMs). However, in the future memories such as DRAMs will become undependable due to further scaling. This has to be counterbalanced with higher refresh rates, which leads to a higher DRAM power consumption. Moreover, due to the increasing device capacities more DRAM cells have to be refreshed regularly, which results in further increase in refresh frequency and power. Recent research activities resulted in the concept of "approximate DRAM" to save power and improve performance by lowering the refresh rate or disabling refresh completely. Hence, fast and accurate models (power, thermal and retention errors) are required for a thoroughly exploration of approximate DRAM for error resilient applications. In this talk we present a holistic simulation environment for investigations on approximate DRAM and show the impact on error resilient applications.

17:00M10.3System-Level Thermal Modeling and Optimization

Marilyn Wolf, Georgia Institute of Technology, US, Contact Marilyn Wolf

Thermal behavior is critical for reliability. Thermal behavior is also critical for real-time systems - thermal sensors may slow down clocks, causing processors to miss deadlines. Accurately assessing thermal behavior for chips requires both compact models of the chip's thermal behavior as well as models of use cases. We will describe thermal RC models that relate power consumption to thermal behavior in both time and space. We will use these to consider statistical models of chip-level thermal behavior. We will also discuss work from a variety of groups on thermal-aware scheduling.