M01 Early System Reliability Analysis for Cross-layer Soft Errors Resilience in Microprocessor Systems
In a world with computation at the epicenter of every activity, computing systems must be highly reliable even if miniaturization makes the underlying hardware unreliable. Techniques able to guarantee high reliability are associated to high costs (reliability tax). Early reliability analysis has the potential to take informed design decisions during to maximize reliability while minimizing the reliability tax. This tutorial focuses on early cross-layer reliability analysis considering the full computing continuum (from IoT/CPS to HPC applications) with emphasis on soft errors. The tutorial will guide attendees from the definition of the problem down to the proper modeling and design exploration strategies considering the full system stack.
Introduction to Reliability (20 minutes)
Reliability is a very broad domain in which several (sometimes competing) communities have provided significant contributions. However, as often happens, definitions and metrics have different meaning in different communities creating a serious obstacle in sharing of knowledge and in the efficient implementation of cross-layer reliability techniques that require synergy between all layers of the system stack.
To overcome this problem in this introduction prof. Gizopoulos will provide an overview of the basic concepts, definitions and metrics required to work in the reliability domains trying to build a common language that could be understood by researchers with different background (e.g., hardware vs software developers).
Cross-Layer Reliability Techniques Overview (20 minutes)
Cross-layer reliability (or cross-layer resilience) is gaining increasing relevance both in the academic and industrial sectors. In a cross-layer resilient system, physical and circuit level techniques can mitigate low-level faults. Hardware redundancy can be used to manage errors at the hardware architecture layer. Eventually, software implemented error detection and correction mechanisms can manage those errors that escaped the lower layers of the stack. In order to understand the potential but also the complexity of this design paradigm prof. Di Carlo will give a brief overview of the most used protection techniques available at the different layers including:
- Logic Layer
- Architectural Layer
- Software Layer
The goal is not to provide an exhaustive review of the state-of-the-art but to give and idea of the building blocks that can be exploited in a cross-layer resilient design and most importantly to let the audience understand the size and complexity of the related design space that makes the reliability analysis a crucial task in the early phases of the design.
Reliability analysis in a Cross-Layer Domain
The decision of how to distribute the error management across the different layers has the goal to meet the system reliability requirements of a specific application, considering its sensitivity to hardware faults while minimizing the related reliability tax. Overall, by considering multiple layers, one can exploit a wider range of information when handling errors. This leads to globally optimized error management strategies dedicated not only to reliability, but also to other design constraints. However, despite a cross-layer holistic design approach has several advantages compared to traditional single layer techniques, it increases the complexity of the design process since a larger design space must be explored. This translates into an increasing demand for system-level reliability analysis frameworks able to evaluate different combinations of cross-layer error protection techniques early in the design cycle. Unfortunately, such tools still lack maturity, especially compared to those available to optimize other design parameters such as power and performance.
This represents the core of the tutorial in which all presenters will exploit their experience in several year of research and collaboration in this domain to guide the audience in an overview of the main reliability analysis on a cross-layer domain.
In particular the tutorial will cover the following topics:
- Fault Injection approaches (90 minutes)
- Device Level
- Microarchitectural Level
- ISA/Software Level
- Stochastic Cross-Layer Modelling and Analysis (30 minutes)
This tutorial presents methodologies and results obtained in the framework of several EC projects including in which the presenters have been actively involved: CLERECO - FP7 (https://www.clereco.eu), UniServer - H2020 (http://www.uniserver2020.eu/) and RECIPE - FETHPC project (http://www.recipe-project.eu/ ).