5.3 Reliable Systems in the Age of Variability

Printer-friendly version PDF version

Date: Wednesday 26 March 2014
Time: 08:30 - 10:00
Location / Room: Konferenz 1

Chair:
Antonio Miele, Politecnico di Milano, IT

Co-Chair:
José L. Ayala, Complutense University of Madrid, ES

The evolution of the silicon industry over past decades has been fueled by continued scaling. This has motivated the rapid evolution of integration technologies. In future technology nodes, reliability is expected to become a first-order design constraint. This session tackles this with novel techniques, spanning from memoization to latency-insensitive systems, proposing to tolerate, recover and manage reliability issues in a more variable scenario.

TimeLabelPresentation Title
Authors
08:305.3.1(Best Paper Award Candidate)
TEMPORAL MEMOIZATION FOR ENERGY-EFFICIENT TIMING ERROR RECOVERY IN GPGPUS
Speakers:
Abbas Rahimi1, Luca Benini2 and Rajesh Gupta1
1UC San Diego, US; 2Università di Bologna, IT
Abstract
Manufacturing and environmental variability lead to timing errors in computing systems that are typically corrected by error detection and correction mechanisms at the circuit level. The cost and speed of recovery can be improved by memoization-based optimization methods that exploit spatial or temporal parallelisms in suitable computing fabrics such as general-purpose graphics processing units (GPGPUs). We propose here a temporal memoization technique for use in floating-point units (FPUs) in GPGPUs that uses value locality inside data-parallel programs. The technique recalls (memorizes) the context of error-free execution of an instruction on a FPU. To enable scalable and independent recovery, a single-cycle lookup table (LUT) is tightly coupled to every FPU to maintain contexts of recent error-free executions. The LUT reuses these memorized contexts to exactly, or approximately, correct errant FP instructions based on application needs. In real-world applications, the temporal memoization technique achieves an average energy saving of 8%-28% for a wide range of timing error rates (0%-4%) and outperforms recent advances in resilient architectures. This technique also enhances robustness in the voltage overscaling regime and achieves relative average energy saving of 66% with 11% voltage overscaling.
09:005.3.2RELIABILITY-AWARE EXCEPTIONS: TOLERATING INTERMITTENT FAULTS IN MICROPROCESSOR ARRAY STRUCTURES
Speakers:
Waleed Dweik, Murali Annavaram and Michel Dubois, University of Southern California, US
Abstract
In future technology nodes, reliability is expected to become a first-order design constraint. Faults encountered in a chip can be classified into three categories: transient, intermittent, and permanent. Fault classification allows a chip to take the appropriate corrective action. Mechanisms have been proposed to distinguish transient from non-transient faults where all non-transient faults are handled as permanent. Intermittent faults induced by wearout phenomena have become the dominant reliability concern in nanoscale technology, yet there is no mechanism that provides finer classification of non-transient faults into intermittent and permanent faults. In this paper, we present a new class of exceptions called Reliability-Aware Exceptions (RAEs) which provide the ability to distinguish intermittent faults in microprocessor array structures. The RAE handlers have the ability to manipulate microprocessor array structures to recover from all three categories of faults. Using RAEs, we demonstrate that the reliability of two representative microarchitecture structures, load/store queue and reorder buffer in an out-of-order processor, is improved by average factors of 1.3 and 1.95, respectively.
09:305.3.3TEMPERATURE AWARE ENERGY-RELIABILITY TRADE-OFFS FOR MAPPING OF THROUGHPUT-CONSTRAINED APPLICATIONS ON MULTIMEDIA MPSOCS
Speakers:
Anup Das, Akash Kumar and Bharadwaj Veeravalli, National University of Singapore, SG
Abstract
This paper proposes a design-time (offline) analysis technique to determine application task mapping and scheduling on a multiprocessor system and the voltage and frequency levels of each cores (offline DVFS) that minimize application computation and communication energy, simultaneously minimizing processor aging. The proposed technique incorporates (1) the effect of the voltage and frequency on the temperature of a core; (2) the effect of neighboring core voltage and frequency on the temperature (spatial effect); (3) pipelined execution and cyclic dependencies among tasks; and (4) the communication energy component which often constitutes a significant fraction of the total energy for multimedia applications. The temperature model proposed here can be easily integrated in the design space exploration for multiprocessor systems. Experiments conducted with applications modeled as synchronous data-flow graphs in conjunction with HotSpot tool for temperature modeling clearly demonstrate the quality and the speed-up achieved using the proposed approach. Further, they also show 40% savings in energy consumption with 6% increase in system lifetime.
09:455.3.4RECOVERY-BASED RESILIENT LATENCY-INSENSITIVE SYSTEMS
Speakers:
Yuankai Chen1, Xuan Zeng2 and Hai Zhou1
1Northwestern University, US; 2Fudan University, CN
Abstract
As the interconnect delay is becoming a larger fraction of the clock cycle time, the conventional global stalling mechanism, which is used to correct error in general synchronous circuits, would be no longer feasible because of the expensive timing cost for the stalling signal to travel across the circuit. In this paper, we propose recovery-based resilient latency-insensitive systems (RLISs) that efficiently integrate error-recovery techniques with latency-insensitive design to replace the global stalling. We first demonstrate a baseline RLIS as the motivation of our work that uses additional output buffer which guarantees that only correct data can enter the output channel. However this baseline RLIS suffers from performance degradations even when errors do not occur. We propose a novel improved RLIS that allows erroneous data to propagate in the system. Equipped with improved queues that prevent accumulation of erroneous data, the improved RLIS retains the system performance. We provide theoretical studies that analyze the impact of errors on system performance and the queue sizing problem. We also theoretically prove that the improved RLIS performs no worse than the global stalling mechanism. Experimental results show that the improved RLIS has 40.3\% and even 3.1\% throughput improvements compared to the baseline RLIS and the infeasible global stalling mechanism respectively, with less than 10\% hardware overhead.
10:00IP2-10, 80A LINUX-GOVERNOR BASED DYNAMIC REALIABILITY MANAGER FOR ANDROID MOBILE DEVICES
Speakers:
Pietro Mercati1, Andrea Bartolini2, Francesco Paterna1, Tajana Simunic Rosing1 and Luca Benini2
1UCSD, US; 2University of Bologna, IT
Abstract
Reliability is a major concern in multiprocessors. Dynamic Reliability Management (DRM) aims at trading off processor performance with lifetime. The state-of-the-art publications study only the theory supported by simulation. This paper presents the first complete software implementation, working on a real hardware, of a low-overhead, Android-compatible workload-aware DRM Governor for mobile multiprocessors. We discuss the design challenges and the run-time overhead involved. We show the effectiveness of our governor in guaranteeing the predefined target lifetime and show that it achieves up to 100% of lifetime improvement with respect to traditional governors, while providing comparable performance for critical applications.
10:01IP2-11, 182YIELD AND TIMING CONSTRAINED SPARE TSV ASSIGNMENT FOR THREE-DIMENSIONAL INTEGRATED CIRCUITS
Speakers:
Yu-Guang Chen1, Kuan-Yu Lai1, Ming-Chao Lee2, Yiyu Shi3, Wing-Kai Hon1 and Shih-Chieh Chang1
1National Tsing Hua University, TW; 2MediaTek Inc., TW; 3Missouri University of Science and Technology, US
Abstract
Through Silicon Via (TSV) is a critical enabling technique in three-dimensional integrated circuits (3D ICs). However, it may suffer from many reliability issues. Various fault-tolerance mechanisms have been proposed in literature to improve yield, at the cost of significant area overhead. In this paper, we focus on the structure that uses one spare TSV for a group of original TSVs, and study the optimal assignment of spare TSVs under yield and timing constraints to minimize the total area overhead. We show that such problem can be modeled through constrained graph decomposition. An efficient heuristic is further developed to address this problem. Experimental results show that under the same yield and timing constraints, our heuristic can reduce the area overhead induced by the fault-tolerance mechanisms by up to 38%, compared with a seemingly more intuitive nearest-neighbor based heuristic.
10:02IP2-12, 568COMPILER-DRIVEN DYNAMIC RELIABILITY MANAGEMENT FOR ON-CHIP SYSTEMS UNDER VARIABILITIES
Speakers:
Semeen Rehman, Florian Kriebel, Muhammad Shafique and Jörg Henkel, Karlsruhe Institute of Technology (KIT), DE
Abstract
This paper presents a novel Dynamic Reliability Management System (DyReMS) for on-chip systems that performs resilience-driven resource allocation and mapping. It accounts for both the tasks' resilience properties and heterogeneous error recovery features of different cores. DyReMS also chooses a reliable task version (out of multiple reliability-aware transformed options) depending upon the reliability level of the allocated core. In case of error detection, rollbacks are performed. Our system provides 70%-87% improved task reliability compared to a timing reliabil-ity-optimizing core assignment, i.e. minimizing the probability of deadline misses (with EDF scheduling).
10:00End of session
Coffee Break in Exhibition Area
On Tuesday-Thursday the coffee and lunch breaks will be located in the Exhibition Area (Terrace Level).