7.7 Aging Mitigation to Improve System Robustness

Printer-friendly version PDF version

Date: Wednesday 16 March 2016
Time: 14:30 - 16:00
Location / Room: Konferenz 5

Chair:
Maria Michael, University of Cyprus, CY

Co-Chair:
Carles Hernandez, Barcellona Supercomputer Center, ES

This session presents methodologies for monitoring aging effects in FPGAs and task mapping strategies for prolonging lifetime in robust multi/many-core systems

TimeLabelPresentation Title
Authors
14:307.7.1PATH SELECTION AND SENSOR INSERTION FLOW FOR AGE MONITORING IN FPGAS
Speaker:
Mohammad Ebrahimi, University of Tehran, IR
Authors:
Mohammad Ebrahimi1, Zana Ghaderi2, Eli Bozorgzadeh2 and Zainalabedin Navabi1
1University of Tehran, IR; 2University of California, Irvine, US
Abstract
This paper presents a two-step aging-aware methodology for Representative Critical Paths (RCPs) selection from a large number of Critical Paths (CPs) in programmable logic devices. First, nomination of CPs is based on delay, temperature, and lexicographic function of duty cycle and switching activity filtering, which are the major causes in Bias Temperature Instability (BTI) and Hot Carrier Injection (HCI) aging mechanisms. Secondly, RCPs will be selected based on Fan-out (FO) and physical location of Logic Blocks (LBs) along a CP to decrease aging propagation and sensor distribution fairness, respectively. We then present a sensor insertion algorithm that will be used during design placement to avoid sensors inaccuracy. Implementation steps of sensor insertion are performed automatically with a limited human interaction. Higher aging-rate of RCPs than unselected CPs in our experiments demonstrates the effectiveness of the proposed methodology. Keywords— Aging, FPGA, path selection, placement.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:007.7.2DESIGN AND EVALUATION OF RELIABILITY-ORIENTED TASK RE-MAPPING IN MPSOCS USING TIME-SERIES ANALYSIS OF INTERMITTENT FAULTS
Speaker:
Siva Satyendra Sahoo, National University of Singapore, SG
Authors:
Siva Satyendra Sahoo1, Akash Kumar2 and Bharadwaj Veeravalli1
1National University of Singapore, SG; 2Technische Universität Dresden, DE
Abstract
A large number of hardware faults are being caused by an increasing number of manufacturing defects and physical interactions during operation. This poses major challenges for the design and testing of modern Multiprocessor System-on-Chips (MPSoCs). Intermittent faults constitute a major part of hardware faults and their fault rates can be used as an indicator of the wear-out in a Processing Element (PE). We propose a run-time task re-mapping method that uses this information to improve the useful lifetime of MPSoCs. We also propose a scenario-aware system-level fault injection technique for intermittent faults to evaluate system-level design techniques in MPSoCs. Our performance results conclusively show that our strategy significantly scales on reliability metrics with respect to number of PEs. Specifically, we show that our method can achieve an increase in lifetime of up to 16% and tolerate up to 30% more faults than state-of-the-art techniques.

Download Paper (PDF; Only available from the DATE venue WiFi)
15:307.7.3LIFETIME-AWARE LOAD DISTRIBUTION POLICIES IN MULTI-CORE SYSTEMS: AN IN-DEPTH ANALYSIS
Speaker:
Antonio Miele, Politecnico di Milano, IT
Authors:
Cristiana Bolchini, Luca Cassano and Antonio Miele, Politecnico di Milano, IT
Abstract
Dynamic Reliability Management solutions are often adopted in multi-core systems to mitigate aging and wear-out effects, by opportunely distributing the workload on the available cores. The efficiency of such solutions is generally evaluated by considering only the occurrence of the first core failure due to the computational complexity. In this paper we propose an in-depth analysis of such approaches by considering the occurrence of multiple subsequent core failures, thus offering a more precise estimation of the lifetime reliability. In particular, we analyzed two classical load distribution approaches: a load balancing strategy versus a strategy based on spare resources. Experimental results show benefits and limitations of the considered solutions in terms of lifetime reliability while fulfilling system performance.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:00IP3-12, 219A LIFETIME-AWARE RUNTIME MAPPING APPROACH FOR MANYCORE SYSTEMS IN THE DARK SILICON ERA
Speaker:
Mohammad-Hashem Haghbayan, University of Turku, FI
Authors:
Mohammad-Hashem Haghbayan1, Antonio Miele2, Amir-Mohammad Rahmani3, Pasi Liljeberg1 and Hannu Tenhunen3
1University of Turku, FI; 2Politecnico di Milano, IT; 3KTH Royal Institute of Technology and University of Turku, FI
Abstract
In this paper, we propose a novel lifetime reliability-aware resource management approach for many-core architectures. The approach is based on hierarchical architecture, composed of a long-term runtime reliability analysis unit and a short-term runtime mapping unit. The former periodically analyses the aging status of the various processing units with respect to a target value specified by the designer, and performs recovery actions on highly stressed cores. The calculated reliability metrics are utilized in runtime mapping of the newly arrived applications to maximize the performance of the system while fulfilling reliability requirements and the available power budget. Our extensive experimental results reveal that the proposed reliability-aware approach can efficiently select the processing cores to be used over time in order to enhance the reliability at the end of the operational life (up to 62%) while offering the comparable performance level of the state-of-the-art runtime mapping approach.

Download Paper (PDF; Only available from the DATE venue WiFi)
16:00End of session
Coffee Break in Exhibition Area