7.7 Aging Mitigation to Improve System Robustness

Time	Label	Presentation Title Authors
14:30	7.7.1	PATH SELECTION AND SENSOR INSERTION FLOW FOR AGE MONITORING IN FPGAS Speaker: Mohammad Ebrahimi, University of Tehran, IR Authors: Mohammad Ebrahimi¹, Zana Ghaderi², Eli Bozorgzadeh² and Zainalabedin Navabi¹ ¹University of Tehran, IR; ²University of California, Irvine, US Abstract This paper presents a two-step aging-aware methodology for Representative Critical Paths (RCPs) selection from a large number of Critical Paths (CPs) in programmable logic devices. First, nomination of CPs is based on delay, temperature, and lexicographic function of duty cycle and switching activity filtering, which are the major causes in Bias Temperature Instability (BTI) and Hot Carrier Injection (HCI) aging mechanisms. Secondly, RCPs will be selected based on Fan-out (FO) and physical location of Logic Blocks (LBs) along a CP to decrease aging propagation and sensor distribution fairness, respectively. We then present a sensor insertion algorithm that will be used during design placement to avoid sensors inaccuracy. Implementation steps of sensor insertion are performed automatically with a limited human interaction. Higher aging-rate of RCPs than unselected CPs in our experiments demonstrates the effectiveness of the proposed methodology. Keywords— Aging, FPGA, path selection, placement. Download Paper (PDF; Only available from the DATE venue WiFi)
15:00	7.7.2	DESIGN AND EVALUATION OF RELIABILITY-ORIENTED TASK RE-MAPPING IN MPSOCS USING TIME-SERIES ANALYSIS OF INTERMITTENT FAULTS Speaker: Siva Satyendra Sahoo, National University of Singapore, SG Authors: Siva Satyendra Sahoo¹, Akash Kumar² and Bharadwaj Veeravalli¹ ¹National University of Singapore, SG; ²Technische Universität Dresden, DE Abstract A large number of hardware faults are being caused by an increasing number of manufacturing defects and physical interactions during operation. This poses major challenges for the design and testing of modern Multiprocessor System-on-Chips (MPSoCs). Intermittent faults constitute a major part of hardware faults and their fault rates can be used as an indicator of the wear-out in a Processing Element (PE). We propose a run-time task re-mapping method that uses this information to improve the useful lifetime of MPSoCs. We also propose a scenario-aware system-level fault injection technique for intermittent faults to evaluate system-level design techniques in MPSoCs. Our performance results conclusively show that our strategy significantly scales on reliability metrics with respect to number of PEs. Specifically, we show that our method can achieve an increase in lifetime of up to 16% and tolerate up to 30% more faults than state-of-the-art techniques. Download Paper (PDF; Only available from the DATE venue WiFi)
15:30	7.7.3	LIFETIME-AWARE LOAD DISTRIBUTION POLICIES IN MULTI-CORE SYSTEMS: AN IN-DEPTH ANALYSIS Speaker: Antonio Miele, Politecnico di Milano, IT Authors: Cristiana Bolchini, Luca Cassano and Antonio Miele, Politecnico di Milano, IT Abstract Dynamic Reliability Management solutions are often adopted in multi-core systems to mitigate aging and wear-out effects, by opportunely distributing the workload on the available cores. The efficiency of such solutions is generally evaluated by considering only the occurrence of the first core failure due to the computational complexity. In this paper we propose an in-depth analysis of such approaches by considering the occurrence of multiple subsequent core failures, thus offering a more precise estimation of the lifetime reliability. In particular, we analyzed two classical load distribution approaches: a load balancing strategy versus a strategy based on spare resources. Experimental results show benefits and limitations of the considered solutions in terms of lifetime reliability while fulfilling system performance. Download Paper (PDF; Only available from the DATE venue WiFi)
16:00	IP3-12, 219	A LIFETIME-AWARE RUNTIME MAPPING APPROACH FOR MANYCORE SYSTEMS IN THE DARK SILICON ERA Speaker: Mohammad-Hashem Haghbayan, University of Turku, FI Authors: Mohammad-Hashem Haghbayan¹, Antonio Miele², Amir-Mohammad Rahmani³, Pasi Liljeberg¹ and Hannu Tenhunen³ ¹University of Turku, FI; ²Politecnico di Milano, IT; ³KTH Royal Institute of Technology and University of Turku, FI Abstract In this paper, we propose a novel lifetime reliability-aware resource management approach for many-core architectures. The approach is based on hierarchical architecture, composed of a long-term runtime reliability analysis unit and a short-term runtime mapping unit. The former periodically analyses the aging status of the various processing units with respect to a target value specified by the designer, and performs recovery actions on highly stressed cores. The calculated reliability metrics are utilized in runtime mapping of the newly arrived applications to maximize the performance of the system while fulfilling reliability requirements and the available power budget. Our extensive experimental results reveal that the proposed reliability-aware approach can efficiently select the processing cores to be used over time in order to enhance the reliability at the end of the operational life (up to 62%) while offering the comparable performance level of the state-of-the-art runtime mapping approach. Download Paper (PDF; Only available from the DATE venue WiFi)
16:00		End of session Coffee Break in Exhibition Area

Time

Label

Presentation Title
Authors

14:30

7.7.1

PATH SELECTION AND SENSOR INSERTION FLOW FOR AGE MONITORING IN FPGAS
Speaker:
Mohammad Ebrahimi, University of Tehran, IR
Authors:
Mohammad Ebrahimi¹, Zana Ghaderi², Eli Bozorgzadeh² and Zainalabedin Navabi¹
¹University of Tehran, IR; ²University of California, Irvine, US
Abstract
This paper presents a two-step aging-aware methodology for Representative Critical Paths (RCPs) selection from a large number of Critical Paths (CPs) in programmable logic devices. First, nomination of CPs is based on delay, temperature, and lexicographic function of duty cycle and switching activity filtering, which are the major causes in Bias Temperature Instability (BTI) and Hot Carrier Injection (HCI) aging mechanisms. Secondly, RCPs will be selected based on Fan-out (FO) and physical location of Logic Blocks (LBs) along a CP to decrease aging propagation and sensor distribution fairness, respectively. We then present a sensor insertion algorithm that will be used during design placement to avoid sensors inaccuracy. Implementation steps of sensor insertion are performed automatically with a limited human interaction. Higher aging-rate of RCPs than unselected CPs in our experiments demonstrates the effectiveness of the proposed methodology. Keywords— Aging, FPGA, path selection, placement.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:00

7.7.2

DESIGN AND EVALUATION OF RELIABILITY-ORIENTED TASK RE-MAPPING IN MPSOCS USING TIME-SERIES ANALYSIS OF INTERMITTENT FAULTS
Speaker:
Siva Satyendra Sahoo, National University of Singapore, SG
Authors:
Siva Satyendra Sahoo¹, Akash Kumar² and Bharadwaj Veeravalli¹
¹National University of Singapore, SG; ²Technische Universität Dresden, DE
Abstract
A large number of hardware faults are being caused by an increasing number of manufacturing defects and physical interactions during operation. This poses major challenges for the design and testing of modern Multiprocessor System-on-Chips (MPSoCs). Intermittent faults constitute a major part of hardware faults and their fault rates can be used as an indicator of the wear-out in a Processing Element (PE). We propose a run-time task re-mapping method that uses this information to improve the useful lifetime of MPSoCs. We also propose a scenario-aware system-level fault injection technique for intermittent faults to evaluate system-level design techniques in MPSoCs. Our performance results conclusively show that our strategy significantly scales on reliability metrics with respect to number of PEs. Specifically, we show that our method can achieve an increase in lifetime of up to 16% and tolerate up to 30% more faults than state-of-the-art techniques.
Download Paper (PDF; Only available from the DATE venue WiFi)

15:30

7.7.3

LIFETIME-AWARE LOAD DISTRIBUTION POLICIES IN MULTI-CORE SYSTEMS: AN IN-DEPTH ANALYSIS
Speaker:
Antonio Miele, Politecnico di Milano, IT
Authors:
Cristiana Bolchini, Luca Cassano and Antonio Miele, Politecnico di Milano, IT
Abstract
Dynamic Reliability Management solutions are often adopted in multi-core systems to mitigate aging and wear-out effects, by opportunely distributing the workload on the available cores. The efficiency of such solutions is generally evaluated by considering only the occurrence of the first core failure due to the computational complexity. In this paper we propose an in-depth analysis of such approaches by considering the occurrence of multiple subsequent core failures, thus offering a more precise estimation of the lifetime reliability. In particular, we analyzed two classical load distribution approaches: a load balancing strategy versus a strategy based on spare resources. Experimental results show benefits and limitations of the considered solutions in terms of lifetime reliability while fulfilling system performance.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:00

IP3-12, 219

A LIFETIME-AWARE RUNTIME MAPPING APPROACH FOR MANYCORE SYSTEMS IN THE DARK SILICON ERA
Speaker:
Mohammad-Hashem Haghbayan, University of Turku, FI
Authors:
Mohammad-Hashem Haghbayan¹, Antonio Miele², Amir-Mohammad Rahmani³, Pasi Liljeberg¹ and Hannu Tenhunen³
¹University of Turku, FI; ²Politecnico di Milano, IT; ³KTH Royal Institute of Technology and University of Turku, FI
Abstract
In this paper, we propose a novel lifetime reliability-aware resource management approach for many-core architectures. The approach is based on hierarchical architecture, composed of a long-term runtime reliability analysis unit and a short-term runtime mapping unit. The former periodically analyses the aging status of the various processing units with respect to a target value specified by the designer, and performs recovery actions on highly stressed cores. The calculated reliability metrics are utilized in runtime mapping of the newly arrived applications to maximize the performance of the system while fulfilling reliability requirements and the available power budget. Our extensive experimental results reveal that the proposed reliability-aware approach can efficiently select the processing cores to be used over time in order to enhance the reliability at the end of the operational life (up to 62%) while offering the comparable performance level of the state-of-the-art runtime mapping approach.
Download Paper (PDF; Only available from the DATE venue WiFi)

16:00

End of session
Coffee Break in Exhibition Area

Visit us at DATE 2016