2.6 Fault-Tolerant Embedded Systems

Time	Label	Presentation Title Authors
11:30	2.6.1	(Best Paper Award Candidate) PROBABILISTIC WCET ESTIMATION IN PRESENCE OF HARDWARE FOR MITIGATING THE IMPACT OF PERMANENT FAULTS Speaker: Damien Hardy, University of Rennes/IRISA, FR Authors: Damien Hardy¹, Isabelle Puaut¹ and Yiannakis Sazeides² ¹University of Rennes 1/IRISA, FR; ²University of Cyprus, CY Abstract Fine-grained disabling and reconfiguration of hardware elements (functional units, cache blocks) will become economically necessary to recover from permanent failures, whose rate is expected to increase dramatically in the near future. This fine-grained disabling will lead to degraded performance as compared to a fault-free execution. Until recently, all static worst-case execution time (WCET) estimations methods were assuming fault-free processors, resulting in unsafe estimates in the presence of faults. The first static WCET estimation technique dealing with the presence of permanent faults in instruction caches was proposed in [1]. This study probabilistically quantified the impact of permanent faults on WCET estimates. It demonstrated that the probabilistic WCET (pWCET) estimates of tasks increase rapidly with the probability of faults as compared to fault-free WCET estimates. In this paper, we show that very simple reliability mechanisms allow mitigating the impact of faulty cache blocks on pWCETs. Two mechanisms, that make part of the cache resilient to faults are analyzed. Experiments show that the gain in pWCET for these two mechanisms are on average 48% and 40% as compared to an architecture with no reliability mechanism. Download Paper (PDF; Only available from the DATE venue WiFi)
12:00	2.6.2	A FOUR-MODE MODEL FOR EFFICIENT FAULT-TOLERANT MIXED-CRITICALITY SYSTEMS Speaker: Zaid Al-bayati, McGill University, CA Authors: Zaid Al-bayati¹, Jonah Caplan¹, Brett Meyer¹ and Haibo Zeng² ¹McGill University, CA; ²Virginia Tech, US Abstract Mixed-criticality systems (MCS) integrate components from different levels of criticality onto the same platform. MCS, like all other electronic systems, are susceptible to transient faults. These systems must mitigate the effects of faults and provide recovery mechanisms when faults occur. In this paper, we consider the problem of designing and scheduling certifiable faulttolerant mixed-criticality systems. To address certification and transient faults, two-mode models must treat any single overrun or fault as a combination of the two, reserving time for the reexecution of tasks with extended execution time. We therefore propose a new four-mode model that addresses fault and execution time overrun with separate modes. This model, combined with the selective continuation of low-criticality tasks, improves the quality of service (QoS) to these tasks while providing the same guarantee to high-criticality tasks. Experimental results show that QoS improvements of up to 42.9% can be achieved by the new model. Furthermore, we show how the model and its schedulability analysis can be calibrated to realistic failures rates to achieve even more efficient designs. Download Paper (PDF; Only available from the DATE venue WiFi)
12:30	2.6.3	PROVIDING FORMAL LATENCY GUARANTEES FOR ARQ-BASED PROTOCOLS IN NETWORKS-ON-CHIP Speaker: Eberle A Rambo, Technische Universität Braunschweig, DE Authors: Eberle A Rambo, Selma Saidi and Rolf Ernst, Technische Universität Braunschweig, DE Abstract Networks-on-Chip (NoCs) are the backbone of Multiprocessor Systems-on-Chip (MPSoCs). In this paper, we perform a formal worst-case communication time analysis of Automatic Repeat reQuest (ARQ) protocols for NoCs. Therefor, we integrate the transport layer analysis for general networks and the network layer analysis for NoCs. An ARQ variant optimized for DMA transfers (DMA ARQ) is introduced and analyzed. Experimental evaluation with Stop-and-Wait, Go-Back-N, and DMA ARQ, in the context of real-time memory traffic is presented, including both error-free and error cases. DMA ARQ achieves a factor 6 improvement on latency bounds over conventional Stop-and-Wait. Download Paper (PDF; Only available from the DATE venue WiFi)
13:00	IP1-8, 739	IMPROVING PERFORMANCE BY MONITORING WHILE MAINTAINING WORST-CASE GUARANTEES Speaker: Syed Md Jakaria Abdullah, Uppsala University, SE Authors: Syed Md Jakaria Abdullah, Kai Lampka and Wang Yi, Uppsala University, SE Abstract With real-time systems, feasibility analysis is based on worst-case scenarios. At run-time, worst-case situations are often very unlikely to occur. With the system being dimensioned for the worst-case, one faces low resource utilization and implicit loss in performance at run-time. We propose to use run-time monitoring for evaluating the deviation of job releases from their worst-case release bound. This allows us to compute a conservative bound on the future workload. Based on this, we design a scheme for reclaiming computation time, which has been originally allocated for the jobs which are now known to be absent. By organizing the consumption of extra computing time in a dynamic and time-safe manner, we improve the run-time performance of applications and provably maintain the worst-case guarantees for their response times. We evaluate the usefulness of the presented approach by using randomly generated traces of job releases. Download Paper (PDF; Only available from the DATE venue WiFi)
13:00		End of session Lunch Break in Großer Saal + Saal 1

Time

Label

Presentation Title
Authors

11:30

2.6.1

(Best Paper Award Candidate)
PROBABILISTIC WCET ESTIMATION IN PRESENCE OF HARDWARE FOR MITIGATING THE IMPACT OF PERMANENT FAULTS
Speaker:
Damien Hardy, University of Rennes/IRISA, FR
Authors:
Damien Hardy¹, Isabelle Puaut¹ and Yiannakis Sazeides²
¹University of Rennes 1/IRISA, FR; ²University of Cyprus, CY
Abstract
Fine-grained disabling and reconfiguration of hardware elements (functional units, cache blocks) will become economically necessary to recover from permanent failures, whose rate is expected to increase dramatically in the near future. This fine-grained disabling will lead to degraded performance as compared to a fault-free execution. Until recently, all static worst-case execution time (WCET) estimations methods were assuming fault-free processors, resulting in unsafe estimates in the presence of faults. The first static WCET estimation technique dealing with the presence of permanent faults in instruction caches was proposed in [1]. This study probabilistically quantified the impact of permanent faults on WCET estimates. It demonstrated that the probabilistic WCET (pWCET) estimates of tasks increase rapidly with the probability of faults as compared to fault-free WCET estimates. In this paper, we show that very simple reliability mechanisms allow mitigating the impact of faulty cache blocks on pWCETs. Two mechanisms, that make part of the cache resilient to faults are analyzed. Experiments show that the gain in pWCET for these two mechanisms are on average 48% and 40% as compared to an architecture with no reliability mechanism.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:00

2.6.2

A FOUR-MODE MODEL FOR EFFICIENT FAULT-TOLERANT MIXED-CRITICALITY SYSTEMS
Speaker:
Zaid Al-bayati, McGill University, CA
Authors:
Zaid Al-bayati¹, Jonah Caplan¹, Brett Meyer¹ and Haibo Zeng²
¹McGill University, CA; ²Virginia Tech, US
Abstract
Mixed-criticality systems (MCS) integrate components from different levels of criticality onto the same platform. MCS, like all other electronic systems, are susceptible to transient faults. These systems must mitigate the effects of faults and provide recovery mechanisms when faults occur. In this paper, we consider the problem of designing and scheduling certifiable faulttolerant mixed-criticality systems. To address certification and transient faults, two-mode models must treat any single overrun or fault as a combination of the two, reserving time for the reexecution of tasks with extended execution time. We therefore propose a new four-mode model that addresses fault and execution time overrun with separate modes. This model, combined with the selective continuation of low-criticality tasks, improves the quality of service (QoS) to these tasks while providing the same guarantee to high-criticality tasks. Experimental results show that QoS improvements of up to 42.9% can be achieved by the new model. Furthermore, we show how the model and its schedulability analysis can be calibrated to realistic failures rates to achieve even more efficient designs.
Download Paper (PDF; Only available from the DATE venue WiFi)

12:30

2.6.3

PROVIDING FORMAL LATENCY GUARANTEES FOR ARQ-BASED PROTOCOLS IN NETWORKS-ON-CHIP
Speaker:
Eberle A Rambo, Technische Universität Braunschweig, DE
Authors:
Eberle A Rambo, Selma Saidi and Rolf Ernst, Technische Universität Braunschweig, DE
Abstract
Networks-on-Chip (NoCs) are the backbone of Multiprocessor Systems-on-Chip (MPSoCs). In this paper, we perform a formal worst-case communication time analysis of Automatic Repeat reQuest (ARQ) protocols for NoCs. Therefor, we integrate the transport layer analysis for general networks and the network layer analysis for NoCs. An ARQ variant optimized for DMA transfers (DMA ARQ) is introduced and analyzed. Experimental evaluation with Stop-and-Wait, Go-Back-N, and DMA ARQ, in the context of real-time memory traffic is presented, including both error-free and error cases. DMA ARQ achieves a factor 6 improvement on latency bounds over conventional Stop-and-Wait.
Download Paper (PDF; Only available from the DATE venue WiFi)

13:00

IP1-8, 739

IMPROVING PERFORMANCE BY MONITORING WHILE MAINTAINING WORST-CASE GUARANTEES
Speaker:
Syed Md Jakaria Abdullah, Uppsala University, SE
Authors:
Syed Md Jakaria Abdullah, Kai Lampka and Wang Yi, Uppsala University, SE
Abstract
With real-time systems, feasibility analysis is based on worst-case scenarios. At run-time, worst-case situations are often very unlikely to occur. With the system being dimensioned for the worst-case, one faces low resource utilization and implicit loss in performance at run-time. We propose to use run-time monitoring for evaluating the deviation of job releases from their worst-case release bound. This allows us to compute a conservative bound on the future workload. Based on this, we design a scheme for reclaiming computation time, which has been originally allocated for the jobs which are now known to be absent. By organizing the consumption of extra computing time in a dynamic and time-safe manner, we improve the run-time performance of applications and provably maintain the worst-case guarantees for their response times. We evaluate the usefulness of the presented approach by using randomly generated traces of job releases.
Download Paper (PDF; Only available from the DATE venue WiFi)

13:00

End of session
Lunch Break in Großer Saal + Saal 1

Visit us at DATE 2016