Improving Reliability for Real‐Time Systems through Dynamic Recovery

Yue Ma1,a, Thidapat Chantem2, Robert P. Dick3 and X. Sharon Hu 1,b
1University of Notre Dame, Notre Dame, USA
ayma1@nd.edu
bshu@nd.edu
2Virginia Polytechnic Institute and State University, Arlington, VA
tchantem@vt.edu
3University of Michigan, Ann Arbor, MI
dickrp@umich.edu

ABSTRACT


Technology scaling has increased concerns about transient faults due to soft errors and permanent faults due to lifetime wear processes. Although researchers have investigated related problems, they have either considered only one of the two reliability concerns or presented simple recovery allocation algorithms that cannot effectively use available time slack to improve soft-error reliability. This paper introduces a framework for improving soft-error reliability while satisfying lifetime reliability and real‐time constraints. We present a dynamic recovery allocation technique that guarantees to recover any failed task if the remaining slack is adequate. Based on this technique, we propose two scheduling algorithms for task sets with different characteristics to improve system‐level soft‐error reliability. Lifetime reliability requirements are satisfied by reducing core frequencies for appropriate tasks, thereby reducing wear due to temperature and thermal cycling. Simulation results show that the proposed framework reduces the probability of failure by at least 8% and 73% on average compared to existing approaches.

Keywords: Soft‐error reliability, Lifetime reliability, Dynamic recovery, Real‐time embedded system.



Full Text (PDF)