10.3 System-level Dependability for Multicore and Real-time Systems

Printer-friendly version PDF version

Date: Thursday, March 28, 2019
Time: 11:00 - 12:30
Location / Room: Room 3

Chair:
Stefano Di Carlo, Politecnico di Torino, IT, Contact Stefano Di Carlo

Co-Chair:
Luca Cassano, Politecnico di Milano, IT, Contact Luca Cassano

This session covers topics ranging from reliability assessments in heterogeneous systems, optimization of the availability in real-time systems under permanent and transient faults, as well as fault tolerant techniques in many core systems.

TimeLabelPresentation Title
Authors
11:0010.3.1IDENTIFYING THE MOST RELIABLE COLLABORATIVE WORKLOAD DISTRIBUTION IN HETEROGENEOUS DEVICES
Speaker:
Paolo Rech, UFRGS, BR
Authors:
Gabriel Piscoya Dávila, Daniel Oliveira, Philippe Navaux and Paolo Rech, UFRGS, BR
Abstract
The constant need for higher performances and reduced power consumption has lead vendors to design heterogeneous devices that embed traditional CPU and an accelerator, like a GPU or FPGA. When the CPU and the accelerator are used collaboratively the device computational performances reach their peak. However, the higher amount of resources employed for computation has, potentially, the side effect of increasing soft error rate. In this paper, we evaluate the reliability behaviour of AMD Kaveri Accelerated Processing Units executing a set of heterogeneous applications. We distribute the workload between the CPU and GPU and evaluate which configuration provides the lowest error rate or allows the computation of the highest amount of data before experiencing a failure. We show that, in most cases, the most reliable workload distribution is the one that delivers the highest performances. As experimentally proven, by choosing the correct workload distribution the device reliability can increase of up to 9x.

Download Paper (PDF; Only available from the DATE venue WiFi)
11:3010.3.2CE-BASED OPTIMIZATION FOR REAL-TIME SYSTEM AVAILABILITY UNDER LEARNED SOFT ERROR RATE
Speaker:
Liying Li, East China Normal University, CN
Authors:
Liying Li1, Tongquan Wei1, Junlong Zhou2, Mingsong Chen1 and X, Sharon Hu3
1East China Normal University, CN; 2Nanjing University of Science and Technology, CN; 3University of Notre Dame, US
Abstract
As the density of integrated circuits continues to increase, the possibility that real-time systems suffer from transient and permanent failures rises significantly, resulting in a degraded availability of system functionality. In this paper, we investigate the dynamic modeling of transient failure rate based on Back Propagation (BP) neural network, and propose an optimization strategy for system availability based on Cross Entropy (CE). Specifically, the neural network is trained using cross-layer simulation data obtained from SPICE simulation while the CE-based optimization for system functionality availability is achieved by judiciously selecting an optimal supply voltage for processors under timing constraints. Simulation results show that the proposed method can achieve system availability improvement of up to 32% compared to benchmarking methods.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:0010.3.3A DETERMINISTIC-PATH ROUTING ALGORITHM FOR TOLERATING MANY FAULTS ON WAFER-LEVEL NOC
Speaker:
Ying Zhang, Tongji University, CN
Authors:
Zhongsheng Chen1, Ying Zhang1, Zebo Peng2 and Jianhui Jiang1
1Tongji University, CN; 2Linköping University, SE
Abstract
Wafer-level NoC has emerged as a promising fabric to further improve supercomputer performance, but this new fabric may suffer from the many-fault problem. This paper presents a deterministic-path routing algorithm for tolerating many faults on wafer-level NoCs. The proposed algorithm generates routing tables using a breadth-first traversal strategy, and stores one routing table in each NoC switch. The switch will then transmit packages according to its routing table online. We use the Tarjan algorithm to dynamically reconfigure the routes to avoid the faulty nodes and develop the deprecated link/node rules to ensure deadlock-free communication of the NoCs. Experimental results demonstrate that the proposed algorithm does not only tolerate the effects of many faults, but also maximizes the available nodes in the reconfigured NoC. The performance of the proposed algorithm in terms of average latency, throughput, and energy consumption is also better than those of the existing solutions.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30IP5-1, 705THERMAL-AWARENESS IN A SOFT ERROR TOLERANT ARCHITECTURE
Speaker:
Sajjad Hussain, Chair for Embedded Systems, KIT, DE
Authors:
Sajjad Hussain1, Muhammad Shafique2 and Joerg Henkel1
1Karlsruhe Institute of Technology, DE; 2Vienna University of Technology (TU Wien), AT
Abstract
It is crucial to provide soft error reliability in a power-efficient manner such that the maximum chip temperature remains within the safe operating limits. Different execution phases of an application have diverse performance, power, temperature and vulnerability behavior that can be leveraged to fulfill the resiliency requirements within the allowed thermal constraints. We propose a soft error tolerant architecture with fine-grained redundancy for different architectural components, such that their reliable operations can be activated selectively at fine-granularity to maximize the reliability under a given thermal constraint. When compared with state-of-the-art, our temperature-aware fine-grained reliability manager provides up to 30% reliability within the thermal budget.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:31IP5-2, 547A SOFTWARE-LEVEL REDUNDANT MULTITHREADING FOR SOFT/HARD ERROR DETECTION AND RECOVERY
Speaker:
Hwisoo So, Yonsei University, KR
Authors:
Moslem Didehban1, HwiSoo So2, Aviral Shrivastava1 and Kyoungwoo Lee2
1Arizona State University, US; 2Yonsei University, KR
Abstract
Advances in semiconductor technology have enabled unprecedented growth in safety-critical applications. In such environments, error resiliency is one of the main design concerns. Software level Redundant MultiThreading is one of the most promising error resilience strategies because they can potentially serve as inexpensive and flexible solutions for hardware unreliability issues i.e. soft and hard errors. However, the error coverage of the existing software level RMT solutions is limited to soft error detection and they rely on external schemes for error recovery. In this paper, we investigate the potential of software-level RMT schemes for complete soft and hard error detection and recovery. First, we pinpoint the main reasons behind ineffectiveness of basic software level triple redundant multithreading (STRMT) in protection against soft and hard errors. Then we introduce FISHER (FlexIble Soft and Hard Error Resiliency) as a software-only RMT scheme which can achieve comprehensive error resiliency against both soft and hard errors. Rather than performing centralized voting operations for critical instructions operands, FISHER distributes and intertwines error detection and recovery operations between redundant threads. To evaluate the effectiveness of the proposed solution, we performed more than 135,000 soft and hard error injection experiments on different hardware components of an ARM cortex53-like μ-architecturally simulated microprocessor. The results demonstrate that FISHER can reduce programs failure rate by around 261× and 162× compared to original and basic STRMTprotected versions of programs, respectively.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:32IP5-3, 317COMMON-MODE FAILURE MITIGATION:INCREASING DIVERSITY THROUGH HIGH-LEVEL SYNTHESIS
Speaker:
Farah Naz Taher, University of Texas at Dallas, US
Authors:
Farah Naz Taher1, Matthew Joslin1, Anjana Balachandran2, Zhiqi Zhu1 and Benjamin Carrion Schaefer1
1The University of Texas at Dallas, US; 2The Hong Kong Polytechnic University, HK
Abstract
Fault tolerance is vital in many domains. One popular way to increase fault-tolerance is through hardware redundancy. However, basic redundancy cannot cope with Common Mode Failures (CMFs). One way to address CMF is through the use of diversity in combination with traditional hardware redundancy. This work proposes an automatic design space exploration (DSE) method to generate optimized redundant hardware accelerators with maximum diversity to protect against CMFs given as a single behavioral description for High-Level Synthesis (HLS). For this purpose, this work exploits one of the main advantages of C-based VLSI design over the traditional RT-level design based on low-level Hardware Description Languages (HDLs): The ability to generate micro-architectures with unique characteristics from the same behavioral description. Experimental results show that the proposed method provides a significant diversity increment compared to using traditional RTL-based exploration to generate diverse designs.

Download Paper (PDF; Only available from the DATE venue WiFi)
12:30End of session
Lunch Break in Lunch Area



Coffee Breaks in the Exhibition Area

On all conference days (Tuesday to Thursday), coffee and tea will be served during the coffee breaks at the below-mentioned times in the exhibition area.

Lunch Breaks (Lunch Area)

On all conference days (Tuesday to Thursday), a seated lunch (lunch buffet) will be offered in the Lunch Area to fully registered conference delegates only. There will be badge control at the entrance to the lunch break area.

Tuesday, March 26, 2019

Wednesday, March 27, 2019

Thursday, March 28, 2019