Evaluating the Impact of Execution Parameters on Program Vulnerability in GPU Applications

Fritz G. Previlon1,a, Charu Kalra1,b, David R. Kaeli1,c and Paolo Rech2a
1Northeastern University Boston, MA, USA
aprevilon@ece.neu.edu
bckalra@ece.neu.edu
ckaeli@ece.neu.edu
2UFRGS, Universidade Federal do Rio Grande do Sul Porto Alegre, Brazil
prech@inf.ufrgs.br

ABSTRACT


While transient faults continue to be a major concern for the High Performance Computing (HPC) community, we still lack a clear understanding of how these faults propagate in applications. This paper addresses two particular aspects of the vulnerabilities of HPC applications as run on Graphics Processing Units (GPUs): their dependence on input data and on thread‐block size. To characterize fault propagation as a function of input parameters, we leverage an ISA‐level fault injection framework and carry out an extensive fault injection campaign to characterize the vulnerability of a suite of GPU applications. Our results show that the vulnerability of most of the programs studied are insensitive to changes in input values, except in less common cases when input values were highly biased, i.e., values that exhibit a special vulnerability behavior. For example, the multiplication property of any value with a zero value (zero times any number is equal to zero) makes it a biased input for multiplication operations. Our study also examines the effects of changing the GPU thread‐block size and its impact on vulnerability. We found that, similar to performance, the vulnerability of an application can depend on the block size of the kernels in the application. In some applications, we found that the silent data corruption rate can vary by as much as 8% when changing the block size of a kernel.



Full Text (PDF)