Document Type
Article
Publication Date
8-1-2015
Abstract
As HPC systems approach Exascale, their circuit features will shrink while their overall size will grow, both at a fixed power limit. These trends imply that soft faults in electronic circuits will become an increasingly significant problem for programs that run on these systems, causing them to occasionally crash or worse, silently return incorrect results. This is motivating extensive work on program resilience to such faults, ranging from generic mechanisms such as replication or checkpoint/restart to algorithm-specific error detection and resilience mechanisms. Effective use of such mechanisms requires a detailed understanding of (1) which vulnerable parts of the program are most worth protecting and (2) the performance and resilience impact of fault resilience mechanisms on the program. This paper presents FaultTelescope, a tool that combines these two and generates actionable insights by presenting program vulnerabilities and impact of fault resilience mechanisms in an intuitive way.
Publication Source (Journal or Book title)
Journal of Supercomputing
First Page
2963
Last Page
2984
Recommended Citation
Chen, S., Bronevetsky, G., Li, B., Guix, M., & Peng, L. (2015). A framework for evaluating comprehensive fault resilience mechanisms in numerical programs. Journal of Supercomputing, 71 (8), 2963-2984. https://doi.org/10.1007/s11227-015-1422-z