Utilizing IBM Spectrum LSF Simulator to Understand the Impacts of Adding AI Workloads to Capability Supercomputing ORNL Report December, 2022
Approaching the Final Frontier: Lessons Learned from the Deployment of HPE/Cray EX Spock and Crusher supercomputers Conference Paper May, 2022
Reliability Lessons Learned From GPU Experience With The Titan Supercomputer at Oak Ridge Leadership Computing Facility Conference Paper November, 2015
Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems... Conference Paper June, 2015
Experience with GPUs on the Titan Supercomputer from a Reliability, Performance and Power Perspective Conference Paper May, 2015
Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer Conference Paper April, 2015
Understanding GPU Errors on Large-scale HPC Systems and the Implications for System Design and Operation... Conference Paper February, 2015