GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability

10.13139/ORNLNCCS/1657202

George Ostrouchov, Don Maxwell, Rizwan Ashraf, Mallikarjun Shankar, and James Rogers. 2020. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). Association for Computing Machinery, New York, NY, USA Data and code for SC20 paper about Titan GPU reliability analysis. https://github.com/olcf/TitanGPULife Includes R code to generate graphics for paper and additional analyses. See code/README for instructions. Includes original Titan GPU reliability data on over 100,000 collective hours of operation data/titan.gpu.history.txt - history data data/titan.service.txt - service nodes for exclusion Includes output data files produced by code/TitanGPUmodel.Rmd data/gc_full.csv - cleaned up data (see paper and R code) data/gc_summary_loc.csv - one record per GPU (variables: SN,time,nlife,nloc,last,col,row,cage,slot,node,max_loc_events,time_max_loc,dbe,dbe_loc,otb,otb_loc,out,batch,days,years,dead,dead_otb,dead_dbe) (see paper and R code) Includes .Rmd analysis document as TitanGPUmode.html Includes Python code to process data/gc_full.csv into graphics from time-between-failure analyses See code/tbf-analyses/README for instructions

Published: 2020-09-02 15:09:44 Download Dataset

Dataset Properties

Field Value
Authors
  • Shankar, Mallikarjun Oak Ridge National Laboratory
  • Ostrouchov, George Oak Ridge National Laboratory
  • Maxwell, Don Oak Ridge National Laboratory
  • Rogers, James Oak Ridge National Laboratory
  • Ashraf, Rizwan Oak Ridge National Laboratory
  • Engelmann, Chrstian Oak Ridge National Laboratory
Project Identifier STF011, OLCF, CADES
Dataset Type ND Numeric Data
Subjects
  • 42 ENGINEERING
  • 47 OTHER INSTRUMENTATION
Keywords
  • GPU
  • reliability
  • supercomputer
  • NVIDIA
  • Cray
  • large-scale systems
  • Kaplan-Meier survival
  • Cox regression
Software Needed R, python
Originating Organizations Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organizations Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)
DOE Contract DE-AC05-00OR22725

Acknowledgements

Papers using this dataset are requested to include the following text in their acknowledgements:

*Support for 10.13139/ORNLNCCS/1657202 is provided by the U.S. Department of Energy, project STF011, OLCF, CADES under Contract DE-AC05-00OR22725 . This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility.