Skip to main content

GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability

  • Shankar, Mallikarjun | Oak Ridge National Laboratory
  • Ostrouchov, George | Oak Ridge National Laboratory
  • Maxwell, Don | Oak Ridge National Laboratory
  • Rogers, James | Oak Ridge National Laboratory
  • Ashraf, Rizwan | Oak Ridge National Laboratory
  • Engelmann, Chrstian | Oak Ridge National Laboratory
Download dataset
Overview

Description

George Ostrouchov, Don Maxwell, Rizwan Ashraf, Mallikarjun Shankar, and James Rogers. 2020. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '20). Association for Computing Machinery, New York, NY, USA. Data and code for SC20 paper about Titan GPU reliability analysis: https://github.com/olcf/TitanGPULife. Includes R code to generate graphics for paper and additional analyses. See code/README for instructions. Includes original Titan GPU reliability data on over 100,000 collective hours of operation: data/titan.gpu.history.txt - history data, data/titan.service.txt - service nodes for exclusion. Includes output data files produced by code/TitanGPUmodel.Rmd: data/gc_full.csv - cleaned up data (see paper and R code); data/gc_summary_loc.csv - one record per GPU (variables: SN, time, nlife, nloc, last, col, row, cage, slot, node, max_loc_events, time_max_loc, dbe, dbe_loc, otb, otb_loc, out, batch, days, years, dead, dead_otb, dead_dbe) (see paper and R code). Includes .Rmd analysis document as TitanGPUmode.html. Includes Python code to process data/gc_full.csv into graphics from time-between-failure analyses: See code/tbf-analyses/README for instructions.

Funding resources

DOE contract number

DE-AC05-00OR22725

Originating research organization

Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring organization

Office of Science (SC), Advanced Scientific Computing Research (ASCR) (SC-21)

Details

DOI

10.13139/ORNLNCCS/1657202

Release date

September 2, 2020

Dataset

Dataset type

ND Numeric Data

Software

R, python

Acknowledgements

Users should acknowledge the OLCF in all publications and presentations that speak to work performed on OLCF resources:

This work was carried out [in part] at Oak Ridge National Laboratory, managed by UT-Battelle, LLC for the U.S. Department of Energy under contract DE-AC05-00OR22725.

Category

  • 42 ENGINEERING,
  • 47 OTHER INSTRUMENTATION

Keywords

  • GPU,
  • reliability,
  • supercomputer,
  • NVIDIA,
  • Cray,
  • large-scale systems,
  • Kaplan-Meier survival,
  • Cox regression