Skip to main content

OLCF Summit Supercomputer GPU Snapshots During Double-Bit Errors and Normal Operations

  • Shin, Woong | Oak Ridge National Laboratory
  • Oles, Vladyslav | Oak Ridge National Laboratory
  • Schmedding, Anna | Williams and Mary
  • Ostrouchov, George | Oak Ridge National Laboratory
  • Smirni, Evgenia | Williams and Mary
  • Engelmann, Christian | Oak Ridge National Laboratory
  • Wang, Feiyi | Oak Ridge National Laboratory
Download dataset
Overview

Description

As we move into the exascale era, the power and energy footprints of high-performance computing (HPC) systems have grown significantly larger. Due to the harsh power and thermal conditions the system, components are exposed to extreme operating conditions. Operation of such modern HPC systems requires deep insights into long term system behavior to maintain its efficiency as well as its longevity. To help the HPC community to gain such insights, we provide double-bit errors using system telemetry data and logs collected from the Summit supercomputer, equipped with 27,648 Tesla V100 GPUs with 2nd-generation high-bandwidth memory (HBM2). The dataset relies on Nvidia XID records internally collected by GPU firmware at the time of failure occurrence, on the reboot-time logs of each Summit node, on node-level job scheduler records collected after each job termination, and on a 1Hz data rate from the baseboard management controllers (BMCs) of each Summit compute node using the OpenBMC event subscription protocol.

Funding resources

DOE contract number

DE-AC05-00OR22725

Originating research organization

Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring organization

Office of Science (SC)

Related resources

Details

DOI

10.13139/OLCF/1970187

Release date

April 20, 2023

Dataset

Dataset type

ND Numeric Data

Other ID number(s)

gen150

Acknowledgements

Users should acknowledge the OLCF in all publications and presentations that speak to work performed on OLCF resources:

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Category

  • 97 MATHEMATICS AND COMPUTING,
  • 99 GENERAL AND MISCELLANEOUS

Keywords

  • High-performance Computing,
  • system power and thermal,
  • reliability,
  • HBM2e,
  • GPUs,
  • medium temperature water cooling,
  • direct liquid cooling