OLCF Summit Supercomputer GPU Snapshots During Double-Bit Errors and Normal Operations

10.13139/OLCF/1970187

As we move into the exascale era, the power and energy footprints of high-performance computing (HPC) systems have grown significantly larger. Due to the harsh power and thermal conditions the system, components are exposed to extreme operating conditions. Operation of such modern HPC systems requires deep insights into long term system behavior to maintain its efficiency as well as its longevity. To help the HPC community to gain such insights, we provide double-bit errors using system telemetry data and logs collected from the Summit supercomputer, equipped with 27,648 Tesla V100 GPUs with 2nd-generation high-bandwidth memory (HBM2). The dataset relies on Nvidia XID records internally collected by GPU firmware at the time of failure occurrence, on the reboot-time logs of each Summit node, on node-level job scheduler records collected after each job termination, and on a 1Hz data rate from the baseboard management controllers (BMCs) of each Summit compute node using the OpenBMC event subscription protocol.

Published: 2023-04-20 15:50:18 Download Dataset

Dataset Properties

Field Value
Authors
  • Shin, Woong Oak Ridge National Laboratory
  • Oles, Vladyslav Oak Ridge National Laboratory
  • Schmedding, Anna Williams and Mary
  • Ostrouchov, George Oak Ridge National Laboratory
  • Smirni, Evgenia Williams and Mary
  • Engelmann, Christian Oak Ridge National Laboratory
  • Wang, Feiyi Oak Ridge National Laboratory
Project Identifier stf218
Dataset Type ND Numeric Data
Subjects
  • 97 MATHEMATICS AND COMPUTING
  • 99 GENERAL AND MISCELLANEOUS
Keywords
  • High-performance Computing
  • system power and thermal
  • reliability
  • HBM2e
  • GPUs
  • medium temperature water cooling
  • direct liquid cooling
Originating Organizations Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organizations Office of Science (SC)
DOE Contract DE-AC05-00OR22725
Related Identifiers
  • IsDerivedFrom (DOI) https://doi.org/10.1145/3458817.3476188
  • IsSupplementTo (DOI) https://doi.org/10.13139/OLCF/1861393

Acknowledgements

Papers using this dataset are requested to include the following text in their acknowledgements:

*Support for 10.13139/OLCF/1970187 is provided by the U.S. Department of Energy, project stf218 under Contract DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility.