OLCF Summit Supercomputer GPU Snapshots During Double-Bit Errors and Normal Operations

Shin, Woong | Oak Ridge National Laboratory
Oles, Vladyslav | Oak Ridge National Laboratory
Schmedding, Anna | Williams and Mary
Ostrouchov, George | Oak Ridge National Laboratory
Smirni, Evgenia | Williams and Mary
Engelmann, Christian | Oak Ridge National Laboratory
Wang, Feiyi | Oak Ridge National Laboratory

Overview

Description

As we move into the exascale era, the power and energy footprints of high-performance computing (HPC) systems have grown significantly larger. Due to the harsh power and thermal conditions the system, components are exposed to extreme operating conditions. Operation of such modern HPC systems requires deep insights into long term system behavior to maintain its efficiency as well as its longevity. To help the HPC community to gain such insights, we provide double-bit errors using system telemetry data and logs collected from the Summit supercomputer, equipped with 27,648 Tesla V100 GPUs with 2nd-generation high-bandwidth memory (HBM2). The dataset relies on Nvidia XID records internally collected by GPU firmware at the time of failure occurrence, on the reboot-time logs of each Summit node, on node-level job scheduler records collected after each job termination, and on a 1Hz data rate from the baseboard management controllers (BMCs) of each Summit compute node using the OpenBMC event subscription protocol. Technical details can be found in the paper Oles et. al “Understanding GPU Memory Corruption at Extreme Scale: The Summit Case Study” ICS’24 (https://doi.org/10.1145/3650200.3656615).

Funding resources

DOE contract number

DE-AC05-00OR22725

Originating research organization

Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring organization

Office of Science (SC)

Project identifier

STF218

Related resources

IsDerivedFrom (DOI): https://doi.org/10.1145/3458817.3476188
IsSupplementTo (DOI): https://doi.org/10.13139/OLCF/1861393
IsSupplementedBy (DOI): https://doi.org/10.1145/3650200.3656615

Details

DOI

10.13139/OLCF/1970187

Release date

April 20, 2023

Dataset

Dataset type

ND Numeric Data

Other ID number(s)

GEN150

Acknowledgements

Users should acknowledge the OLCF in all publications and presentations that speak to work performed on OLCF resources:

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Keywords

High-performance Computing,
system power and thermal,
reliability,
HBM2e,
GPUs,
medium temperature water cooling,
direct liquid cooling