OLCF Summit Supercomputer GPU Snapshots During Double-Bit Errors and Normal Operations
10.13139/OLCF/1970187As we move into the exascale era, the power and energy footprints of high-performance computing (HPC) systems have grown significantly larger. Due to the harsh power and thermal conditions the system, components are exposed to extreme operating conditions. Operation of such modern HPC systems requires deep insights into long term system behavior to maintain its efficiency as well as its longevity. To help the HPC community to gain such insights, we provide double-bit errors using system telemetry data and logs collected from the Summit supercomputer, equipped with 27,648 Tesla V100 GPUs with 2nd-generation high-bandwidth memory (HBM2). The dataset relies on Nvidia XID records internally collected by GPU firmware at the time of failure occurrence, on the reboot-time logs of each Summit node, on node-level job scheduler records collected after each job termination, and on a 1Hz data rate from the baseboard management controllers (BMCs) of each Summit compute node using the OpenBMC event subscription protocol.
Published: 2023-04-20 15:50:18 Download DatasetDataset Properties
Field | Value |
---|---|
Authors |
|
Project Identifier | stf218 |
Dataset Type | ND Numeric Data |
Subjects |
|
Keywords |
|
Originating Organizations | Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States) |
Sponsoring Organizations | Office of Science (SC) |
DOE Contract | DE-AC05-00OR22725 |
Other Identifying Numbers | gen150 |
Related Identifiers |
|
Acknowledgements
Papers using this dataset are requested to include the following text in their acknowledgements:
*Support for 10.13139/OLCF/1970187 is provided by the U.S. Department of Energy, project stf218 under Contract DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility.