Long Term Per-Component Power and Thermal Measurements of the OLCF Summit System

10.13139/OLCF/1861393

As we move into the exascale era, the power and energy footprints of high-performance computing (HPC) systems have grown significantly larger. Due to the harsh power and thermal conditions the system, components are exposed to extreme operating conditions. Operation of such modern HPC systems requires deep insights into long term system behavior to maintain its efficiency as well as its longevity. To help the HPC community to gain such insights, we provide a dataset that records the long-term power and thermal behavior of the 200PF pre-exascale supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), Summit. This system is an IBM AC922 based system that has 9,252 IBM Power9 CPUs and 27,756 Nvidia V100 GPUs and can consume up to 13MW power at peak. Heat removal is performed using medium temperature direct liquid cooling and rear-door heat exchanger based secondary cooling loop. Originally extracted from a high-resolution (1Hz) per-component (GPUs, CPUs) measurements from the system, we primarily provide a dataset that has 10-second and 1-minute mean power and thermal measurements selected from five month-long segments over the course of 2020 (January and August), 2021 (February and August), and 2022 (January). For convenience, we also provide various sub datasets randomly sampled from the time and space (hosts) of the cluster. Further details and example code for analysis can be found in the following GitHub repository: https://github.com/at-aaims/summit_power_and_thermal_data

Published: 2022-04-11 16:28:21 Download Dataset

Dataset Properties

Field Value
Authors
  • Shin, Woong Oak Ridge National Laboratory
  • Ellis, J. Austin Oak Ridge National Laboratory
  • Karimi, Ahmad Maroof Oak Ridge National Laboratory
  • Oles, Vladyslav Oak Ridge National Laboratory
  • Dash, Sajal Oak Ridge National Laboratory
  • Wang, Feiyi Oak Ridge National Laboratory
Project Identifier STF218
Dataset Type ND Numeric Data
Subjects
  • 97 MATHEMATICS AND COMPUTING
  • 99 GENERAL AND MISCELLANEOUS
Keywords
  • High-performance Computing
  • system power and thermal
  • reliability
  • CPUs
  • GPUs
  • medium temperature water cooling
  • direct liquid cooling
Software Needed Pandas (https://pandas.pydata.org/), Pyarrow (https://arrow.apache.org/docs/python/index.html), python-snappy (http://google.github.io/snappy), Dask (https://dask.org/)
Originating Organizations Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organizations Office of Science (SC)
DOE Contract DE-AC05-00OR22725
Related Identifiers
  • IsDerivedFrom (DOI) https://doi.org/10.1145/3458817.3476188

Acknowledgements

Papers using this dataset are requested to include the following text in their acknowledgements:

*Support for 10.13139/OLCF/1861393 is provided by the U.S. Department of Energy, project STF218 under Contract DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility.