Skip to main content

Long Term Per-Component Power and Thermal Measurements of the OLCF Summit System

  • Shin, Woong | Oak Ridge National Laboratory
  • Ellis, J. Austin | Oak Ridge National Laboratory
  • Karimi, Ahmad Maroof | Oak Ridge National Laboratory
  • Oles, Vladyslav | Oak Ridge National Laboratory
  • Dash, Sajal | Oak Ridge National Laboratory
  • Wang, Feiyi | Oak Ridge National Laboratory
Download dataset
Overview

Description

As we move into the exascale era, the power and energy footprints of high-performance computing (HPC) systems have grown significantly larger. Due to the harsh power and thermal conditions the system, components are exposed to extreme operating conditions. Operation of such modern HPC systems requires deep insights into long term system behavior to maintain its efficiency as well as its longevity. To help the HPC community to gain such insights, we provide a dataset that records the long-term power and thermal behavior of the 200PF pre-exascale supercomputer at the Oak Ridge Leadership Computing Facility (OLCF), Summit. This system is an IBM AC922 based system that has 9,252 IBM Power9 CPUs and 27,756 Nvidia V100 GPUs and can consume up to 13MW power at peak. Heat removal is performed using medium temperature direct liquid cooling and rear-door heat exchanger based secondary cooling loop. Originally extracted from a high-resolution (1Hz) per-component (GPUs, CPUs) measurements from the system, we primarily provide a dataset that has 10-second and 1-minute mean power and thermal measurements selected from five month-long segments over the course of 2020 (January and August), 2021 (February and August), and 2022 (January). For convenience, we also provide various sub datasets randomly sampled from the time and space (hosts) of the cluster. Further details and example code for analysis can be found in the following GitHub repository: https://github.com/at-aaims/summit_power_and_thermal_data

Funding resources

DOE contract number

DE-AC05-00OR22725

Originating research organization

Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring organization

Office of Science (SC)

Related resources

Details

DOI

10.13139/OLCF/1861393

Release date

April 11, 2022

Dataset

Dataset type

ND Numeric Data

Software

Pandas (https://pandas.pydata.org/), Pyarrow (https://arrow.apache.org/docs/python/index.html), python-snappy (http://google.github.io/snappy), Dask (https://dask.org/)

Other ID number(s)

GEN150

Acknowledgements

Users should acknowledge the OLCF in all publications and presentations that speak to work performed on OLCF resources:

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Category

  • 97 MATHEMATICS AND COMPUTING,
  • 99 GENERAL AND MISCELLANEOUS

Keywords

  • High-performance Computing,
  • system power and thermal,
  • reliability,
  • CPUs,
  • GPUs,
  • medium temperature water cooling,
  • direct liquid cooling