Skip to main content

Collection of Disk Failure Events from Alpine, the Parallel File System for Summit Supercomputer

  • George, Anjus | Oak Ridge National Laboratory
  • Hanley, Jesse | Oak Ridge National Laboratory
  • Zimmer, Christopher | Oak Ridge National Laboratory
Download dataset
Overview

Description

This dataset contains disk (HDD) failure events collected from the Alpine storage system of the Summit supercomputer, hosted at OLCF, spanning from January 4, 2019, to December 21, 2023 (a total of 4 years, 11 months, and 18 days), covering 89% of its operational lifetime. It includes 3,766 disk failure events, each recorded with its detection timestamp (in ISO 8601 format) and detailed by its location within the storage system - rack, enclosure, and drive slot number.

Funding resources

DOE contract number

DE-AC05-00OR22725

Originating research organization

Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Other contributing organizations

Oak Ridge National Laboratory

Sponsoring organization

Office of Science (SC)

Related resources

Details

DOI

10.13139/OLCF/2441482

Release date

September 19, 2024

Dataset

Dataset type

ND Numeric Data

Acknowledgements

Users should acknowledge the OLCF in all publications and presentations that speak to work performed on OLCF resources:

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Category

  • 97 MATHEMATICS AND COMPUTING

Keywords

  • Disk failures,
  • Parallel File System,
  • GPFS,
  • Alpine,
  • Summit,
  • HOC storage,
  • Supercomputer