Skip to main content

Collection of Disk Failure Events from Alpine, the Parallel File System for Summit Supercomputer

  • George, Anjus | Oak Ridge National Laboratory
  • Hanley, Jesse | Oak Ridge National Laboratory
  • Zimmer, Christopher | Oak Ridge National Laboratory
Download dataset
Overview

Description

This dataset contains disk (HDD) failure events collected from the Alpine storage system of the Summit supercomputer, hosted at OLCF, spanning from January 4, 2019, to December 21, 2023 (a total of 4 years, 11 months, and 18 days), covering 89% of its operational lifetime. It includes 3,766 disk failure events, each recorded with its detection timestamp (in ISO 8601 format) and detailed by its location within the storage system - rack, enclosure, and drive slot number.

Funding resources

DOE contract number

DE-AC05-00OR22725

Originating research organization

Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Other contributing organizations

Oak Ridge National Laboratory

Sponsoring organization

Office of Science (SC)

Related resources

Details

DOI

10.13139/OLCF/2441482

Release date

September 19, 2024

Dataset

Dataset type

ND Numeric Data

Acknowledgements

Papers using this dataset are requested to include the following text in their acknowledgements:

*Support for 10.13139/OLCF/2441482 is provided by the U.S. Department of Energy, project STF008 under Contract DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility.

Category

  • 97 MATHEMATICS AND COMPUTING

Keywords

  • Disk failures,
  • Parallel File System,
  • GPFS,
  • Alpine,
  • Summit,
  • HOC storage,
  • Supercomputer