Collection of Disk Failure Events from Alpine, the Parallel File System for Summit Supercomputer
- George, Anjus | Oak Ridge National Laboratory
- Hanley, Jesse | Oak Ridge National Laboratory
- Zimmer, Christopher | Oak Ridge National Laboratory
Overview
Description
This dataset contains disk (HDD) failure events collected from the Alpine storage system of the Summit supercomputer, hosted at OLCF, spanning from January 4, 2019, to December 21, 2023 (a total of 4 years, 11 months, and 18 days), covering 89% of its operational lifetime. It includes 3,766 disk failure events, each recorded with its detection timestamp (in ISO 8601 format) and detailed by its location within the storage system - rack, enclosure, and drive slot number.
Funding resources
DOE contract number
DE-AC05-00OR22725Originating research organization
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)Other contributing organizations
Oak Ridge National LaboratorySponsoring organization
Office of Science (SC)Related resources
- IsSupplementTo (DOI): https://doi.org/10.1145/3624062.3624119
- Continues (DOI): https://doi.org/10.13139/ORNLNCCS/1868941
Details
DOI
10.13139/OLCF/2441482Release date
September 19, 2024Dataset
Dataset type
ND Numeric DataAcknowledgements
Papers using this dataset are requested to include the following text in their acknowledgements:
*Support for 10.13139/OLCF/2441482 is provided by the U.S. Department of Energy, project STF008 under Contract DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility.
Category
- 97 MATHEMATICS AND COMPUTING
Keywords
- Disk failures,
- Parallel File System,
- GPFS,
- Alpine,
- Summit,
- HOC storage,
- Supercomputer