Collection of Disk Failure Events from Alpine, the Parallel File System for Summit Supercomputer
- George, Anjus | Oak Ridge National Laboratory
- Hanley, Jesse | Oak Ridge National Laboratory
- Zimmer, Christopher | Oak Ridge National Laboratory
Overview
Description
This dataset contains disk (HDD) failure events collected from the Alpine storage system of the Summit supercomputer, hosted at OLCF, spanning from January 4, 2019, to December 21, 2023 (a total of 4 years, 11 months, and 18 days), covering 89% of its operational lifetime. It includes 3,766 disk failure events, each recorded with its detection timestamp (in ISO 8601 format) and detailed by its location within the storage system - rack, enclosure, and drive slot number.
Funding resources
DOE contract number
DE-AC05-00OR22725Originating research organization
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)Other contributing organizations
Oak Ridge National LaboratorySponsoring organization
Office of Science (SC)Related resources
- IsSupplementTo (DOI): https://doi.org/10.1145/3624062.3624119
- Continues (DOI): https://doi.org/10.13139/ORNLNCCS/1868941
Details
DOI
10.13139/OLCF/2441482Release date
September 19, 2024Dataset
Dataset type
ND Numeric DataAcknowledgements
Users should acknowledge the OLCF in all publications and presentations that speak to work performed on OLCF resources:
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Category
- 97 MATHEMATICS AND COMPUTING
Keywords
- Disk failures,
- Parallel File System,
- GPFS,
- Alpine,
- Summit,
- HOC storage,
- Supercomputer