SMC 2021 Data Challenge: Analyzing Resource Utilization and User Behavior on Titan Supercomputer

10.13139/OLCF/1772811

Resource utilization statistics of submitted jobs on a supercomputer can help us understand how users from various scientific domains use HPC platforms and better design a job scheduler. We explore to generate insight regarding workload distribution and usage pattern domains from job scheduler trace, GPU failure information, and project-specific information collected from Titan supercomputer. Furthermore, we want to know how the scheduler performance varies over time and how the users' scheduling behavior changes following a system failure. These observations have the potential to provide valuable insight, which is helpful to prepare for system failures. These practices will help us develop and apply novel machine learning algorithms in understanding system behavior, requirement, and better scheduling of HPC systems. There are two datasets, RUR and GPU: RUR dataset is the job scheduler traces collected from the Titan supercomputer from 01/01/2015 to 07/31/2019 (2015.csv - 2019.csv). These were collected using resource Utilization Report (RUR), a Cray-developed resource-usage data collection and reporting system. It contains the usage information of its critical resources (CPU, Memory, GPU, and I/O) of each running job on Titan during that period (https://ieeexplore.ieee.org/abstract/document/8891001). It includes ProjectAreas as additional information, every job is associated with a project ID. TheProjectAreas.csv dataset provides a mapping of the project ID to its domain science. GPU dataset has information regarding GPU failure on Titan. There have been some hardware-related issues in the GPUs in Titan that caused some GPUs to fail, sometimes irrecoverably during some job runs. This dataset provides information regarding these failures during the execution of the submitted jobs. GPUs on Titan are uniquely identified by a serial number (SN), and they are installed in a location. A GPU can be installed in a location, then removed from that location following a failure, and then re-installed in a different location after fixing the problem. If the failure can't be recovered, the GPU might be removed entirely from Titan. There are two prominent types of failures that resulted in the removal of GPUs from Titan: Double Bit Error (DBE) and Out of the Bus (OTB). The dataset (gc_full.csv) has seven attributes, we provided a short description of these attributes in the ReadMe file. To learn more about this dataset, please refer to the git repository https://github.com/olcf/TitanGPULife and the related publication (https://ieeexplore.ieee.org/abstract/document/9355319).

Published: 2021-03-29 11:31:18 Download Dataset

Dataset Properties

Field Value
Authors
  • Dash, Sajal Oak Ridge National Laboratory
  • Paul, Arnab K. Oak Ridge National Laboratory
  • Oral, Sarp Oak Ridge National Laboratory
  • Wang, Feiyi Oak Ridge National Laboratory
Project Identifier GEN150
Dataset Type ND Numeric Data
Subjects
  • 42 ENGINEERING
  • 97 MATHEMATICS AND COMPUTING
Keywords
  • RUR
  • Titan
  • GPU Failure
Originating Organizations Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organizations Office of Science (SC)
DOE Contract DE-AC05-00OR22725
Related Identifiers
  • Obsoletes (DOI) 10.13139/OLCF/1772604

Acknowledgements

Papers using this dataset are requested to include the following text in their acknowledgements:

*Support for 10.13139/OLCF/1772811 is provided by the U.S. Department of Energy, project GEN150 under Contract DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility.