HydraGNN_Predictive_GFM_2024 - Ensemble of predictive graph foundation models for ground state atomistic materials modeling

Lupo Pasini, Massimiliano | Oak Ridge National Laboratory
Choi, Jong Youl | Oak Ridge National Laboratory
Mehta, Kshitij | Oak Ridge National Laboratory
Zhang, Pei | Oak Ridge National Laboratory
Rogers, David | Oak Ridge National Laboratory
Bae, Jonghyun | Lawrence Berkeley National Laboratory
Ibrahim, Khaled | Lawrence Berkeley National Laboratory
Aji, Ashwin | AMD Research - Advanced Micro Device
Schulz, Karl W. | AMD Research - Advanced Micro Device
Polo, Jorda | AMD Research - Advanced Micro Device
Balaprakash, Prasanna | Oak Ridge National Laboratory

Overview

Description

We provide the ensemble of fifteen pre-trained graph foundation models (GFMs) for atomistic materials modeling applications. Each one of the fifteen GFMs has been trained on five open-source datasets that (once aggregated) amount to over 154 million atomistic structures, which cover over two-thirds of the natural elements of the periodic table and that comprises a broad set of organic and inorganic compounds. This vast set of atomistic structures comprises ground state configurations that are dynamically stable (i.e., equilibrated structures with atomic forces approximately close to zero values) as well as dynamically unstable structures (i.e., non-equilibrium structures with non-negligible non-zero values of atomic forces). The ensemble of datasets aggregated does NOT include excited states. The datasets have been curated to remove atomistic structures with spectral norm of the force tensor above 100 eV/angstrom. Moreover, a linear term of the energy was computed for each dataset using a linear regression model that uses the chemical concentration of each natural element as regressor. The linear term predicted by the linear regression model has been subtracted from each original energy value to perform a re-alignment of the energy values across different electronic structures approximation theories performed to generate the diverse multi-source, multi-fidelity datasets. The folder "ADIOS_files" contains the set of pre-processed datasets in Adaptable I/O System (ADIOS) format (https://www.exascaleproject.org/research-project/adios/) that have been used for the development and training of GFMs in this work. The "ADIOS_files" directory contains 6 sub-directories named as follows: - ANI1x-v3.bp - MPTrj-v3.bp - OC2020-20M-v3.bp - OC2020-v3.bp - OC2022-v3.bp - qm7x-v3.bp Each sub-directory contains the pre-processed datasets converted in Adaptable I/O System (ADIOS) format (https://www.exascaleproject.org/research-project/adios/) that have been used to the development, training, and performance testing of the ensemble go predictive graph foundation models. Each GFM was developed using HydraGNN (https://github.com/ORNL/HydraGNN) as underlying graph neural network (GNN) architecture. The multi-task learning (MTL) capability of HydraGNN was used to simultaneously train the GFMs on labeled values for direct predictions of energy (a total system property of an atomistic structure that measures the chemical stability) and atomic forces (an atomic level property of an atomistic structure that measures the dynamical stability). The hyper parameters of the GFM have been tuned using scalable hyperparameter optimization (HPO) algorithms implemented in the software DeepHyper (https://github.com/deephyper/deephyper). The pre-training of each HPO trial was performed using distributed data parallelism (DDP) to scale the training across 128 compute nodes of the exascale OLCF supercomputer Frontier. Each HPO trial was trained only for 10 epochs and an early stopping was performed to avoid wasting significant computational resources on GNN architectures that were clearly underperforming. For each HPO trial, the 'omnistat' tool developed by (AMD Research - Advanced Micro Device) was used to measure the total energy consumption in kWh. The ensemble of GFMs was obtained by selecting the fifteen best performing HPO trials. Four models have been selected for their clear advantage in accuracy, and these are the GFMs with IDs 229, 156, 147, 260. Additional eleven models have been selected based on judicious balance between accuracy and energy consumption needed for training, and these are the GFMs with IDs 165, 78, 137, 1, 175, 171, 181, 67, 179, 167, 351. Each selected GFM of the ensemble was continued to cumulate a total of at most 30 epochs. In some cases, the total number of epochs actually performed was les than 30 due to two combined factors: (1) the size of the GFM (i.e., the number of model parameters to train) and (2) the total wall-clock time for which the computational resources could be allocated on OLCF-Frontier. The "Ensemble_of_models" directory contains 15 sub-directories named as follows: - gfm_0.229 - gfm_0.156 - gfm_0.147 - gfm_0.260 - gfm_0.165 - gfm_0.78 - gfm_0.137 - gfm_0.1 - gfm_0.175 - gfm_0.171 - gfm_0.181 - gfm_0.67 - gfm_0.179 - gfm_0.167 - gfm_0.351 Each one of these sub-directories refers to one of the fifteen HPO trials that have been selected to continue the pre-training with at most 30 epochs. With each sub-directory associated with a specific HPO trial, the following files can be found: - config.json: file for argument parsing to develop and train an HydraGNN architecture - gfm_0.ID_epoch_N.pk: file with model parameters for HPO ID trial after N epochs of training The ensemble of fifteen GFM architectures was used for (1) ensemble averaging to stabilize the predictions of energy and atomic forces after pre-training for post-processing analysis and (2) ensemble uncertainty quantification (UQ). The code used to develop, pre-train, and load the pre-trained models for post-processing analysis is available on the ORNL-GitHub at the following link: https://github.com/ORNL/HydraGNN/tree/Predictive_GFM_2024

Funding Resources

DOE Contract Number

DE-AC05-00OR22725

Originating Research Organization

Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Other Contributing Organizations

Lawrence Berkeley National Laboratory; AMD Research - Advanced Micro Device

Sponsoring Organization

Office of Science (SC)

Project Identifier

CPH161

Details

DOI

10.13139/OLCF/2474799

Release Date

November 1, 2024

Dataset

Dataset Type

ND Numeric Data

Software

HydraGNN

Other Contract Number(s)

ORNL-LDRD LOIS 11122; ORNL-LDRD LOIS 11874

Acknowledgements

Users should acknowledge the OLCF in all publications and presentations that speak to work performed on OLCF resources:

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Keywords

high-performance computing,
uncertainty quantification,
Graph Neural Network,
Ensemble Learning,
Materials Science,
Atomistic Materials Modeling,
Condensed Matter Physics,
Computational Chemistry,
Graph Foundation Model