HydraGNN_Predictive_GFM_2026 - Ensemble of predictive graph foundation models for atomistic materials modeling
- Lupo Pasini, Massimiliano | Oak Ridge National Laboratory
- Choi, Jong Youl | Oak Ridge National Laboratory
- Mehta, Kshitij | Oak Ridge National Laboratory
- Messerly , Richard | Oak Ridge National Laboratory
- Weaver, Rylie | Oak Ridge National Laboratory
- Aji, Ashwin M. | Oak Ridge National Laboratory
- Schulz, Karl W. | Oak Ridge National Laboratory
- Polo, Jorda | Oak Ridge National Laboratory
Overview
Description
This release contains pre-processed data and parameters of HydraGNN-based graph foundation models trained as a result of the work published in the pre-print "Exascale Multi-Task Graph Foundation Models for Imbalanced, Multi-Fidelity Atomistic Data" by M. Lupo Pasini et al. (https://arxiv.org/abs/2604.15380). We jointly train on 16 open first-principles datasets (544+ million structures covering 85+ elements) using a multi-task architecture with per-dataset heads and a scalable ADIOS2/DDStore data pipeline. On Frontier, we execute six large-scale DeepHyper hyperparameter optimization campaigns in FP64 and promote the top-performing message-passing models to sustained 2,048-node training, yielding a PaiNN-based lead model.
The version of HydraGNN used to generate the outputs provided in this release is HydraGNN v5.0 (https://github.com/ORNL/HydraGNN/releases/tag/v5.0)
The list of datasets used for the training of the graph foundation model is the following:
1) Alexandria [1]
2) ANI1x [2]
3) MPTrj [3]
4) Open Catalyst 2020 (OC20) [4]
5) Open Catalyst 2022 (OC22) [5]
6) Open Catalyst 2025 (OC25) [6]
7) Open Direct ir Capture 2023 (ODAC23) [7]
8) Open Materials 2024 (OMat24) [8]
9) Open Molecules 2025 (OMol25) [9]
10) OMol25-neutral (subset of OMol25 that contains only molecules with zero total charge)
11) OMol25-non-neutral (subset of OMol25 that contains only molecules with non-zero total charge)
12) Open Polymers 2026 (OPoly2026) [10]
13) Nabla2DFT [11]
14) QCML [12]
15) QM7X [reference 13]
16) transition1x [14]
Dataset references:
[1] J. Schmidt et al., “A dataset of 175k stable and metastable materials calculated with the PBEsol and SCAN functionals,” Scientific Data, vol. 9, p. 64, 2022.
[2] J. S. Smith et al., “The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules,” Scientific Data, vol. 7, p. 134, 2020. [Online]. Available: https:
//www.nature.com/articles/s41597-020-0473-z
[3] A. Jain et al., “Commentary: The Materials Project: A materials genome approach to accelerating materials innovation,” APL Materials, vol. 1, no. 1, p. 011002, 07 2013. [Online]. Available: https://doi.org/10.1063/1.4812323
[4] L. Chanussot et al., “Open catalyst 2020 (oc20) dataset and community challenges,” ACS Catalysis, vol. 11, no. 10, pp. 6059–6072, 2021. [Online]. Available: https://doi.org/10.1021/acscatal.0c04525
[5] K. Tran et al., “Open catalyst 2022 (oc22) dataset and challenges for oxidation electrocatalysts,” ACS Catalysis, vol. 13, no. 5, pp. 3066–3084, 2023. [Online]. Available: https://doi.org/10.1021/acscatal.2c05426
[6] S. J. Sahoo et al., “The open catalyst 2025 (oc25) dataset and models for solid-liquid interfaces,” arXiv preprint arXiv:2509.17862, 2025. [Online]. Available: https://arxiv.org/abs/2509.17862
[7] A. Sriram et al., “The open DAC 2023 dataset and challenges for sorbent discovery in direct air capture,” ACS Central Science, vol. 10, no. 5, pp. 923–941, 2024.
[8] L. Barroso-Luque et al., “Open materials 2024 (omat24) inorganic materials dataset and models,” 2024. [Online]. Available: https://arxiv.org/abs/2410.12771
[9] D. S. Levine et al., “The open molecules 2025 (OMol25) dataset, evaluations, and models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.08762
[10] D. S. Levine et al., The open polymers 2026 (OPoly26) dataset and evaluations,” arXiv preprint arXiv:2512.23117, 2025. [Online]. Available: https://arxiv.org/abs/2512.23117
[11] K. Khrabrov et al., “Nabla2dft: A universal quantum chemistry dataset of drug-like molecules and a benchmark for neural network potentials,” in NeurIPS 2024 Datasets and Benchmarks Track, 2024. [Online]. Available: https://openreview.net/forum?id=ElUrNM9U8c
[12] S. Ganscha et al., “The QCML dataset, quantum chemistry reference data from 33.5M DFT and 14.7B semi-empirical calculations,” Scientific Data, vol. 12, p. 406, 2025.
[13] J. Hoja et al., “QM7-X, a comprehensive dataset of quantum-mechanical properties spanning the chemical space of small organic molecules,” Scientific Data, vol. 8, p. 43, 2021. [Online]. Available: https://www.nature.com/articles/s41597-021-00812-2
[14] M. Schreiner et al., “Transition1x - a dataset for building generalizable reactive machine learning potentials,” Scientific Data, vol. 9, p. 779, 2022.
The folder "datasets_ADIOS2_format" contains the set of pre-processed datasets in Adaptable I/O System (ADIOS) format (https://www.exascaleproject.org/research-project/adios/) that have been used for the development and training of GFMs in this work. The "datasets_ADIOS2_format" directory contains 2 sub-directories, one for the version "v1" of the datasets and one for the version "v2" of the datasets. The version "v1" of the datasets provides values of the total energy as they are extracted from the original data as it was released by the respective institutions.
The version "v2" of the datasets provides values of the energy that have been realigned. The realignment was performed by training a linear regression model that predicts the total energy as a function of the chemical composition of the atomistic structure, and then subtract such prediction from the original value of the total energy.
Both folders "v1" and "v2" contain 16 sub-directories, each corresponding to an ADIOS2-formatted dataset
The folder "DeepHyper-results" contains the configurational files and model's parameters for all the 186 HPO trials that were successfully completed by the scalable hyperparameter optimization (HPO) runs on Frontier. The content of the folder "DeepHyper-results" I structured as follows:
1) task-list.txt: list of mpnn name, jobid, and deephyper task id
2) gfm_${MPNN}_${JOBID}_0.${TASKID}: run directory with checkpoint files
3) gfm_${MPNN}: deephyper summary directory (*.csv) for each specific MPNN type
4) deephyper-experiment-${JOBID}: output and error logs for each job
The file "deephyper-sorted.csv" contains the details of each HydraGNN model built and tested by HPO, obtained by merging the (*.csv) filed from each HPO run executed. Out of all the HPO trials, we selected 10 to continue the training of the respective HydraGNN models. Due to limited computational budget available in the LRN070 allocation we could not complete the training till convergence for all these 10 selected models.
The folder "models" contains multiple sub-folders, one per each HydraGNN model trained.
Each model sub-folder contains the parameters of each HydraGNN model, with multiple checkpoint-restarts.
The list of sub-folders are as follows:
1) multidataset_hpo-BEST1-fp64
2) multidataset_hpo-BEST2-fp64
3) multidataset_hpo-BEST3-fp64
4) multidataset_hpo-BEST4-fp64
5) multidataset_hpo-BEST5-fp64
6) multidataset_hpo-BEST6-fp64
7) multidataset_hpo-BEST7-fp64
8) multidataset_hpo-BEST8-fp64
9) multidataset_hpo-BEST9-fp64
10) multidataset_hpo-BEST10-fp64
Within each one of these folders, additional auxiliary log files are provided with descriptions about how the training proceeded.
The lead PaiNN-model is contained inside "multidataset_hpo-BEST6-fp64".
The file "mlp_branch_weights" contains the parameters of the multi-layer perceptron (MLP) used to reconcile the predictions of the 16 output decoding heads of the HydragNN architectures. The MLP takes in input the chemical composition of the atomistic structure and predicts averaging weights to linearly mix the predictions of each output decoding head toward consolidating them into a single one.
The folder "1.1billion-structure-inference" contains 1.1 billion atomistic structures randomly generated. Each structures is associated with energy and forces predicted with the lead-PaiNN model combined with the MLP model for reconciliation of the multi-branch predictions generated by the 16 output decoding heads. The folder "1.1billion-structure-inference" contains 9,300 (*.tar.gz) subdirectories, one per Frontier compute node used to execute the inference at exascale. Once uncompressed, each (*.tar.gz) subdirectory contains an ADIOS2 (*.bp) file container, where each atomistic structure is stored as a PyTorch-Geometric Data object.
The file "export_dataset_environment_variables.sh" contains the environment variables that need to be set before running the HydraGNN code to reproduce the results provided in this dataset release.
The code that can be used to load the ADIOS2 files, load HydraGNN models, and run inference is available at:
https://github.com/ORNL/HydraGNN/releases/tag/v5.0
Funding Resources
DOE Contract Number
DE-AC05-00OR22725; DE-AC02-06CH11357Originating Research Organization
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)Other Contributing Organizations
AMD ResearchSponsoring Organization
Office of Science (SC)Project Identifier
LRN070Related Resources
- References (URL): https://github.com/ORNL/HydraGNN/releases/tag/v5.0
- IsSupplementTo (URL): https://arxiv.org/abs/2604.15380
Details
DOI
10.13139/OLCF/2562660Release Date
May 6, 2026Dataset
Dataset Type
ND Numeric DataSoftware
HydraGNN (https://github.com/ORNL/HydraGNN)Other Contract Number(s)
NERSC ASCR-ERCAP0034735Acknowledgements
This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Advanced Scientific Computing Research programs in the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.
Category
- 74 ATOMIC AND MOLECULAR PHYSICS,
- 71 CLASSICAL AND QUANTUM MECHANICS, GENERAL PHYSICS,
- 75 CONDENSED MATTER PHYSICS, SUPERCONDUCTIVITY AND SUPERFLUIDITY,
- 36 MATERIALS SCIENCE,
- 97 MATHEMATICS AND COMPUTING
Keywords
- Atomistic Materials Modeling,
- Graph Neural Networks,
- Distributed Data Parallelism,
- Model Parallelism,
- Multi-Fidelity Data