Tokenized Data for FORGE Foundation Models

Junqi Yin | Oak Ridge National Laboratory

Sajal Dash | Oak Ridge National Laboratory

Feiyi Wang | Oak Ridge National Laboratory

Mallikarjun (Arjun) Shankar | Oak Ridge National Laboratory

Description

This dataset comprises a vast corpus of 257 billion tokens, accompanied by the corresponding vocabulary file employed in the pre-training of FORGE foundation models. The primary data source for this corpus is scientific documents derived from diverse origins, and they have been tokenized using the Hugging Face BPE tokenizer. Further details about this research can be found in the publication titled FORGE: Pre-Training Open Foundation Models for Science authored by Junqi Yin, Sajal Dash, Feiyi Wang, and Mallikarjun (Arjun) Shankar, presented at SC'23. The data tokenization pipeline and resulting artifacts use CORE data [Ref: Knoth, P., and Zdrahal, Z. (2012). CORE: three access levels to underpin open access. D-Lib Magazine, 18(11/12)]. For use of these data sets for any purpose, please follow the guidelines provided in https://core.ac.uk/terms .

Funding Information

DOE Contract Number

DE-AC05-00OR22725

Originating Research Organization

Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization

Office of Science (SC)

Details

DOI

https://doi.org/10.13139/OLCF/2004951

Release Date

October 18, 2023

Dataset

Dataset Type

SM Specialized Mix

Software

FORGE

Cite This Dataset:

Yin, J., Dash, S., Wang, F., Shankar, M. (2023). Tokenized Data for FORGE Foundation Models. Oak Ridge National Laboratory. https://doi.org/10.13139/OLCF/2004951.

Acknowledgements

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Advanced Scientific Computing Research programs in the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.