Tokenized Data for FORGE Foundation Models

10.13139/OLCF/2004951

This dataset comprises a vast corpus of 257 billion tokens, accompanied by the corresponding vocabulary file employed in the pre-training of FORGE foundation models. The primary data source for this corpus is scientific documents derived from diverse origins, and they have been tokenized using the Hugging Face BPE tokenizer. Further details about this research can be found in the publication titled FORGE: Pre-Training Open Foundation Models for Science authored by Junqi Yin, Sajal Dash, Feiyi Wang, and Mallikarjun (Arjun) Shankar, presented at SC'23. The data tokenization pipeline and resulting artifacts use CORE data [Ref: Knoth, P., and Zdrahal, Z. (2012). CORE: three access levels to underpin open access. D-Lib Magazine, 18(11/12)]. For use of these data sets for any purpose, please follow the guidelines provided in https://core.ac.uk/terms .

Published: 2023-10-18 16:13:07 Download Dataset

Dataset Properties

Field Value
Authors
  • Yin, Junqi Oak Ridge National Laboratory
  • Dash, Sajal Oak Ridge National Laboratory
  • Wang, Feiyi Oak Ridge National Laboratory
  • Shankar, Mallikarjun (Arjun) Oak Ridge National Laboratory
Project Identifier OLCF-6 Benchmark
Dataset Type SM Specialized Mix
Software Needed FORGE
Originating Organizations Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organizations Office of Science (SC)
DOE Contract DE-AC05-00OR22725

Acknowledgements

Papers using this dataset are requested to include the following text in their acknowledgements:

*Support for 10.13139/OLCF/2004951 is provided by the U.S. Department of Energy, project OLCF-6 Benchmark under Contract DE-AC05-00OR22725. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility.