Skip to main content

MOSAIC-CONUS: A Multimodal, Multi-Temporally Paired Dataset for Earth Sciences

    Abhishek Potnis | Oak Ridge National Laboratory
    Youssef Hussein | University of Minnesota
    Waqwoya Abebe | Oak Ridge National Laboratory
    JangHyeon Lee | University of Minnesota
    Debvrat Varshney | Oak Ridge National Laboratory
Download Dataset on Globus

Description

Earth embeddings—vector representations of geographic locations indexed in space and time—are emerging as a unifying interface for geospatial AI. However, their quality depends not only on model design, but on how multimodal Earth observation (EO) data are spatially indexed, temporally aligned, and cross-modally associated during pretraining. We introduce MOSAIC-CONUS (Multimodal Observations with Spatially Aligned Imagery, Urban Points of Interest, In-Situ Measurements and Text Captions), a large-scale EO dataset over the contiguous United States, organized around 250,000 stratified point indices that serve as stable spatial keys across seven modalities: active radar, passive optical imagery, lidar-derived elevation, land cover, functional context, hydrometeorological measurements, and textual summaries. Unlike existing EO datasets, MOSAIC-CONUS introduces four contributions not jointly addressed in prior work: 1. an open-source, large-scale multimodal EO corpus structured around point-indexed data designed to support Earth embedding learning; 2. explicit radar-optical pairing tables spanning twelve temporal alignment regimes, formalizing cross-sensor alignment as a controllable variable for analyzing how temporal mismatch across modalities influences learned embeddings quality; 3. a benchmark suite spanning cross-modal retrieval, annual nightlights regression, and basin-held-out streamflow prediction, positioning MOSAIC-CONUS as a benchmark-ready resource for multimodal AI systems; and 4. a language-based embedding layer through co-registered textual summaries, enabling Earth embeddings to function as a queryable interface for agentic AI systems. The dataset and pairing protocols are publicly released.

Funding Information

DOE Contract Number

AC05-00OR22725

Originating Research Organization

Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization

Office of Science (SC)

Details

Release Date

May 8, 2026

Subject

54 ENVIRONMENTAL SCIENCES, 58 GEOSCIENCES

Keywords

Remote sensing

Dataset

Dataset Type

SM Specialized Mix

Cite This Dataset:

Potnis, A., Hussein, ., Abebe, W., Lee, ., Varshney, ., Arndt, ., Dias, ., Tsaris, A., Lu, D., Lunga, D. (2026). MOSAIC-CONUS: A Multimodal, Multi-Temporally Paired Dataset for Earth Sciences. Oak Ridge National Laboratory. https://doi.org/10.13139/ORNLNCCS/3005125.

Acknowledgements

This work was carried out [in part] at Oak Ridge National Laboratory, managed by UT-Battelle, LLC for the U.S. Department of Energy under contract DE-AC05-00OR22725.