Skip to main content

MOSAIC-CONUS: A Multimodal, Multi-Temporally Paired Dataset for Earth Sciences

  • Potnis, Abhishek | Oak Ridge National Laboratory
  • Hussein, Youssef | University of Minnesota
  • Abebe, Waqwoya | Oak Ridge National Laboratory
  • Lee, JangHyeon | University of Minnesota
  • Varshney, Debvrat | Oak Ridge National Laboratory
  • Arndt, Jacob | Oak Ridge National Laboratory
  • Dias, Philipe | Oak Ridge National Laboratory
  • Tsaris, Aristeidis | Oak Ridge National Laboratory
  • Lu, Dan | Oak Ridge National Laboratory
  • Lunga, Dalton | Oak Ridge National Laboratory
Download Dataset
Overview

Description

Earth embeddings—vector representations of geographic locations indexed in space and time—are emerging as a unifying interface for geospatial AI. However, their quality depends not only on model design, but on how multimodal Earth observation (EO) data are spatially indexed, temporally aligned, and cross-modally associated during pretraining. We introduce MOSAIC-CONUS (Multimodal Observations with Spatially Aligned Imagery, Urban Points of Interest, In-Situ Measurements and Text Captions), a large-scale EO dataset over the contiguous United States, organized around 250,000 stratified point indices that serve as stable spatial keys across seven modalities: active radar, passive optical imagery, lidar-derived elevation, land cover, functional context, hydrometeorological measurements, and textual summaries. Unlike existing EO datasets, MOSAIC-CONUS introduces four contributions not jointly addressed in prior work: 1. an open-source, large-scale multimodal EO corpus structured around point-indexed data designed to support Earth embedding learning; 2. explicit radar-optical pairing tables spanning twelve temporal alignment regimes, formalizing cross-sensor alignment as a controllable variable for analyzing how temporal mismatch across modalities influences learned embeddings quality; 3. a benchmark suite spanning cross-modal retrieval, annual nightlights regression, and basin-held-out streamflow prediction, positioning MOSAIC-CONUS as a benchmark-ready resource for multimodal AI systems; and 4. a language-based embedding layer through co-registered textual summaries, enabling Earth embeddings to function as a queryable interface for agentic AI systems. The dataset and pairing protocols are publicly released.

Funding Resources

DOE Contract Number

AC05-00OR22725

Originating Research Organization

Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)

Sponsoring Organization

Office of Science (SC)

Details

DOI

10.13139/ORNLNCCS/3005125

Release Date

May 8, 2026

Dataset

Dataset Type

SM Specialized Mix

Acknowledgements

This work was carried out [in part] at Oak Ridge National Laboratory, managed by UT-Battelle, LLC for the U.S. Department of Energy under contract DE-AC05-00OR22725.

Category

  • 54 ENVIRONMENTAL SCIENCES,
  • 58 GEOSCIENCES

Keywords

  • Remote sensing