MOSAIC-CONUS: A Multimodal, Multi-Temporally Paired Dataset for Earth Sciences
- Potnis, Abhishek | Oak Ridge National Laboratory
- Hussein, Youssef | University of Minnesota
- Abebe, Waqwoya | Oak Ridge National Laboratory
- Lee, JangHyeon | University of Minnesota
- Varshney, Debvrat | Oak Ridge National Laboratory
- Arndt, Jacob | Oak Ridge National Laboratory
- Dias, Philipe | Oak Ridge National Laboratory
- Tsaris, Aristeidis | Oak Ridge National Laboratory
- Lu, Dan | Oak Ridge National Laboratory
- Lunga, Dalton | Oak Ridge National Laboratory
Overview
Description
Earth embeddings—vector representations of geographic locations indexed in space and time—are emerging as a unifying interface for geospatial AI. However, their quality depends not only on model design, but on how multimodal Earth observation (EO) data are spatially indexed, temporally aligned, and cross-modally associated during pretraining. We introduce MOSAIC-CONUS (Multimodal Observations with Spatially Aligned Imagery, Urban Points of Interest, In-Situ Measurements and Text Captions), a large-scale EO dataset over the contiguous United States, organized around 250,000 stratified point indices that serve as stable spatial keys across seven modalities: active radar, passive optical imagery, lidar-derived elevation, land cover, functional context, hydrometeorological measurements, and textual summaries. Unlike existing EO datasets, MOSAIC-CONUS introduces four contributions not jointly addressed in prior work: 1. an open-source, large-scale multimodal EO corpus structured around point-indexed data designed to support Earth embedding learning; 2. explicit radar-optical pairing tables spanning twelve temporal alignment regimes, formalizing cross-sensor alignment as a controllable variable for analyzing how temporal mismatch across modalities influences learned embeddings quality; 3. a benchmark suite spanning cross-modal retrieval, annual nightlights regression, and basin-held-out streamflow prediction, positioning MOSAIC-CONUS as a benchmark-ready resource for multimodal AI systems; and 4. a language-based embedding layer through co-registered textual summaries, enabling Earth embeddings to function as a queryable interface for agentic AI systems. The dataset and pairing protocols are publicly released.
Funding Resources
DOE Contract Number
AC05-00OR22725Originating Research Organization
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)Sponsoring Organization
Office of Science (SC)Details
DOI
10.13139/ORNLNCCS/3005125Release Date
May 8, 2026Dataset
Dataset Type
SM Specialized MixAcknowledgements
This work was carried out [in part] at Oak Ridge National Laboratory, managed by UT-Battelle, LLC for the U.S. Department of Energy under contract DE-AC05-00OR22725.
Category
- 54 ENVIRONMENTAL SCIENCES,
- 58 GEOSCIENCES
Keywords
- Remote sensing