Skip to content
Snippets Groups Projects

Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing

This repository contains code of the paper Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing submitted to IEEE Transactions on Geoscience and Remote Sensing. This work has been done at the Remote Sensing Image Analysis group by Genc Hoxha, Olivér Angyal and Begüm Demir.

If you use this code, please cite the paper given below:

G. Hoxha, O. Angyal, and B. Demіr, "Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing"

If you use the code from this repository in your research, please cite the following paper:

@inproceedings{hoxha2025T-ITSR,
  title={Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing},
  author={G. {Hoxha} and O. {Angyal} and B. {Demіr}},
  journal={IEEE Transactions on Geoscience and Remote Sensing (TGRS)},
  year={2025}
} 

Introduction

The development of image time series retrieval (ITSR) methods is a growing research interest in remote sensing (RS). Given a user-defined image time series (i.e., the query time series), the ITSR methods search and retrieve from large archives the image time series that have similar content to the query time series. The existing ITSR methods in RS are designed for unimodal retrieval problems, limiting their usability and versatility. To overcome this issue, as a first time in RS we introduce the task of cross-modal text-ITSR. In particular, we present a self-supervised cross-modal text-image time series retrieval (text-ITSR) method that enables the retrieval of image time series using text sentences as queries, and vice versa. In detail, we focus our attention on text-ITSR in pairs of images (i.e., bitemporal images). The proposed text-ITSR method consists of two key components: 1) modality-specific encoders to model the semantic content of bitemporal images and text sentences with discriminative features; and 2) modality-specific projection heads to align textual and image representations in a shared embedding space. To effectively model the temporal information within the bitemporal images, we introduce two fusion strategies: i) global feature fusion (GFF) strategy that combines global image features through simple yet effective operators; and ii) transformer-based feature fusion (TFF) strategy that leverages transformers for fine-grained temporal integration. Extensive experiments conducted on two benchmark RS archives demonstrate the effectiveness of the proposed method in accurately retrieving semantically relevant bitemporal images (or text sentences) to a query text sentence (or bitemporal image).

structure.png


Prerequisites

The code in this repository uses the requirements specified in conda_env.yml. To install the requirements, call conda env create -f conda_env.yml.

For the evaluation we have used nlg-eval. Please follow the following steps:

  1. Follow the installation instructions for nlg-eval. Make sure you are using Python 3.8.
  2. Find the compute_metrics function in the __init__.py file of the installed package. Usually it's under <your env path>/python3.8/site-packages/nlg_eval-2.3-py3.8.egg/nlgeval/__init__.py.
  3. Replace the following code:
with open(hypothesis, 'r') as f:
    hyp_list = f.readlines()
ref_list = []
for iidx, reference in enumerate(references):
    with open(reference, 'r') as f:
        ref_list.append(f.readlines())

with:

hyp_list = hypothesis
ref_list = references

Datasets

Dubai CCD and Levir-CC were used to test the method. To create the input files for training on the Levir-CC dataset run the jupyter notebook ./datasets/create_input_files_training_Levir-CC.ipynb. The input files for traning on the Dubai CCD do not need to be changed from the original release.

Configs

  • Dubai CCD ./config/dubai-cc-config.json
  • Levir -CC ./config/levir-cc-config.json

Training

Run the training for:

  • Dubai CCD with CUDA_VISIBLE_DEVICES=0 python main.py --config=./config/dubai-cc-config.json
  • Levir-CC dataset with CUDA_VISIBLE_DEVICES=0 python main.py --config=./config/levir-cc-config.json

Program Arguments

The following are possible program arguments.

  • --DATA_DIR: source path of the dataset
  • --DATASET: dataset to use, DUBAI-CC or LEVIR-CC
  • --IMG_NET: image feature extraction, clip or resnet
  • --VERSION: RGB or MULTISPECTRAL
  • --GPU_ID: cuda device name used for training
  • --FUSION_STRATEGY: fusion strategies. TFF:Fusion, GFF: Concatenation concat, GFF: Subtractionsubtract
  • --NUM_EPOCH: number of epochs used for training
  • --BATCH_SIZE: size of each batch that is processed
  • --topk: number of retrieved data on which the evaluation is done
  • --EVAL_ROUNDS: number of evaluation rounds where a random caption is selected for every image pair at each time
  • --Train: true for training and false for testing
  • --MODEL_NAME: name of the model
  • --CHECKPOINT_DIR: directory to save the model
  • --LOG_DIR: directory of the logs (i.e., results)
  • --FIGURE_DIR: directory of the figures (i.e., plotting of the loss function)

Authors

Genc Hoxha https://rsim.berlin/team/members/genc-hoxha

Olivér Angyal

Begüm Demir https://rsim.berlin/team/members/begum-demir

For questions, requests and concerns, please contact Olivér Angyal email or Genc Hoxha via email

License

The code in this repository to facilitate the use of the Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing is available under the terms of MIT license:

Copyright (c) 2025 the Authors of The Paper, "Self-Supervised Cross-Modal Text-Image Time Series Retrieval in Remote Sensing"

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.