YU Hanyang, LAN Chaozhen, WANG Longhao, WEI Zijun, GAO Tian, WANG Yiqiao, LIU Ruimeng
[Significance] Multimodal remote sensing image matching has become a fundamental task in integrated Earth observation, enabling precise spatial alignment across heterogeneous image sources. [Progress] As the diversity of sensing modalities, acquisition geometries, and temporal conditions increases, traditional matching frameworks have proven inadequate for capturing complex variations in radiometric responses, geometric configurations, and semantic representations. This technological gap has driven a significant paradigm shift from handcrafted feature engineering to deep learning-based solutions, which now form the core of current research and application development. This paper provides a comprehensive and structured review of recent advances in deep learning methods for multimodal remote sensing image matching, with an emphasis on the evolution of methodological paradigms and technical frameworks. It establishes a clear dual-path classification: the single-session approach and the end-to-end approach. The former selectively replaces or enhances individual components of traditional pipelines, such as feature encoding or similarity estimation, using neural network modules. The latter integrates the entire matching process into a unified network architecture, enabling joint optimization of feature learning, transformation modeling, and correspondence inference within a closed loop. This progression reflects the field's transition from modular adaptation to holistic modeling, revealing a deeper integration of data-driven representation learning with geometric reasoning. The review further examines the development of architectural strategies supporting this evolution, including attention mechanisms, graph-based structures, hierarchical feature fusion, and modality-bridging transformations. These innovations contribute to improved robustness, semantic consistency, and adaptability across diverse matching scenarios. Recent trends also demonstrate a growing reliance on pretrained vision foundation models, which provide transferable feature spaces and reduce the dependence on large-scale labeled datasets. In addition to summarizing technical advancements, the paper analyzes representative datasets, performance evaluation strategies, and the current challenges that constrain real-world deployment. These include limited data availability, weak cross-scene generalization, computational inefficiency, and insufficient interpretability. [Prospect] By synthesizing methodological progress with practical demands, the review identifies key directions for future research, including the design of modality-invariant representations, physically-informed neural architectures, and lightweight solutions tailored for scalable, real-time image registration in complex operational environments.