CATrans：基于跨尺度注意力Transformer的高分辨率遥感影像土地覆盖语义分割框架

陈丽佳; 陈宏辉; 谢艳秋; 何天友; 叶菁; 吴林煌

doi:10.12082/dqxxkx.2025.250092

地球信息科学学报 >

2025 , Vol. 27 >Issue 7: 1624 - 1637

DOI: https://doi.org/10.12082/dqxxkx.2025.250092

遥感科学与应用技术

CATrans：基于跨尺度注意力Transformer的高分辨率遥感影像土地覆盖语义分割框架

陈丽佳 ^,¹ ,
陈宏辉 ² ,
谢艳秋 ³ ,
何天友 ³ ,
叶菁 ³ ,
吴林煌 ^,²^,^*

展开

1.福建商学院艺术设计学院，福州 350599
2.福州大学物理与信息工程学院，福州 350108
3.福建农林大学风景园林与艺术学院，福州 350002

^*吴林煌（1984— ），男，福建福州人，博士，副教授，主要从事人工智能、图像处理算法研究。E-mail: wlh173@163.com

作者贡献：Author Contributions

陈丽佳和陈宏辉参与实验设计；陈丽佳、陈宏辉、谢艳秋、何天友、叶菁完成实验操作；陈丽佳、陈宏辉、谢艳秋、吴林煌参与论文的写作和修改。所有作者均阅读并同意最终稿件的提交。

The study was designed by CHEN Lijia and CHEN Honghui. The experimental operation was completed by CHEN Lijia, CHEN Honghui, XIE Yanqiu, HE Tianyou, and YE Jing. The manuscript was drafted and revised by CHEN Lijia and CHEN Honghui, XIE Yanqiu, and WU Linhuang. All the authors have read the last version of paper and consented for submission.

陈丽佳（1990— ），女，福建福州人，博士生，主要从事风景园林、遥感信息分析研究。E-mail: 2211775005@fafu.edu.cn

收稿日期: 2025-02-26

修回日期: 2025-05-06

网络出版日期: 2025-07-07

基金资助

国家自然科学基金项目(62171135)

福建省重大产学研专项(2023XQ004)

收起

CATrans: A Cross-Scale Attention Transformer for Land Cover Semantic Segmentation in High-Resolution Remote Sensing Images

CHEN Lijia ^,¹ ,
CHEN Honghui ² ,
XIE Yanqiu ³ ,
HE Tianyou ³ ,
YE Jing ³ ,
WU Linhuang ^,²^,^*

Expand

1. College of Art and Design, Fujian Business University, Fuzhou 350599, China
2. College of Physics and Information Engineering, Fuzhou University, Fuzhou 350108, China
3. College of Landscape Architecture and Art, Fujian Agriculture and Forestry University, Fuzhou 350002, China

^*WU Linhuang, E-mail: wlh173@163.com

Received date: 2025-02-26

Revised date: 2025-05-06

Online published: 2025-07-07

Supported by

National Natural Science Foundation of China(62171135)

Fujian Province Key Industry-University Project(2023XQ004)

Fold

摘要

【目的】高分辨率遥感影像语义分割通过精准提取地物信息，为城市规划、土地分析利用提供了重要的数据支持。当前分割方法通常将遥感影像划分为标准块，进行多尺度局部分割和层次推理，未充分考虑影像中的上下文先验知识和局部特征交互能力，影响了推理分割质量。【方法】为了解决这一问题，本文提出了一种联合跨尺度注意力和语义视觉Transformer的遥感影像分割框架（Cross-scale Attention Transformer, CATrans），融合跨尺度注意力模块和语义视觉Transformer，提取上下文先验知识增强局部特征表示和分割性能。首先，跨尺度注意力模块通过空间和通道两个维度进行并行特征处理，分析浅层-深层和局部-全局特征之间的依赖关系，提升对遥感影像中不同粒度对象的注意力。其次，语义视觉Transformer通过空间注意力机制捕捉上下文语义信息，建模语义信息之间的依赖关系。【结果】本文在DeepGlobe、Inria Aerial和LoveDA数据集上进行对比实验，结果表明：CATrans的分割性能优于现有的WSDNet（Discrete Wavelet Smooth Network）和ISDNet（Integrating Shallow and Deep Network）等分割算法，分别取得了76.2%、79.2%、54.2%的平均交并比（Mean Intersection over Union, mIoU）和86.5%、87.8%、66.8%的平均F1得分（Mean F1 Score, mF1），推理速度分别达到38.1 FPS、13.2 FPS和95.22 FPS。相较于本文所对比的最佳方法WSDNet， mIoU和mF1在3个数据集中分别提升2.1%、4.0%、5.3%和1.3%、1.8%、5.6%，在每类地物的分割中都具有显著优势。【结论】本方法实现了高效率、高精度的高分辨率遥感影像语义分割。

关键词： 高分辨率; 语义分割; 跨尺度注意力; 视觉Transformer; 上下文先验; 空间注意力; 语义信息

本文引用格式

陈丽佳 , 陈宏辉 , 谢艳秋 , 何天友 , 叶菁 , 吴林煌 . CATrans：基于跨尺度注意力Transformer的高分辨率遥感影像土地覆盖语义分割框架[J]. 地球信息科学学报, 2025 , 27(7) : 1624 -1637 . DOI: 10.12082/dqxxkx.2025.250092

Abstract

[Objectives] High-resolution remote sensing image segmentation provides essential data support for urban planning, land use, and land cover analysis by accurately extracting terrain information. However, traditional methods face challenges in predicting object categories at the pixel level due to the high computational cost of processing high-resolution images. Current segmentation approaches often divide remote sensing images into a series of standard blocks and perform multi-scale local segmentation, which captures semantic information at different granularities. However, these methods exhibit weak feature interaction between blocks, as they do not consider contextual prior knowledge, ultimately reducing local segmentation performance. [Methods] To address this issue, this paper proposes a high-resolution remote sensing image segmentation framework named CATrans (Cross-scale Attention Transformer), which combines cross-scale attention with a semantic-based visual Transformer. CATrans first predicts the segmentation results of local blocks and then merges them to produce the final global image segmentation. It introduces contextual prior knowledge to enhance local feature representation. Specifically, we propose a cross-scale attention mechanism to integrate contextual semantic information with multi-level features. The multi-branch parallel structure of the cross-scale attention module enhances focus on objects of varying granularities by analyzing shallow-deep and local-global dependencies. This mechanism aggregates cross-spatial information across various dimensions and weights multi-scale kernels to strengthen multi-level feature representations, enabling the model to avoid deep stacking and multiple sequential processes. Additionally, a semantic-based visual Transformer is adopted to couple multi-level contextual semantic information. Spatial attention is used to reinforce these semantic representations. The multi-level contextual information is grouped to form abstract semantic concepts, which are then fed into the Transformer for sequence modeling. The self-attention mechanism within the Transformer captures dependencies between different positions in the input sequence, thereby enhancing the correlation between contextual semantics and spatial positions. Finally, enhanced contextual semantics are generated through feature mapping. [Results] This paper conducts comparative experiments on the DeepGlobe, Inria Aerial, and LoveDA datasets. The results show that CATrans outperforms existing segmentation methods, including Discrete Wavelet Smooth Network (WSDNet) and Integrating Shallow and Deep Network (ISDNet). CATrans achieves a Mean Intersection over Union (mIoU) of 76.2%, 79.2%, and 54.2%, and a Mean F1 Score (mF1) of 86.5, 87.8%, and 66.8%, with inference speeds of 38.1 FPS, 13.2 FPS, and 95.22 FPS on the respective datasets. Compared to the best-performing method, WSDNet, CATrans improves segmentation performance across all classes, with mIoU gains of 2.1%, 4.0%, and 5.3%, and mF1 gains of 1.3%, 1.8%, and 5.6%. [Conclusions] These findings highlight that the proposed CATrans framework significantly enhances high-resolution remote sensing image segmentation by incorporating contextual prior knowledge to improve local feature representation. It achieves an effective balance between segmentation performance and computational efficiency.

Key words： high-resolution; semantic segmentation; cross-scale attention; visual Transformer; contextual prior; spatial attention; semantic information

利益冲突：Conflicts of Interest 所有作者声明不存在利益冲突。

All authors disclose no relevant conflicts of interest.

参考文献

原文顺序 | 文献年度倒序 | 文中引用次数倒序

[1]

Yilmaz

E O

, Kavzoglu

. Quality assessment for multi-resolution segmentation and segment-anything model using worldview-3 imagery[J]. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2024,XLVIII-4/W9-2024:383-390. DOI:10.5194/isprs-archives-xlviii-4-w9-2024-383-2024

[2]	Elsebaei M, El-Naggar A M, et al. Kh Abdel-maguid R,Identification of optimal segmentation parameters for extracting buildings from remote sensing images with different resolutions[J]. Benha Journal of Applied Sciences, 2024, 9(5):1-11. DOI:10.21608/bjas.2024.283085.1404

[3]	Gu Z J, Zeng M M. The use of artificial intelligence and satellite remote sensing in land cover change detection: Review and perspectives[J]. Sustainability, 2024, 16(1):274. DOI:10.3390/su16010274

[4]	Shi Z W, Fan J F, Du Y J, et al. LULC-SegNet: Enhancing land use and land cover semantic segmentation with denoising diffusion feature fusion[J]. Remote Sensing, 2024, 16(23):4573. DOI:10.3390/rs16234573

[5]	Li J Y, Cai Y X, Li Q, et al. A review of remote sensing image segmentation by deep learning methods[J]. International Journal of Digital Earth, 2024, 17(1): 2328827. DOI:10.1080/17538947.2024.2328827

[6]	Ding X, Wang Z Q, Peng S Y, et al. Research on land use and land cover information extraction methods for remote sensing images based on improved convolutional neural networks[J]. ISPRS International Journal of Geo-Information, 2024, 13(11):386. DOI:10.3390/ijgi13110386

[7]	衡雪彪, 许捍卫, 唐璐, 等. 基于改进全卷积神经网络模型的土地覆盖分类方法研究[J]. 地球信息科学学报, 2023, 25(3):495-509. DOI [ Heng X B, Xu H W, Tang L, et al. Research on land cover classification method based on improved fully convolutional neural network model[J]. Journal of Geo-information Science, 2023, 25(3):495-509. ]

[8]	Wu C, Zhang L P, Du B, et al. UNet-like remote sensing change detection: A review of current models and research directions[J]. IEEE Geoscience and Remote Sensing Magazine, 2024, 12(4):305-334. DOI:10.1109/MGRS.2024.3412770

[9]	Hou T, Li J. Application of mask R-CNN for building detection in UAV remote sensing images[J]. Heliyon, 2024, 10(19):e38141. DOI:10.1016/j.heliyon.2024.e38141

[10]	Wang Y, Yang L, Liu X Z, et al. An improved semantic segmentation algorithm for high-resolution remote sensing images based on DeepLabv3+[J]. Scientific Reports, 2024, 14:9716. DOI:10.1038/s41598-024-60375-1 PMID

[11]	Wang L B, Dong S J, Chen Y, et al. MetaSegNet: Metadata-collaborative vision-language representation learning for semantic segmentation of remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62:5644211. DOI:10.1109/TGRS.2024.3477548

[12]	王春艳, 王子康. 改进区间二型模糊神经网络的遥感图像分割方法[J]. 地球信息科学学报, 2025, 27(2):522-535. DOI [Wang C Y, Wang Z K. Improved interval type-2 fuzzy neural network for remote sensing image segmentation[J]. Journal of Geo-Information Science, 2025, 27(2):522-535.] DOI:10.12082/dqxxkx.2025.240549

[13]	He Y H. Deep-learning the landscape[M]// Machine Learning in Pure Mathematics and Theoretical Physics. WORLD SCIENTIFIC (EUROPE), 2023:183-221. DOI:10.1142/9781800613706_0006

[14]

张银胜, 单梦姣, 陈昕, 等. 基于多模态特征提取与层级感知的遥感图像分割[J]. 地球信息科学学报, 2024, 26(12):2741-2758.

DOI

[Zhang

Y S

, Shan

M J

, Chen

, et al. Remote sensing image segmentation based on multi-modal feature extraction and hierarchical perception[J]. Journal of Geo-information Science, 2024, 26(12):2741-2758.] DOI:10.12082/dqxxkx.2024.240488

[15]

林雨准, 金飞, 王淑香, 等. 多分支双任务的多模态遥感影像道路提取方法[J]. 地球信息科学学报, 2024, 26(6):1547-1561.

DOI

[Lin

Y Z

, Jin

, Wang

S X

, et al. Multi-branch and dual-task method for road extraction from multimodal remote sensing images[J]. Journal of Geo-Information Science, 2024, 26(6):1547-1561. ] DOI:10.12082/dqxxkx.2024.240101

[16]	Peng X L, He G J, Wang G Z, et al. A weakly supervised semantic segmentation framework for medium-resolution forest classification with noisy labels and GF-1 WFV images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62:4412419. DOI:10.1109/tgrs.2024.3404953

[17]	Paszke A, Chaurasia A, Kim S, et al. ENet: A deep neural network architecture for real-time semantic segmentation[EB/OL]. 2016:1606.02147. https://arxiv.org/abs/1606.02147v1

[18]	Zhao H S, Qi X J, Shen X Y, et al. ICNet for real-time semantic segmentation on high-resolution images[C]// Computer Vision - ECCV 2018. Cham: Springer, 2018: 418-434. DOI:10.1007/978-3-030-01219-9_25

[19]	Xiong J J, Po L M, Yu W Y, et al. CSRNet: Cascaded Selective Resolution Network for real-time semantic segmentation[J]. Expert Systems with Applications, 2023, 211:118537. DOI:10.1016/j.eswa.2022.118537

[20]	Chen W Y, Jiang Z Y, Wang Z Y, et al. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images[C]// 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2019:8916-8925. DOI:10.1109/CVPR.2019.00913

[21]	Yu C Q, Wang J B, Gao C X, et al. Context prior for scene segmentation[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020:12413-12422. DOI:10.1109/CVPR42600.2020.01243

[22]	Liu W X, Li Q, Lin X D, et al. Ultra-high resolution image segmentation via locality-aware context fusion and alternating local enhancement[J]. International Journal of Computer Vision, 2024, 132(11):5030-5047. DOI:10.1007/s11263-024-02045-3

[23]	Guo S H, Liu L, Gan Z Y, et al. ISDNet: Integrating shallow and deep networks for efficient ultra-high resolution segmentation[C]// 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022:4351-4360. DOI:10.1109/CVPR52688.2022.00432

[24]	Ji D Y, Zhao F, Lu H T, et al. Ultra-high resolution segmentation with ultra-rich context: A novel benchmark[C]// 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2023:23621-23630. DOI:10.1109/CVPR52729.2023.02262

[25]	Guo M H, Xu T X, Liu J J, et al. Attention mechanisms in computer vision: A survey[J]. Computational Visual Media, 2022, 8(3):331-368. DOI:10.1007/s41095-022-0271-y

[26]	Mnih V, Heess N, Graves A. Recurrent models of visual attention[J]. Advances in neural information processing systems, 2014, 27. DOI:10.48550/arXiv.1406.6247

[27]	Gregor K, Danihelka I, Graves A, et al. Draw: A recurrent neural network for image generation[C]// International conference on machine learning. PMLR, 2015: 1462-1471. DOI: 10.48550/arXiv.1502.04623

[28]	Jaderberg M, Simonyan K, Zisserman A. Spatial transformer networks[J]. Advances in neural information processing systems, 2015, 28. DOI:10.48550/arXiv.1506.02025

[29]	Dai J F, Qi H Z, Xiong Y W, et al. Deformable convolutional networks[C]// 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017:764-773. DOI:10.1109/ICCV.2017.89

[30]	Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018:7132-7141. DOI:10.1109/CVPR.2018.00745

[31]	Woo S, Park J, Lee J Y, et al. CBAM: Convolutional Block attention module[C]// Computer Vision - ECCV 2018. Cham: Springer, 2018: 3-19. DOI:10.1007/978-3-030-01234-2_1

[32]	Wang Q L, Wu B G, Zhu P F, et al. ECA-net: Efficient channel attention for deep convolutional neural networks[C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2020: 11531-11539.. DOI:10.1109/cvpr42600.2020.01155

[33]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30. DOI:10.1109/cvpr.2018.00813

[34]	Wang X L, Girshick R, Gupta A, et al. Non-local neural networks[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018:7794-7803. DOI:10.1109/CVPR.2018.00813

[35]	Liang Y Y, Liu H S, Shi Z M, et al. Conv-basis: A new paradigm for efficient attention inference and gradient computation in transformers[EB/OL]. 2024: 2405.05219. https://arxiv.org/abs/2405.05219v2

[36]	Nguyen D K, Assran M, Jain U, et al. An image is worth more than 16x16 patches: Exploring transformers on individual pixels[EB/OL]. 2024: 2406.09415. https://arxiv.org/abs/2406.09415v2

[37]	Wang J J, Zheng Z, Ma A L, et al. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation[EB/OL]. 2021: 2110. 08733. https://arxiv.org/abs/2110.08733v6

[38]	Zhang S Y, Yan S P, He X M. LatentGNN: Learning efficient non-local relations for visual recognition[EB/OL]. 2019: 1905. 11634. https://arxiv.org/abs/1905.11634v1

[39]	Demir I, Koperski K, Lindenbaum D, et al. DeepGlobe 2018: A challenge to parse the Earth through satellite images[C]// 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2018:172-17209. DOI:10.1109/CVPRW.2018.00031

[40]	Maggiori E, Tarabalka Y, Charpiat G, et al. Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark[C]// 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 2017:3226-3229. DOI:10.1109/IGARSS.2017.8127684

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献