VoxTNT:基于多尺度Transformer的点云3D目标检测方法
作者贡献:Author Contributions
郑强文和吴升参与方法设计;郑强文和魏婧卉参与实验设计;郑强文完成实验操作;郑强文和魏婧卉参与实验结果分析;郑强文完成论文初稿;郑强文和吴升参与论文的写作和修改;吴升提供资助经费。所有作者均阅读并同意最终稿件的提交。
ZHENG Qiangwen and WU Sheng contributed to methodology design; ZHENG Qiangwen and WEI JInghui contributed to experimental design; ZHENG Qiangwen performed the experiments; ZHENG Qiangwen and WEI Jinghui analyzed the experimental results; ZHENG Qiangwen drafted the manuscript; ZHENG Qiangwen and WU Sheng contributed to writing and revising the manuscript; WU Sheng provided funding. All the authors have read the last version of manuscript and consented for submission.
郑强文(1990—),男,福建龙岩人,博士生,主要从自动驾驶领域感知技术研究。E-mail: 593161522@qq.com |
收稿日期: 2025-03-14
修回日期: 2025-04-18
网络出版日期: 2025-06-06
基金资助
公共数据开发利用科技创新团队(闽教科〔2023] 15 号)
VoxTNT: A Multi-Scale Transformer-based Approach for 3D Object Detection in Point Clouds
Received date: 2025-03-14
Revised date: 2025-04-18
Online published: 2025-06-06
Supported by
Fujian Provincal Program for Innovative Research Team, Fujian ES [2023] No.15.]
【背景】传统方法因静态感受野设计较难适配城市自动驾驶场景中汽车、行人及骑行者等目标的显著尺度差异,且跨尺度特征融合易引发层级干扰。【方法】针对自动驾驶场景中多类别、多尺寸目标的3D检测中跨尺度表征一致性的关键挑战,本研究提出基于均衡化感受野的3D目标检测方法VoxTNT,通过局部-全局协同注意力机制提升检测性能。在局部层面,设计了PointSetFormer模块,引入诱导集注意力模块(Induced Set Attention Block, ISAB),通过约简的交叉注意力聚合高密度点云的细粒度几何特征,突破传统体素均值池化的信息损失瓶颈;在全局层面,设计了VoxelFormerFFN模块,将非空体素抽象为超点集并实施跨体素ISAB交互,建立长程上下文依赖关系,并将全局特征学习计算负载从O(N 2)压缩至O(M 2)(M<<N, M为非空体素数量),规避了复杂的Transformer 直接使用在原始点云造成的高计算复杂度。该双域耦合架构实现了局部细粒度感知与全局语义关联的动态平衡,有效缓解固定感受野和多尺度融合导致的特征建模偏差。【结果】实验表明,该方法在KITTI数据集单阶段检测下,中等难度级别的行人检测精度AP(Average Precision)值达到59.56%,较SECOND基线提高约12.4%,两阶段检测下以66.54%的综合指标mAP(mean Average Precision)领先次优方法BSAODet的66.10%。同时,在WOD数据集中验证了方法的有效性,综合指标mAP达到66.09%分别超越SECOND和PointPillars基线7.7%和8.5%。消融实验进一步表明,均衡化局部和全局感受野的3D特征学习机制能显著提升小目标检测精度(如在KITTI数据集中全组件消融的情况下,中等难度级别的行人和骑行者检测精度分别下降10.8%和10.0%),同时保持大目标检测的稳定性。【结论】本研究为解决自动驾驶多尺度目标检测难题提供了新思路,未来将优化模型结构以进一步提升效能。
郑强文 , 吴升 , 魏婧卉 . VoxTNT:基于多尺度Transformer的点云3D目标检测方法[J]. 地球信息科学学报, 2025 , 27(6) : 1361 -1380 . DOI: 10.12082/dqxxkx.2025.250122
[Background] Traditional methods, due to their static receptive field design, struggle to adapt to the significant scale differences among cars, pedestrians, and cyclists in urban autonomous driving scenarios. Moreover, cross-scale feature fusion often leads to hierarchical interference. [Methodology] To address the key challenge of cross-scale representation consistency in 3D object detection for multi-class, multi-scale objects in autonomous driving scenarios, this study proposes a novel method named VoxTNT. VoxTNT leverages an equalized receptive field and a local-global collaborative attention mechanism to enhance detection performance. At the local level, a PointSetFormer module is introduced, incorporating an Induced Set Attention Block (ISAB) to aggregate fine-grained geometric features from high-density point clouds through reduced cross-attention. This design overcomes the information loss typically associated with traditional voxel mean pooling. At the global level, a VoxelFormerFFN module is designed, which abstracts non-empty voxels into a super-point set and applies cross-voxel ISAB interactions to capture long-range contextual dependencies. This approach reduces the computational complexity of global feature learning from O(N2) to O(M2) (where M << N, M is the number of non-empty voxels), avoiding the high computational complexity associated with directly applying complex Transformers to raw point clouds. This dual-domain coupled architecture achieves a dynamic balance between local fine-grained perception and global semantic association, effectively mitigating modeling bias caused by fixed receptive fields and multi-scale fusion. [Results] Experiments demonstrate that the proposed method achieves a single-stage detection Average Precision (AP) of 59.56% for moderate-level pedestrian detection on the KITTI dataset, an improvement of approximately 12.4% over the SECOND baseline. For two-stage detection, it achieves a mean Average Precision (mAP) of 66.54%, outperforming the second-best method, BSAODet, which achieves 66.10%. Validation on the WOD dataset further confirms the method’s effectiveness, achieving 66.09% mAP, which outperforms the SECOND and PointPillars baselines by 7.7% and 8.5%, respectively. Ablation studies demonstrate that the proposed equalized local-global receptive field mechanism significantly improves detection accuracy for small objects. For example, on the KITTI dataset, full component ablation resulted in a 10.8% and 10.0% drop in AP for moderate-level pedestrian and cyclist detection, respectively, while maintaining stable performance for large-object detection. [Conclusions] This study presents a novel approach to tackling the challenges of multi-scale object detection in autonomous driving scenarios. Future work will focus on optimizing the model architecture to further enhance efficiency.
利益冲突:Conflicts of Interest 所有作者声明不存在利益冲突。
All authors disclose no relevant conflicts of interest.
[1] |
|
[2] |
|
[3] |
|
[4] |
|
[5] |
|
[6] |
|
[7] |
张尧, 张艳, 王涛, 等. 大场景SAR影像舰船目标检测的轻量化研究[J]. 地球信息科学学报, 2025, 27(1):256-270.
[
|
[8] |
高定, 李明, 范大昭, 等. 复杂背景下轻量级SAR影像船舶检测方法[J]. 地球信息科学学报, 2024, 26(11):2612-2625.
[
|
[9] |
|
[10] |
|
[11] |
|
[12] |
|
[13] |
|
[14] |
|
[15] |
|
[16] |
|
[17] |
|
[18] |
|
[19] |
|
[20] |
|
[21] |
|
[22] |
|
[23] |
|
[24] |
|
[25] |
|
[26] |
|
[27] |
|
[28] |
|
[29] |
|
[30] |
|
[31] |
|
[32] |
|
[33] |
|
[34] |
|
[35] |
|
[36] |
|
[37] |
|
[38] |
|
[39] |
|
[40] |
|
[41] |
|
[42] |
|
[43] |
|
[44] |
|
[45] |
|
[46] |
|
[47] |
|
[48] |
|
[49] |
|
[50] |
|
[51] |
|
[52] |
|
[53] |
|
[54] |
|
[55] |
|
[56] |
孔德明, 李晓伟, 杨庆鑫. 基于伪点云特征增强的多模态三维目标检测方法[J]. 计算机学报, 2024, 47(4):759-775.
[
|
[57] |
|
[58] |
|
[59] |
|
[60] |
|
[61] |
|
[62] |
彭颖, 张胜根, 黄俊富, 等. 基于自注意力机制的两阶段三维目标检测方法[J]. 科学技术与工程, 2024, 24(25):10825-10831.
[
|
[63] |
鲁斌, 杨振宇, 孙洋, 等. 基于多通道交叉注意力融合的三维目标检测算法[J]. 智能系统学报, 2024, 19(4):885-897.
[
|
[64] |
张素良, 张惊雷, 文彪. 基于交叉自注意力机制的LiDAR 点云三维目标检测[J]. 光电子·激光, 2024, 35(1):75-83.
[
|
[65] |
刘明阳, 杨啟明, 胡冠华, 等. 基于Transformer的3D点云目标检测算法[J]. 西北工业大学学报, 2023, 41(6):1190-1197.
[
|
[66] |
|
[67] |
|
[68] |
|
[69] |
|
[70] |
|
[71] |
|
[72] |
|
[73] |
刘慧, 董振阳, 田帅华. 融合点云和体素信息的目标检测网络[J]. 计算机工程与设计, 2024, 45(9):2771-2778.
[
|
[74] |
|
[75] |
|
[76] |
|
[77] |
|
[78] |
|
[79] |
|
[80] |
|
[81] |
|
[82] |
|
[83] |
|
[84] |
PyTorch. Torch.scatter[EB/OL].[5-24]. https://pytorch.org/docs/2.3/generated/torch.scatter.html#torch.scatter.
|
[85] |
|
[86] |
|
[87] |
|
[88] |
|
[89] |
|
/
〈 |
|
〉 |