基于零膨胀贝叶斯时空建模的精细尺度伪基站垃圾短信分析方法
史雨飞(2000— ),女,湖北天门人,硕士研究生,主要从事犯罪地理研究。E-mail: shiyf27@mail2.sysu.edu.cn |
收稿日期: 2022-04-19
修回日期: 2022-06-01
网络出版日期: 2023-01-25
基金资助
国家自然科学基金项目(41971372)
广东省自然科学基金项目(2020A1515010680)
Fine-scale Pseudo Base Station Spam Message Analysis Method based on Zero-inflated Bayesian Spatiotemporal Modeling
Received date: 2022-04-19
Revised date: 2022-06-01
Online published: 2023-01-25
Supported by
National Natural Science Foundation of China(41971372)
National Natural Science Foundation of Guang Dong Province(2020A1515010680)
伪基站垃圾短信活动存在显著的时空自相关和异质性现象,采用时空分析方法可以精准把握伪基站的移动规律和行为模式,为相关部门综合施策、探索长效管理机制提供科学的依据。然而,精细尺度下垃圾短信数据集中过多零数据导致的零膨胀问题,使当前的时空分析方法并不适用。为此,本文以2017年2月23日至2017年4月26日北京市色情服务类垃圾短信数据为例,构建零膨胀贝叶斯时空模型,不仅可以解决零膨胀问题,而且可以综合分析伪基站的空间、时间、时空效应以及外部影响因素,以识别伪基站活动的相对风险高值区、探究城市建成环境对其的影响。结果发现:在数据集中零值占比高达83.46%的情况下,基于零膨胀泊松分布的贝叶斯时空模型具有更好的拟合精度;色情服务类垃圾短信空间上的高风险区域主要聚集在北京市主城区的东部,风险值最高的区域属于朝阳区;周四、五、六风险趋势会相对增加,且18:00至次日02:00为高发时期;伪基站一般18:00从主城区的西南部开始向东北方向移动,凌晨01:00聚集在朝阳区西北部区域;商务住宅与住宿服务类城市环境与垃圾短信呈正相关,餐饮服务与派出所类城市环境呈负相关。研究表明,零膨胀贝叶斯时空模型为精细尺度的伪基站垃圾短信研究,提供了一个可以有效整合多个时间截面的分析数据、充分考虑伪基站的时空关系和外部影响因素并解决数据中存在零过多现象的方法,为发展和验证伪基站的环境犯罪学理论提供了一种重要的分析方法。
史雨飞 , 陶海燕 , 卓莉 . 基于零膨胀贝叶斯时空建模的精细尺度伪基站垃圾短信分析方法[J]. 地球信息科学学报, 2022 , 24(11) : 2089 -2101 . DOI: 10.12082/dqxxkx.2022.220204
There are significant spatiotemporal autocorrelation and heterogeneity in spam message activities of pseudo base stations. Using spatiotemporal analysis method can accurately grasp the movement law and behavior pattern of pseudo base stations, which provides a scientific basis for relevant departments to formulate comprehensive policies and explore long-term management mechanism. However, the problem of zero inflation caused by excessive zero data in the spam SMS data set at the fine scale makes the spatiotemporal analysis method not applicable. In this paper, using the Beijing municipal erotic service spam message data from February 23 to April 26, 2017 as an example. we constructed the zero inflation Bayesian spatiotemporal model, which can not only solve the problem of zero inflation, but also comprehensively analyze space, time, space and time effect, and external influence factors of pseudo base stations. Based on this, we further identified the high risk areas of pseudo base station activity and explored the influence of urban built environment. The results show that the Bayesian spatiotemporal model based on zero-inflation Poisson distribution has a higher fitting accuracy when the ratio of zero values in the dataset is 83.46%. The high risk areas of pornographic service spam messages are mainly concentrated in the eastern part of the main urban area of Beijing, and the Chaoyang District has the highest risk value. The risk increases relatively on Thursday, Friday, and Saturday, and the high-risk period is from 6 pm one day to 2 pm the next. The pseudo base station generally starts moving from the southwest to the northeast of the main city at 6 pm and gathers in the northwest of Chaoyang District at 1 am. There is a positive correlation between the urban environment of commercial residence and accommodation service and the spam message, while there is a negative correlation between the urban environment of catering service and police stations. The zero-inflation Bayesian spatiotemporal model for analyzing fine scale pseudo base station spam messages can effectively integrate multiple time cross section data, take into account the external factors and the relationship between time and space of pseudo base stations, and solve the problem of too much zero data in the dataset. Our study provides an important analysis method for the development and validation of pseudo base station environmental criminology theory.
表1 数据集字段名称与含义Tab. 1 Dataset field name and meaning |
字段名称 | 字段含义 |
---|---|
Phone | 伪基站伪装的发送方电话号码 |
Content | 短信具体内容 |
Md5 | 短信正文MD5 |
Recitime | 垃圾短信接收时间戳 |
Conntime | 与伪基站的连接时间戳 |
lng | 伪基站发送垃圾短信时的近似位置经度 |
lat | 伪基站发送垃圾短信时的近似位置纬度 |
表2 POI数据的简要信息Tab. 2 Brief information about POI data |
POI类别 | 内容 | 数量/条 |
---|---|---|
餐饮服务 | 中餐厅、快餐厅、咖啡厅、糕饼店等 | 48 412 |
交通设施 | 火车站、机场、地铁站、公交车站等 | 40 852 |
购物服务 | 商场、便利店、家电卖场等 | 55 566 |
公司企业 | 公司、农林牧渔基地等 | 41 853 |
住宿服务 | 宾馆酒店、招待所等 | 9461 |
生活服务 | 电讯营业厅、共享设备等 | 50 417 |
商务住宅 | 住宅区、楼宇等 | 26 436 |
派出所 | 警察局、派出所等 | 1043 |
表3 候选零膨胀贝叶斯时空模型的评估结果Tab. 3 Evaluation results of candidate zero-inflated Bayesian spatiotemporal models |
模型 | DIC | WAIC |
---|---|---|
M0 | 1 081 712.22 | 817 283.63 |
M1 | 534 768.70 | 524 369.00 |
M2 | 516 833.00 | 523 083.81 |
M3 | 371 102.18 | 406 692.38 |
M4 | 58 057.76 | 57 216.10 |
表4 基于零膨胀泊松分布和泊松分布的贝叶斯时空模型的比较Tab. 4 Comparison of Bayesian spatiotemporal models with zero-inflated Poisson distribution and Poisson distribution |
模型 | PD | PW | LS |
---|---|---|---|
M4 | 3916.53 | 2700.35 | 1.45 |
M5(Poisson) | 7733.67 | 5011.74 | 4.59 |
注:M5表示考虑了时间项、空间项以及时空交互项的泊松分布的贝叶斯时空模型。 |
图5 北京市六环以内周四18:00—周五01:59色情服务类垃圾短信时空相对风险区域分布Fig. 5 Spatial and temporal relative risk of pornographic service messages within the Sixth Ring Road of Beijing from 18:00 on Thursday to 01:59 on Friday |
图6 北京市六环以内周五18:00—周六01:59色情服务类垃圾短信时空相对风险区域分布Fig. 6 Spatial and temporal relative risk of pornographic service messages within the Sixth Ring Road of Beijing from 18:00 on Friday to 01:59 on Saturday |
表5 协变量的正向逐步回归分析结果Tab. 5 Results of positive stepwise regression analysis of covariates |
协变量 | 回归系数 | T检验 | 显著性 |
---|---|---|---|
餐饮服务 | -7.337 | -3.787 | 0.00021*** |
商务住宅 | 10.854 | 2.095 | 0.03760** |
派出所 | -173.879 | -1.926 | 0.05568* |
住宿服务 | 46.763 | 4.374 | 2.09000e-5*** |
注:***、**和*分别表示0.001、0.05和0.1的显著性水平。 |
表6 协变量后验参数和相对风险值Tab. 6 Covariate posterior parameter and relative risk value |
协变量 | 后验均值(置信区间) | 相对风险RR |
---|---|---|
餐饮服务 | -0.003*(-0.004,-0.004) | 0.997 |
商务住宅 | 0.002*(0.001,0.003) | 1.002 |
派出所 | -0.074*(-0.087,-0.060) | 0.929 |
住宿服务 | 0.013*(0.011,0.014) | 1.013 |
注:*表示协变量在95%置信区间统计显著。 |
[1] |
高松林, 肖尚成. 网络化背景下伪基站电信诈骗犯罪治理对策研究[C]. 做优刑事检察之网络犯罪治理的理论与实践--第十六届国家高级检察官论坛文集, 2020:274-288.
[
|
[2] |
360安全中心. 2021年第三季度中国手机安全状况报告[R]. 2021. https://www.360.cn/n/12047.html
[ 360 Security Center. Report on Mobile phone safety in China in the third quarter of 2021[R]. 2021. https://www.360.cn/n/12047.html
|
[3] |
|
[4] |
|
[5] |
|
[6] |
|
[7] |
唐楷, 赵韦鑫, 蒋宏宇, 等. 基于可视分析的伪基站活动特征分析方法[J]. 西南科技大学学报, 2018, 33(2):72-78.
[
|
[8] |
汪伟, 陶海燕, 卓莉. 北京主城区伪基站时空规律分析[J]. 地球信息科学学报, 2018, 20(7):978-987.
[
|
[9] |
李旭亮, 陶海燕, 卓莉, 等. 北京主城区伪基站活动时空特征及其影响因素[J]. 热带地理, 2019, 39(1):125-134.
[
|
[10] |
|
[11] |
|
[12] |
|
[13] |
|
[14] |
|
[15] |
|
[16] |
|
[17] |
|
[18] |
|
[19] |
|
[20] |
|
[21] |
|
[22] |
|
[23] |
|
[24] |
蒲誉文, 胡海波, 何凌君. 基于多用户垃圾短信数据的伪基站活动轨迹可视分析方法[J]. 计算机应用, 2018, 38(4):1207-1212.
[
|
[25] |
淳锦, 张新长, 黄健锋, 等. 基于POI数据的人口分布格网化方法研究[J]. 地理与地理信息科学, 2018, 34(4):83-89,124.
[
|
[26] |
|
[27] |
|
[28] |
|
[29] |
|
[30] |
|
[31] |
|
[32] |
|
[33] |
|
[34] |
|
[35] |
|
[36] |
|
/
〈 |
|
〉 |