北京主城区伪基站时空规律分析
作者简介:汪 伟(1996-),男,安徽安庆人,本科生,主要从事时空数据挖掘。E-mail: wangw227@mail2.sysu.edu.cn
收稿日期: 2017-09-17
要求修回日期: 2018-01-22
网络出版日期: 2018-07-13
基金资助
国家自然科学基金项目(41371499);广东省自然科学基金团队项目(2014A030312010)
Spatio-temporal Analysis of Pseudo Base Stations in Beijing Downtown
Received date: 2017-09-17
Request revised date: 2018-01-22
Online published: 2018-07-13
Supported by
National Natural Science Foundation of China, No.41371499;Guangdong Province Natural Science Foundation research team project, No.2014A030312010
Copyright
随着公众移动通信的快速发展,伪基站的泛滥不仅破坏正常电信秩序,危害公共安全,而且严重损害群众财产权益,侵犯公民个人隐私,已成为社会一大公害。如何从垃圾短信大数据中挖掘出伪基站活动的时空规律,寻找有效的防控方案,从源头上进行打击和治理成为管理部门和研究者共同关注的焦点。本文基于北京市垃圾短信数据,利用非负矩阵分解的方法分析伪基站的时空分布规律;并利用TF-IDF构建垃圾短信分类模型,对垃圾短信进行分类,结合土地利用数据,分析伪基站在发送不同类型垃圾短信时的时空分布规律。结果显示:北京市垃圾短信多分布于路网和中心城区;白天垃圾短信数量远远多于晚上;垃圾短信的分布随时间的推移沿着路网逐渐向内收缩;发送不同类型垃圾短信的伪基站的时空分布具有一定的差异;通过非负矩阵分解得到的结果,与垃圾短信分类后得到的结果有很好的匹配。研究表明,非负矩阵分解具有实现上的简便性、分解形式和分解结果上的可解释性等优点,可以有针对性的为有关部门建言打击伪基站的有效方案,对于伪基站违法行为的治理具有一定的意义。
汪伟 , 陶海燕 , 卓莉 , 李敏 , 李旭亮 , 汪珂丽 , 史清丽 . 北京主城区伪基站时空规律分析[J]. 地球信息科学学报, 2018 , 20(7) : 978 -987 . DOI: 10.12082/dqxxkx.2018.170430
The rampant pseudo base stations have become a major public hazard. They undermine the normal telecommunications order, endanger public safety, seriously infringe the property rights of the masses, and violate citizen privacy. How to dig out the spatio-temporal patterns of the pseudo base stations’ activities from massive spam messages, design effective prevention and control programs, and fight against the crime from the source, has become the focus of government agencies and researchers. The traditional methods for identifying pseudo base stations through the user terminal, however, face great challenges in terms of accuracy, comprehensiveness, and analytical ability, which no longer meet the requirements of identifying small-scale and mobile pseudo base stations. Utilizing data on the spam messages from February 23rd, 2017 to April 26th, 2017 in Beijing, this paper analyzes the spatio-temporal distribution of pseudo base stations through non-negative matrix factorization. We also constructed a classification model through TF-IDF (Term Frequency-Inverse Document Frequency) which compares types from different classifiers (k-Nearest Neighbors / K-Support Vector Machine /Random Forest/ Single-Layer Neural Network) and selects the most accurate random forest classification method. Combined with the land use data, we analyzed the spatio-temporal distribution of pseudo base stations that send different types of spam messages. The results of non-negative matrix factorization and spam message classification were analyzed in detail. The results show that most of the spam messages in Beijing are sent along the road network and in the central city. The number of spam messages during the day is much more than that during the evening. As time goes by in the day, the distribution of spam messages along the road network gradually shrinks inward. The pseudo base stations that send different types of spam messages differ in the spatio-temporal distribution, but all of them favor the traffic facilities and residential area within the Fourth Ring. The non-negative matrix factorization, which provides reliable results that match with traditional spam message classification, has shown simplicity in performing the analysis and interpretability in the form and result of the decomposition. It can help understand the spatio-temporal patterns of different types of spam messages and provide evident-based suggestions for government agencies to fight against the pseudo base stations effectively. By targeting the source of the spam messages, it is also beneficial for governments to combat the illegal behaviors based on pseudo base stations.
Tab. 1 The field name and definition of the raw data表1 原始数据字段名称与含义 |
字段名称 | 字段含义 |
---|---|
phone | 伪基站伪装的发送方电话号码 |
content | 短信具体正文 |
md5 | 短信正文MD5 |
recitime | 垃圾短信接收时间戳 |
conntime | 与伪基站的连接时间戳 |
lng | 伪基站发送短信时的近似位置经度 |
lat | 伪基站发送短信时的近似位置纬度 |
Fig. 1 The study area: Beijing, China图1 研究区域 |
Tab. 2 The classification of spam messages表2 垃圾短信分类 |
大类名称 | 大类编号 | 小类名称 | 小类编号 |
---|---|---|---|
欺诈类 | 1 | 银行名义 | 1 |
运营商名义 | 2 | ||
其他 | 3 | ||
非法广告 | 2 | 违禁物品买卖 | 4 |
色情服务类 | 5 | ||
办假证假发票类 | 6 | ||
骚扰 | 3 | 恶意骚扰 | 7 |
轻度打扰 | 8 | ||
普通广告 | 4 | 房产中介类 | 9 |
金融理财 | 10 | ||
其他广告 | 11 |
Tab. 3 The classification result and its accuracy表3 分类结果及精度 |
分类器 | 指标 | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | 平均 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RF | p | 0.98 | 0.95 | 0.12 | 1 | 0.98 | 0.99 | 0.98 | 0.5 | 0.97 | 0.98 | 0.91 | 0.85 |
r | 1 | 0.77 | 0.06 | 0.69 | 0.91 | 0.99 | 0.85 | 0.98 | 0.96 | 0.93 | 0.69 | 0.8 | |
F1 | 0.99 | 0.84 | 0.08 | 0.8 | 0.94 | 0.99 | 0.91 | 0.66 | 0.97 | 0.96 | 0.78 | 0.81 | |
KNN | p | 0.99 | 0.9 | 0.58 | 0.83 | 0.99 | 0.98 | 0.96 | 0.16 | 1 | 0.95 | 0.99 | 0.85 |
r | 0.98 | 0.51 | 0.3 | 0.27 | 0.88 | 0.98 | 0.7 | 0.67 | 0.73 | 0.68 | 0.36 | 0.64 | |
F1 | 0.98 | 0.65 | 0.39 | 0.39 | 0.93 | 0.98 | 0.8 | 0.25 | 0.84 | 0.79 | 0.52 | 0.69 | |
KSVM-linear | p | 0.99 | 0.92 | 0.52 | 0.99 | 0.98 | 0.98 | 1 | 0.41 | 0.98 | 0.99 | 0.89 | 0.88 |
r | 1 | 0.83 | 0.3 | 0.73 | 0.91 | 1 | 0.85 | 0.74 | 0.96 | 0.94 | 0.61 | 0.81 | |
F1 | 1 | 0.86 | 0.37 | 0.83 | 0.94 | 0.99 | 0.92 | 0.52 | 0.97 | 0.96 | 0.72 | 0.83 | |
nnet | p | 0.98 | 0.77 | 0.13 | 0.87 | 0.92 | 0.98 | 0.96 | 0.49 | 0.91 | 0.94 | 0.87 | 0.8 |
r | 0.99 | 0.79 | 0.1 | 0.68 | 0.89 | 0.99 | 0.89 | 0.59 | 0.97 | 0.95 | 0.6 | 0.77 | |
F1 | 0.99 | 0.77 | 0.11 | 0.74 | 0.9 | 0.99 | 0.92 | 0.47 | 0.94 | 0.94 | 0.7 | 0.77 |
Tab. 4 The accuracy index of the classification表4 分类评价指标精度 |
RF | KNN | KSVM-linear | nnet | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
准确率 | Kappa | P值 | 准确率 | Kappa | P值 | 准确率 | Kappa | P值 | 准确率 | Kappa | P值 | |||
0.95 | 0.93 | 0 | 0.86 | 0.82 | 0 | 0.94 | 0.92 | 0 | 0.93 | 0.91 | 0 |
Fig. 2 The flow chart for the spam messages classification model图2 垃圾短信分类模型流程图 |
Fig. 3 The spatial distribution of spam messages图3 垃圾短信空间分布图 |
Fig. 4 The temporal distribution of spam messages图4 垃圾短信时间分布图 |
Fig. 5 The temporal component of NMF图5 非负矩阵分解时间分量 |
Fig. 6 The spatial component of NMF图6 非负矩阵分解空间分量 |
Fig. 7 The proportion of spam messages by type图7 垃圾短信分类类型及比例分布 |
Fig. 8 The spatial distribution of spam messages by type图8 不同类型垃圾短信空间分布 |
Fig. 9 The land use map of Beijing within sixth ring图9 北京六环内土地利用图 |
Fig. 10 The spam message statistics by types of land use图10 各土地利用类型垃圾短信统计 |
Fig. 11 The sending area statistics by types of spam messages图11 各类型垃圾短信发送地区统计 |
Fig. 12 The temporal distribution of spam messages by type图12 不同类型短信随时间分布 |
The authors have declared that no competing interests exist.
[1] |
[
|
[2] |
|
[3] |
[
|
[4] |
[
|
[5] |
[
|
[6] |
[
|
[7] |
[
|
[8] |
|
[9] |
[
|
[10] |
[
|
[11] |
|
[12] |
[
|
[13] |
[
|
[14] |
|
[15] |
|
[16] |
|
[17] |
|
[18] |
[
|
[19] |
[
|
[20] |
[
|
[21] |
|
/
〈 |
|
〉 |