顾及字符特征的中文地名词典查询方法
作者简介:叶鹏(1991-),男,博士生,主要从事时空大数据挖掘、遥感影像处理和地理信息系统研究。E-mail: yep730@163.com
收稿日期: 2017-11-12
要求修回日期: 2018-05-04
网络出版日期: 2018-07-13
基金资助
国家自然科学基金项目(41671393、41631177);国家重点研发计划(2017YFB0503602);江苏省高校自然资助项目(15KJA420002);公安部科技强警基础工作专项项目(2016GABJC43、2017GABJC23);警用地理信息技术公安部重点实验室开放课题(2016LPGIT01)
Query Method of Chinese Gazetteer Based on the Character Features
Received date: 2017-11-12
Request revised date: 2018-05-04
Online published: 2018-07-13
Supported by
National Natural Science Foundation of China, No.41671393, 41631177; National Key Research and Development Program of China, No.2017YFB0503602; University Natural Funding Project of Jiangsu Province, No.15KJA420002; Special Project of the BasicWork of Science and Technology Police in the Ministry of Public Security, No.2016GABJC43, 2017GABJC23; Open Project of Key Laboratory of Police Department of Police, Geographic Information Technology, Ministry of Public Security, No.2016LPGIT01
Copyright
地名词典查询是地名校正、地名匹配等地名服务应用的重要基础,但是地名数量的快速增长使得词典查询性能面临严峻挑战。针对大规模数据环境中传统词典查询方法准确率不高且效率较低等问题,提出了一种顾及字符特征的中文地名词典查询方法(CGQM)。首先,查询具有相同字符特征的地名形成候选地名集合,同时构建单字索引提升查询效率;其次,依据字符数量特征比较查询地名与候选地名的差异,进一步过滤候选地名集合;最后,基于字符位置特征优化查询结果排序策略,使得结果排序更为合理。实验以全国地名词典为例,构建5组测试集进行CGQM方法与Lucene检索方法的对比分析。研究结果表明,CGQM方法对于增强地名词典查询功能、提升查询效率具有实际意义。
叶鹏 , 张雪英 , 杜咪 . 顾及字符特征的中文地名词典查询方法[J]. 地球信息科学学报, 2018 , 20(7) : 880 -886 . DOI: 10.12082/dqxxkx.2018.170530
With the rapid development of mobile Internet and the wide application of location-based service technology in various industries, the public's demand for the application of place information is growing rapidly. The gazetteer query, which can provide the support for place names knowledge, is an important basic link in the location information service. At present, because of the significant increase of the data volume of the place names, the query performance of gazetteers is facing a severe challenge. Most of the existing gazetteers directly use general retrieval methods, ignoring the characteristics of the characters and the description rules of the place names themselves. In order to solve these problems, a Chinese gazetteer query method (CGQM) is proposed based on the character features of place names. The CGQM uses the character features of the names with the same character characteristics, character's number and character's position, and query the gazetteer according to the main line of "candidate place name query, place name filtering, place name similarity ranking". Firstly, the single character index of the gazetteer is constructed, and based on this index, the place names containing the same characters in the gazetteer are queried to form a candidate dataset. Secondly, the place names are filtered from the candidate dataset, which has large differences in the number of characters with the search place names. The aim of this step is to enhance the accuracy of the candidate dataset and to ensure the efficiency of the later sorting process. Thirdly, the candidate place names are sorted based on the algorithm of character position similarity. Taking the national Chinese gazetteer as an example, an experiment was implemented with CGQM and a full text query method (Lucene) on 5 test datasets. The purpose of the experiment was to verify that the CGQM method could accurately and efficiently query the gazetteer. The experimental performance evaluation indexes include the operation efficiency, the precision rate, the recall rate and the F value. The results of experiment prove that CGQM can achieve much more better query performance than the Lucene based method. In the future research on gazetteer query, we will also consider many other factors, such as glyph, semantics, etc., and learn from the distributed and multithreading techniques in the retrieval system at the same time. These methods will promote the accuracy and efficiency of gazetteer query and expand the public service of place information.
Fig. 1 The technical framework of the Chinese gazetteer query method图1 中文地名词典查询的技术框架 |
Fig. 2 Chinese gazetteer index based on single characters图2 中文地名单字索引组织方式 |
Fig. 3 The query mode of Chinese gazetteer index based on single characters图3 中文地名单字索引查询方式 |
Tab. 1 Common inaccurate description form in the place name search表1 查询地名中常见的不准确描述方式 |
类型 | 查询地名 | 目标地名 | |
---|---|---|---|
替换字符 | 候家宅子村 | 侯家宅子村 | |
晌滩村 | 响滩村 | ||
缺失字符 | 采石南路南 | 采石南路南口 | |
合肥南 | 合肥南站 | ||
增加字符 | 多余空格 | 南京 市 | 南京市 |
特殊符号 | 凉水-井湾 | 凉水井湾 | |
偏旁分离 | 夕卜坡 | 外坡 | |
交换字符 | 北新桥路口南 | 北新桥南路口 | |
塔什库尔干塔吉克县 | 塔库什尔干塔吉克县 |
Tab. 2 Samples of test datasets表2 实验测试集划分明细及示例 |
等级 | 测试集单条地名准确度 | 测试集地名数量 | 测试集示例 | 对应目标地名 |
---|---|---|---|---|
测试集1 | [90%, 100%] | 133 | 南京明文化村阳山碑村 | 南京明文化村 阳山碑村 |
测试集2 | [80%, 90%) | 377 | 候家石良村,勒图音敖包 | 侯家石良村,勒图音敖包 |
测试集3 | [70%, 80%) | 389 | 大新冊村,豆家吕村 | 大新册村,豆家营村 |
测试集4 | [60%, 70%) | 665 | 樁木槽,达强 | 椿木槽,达强弄 |
测试集5 | [50%, 60%) | 136 | 橫山,痳冲 | 横山,麻冲 |
Tab. 3 Statistics of experimental results表3 实验结果评价指标统计 |
测试集 | 地名数量/个 | CGQM方法 | Lucene方法 | |||||||
---|---|---|---|---|---|---|---|---|---|---|
P/% | R/% | F | 平均效率/ms | P/% | R/% | F | 平均效率/ms | |||
1 | 133 | 95.49 | 100.00 | 98.08 | 409 | 93.23 | 98.50 | 95.79 | 576 | |
2 | 377 | 91.78 | 94.43 | 93.09 | 335 | 90.45 | 91.51 | 90.98 | 537 | |
3 | 389 | 82.26 | 88.95 | 85.47 | 437 | 79.18 | 84.83 | 81.91 | 548 | |
4 | 665 | 72.03 | 80.00 | 75.81 | 388 | 69.02 | 76.09 | 72.38 | 513 | |
5 | 136 | 53.97 | 73.53 | 62.25 | 186 | 50.74 | 68.38 | 58.25 | 562 |
Tab. 4 Details of the query process of the part of experimental data表4 部分实验数据查询过程明细 |
查询地名 | 所属测试集 | 初步结果集合 (部分示例) | 过滤结果集合 (部分示例) | 查询结果排序 (部分示例) | 目标地名 |
---|---|---|---|---|---|
努木其音乌 | 测试集2 | 力努;努松;桥努;…;株木塘;木底塘;木山冲;…;佳木斯我的家生态健康社区;米欠扎木阿吉坎儿孜买里斯;树木岭民营工业园基地三门;… (共50 430个) | 哈达音努如;努木乃淖日;努和廷沙图;…;额尔格勒音努如;沙巴日努很超浩;居努斯阔克铁木;… (共22 101个) | 努木其音乌兰 | 努木其音乌兰 |
兩山村 | 测试集4 | 雨道;雨潭;山岗;…;社山后;开化山;山马岭;…;石家庄华南新村;浚县王升屯新村;平成日式度假村;… (共313 867个) | 雨花冲;梧桐雨;雨冲子;雨水冲;…;青山程家;落雁山村;东畈横山;… (共265 970个) | 村山村;山村;陈山村;东山村;三山村;阳山村;檀山村;嶂山村;横山村;兴山村 | 雨山村 |
The authors have declared that no competing interests exist.
[1] |
[
|
[2] |
|
[3] |
[
|
[4] |
[
|
[5] |
[
|
[6] |
[
|
[7] |
[
|
[8] |
[
|
[9] |
[
|
[10] |
[
|
[11] |
[
|
[12] |
[
|
[13] |
[
|
[14] |
[
|
[15] |
[
|
[16] |
[
|
[17] |
|
[18] |
|
/
〈 | 〉 |