地球信息科学学报 ›› 2018, Vol. 20 ›› Issue (7): 880-886.doi: 10.12082/dqxxkx.2018.170530

• 地球信息科学理论与方法 • 上一篇    下一篇

顾及字符特征的中文地名词典查询方法

叶鹏1,2(), 张雪英1,2,*(), 杜咪1,2   

  1. 1. 南京师范大学 虚拟地理环境教育部重点实验室, 南京 210023
    2. 江苏省地理信息资源开发与利用协同创新中心,南京 210023
  • 收稿日期:2017-11-12 修回日期:2018-05-04 出版日期:2018-07-20 发布日期:2018-07-13
  • 作者简介:

    作者简介:叶鹏(1991-),男,博士生,主要从事时空大数据挖掘、遥感影像处理和地理信息系统研究。E-mail: yep730@163.com

  • 基金资助:
    国家自然科学基金项目(41671393、41631177);国家重点研发计划(2017YFB0503602);江苏省高校自然资助项目(15KJA420002);公安部科技强警基础工作专项项目(2016GABJC43、2017GABJC23);警用地理信息技术公安部重点实验室开放课题(2016LPGIT01)

Query Method of Chinese Gazetteer Based on the Character Features

YE Peng1,2(), ZHANG Xueying1,2,*(), DU Mi1,2   

  1. 1. Key Laboratory of Virtual Geographic Environment, Nanjing Normal University, Ministry of Education, Nanjing 210023, China
    2. Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, China
  • Received:2017-11-12 Revised:2018-05-04 Online:2018-07-20 Published:2018-07-13
  • Contact: ZHANG Xueying
  • Supported by:
    National Natural Science Foundation of China, No.41671393, 41631177; National Key Research and Development Program of China, No.2017YFB0503602; University Natural Funding Project of Jiangsu Province, No.15KJA420002; Special Project of the BasicWork of Science and Technology Police in the Ministry of Public Security, No.2016GABJC43, 2017GABJC23; Open Project of Key Laboratory of Police Department of Police, Geographic Information Technology, Ministry of Public Security, No.2016LPGIT01

摘要:

地名词典查询是地名校正、地名匹配等地名服务应用的重要基础,但是地名数量的快速增长使得词典查询性能面临严峻挑战。针对大规模数据环境中传统词典查询方法准确率不高且效率较低等问题,提出了一种顾及字符特征的中文地名词典查询方法(CGQM)。首先,查询具有相同字符特征的地名形成候选地名集合,同时构建单字索引提升查询效率;其次,依据字符数量特征比较查询地名与候选地名的差异,进一步过滤候选地名集合;最后,基于字符位置特征优化查询结果排序策略,使得结果排序更为合理。实验以全国地名词典为例,构建5组测试集进行CGQM方法与Lucene检索方法的对比分析。研究结果表明,CGQM方法对于增强地名词典查询功能、提升查询效率具有实际意义。

关键词: 中文地名, 地名词典查询, 地名词典单字索引, 地名相似度, 地名字符特征

Abstract:

With the rapid development of mobile Internet and the wide application of location-based service technology in various industries, the public's demand for the application of place information is growing rapidly. The gazetteer query, which can provide the support for place names knowledge, is an important basic link in the location information service. At present, because of the significant increase of the data volume of the place names, the query performance of gazetteers is facing a severe challenge. Most of the existing gazetteers directly use general retrieval methods, ignoring the characteristics of the characters and the description rules of the place names themselves. In order to solve these problems, a Chinese gazetteer query method (CGQM) is proposed based on the character features of place names. The CGQM uses the character features of the names with the same character characteristics, character's number and character's position, and query the gazetteer according to the main line of "candidate place name query, place name filtering, place name similarity ranking". Firstly, the single character index of the gazetteer is constructed, and based on this index, the place names containing the same characters in the gazetteer are queried to form a candidate dataset. Secondly, the place names are filtered from the candidate dataset, which has large differences in the number of characters with the search place names. The aim of this step is to enhance the accuracy of the candidate dataset and to ensure the efficiency of the later sorting process. Thirdly, the candidate place names are sorted based on the algorithm of character position similarity. Taking the national Chinese gazetteer as an example, an experiment was implemented with CGQM and a full text query method (Lucene) on 5 test datasets. The purpose of the experiment was to verify that the CGQM method could accurately and efficiently query the gazetteer. The experimental performance evaluation indexes include the operation efficiency, the precision rate, the recall rate and the F value. The results of experiment prove that CGQM can achieve much more better query performance than the Lucene based method. In the future research on gazetteer query, we will also consider many other factors, such as glyph, semantics, etc., and learn from the distributed and multithreading techniques in the retrieval system at the same time. These methods will promote the accuracy and efficiency of gazetteer query and expand the public service of place information.

Key words: Chinese place name, gazetteer query, Chinese gazetteer index for single Chinese characters, the similarity of place name, place name character features