

  • 叶鹏 , 1, 2 ,
  • 张雪英 , 1, 2, * ,
  • 杜咪 1, 2
  • 1. 南京师范大学 虚拟地理环境教育部重点实验室, 南京 210023
  • 2. 江苏省地理信息资源开发与利用协同创新中心,南京 210023
*通讯作者:张雪英(1970-),女,博士,教授,主要从事时空大数据挖掘、空间位置服务和地理信息系统等方面研究。E-mail: zhangsnowy@163.com


收稿日期: 2017-11-12

  要求修回日期: 2018-05-04

  网络出版日期: 2018-07-13



Query Method of Chinese Gazetteer Based on the Character Features

  • YE Peng , 1, 2 ,
  • ZHANG Xueying , 1, 2, * ,
  • DU Mi 1, 2
  • 1. Key Laboratory of Virtual Geographic Environment, Nanjing Normal University, Ministry of Education, Nanjing 210023, China
  • 2. Jiangsu Center for Collaborative Innovation in Geographical Information Resource Development and Application, Nanjing 210023, China
*Corresponding author: ZHANG Xueying, E-mail:

Received date: 2017-11-12

  Request revised date: 2018-05-04

  Online published: 2018-07-13

Supported by

National Natural Science Foundation of China, No.41671393, 41631177; National Key Research and Development Program of China, No.2017YFB0503602; University Natural Funding Project of Jiangsu Province, No.15KJA420002; Special Project of the BasicWork of Science and Technology Police in the Ministry of Public Security, No.2016GABJC43, 2017GABJC23; Open Project of Key Laboratory of Police Department of Police, Geographic Information Technology, Ministry of Public Security, No.2016LPGIT01


《地球信息科学学报》编辑部 所有




叶鹏 , 张雪英 , 杜咪 . 顾及字符特征的中文地名词典查询方法[J]. 地球信息科学学报, 2018 , 20(7) : 880 -886 . DOI: 10.12082/dqxxkx.2018.170530


With the rapid development of mobile Internet and the wide application of location-based service technology in various industries, the public's demand for the application of place information is growing rapidly. The gazetteer query, which can provide the support for place names knowledge, is an important basic link in the location information service. At present, because of the significant increase of the data volume of the place names, the query performance of gazetteers is facing a severe challenge. Most of the existing gazetteers directly use general retrieval methods, ignoring the characteristics of the characters and the description rules of the place names themselves. In order to solve these problems, a Chinese gazetteer query method (CGQM) is proposed based on the character features of place names. The CGQM uses the character features of the names with the same character characteristics, character's number and character's position, and query the gazetteer according to the main line of "candidate place name query, place name filtering, place name similarity ranking". Firstly, the single character index of the gazetteer is constructed, and based on this index, the place names containing the same characters in the gazetteer are queried to form a candidate dataset. Secondly, the place names are filtered from the candidate dataset, which has large differences in the number of characters with the search place names. The aim of this step is to enhance the accuracy of the candidate dataset and to ensure the efficiency of the later sorting process. Thirdly, the candidate place names are sorted based on the algorithm of character position similarity. Taking the national Chinese gazetteer as an example, an experiment was implemented with CGQM and a full text query method (Lucene) on 5 test datasets. The purpose of the experiment was to verify that the CGQM method could accurately and efficiently query the gazetteer. The experimental performance evaluation indexes include the operation efficiency, the precision rate, the recall rate and the F value. The results of experiment prove that CGQM can achieve much more better query performance than the Lucene based method. In the future research on gazetteer query, we will also consider many other factors, such as glyph, semantics, etc., and learn from the distributed and multithreading techniques in the retrieval system at the same time. These methods will promote the accuracy and efficiency of gazetteer query and expand the public service of place information.

1 引言


2 基本思路

Fig. 1 The technical framework of the Chinese gazetteer query method

图1 中文地名词典查询的技术框架

3 基于相同字符的候选地名查询

3.1 单字索引构建

中文地名单字索引由词典文件和索引文件两部分组成。词典文件用于存储地名词典中全部的地名数据,按照无换行无间隔的方式依次排列,形成一条连续的字符串;索引文件是存储索引记录的物理文件,用于存储索引记录和词典文件中地名词项之间的对应关系。一条索引记录中包含3部分信息:地名个数,字符编码以及词典位置。假设词典文件中共有n个不重复汉字Wi,i∈[1, n],Ci表示汉字Wi的UTF-8编码,Ni为词典文件中包含汉字Wi的地名个数,每个地名的起始位置与结束位置分别表示为SnmEnm,那么地名在词典文件中的存储位置序列表示为<Sn1,En1,Sn2,En2,…,Snm,Enm>。以地名“中岗子”为例,将“中岗子”存储到词典文件中,记录下Snm(“中”在字符串中位置1001)与Enm(“子”在字符串中位置1003)。之后在索引文件中生成“中”、“岗”、“子”3条索引记录,其中“中”字索引为[11079][0xE4B8AD][1001,1003,1015,1017,…,83475,83478],记录字符编码(0xE4B8AD)、词典文件中所有包含“中”字地名的个数(11079)及其存储位置,既有“中岗子”所在位置(1001,1003),还有“中夹滩”、“姜尾林中”等其它含“中”地名所在位置,如(1015,1017)(83475,83478)等(图2)。
Fig. 2 Chinese gazetteer index based on single characters

图2 中文地名单字索引组织方式


3.2 候选地名查询

Fig. 3 The query mode of Chinese gazetteer index based on single characters

图3 中文地名单字索引查询方式

4 基于字符数量的地名过滤

Tab. 1 Common inaccurate description form in the place name search

表1 查询地名中常见的不准确描述方式

类型 查询地名 目标地名
替换字符 候家宅子村 侯家宅子村
晌滩村 响滩村
缺失字符 采石南路南 采石南路南口
合肥南 合肥南站
增加字符 多余空格 南京 市 南京市
特殊符号 凉水-井湾 凉水井湾
偏旁分离 夕卜坡 外坡
交换字符 北新桥路口南 北新桥南路口
塔什库尔干塔吉克县 塔库什尔干塔吉克县

5 基于字符位置的地名相似度排序

sim P , W = α × 1 2 c m + c n + β × min m n , n m × 1 2 i = 1 c L 1 i t = 1 m t + i = 1 c L 2 i k = 1 n k (1)
sim P , W = 0.6 × 4 4 + 4 6 × 1 2 + 0.4 × min 4 6 , 6 4 × 1 2 × 1 + 2 + 3 + 4 1 + 2 + 3 + 4 + 3 + 4 + 5 + 6 1 + 2 + 3 + 4 + 5 + 6 0.75 (2)

6 实验评估分析

A p , w = c n (3)
式中:c表示查询地名p中与目标地名w相比准确的字符数量;n表示查询地名p字符数量。开源全文搜索引擎Lucene在文本分类、信息检索等方面有大量研究与应用[17]。词典作为非结构化文本文件,能够应用Lucene索引机制。因此,本文选取Lucene检索方法与CGQM进行对比实验。查询性能评价指标包括运行效率、准确率(P)、召回率(R)、F值。其中,运行效率是指单条地名查询所耗费的时间。PRF度量值的具体计算公式如式(4)-(6)所示。式中,nij是指目标地名i和查询结果j之间相同的数量,ni是指目标地名i的数量,nj是指模型查询结果j的数量,Fi,j)是指ij之间的F度量值。本次实验中设置的地名过滤阈值k,为查询地名与目标地名中较长地名字符数量的30%。同时以相似度数值大于60%的候选地名作为查询结果,结果依据相似度数值大小进行排序。实验测试机器配置为 Intel Core i7-7700HQ主频2.8 GHz处理器,内存 16 GB,Windows 10操作系统,开发语言为Java。
Tab. 2 Samples of test datasets

表2 实验测试集划分明细及示例

等级 测试集单条地名准确度 测试集地名数量 测试集示例 对应目标地名
测试集1 [90%, 100%] 133 南京明文化村阳山碑村 南京明文化村 阳山碑村
测试集2 [80%, 90%) 377 候家石良村,勒图音敖包 侯家石良村,勒图音敖包
测试集3 [70%, 80%) 389 大新冊村,豆家吕村 大新册村,豆家营村
测试集4 [60%, 70%) 665 樁木槽,达强 椿木槽,达强弄
测试集5 [50%, 60%) 136 橫山,痳冲 横山,麻冲
P i , j = n ij n j (4)
R i , j = n ij n i (5)
F i , j = 2 × P ( i , j ) × R ( i , j ) P i , j + R ( i , j ) (6)
Tab. 3 Statistics of experimental results

表3 实验结果评价指标统计

测试集 地名数量/个 CGQM方法 Lucene方法
P/% R/% F 平均效率/ms P/% R/% F 平均效率/ms
1 133 95.49 100.00 98.08 409 93.23 98.50 95.79 576
2 377 91.78 94.43 93.09 335 90.45 91.51 90.98 537
3 389 82.26 88.95 85.47 437 79.18 84.83 81.91 548
4 665 72.03 80.00 75.81 388 69.02 76.09 72.38 513
5 136 53.97 73.53 62.25 186 50.74 68.38 58.25 562
Tab. 4 Details of the query process of the part of experimental data

表4 部分实验数据查询过程明细

查询地名 所属测试集 初步结果集合
努木其音乌 测试集2 力努;努松;桥努;…;株木塘;木底塘;木山冲;…;佳木斯我的家生态健康社区;米欠扎木阿吉坎儿孜买里斯;树木岭民营工业园基地三门;…
(共50 430个)
(共22 101个)
努木其音乌兰 努木其音乌兰
兩山村 测试集4 雨道;雨潭;山岗;…;社山后;开化山;山马岭;…;石家庄华南新村;浚县王升屯新村;平成日式度假村;…
(共313 867个)
(共265 970个)
村山村;山村;陈山村;东山村;三山村;阳山村;檀山村;嶂山村;横山村;兴山村 雨山村

7 结论


The authors have declared that no competing interests exist.



[ Zhang W Y, Zhou S Y, Tan G X.Place name database quick searching system based on Lucene[J]. Application Research of Computers, 2017,34(6):1756-1761. ]

Delmastro F, Arnaboldi V, Conti M.People-centric computing and communications in smart cities[J]. IEEE Communications Magazine, 2016,54(7):122-128.The extremely pervasive nature of mobile technologies, together with the user's need to continuously interact with her personal devices and to be always connected, strengthen the user-centric approach to design and develop new communication and computing solutions. Nowadays users not only represent the final utilizers of the technology, but they actively contribute to its evolution by assuming different roles: they act as humans, by sharing contents and experiences through social networks, and as virtual sensors, by moving freely in the environment with their sensing devices. Smart cities represent an important reference scenario for the active participation of users through mobile technologies. It involves multiple application domains and defines different levels of user engagement. Participatory sensing, opportunistic sensing, and mobile social networks (MSNs) currently represent some of the most promising people-centric paradigms. In addition, their integration can further improve the user involvement through new services and applications. In this article we present SmartCitizen app, an MSN application designed in the framework of a smart city project to stimulate the active participation of citizens in generating and sharing useful contents related to the quality of life in their city. The app has been developed on top of a context- and social-aware middleware platform (CAMEO) able to integrate the main features of people-centric computing paradigms, lightening the app developer's effort. Existing middleware platforms generally focus on a single people-centric paradigm, exporting a limited set of features to mobile applications. CAMEO overcomes these limitations and, through Smart- Citizen, we highlight the advantages of implementing this type of mobile application in a smart city scenario. Experimental results shown in this article can also represent the technical guidelines for the development of heterogeneous people-centric mobile applications embracing di- ferent application domains.




[ Li D Y, Fang J J, Xu D L.Collaborative research and implementation of multi-sector address business based on GIS technology[J]. Bulletin of Surveying and Mapping, 2016,62(10):121-124. ]


[ Zhang X Y, Lv G N, Du M, et al.Acquisition and application on geographical names information based on large data driving[J]. Modern Surveying and Mapping, 2017,40(2):1-5. ]

许普乐,王杨,黄亚坤,等.大数据环境下基于贝叶斯推理的中文地名地址匹配方法[J].计算机科学,2017,44(9):266-271.传统的中文地名地址匹配技术难以处理大数据环境下海量、多样和异构的智慧城市地理信息空间中的中文地名地址快速匹配问题。提出了一种Spark计算平台下基于中文地名地址要素的匹配框架及应用智能决策的匹配算法(An Intelligent Decision Matching Algorithm,AIDMA)。首先,从中文地名地址中富含的语义性和中文字符串、数字与字母之间的自然分隔性两个方面进行地址要素解析,构建了融合多距离信息的贝叶斯推理网络,从而提出了基于多准则评判的中文地名地址匹配决策方法。然后,利用芜湖市514967条脱敏后的燃气开户中文地名地址信息库与1770979条网格化社区中的中文地名地址信息库(包含网格化地址的地理空间信息)进行实验与分析。实验结果表明,在处理大规模中文地名地址信息时,相比于传统的中文地名地址匹配方法,该方法能够有效提高单条中文地名地址的匹配效率,同时在匹配度与精确度两个指标上匹配结果更加均衡。


[ Xu P L, Wang Y, Huang Y K, et al.Chinese place-name address matching method based on large data analysis and bayesian decision[J]. Computer Science, 2017,44(9):266-271. ]


[ Dong J Y, Ma M Y, Chen Y, et al.Research on multi-level geographical names and addresses service based on object relational database[J]. Geomatics World, 2017,24(4):92-95,100. ]



[ Wang X K, Li Z, Jian Y L, et al.Machine translation dictionary based on Hash method[J]. Journal of Dalian University of Technology, 1996(3):108-111. ]


[ Sun M S, Zuo Z P, Huang C N.An experimental study on dictionary mechanism for Chinese word segmentation[J]. Journal of Chinese Information Processing, 2000(1):1-6. ]

梁南元. 书面汉语自动分词系统—CDWS[J].中文信息学报,1987(2):44-52.

[ Liang N Y.The mordern printed Chinese distinguishing word system[J]. Journal of Chinese Information Processing, 1987(2):44-52. ]

李庆虎,陈玉健,孙家广.一种中文分词词典新机制——双字哈希机制[J].中文信息学报,2003(4):13-18.Chinese word segmentation is the preparation for Chinese Information Processing. As one basic component of Chinese word segmentation systems , the dictionary mechanism influences the speed and efficiency of segmentation significantly. In this paper , we provide a new dictionary mechanism named double-character-hash-indexing (DCHI) . Compared with existing typical dictionary mechanisms (i.e. binary-seek-by-word , TRIE indexing tree and binary-seek-by-characters) , DCHI improves the speed and efficiency of segmentation without increasing the space and time complication and maintenance difficulty.


[ Li Q H, Chen Y J, Sun J G.A new dictionary mechanism for Chinese word segmentation[J]. Journal of Chinese Information Processing, 2003(4):13-18. ]


[ Li J B, Zhou Q, Chen Z S.A study on fast algorithm for Chinese dictionary lookup[J]. Journal of Chinese Information Processing, 2006(5):31-39. ]



[ Wu P F, Ma F J, Li W G, et al.Localization of the open source full-text retrival engine based on Lucene[J]. Data Analysis and Knowledge Discovery, 2009(4):19-22. ]

李淑霞. 地名本体及其在地理空间数据组织中的应用研究[D].郑州:解放军信息工程大学,2009.

[ Li S X.Research on ontology of place and its applications in geospatial data organization[D]. Zhengzhou: PLA Information Engineering University, 2009. ]

胡盈盈. 单汉字标引与检索技术综析[J].情报理论与实践,1999,36(2):74-77.


[ Hu Y Y.Analysis of indexing and retrieval techniques for single Chinese characters[J]. Information Studies: Theory & Application, 1999,36(2):74-77. ]

宋明亮. 汉语词汇字面相似性原理与后控制词表动态维护研究[J].情报学报,1996,15(4):22-32.本文在研究汉语词汇归类问题的基础上,论证了利用汉语字面相似性原理进行后控制词表动态维护的可行性和实施步骤。结论是:汉语词汇之间的字面相似度有八种可能性,根据不同的相似度可将待归类词与被匹配词之间的聚类关系分成三级:A级为根据字面相似度给出的类号一般来说是正确的;B级为根据字面相似度给出的类号不一定正确;C级为无法根据字面相似度给出类号。而后两种情况只有依赖专家知识来完成,因此,利用字面相似性原理进行后控制词表的动态维护应是一条人机结合的道路,这实际上是一种机助的词表维护方法

[ Song M L.The principle of literal similarity of Chinese words and the dynamic maintenance of post controlled vocabulary[J]. Journal of the China Society for Scientific and Technical Information, 1996,15(4):22-32. ]


[ Zhang X Y, Lv G N.Approach to Automatic Conversion of Geographic Information Classification Schemes[J]. Journal of Remote Sensing, 2008,23(3):433-441. ]

Hirsch L, Hirsch R, Saeedi M.Evolving Lucene search queries for text classification[C]. Proceeding of the 9th Annual Conference on Genetic and Evolutionary Computation. New York: ACM Press, 2007:1604-1611.

Milosavljevic B, Boberic D, Surla D.Retrieval of bibliographic records using Apache Lucene[J]. Electronic Library, 2010,28(4):525-539.ABSTRACT Purpose – The aim of the research is modeling and implementing a software component for the retrieval of bibliographic records using the Apache Lucene retrieval engine. Design/methodology/approach – Object-oriented methodology is used for modeling and implementation of the bibliographic record retrieval engine. Modeling is carried out in the CASE tool that supports the unified modeling language (UML 2.0), while the implementation is using the Java programming language and open source components. Findings – The result is a software component for the retrieval of bibliographic records that are independent of the bibliographic format used in cataloging. It features great flexibility in terms of configuring search types without the need to change the software implementation. Research limitations/implications – One of the constraints of this system relates to the problem of searching linking entry fields. UNIMARC format defines fields used to link the item being cataloged to another bibliographic item, so those fields may contain other fields, which can be termed secondary fields. In this proposed solution, secondary fields are treated as all other fields and there is no information whether the search term belongs to the secondary or a regular field. Practical implications – The proposed solution is integrated into library information system BISIS, version 4. This version of the BISIS system is in use at university, public and special libraries. By introducing this version, system performance as well as flexibility of the indexing process are improved and at the same time librarians are able to perform sophisticated and effective retrieval of bibliographic records. Originality/value – The contribution of this work is in the design of a customizable record retrieval component. It is configured by means of an XML document for specifying mapping rules between subfields of the bibliographic record format and search types. By using XML it is possible to add new mapping rules without additional programming. In addition, great attention has been paid to the indexing of subfields that contain punctuation marks having special semantic meanings for librarians and the transliteration between Cyrillic and Latin scripts. Also, originality of this work lies in using the Apache Lucene search engine, which facilitates building highly flexible and efficient retrieval systems.


