地球信息科学学报 ›› 2018, Vol. 20 ›› Issue (7): 871-879.doi: 10.12082/dqxxkx.2018.180032

• 地球信息科学理论与方法 •    下一篇

基于自动回标的地理实体关系语料库构建方法

王姬卜1,2(), 陆锋2,3, 吴升1,2, 余丽3,4,*()   

  1. 1. 福州大学 福建省空间信息工程研究中心,福州 350002
    2. 海西政务大数据应用协同创新中心,福州 350002
    3. 中国科学院地理科学与资源研究所 资源与环境信息系统国家重点实验室,北京 100101
    4. 中国科学院文献情报中心,北京 100190
  • 收稿日期:2018-01-04 修回日期:2018-03-28 出版日期:2018-07-20 发布日期:2018-07-13
  • 通讯作者: 余丽 E-mail:418771916@qq.com;yul@lreis.ac.cn
  • 作者简介:

    作者简介: 王姬卜(1993-),女,山西临汾人,硕士生,主要从事地理信息工程研究。E-mail: 418771916@qq.com

  • 基金资助:
    国家自然科学基金重点项目(41631177);数字福建建设项目(闽发改网数字函[2014]191 号、[2016]23 号、[2016]77号);福建省科技创新平台项目(2015H2001)

Constructing the Corpus of Geographical Entity Relations Based on Automatic Annotation

WANG Jibu1,2(), LU Feng2,3, WU Sheng1,2, YU Li3,4,*()   

  1. 1. Spatial Information Research Center of Fujian Province, Fuzhou University, Fuzhou 350002, China
    2. Fujian Collaborative Innovation Center for Big Data Applications in Governments, Fuzhou 350002, China
    3. State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China
    4. National Science Library, Chinese Academy of Sciences, Beijing 100190, China
  • Received:2018-01-04 Revised:2018-03-28 Online:2018-07-20 Published:2018-07-13
  • Contact: YU Li E-mail:418771916@qq.com;yul@lreis.ac.cn
  • Supported by:
    National Natural Science Foundation of China, No.41631177; Digital Fujian Construction Project, No.[2014]191, [2016]23, [2016]77; Fujian Science and Technology Innovation Platform Project, No.2015H2001

摘要:

地理实体关系语料库是地理信息获取与地理知识服务的基础数据资源,其规模直接影响机器学习模型训练的效果。快速更新的网络文本不断涌现新的关系实例,要求语料库及时更新以覆盖更丰富的关系实例。手工构建和更新语料库成本高昂,亟需一种快速构建大规模地理实体关系语料库的方法。本文提出一种基于回标技术的地理实体关系语料库构建方法。首先,参考地理实体分类标准与语义关系、空间关系分类标准,针对地理实体关系的自然语言描述习惯,建立地理实体关系的标注体系;然后,结合精确匹配与模糊匹配策略,提高客体匹配的覆盖率;接着,基于优序图法建立句子打分规则,实现种子三元组到句子映射的定量评价;最后,使用中文百度百科文本验证方法的有效性。实验结果显示,本文方法平均回标成功率为67.83%,关系标注的准确率为76.36%。相比人工构建空间关系标注语料库的过程,本文提出的语料自动构建方法,标注速度快,规模大,为自动扩充标注语料库提出了可行方案。同时,该方法兼顾了地理实体间的语义关系和空间关系,且关系类型不受限,可用于开放式关系抽取任务。

关键词: 地理实体关系, 语料库构建, 自动回标, 地理信息抽取, 标注体系

Abstract:

The corpus of geographical entity relations is the basic data resource of geographical information acquisition and geographical knowledge services, and its scale directly affects the training effect of machine learning models. Fast-updated web text is constantly emerging as a new relational example, requiring the corpus to be updated in a timely manner to cover richer relational instances. Manually constructing and updating corpus are expensive. Therefore, it needs a more efficient technology of corpus construction for massive geographical entity relations. In this paper, we propose an efficient method of corpus construction for massive geographical entity relations through the automatic annotation technique. First of all, based on encyclopedia resources, referring geographical entity classification standard and semantic relation, spatial relation classification standard to establish an annotation scheme of geographical relation, which considers both the linguistic habits of natural language and the annotation normalization. Secondly, we combine the fully-matching with the approximate matching to improve the coverage rate of object entity finding. Thirdly, we define the rules of sentence scoring by using the optimal sequence diagram method, as well as quantitatively evaluate the results of mapping the seed triples to the sentences. Finally, a series of experiments based on the Chinese BaiduBaike are carried out, which is used to verify the effectiveness of the improved automatic annotation. The results show that, the average success rate of the automatic annotation is 67.83%, and the average accuracy of the annotated relations by our method is 76.36%. Comparing with the manually annotated corpus of the spatial relations, the proposed method constructed a large-scale corpus of geographical entity relations more efficiently, which provides a feasible scheme for expending geographical entity relations corpus automatic. Experimental results on self-built corpus by LSTM (Long Short Term Memory) network shows that the accuracy of geographical relation extracting from web texts is 73.2%, and the accuracy of relative corpora is 75.2%, which proofs that the corpus of geographical entity relations is available. At the same time, this method takes into account the semantic relationship and spatial relationship between geographical entities, and it can be used for open relation extraction task. Besides, the relation types are not limited, which can be applied to open relation extraction.

Key words: geographical relations, corpus construction, automatic annotation, geographical information extraction, annotation scheme