地球信息科学学报 ›› 2014, Vol. 16 ›› Issue (5): 681-690.doi: 10.3724/SP.J.1047.2014.00681

• • 上一篇    下一篇

基于语义知识的空间关系识别研究

袁烨城1(), 刘海江2,,A;*(), 裴韬1, 高锡章1   

  1. 1. 中国科学院地理科学与资源研究所 资源与环境信息系统国家重点实验室,北京 100101
    2. 中国环境监测总站, 北京 100012
  • 收稿日期:2014-01-16 修回日期:2014-05-14 出版日期:2014-09-10 发布日期:2014-09-04
  • 通讯作者: 刘海江 E-mail:yuanyc@lreis.ac.cn;Liuhj@cnemc.cn;liuhj@cnemc.cn
  • 作者简介:

    作者简介:袁烨城(1983-),男,浙江嵊州人,博士,主要从事GIS和网络空间数据挖掘的研究。E-mail:yuanyc@lreis.ac.cn

  • 基金资助:
    国家“863”项目(2012AA12A403)

Spatial Relation Extraction from Chinese Characterized Documents Based on Semantic Knowledge

YUAN Yecheng1(), LIU Haijiang2,*(), PEI Tao1, GAO Xizhang1   

  1. 1. State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Science and Natural Resources Research, CAS, Beijing 100101, China
    2. China National Envirpnment Monitoring Center, Beijing 100012, China
  • Received:2014-01-16 Revised:2014-05-14 Online:2014-09-10 Published:2014-09-04
  • Contact: LIU Haijiang E-mail:yuanyc@lreis.ac.cn;Liuhj@cnemc.cn;liuhj@cnemc.cn
  • About author:

    *The author: CHEN Nan, E-mail:fjcn99@163.com

摘要:

从自然语言文本(新闻报道、博客、论坛、社交网络等)中识别空间关系是大数据时代获取空间信息的重要手段之一。针对现有方法只考虑字词特征,识别过程容易产生匹配歧义的局限,本文提出了一种新的融入词法、句法等语义知识的空间关系识别方法。本方法设计了一个树形结构的抽取模式:树结点代表空间词汇类型,结点之间的关系代表词汇间的依存关系。其中,抽取模式可从标注语料中自主学习得到。模式匹配过程以空间词汇类型和句法依存关系作为硬性约束条件、以词汇语义相似度作为软性约束条件,将模式从树形结构转换成依存序列后,根据有限自动机原理实现匹配。实验结果表明,本方法的识别精度和召回率分别为86.67%和63.11%,与现有其他基于规则的方法相比,有2个优点:(1)模式学习过程无需人工干预;(2)融入了句法依存关系,可消除匹配歧义,提高了识别准确率。

关键词: 空间关系识别, 自动机, 空间词汇, 依存关系, 语义知识

Abstract:

Extracting spatial relation from text documents in natural languages (news, journal, blog, social network etc.) is an important method of obtaining spatial information in the era of big data. Former methods of extracting spatial relation from Chinese characterized text only focused on the features of Chinese characters and phrases, which easily cause ambiguous matching. This paper presented a new rule-based method that integrates lexical, syntactic and semantic knowledge. The extracting rule in this method was composed of spatial words and syntactic dependences between these words, which jointly formed a tree structure. The tree nodes represent the spatial words and they were connected by syntactic dependences. Spatial words were the words that can be used to express spatial relations, which were subsequently classified into 6 categories: geographical entities, preposition, locative nouns, spatial predicate, metaphorical spatial nouns and assistant words. In the process of rule matching, finite automata was used to identify new spatial relation instances that satisfy the following two conditions: (1) same syntactic dependence structure with regard to the extracting rules; (2) similarity of the spatial words. The part-of-speech, semantic similarity were used to measure the consistency between spatial words. The experiment of extracting the direction relations from Encyclopedia of China shows that the accuracy and the recall rate of this method achieve 86.67% and 63.11% respectively, which is better than the former methods. Comparing with the former methods, the improvements of this method include: (1) the process of extracting rule generation does not require human intervention; (2) the ambiguous matching can be diminished by integrating syntactic dependence knowledge, which evidently promoted the performance of spatial relation identification.

Key words: spatial relation extraction, finite automata, spatial word, syntactic dependence, semantic knowledge