地球信息科学理论与方法

论地理知识图谱

  • 陆锋 , 1, * ,
  • 余丽 1, 2 ,
  • 仇培元 1
展开
  • 1. 中国科学院地理科学与资源研究所 资源与环境信息系统国家重点实验室,北京 100101
  • 2. 中国科学院文献情报中心, 北京 100190

作者简介:陆 锋(1970-),博士,研究员,博士生导师,中国GIS协会理论与方法委员会主任,ACM SIGSpatial China主席,主要从事空间数据模型、空间数据库、空间数据挖掘、知识图谱、导航与位置服务等研究。E-mail:

收稿日期: 2017-04-28

  要求修回日期: 2017-05-25

  网络出版日期: 2017-06-20

基金资助

国家自然科学重点基金项目(41631177)

中国科学院重点部署项目(ZDRW-ZS-2016-6-3)

On Geographic Knowledge Graph

  • LU Feng , 1, * ,
  • YU Li 1, 2 ,
  • QIU Peiyuan 1
Expand
  • 1. State Key Lab of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research,Beijing 100101, China
  • 2. National Science Library, Chinese Academy of Sciences, Beijing 100190, China
*Corresponding author: LU Feng, E-mail:

Received date: 2017-04-28

  Request revised date: 2017-05-25

  Online published: 2017-06-20

Copyright

《地球信息科学学报》编辑部 所有

摘要

网络文本蕴含大量隐式地理空间信息,为地理知识获取与知识服务提供了巨大潜能。地理知识图谱是将传统地理信息服务拓展到地理知识服务的关键,也是网络文本蕴含地理信息采集与处理的终极目标。本文系统评述了开放地理语义网、开放地理实体及关系抽取、地理语义网对齐、知识图谱存储方法等地理知识图谱相关主题的研究进展,从网络文本蕴含地理空间信息量与质量评价、地理信息语义理解、空间语义计算模型和异构地理语义网对齐等方面剖析了目前亟需解决的关键科学问题。

本文引用格式

陆锋 , 余丽 , 仇培元 . 论地理知识图谱[J]. 地球信息科学学报, 2017 , 19(6) : 723 -734 . DOI: 10.3724/SP.J.1047.2017.00723

Abstract

Web texts contain a great deal of implicit geospatial information, which provide great potential for the geographic knowledge acquisition and service. Geographic knowledge graph is the key to extend traditional geographic information service to geographic knowledge service, and also the ultimate goal of the collection and processing of implicit geographic information from web texts. This paper systematically reviews the state of the arts of the researches on open geographic semantic web, geographic entity and relation extraction, geographic semantic web alignment, and knowledge graph storage methods. The pressing key scientific issues are also addressed, including the quality evaluation of geospatial information collected from web texts, geographic semantic understanding, spatial semantic computing model, and heterogeneous geographic semantic web alignment.

1 引言

人类生活中所产生的数据大多与地理位置相关[1-2]。一直以来,地理信息获取主要依靠基础测绘、卫星遥感、无人机遥感等专业采集手段,强调几何精确性。近年来,随着信息与通讯技术的发展,地理信息正经历从单一静态到多源动态,从精确结构化到模糊异构的巨大转变。人人都是传感器的理念大量付诸实践。地理信息的持续泛化已成为新地理信息时代的重要特征[3-4]。非传统的隐式地理信息受到广泛关注。地理信息的泛化要求地理信息系统(GIS)向大众化、普适化的广义GIS转变,多源异构大数据成为主流,高性能计算、云计算成为新的支撑技术体系,知识服务成为GIS的终极目标[5]
信息技术的发展使文本逐渐从不可计算的纸质文本转变为可计算的数字化文本,基于文本的数据挖掘与知识发现成为可能。当前,互联网逐步发展为信息传播与交流的主要平台。新闻页面、在线百科、社交网络、数据门户、专业文献等网络文本载体蕴含丰富的隐式地理信息。据统计,18.78%的网络文本资源包含有地理位置信息,18.6%的网络检索与地理位置相关[6]。因此,寻求网络文本挖掘与地理信息分析的结合,成为GIS学科的研究热点[7-8]。面对爆发式增长的可计算网络文本资源,如何从地理空间认知的视角理解网络文本,提取所隐含的地理相关信息,并将其纳入传统上以可量测几何数据为对象的空间计算模型中,快速获取、推理与利用地理知识,是地理信息科学在新地理信息时代面临的挑战,也是广义GIS的重要任务[9]
目前,网络文本蕴含地理信息抽取的研究主要集中在以下几个主题:
(1)文档注记:网络文本蕴含地名提取与地理空间定位,为文档加注地理标签,建立空间索引,辅助地址相关的搜索[10-13];
(2)人机交互:理解自然语言词汇的空间含 义[14],将自然语言描述转换为空间查询谓词[15],实现自然语言方式的人机交互查询[16];
(3)环境感知:搜索网络文本中具有地理分布特征的专题信息,挖掘相关的语义信息,探索分布规律,检测异常,如自然灾害[17-19]、突发事件[20-21]、社会动态[22]、交通状态[23-25]和舆情分析[26]等;
(4)数据共享:研究地理信息元数据语义关联计算方法,促进基于语义相似性的地理空间数据 共享[27-29]
(5)知识建库:利用网络开放资源、百科协作平台或社会化媒体,丰富地理要素或事件的时空属性,自动生成大规模结构化知识库(如GeoNames(① http://www.geonames.org/)与DBpedia(② http://dbpedia.org/)融合生成的Geonames Ontology(③ http://www.geonames.org/ontology/documentation.html))。
可以看出,对于网络文本蕴含地名信息抽取、环境感知专题信息获取等,核心在于对文本描述地理位置或场景进行空间化,由此将网络文本描述的大量语义信息与地理位置进行关联。实现这一目标的瓶颈在于如何有效处理相对自由的自然语言中空间位置与空间语义的异质性描述。这也是地理相关数据共享、知识建库的难点所在。

2 语义网与知识图谱

语义(Semantic)是数据的含义。只有被赋予了含义的数据才有使用价值。虽然互联网上存在多种知识源,但由于结构相异,并且很多语义知识隐藏在知识源深层结构中,计算机通常难以获取和利用这些语义知识。因此,研究多源异构知识源中语义知识的挖掘与集成方法,在自然语言处理任务中具有重要的意义[30]
万维网(WWW)发明者、2017年第50届图灵奖得主Tim Berners-Lee于1998年提出语义网(Semantic Web)概念[31]。语义网是由网络信息资源所构成的具有明确结构与语义(如标注或解释)的图,用于知识的表达与存储。语义网使得计算机不仅可以显示这些信息资源,还可以对其进行整合与推理,将一个个信息孤岛连通为一张巨大的图。语义网是WWW的扩展与延伸,它将自然语言描述的句子表达并存储为图结构,可用于文本摘要、机器翻译、自动问答等[32]
万维网联盟(W3C)是语义网主要的推动者和标准制定者。HP、IBM、微软等公司,斯坦福大学、卡尔斯鲁厄大学、清华大学、上海交通大学、中国人民大学等都对语义网技术展开了深入研究,开发了Jena(④ http://jena.apache.org/)、KAON(⑤ http://kaon2.semanticweb.org/)、Racer(⑥ http://www.racer-systems.com/)、Pellet(⑦ http://www.mindswap.org/2003/pellet/)、SWARMS(⑧ http://keg.cs.tsinghua.edu.cn/project/pswmp.htm)、ORIENT(⑨ http://apex.sjtu.edu.cn/projects/orient/)等语义网应用平台、基于语义网技术的信息集成与查询、推理和本体编辑系统。
语义网技术的发展为互联网搜索引擎的升级换代奠定了基础。随着信息服务向知识服务的转变,搜索引擎技术已由关键词搜索发展为基于语义关联的知识搜索。由此,谷歌公司于2012年提出了知识图谱(Knowledge Graph)的概念,旨在实现基于语义理解的搜索引擎,并且于2013年以后开始在学术界和工业界普及。知识图谱是通过有向图的方式表达实体、概念及其相互之间语义关系的数据组织形式或产品,本质上是一种语义网络(Semantic Network)。其中,节点代表实体或者概念,边代表实体/概念的属性或者彼此之间的语义关系。请注意,语义网络(Semantic Network)是更基础的定义,而语义网(Semantic Web)专指互联网信息资源的语义关联结构,是语义网络概念的具体应用体现。本文中不对二者概念进行严格区分。
知识图谱的直接推动力来自于应用,包括机器问答、情报检索、在线学习等。卡内基梅隆大学在美国国防部高级研究计划署(DARPA)、美国国家科学基金会(NSF)、谷歌和雅虎的共同资助下开展了“Read the Web”项目,致力于研发一个不停学习的计算机系统——NELL(Never-Ending Language Learner),不间断地从互联网上抽取和挖掘知识,构建可以支持多种智能信息处理应用需求的海量规模网络知识库[33]。而工业界更青睐于以群体智慧的方式建设知识库。2001年,第一个用户可编辑的“互联网百科全书”网站——Wikipedia(维基百科)(⑩ https://www.wikipedia.org/)正式面向公众开放,该平台支持网民自主建设知识资源。截至目前,维基百科已经构建了涵盖294种语言的4100多万条知识条目(⑪ https://en.wikipedia.org/wiki/Wikipedia:About)。维基百科的发展给知识库资源的建设带来了新的生机。业界开始基于维基百科生成计算机可利用的知识库,如YAGO(⑫http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/)、DBpedia、Freebase(⑬https://developers.google.com/freebase/)等。由于这些资源涵盖不同领域的知识,内容也随着人类知识的增长而不断丰富,因此引起了搜索引擎巨头的极大关注。谷歌于2010年收购Freebase 后,一直致力于构建相互关联的实体及其属性的巨大知识图谱,并据此建立了谷歌语义搜索引擎。目前,谷歌的知识图谱已经包含了超过6亿实体,180亿属性或关系。谷歌语义搜索引擎在知识图谱的支持下,能够更好地理解搜索意图,依据实体之间的语义联系,确定和整理与搜索请求具有语义关联的信息资源,返回给用户。其后,谷歌又创建了新一代知识图“Knowledge Vault”,用于从非结构化网络文本中获取事实信息[34]
对于网络文本蕴含地理信息采集与知识推理而言,最终的目标是构建地理知识图谱,即如何自动化地探测地理实体间的空间关系与语义关系,实现地理信息的自动聚合过程。这是实现地理问答系统、发展更精准的地学领域知识搜索引擎,基于网络文本的虚拟地理环境自动化构建,乃至位置服务信息聚合和推送的重要前提。因此,抽取网络资源中的地理信息进而构建地理知识图谱,是将传统地理信息服务拓展到地理知识服务的关键,也是网络文本蕴含地理信息采集与处理的终极目标。

3 地理知识图谱研究进展

由网络文本构建知识图的流程如图1所示。知识图谱的数据包括常识性知识和事实性知识。常识性知识直接由结构化文本提供,例如百科知识库(如Freebase、WikiData(⑭https://www.wikidata.org/wiki/Wikidata:Main_Page)等)、特定领域的知识库(如Geonames Ontology、OSM Semantic Network(⑮https://datahub.io/dataset/osm-semantic-network)等)、链接开放数据(Linked Open Data(⑯ http://www.linkedgeodata.org/),如DBpedia、YAGO等)、百科类网站(如维基百科、百度百科、互动百科等)的信息框。事实性知识则从半结构化文本和非结构化文本中获取。针对半结构化文本的知识获取,首先构建面向站点的包装器(Wrapper),然后从各垂直领域网站(例如电商网站、点评网站等)的HTML表格中抽取实体的属性-值对,以丰富实体的描述。针对非结构化文本的知识获取,主要从自然语言描述的文本中发现新增实体或其属性,从而扩充知识图谱的覆盖范围。其中,结构化文本和半结构化文本蕴含的知识质量可靠,是构建知识图谱的基础数据。从非结构化文本中挖掘的新增知识,虽然质量相对较差,但是其数量庞大、动态性强、覆盖面广,是保障知识图谱实用性的关键数据。从多源异构文本中获取的知识,存在大量的数据冗余和不一致性问题,需要借助实体链接、知识验证、语义网对齐等技术进行知识融合,从而实现知识的标准化、保障知识的连通性。最后,将知识图谱以资源描述框架(Resource Description Framework, RDF)数据模型表达并存储至数据库中,以提高知识检索的效率。
Fig. 1 Flowchart of knowledge graph building

图1 知识图谱构建基本流程

3.1 开放地理语义网

随着语义网技术的发展,地理语义网作为语义网的一个子集陆续涌现,代表性成果有GeoNames Ontology、OSM Semantic Network、LinkedGeoData、GeoWordNet(⑰ https://datahub.io/dataset/geowordnet)等。地理语义网使用统一资源描述标识(Unified Resource Identifier,URI)表示关系三元组中的命名实体[35],并将关系三元组集合存储为RDF文件。地理语义网中描述的对象分为3类,即概念(Class)、属性(Property)和实例(Instance或个体Individual)。概念是对具有相同特征的一组对象的抽象定义,属性包括数据属性(DatatypeProperty,描述对象本身的特征)和对象属性(ObjectProperty,揭示对象之间的关系),实例是指定概念的具象表达[36]。概念和实例使用URI表示(如“ http://sws.geonames.org/11070880”),属性一般直接采用其名称表示(如数据属性“population”,对象属性“featureClass”)。
表1列出了上述4种应用广泛的开放地理语义网的数据规模。其中OSM Semantic Network的数据来自OpenStreetMap(⑱ http://www.openstreetmap.org/),它仅包含地理实体的概念和属性,没有实例;GeoNames Ontology的数据来自GeoNames,它拥有大量的地理实体和丰富的实体关系,是从概念到属性再到实例的完整地理语义网;GeoWordNet是结合GeoNames和WordNet(⑲ http://wordnet.princeton.edu/)两个数据库而形成,它为地理实体增加了同义词集合;LinkedGeoData是根据Linked Open Data规则将OpenStreetMap转为RDF格式获得的,它使孤立的地理实体节点以相互连通的网络来展现。
Tab. 1 Open geographical semantic web

表1 开放地理语义网(个)

语义网名称 类数量 属性数量 实例数量 三元组数量
OSM Semantic Network 924 4 217 - -
GeoNames Ontology 690 28 10 951 423 150 000 000
GeoWordNet 334 - 3 600 000 53 000 000
LinkedGeoData - - 1 100 000 000 20 000 000 000

3.2 开放地理实体及关系抽取

网络文本蕴含地理信息抽取任务主要包括地理命名实体识别、地理实体定位、地理实体属性抽取、地理实体关系抽取、地理事件抽取等,当前已开展了大量研究工作,文献[7]、[8]对相关成果进行了系统综述。然而,这些地理信息抽取方法难以满足地理知识图谱构建的需求,主要体现在:传统的地理信息抽取研究在文本体裁、实体类型、实体关系和事件种类上都有所限定,而在知识图谱构建采用的开放网络文本环境中,这些内容未知且不断变化。同时,也无法保证已有地理知识的完备性[37]。因此,针对开放网络文本环境,地理信息抽取需要引入新的研究内容,主要包括开放实体抽取和开放实体关系抽取。
3.2.1 开放实体抽取
开放实体抽取的目的是从海量、冗余、不规范的网络数据源上抽取出符合某个语义类的实体列表[38]。传统命名实体识别通过构建实体词典、识别规则或识别特征来获取文本中的实体。而开放实体抽取则是借助大规模网络文本存在的数据(文本或网页结构)冗余,在给出种子实体或种子网页后,借助弱监督或无监督方法得到泛化的抽取模板或抽取特征,具有发现未知命名实体的能力。冗余数据的使用有多种方式:① 挖掘文本中出现的高频字符串,通过后续处理确定该字符串是否为实体[39];② 利用实体周围上下文自动推导出泛化的抽取模式或抽取特征[40-41];③ 从大规模文本中挖掘字符间隐含的语义关系,进而计算给定的字符组合表达为一个实体的可信度[42]。由于开放实体抽取不限制实体类型,上述方法也适用于地理实体抽取。然而,这些方法在实践中仅验证了对部分类型地理实体(城市、国家、位置和大学等)的抽取效果,对任意类型地理实体的抽取效果还需要评估。空间位置是地理实体的重要特性,蕴含地理实体的文本也经常出现关于其空间位置的描述。一方面,可以利用空间描述文本辅助地理实体抽取,如文献[43]利用“登录地理命名实体+空间关系词”的搜索条件对海量网络文本过滤,以获取高密度的蕴含候选地理命名实体的文本集合,服务于地名数据库更新;另一方面,也可以从空间描述文本中提取新增地理实体缺失的空间属性信息,如文献[44]从Flickr签到数据文本中推断地理实体的空间范围,进而自动构建出大规模的地理实体辞典。
3.2.2 开放实体关系抽取
开放实体关系抽取与传统实体关系抽取最大的区别在于开放实体关系抽取不需要预先定义待抽取关系类型,具有发现新类型关系的能力。华盛顿大学人工智能研究组在开放关系抽取领域开展了大量研究工作,先后构建了4代具有代表性的开放信息抽取原型系统TextRunner[45]、ReVerb[46]、OLLIE[47]和OpenIE 4.X[48]。上述系统更迭与开放实体关系抽取研究的发展历程相吻合,即由“现有工具的直接应用”到“分析关系表达的语法、句法特征”再到“启发式规则增强”,关系类型也由早期的二元关系发展到n元关系。通用开放实体关系抽取没有对地理关系进行专门优化。有研究者考虑利用文本中的空间描述特征增强地理空间关系的抽取效果,其关键是构建满足特征学习的大规模标注语料。为此,文献[49]利用维基百科自动回标技术,建立了河流与水系的“流入”关系、郊区与城镇的“组成”关系。文献[50]使用在线的酒店点评文本自动构建了地理实体“相邻”关系的标注语料,规模为10.6万个文档。文献[51]通过手动建立空间本体,成功抽取了地理实体之间的拓扑和方位关系。然而,上述研究的原始语料本身只隐含部分类型的空间关系,生成的标注语料仅能反映少数空间关系的描述特征,难以适应地理实体关系的多样性。为此,文献[52]基于Bootstrapping技术,利用词语的词性、位置和距离特征识别文本中任意类型的地理实体关系词,减少识别过程对领域专家知识的依赖。此外,考虑到网络文本语料存在地理实体关系分布稀疏的问题,可通过语境增强方法由开放地理文本资源生成大规模语料,并利用统计方法获取关键词提取特征,实现高质量的地理实体关系抽取[53]

3.3 地理语义网对齐

开放地理语义网为地理知识图谱构建提供了高质量的数据基础。然而,这些地理语义网虽然都遵循了W3C制定的理论体系,但语义多样性仍不可避免地导致“一词多义”和“同形异义”现象。此外,各语义网基本上是各自独立管理与维护,形成了许多局部范围内组织良好,整体上却是分散独立的“信息孤岛”[54]。因此,需要通过语义对齐(Semantic Alignment)技术将这些异构分散的知识资源整合在一起,挖掘资源间的语义关系,实现统一查询与访问。早期语义对齐研究主要针对通用知识库,且已发展出完整的对齐系统[55-57]。然而,地理空间数据集和通用数据集在结构上的高度差异性使得通用对齐系统在地理空间数据对齐任务中表现不佳[58]。因此,一些学者结合实际的应用需求,在语义对齐中引入了空间特性,提出了顾及地理语义的对齐方法[59-61]。文献[62]基于名称相似度和人工验证的方法,在概念层级上将GeoNames、WordNet整合得到GeoWordNet。文献[63]、[64]则基于概念在WordNet中使用频次、不同概念的定义的重合度、概念的分类3个特征,建立了OSM Semantic Network与LinkedGeoData的概念之间的相等和包含关系,以及OSM Semantic Network、GeoNames Ontology与WordNet的概念之间的相等和包含关系。文献[65]根据地理实体的分类体系、空间距离和实例的名称相似度,建立了LinkedGeoData与DBpedia的实例之间的相等关系。图2展示了4个主流地理语义网在不同类型对象上的对齐现状。可见,不同类型的地理语义对齐任务尚没有集成于统一框架中,且不同类型对象的融合方法不能相互利用。为此,文献[66]提出了一个地理空间数据对齐集成框架,利用多维信息度量空间和语义相似性,结合投票选举和协同增强策略,一次完成概念、属性和实例对齐。综合来看,相对于通用语义对齐,当前地理语义对齐主要集中在概念对齐,属性对齐和实例对齐的研究较少,尚未出现完整的地理知识库融合系统,地理知识库的语义异质性问题亟待解决。
Fig. 2 State of the arts for open geographical semantic web alignment

图2 开放地理语义网对齐现状

除地理语义网外,网络中还开放了大量基础地理信息资源,如地图数据和统计数据等。有学者尝试这些基础地理信息与地理语义网融合[67],其关 键是将传统地理信息以关联数据的形式重新组 织[68-70]。尤其是语义信息较少的地图数据,需要从相应的元数据中挖掘空间、时间、内容和结构等多种语义信息以形成数据之间的关联关系[28,71]

3.4 地理知识图谱存储

知识图谱采用RDF模型来表示数据,其数据结构强化了对象间的语义关系。RDF有3种对象类型:资源(Resource)、谓词(Predicate)及陈述(Statements)[72]。资源是现实或虚拟世界存在的实体,以唯一的URI表示。谓词描述资源的特征或资源间的关系。陈述以RDF三元组“<主体(subject),谓词(predicate),宾语(object)>”的形式表示。其中,主体是被描述的资源,谓词可表示主体的属性或者主体和宾语之间的某种关系,宾语是属性值或URI表示的资源。标准RDF三元组不易表达空间信息,影响空间索引构建和空间查询的效果。为此,针对地理语义关系数据,当前研究普遍在RDF模型中增加空间声明,如空间类型陈述、空间关系谓词集合等,以构建适于空间索引和查询的空间类型元组[73-75]
RDF数据的存储有2种方式:① 基于关系数据库的存储方式。将RDF三元组拆分后映射为关系数据模型,直接在传统关系数据库中存储。该存储方式的关键问题是如何设计合理的表结构,以表达和索引复杂的元组关系[76-77];② 基于图数据库的存储方式。若将RDF三元组视为带标签的边,RDF数据则可以自然地转换为图结构,非常适合图数据库存储[78]。但这种方式除要考虑边上的标签成为查询对象的问题外,还要解决知识图谱规模增大对查询时间复杂度的影响[79]。RDF数据存储研究进展可参阅文献[80-81]。地理语义关系数据同样可采用上述方式进行存储,例如,文献[82]、[83]对RDF关系数据库引擎 RDF-3X[76]进行扩展,实现了空间信息的存储、索引和查询;文献[84]则在RDF图数据库引擎gStore[78]的基础上通过增加语义-空间混合索引,开发了融合空间信息的图数据库引擎S-store。

4 地理知识图谱构建的核心问题

综上所述,虽然网络文本已成为获取广义地理信息的重要数据资源,仍无法满足开放网络文本环境下地理知识图谱构建的实际需求。此外,在具体研究中,计算机科学界更多的是从文本处理的视角出发,将地理实体看作一般的实体类型,却忽视了地理实体的可量测特征。GIS界更多的是从几何测量的视角出发,对地理实体的可计算文本处理方面关注较少。“文本描述地理实体的可量测”与“地理实体描述文本的可计算”的融合是网络文本蕴含地理信息理解的迫切需求,也是将地理信息服务拓展到地理知识服务的关键。因此,综合地理知识图谱构建和地理知识应用需求,我们提出了一套完整的地理知识图谱构建技术流程,如图3所示。在具体实现过程中,有4个方面的内容亟待研究。
Fig. 3 Flowchart of geographical knowledge graph building

图3 地理知识图谱构建技术流程

4.1 网络文本蕴含地理空间信息量与质量评价

网络文本的重大价值在于参与者众多,动态性极强,是典型的协作式数据采集与汇聚方式,也是志愿者地理信息(VGI)的重要来源。然而,此类数据的特点是质量良莠不齐,对所蕴含的地理空间信息抽取而言,面临的首要问题是甄别这些文本资源的信息量和质量。这不仅仅是网络文本资源在地理空间信息检索中重要性排序的需求,更是后续开展空间关系抽取和空间计算的需求。因此,网络文本蕴含地理空间信息量与质量评价是保证异构网络文本蕴含空间信息分析质量的重要前提。面对海量的网络文本资源,首要任务是提出一套通用的地理空间信息量与质量评价指标体系,借鉴复杂网络理论、模糊数学方法等构建地理空间信息量与质量评价模型,从而有效甄别出不同类型、不同来源的网络文本蕴含的高质量地理信息。

4.2 网络文本蕴含地理信息语义理解

由于自然语言文本蕴含地理空间信息的知识描述方式和精确几何坐标描述方式之间的巨大差异,地理空间语义尤其是地理实体空间关系和空间范围的界定极为困难。GIS采用逻辑语言,需要精确定义地理信息的“质”和“量”,而文本描述(尤其是相对更为自由的网络文本)采用自然语言,需要从语境(Context,或称上下文)中理解语义(包括空间位置)。文本中每一个模糊的词汇在特定的语境中变成了读者可以理解的精确概念,但逻辑语言难以实现。目前,广泛采用的基于规则的方式由于受到地名词典的完备性和时效性的影响,新地名识别和语义多样性问题难以得到解决。对于空间关系识别,基于规则的方法对空间关系词汇的依赖性太强,而且规则覆盖度有限,规则之间容易产生冲突,因此难以识别描述结构较为复杂的空间关系。而基于统计模型的机器学习方法,特别是与知识库的结合,虽然颇受关注,其效率在很大程度上取决于标注语料库的规模和标注质量。因此,对于网络文本,应借助网络资源所提供的大规模语义网,设计机器学习模型,在减少对标注语料库依赖的前提下,通过无监督学习方法,加深自然语言文本蕴含地理实体空间关系的理解,进而实现地理实体、事件和过程文本描述的可靠空间化。

4.3 网络文本描述地理信息空间语义计算模型

目前针对文本蕴含地理空间信息抽取的研究,主要目的是识别地理命名实体和地理标定,并将其与逻辑语言方式的地理空间数据库进行集成。计算过程局限在文本统计模型和空间关系语义计算模型方面,还未涉及针对文本的空间计算过程。从长远来看,如果能够将原本针对地理空间数据集的空间计算过程移植到网络文本上,即直接对文本进行空间约束的计算,将对文本蕴含地理知识自动获取和虚拟地理环境场景构建提供巨大的支持。由于自然语言文本通常使用地名和方位介词,而不是地理坐标来描述地理现象或过程,且文本描述大量采取定性的、模糊的表达方式。在这种自然语言文本所反映的空间认知理念下,如何实现文本挖掘地理空间语义的可计算是需要解决的重要科学问题,即要将传统基于精确几何坐标的数据结构和算法移植到基于地名和方位介词标记的模糊空间数据结构和算法上,同时提升自然语言文本空间计算服务的可用性。地理空间语义计算是异构地理空间大数据分析无法逾越的瓶颈问题,也是对传统几何坐标框架下地理空间计算范式的严峻挑战。

4.4 异构语义网对齐与大规模地理知识图谱构建

从网络文本中提取所蕴含的地理信息,除了从细化时间粒度的角度对现有的专业地理信息进行补充以增强语义信息或实时信息外(如构建全息位置地图),更重要的任务是赋予网络文本地理语义标签,在现有的语义网技术支持下,构筑针对网络文本的地理语义网,这样才能从地理空间的角度建立网络文本资源的语义关联,进而通过知识推理与知识计算方式,辅之以专业地理空间信息,在网络文本语义计算模型和几何空间计算模型的支持下,实现地理知识图谱自动构建和自学习过程。然而,目前推出的地理语义网仍各自独立,存在大量冗余、模糊和不一致问题,如何基于语义整合这些资源,建立不同地理语义网之间的链接关系,维护语义一致性,实现统一查询与访问,是构建地理知识图谱亟待解决的问题。此外,地理知识图谱的存储、管理与更新模式都需要深入研究。这是事关地理问答、虚拟地理场景构建、知识搜索引擎有效性的关键问题。

5 结论

网络文本爆炸式增长带来丰富的隐式地理空间信息,为地理知识获取与知识服务提供了巨大潜能。同时,广义GIS的内在需求和知识服务的外在推动,促使GIS应用由提供地理信息服务向提供地理知识服务转变,地理知识图谱成为网络文本蕴含地理信息采集与处理的终极目标。虽然业界已有大量研究主题与网络文本蕴含地理知识图谱构建的流程环节相契合,但在数据对象、方法性能和计算效率方面存在局限,无法满足大规模网络开放文本处理、进而构建地理知识图谱的需求。因此,网络文本蕴含地理知识图谱构建有着广阔的应用前景,更存在诸多研究挑战,需要重点解决网络文本蕴含地理空间信息量与质量评价、地理信息语义理解、地理信息空间语义计算模型和异构地理语义网对齐等关键科学问题,为实现自动化、智能化的地理知识图谱奠定理论与方法基础。

The authors have declared that no competing interests exist.

[1]
徐冠华. 全社会要高度关注“数字地球”[J].中国测绘,1999,3:7-8.

[ Xu G H.The whole society should pay great attention to the 'Digital Earth'[J]. China Surveying and Mapping, 1999,3:7-8. ]

[2]
Hahmann S, Burghardt D.How much information is geospatially referenced? Networks and cognition[J]. International Journal of Geographical Information Science, 2013,27(6):1171-1189.The aim of this article is to provide a basis in evidence for (or against) the much-quoted assertion that 80% of all information is geospatially referenced. For this purpose, two approaches are presented that are intended to capture the portion of geospatially referenced information in user-generated content: a network approach and a cognitive approach. In the network approach, the German Wikipedia is used as a research corpus. It is considered a network with the articles being nodes and the links being edges. () is introduced as an indicator to measure the network approach. We define NDGR as the shortest path between any Wikipedia article and the closest article within the network that is labeled with coordinates in its headline. An analysis of the German Wikipedia employing this approach shows that 78% of all articles have a coordinate themselves or are directly linked to at least one article that has geospatial coordinates. The cognitive approach is manifested by the (): direct, indirect, and non-geospatial reference. These are categories that may be distinguished and applied by humans. An empirical study including 380 participants was conducted. The results of both approaches are synthesized with the aim to (1) examine correlations between NDGR and the human conceptualization of geospatial reference and (2) to separate geospatial from non-geospatial information. From the results of this synthesis, it can be concluded that 56–59% of the articles within Wikipedia can be considered to be directly or indirectly geospatially referenced. The article thus describes a method to check the validity of the ‘80%-assertion’ for information corpora that can be modeled using graphs (e.g., the World Wide Web, the Semantic Web, and Wikipedia). For the corpus investigated here (Wikipedia), the ‘80%-assertion’ cannot be confirmed, but would need to be reformulated as a ‘60%-assertion’.

DOI

[3]
李德仁,邵振峰.论新地理信息时代[J].中国科学(F辑:信息科学),2009,39(6):579-587.

[ Li D R, Shao Z F.The new geographic information age[J]. Science China(F: Information Science), 2009,39(6):579-587. ]

[4]
龚健雅,王国良.从数字城市到智慧城市:地理信息技术面临的新挑战[J].测绘地理信息,2013,38(2):1-6.

[ Gong J Y, Wang G L.From digital city to smart city: New challenges to geographic information technology[J]. Journal of Geomatics, 2013,38(2):1-6. ]

[5]
周成虎. 创新GIS[R].北京:ESRI中国用户大会,2014.

[ Zhou C H.Innovation of GIS. Beijing: China User Conference of ESRI, 2014. ]

[6]
Aloteibi S, Sanderson M.Analyzing geographic query reformulation: An exploratory study[J]. Journal of the Association for Information Science and Technology, 2014,65(1):13-24.Search engine users typically engage in multiquery sessions in their quest to fulfill their information needs. Despite a plethora of research findings suggesting that a significant group of users look for information within a specific geographical scope, existing reformulation studies lack a focused analysis of how users reformulate geographic queries. This study comprehensively investigates the ways in which users reformulate such needs in an attempt to fill this gap in the literature. Reformulated sessions were sampled from a query log of a major search engine to extract 2,400 entries that were manually inspected to filter geo sessions. This filter identified 471 search sessions that included geographical intent, and these sessions were analyzed quantitatively and qualitatively. The results revealed that one in five of the users who reformulated their queries were looking for geographically related information. They reformulated their queries by changing the content of the query rather than the structure. Users were not following a unified sequence of modifications and instead performed a single reformulation action. However, in some cases it was possible to anticipate their next move. A number of tasks in geo modifications were identified, including standard, multi-needs, multi-places, and hybrid approaches. The research concludes that it is important to specialize query reformulation studies to focus on particular query types rather than generically analyzing them, as it is apparent that geographic queries have their special reformulation characteristics.

DOI

[7]
余丽,陆锋,张恒才.网络文本蕴涵地理信息抽取:研究进展与展望[J].地球信息科学学报,2015,17(2):127-134.互联网的普及产生了大量蕴含着丰富地理语义的文本,为地理信息的深度挖掘和知识发现带来了巨大机遇。同时,蕴含地理语义文本的异构性和动态性,使得地理实体的属性数量和种类激增、地理语义关系复杂,对地理信息检索、空间分析和推理、智能化位置服务等提出了严峻的挑战。本文阐述了网络文本蕴含地理信息抽取的技术流程,从地理实体识别、地理实体定位、地理实体属性抽取、地理实体关系构建、地理事件抽取5个方面总结了网络文本蕴含地理信息抽取的进展和关键技术瓶颈,分析了可用于网络文本蕴含地理信息抽取的开放资源,并展望了未来的发展方向。

DOI

[ Yu L, Lu F, Zhang H C.Extracting geographic information from web texts: status and development[J]. Journal of Geo-information Science, 2015,17(2):127-134. ]

[8]
张雪英. 中文文本的时空信息获取方法[J].中国计算机学会通讯,2015,11(11):33-40.

[ Zhang X Y.Extracting the temporal and spatial information from Chinese texts[J]. Communications of China Computer Federation, 2015,11(11):33-40. ]

[9]
陆锋,张恒才.大数据与广义GIS[J].武汉大学学报·信息科学版,2014,39(6):645-654.普适计算基础设施和数据处理技术的发展催生了大数据概念,而大数据时空粒度的不断细化加速了地理空间信息的泛化过程.阐述了大数据时代地理空间信息泛化的显著特征,进而提出GIS概念广义化的迫切需求,从数据采集与整理、数据管理与集成、数据分析与计算三个方面分析了广义GIS所面临的技术挑战,重点探讨了互联网蕴含地理空间数据采集、移动对象数据库和异构动态数据管理、移动对象轨迹数据挖掘、复杂网络分析等方面的研究进展与存在的问题,并展望了广义GIS时代地理计算与城市计算、社会计算的融合趋势.

DOI

[ Lu F, Zhang H C.Big Data and Generalized GIS[J]. Geomatics and Information Science of Wuhan University, 2014,39(6):645-654. ]

[10]
杨崇俊,刘冬林,张富庆,等.电子政务与隐形搜索技术-词虎[C].中国测绘学会2006年学术年会,2006:533-539.

[ Yang C J, Liu D L, Zhang F Q, et al.E-government and the Invisible Search Technology: Word Tiger[C].The Academic Annual Meeting of Chinese Society for Geodesy, Photogrammetry and Cartography, 2006:533-539. ]

[11]
Pouliquen B, Kimler M, Steinberger R, et al.Geocoding multilingual texts: Recognition, disambiguation and visualisation[C]. In Proceedings of LREC-2006, 2006.

[12]
Sankaranarayanan J, Samet H, Teitler B E, et al.Twitterstand: news in tweets[C]. Proceedings of the 17th acm sigspatial international conference on advances in geographic information systems, ACM, 2009:42-51.

[13]
Delozier G, Baldridge J, London L.Gazetteer-independent toponym resolution using geographic word profiles[C]. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI’15), 2015:2382-2388.

[14]
Xu J, Mark D M.Natural language understanding of spatial relations between linear geographic objects[J]. Spatial Cognition and Computation, 2007,7(4):311-347.People usually use qualitative terms to express spatial relations,while current geographic information systems(GIS) all use quantitative description to store spatial information.The abilities of current GIS to represent spatial information about geographic space are limited,and it is inconvenient for GIS users without professional training.The next generation of GIS could be intelligent GIS built on Na1 ve Geography.It will act and respond as a person would,and therefore can be used without major training by new user communities to solve day-to-day tasks.In order to bridge the gap between natural-language terms and the computational model of spatial relations,a complete understanding of the relationship between the ambiguous natural-language representations and the geometric spatial relations of geographic objects is requisite.A human-subjects test was conducted to find out how natural-language descriptions of spatial relations of linear objects are determined by the geometric configurations of the objects.A series of maps,with each map showing two linear geographic objects,were displayed to the human subjects,and a sentence describing the spatial relations of the two objects was provided to the subjects.Comparing the sentence to the map,the participants determined whether this sentence described the relation correctly,and chose their agreement from given options ranged from "strongly agree" to "strongly disagree".The answers chosen by the human subjects were converted to agreement degrees which represent the plausibility that certain spatial relations can be described by the spatial predicates,so the results could be analyzed quantitatively.The results indicated that both topology and metric properties influent people on choosing spatial predicates to describe spatial relations,but they have different effects on various spatial predicates.Some spatial predicates are mainly affected by topological relations,while metric measures do not have significant effects.Other spatial predicates are primarily affected by topology metric details,while topology only has secondary effects on them.The categories of geographic entities were also found to have effects on the natural-language description in some cases.

DOI

[15]
马林兵,龚健雅.面向自然语言的空间数据库查询研究[J].计算机工程与应用,2003,39(22):16-19.随着地理信息系统深入应用于人们的日常生活,面向自然语言的空间数据库查询的研究越来越被人们重视。论文根据地理要素间基本的空间关系(度量关系、拓扑关系、方向关系),讨论了面向自然语言的各种空间关系的基本查询形式以及查询语句中的空间语义,在此基础上,为进一步的理解以空间分析、网络分析作为自然语言查询条件的研究打下了基础。

DOI

[ Ma L B, Gong J Y.Research on spatial database query oriented natural language[J]. Computer Engineering and Applications, 2003,39(22):16-19. ]

[16]
杜冲,司望利,许珺.基于地理语义的空间关系查询和推理[J].地球信息科学学报,2011,12(1):48-55.

[ Du C, Si W L, Xu J.Querying and reasoning of spatial relations based on geographic semantics[J]. Journal of Geo-information Science, 2011,12(1):48-55. ]

[17]
Sakaki T, Okazaki M, Matsuo Y.Earthquake shakes twitter users: real-time event detection by social sensors[C]. Proceedings of the 19th international conference on World Wide Web (WWW’10), ACM, 2010:851-860.

[18]
Wang W, Stewart K.Spatiotemporal and semantic information extraction from Web news reports about natural hazards[J]. Computers, Environment and Urban Systems,2015,50:30-40.In the field of geographic information science, modeling geographic dynamics based on spatiotemporal information extracted from the Web, especially unconstructed data such as online news reports, is a growing area of research. Extracting spatiotemporal and semantic information from a set of Web documents enables us to build a rich representation of geographic knowledge described in text, capturing where, when, or what events have occurred. This work investigates the role ontologies play as a key component in the process of semantic information extraction. We show how ontologies can be used in conjunction with natural language gazetteers in order to process semantic information about hazard events and augment spatiotemporal extraction with semantics. We are interested in capturing the spatiotemporal patterns of hazard-related events from online news reports to track the occurrences and evolution of natural hazards, such as severe storms. A hazard ontology has been created to assist the spatiotemporal information extraction process, especially with the automatic detection of different kinds of events at multiple granularities from unstructured texts revealing relationships between the events over space ime. The extraction and retrieval of semantic information about event dynamics provides information about the progression of events using both natural and human perspectives.

DOI

[19]
张春菊,张雪英,王曙,等.中文文本的事件时空信息标注[J].中文信息学报,2016,30(3):213-222.基于文本数据源的地理空间信息解析研究侧重于地名实体、空间关系等空间语义角色的标注和抽取,忽略了丰富的时间信息、主题事件信息及其时空一体化信息。该文通过分析中文文本中事件信息描述的语言特点和事件的时空语义特征,基于地名实体和空间关系标注研究成果,制定了中文文本的事件时空信息标注体系和标注模式,并以GATE(General Architecture for Text Engineering)为标注平台,以网页文本为数据源,构建了事件时空信息标注语料库。研究成果为中文文本中地理信息的语义解析提供标准化的训练和测试数据。<br/>

[ Zhang C J, Zhang X Y, Wang S, et al.Annotation of spatial-temporal information of event in Chinese text[J]. Journal Of Chinese Information Processing, 2016,30(3):213-222. ]

[20]
Cui A, Zhang M, Liu Y, et al.Discover breaking events with popular hashtags in twitter[C]. Proceedings of the 21st ACM international conference on Information and knowledge management, ACM, 2012:1794-1798.

[21]
Schulz A, Mencía E L, Schmidt B.A rapid-prototyping framework for extracting small-scale incident-related information in microblogs: Application of multi-label classification on tweets[J]. Information Systems, 2016,57:88-110.In this paper we investigate small-scale incident reporting behavior with microblogs. Based on our findings, we present an easily extensible rapid prototyping framework for information extraction of incident-related tweets. The framework enables the precise identification and extraction of information relevant for emergency management. We evaluate the rapid prototyping capabilities and usefulness of the framework by implementing the multi-label classification of tweets related to small-scale incidents. An evaluation shows that our approach is applicable for detecting multiple labels with an match rate of 84.35%.

DOI

[22]
Stefanidis A, Crooks A, Radzikowski J.Harvesting ambient geospatial information from social media feeds[J]. GeoJournal, 2013,78(2):319-338.Social media generated from many individuals is playing a greater role in our daily lives and provides a unique opportunity to gain valuable insight on information flow and social networking within a society. Through data collection and analysis of its content, it supports a greater mapping and understanding of the evolving human landscape. The information disseminated through such media represents a deviation from volunteered geography, in the sense that it is not geographic information per se. Nevertheless, the message often has geographic footprints, for example, in the form of locations from where the tweets originate, or references in their content to geographic entities. We argue that such data conveys ambient geospatial information, capturing for example, people's references to locations that represent momentary social hotspots. In this paper we address a framework to harvest such ambient geospatial information, and resulting hybrid capabilities to analyze it to support situational awareness as it relates to human activities. We argue that this emergence of ambient geospatial analysis represents a second step in the evolution of geospatial data availability, following on the heels of volunteered geographical information.

DOI

[23]
张恒才,陆锋,仇培元.基于D-S证据理论的微博客蕴含交通信息提取方法[J].中文信息学报,2015,29(2):170-178.微博客消息中经常蕴含大量实时交通信息,有望与现有实时交通信息采集方式形成互补。该文针对微博客消息语义模糊性及用户描述差异性问题,提出了一种微博客消息蕴含交通信息的D-S证据理论提取方法。该方法首先构建微博客消息蕴含交通状态信息评价体系,利用百科知识提高评价精度,然后定义微博客消息源的基本概率分配函数,通过证据合成与证据决策,实现微博客消息蕴含实时交通信息的甄别与融合。实验结果表明,该方法能够对微博客消息蕴含实时交通信息的可信度进行有效判断,并能够在最大程度上利用不同微博客用户发布消息的信息内容,且较之传统的文本聚类融合方法具有更高的准确率。

[ Zhang H C, Lu F, Qiu P Y.Extracting traffic information from micro-blog based on D-S evidence theory[J]. Journal Of Chinese Information Processing, 2016,29(2):170-178. ]

[24]
仇培元,张恒才,陆锋.互联网文本蕴含道路交通信息抽取的模式匹配方法[J].地球信息科学学报,2015,17(4):416-422.互联网页面和社交网络文本中蕴含丰富的道路交通信息,是其他交通信息采集平台的有效补充.然而,自然语言文本形式的交通信息多以线性参考或地标方位描述交通事件空间位置,且大量存在事件元素缺失或隐含现象,对交通信息的自动化抽取有着较大影响.考虑到交通信息的自然语言表达方式虽然自由随意,但表达模式相对固定,提出一种从互联网文本中抽取道路交通信息的模式匹配方法.首先,基于道路交通事件描述的语言特征构建模式库;然后,以特征词词性序列的形式表达互联网文本和抽取模式,利用DTW距离度量序列相似度,实现抽取模式匹配;最后,在匹配抽取模式和填补规则指导下获取结构化的道路交通信息.由上海市城市交通相关门户网站和微博客平台的实验过程显示,本文所提出的模式匹配方法,抽取道路交通信息的准确率和召回率分别达到90%和80%以上,表明该方法能有效抽取互联网文本蕴含的道路交通信息,且实现过程相对简单,易于扩展,具有可用性.

DOI

[ Qiu P Y, Zhang H C, Lu F.A pattern matching method for extracting road traffic information from internet texts[J]. Journal of Geo-information Science, 2015,17(4):416-422. ]

[25]
仇培元,张恒才,余丽,等.微博客蕴含交通事件信息抽取的自动标注方法[J].中文信息学报,2017,31(2):144-153.

[ Qiu P Y, Zhang H C, Yu L, et al.Automatic event labeling for traffic information extraction from microblogs[J]. Journal of Chinese Information Processing, 2017,31(2):144-153. ]

[26]
Murthy D, Longwell S A.Twitter and disasters: The uses of Twitter during the 2010 Pakistan floods[J]. Information, Communication & Society, 2013,16(6):837-855.This research explores the specific use of the prominent social media website Twitter during the 2010 Pakistan floods to examine whether users tend to tweet/retweet links from traditional versus social media, what countries these users are tweeting from, and whether there is a correlation between location and the linking of traditional versus social media. The study finds that Western users have an overwhelming preference for linking to traditional media and Pakistani users have a slight preference for linking to social media. The study also concludes that authorities and hubs in our sample have a significant preference for linking to social media rather than traditional media sites. The findings of this study suggest that there is a perceived legitimacy of social media during disasters by users in Pakistan. Additionally, it provides insights into how social media may be - albeit minimally - challenging the dominant position of traditional media in disaster reporting in developing countries.

DOI

[27]
Zhu R, Hu Y, Janowicz K, et al.Spatial signatures for geographic feature types: Examining gazetteer ontologies using spatial statistics[J]. Transactions in GIS, 2016,20(3):333-355.Abstract Digital gazetteers play a key role in modern information systems and infrastructures. They facilitate (spatial) search, deliver contextual information to recommended systems, enrich textual information with geographical references, and provide stable identifiers to interlink actors, events, and objects by the places they interact with. Hence, it is unsurprising that gazetteers, such as GeoNames, are among the most densely interlinked hubs on the Web of Linked Data. A wide variety of digital gazetteers have been developed over the years to serve different communities and needs. These gazetteers differ in their overall coverage, underlying data sources, provided functionality, and geographic feature type ontologies. Consequently, place types that share a common name may differ substantially between gazetteers, whereas types labeled differently may, in fact, specify the same or similar places. This makes data integration and federated queries challenging, if not impossible. To further complicate the situation, most popular and widely adopted geo-ontologies are lightweight and thus under-specific to a degree where their alignment and matching become nothing more than educated guesses. The most promising approach to addressing this problem, and thereby enabling the meaningful integration of gazetteer data across feature types, seems to be a combination of top-down knowledge representation with bottom-up data-driven techniques such as feature engineering and machine learning. In this work, we propose to derive indicative spatial signatures for geographic feature types by using spatial statistics. We discuss how to create such signatures by feature engineering and demonstrate how the signatures can be applied to better understand the differences and commonalities of three major gazetteers, namely DBpedia Places, GeoNames, and TGN.

DOI

[28]
赵红伟,诸云强,侯志伟,等.地理空间元数据关联网络的构建[J].地理科学,2016,36(8):1180-1189.lt;p>利用资源描述框架(RDF)设计地理空间元数据关联模型,根据地理空间元数据之间的语义关系和语义相关度的计算,以构建以元数据为节点、元数据之间的语义关系为边、语义相关度为权重的关联网络。在这一网络中,一个节点是一个地理空间元数据的资源描述图,包含属性特征(数据来源、空间特征、时间特征、内容)及其关系特征(元数据之间的语义关系、语义相关度)。实验及其分析表明,地理空间元数据关联网络可以有效地支持地理空间数据语义关联检索、推荐等应用,这与传统的基于关键词的元数据检索方式相比,具有更高的准确度。</p>

DOI

[ Zhao H, Zhu Y, Hou Z, et al.Construction of geospatial metadata association network[J]. Scientia Geographica Sinica, 2016,36(8):1180-1189. ]

[29]
Zhu Y, Zhu A X, Feng M, et al.A similarity-based automatic data recommendation approach for geographic models[J]. International Journal of Geographical Information Science, 2017:1-22.

[30]
赵军. 从问答系统看知识智能[J].中国计算机学会通讯,2015,11(3):16-22.识资源.截至目前,维基百科已 建设方面的显著成果为智能问答 经构建了涵盖 287 种语言的 3000 系统技术的突破奠定了基础.在 多万条知识条目.维基百科的发 知识图谱的支撑下进行问答成为 展给知识库资源的建设带来了新 近年来的研究热点.科学家们围 的生机.但它仍然是面向人的知 绕实体消歧,关系映射,问句的 识,由于形式化程度不够,同时 语义解析以及知识的学习和推理 缺乏语义描述,计算机使用起来 等关键技术问题进行了深入探索. 仍然很困难.因此,业界开始基于维基百科生成计算机可利用的

[ Zhao J.Look at Knowledge intelligence from Q&A system[J]. Communications of China Computer Federation, 2015,11(3):16-22. ]

[31]
Berners-Lee T, Hendler J.Publishing on the semantic web[J]. Nature, 2001,410(6832):1023-1024.Not Available

DOI PMID

[32]
漆桂林,高桓,吴天星.知识图谱研究进展[J].情报工程,2017,3(1):4-25.

[ Qi G L, Gao, H, Wu T X. The research advances of knowledge graph[J]. Technology intelligence Engineering, 2017,3(1):4-25. ]

[33]
Carlson A, Betteridge J, Kisiel B, et al.Toward an Architecture for Never-Ending Language Learning[C]. Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10), 2010:1306-1313.

[34]
Dong X, Gabrilovich E, Heitz G, et al.Knowledge vault: A web-scale approach to probabilistic knowledge fusion[C]. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014:601-610.

[35]
Otero-Cerdeira L, Rodríguez-Martínez F J, Gómez-Rodríguez A. Ontology matching: A literature review[J]. Expert Systems with Applications, 2015,42(2):949-971.To do so, we first perform a literature review of the field in the last decade by means of an online search. The articles retrieved are sorted using a classification framework that we propose, and the different categories are revised and analyzed. The information in this review is extended and supported by the results obtained by a survey that we have designed and conducted among the practitioners.

DOI

[36]
OWL Web Ontology Language Overview[EB/OL].OWL Web Ontology Language Overview[EB/OL]. /, 2004

[37]
Janowicz K, Hitzler P.Geospatial Semantic Web[M]. The International Encyclopedia of Geography, 2017:1-6.

[38]
赵军,刘康,周光有,等.开放式文本信息抽取[J].中文信息学报,2011,25(6):98-111.信息抽取研究已经从传统的限定 类别、限定领域信息抽取任务发展到开放类别、开放领域信息抽取。技术手段也从基于人工标注语料库的统计方法发展为有效地挖掘和集成多源异构网络知识并与统 计方法结合进行开放式信息抽取。该文在回顾文本信息抽取研究历史的基础上,重点介绍开放式实体抽取、实体消歧和关系抽取的任务、难点、方法、评测、技术水 平和存在问题,并结合课题组的研究积累,对文本信息抽取的发展方向以及在网络知识工程、问答系统中的应用进行分析讨论。

DOI

[ Zhao J, Liu K, Zhou G Y, et al.Open information extraction[J]. Journal Of Chinese Information Processing, 2011,25(6):98-111. ]

[39]
Parameswaran A, Garcia-Molina H, Rajaraman A.Towards the web of concepts: Extracting concepts from large datasets[J]. Proceedings of the VLDB Endowment, 2010,3(1-2):566-577.ABSTRACT Concepts are sequences of words that represent real or imaginary entities or ideas that users are interested in. As a first step towards building a web of concepts that will form the backbone of the next generation of search technology, we develop a novel technique to extract concepts from large datasets. We approach the problem of concept extraction from corpora as a market-basket problem, adapting statistical measures of support and confidence. We evaluate our concept extraction algorithm on datasets containing data from a large number of users (e.g., the AOL query log data set), and we show that a high-precision concept set can be extracted.

DOI

[40]
Jain A, Pennacchiotti M.Open entity extraction from web search query logs[C]. Proceedings of the 23rd International Conference on Computational Linguistics, ACL, 2010:510-518.

[41]
Song W, Zhao S, Zhang C, et al.Exploiting collective hidden structures in webpage titles for open domain entity extraction[C]. Proceedings of the 24th International Conference on World Wide Web,ACM, 2015:1014-1024.

[42]
Deng K, Wang D, Liu J.Weakly-supervised named entity extraction using word representations[C]. International Conference on Database Systems for Advanced Applications, 2017:195-203.

[43]
张春菊,张雪英,朱少楠,等.基于网络爬虫的地名数据库维护方法[J].地球信息科学学报,2011,13(4):492-499.目前,我国地名数据库建设存在大、中颗粒度地名集中,小颗粒度地名较为缺乏,地名资料陈旧、时效性较低,简称、别名等非标准地名信息和地名的相对位置信息缺失等问题。而地名数据库的更新维护工作主要通过人工测绘手段完成,存在周期长、成本高、效率低等缺点。针对这一问题,本文以现有地名数据库和空间关系词汇为基础,基于Google搜索引擎服务,提出一种以网页资源为数据源,利用网络爬虫技术和地名识别技术,进行地名数据库更新维护的方法。首先,设计以地名为主题的网络爬虫,实现非结构化的网页数据中海量空间敏感网页文本的主动获取;然后,采用HTML DOM技术解析空间敏感网页并应用CRF地名识别模型自动识别网页文本中地名;最后,设计相关算法进行网页文本中地名信息的自动解析,实现新地名和地名空间位置信息的获取,进行地名数据库的更新维护。以&quot;南京师范大学仙林宾馆+西北&quot;为空间检索实例,验证了此方法的可行性。

DOI

[ Zhang C J, Zhang X Y, Zhu S N, et al.Method of toponym database updating based on web crawler[J].Journal of Geo-information Science, 2011,13(4):492-499. ]

[44]
Gao S, Li L, Li W, et al.Constructing gazetteers from volunteered big geo-data based on Hadoop[J]. Computers, Environment and Urban Systems, 2017,61:172-186.Abstract: Traditional gazetteers are built and maintained by authoritative mapping agencies. In the age of Big Data, it is possible to construct gazetteers in a data-driven approach by mining rich volunteered geographic information (VGI) from the Web. In this research, we build a scalable distributed platform and a high-performance geoprocessing workflow based on the Hadoop ecosystem to harvest crowd-sourced gazetteer entries. Using experiments based on geotagged datasets in Flickr, we find that the MapReduce-based workflow running on the spatially enabled Hadoop cluster can reduce the processing time compared with traditional desktop-based operations by an order of magnitude. We demonstrate how to use such a novel spatial-computing infrastructure to facilitate gazetteer research. In addition, we introduce a provenance-based trust model for quality assurance. This work offers new insights on enriching future gazetteers with the use of Hadoop clusters, and makes contributions in connecting GIS to the cloud computing environment for the next frontier of Big Geo-Data analytics.

DOI

[45]
Banko M, Cafarella M J, Soderland S, et al.Open information extraction from the web[C]. Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007:2670-2676.

[46]
Fader A, Soderland S, Etzioni O.Identifying relations for open information extraction[C]. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011:1535-1545.

[47]
Schmitz M, Bart R, Soderland S, et al.Open language learning for information extraction[C]. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, ACL, 2012:523-534.

[48]
Mausam M.Open information extraction systems and downstream applications[C]. Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, AAAI, 2016:4074-4077.

[49]
Blessing A, Schütze H.Fine-grained geographical relation extraction from Wikipedia[C]. Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), 2010:2949-2952.

[50]
Wallgrün J O, Klippel A, Baldwin T.Building a corpus of spatial relational expressions extracted from web documents[C]. Proceedings of the 8th Workshop on Geographic Information Retrieval, 2014:6.

[51]
Loglisci C, Ienco D, Roche M, et al.Toward geographic information harvesting: extraction of spatial relational facts from web documents[C]. Proceedings of the 12th International Conference on Data Mining Workshops (ICDMW 2012), IEEE, 2012:789-796.

[52]
余丽,陆锋,刘希亮.开放式地理实体关系抽取的Bootstrapping方法[J].测绘学报,2016,45(5):616-622.从网络文本中抽取地理实体间空间关系和语义关系要求高时效性和强鲁棒性。本文提出一种开放式地理实体关系的自动抽取方法,通过bootstrapping技术统计词语的词性、位置和距离特征来计算语境中词语权值,据此确定描述地理实体关系的关键词,最终组织成结构化实例,并使用百度百科和Stanford CoreNLP开展了试验。研究结果表明,本文方法能自动挖掘自然语言的部分词法特征,无须领域专家知识和大规模标注语料,适用于未知关系类型的信息抽取任务;较之经典的Frequency、TFIDF和PPMI频率统计方法,精度和召回率分别提升约5%和23%。

[ Yu L, Lu F, Liu X L.A bootstrapping based approach for open geo-entity relation extraction[J]. Acta Geodaetica et Cartographica Sinica, 2016,45(5):616-622. ]

[53]
余丽,陆锋,刘希亮,等.稀疏地理实体关系的关键词提取方法[J].地球信息科学学报,2016,18(11):1465-1475.

[ Yu L, Lu F, Liu X, et al.A method of context enhanced keyword extraction for sparse geo-entity relation[J]. Journal of Geo-information Science, 2016,18(11):1465-1475. ]

[54]
Zong N, Nam S, Eom J H, et al.Aligning ontologies with subsumption and equivalence relations in Linked Data[J]. Knowledge-Based Systems, 2015,76:30-41.With the profusion of RDF resources and Linked Data, ontology alignment has gained significance in providing highly comprehensive knowledge embedded in disparate sources. Ontology alignment, however, in Linking Open Data (LOD) has traditionally focused more on the instance-level rather than the schema-level. Linked Data supports schema-level alignment, provided that instance-level alignment is already established. Linked Data is a hotbed for instance-based schema alignment, which is considered a better solution for aligning classes with ambiguous or obscure names. This study proposes an instance-based schema alignment algorithm, IUT, which builds a unified taxonomy to discover subsumption and equivalence relations between two classes. A scaling algorithm is also developed that reduces pair-wise similarity computations during the taxonomy construction. The IUT is tested with DBpedia and YAGO2, and compared with two state-of-the-art schema alignment algorithms in light of four alignment tasks with different combinations of the two data sets. The experiment results show that the IUT outperforms the two algorithms in efficiency and effectiveness, and demonstrate the IUT can provide an instance-based schema alignment solution with scalability and high performance, for ontologies containing a large number of instances in LOD.

DOI

[55]
Udrea O, Getoor L, Miller R J.Leveraging data and structure in ontology integration[C]. Proceedings of the 2007 ACM SIGMOD international conference on Management of data, ACM, 2007:449-460.

[56]
Suchanek F M, Abiteboul S, Senellart P.PARIS: Probabilistic alignment of relations, instances and schema[J]. Proceedings of the Vldb Endowment, 2011,5(3):157-168.One of the main challenges that the Semantic Web faces is the integration of a growing number of independently designed ontologies. In this work, we present PARIS, an approach for the automatic alignment of ontologies. PARIS aligns not only instances, but also relations and classes. Alignments at the instance level cross-fertilize with alignments at the schema level. Thereby, our system provides a truly holistic solution to the problem of ontology alignment. The heart of the approach is probabilistic, i.e., we measure degrees of matchings based on probability estimates. This allows PARIS to run without any parameter tuning. We demonstrate the efficiency of the algorithm and its precision through extensive experiments. In particular, we obtain a precision of around 90% in experiments with some of the world's largest ontologies.

DOI

[57]
Wang Z, Li J, Zhao Y, et al.A unified approach to matching semantic data on the web[J]. Knowledge-Based Systems, 2013,39(2):173-184.In recent years, the Web has evolved from a global information space of linked documents to a space where data are linked as well. The Linking Open Data (LOD) project has enabled a large number of semantic datasets to be published on the Web. Due to the open and distributed nature of the Web, both the schema (ontology classes and properties) and instances of the published datasets may have heterogeneity problems. In this context, the matching of entities from different datasets is important for the integration of information from different data sources. Recently, much work has been conducted on ontology matching to resolve the schema heterogeneity problem in the semantic datasets. However, there is no unified framework for matching both schema entities and instances. This paper presents a unified matching approach to finding equivalent entities in ontologies and LOD datasets on the Web. The approach first combines multiple lexical matching strategies using a novel voting-based aggregation method; then it utilizes the structural information and the already found correspondences to discover additional ones. We evaluated our approach using datasets from both OAEI and LOD. The results show that the voting-based aggregation method provides highly accurate matching results, and that the structural propagation procedure effectively improves the recall of the results.

DOI

[58]
Delgado F, Martinez-Gonzalez M M, Finat J, et al. An evaluation of ontology matching techniques on geospatial ontologies[J]. International Journal of Geographical Information Science, 2013,27(12):2279-2301.Standardization is one of the pillars of interoperability. In this context, efforts promoted by the Open Geospatial Consortium, such as CityGML (Technical University, Berlin), a standard for exchanging three-dimensional models or urban city objects, are welcomed. However, information from other domains of interest (e.g. energy efficiency or building information modeling) is needed for tasks such as land planning, large-scale flooding analysis, or demand/supply energy simulations. CityGML allows extension in order to integrate information from other domains, but the development process is expensive because there is no way to perform it automatically. The discovery of correspondences between CityGML concepts and other domains concepts poses a significant challenge.Ontology matching is the research field emerged from the Semantic Web to address automatic ontology integration. Using the ontology underlying CityGML and the ontologies which model other domains of interest, ontology matching would be able to find the correspondences that would permit the integration in a more automatic manner than it is done now.In this paper, we evaluate if ontology matching techniques allow performing an automatic integration of geospatial information modeled from different viewpoints. In order to achieve this, an evaluation methodology was designed, and it was applied to the discovery of relationships between CityGML and ontologies coming from the building information modeling and Geospatial Semantic Web domains. The methodology and the results of the evaluation are presented. The best results have been achieved using string-based techniques, while matching systems give the worst precision and recall. Only in a few cases the values are over 50%, which shows the limitations when these techniques are applied to ontologies with a partial overlap.

DOI

[59]
Hess G N, Iochpe C, Castano S.An algorithm and implementation for geo-ontologies integration[C]. VIII Brazilian Symposium on Geoinformatics, 2006:109-120.

[60]
Auer S, Lehmann J, Hellmann S.Linkedgeodata: Adding a spatial dimension to the web of data[C]. Proceedings of the 8th International Semantic Web Conference (ISWC 2009), 2009:731-746.

[61]
Zhang C, Zhao T, Li W.The framework of a geospatial semantic web-based spatial decision support system for Digital Earth[J]. International Journal of Digital Earth, 2010,3(2):111-134.While significant progress has been made to implement the Digital Earth vision, current implementation only makes it easy to integrate and share spatial data from distributed sources and has limited capabilities to integrate data and models for simulating social and physical processes. To achieve effectiveness of decision-making using Digital Earth for understanding the Earth and its systems, new infrastructures that provide capabilities of computational simulation are needed. This paper proposed a framework of geospatial semantic web-based interoperable spatial decision support systems (SDSSs) to expand capabilities of the currently implemented infrastructure of Digital Earth. Main technologies applied in the framework such as heterogeneous ontology integration, ontology-based catalog service, and web service composition were introduced. We proposed a partition-refinement algorithm for ontology matching and integration, and an algorithm for web service discovery and composition. The proposed interoperable SDSS enables decision-makers to reuse and integrate geospatial data and geoprocessing resources from heterogeneous sources across the Internet. Based on the proposed framework, a prototype to assist in protective boundary delimitation for Lunan Stone Forest conservation was implemented to demonstrate how ontology-based web services and the services-oriented architecture can contribute to the development of interoperable SDSSs in support of Digital Earth for decision-making.

DOI

[62]
Giunchiglia F, Maltese V, Farazi F, et al.GeoWordNet: A resource for geo-spatial applications[C]. Proceedings of the 7th Extended Semantic Web Conference (ESWC 2010), 2010:121-136.

[63]
Ballatore A, Wilson D C, Bertolotto M.A survey of volunteered open geo-knowledge bases in the semantic web[M]. Quality issues in the management of web information. Springer Berlin Heidelberg, 2013:93-120.

[64]
Ballatore A, Bertolotto M, Wilson D C.Linking geographic vocabularies through WordNet[J]. Annals of GIS, 2014,20(2):73-84.The linked open data (LOD) paradigm has emerged as a promising approach to structuring and sharing geospatial information. One of the major obstacles to this vision lies in the difficulties found in the automatic integration between heterogeneous vocabularies and ontologies that provides the semantic backbone of the growing constellation of open geo-knowledge bases. In this article, we show how to utilize WordNet as a semantic hub to increase the integration of LOD. With this purpose in mind, we devise , an unsupervised mapping technique between a given vocabulary and WordNet, combining intensional and extensional aspects of the geographic terms. is evaluated against a sample of human-generated alignments with the OpenStreetMap (OSM) Semantic Network, a crowdsourced geospatial resource, and the GeoNames ontology, the vocabulary of a large digital gazetteer. These empirical results indicate that the approach can obtain high precision and recall.

DOI

[65]
Stadler C, Lehmann J, Höffner K, et al.Linkedgeodata: A core for a web of spatial open data[J]. Semantic Web,2012,3(4):333-354.ABSTRACT The Semantic Web eases data and information integration tasks by providing an infrastructure based on RDF and ontologies. In this paper, we contribute to the development of a spatial Data Web by elaborating on how the collaboratively collected OpenStreetMap data can be interactively transformed and represented adhering to the RDF data model. This transformation will simplify information integration and aggregation tasks that require comprehensive background knowledge related to spatial features such as ways, structures, and landscapes. We describe how this data is interlinked with other spatial data sets, how it can be made accessible for machines according to the Linked Data paradigm and for humans by means of several applications, including a faceted geo-browser. The spatial data, vocabularies, interlinks and some of the applications are openly available in the LinkedGeoData project.

DOI

[66]
Yu L, Liu X, Li M, et al.A holistic framework of geographical semantic web aligning[C]. Proceedings of the 10th Workshop on Geographic Information Retrieval,ACM.2016:1.

[67]
Vilches-Blázquez L M, Villazón-Terrazas B, Corcho O, et al. Integrating geographical information in the Linked Digital Earth[J]. International Journal of Digital Earth,2014,7(7):554-575.Many progresses have been made since the Digital Earth notion was envisioned thirteen years ago. However, the mechanism for integrating geographic information into the Digital Earth is still quite limited. In this context, we have developed a process to generate, integrate and publish geospatial Linked Data from several Spanish National data-sets. These data-sets are related to four Infrastructure for Spatial Information in the European Community (INSPIRE) themes, specifically with Administrative units, Hydrography, Statistical units, and Meteorology. Our main goal is to combine different sources (heterogeneous, multidisciplinary, multitemporal, multiresolution, and multilingual) using Linked Data principles. This goal allows the overcoming of current problems of information integration and driving geographical information toward the next decade scenario, that is, inked Digital Earth.

DOI

[68]
Goodwin J, Dolbear C, Hart G.Geographical linked data: The administrative geography of Great Britain on the semantic web[J]. Transactions in GIS, 2008,12(s1):19-30.Ordnance Survey, the national mapping agency of Great Britain, is investigating how semantic web technologies assist its role as a geographical information provider. A major part of this work involves the development of prototype products and datasets in RDF. This article discusses the production of an example dataset for the administrative geography of Great Britain, demonstrating the advantages of explicitly encoding topological relations between geographic entities over traditional spatial queries. We also outline how these data can be linked to other datasets on the web of linked data and some of the challenges that this raises.

DOI

[69]
Koubarakis M.Linked open earth observation data: The LEO project[C]. Image Information Mining Conference: The Sentinels Era, 2014:1-4.

[70]
Patroumpas K, Giannopoulos M A G, Athanasiou S. TripleGeo: An ETL Tool for transforming geospatial data into RDF triples[C]. Proceedings of the Workshops of the EDBT/ICDT 2014 Joint Conference (EDBT/ICDT 2014), 2014:275-278.

[71]
Zhu Y, Zhu A X, Song J, et al.Multidimensional and quantitative interlinking approach for Linked Geospatial Data[J]. International Journal of Digital Earth, 2016:1-21.2017). Multidimensional and quantitative interlinking approach for Linked Geospatial Data. International Journal of Digital Earth. Ahead of Print. doi: 10.1080/17538947.2016.1266041

DOI

[72]
Resource Description Framework (RDF): Concepts and Abstract Syntax[EB/OL]. ,2004.

[73]
Kyzirakos K, Karpathiotakis M, Koubarakis M.Strabon: A Semantic Geospatial DBMS[C]. Proceedings of the 11th International Semantic Web Conference (ISWC 2012), 2012:295-311.

[74]
段红伟,孟令奎,黄长青,等.面向SPARQL查询的地理语义空间索引构建方法[J].测绘学报,2014,43(2):193-199.为了实现地理语义数据的快速有效的空间查询,在分析和研究传统RDF(resource description framework)数据组织方法和空间索引的基础上,提出地理空间四元组(GeoQuad)模型,并基于该模型构建了地理语义空间索引,最后利用Jena、ARQ和JTS Topology Suite实现了支持语义查询规范一SPARQL的地理语义空间查询。试验表明,方法高效可行,不仅能够快速定位空间RDF节点,而且能够快速进行空间查询并返回RDF结果。

DOI

[Duan H W, Meng L K, Huang C Q, et al.A Method for Geo Semantic Spatial Index on SPARQL Query[J]. Acta Geodaetica et Cartographica Sinica, 2014,43(2):193-199. ]

[75]
Gür N, Pedersen T B, Zimányi E, et al.A foundation for spatial data warehouses on the semantic web[J]. Semantic Web,2016(Preprint):1-31.Abstract. Large volumes of geospatial data is being published on the Semantic Web (SW), yielding a need for advanced analysis of such data. However, existing SW technologies only support advanced analytical concepts such as multidimensional (MD) data warehouses and Online Analytical Processing (OLAP) over non-spatial SW data. To remedy this need, this paper presents the QB4SOLAP vocabulary which supports spatially enhanced MD data cubes over RDF data. The paper also defines a number of Spatial OLAP (SOLAP) operators over QB4SOLAP cubes and provides algorithms for generating spatially extended SPARQL queries from the SOLAP operators. The proposals are validated by applying them to a realistic use case.

[76]
Neumann T, Weikum G.RDF-3X: a RISC-style engine for RDF[J]. Proceedings of the VLDB Endowment, 2008,1(1):647-659.ABSTRACT RDF is a data representation format for schema-free struc- tured information that is gaining momentum in the con- text of Semantic-Web corpora, life sciences, and also Web 2.0 platforms. The \pay-as-you-go" nature of RDF and the exible pattern-matching capabilities of its query language SPARQL entail eciency and scalability challenges for com- plex queries including long join paths. This paper presents the RDF-3X engine, an implementation of SPARQL that achieves excellent performance by pursuing a RISC-style ar- chitecture with a streamlined architecture and carefully de- signed, puristic data structures and operations. The salient points of RDF-3X are: 1) a generic solution for storing and indexing RDF triples that completely eliminates the need for physical-design tuning, 2) a powerful yet simple query pro- cessor that leverages fast merge joins to the largest possible extent, and 3) a query optimizer for choosing optimal join orders using a cost model based on statistical synopses for entire join paths. The performance of RDF-3X, in compari- son to the previously best state-of-the-art systems, has been measured on several large-scale datasets with more than 50 million RDF triples and benchmark queries that include pat- tern matching and long join paths in the underlying data graphs.

DOI

[77]
Weiss C, Karras P, Bernstein A.Hexastore: Sextuple indexing for semantic web data management[J]. Proceedings of the VLDB Endowment,2008,1(1):1008-1019.Despite the intense interest towards realizing the Semantic Web vision, most existing RDF data management schemes are constrained in terms of efficiency and scalability. Still, the growing popularity of the RDF format arguably calls for an effort to offset these drawbacks. Viewed from a relational-database perspective, these constraints are derived from the very nature of the RDF data model, which is based on a triple format. Recent research has attempted to address these constraints using a vertical-partitioning approach, in which separate two-column tables are constructed for each property. However, as we show, this approach suffers from similar scalability drawbacks on queries that are not bound by RDF property value. In this paper, we propose an RDF storage scheme that uses the triple nature of RDF as an asset. This scheme enhances the vertical partitioning idea and takes it to its logical conclusion. RDF data is indexed in six possible ways, one for each possible ordering of the three RDF elements. Each instance of an RDF element is associated with two vectors; each such vector gathers elements of one of the other types, along with lists of the third-type resources attached to each vector element. Hence, a sextuple- indexing scheme emerges. This format allows for quick and scalable general-purpose query processing; it confers significant advantages (up to five orders of magnitude) compared to previous approaches for RDF data management, at the price of a worst-case five-fold increase in index space. We experimentally document the advantages of our approach on real-world and synthetic data sets with practical queries. 2008 VLDB Endowment.

DOI

[78]
Zou L, Mo J, Chen L, et al.gStore: answering SPARQL queries via subgraph matching[J]. Proceedings of the VLDB Endowment, 2011,4(8):482-493.Due to the increasing use of RDF data, efficient processing of SPARQL queries over RDF datasets has become an important issue. However, existing solutions suffer from two limitations: 1) they cannot answer SPARQL queries with wildcards in a scalable manner; and 2) they cannot handle frequent updates in RDF repositories efficiently. Thus, most of them have to reprocess the dataset from scratch. In this paper, we propose a graph-based approach to store and query RDF data. Rather than mapping RDF triples into a relational database as most existing methods do, we store RDF data as a large graph. A SPARQL query is then converted into a corresponding subgraph matching query. In order to speed up query processing, we develop a novel index, together with some effective pruning rules and efficient search algorithms. Our method can answer exact SPARQL queries and queries with wildcards in a uniform manner. We also propose an effective maintenance algorithm to handle online updates over RDF repositories. Extensive experiments confirm the efficiency and effectiveness of our solution. ? 2011 VLDB Endowment.

DOI

[79]
富丽贞,孟小峰.大规模图数据可达性索引技术:现状与展望[J].计算机研究与发展,2015,52(1):116-129.随着社交网络、生物信息网、本体等新兴领域的飞速发展,在现实应用中涌现出大量的图数据.可达性查询是有向图上一类最基本的查询.当图的规模非常小时,利用深度优先遍历(depth-first search,DFS)或可达性传递闭包可以很容易处理可达性查询.但是,随着图的规模越变越大,由于DFS方法的查询效率太低而可达性传递闭包方法占用的存储空间太大,这2种方法不再适用.因此,许多可达性索引方法相继被提出.这些方法已经被广泛应用于多个计算机科学领域,如软件工程、编程语言、分布式计算、社交网络分析、生物网络分析、XML和RDF数据库、路由规划等领域.此外,可达性索引还可用于加速其他图算法,如最短路径查询和子图模式匹配.首先介绍了可达性索引的应用背景.接着,依据支持的数据规模、数据类型以及查询类别,将现有可达性索引工作进行了分类,并对代表性工作进行分类比较;最后,讨论了现有的大规模图数据可达性索引方法存在的问题,并指出了未来的研究方向.

DOI

[ Fu L Z, Meng X F.Reachability indexing for large-scale graphs: studies and forecast[J]. Journal of Computer Research and Development, 2015,52(1):116-129. ]

[80]
邹磊,陈跃国.海量RDF数据管理[J].中国计算机学会通讯,2012,8(11):32-43.

[ Zou L, Chen Y G.Data management of massive RDF[J].Communications of China Computer Federation, 2012,8(11):32-43. ]

[81]
王林彬,黎建辉,沈志宏.基于NoSQL的RDF数据存储与查询技术综述[J].计算机应用研究,2015,32(5):1281-1286.随着语义网的发展和 RDF(resource description framework,资源描述框架)数据量的快速增长,利用 No-SQL 数据库存储和管理大规模 RDF 数据已经成为了当前的研究热点。介绍了 NoSQL 数据库的种类划分和各类型特点,阐述了 RDF 数据在各类 NoSQL 数据库中存储结构设计和并行查询算法的研究现状,分析比较了不同方法的优缺点。最后,讨论了利用 NoSQL 数据库管理 RDF 的优势,总结了现有研究的不足之处,并展望了未来的研究方向。

DOI

[ Wang L B, Li J H, Shen Z H.Overview of NoSQL databases for large scaled RDF data management[J]. Application Research of Computers, 2015,32(5):1281-1286. ]

[82]
Brodt A, Nicklas D, Mitschang B.Deep integration of spatial query processing into native RDF triple stores[C]. Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2010:33-42.

[83]
Liagouris J, Mamoulis N, Bouros P, et al.An effective encoding scheme for spatial RDF data[J]. Proceedings of the VLDB Endowment, 2014,7(12):1271-1282.The RDF data model has recently been extended to support representation and querying of spatial information (i.e., locations and geometries), which is associated with RDF entities. Still, there are limited efforts towards extending RDF stores to efficiently support spatial queries, such as range selections (e.g., find entities within a given range) and spatial joins (e.g., find pairs of entities whose locations are close to each other). In this paper, we propose an extension for RDF stores that supports efficient spatial data management. Our contributions include an effective encoding scheme for entities having spatial locations, the introduction of on-the-fly spatial filters and spatial join algorithms, and several optimizations that minimize the overhead of geometry and dictionary accesses. We implemented the proposed techniques as an extension to the opensource RDF-3X engine and we experimentally evaluated them using real RDF knowledge bases. The results show that our system offers robust performance for spatial queries, while introducing little overhead to the original query engine.

DOI

[84]
Wang D, Zou L, Feng Y, et al.S-store: An engine for large RDF graph integrating spatial information[C]. Proceedings of the 18th International Conference on Database Systems for Advanced Applications (DASFAA 2013), 2013:31-47.

文章导航

/