地球信息科学学报 ›› 2016, Vol. 18 ›› Issue (11): 1465-1475.doi: 10.3724/SP.J.1047.2016.01465
余丽1,2(), 陆锋1,**(
), 刘希亮1, 程诗奋1,2, 张雪英3
收稿日期:
2016-07-18
修回日期:
2016-09-22
出版日期:
2016-11-20
发布日期:
2016-11-20
通讯作者:
陆锋
E-mail:yul@lreis.ac.cn;luf@lreis.ac.cn
作者简介:
作者简介:余 丽(1986-),女,博士生,研究方向为互联网空间信息搜索。E-mail:
基金资助:
YU Li1,2(), LU Feng1,*(
), LIU Xiliang1, CHENG Shifen1, ZHANG Xueying3
Received:
2016-07-18
Revised:
2016-09-22
Online:
2016-11-20
Published:
2016-11-20
Contact:
LU Feng
E-mail:yul@lreis.ac.cn;luf@lreis.ac.cn
摘要:
网络文本蕴含地理实体关系抽取技术,需要高时效、强鲁棒的关键词提取方法。与监督学习方法相比,无监督学习方法能捕获文本的动态变化特征并发现新增的关系类型,因此备受关注。其中,基于频率的关键词提取方法获得广泛研究,然而,网络文本蕴含的地理实体关系分布稀疏,基于频率的方法难以直接应用于地理实体关系的关键词提取。为解决该问题,本文基于公开访问的网络资源,提出一种语境增强的关键词提取方法。首先,基于在线百科和开放的同义词词典,通过语境合并和语义融合创建增强的语境,以降低语境中词语的稀疏性。接着,Domain Frequency和Entropy频率统计方法从增强语境中自动构建一个大规模语料。然后,基于该语料选择词法特征并统计其权值,用于扩大语境中词语间的差异。最后,使用选择的词法特征度量增强语境中词语的重要性,将权值最大的词语作为描述地理实体关系的关键词,并基于大规模真实网络文本开展实验。实验结果表明:对于地理实体关系的关键词识别,本文方法的平均精度为85.5%,比Domain Frequency和Entropy方法分别提高41%和36%;对于新增关键词识别,本文方法的精度达到60.3%。语境增强的关键词提取方法能有效地处理地理实体关系分布的稀疏性,可服务于网络文本蕴含地理实体关系的抽取。
余丽, 陆锋, 刘希亮, 程诗奋, 张雪英. 稀疏地理实体关系的关键词提取方法[J]. 地球信息科学学报, 2016, 18(11): 1465-1475.DOI:10.3724/SP.J.1047.2016.01465
YU Li,LU Feng,LIU Xiliang,CHENG Shifen,ZHANG Xueying. A Method of Context Enhanced Keyword Extraction for Sparse Geo-entity Relation[J]. Journal of Geo-information Science, 2016, 18(11): 1465-1475.DOI:10.3724/SP.J.1047.2016.01465
表5
关键词实例提取中常见错误分析"
描述 | 样例 | 错误率/(%) | |||
---|---|---|---|---|---|
本文方法 | DF | Entropy | |||
A | 关键词很少出现在文本中 | “云台山除锦屏山外,其余均为海中岛屿,古称郁洲山或苍梧山。”提取的关键词实例为(云台山,苍梧山,<岛屿>),正确的关键词为“古称”,它在实验数据中出现的频次比“岛屿”更低 | 6.3 | 14.3 | 18.4 |
B | 语境中词语在特征表现上 无显著差异 | “大夏河是甘肃省中部较大的河流,属黄河水系。”提取的关键词实例为(大夏河,黄河,<中部,属>),正确的关键词为“属”,但“中部”和“属”的权值均为最大值 | 2.5 | 5.4 | 3.1 |
C | 同句中存在多个不同地理 实体时,关键词无法区分 | “北镇主要河流有绕阳和及其支流东沙河。”提取的关键词实例为(绕阳河,东沙河,<河流>) | 0.7 | 1.2 | 4.8 |
D | 时间约束的关键词 | “宝山县南宋属嘉定县。”提取的关键词实例为(宝山县,嘉定县,<属>) | 0.3 | 2.9 | 1.6 |
E | 空间约束的关键词 | “汉江以北属秦岭山区。”提取的关键词实例为(汉江,秦岭,<属>) | 0.5 | 2.1 | 1.4 |
[1] |
Jones C B. and Purves R S.Geographical information retrieval[J]. International Journal of Geographical Information Science, 2008,22(3):219-228.
doi: 10.1080/13658810701626343 |
[2] |
Vasardani M, Winter S,Richter K F.Locating place names from place descriptions[J]. International Journal of Geographical Information Science, 2013,27(12):2509-2532.
doi: 10.1080/13658816.2013.785550 |
[3] |
Derungs C, Purves R S.From text to landscape: locating, identifying and mapping the use of landscape features in a Swiss Alpine corpus[J]. International Journal of Geographical Information Science, 2014,28(6):1272-1293.
doi: 10.1080/13658816.2013.772184 |
[4] |
Purves R S, Clough P, Jones C B.The design and implementation of SPIRIT: a spatially aware search engine for information retrieval on the Internet[J]. International Journal of Geographical Information Science, 2007,21(7):717-745.
doi: 10.1080/13658810601169840 |
[5] |
袁烨城,刘海江,裴韬,等.基于语义知识的空间关系识别研究[J].地球信息科学学报,2014,16(5):681-690.
doi: 10.3724/SP.J.1047.2014.00681 |
[ Yuan Y C, Liu H J, Pei T, et al. Spatial Relation extraction from chinese characterized documents based on Semantic knowledge[J]. Journal of Geo-information Science, 2014,16(5):681-690. ]
doi: 10.3724/SP.J.1047.2014.00681 |
|
[6] |
余丽,陆锋,张恒才.网络文本蕴含地理信息抽取:研究进展与展望[J].地球信息科学学报,2015,17(2):127-134.
doi: 10.3724/SP.J.1047.2015.00127 |
[ Yu L, Lu F, Zhang H C.Extracting geographic information from web texts: status and development[J]. Journal of Geo-information Science, 2015,17(2):127-134. ]
doi: 10.3724/SP.J.1047.2015.00127 |
|
[7] |
Li W W, Goodchild M F, Raskin R.Towards geospatial semantic search: exploiting latent semantic relations in geospatial data[J]. International Journal of Digital Earth, 2014,7(1):17-37.
doi: 10.1080/17538947.2012.674561 |
[8] |
杨博,蔡东风,杨华.开放式信息抽取研究进展[J].中文信息学报,2014,28(4):1-11,36.
doi: 10.3969/j.issn.1003-0077.2014.04.001 |
[ Yang B, Cai D F, Yang H.Progress in open information extraction[J]. Journal of Chinese Information Processing, 2014,28(4):1-11,36. ]
doi: 10.3969/j.issn.1003-0077.2014.04.001 |
|
[9] | Yan Y L, Okazaki N, Matsuo Y, et al.Unsupervised relation extraction by mining Wikipedia texts using information from the web[C]. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August Singapore. Suntec: ACM, 2009:1021-1029. |
[10] |
Shen M M, Liu D R, Huang Y S.Extracting semantic relations to enrich domain ontologies[J]. Journal of Intelligent Information Systems, 2012,39(3):749-761.
doi: 10.1007/s10844-012-0210-y |
[11] |
张苇如,孙乐,韩先培.基于维基百科和模式聚类的实体关系抽取方法[J].中文信息学报,2012,26(2):75-81,127.
doi: 10.3969/j.issn.1003-0077.2012.02.014 |
[ Zhang W R, Sun L, Han X P.A entity relation extraction method based on wikipedia and pattern clustering[J]. Journal of Intelligent Information Systems, 2012,26(2):75-81,127. ]
doi: 10.3969/j.issn.1003-0077.2012.02.014 |
|
[12] | 余丽,陆锋,刘希亮.开放式地理实体关系抽取的Bootstrapping方法[J].测绘学报,2016,45(5):616-622. |
[ Yu L, Lu F, Liu X L.A bootstrapping based approach for open Geo-entity relation extraction[J]. Acta Geodaetica et Cartographica Sinica, 2016,45(5):616-622. ] | |
[13] |
Mesquita F.Clustering techniques for open relation extraction. In: Proceedings of SIGMOD/PODS 2012 PhD Symposium, 20 May USA. New York: ACM, 2012:27-32.
doi: 10.1145/2213598.2213607 |
[14] |
秦兵,刘安安,刘挺.无指导的中文开放式实体关系抽取[J].计算机研究与发展,2015,52(5):1029-1035.
doi: 10.7544/issn1000-1239.2015.20131550 |
[ Qin B, Liu A A, Liu T.Unsupervised Chinese open entity relation extraction[J]. Journal of Computer Research and Development, 2015,52(5):1029-1035. ]
doi: 10.7544/issn1000-1239.2015.20131550 |
|
[15] | Chen J X, Ji D H, Tan C L, et al.Unsupervised feature selection for relation extraction[C]. In: Proceedings of the 2nd International Joint Conference on Natural Language Processing, 11-13 October Korea. Jeju Island: LNCS, 2005:262-267. |
[16] | 张雪英,张春菊,杜超利.空间关系词汇与地理实体要素类型的语义约束关系构建方法[J].武汉大学学报·信息科学版,2012,37(11):1266-1270. |
[ Zhang X Y, Zhang C J, Du C L.Semantic relation between spatial relation terms and feature types of geographical entities[J]. Geomatics and Information Science of Wuhan University, 2012,37(11):1266-1270. ] | |
[17] | Schockaert S, Smart P D, Abdelmoty A I, et al.Mining topological relations from the web[C]. In: Proceedings of the 19th International Conference on Database and Expert Systems Applications, 1-5 September Italy. Turin: IEEE, 2008:652-656. |
[18] |
Smole D, Ceh M, Podobnikar T.Evaluation of inductive logic programming for information extraction from natural language texts to support spatial data recommendation services. International Journal of Geographical Information Science, 2011,25(11):1809-1827.
doi: 10.1080/13658816.2011.556640 |
[19] | Elia A, Guglielmo D, Maisto A, et al.A linguistic-based method for automatically extracting spatial relations from large non-structured data[C]. In: Proceedings of the 13th International Conference on Algorithms and Architectures for Parallel Processing, 18-20 December Italy. Vietri sul Mare: Lecture Notes in Computer Science, 2013:193-200. |
[20] | Cao C G, Wang S Jiang L. A practical approach to extracting names of geographical entities and their relations from the web[C]. In: The 7th International Conference on Knowledge Science, Engineering and Management,16-18October Romania. Sibiu: Lecture Notes in Computer Science, 2014:200-221. |
[21] | Hasegawa T, Sekine S,Grishman R.Discovering relations among named entities from large corpora[C]. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics, 21-26 July Spain. Barcelona: ACM, 2004:415-422. |
[22] |
Naughton M, Stokes N and Carthy J Sentence-level event classification in unstructured texts[J]. Information Retrieval, 2010,13(2):132-156.
doi: 10.1007/s10791-009-9113-0 |
[23] |
Zhang P, Li W J, Hou Y X, et al. Developing position structure-based framework for Chinese entity relation extraction[J]. ACM Transactions on Asian Language Information Processing, 2011,10(3):14.
doi: 10.1145/2002980.2002984 |
[24] | Pershina M, Min B, Xu W, et al.Infusion of labeled data into distant supervision for relation extraction[C]. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 22-27 June Maryland. Baltimore: ACM, 2014:732-738. |
[25] |
Kordjamshidi P, Otterlo M V and Moens M F. Spatial role labeling: towards extraction of spatial relations from natural language[J]. ACM Transactions on Speech and Language Processing, 2011,8(3):1-39.
doi: 10.1145/2050104.2050105 |
[26] |
甘丽新,万常选,刘德喜,等.基于句法语义特征的中文实体关系抽取[J].计算机研究与发展,2016,53(2):284-302.
doi: 10.7544/issn1000-1239.2016.20150842 |
[ Gan L X, Wan C X, Liu D X, et al. Chinese named entity relation extraction based on syntactic and semantic features[J]. Journal of Computer Research and Development, 2016,53(2):284-302. ]
doi: 10.7544/issn1000-1239.2016.20150842 |
|
[27] |
Quan C Q, Wang M and Ren F J. An unsupervised text mining method for relation extraction from biomedical literature[J]. PLoS ONE, 2014,9(7):1-8.
pmid: 3280854444493796162922232222241038462812183918547412367524 |
[1] | 帅艳民, 马现伟, 曲歌, 邵聪颖, 刘涛, 刘守民, 黄华兵, 谷玲霄, 拉提帕·吐尔汗江, 梁继, 李玲. 协同多时相波谱特征的不透水面信息级联提取[J]. 地球信息科学学报, 2021, 23(1): 171-186. |
[2] | 王志华, 杨晓梅, 周成虎. 面向遥感大数据的地学知识图谱构想[J]. 地球信息科学学报, 2021, 23(1): 16-28. |
[3] | 郭峰, 毛政元, 邹为彬, 翁谦. 融合LiDAR数据与高分影像特征信息的建筑物提取方法[J]. 地球信息科学学报, 2020, 22(8): 1654-1665. |
[4] | 王学文, 赵庆展, 韩峰, 马永建, 龙翔, 江萍. 机载多光谱影像语义分割模型在农田防护林提取中的应用[J]. 地球信息科学学报, 2020, 22(8): 1702-1713. |
[5] | 闫庆武, 厉飞, 李玲. 基于2种夜间灯光影像亮度修正指数的城市建成区提取研究[J]. 地球信息科学学报, 2020, 22(8): 1714-1724. |
[6] | 黄娟, 陈崇成, 叶晓燕, 马腾. “民国清流”名人文化主题数据的组织和可视化方法[J]. 地球信息科学学报, 2020, 22(5): 954-966. |
[7] | 杨存建. 地学信息图谱思想与实践探索[J]. 地球信息科学学报, 2020, 22(4): 697-704. |
[8] | 黄楠, 杨昕, 刘海龙. 基于等高线空间关系的鞍部点提取方法[J]. 地球信息科学学报, 2020, 22(3): 410-421. |
[9] | 何惠馨, 范俊甫, 陈文贺, 周玉科, 张鹏, 俞宵. 基于亮度补偿的遥感影像阴影遮挡道路提取方法[J]. 地球信息科学学报, 2020, 22(2): 258-267. |
[10] | 李鹏鹏, 李永强, 蔡来良, 董亚涵, 范辉龙. 车载LiDAR点云中道路绿化带提取与动态分析[J]. 地球信息科学学报, 2020, 22(2): 268-278. |
[11] | 何红术, 黄晓霞, 李红旮, 倪凌佳, 王新歌, 陈崇, 柳泽. 基于改进U-Net网络的高分遥感影像水体提取[J]. 地球信息科学学报, 2020, 22(10): 2010-2022. |
[12] | 吴瑞娟, 何秀凤, 王静. 结合像元级与对象级的滨海湿地变化检测方法[J]. 地球信息科学学报, 2020, 22(10): 2078-2087. |
[13] | 袁林旺, 俞肇元, 罗文, 袁帅, 周春烨. PIR传感网数据的几何代数建模与行为分析[J]. 地球信息科学学报, 2020, 22(1): 21-29. |
[14] | 高嘉良,余丽,仇培元,陆锋. 基于通用知识库的地理实体开放关系过滤方法[J]. 地球信息科学学报, 2019, 21(9): 1392-1401. |
[15] | 唐璎,刘正军,杨树文. 基于三指数合成影像的西北地区城市建筑用地 遥感信息提取研究[J]. 地球信息科学学报, 2019, 21(9): 1455-1466. |
|