地球信息科学学报 ›› 2021, Vol. 23 ›› Issue (7): 1208-1220.doi: 10.12082/dqxxkx.2021.200565

• 地球信息科学理论与方法 • 上一篇    下一篇

基于ALBERT模型的园林植物知识实体与关系抽取方法

陈晓玲1,2, 唐丽玉1,2,*, 胡颖1,2, 江锋1,2, 彭巍1,2, 冯先超1,2   

  1. 1.福州大学空间数据挖掘与信息共享教育部重点实验室,福州 350108
    2.福州大学地理空间信息技术国家地方联合工程研究中心,福州 350108
  • 收稿日期:2020-09-29 修回日期:2020-12-29 出版日期:2021-07-25 发布日期:2021-09-25
  • 通讯作者: 唐丽玉
  • 基金资助:
    国家自然科学基金项目(41971344)

Extracting Entity and Relation of Landscape Plant's Knowledge based on ALBERT Model

CHEN Xiaoling1,2, TANG Liyu1,2,*, HU Ying1,2, JIANG Feng1,2, PENG Wei1,2, FENG Xianchao1,2   

  1. 1. Key Laboratory of Spatial Data Mining & Information Sharing of Ministry of Education, Fuzhou University, Fuzhou 350108, China
    2. National Engineering Research Center of Geospatial Information Technology, Fuzhou University, Fuzhou 350108, China
  • Received:2020-09-29 Revised:2020-12-29 Online:2021-07-25 Published:2021-09-25
  • Contact: TANG Liyu
  • Supported by:
    National Natural Science Foundation of China, No.41971344.(41971344)

摘要:

园林植物知识图谱可为顾及区域适应性、观赏性和生态性等因子的绿化树种的选型提供知识支持。植物描述文本的实体识别及关系抽取是知识图谱构建的关键环节。针对植物领域未有公开的标注数据集,本文阐述了园林植物数据集的构建流程,定义了园林植物的概念体系结构,完成了园林植物语料库的构建。针对现有Word2vec、ELMo和BERT等语言模型存在无法解决多义词、融合上下文能力差、运行速度慢等缺点,提出了嵌入ALBERT(A Lite BERT)预训练语言模型的实体识别和关系抽取模型。ALBERT预训练的动态词向量能够有效地表示文本特征,将其分别输入到BiGRU-CRF命名实体识别模型和BiGRU-Attention关系抽取模型中进行训练,进一步提升实体识别和关系抽取的效果。在园林植物语料库上进行方法的有效性验证,结果表明ALBERT-BiGRU-CRF命名实体识别模型的F1值为0.9517,ALBERT-BiGRU-Attention关系抽取模型的F1值为0.9161,相较于经典的语言模型(如Word2vec、ELMo和BERT等)性能有较为显著的提升。因此基于ALBERT模型的实体与关系抽取任务能有效提高识别分类效果,可将其应用于植物描述文本的实体关系抽取任务中,为园林植物知识图谱自动构建提供方法。

关键词: 知识图谱, 信息抽取, 语料库, 园林植物, ALBERT, 词向量, 实体识别, 关系抽取

Abstract:

Knowledge graph of landscape plants provides potential uses in the selection of greening tree species considering regional adaptability, ornamental and ecological factors. Entity and relationship extraction of the plant's description text is a key issue in the construction of knowledge graph. Until now, there has been no publicly available annotated data set for the plant domain. In this paper, a conceptual architecture of landscape plants was defined and briefly described, and the landscape plant corpus was constructed. Existing language models such as word2vec, ELMo, and BERT have various disadvantages, e.g., they can't solve the problem of polysemous words and have poor ability of context fusion and computational efficiency. In this paper, we proposed a named entity recognition model, ALBERT-BiGRU-CRF, and a relationship extraction model, ALBERT-BiGRU-Attention, which were embedded with ALBERT (A Lite Bidirectional Encoder Representation from Transformers) pre-training language model. In the ALBERT-BiGRU-CRF model, the ALBERT model was used to extract text features, the Bi-GRU model was used to learn and excavate deep semantic features between sentences, and the CRF model was used to calculate the probability distribution of the annotation sequence to determine the entities contained in the description text. The ALBERT-BiGRU-Attention model was based on the results of the named entity recognition model. Similarly, the attention model was used to improve the weight of keywords to determine the relationship between entities. The proposed models have the following advantages: (1) The method can effectively identify and extract entities and relationships of landscape plants' knowledge; (2) The models can represent the semantic and sentence characteristics of characters with a good accuracy. The validity of the method was verified on the landscape plant corpus constructed in this paper and compared with other models. Our experimental results of quantitative evaluation show that: (1) The F1 index of the ALBERT-BiGRU-CRF model was 0.9517, indicating that it had good performance in named entity recognition task and can effectively identify 23 main entity types; (2) After comparative experiments and analysis of the relationship extraction results, the F1 index of the ALBERT-BiGRU-Attention model was 0.9161, indicating that it performed well in the relationship extraction of landscape plants; (3) By selecting 6 representative examples to further evaluate the extraction performance of this method, the results show that the method can well identify the knowledge triples of common single-relation and multi-relation texts. Therefore, the entity relationship extraction task based on ALBERT model can effectively improve the recognition and extraction results. It can be applied to the entity relationship extraction task of plant description text, providing a method for automatic construction of landscape plant knowledge graph.

Key words: knowledge graph, information extraction, landscape plant corpus, landscape plant, ALBERT, word vectors, entity recognition, relation extraction