基于机器学习的稀疏样本下的土壤有机质估算方法
刘明杰(1995— ),男,贵州贵阳人,硕士生,研究方向为地理信息系统。E-mail:2210478688@qq.com |
收稿日期: 2019-08-13
要求修回日期: 2019-12-14
网络出版日期: 2020-11-25
基金资助
国家重点研发计划课题(2017YFD0801205)
北京市农林科学院科技创新能力建设专项(KJCX20170407)
北京市农林科学院科技创新能力建设专项(KJCX20200414)
湖南省教育厅资助科研项目(13B129)
湖南省工程实验室开放基金资助项目(KFJ180602)
版权
Estimating Soil Organic Matter based on Machine Learning Under Sparse Sample
Received date: 2019-08-13
Request revised date: 2019-12-14
Online published: 2020-11-25
Supported by
The National Key Research and Development Program of China(2017YFD0801205)
The Science and Technology Innovation Capacity Building Project of Beijing Academy of Agriculture and Forestry Sciences(KJCX20170407)
The Science and Technology Innovation Capacity Building Project of Beijing Academy of Agriculture and Forestry Sciences(KJCX20200414)
Scientific Research Project Funded by The Education Department of Hunan Province(13B129)
Project Supported by Open Fund of Hunan Engineering Laboratory(KFJ180602)
Copyright
采用GRNN(Generalized Regression Neural Network)和RF(Random Forest)2种机器学习方法构建土壤有机质预测模型,以提高稀疏样本情况下的土壤有机质估算精度。依据北京市大兴区农用地2007年的土壤有机质采样数据,按MMSD准则(Minimization of the Mean of the Shortest Distances)抽稀为8种不同采样密度的样本(分别为2703、1352、676、339、169、85、43、22个样本),分别采用GRNN、RF和Ordinary kriging对各采样密度下的未知采样点进行预测,采用交叉检验的方式验证各采样密度下未知样点的预测精度。随着采样点密度的下降,样点间的空间自相关性逐渐减弱,半变异函数的拟和精度变差,预测点结果误差增大,预测的置信度降低。当抽稀到43个和22个采样点时,样点间的空间自相关性接近歼灭,半变异函数的决定系数较低且残差较大。普通克里格受到采样点数量和采样密度、样点的空间结构的影响比较明显,其预测精度随采样点数量的下降而下降。在85个采样点及以下时,其预测值与观测值之间没有显著的相关性。GRNN和RF的预测精度受采样密度的影响不大,其预测精度在一个较小的范围内波动,其预测值围绕观测值在一定阈值空间内震荡波动,具有较好的相关性,在85个及以下的采样密度时,预测精度相对普通克里格有较大的提升。普通克里格法不适合在稀疏样本条件下空间插值计算,尤其是在空间自相关性比较弱的情况下。机器学习模型能充分学习土壤间环境信息、样点空间邻近效应信息,兼顾属性相似性和空间自相关,具有更好的稳定性和适应性,不容易受到采样点数量、构型和采样密度等因素的影响,即使在采样点空间自相关性很弱的情况下也能做出稳定预测精度。
刘明杰 , 徐卓揆 , 郜允兵 , 杨晶 , 潘瑜春 , 高秉博 , 周艳兵 , 周万鹏 , 王凌 . 基于机器学习的稀疏样本下的土壤有机质估算方法[J]. 地球信息科学学报, 2020 , 22(9) : 1799 -1813 . DOI: 10.12082/dqxxkx.2020.190441
To improve the accuracy of soil organic estimation in the case of sparse samples and to construct the soil organic predictive models applying the machine learning methods, GRNN (Generalized Regression Neural Network) and RF(Random Forest). The soil was diluted into 8 samples with different sampling density (2703, 1352, 676, 339, 169, 85, 43, 22 samples) according to the soil organic matter sampling data of Daxing agricultural land in 2007 applying the MMSD (Minimization of the Mean of the Shortest Distances) criterion. GRNN (Generalized Regression Neural Network), RF (random forest) and Ordinary Kriging are applied to predict each sampling density espectively. Cross Validation is used to verify the prediction accuracy of unknown samples at each sampling density. With the decrease of sampling point density, the spatial correlation between sampling points decreases gradually, thus the semivariogram's fitting precision deteriorates, the errorofprediction point result increases, and the confidence of the prediction decreases. The spatial correlation between sampling points is close to disappear when the sample is diluted under 43 and 22 samples, and the coefficient of determination of the semivariogram function is low and the residual is large. The impacts the Ordinary Kriging receives, which are from the changes in the number of the sampling points, sampling density and spatial structures of samples is obvious. The prediction accuracy of the method decreases with the decrease of the number of sampling points. There is no significant correlation between the predicted values and the observed values at or below 85 sampling points. The prediction accuracy of GRNN and RF is almost independent of the sampling density. The predicted values fluctuate within a certain threshold space around the observed values, and has good correlation. At sampling points of 85 and below, the prediction accuracy is greatly improved compared with Ordinary Kriging. Ordinary Kriging is not suitable for spatial interpolating calculation in the case of sparse samples, especially in the case of weak spatial correlation. The machine learning models can fully learn the environmental information and spatial proximity information of soil sampling points. They combine attribute similarity and spatial correlation and have better stability and adaptability, not being easy to be affected by the number of sampling points, configuration and sampling density, and can make stable and accurate predictions even when the spatial autocorrelation between sampling points is very weak.
表1 土壤有机质含量方差分析Tab. 1 Variance analysis of soil organic matter |
方差来源 | 偏差平方和 | 自由度Df | 均方 | F | P | |
---|---|---|---|---|---|---|
用地类型 | 组间 | 351.750 | 4 | 87.938 | 6.871 | 0.000 |
组内 | 1023.868 | 80 | 12.796 | |||
总体 | 1375.618 | 84 | ||||
土壤质地 | 组间 | 356.405 | 3 | 118.802 | 9.442 | 0.000 |
组内 | 1019.213 | 81 | 12.583 | |||
总体 | 1375.618 | 84 | ||||
畜禽粪便利用强度 | 组间 | 241.351 | 5 | 48.270 | 3.362 | 0.008 |
组内 | 1134.267 | 79 | 14.358 | |||
总体 | 1375.618 | 84 | ||||
土壤类型 | 组间 | 0.945 | 2 | 0.472 | 0.028 | 0.972 |
组内 | 1374.674 | 82 | 16.764 | |||
总体 | 1375.618 | 84 | ||||
植被指数 | 组间 | 17.592 | 2 | 8.796 | 0.531 | 0.590 |
组内 | 1358.027 | 82 | 16.561 | |||
总体 | 1375.618 | 84 |
表2 研究区所有实验组土壤有机质含量描述性统计Tab. 2 Descriptive statistics of soil organic matter in all experimental groups in the study area |
实验组 | 极大值/ (g/kg) | 极小值/ (g/kg) | 平均值/ (g/kg) | 标准差/ (g/kg) | 变异 系数 | 偏度 | 峰度 | K-S双侧 显著性 |
---|---|---|---|---|---|---|---|---|
D2703 | 24.73 | 1.20 | 10.49 | 3.91 | 37.28 | 0.147 | -0.098 | 0.302 |
D1352 | 24.73 | 1.65 | 10.53 | 3.92 | 37.23 | 0.159 | -0.003 | 0.632 |
D676 | 24.73 | 1.65 | 10.58 | 4.11 | 38.87 | 0.318 | 0.036 | 0.640 |
D339 | 24.73 | 1.81 | 10.55 | 3.94 | 37.32 | 0.223 | 0.292 | 0.946 |
D169 | 22.98 | 1.98 | 10.20 | 3.99 | 39.08 | 0.290 | 0.112 | 0.964 |
D85 | 18.05 | 2.03 | 10.93 | 4.05 | 37.01 | -0.181 | -0.804 | 0.934 |
D43_1 | 17.97 | 4.35 | 11.07 | 3.46 | 31.24 | -0.187 | -0.616 | 0.972 |
D43_2 | 17.73 | 2.98 | 10.07 | 4.17 | 41.39 | 0.122 | -0.893 | 0.951 |
D43_3 | 17.84 | 3.53 | 11.10 | 3.53 | 31.80 | -0.091 | -0.900 | 0.855 |
D43_4 | 19.67 | 2.02 | 10.33 | 4.04 | 39.13 | 0.128 | -0.064 | 0.906 |
D43_5 | 17.68 | 2.19 | 10.39 | 4.46 | 42.90 | -0.329 | -0.998 | 0.445 |
D22_1 | 18.28 | 3.53 | 10.70 | 4.30 | 40.13 | 0.293 | -0.988 | 0.611 |
D22_2 | 17.61 | 3.91 | 11.28 | 3.84 | 34.03 | -0.037 | -0.634 | 0.940 |
D22_3 | 16.40 | 3.72 | 10.49 | 3.60 | 34.29 | -0.380 | -0.692 | 0.783 |
D22_4 | 15.34 | 4.29 | 10.56 | 3.24 | 30.72 | -0.320 | -0.825 | 0.912 |
D22_5 | 17.86 | 2.36 | 9.35 | 4.36 | 46.60 | 0.260 | -0.540 | 0.999 |
注:为排除随机现象的干扰,43、22个采样点时各抽5组进行实验。 |
表3 土壤有机质含量分级标准Tab. 3 Soil organic matter content grading standard |
土壤属性 | 丰富 | 较丰富 | 中等 | 较缺乏 | 缺乏 | 极缺乏 |
---|---|---|---|---|---|---|
有机质/(g/kg) | >40 | 30~40 | 20~30 | 10~20 | 6~10 | < 6 |
表4 所有实验组GRNN、RF和普通克里格法的预测精度Tab. 4 Prediction accuracy of GRNN, RF and Ordinary Krigingin all experimental groups |
实验组 | RMSE | MRE/% | MAE | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Krige | GRNN | RF | Krige | GRNN | RF | Krige | GRNN | RF | |||
D2703 | 2.82 | 3.11 | 2.99 | 26.89 | 29.59 | 28.92 | 2.21 | 2.43 | 2.35 | ||
D1352 | 3.02 | 3.17 | 3.01 | 29.54 | 30.28 | 29.50 | 3.02 | 2.46 | 2.36 | ||
D676 | 3.45 | 3.34 | 3.23 | 33.00 | 31.94 | 30.50 | 2.74 | 2.65 | 2.57 | ||
D339 | 3.38 | 3.17 | 3.17 | 35.60 | 30.64 | 31.19 | 2.82 | 2.47 | 2.53 | ||
D169 | 3.55 | 3.41 | 3.31 | 35.92 | 36.64 | 34.39 | 2.83 | 2.75 | 2.61 | ||
D85 | 4.10 | 3.17 | 2.96 | 43.76 | 33.37 | 30.45 | 3.47 | 2.72 | 2.46 | ||
D43_1 | 3.36 | 2.84 | 2.72 | 30.40 | 25.63 | 23.73 | 2.74 | 2.31 | 2.18 | ||
D43_2 | 4.13 | 3.19 | 3.11 | 44.58 | 34.52 | 33.70 | 3.34 | 2.70 | 2.65 | ||
D43_3 | 3.68 | 3.14 | 3.31 | 35.53 | 30.26 | 31.93 | 3.17 | 2.67 | 2.91 | ||
D43_4 | 3.95 | 2.76 | 3.18 | 44.86 | 33.59 | 37.36 | 3.21 | 2.36 | 2.69 | ||
D43_5 | 4.69 | 3.42 | 3.61 | 67.44 | 38.46 | 47.36 | 4.06 | 2.70 | 3.01 | ||
D22_1 | 4.34 | 3.22 | 3.42 | 43.60 | 35.06 | 33.76 | 3.77 | 2.86 | 2.91 | ||
D22_2 | 4.31 | 2.43 | 2.62 | 65.64 | 24.60 | 26.55 | 3.83 | 2.13 | 2.28 | ||
D22_3 | 3.90 | 2.89 | 3.07 | 77.11 | 27.63 | 33.04 | 3.54 | 2.27 | 2.64 | ||
D22_4 | 3.53 | 2.56 | 2.86 | 58.38 | 23.23 | 27.42 | 3.03 | 2.18 | 2.45 | ||
D22_5 | 4.31 | 2.98 | 2.98 | 146.35 | 35.83 | 41.13 | 3.79 | 2.44 | 2.55 |
表5 所有实验组样本的空间自相关分析Tab. 5 Spatial correlation analysis of all experimental groupsamples |
实验组 | 平均最短距离/m | Moran's I值 | Z得分 | P值 | 实验组 | 平均最短距离/m | Moran's I值 | Z得分 | P值 |
---|---|---|---|---|---|---|---|---|---|
D2703 | 371.45 | 0.35 | 56.78 | 0.00 | D43_3* | 2615.74 | 0.15 | 1.15 | 0.25 |
D1352 | 425.63 | 0.33 | 30.83 | 0.00 | D43_4* | 3109.01 | 0.15 | 1.30 | 0.19 |
D676 | 567.68 | 0.33 | 14.99 | 0.00 | D43_5* | 2909.11 | 0.08 | 0.83 | 0.41 |
D339 | 810.79 | 0.20 | 8.24 | 0.00 | D22_1* | 4237.90 | -0.32 | -1.39 | 0.17 |
D169 | 1285.91 | 0.17 | 5.68 | 0.00 | D22_2* | 4384.49 | -0.15 | -0.53 | 0.60 |
D85 | 1914.45 | 0.19 | 3.06 | 0.00 | D22_3* | 3662.95 | -0.10 | -0.19 | 0.85 |
D43_1 | 2792.95 | 0.22 | 2.41 | 0.02 | D22_4* | 4342.87 | 0.10 | 0.74 | 0.46 |
D43_2 | 3110.80 | 0.14 | 1.52 | 0.13 | D22_5* | 4723.02 | -0.11 | -0.48 | 0.63 |
注:*表示检验结果为空间自相关性不显著。 |
表6 3种土壤有机质影响因素探测结果Tab. 6 Detected result of three influence factor of soil organic matter |
实验组 | 土地利用类型 | 土壤质地 | 畜禽粪便影响强度 | |||||
---|---|---|---|---|---|---|---|---|
q值 | p值 | q值 | p值 | q值 | p值 | |||
D2703 | 0.182 | 0.000 | 0.199 | 0.000 | 0.148 | 0.000 | ||
D1352 | 0.209 | 0.000 | 0.190 | 0.000 | 0.139 | 0.000 | ||
D676 | 0.198 | 0.000 | 0.233 | 0.000 | 0.171 | 0.000 | ||
D339 | 0.212 | 0.000 | 0.189 | 0.000 | 0.138 | 0.000 | ||
D169 | 0.210 | 0.000 | 0.229 | 0.000 | 0.202 | 0.000 | ||
D85 | 0.256 | 0.067 | 0.259 | 0.016 | 0.175 | 0.035 | ||
D43 | 0.500 | 0.004 | 0.177 | 0.071 | 0.345 | 0.050 | ||
D22 | 0.496 | 0.016 | 0.477 | 0.049 | 0.742 | 0.034 |
4.2 GRNN、RF和Ordinary Kriging预测方法的比较 |
表7 GRNN、RF和普通克里格法的预测值与观测值的相关性分析Tab. 7 The correlation analysis between the observed values and predicted values of GRNN, RF and Ordinary Kriging |
实验组 | Krige | GRNN | RF | 实验组 | Krige | GRNN | RF |
---|---|---|---|---|---|---|---|
D2703 | 0.686** | 0.608** | 0.642** | D43_3 | 0.155 | 0.388* | 0.392** |
D1352 | 0.635** | 0.590** | 0.637** | D43_4 | 0.153 | 0.669** | 0.576** |
D676 | 0.543** | 0.581** | 0.616** | D43_5 | -0.067 | 0.591** | 0.577** |
D339 | 0.450** | 0.580** | 0.578** | D22_1 | -0.137 | 0.572** | 0.590** |
D169 | 0.433** | 0.500** | 0.545** | D22_2 | -0.343 | 0.709** | 0.630** |
D85 | 0.175 | 0.599** | 0.660** | D22_3 | -0.094 | 0.584** | 0.440* |
D43_1 | 0.210 | 0.532** | 0.594** | D22_4 | -0.102 | 0.547** | 0.398* |
D43_2 | 0.129 | 0.559** | 0.627** | D22_5 | 0.054 | 0.635** | 0.762** |
注:检查手段为Pearson相关系数。*表示在0.05水平(双侧)上显著相关;**表示在0.01水平(双侧)上显著相关。 |
[1] |
|
[2] |
徐云鹤, 方斌. 江浙典型茶园土壤有机质空间异质性分析[J]. 地球信息科学学报, 2015,17(5):622-630.
[
|
[3] |
陈桂香, 高灯州, 曾从盛, 等. 福州市农田土壤养分空间变异特征[J]. 地球信息科学学报, 2017,19(2):216-224.
[
|
[4] |
|
[5] |
|
[6] |
黄昌勇, 徐建明. 土壤学(第三版)[M]. 北京: 中国农业出版社, 2010: 29-43
[
|
[7] |
|
[8] |
|
[9] |
|
[10] |
|
[11] |
杨晓. 县域土壤有机质空间分布研究——以山东省诸城市为例[D]. 济南:山东农业大学, 2018.
[
|
[12] |
高凤杰, 马泉来, 韩文文, 等. 黑土丘陵区小流域土壤有机质空间变异及分布格局[J]. 环境科学, 2016,37(5):1915-1922.
[
|
[13] |
宋莎, 李廷轩, 王永东, 等. 县域农田土壤有机质空间变异及其影响因素分析[J]. 土壤, 2011,43(1):44-49.
[
|
[14] |
王宗明, 张柏, 宋开山, 等. 东北平原典型农业县农田土壤养分空间分布影响因素分析[J]. 水土保持学报, 2007(2):73-77.
[
|
[15] |
孔祥斌, 张凤荣, 王茹. 近20年城乡交错带土壤养分时间空间变异特征分析——以北京市大兴区为例[J]. 土壤, 2004(6):636-643.
[
|
[16] |
秦静, 孔祥斌, 姜广辉, 等. 北京典型边缘区25年来土壤有机质的时空变异特征[J]. 农业工程学报, 2008,24(3):124-129.
[
|
[17] |
胡克林, 余艳, 张凤荣, 等. 北京郊区土壤有机质含量的时空变异及其影响因素[J]. 中国农业科学, 2006,38(4):764-771.
[
|
[18] |
赵汝东, 孙焱鑫, 王殿武, 等. 北京地区耕地土壤有机质空间变异分析[J]. 土壤通报, 2010,41(3):552-557.
[
|
[19] |
李启权, 王昌全, 岳天祥, 等. 基于神经网络模型的中国表层土壤有机质空间分布模拟方法[J]. 地球科学进展, 2012,27(2):175-184.
[
|
[20] |
李启权, 王昌全, 张文江, 等. 基于神经网络模型和地统计学方法的土壤养分空间分布预测[J]. 应用生态学报, 2013,24(2):459-466.
[
|
[21] |
江叶枫, 郭熙, 叶英聪, 等. 应用集成BP神经网络模型预测土壤有机质空间分布[J]. 江苏农业学报, 2017,33(5):1044-1050.
[
|
[22] |
江叶枫, 郭熙. 基于辅助变量和回归径向基函数神经网络(R-R BFNN)的土壤有机质空间分布模拟[J]. 浙江农业学报, 2018,30(4):640-648.
[
|
[23] |
江叶枫, 郭熙, 叶英聪, 等. 省域尺度土壤有机质空间分布的神经网络法预测[J]. 江苏农业学报, 2017,33(4):828-835.
[
|
[24] |
|
[25] |
宋英强, 杨联安, 冯武焕, 等. 基于多源辅助变量和极限学习机的蔬菜地土壤有机质预测研究[J]. 土壤通报, 2017,48(1):118-126.
[
|
[26] |
|
[27] |
|
[28] |
何勇, 张淑娟, 方慧. 基于人工神经网络的田间信息插值方法研究[J]. 农业工程学报, 2004,20(3):120-123.
[
|
[29] |
|
[30] |
|
[31] |
|
[32] |
|
[33] |
|
[34] |
侯艺璇, 赵华甫, 吴克宁, 等. 基于BP神经网络的作物Cd含量预测及安全种植分区[J]. 资源科学, 2018,40(12):2414-2424.
[
|
[35] |
|
[36] |
GB 9834-88.土壤有机质测定法[S].
[ GB 9834-88. Method for determination of soil organic matter[S]. ]
|
[37] |
|
[38] |
|
[39] |
许信旺. 不同尺度区域农田土壤有机碳分布与变化[D]. 南京:南京农业大学, 2008.
[
|
[40] |
高凤杰, 吴啸, 师华定, 等. 基于贝叶斯最大熵的黑土区小流域土壤有机质空间预测[J]. 环境科学研究, 2019,32(8):1365-1373.
[
|
[41] |
GB 18596-2001.畜禽养殖业污染物排放标准[S].
[ GB 18596-2001. Discharge standard of pollutants for livestock and poultry breeding[S]. ]
|
[42] |
李帷, 李艳霞, 杨明, 等. 北京市畜禽养殖的空间分布特征及其粪便耕地施用的可达性[J]. 自然资源学报, 2010,25(5):746-755.
[
|
[43] |
任丽, 杨联安, 王辉, 等. 基于随机森林的苹果区土壤有机质空间预测[J]. 干旱区资源与环境, 2018,32(8):141-146.
[
|
[44] |
沈掌泉, 周斌, 孔繁胜, 等. 应用广义回归神经网络进行土壤空间变异研究[J]. 土壤学报, 2004,41(3):471-475.
[
|
[45] |
范曼曼, 吴鹏豹, 张欢, 等. 采样密度对土壤有机质空间变异解析的影响[J]. 农业现代化研究, 2016,37(3):594-600.
[
|
[46] |
姜怀龙, 李贻学, 赵倩倩. 县域土壤有机质空间变异特征及合理采样数的确定[J]. 水土保持通报, 2012,32(4):143-146.
[
|
/
〈 | 〉 |