Journal of Geo-information Science ›› 2020, Vol. 22 ›› Issue (9): 1799-1813.doi: 10.12082/dqxxkx.2020.190441

Previous Articles     Next Articles

Estimating Soil Organic Matter based on Machine Learning Under Sparse Sample

LIU Mingjie1,2(), XU Zhuokui1,3, GAO Yunbing2,4,*(), YANG Jing2,4, PAN Yuchun2,4, GAO Bingbo5, ZHOU Yanbing2,4, ZHOU Wanpeng2,6, WANG Ling7   

  1. 1. School of Traffic and Transportation Engineering, Changsha University of Science and Technology, Changsha 410114, China
    2. Beijing Research Center for Information Technology in Agriculture, Beijing 100097, China
    3. Engineering Laboratory of Spatial Information Technology of Highway Geological Disaster Early Warning in Hunan Province (Changsha University of Science & Technology),Changsha 410114, China
    4.National Engineering Research Center for Information Technology in Agriculture, Beijing 100097, China
    5. China Agricultural University, Beijing 100083, China
    6. Henan Polytechnic University, Jiaozuo 454003, China
    7. Institute of Agricultural Resources and Environment, Hebei Academy of Agriculture and Forestry Sciences, Shijiazhuang 050051, China
  • Received:2019-08-13 Revised:2019-12-14 Online:2020-09-25 Published:2020-11-25
  • Contact: GAO Yunbing;
  • Supported by:
    The National Key Research and Development Program of China(2017YFD0801205);The Science and Technology Innovation Capacity Building Project of Beijing Academy of Agriculture and Forestry Sciences(KJCX20170407);The Science and Technology Innovation Capacity Building Project of Beijing Academy of Agriculture and Forestry Sciences(KJCX20200414);Scientific Research Project Funded by The Education Department of Hunan Province(13B129);Project Supported by Open Fund of Hunan Engineering Laboratory(KFJ180602)


To improve the accuracy of soil organic estimation in the case of sparse samples and to construct the soil organic predictive models applying the machine learning methods, GRNN (Generalized Regression Neural Network) and RF(Random Forest). The soil was diluted into 8 samples with different sampling density (2703, 1352, 676, 339, 169, 85, 43, 22 samples) according to the soil organic matter sampling data of Daxing agricultural land in 2007 applying the MMSD (Minimization of the Mean of the Shortest Distances) criterion. GRNN (Generalized Regression Neural Network), RF (random forest) and Ordinary Kriging are applied to predict each sampling density espectively. Cross Validation is used to verify the prediction accuracy of unknown samples at each sampling density. With the decrease of sampling point density, the spatial correlation between sampling points decreases gradually, thus the semivariogram's fitting precision deteriorates, the errorofprediction point result increases, and the confidence of the prediction decreases. The spatial correlation between sampling points is close to disappear when the sample is diluted under 43 and 22 samples, and the coefficient of determination of the semivariogram function is low and the residual is large. The impacts the Ordinary Kriging receives, which are from the changes in the number of the sampling points, sampling density and spatial structures of samples is obvious. The prediction accuracy of the method decreases with the decrease of the number of sampling points. There is no significant correlation between the predicted values and the observed values at or below 85 sampling points. The prediction accuracy of GRNN and RF is almost independent of the sampling density. The predicted values fluctuate within a certain threshold space around the observed values, and has good correlation. At sampling points of 85 and below, the prediction accuracy is greatly improved compared with Ordinary Kriging. Ordinary Kriging is not suitable for spatial interpolating calculation in the case of sparse samples, especially in the case of weak spatial correlation. The machine learning models can fully learn the environmental information and spatial proximity information of soil sampling points. They combine attribute similarity and spatial correlation and have better stability and adaptability, not being easy to be affected by the number of sampling points, configuration and sampling density, and can make stable and accurate predictions even when the spatial autocorrelation between sampling points is very weak.

Key words: soil organic matter, spatial interpolation, machine learning, attribute similarity, spatial correlation, Daxing County, sparse sample, sampling density