XIANG Guangxin, WEN Jialiang, CHEN Yuanyuan, ZHANG Yiwen, HUANG Shiyi, HAN Jingyuan, PENG Jiajie, YI Hongchao, LI Jiabao, MA Zhanhong
[Objectives] Population big data, characterized by large sample sizes and high spatiotemporal completeness, has become an essential foundational dataset for research in demography, population geography, and spatial population studies. Despite its widespread application, effective calibration methods for population big data remain lacking. Quantitative research, in particular, requires accurate and reliable population big data. However, common mathematical models struggle to precisely and realistically describe the relationship between big data and the Seventh National Population Census. [Methods] This paper proposes a method to calibrate population big data quantities by using the legally authoritative Seventh National Population Census data as an anchor point, integrating statistical data with 2020 Baidu population big data at a spatial level. Based on the mathematical relationship between population big data and official statistics, an operations-research-based optimization model is constructed to obtain the globally optimal deviation values for calibration. Using resident population data from the Seventh National Population Census for Hunan Province as the anchor point, the method calibrates the 2020 Baidu resident population big data and conducts two validation procedures. [Results] The results show that the deviation ratio between the calibrated resident population big data for Hunan Province and the census data is -1.01% (an improvement of 25.87%), with city-level deviation ratios ranging from -2.05% to +0.92% and county-level ratios from -2.06% to +1.99%, without altering the pre-calibration deviation trends. Validation against National Bureau of Statistics data indicates that the deviation ratio for calculated total urban population ranges from -2.7% to 1.7%. Validation against village-reported data in the "Green Heart" region at the intersection of Changsha, Zhuzhou, and Xiangtan cities shows a deviation ratio of 0.47% between the calculated resident population and the village-reported figures. [Conclusions] The calibration and validation results fully demonstrate the effectiveness of the proposed method, offering a viable approach for estimating population figures in non-census years and generating spatially distributed population datasets based on big data. This method can be applied to calibrate population big data from any provider and can be extended beyond resident population figures to calibrate gender ratios, age structures, working populations, mobile populations, OD flow data, and population profiles in big data.