A MapReduce-based Method for Parallel Calculation of Bus Passenger Origin and Destination from Massive Transit Data

  • WU Qunyong , * ,
  • SU Keyun ,
  • ZOU Zhijie
  • 1. National &Local Joint Engineering Research Center of Geo-spatial Information Technology, Fuzhou University, Fuzhou 350002, China;2. Key Laboratory of Spatial Data Mining & Information Sharing of MOE, Fuzhou 350002, China;
*Corresponding author: WU Qunyong, E-mail:

Received date: 2017-08-10

  Request revised date: 2018-03-07

  Online published: 2018-05-20

Supported by

National Natural Science Foundation of China, No.41471333

The Central Guided Local Development of Science and Technology Project, No.2017L3012.


《地球信息科学学报》编辑部 所有


Bus passengers' origin and destinations (OD) can truly reflect travel characteristics and demands of residents, which is an important basic data for bus system evaluation, scheduling and route optimization, with significantly practical value in urban planning. Existing OD estimation methods are mostly applied to a small amount of bus data, which cannot directly and rapidly calculate mass transit passenger OD. In order to solve these problems, a parallel method for calculation of massive transit passengers' origin and destinations based on MapReduce is investigated. Firstly, database migration tool was applied to transfer massive bus data stored in relational database to HBase. Secondly, MapReduce parallel computing framework was introduced to divide the IC card data into multiple Map tasks in the light of region numbers in HBase to calculate origins. The origins are grouped and stored into HDFS by user in the Reduce function. Thirdly, the destinations are estimated by origins in parallel which are divided into multiple Map tasks according to block numbers stored in HDFS. According to the travel record of each passenger, destinations can be accurately calculated by the means of public transit chain method and history similarity. In the end, taking IC card data and GPS bus data in Xiamen from June 13 to 26, 2015 as the example, which has 295 bus lines, 16 879 661 bus records, and 14 410 058 complete OD pairs which accounted for 78.9% of IC card data. Comparing with the traditional method, the computational efficiency has substantially improved. The results illustrate that the parallel method can not only calculate bus passenger OD accurately, but also has higher computational efficiency.

Cite this article

WU Qunyong , SU Keyun , ZOU Zhijie . A MapReduce-based Method for Parallel Calculation of Bus Passenger Origin and Destination from Massive Transit Data[J]. Journal of Geo-information Science, 2018 , 20(5) : 647 -655 . DOI: 10.12082/dqxxkx.2018.170374

1 引言

目前我国的公交收费大都是一票制的,即只需上车刷卡,下车无需刷卡。在缺乏准确定位信息的情况下,通过对IC卡数据的刷卡时间间隔进行聚类,将聚类结果结合公交车辆调度信息和线路站点信息来推算公交乘客的上车站点[3,4,5,6]。随着车辆GPS定位技术的发展与普及,公交车辆实时的位置信息以及其他状态信息更容易获得,研究者更青睐于基于公交车辆GPS数据和IC卡数据来推算上车站点,利用乘客在公交线间的换乘信息以及地铁 和公交线间的换乘信息来推算公交乘客的上车站点[7,8]。而更多的是基于IC卡刷卡时间和公交车辆到离站时间,建立一定的匹配约束规则来推算上车站点[9,10,11,12,13,14]。对于下车站点的推算,根据公交乘客的出行特征主要有3种推算方法:基于公交站点的站点吸引权法、基于单个乘客的连续出行链方法和基于多源数据通勤出行方法。基于公交站点的站点吸引权法是指同时考虑居民公交出行距离近似符合泊松分布规律、受站点周边用地性质影响以及该站点上下车人数吸引强度3个影响因子来确定乘客出行在每个站点的下车概率,推算乘客的下车站 点[15,16]。基于单个乘客的出行链方法是在连续公交出行假设的基础上推算乘客的下车站点[17,18,19,20,21],但连续公交出行只是公交出行一部分,有的学者在出行链的基础上结合站点吸引权法来推算下车站点[22]。基于多源数据通勤出行方法是在多种数据源的条件下追踪乘客的出行及换乘记录推算下车站点[23,24]

2 数据预处理与分区存储

Tab. 1 Data structure of IC card

表1 IC卡数据结构

行键 列簇
IC卡编号:刷卡时间 IC卡编号 刷卡时间 线路编号 公交车辆编号
Tab. 2 Data structure of bus GPS

表2 公交车辆GPS数据结构

行键 列簇
GPS设备编号:GPS时间 GPS设
进出站标志 站点编号 经度 纬度

3 公交乘客上车站点并行推算方法

Fig. 1 The relationship between boarding time and arrival-departure time

图1 乘客刷卡时间与车辆进出站时间关系示意图

Fig. 2 The parrallel estimating flow of bus passengers boarding station

图2 上车站点并行推算流程

Fig. 3 The estimating flow in Map function of bus passengers boarding station

图3 Map函数中推算上车站点流程


4 公交乘客下车站点并行推算

本文乘客下车站点推算方法只针对一票制的公交刷卡数据,未考虑其它交通(如BRT、地铁等)刷卡记录。在通常的公交出行中,普遍存在一种代刷卡的现象,即一张IC卡在同一辆公交车出现连续两次或者多次的刷卡记录(连续定义为同一张IC卡的相邻刷卡时间小于一定的时间阈值)。本文假设这些代刷记录都具有相同的出行目的地,即具有相同的下车站点。居民的公交出行行为根据出行链方法可以分为连续性公交出行链和非连续性公交出行链,如图4所示。图4(a)表示连续性公交出行链,有3个假设条件:① 连续两次公交出行之间不使用其他交通工具。② 下一次公交出行的起点是上一次公交出行的终点或者终点可接受步行换乘距离之内的站点。③ 一天中的最后一次出行终点为当天的第一次出行起点或者第二天首次出行的起点。而非连续性公交出行链如图4(b)所示,其中实线为公交出行,虚线为其他交通方式出行。从 图4可以看出,非连续性公交出行包含以下3种情形:① 用户乘坐公交连续出行,但是最后一次公交出行并未返回当天的起点;② 用户在换乘时使用其他交通方式,并且最后一次出行并未返回当天的起点;③ 用户在换乘时使用其他交通方式,但最后一次出行返回当天的起点。所有的非连续性公交出行均由这3种基本情形自由组合而成。
Fig. 4 Public transit trip chain

图4 公交出行链示意图

Fig. 5 The parrallel estimating flow of bus passengers destination station

图5 公交乘客下车站点并行推算过程

4.1 基于连续性出行链的下车站点并行推算方法

Fig. 6 The destination station inference flow for continuous public transit trip chain in Map function

图6 Map函数中基于连续性公交出行链下车站点推算流程


4.2 非连续性公交出行链下车站点推算

(2)若刷卡时间属于非工作日,则在集合Sb中寻找相似非工作日的D点作为该条上车记录的D点,将OD记录添加到Sb;如果未找到相应的D点 则将该条上车记录添加到集合Sc
Fig. 7 The destination station inference flow for discontinuous public transit trip chain in Map function

图7 Map函数中非连续性公交出行链下车站点推算流程

5 实例分析

本文以厦门市2015年6月13日至26日的公交数据为例进行研究,原始数据包括:3337辆公交车,39 306 315条车辆GPS记录,295条公交线路, 21 112 499条IC卡刷卡记录。经筛选和数据预处理,得到18 268 031条有效IC卡刷卡记录,IC卡刷卡数据和公交车辆GPS数据如表3、4所示。
Tab. 3 Sample of IC card data in Xiamen

表3 厦门市公交IC卡数据示例

IC卡编号 刷卡时间 线路编号 公交车辆编号
5238111601 0878416539 2015-06-13 00:01:42
2015-06-12 18:18:52
本实验平台为Hadoop平台,该平台具有4个节点,每个节点为两核处理器,4 G内存和500 G磁盘空间,通过实现MapReduce并行推算厦门公交乘客OD点。按照调查及经验值,在上车站点的推算中,a取值3 min,b取值7 min,获取刷卡时间前3 min,后7 min的所有车辆进出站记录。推算出1 280 607个IC卡用户,16 879 661条上车记录,占IC卡刷卡记录的92.4%,有7.6%的IC卡记录因为进出站时间的丢失和异常未能成功推算出上车站点。按照图6图7所示的下车站点并行推算流程进行下车站点推算,其中刷卡时间间隔阈值α取值1 min,是为了区分代刷情况,β取值500 m,是由站点间距及乘客换乘步行最大接受范围确定的。共推算出1 085 853个IC卡用户,14 410 058条完整OD记录,占上车记录的85.3%,IC卡刷卡记录的78.9%。其余的14.7%的上车记录未能推算出下车站点原因有以下几点:① 用户一天只有一次公交出行并且总体公交出行记录少。② 用户一天公交出行两次,但第二次是公交换乘,无法推算第二次公交出行的下车站点。 ③ 部分用户两周时间内只有一次公交刷卡记录。表5为某个乘客OD推算结果,从中可以看出本文所提出的方法能够较为准确的推算公交乘客的上下车站点。
Tab. 4 Sample of GPS bus data in Xiamen

表4 厦门市公交车辆GPS数据示例

GPS设备编号 GPS时间 纬度/° 经度/° 行驶方向 进出站标志 站点编号 线路编号
550001313902 2015-06-12 18:18:52 24.489145 118.072485 4 1 12 122
Tab. 5 The result of origin and destination inference

表5 OD推算结果

IC卡编号 刷卡时间 线路编号 线路方向 上车站点 下车站点 推算依据
4078181794 2015-06-13 16:07:31 24 4 中医院 岳阳小区 历史站点频次
4078181794 2015-06-15 07:24:23 31 4 岳阳小区 江头市场 连续出行链
4078181794 2015-06-15 16:23:44 31 5 江头市场 岳阳小区 连续出行链
4078181794 2015-06-17 07:43:44 859 5 岳阳小区 莲花路口东 相似出行规律
4078181794 2015-06-17 07:43:46 859 5 岳阳小区 莲花路口东 连续出行链
4078181794 2015-06-17 15:39:32 42 4 莲花路口东 市行政中心 相似出行规律
4078181794 2015-06-17 15:39:33 42 4 莲花路口东 市行政中心 连续出行链
4078181794 2015-06-17 17:00:12 18 4 市行政中心 枋湖车站 相似出行规律
4078181794 2015-06-17 17:00:14 18 4 市行政中心 枋湖车站 连续出行链
4078181794 2015-06-17 17:11:25 45 5 枋湖车站 岳阳小区 相似出行规律
4078181794 2015-06-17 17:11:26 45 5 枋湖车站 岳阳小区 连续出行链
4078181794 2015-06-18 09:04:16 15 4 岳阳小区 叉车厂 连续出行链
4078181794 2015-06-18 09:25:52 86 4 禾山路 未推算出
Fig. 8 The comparison of computation efficiency of calculating boarding station

图8 上车站点计算效率对比图

在下车站点推算方面,目前常见的方法是连续出行链方法、站点吸引权重方法以及这两种方法的结合使用。站点吸引权重法是根据站点概率来推算乘客下车站点,可信度不高,因而本文将与传统的连续出行链方法进行对比分析。使用传统的连续性公交出行链方法推算本文公交数据的下车站点,得到完整的公交OD记录11 380 983条,占IC卡数据总量的62.3%,低于本文公交乘客下车站点的提取率78.9%。从而可以看出本文使用连续出行链结合乘客历史出行规律及其出行特征方法推算公交乘客下车站点的提取率要高于传统基于连续出行链的方法。

6 结语

准确快速推算出海量公交乘客OD是及时获取全面真实城市公交客流信息的重要前提。面对海量的公交数据,本文提出一种基于MapReduce并行推算海量公交乘客OD的方法。该方法充分利用MapReduce并行计算的优势和乘客历史出行规律相似性的特点,准确快速地提取海量公交乘客OD信息,其优点在于:① 结合乘客历史出行记录的站点出现频次推算下车站点,与以往研究相比,具有更高的公交乘客OD提取率;② 面对海量的公交数据,可以快速计算出乘客OD信息,计算效率高,及时反映出全面真实的城市公交OD客流信息,对公交线路的优化和城市规划具有重要的实用价值。同时本文对公交数据的挖掘还有进一步待提高,如对于未能推算出OD点的数据,可以考虑结合其他数据源(地铁、出租)推算乘客出行OD点,获得更加完善的公交客流OD信息,对于此类数据还有待于进一步深入研究。

The authors have declared that no competing interests exist.



[ Sun C J, Li J W, Ling X H.Estimation of bus origin-destination matrix based on cloud computing[J]. Journal of Jiangsu University (Natural Science), 2016,37(4):456-461. ]



[ Chen F, Liu J F.Characteristics of bus passenger flow based on IC card data: A case study in Beijing[J]. Journal of Urban Transportation, 2016,14(1):51-58,64. ]


[ Dai X.Approach on the information analysis of urban public traffic base on the data of bus intelligent card[D]. Nanjing: Southeast University, 2006. ]



[ Yu Y, Deng T M, Xiao Y M.A novel method of confirming the boarding station of bus holders[J]. Journal of Chongqing Jiaotong University (Natural Science), 2009,28(1):121-125. ]

尹长勇,陈艳艳,陈绍辉.基于聚类分析方法的公交站点客流匹配方法研究[J].交通信息与安全,2010,28(3):21-24.对公交IC卡数据收集、处理和分析得到的结果可以为公交客流分析 提供重要依据.文中针对一票制IC卡数据信息不完善的缺陷,结合公交其他运营信息,利用聚类分析方法,研究基于公交IC卡数据的匹配技术,并给出应用于匹 配技术的实例计算结果.结果表明,提出的方法有很强的实用性,为得到准确、实时且连续的公交客流信息提供良好的平台.


[ Yi C Y, Chen Y Y, Chen S H.Bus station passenger matching method based on cluster analysis method[J]. Journal of Traffic Information and Safety, 2010,28(3):21-24. ]

侯艳,何民,张生斌.基于公交IC卡刷卡记录的居民出行OD推算方法研究[J]. 交通信息与安全,2012,30(6):109-114.

[ Hou Y, He M, Zhang S B.Origin-destination matrix estimation method based on bus smart card records[J]. Journal of Transport Information and Safety, 2012,30(6):109-114. ]


[ Zhang S, Chen X W, Chen Z R.A method of deriving bus stops O-D matrix based on bus IC card data[J]. Journal of Wuhan University Technology (Transportation Science & Engineering), 2014,38(2):333-337. ]



[ Song X Q, Fang Z X, Yin L, et al.A method of deriving the boarding station information of bus passengers based on comprehensive transfer information mined from IC card data[J]. Journal of Geo-information Science, 2016,18(8):1060-1068. ]

靳佳. 基于IC卡的北京市公交出行特征分析[D].北京:首都师范大学,2013.

[ Jin J.Analysis of bus trip characteristics in Beijing based on IC card[D]. Beijing: Capital Normal University, 2013. ]

Hazelton M L.Statistical inference for time varying origin-destination matrices[J]. Transportation Research Part B Methodological, 2008,42(6):542-552.We consider the problem of estimating a sequence of origin–destination matrices from link count data collected on a daily basis. We recommend a parsimonious parameterization for the time varying matrices so as to permit application of standard statistical estimation theory. A number of examples of suitably parameterized matrices are provided. We propose a multivariate normal model for the link counts, based on an underlying overdispersed Poisson process. While likelihood based inference is feasible given information from sufficiently many network links, we focus on Bayesian methods of estimation because of their ability to incorporate prior information in a natural manner. We derive the Bayesian posterior distribution, but note that its normalizing constant is not available in closed form. A Markov chain Monte Carlo algorithm for generating posterior samples is therefore developed. From this we can obtain point estimates, and corresponding measures of precision, for parameters of the origin–destination matrix. The methodology is illustrated by an example involving OD matrix estimation for a section of the road network in the English city of Leicester.


Alsger A A, Mesbah M, Ferreira L, et al.Use of smart card fare data to estimate public transport origin-destination matrix[J]. Transportation Research Record Journal of the Transportation Research Board, 2015,2535:88-96.Over the past few years, several techniques have been developed for using smart card fare data to estimate origin-揹estination (O-D) matrices for public transport. In the past, different walking distance and allowable transfer time assumptions had been applied because of a lack of information about the alighting stop for a trip. Such assumptions can significantly affect the accuracy of the estimated O-D matrices. Little evidence demonstrates the accuracy of O-D pairs estimated with smart card fare data. Unique smart card fare data from Brisbane, Queensland, Australia, offered an opportunity to assess previous methods and their assumptions. South East Queensland data were used to study the effects of different assumptions on estimated O-D matrices and to conduct a sensitivity analysis for different parameters. In addition, an algorithm was proposed for generating an O-D matrix from individual user transactions (trip legs). About 85% of the transfer time was nonwalking time (wait and short activity time). More than 90% of passengers walked less than 10 min to transfer between alighting and the next boarding stop; this time represented about 10% of the allowable transfer time. A change in the assumed allowable transfer time from 15 to 90 min had a minor effect on the estimated O-D matrices. Most passengers returned to within 800 m of their first origin on the same day.


Munizaga M A, Palma C.Estimation of a disaggregate multimodal public transport Origin-Destination matrix from passive smart card data from Santiago, Chile[J]. Transportation Research Part C, 2012,24(9):9-18.A high-quality Origin-揇estination (OD) matrix is a fundamental prerequisite for any serious transport system analysis. However, it is not always easy to obtain it because OD surveys are expensive and difficult to implement. This is particularly relevant in large cities with congested networks, where detailed zonification and time disaggregation require large sample sizes and complicated survey methods. Therefore, the incorporation of information technology in some public transport systems around the world is an excellent opportunity for passive data collection. In this paper, we present a methodology for estimating a public transport OD matrix from smartcard and GPS data for Santiago, Chile. The proposed method is applied to two 1-week datasets obtained for different time periods. From the data available, we obtain detailed information about the time and position of boarding public transportation and generate an estimation of time and position of alighting for over 80% of the boarding transactions. The results are available at any desired time-搒pace disaggregation. After some post-processing and after incorporating expansion factors to account for unobserved trips, we build public transport OD matrices.


Alsger A, Tavassoli A, Mesbah M, et al.Evaluation of effects from sample-size origin-destination estimation using smart card fare data[J]. Journal of Transportation Engineering Part A-Systems, 2017,143(4):1-10.Public transport planners are required to make decisions on transport infrastructure and services worth billions of dollars. The decision-making process for transport planning needs to be informed, accountable, and founded on comprehensive, current, and reliable data. One of the major issues affecting the accuracy of the estimated origin-destination (O-D) matrices is sample size. Cost, time, precision, and biases are some issues associated with sample size. Smart card data can potentially provide much information based on better understanding and assessment of the sample size impact on the estimated O-D matrices. This paper uses South East Queensland (SEQ) data to study the effect of different data sample sizes on the accuracy level of the generated public transport O-D matrices and to quantify the sample size required for a certain level of accuracy. As a result, the total number of O-D trips for the whole network can be accurately estimated at all levels of sample sizes. However, a wide distribution of O-D trips appeared at different sample sizes. The large difference from the actual distribution at 100% sample size was readily captured at small sample sizes where more O-D pairs were not representative. The wide distribution of O-D trips at different levels of sample sizes caused significant errors even at large sample sizes. The variation of the errors within the same sample was also captured as a result of the 80 iterations for each sample size. It is concluded that three major parameters (distribution, number, and sample size of selected stations) have a significant impact on the estimated O-D matrices. These results can be also reflected on the sample size of the traditional O-D estimation methods, such household travel surveys.



[ Ma X L, Liu C C, Liu J F, et al.Boarding stop inference based on transit IC card data[J]. Journal of Transportation Systems Engineering and Information Technology, 2015,15(4):78-84. ]

吴祥国. 基于公交IC卡和GPS数据的居民公交出行OD矩阵推导与应用[D].济南:山东大学,2011.

[ Wu G X.Urban public transportation trip OD matrix inference and application based on bus IC card data and GPS data[D]. Jinan: Shandong University, 2011. ]

胡郁葱,梁杰荣,梁枫明.基于IC卡数据挖掘获取公交OD矩阵的方法[J].交通信息与安全, 2012,30(4):66-70.

[ Hu Y C, Liang J R, Liang F M. A way to get bus regional OD matrix based on mining IC card information[J].Journal of Transport Information and Safety, 2012,30(4):66-70. ]



[ Li H B, Chen X W, Chen Z R.A method for estimating origin-destination matrix of public transit based on smart card and AVL data[J]. Journal of Transport Information and Safety, 2015,33(6):33-39,95. ]

Munizaga M, Devillaine F, Navarrete C, et al.Validating travel behavior estimated from smartcard data[J]. Transportation Research Part C Emerging Technologies, 2014,44(4):70-79.In this paper, we present a validation of public transport origin–destination (OD) matrices obtained from smartcard and GPS data. These matrices are very valuable for management and planning but have not been validated until now. In this work, we verify the assumptions and results of the method using three sources of information: the same database used to make the estimations, a Metro OD survey in which the card numbers are registered for a group of users, and a sample of volunteers. The results are very positive, as the percentages of correct estimation are approximately 90% in all cases.


Nunes A A, Dias T G, Cunha J F E. Passenger journey destination estimation from automated fare collection system data using spatial validation[J]. IEEE Transactions on Intelligent Transportation Systems, 2015,17(1):133-142.A methodology for estimating the destination of passenger journeys from automated fare collection (AFC) system data is described. It proposes new spatial validation features to increase the accuracy of destination inference results and to verify key assumptions present in previous origin-destination estimation literature. The methodology applies to entry-only system configurations combined with distance-based fare structures, and it aims to enhance raw AFC system data with the destination of individual journeys. This paper describes an algorithm developed to implement the methodology and the results from its application to bus service data from Porto. The data relate to an AFC system integrated with an automatic vehicle location system that records a transaction for each passenger boarding a bus, containing attributes regarding the route, the vehicle, and the travel card used, along with the time and the location where the journey began. Some of these are recorded for the purpose of allowing onboard ticket inspection but additionally enable innovative spatial validation features introduced by the methodology. The results led to the conclusion that the methodology is effective for estimating journey destinations at the disaggregate level and identifies false positives reliably.




[ Yang W B, Wang H, Ye X F, et al.OD matrix inference for urban public transportation trip based on GPS and IC card data[J]. Journal of Chongqing Jiaotong University (Natural Science), 2015,34(3):117-121. ]



[ Song Z, Qin Z G, Xu J, et al.Research on large-scale OD matrix estimation method based on bus IC card data[J]. Application Research of Computers, 2016(7):2007-2013. ]

胡继华,邓俊,黄泽.结合出行链的公交 IC 卡乘客下车站点判断概率模型[J].交通运输系统工程与信息,2014,14(2):62-67.

[ Hu J H, Deng J, Huang Z.Trip-chain based probability model for identifying alighting stations of smart card passengers[J]. Journal of Transportation Systems Engineering and Information Technology, 2014,14(2):62-67. ]


[ Hu J H, Gao L X, Liang J X.An inference method of public transit OD matrix based on traffic big data[J]. Science Technology and Engineering, 2017,17(11):309-314. ]

Zhao J, Qu Q, Zhang F, et al.Spatio-temporal analysis of passenger travel patterns in massive smart card data[J]. IEEE Transactions on Intelligent Transportation Systems, 2017,99:1-12.Metro systems have become one of the most important public transit services in cities. It is important to understand individual metro passengers' spatio-temporal travel patterns. More specifically, for a specific passenger: what are the temporal patterns? what are the spatial patterns? is there any relationship between the temporal and spatial patterns? are the passenger's travel patterns normal or special? Answering all these questions can help to improve metro services, such as evacuation policy making and marketing. Given a set of massive smart card data over a long period, how to effectively and systematically identify and understand the travel patterns of individual passengers in terms of space and time is a very challenging task. This paper proposes an effective data-mining procedure to better understand the travel patterns of individual metro passengers in Shenzhen, a modern and big city in China. First, we investigate the travel patterns in individual level and devise the method to retrieve them based on raw smart card transaction data, then use statistical-based and unsupervised clustering-based methods, to understand the hidden regularities and anomalies of the travel patterns. From a statistical-based point of view, we look into the passenger travel distribution patterns and find out the abnormal passengers based on the empirical knowledge. From unsupervised clustering point of view, we classify passengers in terms of the similarity of their travel patterns. To interpret the group behaviors, we also employ the bus transaction data. Moreover, the abnormal passengers are detected based on the clustering results. At last, we provide case studies and findings to demonstrate the effectiveness of the proposed scheme.


