Extracting and Analyzing Latent Semantic Characteristics of Locations Using Social Media Data

  • CHEN Yuanyuan ,
  • GAO Yong , *
Expand
  • Institute of Remote Sensing and Geographic Information System, Peking University, Beijing 100871, China
*Corresponding author: GAO Yong, E-mail:

Received date: 2017-07-04

  Request revised date: 2017-09-07

  Online published: 2017-11-10

Copyright

《地球信息科学学报》编辑部 所有

Abstract

Social media data are increasingly perceived as an important channel to record people’s perception by virtue of its large volume, availability and timeliness. Especially, some social media data are location-stamped, associating with the space in the city with human cognition. Thus, we can further manifest the sociocultural signature of places in a semantic way. In this paper, geo-tagged text data on Weibo were utilized to explore the hidden semantic characteristics of locations, with focus on semantic similarities among regions. Specifically, Latent Semantic Analysis (LSA) were introduced to transform the unstructured regional and semantic feature in social media into a cognition-friendly and deep-related vector. Then, spatial analysis method, including factor analysis, spatial correlation analysis and clustering analysis were employed to mining the hidden characteristics of locations. In terms of research results, different latent topics and their distribution across the city were uncovered. Similarity index of tested locations were then obtained by measuring their latent semantic features. Baidu-pedia entries were further used as empirical consensus and spatial autocorrelation analysis was employed to investigate urban functional hot-regions. Besides, spatial clusters were acquired by using K-MEANS method in latent semantic space. Its effectiveness was validated by the diversity of POI density among clusters. This study demonstrates how the semantic meaning of a space can be harvested through the analysis of crowd-generated content in social media, which is useful to capture the unique themes that shape a location and support urban planning.

Cite this article

CHEN Yuanyuan , GAO Yong . Extracting and Analyzing Latent Semantic Characteristics of Locations Using Social Media Data[J]. Journal of Geo-information Science, 2017 , 19(11) : 1405 -1414 . DOI: 10.3724/SP.J.1047.2017.01405

最后利用各地区在潜语义空间上的特征关系,进行聚类分析,得到研究区域在语义空间上的聚簇,并通过POI的密度分布验证聚类结果的合理性。本研究能有效地挖掘社交媒体上对于空间位置的集体印象,将语义空间与地理空间联系起来,对于场所感知和城市规划具有积极意义。
关键词:位置语义;社交媒体;潜语义分析;场所感知

1 引言

社交媒体是连接物理空间和数字空间的纽带,因其数据量大、获取简易、覆盖范围广等特征,近年来得到广泛研究,被认为是发现个人活动、群体特征和突发事件的重要依据。尤其是定位技术、无线传感技术的发展和签到行为的普及,又在基础的社交媒体数据上引入了位置维度,从而大大丰富了地理知识发现的手段,也带来新的研究方向。首先,利用社交媒体数据进行分析是一种社会感知的方法[1],与传统的大尺度下的分析方法的不同在于,其把每一个个体看成最小粒度的感知单元,自下而上地发现人的活动特征以及人文社会经济要素的空间模式,因此很多学者利用这种个体层面的聚集信息,发现空间特征与人类活动,如功能区识别[2]、重大事件分析[3]和人类社会心理特征的研究[4]。另外,社交媒体中含有丰富的语义信息,是人们直接或间接经验感受的外化。对社交媒体数据进行文本挖掘,并结合位置属性分析,也成为近来的研究热点[5-7]
在地理学的背景下,从社交媒体中自下而上得到的地理主题和语义知识,也是对场所(place)感知和城市意象[8]刻画的有效支持。签到文本反映了用户对于签到地的认知和情感态度,这种对特定空间在群体层面上的认知的重现、反复和聚集,正是空间被赋予意义而形成场所的原因[9-10]。将空间范围从局部地点扩大到城市,这些从语义上反映的认知的集合又是对城市环境感应的独特意象描述。而传统上对空间对象的语义描述主要通过地名本体[11-13]、地名词典[14-15]、地理语义目录[16]和概念层次结构[17]来进行。但这些方法都需要建立在一定体系内完整的层次关系、概念结构或知识清单,依赖先验知识,对于非结构性和非精确性的社交媒体短文本的位置语义表达,具有较大的困难。针对社交媒体的这一特性,不少学者引入了自然语言处理领域和机器学习领域的方法对空间认知的语义进行表达和分析,从而实现位置特征和城市意象的刻画。如Adams等[18]通过对旅游博客进行主题分析,得到全球不同城市独特的城市特征;Lansley等[19]通过主题模型对伦敦市的tweet进行分析,非监督分类得到20种主题,以及各个主题在不同的地点和人群中的分异和时间变化规律;Andrew等[20]利用tweet和Wikipedia的数据,采用概率主题模型、语义分析和聚类分析的方法,提出一种从众源数据中提取语义特征的量化模型,展示了个人和群体对于场所意义的赋予过程。由此可见,社交媒体上这些众源的、领域无关的自然语言描述,对于反演位置特征和城市意象,提供了丰富且多元的分析渠道。
在此背景下,本文利用社交媒体进行位置潜语义特征提取与分析。与现有研究中对于位置或城市本身特征的表达不同,本文侧重于在语义空间上描述位置之间的相关程度。首先通过引入潜语义分析方法,对位置特征在语义空间上进行表达,然后结合空间分析中因子分析、自相关分析和聚类分析的思想,对位置的特征和相似性进行进一步地描述和度量,进而完成从地理空间到语义空间的转换。研究结合社交媒体上大量自发的协同信息,自下而上地揭示地理空间中的场所,在人类活动中所具有的特征。本文的研究对场所描述、空间利用和人类活动的识别具有积极意义,并对语义空间到地理空间的连接提供了一种有效可行的思路。

2 研究方法

本文利用社交媒体对位置潜语义进行提取和分析,首先需要将位置特征与语义特征关联起来,采用潜语义分析(Latent Semantic Analysis, LSA)方法,实现对大数据量的语料库知识自动提取和表 达[21]。在此基础上,再结合空间分析的各类方法对位置潜语义特征进行深入分析。
潜语义分析的数学基础是奇异值分解(Singular Value Decomposition, SVD),基本思想是用文档中词语之间存在的隐含的概念结构取代关键词,对文档进行描述。在LSA的处理过程中,首先需要选定个词,然后将每个文档表示成这些词的集合,因而一个包含个文档的语料库就可以表示为的矩阵 A = [ α ij ] ,其中表示词项i与文档j的共现程度,通常用TF-IDF(Term Frequency- Inverse Document Frequency)模型或对数熵(Log-entropy)模型进行加权处理。对矩阵进行奇异值分解,可 得到:
A = V T (1)
式中:是由词(AAT)的r个特征向量组成的的正交矩阵 U = ( u 1 , u 2 , , u r ) ;是由文档(ATA)的r个特征向量组成的的正交矩阵 V = ( v 1 , v 2 , , v r ) ; Σ = diag ( σ 1 , σ 2 , , σ r ) 是对角矩阵,其中 σ 1 σ 2 σ r ,是A的奇异值;是文档在词的公共主成分上的词项荷载;则相应地对应公共主成分在文档上的荷载。取前k个奇异值,以及矩阵UV的前k列: U k V k ,则可以得到矩阵A的一个降维表达:
A k = U k Σ k V k T (2)
降维的操作不仅在矩阵减秩的过程中去除了原数据的“噪音”,提高了后续分析的运算速度;更重要的是,其将表层共现关系上的信息进行深层的抽象,这种抽象把握了词与文档语义上的潜在关系,将原始矩阵映射到一个更加符合人类认知关系的潜语义空间上[21],进而能够更好地挖掘文档之间的隐含联系。但需要注意的是,在降维的过程中k值的选择尤为重要。一种理想的方案是,选定的维度与词语意义的语义特征的维度是可比的、一致的,但在实际操作中很难得到这样的先验知识。因此学者们对此进行了多方面的研究[22-26],不过目前还没有普适的或者广泛应用的方法。主要原因还在于其选择依赖于具体的语料库和研究目的[27]
实际上,对文本按照不同方式进行组织,通过LSA方法可以发现不同的隐含概念结构。在社交媒体的语境下,微博文本多为小于140字的短文本,面向长文档进行挖掘的LSA无法直接使用,故需要对文本进行一定形式的聚合,如按照时间区间,空间范围或用户属性。本文选择用空间位置对微博文本进行重组织,即按照一定的空间划分规则,将特定空间范围内的所有签到微博聚合在一起,形成一个文档。原始的词项-文档矩阵,即转变为词项-位置矩阵。接着则可以用LSA方法对空间位置上隐含的信息进行分析,具体表现为以下3个方面:
(1)位置主题提取
对词项-位置矩阵进行奇异值分解后得到3个部分,表示位置的公共主成分上的词项荷载,即 u ij σ j 表示主成分j与词项i之间的相关系数。一个主成分如果具有显著的主题特征,词项的荷载则会明显区别于均匀分布,而倾向于集中在特定话题的词下,成为主题。因此可通过对主成分j下的 u ij σ j 的计算和排序,找到相关系数最高的关键词,从而对主题进行命名。例如,排序后,某主成分的关键词表现为“门票”、“景点”、“游客”等,则可命名为“旅游主题”。另外,根据SVD的对偶关系,也可通过对 v ij σ j 进行计算排序,得到主题j在研究空间上的显著热区。同时,由于该方法得到的主题是相互正交的,对于消除特征领域存在的语义交叉也有很好的效果。
(2)位置相似性分析
经过LSA处理后得到位置在潜语义空间上的隐含信息的结构化表达。因为这种表达不仅是对词频和共现关系的体现,更是对于深层信息的挖掘表现,消除了词之间的相关性,故有助于更好地进行位置相似性的度量。具体的,研究区域中已有的位置i,对应LSA降维后的矩阵的第i列,故可以直接计算其与研究范围内任意位置(矩阵的任一列)在潜语义空间上的向量相似度,如余弦相似度。对该相似度进行排序,则可得到与位置i最相近的地区分布。实际上,对于任意一个文本,均可以在分词处理后描述为选定词的向量空间模型(VSM)上的表达Q,进而则通过式(3)将其转换到与研究区域相同的潜语义空间上,标准化后利用 式(4)计算得到其与所有位置的相似性矩阵R
Q ' = Q T U k Σ k - 1 (3)
R = ( Q ' ) T A k (4)
分析R矩阵的实际意义,如果输入文本Q是针对位置的描述,那得到的相似性可以理解成新的地点与研究区域内的位置,在语义空间上的相似性衡量;如果输入文本Q是一种先验知识,得到的相似性则可以理解为一种监督的标注,相似性的大小表示某地区针对该先验的隶属度,如当Q是教育类文本时,则可以用来描述地区对于教育在潜语义空间上的隶属关系。在此基础上,进一步对相似性矩阵R进行局部空间自相关的分析,则可以得到功能区分布的热区。具体来说,常用的Local Moran’s I指数实际上是Gamma指数(式(5))的一种特例,用来描述特定属性在空间分布上的聚集情况,通过定义位置i与邻域范围内的位置j的位置相似性,以及属性相似性来实现。而R矩阵在这里就是一种对潜语义空间上位置间属性相似性的度量。尤其在以先验性的知识文本描述Q为输入得到的情况下,对R矩阵中的属性相似性进行局部空间自相关的分析,高值聚集的结果则对应某一类别下高隶属关系的地区集合,从而可得到功能类别的热区集合[20]
Γ i = i n j n w ij a ij (5)
(3)位置聚类
在传统的位置聚类中,位置按照特征空间上的相似性来进行聚类。对于LSA,位置i对应的特征空间即是通过SVD和降维后的矩阵的第i列,或在位置主题(公共主成分)上的载荷。这两种特征空间的定义都能够实现位置在潜语义空间上的聚类,但前者侧重于实现数据驱动的位置的群组,而后者侧重于发现位置之间的隐含结构[27]。本文使用前者的思路进行聚类。

3 结果与分析

本实验主要用位置关系对社交媒体的签到数据进行文档化聚合,通过潜语义分析和空间分析的方法,挖掘不同位置上潜语义空间上的相似性,并进一步反演位置的语义特征。一方面,在局部特征上,得到了关键主题的区域分布和位置的相似度排列,以及特定功能分区的热区分布;另一方面,在全局特征上,对空间进行了潜语义特征上的聚类分析,并通过引入POI数据,将聚类的语义结果与城市功能结果进一步结合起来,实现了位置功能区的标识和分类。

3.1 实验数据与预处理

本文以北京为研究区,通过微博提供的API抓取了北京市五环内2016年1月-2016年9月的微博签到数据共2 361 729条,选用北京市651个交通小区[28]为地理数据组织的基本单元。交通小区是利用城市主要道路网络对城市空间进行地块划分得到的,相对于格网划分单元,其内部具有更高的城市功能和信息交互上的同质性。
微博的签到数据含有噪音,并且因其短文本的特性,存在部分没有可辨识的语义信息的描述,因此需要对其进行预处理。在处理过程中,首先删除签到文本中的#话题#、[表情]、@用户、http超链接等无关信息,然后删除字数小于4的、以及其他的重复文本(如打卡数据、网易云音乐的歌词分享、“分享视频”),并对签到文本进行分词。由于词项的选择对于潜语义空间的构建至关重要,过多会导致较大的计算开销和矩阵描述的稀疏性,因此在分词后只保留了文本中的形容词、副词、名词、动词、地名和团体机构名共个,得到1 547 434条有效数据。接下来通过空间连接对微博文本进行聚合,将同一个交通小区内的所有微博视作同一个文档。经过上述处理后,最终得到的位置文本矩阵作为后续分析的基础。

3.2 维度选择与主题提取

由于LSA中维度k对分析结果具有显著的影响,在实验中采取了多种方式对k值进行探索。分别以Doxas等[22]的研究结论k=8、Profile Likelihood Test[23]的结果k=50,以及大于1的奇异值个数k=150作为备选参数,在此基础上分别进行K-MEANS、WARD层次聚类和谱聚类,并以轮廓系数作为聚类结果的衡量指标,发现在KMEANS和WARD方法中,k=8的聚类效果在聚类数目设置成2-20的过程中始终远高于后二者,而考虑到谱聚类仅在类别数较少时的适用性,k=8的结果也明显优于后二者 (图1)。同时,因为研究的语料库为微博的签到文本,多为字数小于140的短文本,即便按照空间关系进行了聚合,相较于传统的长文档,仍具有较多的干扰信息,因此选择保留较高的维度来解释原始数据中的较多信息意义不大。因此,最终选择k=8为本实验中LSA分析的维度。
Fig. 1 Silhouettes of clustering in different LSA dimensions

图1 不同维度数目下聚类轮廓系数

在此基础上,提取了8个主题中的3个有典型意义的主题(topic 2-4),对各个主题的关键词绘制词云(图2(a)-(c))。由图发现,各主题的关键词有很强的倾向性与区分性:主题2的关键词多为北京市的著名景点,如“南锣鼓巷”,“颐和园”,“故宫”,以及北京的特色小吃和活动,如“豆汁”、“升旗”;主题3的关键词则表现为学校内的设施,如“实验室”,“图书馆”,以及学习生活的相关方面,如“复习”,“答辩”、“毕设”;而主题4的则体现为各种出行活动,包括各类车站,“候车”、“检票”等。因此,可将3个主题分别命名为“旅游”、“学习”、“交通”。不同主题对应的热点区域存在显著差异,且与各主题的语义内容相一致。如“旅游”主题集中在中轴地区的故宫、北海、南锣鼓巷,以及其他的天坛和颐和园地区;“学习”主题集中在海淀区的大学,如北京师范大学,北京交通大学,北京大学和清华大学,其他一些零散的区域包括了首都经贸大学、北京化工大学和北京中医药大学;而在“交通”主题下,则显著地提取出了北京西站、北京南站、北京北站和北京站4个火车站,以及五环内的南苑机场。由此证明,在社交媒体数据上使用LSA的方法,能够高效地 提取出其中的潜在概念结构,并且在位置相关的 语境下,能够充分地从人们的认知中反演出场所的特征。
Fig. 2 Word cloud and hot spots of Topic 2-4

图2 潜语义空间上的主题词云和热点区域分布

3.3 位置的相似性分析

以北京大学、中南海和三里屯3个代表性功能区为测试对象,研究其在潜语义空间上的相似场所,对相似性进行排列,得到图3,图中颜色愈深表明相似性越高。由图可以看出,LSA方法对于提取潜语义空间上的相似地区的有序排列有很好的效果。例如,对中南海而言,相似区域为外交部街、三里屯使馆区,以及部分医院,多为政治场所和公共性的基础机构;对北京大学而言,相似区域多为海淀区的大学以及首经贸、中医药大学;而对三里屯则表现为西单、中关村等商圈。由此也说明,这种自下而上的非监督方法得到的位置特征与基本的常识认知相符合,能够有效地实现位置的语义相关性连接。
Fig. 3 Similar locations of test regions

图3 局部地区的相似地点分布

另外,前文的相似性测试中,现有的微博签到文本体现的是部分群体在一定时间内对于某个场所或地点的个体性的自发认识。故进一步引入百度百科的词条,作为一种共识层面上的场所类别指导,来完成对研究区域的语义标注,从而使研究进一步从潜语义空间扩展到实际物理空间的搜索、过滤和可视化分析。
实验采用局部空间自相关的方法,来发现特定功能类别的热区。以2类词条作为测试:包括教育(如科学、科普、教育、学习、学校、文化)和交通(交通、出行、客运站、车站、高铁站、飞机站),空间邻接关系用面单元的边角邻接(Queen Contiguity)定义,结果如图4所示。由图4可以看到,教育功能区主要包括海淀区的大学城,以及惠新西街附近的大学聚集区;而在交通主题下,则检测到高铁站、客运站的热点。相较于直接使用单个区域的相似度衡量,首先以词条为输入得到的是一种监督的标注;另外,空间自相关的分析方法进一步考虑了位置的邻域关系,可以得到热区而非热点。同时,局部空间自相关的方法还有利于探测异常的模式,如在教育主题下北三环上存在一个Low-High地点,其为农科院的实验田,与周围的教育用地存在显著的差异,也进而证实了分析的合理性。
Fig. 4 Hot spots distribution of urban functional regions

图4 特定功能类型下的热区分布

3.4 聚类与功能区识别

为了得到全局上的相似性,采用K-MEANS方法对研究区域进行了潜语义空间上的聚类分析,聚类使用的特征空间为降维后的潜语义空间上的表达。根据图1的轮廓系数变化曲线,选择最佳聚类效果对应的聚类数8为分析参数,得到研究区域的聚类结果(图5),并按照各类别中签到点文本的数目由小到大依次命名为clust1-clust8。总体上说,对于面积较大的类别在空间上主要呈现出相互邻接的态势,而面积较小的类别在空间上互相分散。
Fig. 5 K-MEANS clustering results of areas within 5th Ring Road Beijing

图5 五环区域内的全局聚类结果

对各个区域的分布和高频词(对常用词进行了过滤处理)进行分析得到:① Clust1面积较小,没有明显特征,主要表现为在空间上与周边存在明显异质现象的区域,如教育区环绕的颐和园和农科院试验田,以及一些铁路干线附近的居民区。该类别的高频词,主要表现为面积占优的“颐和园”、“昆明湖”和一些日常的居家活动,如“配钥匙”、“睡觉”。② Clust2是面积最小的聚类区域,但签到密度较高,主要分布在北京的交通枢纽上,包括各个高铁站、南苑机场和客运站。③ Clust3分布在北京的一些零散景点,包括奥林匹斯森林公园、动物园、大观园、玉渊潭和欢乐谷等。高频词也体现为一些关键的地名,以及与该区域相关的特征活动,如玉渊潭的“樱花”,鸟巢的“发布会”、“演出”“现场”。 ④ Clust4分布在天安门为中心的北京中轴线上的区域,这个类别的组成较为复杂,高频词不仅包括代表性景点,如“天安门”、“故宫”,还有与“上班”、“使馆”这类工作相关的商区。但这一区域的签到十分密集,也是北京人群集中的区域和代表性区域。⑤ Clust5是在聚类结果上面积最大的区域,但签到频次较低。其主要分布在北京南城的居民区,除住宅外没有明显的场所特征。⑥ Clust6 集中在北京西北的海淀区,朝阳区的北京化工大学和北京中医药大学,以及东四环附近的北京工业大学;高频词关于“学习”、“毕业”、“校园”、“图书馆”的显示也表明了其教育类型的属性。⑦ Clust7 分布在北京东城区的朝阳、望京地区。并且从高频词上可以看到这部分区域的人较为关注“上班”、“下班”、“地铁”等通勤工作事项,以及“电影”、“咖啡”、“美食”、“艺术”等休闲享乐。也反映了这部分地区商业、传媒等第三产业较为发达,市民生活水平较高的特点。⑧ Clust8面积次大,签到数最多,它和Clust5较为类似,可能也是一些主要的居住用地,只是在分布上多在北京中轴线北边。
为了进一步分析每一个类别的功能特征与语义特征的关系,本文引入了北京市POI数据对聚类结果进行识别。该数据集包含北京市内30多种类别的POI共111 751个,实验中选取北京市五环内的18种POI,分别是:银行/ATM 1632个,公司企业6266个,商务大厦1636个,科技馆19个,美术馆129个,度假村14个,垂钓13个,住宅6339个,超市或便利店2082个,餐饮7663个,咖啡馆/茶店845个,电影院98个,KTV 251个,火车站/飞机场19个,图书馆226个,学校1814个,科研机构778个以及培训机构722个。
对每一类地块,计算各种POI点在地块内的分布密度, de n i = N i / S i ,并通过常用的最大最小值的归一化方法: de n i ' = ( de n i - de n min ) ( de n max - de n min ) ,对各区域各类POI的密度作去量纲处理,使之具有可比性。各类POI的密度在8种地块聚簇中呈现一定的规律性,为使结果具有更强的直观性和说明力,根据POI的相似性将其分为4组,分别绘出各组POI在8类用地上的垂直线图,其结果如图6所示。
Fig. 6 The normalized density of POI in each cluster (drop-line chart)

图6 各聚类地块归一化POI密度的垂直箱线图

(1)第一组POI包括学校、科研机构、培训机构、图书馆和科技馆,主要用来衡量地区的教育功能。图中的结果与前文的分析相一致,Clust6在该类POI上都具有很高的密度得分,有效地说明了其教育用地的属性。同时,图中也显示Clust4在该类POI上,除科研机构密度较低外,其他也都有很高的密度,说明中轴区具有相对完整的功能类别,教育资源丰富。而Clust1的较低得分也与前文分析的景点属性相吻合。前文初步估计到的Clust5和Clust8的2个居民区,在该部分区分了其差异,Clust8的教育设施相对齐全。
(2)第二组POI包括餐饮、咖啡店、电影院、美术馆和KTV,主要出现在消费水平高,居民收入高的人口稠密商圈,由图可以看到,Clust4和Clust7这些休闲娱乐相关的配套设施比较充足,说明该区域具有较强的消费能力和较高的经济发展水平。教育用地Clust6在娱乐POI上的分异较为明显,侧重于KTV、餐饮等适合团体活动的地方。Clust1,Clust5和Clust8与前组POI的规律类似。
(3)第三组POI是银行、商务大厦和公司工厂(指工人较多的大型工厂,如造纸厂)的企业类型,由图6可以看到,企业在北京有较为明显地功能区划,Clust7在商业具有最高的密度,与上组分析相结合,也进一步阐明了该区域人的职业和生活习惯。类似的,中轴区Clust4和教育区Clust6也有较强的经济基础,Clust1和Clust3的景区规划则清楚地与企业相隔离。
(4)第四组POI是超市/便利店、住宅、度假村、垂钓和火车站这些功能较为零散单一的POI组。便利店和住宅在Clust8上的较高密度,与前文得到的其居民地属性相吻合。火车站几乎完全有偏地分布在Clust2上,而与度假休闲相关的则零落地分散到Clust1和Clust5。
通过上述分析,可以看到聚类结果具有较好的统计意义和实际意义,得到的类别能够相互区分:Clust1是与周边较为异质的“离群”区域,Clust2是北京的交通枢纽,Clust3是北京的离散景点,Clust4是北京的中心城区,各类资源丰富,设施齐全,Clust5和Clust8是居民区,但后者的配套服务更为完善,经济发展水平更高;Clust7是商业发达的地区,休闲娱乐的场所很多,人们的消费水平高。并且,聚簇地块的语义特征与功能特征相统一,也表明实验非监督的方法能够很好地提取出城市地区的潜在语义特征。但是各类别之间面积、地块数目不均的现象仍然是一个需要解决的问题。

4 结论与展望

社交媒体的普及与流行,使其不仅成为居民表达观点和态度的重要渠道,而且大数据下的个体认知集合还凝集成一种众源的知识体系,进而为科学研究提供重要的数据支撑。在此背景下,本文以社交媒体上的签到文本作为对空间认知的样本,利用潜语义分析与空间分析的各类方法(因子分析、空间自相关分析、聚类分析),对社交媒体中位置潜语义特征提取与分析,具体体现在以下3个方面:
(1)位置隐含语义特征的挖掘。该方法能够有效地挖掘位置中隐含的概念结构,发现区域内与位置相关的主题的分布情况。这些主题特征不仅是在空间上的地名指向,还是空间上特有活动的体现,因而对于了解空间位置的场所内涵提供了有效的支持手段。
(2)位置间语义相关性的度量。该方法能够很好地度量位置之间潜在语义空间上的相似性,从而实现特定地区的语义相关程度的地块索引,对于地区的推荐具有参考价值。同时,局部相似性扩展到全局上,又能得到空间上的语义聚簇,从而发现空间上的话题热区。
(3)监督标识方法的结合。LSA本身是一种非监督的方法,能够自动地实现潜语义的提取。但同时其也能与先验的知识相结合,实现语义标注。以百科词条和POI分类数据作为先验知识,证明本文得到的语义特征与地区的功能特征相吻合,也说明了研究可扩展到监督性的类别划分。
但是研究仍存在一定的局限性:(1)数据的代表性问题。带地理标签的微博文本占的百分比小,人群有偏,并且在空间上分布不均,因此尽管签到文本能体现居民对于位置的认知,但往往会倾向于特定类型的区域,如娱乐、出行、旅游等场所,这也就使得研究对于某些类型,如住宅区的潜语义特征提取不够充分。(2)在研究中只对空间上的潜语义特征进行了提取分析,而没有考虑潜语义在时间上的分异。但实际上,发微博的时间序列曲线以及不同时间段的话题内容,对位置的场所语义特征都具有很大贡献。因此,在这2个方面上的改进,也是未来研究的方向。

The authors have declared that no competing interests exist.

[1]
Liu Y, Liu X, Gao S, et al.Social sensing: A new approach to understanding our socioeconomic environments[J]. Annals of the Association of American Geographers, 2015,105(3):512-530.The emergence of big data brings new opportunities for us to understand our socioeconomic environments. We use the term social sensing for such individual-level big geospatial data and the associated analysis methods. The word sensing suggests two natures of the data. First, they can be viewed as the analogue and complement of remote sensing, as big data can capture well socioeconomic features while conventional remote sensing data do not have such privilege. Second, in social sensing data, each individual plays the role of a sensor. This article conceptually bridges social sensing with remote sensing and points out the major issues when applying social sensing data and associated analytics. We also suggest that social sensing data contain rich information about spatial interactions and place semantics, which go beyond the scope of traditional remote sensing data. In the coming big data era, GIScientists should investigate theories in using social sensing data, such as data representativeness and quality, and develop new tools to deal with social sensing data.

DOI

[2]
Jiang S, Alves A, Rodrigues F, et al.Mining point-of-interest data from social networks for urban land use classification and disaggregation[J]. Computers Environment & Urban Systems, 2015,53:36-46.Over the last few years, much online volunteered geographic information (VGI) has emerged and has been increasingly analyzed to understand places and cities, as well as human mobility and activity. However, there are concerns about the quality and usability of such VGI. In this study, we demonstrate a complete process that comprises the collection, unification, classification and validation of a type of VGI—online point-of-interest (POI) data—and develop methods to utilize such POI data to estimate disaggregated land use (i.e., employment size by category) at a very high spatial resolution (census block level) using part of the Boston metropolitan area as an example. With recent advances in activity-based land use, transportation, and environment (LUTE) models, such disaggregated land use data become important to allow LUTE models to analyze and simulate a person’s choices of work location and activity destinations and to understand policy impacts on future cities. These data can also be used as alternatives to explore economic activities at the local level, especially as government-published census-based disaggregated employment data have become less available in the recent decade. Our new approach provides opportunities for cities to estimate land use at high resolution with low cost by utilizing VGI while ensuring its quality with a certain accuracy threshold. The automatic classification of POI can also be utilized for other types of analyses on cities.

DOI

[3]
Shelton T, Poorthuis A, Graham M, et al.Mapping the data shadows of hurricane sandy: Uncovering the sociospatial dimensions of “big data”[J]. Geoforum, 2014,52(52):167-179.Digital social data are now practically ubiquitous, with increasingly large and interconnected databases leading researchers, politicians, and the private sector to focus on how such ‘big data’ can allow potentially unprecedented insights into our world. This paper investigates Twitter activity in the wake of Hurricane Sandy in order to demonstrate the complex relationship between the material world and its digital representations. Through documenting the various spatial patterns of Sandy-related tweeting both within the New York metropolitan region and across the United States, we make a series of broader conceptual and methodological interventions into the nascent geographic literature on big data. Rather than focus on how these massive databases are causing necessary and irreversible shifts in the ways that knowledge is produced, we instead find it more productive to ask how small subsets of big data, especially georeferenced social media information scraped from the internet, can reveal the geographies of a range of social processes and practices. Utilizing both qualitative and quantitative methods, we can uncover broad spatial patterns within this data, as well as understand how this data reflects the lived experiences of the people creating it. We also seek to fill a conceptual lacuna in studies of user-generated geographic information, which have often avoided any explicit theorizing of sociospatial relations, by employing Jessop et al.’s TPSN framework. Through these interventions, we demonstrate that any analysis of user-generated geographic information must take into account the existence of more complex spatialities than the relatively simple spatial ontology implied by latitude and longitude coordinates.

DOI

[4]
Linna Li, Michael F, Goodchild, Bo Xu.Spatial, temporal, and socioeconomic patterns in the use of twitter and flickr[J]. Cartography and Geographic Information Science, 2013,40(2):61-77.Online social networking and information sharing services have generated large volumes of spatio-temporal footprints, which are potentially a valuable source of knowledge about the physical environment and social phenomena. However, it is critical to take into consideration the uneven distribution of the data generated in social media in order to understand the nature of such data and to use them appropriately. The distribution of footprints and the characteristics of contributors indicate the quantity, quality, and type of the data. Using georeferenced tweets and photos collected from Twitter and Flickr, this research presents the spatial and temporal patterns of such crowd-sourced geographic data in the contiguous United States and explores the socioeconomic characteristics of geographic data creators by investigating the relationships between tweet and photo densities and the characteristics of local people using California as a case study. Correlations between dependent and independent variables in partial least squares regression suggest that well-educated people in the occupations of management, business, science, and arts are more likely to be involved in the generation of georeferenced tweets and photos. Further research is required to explain why some people tend to produce and spread information over the Internet using social media from the perspectives of psychology and sociology. This study would be informative to sociologists who study the behaviors of social media users, geographers who are interested in the spatial and temporal distribution of social media users, marketing agencies who intend to understand the influence of social media, and other scientists who use social media data in their research.

DOI

[5]
Yin Z, Cao L, Han J, et al.Geographical topic discovery and comparison[C]// International Conference on World Wide Web. ACM, 2011:247-256.

[6]
Sizov S.Latent geospatial semantics of social media[J]. Acm Transactions on Intelligent Systems & Technology, 2012,3(4):1-20.Multimodal understanding of shared content is an important success factor for many Web 2.0 applications and platforms. This article addresses the fundamental question of geo-spatial awareness in social media applications. In this context, we introduce an approach for improved characterization of social media by combining text features (e.g., tags as a prominent example of short, unstructured text labels) with spatial knowledge (e.g., geotags, coordinates of images, and videos). Our model-based framework GeoFolk combines these two aspects in order to construct better algorithms for content management, retrieval, and sharing. We demonstrate in systematic studies the benefits of this combination for a broad spectrum of scenarios related to social media: recommender systems, automatic content organization and filtering, and event detection. Furthermore, we establish a simple and technically sound model that can be seen as a reference baseline for future research in the field of geotagged social media.

DOI

[7]
Kim K S, Kojima I, Ogawa H.Discovery of local topics by using latent spatio-temporal relationships in geo-social media[J]. International Journal of Geographical Information Science, 2016,30(9):1-24.Abstract Social networks have played a crucial role as information channels for people to understanding their daily lives beyond merely being communication tools. In particular, coupling social networks with geographic location has boosted the worth of social media to not only enable comprehension of the effects of natural phenomena such as global warming and disasters, but also the social patterns of human societies. However, the high rate of social data generation and the large amounts of noisy data makes it difficult to directly apply social media to decision-making processes. This article proposes a new system of analyzing the spatio-temporal patterns of social phenomena in real time and the discovery of local topics based on their latent spatio-temporal relationships. We will first describe a model that represents the local patterns of populations of geo-tagged social media. We will then define a local topic whose keywords share a region in space and time and present a system implementation based on existing open source technologies. We evaluated the model of local topics with several ways of visualization in experiments and demonstrated a certain social pattern from a dataset of daily Twitter streams. The results obtained from experiments revealed certain keywords had a strong spatio-temporal proximity even though they did not occur in the same message.

DOI

[8]
Lynch K.The Image of the City[M]. Massachusetts: MIT Press, 1960.

[9]
Andrea M, Brandenburg, Matthew S, Carroll. Your place or mine?: The effect of place creation on environmental values and landscape meanings[J]. Society & Natural Resources. An International Journal, 1995,8(5):381-398.The creation of place is predicated on two factors: (1) the social and cultural contexts in which people describe and define a space into a place and (2) the nature of a given space, much of which may not be readily recognizable or categorizable. This qualitative inductive analysis illustrates the concept of place creation through the voices of rural residents as they describe how they believe the resources within a local river drainage should be managed. The results of the interviews indicate that personal experience of a place can alter the values, beliefs, and wisdoms that individuals normally share with their primary social group. However, individuals may be reluctant to share their feelings of emotional ties to a place in traditional public involvement frameworks. Exposure of these conflicting values suggests the importance of using methodologies and public involvement programs conducive to expression of the creation of place.

DOI

[10]
Williams D R, Stewart S I.Sense of place: an elusive concept that is finding a home in ecosystem management[J]. Journal of Forestry Washington, 1998,96(5):18-23.Sense of place" offers resource managers a way to identify and respond to the emotional and spiritual bonds people form with certain spaces. We examine reasons people form an increasing interest in the concept and offer four broad recommendations for applying sense of place to ecosystem management. By initiating a discussion about sense of place, managers can build a working relationship with the public that reflects the complex web of lifestyles, meaning, and social relations endemic to a place.

DOI

[11]
Gao Y, Gao S, Li R Q, et al.A semantic geographical knowledge wiki system mashed up with Google Maps[J]. Science China Technological Sciences, 2010,53(1):52-60.A wiki system is a typical Web 2.0 application that provides a bi-directional platform for users to collaborate and share much useful information online.Unfortunately,computers cannot well understand the wiki pages in plain text.The user-generated geographical content via wiki systems cannot be manipulated properly and efficiently unless the geographical semantics is explicitly represented.In this paper,a geographical semantic wiki system,Geo-Wiki,is introduced to solve this problem.Geo-Wiki is a semantic geographical knowledge-sharing web system based on geographical ontologies so that computers can parse and storage the multi-source geographical knowledge.Moreover,Geo-Wiki mashed up with map services enriches the representation and helps users to find spatial distribution patterns,and thus can serve geospatial decision-making by customizing the Google Maps APIs.

DOI

[12]
李霖,朱海红,王红,等.基于形式本体的基础地理信息语义分析—以陆地水系要素类为例[J].测绘学报,2008,37(2):230-235.以国家基础地理信息中陆地水系要素为案例,讨论其规范定义中语义描述的模糊性。鉴于本体被认为是一种克服语义屏障的有效方法,本文基于形式本体的原理,提出一种对地理信息形式化语义分析的范例方法,即在概念化的基础上,利用属性组来表达概念语义,通过定义概念的本体属性来明确及规范化地表达概念的语义。通过具体分析,本文给出基础地理信息陆地水系要素的形式化语义。

DOI

[ Li L, Zhu H H, Wang H, et al.Semantic analyses of the fundamental geographic information based on formal ontology:Exemplifying hydrological category[J]. Acta Geodaetica et Cartographica Sinica, 2008,37(2):230-235. ]

[13]
宋佳,诸云强,王卷乐,等.基于GML的时空地理本体模型构建及应用研究[J].地球信息科学学报,2009,11(4):442-451.时空地理本体模型是一套描述地理时空知识的形式化说明规范,是构建时空地理本体实例的基础和参考。本文基于GML规范及时空推理理论提出了一种时空地理本体模型的框架,并详细阐述了其组成:要素模型、几何模型、空间关系模型、时态模型中类的关系和所涉及到的属性定义,并基于该本体模型给出了应用实例——行政区划本体设计和构建方法。文中所提出的时空地理本体模型,对开展面向不同应用的地理本体实例的构建和共享研究具有一定参考意义。

[ Song J, Zhu Y Q, Wang J L, et al.A study on the model of spatio-temporal geo-ontology based on GML[J]. Jounrnal of Geo-Information Science, 2009,11(4):442-451. ]

[14]
Riekert W F.Automated retrieval of information in the internet by using thesauri and gazetteers as knowledge sources. Journal of Universal Computer Science, 2002,8(6):581-590.Summary: There is an immense number of information resources on the Internet that can be utilized free of charge. So many knowledge workers try to make use of this information in their daily tasks. Nevertheless, it is very hard to find the relevant information in the Internet by using the full-text retrieval techniques which are offered by most existing search engines. This paper demonstrates that Thesauri, which have been used in established online retrieval systems for a long time, also open up new methods for the automated search for information in the Internet. In addition, thesaurus-like structures known as Gazetteers allow handling geographical references of information resources in a very effective way. The knowledge represented in thesauri and gazetteers can be used to process a variety of thematic and geographical queries and to retrieve the information of interest from the Internet. Comfortable ways of specifying queries can be offered to the users, e.g., by navigating in a hierarchical tree of descriptors, by using synonymous, related or foreign-language terms rather than fixed elements of a controlled vocabulary, or by indicating a geographical region of interest on a cartographic map. In addition to the general principles, examples of powerful query processors and advanced user interfaces are presented which demonstrate the effective usage of the knowledge stored in thesauri and gazetteers. The implemented solutions turn out to be considerably more comfortable than the "black box search" offered by most existing library catalogs and Internet search engines.

DOI

[15]
Schlieder C, Vogele T, Visser U.Qualitative spatial representation for information retrieval by gazetteers. Proceeding of Cosit' 01 Lncs, 2001:336-351.

[16]
Farazi, Feroz, Vincenzo, et al. A semantic geo-catalogue for a local administration[J]. Artificial Intelligence Review, 2013,40(2):193-212.The enhancement of the search capabilities of geo-spatial tools occupies one of the highest positions in the agenda of the INSPIRE initiative. This can be done by equipping applications with tools able to understand user terminology. However, this is in contrast with current approaches, which tend to fix in advance the terminology with a consequent rigidity in the way users interact with the system. In this paper we present the work we have done with the Semantic Geo-Catalogue (SGC) project in providing a semantic extension to the geo-catalogue of the Autonomous Province of Trento (PAT) in Italy. This was done through the adoption of a semantic matching tool and a faceted ontology that codifies knowledge about the geography of the PAT and that was created by reorganizing data extracted from the local geographical dataset. Thanks to the semantic extension, queries to the geo-catalogue are expanded with domain specific terms taken from the ontology thus obtaining a higher number of relevant documents in output. We also complied with the Open Government Data (OGD) initiative by publishing in RDF and by linking to relevant dictionaries some useful data taken from the local repository.

DOI

[17]
贾小斌,艾廷华,彭子凤,等.地理信息语义的LOD表达与相似性度量[J].武汉大学学报·信息科学版,2016,41(10):1299-1306.提出一种实用性较强的地理信息语义表达及相似性度量模型,实现从地理信息语义建模到相似性度量的完整技术链条.在对地理信息语义表达的内容与尺度分析的基础上,提出地理信息语义描述的基本结构,并将其进一步细化为具有不同大小的语义粒度项,以构建出地理信息语义的细节层次(level of detail,LOD)表达模型,最后依据地理信息概念间相关语义粒度项的匹配关系实现地理信息语义相似程度的定量化计算,在实例分析中则以土地利用类型为例进行相似度计算的实验,通过实验结果与实际经验判断比较验证出该模型具有较强的实用性.

DOI

[ Jia X B, Ai T H, Peng Z F, et al.The LOD expression and proximity measurement of the semantic on geographical information[J]. Geomatics and Information Science of Wuhan University, 2016,41(10):1299-1306. ]

[18]
Adams B, Mckenzie G.Inferring thematic places from spatially referenced natural language descriptions[J]. Crowdsourcing Geographic Information, 2013:201-221.Places are more than just a location and spatial footprint. A sense of place is the result of subjective experience that a person has from being in a place or from interacting with information about a

DOI

[19]
Lansley G, Longley P A.The geography of twitter topics in London[J]. Computers Environment & Urban Systems, 2016,58:85-96.Social media data are increasingly perceived as alternative sources to public attitude surveys because of the volume of available data that are time-stamped and (sometimes) precisely located. Such data can be mined to provide planners, marketers and researchers with useful information about activities and opinions across time and space. However, in their raw form, textual data are still difficult to analyse coherently and Twitter streams pose particular interpretive challenges because they are restricted to just 140 characters. This paper explores the use of an unsupervised learning algorithm to classify geo-tagged Tweets from Inner London recorded during typical weekdays throughout 2013 into a small number of groups, following extensive text cleaning techniques. Our classification identifies 20 distinctive and interpretive topic groupings, which represent key types of Tweets, from describing activities or informal conversations between users, to the use of check-in applets. Our motivation is to use the classification to demonstrate how the nature of the content posted on Twitter varies according to the characteristics of places and users. Topics and attitudes expressed through Tweets are found to vary substantially across Inner London, and by time of day. Some observed variations in behaviour on Twitter can be attributed to the inferred demographic and socio-economic characteristics of users, but place and local activities can also exert a considerable influence. Overall, the classification was found to provide a valuable framework for investigating the content and coverage of Twitter usage across Inner London.

DOI

[20]
Andrew J, Arie C, Crooks A T, et al.Crowdsourcing a collective sense of place[J]. Plos One, 2016,11(4):e0152932.Place can be generally defined as a location that has been assigned meaning through human experience, and as such it is of multidisciplinary scientific interest. Up to this point place has been studied primarily within the context of social sciences as a theoretical construct. The availability of large amounts of user-generated content, e.g. in the form of social media feeds or Wikipedia contributions, allows us for the first time to computationally analyze and quantify the shared meaning of place. By aggregating references to human activities within urban spaces we can observe the emergence of unique themes that characterize different locations, thus identifying places through their discernible sociocultural signatures. In this paper we present results from a novel quantitative approach to derive such sociocultural signatures from Twitter contributions and also from corresponding Wikipedia entries. By contrasting the two we show how particular thematic characteristics of places (referred to herein as platial themes) are emerging from such crowd-contributed content, allowing us to observe the meaning that the general public, either individually or collectively, is assigning to specific locations. Our approach leverages probabilistic topic modelling, semantic association, and spatial clustering to find locations are conveying a collective sense of place. Deriving and quantifying such meaning allows us to observe how people transform a location to a place and shape its characteristics.

DOI PMID

[21]
Landauer T K, Foltz P W, Laham D.Introduction to latent semantic analysis[J]. Discourse Processes, 1998,25(3):259-284.Offers an introduction to the theory and implementation of Latent Semantic Analysis (LSA), a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text. Gives an overview of applications and modeling of human knowledge to which LSA has been applied. (SR)

DOI

[22]
Doxas I, Dennis S, Oliver W L.The dimensionality of discourse[J]. Proceedings of the National Academy of Sciences of the United States of America, 2010,107(11):4866.

DOI

[23]
Zhu M, Ghodsi A.Automatic dimensionality selection from the scree plot via the use of profile likelihood[J]. Computational Statistics & Data Analysis, 2006,51(2):918-930.Most dimension reduction techniques produce ordered coordinates so that only the first few coordinates need be considered in subsequent analyses. The choice of how many coordinates to use is often made with a visual heuristic, i.e., by making a scree plot and looking for a “big gap” or an “elbow.” In this article, we present a simple and automatic procedure to accomplish this goal by maximizing a simple profile likelihood function. We give a wide variety of both simulated and real examples.

DOI

[24]
易文斌,慎利,齐银凤,等.基于概率潜语义分析模型的高光谱影像层次聚类分析[J].光谱学与光谱分析,2011,31(9):2471-2475.将概率潜语义分析模型(PLSA)应用于高光谱影像聚类, 提出一种基于语义信息的影像聚类方法。 首先, 利用ISODATA算法获取影像的初次聚类结果, 从而形成PLSA模型中的视觉词; 其次, 利用影像分割算法对高光谱影像进行分割, 并将分割体作为PLSA模型的文档; 再次, 利用多种最佳聚类类别数估计方法确定PLSA模型的潜语义主题的个数; 进而估计PLSA模型的参数, 获得概率主题内视觉词的概率分布和每个分割体中各概率主题的混合比例; 最后利用统计模式识别方法获取每个影像文档中各个视觉词对应的潜语义主题的类型, 从而实现影像的层次聚类分析。 相关实验结果表明, 本文的层次聚类结果较K-MEANS算法、 ISODATA算法聚类结果的面向对象特性更明显, 其与真实地物的空间分布更接近。

[ Yi W B, Shen L, Qi Y F, et al.The hierarchical clustering analysis of hyperspectral images based on probabilistic latent semantic analysis[J]. Spectroscopy & Spectral Analysis, 2011,31(9):2471-2475. ]

[25]
陈黎飞,姜青山,王声瑞.基于层次划分的最佳聚类数确定方法[J].软件学报,2008,19(1):62-72.确定数据集的聚类数目是聚类分析中一项基础性的难题.常用的trail-and-error方法通常依赖于特定的聚类算法,且在大型数据集上计算效率欠佳.提出一种基于层次思想的计算方法,不需要对数据集进行反复聚类,它首先扫描数据集获得CF(clusteringfeature,聚类特征)统计值,然后自底向上地生成不同层次的数据集划分,增量地构建一条关于不同层次划分的聚类质量曲线;曲线极值点所对应的划分用于估计最佳的聚类数目.另外,还提出一种新的聚类有效性指标用于衡量不同划分的聚类质量.该指标着重于簇的几何结构且独立于具体的聚类算法,能够识别噪声和复杂形状的簇.在实际数据和合成数据上的实验结果表明,新方法的性能优于新近提出的其他指标,同时大幅度提高了计算效率.

DOI

[ Chen L F, Jiang Q S, Wang S R, et al.A hierarchical method for determining the number of clusters[J]. Journal of Software, 2008,19(1):62-72. ]

[26]
Wei C,Yang C,Lin C.A latent semantic indexing-based approach to multilingual document clustering[J]. Decision Support Systems,2008,45(3):606-620.

[ Chen L, Jiang Q, Wang S.A hierarchical method for determining the number of clusters[J]. Journal of Software, 2008,45(3):606-620. ]

[27]
Evangelopoulos N, Zhang X, Prybutok V R.Latent semantic analysis: Five methodological recommendations[J]. European Journal of Information Systems, 2012,21(1):70-86.The recent influx in generation, storage, and availability of textual data presents researchers with the challenge of developing suitable methods for their analysis. Latent Semantic Analysis (LSA), a member of a family of methodological approaches that offers an opportunity to address this gap by describing the semantic content in textual data as a set of vectors, was pioneered by researchers in psychology, information retrieval, and bibliometrics. LSA involves a matrix operation called singular value decomposition, an extension of principal component analysis. LSA generates latent semantic dimensions that are either interpreted, if the researcher's primary interest lies with the understanding of the thematic structure in the textual data, or used for purposes of clustering, categorization, and predictive modeling, if the interest lies with the conversion of raw text into numerical data, as a precursor to subsequent analysis. This paper reviews five methodological issues that need to be addressed by the researcher who will embark on LSA. We examine the dilemmas, present the choices, and discuss the considerations under which good methodological decisions are made. We illustrate these issues with the help of four small studies, involving the analysis of abstracts for papers published in the European Journal of Information Systems .

DOI

[28]
康朝贵. 基于个体时空轨迹数据的居民移动模式和城市空间结构分析方法[D].北京:北京大学,2015.

[ Kang C G.Sensing urban space from human activity and its spatio-temporal characteristics[D]. Beijing: Peking University, 2015. ]

Outlines

/