Predicting bursts and popularity of hashtags in real-time

SIGIR, pp. 927-930, 2014.

Cited by: 55|Bibtex|Views31|Links
EI
Keywords:
bursthashtagmiscellaneousreal-time prediction
Wei bo:
Solution: We explore five different regression models, including Linear Regression, Classification And Regression Tree, Gaussian Process Regression, Support Vector Regression and Neural Network

Abstract:

Hashtags have been widely used to annotate topics in tweets (short posts on Twitter.com). In this paper, we study the problems of real-time prediction of bursting hashtags. Will a hashtag burst in the near future? If it will, how early can we predict it, and how popular will it become? Based on empirical analysis of data collected from Tw...More

Code:

Data:

0
Introduction
  • As one of the leading platforms of social communications and information dissemination, Twitter has become a major source of information for common Web users.
  • Burst, real-time prediction
  • The authors present the real-time prediction of these hashtags before they burst.
  • Built on top of the preliminary work [3], the authors present in this paper formal definitions of four different states in the life cycle of a bursting hashtag and two corresponding prediction tasks.
Highlights
  • As one of the leading platforms of social communications and information dissemination, Twitter has become a major source of information for common Web users
  • Bursts of topics have been demonstrated to have a predictive power of product sales, stock market, search engine queries, elections, and even outbursts of diseases
  • We present the real-time prediction of these hashtags before they burst
  • Built on top of our preliminary work [3], we present in this paper formal definitions of four different states in the life cycle of a bursting hashtag and two corresponding prediction tasks
  • We provide formal definitions of four states in the life cycle of a bursting hashtag, based on which we present two real-time prediction tasks of bursting hashtags
  • Solution: We explore five different regression models, including Linear Regression (LR), Classification And Regression Tree (CART), Gaussian Process Regression (GPR), Support Vector Regression (SVR) and Neural Network (NN)
Results
  • The authors provide formal definitions of four states in the life cycle of a bursting hashtag, based on which the authors present two real-time prediction tasks of bursting hashtags.
  • 2. The authors present solutions to the two prediction tasks, and evaluate the performance of different types of features and different methods with real data.
  • The authors monitor the time series of all hashtags, but only those satisfying certain criteria will trigger the prediction(s).
  • Table 1 shows the distribution of bursting hashtags in difference states of their life cycles, estimated from the 10% sample of tweet stream.
  • The proportion of hashtags that are going to be bursting even goes down to 0.8% at the 6th hour after they becomes active.
  • The authors present different types of features which may be effective in predicting whether a hashtag is going to burst and how popular it will become.
  • For Task 1, the number of bursting hashtags among the top-k prototypes is used as a feature.
  • For Task 2, top-k bursting prototypes are extracted, and their weighted average popularity is used as a feature.
  • Predictions were made at six representative moments after a hashtag is marked as active, which can be divided into three stages, early stage (5min, 15min), middle stage (30min, 1h) and late stage (3h, 6h).
  • Predictions after 6 hours since active were not considered because at that time only 5% of bursting hashtags have not started bursting.
Conclusion
  • The earlier in the life cycle of a hashtag, the less important the time series features.
  • The F1-score decreases by 72.41% if the prediction happens at 6 hours after the hashtag becomes active; if the prediction happens 5 minutes after active, removing time series features only results in a 5.64% drop.
  • When the predictions are made early on (i.e., 5 minutes after the hashtag becomes active), the prototype features are the most useful categories of features following time series features, and followed by network features and meme features.
  • Correct prediction of bursting hashtags is at least an average of 55 minutes earlier than the start of their bursts.
Summary
  • As one of the leading platforms of social communications and information dissemination, Twitter has become a major source of information for common Web users.
  • Burst, real-time prediction
  • The authors present the real-time prediction of these hashtags before they burst.
  • Built on top of the preliminary work [3], the authors present in this paper formal definitions of four different states in the life cycle of a bursting hashtag and two corresponding prediction tasks.
  • The authors provide formal definitions of four states in the life cycle of a bursting hashtag, based on which the authors present two real-time prediction tasks of bursting hashtags.
  • 2. The authors present solutions to the two prediction tasks, and evaluate the performance of different types of features and different methods with real data.
  • The authors monitor the time series of all hashtags, but only those satisfying certain criteria will trigger the prediction(s).
  • Table 1 shows the distribution of bursting hashtags in difference states of their life cycles, estimated from the 10% sample of tweet stream.
  • The proportion of hashtags that are going to be bursting even goes down to 0.8% at the 6th hour after they becomes active.
  • The authors present different types of features which may be effective in predicting whether a hashtag is going to burst and how popular it will become.
  • For Task 1, the number of bursting hashtags among the top-k prototypes is used as a feature.
  • For Task 2, top-k bursting prototypes are extracted, and their weighted average popularity is used as a feature.
  • Predictions were made at six representative moments after a hashtag is marked as active, which can be divided into three stages, early stage (5min, 15min), middle stage (30min, 1h) and late stage (3h, 6h).
  • Predictions after 6 hours since active were not considered because at that time only 5% of bursting hashtags have not started bursting.
  • The earlier in the life cycle of a hashtag, the less important the time series features.
  • The F1-score decreases by 72.41% if the prediction happens at 6 hours after the hashtag becomes active; if the prediction happens 5 minutes after active, removing time series features only results in a 5.64% drop.
  • When the predictions are made early on (i.e., 5 minutes after the hashtag becomes active), the prototype features are the most useful categories of features following time series features, and followed by network features and meme features.
  • Correct prediction of bursting hashtags is at least an average of 55 minutes earlier than the start of their bursts.
Tables
  • Table1: Bursting hashtags in different states. Time since active PAB(%) POB(%) PAI (%)
  • Table2: Importance of features in Task 1. Removing time series features results in the largest drop of performance. This decrease is less significant when predictions occur early
  • Table3: Comparison of algorithms for Task 2
Download tables as Excel
Funding
  • The work is supported by National Natural Science Foundation of China (61373022, 61073004, 60773156), and Chinese Major State Basic Research Development 973 Program (2011CB302203-2)
  • It is also partially supported by the National Science Foundation under grant numbers IIS-0968489 and IIS-1054199, and partially supported by the DARPA under award number W911NF-12-1-0037. For those bursting hashtags predicted correctly, how early are they predicted in advance of their bursts? According to the statistics of the results, when predicted at 5 minutes since active, the correct predictions happen on average 55 minutes earlier than the actual bursts
Reference
  • G. H. Chen, S. Nikolov, and D. Shah. A latent source model for nonparametric time series classification. In Advances in Neural Information Processing Systems, pages 1088–1096, 2013.
    Google ScholarLocate open access versionFindings
  • D. Gruhl, R. Guha, R. Kumar, J. Novak, and A. Tomkins. The predictive power of online chatter. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 78–87. ACM, 2005.
    Google ScholarLocate open access versionFindings
  • S. Kong, Q. Mei, L. Feng, and Z. Zhao. Real-time predicting bursting hashtags on twitter. In Web-Age Information Management. Springer, 2014.
    Google ScholarFindings
  • Y. R. Lin, D. Margolin, B. Keegan, A. Baronchelli, and D. Lazer. # bigbirds never die: Understanding social dynamics of emergent hashtag. arXiv preprint arXiv:1303.7144, 2013.
    Findings
  • Z. Ma, A. Sun, and G. Cong. On predicting the popularity of newly emerging hashtags in twitter. Journal of the American Society for Information Science and Technology, 2013.
    Google ScholarLocate open access versionFindings
  • S. Nikolov. Trend or No Trend: A Novel Nonparametric Method for Classifying Time Series. PhD thesis, Massachusetts Institute of Technology, 2012.
    Google ScholarFindings
Your rating :
0

 

Tags
Comments