Identifying Important Communications

Aaron Jaffey, Akifumi Kobashi

semanticscholar(2012)

引用 0|浏览1
暂无评分
摘要
As we move towards a society increasingly dependent on electronic communication, our inbox sizes have become so large that many of us cannot keep up with the deluge of information. To that end, we wanted to know if it would be possible to reduce some aspect of this social noise by using machine learning to filter out what users consider to be unimportant. Although this sounds similar to a spam filtering problem, we wanted to consider all major forms of electronic communication: phone, email, and Facebook. Furthermore, this type of classification is highly individual, compared to spam filtering, so one’s own communications provide the best source for training data. Using Naive Bayes and SVM classifiers trained on a sizeable set of communication metadata, cross-validation showed that the algorithms performed decently on these different sources, achieving a best accuracy around 84% for Facebook data, 89% for cellular data, and 98.5% for email data. These data sources appear to be weakly correlated, and so it is difficult to improve classification accuracy by linking various communications using an address book. 1 Methodology and Data In order to train classification algorithms, it was first necessary to collect personal communications data. For each of the three means of communication we utilized, we wrote Python scripts to scrape the data. We obtained email data using IMAP, Facebook data using the Facebook developer API, and cell phone data through an export process from our cell phone carriers. For this problem, we chose a simple binary classification scheme to divide communications into important and unimportant. While this is a rather granular classification scheme, it makes automated and manual classification of training data easier. For example, for email, we are able to classify training data using different methods based on one’s usage patterns. In one case, we can consider read messages in the the user’s inbox as important, and unread ones as unimportant. Alternatively, we can consider messages in one’s inbox as important, and archived messages as unimportant. Similarly, we can assume that emails one replied to were important. Facebook message data is rather difficult to classify automatically by examining metadata. While the Facebook message center is constantly changing, users do not typically receive sufficient unimportant mail that is discarded to use this as training data. Furthermore, almost all personal messages appear in the user’s inbox. Determining if a Facebook message was replied to is problematic, because messages can be divided into short snippets from chat sessions, and clustering them to identify conversations is an undertaking in itself. Due to these difficulties, we chose to classify Facebook messages manually. Classifying phone call data posed a similar problem, so we also chose to classify calls manually. With this project, we hoped not only to attempt to classify important communications within each platform, but also to link our classification results together in a logical and useful way. A natural way to do this is to connect data based on the other party in the communication. We began by exporting our personal address book data into a form that could be integrated with machine learning algorithms, and queried to derive a unique identifier for a person based on their name, email address, or phone number. If the third party was not in the address book, we generated a new unique identifier for them. By including features that link a communication to its sender/recipient, we attempted to
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要