Machine Learning Algorithm for Text Categorization of News Articles from Senegalese Online News Websites

2022 17TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI)(2022)

引用 2|浏览3
暂无评分
摘要
The growth of information in the form of news articles is a big problem in every society. This information lies in unstructured form and manually managing and effectively making use of it is tedious. In this study, we aimed to build a classifier to categorize news articles from Senegalese online news websites according to their contents: "politique", "religion", "justice", "sante", "education", "faits-divers", and "people". We developed, based on web scrapping techniques, a crawler for collecting news articles from the Senegalese online press. We used OvA method to transform our multi-labels problem into binary classification problem(s) and classify online news articles using different supervised machine learning algorithms. We use the Area Under the ROC Curve (AUC) to measure performances of selected models. Most of the themes being recognized correctly with more than 80% of micro-average AUC. Overall, the Random Forest algorithm showed the best performance with 90% of micro-average AUC. Using the collection of data through various Senegalese online news portals, a tool to automatically categorize news published articles can be developed with our final prediction model and to guide future research in this area.
更多
查看译文
关键词
Machine learning, teal categorization, text representations, Journalistic data
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要