Author Identification in Turkish Documents with Ridge Regression Analysis

Birol Kuyumcu,Basak Buluz,Yavuz Komecocglu

Signal Processing and Communications Applications Conference（2019）

引用 4|浏览2

暂无评分

摘要

The amount of documentation which increasing in a proportional manner with the increasing pace of technological development result the need for successful classification methods to categorize them to facilitate accessibility. In addition to printed documents, hundreds of thousands of texts are published on digital media every day, creating problems such as incorrect or anonymous transfer of text writers in a dirty information complex. In this study, for the solution of the author recognition problem, the features extracted by applying the Tf-Idf weighting method for word 1-3-ngrams and character 2-6-ngrams were combined and represented in vector space. Ridge Regression is trained for each author, and each trained model is provided with a predictive value on the test data set. The result with the highest value is then determined as the final estimate.This model, which was established in Hurriyet and Sabah national newspapers, has been trained in 100 different opinion columns of 237 different writers in the last 20 years and has been tested on a test set consisting of 20 different opinion columns for each author.This model, which has a accuracy of 89.6%, performed better than the best results in the literature on the same dataset.

查看译文

关键词

ridge regression,author recognition,tf-idf,natural language processing

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要