1 Parallel Large-Scale Feature Selection

semanticscholar(2010)

引用 3|浏览0
暂无评分
摘要
The set of features used by a learning algorithm can have a dramatic impact on the performance of that algorithm. Including extraneous features can make the learning problem harder by adding useless, noisy dimensions that lead to over-fitting and increased computational complexity. Conversely, leaving out useful features can deprive the model of important signals. The problem of feature selection is to find a subset of features that allows the learning algorithm to learn the “best” model in terms of measures such as accuracy or model simplicity. The problem of feature selection continues to grow in both importance and difficulty as extremely high-dimensional data sets become the standard in real-world machine learning tasks. Scalability can become a problem for even simple approaches. For example, common feature selection approaches that evaluate each new feature by training a new model containing that feature require a learning a linear number of models each time they add a new feature. This computational cost can add up quickly when we are iteratively adding many new features. Even techniques that use relatively computationally inexpensive tests of a feature’s value, such as mutual information, require at least linear time in the number of features being evaluated. As a simple illustrative example consider the task of classifying websites. In this case the data set could easily contain many millions of examples. Just including very basic features such as text unigrams on the page or HTML tags could easily provide many thousands of potential features for the model. Considering more complex attributes such as bigrams of words
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要