New support vector machine formulations and algorithms with application to biomedical data analysis

New support vector machine formulations and algorithms with application to biomedical data analysis(2011)

引用 28|浏览29
暂无评分
摘要
The Support Vector Machine (SVM) classifier seeks to find the separating hyperplane wx = r that maximizes the margin distance 1/||w|| 22 . It can be formalized as an optimization problem that minimizes the hinge loss ∑i(1 –yif (xi))+ plus the L2-norm of the weight vector. SVM is now a mainstay method of machine learning. The goal of this dissertation work is to solve different biomedical data analysis problems efficiently using extensions of SVM, in which we augment the standard SVM formulation based on the application requirements. The biomedical applications we explore in this thesis include: cancer diagnosis, biomarker discovery, and energy function learning for protein structure prediction. Ovarian cancer diagnosis is problematic because the disease is typically asymptomatic especially at early stages of progression and/or recurrence. We investigate a sample set consisting of 44 women diagnosed with serous papillary ovarian cancer and 50 healthy women or women with benign conditions. We profile the relative metabolite levels in the patient sera using a high throughput ambient ionization mass spectrometry technique, Direct Analysis in Real Time (DART). We then reduce the diagnostic classification on these metabolic profiles into a functional classification problem and solve it with functional Support Vector Machine (fSVM) method. The assay distinguished between the cancer and control groups with an unprecedented 99% accuracy (100% sensitivity, 98% specificity) under leave-one-out-cross-validation. This approach has significant clinical potential as a cancer diagnostic tool. High throughput technologies provide simultaneous evaluation of thousands of potential biomarkers to distinguish different patient groups. In order to assist biomarker discovery from these low sample size high dimensional cancer data, we first explore a convex relaxation of the L0-SVM problem and solve it using mixed-integer programming techniques. We further propose a more efficient L0-SVM approximation, fractional norm SVM, by replacing the L2-penalty with Lq-penalty (q in (0,1)) in the optimization formulation. We solve it through Difference of Convex functions (DC) programming technique. Empirical studies on the synthetic data sets as well as the real-world biomedical data sets support the effectiveness of our proposed L0-SVM approximation methods over other commonly-used sparse SVM methods such as the L1-SVM method. A critical open problem in ab initio protein folding is protein energy function design. We reduce the problem of learning energy function for ab initio folding to a standard machine learning problem, learning-to-rank. Based on the application requirements, we constrain the reduced ranking problem with non-negative weights and develop two efficient algorithms for non-negativity constrained SVM optimization. We conduct the empirical study on an energy data set for random conformations of 171 proteins that falls into the ab initio folding class. We compare our approach with the optimization approach used in protein structure prediction tool, TASSER. Numerical results indicate that our approach was able to learn energy functions with improved rank statistics (evaluated by pairwise agreement) as well as improved correlation between the total energy and structural dissimilarity.
更多
查看译文
关键词
SVM optimization,L0-SVM problem,critical open problem,energy function,biomedical data analysis,commonly-used sparse SVM method,optimization problem,biomarker discovery,new support vector machine,empirical study,functional classification problem,application requirement
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要