Explain-seq: an end-to-end pipeline from training to interpretation of sequence-based deep learning models

biorxiv(2023)

引用 0|浏览12
暂无评分
摘要
Interpreting predictive machine learning models to derive biological knowledge is the ultimate goal of developing models in the era of genomic data exploding. Recently, sequence-based deep learning models have greatly outperformed other machine learning models such as SVM in genome-wide prediction tasks. However, deep learning models, which are black-box models, are challenging to interpret their predictions. Here we represented an end-to-end computational pipeline, Explain-seq, to automate the process of developing and interpreting deep learning models in the context of genomics. Explain-seq takes input as genomic sequences and outputs predictive motifs derived from the model trained on sequences. We demonstrated Explain-seq with a public STARR-seq dataset of the A549 human lung cancer cell line released by ENCODE. We found our deep learning model outperformed gkm-SVM model in predicting A549 enhancer activities. By interpreting our well-performed model, we identified 47 TF motifs matched with known TF PWMs, including ZEB1, SP1, YY1, and INSM1. They are associated with epithelial-mesenchymal transition and lung cancer proliferation and metagenesis. In addition, there were motifs that were not matched in the JASPAR database and may be considered as de novo enhancer motifs in the A549 cell line. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
关键词
deep learning,models,explain-seq,end-to-end,sequence-based
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要