GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

Maxim Zvyagin,Alexander Brace,Kyle Hippe,Yuntian Deng,Bin Zhang, Cindy Orozco Bohorquez,Austin Clyde,Bharat Kale,Danilo Perez-Rivera,Heng Ma,Carla M. Mann, Michael Irvin, Defne G. Ozgulbas, Natalia Vassilieva,James Gregory Pauloski,Logan Ward,Valerie Hayot-Sasson,Murali Emani,Sam Foreman,Zhen Xie,Diangen Lin,Maulik Shukla,Weili Nie,Josh Romero,Christian Dallago,Arash Vahdat,Chaowei Xiao,Thomas Gibbs,Ian Foster,James J. Davis,Michael E. Papka,Thomas Brettin,Rick Stevens,Anima Anandkumar,Venkatram Vishwanath,Arvind Ramanathan

biorxiv（2023）

引用 16|浏览67

暂无评分

摘要

We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole-genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.

查看译文

关键词

SARS-CoV-2,COVID-19,HPC,AI,large language models,whole-genome analyses

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要