谷歌浏览器插件
订阅小程序
在清言上使用

De-heterogeneity of the eukaryotic viral reference database (EVRD) improves the accuracy and efficiency of viromic analysis

mSystems(2022)

引用 1|浏览21
暂无评分
摘要
Widespread in public databases, the notorious contamination in virus reference databases often leads to confusing even wrong conclusions in applications like viral disease diagnosis and viromic analysis, highlighting the need of a high-quality database. Here, we report the comprehensive scrutiny and the purification of the largest viral sequence collections of GenBank and UniProt by detection and characterization of heterogeneous sequences (HGSs). A total of 766 nucleotide- and 276 amino acid-HGSs were determined with length up to 6,605 bp, which were widely distributed in 39 families, with many involving highly public health-related viruses, such as hepatitis C virus, Crimea-Congo hemorrhagic fever virus and filovirus. Majority of these HGSs are sequences of a wide range of hosts including humans, with the rest resulting from vectors, misclassification and laboratory components. However, these HGSs cannot be simply considered as exotic contaminants, since part of which are resultants of natural occurrence or artificial engineering of the viruses. Nevertheless, they significantly disturb the genomic analysis, and hence were deleted from the database. A further augmentation was implemented with addition of the risk and vaccine sequences, which finally results in a high-quality eukaryotic virus reference database (EVRD). EVRD showed higher accuracy and less time-consuming without coverage compromise by reducing false positives than other integrated databases in viromic analysis. EVRD is freely accessible with favorable application in viral disease diagnosis, taxonomic clustering, viromic analysis and novel virus detection. ### Competing Interest Statement The authors have declared no competing interest.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要