Scaling up DNA data storage and random access retrieval

Nature Biotechnology(2017)

引用 44|浏览160
暂无评分
摘要
Current storage technologies can no longer keep pace with exponentially growing amounts of data. [1][1] Synthetic DNA offers an attractive alternative due to its potential information density of ~ 1018 B/mm3, 107 times denser than magnetic tape, and potential durability of thousands of years.[2][2] Recent advances in DNA data storage have highlighted technical challenges, in particular, coding and random access, but have stored only modest amounts of data in synthetic DNA. [3][3],[4][4],[5][5] This paper demonstrates an end-to-end approach toward the viability of DNA data storage with large-scale random access. We encoded and stored 35 distinct files, totaling 200MB of data, in more than 13 million DNA oligonucleotides (about 2 billion nucleotides in total) and fully recovered the data with no bit errors, representing an advance of almost an order of magnitude compared to prior work. [6][6] Our data curation focused on technologically advanced data types and historical relevance, including the Universal Declaration of Human Rights in over 100 languages,[7][7] a high-definition music video of the band OK Go,[8][8] and a CropTrust database of the seeds stored in the Svalbard Global Seed Vault.[9][9] We developed a random access methodology based on selective amplification, for which we designed and validated a large library of primers, and successfully retrieved arbitrarily chosen items from a subset of our pool containing 10.3 million DNA sequences. Moreover, we developed a novel coding scheme that dramatically reduces the physical redundancy (sequencing read coverage) required for error-free decoding to a median of 5x, while maintaining levels of logical redundancy comparable to the best prior codes. We further stress-tested our coding approach by successfully decoding a file using the more error-prone nanopore-based sequencing. We provide a detailed analysis of errors in the process of writing, storing, and reading data from synthetic DNA at a large scale, which helps characterize DNA as a storage medium and justify our coding approach. Thus, we have demonstrated a significant improvement in data volume, random access, and encoding/decoding schemes that contribute to a whole-system vision for DNA data storage. [1]: #ref-1 [2]: #ref-2 [3]: #ref-3 [4]: #ref-4 [5]: #ref-5 [6]: #ref-6 [7]: #ref-7 [8]: #ref-8 [9]: #ref-9
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要