SR2C: A Structurally Redundant Short Reads Collapser for Optimizing DNA Data Compression.

Hui Sun, Huidong Ma, Yingfeng Zheng, Haonan Xie, Xiaofei Wang,Xiaoguang Liu,Gang Wang

International Conference on Parallel and Distributed Systems(2023)

引用 0|浏览6
暂无评分
摘要
The current redundant sequence deduplication algorithms cannot remove structural repetitive DNA short reads such as mirror, reverse, paired, and complementary palindromes in high-throughput genomics sequencing data. Moreover, these methods also cannot construct indexes to recover the original sequences, thus failing to meet the requirements of lossless compression for downstream applications. To address these problems, we propose a data structure called Cycle-Hash-Linkage (CHL) and present a CPU parallelism optimization algorithm named SR2C (Structurally Redundant Short Reads Collapser) based on CHL to improve the compression ratio of DNA sequencing data. Experimental results on actual data from the NCBI database demonstrate that SR2C achieves an average residual sequence percentage improvement of 2.556% compared to the state-of-the-art redundant sequence deduplication algorithm, Minirmd. Furthermore, SR2C cascaded optimization improves the average compression ratios of compression algorithms Pigz, PBzip2, XZ, and 7Z by 92.345%, 78.999%, 10.132%, and 7.434%, respectively. By leveraging multi-core CPU parallel computation, SR2C effectively reduces time consumption, which achieves 2-5X deduplication and recovers acceleration.The same name Linux toolkit is freely available at https://github.com/fahaihi/SR2C.
更多
查看译文
关键词
parallel algorithm,redundancy deduplication,DNA sequencing data compression,structurally redundant reads,data compression,rolling-hash algorithm
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要