What type of study is this?

This is a Experimental Study study.

October 19, 2025Open Access

PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database

Key Points

PMKLC significantly improves compression ratios by up to 73.609% on genomic data.
Throughput gains reach up to 10.710 times compared to existing compressors.
Utilizes a novel GPU-accelerated encoder and multi-knowledge learning framework.
Demonstrates better robustness against diverse probability distribution perturbations.

Abstract

Learning-based lossless compressors play a crucial role in large-scale genomic database backup, storage, transmission, and management. However, their 1) inadequate compression ratio, 2) low compression \ 2) we design a GPU-accelerated (s, k) -mer encoder to optimize compression throughput and computing resource usage; 3) we introduce data block partitioning and Step-wise Model Passing (SMP) mechanisms for parallel acceleration; 4) We design two compression modes PMKLC-S and PMKLC-M to meet the complex application scenarios, where the former runs on a resource-constrained single GPU and the latter is multi-GPU accelerated. We benchmark PMKLC-S/M and 14 baselines (7 traditional and 7 leaning-based) on 15 real-world datasets with different species and data sizes. Compared to baselines on the testing datasets, PMKLC-S/M achieve the average compression ratio improvement up to 73. 609\% and 73. 480\%, the average throughput improvement up to 3. 036 and 10. 710, respectively. Besides, PMKLC-S/M also achieve the best robustness and competitive memory cost, indicating its greater stability against datasets with different probability distribution perturbations, and its strong ability to run on memory-constrained devices.

PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database

Key Points

Abstract

Cite This Study

Also Consider

Also Consider