Bioinformatics Zen | We’re losing cash by solely supporting gzip for uncooked DNA recordsdata.
We’re losing cash by solely supporting gzip for uncooked DNA recordsdata.
- The rising throughput of Illumina DNA sequencing means
establishments and corporations are spending tens of hundreds of {dollars}
to retailer terabytes of uncooked DNA sequence (FASTQ). This knowledge is saved
utilizing gzip, a 30-year-old compression algorithm. - Widespread bioinformatics instruments ought to assist newer compression
algorithms reminiscent of zstd for FASTQ knowledge. Zstd has broad business
assist, with comparable run occasions and would seemingly scale back storage
prices by 50% over gzip.
Gzip is outperformed by different algorithms
The unique implementation of gzip (Gailly/Madler) has been surpassed
in efficiency by different gzip implementations. For instance,
cloudflare-zlib outperforms the
authentic gzip in compression speeds and must be used as a substitute.
Using the gzip compression format continues to be ubiquitous for uncooked FASTQ
DNA sequence. This is because of it being the one supported compression
format for FASTQ in bioinformatics instruments. Within the thirty years since gzip
was created there at the moment are alternate options with superior compression ratios.
Solely supporting gzip for FASTQ interprets into thousands and thousands of {dollars} in
storage charges on companies like Amazon’s S3 and EFS in contrast with
algorithms with higher compression ratios. Corporations like
Meta,
Amazon, and
Uber are
reported to be switching to zstd over gzip. If the most typical
bioinformatics instruments can transfer to assist ingesting zstd-compressed FASTQ
format this might save everybody money and time with minimal impression on IO
time.
A toy benchmark
For example a zstd compressed FASTQ file
(SRR7589561)
is nearly 50% the scale of the identical gzipped file. Within the determine beneath I
downloaded ~1.5Gb of FASTQ knowledge and compressed it with both pigz
or
zstd
. Pigz is a parallel implementation of the unique gzip.
FASTQ recordsdata do nevertheless take longer to compress with zstd. The ztsd -15
command takes ~70s which is 50% longer than pigz -9
at ~35s. Nonetheless,
it’s value noting when storing uncooked FASTQ from a sequencer, these recordsdata
are compressed as soon as, after which saved for years. This extra CPU time
price is greater than offset by financial savings in storage prices. The identical doesn’t
apply to intermediate recordsdata reminiscent of trimmed or filtered FASTQ in a
pipeline that are typically ephemeral. These would require a futher
examination of tradeoffs.
The following determine reveals that adjustments in decompression time for a similar
file are comparatively small, 3.5s vs 2.2s. Subsequently decompression would
be minimally impacted.
Detailed comparability of flags
This determine compares the compressed output file measurement for all of the
totally different out there gzip implementations with zstd for various
compression flags on the identical SRR7589561 FASTQ file. This reveals that
zstd outperforms gzip on the highest compression ranges, with the output
file sizes being ~60% the scale of the very best gzip compression ranges.
This subsequent plot compares the trade-offs for file measurement versus the wall
clock run time taken to compress a FASTQ file. That is for the
compression course of working single-threaded. This reveals that zstd can
end in a lot better compression ratios, ~10% of the unique file measurement
however with rising run time. Although not practically so long as the run time
for zopfli, a gzip implementation offers the very best compression ratio of
any gzip implementation however on the expense ~2 orders of magnitude in
compression time.
The gzip implementation is outmoded by different compression algorithms
reminiscent of zstd. By persevering with to solely assist gzip for FASTQ, the
bioinformatics business spends cash unnecessarily on further
storage. Bioinformatics instruments ought to extensively assist zstd as a
compression format for FASTQ.