Now Reading
Bioinformatics Zen | We’re losing cash by solely supporting gzip for uncooked DNA recordsdata.

Bioinformatics Zen | We’re losing cash by solely supporting gzip for uncooked DNA recordsdata.

2023-01-09 12:05:17

We’re losing cash by solely supporting gzip for uncooked DNA recordsdata.


  • The rising throughput of Illumina DNA sequencing means
    establishments and corporations are spending tens of hundreds of {dollars}
    to retailer terabytes of uncooked DNA sequence (FASTQ). This knowledge is saved
    utilizing gzip, a 30-year-old compression algorithm.
  • Widespread bioinformatics instruments ought to assist newer compression
    algorithms reminiscent of zstd for FASTQ knowledge. Zstd has broad business
    assist, with comparable run occasions and would seemingly scale back storage
    prices by 50% over gzip.

Gzip is outperformed by different algorithms

The unique implementation of gzip (Gailly/Madler) has been surpassed
in efficiency by different gzip implementations. For instance,
cloudflare-zlib outperforms the
authentic gzip in compression speeds and must be used as a substitute.

Using the gzip compression format continues to be ubiquitous for uncooked FASTQ
DNA sequence. This is because of it being the one supported compression
format for FASTQ in bioinformatics instruments. Within the thirty years since gzip
was created there at the moment are alternate options with superior compression ratios.
Solely supporting gzip for FASTQ interprets into thousands and thousands of {dollars} in
storage charges on companies like Amazon’s S3 and EFS in contrast with
algorithms with higher compression ratios. Corporations like
Meta,
Amazon, and
Uber are
reported to be switching to zstd over gzip. If the most typical
bioinformatics instruments can transfer to assist ingesting zstd-compressed FASTQ
format this might save everybody money and time with minimal impression on IO
time.

A toy benchmark

For example a zstd compressed FASTQ file
(SRR7589561)
is nearly 50% the scale of the identical gzipped file. Within the determine beneath I
downloaded ~1.5Gb of FASTQ knowledge and compressed it with both pigz or
zstd. Pigz is a parallel implementation of the unique gzip.

FASTQ file measurement by compression algorithm.

FASTQ recordsdata do nevertheless take longer to compress with zstd. The ztsd -15
command takes ~70s which is 50% longer than pigz -9 at ~35s. Nonetheless,
it’s value noting when storing uncooked FASTQ from a sequencer, these recordsdata
are compressed as soon as, after which saved for years. This extra CPU time
price is greater than offset by financial savings in storage prices. The identical doesn’t
apply to intermediate recordsdata reminiscent of trimmed or filtered FASTQ in a
pipeline that are typically ephemeral. These would require a futher
examination of tradeoffs.

Complete compression time in seconds by algorithm.

The following determine reveals that adjustments in decompression time for a similar
file are comparatively small, 3.5s vs 2.2s. Subsequently decompression would
be minimally impacted.

Complete decompression time in seconds by algorithm.

Detailed comparability of flags

This determine compares the compressed output file measurement for all of the
totally different out there gzip implementations with zstd for various
compression flags on the identical SRR7589561 FASTQ file. This reveals that
zstd outperforms gzip on the highest compression ranges, with the output
file sizes being ~60% the scale of the very best gzip compression ranges.

See Also

Output compressed file measurement ratios by command line flag for every compression instrument. Every color represents a unique compression instrument implementation. Every argument was benchmarked 5 occasions. Word that zopfli has a single datum as a result of it solely compresses to the max ratio.

This subsequent plot compares the trade-offs for file measurement versus the wall
clock run time taken to compress a FASTQ file. That is for the
compression course of working single-threaded. This reveals that zstd can
end in a lot better compression ratios, ~10% of the unique file measurement
however with rising run time. Although not practically so long as the run time
for zopfli, a gzip implementation offers the very best compression ratio of
any gzip implementation however on the expense ~2 orders of magnitude in
compression time.

Compression ratio versus compression time. Every color represents a unique compression instrument implementation. Every argument was benchmarked 5 occasions. Word that zopfli has a single level as a result of it solely compresses to the max ratio.

The gzip implementation is outmoded by different compression algorithms
reminiscent of zstd. By persevering with to solely assist gzip for FASTQ, the
bioinformatics business spends cash unnecessarily on further
storage. Bioinformatics instruments ought to extensively assist zstd as a
compression format for FASTQ.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top