Phylogenetic compression
Microbes on a Flash drive
Introduction
- The challenge
- Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, their size grows exponentially, at a faster rate than computational capacities. This makes effectively impossible to search these data using tools such as BLAST and its successors. For instance, the proportion of searchable bacteria decreases exponentially over time (Fig. 1).
- Phylogenetic compression
- Phylogenetic compression is a technique using evolutionary history to guide compression and search of large collections of microbial genomes using existing algorithms and data structures. This improves the compression ratios of assemblies, de Bruijn graphs, and 𝑘-mer indexes by 1–2 orders of magnitude. Consequently, this enables BLAST-like alignment to all sequenced bacteria until 2019 on ordinary desktop and laptop computers within a few hours.
- How does it work?
- Phylogenetic compression combines four ingredients:
- clustering of genomes into phylogenetically related groups, followed by
- inference of a compressive phylogeny that acts as a template for
- data reordering, prior to
- an application of a calibrated low-level compressor or indexer.
- This general scheme can be instantiated to individual protocols for various data types and for diverse application use cases, such as genome compression and data search.
- Application 1: Efficient parallelized compression of arbitrarily sized genome collections
- MiniPhy (Minimization via Phylogenetic compression) implements phylogenetic compression for large bacteria data. Examples of miniphied collections are provided in the List of Compressed Genome Collection. For instance, the 661k collection was recompressed from the original 805 GB (GZip) to 17.5 GB (MiniPhy-MBGC).
- Application 2: BLAST-like alignments across all pre-2019 bacteria
- Phylign implements BLAST-like search across all pre-2019 bacteria on ordinary laptop and desktop computers. For instance, time to search 2,826 EBI plasmids was reduced from the original 2120 CPU hours (BIGSI, pres./abs. only) to 44 CPU hours (Phylign, pres./abs. and alignments).
- Want to learn more about the science behind?
- See the main paper about phylogenetic compression.
How-To’s
You’re a user
- BLAST-like search across all pre-2019 bacteria from ENA
- Phylign is a tool based on phylogenetic compression to align queries to all high-quality genomes from the 661k collection on standard desktop and laptops computer in a fashion similar to BLAST. All documentation and instructions for users are provided in the README of Phylign.
- Downloading phylogenetically compressed 661k and BIGSIdata collections
- Phylogenetic compression allows to compress existing large genome collection
by 1-2 orders compared to the state-of-the-art protocols. Two main collections provided for users are:
- 661k (661k ENA bacterial assemblies) – downloadable from Zenodo (standard
.tar.xz
files) - BIGSIdata (425k ENA microbial de Bruijn graphs) – downloadable and extractable by de-MiniPhy-BIGSIdata
For a comprehensive list of all compressed collections and additional details, see the List of Compressed Genome Collections.
- 661k (661k ENA bacterial assemblies) – downloadable from Zenodo (standard
- Phylogenetic compression of custom genome collections by MiniPhy
- Phylogenetic compression can in principle be extremely straighforward and based entirely on simple dataset-specific scripts, ordering the genomes according to a phylogenetic tree and compressing them in that order.
MiniPhy, implements this specifically with MashTree and XZ (possibly followed by MBGC), and is suitable for most practical use cases. All documentation and instructions for users are provided in the README of MiniPhy, including information on batching in case of very large collections.
You’re a method developer
- Evaluating your own low-level compressor in connection with phylogenetic compression
- Recompress published tar archives, for instance, from the
phylogenetically compressed 661k collection.
If your compressor supports arbitrary content, just recompress a given TAR file, e.g., by
xzcat neisseria_gonorrhoeae__01.tar.xz \ | your_general_compressor \ > neisseria_gonorrhoeae__01.tar.compressed
- If your compressor supports only the FASTA format, merge all the content
(in the same file order)
and recompress it, e.g., by
tar -xOvf neisseria_gonorrhoeae__01.tar.xz \ | your_fasta_compressor \ > neisseria_gonorrhoeae__01.fa.compressed
- Evaluating your own phylogeny inference methods in connection with phylogenetic compression
-
Download one or more batches from the phylogenetically compressed 661k collection, infer their phylogeny using your method, and finally re-compress the genomes using MiniPhy with your phylogeny. This can be achieved by placing both
{batch}.txt
and{batch}.nw
into theinput/
directory
- Evaluating your own genome indexer in connection with phylogenetic compression
-
Download one or more batches from the phylogenetically compressed 661k collection and index them in the order in which they appear in the archive. Indexing can be done either per individual batches (resulting in many small indexes), or by merging all the genome batches together while preserving the orders (resulting in one large index).
- Genome order can be determined from from a
.tar.xz
file bytar tf {batch}.tar.xz
List of Compressed Genome Collections
The following phylogenetically genome collections are provided for download on Zenodo. Supplementary metadata for all the datasets can be found in a dedicated repository.
661k
Genomes: | 661,405 Illumina draft assemblies of (est.) 2,336 bacterial species |
Length: | 2.58 Tbp |
Diversity: | 44.3 G distinct canonical 31-mers |
Original size: |
assemblies:
805 GB
(750 GiB, GZip) k-mer index: 936 GB (872 GiB, COBS Compact index) |
Significance: | All pre-2019 Illumina-sequenced bacterial isolates from ENA, all assembled using a unified pipeline |
Assemblies
- MiniPhy-XZ – 29.0 GB, production-ready
- Technique: Standard MiniPhy pipeline based on XZ.
- Accessions: 10.5281/zenodo.4602622
- MiniPhy-MBGCv1 – 20.7 GB, experimental
- Technique: Per-batch FASTA files from MiniPhy recompressed using MBGC v1.2
- Accessions: 10.5281/zenodo.6347064
- MiniPhy-MBGCv2 – 17.5 GB, experimental
- Technique: Per-batch FASTA files from MiniPhy recompressed using MBGC v2.0
- Accessions: 10.5281/zenodo.10229555
K-mer indexes
- MiniPhy-COBS-XZ – 110. GB, experimental
- Technique: Compressed COBS Classic indexes
- Accessions: 10.5281/zenodo.7313926, 10.5281/zenodo.7313942, 10.5281/zenodo.7315499
661k-HQ
Significance: | Only those assemblies from the 661k that passed quality control (i.e., thus excluding contaminated samples) |
K-mer indexes
- MiniPhy-COBS-XZ – 72.8 GB, production-ready
- Technique: Compressed COBS Classic indexes
- Accesssions: 10.5281/zenodo.6845083, 10.5281/zenodo.6849657
BIGSIdata
Genomes: | 425,160 de Bruijn graph of (est.) 1,443 microbial species |
Length: | 1.68 Tbp (total unitig length) |
Diversity: | 41.1 G distinct canonical 31-mers |
Original size: | 16.7 TB after McCortex cleaning |
Significance: | All pre-2016 ENA bacterial and viral genomes |
de Bruijn graphs
- Prototype of MiniPhy(P3)-XZ – 74.4 GB, production ready
- Technique: Simplitigs after 𝑘-mer propagation stored as FASTA, first prototype of MiniPhy protocol 3 from 2020, validated with k-mer counting
- Accessions: 10.5281/zenodo.4086456, 10.5281/zenodo.4087330
- Extraction: by de-MiniPhy-BIGSIdata.
- MiniPhy(P3)-XZ – 52.3 GB, experimental
- Technique: Current MiniPhy protocol 3; extraction not yet implemented
- Accessions: 10.5281/zenodo.5555253
NCTC3k
Genomes: | 1,065 near-complete assemblies of 259 bacterial species |
Length: | 4.35 Gbp |
Diversity: | 992 M distinct canonical 31-mers |
Original size: | 1.25 GB after gzip compression |
Significance: | A high-quality collection of diverse, nearly-complete bacterial genomes |
Assemblies
- MiniPhy-XZ – 257 MB
- Technique: MiniPhy with default settings
- Accessions: 10.5281/zenodo.5533354
GISP
Genomes: | 1,102 Illumina draft assemblies of N. gonorrhoeae from this paper |
Length: | 2.36 Gbp |
Diversity: | 4.18 M distinct canonical 31-mers |
Original size: | 726 MB after gzip compression. |
Significance: | |
Significance: | A high-quality collection of draft assemblies of single bacterial species of a low diversity |
Assemblies
- MiniPhy-XZ – 5.67 MB / 5.44 MB
- Technique: MiniPhy with default settings, using MashTree / RaXML
- Accession: 10.5281/zenodo.10070404
SC2
Genomes: | 590,779 complete assemblies of SARS-CoV-2 |
Length: | 17.6 Gbp |
Diversity: | 1.85 M |
Original size: | 201 MB after xz compression |
Significance: | An extremely large collection of genomes of the same viral species |
Assemblies
- Equiv. of MiniPhy-XZ – 10.7 MB
- Technique: Genomes sorted left-to-right with respect to GISAID phylogeny and compressed by XZ
- Files: Upon request (due to the licensing policies)
List of Software Packages
Core packages for phylogenetic compression
- Phylign
- BLAST-like search on laptops across all pre-2019 bacteria (the 661k-HQ collection). Implemented as a Snakemake pipeline,
- MiniPhy
- The main package for phylogenetic compression of individual genome batches.
Auxiliary packages for phylogenetic compression
- MiniPhy-COBS
- Building phylogenetically compressed COBS indexes from the output of MiniPhy.
- de-MiniPhy-BIGSIdata
- Download and extraction of de Bruijn graphs from the minified BIGSIdata collection.
Low-level tools particularly adapted for phylogenetic compression
- ProPhyle
- Metagenomic classifier, based on 𝑘-mer propagation, simplitigs, and 𝑘-mer indexing using the Burrows-Wheeler Transform. ProPhyle is used by MiniPhy as the underlying engine for 𝑘-mer propagation to compres de Bruijn graphs and for computing the phylogenetically explained data redundancy in genome collections. ProPhyle was modified for the purpose of Miniphy by adding a parameter that stops the indexing step after 𝑘-mer propagation.
- COBS
- High-performance k-mer index based on inverted indexes and Bloom filters; an efficient re-implementation of BIGSI with additional ideas. To use COBS in Phylign, we implemented functionality for reading indexes from data streams and support for OS X (versions 0.2.0 and 0.2.1).
Cite
Main paper
More information about phylogenetic compression, MiniPhy, and Phylign can be found in the main phylogenetic compression paper [1].
[1] | K. Břinda, L. Lima, S. Pignotti, N. Quinones-Olvera, K. Salikhov, R. Chikhi, G. Kucherov, Z. Iqbal, and M. Baym, Efficient and robust search of microbial genomes via phylogenetic compression, bioRxiv 2023.04.15.536996, 2023. https://doi.org/10.1101/2023.04.15.536996 |
Low-level techniques
MiniPhy and Phylign build upon several low-level computational techniques that we developed previously, including simplitigs [2], COBS [3], 𝑘-mer propagation and ProPhyle [4], and the linkage between alignment scores and 𝑘-mer matches [5].
[2] | K. Břinda, M. Baym, and G. Kucherov, Simplitigs as an efficient and scalable representation of de Bruijn graphs, Genome Biology 22(96), 2021. https://doi.org/10.1186/s13059-021-02297-z |
[3] | T. Bingmann, P. Bradley, F. Gauger, and Z. Iqbal, COBS: A Compact Bit-Sliced Signature Index, SPIRE 2019, 2019. https://doi.org/10.1007/978-3-030-32686-9_21 |
[4] | K. Břinda, Novel computational techniques for mapping and classification of Next-Generation Sequencing data. PhD thesis, University of Paris-Est, 2016. https://doi.org/10.5281/zenodo.1045317 |
[5] | K. Břinda, M. Sykulski, G. Kucherov, Spaced seeds improve 𝑘-mer-based metagenomic classification, Bioinformatics 31(22), 2015. https://doi.org/10.1093/bioinformatics/btv419 |
Authors
- Karel Břinda
- Leandro Lima
- Zam Iqbal
- Michael Baym
The project originally started in the Baym lab at Harvard Medical School and has been continuing in the Břinda group at Inria GenScale. The project was developed in collaboration with the Iqbal group at EMBL-EBI.