Phylogenetic compression
MOF: Microbes on a Flash drive
Introduction
- The challenge
- Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, their size grows exponentially, at a faster rate than computational capacities. This makes effectively impossible to search these data using tools such as BLAST and its successors. For instance, the proportion of searchable bacteria decreases exponentially over time (Fig. 1).
- Phylogenetic compression
- Phylogenetic compression is a technique using evolutionary history to guide compression and search of large collections of microbial genomes using existing algorithms and data structures. This improves the compression ratios of assemblies, de Bruijn graphs, and ð-mer indexes by 1â2 orders of magnitude. Consequently, this enables BLAST-like alignment to all sequenced bacteria until 2019 on ordinary desktop and laptop computers within a few hours.
- How does it work?
- Phylogenetic compression combines four ingredients:
- clustering of genomes into phylogenetically related groups, followed by
- inference of a compressive phylogeny that acts as a template for
- data reordering, prior to
- an application of a calibrated low-level compressor or indexer.
- This general scheme can be instantiated to individual protocols for various data types and for diverse application use cases.
- Microbes on a Flash drive
- MOF, Microbes on a Flash drive, is a set of tools implementing proof of concepts of phylogenetic compression.
Examples of collections compressed by MOF can be found in the List of Genome Collection,
and the individual MOF packages in the List of Software Packages.
The two main tools are:
- MOF-Compress for building highly phylogenetically compressed
.tar.xz
genome archives. - MOF-Search for BLAST-like search across all pre-2019 bacteria on ordinary laptop and desktop computers within a few hours.
- MOF-Compress for building highly phylogenetically compressed
- Want to learn more about the science behind?
- See the main paper.
How-Toâs
Youâre a user
- BLAST-like search across all pre-2019 bacteria from ENA
- MOF-Search is a tool based on phylogenetic compression to align queries to all high-quality genomes from the 661k collection on standard desktop and laptops computer in a fashion similar to BLAST. All documentation and instructions for users are provided in the README of MOF-Search.
- Downloading phylogenetically compressed 661k and BIGSIdata collections
- Phylogenetic compression allows to compress existing large genome collection
by 1-2 orders compared to other methods. Two main collections provided for users are:
- 661k (661k ENA bacterial assemblies) â downloadable from Zenodo (standard
.tar.xz
files) - BIGSIdata (425k ENA microbial de Bruijn graphs) â downloadable and decompressable by De-MiniPhy-BIGSIdata
For a comprehensive list of all compressed collections and the corresponding links and technical details, see the List of Genome Collections.
- 661k (661k ENA bacterial assemblies) â downloadable from Zenodo (standard
- Phylogenetic compression of your own genome collection by MiniPhy
- Phylogenetic compression can in principle be extremely simple and based entirely on simple custom
scripts for specific data, ordering them according to a phylogenetic tree and compressing them in that order.
MiniPhy, implements this specifically with MashTree and xz, and is suitable for most practical use cases. All documentation and instructions for users are provided in the README of MiniPhy, including information on batching for large collections.
Youâre a method developer
- Evaluating your own low-level compressor in connection with phylogenetic compression
- Recompress published tar archives, for instance, from the
phylogenetically compressed 661k collection.
If your compressor supports arbitrary content, just recompress a given TAR file, e.g., by
xzcat neisseria_gonorrhoeae__01.tar.xz \ | your_general_compressor \ > xzcat neisseria_gonorrhoeae__01.tar.compressed
- If your compressor supports only the FASTA format, merge all the content
(in the same file order)
and recompress it, e.g., by
tar -xOvf neisseria_gonorrhoeae__01.tar.xz \ | your_fasta_compressor \ > xzcat neisseria_gonorrhoeae__01.fa.compressed
- Evaluating your own phylogeny inference methods in connection with phylogenetic compression
-
Download one or more batches from the phylogenetically compressed 661k collection, infer their phylogeny using your method, and finally re-compress the genomes using MiniPhy with your phylogeny. This can be achieved by placing both
{batch}.txt
and{batch}.nw
into theinput/
directory
- Evaluating your own genome indexer in connection with phylogenetic compression
-
Download one or more batches from the phylogenetically compressed 661k collection and index them in the order in which they appear in the archive. Indexing can be done either per individual batches (resulting in many small indexes), or by merging all the genome batches together while preserving the orders (resulting in one large index).
- Genome order can be determined from from a
.tar.xz
file bytar tf {batch}.tar.xz
List of Genome Collections
The following phylogenetically genome collections are provided for download on Zenodo. Supplementary metadata for all the datasets can be found in a dedicated repository.
661k
Genomes: | 661,405 Illumina draft assemblies of (est.) 2,336 bacterial species |
Length: | 2.58 Tbp |
Diversity: | 44.3 G distinct 31-mers |
Original size: | 805 GB (750 GiB) after gzip compression |
- Compressed assemblies (xz) â everything
- All 661k assemblies compressed within standard
.tar.xz
archives. Production-ready.
Final size: 29.0 GB - Compressed assemblies (mbgc) - everything
- The same recompressed using MBGC. All sequences from a batch merged into a single file. Experimental.
Final size: 20.7 GB - Compressed COBS indexes (xz) - part1, part2, part3
- Compressed COBS Classic indexes. Experimental.
Final size: 110. GB
661k-HQ
Genomes: | |
Length: | |
Diversity: | distinct 31-mers |
Original size: |
- Compressed COBS indexes (XZ) â part1, part2
- Compressed COBS Classic indexes. Production-ready.
Final size: 72.8 GB
BIGSI data
Genomes: | 425,160 de Bruijn graph of (est.) 1,443 microbial species |
Length: | 1.68 Tbp (total unitig length) |
Diversity: | 41.1 G distinct 31-mers |
Original size: | 16.7 TB after McCortex cleaning |
- Compressed de Bruijn graphs v1 (XZ) â part1, part2
- Simplitigs after ð-mer propagation, first version from 2020, extensively validated, can be decompressed by MOF-Client. Production-ready.
Final size: 74.4 GB - Compressed de Bruijn graphs v2 (XZ) â everything
- The same computed using a more optimized compression protocol, but decompression not yet implemented. Experimental.
Final size: 52.3 GB
NCTC3k
Genomes: | 1,065 near-complete assemblies of 259 bacterial species |
Length: | 4.35 Gbp |
Diversity: | 992 M distinct 31-mers |
Original size: | 1.25 GB after gzip compression |
- Compressed assemblies (XZ) â everything
- Compressed by MiniPhy with default settings.
Final size: 257 MB
GISP
Genomes: | 1,102 draft assemblies of N. gonorrhoeae from this paper |
Length: | 2.36 Gbp |
Diversity: | 4.18 M distinct 31-mers |
Original size: | 726 MB after gzip compression. |
- Compressed assemblies (XZ) â everything
- Compressed by MiniPhy with default settings and using MashTree and RaXML.
Final size: 5.67 MB and 5.44 MB, respectively.
SC2
Genomes: | 590,779 complete assemblies of SARS-CoV-2 |
Length: | 17.6 Gbp |
Diversity: | 1.85 M |
Original size: | 201 MB after xz compression |
- Compressed assemblies (XZ)
- Upon request (the licence does not allow public sharing).
Final size: 10.7 MB
List of Software Packages
Core MOF packages
- MOF-Search
- BLAST-like search on laptops across all pre-2019 bacteria (the 661k-HQ collection). Implemented as a Snakemake pipeline,
- MiniPhy
- The main package for phylogenetic compression of individual genome batches.
- MiniPhy-COBS
- Building phylogenetically compressed COBS indexes from the output of MiniPhy.
- MOF-Client
- Client program for decompressing de Bruijn graphs from BIGSIdata.
Auxiliary software
- ProPhyle
- Metagenomic classifier, based on ð-mer propagation, simplitigs, and ð-mer indexing using the Burrows-Wheeler Transform. ProPhyle is used by MiniPhy as the underlying engine for ð-mer propagation to compres de Bruijn graphs and for computing the phylogenetically explained data redundancy in genome collections. ProPhyle was modified for the purpose of MOF by adding a parameter that stops the indexing step after ð-mer propagation.
Cite
Main paper
More information about phylogenetic compression and the MOF framework can be found in the main phylogenetic compression paper [1].
  [1] | K. BÅinda, L. Lima, S. Pignotti, N. Quinones-Olvera, K. Salikhov, R. Chikhi, G. Kucherov, Z. Iqbal, and M. Baym, Efficient and robust search of microbial genomes via phylogenetic compression, bioRxiv 2023.04.15.536996, 2023. https://doi.org/10.1101/2023.04.15.536996 |
Low-level techniques
MOF builds upon numerous low-level computational techniques that we developed previously, including simplitigs [2], COBS [3], ð-mer propagation [4], and the linkage between alignment scores and ð-mer matches [5].
  [2] | K. BÅinda, M. Baym, and G. Kucherov, Simplitigs as an efficient and scalable representation of de Bruijn graphs, Genome Biology 22(96), 2021. https://doi.org/10.1186/s13059-021-02297-z |
  [3] | T. Bingmann, P. Bradley, F. Gauger, and Z. Iqbal, COBS: A Compact Bit-Sliced Signature Index, SPIRE 2019, 2019. https://doi.org/10.1007/978-3-030-32686-9_21 |
  [4] | K. BÅinda, Novel computational techniques for mapping and classification of Next-Generation Sequencing data. PhD thesis, University of Paris-Est, 2016. https://doi.org/10.5281/zenodo.1045317 |
  [5] | K. BÅinda, M. Sykulski, G. Kucherov, Spaced seeds improve ð-mer-based metagenomic classification, Bioinformatics 31(22), 2015. https://doi.org/10.1093/bioinformatics/btv419 |
Authors
- Karel BÅinda
- Leandro Lima
- Zam Iqbal
- Michael Baym
The project originally started in the Baym lab at Harvard Medical School and has been continuing in the BÅinda group at Inria GenScale. The project was developed in collaboration with the Iqbal group at EMBL-EBI.