Introduction

Fig. 1: Exponential decrease of bacteria searchability by BLAST.
The challenge
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, their size grows exponentially, at a faster rate than computational capacities. This makes effectively impossible to search these data using tools such as BLAST and its successors. For instance, the proportion of searchable bacteria decreases exponentially over time (Fig. 1).
Phylogenetic compression
Phylogenetic compression is a technique using evolutionary history to guide compression and search of large collections of microbial genomes using existing algorithms and data structures. This improves the compression ratios of assemblies, de Bruijn graphs, and 𝑘-mer indexes by 1–2 orders of magnitude. Consequently, this enables BLAST-like alignment to all sequenced bacteria until 2019 on ordinary desktop and laptop computers within a few hours.
How does it work?
Phylogenetic compression combines four ingredients:
  1. clustering of genomes into phylogenetically related groups, followed by
  2. inference of a compressive phylogeny that acts as a template for
  3. data reordering, prior to
  4. an application of a calibrated low-level compressor or indexer.
This general scheme can be instantiated to individual protocols for various data types and for diverse application use cases.
Microbes on a Flash drive
MOF, Microbes on a Flash drive, is a set of tools implementing proof of concepts of phylogenetic compression. Examples of collections compressed by MOF can be found in the List of Genome Collection, and the individual MOF packages in the List of Software Packages. The two main tools are:
  • MOF-Compress for building highly phylogenetically compressed .tar.xz genome archives.
  • MOF-Search for BLAST-like search across all pre-2019 bacteria on ordinary laptop and desktop computers within a few hours.
Want to learn more about the science behind?
See the main paper.

How-To’s

You’re a user

BLAST-like search across all pre-2019 bacteria from ENA
MOF-Search is a tool based on phylogenetic compression to align queries to all high-quality genomes from the 661k collection on standard desktop and laptops computer in a fashion similar to BLAST. All documentation and instructions for users are provided in the README of MOF-Search.
Downloading phylogenetically compressed 661k and BIGSIdata collections
Phylogenetic compression allows to compress existing large genome collection by 1-2 orders compared to other methods. Two main collections provided for users are:

For a comprehensive list of all compressed collections and the corresponding links and technical details, see the List of Genome Collections.

Phylogenetic compression of your own genome collection by MiniPhy
Phylogenetic compression can in principle be extremely simple and based entirely on simple custom scripts for specific data, ordering them according to a phylogenetic tree and compressing them in that order.

MiniPhy, implements this specifically with MashTree and xz, and is suitable for most practical use cases. All documentation and instructions for users are provided in the README of MiniPhy, including information on batching for large collections.

You’re a method developer

Evaluating your own low-level compressor in connection with phylogenetic compression
Recompress published tar archives, for instance, from the phylogenetically compressed 661k collection. If your compressor supports arbitrary content, just recompress a given TAR file, e.g., by
xzcat neisseria_gonorrhoeae__01.tar.xz \
  | your_general_compressor \
  > xzcat neisseria_gonorrhoeae__01.tar.compressed
If your compressor supports only the FASTA format, merge all the content (in the same file order) and recompress it, e.g., by
tar -xOvf neisseria_gonorrhoeae__01.tar.xz \
  | your_fasta_compressor \
  > xzcat neisseria_gonorrhoeae__01.fa.compressed
Evaluating your own phylogeny inference methods in connection with phylogenetic compression

Download one or more batches from the phylogenetically compressed 661k collection, infer their phylogeny using your method, and finally re-compress the genomes using MiniPhy with your phylogeny. This can be achieved by placing both {batch}.txt and {batch}.nw into the input/ directory

Evaluating your own genome indexer in connection with phylogenetic compression

Download one or more batches from the phylogenetically compressed 661k collection and index them in the order in which they appear in the archive. Indexing can be done either per individual batches (resulting in many small indexes), or by merging all the genome batches together while preserving the orders (resulting in one large index).

Genome order can be determined from from a .tar.xz file by
tar tf {batch}.tar.xz

List of Genome Collections

The following phylogenetically genome collections are provided for download on Zenodo. Supplementary metadata for all the datasets can be found in a dedicated repository.

661k

Genomes: 661,405 Illumina draft assemblies of (est.) 2,336 bacterial species
Length: 2.58 Tbp
Diversity: 44.3 G distinct 31-mers
Original size: 805 GB (750 GiB) after gzip compression
Compressed assemblies (xz) – everything
All 661k assemblies compressed within standard .tar.xz archives. Production-ready.
Final size: 29.0 GB
Compressed assemblies (mbgc) - everything
The same recompressed using MBGC. All sequences from a batch merged into a single file. Experimental.
Final size: 20.7 GB
Compressed COBS indexes (xz) - part1, part2, part3
Compressed COBS Classic indexes. Experimental.
Final size: 110. GB

661k-HQ

Genomes:
Length:
Diversity: distinct 31-mers
Original size:
Compressed COBS indexes (XZ) – part1, part2
Compressed COBS Classic indexes. Production-ready.
Final size: 72.8 GB

BIGSI data

Genomes: 425,160 de Bruijn graph of (est.) 1,443 microbial species
Length: 1.68 Tbp (total unitig length)
Diversity: 41.1 G distinct 31-mers
Original size: 16.7 TB after McCortex cleaning
Compressed de Bruijn graphs v1 (XZ) – part1, part2
Simplitigs after 𝑘-mer propagation, first version from 2020, extensively validated, can be decompressed by MOF-Client. Production-ready.
Final size: 74.4 GB
Compressed de Bruijn graphs v2 (XZ) – everything
The same computed using a more optimized compression protocol, but decompression not yet implemented. Experimental.
Final size: 52.3 GB

NCTC3k

Genomes: 1,065 near-complete assemblies of 259 bacterial species
Length: 4.35 Gbp
Diversity: 992 M distinct 31-mers
Original size: 1.25 GB after gzip compression
Compressed assemblies (XZ) – everything
Compressed by MiniPhy with default settings.
Final size: 257 MB

GISP

Genomes: 1,102 draft assemblies of N. gonorrhoeae from this paper
Length: 2.36 Gbp
Diversity: 4.18 M distinct 31-mers
Original size: 726 MB after gzip compression.
Compressed assemblies (XZ) – everything
Compressed by MiniPhy with default settings and using MashTree and RaXML.
Final size: 5.67 MB and 5.44 MB, respectively.

SC2

Genomes: 590,779 complete assemblies of SARS-CoV-2
Length: 17.6 Gbp
Diversity: 1.85 M
Original size: 201 MB after xz compression
Compressed assemblies (XZ)
Upon request (the licence does not allow public sharing).
Final size: 10.7 MB

List of Software Packages

Core MOF packages

MOF-Search
BLAST-like search on laptops across all pre-2019 bacteria (the 661k-HQ collection). Implemented as a Snakemake pipeline,
MiniPhy
The main package for phylogenetic compression of individual genome batches.
MiniPhy-COBS
Building phylogenetically compressed COBS indexes from the output of MiniPhy.
MOF-Client
Client program for decompressing de Bruijn graphs from BIGSIdata.

Auxiliary software

ProPhyle
Metagenomic classifier, based on 𝑘-mer propagation, simplitigs, and 𝑘-mer indexing using the Burrows-Wheeler Transform. ProPhyle is used by MiniPhy as the underlying engine for 𝑘-mer propagation to compres de Bruijn graphs and for computing the phylogenetically explained data redundancy in genome collections. ProPhyle was modified for the purpose of MOF by adding a parameter that stops the indexing step after 𝑘-mer propagation.

Cite

Main paper

More information about phylogenetic compression and the MOF framework can be found in the main phylogenetic compression paper [1].

  [1]  K. Břinda, L. Lima, S. Pignotti, N. Quinones-Olvera, K. Salikhov, R. Chikhi, G. Kucherov, Z. Iqbal, and M. Baym, Efficient and robust search of microbial genomes via phylogenetic compression, bioRxiv 2023.04.15.536996, 2023. https://doi.org/10.1101/2023.04.15.536996

Low-level techniques

MOF builds upon numerous low-level computational techniques that we developed previously, including simplitigs [2], COBS [3], 𝑘-mer propagation [4], and the linkage between alignment scores and 𝑘-mer matches [5].

  [2]  K. Břinda, M. Baym, and G. Kucherov, Simplitigs as an efficient and scalable representation of de Bruijn graphs, Genome Biology 22(96), 2021. https://doi.org/10.1186/s13059-021-02297-z
  [3] T. Bingmann, P. Bradley, F. Gauger, and Z. Iqbal, COBS: A Compact Bit-Sliced Signature Index, SPIRE 2019, 2019. https://doi.org/10.1007/978-3-030-32686-9_21
  [4] K. Břinda, Novel computational techniques for mapping and classification of Next-Generation Sequencing data. PhD thesis, University of Paris-Est, 2016. https://doi.org/10.5281/zenodo.1045317
  [5] K. Břinda, M. Sykulski, G. Kucherov, Spaced seeds improve 𝑘-mer-based metagenomic classification, Bioinformatics 31(22), 2015. https://doi.org/10.1093/bioinformatics/btv419

Authors

The project originally started in the Baym lab at Harvard Medical School and has been continuing in the Břinda group at Inria GenScale. The project was developed in collaboration with the Iqbal group at EMBL-EBI.