Algorithms in Bioinformatics


A   A   A
Sections
Home > Software > Downloads > MEGAN-LR

Skip to content. | Skip to navigation

MEGAN-LR

Supplementary data for the manuscript "MEGAN-LR: New algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs" by Huson et al.

All data discussed in the paper will be made available here.

 

Simulation study

 
We downloaded the complete genomes and proteins of all prokaryotes from NCBI's Genome database: https://www.ncbi.nlm.nih.gov/genome/browse/ 
We downloaded NCBI taxonomy from its FTP: ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz 
Both the genomes and taxonomy was downloaded on 21 April 2017.
 
We selected those organisms, which have at least 2 and at most 10 siblings in their genus, also with complete genomes and a complete set of proteins in NCBI's Genome database, as well as a full taxonomic annotation in NCBI's taxonomy.
 
We simulated 2000 reads for each of these genomes using NanoSim (https://github.com/bcgsc/NanoSim) in linear mode with both ecoli_R73_profile.zip and ecoli_R92D_profile.zip, available from FTP(ftp://ftp.bcgsc.ca/supplementary/NanoSim/ , accessed on 21 April 2017)
 
For each simulated genome, we collected the sets of proteins of every other organism used in the simulation study and created Kaiju (https://github.com/bioinformatics-centre/kaiju, commit 0a61b4d) and LAST(last.cbrc.jp/, version 852) indices for them. We renamed the headers in proteins sequences with a unique identifier, followed by an underscore and the tax id of the organism that protein is coming from, as detailed in https://github.com/bioinformatics-centre/kaiju#custom-database. We created databases for LAST for the same sets of proteins for each organism, with command:
lastdb -p -cR01 lastdb protein_db.faa

We then aligned reads against this protein database, using LAST with a frameshift cost of 15 : -P15 ; and stored the output in MAF format. These MAF files were then sorted using sort-last-maf script in MEGAN's tools directory and were imported into MEGAN using blast2rma script, selecting longReads as the LCA algorithm: -alg longReads. We exported MEGAN's assignments using the rma2info script.
 
We classified the same set of reads with Kaiju in greedy mode with a maximum of 5 allowed substitutions: -e 5.

We then calculated the sensitivity and precision of assignments for each dataset as follows:
genus sensitivity: percentage of reads summarized at the genus of the simulated organism out of number of reads in the dataset
genus precision: percentage of reads summarized at the genus of the simulated organism out of all assigned reads, excluding those being assigned to an ancestor of the corresponding genus .

 

PacBio HMP mock community

Reads downloaded from: PacBio

Reads

File produced using LAST+MEGAN-LR pipeline, can be opened in MEGAN:

Meganized DAA file

This file contains the HMP mock-community profile (bacteria) and can be opened in MEGAN:

HMP profile MEGAN file

 

 Assembly produced using minimap+miniasm:

Assembled reads

File produced using LAST+MEGAN-LR pipeline, can be opened in MEGAN:

Meganized DAA file

 

PacBio reads from another mock community

This data is from: Singer et al, 2016
 

Reads

File produced using LAST+MEGAN-LR pipeline, can be opened in MEGAN:

Meganized DAA file

This file contains the mock-community profile and can be opened in MEGAN:

Community profile MEGAN file

 

Nanopore reads on HMP mock community

This is new Nanopore data that we produced from the HMP even mock-community:

Reads

File produced using LAST+MEGAN-LR pipeline, can be opened in MEGAN:

Meganized DAA file

 

Nanopore reads from Anammox bio-rector

This is new Nanopore data that we produced from an enrichment bio-reactor:

Reads

File produced using LAST+MEGAN-LR pipeline, can be opened in MEGAN:

Meganized DAA file

 

 Assembly produced using minimap+miniasm:

Assembled reads

File produced using LAST+MEGAN-LR pipeline, can be opened in MEGAN:

Meganized DAA file

 
Document Actions