MEGAN5 - MEtaGenome ANalyzer
MEGAN5 was written by D. H. Huson, with ideas or supporting code contributed by S.C. Schuster, S. Mitra, D.C. Richter, P. Rupek, H.-J. Ruscheweyh, R. Tappu and N. Weber.
In metagenomics, the aim is to understand the composition and operation of complex microbial consortia in environmental samples through sequencing and analysis of their DNA. Similarly, metatranscriptomics and metaproteomics target the RNA and proteins obtained from such samples. Technological advances in next-generation sequencing methods are fueling a rapid increase in the number and scope of environmental sequencing projects. In consequence, there is a dramatic increase in the volume of sequence data to be analyzed.
Basic computational questions
The first three basic computational tasks for such data are taxonomic analysis, functional analysis and comparative analysis. These are also known as the “who is out there?’, “what are they doing?” and “how do they compare?” questions. They pose an immense conceptual and computational challenge, and there is a great need for new bioinformatics tools and methods to address them.
History of MEGAN
In 2007, we published the first stand-alone analysis tool for metagenomic of short-read data, called MEGAN (MEta Genome ANalyzer, paper). Initially, the aim was to provide a tool for studying the taxonomic content of a single dataset. A subsequent version of the program allowed the comparative taxonomic analysis of multiple datasets (MEGAN 2). In version 3 of the program, we aimed at also providing a functional analysis of metagenome data, based on the GO ontology. Unfortunately, in our hands the GO ontology proved to be little suitable for this purpose. In version 4 of MEGAN, the GO analyzer has been replaced by two new functional analysis methods, one based on the SEED classification and the other based on KEGG (Kyoto Encyclopedia for Genes and Genomes). MEGAN 4 was released at the beginning of 2011 (paper). MEGAN5 was released in April 2013. The program consists of about 275,000 lines of code.
What is new in MEGAN5?
- Faster parsing of input data, more flexible mapping, faster analysis: The size and number of datasets considered in metagenomics continues increase and so a number of the core algorithms for parsing and storing data have been redesigned to keep up. The "Import BLAST" dialog has been redesigned to make it easier to process comparisons against specialized databases.
- COG/EGGNOG analysis: MEGAN5 can map reads to COG/EGGNOG classes and provides analyzer window
- Update of SEED and KEGG mapping data: MEGAN5 is shipped with new SEED and KEGG mapping files. The KEGG pathways shipped with MEGAN are based on the last free version of KEGG (June 2011).
- PCoA analysis of taxonomy and function: MEGAN5 allows the user to perform PCoA analysis based on taxonomy for function, the latter based on SEED, KEGG or COG/EGGNOG.
- Working with multiple samples: MEGAN5 makes it easy to work with many different samples simultaneously. A MEGAN5 comparison document can contain any number of samples. Samples can be extracted, merged or resampled.
- Biome extraction: MEGAN5 provides methods for computing the core biome, total biome, minimal biome and rare biome for a set of samples
- Metadata: MEGAN5 supports metadata associated with samples. Such attributes can be used to select, order or color samples
- Charting: MEGAN5 contains a number of new charts including a radial space filling trees, bubble charts, word clouds and co-occurrence graphs.
- Improved LCA algorithm: novel "minimum taxon cover" algorithms greatly enhances the specificity of the taxonomic LCA placement algorithm
- Color management: MEGAN5 always assigns the same colors to the same entities across different windows and datasets. The program uses built-in palette, however, the user can interactively chance any choice.
High-throughput analysis using DIAMOND or MALT
To prepare a dataset for use with MEGAN, one must first compare the given reads against a database of reference sequences, for example by performing aBLASTX search against the NCBI-NR database. The file of reads and the resulting BLAST file can then be directly imported into MEGAN and the program will automatically calculate a taxonomic classification of the reads and also, if desired, a functional classification, using either the SEED or KEGG classification, or both. The results can be interactively viewed and inspected. Multiple datasets can be opened simultaneously in a single comparative document that provides comparative views of the different classifications.
MEGAN can be used to interactively explore the dataset. In the following figure, we show the assignment of reads to the NCBI taxonomy. Each node is labeled by a taxon and the number of reads assigned to the taxon, The size of a node is scaled logarithmically to represent the number of assigned reads. Optionally, the program can also display the number of reads summarized by a node, that is, the number of reads that are assigned to the node or to any of its descendants in the taxonomy. The program allows one to interactively inspect the assignment of reads to a specific node, to drill down to the individual BLAST hits that support the assignment of a read to a node, and to export all reads (and their matches, if desired) that were assigned to a specific part of the NCBI taxonomy. Additionally, one can select a set of taxa and then use MEGAN to generate different types of charts for them.
Functional analysis using the SEED classification
To perform a functional analysis using the SEED classification, MEGAN attempts to map each read to a SEED functional role, using the highest scoring BLAST match to a protein sequence for which the functional role is known. The SEED classification is depicted as a rooted tree whose internal nodes represent the different subsystems and whose leaves represent the functional roles. Note that the tree is “multi-labeled” in the sense that different leaves may represent the same functional role, if it occurs in different types of subsystems. The current tree has about 13 000 nodes. The following figure shows a part of the SEED analysis of a marine metagenome sample.
Functional analysis using the KEGG classification
To perform a KEGG analysis, MEGAN attempts to match each read to a KEGG orthology (KO) accession number, using the best hit to a reference sequence for which a KO accession number is known. This information is then used to assign reads to enzymes and pathways. The KEGG classification is represented by a rooted tree (with approximately 13 000 nodes,) whose leaves represent different pathways. Each pathway can also be inspected visually, to see which reads were assigned to which enzymes. As an example, consider the citric acid cycle, which is of central importance for cells that use oxygen as part of cellular respiration. In the following figure we show the citric acid cycle pathway. In such a drawing of a pathway as provided by the KEGG database, different participating enzymes are represented by numbered rectangles. MEGAN shades each such rectangle is so as to indicate the number of reads assigned to the corresponding enzyme.
Functional analysis using the COG/EGGNOG classification
To compare a collection of different datasets visually, MEGAN provides a comparison view that is based on a tree in which each node shows the number of reads assigned to it for each of the datasets. This can be done either as a pie chart, a bar chart or as a heat map. To construct such a view using MEGAN, the datasets must first all be individually opened in the program. Using a provided “compare” dialog one can then setup a new comparison document containing the datasets of interest. The following figure shows the taxonomic comparison of all eight marine datasets. Here, each node in the NCBI taxonomy is shown as a bar chart indicating the number of reads (normalized, if desired) from each dataset that have been assigned to the node.
PCoA and clustering
MEGAN5 can be used both taxonomic and functional profiles. Using a number of different ecological indices, MEGAN5 can compute distances between different samples. These can be analyzed using principal coordinate analysis (PCoA), hierarchical (UPGMA tree, NJ tree) and non-hierarchical clustering (Neighbor-net method).
Co-occurrence plot and other charts
MEGAN5 provides a number of new plots and charts including a co-occurrence plot, space-filling radial trees (as popularized by the program Krona),
bubble charts, wordclouds and more.
Sample attributes (aka metadata) and the Sample Viewer
Non-academic users please contact Daniel Huson for a commercial or trial license.
(Test alpha version of MEGAN 6 here.)