MEGAN 4 - MEtaGenome ANalyzer
Software for analyzing metagenomes.
Over 7000 registered users.
MEGAN 4 written by D. H. Huson, original design by D. H. Huson and S.C. Schuster, with contributions from S. Mitra, D.C. Richter, P. Rupek, H.-J. Ruscheweyh and N. Weber.
In metagenomics, the aim is to understand the composition and operation of complex microbial consortia in environmental samples through sequencing and analysis of their DNA. Similarly, metatranscriptomics and metaproteomics target the RNA and proteins obtained from such samples. Technological advances in next-generation sequencing methods are fueling a rapid increase in the number and scope of environmental sequencing projects. In consequence, there is a dramatic increase in the volume of sequence data to be analyzed.
Basic computational questions
The first three basic computational tasks for such data are taxonomic analysis, functional analysis and comparative analysis. These are also known as the “who is out there?’, “what are they doing?” and “how do they compare?” questions. They pose an immense conceptual and computational challenge, and there is a great need for new bioinformatics tools and methods to address them.
History of MEGAN
In 2007, we published the first stand-alone analysis tool for metagenomic of short-read data, called MEGAN (MEta Genome ANalyzer, paper). Initially, the aim was to provide a tool for studying the taxonomic content of a single dataset. A subsequent version of the program allowed the comparative taxonomic analysis of multiple datasets (MEGAN 2). In version 3 of the program, we aimed at also providing a functional analysis of metagenome data, based on the GO ontology. Unfortunately, in our hands the GO ontology proved to be little suitable for this purpose. In version 4 of MEGAN, the GO analyzer has been replaced by two new functional analysis methods, one based on the SEED classification and the other based on KEGG (Kyoto Encyclopedia for Genes and Genomes). MEGAN 4 was released at the beginning of 2011 (paper accepted for publication in Genome Research).
To prepare a dataset for use with MEGAN, one must first compare the given reads against a database of reference sequences, for example by performing a BLASTX search against the NCBI-NR database. The file of reads and the resulting BLAST file can then be directly imported into MEGAN and the program will automatically calculate a taxonomic classification of the reads and also, if desired, a functional classification, using either the SEED or KEGG classification, or both. The results can be interactively viewed and inspected. Multiple datasets can be opened simultaneously in a single comparative document that provides comparative views of the different classifications.
MEGAN can be used to interactively explore the dataset. In the following figure, we show the assignment of reads to the NCBI taxonomy. Each node is labeled by a taxon and the number of reads assigned to the taxon, The size of a node is scaled logarithmically to represent the number of assigned reads. Optionally, the program can also display the number of reads summarized by a node, that is, the number of reads that are assigned to the node or to any of its descendants in the taxonomy. The program allows one to interactively inspect the assignment of reads to a specific node, to drill down to the individual BLAST hits that support the assignment of a read to a node, and to export all reads (and their matches, if desired) that were assigned to a specific part of the NCBI taxonomy. Additionally, one can select a set of taxa and then use MEGAN to generate different types of charts for them.
Functional analysis using the SEED classification
To perform a functional analysis using the SEED classification, MEGAN attempts to map each read to a SEED functional role, using the highest scoring BLAST match to a protein sequence for which the functional role is known. The SEED classification is depicted as a rooted tree whose internal nodes represent the different subsystems and whose leaves represent the functional roles. Note that the tree is “multi-labeled” in the sense that different leaves may represent the same functional role, if it occurs in different types of subsystems. The current tree has about 13 000 nodes. The following figure shows a part of the SEED analysis of a marine metagenome sample.
Functional analysis using the KEGG classification
To perform a KEGG analysis, MEGAN attempts to match each read to a KEGG orthology (KO) accession number, using the best hit to a reference sequence for which a KO accession number is known. This information is then used to assign reads to enzymes and pathways. The KEGG classification is represented by a rooted tree (with approximately 13 000 nodes,) whose leaves represent different pathways. Each pathway can also be inspected visually, to see which reads were assigned to which enzymes. As an example, consider the citric acid cycle, which is of central importance for cells that use oxygen as part of cellular respiration. In the following figure we show the citric acid cycle pathway. In such a drawing of a pathway as provided by the KEGG database, different participating enzymes are represented by numbered rectangles. MEGAN shades each such rectangle is so as to indicate the number of reads assigned to the corresponding enzyme.
To compare a collection of different datasets visually, MEGAN provides a comparison view that is based on a tree in which each node shows the number of reads assigned to it for each of the datasets. This can be done either as a pie chart, a bar chart or as a heat map. To construct such a view using MEGAN, the datasets must first all be individually opened in the program. Using a provided “compare” dialog one can then setup a new comparison document containing the datasets of interest. The following figure shows the taxonomic comparison of all eight marine datasets. Here, each node in the NCBI taxonomy is shown as a bar chart indicating the number of reads (normalized, if desired) from each dataset that have been assigned to the node.
In a similar fashion, MEGAN supports the simultaneous analysis and comparison of the SEED functional content of multiple metagenomes, see the next figure. Moreover, a comparative view of assignments to a KEGG pathway is also possible.
Computational comparison of metagenomes
MEGAN provides an analysis window for comparing multiple datasets, which allows one to compute a distance matrix for a collection of datasets using a number of different ecological indices. The calculation can be based on the results of a taxonomic, SEED or KEGG analysis. If no nodes are selected, then the distances will be based on the number of reads assigned to the current leaves of the tree representation of the analysis. If some nodes are selected, then only the values for the selected nodes are used in the calculation.
MEGAN supports a number of different methods for calculating a distance matrix, such as Goodall’s ecological index, a simple version of UniFrac and euclidean distances. Such a distance matrix can be visualized either using a split network calculated using the neighbor-net algorithm, or using a multi-dimensional scaling plot. In the next figure we show the result of a comparison of the eight marine datasets based on the taxonomic content of the datasets and computed using Goodall’s index.
Analysis of other types of data
MEGAN was originally designed to analyse metagenomic and
metatranscriptomic data. However, it is easily possible to analyze
metaproteomic data as well.
Please note that MEGAN can now be used to analyze sequencing reads obtained in an approach targeted at 16S rRNA sequences, as shown here:
- Huson, DH, Mitra, S, Weber, N, Ruscheweyh, H, and Schuster, SC (2011). Integrative analysis of environmental sequences using MEGAN4. Genome Research, 21:1552-1560. (Please cite this paper when using MEGAN4)
- S. Mitra, B. Klar and D.H. Huson (2009). Visual and statistical comparison of metagenomes, Bioinformatics, 25(15):1849–1855.
- S. Mitra, J.A. Gilbert, D. Field, and D.H. Huson (2010). Comparison of multiple metagenomes using phylogenetic networks based on ecological indices, ISME J, 4:1236–1242, doi:10.1038/ismej.2010.51.
- H. N. Poinar, C. Schwarz, Ji Qi, B. Shapiro, R. D. E. MacPhee, B. Buigues, A. Tikhonov, D. H. Huson, L. P. Tomsho, A. Auch, M. Rampp, W. Miller, S. C. Schuster, Metagenomics to Paleogenomics: Large-Scale Sequencing of Mammoth DNA, Science 311:392-394, 2006, where we used an early version of our software to analyze the taxonomical content of a collection of DNA reads sampled from a mammoth.
- T. Urich A. Lanzén, Ji Qi, D.H. Huson, C. Schleper and Stephan C. Schuster, Simultaneous Assessment of Soil Microbial Community Structure and Function through Analysis of the Meta-Transcriptome, PLoS ONE 3(6): e2527 doi:10.1371/journal.pone.0002527.
- D.H. Huson, A.F. Auch, Ji Qi and S.C. Schuster, MEGAN Analysis of Metagenomic Data, Genome Research. 17:377-386, 2007.
- MEGAN user manual.
Use of MEGAN requires a license key. Commercial users can obtain a single user license or site license here.
Academic users can obtain a free license under the condition that any use of MEGAN is cited from here. (Please use ASCII characters only, no accents, Umlaute etc).
Non-academic users please contactfor a commercial or trial license.
(Download old MEGAN version 4.)
(License key server for academic users of MEGAN5 here.)
(Download old MEGAN version 3.)