A Sequencing Simulator for Genomics and Metagenomics.
By Daniel H. Huson and Felix Ott. With contributions from Ramona Schmid, Alexander F. Auch and Daniel C. Richter
The new research field of metagenomics is providing exciting insights into various, previously unclassified ecological systems. Next-generation sequencing technologies are producing an increase of environmental data in public databases.There is great need for specialized software solutions and statistical methods for dealing with complex, metagenome data sets. To facilitate the development and improvement of metagenomic tools, we introduce a sequencing simulator called MetaSim.
Our software can be used to generate collections of synthetic reads that reflect the
diverse taxonomical composition of typical metagenome data sets.
Based on a database of given genomes, the program allows the user to design a
metagenome by specifying the number of genomes present at different levels of
the NCBI taxonomy, and then to collect reads from the metagenome using a simulation
of a number of different sequencing technologies.
A population sampler optionally produces evolved sequences based on
source genomes and a given evolutionary tree.
The resulting data sets can be used as standardized test scenarios for planning sequencing projects or for benchmarking assembler and metagenomic software.
- integrates a database for source genome sequences
- generates sets of synthetic reads or mate-pairs based on adaptable sequencing error models (e.g. for Sanger chemistry, Roche's 454 and Illumina (former Solexa)
- enables the user to configure abundance values for each organism to model specific taxon compositions
- provides a population sampler to generate evolved sequences based on source genomes and a given evolutionary tree
- can be controlled via graphical user interface or in command line mode
Richter DC, Ott F, Auch AF, Schmid R, Huson DH (2008): MetaSim—A Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 3(10): e3373. doi:10.1371/journal.pone.0003373 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0003373
- fixed bug in command line mode: sanger setting with new option for setting the mate pair probability.
- FASTA import bug fixed
- Empirical setting now supports mate-pairs, thus allowing simulation of Solexa paired reads
- Some new command line mode options to control output file (e.g. specify the output directory)
- Toolbar bug fixed
- command line mode is now fully working
- each read header in the result file now contains an error report (all base modifications are reported)
I installed MetaSim. After start up, I do not know how to begin. Please refer to the section "Getting started" in the manual (found in the program folder or here).
When clicking on the database item after initial program start, an error message comes up. Maybe the location of the database has to be changed to a folder where you ha ve write permission. Change the default database location in your file systems using Edit -> Preferences -> Set Database Location.
I have generated a taxon profile but MetaSim says: "Profile NOT saved". Please check the syntax of your taxon profile. Refer to the manual or use on e of the example taxon profiles in the examples folder that can be easily adapted.
I have generated/loaded a taxon profile but its icon in the project tree s hows a red exclamation mark. The syntax of your taxon profile seems to be correct but at least one sequence entry could not be found in the database. First, check whether the genome sequence that is listed with a red exclamation mark in the taxon profile has already been loaded into the database. Second, check whether the spelling of the name or taxid in the taxon profile equals the name or taxid in the database.
I have selected a taxon profile and I wanted to open the taxonomy editor. A window opens but nothing is displayed. The taxonomy editor can only be used if the genome sequences in the database have a NCBI taxon id. Please check if the database contains the taxon ids for each genome sequence . Database entries showing a '-1' in the taxid column are not assigned a taxon id. Please import this file: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz and i mport it using Database -> Get Taxon IDs by GI...
I tried to download the taxon ids using Database -> Get Taxon IDs (NCBI ftp) but it did not work. Seems to be a network problem with ftp. Alternatively, download the file from the NCBI ftp server (Link) and import the file manually using Database -> Get Taxon Ids by GI...
I do not need ALL genome sequences that are contained in this huge all.fna.tar.gz file (~760MB). Can I use my own files? Of course. You can import any genome sequence (fasta format) in the database using Database ->Import Files.. Note that without any gi number, MetaSim is not able to assign unique taxon ids to genome sequences. Without taxon ids, the taxonomy editor can not be used.
I can provide the community with an empirical error model from another sequencing technology. Maybe this could support and motivate others to develop software and analysis tools based on this error model. Great! Please (contact us ), so that we can provide this file for others.
I started a simulation generating 10000 reads. In the result folder of the project tree this file only contains 10 fasta entries. What went wrong? The result file in the project tree can be used only to get a short overview about few generated reads. The multifasta file with ALL reads can be found at the location where the taxon profile has been saved to.
Does MetaSim report quality values? No. But all inserted errors/substitutions are reported in the fasta header per read.
I like to generate reads for linear genome fragments (e.g. contigs or whole genomes). MetaSim seems to consider each input sequence to be circular. How can I change this? In GUI mode, uncheck the circular checkboxes in the database view for each sequence. When controlling MetaSim via command line do the following: Usually, each sequence header of your input file may look like this: gi|123456|ref|XP_123456| <sequence name> If you add the word "linear" to <sequence name>, MetaSim will generate reads for a linear genome. (This is a "hack" for now. Future software releases will provide a command line option for this.)
I need further empirical error models for simulating Illumina reads. Here is a model for 62 bp reads: errormodel-62bp.mconf. Here is a model for 80 bp reads: errormodel-80bp.mconf. By adding or removing lines with probability values, you can adapt this models to obtain the desired read lengths. These error models will be included into upcoming versions of MetaSim.
There are some bugs in the program. What shall I do with them? Sorry for this. MetaSim is still under development. We are looking forward to any user feedback. So, if you noticed any bugs please (let us know). Thanks!
- Main window with project tree, taxon abundance profile and message panel. A second window shows the Taxonomy editor that can be alternatively used to determine the abundance values for the source genomes.