How to use BLAST
My personal opinion about how to use BLAST
The opinions expressed here are solely my own, so please do not make me responsible for any problems related to using these recommendations. Use them at your own risk. :-)
In the first version of this document, I will focus on describing relevant parameters of NCBI blast, because this is - at least at the moment - the implementation of the BLAST algorithm we are using for our Metagenomics analyses. WU-blast is another interesting variant, which I tend to use in more complicated szenarios like whole genome similarity searches or ITS sequence comparison, because it can be far better tuned to these settings than NCBI BLAST, at least according to my experience.
But if you do not need this level of fine tuning, NCBI BLAST may be the better choice, since it is under active development and it is steadily enhanced by the NCBI programming guys.
The settings discussed in the following paragraphs have been selected with very large databases in mind, like NT (blastn) or NR (blastx). With smaller databases, the expectation value threshold can and should be lowered.
BLAST Parameters for query sequence lengths >= 100 bp
With query sequence lengths of 100 bps or more, NCBI BLAST works quite well with standard settings.
According to experience, it is better to change blasts low complexity filtering strategy to soft filtering. This can be done by appending
-F "m D" for blastn, or
-F "m S" for blastx. Please note the quotes - these are necessary because there is a blank between the soft masking operator and the character indicating the filter type (DUST for nucleotides, SEG for protein sequences). Without these parameters, some high scoring pairs containing low complexity regions may break apart, leading to several HSPs with smaller scores. This is undesirable, since MEGAN could favor a HSP having a higher score than any of the fragmented HSPs, but a smaller score than the unbroken HSP...
When using blastx, one has to decide which translation table to use for the query sequence. In most cases, I prefer code table 11 (option:
-Q 11) for bacterial sequences providing some alternative start codons, but this is highly depending on the metagenomics sample. E.g., if you are largely dealing with Mollicutes, code table 4 should be preferred. For a description of the codes and their differences to the standard code, please read the NCBI documentation.
There are several options allowing to decrease output size, namely changing the expectation value threshold with
-e or reducing the maximum number of reported HSPs with
-b. In my personal opinion, I do not recommend using these parameters, since their impact on run time is only marginally or zero. Any filtering can be done later on, and no one wants to redo all blast runs because the estimated settings were too low...
BLAST Parameters for short query sequences
For searching sequence similarities within very short fragments, BLAST may not be the best choice. If you want to tackle this anyhow, the word size should be reduced to the minimum, and the expectation value should be adjusted as well. Minimal settings for word size are
-W 7 for blastn, and
-W 2 for blastx in conjunction with reducing the neighborhood word threshold score to
-f 8 or below (this is only necessary for blastx). Expectation value should be
-E 100. Yes, that's no joke. When comparing against large databases like NT or NR, such high amounts of expected random hits have to be accepted. A lower eValue threshold could be used when only nearly exact matches are desired.
Additionally, when using blastx, the two-hit algorithm may be disabled by using
-P 0, but I cannot say how much influence this parameter has, since I have not used it by myself till now...
Please send me any comments, hints, feedback or criticism, so that I can improve this short document... ;-)