PAUDA - a poor man’s BLASTX - high-throughput metagenomic protein database search tool
PAUDA is a new approach toward the problem of comparing DNA reads against a database of protein reference sequences that is applicable to very large datasets consisting of hundreds of millions or billions of reads. PAUDA is an acronym for "Protein Alignment Using a DNA Aligner". The approach allows one to harness the high efficiency of DNA read aligners to compute BLASTX-like alignments between sequencing reads and a protein database in a small fraction of the time required by BLASTX. The PAUDA approach makes it possible to process DNA reads at a rate of millions of reads per CPU hour. PAUDA is 10,000 times faster than BLASTX.
The key idea is to convert all protein sequences into “pseudo DNA”, or “pDNA” for short, by mapping the amino acid alphabet onto a four-lettered alphabet that reflects which amino acids are likely to replace each other in significant BLASTX alignments. The reduced alphabet idea is also used in the seeding step of RapSearch2 (Zhao et al., 2012). A high-throughput sequencing read aligner such as Bowtie2 (Langmead and Salzberg, 2012) is then used to compare pDNA reads against a pDNA database. For any match found, the participating pDNA sequences are translated back into protein sequences and the raw score, bitscore and expected value of the corresponding protein alignment are calculated to determine statistical significance. The final output is a file of statically significant protein alignments in BLASTX format.
The PAUDA package provides two commands, pauda-build and pauda-run. The pauda-build command is used to build an index for the reference database, such as microbial-RefSeq or KEGG. The run-pauda command is used to align a file of DNA reads against the index. Both commands use Bowtie2 and so Bowtie2 must be installed. The pauda-build script requires about 30 GB of main memory, whereas pauda-run requires up to 16 GB, for a database containing 10 million protein reference sequences. Under the hood, PAUDA consists of a set of shell scripts and Java program that run on Linux or MacOS X. Here is an overview:
One of the largest published metagenomic datasets to date are 12 permafrost samples described and analyzed in Mackelprang et al, 2011. For the purposes of a functional analysis of approximately 246 million reads, the authors performed a BLASTX search of the data against the KEGG database, which reportedly took 800,000 CPU hours at a supercomputer center. The same dataset takes 80 CPU hours to compute using PAUDA, or less than half a day on a multicore server.
A key result of (Mackelprang et al., 2011) is a figure that displays a principal component analysis of relative abundance of KO groups in 12 samples, based on a BLASTX comparison against the KEGG database. Their figure shows that two different frozen samples taken from the active layer of the permafrost have similar functional profiles, and that these change only very little after thawing for 2 or 7 days. In contrast, two frozen samples obtained from the permafrost layer initially exhibit very distinctive profiles, which gradually become more similar after 2 and then 7 days of thawing. A PCoA analysis of Bray-Curtis distances (Bray and Curtis, 1957; Mitra et al., 2010) based on a PAUDA comparison of the data against the KEGG database delivers the same result in a small fraction of the computational time.
- Supplementary method description: method.pdf
Download PAUDA here.
In addition to PAUDA you will also need to install the latest version of Bowtie2, available here.