PAUDA - a poor man’s BLASTX - high-throughput metagenomic protein database search tool
The key idea is to convert all protein sequences into “pseudo DNA”, or “pDNA” for short, by mapping the amino acid alphabet onto a four-lettered alphabet that reflects which amino acids are likely to replace each other in significant BLASTX alignments. The reduced alphabet idea is also used in the seeding step of RapSearch2 (Zhao et al., 2012). A high-throughput sequencing read aligner such as Bowtie2 (Langmead and Salzberg, 2012) is then used to compare pDNA reads against a pDNA database. For any match found, the participating pDNA sequences are translated back into protein sequences and the raw score, bitscore and expected value of the corresponding protein alignment are calculated to determine statistical significance. The final output is a file of statically significant protein alignments in BLASTX format.
The PAUDA package provides two commands, pauda-build and pauda-run. The pauda-build command is used to build an index for the reference database, such as microbial-RefSeq or KEGG. The run-pauda command is used to align a file of DNA reads against the index. Both commands use Bowtie2 and so Bowtie2 must be installed. The pauda-build script requires about 30 GB of main memory, whereas pauda-run requires up to 16 GB, for a database containing 10 million protein reference sequences. Under the hood, PAUDA consists of a set of shell scripts and Java program that run on Linux or MacOS X. Here is an overview:
One of the largest published metagenomic datasets to date are 12 permafrost samples described and analyzed in Mackelprang et al, 2011. For the purposes of a functional analysis of approximately 246 million reads, the authors performed a BLASTX search of the data against the KEGG database, which reportedly took 800,000 CPU hours at a supercomputer center. The same dataset takes 80 CPU hours to compute using PAUDA, or less than half a day on a multicore server.
A key result of (Mackelprang et al., 2011) is a figure that displays a principal component analysis of relative abundance of KO groups in 12 samples, based on a BLASTX comparison against the KEGG database. Their figure shows that two different frozen samples taken from the active layer of the permafrost have similar functional profiles, and that these change only very little after thawing for 2 or 7 days. In contrast, two frozen samples obtained from the permafrost layer initially exhibit very distinctive profiles, which gradually become more similar after 2 and then 7 days of thawing. A PCoA analysis of Bray-Curtis distances (Bray and Curtis, 1957; Mitra et al., 2010) based on a PAUDA comparison of the data against the KEGG database delivers the same result in a small fraction of the computational time.
Figure 3b from Mackelprang et al (2011)
PCA analysis of KEGG profiles based on BLASTX analysis
800,000 CPU hours at supercomputer center
PCoA analysis of KEGG profiles based on PAUDA analysis
80 CPU on single server
- Supplementary method description: method.pdf
In addition to PAUDA you will also need to install the latest version of Bowtie2, available here.