Citing the steps in the QIIME pipeline

The MacQIIME package includes many separate software packages developed by many different research groups, all wrapped together by the QIIME scripts. Since you install MacQIIME in just one step, all those separate software packages seem somewhat invisible, and you may not even notice whose software you're really using. Here are some of the software packages you may be using within MacQIIME, and how to cite them, organized into the separate steps in a common sequence analysis pipeline.

This is not a complete list. If you notice an oversight or an important reference missing, please email me and let me know. Thanks!  -Jeff

Citing the overall QIIME package

The overall QIIME software package/pipeline can be cited as follows:

Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Gonzalez Pena A, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R. 2010. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7(5): 335-336.

Citing the many QIIME pipeline steps

Denoising 454 data

If you use the denoise_wrapper.py script in the recommended default denoising method, you are using Denoiser. Cite this:

Reeder J, Knight R. 2010. Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nature Methods 7:668-669.

Alternatively, you might use AmpliconNoise, which is the newest version of PyroNoise. If you used ampliconnoise.py instead of the denoise_wrapper, cite this:

Quince C, Lanzén A, Curtis TP, Davenport RJ, Hall N, Head IM, Read LF, Sloan WT. 2009. Accurate determination of microbial diversity from 454 pyrosequencing data. Nature Methods 6:639-641.

OTU picking

If you pick OTUs using pick_otus.py, there are several options for the method (-m). 

If you use the default method (don't specify, or use -m uclust), then you used uclust. Or, if you used -m usearch61, you used a more recent version of this program called USEARCH. Either way, cite this:

Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460-2461.

If you use -m mothur to pick OTUs from an alignment using furthest neighbor, average neighbor or nearest neighbor, you used mothur 1.25. Cite this:

Schloss PD, Wescott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. 2009. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75(23):7537-7541.

If you use -m blast to pick OTUs, you used BLAST. Cite this:

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215(3):403-410.

Alignment

If you used the default pipeline of aligning your reads to the Greengenes Core reference alignment, then you used PyNAST:

Caporaso JG, Bittinger K, Bushman FD, DeSantis TZ, Andersen GL, Knight R. 2010. PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics 26:266-267.

The Greengenes core reference alignment (used by default) can be cited here:

DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, et al. 2006. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microb 72(7): 5069-5072.

If you used -m infernal to build your alignment then you used Infernal:

Nawrocki EP, Kolbe DL, Eddy SR. 2009. Infernal 1.0: Inference of RNA alignments. Bioinformatics 25:1335-1337.

Chimera detection

If you used the default method (-m ChimeraSlayer) in identify_chimeric_seqs.py then you used ChimeraSlayer which can be cited:

Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, et al. 2011. Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Research 21:494-504.

Alternatively, you may have used -m blast_fragments, which uses BLAST:

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215(3):403-410.

Another new chimera screening option is to use the UCHIME chimera detection during the OTU-picking step (pick_otus.py -m usearch61 ...), which means you never have to use identify_chimeric_seqs.py. This UCHIME method uses the USEARCH program:

Edgar,RC, Haas,BJ, Clemente,JC, Quince,C, Knight,R. 2011. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics btr381.

Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460-2461.

Schloss et al have re-implemented the code in mothur and benchmarked its performance in this reference, which is also nice reading if you're wondering about different chimera detection methods for HTP 16S rRNA gene surveys.

Schloss PD, Gevers D, Westcott SL. 2011. Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PLoS ONE 6(12):e27310.

Assigning taxonomy

By default, in MacQIIME, taxonomy is assigned based on the Greengenes taxonomy and a Greengenes reference database (currently version 12_10). Here are a couple of good references to cite for this taxonomy reference database:

McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, Andersen GL, Knight R, Hugenholtz P. 2012. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6(3): 610–618.

Werner JJ, Koren O, Hugenholtz P, DeSantis TZ, Walters WA, Caporaso JG, Angenent LT, Knight R, Ley RE. 2012. Impact of training sets on classification of high-throughput bacterial 16S rRNA gene surveys. ISME J 6:94-103.

Additionally you should cite the classification software itself that did the taxonomic assignment. For example, references for some of the options are listed below.

If you used the default taxonomy assignment method in assign_taxonomy.py, you used the RDP Classifier 2.2. Cite this:

Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microb 73(16): 5261-5267.

If you used -m rtax to assign taxonomy, you used RTAX and USEARCH and should cite them both:

Soergel DAW, Dey N, Knight R, Brenner SE. 2012.  Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences. ISME J 6: 1440–1444.

Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460-2461.

If you used -m mothur to assign taxonomy, you used the mothur 1.25 implementation of naive Bayesian classification that is based on the RDP Classifier. You should cite both of them:

Schloss PD, Wescott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. 2009. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75(23):7537-7541.

Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microb 73(16): 5261-5267.

If you used -m blast, you used BLAST. Cite this:

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215(3):403-410.

Building a phylogenetic tree

The default method (or -m fasttree) in make_phylogeny.py is to build an approximately-maximum-likelihood phylogenetic tree using FastTree 2.1.3.  If this is what you did, you can cite:

Price MN, Dehal PS, Arkin AP. 2010. FastTree 2-Approximately Maximum-Likelihood Trees for Large Alignments. Plos One 5(3).

If you used insert_seqs_into_tree.py to insert sequences into an existing tree, you used RAxML and pplacer:

Stamatakis A. 2006. RAxML-VI-HPC: Maximum Likelihood-based Phylogenetic Analyses with Thousands of Taxa and Mixed Models. Bioinformatics 22(21):2688-2690.

Matsen FA, Kodner RB, Ambrust EV. 2010. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 11:538.

There are a number of other possible tree building methods you could have chosen:

-m raxml (used raxml 7.0.3 to build a ML tree)

Stamatakis A. 2006. RAxML-VI-HPC: Maximum Likelihood-based Phylogenetic Analyses with Thousands of Taxa and Mixed Models. Bioinformatics 22(21):2688-2690.

-m clearcut (used clearcut 1.0.9 to build a RNJ tree)

Evans J, Sheneman L, Foster JA. 2006. Relaxed Neighbor-Joining: A Fast Distance-Based Phylogenetic Tree Construction Method. J Mol Evol 62:785-792.

-m clustalw (used clustalw 2.0.12 to build a NJ tree)

Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG. 2007. Clustal W and Clustal X version 2.0. Bioinformatics 23:2947-2948.

Comparative analysis

If you used UniFrac to calculate beta-diversity, this is a good reference:

Lozupone C, Knight R. 2005. UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71(12): 8228-8235.

If you used beta_diversity_through_plots.py to visualize three-dimensional PCoA plots you are using Emperor. Cite this:

Vazquez-Baeza Y, Pirrung M, Gonzalez A, Knight R. 2013. Emperor: A tool for visualizing high-throughput microbial community data. Gigascience 2(1):16.