C. Taxonomy and the OTU Table

...continued from Picking OTUs and Denoising


Assigning Taxonomy

Taxonomy is assigned to high-throughput 16S rRNA gene sequences using some kind of comparison to a reference database. The most basic approach might be doing a BLAST search and taking the top hit. However, this doesn't account for redundancy or give you any idea of your confidence or specificity. The default algorithm in QIIME is the RDP Classifier. This is a Bayesian classifier that incorporates information about different places in the taxonomic tree where the sequence might fit in, and it calculates the highest probability taxonomy that can be assigned with some specified level of confidence. This still uses a reference database, in this case called a training set. In MacQIIME, the default training set is the Greengenes (GG) reference database clustered at 97% identity, other versions of GG are available for download from here.

We're going to use the QIIME script assign_taxonomy.py as follows (may take a minute):

assign_taxonomy.py -i rep_set.fna -o taxonomy_results/

Some of the options:

    -i [file]  Name of the fasta file to classify (usually OTU representative seqs)

    -o [name]  Name of directory to create for output

To use a custom training set like the download from Greengenes, you have to reference it using the -t and -r options as described in the documentation.

This created a new directory called taxonomy_results. Look in there with ls and you will see that it contains two files: a log file, and a txt file of results. The rep_set_tax_assignments.txt file contains an entry for each representative sequence, listing taxonomy to the greatest depth allowed by the confidence threshold (80% by default, can be changed with the -c option), and a column of confidence values for the deepest level of taxonomy shown.  These data will be really useful, once we have them inside an OTU table!

Building an OTU Table

An OTU table is a form of your sequencing results that will finally be really useful to analyze in excel, visualize, etc.  It is a table giving the count of the number of sequences in each OTU, for each sample, and the taxonomy of that OTU.  Super!

We generate an OTU table with a script called make_otu_table.py as follows

make_otu_table.py -i uclust_picked_otus/inflated_denoised_seqs_otus.txt -t taxonomy_results/rep_set_tax_assignments.txt -o otu_table.biom

Some of the options:

    -i [file]  The OTU map output from pick_otus.py

    -t [file]  The taxonomy file from assign_taxonomy.py

    -o [name]  The name of the output file

This should create an OTU table file called otu_table.biom.  Check it out in less - it is in an XML/JSON format called biom (new as of QIIME 1.5.0).  If you want to look at the OTU counts per sample in a spreadsheet, you'll first have to convert the biom OTU table into a normal text table.  This can be done with the biom script as follows:

biom convert -i otu_table.biom -o otu_table_tabseparated.txt --to-tsv --header-key taxonomy --output-metadata-id "ConsensusLineage"

Now, the new output file otu_table_tabseparated.txt is something you can easily import into a spreadsheet. You can also read it more easily in the terminal:

less otu_table_tabseparated.txt

Note that this tab-separated file format for OTU tables used to be supported by QIIME (v 1.4 and earlier) but is no-longer a supported format as of QIIME 1.5.0. This is due to the needs of researchers generating 16S rRNA gene surveys using the Illumina HiSeq platform, with thousands of samples and hundreds of thousands+ of OTUs. Once the OTU table gets that big, it's much more reasonable to store it in the biom format.

Also note, when analyzing these data in a spreadsheet, that you will first want use the spreadsheet functions to normalize the abundances to the total number of reads in each sample.  If one sample has 100 reads and another has 200 reads, you can't directly compare raw read numbers -- it's more meaningful to look at percentages.  e.g., One OTU was 10% of the abundance in Sample 1 vs 50% of the abundance in Sample 2, etc.

Summarizing Taxonomy

The OTU table is pretty useful, but we can make other tables of taxonomic summaries that we can analyze in a spreadsheet as well.  To summarize taxonomy in terms of relative abundance, we can use the script summarize_taxa.py as follows:

summarize_taxa.py -i otu_table.biom -o taxonomy_summaries/

Some of the options:

    -i [file]  The input OTU table with included taxonomy information

    -o [name]  The name of the directory to be created and filled with goodies

Check out the contents of the taxonomy_summaries folder. You can open any of these files in a spreadsheet and graph taxonomy summaries for your samples!  The numbers are in terms of relative abundance. In other words, they're already normalized. The code L2, L3, etc, refers to the level of taxonomy summarized in the file. I broke down the L3 (class) taxonomy data in this spreadsheet to make a graph that looks like the following. The samples on the left are the controls, and on the right are the fasted treatments.

By doing this in a spreadsheet, I was able to leave out all those extra "Other" categories (which signify unknown classification) and the small classes that wouldn't be visible on the graph anyway. I would just have to specify in the figure legend that the remaining white space represents unclassified reads and low-abundance classes.

If you're impatient and don't want to dive into the spreadsheet just yet, QIIME can make some simple graphs to summarize the results. The script that does this is called plot_taxa_summary.py.  For example, you can summarize taxonomy at the Class level (L3) like this:

plot_taxa_summary.py -i taxonomy_summaries/otu_table_L3.txt -o taxonomy_plot_L3/

Open up the newly created bar_charts.html file in your web browser to see some graphed data.

Next steps: Alignments and Trees