B. Picking OTUs and Denoising

...continued from: Trimming Barcodes

Goals

Denoise Seqs with Denoiser

This Denoiser step only applies to 454 Pyrosequencing data. It can be skipped, at your own risk.

One problem with 454 pyrosequencing is that sequencing errors can give you effectively more OTUs than really are there. There are a number of strategies for dealing with this problem. The default in QIIME is to use a built-in program called Denoiser, which compares flowgrams (the raw sequencing data) of similar sequences to see if the differences between them may have been due to erroneous base calls. You can see some of this raw sequencing flowgram data by opening the Fasting_Example.sff.txt file in less.  This is a text version of the "SFF" raw data format generated by the 454 pyrosequencing platform.  It won't be terribly meaningful to you as it is, but it's always nice to have a feel for what all the data files look like. The "FlowGram" entry for each read gives the integrated signal from each consecutive reagent, T, A, G, C, T, A, G, C, etc. These are the data that the Denoiser will use to denoise our reads.

Our command (run this) is to use the script denoise_wrapper.py (Warning: this step could take up to an hour. Once you start it, you may want to go get coffee or do something else for a bit).

denoise_wrapper.py -i Fasting_Example.sff.txt -f split_library_output/seqs.fna -m Fasting_Map.txt -o denoiser/

Some of the options: 

-i [filename]  Specifies the sff.txt file name

-f [filename]  Specifies the fasta file that was created by split_libraries.py

-m [filename]  Sample mapping file with primers and barcodes

-o [name]  Specifies a directory name for the output

It may seem inconvenient to have to wait a few minutes for this script, but on a full 454 Titanium plate it could take days or weeks, depending on how many CPUs you use.  You can specify the number of CPUs by adding the -n option.  If it's titanium data, you also have to add the --titanium option.

This script created a new output directory called denoiser/ which contains a bunch of new files. The important ones are denoiser_mapping.txt, centroids.fasta and singletons.fasta. The singletons had no other sequences that matched up with them, and the centroids are representatives of clusters that collapsed into a single sequence following denoising. The denoiser_mapping.txt file maps which sequences fell into which centroids.  To make use of these data, we have to "inflate" the results to create a new FASTA file that is a theoretically denoised version of the original. The script that does this for us is called inflate_denoiser_output.py

inflate_denoiser_output.py -c denoiser/centroids.fasta -s denoiser/singletons.fasta -f split_library_output/seqs.fna -d denoiser/denoiser_mapping.txt -o inflated_denoised_seqs.fna

Some of the options:

-c [file]  The centroids.fasta file from denoise_wrapper.py

-s [file]  The singletons.fasta file from denoise_wrapper.py

-f [file] The seqs.fna file from split_libraries.py

-d [file]  The denoiser_mapping.txt file from denoise_wrapper.py

-o [file]  The name of the new fasta file to be created

Check the contents of the newly created output file, inflated_denoised_seqs.fna.  It should contain all the seqs from split_library_output/seqs.fna, but the sequences will be slightly different depending on to what extent they were denoised.

For more complex examples and information, see the Denoising Tutorial on qiime.org.  Note: I tested this tutorial both with and without the denoising step, and denoising reduced the number of OTUs from 417 to 209 -- quite a big difference!

Pick Operational Taxonomic Units

OTUs (operational taxonomic units) are clusters of similar sequences. Much of the QIIME pipeline is concerned with analyzing OTUs rather than the full set of sequences. In this OTU-centric process, you are looking at samples as collections of OTUs, where two different samples may contain some of the same OTUs and some different OTUs. We'll eventually assign each OTU some data, such as taxonomy data, data as to how abundant it was in each sample, and data on the evolutionary history of each OTU compared to the others. There are a number of ways to bin sequences into these "clusters" of OTUs, a process called OTU picking.  (Some folks may colloquially speak of OTUs with 97% similarity threshold as being "species," though this is just for the sake of communication where technical lingo would bog down the conversation). The default default OTU-picking method in QIIME is to use a program called uclust.

The defaults in the QIIME script pick_otus.py for OTU-picking are to use uclust, and to use an identity threshold of 97% (i.e., sequences that are 97% similar should get binned into the same OTU together). You can see how to specify different options in the help info for pick_otus.py.  We're going to use all the defaults, so all we have to specify is the input file name with the -i flag:

pick_otus.py -i inflated_denoised_seqs.fna

Notice that the fasta file we used as input was the output of our previous denoising step. (If you skipped denoising, the input for the -i option in this step could instead be split_library_output/seqs.fna ) Our new command should create a new directory automatically named "uclust_picked_otus." Look inside that directory with ls and you should see three new files: a .uc file, a .log file, and a file called inflated_denoised_seqs_otus.txt.  Let's look at the contents of that file:

less uclust_picked_otus/inflated_denoised_seqs_otus.txt

It's another tab-separated file. This format is called an OTU map.  It is an intermediate file format, and maybe not very useful to you as raw data.  It lists each OTU (the first column is the ID of the OTU, counting up from 0) followed by the IDs of all the sequences that fell into that OTU.  My first few lines look like this:

0       PC.356_1161

1       PC.635_788

2       PC.635_779      PC.634_176      PC.634_7

You can see that OTU 0 and OTU 1 both contain only one sequence, while OTU 2 contains three sequences. Some of the OTUs contain many sequences! Okay, we can quit less now.

Working with all of your sequences is cumbersome, if you are, for example, building an alignment, assigning taxonomy, etc.  Since we have OTUs picked, what we really would like to work with are a collection of just one representative sequence per OTU.  We can get this using pick_rep_set.py.

pick_rep_set.py -i uclust_picked_otus/inflated_denoised_seqs_otus.txt -f inflated_denoised_seqs.fna -o rep_set.fna

Some of the options:

-i [file]  The OTU mapping file from pick_otus.py

-f [file]  The fasta file that was used to pick OTUs

-o [file]  The name of the output FASTA file to create

How many sequences are in our resulting FASTA file? We can do grep -c ">" rep_set.fna to see: mine has 209 sequences in it (one for each OTU). What does the file format look like?  Check it out using less:

>0 PC.356_1161

CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTCCGCCCTCTCAGGCCGGCTACCGATCGTC

GCCTTGGTGGGCCGCTGCCCCGCCAACAAGCTAATCGGACGCGGACCCATCGCATGCCGG

GATCGCTCCCTTTCCGCACTGCGCCATGCGGCGCCGTGCGCATATGCGGTATTAGCGGTT

GTTTCCAGCCGGTATCCCCCTGCATGCGGCAGGTTG

>1 PC.635_788

CTGGGCCGTATCTCAGTCCCAATGTGGCCGGCCAACCTCTCAGTCCGGCTACTGATCGTC

GCCTTGGTGGGCTGTTATCTCACCAACTAGCTAATCAGATGCGGGCCCATCTTTTACCGA

ATATTTCCTTTCCTTCTCGGAAGATGCCTTCCAAGAATATTATGCGGTATTAGTCACCGT

TTCCAGTGATTATTCCCCAGTAAAAGGCAGGTTGCCCACACGTTACTCACCCG

The first sequence has the ID "0" -- followed by information that it came from sequence PC.356_1161, which is the representative of OTU 0.  The only important header information anymore, however, is that "0" - that is the ID of that OTU. Now we can create a table that lists how many reads from each OTU were found in each sample (on the next page).

Next step: Taxonomy and the OTU Table