...continued from: Trimming Barcodes Goals
Denoise Seqs with DenoiserThis Denoiser step only applies to 454 Pyrosequencing data. It can be skipped, at your own risk.
One problem with 454 pyrosequencing is that sequencing errors can give you effectively more OTUs than really are there. There are a number of strategies for dealing with this problem. The default in QIIME is to use a built-in program called Denoiser, which compares flowgrams (the raw sequencing data) of similar sequences to see if the differences between them may have been due to erroneous base calls. You can see some of this raw sequencing flowgram data by opening the Fasting_Example.sff.txt file in less. This is a text version of the "SFF" raw data format generated by the 454 pyrosequencing platform. It won't be terribly meaningful to you as it is, but it's always nice to have a feel for what all the data files look like. The "FlowGram" entry for each read gives the integrated signal from each consecutive reagent, T, A, G, C, T, A, G, C, etc. These are the data that the Denoiser will use to denoise our reads. Our command (run this) is to use the script denoise_wrapper.py (Warning: this step could take up to an hour. Once you start it, you may want to go get coffee or do something else for a bit). denoise_wrapper.py -i Fasting_Example.sff.txt -f split_library_output/seqs.fna -m Fasting_Map.txt -o denoiser/ Some of the options:
-i [filename] Specifies the sff.txt file name
-f [filename] Specifies the fasta file that was created by split_libraries.py -m [filename] Sample mapping file with primers and barcodes -o [name] Specifies a directory name for the output
It may seem inconvenient to have to wait a few minutes for this script, but on a full 454 Titanium plate it could take days or weeks, depending on how many CPUs you use. You can specify the number of CPUs by adding the -n option. If it's titanium data, you also have to add the --titanium option.
This script created a new output directory called denoiser/ which contains a bunch of new files. The important ones are denoiser_mapping.txt, centroids.fasta and singletons.fasta. The singletons had no other sequences that matched up with them, and the centroids are representatives of clusters that collapsed into a single sequence following denoising. The denoiser_mapping.txt file maps which sequences fell into which centroids. To make use of these data, we have to "inflate" the results to create a new FASTA file that is a theoretically denoised version of the original. The script that does this for us is called inflate_denoiser_output.py inflate_denoiser_output.py -c denoiser/centroids.fasta -s denoiser/singletons.fasta -f split_library_output/seqs.fna -d denoiser/denoiser_mapping.txt -o inflated_denoised_seqs.fna Some of the options: -c [file] The centroids.fasta file from denoise_wrapper.py
-s [file] The singletons.fasta file from denoise_wrapper.py -f [file] The seqs.fna file from split_libraries.py -d [file] The denoiser_mapping.txt file from denoise_wrapper.py -o [file] The name of the new fasta file to be created Check the contents of the newly created output file, inflated_denoised_seqs.fna. It should contain all the seqs from split_library_output/seqs.fna, but the sequences will be slightly different depending on to what extent they were denoised. For more complex examples and information, see the Denoising Tutorial on qiime.org. Note: I tested this tutorial both with and without the denoising step, and denoising reduced the number of OTUs from 417 to 209 -- quite a big difference! Pick Operational Taxonomic Units
OTUs
(operational taxonomic units) are clusters of similar sequences. Much
of the QIIME pipeline is concerned with analyzing OTUs rather than the
full set of sequences. In this OTU-centric process, you are looking at
samples as collections of OTUs, where two different samples may contain
some of the same OTUs and some different OTUs. We'll eventually assign
each OTU some data, such as taxonomy data, data as to how abundant it
was in each sample, and data on the evolutionary history of each OTU
compared to the others. There are a number of ways to bin sequences into
these "clusters" of OTUs, a process called OTU picking. (Some folks
may colloquially speak of OTUs with 97% similarity threshold as being
"species," though this is just for the sake of communication where
technical lingo would bog down the conversation). The default default
OTU-picking method in QIIME is to use a program called uclust. |