A. Trimming Barcodes
...continued from: QIIME Overview Tutorial
Now it's time to start using QIIME!
QIIME commands are run by first specifying the script you want to execute, and then passing that script a series of options. This is all done in one line of text entered in the command line. For an example to start with, we'll trim the barcodes an primers from the sequences, below.
Trim Those Barcodes
Goals
Look up the command line options to QIIME scripts
Use command line options
Trim and quality filter sequence data
Assign sequences to samples based on barcoded primers
The QIIME script split_libraries.py accomplishes the task of checking each sequence for a barcode and primer, trimming off the barcode and primer, and then renaming the sequence so that the ID in the FASTA header indicates which sample it came from. This script needs two input files: (1.) the FASTA file of raw sequencing data, and (2.) the mapping file specifying which barcodes and primers belong to which samples. You can also tell the split_libraries.py script where to find the qual file, so it can trim out low-quality sequence data.
To pass all this information to a script in the command line, we use what are called "options" or "flags." After giving the name of the script, you type a space, and then an option flag such as "-i" or "-m" -- some of these flags specify that the text to follow gives information on what option to use, what file to process, or what new filename to create.
MacQIIME Users Only: To source the QIIME environment variables and load up all the MacQIIME scripts in the PATH, you have to start the MacQIIME subshell:
macqiime
which should give the following output:
Sourcing MacQIIME environment variables...
This is the same as a normal terminal shell, except your default python is DIFFERENT (/macqiime/bin/python).
Type "exit" (return) to go back to your normal shell
And your terminal prompt now says "MacQIIME" at the start, to remind you that you have the macqiime executables in your PATH.
Read the manual. There are lots of QIIME scripts, and each script has lots of options. Whenever you are using a new script, you should always look up its command line options. The QIIME website lists the options for every script in the scripts index, here. For example, the information on all the options you can use with split_libraries.py is available here. You can also check the command line options of most programs by passing the help flag to the script. For example, if you want to know the options for split_libraries.py, you could type:
split_libraries.py -h
You may have to scroll up in the terminal to read all the output. Notice some of these important options for split_libraries.py:
-o DIR_PREFIX, --dir-prefix=DIR_PREFIX
directory prefix for output files [default: .]
...
-q QUAL_FNAMES, --qual=QUAL_FNAMES
names of qual files, comma-delimited [default: none]
...
-m MAP_FNAME, --map=MAP_FNAME
name of mapping file. NOTE: Must contain a header line
indicating SampleID in the first column and
BarcodeSequence in the second, LinkerPrimerSequence in
the third. [REQUIRED]
-f FASTA_FNAMES, --fasta=FASTA_FNAMES
names of fasta files, comma-delimited [REQUIRED]
Here is an example of running split_libraries.py on the tutorial data (all one long line of text):
split_libraries.py -m Fasting_Map.txt -f Fasting_Example.fna -q Fasting_Example.qual -o split_library_output/
The command line options for QIIME scripts can appear in any order. The flag -m is followed by the name of the mapping file, the flag -f is followed by the name of the FASTA file, -q is followed by the qual file, and -o is followed by the name you want to give to the output folder (a new folder that will be created by the script). Please note: avoid using file names with spaces in them. In the command line, a space separates one argument from another! If you want to use files with spaces in the names, then you have to put quotes around the file name.
Run that above command; it should only take a few seconds, after which you will see a new command line prompt appear. It is now done! What did it do? The script created the new directory you specified with the -o option. You can verify this with "ls -lh" -- there should be a new directory called split_library_output. You can see the contents within this directory using the ls command as well, by specifying the name of the directory to look in:
ls -lh split_library_output/
There should be three output files in that directory: histograms.txt, seqs.fna, and split_library_log.txt. The log file tells you stats on how your analysis worked out, including a table of the number of sequences assigned to each sample. Read it with less, like this:
less split_library_output/split_library_log.txt
Nice; there were about 150 reads assigned to each sample. Good to know! Quit, and then look at the output FASTA file:
less split_library_output/seqs.fna
The sequences are shorter, and they now have new IDs! For example, the first two are:
>PC.634_1 FLP3FBN01ELBSX orig_bc=ACAGAGTCGGCT new_bc=ACAGAGTCGGCT bc_diffs=0
CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATC
GCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATGCGCCGCAGGTCCATCCATGTTCAC
GCCTTGATGGGCGCTTTAATATACTGAGCATGCGCTCTGTATACCTATCCGGTTTTAGCT
ACCGTTTCCAGCAGTTATCCCGGACACATGGGCTAGG
>PC.634_2 FLP3FBN01EG8AX orig_bc=ACAGAGTCGGCT new_bc=ACAGAGTCGGCT bc_diffs=0
TTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCCTATCCATCGAAG
GCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAACGCATCCCCATCGATGACCGAA
GTTCTTTAATAGTTCTACCATGCGGAAGAACTATGCCATCGGGTATTAATCTTTCTTTCG
AAAGGCTATCCCCGAGTCATCGGCAGGTTGGATACGTGTTACTCACCCGTGCGCCGGTCG
CCA
The new FASTA IDs created by split_libraries are formatted to indicate the sample name, followed by an underscore (_), and then a number. For example, the sequences with IDs PC.634_1 and PC.634_2 both came from the sample PC.634, and the numbers 1 and 2 are there to make sure that each sequence still has a unique FASTA ID. This ID format: SampleName_[number] is necessary for future steps in the pipeline, so QIIME will know what samples the reads came from (Remember - we've removed the barcodes and primers!).
To continue on with the analysis, we now have to cluster all those 1399 sequences into operational taxonomic units (OTUs)...
Next step: Picking OTUs and Denoising