A. Trimming Barcodes

...continued from: QIIME Overview Tutorial

Now it's time to start using QIIME!

QIIME commands are run by first specifying the script you want to execute, and then passing that script a series of options. This is all done in one line of text entered in the command line. For an example to start with, we'll trim the barcodes an primers from the sequences, below.

Trim Those Barcodes

Goals

- Look up the command line options to QIIME scripts
- Use command line options
- Trim and quality filter sequence data

Assign sequences to samples based on barcoded primers

The QIIME script split_libraries.py accomplishes the task of checking each sequence for a barcode and primer, trimming off the barcode and primer, and then renaming the sequence so that the ID in the FASTA header indicates which sample it came from. This script needs two input files: (1.) the FASTA file of raw sequencing data, and (2.) the mapping file specifying which barcodes and primers belong to which samples. You can also tell the split_libraries.py script where to find the qual file, so it can trim out low-quality sequence data.

To pass all this information to a script in the command line, we use what are called "options" or "flags." After giving the name of the script, you type a space, and then an option flag such as "-i" or "-m" -- some of these flags specify that the text to follow gives information on what option to use, what file to process, or what new filename to create.

MacQIIME Users Only: To source the QIIME environment variables and load up all the MacQIIME scripts in the PATH, you have to start the MacQIIME subshell:

macqiime

which should give the following output:

Sourcing MacQIIME environment variables...

This is the same as a normal terminal shell, except your default python is DIFFERENT (/macqiime/bin/python).

Type "exit" (return) to go back to your normal shell

And your terminal prompt now says "MacQIIME" at the start, to remind you that you have the macqiime executables in your PATH.

Read the manual. There are lots of QIIME scripts, and each script has lots of options. Whenever you are using a new script, you should always look up its command line options. The QIIME website lists the options for every script in the scripts index, here. For example, the information on all the options you can use with split_libraries.py is available here. You can also check the command line options of most programs by passing the help flag to the script. For example, if you want to know the options for split_libraries.py, you could type:

split_libraries.py -h

You may have to scroll up in the terminal to read all the output. Notice some of these important options for split_libraries.py:

-o DIR_PREFIX, --dir-prefix=DIR_PREFIX

directory prefix for output files [default: .]

...

-q QUAL_FNAMES, --qual=QUAL_FNAMES

names of qual files, comma-delimited [default: none]

...

-m MAP_FNAME, --map=MAP_FNAME

name of mapping file. NOTE: Must contain a header line

indicating SampleID in the first column and

BarcodeSequence in the second, LinkerPrimerSequence in

the third. [REQUIRED]

-f FASTA_FNAMES, --fasta=FASTA_FNAMES

names of fasta files, comma-delimited [REQUIRED]

Here is an example of running split_libraries.py on the tutorial data (all one long line of text):

split_libraries.py -m Fasting_Map.txt -f Fasting_Example.fna -q Fasting_Example.qual -o split_library_output/

The command line options for QIIME scripts can appear in any order. The flag -m is followed by the name of the mapping file, the flag -f is followed by the name of the FASTA file, -q is followed by the qual file, and -o is followed by the name you want to give to the output folder (a new folder that will be created by the script). Please note: avoid using file names with spaces in them. In the command line, a space separates one argument from another! If you want to use files with spaces in the names, then you have to put quotes around the file name.

Run that above command; it should only take a few seconds, after which you will see a new command line prompt appear. It is now done! What did it do? The script created the new directory you specified with the -o option. You can verify this with "ls -lh" -- there should be a new directory called split_library_output. You can see the contents within this directory using the ls command as well, by specifying the name of the directory to look in:

ls -lh split_library_output/

There should be three output files in that directory: histograms.txt, seqs.fna, and split_library_log.txt. The log file tells you stats on how your analysis worked out, including a table of the number of sequences assigned to each sample. Read it with less, like this:

less split_library_output/split_library_log.txt

Nice; there were about 150 reads assigned to each sample. Good to know! Quit, and then look at the output FASTA file:

less split_library_output/seqs.fna

The sequences are shorter, and they now have new IDs! For example, the first two are:

>PC.634_1 FLP3FBN01ELBSX orig_bc=ACAGAGTCGGCT new_bc=ACAGAGTCGGCT bc_diffs=0

CTGGGCCGTGTCTCAGTCCCAATGTGGCCGTTTACCCTCTCAGGCCGGCTACGCATCATC

GCCTTGGTGGGCCGTTACCTCACCAACTAGCTAATGCGCCGCAGGTCCATCCATGTTCAC

GCCTTGATGGGCGCTTTAATATACTGAGCATGCGCTCTGTATACCTATCCGGTTTTAGCT

ACCGTTTCCAGCAGTTATCCCGGACACATGGGCTAGG

>PC.634_2 FLP3FBN01EG8AX orig_bc=ACAGAGTCGGCT new_bc=ACAGAGTCGGCT bc_diffs=0

TTGGACCGTGTCTCAGTTCCAATGTGGGGGCCTTCCTCTCAGAACCCCTATCCATCGAAG

GCTTGGTGGGCCGTTACCCCGCCAACAACCTAATGGAACGCATCCCCATCGATGACCGAA

GTTCTTTAATAGTTCTACCATGCGGAAGAACTATGCCATCGGGTATTAATCTTTCTTTCG

AAAGGCTATCCCCGAGTCATCGGCAGGTTGGATACGTGTTACTCACCCGTGCGCCGGTCG

CCA

The new FASTA IDs created by split_libraries are formatted to indicate the sample name, followed by an underscore (_), and then a number. For example, the sequences with IDs PC.634_1 and PC.634_2 both came from the sample PC.634, and the numbers 1 and 2 are there to make sure that each sequence still has a unique FASTA ID. This ID format: SampleName_[number] is necessary for future steps in the pipeline, so QIIME will know what samples the reads came from (Remember - we've removed the barcodes and primers!).

To continue on with the analysis, we now have to cluster all those 1399 sequences into operational taxonomic units (OTUs)...

Next step: Picking OTUs and Denoising