F. Compare Samples to Each Other

...continued from Alpha Diversity

  • Set up a parameter file for a workflow script
  • Use a workflow script
  • Perform multiple rarefactions at an even depth
  • Calculate beta diversity metrics for between-sample comparisons
  • Visualize the results of beta diversity analysis

Beta diversity is a term for the comparison of samples to each other. A beta diversity metric does not calculate a value for each sample. Rather, it calculates a distance between a pair of samples. If you have many samples (in this tutorial we have nine samples), a beta diversity metric will return a matrix of the distances of all samples to all other samples. If you have experience in phylogenetics, you may know that a distance matrix can be visualized as a tree. Distance matrices can also be visualized as a graph of points, a network, or any other creative method you can come up with. In this tutorial, we'll use principal coordinates analysis to visualize distances between samples on an x-y-z plot.

Sequencing depth can have an effect on beta diversity analysis, just as it does on alpha diversity. The effect of sequencing depth is not as easy to visualize, and it depends a lot on what distance metric you chose. The safest bet is always to chose a single sequencing depth that is less than the total number of sequences in your smallest sample, and rarefy all your samples at that same depth.  

Before we get started, lets introduce another feature that will help us do complicated analyses with less time on our part.

Workflow scripts in QIIME

Up until this point, we've been doing everything the long way, step-by-step, to help you explore and understand how all the steps work and fit together into an analysis pipeline. Now, it seems like a good time to introduce some shortcuts to make your life easier! QIIME has some special built-in scripts that are categorized as workflow scripts. These workflow scripts take an analysis that would normally be a long series of steps, and performs all the steps in a series. All you have to do, as a user, is run the one workflow script. For example, for our previous section on alpha diversity analysis we had to run three different scripts in a series. There is actually a workflow script that can do all these steps for us in one command: alpha_rarefaction.py. And, for our process [tutorial pages C and D] of picking OTUs, picking a reference set of sequences, assigning taxonomy, building an OTU table, aligning our sequences, filtering the alignment, and building a phylogenetic tree are all combined into another single workflow script called pick_otus_through_otu_table.py. (Note that this workflow script does not do the denoising step, though.)

We'll discuss the pick_otus_through_otu_table workflow more in the next section.  For now, we'll start with the workflow script that helps us with our task at hand: beta diversity analysis. 

Note:  Once you get more advanced in QIIME, you may want to set a lot of custom options. Because of the complexity of a workflow script, many of the options have to be set by supplying an input file called a parameters file (with the -p option). You don't need a parameters file, though. If one is not specified, all the steps of the workflow will happen with relatively common default settings.

Jackknifed beta diversity analysis

We're going to use the workflow script jackknifed_beta_diversity.py to do our beta diversity analysis. This script does the following steps:

  1. Compute a beta diversity distance matrix from the full data set
  2. Perform multiple rarefactions at a single depth
  3. Compute distance matrices for all the rarefied OTU tables
  4. Build UPGMA trees for all the rarefactions
  5. Compare all the trees to get consensus and support values for branching
  6. Perform principal coordinates analysis on all the rarefied distance matrices
  7. Generate plots of the principal coordinates

Let's start this following script, and we'll talk about all the parts to the script while it's running:

jackknifed_beta_diversity.py -i otu_table.biom -o jackknifed_beta_diversity/ -e 90 -m Fasting_Map.txt -t rep_set_tree.tre

Some of the options for jackknifed_beta_diversity.py:

    -i [file]  The OTU table file

    -o [name]  Choose a name for the output directory to be created

    -e [number]  The depth for even rarefactions

    -m [file]  The sample mapping file

    -t [file]  The phylogenetic tree file

Lets start with the -e option for the rarefaction depth. This script performs rarefaction at only one depth. The idea is that we want to create a large collection of distance matrices that we can do statistics on. For this to work, we need to choose a depth that is significantly smaller than the number of seqs in the smallest sample. If any samples have fewer seqs than the rarefaction depth, they will be left out!  Also, if the rarefaction depth is too high, then every rarefied OTU table will be the same, and there won't be any differences between all our distance matrices.  I chose 90 seqs/sample as our depth, because most samples have a little under 150 seqs/sample total. Going even lower may have been advisable, perhaps to 75 seqs/sample or less, but this is a pretty small data set to start with so I just made a compromise. 

Let's go back and look at that parameters file... there are lots of variables in there that you can set.  You'll notice that most of them are not set to anything. Also, many of them have nothing to do with our jackknifed_beta_diversity.py workflow.  That's okay.  The idea is that you can set up one parameters file for all your workflows in a project, and any variables you leave blank in the parameters file will just be set to their default values.

Okay, my jackknifed_beta_diversity.py workflow is done; there are LOTS of new files that it created in that jackknifed_beta_diversity/ folder! 

There is one sub-folder full of rarefied OTU tables, and three separate sub-folders for each of the three different distance metrics we chose! Let's look in the unweighted UniFrac folder. It contains a folder called "3d_plots," (Note: This has changed to "emperor_pcoa_plots" in QIIME 1.8.0 -- I will have to update this in tutorial later...) and in there is an html file named index.html. Double-click on this html file to open it in your browser. It may take a minute, but you should eventually see a Java application called Emperor that opens up, displaying a graph like this:
Those are the results of our principal coordinates analysis! Each point represents one of the samples. The important thing to realize here is that the axes are meaningless. This is different from principal components analysis, which you may be more familiar with. If you do principal components analysis, you get out axes that have weighted components from real variables that went into the ordination. In principal coordinates analyses, these axes were created to reproduce the distances between samples, and we've lost the variables that the distances originally came from.  You'll notice that samples PC.634, PC.635, and PC.636 are all really close to each other. Because the distances between samples were calculated using unweighted UniFrac, this means that those three samples have communities with very similar overall phylogenetic trees. And, looking back into our mapping file (Fasting_Map.txt), we can see that these three samples were all experimental samples (fasted mice microbiomes)!

How can we visually look for patterns in this graph? In the upper right corner of the KiNG viewer, you'll see a list of all the variables that were in your mapping file. The variable you select will determine the color scheme for the graph. Let's select the Treatment variable, and we'll select the "unscaled" version for now (see inset image). (The unscaled plot is squared off, while the scaled version would make the higher inertia axes longer). It looks like there is definitely a separation of the fasted mice microbiomes from the controls. Do you see an image like the one below on your system?

Notice that each data point consists of a central point, surrounded by a semi-transparent cloud. The clouds represent the variation in UniFrac results from all those rarefactions we did. This demonstrates that the separation of these samples holds up to subsampling, which is a nice thing to check. That means it isn't driven by one or two low abundance singlets.  If you drag the graph around with your mouse, it will rotate in 3D (adding a third principal coordinate) - give it a try! 

There are a bunch of other scripts you can use to analyze these distance matrices.  Let's try a few. 

Distance Statistics

Let's use the script dissimilarity_mtx_stats.py to calculate means and standard deviations for our unweighted unifrac distance matrices. Remember that we created 100 distance matrices! I'm going to set our output to a new folder that is closer to the root, so we can access it more easily. It's important to chose descriptive names, to you know later on which folders contain what kinds of data.

dissimilarity_mtx_stats.py -i jackknifed_beta_diversity/unweighted_unifrac/rare_dm/ -o unweighted_unifrac_stats/

The input (-i) directory is the directory containing the distance matrices - this directory is named rare_dm, and is located under the corresponding metric's folder.  I named our output directory "unweighted_unifrac_stats" so that, later on, I will know what the data are in that directory.  Notice that there are three files in there: means.txt, medians.txt, and stdevs.txt.  If you wanted to know the mean and s.d. of the distance between two samples, you could get those values from these matrices. In our particular experiment, however, we have two categories, and several samples in each category. What we'd really like to know is: are the samples in an individual category closer to each other than they are to samples outside the category?  To test this question, QIIME has a script called make_distance_boxplots.py - let's run it on our "means.txt" distance matrix!

make_distance_boxplots.py -m Fasting_Map.txt -o distance_boxplots -d unweighted_unifrac_stats/means.txt -f Treatment --save_raw_data

The --save_raw_data flag is nice; you may want to graph the data yourself.  The result is something that looks like this:

In this case, there are only two Treatment categories, so the "Control vs Fast" comparison is the same as the "All between Treatment" comparison. There are more details and examples on how you can compare distances in this tutorial on qiime.org.

Also, if you're really interested in distances between samples, and you want to compare different distance metrics to each other, you may be interested in using QIIME to do Procrustes analysis.

Okay, we're nearly done!

We're pretty much done here. You've now created all of the basic types of files needed for more in-depth analysis. The best thing for you to do now may be to explore all the scripts in this list, and see if there is anything else that takes your fancy.

In the remaining section, we'll re-do all the initial processing steps in this tutorial, but this time we'll use the workflow scripts so you can see how to use those wonderful shortcuts.

Next: Recap

Jeff Werner,
Mar 6, 2012, 9:16 AM