Exploring Data in QIIME
This is a second tutorial on QIIME primarily intended for students who have already completed the overview tutorial and feel comfortable (and hopefully excited!) to do some of their own analysis. The idea here is for you to have a chance to try out as many QIIME features as you wish with a full-sized data set.
Goals
Work independently with QIIME to explore a data set
Find new QIIME functions you haven't used before and figure out how to use them effectively based on the documentation
Solve problems that arise along the way in a sequencing analysis
Find significant relationships in sequencing data
The Data Set
Here is a new data set for you to download. It came from this paper on microbial communities in nine anaerobic digesters treating brewery waste water. For each of the nine digesters, there is a one-year time series. In the mapping file you will find some accompanying data on environmental conditions as well as digester performance that you may find useful.
I've already done the initial drudgery of processing the sequences. The following workflow steps are done:
Trimmed barcodes
Denoised
Picked OTUs (97% ID)
Assigned taxonomy (old version of the taxonomy that eventually became Greengenes taxonomy)
Built OTU table
Removed chimeras from the OTU table (described in the paper)
Aligned to the GG core
Filtered alignment & applied Lane mask
Built a phylogenetic tree (FastTree)
Basically, it's the equivalent of up-to-and-including the pick_otus_through_otu_table.py workflow script.
Update: To be compatible with the latest QIIME file formats, I added a biom format OTU table. I also left the old text format OTU table in there to make it easier to open it up in a spreadsheet.
Your Challenge
Find three interesting trends in these data! There are lots of places to look, and I'm not going to be judgemental about what you think is interesting. We reported a couple things in that paper, but there are TONS of other random interesting things hiding in those data.
Some Ideas to Get Started With
There are lots of possible things to look into. Are there any aspects of the community related to the conditions or performance of the reactors? How would you characterize the dynamics of the communities over time? How are different communities structured differently?
Definitely do some beta diversity analysis
Alpha diversity may be interesting too
Are any OTUs correlated with any of the data in the mapping file? You can test this with otu_category_significance.py! Be sure to note the difference between correlation and longitudinal_correlation; those are different tests!
Maybe try filtering the OTU table, then pruning the tree using the filtered table (as the -t option), and then importing the resultant smaller tree and smaller OTU table (and mapping file data) into Topiary Explorer (trimming things down to just the major OTUs will be helpful in making a tree that can be visualized)
Are you going to look at all the samples, or might you find more interesting time-series trends looking at just one digester at a time?
The script categorized_dist_scatterplot.py is new, and I haven't even tried it yet. Sounds like it might be useful! Likewise, make_distance_histogram.py sounds interesting.
The new compute_core_microbiome.py script may be interesting to play with, especially if you use the --valid_states option to look at one reactor at a time.
Spreadsheets are your friend
If you have lots of time, and you're interested in finding new graphical/interactive software, check out ggobi or Orange -- they help a lot with visualizing complex data sets
Enjoy! In the spirit of inquiry-based learning, I haven't tried half of those suggestions up there. Anyway, I'm sure most of the stuff you end up doing will be new and interesting. Discover!