I have a bacterial count matrix rows indicating bacterium species and columns indicating samples. I have also several different groupings. For example of one of the grouping: each sample can belong to one the following groups: A,B,C,D. Let's say this grouping is factor1. From all of the groupings/factors I have made a design matrix: design = model.matrix(~ factor1 + factor2 ...)
I have additionally a grouping where each sample belongs to one of E,F,G. I would like to use Limma to find out statistically significant bacteria between E,F,G without having to do pairwise-comparisons (EF,EG,FG). That is, much like anova results are interpreted. For some reason I have found only Limma contrast matrix examples for pairwise-comparisons.
The dataset is such that samples have many different groupings / properties, for example some of the samples are biological replicates and then some of the samples have been sequenced differently than others. If I compare differences of sequencing properties I would like the comparison to take into account that also other properties like the replicate status might influence the samples.
The bacterial count table is from QIIME software. I have previously used earlier this (http://qiime.org/scripts/compare_categories.html) for statistical testing and there seems to be vegan mentioned. However, I was given an advice to try limma because it takes different groupings into account. Are using these compare categories methods preferred over limma?
This is why I asked you what the goal of your experiment is. It seems to me that you might be getting ahead of yourself. In other words, what method you use to analyze data is determined by two things. First and foremost, what is the hypothesis you are trying to test? And second, given the data in hand, what simplifying assumptions can you make, which would allow you to use a particular statistical method?
When I read your original post, this looked to me like it was sort of an ecological question that ape or vegan or ade4 are intended to answer. Something like 'Are the bacterial species found in this set of samples different from the species in that other set of samples?' This question can be answered at several different levels, or operational taxonomic units (roughly species, genus, family, order), which packages like I mention above are intended to exploit, whereas limma is not. In addition, the ecological diversity packages try to explicitly model the zero inflation you get with these sorts of data (where you get lots of zeros, and it's not exactly clear if a zero means the bacterium really wasn't there, or maybe it is there, but you didn't count at a high enough depth to pick it up).
So while you can hypothetically use limma to analyze count data of this sort (preferably after using voom to convert to cpm and model the mean-variance dependence), it's probably going to be a bit tricky. Smart people like Gordon and Aaron could probably wing it and make it work, but they know what they are doing.
You also mention that some of the samples have been sequenced differently. If this is true of all the groups you are comparing, then you could hypothetically fit a batch effect to control for those differences (and I would be surprised if you couldn't do that with the other packages I mentioned, but I have only used them sparingly). But if one set of samples was sequenced on a SOLiD (lol), and the rest on a HiSeq, then 'fixing' that is not possible, no matter what methods you might want to use.
I have count data from phyla (L2) to genus (L6) levels. I am planning to do analysis for L2 and L6 levels separately. I guess the analysis methods applicable to both levels are the same. The goal is to find phyla / genuses which have statistically different counts between various groupings while taking into account that samples have several groupings which could have an impact on the counts. So far I have managed to write this R Code: http://ariana.dy.fi/files/limma/. However, it sounds like Limma might not be suitable for this or at least it is super hard to use / understand.
If you are going to use
lmFit
, you seriously don't want to do it that way. You want to doSimply converting counts to CPM and taking logs isn't enough. You also need to estimate observation-level weights (which is what
voom
does) that you then pass tolmFit
, which fits a weighted linear model. Otherwise you are using an unweighted model, which isn't what you want to do with these data.Yes, voom is something I was planning to use. Hopefully this code does better:
y <- DGEList(counts=orig)
y <- calcNormFactors(y) # TMM
v <- voom(y)
fit = lmFit(v, design)