How to use Limma to analyze differences among group items without doing pairwise comparisons
5
0
Entering edit mode
@samipietila-10617
Last seen 7.9 years ago

I have a bacterial count matrix rows indicating bacterium species and columns indicating samples. I have also several different groupings. For example of one of the grouping: each sample can belong to one the following groups: A,B,C,D. Let's say this grouping is factor1. From all of the groupings/factors I have made a design matrix: design = model.matrix(~ factor1 + factor2 ...)

I have additionally a grouping where each sample belongs to one of E,F,G. I would like to use Limma to find out statistically significant bacteria between E,F,G without having to do pairwise-comparisons (EF,EG,FG). That is, much like anova results are interpreted. For some reason I have found only Limma contrast matrix examples for pairwise-comparisons.

 

limma • 6.0k views
ADD COMMENT
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 11 hours ago
The city by the bay

Let's say we have something like this:

factor1 <- factor(rep(LETTERS[1:4], each=3)) # A-D
factor2 <- factor(rep(LETTERS[5:7], 4)) # E-G
design <- model.matrix(~factor1 + factor2)

If you look at colnames(design), you get:

[1] "(Intercept)" "factor1B"    "factor1C"    "factor1D"    "factor2F"   
[6] "factor2G"   

The intercept represents the average expression in the A/E "baseline" combination, whereas each of the other terms represent the average log-fold change of the corresponding level over the baseline. So, if you want to identify any DE between E, F and G, your null hypothesis would be that factor2F = factor2G = 0. This can be achieved by dropping both terms simultaneously:

topTable(fit, coef=5:6) # after eBayes()

... which will do an ANOVA-like comparison as requested. With contrast matrices, the same applies; each column of the matrix represents a pairwise null hypothesis (say, X=Y or X=Z in a design with three groups X, Y and Z) but if you drop multiple columns at once in topTable, this defines a single null hypothesis (X=Y=Z) that corresponds to what an ANOVA does.

ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 2 hours ago
United States

What's the goal here? To detect the presence or absence of a species or genus in a sample? If so, you might consider using something like ade4 or vegan rather than limma. See here, particularly under the 'Ordination' header.
 

ADD COMMENT
0
Entering edit mode

The dataset is such that samples have many different groupings / properties, for example some of the samples are biological replicates and then some of the samples have been sequenced differently than others. If I compare differences of sequencing properties I would like the comparison to take into account that also other properties like the replicate status might influence the samples.

The bacterial count table is from QIIME software. I have previously used earlier this (http://qiime.org/scripts/compare_categories.html) for statistical testing and there seems to be vegan mentioned. However, I was given an advice to try limma because it takes different groupings into account. Are using these compare categories methods preferred over limma?

ADD REPLY
1
Entering edit mode

This is why I asked you what the goal of your experiment is. It seems to me that you might be getting ahead of yourself. In other words, what method you use to analyze data is determined by two things. First and foremost, what is the hypothesis you are trying to test? And second, given the data in hand, what simplifying assumptions can you make, which would allow you to use a particular statistical method?

When I read your original post, this looked to me like it was sort of an ecological question that ape or vegan or ade4 are intended to answer. Something like 'Are the bacterial species found in this set of samples different from the species in that other set of samples?' This question can be answered at several different levels, or operational taxonomic units (roughly species, genus, family, order), which packages like I mention above are intended to exploit, whereas limma is not. In addition, the ecological diversity packages try to explicitly model the zero inflation you get with these sorts of data (where you get lots of zeros, and it's not exactly clear if a zero means the bacterium really wasn't there, or maybe it is there, but you didn't count at a high enough depth to pick it up).

So while you can hypothetically use limma to analyze count data of this sort (preferably after using voom to convert to cpm and model the mean-variance dependence), it's probably going to be a bit tricky. Smart people like Gordon and Aaron could probably wing it and make it work, but they know what they are doing.

You also mention that some of the samples have been sequenced differently. If this is true of all the groups you are comparing, then you could hypothetically fit a batch effect to control for those differences (and I would be surprised if you couldn't do that with the other packages I mentioned, but I have only used them sparingly). But if one set of samples was sequenced on a SOLiD (lol), and the rest on a HiSeq, then 'fixing' that is not possible, no matter what methods you might want to use.
 

ADD REPLY
0
Entering edit mode

I have count data from phyla (L2) to genus (L6) levels. I am planning to do analysis for L2 and L6 levels separately. I guess the analysis methods applicable to both levels are the same. The goal is to find phyla / genuses which have statistically different counts between various groupings while taking into account that samples have several groupings which could have an impact on the counts. So far I have managed to write this R Code: http://ariana.dy.fi/files/limma/. However, it sounds like Limma might not be suitable for this or at least it is super hard to use / understand.

ADD REPLY
0
Entering edit mode

If you are going to use lmFit, you seriously don't want to do it that way. You want to do

v <- voom(y)
fit <- lmFit(v, design)

Simply converting counts to CPM and taking logs isn't enough. You also need to estimate observation-level weights (which is what voom does) that you then pass to lmFit, which fits a weighted linear model. Otherwise you are using an unweighted model, which isn't what you want to do with these data.

ADD REPLY
0
Entering edit mode

Yes, voom is something I was planning to use. Hopefully this code does better:

  y <- DGEList(counts=orig)
  y <- calcNormFactors(y) # TMM
  v <- voom(y)
  fit = lmFit(v, design)

ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 2 hours ago
WEHI, Melbourne, Australia

The reason that you've only seen pair-wise comparisons in limma is that limma can contruct the ANOVA from the pairwise comparisons. Suppose that you have defined contrasts for the pairwise comparisons EF, EG, FG. Then you can use:

topTable(fit, coef=c("EF","EG","FG"))

and this output the anova results you want. In fact any of the following will give exactly the same anova results:

topTable(fit, coef=c("EF","EG","FG"))
topTable(fit, coef=c("EF","EG"))
topTable(fit, coef=c("EG","FG"))
topTable(fit, coef=c("EF","FG"))
ADD COMMENT
0
Entering edit mode

Specifying all pairwise comparisons sounds good. However, it seems I have a problem with contrast.fit complaining that contrast does not match to coefficients. Should I build contrast matrix or design matrix in some different way?

> efs
      GroupA GroupB
X1001      A      E
X1002      B      F
X1003      C      G
X1004      D      G
X1005      D      G

>design = model.matrix(~ ., data=efs) 

>fit = lmFit(data, design)
>contrast.matrix = makeContrasts(contrasts=c("GroupBE-GroupBF","GroupBE-GroupBG","GroupBF-GroupBG"), levels=c("GroupBE","GroupBF","GroupBG"))
>contrast_fit = contrasts.fit(fit, contrast.matrix)
Error in contrasts.fit(fit, contrast.matrix) : 
  Number of rows of contrast matrix must match number of coefficients
In addition: Warning messages:
1: In rn != cn :
  longer object length is not a multiple of shorter object length
2: In contrasts.fit(fit, contrast.matrix) :
  row names of contrasts don't match col names of coefficients
 
 
ADD REPLY
0
Entering edit mode

If efs is your entire experimental setup, then your linear model will not have a single solution. This is because all A samples are also E samples, and all B samples are also F samples (and vice versa). There is no way to distinguish the effect of B versus A from that of F versus E. Similarly, there is no way to disentangle the effect of C/D versus A from that of G versus E. Reconsider how you want to analyze your experiment, as it's currently impossible to do the desired comparisons.

As for the makeContrasts problem, read the limma user's guide and you'll see that levels is usually set to design. Note that this modification will still fail with your code above, because there is no groupBE group in your design matrix. You can only work with the named columns that exist in the design matrix. (In any case, fix your design matrix first.)

ADD REPLY
0
Entering edit mode

It seems my design matrix was/is? poorly chosen. The design matrix is arbitrarily created just to have a small simple design to test R code. I updated the design matrix data to following:

> efs
      GroupA GroupB
X1001      A      E
X1002      A      F
X1003      A      G
X1004      B      E
X1005      B      F
X1006      B      G
X1007      C      E
X1008      C      F
X1009      C      G
X1010      D      E
X1011      D      F
X1012      D      G

I also updated the creation of contrast matrix so that it takes levels from design like seen in limma manual page 36. However now makeContrasts seems to give an error message: "Renaming (Intercept) to Intercept".

>contrast.matrix = makeContrasts(contrasts=c("GroupBE-GroupBF","GroupBE-GroupBG","GroupBF-GroupBG"), levels=design)

There seems to be a small change in contrast fit error message.

>   contrast_fit = contrasts.fit(fit, contrast.matrix)
Error in contrasts.fit(fit, contrast.matrix) : 
  trying to take contrast of non-estimable coefficient
In addition: Warning message:
In contrasts.fit(fit, contrast.matrix) :
  row names of contrasts don't match col names of coefficients
ADD REPLY
0
Entering edit mode

There is no GroupBE coefficient in your design matrix, hence the complaint. There really isn't any need to set up a contrast matrix here, as the GroupBF and GroupBG coefficients directly represent the effect of F and G over E, respectively; so you can just drop them directly as I've described in my original answer. Nonetheless, if you want to do it with a contrast matrix, you would do something like the code below:

contrast.matrix <- makeContrasts(GroupBF, GroupBG, levels=design)

Don't worry about the warning message regarding the intercept renaming.

ADD REPLY
0
Entering edit mode

I am not sure how to see coefficients from design matrix. coef(design) gives an error. However, when looking at the design matrix I can see following columns: (Intercept) GroupAA GroupAB GroupAC GroupAD GroupBE GroupBF GroupBG and there seems to be GroupBE listed.

Basically the efs is produced from QIIME mapping file (dataset specific sample groupings) and I am looking a general rule how to create a contrast matrix which is derived from efs information. So, if I have GroupBE, GroupBF, GroupBG I can make contrasts by dropping the first one. I am wondering does it work if I always create contrasts just by dropping the first item from the group.

 

ADD REPLY
0
Entering edit mode

I don't see how you managed to get a GroupBE column in your design matrix. Running the code below:

design = model.matrix(~ ., data=efs)

... when efs is set such that:

      GroupA GroupB
X1001      A      E
X1002      A      F
X1003      A      G
X1004      B      E
X1005      B      F
X1006      B      G
X1007      C      E
X1008      C      F
X1009      C      G
X1010      D      E
X1011      D      F
X1012      D      G

... gives me colnames(design) of:

[1] "(Intercept)" "GroupAB"     "GroupAC"     "GroupAD"     "GroupBF"    
[6] "GroupBG"    

In fact, it shouldn't be possible to get a GroupBE coefficient, as then your design matrix wouldn't be of full rank. There'd be no way to estimate such a coefficient, even if you did manage to cram it into the matrix.

ADD REPLY
0
Entering edit mode

Interesting, colnames gives me different results. My code and data files can be found from http://ariana.dy.fi/files/limma/. Perhaps my code has something wrong.

> colnames(design)
[1] "(Intercept)" "GroupAA"     "GroupAB"     "GroupAC"     "GroupAD"     "GroupBE"     "GroupBF"     "GroupBG" 
ADD REPLY
0
Entering edit mode

If you look at levels(efs[,2]), you'll find that you have an empty string as your first level. You need to change the as.factor in your loop to factor in order to remove unused levels.

ADD REPLY
0
Entering edit mode

True, as.factor seems to create an empty level. Not sure why. Anyway, factor seems to work as wanted.

I used the same list ("GroupBF" "GroupBG") to toptable as with contrast fit and got results from toptable. I am not sure if it is a good idea to paste this big code blocks here, but below is code which I am hoping to produce limma group comparisons for QIIME data.

efsGrLevels = function(efs, groupIndex) {
  return (paste(colnames(efs)[groupIndex], levels(droplevels(efs[, groupIndex])), sep=""))
}

  library(edgeR)
  library(limma)
  
  metadata = read.table("test_metadata.csv", comment.char= "", header=T, sep="\t", quote="")
  metadata = metadata[-c(1),]
  sampleids = paste("X",as.character(metadata[,1]), sep ="")
  rownames(metadata) = sampleids
  
  efs = metadata[,2:(ncol(metadata)-1), drop=F] # Drop sampleID and description columns

  for (i in 1:ncol(efs)) {
    efs[,i] = factor(efs[,i])
  }
  
  # take all groups from metadata as coefficients 
  design = model.matrix(~ ., data=efs) 

  orig = read.table("test_otu_table_L6.txt", skip=1, comment.char= "", header=T, row.names=1, sep="\t", quote="")
  orig = orig[,sampleids] # sort columns by metadata

  # Normalization
  #data <- cpm(v, normalized.lib.sizes=T)  
  y <- DGEList(counts=orig)
  y <- calcNormFactors(y) # TMM
  v <- voom(y)
  fit = lmFit(v, design)
  
  # TODO: replace fixed index 2 with a parameter
  efsGrLevelsVar = efsGrLevels(efs,2)
  contrast.matrix = makeContrasts(contrasts=efsLevelsVar[-1], levels=design)
  
  contrast_fit = contrasts.fit(fit, contrast.matrix)
  ebayes_fit = eBayes(contrast_fit)
  
  topTable(ebayes_fit, coef=efsGrLevelsVar[-1])
ADD REPLY
0
Entering edit mode

Does this method apply also in cases where GroupB has only items, for example, only E and F? In the code I am assuming that in this case the contrast matrix can be formed by giving only GroupBF.

contrast.matrix <- makeContrasts(GroupBF, levels=design)

 

ADD REPLY
0
Entering edit mode

Yes, that should be okay.

ADD REPLY
0
Entering edit mode
@samipietila-10617
Last seen 7.9 years ago

Can the method be used to test a subset of levels? That is, if GroupB has levels E,F,G,H can it be tested if a subset of those levels have differences like E vs F vs G? Or F vs G.

ADD COMMENT
0
Entering edit mode

It's not clear what you mean. Please explain what you want to do in a bit more detail.

ADD REPLY
0
Entering edit mode

I have a dataset which has 1..N samples. For each sample there is metadata attached. Each sample have been processed with one of the DNA isolation methods A,B,C,D (Group A levels). Additional to that, each sample is sequenced from one of regions: E,F,G,H (Group B levels). I have built the model.matrix so that it contains this metadata for each sample. Now researchers are interested in knowing if for example sequencing regions F,G have differences. My model has all regions E,F,G,H and with previous help I am able to test if there is difference when comparing between all E,F,G,H. However, now that researched are interested comparing only subset of Group B I am wondering can I do this somehow while maintaining a model with all E,F,G,H or should I build a new model which there are only samples where Group B is one of F or G.

ADD REPLY
0
Entering edit mode

You can't do this with the design matrix constructed in C: How to use Limma to analyze differences among group items without doing pairwise. This is because you average across isolation methods to obtain enough residual d.f. for dispersion estimation. This means that you can't separate out the region effect in each isolation method, because it's been averaged over. If you want to test between regions for isolation method B, you'll have to fit a new model - probably a one-way layout using only the B samples, with each region defined as a separate group - and use the NB dispersion estimates from your first model. This is a bit of a hack because the dispersion estimates from the first model don't account for the uncertainty in the second, but there's no other way to do it if you don't have replicates for each of the BF, BG, etc. groups.

ADD REPLY
0
Entering edit mode
@samipietila-10617
Last seen 7.9 years ago

 

 

ADD COMMENT
0
Entering edit mode

Please post additional information as a comment or modification to your post, rather than as a separate answer.

ADD REPLY

Login before adding your answer.

Traffic: 547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6