limma comparison for small sample size and how to get covariance matrix
Entering edit mode
jiachengd • 0
Last seen 7 days ago
United States


Thanks for your attention.

I am trying to use voomLmFit to compare 3 phenotypes to the same control while also blocking for subject_ID that are repeatedly measured in batch 1 and batch2.

Experimental design

# print meta data
> pb.colData
         condition subject_ID_unique subject_ID batch
01501_b2       0_1          01501_b2      01501    b2
04901_b2       0_0          04901_b2      04901    b2
21201_b2       1_1          21201_b2      21201    b2
31101_b1       1_0          31101_b1      31101    b1
31101_b2       1_0          31101_b2      31101    b2
44501_b1       0_0          44501_b1      44501    b2
45401_b2       0_0          45401_b2      45401    b2
51301_b1       0_0          51301_b1      51301    b1
51801_b1       1_1          51801_b1      51801    b1
51801_b2       1_1          51801_b2      51801    b2
51901_b1       0_1          51901_b1      51901    b1
51901_b2       0_1          51901_b2      51901    b2
52001_b2       0_0          52001_b2      52001    b2
53201_b1       0_1          53201_b1      53201    b1
53201_b2       0_1          53201_b2      53201    b2
57001_b2       0_0          57001_b2      57001    b2
61101_b1       0_0          61101_b1      61101    b1
62001_b1       1_0          62001_b1      62001    b1
65301_b2       0_0          65301_b2      65301    b2
71301_b2       0_0          71301_b2      71301    b2
71501_b2       0_0          71501_b2      71501    b2
73801_b2       0_0          73801_b2      73801    b2
77601_b1       1_0          77601_b1      77601    b1
77601_b2       1_0          77601_b2      77601    b2
81001_b2       0_0          81001_b2      81001    b2
87501_b1       1_1          87501_b1      87501    b1
87501_b2       1_1          87501_b2      87501    b2
89201_b1       0_0          89201_b1      89201    b2
A9501_b2       0_0          A9501_b2      A9501    b2

The above table is my meta data where condition has four unique elements, with 0_0 being control and others being diseases. subject_ID is the identifier for every sample. batch is a factor I wish to regress out. subject_ID_unique is subject_ID plus batch to make a unique identifier.

Design matrix and code for running comparison

# condition & batch are factors with set levels
group <- pb.colData$condition
subject_ID <- pb.colData$subject_ID
batch <- pb.colData$batch

design <- model.matrix(~0+group+batch)

# dpb is a DGEList
v <- voomLmFit(counts = dpb, 
                 design = design, 
                 block = subject_ID,
                 sample.weights = TRUE)

# example contrast
contrast <- makeContrasts(group1_1 - group0_0, levels=colnames(design))

fit <-, contrasts = contrast)  
fit <- eBayes(fit)

If the above code makes sense to you (please correct, if not), by comparing 1_1 with 0_0, there were over 900 differentially expressed genes. However, it concerned me because I only have 3 unique samples in 1_1 group and the signals seem driven by outliers.

# print n unique samples per condition
>   print(pb.colData %>% 
+           distinct(subject_ID, .keep_all = TRUE) %>% select(condition) %>% table())
0_0 0_1 1_0 1_1 
 14   3   3   3

Here is the PCA for all samples where you can see the major variance is driven by outliers. PCA plot

Question 1

How can I get a relatively robust result given the small sample size and outliers? I noticed eBayes has a robust argument. Can I use it? Is there any other methods worth trying?


I am also trying to get the covariance matrix (diagonal and off diagonal) for using as an input for mashr. I read from the limma tutorial section 13.2:

The covariance matrix of the estimated Bhat is Sigma^2 CT (XT Vg X) -1C

Sorry about the format. The formula is on page 62

Question 2

I understand X and C are design matrix and contrast matrix, but what is Vg? How to derive Vg?

Thank you very much for your time and help.

Jiacheng Ding

covariancematrix limma • 1.1k views
Entering edit mode
Last seen 3 hours ago
WEHI, Melbourne, Australia

limma has already downweighted the outlier samples, which is the purpose of sample.weights=TRUE in the voomLmFit() call. So limma has already conducted a robust analysis.

You can use robust=TRUE in the eBayes() call as well if you like. That may give an extra layer of protection, but it protects against outlier genes rather than outlier samples.

You don't explain how you constructed the PCA plot (it's not a limma plot) but it presumably doesn't take account of the limma precision weights or the batch correction.

The formula for Vg in this case is somewhat complicated because it involves both the blocking effect and the precision weights. It is equal to a block correlation matrix multiplied by diagonal matrices of 1/sqrt(weights).

Added later:

You could compute Vg like this (where g is in the row number of a particular gene):

subject_ID <- factor(subject_ID)
design.subject <- model.matrix(~0+subject_ID)
CorMatrix <- v$correlation * design.subject %*% t(design.subject)
diag(CorMatrix) <- 1
iswg <- 1/sqrt(v$EList$weights[g,])
Vg <- diag(iswg) %*% CorMatrix %*% diag(iswg)

Note that outputing Vg is not part of the documented user interface of limma, so I offer this code as is. It should work fine, but it's not very efficient and I have not had time to check it.

Entering edit mode

Thank you very much for your rapid response. Your answer makes a lot sense to me. I check the dimension reduction plot again using limma::plotMDS with the following code:

v <- voomLmFit(counts = dpb, 
                 design = design, 
                 block = subject_ID,
                 sample.weights = TRUE)

 v.test <- voom(dpb, design, block = subject_ID, correlation = v$correlation)

 limma::plotMDS(v.test, top = 1000, col=case_when(group == '1_1' ~ 'red',
                                                   group == '0_1' ~ 'orange',
                                                   group == '1_0' ~ 'cyan',
                                                   TRUE ~ 'green'))


It's indeed different from the original PCA plot I provided.


Thanks for your explanation on Vg, however I still don't understand how to get the block correlation. Could you please provide more details on instructing me to calculate it? Thank you very much for your time!

Jiacheng Ding

Entering edit mode

I have edited my answer above to add code for Vg.

Entering edit mode

Thank you very very much for your time, professor!


Entering edit mode

Your MDS plot is better than before, but you are still not following the recommendation for making for making such plots given in the limma and edgeR documentation. If dpb is matrix of counts, then we recommend that you use:

logCPM <- cpm(dpb, log=TRUE)

If dpb is a DGEList object, then you could use


directly, which will give the same result. We don't recommend that you use plotMDS() on the voom() output because voom() output is only intended to be used in functions that can use weights and plotMDS() doesn't use the weights.

By the way, you never need to run voom() after voomLmFit() because the output is already stored in the v$EList component of the voomLmFit() output. You could find that out by reading the voomLmFit help page.

Entering edit mode

Thanks a lot for your answer again!

I am trying to validate the solution you provided by comparing the sqrt(diag(CT (XT Vg X) -1 C)) with fit$stdev.unscaled. Based on limma tutorial, they are supposed to be equal. page 62

R Code for calculating first gene's stdev.unscaled

design <- model.matrix(~0+group+batch)
v <- voomLmFit(counts = dpb, 
                 design = design, 
                 block = subject_ID,
                 sample.weights = TRUE)

# this part is inconsistent with your answer, because there is no v$weights
v.test <- voom(dpb, design, block = subject_ID, correlation = v$correlation)

subject_ID <- factor(subject_ID)
design.subject <- model.matrix(~0+subject_ID)
CorMatrix <- v$correlation * design.subject %*% t(design.subject)
diag(CorMatrix) <- 1

# modified as v.test, instead of v
iswg <- 1/sqrt(v.test$weights[1,])
Vg <- diag(iswg) %*% CorMatrix %*% diag(iswg)

contMatrix <- makeContrasts(group1_1 - group0_0, levels=colnames(design))

designT <- t(design)
XtVgX <- designT %*% Vg %*% design
invXtVgX <- solve(XtVgX)

stdev.unscaled.calculated <- sqrt(diag(t(contMatrix) %*% invXtVgX %*% contMatrix))
>   stdev.unscaled.calculated
group1_1 - group0_0 

fit <-, contrasts = contMatrix)
fit <- eBayes(fit, trend=TRUE, robust=FALSE)

> fit$stdev.unscaled[1,]
[1] 0.3701154

My question

I think my way for getting 'weights' is wrong considering stdev.unscaled.calculated does not equal to fit$stdev.unscaled[1,]. I understand I can use fit$stdev.unscaled[1,] to back calculate weights with your solution, however, how to correctly get the weights for v?

Thank you very much!

Jiacheng Ding

Entering edit mode

The weights are contained in v$EList$weights. I previously mistyped it as v$weights but have now corrected the code in my answer above.

Here is an reproducible example of computing the covariance matrix:

> library(edgeR)
> SampleInfo
         Condition Subject Batch
01501_b2       0_1   01501    b2
04901_b2       0_0   04901    b2
21201_b2       1_1   21201    b2
31101_b1       1_0   31101    b1
31101_b2       1_0   31101    b2
44501_b1       0_0   44501    b2
45401_b2       0_0   45401    b2
51301_b1       0_0   51301    b1
51801_b1       1_1   51801    b1
51801_b2       1_1   51801    b2
51901_b1       0_1   51901    b1
51901_b2       0_1   51901    b2
52001_b2       0_0   52001    b2
53201_b1       0_1   53201    b1
53201_b2       0_1   53201    b2
57001_b2       0_0   57001    b2
61101_b1       0_0   61101    b1
62001_b1       1_0   62001    b1
65301_b2       0_0   65301    b2
71301_b2       0_0   71301    b2
71501_b2       0_0   71501    b2
73801_b2       0_0   73801    b2
77601_b1       1_0   77601    b1
77601_b2       1_0   77601    b2
81001_b2       0_0   81001    b2
87501_b1       1_1   87501    b1
87501_b2       1_1   87501    b2
89201_b1       0_0   89201    b2
A9501_b2       0_0   A9501    b2
> Condition <- factor(SampleInfo$Condition)
> Subject <- factor(SampleInfo$Subject)
> Batch <- factor(SampleInfo$Batch)
> nsamples <- nrow(SampleInfo)
> ngenes <- 1000
> # Simulate count matrix
> set.seed(31415926)
> log2Mu <- seq(from=1,to=14,len=ngenes)
> Mu <- matrix(2^log2Mu,ngenes,nsamples)
> u <- runif(nsamples,0.5,2)
> Mu <- t(t(Mu) * u)
> y <- matrix(rpois(length(Mu),lambda=Mu),ngenes,nsamples)
> colnames(y) <- row.names(SampleInfo)
> rownames(y) <- 1:ngenes
> design <- model.matrix(~Condition+Batch)
> fit <- voomLmFit(y,design,block=Subject)
First intra-block correlation  0.01433613
Final intra-block correlation  0.01433428
> g <- 500
> design.subject <- model.matrix(~0+Subject)
> CorMatrix <- fit$correlation * design.subject %*% t(design.subject)
> diag(CorMatrix) <- 1
> iswg <- 1/sqrt(fit$EList$weights[g,])
> Vg <- diag(iswg) %*% CorMatrix %*% diag(iswg)
> XVX <- t(design) %*% solve(Vg) %*% design
> sqrt(diag(solve(XVX)))
 (Intercept) Condition0_1 Condition1_0 Condition1_1      Batchb2 
  0.04713938   0.05386578   0.06492907   0.06056974   0.04607647 
> fit$stdev.unscaled[g,]
 (Intercept) Condition0_1 Condition1_0 Condition1_1      Batchb2 
  0.04713938   0.05386578   0.06492907   0.06056974   0.04607647

In the above example, the covariance matrix agrees exactly with fit$stdev.unscaled. Beware though that, if you use, then you will not be able to reproduce fit$stdev.unscaled exactly for the reason explained on the help page.

Entering edit mode

Thank you very much, professor! It worked for my data.

Thank you for your time. Jiacheng


Login before adding your answer.

Traffic: 635 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6