Limma design with continuous and categorical covariates
Entering edit mode
dmcleanuk ▴ 30
Last seen 7.8 years ago

Dear all,

I'm performing a differential gene expression analysis using limma.

My study design is simple in that I only want to compare control and disease within three different groups, however, I want to control for 3 categorical covariates - site, sampling location and gender, and one continuous covariate - age. The contrasts of interest are:

CP_1 to CP_2: DEGs between control CP (CP_1) and diseased CP (CP_2)

C_1 to C_2: DEGs between control C (C_1) and diseased C (C_2)

N_1 to N_2: DEGs between control N (N_1) and diseased N (N_2)

Thus, '_1' represents a control, and '_2' represents a diseased case. I am not interested in genes associated with gender/age or any of the covariates, I just want to identify the disease effect whilst controlling for the covariates. 

My targets file is follows:

> targets[1:10,]
   Sample_ID Group     Site Sampling_location Gender Age
1    emiasb1  CP_1      leg               S_2 Female  54
2    emiasb2  CP_1      leg               S_1 Female  74
3    emiasb3  CP_1      leg               S_1 Female  39
4    emiasb4  CP_1      leg               S_1 Female  57
5    emiasb5  CP_1      leg               S_1   Male  32
6    emiasb6  CP_1      leg               S_1   Male  32
7    emiasb7  CP_1 buttocks               S_3   Male  63
8    emiasb8  CP_1      leg               S_2   Male  46
9    emiasb9  CP_1      leg               S_3   Male  72
10  emiasb10  CP_1      leg               S_2   Male  58

# sample numbers

> table(targets$Group)

 C_1  C_2 CP_1 CP_2  N_1  N_2
136   57   87   27   87   30 

f <- factor(targets$Group)

# cell means model with sampling location, age, site, and gender as blocking factors
design <- model.matrix(~0 + f + Sampling_location + as.numeric(Age) + Site + Gender, targets)


> head(design)
  fC_1 fC_2 fCP_1 fCP_2 fN_1 fN_2 Sampling_locationS_2 Sampling_locationS_3 Age Sitebuttocks Siteleg Sitethigh GenderMale
1    0    0     1     0    0    0                    1                    0  54            0       1         0          0
2    0    0     1     0    0    0                    0                    0  74            0       1         0          0
3    0    0     1     0    0    0                    0                    0  39            0       1         0          0
4    0    0     1     0    0    0                    0                    0  57            0       1         0          0
5    0    0     1     0    0    0                    0                    0  32            0       1         0          1
6    0    0     1     0    0    0                    0                    0  32            0       1         0          1

#fit model
fit <- lmFit(affydat, design)
# define contrast matrix
cont.matrix <- makeContrasts(P1="fCP_1-fCP_2",P2 = "fN_1-fN_2", P3 = "fC_1-fC_2",  levels=design)

> cont.matrix

Levels                 P1 P2 P3
  fC_1                  0  0  1
  fC_2                  0  0 -1
  fCP_1                 1  0  0
  fCP_2                -1  0  0
  fN_1                  0  1  0
  fN_2                  0 -1  0
  Sampling_locationS_2  0  0  0
  Sampling_locationS_3  0  0  0
  Age                   0  0  0
  Sitebuttocks          0  0  0
  Siteleg               0  0  0
  Sitethigh             0  0  0
  GenderMale            0  0  0

  fit2  <-, cont.matrix)
  fit3  <- eBayes(fit2, trend = TRUE)
  # the results for CP_1-CP_2 is in
  res <- topTable(fit3,coef=1,n = Inf, adjust="fdr", p = 0.05)

My results appear to be consistent with published studies however, I would like to confirm that my design is correct, and that I am appropriately controlling for the confounding variables before taking this any further.




limma batch effect covariates microarray • 8.9k views
Entering edit mode

The design and correction looks fine to me except I don't think eBayes(fit2, trend = TRUE ) is necessary? Why did you add trend=TRUE? Perhaps Gordon or some one similar can advise here also.

Entering edit mode

The trend=TRUE argument will model any mean-variance trend in the data. This is required in cases where you have a mean-variance trend (obviously), otherwise your estimates of the prior d.f. and subsequent EB shrinkage would be confounded by systematic mean-dependent differences from a common prior variance. You can check whether you have a mean-dependent trend with plotSA - read the limma user's guide for more details.

Entering edit mode


I'm trying to do a differential methylation analysis with the Limma package and was wondering if you had any suggestions on how to handle this trend argument with this type of data? By applying plotSA empirically on my data, I don't have any particular trend on my variance but I don't know if this type of analysis is adapted to methylation data? In particular, when I try trend = TRUE or not in my eBayes function I get quite a different number of differentially methylated CpGs without really understanding the subtlety of what this represents from a biological point of view?

Thanks in advance for any suggestion,


Entering edit mode
Aaron Lun ★ 28k
Last seen 9 hours ago
The city by the bay

Looks fine to me, though I would suggest examining the distribution of age/sex/site/etc. on a MDS plot to see if you really need to block on them. If a factor doesn't contribute a lot to the variance or differences between groups, it might be better to just ignore it. This will simplify your analysis and avoid potential issues, e.g., when your uninteresting factors are not orthogonal to the disease response.

Entering edit mode

Great thanks, could you elaborate a little more regarding your statement "void potential issues, e.g., when your uninteresting factors are not orthogonal to the disease response." ?



Entering edit mode

For example, if most of the diseased samples are male, and most of the control samples are female, it would be more difficult to distinguish the disease effect from the sex effect. This is because the corresponding terms in the design matrix would be almost redundant, such that dropping one coefficient by itself wouldn't have much effect (as the other coefficient is still there to provide a good fit in the null model). Similar reasoning applies to the other factors.

Entering edit mode

Thanks again. Am I correct to think that the disease effect is calculated within each block separately, and if the blocks are unbalanced then this would reduce the power to identify the  disease effect? Also what about if the covariates themselves are unbalanced, for instance I have a significant dependence of body sites across gender:

table(targets$Site, targets$Gender)

           Female Male
  back          8   20
  buttocks      8   27
  leg          98  160
  thigh        64   36

Is this likely to be problematic for extracting the disease effect?

Entering edit mode

The disease effect is calculated as an "average" (of sorts) across all blocking factors. If you wanted the disease effect to be calculated separately within each level of a blocking factor, you'd need a model with interaction terms rather than just purely additive terms. (Probably not worth doing this, at least not for all of the blocking factors.)

Anyway, if the disease status is partially or fully confounded with the blocking factors, then you would lose power to detect DE due to disease. However, there's nothing much you can do about it at this point - it's more of an issue that should have been dealt with during experimental design. Dependencies between the uninteresting covariates are irrelevant to inferences about the disease effect, as long as the design matrix is still full rank.


Login before adding your answer.

Traffic: 712 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6