Question: Limma Design Matrix for Paired Samples with Confouding Factors
0
gravatar for yuabrahamliu
8 months ago by
yuabrahamliu0 wrote:

Hi all,

I have a data set with both paired samples and confounding factors. I want to use limma to find the differential expressed genes, but I don't know how can I design the design matrix. Could anyone give some suggestions? The data looks like this, 

Sample_Name Condition Pair_Name Source Days
Sample1 Treated 1 Tissue 25
Sample2 Control 1 Tissue 30
Sample3 Treated 2 Blood 22
Sample4 Control 2 Tissue 31
Sample5 Treated 3 Tissue 32
Sample6 Control 3 Blood 30

 

 

 

 

 

 

The variable Condition, namely Treated or Control, is the main effect need to be explored. Pair_Name reflects the pairing of the samples, namely the samples in pair 1 from the same donator, samples in pair 2 from the 2nd donator, pair 3 from the 3rd. Soure is the source of the sample, some from the tissue of interest, while some from blood, which is a discrete variable, and also a confounding factor. The last variable Days is a continuous variable and another confounding factor.

What I want to do is to select the significantly differential expressed genes between the 2 conditions Treated and Control, while considering the effect of sample pairing and the confounding factors Source and Days. It seems very complicated now and I don't know how to design the design matrix and contrast matrix properly. Could anyone give me some suggestions? Thank you so much!

ADD COMMENTlink modified 8 months ago by Aaron Lun24k • written 8 months ago by yuabrahamliu0
Answer: Limma Design Matrix for Paired Samples with Confouding Factors
0
gravatar for Aaron Lun
8 months ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

The simplest approach would be to just throw in everything (note the types):

# Pair_Name must be a factor,
# Days must be numeric.
design <- model.matrix(~Condition + Pair_Name + Source + Days)

... and set coef=2 in topTable() to test the condition effect. Of course, this assumes that all effects are additive on the log-scale, and that the effect of Days is linear with respect to log-expression.

You can weaken the first assumption by defining Group <- paste0(Condition, ".", Source) and using ~0 + Group + Pair_Name + Days instead in the design matrix. This accounts for any interactions (i.e., non-additivity) between the disease condition and the tissue of origin, which will probably be present in any real biological system. Each level of Group defines a condition/source combination, so you will need to use makeContrasts() to define comparisons of interest between specific combinations, followed by contrasts.fit(). If you don't care about blood, your contrasts would just be between normal and diseased tissue.

You can weaken the second assumption by defining DaysX <- splines::ns(Days, 3) and using DaysX instead of Days in the design matrix. This fits a spline with respect to the days, which is more flexible than a linear trend. However, it does require that you have more samples than shown in the table, otherwise you won't have enough residual degrees of freedom for variance estimation.

If nothing I said above makes any sense, read the limma user's guide.

ADD COMMENTlink modified 8 months ago • written 8 months ago by Aaron Lun24k

Very helpful. Thank you Aaron!

ADD REPLYlink written 8 months ago by yuabrahamliu0

Hi Aaron, 

Thank you for your help again, but I met a problem when using the method on my data, which is about the pairing information. As shown before, my data are like this,

Sample_Name Condition Pair_Name
Sample1 Treated 1
Sample2 Control 1
Sample3 Treated 2
Sample4 Control 2
Sample5 Treated 3
Sample6 Control 3

 

Actually, I have more than 100 samples, namely, more than 50 pairs, other confounding factors are not shown here. My code on the analysis is like this if only consider Condition and pairing information,

design <- model.matrix(~ Condition + Pair_Name, data = pd)
fit1 <- lmFit(Normdata, design)
fit1 <- eBayes(fit1)
difflimma <- topTable(fit1, coef=2, n=dim(fit1)[1],sort="logFC")

Here, I could only get about 40 genes with a adjusted p-value less than 0.05. While if I only consider Condition and use the following code,

design <- model.matrix(~ Conditon, data = pd)
fit1 <- lmFit(Normdata, design)
fit1 <- eBayes(fit1)
difflimma <- topTable(fit1, coef=2, n=dim(fit1)[1],sort="logFC")

If so, I could get greater than 9000 genes with a adjusted p-value less than 0.05!

I really feel worried about this result. Will the pairing information influence the significant genes so largely? Or, my code has something wrong? Or there are other causes. Actually, if I also consider other confounding factors like Source and Days, as mentioned yesterday, when combined with the pairing information, no genes will have an adjusted p-value less than 0.05. Maybe you have much more experience, do you think this result is reasonable? Thank you so much Aaron.

 

ADD REPLYlink modified 8 months ago • written 8 months ago by yuabrahamliu0

Your second analysis (without blocking on pair) is probably very anticonservative. If the pair effect is strong, samples from the same pair are going to be highly correlated. If the design matrix does not know about the pairing, the model will not consider these correlations and will effectively overstate the residual d.f. by almost two-fold. This causes underestimation of the uncertainty of the variance estimate and lower-than-expected p-values. The fact that you have so many samples will amplify this effect, resulting in the observed differences in the number of DE genes between analyses.

So, I would say that the first analysis is quite likely to be much more correct than the second analysis. However, you are still missing the other potential explanatory variables, of which Source will probably be quite important. If you include these other variables and you no longer detect DE genes... well, that suggests that the DE genes previously detected were driven by the other variables and not by the condition of interest. (Especially given that Source and Condition do not seem to be balanced.) Perhaps there aren't many (or any) differences between conditions in your system - if the data is trying to tell you something, it's worth listening with an open mind rather than trying to coerce it into giving you DE genes.

ADD REPLYlink modified 8 months ago • written 8 months ago by Aaron Lun24k

I see. Really learned something. Thank you so much Aaron!

ADD REPLYlink written 8 months ago by yuabrahamliu0

Hi Aaron,

Yes, if consider pairing information and other confounding factors, the effect of Condition should be very limited on regulating gene expression, so I wanted to show the contribution of each factors to the variance using ANOVA, 

fit <- aov(logExp ~ Condition + Source + Days + Pair_Name, data = Normdata)
summaryfit <- summary(fit)

However, what made me very confused was that, according to the F-stat of each variable, including Condition, Source, Days, and Pair_Name, Condition really had the absolutely dominant effect on the variance, that of other factors were very very little. (The averaged F-stat across all the genes was greater than 8, while each of the other factors was less than 1). Hence, do you think this is contradictory to the effect of other factors on DE gene selection shown before? I really feel confused. Sorry.

ADD REPLYlink written 8 months ago by yuabrahamliu0

aov does type I sequential ANOVA, where it tests factors in order (i.e., Condition alone, then Condition + Source, and so on). So the reported statistics for Condition don't even consider the effect of the other factors, which was your entire problem in the first place. You can see this pretty clearly:

set.seed(0)
g <- gl(2, 100) # true groups
b <- factor(c(rep(LETTERS[1:2], c(90, 10)), rep(LETTERS[1:2], c(10, 90)))) # blocking
y <- rnorm(200) + as.integer(b) # making up data - batch effect only.
summary(aov(y ~ g + b)) # wow, 'g' is significant!
summary(aov(y ~ b + g)) # but it actually isn't when you consider 'b'.

You'll get the desired type III ANOVA if you use lm instead:

summary(lm(y ~ g + b))
summary(lm(y ~ b + g)) # same p-values.
ADD REPLYlink written 8 months ago by Aaron Lun24k

Learned. Thank you so much Aaron!

ADD REPLYlink written 8 months ago by yuabrahamliu0

Hi Aaron, 

I noted a strange thing when fit my data, don't know if it has a large influence on my result. The design matrix I used is, 

design <- model.matrix(~ Condition + Source + Pair_Name, data = pd)

fit1 <- lmFit(Normdata, design)

However, after this step, I will always get a message as, 

Coefficients not estimable: Pair_Name223
Warning message:
Partial NA coefficients for 45101 probe(s)

I don't know the reason of this warning and its influence on the result. My design matrix is like this, 

(Intercept) ConditionControls SourceTissue Pair_Name29 Pair_Name47 Pair_Name223
1 0 1 0 1 0
1 1 1 0 1 0
1 0 1 1 0 0
1 1 1 1 0 0
1 0 0 0 0 1
1 1 0 0 0 1

Actually I have much more rows and much more Pair_Name relevant columns. For the Pair_Name relevant columns, each of them has a column sum of 2 because 2 of all the samples belong to that specific Pair_Name. As to this design matrix, the last column is Pair_Name223 and its coefficient cannot be evaluated and all of its coefficients value in the fit1$coefficients table are NA. If I changed the design matrix to, 

design <- model.matrix(~Condition + Pair_Name + Source, data = pd)

and let Source become the last column of the design matrix, limma will tell me that the coefficient of Source could not be evaluated this time.

I don't know the reason and will it bring a great influence to the result?

Thank you so much!

ADD REPLYlink written 8 months ago by yuabrahamliu0

This is a standard case of linear dependencies between coefficients. You can see how the intercept is equal to the sum of Pair_Name223 and SourceTissue. This means that there are infinitely possible solutions to the linear system that minimize the sum of squares. For a given intercept value, you could get the same sum of squares by subtracting "x" from the intercept and adding "x" to Pair_Name223 and SourceTissue. There is obviously no unique solution so one coefficient is discarded by the function to achieve identifiability.

In your example above, the pairs are nested within Source, so you can't have both of them in the same model. The immediate solution to the problem would be to simply drop Source from the model.matrix call. Now, the function raises a warning but the issue is symptomatic of a deeper misunderstanding of your experimental design. Unfortunately, my ability to give advice is limited here, as the design matrix in your comment above differs from the experimental set-up in your original post (where the pairing is not nested within the source). In fact, I don't even see how you managed to construct that design matrix from the model.matrix call, I can only assume that you have more samples than the six rows that are shown.

ADD REPLYlink modified 8 months ago • written 8 months ago by Aaron Lun24k

I see. Thank you so much Aaron!

ADD REPLYlink written 8 months ago by yuabrahamliu0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 405 users visited in the last hour