although similar posts, I would need a further understanding of the “Multi-level Experiments” setting with limma. My experiment design includes paired samples and two factors (Condition and Tissue) similarly to the example of the section 9.7 of the limma User Guide (Last revised 16 October 2016) but with the Condition variable that is unbalanced (25% High and 75% Low; see below).
FileName Subject Condition Tissue
File01 1 High A
File02 1 High B
File03 2 High A
File04 2 High B
File05 3 High A
File06 3 High B
File07 4 Low A
File08 4 Low B
File09 5 Low A
File10 5 Low B
…
File23 12 Low A
File24 12 Low B
Basically, I followed the example code of the User Guide and it seems to work properly. However, I didn’t understand exactly how the duplicateCorrelation function may take into account for a paired design with two factors. Please, can you give me, in short, the correct interpretation of the duplicateCorrelation function and how it works?
Moreover, I would enquire whether there is some trick to use the variable “Condition” as a continuous variable in place of the dichotomous (High/Low) variable.
Presumably, your design matrix only contains the factors Condition and Tissue. By itself, this would fail to account for the fact that your samples are paired by subject. The subsequent linear model would exhibit correlations in the residuals of paired samples, due to the presence of subject-specific effects. For example, if something happened to subject 1 that resulted in lower expression of a particular gene, the residuals would be negative for both File01 and File02. These correlations are bad because they interfere with statistical inference from the linear model. In particular, limma will think that there is more information in the experiment than there actually is, because it will assume that all observations are independent. In practice, this usually results in anticonservativeness as the uncertainty in the coefficient estimates is understated.
This is where duplicateCorrelation comes into play. It estimates the correlation for all genes due to this subject-specific effect; roughly speaking, this is the proportion of the variance that is explained by putting Subject into the model. Estimation of the correlation is done with some care to share information across genes to yield a consensus estimate, which is more stable (and useful) than a gene-specific estimate. lmFit will then use the consensus correlation in generalized least squares, which fits the linear model in a manner that accounts for the Subject blocking factor and the associated correlation.
Finally, if you want to use Condition as a continuous variable - just do it, model.matrix will happily accept real-valued covariates.
thanks for your quick reply! I understood a little bit better the meaning underlying the duplicateCorrelation function.
For what concern my second issue, actually, I tried to use Condition as a continuous variable in place of High/Low factorial variable. Whereas I easily drew a contrast matrix for the first design (i.e. Condition and Tissue as factors) as follows:
TissueA_High.vs.TissueA_Low=High.TissueA-Low.TissueA, # which genes respond to Condition in TissueA
TissueB_High.vs.TissueB_Low=High.TissueB-Low.TissueB, # which genes respond to Condition in TissueB
TissueA.vs.TissueB_High=High.TissueA-High.TissueB, # which genes respond to Tissue in High
TissueA.vs.TissueB_Low=Low.TissueA-Low.TissueB, # which genes respond to Tissue in Low
(TissueA.vs.TissueB_High)_VS_(TissueA.vs.TissueB_Low)=(High.TissueA-High.TissueB)-(Low.TissueA-Low.TissueB), # which genes respond differently in High vs Low (interaction)
levels=design
)
I had some doubt about how to correctly draw and interpret the contrast matrix when I substituted the Condition as factor with the Condition as continuous values. In other words, how can I contrast an interaction term to identify genes that respond differently to Condition (continuous) in TissueA vs TissueB?
This effectively gives an intercept and a gradient term for each level of Tissue, where the gradient describes the change in the log-expression values with respect to increasing values of Condition. To determine whether the condition effect is the same in each tissue, you can just test for equality of the gradient terms via makeContrasts.
Your answer above is really valuable for multi-tissue designs. I do have a couple of questions though. In my case I have multiple tissues (>3 e.g. T1, T2 and T3), confounding factors (e.g. bmi and age) and a condition that can be both continuous and/or discrete with more than 2 levels (e.g. L1, L2, L3).
When the condition is discrete and one wants to account for confounding factors then how correct would be the following design:
l
Treat = factor(paste(Condition, Tissue, sep = "."))
design <- model.matrix(~ 0 + Treat + bmi + age, data = covars)
corfit <- duplicateCorrelation(data, design, block = covars$Subject)
fit <- lmFit(data, design, block = covars$Subject, correlation = corfit$consensus)
cm <- makeContrasts(
# pair-wise comparisons among conditions for the same tissue (T1)
L1vsL2forT1 = TreatL1.T1 - TreatL2.T1,
L1vsL3forT1 = TreatL1.T1 - TreatL3.T1,
L2vsL3forT1 = TreatL2.T1 - TreatL3.T1,
# pair-wise comparisons among tissues for the same condition (L1)
T1vsT2forL1 = TreatL1.T1 - TreatL1.T2,
T1vsT3forL1 = TreatL1.T1 - TreatL1.T3,
T2vsT3forL1 = TreatL1.T2 - TreatL1.T3,
# compare one tissue to all others for the same condition
T1vsAllforL1 = TreatL1.T1 - (TreatL1.T2 + TreatL1.T2) / 2,
T1vsAllforL2 = TreatL2.T1 - (TreatL2.T2 + TreatL2.T2) / 2,
T1vsAllforL3 = TreatL3.T1 - (TreatL3.T2 + TreatL3.T2) / 2,
levels = design)
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)
topTable(fit = fit2, coef = "L1vsL2forT1")
Additionally, would it be correct to say that the significant genes that are unique for T1vsAllforL1 when compared to T1vsAllforL2 and T1vsAllforL3, could be (presumably) condition-specific responses for tissue T1?
Now when the condition is continuous (e.g. C1):
l
design <- model.matrix(~ 0 + Tissue + Tissue:C1 + age + bmi, data = covars)
colnames(design) <- c("T1", "T2", "T3", "age", "bmi", "T1.C1", "T2.C1", "T3.C1")
corfit <- duplicateCorrelation(data, design, block = covars$Subject)
fit <- lmFit(data, design, block = covars$Subject, correlation = corfit$consensus)
cm <- makeContrasts(
# association of the tissue C1 to gene expression of tissue T1
T1 = T1.C1,
# pair-wise comparison of tissues associations of C1 to gene expression
T1vsT2 = T1.C1 - T2.C1,
# tissue-specific associations of C1 to gene expression
T1vsAll = T1.C1 - (T2.C1 + T3.C1) / 2,
levels = design)
fit2 <- contrasts.fit(fit, cm)
fit2 <- eBayes(fit2)
topTable(fit = fit2, coef = "T1vsAll")
And the same as above here: Genes that are significant in T1vsAll, does it mean that they (presumably) are condition-specific responses for tissue T1?
In another post you suggested that continuous outcomes/conditions, such as C1 in this case, might benefit from applying splines on them. How could this extra parameter be applied in this case?
Dear Aaron,
thanks for your quick reply! I understood a little bit better the meaning underlying the duplicateCorrelation function.
For what concern my second issue, actually, I tried to use Condition as a continuous variable in place of High/Low factorial variable. Whereas I easily drew a contrast matrix for the first design (i.e. Condition and Tissue as factors) as follows:
Treat<-factor(paste(mydata$Condition, mydata$tissue, sep="."))
design<- model.matrix(~0+Treat)
# colnames(design) are: High.TissueA, High.TissueB, Low.TissueA, Low.TissueB
cm<- makeContrasts(
TissueA_High.vs.TissueA_Low=High.TissueA-Low.TissueA, # which genes respond to Condition in TissueA
TissueB_High.vs.TissueB_Low=High.TissueB-Low.TissueB, # which genes respond to Condition in TissueB
TissueA.vs.TissueB_High=High.TissueA-High.TissueB, # which genes respond to Tissue in High
TissueA.vs.TissueB_Low=Low.TissueA-Low.TissueB, # which genes respond to Tissue in Low
(TissueA.vs.TissueB_High)_VS_(TissueA.vs.TissueB_Low)=(High.TissueA-High.TissueB)-(Low.TissueA-Low.TissueB), # which genes respond differently in High vs Low (interaction)
levels=design
)
I had some doubt about how to correctly draw and interpret the contrast matrix when I substituted the Condition as factor with the Condition as continuous values. In other words, how can I contrast an interaction term to identify genes that respond differently to Condition (continuous) in TissueA vs TissueB?
Thanks again!
Luca
You need to do something like this:
This effectively gives an intercept and a gradient term for each level of
Tissue
, where the gradient describes the change in the log-expression values with respect to increasing values ofCondition
. To determine whether the condition effect is the same in each tissue, you can just test for equality of the gradient terms viamakeContrasts
.There are multiple ways to effectively model the matrix, this is great.
Thanks again Aaron.
Hi Aaron,
Your answer above is really valuable for multi-tissue designs. I do have a couple of questions though. In my case I have multiple tissues (>3 e.g. T1, T2 and T3), confounding factors (e.g. bmi and age) and a condition that can be both continuous and/or discrete with more than 2 levels (e.g. L1, L2, L3).
l
Additionally, would it be correct to say that the significant genes that are unique for T1vsAllforL1 when compared to T1vsAllforL2 and T1vsAllforL3, could be (presumably) condition-specific responses for tissue T1?
l
And the same as above here: Genes that are significant in T1vsAll, does it mean that they (presumably) are condition-specific responses for tissue T1?
Thanks beforehand.
Klev
I would suggest posting a new question, we're running out of space in the thread.
Of course. I just found my question a good continuation of this one. Here is the new thread.
Thanks for your time.