Question

limma - complex design with paired samples, sex and batch effect

4

Entering edit mode

aec ▴ 90

@aec-9409

Last seen 3.9 years ago

Hi,

I have a very complex paired design to test for differential expression with limma. I have a control group and intervention group (more than 100 RNA-seq samples). The intervention group received a treatment but not the control. So my expression data is structured like this:

Pre1: control pre-treatment

Pre2: intervention pre-treatment

Post1: control post-treatment

Post2: intervention post-treatment

I would like to compare:

Post1-Pre1

Post2-Pre2

Pre2-Pre1

Post2-Post1

But taking into account batch effects (batch1 and batch2) and sex (male/female) . But I am not sure if including batch effect because Pre and Post samples coming from the same individual fall in the same batch.

This is my limma code so far:

head(info)

PATIENT	TREATMENT	GROUP	GENDER	BATCH
11	Pre	2	Female	BATCH1
11	Post	2	Female	BATCH1
12	Pre	2	Male	BATCH1
12	Post	2	Male	BATCH1
13	Pre	2	Female	BATCH1
13	Post	2	Female	BATCH1
14	Pre	2	Male	BATCH1
14	Post	2	Male	BATCH1
15	Pre	2	Male	BATCH1
15	Post	2	Male	BATCH1
16	Pre	2	Male	BATCH2
16	Post	2	Male	BATCH2
17	Pre	2	Female	BATCH2
17	Post	2	Female	BATCH2
18	Pre	2	Male	BATCH2
18	Post	2	Male	BATCH2
19	Pre	2	Female	BATCH2
19	Post	2	Female	BATCH2
20	Pre	2	Female	BATCH2
20	Post	2	Female	BATCH2
21	Pre	2	Female	BATCH2
21	Post	2	Female	BATCH2
22	Pre	2	Male	BATCH2
22	Post	2	Male	BATCH2
23	Pre	2	Female	BATCH2
23	Post	2	Female	BATCH2
24	Pre	1	Female	BATCH2
24	Post	1	Female	BATCH2
25	Pre	2	Male	BATCH2
25	Post	2	Male	BATCH2
26	Pre	2	Female	BATCH2
26	Post	2	Female	BATCH2
27	Pre	2	Female	BATCH2
27	Post	2	Female	BATCH2
28	Pre	2	Female	BATCH2
28	Post	2	Female	BATCH2
29	Pre	1	Female	BATCH2
29	Post	1	Female	BATCH2
30	Pre	2	Female	BATCH2
30	Post	2	Female	BATCH2
31	Pre	2	Female	BATCH2
31	Post	2	Female	BATCH2
32	Pre	2	Female	BATCH2
32	Post	2	Female	BATCH2
33	Pre	2	Male	BATCH2
33	Post	2	Male	BATCH2
34	Pre	1	Female	BATCH2
34	Post	1	Female	BATCH2

y=DGEList(counts=counts)
A<-rowSums(y$counts)
isexpr<-A>500
y=y[isexpr,keep.lib.size=FALSE]
dim(y)

y=calcNormFactors(y)
sex=as.factor(info$SEX)
batch=as.factor(info$BATCH)
Treat=factor(paste(info$TREATMENT, info$GROUP, sep="."))
design=model.matrix(~0+Treat+sex+batch)
colnames(design)
v=voom(y,design)
cor=duplicateCorrelation(v,design,block=info$PATIENT)

fit=lmFit(v,design,block=info$PATIENT, correlation=cor$consensus)
cm=makeContrasts(CONTROLvsINTERVENTIONforPRE= TreatPre.2-TreatPre.1, CONTROLvsINTERVENTIONforPOST= TreatPost.2-TreatPost.1, PrevsPostforCONTROL= TreatPost.1-TreatPre.1, PrevsPostforINTERVENTION=TreatPost.2-TreatPre.2,levels=design)
fit2=contrasts.fit(fit,cm)
fit2=eBayes(fit2)

summary(decideTests(fit2))

Pre1Pre2_table=topTable(fit2,coef="CONTROLvsINTERVENTIONforPRE",n=Inf, sort="p", p=0.05)
Post1Post2_table=topTable(fit2,coef="CONTROLvsINTERVENTIONforPOST",n=Inf, sort="p", p=0.05)
Pre1Post1_table=topTable(fit2,coef="PrevsPostforCONTROL",n=Inf, sort="p", p=0.05)
Pre2Post2_table=topTable(fit2,coef="PrevsPostforINTERVENTION",n=Inf, sort="p", p=0.05)

limma paired samples complex-design • 5.3k views

ADD COMMENT • link 8.4 years ago • updated 8.3 years ago aec ▴ 90

score 4 · Answer 1 · 2015-12-22

Looks fine to me. Including sex and batch as fixed effects in design probably doesn't hurt, as you're not losing a lot of residual d.f. from two extra terms. In fact, if you're using duplicateCorrelation, it might even help. Those terms will absorb systematic differences in expression between sexes and batches, which ensures that the remaining random effects across patients are more homogeneous.

The alternative would be to block directly on the patient ID in the design matrix, which would then obviate the need to block on sex and batch. However, this renders your desired DE comparisons impossible as you can't compare directly across blocking levels, e.g., if you wanted to compare post-treatment expression between the different levels of GROUP. You'd be restricted to comparing within GROUP (your first two comparisons) or looking for differential treatment effects between groups, i.e., (Post2 - Pre2) - (Post1 - Pre1).

Some other comments:

Setting isexpr<-A>500 is not robust as the aggressiveness of the filter depends on the number of libraries. If you were to copy-paste this code elsewhere, or if you got new libraries and added them to the analysis, or if you found out some of your existing libraries were rubbish and threw them out, then the strength of the filter would change. You could make it safer by just switching to the equivalent threshold in terms of the rowMeans.
You can set robust=TRUE in eBayes to protect against genes with outlier variances, of which there may be a few in such a large data set. This ensures that those outliers won't distort the estimates of the EB statistics.

score 0 · Answer 2 · 2015-12-23

0

Entering edit mode

aec ▴ 90

@aec-9409

Last seen 3.9 years ago

Aaron thanks for your comment.

The thing is that if I run this model: ~0+Treat+sex+batch, the number of DE genes is only 2 for Post2-Post1 and I would expect many more DE genes because controls did not received the treatment but the intervention group did.

See results:

CONTROLvsINTERVENTIONforPRE CONTROLvsINTERVENTIONforPOST PrevsPostforCONTROL

-1 0 0 0 0 23776 23774 23776 1 0 2 0 PrevsPostforINTERVENTION -1 1165 0 21349 1 1262

If I do not take into account the batch effect, the model would be : ~0+Treat+sex and these are the results:

CONTROLvsINTERVENTIONforPRE CONTROLvsINTERVENTIONforPOST PrevsPostforCONTROL -1 92 893 0 0 23603 21861 23776 1 81 1022 0 PrevsPostforINTERVENTION -1 1455 0 20658 1 1663

As you can see I do get more DE genes for Post2-Post1, but it is true that also the number of DE genes for Pre2-Pre1 also increases and that is not desirable as we expect no differences.

My question would be which of the two models is more appropiate? Is it really necessary to account for batch effect if the same individual pre and post comes from the same batch?

The other question is, when we compare Pre2-Pre1 or Post2-Post1 we do not have paired samples, so is it necessary to account for the patient covariate for these two comparisons?

ADD COMMENT • link 8.4 years ago aec ▴ 90

0

Entering edit mode

I'd check what your samples are doing with respect to the batch, e.g., on a MDS plot. I suspect there may be a strong difference between samples in different batches. If so, this needs to be modelled because it will otherwise distort the variance estimates. This is true even when you're comparing within batches via duplicateCorrelation (in which case, a systematic difference between patients in different batches will distort the correlation estimate).

For similar reasons, you need to account for the patient covariate when comparing between treatments. This is because samples from the same patient are correlated - ignoring that will underestimate the standard error of the log-fold change.

Edit: I should add that getting more DE genes when ignoring a batch effect is pretty weird. Usually, the batch effect will inflate the variance/increase the correlation, which should result in fewer DE genes for that comparison. Check if all the control or intervention samples are in one batch, because if that's the case, the samples in the other batch will not be used in the DE comparison.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Aaron, you are right, Batch1 has only intervention samples, anyone coming from controls. But Batch2 has both control and intervention samples. How should I proceed with such a case?

Regarding the MDS plot I can clearly see the sex and batch separation.

Many thanks for your comments.

ADD REPLY • link 8.4 years ago aec ▴ 90

0

Entering edit mode

I would definitely block on batch in the design matrix. You don't want a situation where a gene is DE between intervention and control just because of a batch effect (e.g., in Batch 1 that results in up/downregulation in many of the intervention samples). Now, it's unfortunate that you get fewer DE genes, but that's the way it is. Perhaps a more relevant comparison would account for pre-treatment differences; try out the (Post2 - Pre2) - (Post1 - Pre1) contrast that I described before, which tests for a differential treatment effect between control and intervention.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Aaron, The contrast you suggested gives no DE genes. Here is part of the code:

fit=lmFit(v,design,block=info$PATIENT, correlation=cor$consensus)
cm=makeContrasts(CONTROLvsINTERVENTIONforTreatment_Effect= (TreatPost.2-TreatPre.2)-(TreatPost.1-TreatPre.1),levels=design)
fit2=contrasts.fit(fit,cm)
fit2=eBayes(fit2)

Result:

CONTROLvsINTERVENTIONforTreatment_Effect
-1                                        0
0                                     23776
1                                         0

This is not expected, right? Is there something I am doing wrong? I corrected by sex and batch effect as well.

ADD REPLY • link 8.3 years ago aec ▴ 90

0

Entering edit mode

The code looks fine to me. The result is quite strange, though; you get around 2000 DE genes for PrevsPostforINTERVENTION and no DE genes for PrevsPostforCONTROL, yet the comparison between the two log-fold changes doesn't yield any DE genes. Try blocking on patient in the design matrix:

design <- model.matrix(~ patient + group:treatment)

... and do a contrast between the coefficients representing the treatment log-fold change between control and intervention. (Dropping of a coefficient to obtain full rank will probably be required, probably with the second-last coefficient.) Also drop each coefficient individually, as a reference. The aim is to ensure that you're not getting penalised by duplicateCorrelation for between-block comparisons.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

I think this is your suggested code:

design <- model.matrix(~ patient + treatment:group)
colnames(design)

[1] "(Intercept)"          "patient 10"           "patient100"
[4] "patient101"           "patient102"           "patient103"
[7] "patient104"           "patient105"           "patient106"
[10] "patient107"           "patient108"           "patient109"
[13] "patient 11"           "patient110"           "patient111"
[16] "patient112"           "patient113"           "patient114"
[19] "patient115"           "patient116"           "patient117"
[22] "patient118"           "patient119"           "patient 12"
[25] "patient120"           "patient121"           "patient122"
[28] "patient123"           "patient124"           "patient125"
[31] "patient126"           "patient127"           "patient128"
[34] "patient129"           "patient 13"           "patient130"
[37] "patient 14"           "patient 15"           "patient 16"
[40] "patient 17"           "patient 18"           "patient 19"
[43] "patient 2"           "patient 20"           "patient 21"
[46] "patient 22"           "patient 23"           "patient 24"
[49] "patient 25"           "patient 26"           "patient 27"
[52] "patient 28"           "patient 29"           "patient 3"
[55] "patient 30"           "patient 31"           "patient 32"
[58] "patient 33"           "patient 34"           "patient 35"
[61] "patient 36"           "patient 37"           "patient 38"
[64] "patient 39"           "patient 4"           "patient 40"
[67] "patient 41"           "patient 42"           "patient 43"
[70] "patient 44"           "patient 45"           "patient 46"
[73] "patient 47"           "patient 48"           "patient 49"
[76] "patient 5"           "patient 50"           "patient 51"
[79] "patient 52"           "patient 53"           "patient 54"
[82] "patient 55"           "patient 56"           "patient 57"
[85] "patient 58"           "patient 59"           "patient 6"
[88] "patient 60"           "patient 61"           "patient 62"
[91] "patient 63"           "patient 64"           "patient 65"
[94] "patient 66"           "patient 67"           "patient 68"
[97] "patient 69"           "patient 7"           "patient 70"
[100] "patient 71"           "patient 72"           "patient 73"
[103] "patient 74"           "patient 75"           "patient 76"
[106] "patient 77"           "patient 78"           "patient 79"
[109] "patient 8"           "patient 80"           "patient 81"
[112] "patient 82"           "patient 83"           "patient 84"
[115] "patient 85"           "patient 86"           "patient 87"
[118] "patient 88"           "patient 89"           "patient 9"
[121] "patient 90"           "patient 91"           "patient 92"
[124] "patient 93"           "patient 94"           "patient 95"
[127] "patient 96"           "patient 97"           "patient 98"
[130] "patient 99"           "treatmentPost:group1" "treatmentPre:group1"
[133] "treatmentPost:group2" "treatmentPre:group2"

v=voom(y,design)
fit=lmFit(v,design)
cm=makeContrasts(CONTROLvsINTERVENTIONforTreatment_Effect= (treatmentPost:group2 - treatmentPre:group2)-(treatmentPost:group1 - treatmentPre:group1),levels=design)
fit2=contrasts.fit(fit,cm)
fit2=eBayes(fit2)

Which gives these error messages:

Coefficients not estimable: treatmentPre:group1 treatmentPre:group2
Warning message:
Partial NA coefficients for 23776 probe(s)
Coefficients not estimable: treatmentPre:group1 treatmentPre:group2
Warning message:
Partial NA coefficients for 23776 probe(s)
Error in makeContrasts(CONTROLvsINTERVENTIONforTreatment_Effect = (treatmentPost:group2 - :
The levels must by syntactically valid names in R, see help(make.names). Non-valid names: patient 10,patient 11,patient 12,patient 13,patient 14,patient 15,patient 16,patient 17,patient 18,patient 19,patient 2,patient 20,patient 21,patient 22,patient 23,patient 24,patient 25,patient 26,patient 27,patient 28,patient 29,patient 3,patient 30,patient 31,patient 32,patient 33,patient 34,patient 35,patient 36,patient 37,patient 38,patient 39,patient 4,patient 40,patient 41,patient 42,patient 43,patient 44,patient 45,patient 46,patient 47,patient 48,patient 49,patient 5,patient 50,patient 51,patient 52,patient 53,patient 54,patient 55,patient 56,patient 57,patient 58,patient 59,patient 6,patient 60,patient 61,patient 62,patient 63,patient 64,patient 65,patient 66,patient 67,patient 68,patient 69,patient 7,patient 70,patient 71,patient 72,patient 73,patient 74,patient 75,patient 76,patient 77,patient 78,patient 79,patient 8,patient 80,patient 81,patient 82,patient 83,patient
In addition: Warning message:
In makeContrasts(CONTROLvsINTERVENTIONforTreatment_Effect = (treatmentPost:group2 - :
Renaming (Intercept) to Intercept
Execution halted

ADD REPLY • link 8.3 years ago aec ▴ 90

0

Entering edit mode

You need to drop coefficients 132 and 134. Also, you need to rename the coefficients so that they don't have spaces or colons in them.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

I think I got your point:

design <- model.matrix(~ patient + treatment:group)
colnames(design) <- make.names(colnames(design), unique=TRUE)
design=design[,c(-132,-134)]
colnames(design)

[1] "X.Intercept."         "patient.10"           "patient100"
[4] "patient101"           "patient102"           "patient103"
[7] "patient104"           "patient105"           "patient106"
[10] "patient107"           "patient108"           "patient109"
[13] "patient.11"           "patient110"           "patient111"
[16] "patient112"           "patient113"           "patient114"
[19] "patient115"           "patient116"           "patient117"
[22] "patient118"           "patient119"           "patient.12"
[25] "patient120"           "patient121"           "patient122"
[28] "patient123"           "patient124"           "patient125"
[31] "patient126"           "patient127"           "patient128"
[34] "patient129"           "patient.13"           "patient130"
[37] "patient.14"           "patient.15"           "patient.16"
[40] "patient.17"           "patient.18"           "patient.19"
[43] "patient..2"           "patient.20"           "patient.21"
[46] "patient.22"           "patient.23"           "patient.24"
[49] "patient.25"           "patient.26"           "patient.27"
[52] "patient.28"           "patient.29"           "patient..3"
[55] "patient.30"           "patient.31"           "patient.32"
[58] "patient.33"           "patient.34"           "patient.35"
[61] "patient.36"           "patient.37"           "patient.38"
[64] "patient.39"           "patient..4"           "patient.40"
[67] "patient.41"           "patient.42"           "patient.43"
[70] "patient.44"           "patient.45"           "patient.46"
[73] "patient.47"           "patient.48"           "patient.49"
[76] "patient..5"           "patient.50"           "patient.51"
[79] "patient.52"           "patient.53"           "patient.54"
[82] "patient.55"           "patient.56"           "patient.57"
[85] "patient.58"           "patient.59"           "patient..6"
[88] "patient.60"           "patient.61"           "patient.62"
[91] "patient.63"           "patient.64"           "patient.65"
[94] "patient.66"           "patient.67"           "patient.68"
[97] "patient.69"           "patient..7"           "patient.70"
[100] "patient.71"           "patient.72"           "patient.73"
[103] "patient.74"           "patient.75"           "patient.76"
[106] "patient.77"           "patient.78"           "patient.79"
[109] "patient..8"           "patient.80"           "patient.81"
[112] "patient.82"           "patient.83"           "patient.84"
[115] "patient.85"           "patient.86"           "patient.87"
[118] "patient.88"           "patient.89"           "patient..9"
[121] "patient.90"           "patient.91"           "patient.92"
[124] "patient.93"           "patient.94"           "patient.95"
[127] "patient.96"           "patient.97"           "patient.98"
[130] "patient.99"           "treatmentPost.group1" "treatmentPost.group2"

v=voom(y,design)
fit=lmFit(v,design)
cm=makeContrasts(CONTROLvsINTERVENTIONforPost= (treatmentPost.group2 - treatmentPost.group1),levels=design)
fit2=contrasts.fit(fit,cm)
fit2=eBayes(fit2) 

summary(decideTests(fit2))

CONTROLvsINTERVENTIONforPost
-1                            9
0                         23763
1                             4

Still very few DE genes comparing Post2-Post1 but this design does not include batch and sex effects. Should I add them too?

ADD REPLY • link 8.3 years ago aec ▴ 90

1

Entering edit mode

No need to include batch and sex effect, as you're blocking on the patient. Each patient is nested within a batch and sex, so the latter two would be redundant. Anyway, there's not a lot of DE genes. One possible explanation is that you just don't have a lot of control samples (I assume this is group 1), such that lots of weak DE can be picked up in the intervention pre-post comparison (e.g., by dropping treatmentPost.group2) but not in the equivalent comparison for the control (e.g., by dropping treatmentPost.group1). Similarly, this means that there's not enough evidence for looking at differential treatment effects between intervention and control. If so, I suspect that trying different analyses won't do much to help, because the experimental design is inherently limited.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Yes, I have different amount of samples for group1 and group2.

43 for group1 (control)

87 for group2 (intervention)

Thanks for all your comments Aaron.

ADD REPLY • link 8.3 years ago aec ▴ 90

0

Entering edit mode

one last question, if here we are comparing Post2-Post1 why should we block on patient? Patients from group1 are different from group2...

ADD REPLY • link 8.3 years ago aec ▴ 90

0

Entering edit mode

What makes you think that you shouldn't block on the patient term (either in the design matrix or via duplicateCorrelation)? You need to account for patient-specific effects on expression, regardless of what comparison you're making.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Ok I see. It's difficult to me to see the difference between blocking via duplicateCorrelation o directly in the design matrix. What is the advantage of each methodology?

ADD REPLY • link 8.3 years ago aec ▴ 90

0

Entering edit mode

If you want to compare between different levels of the blocking factor, use duplicateCorrelation. If not, then you can use either method - in such cases, duplicateCorrelation is more powerful but can also be slightly liberal. This is because it treats the patient-specific effects as being randomly sampled from some homogeneous distribution, whereas the same effects are not considered to be related when you block on patient in the design matrix. As such, duplicateCorrelation makes better use of the information in the data set as it doesn't have to estimate each patient-specific effect separately; however, this comes at the cost of having to estimate the correlation (related to the variability of the patient effects) which is not easy to do with low numbers of patients, when there are systematic differences between patients, or in the presence of outlier patients. These are rather esoteric considerations so, for all intents and purposes, the practical choice is driven by the comparisons you want to do.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Clear, thanks a lot.

ADD REPLY • link 8.3 years ago aec ▴ 90

score 0 · Answer 3 · 2016-01-15

0

Entering edit mode

aec ▴ 90

@aec-9409

Last seen 3.9 years ago

Aaron, I would like to know how sex and batch affect the results. How to test if they are significant?

ADD COMMENT • link 8.3 years ago aec ▴ 90

0

Entering edit mode

We do this informally by just seeing if they increase or decrease the number of DE genes when we include them in the model. However, if you're seeing separation by sex/batch in the MDS plot, I would suggest that you keep them in your model, regardless of how the numbers of DE genes change. In such cases, it's clear there's an effect that should be modelled.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k