RNAseq multifactor experiment limma design with patient and batch effects
2
1
Entering edit mode
cfreije ▴ 10
@cfreije-6767
Last seen 9.5 years ago
United States

I have a question regarding the design matrix set up for a multifactor time course RNASeq experiment.

My experiment is made up of 12 patients, the patients have one of two diseases A or B (6 A patients, 6 B patients), and each patient has samples at 4 time points (1, 2, 3, and 4). For unavoidable reasons, samples were prepped in several different batches; however, all samples from each patient (all time points) are in the same batch (i.e. no patient has samples in more than one batch). The following is the format of the sample metadata:

PatientID    SampleID    Time    Disease    Batch

p1           p1-1        1       A            batchA
p1           p1-2        2       A            batchA
p1           p1-3        3       A            batchA
p1           p1-4        4       A            batchA
p2           p2-1        1       A            batchB
p2           p2-2        2       A            batchB
p2           p2-3        3       A            batchB
p2           p2-4        4       A            batchB
p3           p3-1        1       B            batchB
p3           p3-2        2       B            batchB
p3           p3-3        3       B            batchB
p3           p3-4        4       B            batchB
p4           p4-1        1       B            batchA
p4           p4-2        2       B            batchA
p4           p4-3        3       B            batchA
p4           p4-4        4       B            batchA

...                         ....                     ....              ...                         ....

p12          p12-1       1       B            batchB
p12          p12-2       2       B            batchB
p12          p12-3       3       B            batchB
p12          p12-4       4       B            batchB

Following limma's user guide (Section 9.5), we combined condition and time into a single variable ("Treat") with values A.1, A.2, A.3, A.4, B.1, B.2, B.3, and B.4.

We are interested in the following questions:

1) What genes are differentially expressed compared to time 1 within each disease (i.e. A.2 vs A.1, A.3 vs A.1, A.4 vs A.1, B.2 vs B.1, B.3 vs B.1, B.4 vs B.1)

2) The interaction term - what genes are differentially induced or repressed between diseases? (i.e. (A.2 vs A.1) vs (B.2 vs B.1), (A.3 vs A.1) vs (B.3 vs B.1), and (A.4 vs A.1) vs (B.4 vs B.1))

We are using the following design:

design <- model.matrix(~0 + Treat,sample.metadata)

And then we block by patientID using duplicate correlation to account for inter-patient variability:

v <- voom(y,design) # y is our counts matrix
corfit <- duplicateCorrelation(v, design, block=sample.metadata$patientID)
v <- voom(y, design, block=sample.metadata$patientID, correlation=corfit$consensus)

Our question is related to controlling for batch effects. We are currently not including batch in our design for two reasons: 1) All of our contrasts are asking about changes relative to time 1 and thus are internally normalized. 2) Each patient's samples are in the same batch. No patient has samples in more than one batch; therefore, when we block by patientID using duplicate correlation, we also account for batch effects.

Is this reasoning correct? Or should be including batch in the model even though one patientID is not present in multiple batches and we are blocking by patientID?

Thank you for you help.

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gplots_2.14.2         ggplot2_1.0.0         annotate_1.43.5       XML_3.98-1.1          AnnotationDbi_1.27.16 GenomeInfoDb_1.1.23   IRanges_1.99.28      
 [8] S4Vectors_0.2.4       DESeq_1.17.0          lattice_0.20-29       locfit_1.5-9.1        Biobase_2.25.0        BiocGenerics_0.11.5   biomaRt_2.21.1       
[15] edgeR_3.7.17          limma_3.21.19        

loaded via a namespace (and not attached):
 [1] bitops_1.0-6       caTools_1.17.1     colorspace_1.2-4   DBI_0.3.1          digest_0.6.4       gdata_2.13.3       genefilter_1.47.6  geneplotter_1.43.0
 [9] grid_3.1.1         gtable_0.1.2       gtools_3.4.1       KernSmooth_2.23-13 MASS_7.3-34        munsell_0.4.2      plyr_1.8.1         proto_0.3-10      
[17] RColorBrewer_1.0-5 Rcpp_0.11.2        RCurl_1.95-4.3     reshape2_1.4       RSQLite_0.11.4     scales_0.2.4       splines_3.1.1      stringr_0.6.2     
[25] survival_2.37-7    xtable_1.7-4      

 

limma rnaseq batcheffect • 3.2k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 10 minutes ago
The city by the bay

You are correct in that the batch effect is implicitly included in the PatientID, and will be automatically considered when duplicateCorrelation is used with the latter. However, duplicateCorrelation assumes that the patient-specific effects are homogeneously distributed. If the patients in the first batch consistently behave differently from the patients in the second batch, the homogeneity assumption will not hold and the adjustment for the correlations in the linear models will not be accurate.

Now, if you include Batch in your design matrix, it will absorb any systematic differences between patients in different batches. This will make the homogeneity assumption more reasonable if such a batch effect exists. In short, inclusion of Batch as a separate factor in the design may be desirable, as it provides some robustness with respect to the correlation calculations. You'll lose one residual d.f. for variance estimation, but that's acceptable given that you have ~40 d.f.'s (48 libraries - 8 groups) to start with.

---

Edit: Actually, for your first aim (i.e., comparing between time points within the same condition), you could afford to put the PatientID as a blocking factor in the design. This would result in 12 more patient-specific coefficients that need to be estimated, in addition to the 8 coefficients that are already present for each group. As such, you'd end up with a total of 20 terms in the design matrix. This is easily accommodated by your dataset - you have 48 libraries, so you'd end up with 28 residual degrees of freedom for variance estimation (which is plenty).

With this approach, you wouldn't have to make any assumptions at all regarding the similarity of the correlations across different genes. Comparison between pairs of time points is still feasible, as you'd be comparing samples within each patient. Of course, this won't work for your second aim (i.e., comparing between conditions) as each patient-specific term would absorb any differences due to Disease.

ADD COMMENT
0
Entering edit mode

Aaron,

Thank you for your quick and helpful post.

Regarding the first approach you mentioned (including batch in the design and then blocking for patientID using duplicate correlation), I forgot to mention in my first post a limitation of our samples. We have one batch that only has samples for one patient (i.e. batchC only has samples p9-1, p9-2, p9-3, and p9-4). Therefore, we were wondering if this limits the model's ability to estimate what variability is attributable to batchC versus p9? We weren't able to avoid such limitation, so do you have any suggestions on what would be the ideal set up to account for this? 

Regarding your second design suggestion (including patientID in the design). We attempted this, but upon running voom we got an error related to coefficients unable to be estimated.

design <- model.matrix(~0+patientID+Treat,sample.metadata)
v <- voom(y,design)
Coefficients not estimable: B.4 
Warning message:
Partial NA coefficients for 10405 probe(s) 
fit <- lmFit(v, design)
fit2 <- contrasts.fit(fit, contrasts)
Error in contrasts.fit(fit, contrasts)
trying to take contrast of non-estimable coefficient

We believe this error occurred because the model was not able to estimate the variance. We then tried splitting our samples into two groups based on disease and then completing a limma analysis on the counts of each disease group separately (patientID still included in the model as a blocking factor). However, this significantly worsened our FDRs, which we believe is related to the reduction in sample numbers.

Based on this and the limitation of batchC only having samples for one patient, what do you suggest is the most ideal design?

Thank you again for all of your help!

ADD REPLY
0
Entering edit mode

The presence of BatchC for patient 9 complicates matters. Assume that we've included Batch in the design. In terms of variance estimation, patient 9's samples will still be used as they provide information regarding the variability between the timepoints. However, we'll lose a residual d.f. because the coefficient for BatchC must be estimated. For comparisons between time points, patient 9 will still be useful, as you're just comparing within each patient who has a particular disease. However, for comparisons between diseases, the BatchC term will absorb any changes in expression (i.e., patient 9's samples will not be used for this contrast).

So, there's a greater cost in including the Batch term in the design. Nonetheless, it may still be worth doing so if it improves the results. For example, see if you get more DE genes with or without the batch effect.

As for the second design, I forgot that the PatientID factor will absorb the Disease factor. This means that the coefficients for the latter will not be estimable. To avoid this, just use a design matrix of

design <- model.matrix(~0+PatientID+Time:Disease,sample.metadata)
design <- design[,-match(c("Time1:DiseaseA", "Time1:DiseaseB"), colnames(design))]

Each of the interaction terms refers to the change from time point 1 within each patient. The test for the effect of time within each disease can then be easily performed by dropping the corresponding coefficient. The second command in the block above just gets rid of some unestimable coefficients; time point 1 is the baseline for each patient and is represented by the coefficient from the PatientID term, so there's no need for another Time1 coefficient.

To be clear, I'm suggesting two designs here, one for each analysis. The design with the PatientID is only useful for comparing between time points, but not between diseases.

ADD REPLY
0
Entering edit mode

Note that the above design requires Time to be a factor, not an integer type. Otherwise, you'll end up with a very strange design.

ADD REPLY
0
Entering edit mode
@robertmaughan-8096
Last seen 8.0 years ago
Ireland

Hi Aaron, 

I am struggling with a similar experimental setup as cfreije above.

Following your suggested model above for estimating the differences within patients using:

design <- model.matrix(~0+PatientID+Time:Disease,sample.metadata)
design <- design[,-match(c("Time1:DiseaseA", "Time1:DiseaseB"), colnames(design))]

 

How would you suggest to:

1: Test for the differences in responses between disease groups? Are we forced to use the "less accurate" method of duplicateCorrelation to account for the between patient differences?

Alternatively is there a way to incorporate the patientID, the timepoints and the treatment all into the same model and use contrasts to pull out the within patient and between group effects? Maybe we can transform our expression data into % change or similar to summarise in a single variable the difference between time1 and time2?

2: I also have a confounding categorical variable  similar to cfreije above (but mine is binary). I would like to account for the effects of this variable in the model. I have tried incorporating an additional variable in your suggested model but the model was not estimable. My code was as follows:

design<-model.matrix(~0+patient+time:disease+confounder,sample.metadata)
design <- design[,-match(c("Time1:DiseaseA", "Time1:DiseaseB"), colnames(design))]

I'm not really sure what the issue is but I suspect that it something to do with the patient variable absorbing the confounding factor.

 

Any advice on either of these two issues would be much appreciated.

Rob

ADD COMMENT
0
Entering edit mode

For your first question; if you want to test for differences in the responses to time between diseases, that's easily done. For example, to test for disease-specific differences in the change between times 1 and 2, you can do:

con <- makeContrasts(Time2:DiseaseA - Time2:DiseaseB, levels=design)

Testing for differences between diseases at any single time point will be complicated by the blocking on PatientID. Nonetheless, it's still possible in this design with a bit of hard work (and some caveats, see the edit below). Let's say you're interested in differences between diseases at time point 1. We identify the disease for each patient:

npatients <- nlevels(sample.metadata$PatientID)
first.in <- apply(design[,1:npatients], 2, FUN=function(x) { which(x==1)[1] })
patient.disease <- sample.metadata$Disease[first.in]

We then construct a contrast vector for use in contrasts.fit. This compares the average of all patients with disease A against the average of all patients with disease B.

con <- integer(ncol(design))
con[1:npatients][patient.disease=="A"] <- 1/sum(patient.disease=="A")
con[1:npatients][patient.disease=="B"] <- -1/sum(patient.disease=="B")

If you're interested in the second timepoint, you just have to set the entry in the contrast vector for Time2:DiseaseA to 1, and the entry for Time2:DiseaseB to -1. The same logic applies for all of the other time points and the corresponding coefficients. Note that I haven't considered the batch effect in cfreije's original post; the construction of the contrast becomes more complicated, as you have to compute the averages within each batch.

As for your second question: lack of estimability depends on what your confounding variable is. My guess is that the patients are nested within your confounder, so there's no point including it.

---

Edit: For the first question, the strategy outlined above does not consider the variance between patients within each timepoint. For example, the variance estimates will be small as long as the response to time is consistent between patients. This is true even if the absolute expression at any given time point varies massively between patients. As such, you could end up with a lot of highly variable genes in your DE list if you use the above contrast to compare between diseases at each time point. To this end, use of the simple Treat-based design with duplicateCorrelations (as described in the original post) will provide some protection and should be preferred for this particular DE comparison. There's no "inaccuracy" here, as the simple design models the variance appropriate to the research question.

ADD REPLY

Login before adding your answer.

Traffic: 552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6