I have a question regarding the design matrix set up for a multifactor time course RNASeq experiment.
My experiment is made up of 12 patients, the patients have one of two diseases A or B (6 A patients, 6 B patients), and each patient has samples at 4 time points (1, 2, 3, and 4). For unavoidable reasons, samples were prepped in several different batches; however, all samples from each patient (all time points) are in the same batch (i.e. no patient has samples in more than one batch). The following is the format of the sample metadata:
PatientID SampleID Time Disease Batch p1 p1-1 1 A batchA
p1 p1-2 2 A batchA
p1 p1-3 3 A batchA
p1 p1-4 4 A batchA
p2 p2-1 1 A batchB
p2 p2-2 2 A batchB
p2 p2-3 3 A batchB
p2 p2-4 4 A batchB
p3 p3-1 1 B batchB
p3 p3-2 2 B batchB
p3 p3-3 3 B batchB
p3 p3-4 4 B batchB
p4 p4-1 1 B batchA
p4 p4-2 2 B batchA
p4 p4-3 3 B batchA
p4 p4-4 4 B batchA
... .... .... ... ....
p12 p12-1 1 B batchB
p12 p12-2 2 B batchB
p12 p12-3 3 B batchB
p12 p12-4 4 B batchB
Following limma's user guide (Section 9.5), we combined condition and time into a single variable ("Treat") with values A.1, A.2, A.3, A.4, B.1, B.2, B.3, and B.4.
We are interested in the following questions:
1) What genes are differentially expressed compared to time 1 within each disease (i.e. A.2 vs A.1, A.3 vs A.1, A.4 vs A.1, B.2 vs B.1, B.3 vs B.1, B.4 vs B.1)
2) The interaction term - what genes are differentially induced or repressed between diseases? (i.e. (A.2 vs A.1) vs (B.2 vs B.1), (A.3 vs A.1) vs (B.3 vs B.1), and (A.4 vs A.1) vs (B.4 vs B.1))
We are using the following design:
design <- model.matrix(~0 + Treat,sample.metadata)
And then we block by patientID using duplicate correlation to account for inter-patient variability:
v <- voom(y,design) # y is our counts matrix corfit <- duplicateCorrelation(v, design, block=sample.metadata$patientID) v <- voom(y, design, block=sample.metadata$patientID, correlation=corfit$consensus)
Our question is related to controlling for batch effects. We are currently not including batch in our design for two reasons: 1) All of our contrasts are asking about changes relative to time 1 and thus are internally normalized. 2) Each patient's samples are in the same batch. No patient has samples in more than one batch; therefore, when we block by patientID using duplicate correlation, we also account for batch effects.
Is this reasoning correct? Or should be including batch in the model even though one patientID is not present in multiple batches and we are blocking by patientID?
Thank you for you help.
> sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] gplots_2.14.2 ggplot2_1.0.0 annotate_1.43.5 XML_3.98-1.1 AnnotationDbi_1.27.16 GenomeInfoDb_1.1.23 IRanges_1.99.28 [8] S4Vectors_0.2.4 DESeq_1.17.0 lattice_0.20-29 locfit_1.5-9.1 Biobase_2.25.0 BiocGenerics_0.11.5 biomaRt_2.21.1 [15] edgeR_3.7.17 limma_3.21.19 loaded via a namespace (and not attached): [1] bitops_1.0-6 caTools_1.17.1 colorspace_1.2-4 DBI_0.3.1 digest_0.6.4 gdata_2.13.3 genefilter_1.47.6 geneplotter_1.43.0 [9] grid_3.1.1 gtable_0.1.2 gtools_3.4.1 KernSmooth_2.23-13 MASS_7.3-34 munsell_0.4.2 plyr_1.8.1 proto_0.3-10 [17] RColorBrewer_1.0-5 Rcpp_0.11.2 RCurl_1.95-4.3 reshape2_1.4 RSQLite_0.11.4 scales_0.2.4 splines_3.1.1 stringr_0.6.2 [25] survival_2.37-7 xtable_1.7-4
Aaron,
Thank you for your quick and helpful post.
Regarding the first approach you mentioned (including batch in the design and then blocking for patientID using duplicate correlation), I forgot to mention in my first post a limitation of our samples. We have one batch that only has samples for one patient (i.e. batchC only has samples p9-1, p9-2, p9-3, and p9-4). Therefore, we were wondering if this limits the model's ability to estimate what variability is attributable to batchC versus p9? We weren't able to avoid such limitation, so do you have any suggestions on what would be the ideal set up to account for this?
Regarding your second design suggestion (including patientID in the design). We attempted this, but upon running voom we got an error related to coefficients unable to be estimated.
We believe this error occurred because the model was not able to estimate the variance. We then tried splitting our samples into two groups based on disease and then completing a limma analysis on the counts of each disease group separately (patientID still included in the model as a blocking factor). However, this significantly worsened our FDRs, which we believe is related to the reduction in sample numbers.
Based on this and the limitation of batchC only having samples for one patient, what do you suggest is the most ideal design?
Thank you again for all of your help!
The presence of
BatchC
for patient 9 complicates matters. Assume that we've includedBatch
in the design. In terms of variance estimation, patient 9's samples will still be used as they provide information regarding the variability between the timepoints. However, we'll lose a residual d.f. because the coefficient forBatchC
must be estimated. For comparisons between time points, patient 9 will still be useful, as you're just comparing within each patient who has a particular disease. However, for comparisons between diseases, theBatchC
term will absorb any changes in expression (i.e., patient 9's samples will not be used for this contrast).So, there's a greater cost in including the
Batch
term in the design. Nonetheless, it may still be worth doing so if it improves the results. For example, see if you get more DE genes with or without the batch effect.As for the second design, I forgot that the
PatientID
factor will absorb theDisease
factor. This means that the coefficients for the latter will not be estimable. To avoid this, just use a design matrix ofEach of the interaction terms refers to the change from time point 1 within each patient. The test for the effect of time within each disease can then be easily performed by dropping the corresponding coefficient. The second command in the block above just gets rid of some unestimable coefficients; time point 1 is the baseline for each patient and is represented by the coefficient from the
PatientID
term, so there's no need for anotherTime1
coefficient.To be clear, I'm suggesting two designs here, one for each analysis. The design with the
PatientID
is only useful for comparing between time points, but not between diseases.Note that the above
design
requiresTime
to be a factor, not an integer type. Otherwise, you'll end up with a very strange design.