I have a question regarding the design matrix set up for a multifactor time course RNASeq experiment.
My experiment is made up of 12 patients, the patients have one of two diseases A or B (6 A patients, 6 B patients), and each patient has samples at 4 time points (1, 2, 3, and 4). For unavoidable reasons, samples were prepped in several different batches; however, all samples from each patient (all time points) are in the same batch (i.e. no patient has samples in more than one batch). The following is the format of the sample metadata:
PatientID SampleID Time Disease Batch p1 p1-1 1 A batchA
p1 p1-2 2 A batchA
p1 p1-3 3 A batchA
p1 p1-4 4 A batchA
p2 p2-1 1 A batchB
p2 p2-2 2 A batchB
p2 p2-3 3 A batchB
p2 p2-4 4 A batchB
p3 p3-1 1 B batchB
p3 p3-2 2 B batchB
p3 p3-3 3 B batchB
p3 p3-4 4 B batchB
p4 p4-1 1 B batchA
p4 p4-2 2 B batchA
p4 p4-3 3 B batchA
p4 p4-4 4 B batchA
... .... .... ... ....
p12 p12-1 1 B batchB
p12 p12-2 2 B batchB
p12 p12-3 3 B batchB
p12 p12-4 4 B batchB
Following limma's user guide (Section 9.5), we combined condition and time into a single variable ("Treat") with values A.1, A.2, A.3, A.4, B.1, B.2, B.3, and B.4.
We are interested in the following questions:
1) What genes are differentially expressed compared to time 1 within each disease (i.e. A.2 vs A.1, A.3 vs A.1, A.4 vs A.1, B.2 vs B.1, B.3 vs B.1, B.4 vs B.1)
2) The interaction term - what genes are differentially induced or repressed between diseases? (i.e. (A.2 vs A.1) vs (B.2 vs B.1), (A.3 vs A.1) vs (B.3 vs B.1), and (A.4 vs A.1) vs (B.4 vs B.1))
We are using the following design:
design <- model.matrix(~0 + Treat,sample.metadata)
And then we block by patientID using duplicate correlation to account for inter-patient variability:
v <- voom(y,design) # y is our counts matrix corfit <- duplicateCorrelation(v, design, block=sample.metadata$patientID) v <- voom(y, design, block=sample.metadata$patientID, correlation=corfit$consensus)
Our question is related to controlling for batch effects. We are currently not including batch in our design for two reasons: 1) All of our contrasts are asking about changes relative to time 1 and thus are internally normalized. 2) Each patient's samples are in the same batch. No patient has samples in more than one batch; therefore, when we block by patientID using duplicate correlation, we also account for batch effects.
Is this reasoning correct? Or should be including batch in the model even though one patientID is not present in multiple batches and we are blocking by patientID?
Thank you for you help.
> sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] gplots_2.14.2 ggplot2_1.0.0 annotate_1.43.5 XML_3.98-1.1 AnnotationDbi_1.27.16 GenomeInfoDb_1.1.23 IRanges_1.99.28 [8] S4Vectors_0.2.4 DESeq_1.17.0 lattice_0.20-29 locfit_1.5-9.1 Biobase_2.25.0 BiocGenerics_0.11.5 biomaRt_2.21.1 [15] edgeR_3.7.17 limma_3.21.19 loaded via a namespace (and not attached): [1] bitops_1.0-6 caTools_1.17.1 colorspace_1.2-4 DBI_0.3.1 digest_0.6.4 gdata_2.13.3 genefilter_1.47.6 geneplotter_1.43.0 [9] grid_3.1.1 gtable_0.1.2 gtools_3.4.1 KernSmooth_2.23-13 MASS_7.3-34 munsell_0.4.2 plyr_1.8.1 proto_0.3-10 [17] RColorBrewer_1.0-5 Rcpp_0.11.2 RCurl_1.95-4.3 reshape2_1.4 RSQLite_0.11.4 scales_0.2.4 splines_3.1.1 stringr_0.6.2 [25] survival_2.37-7 xtable_1.7-4

Aaron,
Thank you for your quick and helpful post.
Regarding the first approach you mentioned (including batch in the design and then blocking for patientID using duplicate correlation), I forgot to mention in my first post a limitation of our samples. We have one batch that only has samples for one patient (i.e. batchC only has samples p9-1, p9-2, p9-3, and p9-4). Therefore, we were wondering if this limits the model's ability to estimate what variability is attributable to batchC versus p9? We weren't able to avoid such limitation, so do you have any suggestions on what would be the ideal set up to account for this?
Regarding your second design suggestion (including patientID in the design). We attempted this, but upon running voom we got an error related to coefficients unable to be estimated.
We believe this error occurred because the model was not able to estimate the variance. We then tried splitting our samples into two groups based on disease and then completing a limma analysis on the counts of each disease group separately (patientID still included in the model as a blocking factor). However, this significantly worsened our FDRs, which we believe is related to the reduction in sample numbers.
Based on this and the limitation of batchC only having samples for one patient, what do you suggest is the most ideal design?
Thank you again for all of your help!
The presence of
BatchCfor patient 9 complicates matters. Assume that we've includedBatchin the design. In terms of variance estimation, patient 9's samples will still be used as they provide information regarding the variability between the timepoints. However, we'll lose a residual d.f. because the coefficient forBatchCmust be estimated. For comparisons between time points, patient 9 will still be useful, as you're just comparing within each patient who has a particular disease. However, for comparisons between diseases, theBatchCterm will absorb any changes in expression (i.e., patient 9's samples will not be used for this contrast).So, there's a greater cost in including the
Batchterm in the design. Nonetheless, it may still be worth doing so if it improves the results. For example, see if you get more DE genes with or without the batch effect.As for the second design, I forgot that the
PatientIDfactor will absorb theDiseasefactor. This means that the coefficients for the latter will not be estimable. To avoid this, just use a design matrix ofdesign <- model.matrix(~0+PatientID+Time:Disease,sample.metadata) design <- design[,-match(c("Time1:DiseaseA", "Time1:DiseaseB"), colnames(design))]Each of the interaction terms refers to the change from time point 1 within each patient. The test for the effect of time within each disease can then be easily performed by dropping the corresponding coefficient. The second command in the block above just gets rid of some unestimable coefficients; time point 1 is the baseline for each patient and is represented by the coefficient from the
PatientIDterm, so there's no need for anotherTime1coefficient.To be clear, I'm suggesting two designs here, one for each analysis. The design with the
PatientIDis only useful for comparing between time points, but not between diseases.Note that the above
designrequiresTimeto be a factor, not an integer type. Otherwise, you'll end up with a very strange design.