I have a question regarding the design matrix set up for a multifactor time course RNASeq experiment.
My experiment is made up of 12 patients, the patients have one of two diseases A or B (6 A patients, 6 B patients), and each patient has samples at 4 time points (1, 2, 3, and 4). For unavoidable reasons, samples were prepped in several different batches; however, all samples from each patient (all time points) are in the same batch (i.e. no patient has samples in more than one batch). The following is the format of the sample metadata:
PatientID SampleID Time Disease Batch p1 p1-1 1 A batchA
p1 p1-2 2 A batchA
p1 p1-3 3 A batchA
p1 p1-4 4 A batchA
p2 p2-1 1 A batchB
p2 p2-2 2 A batchB
p2 p2-3 3 A batchB
p2 p2-4 4 A batchB
p3 p3-1 1 B batchB
p3 p3-2 2 B batchB
p3 p3-3 3 B batchB
p3 p3-4 4 B batchB
p4 p4-1 1 B batchA
p4 p4-2 2 B batchA
p4 p4-3 3 B batchA
p4 p4-4 4 B batchA
... .... .... ... ....
p12 p12-1 1 B batchB
p12 p12-2 2 B batchB
p12 p12-3 3 B batchB
p12 p12-4 4 B batchB
Following limma's user guide (Section 9.5), we combined condition and time into a single variable ("Treat") with values A.1, A.2, A.3, A.4, B.1, B.2, B.3, and B.4.
We are interested in the following questions:
1) What genes are differentially expressed compared to time 1 within each disease (i.e. A.2 vs A.1, A.3 vs A.1, A.4 vs A.1, B.2 vs B.1, B.3 vs B.1, B.4 vs B.1)
2) The interaction term - what genes are differentially induced or repressed between diseases? (i.e. (A.2 vs A.1) vs (B.2 vs B.1), (A.3 vs A.1) vs (B.3 vs B.1), and (A.4 vs A.1) vs (B.4 vs B.1))
We are using the following design:
design <- model.matrix(~0 + Treat,sample.metadata)
And then we block by patientID using duplicate correlation to account for inter-patient variability:
v <- voom(y,design) # y is our counts matrix corfit <- duplicateCorrelation(v, design, block=sample.metadata$patientID) v <- voom(y, design, block=sample.metadata$patientID, correlation=corfit$consensus)
Our question is related to controlling for batch effects. We are currently not including batch in our design for two reasons: 1) All of our contrasts are asking about changes relative to time 1 and thus are internally normalized. 2) Each patient's samples are in the same batch. No patient has samples in more than one batch; therefore, when we block by patientID using duplicate correlation, we also account for batch effects.
Is this reasoning correct? Or should be including batch in the model even though one patientID is not present in multiple batches and we are blocking by patientID?
Thank you for you help.
> sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale:  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages:  stats4 parallel stats graphics grDevices utils datasets methods base other attached packages:  gplots_2.14.2 ggplot2_1.0.0 annotate_1.43.5 XML_3.98-1.1 AnnotationDbi_1.27.16 GenomeInfoDb_1.1.23 IRanges_1.99.28  S4Vectors_0.2.4 DESeq_1.17.0 lattice_0.20-29 locfit_1.5-9.1 Biobase_2.25.0 BiocGenerics_0.11.5 biomaRt_2.21.1  edgeR_3.7.17 limma_3.21.19 loaded via a namespace (and not attached):  bitops_1.0-6 caTools_1.17.1 colorspace_1.2-4 DBI_0.3.1 digest_0.6.4 gdata_2.13.3 genefilter_1.47.6 geneplotter_1.43.0  grid_3.1.1 gtable_0.1.2 gtools_3.4.1 KernSmooth_2.23-13 MASS_7.3-34 munsell_0.4.2 plyr_1.8.1 proto_0.3-10  RColorBrewer_1.0-5 Rcpp_0.11.2 RCurl_1.95-4.3 reshape2_1.4 RSQLite_0.11.4 scales_0.2.4 splines_3.1.1 stringr_0.6.2  survival_2.37-7 xtable_1.7-4