Dear all,
I'm struggeling to find the correct approach to handling batch effects. I have a longitudinal RNA-Seq experiment where patient samples were measured in 3 batches: First the samples from the inital visit (visit 0, batch "baseline") were sequenced, some had to be re-sequenced due to QC issues (batch "baseline_reseq") and finally the follow-up samples from visits 1, 2, 3 came in and were sequenced in the third batch. Here is an example of the meta data matrix:
patient | visit | batch |
---|---|---|
s1 | 0 | baseline |
s2 | 0 | baseline_resq |
s1 | 1 | follow-up |
s3 | 0 | baseline |
s2 | 1 | follow-up |
s3 | 1 | follow-up |
For now we're only looking for differentailly expressed genes between visit 0 and visit 1.
I think I should fit a mixed model using the duplicateCorrelation function to account for the batch effects like in this example.
## Limma with batch as blocking factor
dge <- DGEList(counts=counts)
design.test <- data.frame(group=c(rep(1,2), rep(2,2), rep(3,2)), batch=c(1,2,1,2,1,2))
design <- model.matrix(~group, design.test)
dge <- calcNormFactors(dge)
v <- voom(dge, design, plot=FALSE)
corfit <- duplicateCorrelation(v, design, block=design.test$batch)
# re-run voom & dupCorr to get a more accurate fit (recommended [here][1])
v <- voom(dge, design, block=design.test$batch, correlation = corfit$consensus)
corfit <- duplicateCorrelation(v, design, block=design.test$batch)
fit <- lmFit(v, design, block=design.test$batch, correlation=corfit$consensus)
results <- topTable(fit, coef="TimepointT120",
n=nrow(elist), sort.by="none")
But how do I account for the patient in this case? Would it make sense to combine patient and batch and use the combined factor as blocking factor?
Your advice is much appreciated!
Julia