Hello.
I have data that has 2 treatment groups (6 samples, 3 bio reps each group). For each bio rep, I have 9 technical reps, so 54 total samples. I have a co-variate of number of mutations and a date of experiment of batch effect. The size of the gene expression matrix = 5000 genes by 54 samples.
If I run Limma lmFit eBayes like so;
design<-model.matrix(~Date + mutations + treatment, data=pheno)
block<-pheno$block # this refers to the technical replicates
dupcor <- duplicateCorrelation(rmat,design,block=block)
dupcor$consensus # gives 0.193
fit1 <- lmFit(rmat,design,block=block,correlation=dupcor$consensus)
fit1 <- eBayes(fit1)
sum(topTable(fit1,coef="treatment",Inf)[,5]<=0.1) # gives 409 genes at 10% FDR (seems good).
# extracting the batch effect removed data like so, using my basic logic of trying to preserve the tech rep correlations by including block and treatment in the design, not sure if this right?
# known batch = date of experiment
# known covariate is the number mutations
# want to preserve treatment, and from my simple logic, block
y<-removeBatchEffect(rmat, batch=pheno$Date, covariates=pheno$mutations, design=model.matrix(~block + treatment, data=pheno))
The clustering and PCA plots of y look very nice in terms of first two components and clustering separating the treatment groups well.
For interest, if I now run the above Limma analysis using the corrected y matrix with
design<-model.matrix(~treatment, data=pheno)
block<-pheno$block # this refers to the technical replicates
dupcor <- duplicateCorrelation(y,design,block=block)
dupcor$consensus # now is 0.20
fit1 <- lmFit(y,design,block=block,correlation=dupcor$consensus)
fit1 <- eBayes(fit1)
sum(topTable(fit1,coef="treatment",Inf)[,5]<=0.1) # now is 1781 genes at 10% FDR (huge amounts).
I am a bit surprised after batch removal that the number of significant genes is vastly increased. I naively thought it will be similar numbers.
I think and understand that the first approach is the correct thing to do with Limma for differential gene expression, to include batch, mutations in the design and run duplicateCorrelations and lmFit etc; so this is the real result of differentially expressed genes. And to use the batch removal data only for things like MDS and PCA. This is great and has helped my work. I am just curious why big difference in significant genes, which is probably a mistake in my code/logic.
Thanks.
John.