Hi all,
I basically have a bunch or RNA-SEQ samples that cluster into different groups under a certain gene signature, the samples are from two different batches so batch needs to be factored. Say I want to take a quarter of my samples that respond best to some treatment and the then another quarter that does the worst and see if there is any differentially expressed genes.
Option number 1: Split the dataset into two different groups, one for for the first cluster and one for the other cluster and run DEG analysis as outlined in the Limma Manual https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf (page 44)
# Let Batch Correspond to Batch Variable and Group correspond to to response group X0 is RNA-SEQ Matrix
# G=Good group P= Poor Group
design<- model.matrix(~Group+Batch)
y<-voom(X0,design,plot=T)
fit1<-lmfit(y,design)
fit<-eBayes(fit1)
tpt.table<-topTable(fit,coef="GroupG",dort.by'p',number=Inf,adjust="BH")
There is also another way I have seen it been done with contrasts matrices
design<-model.matrix(~0+Group+Batch)
cmat<-makeContrasts(diff=GroupG-GroupP,levels=design)
y<-voom(X0,design,plot=T)
fit1<-lmfit(y,design)
fit2<-contrasts.fit(fit1,cmat)
fit<-eBayes(fit2)
tpt.table(fit,sort.by='p',number=Inf,adjust="BH")
So the first thing I notice about the results if I do it these two different ways is that the results are different. Generally, the first one tends to be more conservative. Is there a better or right way statistically speaking?
In the case of my specific experiment this method, of breaking the dataset into two sets by cluster and running DEG analysis separately, yielded no statistically significant genes.
I figured to increase the power I might use the whole dataset and then treat cluster as another batch variable like so
design<-model.matrix(~Group+Cluster+Batch)
y<-voom(X0, design,plot=T)
fit1<-lmfit(y,design)
fit<-eBayes(fit1)
tpt.table<-topTable(fit,coef='GroupG',sort.by='p',number=Inf,adjust="BH")
Doing it this way now yielded quite a number of significant genes, even if the number of samples is similar with doing independently by cluster (say cut each cluster in half) , it still performs well which makes me somewhat suspicious.
Additionally, you can approach it using the make contrasts way
design<-model.matrix(~0+Group+Cluster+Batch)
cmat<-makeContrasts(diff=GroupG-GourpP,levels=design)
y<-voom(X0,design,plot=T)
fit1<-lmfit(y,design)
fit2<-contrasts.fit(fit1,cmat)
fit<-eBayes(fit2)
tpt.table<-topTable(fit,sort.by='p',number=Inf,asjust="BH")
doing it this way led to a 50% increase in DEGs that met a 0.05 BH corrected significance cutoff. So why are they so drastically different? Should I take another approach such as not model cluster as a batch effect but as an additional effect parameter than form a contrasts matrix that only looks at the Groups?. Any help would be much appreciated.
Using
~Group
without the0+
is correct. I will explain why when I have more time.