Dear Bioconductor Community,
I am trying to do a differential expression analysis between different cell types, where I have four different cell types from two different experiments. Two of the cell types come from one experiment and four cell types come from the other. I have a total of 18 samples. I have also detected that there is a batch effect in the samples coming from the second experiment. How would I do a differential expression analysis when I have a batch effect in only 12 of my 16 samples? Would I make a design matrix that includes the batch effect as well as the experiment number? Below is an example of what I am describing.
cellType <- factor(c("cellType1","cellType2","cellType1","cellType2","cellType1","cellType2",rep("cellType3",3),rep("cellType4",3),rep("cellType5",3),rep("cellType6",3)))
batches2 <- factor(c(0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,1,0,0))
experiment <- factor(c(0,0,0,0,0,0,rep("1",12)))
myDesign <- model.matrix(~0+cellType+experiment+batches2)
colnames(myDesign) <- c("cellType1","cellType2","cellType3","cellType4","cellType5","cellType6","experiment","batch")
Also, if I wanted to get the values of the batch corrected data after doing differential expression analysis would I put the experiment as a covariate or would I include it in the design? Below is an example of both methods.
log2RPKMmat <- matrix(sample(1:16,90,replace=T), ncol=18)
batchVals_Covariate <- removeBatchEffect(log2RPKMmat,batch=batches2,design=model.matrix(~0+cellType),covariates=experiment)
batchVals_noCovariate <- removeBatchEffect(log2RPKMmat,batch=batches2,design=model.matrix(~0+cellType+experiment))
I also looked at the paper "Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses." It says that a two-way ANOVA model (which I believe I may be describing above) could cause false positives because my study is unbalanced. Are there any other methods you would propose?
Thank you!
Sam
Thanks Aaron for the help. I updated the question above to fix the mix up in labeling the design matrix as well. Unfortunately though, that design matrix did not work. When I used it in the estimateDisp function of edgeR I got an error saying the Design matrix was not of full rank. Do you know why this happened? I looked up the error and found this post: EdgeR Design matrix not of full rank but it does not seem like their are any dependancies in my design matrix.
Again, thanks for your help.
Sam
That shouldn't be the case - are you sure you're running
estimateDisp
with the samemyDesign
that you construct in your post? When I run your code, I get a design matrix of full rank (as checked viaqr
).You are right. I realized I made a mistake in the experiment variable. Instead of
<font face="monospace">it should be. </font>
<font face="monospace">This is because my first experiment is for the first 6 samples, and the second experiment is for the last 12 samples. I edited my question above to reflect the change.</font>
Okay, well, you have a problem, because cell types 3-6 are only present in one experiment and cell types 1-2 are only present in the other. This means that there's no way to distinguish between the experiment effect and that of the difference between cell types. To be able to fit a GLM, you need to drop the experiment factor from your design matrix to estimate the average expression levels for each cell type. However, keep in mind that you cannot compare between cell types from different experiments. edgeR won't stop you from doing so, but you really shouldn't.