Hi,
I have a home made RNAseq dataset, and I would like to compare the expression of some genes to TCGA samples (public data). I am not talking about differential analysis here, rather descriptive analysis.
What I would like to do is to first "vst transforme" all data together, then apply Combat on the output.
Is it a right way to perform this kind of analysis ?
Thank you
Many thanks for your reply.
My goal is to describe my dataset according to TCGA data. Does my samples have same expression level than those from TCGA for a set of genes (boxplot and PCA/clustering if possible). The absolute expression level is not really important I am more interested by the trend.
I moved your post to a comment, because you had posted it as an "Answer" to your original question.
You can see what the Combat authors say, but if the data is perfectly confounded, I don't think these batch correction removing software tools can help at all.
You can try removing GC dependence trends using Bioconductor software like cqn and EDASeq (you would provide these tools with the counts, not the VST values).
I am not sure to understand how GC correction will correct for protocol and batch effect ?
Some amount of technical differences in counts across batch can be removed by modeling the dependence of counts on GC content of features (and length as well), as performed by those two software packages I mentioned (see citations for details of those methods). But I don't know of any method that claims to be able to remove all protocol and/or batch effects when those are perfectly confounded with the comparison of interest.
I see your point, thank you ! I know this is really tricky. I will try what you have suggested. Again thank you for your help !