I have been working on data from the recount3 project to integrate GTEx and TCGA data and perform DEG analysis using DESeq2
. However, I am encountering an issue where I am getting too many significant genes while using datasets with large sample size such as TCGA-COAD and colon tissue in GTEX.
This phenomenon is also mentioned here (PMID: 35199033
), which reports that 92% of total gene input is accounted for by differentially expressed genes (DEGs) detected across TCGA primary tumor and GTEx normal colon tissue samples.
When using limma
for analysis, the treat
function can help address this issue by computing empirical Bayes moderated-t p-values relative to a minimum fold-change threshold.
Now I have two questions:
- I was wondering if there is a similar solution available for
DESeq2
. - Is there a better approach to address this problem? because even after using
treat
there are still many significant genes left.
These two datasets are from completely different experiments / batches. It is utterly meaningless to compare them. I would suggest comparative analysis within subtypes only using TCGA data.