I did an analysis through the cBioPortal website and another one using TCGAbiolinks to get the TCGA files and then DESeq2 locally. Although there is a good overlap (58%) of differentially expressed (DE) genes, I'm curious to understand why there isn't a more satisfactory overlap when comparing the two pipelines.
This is the analysis using cBioPortal: https://www.cbioportal.org/comparison/mrna?comparisonId=63b2d2551cec6922c422d9a2
I noticed the DE genes in my analysis that are not statistically DE in cBioPortal are mostly genes with low counts. I have already tried to filter these genes using:
keep <- rowSums( counts(dds) >= 5 ) >= 50 #since I am working with >950 samples dds <- dds[keep,]
But I keep seeing very low expressed genes as the top DE genes (lower padj) in my analysis.
Some examples that could be checked in the link above; these genes are DE in my analysis, but not in cBioPortal: "ALDH3A1" "STEAP1B" "GHSR" "KRT12"
One gene, "OR3A3", for example, is up-regulated in the “High” group in cBioPortal, but downregulated in the “High” group in my analysis.
Is there a way to get a better overlaping or that is what it is?
I will be glad to provide more details.