Correct workflow for analyzing differentially expressed genes and correlation analyses on TCGA data?
Entering edit mode
Last seen 1 day ago

Hey everyone,

I want to analyze the expression of three specific genes in six TCGA cohorts. First, I want to know if the three genes are differentially expressed between tumor and normal samples in each individual TCGA cohort. Second, I want to know if the three genes are correlated in the tumor samples of each individual TCGA cohort.

My starting point for each TCGA cohort is a count matrix with counts for all genes for all samples and a corresponding colData containing information about the TCGA barcode, the sampleType (normal or tumor) and the batchID, which is important for removal of batch effects.

Here is my workflow for the differential expression analysis. As an example I picked the BRCA cohort.

ddsBRCA <- DESeqDataSetFromMatrix(countData = countsBRCA, colData = colDataBRCA, design = ~ batchID + sampleType)
#remove genes with low counts
ddsBRCA <- ddsBRCA[rowMedians(counts(ddsBRCA)) >= 10,]
ddsBRCA <- DESeq(ddsBRCA, parallel=TRUE)
#for a very small number of genes there was no convergence
ddsBRCA <- ddsBRCA[which(mcols(ddsBRCA)$betaConv),] 
resShrunkenBRCA <- lfcShrink(ddsBRCA, coef="sampleType_tumor_vs_normal", parallel=TRUE)

I did this for all six cohorts and now can check the regulation of my three genes of interest and probably have a look on the other differentially expressed genes.

For the correlation analysis I have the following workflow, which follows directly after the differential expression workflow.

# no correction for batch effect
vsdBRCA <- vst(ddsBRCA, blind=FALSE) 

# correction for batch effect
vsdBcBRCA <- vsdBRCA
matBRCA <- assay(vsdBcBRCA)
mmBRCA <- model.matrix(~sampleType, colData(vsdBcBRCA))
matBRCA <- limma::removeBatchEffect(matBRCA, batch=vsdBcBRCA$batchID, design=mmBRCA)
assay(vsdBcBRCA) <- matBRCA

I made it once with and without batch correction to see the effect of the batch correction. For the COAD cohort I obesrved a strong batch effect, which was nicely corrected with the batch correction.

The correlation alaysis should be based on the vsdBcBRCA object.

My questions is whether the differential expression workflow seems valid and if the correlation analysis can be performed on the combination of vst() and removeBatchEffect().

Thanks in Advance.


DESeq2 • 150 views
Entering edit mode
Last seen 4 days ago
United States

I don't have much time on the support site, so I have to restrict myself to software related questions. I don't see a problem with performing correlation following VST and batch effect removal.

Entering edit mode

Thanks for the reply and confirmation.

Probably important for other users that want to correlate gene expressions in TCGA:

I think that the combination of the DESeq2 normalization and vst transformation is perfect for the TCGA correlations. Usually you see in publications or in online tools (e.g. Gepia) correlations based on TPM values. What we encountered for some cancer cohorts with TPM values was that no matter which two genes you correlate the correlation is always positive. This could be attributed to differences in the transcripome complexities, which is accounted for by the DESeq2 but not TPM normalization.


Login before adding your answer.

Traffic: 272 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6