edgeR. Known cancer genes are not differentially expressed?
Entering edit mode
rina ▴ 20
Last seen 3 days ago

Dear experts!

I am performing an EdgeR analysis in TCGA samples. I have a subset of the 83 patients I want to check their differential expression in comparison to the 41 samples of normal tissue. Checking for the differential expression of known critical genes (basically almost all the genes taking part in signaling pathways involved in cancer) I see that almost all of the 297 genes have a logFC < 0.8.

In general, the results include

Genes with counts > 1 : 21251

Significant DEGs FDR < 0.05 16222

Significant DEGs FDR < 0.05 with logFC > 2 1640

Significant DEGs FDR < 0.05 with logFC < -2 1664

I would assume that not all, but at least some, would show a high differential expression. I am concerned that the results are kind of "diluted" or the problem comes from the fact that the samples are not paired (normal samples come from other patients).

I have performed this analyis in 4 different subset of patients and the results are almost the same when it comes to the known critical genes.

This is the code I used during my analysis.

query = GDCquery(project = "TCGA-COAD", data.category = "Transcriptome Profiling", 
                 data.type = "Gene Expression Quantification", 
                 workflow.type = "HTSeq - Counts",
                 barcode = c(subtype1.patients$IDs,Normal$IDs))


# Save in matrix form
exp_subtype1_NORMAL_SE = GDCprepare(query=query, save=TRUE, save.filename = "exp_subtype1_NORMAL.rda", summarizedExperiment = FALSE)

exp_subtype1_NORMAL_SE = exp_subtype1_NORMAL_SE[-c(1:5),]
rownames(exp_subtype1_NORMAL_SE) = exp_subtype1_NORMAL_SE$X1
exp_subtype1_NORMAL_SE = exp_subtype1_NORMAL_SE[,-1]

#Define groups and model matrix
groups = factor(c(rep("subtype1", 83), rep("normal",41)))
groups = relevel(groups, ref = "normal")
design = model.matrix(~groups, y$samples)

#Convert to DGEList
y1 <- DGEList(counts=exp_subtype1_NORMAL_SE,genes = rownames(exp_subtype1_NORMAL_SE), group = groups)

# Filter out low expressed genes
keep <- rowSums(cpm(y1)>1) >= 2
y1 <- y1[keep, , keep.lib.sizes=FALSE]
y1 <- calcNormFactors(y1)
y1 <- estimateDisp(y1)

et1 <- exactTest(y1)
tt1 <- topTags(et, n=nrow(y), p.value=0.05)

Here is the first three columns of the expression matrix used in the start of the analysis.

                   TCGA-AA-3815-01A-01R-1022-07 TCGA-AA-A02R-01A-01R-A00A-07 TCGA-CK-4951-01A-01R-1410-07
ENSG00000000003.13                         2449                         1040                         6513
ENSG00000000005.5                             6                            0                            8
ENSG00000000419.11                          487                          450                          959
ENSG00000000457.12                          269                          223                          307
ENSG00000000460.15                          177                          174                          246
ENSG00000000938.11                          331                          415                          324

Hope I have provided everything needed.

Thank you in advance.


edger tcga • 604 views
Entering edit mode
Aaron Lun ★ 27k
Last seen 1 hour ago
The city by the bay

Your analysis betrays a certain amount of naivete regarding the processing of human patient data. There are many possible confounding factors that need to be considered in these data sets - for example, I would routinely include age, sex and genetic background/ethnicity in my model. This might possibly be joined by more cancer-specific factors such as Her2 status for breast cancers and smoking status for lung cancers. Failing to include these factors will inflate the apparent variability, thus reducing power for the entire analysis (due to the fact that empirical Bayes shrinkage propagates any over-estimation of the dispersion to all genes). It can also potentially result in misleading conclusions if the blocking factors are not fully orthogonal to disease status. Classic edgeR only accommodates pairwise tests, so I would suggest switching to the quasi-likelihood functions (glmQLFit and glmQLFTest) to include these additional factors, see this workflow.

If you still fail to observe strong log-fold changes in these "cancer-relevant genes", that narrows down the possible explanations. The p-value reported by edgeR is sensitive to various things like the dispersion but it's pretty hard to get the log-fold change so grossly wrong. I could only imagine that (i) the matching of sample annotation to columns of y1 is wrong, (ii) the matching of elements in the gene set to rows of y1 is wrong, or (iii) these "known critical genes" are perhaps not that critical.

Entering edit mode

Hi Aaron,

Thanks for your comment. The patients groups I am analyzing are grouped together according to their molecular subtypes (that themselves are defined based on expression data), so I would assume that this reduces the variability across the samples. Please correct me if I am wrong.

As for your proposed solutions, I doublechecked both sample and gene annotations and everything seems to be as it should. When it comes to the genes I mentioned, I am referring to the genes involved in known disregulated pathways in colon cancer, so I would assume that at least some of them should show differential expression?

Would you assume that the low logFC could be because the majority of the samples are unpaired? (Some of the tumor samples have normal samples, while others do not)

Any additional advices would be much appreciated :)


Entering edit mode

If you are looking for signals associated with groups of genes, perhaps you want to consider gene set enrichment analysis approaches. As for low logFC, those are not that unusual in human tumor systems. I cannot say with certainty that nothing you are doing is incorrect, but I would not necessarily use logFC as a marker of something being wrong. You mention that you have thousands of genes with significant results and that is what I would expect with a tumor/normal comparison. 

Entering edit mode

I hope your subtypes are not defined from the same expression data that you're using for the DE analysis. This would be classic data dredging - see my comments on a related problem in the single-cell RNA-seq context here. Putting that aside, using subtypes will not necessarily protect you against sex and age effects (otherwise you would get meaningless "subtypes" like old-man-cancer). These clusters have a certain coarseness in which there is still variability to be modelled.

If you are confident that the sample/gene annotations are correct, the only remaining possibility is that your biological assumptions are wrong and these critical genes don't change much in expression. Don't like it? Take it up with TCGA. There are a number of possible reasons that require some deeper understanding of the system/disease than I currently have. For example, if the tumours are highly heterogeneous (even within subtypes), strong log-fold changes for driver genes in one  tumour population might be "diluted" by near-zero log-fold changes in other populations. And it's entirely possible to disrupt critical genes without affecting expression, especially in signalling pathways dependent on post-translation modifications. 

The absence of pairing should have very little (almost nothing!) to do with the log-fold change estimate.


Login before adding your answer.

Traffic: 988 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6