Dear experts!
I am performing an EdgeR analysis in TCGA samples. I have a subset of the 83 patients I want to check their differential expression in comparison to the 41 samples of normal tissue. Checking for the differential expression of known critical genes (basically almost all the genes taking part in signaling pathways involved in cancer) I see that almost all of the 297 genes have a logFC < 0.8.
In general, the results include
Genes with counts > 1 : 21251
Significant DEGs FDR < 0.05 16222
Significant DEGs FDR < 0.05 with logFC > 2 1640
Significant DEGs FDR < 0.05 with logFC < -2 1664
I would assume that not all, but at least some, would show a high differential expression. I am concerned that the results are kind of "diluted" or the problem comes from the fact that the samples are not paired (normal samples come from other patients).
I have performed this analyis in 4 different subset of patients and the results are almost the same when it comes to the known critical genes.
This is the code I used during my analysis.
library(TCGAbiolinks) library(edgeR)
query = GDCquery(project = "TCGA-COAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts", barcode = c(subtype1.patients$IDs,Normal$IDs)) GDCdownload(query) # Save in matrix form exp_subtype1_NORMAL_SE = GDCprepare(query=query, save=TRUE, save.filename = "exp_subtype1_NORMAL.rda", summarizedExperiment = FALSE) exp_subtype1_NORMAL_SE = exp_subtype1_NORMAL_SE[-c(1:5),] rownames(exp_subtype1_NORMAL_SE) = exp_subtype1_NORMAL_SE$X1 exp_subtype1_NORMAL_SE = exp_subtype1_NORMAL_SE[,-1] #Define groups and model matrix groups = factor(c(rep("subtype1", 83), rep("normal",41))) groups = relevel(groups, ref = "normal") design = model.matrix(~groups, y$samples) #Convert to DGEList y1 <- DGEList(counts=exp_subtype1_NORMAL_SE,genes = rownames(exp_subtype1_NORMAL_SE), group = groups) # Filter out low expressed genes keep <- rowSums(cpm(y1)>1) >= 2 y1 <- y1[keep, , keep.lib.sizes=FALSE] y1 <- calcNormFactors(y1) y1 <- estimateDisp(y1) et1 <- exactTest(y1) tt1 <- topTags(et, n=nrow(y), p.value=0.05)
Here is the first three columns of the expression matrix used in the start of the analysis.
head(exp_subtype1_NORMAL_SE)[,1:3] TCGA-AA-3815-01A-01R-1022-07 TCGA-AA-A02R-01A-01R-A00A-07 TCGA-CK-4951-01A-01R-1410-07 ENSG00000000003.13 2449 1040 6513 ENSG00000000005.5 6 0 8 ENSG00000000419.11 487 450 959 ENSG00000000457.12 269 223 307 ENSG00000000460.15 177 174 246 ENSG00000000938.11 331 415 324
Hope I have provided everything needed.
Thank you in advance.
R.
Hi Aaron,
Thanks for your comment. The patients groups I am analyzing are grouped together according to their molecular subtypes (that themselves are defined based on expression data), so I would assume that this reduces the variability across the samples. Please correct me if I am wrong.
As for your proposed solutions, I doublechecked both sample and gene annotations and everything seems to be as it should. When it comes to the genes I mentioned, I am referring to the genes involved in known disregulated pathways in colon cancer, so I would assume that at least some of them should show differential expression?
Would you assume that the low logFC could be because the majority of the samples are unpaired? (Some of the tumor samples have normal samples, while others do not)
Any additional advices would be much appreciated :)
R.
If you are looking for signals associated with groups of genes, perhaps you want to consider gene set enrichment analysis approaches. As for low logFC, those are not that unusual in human tumor systems. I cannot say with certainty that nothing you are doing is incorrect, but I would not necessarily use logFC as a marker of something being wrong. You mention that you have thousands of genes with significant results and that is what I would expect with a tumor/normal comparison.
I hope your subtypes are not defined from the same expression data that you're using for the DE analysis. This would be classic data dredging - see my comments on a related problem in the single-cell RNA-seq context here. Putting that aside, using subtypes will not necessarily protect you against sex and age effects (otherwise you would get meaningless "subtypes" like old-man-cancer). These clusters have a certain coarseness in which there is still variability to be modelled.
If you are confident that the sample/gene annotations are correct, the only remaining possibility is that your biological assumptions are wrong and these critical genes don't change much in expression. Don't like it? Take it up with TCGA. There are a number of possible reasons that require some deeper understanding of the system/disease than I currently have. For example, if the tumours are highly heterogeneous (even within subtypes), strong log-fold changes for driver genes in one tumour population might be "diluted" by near-zero log-fold changes in other populations. And it's entirely possible to disrupt critical genes without affecting expression, especially in signalling pathways dependent on post-translation modifications.
The absence of pairing should have very little (almost nothing!) to do with the log-fold change estimate.