DESeq2/edgeR DEG analysis data input from TCGA and correation analysis
Entering edit mode
JackieMe • 0
Last seen 2.5 years ago

Hello alll. I am a newbie in bioinformatics and still learning.

The thing is, I want to do differential expressed gene(DEG) analysis in the TCGA RNA-seq data.

DESeq2 and edgeR manuals said they require raw count data as input, so I downloaded

form [FireHose] ( and exact count  column from it.

My code for DEGs in tumor vs nomral goes like:

d <- DGEList(counts=count_data, group=group)
d <- calcNormFactors(d)
design <- model.matrix(~group)
d <- estimateDisp(d,design)
f <- glmFit(d,design)
lrt <- glmLRT(f, coef = 2)
topTags(lrt, n =20)
DEG.edgeR <- topTags(lrt, n=Inf, = "logFC", adjust.method = "BH")$table
DEG.edgeR$gene <- rownames(DEG.edgeR)
DEG.edgeR.sig <- filter(DEG.edgeR, abs(logFC) >= 2 & FDR < 0.05)

and I get a data frame with DEGs sorted by logFC. Here comes my question: 

1. Am I using the right data for DESeq2/edgeR and is my code ok? 

2. Suppose I am right on 1, now I want to do a correalation analysis among the DEGs' expressinon, what data should I use?     

Since I now only have raw count data, it varies so much and it seems I should perform some kind of normalization first and then do correlation analysis. Is it right?

Also, if I want to do a boxplot presantation of a gene's expression in tumor and normal sample, common practice is to use TCGA RPKM/FPKM data for this. Does this mean I have to download RPKM/FRKM data too just for correlation analysis and ploting?
It sounds weired to me to using one data to do DEG analysis and another to  do correlation analysis and ploting. 

edgeR TCGA • 2.2k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 23 hours ago
The city by the bay

You don't have any code for DESeq2 here, so I'll ignore that part of your post.

Anyway, here's a heap of issues for your first question:

  • You should filter out low-abundance genes before running through edgeR.
  • You should be using the QL pipeline, i.e., glmQLFit and glmQLFTest.
  • You should not be selecting DE genes with a log-fold change threshold. If you need to incorporate log-fold change information into the definition of your DE genes, you should be using glmTreat instead.

As for your second question, I don't really understand what you mean by a correlation analysis. By definition, all of the DEGs will have expression profiles that positively correlated with each other if they have the same sign of the log-fold change, and negatively correlated if the signs are different. The more interesting analysis would be to see if the residuals are correlated between genes - this smells like network reconstruction to me, see WGCNA.

Finally, if you want to get RPKMs, you can use the rpkm function in edgeR. However, this requires you to know the length of the transcripts used during alignment. You'll probably need to spend some time figuring this out, e.g., finding the right version of the annotation, figuring out the alignment settings that they used. Or, you could just download the RPKM values from the server, assuming that they used the RSEM expected counts to compute the RPKMs on their side. Your call.

Entering edit mode

Thanks for your reply, Aaron! I saw your answer days ago and realized I haven't understand how edgeR works so I went back reading the edgeR User Guild, took me a while to get a glimpse of the capability and usage of edgeR.
To answer your suggestion:

  • Yes, I already filtered low-abundance genes out. I just forgot to paste the code.
  • As I understand, the QL and the classic pipeline are two alternatives for DE analysis, while QL is recommended. I use both to do D analysis and see the top 100 DE genes with maximum logFC and the 100 genes from above two methods actually get 94 in common. Am I right about this?
  • Yes, the user guide proposed a lfc option in glmTreat. I was wrong to filter the DEG using dplyr.

As for correlation analysis, you're right again. Since all genes are DE genes, they might be supposed to be  positively or negatively correlated with each other. I was wrong to analysis like that.  I know about WGCNA, but they don't recommend feeding DEG result to WGCNA. I may have to find another way.

I want RPKM mainly because I am to do some plot presentation of the expression data ( boxplot, barplot etc). Raw count data is not proper for that. I finally get log transformed RPKM value using the following code:

rpkm <- rpkm(d, log = TRUE)

after getting gene length using getGeneLengthAndGCContent from EDASeq package.


Login before adding your answer.

Traffic: 270 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6