Question

DESeq2/edgeR DEG analysis data input from TCGA and correation analysis

0

Entering edit mode

JackieMe • 0

@jackieme-14035

Last seen 4.7 years ago

Hello alll. I am a newbie in bioinformatics and still learning.

The thing is, I want to do differential expressed gene(DEG) analysis in the TCGA RNA-seq data.

DESeq2 and edgeR manuals said they require raw count data as input, so I downloaded STAD.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.data

form [FireHose] (http://gdac.broadinstitute.org/) and exact count column from it.

My code for DEGs in tumor vs nomral goes like:

library(edgeR)
library(dplyr)
d <- DGEList(counts=count_data, group=group)
d <- calcNormFactors(d)
design <- model.matrix(~group)
d <- estimateDisp(d,design)
f <- glmFit(d,design)
lrt <- glmLRT(f, coef = 2)
topTags(lrt, n =20)
DEG.edgeR <- topTags(lrt, n=Inf, sort.by = "logFC", adjust.method = "BH")$table
DEG.edgeR$gene <- rownames(DEG.edgeR)
DEG.edgeR.sig <- filter(DEG.edgeR, abs(logFC) >= 2 & FDR < 0.05)

and I get a data frame with DEGs sorted by logFC. Here comes my question:

1. Am I using the right data for DESeq2/edgeR and is my code ok?

2. Suppose I am right on 1, now I want to do a correalation analysis among the DEGs' expressinon, what data should I use?

Since I now only have raw count data, it varies so much and it seems I should perform some kind of normalization first and then do correlation analysis. Is it right?

Also, if I want to do a boxplot presantation of a gene's expression in tumor and normal sample, common practice is to use TCGA RPKM/FPKM data for this. Does this mean I have to download RPKM/FRKM data too just for correlation analysis and ploting?
It sounds weired to me to using one data to do DEG analysis and another to do correlation analysis and ploting.

edgeR TCGA • 3.0k views

ADD COMMENT • link updated 6.6 years ago by Aaron Lun ★ 28k • written 6.6 years ago by JackieMe • 0

score 1 · Answer 1 · 2017-09-28

You don't have any code for DESeq2 here, so I'll ignore that part of your post.

Anyway, here's a heap of issues for your first question:

You should filter out low-abundance genes before running through edgeR.
You should be using the QL pipeline, i.e., glmQLFit and glmQLFTest.
You should not be selecting DE genes with a log-fold change threshold. If you need to incorporate log-fold change information into the definition of your DE genes, you should be using glmTreat instead.

As for your second question, I don't really understand what you mean by a correlation analysis. By definition, all of the DEGs will have expression profiles that positively correlated with each other if they have the same sign of the log-fold change, and negatively correlated if the signs are different. The more interesting analysis would be to see if the residuals are correlated between genes - this smells like network reconstruction to me, see WGCNA.

Finally, if you want to get RPKMs, you can use the rpkm function in edgeR. However, this requires you to know the length of the transcripts used during alignment. You'll probably need to spend some time figuring this out, e.g., finding the right version of the annotation, figuring out the alignment settings that they used. Or, you could just download the RPKM values from the server, assuming that they used the RSEM expected counts to compute the RPKMs on their side. Your call.