Hello alll. I am a newbie in bioinformatics and still learning.
The thing is, I want to do differential expressed gene(DEG) analysis in the TCGA RNA-seq data.
DESeq2 and edgeR manuals said they require raw count data as input, so I downloaded STAD.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.data
form [FireHose] (http://gdac.broadinstitute.org/) and exact count column from it.
My code for DEGs in tumor vs nomral goes like:
library(edgeR) library(dplyr) d <- DGEList(counts=count_data, group=group) d <- calcNormFactors(d) design <- model.matrix(~group) d <- estimateDisp(d,design) f <- glmFit(d,design) lrt <- glmLRT(f, coef = 2) topTags(lrt, n =20) DEG.edgeR <- topTags(lrt, n=Inf, sort.by = "logFC", adjust.method = "BH")$table DEG.edgeR$gene <- rownames(DEG.edgeR) DEG.edgeR.sig <- filter(DEG.edgeR, abs(logFC) >= 2 & FDR < 0.05)
and I get a data frame with DEGs sorted by logFC. Here comes my question:
1. Am I using the right data for DESeq2/edgeR and is my code ok?
2. Suppose I am right on 1, now I want to do a correalation analysis among the DEGs' expressinon, what data should I use?
Since I now only have raw count data, it varies so much and it seems I should perform some kind of normalization first and then do correlation analysis. Is it right?
Also, if I want to do a boxplot presantation of a gene's expression in tumor and normal sample, common practice is to use TCGA RPKM/FPKM data for this. Does this mean I have to download RPKM/FRKM data too just for correlation analysis and ploting?
It sounds weired to me to using one data to do DEG analysis and another to do correlation analysis and ploting.