Which of these (raw count, scaled estimate, normalized_results ) data of TCGA is best for finding the DEG (differentially expressed) genes by DESeq2 package ?
2
0
Entering edit mode
@roohallah1435-23238
Last seen 4.1 years ago

hi I want to analyze the TCGA data with the DESeq2 package. As you know there are three types of data in this database. This site (http://seqanswers.com/forums/showthread.php?t=42911) provides information on these three types of data.

1- raw counts: The (first) RSEM paper explains that the program calculates two values. One represent the (estimated) number of reads that aligned to a transcript. This value is not an integer because RSEM only reports a guess of how many ambiguously mapping reads belong to a transcript/gene. This number is what the TCGA slightly misleadingly calls raw counts.

2- scaled estimate: The scaled estimate value on the other hand is the estimated frequency of the gene/transcript amongst the total number of transcripts that were sequenced. Newer versions of RSEM call this value (multiplied by 1e6) TPM - Transcripts Per Million. It's closely related to FPKM, as explained on the RSEM website. The important point is that TPM, like FPKM, is independent of transcript length, whereas "raw" counts are not!

3- normalizedresults: The *.normalizedresults files on the other hand just contain a scaled version of the raw_counts column. The values are divided by the 75-percentile and multiplied by 1000. This should make the values a bit more comparable between experiments.

I read the biostars and support.bioconducto posts but unfortunately did not get my questions answered. Which of these data is best for finding the DEG (differentially expressed genes)?

thank you

deseq2 • 3.2k views
ADD COMMENT
0
Entering edit mode

I have re-processed most of the TCGA RNA-seq data from, originally, the HTseq raw counts (when they were the only data available), and, recently, the RSEM expression levels. Taking the RSEM files, you can import these to DESeq2 via tximport: https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#rsem

ADD REPLY
1
Entering edit mode

This paper and dataset might also be relevant: https://www.ncbi.nlm.nih.gov/pubmed/26209429 (GSE62944 in Gene Expression Omnibus). It provides raw counts for most of TCGA.

ADD REPLY
0
Entering edit mode
@mikelove
Last seen 13 hours ago
United States

What kind of DE are you looking for? What groups of samples?

ADD COMMENT
0
Entering edit mode

Comparison between normal and tumor samples.

1- Is the use of normalized data correct or is it recommended to use raw data?

2- Are my codes correct?

R scripts for my analysis is:

options(stringsAsFactors=F)

GBMnormalized <- readexcel("GBM_normalized.xlsx")

GBMdata <- as.data.frame(GBM_normalized)

rownames(GBMdata)<- GBMdata$Genenames

GBMdata <- GBMdata[,-1]

GBMdata <- as.matrix(GBMdata)

mode(GBMdata)<- "integer"

GBMdata_nt <- GBMdata[,1:161]

gr_nt <- factor(c(rep("normal",5 ), rep("Tumor", 156)))

colDatant <- data.frame(group=grnt , type= "paired-end")

cdsnt <- DESeqDataSetFromMatrix(GBMdatant, colData_nt, design = ~group )

cdsnt <- DESeq(cdsnt)

cntnt <- log2(1+counts(cdsnt, normalized= T))

Find of DEG

resnt <- data.frame (results(cdsnt, c("group", "Tumor", "normal")))

resnt$genename <- rownames(resnt)

resnt$padj <- p.adjust(resnt$pvalue, method = "BH")

resnt <- resnt[order(res_nt$padj),]

ggplot(resnt, aes(log2FoldChange, -log10(padj) , color=log2FoldChange)) + geompoint() + theme_bw()

ADD REPLY
0
Entering edit mode
Kevin Blighe ★ 3.9k
@kevin
Last seen 8 hours ago
Republic of Ireland

1- Is the use of normalized data correct or is it recommended to use raw data?

The recommended input to DESeq2 is stated in the vignette and manual pages, i.e., raw counts.

roohallah1435, what is contained in GBM_normalized.xlsx?; and why is the data even in an Excel file? - having data in Excel format can result in numerous types of formatting issues. If you want help, then please help us - this is the very first time that you have mentioned the file, GBM_normalized.xlsx.

Judging by the description provided in your original post, and your subsequent code that you've provided, you are taking scaled raw counts, forcing them back to integers, and then normalising them in DESeq2 (?) - this does not seem correct to me.

Another part that makes little sense is when you use p.adjust() - DESeq2 will perform p-value adjustment for you.

Please read my other comment and start from the RSEM files.

ADD COMMENT
0
Entering edit mode

THANK YOU dear Kevin Blighe GBM_normalized.xlsx contain normalized results that downloaded by TCGAassembler2 package. ok. i try start from the RSEM files.

ADD REPLY

Login before adding your answer.

Traffic: 942 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6