Applying DESeq on RSEM output

0

Entering edit mode

dvir.tau@gmail.com ▴ 30

@dvirtaugmailcom-5837

Last seen 9.6 years ago

Hello, I'm running DESeq and EdgeR on RNA-Seq data that was already processed with RSEM (downloaded from TCGA web site). Since these methods require the raw read counts I'm using the raw_count column of the RSEM output but I'm not sure this is the right thing to do (is it the actual raw count required ?) Here's an example file for the RSEM output file downloaded from TCGA: barcode gene_id raw_count scaled_estimate transcript_id TCGA-A1-A0SB-01A-11R-A144-07 ?|100130426 0 0 uc011lsn.1 TCGA-A1-A0SB-01A-11R-A144-07 ?|100133144 34.05 1.23812E-06 uc010unu.1,uc010uoa.1 TCGA-A1-A0SB-01A-11R-A144-07 ?|100134869 31.95 8.40876E-07 uc002bgz.2,uc002bic.2 TCGA-A1-A0SB-01A-11R-A144-07 ?|10357 258.35 2.16969E-05 uc010zzl.1 TCGA-A1-A0SB-01A-11R-A144-07 ?|10431 1459 5.53441E-05 uc001jiu.2,uc010qhg.1 Many thanks, Dvir [[alternative HTML version deleted]]

edgeR DESeq edgeR DESeq • 13k views

ADD COMMENT • link updated 3.1 years ago by mehdimerbah • 0 • written 11.1 years ago by dvir.tau@gmail.com ▴ 30

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 9 hours ago

United States

hi Dvir,

I am not familiar with RSEM software, but you have non-integer values in the raw_count column, for example 31.95 and 258.35. Non-integer values are not appropriate for DESeq or edgeR analysis. Searching for a minute on Google it seems that these raw counts involve assigning fractions of ambiguously mapped reads (but you should check with the RSEM developers). If you don't have access to any lower level data, the next best option is to round the raw_count values and proceed. Update (12/13/15): See Simon's response below. Also, after investigation into the RSEM method, I've come around and recommend the option of using rounded estimated gene-level counts from RSEM as input to DESeq2.

best, Mike

ADD COMMENT • link 11.1 years ago • updated 8.4 years ago Michael Love 41k

0

Entering edit mode

Thanks for your reply Mike. Since the data I'm analyzing is extremely big I really hope to be able to use the Level 3 data from TCGA which is in RSEM format. Do you think that rounding the values should have a significant effect on DESeq performance ? I still wonder if that raw_count column is good enough as raw counts, and if there is another way to get the input required for DESeq from the RSEM output without downloading the huge sequence files. Many thanks, Dvir -----Original Message----- From: Michael Love [mailto:michaelisaiahlove@gmail.com] Sent: Wednesday, March 20, 2013 3:47 PM To: dnetanely at tau.ac.il Cc: bioconductor at r-project.org Subject: Re: [BioC] Applying DESeq on RSEM output hi Dvir I am not familiar with RSEM software, but you have non-integer values in the raw_count column, for example 31.95 and 258.35. Non-integer values are not appropriate for DESeq or edgeR analysis. Searching for a minute on Google it seems that these raw counts involve assigning fractions of ambiguously mapped reads (but you should check with the RSEM developers). If you don't have access to any lower level data, the next best option is to round the raw_count values and proceed. best, Mike On Wed, Mar 20, 2013 at 2:15 PM, <dvir.tau at="" gmail.com=""> wrote: > Hello, > > > > I'm running DESeq and EdgeR on RNA-Seq data that was already processed > with RSEM (downloaded from TCGA web site). > > Since these methods require the raw read counts I'm using the > raw_count column of the RSEM output but I'm not sure this is the right > thing to do (is it the actual raw count required ?) > > > > Here's an example file for the RSEM output file downloaded from TCGA: > > > > > barcode > > gene_id > > raw_count > > scaled_estimate > > transcript_id > > > TCGA-A1-A0SB-01A-11R-A144-07 > > ?|100130426 > > 0 > > 0 > > uc011lsn.1 > > > TCGA-A1-A0SB-01A-11R-A144-07 > > ?|100133144 > > 34.05 > > 1.23812E-06 > > uc010unu.1,uc010uoa.1 > > > TCGA-A1-A0SB-01A-11R-A144-07 > > ?|100134869 > > 31.95 > > 8.40876E-07 > > uc002bgz.2,uc002bic.2 > > > TCGA-A1-A0SB-01A-11R-A144-07 > > ?|10357 > > 258.35 > > 2.16969E-05 > > uc010zzl.1 > > > TCGA-A1-A0SB-01A-11R-A144-07 > > ?|10431 > > 1459 > > 5.53441E-05 > > uc001jiu.2,uc010qhg.1 > > > > > > Many thanks, > > Dvir > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 11.1 years ago dvir.tau@gmail.com ▴ 30

0

Entering edit mode

Hello, I used an RSEM, gene-level count estimates matrix and just rounded the values using the round() function in R to feed them to DESeq2 but the results are very odd (VERY low number of deferentially expressed genes). Is this really a solid way to go about deferential analysis or should I resort to starting from raw data again.

NOTE: The study I got the data from provided the data as RSEM gene-level count matrix and FPKM normalized matrix. I had to use rsem since I know DESeq2 only takes non-normalized counts. The study used an independent t-test on the FPKM file to do the analysis, but I read somewhere that it is highly discouraged.

ADD REPLY • link 3.1 years ago mehdimerbah • 0

0

Entering edit mode

Simon Anders ★ 3.7k

@simon-anders-3855

Last seen 3.7 years ago

Zentrum für Molekularbiologie, Universi…

Hi Dvir On 20/03/13 14:15, dvir.tau at gmail.com wrote: > I'm running DESeq and EdgeR on RNA-Seq data that was already processed with > RSEM (downloaded from TCGA web site). > > Since these methods require the raw read counts I'm using the raw_count > column of the RSEM output but I'm not sure this is the right thing to do (is > it the actual raw count required ?) The real issue is not that your counts are not integer, but that RSEM gives you counts per isoform rather than per gene. Now, if you have two very similar isoforms, RSEM will be unable to decide which isoform to assign a read to and just spread them proportionally over both. Hence, even if only one of the two isoforms is differentially expressed, you will incorrectly see differential expression for both isoforms. This is why the output of isoform quantification methods such as RSEM of MMSeq are not suitable as input for differential expression tests. At the very minimum, you need also the information about the uncertainty of the assignments of reads to isoforms. In fact, RSEM provides this information if you run it in its Bayesian mode, but this seems to be hardly ever done in practice. If you really need to perform differential expression analysis on a level finer than whole gene expression, you should either use a tool for differential exon usage testing, such as our DEXSeq package, or one that combines isoform abundance estimation and testing for differences in a unified framework, such as BitSeq. In both cases, you will need the SAM files. If you are fine with staying on the gene level for your analysis, you need to get counts per gene, not per isoform. I am not familiar enough with RSEM, though, to tell you whether adding up the counts from all the isoforms per gene would be a good idea. Simon

ADD COMMENT • link 11.1 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Hi Simon, Thanks again for your help. The data files I'm using are the GENE LEVEL and UNNORMALIZED and named "...expression_rsem_gene.txt". Exon level and normalized versions of the RSEM output also exist but I do not use them. These files contain a 'raw_count' column which I thought should be okay as input for DESeq and EdgeR. Do you still think it's calculated on the exon level ? Many thanks, Dvir -----Original Message----- From: Simon Anders [mailto:anders@embl.de] Sent: Thursday, March 21, 2013 3:20 PM To: bioconductor at r-project.org Subject: Re: [BioC] Applying DESeq on RSEM output Hi Dvir On 20/03/13 14:15, dvir.tau at gmail.com wrote: > I'm running DESeq and EdgeR on RNA-Seq data that was already processed > with RSEM (downloaded from TCGA web site). > > Since these methods require the raw read counts I'm using the > raw_count column of the RSEM output but I'm not sure this is the right > thing to do (is it the actual raw count required ?) The real issue is not that your counts are not integer, but that RSEM gives you counts per isoform rather than per gene. Now, if you have two very similar isoforms, RSEM will be unable to decide which isoform to assign a read to and just spread them proportionally over both. Hence, even if only one of the two isoforms is differentially expressed, you will incorrectly see differential expression for both isoforms. This is why the output of isoform quantification methods such as RSEM of MMSeq are not suitable as input for differential expression tests. At the very minimum, you need also the information about the uncertainty of the assignments of reads to isoforms. In fact, RSEM provides this information if you run it in its Bayesian mode, but this seems to be hardly ever done in practice. If you really need to perform differential expression analysis on a level finer than whole gene expression, you should either use a tool for differential exon usage testing, such as our DEXSeq package, or one that combines isoform abundance estimation and testing for differences in a unified framework, such as BitSeq. In both cases, you will need the SAM files. If you are fine with staying on the gene level for your analysis, you need to get counts per gene, not per isoform. I am not familiar enough with RSEM, though, to tell you whether adding up the counts from all the isoforms per gene would be a good idea. Simon ----- No virus found in this message. Checked by AVG - www.avg.com

ADD REPLY • link 11.1 years ago dvir.tau@gmail.com ▴ 30

0

Entering edit mode

hi Dvir, Simon is pointing out that the RSEM output you pasted is calculated on the transcript level, as the rows have a transcript_id column. DESeq operates on gene-level counts, which involve collapsing all the possible transcripts of a gene into one feature. For more information on the details of appropriate gene-level counting for DESeq, you can see this page of the HTSeq documentation ( http://www-huber.embl.de/users/anders/HTSeq/doc/count.html ): "the features are typically genes, where each gene is considered here as the union of all its exons" Suppose you want to do gene-level DE analysis, and you have 1 gene with 10 transcripts. The count of reads that hit this gene is somehow distributed by the RSEM software over the 10 transcripts. Since we do not know exactly how the reads are distributed by the RSEM software, we cannot answer how to arrive at a sensible gene-level count. I would recommend contacting the RSEM developers to see if it is appropriate to add, for example the counts in these files from the 10 transcripts to arrive at a gene-level count which conforms to the definition from the HTSeq link above. Mike On Mon, Mar 25, 2013 at 12:41 PM, Dvir Netanely <dvir.tau@gmail.com> wrote: > Hi Simon, > > Thanks again for your help. > > The data files I'm using are the GENE LEVEL and UNNORMALIZED and named > "...expression_rsem_gene.txt". Exon level and normalized versions of the > RSEM output also exist but I do not use them. > These files contain a 'raw_count' column which I thought should be okay as > input for DESeq and EdgeR. > > Do you still think it's calculated on the exon level ? > > Many thanks, > Dvir > > -----Original Message----- > From: Simon Anders [mailto:anders@embl.de] > Sent: Thursday, March 21, 2013 3:20 PM > To: bioconductor@r-project.org > Subject: Re: [BioC] Applying DESeq on RSEM output > > Hi Dvir > > On 20/03/13 14:15, dvir.tau@gmail.com wrote: > > I'm running DESeq and EdgeR on RNA-Seq data that was already processed > > with RSEM (downloaded from TCGA web site). > > > > Since these methods require the raw read counts I'm using the > > raw_count column of the RSEM output but I'm not sure this is the right > > thing to do (is it the actual raw count required ?) > > The real issue is not that your counts are not integer, but that RSEM gives > you counts per isoform rather than per gene. Now, if you have two very > similar isoforms, RSEM will be unable to decide which isoform to assign a > read to and just spread them proportionally over both. Hence, even if only > one of the two isoforms is differentially expressed, you will incorrectly > see differential expression for both isoforms. > > This is why the output of isoform quantification methods such as RSEM of > MMSeq are not suitable as input for differential expression tests. > > At the very minimum, you need also the information about the uncertainty of > the assignments of reads to isoforms. In fact, RSEM provides this > information if you run it in its Bayesian mode, but this seems to be hardly > ever done in practice. > > If you really need to perform differential expression analysis on a level > finer than whole gene expression, you should either use a tool for > differential exon usage testing, such as our DEXSeq package, or one that > combines isoform abundance estimation and testing for differences in a > unified framework, such as BitSeq. In both cases, you will need the SAM > files. > > If you are fine with staying on the gene level for your analysis, you need > to get counts per gene, not per isoform. I am not familiar enough with > RSEM, > though, to tell you whether adding up the counts from all the isoforms per > gene would be a good idea. > > Simon > > > > ----- > No virus found in this message. > Checked by AVG - www.avg.com > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 11.1 years ago Michael Love 41k

Login before adding your answer.