Hello,
I'm running DESeq and EdgeR on RNA-Seq data that was already processed
with
RSEM (downloaded from TCGA web site).
Since these methods require the raw read counts I'm using the
raw_count
column of the RSEM output but I'm not sure this is the right thing to
do (is
it the actual raw count required ?)
Here's an example file for the RSEM output file downloaded from TCGA:
barcode
gene_id
raw_count
scaled_estimate
transcript_id
TCGA-A1-A0SB-01A-11R-A144-07
?|100130426
0
0
uc011lsn.1
TCGA-A1-A0SB-01A-11R-A144-07
?|100133144
34.05
1.23812E-06
uc010unu.1,uc010uoa.1
TCGA-A1-A0SB-01A-11R-A144-07
?|100134869
31.95
8.40876E-07
uc002bgz.2,uc002bic.2
TCGA-A1-A0SB-01A-11R-A144-07
?|10357
258.35
2.16969E-05
uc010zzl.1
TCGA-A1-A0SB-01A-11R-A144-07
?|10431
1459
5.53441E-05
uc001jiu.2,uc010qhg.1
Many thanks,
Dvir
[[alternative HTML version deleted]]
I am not familiar with RSEM software, but you have non-integer values in the raw_count column, for example 31.95 and 258.35. Non-integer values are not appropriate for DESeq or edgeR analysis. Searching for a minute on Google it seems that these raw counts involve assigning fractions of ambiguously mapped reads (but you should check with the RSEM developers). If you don't have access to any lower level data, the next best option is to round the raw_count values and proceed. Update (12/13/15): See Simon's response below. Also, after investigation into the RSEM method, I've come around and recommend the option of using rounded estimated gene-level counts from RSEM as input to DESeq2.
Thanks for your reply Mike.
Since the data I'm analyzing is extremely big I really hope to be able
to
use the Level 3 data from TCGA which is in RSEM format.
Do you think that rounding the values should have a significant effect
on
DESeq performance ?
I still wonder if that raw_count column is good enough as raw counts,
and if
there is another way to get the input required for DESeq from the RSEM
output without downloading the huge sequence files.
Many thanks,
Dvir
-----Original Message-----
From: Michael Love [mailto:michaelisaiahlove@gmail.com]
Sent: Wednesday, March 20, 2013 3:47 PM
To: dnetanely at tau.ac.il
Cc: bioconductor at r-project.org
Subject: Re: [BioC] Applying DESeq on RSEM output
hi Dvir
I am not familiar with RSEM software, but you have non-integer values
in the
raw_count column, for example 31.95 and 258.35. Non-integer values
are not
appropriate for DESeq or edgeR analysis.
Searching for a minute on Google it seems that these raw counts
involve
assigning fractions of ambiguously mapped reads (but you should check
with
the RSEM developers). If you don't have access to any lower level
data, the
next best option is to round the raw_count values and proceed.
best,
Mike
On Wed, Mar 20, 2013 at 2:15 PM, <dvir.tau at="" gmail.com=""> wrote:
> Hello,
>
>
>
> I'm running DESeq and EdgeR on RNA-Seq data that was already
processed
> with RSEM (downloaded from TCGA web site).
>
> Since these methods require the raw read counts I'm using the
> raw_count column of the RSEM output but I'm not sure this is the
right
> thing to do (is it the actual raw count required ?)
>
>
>
> Here's an example file for the RSEM output file downloaded from
TCGA:
>
>
>
>
> barcode
>
> gene_id
>
> raw_count
>
> scaled_estimate
>
> transcript_id
>
>
> TCGA-A1-A0SB-01A-11R-A144-07
>
> ?|100130426
>
> 0
>
> 0
>
> uc011lsn.1
>
>
> TCGA-A1-A0SB-01A-11R-A144-07
>
> ?|100133144
>
> 34.05
>
> 1.23812E-06
>
> uc010unu.1,uc010uoa.1
>
>
> TCGA-A1-A0SB-01A-11R-A144-07
>
> ?|100134869
>
> 31.95
>
> 8.40876E-07
>
> uc002bgz.2,uc002bic.2
>
>
> TCGA-A1-A0SB-01A-11R-A144-07
>
> ?|10357
>
> 258.35
>
> 2.16969E-05
>
> uc010zzl.1
>
>
> TCGA-A1-A0SB-01A-11R-A144-07
>
> ?|10431
>
> 1459
>
> 5.53441E-05
>
> uc001jiu.2,uc010qhg.1
>
>
>
>
>
> Many thanks,
>
> Dvir
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
Hello, I used an RSEM, gene-level count estimates matrix and just rounded the values using the round() function in R to feed them to DESeq2 but the results are very odd (VERY low number of deferentially expressed genes). Is this really a solid way to go about deferential analysis or should I resort to starting from raw data again.
NOTE: The study I got the data from provided the data as RSEM gene-level count matrix and FPKM normalized matrix. I had to use rsem since I know DESeq2 only takes non-normalized counts. The study used an independent t-test on the FPKM file to do the analysis, but I read somewhere that it is highly discouraged.
Hi Dvir
On 20/03/13 14:15, dvir.tau at gmail.com wrote:
> I'm running DESeq and EdgeR on RNA-Seq data that was already
processed with
> RSEM (downloaded from TCGA web site).
>
> Since these methods require the raw read counts I'm using the
raw_count
> column of the RSEM output but I'm not sure this is the right thing
to do (is
> it the actual raw count required ?)
The real issue is not that your counts are not integer, but that RSEM
gives you counts per isoform rather than per gene. Now, if you have
two
very similar isoforms, RSEM will be unable to decide which isoform to
assign a read to and just spread them proportionally over both. Hence,
even if only one of the two isoforms is differentially expressed, you
will incorrectly see differential expression for both isoforms.
This is why the output of isoform quantification methods such as RSEM
of
MMSeq are not suitable as input for differential expression tests.
At the very minimum, you need also the information about the
uncertainty
of the assignments of reads to isoforms. In fact, RSEM provides this
information if you run it in its Bayesian mode, but this seems to be
hardly ever done in practice.
If you really need to perform differential expression analysis on a
level finer than whole gene expression, you should either use a tool
for
differential exon usage testing, such as our DEXSeq package, or one
that
combines isoform abundance estimation and testing for differences in a
unified framework, such as BitSeq. In both cases, you will need the
SAM
files.
If you are fine with staying on the gene level for your analysis, you
need to get counts per gene, not per isoform. I am not familiar enough
with RSEM, though, to tell you whether adding up the counts from all
the
isoforms per gene would be a good idea.
Simon
Hi Simon,
Thanks again for your help.
The data files I'm using are the GENE LEVEL and UNNORMALIZED and named
"...expression_rsem_gene.txt". Exon level and normalized versions of
the
RSEM output also exist but I do not use them.
These files contain a 'raw_count' column which I thought should be
okay as
input for DESeq and EdgeR.
Do you still think it's calculated on the exon level ?
Many thanks,
Dvir
-----Original Message-----
From: Simon Anders [mailto:anders@embl.de]
Sent: Thursday, March 21, 2013 3:20 PM
To: bioconductor at r-project.org
Subject: Re: [BioC] Applying DESeq on RSEM output
Hi Dvir
On 20/03/13 14:15, dvir.tau at gmail.com wrote:
> I'm running DESeq and EdgeR on RNA-Seq data that was already
processed
> with RSEM (downloaded from TCGA web site).
>
> Since these methods require the raw read counts I'm using the
> raw_count column of the RSEM output but I'm not sure this is the
right
> thing to do (is it the actual raw count required ?)
The real issue is not that your counts are not integer, but that RSEM
gives
you counts per isoform rather than per gene. Now, if you have two very
similar isoforms, RSEM will be unable to decide which isoform to
assign a
read to and just spread them proportionally over both. Hence, even if
only
one of the two isoforms is differentially expressed, you will
incorrectly
see differential expression for both isoforms.
This is why the output of isoform quantification methods such as RSEM
of
MMSeq are not suitable as input for differential expression tests.
At the very minimum, you need also the information about the
uncertainty of
the assignments of reads to isoforms. In fact, RSEM provides this
information if you run it in its Bayesian mode, but this seems to be
hardly
ever done in practice.
If you really need to perform differential expression analysis on a
level
finer than whole gene expression, you should either use a tool for
differential exon usage testing, such as our DEXSeq package, or one
that
combines isoform abundance estimation and testing for differences in a
unified framework, such as BitSeq. In both cases, you will need the
SAM
files.
If you are fine with staying on the gene level for your analysis, you
need
to get counts per gene, not per isoform. I am not familiar enough with
RSEM,
though, to tell you whether adding up the counts from all the isoforms
per
gene would be a good idea.
Simon
-----
No virus found in this message.
Checked by AVG - www.avg.com
hi Dvir,
Simon is pointing out that the RSEM output you pasted is calculated on
the
transcript level, as the rows have a transcript_id column. DESeq
operates
on gene-level counts, which involve collapsing all the possible
transcripts
of a gene into one feature. For more information on the details of
appropriate gene-level counting for DESeq, you can see this page of
the
HTSeq documentation (
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html ):
"the features are typically genes, where each gene is considered here
as
the union of all its exons"
Suppose you want to do gene-level DE analysis, and you have 1 gene
with 10
transcripts. The count of reads that hit this gene is somehow
distributed
by the RSEM software over the 10 transcripts. Since we do not know
exactly
how the reads are distributed by the RSEM software, we cannot answer
how to
arrive at a sensible gene-level count. I would recommend contacting
the
RSEM developers to see if it is appropriate to add, for example the
counts
in these files from the 10 transcripts to arrive at a gene-level count
which conforms to the definition from the HTSeq link above.
Mike
On Mon, Mar 25, 2013 at 12:41 PM, Dvir Netanely <dvir.tau@gmail.com>
wrote:
> Hi Simon,
>
> Thanks again for your help.
>
> The data files I'm using are the GENE LEVEL and UNNORMALIZED and
named
> "...expression_rsem_gene.txt". Exon level and normalized versions of
the
> RSEM output also exist but I do not use them.
> These files contain a 'raw_count' column which I thought should be
okay as
> input for DESeq and EdgeR.
>
> Do you still think it's calculated on the exon level ?
>
> Many thanks,
> Dvir
>
> -----Original Message-----
> From: Simon Anders [mailto:anders@embl.de]
> Sent: Thursday, March 21, 2013 3:20 PM
> To: bioconductor@r-project.org
> Subject: Re: [BioC] Applying DESeq on RSEM output
>
> Hi Dvir
>
> On 20/03/13 14:15, dvir.tau@gmail.com wrote:
> > I'm running DESeq and EdgeR on RNA-Seq data that was already
processed
> > with RSEM (downloaded from TCGA web site).
> >
> > Since these methods require the raw read counts I'm using the
> > raw_count column of the RSEM output but I'm not sure this is the
right
> > thing to do (is it the actual raw count required ?)
>
> The real issue is not that your counts are not integer, but that
RSEM gives
> you counts per isoform rather than per gene. Now, if you have two
very
> similar isoforms, RSEM will be unable to decide which isoform to
assign a
> read to and just spread them proportionally over both. Hence, even
if only
> one of the two isoforms is differentially expressed, you will
incorrectly
> see differential expression for both isoforms.
>
> This is why the output of isoform quantification methods such as
RSEM of
> MMSeq are not suitable as input for differential expression tests.
>
> At the very minimum, you need also the information about the
uncertainty of
> the assignments of reads to isoforms. In fact, RSEM provides this
> information if you run it in its Bayesian mode, but this seems to be
hardly
> ever done in practice.
>
> If you really need to perform differential expression analysis on a
level
> finer than whole gene expression, you should either use a tool for
> differential exon usage testing, such as our DEXSeq package, or one
that
> combines isoform abundance estimation and testing for differences in
a
> unified framework, such as BitSeq. In both cases, you will need the
SAM
> files.
>
> If you are fine with staying on the gene level for your analysis,
you need
> to get counts per gene, not per isoform. I am not familiar enough
with
> RSEM,
> though, to tell you whether adding up the counts from all the
isoforms per
> gene would be a good idea.
>
> Simon
>
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor@r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
[[alternative HTML version deleted]]
Hello, I used an RSEM, gene-level count estimates matrix and just rounded the values using the round() function in R to feed them to DESeq2 but the results are very odd (VERY low number of deferentially expressed genes). Is this really a solid way to go about deferential analysis or should I resort to starting from raw data again.
NOTE: The study I got the data from provided the data as RSEM gene-level count matrix and FPKM normalized matrix. I had to use rsem since I know DESeq2 only takes non-normalized counts. The study used an independent t-test on the FPKM file to do the analysis, but I read somewhere that it is highly discouraged.