DESeq and number of replicates required for RNA-Seq

0

Entering edit mode

michael watson IAH-C ★ 3.4k

@michael-watson-iah-c-378

Last seen 11.4 years ago

Hi This follows on slightly from my experimental design thread. Having worked through the vignette for DESeq, it seems to work well. However, for the TagSeqExample.tab data set, when using an FDR cut off of 0.05, what we see is that we only find differential expression for large fold changes - an average of log2 fold change of 5 for up- regulated, and log2 fold change of -5 for down-regulated. There are very few significant results that even go as far down as 2 or -2 - which is still a 4-fold change. So, the question is, how many replicates must we have to get more sensitive results? Say down to log2FC of 1? (two-fold up or down regulated)? I can calculate this by using DESeq's own estimates of variance to approximate replicates for T and N in the example data, and keep going until my significant results start to hit a logFC of 1, but I wanted to know if anyone else had done this yet? Thanks Mick

GO DESeq GO DESeq • 4.4k views

ADD COMMENT • link updated 15.7 years ago by Naomi Altman ★ 6.0k • written 15.7 years ago by michael watson IAH-C ★ 3.4k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 4.8 years ago

United States

The issue is a mix of expression level and sample size. For count data, the power is higher when the expression is higher. Also, the p-values are discrete - the lower the total read count, the fewer values are possible, which messes up the FDR estimation. Of course, understanding the problem does not necessarily suggest a solution. But sample sizes will need to be large (or you need to sequence very deeply) if you want to detect differential expression in low expressing genes. --Naomi At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: >Hi > >This follows on slightly from my experimental design thread. > >Having worked through the vignette for DESeq, it seems to work >well. However, for the TagSeqExample.tab data set, when using an >FDR cut off of 0.05, what we see is that we only find differential >expression for large fold changes - an average of log2 fold change >of 5 for up-regulated, and log2 fold change of -5 for >down-regulated. There are very few significant results that even go >as far down as 2 or -2 - which is still a 4-fold change. > >So, the question is, how many replicates must we have to get more >sensitive results? Say down to log2FC of 1? (two-fold up or down regulated)? > >I can calculate this by using DESeq's own estimates of variance to >approximate replicates for T and N in the example data, and keep >going until my significant results start to hit a logFC of 1, but I >wanted to know if anyone else had done this yet? > >Thanks >Mick > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 15.7 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Hi Naomi Thanks for the reply. The issue isn't necessarily low expressing genes, but perhaps high expressing genes with a small (ish) fold change. DESeq seems to only report as significant differences that are high fold changes. Contrast this to limma for microarrays, where small fold changes can be reported as significant. For whatever reason, the transcriptomic community have become fixated on "two-fold" as some kind of standard cut-off. Now, I'm not fixated on that, but the example in DESeq reports 428 significant genes with an estimated fold change at FDR 5%, however, NONE of these are in the range -2 : 2. The minimum positive logFC is 2.18 (4.5 fold up- regulation), and the maximum negative logFC is 2.49 (5.65 fold down- regulation). So what I am concerned about is finding genes, either highly or lowly expressed, that are differing by a small fold change - say two-fold. Thanks Mick ________________________________________ From: Naomi Altman [naomi@stat.psu.edu] Sent: 14 June 2010 17:42 To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq The issue is a mix of expression level and sample size. For count data, the power is higher when the expression is higher. Also, the p-values are discrete - the lower the total read count, the fewer values are possible, which messes up the FDR estimation. Of course, understanding the problem does not necessarily suggest a solution. But sample sizes will need to be large (or you need to sequence very deeply) if you want to detect differential expression in low expressing genes. --Naomi At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: >Hi > >This follows on slightly from my experimental design thread. > >Having worked through the vignette for DESeq, it seems to work >well. However, for the TagSeqExample.tab data set, when using an >FDR cut off of 0.05, what we see is that we only find differential >expression for large fold changes - an average of log2 fold change >of 5 for up-regulated, and log2 fold change of -5 for >down-regulated. There are very few significant results that even go >as far down as 2 or -2 - which is still a 4-fold change. > >So, the question is, how many replicates must we have to get more >sensitive results? Say down to log2FC of 1? (two-fold up or down regulated)? > >I can calculate this by using DESeq's own estimates of variance to >approximate replicates for T and N in the example data, and keep >going until my significant results start to hit a logFC of 1, but I >wanted to know if anyone else had done this yet? > >Thanks >Mick > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 15.7 years ago michael watson IAH-C ★ 3.4k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 4.8 years ago

United States

Hi Michael, I was working this out for a lecture and here is what I found: If there is enough expression for the Normal approximation to hold then here is a rule of thumb. Suppose that the total number of reads is identical for all samples and that there is NO biological variation. If Yi is the number of reads for a gene in sample i, then Poisson variation alone leads to log(Yi) approx normal with variance 1/4. (This is what the DESeq vignette calls "shot" variance.) Using the formula for a 2 sample t-test, you see that to detect 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need n>32 var/log(fold) which is approximately 8 biological reps per treatment. However, that is for NO biological variation. (Have a look at the example in the DESeq vignette!) And is assumes alpha=.05 (but we are going to use a much smaller alpha due to the multiple comparisons adjustment). --Naomi At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote: >Hi Naomi > >Thanks for the reply. > >The issue isn't necessarily low expressing genes, but perhaps high >expressing genes with a small (ish) fold change. DESeq seems to >only report as significant differences that are high fold changes. > >Contrast this to limma for microarrays, where small fold changes can >be reported as significant. > >For whatever reason, the transcriptomic community have become >fixated on "two-fold" as some kind of standard cut-off. Now, I'm >not fixated on that, but the example in DESeq reports 428 >significant genes with an estimated fold change at FDR 5%, however, >NONE of these are in the range -2 : 2. The minimum positive logFC >is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is >2.49 (5.65 fold down-regulation). > >So what I am concerned about is finding genes, either highly or >lowly expressed, that are differing by a small fold change - say two- fold. > >Thanks >Mick >________________________________________ >From: Naomi Altman [naomi at stat.psu.edu] >Sent: 14 June 2010 17:42 >To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch >Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq > >The issue is a mix of expression level and sample size. For count >data, the power is higher when the expression is higher. Also, the >p-values are discrete - the lower the total read count, the fewer >values are possible, which messes up the FDR estimation. > >Of course, understanding the problem does not necessarily suggest a >solution. But sample sizes will need to be large (or you need to >sequence very deeply) if you want to detect differential expression >in low expressing genes. > >--Naomi > >At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: > >Hi > > > >This follows on slightly from my experimental design thread. > > > >Having worked through the vignette for DESeq, it seems to work > >well. However, for the TagSeqExample.tab data set, when using an > >FDR cut off of 0.05, what we see is that we only find differential > >expression for large fold changes - an average of log2 fold change > >of 5 for up-regulated, and log2 fold change of -5 for > >down-regulated. There are very few significant results that even go > >as far down as 2 or -2 - which is still a 4-fold change. > > > >So, the question is, how many replicates must we have to get more > >sensitive results? Say down to log2FC of 1? (two-fold up or down > regulated)? > > > >I can calculate this by using DESeq's own estimates of variance to > >approximate replicates for T and N in the example data, and keep > >going until my significant results start to hit a logFC of 1, but I > >wanted to know if anyone else had done this yet? > > > >Thanks > >Mick > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor at stat.math.ethz.ch > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > >Naomi S. Altman 814-865-3791 (voice) >Associate Professor >Dept. of Statistics 814-863-7114 (fax) >Penn State University 814-865-1348 (Statistics) >University Park, PA 16802-2111 > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 15.7 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Thanks Naomi Yes, I have several RNA-Seq datasets that look like they may have large biological variation. I feel this is the "dirty secret" of the new revolution that is RNA- Seq - even with large numbers of replicates, the variation in (and nature of) the read counts means we can only find genes that are changing by a large amount. I wonder if some of the normalisation suggested by Robinson and Oshlack will help (http://genomebiology.com/2010/11/3/R25). And of course there is cufflinks Thanks Mick ________________________________________ From: Naomi Altman [naomi@stat.psu.edu] Sent: 15 June 2010 03:02 To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq Hi Michael, I was working this out for a lecture and here is what I found: If there is enough expression for the Normal approximation to hold then here is a rule of thumb. Suppose that the total number of reads is identical for all samples and that there is NO biological variation. If Yi is the number of reads for a gene in sample i, then Poisson variation alone leads to log(Yi) approx normal with variance 1/4. (This is what the DESeq vignette calls "shot" variance.) Using the formula for a 2 sample t-test, you see that to detect 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need n>32 var/log(fold) which is approximately 8 biological reps per treatment. However, that is for NO biological variation. (Have a look at the example in the DESeq vignette!) And is assumes alpha=.05 (but we are going to use a much smaller alpha due to the multiple comparisons adjustment). --Naomi At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote: >Hi Naomi > >Thanks for the reply. > >The issue isn't necessarily low expressing genes, but perhaps high >expressing genes with a small (ish) fold change. DESeq seems to >only report as significant differences that are high fold changes. > >Contrast this to limma for microarrays, where small fold changes can >be reported as significant. > >For whatever reason, the transcriptomic community have become >fixated on "two-fold" as some kind of standard cut-off. Now, I'm >not fixated on that, but the example in DESeq reports 428 >significant genes with an estimated fold change at FDR 5%, however, >NONE of these are in the range -2 : 2. The minimum positive logFC >is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is >2.49 (5.65 fold down-regulation). > >So what I am concerned about is finding genes, either highly or >lowly expressed, that are differing by a small fold change - say two- fold. > >Thanks >Mick >________________________________________ >From: Naomi Altman [naomi at stat.psu.edu] >Sent: 14 June 2010 17:42 >To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch >Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq > >The issue is a mix of expression level and sample size. For count >data, the power is higher when the expression is higher. Also, the >p-values are discrete - the lower the total read count, the fewer >values are possible, which messes up the FDR estimation. > >Of course, understanding the problem does not necessarily suggest a >solution. But sample sizes will need to be large (or you need to >sequence very deeply) if you want to detect differential expression >in low expressing genes. > >--Naomi > >At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: > >Hi > > > >This follows on slightly from my experimental design thread. > > > >Having worked through the vignette for DESeq, it seems to work > >well. However, for the TagSeqExample.tab data set, when using an > >FDR cut off of 0.05, what we see is that we only find differential > >expression for large fold changes - an average of log2 fold change > >of 5 for up-regulated, and log2 fold change of -5 for > >down-regulated. There are very few significant results that even go > >as far down as 2 or -2 - which is still a 4-fold change. > > > >So, the question is, how many replicates must we have to get more > >sensitive results? Say down to log2FC of 1? (two-fold up or down > regulated)? > > > >I can calculate this by using DESeq's own estimates of variance to > >approximate replicates for T and N in the example data, and keep > >going until my significant results start to hit a logFC of 1, but I > >wanted to know if anyone else had done this yet? > > > >Thanks > >Mick > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor at stat.math.ethz.ch > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > >Naomi S. Altman 814-865-3791 (voice) >Associate Professor >Dept. of Statistics 814-863-7114 (fax) >Penn State University 814-865-1348 (Statistics) >University Park, PA 16802-2111 > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 15.7 years ago michael watson IAH-C ★ 3.4k

0

Entering edit mode

Hi Mick. I can't speak for cufflinks, but the TMM normalization in that GB paper is really about accounting for 'composition' biases. So, this can help when the samples have different RNA composition (or some other systematic effect), but it seems to me like the "dirtiness" you mention here is just that you have large biological variation. Genomics studies are generally underpowered anyways and high biological variation, which is presumably a reality of your experimental system, just makes detecting changes harder. Naomi: I assume you meant sqrt(Yi), not log(Yi) for the normal approximation to the Possion ? Cheers, Mark On 2010-06-15, at 4:44 PM, michael watson (IAH-C) wrote: > Thanks Naomi > > Yes, I have several RNA-Seq datasets that look like they may have large biological variation. > > I feel this is the "dirty secret" of the new revolution that is RNA- Seq - even with large numbers of replicates, the variation in (and nature of) the read counts means we can only find genes that are changing by a large amount. > > I wonder if some of the normalisation suggested by Robinson and Oshlack will help (http://genomebiology.com/2010/11/3/R25). > > And of course there is cufflinks > > Thanks > Mick > ________________________________________ > From: Naomi Altman [naomi at stat.psu.edu] > Sent: 15 June 2010 03:02 > To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq > > Hi Michael, > I was working this out for a lecture and here is what I found: > > If there is enough expression for the Normal approximation to hold > then here is a rule of thumb. > > Suppose that the total number of reads is identical for all samples > and that there is NO biological variation. If Yi is the number of > reads for a gene in sample i, then > Poisson variation alone leads to log(Yi) approx normal with variance > 1/4. (This is what the DESeq vignette calls "shot" variance.) > > Using the formula for a 2 sample t-test, you see that to detect > 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need > n>32 var/log(fold) which is approximately 8 biological reps per treatment. > > However, that is for NO biological variation. (Have a look at the > example in the DESeq vignette!) And is assumes alpha=.05 (but we are > going to use a much smaller alpha due to the multiple comparisons > adjustment). > > --Naomi > > > At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote: >> Hi Naomi >> >> Thanks for the reply. >> >> The issue isn't necessarily low expressing genes, but perhaps high >> expressing genes with a small (ish) fold change. DESeq seems to >> only report as significant differences that are high fold changes. >> >> Contrast this to limma for microarrays, where small fold changes can >> be reported as significant. >> >> For whatever reason, the transcriptomic community have become >> fixated on "two-fold" as some kind of standard cut-off. Now, I'm >> not fixated on that, but the example in DESeq reports 428 >> significant genes with an estimated fold change at FDR 5%, however, >> NONE of these are in the range -2 : 2. The minimum positive logFC >> is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is >> 2.49 (5.65 fold down-regulation). >> >> So what I am concerned about is finding genes, either highly or >> lowly expressed, that are differing by a small fold change - say two-fold. >> >> Thanks >> Mick >> ________________________________________ >> From: Naomi Altman [naomi at stat.psu.edu] >> Sent: 14 June 2010 17:42 >> To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch >> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq >> >> The issue is a mix of expression level and sample size. For count >> data, the power is higher when the expression is higher. Also, the >> p-values are discrete - the lower the total read count, the fewer >> values are possible, which messes up the FDR estimation. >> >> Of course, understanding the problem does not necessarily suggest a >> solution. But sample sizes will need to be large (or you need to >> sequence very deeply) if you want to detect differential expression >> in low expressing genes. >> >> --Naomi >> >> At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: >>> Hi >>> >>> This follows on slightly from my experimental design thread. >>> >>> Having worked through the vignette for DESeq, it seems to work >>> well. However, for the TagSeqExample.tab data set, when using an >>> FDR cut off of 0.05, what we see is that we only find differential >>> expression for large fold changes - an average of log2 fold change >>> of 5 for up-regulated, and log2 fold change of -5 for >>> down-regulated. There are very few significant results that even go >>> as far down as 2 or -2 - which is still a 4-fold change. >>> >>> So, the question is, how many replicates must we have to get more >>> sensitive results? Say down to log2FC of 1? (two-fold up or down >> regulated)? >>> >>> I can calculate this by using DESeq's own estimates of variance to >>> approximate replicates for T and N in the example data, and keep >>> going until my significant results start to hit a logFC of 1, but I >>> wanted to know if anyone else had done this yet? >>> >>> Thanks >>> Mick >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> Naomi S. Altman 814-865-3791 (voice) >> Associate Professor >> Dept. of Statistics 814-863-7114 (fax) >> Penn State University 814-865-1348 (Statistics) >> University Park, PA 16802-2111 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: m.robinson at garvan.org.au e: mrobinson at wehi.edu.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852 ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 15.7 years ago Mark Robinson ★ 1.1k

0

Entering edit mode

Hi Mark Thanks for the reply. I drafted and sent en e-mail on my phone which then lost connectivity (and the message) so apologies if anyone gets something similar twice! What worries me here is that we are looking at gene expression; we have been for years, using technologies such as RT-PCR, microarrays and SAGE. Using those technologies, we have been able to detect small but significant changes. Now, with RNA-Seq, we are seeing huge variation, and we simply state that due to this huge biological variation, we can only detect large changes. Now, either these systems do actually contain huge biological variation, which means that many of our RT-PCR and array work was wrong (as these didn't detect such huge biological variation), or the variation we are seeing is not all biological and could be due to some bias we are not yet aware of. In fact, both SAGE and RT-PCR rely on count data of some sort, yet again neither showed up so much biological variation that they could only detect large changes. I suspect we haven't gotten to the bottom of RNA-Seq data just yet. Mick ________________________________________ From: Mark Robinson [mrobinson@wehi.EDU.AU] Sent: 15 June 2010 12:20 To: michael watson (IAH-C) Cc: bioc list; Naomi Altman Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq Hi Mick. I can't speak for cufflinks, but the TMM normalization in that GB paper is really about accounting for 'composition' biases. So, this can help when the samples have different RNA composition (or some other systematic effect), but it seems to me like the "dirtiness" you mention here is just that you have large biological variation. Genomics studies are generally underpowered anyways and high biological variation, which is presumably a reality of your experimental system, just makes detecting changes harder. Naomi: I assume you meant sqrt(Yi), not log(Yi) for the normal approximation to the Possion ? Cheers, Mark On 2010-06-15, at 4:44 PM, michael watson (IAH-C) wrote: > Thanks Naomi > > Yes, I have several RNA-Seq datasets that look like they may have large biological variation. > > I feel this is the "dirty secret" of the new revolution that is RNA- Seq - even with large numbers of replicates, the variation in (and nature of) the read counts means we can only find genes that are changing by a large amount. > > I wonder if some of the normalisation suggested by Robinson and Oshlack will help (http://genomebiology.com/2010/11/3/R25). > > And of course there is cufflinks > > Thanks > Mick > ________________________________________ > From: Naomi Altman [naomi at stat.psu.edu] > Sent: 15 June 2010 03:02 > To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq > > Hi Michael, > I was working this out for a lecture and here is what I found: > > If there is enough expression for the Normal approximation to hold > then here is a rule of thumb. > > Suppose that the total number of reads is identical for all samples > and that there is NO biological variation. If Yi is the number of > reads for a gene in sample i, then > Poisson variation alone leads to log(Yi) approx normal with variance > 1/4. (This is what the DESeq vignette calls "shot" variance.) > > Using the formula for a 2 sample t-test, you see that to detect > 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need > n>32 var/log(fold) which is approximately 8 biological reps per treatment. > > However, that is for NO biological variation. (Have a look at the > example in the DESeq vignette!) And is assumes alpha=.05 (but we are > going to use a much smaller alpha due to the multiple comparisons > adjustment). > > --Naomi > > > At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote: >> Hi Naomi >> >> Thanks for the reply. >> >> The issue isn't necessarily low expressing genes, but perhaps high >> expressing genes with a small (ish) fold change. DESeq seems to >> only report as significant differences that are high fold changes. >> >> Contrast this to limma for microarrays, where small fold changes can >> be reported as significant. >> >> For whatever reason, the transcriptomic community have become >> fixated on "two-fold" as some kind of standard cut-off. Now, I'm >> not fixated on that, but the example in DESeq reports 428 >> significant genes with an estimated fold change at FDR 5%, however, >> NONE of these are in the range -2 : 2. The minimum positive logFC >> is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is >> 2.49 (5.65 fold down-regulation). >> >> So what I am concerned about is finding genes, either highly or >> lowly expressed, that are differing by a small fold change - say two-fold. >> >> Thanks >> Mick >> ________________________________________ >> From: Naomi Altman [naomi at stat.psu.edu] >> Sent: 14 June 2010 17:42 >> To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch >> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq >> >> The issue is a mix of expression level and sample size. For count >> data, the power is higher when the expression is higher. Also, the >> p-values are discrete - the lower the total read count, the fewer >> values are possible, which messes up the FDR estimation. >> >> Of course, understanding the problem does not necessarily suggest a >> solution. But sample sizes will need to be large (or you need to >> sequence very deeply) if you want to detect differential expression >> in low expressing genes. >> >> --Naomi >> >> At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: >>> Hi >>> >>> This follows on slightly from my experimental design thread. >>> >>> Having worked through the vignette for DESeq, it seems to work >>> well. However, for the TagSeqExample.tab data set, when using an >>> FDR cut off of 0.05, what we see is that we only find differential >>> expression for large fold changes - an average of log2 fold change >>> of 5 for up-regulated, and log2 fold change of -5 for >>> down-regulated. There are very few significant results that even go >>> as far down as 2 or -2 - which is still a 4-fold change. >>> >>> So, the question is, how many replicates must we have to get more >>> sensitive results? Say down to log2FC of 1? (two-fold up or down >> regulated)? >>> >>> I can calculate this by using DESeq's own estimates of variance to >>> approximate replicates for T and N in the example data, and keep >>> going until my significant results start to hit a logFC of 1, but I >>> wanted to know if anyone else had done this yet? >>> >>> Thanks >>> Mick >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> Naomi S. Altman 814-865-3791 (voice) >> Associate Professor >> Dept. of Statistics 814-863-7114 (fax) >> Penn State University 814-865-1348 (Statistics) >> University Park, PA 16802-2111 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: m.robinson at garvan.org.au e: mrobinson at wehi.edu.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852 ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 15.7 years ago michael watson IAH-C ★ 3.4k

0

Entering edit mode

Hi Mick, Do you know how much PCR amplification was used for your RNA-Seq datasets? Could this be one factor leading to more apparent variation? I'm surprised at your results though, I seem to remember many papers with dot-plots comparing expression measurements between biological replicates, which show that the RNA-Seq derived results are much tighter (less variation) than the array based ones. Cheers, Cei michael watson (IAH-C) wrote: > Hi Mark > > Thanks for the reply. I drafted and sent en e-mail on my phone which then lost connectivity (and the message) so apologies if anyone gets something similar twice! > > What worries me here is that we are looking at gene expression; we have been for years, using technologies such as RT-PCR, microarrays and SAGE. Using those technologies, we have been able to detect small but significant changes. Now, with RNA-Seq, we are seeing huge variation, and we simply state that due to this huge biological variation, we can only detect large changes. Now, either these systems do actually contain huge biological variation, which means that many of our RT-PCR and array work was wrong (as these didn't detect such huge biological variation), or the variation we are seeing is not all biological and could be due to some bias we are not yet aware of. > > In fact, both SAGE and RT-PCR rely on count data of some sort, yet again neither showed up so much biological variation that they could only detect large changes. > > I suspect we haven't gotten to the bottom of RNA-Seq data just yet. > > Mick > > ________________________________________ > From: Mark Robinson [mrobinson at wehi.EDU.AU] > Sent: 15 June 2010 12:20 > To: michael watson (IAH-C) > Cc: bioc list; Naomi Altman > Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq > > Hi Mick. > > I can't speak for cufflinks, but the TMM normalization in that GB paper is really about accounting for 'composition' biases. So, this can help when the samples have different RNA composition (or some other systematic effect), but it seems to me like the "dirtiness" you mention here is just that you have large biological variation. Genomics studies are generally underpowered anyways and high biological variation, which is presumably a reality of your experimental system, just makes detecting changes harder. > > Naomi: I assume you meant sqrt(Yi), not log(Yi) for the normal approximation to the Possion ? > > Cheers, > Mark > > On 2010-06-15, at 4:44 PM, michael watson (IAH-C) wrote: > >> Thanks Naomi >> >> Yes, I have several RNA-Seq datasets that look like they may have large biological variation. >> >> I feel this is the "dirty secret" of the new revolution that is RNA-Seq - even with large numbers of replicates, the variation in (and nature of) the read counts means we can only find genes that are changing by a large amount. >> >> I wonder if some of the normalisation suggested by Robinson and Oshlack will help (http://genomebiology.com/2010/11/3/R25). >> >> And of course there is cufflinks >> >> Thanks >> Mick >> ________________________________________ >> From: Naomi Altman [naomi at stat.psu.edu] >> Sent: 15 June 2010 03:02 >> To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch >> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq >> >> Hi Michael, >> I was working this out for a lecture and here is what I found: >> >> If there is enough expression for the Normal approximation to hold >> then here is a rule of thumb. >> >> Suppose that the total number of reads is identical for all samples >> and that there is NO biological variation. If Yi is the number of >> reads for a gene in sample i, then >> Poisson variation alone leads to log(Yi) approx normal with variance >> 1/4. (This is what the DESeq vignette calls "shot" variance.) >> >> Using the formula for a 2 sample t-test, you see that to detect >> 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need >> n>32 var/log(fold) which is approximately 8 biological reps per treatment. >> >> However, that is for NO biological variation. (Have a look at the >> example in the DESeq vignette!) And is assumes alpha=.05 (but we are >> going to use a much smaller alpha due to the multiple comparisons >> adjustment). >> >> --Naomi >> >> >> At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote: >>> Hi Naomi >>> >>> Thanks for the reply. >>> >>> The issue isn't necessarily low expressing genes, but perhaps high >>> expressing genes with a small (ish) fold change. DESeq seems to >>> only report as significant differences that are high fold changes. >>> >>> Contrast this to limma for microarrays, where small fold changes can >>> be reported as significant. >>> >>> For whatever reason, the transcriptomic community have become >>> fixated on "two-fold" as some kind of standard cut-off. Now, I'm >>> not fixated on that, but the example in DESeq reports 428 >>> significant genes with an estimated fold change at FDR 5%, however, >>> NONE of these are in the range -2 : 2. The minimum positive logFC >>> is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is >>> 2.49 (5.65 fold down-regulation). >>> >>> So what I am concerned about is finding genes, either highly or >>> lowly expressed, that are differing by a small fold change - say two-fold. >>> >>> Thanks >>> Mick >>> ________________________________________ >>> From: Naomi Altman [naomi at stat.psu.edu] >>> Sent: 14 June 2010 17:42 >>> To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch >>> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq >>> >>> The issue is a mix of expression level and sample size. For count >>> data, the power is higher when the expression is higher. Also, the >>> p-values are discrete - the lower the total read count, the fewer >>> values are possible, which messes up the FDR estimation. >>> >>> Of course, understanding the problem does not necessarily suggest a >>> solution. But sample sizes will need to be large (or you need to >>> sequence very deeply) if you want to detect differential expression >>> in low expressing genes. >>> >>> --Naomi >>> >>> At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: >>>> Hi >>>> >>>> This follows on slightly from my experimental design thread. >>>> >>>> Having worked through the vignette for DESeq, it seems to work >>>> well. However, for the TagSeqExample.tab data set, when using an >>>> FDR cut off of 0.05, what we see is that we only find differential >>>> expression for large fold changes - an average of log2 fold change >>>> of 5 for up-regulated, and log2 fold change of -5 for >>>> down-regulated. There are very few significant results that even go >>>> as far down as 2 or -2 - which is still a 4-fold change. >>>> >>>> So, the question is, how many replicates must we have to get more >>>> sensitive results? Say down to log2FC of 1? (two-fold up or down >>> regulated)? >>>> I can calculate this by using DESeq's own estimates of variance to >>>> approximate replicates for T and N in the example data, and keep >>>> going until my significant results start to hit a logFC of 1, but I >>>> wanted to know if anyone else had done this yet? >>>> >>>> Thanks >>>> Mick >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> Naomi S. Altman 814-865-3791 (voice) >>> Associate Professor >>> Dept. of Statistics 814-863-7114 (fax) >>> Penn State University 814-865-1348 (Statistics) >>> University Park, PA 16802-2111 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> Naomi S. Altman 814-865-3791 (voice) >> Associate Professor >> Dept. of Statistics 814-863-7114 (fax) >> Penn State University 814-865-1348 (Statistics) >> University Park, PA 16802-2111 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > ------------------------------ > Mark Robinson, PhD (Melb) > Epigenetics Laboratory, Garvan > Bioinformatics Division, WEHI > e: m.robinson at garvan.org.au > e: mrobinson at wehi.edu.au > p: +61 (0)3 9345 2628 > f: +61 (0)3 9347 0852 > ------------------------------ > > > > > > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:6}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 15.7 years ago Cei Abreu-Goodger ▴ 830

0

Entering edit mode

Hi Cei These are not my datasets, and I have heard something similar to you. The history of this discussion is that I asked why DESeq, in the example data anaylsis, only found as significant genes whose fold change was very large (2^5 on average). I thought this was odd as low fold changes can be significant. Naomi replied and stated this was due to high biological variation. Perhaps the data in the DESeq example are different to other RNA-Seq datasets - if anyone else has applied DESeq to other datasets and found significant genes with a low fold-change, I would be interested to hear it Thanks Mick ________________________________________ From: Cei Abreu-Goodger [cei@ebi.ac.uk] Sent: 15 June 2010 16:12 To: michael watson (IAH-C) Cc: bioc list Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq Hi Mick, Do you know how much PCR amplification was used for your RNA-Seq datasets? Could this be one factor leading to more apparent variation? I'm surprised at your results though, I seem to remember many papers with dot-plots comparing expression measurements between biological replicates, which show that the RNA-Seq derived results are much tighter (less variation) than the array based ones. Cheers, Cei michael watson (IAH-C) wrote: > Hi Mark > > Thanks for the reply. I drafted and sent en e-mail on my phone which then lost connectivity (and the message) so apologies if anyone gets something similar twice! > > What worries me here is that we are looking at gene expression; we have been for years, using technologies such as RT-PCR, microarrays and SAGE. Using those technologies, we have been able to detect small but significant changes. Now, with RNA-Seq, we are seeing huge variation, and we simply state that due to this huge biological variation, we can only detect large changes. Now, either these systems do actually contain huge biological variation, which means that many of our RT-PCR and array work was wrong (as these didn't detect such huge biological variation), or the variation we are seeing is not all biological and could be due to some bias we are not yet aware of. > > In fact, both SAGE and RT-PCR rely on count data of some sort, yet again neither showed up so much biological variation that they could only detect large changes. > > I suspect we haven't gotten to the bottom of RNA-Seq data just yet. > > Mick > > ________________________________________ > From: Mark Robinson [mrobinson at wehi.EDU.AU] > Sent: 15 June 2010 12:20 > To: michael watson (IAH-C) > Cc: bioc list; Naomi Altman > Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq > > Hi Mick. > > I can't speak for cufflinks, but the TMM normalization in that GB paper is really about accounting for 'composition' biases. So, this can help when the samples have different RNA composition (or some other systematic effect), but it seems to me like the "dirtiness" you mention here is just that you have large biological variation. Genomics studies are generally underpowered anyways and high biological variation, which is presumably a reality of your experimental system, just makes detecting changes harder. > > Naomi: I assume you meant sqrt(Yi), not log(Yi) for the normal approximation to the Possion ? > > Cheers, > Mark > > On 2010-06-15, at 4:44 PM, michael watson (IAH-C) wrote: > >> Thanks Naomi >> >> Yes, I have several RNA-Seq datasets that look like they may have large biological variation. >> >> I feel this is the "dirty secret" of the new revolution that is RNA-Seq - even with large numbers of replicates, the variation in (and nature of) the read counts means we can only find genes that are changing by a large amount. >> >> I wonder if some of the normalisation suggested by Robinson and Oshlack will help (http://genomebiology.com/2010/11/3/R25). >> >> And of course there is cufflinks >> >> Thanks >> Mick >> ________________________________________ >> From: Naomi Altman [naomi at stat.psu.edu] >> Sent: 15 June 2010 03:02 >> To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch >> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq >> >> Hi Michael, >> I was working this out for a lecture and here is what I found: >> >> If there is enough expression for the Normal approximation to hold >> then here is a rule of thumb. >> >> Suppose that the total number of reads is identical for all samples >> and that there is NO biological variation. If Yi is the number of >> reads for a gene in sample i, then >> Poisson variation alone leads to log(Yi) approx normal with variance >> 1/4. (This is what the DESeq vignette calls "shot" variance.) >> >> Using the formula for a 2 sample t-test, you see that to detect >> 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need >> n>32 var/log(fold) which is approximately 8 biological reps per treatment. >> >> However, that is for NO biological variation. (Have a look at the >> example in the DESeq vignette!) And is assumes alpha=.05 (but we are >> going to use a much smaller alpha due to the multiple comparisons >> adjustment). >> >> --Naomi >> >> >> At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote: >>> Hi Naomi >>> >>> Thanks for the reply. >>> >>> The issue isn't necessarily low expressing genes, but perhaps high >>> expressing genes with a small (ish) fold change. DESeq seems to >>> only report as significant differences that are high fold changes. >>> >>> Contrast this to limma for microarrays, where small fold changes can >>> be reported as significant. >>> >>> For whatever reason, the transcriptomic community have become >>> fixated on "two-fold" as some kind of standard cut-off. Now, I'm >>> not fixated on that, but the example in DESeq reports 428 >>> significant genes with an estimated fold change at FDR 5%, however, >>> NONE of these are in the range -2 : 2. The minimum positive logFC >>> is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is >>> 2.49 (5.65 fold down-regulation). >>> >>> So what I am concerned about is finding genes, either highly or >>> lowly expressed, that are differing by a small fold change - say two-fold. >>> >>> Thanks >>> Mick >>> ________________________________________ >>> From: Naomi Altman [naomi at stat.psu.edu] >>> Sent: 14 June 2010 17:42 >>> To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch >>> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq >>> >>> The issue is a mix of expression level and sample size. For count >>> data, the power is higher when the expression is higher. Also, the >>> p-values are discrete - the lower the total read count, the fewer >>> values are possible, which messes up the FDR estimation. >>> >>> Of course, understanding the problem does not necessarily suggest a >>> solution. But sample sizes will need to be large (or you need to >>> sequence very deeply) if you want to detect differential expression >>> in low expressing genes. >>> >>> --Naomi >>> >>> At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: >>>> Hi >>>> >>>> This follows on slightly from my experimental design thread. >>>> >>>> Having worked through the vignette for DESeq, it seems to work >>>> well. However, for the TagSeqExample.tab data set, when using an >>>> FDR cut off of 0.05, what we see is that we only find differential >>>> expression for large fold changes - an average of log2 fold change >>>> of 5 for up-regulated, and log2 fold change of -5 for >>>> down-regulated. There are very few significant results that even go >>>> as far down as 2 or -2 - which is still a 4-fold change. >>>> >>>> So, the question is, how many replicates must we have to get more >>>> sensitive results? Say down to log2FC of 1? (two-fold up or down >>> regulated)? >>>> I can calculate this by using DESeq's own estimates of variance to >>>> approximate replicates for T and N in the example data, and keep >>>> going until my significant results start to hit a logFC of 1, but I >>>> wanted to know if anyone else had done this yet? >>>> >>>> Thanks >>>> Mick >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> Naomi S. Altman 814-865-3791 (voice) >>> Associate Professor >>> Dept. of Statistics 814-863-7114 (fax) >>> Penn State University 814-865-1348 (Statistics) >>> University Park, PA 16802-2111 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> Naomi S. Altman 814-865-3791 (voice) >> Associate Professor >> Dept. of Statistics 814-863-7114 (fax) >> Penn State University 814-865-1348 (Statistics) >> University Park, PA 16802-2111 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > ------------------------------ > Mark Robinson, PhD (Melb) > Epigenetics Laboratory, Garvan > Bioinformatics Division, WEHI > e: m.robinson at garvan.org.au > e: mrobinson at wehi.edu.au > p: +61 (0)3 9345 2628 > f: +61 (0)3 9347 0852 > ------------------------------ > > > > > > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:6}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 15.7 years ago michael watson IAH-C ★ 3.4k

0

Entering edit mode

On Tue, Jun 15, 2010 at 11:18 AM, michael watson (IAH-C) < michael.watson@bbsrc.ac.uk> wrote: > Hi Cei > > These are not my datasets, and I have heard something similar to you. > > The history of this discussion is that I asked why DESeq, in the example > data anaylsis, only found as significant genes whose fold change was very > large (2^5 on average). I thought this was odd as low fold changes can be > significant. Naomi replied and stated this was due to high biological > variation. > > That is certainly one possibility. I think that she also suggested that both sequencing depth and sample numbers also contribute to the ability to find DE genes. As a corollary to the sequencing depth issue, it will be true that the fold change defined as "significant" is a function of the sequencing depth, so for high-expressing genes, it is easier to find DE with smaller fold changes than with lower expressed genes. In other words, a two-fold change may be meaningless for a large chunk of your genes of interest if the expression level/sequencing depth is not large enough while for high-expressing genes, a two-fold change might be highly significant. While this was somewhat true for expression level in microarrays, the expression level and ability to detect differential expression with sequencing data takes on a new level of importance, perhaps. Sean > Perhaps the data in the DESeq example are different to other RNA-Seq > datasets - if anyone else has applied DESeq to other datasets and found > significant genes with a low fold-change, I would be interested to hear it > > Thanks > Mick > > ________________________________________ > From: Cei Abreu-Goodger [cei@ebi.ac.uk] > Sent: 15 June 2010 16:12 > To: michael watson (IAH-C) > Cc: bioc list > Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq > > Hi Mick, > > Do you know how much PCR amplification was used for your RNA-Seq > datasets? Could this be one factor leading to more apparent variation? > > I'm surprised at your results though, I seem to remember many papers > with dot-plots comparing expression measurements between biological > replicates, which show that the RNA-Seq derived results are much tighter > (less variation) than the array based ones. > > Cheers, > > Cei > > michael watson (IAH-C) wrote: > > Hi Mark > > > > Thanks for the reply. I drafted and sent en e-mail on my phone which > then lost connectivity (and the message) so apologies if anyone gets > something similar twice! > > > > What worries me here is that we are looking at gene expression; we have > been for years, using technologies such as RT-PCR, microarrays and SAGE. > Using those technologies, we have been able to detect small but significant > changes. Now, with RNA-Seq, we are seeing huge variation, and we simply > state that due to this huge biological variation, we can only detect large > changes. Now, either these systems do actually contain huge biological > variation, which means that many of our RT-PCR and array work was wrong (as > these didn't detect such huge biological variation), or the variation we are > seeing is not all biological and could be due to some bias we are not yet > aware of. > > > > In fact, both SAGE and RT-PCR rely on count data of some sort, yet again > neither showed up so much biological variation that they could only detect > large changes. > > > > I suspect we haven't gotten to the bottom of RNA-Seq data just yet. > > > > Mick > > > > ________________________________________ > > From: Mark Robinson [mrobinson@wehi.EDU.AU] > > Sent: 15 June 2010 12:20 > > To: michael watson (IAH-C) > > Cc: bioc list; Naomi Altman > > Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq > > > > Hi Mick. > > > > I can't speak for cufflinks, but the TMM normalization in that GB paper > is really about accounting for 'composition' biases. So, this can help when > the samples have different RNA composition (or some other systematic > effect), but it seems to me like the "dirtiness" you mention here is just > that you have large biological variation. Genomics studies are generally > underpowered anyways and high biological variation, which is presumably a > reality of your experimental system, just makes detecting changes harder. > > > > Naomi: I assume you meant sqrt(Yi), not log(Yi) for the normal > approximation to the Possion ? > > > > Cheers, > > Mark > > > > On 2010-06-15, at 4:44 PM, michael watson (IAH-C) wrote: > > > >> Thanks Naomi > >> > >> Yes, I have several RNA-Seq datasets that look like they may have large > biological variation. > >> > >> I feel this is the "dirty secret" of the new revolution that is RNA-Seq > - even with large numbers of replicates, the variation in (and nature of) > the read counts means we can only find genes that are changing by a large > amount. > >> > >> I wonder if some of the normalisation suggested by Robinson and Oshlack > will help (http://genomebiology.com/2010/11/3/R25). > >> > >> And of course there is cufflinks > >> > >> Thanks > >> Mick > >> ________________________________________ > >> From: Naomi Altman [naomi@stat.psu.edu] > >> Sent: 15 June 2010 03:02 > >> To: michael watson (IAH-C); Naomi Altman; > bioconductor@stat.math.ethz.ch > >> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq > >> > >> Hi Michael, > >> I was working this out for a lecture and here is what I found: > >> > >> If there is enough expression for the Normal approximation to hold > >> then here is a rule of thumb. > >> > >> Suppose that the total number of reads is identical for all samples > >> and that there is NO biological variation. If Yi is the number of > >> reads for a gene in sample i, then > >> Poisson variation alone leads to log(Yi) approx normal with variance > >> 1/4. (This is what the DESeq vignette calls "shot" variance.) > >> > >> Using the formula for a 2 sample t-test, you see that to detect > >> 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need > >> n>32 var/log(fold) which is approximately 8 biological reps per > treatment. > >> > >> However, that is for NO biological variation. (Have a look at the > >> example in the DESeq vignette!) And is assumes alpha=.05 (but we are > >> going to use a much smaller alpha due to the multiple comparisons > >> adjustment). > >> > >> --Naomi > >> > >> > >> At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote: > >>> Hi Naomi > >>> > >>> Thanks for the reply. > >>> > >>> The issue isn't necessarily low expressing genes, but perhaps high > >>> expressing genes with a small (ish) fold change. DESeq seems to > >>> only report as significant differences that are high fold changes. > >>> > >>> Contrast this to limma for microarrays, where small fold changes can > >>> be reported as significant. > >>> > >>> For whatever reason, the transcriptomic community have become > >>> fixated on "two-fold" as some kind of standard cut-off. Now, I'm > >>> not fixated on that, but the example in DESeq reports 428 > >>> significant genes with an estimated fold change at FDR 5%, however, > >>> NONE of these are in the range -2 : 2. The minimum positive logFC > >>> is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is > >>> 2.49 (5.65 fold down-regulation). > >>> > >>> So what I am concerned about is finding genes, either highly or > >>> lowly expressed, that are differing by a small fold change - say > two-fold. > >>> > >>> Thanks > >>> Mick > >>> ________________________________________ > >>> From: Naomi Altman [naomi@stat.psu.edu] > >>> Sent: 14 June 2010 17:42 > >>> To: michael watson (IAH-C); bioconductor@stat.math.ethz.ch > >>> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq > >>> > >>> The issue is a mix of expression level and sample size. For count > >>> data, the power is higher when the expression is higher. Also, the > >>> p-values are discrete - the lower the total read count, the fewer > >>> values are possible, which messes up the FDR estimation. > >>> > >>> Of course, understanding the problem does not necessarily suggest a > >>> solution. But sample sizes will need to be large (or you need to > >>> sequence very deeply) if you want to detect differential expression > >>> in low expressing genes. > >>> > >>> --Naomi > >>> > >>> At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: > >>>> Hi > >>>> > >>>> This follows on slightly from my experimental design thread. > >>>> > >>>> Having worked through the vignette for DESeq, it seems to work > >>>> well. However, for the TagSeqExample.tab data set, when using an > >>>> FDR cut off of 0.05, what we see is that we only find differential > >>>> expression for large fold changes - an average of log2 fold change > >>>> of 5 for up-regulated, and log2 fold change of -5 for > >>>> down-regulated. There are very few significant results that even go > >>>> as far down as 2 or -2 - which is still a 4-fold change. > >>>> > >>>> So, the question is, how many replicates must we have to get more > >>>> sensitive results? Say down to log2FC of 1? (two-fold up or down > >>> regulated)? > >>>> I can calculate this by using DESeq's own estimates of variance to > >>>> approximate replicates for T and N in the example data, and keep > >>>> going until my significant results start to hit a logFC of 1, but I > >>>> wanted to know if anyone else had done this yet? > >>>> > >>>> Thanks > >>>> Mick > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor@stat.math.ethz.ch > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: > >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> Naomi S. Altman 814-865-3791 (voice) > >>> Associate Professor > >>> Dept. of Statistics 814-863-7114 (fax) > >>> Penn State University 814-865-1348 (Statistics) > >>> University Park, PA 16802-2111 > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor@stat.math.ethz.ch > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> Naomi S. Altman 814-865-3791 (voice) > >> Associate Professor > >> Dept. of Statistics 814-863-7114 (fax) > >> Penn State University 814-865-1348 (Statistics) > >> University Park, PA 16802-2111 > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > ------------------------------ > > Mark Robinson, PhD (Melb) > > Epigenetics Laboratory, Garvan > > Bioinformatics Division, WEHI > > e: m.robinson@garvan.org.au > > e: mrobinson@wehi.edu.au > > p: +61 (0)3 9345 2628 > > f: +61 (0)3 9347 0852 > > ------------------------------ > > > > > > > > > > > > > > ______________________________________________________________________ > > The information in this email is confidential and intend...{{dropped:6}} > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 15.7 years ago Sean Davis 21k

0

Entering edit mode

I guess the issue I am trying to highlight is one that Naomi put into a very good statistical framework â with ZERO biological variability, one would still need 8 replicates per treatment to detect a significant two-fold difference. Isnât that slightly worrisome? Donât get me wrong, I love the potential of RNA-Seq, have some RNA- Seq data and plan to produce a hell of a lot more, but the analysis side seems to lack the power to detect small changes. Naomi Altman said: âSuppose that the total number of reads is identical for all samples and that there is NO biological variation. If Yi is the number of reads for a gene in sample i, then Poisson variation alone leads to log(Yi) approx normal with variance 1/4. (This is what the DESeq vignette calls "shot" variance.) Using the formula for a 2 sample t-test, you see that to detect 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need n>32 var/log(fold) which is approximately 8 biological reps per treatment.â From: seandavi@gmail.com [mailto:seandavi@gmail.com] On Behalf Of Sean Davis Sent: 15 June 2010 17:06 To: michael watson (IAH-C) Cc: Cei Abreu-Goodger; bioc list Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq On Tue, Jun 15, 2010 at 11:18 AM, michael watson (IAH-C) <michael.watson@bbsrc.ac.uk<mailto:michael.watson@bbsrc.ac.uk>> wrote: Hi Cei These are not my datasets, and I have heard something similar to you. The history of this discussion is that I asked why DESeq, in the example data anaylsis, only found as significant genes whose fold change was very large (2^5 on average). I thought this was odd as low fold changes can be significant. Naomi replied and stated this was due to high biological variation. That is certainly one possibility. I think that she also suggested that both sequencing depth and sample numbers also contribute to the ability to find DE genes. As a corollary to the sequencing depth issue, it will be true that the fold change defined as "significant" is a function of the sequencing depth, so for high-expressing genes, it is easier to find DE with smaller fold changes than with lower expressed genes. In other words, a two-fold change may be meaningless for a large chunk of your genes of interest if the expression level/sequencing depth is not large enough while for high-expressing genes, a two-fold change might be highly significant. While this was somewhat true for expression level in microarrays, the expression level and ability to detect differential expression with sequencing data takes on a new level of importance, perhaps. Sean Perhaps the data in the DESeq example are different to other RNA-Seq datasets - if anyone else has applied DESeq to other datasets and found significant genes with a low fold-change, I would be interested to hear it Thanks Mick ________________________________________ From: Cei Abreu-Goodger [cei@ebi.ac.uk<mailto:cei@ebi.ac.uk>] Sent: 15 June 2010 16:12 To: michael watson (IAH-C) Cc: bioc list Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq Hi Mick, Do you know how much PCR amplification was used for your RNA-Seq datasets? Could this be one factor leading to more apparent variation? I'm surprised at your results though, I seem to remember many papers with dot-plots comparing expression measurements between biological replicates, which show that the RNA-Seq derived results are much tighter (less variation) than the array based ones. Cheers, Cei michael watson (IAH-C) wrote: > Hi Mark > > Thanks for the reply. I drafted and sent en e-mail on my phone which then lost connectivity (and the message) so apologies if anyone gets something similar twice! > > What worries me here is that we are looking at gene expression; we have been for years, using technologies such as RT-PCR, microarrays and SAGE. Using those technologies, we have been able to detect small but significant changes. Now, with RNA-Seq, we are seeing huge variation, and we simply state that due to this huge biological variation, we can only detect large changes. Now, either these systems do actually contain huge biological variation, which means that many of our RT-PCR and array work was wrong (as these didn't detect such huge biological variation), or the variation we are seeing is not all biological and could be due to some bias we are not yet aware of. > > In fact, both SAGE and RT-PCR rely on count data of some sort, yet again neither showed up so much biological variation that they could only detect large changes. > > I suspect we haven't gotten to the bottom of RNA-Seq data just yet. > > Mick > > ________________________________________ > From: Mark Robinson [mrobinson@wehi.EDU.AU<mailto:mrobinson@wehi.edu.au>] > Sent: 15 June 2010 12:20 > To: michael watson (IAH-C) > Cc: bioc list; Naomi Altman > Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq > > Hi Mick. > > I can't speak for cufflinks, but the TMM normalization in that GB paper is really about accounting for 'composition' biases. So, this can help when the samples have different RNA composition (or some other systematic effect), but it seems to me like the "dirtiness" you mention here is just that you have large biological variation. Genomics studies are generally underpowered anyways and high biological variation, which is presumably a reality of your experimental system, just makes detecting changes harder. > > Naomi: I assume you meant sqrt(Yi), not log(Yi) for the normal approximation to the Possion ? > > Cheers, > Mark > > On 2010-06-15, at 4:44 PM, michael watson (IAH-C) wrote: > >> Thanks Naomi >> >> Yes, I have several RNA-Seq datasets that look like they may have large biological variation. >> >> I feel this is the "dirty secret" of the new revolution that is RNA-Seq - even with large numbers of replicates, the variation in (and nature of) the read counts means we can only find genes that are changing by a large amount. >> >> I wonder if some of the normalisation suggested by Robinson and Oshlack will help (http://genomebiology.com/2010/11/3/R25). >> >> And of course there is cufflinks >> >> Thanks >> Mick >> ________________________________________ >> From: Naomi Altman [naomi@stat.psu.edu<mailto:naomi@stat.psu.edu>] >> Sent: 15 June 2010 03:02 >> To: michael watson (IAH-C); Naomi Altman; bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> >> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq >> >> Hi Michael, >> I was working this out for a lecture and here is what I found: >> >> If there is enough expression for the Normal approximation to hold >> then here is a rule of thumb. >> >> Suppose that the total number of reads is identical for all samples >> and that there is NO biological variation. If Yi is the number of >> reads for a gene in sample i, then >> Poisson variation alone leads to log(Yi) approx normal with variance >> 1/4. (This is what the DESeq vignette calls "shot" variance.) >> >> Using the formula for a 2 sample t-test, you see that to detect >> 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need >> n>32 var/log(fold) which is approximately 8 biological reps per treatment. >> >> However, that is for NO biological variation. (Have a look at the >> example in the DESeq vignette!) And is assumes alpha=.05 (but we are >> going to use a much smaller alpha due to the multiple comparisons >> adjustment). >> >> --Naomi >> >> >> At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote: >>> Hi Naomi >>> >>> Thanks for the reply. >>> >>> The issue isn't necessarily low expressing genes, but perhaps high >>> expressing genes with a small (ish) fold change. DESeq seems to >>> only report as significant differences that are high fold changes. >>> >>> Contrast this to limma for microarrays, where small fold changes can >>> be reported as significant. >>> >>> For whatever reason, the transcriptomic community have become >>> fixated on "two-fold" as some kind of standard cut-off. Now, I'm >>> not fixated on that, but the example in DESeq reports 428 >>> significant genes with an estimated fold change at FDR 5%, however, >>> NONE of these are in the range -2 : 2. The minimum positive logFC >>> is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is >>> 2.49 (5.65 fold down-regulation). >>> >>> So what I am concerned about is finding genes, either highly or >>> lowly expressed, that are differing by a small fold change - say two-fold. >>> >>> Thanks >>> Mick >>> ________________________________________ >>> From: Naomi Altman [naomi@stat.psu.edu<mailto:naomi@stat.psu.edu>] >>> Sent: 14 June 2010 17:42 >>> To: michael watson (IAH-C); bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> >>> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq >>> >>> The issue is a mix of expression level and sample size. For count >>> data, the power is higher when the expression is higher. Also, the >>> p-values are discrete - the lower the total read count, the fewer >>> values are possible, which messes up the FDR estimation. >>> >>> Of course, understanding the problem does not necessarily suggest a >>> solution. But sample sizes will need to be large (or you need to >>> sequence very deeply) if you want to detect differential expression >>> in low expressing genes. >>> >>> --Naomi >>> >>> At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: >>>> Hi >>>> >>>> This follows on slightly from my experimental design thread. >>>> >>>> Having worked through the vignette for DESeq, it seems to work >>>> well. However, for the TagSeqExample.tab data set, when using an >>>> FDR cut off of 0.05, what we see is that we only find differential >>>> expression for large fold changes - an average of log2 fold change >>>> of 5 for up-regulated, and log2 fold change of -5 for >>>> down-regulated. There are very few significant results that even go >>>> as far down as 2 or -2 - which is still a 4-fold change. >>>> >>>> So, the question is, how many replicates must we have to get more >>>> sensitive results? Say down to log2FC of 1? (two-fold up or down >>> regulated)? >>>> I can calculate this by using DESeq's own estimates of variance to >>>> approximate replicates for T and N in the example data, and keep >>>> going until my significant results start to hit a logFC of 1, but I >>>> wanted to know if anyone else had done this yet? >>>> >>>> Thanks >>>> Mick >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> Naomi S. Altman 814-865-3791 (voice) >>> Associate Professor >>> Dept. of Statistics 814-863-7114 (fax) >>> Penn State University 814-865-1348 (Statistics) >>> University Park, PA 16802-2111 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> Naomi S. Altman 814-865-3791 (voice) >> Associate Professor >> Dept. of Statistics 814-863-7114 (fax) >> Penn State University 814-865-1348 (Statistics) >> University Park, PA 16802-2111 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > ------------------------------ > Mark Robinson, PhD (Melb) > Epigenetics Laboratory, Garvan > Bioinformatics Division, WEHI > e: m.robinson@garvan.org.au<mailto:m.robinson@garvan.org.au> > e: mrobinson@wehi.edu.au<mailto:mrobinson@wehi.edu.au> > p: +61 (0)3 9345 2628 > f: +61 (0)3 9347 0852 > ------------------------------ > > > > > > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:6}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 15.7 years ago michael watson IAH-C ★ 3.4k

0

Entering edit mode

For array studies, in how many cases is the number replicates was similar to that needed according to simple power calculations <5% ,<1% ? yet meaningful results with sophisticated methods. While power calculations have there uses , they are based on a simple t-test. If your method for detecting DE is more sophisticated that a t-test (and hopefully it is) then your power calculation may be "misleading". I would rather believe the results of simulated data with state of the art algorithms. -----Original Message----- From: michael watson (IAH-C) <michael.watson@bbsrc.ac.uk> To: 'Sean Davis' <sdavis2@mail.nih.gov> Cc: bioc list <bioconductor@stat.math.ethz.ch> Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq Date: Wed, 16 Jun 2010 08:45:52 +0100 I guess the issue I am trying to highlight is one that Naomi put into a very good statistical framework â with ZERO biological variability, one would still need 8 replicates per treatment to detect a significant two-fold difference. Isnât that slightly worrisome? Donât get me wrong, I love the potential of RNA-Seq, have some RNA- Seq data and plan to produce a hell of a lot more, but the analysis side seems to lack the power to detect small changes. Naomi Altman said: âSuppose that the total number of reads is identical for all samples and that there is NO biological variation. If Yi is the number of reads for a gene in sample i, then Poisson variation alone leads to log(Yi) approx normal with variance 1/4. (This is what the DESeq vignette calls "shot" variance.) Using the formula for a 2 sample t-test, you see that to detect 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need n>32 var/log(fold) which is approximately 8 biological reps per treatment.â From: seandavi@gmail.com [mailto:seandavi@gmail.com] On Behalf Of Sean Davis Sent: 15 June 2010 17:06 To: michael watson (IAH-C) Cc: Cei Abreu-Goodger; bioc list Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq On Tue, Jun 15, 2010 at 11:18 AM, michael watson (IAH-C) <michael.watson@bbsrc.ac.uk<mailto:michael.watson@bbsrc.ac.uk>> wrote: Hi Cei These are not my datasets, and I have heard something similar to you. The history of this discussion is that I asked why DESeq, in the example data anaylsis, only found as significant genes whose fold change was very large (2^5 on average). I thought this was odd as low fold changes can be significant. Naomi replied and stated this was due to high biological variation. That is certainly one possibility. I think that she also suggested that both sequencing depth and sample numbers also contribute to the ability to find DE genes. As a corollary to the sequencing depth issue, it will be true that the fold change defined as "significant" is a function of the sequencing depth, so for high-expressing genes, it is easier to find DE with smaller fold changes than with lower expressed genes. In other words, a two-fold change may be meaningless for a large chunk of your genes of interest if the expression level/sequencing depth is not large enough while for high-expressing genes, a two-fold change might be highly significant. While this was somewhat true for expression level in microarrays, the expression level and ability to detect differential expression with sequencing data takes on a new level of importance, perhaps. Sean Perhaps the data in the DESeq example are different to other RNA-Seq datasets - if anyone else has applied DESeq to other datasets and found significant genes with a low fold-change, I would be interested to hear it Thanks Mick ________________________________________ From: Cei Abreu-Goodger [cei@ebi.ac.uk<mailto:cei@ebi.ac.uk>] Sent: 15 June 2010 16:12 To: michael watson (IAH-C) Cc: bioc list Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq Hi Mick, Do you know how much PCR amplification was used for your RNA-Seq datasets? Could this be one factor leading to more apparent variation? I'm surprised at your results though, I seem to remember many papers with dot-plots comparing expression measurements between biological replicates, which show that the RNA-Seq derived results are much tighter (less variation) than the array based ones. Cheers, Cei michael watson (IAH-C) wrote: > Hi Mark > > Thanks for the reply. I drafted and sent en e-mail on my phone which then lost connectivity (and the message) so apologies if anyone gets something similar twice! > > What worries me here is that we are looking at gene expression; we have been for years, using technologies such as RT-PCR, microarrays and SAGE. Using those technologies, we have been able to detect small but significant changes. Now, with RNA-Seq, we are seeing huge variation, and we simply state that due to this huge biological variation, we can only detect large changes. Now, either these systems do actually contain huge biological variation, which means that many of our RT-PCR and array work was wrong (as these didn't detect such huge biological variation), or the variation we are seeing is not all biological and could be due to some bias we are not yet aware of. > > In fact, both SAGE and RT-PCR rely on count data of some sort, yet again neither showed up so much biological variation that they could only detect large changes. > > I suspect we haven't gotten to the bottom of RNA-Seq data just yet. > > Mick > > ________________________________________ > From: Mark Robinson [mrobinson@wehi.EDU.AU<mailto:mrobinson@wehi.edu.au>] > Sent: 15 June 2010 12:20 > To: michael watson (IAH-C) > Cc: bioc list; Naomi Altman > Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq > > Hi Mick. > > I can't speak for cufflinks, but the TMM normalization in that GB paper is really about accounting for 'composition' biases. So, this can help when the samples have different RNA composition (or some other systematic effect), but it seems to me like the "dirtiness" you mention here is just that you have large biological variation. Genomics studies are generally underpowered anyways and high biological variation, which is presumably a reality of your experimental system, just makes detecting changes harder. > > Naomi: I assume you meant sqrt(Yi), not log(Yi) for the normal approximation to the Possion ? > > Cheers, > Mark > > On 2010-06-15, at 4:44 PM, michael watson (IAH-C) wrote: > >> Thanks Naomi >> >> Yes, I have several RNA-Seq datasets that look like they may have large biological variation. >> >> I feel this is the "dirty secret" of the new revolution that is RNA-Seq - even with large numbers of replicates, the variation in (and nature of) the read counts means we can only find genes that are changing by a large amount. >> >> I wonder if some of the normalisation suggested by Robinson and Oshlack will help (http://genomebiology.com/2010/11/3/R25). >> >> And of course there is cufflinks >> >> Thanks >> Mick >> ________________________________________ >> From: Naomi Altman [naomi@stat.psu.edu<mailto:naomi@stat.psu.edu>] >> Sent: 15 June 2010 03:02 >> To: michael watson (IAH-C); Naomi Altman; bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> >> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq >> >> Hi Michael, >> I was working this out for a lecture and here is what I found: >> >> If there is enough expression for the Normal approximation to hold >> then here is a rule of thumb. >> >> Suppose that the total number of reads is identical for all samples >> and that there is NO biological variation. If Yi is the number of >> reads for a gene in sample i, then >> Poisson variation alone leads to log(Yi) approx normal with variance >> 1/4. (This is what the DESeq vignette calls "shot" variance.) >> >> Using the formula for a 2 sample t-test, you see that to detect >> 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need >> n>32 var/log(fold) which is approximately 8 biological reps per treatment. >> >> However, that is for NO biological variation. (Have a look at the >> example in the DESeq vignette!) And is assumes alpha=.05 (but we are >> going to use a much smaller alpha due to the multiple comparisons >> adjustment). >> >> --Naomi >> >> >> At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote: >>> Hi Naomi >>> >>> Thanks for the reply. >>> >>> The issue isn't necessarily low expressing genes, but perhaps high >>> expressing genes with a small (ish) fold change. DESeq seems to >>> only report as significant differences that are high fold changes. >>> >>> Contrast this to limma for microarrays, where small fold changes can >>> be reported as significant. >>> >>> For whatever reason, the transcriptomic community have become >>> fixated on "two-fold" as some kind of standard cut-off. Now, I'm >>> not fixated on that, but the example in DESeq reports 428 >>> significant genes with an estimated fold change at FDR 5%, however, >>> NONE of these are in the range -2 : 2. The minimum positive logFC >>> is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is >>> 2.49 (5.65 fold down-regulation). >>> >>> So what I am concerned about is finding genes, either highly or >>> lowly expressed, that are differing by a small fold change - say two-fold. >>> >>> Thanks >>> Mick >>> ________________________________________ >>> From: Naomi Altman [naomi@stat.psu.edu<mailto:naomi@stat.psu.edu>] >>> Sent: 14 June 2010 17:42 >>> To: michael watson (IAH-C); bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> >>> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq >>> >>> The issue is a mix of expression level and sample size. For count >>> data, the power is higher when the expression is higher. Also, the >>> p-values are discrete - the lower the total read count, the fewer >>> values are possible, which messes up the FDR estimation. >>> >>> Of course, understanding the problem does not necessarily suggest a >>> solution. But sample sizes will need to be large (or you need to >>> sequence very deeply) if you want to detect differential expression >>> in low expressing genes. >>> >>> --Naomi >>> >>> At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: >>>> Hi >>>> >>>> This follows on slightly from my experimental design thread. >>>> >>>> Having worked through the vignette for DESeq, it seems to work >>>> well. However, for the TagSeqExample.tab data set, when using an >>>> FDR cut off of 0.05, what we see is that we only find differential >>>> expression for large fold changes - an average of log2 fold change >>>> of 5 for up-regulated, and log2 fold change of -5 for >>>> down-regulated. There are very few significant results that even go >>>> as far down as 2 or -2 - which is still a 4-fold change. >>>> >>>> So, the question is, how many replicates must we have to get more >>>> sensitive results? Say down to log2FC of 1? (two-fold up or down >>> regulated)? >>>> I can calculate this by using DESeq's own estimates of variance to >>>> approximate replicates for T and N in the example data, and keep >>>> going until my significant results start to hit a logFC of 1, but I >>>> wanted to know if anyone else had done this yet? >>>> >>>> Thanks >>>> Mick >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> Naomi S. Altman 814-865-3791 (voice) >>> Associate Professor >>> Dept. of Statistics 814-863-7114 (fax) >>> Penn State University 814-865-1348 (Statistics) >>> University Park, PA 16802-2111 >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> Naomi S. Altman 814-865-3791 (voice) >> Associate Professor >> Dept. of Statistics 814-863-7114 (fax) >> Penn State University 814-865-1348 (Statistics) >> University Park, PA 16802-2111 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch<mailto:bioconductor@stat.math.ethz.ch> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > ------------------------------ > Mark Robinson, PhD (Melb) > Epigenetics Laboratory, Garvan > Bioinformatics Division, WEHI > e: m.robinson@garvan.org.au<mailto:m.robinson@garvan.org.au> > e: mrobinson@wehi.edu.au<mailto:mrobinson@wehi.edu.au> > p: +61 (0)3 9345 2628 > f: +61 (0)3 9347 0852 > ------------------------------ > > > > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:28}}

ADD REPLY • link 15.7 years ago Paul Leo ▴ 970

0

Entering edit mode

Mark is right. If Y~Poisson then sqrt(Y) is approximately normal with variance 1/4. --Naomi At 07:20 AM 6/15/2010, Mark Robinson wrote: >Hi Mick. > >I can't speak for cufflinks, but the TMM normalization in that GB >paper is really about accounting for 'composition' biases. So, this >can help when the samples have different RNA composition (or some >other systematic effect), but it seems to me like the "dirtiness" >you mention here is just that you have large biological >variation. Genomics studies are generally underpowered anyways and >high biological variation, which is presumably a reality of your >experimental system, just makes detecting changes harder. > >Naomi: I assume you meant sqrt(Yi), not log(Yi) for the normal >approximation to the Possion ? > >Cheers, >Mark > >On 2010-06-15, at 4:44 PM, michael watson (IAH-C) wrote: > > > Thanks Naomi > > > > Yes, I have several RNA-Seq datasets that look like they may have > large biological variation. > > > > I feel this is the "dirty secret" of the new revolution that is > RNA-Seq - even with large numbers of replicates, the variation in > (and nature of) the read counts means we can only find genes that > are changing by a large amount. > > > > I wonder if some of the normalisation suggested by Robinson and > Oshlack will help (http://genomebiology.com/2010/11/3/R25). > > > > And of course there is cufflinks > > > > Thanks > > Mick > > ________________________________________ > > From: Naomi Altman [naomi at stat.psu.edu] > > Sent: 15 June 2010 03:02 > > To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch > > Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq > > > > Hi Michael, > > I was working this out for a lecture and here is what I found: > > > > If there is enough expression for the Normal approximation to hold > > then here is a rule of thumb. > > > > Suppose that the total number of reads is identical for all samples > > and that there is NO biological variation. If Yi is the number of > > reads for a gene in sample i, then > > Poisson variation alone leads to log(Yi) approx normal with variance > > 1/4. (This is what the DESeq vignette calls "shot" variance.) > > > > Using the formula for a 2 sample t-test, you see that to detect > > 2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need > > n>32 var/log(fold) which is approximately 8 biological reps per treatment. > > > > However, that is for NO biological variation. (Have a look at the > > example in the DESeq vignette!) And is assumes alpha=.05 (but we are > > going to use a much smaller alpha due to the multiple comparisons > > adjustment). > > > > --Naomi > > > > > > At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote: > >> Hi Naomi > >> > >> Thanks for the reply. > >> > >> The issue isn't necessarily low expressing genes, but perhaps high > >> expressing genes with a small (ish) fold change. DESeq seems to > >> only report as significant differences that are high fold changes. > >> > >> Contrast this to limma for microarrays, where small fold changes can > >> be reported as significant. > >> > >> For whatever reason, the transcriptomic community have become > >> fixated on "two-fold" as some kind of standard cut-off. Now, I'm > >> not fixated on that, but the example in DESeq reports 428 > >> significant genes with an estimated fold change at FDR 5%, however, > >> NONE of these are in the range -2 : 2. The minimum positive logFC > >> is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is > >> 2.49 (5.65 fold down-regulation). > >> > >> So what I am concerned about is finding genes, either highly or > >> lowly expressed, that are differing by a small fold change - say two-fold. > >> > >> Thanks > >> Mick > >> ________________________________________ > >> From: Naomi Altman [naomi at stat.psu.edu] > >> Sent: 14 June 2010 17:42 > >> To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch > >> Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq > >> > >> The issue is a mix of expression level and sample size. For count > >> data, the power is higher when the expression is higher. Also, the > >> p-values are discrete - the lower the total read count, the fewer > >> values are possible, which messes up the FDR estimation. > >> > >> Of course, understanding the problem does not necessarily suggest a > >> solution. But sample sizes will need to be large (or you need to > >> sequence very deeply) if you want to detect differential expression > >> in low expressing genes. > >> > >> --Naomi > >> > >> At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: > >>> Hi > >>> > >>> This follows on slightly from my experimental design thread. > >>> > >>> Having worked through the vignette for DESeq, it seems to work > >>> well. However, for the TagSeqExample.tab data set, when using an > >>> FDR cut off of 0.05, what we see is that we only find differential > >>> expression for large fold changes - an average of log2 fold change > >>> of 5 for up-regulated, and log2 fold change of -5 for > >>> down-regulated. There are very few significant results that even go > >>> as far down as 2 or -2 - which is still a 4-fold change. > >>> > >>> So, the question is, how many replicates must we have to get more > >>> sensitive results? Say down to log2FC of 1? (two-fold up or down > >> regulated)? > >>> > >>> I can calculate this by using DESeq's own estimates of variance to > >>> approximate replicates for T and N in the example data, and keep > >>> going until my significant results start to hit a logFC of 1, but I > >>> wanted to know if anyone else had done this yet? > >>> > >>> Thanks > >>> Mick > >>> > >>> _______________________________________________ > >>> Bioconductor mailing list > >>> Bioconductor at stat.math.ethz.ch > >>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>> Search the archives: > >>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > >> Naomi S. Altman 814-865-3791 (voice) > >> Associate Professor > >> Dept. of Statistics 814-863-7114 (fax) > >> Penn State University 814-865-1348 (Statistics) > >> University Park, PA 16802-2111 > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at stat.math.ethz.ch > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > Naomi S. Altman 814-865-3791 (voice) > > Associate Professor > > Dept. of Statistics 814-863-7114 (fax) > > Penn State University 814-865-1348 (Statistics) > > University Park, PA 16802-2111 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > >------------------------------ >Mark Robinson, PhD (Melb) >Epigenetics Laboratory, Garvan >Bioinformatics Division, WEHI >e: m.robinson at garvan.org.au >e: mrobinson at wehi.edu.au >p: +61 (0)3 9345 2628 >f: +61 (0)3 9347 0852 >------------------------------ > > > > > > >_____________________________________________________________________ _ >The information in this email is confidential and intended solely >for the addressee. >You must not disclose, forward, print or use it without the >permission of the sender. >_____________________________________________________________________ _ Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 15.7 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Naomi Altman ★ 6.0k

@naomi-altman-380

Last seen 4.8 years ago

United States

I was very surprised at the level of biological variation in the data sets I looked at. The question is: How can the biological variation in RNA-seq data appear to be so much higher than in microarray data? If the variation is artificially low in microarray data, then we have more false positives than we think. If the variation is artificially high in RNA-seq data, then it must be due to technical variation which ought to show up in the analysis of RNA samples split into several lanes on the sequencer. --Naomi At 02:44 AM 6/15/2010, michael watson (IAH-C) wrote: >Thanks Naomi > >Yes, I have several RNA-Seq datasets that look like they may have >large biological variation. > >I feel this is the "dirty secret" of the new revolution that is >RNA-Seq - even with large numbers of replicates, the variation in >(and nature of) the read counts means we can only find genes that >are changing by a large amount. > >I wonder if some of the normalisation suggested by Robinson and >Oshlack will help (http://genomebiology.com/2010/11/3/R25). > >And of course there is cufflinks > >Thanks >Mick >________________________________________ >From: Naomi Altman [naomi at stat.psu.edu] >Sent: 15 June 2010 03:02 >To: michael watson (IAH-C); Naomi Altman; bioconductor at stat.math.ethz.ch >Subject: Re: [BioC] DESeq and number of replicates required for RNA- Seq > >Hi Michael, >I was working this out for a lecture and here is what I found: > >If there is enough expression for the Normal approximation to hold >then here is a rule of thumb. > >Suppose that the total number of reads is identical for all samples >and that there is NO biological variation. If Yi is the number of >reads for a gene in sample i, then >Poisson variation alone leads to log(Yi) approx normal with variance >1/4. (This is what the DESeq vignette calls "shot" variance.) > >Using the formula for a 2 sample t-test, you see that to detect >2-fold differences (Log2(2)=1) with 95% power at alpha =.05 you need >n>32 var/log(fold) which is approximately 8 biological reps per treatment. > >However, that is for NO biological variation. (Have a look at the >example in the DESeq vignette!) And is assumes alpha=.05 (but we are >going to use a much smaller alpha due to the multiple comparisons >adjustment). > >--Naomi > > >At 12:57 PM 6/14/2010, michael watson (IAH-C) wrote: > >Hi Naomi > > > >Thanks for the reply. > > > >The issue isn't necessarily low expressing genes, but perhaps high > >expressing genes with a small (ish) fold change. DESeq seems to > >only report as significant differences that are high fold changes. > > > >Contrast this to limma for microarrays, where small fold changes can > >be reported as significant. > > > >For whatever reason, the transcriptomic community have become > >fixated on "two-fold" as some kind of standard cut-off. Now, I'm > >not fixated on that, but the example in DESeq reports 428 > >significant genes with an estimated fold change at FDR 5%, however, > >NONE of these are in the range -2 : 2. The minimum positive logFC > >is 2.18 (4.5 fold up-regulation), and the maximum negative logFC is > >2.49 (5.65 fold down-regulation). > > > >So what I am concerned about is finding genes, either highly or > >lowly expressed, that are differing by a small fold change - say two-fold. > > > >Thanks > >Mick > >________________________________________ > >From: Naomi Altman [naomi at stat.psu.edu] > >Sent: 14 June 2010 17:42 > >To: michael watson (IAH-C); bioconductor at stat.math.ethz.ch > >Subject: Re: [BioC] DESeq and number of replicates required for RNA-Seq > > > >The issue is a mix of expression level and sample size. For count > >data, the power is higher when the expression is higher. Also, the > >p-values are discrete - the lower the total read count, the fewer > >values are possible, which messes up the FDR estimation. > > > >Of course, understanding the problem does not necessarily suggest a > >solution. But sample sizes will need to be large (or you need to > >sequence very deeply) if you want to detect differential expression > >in low expressing genes. > > > >--Naomi > > > >At 09:45 AM 6/14/2010, michael watson (IAH-C) wrote: > > >Hi > > > > > >This follows on slightly from my experimental design thread. > > > > > >Having worked through the vignette for DESeq, it seems to work > > >well. However, for the TagSeqExample.tab data set, when using an > > >FDR cut off of 0.05, what we see is that we only find differential > > >expression for large fold changes - an average of log2 fold change > > >of 5 for up-regulated, and log2 fold change of -5 for > > >down-regulated. There are very few significant results that even go > > >as far down as 2 or -2 - which is still a 4-fold change. > > > > > >So, the question is, how many replicates must we have to get more > > >sensitive results? Say down to log2FC of 1? (two-fold up or down > > regulated)? > > > > > >I can calculate this by using DESeq's own estimates of variance to > > >approximate replicates for T and N in the example data, and keep > > >going until my significant results start to hit a logFC of 1, but I > > >wanted to know if anyone else had done this yet? > > > > > >Thanks > > >Mick > > > > > >_______________________________________________ > > >Bioconductor mailing list > > >Bioconductor at stat.math.ethz.ch > > >https://stat.ethz.ch/mailman/listinfo/bioconductor > > >Search the archives: > > >http://news.gmane.org/gmane.science.biology.informatics.conductor > > > >Naomi S. Altman 814-865-3791 (voice) > >Associate Professor > >Dept. of Statistics 814-863-7114 (fax) > >Penn State University 814-865-1348 (Statistics) > >University Park, PA 16802-2111 > > > >_______________________________________________ > >Bioconductor mailing list > >Bioconductor at stat.math.ethz.ch > >https://stat.ethz.ch/mailman/listinfo/bioconductor > >Search the archives: > >http://news.gmane.org/gmane.science.biology.informatics.conductor > >Naomi S. Altman 814-865-3791 (voice) >Associate Professor >Dept. of Statistics 814-863-7114 (fax) >Penn State University 814-865-1348 (Statistics) >University Park, PA 16802-2111 Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD COMMENT • link 15.7 years ago Naomi Altman ★ 6.0k

Login before adding your answer.