Search
Question: DESeq adjusted pvalue calculation / filtering data
0
6.9 years ago by
Markus Grohme20 wrote:
Dear all, Simon Anders kindly asked me to repeat my question on this mailing list considering the statistical analysis of my RNA sequencing data. I am rather new to Bioconductor and RNA sequencing analysis (molecular biologist) and tried to read myself a bit into the statistics behind the analysis in DESeq and found the publication of "Storey & Tibshirani, 2003" and tested also the R program "qvalue". In the paper he authors claim that that the Benjamini/Hochberg method is too conservative for most genomics studies. After running my analysis in DESeq I did some analysis of the pvalues using "qvalue". How does the way the adjusted p-value is calculated using Benjamini/Hochberg and qvalue differ, the ones generated by DESeq (which uses the BH method as I understand it). The package DE"G"Seq (Wang et. al 2010) allows both B/H and qvalue methods to be used for adj- pvalue calculation. The Bourgon_2010 paper for independent filtering also briefly mentions qvalue (ref.23). I am not really a statistician, not even a real bioinformatician so I would like to ask more experienced people dealing with this kind of data on a regular basis. Some of the programs have been written with microarrays in mind so I am not sure how this will perform on sequencing count data. I am also not sure how my experimental setup fits into all of this. My experimental setup is (not standard I think): 4 biological samples no biological no technical replicate (I know far from perfect) 1 lane Illumina GAII 36bp single end each where i get around 7-8 Mio reads mapped (basis of my counts). I do not sample over the whole transcript but sequenced in a window of ~300 bases of 3' (fragmented cDNA / PolyA selected). This means all my tags are clustered around the 3' of the transcript. This gives me the advantage that my counts are not distributed over sequences that I can't connect informatically without a reference genome. My reference was a transcriptome assembly (non-model organism, Sanger/454/Illumina data). So I have about 70.000 features as a reference but due to the nature of a transcriptome assembly I get some fragmentation. E.g. a transcript is represented in 3 separate contigs. Of these 3 contigs only the 3' most gets read counts assigned. This means I am testing against 3 features of which only 1 is capable of giving me informative results. I know that the experiment is not the best setup. But there should be relatively little biological variability as I pooled 200 synchronized individuals that are also more or less clones because they reproduce asexually. Also the 4 stages are a circular time course that is only some hours apart and the gene expression patterns should overlap. So the e.g. A->B->C->D(->A) share in part expression changes with their adjacent sample (D will finally end up being A again.) My interest is finding out what is different in B,C,D compared to A that is my normal physiological state. I am concerned how my overabundance of uninformative features generates a multiple testing correction problem due to my experimental setup. Many tag count rows give just 0,0,0,0 or spurious hits like 0,2,0,5 due to hits more 5' in some fragments not connected by the assembly to the 3'. Should I remove the rows that are just background noise? Or should I let DESeq do the filtering using the independent filtering option (if it is even possible or makes sense for my data). I also attached a picture of A vs B and marked at 5% false discovery rate cutoff from DESeq. It seems that the changes are not very subtle. I understand that by not having any replicates my test will not have strong power to detect small differences. So will filtering actually help here? Also it seems that my sequencing depth is also on the lower end so I would need to get more data to statistically venture into the low expression fraction. Does anybody have some clues how to approach this and what makes sense and what doesn't? Cheers, Markus -------------- next part -------------- A non-text attachment was scrubbed... Name: A_vs_B_q05.png Type: image/png Size: 39357 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20111124="" 744f00eb="" attachment.png="">
modified 6.9 years ago by Naomi Altman6.0k • written 6.9 years ago by Markus Grohme20
0
6.9 years ago by
Simon Anders3.5k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.5k wrote:
Dear Markus, there are several questions in your mail; I try to answer them separately. 1. Storey's qvalues: While, technically, the applicability of Storey's method might be a bit more narrow that of Benjamini and Hochberg's, within transcriptomics both are usually equally applicable, and in, Storey's does give more results. Internally, DESeq calculates the adjusted p values with something like res$padj <- p.adjust( res$pval, method="BH" ) You can also convert the raw p values (res$pval) yourself with Storey's package if you have it installed. Beware that it does not handle NAs well, you may need to take out the NA p values and put them back in. 2. Independent filtering: In the newest version of the DESeq voignette, we have added a section on independent filtering. Removing, e.g., all genes with, say, an average count below 10 does give you some extra hits. 3. The real reason that you have so few hits is your lack of replicates. In this situation, DESeq reports by design only those hits that are strikingly obvious, and doing otherwise wih a sound analysis method is impossible. You cannot expect to get useful results with a flawed experimental design -- and while the two points above might give you a few extra hit, you are unlikely to get usable result without fixing your experiment. 4. Sequencing depth: Remember that it is the total number of counts per gene and _condition_ (not: sample) that gives you power for weakly expressed genes, and the number of replicates that gives your power for the strongly expressed genes. Hence, whenever practically feasible, it is always better to sequence many biological replicate samples to moderate depth than to sequence a few samples very deeply. (Of course, even if replicates are difficult to obtain, two replicates is the minimum. Doing an experiment without that is pointless.) Simon ADD COMMENTlink written 6.9 years ago by Simon Anders3.5k I am doing research on the use of FDR methods with count data. Filtering definitely helps. You want to remove features which have so few counts that you cannot achieve statistical significance even if all the reads come from 1 condition. This is a bit complicated to determine using DESeq due to the dispersion shrinkage, but 10 to 20 are probably good cut-offs. Storey's method works well with count data if the estimate of pi_0 is OK. To determine this, draw a histogram of the raw p-values (from the filtered data). There should be a single peak near p=0. If there is another peak near p=1, then Storey's method does not work so well. The Benjamini and Hochberg method is more conservative, but it at least works. The dissertation on which my comments are based should be available by the end of January. I will post a link as soon as I am able. Naomi At 02:05 PM 11/25/2011, Simon Anders wrote: >Dear Markus, > >there are several questions in your mail; I try to answer them separately. > >1. Storey's qvalues: While, technically, the applicability of >Storey's method might be a bit more narrow that of Benjamini and >Hochberg's, within transcriptomics both are usually equally >applicable, and in, Storey's does give more results. > >Internally, DESeq calculates the adjusted p values with something like > > res$padj <- p.adjust( res$pval, method="BH" ) > >You can also convert the raw p values (res$pval) yourself with >Storey's package if you have it installed. Beware that it does not >handle NAs well, you may need to take out the NA p values and put them back in. > >2. Independent filtering: In the newest version of the DESeq >voignette, we have added a section on independent filtering. >Removing, e.g., all genes with, say, an average count below 10 does >give you some extra hits. > >3. The real reason that you have so few hits is your lack of >replicates. In this situation, DESeq reports by design only those >hits that are strikingly obvious, and doing otherwise wih a sound >analysis method is impossible. You cannot expect to get useful >results with a flawed experimental design -- and while the two >points above might give you a few extra hit, you are unlikely to get >usable result without fixing your experiment. > >4. Sequencing depth: Remember that it is the total number of counts >per gene and _condition_ (not: sample) that gives you power for >weakly expressed genes, and the number of replicates that gives your >power for the strongly expressed genes. Hence, whenever practically >feasible, it is always better to sequence many biological replicate >samples to moderate depth than to sequence a few samples very >deeply. (Of course, even if replicates are difficult to obtain, two >replicates is the minimum. Doing an experiment without that is pointless.) > > Simon > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor
Hi Simon, Hi Naomi, thanks for your feedback. The filtering removes a really nice chunk of my features: > use <- (rs > quantile(rs, 0.4)) > quantile(rs,0.4) 40% 23 > table(use) use FALSE TRUE 28383 42190 So after filtering my raw pvalues have two peaks in the histogram. One at p=0 and one at p=1. The "0" peak is half the size of "1". And some in between starting at half of "0" and ending at half of "1". So this tells me I should stay with Benjamini-Hochberg I guess. I would be really interested in your dissertation Naomi as I have still have a lot to learn here. As I understand my filtering cutoff the 0.4 quantile of my data is a sum of the rows <23. I guess thats fine. I agree with Simon that my experiment is flawed. And it makes sense to not try to optimize the statistics on flawed data. So asterisk-hunting is not really advisable. But as we were limited in sample AND sequencing capability thats what we did. Even if it is not perfect and is more of a exploratory study the results are quite interesting and a good starting point for further (and better) experiments. I was part interested in the inner workings of the statistical analysis of such data and if I could maybe improve my analysis a bit. But it seems I first have to improve the experiment first. :-) Thanks for your time. Cheers, Markus Am 26. November 2011 20:45 schrieb Naomi Altman <naomi@stat.psu.edu>: > I am doing research on the use of FDR methods with count data. > > Filtering definitely helps. You want to remove features which have so few > counts that you cannot achieve statistical significance even if all the > reads come from 1 condition. This is a bit complicated to determine using > DESeq due to the dispersion shrinkage, but 10 to 20 are probably good > cut-offs. > > Storey's method works well with count data if the estimate of pi_0 is OK. > To determine this, draw a histogram of the raw p-values (from the filtered > data). There should be a single peak near p=0. If there is another peak > near p=1, then Storey's method does not work so well. The Benjamini and > Hochberg method is more conservative, but it at least works. > > The dissertation on which my comments are based should be available by the > end of January. I will post a link as soon as I am able. > > Naomi > > > At 02:05 PM 11/25/2011, Simon Anders wrote: > >> Dear Markus, >> >> there are several questions in your mail; I try to answer them separately. >> >> 1. Storey's qvalues: While, technically, the applicability of Storey's >> method might be a bit more narrow that of Benjamini and Hochberg's, within >> transcriptomics both are usually equally applicable, and in, Storey's does >> give more results. >> >> Internally, DESeq calculates the adjusted p values with something like >> >> res$padj <- p.adjust( res$pval, method="BH" ) >> >> You can also convert the raw p values (res$pval) yourself with Storey's >> package if you have it installed. Beware that it does not handle NAs well, >> you may need to take out the NA p values and put them back in. >> >> 2. Independent filtering: In the newest version of the DESeq voignette, >> we have added a section on independent filtering. Removing, e.g., all genes >> with, say, an average count below 10 does give you some extra hits. >> >> 3. The real reason that you have so few hits is your lack of replicates. >> In this situation, DESeq reports by design only those hits that are >> strikingly obvious, and doing otherwise wih a sound analysis method is >> impossible. You cannot expect to get useful results with a flawed >> experimental design -- and while the two points above might give you a few >> extra hit, you are unlikely to get usable result without fixing your >> experiment. >> >> 4. Sequencing depth: Remember that it is the total number of counts per >> gene and _condition_ (not: sample) that gives you power for weakly >> expressed genes, and the number of replicates that gives your power for the >> strongly expressed genes. Hence, whenever practically feasible, it is >> always better to sequence many biological replicate samples to moderate >> depth than to sequence a few samples very deeply. (Of course, even if >> replicates are difficult to obtain, two replicates is the minimum. Doing an >> experiment without that is pointless.) >> >> Simon >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]] ADD REPLYlink written 6.9 years ago by Markus Grohme20 0 6.9 years ago by Naomi Altman6.0k Naomi Altman6.0k wrote: Just to clarify, the dissertation is the work of my student, Isaac Dialsingh. The peak at zero indicates that you do have some differential expression that is greater than you would expect by chance. Naomi At 06:17 AM 11/29/2011, Markus Grohme wrote: >Hi Simon, Hi Naomi, >thanks for your feedback. > >The filtering removes a really nice chunk of my features: > > use <- (rs > quantile(rs, 0.4)) > > quantile(rs,0.4) >40% >23 > > table(use) >use >FALSE TRUE >28383 42190 > >So after filtering my raw pvalues have two peaks in the histogram. One at >p=0 and one at p=1. The "0" peak is half the size of "1". And some in >between starting at half of "0" and ending at half of "1". So this tells me >I should stay with Benjamini-Hochberg I guess. I would be really interested >in your dissertation Naomi as I have still have a lot to learn here. >As I understand my filtering cutoff the 0.4 quantile of my data is a sum of >the rows <23. I guess thats fine. > >I agree with Simon that my experiment is flawed. And it makes sense to not >try to optimize the statistics on flawed data. So asterisk-hunting is not >really advisable. But as we were limited in sample AND sequencing >capability thats what we did. Even if it is not perfect and is more of a >exploratory study the results are quite interesting and a good starting >point for further (and better) experiments. >I was part interested in the inner workings of the statistical analysis of >such data and if I could maybe improve my analysis a bit. But it seems I >first have to improve the experiment first. :-) > >Thanks for your time. > >Cheers, > >Markus > > > >Am 26. November 2011 20:45 schrieb Naomi Altman <naomi at="" stat.psu.edu="">: > > > I am doing research on the use of FDR methods with count data. > > > > Filtering definitely helps. You want to remove features which have so few > > counts that you cannot achieve statistical significance even if all the > > reads come from 1 condition. This is a bit complicated to determine using > > DESeq due to the dispersion shrinkage, but 10 to 20 are probably good > > cut-offs. > > > > Storey's method works well with count data if the estimate of pi_0 is OK. > > To determine this, draw a histogram of the raw p-values (from the filtered > > data). There should be a single peak near p=0. If there is another peak > > near p=1, then Storey's method does not work so well. The Benjamini and > > Hochberg method is more conservative, but it at least works. > > > > The dissertation on which my comments are based should be available by the > > end of January. I will post a link as soon as I am able. > > > > Naomi > > > > > > At 02:05 PM 11/25/2011, Simon Anders wrote: > > > >> Dear Markus, > >> > >> there are several questions in your mail; I try to answer them separately. > >> > >> 1. Storey's qvalues: While, technically, the applicability of Storey's > >> method might be a bit more narrow that of Benjamini and Hochberg's, within > >> transcriptomics both are usually equally applicable, and in, Storey's does > >> give more results. > >> > >> Internally, DESeq calculates the adjusted p values with something like > >> > >> res$padj <- p.adjust( res$pval, method="BH" ) > >> > >> You can also convert the raw p values (res$pval) yourself with Storey's > >> package if you have it installed. Beware that it does not handle NAs well, > >> you may need to take out the NA p values and put them back in. > >> > >> 2. Independent filtering: In the newest version of the DESeq voignette, > >> we have added a section on independent filtering. Removing, > e.g., all genes > >> with, say, an average count below 10 does give you some extra hits. > >> > >> 3. The real reason that you have so few hits is your lack of replicates. > >> In this situation, DESeq reports by design only those hits that are > >> strikingly obvious, and doing otherwise wih a sound analysis method is > >> impossible. You cannot expect to get useful results with a flawed > >> experimental design -- and while the two points above might give you a few > >> extra hit, you are unlikely to get usable result without fixing your > >> experiment. > >> > >> 4. Sequencing depth: Remember that it is the total number of counts per > >> gene and _condition_ (not: sample) that gives you power for weakly > >> expressed genes, and the number of replicates that gives your > power for the > >> strongly expressed genes. Hence, whenever practically feasible, it is > >> always better to sequence many biological replicate samples to moderate > >> depth than to sequence a few samples very deeply. (Of course, even if > >> replicates are difficult to obtain, two replicates is the > minimum. Doing an > >> experiment without that is pointless.) > >> > >> Simon > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at r-project.org >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor