Question

Filtering out tags with low counts in DESeq and EgdeR?

0

Entering edit mode

Xiaohui Wu ▴ 280

@xiaohui-wu-4141

Last seen 9.6 years ago

Thanks, Simon and Wolfgang. to follow up Simon's suggestions: > > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ] >Try instead something like: >filter=dat[rowSums(dat) >= 16, ] > >- How does your filter affect the variance functions? Do the plots >generated by 'scvPlot()' differ between the filtered and the unfiltered >data set? I tried filter=dat[rowSums(dat) >= 16, ] or filter=dat[rowSums(dat) >= 8, ], the scvPlot of 8 or 16 is almost the same, but the plots of filtering and no filtering are different. The two plots are attached, to me, they look different, but I don't know what excatly the differences are, could you look into the plots? >- If so, are the hits that you get at expression strength were the >variance functions differ? Are they at the low end, i.e., where the >filter made changes? Sorry, I don't know how to find out where the variance functions change, can we see this from the scvPlots? >- Have you tried what happens if you filter after estimating variance? >The raw p values should be the same as without filtering, but the >adjusted p values might get better. How to filter the CountDataSet? or could I just adjust p values (using some adjusted function?) on the unfiltered results of nbinomTest()? d <- newCountDataSet( dat, group ) d <- estimateSizeFactors( d) sf=sizeFactors(d) d <- estimateVarianceFunctions( d ) ... Then, how could I get the CountDataSet with filtered counts and variance? >> 2. For EdgeR > >DESeq and edgeR are sufficiently similar that any correct answer >regarding filtering should apply to both. > >> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes? >> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH")< 0.05) > >Of course, this is possible. (Read up on the "multiple hypothesis >testing problem" if this is unclear to you.) Not also, though, that you >used an FDR of .1 in your DESeq code but of .05 here. Actually, I tried 0.1 or 0.05, or other adjusted methods provided in EdgeR, always 0 DE genes after adjusted. > Il May/21/11 5:16 PM, Wolfgang ha scritto: >Hi Martin, > >this is a very good question, and it needs more investigation. I'll have >a closer look into this and report back in this place. Two comments >though already: > >- That the dispersion parameter is estimated from the data need by >itself not be problematic (It is not problematic in the microattay case >that the t-test variances are estimated from the data.) > >- The situation with limma is different: there, the problem is that >limma's Bayesian model, which assumes that gene-level error variances >?^2_1, ..., ?^2_m follow a scaled inverse ?2 distribution, no longer >fits the data when the data are filtered for genes with low overall >variance. Due to the way that limma is implemented, the posterior >degrees-of-freedom estimate of that distribution then becomes infinite, >gene-level variance estimates will be ignored (leading to an unintended >analysis based on fold change only), and type I error rate control is >lost. See Fig. 2B+C in the paper, and the text on p.3. > >So, what we need to do with DESeq is >- simulate data according to the null model >- see if & how filtering affects the estimated mean-dispersion relationship >- see if & how it affects the type I error globally and locally (for >particular ranges of the mean). > >Another point is how much the filtering improves power - that will be >related to how many genes can actually be removed by the filtering step, >which depends on the data. > > Best wishes > Wolfgang > > > >Il May/21/11 4:36 PM, Martin Morgan ha scritto: >> On 05/21/2011 07:07 AM, Wolfgang Huber wrote: >>> Hi Xiaohui >>> >>> to follow up on the filtering question: >>> >>> - the filter that Xiaohui applied is invalid, it will distort the >>> null-distribution of the test statistic and lead to invalid p-values. >>> This might explain the discrepancy. >>> >>> - the filter that Simon suggested is OK and should provide better >>> results. >>> >>> - I'd also be keen to hear about your experience with this. >>> >>> A valid filtering criterion does not change the null distribution of the >>> subsequently applied test statistic (it can, and in fact should, change >>> the alternative distribution(s)). In practice, this means choosing a >>> filter criterion that is statistically independent, under the null, from >>> the test statistic, and in particular, that it does not use the class >>> labels. Details in the below-cited PNAS paper. >> >> Was wondering whether, since the dispersion parameter is estimated from >> the data, in some strict sense the filtering and testing procedures are >> not independent under the null anyway? For the same reason that one >> would not want to use a variance filter before a limma-style analysis, >> if I understand correctly. >> >> Martin >> >>> >>> Best wishes >>> Wolfgang >>> >>> >>> >>> >>> >>> Il May/21/11 11:02 AM, Simon Anders ha scritto: >>>> Hi Xiaohui >>>> >>>> I agree thatit is worrying to get so different results from your two >>>> approaches of using DESeq. Here are a few suggestion how you might >>>> investigate this (and I'd be eager to hear about your findings): >>>> >>>> - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre- filtering >>>> affects the validity and power of a test. They stress that it is >>>> important that the filter is blind to the sample labels (actually: even >>>> permutation invariant). So what you do here is not statistically sound: >>>> >>>> > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ] >>>> >>>> Try instead something like: >>>> >>>> filter=dat[rowSums(dat) >= 16, ] >>>> >>>> - How does your filter affect the variance functions? Do the plots >>>> generated by 'scvPlot()' differ between the filtered and the unfiltered >>>> data set? >>>> >>>> - If so, are the hits that you get at expression strength were the >>>> variance functions differ? Are they at the low end, i.e., where the >>>> filter made changes? >>>> >>>> - Have you tried what happens if you filter after estimating variance? >>>> The raw p values should be the same as without filtering, but the >>>> adjusted p values might get better. >>>> >>>> To be honest, I'm currently a bit at a loss which one is more correct: >>>> Filtering before or after variance estimation. Let's hear what other >>>> people on the list think. >>>> >>>>> 2. For EdgeR >>>> >>>> DESeq and edgeR are sufficiently similar that any correct answer >>>> regarding filtering should apply to both. >>>> >>>>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after >>>>> adjusting p.value, is this possible? Then, can I used the *unadjusted* >>>>> p.value to get DE genes? >>>>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = >>>>> "BH")< 0.05) >>>> >>>> Of course, this is possible. (Read up on the "multiple hypothesis >>>> testing problem" if this is unclear to you.) Not also, though, that you >>>> used an FDR of .1 in your DESeq code but of .05 here. >>>> >>>> Simon >>>>

Bayesian limma edgeR DESeq Bayesian limma edgeR DESeq • 1.2k views

ADD COMMENT • link 12.9 years ago Xiaohui Wu ▴ 280