Filtering out tags with low counts in DESeq and EgdeR?

0

Entering edit mode

Xiaohui Wu ▴ 280

@xiaohui-wu-4141

Last seen 9.6 years ago

Hello, I want to find DE genes between two conditions, using DESeq, EdgeR or other tools if any. I have some questions on how/when to filter out genes with low counts and I am also confused on the final results. I have 50,000 genes, 2 conditions, each condition 4 replicates. 25,000 genes have >=8 counts in at least one condition. The total counts in each replicate is as following: cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2 cond2_rep3 cond2_rep4 1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892 1. For DESeq I used all 50,000 genes to estimate size factor for normalization. Then should I remove genes with low count before estimating variance or after that? I got totally different results using the following two ways, which I was really surprised that NO overlap was found between the two DE gene sets. 1) no filtering before estimating variance d <- newCountDataSet( dat, group ) d <- estimateSizeFactors( d) sf=sizeFactors(d) d <- estimateVarianceFunctions( d ) res <- nbinomTest( d,"cond1","cond2") resSig <- res[ res$padj < .1, ] 2) filtering before estimating variance filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## remove low count genes filter.ds <- newCountDataSet( filter, group ) filter.ds$sizeFactor=sf ## use the sizefactor from 1) filter.ds <- estimateVarianceFunctions( filter.ds ) filter.res <- nbinomTest( filter.ds,"cond1","cond2") filter.sig=filter.res[ filter.res$padj < .1, ] nrow(filter.sig) Results: For my data, total 177 DE genes from 2) and 106 from 1), and no overlap between 1) and 2). 2. For EdgeR 1) when should I remove low count genes? f <- calcNormFactors(dat) ## I should calculate factor using the whole data, right? f <- f/exp(mean(log(f))) dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## And, then filter out genes with low counts? d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f) 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes? To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH") < 0.05) Any help or suggestions will be appreciated. Thanks, Xiaohui

Normalization edgeR DESeq Normalization edgeR DESeq • 5.6k views

ADD COMMENT • link 12.9 years ago Xiaohui Wu ▴ 280

1

Entering edit mode

Simon Anders ★ 3.7k

@simon-anders-3855

Last seen 3.7 years ago

Zentrum für Molekularbiologie, Universi…

Hi Xiaohui I agree thatit is worrying to get so different results from your two approaches of using DESeq. Here are a few suggestion how you might investigate this (and I'd be eager to hear about your findings): - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre-filtering affects the validity and power of a test. They stress that it is important that the filter is blind to the sample labels (actually: even permutation invariant). So what you do here is not statistically sound: > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ] Try instead something like: filter=dat[rowSums(dat) >= 16, ] - How does your filter affect the variance functions? Do the plots generated by 'scvPlot()' differ between the filtered and the unfiltered data set? - If so, are the hits that you get at expression strength were the variance functions differ? Are they at the low end, i.e., where the filter made changes? - Have you tried what happens if you filter after estimating variance? The raw p values should be the same as without filtering, but the adjusted p values might get better. To be honest, I'm currently a bit at a loss which one is more correct: Filtering before or after variance estimation. Let's hear what other people on the list think. > 2. For EdgeR DESeq and edgeR are sufficiently similar that any correct answer regarding filtering should apply to both. > 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes? > To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH")< 0.05) Of course, this is possible. (Read up on the "multiple hypothesis testing problem" if this is unclear to you.) Not also, though, that you used an FDR of .1 in your DESeq code but of .05 here. Simon

ADD COMMENT • link 12.9 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Hi Xiaohui to follow up on the filtering question: - the filter that Xiaohui applied is invalid, it will distort the null-distribution of the test statistic and lead to invalid p-values. This might explain the discrepancy. - the filter that Simon suggested is OK and should provide better results. - I'd also be keen to hear about your experience with this. A valid filtering criterion does not change the null distribution of the subsequently applied test statistic (it can, and in fact should, change the alternative distribution(s)). In practice, this means choosing a filter criterion that is statistically independent, under the null, from the test statistic, and in particular, that it does not use the class labels. Details in the below-cited PNAS paper. Best wishes Wolfgang Il May/21/11 11:02 AM, Simon Anders ha scritto: > Hi Xiaohui > > I agree thatit is worrying to get so different results from your two > approaches of using DESeq. Here are a few suggestion how you might > investigate this (and I'd be eager to hear about your findings): > > - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre- filtering > affects the validity and power of a test. They stress that it is > important that the filter is blind to the sample labels (actually: even > permutation invariant). So what you do here is not statistically sound: > > > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ] > > Try instead something like: > > filter=dat[rowSums(dat) >= 16, ] > > - How does your filter affect the variance functions? Do the plots > generated by 'scvPlot()' differ between the filtered and the unfiltered > data set? > > - If so, are the hits that you get at expression strength were the > variance functions differ? Are they at the low end, i.e., where the > filter made changes? > > - Have you tried what happens if you filter after estimating variance? > The raw p values should be the same as without filtering, but the > adjusted p values might get better. > > To be honest, I'm currently a bit at a loss which one is more correct: > Filtering before or after variance estimation. Let's hear what other > people on the list think. > >> 2. For EdgeR > > DESeq and edgeR are sufficiently similar that any correct answer > regarding filtering should apply to both. > >> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after >> adjusting p.value, is this possible? Then, can I used the *unadjusted* >> p.value to get DE genes? >> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = >> "BH")< 0.05) > > Of course, this is possible. (Read up on the "multiple hypothesis > testing problem" if this is unclear to you.) Not also, though, that you > used an FDR of .1 in your DESeq code but of .05 here. > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 12.9 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

On 05/21/2011 07:07 AM, Wolfgang Huber wrote: > Hi Xiaohui > > to follow up on the filtering question: > > - the filter that Xiaohui applied is invalid, it will distort the > null-distribution of the test statistic and lead to invalid p-values. > This might explain the discrepancy. > > - the filter that Simon suggested is OK and should provide better results. > > - I'd also be keen to hear about your experience with this. > > A valid filtering criterion does not change the null distribution of the > subsequently applied test statistic (it can, and in fact should, change > the alternative distribution(s)). In practice, this means choosing a > filter criterion that is statistically independent, under the null, from > the test statistic, and in particular, that it does not use the class > labels. Details in the below-cited PNAS paper. Was wondering whether, since the dispersion parameter is estimated from the data, in some strict sense the filtering and testing procedures are not independent under the null anyway? For the same reason that one would not want to use a variance filter before a limma-style analysis, if I understand correctly. Martin > > Best wishes > Wolfgang > > > > > > Il May/21/11 11:02 AM, Simon Anders ha scritto: >> Hi Xiaohui >> >> I agree thatit is worrying to get so different results from your two >> approaches of using DESeq. Here are a few suggestion how you might >> investigate this (and I'd be eager to hear about your findings): >> >> - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre- filtering >> affects the validity and power of a test. They stress that it is >> important that the filter is blind to the sample labels (actually: even >> permutation invariant). So what you do here is not statistically sound: >> >> > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ] >> >> Try instead something like: >> >> filter=dat[rowSums(dat) >= 16, ] >> >> - How does your filter affect the variance functions? Do the plots >> generated by 'scvPlot()' differ between the filtered and the unfiltered >> data set? >> >> - If so, are the hits that you get at expression strength were the >> variance functions differ? Are they at the low end, i.e., where the >> filter made changes? >> >> - Have you tried what happens if you filter after estimating variance? >> The raw p values should be the same as without filtering, but the >> adjusted p values might get better. >> >> To be honest, I'm currently a bit at a loss which one is more correct: >> Filtering before or after variance estimation. Let's hear what other >> people on the list think. >> >>> 2. For EdgeR >> >> DESeq and edgeR are sufficiently similar that any correct answer >> regarding filtering should apply to both. >> >>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after >>> adjusting p.value, is this possible? Then, can I used the *unadjusted* >>> p.value to get DE genes? >>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = >>> "BH")< 0.05) >> >> Of course, this is possible. (Read up on the "multiple hypothesis >> testing problem" if this is unclear to you.) Not also, though, that you >> used an FDR of .1 in your DESeq code but of .05 here. >> >> Simon >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793

ADD REPLY • link 12.9 years ago Martin Morgan 25k

0

Entering edit mode

Hi Martin, this is a very good question, and it needs more investigation. I'll have a closer look into this and report back in this place. Two comments though already: - That the dispersion parameter is estimated from the data need by itself not be problematic (It is not problematic in the microattay case that the t-test variances are estimated from the data.) - The situation with limma is different: there, the problem is that limma's Bayesian model, which assumes that gene-level error variances ?^2_1, ..., ?^2_m follow a scaled inverse ?2 distribution, no longer fits the data when the data are filtered for genes with low overall variance. Due to the way that limma is implemented, the posterior degrees-of-freedom estimate of that distribution then becomes infinite, gene-level variance estimates will be ignored (leading to an unintended analysis based on fold change only), and type I error rate control is lost. See Fig. 2B+C in the paper, and the text on p.3. So, what we need to do with DESeq is - simulate data according to the null model - see if & how filtering affects the estimated mean-dispersion relationship - see if & how it affects the type I error globally and locally (for particular ranges of the mean). Another point is how much the filtering improves power - that will be related to how many genes can actually be removed by the filtering step, which depends on the data. Best wishes Wolfgang Il May/21/11 4:36 PM, Martin Morgan ha scritto: > On 05/21/2011 07:07 AM, Wolfgang Huber wrote: >> Hi Xiaohui >> >> to follow up on the filtering question: >> >> - the filter that Xiaohui applied is invalid, it will distort the >> null-distribution of the test statistic and lead to invalid p-values. >> This might explain the discrepancy. >> >> - the filter that Simon suggested is OK and should provide better >> results. >> >> - I'd also be keen to hear about your experience with this. >> >> A valid filtering criterion does not change the null distribution of the >> subsequently applied test statistic (it can, and in fact should, change >> the alternative distribution(s)). In practice, this means choosing a >> filter criterion that is statistically independent, under the null, from >> the test statistic, and in particular, that it does not use the class >> labels. Details in the below-cited PNAS paper. > > Was wondering whether, since the dispersion parameter is estimated from > the data, in some strict sense the filtering and testing procedures are > not independent under the null anyway? For the same reason that one > would not want to use a variance filter before a limma-style analysis, > if I understand correctly. > > Martin > >> >> Best wishes >> Wolfgang >> >> >> >> >> >> Il May/21/11 11:02 AM, Simon Anders ha scritto: >>> Hi Xiaohui >>> >>> I agree thatit is worrying to get so different results from your two >>> approaches of using DESeq. Here are a few suggestion how you might >>> investigate this (and I'd be eager to hear about your findings): >>> >>> - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre- filtering >>> affects the validity and power of a test. They stress that it is >>> important that the filter is blind to the sample labels (actually: even >>> permutation invariant). So what you do here is not statistically sound: >>> >>> > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ] >>> >>> Try instead something like: >>> >>> filter=dat[rowSums(dat) >= 16, ] >>> >>> - How does your filter affect the variance functions? Do the plots >>> generated by 'scvPlot()' differ between the filtered and the unfiltered >>> data set? >>> >>> - If so, are the hits that you get at expression strength were the >>> variance functions differ? Are they at the low end, i.e., where the >>> filter made changes? >>> >>> - Have you tried what happens if you filter after estimating variance? >>> The raw p values should be the same as without filtering, but the >>> adjusted p values might get better. >>> >>> To be honest, I'm currently a bit at a loss which one is more correct: >>> Filtering before or after variance estimation. Let's hear what other >>> people on the list think. >>> >>>> 2. For EdgeR >>> >>> DESeq and edgeR are sufficiently similar that any correct answer >>> regarding filtering should apply to both. >>> >>>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after >>>> adjusting p.value, is this possible? Then, can I used the *unadjusted* >>>> p.value to get DE genes? >>>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = >>>> "BH")< 0.05) >>> >>> Of course, this is possible. (Read up on the "multiple hypothesis >>> testing problem" if this is unclear to you.) Not also, though, that you >>> used an FDR of .1 in your DESeq code but of .05 here. >>> >>> Simon >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> > > -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 12.9 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Just to summarise the main points here: Filtering on overall counts (disregarding group labels) is admissible in conjunction with DESeq and afaIcs edgeR: the type-I error is not measurably affected. There is benefit to it: more genes can be detected as differentially expressed at the same type-I error (as measured e.g. by FDR). The size of the benefit depends on the threshold used for the cutoff. I do not see major objections to trying several cutoffs and "picking the best". The size of the benefit tends to be smaller than with microarray data. In my experience, one gets 10-20% more differentially expressed genes. This is still not bad for something that is "free". Why? The reason for the smaller size of the benefit is that those genes that have zero counts everywhere are already implicitly filtered either by people's data processing or by DESeq/edgeR. The number of genes that have some non-zero counts but not enough to be significant seem to lie in the range mentioned above (this will depend on coverage) - and it is these for which the filtering provides benefit. With microarrays, both of these classes of genes needed to be filtered explicitly. Best wishes Wolfgang > > Hi Martin, > > this is a very good question, and it needs more investigation. I'll have > a closer look into this and report back in this place. Two comments > though already: > > - That the dispersion parameter is estimated from the data need by > itself not be problematic (It is not problematic in the microattay case > that the t-test variances are estimated from the data.) > > - The situation with limma is different: there, the problem is that > limma's Bayesian model, which assumes that gene-level error variances > ?^2_1, ..., ?^2_m follow a scaled inverse ?2 distribution, no longer > fits the data when the data are filtered for genes with low overall > variance. Due to the way that limma is implemented, the posterior > degrees-of-freedom estimate of that distribution then becomes infinite, > gene-level variance estimates will be ignored (leading to an unintended > analysis based on fold change only), and type I error rate control is > lost. See Fig. 2B+C in the paper, and the text on p.3. > > So, what we need to do with DESeq is > - simulate data according to the null model > - see if & how filtering affects the estimated mean-dispersion relationship > - see if & how it affects the type I error globally and locally (for > particular ranges of the mean). > > Another point is how much the filtering improves power - that will be > related to how many genes can actually be removed by the filtering step, > which depends on the data. > > Best wishes > Wolfgang > > > > Il May/21/11 4:36 PM, Martin Morgan ha scritto: >> On 05/21/2011 07:07 AM, Wolfgang Huber wrote: >>> Hi Xiaohui >>> >>> to follow up on the filtering question: >>> >>> - the filter that Xiaohui applied is invalid, it will distort the >>> null-distribution of the test statistic and lead to invalid p-values. >>> This might explain the discrepancy. >>> >>> - the filter that Simon suggested is OK and should provide better >>> results. >>> >>> - I'd also be keen to hear about your experience with this. >>> >>> A valid filtering criterion does not change the null distribution of the >>> subsequently applied test statistic (it can, and in fact should, change >>> the alternative distribution(s)). In practice, this means choosing a >>> filter criterion that is statistically independent, under the null, from >>> the test statistic, and in particular, that it does not use the class >>> labels. Details in the below-cited PNAS paper. >> >> Was wondering whether, since the dispersion parameter is estimated from >> the data, in some strict sense the filtering and testing procedures are >> not independent under the null anyway? For the same reason that one >> would not want to use a variance filter before a limma-style analysis, >> if I understand correctly. >> >> Martin >> >>> >>> Best wishes >>> Wolfgang >>> >>> >>> >>> >>> >>> Il May/21/11 11:02 AM, Simon Anders ha scritto: >>>> Hi Xiaohui >>>> >>>> I agree thatit is worrying to get so different results from your two >>>> approaches of using DESeq. Here are a few suggestion how you might >>>> investigate this (and I'd be eager to hear about your findings): >>>> >>>> - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre- filtering >>>> affects the validity and power of a test. They stress that it is >>>> important that the filter is blind to the sample labels (actually: even >>>> permutation invariant). So what you do here is not statistically sound: >>>> >>>> > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ] >>>> >>>> Try instead something like: >>>> >>>> filter=dat[rowSums(dat) >= 16, ] >>>> >>>> - How does your filter affect the variance functions? Do the plots >>>> generated by 'scvPlot()' differ between the filtered and the unfiltered >>>> data set? >>>> >>>> - If so, are the hits that you get at expression strength were the >>>> variance functions differ? Are they at the low end, i.e., where the >>>> filter made changes? >>>> >>>> - Have you tried what happens if you filter after estimating variance? >>>> The raw p values should be the same as without filtering, but the >>>> adjusted p values might get better. >>>> >>>> To be honest, I'm currently a bit at a loss which one is more correct: >>>> Filtering before or after variance estimation. Let's hear what other >>>> people on the list think. >>>> >>>>> 2. For EdgeR >>>> >>>> DESeq and edgeR are sufficiently similar that any correct answer >>>> regarding filtering should apply to both. >>>> >>>>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after >>>>> adjusting p.value, is this possible? Then, can I used the *unadjusted* >>>>> p.value to get DE genes? >>>>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = >>>>> "BH")< 0.05) >>>> >>>> Of course, this is possible. (Read up on the "multiple hypothesis >>>> testing problem" if this is unclear to you.) Not also, though, that you >>>> used an FDR of .1 in your DESeq code but of .05 here. >>>> >>>> Simon >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> >> > > -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 12.9 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi, I understand that filtering the dataset based on all the samples is more adequate than per experimental group. However, if one has unbalanced samples, is it still valid? Assuming one group has 10 samples (A) and other group has 5 samples (B) sequenced. If I filter by total number of reads for the 15 samples, I would eliminate many more genes that are expressed at the lower range from group B, compared to the genes expressed at the lower range in the group A. Is there a way around it for experiments performed with unbalanced number of samples? Best regards, Fernando ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] On Behalf Of Wolfgang Huber [whuber@embl.de] Sent: Saturday, May 21, 2011 9:07 AM To: bioconductor at r-project.org Subject: Re: [BioC] Filtering out tags with low counts in DESeq and EgdeR? Hi Xiaohui to follow up on the filtering question: - the filter that Xiaohui applied is invalid, it will distort the null-distribution of the test statistic and lead to invalid p-values. This might explain the discrepancy. - the filter that Simon suggested is OK and should provide better results. - I'd also be keen to hear about your experience with this. A valid filtering criterion does not change the null distribution of the subsequently applied test statistic (it can, and in fact should, change the alternative distribution(s)). In practice, this means choosing a filter criterion that is statistically independent, under the null, from the test statistic, and in particular, that it does not use the class labels. Details in the below-cited PNAS paper. Best wishes Wolfgang Il May/21/11 11:02 AM, Simon Anders ha scritto: > Hi Xiaohui > > I agree thatit is worrying to get so different results from your two > approaches of using DESeq. Here are a few suggestion how you might > investigate this (and I'd be eager to hear about your findings): > > - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre- filtering > affects the validity and power of a test. They stress that it is > important that the filter is blind to the sample labels (actually: even > permutation invariant). So what you do here is not statistically sound: > > > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ] > > Try instead something like: > > filter=dat[rowSums(dat) >= 16, ] > > - How does your filter affect the variance functions? Do the plots > generated by 'scvPlot()' differ between the filtered and the unfiltered > data set? > > - If so, are the hits that you get at expression strength were the > variance functions differ? Are they at the low end, i.e., where the > filter made changes? > > - Have you tried what happens if you filter after estimating variance? > The raw p values should be the same as without filtering, but the > adjusted p values might get better. > > To be honest, I'm currently a bit at a loss which one is more correct: > Filtering before or after variance estimation. Let's hear what other > people on the list think. > >> 2. For EdgeR > > DESeq and edgeR are sufficiently similar that any correct answer > regarding filtering should apply to both. > >> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after >> adjusting p.value, is this possible? Then, can I used the *unadjusted* >> p.value to get DE genes? >> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = >> "BH")< 0.05) > > Of course, this is possible. (Read up on the "multiple hypothesis > testing problem" if this is unclear to you.) Not also, though, that you > used an FDR of .1 in your DESeq code but of .05 here. > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.9 years ago Biase, Fernando ▴ 150

0

Entering edit mode

Dear Fernando (un)balanace of group sizes does not play a role. What is important is that the test statistic for differential expression is statistically independent from the filter criterion *under the null hypothesis* of no differential expression. To very good approximation, this is the case for - the row sums of the count matrix - the negative binomial test that DESeq and edgeR perform Thus, filtering is OK also for experiments performed with unbalanced number of samples. Wolfgang Il May/21/11 11:09 PM, Biase, Fernando ha scritto: > Hi, > > I understand that filtering the dataset based on all the samples is more adequate than per experimental group. However, if one has unbalanced samples, is it still valid? > > Assuming one group has 10 samples (A) and other group has 5 samples (B) sequenced. If I filter by total number of reads for the 15 samples, I would eliminate many more genes that are expressed at the lower range from group B, compared to the genes expressed at the lower range in the group A. > > Is there a way around it for experiments performed with unbalanced number of samples? > > Best regards, > > Fernando > > ________________________________________ > From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] On Behalf Of Wolfgang Huber [whuber at embl.de] > Sent: Saturday, May 21, 2011 9:07 AM > To: bioconductor at r-project.org > Subject: Re: [BioC] Filtering out tags with low counts in DESeq and EgdeR? > > Hi Xiaohui > > to follow up on the filtering question: > > - the filter that Xiaohui applied is invalid, it will distort the > null-distribution of the test statistic and lead to invalid p-values. > This might explain the discrepancy. > > - the filter that Simon suggested is OK and should provide better results. > > - I'd also be keen to hear about your experience with this. > > A valid filtering criterion does not change the null distribution of the > subsequently applied test statistic (it can, and in fact should, change > the alternative distribution(s)). In practice, this means choosing a > filter criterion that is statistically independent, under the null, from > the test statistic, and in particular, that it does not use the class > labels. Details in the below-cited PNAS paper. > > Best wishes > Wolfgang > > > > > > Il May/21/11 11:02 AM, Simon Anders ha scritto: >> Hi Xiaohui >> >> I agree thatit is worrying to get so different results from your two >> approaches of using DESeq. Here are a few suggestion how you might >> investigate this (and I'd be eager to hear about your findings): >> >> - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre- filtering >> affects the validity and power of a test. They stress that it is >> important that the filter is blind to the sample labels (actually: even >> permutation invariant). So what you do here is not statistically sound: >> >> > filter=dat[rowSums(dat[,group1]>= 8) | rowSums(dat[,group2]>= 8), ] >> >> Try instead something like: >> >> filter=dat[rowSums(dat)>= 16, ] >> >> - How does your filter affect the variance functions? Do the plots >> generated by 'scvPlot()' differ between the filtered and the unfiltered >> data set? >> >> - If so, are the hits that you get at expression strength were the >> variance functions differ? Are they at the low end, i.e., where the >> filter made changes? >> >> - Have you tried what happens if you filter after estimating variance? >> The raw p values should be the same as without filtering, but the >> adjusted p values might get better. >> >> To be honest, I'm currently a bit at a loss which one is more correct: >> Filtering before or after variance estimation. Let's hear what other >> people on the list think. >> >>> 2. For EdgeR >> >> DESeq and edgeR are sufficiently similar that any correct answer >> regarding filtering should apply to both. >> >>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after >>> adjusting p.value, is this possible? Then, can I used the *unadjusted* >>> p.value to get DE genes? >>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = >>> "BH")< 0.05) >> >> Of course, this is possible. (Read up on the "multiple hypothesis >> testing problem" if this is unclear to you.) Not also, though, that you >> used an FDR of .1 in your DESeq code but of .05 here. >> >> Simon >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > > > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 12.9 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Xiaohui Wu ▴ 280

@xiaohui-wu-4141

Last seen 9.6 years ago

Hi list, Please ignore my last email with two figure attachments because some calculations were not right, sorry about this. Actually, there were some overlap among different ways to get DE genes using DESeq, I had a typo in my previous code: filter=dat[rowSums(dat > = 16), ] --> filter=dat[rowSums(dat) > = 16, ], so no overlap was found. But, I still couldn't get any DE genes after adjusting p.value using EdgeR. It was supposed to get some but not 0 DE gene, and there should be some overlap between EdgeR's result and DESeq's or other similar tools'. To follow up Simon's suggestions: >> > filter=dat[rowSums(dat[,group1]> = 8) | rowSums(dat[,group2]> = 8), ] >> Try instead something like: >> filter=dat[rowSums(dat) > = 16, ] >> >> - How does your filter affect the variance functions? Do the plots >> generated by 'scvPlot()' differ between the filtered and the unfiltered >> data set? >> - If so, are the hits that you get at expression strength were the >> variance functions differ? Are they at the low end, i.e., where the >> filter made changes? The scvPlot for unfiltered data or filtered data are similar, the differences are at the low end (at the begining of the scvPlot). >> - Have you tried what happens if you filter after estimating variance? >> The raw p values should be the same as without filtering, but the >> adjusted p values might get better. Using DESeq, I tested three ways, now the overlap among different sets seemed normal. (A) no filter gene# 51939 DE gene# (pvalue<0.1) 2385 DE gene# (padjust<0.1) 106 (B) filtering before estimating variance (rowsums>=16) gene filtered# 20923 DE gene# (pvalue<0.1) 2403 DE gene# (padjust<0.1) 197 (105 in (A)'s 106 genes) (C) filtering and adjusting after estimating variance (rowsum>=16) gene filtered# 20923 DE gene# (pvalue<0.1) 2247 DE gene# (padjust<0.1) 139 (106 in (A), 138 in (B)) Using EdgeR, I tested the common disper and tagwise disper it provides, but no DE genes was found after adjusting pvalue. (D) common disper gene# 51939 DE gene# (pvalue<0.1) 1771 DE gene# (padjust<0.1) 0 (E) tagwise disper gene# 51939 DE gene# (pvalue<0.1) 880 (88.5% overlap with (D)) DE gene# (padjust<0.1) 0 (F) filtering before exactTest (rowsum>=16, tagwise disper) gene filtered# 20923 DE gene# (pvalue<0.1) 1771 DE gene# (padjust<0.1) 0 >> > 2. For EdgeR >> >> DESeq and edgeR are sufficiently similar that any correct answer >> regarding filtering should apply to both. >> >> > 2) I got 800 DE genes with p.value< 0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes? >> > To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH")< 0.05) >> >> Of course, this is possible. (Read up on the "multiple hypothesis >> testing problem" if this is unclear to you.) Not also, though, that you >> used an FDR of .1 in your DESeq code but of .05 here. Actually, I tried 0.1 or 0.05, or other adjusted methods provided in EdgeR, always 0 DE genes after adjusted. Best, Xiaohui >> Il May/21/11 5:16 PM, Wolfgang ha scritto: >> Hi Martin, >> >> this is a very good question, and it needs more investigation. I'll have >> a closer look into this and report back in this place. Two comments >> though already: >> >> - That the dispersion parameter is estimated from the data need by >> itself not be problematic (It is not problematic in the microattay case >> that the t-test variances are estimated from the data.) >> >> - The situation with limma is different: there, the problem is that >> limma's Bayesian model, which assumes that gene-level error variances >> ?^2_1, ..., ?^2_m follow a scaled inverse ?2 distribution, no longer >> fits the data when the data are filtered for genes with low overall >> variance. Due to the way that limma is implemented, the posterior >> degrees-of-freedom estimate of that distribution then becomes infinite, >> gene-level variance estimates will be ignored (leading to an unintended >> analysis based on fold change only), and type I error rate control is >> lost. See Fig. 2B+C in the paper, and the text on p.3. >> >> So, what we need to do with DESeq is >> - simulate data according to the null model >> - see if & how filtering affects the estimated mean-dispersion relationship >> - see if & how it affects the type I error globally and locally (for >> particular ranges of the mean). >> >> Another point is how much the filtering improves power - that will be >> related to how many genes can actually be removed by the filtering step, >> which depends on the data. >> >> Best wishes >> Wolfgang >> >> >> >> Il May/21/11 4:36 PM, Martin Morgan ha scritto: >> > On 05/21/2011 07:07 AM, Wolfgang Huber wrote: >> > > Hi Xiaohui >> > > >> > > to follow up on the filtering question: >> > > >> > > - the filter that Xiaohui applied is invalid, it will distort the >> > > null-distribution of the test statistic and lead to invalid p-values. >> > > This might explain the discrepancy. >> > > >> > > - the filter that Simon suggested is OK and should provide better >> > > results. >> > > >> > > - I'd also be keen to hear about your experience with this. >> > > >> > > A valid filtering criterion does not change the null distribution of the >> > > subsequently applied test statistic (it can, and in fact should, change >> > > the alternative distribution(s)). In practice, this means choosing a >> > > filter criterion that is statistically independent, under the null, from >> > > the test statistic, and in particular, that it does not use the class >> > > labels. Details in the below-cited PNAS paper. >> > >> > Was wondering whether, since the dispersion parameter is estimated from >> > the data, in some strict sense the filtering and testing procedures are >> > not independent under the null anyway? For the same reason that one >> > would not want to use a variance filter before a limma-style analysis, >> > if I understand correctly. >> > >> > Martin >> > >> > > >> > > Best wishes >> > > Wolfgang >> > > >> > > >> > > >> > > >> > > >> > > Il May/21/11 11:02 AM, Simon Anders ha scritto: >> > > > Hi Xiaohui >> > > > >> > > > I agree thatit is worrying to get so different results from your two >> > > > approaches of using DESeq. Here are a few suggestion how you might >> > > > investigate this (and I'd be eager to hear about your findings): >> > > > >> > > > - Bourgen et al. (PNAS, 2010, 107:9546) have studied how pre-filtering >> > > > affects the validity and power of a test. They stress that it is >> > > > important that the filter is blind to the sample labels (actually: even >> > > > permutation invariant). So what you do here is not statistically sound: >> > > > >> > > > > filter=dat[rowSums(dat[,group1]> = 8) | rowSums(dat[,group2]> = 8), ] >> > > > >> > > > Try instead something like: >> > > > >> > > > filter=dat[rowSums(dat) > = 16, ] >> > > > >> > > > - How does your filter affect the variance functions? Do the plots >> > > > generated by 'scvPlot()' differ between the filtered and the unfiltered >> > > > data set? >> > > > >> > > > - If so, are the hits that you get at expression strength were the >> > > > variance functions differ? Are they at the low end, i.e., where the >> > > > filter made changes? >> > > > >> > > > - Have you tried what happens if you filter after estimating variance? >> > > > The raw p values should be the same as without filtering, but the >> > > > adjusted p values might get better. >> > > > >> > > > To be honest, I'm currently a bit at a loss which one is more correct: >> > > > Filtering before or after variance estimation. Let's hear what other >> > > > people on the list think. >> > > > >> > > > > 2. For EdgeR >> > > > >> > > > DESeq and edgeR are sufficiently similar that any correct answer >> > > > regarding filtering should apply to both. >> > > > >> > > > > 2) I got 800 DE genes with p.value< 0.1, but got 0 DE genes after >> > > > > adjusting p.value, is this possible? Then, can I used the *unadjusted* >> > > > > p.value to get DE genes? >> > > > > To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = >> > > > > "BH")< 0.05) >> > > > >> > > > Of course, this is possible. (Read up on the "multiple hypothesis >> > > > testing problem" if this is unclear to you.) Not also, though, that you >> > > > used an FDR of .1 in your DESeq code but of .05 here. >> > > > >> > > > Simon >> > > >

ADD COMMENT • link 12.9 years ago Xiaohui Wu ▴ 280

0

Entering edit mode

Xiaohui Wu ▴ 280

@xiaohui-wu-4141

Last seen 9.6 years ago

Thanks, Gordon. I tried setting different common dispersion and prior.n in EdgeR, but the results were so different that I became more confused now. I would be very grateful if you or anyone could give me some idea. 1) Is the following step to filter out low counts gene right? Do I need to change the libsize using d$counts>=16 or just remain the original libsize? f <- calcNormFactors(dat) f <- f/exp(mean(log(f))) d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f) d <- d[rowSums(d$counts) >=16, ] 2) How to estimate which common.dispersion or prior.n is better? When I used EdgeR, I tried to get some DE genes after adjusting pvalue. To estimate common or tagwise dispertion, I used: d <- estimateCommonDisp(d) d <- estimateTagwiseDisp(d, prior.n = 10) NO DE gene was found after adjusting pvalue, about 800~1800 DE genes were found with pvalue<0.1 (before adjusting). In my data, the biological variation between the four replicates are high, (the estimateCommonDisp was 5, which means very high variation I think). When I used: d$common.dispersion=2, I got 10209 genes with pvalue<0.1, and 9532 DE genes with adjusted pvalue<0.1. When I used: prior.n=0, I got 3428 genes with pvalue<0.1, and 339 DE genes with adjusted pvalue<0.1. But NO DE gene could be found even using prior.n=1. Is it OK to use prior.n=0, or to set my own common dispersion? Should I use tagwise dispersion or common dispersion? Any help or suggestions will be appreciated. Best, Xiaohui ------------------------------------------------------------- Send?Gordon K Smyth Date?2011-05-25 09:59:36 Receive?Wu, Xiaohui Ms. Title?Filtering out tags with low counts in DESeq and EgdeR? Dear Xiaohui, I'll answer for edgeR, which is what I have experience with. I personally pre-filter RNA-Seq data in much the same way that I filter microarray data. I aim to keep transcripts that show some evidence of being expressed at a worthwhile level in at least one treatment group. If I am comparing two groups with sample sizes n1=5 and n2=10, then I will keep transcripts that achieve a minimum level of expression in at least 5 samples (the minimum group size). Note that this is done without regard to the sample labels, so no knowledge of group membership is used in the filtering. The minimum expression level might be, say, 1 count per million. Best wishes Gordon > Date: Sat, 21 May 2011 16:25:08 +0800 > From: "Xiaohui Wu" <wux3 at="" muohio.edu=""> > To: "bioconductor" <bioconductor at="" r-project.org=""> > Subject: [BioC] Filtering out tags with low counts in DESeq and EgdeR? > > Hello, > > I want to find DE genes between two conditions, using DESeq, EdgeR or other tools if any. > I have some questions on how/when to filter out genes with low counts and I am also confused on the final results. > > I have 50,000 genes, 2 conditions, each condition 4 replicates. 25,000 genes have >=8 counts in at least one condition. > The total counts in each replicate is as following: > cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2 cond2_rep3 cond2_rep4 > 1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892 > > 1. For DESeq > I used all 50,000 genes to estimate size factor for normalization. > Then should I remove genes with low count before estimating variance or after that? > I got totally different results using the following two ways, which I was really surprised that NO overlap was found between the two DE gene sets. > > 1) no filtering before estimating variance > d <- newCountDataSet( dat, group ) > d <- estimateSizeFactors( d) > sf=sizeFactors(d) > d <- estimateVarianceFunctions( d ) > res <- nbinomTest( d,"cond1","cond2") > resSig <- res[ res$padj < .1, ] > > 2) filtering before estimating variance > filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## remove low count genes > filter.ds <- newCountDataSet( filter, group ) > filter.ds$sizeFactor=sf ## use the sizefactor from 1) > filter.ds <- estimateVarianceFunctions( filter.ds ) > filter.res <- nbinomTest( filter.ds,"cond1","cond2") > filter.sig=filter.res[ filter.res$padj < .1, ] > nrow(filter.sig) > > Results: For my data, total 177 DE genes from 2) and 106 from 1), and no overlap between 1) and 2). > > 2. For EdgeR > 1) when should I remove low count genes? > f <- calcNormFactors(dat) ## I should calculate factor using the whole data, right? > f <- f/exp(mean(log(f))) > dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## And, then filter out genes with low counts? > d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f) > > 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes? > To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH") < 0.05) > > Any help or suggestions will be appreciated. > > Thanks, > Xiaohui ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. ______________________________________________________________________ .

ADD COMMENT • link 12.9 years ago Xiaohui Wu ▴ 280

0

Entering edit mode

Dear Xiao, Your dispersion value of 5 is higher than I have even seen for real data, This means that the coefficient of variation of expression is sqrt(5)=2.2, meaning the standard deviation is more than twice the mean. With such wild data, it is no surprise that you get no DE genes. In fact, I'd be a bit worried if edgeR did give you DE results. It is not valid to simply input a smaller dispersion value than your data indicates. Judging from the dispersion value, I think it is likely that you need to reconsider the basic quality of your data. Trying to repair things with different statistical methods is not likely to solve the basic problem. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Wed, 25 May 2011, Xiaohui Wu wrote: > Thanks, Gordon. > > I tried setting different common dispersion and prior.n in EdgeR, but the results were so different that I became more confused now. I would be very grateful if you or anyone could give me some idea. > > 1) Is the following step to filter out low counts gene right? Do I need to change the libsize using d$counts>=16 or just remain the original libsize? > f <- calcNormFactors(dat) > f <- f/exp(mean(log(f))) > d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f) > d <- d[rowSums(d$counts) >=16, ] > > 2) How to estimate which common.dispersion or prior.n is better? > When I used EdgeR, I tried to get some DE genes after adjusting pvalue. To estimate common or tagwise dispertion, I used: > d <- estimateCommonDisp(d) > d <- estimateTagwiseDisp(d, prior.n = 10) > NO DE gene was found after adjusting pvalue, about 800~1800 DE genes were found with pvalue<0.1 (before adjusting). > > In my data, the biological variation between the four replicates are high, (the estimateCommonDisp was 5, which means very high variation I think). > When I used: d$common.dispersion=2, I got 10209 genes with pvalue<0.1, and 9532 DE genes with adjusted pvalue<0.1. > When I used: prior.n=0, I got 3428 genes with pvalue<0.1, and 339 DE genes with adjusted pvalue<0.1. > But NO DE gene could be found even using prior.n=1. > Is it OK to use prior.n=0, or to set my own common dispersion? Should I use tagwise dispersion or common dispersion? > > Any help or suggestions will be appreciated. > > Best, > Xiaohui > > > > > ------------------------------------------------------------- > Send?Gordon K Smyth > Date?2011-05-25 09:59:36 > Receive?Wu, Xiaohui Ms. > Title?Filtering out tags with low counts in DESeq and EgdeR? > > Dear Xiaohui, > > I'll answer for edgeR, which is what I have experience with. > > I personally pre-filter RNA-Seq data in much the same way that I filter > microarray data. I aim to keep transcripts that show some evidence of > being expressed at a worthwhile level in at least one treatment group. > If I am comparing two groups with sample sizes n1=5 and n2=10, then I will > keep transcripts that achieve a minimum level of expression in at least 5 > samples (the minimum group size). Note that this is done without regard > to the sample labels, so no knowledge of group membership is used in the > filtering. The minimum expression level might be, say, 1 count per > million. > > Best wishes > Gordon > >> Date: Sat, 21 May 2011 16:25:08 +0800 >> From: "Xiaohui Wu" <wux3 at="" muohio.edu=""> >> To: "bioconductor" <bioconductor at="" r-project.org=""> >> Subject: [BioC] Filtering out tags with low counts in DESeq and EgdeR? >> >> Hello, >> >> I want to find DE genes between two conditions, using DESeq, EdgeR or other tools if any. >> I have some questions on how/when to filter out genes with low counts and I am also confused on the final results. >> >> I have 50,000 genes, 2 conditions, each condition 4 replicates. 25,000 genes have >=8 counts in at least one condition. >> The total counts in each replicate is as following: >> cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2 cond2_rep3 cond2_rep4 >> 1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892 >> >> 1. For DESeq >> I used all 50,000 genes to estimate size factor for normalization. >> Then should I remove genes with low count before estimating variance or after that? >> I got totally different results using the following two ways, which I was really surprised that NO overlap was found between the two DE gene sets. >> >> 1) no filtering before estimating variance >> d <- newCountDataSet( dat, group ) >> d <- estimateSizeFactors( d) >> sf=sizeFactors(d) >> d <- estimateVarianceFunctions( d ) >> res <- nbinomTest( d,"cond1","cond2") >> resSig <- res[ res$padj < .1, ] >> >> 2) filtering before estimating variance >> filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## remove low count genes >> filter.ds <- newCountDataSet( filter, group ) >> filter.ds$sizeFactor=sf ## use the sizefactor from 1) >> filter.ds <- estimateVarianceFunctions( filter.ds ) >> filter.res <- nbinomTest( filter.ds,"cond1","cond2") >> filter.sig=filter.res[ filter.res$padj < .1, ] >> nrow(filter.sig) >> >> Results: For my data, total 177 DE genes from 2) and 106 from 1), and no overlap between 1) and 2). >> >> 2. For EdgeR >> 1) when should I remove low count genes? >> f <- calcNormFactors(dat) ## I should calculate factor using the whole data, right? >> f <- f/exp(mean(log(f))) >> dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## And, then filter out genes with low counts? >> d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f) >> >> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes? >> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH") < 0.05) >> >> Any help or suggestions will be appreciated. >> >> Thanks, >> Xiaohui > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:12}}

ADD REPLY • link 12.9 years ago Gordon Smyth 50k

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 10 minutes ago

WEHI, Melbourne, Australia

Dear Xiaohui, I'll answer for edgeR, which is what I have experience with. I personally pre-filter RNA-Seq data in much the same way that I filter microarray data. I aim to keep transcripts that show some evidence of being expressed at a worthwhile level in at least one treatment group. If I am comparing two groups with sample sizes n1=5 and n2=10, then I will keep transcripts that achieve a minimum level of expression in at least 5 samples (the minimum group size). Note that this is done without regard to the sample labels, so no knowledge of group membership is used in the filtering. The minimum expression level might be, say, 1 count per million. Best wishes Gordon > Date: Sat, 21 May 2011 16:25:08 +0800 > From: "Xiaohui Wu" <wux3 at="" muohio.edu=""> > To: "bioconductor" <bioconductor at="" r-project.org=""> > Subject: [BioC] Filtering out tags with low counts in DESeq and EgdeR? > > Hello, > > I want to find DE genes between two conditions, using DESeq, EdgeR or other tools if any. > I have some questions on how/when to filter out genes with low counts and I am also confused on the final results. > > I have 50,000 genes, 2 conditions, each condition 4 replicates. 25,000 genes have >=8 counts in at least one condition. > The total counts in each replicate is as following: > cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2 cond2_rep3 cond2_rep4 > 1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892 > > 1. For DESeq > I used all 50,000 genes to estimate size factor for normalization. > Then should I remove genes with low count before estimating variance or after that? > I got totally different results using the following two ways, which I was really surprised that NO overlap was found between the two DE gene sets. > > 1) no filtering before estimating variance > d <- newCountDataSet( dat, group ) > d <- estimateSizeFactors( d) > sf=sizeFactors(d) > d <- estimateVarianceFunctions( d ) > res <- nbinomTest( d,"cond1","cond2") > resSig <- res[ res$padj < .1, ] > > 2) filtering before estimating variance > filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## remove low count genes > filter.ds <- newCountDataSet( filter, group ) > filter.ds$sizeFactor=sf ## use the sizefactor from 1) > filter.ds <- estimateVarianceFunctions( filter.ds ) > filter.res <- nbinomTest( filter.ds,"cond1","cond2") > filter.sig=filter.res[ filter.res$padj < .1, ] > nrow(filter.sig) > > Results: For my data, total 177 DE genes from 2) and 106 from 1), and no overlap between 1) and 2). > > 2. For EdgeR > 1) when should I remove low count genes? > f <- calcNormFactors(dat) ## I should calculate factor using the whole data, right? > f <- f/exp(mean(log(f))) > dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## And, then filter out genes with low counts? > d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f) > > 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes? > To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH") < 0.05) > > Any help or suggestions will be appreciated. > > Thanks, > Xiaohui ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.9 years ago Gordon Smyth 50k

0

Entering edit mode

Xiaohui Wu ▴ 280

@xiaohui-wu-4141

Last seen 9.6 years ago

Thank you very much, Gordon. Yes, it is really bad that the data is so wild. But before examining the data, do you think it is valid that I remove the cases with high dispersion, such as (0,0,100,0), and then take the new whole data to find DE genes? Thank you! Best, Xiaohui ------------------------------------------------------------- Sender: Gordon K Smyth Date?2011-05-25 15:24:12 Recieve?Wu, Xiaohui Ms. Title?Re: Filtering out tags with low counts in DESeq and EgdeR? Dear Xiao, Your dispersion value of 5 is higher than I have even seen for real data, This means that the coefficient of variation of expression is sqrt(5)=2.2, meaning the standard deviation is more than twice the mean. With such wild data, it is no surprise that you get no DE genes. In fact, I'd be a bit worried if edgeR did give you DE results. It is not valid to simply input a smaller dispersion value than your data indicates. Judging from the dispersion value, I think it is likely that you need to reconsider the basic quality of your data. Trying to repair things with different statistical methods is not likely to solve the basic problem. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Wed, 25 May 2011, Xiaohui Wu wrote: > Thanks, Gordon. > > I tried setting different common dispersion and prior.n in EdgeR, but the results were so different that I became more confused now. I would be very grateful if you or anyone could give me some idea. > > 1) Is the following step to filter out low counts gene right? Do I need to change the libsize using d$counts>=16 or just remain the original libsize? > f <- calcNormFactors(dat) > f <- f/exp(mean(log(f))) > d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f) > d <- d[rowSums(d$counts) >=16, ] > > 2) How to estimate which common.dispersion or prior.n is better? > When I used EdgeR, I tried to get some DE genes after adjusting pvalue. To estimate common or tagwise dispertion, I used: > d <- estimateCommonDisp(d) > d <- estimateTagwiseDisp(d, prior.n = 10) > NO DE gene was found after adjusting pvalue, about 800~1800 DE genes were found with pvalue<0.1 (before adjusting). > > In my data, the biological variation between the four replicates are high, (the estimateCommonDisp was 5, which means very high variation I think). > When I used: d$common.dispersion=2, I got 10209 genes with pvalue<0.1, and 9532 DE genes with adjusted pvalue<0.1. > When I used: prior.n=0, I got 3428 genes with pvalue<0.1, and 339 DE genes with adjusted pvalue<0.1. > But NO DE gene could be found even using prior.n=1. > Is it OK to use prior.n=0, or to set my own common dispersion? Should I use tagwise dispersion or common dispersion? > > Any help or suggestions will be appreciated. > > Best, > Xiaohui > > > > > ------------------------------------------------------------- > Send?Gordon K Smyth > Date?2011-05-25 09:59:36 > Receive?Wu, Xiaohui Ms. > Title?Filtering out tags with low counts in DESeq and EgdeR? > > Dear Xiaohui, > > I'll answer for edgeR, which is what I have experience with. > > I personally pre-filter RNA-Seq data in much the same way that I filter > microarray data. I aim to keep transcripts that show some evidence of > being expressed at a worthwhile level in at least one treatment group. > If I am comparing two groups with sample sizes n1=5 and n2=10, then I will > keep transcripts that achieve a minimum level of expression in at least 5 > samples (the minimum group size). Note that this is done without regard > to the sample labels, so no knowledge of group membership is used in the > filtering. The minimum expression level might be, say, 1 count per > million. > > Best wishes > Gordon > >> Date: Sat, 21 May 2011 16:25:08 +0800 >> From: "Xiaohui Wu" <wux3 at="" muohio.edu=""> >> To: "bioconductor" <bioconductor at="" r-project.org=""> >> Subject: [BioC] Filtering out tags with low counts in DESeq and EgdeR? >> >> Hello, >> >> I want to find DE genes between two conditions, using DESeq, EdgeR or other tools if any. >> I have some questions on how/when to filter out genes with low counts and I am also confused on the final results. >> >> I have 50,000 genes, 2 conditions, each condition 4 replicates. 25,000 genes have >=8 counts in at least one condition. >> The total counts in each replicate is as following: >> cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2 cond2_rep3 cond2_rep4 >> 1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892 >> >> 1. For DESeq >> I used all 50,000 genes to estimate size factor for normalization. >> Then should I remove genes with low count before estimating variance or after that? >> I got totally different results using the following two ways, which I was really surprised that NO overlap was found between the two DE gene sets. >> >> 1) no filtering before estimating variance >> d <- newCountDataSet( dat, group ) >> d <- estimateSizeFactors( d) >> sf=sizeFactors(d) >> d <- estimateVarianceFunctions( d ) >> res <- nbinomTest( d,"cond1","cond2") >> resSig <- res[ res$padj < .1, ] >> >> 2) filtering before estimating variance >> filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## remove low count genes >> filter.ds <- newCountDataSet( filter, group ) >> filter.ds$sizeFactor=sf ## use the sizefactor from 1) >> filter.ds <- estimateVarianceFunctions( filter.ds ) >> filter.res <- nbinomTest( filter.ds,"cond1","cond2") >> filter.sig=filter.res[ filter.res$padj < .1, ] >> nrow(filter.sig) >> >> Results: For my data, total 177 DE genes from 2) and 106 from 1), and no overlap between 1) and 2). >> >> 2. For EdgeR >> 1) when should I remove low count genes? >> f <- calcNormFactors(dat) ## I should calculate factor using the whole data, right? >> f <- f/exp(mean(log(f))) >> dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## And, then filter out genes with low counts? >> d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f) >> >> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes? >> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH") < 0.05) >> >> Any help or suggestions will be appreciated. >> >> Thanks, >> Xiaohui > > ______________________________________________________________________ > The information in this email is confidential and intended solely for the addressee. > You must not disclose, forward, print or use it without the permission of the sender. > ______________________________________________________________________ > . > ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. ______________________________________________________________________

ADD COMMENT • link 12.9 years ago Xiaohui Wu ▴ 280

0

Entering edit mode

Dear Xioahui, You seem to have ignored or perhaps not understood my original reply to you. If you read it again, you will see that it takes care of cases such as you mention. Best wishes Gordon On Wed, 25 May 2011, Xiaohui Wu wrote: > Thank you very much, Gordon. > Yes, it is really bad that the data is so wild. But before examining the > data, do you think it is valid that I remove the cases with high > dispersion, such as (0,0,100,0), and then take the new whole data to > find DE genes? > Thank you! > > Best, > Xiaohui > > ------------------------------------------------------------- > Sender: Gordon K Smyth > Date?2011-05-25 15:24:12 > Recieve?Wu, Xiaohui Ms. > Title?Re: Filtering out tags with low counts in DESeq and EgdeR? > > Dear Xiao, > > Your dispersion value of 5 is higher than I have even seen for real data, > This means that the coefficient of variation of expression is sqrt(5)=2.2, > meaning the standard deviation is more than twice the mean. > > With such wild data, it is no surprise that you get no DE genes. In fact, > I'd be a bit worried if edgeR did give you DE results. > > It is not valid to simply input a smaller dispersion value than your data > indicates. > > Judging from the dispersion value, I think it is likely that you need to > reconsider the basic quality of your data. Trying to repair things with > different statistical methods is not likely to solve the basic problem. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Tel: (03) 9345 2326, Fax (03) 9347 0852, > smyth at wehi.edu.au > http://www.wehi.edu.au > http://www.statsci.org/smyth > > On Wed, 25 May 2011, Xiaohui Wu wrote: > >> Thanks, Gordon. >> >> I tried setting different common dispersion and prior.n in EdgeR, but the results were so different that I became more confused now. I would be very grateful if you or anyone could give me some idea. >> >> 1) Is the following step to filter out low counts gene right? Do I need to change the libsize using d$counts>=16 or just remain the original libsize? >> f <- calcNormFactors(dat) >> f <- f/exp(mean(log(f))) >> d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f) >> d <- d[rowSums(d$counts) >=16, ] >> >> 2) How to estimate which common.dispersion or prior.n is better? >> When I used EdgeR, I tried to get some DE genes after adjusting pvalue. To estimate common or tagwise dispertion, I used: >> d <- estimateCommonDisp(d) >> d <- estimateTagwiseDisp(d, prior.n = 10) >> NO DE gene was found after adjusting pvalue, about 800~1800 DE genes were found with pvalue<0.1 (before adjusting). >> >> In my data, the biological variation between the four replicates are high, (the estimateCommonDisp was 5, which means very high variation I think). >> When I used: d$common.dispersion=2, I got 10209 genes with pvalue<0.1, and 9532 DE genes with adjusted pvalue<0.1. >> When I used: prior.n=0, I got 3428 genes with pvalue<0.1, and 339 DE genes with adjusted pvalue<0.1. >> But NO DE gene could be found even using prior.n=1. >> Is it OK to use prior.n=0, or to set my own common dispersion? Should I use tagwise dispersion or common dispersion? >> >> Any help or suggestions will be appreciated. >> >> Best, >> Xiaohui >> >> >> >> >> ------------------------------------------------------------- >> Send?Gordon K Smyth >> Date?2011-05-25 09:59:36 >> Receive?Wu, Xiaohui Ms. >> Title?Filtering out tags with low counts in DESeq and EgdeR? >> >> Dear Xiaohui, >> >> I'll answer for edgeR, which is what I have experience with. >> >> I personally pre-filter RNA-Seq data in much the same way that I filter >> microarray data. I aim to keep transcripts that show some evidence of >> being expressed at a worthwhile level in at least one treatment group. >> If I am comparing two groups with sample sizes n1=5 and n2=10, then I will >> keep transcripts that achieve a minimum level of expression in at least 5 >> samples (the minimum group size). Note that this is done without regard >> to the sample labels, so no knowledge of group membership is used in the >> filtering. The minimum expression level might be, say, 1 count per >> million. >> >> Best wishes >> Gordon >> >>> Date: Sat, 21 May 2011 16:25:08 +0800 >>> From: "Xiaohui Wu" <wux3 at="" muohio.edu=""> >>> To: "bioconductor" <bioconductor at="" r-project.org=""> >>> Subject: [BioC] Filtering out tags with low counts in DESeq and EgdeR? >>> >>> Hello, >>> >>> I want to find DE genes between two conditions, using DESeq, EdgeR or other tools if any. >>> I have some questions on how/when to filter out genes with low counts and I am also confused on the final results. >>> >>> I have 50,000 genes, 2 conditions, each condition 4 replicates. 25,000 genes have >=8 counts in at least one condition. >>> The total counts in each replicate is as following: >>> cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2 cond2_rep3 cond2_rep4 >>> 1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892 >>> >>> 1. For DESeq >>> I used all 50,000 genes to estimate size factor for normalization. >>> Then should I remove genes with low count before estimating variance or after that? >>> I got totally different results using the following two ways, which I was really surprised that NO overlap was found between the two DE gene sets. >>> >>> 1) no filtering before estimating variance >>> d <- newCountDataSet( dat, group ) >>> d <- estimateSizeFactors( d) >>> sf=sizeFactors(d) >>> d <- estimateVarianceFunctions( d ) >>> res <- nbinomTest( d,"cond1","cond2") >>> resSig <- res[ res$padj < .1, ] >>> >>> 2) filtering before estimating variance >>> filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## remove low count genes >>> filter.ds <- newCountDataSet( filter, group ) >>> filter.ds$sizeFactor=sf ## use the sizefactor from 1) >>> filter.ds <- estimateVarianceFunctions( filter.ds ) >>> filter.res <- nbinomTest( filter.ds,"cond1","cond2") >>> filter.sig=filter.res[ filter.res$padj < .1, ] >>> nrow(filter.sig) >>> >>> Results: For my data, total 177 DE genes from 2) and 106 from 1), and no overlap between 1) and 2). >>> >>> 2. For EdgeR >>> 1) when should I remove low count genes? >>> f <- calcNormFactors(dat) ## I should calculate factor using the whole data, right? >>> f <- f/exp(mean(log(f))) >>> dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## And, then filter out genes with low counts? >>> d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f) >>> >>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes? >>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH") < 0.05) >>> >>> Any help or suggestions will be appreciated. >>> >>> Thanks, >>> Xiaohui ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}}

ADD REPLY • link 12.9 years ago Gordon Smyth 50k

0

Entering edit mode

Xiaohui Wu ▴ 280

@xiaohui-wu-4141

Last seen 9.6 years ago

Many thanks, Gordon. Sorry that I read your reply last time, but I wrongly understood it as filtering out tags by total counts not samples. Now I understand. Thanks, all! Best, Xiaohui ------------------------------------------------------------- Sender?Gordon K Smyth Date?2011-05-26 06:20:34 Recieve?Wu, Xiaohui Ms. Title?Re: Re: Filtering out tags with low counts in DESeq and EgdeR? Dear Xioahui, You seem to have ignored or perhaps not understood my original reply to you. If you read it again, you will see that it takes care of cases such as you mention. Best wishes Gordon On Wed, 25 May 2011, Xiaohui Wu wrote: > Thank you very much, Gordon. > Yes, it is really bad that the data is so wild. But before examining the > data, do you think it is valid that I remove the cases with high > dispersion, such as (0,0,100,0), and then take the new whole data to > find DE genes? > Thank you! > > Best, > Xiaohui > > ------------------------------------------------------------- > Sender: Gordon K Smyth > Date?2011-05-25 15:24:12 > Recieve?Wu, Xiaohui Ms. > Title?Re: Filtering out tags with low counts in DESeq and EgdeR? > > Dear Xiao, > > Your dispersion value of 5 is higher than I have even seen for real data, > This means that the coefficient of variation of expression is sqrt(5)=2.2, > meaning the standard deviation is more than twice the mean. > > With such wild data, it is no surprise that you get no DE genes. In fact, > I'd be a bit worried if edgeR did give you DE results. > > It is not valid to simply input a smaller dispersion value than your data > indicates. > > Judging from the dispersion value, I think it is likely that you need to > reconsider the basic quality of your data. Trying to repair things with > different statistical methods is not likely to solve the basic problem. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Tel: (03) 9345 2326, Fax (03) 9347 0852, > smyth at wehi.edu.au > http://www.wehi.edu.au > http://www.statsci.org/smyth > > On Wed, 25 May 2011, Xiaohui Wu wrote: > >> Thanks, Gordon. >> >> I tried setting different common dispersion and prior.n in EdgeR, but the results were so different that I became more confused now. I would be very grateful if you or anyone could give me some idea. >> >> 1) Is the following step to filter out low counts gene right? Do I need to change the libsize using d$counts>=16 or just remain the original libsize? >> f <- calcNormFactors(dat) >> f <- f/exp(mean(log(f))) >> d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f) >> d <- d[rowSums(d$counts) >=16, ] >> >> 2) How to estimate which common.dispersion or prior.n is better? >> When I used EdgeR, I tried to get some DE genes after adjusting pvalue. To estimate common or tagwise dispertion, I used: >> d <- estimateCommonDisp(d) >> d <- estimateTagwiseDisp(d, prior.n = 10) >> NO DE gene was found after adjusting pvalue, about 800~1800 DE genes were found with pvalue<0.1 (before adjusting). >> >> In my data, the biological variation between the four replicates are high, (the estimateCommonDisp was 5, which means very high variation I think). >> When I used: d$common.dispersion=2, I got 10209 genes with pvalue<0.1, and 9532 DE genes with adjusted pvalue<0.1. >> When I used: prior.n=0, I got 3428 genes with pvalue<0.1, and 339 DE genes with adjusted pvalue<0.1. >> But NO DE gene could be found even using prior.n=1. >> Is it OK to use prior.n=0, or to set my own common dispersion? Should I use tagwise dispersion or common dispersion? >> >> Any help or suggestions will be appreciated. >> >> Best, >> Xiaohui >> >> >> >> >> ------------------------------------------------------------- >> Send?Gordon K Smyth >> Date?2011-05-25 09:59:36 >> Receive?Wu, Xiaohui Ms. >> Title?Filtering out tags with low counts in DESeq and EgdeR? >> >> Dear Xiaohui, >> >> I'll answer for edgeR, which is what I have experience with. >> >> I personally pre-filter RNA-Seq data in much the same way that I filter >> microarray data. I aim to keep transcripts that show some evidence of >> being expressed at a worthwhile level in at least one treatment group. >> If I am comparing two groups with sample sizes n1=5 and n2=10, then I will >> keep transcripts that achieve a minimum level of expression in at least 5 >> samples (the minimum group size). Note that this is done without regard >> to the sample labels, so no knowledge of group membership is used in the >> filtering. The minimum expression level might be, say, 1 count per >> million. >> >> Best wishes >> Gordon >> >>> Date: Sat, 21 May 2011 16:25:08 +0800 >>> From: "Xiaohui Wu" <wux3 at="" muohio.edu=""> >>> To: "bioconductor" <bioconductor at="" r-project.org=""> >>> Subject: [BioC] Filtering out tags with low counts in DESeq and EgdeR? >>> >>> Hello, >>> >>> I want to find DE genes between two conditions, using DESeq, EdgeR or other tools if any. >>> I have some questions on how/when to filter out genes with low counts and I am also confused on the final results. >>> >>> I have 50,000 genes, 2 conditions, each condition 4 replicates. 25,000 genes have >=8 counts in at least one condition. >>> The total counts in each replicate is as following: >>> cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2 cond2_rep3 cond2_rep4 >>> 1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892 >>> >>> 1. For DESeq >>> I used all 50,000 genes to estimate size factor for normalization. >>> Then should I remove genes with low count before estimating variance or after that? >>> I got totally different results using the following two ways, which I was really surprised that NO overlap was found between the two DE gene sets. >>> >>> 1) no filtering before estimating variance >>> d <- newCountDataSet( dat, group ) >>> d <- estimateSizeFactors( d) >>> sf=sizeFactors(d) >>> d <- estimateVarianceFunctions( d ) >>> res <- nbinomTest( d,"cond1","cond2") >>> resSig <- res[ res$padj < .1, ] >>> >>> 2) filtering before estimating variance >>> filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## remove low count genes >>> filter.ds <- newCountDataSet( filter, group ) >>> filter.ds$sizeFactor=sf ## use the sizefactor from 1) >>> filter.ds <- estimateVarianceFunctions( filter.ds ) >>> filter.res <- nbinomTest( filter.ds,"cond1","cond2") >>> filter.sig=filter.res[ filter.res$padj < .1, ] >>> nrow(filter.sig) >>> >>> Results: For my data, total 177 DE genes from 2) and 106 from 1), and no overlap between 1) and 2). >>> >>> 2. For EdgeR >>> 1) when should I remove low count genes? >>> f <- calcNormFactors(dat) ## I should calculate factor using the whole data, right? >>> f <- f/exp(mean(log(f))) >>> dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8), ] ## And, then filter out genes with low counts? >>> d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f) >>> >>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after adjusting p.value, is this possible? Then, can I used the *unadjusted* p.value to get DE genes? >>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method = "BH") < 0.05) >>> >>> Any help or suggestions will be appreciated. >>> >>> Thanks, >>> Xiaohui ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. ______________________________________________________________________

ADD COMMENT • link 12.9 years ago Xiaohui Wu ▴ 280

0

Entering edit mode

I am running the development version of R, and I cannot do simple plots anymore, can anyone give me a hint? See below and thanks! Ina > x <- rnorm(100,0,1) > hist(x) Error in function (display = "", width, height, pointsize, gamma, bg, : X11 module cannot be loaded In addition: Warning message: In function (display = "", width, height, pointsize, gamma, bg, : unable to load shared object '/home/inah/R-devel/lib64/R/modules//R_X11.so': libpng12.so.0: cannot open shared object file: No such file or directory sessionInfo() R version 2.14.0 Under development (unstable) (2011-05-19 r55967) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.14.0

ADD REPLY • link 12.9 years ago Ina Hoeschele ▴ 620

0

Entering edit mode

Hi Ina: You need to recompile R with X11 support. When you run the configure script, pass along the --with-X11 flag. You may have to also pass along the location of the header files and .so libraries for X11. Dave H Sent via DROID on Verizon Wireless -----Original message----- From: Ina Hoeschele <inah@vbi.vt.edu> To: Bioconductor mailing list <bioconductor@r-project.org> Cc: David Bynum <dbynum@vbi.vt.edu> Sent: Thu, May 26, 2011 20:26:10 GMT+00:00 Subject: Re: [BioC] development version of R I am running the development version of R, and I cannot do simple plots anymore, can anyone give me a hint? See below and thanks! Ina > x <- rnorm(100,0,1) > hist(x) Error in function (display = "", width, height, pointsize, gamma, bg, : X11 module cannot be loaded In addition: Warning message: In function (display = "", width, height, pointsize, gamma, bg, : unable to load shared object '/home/inah/R-devel/lib64/R/modules//R_X11.so': libpng12.so.0: cannot open shared object file: No such file or directory sessionInfo() R version 2.14.0 Under development (unstable) (2011-05-19 r55967) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_2.14.0 _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 12.9 years ago David Henderson ▴ 20

0

Entering edit mode

Somebody has removed your libpng12.so.0. One way to fix this is to recompile R, this time it may not support png() But this is really something you need to talk to your local sysadmin about. Kasper On Thu, May 26, 2011 at 4:24 PM, Ina Hoeschele <inah at="" vbi.vt.edu=""> wrote: > I am running the development version of R, and I cannot do simple plots anymore, can anyone give me a hint? See below and thanks! > Ina > >> x <- rnorm(100,0,1) >> hist(x) > Error in function (display = "", width, height, pointsize, gamma, bg, > : > ?X11 module cannot be loaded > In addition: Warning message: > In function (display = "", width, height, pointsize, gamma, bg, ?: > ?unable to load shared object '/home/inah/R-devel/lib64/R/modules//R_X11.so': > ?libpng12.so.0: cannot open shared object file: No such file or directory > > ?sessionInfo() > R version 2.14.0 Under development (unstable) (2011-05-19 r55967) > Platform: x86_64-unknown-linux-gnu (64-bit) > locale: > ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C > ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8 > ?[5] LC_MONETARY=en_US.UTF-8 ? ?LC_MESSAGES=en_US.UTF-8 > ?[7] LC_PAPER=C ? ? ? ? ? ? ? ? LC_NAME=C > ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > loaded via a namespace (and not attached): > [1] tools_2.14.0 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 12.9 years ago Kasper Daniel Hansen ★ 6.5k

Login before adding your answer.