Many thanks, Gordon.
Sorry that I read your reply last time, but I wrongly understood it as
filtering out tags by total counts not samples. Now I understand.
Sender?Gordon K Smyth
Recieve?Wu, Xiaohui Ms.
Title?Re: Re: Filtering out tags with low counts in DESeq and EgdeR?
You seem to have ignored or perhaps not understood my original reply
you. If you read it again, you will see that it takes care of cases
as you mention.
On Wed, 25 May 2011, Xiaohui Wu wrote:
> Thank you very much, Gordon.
> Yes, it is really bad that the data is so wild. But before examining
> data, do you think it is valid that I remove the cases with high
> dispersion, such as (0,0,100,0), and then take the new whole data to
> find DE genes?
> Thank you!
> Sender: Gordon K Smyth
> Date?2011-05-25 15:24:12
> Recieve?Wu, Xiaohui Ms.
> Title?Re: Filtering out tags with low counts in DESeq and EgdeR?
> Dear Xiao,
> Your dispersion value of 5 is higher than I have even seen for real
> This means that the coefficient of variation of expression is
> meaning the standard deviation is more than twice the mean.
> With such wild data, it is no surprise that you get no DE genes. In
> I'd be a bit worried if edgeR did give you DE results.
> It is not valid to simply input a smaller dispersion value than your
> Judging from the dispersion value, I think it is likely that you
> reconsider the basic quality of your data. Trying to repair things
> different statistical methods is not likely to solve the basic
> Best wishes
> Professor Gordon K Smyth,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Tel: (03) 9345 2326, Fax (03) 9347 0852,
> smyth at wehi.edu.au
> On Wed, 25 May 2011, Xiaohui Wu wrote:
>> Thanks, Gordon.
>> I tried setting different common dispersion and prior.n in EdgeR,
but the results were so different that I became more confused now. I
would be very grateful if you or anyone could give me some idea.
>> 1) Is the following step to filter out low counts gene right? Do I
need to change the libsize using d$counts>=16 or just remain the
>> f <- calcNormFactors(dat)
>> f <- f/exp(mean(log(f)))
>> d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) * f)
>> d <- d[rowSums(d$counts) >=16, ]
>> 2) How to estimate which common.dispersion or prior.n is better?
>> When I used EdgeR, I tried to get some DE genes after adjusting
pvalue. To estimate common or tagwise dispertion, I used:
>> d <- estimateCommonDisp(d)
>> d <- estimateTagwiseDisp(d, prior.n = 10)
>> NO DE gene was found after adjusting pvalue, about 800~1800 DE
genes were found with pvalue<0.1 (before adjusting).
>> In my data, the biological variation between the four replicates
are high, (the estimateCommonDisp was 5, which means very high
variation I think).
>> When I used: d$common.dispersion=2, I got 10209 genes with
pvalue<0.1, and 9532 DE genes with adjusted pvalue<0.1.
>> When I used: prior.n=0, I got 3428 genes with pvalue<0.1, and 339
DE genes with adjusted pvalue<0.1.
>> But NO DE gene could be found even using prior.n=1.
>> Is it OK to use prior.n=0, or to set my own common dispersion?
Should I use tagwise dispersion or common dispersion?
>> Any help or suggestions will be appreciated.
>> Send?Gordon K Smyth
>> Date?2011-05-25 09:59:36
>> Receive?Wu, Xiaohui Ms.
>> Title?Filtering out tags with low counts in DESeq and EgdeR?
>> Dear Xiaohui,
>> I'll answer for edgeR, which is what I have experience with.
>> I personally pre-filter RNA-Seq data in much the same way that I
>> microarray data. I aim to keep transcripts that show some evidence
>> being expressed at a worthwhile level in at least one treatment
>> If I am comparing two groups with sample sizes n1=5 and n2=10, then
>> keep transcripts that achieve a minimum level of expression in at
>> samples (the minimum group size). Note that this is done without
>> to the sample labels, so no knowledge of group membership is used
>> filtering. The minimum expression level might be, say, 1 count per
>> Best wishes
>>> Date: Sat, 21 May 2011 16:25:08 +0800
>>> From: "Xiaohui Wu" <wux3 at="" muohio.edu="">
>>> To: "bioconductor" <bioconductor at="" r-project.org="">
>>> Subject: [BioC] Filtering out tags with low counts in DESeq and
>>> I want to find DE genes between two conditions, using DESeq, EdgeR
or other tools if any.
>>> I have some questions on how/when to filter out genes with low
counts and I am also confused on the final results.
>>> I have 50,000 genes, 2 conditions, each condition 4 replicates.
25,000 genes have >=8 counts in at least one condition.
>>> The total counts in each replicate is as following:
>>> cond1_rep1 cond1_rep2 cond1_rep3 cond1_rep4; cond2_rep1 cond2_rep2
>>> 1792310 1896603 1804500 2166961; 1792663 2492385 1781523 2222892
>>> 1. For DESeq
>>> I used all 50,000 genes to estimate size factor for normalization.
>>> Then should I remove genes with low count before estimating
variance or after that?
>>> I got totally different results using the following two ways,
which I was really surprised that NO overlap was found between the two
DE gene sets.
>>> 1) no filtering before estimating variance
>>> d <- newCountDataSet( dat, group )
>>> d <- estimateSizeFactors( d)
>>> d <- estimateVarianceFunctions( d )
>>> res <- nbinomTest( d,"cond1","cond2")
>>> resSig <- res[ res$padj < .1, ]
>>> 2) filtering before estimating variance
>>> filter=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >=
8), ] ## remove low count genes
>>> filter.ds <- newCountDataSet( filter, group )
>>> filter.ds$sizeFactor=sf ## use the sizefactor from 1)
>>> filter.ds <- estimateVarianceFunctions( filter.ds )
>>> filter.res <- nbinomTest( filter.ds,"cond1","cond2")
>>> filter.sig=filter.res[ filter.res$padj < .1, ]
>>> Results: For my data, total 177 DE genes from 2) and 106 from 1),
and no overlap between 1) and 2).
>>> 2. For EdgeR
>>> 1) when should I remove low count genes?
>>> f <- calcNormFactors(dat) ## I should calculate factor using the
whole data, right?
>>> f <- f/exp(mean(log(f)))
>>> dat=dat[rowSums(dat[,group1] >= 8) | rowSums(dat[,group2] >= 8),
] ## And, then filter out genes with low counts?
>>> d <- DGEList(counts =dat, group=group,lib.size = colSums(dat) *
>>> 2) I got 800 DE genes with p.value<0.1, but got 0 DE genes after
adjusting p.value, is this possible? Then, can I used the *unadjusted*
p.value to get DE genes?
>>> To adjust pvalue, I used: nde.adjust=sum(p.adjust(de.p, method =
"BH") < 0.05)
>>> Any help or suggestions will be appreciated.
The information in this email is confidential and intended solely for
You must not disclose, forward, print or use it without the permission
of the sender.