Question

DESeq variance question

0

Entering edit mode

Steffen Priebe ▴ 10

@steffen-priebe-4988

Last seen 9.6 years ago

I was using DESeq (and edgeR) for differentially expression analysis. In my current dataset I compare 3 biological replicates of control vs. 3 biol. replicates from a mutant. The resulting 4 top genes according adjusted pvalue by DESeq and edgeR have a very high variance. (The reason for this is, that this are genes located on the chrY and only one replicate of the mutant was male) My question is now, how can genes with such a high variance of the counts result in this small pvalues? Is there any way to avoid this, because I think this are False Positives? Attached you can find the combined result table of DESeq and edgeR for the top 100 genes. The problem occurs for the first 4 genes. The raw counts are stated in columns P-U (P-R: Mutant, T-U Control). The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and we request that you notify us by reply mail or telephone and delete the original message from your mail system.

edgeR DESeq edgeR DESeq • 1.3k views

ADD COMMENT • link updated 12.4 years ago by Gordon Smyth 50k • written 12.4 years ago by Steffen Priebe ▴ 10

score 0 · Answer 1 · 2011-12-03

Dear Steffen On 2011-12-02 13:53, Steffen Priebe wrote: > I was using DESeq (and edgeR) for differentially expression analysis. > In my current dataset I compare 3 biological replicates of control vs. 3 biol. replicates from a mutant. > The resulting 4 top genes according adjusted pvalue by DESeq and edgeR have a very high variance. > (The reason for this is, that this are genes located on the chrY and only one replicate of the mutant was male) > > My question is now, how can genes with such a high variance of the counts result in this small pvalues? > Is there any way to avoid this, because I think this are False Positives? > > Attached you can find the combined result table of DESeq and edgeR for the top 100 genes. > The problem occurs for the first 4 genes. The raw counts are stated in columns P-U (P-R: Mutant, T-U Control). Short answer: I suppose you used version 1.4.x of DESeq. In the new release (DESeq version 1.6.x), we made some major changes, which should cause the problem to disappear. Longer answer: The data frame returned by 'nbinomTest' in the old version returned, next to the p values, two vectors of "variance residuals", labeled "resVarA" and "resVarB". The vignette explained that p values should be considered unreliable if the variance residuals were too large and advised to disregard such hits. Your Y chromosome genes certainly had such large values in resVarA or resVarB, and you should have removed them because of this. These variance residuals are the ratio of the per-gene estimate of the variance (which is very imprecise in case of few samples) and the fitted value found from sharing data across genes (which is stable but may be misleading in case of genes which behave very different than the other genes of similar expression range.) Previously, we used only the fitted dispersion values for the test and left it to the user to filter out those hits for which the two values were in too much disagreement. Many users overlooked the need for this last step, others found the solution unsatisfactory as it turned out to be hard to advise on a good threshold for the filtering on variance residuals. The new version solves the issue with a pragmatic and simple approach that works surprisingly well: DESeq now simply uses the maximum of the two values. See the updated vignette for more details on this topic. This costs power but avoids the need for filtering. In our experience, the power cost is surprisingly low for typical data sets, which, in our view, justifies the use of such a simple method, at least for now. You can switch back to the old behaviour, using the 'sharingMode' argument to the 'estimateDispersions' function. This can be useful to see how this 'maximum rule' influences your result. EdgeR, with its empirical Bayesian approach (implemented in its function 'estimateTagwiseDispersion') should typically give p values in the middle between DESeq's result using the 'maximum' and its the 'fitted-only' sharing modes. However, at least in your case, edgeR seemed to have stayed too close to the fitted values (or: to the 'common dispersion', in edgeR's terminology) as you wrote it also gave you p values for your high-variance genes that you considered implausibly low. Simon

score 0 · Answer 2 · 2011-12-05

Dear Steffen, > Date: Fri, 02 Dec 2011 13:53:42 +0100 > From: "Steffen Priebe" <steffen.priebe at="" hki-jena.de=""> > To: <bioconductor at="" r-project.org=""> > Subject: [BioC] DESeq variance question > > I was using DESeq (and edgeR) for differentially expression analysis. In > my current dataset I compare 3 biological replicates of control vs. 3 > biol. replicates from a mutant. The resulting 4 top genes according > adjusted pvalue by DESeq and edgeR have a very high variance. (The > reason for this is, that this are genes located on the chrY and only one > replicate of the mutant was male) Replicates should be representative of the same population, so I would remove the male mutant from the experiment, or else remove all X and Y chromosome genes from the analysis. In our in-house analyses, we have tended to do the latter when faced with your situation. More generally, this is exactly the issue that tagwise dispersion estimation in edgeR is intended to combat. In our experience, filtering so that genes are expressed in at least three libraries (for a 3 vs 3 study) and using a reasonably low prior.n to estimateTagwiseDisp() will give a satisfying topTags gene list. You don't say whether you used tagwise dispersion estimation. > My question is now, how can genes with such a high variance of the > counts result in this small pvalues? Is there any way to avoid this, > because I think this are False Positives? > > Attached you can find the combined result table of DESeq and edgeR for > the top 100 genes. The problem occurs for the first 4 genes. The raw > counts are stated in columns P-U (P-R: Mutant, T-U Control). Note that we have not seen your attachments, with are removed by the list server. Nor do we know what version of software you are using. If you post again, please give output of sessionInfo() and give code for your edgeR analysis. Best wishes Gordon ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

score 0 · Answer 3 · 2011-12-05

Dear Simon and Steffen, > Date: Sat, 03 Dec 2011 20:36:10 +0100 > From: Simon Anders <anders at="" embl.de=""> > To: Steffen Priebe <steffen.priebe at="" hki-jena.de="">, > bioconductor at r-project.org > Subject: Re: [BioC] DESeq variance question > > Dear Steffen > > On 2011-12-02 13:53, Steffen Priebe wrote: >> I was using DESeq (and edgeR) for differentially expression analysis. >> In my current dataset I compare 3 biological replicates of control vs. >> 3 biol. replicates from a mutant. The resulting 4 top genes according >> adjusted pvalue by DESeq and edgeR have a very high variance. (The >> reason for this is, that this are genes located on the chrY and only >> one replicate of the mutant was male) >> >> My question is now, how can genes with such a high variance of the >> counts result in this small pvalues? Is there any way to avoid this, >> because I think this are False Positives? >> >> Attached you can find the combined result table of DESeq and edgeR for >> the top 100 genes. The problem occurs for the first 4 genes. The raw >> counts are stated in columns P-U (P-R: Mutant, T-U Control). ... > EdgeR, with its empirical Bayesian approach (implemented in its function > 'estimateTagwiseDispersion') should typically give p values in the > middle between DESeq's result using the 'maximum' and its the > 'fitted-only' sharing modes. However, at least in your case, edgeR > seemed to have stayed too close to the fitted values (or: to the 'common > dispersion', in edgeR's terminology) Common dispersion is not edgeR terminology for DESeq's "fitted values", and (in the current Bioconductor release) edgeR moderates towards a local prior rather towards the common dispersion. By default, edgeR does not fit any model to the dispersion, hence does not have fitted values. Instead it uses a prior based on locally weighted likelihood. > as you wrote it also gave you p values for your high-variance genes that > you considered implausibly low. > > Simon We don't actually know whether tagwise dispersion was used in the edgeR analysis, nor have we seen the gene list (at least I haven't). In the absence of the knowing either the analysis or the results, it would seem premature to make conclusions about the behaviour of estimateTagwiseDisp. Best wishes Gordon ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}