Question

edgeR outlier question for version 3.0.8

0

Entering edit mode

Simon Melov ▴ 340

@simon-melov-266

Last seen 9.7 years ago

Hi Gordon, I'm analyzing a fairly large RNAseq data set (N of roughly 20 per group), and wanted to know what changed had been made to the prior as you indiate below from last year. Do we still need to set the prior when looking at comparatively large data sets? Or is edgeR less smoothing in the latest version. thanks Simon. > > > On Wed, 9 May 2012, Gordon K Smyth wrote: > >> Hi Simon, >> >> edgeR does take into account the amount of biological variation within >> groups, and it does de-prioritize genes that are inconsistent within groups, >> although it seems not as strongly as you'd like in your case. >> >> Here are some quick solutions. First, the filtering you've done sounds good, >> but I would require a minimum cpm for at least ten samples for your >> experiment, rather than eight. That's because both of your groups are of >> size ten. From your description, if you filter genes that fail to achieve at >> least 2 cpm in >= 10 samples, that may take care of the one-offs. >> >> Second, edgeR (unlike limma) doesn't have to ability to automatically adapt >> the degree of empirical Bayes smoothing, but you can adjust it yourself. The >> default prior degrees of freedom for the edgeR empirical Bayes procedure is >> set at 20. You might need a smaller value, perhaps a lot smaller. Try prior >> df of 2, say, which you can achieve by setting prior.n=2/18 when you run >> estimateTagwiseDisp(). The smaller you make this value, the more strongly >> edgeR will down-weight genes that are inconsistent within replicates. >> >> A more radical solution would be to use edgeR's glm pipeline, and to use >> glmQLFTest() in place of the more usual glmLRT(). In this quasi glm >> pipeline, estimateGLMTagwiseDisp() is omitted, and instead edgeR calls limma >> functions to do the empirical Bayes shrinkage, meaning that the prior df is >> estimated rather than preset. This also provides a more conservative >> statistical test that fully takes into account the uncertainty with which the >> dispersion is estimated. This pipeline will strongly de-prioritize genes >> that are inconsistent within replicates. >> >> Finally, you could consider removing outlier genes manually. There are a few >> ways to do that. We always look at plotBCV() plots of the estimated >> dispersions, and sometimes if there are obvious outliers we will identify and >> filter them out. If you have a small percentage of extreme outliers, this is >> the way to go. >> >> Best wishes >> Gordon >> >>> Date: Mon, 7 May 2012 12:19:19 -0700 >>> From: Simon Melov <smelov at="" buckinstitute.org=""> >>> To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> >>> Subject: [BioC] edgeR outlier question >>> >>> I have a reasonable RNASeq data set of 10 biological replicates of a >>> control group versus 10 biological replicates experimental I've gone >>> through the edgeR workflow, and get a nice list of about 1000 genes >>> differentially expressed due to the experimental manipulation. I input the >>> data based on total reads per gene (I'd like to get to exons too, but first >>> things first). The data is obtained via a paired end strategy, so its >>> pretty good quality. The number of reads per sample (library) is about 10 >>> million reads each. My question is, as I go through list of significant >>> genes which are differentially expressed between the two groups (normalized >>> via the workflow), ranked by BH FDR down to 0.05, I see genes being judged >>> as differentially expressed which have very low expression in most samples, >>> yet are thrown off by 1 or 2 values, thereby achieving statistical >>> significance. For example, a gene might have between 1 and 2 counts per >>> million reads in one group, and be basically the ! same in the other group, >>> but one of the values is perhaps at a 1000 or so counts, which seems to >>> throw off the entire group, thereby becoming "significant". >>> >>> Shouldn't edgeR take into account this sort of biological variation within >>> a group and account for it in assessing significance? Its clear that in the >>> above example, that sample is an outlier, and therefore the variance is so >>> high, so it shouldn't be ranked as being differentially expressed. I >>> filtered the data by applying the criteria of at least 1 count per sample, >>> and I have to have at least 8 samples per group which have this. Should >>> there be an additional filtering criteria to exclude these outliers? or >>> doesn't edgeR take into account this sort of situation (I thought it did). >>> >>> Am I doing something wrong here? >> > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}

RNASeq GO edgeR RNASeq GO edgeR • 1.1k views

ADD COMMENT • link updated 10.9 years ago by Gordon Smyth 50k • written 10.9 years ago by Simon Melov ▴ 340

score 0 · Answer 1 · 2013-06-26

Dear Simon, Have you looked at the help page ?estimateGLMTagwiseDisp That will tell you want the defaults are. edgeR adapts automatically to different numbers of arrays, and there is usually no need for you to intervene, even if your experiment has large sample sizes. The default for prior.df was reduced from 20 to 10 in edgeR 3.2.X. That should be small enough for most data sets, unless there is a very serious problem with outliers, in which case I would recommend robust alternatives: https://www.stat.math.ethz.ch/pipermail/bioconductor/2013-June/053357. html Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, http://www.statsci.org/smyth > From: Simon Melov <smelov at="" buckinstitute.org=""> > Subject: edgeR outlier question for version 3.0.8 > Date: June 24, 2013 2:47:24 PM PDT > To: bioconductor at r-project.org > > Hi Gordon, > I'm analyzing a fairly large RNAseq data set (N of roughly 20 per > group), and wanted to know what changes had been made to the prior as > you indicate below from last year. Do we still need to set the prior > when looking at comparatively large data sets? Or is edgeR less > smoothing in the latest version? > > thanks > > Simon. > > > On Wed, 9 May 2012, Gordon K Smyth wrote: > > Hi Simon, > > edgeR does take into account the amount of biological variation within > groups, and it does de-prioritize genes that are inconsistent within > groups, although it seems not as strongly as you'd like in your case. > > Here are some quick solutions. First, the filtering you've done sounds > good, but I would require a minimum cpm for at least ten samples for > your experiment, rather than eight. That's because both of your groups > are of size ten. From your description, if you filter genes that fail > to achieve at least 2 cpm in >= 10 samples, that may take care of the > one-offs. > > Second, edgeR (unlike limma) doesn't have to ability to automatically > adapt the degree of empirical Bayes smoothing, but you can adjust it > yourself. The default prior degrees of freedom for the edgeR empirical > Bayes procedure is set at 20. You might need a smaller value, perhaps a > lot smaller. Try prior df of 2, say, which you can achieve by setting > prior.n=2/18 when you run estimateTagwiseDisp(). The smaller you make > this value, the more strongly edgeR will down-weight genes that are > inconsistent within replicates. > > A more radical solution would be to use edgeR's glm pipeline, and to use > glmQLFTest() in place of the more usual glmLRT(). In this quasi glm > pipeline, estimateGLMTagwiseDisp() is omitted, and instead edgeR calls > limma functions to do the empirical Bayes shrinkage, meaning that the > prior df is estimated rather than preset. This also provides a more > conservative statistical test that fully takes into account the > uncertainty with which the dispersion is estimated. This pipeline will > strongly de-prioritize genes that are inconsistent within replicates. > > Finally, you could consider removing outlier genes manually. There are > a few ways to do that. We always look at plotBCV() plots of the > estimated dispersions, and sometimes if there are obvious outliers we > will identify and filter them out. If you have a small percentage of > extreme outliers, this is the way to go. > > Best wishes > Gordon > > Date: Mon, 7 May 2012 12:19:19 -0700 > From: Simon Melov <smelov at="" buckinstitute.org<mailto:smelov="" at="" buckinstitute.org="">> > To: "bioconductor at r-project.org<mailto:bioconductor at="" r-project.org="">" <bioconductor at="" r-project.org<mailto:bioconductor="" at="" r-project.org="">> > Subject: [BioC] edgeR outlier question > > I have a reasonable RNASeq data set of 10 biological replicates of a > control group versus 10 biological replicates experimental I've gone > through the edgeR workflow, and get a nice list of about 1000 genes > differentially expressed due to the experimental manipulation. I input > the data based on total reads per gene (I'd like to get to exons too, > but first things first). The data is obtained via a paired end strategy, > so its pretty good quality. The number of reads per sample (library) is > about 10 million reads each. My question is, as I go through list of > significant genes which are differentially expressed between the two > groups (normalized via the workflow), ranked by BH FDR down to 0.05, I > see genes being judged as differentially expressed which have very low > expression in most samples, yet are thrown off by 1 or 2 values, thereby > achieving statistical significance. For example, a gene might have > between 1 and 2 counts per million reads in one group, and be basically > the ! same in the other group, but one of the values is perhaps at a > 1000 or so counts, which seems to throw off the entire group, thereby > becoming "significant". > > Shouldn't edgeR take into account this sort of biological variation > within a group and account for it in assessing significance? Its clear > that in the above example, that sample is an outlier, and therefore the > variance is so high, so it shouldn't be ranked as being differentially > expressed. I filtered the data by applying the criteria of at least 1 > count per sample, and I have to have at least 8 samples per group which > have this. Should there be an additional filtering criteria to exclude > these outliers? or doesn't edgeR take into account this sort of > situation (I thought it did). > > Am I doing something wrong here? > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}