Question: A few Q's on using DEXSeq with mucho data
0
7.8 years ago by
Denali
Steve Lianoglou12k wrote:
Hi, Imagine, if you will, that I have data for 7-8 different conditions and I'd like to use DEXSeq to test for differential exon usage. Is it best to create an ExonCountSet with all of my data (with an appropriate design matrix that identifies which samples belong to which condition) and do the estimateDispersion step with that? My guess is yes, but I wanted to double check -- am I in danger of maybe flagging some genes/exons as non-testable if, for example, it is only expressed in 2 out of 8 conditions? Assuming I should use all of my data to estimateDisperions, what if I only have one replicate for one of the 8 conditions, is it best to remove it? I'm guessing it wouldn't provide any meaning information for dispersion estimation since it's only one observation for that condition. Lastly, when testing for differences in exon usage, assuming I've been using the data from all of my experiments up to this point, I don't see a way to specify which experiments I want to specifically test. If I run "the normal" DEXSeq analysis on my data, I end up with a DEUresultTable that looks like it has log2fold(x/y) values for all experiments against just one y. There is only one pvalue/padjust column, so I'm not sure what comparison that is for. Thanks for any help, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
cancer dexseq • 641 views
modified 7.8 years ago by Simon Anders3.6k • written 7.8 years ago by Steve Lianoglou12k
Answer: A few Q's on using DEXSeq with mucho data
0
7.8 years ago by
Denali
Steve Lianoglou12k wrote:
Hi Steve, I had some thought for you that struck me right after I sent out this email, specifically your "hunch" about using all the data to estimate and fit the dispersion for each "bin". Perhaps it's a good idea to use all the data to estimateDispersions? (Still quite curious about that), but it seems that doing the subsequent fitDispersionFunction step might not be a great idea to run against all of the data at once. I say this because by looking through the code, it seems like the dispersion is fit for each exon/bin by the (normalized) mean expression of that bin across the entire dataset. So, if the expression of the gene (and exon) is quite variable across all conditions we have data for, when we go back and try to test differential exon usage for a specific condition 1 vs. condition 2 case, I think we'd rather fit the dispersion for the mean expression of that bin for the two conditions under test. Isn't that right? Sorry for the spam, just trying to reason a bit about this stuff ... Thanks for any help, -steve On Thu, Mar 8, 2012 at 10:08 AM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Hi, > > Imagine, if you will, that I have data for 7-8 different conditions > and I'd like to use DEXSeq to test for differential exon usage. > > Is it best to create an ExonCountSet with all of my data (with an > appropriate design matrix that identifies which samples belong to > which condition) and do the estimateDispersion step with that? > > My guess is yes, but I wanted to double check -- am I in danger of > maybe flagging some genes/exons as non-testable if, for example, it is > only expressed in 2 out of 8 conditions? > > Assuming I should use all of my data to estimateDisperions, what if > I only have one replicate for one of the 8 conditions, is it best to > remove it? I'm guessing it wouldn't provide any meaning information > for dispersion estimation since it's only one observation for that > condition. > > Lastly, when testing for differences in exon usage, assuming I've been > using the data from all of my experiments up to this point, I don't > see a way to specify which experiments I want to specifically test. > > If I run "the normal" DEXSeq analysis on my data, I end up with a > DEUresultTable that looks like it has log2fold(x/y) values for all > experiments against just one y. There is only one pvalue/padjust > column, so I'm not sure what comparison that is for. > > Thanks for any help, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Answer: A few Q's on using DEXSeq with mucho data
0
7.8 years ago by
Simon Anders3.6k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.6k wrote:
Hi Steve On 03/08/2012 04:08 PM, Steve Lianoglou wrote: > Imagine, if you will, that I have data for 7-8 different conditions > and I'd like to use DEXSeq to test for differential exon usage. > > Is it best to create an ExonCountSet with all of my data (with an > appropriate design matrix that identifies which samples belong to > which condition) and do the estimateDispersion step with that? > > My guess is yes, but I wanted to double check -- am I in danger of > maybe flagging some genes/exons as non-testable if, for example, it is > only expressed in 2 out of 8 conditions? An exon is flagged as not testable if the sum over all samples of the read counts mapped to it is less then 'minCounts' (default: 10). Hence, adding additional samples will never reduce the number of testable exons. > Assuming I should use all of my data to estimateDisperions, what if > I only have one replicate for one of the 8 conditions, is it best to > remove it? I'm guessing it wouldn't provide any meaning information > for dispersion estimation since it's only one observation for that > condition. The non-replicates sample would be ignored in the dispersion estimation anyway. > Lastly, when testing for differences in exon usage, assuming I've been > using the data from all of my experiments up to this point, I don't > see a way to specify which experiments I want to specifically test. > > If I run "the normal" DEXSeq analysis on my data, I end up with a > DEUresultTable that looks like it has log2fold(x/y) values for all > experiments against just one y. There is only one pvalue/padjust > column, so I'm not sure what comparison that is for. The p value is from a test against the null hypothesis that exon usage does not depend on condition, i.e., that it is the same between all conditions (similar to the F test in standard anova). If you want to contrast two specific conditions, you should subset. This leaves the question whether to subset before or after dispersion estimation, which is what your follow-up plot is concerned with, too: > I had some thought for you that struck me right after I sent out this > email, specifically your "hunch" about using all the data to estimate > and fit the dispersion for each "bin". > > Perhaps it's a good idea to use all the data to estimateDispersions? > (Still quite curious about that), but it seems that doing the > subsequent fitDispersionFunction step might not be a great idea to > run against all of the data at once. > > I say this because by looking through the code, it seems like the > dispersion is fit for each exon/bin by the (normalized) mean > expression of that bin across the entire dataset. So, if the > expression of the gene (and exon) is quite variable across all > conditions we have data for, when we go back and try to test > differential exon usage for a specific condition 1 vs. condition 2 > case, I think we'd rather fit the dispersion for the mean expression > of that bin for the two conditions under test. > > Isn't that right? If some of your conditions have much higher variability than the other conditions, they will in fact cause you to lose power even in comparisons which do not involve them. This would be an argument for subsetting before dispersion estimation. However, if the dispersions are roughly the same in all conditions, you are better off with estimating a pooled dispersion from all samples. This is because the per-exon estimates are then based on more degrees of freedom and have less sampling variance. Remember that DEXSeq uses in its test the maximum of the per-exon estimate and the fitted estimate. This is to avoid that exons with truly high variability are tested with a too low dispersion value, which would likely lead to a false positive. However, a high per-exon estimate can also arise from an exon that actually has low true dispersion, simply because of the estimator's sampling noise. This costs power, and estimating the dispersion from as many samples as possible can considerably reduce this loss of power. In classical OLS ANOVA, something like our maximum rule is not necessary because the F test is informed about the precision of the variance estimate by the specified lower DoF number. Nevertheless, the effect is the same: an ANOVA procedure has more power if the variance is estimated in a pooled fashion from all samples and this is hence what is usually done. (Furthermore, the standard F test would not even be valid if the conditions had unequal variances.) Hence, stick to pooling all samples, unless one of the conditions is really bad. Simon
Hi Simon, Thanks for the detailed response. I just wanted to clarify one bit regarding the call to fitDispersionFunction regarding the concern I rose in my email back to myself, which is that we are fitting the dispersion. You say: > If some of your conditions have much higher variability than the other > conditions, they will in fact cause you to lose power even in comparisons > which do not involve them. This would be an argument for subsetting before > dispersion estimation. I'm not so much concerned with, say, an experiment (or batch of experiments) showing more variability between read counts (although it is a possibility) for the same bin in the same condition. I was trying to convey my concern (maybe wrong) that the "fitted dispersion" we end up using is a function of the mean read count for the bin, where the mean is calculated from the expression of the gene across all samples/conditions. It could be that in one particular two-experiment comparison I want to make, the expression of the exon is quite high in both samples. In this case, the higher "averaged normalzed count value" (x-axis of fig 2 in your pre-paper) would likely be associated w/ a lower dispersion when doing the test therefore increasing our power. It could be, however, that in the rest of the conditions the gene (and therefore the exon) would be expressed at a lower level, and the dispersion for that bin would then be estimated at a higher amount, decreasing the power in this case. So -- I feel like I'd like to use all the data to calculate the "mean normalized count" to dispersion line of best fit (again, Fig 2 in the pre-paper) ... so far, so good. But for each pairwise test between conditions I now want to perform, I'd set the dispFitted column for the bin by using the mean expression of the bin only from the two conditions under test. Sorry if that's not all that clear -- hopefully what I'm trying to do makes sense -- but also (hopefully) it doesn't sound too boneheaded. What do you think? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact