A few Q's on using DEXSeq with mucho data
2
0
Entering edit mode
@steve-lianoglou-2771
Last seen 13 months ago
United States
Hi, Imagine, if you will, that I have data for 7-8 different conditions and I'd like to use DEXSeq to test for differential exon usage. Is it best to create an ExonCountSet with all of my data (with an appropriate design matrix that identifies which samples belong to which condition) and do the estimateDispersion step with that? My guess is yes, but I wanted to double check -- am I in danger of maybe flagging some genes/exons as non-testable if, for example, it is only expressed in 2 out of 8 conditions? Assuming I should use all of my data to `estimateDisperion`s, what if I only have one replicate for one of the 8 conditions, is it best to remove it? I'm guessing it wouldn't provide any meaning information for dispersion estimation since it's only one observation for that condition. Lastly, when testing for differences in exon usage, assuming I've been using the data from all of my experiments up to this point, I don't see a way to specify which experiments I want to specifically test. If I run "the normal" DEXSeq analysis on my data, I end up with a DEUresultTable that looks like it has log2fold(x/y) values for all experiments against just one y. There is only one pvalue/padjust column, so I'm not sure what comparison that is for. Thanks for any help, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
Cancer DEXSeq Cancer DEXSeq • 1.0k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 13 months ago
United States
Hi Steve, I had some thought for you that struck me right after I sent out this email, specifically your "hunch" about using all the data to estimate and fit the dispersion for each "bin". Perhaps it's a good idea to use all the data to estimateDispersions? (Still quite curious about that), but it seems that doing the subsequent `fitDispersionFunction` step might not be a great idea to run against all of the data at once. I say this because by looking through the code, it seems like the dispersion is fit for each exon/bin by the (normalized) mean expression of that bin across the entire dataset. So, if the expression of the gene (and exon) is quite variable across all conditions we have data for, when we go back and try to test differential exon usage for a specific condition 1 vs. condition 2 case, I think we'd rather fit the dispersion for the mean expression of that bin for the two conditions under test. Isn't that right? Sorry for the spam, just trying to reason a bit about this stuff ... Thanks for any help, -steve On Thu, Mar 8, 2012 at 10:08 AM, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: > Hi, > > Imagine, if you will, that I have data for 7-8 different conditions > and I'd like to use DEXSeq to test for differential exon usage. > > Is it best to create an ExonCountSet with all of my data (with an > appropriate design matrix that identifies which samples belong to > which condition) and do the estimateDispersion step with that? > > My guess is yes, but I wanted to double check -- am I in danger of > maybe flagging some genes/exons as non-testable if, for example, it is > only expressed in 2 out of 8 conditions? > > Assuming I should use all of my data to `estimateDisperion`s, what if > I only have one replicate for one of the 8 conditions, is it best to > remove it? I'm guessing it wouldn't provide any meaning information > for dispersion estimation since it's only one observation for that > condition. > > Lastly, when testing for differences in exon usage, assuming I've been > using the data from all of my experiments up to this point, I don't > see a way to specify which experiments I want to specifically test. > > If I run "the normal" DEXSeq analysis on my data, I end up with a > DEUresultTable that looks like it has log2fold(x/y) values for all > experiments against just one y. There is only one pvalue/padjust > column, so I'm not sure what comparison that is for. > > Thanks for any help, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| Memorial Sloan-Kettering Cancer Center > ?| Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Simon Anders ★ 3.7k
@simon-anders-3855
Last seen 3.7 years ago
Zentrum für Molekularbiologie, Universi…
Hi Steve On 03/08/2012 04:08 PM, Steve Lianoglou wrote: > Imagine, if you will, that I have data for 7-8 different conditions > and I'd like to use DEXSeq to test for differential exon usage. > > Is it best to create an ExonCountSet with all of my data (with an > appropriate design matrix that identifies which samples belong to > which condition) and do the estimateDispersion step with that? > > My guess is yes, but I wanted to double check -- am I in danger of > maybe flagging some genes/exons as non-testable if, for example, it is > only expressed in 2 out of 8 conditions? An exon is flagged as not testable if the sum over all samples of the read counts mapped to it is less then 'minCounts' (default: 10). Hence, adding additional samples will never reduce the number of testable exons. > Assuming I should use all of my data to `estimateDisperion`s, what if > I only have one replicate for one of the 8 conditions, is it best to > remove it? I'm guessing it wouldn't provide any meaning information > for dispersion estimation since it's only one observation for that > condition. The non-replicates sample would be ignored in the dispersion estimation anyway. > Lastly, when testing for differences in exon usage, assuming I've been > using the data from all of my experiments up to this point, I don't > see a way to specify which experiments I want to specifically test. > > If I run "the normal" DEXSeq analysis on my data, I end up with a > DEUresultTable that looks like it has log2fold(x/y) values for all > experiments against just one y. There is only one pvalue/padjust > column, so I'm not sure what comparison that is for. The p value is from a test against the null hypothesis that exon usage does not depend on condition, i.e., that it is the same between all conditions (similar to the F test in standard anova). If you want to contrast two specific conditions, you should subset. This leaves the question whether to subset before or after dispersion estimation, which is what your follow-up plot is concerned with, too: > I had some thought for you that struck me right after I sent out this > email, specifically your "hunch" about using all the data to estimate > and fit the dispersion for each "bin". > > Perhaps it's a good idea to use all the data to estimateDispersions? > (Still quite curious about that), but it seems that doing the > subsequent `fitDispersionFunction` step might not be a great idea to > run against all of the data at once. > > I say this because by looking through the code, it seems like the > dispersion is fit for each exon/bin by the (normalized) mean > expression of that bin across the entire dataset. So, if the > expression of the gene (and exon) is quite variable across all > conditions we have data for, when we go back and try to test > differential exon usage for a specific condition 1 vs. condition 2 > case, I think we'd rather fit the dispersion for the mean expression > of that bin for the two conditions under test. > > Isn't that right? If some of your conditions have much higher variability than the other conditions, they will in fact cause you to lose power even in comparisons which do not involve them. This would be an argument for subsetting before dispersion estimation. However, if the dispersions are roughly the same in all conditions, you are better off with estimating a pooled dispersion from all samples. This is because the per-exon estimates are then based on more degrees of freedom and have less sampling variance. Remember that DEXSeq uses in its test the maximum of the per-exon estimate and the fitted estimate. This is to avoid that exons with truly high variability are tested with a too low dispersion value, which would likely lead to a false positive. However, a high per-exon estimate can also arise from an exon that actually has low true dispersion, simply because of the estimator's sampling noise. This costs power, and estimating the dispersion from as many samples as possible can considerably reduce this loss of power. In classical OLS ANOVA, something like our maximum rule is not necessary because the F test is informed about the precision of the variance estimate by the specified lower DoF number. Nevertheless, the effect is the same: an ANOVA procedure has more power if the variance is estimated in a pooled fashion from all samples and this is hence what is usually done. (Furthermore, the standard F test would not even be valid if the conditions had unequal variances.) Hence, stick to pooling all samples, unless one of the conditions is really bad. Simon
ADD COMMENT
0
Entering edit mode
Hi Simon, Thanks for the detailed response. I just wanted to clarify one bit regarding the call to `fitDispersionFunction` regarding the concern I rose in my email back to myself, which is that we are fitting the dispersion. You say: > If some of your conditions have much higher variability than the other > conditions, they will in fact cause you to lose power even in comparisons > which do not involve them. This would be an argument for subsetting before > dispersion estimation. I'm not so much concerned with, say, an experiment (or batch of experiments) showing more variability between read counts (although it is a possibility) for the same bin in the same condition. I was trying to convey my concern (maybe wrong) that the "fitted dispersion" we end up using is a function of the mean read count for the bin, where the mean is calculated from the expression of the gene across all samples/conditions. It could be that in one particular two-experiment comparison I want to make, the expression of the exon is quite high in both samples. In this case, the higher "averaged normalzed count value" (x-axis of fig 2 in your pre-paper) would likely be associated w/ a lower dispersion when doing the test therefore increasing our power. It could be, however, that in the rest of the conditions the gene (and therefore the exon) would be expressed at a lower level, and the dispersion for that bin would then be estimated at a higher amount, decreasing the power in this case. So -- I feel like I'd like to use all the data to calculate the "mean normalized count" to dispersion line of best fit (again, Fig 2 in the pre-paper) ... so far, so good. But for each pairwise test between conditions I now want to perform, I'd set the `dispFitted` column for the bin by using the mean expression of the bin only from the two conditions under test. Sorry if that's not all that clear -- hopefully what I'm trying to do makes sense -- but also (hopefully) it doesn't sound too boneheaded. What do you think? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY
0
Entering edit mode
Hi Steve On 2012-03-08 19:31, Steve Lianoglou wrote: [...] > I was trying to convey my concern (maybe wrong) that the "fitted > dispersion" we end up using is a function of the mean read count for > the bin, where the mean is calculated from the expression of the gene > across all samples/conditions. > > It could be that in one particular two-experiment comparison I want to > make, the expression of the exon is quite high in both samples. In > this case, the higher "averaged normalzed count value" (x-axis of fig > 2 in your pre-paper) would likely be associated w/ a lower dispersion > when doing the test therefore increasing our power. > > It could be, however, that in the rest of the conditions the gene (and > therefore the exon) would be expressed at a lower level, and the > dispersion for that bin would then be estimated at a higher amount, > decreasing the power in this case. [...] You are right, this could be an argument in favour of subsetting before dispersion estimation. I am not quite sure how important this effect is in practice, though. Bear in mind that the dispersion does not contain the Poisson noise. For mean ? and a dispersion ?, the variance is v = ? + ? ??, and the coefficient of variance (CV) squared is CV? = v/?? = 1/? + ?. Hence, the dominant term for the dependence of variance and hence power on mean is the Poisson term 1/?, and not so much any remaining dependence of ? on ?. A negative binomial generalized linear model takes this into account: it uses the mean-variance relation v = ? + ? ??, with ?, not v, considered constant across the model, precisely because this handles well the effect on variance of differences in overall mean in the different treatment groups. Nevertheless, as not only CV? but also ? itself seems to decrease with ?, this is not perfect. Simon
ADD REPLY
0
Entering edit mode
Hi Simon, Just wanted to say thanks for taking the time to provide these detailed responses. Cheers, -steve On Sun, Mar 11, 2012 at 7:04 AM, Simon Anders <anders at="" embl.de=""> wrote: > Hi Steve > > On 2012-03-08 19:31, Steve Lianoglou wrote: > [...] > >> I was trying to convey my concern (maybe wrong) that the "fitted >> dispersion" we end up using is a function of the mean read count for >> the bin, where the mean is calculated from the expression of the gene >> across all samples/conditions. >> >> It could be that in one particular two-experiment comparison I want to >> make, the expression of the exon is quite high in both samples. In >> this case, the higher "averaged normalzed count value" (x-axis of fig >> 2 in your pre-paper) would likely be associated w/ a lower dispersion >> when doing the test therefore increasing our power. >> >> It could be, however, that in the rest of the conditions the gene (and >> therefore the exon) would be expressed at a lower level, and the >> dispersion for that bin would then be estimated at a higher amount, >> decreasing the power in this case. > > [...] > > You are right, this could be an argument in favour of subsetting before > dispersion estimation. I am not quite sure how important this effect is in > practice, though. > > Bear in mind that the dispersion does not contain the Poisson noise. For > mean ? and a dispersion ?, the variance is v = ? + ? ??, and the coefficient > of variance (CV) squared is CV? = v/?? = 1/? + ?. > > Hence, the dominant term for the dependence of variance and hence power on > mean is the Poisson term 1/?, and not so much any remaining dependence of ? > on ?. A negative binomial generalized linear model takes this into account: > it uses the mean-variance relation > v = ? + ? ??, with ?, not v, considered constant across the model, precisely > because this handles well the effect on variance of differences in overall > mean in the different treatment groups. > > Nevertheless, as not only CV? but also ? itself seems to decrease with ?, > this is not perfect. > > ?Simon -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD REPLY

Login before adding your answer.

Traffic: 689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6