Log transformation and left censoring

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 4 months ago

EMBL European Molecular Biology Laborat…

Hi Paul given your description, one possibility to explore might be a variance stabilising transformation. E.g. DESeq provides one that smoothly interpolates between the square- root function for low counts and the log-transformation for higher counts, see Section 6 (and 7) of the vignette. Best wishes Wolfgang Il giorno Jan 31, 2013, alle ore 8:57 AM, Paul Harrison <paul.harrison at="" monash.edu=""> ha scritto: > Hello, > > We have been using voom and limma for some time now, and while we're > fairly happy with it, it seems to produce significance levels that are > on the conservative side. We also use edgeR to produce more optimistic > results, but don't entirely trust the significance levels that it > reports. I am looking for something in-between these extremes, and > want to run an idea past this list as a sanity check. I would > especially value Gordon and Charity's comments if they have time. > > The voom log transformation is essentially: > > log2( (count+0.5) / library.size ) > > It then does some clever things with weights. What I'm considering instead is > > log2( count / library.size + moderation.amount / mean.library.size ) > > where moderation.amount is much larger then 0.5, say 5. A couple of things here: > > - Instead of down-weighting low counts, I'm trying to get rid of the > extra variation from low counts by artificially left censoring the > data. > > - I'm using the mean of the libaray sizes because I want the left > censor to be in the same place for each sample even if the library > sizes are different, so that if a gene is entirely switched off in one > condition it won't look variable just because there is a different > left censor in each sample. > > I'm also using this transformation to create heatmaps. > > This seems to be working with the data set I am working with, I get > more significant results and they seem reasonable by eye. It seems to > me that even if this approach isn't ideal it should at least be safe, > at worst it will cause limma to reduce the df.prior and produce less > significant results. Anything I've missed? > > -- > Paul Harrison > > Victorian Bioinformatics Consortium / Monash University > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

limma edgeR DESeq limma edgeR DESeq • 2.4k views

ADD COMMENT • link updated 12.4 years ago by Gordon Smyth 52k • written 12.4 years ago by Wolfgang Huber ★ 13k

0

Entering edit mode

Paul Harrison ▴ 110

@paul-harrison-5740

Last seen 5 days ago

Australia/Melbourne/Monash University G…

On Sun, Feb 3, 2013 at 10:24 PM, Wolfgang Huber <whuber at="" embl.de=""> wrote: > > Hi Paul > > given your description, one possibility to explore might be a variance > stabilising transformation. > > E.g. DESeq provides one that smoothly interpolates between the square-root > function for low counts and the log-transformation for higher counts, see > Section 6 (and 7) of the vignette. > Thanks Wolfgang. Zero is tranformed to the same value in each of the samples, which is good. The ability of limma to detect differential expression with this transformation appears to be about the same as with voom. But perhaps I should be using DESeq's nbinomTest. I do wonder if this is still too mild near zero. When I plot variance for each gene vs mean for each gene, there is a distinct bump at low expression levels. Looking at vst.pdf the assumption seems to be that the variance of a zero count is zero, which isn't correct. Seeing a zero count, my expectation of the true mean would be somewhat greater than zero, and so the variance would also be somewhat greater than zero. -- Paul Harrison Victorian Bioinformatics Consortium / Monash University

ADD COMMENT • link 12.4 years ago Paul Harrison ▴ 110

0

Entering edit mode

Hi Paul the model (in the "parametric" case) is var(X) = a * E(X) + b * E(X)^2 with some a,b>0. And you are right that var(X)=0 <=> E(X)=0 but since the support of X is positive, the only way that E(X)=0 is when the distribution of X is a point mass at 0. So that assumption has to be correct. For microarray data (e.g. in vsn), a model with a constant term on the right hand side of above equation (and Var(X)>0 even if E(X)=0) was needed, but it is hard to see how that could be the case for count data. However, it could be that the variance-mean relationship of your data is not fit by any well-behaved smooth function, e.g. if you have other strong determinants of the variance than the mean (e.g. type of gene). In that case, there is no variance-stabilising transformation. One would need to explore your data for that. Best wishes Wolfgang Il giorno Feb 4, 2013, alle ore 7:33 AM, Paul Harrison <paul.harrison at="" monash.edu=""> ha scritto: > On Sun, Feb 3, 2013 at 10:24 PM, Wolfgang Huber <whuber at="" embl.de=""> wrote: >> >> Hi Paul >> >> given your description, one possibility to explore might be a variance >> stabilising transformation. >> >> E.g. DESeq provides one that smoothly interpolates between the square-root >> function for low counts and the log-transformation for higher counts, see >> Section 6 (and 7) of the vignette. >> > > Thanks Wolfgang. Zero is tranformed to the same value in each of the > samples, which is good. The ability of limma to detect differential > expression with this transformation appears to be about the same as > with voom. But perhaps I should be using DESeq's nbinomTest. > > I do wonder if this is still too mild near zero. When I plot variance > for each gene vs mean for each gene, there is a distinct bump at low > expression levels. Looking at vst.pdf the assumption seems to be that > the variance of a zero count is zero, which isn't correct. Seeing a > zero count, my expectation of the true mean would be somewhat greater > than zero, and so the variance would also be somewhat greater than > zero. > > -- > Paul Harrison > > Victorian Bioinformatics Consortium / Monash University

ADD REPLY • link 12.4 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

On Mon, Feb 4, 2013 at 7:56 PM, Wolfgang Huber <whuber at="" embl.de=""> wrote: > > Hi Paul > > the model (in the "parametric" case) is > > var(X) = a * E(X) + b * E(X)^2 > > with some a,b>0. And you are right that var(X)=0 <=> E(X)=0 > but since the support of X is positive, the only way that E(X)=0 is when the distribution of X is a point mass at 0. So that assumption has to be correct. For microarray data (e.g. in vsn), a model with a constant term on the right hand side of above equation (and Var(X)>0 even if E(X)=0) was needed, but it is hard to see how that could be the case for count data. > > However, it could be that the variance-mean relationship of your data is not fit by any well-behaved smooth function, e.g. if you have other strong determinants of the variance than the mean (e.g. type of gene). In that case, there is no variance-stabilising transformation. One would need to explore your data for that. > Ok, if I wish to proceed with my folly it looks like the glog function log(x+sqrt(x*x+1)) used by vsn would be a good choice, since it is based on var(X) = a + b * E(X)^2. This does actually look marginally better. The method of getting a variance stabilizing transform from var(x) involves an approximation, maybe that's breaking down slightly, or maybe my data is weird. Thanks, I feel I have a few different perspectives and a better idea of the history of these ideas now. Paul Harrison Victorian Bioinformatics Consortium / Monash University

ADD REPLY • link 12.4 years ago Paul Harrison ▴ 110

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 12 minutes ago

WEHI, Melbourne, Australia

Dear Paul, The transformation that you propose is the same transformation that is done by predFC(y) in the edgeR package, or by cpm(y,log=TRUE) in the developmental version of the edgeR package. The argument prior.count controls the moderation amount. This is the same transformation that we recommend and use ourselves for heatmaps. See Section 2.10 of the edgeR User's Guide: http://bioconductor.org/packages/2.12/bioc/vignettes/edgeR/inst/doc/ed geRUsersGuide.pdf There is an example of its use on page 58 of the User's Guide. Belinda Phipson has shown as part of her PhD work that, under some assumptions, this transformation comes close to minimizing the mean square error when predicting the true log fold changes. Simply putting these logCPM values into limma will perform comparably to voom if the library sizes are not very different, provided that you use eBayes(fit,trend=TRUE). When the library sizes are different, however, voom is the clear winner. There is no censoring. A major reason for adding an offset (aka prior.count) to the counts is to avoid the need to censor, truncate or remove observations. Rather a mononotic transformation of the counts is performed for each library. Best wishes Gordon On Jan 31, 2013, 8:57 AM, Paul Harrison <paul.harrison at="" monash.edu=""> wrote: > Hello, > > We have been using voom and limma for some time now, and while we're > fairly happy with it, it seems to produce significance levels that are > on the conservative side. We also use edgeR to produce more optimistic > results, but don't entirely trust the significance levels that it > reports. I am looking for something in-between these extremes, and want > to run an idea past this list as a sanity check. I would especially > value Gordon and Charity's comments if they have time. > > The voom log transformation is essentially: > > log2( (count+0.5) / library.size ) > > It then does some clever things with weights. What I'm considering > instead is > > log2( count / library.size + moderation.amount / mean.library.size ) > > where moderation.amount is much larger then 0.5, say 5. A couple of > things here: > > - Instead of down-weighting low counts, I'm trying to get rid of the > extra variation from low counts by artificially left censoring the data. > > - I'm using the mean of the libaray sizes because I want the left censor > to be in the same place for each sample even if the library sizes are > different, so that if a gene is entirely switched off in one condition > it won't look variable just because there is a different left censor in > each sample. > > I'm also using this transformation to create heatmaps. > > This seems to be working with the data set I am working with, I get more > significant results and they seem reasonable by eye. It seems to me that > even if this approach isn't ideal it should at least be safe, at worst > it will cause limma to reduce the df.prior and produce less significant > results. Anything I've missed? > > -- > Paul Harrison > > Victorian Bioinformatics Consortium / Monash University ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.4 years ago Gordon Smyth 52k

0

Entering edit mode

BTW, the coefficients and log-fold-changes returned by glmFit() and exactTest() in edgeR also agree with what you would get from this transformation (but applied at the glm modelling level, not simply linear modelled on the log-scale). Gordon On Tue, 5 Feb 2013, Gordon K Smyth wrote: > Dear Paul, > > The transformation that you propose is the same transformation that is done > by predFC(y) in the edgeR package, or by cpm(y,log=TRUE) in the developmental > version of the edgeR package. The argument prior.count controls the > moderation amount. > > This is the same transformation that we recommend and use ourselves for > heatmaps. See Section 2.10 of the edgeR User's Guide: > > http://bioconductor.org/packages/2.12/bioc/vignettes/edgeR/inst/doc/ edgeRUsersGuide.pdf > > There is an example of its use on page 58 of the User's Guide. > > Belinda Phipson has shown as part of her PhD work that, under some > assumptions, this transformation comes close to minimizing the mean square > error when predicting the true log fold changes. > > Simply putting these logCPM values into limma will perform comparably to voom > if the library sizes are not very different, provided that you use > eBayes(fit,trend=TRUE). When the library sizes are different, however, voom > is the clear winner. > > There is no censoring. A major reason for adding an offset (aka prior.count) > to the counts is to avoid the need to censor, truncate or remove > observations. Rather a mononotic transformation of the counts is performed > for each library. > > Best wishes > Gordon > > > On Jan 31, 2013, 8:57 AM, Paul Harrison <paul.harrison at="" monash.edu=""> wrote: > >> Hello, >> >> We have been using voom and limma for some time now, and while we're fairly >> happy with it, it seems to produce significance levels that are on the >> conservative side. We also use edgeR to produce more optimistic results, >> but don't entirely trust the significance levels that it reports. I am >> looking for something in-between these extremes, and want to run an idea >> past this list as a sanity check. I would especially value Gordon and >> Charity's comments if they have time. >> >> The voom log transformation is essentially: >> >> log2( (count+0.5) / library.size ) >> >> It then does some clever things with weights. What I'm considering instead >> is >> >> log2( count / library.size + moderation.amount / mean.library.size ) >> >> where moderation.amount is much larger then 0.5, say 5. A couple of things >> here: >> >> - Instead of down-weighting low counts, I'm trying to get rid of the extra >> variation from low counts by artificially left censoring the data. >> >> - I'm using the mean of the libaray sizes because I want the left censor to >> be in the same place for each sample even if the library sizes are >> different, so that if a gene is entirely switched off in one condition it >> won't look variable just because there is a different left censor in each >> sample. >> >> I'm also using this transformation to create heatmaps. >> >> This seems to be working with the data set I am working with, I get more >> significant results and they seem reasonable by eye. It seems to me that >> even if this approach isn't ideal it should at least be safe, at worst it >> will cause limma to reduce the df.prior and produce less significant >> results. Anything I've missed? >> >> -- >> Paul Harrison >> >> Victorian Bioinformatics Consortium / Monash University > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 12.4 years ago Gordon Smyth 52k

0

Entering edit mode

On Tue, Feb 5, 2013 at 11:23 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Paul, > > The transformation that you propose is the same transformation that is done > by predFC(y) in the edgeR package, or by cpm(y,log=TRUE) in the > developmental version of the edgeR package. The argument prior.count > controls the moderation amount. > > This is the same transformation that we recommend and use ourselves for > heatmaps. See Section 2.10 of the edgeR User's Guide: > > http://bioconductor.org/packages/2.12/bioc/vignettes/edgeR/inst/doc/ edgeRUsersGuide.pdf > > There is an example of its use on page 58 of the User's Guide. > > Belinda Phipson has shown as part of her PhD work that, under some > assumptions, this transformation comes close to minimizing the mean square > error when predicting the true log fold changes. > > Simply putting these logCPM values into limma will perform comparably to > voom if the library sizes are not very different, provided that you use > eBayes(fit,trend=TRUE). When the library sizes are different, however, voom > is the clear winner. > It's reassuring to see better statisticians than me thinking along similar lines, including using the mean library size. A difference is that the default prior.count is much smaller than I have been using, 0.25. But then in the example in section 2.10, you're a using prior.count of 2. The ideal prior.count for estimating fold changes might be different than the ideal for visualization and clustering, and the ideal for detecting differential expression might be different from the first or both of these. With my large prior.count I'm detecting extra differentially expressed genes, and in particular these are genes with low average expression. Also, an artifact in my heatmaps, which seems to be due to the different sample sizes, is removed. It might be that my data is weird, but these improvements seem to be to do with having a spread of library sizes. > There is no censoring. A major reason for adding an offset (aka > prior.count) to the counts is to avoid the need to censor, truncate or > remove observations. Rather a mononotic transformation of the counts is > performed for each library. > Apologies for being unclear. Yes, from the perspective of untransformed counts, there is no censoring. I was viewing the problem from the perspective of log transformed expression values, which is an odd viewpoint. From this perspective, there's a (soft) line left of which it's impossible estimate the log expression, just because we're using count data (microarrays have a similar line, but for different reasons). Further, where the number of reads is different in different samples, they have different (soft) cutoff lines. Statistical methods that work with count data directly, such as edgeR, shouldn't be affected by this, but when feeding log transformed values into limma it is a potential problem. I definitely wouldn't want to remove low counts, because I'm then likely to miss those genes with such a large fold change that in one group the reads are almost entirely absent, and these are likely to be exactly the genes I am looking for. So I want to (softly) truncate low counts at a consistent point before log transforming them. One downside: even if this works for straightforward differential expression, it will tend to see interactions that don't actually exist. Voom's weighting system might mitigate this, and it would be an interesting demonstration of voom's power. Hmm. What I actually want is a test that works on counts directly, with empirical priors on coefficients and the dispersion. Um. Assuming we have model0 and model1, and some reasonable priors on the parameters in these models. Say for the count data for a gene, x, we have a way to compute P(x|model0) and P(x|model1). Use P(x|model1) as a test statistic, the significance is the likelihood that P(x2|model1) >= P(x|model1) for x2 sampled from model0. We can generate a few million samples from model0 to give a distribution to test against, only need to do this once, so it's computationally feasible. For actually computing P(x|model): P(x | coefficients, dispersion) is straightforward, and assume we have priors P(coefficients | dispersion), P(dispersion). Would need to numerically integrate over dispersion, I think. For a range of dispersion values find the MAP for the coefficients then use an exp(2nd order Taylor expansion) approximation around that to integrate P(x | coefficients, dispersion) P(coefficients | dispersion) down to P(x | dispersion). Then sum P(x | dispersion) P(dispersion) to get P(x). Tweak emprical priors and repeat a few times. It wouldn't exactly be fast, but it wouldn't be unworkably slow. -- Paul Harrison Victorian Bioinformatics Consortium / Monash University

ADD REPLY • link 12.4 years ago Paul Harrison ▴ 110

Login before adding your answer.