Search
Question: voom() vs. RPKM/FPKM or otherwise normalized counts, and GC correction, when fitting models to a small number of responses (per-feature counts)
0
6.3 years ago by
Tim Triche4.2k
United States
Tim Triche4.2k wrote:
Hi Dr. Smyth and Dr. Law, I have been reading the documentation for limma::voom() and trying to understand why there seems to be no correction for the size of the feature in the model: In an experiment, a count value is observed for each tag in each sample. A tag-wise mean-variance trend is computed using lowess. The tag-wise mean is the mean log2 count with an offset of 0.5, across samples for a given tag. The tag-wise variance is the quarter-root-variance of normalized log2 counts per million values with an offset of 0.5, across samples for a given tag. Tags with zero counts across all samples are not included in the lowess fit. Optional normalization is performed using normalizeBetweenArrays. Using fitted values of log2 counts from a linear model fit by lmFit, variances from the mean-variance trend were interpolated for each observation. This was carried out by approxfun. Inverse variance weights can be used to correct for mean-variance trend in the count data. I don't see a reference to the feature size in all of this. (?) Am I missing something? Probably something major (like, say, the relationship of GC content or read length to variance)... Is the idea that features with similar sequence properties/size and abundance will have their mean-variance relationship modeled appropriately and weights generated empirically? For comparison, what I have been doing (in lieu of knowing any better) is as follows: align with Rsubread, run subjunc and splicegrapher, and count against exon/gene/feature models: alignedToRPKM <- function(readcounts) { # the output of featureCounts() millionsMapped <- colSums(readcounts$counts)/1000000 if('ExonLength' %in% names(readcounts$annotation)) { geneLengthsInKB <- readcounts$annotation$ExonLength/1000 } else { geneLengthsInKB <- readcounts$annotation$GeneLength/1000 # works fine for ncRNA and splice graph edges } # example usage: readcountsRPKM <- alignedToRPKM(readcounts) return( sweep(readcountscounts, 2, millionsMapped, '/') / geneLengthsInKB ) } (When I did pretty much the same thing with Bowtie/TopHat/CuffLinks I got about the same results but slower, so I stuck with Rsubread. And featureCounts() is really handy.) So, given the feature sizes in readcountsannotation I can at least put things on something like a similar scale. Most of my modeling currently is focused on penalized local regressions and thus a performant (but accurate) measure that can be used for linear modeling on a large scale is desirable. Is the output of voom() what I want? Does one need to use limma/lmFit() to make use of voom()'s output? Last but not least, should I use something like Yuval Benjamini's GCcorrect package (http://www.stat.berkeley.edu/~yuvalb/YuvalWeb/Software.html) before/during/instead of voom()? And if the expression of a feature or several nearby features is often the response, does it matter a great deal what I use? Thanks for any input you might have time to provide. I have to assume that the minds at WEHI periodically scheme together how best to go about these things... -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]] ADD COMMENTlink modified 6.3 years ago by Gordon Smyth35k • written 6.3 years ago by Tim Triche4.2k 0 6.3 years ago by Gordon Smyth35k Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia Gordon Smyth35k wrote: Dear Tim, I don't follow what you are trying to do scientifically, and this makes all the difference when deciding what are the appropriate tools to use. If you are undertaking some sort of analysis that requires absolute gene (or feature) expression levels as responses, then you should not be using voom or limma or edgeR. limma and edgeR do not estimate absolute expression. If on the other hand, you want to detect differentially expressed genes (or features), which is what voom does, then there is no need to correct for gene length. The comments of Section 2.3 of the edgeR User's Guide and especially 2.3.2 "Adjustments for gene length, GC content, mappability and so on" are also relevant for voom. There is no need to correct for any characteristic of a gene that remains unchanged across samples. A good case has been made that GC content can have differential influence across samples, but that doesn't apply to gene length. voom does not work on RPKM or FPKM, or on the output from cufflinks. voom estimates a mean-variance relationship, and the variance is a function of count size, not of expression level. Yes, you need limma to use the output from voom, because other softwares do not generally have the ability to use quantitative weights. If you ignore the weights, then the output from voom is just logCPM, and you hardly need voom to compute that. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Thu, 24 May 2012, Tim Triche, Jr. wrote: > Hi Dr. Smyth and Dr. Law, > > I have been reading the documentation for limma::voom() and trying to > understand why there seems to be no correction for the size of the feature > in the model: > > In an experiment, a count value is observed for each tag in each sample. A > tag-wise mean-variance trend is computed using lowess. The tag-wise mean is > the mean log2 count with an offset of 0.5, across samples for a given tag. > The tag-wise variance is the quarter-root-variance of normalized log2 > counts per million values with an offset of 0.5, across samples for a given > tag. Tags with zero counts across all samples are not included in the > lowess fit. Optional normalization is performed using > normalizeBetweenArrays. Using fitted values of log2 counts from a linear > model fit by lmFit, variances from the mean-variance trend were > interpolated for each observation. This was carried out by approxfun. > Inverse variance weights can be used to correct for mean-variance trend in > the count data. > > > I don't see a reference to the feature size in all of this. (?) Am I > missing something? Probably something major (like, say, the relationship > of GC content or read length to variance)... > Is the idea that features with similar sequence properties/size and > abundance will have their mean-variance relationship modeled appropriately > and weights generated empirically? > > For comparison, what I have been doing (in lieu of knowing any better) is > as follows: align with Rsubread, run subjunc and splicegrapher, and count > against exon/gene/feature models: > > alignedToRPKM <- function(readcounts) { # the output of featureCounts() > millionsMapped <- colSums(readcountscounts)/1000000 > if('ExonLength' %in% names(readcounts$annotation)) { > geneLengthsInKB <- readcounts$annotation$ExonLength/1000 > } else { > geneLengthsInKB <- readcounts$annotation$GeneLength/1000 # works fine > for ncRNA and splice graph edges > } > > # example usage: readcounts$RPKM <- alignedToRPKM(readcounts) > return( sweep(readcounts$counts, 2, millionsMapped, '/') / > geneLengthsInKB ) > } > > (When I did pretty much the same thing with Bowtie/TopHat/CuffLinks I got > about the same results but slower, so I stuck with Rsubread. And > featureCounts() is really handy.) > > So, given the feature sizes in readcounts$annotation I can at least put > things on something like a similar scale. Most of my modeling currently is > focused on penalized local regressions and thus a performant (but accurate) > measure that can be used for linear modeling on a large scale is desirable. > Is the output of voom() what I want? Does one need to use limma/lmFit() > to make use of voom()'s output? > > Last but not least, should I use something like Yuval Benjamini's GCcorrect > package (http://www.stat.berkeley.edu/~yuvalb/YuvalWeb/Software.html) > before/during/instead of voom()? > And if the expression of a feature or several nearby features is often the > response, does it matter a great deal what I use? > > Thanks for any input you might have time to provide. I have to assume that > the minds at WEHI periodically scheme together how best to go about these > things... > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
Hi Tim, On thinking about this a little more, voom() could easily output logRPKM rather than logCPM, and the same weights would apply. Indeed you could convert the voom() output to logRPKM yourself and, in principle, undertake analyses using the values if you make use of the corresponding voom weights. However voom() does need to get raw counts as input, just like edgeR, rather than RPKM. voom() can cope with a re-scaling of the counts, but not with a transformation that is non-monotonic in the counts. RPKM is an unhelpful measure from a statistical point of view, because it "forgets" how large the count was in the first plae. The aims of Yuval's package are complementary to edgeR or voom, certainly neither replaces the other. These results may inform how we do the normalization step, but we have not yet reached the stage of doing this routinely. Best wishes Gordon On Fri, 25 May 2012, Gordon K Smyth wrote: > Dear Tim, > > I don't follow what you are trying to do scientifically, and this makes all > the difference when deciding what are the appropriate tools to use. > > If you are undertaking some sort of analysis that requires absolute gene (or > feature) expression levels as responses, then you should not be using voom or > limma or edgeR. limma and edgeR do not estimate absolute expression. > > If on the other hand, you want to detect differentially expressed genes (or > features), which is what voom does, then there is no need to correct for gene > length. The comments of Section 2.3 of the edgeR User's Guide and especially > 2.3.2 "Adjustments for gene length, GC content, mappability and so on" are > also relevant for voom. There is no need to correct for any characteristic > of a gene that remains unchanged across samples. > > A good case has been made that GC content can have differential influence > across samples, but that doesn't apply to gene length. > > voom does not work on RPKM or FPKM, or on the output from cufflinks. voom > estimates a mean-variance relationship, and the variance is a function of > count size, not of expression level. > > Yes, you need limma to use the output from voom, because other softwares do > not generally have the ability to use quantitative weights. If you ignore > the weights, then the output from voom is just logCPM, and you hardly need > voom to compute that. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > smyth at wehi.edu.au > http://www.wehi.edu.au > http://www.statsci.org/smyth > > On Thu, 24 May 2012, Tim Triche, Jr. wrote: > >> Hi Dr. Smyth and Dr. Law, >> >> I have been reading the documentation for limma::voom() and trying to >> understand why there seems to be no correction for the size of the feature >> in the model: >> >> In an experiment, a count value is observed for each tag in each sample. A >> tag-wise mean-variance trend is computed using lowess. The tag-wise mean is >> the mean log2 count with an offset of 0.5, across samples for a given tag. >> The tag-wise variance is the quarter-root-variance of normalized log2 >> counts per million values with an offset of 0.5, across samples for a given >> tag. Tags with zero counts across all samples are not included in the >> lowess fit. Optional normalization is performed using >> normalizeBetweenArrays. Using fitted values of log2 counts from a linear >> model fit by lmFit, variances from the mean-variance trend were >> interpolated for each observation. This was carried out by approxfun. >> Inverse variance weights can be used to correct for mean-variance trend in >> the count data. >> >> >> I don't see a reference to the feature size in all of this. (?) Am I >> missing something? Probably something major (like, say, the relationship >> of GC content or read length to variance)... >> Is the idea that features with similar sequence properties/size and >> abundance will have their mean-variance relationship modeled appropriately >> and weights generated empirically? >> >> For comparison, what I have been doing (in lieu of knowing any better) is >> as follows: align with Rsubread, run subjunc and splicegrapher, and count >> against exon/gene/feature models: >> >> alignedToRPKM <- function(readcounts) { # the output of featureCounts() >> millionsMapped <- colSums(readcounts$counts)/1000000 >> if('ExonLength' %in% names(readcounts$annotation)) { >> geneLengthsInKB <- readcounts$annotation$ExonLength/1000 >> } else { >> geneLengthsInKB <- readcounts$annotation$GeneLength/1000 # works fine >> for ncRNA and splice graph edges >> } >> >> # example usage: readcountsRPKM <- alignedToRPKM(readcounts) >> return( sweep(readcountscounts, 2, millionsMapped, '/') / >> geneLengthsInKB ) >> } >> >> (When I did pretty much the same thing with Bowtie/TopHat/CuffLinks I got >> about the same results but slower, so I stuck with Rsubread. And >> featureCounts() is really handy.) >> >> So, given the feature sizes in readcountsannotation I can at least put >> things on something like a similar scale. Most of my modeling currently is >> focused on penalized local regressions and thus a performant (but accurate) >> measure that can be used for linear modeling on a large scale is desirable. >> Is the output of voom() what I want? Does one need to use limma/lmFit() >> to make use of voom()'s output? >> >> Last but not least, should I use something like Yuval Benjamini's GCcorrect >> package (http://www.stat.berkeley.edu/~yuvalb/YuvalWeb/Software.html) >> before/during/instead of voom()? >> And if the expression of a feature or several nearby features is often the >> response, does it matter a great deal what I use? >> >> Thanks for any input you might have time to provide. I have to assume that >> the minds at WEHI periodically scheme together how best to go about these >> things... >> >> >> -- >> *A model is a lie that helps you see the truth.* >> * >> * >> Howard >> Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} ADD REPLYlink written 6.3 years ago by Gordon Smyth35k Hi Dr. Smyth, Thank you for the helpful clarifications. It seems like RPM/CPM is useful for tasks such as plotting expression on a reasonably similar scale; taking logs and adjusting for mean-variance relationships can better satisfy expected mean-variance relationships for linear modeling and thus should dovetail better with the toolset in limma, offering a less computationally demanding alternative for exploratory analysis. On the other hand, if the primary goal is to detect differences, especially in rare or highly variably expressed features, an edgeR GLM with empirical Bayes estimates of the feature-wise dispersion is the most appropriate tool to maximize statistical power. Is this understanding reasonable? It would seem that, whether I use limma or rig up some sort of weighting for (e.g.) sparsenet, the output from voom() is most likely to be useful for my particular (EDA) needs at the moment. One last question (for anyone who wishes to answer, really) -- if gene/transcript length is not associated with the mean/variance relationship for read counts, why was it asserted in the original Mortazavi paper that: The sensitivity of RNA-Seq will be a function of both molar concentration and transcript length [nb: no citation given, presumably this is felt to be self-evident?]. We therefore quantified transcript levels in reads per kilobase of exon model per million mapped reads. It seems as if this is a red herring? GC% could clearly affect the degree to which a transcript "absorbs" read depth, but I continue to have difficulty understanding why the length of exon model is relevant in this context. Thank you so much for your time and effort in explaining these rather subtle issues. Tim Triche, Jr. USC Biostatistics On Wed, May 30, 2012 at 1:23 AM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > Hi Tim, > > On thinking about this a little more, voom() could easily output logRPKM > rather than logCPM, and the same weights would apply. Indeed you could > convert the voom() output to logRPKM yourself and, in principle, undertake > analyses using the values if you make use of the corresponding voom weights. > > However voom() does need to get raw counts as input, just like edgeR, > rather than RPKM. voom() can cope with a re-scaling of the counts, but not > with a transformation that is non-monotonic in the counts. RPKM is an > unhelpful measure from a statistical point of view, because it "forgets" > how large the count was in the first plae. > > The aims of Yuval's package are complementary to edgeR or voom, certainly > neither replaces the other. These results may inform how we do the > normalization step, but we have not yet reached the stage of doing this > routinely. > > Best wishes > Gordon > > > On Fri, 25 May 2012, Gordon K Smyth wrote: > > Dear Tim, >> >> I don't follow what you are trying to do scientifically, and this makes >> all the difference when deciding what are the appropriate tools to use. >> >> If you are undertaking some sort of analysis that requires absolute gene >> (or feature) expression levels as responses, then you should not be using >> voom or limma or edgeR. limma and edgeR do not estimate absolute >> expression. >> >> If on the other hand, you want to detect differentially expressed genes >> (or features), which is what voom does, then there is no need to correct >> for gene length. The comments of Section 2.3 of the edgeR User's Guide and >> especially 2.3.2 "Adjustments for gene length, GC content, mappability and >> so on" are also relevant for voom. There is no need to correct for any >> characteristic of a gene that remains unchanged across samples. >> >> A good case has been made that GC content can have differential influence >> across samples, but that doesn't apply to gene length. >> >> voom does not work on RPKM or FPKM, or on the output from cufflinks. voom >> estimates a mean-variance relationship, and the variance is a function of >> count size, not of expression level. >> >> Yes, you need limma to use the output from voom, because other softwares >> do not generally have the ability to use quantitative weights. If you >> ignore the weights, then the output from voom is just logCPM, and you >> hardly need voom to compute that. >> >> Best wishes >> Gordon >> >> ------------------------------**--------------- >> Professor Gordon K Smyth, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> smyth@wehi.edu.au >> http://www.wehi.edu.au >> http://www.statsci.org/smyth >> >> On Thu, 24 May 2012, Tim Triche, Jr. wrote: >> >> Hi Dr. Smyth and Dr. Law, >>> >>> I have been reading the documentation for limma::voom() and trying to >>> understand why there seems to be no correction for the size of the >>> feature >>> in the model: >>> >>> In an experiment, a count value is observed for each tag in each sample. >>> A >>> tag-wise mean-variance trend is computed using lowess. The tag- wise mean >>> is >>> the mean log2 count with an offset of 0.5, across samples for a given >>> tag. >>> The tag-wise variance is the quarter-root-variance of normalized log2 >>> counts per million values with an offset of 0.5, across samples for a >>> given >>> tag. Tags with zero counts across all samples are not included in the >>> lowess fit. Optional normalization is performed using >>> normalizeBetweenArrays. Using fitted values of log2 counts from a linear >>> model fit by lmFit, variances from the mean-variance trend were >>> interpolated for each observation. This was carried out by approxfun. >>> Inverse variance weights can be used to correct for mean-variance trend >>> in >>> the count data. >>> >>> >>> I don't see a reference to the feature size in all of this. (?) Am I >>> missing something? Probably something major (like, say, the relationship >>> of GC content or read length to variance)... >>> Is the idea that features with similar sequence properties/size and >>> abundance will have their mean-variance relationship modeled >>> appropriately >>> and weights generated empirically? >>> >>> For comparison, what I have been doing (in lieu of knowing any better) is >>> as follows: align with Rsubread, run subjunc and splicegrapher, and count >>> against exon/gene/feature models: >>> >>> alignedToRPKM <- function(readcounts) { # the output of featureCounts() >>> millionsMapped <- colSums(readcountscounts)/**1000000 >>> if('ExonLength' %in% names(readcounts$annotation)) { >>> geneLengthsInKB <- readcounts$annotation$**ExonLength/1000 >>> } else { >>> geneLengthsInKB <- readcounts$annotation$**GeneLength/1000 # works >>> fine >>> for ncRNA and splice graph edges >>> } >>> >>> # example usage: readcounts$RPKM <- alignedToRPKM(readcounts) >>> return( sweep(readcounts$counts, 2, millionsMapped, '/') / >>> geneLengthsInKB ) >>> } >>> >>> (When I did pretty much the same thing with Bowtie/TopHat/CuffLinks I got >>> about the same results but slower, so I stuck with Rsubread. And >>> featureCounts() is really handy.) >>> >>> So, given the feature sizes in readcounts$annotation I can at least put >>> things on something like a similar scale. Most of my modeling currently >>> is >>> focused on penalized local regressions and thus a performant (but >>> accurate) >>> measure that can be used for linear modeling on a large scale is >>> desirable. >>> Is the output of voom() what I want? Does one need to use limma/lmFit() >>> to make use of voom()'s output? >>> >>> Last but not least, should I use something like Yuval Benjamini's >>> GCcorrect >>> package (http://www.stat.berkeley.edu/**~yuvalb/YuvalWeb/Software. html<http: www.stat.berkeley.edu="" ~yuvalb="" yuvalweb="" software.html=""> >>> **) >>> before/during/instead of voom()? >>> And if the expression of a feature or several nearby features is often >>> the >>> response, does it matter a great deal what I use? >>> >>> Thanks for any input you might have time to provide. I have to assume >>> that >>> the minds at WEHI periodically scheme together how best to go about these >>> things... >>> >>> >>> -- >>> *A model is a lie that helps you see the truth.* >>> * >>> * >>> Howard Skipper<http: cancerres.**aacrjournals.org="" content="" 31="" 9="" **="">>> 1173.full.pdf<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.="" full.pdf=""> >>> > >>> >>> >> > ______________________________**______________________________**____ ______ > The information in this email is confidential and inte...{{dropped:18}}
On Wed, 30 May 2012, Tim Triche, Jr. wrote: > Hi Dr. Smyth, > > Thank you for the helpful clarifications. It seems like RPM/CPM is > useful for tasks such as plotting expression on a reasonably similar scale; > taking logs and adjusting for mean-variance relationships can better > satisfy expected mean-variance relationships for linear modeling and thus > should dovetail better with the toolset in limma, offering a less > computationally demanding alternative for exploratory analysis. On the > other hand, if the primary goal is to detect differences, especially in > rare or highly variably expressed features, an edgeR GLM with empirical > Bayes estimates of the feature-wise dispersion is the most appropriate tool > to maximize statistical power. > > Is this understanding reasonable? It would seem that, whether I use > limma or rig up some sort of weighting for (e.g.) sparsenet, the output > from voom() is most likely to be useful for my particular (EDA) needs at > the moment. It's certainly reasonable, although we're all still learning. > One last question (for anyone who wishes to answer, really) -- if > gene/transcript length is not associated with the mean/variance > relationship for read counts, why was it asserted in the original Mortazavi > paper that: > > The sensitivity of RNA-Seq will be a function of both molar concentration > and transcript length [nb: no citation given, presumably this is felt to be > self-evident?]. We therefore quantified transcript levels in reads per > kilobase of exon model per million mapped reads. I interpreted "sensitivity" here to mean expected number of reads (rather than standard deviation), and expected count size would seem to be proportional to molar concentration by transcript length by library size. So RPKM should be proportional to molar concentration, to a first approximation. Best wishes Gordon > It seems as if this is a red herring? GC% could clearly affect the degree > to which a transcript "absorbs" read depth, but I continue to have > difficulty understanding why the length of exon model is relevant in this > context. > > Thank you so much for your time and effort in explaining these rather > subtle issues. > > Tim Triche, Jr. > USC Biostatistics > > On Wed, May 30, 2012 at 1:23 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Hi Tim, >> >> On thinking about this a little more, voom() could easily output logRPKM >> rather than logCPM, and the same weights would apply. Indeed you could >> convert the voom() output to logRPKM yourself and, in principle, undertake >> analyses using the values if you make use of the corresponding voom weights. >> >> However voom() does need to get raw counts as input, just like edgeR, >> rather than RPKM. voom() can cope with a re-scaling of the counts, but not >> with a transformation that is non-monotonic in the counts. RPKM is an >> unhelpful measure from a statistical point of view, because it "forgets" >> how large the count was in the first plae. >> >> The aims of Yuval's package are complementary to edgeR or voom, certainly >> neither replaces the other. These results may inform how we do the >> normalization step, but we have not yet reached the stage of doing this >> routinely. >> >> Best wishes >> Gordon >> >> >> On Fri, 25 May 2012, Gordon K Smyth wrote: >> >> Dear Tim, >>> >>> I don't follow what you are trying to do scientifically, and this makes >>> all the difference when deciding what are the appropriate tools to use. >>> >>> If you are undertaking some sort of analysis that requires absolute gene >>> (or feature) expression levels as responses, then you should not be using >>> voom or limma or edgeR. limma and edgeR do not estimate absolute >>> expression. >>> >>> If on the other hand, you want to detect differentially expressed genes >>> (or features), which is what voom does, then there is no need to correct >>> for gene length. The comments of Section 2.3 of the edgeR User's Guide and >>> especially 2.3.2 "Adjustments for gene length, GC content, mappability and >>> so on" are also relevant for voom. There is no need to correct for any >>> characteristic of a gene that remains unchanged across samples. >>> >>> A good case has been made that GC content can have differential influence >>> across samples, but that doesn't apply to gene length. >>> >>> voom does not work on RPKM or FPKM, or on the output from cufflinks. voom >>> estimates a mean-variance relationship, and the variance is a function of >>> count size, not of expression level. >>> >>> Yes, you need limma to use the output from voom, because other softwares >>> do not generally have the ability to use quantitative weights. If you >>> ignore the weights, then the output from voom is just logCPM, and you >>> hardly need voom to compute that. >>> >>> Best wishes >>> Gordon >>> >>> ------------------------------**--------------- >>> Professor Gordon K Smyth, >>> Bioinformatics Division, >>> Walter and Eliza Hall Institute of Medical Research, >>> 1G Royal Parade, Parkville, Vic 3052, Australia. >>> smyth at wehi.edu.au >>> http://www.wehi.edu.au >>> http://www.statsci.org/smyth >>> >>> On Thu, 24 May 2012, Tim Triche, Jr. wrote: >>> >>> Hi Dr. Smyth and Dr. Law, >>>> >>>> I have been reading the documentation for limma::voom() and trying to >>>> understand why there seems to be no correction for the size of the >>>> feature >>>> in the model: >>>> >>>> In an experiment, a count value is observed for each tag in each sample. >>>> A >>>> tag-wise mean-variance trend is computed using lowess. The tag- wise mean >>>> is >>>> the mean log2 count with an offset of 0.5, across samples for a given >>>> tag. >>>> The tag-wise variance is the quarter-root-variance of normalized log2 >>>> counts per million values with an offset of 0.5, across samples for a >>>> given >>>> tag. Tags with zero counts across all samples are not included in the >>>> lowess fit. Optional normalization is performed using >>>> normalizeBetweenArrays. Using fitted values of log2 counts from a linear >>>> model fit by lmFit, variances from the mean-variance trend were >>>> interpolated for each observation. This was carried out by approxfun. >>>> Inverse variance weights can be used to correct for mean-variance trend >>>> in >>>> the count data. >>>> >>>> >>>> I don't see a reference to the feature size in all of this. (?) Am I >>>> missing something? Probably something major (like, say, the relationship >>>> of GC content or read length to variance)... >>>> Is the idea that features with similar sequence properties/size and >>>> abundance will have their mean-variance relationship modeled >>>> appropriately >>>> and weights generated empirically? >>>> >>>> For comparison, what I have been doing (in lieu of knowing any better) is >>>> as follows: align with Rsubread, run subjunc and splicegrapher, and count >>>> against exon/gene/feature models: >>>> >>>> alignedToRPKM <- function(readcounts) { # the output of featureCounts() >>>> millionsMapped <- colSums(readcounts$counts)/**1000000 >>>> if('ExonLength' %in% names(readcounts$annotation)) { >>>> geneLengthsInKB <- readcounts$annotation$**ExonLength/1000 >>>> } else { >>>> geneLengthsInKB <- readcounts$annotation$**GeneLength/1000 # works >>>> fine >>>> for ncRNA and splice graph edges >>>> } >>>> >>>> # example usage: readcountsRPKM <- alignedToRPKM(readcounts) >>>> return( sweep(readcountscounts, 2, millionsMapped, '/') / >>>> geneLengthsInKB ) >>>> } >>>> >>>> (When I did pretty much the same thing with Bowtie/TopHat/CuffLinks I got >>>> about the same results but slower, so I stuck with Rsubread. And >>>> featureCounts() is really handy.) >>>> >>>> So, given the feature sizes in readcounts\$annotation I can at least put >>>> things on something like a similar scale. Most of my modeling currently >>>> is >>>> focused on penalized local regressions and thus a performant (but >>>> accurate) >>>> measure that can be used for linear modeling on a large scale is >>>> desirable. >>>> Is the output of voom() what I want? Does one need to use limma/lmFit() >>>> to make use of voom()'s output? >>>> >>>> Last but not least, should I use something like Yuval Benjamini's >>>> GCcorrect >>>> package (http://www.stat.berkeley.edu/**~yuvalb/YuvalWeb/Software .html<http: www.stat.berkeley.edu="" ~yuvalb="" yuvalweb="" software.html=""> >>>> **) >>>> before/during/instead of voom()? >>>> And if the expression of a feature or several nearby features is often >>>> the >>>> response, does it matter a great deal what I use? >>>> >>>> Thanks for any input you might have time to provide. I have to assume >>>> that >>>> the minds at WEHI periodically scheme together how best to go about these >>>> things... >>>> >>>> >>>> -- >>>> *A model is a lie that helps you see the truth.* >>>> * >>>> * >>>> Howard Skipper<http: cancerres.**aacrjournals.org="" content="" 31="" 9="" **="">>>> 1173.full.pdf<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173="" .full.pdf=""> >>>>> >>>> >>>> >>> >> ______________________________**______________________________**___ _______ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________**______________________________**___ _______ >> > > > > -- > *A model is a lie that helps you see the truth.* > * > * > Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}