Entering edit mode
Hi Dr. Smyth and Dr. Law,
I have been reading the documentation for limma::voom() and trying to
understand why there seems to be no correction for the size of the
feature
in the model:
In an experiment, a count value is observed for each tag in each
sample. A
tag-wise mean-variance trend is computed using lowess. The tag-wise
mean is
the mean log2 count with an offset of 0.5, across samples for a given
tag.
The tag-wise variance is the quarter-root-variance of normalized log2
counts per million values with an offset of 0.5, across samples for a
given
tag. Tags with zero counts across all samples are not included in the
lowess fit. Optional normalization is performed using
normalizeBetweenArrays. Using fitted values of log2 counts from a
linear
model fit by lmFit, variances from the mean-variance trend were
interpolated for each observation. This was carried out by approxfun.
Inverse variance weights can be used to correct for mean-variance
trend in
the count data.
I don't see a reference to the feature size in all of this. (?) Am I
missing something? Probably something major (like, say, the
relationship
of GC content or read length to variance)...
Is the idea that features with similar sequence properties/size and
abundance will have their mean-variance relationship modeled
appropriately
and weights generated empirically?
For comparison, what I have been doing (in lieu of knowing any better)
is
as follows: align with Rsubread, run subjunc and splicegrapher, and
count
against exon/gene/feature models:
alignedToRPKM <- function(readcounts) { # the output of
featureCounts()
millionsMapped <- colSums(readcounts$counts)/1000000
if('ExonLength' %in% names(readcounts$annotation)) {
geneLengthsInKB <- readcounts$annotation$ExonLength/1000
} else {
geneLengthsInKB <- readcounts$annotation$GeneLength/1000 # works
fine
for ncRNA and splice graph edges
}
# example usage: readcounts$RPKM <- alignedToRPKM(readcounts)
return( sweep(readcounts$counts, 2, millionsMapped, '/') /
geneLengthsInKB )
}
(When I did pretty much the same thing with Bowtie/TopHat/CuffLinks I
got
about the same results but slower, so I stuck with Rsubread. And
featureCounts() is really handy.)
So, given the feature sizes in readcounts$annotation I can at least
put
things on something like a similar scale. Most of my modeling
currently is
focused on penalized local regressions and thus a performant (but
accurate)
measure that can be used for linear modeling on a large scale is
desirable.
Is the output of voom() what I want? Does one need to use
limma/lmFit()
to make use of voom()'s output?
Last but not least, should I use something like Yuval Benjamini's
GCcorrect
package (http://www.stat.berkeley.edu/~yuvalb/YuvalWeb/Software.html)
before/during/instead of voom()?
And if the expression of a feature or several nearby features is often
the
response, does it matter a great deal what I use?
Thanks for any input you might have time to provide. I have to assume
that
the minds at WEHI periodically scheme together how best to go about
these
things...
--
*A model is a lie that helps you see the truth.*
*
*
Howard
Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf="">
[[alternative HTML version deleted]]