voom() vs. RPKM/FPKM or otherwise normalized counts, and GC correction, when fitting models to a small number of responses (per-feature counts)

0

Entering edit mode

On Wed, May 30, 2012 at 7:11 PM, Tim Triche, Jr. <tim.triche at="" gmail.com=""> wrote:
> Hi Dr. Smyth,
>
> ?Thank you for the helpful clarifications. ?It seems like RPM/CPM is
> useful for tasks such as plotting expression on a reasonably similar
scale;
> taking logs and adjusting for mean-variance relationships can better
> satisfy expected mean-variance relationships for linear modeling and
thus
> should dovetail better with the toolset in limma, offering a less
> computationally demanding alternative for exploratory analysis. ?On
the
> other hand, if the primary goal is to ?detect differences,
especially in
> rare or highly variably expressed features, an edgeR GLM with
empirical
> Bayes estimates of the feature-wise dispersion is the most
appropriate tool
> to maximize statistical power.
>
> ?Is this understanding reasonable? ?It would seem that, whether I
use
> limma or rig up some sort of weighting for (e.g.) sparsenet, the
output
> from voom() is most likely to be useful for my particular (EDA)
needs at
> the moment.
>
> ?One last question (for anyone who wishes to answer, really) -- if
> gene/transcript length is not associated with the mean/variance
> relationship for read counts, why was it asserted in the original
Mortazavi
> paper that:
>
> The sensitivity of RNA-Seq will be a function of both molar
concentration
> and transcript length [nb: no citation given, presumably this is
felt to be
> self-evident?]. We therefore quantified transcript levels in reads
per
> kilobase of exon model per million mapped reads.
>
> It seems as if this is a red herring? ?GC% could clearly affect the
degree
> to which a transcript "absorbs" read depth, but I continue to have
> difficulty understanding why the length of exon model is relevant in
this
> context.
While the Mortazavi paper is a very good paper on RNA-seq, this
section is not their best.
Because RNA is fragmented, there will be a relationship between read
counts (number of reads mapped to a gene model) and gene length. This
is indisputable. The question is whether this is something we want
to include in our model, beyond the fact that longer genes have more
counts and therefore a bigger mean (and since higher mean leads to
lower variance, this is probably what Mortazavi meant here),
RPKM tries to make an expression measure that is comparable between
genes inside a single sample. This is for example necessary for
making the "titration" curve in the Mortazavi paper showing a nice
relationship between actual concentration and RPKM, since each of the
points on the curve is a different gene. Note that that plot has
nothing to do with differential expression, but rather absolute
quantification.
In a typical gene expression analysis we are
(1) not interested in comparing genes, only samples within a fixed
gene.
(2) interested in relative changes, not absolute meaurements
This is really not something Mortazavi discusses.
EdgeR and DEseq tries to get at differential expression. And they
essentially use the fact that there is a mean-variance relationship to
improve their modeling. Now, it is clear (I would argue) that mean
does not in any way perfectly predict variability, so it entirely
possible that a better method may come along and improve on what we
have. But such a method would first have to prove itself.
Now, as I said above, gene length affects read counts through
fragmentation. In case fragmentation varies between samples, there
may be a problem. Same with GC content. We recently showed [1] that
GC content, and to a lesser extent gene length, can have a sample
specific effect. If that is the case, you need to account for that.
But that is because the effect is sample specific.
1. Removing technical variability in RNA-seq data using conditional
quantile normalization. Biostatistics 13, 204?216 (2012).
Kasper
>
> Thank you so much for your time and effort in explaining these
rather
> subtle issues.
>
> Tim Triche, Jr.
> USC Biostatistics
>
> On Wed, May 30, 2012 at 1:23 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote:
>
>> Hi Tim,
>>
>> On thinking about this a little more, voom() could easily output
logRPKM
>> rather than logCPM, and the same weights would apply. ?Indeed you
could
>> convert the voom() output to logRPKM yourself and, in principle,
undertake
>> analyses using the values if you make use of the corresponding voom
weights.
>>
>> However voom() does need to get raw counts as input, just like
edgeR,
>> rather than RPKM. ?voom() can cope with a re-scaling of the counts,
but not
>> with a transformation that is non-monotonic in the counts. ?RPKM is
an
>> unhelpful measure from a statistical point of view, because it
"forgets"
>> how large the count was in the first plae.
>>
>> The aims of Yuval's package are complementary to edgeR or voom,
certainly
>> neither replaces the other. ?These results may inform how we do the
>> normalization step, but we have not yet reached the stage of doing
this
>> routinely.
>>
>> Best wishes
>> Gordon
>>
>>
>> On Fri, 25 May 2012, Gordon K Smyth wrote:
>>
>> ?Dear Tim,
>>>
>>> I don't follow what you are trying to do scientifically, and this
makes
>>> all the difference when deciding what are the appropriate tools to
use.
>>>
>>> If you are undertaking some sort of analysis that requires
absolute gene
>>> (or feature) expression levels as responses, then you should not
be using
>>> voom or limma or edgeR. ?limma and edgeR do not estimate absolute
>>> expression.
>>>
>>> If on the other hand, you want to detect differentially expressed
genes
>>> (or features), which is what voom does, then there is no need to
correct
>>> for gene length. ?The comments of Section 2.3 of the edgeR User's
Guide and
>>> especially 2.3.2 "Adjustments for gene length, GC content,
mappability and
>>> so on" are also relevant for voom. ?There is no need to correct
for any
>>> characteristic of a gene that remains unchanged across samples.
>>>
>>> A good case has been made that GC content can have differential
influence
>>> across samples, but that doesn't apply to gene length.
>>>
>>> voom does not work on RPKM or FPKM, or on the output from
cufflinks. voom
>>> estimates a mean-variance relationship, and the variance is a
function of
>>> count size, not of expression level.
>>>
>>> Yes, you need limma to use the output from voom, because other
softwares
>>> do not generally have the ability to use quantitative weights. ?If
you
>>> ignore the weights, then the output from voom is just logCPM, and
you
>>> hardly need voom to compute that.
>>>
>>> Best wishes
>>> Gordon
>>>
>>> ------------------------------**---------------
>>> Professor Gordon K Smyth,
>>> Bioinformatics Division,
>>> Walter and Eliza Hall Institute of Medical Research,
>>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>>> smyth at wehi.edu.au
>>> http://www.wehi.edu.au
>>> http://www.statsci.org/smyth
>>>
>>> On Thu, 24 May 2012, Tim Triche, Jr. wrote:
>>>
>>> ?Hi Dr. Smyth and Dr. Law,
>>>>
>>>> I have been reading the documentation for limma::voom() and
trying to
>>>> understand why there seems to be no correction for the size of
the
>>>> feature
>>>> in the model:
>>>>
>>>> In an experiment, a count value is observed for each tag in each
sample.
>>>> A
>>>> tag-wise mean-variance trend is computed using lowess. The tag-
wise mean
>>>> is
>>>> the mean log2 count with an offset of 0.5, across samples for a
given
>>>> tag.
>>>> The tag-wise variance is the quarter-root-variance of normalized
log2
>>>> counts per million values with an offset of 0.5, across samples
for a
>>>> given
>>>> tag. Tags with zero counts across all samples are not included in
the
>>>> lowess fit. Optional normalization is performed using
>>>> normalizeBetweenArrays. Using fitted values of log2 counts from a
linear
>>>> model fit by lmFit, variances from the mean-variance trend were
>>>> interpolated for each observation. This was carried out by
approxfun.
>>>> Inverse variance weights can be used to correct for mean-variance
trend
>>>> in
>>>> the count data.
>>>>
>>>>
>>>> I don't see a reference to the feature size in all of this. (?)
?Am I
>>>> missing something? ?Probably something major (like, say, the
relationship
>>>> of GC content or read length to variance)...
>>>> Is the idea that features with similar sequence properties/size
and
>>>> abundance will have their mean-variance relationship modeled
>>>> appropriately
>>>> and weights generated empirically?
>>>>
>>>> For comparison, what I have been doing (in lieu of knowing any
better) is
>>>> as follows: align with Rsubread, run subjunc and splicegrapher,
and count
>>>> against exon/gene/feature models:
>>>>
>>>> alignedToRPKM <- function(readcounts) { # the output of
featureCounts()
>>>> ?millionsMapped <- colSums(readcounts$counts)/**1000000
>>>> ?if('ExonLength' %in% names(readcounts$annotation)) {
>>>> ? geneLengthsInKB <- readcounts$annotation$**ExonLength/1000
>>>> ?} else {
>>>> ? geneLengthsInKB <- readcounts$annotation$**GeneLength/1000 #
works
>>>> fine
>>>> for ncRNA and splice graph edges
>>>> ?}
>>>>
>>>> ?# example usage: readcounts$RPKM <- alignedToRPKM(readcounts)
>>>> ?return( sweep(readcounts$counts, 2, millionsMapped, '/') /
>>>> geneLengthsInKB )
>>>> }
>>>>
>>>> (When I did pretty much the same thing with
Bowtie/TopHat/CuffLinks I got
>>>> about the same results but slower, so I stuck with Rsubread. ?And
>>>> featureCounts() is really handy.)
>>>>
>>>> So, given the feature sizes in readcounts$annotation I can at
least put
>>>> things on something like a similar scale. ?Most of my modeling
currently
>>>> is
>>>> focused on penalized local regressions and thus a performant (but
>>>> accurate)
>>>> measure that can be used for linear modeling on a large scale is
>>>> desirable.
>>>> Is the output of voom() what I want? ?Does one need to use
limma/lmFit()
>>>> to make use of voom()'s output?
>>>>
>>>> Last but not least, should I use something like Yuval Benjamini's
>>>> GCcorrect
>>>> package (http://www.stat.berkeley.edu/**~yuvalb/YuvalWeb/Software
.html<http: www.stat.berkeley.edu="" ~yuvalb="" yuvalweb="" software.html="">
>>>> **)
>>>> before/during/instead of voom()?
>>>> And if the expression of a feature or several nearby features is
often
>>>> the
>>>> response, does it matter a great deal what I use?
>>>>
>>>> Thanks for any input you might have time to provide. ?I have to
assume
>>>> that
>>>> the minds at WEHI periodically scheme together how best to go
about these
>>>> things...
>>>>
>>>>
>>>> --
>>>> *A model is a lie that helps you see the truth.*
>>>> *
>>>> *
>>>> Howard
Skipper<http: cancerres.**aacrjournals.org="" content="" 31="" 9="" **="">>>> 1173.full.pdf<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173="" .full.pdf="">
>>>> >
>>>>
>>>>
>>>
>> ______________________________**______________________________**___
_______
>> The information in this email is confidential and
inte...{{dropped:18}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor

0

Entering edit mode

Hi Kasper,

I look forward to benefiting from your work on sample-specific GC and gene length effects. As you know, we cited the article you mention in our recent edgeR glm paper (McCarthy et al NAR 2012), and edgeR works nicely with the normalization matrix from your cqn package. When I get a chance, I'll rewrite the section on normalization in the edgeR User's Guide to mention these possibilities. Until now, I hadn't realized that you were finding sample-specific effects for length as well for GC content.

voom() isn't currently setup to accept the normalization matrix from cqn, but it will be down the track.

I don't understand your remarks on mean-variance relationships though. For one thing, the normalization matrix matrix from cqn affects the assumed link between expression levels and count size in edgeR. It doesn't have any effect on the form on the mean-variance relationship.

Secondly, edgeR has never assumed that the mean perfectly predicts variability. edgeR uses the mean-variance relationship as a guide towards which to moderate the genewise variances, but the fitted variances remain gene-specific and are not a monotonic function of the mean. DEseq (as described in the Genome Biol paper) did assume that the variance is a perfect function of the mean, but edgeR has never had this limitation.

Best wishes

Gordon

Similar Posts

Loading Similar Posts

Traffic: 308 users visited in the last hour

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.