I recently used the EdgeR package to analyze a RNA-Seq dataset, with 2
genotypes and 3 biological replicates each.
After running the exacttest, the logFC and logCPM are provided for
each gene. I am a bit confused about how exactly these values are
calculated.
1) For logCPM, I assume that this is the average expression over all
samples. It is clearly not simply the averaged [counts/effective
library size for each sample].
I understand that generally speaking the original counts (or the CPM?
instead) are moderated to avoid infinite values when taking logs of
samples/genes with zero counts/CPM, but I'm not quite sure that I can
figure out exactly how this is produced.
a) Is the same small value added to each gene for each sample or is
the added value different for different genes? How is prior.count
determined?
b) Are only genes that have a "0" in one sample moderated or all all
genes moderated by prior.count?
c) Is there a way to see the moderated CPM for each gene and sample
and not just the log (moderated CPM)?
2) How is the logFC calculated? Is it based on moderated CPMs for each
lane? Does it take the ratio of the average moderated CPM for each
group?
Thank you!
-- output of sessionInfo():
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_3.2.4 limma_3.16.7
--
Sent via the guest posting facility at bioconductor.org.
Dear Karen,
> Date: Mon, 2 Dec 2013 10:55:38 -0800 (PST)
> From: "Karen [guest]" <guest at="" bioconductor.org="">
> To: bioconductor at r-project.org, karenmenuz at hotmail.com
> Subject: [BioC] edgeR prior.count
>
>
> I recently used the EdgeR package to analyze a RNA-Seq dataset, with
2
> genotypes and 3 biological replicates each.
Please update to the current Bioconductor release (edgeR 3.4.1).
> After running the exacttest, the logFC and logCPM are provided for
each
> gene. I am a bit confused about how exactly these values are
calculated.
It may be that you are expecting things to be somewhat simpler than
they
actually are. edgeR uses generalized linear models to compute
statistically efficient estimates of logCPM and logFC values. These
involve an interative computation for each gene that takes into
account
the dispersion value, library sizes and so on. It's not just a matter
of
computing moderated counts and then taking averages or differences.
> 1) For logCPM, I assume that this is the average expression over all
> samples. It is clearly not simply the averaged [counts/effective
library
> size for each sample].
>
> I understand that generally speaking the original counts (or the
CPM?
> instead) are moderated to avoid infinite values when taking logs of
> samples/genes with zero counts/CPM, but I'm not quite sure that I
can
> figure out exactly how this is produced.
See ?aveLogCPM
> a) Is the same small value added to each gene for each sample or is
the
> added value different for different genes? How is prior.count
> determined?
See ?predFC
As for determining the prior.count, you input the prior count yourself
when you run exactTest, or else the default value is used. The
prior.count has no effect on the p-values. It only affects the amount
of
moderation applied to the reported fold changes.
> b) Are only genes that have a "0" in one sample moderated or all all
> genes moderated by prior.count?
See ?predFC
> c) Is there a way to see the moderated CPM for each gene and sample
and
> not just the log (moderated CPM)?
See ?cpm
> 2) How is the logFC calculated? Is it based on moderated CPMs for
each
> lane? Does it take the ratio of the average moderated CPM for each
> group?
Generalized linear model. See ?glmFit. Note that a generalized
linear
model is used for the fold changes, even when using the exactTest.
Best wishes
Gordon
> Thank you!
>
> -- output of sessionInfo():
>
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] edgeR_3.2.4 limma_3.16.7
>
> --
> Sent via the guest posting facility at bioconductor.org.
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
Thank you!
> Date: Wed, 4 Dec 2013 15:14:06 +1100
> From: smyth@wehi.EDU.AU
> To: karenmenuz@hotmail.com
> CC: bioconductor@r-project.org
> Subject: edgeR prior.count
>
> Dear Karen,
>
> > Date: Mon, 2 Dec 2013 10:55:38 -0800 (PST)
> > From: "Karen [guest]" <guest@bioconductor.org>
> > To: bioconductor@r-project.org, karenmenuz@hotmail.com
> > Subject: [BioC] edgeR prior.count
> >
> >
> > I recently used the EdgeR package to analyze a RNA-Seq dataset,
with 2
> > genotypes and 3 biological replicates each.
>
> Please update to the current Bioconductor release (edgeR 3.4.1).
>
> > After running the exacttest, the logFC and logCPM are provided for
each
> > gene. I am a bit confused about how exactly these values are
calculated.
>
> It may be that you are expecting things to be somewhat simpler than
they
> actually are. edgeR uses generalized linear models to compute
> statistically efficient estimates of logCPM and logFC values. These
> involve an interative computation for each gene that takes into
account
> the dispersion value, library sizes and so on. It's not just a
matter of
> computing moderated counts and then taking averages or differences.
>
> > 1) For logCPM, I assume that this is the average expression over
all
> > samples. It is clearly not simply the averaged [counts/effective
library
> > size for each sample].
> >
> > I understand that generally speaking the original counts (or the
CPM?
> > instead) are moderated to avoid infinite values when taking logs
of
> > samples/genes with zero counts/CPM, but I'm not quite sure that I
can
> > figure out exactly how this is produced.
>
> See ?aveLogCPM
>
> > a) Is the same small value added to each gene for each sample or
is the
> > added value different for different genes? How is prior.count
> > determined?
>
> See ?predFC
>
> As for determining the prior.count, you input the prior count
yourself
> when you run exactTest, or else the default value is used. The
> prior.count has no effect on the p-values. It only affects the
amount of
> moderation applied to the reported fold changes.
>
> > b) Are only genes that have a "0" in one sample moderated or all
all
> > genes moderated by prior.count?
>
> See ?predFC
>
> > c) Is there a way to see the moderated CPM for each gene and
sample and
> > not just the log (moderated CPM)?
>
> See ?cpm
>
> > 2) How is the logFC calculated? Is it based on moderated CPMs for
each
> > lane? Does it take the ratio of the average moderated CPM for each
> > group?
>
> Generalized linear model. See ?glmFit. Note that a generalized
linear
> model is used for the fold changes, even when using the exactTest.
>
> Best wishes
> Gordon
>
> > Thank you!
> >
> > -- output of sessionInfo():
> >
> > R version 3.0.1 (2013-05-16)
> > Platform: x86_64-apple-darwin10.8.0 (64-bit)
> >
> > locale:
> > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods
base
> >
> > other attached packages:
> > [1] edgeR_3.2.4 limma_3.16.7
> >
> > --
> > Sent via the guest posting facility at bioconductor.org.
>
>
______________________________________________________________________
> The information in this email is confidential and
inte...{{dropped:9}}