Question

edgeR prior.count

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

I recently used the EdgeR package to analyze a RNA-Seq dataset, with 2 genotypes and 3 biological replicates each. After running the exacttest, the logFC and logCPM are provided for each gene. I am a bit confused about how exactly these values are calculated. 1) For logCPM, I assume that this is the average expression over all samples. It is clearly not simply the averaged [counts/effective library size for each sample]. I understand that generally speaking the original counts (or the CPM? instead) are moderated to avoid infinite values when taking logs of samples/genes with zero counts/CPM, but I'm not quite sure that I can figure out exactly how this is produced. a) Is the same small value added to each gene for each sample or is the added value different for different genes? How is prior.count determined? b) Are only genes that have a "0" in one sample moderated or all all genes moderated by prior.count? c) Is there a way to see the moderated CPM for each gene and sample and not just the log (moderated CPM)? 2) How is the logFC calculated? Is it based on moderated CPMs for each lane? Does it take the ratio of the average moderated CPM for each group? Thank you! -- output of sessionInfo(): R version 3.0.1 (2013-05-16) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.2.4 limma_3.16.7 -- Sent via the guest posting facility at bioconductor.org.

edgeR edgeR • 3.9k views

ADD COMMENT • link updated 10.4 years ago by Gordon Smyth 50k • written 10.4 years ago by Guest User ★ 13k

score 0 · Answer 1 · 2013-12-04

Dear Karen, > Date: Mon, 2 Dec 2013 10:55:38 -0800 (PST) > From: "Karen [guest]" <guest at="" bioconductor.org=""> > To: bioconductor at r-project.org, karenmenuz at hotmail.com > Subject: [BioC] edgeR prior.count > > > I recently used the EdgeR package to analyze a RNA-Seq dataset, with 2 > genotypes and 3 biological replicates each. Please update to the current Bioconductor release (edgeR 3.4.1). > After running the exacttest, the logFC and logCPM are provided for each > gene. I am a bit confused about how exactly these values are calculated. It may be that you are expecting things to be somewhat simpler than they actually are. edgeR uses generalized linear models to compute statistically efficient estimates of logCPM and logFC values. These involve an interative computation for each gene that takes into account the dispersion value, library sizes and so on. It's not just a matter of computing moderated counts and then taking averages or differences. > 1) For logCPM, I assume that this is the average expression over all > samples. It is clearly not simply the averaged [counts/effective library > size for each sample]. > > I understand that generally speaking the original counts (or the CPM? > instead) are moderated to avoid infinite values when taking logs of > samples/genes with zero counts/CPM, but I'm not quite sure that I can > figure out exactly how this is produced. See ?aveLogCPM > a) Is the same small value added to each gene for each sample or is the > added value different for different genes? How is prior.count > determined? See ?predFC As for determining the prior.count, you input the prior count yourself when you run exactTest, or else the default value is used. The prior.count has no effect on the p-values. It only affects the amount of moderation applied to the reported fold changes. > b) Are only genes that have a "0" in one sample moderated or all all > genes moderated by prior.count? See ?predFC > c) Is there a way to see the moderated CPM for each gene and sample and > not just the log (moderated CPM)? See ?cpm > 2) How is the logFC calculated? Is it based on moderated CPMs for each > lane? Does it take the ratio of the average moderated CPM for each > group? Generalized linear model. See ?glmFit. Note that a generalized linear model is used for the fold changes, even when using the exactTest. Best wishes Gordon > Thank you! > > -- output of sessionInfo(): > > R version 3.0.1 (2013-05-16) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.2.4 limma_3.16.7 > > -- > Sent via the guest posting facility at bioconductor.org. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}