edgeR: pseudocounts, logConc and logFC
1
0
Entering edit mode
Ann Hess ▴ 340
@ann-hess-251
Last seen 9.6 years ago
I am experimenting with edgeR for high throughput (next gen) sequence data and proteomics spectral count data and have a few questions. 1. Is it correct to think of the pseudocounts (pseudo.alt produced by estimateCommonDisp) as normalized counts? According to the edgeR vignette ?The pseudocounts are calculated using a quantile-to-quantile method for the negative binomial so that the library sizes for the pseudocounts are equal to the geometric mean of the original library sizes.? For the data that I am working with, the column sums for pseudo.alt are very close to the common.lib.size, but the boxplots do not ?line-up?. Is this because the pseudocounts are ?generated under the alternative hypothesis?? 2. I noticed that within the estimatePs function, the minimum value is set to 8.783496e-16. I think the choice of this minimum will affect the estimated logConc and logFC values, but will it affect the test results (p-values)? 3. The ranges for logConc and logFC seems different when comparing the graph produced by smearPlot and output produced by exactTest (for a single comparison). Specifically, for each of the examples in the edgeR vignette (and in my own data examples), the minimum logConc in the smearPlot is ~ -16, while in the table from topTags the minimum is ~32. For logFC, the max shown in smearPlot is ~10, while the max in topTags is ~40. After changing xlim and ylim in plotSmear, this doesn?t seem to be an issue of setting the axes. I am using edgeR_1.4.7 with R version 2.10.1. Thanks! Ann
Proteomics graph edgeR Proteomics graph edgeR • 4.7k views
ADD COMMENT
1
Entering edit mode
Mark Robinson ★ 1.1k
@mark-robinson-2171
Last seen 9.6 years ago
Hi Ann. Good questions. See comments below. > I am experimenting with edgeR for high throughput (next gen) sequence > data and proteomics spectral count data and have a few questions. > > 1. Is it correct to think of the pseudocounts (pseudo.alt produced by > estimateCommonDisp) as normalized counts? According to the edgeR > vignette ?The pseudocounts are calculated using a quantile-to- quantile > method for the negative binomial so that the library sizes for the > pseudocounts are equal to the geometric mean of the original library > sizes.? For the data that I am working with, the column sums for > pseudo.alt are very close to the common.lib.size, but the boxplots do > not ?line-up?. Is this because the pseudocounts are ?generated under > the alternative hypothesis?? Yes, you could use the pseudodata as normalized counts. If its RNA- seq data, you might want to do something additional about gene length though (e.g. RPKMs). The reason for your boxplots not lining up (and another consideration for normalization) may be what we call composition bias: http://genomebiology.com/2010/11/3/R25 As you may know, the Berkeley folks have methods for normalization: http://www.biomedcentral.com/1471-2105/11/94 In general for differential expression, we make the statistical models operate directly on the raw counts (and incorporate 'normalization' into the model); for us, the normalized data is just for looking at, not for doing statistics on. > 2. I noticed that within the estimatePs function, the minimum value > is set to 8.783496e-16. I think the choice of this minimum will > affect the estimated logConc and logFC values, but will it affect the > test results (p-values)? Yes, it definitely will affect logFC and logConc. It shouldn't affect the exact testing, since this is based on sums of group pseudocounts, which are at roughly the original scale of measurement. > 3. The ranges for logConc and logFC seems different when comparing > the graph produced by smearPlot and output produced by exactTest (for > a single comparison). Specifically, for each of the examples in the > edgeR vignette (and in my own data examples), the minimum logConc in > the smearPlot is ~ -16, while in the table from topTags the minimum is > ~32. For logFC, the max shown in smearPlot is ~10, while the max in > topTags is ~40. After changing xlim and ylim in plotSmear, this > doesn?t seem to be an issue of setting the axes. Actually, this is the whole reason for the 'smear' plots. The smear itself is composed of those genes/tags that have the minimum value in one of the two groups. The X values for the smear are chosen as random uniform (hence, the smear), just to the left of the non-minimum genes/tags. The Y values are a 'compressed' logFC, so that they are not so far out. So, plotSmear() gives a different visual representation of logFC/logConc than the exactTest() output table. Hope that helps. Cheers, Mark > > I am using edgeR_1.4.7 with R version 2.10.1. > > Thanks! > > Ann > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENT

Login before adding your answer.

Traffic: 561 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6