edgeR: pseudocounts, logConc and logFC

0

Entering edit mode

Ann Hess ▴ 340

@ann-hess-251

Last seen 9.6 years ago

I am experimenting with edgeR for high throughput (next gen) sequence data and proteomics spectral count data and have a few questions. 1. Is it correct to think of the pseudocounts (pseudo.alt produced by estimateCommonDisp) as normalized counts? According to the edgeR vignette ?The pseudocounts are calculated using a quantile-to-quantile method for the negative binomial so that the library sizes for the pseudocounts are equal to the geometric mean of the original library sizes.? For the data that I am working with, the column sums for pseudo.alt are very close to the common.lib.size, but the boxplots do not ?line-up?. Is this because the pseudocounts are ?generated under the alternative hypothesis?? 2. I noticed that within the estimatePs function, the minimum value is set to 8.783496e-16. I think the choice of this minimum will affect the estimated logConc and logFC values, but will it affect the test results (p-values)? 3. The ranges for logConc and logFC seems different when comparing the graph produced by smearPlot and output produced by exactTest (for a single comparison). Specifically, for each of the examples in the edgeR vignette (and in my own data examples), the minimum logConc in the smearPlot is ~ -16, while in the table from topTags the minimum is ~32. For logFC, the max shown in smearPlot is ~10, while the max in topTags is ~40. After changing xlim and ylim in plotSmear, this doesn?t seem to be an issue of setting the axes. I am using edgeR_1.4.7 with R version 2.10.1. Thanks! Ann

Proteomics graph edgeR Proteomics graph edgeR • 4.7k views

ADD COMMENT • link updated 14.0 years ago by Mark Robinson ★ 1.1k • written 14.0 years ago by Ann Hess ▴ 340

1

Entering edit mode

Mark Robinson ★ 1.1k

@mark-robinson-2171

Last seen 9.6 years ago

Hi Ann. Good questions. See comments below. > I am experimenting with edgeR for high throughput (next gen) sequence > data and proteomics spectral count data and have a few questions. > > 1. Is it correct to think of the pseudocounts (pseudo.alt produced by > estimateCommonDisp) as normalized counts? According to the edgeR > vignette ?The pseudocounts are calculated using a quantile-to- quantile > method for the negative binomial so that the library sizes for the > pseudocounts are equal to the geometric mean of the original library > sizes.? For the data that I am working with, the column sums for > pseudo.alt are very close to the common.lib.size, but the boxplots do > not ?line-up?. Is this because the pseudocounts are ?generated under > the alternative hypothesis?? Yes, you could use the pseudodata as normalized counts. If its RNA- seq data, you might want to do something additional about gene length though (e.g. RPKMs). The reason for your boxplots not lining up (and another consideration for normalization) may be what we call composition bias: http://genomebiology.com/2010/11/3/R25 As you may know, the Berkeley folks have methods for normalization: http://www.biomedcentral.com/1471-2105/11/94 In general for differential expression, we make the statistical models operate directly on the raw counts (and incorporate 'normalization' into the model); for us, the normalized data is just for looking at, not for doing statistics on. > 2. I noticed that within the estimatePs function, the minimum value > is set to 8.783496e-16. I think the choice of this minimum will > affect the estimated logConc and logFC values, but will it affect the > test results (p-values)? Yes, it definitely will affect logFC and logConc. It shouldn't affect the exact testing, since this is based on sums of group pseudocounts, which are at roughly the original scale of measurement. > 3. The ranges for logConc and logFC seems different when comparing > the graph produced by smearPlot and output produced by exactTest (for > a single comparison). Specifically, for each of the examples in the > edgeR vignette (and in my own data examples), the minimum logConc in > the smearPlot is ~ -16, while in the table from topTags the minimum is > ~32. For logFC, the max shown in smearPlot is ~10, while the max in > topTags is ~40. After changing xlim and ylim in plotSmear, this > doesn?t seem to be an issue of setting the axes. Actually, this is the whole reason for the 'smear' plots. The smear itself is composed of those genes/tags that have the minimum value in one of the two groups. The X values for the smear are chosen as random uniform (hence, the smear), just to the left of the non-minimum genes/tags. The Y values are a 'compressed' logFC, so that they are not so far out. So, plotSmear() gives a different visual representation of logFC/logConc than the exactTest() output table. Hope that helps. Cheers, Mark > > I am using edgeR_1.4.7 with R version 2.10.1. > > Thanks! > > Ann > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 14.0 years ago Mark Robinson ★ 1.1k

Login before adding your answer.