Good questions. See comments below.
> I am experimenting with edgeR for high throughput (next gen)
> data and proteomics spectral count data and have a few questions.
> 1. Is it correct to think of the pseudocounts (pseudo.alt produced
> estimateCommonDisp) as normalized counts? According to the edgeR
> vignette ?The pseudocounts are calculated using a quantile-to-
> method for the negative binomial so that the library sizes for the
> pseudocounts are equal to the geometric mean of the original library
> sizes.? For the data that I am working with, the column sums for
> pseudo.alt are very close to the common.lib.size, but the boxplots
> not ?line-up?. Is this because the pseudocounts are ?generated
> the alternative hypothesis??
Yes, you could use the pseudodata as normalized counts. If its RNA-
data, you might want to do something additional about gene length
(e.g. RPKMs). The reason for your boxplots not lining up (and another
consideration for normalization) may be what we call composition bias:
As you may know, the Berkeley folks have methods for normalization:
In general for differential expression, we make the statistical models
operate directly on the raw counts (and incorporate 'normalization'
the model); for us, the normalized data is just for looking at, not
doing statistics on.
> 2. I noticed that within the estimatePs function, the minimum value
> is set to 8.783496e-16. I think the choice of this minimum will
> affect the estimated logConc and logFC values, but will it affect
> test results (p-values)?
Yes, it definitely will affect logFC and logConc. It shouldn't affect
exact testing, since this is based on sums of group pseudocounts,
are at roughly the original scale of measurement.
> 3. The ranges for logConc and logFC seems different when comparing
> the graph produced by smearPlot and output produced by exactTest
> a single comparison). Specifically, for each of the examples in the
> edgeR vignette (and in my own data examples), the minimum logConc in
> the smearPlot is ~ -16, while in the table from topTags the minimum
> ~32. For logFC, the max shown in smearPlot is ~10, while the max
> topTags is ~40. After changing xlim and ylim in plotSmear, this
> doesn?t seem to be an issue of setting the axes.
Actually, this is the whole reason for the 'smear' plots. The smear
itself is composed of those genes/tags that have the minimum value in
of the two groups. The X values for the smear are chosen as random
uniform (hence, the smear), just to the left of the non-minimum
genes/tags. The Y values are a 'compressed' logFC, so that they are
so far out. So, plotSmear() gives a different visual representation
logFC/logConc than the exactTest() output table.
Hope that helps.
> I am using edgeR_1.4.7 with R version 2.10.1.
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> Search the archives:
The information in this email is confidential and