Entering edit mode
Koen Van den Berge
▴
180
@koen-van-den-berge-6369
Last seen 5 months ago
Ghent University, Belgium
Dear Bioconductors,
In evaluating several RNA-seq analysis techniques, I decided to
analyse the same SEQC spiked-in dataset as mentioned in the limma-Voom
paper (Law et al. 2014, Genome Biology).
This dataset contains 92 genes, 69 of which are spiked-in at different
concentrations. As in the paper, I copied the 23 non DE genes twice,
providing me a dataset with a total of 138 genes, half of which are
(non) DE.
The authors use a filter on the dataset that requires a cpm higher
than 1 in at least 4 libraries.
First I decided to analyse the data using EdgeR:
spiked <- read.csv(filename)
#Copy non DE genes
copies <- spiked[id,]
spiked2 <- rbind(spiked,copies)
spiked3 <- rbind(spiked2,copies)
#Design matrix
libraries <- factor(rep(c("A" , "B" , "C" , "D"),each=4))
design <- model.matrix(~ libraries)
#make DGElist and normalize
y <- DGEList(spiked3[,-1], group= libraries)
y$samples$norm.factors <- normFactorsAll
#estimate dispersion
estimateDisip(y , design)
In doing so, the estimateDisp() function provided me an estimate of
prior df of 58.90.
In assessing the effect of the filtering process, I followed the same
steps as above, however not applying the filtering process. The
estimateDisp() function then provided me an estimate of prior df of
Infinity, suggesting the trended dispersion estimation should be used.
Why is this happening? Is the ML dispersion estimation sensitive to
these low counts, resulting in an estimate of infinity for the prior
degrees of freedom?
I can not really grasp this result, so any suggestions or help in this
topic are very welcome.
Sincerely,
Koen.
[[alternative HTML version deleted]]