Hi everyone,
I have a problem with edgeR right now, and I cannot figure out what the cause is.
I use these commands:
x <- read.delim("input.csv",row.names="Contigs") group <- c(1,1,2,2,3,3,1,1,4,4,5,5,1,1,6,6,7,7,8,8,9,9,10,10,11,11,12,13,13,14,14,15,15,16,16) library("edgeR") d <- DGEList(counts=x, group=group) d$samples$lib.size <- colSums(d$counts) d <- calcNormFactors(d) d <- estimateCommonDisp(d, verbose=TRUE)
And then get the following error.
Error in if (any(input.mean < 0)) stop("input.mean must be non-negative") : missing value where TRUE/FALSE needed Calls: estimateCommonDisp -> equalizeLibSizes -> q2qnbinom
The file does not contain any negative values (but quite a few 0), so I don't understand where this error is coming from.
Any help would be appreciated :).
(I'd like to attach the input file, but right now can't find a way to do that, help o_O)
EDIT: The input is here: http://www.mediafire.com/view/1pv8ocwlll2cth6/input.test.csv
You could certainly use voom(), which will convert all those zeros to some non-zero value, and then run a conventional ANOVA analysis on your data. However, given the huge differences between samples, you end up with pretty much every contig being significantly different. But maybe this is just a subset of your data?
I had a similar sparse data set from some single-cell RNA-Seq data, and I used voom() precisely because of the NaNs from calcNormFactors(). If you have single-cell data, you might also try the monocle package. I spoke with Cole Trapnell at the BioC2105 conference, and he said it should work fine when you only have dozens of samples as opposed to hundreds, like I thought it needed. Haven't tried it myself on my data yet, but I've got it on my To Do list...
Jenny
Hmm... nothing obvious comes to mind. There just doesn't seem to be enough non-zero features (i.e., contigs) for reliable TMM normalization. Indeed, even in the best case where all the counts are non-zero, TMM normalization will still assume that the majority of contigs are non-DE. This may not be appropriate if you're expecting a large number of these contigs to be different between samples. Spike-ins might have been useful in this situation as they would have allowed us to avoid the problems above (with some caveats), but I don't see any in your test data.
Also, to add to James' comment; while
voom
will technically work, I don't anticipate that it will give sensible results. For starters, normalization invoom
is traditionally computed from the normalization factors fromedgeR
. If we used the latter, we'd end up with the same problem as that in the original post. If we didn't normalize at all, that'll compromise the quantitative interpretation of the results. Modelling of the mean-variance trend invoom
also struggles at low counts with high discreteness. This is exacerbated by the small numbers of features which precludes reliable trend fitting.Thanks for all the answers :).
The dataset is indeed only a part. They are genes belonging to...er...a taxonomic entity (can't even call that OTU) from a metatranscriptomic dataset, therefore it's so sparse in some areas. I need to separate the "organisms" in there, because the population vastly changes between samples, and normalizing this as one will not give any results which make sense.
Using another method might be an idea, but I'd need to use that for the whole dataset.
I see that tmm might not be appropriate, but the problem is that it might be appropriate for some of the subsets, but not for others, and there are too many to deal with that manually, so I'd need a general approach.
Maybe using RPKM might be the best approach here. The advanced statistics are not really made for this type of data, and an approach which does not derive anything from the whole behaviour of the dataset is maybe the best.
Any other thoughts would be appreciated :), because my statistics are rather weak :/.