Hello, I am running DESeq2 on a large number (n=166) of paired RNA-seq samples, but the estimate for beta is non-convergent for ~500 of 17.3k genes.
I'm fairly familiar with general{ , ized} linear modeling, and I know that collinearity can produce so-called numerical instability during estimation of beta; as such, what I read in this prior thread makes sense to me.
In that same thread, Michael Love explains that genes with low counts can also produce such a problem, but I have already excluded lowly expressed genes, and
badHombres<-which(mcols(GlassDds)$betaConv == FALSE)
rowSums(assays(GlassDds)$counts[badHombres,])
returns
USH1C UBR7 GABRA1 CBLN4 GOLGA5 CELSR3 SNAP91
63492 423038 75954 46280 377727 228592 195439
MT3 CGRRF1 MYBL2 PDYN FA2H STMN2 GSR
2768201 176736 191081 53601 123691 523764 62811
ALDH3A1 ABI3 IBSP PCDHB3 PCDHB5 LTF ZBTB47
11977 106502 109584 153525 39030 4189439 196215
CENPA EVA1A REG1A BCL9 NR5A2 HOXB8 KCNJ8
43393 23166 8477 48716 29786 15059 98766
GJB6 HOXA7 NEUROD4 PI3 COL21A1 MSTO1 GDF1
62758 81526 5535 82698 39279 229045 335964
LIF HOXD10 HOXD11 CEP170B CALY JCHAIN HMGB1P5
192423 71064 32479 288324 30076 42385 110157
Of course, discarding these data would solve the problem, but this list contains a few genes in which our co-investigators are fairly interested, and I am eager to avoid a conversation akin to:
wet: "where is Lif"
dry: "I discarded it"
wet: "why?"
dry: "because the (jargon jargon jargon) didn't (jargon)"
wet: "what?"
dry: "cancel my 2 o'clock, please"
I'm guessing that part of the problem here is the relatively large number of samples compounding certain issues relating to how the dataset was constructed.
Other than adding
dds <- estimateSizeFactors(dds)
nc <- counts(dds, normalized=TRUE)
filter <- rowSums(nc >= 10) >= 2
dds <- dds[filter,]
and upping maxit
in a call to nbinomWaldTest()
are there other things I can try?
Thank you for considering this question!
filter
is essentially just removing genes with only zeros but nothing else. I would tryfilterByExpr()
from edgeR or increase then
in the filter, like more than 10 raw (not normalized) counts in more than 25 samples, or any n that makes sense at a total of 166 samples with respect to the smallest group.ATpoint - quite right; that block of code is a direct cut and paste from something mike love wrote; my code uses
as the use of a row mean avoids problems that could stem from running workflows of different sizes with the same pipeline. Ill take a look at filterByExpr()
TYVM
Such a filter does not remove problematic genes, for example at n=166, say 3 samples being outliers with 1000 counts and the rest is 0:
The mean above is 18.1 -- not removed by your filter, yet a problematic gene that should be removed. Mean-based filters are even more problematic when n is small, as an outlier of 1000 counts then kicks in even more. Something like "more than 10 raw counts in 20% of samples" would catch that, so does filterByExpr. The good thing on filterByExpr is that it is automatic, aware of group size / design, and stood well the test of time. The latter is a big plus beyond what any benchmark could test, since if many people use it routinely (It's in the edgeR standard code in the manual) then apparently it is at least "good enough" for most purposes.
that's true though I would catch it downstream at a couple different points. was able to get past this hurdle through brute force iteration.
thx for your help.
VAL