Question

Failure of beta estimates to converge in DESeq2 despite noncollinearity and reasonable counts

0

Entering edit mode

Ndimensional ▴ 20

@vlaufer-14169

Last seen 2.2 years ago

United States

Hello, I am running DESeq2 on a large number (n=166) of paired RNA-seq samples, but the estimate for beta is non-convergent for ~500 of 17.3k genes.

I'm fairly familiar with general{ , ized} linear modeling, and I know that collinearity can produce so-called numerical instability during estimation of beta; as such, what I read in this prior thread makes sense to me.

In that same thread, Michael Love explains that genes with low counts can also produce such a problem, but I have already excluded lowly expressed genes, and

badHombres<-which(mcols(GlassDds)$betaConv == FALSE)
rowSums(assays(GlassDds)$counts[badHombres,])

returns

          USH1C            UBR7          GABRA1           CBLN4          GOLGA5          CELSR3          SNAP91 
          63492          423038           75954           46280          377727          228592          195439 
            MT3          CGRRF1           MYBL2            PDYN            FA2H           STMN2             GSR 
        2768201          176736          191081           53601          123691          523764           62811 
        ALDH3A1            ABI3            IBSP          PCDHB3          PCDHB5             LTF          ZBTB47 
          11977          106502          109584          153525           39030         4189439          196215 
          CENPA           EVA1A           REG1A            BCL9           NR5A2           HOXB8           KCNJ8 
          43393           23166            8477           48716           29786           15059           98766 
           GJB6           HOXA7         NEUROD4             PI3         COL21A1           MSTO1            GDF1 
          62758           81526            5535           82698           39279          229045          335964 
            LIF          HOXD10          HOXD11         CEP170B            CALY          JCHAIN         HMGB1P5 
         192423           71064           32479          288324           30076           42385          110157

Of course, discarding these data would solve the problem, but this list contains a few genes in which our co-investigators are fairly interested, and I am eager to avoid a conversation akin to:

wet: "where is Lif"

dry: "I discarded it"

wet: "why?"

dry: "because the (jargon jargon jargon) didn't (jargon)"

wet: "what?"

dry: "cancel my 2 o'clock, please"

I'm guessing that part of the problem here is the relatively large number of samples compounding certain issues relating to how the dataset was constructed.

Other than adding

dds <- estimateSizeFactors(dds)
nc <- counts(dds, normalized=TRUE)
filter <- rowSums(nc >= 10) >= 2
dds <- dds[filter,]

and upping maxit in a call to nbinomWaldTest()

are there other things I can try?

Thank you for considering this question!

converge beta DESeq2 • 1.9k views

ADD COMMENT • link 2.3 years ago Ndimensional ▴ 20

0

Entering edit mode

filter is essentially just removing genes with only zeros but nothing else. I would try filterByExpr() from edgeR or increase the n in the filter, like more than 10 raw (not normalized) counts in more than 25 samples, or any n that makes sense at a total of 166 samples with respect to the smallest group.

ADD REPLY • link 2.3 years ago ATpoint ★ 5.0k

0

Entering edit mode

ATpoint - quite right; that block of code is a direct cut and paste from something mike love wrote; my code uses

Dds[ rowMeans(counts(Dds)) > 15, ]

as the use of a row mean avoids problems that could stem from running workflows of different sizes with the same pipeline. Ill take a look at filterByExpr()

TYVM

ADD REPLY • link 2.3 years ago Ndimensional ▴ 20

0

Entering edit mode

Such a filter does not remove problematic genes, for example at n=166, say 3 samples being outliers with 1000 counts and the rest is 0:

mean(c(rep(1000, 3), rep(0, 163)))

The mean above is 18.1 -- not removed by your filter, yet a problematic gene that should be removed. Mean-based filters are even more problematic when n is small, as an outlier of 1000 counts then kicks in even more. Something like "more than 10 raw counts in 20% of samples" would catch that, so does filterByExpr. The good thing on filterByExpr is that it is automatic, aware of group size / design, and stood well the test of time. The latter is a big plus beyond what any benchmark could test, since if many people use it routinely (It's in the edgeR standard code in the manual) then apparently it is at least "good enough" for most purposes.

ADD REPLY • link 2.3 years ago ATpoint ★ 5.0k

0

Entering edit mode

that's true though I would catch it downstream at a couple different points. was able to get past this hurdle through brute force iteration.

thx for your help.

VAL

ADD REPLY • link 2.3 years ago Ndimensional ▴ 20

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 15 hours ago

United States

Can you post the design and head(colData(dds))

ADD COMMENT • link 2.3 years ago Michael Love 43k

1

Entering edit mode

If you're putting in a term for the donor that's a lot of extra coefficients. I would recommend to model this with duplicateCorrelation in limma-voom.

ADD REPLY • link 2.3 years ago Michael Love 43k

0

Entering edit mode

this is a great point. i had that happen on a clinical analysis not long ago, then realized i was costing myself a ton of degrees of freedom.

ADD REPLY • link 2.3 years ago Ndimensional ▴ 20

score 1 · Accepted Answer · 2023-06-14

Mike, AT - thanks so much for your helpful comments.

I've progressed enough with this analysis to be able to close the question.

The key to resolving this issue was to include additional Grade 2 patients (even though I never planned on testing them for DE in order to help calibrate variance estimates), which worked like a charm.

I think it had the effect of lessening what otherwise would've been too close a correspondence between some predictors that I need to retain in the model. None of the r values were north of 0.6, but adding additional samples I think really hastened algorithmic convergence, resolving the problem.

I will close this question but I have another for you. Will post in a moment.