Question

DESeq2 Error: rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest

4

Entering edit mode

erin.gill81 ▴ 60

@eringill81-6831

Last seen 4.7 years ago

Canada

Hello,

I am trying to analyse a dataset that has data from three different sequencing runs, two cell types (one is a homozygous knockout of the other) and four stimuli (plus control). I know from previous analyses of this data that there are batch effects, so I am using a design formula that looks like the following:

design = ~ run + celltype + stimulus + celltype:stimulus

When I attempt to run DESeq2 using this design formula, I get the error "1650 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest"

I don't get the error when I combine celltype and stimulus into a single variable "condition".

I've searched for this error in the manual and online, but cannot figure out what it means. Could somebody please explain it?

Thanks!

Erin

output of sessionInfo():

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] DESeq2_1.6.3 RcppArmadillo_0.4.600.0 Rcpp_0.11.3
[4] GenomicRanges_1.18.4 GenomeInfoDb_1.2.4 IRanges_2.0.1
[7] S4Vectors_0.4.0 BiocGenerics_0.12.1

loaded via a namespace (and not attached):
[1] acepack_1.3-3.3 annotate_1.44.0 AnnotationDbi_1.28.1
[4] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8
[7] Biobase_2.26.0 BiocParallel_1.0.0 brew_1.0-6
[10] checkmate_1.5.1 cluster_2.0.1 codetools_0.2-10
[13] colorspace_1.2-4 DBI_0.3.1 digest_0.6.8
[16] fail_1.2 foreach_1.4.2 foreign_0.8-62
[19] Formula_1.2-0 genefilter_1.48.1 geneplotter_1.44.0
[22] ggplot2_1.0.0 grid_3.1.2 gtable_0.1.2
[25] Hmisc_3.14-6 iterators_1.0.7 lattice_0.20-29
[28] latticeExtra_0.6-26 locfit_1.5-9.1 MASS_7.3-37
[31] munsell_0.4.2 nnet_7.3-9 plyr_1.8.1
[34] proto_0.3-10 RColorBrewer_1.1-2 reshape2_1.4.1
[37] rpart_4.1-9 RSQLite_1.0.0 scales_0.2.4
[40] sendmailR_1.2-1 splines_3.1.2 stringr_0.6.2
[43] survival_2.37-7 tools_3.1.2 XML_3.98-1.1
[46] xtable_1.7-4 XVector_0.6.0

deseq2 • 12k views

ADD COMMENT • link 9.2 years ago erin.gill81 ▴ 60

score 1 · Answer 1 · 2015-02-19

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 14 hours ago

United States

hi Erin,

Can you show what the column data looks like? It might be that there is some near-collinearity which poses a problem for the fitting algorithm. Also can you provide all the code you are using for construction and running DESeq2?

ADD COMMENT • link 9.2 years ago Michael Love 41k

score 0 · Answer 2 · 2015-02-19

0

Entering edit mode

erin.gill81 ▴ 60

@eringill81-6831

Last seen 4.7 years ago

Canada

The code that I'm using is as follows:

#start R

R

#read in new sample sheet 
samples = read.csv("samples.csv",header=TRUE)

#read in count matrix
counts = read.delim("sumtechreps.csv", col.names=samples$techreps)

#attach the column data to from the sample sheet to the variable coldata
coldata = with(samples,data.frame(shortname = I(shortname), celltype = celltype, run = run, stimulus = stimulus))

#attach the count data to the variable countdata
countdata = counts

#start DESeq2
library("DESeq2")

#construct your DESeq2 data set, making sure to specify the design matrix here
dds <- DESeqDataSetFromMatrix(countData = countdata,
colData = coldata,
design = ~ run + celltype + stimulus + celltype:stimulus)

#relevel controls
dds$celltype <- relevel(dds$celltype, "E")
dds$stimulus <- relevel(dds$stimulus, "C4")

#run DESeq2 on the dataset
dds <- DESeq(dds)

Here is what coldata looks like:

      shortname celltype run stimulus
1  2_E14-Co....        E  r2       C4
2     2_E14-CpG        E  r2      CpG
3  2_E14-TRIDAP        E  r2        T
4     2_E14-LPS        E  r2        L
5  2_E14-In....        E  r2      I24
6  2_IRAK1-....        I  r2       C4
7   2_IRAK1-CpG        I  r2      CpG
8  2_IRAK1-....        I  r2        T
9  2_IRAK1-....        I  r2      I24
10  2_IRAK1-LPS        I  r2        L
11    3_E14-CpG        E  r3      CpG
12 3_E14-TRIDAP        E  r3        T
13 3_E14-Co....        E  r3       C4
14    3_E14-LPS        E  r3        L
15 3_E14-In....        E  r3      I24
16  3_IRAK1-CpG        I  r3      CpG
17 3_IRAK1-....        I  r3        T
18 3_IRAK1-....        I  r3       C4
19  3_IRAK1-LPS        I  r3        L
20 3_IRAK1-....        I  r3      I24
21 1_E14-Co....        E  r1       C4
22    1_E14-LPS        E  r1        L
23    1_E14-CpG        E  r1      CpG
24 1_E14-TRIDAP        E  r1        T
25 1_Influe....        E  r1      I24
26  1_IRAK1-LPS        I  r1        L
27  1_IRAK1-CpG        I  r1      CpG
28 1_IRAK1-....        I  r1        T
29 1_IRAK1-....        I  r1       C4
30 1_IRAK1-....        I  r1      I24

ADD COMMENT • link 9.2 years ago erin.gill81 ▴ 60

2

Entering edit mode

Well there's nothing wrong with the design, it's perfectly balanced. Sometimes the GLM code can have an issue converging when there is a single count in a row of all 0's and there are many interactions terms. Can you try first subsetting out the genes with small counts? We don't have this kind of code in the manual, because these rows are omitted from multiple test correction automatically, but it is a perfectly valid filtering step. Some code which says we want a normalized count of at least 10 in two or more samples:

dds <- estimateSizeFactors(dds)
nc <- counts(dds, normalized=TRUE)
filter <- rowSums(nc >= 10) >= 2
dds <- dds[filter,]

...then run DESeq()

Hopefully this would eliminate the convergence problems. If not, ping this thread and I will take a look.

ADD REPLY • link 9.2 years ago Michael Love 41k

0

Entering edit mode

This has ameliorated the problem, but not completely solved it. Now instead of having 1650 rows not converge, I am left with 186 rows that don't converge. I also tried omitting genes with less than 15 counts in two or more samples. This gives me 131 rows that don't converge.

ADD REPLY • link 9.2 years ago erin.gill81 ▴ 60

1

Entering edit mode

Some more options, try increasing the 2 sample requirement to 3 or 4.

To deal with any remaining rows, you can either omit them from the results step: ddsClean <- dds[which(mcols(dds)$betaConv),], because these are typically genes with very small counts and little power, or you can increase the maximum iterations with the following code. Instead of running DESeq(), run:

dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds, maxit=500)

ADD REPLY • link 9.2 years ago Michael Love 41k

0

Entering edit mode

Thank you so much for your help. A combination of adding the filtering step and increasing the iterations have fixed the problem.

ADD REPLY • link 9.2 years ago erin.gill81 ▴ 60