Search
Question: Error in DESeq2 Unmix() function: L-BFGS-B needs finite values of 'fn'
0
5 months ago by
jennyl.smith120 wrote:

Hello,

I wanted to use the unmix() function from DESeq2 to investigate the mixing proportions of normal and tumor from bulk tumor RNAseq.  I was hoping to compare the results from using this function with A) normalized counts from the median ratio method and B) Transcripts per million (TPM).

First when I run the function using the median ratio normalized counts, I found that I receive the error, "Error in optim(par = rep(1, ncol(pure)), fn = distVST, gr = NULL, i, vst, : L-BFGS-B needs finite values of 'fn'".

Code that produced the error:

dds <- DESeqDataSetFromMatrix(countData = round(cts, digits = 0),
colData = blasts.anno,
design = ~1)

dds <- dds[rowSums(counts(dds)) > 10, ]
dim(dds)

dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds, fitType="parametric") #asymptDisp: 0.5999724

dispersionFunction(dds)

norm.cts.dds <- counts(dds, normalized=TRUE)

aml.dds <- norm.cts.dds[,grep("^PA", colnames(norm.cts.dds))]
nbm.dds <- norm.cts.dds[,grep("^RO|^BM", colnames(norm.cts.dds))]

unmix.dds <- unmix(x=aml.dds, pure = nbm.dds, alpha=0.5999724)

traceback():

Error in optim(par = rep(1, ncol(pure)), fn = distVST, gr = NULL, i, vst, : L-BFGS-B needs finite values of 'fn'
 5. optim(par = rep(1, ncol(pure)), fn = distVST, gr = NULL, i, vst, alpha, loss, method = "L-BFGS-B", lower = 0, upper = 100)
 4. FUN(X[[i]], ...)
 3. lapply(X[Split[[i]]], FUN, ...)
 2. lapply(seq_len(ncol(x)), function(i) { optim(par = rep(1, ncol(pure)), fn = distVST, gr = NULL, i, vst, alpha, loss, method = "L-BFGS-B", lower = 0, upper = 100)\$par ...
 1. unmix(x = aml.dds, pure = nbm.dds, alpha = 0.5999724)

Does anyone have any suggestions about where this error might be originating from? I updated all my bioconductor packages and tried to run it again, but it failed with the same error.

Also, should I be treating the tumors and the normals separately? Here you can see that I normalized the entire matrix of the tumors and the pure normals together, and then seperated them only for the unmix() function to run. Is that appropriate? My thoughts on doing it this way is that this ensure the same number of rows for both tumor and normals since rowSums(counts(dds)) > 10 would be different between the tumors and normals if I treated them as separate count matrices.

To use the function with a TPM matrix, my question is how to interpret the "shift" parameter that is required. I used the vsn::meanSdPlot() function, but I am unsure of what values are meant to be used in the "shift". Shift is defined as, "for TPMs, the shift which approximately stabilizes the variance of log shifted TPMs. Can be assessed with vsn::meanSdPlot"  in the documentation.

Any advice would be appreciated and thank you very much,

Jenny Smith

sessionInfo()

R version 3.4.0 (2017-04-21)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Ubuntu 14.04.5 LTS

Matrix products: default

BLAS/LAPACK: /app/easybuild/software/OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1/lib/libopenblas_prescottp-r0.2.18.so

locale:

[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8

[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C

attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:

[1] DESeq2_1.20.0              SummarizedExperiment_1.6.5 DelayedArray_0.2.7         matrixStats_0.53.1

[5] Biobase_2.36.2             GenomicRanges_1.28.6       GenomeInfoDb_1.12.3        IRanges_2.10.5

[9] S4Vectors_0.14.7           BiocGenerics_0.22.1        bindrcpp_0.2.2             pryr_0.1.4

[13] edgeR_3.18.1               limma_3.32.10              tidyr_0.6.1                tibble_1.4.2

[17] dplyr_0.7.5                ggplot2_2.2.1              magrittr_1.5               stringr_1.3.1

[21] knitr_1.20

loaded via a namespace (and not attached):

[1] vsn_3.44.0              bit64_0.9-7             splines_3.4.0           Formula_1.2-3

[5] assertthat_0.2.0        affy_1.54.0             latticeExtra_0.6-28     blob_1.1.1

[9] GenomeInfoDbData_0.99.0 yaml_2.1.19             RSQLite_2.1.1           pillar_1.2.3

[13] backports_1.1.2         lattice_0.20-35         glue_1.2.0              digest_0.6.15

[17] RColorBrewer_1.1-2      XVector_0.16.0          checkmate_1.8.5         colorspace_1.3-2

[21] htmltools_0.3.5         preprocessCore_1.38.0   Matrix_1.2-14           plyr_1.8.4

[25] XML_3.98-1.11           pkgconfig_2.0.1         pheatmap_1.0.8          genefilter_1.58.1

[29] zlibbioc_1.22.0         xtable_1.8-2            purrr_0.2.5             scales_0.4.1

[33] affyio_1.46.0           BiocParallel_1.10.1     annotate_1.54.0         htmlTable_1.9

[37] pbapply_1.3-4           nnet_7.3-12             lazyeval_0.2.1          survival_2.41-3

[41] memoise_1.1.0           foreign_0.8-70          BiocInstaller_1.26.1    data.table_1.11.4

[45] tools_3.4.0             munsell_0.5.0           locfit_1.5-9.1          cluster_2.0.7-1

[49] AnnotationDbi_1.38.2    compiler_3.4.0          rlang_0.2.1             grid_3.4.0

[53] RCurl_1.95-4.10         htmlwidgets_1.2         bitops_1.0-6            base64enc_0.1-3

[57] gtable_0.2.0            codetools_0.2-15        DBI_1.0.0               R6_2.2.2

[61] gridExtra_2.3           bit_1.1-14              bindr_0.1.1             Hmisc_4.0-2

[65] stringi_1.2.3           Rcpp_0.12.17            geneplotter_1.54.0      rpart_4.1-13

[69] acepack_1.4.1           tidyselect_0.2.4      
modified 5 months ago by Michael Love20k • written 5 months ago by jennyl.smith120
0
5 months ago by
Michael Love20k
United States
Michael Love20k wrote:

What is the correlation on log scale of the "pure" samples. E.g. you can do: cor(assay(vsd)) if vsd is an object with just the pure samples subsetted after running vst() on the dds. Having very correlated pure samples, even two of them, makes the optimization not converge, which makes sense: you can use X of one of the pure samples, or X of the other, of 1/2 X of both, all of these are then equally good at explaining the data.

If I understand you correctly, it should be fine to do the normalization and filtering on all the samples. unmix() then looks one sample at a time, totally independently. However, if there are large differences in the samples like tumor vs normal, this should be in the design so that the dispersions are properly estimated.

Re: shift, you can see how log2(x + shift) looks in the meanSdPlot. You want to stabilize the trend of most of the genes (not all the genes will be flat, this is ok, the important thing is the trend.)

Hi Michael,

Thank you for your response. I examined the correlation of the variance stabilized transformed counts and we do have very correlated pure samples.  The minimum correlation found was still 0.95.  I tried to run the algorithm again using those two pure samples (cor = 0.95), and  I was able to get to 24% progress completed, before the function returned the same error as in the title of this post.  I would suppose I would end up with the same issue, where the algorithm cannot converge using a TPM matrix, since my pure samples are so highly correlated. Thank you again for you time and assistance!