Hello,

I wanted to use the unmix() function from DESeq2 to investigate the mixing proportions of normal and tumor from bulk tumor RNAseq. I was hoping to compare the results from using this function with A) normalized counts from the median ratio method and B) Transcripts per million (TPM).

First when I run the function using the median ratio normalized counts, I found that I receive the error, "Error in optim(par = rep(1, ncol(pure)), fn = distVST, gr = NULL, i, vst, : L-BFGS-B needs finite values of 'fn'".

Code that produced the error:

```
dds <- DESeqDataSetFromMatrix(countData = round(cts, digits = 0),
colData = blasts.anno,
design = ~1)
```

dds <- dds[rowSums(counts(dds)) > 10, ]
dim(dds)
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds, fitType="parametric") #asymptDisp: 0.5999724
dispersionFunction(dds)
norm.cts.dds <- counts(dds, normalized=TRUE)
aml.dds <- norm.cts.dds[,grep("^PA", colnames(norm.cts.dds))]
nbm.dds <- norm.cts.dds[,grep("^RO|^BM", colnames(norm.cts.dds))]
unmix.dds <- unmix(x=aml.dds, pure = nbm.dds, alpha=0.5999724)
traceback():
Error in optim(par = rep(1, ncol(pure)), fn = distVST, gr = NULL, i, vst, : L-BFGS-B needs finite values of 'fn'

5. |
optim(par = rep(1, ncol(pure)), fn = distVST, gr = NULL, i, vst, alpha, loss, method = "L-BFGS-B", lower = 0, upper = 100) |

4. |
FUN(X[[i]], ...) |

3. |
lapply(X[Split[[i]]], FUN, ...) |

2. |
lapply(seq_len(ncol(x)), function(i) { optim(par = rep(1, ncol(pure)), fn = distVST, gr = NULL, i, vst, alpha, loss, method = "L-BFGS-B", lower = 0, upper = 100)$par ... |

1. |
unmix(x = aml.dds, pure = nbm.dds, alpha = 0.5999724) |

Does anyone have any suggestions about where this error might be originating from? I updated all my bioconductor packages and tried to run it again, but it failed with the same error.

Also, should I be treating the tumors and the normals separately? Here you can see that I normalized the entire matrix of the tumors and the pure normals together, and then seperated them only for the unmix() function to run. Is that appropriate? My thoughts on doing it this way is that this ensure the same number of rows for both tumor and normals since `rowSums(counts(dds)) > 10`

would be different between the tumors and normals if I treated them as separate count matrices.

To use the function with a TPM matrix, my question is how to interpret the "shift" parameter that is required. I used the vsn::meanSdPlot() function, but I am unsure of what values are meant to be used in the "shift". Shift is defined as, "for TPMs, the shift which approximately stabilizes the variance of log shifted TPMs. Can be assessed with `vsn::meanSdPlot`

" in the documentation.

Any advice would be appreciated and thank you very much,

Jenny Smith

sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.5 LTS Matrix products: default BLAS/LAPACK: /app/easybuild/software/OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1/lib/libopenblas_prescottp-r0.2.18.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq2_1.20.0 SummarizedExperiment_1.6.5 DelayedArray_0.2.7 matrixStats_0.53.1 [5] Biobase_2.36.2 GenomicRanges_1.28.6 GenomeInfoDb_1.12.3 IRanges_2.10.5 [9] S4Vectors_0.14.7 BiocGenerics_0.22.1 bindrcpp_0.2.2 pryr_0.1.4 [13] edgeR_3.18.1 limma_3.32.10 tidyr_0.6.1 tibble_1.4.2 [17] dplyr_0.7.5 ggplot2_2.2.1 magrittr_1.5 stringr_1.3.1 [21] knitr_1.20 loaded via a namespace (and not attached): [1] vsn_3.44.0 bit64_0.9-7 splines_3.4.0 Formula_1.2-3 [5] assertthat_0.2.0 affy_1.54.0 latticeExtra_0.6-28 blob_1.1.1 [9] GenomeInfoDbData_0.99.0 yaml_2.1.19 RSQLite_2.1.1 pillar_1.2.3 [13] backports_1.1.2 lattice_0.20-35 glue_1.2.0 digest_0.6.15 [17] RColorBrewer_1.1-2 XVector_0.16.0 checkmate_1.8.5 colorspace_1.3-2 [21] htmltools_0.3.5 preprocessCore_1.38.0 Matrix_1.2-14 plyr_1.8.4 [25] XML_3.98-1.11 pkgconfig_2.0.1 pheatmap_1.0.8 genefilter_1.58.1 [29] zlibbioc_1.22.0 xtable_1.8-2 purrr_0.2.5 scales_0.4.1 [33] affyio_1.46.0 BiocParallel_1.10.1 annotate_1.54.0 htmlTable_1.9 [37] pbapply_1.3-4 nnet_7.3-12 lazyeval_0.2.1 survival_2.41-3 [41] memoise_1.1.0 foreign_0.8-70 BiocInstaller_1.26.1 data.table_1.11.4 [45] tools_3.4.0 munsell_0.5.0 locfit_1.5-9.1 cluster_2.0.7-1 [49] AnnotationDbi_1.38.2 compiler_3.4.0 rlang_0.2.1 grid_3.4.0 [53] RCurl_1.95-4.10 htmlwidgets_1.2 bitops_1.0-6 base64enc_0.1-3 [57] gtable_0.2.0 codetools_0.2-15 DBI_1.0.0 R6_2.2.2 [61] gridExtra_2.3 bit_1.1-14 bindr_0.1.1 Hmisc_4.0-2 [65] stringi_1.2.3 Rcpp_0.12.17 geneplotter_1.54.0 rpart_4.1-13 [69] acepack_1.4.1 tidyselect_0.2.4

Hi Michael,

Thank you for your response. I examined the correlation of the variance stabilized transformed counts and we do have very correlated pure samples. The minimum correlation found was still 0.95. I tried to run the algorithm again using those two pure samples (cor = 0.95), and I was able to get to 24% progress completed, before the function returned the same error as in the title of this post. I would suppose I would end up with the same issue, where the algorithm cannot converge using a TPM matrix, since my pure samples are so highly correlated. Thank you again for you time and assistance!