estimateSizeFactors throws "Error in .local(object, ..., value) : all(!is.na(value)) is not TRUE". Identified root cause within estimateSizeFactorsForMatrix lines 44-46. I want to do a custom hotfix
1
0
Entering edit mode
@fd8f3402
Last seen 15 days ago
United States

Hi, I am relatively new to R and the DESeq 2 package but do have a background in software engineering. Currently I am trying to run some analysis on pathway abundance within samples using DESeq2 package. As these pathways are not all present in every sample, the data frame has been left with 0s in at least one spot in every row. I attempted to use my own geometric means and using type="poscounts" as well, with both leading to the same issue. Most issues I have been able to solve with posts from previous people, however this specific error was briefly mentioned in a post 3 years ago without identifying the cause of the issue with all solutions commented not working in my case(link to the thread DESeq estimateSizeFactors() is returning error "Error: all(!is.na(value)) is not TRUE").

After using the debugger from Rstudio, the NA values are being generated within the method estimateSizeFactorsForMatrix lines 44-46

if (incomingGeoMeans) {
sf <- sf/exp(mean(log(sf)))
}


what ends up happening here is that the calculation sf/exp(mean(log(sf))) is saying that every number is NA. However within debugger (giving me temp access to sf), I was able to manually calculate every sf value within the sf vector.

Thus

if (incomingGeoMeans) {
sf <- sf/exp(mean(log(sf)))
}


produces values that are all NA

while

temp <- c()
for(I in 1:length(sf)){
temp <- append(temp,sf[I]/exp(mean(log(sf[I]))))
}
sf<-temp


produces actual geometric values

**note, code could probably be cleaner but I'm still new to the R syntax and project needs to finish by august so if it works it works

I am not sure why this is happening but this leads me to the question, What would be the best way to edit the source code of estimateSizeFactorsForMatrix such that: my changes are applied to the estimateSizeFactorsForMatrix function and the DESeq2 method estimateSizeFactors will call the new method instead of the originally integrated one.

error message:

In DESeqDataSet(se, design = design, ignoreRank) :

Error in .local(object, ..., value) : all(!is.na(value)) is not TRUE


estimateSizeFactorsForMatrix pulled from debugger viewer

function (counts, locfunc = stats::median, geoMeans, controlGenes,
type = c("ratio", "poscounts"))
{
type <- match.arg(type, c("ratio", "poscounts"))
if (missing(geoMeans)) {
incomingGeoMeans <- FALSE
if (type == "ratio") {
loggeomeans <- rowMeans(log(counts))
}
else if (type == "poscounts") {
lc <- log(counts)
lc[!is.finite(lc)] <- 0
loggeomeans <- rowMeans(lc)
allZero <- rowSums(counts) == 0
loggeomeans[allZero] <- -Inf
}
}
else {
incomingGeoMeans <- TRUE
if (length(geoMeans) != nrow(counts)) {
stop("geoMeans should be as long as the number of rows of counts")
}
loggeomeans <- log(geoMeans)
}
if (all(is.infinite(loggeomeans))) {
stop("every gene contains at least one zero, cannot compute log geometric means")
}
sf <- if (missing(controlGenes)) {
apply(counts, 2, function(cnts) {
exp(locfunc((log(cnts) - loggeomeans)[is.finite(loggeomeans) &
cnts > 0]))
})
}
else {
if (!(is.numeric(controlGenes) | is.logical(controlGenes))) {
stop("controlGenes should be either a numeric or logical vector")
}
loggeomeansSub <- loggeomeans[controlGenes]
apply(counts[controlGenes, , drop = FALSE], 2, function(cnts) {
exp(locfunc((log(cnts) - loggeomeansSub)[is.finite(loggeomeansSub) &
cnts > 0]))
})
}
if (incomingGeoMeans) {
sf <- sf/exp(mean(log(sf)))
}
sf
}


SessionInfo


sessionInfo( )
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] tidyr_1.1.3                 psych_2.1.6                 BiocParallel_1.26.0         DESeq2_1.32.0
[5] SummarizedExperiment_1.22.0 Biobase_2.52.0              MatrixGenerics_1.4.0        matrixStats_0.59.0
[9] GenomicRanges_1.44.0        GenomeInfoDb_1.28.0         IRanges_2.26.0              S4Vectors_0.30.0
[13] BiocGenerics_0.38.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.6             locfit_1.5-9.4         lattice_0.20-44        png_0.1-7              Biostrings_2.60.1
[6] utf8_1.2.1             R6_2.5.0               RSQLite_2.2.7          httr_1.4.2             ggplot2_3.3.4
[11] pillar_1.6.1           zlibbioc_1.38.0        rlang_0.4.11           annotate_1.70.0        blob_1.2.1
[16] Matrix_1.3-4           splines_4.1.0          geneplotter_1.70.0     RCurl_1.98-1.3         bit_4.0.4
[21] munsell_0.5.0          DelayedArray_0.18.0    compiler_4.1.0         pkgconfig_2.0.3        mnormt_2.0.2
[26] tmvnsim_1.0-2          tidyselect_1.1.1       KEGGREST_1.32.0        tibble_3.1.2           GenomeInfoDbData_1.2.6
[31] XML_3.99-0.6           fansi_0.5.0            crayon_1.4.1           dplyr_1.0.7            bitops_1.0-7
[36] grid_4.1.0             nlme_3.1-152           xtable_1.8-4           gtable_0.3.0           lifecycle_1.0.0
[41] DBI_1.1.1              magrittr_2.0.1         scales_1.1.1           cachem_1.0.5           XVector_0.32.0
[46] genefilter_1.74.0      generics_0.1.0         ellipsis_0.3.2         vctrs_0.3.8            RColorBrewer_1.1-2
[51] tools_4.1.0            bit64_4.0.5            glue_1.4.2             purrr_0.3.4            fastmap_1.1.0
[56] survival_3.2-11        AnnotationDbi_1.54.1   colorspace_2.0-1       memoise_2.0.0

DESeq2 • 187 views
0
Entering edit mode
@mikelove
Last seen 11 minutes ago
United States

However, your code does not produce the correct output. The point of that line of code is to center the size factors on 1. Your code sets the size factors to all 1.

Let's step back, what is the input data you are providing to DESeq2? What are the features (rows) and what are the samples (columns) and how many are there?

Typically, RNA-seq or other genomic data provided to DESeq2 has features with all samples having non-zero counts. And then if not, you can try alternative estimators, but first let's figure out what kind of data you are trying to provide to the functions.

0
Entering edit mode

Thank you for explaining the purpose of that line. The input data is ~420 microbe pathway abundances for 72 individual samples all obtained by humann3. The layout is Pathways (row) v samples (col) with the abundances from humann files as the values. In the case a pathway was not present in the sample, the abundance would be assigned 0. As no microbe pathway exists in every sample, it created a situation where there is at least one 0 for every pathway. When I looked into how they resolved their issue of 0's in every row, I would be brought to calculating the geometric means, then using the estimateSizeFactor method to then progress forward with DESeq analysis. Which resulted in the error previously mentioned. If I can get past this part, my group wants to use the method DESeq with the same designs as used on similar data. The analysis is primarily for metagenomic analysis across a longitudinal study.

And to get a better grasp on that line, why would centering the already calculated geomeans result in NA?

Thank you for the help

0
Entering edit mode

The error was due to a column/sample having no abundances. this was due to a naming error to the column name. Due to this, my program wasn't able to assign the abundances from the file to the corresponding sample, resulting in this sample being NA after the sf calculation.

In the end, remove any columns with a sum of 0 and the method should work.

0
Entering edit mode

After some digging, I figured out the issue and resolved it. As you suspected/hinted, it was my input data that was faulty and causing the error. Sorry about taking up your time on this