Hi, I am relatively new to R and the DESeq 2 package but do have a background in software engineering. Currently I am trying to run some analysis on pathway abundance within samples using DESeq2 package. As these pathways are not all present in every sample, the data frame has been left with 0s in at least one spot in every row. I attempted to use my own geometric means and using type="poscounts" as well, with both leading to the same issue. Most issues I have been able to solve with posts from previous people, however this specific error was briefly mentioned in a post 3 years ago without identifying the cause of the issue with all solutions commented not working in my case(link to the thread DESeq estimateSizeFactors() is returning error "Error: all(!is.na(value)) is not TRUE").
After using the debugger from Rstudio, the NA values are being generated within the method estimateSizeFactorsForMatrix lines 44-46
if (incomingGeoMeans) {
sf <- sf/exp(mean(log(sf)))
}
what ends up happening here is that the calculation sf/exp(mean(log(sf))) is saying that every number is NA. However within debugger (giving me temp access to sf), I was able to manually calculate every sf value within the sf vector.
Thus
if (incomingGeoMeans) {
sf <- sf/exp(mean(log(sf)))
}
produces values that are all NA
while
temp <- c()
for(I in 1:length(sf)){
temp <- append(temp,sf[I]/exp(mean(log(sf[I]))))
}
sf<-temp
produces actual geometric values
**note, code could probably be cleaner but I'm still new to the R syntax and project needs to finish by august so if it works it works
I am not sure why this is happening but this leads me to the question, What would be the best way to edit the source code of estimateSizeFactorsForMatrix such that: my changes are applied to the estimateSizeFactorsForMatrix function and the DESeq2 method estimateSizeFactors will call the new method instead of the originally integrated one.
error message:
In DESeqDataSet(se, design = design, ignoreRank) :
Error in .local(object, ..., value) : all(!is.na(value)) is not TRUE
estimateSizeFactorsForMatrix pulled from debugger viewer
function (counts, locfunc = stats::median, geoMeans, controlGenes,
type = c("ratio", "poscounts"))
{
type <- match.arg(type, c("ratio", "poscounts"))
if (missing(geoMeans)) {
incomingGeoMeans <- FALSE
if (type == "ratio") {
loggeomeans <- rowMeans(log(counts))
}
else if (type == "poscounts") {
lc <- log(counts)
lc[!is.finite(lc)] <- 0
loggeomeans <- rowMeans(lc)
allZero <- rowSums(counts) == 0
loggeomeans[allZero] <- -Inf
}
}
else {
incomingGeoMeans <- TRUE
if (length(geoMeans) != nrow(counts)) {
stop("geoMeans should be as long as the number of rows of counts")
}
loggeomeans <- log(geoMeans)
}
if (all(is.infinite(loggeomeans))) {
stop("every gene contains at least one zero, cannot compute log geometric means")
}
sf <- if (missing(controlGenes)) {
apply(counts, 2, function(cnts) {
exp(locfunc((log(cnts) - loggeomeans)[is.finite(loggeomeans) &
cnts > 0]))
})
}
else {
if (!(is.numeric(controlGenes) | is.logical(controlGenes))) {
stop("controlGenes should be either a numeric or logical vector")
}
loggeomeansSub <- loggeomeans[controlGenes]
apply(counts[controlGenes, , drop = FALSE], 2, function(cnts) {
exp(locfunc((log(cnts) - loggeomeansSub)[is.finite(loggeomeansSub) &
cnts > 0]))
})
}
if (incomingGeoMeans) {
sf <- sf/exp(mean(log(sf)))
}
sf
}
SessionInfo
sessionInfo( )
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.4
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] tidyr_1.1.3 psych_2.1.6 BiocParallel_1.26.0 DESeq2_1.32.0
[5] SummarizedExperiment_1.22.0 Biobase_2.52.0 MatrixGenerics_1.4.0 matrixStats_0.59.0
[9] GenomicRanges_1.44.0 GenomeInfoDb_1.28.0 IRanges_2.26.0 S4Vectors_0.30.0
[13] BiocGenerics_0.38.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 locfit_1.5-9.4 lattice_0.20-44 png_0.1-7 Biostrings_2.60.1
[6] utf8_1.2.1 R6_2.5.0 RSQLite_2.2.7 httr_1.4.2 ggplot2_3.3.4
[11] pillar_1.6.1 zlibbioc_1.38.0 rlang_0.4.11 annotate_1.70.0 blob_1.2.1
[16] Matrix_1.3-4 splines_4.1.0 geneplotter_1.70.0 RCurl_1.98-1.3 bit_4.0.4
[21] munsell_0.5.0 DelayedArray_0.18.0 compiler_4.1.0 pkgconfig_2.0.3 mnormt_2.0.2
[26] tmvnsim_1.0-2 tidyselect_1.1.1 KEGGREST_1.32.0 tibble_3.1.2 GenomeInfoDbData_1.2.6
[31] XML_3.99-0.6 fansi_0.5.0 crayon_1.4.1 dplyr_1.0.7 bitops_1.0-7
[36] grid_4.1.0 nlme_3.1-152 xtable_1.8-4 gtable_0.3.0 lifecycle_1.0.0
[41] DBI_1.1.1 magrittr_2.0.1 scales_1.1.1 cachem_1.0.5 XVector_0.32.0
[46] genefilter_1.74.0 generics_0.1.0 ellipsis_0.3.2 vctrs_0.3.8 RColorBrewer_1.1-2
[51] tools_4.1.0 bit64_4.0.5 glue_1.4.2 purrr_0.3.4 fastmap_1.1.0
[56] survival_3.2-11 AnnotationDbi_1.54.1 colorspace_2.0-1 memoise_2.0.0
Thank you for explaining the purpose of that line. The input data is ~420 microbe pathway abundances for 72 individual samples all obtained by humann3. The layout is Pathways (row) v samples (col) with the abundances from humann files as the values. In the case a pathway was not present in the sample, the abundance would be assigned 0. As no microbe pathway exists in every sample, it created a situation where there is at least one 0 for every pathway. When I looked into how they resolved their issue of 0's in every row, I would be brought to calculating the geometric means, then using the estimateSizeFactor method to then progress forward with DESeq analysis. Which resulted in the error previously mentioned. If I can get past this part, my group wants to use the method DESeq with the same designs as used on similar data. The analysis is primarily for metagenomic analysis across a longitudinal study.
And to get a better grasp on that line, why would centering the already calculated geomeans result in NA?
Thank you for the help
The error was due to a column/sample having no abundances. this was due to a naming error to the column name. Due to this, my program wasn't able to assign the abundances from the file to the corresponding sample, resulting in this sample being NA after the sf calculation.
In the end, remove any columns with a sum of 0 and the method should work.
After some digging, I figured out the issue and resolved it. As you suspected/hinted, it was my input data that was faulty and causing the error. Sorry about taking up your time on this