I'm attempting to implement sva on my RNA-Seq count data matrix. I have a table, myFactors, which describes various potential sources of known (e.g., myFactors$Condition - control, condition1, condition2) and suspected (e.g., myFactors$Date - date of sample preparation) variation.
I've followed along with the sva tutorial, but I'm getting an error I cannot interpret. Does anyone know what it means when it says 'x' contains missing values, or NaNs produced? Any guesses what I'm doing wrong here?
Thanks for your help. It seems like a number of people have asked this question previously, but I have not found any posted answers.
> myFactors <- read.csv('Location/On/Computer') # myFactors is 17 obs x 15 variable data.frame > data <- as.matrix(RNASeqCountDataVariable) # data is matrix (63,677 genes in rows, 17 samps in columns) # remove rows from data where 0 expression across all genes > row_sub = apply(data,1,function(x) !all(x==0) ) > data_noZeros <- data[row_sub,] # data_noZeros is matrix, 33,704 x 17 > myModel <- model.matrix(~as.factor(Condition), data=myFactors) > nullModel <- model.matrix(~1, data=myFactors) > numLatentFactors <- num.sv(data_noZeros, myModel, method="leek") > sva_obj <- sva(data_noZeros, myModel, nullModel, n.sv=numLatentFactors)
Number of significant surrogate variables is: 14
Iteration (out of 5 ):
Error in density.default(x, adjust = adj) : 'x' contains missing values
In addition: Warning message:
In pf(fstats, df1 = (df1 - df0), df2 = (n - df1)) : NaNs produced
Output of sessionInfo():
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.2 (El Capitan)
attached base packages:
 grid parallel stats4 stats graphics grDevices utils datasets methods
other attached packages:
 bladderbatch_1.8.0 limma_3.26.5 pamr_1.55
 survival_2.38-3 cluster_2.0.3 sva_3.18.0
 genefilter_1.52.0 mgcv_1.8-11 nlme_3.1-124
 VennDiagram_1.6.16 futile.logger_1.4.1 RColorBrewer_1.1-2
 gplots_2.17.0 biomaRt_2.26.1 DESeq2_1.10.1
 RcppArmadillo_0.6.400.2.2 Rcpp_0.12.3 SummarizedExperiment_1.0.2
 Biobase_2.30.0 GenomicRanges_1.22.3 GenomeInfoDb_1.6.3
 IRanges_2.4.6 S4Vectors_0.8.7 BiocGenerics_0.16.1
loaded via a namespace (and not attached):
 gtools_3.5.0 locfit_1.5-9.1 splines_3.2.2 lattice_0.20-33
 colorspace_1.2-6 XML_3.98-1.3 foreign_0.8-66 DBI_0.3.1
 BiocParallel_1.4.3 lambda.r_1.1.7 plyr_1.8.3 zlibbioc_1.16.0
 munsell_0.4.2 gtable_0.1.2 caTools_1.17.1 latticeExtra_0.6-26
 geneplotter_1.48.0 AnnotationDbi_1.32.3 acepack_1.3-3.3 KernSmooth_2.23-15
 xtable_1.8-0 scales_0.3.0 gdata_2.17.0 Hmisc_3.17-1
 annotate_1.48.0 XVector_0.10.0 gridExtra_2.0.0 ggplot2_2.0.0
 tools_3.2.2 bitops_1.0-6 RCurl_1.95-4.7 RSQLite_1.0.0
 Formula_1.2-1 futile.options_1.0.0 Matrix_1.2-3 rpart_4.1-10
**EDIT 1** I get the same error when I remove all rows with low variance instead of rows that are all 0s, using the following:
getVar <- apply(data,1,var) param <- 1 data_removeNoVariance <- data[getVar > param& !is.na(getVar),]
Though, granted, I'm not sure if there's a different variance threshold I should set to make it work with SVA...
**EDIT 2** I think I found the piece of code causing the bug here. When I put this code into a script, remove the function, and just declare dat = data_noZeros, mod = myModel, etc., the code works perfectly - i.e., when I do this:
>tst <- pf(fstats,df1=(df1-df0),df2=(n-df1)) >sum(is.nan(tst))
So there aren't any NaNs being made when I write this function like it's a script. Why would calling the function through sva result in an error...?