Question: Error running summarizeToGene() - RNA Seq Bioconductor pipeline
1
0
Entering edit mode
anthony.nash ▴ 20
@anthonynash-14843
Last seen 4.2 years ago
University of Oxford

Hi all, I'm seeing a bit of an unusually error that I've never come across before - hoping for some assistance. I had posted to Biostars, recommendations were to put it up here.

I quantified pair-end RNA seq data using Salmon, loaded it into RStudio via tximeta and continued with the general Bioconductor DESeq2 pipeline structure. Everything is working as expected. However, when I ask one of my project students to rerun the analysis on their machine, he runs into the error:

gse <- summarizeToGene(se)
loading existing EnsDb created: 2020-09-21 10:50:18
obtaining transcript-to-gene mapping from database
loading existing gene ranges created: 2020-09-21 10:56:56
summarizing abundance
summarizing counts
summarizing length
Error in abundanceMatTx * lengthMatTx : 
  non-numeric argument to binary operator

I've never come across this error before. From the face of things, R is complaining that a non-numeric value eg. a name of something is being passed into a binary operator e.g., "+". I'm having the student double check the meta-data that makes up the prior call to se <- tximeta(colDataDF). However, one thought that came to mind were the quant.sf files generated by Salmon.

Due to Covid lockdown we're having to do all of this remotely and sadly, the project student doesn't have the bandwidth at home to download the RNA seq data and process the reads using Salmon. Therefore, he's using the quantified .sf files I generated, a fraction of the size. My initial concern is some machine-dependent association with the quant.sf files.

I noticed we had different versions of tximeta, so I brought my version up to 1.7.14. There are quite a few changes in the object attributes returned by summarizeToGene between my original version of tximeta (1.7.4) and the later version, however, my machine, can still execute summarizeToGene() without throwing an error.

We tried using tximport as an alternative to tximeta. The tximport::summarizeToGene() yields a valid object on my machine, however, fails with a similar error to the one above on my students machine.

This is my R setup.

SessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] KEGGgraph_1.49.1            gprofiler2_0.2.0            ReportingTools_2.29.0      
 [4] knitr_1.29                  org.Hs.eg.db_3.11.4         AnnotationDbi_1.51.3       
 [7] RColorBrewer_1.1-2          pheatmap_1.0.12             genefilter_1.71.0          
[10] magrittr_1.5                dplyr_1.0.1                 GGally_2.0.0               
[13] tidyr_1.1.2                 broom_0.7.0                 ggplot2_3.3.2              
[16] vsn_3.57.0                  DESeq2_1.29.13              tximeta_1.7.14             
[19] SummarizedExperiment_1.19.6 DelayedArray_0.15.4         matrixStats_0.56.0         
[22] Matrix_1.2-18               Biobase_2.49.0              GenomicRanges_1.41.5       
[25] GenomeInfoDb_1.25.10        IRanges_2.23.10             S4Vectors_0.27.12          
[28] BiocGenerics_0.35.4        

loaded via a namespace (and not attached):
  [1] backports_1.1.10              GOstats_2.55.0                Hmisc_4.4-1                  
  [4] AnnotationHub_2.21.5          BiocFileCache_1.13.1          plyr_1.8.6                   
  [7] lazyeval_0.2.2                GSEABase_1.51.1               splines_4.0.2                
 [10] BiocParallel_1.23.2           digest_0.6.25                 ensembldb_2.13.1             
 [13] htmltools_0.5.0               GO.db_3.11.4                  checkmate_2.0.0              
 [16] memoise_1.1.0                 BSgenome_1.57.6               cluster_2.1.0                
 [19] limma_3.45.14                 Biostrings_2.57.2             annotate_1.67.1              
 [22] R.utils_2.10.1                ggbio_1.37.0                  askpass_1.1                  
 [25] prettyunits_1.1.1             jpeg_0.1-8.1                  colorspace_1.4-1             
 [28] blob_1.2.1                    rappdirs_0.3.1                xfun_0.16                    
 [31] crayon_1.3.4                  RCurl_1.98-1.2                jsonlite_1.7.1               
 [34] tximport_1.17.6               graph_1.67.1                  VariantAnnotation_1.35.3     
 [37] survival_3.2-3                glue_1.4.1                    gtable_0.3.0                 
 [40] zlibbioc_1.35.0               XVector_0.29.3                Rgraphviz_2.33.0             
 [43] scales_1.1.1                  DBI_1.1.0                     edgeR_3.31.4                 
 [46] Rcpp_1.0.5                    viridisLite_0.3.0             xtable_1.8-4                 
 [49] progress_1.2.2                htmlTable_2.1.0               foreign_0.8-80               
 [52] bit_4.0.4                     OrganismDbi_1.31.0            preprocessCore_1.51.0        
 [55] Formula_1.2-3                 AnnotationForge_1.31.2        htmlwidgets_1.5.1            
 [58] httr_1.4.2                    ellipsis_0.3.1                R.methodsS3_1.8.1            
 [61] pkgconfig_2.0.3               reshape_0.8.8                 XML_3.99-0.5                 
 [64] nnet_7.3-14                   dbplyr_1.4.4                  locfit_1.5-9.4               
 [67] tidyselect_1.1.0              rlang_0.4.7                   reshape2_1.4.4               
 [70] later_1.1.0.1                 munsell_0.5.0                 BiocVersion_3.12.0           
 [73] tools_4.0.2                   generics_0.0.2                RSQLite_2.2.0                
 [76] stringr_1.4.0                 fastmap_1.0.1                 yaml_2.2.1                   
 [79] bit64_4.0.5                   purrr_0.3.4                   AnnotationFilter_1.13.0      
 [82] RBGL_1.65.0                   mime_0.9                      R.oo_1.24.0                  
 [85] biomaRt_2.45.2                compiler_4.0.2                rstudioapi_0.11              
 [88] plotly_4.9.2.1                curl_4.3                      png_0.1-7                    
 [91] interactiveDisplayBase_1.27.5 affyio_1.59.0                 PFAM.db_3.11.4               
 [94] tibble_3.0.3                  geneplotter_1.67.0            stringi_1.5.3                
 [97] GenomicFeatures_1.41.3        lattice_0.20-41               ProtGenerics_1.21.0          
[100] vctrs_0.3.2                   pillar_1.4.6                  lifecycle_0.2.0              
[103] BiocManager_1.30.10           data.table_1.13.0             bitops_1.0-6                 
[106] httpuv_1.5.4                  rtracklayer_1.49.5            hwriter_1.3.2                
[109] R6_2.4.1                      latticeExtra_0.6-29           affy_1.67.0                  
[112] promises_1.1.1                gridExtra_2.3                 dichromat_2.0-0              
[115] assertthat_0.2.1              openssl_1.4.3                 Category_2.55.0              
[118] withr_2.2.0                   GenomicAlignments_1.25.3      Rsamtools_2.5.3              
[121] GenomeInfoDbData_1.2.3        hms_0.5.3                     grid_4.0.2                   
[124] rpart_4.1-15                  biovizBase_1.37.0             shiny_1.5.0                  
[127] base64enc_0.1-3

This is the setup that fails to generate a summarizeToGene result:

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base
other attached packages:
 [1] tximeta_1.7.14              tximport_1.17.6             SummarizedExperiment_1.19.6
 [4] DelayedArray_0.15.8         matrixStats_0.56.0          Matrix_1.2-18
 [7] Biobase_2.49.1              GenomicRanges_1.41.6        GenomeInfoDb_1.25.11
[10] IRanges_2.23.10             S4Vectors_0.27.13           BiocGenerics_0.35.4
loaded via a namespace (and not attached):
 [1] httr_1.4.2                    jsonlite_1.7.1                bit64_4.0.5
 [4] AnnotationHub_2.21.5          shiny_1.5.0                   assertthat_0.2.1
 [7] askpass_1.1                   interactiveDisplayBase_1.27.5 BiocManager_1.30.10
[10] BiocFileCache_1.13.1          blob_1.2.1                    Rsamtools_2.5.3
[13] GenomeInfoDbData_1.2.3        yaml_2.2.1                    progress_1.2.2
[16] BiocVersion_3.12.0            pillar_1.4.6                  RSQLite_2.2.0
[19] lattice_0.20-41               glue_1.4.2                    digest_0.6.25
[22] promises_1.1.1                XVector_0.29.3                htmltools_0.5.0
[25] httpuv_1.5.4                  XML_3.99-0.5                  pkgconfig_2.0.3
[28] biomaRt_2.45.2                zlibbioc_1.35.0               purrr_0.3.4
[31] xtable_1.8-4                  later_1.1.0.1                 BiocParallel_1.23.2
[34] tibble_3.0.3                  openssl_1.4.3                 AnnotationFilter_1.13.0
[37] generics_0.0.2                ellipsis_0.3.1                GenomicFeatures_1.41.3
[40] lazyeval_0.2.2                magrittr_1.5                  crayon_1.3.4
[43] mime_0.9                      memoise_1.1.0                 evaluate_0.14
[46] tools_4.0.2                   prettyunits_1.1.1             hms_0.5.3
[49] lifecycle_0.2.0               stringr_1.4.0                 ensembldb_2.13.1
[52] AnnotationDbi_1.51.3          Biostrings_2.57.2             compiler_4.0.2
[55] tinytex_0.25                  rlang_0.4.7                   grid_4.0.2
[58] RCurl_1.98-1.2                rstudioapi_0.11               rappdirs_0.3.1
[61] bitops_1.0-6                  rmarkdown_2.3                 DBI_1.1.0
[64] curl_4.3                      R6_2.4.1                      GenomicAlignments_1.25.3
[67] rtracklayer_1.49.5            knitr_1.29                    dplyr_1.0.2
[70] fastmap_1.0.1                 bit_4.0.4                     ProtGenerics_1.21.0
[73] stringi_1.5.3                 Rcpp_1.0.5                    vctrs_0.3.4
[76] dbplyr_1.4.4                  tidyselect_1.1.0              xfun_0.17

Suggestions are most welcome.

tximeta tximport deseq2 • 1.7k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 16 hours ago
United States

Thanks for the detailed report. I wonder if the student somehow obtained a different DB than you. This shouldn’t happen because it is done programmatically, but let’s check that. You can load the BiocFileCache:

bfc <- BiocFileCache()
bfcinfo(bfc)

Maybe inspecting the origin of the EnsDb might help.

Also you can pull it down with:

retrieveDb(se)

Somehow the summary operation is not the same on each machine which shouldn’t occur, so I’m a bit confused.

ADD COMMENT
0
Entering edit mode

Any luck hunting down the bug from the database side?

ADD REPLY
0
Entering edit mode

Hi Mike, sorry to have left this hanging, I was waiting to hear back from the student after the weekend. It's very difficult getting to the bottom of this whilst we are all working remotely. I'm still waiting for the feedback from those commands you supplied, but in the mean while after he managed to get the raw reads (so no longer using the quant files I gave him), he's now seeing:

Error in reshapeMetaInfo(metaInfo) : 
all(out$index_name_hash == out$index_name_hash[1]) is not TRUE

I have a feeling this is all tied together with the meta_info.json file in each quant folder and the EnsDb you mentioned. I'm just not sure how as I've never had to delve this deep into tximeta as it's always worked for me.

I've just got back the information required: This is my setup:

> bfcinfo(bfc)$rname
[1] "Homo_sapiens.GRCh38.100.gtf.gz"          "txpRngs-Homo_sapiens.GRCh38.100.gtf.gz" 
[3] "geneRngs-Homo_sapiens.GRCh38.100.gtf.gz"
> bfcinfo(bfc)$fpath
[1] "C:\\Users\\yewro\\AppData\\Local\\Temp\\Rtmpm2ahfl\\BiocFileCache\\3f1863f97088_86435"
[2] "3f18bcb435a"                                                                          
[3] "3f18cc32abe"                                                                          
> bfcinfo(bfc)$rpath
[1] "C:\\Users\\yewro\\AppData\\Local\\BiocFileCache\\BiocFileCache\\Cache/3f18259750cb_3f1863f97088_86435"
[2] "C:\\Users\\yewro\\AppData\\Local\\BiocFileCache\\BiocFileCache\\Cache/3f18bcb435a_3f18bcb435a.rds"    
[3] "C:\\Users\\yewro\\AppData\\Local\\BiocFileCache\\BiocFileCache\\Cache/3f18cc32abe_3f18cc32abe.rds"    
> bfcinfo(bfc)$rid
[1] "BFC1" "BFC2" "BFC3"

This is his:

>bfcinfo(bfc)$rname
[1] "Homo_sapiens.GRCh38.100.gtf.gz"          "txpRngs-Homo_sapiens.GRCh38.100.gtf.gz"
[3] "geneRngs-Homo_sapiens.GRCh38.100.gtf.gz"
> bfcinfo(bfc)$fpath
[1] "/private/var/folders/y9/4smst0px70922f3w4qdqblc80000gp/T/Rtmpdr7sFi/BiocFileCache/3e659c0734c_86435"
[2] "3e645a2080a"
[3] "3e617f7b306"
> bfcinfo(bfc)$rpath
[1] "/Users/Ben/Library/Caches/BiocFileCache/3e669d1b09c_3e659c0734c_86435"
[2] "/Users/Ben/Library/Caches/BiocFileCache/3e645a2080a_3e645a2080a.rds"
[3] "/Users/Ben/Library/Caches/BiocFileCache/3e617f7b306_3e617f7b306.rds"
> bfcinfo(bfc)$rid
[1] "BFC1" "BFC2" "BFC3"
ADD REPLY
0
Entering edit mode

Thanks for the update. The new error above indicates that samples are being imported that were run with different Salmon index (and so the transcripts will not align to build a matrix of count data). That may be a clue to the other error (or it could be independent).

You can do:

grep \"index_seq_hash\" */aux_info/meta_info.json

To see which files were quantified with a different set of transcripts according to its hash value. You can also look at:

grep index */cmd_info.json

To see the path of the index that was provided to Salmon.

ADD REPLY
0
Entering edit mode

Thanks very much Mike.

Had a look at what his terminal spits out. The checksum is identical to mine (we used the same reference transcriptome), and index reference is what it should be (pointing to the reference he used during the indexing stage of salmon). Would this suggest that this checksum entry is not in his distribution of Tximeta? <-- I'm looking at the figure in the tximeta html document.

Also, I don't know whether the reference transcriptome file name is of a standard i.e., a standard set by Ensembl. The file was renamed on his machine, and I'm seeing:

quant_02/cmd_info.json:  "index": "Human_cDNA_index/"
...

I'm reading that Ensembl transcripts sould be automatically recognised by tximeta. I'm just taking a stab at the dark here.

I've looked through the info.json of the reference index folder, the "indexseqhash" matches what's in his auxinfo/metainfo.json files. An interesting development, and I have no idea if it would make any difference. He ran salmon index a second from the seventh sample onwards. I've checked for difference in the metadata of those samples that came before and those that came after, I can't spot anything obvious.

ADD REPLY
0
Entering edit mode

So the error indicates that the reference transcripts on his machine are not identical across samples, which kind of breaks the import process (e.g. tximport/tximeta won't be able to merge to create a single object).

Can you explain this more:

He ran salmon index a second from the seventh sample onwards.

This sounds like the source of an error that would break tximport/tximeta.

ADD REPLY
0
Entering edit mode

Indeed! Again, this is all remote some I'm trying to get to the bottom of it. I have a transcript of the commands used on the Mac terminal. From the looks of it, the first few samples he performed Salmon quant having first performed a salmon index. The machine was powered off. The next day it was powered on. An attempt was made at running salmon quant on the next sample but failed with:

Exception : [Error: This version of salmon does not support indexing using the RapMap index.] 
salmon quant was invoked improperly.

I've never seen that error before. Anyway, that's when salmon index was run a second time. It ran, then the remaining samples were processed with salmon quant. I took a look at what files I could get hold of. The json files before and after the second salmon index instantiation look the same.

Update: It is most certainly the second execution of salmon index that broke his pipeline. Thank you Mike for being extremely patient and helpful.

ADD REPLY
0
Entering edit mode

Thanks for the report. I think that I can look into a better error message... for some reason the function got to summarization when it should have caught on the mis-matching index across samples first.

ADD REPLY
0
Entering edit mode

Thanks again. I should have diagnosed the problem a lot earlier, but with supervision being remote, technical support for new project students is really tough!

ADD REPLY

Login before adding your answer.

Traffic: 1038 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6