tximport::summarizeToGene fails to ignoreTxVersion
1
0
Entering edit mode
Benjamin • 0
@benjamin-21234
Last seen 5.4 years ago
TU Dortmund

Using airway2 (https://github.com/mikelove/airway2/tree/master/inst) salmon transcript level count files, this is supposed to remove the versioning of Ensemble gene IDs:

library(tximeta)

srrs = list.files(path = "~/Downloads/airway2-master/inst/extdata/quants/", full.names = TRUE)

txm_raw = tximeta(file.path(srrs, "quant.sf.gz"), type = "salmon") 
txm_con = summarizeToGene(txm_raw, ignoreTxVersion = TRUE)

loading existing TxDb created: 2019-07-02 20:01:47
obtaining transcript-to-gene mapping from TxDb
Error in .local(object, ...) : 

  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

Example IDs (file): [ENST00000456328, ENST00000450305, ENST00000488147, ...]

Example IDs (tx2gene): [ENST00000456328.2, ENST00000450305.2, ENST00000473358.1, ...]

  This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.

Similarly, the operation fails using mapIds of AnnotationDbi (for which it might be nice to have as well an option like ignoreTxVersion ).

Current work-around:

names(GRanges_object) <- sapply(names(GRanges_object), function(x) unlist(strsplit(x, "\\."))[[1]])

Session Info:

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
[1] GenomicFeatures_1.36.3 AnnotationDbi_1.46.0   Biobase_2.44.0        
[4] GenomicRanges_1.36.0   GenomeInfoDb_1.20.0    IRanges_2.18.1        
[7] S4Vectors_0.22.0       BiocGenerics_0.30.0    tximeta_1.2.1         

loaded via a namespace (and not attached):
 [1] httr_1.4.0                  tidyr_0.8.3                 vsn_3.52.0                 
 [4] bit64_0.9-7                 jsonlite_1.6                assertthat_0.2.1           
 [7] BiocManager_1.30.4          affy_1.62.0                 BiocFileCache_1.8.0        
[10] blob_1.1.1                  GenomeInfoDbData_1.2.1      Rsamtools_2.0.0            
[13] progress_1.2.2              pillar_1.4.2                RSQLite_2.1.1              
[16] lattice_0.20-38             limma_3.40.2                glue_1.3.1                 
[19] digest_0.6.19               XVector_0.24.0              colorspace_1.4-1           
[22] preprocessCore_1.46.0       Matrix_1.2-17               XML_3.98-1.20              
[25] pkgconfig_2.0.2             biomaRt_2.40.1              zlibbioc_1.30.0            
[28] purrr_0.3.2                 scales_1.0.0                affyio_1.54.0              
[31] BiocParallel_1.18.0         tibble_2.1.3                AnnotationFilter_1.8.0     
[34] ggplot2_3.2.0               SummarizedExperiment_1.14.0 lazyeval_0.2.2             
[37] magrittr_1.5                crayon_1.3.4                memoise_1.1.0              
[40] tools_3.6.0                 prettyunits_1.0.2           hms_0.4.2                  
[43] matrixStats_0.54.0          stringr_1.4.0               munsell_0.5.0              
[46] DelayedArray_0.10.0         ensembldb_2.8.0             Biostrings_2.52.0          
[49] compiler_3.6.0              rlang_0.4.0                 grid_3.6.0                 
[52] RCurl_1.95-4.12             tximport_1.12.3             rstudioapi_0.10            
[55] rappdirs_0.3.1              bitops_1.0-6                gtable_0.3.0               
[58] DBI_1.0.0                   curl_3.3                    R6_2.4.0                   
[61] GenomicAlignments_1.20.1    knitr_1.23                  dplyr_0.8.2                
[64] rtracklayer_1.43.3          bit_1.1-14                  ProtGenerics_1.16.0        
[67] readr_1.3.1                 stringi_1.4.3               Rcpp_1.0.1                 
[70] dbplyr_1.4.2                tidyselect_0.2.5            xfun_0.8
tximport tximeta AnnotationDbi • 2.4k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 16 hours ago
United States

I think the files do have the version ending, and so it works just with defaults (here with airway2 files):

> coldata <- data.frame(files=file.path(list.files(),"quant.sf.gz"), names=letters[1:8])
> se <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8
found matching transcriptome:
[ Gencode - Homo sapiens - release 27 ]
loading existing TxDb created: 2018-10-25 12:50:54
loading existing transcript ranges created: 2019-05-20 15:35:22
fetching genome info
> gse <- summarizeToGene(se)
loading existing TxDb created: 2018-10-25 12:50:54
obtaining transcript-to-gene mapping from TxDb
loading existing gene ranges created: 2019-07-04 13:15:09
summarizing abundance
summarizing counts
summarizing length
> gse <- addIds(gse, "ENTREZID")
mapping to new IDs using 'org.Hs.eg.db' data package
if all matching IDs are desired, and '1:many mappings' are reported,
set multiVals='list' to obtain all the matching IDs
it appears the rows are gene IDs, setting 'gene' to TRUE
'select()' returned 1:many mapping between keys and columns
> mcols(gse)
DataFrame with 58288 rows and 2 columns
                              gene_id    ENTREZID
                          <character> <character>
ENSG00000000003.14 ENSG00000000003.14        7105
ENSG00000000005.5   ENSG00000000005.5       64102
ENSG00000000419.12 ENSG00000000419.12        8813
ENSG00000000457.13 ENSG00000000457.13       57147
ENSG00000000460.16 ENSG00000000460.16       55732
ADD COMMENT
0
Entering edit mode

Thank you, Michael. I didn't know about the tximeta::addIds() function.

A minor point with regards to this solution:

Given, I've run a DESeq2 analysis on gse (or txm) and I would like to add the ENTREZID afterwards, the addIds(dds_results, "ENTREZID") function will not work with an object of DESeqResults class (even if metadata(txm)$txomeInfo -> metadata(dds_results)$txomeInfo. I can imagine this holds true for a variety of objects derived from SummarizedExperiments that end up as data.frame or similar.

In this case, I still would have to do something like

mapIds(org.Hs.eg.db, keys = row.names(dds_results), keytype = "ENSEMBL",
   column = "SYMBOL")

and given that row.names(dds_results) are versioned, this would throw an error.

Of course, one can work around as specified in the original question by stripping off the versioning ... however, it would be nice to harmonize tximeta::addIds and AnnotationDbi::mapIds for such cases. Maybe a good feature request for AnnotationDbi? (Or a request to allow mappings via addIds() for any vector given a valid metadata(obj)$txomeInfo?)

ADD REPLY
1
Entering edit mode

DESeqResults is a DataFrame, coming from the S4Vectors package.

So far, I've just built out tximeta functions for SummarizedExperiments, and so one could add the IDs upstream of DESeq2 or I think on a DESeqDataSet.

I can look into changing the code to support GRanges and/or DataFrames.

ADD REPLY

Login before adding your answer.

Traffic: 1116 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6