Question: tximport::summarizeToGene fails to ignoreTxVersion
0
gravatar for Benjamin
3 months ago by
Benjamin0
TU Dortmund
Benjamin0 wrote:

Using airway2 (https://github.com/mikelove/airway2/tree/master/inst) salmon transcript level count files, this is supposed to remove the versioning of Ensemble gene IDs:

library(tximeta)

srrs = list.files(path = "~/Downloads/airway2-master/inst/extdata/quants/", full.names = TRUE)

txm_raw = tximeta(file.path(srrs, "quant.sf.gz"), type = "salmon") 
txm_con = summarizeToGene(txm_raw, ignoreTxVersion = TRUE)

loading existing TxDb created: 2019-07-02 20:01:47
obtaining transcript-to-gene mapping from TxDb
Error in .local(object, ...) : 

  None of the transcripts in the quantification files are present
  in the first column of tx2gene. Check to see that you are using
  the same annotation for both.

Example IDs (file): [ENST00000456328, ENST00000450305, ENST00000488147, ...]

Example IDs (tx2gene): [ENST00000456328.2, ENST00000450305.2, ENST00000473358.1, ...]

  This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.

Similarly, the operation fails using mapIds of AnnotationDbi (for which it might be nice to have as well an option like ignoreTxVersion ).

Current work-around:

names(GRanges_object) <- sapply(names(GRanges_object), function(x) unlist(strsplit(x, "\\."))[[1]])

Session Info:

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
[1] GenomicFeatures_1.36.3 AnnotationDbi_1.46.0   Biobase_2.44.0        
[4] GenomicRanges_1.36.0   GenomeInfoDb_1.20.0    IRanges_2.18.1        
[7] S4Vectors_0.22.0       BiocGenerics_0.30.0    tximeta_1.2.1         

loaded via a namespace (and not attached):
 [1] httr_1.4.0                  tidyr_0.8.3                 vsn_3.52.0                 
 [4] bit64_0.9-7                 jsonlite_1.6                assertthat_0.2.1           
 [7] BiocManager_1.30.4          affy_1.62.0                 BiocFileCache_1.8.0        
[10] blob_1.1.1                  GenomeInfoDbData_1.2.1      Rsamtools_2.0.0            
[13] progress_1.2.2              pillar_1.4.2                RSQLite_2.1.1              
[16] lattice_0.20-38             limma_3.40.2                glue_1.3.1                 
[19] digest_0.6.19               XVector_0.24.0              colorspace_1.4-1           
[22] preprocessCore_1.46.0       Matrix_1.2-17               XML_3.98-1.20              
[25] pkgconfig_2.0.2             biomaRt_2.40.1              zlibbioc_1.30.0            
[28] purrr_0.3.2                 scales_1.0.0                affyio_1.54.0              
[31] BiocParallel_1.18.0         tibble_2.1.3                AnnotationFilter_1.8.0     
[34] ggplot2_3.2.0               SummarizedExperiment_1.14.0 lazyeval_0.2.2             
[37] magrittr_1.5                crayon_1.3.4                memoise_1.1.0              
[40] tools_3.6.0                 prettyunits_1.0.2           hms_0.4.2                  
[43] matrixStats_0.54.0          stringr_1.4.0               munsell_0.5.0              
[46] DelayedArray_0.10.0         ensembldb_2.8.0             Biostrings_2.52.0          
[49] compiler_3.6.0              rlang_0.4.0                 grid_3.6.0                 
[52] RCurl_1.95-4.12             tximport_1.12.3             rstudioapi_0.10            
[55] rappdirs_0.3.1              bitops_1.0-6                gtable_0.3.0               
[58] DBI_1.0.0                   curl_3.3                    R6_2.4.0                   
[61] GenomicAlignments_1.20.1    knitr_1.23                  dplyr_0.8.2                
[64] rtracklayer_1.43.3          bit_1.1-14                  ProtGenerics_1.16.0        
[67] readr_1.3.1                 stringi_1.4.3               Rcpp_1.0.1                 
[70] dbplyr_1.4.2                tidyselect_0.2.5            xfun_0.8
ADD COMMENTlink modified 3 months ago • written 3 months ago by Benjamin0
Answer: tximport::summarizeToGene fails to ignoreTxVersion
1
gravatar for Michael Love
3 months ago by
Michael Love25k
United States
Michael Love25k wrote:

I think the files do have the version ending, and so it works just with defaults (here with airway2 files):

> coldata <- data.frame(files=file.path(list.files(),"quant.sf.gz"), names=letters[1:8])
> se <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8
found matching transcriptome:
[ Gencode - Homo sapiens - release 27 ]
loading existing TxDb created: 2018-10-25 12:50:54
loading existing transcript ranges created: 2019-05-20 15:35:22
fetching genome info
> gse <- summarizeToGene(se)
loading existing TxDb created: 2018-10-25 12:50:54
obtaining transcript-to-gene mapping from TxDb
loading existing gene ranges created: 2019-07-04 13:15:09
summarizing abundance
summarizing counts
summarizing length
> gse <- addIds(gse, "ENTREZID")
mapping to new IDs using 'org.Hs.eg.db' data package
if all matching IDs are desired, and '1:many mappings' are reported,
set multiVals='list' to obtain all the matching IDs
it appears the rows are gene IDs, setting 'gene' to TRUE
'select()' returned 1:many mapping between keys and columns
> mcols(gse)
DataFrame with 58288 rows and 2 columns
                              gene_id    ENTREZID
                          <character> <character>
ENSG00000000003.14 ENSG00000000003.14        7105
ENSG00000000005.5   ENSG00000000005.5       64102
ENSG00000000419.12 ENSG00000000419.12        8813
ENSG00000000457.13 ENSG00000000457.13       57147
ENSG00000000460.16 ENSG00000000460.16       55732
ADD COMMENTlink written 3 months ago by Michael Love25k

Thank you, Michael. I didn't know about the tximeta::addIds() function.

A minor point with regards to this solution:

Given, I've run a DESeq2 analysis on gse (or txm) and I would like to add the ENTREZID afterwards, the addIds(dds_results, "ENTREZID") function will not work with an object of DESeqResults class (even if metadata(txm)$txomeInfo -> metadata(dds_results)$txomeInfo. I can imagine this holds true for a variety of objects derived from SummarizedExperiments that end up as data.frame or similar.

In this case, I still would have to do something like

mapIds(org.Hs.eg.db, keys = row.names(dds_results), keytype = "ENSEMBL",
   column = "SYMBOL")

and given that row.names(dds_results) are versioned, this would throw an error.

Of course, one can work around as specified in the original question by stripping off the versioning ... however, it would be nice to harmonize tximeta::addIds and AnnotationDbi::mapIds for such cases. Maybe a good feature request for AnnotationDbi? (Or a request to allow mappings via addIds() for any vector given a valid metadata(obj)$txomeInfo?)

ADD REPLYlink written 3 months ago by Benjamin0
1

DESeqResults is a DataFrame, coming from the S4Vectors package.

So far, I've just built out tximeta functions for SummarizedExperiments, and so one could add the IDs upstream of DESeq2 or I think on a DESeqDataSet.

I can look into changing the code to support GRanges and/or DataFrames.

ADD REPLYlink written 3 months ago by Michael Love25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 136 users visited in the last hour