Question

summarizeToGene from tximeta: not all 'rnames' found or unique.

1

Entering edit mode

Pufall, Miles A ▴ 30

@pufall-miles-a-5909

Last seen 6 months ago

United States

I'm running the RNA-seq gene-level workflow largely described by Mike Love (http://master.bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html). It had been running just fine, but yesterday the workflow stopped and summarizeToGene with the following error:

"rnames' exact pattern
    'Mus_musculus.GRCm38.102.gtf.gz'
  is not unique; use 'bfcquery()' to see matches.Error in bfcrpath(bfc, txdbName) : not all 'rnames' found or unique."

This is after a fresh retrieval of metadata by tximeta. As I said - this had not caused any problems until yesterday, but I can't seem to solve it even after removing rids from the bfc cache.

Any thoughts? a paste from the .Rmd file is below:

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(tximeta) library(DESeq2) library(tidyverse) library("pheatmap") library("RColorBrewer") library("PoiClaClu") library("org.Mm.eg.db") library(BiocFileCache)


## RNA-seq analysis in G9a KO mice

## Trimmed the reads using TrimGalore!

`trim_galore -o Sample-2 Sample-2/Sample-2_1.fq.gz`

This probably isn't 100% necessary, as the reads are single and short, but did it anyway.

## Quantifying counts

I'll use Salmon (v 1.3.0) to make count table (https://combine-lab.github.io/salmon/)

Downloaded transcriptome in fasta format from ensembl (ftp://ftp.ensembl.org/pub/release-101/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz)

As well as the gene structures in GTF format (ftp://ftp.ensembl.org/pub/release-101/gtf/mus_musculus/Mus_musculus.GRCm38.101.gtf.gz)

The first step - index the transcriptome:
`salmon index -t Mus_musculus.GRCm38.cdna.all.fa -i GRCm38_index`

The map the reads:
`#!/bin/bash
for fn in Sample-{1..12};
do
samp=`basename ${fn}`
echo "Processing sample ${samp}"
salmon quant -i GRCm38_index  -l A \
        -r ${samp}/${samp}_1_trimmed.fq.gz \
        -p 4 --validateMappings --seqBias --gcBias --numBootstraps 50 -o quants/${samp}_quant
done`

## Import data into R

We'll use tximeta.

```{r}
list.files(file.path("quants"))
files <- file.path("quants", paste0("Sample-", 1:12, "_quant"), "quant.sf")
names <- paste0("Sample_", 1:12)
cell <- factor(c("wt", "wt", "wt", "wt", "wt", "wt", "K182R", "K182R", "K182R", "K182R", "K182R", "K182R"), levels = c("wt","K182R"))
treat <- factor(c("PBS", "PBS", "PBS", "dex", "dex", "dex", "PBS", "PBS", "PBS", "dex", "dex", "dex"), levels = c("PBS", "dex"))
coldata <- tibble(files, names, cell, treat)
file.exists(coldata$files)

[1] "Sample-1_quant" "Sample-10_quant" "Sample-11_quant" "Sample-12_quant" [5] "Sample-2_quant" "Sample-3_quant" "Sample-4_quant" "Sample-5_quant" [9] "Sample-6_quant" "Sample-7_quant" "Sample-8_quant" "Sample-9_quant" [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

se <- tximeta(coldata)
gse <- summarizeToGene(se)

Error in bfcrpath(bfc, txdbName) : not all 'rnames' found or unique.

A more detailed output of the tximeta and smmarizeToGene is below:

> se <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 

tximeta needs a BiocFileCache directory to access and save TxDb objects.
Do you wish to use the default directory: '/home/dweebis/.cache/BiocFileCache'?
If not, a temporary directory that is specific to this R session will be used.

You can always change this directory later by running: setTximetaBFC()
Or enter [0] to exit and set this directory manually now.
This location can also be set by environmental variable TXIMETA_HUB_CACHE. 

1: Yes (use default)
2: No (use temp)

Selection: 1
/home/dweebis/.cache/BiocFileCache
  does not exist, create directory? (yes/no): yes
found matching transcriptome:
[ Ensembl - Mus musculus - release 102 ]
useHub=TRUE: checking for EnsDb via 'AnnotationHub'
snapshotDate(): 2020-10-27
did not find matching EnsDb via 'AnnotationHub'
building EnsDb with 'ensembldb' package
Importing GTF file ... trying URL 'ftp://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz'
Content type 'unknown' length 33443321 bytes (31.9 MB)
==================================================
OK
Processing metadata ... OK
Processing genes ... 
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... OK
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o transcript_biotype ... OK
OK
Processing exons ... OK
Processing chromosomes ... Fetch seqlengths from ensembl ... OK
Generating index ... OK
  -------------
Verifying validity of the information in the database:
Checking transcripts ... OK
Checking exons ... Warning in if ((One <- nargs() == 1L) && !missing(from)) { :
  closing unused connection 3 (ftp://ftp.ensembl.org/pub/release-102/mysql/)
OK
building TxDb with 'GenomicFeatures' package
Import genomic features from the file as a GRanges object ... trying URL 'ftp://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz'
Content type 'unknown' length 33443321 bytes (31.9 MB)
==================================================
OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.Warning: call dbDisconnect() when finished working with a connection
OK
generating transcript ranges


Warning: the annotation is missing some transcripts that were quantified.
176 out of 117135 txps were missing from GTF/GFF but were in the indexed FASTA.
(This occurs sometimes with Ensembl txps on haplotype chromosomes.)
In order to build a ranged SummarizedExperiment, these txps were removed.
To keep these txps, and to skip adding ranges, use skipMeta=TRUE

Example missing txps: [ENSMUST00000181375, ENSMUST00000214094, ENSMUST00000215103, ...]
> gse <- summarizeToGene(se)
Error in (function (x)  : attempt to apply non-function
'rnames' exact pattern
    'Mus_musculus.GRCm38.102.gtf.gz'
  is not unique; use 'bfcquery()' to see matches.Error in bfcrpath(bfc, txdbName) : not all 'rnames' found or unique.

thx - Miles

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] BiocFileCache_1.14.0        dbplyr_2.0.0                org.Mm.eg.db_3.12.0        
 [4] AnnotationDbi_1.52.0        PoiClaClu_1.0.2.1           RColorBrewer_1.1-2         
 [7] pheatmap_1.0.12             forcats_0.5.0               stringr_1.4.0              
[10] dplyr_1.0.2                 purrr_0.3.4                 readr_1.4.0                
[13] tidyr_1.1.2                 tibble_3.0.4                ggplot2_3.3.2              
[16] tidyverse_1.3.0             DESeq2_1.30.0               SummarizedExperiment_1.20.0
[19] Biobase_2.50.0              MatrixGenerics_1.2.0        matrixStats_0.57.0         
[22] GenomicRanges_1.42.0        GenomeInfoDb_1.26.1         IRanges_2.24.0             
[25] S4Vectors_0.28.0            BiocGenerics_0.36.0         tximeta_1.8.2              

loaded via a namespace (and not attached):
 [1] colorspace_2.0-0              ellipsis_0.3.1               
 [3] XVector_0.30.0                fs_1.5.0                     
 [5] rstudioapi_0.13               bit64_4.0.5                  
 [7] fansi_0.4.1                   interactiveDisplayBase_1.28.0
 [9] lubridate_1.7.9.2             xml2_1.3.2                   
[11] splines_4.0.3                 tximport_1.18.0              
[13] geneplotter_1.68.0            knitr_1.30                   
[15] jsonlite_1.7.1                Rsamtools_2.6.0              
[17] broom_0.7.2                   annotate_1.68.0              
[19] shiny_1.5.0                   BiocManager_1.30.10          
[21] compiler_4.0.3                httr_1.4.2                   
[23] backports_1.2.0               assertthat_0.2.1             
[25] Matrix_1.2-18                 fastmap_1.0.1                
[27] lazyeval_0.2.2                cli_2.2.0                    
[29] later_1.1.0.1                 htmltools_0.5.0              
[31] prettyunits_1.1.1             tools_4.0.3                  
[33] gtable_0.3.0                  glue_1.4.2                   
[35] GenomeInfoDbData_1.2.4        rappdirs_0.3.1               
[37] tinytex_0.27                  Rcpp_1.0.5                   
[39] cellranger_1.1.0              vctrs_0.3.5                  
[41] Biostrings_2.58.0             rtracklayer_1.50.0           
[43] xfun_0.19                     rvest_0.3.6                  
[45] mime_0.9                      lifecycle_0.2.0              
[47] ensembldb_2.14.0              XML_3.99-0.5                 
[49] AnnotationHub_2.22.0          zlibbioc_1.36.0              
[51] scales_1.1.1                  hms_0.5.3                    
[53] promises_1.1.1                ProtGenerics_1.22.0          
[55] AnnotationFilter_1.14.0       yaml_2.2.1                   
[57] curl_4.3                      memoise_1.1.0                
[59] biomaRt_2.46.0                stringi_1.5.3                
[61] RSQLite_2.2.1                 BiocVersion_3.12.0           
[63] genefilter_1.72.0             GenomicFeatures_1.42.1       
[65] BiocParallel_1.24.1           rlang_0.4.9                  
[67] pkgconfig_2.0.3               bitops_1.0-6                 
[69] evaluate_0.14                 lattice_0.20-41              
[71] GenomicAlignments_1.26.0      bit_4.0.4                    
[73] tidyselect_1.1.0              magrittr_2.0.1               
[75] R6_2.5.0                      generics_0.1.0               
[77] DelayedArray_0.16.0           DBI_1.1.0                    
[79] withr_2.3.0                   pillar_1.4.7                 
[81] haven_2.3.1                   survival_3.2-7               
[83] RCurl_1.98-1.2                modelr_0.1.8                 
[85] crayon_1.3.4                  rmarkdown_2.5                
[87] progress_1.2.2                locfit_1.5-9.4               
[89] grid_4.0.3                    readxl_1.3.1                 
[91] blob_1.2.1                    reprex_0.3.0                 
[93] digest_0.6.27                 xtable_1.8-4                 
[95] httpuv_1.5.4                  openssl_1.4.3                
[97] munsell_0.5.0                 askpass_1.1

RNA-seq tximeta • 2.1k views

ADD COMMENT • link 3.4 years ago • updated 3.3 years ago Pufall, Miles A ▴ 30

0

Entering edit mode

Hello,

I face the same issue. I do not have a solution, but a workaround. It is certainly a terrible idea and nobody should do it. It will also be overwritten by a future tximeta update.

That said, find your R library path/tximeta/extdata/hashtable.csv

Line 63 holds the information for the ENSEMBL GRCm38 genome, release 102. It has the same SHA256 checksum as the 101 release (and the 100 release, as well; I do not know if that is relevant). You performed your Salmon quantification with the release 101 (as did I), so my best guess is that these get mixed up somehow, but I have no idea about the specifics. Deleting that line allowed my pipeline to run as it did previously.

Good luck!

ADD REPLY • link 3.4 years ago OT • 0

0

Entering edit mode

Thanks for the reply! That does seem like something I shouldn't do...but I may try it if I can't find a way to do it without editing tximeta.

ADD REPLY • link 3.4 years ago Pufall, Miles A ▴ 30

0

Entering edit mode

Maybe this is more appropriate as a BiocFileCache issue?

ADD REPLY • link 3.4 years ago Pufall, Miles A ▴ 30

score 1 · Answer 1 · 2020-12-15

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 4 hours ago

United States

Dear Miles,

Thanks for the bug report and sorry for the delay in response. This skipped my inbox for some reason.

Can you show me:

library(BiocFileCache)
bfc <- BiocFileCache()
as.data.frame( (bfcquery(bfc, "Mus_musculus.GRCm38.102.gtf.gz") )

ADD COMMENT • link 3.3 years ago Michael Love 41k

0

Entering edit mode

I've quantified some mouse data on my end with Ensembl release 102, and I was able to reproduce the bug wherein two entries are stored in the BFC, which leads to this error. This shouldn't ever happen under normal usage so the error was a good thing for helping catch the problem. Tximeta shouldn't add a database to the BFC if it already exists there with the same name.

I'll work on this today.

ADD REPLY • link 3.3 years ago Michael Love 41k

1

Entering edit mode

I've hopefully fixed this in v1.9.3 and v1.8.3 with this commit.

Duplicate entries were being added to the BFC because of a bug in how I was checking cases, introduced in the current release.

In ~2 days this code will be available via BiocManager::install, or you can test it immediately with install_github("mikelove/tximeta") (at the moment, release and devel branch are the same so this shouldn't break anything).

You will need to bfcremove the rids associated with Mus_musculus.GRCm38.102.gtf.gz before trying the new code. You can bfcquery as above and then do, e.g.:

bfcremove(bfc, c("BFC123","BFC234"))

ADD REPLY • link 3.3 years ago Michael Love 41k

0

Entering edit mode

Fantastic - thanks for the help, and I'll report back!

ADD REPLY • link 3.3 years ago Pufall, Miles A ▴ 30

0

Entering edit mode

I update to tximeta from your gitthub, removed previous rids, and everything is running smoothly again. Thanks!

ADD REPLY • link 3.3 years ago Pufall, Miles A ▴ 30

0

Entering edit mode

Sorry for the delay - I'm also not getting email notification of a comment. You've answered this below - but here's the output:

library(BiocFileCache) Loading required package: dbplyr

bfc <- BiocFileCache()

as.data.frame( (bfcquery(bfc, "Mus_musculus.GRCm38.102.gtf.gz") )

)

rid rname create_time access_time

1 BFC1 Mus_musculus.GRCm38.102.gtf.gz 2020-12-06 22:50:44 2020-12-06 22:50:44

2 BFC2 Mus_musculus.GRCm38.102.gtf.gz 2020-12-06 22:52:26 2020-12-06 22:52:26

3 BFC3 txpRngs-Mus_musculus.GRCm38.102.gtf.gz 2020-12-06 22:52:27 2020-12-06 22:52:27 rpath rtype 1 /Users/milespufall/Library/Caches/BiocFileCache/54305640c910_54305640c910.sqlite relative 2 /Users/milespufall/Library/Caches/BiocFileCache/54301fed178a_54301fed178a.sqlite relative 3 /Users/milespufall/Library/Caches/BiocFileCache/54306a47566_54306a47566.rds relative

fpath last_modified_time etag expires

1 54305640c910 NA <NA> NA

2 54301fed178a NA <NA> NA

3 54306a47566 NA <NA> NA

ADD REPLY • link 3.3 years ago Pufall, Miles A ▴ 30