Questions regarding failing tximeta import for swish
1
0
Entering edit mode
jvandinter • 0
@05c09cf1
Last seen 4 weeks ago
Utrecht

Hi all!

We have a custom pipeline in our lab to find cancer-specific transcripts in our RNA-seq sample group using Stringtie for transcript identification and GFFcompare for transcript class annotation. After filtering novel transcripts, we would like to quantify the novel transcripts using salmon and swish from the fishpond package on our sample set and a set of control samples. We currently use Ensembl v102.

Following the swish vignette, I create the following Txome (all using local files, as they are custom and differ per experiment):

# custom_gtf_loc = output from gffcompare and stringtie, with some additional transcript filtering. Merged with Ensembl v102 GTF
# gentrome_fasta_loc = 102/Homo_sapiens.GRCh38.dna.primary_assembly.fa + custom GTF converted to .fa with GFFread
# salmon_index_loc = created using gentrome

source = "stringtie",
organism = "Homo Sapiens",
release = "102",
genome = "GRCh38",
fasta = "gentrome_fasta_loc",
gtf = "custom_gtf_loc",
write=F)


When I try to create a tximeta file:

se <- tximeta(coldata, useHub = FALSE)
importing quantifications
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134
[ stringtie - Homo Sapiens - release 102 ]
building TxDb with 'GenomicFeatures' package
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in .make_splicings(exons, cds, stop_codons) :
some CDS cannot be mapped to an exon
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.


And a recreation of the error

> txdb <- makeTxDbFromGFF("/location/custom.gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Warning in .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored.
Error in .make_splicings(exons, cds, stop_codons) :
some CDS cannot be mapped to an exon


However, makeTxDb from Granges works fine:

> gtf_novel <- rtracklayer::import("/location/custom.gtf")
> txdb <- makeTxDbFromGRanges(gr = gtf_novel)
Warning in .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored.


My questions are as follows:

1. A bit naive, but what exactly is the added value of using tximeta to import counts instead of a 'simple' tximport() / add the variable skipMeta = T to the tximeta function call?
2. How can I circumvent the error in my tximeta call?
3. If 2. is impossible, how can I troubleshoot my GTF to fish out the erroneous CDS / exon?

And finally, the session info

> sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] GenomicFeatures_1.46.5 AnnotationDbi_1.56.2   Biobase_2.54.0         rtracklayer_1.54.0     GenomicRanges_1.46.1   GenomeInfoDb_1.30.1
[7] IRanges_2.28.0         S4Vectors_0.32.4       BiocGenerics_0.40.0

loaded via a namespace (and not attached):
[1] MatrixGenerics_1.6.0        httr_1.4.2                  bit64_4.0.5                 splines_4.1.3
[5] assertthat_0.2.1            BiocFileCache_2.2.1         BiocManager_1.30.16         blob_1.2.2
[9] GenomeInfoDbData_1.2.7      Rsamtools_2.10.0            yaml_2.3.5                  progress_1.2.2
[13] pillar_1.7.0                RSQLite_2.2.12              lattice_0.20-45             glue_1.6.2
[17] digest_0.6.29               RColorBrewer_1.1-3          XVector_0.34.0              colorspace_2.0-3
[21] Matrix_1.4-1                DESeq2_1.34.0               XML_3.99-0.9                pkgconfig_2.0.3
[25] biomaRt_2.50.3              genefilter_1.76.0           zlibbioc_1.40.0             purrr_0.3.4
[29] xtable_1.8-4                scales_1.1.1                BiocParallel_1.28.3         tibble_3.1.6
[33] annotate_1.72.0             KEGGREST_1.34.0             generics_0.1.2              ggplot2_3.3.5
[37] ellipsis_0.3.2              cachem_1.0.6                SummarizedExperiment_1.24.0 cli_3.2.0
[41] survival_3.3-1              magrittr_2.0.3              crayon_1.5.1                memoise_2.0.1
[45] fansi_1.0.3                 xml2_1.3.3                  prettyunits_1.1.1           tools_4.1.3
[49] hms_1.1.1                   BiocIO_1.4.0                lifecycle_1.0.1             matrixStats_0.61.0
[53] stringr_1.4.0               munsell_0.5.0               locfit_1.5-9.5              DelayedArray_0.20.0
[57] Biostrings_2.62.0           compiler_4.1.3              rlang_1.0.2                 grid_4.1.3
[61] RCurl_1.98-1.6              rappdirs_0.3.3              rjson_0.2.21                bitops_1.0-7
[65] restfulr_0.0.13             gtable_0.3.0                curl_4.3.2                  DBI_1.1.2
[69] R6_2.5.1                    GenomicAlignments_1.30.0    dplyr_1.0.8                 fastmap_1.1.0
[73] bit_4.0.4                   utf8_1.2.2                  filelock_1.0.2              stringi_1.7.6
[77] parallel_4.1.3              Rcpp_1.0.8.3                vctrs_0.4.0                 geneplotter_1.72.0
[81] png_0.1-7                   dbplyr_2.1.1                tidyselect_1.1.2

tximeta GenomicFeatures swish • 372 views
2
Entering edit mode
@mikelove
Last seen 14 hours ago
United States

Thanks for the post, firstly 1) you can just use tximeta with skipMeta=TRUE. The value add is just that the metadata about the features is attached to the object as rowRanges, this facilitates downstream genomics analysis like, are DE genes near ATAC or ChIP peaks. But there's no problem using skipMeta=TRUE and then running swish(). Swish doesn't use the metadata at all to compute statistics.

2) I'll look into bypassing the GTF somehow. I haven't seen this before where makeTxDbFromGFF fails but makeTxDbFromGRanges doesn't. That seems mysterious, but I could just have an option to try this route I suppose.

1
Entering edit mode

I just made a commit on GitHub that would allow the GTF slot of the linkedTxome to be an .rda file path pointing to a saved GRanges object.

Could you try that solution? You can install the devel branch with devtools::install_github("mikelove/tximeta")

You may need to remove the linkedTxome table:

0
Entering edit mode

Thank you for the swift reply Michael, I will update the tximeta package and will get back to you whether it worked! Good to know that the metadata is not required per se.

0
Entering edit mode

I am not sure what I'm doing wrong here. I save the granges object as a rda and put it in the Txome. However, I get a return error that the object is no Granges (which I show it is). What am I missing?

Creation of the .rda file

> gr <- rtracklayer::import("/location/custom.gtf")
GRanges object with 6 ranges and 31 metadata columns:
""" Granges """
> save(gr, file = "/location/custom.rda")


Now, on to create a new txome:

> makeLinkedTxome(indexDir = "salmon_index_loc",
source = "stringtie",
organism = "Homo Sapiens",
release = "102",
genome = "GRCh38",
fasta = "gentrome_loc",
gtf = "/location/custom.rda",
write=F)

#Sample_info_test is a trunctated coldata table with only the first 5 samples
> y <- tximeta(sample_info_test)
importing quantifications
1 2 3 4 5
[ stringtie - Homo Sapiens - release 102 ]
building TxDb with 'GenomicFeatures' package
Error in makeTxDbFromGRanges(load(txomeInfo$gtf)) : 'gr' must be a GRanges object  here to show that I have the latest version: > sessionInfo() R version 4.1.3 (2022-03-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.6 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] readxl_1.4.0 GenomicFeatures_1.46.5 AnnotationDbi_1.56.2 Biobase_2.54.0 GenomicRanges_1.46.1 GenomeInfoDb_1.30.1 [7] IRanges_2.28.0 S4Vectors_0.32.4 BiocGenerics_0.40.0 BiocFileCache_2.2.1 dbplyr_2.1.1 **tximeta_1.13.10** [13] forcats_0.5.1 dplyr_1.0.8 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.6 [19] ggplot2_3.3.5 tidyverse_1.3.1 stringr_1.4.0  ADD REPLY 1 Entering edit mode Can you repeat (install and force installation, quit R, restart)? There was a minute when I had left out a load() command but I quickly pushed the fix. ADD REPLY 0 Entering edit mode I checked the commit you made on github and I think I installed 13.10 when it was already updated. To be sure, I re-downloaded the new version (using forced = T), and restarted R. Unfortunately with the same results: > y <- tximeta(sample_info_test) importing quantifications reading in files with read_tsv 1 2 3 4 5 found matching linked transcriptome: [ stringtie - Homo Sapiens - release 102 ] building TxDb with 'GenomicFeatures' package Error in makeTxDbFromGRanges(load(txomeInfo$gtf)) :
'gr' must be a GRanges object


I uncoupled the txome again and generated it again with the new version, but no luck.

I tried doing it via makeTxdbFromGranges directly, and this does seem to work:

> txdb <- makeTxDbFromGRanges(gr = load("/location/custom.rda"))
Error in makeTxDbFromGRanges(gr = load("/location/custom.rda")) :
'gr' must be a GRanges object
> txdb <- makeTxDbFromGRanges(rtracklayer::import("/location/custom.gtf"))
Warning in .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored.
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Genome: NA
# Nb of transcripts: 242419
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2022-04-08 15:23:13 +0200 (Fri, 08 Apr 2022)
# GenomicFeatures version at creation time: 1.46.5
# RSQLite version at creation time: 2.2.12
# DBSCHEMAVERSION: 1.2

1
Entering edit mode

Thanks for checking again -- oh I see now I was being a bit sloppy with my R code. I've pushed a new fix if its possible for you to try version 1.13.11.

This would involve saving the GRanges as an rds instead of rda:

saveRDS(gr, file="custom.rds")


And then pointing to the rds file in the GTF slot.

0
Entering edit mode

Yes, it works now with v1.13.11 :)

Thank kindly you for the help!

0
Entering edit mode

Thanks again for following up with the extra checks. I think this is a nice workaround for others who may encounter TxDb construction issues. If you have any questions with running swish() feel free to post another issue, happy to help there as well.