Running swish on a non-reference transcript assembly / tximeta import issue
1
0
Entering edit mode
Jack Riley • 0
@b67671b2
Last seen 5 days ago
United Kingdom

Hello,

I am trying to run swish on a non-reference transcript assembly.

I've tried:

  • Importing the data via tximeta, although I am having trouble getting tximeta to recognise my custom transcriptome (which consists of referenced transcripts from Ensembl.GRCh38.v85 as well as novel transcripts assembled from StringTie). When using a LinkedTxome as follows:
#make the linkedTxome
indexDir = file.path("salmon_index/agg-agg-agg.gtf.salmon.index")
gtfPath = file.path("export/agg-agg-agg.gtf") ##<<--- custom transcriptome with REF+StringTie txs... getting errors
#gtfPath = "http://ftp.ensembl.org/pub/release-85/gtf/homo_sapiens/Homo_sapiens.GRCh38.85.gtf.gz" ##<<--- this one works but then removes all the novel transcripts from StringTie when tximeta is called
fastaPath = file.path("salmon_index/agg-agg-aggtranscripts.fasta")
tmp = tempdir()
jsonFile = file.path(tmp, paste0(basename(indexDir), ".json"))
library(ensembldb)
makeLinkedTxome(indexDir = indexDir,
                source = "Ensembl", organism = "Homo sapien",
                release = "85", genome = "GRCh38",
                fasta=fastaPath, gtf=gtfPath,
                jsonFile = jsonFile)
se = tximeta(sample_table, tx2gene="expression.dir/csvdb_files/tx2gene.txt")

I get the following errors:

importing quantifications
reading in files with read_tsv
1 2 3 4 
found matching linked transcriptome:
[ Ensembl - Homo sapien - release 85 ]
useHub=TRUE: checking for EnsDb via 'AnnotationHub'
snapshotDate(): 2020-10-27
did not find matching EnsDb via 'AnnotationHub'
building EnsDb with 'ensembldb' package
Importing GTF file ... OK
Processing metadata ... OK
Processing genes ... 
 Attribute availability:
  o gene_id ... OK
  o gene_name ... OK
  o entrezid ... Nope
  o gene_biotype ... Nope
OK
Processing transcripts ... 
 Attribute availability:
  o transcript_id ... OK
  o gene_id ... OK
  o source ... OK
 I can't find type=='CDS'! The resulting database will lack CDS information!
Error in FUN(X[[i]], ...) : 
  Column 'tx_cds_seq_start' is not of type integer!

#subsequent runs give this error
loading existing EnsDb created: 2021-04-27 09:55:06
Error in EnsDb(loadpath) : 
  Required tables tx, tx2exon, exon, chromosome are not present in the database!

I am primarily looking at UTRs so am not in need of the CDS information. is this an issue with how I have set up the linkedtxome? or is there another way to get tximeta to work without the cds information?

  • I also wondered whether I may be able to do this via Tximport, however, the inferential replicates from Salmon seem to come out different, i.e. as "infReps" as opposed to "infRep1, infRep2, infRepN" (as with tximeta):
library(tximport)
files = sample_table$files
names(files) = sample_table$names
se = tximport(files, type="salmon", txOut=TRUE)

se = scaleInfReps(se)

Gives the error:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'metadata' for signature '"list"'

Is there a way to link the metadata from sample_table back to the se object? (e.g. similar to that of a DESeqDataSetFromTximport, does something similar to this exist for swish?)

Any help or advice to try to get this working would be greatly appreciated.

Many thanks in advance.

Jack.

sessionInfo()

R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tximport_1.18.0             org.Hs.eg.db_3.12.0         ensembldb_2.14.1            AnnotationFilter_1.14.0    
 [5] GenomicFeatures_1.42.3      AnnotationDbi_1.52.0        SummarizedExperiment_1.20.0 Biobase_2.50.0             
 [9] GenomicRanges_1.42.0        GenomeInfoDb_1.26.7         IRanges_2.24.1              S4Vectors_0.28.1           
[13] BiocGenerics_0.36.0         MatrixGenerics_1.2.1        matrixStats_0.58.0          qvalue_2.22.0              
[17] samr_3.0                    dplyr_1.0.5                 tximeta_1.8.5               fishpond_1.6.0             

loaded via a namespace (and not attached):
 [1] colorspace_2.0-0              ellipsis_0.3.1                XVector_0.30.0                fs_1.5.0                     
 [5] bit64_4.0.5                   interactiveDisplayBase_1.28.0 fansi_0.4.2                   xml2_1.3.2                   
 [9] splines_4.0.4                 cachem_1.0.4                  impute_1.64.0                 knitr_1.33                   
[13] jsonlite_1.7.2                Rsamtools_2.6.0               dbplyr_2.1.1                  shiny_1.6.0                  
[17] BiocManager_1.30.12           readr_1.4.0                   compiler_4.0.4                httr_1.4.2                   
[21] assertthat_0.2.1              Matrix_1.3-2                  fastmap_1.1.0                 lazyeval_0.2.2               
[25] later_1.1.0.1                 htmltools_0.5.1.1             prettyunits_1.1.1             tools_4.0.4                  
[29] gtable_0.3.0                  glue_1.4.2                    GenomeInfoDbData_1.2.4        reshape2_1.4.4               
[33] rappdirs_0.3.3                tinytex_0.31                  Rcpp_1.0.6                    vctrs_0.3.6                  
[37] Biostrings_2.58.0             rtracklayer_1.49.5            xfun_0.22                     stringr_1.4.0                
[41] openxlsx_4.2.3                mime_0.10                     lifecycle_1.0.0               gtools_3.8.2                 
[45] XML_3.99-0.5                  AnnotationHub_2.22.1          zlibbioc_1.36.0               scales_1.1.1                 
[49] hms_1.0.0                     promises_1.2.0.1              ProtGenerics_1.22.0           yaml_2.2.1                   
[53] curl_4.3                      memoise_2.0.0                 ggplot2_3.3.3                 biomaRt_2.46.3               
[57] stringi_1.5.3                 RSQLite_2.2.4                 BiocVersion_3.12.0            zip_2.1.1                    
[61] BiocParallel_1.24.1           GSA_1.03.1                    rlang_0.4.10                  pkgconfig_2.0.3              
[65] bitops_1.0-6                  evaluate_0.14                 lattice_0.20-41               purrr_0.3.4                  
[69] GenomicAlignments_1.26.0      bit_4.0.4                     tidyselect_1.1.0              plyr_1.8.6                   
[73] magrittr_2.0.1                R6_2.5.0                      generics_0.1.0                DelayedArray_0.16.3          
[77] DBI_1.1.1                     pillar_1.6.0                  withr_2.4.2                   svMisc_1.1.4                 
[81] RCurl_1.98-1.3                tibble_3.1.0                  crayon_1.4.1                  shinyFiles_0.9.0             
[85] utf8_1.1.4                    BiocFileCache_1.14.0          rmarkdown_2.7                 progress_1.2.2               
[89] grid_4.0.4                    blob_1.2.1                    digest_0.6.27                 xtable_1.8-4                 
[93] httpuv_1.5.5                  openssl_1.4.3                 munsell_0.5.0                 askpass_1.1
tximeta swish fishpond • 161 views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 2 hours ago
United States

hi Jack,

Can you try with the devel branch of tximeta (which will be released soon)? I want to check that it is fixed there, and then I can port the changes to release.

install_github("mikelove/tximeta", dependencies=FALSE)

If you can't get this above to work, a solution is to not use the tag "Ensembl" in the linkedTxome source, but anything else. E.g. "ModifiedEnsembl". This makes sense anyway, as the transcriptome is not simply Ensembl v85, but still I think I've fixed the bug in devel.

Re: tximport, basically, it will be much easier to get tximeta to give you the right shaped SE object. tximport provides a list of infReps per sample, whereas tximeta reshapes the infRep data to multiple assays (i.e. number Gibbs samples or bootstraps) all of size features x samples.

ADD COMMENT
0
Entering edit mode

Hi Michael,

Thank you for the rapid reply. I seem to be getting the same error with the development build as well:

 I can't find type=='CDS'! The resulting database will lack CDS information!
Error in FUN(X[[i]], ...) : 
  Column 'tx_cds_seq_start' is not of type integer!

However, as you suggested, using ModifiedEnsembl worked perfectly.

Thanks again

Jack.

ADD REPLY
0
Entering edit mode

Ok thanks for the report. I'll take a look at the code.

I think I do want to support people putting any string in source when they are building a linkedTxome but right now it seems "Ensembl" triggers an attempted download of the Ensembl txome via ensembldb. I need to do some more testing on my end.

ADD REPLY
0
Entering edit mode

Thanks again for the report Jack.

I've added some more messaging and documentation about what happens with linkedTxome when source="Ensembl" or "GENCODE".

I think the current behavior is actually best, but more messaging will help users realize that if they've modified the GTF, then they want to avoid having tximeta think that ensembldb is the correct package for parsing it.

If you get a chance could you see on your end (with the new github code) that the messaging appears when you try to import the data with source="Ensembl"?

ADD REPLY
0
Entering edit mode

Hi Michael,

Thank you for your help. As requested I can see that when trying with source="Ensembl" on the newest github version (1.9.6) that the warning message does appear.

NOTE: linkedTxome with source='Ensembl', ensembldb will be used to parse GTF.
this may produce errors if the GTF is not from Ensembl, or has been modified.
set useHub=FALSE in tximeta to avoid download of reference txome from AnnotationHub.
alternatively use a different string for source argument

I hadn't initially realised any string could be used, I assumed from the docs it was only "Ensembl", "GENCODE" or "de-novo", so this warning message helps to clarify that in that instance.

Jack.

ADD REPLY
0
Entering edit mode

Thanks, and I realized that the documentation was also confusing, and have cleared that up in the new man page:

source: the source of transcriptome (e.g. "de-novo").
Note: if you specify "GENCODE" or "Ensembl", this will trigger
behavior by tximeta that may not be desired: e.g. attempts to
download canonical transcriptome data from AnnotationHub
(unless useHub=FALSE when running tximeta) and parsing of
Ensembl GTF using ensembldb (which may fail if the GTF file
has been modified).}

I also added this to the vignette:

Source: when creating the linkedTxome one must specify the source of the transcriptome. See ?linkedTxome for a note on the implications of this text string. For canonical GENCODE or Ensembl transcriptomes, one can use "GENCODE" or "Ensembl", but for modified transcriptomes, it is recommended to use a different string, e.g. "ModifiedEnsembl", etc.

ADD REPLY

Login before adding your answer.

Traffic: 411 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6