ensembldb::getGenomeTwoBitFile() not working for many species
1
0
Entering edit mode
cjag • 0
@d9c4e73f
Last seen 14 months ago
United States

Hi, I am trying to retrieve gene sequences from AnnotationHub using ensemblb's getGenomeTwoBitFile() for many non-human species and am getting similar errors stating that there is no genome assembly fasta file available for these organisms. I tried with multiple versions of the AnnotationHub EnsDb object for each species and continue to get the same type of error. My code works fine for the latest human genome version 109 and is able to fetch version 95 of the Mus musculus genome (but it doesn't find anything more updated than that for mouse). Is there anything I can do to change this or am I out of luck? Ideally I would like to retrieve selected gene sequences from all species in the Ensembl database so a universal solution would be much appreciated if possible. I know I can do this using biomaRt but I need an alternative with different caching behavior, which is why I started looking into the ensembldb method. Thanks in advance for your assistance!

Edit: it's actually only retrieving v95 of the human twobit file as well, not v109 as I originally stated.

> ah <- AnnotationHub::AnnotationHub(cache = ".cache/annotationhub")
> xtah <- ah[["AH83359"]]
downloading 1 resources
retrieving 1 resource
  |=============================================================================================================================================| 100%

loading from cache
> xtgenome <- ensembldb::getGenomeTwoBitFile(xtah)
snapshotDate(): 2022-10-31
Error in .local(x, ...) : 
  No genome assembly fasta file available for organism: Xenopus tropicalis, data provider: Ensembl and genome version: Xenopus_tropicalis_v9.1!
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: AlmaLinux 9.2 (Turquoise Kodkod)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomartr_1.0.4          rtracklayer_1.58.0      ensembldb_2.22.0        AnnotationFilter_1.22.0 GenomicFeatures_1.50.4  GenomicRanges_1.50.2   
 [7] GenomeInfoDb_1.34.9     AnnotationDbi_1.60.2    IRanges_2.32.0          S4Vectors_0.36.2        Biobase_2.58.0          BiocGenerics_0.44.0    
[13] readr_2.1.4             biomaRt_2.54.1         

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.30.0           bitops_1.0-7                  matrixStats_1.0.0             bit64_4.0.5                  
 [5] filelock_1.0.2                progress_1.2.2                httr_1.4.7                    tools_4.2.2                  
 [9] utf8_1.2.3                    R6_2.5.1                      DBI_1.1.3                     lazyeval_0.2.2               
[13] withr_2.5.0                   tidyselect_1.2.0              prettyunits_1.1.1             bit_4.0.5                    
[17] curl_5.0.2                    compiler_4.2.2                cli_3.6.1                     xml2_1.3.5                   
[21] DelayedArray_0.24.0           rappdirs_0.3.3                stringr_1.5.0                 digest_0.6.33                
[25] Rsamtools_2.14.0              XVector_0.38.0                pkgconfig_2.0.3               htmltools_0.5.6              
[29] MatrixGenerics_1.10.0         BSgenome_1.66.3               dbplyr_2.3.3                  fastmap_1.1.1                
[33] rlang_1.1.1                   RSQLite_2.3.1                 shiny_1.7.5                   BiocIO_1.8.0                 
[37] generics_0.1.3                jsonlite_1.8.7                BiocParallel_1.32.6           vroom_1.6.3                  
[41] dplyr_1.1.3                   RCurl_1.98-1.12               magrittr_2.0.3                GenomeInfoDbData_1.2.9       
[45] Matrix_1.5-1                  Rcpp_1.0.11                   fansi_1.0.4                   lifecycle_1.0.3              
[49] stringi_1.7.12                yaml_2.3.7                    SummarizedExperiment_1.28.0   zlibbioc_1.44.0              
[53] BiocFileCache_2.6.1           AnnotationHub_3.6.0           grid_4.2.2                    blob_1.2.4                   
[57] parallel_4.2.2                promises_1.2.1                crayon_1.5.2                  lattice_0.20-45              
[61] Biostrings_2.66.0             hms_1.1.3                     KEGGREST_1.38.0               pillar_1.9.0                 
[65] rjson_0.2.21                  codetools_0.2-18              XML_3.99-0.14                 glue_1.6.2                   
[69] BiocVersion_3.16.0            downloader_0.4                data.table_1.14.8             renv_1.0.2                   
[73] BiocManager_1.30.22           png_0.1-8                     vctrs_0.6.3                   tzdb_0.4.0                   
[77] httpuv_1.6.11                 purrr_1.0.2                   tidyr_1.3.0                   cachem_1.0.8                 
[81] mime_0.12                     xtable_1.8-4                  restfulr_0.0.15               later_1.3.1                  
[85] tibble_3.2.1                  GenomicAlignments_1.34.1      memoise_2.0.1                 ellipsis_0.3.2               
[89] interactiveDisplayBase_1.36.0
AnnotationHubSoftware ensembldb • 1.2k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 4 hours ago
United States

That's sort of a roundabout way of getting a TwoBit file. Why not query directly?

> z <-  query(hub, c("xenopus tropicalis","twobit"))
> z
AnnotationHub with 103 records
# snapshotDate(): 2023-04-25
# $dataprovider: Ensembl
# $species: Xenopus tropicalis, x...
# $rdataclass: TwoBitFile
# additional mcols():
#   taxonomyid, genome,
#   description,
#   coordinate_1_based,
#   maintainer, rdatadateadded,
#   preparerclass, tags,
#   rdatapath, sourceurl,
#   sourcetype 
# retrieve records with, e.g.,
#   'object[["AH49927"]]' 


  AH49927  |
  AH49928  |
  AH49929  |
  AH49930  |
  AH49931  |
  ...       
  AH104588 |
  AH107024 |
  AH107025 |
  AH107026 |
  AH107027 |
           title                   
  AH49927  Xenopus_tropicalis.JG...
  AH49928  Xenopus_tropicalis.JG...
  AH49929  Xenopus_tropicalis.JG...
  AH49930  Xenopus_tropicalis.JG...
  AH49931  Xenopus_tropicalis.JG...
  ...      ...                     
  AH104588 Xenopus_tropicalis.Xe...
  AH107024 Xenopus_tropicalis.UC...
  AH107025 Xenopus_tropicalis.UC...
  AH107026 Xenopus_tropicalis.UC...
  AH107027 Xenopus_tropicalis.UC...

## delving further
> zz <- as(mcols(z)[,c("dataprovider","rdatadateadded","genome","sourceurl","description")], "data.frame")
> zzsmall <- zz[grep("v9.1$", zz$genome),]
> zzsmall <- zzsmall[grep("cdna.all.fa.gz$", zzsmall$sourceurl),]
> zzsmall
         dataprovider rdatadateadded
AH78353       Ensembl     2019-10-29
AH83095       Ensembl     2020-04-27
AH85367       Ensembl     2020-10-26
AH89059       Ensembl     2020-10-27
AH91547       Ensembl     2020-10-27
AH93816       Ensembl     2021-05-17
AH100280      Ensembl     2021-10-20
AH104585      Ensembl     2022-04-21
                          genome
AH78353  Xenopus_tropicalis_v9.1
AH83095  Xenopus_tropicalis_v9.1
AH85367  Xenopus_tropicalis_v9.1
AH89059  Xenopus_tropicalis_v9.1
AH91547  Xenopus_tropicalis_v9.1
AH93816  Xenopus_tropicalis_v9.1
AH100280 Xenopus_tropicalis_v9.1
AH104585 Xenopus_tropicalis_v9.1
                                                                                                                             sourceurl
AH78353   ftp://ftp.ensembl.org/pub/release-98/fasta/xenopus_tropicalis/cdna/Xenopus_tropicalis.Xenopus_tropicalis_v9.1.cdna.all.fa.gz
AH83095  ftp://ftp.ensembl.org/pub/release-100/fasta/xenopus_tropicalis/cdna/Xenopus_tropicalis.Xenopus_tropicalis_v9.1.cdna.all.fa.gz
AH85367  ftp://ftp.ensembl.org/pub/release-101/fasta/xenopus_tropicalis/cdna/Xenopus_tropicalis.Xenopus_tropicalis_v9.1.cdna.all.fa.gz
AH89059  ftp://ftp.ensembl.org/pub/release-102/fasta/xenopus_tropicalis/cdna/Xenopus_tropicalis.Xenopus_tropicalis_v9.1.cdna.all.fa.gz
AH91547  ftp://ftp.ensembl.org/pub/release-103/fasta/xenopus_tropicalis/cdna/Xenopus_tropicalis.Xenopus_tropicalis_v9.1.cdna.all.fa.gz
AH93816  ftp://ftp.ensembl.org/pub/release-104/fasta/xenopus_tropicalis/cdna/Xenopus_tropicalis.Xenopus_tropicalis_v9.1.cdna.all.fa.gz
AH100280 ftp://ftp.ensembl.org/pub/release-105/fasta/xenopus_tropicalis/cdna/Xenopus_tropicalis.Xenopus_tropicalis_v9.1.cdna.all.fa.gz
AH104585 ftp://ftp.ensembl.org/pub/release-106/fasta/xenopus_tropicalis/cdna/Xenopus_tropicalis.Xenopus_tropicalis_v9.1.cdna.all.fa.gz
                                         description
AH78353  TwoBit cDNA sequence for Xenopus tropicalis
AH83095  TwoBit cDNA sequence for Xenopus tropicalis
AH85367  TwoBit cDNA sequence for xenopus tropicalis
AH89059  TwoBit cDNA sequence for xenopus tropicalis
AH91547  TwoBit cDNA sequence for xenopus tropicalis
AH93816  TwoBit cDNA sequence for xenopus tropicalis
AH100280 TwoBit cDNA sequence for xenopus tropicalis
AH104585 TwoBit cDNA sequence for xenopus tropicalis

Presumably you want the last one, but your choice.

ADD COMMENT
0
Entering edit mode

Oh wait. You want the entire genome, not the cDNA, right? In that case you probably have to go to their ftp site, download all the chromosomes and then convert to a TwoBit file

ADD REPLY
0
Entering edit mode

I didn't know I could do that, thank you! However, I am looking for the gene (exon + intron) sequences, not the cDNA. So I am trying to retrieve the primary assembly genomic DNA twobit files. Looking at the records in your z object, it looks like it only has the toplevel DNA objects, which might work for my purposes but are going to be substantially larger. I think that's prohibitive for my use-case (needing a lot of species) so if you have any suggestions for retrieving the primary assembly DNA I would greatly appreciate it!

Edit: I think I also need the AnnotationHub annotation object from which I am pulling exon coordinates to match the AnnotationHub object from which I pull the gene sequences, so that the genome versions are the same. Using ensembldb to get the twobit files allows me to do that, if I am understanding it correctly.

ADD REPLY
0
Entering edit mode

Ensembl says this:


PRIMARY ASSEMBLY

Primary assembly contains all toplevel sequence regions excluding haplotypes and patches. This file is best used for performing sequence similarity searches where patch and haplotype sequences would confuse analysis. If the primary assembly file is not present, that indicates that there are no haplotype/patch regions, and the 'toplevel' file is equivalent.

Which seems to be a different thing than what you think of as a primary assembly? But if you just mean the standard chromosomes, then for X. tropicalis, the toplevel are the same as primary assembly.

ADD REPLY
0
Entering edit mode

Ah, thank you! This was very helpful - "If the primary assembly file is not present, that indicates that there are no haplotype/patch regions, and the 'toplevel' file is equivalent."

ADD REPLY
0
Entering edit mode

There is a primary assembly for X. tropicalis, but it's not available on AnnotationHub. It's not what I think you think it is though.

> library(Biostrings)
> options(timeout = 1000)
> readDNAStringSet("https://ftp.ensembl.org/pub/current_fasta/xenopus_tropicalis/dna/Xenopus_tropicalis.UCB_Xtro_10.0.dna.primary_assembly.1.fa.gz")
trying URL 'https://ftp.ensembl.org/pub/current_fasta/xenopus_tropicalis/dna/Xenopus_tropicalis.UCB_Xtro_10.0.dna.primary_assembly.1.fa.gz'
Content type 'application/x-gzip' length 63624057 bytes (60.7 MB)
downloaded 60.7 MB

DNAStringSet object of length 1:
        width seqnames               
[1] 217471166 CC...CC 1 dna:primary_ass...

That's just a chromosome, not the gene sequences (not a thing, btw. There are transcript sequences, but a gene is a simplification of all the transcripts and doesn't really exist).

ADD REPLY
0
Entering edit mode

Yes, I know the gene sequences don't really exist per se. I'm having to build them by taking the start position of the first exon and the end position of the last exon and then extracting that range from the chromosomal sequences, or by using ensembldb::genes() to get a GRanges filter for the specific genes I'm interested in and then applying that to the twobit object with BSgenome::getSeq(). It's for a very specific use-case that just requires knowledge of each exon's position within the overall gene, not knowledge of how the exons relate to each other in terms of transcripts. I can get the exon position info using the AnnotationHub object and ensembldb::exons() and write that to a gff3 file, but I need the corresponding gene sequences in fasta format as well. So it's making my life a bit harder given that there's no easy method of fetching those sequences from a cached/downloaded genome file.

This is code that's working as intended, but it doesn't function for many of the non-human genomes and actually only gives v95 of the twobit file, not v109 which is the AnnotationHub version returned by AH109606. Hence my original question about the versions returned and the availability of non-human genomes.

#retrieve gene sequences from annotation hub (doesn't work for many non-human species?)  
ah <- AnnotationHub::AnnotationHub(cache = ".cache/annotationhub")
hsah <- ah[["AH109606"]]
genome <- ensembldb::getGenomeTwoBitFile(hsah)
genes <- c("ENSG00000169252", "ENSG00000043591")
gn <- ensembldb::genes(hsah, columns=listColumns(hsah), return.type = "GRanges", filter = AnnotationFilter::AnnotationFilterList(AnnotationFilter::GeneIdFilter(genes))) |> unique()
geneSeq <- BSgenome::getSeq(genome, gn) |> as.character()

It returns the following:

ENSG00000043591
"AGAAACATGCTGAA.........GAAGCACAA"

ENSG00000169252
"GCACTGCGAAGCG.......ACCATG"

If you have any suggestions for easier ways to do this, I'm all ears! Thanks again :)

ADD REPLY
0
Entering edit mode

The easy way would involve other people having already converted the existing FASTA files into 2bit and uploading to the AnnotationHub. But there are only a few species for which that has been done, so the easy way may not exist.

If this is something you have to do repeatedly for the same species, then it might make sense to download the FASTA files from Ensembl and convert to 2bit format yourself. Or even if you have to do it once, maybe it's worth the effort? There is the BSgenomeForge package, which will automatically download FASTA files from NCBI or UCSC and make BSgenome packages. That might not be directly relevant for you, although maybe it is? There are any number of differences between Ensembl and NCBI, as far as transcripts and genes go, but I don't know if there are any real differences between the genomes.

But anyway, at the very least BSgenomeForge has functions to download and convert FASTA files that you could copy and edit to get from Ensembl, and then maybe it's just a matter of firing the code off on a Friday and hoping it's done by Monday.

ADD REPLY
0
Entering edit mode

I think your original suggestion of querying AnnotationHub directly for the twobit files is probably the way to go - it looks like they exist for all the species in the main Ensembl database. Do you happen to know what the difference between the records marked dna_rm and dna_sm is? i.e. AH107025 | Xenopus_tropicalis.UCB_Xtro_10.0.dna_rm.toplevel.2bit vs. AH107026 | Xenopus_tropicalis.UCB_Xtro_10.0.dna_sm.toplevel.2bit? Thanks again!

ADD REPLY
0
Entering edit mode

FILE NAMES

The files are consistently named following this pattern: <species>.<assembly>.<sequence type>.<id type>.<id>.fa.gz

<species>: The systematic name of the species.

<assembly>: The assembly build name.

<sequence type>:

  • 'dna' - unmasked genomic DNA sequences.
    • 'dna_rm' - masked genomic DNA. Interspersed repeats and low complexity regions are detected with the RepeatMasker tool and masked by replacing repeats with 'N's.
    • 'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions have been replaced with lowercased versions of their nucleic base
ADD REPLY

Login before adding your answer.

Traffic: 729 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6