Question

Unable to locate BioMart dataset for Dog10K_Boxer_Tasha

0

Entering edit mode

ScafioRuo • 0

@caranlove-21533

Last seen 8 weeks ago

United States

I am having issues finding the Ensembl Dog10K_Boxer_Tasha dataset. It is available through the [webpage][1], but I am trying to pull down gene names using the web portal BioMart or the R version of biomaRt, with no luck. Has anyone had luck locating this dataset?

R BioMart biomaRt • 797 views

ADD COMMENT • link 5 months ago • updated 8 weeks ago ScafioRuo • 0

score 0 · Answer 1 · 2023-11-16

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 21 hours ago

United States

What's available on the webpage is information on the genome for that particular dog species. The Biomart server only has information for ROS_Cfam_1.0. If you are just after gene names, they aren't positional, so it shouldn't matter. In other words, the only expected difference between ROS and Dog10K are possible positional differences in the genes, but the names of those genes (or Ensembl Gene IDs) should not vary.

ADD COMMENT • link 5 months ago James W. MacDonald 65k

0

Entering edit mode

Unfortunately, I am looking to go from position to gene name (and ideally Ensembl gene ID)... which means I might need to use another genome browser to go from position to gene name, then user biomart to translate them into the Ensembl IDs... But if you have any other (more streamlined) suggestions, I would love to hear them! Thanks!

ADD REPLY • link 5 months ago ScafioRuo • 0

0

Entering edit mode

You could just make a TxDb object and use that.

> txdb <- makeTxDbFromGFF("https://ftp.ensembl.org/pub/current_gff3/canis_lupus_familiarisboxer/Canis_lupus_familiarisboxer.Dog10K_Boxer_Tasha.110.gff3.gz")
Import genomic features from the file as a GRanges object ... trying URL 'https://ftp.ensembl.org/pub/current_gff3/canis_lupus_familiarisboxer/Canis_lupus_familiarisboxer.Dog10K_Boxer_Tasha.110.gff3.gz'
Content type 'application/x-gzip' length 17003104 bytes (16.2 MB)
downloaded 16.2 MB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: https://ftp.ensembl.org/pub/current_gff3/canis_lupus_familiarisboxer/Canis_lupus_familiarisboxer.Dog10K_Boxer_Tasha.110.gff3.gz
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 53114
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2023-11-17 11:01:45 -0500 (Fri, 17 Nov 2023)
# GenomicFeatures version at creation time: 1.54.0
# RSQLite version at creation time: 2.3.1
# DBSCHEMAVERSION: 1.2
> genes(txdb)
GRanges object with 29993 ranges and 1 metadata column:
                     seqnames
                        <Rle>
  ENSCAFG00000000001        1
  ENSCAFG00000000002        1
  ENSCAFG00000000005        1
  ENSCAFG00000000007        1
  ENSCAFG00000000008        1
                 ...      ...
  ENSCAFG00000059470        8
  ENSCAFG00000059471        9
  ENSCAFG00000059472       11
  ENSCAFG00000059473       10
  ENSCAFG00000059474       23
                                ranges
                             <IRanges>
  ENSCAFG00000000001   1248021-1317801
  ENSCAFG00000000002   1360244-1361729
  ENSCAFG00000000005   1489349-1568646
  ENSCAFG00000000007   1604336-1638507
  ENSCAFG00000000008   1724670-1737276
                 ...               ...
  ENSCAFG00000059470 68812196-68812302
  ENSCAFG00000059471 19377613-19388137
  ENSCAFG00000059472 48779675-48782860
  ENSCAFG00000059473 13155710-13167443
  ENSCAFG00000059474   9212550-9216523
                     strand |
                      <Rle> |
  ENSCAFG00000000001      - |
  ENSCAFG00000000002      + |
  ENSCAFG00000000005      + |
  ENSCAFG00000000007      - |
  ENSCAFG00000000008      + |
                 ...    ... .
  ENSCAFG00000059470      + |
  ENSCAFG00000059471      - |
  ENSCAFG00000059472      - |
  ENSCAFG00000059473      + |
  ENSCAFG00000059474      - |
                                gene_id
                            <character>
  ENSCAFG00000000001 ENSCAFG00000000001
  ENSCAFG00000000002 ENSCAFG00000000002
  ENSCAFG00000000005 ENSCAFG00000000005
  ENSCAFG00000000007 ENSCAFG00000000007
  ENSCAFG00000000008 ENSCAFG00000000008
                 ...                ...
  ENSCAFG00000059470 ENSCAFG00000059470
  ENSCAFG00000059471 ENSCAFG00000059471
  ENSCAFG00000059472 ENSCAFG00000059472
  ENSCAFG00000059473 ENSCAFG00000059473
  ENSCAFG00000059474 ENSCAFG00000059474
  -------
  seqinfo: 40 sequences from an unspecified genome; no seqlengths

Which includes the positions and the Ensembl ID for that position.

ADD REPLY • link 5 months ago James W. MacDonald 65k

0

Entering edit mode

I forgot, you need library(GenomicFeatures) first.

ADD REPLY • link 5 months ago James W. MacDonald 65k

0

Entering edit mode

Oh fantastic! Thank you, this is a wonderful workaround!

ADD REPLY • link 5 months ago ScafioRuo • 0

0

Entering edit mode

@james-w-macdonald-5106, Thank you again for the solution provided previously! Unfortunately, these IDs seem to be unique to the Dog10K_Boxer_Tasha genome... and thus are not compatible with biomart in the end...

So I now wonder your best solution for either 1) converting these IDs to ROS_Cfam_1.0 without full lift over of coordinates. Or 2) easiest method for pulling gene names (not ensembl IDs) from coordinates in R.

Thanks so much for you help!!

ADD REPLY • link 3 months ago ScafioRuo • 0

0

Entering edit mode

Is there a particular reason you are using the boxer genome rather than the C. lupis familiaris genome? It seems that using the 'regular' genome would fix all your issues.

ADD REPLY • link 3 months ago James W. MacDonald 65k

0

Entering edit mode

Thank you for your comment! Yes, I think remapping might be helpful. But was hoping to find a solution without needing to re-map. Thanks!

ADD REPLY • link 3 months ago • updated 8 weeks ago ScafioRuo • 0

0

Entering edit mode

I don't know why Ensembl does that. But they are not the only game in town. You could use the UCSC version, which is based on CanFam6 and uses NCBI IDs, which should be readily converted to symbols if that's what you want.

ADD REPLY • link 3 months ago James W. MacDonald 65k

0

Entering edit mode

Thanks, yeah, I am not sure why they do that either... Thanks pondering with me though!

ADD REPLY • link 3 months ago ScafioRuo • 0