Unable to locate BioMart dataset for Dog10K_Boxer_Tasha
1
0
Entering edit mode
ScafioRuo • 0
@caranlove-21533
Last seen 8 weeks ago
United States

I am having issues finding the Ensembl Dog10K_Boxer_Tasha dataset. It is available through the [webpage][1], but I am trying to pull down gene names using the web portal BioMart or the R version of biomaRt, with no luck. Has anyone had luck locating this dataset?

R BioMart biomaRt • 797 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 21 hours ago
United States

What's available on the webpage is information on the genome for that particular dog species. The Biomart server only has information for ROS_Cfam_1.0. If you are just after gene names, they aren't positional, so it shouldn't matter. In other words, the only expected difference between ROS and Dog10K are possible positional differences in the genes, but the names of those genes (or Ensembl Gene IDs) should not vary.

0
Entering edit mode

Unfortunately, I am looking to go from position to gene name (and ideally Ensembl gene ID)... which means I might need to use another genome browser to go from position to gene name, then user biomart to translate them into the Ensembl IDs... But if you have any other (more streamlined) suggestions, I would love to hear them! Thanks!

ADD REPLY
0
Entering edit mode

You could just make a TxDb object and use that.

> txdb <- makeTxDbFromGFF("https://ftp.ensembl.org/pub/current_gff3/canis_lupus_familiarisboxer/Canis_lupus_familiarisboxer.Dog10K_Boxer_Tasha.110.gff3.gz")
Import genomic features from the file as a GRanges object ... trying URL 'https://ftp.ensembl.org/pub/current_gff3/canis_lupus_familiarisboxer/Canis_lupus_familiarisboxer.Dog10K_Boxer_Tasha.110.gff3.gz'
Content type 'application/x-gzip' length 17003104 bytes (16.2 MB)
downloaded 16.2 MB

OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: https://ftp.ensembl.org/pub/current_gff3/canis_lupus_familiarisboxer/Canis_lupus_familiarisboxer.Dog10K_Boxer_Tasha.110.gff3.gz
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 53114
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2023-11-17 11:01:45 -0500 (Fri, 17 Nov 2023)
# GenomicFeatures version at creation time: 1.54.0
# RSQLite version at creation time: 2.3.1
# DBSCHEMAVERSION: 1.2
> genes(txdb)
GRanges object with 29993 ranges and 1 metadata column:
                     seqnames
                        <Rle>
  ENSCAFG00000000001        1
  ENSCAFG00000000002        1
  ENSCAFG00000000005        1
  ENSCAFG00000000007        1
  ENSCAFG00000000008        1
                 ...      ...
  ENSCAFG00000059470        8
  ENSCAFG00000059471        9
  ENSCAFG00000059472       11
  ENSCAFG00000059473       10
  ENSCAFG00000059474       23
                                ranges
                             <IRanges>
  ENSCAFG00000000001   1248021-1317801
  ENSCAFG00000000002   1360244-1361729
  ENSCAFG00000000005   1489349-1568646
  ENSCAFG00000000007   1604336-1638507
  ENSCAFG00000000008   1724670-1737276
                 ...               ...
  ENSCAFG00000059470 68812196-68812302
  ENSCAFG00000059471 19377613-19388137
  ENSCAFG00000059472 48779675-48782860
  ENSCAFG00000059473 13155710-13167443
  ENSCAFG00000059474   9212550-9216523
                     strand |
                      <Rle> |
  ENSCAFG00000000001      - |
  ENSCAFG00000000002      + |
  ENSCAFG00000000005      + |
  ENSCAFG00000000007      - |
  ENSCAFG00000000008      + |
                 ...    ... .
  ENSCAFG00000059470      + |
  ENSCAFG00000059471      - |
  ENSCAFG00000059472      - |
  ENSCAFG00000059473      + |
  ENSCAFG00000059474      - |
                                gene_id
                            <character>
  ENSCAFG00000000001 ENSCAFG00000000001
  ENSCAFG00000000002 ENSCAFG00000000002
  ENSCAFG00000000005 ENSCAFG00000000005
  ENSCAFG00000000007 ENSCAFG00000000007
  ENSCAFG00000000008 ENSCAFG00000000008
                 ...                ...
  ENSCAFG00000059470 ENSCAFG00000059470
  ENSCAFG00000059471 ENSCAFG00000059471
  ENSCAFG00000059472 ENSCAFG00000059472
  ENSCAFG00000059473 ENSCAFG00000059473
  ENSCAFG00000059474 ENSCAFG00000059474
  -------
  seqinfo: 40 sequences from an unspecified genome; no seqlengths

Which includes the positions and the Ensembl ID for that position.

ADD REPLY
0
Entering edit mode

I forgot, you need library(GenomicFeatures) first.

ADD REPLY
0
Entering edit mode

Oh fantastic! Thank you, this is a wonderful workaround!

ADD REPLY
0
Entering edit mode

@james-w-macdonald-5106, Thank you again for the solution provided previously! Unfortunately, these IDs seem to be unique to the Dog10K_Boxer_Tasha genome... and thus are not compatible with biomart in the end...

So I now wonder your best solution for either 1) converting these IDs to ROS_Cfam_1.0 without full lift over of coordinates. Or 2) easiest method for pulling gene names (not ensembl IDs) from coordinates in R.

Thanks so much for you help!!

ADD REPLY
0
Entering edit mode

Is there a particular reason you are using the boxer genome rather than the C. lupis familiaris genome? It seems that using the 'regular' genome would fix all your issues.

ADD REPLY
0
Entering edit mode

Thank you for your comment! Yes, I think remapping might be helpful. But was hoping to find a solution without needing to re-map. Thanks!

ADD REPLY
0
Entering edit mode

I don't know why Ensembl does that. But they are not the only game in town. You could use the UCSC version, which is based on CanFam6 and uses NCBI IDs, which should be readily converted to symbols if that's what you want.

ADD REPLY
0
Entering edit mode

Thanks, yeah, I am not sure why they do that either... Thanks pondering with me though!

ADD REPLY

Login before adding your answer.

Traffic: 459 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6