using the "promoters" function with an "OrganismDb" to generate "GRanges" with "REFSEQ" rather than UCSC gene names
1
1
Entering edit mode
efoss ▴ 10
@efoss-8908
Last seen 19 months ago
United States

I would like to use the "promoters" function in the "GenomicRanges" package to generate "GRanges" objects, but I would like these objects to have "REFSEQ" names rather than UCSC-style names.

This gives me UCSC-style names:

library(OrganismDbi)

a1 <- promoters(Mus.musculus)

Trying the following sort of approach, based on looking at the columns in Mus.musculus, allows me to change some of the metadata columns, but I quickly end up with error messages saying that the columns I'm trying to add are unavailable:

a2 <- promoters(Mus.musculus, columns = c("TXID", "TXSTART", "GENEID"))

Thank you.

Eric

promoter granges refseq ucsc organismdb • 1.9k views
1
Entering edit mode
Robert Castelo ★ 3.0k
@rcastelo
Last seen 3 hours ago
Barcelona/Universitat Pompeu Fabra

hi Eric,

I would say you need to use the GenomicFeatures package that allows one to fetch and build custom annotation TxDb objects. You can, for instance, build the TxDb package from the 'refGene' track at the UCSC using the function  makeTxDbFromUCSC() that will return you a TxDb object, or makeTxDbPackageFromUCSC() if you want to have an actual package file and install it. Look up the documentation to see how to call them. You can then use the promoters() function on the TxDb objects. The vignette of GenomicFeatures has also specific examples.

Another route that may be interesting for you is to go directly to the source of refseq annotations which is at NCBI, download the GFF and create a TxDb object from the GFF file with the function makeTxDbFromGFF(). Here is an example:

library(GenomicFeatures)

url <- "ftp://ftp.ncbi.nlm.nih.gov/genomes/M_musculus/GFF/ref_GRCm38.p3_top_level.gff3.gz"
download.file(url, "ref_GRCm38.p3_top_level.gff3.gz", method="curl")

txdbMm38 <- makeTxDbFromGFF(file=gzfile("ref_GRCm38.p3_top_level.gff3.gz"))

a1 <- promoters(txdbMm38)

GRanges object with 6 ranges and 2 metadata columns:
seqnames             ranges strand |     tx_id     tx_name
<Rle>          <IRanges>  <Rle> | <integer> <character>
[1] NC_000067.6 [3355323, 3357522]      + |         1 XR_865166.1
[2] NC_000067.6 [4520817, 4523016]      + |         2      Gm7357
[3] NC_000067.6 [4593021, 4595220]      + |         3 XR_865167.1
[4] NC_000067.6 [4593021, 4595220]      + |         4 XR_865168.1
[5] NC_000067.6 [4607962, 4610161]      + |         5      Gm7369
[6] NC_000067.6 [4614900, 4617099]      + |         6 XR_373196.1
-------
seqinfo: 163 sequences from an unspecified genome; no seqlengths

cheers,

robert.

0
Entering edit mode

Thanks so very much, Robert! That really helps.

Eric