Question

Does biovizBase contain a genesymbol data set for other model organisms?

0

Entering edit mode

moldach ▴ 20

@moldach-8829

Last seen 4.8 years ago

Canada/Montreal/Douglas Mental Health I…

I would like to use ggbio for C elegans and am following along with this tutorial for biovizBase.

When adding a gene model track the code looks like this:

library(ggbio)
library(Homo.sapiens)
#load gene symbol : GRanges, one gene/row
data(genesymbol, package = "biovizBase")

Looking at the help for ?genesymbol it looks like this is a dataset for humans. The package Homo.sapiens only contains data for Homo sapiens. OrganismDb, which hosts Homo.sapiens, looks like it only has information for two other models: Mus musculus, and Rattus norvegicus.

Is it possible to get a gene model track for C elegans with ggbio?

Adding the reference track doesn't seem to be an issue; I searched BSgenome.Hsapiens.UCSC.hg19 on Bioconductor and under Packages found under BSgenome there is a package called BSgenome.Celegans.UCSC.ce11.

biovizBase ggbio • 1.0k views

ADD COMMENT • link updated 5.0 years ago by James W. MacDonald 68k • written 5.0 years ago by moldach ▴ 20

0

Entering edit mode

Just to clarify, the genesymbol dataset is really just a convenience for demonstration purposes. Jim is right: ggbio should directly support TxDb and other annotation objects.

ADD REPLY • link 5.0 years ago Michael Lawrence ★ 11k

score 0 · Answer 1 · 2020-03-30

You can do that on the fly, using the TxDb and the OrgDb for that organism. Here is a function that I wrote recently to create a gene plot, followed by coverage plots for zero or more BAM files, that I use for non-model organisms.

makeGGplot <- function(gene, bams = NULL, txdb, orgdb, bamnames, geom = "gapped.pair"){
    ## gene is gene symbol
    ## bams is a vector of bamfile paths. Ensure they are indexed
    ## txdb and orgdb are obvious
    ## bamnames is a vector of names for resulting ggplot
    ## geom = "line" is in general faster
    require("biovizBase", quietly = TRUE, character.only = TRUE)
    require("ggbio", quietly = TRUE, character.only = TRUE)
    egid <- mapIds(orgdb, gene, "ENTREZID", "SYMBOL")
    gnrng <- genes(txdb, single.strand.genes.only = FALSE)[[egid]]
    if(length(gnrng) > 1) warning("Doubledog!")
    tx <- crunch(txdb, which = gnrng)
    tx <- split(tx, tx$tx_id)
    lbs <- select(txdb, names(tx), c("TXNAME","TXID","GENEID"), "TXID")
    lbs$SYMBOL <- mapIds(orgdb, lbs$GENEID, "SYMBOL", "ENTREZID")
    ind <- lbs$GENEID %in% egid
    lbs <- lbs[ind,]
    tx <- tx[ind]
    names(tx) <- lbs$SYMBOL
    gnplot <- ggplot() + do.call(geom_alignment, c(list(data = tx, names.expr = "SYMBOL")))
    if(is.null(bams)){
        gnplot
    } else {
        gnplot <- list(gnplot)
        gnrng <- keepSeqlevels(gnrng, as.character(seqnames(gnrng)))
        bl <- BamFileList(bams)
        bamplot <- lapply(bl, autoplot, which = gnrng, geom=geom)
        names(bamplot) <- bamnames
        tracks(c(gnplot, bamplot), heights = c(4, rep(1, length(bams))), label.text.angle = 0, xlim = gnrng,
               label.text.cex = 0.75)
    }
}

Doing say

> hub <- AnnotationHub()
  |======================================================================| 100%

snapshotDate(): 2019-10-29
> query(hub, c("elegans","txdb"))
AnnotationHub with 8 records
# snapshotDate(): 2019-10-29 
# $dataprovider: UCSC
# $species: Caenorhabditis elegans
# $rdataclass: TxDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH52249"]]' 

            title                                 
  AH52249 | TxDb.Celegans.UCSC.ce11.refGene.sqlite
  AH52250 | TxDb.Celegans.UCSC.ce6.ensGene.sqlite 
  AH53778 | TxDb.Celegans.UCSC.ce11.refGene.sqlite
  AH57984 | TxDb.Celegans.UCSC.ce11.refGene.sqlite
  AH61790 | TxDb.Celegans.UCSC.ce11.refGene.sqlite
  AH66167 | TxDb.Celegans.UCSC.ce11.refGene.sqlite
  AH70583 | TxDb.Celegans.UCSC.ce11.ensGene.sqlite
  AH70584 | TxDb.Celegans.UCSC.ce11.refGene.sqlite
> txdb <- hub[["AH70584"]]
> query(hub, c("elegans","orgdb"))
AnnotationHub with 2 records
# snapshotDate(): 2019-10-29 
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Drosophila elegans, Caenorhabditis elegans
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH75747"]]' 

            title                           
  AH75747 | org.Ce.eg.db.sqlite             
  AH76581 | org.Drosophila_elegans.eg.sqlite
> orgdb <- hub[["AH75747"]]
> makeGGplot("homt-1", NULL, txdb, orgdb, "nobams")

Seems to work OK.