Question

Can't display protein domain data from UCSC table unipDomain using Gviz

1

Entering edit mode

paul.jaschke ▴ 10

@pauljaschke-24116

Last seen 3.9 years ago

Hi all, Using Gviz I am trying to create a figure where I have the exons from a gene in one track and below that the protein domains mapped to the genome sequence. I am able to map the transcript/exon information easily by grabing info from the UCSC site using the UcscTrack() function but cannot seem to properly grab the data out of the protein domain table 'unipDomain' represented by the schema here.

The problem seems to be the way the data is organized within the table with no easy 1:1 mapping between the 'chromStart' and 'chromEnd' data columns because there are multiple start positions relative to chromStart represented in the 'chromStarts'. That is, you need to offset a certain amount from the chromStart value using the chromStarts values (comma separated values) and then represent the width of the feature by the 'blockSizes' data, stored as comma-separated values. I don't see any way to do this with the GeneRegionTrack, AnnotationTrack, or DataTrack classes. Any help would be greatly appreciated, either solving this problem or pointing me towards an easier way to represent domain information on the chromosome with another method. Thanks!

This is what I would expect it to look like based on the track in UCSC browser ucsc browser image

This is what mine looks like (one big box instead of split along with exons) Gviz image

Code used to generate image above

library(Gviz)
library(GenomicRanges)

gen <- "hg38"
chr <- "chr11"

# Create the Ideogram track
itrack <- IdeogramTrack(genome = gen, chromosome = chr)

# Create the GenomeAxisTrack
# GenomeAxisTrack class
gtrack <- GenomeAxisTrack()

# example of FKBP2
from <- 64240500
to <- 64244500
knownGenes <- UcscTrack(genome = "hg38", 
                        chromosome = "chr11",
                        track = "knownGene", 
                        from = from,
                        to = to, 
                        trackType = "GeneRegionTrack",
                        rstarts = "exonStarts", 
                        rends = "exonEnds", 
                        gene = "name",
                        symbol = "name",
                        transcript = "name",
                        strand = "strand",
                        fill = "#8282d2",
                        name = "UCSC Genes")

domains <- UcscTrack(genome = "hg38",
                     chromosome = "chr11",
                     track = "uniprot",
                     table = "unipDomain",
                     from = from,
                     to = to,
                     trackType = "GeneRegionTrack",
                     rstarts = "chromStart",
                     rends = "chromEnd",
                     gene = "name",
                     symbol = "name",
                     strand = "strand",
                     name = "domain")

plotTracks(list(itrack, gtrack, knownGenes, domains), transcriptAnnotation = "gene" )

Gviz protein visualization UcscTrack protein domains • 1.3k views

ADD COMMENT • link updated 3.9 years ago by James W. MacDonald 66k • written 3.9 years ago by paul.jaschke ▴ 10

0

Entering edit mode

Cross-posted: https://www.biostars.org/p/460044/

ADD REPLY • link 3.9 years ago Kevin Blighe ★ 4.0k

score 0 · Answer 1 · 2020-09-08

There might be a smarter way to do this, but that's not my forte, so here goes.

## get the data using rtracklayer
> library(rtracklayer)
> session <- browserSession()
> genome(session) <- "hg38"
> gr <- GRanges("chr11", IRanges(from, to))
> z <- getTable(ucscTableQuery(session, track = "uniprot", table = "unipDomain", range = gr))
> z
  chrom chromStart chromEnd             name score strand thickStart thickEnd
1 chr11   64242531 64244011 PPIase FKBP-type  1000      +   64242531 64244011
2 chr11   64242531 64243472 PPIase FKBP-type     0      +   64242531 64243472
   reserved blockCount      blockSizes         chromStarts
1 12,12,120          5 27,113,47,36,44 0,667,919,1309,1436
2 0,150,250          3       27,113,22           0,667,919
                          dbName annotationType
1 Manually reviewed (Swiss-Prot)         domain
2            Unreviewed (TrEMBL)         domain
                              position         comments uniProtId pmids
1 amino acids 49-137 on protein P26885 PPIase FKBP-type    P26885    NA
2 amino acids 49-102 on protein F5H0N4 PPIase FKBP-type    F5H0N4    NA

## that's boring. As you  noted, it's the start, plus the block starts, plus the block lengths
## let's have a function, shall we?

makeSplit <- function(d.f){
    splitsville <- lapply(strsplit(z$chromStarts, ","), as.numeric)
    splitsville2 <- lapply(strsplit(z$blockSizes, ","), as.numeric)
    splitsville <- mapply(function(x,y) x + y, splitsville, d.f[,2])
    splitsville2 <- mapply(function(x,y) x + y, splitsville2, splitsville)
    len <- length(unlist(splitsville))
    thatstuff <- rep("exon_1", len)
    gr <- GRanges(rep(d.f[1,1], len),
                  IRanges(unlist(splitsville), unlist(splitsville2)),
                  rep(z$strand, sapply(splitsville, length)),
                  feature = thatstuff, id = thatstuff, exon = thatstuff,
                  transcript = rep(paste0("transcript_", seq_len(length(splitsville))),
                                   sapply(splitsville, length)),
                  gene = rep(z$name, sapply(splitsville, length)),
                  symbol = rep(z$name, sapply(splitsville, length)),
                  density = rep(1, length(thatstuff)))
    gr
}

> domains2 <- GeneRegionTrack(makeSplit(z))
> plotTracks(list(itrack, gtrack, knownGenes, domains2), transcriptAnnotation = "gene" )

Seems like that's what you want?