Question

Complete variant toolbox: gmapR/VariantTools/VariantAnnotation - revived

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Robert, I've been thinking about this. The original 2036 answer was not wrong, it is the position relative to the transcript (i.e., concatenated exon ranges). The 1797 is position relative to the individual exon. The intent of mapCoords() is to return the mapped position relative to the outer list elements of 'to' (transcipt-level in this case). It looks like you want the position relative to the individual element hit (exon-level). We can have both. I've added an 'elt.loc' argument that returns position relative to the individual list elements as a metadata column. Original ranges: from <- GRanges("chrA", IRanges(43522349, width=1), strand="-") to <- GRangesList(GRanges("chrA", IRanges(c(43522244, 43528406), c(43524145, 43528644)), strand="-")) x <- mapCoords(from, to, elt.loc=TRUE) >> x > GRanges with 1 range and 3 metadata columns: > seqnames ranges strand | queryHits subjectHits eltLoc > <rle> <iranges> <rle> | <integer> <integer> <iranges> > [1] chrA [2036, 2036] - | 1 1 [1797, 1797] 'subjectHits' identifies the list element (transcript) that was hit. To identify which exon was hit use 'elt.hits=TRUE': x <- mapCoords(from, to, elt.loc=TRUE, elt.hits=TRUE) >> x > GRanges with 1 range and 4 metadata columns: > seqnames ranges strand | queryHits subjectHits eltLoc > <rle> <iranges> <rle> | <integer> <integer> <iranges> > [1] chrA [2036, 2036] - | 1 1 [1797, 1797] > eltHits > <integer> > [1] 1 Unlist 'to' and subset on 'eltHits' to see the exon(s) hit: unlist(to)[mcols(x)$eltHits] >> unlist(to)[mcols(x)$eltHits] > GRanges with 1 range and 0 metadata columns: > seqnames ranges strand > <rle> <iranges> <rle> > [1] chrA [43522244, 43524145] - As an fyi, there is also an ignore.strand arg: >> x <- mapCoords(from, to, elt.loc=TRUE, ignore.strand=TRUE) >> x > GRanges with 1 range and 3 metadata columns: > seqnames ranges strand | queryHits subjectHits eltLoc > <rle> <iranges> <rle> | <integer> <integer> <iranges> > [1] chrA [106, 106] * | 1 1 [106, 106] Available in GenomicRanges 1.17.33. Valerie On 07/28/14 20:41, Robert Castelo wrote: > Excellent! Thanks for fixing it, and again for this new feature that > enables calculating cDNA positions. > > best regards, > > robert. > > On 7/28/14 11:30 PM, Valerie Obenchain wrote: >> OK. We'll try it again. See 1.17.29. >> >> It looks like I did not introduce this with ignore.strand. The bug was >> deeper and could be handled in either .orderElementsByTranscription() >> or the pair of .listCumsum* functions. I chose to add an arg to >> .orderElementsByTranscription() that controls whether negative strand >> ranges are ordered large-to-small (3' to 5') or small-to-large (5' to >> 3'). >> >> Valerie >> >> >> On 07/28/2014 01:23 PM, Robert Castelo wrote: >>> hi Valerie, >>> >>> i'm still not getting the right answer, i've looked at the unit test you >>> added and i'm afraid it does not recreate the situation i described >>> below, which involved one transcript with two exons, while the unit test >>> forms two transcripts of one exon. so as it is now, doing: >>> >>> mapCoords(gr, grl, ignore.strand=TRUE) >>> > map >>> GRanges with 2 ranges and 2 metadata columns: >>> seqnames ranges strand | queryHits subjectHits >>> <rle> <iranges> <rle> | <integer> <integer> >>> [1] chrA [1797, 1797] - | 1 2 >>> [2] chrA [1797, 1797] - | 2 2 >>> >>> you get the right answer, but doing: >>> >>> mapCoords(gr, GRangesList(unlist(grl)), ignore.strand=TRUE) >>> > map >>> GRanges with 2 ranges and 2 metadata columns: >>> seqnames ranges strand | queryHits subjectHits >>> <rle> <iranges> <rle> | <integer> <integer> >>> [1] chrA [2036, 2036] - | 1 1 >>> [2] chrA [2036, 2036] - | 2 1 >>> >>> you get the previous wrong answer. you'll find my sessionInfo() below. >>> >>> cheers, >>> robert. >>> ps: sessionInfo() >>> R version 3.1.1 (2014-07-10) >>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> base >>> >>> other attached packages: >>> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.14.0 GenomicFeatures_1.17.13 >>> AnnotationDbi_1.27.9 >>> [4] Biobase_2.25.0 VariantAnnotation_1.11.19 Rsamtools_1.17.31 >>> [7] Biostrings_2.33.13 XVector_0.5.7 GenomicRanges_1.17.28 >>> [10] GenomeInfoDb_1.1.17 IRanges_1.99.23 S4Vectors_0.1.2 >>> [13] BiocGenerics_0.11.3 vimcom_0.9-93 setwidth_1.0-3 >>> [16] colorout_1.0-2 >>> >>> loaded via a namespace (and not attached): >>> [1] BatchJobs_1.3 BBmisc_1.7 BiocParallel_0.7.9 >>> biomaRt_2.21.1 >>> [5] bitops_1.0-6 brew_1.0-6 BSgenome_1.33.8 >>> checkmate_1.2 >>> [9] codetools_0.2-8 DBI_0.2-7 digest_0.6.4 fail_1.2 >>> [13] foreach_1.4.2 GenomicAlignments_1.1.22 >>> iterators_1.0.7 Rcpp_0.11.2 >>> [17] RCurl_1.95-4.1 RSQLite_0.11.4 rtracklayer_1.25.13 >>> sendmailR_1.1-2 >>> [21] stats4_3.1.1 stringr_0.6.2 tools_3.1.1 >>> XML_3.98-1.1 >>> [25] zlibbioc_1.11.1 >>> >>> >>> On 7/28/14 1:25 PM, Valerie Obenchain wrote: >>>> Hi Robert, >>> > >>> > Now fixed in GenomicRanges 1.17.28. >>> > >>> > I introduced this bug when trying to be too clever with >>> > ignore.strand. Thanks for catching it and, as always, providing a >>> > clear reproducible example. >>> > >>> > Valerie >>> > >>> > On 07/24/2014 08:24 AM, Robert Castelo wrote: >>> >> Dear Valerie, >>> >> >>> >> sorry for my long delay in reacting to your message below. This >>> >> addition to VariantAnnotation by Michael and you is great. I just >>> >> would like to make a comment and report what looks to me as a bug >>> >> related to the computation of cDNA positions, just in case you have >>> >> suggestions and comments, specially with respect to what at the >>> >> moment seems like a bug to me. >>> >> >>> >> the way in which mapCoords() works is by finding all possible >>> >> overlaps of each GRanges entry, corresponding to a variant. This is >>> >> strictly not necessary once you have gone through locateVariants() >>> >> and/or predictCoding() since these functions already find all >>> >> possible overlaps of each variant to different transcripts and >>> >> regions. So, to get the corresponding annotation of cDNA positions >>> >> I had to write the following function which is slightly more >>> >> elaborated from what you suggest below: >>> >> >>> >> ## this function assumes that locateVariants() or predictCoding() >>> >> ## have been called so that there is a TXID metadata column >>> >> .cDNAloc <- function(gr, txdb) { if (!"TXID" %in% >>> >> colnames(mcols(gr))) stop("cDNApos: metadata column 'TXID' not >>> >> found in input GRanges object.") >>> >> >>> >> exonsbytx <- exonsBy(txdb, by="tx") map <- mapCoords(gr, exonsbytx, >>> >> eltHits=TRUE) eolap <- map$eltHits qolap <- map$queryHits txids <- >>> >> rep(names(exonsbytx), elementLengths(exonsbytx))[eolap] res <- >>> >> gr[qolap] mcols(res) <- append(values(res), >>> >> DataFrame(cDNALOC=ranges(map), TXID2=as.integer(txids))) ## >>> >> restrict results to cDNA positions of query TXID res <- >>> >> res[res$TXID == res$TXID2] >>> >> >>> >> start(res$cDNALOC) } >>> >> >>> >> i'm going to run it with a toy example of a variant that overlaps >>> >> a coding region of 7 transcripts in the positive strand and the >>> >> three UTR region of one transcript in the negative strand. >>> >> >>> >> library(VariantAnnotation) >>> >> library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>> >> >>> >> gr <- GRanges("chr21", IRanges(start=c(43522349, 43522349), >>> >> width=c(1, 1)), strand=c("+", "-")) gr$TXID <- c(72816L, 73335L) >>> >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene loc <- locateVariants(gr, >>> >> txdb, AllVariants()) loc$cDNALOC <- .cDNAloc(loc, txdb) >>> >> >>> >> now let's look to the cDNA position of the transcript in the >>> >> negative strand (the last entry of the resulting GRanges object): >>> >> >>> >> GRanges with 1 range and 2 metadata columns: seqnames >>> >> ranges strand | TXID cDNALOC <rle> <iranges> >>> >> <rle> | <integer> <integer> [1] chr21 [43522349, 43522349] >>> >> - | 73335 2036 >>> >> >>> >> and the exonic structure of this transcript >>> >> >>> >> exonsbytx <- exonsBy(txdb, by="tx") exonsbytx[["73335"]] GRanges >>> >> with 2 ranges and 3 metadata columns: seqnames ranges >>> >> strand | exon_id exon_name exon_rank <rle> <iranges> >>> >> <rle> | <integer> <character> <integer> [1] chr21 [43528406, >>> >> 43528644] - | 260152 <na> 1 [2] chr21 >>> >> [43522244, 43524145] - | 260151 <na> 2 >>> >> >>> >> from this structure one would expect that the reported cDNA >>> >> position would be: >>> >> >>> >> 43524145 - 43522349 + 1 = 1797 >>> >> >>> >> instead of 2036. >>> >> >>> >> after investigating a bit where this value may come from, it seems >>> >> that while the strand is being taken into account for the relative >>> >> position within the exon, the widths of upstream exons is wrongly >>> >> selected (in this case there are no upstream exons): >>> >> >>> >> width(exonsbytx[["73335"]]) [1] 239 1902 >>> >> >>> >> 239 + 1796 + 1 = 2036 >>> >> >>> >> >>> >> so, my guess is that this must be something to be fixed in >>> >> mapCoords(). >>> >> >>> >> >>> >> best regards!! >>> >> >>> >> robert. >>> >> >>> >> >>> >> On 07/13/2014 07:26 AM, Valerie Obenchain wrote: >>> >>> Hi, >>> >>> >>> >>> Following up on this thread: >>> >>> >>> >>> >>> https://stat.ethz.ch/pipermail/bioconductor/2013-December/056745.html >>> >>> >>> >>> >>> >>> >>> These changes are available in VariantAnnotation 1.11.15: >>>>>> >>> >>> 1) LOCSTART, LOCEND >>> >>> >>> >>> locateVariants() has 2 new output columns, LOCSTART and LOCEND. >>> >>> These are LOCATION-centric coordinates and can be different for >>> >>> each row so I thought these names were more descriptive than >>> >>> REFLOCS (discussed in thread). We have 2 values (start/end) >>> >>> instead of a single column of IRanges() because we can't make an >>> >>> IRanges() with missing values. Technically 'missing' ranges are >>> >>> represented by zero-width ranges but we still need a position; >>> >>> there is no position because there was no overlap. >>> >>> >>> >>> 2) mapCoords(), pmapCoords() >>> >>> >>> >>> These functions are courtesy of Michael. mapCoords() maps ranges >>> >>> onto another set of coordinates. You can map to cds-centric, >>> >>> exon-centric or any other type of coordinate. See ?mapCoords in >>> >>> both GenomicRanges and GenomicAlignments. >>> >>> >>> >>> >>> >>> In the previous thread we discussed added cDNA locations to >>> >>> predictCoding(). I've decided against this because it adds the >>> >>> additional overhead of the exonsBy() extraction and a >>> >>> findOverlaps() call. Not all users want the cDNA locations and >>> >>> those that do can now easily get them with mapCoords(). >>> >>> >>> >>> ## The usual predictCoding setup: >>> >>> library(BSgenome.Hsapiens.UCSC.hg19) >>> >>> library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- >>> >>> TxDb.Hsapiens.UCSC.hg19.knownGene fl <- system.file("extdata", >>> >>> "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, >>> >>> "hg19") vcf <- renameSeqlevels(vcf, "chr22") coding <- >>> >>> predictCoding(vcf, txdb, Hsapiens) >>> >>> >>> >>> ## Exon-centric or cDNA locations: exonsbytx <- exonsBy(txdb, >>> >>> "tx") cDNA <- mapCoords(coding[!duplicated(ranges(coding))], >>> >>> exonsbytx) coding$cDNA <- ranges(cDNA)[togroup(coding$QUERYID)] >>> >>> >>> >>> Let me know if you run into problems or if the docs need more >>> >>> detail. >>> >>> >>> >>> Valerie >>> >>> >>> >> >>> > >>> >>> >> >

VariantAnnotation Annotation GenomicRanges VariantAnnotation GenomicAlignments Annotation • 956 views

ADD COMMENT • link 9.7 years ago Valerie Obenchain ★ 6.8k