locateVariant from VariantAnnotation returns wrong results
1
0
Entering edit mode
nhaus ▴ 70
@789c70a6
Last seen 12 weeks ago
Switzerland

Hi all,

I am using VariantAnnotation::locateVariants to get the location (i.e. intron, intergeneic, ...) of a list of SNPs. I have noticed however, that the results returned by wrong, when I manually look them up in the UCSC genome browser. Here is one such example (I am using R.4.2.2, TxDb.Hsapiens.UCSC.hg19.knownGene 3.2.2 and VariantAnnotation 1.44.1):

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

snp <- GRanges(seqnames="chr1", ranges = 109706393) # corresponds to SNP rs649539
locateVariants(snp, txdb, AllVariants(), ignore.strand=T)

returns the following result:

GRanges object with 10 ranges and 9 metadata columns:
       seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID        TXID
          <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <character>
   [1]     chr1 109706393      + |   intron     49314     49314         1        1785
   [2]     chr1 109706393      + |   intron     49314     49314         1        1786
   [3]     chr1 109706393      + |   intron    135483    135483         1        1786
   [4]     chr1 109706393      + |   intron     49435     49435         1        1787
   [5]     chr1 109706393      + |   intron     49314     49314         1        1788
   [6]     chr1 109706393      + |   intron     49314     49314         1       74189
   [7]     chr1 109706393      + |   intron     49314     49314         1       74190
   [8]     chr1 109706393      + |   intron    135483    135483         1       74190
   [9]     chr1 109706393      + |   intron     49435     49435         1       74191
  [10]     chr1 109706393      + |   intron    135191    135191         1       74191
               CDSID      GENEID       PRECEDEID        FOLLOWID
       <IntegerList> <character> <CharacterList> <CharacterList>
   [1]                      5825                                
   [2]                      5825                                
   [3]                      5825                                
   [4]                      5825                                
   [5]                      5825                                
   [6]                    150365                                
   [7]                    150365                                
   [8]                    150365                                
   [9]                    150365                                
  [10]                    150365                                
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Gene ID 5825 corresponds to the ABCD3 gene and gene id 150365 corresponds to the MEI1 gene. However, if I go to the position of the SNP in the UCSC genome browser, it shows me that I am in an intronic region of ELAPOR1 (see image below).

enter image description here

Furthermore, if I use txdb to look up the chromosome on which these genes are located I get the following results:

select(txdb, keys = c("5825", "150365"), columns="TXCHROM", keytype="GENEID")

leads to:

  GENEID TXCHROM
1   5825    chr1
2 150365   chr22

I would very much appreciate any insights into what is happening here, because it seems like the tool is just completely off in this case.

VariantAnnotation locateVariant • 835 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 12 hours ago
United States

This is a known and so far unfixed issue.

>  mapToTranscripts(snp, intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene))
GRanges object with 10 ranges and 2 metadata columns:
       seqnames    ranges strand |
          <Rle> <IRanges>  <Rle> |
   [1]     1785     49314      + |
   [2]     1786     49314      + |
   [3]     1786    135483      + |
   [4]     1787     49435      + |
   [5]     1788     49314      + |
   [6]    74189     49314      + |
   [7]    74190     49314      + |
   [8]    74190    135483      + |
   [9]    74191     49435      + |
  [10]    74191    135191      + |
           xHits transcriptsHits
       <integer>       <integer>
   [1]         1            1785
   [2]         1            1786
   [3]         1            1786
   [4]         1            1787
   [5]         1            1788
   [6]         1           74189
   [7]         1           74190
   [8]         1           74190
   [9]         1           74191
  [10]         1           74191
  -------
  seqinfo: 82960 sequences from an unspecified genome
> z <- intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene)
> mapToTranscripts(snp, z[lengths(z) > 0L])
GRanges object with 4 ranges and 2 metadata columns:
      seqnames    ranges strand |
         <Rle> <IRanges>  <Rle> |
  [1]     1925     49314      + |
  [2]     1926     49314      + |
  [3]     1927     49435      + |
  [4]     1928     49314      + |
          xHits transcriptsHits
      <integer>       <integer>
  [1]         1            1785
  [2]         1            1786
  [3]         1            1787
  [4]         1            1788
  -------
  seqinfo: 72404 sequences from an unspecified genome

You can get the correct results though.

> zz <- locateVariants(snp, z[lengths(z) > 0L], IntronVariants())
> zz$GENEID <- mapIds(TxDb.Hsapiens.UCSC.hg19.knownGene, zz$TXID, "GENEID","TXID")
'select()' returned 1:1 mapping
between keys and columns
> zz
GRanges object with 4 ranges and 9 metadata columns:
      seqnames    ranges strand |
         <Rle> <IRanges>  <Rle> |
  [1]     chr1 109706393      + |
  [2]     chr1 109706393      + |
  [3]     chr1 109706393      + |
  [4]     chr1 109706393      + |
      LOCATION  LOCSTART    LOCEND
      <factor> <integer> <integer>
  [1]   intron     49314     49314
  [2]   intron     49314     49314
  [3]   intron     49435     49435
  [4]   intron     49314     49314
        QUERYID        TXID
      <integer> <character>
  [1]         1        1925
  [2]         1        1926
  [3]         1        1927
  [4]         1        1928
              CDSID      GENEID
      <IntegerList> <character>
  [1]                     57535
  [2]                     57535
  [3]                     57535
  [4]                     57535
            PRECEDEID
      <CharacterList>
  [1]                
  [2]                
  [3]                
  [4]                
             FOLLOWID
      <CharacterList>
  [1]                
  [2]                
  [3]                
  [4]                
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> select(org.Hs.eg.db, "57535", "SYMBOL")
'select()' returned 1:1 mapping
between keys and columns
  ENTREZID  SYMBOL
1    57535 ELAPOR1
ADD COMMENT
0
Entering edit mode

Thank you very much for your quick reply! I guess if I do not know a priori where the SNP is located, I have to iterate thorugh the different intronsByTranscript, fiveUTRsByTranscript, threeUTRsByTranscript,...? Or is there a way to combine the outputs from all of these commands?

ADD REPLY
0
Entering edit mode

You can also just stuff that 'z' object into an env and use that.

> cache <- new.env(parent = emptyenv())
> z <- intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene)
> z <- z[lengths(z) > 0L]
> cache[["intbytx"]] <- z
> snp <- GRanges(seqnames="chr1", ranges = 109706393) # corresponds to SNP rs649539
> locateVariants(snp, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants(), cache = cache)
'select()' returned 1:1 mapping
between keys and columns
GRanges object with 4 ranges and 9 metadata columns:
      seqnames    ranges strand |
         <Rle> <IRanges>  <Rle> |
  [1]     chr1 109706393      + |
  [2]     chr1 109706393      + |
  [3]     chr1 109706393      + |
  [4]     chr1 109706393      + |
      LOCATION  LOCSTART    LOCEND
      <factor> <integer> <integer>
  [1]   intron     49314     49314
  [2]   intron     49314     49314
  [3]   intron     49435     49435
  [4]   intron     49314     49314
        QUERYID        TXID
      <integer> <character>
  [1]         1        1925
  [2]         1        1926
  [3]         1        1927
  [4]         1        1928
              CDSID      GENEID
      <IntegerList> <character>
  [1]                     57535
  [2]                     57535
  [3]                     57535
  [4]                     57535
            PRECEDEID
      <CharacterList>
  [1]                
  [2]                
  [3]                
  [4]                
             FOLLOWID
      <CharacterList>
  [1]                
  [2]                
  [3]                
  [4]                
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
ADD REPLY
0
Entering edit mode

Thank you so much! Just for other people that stumble across this, I have tried to replicate the cache that is created by VaroiantAnnotation like this.

cache <- new.env(parent = emptyenv())

# CodingVariants
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene::TxDb.Hsapiens.UCSC.hg19.knownGene
z <- cdsBy(txdb)
z <- z[lengths(z) > 0L]
cache[["cdsbytx"]] <- z

# IntronVariants
z <- intronsByTranscript(txdb)
z <- z[lengths(z) > 0L]
cache[["intbytx"]] <- z

# ThreeUTRVariants 
z <- threeUTRsByTranscript(txdb)
z <- z[lengths(z) > 0L]
cache[["threeUTRbytx"]] <- z

# FiveUTRVariants 
z <- fiveUTRsByTranscript(txdb)
z <- z[lengths(z) > 0L]
cache[["fiveUTRbytx"]] <- z

# IntergenicVariants 
z <- transcriptsBy(txdb, "gene")
z <- z[lengths(z) > 0L]
cache[["txbygene"]]  <- z

# SpliceSiteVariants 
# no need to populate cache further ?

# PromoterVariants 
z <- transcripts(txdb)
z <- z[lengths(z) > 0L]
cache[["tx"]] <- splitAsList(z, seq_len(length(z))) 
names(cache[["tx"]]) <- z$tx_id
ADD REPLY
0
Entering edit mode

You can certainly do that, but it's extra unnecessary work, as the rest of the cache will already be correctly populated. You only need to do what I showed you.

ADD REPLY

Login before adding your answer.

Traffic: 1019 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6