Question

locateVariant from VariantAnnotation returns wrong results

0

Entering edit mode

nhaus ▴ 30

@789c70a6

Last seen 1 day ago

Switzerland

Hi all,

I am using VariantAnnotation::locateVariants to get the location (i.e. intron, intergeneic, ...) of a list of SNPs. I have noticed however, that the results returned by wrong, when I manually look them up in the UCSC genome browser. Here is one such example (I am using R.4.2.2, TxDb.Hsapiens.UCSC.hg19.knownGene 3.2.2 and VariantAnnotation 1.44.1):

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

snp <- GRanges(seqnames="chr1", ranges = 109706393) # corresponds to SNP rs649539
locateVariants(snp, txdb, AllVariants(), ignore.strand=T)

returns the following result:

GRanges object with 10 ranges and 9 metadata columns:
       seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID        TXID
          <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <character>
   [1]     chr1 109706393      + |   intron     49314     49314         1        1785
   [2]     chr1 109706393      + |   intron     49314     49314         1        1786
   [3]     chr1 109706393      + |   intron    135483    135483         1        1786
   [4]     chr1 109706393      + |   intron     49435     49435         1        1787
   [5]     chr1 109706393      + |   intron     49314     49314         1        1788
   [6]     chr1 109706393      + |   intron     49314     49314         1       74189
   [7]     chr1 109706393      + |   intron     49314     49314         1       74190
   [8]     chr1 109706393      + |   intron    135483    135483         1       74190
   [9]     chr1 109706393      + |   intron     49435     49435         1       74191
  [10]     chr1 109706393      + |   intron    135191    135191         1       74191
               CDSID      GENEID       PRECEDEID        FOLLOWID
       <IntegerList> <character> <CharacterList> <CharacterList>
   [1]                      5825                                
   [2]                      5825                                
   [3]                      5825                                
   [4]                      5825                                
   [5]                      5825                                
   [6]                    150365                                
   [7]                    150365                                
   [8]                    150365                                
   [9]                    150365                                
  [10]                    150365                                
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Gene ID 5825 corresponds to the ABCD3 gene and gene id 150365 corresponds to the MEI1 gene. However, if I go to the position of the SNP in the UCSC genome browser, it shows me that I am in an intronic region of ELAPOR1 (see image below).

enter image description here

Furthermore, if I use txdb to look up the chromosome on which these genes are located I get the following results:

select(txdb, keys = c("5825", "150365"), columns="TXCHROM", keytype="GENEID")

leads to:

  GENEID TXCHROM
1   5825    chr1
2 150365   chr22

I would very much appreciate any insights into what is happening here, because it seems like the tool is just completely off in this case.

VariantAnnotation locateVariant • 476 views

ADD COMMENT • link updated 5 months ago by James W. MacDonald 65k • written 5 months ago by nhaus ▴ 30

score 0 · Answer 1 · 2023-11-28

This is a known and so far unfixed issue.

>  mapToTranscripts(snp, intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene))
GRanges object with 10 ranges and 2 metadata columns:
       seqnames    ranges strand |
          <Rle> <IRanges>  <Rle> |
   [1]     1785     49314      + |
   [2]     1786     49314      + |
   [3]     1786    135483      + |
   [4]     1787     49435      + |
   [5]     1788     49314      + |
   [6]    74189     49314      + |
   [7]    74190     49314      + |
   [8]    74190    135483      + |
   [9]    74191     49435      + |
  [10]    74191    135191      + |
           xHits transcriptsHits
       <integer>       <integer>
   [1]         1            1785
   [2]         1            1786
   [3]         1            1786
   [4]         1            1787
   [5]         1            1788
   [6]         1           74189
   [7]         1           74190
   [8]         1           74190
   [9]         1           74191
  [10]         1           74191
  -------
  seqinfo: 82960 sequences from an unspecified genome
> z <- intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene)
> mapToTranscripts(snp, z[lengths(z) > 0L])
GRanges object with 4 ranges and 2 metadata columns:
      seqnames    ranges strand |
         <Rle> <IRanges>  <Rle> |
  [1]     1925     49314      + |
  [2]     1926     49314      + |
  [3]     1927     49435      + |
  [4]     1928     49314      + |
          xHits transcriptsHits
      <integer>       <integer>
  [1]         1            1785
  [2]         1            1786
  [3]         1            1787
  [4]         1            1788
  -------
  seqinfo: 72404 sequences from an unspecified genome

You can get the correct results though.

> zz <- locateVariants(snp, z[lengths(z) > 0L], IntronVariants())
> zz$GENEID <- mapIds(TxDb.Hsapiens.UCSC.hg19.knownGene, zz$TXID, "GENEID","TXID")
'select()' returned 1:1 mapping
between keys and columns
> zz
GRanges object with 4 ranges and 9 metadata columns:
      seqnames    ranges strand |
         <Rle> <IRanges>  <Rle> |
  [1]     chr1 109706393      + |
  [2]     chr1 109706393      + |
  [3]     chr1 109706393      + |
  [4]     chr1 109706393      + |
      LOCATION  LOCSTART    LOCEND
      <factor> <integer> <integer>
  [1]   intron     49314     49314
  [2]   intron     49314     49314
  [3]   intron     49435     49435
  [4]   intron     49314     49314
        QUERYID        TXID
      <integer> <character>
  [1]         1        1925
  [2]         1        1926
  [3]         1        1927
  [4]         1        1928
              CDSID      GENEID
      <IntegerList> <character>
  [1]                     57535
  [2]                     57535
  [3]                     57535
  [4]                     57535
            PRECEDEID
      <CharacterList>
  [1]                
  [2]                
  [3]                
  [4]                
             FOLLOWID
      <CharacterList>
  [1]                
  [2]                
  [3]                
  [4]                
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> select(org.Hs.eg.db, "57535", "SYMBOL")
'select()' returned 1:1 mapping
between keys and columns
  ENTREZID  SYMBOL
1    57535 ELAPOR1