Question

variantAnnotation::locateVariants() with AllVariants(): why do some input variants not get annotated?

0

Entering edit mode

Jakob Goldmann • 0

@jakob-goldmann-8715

Last seen 9.2 years ago

Radboud UMC, Nijmegen, Netherlands

I want to annotate a list of variants, but for some reason I do not understand, some of them refuse to appear in the output. I set the options for the intronic variants to be as including as possible.

Here an example:

library(VariantAnnotation)

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

locateVariants(
  VRanges("chr10", IRanges(48451378, 48451378), "A", "T"), 
  txdb, 
  AllVariants(promoter = PromoterVariants(upstream = 0, downstream = 0),  
              intergenic = IntergenicVariants(upstream = 0, downstream = 0)))

The return value is an empty GRanges object:

GRanges object with 0 ranges and 9 metadata columns:
   seqnames    ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID      TXID         CDSID      GENEID       PRECEDEID        FOLLOWID
      <Rle> <IRanges>  <Rle> | <factor> <integer> <integer> <integer> <integer> <IntegerList> <character> <CharacterList> <CharacterList>
  -------
  seqinfo: no sequences

The region that I chose here, does have some gene annotation:

subsetByOverlaps(genes(txdb), subject = GRanges(seqnames = "chr10", ranges=IRanges(48451378, 48451378), strand="*"))

...returns:

GRanges object with 5 ranges and 1 metadata column:
         seqnames               ranges strand |     gene_id
            <Rle>            <IRanges>  <Rle> | <character>
   26095    chr10 [46550123, 48827924]      - |       26095
  414212    chr10 [46764668, 48958083]      + |      414212
   55747    chr10 [47894023, 51893269]      + |       55747
  653129    chr10 [46550123, 48827924]      - |      653129
  728053    chr10 [46737613, 48952629]      - |      728053
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

What am I missing here? The function should return the variant, either annotated as being inside of the genes, or being intergenic, right?

Thanks a lot,

Jakob

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2 GenomicFeatures_1.20.1                  AnnotationDbi_1.30.1                   
 [4] Biobase_2.28.0                          VariantAnnotation_1.14.6                Rsamtools_1.20.4                       
 [7] Biostrings_2.36.1                       XVector_0.8.0                           GenomicRanges_1.20.5                   
[10] GenomeInfoDb_1.4.1                      IRanges_2.2.5                           S4Vectors_0.6.3                        
[13] BiocGenerics_0.14.0                    

loaded via a namespace (and not attached):
 [1] zlibbioc_1.14.0         GenomicAlignments_1.4.1 BiocParallel_1.2.9      BSgenome_1.36.3         tools_3.2.1             DBI_0.3.1              
 [7] lambda.r_1.1.7          futile.logger_1.4.1     rtracklayer_1.28.6      futile.options_1.0.0    bitops_1.0-6            RCurl_1.95-4.7         
[13] biomaRt_2.24.0          RSQLite_1.0.0           XML_3.98-1.3

variantannotation locatevariants • 2.4k views

ADD COMMENT • link 9.2 years ago Jakob Goldmann • 0

score 0 · Answer 1 · 2015-08-28

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.9 years ago

United States

Hi Jakob,

You've restricted the search range for promoter and intergenic regions to zero so no overlaps are found.

Both constructors have defaults -

> PromoterVariants()
class: PromoterVariants 
upstream: 2000 
downstream: 200

Try expanding your range

PromoterVariants(400000, 20000)

and you'll see the overlaps.

Valerie

ADD COMMENT • link 9.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie,

Thanks for your quick reaction.

Expanding the ranges does not help it, I still don't get the variant annotated. The same holds true for playing with the parameters of the IntergenicVariants().

It looks to me like the variant might not get recognized as belonging to any of the categories, although this would contrast the notion of 'all variants' in AllVariants(). To my understanding I set the parameters such that there is no 'gap' in between genic regions and intergenic regions, where variants could hide. Or did I oversee something?

Thanks, Jakob

ADD REPLY • link 9.2 years ago Jakob Goldmann • 0

1

Entering edit mode

locateVariants() ids variants (or any ranges in 'query') in the intron, cds, spliceSite, intergenic, promoter, and utr regions. The regions are extracted from a TxDb with the functions from GenomicFeatures, cdsBy(), intronsBy(), etc.

You're using PromoterVariants() and IntergenicVariants() so you have some idea
that the variant doesn't fall in the transcript.

As you've already done, we can confirm the variant hits a few genes:

genes <- genes(txdb)
olaps <- findOverlaps(vr, genes)
> olaps
Hits object with 5 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1        8109
  [2]         1       12166
  [3]         1       15116
  [4]         1       17683
  [5]         1       18678
  -------
  queryLength: 1
  subjectLength: 23056

but does not hit any transcipts:

txbygene <- transcriptsBy(txdb, "gene")
> findOverlaps(vr, txbygene)
Hits object with 0 hits and 0 metadata columns:
   queryHits subjectHits
   <integer>   <integer>
  -------
  queryLength: 1
  subjectLength: 23459

Take a closer look at one of the genes hit. As an example we use 414212.

gene_hits <- genes[subjectHits(olaps)]
g0 <- gene_hits[2]
> g0
GRanges object with 1 range and 1 metadata column:
         seqnames               ranges strand |     gene_id
            <Rle>            <IRanges>  <Rle> | <character>
  414212    chr10 [46764668, 48958083]      + |      414212
  -------

Look at the transcripts for gene 414212:

tx0 <- txbygene[["414212"]]
> tx0
GRanges object with 2 ranges and 2 metadata columns:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]    chr10 [46764668, 46768346]      + |     38016  uc031puv.1
  [2]    chr10 [48954405, 48958083]      + |     38053  uc031puz.1
  -------

We can now see the variant falls between the two transcripts.

> end(tx0)[1] < start(vr) || start(tx0)[2]  > start(vr)
[1] TRUE

The 'upstream' and 'downstream' in PromoterVariants() are computed from the
transcription start site (TSS). This is a '+' strand gene so transcription
is from 5' to 3'. Our variant is 503027 upstream of the second transcript:

> start(tx0)[2] - start(vr)
[1] 503027

locateVariants() returns 8 ranges, each for a different transcript. The
'GENEID' column shows some transcripts are from the same gene, some different
and some have no associated gene id.

loc <- locateVariants(vr, txdb, PromoterVariants(upstream=510000))
> loc
GRanges object with 8 ranges and 9 metadata columns:
      seqnames               ranges strand | LOCATION  LOCSTART    LOCEND
         <Rle>            <IRanges>  <Rle> | <factor> <integer> <integer>
  [1]    chr10 [48451378, 48451378]      + | promoter      <NA>      <NA>
  [2]    chr10 [48451378, 48451378]      + | promoter      <NA>      <NA>
  [3]    chr10 [48451378, 48451378]      - | promoter      <NA>      <NA>
  [4]    chr10 [48451378, 48451378]      - | promoter      <NA>      <NA>
  [5]    chr10 [48451378, 48451378]      - | promoter      <NA>      <NA>
  [6]    chr10 [48451378, 48451378]      - | promoter      <NA>      <NA>
  [7]    chr10 [48451378, 48451378]      - | promoter      <NA>      <NA>
  [8]    chr10 [48451378, 48451378]      - | promoter      <NA>      <NA>
        QUERYID      TXID         CDSID      GENEID       PRECEDEID
      <integer> <integer> <IntegerList> <character> <CharacterList>
  [1]         1     38052                      <NA>                
  [2]         1     38053                    414212                
  [3]         1     39727                      <NA>                
  [4]         1     39728                      5949                
  [5]         1     39729                      2658                
  [6]         1     39730                      2662                
  [7]         1     39731                      2662                
  [8]         1     39732                      2662                
             FOLLOWID
      <CharacterList>
  [1]                
  [2]                
  [3]                
  [4]                
  [5]                
  [6]                
  [7]                
  [8]                
  -------

Gene with no associated gene id:

tx <- transcripts(txdb, columns=c("GENEID", "TXID"))
> tx[mcols(tx)$TXID == 38052]
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand |          GENEID      TXID
         <Rle>            <IRanges>  <Rle> | <CharacterList> <integer>
  [1]    chr10 [48902199, 48902228]      + |                     38052
  -------

Valerie

ADD REPLY • link 9.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie,

Thanks for the explanation. So I guess the problem is these "muted" regions that belong to a gene, but not to a transcript.

I had a look at the code, and all categories but the intergenic one is defined in a transcript-dependend manner:

category	definition
intron	intronsByTranscript(...)
cds	cdsBy(...,"transcript")
spliceSite	intronsByTranscripts()
promoter	transcripts(...)
fiveUTR	fiveUTRsByTranscripts()
threeUTR	threeUTRsByTranscripts()
intergenic	intronsBy(...,"gene")

I guess that for most use cases these definitions make sense. In particular, the intergenic category is called intergenic, and not "intertranscript". For my specific problem of simply annotating all input ranges, I guess I will make a local branch of the package and change the intergenic definition to a transcript-based one.

Cheers, Jakob

ADD REPLY • link 9.2 years ago Jakob Goldmann • 0