Search
Question: No annotation for single SNPs
0
gravatar for Lna
5 months ago by
Lna0
Lna0 wrote:

Dear all,

I want to annotate all SNPs on my Illumina Chip. I generated a vcf file containing all 4284426 SNPs with:

loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants(promoter=PromoterVariants(downstream=500)))

I do not get a result for every entry in my vcf file. What is the reason for this?

When I build a table using

names(loc) <- NULL
out <- as.data.frame(loc)
out$names <- names(target)[ out$QUERYID ]
annotT.all <- out[ , c("names", "seqnames","QUERYID", "start", "end", "LOCATION", "GENEID", "PRECEDEID", "FOLLOWID")]
annotT.all <- unique(annotT.all)

is has only 2712896 entries. What kind of SNPs cannot be annotated by variantAnnotaion?

I checked a SNP on chr1       Position:768448, for which the entry is missing in the annotation, on the UCSC genome browser and according to the UCSC the SNP is part of an intron. So why is there no line in the output table?

Thanks for your help!

 

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.4.1                      TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.28.3                  AnnotationDbi_1.38.1                   
 [5] BiocInstaller_1.26.0                    VariantAnnotation_1.22.1                Rsamtools_1.28.0                        Biostrings_2.44.1                      
 [9] XVector_0.16.0                          SummarizedExperiment_1.6.3              DelayedArray_0.2.7                      matrixStats_0.52.2                     
[13] GenomicRanges_1.28.3                    GenomeInfoDb_1.12.2                     IRanges_2.10.2                          S4Vectors_0.14.3                       
[17] GWASTools_1.22.0                        Biobase_2.36.2                          BiocGenerics_0.22.0                    

loaded via a namespace (and not attached):
 [1] quantsmooth_1.42.0       Rcpp_0.12.11             lattice_0.20-35          zoo_1.8-0                digest_0.6.9             lmtest_0.9-35           
 [7] plyr_1.8.3               MatrixModels_0.4-1       gdsfmt_1.12.0            RSQLite_1.1-2            ggplot2_2.2.1            zlibbioc_1.22.0         
[13] rlang_0.1.1              lazyeval_0.2.0           SparseM_1.77             rpart_4.1-11             Matrix_1.2-10            splines_3.4.0           
[19] BiocParallel_1.10.1      GWASExactHW_1.01         RCurl_1.95-4.8           biomaRt_2.32.1           munsell_0.4.2            compiler_3.4.0          
[25] rtracklayer_1.36.3       mgcv_1.8-17              nnet_7.3-12              tibble_1.3.3             GenomeInfoDbData_0.99.0  DNAcopy_1.50.1          
[31] XML_3.98-1.8             GenomicAlignments_1.12.1 MASS_7.3-47              bitops_1.0-6             grid_3.4.0               nlme_3.1-131            
[37] gtable_0.1.2             DBI_0.7                  scales_0.4.1             ncdf4_1.16               mice_2.30                sandwich_2.3-4          
[43] tools_3.4.0              BSgenome_1.44.0          survival_2.41-3          colorspace_1.2-4         memoise_0.2.1            logistf_1.22            
[49] quantreg_5.33 
ADD COMMENTlink modified 4 months ago by Valerie Obenchain ♦♦ 6.4k • written 5 months ago by Lna0
2
gravatar for Valerie Obenchain
4 months ago by
Valerie Obenchain ♦♦ 6.4k
United States
Valerie Obenchain ♦♦ 6.4k wrote:

Hi,

Yes, this is a bug.

* locateVariants(query, subject, region=AllVariants()) should return all 8 variant types described on the ?AllVariants man page

* 'upstream' and 'downstream' parameters in AllVariants() can be set for both PromoterVariants() and IntergenicVariants() but not IntronVariants(); see ?AllVariants for default values

Your example SNP does fall in an intron:

snp <- GRanges("chr1", IRanges(768448, width=1))
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
findOverlaps(gr, intronsByTranscript(txdb))
> findOverlaps(gr, intronsByTranscript(txdb))
Hits object with 6 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1          14
  [2]         1          15
  [3]         1          16
  [4]         1          17
  [5]         1          18
  [6]         1          19
  -------
  queryLength: 1 / subjectLength: 82960

> intronsByTranscript(txdb)[14]
GRangesList object of length 1:
$14
GRanges object with 2 ranges and 0 metadata columns:
      seqnames           ranges strand
         <Rle>        <IRanges>  <Rle>
  [1]     chr1 [763156, 764382]      +
  [2]     chr1 [764485, 776579]      +


It was not being reported by locateVariants(query, subject, region=AllVariants()) because the code that was supposed to retrieve IntronVariants() was actually retrieving IntergenicVariants(). The SNP is not intergenic so it was missed. This has been fixed in release (1.22.3) and devel (1.23.5) which are available in svn immediately or with biocLite() June 24 after 1pm EST.

locateVariants(snp, txdb, AllVariants())
> locateVariants(snp, txdb, AllVariants())
'select()' returned 1:1 mapping between keys and columns
GRanges object with 6 ranges and 9 metadata columns:
      seqnames           ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID
         <Rle>        <IRanges>  <Rle> | <factor> <integer> <integer> <integer>
  [1]     chr1 [768448, 768448]      + |   intron      5191      5191         1
  [2]     chr1 [768448, 768448]      + |   intron      5191      5191         1
  [3]     chr1 [768448, 768448]      + |   intron      5191      5191         1
  [4]     chr1 [768448, 768448]      + |   intron      5191      5191         1
  [5]     chr1 [768448, 768448]      + |   intron      5191      5191         1
  [6]     chr1 [768448, 768448]      + |   intron      5117      5117         1
  ...                            
  -------


Thanks for catching this.
Valerie

ADD COMMENTlink modified 4 months ago • written 4 months ago by Valerie Obenchain ♦♦ 6.4k
0
gravatar for Vincent J. Carey, Jr.
5 months ago by
United States
Vincent J. Carey, Jr.6.2k wrote:

"according to the UCSC the SNP is part of an intron" -- but you have used

AllVariants(promoter=PromoterVariants(downstream=500)))

in your locateVariants() call.  Try IntronVariants().  I would not be surprised if there

were still some mismatches -- please provide full details if identified, it is nice to work these out.

usually there is a good explanation!

ADD COMMENTlink written 5 months ago by Vincent J. Carey, Jr.6.2k

Thank you for your answer, but honestly... now I am totally confused. I am using the AllVariants() parameter, because I thought this would provide me all types of available annotation, including the output of CodingVariants(), IntronVariants(), FiveUTRVariants(), ThreeUTRVariants and so on. (I thought the "promoter"-argument only sets special parameters for promoters.) When I used locateVariants some months ago, indeed... I also got annotations for intron SNPs. But you are right, my actual list does not include any intron variants and if I use the parameter IntronVariants instead, I get them. This does not make any sense to me. Rather looks like a bug!? Why should all types of SNP locations be reported with "AllVariants" but introns???

It would be really nice if you could clarify this for me!

ADD REPLYlink written 5 months ago by Lna0
1
i'd say the clue is to look at the output of AllVariants() maybe Val can comment further on how to increase scope On Wed, Jun 21, 2017 at 7:48 AM Lna [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Lna <https: support.bioconductor.org="" u="" 10651=""/> wrote Comment: No > annotation for single SNPs > <https: support.bioconductor.org="" p="" 97235="" #97270="">: > > Thank you for your answer, but honestly... now I am totally confused. I am > using the AllVariants() parameter, because I thought this would provide me > all types of available annotation, including the output of > CodingVariants(), IntronVariants(), FiveUTRVariants(), ThreeUTRVariants and > so on. (I thought the "promoter"-argument only sets special parameters for > promoters.) When I used locateVariants some months ago, indeed... I also > got annotations for intron SNPs. But you are right, my actual list does not > include any intron variants and if I use the parameter IntronVariants > instead, I get them. This does not make any sense to me. Rather looks like > a bug!? Why should all types of SNP locations be reported with > "AllVariants" but introns??? > > It would be really nice if you could clarify this for me! > > ------------------------------ > > Post tags: variantannotation, SNPs, txdb.hsapiens.ucsc.hg19.knowngene, > ucsc > > You may reply via email or visit > C: No annotation for single SNPs >
ADD REPLYlink written 5 months ago by Vincent J. Carey, Jr.6.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 164 users visited in the last hour