I want to annotate a list of variants, but for some reason I do not understand, some of them refuse to appear in the output. I set the options for the intronic variants to be as including as possible.
Here an example:
library(VariantAnnotation) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene locateVariants( VRanges("chr10", IRanges(48451378, 48451378), "A", "T"), txdb, AllVariants(promoter = PromoterVariants(upstream = 0, downstream = 0), intergenic = IntergenicVariants(upstream = 0, downstream = 0)))
The return value is an empty GRanges object:
GRanges object with 0 ranges and 9 metadata columns: seqnames ranges strand | LOCATION LOCSTART LOCEND QUERYID TXID CDSID GENEID PRECEDEID FOLLOWID <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer> <integer> <IntegerList> <character> <CharacterList> <CharacterList> ------- seqinfo: no sequences
The region that I chose here, does have some gene annotation:
subsetByOverlaps(genes(txdb), subject = GRanges(seqnames = "chr10", ranges=IRanges(48451378, 48451378), strand="*"))
...returns:
GRanges object with 5 ranges and 1 metadata column: seqnames ranges strand | gene_id <Rle> <IRanges> <Rle> | <character> 26095 chr10 [46550123, 48827924] - | 26095 414212 chr10 [46764668, 48958083] + | 414212 55747 chr10 [47894023, 51893269] + | 55747 653129 chr10 [46550123, 48827924] - | 653129 728053 chr10 [46737613, 48952629] - | 728053 ------- seqinfo: 93 sequences (1 circular) from hg19 genome
What am I missing here? The function should return the variant, either annotated as being inside of the genes, or being intergenic, right?
Thanks a lot,
Jakob
> sessionInfo() R version 3.2.1 (2015-06-18) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2 GenomicFeatures_1.20.1 AnnotationDbi_1.30.1 [4] Biobase_2.28.0 VariantAnnotation_1.14.6 Rsamtools_1.20.4 [7] Biostrings_2.36.1 XVector_0.8.0 GenomicRanges_1.20.5 [10] GenomeInfoDb_1.4.1 IRanges_2.2.5 S4Vectors_0.6.3 [13] BiocGenerics_0.14.0 loaded via a namespace (and not attached): [1] zlibbioc_1.14.0 GenomicAlignments_1.4.1 BiocParallel_1.2.9 BSgenome_1.36.3 tools_3.2.1 DBI_0.3.1 [7] lambda.r_1.1.7 futile.logger_1.4.1 rtracklayer_1.28.6 futile.options_1.0.0 bitops_1.0-6 RCurl_1.95-4.7 [13] biomaRt_2.24.0 RSQLite_1.0.0 XML_3.98-1.3 |
Hi Valerie,
Thanks for your quick reaction.
Expanding the ranges does not help it, I still don't get the variant annotated. The same holds true for playing with the parameters of the
IntergenicVariants()
.It looks to me like the variant might not get recognized as belonging to any of the categories, although this would contrast the notion of 'all variants' in
AllVariants()
. To my understanding I set the parameters such that there is no 'gap' in between genic regions and intergenic regions, where variants could hide. Or did I oversee something?Thanks, Jakob
locateVariants() ids variants (or any ranges in 'query') in the intron, cds, spliceSite, intergenic, promoter, and utr regions. The regions are extracted from a TxDb with the functions from GenomicFeatures, cdsBy(), intronsBy(), etc.
You're using PromoterVariants() and IntergenicVariants() so you have some idea
that the variant doesn't fall in the transcript.
As you've already done, we can confirm the variant hits a few genes:
but does not hit any transcipts:
Take a closer look at one of the genes hit. As an example we use 414212.
Look at the transcripts for gene 414212:
We can now see the variant falls between the two transcripts.
The 'upstream' and 'downstream' in PromoterVariants() are computed from the
transcription start site (TSS). This is a '+' strand gene so transcription
is from 5' to 3'. Our variant is 503027 upstream of the second transcript:
locateVariants() returns 8 ranges, each for a different transcript. The
'GENEID' column shows some transcripts are from the same gene, some different
and some have no associated gene id.
Gene with no associated gene id:
Valerie
Hi Valerie,
Thanks for the explanation. So I guess the problem is these "muted" regions that belong to a gene, but not to a transcript.
I had a look at the code, and all categories but the intergenic one is defined in a transcript-dependend manner:
I guess that for most use cases these definitions make sense. In particular, the intergenic category is called intergenic, and not "intertranscript". For my specific problem of simply annotating all input ranges, I guess I will make a local branch of the package and change the intergenic definition to a transcript-based one.
Cheers, Jakob