SNPs in multiple locations
1
0
Entering edit mode
Lna • 0
@lna-10651
Last seen 4.6 years ago

Dear all,

I used the locateVariants() function in the VariantAnnotation package to annotate a big list of SNPs and I'm having some problems interpreting the results.

I checked for all available locations (intergenic, intron, coding, fiveUTR, threeUTR, promoter and splicesite) and I found out that one can pick any two of these locations and there are always some SNPs which are assigned to both categories, e.g. rs9778016 (chr1:996184) is annotated to be located in an intron as well as in an intergenic region. I'm not so sure what this means.

Is the reason that genes and gene predictions in the UCSC browser come form different sources and accordingly can be annotated differently? So the contradicting annotations refer to different sources? Or is it more probable that there are some errors in the annotation?

In the case of rs9778016 I checked the output of the UCSC table browser. For chr1:996184. I obtained a table with intron regions for two genes (uc009vjs.1 and uc001acl.1). I would like to understand where the information that the SNP is located in an intergenic region comes from on the UCSC page. Is there any way in which I can reproduce this directly by entering the SNP position into the browser? To me it just looks like as if it is part of an intron.

0
Entering edit mode
@valerie-obenchain-4275
Last seen 9 days ago
United States

Hi,

On the locateVariants man page it says this under the 'Value' section:

A ‘GRanges’ object with a row for each variant-transcript match.

This means a single SNP may be classified as more than one feature and each feature is returned as a separate row. For your example it sounds like the SNP is in an intron region of a transcript but the transcript may not be part of a known gene. The term 'intergenic' means the range in 'query' does not fall in a gene range as defined by 'subject'.

I'm not sure what annotation you're using, I'll use TxDb.Hsapiens.UCSC.hg19.knownGene as an example.

> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> gr <- GRanges("chr1", IRanges(996184, width=1))

locateVariants() uses the extractors in the GenomicFeatures package to extract features and then finds overlaps between the given ranges and features. Here we see the range falls in a couple of transcripts

> findOverlaps(gr, intronsByTranscript(txdb))
Hits object with 2 hits and 0 metadata columns:
queryHits subjectHits
<integer>   <integer>
[1]         1          65
[2]         1          66
-------
queryLength: 1 / subjectLength: 82960

but does not fall in any known gene ranges.​

> findOverlaps(gr, genes(txdb))
Hits object with 0 hits and 0 metadata columns:
queryHits subjectHits
<integer>   <integer>
-------
queryLength: 1 / subjectLength: 23056

Therefore the SNP would be classified as both 'intron' and 'intergenic'.

Valerie

0
Entering edit mode

Thank you very much for the detailed explanation. Sorry, I forgot to add the annotation source, yes, I also used TxDb.Hsapiens.UCSC.hg19.knownGene. Now I understand how the output is generated based on the annotation package, but I still don't really understand how to relate it to the information on the UCSC page. I don't have much experience using the UCSC page, maybe I get something wrong!? Given the position I mentioned (chr1:996184) the result is: Intron region in genes AK310350 and BC033949 on the UCSC genes track in the genome browser (uc009vjs.1 and uc001acl.1 in the table browser). According to your example query, aren't these "known genes"? I would be really grateful if you could comment on that...!