Annotating all SNVs in a genomic area, based on HG38 coordinates
1
0
Entering edit mode
Emanuele • 0
@emanuele-24073
Last seen 4.0 years ago
United Kingdom

Hello, sorry for the probably silly question, but I've tried for days now. I'm a beginner. I have a list of genomic coordinates in GenomicRanges format, e.g.

> coords_GR
GRanges object with 3320 ranges and 0 metadata columns:
         seqnames            ranges strand
            <Rle>         <IRanges>  <Rle>
     [1]        1           1950293      *
     [2]        1 18336405-20142656      *
     [3]        1          29684252      *
     [4]        1 71218722-73861224      *
     [5]        1 80888506-83526065      *
     ...      ...               ...    ...
  [3316]       20    716956-2488855      *
  [3317]       21 41901419-44757190      *
  [3318]       22 29255810-31043932      *
  [3319]       22 35134992-38911889      *
  [3320]        X         154717327      *
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

And I would like to annotate all SNVs in the range, obtaining information such as RSid, MAF, and, if exonic, if it's synonimous/non-synonimous. Also it would be good to be able to annotate non-SNP variations such as indels, CNVs, deletions etc for potential protein coding/clinical impact. I found a number of tools that can do that for genotyping data in VCF (such as ensembl VEP), but no tools that can do it based on genomic coordinates alone.

Many thanks for your help

Emanuele

Annotation variation HG38 • 1.4k views
ADD COMMENT
3
Entering edit mode
@herve-pages-1542
Last seen 3 days ago
Seattle, WA, United States

Hi,

As a first step you could use a SNPlocs package to map your genomic ranges to rs ids:

library(SNPlocs.Hsapiens.dbSNP144.GRCh38)
snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38
my_regions <- GRanges(c("1:1950293", "1:18336405-20142656", "1:29684252",
                        "1:71218722-73861224", "22:35134992-38911889"))
my_snps <- snpsByOverlaps(snps, my_regions)
my_snps
# UnstitchedGPos object with 411257 positions and 2 metadata columns:
#            seqnames       pos strand |   RefSNP_id alleles_as_ambig
#               <Rle> <integer>  <Rle> | <character>      <character>
#        [1]        1   1950293      * | rs145840463                R
#        [2]        1  18336405      * | rs147541038                R
#        [3]        1  18336413      * | rs542452980                R
#        [4]        1  18336416      * | rs575136982                R
#        [5]        1  18336432      * | rs182869403                K
#        ...      ...       ...    ... .         ...              ...
#   [411253]       22  38911800      * | rs535959691                R
#   [411254]       22  38911824      * | rs555183244                R
#   [411255]       22  38911852      * | rs538293072                K
#   [411256]       22  38911861      * | rs189740977                Y
#   [411257]       22  38911889      * | rs181225505                Y
#   -------
#   seqinfo: 25 sequences (1 circular) from GRCh38.p2 genome

Then use your preferred tool for retrieving details about the rs ids.

Notes:

  1. Unfortunately the regions of origin of the SNPs are not reported but you can easily retrieve them with:

    hits <- findOverlaps(my_snps, my_regions)
    region_idx <- as(hits, "IntegerList")
    mcols(my_snps)$region_idx <- region_idx
    mcols(my_snps)$region <- extractList(my_regions, region_idx)
    my_snps
    # UnstitchedGPos object with 411257 positions and 4 metadata columns:
    #            seqnames       pos strand |   RefSNP_id alleles_as_ambig
    #               <Rle> <integer>  <Rle> | <character>      <character>
    #        [1]        1   1950293      * | rs145840463                R
    #        [2]        1  18336405      * | rs147541038                R
    #        [3]        1  18336413      * | rs542452980                R
    #        [4]        1  18336416      * | rs575136982                R
    #        [5]        1  18336432      * | rs182869403                K
    #        ...      ...       ...    ... .         ...              ...
    #   [411253]       22  38911800      * | rs535959691                R
    #   [411254]       22  38911824      * | rs555183244                R
    #   [411255]       22  38911852      * | rs538293072                K
    #   [411256]       22  38911861      * | rs189740977                Y
    #   [411257]       22  38911889      * | rs181225505                Y
    #               region_idx               region
    #            <IntegerList>        <GRangesList>
    #        [1]             1            1:1950293
    #        [2]             2  1:18336405-20142656
    #        [3]             2  1:18336405-20142656
    #        [4]             2  1:18336405-20142656
    #        [5]             2  1:18336405-20142656
    #        ...           ...                  ...
    #   [411253]             5 22:35134992-38911889
    #   [411254]             5 22:35134992-38911889
    #   [411255]             5 22:35134992-38911889
    #   [411256]             5 22:35134992-38911889
    #   [411257]             5 22:35134992-38911889
    #   -------
    #   seqinfo: 25 sequences (1 circular) from GRCh38.p2 genome
    
  2. If you want SNVs in addition to SNPs, do the same as the above with the XtraSNPlocs.Hsapiens.dbSNP144.GRCh38 package.

See ?SNPlocs and ?XtraSNPlocs in the BSgenome package for more information.

Best,

H.

ADD COMMENT
0
Entering edit mode

Dear Hervé, this is amazing, thank you so much. Bioconductor is great but so difficult to start with! At least for me. When you say "Then use your preferred tool for retrieving details about the rs ids." which tools can you recommend to annotate protein coding effects, both deletions/frameshift/truncations etc?

Many thanks Emanuele

ADD REPLY
3
Entering edit mode

You could try Annovar, or the UCSC Genome Browser or any of the other choices listed by dbNSFP.

ADD REPLY

Login before adding your answer.

Traffic: 434 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6