Question

How to correlate bp and Chr columns in the corresponding RSID column?

0

Entering edit mode

iago.junger • 0

@49f16e03

Last seen 2.8 years ago

Brazil

Hello dear all!

I have a summary statistic that has Chr and bp columns. However, to run the LDSC script I need that same summary with the RSID column, So I need a BiomaRT script that can correlate the Chr and bp column and give me the corresponding RSID column. Nevertheless, me and my team are struggling in using BiomaRT. Is there anyone here who knows how to do that? Please contact me :)

All the best,

Iago Junger

biomaRt • 1.2k views

ADD COMMENT • link updated 2.8 years ago by James W. MacDonald 65k • written 2.8 years ago by iago.junger • 0

0

Entering edit mode

What have you tried? Have you tried a few coordinates in BiomaRt through the web interface?

ADD REPLY • link 2.8 years ago swbarnes2 ★ 1.3k

score 0 · Answer 1 · 2021-07-12

I would tend to use one of the SNPlocs packages for this, rather than biomaRt. As a completely contrived example,

> library(SNPlocs.Hsapiens.dbSNP144.GRCh37)

## fake GRanges - you need to use your Chr and bp columns to do this!
## also note that the chromosomes have no prepended 'chr'.

> fakeo <- GRanges(rep("1", 500), IRanges(sample(1:1e5, 500), width = 1))

## EDIT

> z <- snpsByOverlaps(SNPlocs.Hsapiens.dbSNP144.GRCh37, fakeo)
> z
UnstitchedGPos object with 6 positions and 2 metadata columns:
      seqnames       pos strand |   RefSNP_id alleles_as_ambig
         <Rle> <integer>  <Rle> | <character>      <character>
  [1]        1     14728      * | rs547701710                M
  [2]        1     15150      * |  rs11803681                Y
  [3]        1     17538      * | rs200046632                M
  [4]        1     63643      * | rs202004563                R
  [5]        1     66737      * | rs560785016                K
  [6]        1     69869      * | rs548049170                W
  -------
  seqinfo: 25 sequences (1 circular) from GRCh37.p13 genome

## and now you can get the RSIDs from the GPos object.

> fo <- findOverlaps(fakeo, z)
> fo
Hits object with 6 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         3           1
  [2]        95           2
  [3]       120           6
  [4]       229           5
  [5]       370           3
  [6]       465           4
  -------
  queryLength: 500 / subjectLength: 6
> mcols(fakeo)$rsid <- NA
> mcols(fakeo)$rsid[queryHits(fo)] <- mcols(z)$RefSNP_id[subjectHits(fo)]
> fakeo
GRanges object with 500 ranges and 1 metadata column:
        seqnames    ranges strand |        rsid
           <Rle> <IRanges>  <Rle> | <character>
    [1]        1     94944      * |        <NA>
    [2]        1     97983      * |        <NA>
    [3]        1     14728      * | rs547701710
    [4]        1     56186      * |        <NA>
    [5]        1     53476      * |        <NA>
    ...      ...       ...    ... .         ...
  [496]        1     91756      * |        <NA>
  [497]        1     70297      * |        <NA>
  [498]        1     27187      * |        <NA>
  [499]        1     66576      * |        <NA>
  [500]        1     81208      * |        <NA>
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
>

Given that I just faked up some positions there isn't much overlap. But you get the general idea, I hope.