Obtaining rsIDs with ENSG and variant ID information
1
0
Entering edit mode
bd2000 ▴ 30
@5d657c1d
Last seen 9 months ago
United Kingdom

Hi all,

I'm trying to obtain rsid information on my dataset using biomaRt and I'm not too sure how to go about it. My dataset has over 8,000,000 rows and has the following columns:

phenotype_id   variant_id tss_distance   maf ma_samples ma_count pval_nominal       slope  slope_se
1: ENSG00000112679 6_203909_A_G      -147446 0.065         13       13    0.7262169  0.08764783 0.2494609
2: ENSG00000112679 6_204072_G_T      -147283 0.065         13       13    0.7262169  0.08764783 0.2494609

Is it possible to get rsids with just this information? And if so, how would I go about it? Thank you in advance!

rsid biomaRt • 943 views
ADD COMMENT
0
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 7 minutes ago
Barcelona/Universitat Pompeu Fabra

You should first and foremost find out what was the human reference genome version from which your dataset was derived. Assuming this was GRCh38, you may use the annotation package SNPlocs.Hsapiens.dbSNP155.GRCh38 to find the rsIDs, as illustrated in this previous answer in this forum to the same question. To build the input GPos object from a data.frame object of the kind you have you may do the following (others in this forum may suggest more compact solutions):

## this just simulates your the two first columns of your input dataset
dat <- data.frame(phehotype_id=c("ENSG00000112679", "ENSG00000112679"),
                  variant_id=c("6_203909_A_G", "6_204072_G_T"))
dat
     phehotype_id   variant_id
1 ENSG00000112679 6_203909_A_G
2 ENSG00000112679 6_204072_G_T
my_snps <- strsplit(dat$variant_id, "_")
my_snps <- GPos(seqnames=sapply(my_snps, "[", 1),
                as.integer(sapply(my_snps, "[", 2)))
my_snps
UnstitchedGPos object with 2 positions and 0 metadata columns:
      seqnames       pos strand
         <Rle> <integer>  <Rle>
  [1]        6    203909      *
  [2]        6    204072      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

and then apply the code in the answer linked above.

ADD COMMENT
0
Entering edit mode

Thank you for your answer. I've tried to download the package but it didn't work:

Error in download.file(url, destfile, method, mode = "wb", ...) : 
  download from 'https://bioconductor.org/packages/3.18/data/annotation/src/contrib/SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.24.tar.gz' failed
In addition: Warning messages:
1: In download.file(url, destfile, method, mode = "wb", ...) :
  downloaded length 0 != reported length 0
2: In download.file(url, destfile, method, mode = "wb", ...) :
  URL 'https://bioconductor.org/packages/3.18/data/annotation/src/contrib/SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.24.tar.gz': Timeout of 300 seconds was reached

Is there any other way I can download the package or maybe a different package I can use?

ADD REPLY
0
Entering edit mode

This is a large package, and when downloading and installing large packages it's often the case to get timeouts. Try setting a longer time out and try installing it again, i.e.:

options(timeout=1200)
BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh38")
ADD REPLY

Login before adding your answer.

Traffic: 932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6