Question

Good way to get location of in-del SNPs in UCSC coordinates?

1

Entering edit mode

garrisrg ▴ 10

@garrisrg-6920

Last seen 9.4 years ago

United States

Hi guys,

My question is somewhat related to this one posted almost 2 years ago: Problem locating SNP by rsID for SNPlocs.Hsapiens.dbSNP.20120608 package Bioconductor x. I am curious if there is any quick way to retrieve in-del snps that give USCS coordinates like the SNPlocs.Hsapiens.dbSNP.20120608 package? I know I can retrieve them using the biomaRt package but the snps given are in the NCBI coordinate system. Is there some way available to convert between coordinate systems or maybe another package or function that will help me get this information? Most of what I do involves pharmacogenomics which tends to involve quite a few in-del like snps.

Please let me know if what I'm saying is not clear.

Thanks

snplocs snp • 2.3k views

ADD COMMENT • link 9.4 years ago garrisrg ▴ 10

score 0 · Answer 1 · 2014-11-28

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 3 days ago

United States

Indels are available in the XtraSNPlocs.Hsapiens.dbSNP141.GRCh38 package, which is currently only available in the 'devel' branch of Bioconductor (see instructions). Relevant help pages are ?XtraSNPlocs and ?XtraSNPlocs.Hsapiens.dbSNP141.GRCh38. I'm not an expert, so take the following with a grain of salt. Here are the indels:

library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
snps = XtraSNPlocs.Hsapiens.dbSNP141.GRCh38
my_snps1 = snpsBySeqname(snps, c("ch22", "chMT"),
                         c("RefSNP_id", "alleles", "snpClass"))
indels = my_snps1[my_snps1$snpClass == "in-del"]

The key difference between dbSNP, NCBI, and UCSC is the naming convention used for chromosomes; of course it is very important to have the appropriate genome builds when relating SNPs to, e.g., genome sequence or gene annotation. From the example on ?XtraSNPlocs.Hsapiens.dbSNP141.GRCh38, it's easy to convert to, e.g., UCSC

library(BSgenome.Hsapiens.NCBI.GRCh38)
genome <- BSgenome.Hsapiens.NCBI.GRCh38
seqlevelsStyle(indels)  # dbSNP
seqlevelsStyle(genome)  # NCBI
seqlevelsStyle(indels) <- seqlevelsStyle(genome)

And then sequence-based operations between the SNPs and the genome are in sync. Similarly with the annotation package TxDb.Hsapiens.UCSC.hg38.knownGene.

ADD COMMENT • link 9.4 years ago Martin Morgan 25k

0

Entering edit mode

Thank you Martin I will give the developer version a try. I figured there might be a way with the new GRCh38 packages but I was somewhat hesitant as I attempted to retrieve a few sequences from BSgenome.Hsapiens.NCBI.GRCh38 and found myself having similar problems as in this post: https://stat.ethz.ch/pipermail/bioc-devel/2014-April/005595.html (giving me different sequences every time I reran the code) using the getSeq() function. Anyway I'll give it a try, Thank you!

ADD REPLY • link 9.4 years ago garrisrg ▴ 10

0

Entering edit mode

It would be great to hear more about the 'same query different sequence' results if they persist; considerable effort went into trying to reproduce and correct this, but in the end the original poster stopped replying. The solution (using a different internal representation for the BSgenome objects) should have been a robust solution to our best guess as to what the problem was. It's important to make any problem reports as explicit and simple as possible, using a current R, ensuring BiocInstaller::biocValid() is happy, making sure no unnecessary packages are loaded, and providing a short code chunk that someone not on your computer can evaluate. If you do have problems with inconsistent sequence results, please do let us know!

ADD REPLY • link 9.4 years ago Martin Morgan 25k

score 0 · Answer 2 · 2014-12-01

Hi again Martin,

I tried to reproduce the sequence issue I was talking about. Curiously the problem seemed to show up with the BSgenome.Hsapiens.UCSC.hg19_1.3.99, although I've never really seen it for this one before:

> library("BSgenome.Hsapiens.UCSC.hg19")
> genome2<-BSgenome.Hsapiens.UCSC.hg19
> #start of SLC6A4 gene in UCSC coordinates 
> start2<-rep(28521337,10) 
> #seq strings for UCSC genome
> seqs2<-c()
> for(i in 1:length(start2)){
+   seqs2[i]<- as.character(getSeq(genome2, 
+                                  names= "chr17", start= start2[i], width=50))
+ }
> 
> print(DNAStringSet(seqs2))
  A DNAStringSet instance of length 10
     width seq
 [1]    50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
 [2]    50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
 [3]    50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
 [4]    50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
 [5]    50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
 [6]    50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
 [7]    50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
 [8]    50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
 [9]    50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[10]    50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG

Normally I wouldn't use a for loop but I wanted to demonstrate 10 iterations of the same thing. This looks consistent. Now with the start of the CYP2D6 gene:

> library("BSgenome.Hsapiens.UCSC.hg19")
> genome2<-BSgenome.Hsapiens.UCSC.hg19
> #location of CYP2D6 start site UCSC (GRCh37/hg19)

> start4<-rep(42522501,10) 
> 
> #seq strings for UCSC genome
> seqs4<-c()
> for(i in 1:length(start4)){
+   seqs4[i]<- as.character(getSeq(genome2, names= "chr22", start= start4[i], 
+                           width=50, strand= "+"))
+ }
> 
> print(DNAStringSet(seqs4))
  A DNAStringSet instance of length 10
     width seq
 [1]    50 AATCATGTGTTTAAGTCGATTAGAACTTTTTTCTGTGTCCTGCCTGGCAT
 [2]    50 TCCTGCTAGCATAATATCATCAAAATAATAAAGCCAAATGTGATATTTCA
 [3]    50 ATAATCACAAAAATATCAAGATTTTGCTTGTCTATTATGACATTAGGCAG
 [4]    50 AAATCAGAAAAAAAATCCTCATGTAAGATAGTAAAAATGTACTTTTGCCC
 [5]    50 TGCTATAAAAAATAAAATAGCTTCTAATTATGTTTACTAATAGAAATTAA
 [6]    50 GAGAAAAATGCTCTATCAAAAGCTGCAGAGCAAGATAGCTGAATAGAACT
 [7]    50 TCCAGGAATTTCCCCCCACGCACACAAAGGAACACCAAATTGAACAACTA
 [8]    50 CCACACAAAAATACTTTCATAGAACCAAAAATTAGGTTGAGCAATCACAG
 [9]    50 AGCTGGTTTTAATCTTATATCAAAGAAAGAGGCATCGAAGAAGGTAGAGA
[10]    50 GGCAGTCTTAAATTGCCTATGCCACCCCTCCTCCATCCACTAGCTGCAGC

This time the sequence received is not consistent, but I'm unsure what's going on. Here's the session info for that:

R version 3.1.2 (2014-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.3.99 BSgenome_1.34.0                   
 [3] rtracklayer_1.26.2                 Biostrings_2.34.0                 
 [5] XVector_0.6.0                      GenomicRanges_1.18.3              
 [7] GenomeInfoDb_1.2.3                 IRanges_2.0.0                     
 [9] S4Vectors_0.4.0                    BiocGenerics_0.12.1               

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.8             
 [4] BiocParallel_1.0.0      bitops_1.0-6            brew_1.0-6             
 [7] checkmate_1.5.0         codetools_0.2-9         DBI_0.3.1              
[10] digest_0.6.4            fail_1.2                foreach_1.4.2          
[13] GenomicAlignments_1.2.1 iterators_1.0.7         RCurl_1.95-4.4         
[16] Rsamtools_1.18.2        RSQLite_1.0.0           sendmailR_1.2-1        
[19] stringr_0.6.2           tools_3.1.2             XML_3.98-1.1           
[22] zlibbioc_1.12.0

I then retried it using the BSgenome.Hsapiens.NCBI.GRCh38 I got this:

> genome<-BSgenome.Hsapiens.NCBI.GRCh38

> #start of SLC6A4 gene
> start1<-rep(30194319,10)
> 
> #seq strings for NCBI build
> seqs1<-c()
> for(i in 1:length(start1)){
+   seqs1[i]<- as.character(getSeq(genome,"17", start=start1[i], 
+                                 width=50, strand= "+"))
+ }
Error in value[[3L]](cond) : 
   record 1 (17:30194319-30194368) contains invalid DNA letters
  file: C:/Program Files/R/R-3.1.2/library/BSgenome.Hsapiens.NCBI.GRCh38/extdata/single_sequences.fa.rz
> 
> print(DNAStringSet(seqs1))
  A DNAStringSet instance of length 0

If you re-run it after this error you get a string of NNN's for all iterations (which I've also seen with the hg19 genome too, but not always). The same thing happened when running the CYP2D6 start site (different chromosome). Currently this computer is not operating with the developmental version and I have both BSgenomes loaded; not sure if that has anything to do with it. Here's the sessionInfo for the last block:

R version 3.1.2 (2014-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.NCBI.GRCh38_1.3.999
 [2] BSgenome_1.34.0                      
 [3] rtracklayer_1.26.2                   
 [4] Biostrings_2.34.0                    
 [5] XVector_0.6.0                        
 [6] GenomicRanges_1.18.3                 
 [7] GenomeInfoDb_1.2.3                   
 [8] IRanges_2.0.0                        
 [9] S4Vectors_0.4.0                      
[10] BiocGenerics_0.12.1                  

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.8             
 [4] BiocParallel_1.0.0      bitops_1.0-6            brew_1.0-6             
 [7] checkmate_1.5.0         codetools_0.2-9         DBI_0.3.1              
[10] digest_0.6.4            fail_1.2                foreach_1.4.2          
[13] GenomicAlignments_1.2.1 iterators_1.0.7         RCurl_1.95-4.4         
[16] Rsamtools_1.18.2        RSQLite_1.0.0           sendmailR_1.2-1        
[19] stringr_0.6.2           tools_3.1.2             XML_3.98-1.1           
[22] zlibbioc_1.12.0

It's probably something with my machine or settings but any help you can give would be great,

Thanks again!

Ryan

score 0 · Answer 3 · 2014-12-02

Thanks Martin that did the trick,

I'll be sure I have the source file from now on. Here's the comparison with both genomes:

NCBI:

> library('BSgenome.Hsapiens.NCBI.GRCh38') 
> genome<-BSgenome.Hsapiens.NCBI.GRCh38
> start3<-rep(42126499,10)
> #CYP2D6 seq strings for NCBI build
> seqs3<-c()
> for(i in 1:length(start3)){
+   seqs3[i]<- as.character(getSeq(genome,names="22", start=start3[i], 
+                                 width=50, strand= "+"))
+ }
> print(DNAStringSet(seqs3))
  A DNAStringSet instance of length 10
     width seq
 [1]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [2]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [3]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [4]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [5]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [6]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [7]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [8]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [9]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
[10]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG

And UCSC:

> #location of CYP2D6 start site UCSC (GRCh37/hg19)
> start4<-rep(42522501,10) 

#seq strings for UCSC genome
> library("BSgenome.Hsapiens.UCSC.hg19")
> genome2<-BSgenome.Hsapiens.UCSC.hg19

> seqs4<-c()
> for(i in 1:length(start4)){
+   seqs4[i]<- as.character(getSeq(genome2, names= "chr22", start= start4[i], 
+                           width=50, strand= "+"))
+ }
> 
> print(DNAStringSet(seqs4))
  A DNAStringSet instance of length 10
     width seq
 [1]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [2]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [3]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [4]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [5]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [6]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [7]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [8]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
 [9]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG
[10]    50 TTGGAACTACCACATTGCTTTATTGTACATTAGAGCCTCTGGCTAGGGAG

Thanks again,

Ryan

R version 3.1.2 (2014-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0      BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000 BSgenome_1.34.0                       
 [4] rtracklayer_1.26.2                     Biostrings_2.34.0                      XVector_0.6.0                         
 [7] GenomicRanges_1.18.3                   GenomeInfoDb_1.2.3                     IRanges_2.0.0                         
[10] S4Vectors_0.4.0                        BiocGenerics_0.12.1                    BiocInstaller_1.16.1                  

loaded via a namespace (and not attached):
 [1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.8              BiocParallel_1.0.0      bitops_1.0-6           
 [6] brew_1.0-6              checkmate_1.5.0         codetools_0.2-9         DBI_0.3.1               digest_0.6.4           
[11] fail_1.2                foreach_1.4.2           GenomicAlignments_1.2.1 iterators_1.0.7         RCurl_1.95-4.4         
[16] Rsamtools_1.18.2        RSQLite_1.0.0           sendmailR_1.2-1         stringr_0.6.2           tools_3.1.2            
[21] XML_3.98-1.1            zlibbioc_1.12.0