Hi again Martin,
I tried to reproduce the sequence issue I was talking about. Curiously the problem seemed to show up with the BSgenome.Hsapiens.UCSC.hg19_1.3.99, although I've never really seen it for this one before:
> library("BSgenome.Hsapiens.UCSC.hg19")
> genome2<-BSgenome.Hsapiens.UCSC.hg19
> #start of SLC6A4 gene in UCSC coordinates
> start2<-rep(28521337,10)
> #seq strings for UCSC genome
> seqs2<-c()
> for(i in 1:length(start2)){
+ seqs2[i]<- as.character(getSeq(genome2,
+ names= "chr17", start= start2[i], width=50))
+ }
>
> print(DNAStringSet(seqs2))
A DNAStringSet instance of length 10
width seq
[1] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[2] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[3] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[4] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[5] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[6] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[7] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[8] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[9] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[10] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
Normally I wouldn't use a for loop but I wanted to demonstrate 10 iterations of the same thing. This looks consistent. Now with the start of the CYP2D6 gene:
> library("BSgenome.Hsapiens.UCSC.hg19")
> genome2<-BSgenome.Hsapiens.UCSC.hg19
> #location of CYP2D6 start site UCSC (GRCh37/hg19)
> start4<-rep(42522501,10)
>
> #seq strings for UCSC genome
> seqs4<-c()
> for(i in 1:length(start4)){
+ seqs4[i]<- as.character(getSeq(genome2, names= "chr22", start= start4[i],
+ width=50, strand= "+"))
+ }
>
> print(DNAStringSet(seqs4))
A DNAStringSet instance of length 10
width seq
[1] 50 AATCATGTGTTTAAGTCGATTAGAACTTTTTTCTGTGTCCTGCCTGGCAT
[2] 50 TCCTGCTAGCATAATATCATCAAAATAATAAAGCCAAATGTGATATTTCA
[3] 50 ATAATCACAAAAATATCAAGATTTTGCTTGTCTATTATGACATTAGGCAG
[4] 50 AAATCAGAAAAAAAATCCTCATGTAAGATAGTAAAAATGTACTTTTGCCC
[5] 50 TGCTATAAAAAATAAAATAGCTTCTAATTATGTTTACTAATAGAAATTAA
[6] 50 GAGAAAAATGCTCTATCAAAAGCTGCAGAGCAAGATAGCTGAATAGAACT
[7] 50 TCCAGGAATTTCCCCCCACGCACACAAAGGAACACCAAATTGAACAACTA
[8] 50 CCACACAAAAATACTTTCATAGAACCAAAAATTAGGTTGAGCAATCACAG
[9] 50 AGCTGGTTTTAATCTTATATCAAAGAAAGAGGCATCGAAGAAGGTAGAGA
[10] 50 GGCAGTCTTAAATTGCCTATGCCACCCCTCCTCCATCCACTAGCTGCAGC
This time the sequence received is not consistent, but I'm unsure what's going on. Here's the session info for that:
R version 3.1.2 (2014-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg19_1.3.99 BSgenome_1.34.0
[3] rtracklayer_1.26.2 Biostrings_2.34.0
[5] XVector_0.6.0 GenomicRanges_1.18.3
[7] GenomeInfoDb_1.2.3 IRanges_2.0.0
[9] S4Vectors_0.4.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8
[4] BiocParallel_1.0.0 bitops_1.0-6 brew_1.0-6
[7] checkmate_1.5.0 codetools_0.2-9 DBI_0.3.1
[10] digest_0.6.4 fail_1.2 foreach_1.4.2
[13] GenomicAlignments_1.2.1 iterators_1.0.7 RCurl_1.95-4.4
[16] Rsamtools_1.18.2 RSQLite_1.0.0 sendmailR_1.2-1
[19] stringr_0.6.2 tools_3.1.2 XML_3.98-1.1
[22] zlibbioc_1.12.0
I then retried it using the BSgenome.Hsapiens.NCBI.GRCh38 I got this:
> genome<-BSgenome.Hsapiens.NCBI.GRCh38
> #start of SLC6A4 gene
> start1<-rep(30194319,10)
>
> #seq strings for NCBI build
> seqs1<-c()
> for(i in 1:length(start1)){
+ seqs1[i]<- as.character(getSeq(genome,"17", start=start1[i],
+ width=50, strand= "+"))
+ }
Error in value[[3L]](cond) :
record 1 (17:30194319-30194368) contains invalid DNA letters
file: C:/Program Files/R/R-3.1.2/library/BSgenome.Hsapiens.NCBI.GRCh38/extdata/single_sequences.fa.rz
>
> print(DNAStringSet(seqs1))
A DNAStringSet instance of length 0
If you re-run it after this error you get a string of NNN's for all iterations (which I've also seen with the hg19 genome too, but not always). The same thing happened when running the CYP2D6 start site (different chromosome). Currently this computer is not operating with the developmental version and I have both BSgenomes loaded; not sure if that has anything to do with it. Here's the sessionInfo for the last block:
R version 3.1.2 (2014-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] BSgenome.Hsapiens.NCBI.GRCh38_1.3.999
[2] BSgenome_1.34.0
[3] rtracklayer_1.26.2
[4] Biostrings_2.34.0
[5] XVector_0.6.0
[6] GenomicRanges_1.18.3
[7] GenomeInfoDb_1.2.3
[8] IRanges_2.0.0
[9] S4Vectors_0.4.0
[10] BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8
[4] BiocParallel_1.0.0 bitops_1.0-6 brew_1.0-6
[7] checkmate_1.5.0 codetools_0.2-9 DBI_0.3.1
[10] digest_0.6.4 fail_1.2 foreach_1.4.2
[13] GenomicAlignments_1.2.1 iterators_1.0.7 RCurl_1.95-4.4
[16] Rsamtools_1.18.2 RSQLite_1.0.0 sendmailR_1.2-1
[19] stringr_0.6.2 tools_3.1.2 XML_3.98-1.1
[22] zlibbioc_1.12.0
It's probably something with my machine or settings but any help you can give would be great,
Thanks again!
Ryan
Thank you Martin I will give the developer version a try. I figured there might be a way with the new GRCh38 packages but I was somewhat hesitant as I attempted to retrieve a few sequences from BSgenome.Hsapiens.NCBI.GRCh38 and found myself having similar problems as in this post: https://stat.ethz.ch/pipermail/bioc-devel/2014-April/005595.html (giving me different sequences every time I reran the code) using the getSeq() function. Anyway I'll give it a try, Thank you!
It would be great to hear more about the 'same query different sequence' results if they persist; considerable effort went into trying to reproduce and correct this, but in the end the original poster stopped replying. The solution (using a different internal representation for the BSgenome objects) should have been a robust solution to our best guess as to what the problem was. It's important to make any problem reports as explicit and simple as possible, using a current R, ensuring BiocInstaller::biocValid() is happy, making sure no unnecessary packages are loaded, and providing a short code chunk that someone not on your computer can evaluate. If you do have problems with inconsistent sequence results, please do let us know!