Hi again Martin,
I tried to reproduce the sequence issue I was talking about. Curiously the problem seemed to show up with the BSgenome.Hsapiens.UCSC.hg19_1.3.99, although I've never really seen it for this one before:
> library("BSgenome.Hsapiens.UCSC.hg19")
> genome2<-BSgenome.Hsapiens.UCSC.hg19
> #start of SLC6A4 gene in UCSC coordinates
> start2<-rep(28521337,10)
> #seq strings for UCSC genome
> seqs2<-c()
> for(i in 1:length(start2)){
+ seqs2[i]<- as.character(getSeq(genome2,
+ names= "chr17", start= start2[i], width=50))
+ }
>
> print(DNAStringSet(seqs2))
A DNAStringSet instance of length 10
width seq
[1] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[2] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[3] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[4] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[5] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[6] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[7] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[8] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[9] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
[10] 50 TTGCAGTGAGCTGAGATTGCACCATTGCACCCCAGCCTGGTTGACAAGAG
Normally I wouldn't use a for loop but I wanted to demonstrate 10 iterations of the same thing. This looks consistent. Now with the start of the CYP2D6 gene:
> library("BSgenome.Hsapiens.UCSC.hg19")
> genome2<-BSgenome.Hsapiens.UCSC.hg19
> #location of CYP2D6 start site UCSC (GRCh37/hg19)
> start4<-rep(42522501,10)
>
> #seq strings for UCSC genome
> seqs4<-c()
> for(i in 1:length(start4)){
+ seqs4[i]<- as.character(getSeq(genome2, names= "chr22", start= start4[i],
+ width=50, strand= "+"))
+ }
>
> print(DNAStringSet(seqs4))
A DNAStringSet instance of length 10
width seq
[1] 50 AATCATGTGTTTAAGTCGATTAGAACTTTTTTCTGTGTCCTGCCTGGCAT
[2] 50 TCCTGCTAGCATAATATCATCAAAATAATAAAGCCAAATGTGATATTTCA
[3] 50 ATAATCACAAAAATATCAAGATTTTGCTTGTCTATTATGACATTAGGCAG
[4] 50 AAATCAGAAAAAAAATCCTCATGTAAGATAGTAAAAATGTACTTTTGCCC
[5] 50 TGCTATAAAAAATAAAATAGCTTCTAATTATGTTTACTAATAGAAATTAA
[6] 50 GAGAAAAATGCTCTATCAAAAGCTGCAGAGCAAGATAGCTGAATAGAACT
[7] 50 TCCAGGAATTTCCCCCCACGCACACAAAGGAACACCAAATTGAACAACTA
[8] 50 CCACACAAAAATACTTTCATAGAACCAAAAATTAGGTTGAGCAATCACAG
[9] 50 AGCTGGTTTTAATCTTATATCAAAGAAAGAGGCATCGAAGAAGGTAGAGA
[10] 50 GGCAGTCTTAAATTGCCTATGCCACCCCTCCTCCATCCACTAGCTGCAGC
This time the sequence received is not consistent, but I'm unsure what's going on. Here's the session info for that:
R version 3.1.2 (2014-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg19_1.3.99 BSgenome_1.34.0
[3] rtracklayer_1.26.2 Biostrings_2.34.0
[5] XVector_0.6.0 GenomicRanges_1.18.3
[7] GenomeInfoDb_1.2.3 IRanges_2.0.0
[9] S4Vectors_0.4.0 BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8
[4] BiocParallel_1.0.0 bitops_1.0-6 brew_1.0-6
[7] checkmate_1.5.0 codetools_0.2-9 DBI_0.3.1
[10] digest_0.6.4 fail_1.2 foreach_1.4.2
[13] GenomicAlignments_1.2.1 iterators_1.0.7 RCurl_1.95-4.4
[16] Rsamtools_1.18.2 RSQLite_1.0.0 sendmailR_1.2-1
[19] stringr_0.6.2 tools_3.1.2 XML_3.98-1.1
[22] zlibbioc_1.12.0
I then retried it using the BSgenome.Hsapiens.NCBI.GRCh38 I got this:
> genome<-BSgenome.Hsapiens.NCBI.GRCh38
> #start of SLC6A4 gene
> start1<-rep(30194319,10)
>
> #seq strings for NCBI build
> seqs1<-c()
> for(i in 1:length(start1)){
+ seqs1[i]<- as.character(getSeq(genome,"17", start=start1[i],
+ width=50, strand= "+"))
+ }
Error in value[[3L]](cond) :
record 1 (17:30194319-30194368) contains invalid DNA letters
file: C:/Program Files/R/R-3.1.2/library/BSgenome.Hsapiens.NCBI.GRCh38/extdata/single_sequences.fa.rz
>
> print(DNAStringSet(seqs1))
A DNAStringSet instance of length 0
If you re-run it after this error you get a string of NNN's for all iterations (which I've also seen with the hg19 genome too, but not always). The same thing happened when running the CYP2D6 start site (different chromosome). Currently this computer is not operating with the developmental version and I have both BSgenomes loaded; not sure if that has anything to do with it. Here's the sessionInfo for the last block:
R version 3.1.2 (2014-10-31)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] BSgenome.Hsapiens.NCBI.GRCh38_1.3.999
[2] BSgenome_1.34.0
[3] rtracklayer_1.26.2
[4] Biostrings_2.34.0
[5] XVector_0.6.0
[6] GenomicRanges_1.18.3
[7] GenomeInfoDb_1.2.3
[8] IRanges_2.0.0
[9] S4Vectors_0.4.0
[10] BiocGenerics_0.12.1
loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.5 BBmisc_1.8
[4] BiocParallel_1.0.0 bitops_1.0-6 brew_1.0-6
[7] checkmate_1.5.0 codetools_0.2-9 DBI_0.3.1
[10] digest_0.6.4 fail_1.2 foreach_1.4.2
[13] GenomicAlignments_1.2.1 iterators_1.0.7 RCurl_1.95-4.4
[16] Rsamtools_1.18.2 RSQLite_1.0.0 sendmailR_1.2-1
[19] stringr_0.6.2 tools_3.1.2 XML_3.98-1.1
[22] zlibbioc_1.12.0
It's probably something with my machine or settings but any help you can give would be great,
Thanks again!
Ryan