Getting "negative length vectors are not allowed" error in BSgenome::getSeq
1
1
Entering edit mode
rubi ▴ 90
@rubi-6462
Last seen 3.1 years ago

Hi,

 

I'm trying to retrieve the sequences of a certain transcript represented as a GenomicRanges object.

 

Here's the data.frame of that transcript:

> exon.df
        seqnames   start     end strand
NW_004624885.1 3278632 3279702      +
NW_004624885.1 3280386 3280638      +
NW_004624885.1 3280818 3280887      +
NW_004624885.1 3280993 3281162      +
NW_004624885.1 3281232 3281476      +
NW_004624885.1 3282194 3282260      +
NW_004624885.1 3282342 3282400      +

 

The genome fasta file was read into a DNAStringSet object using Biostrings::readDNAStringSet:

 

> class(genome.fa)
[1] "DNAStringSet"
attr(,"package")
[1] "Biostrings"
> head(genome.fa)
  A DNAStringSet instance of length 6
       width seq                                                                                                                                                                                       names              
[1] 78885850 AAGCCCTAAAAGAAAATAACTTGCAACCTAGACTGTTATATCCAGCAAAATTATCTTTCAAAATTGATGGGAAAATTAGATACTTCCATGA...CCAGCTGGCATCTCCAAACCTTTCCAAACTGCTGTGGACCACTTCCACAGGCCTCTGAACCACCTGGTGAGGTACAGTCGCTGCAGTGACC NW_004624730.1
[2] 46765855 AGGAATGGGGAAAAAATAAGAACCAAGATGCATTATGTAGAGGTACAATTCCCAATGAGAAATGTAATTGTGTTGTTTACCTAAAATGTAC...ATATGAATTTAATAGGGATGGCCGTTATCCTGTAGAATTGGCATTCTTGTAAGTGATGAAAACACACACACACACACACACACACACACAC NW_004624731.1
[3] 47073470 TAATGAACAATGAAAAGATTTTAAAAATTGACTAAAAATAAAAGTTCCATATCAGTGAGTTACAGGTACATTACTTTAATAAAAATGAATC...TTTTGTTGAGCTTTTTAGATGTATAAATGAATGTTTTTCATTAAATTTAGGGAATTTTCTAGTTCTAATAACTGTTAAGCCCCTCTAATGA NW_004624732.1
[4] 43462546 TAAAGTGCAGTTTCTTCTCAGAGAGAGCAGTTAGATCGCTCTCACTGCAGCAGTACTCTGCACTCTGTGAATGTCATGGAACAATATTCAC...GACCAAGTCACTCCCCTCGAGGTCCTGCCCTTGACACCTGCAGGGCACGTGAGTGACACCTCTCGGGGGCAGGTCCAACCTTGGCTCTTGG NW_004624733.1
[5] 42019962 TAACATTATGGTCTCAAGGTGCATCTATTTTCCTACAAATTTCCTACCAAGTTTCCTTCAAGATAATCCTTCTATATGACATAGTAATAAT...AGGTTATTATTTTTATTTCAAAGATGATTAGTGATAGCATTTATTCATAGAAATGCTGTTGGCCATTTCTTTTTTTAAAAATTAATTTTAT NW_004624734.1
[6] 42145720 GGGCCCCTTGTCTATTCCTATGGACTGCCTCGGCCTTCCTGACCCCACCAAGGCGCATGTGTGCGCTGCGTTGGCCAATCCAGGCTCACCC...TGGTTATTGGGATTGGGTCCTATACCTTAGGGTTAGGGTTAGGGTTAGGTGTAGGTTTATGTTTAGGGTTCCAATTAATACACCTGACATT NW_004624735.1

 

Now, trying to run getSeq:

BSgenome::getSeq(fa,as(exon.df,"GRanges"))

I get this error:

Error in .Call(.NAME, ..., PACKAGE = PACKAGE) : 
  negative length vectors are not allowed

 

Is it because the genome fasta file was not read all the way through?

width(genome.fa)[which(names(genome.fa) == "NW_004624885.1")]

returns:

3740447

 

But then even trying to get the sequence of the last exon:

subseq(query.genome.fa, start=3282342, end=3282400)

 

returns an error:

Error in .Call2("solve_user_SEW", refwidths, start, end, width, translate.negative.coord,  : 
  solving row 167: 'allow.nonnarrowing' is FALSE and the supplied start (3282342) is > refwidth + 1

 

 

 

 

> sessionInfo()

R version 3.3.2 (2016-10-31)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome_1.42.0      biomaRt_2.30.0       gplots_3.0.1         reshape2_1.4.2       plotrix_3.6-3        Hmisc_3.17-4         Formula_1.2-1        survival_2.40-1      lattice_0.20-34      annotationData_0.1.0
[11] plyr_1.8.4           magrittr_1.5         gtable_0.2.0         gridExtra_2.2.1      plotly_4.7.0         ggplot2_2.2.1.9000   eulerr_2.0.0         dplyr_0.5.0          rtracklayer_1.34.1   data.table_1.9.6    
[21] doParallel_1.0.10    iterators_1.0.8      foreach_1.4.3        GenomicRanges_1.26.2 GenomeInfoDb_1.10.0  snpEnrichment_1.7.0  stringi_1.1.5        Biostrings_2.42.1    XVector_0.14.0       IRanges_2.8.1       
[31] S4Vectors_0.12.1     BiocGenerics_0.20.0 

loaded via a namespace (and not attached):
 [1] Biobase_2.34.0             httr_1.2.1                 tidyr_0.6.3                jsonlite_1.4               viridisLite_0.2.0          splines_3.3.2              gtools_3.5.0              
 [8] assertthat_0.2.0           latticeExtra_0.6-28        Rsamtools_1.26.1           RSQLite_1.0.0              chron_2.3-47               digest_0.6.12              RColorBrewer_1.1-2        
[15] colorspace_1.3-2           htmltools_0.3.6            Matrix_1.2-7.1             XML_3.98-1.4               zlibbioc_1.20.0            purrr_0.2.2.2              scales_0.4.1              
[22] gdata_2.17.0               BiocParallel_1.8.1         tibble_1.3.3               SummarizedExperiment_1.2.3 nnet_7.3-12                lazyeval_0.2.0             foreign_0.8-67            
[29] tools_3.3.2                stringr_1.2.0              munsell_0.4.3              cluster_2.0.5              AnnotationDbi_1.36.0       snpStats_1.24.0            caTools_1.17.1            
[36] rlang_0.1.1                RCurl_1.95-4.8             htmlwidgets_0.8            bitops_1.0-6               codetools_0.2-15           DBI_0.5-1                  R6_2.2.0                  
[43] GenomicAlignments_1.8.4    KernSmooth_2.23-15         Rcpp_0.12.11.1             rpart_4.1-10               acepack_1.4.1             

 

 

BSgenome getSeq • 1.1k views
ADD COMMENT
3
Entering edit mode
@martin-morgan-1513
Last seen 9 weeks ago
United States

This looks like an integer overflow, occurring because the total number of nucleotides in the fasta file is larger than 2^31 - 1. It's possible to index the fasta file and read in only portions of it; see ?readDNAStringSet and the paragraph 'A subset of this data frame...'.

Another strategy is to convert the fasta file to 2bit, which allows very fast random access

library(Biostrings)
library(rtracklayer)

dna = Biostrings::readDNAStringSet("foo.fasta")
twobit = TwoBitFile("foo.2bit")
rtracklayer::export(dna, twobit)
getSeq(twobit, gr)
ADD COMMENT

Login before adding your answer.

Traffic: 189 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6