In GenomicRanges, how to narrow down search of duplicated names in case of "'seqlevels' cannot contain duplicated sequence names" error?
c.legrand
7.0 years ago


I'm trying to use "summarizeOverlaps" to count reads from a bamfile, using annotation from a gff3 file, using a standard procedure :


txdb <- makeTxDbFromGFF(refgff3, format = "gff3", circ_seqs = character()) #or gtf
ebg <- exonsBy(txdb, by="gene")
bamfile <- BamFile(readsBAM)
se <- summarizeOverlaps(features=ebg, reads=bamfile, mode="Union", singleEnd=FALSE,ignore.strand=TRUE,  fragments=TRUE )

This last step returns the following error :

Error in .normargSeqlevels(seqnames) :
  supplied 'seqlevels' cannot contain duplicated sequence names

It's quite clear that there must be some duplicates either in the bam file or in the exonsBy object. However any combination of duplicates(ebg), which(duplicates(names(ebg)), etc. that I could try returned no duplicate.

=> hence my question : how to narrow down the search on which object is duplicated ?

  • in particular, which object is examined just before the error occurs ?
  • maybe there is another way to use the 'duplicated' command that would work ?

Thanks a lot in advance for your answers

PS : to be complete, here is the output of the sessionInfo() command :

R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rPython_0.0-6              RJSONIO_1.3-0              GenomicAlignments_1.8.4    Rsamtools_1.24.0           Biostrings_2.40.2         
 [6] XVector_0.12.1             SummarizedExperiment_1.2.3 GenomicFeatures_1.24.5     AnnotationDbi_1.34.4       Biobase_2.32.0            
[11] GenomicRanges_1.24.3       GenomeInfoDb_1.8.7         IRanges_2.6.1              S4Vectors_0.10.3           BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15       zlibbioc_1.18.0    BiocParallel_1.6.6 bit_1.1-12         rlang_0.2.0        blob_1.1.0         tools_3.3.3       
 [8] DBI_0.7            bit64_0.9-7        digest_0.6.15      tibble_1.4.2       rtracklayer_1.32.2 bitops_1.0-6       biomaRt_2.28.0    
[15] RCurl_1.95-4.10    memoise_1.1.0      RSQLite_2.0        pillar_1.1.0       XML_3.98-1.10      pkgconfig_2.0.1   


software error
Last seen 32 minutes ago
Seattle, WA, United States


You need to look for duplicates in the seqlevels of your objects so look at seqlevels(ebg) and seqlevels(bamfile). However it's unlikely that you'll see duplicates there either. It looks to me that you've encountered a bug and we would need to be able to reproduce it in order to help. Could you please provide a reproducible example or make your GFF3 and BAM files available?

Also you're using BioC 3.3 which is old and unsupported so I strongly recommend that you upgrade to the current release (BioC 3.6, requires R 3.4). It could be that the bug is gone in this version.




