In GenomicRanges, how to narrow down search of duplicated names in case of "'seqlevels' cannot contain duplicated sequence names" error?
1
0
Entering edit mode
c.legrand • 0
@clegrand-15103
Last seen 6.1 years ago

Hello,

I'm trying to use "summarizeOverlaps" to count reads from a bamfile, using annotation from a gff3 file, using a standard procedure :


library(GenomicFeatures)
library(GenomicAlignments)

txdb <- makeTxDbFromGFF(refgff3, format = "gff3", circ_seqs = character()) #or gtf
ebg <- exonsBy(txdb, by="gene")
bamfile <- BamFile(readsBAM)
se <- summarizeOverlaps(features=ebg, reads=bamfile, mode="Union", singleEnd=FALSE,ignore.strand=TRUE,  fragments=TRUE )

This last step returns the following error :

Error in .normargSeqlevels(seqnames) :
  supplied 'seqlevels' cannot contain duplicated sequence names

It's quite clear that there must be some duplicates either in the bam file or in the exonsBy object. However any combination of duplicates(ebg), which(duplicates(names(ebg)), etc. that I could try returned no duplicate.

=> hence my question : how to narrow down the search on which object is duplicated ?

  • in particular, which object is examined just before the error occurs ?
  • maybe there is another way to use the 'duplicated' command that would work ?

Thanks a lot in advance for your answers


PS : to be complete, here is the output of the sessionInfo() command :

R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rPython_0.0-6              RJSONIO_1.3-0              GenomicAlignments_1.8.4    Rsamtools_1.24.0           Biostrings_2.40.2         
 [6] XVector_0.12.1             SummarizedExperiment_1.2.3 GenomicFeatures_1.24.5     AnnotationDbi_1.34.4       Biobase_2.32.0            
[11] GenomicRanges_1.24.3       GenomeInfoDb_1.8.7         IRanges_2.6.1              S4Vectors_0.10.3           BiocGenerics_0.18.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15       zlibbioc_1.18.0    BiocParallel_1.6.6 bit_1.1-12         rlang_0.2.0        blob_1.1.0         tools_3.3.3       
 [8] DBI_0.7            bit64_0.9-7        digest_0.6.15      tibble_1.4.2       rtracklayer_1.32.2 bitops_1.0-6       biomaRt_2.28.0    
[15] RCurl_1.95-4.10    memoise_1.1.0      RSQLite_2.0        pillar_1.1.0       XML_3.98-1.10      pkgconfig_2.0.1   

 

software error • 844 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 11 hours ago
Seattle, WA, United States

Hi,

You need to look for duplicates in the seqlevels of your objects so look at seqlevels(ebg) and seqlevels(bamfile). However it's unlikely that you'll see duplicates there either. It looks to me that you've encountered a bug and we would need to be able to reproduce it in order to help. Could you please provide a reproducible example or make your GFF3 and BAM files available?

Also you're using BioC 3.3 which is old and unsupported so I strongly recommend that you upgrade to the current release (BioC 3.6, requires R 3.4). It could be that the bug is gone in this version.

Cheers,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 740 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6