Question: Problems with summarizeOverlaps
gravatar for Walter F. Baumann
17 months ago by
Walter F. Baumann 10 wrote:


I want to create a table with counts per region (column), per transcript or gene (row).

First thing I did was to define the regions:

txdb <- makeTxDbFromGFF("/pathto/gencode.v19.annotation.gtf", dataSource="Gencode", organism="Homo sapiens", format="gtf")
txdb_5utr_transcript <- fiveUTRsByTranscript(txdb, use.names=T)
txdb_exons_transcripts <- exonsBy(txdb, by="tx",
txdb_cds_transcripts <- cdsBy(txdb, by = "tx",

Then I imported my RNA-Seq experiment (paired end RNA-Seq, aligned with STAR):

myrnaseq1 <- readGAlignments(file= "/pathto/PolyA_1.sortedByCoord.out.bam",
                                  index= "/pathto/PolyA_1.sortedByCoord.out.bam.bai")


First problem: There seem to be different regions per exon annotated(??). How can the same exon number have different regions?


chrX [99891605, 99891803]      - |    667159 ENSE00001855382.1


chrX [99891692, 99891803]      - |    667159 ENSE00001855382.1

Second problem:

readrna1ex_tx <- summarizeOverlaps(features= txdb_exons_transcripts, reads = myrnaseq1, mode=Union, singleEnd=F, inter.feature=FALSE) # uses same counting algorithm as in HTSeq; double counted in case of overlapping
Error: cannot allocate vector of size 1.3 Gb

Here my SessionInfo:

R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BiocInstaller_1.24.0       GenomicAlignments_1.10.1   SummarizedExperiment_1.4.0 Rsamtools_1.26.2           Biostrings_2.42.1         
 [6] XVector_0.14.1             GenomicFeatures_1.26.4     AnnotationDbi_1.36.2       Biobase_2.34.0             GenomicRanges_1.26.4      
[11] GenomeInfoDb_1.10.3        IRanges_2.8.2              S4Vectors_0.12.2           BiocGenerics_0.20.0      


ADD COMMENTlink modified 17 months ago by Martin Morgan ♦♦ 22k • written 17 months ago by Walter F. Baumann 10
gravatar for Martin Morgan
17 months ago by
Martin Morgan ♦♦ 22k
United States
Martin Morgan ♦♦ 22k wrote:

For processing whole files use the BAM file (via Rsamtools::BamFile() or BamFileList()) as the second argument. This iterates through the BAM file in chunks rather than trying  to put it all in memory.

ADD COMMENTlink written 17 months ago by Martin Morgan ♦♦ 22k

Thanks. I tried to apply the given examples of the Rsamtools and the GenomicAlignments manual on my case, but so far I was not successful. (This is a single-end RNA-Seq experiment). 

myalignment <- BamFile(file=„/pathto/Aligned.bam",


                      yieldSize = 100000, asMates=F)


scanBam(myalignment) then provides me all reads. However, I seem to misunderstand something essential, because the output of scanBam (a list) is not interpretable by readGAlignments, or by summarizeOverlaps. 


When I execute this: 


readrna1ex_tx <- summarizeOverlaps(features= txdb_exons_transcripts, 
reads = myalignment, mode=Union, singleEnd=T, inter.feature=FALSE)


I get the Error

Error in registered()[[bpparanClass]]:

attempt to select less than one element in get1index
ADD REPLYlink modified 16 months ago • written 16 months ago by Walter F. Baumann 10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 336 users visited in the last hour