Question: Problems with summarizeOverlaps
gravatar for Walter F. Baumann
8 months ago by
Walter F. Baumann 10 wrote:


I want to create a table with counts per region (column), per transcript or gene (row).

First thing I did was to define the regions:

txdb <- makeTxDbFromGFF("/pathto/gencode.v19.annotation.gtf", dataSource="Gencode", organism="Homo sapiens", format="gtf")
txdb_5utr_transcript <- fiveUTRsByTranscript(txdb, use.names=T)
txdb_exons_transcripts <- exonsBy(txdb, by="tx",
txdb_cds_transcripts <- cdsBy(txdb, by = "tx",

Then I imported my RNA-Seq experiment (paired end RNA-Seq, aligned with STAR):

myrnaseq1 <- readGAlignments(file= "/pathto/PolyA_1.sortedByCoord.out.bam",
                                  index= "/pathto/PolyA_1.sortedByCoord.out.bam.bai")


First problem: There seem to be different regions per exon annotated(??). How can the same exon number have different regions?


chrX [99891605, 99891803]      - |    667159 ENSE00001855382.1


chrX [99891692, 99891803]      - |    667159 ENSE00001855382.1

Second problem:

readrna1ex_tx <- summarizeOverlaps(features= txdb_exons_transcripts, reads = myrnaseq1, mode=Union, singleEnd=F, inter.feature=FALSE) # uses same counting algorithm as in HTSeq; double counted in case of overlapping
Error: cannot allocate vector of size 1.3 Gb

Here my SessionInfo:

R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BiocInstaller_1.24.0       GenomicAlignments_1.10.1   SummarizedExperiment_1.4.0 Rsamtools_1.26.2           Biostrings_2.42.1         
 [6] XVector_0.14.1             GenomicFeatures_1.26.4     AnnotationDbi_1.36.2       Biobase_2.34.0             GenomicRanges_1.26.4      
[11] GenomeInfoDb_1.10.3        IRanges_2.8.2              S4Vectors_0.12.2           BiocGenerics_0.20.0      


ADD COMMENTlink modified 8 months ago by Martin Morgan ♦♦ 21k • written 8 months ago by Walter F. Baumann 10
gravatar for Martin Morgan
8 months ago by
Martin Morgan ♦♦ 21k
United States
Martin Morgan ♦♦ 21k wrote:

For processing whole files use the BAM file (via Rsamtools::BamFile() or BamFileList()) as the second argument. This iterates through the BAM file in chunks rather than trying  to put it all in memory.

ADD COMMENTlink written 8 months ago by Martin Morgan ♦♦ 21k

Thanks. I tried to apply the given examples of the Rsamtools and the GenomicAlignments manual on my case, but so far I was not successful. (This is a single-end RNA-Seq experiment). 

myalignment <- BamFile(file=„/pathto/Aligned.bam",


                      yieldSize = 100000, asMates=F)


scanBam(myalignment) then provides me all reads. However, I seem to misunderstand something essential, because the output of scanBam (a list) is not interpretable by readGAlignments, or by summarizeOverlaps. 


When I execute this: 


readrna1ex_tx <- summarizeOverlaps(features= txdb_exons_transcripts, 
reads = myalignment, mode=Union, singleEnd=T, inter.feature=FALSE)


I get the Error

Error in registered()[[bpparanClass]]:

attempt to select less than one element in get1index
ADD REPLYlink modified 7 months ago • written 7 months ago by Walter F. Baumann 10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 285 users visited in the last hour