Question

Problems with summarizeOverlaps

0

Entering edit mode

Walter F. Baumann ▴ 10

@walter-f-baumann-12439

Last seen 6.4 years ago

Hi,

I want to create a table with counts per region (column), per transcript or gene (row).

First thing I did was to define the regions:

txdb <- makeTxDbFromGFF("/pathto/gencode.v19.annotation.gtf", dataSource="Gencode", organism="Homo sapiens", format="gtf")

txdb_5utr_transcript <- fiveUTRsByTranscript(txdb, use.names=T)
txdb_exons_transcripts <- exonsBy(txdb, by="tx", use.name=T)
txdb_cds_transcripts <- cdsBy(txdb, by = "tx", use.name=T)

Then I imported my RNA-Seq experiment (paired end RNA-Seq, aligned with STAR):

myrnaseq1 <- readGAlignments(file= "/pathto/PolyA_1.sortedByCoord.out.bam",
                                  index= "/pathto/PolyA_1.sortedByCoord.out.bam.bai")

First problem: There seem to be different regions per exon annotated(??). How can the same exon number have different regions?

txdb_5utr_transcript$ENST00000373020.4

chrX [99891605, 99891803]      - |    667159 ENSE00001855382.1

txdb_exons_transcripts$ENST00000373020.4

chrX [99891692, 99891803]      - |    667159 ENSE00001855382.1

Second problem:

readrna1ex_tx <- summarizeOverlaps(features= txdb_exons_transcripts, reads = myrnaseq1, mode=Union, singleEnd=F, inter.feature=FALSE) # uses same counting algorithm as in HTSeq; double counted in case of overlapping
Error: cannot allocate vector of size 1.3 Gb

Here my SessionInfo:

R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BiocInstaller_1.24.0       GenomicAlignments_1.10.1   SummarizedExperiment_1.4.0 Rsamtools_1.26.2           Biostrings_2.42.1         
 [6] XVector_0.14.1             GenomicFeatures_1.26.4     AnnotationDbi_1.36.2       Biobase_2.34.0             GenomicRanges_1.26.4      
[11] GenomeInfoDb_1.10.3        IRanges_2.8.2              S4Vectors_0.12.2           BiocGenerics_0.20.0

bioconductor rnaseq summarizeoverlaps • 1.1k views

ADD COMMENT • link updated 7.0 years ago by Martin Morgan 25k • written 7.0 years ago by Walter F. Baumann ▴ 10

score 1 · Answer 1 · 2017-05-11

1

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 5 days ago

United States

For processing whole files use the BAM file (via Rsamtools::BamFile() or BamFileList()) as the second argument. This iterates through the BAM file in chunks rather than trying to put it all in memory.

ADD COMMENT • link 7.0 years ago Martin Morgan 25k

0

Entering edit mode

Thanks. I tried to apply the given examples of the Rsamtools and the GenomicAlignments manual on my case, but so far I was not successful. (This is a single-end RNA-Seq experiment).

myalignment <- BamFile(file=„/pathto/Aligned.bam",

                      index="/pathto/Aligned.bam.bai",

                      yieldSize = 100000, asMates=F)

scanBam(myalignment) then provides me all reads. However, I seem to misunderstand something essential, because the output of scanBam (a list) is not interpretable by readGAlignments, or by summarizeOverlaps.

When I execute this:

readrna1ex_tx <- summarizeOverlaps(features= txdb_exons_transcripts, 
reads = myalignment, mode=Union, singleEnd=T, inter.feature=FALSE)

I get the Error

Error in registered()[[bpparanClass]]:

attempt to select less than one element in get1index

ADD REPLY • link 6.9 years ago Walter F. Baumann ▴ 10