Problems with summarizeOverlaps
1
0
Entering edit mode
@walter-f-baumann-12439
Last seen 7.0 years ago

Hi,

I want to create a table with counts per region (column), per transcript or gene (row).

First thing I did was to define the regions:

txdb <- makeTxDbFromGFF("/pathto/gencode.v19.annotation.gtf", dataSource="Gencode", organism="Homo sapiens", format="gtf")
txdb_5utr_transcript <- fiveUTRsByTranscript(txdb, use.names=T)
txdb_exons_transcripts <- exonsBy(txdb, by="tx", use.name=T)
txdb_cds_transcripts <- cdsBy(txdb, by = "tx", use.name=T)

Then I imported my RNA-Seq experiment (paired end RNA-Seq, aligned with STAR):

myrnaseq1 <- readGAlignments(file= "/pathto/PolyA_1.sortedByCoord.out.bam",
                                  index= "/pathto/PolyA_1.sortedByCoord.out.bam.bai")

 

First problem: There seem to be different regions per exon annotated(??). How can the same exon number have different regions?

txdb_5utr_transcript$ENST00000373020.4

chrX [99891605, 99891803]      - |    667159 ENSE00001855382.1

txdb_exons_transcripts$ENST00000373020.4

chrX [99891692, 99891803]      - |    667159 ENSE00001855382.1

Second problem:

readrna1ex_tx <- summarizeOverlaps(features= txdb_exons_transcripts, reads = myrnaseq1, mode=Union, singleEnd=F, inter.feature=FALSE) # uses same counting algorithm as in HTSeq; double counted in case of overlapping
Error: cannot allocate vector of size 1.3 Gb

Here my SessionInfo:

R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BiocInstaller_1.24.0       GenomicAlignments_1.10.1   SummarizedExperiment_1.4.0 Rsamtools_1.26.2           Biostrings_2.42.1         
 [6] XVector_0.14.1             GenomicFeatures_1.26.4     AnnotationDbi_1.36.2       Biobase_2.34.0             GenomicRanges_1.26.4      
[11] GenomeInfoDb_1.10.3        IRanges_2.8.2              S4Vectors_0.12.2           BiocGenerics_0.20.0      

 

bioconductor rnaseq summarizeoverlaps • 1.2k views
ADD COMMENT
1
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States

For processing whole files use the BAM file (via Rsamtools::BamFile() or BamFileList()) as the second argument. This iterates through the BAM file in chunks rather than trying  to put it all in memory.

ADD COMMENT
0
Entering edit mode

Thanks. I tried to apply the given examples of the Rsamtools and the GenomicAlignments manual on my case, but so far I was not successful. (This is a single-end RNA-Seq experiment). 

myalignment <- BamFile(file=„/pathto/Aligned.bam",

                      index="/pathto/Aligned.bam.bai",

                      yieldSize = 100000, asMates=F)

 

scanBam(myalignment) then provides me all reads. However, I seem to misunderstand something essential, because the output of scanBam (a list) is not interpretable by readGAlignments, or by summarizeOverlaps. 

 

When I execute this: 

 

readrna1ex_tx <- summarizeOverlaps(features= txdb_exons_transcripts, 
reads = myalignment, mode=Union, singleEnd=T, inter.feature=FALSE)

 

I get the Error

Error in registered()[[bpparanClass]]:

attempt to select less than one element in get1index
ADD REPLY

Login before adding your answer.

Traffic: 723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6