Entering edit mode
Walter F. Baumann
▴
10
@walter-f-baumann-12439
Last seen 7.1 years ago
Hi,
I want to create a table with counts per region (column), per transcript or gene (row).
First thing I did was to define the regions:
txdb <- makeTxDbFromGFF("/pathto/gencode.v19.annotation.gtf", dataSource="Gencode", organism="Homo sapiens", format="gtf")
txdb_5utr_transcript <- fiveUTRsByTranscript(txdb, use.names=T) txdb_exons_transcripts <- exonsBy(txdb, by="tx", use.name=T) txdb_cds_transcripts <- cdsBy(txdb, by = "tx", use.name=T)
Then I imported my RNA-Seq experiment (paired end RNA-Seq, aligned with STAR):
myrnaseq1 <- readGAlignments(file= "/pathto/PolyA_1.sortedByCoord.out.bam", index= "/pathto/PolyA_1.sortedByCoord.out.bam.bai")
First problem: There seem to be different regions per exon annotated(??). How can the same exon number have different regions?
txdb_5utr_transcript$ENST00000373020.4 chrX [99891605, 99891803] - | 667159 ENSE00001855382.1 txdb_exons_transcripts$ENST00000373020.4 chrX [99891692, 99891803] - | 667159 ENSE00001855382.1
Second problem:
readrna1ex_tx <- summarizeOverlaps(features= txdb_exons_transcripts, reads = myrnaseq1, mode=Union, singleEnd=F, inter.feature=FALSE) # uses same counting algorithm as in HTSeq; double counted in case of overlapping
Error: cannot allocate vector of size 1.3 Gb
Here my SessionInfo:
R version 3.3.3 (2017-03-06) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.2 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.24.0 GenomicAlignments_1.10.1 SummarizedExperiment_1.4.0 Rsamtools_1.26.2 Biostrings_2.42.1 [6] XVector_0.14.1 GenomicFeatures_1.26.4 AnnotationDbi_1.36.2 Biobase_2.34.0 GenomicRanges_1.26.4 [11] GenomeInfoDb_1.10.3 IRanges_2.8.2 S4Vectors_0.12.2 BiocGenerics_0.20.0
Thanks. I tried to apply the given examples of the Rsamtools and the GenomicAlignments manual on my case, but so far I was not successful. (This is a single-end RNA-Seq experiment).
scanBam(myalignment) then provides me all reads. However, I seem to misunderstand something essential, because the output of scanBam (a list) is not interpretable by readGAlignments, or by summarizeOverlaps.
When I execute this:
I get the Error