Entering edit mode
Stefanie
▴
360
@stefanie-5192
Last seen 10.3 years ago
Dear list,
I have some human data, mapped with bowtie to the transcriptome and
then
further processed with eXpress.
The output of eXpress is a sorted bam file, containing the alignments
of
the reads to different transcripts.
So, the bam file contains about 48 x 10^6 alignments, distributed over
about 50000 transcripts.
I read in the bam file with 'readGappedAlignments'.
> reads
GappedAlignments with 46765032 alignments and 0 metadata columns:
seqnames strand cigar qwidth
<rle> <rle> <character> <integer>
[1] hg19_ensGene_ENST00000237247 + 75M 75
[2] hg19_ensGene_ENST00000371036 - 75M 75
[3] hg19_ensGene_ENST00000371037 - 75M 75
[4] hg19_ensGene_ENST00000471889 - 75M 75
[5] hg19_ensGene_ENST00000377479 + 75M 75
... ... ... ... ...
[46765028] hg19_ensGene_ENST00000344867 - 75M 75
[46765029] hg19_ensGene_ENST00000400848 - 75M 75
[46765030] hg19_ensGene_ENST00000400848 - 75M 75
[46765031] hg19_ensGene_ENST00000400848 - 75M 75
[46765032] hg19_ensGene_ENST00000400829 - 75M 75
start end width ngap
<integer> <integer> <integer> <integer>
[1] 1783 1857 75 0
[2] 690 764 75 0
[3] 1632 1706 75 0
[4] 2121 2195 75 0
[5] 391 465 75 0
... ... ... ... ...
[46765028] 280 354 75 0
[46765029] 509 583 75 0
[46765030] 509 583 75 0
[46765031] 509 583 75 0
[46765032] 179 253 75 0
---
seqlengths:
hg19_ensGene_ENST00000237247 ... hg19_ensGene_ENST00000400829
2580 ... 999
Now, for whatever reason, I would like to have the coverage vector for
each transcript.
So, basically I would just compute
coverage(reads).
This works very well for smaller genomes and experiments (e.g. yeast
with
10 x 10^6 alignments).
But, using 'coverage' on this large ReadGappedAlignments Object takes
really very very long.
Is there a way to do this better (in R)? Or is this just not feasible
to
do in R and I should rather preprocess the data outside R?
I would appreciate any comments,
best,
Stefanie