I have a paired-end RNA-seq data in bam. It was read into R using readGAlignmentPairs, named gal2. The compatible reads pairs was parsed using findCompatibleOverlaps function. The following codes only specifically examines compatible read pairs for 4th gene in the dm3_transcripts list. There was only one paired-end aligned to this gene. It's clear that the two mates overlap at 94458625,94458626 and 94458627. As a result the coverageByTranscript returns coverage score of 2 in the three overlapped positions (which is wrong, it should be 1 instead of 3). Anyway to correct this error? thanks.
> gal2[unlist(tx2reads[4])]
GAlignmentPairs object with 1 pair, strandMode=1, and 0 metadata columns:
seqnames strand : ranges -- ranges
<Rle> <Rle> : <IRanges> -- <IRanges>
[1] chr1 - : 94458625-94458725 -- 94458527-94458627
-------
seqinfo: 25 sequences from an unspecified genome
> dm3_transcripts[4]
GRangesList object of length 1:
$ABCA4
GRanges object with 50 ranges and 2 metadata columns:
seqnames ranges strand | exon_id exon_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 94458394-94458798 - | 18097 <NA>
[2] chr1 94461665-94461751 - | 18098 <NA>
[3] chr1 94463417-94463666 - | 18099 <NA>
[4] chr1 94466392-94466484 - | 18100 <NA>
[5] chr1 94466558-94466661 - | 18101 <NA>
... ... ... ... . ... ...
[46] chr1 94568571-94568698 - | 18142 <NA>
[47] chr1 94574133-94574272 - | 18143 <NA>
[48] chr1 94576994-94577135 - | 18144 <NA>
[49] chr1 94578529-94578622 - | 18145 <NA>
[50] chr1 94586536-94586705 - | 18146 <NA>
-------
> coverageByTranscript(gal2[unlist(tx2reads[4])],
+ dm3_transcripts[4])
RleList of length 1
$ABCA4
integer-Rle of length 7325 with 5 runs
Lengths: 73 98 3 98 7053
Values : 0 1 2 1 0
Hi Hervé,
Thanks for your quick help. I understand and appreciate what you explained. I am just wondering whether your package could implement this as default for the paired-end data. I saw there are some other posts where users raised the same question. In my opinion, the coverage calculation for the paired end should NOT double count in the overlapped region because the two mates are from the same fragments.
Using the reduce function, we can merge the two GAranges from the two mates of one pair. When one gene has many pairs, we will have to do the reduce for each pairs before we can plug into the coverage function, right? Is there a shortcut that we can calculate the coverage for all genes from one chromosome with a few of lines of codes? I tried to use endoapply function to apply reduce to GRangesList, but it takes extremely long time for one chromosome.
Hi Jiping,
reduce()
is vectorized on a GRangesList object so there is no need to useendoapply()
:What drove the decision to not treat overlapping ends in a special way was the desire to implement a simple and straightforward semantic where the pairing does not affect the coverage. In other words there is some appeal in the fact that you get the same coverage whether you use
readGAlignments()
,readGAlignmentPairs()
, orreadGAlignmentsList()
to load your reads. Even if that means that the overlapping ends of a paired-end read contribute twice to the coverage, which is a rare event anyway. One could also see the double contribution as a default weighting scheme where more weight is given to the reads whose 2 ends align to the same region, which is not a completely nonsensical thing to do after all.Anyway after taking a quick look at this it seems that it would be easy enough to modify the
coverage()
method for GAlignmentPairs objects to avoid the double contribution. So if nobody objects I'll make the change. Note that this change will slightly affect the results obtained previously withcoverage()
,coverageByTranscript()
, andpcoverageByTranscript()
though.H.
Thanks so much. I also figured out the reduce function as well. Very grateful for your detailed help!