Question: coverageByTranscript coudble counts overlapped region for paired-end data
0
4 months ago by
Jiping Wang70
Jiping Wang70 wrote:

I have a paired-end RNA-seq data in bam. It was read into R using readGAlignmentPairs, named gal2. The compatible reads pairs was parsed using findCompatibleOverlaps function. The following codes only specifically examines compatible read pairs for 4th gene in the dm3_transcripts list. There was only one paired-end aligned to this gene. It's clear that the two mates overlap at 94458625,94458626 and 94458627. As a result the coverageByTranscript returns coverage score of 2 in the three overlapped positions (which is wrong, it should be 1 instead of 3). Anyway to correct this error? thanks.

> gal2[unlist(tx2reads[4])]
GAlignmentPairs object with 1 pair, strandMode=1, and 0 metadata columns:
seqnames strand :            ranges --            ranges
<Rle>  <Rle> :         <IRanges> --         <IRanges>
[1]     chr1      - : 94458625-94458725 -- 94458527-94458627
-------
seqinfo: 25 sequences from an unspecified genome
> dm3_transcripts[4]
GRangesList object of length 1:
$ABCA4 GRanges object with 50 ranges and 2 metadata columns: seqnames ranges strand | exon_id exon_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr1 94458394-94458798 - | 18097 <NA> [2] chr1 94461665-94461751 - | 18098 <NA> [3] chr1 94463417-94463666 - | 18099 <NA> [4] chr1 94466392-94466484 - | 18100 <NA> [5] chr1 94466558-94466661 - | 18101 <NA> ... ... ... ... . ... ... [46] chr1 94568571-94568698 - | 18142 <NA> [47] chr1 94574133-94574272 - | 18143 <NA> [48] chr1 94576994-94577135 - | 18144 <NA> [49] chr1 94578529-94578622 - | 18145 <NA> [50] chr1 94586536-94586705 - | 18146 <NA> ------- > coverageByTranscript(gal2[unlist(tx2reads[4])], + dm3_transcripts[4]) RleList of length 1$ABCA4
integer-Rle of length 7325 with 5 runs
Lengths:   73   98    3   98 7053
Values :    0    1    2    1    0


modified 4 months ago by Hervé Pagès ♦♦ 14k • written 4 months ago by Jiping Wang70
Answer: coverageByTranscript coudble counts overlapped region for paired-end data
0
4 months ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi Jiping,

coverage() and family (including coverageByTranscript()) always turn a GAlignmentPairs object into a GRangesList object internally before computing the coverage. More precisely, if x is a GAlignmentPairs object, coverage(x, ...) and coverageByTranscript(x, ...) are equivalent to coverage(as(x, "GRangesList"), ...) and coverageByTranscript(as(x, "GRangesList"), ...), respectively.

As a result, the 2 ends of a paired-end read both contribute to the coverage, whether they overlap with each other or not.

Let's observe this on a fabricated paired-end read where the 2 ends overlap with each other:

library(GenomicAlignments)
first <- GAlignments("chr2L", 7650L, "101M", strand=strand("+"))
last <- GAlignments("chr2L", 7700L, "101M", strand=strand("-"))
galp <- GAlignmentPairs(first, last)
galp
# GAlignmentPairs object with 1 pair, strandMode=1, and 0 metadata columns:
#       seqnames strand :    ranges --    ranges
#          <Rle>  <Rle> : <IRanges> -- <IRanges>
#   [1]    chr2L      + : 7650-7750 -- 7700-7800
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

grl <- as(galp, "GRangesList")
grl
# GRangesList object of length 1:
# [[1]]
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]    chr2L 7650-7750      +
#   [2]    chr2L 7700-7800      +
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

coverage(grl)
# RleList of length 1
# $chr2L # integer-Rle of length 7800 with 4 runs # Lengths: 7649 50 51 50 # Values : 0 1 2 1 coverage(galp) # equivalent to coverage(grl) # RleList of length 1 #$chr2L
# integer-Rle of length 7800 with 4 runs
#   Lengths: 7649   50   51   50
#   Values :    0    1    2    1


For coverageByTranscript() we'll use the transcript my_transcript from the examples in ?coverageByTranscript:

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
dm3_transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
my_transcript <- dm3_transcripts["FBtr0300689"]
my_transcript
# GRangesList object of length 1:
# $FBtr0300689 # GRanges object with 2 ranges and 3 metadata columns: # seqnames ranges strand | exon_id exon_name exon_rank # <Rle> <IRanges> <Rle> | <integer> <character> <integer> # [1] chr2L 7529-8116 + | 1 <NA> 1 # [2] chr2L 8193-9484 + | 3 <NA> 2 # ------- # seqinfo: 15 sequences (1 circular) from dm3 genome coverageByTranscript(grl, my_transcript) # RleList of length 1 #$FBtr0300689
# integer-Rle of length 1880 with 5 runs
#   Lengths:  121   50   51   50 1608
#   Values :    0    1    2    1    0

coverageByTranscript(galp, my_transcript)  # equivalent to
# coverageByTranscript(grl,
#                      my_transcript)
# RleList of length 1
# $FBtr0300689 # integer-Rle of length 1880 with 5 runs # Lengths: 121 50 51 50 1608 # Values : 0 1 2 1 0  If we don't want the overlapping ends of the read to contribute twice to the coverage, we can reduce the GRangesList object: reduced_grl <- reduce(grl) reduced_grl # GRangesList object of length 1: # [[1]] # GRanges object with 1 range and 0 metadata columns: # seqnames ranges strand # <Rle> <IRanges> <Rle> # [1] chr2L 7650-7800 + # ------- # seqinfo: 1 sequence from an unspecified genome; no seqlengths  Then: coverage(reduced_grl) # RleList of length 1 #$chr2L
# integer-Rle of length 7800 with 2 runs
#   Lengths: 7649  151
#   Values :    0    1

coverageByTranscript(reduced_grl, my_transcript)
# RleList of length 1
# \$FBtr0300689
# integer-Rle of length 1880 with 3 runs
#   Lengths:  121  151 1608
#   Values :    0    1    0


Hope this helps,

H.

Hi Hervé,

Thanks for your quick help. I understand and appreciate what you explained. I am just wondering whether your package could implement this as default for the paired-end data. I saw there are some other posts where users raised the same question. In my opinion, the coverage calculation for the paired end should NOT double count in the overlapped region because the two mates are from the same fragments.

Using the reduce function, we can merge the two GAranges from the two mates of one pair. When one gene has many pairs, we will have to do the reduce for each pairs before we can plug into the coverage function, right? Is there a shortcut that we can calculate the coverage for all genes from one chromosome with a few of lines of codes? I tried to use endoapply function to apply reduce to GRangesList, but it takes extremely long time for one chromosome.

Hi Jiping,

reduce() is vectorized on a GRangesList object so there is no need to use endoapply():

library(GenomicRanges)
grl <- GRangesList(GRanges(c("chr1:21-30", "chr1:28-35")),
GRanges(c("chr1:25-30", "chr1:35-40")),
GRanges(c("chr1:31-37", "chr1:38-45")))
reduce(grl)
# GRangesList object of length 3:
# [[1]]
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1     21-35      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#
# [[2]]
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1     25-30      *
#   [2]     chr1     35-40      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#
# [[3]]
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1     31-45      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths


What drove the decision to not treat overlapping ends in a special way was the desire to implement a simple and straightforward semantic where the pairing does not affect the coverage. In other words there is some appeal in the fact that you get the same coverage whether you use readGAlignments(), readGAlignmentPairs(), or readGAlignmentsList() to load your reads. Even if that means that the overlapping ends of a paired-end read contribute twice to the coverage, which is a rare event anyway. One could also see the double contribution as a default weighting scheme where more weight is given to the reads whose 2 ends align to the same region, which is not a completely nonsensical thing to do after all.

Anyway after taking a quick look at this it seems that it would be easy enough to modify the coverage() method for GAlignmentPairs objects to avoid the double contribution. So if nobody objects I'll make the change. Note that this change will slightly affect the results obtained previously with coverage(), coverageByTranscript(), and pcoverageByTranscript() though.

H.