Question

coverageByTranscript coudble counts overlapped region for paired-end data

0

Entering edit mode

Jiping Wang ▴ 90

@jiping-wang-6687

Last seen 25 days ago

United States

I have a paired-end RNA-seq data in bam. It was read into R using readGAlignmentPairs, named gal2. The compatible reads pairs was parsed using findCompatibleOverlaps function. The following codes only specifically examines compatible read pairs for 4th gene in the dm3_transcripts list. There was only one paired-end aligned to this gene. It's clear that the two mates overlap at 94458625,94458626 and 94458627. As a result the coverageByTranscript returns coverage score of 2 in the three overlapped positions (which is wrong, it should be 1 instead of 3). Anyway to correct this error? thanks.

> gal2[unlist(tx2reads[4])]
GAlignmentPairs object with 1 pair, strandMode=1, and 0 metadata columns:
      seqnames strand :            ranges --            ranges
         <Rle>  <Rle> :         <IRanges> --         <IRanges>
  [1]     chr1      - : 94458625-94458725 -- 94458527-94458627
  -------
  seqinfo: 25 sequences from an unspecified genome
> dm3_transcripts[4]
GRangesList object of length 1:
$ABCA4 
GRanges object with 50 ranges and 2 metadata columns:
       seqnames            ranges strand |   exon_id   exon_name
          <Rle>         <IRanges>  <Rle> | <integer> <character>
   [1]     chr1 94458394-94458798      - |     18097        <NA>
   [2]     chr1 94461665-94461751      - |     18098        <NA>
   [3]     chr1 94463417-94463666      - |     18099        <NA>
   [4]     chr1 94466392-94466484      - |     18100        <NA>
   [5]     chr1 94466558-94466661      - |     18101        <NA>
   ...      ...               ...    ... .       ...         ...
  [46]     chr1 94568571-94568698      - |     18142        <NA>
  [47]     chr1 94574133-94574272      - |     18143        <NA>
  [48]     chr1 94576994-94577135      - |     18144        <NA>
  [49]     chr1 94578529-94578622      - |     18145        <NA>
  [50]     chr1 94586536-94586705      - |     18146        <NA>

-------
> coverageByTranscript(gal2[unlist(tx2reads[4])],
+                       dm3_transcripts[4])
RleList of length 1
$ABCA4
integer-Rle of length 7325 with 5 runs
  Lengths:   73   98    3   98 7053
  Values :    0    1    2    1    0

GenomicAlignments RNA-seq paired-end read coverage • 1.2k views

ADD COMMENT • link updated 5.5 years ago by Hervé Pagès 16k • written 5.5 years ago by Jiping Wang ▴ 90

score 0 · Answer 1 · 2019-08-03

Hi Jiping,

coverage() and family (including coverageByTranscript()) always turn a GAlignmentPairs object into a GRangesList object internally before computing the coverage. More precisely, if x is a GAlignmentPairs object, coverage(x, ...) and coverageByTranscript(x, ...) are equivalent to coverage(as(x, "GRangesList"), ...) and coverageByTranscript(as(x, "GRangesList"), ...), respectively.

As a result, the 2 ends of a paired-end read both contribute to the coverage, whether they overlap with each other or not.

Let's observe this on a fabricated paired-end read where the 2 ends overlap with each other:

library(GenomicAlignments)
first <- GAlignments("chr2L", 7650L, "101M", strand=strand("+"))
last <- GAlignments("chr2L", 7700L, "101M", strand=strand("-"))
galp <- GAlignmentPairs(first, last)
galp
# GAlignmentPairs object with 1 pair, strandMode=1, and 0 metadata columns:
#       seqnames strand :    ranges --    ranges
#          <Rle>  <Rle> : <IRanges> -- <IRanges>
#   [1]    chr2L      + : 7650-7750 -- 7700-7800
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

grl <- as(galp, "GRangesList")
grl
# GRangesList object of length 1:
# [[1]]
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]    chr2L 7650-7750      +
#   [2]    chr2L 7700-7800      +
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

coverage(grl)
# RleList of length 1
# $chr2L
# integer-Rle of length 7800 with 4 runs
#   Lengths: 7649   50   51   50
#   Values :    0    1    2    1

coverage(galp)  # equivalent to coverage(grl)
# RleList of length 1
# $chr2L
# integer-Rle of length 7800 with 4 runs
#   Lengths: 7649   50   51   50
#   Values :    0    1    2    1

For coverageByTranscript() we'll use the transcript my_transcript from the examples in ?coverageByTranscript:

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
dm3_transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
my_transcript <- dm3_transcripts["FBtr0300689"]
my_transcript
# GRangesList object of length 1:
# $FBtr0300689
# GRanges object with 2 ranges and 3 metadata columns:
#       seqnames    ranges strand |   exon_id   exon_name exon_rank
#          <Rle> <IRanges>  <Rle> | <integer> <character> <integer>
#   [1]    chr2L 7529-8116      + |         1        <NA>         1
#   [2]    chr2L 8193-9484      + |         3        <NA>         2
#   -------
#   seqinfo: 15 sequences (1 circular) from dm3 genome

coverageByTranscript(grl, my_transcript)
# RleList of length 1
# $FBtr0300689
# integer-Rle of length 1880 with 5 runs
#   Lengths:  121   50   51   50 1608
#   Values :    0    1    2    1    0

coverageByTranscript(galp, my_transcript)  # equivalent to
                                           # coverageByTranscript(grl,
                                           #                      my_transcript)
# RleList of length 1
# $FBtr0300689
# integer-Rle of length 1880 with 5 runs
#   Lengths:  121   50   51   50 1608
#   Values :    0    1    2    1    0

If we don't want the overlapping ends of the read to contribute twice to the coverage, we can reduce the GRangesList object:

reduced_grl <- reduce(grl)
reduced_grl
# GRangesList object of length 1:
# [[1]]
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]    chr2L 7650-7800      +
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Then:

coverage(reduced_grl)
# RleList of length 1
# $chr2L
# integer-Rle of length 7800 with 2 runs
#   Lengths: 7649  151
#   Values :    0    1

coverageByTranscript(reduced_grl, my_transcript)
# RleList of length 1
# $FBtr0300689
# integer-Rle of length 1880 with 3 runs
#   Lengths:  121  151 1608
#   Values :    0    1    0

Hope this helps,

H.