Does coverage() split exon spanning RNA-seq reads?
1
1
Entering edit mode
Jake ▴ 90
@jake-7236
Last seen 20 months ago
United States

I couldn't find the answer to how the coverage() function handles reads split over exons when reading a bam file. When running

cvg <- coverage(BamFile("path/to/bam/file.bam"))

When calculating genome coverage in bedtools you need to specify the split command (see below). Does coverage handle this automatically?

coverage genomicranges genomicalignments • 1.9k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 3 days ago
United States

I think the answer is that it behaves like -d -split by default

> coverage(GAlignments("A", 20L, "5M10N5M", strand("+")))
RleList of length 1
$A
integer-Rle of length 39 with 4 runs
  Lengths: 19  5 10  5
  Values :  0  1  0  1

For what it's worth, a 'custom' coverage function isn't too hard to manage, using GenomicFiles::reduceByYield() with

library(GenomicAlignments)
library(GenomicFiles)
library(RNAseqData.HNRNPC.bam.chr14) # example BAM files
fl = RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
bf = BamFile(fl, yieldSize=100000) # for illustration; use larger (e.g., 1M) in practice
yield = readGAlignmentsList        # paired-end reads
map = function(aln) {
    ## whatever coverage function you want, e.g.,
    aln <- GenomeInfoDb::keepStandardChromosomes(aln)
    coverage(as(aln, "GRanges"))   # like genomecov -d (??)
}
reduce = `+`
reduceByYield(bf, yield, map, reduce)[["chr14"]]

 

 

ADD COMMENT
0
Entering edit mode

Hi Jake,

Just to try to give a little bit more context to Martin's suggestion for obtaining the behavior of bedtools -d option. This:

coverage(BamFile("path/to/bam/file.bam"))

is more memory efficient but semantically equivalent to:

gal <- readGAlignments(BamFile("path/to/bam/file.bam"))
coverage(as(gal, "GRangesList"))

As you can see, the GAlignments object returned by readGAlignments() is turned into a GRangesList object before coverage() is called. This coercion preserves the junctions (i.e. the N runs in the CIGAR):

as(GAlignments("A", 20L, "5M10N5M", strand("+")), "GRangesList")
# GRangesList object of length 1:
# [[1]] 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]        A  [20, 24]      +
#   [2]        A  [35, 39]      +
#
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths

However, turning a GAlignments object into a GRanges object does NOT preserve the junctions:

as(GAlignments("A", 20L, "5M10N5M", strand("+")), "GRanges")
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]        A  [20, 39]      +
#   -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

So to get the behavior of bedtools -d option, you could do:

gal <- readGAlignments(BamFile("path/to/bam/file.bam"))
coverage(as(gal, "GRanges"))

or use Martin's memory efficient version above (which should be semantically equivalent).

H.

ADD REPLY

Login before adding your answer.

Traffic: 846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6