Question: Does coverage() split exon spanning RNA-seq reads?
1
gravatar for Jake
4.4 years ago by
Jake60
United States
Jake60 wrote:

I couldn't find the answer to how the coverage() function handles reads split over exons when reading a bam file. When running

cvg <- coverage(BamFile("path/to/bam/file.bam"))

When calculating genome coverage in bedtools you need to specify the split command (see below). Does coverage handle this automatically?

ADD COMMENTlink modified 4.4 years ago by Martin Morgan ♦♦ 24k • written 4.4 years ago by Jake60
Answer: Does coverage() split exon spanning RNA-seq reads?
0
gravatar for Martin Morgan
4.4 years ago by
Martin Morgan ♦♦ 24k
United States
Martin Morgan ♦♦ 24k wrote:

I think the answer is that it behaves like -d -split by default

> coverage(GAlignments("A", 20L, "5M10N5M", strand("+")))
RleList of length 1
$A
integer-Rle of length 39 with 4 runs
  Lengths: 19  5 10  5
  Values :  0  1  0  1

For what it's worth, a 'custom' coverage function isn't too hard to manage, using GenomicFiles::reduceByYield() with

library(GenomicAlignments)
library(GenomicFiles)
library(RNAseqData.HNRNPC.bam.chr14) # example BAM files
fl = RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
bf = BamFile(fl, yieldSize=100000) # for illustration; use larger (e.g., 1M) in practice
yield = readGAlignmentsList        # paired-end reads
map = function(aln) {
    ## whatever coverage function you want, e.g.,
    aln <- GenomeInfoDb::keepStandardChromosomes(aln)
    coverage(as(aln, "GRanges"))   # like genomecov -d (??)
}
reduce = `+`
reduceByYield(bf, yield, map, reduce)[["chr14"]]

 

 

ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by Martin Morgan ♦♦ 24k

Hi Jake,

Just to try to give a little bit more context to Martin's suggestion for obtaining the behavior of bedtools -d option. This:

coverage(BamFile("path/to/bam/file.bam"))

is more memory efficient but semantically equivalent to:

gal <- readGAlignments(BamFile("path/to/bam/file.bam"))
coverage(as(gal, "GRangesList"))

As you can see, the GAlignments object returned by readGAlignments() is turned into a GRangesList object before coverage() is called. This coercion preserves the junctions (i.e. the N runs in the CIGAR):

as(GAlignments("A", 20L, "5M10N5M", strand("+")), "GRangesList")
# GRangesList object of length 1:
# [[1]] 
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]        A  [20, 24]      +
#   [2]        A  [35, 39]      +
#
# -------
# seqinfo: 1 sequence from an unspecified genome; no seqlengths

However, turning a GAlignments object into a GRanges object does NOT preserve the junctions:

as(GAlignments("A", 20L, "5M10N5M", strand("+")), "GRanges")
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]        A  [20, 39]      +
#   -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

So to get the behavior of bedtools -d option, you could do:

gal <- readGAlignments(BamFile("path/to/bam/file.bam"))
coverage(as(gal, "GRanges"))

or use Martin's memory efficient version above (which should be semantically equivalent).

H.

ADD REPLYlink written 4.4 years ago by Hervé Pagès ♦♦ 14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 145 users visited in the last hour