Calculate fraction of bases (GRanges) that have non-zero coverage (RleList)
1
0
Entering edit mode
jma1991 ▴ 70
@jma1991-11856
Last seen 12 months ago
Cumbernauld

I have a RleList of read coverage:

> readCoverage
RleList of length 22
$chr10
integer-Rle of length 130694993 with 1966679 runs
  Lengths: 3100009      50    1954      50      48       3      47       3 ...       2      47       1      46   13009      50  101986
  Values :       0       1       0       1       0       1       2       1 ...       2       1       2       1       0       1       0

$chr11
integer-Rle of length 122082543 with 2655111 runs
  Lengths: 3100225      32      18      16      16      12      24      27 ...     276      50    4029      50    3096      50  107486
  Values :       0       1       3       2       3       1       2       1 ...       0       1       0       1       0       1       0

$chr12
integer-Rle of length 120129022 with 1566759 runs
  Lengths: 3000092      50      38      50    1230      50      18      50 ...     139      50     195      50       1      50  100839
  Values :       0       1       0       1       0       1       0       1 ...       0       1       0       1       0       1       0

$chr13
integer-Rle of length 120421639 with 1689775 runs
  Lengths: 3002660      50    1952      50    1090       3      47       3 ...      10      40    3207      50     608      40  102487
  Values :       0       2       0       2       0       1       2       1 ...       2       1       0       1       0       1       0

$chr14
integer-Rle of length 124902244 with 1559985 runs
  Lengths: 3022498      50      57      10       2      38      10       2 ...      66      43       7      39       5      50  102256
  Values :       0       1       0       1       3       5       4       2 ...       0       1       2       1       0       1       0

...
<17 more elements>

and a GRanges object of genomic regions:

> genomicRegions
GRanges object with 6665053 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
        [1]     chr1   [      1, 3000192]      +
        [2]     chr1   [3000193, 3000814]      +
        [3]     chr1   [3000815, 3001049]      +
        [4]     chr1   [3001050, 3001120]      +
        [5]     chr1   [3001121, 3001796]      +
        ...      ...                  ...    ...
  [6665049]     chrY [90843572, 90844088]      +
  [6665050]     chrY [90844089, 90844335]      +
  [6665051]     chrY [90844336, 90844497]      +
  [6665052]     chrY [90844498, 90844696]      +
  [6665053]     chrY [90844697, 91744698]      +
  -------
  seqinfo: 21 sequences from an unspecified genome; no seqlengths

I would like to calculate the fraction of bases in genomicRegions that have non-zero coverage from readCoverage. My end goal is to be able to create a matrix listing the fraction of bases covered for each genomic region (rows) for multiple bam files (columns).

So far I've only been able to calculate the read coverage and make a Views object of regions on just one of the chromosomes (I can't work out how to make a views object for all ranges, so I guess I'll just loop through the chromosomes):

# Calculate base pair coverage
readCoverage <- coverage(bamFile)

# Setup views along chromosome
chromRanges <- fragmentRanges[seqnames(fragmentRanges) == "chr1"]
​coverageViews <- Views(readCoverage, ranges(chromRanges))

Update:

The below code works, but it takes a long time to compute the number of bases. Is there a quicker way then using viewApply?

  coverageMatrix <- lapply(bamFiles, function(bamFile) {
    genomicViews <- Views(coverage(bamFile), as(genomicRanges, "RangesList"))
    basesCovered <- viewApply(genomicViews, function(x) sum(x > 0) / width(x))
  }

 

 

 

 

genomicalignments genomicranges • 1.7k views
ADD COMMENT
1
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States

Couple of options. If you want to represent the data as Views (seems natural), then you could first test > 0, to get a logical RleList, and form the Views on that. Then it's just viewMeans().

That would be something like:

viewMeans(Views(coverage(bamFile) > 0, as(genomicRanges, "RangesList")))

Another way would be to approach it as an aggregation using Lists:

mean(coverage(bamFile)[genomicRanges] > 0)

When looking into this answer, I also found an obscure function called binnedAverage():

binnedAverage(genomicRanges, coverage(bamFile) > 0, "frac.nonzero")

Personally, I would use the second (List-based) option, because Views are limited by supporting only a single sequence, which necessitates a ViewsList and adds complexity without any real benefit in this case. I do like how Views preserves the original vector and ranges, but the split by sequence is annoying.

If we were starting over (and I think we probably should), we would return the coverage as a GPos, not an RleList, with an intuitively named function that would extract a GRanges of windows with an easily summarized RleList mcol holding the coverage values. Something like:

bamFile %>% compute_coverage() %>% extract_windows(genomicRanges) %>% aggregate(frac.nonzero=mean(score > 0))

 

ADD COMMENT
0
Entering edit mode

Thank you for the reply. Options 1 and 3 work but your preferred option gives me an integer overflow error. I guess it's because I'm working with a lot of ranges:

> viewMeans(Views(readCoverage > 0, as(genomicRanges, "RangesList")))
NumericList of length 21
[["chr1"]] 0 0.160771704180064 0 0 0.0695266272189349 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0.210084033613445 0 0 0 0 0 0 0 0 0 0 0
[["chr2"]] 0 0.241217798594848 0 0 0 0 0.315527950310559 ... 0 0 0.217142857142857 0 0 0.0605326876513317 0.00076577293340759
[["chr3"]] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0.290697674418605 0.151033386327504 0.0798722044728434 0 0 0.429889298892989 0 0 0 0 0 0 0 0 0
[["chr4"]] 0.000139964481144785 0 0.174800354924579 0.174377224199288 0.586565752128666 0 0 0 0.516616314199396 0 0 ... 0 0 0 0 0 0 0 0 0 0 0
[["chr5"]] 0 0 0.1463133640553 0.0656167979002625 0 0 0.21462829736211 0.267206477732794 ... 0 0 0 0 0.168918918918919 0 0.259314456035768 0
[["chr6"]] 0 0 0 0 0.425925925925926 0.266396761133603 0 0 0.0766871165644172 ... 0 0.0968992248062016 0.213360599863667 0 0 0 0 0
[["chr7"]] 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0.131294964028777 0.240631163708087 0.192678227360308 0.344797178130511 0
[["chr8"]] 0 0 0 0 0.0851900393184797 0 0 0 0 0 0 ... 0.0984251968503937 0 0 0 0.124688279301746 0 0.154831199068685 0 0.0597238895558223 0 0
[["chr9"]] 0 0 0 0 0 0 0.0483870967741935 0 0 0.0237304224015187 0.0625 0 0.0290224032586558 ... 0 0 0 0 0 0 0 0 0 0 0.191570881226054 0 0
[["chr10"]] 1.61283507051799e-05 0 0 0 0 0 0 0 0.266839378238342 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0.099601593625498 0 0 0 0
...
<11 more elements>

> mean(readCoverage[genomicRanges] > 0)
Error in .Call2("Rle_constructor", values, lengths, check, 0L, PACKAGE = "S4Vectors") :
  integer overflow while summing elements in 'lengths'

> binnedAverage(genomicRanges, readCoverage > 0, "frac.nonzero")
GRanges object with 6665053 ranges and 3 metadata columns:
            seqnames               ranges strand |        name     score frac.nonzero
               <Rle>            <IRanges>  <Rle> | <character> <numeric>    <numeric>
        [1]     chr1   [      1, 3000192]      + | GATC0000001         0   0.00000000
        [2]     chr1   [3000193, 3000814]      + | GATC0000002         0   0.16077170
        [3]     chr1   [3000815, 3001049]      + | GATC0000003         0   0.00000000
        [4]     chr1   [3001050, 3001120]      + | GATC0000004         0   0.00000000
        [5]     chr1   [3001121, 3001796]      + | GATC0000005         0   0.06952663
        ...      ...                  ...    ... .         ...       ...          ...
  [6665049]     chrY [90843572, 90844088]      + | GATC6665049         0            0
  [6665050]     chrY [90844089, 90844335]      + | GATC6665050         0            0
  [6665051]     chrY [90844336, 90844497]      + | GATC6665051         0            0
  [6665052]     chrY [90844498, 90844696]      + | GATC6665052         0            0
  [6665053]     chrY [90844697, 91744698]      + | GATC6665053         0            0
  -------
  seqinfo: 21 sequences from an unspecified genome; no seqlengths

Also all of the options throw an error if there are seqlevels in the readCoverage, which aren't in the genomicRanges. In my case genomicRanges doesn't contain any ranges on chrM, but my BAM files contains reads mapped to chrM:

> viewMeans(Views(coverage(bamFile) > 0, as(genomicRanges, "RangesList")))
Error in RleViewsList(rleList = subject, rangesList = start) : 
  'rleList' and 'rangesList' must have the same length

> mean(coverage(bamFile)[genomicRanges] > 0)
Error in .Call2("Rle_constructor", values, lengths, check, 0L, PACKAGE = "S4Vectors") : 
  integer overflow while summing elements in 'lengths'

> binnedAverage(genomicRanges, coverage(bamFile) > 0, "frac.nonzero")
Error in binnedAverage(genomicRanges, coverage(bamFile) > 0, "frac.nonzero") : 
  'seqlevels(bin)' and 'names(numvar)' must be identical

 

ADD REPLY
1
Entering edit mode

It's not the number of ranges but their total width. The seqlevels problem is fairly easily fixed by just assigning the Seqinfo from the BAM file (as a BamFile object) to your GRanges.

seqinfo(genomicRanges) <- seqinfo(bamFile)

 

ADD REPLY
0
Entering edit mode

Replacing seqinfo produces an error:

> seqinfo(genomicRanges)
Seqinfo object with 21 sequences from an unspecified genome; no seqlengths:
  seqnames seqlengths isCircular genome
  chr1           <NA>       <NA>   <NA>
  chr2           <NA>       <NA>   <NA>
  chr3           <NA>       <NA>   <NA>
  chr4           <NA>       <NA>   <NA>
  chr5           <NA>       <NA>   <NA>
  ...             ...        ...    ...
  chr17          <NA>       <NA>   <NA>
  chr18          <NA>       <NA>   <NA>
  chr19          <NA>       <NA>   <NA>
  chrX           <NA>       <NA>   <NA>
  chrY           <NA>       <NA>   <NA>
> seqinfo(bamFile)
Seqinfo object with 22 sequences from an unspecified genome:
  seqnames seqlengths isCircular genome
  chr10     130694993       <NA>   <NA>
  chr11     122082543       <NA>   <NA>
  chr12     120129022       <NA>   <NA>
  chr13     120421639       <NA>   <NA>
  chr14     124902244       <NA>   <NA>
  ...             ...        ...    ...
  chr8      129401213       <NA>   <NA>
  chr9      124595110       <NA>   <NA>
  chrM          16299       <NA>   <NA>
  chrX      171031299       <NA>   <NA>
  chrY       91744698       <NA>   <NA>
> seqinfo(genomicRanges) <- seqinfo(bamFile)
Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old, seqlevels(value)) : 
  when 'new2old' is NULL, the first elements in the
  supplied 'seqlevels' must be identical to 'seqlevels(x)'

But replacing the seqlevels works:

> seqlevels(genomicRanges) <- seqlevels(bamFile)
ADD REPLY

Login before adding your answer.

Traffic: 767 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6