Question: Calculate fraction of bases (GRanges) that have non-zero coverage (RleList)
0
2.5 years ago by
jma199130
jma199130 wrote:

I have a RleList of read coverage:

> readCoverage
RleList of length 22
$chr10 integer-Rle of length 130694993 with 1966679 runs Lengths: 3100009 50 1954 50 48 3 47 3 ... 2 47 1 46 13009 50 101986 Values : 0 1 0 1 0 1 2 1 ... 2 1 2 1 0 1 0$chr11
integer-Rle of length 122082543 with 2655111 runs
Lengths: 3100225      32      18      16      16      12      24      27 ...     276      50    4029      50    3096      50  107486
Values :       0       1       3       2       3       1       2       1 ...       0       1       0       1       0       1       0

$chr12 integer-Rle of length 120129022 with 1566759 runs Lengths: 3000092 50 38 50 1230 50 18 50 ... 139 50 195 50 1 50 100839 Values : 0 1 0 1 0 1 0 1 ... 0 1 0 1 0 1 0$chr13
integer-Rle of length 120421639 with 1689775 runs
Lengths: 3002660      50    1952      50    1090       3      47       3 ...      10      40    3207      50     608      40  102487
Values :       0       2       0       2       0       1       2       1 ...       2       1       0       1       0       1       0

\$chr14
integer-Rle of length 124902244 with 1559985 runs
Lengths: 3022498      50      57      10       2      38      10       2 ...      66      43       7      39       5      50  102256
Values :       0       1       0       1       3       5       4       2 ...       0       1       2       1       0       1       0

...
<17 more elements>

and a GRanges object of genomic regions:

> genomicRegions
GRanges object with 6665053 ranges and 0 metadata columns:
seqnames               ranges strand
<Rle>            <IRanges>  <Rle>
[1]     chr1   [      1, 3000192]      +
[2]     chr1   [3000193, 3000814]      +
[3]     chr1   [3000815, 3001049]      +
[4]     chr1   [3001050, 3001120]      +
[5]     chr1   [3001121, 3001796]      +
...      ...                  ...    ...
[6665049]     chrY [90843572, 90844088]      +
[6665050]     chrY [90844089, 90844335]      +
[6665051]     chrY [90844336, 90844497]      +
[6665052]     chrY [90844498, 90844696]      +
[6665053]     chrY [90844697, 91744698]      +
-------
seqinfo: 21 sequences from an unspecified genome; no seqlengths

I would like to calculate the fraction of bases in genomicRegions that have non-zero coverage from readCoverage. My end goal is to be able to create a matrix listing the fraction of bases covered for each genomic region (rows) for multiple bam files (columns).

So far I've only been able to calculate the read coverage and make a Views object of regions on just one of the chromosomes (I can't work out how to make a views object for all ranges, so I guess I'll just loop through the chromosomes):

# Calculate base pair coverage

# Setup views along chromosome
chromRanges <- fragmentRanges[seqnames(fragmentRanges) == "chr1"]
​coverageViews <- Views(readCoverage, ranges(chromRanges))

Update:

The below code works, but it takes a long time to compute the number of bases. Is there a quicker way then using viewApply?

  coverageMatrix <- lapply(bamFiles, function(bamFile) {
genomicViews <- Views(coverage(bamFile), as(genomicRanges, "RangesList"))
basesCovered <- viewApply(genomicViews, function(x) sum(x > 0) / width(x))
}

modified 2.5 years ago • written 2.5 years ago by jma199130
Answer: Calculate fraction of bases (GRanges) that have non-zero coverage (RleList)
1
2.5 years ago by
United States
Michael Lawrence11k wrote:

Couple of options. If you want to represent the data as Views (seems natural), then you could first test > 0, to get a logical RleList, and form the Views on that. Then it's just viewMeans().

That would be something like:

viewMeans(Views(coverage(bamFile) > 0, as(genomicRanges, "RangesList")))

Another way would be to approach it as an aggregation using Lists:

mean(coverage(bamFile)[genomicRanges] > 0)

When looking into this answer, I also found an obscure function called binnedAverage():

binnedAverage(genomicRanges, coverage(bamFile) > 0, "frac.nonzero")

Personally, I would use the second (List-based) option, because Views are limited by supporting only a single sequence, which necessitates a ViewsList and adds complexity without any real benefit in this case. I do like how Views preserves the original vector and ranges, but the split by sequence is annoying.

If we were starting over (and I think we probably should), we would return the coverage as a GPos, not an RleList, with an intuitively named function that would extract a GRanges of windows with an easily summarized RleList mcol holding the coverage values. Something like:

bamFile %>% compute_coverage() %>% extract_windows(genomicRanges) %>% aggregate(frac.nonzero=mean(score > 0))

Thank you for the reply. Options 1 and 3 work but your preferred option gives me an integer overflow error. I guess it's because I'm working with a lot of ranges:

> viewMeans(Views(readCoverage > 0, as(genomicRanges, "RangesList")))
NumericList of length 21
[["chr1"]] 0 0.160771704180064 0 0 0.0695266272189349 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0.210084033613445 0 0 0 0 0 0 0 0 0 0 0
[["chr2"]] 0 0.241217798594848 0 0 0 0 0.315527950310559 ... 0 0 0.217142857142857 0 0 0.0605326876513317 0.00076577293340759
[["chr3"]] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0.290697674418605 0.151033386327504 0.0798722044728434 0 0 0.429889298892989 0 0 0 0 0 0 0 0 0
[["chr4"]] 0.000139964481144785 0 0.174800354924579 0.174377224199288 0.586565752128666 0 0 0 0.516616314199396 0 0 ... 0 0 0 0 0 0 0 0 0 0 0
[["chr5"]] 0 0 0.1463133640553 0.0656167979002625 0 0 0.21462829736211 0.267206477732794 ... 0 0 0 0 0.168918918918919 0 0.259314456035768 0
[["chr6"]] 0 0 0 0 0.425925925925926 0.266396761133603 0 0 0.0766871165644172 ... 0 0.0968992248062016 0.213360599863667 0 0 0 0 0
[["chr7"]] 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0.131294964028777 0.240631163708087 0.192678227360308 0.344797178130511 0
[["chr8"]] 0 0 0 0 0.0851900393184797 0 0 0 0 0 0 ... 0.0984251968503937 0 0 0 0.124688279301746 0 0.154831199068685 0 0.0597238895558223 0 0
[["chr9"]] 0 0 0 0 0 0 0.0483870967741935 0 0 0.0237304224015187 0.0625 0 0.0290224032586558 ... 0 0 0 0 0 0 0 0 0 0 0.191570881226054 0 0
[["chr10"]] 1.61283507051799e-05 0 0 0 0 0 0 0 0.266839378238342 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0.099601593625498 0 0 0 0
...
<11 more elements>

Error in .Call2("Rle_constructor", values, lengths, check, 0L, PACKAGE = "S4Vectors") :
integer overflow while summing elements in 'lengths'

> binnedAverage(genomicRanges, readCoverage > 0, "frac.nonzero")
GRanges object with 6665053 ranges and 3 metadata columns:
seqnames               ranges strand |        name     score frac.nonzero
<Rle>            <IRanges>  <Rle> | <character> <numeric>    <numeric>
[1]     chr1   [      1, 3000192]      + | GATC0000001         0   0.00000000
[2]     chr1   [3000193, 3000814]      + | GATC0000002         0   0.16077170
[3]     chr1   [3000815, 3001049]      + | GATC0000003         0   0.00000000
[4]     chr1   [3001050, 3001120]      + | GATC0000004         0   0.00000000
[5]     chr1   [3001121, 3001796]      + | GATC0000005         0   0.06952663
...      ...                  ...    ... .         ...       ...          ...
[6665049]     chrY [90843572, 90844088]      + | GATC6665049         0            0
[6665050]     chrY [90844089, 90844335]      + | GATC6665050         0            0
[6665051]     chrY [90844336, 90844497]      + | GATC6665051         0            0
[6665052]     chrY [90844498, 90844696]      + | GATC6665052         0            0
[6665053]     chrY [90844697, 91744698]      + | GATC6665053         0            0
-------
seqinfo: 21 sequences from an unspecified genome; no seqlengths

Also all of the options throw an error if there are seqlevels in the readCoverage, which aren't in the genomicRanges. In my case genomicRanges doesn't contain any ranges on chrM, but my BAM files contains reads mapped to chrM:

> viewMeans(Views(coverage(bamFile) > 0, as(genomicRanges, "RangesList")))
Error in RleViewsList(rleList = subject, rangesList = start) :
'rleList' and 'rangesList' must have the same length

> mean(coverage(bamFile)[genomicRanges] > 0)
Error in .Call2("Rle_constructor", values, lengths, check, 0L, PACKAGE = "S4Vectors") :
integer overflow while summing elements in 'lengths'

> binnedAverage(genomicRanges, coverage(bamFile) > 0, "frac.nonzero")
Error in binnedAverage(genomicRanges, coverage(bamFile) > 0, "frac.nonzero") :
'seqlevels(bin)' and 'names(numvar)' must be identical

1

It's not the number of ranges but their total width. The seqlevels problem is fairly easily fixed by just assigning the Seqinfo from the BAM file (as a BamFile object) to your GRanges.

seqinfo(genomicRanges) <- seqinfo(bamFile)

Replacing seqinfo produces an error:

> seqinfo(genomicRanges)
Seqinfo object with 21 sequences from an unspecified genome; no seqlengths:
seqnames seqlengths isCircular genome
chr1           <NA>       <NA>   <NA>
chr2           <NA>       <NA>   <NA>
chr3           <NA>       <NA>   <NA>
chr4           <NA>       <NA>   <NA>
chr5           <NA>       <NA>   <NA>
...             ...        ...    ...
chr17          <NA>       <NA>   <NA>
chr18          <NA>       <NA>   <NA>
chr19          <NA>       <NA>   <NA>
chrX           <NA>       <NA>   <NA>
chrY           <NA>       <NA>   <NA>
> seqinfo(bamFile)
Seqinfo object with 22 sequences from an unspecified genome:
seqnames seqlengths isCircular genome
chr10     130694993       <NA>   <NA>
chr11     122082543       <NA>   <NA>
chr12     120129022       <NA>   <NA>
chr13     120421639       <NA>   <NA>
chr14     124902244       <NA>   <NA>
...             ...        ...    ...
chr8      129401213       <NA>   <NA>
chr9      124595110       <NA>   <NA>
chrM          16299       <NA>   <NA>
chrX      171031299       <NA>   <NA>
chrY       91744698       <NA>   <NA>
> seqinfo(genomicRanges) <- seqinfo(bamFile)
Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old, seqlevels(value)) :
when 'new2old' is NULL, the first elements in the
supplied 'seqlevels' must be identical to 'seqlevels(x)'

But replacing the seqlevels works:

> seqlevels(genomicRanges) <- seqlevels(bamFile)