How to extract coordinates of mismatches from a BAM file lodaded in as a GAlignments?
1
0
Entering edit mode
Charles • 0
@charles-23991
Last seen 4 months ago
Japan

Hello,

I have loaded a BAM file in Bioconductor with the GenomicAlignments package. I am looking for a way to extract the position of the mismatches (X in the CIGAR line) as a GRanges object, but so far I only found a function (cigarRangesAlongReferenceSpace) that returns an IRanges object. Is there a simple way to project these IRanges to the GAlignements in order to produce on GRange per IRange?

GenomicAlignments • 510 views
ADD COMMENT
0
Entering edit mode

I am trying this but it is slow.

aln2mismatches <- function (gal) {
  gr <- GRanges(gal)
  irl <- cigarRangesAlongReferenceSpace(cigar(gal), ops = "X", reduce.ranges = FALSE)
  keep <- sum(width(irl)) != 0
  if (sum(keep) == 0) return(GRanges())
  gr <- gr[keep]
  irl <- irl[keep]
  mismatchesGR <- function (r, i) {
    GRanges(seqnames(r), shift(IPos(i), start(r)))
  }
  lapply(seq_along(gr), \(i) {
    mismatchesGR(gr[i], irl[[i]])
  }) |> as("GRangesList") |> unlist()
}

In the end, I suppose that the reason why such a function does not exist is because one can just run bcftools on the BAM file and load the variant calls with VariantAnnotation::readVcf?

ADD REPLY
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

This should be fast:

library(GenomicAlignments)

extractMismatchesFromGAlignments <- function(ga)
{
    stopifnot(is(ga, "GAlignments"))
    Xranges <- cigarRangesAlongReferenceSpace(cigar(ga), pos=start(ga), ops="X")
    GenomicAlignments:::make_GRangesList_from_CompressedIRangesList(Xranges,
                                         seqnames(ga), strand(ga), seqinfo(ga))
}

Then:

ga <- GAlignments(
    seqnames=c("chr1", "chr2"),
    pos=c(1001, 2001),
    cigar=c("19M1X10M2000N4M2X6M", "3X12M40N2M1X3M2X2M")
)

extractMismatchesFromGAlignments(ga)
# GRangesList object of length 2:
# [[1]]
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1      1020      *
#   [2]     chr1 3035-3036      *
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths
#
# [[2]]
# GRanges object with 3 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr2 2001-2003      *
#   [2]     chr2      2058      *
#   [3]     chr2 2062-2063      *
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths

The key is to avoid using an lapply-loop to construct the final GRangesList object from CompressedIRangesList object Xranges, which is what internal helper GenomicAlignments:::make_GRangesList_from_CompressedIRangesList() does. You can take a look at how make_GRangesList_from_CompressedIRangesList() is implemented to get a sense of how it achieves this.

ADD COMMENT

Login before adding your answer.

Traffic: 950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6