Finding mismatches between reads and reference sequence
Entering edit mode
Last seen 5.0 years ago



following up the post Finding mismatches between reads and reference sequence and, in particular, the two very handy functions 'mismatches' and 'subsetByIntegerList' presented in this post by Herve Pages:

Is there already some functionality available for finding the position and identity of mismatches between reads and references sequence when having insertions in my alignments?




bsgenome cigar alignment • 1.3k views
Entering edit mode
Last seen 1 day ago
Seattle, WA, United States

Hi Stefanie,

The "Finding mismatches between reads and reference sequence" post is old and the good news is that it's a little bit easier to do these things these days with sequenceLayer() (which I added since then). sequenceLayer() is defined in the GenomicAlignments package and supports insertions, deletions, and clipping (soft and hard).

bam <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
param <- ScanBamParam(what="seq")
reads <- readGAlignments(bam, param=param)
##        M         I         D         N         S         H         P         = 
## 57631946      2902      1560 659051789         0         0         0         0 
##        X 
##        0 

Our reads contain insertions (I), deletions (D), and skipped regions (N) (aka junctions).

Query sequences:

qseqs <- mcols(reads)$seq

Extract the reference sequences from the reference genome. Important: all the sequences must be extracted from the plus strand:

genome <- BSgenome.Hsapiens.UCSC.hg19
read_ranges <- granges(reads)
strand(read_ranges) <- "+"
rseqs <- getSeq(genome, read_ranges)

Use sequenceLayer() to bring the reference sequences "in the same space as the query sequences". Concretely this will remove from rseqs the nucleotides that correspond to deletions (D) and junctions (N), and will insert the letter "-" where insertions (I) occurred:  

rseqs2 <- sequenceLayer(rseqs, cigar(reads), from="reference", to="query")

A sanity check:

identical(width(qseqs), width(rseqs2))  # TRUE (that was not the case with rseqs)

From here we can re-use the mismatches() function to find the positions of the mismatches between qseqs and rseqs2:

mismatches <- function(x, y)
    if (!is(x, "DNAStringSet") || !is(y, "DNAStringSet"))
        stop("'x' and 'y' must be DNAStringSet objects")
    x_width <- width(x)
    y_width <- width(y)
    if (!identical(x_width, y_width))
        stop("'x' and 'y' must have the same shape ",
             "(i.e. same length and widths)")
    unlisted_ans <- which(as.raw(unlist(x)) != as.raw(unlist(y)))
    breakpoints <- cumsum(x_width)
    ans_eltlens <- tabulate(findInterval(unlisted_ans - 1L,
                                         breakpoints) + 1L,
    skeleton <- PartitioningByEnd(cumsum(ans_eltlens))
    ans <- relist(unlisted_ans, skeleton)
    offsets <- c(0L, breakpoints[-length(breakpoints)])
    ans <- ans - offsets

mm <- mismatches(qseqs, rseqs2)

The result is an IntegerList "parallel" to qseqs (and to rseqs2), that is, it has one list element per query sequence:

 > mm
IntegerList of length 800484
[[1]] integer(0)
[[2]] integer(0)
[[3]] integer(0)
[[4]] integer(0)
[[5]] integer(0)
[[6]] integer(0)
[[7]] integer(0)
[[8]] 7 9
[[9]] 7 9
[[10]] 7 9
<800474 more elements>

Each list element is an integer vector giving the 1-based position of the mismatches with respect to the corresponding sequence in qseqs (i.e. to the corresponding sequence in the SEQ field of the BAM file). Keep in mind that these sequences are the reverse complement of the original read sequences when they aligned to the minus strand. Also they could have been hard-clipped by the aligner.

Nb of mismatches per read:


Summary of nb of mismatches per read:

> summary(elementLengths(mm))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.2147  0.0000  2.0000 

To see the nucleotides that correspopnd to these mismatches, qseqs and rseqs2 can be subsetted with mm (no need to use adhoc subsetByIntegerList() anymore):

> head(qseqs[mm], n=10)
  A DNAStringSet instance of length 10
     width seq
 [1]     0 
 [2]     0 
 [3]     0 
 [4]     0 
 [5]     0 
 [6]     0 
 [7]     0 
 [8]     2 GA
 [9]     2 GA
[10]     2 GA

> head(rseqs2[mm], n=10)
  A DNAStringSet instance of length 10
     width seq
 [1]     0 
 [2]     0 
 [3]     0 
 [4]     0 
 [5]     0 
 [6]     0 
 [7]     0 
 [8]     2 AG
 [9]     2 AG
[10]     2 AG

Hope this helps. These tools have been added quite recently. Let me know if you run into problems and/or if they need improvement.


Entering edit mode

Thanks so much for the detailed comment!


Login before adding your answer.

Traffic: 322 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6