Question: Finding mismatches between reads and reference sequence
0
4.1 years ago by
Germany
Stefanie Tauber40 wrote:

Hi,

following up the post Finding mismatches between reads and reference sequence and, in particular, the two very handy functions 'mismatches' and 'subsetByIntegerList' presented in this post by Herve Pages:

Is there already some functionality available for finding the position and identity of mismatches between reads and references sequence when having insertions in my alignments?

Best,

Stefanie

bsgenome cigar alignment • 906 views
modified 4.1 years ago by Hervé Pagès ♦♦ 13k • written 4.1 years ago by Stefanie Tauber40
1
4.1 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Stefanie,

The "Finding mismatches between reads and reference sequence" post is old and the good news is that it's a little bit easier to do these things these days with sequenceLayer() (which I added since then). sequenceLayer() is defined in the GenomicAlignments package and supports insertions, deletions, and clipping (soft and hard).

library(RNAseqData.HNRNPC.bam.chr14)
bam <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
param <- ScanBamParam(what="seq")
##        M         I         D         N         S         H         P         =
## 57631946      2902      1560 659051789         0         0         0         0
##        X
##        0 

Our reads contain insertions (I), deletions (D), and skipped regions (N) (aka junctions).

Query sequences:

qseqs <- mcols(reads)\$seq

Extract the reference sequences from the reference genome. Important: all the sequences must be extracted from the plus strand:

library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
rseqs <- getSeq(genome, read_ranges)

Use sequenceLayer() to bring the reference sequences "in the same space as the query sequences". Concretely this will remove from rseqs the nucleotides that correspond to deletions (D) and junctions (N), and will insert the letter "-" where insertions (I) occurred:

rseqs2 <- sequenceLayer(rseqs, cigar(reads), from="reference", to="query")

A sanity check:

identical(width(qseqs), width(rseqs2))  # TRUE (that was not the case with rseqs)

From here we can re-use the mismatches() function to find the positions of the mismatches between qseqs and rseqs2:

mismatches <- function(x, y)
{
if (!is(x, "DNAStringSet") || !is(y, "DNAStringSet"))
stop("'x' and 'y' must be DNAStringSet objects")
x_width <- width(x)
y_width <- width(y)
if (!identical(x_width, y_width))
stop("'x' and 'y' must have the same shape ",
"(i.e. same length and widths)")
unlisted_ans <- which(as.raw(unlist(x)) != as.raw(unlist(y)))
breakpoints <- cumsum(x_width)
ans_eltlens <- tabulate(findInterval(unlisted_ans - 1L,
breakpoints) + 1L,
nbins=length(x))
skeleton <- PartitioningByEnd(cumsum(ans_eltlens))
ans <- relist(unlisted_ans, skeleton)
offsets <- c(0L, breakpoints[-length(breakpoints)])
ans <- ans - offsets
ans
}

mm <- mismatches(qseqs, rseqs2)

The result is an IntegerList "parallel" to qseqs (and to rseqs2), that is, it has one list element per query sequence:

 > mm
IntegerList of length 800484
[[1]] integer(0)
[[2]] integer(0)
[[3]] integer(0)
[[4]] integer(0)
[[5]] integer(0)
[[6]] integer(0)
[[7]] integer(0)
[[8]] 7 9
[[9]] 7 9
[[10]] 7 9
...
<800474 more elements>

Each list element is an integer vector giving the 1-based position of the mismatches with respect to the corresponding sequence in qseqs (i.e. to the corresponding sequence in the SEQ field of the BAM file). Keep in mind that these sequences are the reverse complement of the original read sequences when they aligned to the minus strand. Also they could have been hard-clipped by the aligner.

elementLengths(mm)

Summary of nb of mismatches per read:

> summary(elementLengths(mm))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.0000  0.0000  0.0000  0.2147  0.0000  2.0000 

To see the nucleotides that correspopnd to these mismatches, qseqs and rseqs2 can be subsetted with mm (no need to use adhoc subsetByIntegerList() anymore):

> head(qseqs[mm], n=10)
A DNAStringSet instance of length 10
width seq
[1]     0
[2]     0
[3]     0
[4]     0
[5]     0
[6]     0
[7]     0
[8]     2 GA
[9]     2 GA
[10]     2 GA

A DNAStringSet instance of length 10
width seq
[1]     0
[2]     0
[3]     0
[4]     0
[5]     0
[6]     0
[7]     0
[8]     2 AG
[9]     2 AG
[10]     2 AG

Hope this helps. These tools have been added quite recently. Let me know if you run into problems and/or if they need improvement.

H.