Question

Counting SNPs where ambiguity codes represent heterozygosity.

0

Entering edit mode

ben.ward ▴ 30

@benward-7169

Last seen 8.1 years ago

United Kingdom

Hi,

If I have two DNAStrings in a DNAStringSet, they are of the same length:

> dna <- DNAStringSet(c("ATYGRTCGATCG", "MTSGATCRATCG"))
> dna
A DNAStringSet instance of length 2
width seq
[1] 12 ATYGRTCGATCG
[2] 12 MTSGATCRATCG

How can I count the number of mismatches between the two sequences, ideally at each base, assuming that the ambiguity codes denote heterozygosity. So for example at the first base, A means homozygous A/A, and M means heterozygous A/C, and therefore counts as 1 mismatch. Another example, at the third position, Y = heterozygous C/T, whereas S = heterozygous C/G, and again is one mismatch.

I've thought of a few ways of doing this using consensusMatrices and such and thought about defining some custom scoring matrix which when you index using the bases, spits out the appropriate score, so mat["A", "A"] = 0, mat["A", "M"] = 1 and so on.

I wanted to ask the board if this (or a similar task) is a task that has been done with Biocondcuctor before and what the best way of doing it might be.

Many thanks,

Ben W.

Biostrings SNPs alignment • 1.5k views

ADD COMMENT • link updated 8.8 years ago by Hervé Pagès 16k • written 8.8 years ago by ben.ward ▴ 30

score 0 · Answer 1 · 2015-07-24

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 15 hours ago

Seattle, WA, United States

Hi Ben,

Maybe you want to use neditAt() for that:

> neditAt(dna[[1]], dna[[2]])
[1] 4

See ?neditAt for more information. Of particular interest is the fixed argument to let you control how the IUPAC ambiguity codes should be interpreted.

Cheers,

H.

ADD COMMENT • link 8.8 years ago Hervé Pagès 16k