Hi, bioconductor members,
I have two questions here. The first one is basically about how to use narrow() to keep the wanted sequence in an efficient way. I have a read and corresponding cigar, a subset from my
GAlignments instance. I want to clip the insertion and soft-clip from the read so that I can find the type of mismatch between the read and the reference genome. (A similar question was asked here: Finding mismatches between reads and reference sequence. I have tried to use the code given by Herve there, but many of the functions are deprecated and I have to improvise. )
I start with a read with insertion:
cigar <- c("47M1I52M") qseq <- DNAStringSet(x="TTCAAATGTGAACAAGTGCTTTTCCCACAGCTATTGAAATAACCATATTTTTTTTCTTTTATTCAGTTAATGTGGTTAATTTCATTGTTTGGTTTTCTAA") clipping <- unlist(cigarRangesAlongQuerySpace(cigar, ops="M")) > clipping IRanges object with 2 ranges and 0 metadata columns: start end width <integer> <integer> <integer>  1 47 47  49 100 52
Then I want to keep the sequence from 1 to 47 and 49 to 100. I don't know how to do it efficiently, so I did it in a stupid way:
s1 <- narrow(qseq, start=1, end=47) s2 <- narrow(qseq, start=49, end=100) qseq <- DNAStringSet(x=paste0(s1, s2))
I don't understand why
narrow(qseq, start=start(clipping), end=end(clipping)) won't work. Tried to understand how solveUserSEW works and not really helping. Can you make a suggestion for a more robust way??
My goal is to find the A-G mismatch, which would suggest an A-I editing site, and ultimately finding the editing enriched region. So this is my first step to remove the soft-clip and insertion from the read sequence. To accomplish it, I just keep the range with CIGAR -OPS=M (
cigarRangesAlongQuerySpace(cigar, ops="M")). Does that make sense? Any flaws? Or is there any better way?