Hi, bioconductor members,
I have two questions here. The first one is basically about how to use narrow() to keep the wanted sequence in an efficient way. I have a read and corresponding cigar, a subset from my GAlignments
instance. I want to clip the insertion and soft-clip from the read so that I can find the type of mismatch between the read and the reference genome. (A similar question was asked here: Finding mismatches between reads and reference sequence. I have tried to use the code given by Herve there, but many of the functions are deprecated and I have to improvise. )
I start with a read with insertion:
cigar <- c("47M1I52M") qseq <- DNAStringSet(x="TTCAAATGTGAACAAGTGCTTTTCCCACAGCTATTGAAATAACCATATTTTTTTTCTTTTATTCAGTTAATGTGGTTAATTTCATTGTTTGGTTTTCTAA") clipping <- unlist(cigarRangesAlongQuerySpace(cigar, ops="M")) > clipping IRanges object with 2 ranges and 0 metadata columns: start end width <integer> <integer> <integer> [1] 1 47 47 [2] 49 100 52
Then I want to keep the sequence from 1 to 47 and 49 to 100. I don't know how to do it efficiently, so I did it in a stupid way:
s1 <- narrow(qseq, start=1, end=47) s2 <- narrow(qseq, start=49, end=100) qseq <- DNAStringSet(x=paste0(s1, s2))
I don't understand why narrow(qseq, start=start(clipping), end=end(clipping))
won't work. Tried to understand how solveUserSEW works and not really helping. Can you make a suggestion for a more robust way??
My goal is to find the A-G mismatch, which would suggest an A-I editing site, and ultimately finding the editing enriched region. So this is my first step to remove the soft-clip and insertion from the read sequence. To accomplish it, I just keep the range with CIGAR -OPS=M (cigarRangesAlongQuerySpace(cigar, ops="M")
). Does that make sense? Any flaws? Or is there any better way?
Thanks,
Chao-Jen
If your goal is to count mismatches, consider using the pileup functionality in Rsamtools, or the tallyVariants() function in VariantTools.
Hi, Michael,
I just pileup() and it seems to work for my purpose. I parse the return the data.frame of pileup() and compare with the reference genome to define the type of mismatch. I have not tried tallyVariants(), but it sounds even better. I will definitely try it tomorrow. Thanks for the helpful suggestions.
Thanks,
CJ