Question: How to clip a sequence (DNAStringSet) using narrow() and keep the wanted sequence?
gravatar for Chao-Jen Wong
2.5 years ago by
USA/Seattle/Fred Hutchinson Cancer Research Center
Chao-Jen Wong30 wrote:

Hi, bioconductor members, 

I have two questions here. The first one is basically about how to use narrow() to keep the wanted sequence in an efficient way. I have a read and corresponding cigar, a subset from my GAlignments instance. I want to clip the insertion and soft-clip from the read so that I can find the type of mismatch between the read and the reference genome. (A similar question was asked here: Finding mismatches between reads and reference sequence. I have tried to use the code given by Herve there, but many of the functions are deprecated and I have to improvise. )

I start with a read with insertion:

cigar <- c("47M1I52M")
clipping <- unlist(cigarRangesAlongQuerySpace(cigar, ops="M"))

> clipping
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1        47        47
  [2]        49       100        52

Then I want to keep the sequence from 1 to 47 and 49 to 100. I don't know how to do it efficiently, so I did it in a stupid way:

s1 <- narrow(qseq, start=1, end=47)
s2 <- narrow(qseq, start=49, end=100)
qseq <- DNAStringSet(x=paste0(s1, s2))

I don't understand why narrow(qseq, start=start(clipping), end=end(clipping)) won't work. Tried to understand how solveUserSEW works and not really helping. Can you make a suggestion for a more robust way??

My goal is to find the A-G mismatch, which would suggest an A-I editing site, and ultimately finding the editing enriched region. So this is my first step to remove the soft-clip and insertion from the read sequence. To accomplish it, I just keep the range with CIGAR -OPS=M (cigarRangesAlongQuerySpace(cigar, ops="M")). Does that make sense? Any flaws? Or is there any better way?



ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Chao-Jen Wong30

If your goal is to count mismatches, consider using the pileup functionality in Rsamtools, or the tallyVariants() function in VariantTools.

ADD REPLYlink written 2.5 years ago by Michael Lawrence11k

Hi, Michael,

I just pileup() and it seems to work for my purpose. I parse the return the data.frame of pileup() and compare with the reference genome to define the type of mismatch.  I have not tried tallyVariants(), but it sounds even better. I will definitely try it tomorrow.  Thanks for the helpful suggestions.  




ADD REPLYlink written 2.5 years ago by Chao-Jen Wong30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 130 users visited in the last hour