How to clip a sequence (DNAStringSet) using narrow() and keep the wanted sequence?
0
0
Entering edit mode
@chao-jen-wong-7035
Last seen 9 months ago
USA/Seattle/Fred Hutchinson Cancer Rese…

Hi, bioconductor members, 

I have two questions here. The first one is basically about how to use narrow() to keep the wanted sequence in an efficient way. I have a read and corresponding cigar, a subset from my GAlignments instance. I want to clip the insertion and soft-clip from the read so that I can find the type of mismatch between the read and the reference genome. (A similar question was asked here: Finding mismatches between reads and reference sequence. I have tried to use the code given by Herve there, but many of the functions are deprecated and I have to improvise. )

I start with a read with insertion:

cigar <- c("47M1I52M")
qseq <- DNAStringSet(x="TTCAAATGTGAACAAGTGCTTTTCCCACAGCTATTGAAATAACCATATTTTTTTTCTTTTATTCAGTTAATGTGGTTAATTTCATTGTTTGGTTTTCTAA")
clipping <- unlist(cigarRangesAlongQuerySpace(cigar, ops="M"))

> clipping
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1        47        47
  [2]        49       100        52

Then I want to keep the sequence from 1 to 47 and 49 to 100. I don't know how to do it efficiently, so I did it in a stupid way:

s1 <- narrow(qseq, start=1, end=47)
s2 <- narrow(qseq, start=49, end=100)
qseq <- DNAStringSet(x=paste0(s1, s2))

I don't understand why narrow(qseq, start=start(clipping), end=end(clipping)) won't work. Tried to understand how solveUserSEW works and not really helping. Can you make a suggestion for a more robust way??

My goal is to find the A-G mismatch, which would suggest an A-I editing site, and ultimately finding the editing enriched region. So this is my first step to remove the soft-clip and insertion from the read sequence. To accomplish it, I just keep the range with CIGAR -OPS=M (cigarRangesAlongQuerySpace(cigar, ops="M")). Does that make sense? Any flaws? Or is there any better way?

Thanks,

Chao-Jen

genomicalignments narrow mismatch • 1.2k views
ADD COMMENT
1
Entering edit mode

If your goal is to count mismatches, consider using the pileup functionality in Rsamtools, or the tallyVariants() function in VariantTools.

ADD REPLY
0
Entering edit mode

Hi, Michael,

I just pileup() and it seems to work for my purpose. I parse the return the data.frame of pileup() and compare with the reference genome to define the type of mismatch.  I have not tried tallyVariants(), but it sounds even better. I will definitely try it tomorrow.  Thanks for the helpful suggestions.  

 

Thanks, 

CJ

ADD REPLY

Login before adding your answer.

Traffic: 709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6