Question

How to clip a sequence (DNAStringSet) using narrow() and keep the wanted sequence?

0

Entering edit mode

Chao-Jen Wong ▴ 40

@chao-jen-wong-7035

Last seen 9 months ago

USA/Seattle/Fred Hutchinson Cancer Rese…

Hi, bioconductor members,

I have two questions here. The first one is basically about how to use narrow() to keep the wanted sequence in an efficient way. I have a read and corresponding cigar, a subset from my GAlignments instance. I want to clip the insertion and soft-clip from the read so that I can find the type of mismatch between the read and the reference genome. (A similar question was asked here: Finding mismatches between reads and reference sequence. I have tried to use the code given by Herve there, but many of the functions are deprecated and I have to improvise. )

I start with a read with insertion:

cigar <- c("47M1I52M")
qseq <- DNAStringSet(x="TTCAAATGTGAACAAGTGCTTTTCCCACAGCTATTGAAATAACCATATTTTTTTTCTTTTATTCAGTTAATGTGGTTAATTTCATTGTTTGGTTTTCTAA")
clipping <- unlist(cigarRangesAlongQuerySpace(cigar, ops="M"))

> clipping
IRanges object with 2 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
  [1]         1        47        47
  [2]        49       100        52

Then I want to keep the sequence from 1 to 47 and 49 to 100. I don't know how to do it efficiently, so I did it in a stupid way:

s1 <- narrow(qseq, start=1, end=47)
s2 <- narrow(qseq, start=49, end=100)
qseq <- DNAStringSet(x=paste0(s1, s2))

I don't understand why narrow(qseq, start=start(clipping), end=end(clipping)) won't work. Tried to understand how solveUserSEW works and not really helping. Can you make a suggestion for a more robust way??

My goal is to find the A-G mismatch, which would suggest an A-I editing site, and ultimately finding the editing enriched region. So this is my first step to remove the soft-clip and insertion from the read sequence. To accomplish it, I just keep the range with CIGAR -OPS=M (cigarRangesAlongQuerySpace(cigar, ops="M")). Does that make sense? Any flaws? Or is there any better way?

Thanks,

Chao-Jen

genomicalignments narrow mismatch • 1.2k views

ADD COMMENT • link 7.3 years ago Chao-Jen Wong ▴ 40

1

Entering edit mode

If your goal is to count mismatches, consider using the pileup functionality in Rsamtools, or the tallyVariants() function in VariantTools.

ADD REPLY • link 7.3 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi, Michael,

I just pileup() and it seems to work for my purpose. I parse the return the data.frame of pileup() and compare with the reference genome to define the type of mismatch. I have not tried tallyVariants(), but it sounds even better. I will definitely try it tomorrow. Thanks for the helpful suggestions.

Thanks,

CJ

ADD REPLY • link 7.3 years ago Chao-Jen Wong ▴ 40