Compare DNA sequence pairs and compute lengths and coordinates
0
0
Entering edit mode
Md Abrar • 0
@68385df3
Last seen 2.3 years ago
Bangladesh

Dear altruists, the following code is running properly when I'm trying to compare AXT (from UCSC) sequence pairs and keep only [A, T, C, G] uppercases and compute the matched sequence lengths. But somehow, this function is producing a lot of NaN values. Actually it should not contain any NaN values at all. I implemented the same algorithm for Python and it worked properly. But it seems that Bioconductor's DNAStringSet has limitations like it can't compare lower cases. It converts lowercase to uppercase automatically and then compares. So, I converted the AXT file's lowercases to '-' and then used it as an input. Still, I'm getting NaN values now. How can I overcome this?

The full code can be found here: https://github.com/alexander-nash/kurtosis_conservation/blob/master/get_identical_seq_locations.R

N.B: I changed the logic for the identical_bases variable. After this change, I'm getting NaN values.

getLengthsOfIdenticalSeqs<-function(x){
  #x must be an Axt object imported by CNEr
  qr<-BStringSet(apply(as.data.frame(querySeqs(x)), 2, function(z){
    tmp<-paste0("M", z)
    out<-paste0(tmp, "N")
  }))
  tr<-BStringSet(apply(as.data.frame(targetSeqs(x)), 2, function(z){
    tmp<-paste0("K", z)
    out<-paste0(tmp, "V")
  }))

#  identical_bases <- mapply(x=qr, y=tr, function(x,y){
#    out<-which(as.raw(x)==as.raw(y) & (as.raw(x)==charToRaw('A') | as.raw(x)==charToRaw('T') | as.raw(x)==charToRaw('C') | as.raw(x)==charToRaw('G')) & (as.raw(y)==charToRaw('A') | as.raw(y)==charToRaw('T') | as.raw(y)==charToRaw('C') | as.raw(y)==charToRaw('G')))
#  })

  identical_bases <- mapply(x=qr, y=tr, function(x,y){
        out<-which((as.matrix(x) == as.matrix(y)) & 
                           (as.matrix(x) != '-')  & (as.matrix(y) != '-') & 
                           (as.matrix(x)=='A' | as.matrix(x)=='T' | as.matrix(x)=='C' | as.matrix(x)=='G') & 
                           (as.matrix(y)=='A' | as.matrix(y)=='T' | as.matrix(y)=='C' | as.matrix(y)=='G'))
  })


  relative_ranges <- lapply(identical_bases, function(x){
    reduce(IRanges(x-2, x-2))
  })

  absolute_ranges <- mapply(z=relative_ranges, y=as.list(x), function(z,y){
    if(length(z) == 0){
      out<-list(data.frame())
    } else {
      out<-list(data.frame("first.seqnames"=seqnames(first(y)),
                           "first.start"=(start(first(y)) + start(z)),
                           "first.end"=(start(first(y)) + end(z)),
                           "first.width"=(end(z) - start(z) + 1),
                           "first.strand"="*",
                           "second.seqnames"=seqnames(second(y)),
                           "second.start"=(start(second(y)) + start(z)),
                           "second.end"=(start(second(y)) + end(z)),
                           "second.width"=(end(z) - start(z) + 1),
                           "second.strand"="*"))
    }
  })

  return(absolute_ranges)
}
DNASeq CNEr • 635 views
ADD COMMENT

Login before adding your answer.

Traffic: 484 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6