confused interpreting results from scanBam in RSamtools
0
0
Entering edit mode
sara • 0
@0d527ed6
Last seen 2.2 years ago
Canada

Hi there, I am using Rsamtools to read in a BAM file. Then I want to search the results to see what reads map to refernce positions of interest. Reading the user manual I noticed that 'pos' refers to the 3' end if strand "-".

I have been messing around with my code trying to extract nucleotides from the seq field depending of where the reference posision of interest is.
I have a complicated non-equi join using data.table. If you look in there I have an ifelse statement to get the read_base depending on if strand is + or -. I don't really know why I have to add the +1, but it seems to work, and I have gotten different results so I"m not sure which is correct.

Can anyone who understands this substr let me know if the concept is correct ? Thanks

```what <- c("qname","rname", "strand", "pos", "qwidth", "cigar","seq","mrnm","mpos","mate_status") param <- ScanBamParam( what=what)

no "which" as we only have reads mapped to chr1 in the file

bam<-scanBam(bamFile,param=param)

custom unlist function from Rsamtools vignette

.unlist <- function (x) {

 ## do.call(c, ...) coerces factor to integer, which is undesired
   x1 <- x[[1L]]
   if (is.factor(x1)) {
     structure(unlist(x), class = "factor", levels = levels(x1))
     } else {
       do.call(c, x)
       }
   }

elts <- setNames(bamWhat(param), bamWhat(param)) lst <- lapply(elts, function(elt) .unlist(lapply(bam, "[[", elt)))

create a data.table from the scanBam output for faster accessing

dt_bam<-data.table(qname=lst[[1]],rname=lst[[2]],strand=lst[[3]],pos=lst[[4]],qwidth=lst[[5]],cigar=lst[[6]], seq=as.character(lst[[7]]))

#

add columns start and end

for - strands pos represents the 3' end

for + strands pos represents the 5' end

dt_bam[, c("start", "end") := .( pos - qwidth (strand == "-"), pos + qwidth (strand == "+") )]

use data.table for a non-equi join with table variable_sites which contains reference positions of interest

summ_tab<-variable_sites[dt_bam, .(locus_chr=rname,locus_pos=x.POS,locus_ref=REF, read_name=qname,read_len=qwidth, read_cigar_string=cigar,locus_pos_in_read=end-x.POS, read_base=ifelse(strand=="+",substr(seq, x.POS - i.start, x.POS - i.start), substr(seq, x.POS - i.start-1, x.POS - i.start-1)) ), on = .(POS > start, POS < end), nomatch=NULL] # join only if the var_site[POS] is between start and end of the read in the bam file

include your problematic code here with any corresponding output

please also include the results of running the following in an R session

sessionInfo( )

```

Rsamtools Rsa • 608 views
ADD COMMENT

Login before adding your answer.

Traffic: 819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6