Hi there, I am using Rsamtools to read in a BAM file. Then I want to search the results to see what reads map to refernce positions of interest. Reading the user manual I noticed that 'pos' refers to the 3' end if strand "-".
I have been messing around with my code trying to extract nucleotides from the seq field depending of where the reference posision of interest is.
I have a complicated non-equi join using data.table.
If you look in there I have an ifelse statement to get the read_base depending on if strand is + or -.
I don't really know why I have to add the +1, but it seems to work, and I have gotten different results so I"m not sure which is correct.
Can anyone who understands this substr let me know if the concept is correct ? Thanks
```what <- c("qname","rname", "strand", "pos", "qwidth", "cigar","seq","mrnm","mpos","mate_status") param <- ScanBamParam( what=what)
no "which" as we only have reads mapped to chr1 in the file
bam<-scanBam(bamFile,param=param)
custom unlist function from Rsamtools vignette
.unlist <- function (x) {
## do.call(c, ...) coerces factor to integer, which is undesired
x1 <- x[[1L]]
if (is.factor(x1)) {
structure(unlist(x), class = "factor", levels = levels(x1))
} else {
do.call(c, x)
}
}
elts <- setNames(bamWhat(param), bamWhat(param)) lst <- lapply(elts, function(elt) .unlist(lapply(bam, "[[", elt)))
create a data.table from the scanBam output for faster accessing
dt_bam<-data.table(qname=lst[[1]],rname=lst[[2]],strand=lst[[3]],pos=lst[[4]],qwidth=lst[[5]],cigar=lst[[6]], seq=as.character(lst[[7]]))
#
add columns start and end
for - strands pos represents the 3' end
for + strands pos represents the 5' end
dt_bam[, c("start", "end") := .( pos - qwidth (strand == "-"), pos + qwidth (strand == "+") )]
use data.table for a non-equi join with table variable_sites which contains reference positions of interest
summ_tab<-variable_sites[dt_bam, .(locus_chr=rname,locus_pos=x.POS,locus_ref=REF, read_name=qname,read_len=qwidth, read_cigar_string=cigar,locus_pos_in_read=end-x.POS, read_base=ifelse(strand=="+",substr(seq, x.POS - i.start, x.POS - i.start), substr(seq, x.POS - i.start-1, x.POS - i.start-1)) ), on = .(POS > start, POS < end), nomatch=NULL] # join only if the var_site[POS] is between start and end of the read in the bam file
include your problematic code here with any corresponding output
please also include the results of running the following in an R session
sessionInfo( )
```