Thank you both for help!
Let me explain the purpose of this:
It is not to calculate coverage, but to select reads, together with their sequences, that will be relevant for further analysis.
I have BAMs of probands and I want to genotype them for insertions of certain DNA sequences with breakpoints at certain known loci.
Only the read pairs spanning the breakpoints are of interest since they can prove or disprove the presence of insertion.
I can do this with presence of multiple reads that start clipping at the same loci or with read pairs in which one pair maps to the reference and the other to the insertion sequence.
I have the location of the breakpoints and want reads that span them with at least 35bp. Thus I have created a list of breakpoints:
lsBP <- split(BPgr,seq(1,98)) #BPgr is just a granges object with all breakpoints (width = 1) but I want the results separately for each of them, so I put them in a list.
I also made lists of loci 35 upstream/downstream of these loci.
lsHP <- split(HPgr,seq(1,98))
lsRP <- split(RPgr,seq(1,98))
In the last of the following lines I choose only the reads that overlap with all of those. #BAMs is a list of readGAlignmentPairs(BAM)
for(i in 1:length(BAMs)) {
OverlapsBP <- lapply(lsBP, FUN = function(x){findOverlaps(BAMs[[i]],x,ignore.strand=T)})
OverlapsHP <- lapply(lsHP, FUN = function(x){findOverlaps(BAMs[[i]],x,ignore.strand=T)})
OverlapsRP <- lapply(lsRP, FUN = function(x){findOverlaps(BAMs[[i]],x,ignore.strand=T)})
ntxsBP <- lapply(OverlapsBP,countQueryHits)
ntxsHP <- lapply(OverlapsHP,countQueryHits)
ntxsRP <- lapply(OverlapsRP,countQueryHits)
readsINT <- mapply(function(BP,HP,RP){BAMs[[i]][which(BP!=0 & HP!=0 & RP!=0),]},ntxsBP,ntxsHP,ntxsRP)
...continues...
The code works perfectly except for getting me the reads that I'm actually interested in.
The parts that would map onto the insertion are clipped (insertion is not in the hg19 reference) and return HP==0.
You can now see why I also don't care if the NNNNN inbetween read pairs or the actual reads overlap with those 3 points.
To me it only matters if the whole insert somehow spans the putative insertion breakpoint with at least 35bp overhang.
It is important to know, that I then use the actual sequences from those reads, to call BLAT on them externally.
So the solution would be to use "granges(BAMs)" to get indexes and then use those indexes on BAMs to get the actual sequences, as you suggested:
"Do this by calling granges() on your GAlignmentPairs object."
Reading your edit, I'm not sure will this work ? Does "Oops, not so fast! " concern this solution as well? Does granges() depend on rglist method?
If it does, is there a way to setMethod in a way to do the job for me?
Maybe to use
ops <- c("M", "=", "X", "I", "D", "S") # "S" added!
ans <- cigarRangesAlongQuerySpace(x@cigar, pos=x@start, ops=ops, reduce.ranges=TRUE) #cigarRangesAlongQuerySpace instead of cigarRangesAlongReferenceSpace
?
Thank you !
Lovro
The soft-clipped part is not aligned to the genome. Are you sure it makes sense to assume that it was?
For the "PS" part of your question, you can get the ranges spanning the pair with granges() and use that to find overlaps.
Yes, I'm trying to select read pairs which span the breakpoint of an insertion (which is not in the reference genome) for further analysis. I want to differentiate between reads which overhang into the putative insertion by at least, say, 35bp and those that don't. Some of the insertions I'm genotyping are in fact in the reference genome (although they are not fixed in population), so I cant rely solely on the presence/absence of clipping if I want to make a unified algorithm.
All I want to do, is read BAMs into R and select read pairs that overlap a certain point on the reference genome, irrespective of their cigar string content. (But then I also need to access the sequences of reads... granges() doesn't work since not all read pairs are mapped to the same chromosome.. If I drop them, the indexes will not match the original readpair object.)