Search
Question: how to make findOverlaps from genomicAlignments package to consider soft-clipped parts of BWA paired reads in overlap calculation?
1
2.0 years ago by
lovro010
Slovenia/Ljubljana/Clinical institute of medical genetics
lovro010 wrote:

the problem is as follows:

findOverlaps() does not produce a hit if only the soft-clipped part of the read pair is overlapping the specified GRanges.
Is there a (easy) way to make soft-clipped parts of the read count in overlap calculation?

Thank you,
Lovro

PS: if the interval in-between read pairs overlaps a certain range, does findOverlaps() produce a hit or not ? I would like it to.

modified 2.0 years ago • written 2.0 years ago by lovro010

The soft-clipped part is not aligned to the genome. Are you sure it makes sense to assume that it was?

For the "PS" part of your question, you can get the ranges spanning the pair with granges() and use that to find overlaps.

Yes, I'm trying to select read pairs which span the breakpoint of an insertion (which is not in the reference genome) for further analysis. I want to differentiate between reads which overhang into the putative insertion by at least, say, 35bp and those that don't. Some of the insertions I'm genotyping are in fact in the reference genome (although they are not fixed in population), so I cant rely solely on the presence/absence of clipping if I want to make a unified algorithm.
All I want to do, is read BAMs into R and select read pairs that overlap a certain point on the reference genome, irrespective of their cigar string content. (But then I also need to access the sequences of reads... granges() doesn't work since not all read pairs are mapped to the same chromosome.. If I drop them, the indexes will not match the original readpair object.)

1
2.0 years ago by
lovro010
Slovenia/Ljubljana/Clinical institute of medical genetics
lovro010 wrote:

Thank you both for help!

Let me explain the purpose of this:

It is not to calculate coverage, but to select reads, together with their sequences, that will be relevant for further analysis.

I have BAMs of probands and I want to genotype them for insertions of certain DNA sequences with breakpoints at certain known loci.
Only the read pairs spanning the breakpoints are of interest since they can prove or disprove the presence of insertion.
I can do this with presence of multiple reads that start clipping at the same loci or with read pairs in which one pair maps to the reference and the other to the insertion sequence.

I have the location of the breakpoints and want reads that span them with at least 35bp. Thus I have created a list of breakpoints:

lsBP <- split(BPgr,seq(1,98))    #BPgr is just a granges object with all breakpoints (width = 1) but I want the results separately for each of them, so I put them in a list.

I also made lists of loci 35 upstream/downstream of these loci.

lsHP <- split(HPgr,seq(1,98))
lsRP <- split(RPgr,seq(1,98))

In the last of the following lines I choose only the reads that overlap with all of those. #BAMs is a list of readGAlignmentPairs(BAM)

for(i in 1:length(BAMs)) {

OverlapsBP <- lapply(lsBP, FUN = function(x){findOverlaps(BAMs[[i]],x,ignore.strand=T)})
OverlapsHP <- lapply(lsHP, FUN = function(x){findOverlaps(BAMs[[i]],x,ignore.strand=T)})
OverlapsRP <- lapply(lsRP, FUN = function(x){findOverlaps(BAMs[[i]],x,ignore.strand=T)})

ntxsBP <- lapply(OverlapsBP,countQueryHits)
ntxsHP <- lapply(OverlapsHP,countQueryHits)
ntxsRP <- lapply(OverlapsRP,countQueryHits)

readsINT <- mapply(function(BP,HP,RP){BAMs[[i]][which(BP!=0 & HP!=0 & RP!=0),]},ntxsBP,ntxsHP,ntxsRP)

...continues...

The code works perfectly except for getting me the reads that I'm actually interested in.
The parts that would map onto the insertion are clipped (insertion is not in the hg19 reference) and return HP==0.
You can now see why I also don't care if the NNNNN inbetween read pairs or the actual reads overlap with those 3 points.
To me it only matters if the whole insert somehow spans the putative insertion breakpoint with at least 35bp overhang.

It is important to know, that I then use the actual sequences from those reads, to call BLAT on them externally.
So the solution would be to use "granges(BAMs)" to get indexes and then use those indexes on BAMs to get the actual sequences, as you suggested:
"Do this by calling granges() on your GAlignmentPairs object."

Reading your edit, I'm not sure will this work ? Does "Oops, not so fast! " concern this solution as well? Does granges() depend on rglist method?

If it does, is there a way to setMethod in a way to do the job for me?
Maybe to use
ops <- c("M", "=", "X", "I", "D", "S")  # "S" added!
ans <- cigarRangesAlongQuerySpace(x@cigar, pos=x@start, ops=ops, reduce.ranges=TRUE) #cigarRangesAlongQuerySpace instead of cigarRangesAlongReferenceSpace

?

Thank you !
Lovro

> TEST<-granges(BAMs[[1]])
Error in .local(x, use.names, use.mcols, ...) :
For some pairs in 'x', the 2 alignments are not on the same chromosome. Cannot
associate a unique genomic range to such pairs. Please call granges() with
'on.discordant.seqnames="drop"' to drop these pairs, or with
'on.discordant.seqnames="split"' to represent each of them with 2 genomic
ranges in the returned GRanges object. Note that in both cases the returned
object won't be parallel to 'x'. Alternatively, please consider using
grglist() instead of granges() to turn 'x' into a GRangesList object instead

I guess using "drop" or "split" would be a problem, since the indexes between granges(BAMS[[1]]) and BAMS[[1]] would not match. Also I would prefer to keep those pairs, since the reason for them to be mapped on different chromosomes might be the presence of the insertion I'm trying to genotype. My downstream analysis is insensitive to this as long as I keep the read pair in the pipe.

0
2.0 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

If we agree that it wouldn't make sense to have the interval in-between read pairs account in the overlap calculation without also having the so called "skipped regions" (i.e. N in the cigars), then you can completely ignore the fine range structure of each pair and replace it by a single range. Do this by calling granges() on your GAlignmentPairs object.

As for accounting soft-clipped regions in the overlap calculation, I agree with Michael that it's a questionable thing to do. [Edit: The following solution is incorrect. See my comment below this answer for a follow-up.] A quick and dirty and hacky solution would be for you to redefine the rglist method for GAlignments objects (which is called behind the scene by grglist() and findOverlaps()) just before calling findOverlaps() or grglist() on your GAlignmentPairs, and to restore the original method right after the call returns. Redefine the rglist method with:

setMethod("rglist", "GAlignments",
function(x, use.names=TRUE, use.mcols=FALSE,
order.as.in.query=FALSE, drop.D.ranges=FALSE)
{
if (!isTRUEorFALSE(use.names))
stop("'use.names' must be TRUE or FALSE")
if (!isTRUEorFALSE(use.mcols))
stop("'use.mcols' must be TRUE or FALSE")
if (!isTRUEorFALSE(order.as.in.query))
stop("'order.as.in.query' must be TRUE or FALSE")
ops <- c("M", "=", "X", "I", "D", "S")  # "S" added!
ans <- cigarRangesAlongReferenceSpace(x@cigar, pos=x@start,
ops=ops, reduce.ranges=TRUE)
if (order.as.in.query)
ans <- revElements(ans, strand(x) == "-")
if (use.names)
names(ans) <- names(x)
if (use.mcols)
mcols(ans) <- mcols(x)
ans
}
)

Untested!!

H.

Oops, not so fast! Sorry I forgot that cigarRangesAlongReferenceSpace() reports a zero-width range for the S in the CIGAR:

> cigarRangesAlongReferenceSpace("2S47M3I24M3S", with.ops=TRUE)[[1]]
IRanges object with 5 ranges and 0 metadata columns:
start       end     width
<integer> <integer> <integer>
S         1         0         0
M         1        47        47
I        48        47         0
M        48        71        24
S        72        71         0

Just like for insertions (i.e. I in the CIGAR). Soft-clipped nucleotides can actually be seen as insertions at the ends of the reads.

A correct solution would be to turn your GAlignmentPairs object into a GRangesList object with grglist() and then to add the soft-clipped regions to this GRangesList object. You could use the cigarRangesAlongQuerySpace() low-level utility to compute these regions:

> cigarRangesAlongQuerySpace("2S47M3I24M3S", with.ops=TRUE)[[1]]
IRanges object with 5 ranges and 0 metadata columns:
start       end     width
<integer> <integer> <integer>
S         1         2         2
M         3        49        47
I        50        52         3
M        53        76        24
S        77        79         3

See ?cigarRangesAlongQuerySpace for more information.

Finally replace your GAlignmentPairs with this GRangesList object when calling findOverlaps().

But if all you're worrying about is that findOverlaps() could be missing some hits because BWA did too much soft-clipping on your reads then you have simpler options. First I would investigate BWA documentation and see if the soft-clipping can be controlled somehow. Another possibility is to try to set the maxgap argument to an non-zero value when calling findOverlaps().

Hope this helps,

H.

Hi, I appreciate your help, however I haven't been able to solve the problem yet.

maxgap wouldn't do anything for me, since I need to differentiate between reads which extend over the breakpoint with at least 35bp overhang, and those that extend over less.
All reads that extend over the breakpoint represent sequences of the insertions and are clipped. So setting maxgap under 35 wouldn't get any of the reads and over 35 would get all of them.

Regarding your proposed solution I am a little confused: "turn your GAlignmentPairs object into a GRangesList object with grglist() and then to add the soft-clipped regions to this GRangesList object."
grglist() will remove cigars from the object, so I cant run cigarRangesAlongQuerySpace on them. And running cigarRangesAlongQuerySpace on GAlignmentPairs object doesnt give me new start/end coordinates.
For example, I get
> cigarRangesAlongQuerySpace(BAMs[[1]][[1]]@cigar,with.ops = T)
IRangesList of length 2
[[1]]
IRanges object with 2 ranges and 0 metadata columns:
start       end     width
<integer> <integer> <integer>
S         1       209       209
M       210       251        42

[[2]]
IRanges object with 2 ranges and 0 metadata columns:
start       end     width
<integer> <integer> <integer>
M         1        59        59
S        60       250       191

What did you have in mind ? I could try something in the line of: if(first line is S){start = start - width} , if(last line is S){end = end + width}in GAlignmentPairs object. However I find that way too cumbersome to do for every of read I will ever process.

I am thinking of 3 possible solutions:

1. I noticed, that cigarRangesAlongQuerySpace() has an option "after.soft.clipping = T". This would do the trick, but cigarRangesAlongReferenceSpace() doesn't have it. Is there a reason? Maybe I could try to change that, but I would need a hint or 2.
2. I could try to edit all cigar strings with simply replacing all "S" with "M". However, I would probably have to do this before I read in BAMs into R ? Or is there a way to "refresh" GAlignmentPairs object... perhaps then I could do it it with cigarRangesAlongReferenceSpace().
3. I could try filter reads before R with samtools. However, the fact that I couldn't select for both read pairs at the same time was precisely the reason I decided to use R :) I could try selecting one and call the pair by its name...

I's sorry I'm thinking out loud here, I'm a MD and not an IT so I don't really know wtf I'm doing.

Tnx for help!
Lovro