Hi,
The summarizeOverlaps()
function has an argument called fragments
by which if fragments=FALSE
, then the function readGAlignmentPairs()
is used to read the BAM file, which outputs a GAlignmentPairs
object. If fragments=TRUE
, then the function readGAlignmentsList()
is used to read the BAM file, which outputs a GAlignmentsList
object. Let's read an example BAM file with these two options:
library(GenomicAlignments)
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
galp <- readGAlignmentPairs(bamfile)
gall <- readGAlignmentsList(bamfile)
Let's look now at the first paired alignment read with these two options:
galp[1]
GAlignmentPairs object with 1 pair, strandMode=1, and 0 metadata columns:
seqnames strand : ranges -- ranges
<Rle> <Rle> : <IRanges> -- <IRanges>
[1] seq1 + : 36-70 -- 185-219
-------
seqinfo: 2 sequences from an unspecified genome
gall[[1]]
GAlignments object with 2 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end width
<Rle> <Rle> <character> <integer> <integer> <integer> <integer>
[1] seq1 + 35M 35 36 70 35
[2] seq1 - 35M 35 185 219 35
njunc
<integer>
[1] 0
[2] 0
-------
seqinfo: 2 sequences from an unspecified genome
As you can see, in the object galp
the first paired-end alignment has a positive strand because, by default, strandMode=1
in the call to readAlignmentPairs()
and therefore the strand of the paired alignment is the strand of the first mate. However, in the object gall
, there is no such parameter strandMode
and the mates have their original strand, which means that there is no strand associated with the paired alignment. This is, in my opinion, problematic when counting features with summarizeOverlaps()
and the argument fragments=TRUE
, because it may be counting features that overlap in the wrong strand. See the following example:
f <- GRanges(seqnames="seq1", IRanges(30, 250), strand="-")
f
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] seq1 30-250 -
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
findOverlaps(galp[1], f, ignore.strand=FALSE)
Hits object with 0 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
-------
queryLength: 1 / subjectLength: 1
findOverlaps(gall[1], f, ignore.strand=FALSE)
Hits object with 1 hit and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 1
-------
queryLength: 1 / subjectLength: 1
Here the feature is in the negative strand and consequently, if we take the strand into account with ignore.strand=FALSE
, the first alignment in the galp
object does not overlap. However, the first alignment in the gall
object does overlap, while, in my opinion, it should not.
My question is, shouldn't the elements of the GAlignments
list object have the strand of the corresponding mate? i.e.,
gall[[1]]
GAlignments object with 2 alignments and 0 metadata columns:
seqnames strand cigar qwidth start end width
<Rle> <Rle> <character> <integer> <integer> <integer> <integer>
[1] seq1 + 35M 35 36 70 35
[2] seq1 + 35M 35 185 219 35
njunc
<integer>
[1] 0
[2] 0
-------
seqinfo: 2 sequences from an unspecified genome
In other terms, shouldn't the function readGAlignmentsList()
have also a strandMode
parameter?
thanks!
robert.
Hi Robert,
Just to clarify, a GAlignmentsList object represents a list of GAlignments objects, in the same way that a GRangesList object represents a list of GRanges objects. In particular a list element in a GAlignmentsList object can contain an arbitrary number of GAlignments objects (0, 1, 2, or more).
OTOH the strandMode concept is a paired-end specific concept so doesn't apply to GAlignmentsList objects in general, unless the object contains paired-end RNA-seq data. But in this case shouldn't you use a GAlignmentPairs object instead? I'm trying to understand the reason why one would use a GAlignmentsList object instead of a GAlignmentPairs object if they have paired-end data.
Thanks,
H.
One reason to use a GAlignmentsList object is if you have 'too many' unpaired reads for some reason and you want to recover them. And it's what is going to happen if you use fragments = TRUE in a call to
summarizeOverlaps
. In which case something like thisWill not accurately count the second reads because they are on the wrong strand, unlike in the GAlignmentPairs situation. Unless I misunderstand something.
Got it. Yeah I guess having a
strandMode
parameter forreadGAlignmentsList()
would help handle this.@Robert: Do you want to open an issue on GitHub to request this feature? https://github.com/Bioconductor/GenomicAlignments/issues Unfortunately I won't have the resources to work on this in the near future so any help with this will be greatly appreciated. Thanks!
As James said, a
GAlignmentsList
object is the container for storing properly-paired, but with ambiguous pairing, alignments, and we're interested in using them. Moreover, right now settingfragments=TRUE
in the call tosummarizeOverlaps()
may lead to wrong results for a fraction of the aligned reads. I'll open an issue in theGenomicAlignments
GitHub repo, and will try to suggest the necessary edits in the code.Hi again, I've just created a PR in the GitHub repo of GenomicAlignments, providing a first implementation of the support for the
strandMode
parameter withGAlignmentsList
objects.