Strand information in a GAlignmentsList object derived from strand-specific paired-end RNA-seq data
0
2
Entering edit mode
Robert Castelo ★ 3.4k
@rcastelo
Last seen 2 days ago
Barcelona/Universitat Pompeu Fabra

Hi,

The summarizeOverlaps() function has an argument called fragments by which if fragments=FALSE, then the function readGAlignmentPairs() is used to read the BAM file, which outputs a GAlignmentPairs object. If fragments=TRUE, then the function readGAlignmentsList() is used to read the BAM file, which outputs a GAlignmentsList object. Let's read an example BAM file with these two options:

library(GenomicAlignments)

bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
galp <- readGAlignmentPairs(bamfile)
gall <- readGAlignmentsList(bamfile)

Let's look now at the first paired alignment read with these two options:

galp[1]
GAlignmentPairs object with 1 pair, strandMode=1, and 0 metadata columns:
      seqnames strand :    ranges --    ranges
         <Rle>  <Rle> : <IRanges> -- <IRanges>
  [1]     seq1      + :     36-70 --   185-219
  -------
  seqinfo: 2 sequences from an unspecified genome
gall[[1]]
GAlignments object with 2 alignments and 0 metadata columns:
      seqnames strand       cigar    qwidth     start       end     width
         <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
  [1]     seq1      +         35M        35        36        70        35
  [2]     seq1      -         35M        35       185       219        35
          njunc
      <integer>
  [1]         0
  [2]         0
  -------
  seqinfo: 2 sequences from an unspecified genome

As you can see, in the object galp the first paired-end alignment has a positive strand because, by default, strandMode=1 in the call to readAlignmentPairs() and therefore the strand of the paired alignment is the strand of the first mate. However, in the object gall, there is no such parameter strandMode and the mates have their original strand, which means that there is no strand associated with the paired alignment. This is, in my opinion, problematic when counting features with summarizeOverlaps() and the argument fragments=TRUE, because it may be counting features that overlap in the wrong strand. See the following example:

f <- GRanges(seqnames="seq1", IRanges(30, 250), strand="-")
f
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     seq1    30-250      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

findOverlaps(galp[1], f, ignore.strand=FALSE)
Hits object with 0 hits and 0 metadata columns:
   queryHits subjectHits
   <integer>   <integer>
  -------
  queryLength: 1 / subjectLength: 1

findOverlaps(gall[1], f, ignore.strand=FALSE)
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  -------
  queryLength: 1 / subjectLength: 1

Here the feature is in the negative strand and consequently, if we take the strand into account with ignore.strand=FALSE, the first alignment in the galp object does not overlap. However, the first alignment in the gall object does overlap, while, in my opinion, it should not.

My question is, shouldn't the elements of the GAlignments list object have the strand of the corresponding mate? i.e.,

gall[[1]]
GAlignments object with 2 alignments and 0 metadata columns:
      seqnames strand       cigar    qwidth     start       end     width
         <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
  [1]     seq1      +         35M        35        36        70        35
  [2]     seq1      +         35M        35       185       219        35
          njunc
      <integer>
  [1]         0
  [2]         0
  -------
  seqinfo: 2 sequences from an unspecified genome

In other terms, shouldn't the function readGAlignmentsList() have also a strandMode parameter?

thanks!

robert.

GenomicAlignments • 1.1k views
ADD COMMENT
0
Entering edit mode

Hi Robert,

Just to clarify, a GAlignmentsList object represents a list of GAlignments objects, in the same way that a GRangesList object represents a list of GRanges objects. In particular a list element in a GAlignmentsList object can contain an arbitrary number of GAlignments objects (0, 1, 2, or more).

OTOH the strandMode concept is a paired-end specific concept so doesn't apply to GAlignmentsList objects in general, unless the object contains paired-end RNA-seq data. But in this case shouldn't you use a GAlignmentPairs object instead? I'm trying to understand the reason why one would use a GAlignmentsList object instead of a GAlignmentPairs object if they have paired-end data.

Thanks,

H.

ADD REPLY
1
Entering edit mode

One reason to use a GAlignmentsList object is if you have 'too many' unpaired reads for some reason and you want to recover them. And it's what is going to happen if you use fragments = TRUE in a call to summarizeOverlaps. In which case something like this

gall[[1]]
GAlignments object with 2 alignments and 0 metadata columns:
      seqnames strand       cigar    qwidth     start       end     width
         <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
  [1]     seq1      +         35M        35        36        70        35
  [2]     seq1      -         35M        35       185       219        35

Will not accurately count the second reads because they are on the wrong strand, unlike in the GAlignmentPairs situation. Unless I misunderstand something.

ADD REPLY
1
Entering edit mode

Got it. Yeah I guess having a strandMode parameter for readGAlignmentsList() would help handle this.

@Robert: Do you want to open an issue on GitHub to request this feature? https://github.com/Bioconductor/GenomicAlignments/issues Unfortunately I won't have the resources to work on this in the near future so any help with this will be greatly appreciated. Thanks!

ADD REPLY
0
Entering edit mode

As James said, a GAlignmentsList object is the container for storing properly-paired, but with ambiguous pairing, alignments, and we're interested in using them. Moreover, right now setting fragments=TRUE in the call to summarizeOverlaps() may lead to wrong results for a fraction of the aligned reads. I'll open an issue in the GenomicAlignments GitHub repo, and will try to suggest the necessary edits in the code.

ADD REPLY
0
Entering edit mode

Hi again, I've just created a PR in the GitHub repo of GenomicAlignments, providing a first implementation of the support for the strandMode parameter with GAlignmentsList objects.

ADD REPLY

Login before adding your answer.

Traffic: 627 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6